Data from Martin Taylor martin.taylor@igmm.ed.ac.uk July 2012 Generated using Ensembl version 67 EPO12 alignments. For the projection between species, the peak (modal tag position) of the cluster is projected and the boundaries of the cluster are mapped into the identified genomic segment. Bed files show liftover of both the peak (thick line) and tag cluster extent (thin line). Files ##### canFam3_projectedHumanPermissiveTSS.bed.gz Human permissive projected into dog canFam3_projectedMousePermissiveTSS.bed.gz Mouse permissive projected into dog hg19_mm9ProjectionOutcome.bed.gz The outcome of projecting human into mouse hg19_projectedMousePermissiveTSS.bed.gz Mouse permissive projected into human mm9_hg19ProjectionOutcome.bed.gz The outcome of projecting mouse into human mm9_projectedHumanPermissiveTSS.bed.gz Human permissive projected into mouse rheMac2_projectedHumanPermissiveTSS.bed.gz human permissive projected into macaque rheMac2_projectedMousePermissiveTSS.bed.gz Mouse permissive projected into macaque rn4_projectedHumanPermissiveTSS.bed.gz Human permissive projected into rat rn4_projectedMousePermissiveTSS.bed.gz Mouse permissive projected into rat mouse_human_oneToOneOrthologyTSS.txt.gz Pairs of mouse and human TSS assigned as orthologous. The "outcome" files contain details of orthologous sequence projection, simple expression analysis and TSS orthology assignment. Details ####### Files ----- Data files (BEDs where possible) can be found at the WEBDAV area File Description hg19_projectedMousePermissiveTSS.bed.gz Mouse TSS projected into human. Red forward strand, blue reverse. hg19 coordinates. mm9_projectedHumanPermissiveTSS.bed.gz Human TSS projected into mouse. Red forward strand, blue reverse. mm9 coordinates. hg19_mm9ProjectionOutcome.bed.gz Records the outcome of projecting human TSS into mouse. hg19 coordinates. See table below for 12 classifications of outcome. mm9_hg19ProjectionOutcome.bed.gz Records the outcome of projecting mouse TSS into human. mm9 coordinates. See table below for 12 classifications of outcome. mouse_human_oneToOneOrthologyTSS.txt.gz Two column file giving 1:1 correspondence between mouse and human TSS orthologue pairs. Additional files are given for both mouse and human TSS projected into other species. File names are underscore (_) delimited. The first part, e.g. mm9, denotes the genome the coordinates correspond to. The second part denotes what has been projected. E.g. mm9_projectedHumanPermissiveTSS indicates the human permissive set of TSS were projected into the mm9 mouse reference genome assembly. ProjectionOutcome files These files contain all permissive (thus also robust) TSS defined in the FANTOM5 freeze 1.1. They are partitioned into twelve color-coded categories base on both the projection and expression properties of the TSS. Projections between species fall into one of four categories: Unaligned - TSS location is not aligned with any sequence from the target species. This does not discriminate between genomic gain or loss and technical problems (alignment error, genome assembly error, lack of read coverage in raw genomic sequence). Gap - TSS location projects into an alignment gap in the target species. This indicates the gain or loss of sequence over evolution. Aligned - An orthologous sequence for the TSS can be found in the target genome, but there is not a FANTOM5 defined TSS at that position in the target genome. These include both the aligned and peakGap categories of projection discussed in the details below. Orthologous - An orthologous sequence for the TSS can be found in the target genome and there is a FANTOM5 TSS defined at that position. Additionally, three expression categories have also been considered: non-robust - the permissive but not robust FANTOM5 TSS (maxcounts>=3 in any one library). robust - TSS classified as robust under FANTOM5 (maxcounts>=11 and maxTPM>=1) in any one library). robust equiv - TSS meeting the robust criteria, but additionally exhibiting maxTPM>=3 in at least one of the CAGE libraries defined by Al as either approximately or exactly equivalent between mouse and human. With four projection criteria and three expression criteria that gives a total of 12 categories. Projection Expression Color-code (RGB) Color Orthologous robust-equiv 2,112,51 Dark green robust 89,155,109 Green non-robust 163,211,161 Light Green Aligned robust-equiv 16,82,156 Dark blue robust 65,146,201 Blue non-robust 168,201,246 Light blue Gap robust-equiv 156,52,15 Dark orange robust 230,118,54 Orange non-robust 253,196,89 Light orange Unaligned robust-equiv 80,80,80 Dark grey robust 130,130,130 Grey non-robust 180,180,180 Light grey Alignments ---------- We have used Ensembl's EPO alignments pmid:18849524 pmid:19033362. They have several advantages over the widely used UCSC nets and chains: EPO are true multi-sequence alignments so gap placement is optimised relative to all aligned species, not the case in the stacked pairwise nets-chains. EPO uses wider synteny data and local sequence inferred phylogenetic trees to resolve ambiguities in matching segment placement, for example non-ortholgous processed pseudogenes often align as a top-level chain segment in UCSC nets. These are usually accurately resolved in EPO. Unlike nets-chains there is no master reference sequence in EPO, so a sequence deleted in human can still align between dog and rat. As EPO is true multi-sequence alignment there is much better circular consistency in coordinate projection. E.g. project human->dog->mouse->human is likely (but because of alignment gaps not guaranteed) to give you back the original human coordinate and it will be in the same locality. With nets-chains you could end up on a different chromosome. We have used the EPO12 alignments (eutherian mammals: homo_sapiens pan_troglodytes gorilla_gorilla pongo_abelii macaca_mulatta callithrix_jacchus mus_musculus rattus_norvegicus bos_taurus sus_scrofa canis_familiaris equus_caballus) from the Ensembl 67 release. Although all these species can be projected into, only species for which CAGE data has been generated are summarised here and included in the linked files. Used the Ensembl 67 API version for interaction with the data. Coordinate projection --------------------- For a human (or mouse) TSS, the boundaries of the TSS were used to define an "alignment slice" of the EPO12 alignments, using the alignSlice functions to resolve overlapping alignment blocks and to orient and order alignment blocks relative to the human (or mouse) genome. The reference position of the TSS (BED thick line) was projected through the alignment slice to obtain a projected reference position. In cases where the projected reference falls in an alignment gap, the reference is projected onto the nucleotide at the closest edge of the gap, but still within the alignment slice. These are recorded as peakGap alignments. In cases where the alignment slice is entirely gap this is recorded and no coordinate projection made. A GAP alignment indicates that the TSS position has been deleted/inserted during genome evolution. An option to add an arbitrary additional window around the alignment slice is implemented that could allow for the mapping recovery of additional TSS but that option was not used in the data and results presented here. In the cases where the alignment slice cannot be projected into a genome at all, i.e. there is no syntenic interval that aligns across the interval, this is recorded as unaligned. Here we don't have evidence to discriminate the evolutionary gain or loss of sequence from technical difficulties such as alignment or genome assembly problems or the absence of read coverage in the raw genomic sequence. The outer margins of the TSS interval were mapped into the aligned sequences, requiring that they map into the same chromosomal locus as the projected reference position (+-80nt). In cases of genomic rearrangement between species, the projected interval was trimmed down to the boundary of the rearrangement. Assignment of orthology/equivalence ----------------------------------- TSS projected from mouse into human were progressively assigned to a human TSS. If the real human and projected mouse TSS intervals overlapped, the human TSS was assigned to the mouse TSS with the closest projected reference position to the real reference position (so a real TSS can only be involved in one mapping, a projected on could be involved in multiple). Where a real TSS has no overlapping projected interval, the distance between closest intervals is used in the same manner: only one mapping for a real TSS but possibly more for projected. An upper limit constraint of 20nt was applied distance between real and projected intervals. An identical procedure was then performed for TSS projected from human into mouse. The final set of 1:1 orthology/equivalence mapping was obtained by identifying the reciprocal human-into-mouse and mouse-into-human mappings. It was clear from this orthology/equivalence assignment procedure that the human TSS are more fragmentary than the mouse. This suggests that for cross-species comparison of gene expression measures, some "clusters" of TSS will need to be grouped into merged TSS, particularly in the human data. These groupings can be obtained from the one-to-many relationships (real to projected) in the orthology/equivalence assignment. Summary of orthology/equivalence assignment