Uploading UCSC repetitive elements track

In this case study we will illustrate how to upload into zenbu annotations from third party sources. In this case uploading the latest hg19 repeat masker track available from UCSC. The mysql table dump from UCSC provides the name and genomic location of repetitive elements but also the class and family repeats belongs to. We will illustrate how to upload onto ZENBU either a simple BED-based  version of it (solely containing the repeats name and location) or the more  complete  mysql table dump containing all the information also available (alignment scores, repeat classes and families,... ) which can be taken advantage of by ZENBU for manipulation and processing.

This particular example is also used as part of a more comprehensive case study focussed on extracting repetitive elements sub-cellular compartment specific expression from ENCODE K562 cell line analyzed by CAGE.

Track description
The RepeatMasker (rmsk) track was created by using Arian Smit's RepeatMasker program, which screens DNA sequences for interspersed repeats and low complexity DNA sequences. The program outputs a detailed annotation of the repeats that are present in the query sequence (represented by this track), as well as a modified version of the query sequence in which all the annotated repeats have been masked.

RepeatMasker uses the Repbase Update library of repeats from the Genetic Information Research Institute (GIRI). Data are generated using the RepeatMasker -s flag. UCSC also used the Tandem Repeat Finder (trf) program, masking out repeats of period 12 or less. The repeats are just "soft" masked. Alignments may extend through repeats, but are not permitted to initiate in them.

Track content
This track contains, among others, the following classes of repeats: A "?" at the end of the "Family" or "Class" (for example, DNA?) signifies that the curator was unsure of the classification. At some point in the future, either the "?" will be removed or the classification will be changed.
 * Short interspersed nuclear elements (SINE), which include ALUs
 * Long interspersed nuclear elements (LINE)
 * Long terminal repeat elements (LTR), which include retroposons
 * DNA repeat elements (DNA)
 * Simple repeats (micro-satellites)
 * Low complexity repeats
 * Satellite repeats
 * RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA)
 * Other repeats, which includes class RC (Rolling Circle)
 * Unknown

References and credits
Thanks to UCSC for providing the track and to Arian Smit and GIRI for providing the tools and repeat libraries used to generate it.

References
 * Smit, AFA, Hubley, R and Green, P. RepeatMasker Open-3.0. http://www.repeatmasker.org. 1996-2007.
 * RepBase is described in Jurka J. Repbase update: a database and an electronic journal of repetitive elements. Trends Genet. 2000 Sep;16(9):418-420.

For a discussion of repeats in mammalian genomes, see:
 * Faulkner GJ et al. The regulated retrotransposon transcriptome of mammalian cells. Nature Genetics 41, 563 - 571 (2009),
 * Smit AF. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr Opin Genet Dev. 1999 Dec;9(6): 657-63.
 * Smit AF. The origin of interspersed repeats in the human genome. Curr Opin Genet Dev. 1996 Dec;6(6):743-8.

downloading UCSC rmsk data as BED
BED formatted UCSC track content can be obtained from UCSC table broswer. The rmsk RepeatMasker (rmsk) track can be exported as BED file by selecting As we desire the complete repetitive elements genome-wide to be loaded into ZENBU, therefore we select ZENBU enable gzip compressed bed files to be loaded directly, so we will further select :
 * the assembly "Feb.2009 GRCh37/hg19"
 * the group "repeats and variations"
 * the track "RepeatMasker"
 * and finally the table "rmsk"
 * region: "genome"
 * output format: "BED - browser extensibke format"
 * output file: we will name the file "UCSC_rmsk.hg19.bed.gz"
 * file type returned: "gzip compressed"



Finally we click "get output", which opens up a novel window offering to download "one BED record per: Whole Gene" or extend each entry by some fixed length segment. The compressed BED file should be upoladed locally and ready to be transfered as is into ZENBU.

uploading UCSC rmsk downloaded data
In order to load annotation or expression/experiment into zenbu, we need to be logged-in as a zenbu user (as uploaded files need to have a owner).



Clicking onto the "User" tab of ZENBU interface, brings us the "user profile" if we are already logged into ZENBU page or the log-in page. The "Data Upload" tab provides us with the interface for file uploading.



As we have named our file "UCSC_rmsk.hg19.bed.gz", ZENBU automatically recognized that this is a BED formatted file. UCSC table dump provided us with a score column containing the Smith Waterman alignment score. Since we are simply interested in the location of the repetitive elements, this score will not be relevant to our use of the data as a full fledged "experiment" (in which case zenbu provides for automatic computation of per-million expression normalization, with or without multimapping correction). We, therefore, leave both check-boxes "BED.score column has expression values" and "single-best-mapping expression" unchecked. Note that keeping the BED score as a mere score associated to each entry will still enable us to use it to use it (for example to filter repeats on the basis of its SW alignment score). Once uploaded the "my data" section shows the uploaded BED file and offers us the possibility to share it with collaboration.



Comprehensive OSC table based upload
By retrieving the data from UCSC in BED format, along with their genomic location, the sole repeat name (repName) is obtained. Valuable information, which can be used by ZENBU for manipulation and processing, such as the class (repClass) and the family (repFamiliy) the repeats belong to is not retrieved.

In order to get all the information provided within this track into ZENBU, one can alternatively upload the data as an OSCtable which allow for more information to be associated to each repeat (feature) to be stored into ZENBU as (feature-associated) metadata.

downloading UCSC rmsk mysql table dump
Full table dump of UCSC track content can be obtained from UCSC table broswer. The full data content can be seen by clicking the "describe table schema".



The rmsk RepeatMasker (rmsk) track can be exported as a tab delimited file by selecting As we desire the complete repetitive elements genome-wide to be loaded into ZENBU, therefore we select We want the complete RepeatMasker track table dump, so we will further select :
 * the assembly "Feb.2009 GRCh37/hg19"
 * the group "repeats and variations"
 * the track "RepeatMasker"
 * and finally the table "rmsk"
 * region: "genome"
 * output format: "all filed from selected table"
 * output file: we will name the file "UCSC_rmsk.hg19.table.gz"
 * file type returned: "gzip compressed"



This will allow us the download locally the complete data available as a tab delimited text file.

creating a custom OSCheader
In order to load the data as an OSC table we need to prepend an OSCheader to the tab delimited "UCSC_rmsk.hg19.table.gz" that we have just retrieved from UCSC table dump. Generic wrapping of standart format (BED, GFF, ...) are described in the ZENBU interpretation of OSCtable files wrapping section. In this case, where the UCSC table dump does not correspond to any of those generic format, a quick look at the content of the table schema (see screenshot above) tells us that the most simple OSCheader corresponding to the definition of each column in terms understood by the OSCtable parser will be : If we want the repeat family to be the primary name of the feature in zenbu we then modify In addition we may which to ignore the columns "bin" and "id" which are internal to UCSC by adding to those column the prefixe "ignore."
 * genoName -> chrom
 * genoStart -> start.0base (all start coordinates in UCSC database are 0-based)
 * genoEnd -> end
 * repFAmily -> name
 * bin -> ignore.bin
 * id -> ignore.id

To do so, you can then the edit file with your favorite text editor to which you will have to modify (or add -- since line starting by # will be ignored -- ) the first line. Here is an overview of the first lines of the thus modified file.



For the UNIX savvy, this can easily be done with the following simple commands zcat UCSC_rmsk.hg19.table.gz | head -n 1 \ | sed -e 's/#//' \ | sed -e 's/bin/ignore.bin/' \ | sed -e 's/genoName/chrom/' \ | sed -e 's/genoStart/start.0base/' \ | sed -e 's/genoEnd/end/' \ | sed -e 's/repFamily/name/' \ | sed -e 's/id/ignore.id/' \ | gzip -c > UCSC_rmsk.hg19.table.oscheader.gz zcat UCSC_rmsk.hg19.table.oscheader.gz UCSC_rmsk.hg19.table.gz > UCSC_rmsk.hg19.table.osc.gz

uploading UCSC rmsk OSCtable
In order to load annotation or expression/experiment into zenbu, we need to be logged-in as a zenbu user (as uploaded files need to have a owner).



Clicking onto the "User" tab of ZENBU interface, brings us the "user profile" if we are already logged into ZENBU page or the log-in page. The "Data Upload" tab provides us with the interface for file uploading.



As we have named our file "UCSC_rmsk.hg19.table.osc.gz", ZENBU automatically recognized that this is a OSCtable formatted file. This data does not contain relevent expression information, therefore, we leave both check-boxes "single-best-mapping expression" unchecked. Once uploaded the "my data" section shows the uploaded OSCtable file and offers us the possibility to share it with collaboration.