RNAseq

In this case study we will illustrate how to upload into zenbu RNAseq data.
 * We will start off with raw read data obtained from the Short Read Archive.
 * We will map those reads onto the genome using Tophat, taking advantage of ZENBU to easily gather existing exon junctions from a large collection of sources which be used as by TopHat for the coverage of exon-exon junctions by split-mapped reads.
 * We will then present the few ways ZENBU can render those mapped raw reads, readily giving insight into the data, among which, how to :
 * visualize exonic signal
 * visualize intron usage
 * visualize and retrieve transcript-level expression
 * Finally, we will show how transcript-level expression can be exported from zenbu to be further analyzed for differential expression

Study description (summary)
The study pursued dual goals: To advance mRNA-seq bioinformatics towards unbiased transcriptome capture and to demonstrate its potential for discovery in neuroscience by applying the approach to an in vivo model of neurological disease. We found that 12.4% of known genes were induced and 7% were suppressed in the dysfunctional (but anatomically intact) L4 dorsal root ganglion (DRG) 2 weeks after L5 spinal Nerve Ligation (SNL). A new algorithm for agnostic mapping of pre-mRNA splice junctions (SJ) achieved a precision of 97%. Overall Design: mRNA-seq of L4 DRG 2 weeks and 2 months after L5 spinal nerve ligation.CONTROL and SNL were used to identify differential gene expression between chronic pain and standard conditions in Rattus norvegicus.CONTROL and SNL and PILOT were used to perform 'agnostic splice site discovery' in the nervous system transcriptome in Rattus norvegicus. http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE20895

Overall design

 * mRNA-seq of L4 DRG 2 weeks and 2 months after L5 spinal nerve ligation.
 * CONTROL and SNL were used to identify differential gene expression between chronic pain and standard conditions in Rattus norvegicus.
 * CONTROL and SNL and PILOT were used to perform 'agnostic splice site discovery' in the nervous system transcriptome in Rattus norvegicus

References and credits
Contributors : Hammer P, Banck MS, Amberg R, Wang C, Petznick G, Luo S, Khrebtukova I, Schroth GP, Beyerlein P, Beutler AS Citations : Hammer P, Banck MS, Amberg R, Wang C et al. mRNA-seq with agnostic splice site discovery for nervous system transcriptomics tested in chronic pain. Genome Res 2010 Jun;20(6):847-60. PMID: 20452967

Aspera raw reads and experimental metadata download
We downloaded reads for entire study in sra format using the Aspera plugin for fast download. http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP002416 Experimental metadata related to each sequencing run was retrieved from the associated GeneExpresionOmnibus entries.

Retrieving known exon junctions from zenbu
We will map the raw reads from SRP002416 / mRNA-seq with Agnostic Splice Site Discovery for CNS Transcriptomics Tested in Chronic Pain using TopHat.

Tophat -j/--raw-juncs <.juncs file> option
Since the goal here is not to thoroughly optimize the mapping procedure but to demonstrate how ZENBU can be used in that context, mapping with tophat will be done using the default settings. The sole departure from this will be to take advantage of known exon-exon junction from the complete collection of genbank mRNA, Ensembl, Refseq and UCSC knowGene curated transcript models. Supplying TopHat with a list of raw junctions can be achieved by invoking -j/--raw-juncs <.juncs file> option. Junctions in <.juncs file> are specified one per line, in a tab-delimited format. Records look like:  <+/-> left and right are zero-based coordinates, and specify the last character of the left sequenced to be spliced to the first character of the right sequence, inclusive. That is, the last and the first positions of the flanking exons. Zenbu provides for a very easy way to extract introns via the CalcInterSubfeatures basic processing module, which can be combined with the ResizeFeatures basic processing module to achieve the desired last and the first positions of the flanking exons :   <+/-> format.

Getting the complete collection of genbank mRNA, Ensembl, Refseq and UCSC knowGene curated transcript models
We first use DEX to retrieve various transcriptome models loaded in zenbu. And build a single track containing all genbank mRNA, Ensembl, Refseq and UCSC knowGene curated transcript models. After hitting "visualize", we get to the glyph interface displaying a single track that we termed "rat trabscriptome"

Processing the transcript models collection to extract junctions defined as the last and the first positions of the flanking exons
In order to process this track and extract junctions, we firts duplicate the track (yellow square icon) and process the copy (grey gear icon). We will edit from scratch the processing script to combine : CalcInterSubfeatures (a processing module designed to work on Features with subfeatures, such as split-mapped reads, and fill in the gaps between subfeatures with new subfeatures, the category of the newly created subfeatures can be named and the subfeatures used in follow up calculation can be filtered) and ResizeFeatures (designed to work on Features to alter their genomic coordinates). Here we will both expand the intron produced by the preceding process of 1bp to its start and end. Together, they will form a script that will allow for extracting introns and extend them of 1bp on either side in order to obtain junction file to be feed TopHat. Content of the "junction processing script"      intron true false  expand_start 1 		 expand_end 1 	 

Downloading the junctions for TopHat
We then proceed to download the enite collection of junction into a BED6 formatted format, the column 1,2,3 and 6 of which being the exact content/format of the junction file required by topHat.

Uploading bam formatted TopHat mapped reads into ZENBU
Logging in to zenbu allows user to upload annotation or expression files. Supported format are BAM/SAM, BED and OSCtable. For BAM/SAM files, each mapped reads count as `1` expressed tags. ZENBU automates the computation of tag per million normalization taking into account tags whose mapping quality (MapQ) is 20 and more.

The description box allows for adding metadata for the easy retrieval / query of the data



 User:data upload panel : listing data you have uploaded and allowing you to organize it in collaborations



DEX data selection
We can select the uploaded experiment thru DEX. At first, let us just select the default visualization which will use the entire span of the read mapping to draw an histogram type plot of all 24 samples together. Selection of the experiments via DEX expression/experiments In addition we have added the entrez gene and refseq transcripts (selected from DEX annotation sets).

Accounting for the strandlessness nature of the RNAseq data
As appear obviously on this representation of the signal, the RNAseq have been generated from a non-stranded protocol. We can re-specify the rendering of the data to not take into account the (irrelevant) strand information.

Visualizing read (rather that read span) signal
The default visualization which we formerly choose uses the entire span of the read mapping to draw the histogram. Thus gaps of split-mapped reads are accounted in the computation of the local expression displayed by the histogram. To convince ourself, we can display the actual reads rather than transforming them as histogram.

Visualizing raw reads
This can be done very easily by modifying the rendering using the small grey icon on the upper right corner of the track. We'll first duplicate the track and modify its processing/rendering by 1)turning off any processing and 2)rendering the data as thin-transcript (so as to visualize split-mapped reads).

Condensing redundancy
We note that the number of reads stacked up makes this view relatively unappealing. By grouping similar reads (same mapping location) and color-coding the redundancy, we can improve the visualization. To this aim we can take advantage of the [UniqueFeature] which combines Features which match a unique or identical criteria (aka share the same genomic location (chrom/start/end), in this particular usage of the module we will ignore the irrelevant strand -- since the RNAseq protocol is un-stranded). The processing script to be input in the custom script processing box is as follow. In addition we will turn on the "color by score" check box   <spstream module="UniqueFeature"> true</ignore_strand> </stream_queue> </zenbu_script>

Showing the most prominent / redundant reads
The height of the track is conveniently reduce, yet on highly expressed loci or large area, this may not be sufficient to provide a synthetic enough overview. The [TopHits] atomic processing unit, allows one to only display the top "n" highest scoring feature (by calling the [UniqueFeature] module, the number of reads clustered at a given location is aggregated into its score).   <spstream module="UniqueFeature"> true</ignore_strand> <spstream module="TopHits"> <queue_length>10</queue_length> </stream_queue> </zenbu_script>

Read signal (rather than span) histogram
To accurately display "exonic" reads (aka no expression drawn for the gap between split mapped read parts), we need to send into the histogram binning, the block/exonic part of the read. To do so, we call on the [StreamSubfeatures] processing module, designed to work on Features with subfeatures and exports them onto the primary data stream. The subfeatures can be selected based on their category. In addition there are parameter control options related to expression and uniquing. The simple processing script to accomplish this, still need to by associated with an histogram binning guide which is brought up by combinig [FeatureEmitter], creating regular grids of features dynamically and [TemplateCluster], clustering and collating signal onto this grid.  <track_defaults source_outmode="skip_metadata" hidezeroexps="false" glyphStyle="express" strandless="true" scorecolor="" height="50"/> <stream_stack> <spstream module="TemplateCluster"> <overlap_mode>height</overlap_mode> <expression_mode>sum</expression_mode> <overlap_subfeatures>true</overlap_subfeatures> true</ignore_strand> <side_stream> <spstream module="FeatureEmitter"> <num_per_region>970</num_per_region> <fixed_grid>true</fixed_grid> <both_strands>false</both_strands> </side_stream> </stream_stack> </zenbu_script>



Contrasting all 24 experiments with the heatmap view
The previous representation of the data can be enhanced by displaying the content of the 24 experiments as an heatmap, each row representing the expression level in a particular condition. The heatmap can be generated using the follwong processing script  <track_defaults source_outmode="skip_metadata"/> <stream_stack> <spstream module="TemplateCluster"> <overlap_mode>height</overlap_mode> <expression_mode>sum</expression_mode> <overlap_subfeatures>true</overlap_subfeatures> true</ignore_strand> <side_stream> <spstream module="FeatureEmitter"> <num_per_region>970</num_per_region> <fixed_grid>true</fixed_grid> <both_strands>false</both_strands> </side_stream> </stream_stack> </zenbu_script> Note that the ordering of the experiments is linked to its ordering on the gluyph bottom expression summary panel. Hence selecting to order the experiments by expression level rather than alphabetical order will re-order the row in the heatmap dynamically.

Visualizing intron usage
We have seen previously that intron could be extracted from full transcritp sequence. The same is true of split mapped reads which indicates the presence of a splice junction. By counting the number of such gap-induced split mapped reads, we can infer intron usage in each of the 24 RNA-seq samples. The processing script required for this process as been saved as a script called "intron support" which we can easily call from the glyph interface. For info the script is written as follow  <track_defaults source_outmode="skip_metadata"/>  <spstream module="CalcInterSubfeatures"/> <spstream module="StreamSubfeatures"> intron</category_filter> true</transfer_expression> true</ignore_strand> <spstream module="TopHits"> <queue_length>200</queue_length> </stream_queue> </zenbu_script> The bottom part will ensure that ony the top 100 most used intron are displayed Let's zoom out a little bit to get a better overview

Clustering the expression onto transcript models
 <source id="0583D02E-BA10-11DE-B45C-8D369A8382FD::62:::FeatureSource" name="UCSC_rn4_refgene"/>  <spstream module="TemplateCluster"> <overlap_mode>height</overlap_mode> <skip_empty_templates>false</skip_empty_templates> <expression_mode>sum</expression_mode> <overlap_subfeatures>true</overlap_subfeatures> <ignore_strand>true</ignore_strand> <side_stream> <spstream module="EEDB::SPStream::Proxy" name="transcript"/> </side_stream> <spstream module="NormalizeRPKM"/> <spstream module="CalcFeatureSignificance"/> </stream_queue> </zenbu_script>