Clustering CAGE along gene models

In this case study we will illustrate how to vizualize gene level expression from CAGE signal (green background) collating TSSS signal (yellow background) onto gene (blue background) specific proximal promoter defined as the 1kb region centrered around the gene sart position The example below uses the FANTOM4 CAGE data and defined expression clusters boundaries. http://fantom.gsc.riken.jp/zenbu/gLyphs/#config=

Creating a Track with the expression datastream to be clustered
Using the Expression/Experiment tab in Data Explorer Interface, we select all FANTOM4 (searched keyword) CAGE (further filtred by plateform). We will name this track "FANTOM4 CAGE" and render it as an wiggle plot. http://fantom.gsc.riken.jp/zenbu/dex/#section=Experiments;search=CAGE%20FANTOM4



Creating a Track with the gene models against which CAGE data will be collated
Using the Expression/Experiment tab in Data Explorer Interface, we select RefSeq genes and We will name this track "transcriptome models" and render it as a set a transcripts. http://fantom.gsc.riken.jp/zenbu/dex/#section=Experiments;search=RefSeq



Creating a mask Track with the FANTOM4 CAGE promoters L2 dataset
The FANTOM4 CAGE promoters L2 dataset results from the computation of cluster of Transcriptional Start Sites whose expression pattern throughout the studied response of THP1 to PMA stimluation. We will see how to use this set as a mask prior to collating CAGE based expression levels on our transcripts set. We also select from the Annotation tab the FANTOM CAGE promoters L2 dataset to construct a second track. http://fantom.gsc.riken.jp/zenbu/dex/#section=Annotation;search=%20FANTOM4



Creating the View from the set of selected/created Tracks
Pressing the View button in the shopping cart panel creates the track and opens it in Glyph



Modifying the track to collate expression along FANTOM4 L2 clusters boundaries
The script "FANTOM4 (hg18) CAGE promoter expression collation" allows to collate an expression stream along the boundaries defined as valid CAGE prtomoter region by the FANTOM4 consortium (The transcriptional network that controls growth arrest and differentiation in a human myeloid leukemia cell line. Nat Genet. 2009 May;41(5):553-62). Users interested in applying the same approach to their own predefined set of regions can wrote their own TemplateCluster based expression collation script.

Using the predefined "FANTOM4 (hg18) CAGE promoter expression collation" script
As a first step, let's duplicate the track (this is not strictly speaking necessary but it will help illustrate the outcome of the expression collation). On the duplicated track, clicking on the "configure track" (grey gear) icon, opens up the panel reconfigure panel. The section "Stream Processing script" describes how the CAGE data stream is currently processed as a wiggle plot. http://fantom.gsc.riken.jp/zenbu/gLyphs/#config=9fOol9rSBRjlHDB0jdcDkB;loc=hg18::chr19:54853066..54862509

We change it to process the data collating CAGE based expression levels along the FANTOM4 (hg18) CAGE promoter regions. A script performing such operation is already stored into zenbu and can be recalled by selecting "predefined script" in the drop down menu. The "predefined script" selection opens up a search panel which interrogate ZENBU for existing scripts matching the search criteria.



We select the processing script called "FANTOM4 (hg18) CAGE promoter expression collation". To reflect the different processing we also modify the current title of the track and provide a quick description of the new content. Note that the processing of this script can be brought up by editing it (See below).

http://fantom.gsc.riken.jp/zenbu/gLyphs/#config=zW_uLdNRF5FqL0Q5V3zYNC;loc=hg18::chr19:54853066..54862509 ...Zooming in on the region : http://fantom.gsc.riken.jp/zenbu/gLyphs/#config=zW_uLdNRF5FqL0Q5V3zYNC;loc=hg18::chr19:54860464..54861262 Clicking on on of the cluster provides its expression level across all the streamed in experiments

Writing your own cluster/region based expression stream collation script
We provide an already written script which allows the exploration of any expression data stream arising from reads mapped onto hg18. http://fantom.gsc.riken.jp/zenbu/dex/#section=Scripts;search=FANTOM4%20CAGE%20promoter%20expression%20collation Below is the complete script, further detailed in the next paragraph of this page.  track default vizualization parameters : thick-arrow with color-coded total expression 

Lists RefSeq transcript sources for all the main assemblies loaded in zenbu  



Get the Transcriptional Start Sites (TSS) revealed by the 5'extremity of CAGE derived reads  shrink_5prime

Collate the CAGE TSS along regions defined by FANTOM4 L2 clusters     

Sum up the expression over all samples and save the value as the refseq score to color it accordingly  sum

 

Let's describe the construction of the "FANTOM4 (hg18) CAGE promoter expression collation" script in order to be able to easily write your own cluster/region based expresion stream collation script. The script will use the following processing modules :
 * The Proxy : a special place holder processing module designed to work in coordination with the section of the ZENBU scripting system. Each has a name attribute and a pool of data sources with tag . Each data source is defined by their ZENBU system id . The other attributes of each are ignored, but can be helpful for script writers as comments.
 * The ResizeFeatures processing module : designed to work on Features to alter their genomic coordinates. The module will resort features on the data stream as needed to preserve the stream integrity. Its ypical use cases illustrated here shrink the feature to its 5' end and make it 1bp wide (Transcriptional Start Sites revelaed by CAGE reads 5' extremity).
 * The TemplateFilter processing module which takes a stream of template features on a side stream defined here by the Proxy and performs overlap comparison against features on the primary data stream (here the CAGE data stream). When an overlap occurs, the primary stream primary-stream feature is either passed through this filter (default behaviour used herein) or blocked based on this module's parameter settings.
 * The CalcFeatureSignificance processing module : designed to sit in the middle of a processing stream and transform the multiple Experiment /  Expression data of a Feature into the single significance for that Feature.

track default vizualization parameters : thick-arrow with color-coded total expression <track_defaults source_outmode="skip_metadata" scorecolor="fire1" backColor="" hidezeroexps="true" glyphStyle="arrow"/> The default rendering will be thick-arrows, color-coded using the "fire1" scaling (grey -> yellow -> orange -> red) Empty expression will automatically be hidden and metadata associated with the regions onto which the expression stream will be collated will not be reported
 * The first part of the zenbu script defines the default vizualization parameters that will be used to render the processed track :

Lists RefSeq transcript sources for all the main assemblies loaded in zenbu <datastream name="cluster" output="simple_feature"> <source id="72DA22E8-B95F-48B8-B7E3-3698E820E331::48:::FeatureSource" category="L2_promoter" name="CAGE_L2_promoter_april2008"/> The datasource id(s) referencing the regions onto which the expression steam will be collated must be defined herein. It will be used further in the processing directive as a Proxy named "cluster". Obtaining the source_id of datasteam can be done thru DEX http://fantom.gsc.riken.jp/zenbu/dex/#section=Annotation;search=FANTOM4%20CAGE
 * The second part of the script defines the CAGE_L2_promoter regions onto which the expression stream will be collated :

 ...	</stream_processing>
 * The 3rd part of the script defines the sequential processing of the data.

First, the stream is modified such that only the 5'end of the streamed data source (i.e. CAGE reads' associated TSS) is to be considered and the data is "resized" to only exptract the 5' extremity of reads. Get the Transcriptional Start Sites (TSS) revealed by the 5'extremity of CAGE derived reads <spstream module="ResizeFeatures"> shrink_5prime

Second, this 5'end resized stream is intersected with a side stream defined by the datastream above. To do so we employ the TemplateCluster module coupled with the Proxy module referered to by its name Collate the CAGE TSS along regions defined by FANTOM4 L2 clusters <spstream module="TemplateCluster"> <ignore_strand value="false"/>  <spstream module="Proxy" name="cluster"/> </side_stream>

This score is then used to color code the relative intensity of the signal (CalcFeatureSignificance) Sum up the expression over all samples and save the value as the refseq score to color it accordingly <spstream module="CalcFeatureSignificance"> sum</expression_mode>
 * Finally, we modify the score of the cluster to be the sum of the expression over all experiments in the input stream.