Data Stream Processing

= Sorted Data Stream = The central concept of any track in the ZENBU system is that all data comes through the system as a single stream of data. This single data stream is often the result of pooling multiple data sources together.

This central data stream concept means that any object of the Data Model can be passed on this stream. This gives the processing and visualization systems a great deal of flexibility since all information can be made available on the data stream.

For genomic Features, every data stream in the system preserves a region-location sort order. When multiple sources are merged together in a Pool, the Features are "merge sorted" so that this sort order is preserved. When Features are processed by different processing modules the sort order is also preserved. By forcing all data streams to be required to follow this sort-order, it becomes very easy to write signal-processing modules which can efficiently take advantage of the fact of this sort-order. This means that many processing operation can be performed without buffering data or requiring massive amount of memory. This is one of the key features of the ZENBU system which allows it to work with Terrabytes of data yet still be able to run on modest hardware computers.

The genomic location sort order for Features appearing on the stream is as follows This means that location takes priority over stand. One advantage of this sort order is that it becomes very easy to flip between stranded and strandless analysis without requiring buffering or resorting.
 * chrom_start
 * chrom_end
 * strand

=Scripting= ZENBU data processing scripts are an XML description language. The basic form of the script starts with an outer XML tag structure of   ... 



Within that structure there are several sections
 *   : allows specification of alternate "virtual Data Source pools" for use in coordination with Proxy modules. Each different "datastream" gets its own tag section
 * : Defines the streaming-chain of processing modules which are injected between DataSource of the track and the Visualization. Data processing happens in a signal-processing style by daisy-chaining multiple processing modules together. Some processing modules operate by combining data from multiple data-streams through the use of a  specification inside the module configuration. In the above script example the data on the primary stream is first processed by TemplateCluster against a side-stream of gencode data sources which collates the expression into the genocode annotation features, followed by the second module NormalizeRPKM which normalized the expression, and then followed by the third module in the chain CalcFeatureSignificance which recalculates the combined expression of all experiments into the significance of each Feature on the stream.
 *  : defines default options in the track configuration panel when used with "sharing a predefined script". When a predefined script with a track_defaults section is loaded into a track, those parameters of the "Track configuration panel" are toggled into this new default state. This makes it easy for script writers to defined a package of both processing and visualization as a "saved predefined script". Only one is  tag is allowed to be defined inside a script. Attribute options are:
 * source_outmode : sets the "feature mode" in the Data Source section
 * datatype : sets the "data source type" in the Data Source section
 * glyphStyle : sets the visualization style
 * scorecolor: sets the score_color by name
 * backColor: sets the background color
 * hidezeroexps : sets the state of the hide zero experiments checkbox
 * exptype : sets the display datatype
 * height : sets the "track pixel height" for express style tracks
 * expscaling : sets the "express scale" option for express style tracks
 * strandless : sets the "strandless" option for express style tracks
 * logscale : sets the "log scale" option for express style tracks

After a script has been created and is working as desired, it can be saved and shared with other users through the save script button in the track reconfigure panel.

Please check out each of the processing modules below. Every module's wiki page includes an example script of how that module can be used and shows the structure of the scripting XML language. Many module pages also contain a hyper link to an active ZENBU view page as a live example of the script in action.

=Processing modules= Processing is accomplished by chaining a series processing modules (or plugins) together between the pooled data source and the visualization / data download output. In addition some modules may provide for side chaining addition data streams into the main signal processing data stream. Side chains can be simple or complex chains of processing modules like in this case study

The processing modules can be broken down into several concept categories

Infrastructure modules
These modules provide access to additional data sources for use on side-streams
 * Proxy: Provide security-checked access to data sources loaded into ZENBU.
 * FeatureEmitter: Create regular grids of features dynamically.

Clustering, collation, peak calling
These modules provide for high-level manipulations of data to reduce the number of features on the data stream by grouping them into related concepts.
 * TemplateCluster: Use a side-chain-stream as template to collate expression.
 * UniqueFeature: Cluster and count features matching 'unique' criteria.
 * Paraclu: Hierarchical clustering (peak calling) based on Martin Frith's paraClu algorithm (http://www.cbrc.jp/paraclu/).

Filtering
These modules remove data from the stream based on filtering criteria
 * TemplateFilter: Use a side-chain-stream as a mask to filter features on the primary stream.
 * CutoffFilter: Filter features using simple cutoff filters (high pass, low pass, band pass).
 * ExpressionDatatypeFilter: Filter expression from features based on datatype.
 * FeatureLengthFilter: Filter Features based on min/max length criteria.
 * TopHits: Filter neighborhood-regions based on best feature significance.
 * NeighborCutoff: Noise filtering relative to strongest signal within a neighborhood-region.

Data normalization and rescaling
These modules alter the expression in a stream based on normalization or rescaling algorithms.
 * NormalizeByFactor: Normalize expression with respect to experiments associated metadata.
 * NormalizePerMillion: Normalize expression with respect to the total expression of the associated experiments (stored as metadata at upload time).
 * NormalizeRPKM: Reads Per Kilobase per Million (RPKM) based expression normalization.
 * RescalePseudoLog: pseudo-log Transformation of expression value.

Metadata manipulation

 * OverlapAnnotate: Transfer metadata between overlapping features.
 * MetadataFilter: Filter Features based on matching metadata.
 * RenameExperiments: Create new Experiment name based on concatenating some of its associated metadata.
 * FeatureRename: Rename the features of a stream as their FeatureSource name.

General manipulation
These modules are general purpose lego blocks to manipulate objects on the stream to help with getting data in the right format for the next module in the stream.
 * CalcFeatureSignificance : Aggregate the associated expression values onto the score of a feature.
 * CalcInterSubfeatures : Stream the region between subfeature of a parent feature (i.e. intron).
 * StreamSubfeatures : Stream the sub-features rather than the parent feature.
 * FilterSubfeatures : Rebuild a feature/subfeature structure by filtering subfeatures.
 * ResizeFeatures : Alter the boundaries of a feature (shrink toward 5', 3', start and end).
 * MakeStrandless : Alter the strand of a feature.