ZDX file

The ZDX (zenbu data exchange) file format was designed primarily for internal ZENBU use to provide a file-based persistence layer to enable the ZENBU track caching system. ZENBU was designed primarily as a dataflow system where processed and manipulated data only existed in computer memory as it flowed through the data stream processing system. And while this works extremely well for medium-weight tracks (dozens of datasource with simple data processing), we started to see that heavy-tracks (1000s or data sources or very complex scripting with many side-streams) could be slow for users who just wanted to see the results of someone elses views. Thus we designed the TrackCaching system and the ZDX file.

Because ZENBU is designed around random-access the ZDX file needed to provide very fast random-access write capabilities. For example if a user configures a new track which the track-cache system has never seen before and looks at the data around EGR1 (hg19:: chr5 137800224-137805959) and then at RUNX1(hg19:: chr21 36094723-364869689) the TrackBuilding system would be writing the result from these two streaming queries into the ZDX and all the other locations in the genome would still be empty.

To enable this ability to randomly write data at specific genomic locations in a random-access manner, we designed the ZDX file as a binary format using the computer science principles of how filesystems are designed. In particular we based the design around the ideas of inode, directory tables and file blocks.

= ZDX uses ZENBU data model =

Because the ZDX file is designed for ZENBU data persistence, it uses the ZENBU data model. The ZENBU data model exists as this abstract design, as a series of c++ classes for server side systems, and as an XML representation for data transport. Assembly.cpp	 Datatype.cpp       EdgeSource.cpp  Feature.cpp         MetadataSet.cpp Assembly.h	 Datatype.h         EdgeSource.h    Feature.h		 MetadataSet.h Chrom.cpp	 Edge.cpp	     Experiment.cpp  FeatureSource.cpp	 Metadata.cpp Chrom.h         Edge.h              Experiment.h    FeatureSource.h     Metadata.h DataSource.cpp   EdgeSet.cpp         Expression.cpp  	                 Symbol.cpp DataSource.h	 EdgeSet.h          Expression.h    	                 Symbol.h*

For example here is the ZENBU XML version for a GENCODE gene

LZ4 compression of XML
Although XML appears very verbose in its uncompressed form, XML compresses extremely well (sometimes up to 20x compression ratio). For ZDX, ZENBU uses the LZ4 compression algorithm for reading and writing data into blocks of the ZDX file. http://code.google.com/p/lz4/ LZ4 is designed to have exceptionally fast compression and decompression speeds with very good compression ratios. LZ4 was designed primarily for compressing data for transport where the data is transient in nature. This fits perfectly to the needs of ZDX where we use it in a caching system where compression/decompression speeds are more important than absolutely smallest file size.

But even LZ4 compress of ZENBU XML yields excellent final file sizes and is actually similar in size to BAM files.

As a test we took a BAM alignment file from the ENCODE project wgEncodeUwDnaseMonocd14ro1746AlnRep1.bam with 33,322,702 alignments and converted it into a ZDX file. Because the ZENBU data model is flexible, it is possible to also strip the alignments down to their minimal components (alignment chrom,start,end,strand, name, score) and store that into a ZDX file.

= ZDX performance =

Due to the map-reduce parallel-processing capabilities of ZENBU and the ZDX file format, ZDX files can be created very quickly.

Taking the same example ENCODE BAM file wgEncodeUwDnaseMonocd14ro1746AlnRep1.bam (33,322,702 alignments, 1.3GBytes of file space). We ran the following timing tests. The first set of samtools commands are to show equivalent steps on the file to get a perspective of ZENBU and ZDX build performance.

Although these tests were done on a single BAM file to allow for comparisons, ZDX is best suited to the storage of processed results due to its completely flexible data modeling ability rather than sequence alignments. In contrast BAM was designed and optimized for the storage of sequence alignments.