Supplementary Table 2

A. Full-length evaluation
 
Classification >98% similar to MGI MGI confirmed identical to homolog to similar to related to
Complete CDS (redundant)* 1,508  63.0% 2,703 63.3% 615 73.1% 4,231 76.3% 572 61.5% 573 68.1%
Complete CDS (non-redundant)*   1,636 68.5% 385 78.9% 2,781 78.3% 449 63.9% 408 72.3%
excluded at annotation+   53 2.2% 47 9.6% 251 7.1% 34 4.8% 17 0.3%
Including  5' UTR (redundant)*   2,365 55.7% 517 61.5% 3,593 65.0% 511 54.9% 532 63.3%
Including  5' UTR (non-redundant)*   1,416 59.2% 330 67.6% 2,392 67.4% 401 57.0% 379 67.2%
excluded at annotation+   18 0.75% 18 3.7% 99 2.8% 22 3.1% 11 2.0%
Positive+   2,159 90.3% 364 74.6% 2,911 81.2% 540 76.8% 507 89.9%
5' truncated+ 230  *9.7% 97 4.0% 52 10.6% 283 8.0% 52 7.4% 24 4.2%
3' truncated+ 71 *3.1% 23 1.0% 12 2.5% 132 3.7% 31 4.4% 12 2.1%
truncated+   0 0.0% 0 0.0% 0 0.0% 0 0.0% 3 0.5%
unspliced+ 37 1.5% 40 8.2% 82 2.3% 27 3.8% 9 1.60%
5' truncated, unspliced+ 7 0.3% 2 0.4% 7 0.2% 3 0.4%  
3' truncated, unspliced+ 4 0.2% 2 0.4% 12 0.3% 1 0.1%
3', 5' truncated, unspliced+ 5 0.2% 3 0.6%  3 0.1%  0 0%  2 0.4%
chimera+ NA 8 0.3% 0 0% 8 0.2% 3 0.4% 0 0.0% 
reverse+ 52 *2.2% 29 1.2% 22 4% 23 0.6% 18 2.6% 3 0.5%
5' immature* 51 2.2%  
3' immature* 90 3.7%
alternative C-termini* 149 6.2%
alternative N-termini* 42 1.7%
predicted start* 4 0.2%
predicted stop* 83 3.5%
unclear* 122 5.1%
alignment problems*   1 0.0% 1 0.2% 2 0.1% 1 0.1%  
Total (redundant) NA 4,248   841   5,525   930   841  
Total (non-redundant) 2,388   2,390   488   3,550   703   564  
(*),determined computationally or (+) determined at the Fantom by curator annotation, NA: Not applicable. Computationally determined classifications were based on the clustered, nonredundant set because not all members of clusters were annotated as single clones. "Complete CDS": evaluated computationally; "excluded at annotation": clones that were annotated as truncated despite they were predicted full-length by computer analysis. "including 5' UTR": clones considered full-length only if having 20 nt or longer 5'UTR upstream the first ATG. "excluded at annotation": as above. "Redundant" and "non-redundant" refers to the analysis of non-clustered and clustered clones, respectively. "Positive": clones without penalties at the annotation (full-length candidates at annotation). 5' and 3' immature, clones retaining one residual unspliced intron either at the 5' or the 3' side of the homology sequence and thus not full-coding. Alternative C- and N-termini: after the region of similarity the sequence diverged and still seems to code for a complete protein. "Predicted start": stop codon was not found computationally either because of truncation or most commonly because of sequencing errors; "Predicted stop": start codon was not found computationally because of truncation or most commonly because of sequencing errors. Terms flanked by (+) in the 5' truncation and 3' truncation rows but flanked by (*) in the >98% MGI were computationally annotated. Total number of clones is given for both redundant data set (all the clones) and non-redundant data set (after clustering).

Additional comments: In the column "MGI > 98 %", among the clones with partial inserts, 11.7% were truncated at either the 5' end (presumably due to the failure of cap selection) or 3' end (commonly due to internal priming, which could be distinguished in many clusters due to the long internal polyA sequences) or both sides and 5% were "unspliced" (cDNA retaining residual introns) at either 5' or 3' and therefore non-coding.  The number of "unspliced" clones in the collection could have been reduced by preparing libraries from cytoplasmic RNA.  This will be done in the future.  We distinguished between ìunsplicedî cDNAs and those with "Alternative N- or C- termini" (totally, 8.1%), which lacked splice consensus sequences and had long open reading frames that could reflect splice variation instead of immature-truncated cDNAs. The clones grouped as "predicted start or stop" did not show the start or stop codon sequence in otherwise very well matched hybrid sequences, and are likely to full-length clones suffering from sequencing errors.

It is noteworthy that the rate of full-length is inferior in the non-redundant set by only 2-6% than the redundant set.

Computational analysis of full-lengthness. In the category ">98% similarity to MGI", we aligned sequences with BLAST E-values of 1e-10 or less. For the analysis of full-lengthness of other categories, the results used where obtained with bioSCOUT 1.5 employing BLAST 2.0 (Version 2.0.12). BLASTX hits with an E-value of < 1e-10 were used as the basis for the evaluation of potential full-length clones and for the functional annotation of queries by bioSCOUT. The database which was searched with BLASTX was a non-redundant protein database (nrdb) which was built from Swissprot, Swissnew, TrEMBL, TrEMBLnew, Pironly, Genpept, Gepeptnew (built from Genbank and Genbanknew cds entries respectively) and pdb. The tool employed for building the non-redundant protein database was nrdb from NCBI. The criteria for counting one of the above mentioned hits as a "potential full-length clone" were: (1) BLASTX hits with an E-value < 1e-10 which were used for functional annotation of the query by bioSCOUT 1.5. (2) The presence of a putative 5' UTR of at least 20 nucleotides preceding the alignment region, or the presence of CDS, or the presence of the first ATG, depending on the evaluation type. (3) An alignment of > 300 bp (100 amino acids) on the query sequence. 

The final estimation of computationally predicted full coding sequences (CDS) in the clone set and the fraction of clones annotated as full-length by curators (53% and 59%) was determined by averaging the coding rate of the "MGI confirmed", "identical to", "homolog to", "similar to", "related to" categories with the "motif containing protein" and "hypotetical protein" because the average CDS length was the same, while we attributed 0% coding to the "unclassifiable and unclassifiable transcript".

To roughly estimate the minimum and maximum number of full-length cDNA in the dataset, we took the minimum (53%, predicted by computer) and maximum (63%, predicted on the non-redundant MGI set with similarity of > 98%) and recalculated the probability to find a full-length member per each cluster. For instance, in case of 63% predicted full coding, the fraction of truncated clones is 0.37. Thus, in a cluster of two clones, the probability of truncation will be 0.37x0.37 = 0.136 (or 86.4% probability of full-length), in a cluster of three clones the probability will be 0.37x0.37x0.37= 0.051 and so on. The same calculation was done for the 53% value. The values obained for each cluster categories (2, 3, 4, etc. clones) were multiplied with the number of clusters in the respective category to estimate the number of non-redundant full-length cDNAs in the clone set.

B. Analysis of Alternative Splicing in Redundant Clone Set
 
Gap category Number of gaps
 3n 3n+1 3n+2 Exon variant Sum
Potential alternative splicing 81 36.8% 70 31.8% 62 28.2% 7 3.2% 220 100%
Potential unspliced intron 26 33.3% 23 29.5% 29 37.2% NA 78 100%
Unclassified 94 44.3% 56 26.4% 62 29.2% 212 100%
Total 201   149   153   7   510  

 
Clones of the redundant clone set were compared with BLAST2 against each other and clustered into 432 groups based on the criteria of showing more than two portions of similar alignments of at least 20bps length with more than 94% sequence identity. Sequences of the 432 alternative splicing candidate clusters were aligned using the CLUSTALW algorithm. 503 gapped regions (excluding exon variants) were then analyzed by gap length and frames 3N, 3N+1, 3N+2, yielding 201, 149 and 153 clusters, respectively. The gap sites were placed into four categories based upon the consensus sequences, 5'-(C or A)AG|GT and AG|G)-3' described by Mount (Nucleic Acids Res.10(2): 459-72, 1982). 1) Potential alternative splicing (220): gap which has consensus sequences around gap sites do but does not correspond to the GT-AG rule, 2) Potential unspliced intron(78): gap which correspondes to the consensus sequence, and 3) Unclassified (212): gap does not belong to any of the a foresaid categories.  If multiple gap sites existed in a cluster and at least one of them were placed into the first category , the cluster was defined as "Potential alternative splicing cluster". 4) "Exon variant" refers to cases of alternative selection of two potential exon cassettes. NA: not analyzed.