A. Full-length evaluation
|Classification||>98% similar to MGI||MGI confirmed||identical to||homolog to||similar to||related to|
|Complete CDS (redundant)*||1,508||63.0%||2,703||63.3%||615||73.1%||4,231||76.3%||572||61.5%||573||68.1%|
|Complete CDS (non-redundant)*||1,636||68.5%||385||78.9%||2,781||78.3%||449||63.9%||408||72.3%|
|excluded at annotation+||53||2.2%||47||9.6%||251||7.1%||34||4.8%||17||0.3%|
|Including 5' UTR (redundant)*||2,365||55.7%||517||61.5%||3,593||65.0%||511||54.9%||532||63.3%|
|Including 5' UTR (non-redundant)*||1,416||59.2%||330||67.6%||2,392||67.4%||401||57.0%||379||67.2%|
|excluded at annotation+||18||0.75%||18||3.7%||99||2.8%||22||3.1%||11||2.0%|
|5' truncated, unspliced+||7||0.3%||2||0.4%||7||0.2%||3||0.4%|
|3' truncated, unspliced+||4||0.2%||2||0.4%||12||0.3%||1||0.1%|
|3', 5' truncated, unspliced+||5||0.2%||3||0.6%||3||0.1%||0||0%||2||0.4%|
or (+) determined at the Fantom by curator annotation, NA: Not applicable.
Computationally determined classifications were based on the clustered,
nonredundant set because not all members of clusters were annotated as
single clones. "Complete CDS": evaluated computationally; "excluded at
annotation": clones that were annotated as truncated despite they were
predicted full-length by computer analysis. "including 5' UTR": clones
considered full-length only if having 20 nt or longer 5'UTR upstream the
first ATG. "excluded at annotation": as above. "Redundant" and "non-redundant"
refers to the analysis of non-clustered and clustered clones, respectively.
"Positive": clones without penalties at the annotation (full-length candidates
at annotation). 5' and 3' immature, clones retaining one residual unspliced
intron either at the 5' or the 3' side of the homology sequence and thus
not full-coding. Alternative C- and N-termini: after the region of similarity
the sequence diverged and still seems to code for a complete protein. "Predicted
start": stop codon was not found computationally either because of truncation
or most commonly because of sequencing errors; "Predicted stop": start
codon was not found computationally because of truncation or most commonly
because of sequencing errors. Terms flanked by (+) in the 5' truncation
and 3' truncation rows but flanked by (*) in the >98% MGI were computationally
annotated. Total number of clones is given for both redundant data set
(all the clones) and non-redundant data set (after clustering).
Additional comments: In the column "MGI > 98 %", among the clones with partial inserts, 11.7% were truncated at either the 5' end (presumably due to the failure of cap selection) or 3' end (commonly due to internal priming, which could be distinguished in many clusters due to the long internal polyA sequences) or both sides and 5% were "unspliced" (cDNA retaining residual introns) at either 5' or 3' and therefore non-coding. The number of "unspliced" clones in the collection could have been reduced by preparing libraries from cytoplasmic RNA. This will be done in the future. We distinguished between ìunsplicedî cDNAs and those with "Alternative N- or C- termini" (totally, 8.1%), which lacked splice consensus sequences and had long open reading frames that could reflect splice variation instead of immature-truncated cDNAs. The clones grouped as "predicted start or stop" did not show the start or stop codon sequence in otherwise very well matched hybrid sequences, and are likely to full-length clones suffering from sequencing errors.
It is noteworthy that the rate of full-length is inferior in the non-redundant set by only 2-6% than the redundant set.
Computational analysis of full-lengthness. In the category ">98% similarity to MGI", we aligned sequences with BLAST E-values of 1e-10 or less. For the analysis of full-lengthness of other categories, the results used where obtained with bioSCOUT 1.5 employing BLAST 2.0 (Version 2.0.12). BLASTX hits with an E-value of < 1e-10 were used as the basis for the evaluation of potential full-length clones and for the functional annotation of queries by bioSCOUT. The database which was searched with BLASTX was a non-redundant protein database (nrdb) which was built from Swissprot, Swissnew, TrEMBL, TrEMBLnew, Pironly, Genpept, Gepeptnew (built from Genbank and Genbanknew cds entries respectively) and pdb. The tool employed for building the non-redundant protein database was nrdb from NCBI. The criteria for counting one of the above mentioned hits as a "potential full-length clone" were: (1) BLASTX hits with an E-value < 1e-10 which were used for functional annotation of the query by bioSCOUT 1.5. (2) The presence of a putative 5' UTR of at least 20 nucleotides preceding the alignment region, or the presence of CDS, or the presence of the first ATG, depending on the evaluation type. (3) An alignment of > 300 bp (100 amino acids) on the query sequence.
The final estimation of computationally predicted full coding sequences (CDS) in the clone set and the fraction of clones annotated as full-length by curators (53% and 59%) was determined by averaging the coding rate of the "MGI confirmed", "identical to", "homolog to", "similar to", "related to" categories with the "motif containing protein" and "hypotetical protein" because the average CDS length was the same, while we attributed 0% coding to the "unclassifiable and unclassifiable transcript".
To roughly estimate the minimum and maximum number of full-length cDNA in the dataset, we took the minimum (53%, predicted by computer) and maximum (63%, predicted on the non-redundant MGI set with similarity of > 98%) and recalculated the probability to find a full-length member per each cluster. For instance, in case of 63% predicted full coding, the fraction of truncated clones is 0.37. Thus, in a cluster of two clones, the probability of truncation will be 0.37x0.37 = 0.136 (or 86.4% probability of full-length), in a cluster of three clones the probability will be 0.37x0.37x0.37= 0.051 and so on. The same calculation was done for the 53% value. The values obained for each cluster categories (2, 3, 4, etc. clones) were multiplied with the number of clusters in the respective category to estimate the number of non-redundant full-length cDNAs in the clone set.
B. Analysis of Alternative Splicing
in Redundant Clone Set
|Potential alternative splicing||81||36.8%||70||31.8%||62||28.2%||7||3.2%||220||100%|
|Potential unspliced intron||26||33.3%||23||29.5%||29||37.2%||NA||78||100%|
|Clones of the redundant clone set were compared with BLAST2 against each other and clustered into 432 groups based on the criteria of showing more than two portions of similar alignments of at least 20bps length with more than 94% sequence identity. Sequences of the 432 alternative splicing candidate clusters were aligned using the CLUSTALW algorithm. 503 gapped regions (excluding exon variants) were then analyzed by gap length and frames 3N, 3N+1, 3N+2, yielding 201, 149 and 153 clusters, respectively. The gap sites were placed into four categories based upon the consensus sequences, 5'-(C or A)AG|GT and AG|G)-3' described by Mount (Nucleic Acids Res.10(2): 459-72, 1982). 1) Potential alternative splicing (220): gap which has consensus sequences around gap sites do but does not correspond to the GT-AG rule, 2) Potential unspliced intron(78): gap which correspondes to the consensus sequence, and 3) Unclassified (212): gap does not belong to any of the a foresaid categories. If multiple gap sites existed in a cluster and at least one of them were placed into the first category , the cluster was defined as "Potential alternative splicing cluster". 4) "Exon variant" refers to cases of alternative selection of two potential exon cassettes. NA: not analyzed.|