Genome Exploration Research Group web site). We have been focusing on the collection and sequencing of more than one million mouse cDNAs, Phase I of the project. In Phase II, we have re-arrayed the non-redundant clones and produced full-length sequence for those clones. Functional annotation of the full length mouse cDNAs and deposition of their sequence data with the annotation into the public databases will contribute to the progress of science.
In order to assign functional annotation to uncharacterized cDNAs, we have been developing a semi-automatic annotation tool which refers to the results from the following:
1. homology search including search for orthologous database (human, rat, drosophila, C. elegans, yeast,
2. well-known protein motif search using Pfam and Prosite
3. other data such as, expression data, protein-protein interaction data and other data as may be applicable.
We use the term "functional annotation of genes" to refer to the assignment of attributes to genes. The attributes include Gene Ontology terms, classified into three categories;
- molecular function,
- biological process and
- cellular component
However, there are limits to our semi-automatic methods similar in ways used in other databases such as Unigene. For example, curation by biologists is always necessary when annotating genes for which BLAST searches result in only low-similarity matches in E-value.
Based on these issues, we believe we should discuss what is necessary for the functional annotation for the mouse full length cDNAs. Some of the points which need to be discussed include; what is necessary for biologists to curate and the rules of functional annotation. We then want to annotate the mouse full length cDNAs as adequately as possible with experts in the fields of bioinformatics, genome science, biology and other fields during the proposed meeting.
Therefore, we held a meeting for annotating our mouse full length cDNA, named FANTOM (Functional ANnoTation Of Mouse) meeting.
A variety of tools, including BLASTN, BLASTX (http://www.ncbi.nlm.nih.gov/BLAST/), FASTA/FASTY (ftp://ftp.virginia.edu/pub/fasta/), DECODER, EST-WISE (http://www.sanger.ac.uk/Software/Wise2/), and HMMER (http://hmmer.wustl.edu/) were used to search a large number of databases including NCBI-nr, Locus Link, SwissProt, SwissProt TrEMBL, TIGR nraa, PFAM, TIGR-FAM, UniGene, the TIGR Gene Indices, the UTR db and UTR site, and a number of species-specific databases. Additional analyses were performed using the bioSCOUT® program from LION Bioscience. Protein domain analyses were conducted by EBI using InterPro.
FANTOM meeting web site
In the of the meeting, strategies to annotate sequences of 21,076 cDNAs was discussed by bioinformatists and biologists. PubMed), and grouped them on the basis of sequence using CAP3(http://genome.cs.mtu.edu/cap/cap3.html) and aligned using CLUSTALW(ftp://ftp.ebi.ac.uk/pub/software/unix/clustalw/), and visually inspected. This placed 8,207 clones into 2,957 clusters, reducing the size of the cDNA clone set to 15,826 unique genes and the MGI-confirmed set (see below) to 2,921 unique genes. Further analysis of RIKEN clones in the MGI-confirmed set revealed some instances where non-overlapping clones could be added to existing clusters or grouped together based on curatorial association with the same MGI gene. Therefore, the actual number of genes in the MGI set was reduced from 2,921 to 2,390, and the total number of genes represented by the whole RIKEN set was reduced to 15,295.
For novel genes represented by RIKEN clusters, nomenclature will be taken from the Clone Identifiers of the representative clones for each cluster.
A supplementary RIKEN definition line (riken_def_suppl) was available in the interface for additional pertinent annotation. Annotation of RIKEN clones with significant similarity to known sequences was guided by the gene/gene product descriptors of the reference sequences to which the RIKEN clones were most similar. In general, the riken_def was derived from the gene descriptor of the reference sequence that had the highest similarity to the RIKEN clone sequence. When the RIKEN clone was highly similar to several genes, an annotation hierarchy was used to choose the riken_def, based on the species of origin and descriptor content for the candidate reference sequences.
Priority was given to reference sequence descriptors from which some functional information could be inferred for the RIKEN clones, even if sequences with less informative descriptors were more similar to the clones. Annotations from highly curated databases (MGI and SwissProt) were preferred and provided convenient entry points into the Gene Ontology vocabularies. Informative descriptors from mouse genes identical to RIKEN clones were the first choice for annotation. Official gene nomenclature was used preferentially for RIKEN clones found to be identical to mouse genes in the Mouse Genome Informatics (MGI) databases (the "MGI-confirmed" set). For RIKEN clones identical to mouse genes not represented in MGI, or with non-identical similarity to known genes, riken_defs were derived from informative gene descriptors according to the following species priority: identical mouse > non-identical mouse > non-mouse mammal > non-mammal. Controlled vocabulary prefix terms "similar to", "homolog to" or "related to" were used in the riken_def line to indicate that a gene descriptor was derived from non-identical mouse, non-mouse mammal, or non-mammal sources, respectively.
RIKEN clones with no significant sequence similarity to known genes were named based on coding potential, protein motif signature and representation in mouse, human or rat EST databases. RIKEN clones with no significant similarity to known sequences, but with predicted protein motifs found in Pfam and/or InterPro were named "<motif name> containing protein". Clones with no known sequence similarity or domain hits, but with coding potential equal to or greater than 100 amino acids and EST representation were named "hypothetical protein". Clones belonging to none of the above groups, but with matches to ESTs were referred to as "unclassifiable transcript". Clones with no EST matches were called "unclassifiable".
New mouse genes discovered in the RIKEN clone set will be assigned official nomenclature in MGI that follows a defined syntax: Gene Symbol= <Riken Clone Identifier> "Rik", Gene Name= "Riken cDNA" <Riken Clone Identifier> "gene" (e.g. 2610307C23Rik, Riken cDNA 2610307C23 gene). Information about RIKEN clones and genes is available through Mouse Genome Informatics web site.