Supplementary Table 7

A. Databases that were used for the functional annotation

Database or repository Description

Nucleotide sequence

Mouse Gene Index (in-house) redundant phase I clone sequencing data
nr-nt (in-house) non-redundant database built from Genbank, EMBL, DDBJ, and their cumulative daily-updated nucleotide sequences
tigr-mgi Nucleotide sequences from TIGR Mouse Gene Index
MGI integrated view of gene characterization, nomenclature, genetic markers, mapping, gene homologies, expression, phenotype and other biological data
est_mouse mouse EST sequences
UniGene clusters of ESTs and full-length mRNA sequences; each cluster; represent a unique known or putative human gene
TIGR Gene Indices human and non-human TIGR and GenBank EST sequences assembled to tentative consensus sequences
UTRdB a non-redundant 3' and 5'UTR sequences of eukaryotic mRNAs enriched with annotations abouts functional elements and repeats

Mapping

Whitehead Mouse RH dB T31 RH hybrid data of 20 mouse chromosomes
Jackson Laboratory T31 Mouse RH dB T31 RH data of 20 mouse chromosomes from various sources incl. WICGR mouse RH dB, The UK Mouse Genome Centre, Genoscope - CNS mapped together into a single comprehensive map
Refseq reference sequence standards for chromosomes, mRNAs, and proteins for the functional annotation of genome data
Ensembl human genome dataset containing confirmed and predicted genes, exons, transcripts, and contigs

Protein sequence

NCBI-nr non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PDB
SwissProt annotated protein database with minimum redunandancy, annotation incl. GO terms and functional sites>
TrEMBL translations of all CDS present in the EMBL, which are not yet integrated into SWISS-PROT
TIGR's nr-aa  non-redundant amino acid sequence database prepared at TIGR using data from EGAD, SwissProt, PDB and GenPept

Gene cluster

HomoloGene curated orthologs of mouse, rat, and human and zebrafish, mouse human, calculated orthologs for sequence comparisons between all UniGene clusters for each pair of organism
Pfam semi-automatic protein family database containing multiple protein alignments and profile-HMMs of these families
TIGRFAM a curated protein family database containing multiple protein alignments and profile HMMs of these families
InterPro integrated view of other domain and functional site databases (PROSITE, PRINTS, ProDom and Pfam)
UTRsite nucleotide sequence patterns of UTRs where a functional role has been shown epxerimentally

Pathway

KEGG metabolic and regulatory pathway maps

Disease

LocusLink annotated sequence and descriptive information about genetic loci
Refseq reference sequence standards for chromosomes, mRNAs, and proteins for the functional annotation of genome data
OMIM catalog of human genes and genetic disorders

Literature

PubMed abstracts and bibliographic information of journal articles and books

Gene Onotology

swp2go gene ontology index for mapping of SwissProt keywords to GO terms
egad2go gene ontology index for mapping of EGAD cellular roles to GO terms

B. Software that was used during full-length sequencing and the functional annotation

Software name Description

Functional Annotation

FANTOM+ web-based system for human curation of sequences

Database searching

NCBI-BLAST Basic Local Alignment Search Tool that includes s a set of similarity search programs(BLASTN, BLASTP, BLASTX, TBLASTN, TBLASTX)
RepeatMasker screens DNA sequences against a library of repetitive elements, as well as for low complexity regions; it returns a masked query sequence ready for database searches
FASTA The package that compares a sequence to another sequence or to a sequence database using the FASTA algorithm. Especially, FASTY program was frequently used in the FANTOM meeting. (FASTY is a program that compares a DNA sequence to a protein sequence database using the FASTA algorithm; it translates the DNA sequence in three forward (or reverse) frames and allows frameshifts) 
FLAST (in house) DDS based program that compares a query sequence pairwise with a cDNA sequence database
Wise2 Wise2 is a package for comparing DNA and protein sequences. In the meeting, estwise in the Wise2 package was frequently used because it can compare a protein sequence against an EST/cDNA sequence with the option of using a protein profile HMM
HMMER profile hidden Markov models for biological sequence analysis; searches a sequence database with a profile HMM or builds a hidden Markov model from an sequence alignment
Patsearch finds functional elements in nucleotide and protein sequences and assesses their statistical significance

Gene structure; Open Reading Frame

GenScan determines the most likely gene structure (exon/intron) under a probabilistic model of the gene structural and compositional properties of the genomic DNA for a given organism 
ORF Finder finds all open reading frames of a selected minimum size in a sequence
DECODER (in house) extracts open reading frames from sequences and corrects frame-shifts

Multiple sequence alignment

CLUSTALW progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice

Cluster analysis

Maximum density subgraph (in house) generates a linkage graph whose veritices are sequences and edges are pairwise similarities; it then finds subgraphs whose vertices are connected with a  fraction'p' of the other vertices until all sequences are covered and the maximum density (sum of similarities/no of nodes) is found

Assemble

Phred reads DNA sequencer trace data, calls bases, and assigns quality values to the bases
Phrap assembles shotgun DNA sequence data to a contig sequence
Consed  edits sequence assemblies created by Phrap for reassembling of the same data set
CAP3 assembles sequences using base quality values in computation of overlaps between reads; construction of multiple sequence alignments of reads, and generation of consensus sequences

Others

bioSCOUT commercial software package for enhanced sequence analysis
experimental programs extraction and assignment of GO terms