FANTOM

computational analyses and curation

Software name	Reference	Description
FANTOM cDNA annotation system (CAS)	T. Kasukawa et al. in preparation	web-based system for human curation of sequences
ITOP	T. Kasukawa et al. in preparation	displays seqencing quality (PHRED) scores
Homology Viewer	M. Furuno et al. in preparation	Graphical viewer that shows homologous regions to protein sequences and start/stop condons for each frame
ClusTrans	J. Adachi et al. in preparation	RIKEN cDNA sequence clustering, viewer, and editor
READ	Bono et al. Nucleic Acids Res. 30, 211-213. (2002)	RIKEN expression array database
Metabolomapper	H. Bono et al. in preparation	system to browse and map assigned EC numbers ot KEGG metabolic pathways
FACTS	T. Nagashima et al. in preparation	system to explore and curate computational higher functional annotations (protein interactions and disease assocations) of cDNA clones using text sources

Programs that were used during the full-length sequencing and functional annotation

Software name	Reference	Description
Database searching
Database searching
NCBI-BLAST	Altschul et al. J. Mol. Biol. 215, 403-410. (1990)	Basic Local Alignment SearchTool that includes s a set of similarity search programs(BLASTN, BLASTP,BLASTX, TBLASTN, TBLASTX)
RepeatMasker	Smit, A.F.A. and Green, P. unpublished results	screens DNA sequences againsta library of repetitive elements, as well as for low complexity regions;it returns a masked query sequence ready for database searches
Protein Sequence Analysis
FASTY	Pearson et al, Genomics 46, 24-36. (1997)	FASTY is a program of the FASTA package that compares a DNA sequence to a protein sequence database using the FASTA algorithm; it translates the DNA sequencein three forward (or reverse) frames and allows frameshifts)
HMMER	Eddy. Bioinformatics 14, 755-763. (1998)	profile hidden Markov modelsfor biological sequence analysis; searches a sequence database with a profileHMM or builds a hidden Markov model from an sequence alignment
InterProScan	Zdobnov and Apweiler. Bioinformatics 17, 847-848. (2001)	SW-based InterPro motif search
iPSORT	Bannai et al. Bioinformatics 18, 298-305. (2002)	Predicts the subcellular location of proteins
TMHMM	A. Krogh et al. J. Mol. Biol. 305, 567-580. (2001)	Prediction of transmembrane helices in proteins
COILS	A. Lupas et al. Science 252, 1162-1164. (1991)	Prediction of coiled-coil conformation from protein sequences
SignalP	H. Nielsen el al. Proc Int Conf Intell Syst Mol Biol 6, 122-130. (1998)	Prediction of the presence and location of signal peptide cleavage sites in amino acid sequences
Gene structure; Open Reading Frame
Gene structure; Open Reading Frame
DECODER (in house)	Fukunishi and Hayashizaki, Physiological genomics 5, 81-87. (2001)	extracts open reading frames from sequences and corrects frame-shifts
rsCDS (in house)	M. Furuno et al. in preration	CDS prediction completely based on homology search of protein sequences
ProCrest (in house)	J. Adachi et al. in preparation	CDS prediction based on coding potential in DNA sequences
NCBI CDS Predictor (in house)	L. Wagner, (unpublished)	CDS prediction based on both homology proteins and coding potential
Sequence assembly, clustering, Gene Index building
Sequence assembly, clustering, Gene Index building
Phred	Ewing and Green. Genome Res. 8, 186-194. (1998)	reads DNA sequencer tracedata, calls bases, and assigns quality values to the bases
Phrap		assembles shotgun DNA sequencedata to a contig sequence
Consed	D. Gordon et al. Genome Res. 8, 195-202. (1998)	edits sequence assembliescreated by Phrap for reassembling of the same data set
CAP3	X. Huang et al. Genome Res. 9, 868-877. (1999)	assembles sequences using base quality values in computation of overlaps between reads; construction of multiple sequence alignments of reads, and generation of consensus sequences; integrated in the TIGR Gene Index assembly pipline
Megablast		nucleotide sequence alignment search program, used for clustering in the TIGR Gene Index assembly
TGI assemby pipeline	J. Quackenbush et al. Nucleic Acids Res. 29, 159-164. (2001)	TIGR Gene Index assembly pipline
Mapping and genomic alignments
TGI mapping pipeline		genomic alignment and groupingof tentative transcript sequences
blEST	L. Florea et al. Genome Res. 8, 967-974. (1998)	cDNA-genome alignment program integrated in TIGR Gene Index genomic mapping pipeline
SIM4	L. Florea et al. Genome Res. 8, 967-974. (1998)	aligns a cDNA sequence to a genomic sequence under the assumption that the differences between the two sequences are limited to introns in the genomic sequence and sequencing errors in either of the sequences
Gene Ontology Browser
GO around	J. Tanoue et al. Bioinformatics (in press)	Gene ontology viewer

Databases that were used for the annotation pipeline and curation

Database	Reference	Description
Nucleotide sequence
Nucleotide sequence
DDBJ	Tateno et al. Nuecleic Acids Res. 30, 27-30. (2002)	all known nucleotide and protein sequences
EMBL	Stoesser et al . Nucleic Acids Res. 30, 21-26. (2002)	all known nucleotide and protein sequences
GenBank	Benson et al. Nucleic Acids Res. 30, 17-20. (2002)	all known nucleotide and protein sequences
Mouse Genome Informatics (MGI) - Mouse Genome Database (MGD)	Blake et al. Nucleic Acids Res. 30, 113-115. (2002)	model organsim database for the laboratory mouse; gene, sequence, nomenclature, GO information among others
RefSeq/LocusLink	Pruitt et al. Nucleic Acids Res. 29, 137-140. (2001)	non-redundant collection of genes and reference reference sequence standards
dbEST(mouse division)		mouse EST sequences
UniGene	Wheeler et al. Nucleic Acids Res. 30:13-16, 2002	clusters of ESTs and full-length mRNA sequences; each cluster; represent a unique known or putative gene
TIGR Gene Indices	J. Quackenbush et al. Nucleic Acids Res. 29, 159-164. (2001)	TIGR and GenBank EST sequences assembled to tentative consensus sequences
nt(NCBI)	Wheeler et al. Nucleic Acids Res. 30, 13-16. (2002)	all GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences). No longer "non-redundant".
Alternative splicing dB	Zavolan et al. manusript in preparation	Database of alternatively spliced mouse transcripts
Mapping
Mapping
MGSC v3	Mouse Genome Sequencing Consortium. Nature. (this issue) (2002)	mouse genome sequence assembly
Human "Golden Path"	International Human Genome Sequencing Consortium, Nature 409, 860-921. (2001)	human genome sequence assembly
Ensembl	Hubbard et al. Nucleic Acids Res. 30, 38-41. (2002)	genome dataset containing confirmed and predicted genes, exons, transcripts, and contigs
Riken-GenoMapper M. musculus cDNA mapping	H. Kiyosawa et al. in preparation	RIKEN clones mapped to mouse genome incl. information disease, public mouse genes, markers and ESTs
Riken-GenoMapper H. sapiens cDNA mapping	H. Kiyosawa et al. in preparation	RIKEN clones mapped to human genome incl. information disease, public mouse genes, markers and ESTs
Radiation Hybrid Map	I. Yamanaka et al. J. Struct. Func. Genomics 2, 23-28. (2002)	RIKEN clones mapped to mouse chromosomes based on sequence homology to ESTs of Whitehead mouse T31 radiation hybrid map
Protein sequence
Protein sequence
nr(NCBI)	Wheeler et al. Nucleic Acids Res. 30, 13-16. (2002)	non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF
SPTR (SwissProt + TrEMBL non-redundant protein set)	Bairoch et al. Nucleic Acids Res. 28, 45-48. (2000)	annotated protein databasewith minimum redunandancy, annotation incl. GO terms and functional sites
PIR NREF	Wu et al. Nucleic Acids Res. 30, 35-37. (2002)	non-redundant reference protein database that includes all sequences from PIR-PSD, Swiss-Prot, TrEMBL, RefSeq, GenPept, and PDB
Domains, motifs and superfamilies
Domains, motifs and superfamilies
SCOP	Lo Conte et al. Nucleic Acids Res. 30, 264-267. (2002)	structural classification of proteins
SUPERFAMILY	Gough et al., Nucleic Acids Res. 30, 268-272. (2002)	HMM based on the SCOP 'superfamily' level of protein domain classification
Pfam	Bateman et al. Nucleic Acids Res. 30, 276-280. (2002)	semi-automatic protein familydatabase containing multiple protein alignments and profile-HMMs of thesefamilies
MDS	Kawaji et al. Genome Res. 12, 367-378. (2002)	novel motifs extracted from SPTR and FANTOM DB
InterPro	Apweiler et al. Nucleic Acids Res. 29, 37-40. (2001)	integrated view of otherdomain and functional site databases (PROSITE, PRINTS, ProDom and Pfam)
UTRsite and UTRdb	Pesole et al. Nucleic Acids Res. 30, 335-340. (2002)	UTRsite: nucleotide sequence patterns of UTRs where a functional role has been shown epxerimentally; UTRdB a non-redundant 3' and 5'UTRsequences of eukaryotic mRNAs enriched with annotations abouts functional elements and repeats
Pathway
Pathway
KEGG	Kanehisa et al. Nucleic Acids Res. 30, 42-46. (2002)	metabolic and regulatory pathway maps
Disease
Disease
OMIM	Wheeler et al. Nucleic Acids Res. 30, 13-16. (2002)	catalog of human genes and genetic disorders
Literature
Literature
PubMed		abstracts and bibliographicinformation of journal articles and books
Gene Onotology
Gene Onotology
GO database	Ashburner et al. Nat Genet. 25, 25-29. (2000)	gene ontology terms
SNP
SNP
dbSNP	Wheeler et al. Nucleic Acids Res. 30, 13-16. (2002)	single nucletoide polymorphism