EST resources, clone sets, and databases

Janet F. Kelso, University of the Western Cape, Bellville, South Africa


1. Introduction

Gene identification via complete or partial transcript capture has proved to be a valuable and rapid route to gene discovery (Boguski and Schuler, 1995; Schuler et al., 1996). The sequencing of Expressed Sequence Tags (ESTs) has been used in numerous pilot gene identification projects in which it has provided insight into the transcribed genomes of a wide range of organisms.

As a result, many groups, both academic and commercial, have contributed and continue to deposit many thousands of ESTs representing numerous organisms and expression states to public databases. In addition, many also distribute publicly the source clone sets from which these ESTs were generated, providing an experimental resource for expression studies.


2. EST databases

The public EST databases can be divided into (1) those that perform no processing of the incoming data and simply act as data repositories and (2) those that preprocess the data to reduce error and take advantage of sequence redundancy to increase quality.

2.1. Unprocessed EST databases
2.1.1. dbEST http://www.ncbi.nlm.nih.gov/dbEST/
dbEST, the EST division of Genbank, is a public repository for raw EST sequence and annotation information (Boguski et al., 1993; Boguski and Schuler, 1995). More than 20 million ESTs representing in excess of 650 organisms have been deposited in dbEST since its inception in 1991. The most organisms most highly represented in dbEST are listed in Table 1. This EST data is available by anonymous ftp from ftp://ncbi.nlm.nih.gov/genbank/. Individual sequences and small batches can be obtained using Entrez (http://www.ncbi.nlm.nih.gov/entrez/).

 

Table 1. Top 10 organisms represented in dbEST (26 February 2004)

Organism
ESTs

Homo sapiens (human)
5 472 005
Mus musculus + domesticus (mouse)
4 056 481
Rattus sp. (rat)
583 841
Triticum aestivum (wheat)
549 926
Ciona intestinalis
492 511
Gallus gallus (chicken)
460 385
Danio rerio (zebrafish)
450 652
Zea mays (maize)
391 417
Xenopus laevis (African clawed frog)
359 901
Hordeum vulgare + subsp. vulgare (barley)
352 924




The highly redundant EST data in dbEST is not clustered or assembled, and only a subset is grouped by species of origin. Unrestricted homology searches against dbEST will therefore commonly return numerous sequences that represent the same gene as the query, paralogous genes, and sequences from related species. Both the National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov/BLAST/) and the Swiss Institute of Bioinformatics (SIB) (http://www.ch.embnet.org/software/aBLAST.html) offer the ability to search subsets of dbEST restricted by species, with NCBI offering human, mouse, and “other” divisions and SIB offering the ability to select one or more from a large number of divisions including plants, prokaryotes, fungi, invertebrates, zebrafish, human, mouse, and rat.

2.2. Processed EST databases
EST data can be organized and presented in such a way as to produce valuable information about gene expression, including details of the location and timing of transcript expression, alternative splicing, and regulation. In general, the data stored in the large public EST databases is largely unorganized, sparsely annotated, and redundant. The sequences themselves are usually short, unprocessed, and error-prone.

However, the sheer volume of EST data generated by large-scale EST sequencing projects means that a significant improvement in reliability can be gained by taking advantage of EST sequence redundancy to reduce error and increase the length of represented transcripts (Jongeneel, 2000).

EST clustering systems that preprocess, cluster, and postprocess EST data to yield higher-quality transcript information aim to construct gene indices; nonredundant catalogs in which all represented transcripts are partitioned into groups (clusters) such that transcripts are placed in the same cluster if they represent the same gene or gene isoform. Gene indices facilitate gene expression studies and novel transcript detection. In addition to clustering, many groups perform transcript reconstruction, using assembled clusters to build a consensus sequence that provides a longer and more accurate representation of the transcript represented by the cluster.

Homology searching against clustered EST collections such as Unigene will result in a more concise report than searching dbEST. Homology searching against clustered databases that provide contigs and consensus sequences for each cluster is very rapid, though the accuracy of the contig production and consensus sequence generation may affect the quality of the matches obtained.

A number of gene indices have been produced using publicly available EST data. These aim to reduce the error and redundancy present in the raw data and to thereby enhance the useful transcript information that can be gleaned from ESTs. The Institute for Genome Research (TIGR) gene indices (Quackenbush et al., 2000; Quackenbush et al., 2001) and Unigene (Wheeler et al., 2003), produced by NCBI, have focused on providing a reconstruction of the gene complement of various genomes. The STACK database (Christoffels et al., 2001; Miller et al., 1999) has focused on the detection and visualization of transcript variation and the production of accurate consensus sequences that represent the transcript variation in the context of tissue, developmental stage, and pathological states.

Both the TIGR and STACK databases offer BLASTable gene indices on their respective websites.

2.2.1. Unigene http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene
Unigene, based at the National Centre for Biotechnology Information (NCBI), is one of the earliest and most enduring efforts for the automatic production of gene indices from Genbank sequences. Each Unigene cluster contains mRNA and EST sequences that represent a unique gene. Additional information such as the identity of the gene, chromosomal map location, and tissue types in which the gene is expressed (from SAGE and EST data) is also provided. NCBI does not generate contigs and/or consensus sequences for Unigene clusters.

Unigene databases are available for 38 organisms including: human (Homo sapiens), mouse (Mus musculus), rat (Rattus norvegicus), zebrafish (Danio rerio), Cow (Bos taurus), Clawed frog (Xenopus laevis), Arabidopsis (Arabidopsis thaliana), wheat (Triticum aestivum), rice (Oryza sativa), barley (Hordeum vulgare), and maize (Zea mays). Databases are updated weekly with new ESTs and bimonthly with newly characterized sequences. All Unigene databases are available for download from ftp://ncbi.nlm.nih.gov/repository/UniGene/.

Unigene clusters can be searched by gene name, Unigene cluster ID, chromosomal location, cDNA library, accession number, and text terms. Sequence-based searches against Unigene human, rat, and mouse databases are available from the SIB at http://www.ch.embnet.org/.

Unigene has been used for the selection of unique transcripts for the construction of a cDNA microarray for the large-scale analysis of gene expression and as the candidates for the production of a human gene map.

For information on the construction of Unigene, see http://www.ncbi.nlm.nih.gov/UniGene/build.html.

2.2.2. TIGR Gene Indices http://www.tigr.org/tdb/tgi.shtml
TIGR produces gene indices for more than 40 organisms including various animal, plant, protist, and fungal species (Quackenbush et al., 2001; Quackenbush et al., 2000). The TIGR indices incorporate both ESTs sequenced at TIGR, ESTs from dbEST, and mRNAs from Genbank. Each TIGR cluster is represented by a Tentative Consensus sequence (TC, or THC in the case of Tentative Human Consensi), which is a FASTA-formatted sequence with a unique accession as well as additional information including details of the assembly, putative gene identification, and a list of tissues in which the transcript is expressed. Related databases generated by TIGR provide additional information about TCs. The Genomic Maps database provides genomic mapping for a subset of organisms for which TCs are available. The Institute for Genome Research Orthologous Gene Alignment database (TOGA) (Lee et al., 2002) provides information about orthologous sequences between TCs for the organisms for which TIGR gene indices have been generated.

The TIGR databases are freely available to researchers at nonprofit organizations at http://www.tigr.org/tdb/tgi.shtml. The TIGR Human Gene Index (HGI) is produced annually. The frequency of new releases varies between species and depends on the accumulation of new transcripts. The TIGR gene indices can be searched by nucleotide or protein sequence, EST, transcript or consensus identifiers, tissue, cDNA library name or library identifier, gene product name, or functional classification according to Gene Ontology (GO) terms (Ashburner et al., 2000).

2.2.3. STACK http://www.sanbi.ac.za/Dbases.html
The STACK human gene index is generated by clustering EST and mRNA data, and offers human transcript consensus sequences that reflect gene expression forms and alternate expression variants within 15 tissue-based and one disease category (Christoffels et al., 2001; Miller et al., 1999). This organization of transcript by expression site presents the opportunity to explore transcript expression in specific tissues or subsets such as disease-related sequences.

Each STACK cluster contains alignments, consensus sequences, and assembly information, and is dynamically linked to the Unigene database. Web-based software allows for the visualization of clusters and alignments, and highlights transcript variation. STACK database releases are made available with varying frequency – on average twice a year. STACKdb and the stackPACK toolset used to generate STACK are freely available to academic groups and can be downloaded from http://www.sanbi.ac.za/CODES. Sequence-based searching of STACKdb tissue divisions is available at http://juju.egenetics.com/cgi-bin/stackpack/blast.py.

STACKdb has been used to support the detection of a novel retinal-specific gene responsible for retinitis pigmentosa. The STACKpack toolset has been used in the production of various gene indices and for the survey of genes in the malarial genome.

2.3. cDNA clone sets
The availability of clones representing partial or full-length transcripts in an organized public collection is a critical resource for the ongoing genomic research. Clone sets have applications in gene discovery, a range of functional genetic analyses, and also as substrates for microarray expression studies.

2.3.1. I.M.A.G.E http://image.llnl.gov/
Recognizing the need for a publicly available clone collection, the Integrated Molecular Analysis of Genomes and their Expression (I.M.A.G.E.) consortium was formed in 1993. The aim of the group was to make cDNA clones representing all known genes, as well as their sequence, map, and expression information, publically available in order to facilitate biological research (Lennon et al., 1996). To this end, the group has generated more than 7.5 million clones from over 882 cDNA libraries representing more than 50 human tissues, as well as mouse, rat, zebrafish, rhesus monkey, Fugu, and Xenopus.

EST projects that are part of the I.M.A.G.E. consortium are listed in Table 2.

 

Table 2. Some of the major I.M.A.G.E consortium EST projects

Project
Date
URL
Description

Merck/WashU EST Project
1995–1997
http://genome.wustl.edu/est/index.php?human_merck=1
A major early contribution to dbEST was the sequencing of 584 000 ESTs by Washington University under a project launched by Merck and Co. The cDNA libraries were constructed by Bento Soares at Columbia University, and arraying for high-throughput processing was performed by Greg Lennon at the Lawrence Livermore National Laboratory (Boguski and Schuler, 1995).
HHMI/WashU Mouse EST Project
1996–1998
http://genome.wustl.edu/est/index.php?mouse=1
Approximately, 400 000 mouse ESTs were contributed to dbEST by Washington University under the sponsorship of the Howard Hughes Medical Institute.
Cancer Genome Anatomy Project (CGAP)
1997–(ongoing)
http://cgap.nci.nih.gov
Through funding from the National Cancer Institute (NCI), human clones, largely from NCI-CGAP and ORESTES libraries, have been sequenced by Washington University and The NIH Intramural Sequencing Center (NISC). The aim of CGAP is to determine the gene expression profiles of normal, precancer, and cancer cells with a view to improving cancer diagnosis and treatment. Since 1999, CGAP has also contributed libraries and sequences for mouse, rat, Xenopus, and primate.
WashU Zebrafish EST Project
1997–2002
http://genome.wustl.edu/est/index.php?zebrafish=1
cDNA libraries produced by the zebrafish research community prior to 2002 have been arrayed by the I.M.A.G.E. consortium and sequenced at Washington University.
NIH Zebrafish Gene Collection
2002–(ongoing)
http://zgc.nci.nih.gov/
More than 3400 full-length ORF zebrafish clones have been produced through this initiative sponsored by the National Institutes of Health and are available for public research.
WashU Xenopus EST Project
1999–2002
http://genome.wustl.edu/est/index.php?xenopus=1
cDNA libraries produced by the Xenopus research community and CGAP prior to 2002 have been arrayed by the I.M.A.G.E. consortium and sequenced at Washington University.
NIH Xenopus Gene Collection
2002–(ongoing)
http://www.ncbi.nlm.nih.gov/genome/flcdna/prj.cgi?prjid=15
More than 600 full-length ORF Xenopus clones have been produced through this initiative sponsored by the National Institutes of Health and are available for public research.
University of Iowa Rat project
1998–(ongoing)
http://ratest.eng.uiowa.edu/
Clones from this project are arrayed and sequenced at the University of Iowa, sponsored by the National Institutes of Health. Unique clones are rearrayed and given I.M.A.G.E cloneIDs (located in the comment field of the Genbank entry). 25 000 of these sequence-verified clones are currently available through the I.M.A.G.E. distributors.
The Mammalian Gene Collection (MGC)
1999–(ongoing)
http://mgc.nci.nih.gov/
Initiated in 1999 as a collaborative effort between various institutes of the NIH, the Mammalian Gene Collection project aims to provide a catalog of full-length mammalian genes (Strausberg et al., 1999). The project has focused initially on the production of full-length cDNAs for human and mouse, and will later extend to include other mammals. Clones produced by the project are prepared from high quality mRNA extracted from cell lines or tissues. Clones are made available through the I.M.A.G.E consortium, while 3? and 5? ESTs are generated and released to dbEST. An ongoing informatics challenge is the selection of clones likely to represent full-length transcripts. In the initial phases of the project, clones with inserts of up to 3 to 4 kb were sequenced using techniques such as shotgun sequencing, primer walking, and concatenation. Sequence data is generated to the same standards as those specified by the Human Genome Project – finished sequence is therefore 99.99% accurate. Annotation of the sequence data is also performed. As of January 2002, a nonredundant set of more than 20 000 putative full-length human and mouse clones have been identified and full sequences for 9000 human and 4000 mouse clones have been produced. 75% of the selected clones contain full-length ORFs. Clone library lists, clone lists, and insert sequences in FASTA format are available for download from http://mgc.nci.nih.gov/. Sequenced clones can be searched using BLAST at the same site. Additionally, the genes represented by MGC clones can be searched by gene name or keyword at the website.




The sequences from the I.M.A.G.E. clones are submitted to dbEST, and the clones themselves are generally available royalty-free through a network of distributors in the United States and Europe (http://image.llnl.gov/image/html/idistributors.shtml). These groups generally provide added services that allow users to select the most appropriate clones for their research. I.M.A.G.E consortium distributors in Europe and the USA are listed in Table 3.

 

Table 3. I.M.A.G.E. consortium clone distributors in Europe and the United States of America

Continent
Distributor
URL
Description

Europe
RZPD
http://www.rzpd.de/products/clones/
As a nonprofit service facility, the RZPD provides research materials and data including cDNA clone sets. Nonredundant, sequence-verified cDNA clone sets are available for human, mouse, and rat, and full-length and open-reading-frame (ORF) clones are available for more than 30 organisms. More than 35 000 000 clones representing over 1000 cDNA libraries, including the I.M.A.G.E collection, are represented. The linking of various public databases to the available clone sets provides researchers with the ability to search for clones using Gene Data, Clone ID, Unigene Cluster, Chromosome Location, Genbank Accession ID, or Affymetrix Probeset ID.
Europe
MRC geneservice
http://www.hgmp.mrc.ac.uk/geneservice/index.shtml
cDNA clone sets including the I.M.A.G.E, MGC, and NIA mouse cDNA collections, as well as other human, mouse, rat, Drosophila, Fugu, Xenopus, and chicken clones. Various software tools are provided to allow users to interrogate the clone sets.
USA
ATTC
http://www.atcc.org/
ATCC is a commercial distributor of the I.M.A.G.E., MGC, and NIA mouse clone sets, as well as variety of clones for human, mouse, rat, and pine. More than 500 000 partially sequenced cDNA clones representing the majority of human genes and more than 300 000 murine cDNAs and 20 000 rat cDNAs are represented in the collection.
USA
Invitrogen
http://clones.invitrogen.com/index.php
The clones distributed by Invitrogen are assembled from public resources including I.M.A.G.E. Researchers can search for clones of interest using CloneRangerTM (http://clones.invitrogen.com/cloneranger.php). Approximately 10 million clones across a wide range of species. The collection can be searched by clone ID, NCBI accession, Unigene cluster ID, LocusLink ID, keyword, sequence, or plate ID.
USA
Open Biosystems
http://www.openbiosystems.com/clone_collections.php
Open Biosystems distributes a large number of clone sets including the I.M.A.G.E. and Incyte sets. Organisms included are human, mouse, rat, dog, pig, Drosophila, Xenopus, C. elegans, monkey, and zebrafish. Clones can be searched by Genbank accession, gene name, and clone ID.




The I.M.A.G.E. consortium uses and provides IMAGEne, a clustering toolset to cluster the sequences from I.M.A.G.E. cDNA clones (Cariaso et al., 1999). Known gene clusters are based on NCBI's RefSeq, and candidate gene clusters are those with no known gene association. Those clones whose sequences do not match any other cDNA are grouped as Singletons. Users of the system are able to query against these cluster sets to obtain a list of available I.M.A.G.E. clones aligned with the corresponding known gene or consensus sequence. By offering the best representative clones that are available for order from the I.M.A.G.E clone set, IMAGEne provides a valuable laboratory research tool. IMAGEne is available via the web at http://image.llnl.gov/image/imagene/current/bin/search.

3. Conclusion

The EST databases provide an invaluable view of the transcriptomes of a large and growing number of organisms. The availability of EST data, and the associated annotation information, including details of the tissue source, provides an early expression map of the transcriptomes for these organisms. The public availability of clone sets representing a large number of organisms, tissues, diseases, and developmental stages is a valuable and ongoing resource for expression profiling and functional genomics studies.


4. Further reading

    a) Raw EST resources
Raw EST data: dbEST
http://www.ncbi.nlm.nih.gov/dbEST/
        Download all sequences
ftp://ncbi.nlm.nih.gov/genbank/
        Download individual sequences and small batches
http://www.ncbi.nlm.nih.gov/entrez/
        BLAST searchable dbEST
http://www.ncbi.nlm.nih.gov/BLAST/ http://www.ch.embnet.org/software/aBLAST.html
EST Tracefile archives

        Washington University Traces Viewer
http://genome.wustl.edu/est/est_search/nci_viewer.html
        NCBI Trace Archive
http://www.ncbi.nlm.nih.gov/Traces/
General information about ESTs

        Washington University Genome Sequence Center
http://genome.wustl.edu/est/
    b) Mining EST data
Jongeneel CV (2000) Searching the expressed sequence tag (EST) databases: panning for genes. Briefing in Bioinformatics, 1(1), 76–92.
    c) Gene indices
Unigene

        Unigene build information
http://www.ncbi.nlm.nih.gov/UniGene/build.html
        Download Unigene
ftp://ncbi.nlm.nih.gov/repository/UniGene/
        BLAST searchable Unigene
http://www.ch.embnet.org/
Pontius J, Wagner L, Schuler G (2002) Unigene: a unified view of the transcriptome. NCBI Handbook http://www.ncbi.nlm.nih.gov/books/bookres.fcgi/handbook/ch21 d1.pdf
TIGR

        TIGR information and download
http://www.tigr.org/tdb/tgi.shtml
        BLAST searchable TIGR Gene Indices
http://tigrblast.tigr.org/tgi/
STACK

        STACK information and download
http://www.sanbi.ac.za/Dbases.html
        BLAST searchable STACKdb
http://juju.egenetics.com/stackpack/webblast.html
    d) Gene indices incorporating genome data
        Ensembl
http://www.ensembl.org/
        RefSeq
http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html
        AllGenes
http://www.allgenes.org/
Pruitt KD, Katz KS, Sicotte H and Maglott DR (2000) Introducing RefSeq and LocusLink: curated human genome resources at the NCBI. Trends in Genetics, 16, 44–47.
Hosted by uCoz