Sequence and Other Non-Bibliographic Databases

Introduction

see also Definitions, Glossaries, and Dictionaries
see also Guides, Tutorials and Primers
see also Recommended Reading

Sequence and other non-bibliographic databases are the central, most important type of information resource in this field. The multiplicity of databases makes selection confusing, and the databases themselves can be challenging to understand and navigate. Nomenclature is not standard. Data formats/metadata schemes are not standard. Databases struggle with data redundancy and charges that they contain a lot of "junk." There are a lot of interrelated pieces of information surrounding a gene (genome location, structure, sequence, expression information, chemistry, etc.) or a protein, which lead to somewhat complicated database structures and links to related databases which may or may not be intuitive. There are the additional requirements that 3D structures place on metadata and the increasing volume of sequence data pouring into these databases (take a look at the exponential growth of Genbank at http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html). A great deal of this growth is in the form of direct submissions that may or may not have been peer reviewed. Whether peer reviewed or not the data are subject to frequent changes and updates as new information becomes available. There are many approaches to solving these problems, which means multiple data structures and multiple search interfaces and many specialized databases.

As mentioned earlier, there are hundreds of databases that might be considered relevant to bioinformatics. There are specialized databases for each species, and separate databases for different types of information (nucleic acid sequences, protein sequences, protein structures, biochemical and biophysical information, etc.). There is also a great redundancy of databases, with multiple databases covering nearly the same information for the same organisms. This situation arose in part from many researchers developing their own databases in their own formats over the years, and from databases developing in parallel in Europe, Japan and the United States. The situation is further complicated by the existence of several versions or mirrors of the same database on different servers (each with varying degrees of currency or completeness), and by the sharing of records between databases. For example, the Entrez search system draws data from SWISS-PROT but only includes SWISS-PROT records for proteins that are based upon nucleotide sequence data that meet the criteria for inclusion in GenBank. A search of SWISS-PROT through another interface may retrieve more records as well as more detail in each record.

The following list of databases is intended to orient the reader to the major databases. To become a proficient searcher in each database takes considerable training, which is beyond the scope of this guide. This database list is highly selective, including only a few representatives of each type. Emphasis is placed on the larger, better known databases, on the free public databases, and on those that cover human data. Grouping databases by type is a common and useful way of organizing them, but many databases provide more than one type of information to the user so bear in mind that this classification is not precise.

A review of the basic genetic terms and concepts is highly recommended before approaching the sequence databases. See the Definitions, Glossaries, and Dictionaries and the Guides, Tutorials and Primers sections of this guide for recommended sources.

Database Directories and Lists

see also Comprehensive Web Sites

The large number of databases naturally leads to the impulse to create database directories and lists. Many of the comprehensive web sites listed in this guide also provide such lists and are worth consulting.

Nucleic Acids Research: Annual Database Issue - http://www.nar.oupjournals.org/content/vol30/issue1/ [subscription required for access]
For the past several years the journal Nucleic Acids Research published by Oxford University Press has devoted the first issue of each year to listing and describing the many molecular biology and bioinformatics databases. This issue always includes many informative articles describing selected databases in depth (several of these are in the Recommended Reading section of this guide) as well as a comprehensive list of databases called the Molecular Biology Database Collection (see http://www.nar.oupjournals.org/cgi/content/full/30/1/1/DC1). In 2002 this list included 335 databases, up from 281 the year before. The list can be accessed by category/type of database (there are eighteen categories in 2002) or alphabetically by title. Click on the short description of each database to access a paragraph-long description written by a researcher familiar with the database.

Introduction to Molecular Biology Databases - {http://www.ebi.ac.uk/swissprot/Publications/mbd1.html}
Although not technically a directory, this article, written in 1999, is a very helpful introduction to the major databases, including many of the organism specific ones that are outside the scope of this guide. A good starting place for the non-specialist.

Sequence Retrieval System (SRS) database descriptions - {http://downloads.lionbio.co.uk/publicsrs.html}
This database of database descriptions can be accessed in a number of ways. From the Public SRS Servers List ({http://downloads.lionbio.co.uk/publicsrs.html}) you can view an alphabetical list of databases (scroll down past the list of servers at the top to see the list of databases or "libraries" in the middle of the page). Click on the link for the server that is hosting the database you are interested in to view the record for the version of that database as mounted on that server. Or, you can link to the database descriptions from the search interface of a particular server (for example, see the European Bioinformatics Institute server at http://srs6.ebi.ac.uk/srs6bin/cgi-bin/wgetz?-page+top+-newId). This method is particularly helpful because here the databases are sorted by database type (e.g., sequence libraries, protein 3D structures, mutations, SNP, metabolic pathways, etc.). Click on the plus sign next to the type to expand the list, then click on the database name to access the description of that database as mounted on that server. You can also go one step further and actually initiate your search in those databases from this page as well.

Nucleotide Sequences

GenBank - http://www.ncbi.nlm.nih.gov/Genbank/GenbankSearch.html, and Entrez Nucleotides Database - http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide
GenBank is the nucleotide sequence database built and distributed by the National Center for Biotechnology Information (NCBI) at the National Institutes of Health. As of this writing, GenBank contains more than 13 billion bases from over 100,000 species, and is growing exponentially (see http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html). The data are obtained through direct submission of sequence data from individual laboratories, from large-scale sequencing projects, and from the US Patent and Trademark Office. A little more than half of the total sequences in the database are from Homo sapiens.

There are two ways to search GenBank: a text-based query can be submitted through the Entrez system at {http://www.ncbi.nlm.nih.gov/Entrez/index.html}, or a sequence query can be submitted through the BLAST family of programs (see http://www.ncbi.nlm.nih.gov/BLAST/). To search GenBank through the Entrez system you would select the Nucleotides database from the menu. The Entrez Nucleotides Database is a collection of sequences from several sources, including GenBank, RefSeq, and the Protein Databank, so you don't actually search GenBank exclusively. Searches of the Entrez Nucleotides database query the text and numeric fields in the record, such as the accession number, definition, keyword, gene name, and organism fields to name just a few. So, for example, you could enter the terms Bacillus anthracis and you would be presented with many records that contain and describe nucleotide or protein sequences related to the anthrax bacteria.The accession number is very handy, because it is a unique and persistent identifier for the GenBank entry as a whole and doesn't change even if there is a later change or update to the sequence or annotation. Nucleotide sequence records in the Nucleotides database are linked to the PubMed citation of the article in which the sequences were published. Protein sequence records are linked to the nucleotide sequence from which the protein was translated. To become an effective searcher of this database takes study. For starters, take the Nucleotides database online tutorial that starts at {http://www.ncbi.nlm.nih.gov/Database/tut1.html}, and consult the other resources available from the NCBI Education Page at {http://www.ncbi.nlm.nih.gov/Education/}. See also the Recommended Reading section of this guide.

If you have obtained a record through a text-based Entrez Nucleotides Database search you can read the nucleotide sequence in the record. However, most researchers wish to submit a nucleotide sequence of interest to find the sequences that are most similar to theirs. This is done using the BLAST (Basic Local Alignment Search Tool) programs. You select the BLAST program you wish to use depending upon the type of comparison you are doing (nucleotide to nucleotide, or nucleotide to protein sequence, etc.) and then you select the database to run the query in (any of several nucleotide or protein databases). Many NCBI databases accept BLAST searches, as do many of the other databases covered elsewhere in this guide. The result is a detailed report that summarizes your query, provides a graphical overview of database matches, indicates the statistical significance of the matches and describes each significant alignment. From here you can link to the full database record for the individual matches. You can learn more about BLAST searching from the NCBI BLAST educational page at {http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html} (read the online tutorial).

EMBL Nucleotide Sequence Database - http://www.ebi.ac.uk/embl/
"The EMBL Nucleotide Sequence Database constitutes Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing projects and patent applications. The database is produced in an international collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ). Each of the three groups collects a portion of the total sequence data reported worldwide, and all new and updated database entries are exchanged between the groups on a daily basis."

From the home page you can submit simple text searches to the EMBL Nucleotide Sequence Database, or to the Protein Databank (what you search when you select protein structures from the menu) or to a protein sequence database called Swall. For more complex searches, they recommend accessing the databases through the Sequence Retrieval System (SRS) server (http://srs.ebi.ac.uk/). SRS is a database querying / navigation system, similar in function to the Entrez system. It allows you to simultaneously search across several databases and to display the results in many ways. SRS can be used to access a large number of databases, including EMBL, SWISS-PROT and the Protein Databank, depending upon the configuration of the particular SRS server you are using. The structure and content of an EMBL Nucleotide record is very similar to that of an NCBI Entrez Nucleotide database record.

Genome Databases

Entrez Genome - http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome
"The whole genomes of over 800 organisms can be found in Entrez Genomes. The genomes represent both completely sequenced organisms and those for which sequencing is in progress. All three main domains of life - bacteria, archaea, and eukaryota - are represented, as well as many viruses and organelles." Text searches can be done from the main page. Data can also be accessed alphabetically by species {http://www.ncbi.nlm.nih.gov:80/PMGifs/Genomes/allorg.html}), or hierarchically by drilling down through a taxonomic list to a graphical overview for the genome of that organism, then to specific chromosomes, then on to specific genes. At each level are maps, pre-computed summaries, analysis appropriate to that level, and links to related records from a variety of other Entrez databases. BLAST searches of some genomes are also possible.

Very useful pages for some of the most commonly studied species (e.g., human, mouse, fruit fly, malarial parasite) can be found on the Genomic Biology page under "organism-specifc resources" (http://www.ncbi.nlm.nih.gov/Genomes/). These pages are so detailed that each could be classified as a comprehensive web site in itself. Each one brings together links to the genomic data, useful tools, related data sources and news about the genome of that species. The Human Genome Guide (http://www.ncbi.nlm.nih.gov/genome/guide/human/) is particularly rich.

Human Genome Browser from UCSC - http://genome.ucsc.edu/
"The sequence of the human genome is too big to see at all at once; few people want to look at raw DNA sequence anyway. The alternative is the Human Genome Browser for a quick display of any requested portion of the genome at any scale, along with more than two dozen tracks of information (genes, ESTs, CpG islands, assembly gaps, chromosomal band, ...) associated with the complete human genome sequence... Clicking on a displayed feature opens a second window providing protein sequence, coordinates and accession numbers, as appropriate. Clicking in the corner of the display calls up raw DNA sequence corresponding to the display window boundaries. This look-up feature is far more convenient than manual retrieval of a precise coordinate range from GenBank entries."

The Genome Database (GDB) - {http://www.gdb.org/}
The Genome Database is the official central repository for genomic mapping data resulting from the Human Genome Initiative. The database contains three types of data: (1) regions of the human genome, including genes, clones, and ESTs, (2) maps of the human genome, including cytogenetic maps, linkage maps, radiation hybrid maps, content contig maps, and integrated maps (these maps can be displayed graphically via the Web), and (3) variations within the human genome including mutations and polymorphisms, plus allele frequency data. There are options to browse genes by chromosome, genes by symbol name, and genetic diseases by chromosome. There are multiple ways to search, including text-based searches for people, citations, segment names or accession numbers, and sequence searching via BLAST.

KEGG: Kyoto Encyclopedia of Genes and Genomes - {http://www.genome.jp/kegg/}
This database often appears in Google search results, so let's put it in context. Despite the name, this is actually a biochemical pathway database and gene catalog, not an encyclopedia in the book sense. "The primary objective of KEGG is to computerize the current knowledge of molecular interactions; namely, metabolic pathways, regulatory pathways, and molecular assemblies. At the same time, KEGG maintains gene catalogs for all the organisms that have been sequenced and links each gene product to a component on the pathway. Because we need an additional catalog of building blocks, KEGG also organizes a database of all chemical compounds in living cells and links each compound to a pathway component."

Protein Sequences

SWISS-PROT - {http://web.expasy.org/groups/swissprot/}
"SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc.), a minimal level of redundancy and a high level of integration with other databases." "The data in Swiss-Prot are derived from translations of DNA sequences from the EMBL Nucleotide Sequence Database, adapted from the Protein Identification Resource (PIR) collection, extracted from the literature and directly submitted by researchers. It contains high-quality annotations, is non-redundant, and cross-referenced to several other databases, notably the EMBL nucleotide sequence database, PROSITE pattern database and PDB."

From the home page, a quick text search can be done by accession or ID number, description, gene name, or organism. By searching SWISS-PROT through the Sequence Retrieval System (SRS) more sophisticated searches can be performed and the format of the results can be customized. Access to SWISS-PROT (directly or via SRS) and links to many other proteomics resources are available from the ExPASy (Expert Protein Analysis System) proteomics server of the Swiss Institute of Bioinformatics (SIB) at {http://us.expasy.org/}. The SWISS-PROT records are quite detailed. Be advised that other databases or search systems that import SWISS-PROT data may not always provide access to the entire SWISS-PROT record.

Entrez Protein Database - http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?dB=Protein
"The Protein database contains sequence data from the translated coding regions from DNA sequences in GenBank, EMBL and DDBJ as well as protein sequences submitted to PIR, SWISS-PROT, PRF, and the Protein Data Bank (PDB) (sequences from solved structures)." The native SWISS-PROT records usually contain more detailed annotations than will be obtained from Entrez Protein Database records derived from SWISS-PROT records. In typical Entrez fashion, results from a search of the Protein database link to PubMed, to the taxonomy database, to related sequences, and in some cases to pre-computed BLAST search results (look for BLink links).

Protein Information Resource - International Protein Sequence Database (PIR-PSD) - http://pir.georgetown.edu/
In 1988 the Protein Information Resource (PIR), which is affiliated with Georgetown University Medical Center, established a cooperative effort with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID) to collect, publish and distribute the PIR-International Protein Sequence Database (PIR-PSD). They describe the database as "a comprehensive, non-redundant, expertly annotated, fully classified and extensively cross-referenced protein sequence database in the public domain". Text searches can be done in the title, species, author, citation, keyword, superfamily, feature and gene name fields. Gapped-BLAST sequence similarity searches are also an option. Note that both SWISS-PROT and the Entrez Protein database contain data adapted from the PIR.

Protein Structure

Protein Data Bank (PDB) - http://www.rcsb.org/pdb/
The PDB was established at Brookhaven National Laboratories in 1971, making it the first public bioinformatics database. The PDB is now operated by the Research Collaboratory for Structural Bioinformatics (RCSB) which is a collaborative effort of the San Diego Supercomputing Center, Rutgers University, and the National Institute of Standards and Technology (NIST). The PDB is a repository of experimentally determined three-dimensional structures of biological macromolecules (proteins, enzymes, nucleic acids, protein-nucleic acid complexes, and viruses) derived from x-ray crystallography and NMR experiments (see {http://www.rcsb.org/pdb/experimental_methods.html} for a helpful overview of these methods). Depositing structures obtained from theoretical models is discouraged. Data are deposited by the international user community and maintained by the RCSB PDB staff. Approximately 50-100 new structures are deposited each week. A variety of information associated with each structure is available, including "sequence details, atomic coordinates, crystallization conditions, 3-D structure neighbors computed using various methods, derived geometric data, structure factors, 3-D images, and a variety of links to other resources."

There are three ways to search the PDB. The SearchLite interface accepts text queries using Boolean operators, and searches the text fields such as the author, compound, molecule class, and keywords fields. The SearchFields interface is an advanced search option that allows you to choose specific fields in which to search and to apply various limits. It also allows you to customize the format of the results. The third search method requires leaving the PDB site, going to the NCBI Entrez site and performing a NCBI BLAST sequence search with "pdb" selected as the target database. See the notes on protein BLAST searching at http://www.ncbi.nlm.nih.gov/blast/html/BLASThomehelp.html#AABLAST.

Since this is such an old database, historic inconsistencies in the way data are reported within PDB records may lead to unexpected or incomplete results when searching, particularly for text-based information. Certain keywords, like alpha, are not properly searchable. For example, looking for alpha hemolysin fails to find anything, but a search on hemolysin alone results in ten hits, including 7AHL, which is alpha hemolysin. The PDB file format itself also has numerous flaws, but remains the most widely accepted format for structural data. The database producers are aware of these problems and are working to solve them. Several software packages can be used to view PDB files in 3D, including the RasMol and Chime browser plug-ins and Deep-View. For more information see the PDB Query Tutorial at {http://www.rcsb.org/pdbstatic/tutorials/LargeBeta.swf} and the PDB Documentation and Information page at {http://www.rcsb.org/pdb/info.html#General_Information}. See also the entry for the MMDB below, which is a subset of the PDB with some added features.

MMDB: Molecular Modeling DataBase - http://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtml
The MMDB is NCBI's structure database. It is a subset of three-dimensional structures obtained from the Protein Databank (PDB), excluding theoretical models. MMDB adds value through the addition of explicit chemical graph information and through the cross-linking of structural data to bibliographic information, to the sequence databases, and to the NCBI taxonomy. The explicit bond information makes for more consistent interpretation of the coordinate data by visualization software. MMDB can provide data for three different structure viewers: Cn3D, a viewer developed by the NCBI; RasMol; and MAGE. All three are available for a variety of platforms (Windows, MacOS, UNIX). After installing the software, the 3-dimensional structure can be viewed by clicking the button labeled View/Save Structure close to the bottom of each structure summary.

The structure database may be queried directly, using accession numbers or text terms such as author names, protein names, species names or publication dates. The result will yield "Structure Query" pages, providing access to entries which matched the keywords. From the Structure Summary pages of an individual matching entry one may access amino acid and nucleic acid sequences, retrieve PubMed documents, get taxonomy information, and launch the software to view the 3D image.

The MMDB documentation also notes that:

"The structure database is considerably smaller than Entrez's protein or nucleotide databases, but a large fraction of all known protein sequences have homologs in this set, and one may often learn more about a protein by examining 3-D structures of its homologs. Protein sequences from MMDB are extracted and available in the Entrez protein sequence database. They are linked to the 3-D structures, therefore it is possible to determine whether a protein sequence in Entrez has homologs amongst known structures by examining its Related Sequences or Protein Neighbors and checking whether this set has any Structure Links."

Software

Software Directories

Bioinformatics Software Resource (BISR) - {http://bioinfo.nist.gov/BISR/}
A catalog and clearinghouse of links to bioinformatics and computational biology software and resources. Over 400 packages are currently available, more than 70% of the software is free, and a variety of operating systems are supported. This database is maintained by the Chemical Science and Technology Laboratory of the National Institute of Standards and Technology (NIST).

Database Searching, Browsing and Analysis Tools - http://www.ebi.ac.uk/Tools/index.html
A list of software tools (programs) you can use via the web to submit queries to the sequence databases and to analyze the results of those queries. This list is from the European Bioinformatics Institute. See also the ExPASy Proteomics Tools list below.

ExPASy Proteomics Tools - {http://www.expasy.org/links.html}
Tools for proteomics that may be used over the web, covering such categories as protein identification and characterization, similarity searches, secondary structure prediction, and sequence alignment.

Genamics SoftwareSeek - http://genamics.com/software/index.htm
A repository and database of over 1200 free and commercial tools for use in molecular biology and biochemistry. Windows, MS-DOS, Mac, Unix and Linux platforms are supported, as well as online tools that run through your Internet browser. You may browse by category (such as DNA sequence analysis, molecular modeling, or protein structure prediction) or you may search by platform, program name or keyword.

Freshmeat Open Source Software Repository - http://freshmeat.net/
This database of UNIX and cross-platform open source software is a good source for molecular modeling and visualization programs and contains a smattering of bioinformatics applications. Each entry provides a history of the project's releases (very useful for spotting stale code) and a popularity ranking. See also Open Source Software Promoters.


Comprehensive Web Sites

see also Database Directories and Lists

Amos' WWW Links Page - {http://au.expasy.org/links.html}
This is a massive list of "information sources for life scientists with an interest in biological macromolecules" and one that is often referenced by other bioinformatics web sites. Amos' page is hosted by the ExPASy (Expert Protein Analysis System) proteomics server of the Swiss Institute of Bioinformatics, yet the disclaimer points out that the content entirely reflects the interests of one man: Amos Bairoch, a researcher at the Swiss Institute of Bioinformatics. Annotations are extremely brief (sometimes only three words) so they provide little guidance for the novice. However, the sheer extent of the list (over 1000 entries) coupled with its logical headings, its "scanability" (the lack of annotations does make it easier to browse quickly), and its notoriety in the field make it worth noting here.

GenomeWeb - {http://www.hgmp.mrc.ac.uk/GenomeWeb/}
This is a very extensive list of sites. When browsing click on the "i" icon next to each title to access the annotations at the bottom of the page. While the directory organization is not as clear as one would like there is a keyword search option that helps. GenomeWeb is to be commended for tackling the task of annotating every record and for verifying all of its links on a daily basis to ensure that each URL is valid and that the document hasn't disappeared or moved.

Human Genome Project
There are many sites on this topic. Here are three good ones in terms of their comprehensive nature and links to research at the university level:

Large-Scale Gene Expression and Microarray Links and Resources - {http://industry.ebi.ac.uk/~alan/MicroArray/}
To put microarrays into context with bioinformatics consider this quote from Gene-Chips.com (http://www.gene-chips.com/): "It is widely believed that thousands of genes and their products (i.e., RNA and proteins) in a given living organism function in a complicated and orchestrated way that creates the mystery of life. However, traditional methods in molecular biology generally work on a "one gene in one experiment" basis, which means that the throughput is very limited and the "whole picture" of gene function is hard to obtain. In the past several years, a new technology, called DNA microarray, has attracted tremendous interests among biologists. This technology promises to monitor the whole genome on a single chip so that researchers can have a better picture of the interactions among thousands of genes simultaneously. Terminologies that have been used in the literature to describe this technology include, but not limited to: biochip, DNA chip, DNA microarray, and gene array."

Bioinformatics comes into play in a microarray experiment in terms of image processing, robotics control, and analysis of the resulting raw data. This site is particularly useful because, by its own admission, it's view of the subject is biased towards the development and application of bioinformatics to the technology of microarrays.

Nature's Genome Gateway - http://www.nature.com/genomics/
Produced by the journal Nature, access to all material at this site is free. It offers a nice collection of full text research articles from the Nature Publishing Group journals arranged by organism (the human category in this section is the most rich). An additional section of the site is completely devoted to the human genome. A post-genomics section looks at the applications of sequencing research. A well organized site with good information.

SMD Microarray Resources - {http://genome-www4.stanford.edu/MicroArray/SMD/resources.html}
Good content nicely arranged is presented here. Headings for microarray databases, software, companies and academic sites as well as a good list of starting points under the general information heading. SMD stands for the Stanford Microarray Database, which stores microarray data and images. Some of the SMD data is available to the public.

Southwest Biotechnology and Informatics Center (SWBIC): Bioinformatics & Genomics - {http://www.swbic.org/links/1.php}
This is a site of stellar organization and considerable content. All the standard categories are well represented (e.g., Conferences, Databases, News, Online Journals) and the subject specific categories are outstanding (Hidden Markov Models, Genomics and DNA Sequence Analysis, Metabolic Pathway Databases, etc.). The quality of the entries is high, and each entry is annotated. A search capability is also offered as an alternative to browsing.

Visualisation Awareness Pages - {http://industry.ebi.ac.uk/~Alan/VisSupp/VisAware/index.html}
"These pages aim to catalogue visualisation sites, applications, techniques and papers that may be of interest to the Bioinformatics and Biological community." This is an extensive and varied directory created by Alan Robinson, a researcher at the European Bioinformatics Institute. Extensive data visualization resource directories are not common, so its especially nice to find one that focuses on bioinformatics visualization in particular.

Bibliographic Databases

PubMed - {http://www.ncbi.nlm.nih.gov/pubmed}
For medical bibliographic citations this is the place to go. PubMed is the public interface to the medical literature database (MEDLINE) produced by the National Library of Medicine. PubMed provides access to over 11 million MEDLINE citations for articles and conference papers back to the mid-1960's. There are links to many sites providing full text articles (some for free) and PubMed citations link to Entrez nucleotide, protein and structure records when available. Unfortunately, PubMed currently supports searching by Chemical Abstracts Service (CAS) Registry Numbers (RNs) in a very limited way. The dictionary of RNs supported in PubMed is limited and is not currently extended to sequences found in other parts of the Entrez system. The PubMed interface is a rich and somewhat complicated one that requires some study to use efficiently. PubMed with its Entrez links provides an almost one-stop-shopping experience, and is an amazingly rich resource for medical and genetics data.

INSPEC - {http://www.iee.org/publish/inspec/about/} [subscription required]
For citations to computer science literature, start with INSPEC (for noncommercial computer science articles freely available on the Internet, see ResearchIndex below). Produced by the Institution of Electrical Engineers (IEE), INSPEC is the leading bibliographic information service providing access to conference papers and journal articles in computer science and information technology as well as electrical engineering and physics. The database covers literature from 1969-present, and is available from a variety of database vendors, most of which will also provide links to your library's online journals.

Chemical Abstracts and the Registry File - http://www.cas.org/ [subscription required]
For access to the chemical literature this is the place to start. Produced by CAS (Chemical Abstracts Service), Chemical Abstracts is the leading bibliographic information service providing citations to conference papers, journal articles, patents and other documents pertinent to chemistry (and it is to our advantage that they define chemistry very broadly). In December of 2001 CAS extended the coverage of Chemical Abstracts to literature from 1907 to the present, which is a boon to researchers since the chemical literature becomes obsolete slowly, if at all.

"Substance identification is a special strength of CAS, which is widely known for the CAS Chemical Registry, the largest substance identification system in existence. When a chemical substance is newly encountered in the literature processed by CAS, its molecular structure diagram, systematic chemical name, molecular formula, and other identifying information are added to the Registry and assigned a unique CAS Registry Number." This is relevant to bioinformatics because currently about 45% of the Registry File consists of protein and nucleic acid sequences. CAS gets its sequence information from the chemistry journals, patents and other documents that CAS routinely covers as well as sequences from the GenBank database. 13% of the sequences in the Registry File are unique, while 87% overlap with GenBank. The RefSeq sequences generated from Genbank are not part of the CAS Registry. While most sequences reported in the Protein Data Bank are in Registry, Registry does not provide access to the 3D data that is available in the PDB.

The ability to perform nucleic acid and amino acid sequence similarity searching is highly desirable. BLAST similarity searching is currently possible in some versions of the Registry File (it is available in SciFinder 2001 and via STN on the Web) but unfortunately not in SciFinder Scholar, which is the version most commonly subscribed to by academic libraries. On STN, once a Registry sequence search is completed (via BLAST or a text based search) there are multiple files which can then be very easily searched with CAS Registry Numbers, e.g., BIOSIS AGRICOLA, USPATFULL, CAplus (Chemical Abstracts), and others. (Other vendors support Registry Number searching in many of these databases as well, though they lack the Registry File itself). Chemical Abstracts can also be searched directly via the usual bibliographic fields (author, title, etc.). If you are in need of access to the patent literature, or if you are studying protein chemistry, pharmacogenomics, or small molecules then the CAS databases should be high on your priority list. But without access to BLAST searches in the Registry File the academic bioinformatics community will probably continue to rely heavily on other databases.

BIOSIS Previews - {http://www.biosis.org/products/previews/} [subscription required]
Start here for plant biology and other non-medical biology articles and conference papers from 1969 to the present (only 34% of the journals in BIOSIS overlap with MEDLINE). Since 1993 a "sequence data" field has been available but rather than containing actual nucleotide or amino acid sequences this field contains the accession number for the sequence from databases such as GenBank, EMBL and SwissProt, if the author included this number in the article text. A very small percentage of records actually use this field. BIOSIS is available from a number of database vendors, most of which will also provide links from the citation to your library's online journals.

ISI Web of Science - {http://apps.webofknowledge.com/} [subscription required]
This is a large, powerful and costly citation database. It is an index to scientific, commercially published journal articles from 1975 to the present that also allows you to search for citations to a particular article. You look up the reference to a work that you have identified to find other more recent journal articles that have cited it. Cited reference searching is a unique way to trace ideas and subjects from past research into the present day. Searchable by author, keyword, and cited reference. Computer scientists and biologists are quite interested in citation data. Web of Science doesn't index conferences as a primary literature source, which is a disadvantage in bioinformatics where conferences are so important. See also ResearchIndex below.

ResearchIndex (formerly CiteSeer) - {http://citeseer.ist.psu.edu/}
This is a free, full-text index to the freely available research articles on the web. "Although availability varies greatly by discipline, over a million research articles are freely available on the web. Some journals and conferences provide free access online, others allow authors to post articles on the web, and others allow authors to purchase the right to post their articles on the web." (Lawrence 2001) this index is popular with computer scientists because a great deal of their literature is available this way, and because ResearchIndex also provides citation analysis. While Web of Science doesn't index conference papers (which are a mainstay in computer science), ResearchIndex does (if the proceedings are on the web for free). It also offers reference linking, extraction of citation context, related document detection and the BibTeX entry for each article.

Technical Reports and Preprints

GenomeBiology.com: Preprint Depository - {http://genomebiology.com/preprint/}
GenomeBiology.com is an online journal from the same publishing group that brings you BioMed Central (http://www.biomedcentral.com/), of which the free online journal BMC Bioinformatics is a part (http://www.biomedcentral.com/1471-2105/). GenomeBiology.com provides free access to its peer-reviewed articles and preprints, although it charges a subscription fee to access its reviews, reports, news and commentaries. The preprints in this depository are not peer reviewed. The only screening process is to ensure relevance of the preprint to GenomeBiology.com's scope and to avoid abusive, libelous or indecent articles.

NCSTRL - Networked Computer Science Technical Reference Library - {http://csetechrep.ucsd.edu/Dienst/htdocs/Welcome.html}
NCSTRL (pronounced "ancestral") is an international collection of technical reports from a selection of participating computer science and computer engineering departments, industrial and government research laboratories made available for noncommercial and educational use. Searchable by keyword, author, or title.

PrePrint Network from the Department of Energy - {http://www.osti.gov/preprints/}
The Department of Energy funds a great deal of bioinformatics research at US universities. They are particularly interested in protein structure, DNA repair of radiation damage, and bioremediation of polluted sites. The Preprint Network is the gateway to preprints in disciplines of interest to the DOE, including bioinformatics. The Network is a metasearch engine that searches across a number of preprint and technical report collections, including the Networked Computer Science Technical Reference Library (NCSTRL), among others. They also offer an update service that will e-mail you when new resources are added in your area of interest. "The Preprint Network is one leg of a triad of electronic products for the science information consumer. We also offer PubSCIENCE ({http://www.osti.gov/pubscience}), a gateway to journal literature, and the DOE Information Bridge (http://www.osti.gov/bridge), an on-line access route to full-text technical report literature of the Department of Energy."

Important Organizations

Guides, Tutorials and Primers

see also Definitions, Glossaries, and Dictionaries

Bioinformatics Frequently Asked Questions - http://bioinformatics.org/FAQ/
This is a scholarly yet pragmatic FAQ (filled with what the author calls "blunt opinions") that is rich in useful information and advice. It is written by Damian Counsell of the Institute of Cancer Research, UK, though he cautions readers that the FAQ doesn't represent the ICR's views. It begins with a wonderful overview of the field that helps put all the major pieces (definitions, programs, databases) into perspective. He goes on to answer questions about finding resources in the field, questions about careers and jobs, and many practical questions like "how can I align two sequences?," "how can I predict the function of a gene," and "how do I write this up?" It is still a work in progress and the lists of books are now a bit dated, but nevertheless the FAQ is highly recommended reading.

The Bioinformatics Resource (TBR): Tutorials - {http://www.hgmp.mrc.ac.uk/CCP11/directory_tutorials.jsp?Rp=20}
This brand new database (launched January 25, 2002) covers a wide range of topics, contains substantial numbers of records, and is both searchable by keyword and browsable by topic. Sixty-three tutorials are currently cataloged. The list is very heavily weighted towards university course web pages, yet there are some real gems in here. TBR is the website of the CCP11project (Collaborative Computational Project 11). CCP11 was established to foster bioinformatics in the UK research community, thus explaining the high number of UK resources listed.

Crystallography 101 - {http://www-structure.llnl.gov/Xray/101index.html}
Crystallography is important for the study of protein structures and bioinformatics is much concerned with the prediction and modeling of protein folding and structure. This is a substantial and well written tutorial on the subject by Bernhard Rupp, Professor of Molecular Structural Biology and Head of the Macromolecular Crystallography Group at the Lawrence Livermore National Laboratory.

MIT Biology Hypertextbook - {http://esg-www.mit.edu:8001/esgbio/7001main.html}
This is the often cited, extensive and well illustrated basic introductory molecular biology text that is used as a supplement to courses at MIT. It is arranged in chapters covering all the basic topics (such as cell biology, enzyme biochemistry, recombinant DNA), and includes a searchable index and practice problems.

NCBI: Education Page - {http://www.ncbi.nlm.nih.gov/Education}
Online education materials from the National Center for Biotechnology Information. Includes online tutorials for the BLAST search program and some of the Entrez Databases (PubMed, Nucleotides, Structures), as well as a useful essay on similarity searching and glossary of terms related to sequence searching.

NCBI: Medical Library Association's CE Course Manual: Molecular Biology Information Resources - http://www.ncbi.nlm.nih.gov/Class/MLACourse/
This is Renata McCarthy's manual from the excellent full day continuing education course for librarians on molecular biology information resources. The online notes, links and examples are very helpful even without taking the class in person. Which is a lucky thing, since unfortunately this course will no longer be offered in locations throughout the U.S. Due to the complexity of the material the course is being expanded to three days and will be offered several times per year only at the National Library of Medicine, starting in early 2002. Once the revised course is available, this page will contain a link to the new course web page, which will include a schedule of course dates and registration information.

Protein Data Bank (PDB): Education Resources - {http://www.rcsb.org/pdb/static.do?p=general_information/news_publications/newsletters/educationcorner.html}
A nicely organized directory of high quality educational sites related to proteins and nucleic acids, as well as pointers to tutorials on using the PDB itself. There is a section called "protein documentaries" that lists multimedia sites (VRML, RealPlayer and/or Chime plug-ins required) and an excellent selection of molecular modeling resources in the section called "Other Educational Resources." Also worth visiting is the link to "Links" in the upper right under "Other Information Resources" that takes you to their "Macromolecular Structure Related Resources" page which is a comprehensive web directory of its own.

Science Magazine: Functional Genomics Educational Resources - http://www.sciencemag.org/feature/plus/sfg/education/index.shtml
This site has a lot to recommend it. A "film festival" section provides RealTime movies and webcasts of press conferences and lectures. The glossary section has already been recommended earlier in this guide. There is an annotated list of "Ten Great Educational Websites" (which are very cool, though most seem to be aimed at the high school level) plus an education site of the month. And not to be missed are the three sites in the "A Little Base (Pair) Humor" section: Cartoonists' views of the Human Genome Project from Slate magazine, the DNA-O-Gram which allows you to send a nucleotide-encoded message to a friend, and Swiss-Jokes -- "The infamous random sampler of helvetian humor from ExPASy."

Recommended Reading

Baxevanis, A. D. & Ouellette, B. F. F. (eds). 2001. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Methods of Biochemical Analysis, vol. 43, 2nd ed., New York: John Wiley & Sons, Inc.

Benson, D. A., et al. 2000. GenBank. Nucleic Acids Research 28(1): 15-18.

Berman, H. M., et al. 2000. The protein data bank. Nucleic Acids Research 28(1): 235-242.

Gibas, C. & Jambeck, P. 2001. Developing Bioinformatics Computer Skills. Sebastopol, CA: O'Reilly & Associates, Inc.

Nucleic Acids Research. 2002. 30(1). [Online.] Available: {http://nar.oxfordjournals.org/content/vol30/issue1/} [January 25 2002].

Schuler, G. D. 1997. Pieces of the puzzle: expressed sequence tags and the catalog of human genes. Journal of Molecular Medicine 75(10): 694-698.

Wang, Y., et al. 2000. MMDB: 3D structure data in Entrez. Nucleic Acids Research 28(1): 243-245.

Wheeler, D. L., et al. 2001. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 29(1): 11-16.

References

Counsell, Damian. 2001. Bioinformatics FAQ. [Online]. Available: http://bioinformatics.org/faq/ [January 16, 2002].

Doernberg, D. 1993. Computer Literacy Interview With Donald Knuth. [Online]. Available: {http://www1.fatbrain.com/interviews/knuth_interview.html} [November 19, 2001].

Lawrence, S. 2001. Online or Invisible? [Online]. Available: {http://citeseer.ist.psu.edu/online-nature01/} [November 19, 2001].

National Center for Biotechnology Information (NCBI). 2001. NCBI Education Site. [Online]. Available: http://www.ncbi.nlm.nih.gov/Education/ [November 19, 2001].

Nucleic Acids Research. 2002. 30(1). [Online]. Available: {http://nar.oxfordjournals.org/content/vol30/issue1/} [January 25, 2002].