see also
Definitions, Glossaries, and Dictionaries
see also
Guides, Tutorials and Primers
see also
Recommended Reading
Sequence and other non-bibliographic databases are the central, most important type
of information resource in this field. The multiplicity of databases makes
selection confusing, and the databases themselves can be challenging to understand
and navigate. Nomenclature is not standard. Data formats/metadata schemes are not
standard. Databases struggle with data redundancy and charges that they contain a
lot of "junk." There are a lot of interrelated pieces of information surrounding a
gene (genome location, structure, sequence, expression information, chemistry,
etc.) or a protein, which lead to somewhat complicated database structures and
links to related databases which may or may not be intuitive. There are the
additional requirements that 3D structures place on metadata and the increasing
volume of sequence data pouring into these databases (take a look at the
exponential growth of Genbank at http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html).
A great deal of this growth is in the form of direct submissions that may or may
not have been peer reviewed. Whether peer reviewed or not the data are subject to
frequent changes and updates as new information becomes available. There are many
approaches to solving these problems, which means multiple data structures and
multiple search interfaces and many specialized databases.
As mentioned earlier, there are hundreds of databases that might be considered
relevant to bioinformatics. There are specialized databases for each species, and
separate databases for different types of information (nucleic acid sequences,
protein sequences, protein structures, biochemical and biophysical information,
etc.). There is also a great redundancy of databases, with multiple databases
covering nearly the same information for the same organisms. This situation arose
in part from many researchers developing their own databases in their own formats
over the years, and from databases developing in parallel in Europe, Japan and the
United States. The situation is further complicated by the existence of several
versions or mirrors of the same database on different servers (each with varying
degrees of currency or completeness), and by the sharing of records between
databases. For example, the Entrez search system draws data from SWISS-PROT but
only includes SWISS-PROT records for proteins that are based upon nucleotide
sequence data that meet the criteria for inclusion in GenBank. A search of
SWISS-PROT through another interface may retrieve more records as well as more
detail in each record.
The following list of databases is intended to orient the reader to the major
databases. To become a proficient searcher in each database takes considerable
training, which is beyond the scope of this guide. This database list is highly
selective, including only a few representatives of each type. Emphasis is placed on
the larger, better known databases, on the free public databases, and on those that
cover human data. Grouping databases by type is a common and useful way of
organizing them, but many databases provide more than one type of information to
the user so bear in mind that this classification is not precise.
A review of the basic genetic terms and concepts is highly recommended before
approaching the sequence databases. See the Definitions,
Glossaries, and Dictionaries and the Guides, Tutorials and
Primers sections of this guide for recommended sources.
see also
Comprehensive Web Sites
The large number of databases naturally leads to the impulse to create database
directories and lists. Many of the comprehensive web sites listed in this guide
also provide such lists and are worth consulting.
- Nucleic Acids Research: Annual Database Issue -
http://www.nar.oupjournals.org/content/vol30/issue1/
[subscription required for access]
- For the past several years the journal Nucleic Acids Research published by Oxford University Press has devoted the first issue of each year to listing and describing the many molecular biology and bioinformatics databases. This issue always includes many informative articles describing selected databases in depth (several of these are in the Recommended Reading section of this guide) as well as a comprehensive list of databases called the Molecular Biology Database Collection (see http://www.nar.oupjournals.org/cgi/content/full/30/1/1/DC1). In 2002 this list included 335 databases, up from 281 the year before. The list can be accessed by category/type of database (there are eighteen categories in 2002) or alphabetically by title. Click on the short description of each database to access a paragraph-long description written by a researcher familiar with the database.
- Introduction to Molecular Biology Databases - {http://www.ebi.ac.uk/swissprot/Publications/mbd1.html}
- Although not technically a directory, this article, written in 1999, is a very
helpful introduction to the major databases, including many of the organism
specific ones that are outside the scope of this guide. A good starting place for
the non-specialist.
- Sequence Retrieval System (SRS) database descriptions - {http://downloads.lionbio.co.uk/publicsrs.html}
- This database of database descriptions can be accessed in a number of ways. From the Public SRS Servers List ({http://downloads.lionbio.co.uk/publicsrs.html}) you can view an alphabetical list of databases (scroll down past the list of servers at the top to see the list of databases or "libraries" in the middle of the page). Click on the link for the server that is hosting the database you are interested in to view the record for the version of that database as mounted on that server. Or, you can link to the database descriptions from the search interface of a particular server (for example, see the European Bioinformatics Institute server at http://srs6.ebi.ac.uk/srs6bin/cgi-bin/wgetz?-page+top+-newId). This method is particularly helpful because here the databases are sorted by database type (e.g., sequence libraries, protein 3D structures, mutations, SNP, metabolic pathways, etc.). Click on the plus sign next to the type to expand the list, then click on the database name to access the description of that database as mounted on that server. You can also go one step further and actually initiate your search in those databases from this page as well.
- GenBank - http://www.ncbi.nlm.nih.gov/Genbank/GenbankSearch.html, and Entrez Nucleotides Database - http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide
- GenBank is the nucleotide sequence database built and distributed by the National Center for Biotechnology Information (NCBI) at the National Institutes of Health. As of this writing, GenBank contains more than 13 billion bases from over 100,000 species, and is growing exponentially (see http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html). The data are obtained through direct submission of sequence data from individual laboratories, from large-scale sequencing projects, and from the US Patent and Trademark Office. A little more than half of the total sequences in the database are from Homo sapiens.
There are two ways to search GenBank: a text-based query can be submitted through the Entrez system at {http://www.ncbi.nlm.nih.gov/Entrez/index.html}, or a sequence query can be submitted through the BLAST family of programs (see http://www.ncbi.nlm.nih.gov/BLAST/). To search GenBank through the Entrez system you would select the Nucleotides database from the menu. The Entrez Nucleotides Database is a collection of sequences from several sources, including GenBank, RefSeq, and the Protein Databank, so you don't actually search GenBank exclusively. Searches of the Entrez Nucleotides database query the text and numeric fields in the record, such as the accession number, definition, keyword, gene name, and organism fields to name just a few. So, for example, you could enter the terms Bacillus anthracis and you would be presented with many records that contain and describe nucleotide or protein sequences related to the anthrax bacteria.The accession number is very handy, because it is a unique and persistent identifier for the GenBank entry as a whole and doesn't change even if there is a later change or update to the sequence or annotation. Nucleotide sequence records in the Nucleotides database are linked to the PubMed citation of the article in which the sequences were published. Protein sequence records are linked to the nucleotide sequence from which the protein was translated. To become an effective searcher of this database takes study. For starters, take the Nucleotides database online tutorial that starts at {http://www.ncbi.nlm.nih.gov/Database/tut1.html}, and consult the other resources available from the NCBI Education Page at {http://www.ncbi.nlm.nih.gov/Education/}. See also the Recommended Reading section of this guide.
If you have obtained a record through a text-based Entrez Nucleotides Database
search you can read the nucleotide sequence in the record. However, most
researchers wish to submit a nucleotide sequence of interest to find the sequences
that are most similar to theirs. This is done using the BLAST
(Basic Local Alignment
Search Tool) programs. You select the BLAST
program you wish to use depending upon the type of comparison you are doing
(nucleotide to nucleotide, or nucleotide to protein sequence, etc.) and then you
select the database to run the query in (any of several nucleotide or protein
databases). Many NCBI databases accept BLAST searches, as do many of the other
databases covered elsewhere in this guide. The result is a detailed report that
summarizes your query, provides a graphical overview of database matches, indicates
the statistical significance of the matches and describes each significant
alignment. From here you can link to the full database record for the individual
matches. You can learn more about BLAST searching from the NCBI BLAST educational
page at {http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html}
(read the online tutorial).
- EMBL Nucleotide Sequence Database - http://www.ebi.ac.uk/embl/
- "The EMBL Nucleotide Sequence Database constitutes Europe's primary nucleotide
sequence resource. Main sources for DNA and RNA sequences are direct submissions
from individual researchers, genome sequencing projects and patent applications.
The database is produced in an international collaboration with GenBank (USA) and
the DNA Database of Japan (DDBJ). Each of the three groups collects a portion of
the total sequence data reported worldwide, and all new and updated database
entries are exchanged between the groups on a daily basis."
From the home page you can submit simple text searches to the EMBL Nucleotide
Sequence Database, or to the Protein Databank (what you search when you select
protein structures from the menu) or to a protein sequence database called Swall.
For more complex searches, they recommend accessing the databases through the
Sequence Retrieval System (SRS) server (http://srs.ebi.ac.uk/). SRS is a database querying
/ navigation system, similar in function to the Entrez system. It allows you to
simultaneously search across several databases and to display the results in many
ways. SRS can be used to access a large number of databases, including EMBL,
SWISS-PROT and the Protein Databank, depending upon the configuration of the
particular SRS server you are using. The structure and content of an EMBL
Nucleotide record is very similar to that of an NCBI Entrez Nucleotide database
record.
- Entrez Genome - http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome
- "The whole genomes of over 800 organisms can be found in Entrez Genomes. The
genomes represent both completely sequenced organisms and those for which
sequencing is in progress. All three main domains of life - bacteria, archaea, and
eukaryota - are represented, as well as many viruses and organelles." Text searches
can be done from the main page. Data can also be accessed alphabetically by species
{http://www.ncbi.nlm.nih.gov:80/PMGifs/Genomes/allorg.html}),
or hierarchically by drilling down through a taxonomic list to a graphical overview
for the genome of that organism, then to specific chromosomes, then on to specific
genes. At each level are maps, pre-computed summaries, analysis appropriate to that
level, and links to related records from a variety of other Entrez databases.
BLAST searches of some genomes are also possible.
Very useful pages for some of the most commonly studied species (e.g., human,
mouse, fruit fly, malarial parasite) can be found on the Genomic Biology page under
"organism-specifc resources" (http://www.ncbi.nlm.nih.gov/Genomes/).
These pages are so detailed that each could be classified as a comprehensive web
site in itself. Each one brings together links to the genomic data, useful tools,
related data sources and news about the genome of that species. The Human Genome
Guide (http://www.ncbi.nlm.nih.gov/genome/guide/human/)
is particularly rich.
- Human Genome Browser from UCSC - http://genome.ucsc.edu/
- "The sequence of the human genome is too big to see at all at once; few people
want to look at raw DNA sequence anyway. The alternative is the Human Genome
Browser for a quick display of any requested portion of the genome at any scale,
along with more than two dozen tracks of information (genes, ESTs, CpG islands,
assembly gaps, chromosomal band, ...) associated with the complete human genome
sequence... Clicking on a displayed feature opens a second window providing protein
sequence, coordinates and accession numbers, as appropriate. Clicking in the corner
of the display calls up raw DNA sequence corresponding to the display window
boundaries. This look-up feature is far more convenient than manual retrieval of a
precise coordinate range from GenBank entries."
- The Genome Database (GDB) - {http://www.gdb.org/}
- The Genome Database is the official central repository for genomic mapping data
resulting from the Human Genome Initiative. The database contains three types of
data: (1) regions of the human genome, including genes, clones, and ESTs, (2) maps
of the human genome, including cytogenetic maps, linkage maps, radiation hybrid
maps, content contig maps, and integrated maps (these maps can be displayed
graphically via the Web), and (3) variations within the human genome including
mutations and polymorphisms, plus allele frequency data. There are options to
browse genes by chromosome, genes by symbol name, and genetic diseases by
chromosome. There are multiple ways to search, including text-based searches for
people, citations, segment names or accession numbers, and sequence searching via
BLAST.
- KEGG: Kyoto Encyclopedia of Genes and Genomes -
{http://www.genome.jp/kegg/}
- This database often appears in Google search results, so let's put it in
context. Despite the name, this is actually a biochemical pathway database and
gene catalog, not an encyclopedia in the book sense. "The primary objective of KEGG
is to computerize the current knowledge of molecular interactions; namely,
metabolic pathways, regulatory pathways, and molecular assemblies. At the same
time, KEGG maintains gene catalogs for all the organisms that have been sequenced
and links each gene product to a component on the pathway. Because we need an
additional catalog of building blocks, KEGG also organizes a database of all
chemical compounds in living cells and links each compound to a pathway component."
- SWISS-PROT - {http://web.expasy.org/groups/swissprot/}
- "SWISS-PROT is a curated protein sequence database which strives to provide a
high level of annotation (such as the description of the function of a protein, its
domain structure, post-translational modifications, variants, etc.), a minimal
level of redundancy and a high level of integration with other databases." "The
data in Swiss-Prot are derived from translations of DNA sequences from the EMBL
Nucleotide Sequence Database, adapted from the Protein Identification Resource
(PIR) collection, extracted from the literature and directly submitted by
researchers. It contains high-quality annotations, is non-redundant, and
cross-referenced to several other databases, notably the EMBL nucleotide sequence
database, PROSITE pattern database and PDB."
From the home page, a quick text search can be done by accession or ID
number, description, gene name, or organism. By searching SWISS-PROT
through the Sequence Retrieval System (SRS) more sophisticated searches
can be performed and the format of the results can be customized. Access
to SWISS-PROT (directly or via SRS) and links to many other proteomics
resources are available from the ExPASy
(Expert Protein
Analysis System) proteomics server of
the Swiss Institute of Bioinformatics (SIB) at {http://us.expasy.org/}. The SWISS-PROT records are quite detailed. Be advised that other databases or search systems that import SWISS-PROT data may not always provide access to the entire SWISS-PROT record.
- Entrez Protein Database - http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?dB=Protein
- "The Protein database contains sequence data from the translated coding regions
from DNA sequences in GenBank, EMBL and DDBJ as well as protein sequences submitted
to PIR, SWISS-PROT, PRF, and the Protein Data Bank (PDB) (sequences from solved
structures)." The native SWISS-PROT records usually contain more detailed
annotations than will be obtained from Entrez Protein Database records derived from
SWISS-PROT records. In typical Entrez fashion, results from a search of the Protein
database link to PubMed, to the taxonomy database, to related sequences, and in
some cases to pre-computed BLAST search results (look for BLink links).
- Protein Information Resource - International Protein Sequence Database
(PIR-PSD) - http://pir.georgetown.edu/
- In 1988 the Protein Information Resource (PIR), which is affiliated with
Georgetown University Medical Center, established a cooperative effort with the
Munich Information Center for Protein Sequences (MIPS) and the Japan International
Protein Information Database (JIPID) to collect, publish and distribute the
PIR-International Protein Sequence Database (PIR-PSD). They describe the database
as "a comprehensive, non-redundant, expertly annotated, fully classified and
extensively cross-referenced protein sequence database in the public domain". Text
searches can be done in the title, species, author, citation, keyword, superfamily,
feature and gene name fields. Gapped-BLAST sequence similarity searches are also an
option. Note that both SWISS-PROT and the Entrez Protein database contain data
adapted from the PIR.
- Protein Data Bank (PDB) - http://www.rcsb.org/pdb/
- The PDB was established at Brookhaven National Laboratories in 1971, making it
the first public bioinformatics database. The PDB is now operated by the Research
Collaboratory for Structural Bioinformatics (RCSB) which is a collaborative effort
of the San Diego Supercomputing Center, Rutgers University, and the National
Institute of Standards and Technology (NIST). The PDB is a repository of
experimentally determined three-dimensional structures of biological macromolecules
(proteins, enzymes, nucleic acids, protein-nucleic acid complexes, and viruses)
derived from x-ray crystallography and NMR experiments (see
{http://www.rcsb.org/pdb/experimental_methods.html}
for a helpful overview of these methods). Depositing structures obtained from
theoretical models is discouraged. Data are deposited by the international user
community and maintained by the RCSB PDB staff. Approximately 50-100 new structures
are deposited each week. A variety of information associated with each structure is
available, including "sequence details, atomic coordinates, crystallization
conditions, 3-D structure neighbors computed using various methods, derived
geometric data, structure factors, 3-D images, and a variety of links to other
resources."
There are three ways to search the PDB. The SearchLite interface accepts text
queries using Boolean operators, and searches the text fields such as the author,
compound, molecule class, and keywords fields. The SearchFields interface is an
advanced search option that allows you to choose specific fields in which to search
and to apply various limits. It also allows you to customize the format of the
results. The third search method requires leaving the PDB site, going to the NCBI
Entrez site and performing a NCBI BLAST sequence search with "pdb" selected as the
target database. See the notes on protein BLAST searching at http://www.ncbi.nlm.nih.gov/blast/html/BLASThomehelp.html#AABLAST.
Since this is such an old database, historic inconsistencies in the way data are reported within PDB records may lead to unexpected or incomplete results when searching, particularly for text-based information. Certain keywords, like alpha, are not properly searchable. For example, looking for alpha hemolysin fails to find anything, but a search on hemolysin alone results in ten hits, including 7AHL, which is alpha hemolysin. The PDB file format itself also has numerous flaws, but remains the most widely accepted format for structural data. The database producers are aware of these problems and are working to solve them. Several software packages can be used to view PDB files in 3D, including the RasMol and Chime browser plug-ins and Deep-View. For more information see the PDB Query Tutorial at {http://www.rcsb.org/pdbstatic/tutorials/LargeBeta.swf} and the PDB Documentation and Information page at {http://www.rcsb.org/pdb/info.html#General_Information}. See also the entry for the MMDB below, which is a subset of the PDB with some added features.
- MMDB: Molecular Modeling DataBase - http://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtml
- The MMDB is NCBI's structure database. It is a subset of three-dimensional
structures obtained from the Protein Databank (PDB), excluding theoretical models.
MMDB adds value through the addition of explicit chemical graph information and
through the cross-linking of structural data to bibliographic information, to the
sequence databases, and to the NCBI taxonomy. The explicit bond information makes
for more consistent interpretation of the coordinate data by visualization
software. MMDB can provide data for three different structure viewers: Cn3D, a
viewer developed by the NCBI; RasMol; and MAGE. All three are available for a
variety of platforms (Windows, MacOS, UNIX). After installing the software, the
3-dimensional structure can be viewed by clicking the button labeled View/Save
Structure close to the bottom of each structure summary.
The structure database may be queried directly, using accession numbers or text
terms such as author names, protein names, species names or publication dates. The
result will yield "Structure Query" pages, providing access to entries which
matched the keywords. From the Structure Summary pages of an individual matching
entry one may access amino acid and nucleic acid sequences, retrieve PubMed
documents, get taxonomy information, and launch the software to view the 3D image.
The MMDB documentation also notes that:
"The structure database is considerably smaller than Entrez's protein or
nucleotide databases, but a large fraction of all known protein sequences have
homologs in this set, and one may often learn more about a protein by examining
3-D structures of its homologs. Protein sequences from MMDB are extracted and
available in the Entrez protein sequence database. They are linked to the 3-D
structures, therefore it is possible to determine whether a protein sequence in
Entrez has homologs amongst known structures by examining its Related Sequences or
Protein Neighbors and checking whether this set has any Structure Links."
- Bioinformatics Software Resource (BISR) - {http://bioinfo.nist.gov/BISR/}
- A catalog and clearinghouse of links to bioinformatics and
computational biology software and resources. Over 400 packages are
currently available, more than 70% of the software is free, and a variety
of operating systems are supported. This database is maintained by the
Chemical Science and Technology Laboratory of the National Institute of
Standards and Technology (NIST).
- Database Searching, Browsing and Analysis Tools - http://www.ebi.ac.uk/Tools/index.html
- A list of software tools (programs) you can use via the web to submit queries
to the sequence databases and to analyze the results of those queries. This list is
from the European Bioinformatics Institute. See also the ExPASy Proteomics Tools
list below.
- ExPASy Proteomics Tools -
{http://www.expasy.org/links.html}
- Tools for proteomics that may be used over the web, covering such categories as
protein identification and characterization, similarity searches, secondary
structure prediction, and sequence alignment.
- Genamics SoftwareSeek - http://genamics.com/software/index.htm
- A repository and database of over 1200 free and commercial tools for use in
molecular biology and biochemistry. Windows, MS-DOS, Mac, Unix and Linux platforms
are supported, as well as online tools that run through your Internet browser. You
may browse by category (such as DNA sequence analysis, molecular modeling, or
protein structure prediction) or you may search by platform, program name or
keyword.
- Freshmeat Open Source Software Repository - http://freshmeat.net/
- This database of UNIX and cross-platform open source software is a good source
for molecular modeling and visualization programs and contains a smattering of
bioinformatics applications. Each entry provides a history of the project's
releases (very useful for spotting stale code) and a popularity ranking. See also
Open Source Software Promoters.
see also
Database Directories and Lists
- Amos' WWW Links Page - {http://au.expasy.org/links.html}
- This is a massive list of "information sources for life scientists
with an interest in biological macromolecules" and one that is often
referenced by other bioinformatics web sites. Amos' page is hosted by the
ExPASy (Expert Protein Analysis System) proteomics server of the Swiss
Institute of Bioinformatics, yet the disclaimer points out that the
content entirely reflects the interests of one man: Amos Bairoch, a
researcher at the Swiss Institute of Bioinformatics. Annotations are
extremely brief (sometimes only three words) so they provide little
guidance for the novice. However, the sheer extent of the list (over 1000
entries) coupled with its logical headings, its "scanability" (the lack of
annotations does make it easier to browse quickly), and its notoriety in
the field make it worth noting here.
- GenomeWeb - {http://www.hgmp.mrc.ac.uk/GenomeWeb/}
- This is a very extensive list of sites. When browsing click on the "i"
icon next to each title to access the annotations at the bottom of the
page. While the directory organization is not as clear as one would like
there is a keyword search option that helps. GenomeWeb is to be commended
for tackling the task of annotating every record and for verifying all of
its links on a daily basis to ensure that each URL is valid and that the
document hasn't disappeared or moved.
- Human Genome Project
- There are many sites on this topic. Here are three good ones in terms
of their comprehensive nature and links to research at the university
level:
- Large-Scale Gene Expression and Microarray Links and
Resources - {http://industry.ebi.ac.uk/~alan/MicroArray/}
- To put microarrays into context with bioinformatics consider this quote from Gene-Chips.com (http://www.gene-chips.com/):
"It is widely believed that thousands of genes and their products (i.e., RNA and proteins) in a given living organism function in a complicated and orchestrated way that creates the mystery of life. However, traditional methods in molecular biology generally work on a "one gene in one experiment" basis, which means that the throughput is very limited and the "whole picture" of gene function is hard to obtain. In the past several years, a new technology, called DNA microarray, has attracted tremendous interests among biologists. This technology promises to monitor the whole genome on a single chip so that researchers can have a better picture of the interactions among thousands of genes simultaneously. Terminologies that have been used in the literature to describe this technology include, but not limited to: biochip, DNA chip, DNA microarray, and gene array."
- Bioinformatics comes into play in a microarray experiment in terms of
image processing, robotics control, and analysis of the resulting raw
data. This site is particularly useful because, by its own admission, it's
view of the subject is biased towards the development and application of
bioinformatics to the technology of microarrays.
- Nature's Genome Gateway -
http://www.nature.com/genomics/
- Produced by the journal Nature, access to all material at this site is
free. It offers a nice collection of full text research articles from the
Nature Publishing Group journals arranged by organism (the human category
in this section is the most rich). An additional section of the site is
completely devoted to the human genome. A post-genomics section looks at
the applications of sequencing research. A well organized site with good
information.
- SMD Microarray Resources -
{http://genome-www4.stanford.edu/MicroArray/SMD/resources.html}
- Good content nicely arranged is presented here. Headings for
microarray databases, software, companies and academic sites as well as a
good list of starting points under the general information heading. SMD
stands for the Stanford Microarray Database, which stores microarray data
and images. Some of the SMD data is available to the public.
- Southwest Biotechnology and Informatics Center (SWBIC): Bioinformatics & Genomics - {http://www.swbic.org/links/1.php}
- This is a site of stellar organization and considerable content. All
the standard categories are well represented (e.g., Conferences,
Databases, News, Online Journals) and the subject specific categories are
outstanding (Hidden Markov Models, Genomics and DNA Sequence Analysis,
Metabolic Pathway Databases, etc.). The quality of the entries is high,
and each entry is annotated. A search capability is also offered as an
alternative to browsing.
- Visualisation Awareness Pages - {http://industry.ebi.ac.uk/~Alan/VisSupp/VisAware/index.html}
- "These pages aim to catalogue visualisation sites, applications,
techniques and papers that may be of interest to the Bioinformatics and
Biological community." This is an extensive and varied directory created
by Alan Robinson, a researcher at the European Bioinformatics Institute.
Extensive data visualization resource directories are not common, so its
especially nice to find one that focuses on bioinformatics visualization
in particular.
- PubMed - {http://www.ncbi.nlm.nih.gov/pubmed}
- For medical bibliographic citations this is the place to go. PubMed is
the public interface to the medical literature database (MEDLINE) produced
by the National Library of Medicine. PubMed provides access to over 11
million MEDLINE citations for articles and conference papers back to the
mid-1960's. There are links to many sites providing full text articles
(some for free) and PubMed citations link to Entrez nucleotide, protein
and structure records when available. Unfortunately, PubMed currently
supports searching by Chemical Abstracts Service (CAS) Registry Numbers
(RNs) in a very limited way. The dictionary of RNs supported in PubMed is
limited and is not currently extended to sequences found in other parts of
the Entrez system. The PubMed interface is a rich and somewhat complicated
one that requires some study to use efficiently. PubMed with its Entrez
links provides an almost one-stop-shopping experience, and is an amazingly
rich resource for medical and genetics data.
- INSPEC - {http://www.iee.org/publish/inspec/about/} [subscription
required]
- For citations to computer science literature, start with INSPEC (for
noncommercial computer science articles freely available on the Internet,
see ResearchIndex below). Produced by the
Institution of Electrical Engineers (IEE), INSPEC is the leading
bibliographic information service providing access to conference papers
and journal articles in computer science and information technology as
well as electrical engineering and physics. The database covers literature
from 1969-present, and is available from a variety of database vendors,
most of which will also provide links to your library's online journals.
- Chemical Abstracts and the Registry File - http://www.cas.org/ [subscription required]
- For access to the chemical literature this is the place to start.
Produced by CAS (Chemical Abstracts Service), Chemical Abstracts is the
leading bibliographic information service providing citations to
conference papers, journal articles, patents and other documents pertinent
to chemistry (and it is to our advantage that they define chemistry very
broadly). In December of 2001 CAS extended the coverage of Chemical
Abstracts to literature from 1907 to the present, which is a boon to
researchers since the chemical literature becomes obsolete slowly, if at
all.
"Substance identification is a special strength of CAS, which is widely
known for the CAS Chemical Registry, the largest substance identification
system in existence. When a chemical substance is newly encountered in the
literature processed by CAS, its molecular structure diagram, systematic
chemical name, molecular formula, and other identifying information are
added to the Registry and assigned a unique CAS Registry Number." This is
relevant to bioinformatics because currently about 45% of the Registry
File consists of protein and nucleic acid sequences. CAS gets its sequence
information from the chemistry journals, patents and other documents that
CAS routinely covers as well as sequences from the GenBank database. 13%
of the sequences in the Registry File are unique, while 87% overlap with
GenBank. The RefSeq sequences generated from Genbank are not part of the
CAS Registry. While most sequences reported in the Protein Data Bank are
in Registry, Registry does not provide access to the 3D data that is
available in the PDB.
The ability to perform nucleic acid and amino acid sequence similarity
searching is highly desirable. BLAST similarity searching is currently
possible in some versions of the Registry File (it is available in
SciFinder 2001 and via STN on the Web) but unfortunately not in SciFinder
Scholar, which is the version most commonly subscribed to by academic
libraries. On STN, once a Registry sequence search is completed (via BLAST
or a text based search) there are multiple files which can then be very
easily searched with CAS Registry Numbers, e.g., BIOSIS AGRICOLA,
USPATFULL, CAplus (Chemical Abstracts), and others. (Other vendors support
Registry Number searching in many of these databases as well, though they
lack the Registry File itself). Chemical Abstracts can also be searched
directly via the usual bibliographic fields (author, title, etc.). If you
are in need of access to the patent literature, or if you are studying
protein chemistry, pharmacogenomics, or small molecules then the CAS
databases should be high on your priority list. But without access to
BLAST searches in the Registry File the academic bioinformatics community
will probably continue to rely heavily on other databases.
- BIOSIS Previews - {http://www.biosis.org/products/previews/}
[subscription required]
- Start here for plant biology and other non-medical biology articles
and conference papers from 1969 to the present (only 34% of the journals
in BIOSIS overlap with MEDLINE). Since 1993 a "sequence data" field has
been available but rather than containing actual nucleotide or amino acid
sequences this field contains the accession number for the sequence from
databases such as GenBank, EMBL and SwissProt, if the author included this
number in the article text. A very small percentage of records actually
use this field. BIOSIS is available from a number of database vendors,
most of which will also provide links from the citation to your library's
online journals.
- ISI Web of Science - {http://apps.webofknowledge.com/} [subscription required]
- This is a large, powerful and costly citation database. It is an index to scientific, commercially published journal articles from 1975 to the present that also allows you to search for citations to a particular article. You look up the reference to a work that you have identified to find other more recent journal articles that have cited it. Cited reference searching is a unique way to trace ideas and subjects from past research into the present day. Searchable by author, keyword, and cited reference. Computer scientists and biologists are quite interested in citation data. Web of Science doesn't index conferences as a primary literature source, which is a disadvantage in bioinformatics where conferences are so important. See also ResearchIndex below.
- ResearchIndex (formerly
CiteSeer) - {http://citeseer.ist.psu.edu/}
- This is a free, full-text index to the freely available research
articles on the web. "Although availability varies greatly by discipline,
over a million research articles are freely available on the web. Some
journals and conferences provide free access online, others allow authors
to post articles on the web, and others allow authors to purchase the
right to post their articles on the web." (Lawrence 2001) this index is
popular with computer scientists because a great deal of their literature
is available this way, and because ResearchIndex also provides citation
analysis. While Web of Science doesn't index conference papers (which are
a mainstay in computer science), ResearchIndex does (if the proceedings
are on the web for free). It also offers reference linking, extraction of
citation context, related document detection and the BibTeX entry for each
article.
- GenomeBiology.com: Preprint Depository - {http://genomebiology.com/preprint/}
- GenomeBiology.com is an online journal from the same publishing group
that brings you BioMed Central (http://www.biomedcentral.com/),
of which the free online journal BMC Bioinformatics is a part (http://www.biomedcentral.com/1471-2105/).
GenomeBiology.com provides free access to its peer-reviewed articles and
preprints, although it charges a subscription fee to access its reviews,
reports, news and commentaries. The preprints in this depository are not
peer reviewed. The only screening process is to ensure relevance of the
preprint to GenomeBiology.com's scope and to avoid abusive, libelous or
indecent articles.
- NCSTRL - Networked Computer Science Technical Reference Library - {http://csetechrep.ucsd.edu/Dienst/htdocs/Welcome.html}
- NCSTRL (pronounced "ancestral") is an international collection of technical reports from a selection of participating computer science and computer engineering departments, industrial and government research laboratories made available for noncommercial and educational use. Searchable by keyword, author, or title.
- PrePrint Network from the Department of Energy - {http://www.osti.gov/preprints/}
- The Department of Energy funds a great deal of bioinformatics research
at US universities. They are particularly interested in protein structure,
DNA repair of radiation damage, and bioremediation of polluted sites. The
Preprint Network is the gateway to preprints in disciplines of interest to
the DOE, including bioinformatics. The Network is a metasearch engine that
searches across a number of preprint and technical report collections,
including the Networked Computer Science Technical Reference Library
(NCSTRL), among others. They also offer an update service that will e-mail
you when new resources are added in your area of interest. "The Preprint
Network is one leg of a triad of electronic products for the science
information consumer. We also offer PubSCIENCE
({http://www.osti.gov/pubscience}), a gateway to journal literature, and the DOE Information Bridge (http://www.osti.gov/bridge), an on-line access route to full-text technical report literature of the Department of Energy."
see also
Definitions, Glossaries, and Dictionaries
- Bioinformatics Frequently Asked Questions -
http://bioinformatics.org/FAQ/
- This is a scholarly yet pragmatic FAQ (filled with what the author calls "blunt
opinions") that is rich in useful information and advice. It is written by Damian
Counsell of the Institute of Cancer Research, UK, though he cautions readers that
the FAQ doesn't represent the ICR's views. It begins with a wonderful overview of
the field that helps put all the major pieces (definitions, programs, databases)
into perspective. He goes on to answer questions about finding resources in the
field, questions about careers and jobs, and many practical questions like "how can
I align two sequences?," "how can I predict the function of a gene," and "how do I
write this up?" It is still a work in progress and the lists of books are now a bit
dated, but nevertheless the FAQ is highly recommended reading.
- The Bioinformatics Resource (TBR): Tutorials -
{http://www.hgmp.mrc.ac.uk/CCP11/directory_tutorials.jsp?Rp=20}
- This brand new database (launched January 25, 2002) covers a wide
range of topics, contains substantial numbers of records, and is both
searchable by keyword and browsable by topic. Sixty-three tutorials are
currently cataloged. The list is very heavily weighted towards university
course web pages, yet there are some real gems in here. TBR is the website
of the CCP11project (Collaborative Computational Project 11). CCP11 was
established to foster bioinformatics in the UK research community, thus
explaining the high number of UK resources listed.
- Crystallography 101 - {http://www-structure.llnl.gov/Xray/101index.html}
- Crystallography is important for the study of protein structures and
bioinformatics is much concerned with the prediction and modeling of protein
folding and structure. This is a substantial and well written tutorial on the
subject by Bernhard Rupp, Professor of Molecular Structural Biology and Head of the
Macromolecular Crystallography Group at the Lawrence Livermore National Laboratory.
- MIT Biology Hypertextbook -
{http://esg-www.mit.edu:8001/esgbio/7001main.html}
- This is the often cited, extensive and well illustrated basic
introductory molecular biology text that is used as a supplement to
courses at MIT. It is arranged in chapters covering all the basic topics
(such as cell biology, enzyme biochemistry, recombinant DNA), and includes
a searchable index and practice problems.
- NCBI: Education Page - {http://www.ncbi.nlm.nih.gov/Education}
- Online education materials from the National Center for Biotechnology
Information. Includes online tutorials for the BLAST search program and some of
the Entrez Databases (PubMed, Nucleotides, Structures), as well as a useful essay
on similarity searching and glossary of terms related to sequence searching.
- NCBI: Medical Library Association's CE Course Manual: Molecular Biology
Information Resources
- http://www.ncbi.nlm.nih.gov/Class/MLACourse/
- This is Renata McCarthy's manual from the excellent full day continuing
education course for librarians on molecular biology information resources. The
online notes, links and examples are very helpful even without taking the class in
person. Which is a lucky thing, since unfortunately this course will no longer be
offered in locations throughout the U.S. Due to the complexity of the material the
course is being expanded to three days and will be offered several times per year
only at the National Library of Medicine, starting in early 2002. Once the revised
course is available, this page will contain a link to the new course web page,
which will include a schedule of course dates and registration information.
- Protein Data Bank (PDB): Education Resources -
{http://www.rcsb.org/pdb/static.do?p=general_information/news_publications/newsletters/educationcorner.html}
- A nicely organized directory of high quality educational sites related to
proteins and nucleic acids, as well as pointers to tutorials on using the PDB
itself. There is a section called "protein documentaries" that lists multimedia
sites (VRML, RealPlayer and/or Chime plug-ins required) and an excellent selection
of molecular modeling resources in the section called "Other Educational
Resources." Also worth visiting is the link to "Links" in the upper right under
"Other Information Resources" that takes you to their "Macromolecular Structure
Related Resources" page which is a comprehensive web directory of its own.
- Science Magazine: Functional Genomics Educational Resources -
http://www.sciencemag.org/feature/plus/sfg/education/index.shtml
- This site has a lot to recommend it. A "film festival" section provides RealTime
movies and webcasts of press conferences and lectures. The glossary section has
already been recommended earlier in this guide. There is an annotated list of "Ten
Great Educational Websites" (which are very cool, though most seem to be aimed at
the high school level) plus an education site of the month. And not to be missed
are the three sites in the "A Little Base (Pair) Humor" section: Cartoonists'
views of the Human Genome Project from Slate magazine, the DNA-O-Gram which allows
you to send a nucleotide-encoded message to a friend, and Swiss-Jokes -- "The
infamous random sampler of helvetian humor from ExPASy."
Baxevanis, A. D. & Ouellette, B. F. F. (eds). 2001.
Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins,
Methods of Biochemical Analysis, vol. 43, 2nd ed., New York: John Wiley &
Sons, Inc.
Benson, D. A., et al. 2000. GenBank. Nucleic Acids
Research 28(1): 15-18.
Berman, H. M., et al. 2000. The protein data bank. Nucleic
Acids Research 28(1): 235-242.
Gibas, C. & Jambeck, P. 2001. Developing Bioinformatics
Computer Skills. Sebastopol, CA: O'Reilly & Associates, Inc.
Nucleic Acids Research. 2002. 30(1). [Online.]
Available: {http://nar.oxfordjournals.org/content/vol30/issue1/}
[January 25 2002].
Schuler, G. D. 1997. Pieces of the puzzle: expressed sequence tags
and the catalog of human genes. Journal of Molecular Medicine 75(10):
694-698.
Wang, Y., et al. 2000. MMDB: 3D structure data in Entrez.
Nucleic Acids Research 28(1): 243-245.
Wheeler, D. L., et al. 2001. Database resources of the National
Center for Biotechnology Information. Nucleic Acids Research 29(1):
11-16.
Counsell, Damian. 2001.
Bioinformatics FAQ. [Online]. Available:
http://bioinformatics.org/faq/ [January
16, 2002].
Doernberg, D. 1993.
Computer Literacy Interview With Donald Knuth. [Online].
Available:
{http://www1.fatbrain.com/interviews/knuth_interview.html}
[November 19, 2001].
Lawrence, S. 2001. Online or Invisible? [Online]. Available: {http://citeseer.ist.psu.edu/online-nature01/}
[November 19, 2001].
National Center for Biotechnology Information
(NCBI). 2001. NCBI Education Site. [Online]. Available:
http://www.ncbi.nlm.nih.gov/Education/
[November 19, 2001].
Nucleic Acids Research. 2002. 30(1).
[Online]. Available: {http://nar.oxfordjournals.org/content/vol30/issue1/}
[January 25, 2002].