Welcome to BecA bioinformatics page

This portal serves to introduce various bioinformatics resources and tools to aid ABCF placements analyse and intepret their data.

Click here for Building Phylogenetic trees using CLC Main workbench

Introduction to phylogeny

Phylogenetic reconstruction is an attempt to discern the ancestral relationship of a set of sequences. It involves the construction of a tree, where the nodes indicate separate evolutionary paths, and the lengths of the branches give an estimate of how distantly related the sequences represented by those branches are.
Terminology:
topology: branching order of species, independent of branch length;
OTUs: operational taxonomic units: represent whatever group of organisms etc are under consideration;
outgroup: OTU included in study for explicit purpose of finding the root of the tree;
homoplasies: convergences of a particular character at a site.
There are several methods of constructing phylogenetic trees - the most common are:
  • distance methods
  • parsimony methods
  • maximum likelihood methods
All these methods can only provide estimates of what a phylogenetic tree might look like for a given set of data. Most good methods also provide an indication of how much variation there is in these estimates.

Distance methods:

Preferred for work with immunological data, frequency data, or data with some impreciseness in its methods. Very rapid, and easily permits statistical tests e.g. bootstrapping. Derives some measure of similarity or difference between the input sequences.

UPGMA

Cluster algorithm. Links least different pairs of seqs, sequentially (so that when one pair is formed, they become a single entity). (Invalid) assumptions made: 1. Rate of change equal among all sequences. 2. Branch lengths correlate with the expected phenotypic distance between sequences, which corresponds to a proportional measure of time.

Neighbor Joining

Corrects several assumptions made in the UPGMA method. Yields an unrooted tree.

Fitch and Margoliash

Does not try to find pairs of least different sequences, but tries to find trees that fulfil an optimum criterion. Yields an unrooted tree.

  1. Parsimony methods:
  2. Popular for reconstructing ancestral relationships.
    - Maximum parsimony: Evaluates all possible trees. Infers the number of evolutionary events implied by a particular topology. The most likely tree is then one that requires the minimum number of evolutionary changes needed to explain the observed data. Problems: Most parsimonious tree may not be unique; difficult to make valid statistical statements if there are many steps in a tree; branches with particularly rapid rates of change tend to attract one another, especially when the sequence lengths are small.
    - Maximum likelihood: Very slow. Preferred when homoplasies (convergences of a particular character at a site) are expected to be concentrated in a few sites only, whose identities are known in advance. The method works by estimating, for all nucleotide positions in a sequence, what the probability of having a particular nucleotide at a particular site is, based on whether or not its ancestors had it (and the transition/transversion ratio). These probabilities are summed over the whole sequence, for both branches of a bifurcating tree. The product of the two probabilities gives you the likelihood of the tree up to this point. With more sequences, the estimation is done recursively at every branch point. Since each site evolves independently, the likelihood of the phylogeny can be estimated at every site. This process can only be done in a reasonable amount of time with four sequences. If there are more than four sequences, basic trees can be made for sets of four sequences, and then extra sequences added to the tree and the process of finding the maximum likelihood re-estimated. The order in which the sequences are added and the initial sequences chosen to start the process critically influences the resulting tree. To prevent any bias, the whole process is done multiple times with random choices for the order of the sequences. A majority rule consensus tree is then chosen as the final tree.
To create a phylogenetic tree, you must first have an alignment. This can be created using ClustalW. ClustalW can also create a tree file for you (if you choose 'nj', 'phylip', or 'dist' from the "Tree type" pull-down menu.) However, you have more control over the tree if you simply choose to create an alignment in ClustalW (do not choose a tree type in this case, because then the alignment itself will not be presented).
Copy the alignment (including the title, so that the PHYLIP programs recognise the alignment format as ClustalW), and paste it into the text-entry box provided for alignments in one of the following programs in the PHYLIP suite of programs.
PHYLIP will convert the format of your alignment to Phylip format automatically. However, occasionally, especially in cases where the alignment is very large, this automatic conversion may cause errors.

Rooted vs Unrooted tree

Rooted phylogenetic trees can serve as a pathway to understanding evolutionary history. The pathway can be traced from the origin of life to any individual species by navigating through the evolutionary branches between the two points. Also, by starting with a single species and tracing back towards the "trunk" of the tree, one can discover that species' ancestors, as well as where lineages share a common ancestry. In addition, the tree can be used to study entire groups of organisms.
Another point to mention on phylogenetic tree structure is that rotation at branch points does not change the information. For example, if a branch point was rotated and the taxon order changed, this would not alter the information because the evolution of each taxon from the branch point was independent of the other.
Many disciplines within the study of biology contribute to understanding how past and present life evolved over time; together, these disciplines contribute to building, updating, and maintaining the "tree of life." Information is used to organize and classify organisms based on evolutionary relationships in a scientific field called systematics. Data may be collected from fossils, from studying the structure of body parts or molecules used by an organism, and by DNA analysis. By combining data from many sources, scientists can put together the phylogeny of an organism. Since phylogenetic trees are hypotheses, they will continue to change as new types of life are discovered and new information is learned.

image

Building Phylogenetic trees in CLC Main Workbench using CLC Main workbench

The functionalities of CLC Main Workbench are used for DNA, RNA, and protein sequence data analysis, such as gene expression analysis, primer design, molecular cloning, phylogenetic analyses, and sequence data management, amongst a wide variety of other features

nucleotide substitution models

The use of maximum likelihood (ML) algorithms in developing phylogenetic hypotheses requires a model of evolution. The frequently used General Time Reversible (GTR) family of nested models encompasses 64 models with different combinations of parameters for DNA site substitution. The models are listed here from the least complex to the most parameter rich.

Jukes-Cantor (JC, nst=1): equal base frequencies, all substitutions equally likely (PAUP* rate classification: aaaaaa, PAML: aaaaaa) (Jukes and Cantor 1969)

Felsenstein 1981 (F81, nst=1): variable base frequencies, all substitutions equally likely (PAUP*: aaaaaa, PAML: aaaaaa) (Felsenstein 1981)

Kimura 2-parameter (K80, nst=2): equal base frequencies, one transition rate and one transversion rate (PAUP*: abaaba, PAML: abbbba) (Kimura 1980)

Hasegawa-Kishino-Yano (HKY, nst=2): variable base frequencies, one transition rate and one transversion rate (PAUP*: abaaba, PAML: abbbba) (Hasegawa et. al. 1985)

Tamura-Nei (TrN): variable base frequencies, equal transversion rates, variable transition rates (PAUP*: abaaea, PAML: abbbbf) (Tamura Nei 1993)

Kimura 3-parameter (K3P): variable base frequencies, equal transition rates, two transversion rates (PAUP*: abccba, PAML: abccba) (Kimura 1981)

transition model (TIM): variable base frequencies, variable transition rates, two transversion rates (PAUP*: abccea, PAML: abccbe)

transversion model (TVM): variable base frequencies, variable transversion rates, transition rates equal (PAUP*: abcdbe, PAML: abcdea)

symmetrical model (SYM): equal base frequencies, symmetrical substitution matrix (A to T = T to A) (PAUP*: abcdef, PAML: abcdef) (Zharkikh 1994)

general time reversible (GTR, nst=6): variable base frequencies, symmetrical substitution matrix (PAUP*: abcdef, PAML: abcdef) (e.g., Lanave et al. 1984, Tavare 1986, Rodriguez et. al. 1990)

In addition to models describing the rates of change from one nucleotide to another, there are models to describe rate variation among sites in a sequence. The following are the two most commonly used models.

gamma distribution (G): gamma distributed rate variation among sites

proportion of invariable sites (I): extent of static, unchanging sites in a dataset


Substitutions are themselves grouped hierarchically: simple, general base substitution, transitions and transversions, purine to purine and pyrimidine to pyrimidine transitions, and AC/GT and AT/CG transversions. The groupings are symbolized as rate classifications according to the PAUP* and PAML matrices below. Substitution types that are constrained to be equal in rate assume the leftmost letter symbol.

PAUP* substitution rate matrix     PAML substitution rate matrix
    A  C  G  T                         T  C  A  G
A   -  a  b  c                     T   -  a  b  c
C      -  d  e                     C      -  d  e
G         -  f=1                   A         -  f=1
T            -                     G            -

Using Modeltest software to find best model

Note: This program has been superceded by jModelTest. Modeltest 3.7 (Posada and Crandall 1998) is a program that, in conjunction with PAUP*, selects the best-fit nucleotide substitution model for a set of aligned sequences. This model can then be implemented in maximum-likelihood and Bayesian phylogenetic analyses. The aim of this software is to facilitate comparisons between 56 alternative models using different criteria.

Model selection can be conducted on the basis of hierarchical likelihood ratio tests (hLRT), Akaike Information Criterion (AIC = -2 lnL + 2K; Akaike 1974), corrected AIC (AICc = AIC + 2K(K+1)/(N-K-1); Hurvich and Tsai 1989, Sugiura 1978) or Bayesian Information Criterion (BIC = -2lnL + KlogN; Schwarz 1978) [L = model likelihood, K = number of estimatable parameters, N = sample size]. AIC can be interpreted as the amount of information lost when we use a particular model to approximate the real process of nucleotide substitution; thus, the model with the smallest AIC is favored. Given equal priors for each of the competing models, the model with the smallest BIC is equivalent to the model with the maximum posterior probability.

Documentation: 

For further information about Modeltest 3.7 look at the manual or go to the Modeltest web page. For a discussion on the advantages and disadvantages of different model selection approaches in phylogenetics, see Posada and Buckley (2004).

If you are interested in selection of best-fit models of evolution for protein sequence alignments, see Abascal et al. (2005).

Input Format: 
Instructions for All: 

Running Modeltest through a terminal window

  1. Format your data into a NEXUS file. You can use this example dataset (download).
  2. Execute the NEXUS file in PAUP*.
  3. Execute the modelblock file within PAUP* by typing:

    execute modelblock.txt;

    This file tells PAUP* to compute likelihood scores for each of 56 models on the same neighbor-joining tree. When the computations are over you will see an output file named model.scores in your home directory.

  4. Save this file under a different name which is specific to your project; otherwise, Modeltest will not work the next time you run it.
  5. To run the computed tree scores in Modeltest, type:

    modeltest3.7 < infile > outfile1

    infile is the name of your input file — remember to change it from model.scores to something specific — and outfile1 is the name of your output file.

  6. By default, Modeltest will select the best-fit nucleotide substitution model using the likelihood ratio test and the AIC. Modeltest 3.7 also allows model selection based on the AICc and BIC. To do this, you must specify this option and also specify the sample size. Sample size for an alignment of DNA sequences is a difficult concept as it will depend on the number of characters, the number of taxa, and their correlation. You could specify the number of characters or the number of characters times the number of taxa, but probably none of these options is correct most of the time.

  7. To run the computed tree scores in Modeltest implementing AICc model selection, type:

    modeltest3.7 -n100 [replace 100 by your sample size] < infile > outfile2

  8. To run the computed tree scores in Modeltest implementing BIC model selection, type:

    modeltest3.7 -b -n100 [replace 100 by your sample size] < infile > outfile3

Although Modeltest will automatically create command blocks that can be pasted directly into PAUP* to set the parameters for maximum-likelihood analyses, it is best to first carefully interpret the results generated by the program. Note that hLRT, AIC, AICc and BIC may select different models; choosing among them is up to the user.

An important additional issue is taking into account the uncertainty in model selection. The output of Modeltest allows examining uncertainty on the basis of the AIC differences (deltas, or rescaled AICs), and the normalized relative AIC for each model (AIC weights). For cases in which support for a particular model is not overwhelming, users may want to consider the implementation of model averaging, a procedure that allows drawing inferences from several models simultaneously. By default, Modeltest 3.7 calculates model averaged estimates of parameters. This is accomplished by estimating parameters for each model and then averaging the estimates according to how likely each model is (i.e., based on Akaike weights).

Instructions for Windows: 

There is a tutorial available that has detailed instructions for running the Windows version of Modeltest.

Companion Programs: