Next: Protein analysis Up: Introduction to Sequence Analysis Previous: Working with sequences

Subsections

Pairwise sequence alignment

This chapter is about sequence similarity. Let us start with a warning: there is no unique, precise, or universally applicable notion of similarity. An alignment is an arrangement of two sequences which shows where the two sequences are similar, and where they differ. An optimal alignment, of course, is one that exhibits the most significant similarities, and the least differences. Broadly, there are three categories of methods for sequence comparison.

Segment methods compare all windows (overlapping segments of a predetermined length (e.g., 10 amino acids)) from one sequence to all segments from the other. This is the approach used in dotplots.
Optimal global alignment methods allow the best overall score for the comparison of the two sequences to be obtained, including a consideration of gaps.
Optimal local alignment algorithms seek to identify the best local similarities between two sequences but, unlike segment methods, include explicit consideration of gaps.

Dotplots

The most intuitive representation of the comparison between two sequences uses dot-plots. One sequence is represented on each axis and significant matching regions are distributed along diagonals in the matrix.

Exercise: Making a dotplot

unix % dottup
DNA sequence dot plot
Input sequence: embl:xl23808
Second sequence: embl:xlrhodop
Word size [4]: 10
Graph type [x11]:

A window will pop up on your screen that should look something like this:

$\begin{figure} \begin{center} \epsfig{figure=dottup.ps, height=5in,angle=90}\end{center}\end{figure}$

The diagonal lines represent areas where the two sequences align well. You can see that there are five clear diagonals. You will remember that we are aligning genomic and cDNA - these five diagonals represent the five exons of the gene! If you look at the original EMBL entry for the genomic sequence using SRS, you will see that the annotated entry says that there are five exons in this gene. So our results are in agreement.

The settings we have used for this example are those that give the best results. dottup looks for exact matches between sequences. As we expect the exon regions from the genomic sequence to exactly match the cDNA sequence we can use longer word lengths as we should still get exact matches. This gives a very clean plot. If you were to match the cDNA sequence against that of a related sequence, e.g. the rhodopsin from mouse (embl:m55171) then you wouldn't expect long exact matches so should use a shorter word length.

Exercise: Examining dotplot parameters

Repeat the previous example comparing the frog rhodopsin cDNA against the mouse genomic DNA:

unix % dottup embl:m55171 embl:xlrhodop
DNA sequence dot plot
Word size [4]: 10
Graph type [x11]:

Repeat this for both comparisons using different word sizes. What do you notice?
Which word sizes give the clearest plot?
Why are the diagonals in the first and last exons not so clear? (hint: look back at the results for showfeat)

The dotplot doesn't give us any detailed sequence information. For this, we need to use different programs. The algorithms we will be using are more rigorous than those used for searching databases; so even if you have retrieved a sequence from a database using something like BLAST, it will be well worth your while performing a careful pairwise alignment afterwards. The basic idea behind the sequence alignment programs is to align the two sequences in such a way as to produce the highest score - a scoring matrix is used to add points to the score for each match and subtract them for each mismatch. The matrices used for nucleic acid alignments tend to involve fairly simple match/mismatch scoring schemes, while the matrices commonly used for scoring protein alignments are more complex, with scores designed to reflect similarity between the different amino acids rather than simply scoring identities. Over time various mutations occur in sequences; the scoring matrices attempt to cope with mutations, but insertions and deletions require some extra parameters to allow the introduction of gaps in the alignment. There are penalties both for the creation of gaps and for the extension of existing ones; the default gap parameters given in alignment programs have been found to be empirically correct with test sequences but you should experiment with different gap penalties.

Global alignment

A global alignment is one that compares the two sequences over their entire lengths, and is appropriate for comparing sequences that are expected to share similarity over the whole length. The alignment maximises regions of similarity and minimises gaps using the scoring matrices and gap parameters provided to the program. The EMBOSS program needle is an implementation of the Needleman-Wunsch [] algorithm for global alignment; the computation is rigorous and needle can be time consuming to run if the sequences are long.

Exercise: `needle`

unix % needle
Needleman-Wunsch global alignment.
Input sequence:embl:xlrhodop
Second sequence:embl:xl23808
Gap opening penalty [10.0]:
Gap extension penalty [0.5]:
Output file [xlrhodop.needle]:

unix % more xlrhodop.needle

Global:	XLRHODOP vs XL23808
Score: 7471.00

XLRHODOP

XL23808		1	 cgtaactaggaccccaggtcgacacgacaccttccctttcccagt 45

XLRHODOP

XL23808		46	 tatttcccctgtagacgttagaaggggaaggggtgtacttatgtc 90

XLRHODOP

XL23808		91	 acgacgaactacgtccttgactacttagggccagagagacgaggt 135

$\vdots$

Note that as this is a global alignment, the entire genomic sequence is given in the output, even in regions where it does not line up with the cDNA. Scroll down the output until you reach an area of alignment.

XLRHODOP	1		   ggtagaacagcttcagttgggatcacaggcttcta 35
				    ||||||||||||||||||||||||||||||||||
XL23808		1171	 tgggtcatactgtagaacagcttcagttgggatcacaggcttcta 1215

XLRHODOP	36	 gggatcctttgggcaaaaaagaaacacagaaggcattctttctat 80
			 |||||||||||||||||||||||||||||||||||||||||||||
XL23808		1216	 gggatcctttgggcaaaaaagaaacacagaaggcattctttctat 1260

XLRHODOP	81	 acaagaaaggactttatagagctgctaccatgaacggaacagaag 125
			 |||||||||||||||||||||||||||||||||||||||||||||
XL23808		1261	 acaagaaaggactttatagagctgctaccatgaacggaacagaag 1305

XLRHODOP	126	 gtccaaatttttatgtccccatgtccaacaaaactggggtggtac 170
			 |||||||||||||||||||||||||||||||||||||||||||||
XL23808		1306	 gtccaaatttttatgtccccatgtccaacaaaactggggtggtac 1350

$\vdots$

We've only shown part of the output as it is very long. You should look at the whole output and note that there are five aligned regions that represent the five exons as predicted from the dotplot.
Look carefully at the boundaries of the aligned regions. We know, as biologists, that intron/exon boundaries have a conserved gt..ag pair of dinucleotides delimiting the splice sites. It is most unlikely that needle has correctly aligned these boundaries as it has no understanding of models of gene structure. The scoring method it uses does not specifically treat splice sites. The program est2genome has an extra scoring factor that allows it to do a better job of aligning intron/exon boundaries.

Local alignment

As we mentioned above, global sequence alignment algorithms align sequences over their entire lengths. You do need to think about whether that type of alignment makes sense for your sequences. For our example, where we expect each exon to be represented in the sequences and in the same order, it has worked well - however, how well do you think this approach would work with, for example, multidomain proteins that share one domain but not others, or sequences where there have been regions of duplication? A second comparison method, local alignment, searches for regions of local similarity and need not include the entire length of the sequences. Local alignment methods are very useful for scanning databases or when you do not know that the sequences are similar over their entire lengths. The EMBOSS program water is a rigorous implementation of the Smith Waterman algorithm for local alignments [].

Exercise: `water`

unix % water
Smith-Waterman local alignment.
Input sequence: embl:xlrhodop
Second sequence: embl:xl23808
Gap opening penalty [10.0]:
Gap extension penalty [0.5]:
Output file [xlrhodop.water]:

unix % more xlrhodop.water

Local: XLRHODOP	vs XL23808
Score: 7448.00

XLRHODOP	2     gtagaacagcttcagttgggatcacaggcttctagggatcctttg 46
		      |||||||||||||||||||||||||||||||||||||||||||||
XL23808		1182  gtagaacagcttcagttgggatcacaggcttctagggatcctttg 1226

XLRHODOP	47    ggcaaaaaagaaacacagaaggcattctttctatacaagaaagga 91
		      |||||||||||||||||||||||||||||||||||||||||||||
XL23808		1227  ggcaaaaaagaaacacagaaggcattctttctatacaagaaagga 1271

XLRHODOP	92    ctttatagagctgctaccatgaacggaacagaaggtccaaatttt 136
		      |||||||||||||||||||||||||||||||||||||||||||||
XL23808		1272  ctttatagagctgctaccatgaacggaacagaaggtccaaatttt 1316

XLRHODOP	137   tatgtccccatgtccaacaaaactggggtggtacgaagcccattc 181
		      |||||||||||||||||||||||||||||||||||||||||||||
XL23808		1317  tatgtccccatgtccaacaaaactggggtggtacgaagcccattc 1361

XLRHODOP	182   gattaccctcagtattacttagcagagccatggcaatattcagca 226
		      |||||||||||||||||||||||||||||||||||||||||||||
XL23808		1362  gattaccctcagtattacttagcagagccatggcaatattcagca 1406

XLRHODOP	227   ctggctgcttacatgttcctgctcatcctgcttgggttaccaatc 271
		      |||||||||||||||||||||||||||||||||||||||||||||
XL23808		1407  ctggctgcttacatgttcctgctcatcctgcttgggttaccaatc 1451

XLRHODOP	272   aacttcatgaccttgtttgttaccatccagcacaagaaactcaga 316
		      |||||||||||||||||||||||||||||||||||||||||||||
XL23808		1452  aacttcatgaccttgtttgttaccatccagcacaagaaactcaga 1496

XLRHODOP	317   acacccctaaactacatcctgctgaacctggtatttgccaatcac 361
		      |||||||||||||||||||||||||||||||||||||||||||||
XL23808		1497  acacccctaaactacatcctgctgaacctggtatttgccaatcac 1541

$\vdots$

Scroll down the entire output and again, note that five exons have been found.

In these cases we have not had to adjust the gap parameters from the defaults used in these programs. You should be aware that you may need to do so with your own sequences.

EMBOSS contains other pairwise alignment programs - stretcher and matcher are global and local alignment programs respectively that are less rigorous than needle and water and therefore run more quickly; they may be useful for database searching. supermatcher is designed for local alignments of very large sequences and is even less rigorous in its implementation. The documentation pages for all these programs can be found at
http://emboss.sourceforge.net/apps/

Next: Protein analysis Up: Introduction to Sequence Analysis Previous: Working with sequences

Gary Williams
2003-04-29