next up previous
Next: Pairwise sequence alignment Up: Introduction to Sequence Analysis Previous: What is EMBOSS?

Subsections

   
Working with sequences

Throughout this tutorial, we're going to look at members of the rhodopsin family of G-protein coupled receptors. The general principles are, of course, applicable to any sequences you would like to analyse. We will be working with sequences retrieved from EMBL and SwissProt but you can also use EMBOSS with sequences in text files.

We will begin with two EMBL sequences whose identifiers are XL23808 and XLRHODOP; these sequences are the genomic and the corresponding cDNA sequence for Xenopus laevis rhodopsin.

You need to tell EMBOSS where to read the sequence(s) you want to analyse. EMBOSS can read sequences either from text files or directly from a sequence database. The easiest way to see this is with examples.

Retrieving sequences from databases

The EMBOSS programs can read sequences from various sequence databases provided the sequence is referred to in the form database:entry. This format is known as a USA (Uniform Sequence Address). Further information on USA's can be found on the EMBOSS website. You can see the databases we have set up for you using the program showdb:

Exercise: showdb

As an example, here are the first few databases available using EMBOSS at the HGMP. Your local site will probably have a different selection of databases depending on what the local EMBOSS maintainer has set up.

unix % showdb
Displays information on the currently available databases

#Name Type ID Qry All Comment
#==== ==== == === === =======
nbrf P OK OK OK PIR/NBRF
pir P OK OK OK PIR/NBRF
remtrembl P OK OK OK REMTREMBL sequences
sptrembl P OK OK OK SPTREMBL sequences
sw P OK OK OK SWISSPROT sequences
swissprot P OK OK OK SWISSPROT sequences
trarc P OK OK OK TREMBL ARC sequences
trembl P OK OK OK TREMBL sequences
tremblnew P OK OK OK New TREMBL sequences


showdb writes out a simple table displaying the names, contents and access methods for the databases.

ID allows programs to extract a single explicitly named entry from the database, for example: embl:x13776
Query indicates that programs can extract a set of matching wildcard entry names. For example: swissprot:pax*_human
All allows programs to analyse all the entries in the database sequentially. For example: embl:*

You can access EMBL by either identifier eg xlrhodop, or accession number eg L07770. Let's try these now.

seqret

seqret reads in a sequence, and writes it out, in effect being the EMBOSS equivalent of readseq. It is probably the most commonly used EMBOSS program.

Exercise: seqret

unix % seqret
Reads and writes (returns) a sequence
Input sequence: embl:xlrhodop
Output sequence [xlrhodop.fasta]:

unix % more xlrhodop.fasta

>XLRHODOP L07770 Xenopus laevis	rhodopsin mRNA,	complete cds.
ggtagaacagcttcagttgggatcacaggcttctagggatcctttgggcaaaaaagaaac
acagaaggcattctttctatacaagaaaggactttatagagctgctaccatgaacggaac
$ \vdots$

Now let's retrieve the sequence using its the accession number:

unix % seqret
Reads and writes (returns) a sequence
Input sequence: embl:L07770
Output sequence [xlrhodop.fasta]: xlrhodop2.fasta

unix % more xlrhodop2.fasta

>XLRHODOP L07770 Xenopus laevis	rhodopsin mRNA,	complete cds.
ggtagaacagcttcagttgggatcacaggcttctagggatcctttgggcaaaaaagaaac
acagaaggcattctttctatacaagaaaggactttatagagctgctaccatgaacggaac
$ \vdots$

You could also run this example entirely from the command line:

unix % seqret embl:xlrhodop -outseq xlrhodop.fasta

By default, seqret writes the sequence in fasta format. You can also tell it to use a different output format:

unix % seqret embl:L07770 -outseq gcg::xlrhodop.gcg

As an alternative to specifying the format in the output sequence USA you can use the -osformat qualifier. This command is identical in action to the previous one:

unix % seqret embl:L07770 -outseq xlrhodop.gcg -osformat gcg

unix % more xlrhodop.gcg

!!NA_SEQUENCE 1.0

Xenopus	laevis rhodopsin mRNA, complete	cds.

XLRHODOP  Length: 1684	Type: N	 Check:	9453 ..

    1 ggtagaacag cttcagttgg gatcacaggc ttctagggat cctttgggca

   51 aaaaagaaac acagaaggca ttctttctat acaagaaagg actttataga
$ \vdots$

A list of the various formats that EMBOSS understands is given at
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

Reading sequences from files

EMBOSS can also read sequences from files. For example, if we wanted to reformat the fasta sequence we have downloaded into gcg format, we could say:

unix % seqret xlrhodop.fasta -outseq gcg::myseq.gcg

or

unix % seqret xlrhodop.fasta -outseq myseq.gcg -osformat gcg

Getting information about sequences

infoseq

infoseq is a small utility to list the sequences' USA, name, accession number, type (nucleic or protein), length, percentage G+C (for nucleic), and/or description. We can view this information for our sequence:

unix % infoseq embl:xlrhodop
Displays some simple information about sequences
# USA Name Accession Type Length GC Description
embl-id:XLRHODOP XLRHODOP L07770 N 1684 45.72 X.laevis rhodopsin

Sequence annotation

Sequence databases do not just contain sequences, they also contain a great deal of associated information (annotation) about the sequence entries. By default EMBOSS does not return all this information when you run seqret.

To retrieve the full entry for a sequence in it's original database form you can use the utility entret.

unix % entret embl:xl23808
Reads and writes (returns) flatfile entries
Output file [xl23808.entret]:

unix % more xl23808.entret
ID XL23808 standard; DNA; VRT; 4734 BP.
XX
AC U23808;
XX
SV U23808.1
XX
DT 23-APR-1995 (Rel. 43, Created)
DT 04-MAR-2000 (Rel. 63, Last updated, Version 7)
XX
DE Xenopus laevis rhodopsin gene, complete cds.
XX
KW .
XX
OS Xenopus laevis (African clawed frog)
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Amphibia;
OC Batrachia; Anura; Mesobatrachia; Pipoidea; Pipidae; Xenopodinae; Xenopus.
XX
$ \vdots$

There is a lot of information here. Near the bottom, just above the sequence itself is a list of features associated with the sequence. A feature is any defined region of the sequence that has a particular description associated with it. We can view a simple graphical overview of the sequence features using the utility showfeat:

unix % showfeat embl:xl23808
Show features of a sequence.
Output file [xl23808.showfeat]:

unix % more xl23808.showfeat

XL23808
Xenopus laevis rhodopsin gene, complete cds.
|==========================================================| 4734
|----------------------------------------------------------> source
             |----->                                         mRNA
               |--->                                         CDS
                       |->                                   CDS
                       |->                                   mRNA
                                |->                          CDS
                                |->                          mRNA
                                      |-->                   CDS
                                      |-->                   mRNA
                                                  |>         CDS
                                                  |------->  mRNA
To retrieve a sequence with all its features we can use the program seqret with the option seqret -feature. This has some properties of which to be aware.

unix % seqret -feature embl:xl23808
Reads and writes (returns) one or more sequences
Output sequence [xl23808.fasta]:

This looks pretty much like running seqret except that a second file has been created, unknown.gff. Take a look at this file:

unix % more unknown.gff

##gff-version 2.0
##date 2003-02-21
##Type DNA XL23808
XL23808 EMBL source 1    4734 0.000 + . Sequence "XL23808.1" ; db_xref \
    "taxon:8355" ; organism "Xenopus laevis"
XL23808 EMBL  mRNA  1181 1650 0.000 + . Sequence "XL23808.2" ; FeatFlags \
    "0x100" ; product "rhodopsin" 
XL23808 EMBL mRNA   1899 2067 0.000 + . Sequence "XL23808.2" ; FeatFlags \
    "0x104"
XL23808 EMBL mRNA   2669 2834 0.000 + . Sequence "XL23808.2" ; FeatFlags \
    "0x104"
$ \vdots$

This is a list of the features in the database entry in GFF (General Feature Format). You can find out more about feature formats on the EMBOSS web site.
In order to change the feature formats and filename we need to use associated qualifiers when running seqret -feature. Lets save the features in EMBL format in the file rhodop.features:

unix % seqret -feature embl:xl23808 -offormat embl -ofname rhodop.features
Reads and writes (returns) one or more sequences
Output sequence [xl23808.fasta]:

And the file appears as expected
.

We could use a Uniform Feature Object (UFO) instead of the -offormat and -ofname qualifiers.

unix % seqret -feature embl:xl23808 -oufo embl::rhodop.features

Using multiple sequences

EMBOSS programs can also deal with multiple sequences. A quick search using SRS will tell you that the SwissProt sequence corresponding to the EMBL sequence we've been looking at has the identifier OPSD_XENLA. To retrieve the information about all the other OPSD sequences in SwissProt we can use the wild card character:

unix % infoseq
Displays some simple information about sequences
Input sequence(s): sw:opsd_*
# USA Name Accession Type Length Description
sw-id:OPSD_ABYKO OPSD_ABYKO O42294 P 289 RHODOPSIN (FRAGMENT).
sw-id:OPSD_ALLMI OPSD_ALLMI P52202 P 352 RHODOPSIN.
sw-id:OPSD_AMBTI OPSD_AMBTI Q90245 P 354 RHODOPSIN.
sw-id:OPSD_ANGAN OPSD_ANGAN Q90214 P 352 RHODOPSIN, DEEP-SEA
sw-id:OPSD_ANOCA OPSD_ANOCA P41591 P 352 RHODOPSIN.
sw-id:OPSD_APIME OPSD_APIME Q17053 P 377 RHODOPSIN.
sw-id:OPSD_ASTFA OPSD_ASTFA P41590 P 352 RHODOPSIN.
sw-id:OPSD_BATMU OPSD_BATMU O42300 P 289 RHODOPSIN (FRAGMENT).
sw-id:OPSD_BATNI OPSD_BATNI O42301 P 289 RHODOPSIN (FRAGMENT).
sw-id:OPSD_BOVIN OPSD_BOVIN P02699 P 348 RHODOPSIN.


We can also use the wild card character on the command line, but here we must enclose the specification in quotation marks:

unix % infoseq ``sw:opsd_*''

You can use seqret to retrieve multiple sequences into a file; for exmaple:

unix % seqret ``sw:opsd_a*'' -outseq opsd_a.seqs

retrieves all the sequences whose identifiers start ``opsd_a'' into a file called opsd_a.seqs. If we wanted to have each sequence in a separate file, we could type:

unix % seqret ``sw:opsd_a*'' -ossingle

Filenames are generated based on the identifiers of the sequences.

Listfiles

It is also possible to use list files within EMBOSS. Instead of containing the sequences themselves, a list file contains "references" to sequences - so, for example, you might include database entries, the names of files containing sequences, or even the names of other list files. You'll need to use a text editor such as pico to create the appropriate list files if you'd like to try this yourself.

Here's an example of a valid list file, called seq.list:

opsd_abyko.fasta
sw:opsd_xenla
sw:opsd_c*
@another_list

If you have created this file then you can read it using:

unix % more seq.list

This may look a bit odd, but it's really very straightforward; the file contains:

Notice the @ in front of the last entry. This is the way you tell EMBOSS that this file is a list file, not a regular sequence file. As an alternative to using the @ you can write list:filename instead. Let's demonstrate this by using this file as the input to seqret and get the sequences into a new file, perhaps for use in a multiple sequence alignment (see Section 5.3).

First of all, we'll make the file opsd_abyko.fasta using seqret:

unix % seqret sw:opsd_abyko -outseq opsd_abyko.fasta

Now let's look at another_list. Note that its structure is very similar to that of seq.list but this time only contains database references:

sw:opsd_anoca
sw:opsd_apime
sw:opsd_astfa

After you have created this file you will be able to view it using

unix % more another_list

Finally, let's run seqret with seq.list (not forgetting the @ sign) and look at the results:

unix % seqret @seq.list -outseq outfile

unix % more outfile

>OPSD_ABYKO O42294 RHODOPSIN (FRAGMENT).
YLVNPAAYAALGAYMFLLILIGFPINFLTLYVTLEHKKLRTPLNYILLNLAVANLFMVLG
GFTTTMYTSMHGYFVLGRLGCNLEAFFATLGGEIALWSLVVLAIERWIVVCKPISNFRFT
EDHAIMGLAFTWVMALACAVPPLVGWSRYIPEGMQCSCGVDYYTRAEGFNNESFVIYMFI
VHFLIPLSVIFFCYGRLLCAVKEAPAAQQESETTQRAEKEVSRMVVIMVIGFLVCWLPYA
SVAWWIFCNQGSDFGPIFMTLPSFFAKSAAIYNPMIYICMNKQFRHCMI
>OPSD_XENLA P29403 RHODOPSIN.
MNGTEGPNFYVPMSNKTGVVRSPFDYPQYYLAEPWQYSALAAYMFLLILLGLPINFMTLF
VTIQHKKLRTPLNYILLNLVFANHFMVLCGFTVTMYTSMHGYFIFGPTGCYIEGFFATLG
GEVALWSLVVLAVERYIVVCKPMANFRFGENHAIMGVAFTWIMALSCAAPPLFGWSRYIP
EGMQCSCGVDYYTLKPEVNNESFVIYMFIVHFTIPLIVIFFCYGRLLCTVKEAAAQQQES
LTTQKAEKEVTRMVVIMVVFFLICWVPYAYVAFYIFTHQGSNFGPVFMTVPAFFAKSSAI
YNPVIYIVLNKQFRNCLITTLCCGKNPFGDEDGSSAATSKTEASSVSSSQVSPA
>OPSD_CAMAB Q17292 RHODOPSIN.
MMSIASGPSHAAYTWASQGGGFGNQTVVDKVPPEMLHMVDAHWYQFPPMNPLWHALLGFV
IGVLGVISVIGNGMVIYIFTTTKSLRTPSNLLVVNLAISDFLMMLCMSPAMVINCYYETW
VLGPLFCELYGLAGSLFGCASIWTMTMIAFDRYNVIVKGLSAKPMTINGALIRILTIWFF
TLAWTIAPMFGWNRYVPEGNMTACGTDYLTKDLFSRSYILIYSIFVYFTPLFLIIYSYFF
IIQAVAAHEKNMREQAKKMNVASLRSAENQSTSAECKLAKVALMTISLWFMAWTPYLVIN
YSGIFETTKISPLFTIWGSLFAKANAVYNPIVYGISHPKYRAALFQKFPSLACTTEPTGA
DTMSTTTTVTEGNEKPAA
>OPSD_CAMHU O18312 RHODOPSIN (FRAGMENT).
LHMIHLHWYQYPPMNPMMYPLLLIFMLFTGILCLAGNFVTIWVFMNTKSLRTPANLLVVN
LAMSDFLMMFTMFPPMMVTCYYHTWTLGPTFCQVYAFLGNLCGCASIWTMVFITFDRYNV
IVKGVAGEPLSTKKASLWILSVWVLSTAWCIAPFFGWNHYVPEGNLTGCGTDYLSEDILS
RSYLYIYSTWVYFLPLAITIYCYVFIIKAVAAHEKGMRDQAKKMGIKSLRNEEAQKTSAE
CRLAKNAMTTVALWFIAWTPCLLINWVGMFARSYLSPVYTIWGYVFAKANAVYNPIVYAI
S
$ \vdots$

Note that the output file contains all the sequences we specified in seq.list, as we had expected.


next up previous
Next: Pairwise sequence alignment Up: Introduction to Sequence Analysis Previous: What is EMBOSS?
Gary Williams
2003-04-29