next up previous
Next: Working with sequences Up: Introduction to Sequence Analysis Previous: Introduction to Sequence Analysis

Subsections

   
What is EMBOSS?

Since 1988, the sequence analysis package EGCG has provided extensions to the market leading commercial sequence analysis package GCG. EGCG development was a collaboration of groups within EMBnet and elsewhere.

EGCG provided support for core sequence activities at the Sanger Centre, and has been the basis of new sequence analysis software for internal use, as well as providing advanced features in use at approximately 150 sites, and for more than 10,000 users of EMBnet national services.

That project has reached the limits of what can be achieved using the GCG package. Specifically, it is no longer possible to distribute academic software source code which uses the GCG libraries and has become difficult even to distribute binaries.

As a result, the former EGCG developers have been designing a totally new generation of academic sequence analysis software. This has resulted in the present EMBOSS project.

So, what is EMBOSS?

EMBOSS is a new, free Open Source software analysis package specially developed for the needs of the molecular biology (e.g. EMBnet) user community. The software automatically copes with data in a variety of formats and even allows transparent retrieval of sequence data from the web. Also, as extensive libraries are provided with the package, it is a platform to allow other scientists to develop and release software in true open source spirit. EMBOSS also integrates a range of currently available packages and tools for sequence analysis into a seamless whole. EMBOSS breaks the historical trend towards commercial software packages.

The EMBOSS suite:

Within EMBOSS you will find over 150 programs (applications). These are just some of the areas covered:

More information about EMBOSS can be found at
http://emboss.sourceforge.net/

Working with EMBOSS

How this tutorial is organised

We assume that you are familiar with basic Unix commands for manipulating files and directories. EMBOSS contains many more applications than we can describe in the time available. We will introduce some of these and also show you how to find out about the others. There are many exercises for you to try, and we'll present the results you will see so that you know all is going well. Please feel free to experiment with the programs! That is definitely the best way to learn what they can do.

Much of the text in this document is what you will see on your screen; the Unix prompt is represented as unix % - don't type this in! The commands you need to type are printed in bold. If no input is specified, just press return. Pressing return will also dismiss graphics windows. The symbol $ \vdots$ means we have truncated the program output to save space.

wossname: a first EMBOSS application

All EMBOSS programs run from the Unix command line. We'll introduce the basics with a specific example: the EMBOSS utility wossname will produce a list of all the various EMBOSS applications.

Exercise: wossname

Type wossname at the unix % prompt:

unix % wossname

EMBOSS programs start up with a one line description and then prompt you for information; in this case you see:

Finds programs by keywords in their one-line documentation
Keyword to search for: protein
SEARCH FOR 'PROTEIN'

antigenic Finds antigenic sites in proteins
backtranseq Back translate a protein sequence
checktrans Reports STOP codons and ORF statistics of a protein sequence
emowse Protein identification by mass spectrometry
digest Protein proteolytic enzyme or reagent cleavage digest
eprotdist Protein distance algorithm
eprotpars Protein parsimony algorithm
fuzzpro Protein pattern search
fuzztran Protein pattern search after translation
garnier GARNIER predicts protein secondary structure.
iep Calculates the isoelectric point of a protein
octanol Displays protein hydropathy
oddcomp Finds protein sequence regions with a biased composition
patmatdb Search a protein sequence database with a motif
patmatmotifs Search a motif database with a protein sequence
pepnet Displays proteins as a helical net
pepstats Protein statistics
pepwheel Shows protein sequences as helices
pepwindow Displays protein hydropathy
pepwindowall Displays protein hydropathy of a set of sequences
preg Regular expression search of a protein sequence
pscan Scans proteins using PRINTS
sigcleave Reports protein signal cleavage sites
topo Draws an image of a transmembrane protein


Many EMBOSS programs have additional, optional parameters that offer more functionality. As a rule, you can force the program to present this information to you by appending the flag -opt to the program name as follows:

unix % wossname -opt

You will now be presented with a variety of additional options. The default value for each option is given in square brackets, and you can either press return to accept the default, or enter the value you require:

Keyword to search for: protein
Output program details to a file [stdout]: myfile
Format the output for HTML [N]: Y
String to form the first half of an HTML link:
String to form the second half on an HTML link:
Output only the group names [N]:
Output an alphabetic list of programs [N]:
Use the expanded group names [N]:

This set of commands will cause wossname to write out the list of programs to a file called myfile, in HTML format ready for viewing in a web browser.

To produce a list of all the current EMBOSS programs, start up wossname again but instead of specifying a keyword, press return. A list of programs will scroll onto your screen, divided up into groups according to their functions. Scroll up and down to see them all. Can you think of how to get this data into a file? (Hint: use -opt)

If you append the flag -help to the name of any EMBOSS program you will see a list of all the command flags available for this program. For example:

unix % wossname -help

We'll see some more flags later. Let's move on to some sequence analysis ...


next up previous
Next: Working with sequences Up: Introduction to Sequence Analysis Previous: Introduction to Sequence Analysis
Gary Williams
2003-04-29