BecA Bioinformatics

Quick Links

Useful Resources

BLAST database search

BLAST: Basic Local Alignment Search Tool.

BLAST program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches.
Establishing links between observed sequence variation and gene function is a major challenge when analyzing transcriptome data from non-model organisms. Here, the Basic Local Alignment Search Tool (BLAST) is used to compare your de novo assembled contigs to sequence databases in order to annotate them with similarity to known genes/proteins/functions. BLAST is a toolkit developed by the National Center for Biotechnology Information (NCBI), the US-based organization responsible for archiving and databasing the world's genetic sequence information.
BLAST homepage

A note on E-values

To determine whether matches to the databases are "significant", we use a threshold E-value. The E-value describes the number of hits one can expect to see by chance when searching a database of a particular size.
The lower the E-value, the more "significant" a match to a database sequence is (i.e. there is a smaller probability of finding a match just by chance).
However, the quality of a match also depends on the length of the alignment and the percentage similarity, so these statistics may also be considered when evaluating the significance of a match.

BLAST programs

Guide how to do a Protein sequence search

Program: BLASTP

Enter a random protein sequence that is 40 amino acids long in the sequence window. Make sure to write the sequence in fasta format (recall: name line begins with ">").

>unknown_protein
MATGSRTSLLLAFGLLCLPWLQEGSAFPTIPLSRLFDNAMLRAHRLHQLAFDTYQEFE EAYIPKEQKYSFLQNPQTSLCFSESIPTPSNREETQQKSNLELLRISLLLIQSWLEPV QFLRSVFANSLVYGASDSNVYDLLKDLEEGIQTLMGRLEDGSPRTGQIFKQTYSKFDT NSHNDDALLKNYGLLYCFRKDMDKVETFLRIVQCRSVEGSCGF

You can copy paste this sequence
Select "Show results in a new window", and then search the database by clicking the big BLAST button:
Program: BLASTP

Under algorith parameters, at the bottom of hte page, you can adjust optional parameters, such as e-value, maximum target sequences, etc.
- Based on the hits can you predict the function of this unknown protein sequence?
Predicting the function of un-characterised proteins by finding similar, known proteins in the database, is probably the single most important bioinformatics method!

BLAST FOR BEGINNERS

Click here for a quick guide and introduction to BLAST [courtesy of Sandra Porter; digitalworldbiology]

Command line BLAST

While The NCBI web-based BLAST is graphical, i.e. gives results in intuitive, color-coded grapgics according to the score of the alignment, its over reliance on internet is a setback. Also, it is slow, and tedious while processing many sequences / Contigs.

Download the sequences for which you will search your query against (This may be at kingdom, phylum, class, order, family, genus, or species level)
Format the sequences as database

makeblastdb -in DATABASENAME.fasta -dbtype prot -out DATABASENAME

–in parameter

-dbtype

-out

Geting help with the BLAST programs

blastx –-help

Running a query search:

blastx -query YOURASSEMBLY.fasta -db DBNAME -out YOURASSEMBLY_BLASTX2DBNAME -outfmt 6 -evalue 0.0001 -gapopen 11 -gapextend 1 -word_size 3 -matrix BLOSUM62 -num_threads 4

http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml