===== MPI Blast =====
Parallel implementation of NCBI's BLAST algorithm.
* http://wiki.bioinformatics.ucdavis.edu/index.php/MPI_Blast
$ mpiformatdb -i drosoph.nt -p F --nfrags=12
* **nfrags** specifies how many database fragments you want to split the original database into. This should be equal to how many different nodes you want to run mpiblast on.
===== Notes on .ncbirc ====
Notes on setting up the ''~/.ncbirc'' file from the mpiBLAST installation page: http://www.mpiblast.org/Docs/Install#unix
Before running mpiBLAST, it is necessary to configure the shared and local storage paths that each node will use to access the database. A shared storage path is usually a path to a directory residing on a file server, such as NFS, AFS, or samba. The local storage path is typically a subdirectory within the /tmp directory, e.g. /tmp/mpiblast. As worker nodes search the database, they will copy fragments to the local storage directory. During subsequent searches of the same database, the fragments will already reside in local storage and thus will not need to be copied. Note that diskless nodes can be supported by setting the local storage path to be the same as the shared storage path. To configure mpiBLAST create a .ncbirc file in your home directory that looks like:
[NCBI]
Data=/path/to/ncbi/data
[BLAST]
BLASTDB=/path/to/shared/storage
BLASTMAT=/path/to/ncbi/data
[mpiBLAST]
Shared=/path/to/shared/storage
Local=/path/to/local/storage
The Data variable gives the location of the NCBI Data directory containing BLOSUM and PAM scoring matrices, among other things. The scoring matrix files are necessary for any type of protein BLAST search and should be accessible by all cluster nodes. The BLASTMAT variable also specifies the path to the scoring matrices, and will usually be identical to the Data variable. The BLASTDB variable tells standard NCBI blastall (not mpiBLAST) where to find BLAST databases. As previously mentioned, the Shared and Local variables give the shared and local database paths, respectively. By setting BLASTDB to the same path as Shared, it is possible for NCBI blastall to share the same databases that mpiBLAST uses. In such a configuration, be sure to format all databases with mpiformatdb rather than formatdb.
===== Frequently Asked Questions =====
Collection of the more-helpful questions and answers from the [[http://www.mpiblast.org/Docs/FAQ|mpiBLAST FAQ]].
====How do I format a huge database?====
Large databases like nt can consume several gigabytes of disk space and it is preferable to store them in compressed form. Starting with mpiBLAST 1.4.0 it is possible to pipe FastA formatted sequence data into mpiformatdb. This feature provides the ability to directly format a compressed (gzip/bzip etc.) database using command line syntax like:
$ zcat nt.gz | mpiformatdb -i stdin -N 100 -t nt -p F
==== SGE Support ====
See this FAQ entry: http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge
$ ompi_info | grep gridengine
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.2)
===== Benchmarks =====
==== Standard BLAST ====
$ time blastall -d drosoph.nt -p blastn -i drosoph.seq -o drosoph.result
real 7m48.052s
user 7m40.775s
sys 0m6.732s
==== MPI Blast with 4 jobs, 1 node ====
$ time /opt/openmpi/bin/mpirun -np 4 /opt/Bio/mpiblast/bin/mpiblast -d drosoph.nt -i drosoph.seq -p blastn -o mpi_drosoph_result.txt
Total Execution Time: 395.754
real 6m36.841s
user 12m13.891s
sys 0m56.631s
==== MPI Blast with 12 jobs, 6 nodes ====
$ less mpiblast_sge.sh.o5515
Total Execution Time: 98.3068
==== Paracel Blast ====
$ time pb blastall -d alan_drosoph -p blastn -i sequences/drosoph.seq -o drosoph.result
real 3m6.163s
user 0m0.046s
sys 0m1.423s
The number of processes for an MPI job should be +1 of the number of CPUs because one process is used as the master to control the other jobs.
===== Random Notes =====
==== Number of Jobs ====
https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2005-July/012726.html
%%With formatdb, the number of nodes refers to compute node number.
With mpiBLAST, np refers to number of processes, which isn't
necessarily linked to compute node or processor number. You can run
10 processes on 4 processors. But it's recommended to run a single
process per processor. But the minimum number of processes for
mpiBLAST is 3, no matter what your compute node number is.%%
==== Number of Fragments ====
Rule of thumb for large databases, one segment for every gigabyte (144 GB, 144 segments).
* http://lists.mpiblast.org/pipermail/users_lists.mpiblast.org/2009-August/000988.html
* https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2008-February/029231.html
==== Incorrect mpiBLAST Version ====
$ rpmquery -a mpiblast
mpiblast-1.5.0-pio
$ mpiblast --version
mpiblast version 1.4.0
You're not crazy, it's a known issue. 1.5.0 reports as 1.4.0: http://lists.mpiblast.org/pipermail/users_lists.mpiblast.org/2009-February/000933.html
===== Links =====
* Submitting MPI jobs using SGE: http://www.shef.ac.uk/wrgrid/documents/gridengine.html
* mpiBLAST Guide: http://www.mpiblast.org/Docs/Guide
* Updating the BLAST databases: http://www.ncbi.nlm.nih.gov/blast/docs/update_blastdb.pl
* Rocks documentation on mpiBLAST: http://www.rocksclusters.org/roll-documentation/bio/5.2/mpiblast_usage.html
* wwwblast: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/wwwblast/
* OpenMPI FAQ: http://www.open-mpi.org/faq/