User Tools

Site Tools


mpiblast

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
mpiblast [2010/01/19 10:09] 172.26.0.166mpiblast [2010/01/29 09:42] 172.26.0.166
Line 2: Line 2:
 Parallel implementation of NCBI's BLAST algorithm. Parallel implementation of NCBI's BLAST algorithm.
  
-http://wiki.bioinformatics.ucdavis.edu/index.php/MPI_Blast +  * http://wiki.bioinformatics.ucdavis.edu/index.php/MPI_Blast
-OpenMPI FAQ: http://www.open-mpi.org/faq/+
  
 <code>$ mpiformatdb -i drosoph.nt -p F --nfrags=12</code> <code>$ mpiformatdb -i drosoph.nt -p F --nfrags=12</code>
Line 27: Line 26:
  
 The Data variable gives the location of the NCBI Data directory containing BLOSUM and PAM scoring matrices, among other things. The scoring matrix files are necessary for any type of protein BLAST search and should be accessible by all cluster nodes. The BLASTMAT variable also specifies the path to the scoring matrices, and will usually be identical to the Data variable. The BLASTDB variable tells standard NCBI blastall (not mpiBLAST) where to find BLAST databases. As previously mentioned, the Shared and Local variables give the shared and local database paths, respectively. By setting BLASTDB to the same path as Shared, it is possible for NCBI blastall to share the same databases that mpiBLAST uses. In such a configuration, be sure to format all databases with mpiformatdb rather than formatdb. The Data variable gives the location of the NCBI Data directory containing BLOSUM and PAM scoring matrices, among other things. The scoring matrix files are necessary for any type of protein BLAST search and should be accessible by all cluster nodes. The BLASTMAT variable also specifies the path to the scoring matrices, and will usually be identical to the Data variable. The BLASTDB variable tells standard NCBI blastall (not mpiBLAST) where to find BLAST databases. As previously mentioned, the Shared and Local variables give the shared and local database paths, respectively. By setting BLASTDB to the same path as Shared, it is possible for NCBI blastall to share the same databases that mpiBLAST uses. In such a configuration, be sure to format all databases with mpiformatdb rather than formatdb.
- 
-===== wwwblast ===== 
-http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/wwwblast/ 
  
  
Line 45: Line 41:
                  MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.2)</code>                  MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.2)</code>
 ===== Benchmarks ===== ===== Benchmarks =====
 +
 +==== Standard BLAST ====
 <code>$ time blastall -d drosoph.nt -p blastn -i drosoph.seq -o drosoph.result <code>$ time blastall -d drosoph.nt -p blastn -i drosoph.seq -o drosoph.result
                  
Line 50: Line 48:
 user    7m40.775s user    7m40.775s
 sys     0m6.732s</code> sys     0m6.732s</code>
 +==== MPI Blast with 4 jobs, 1 node ====
 <code>$ time /opt/openmpi/bin/mpirun -np 4 /opt/Bio/mpiblast/bin/mpiblast -d drosoph.nt -i drosoph.seq -p blastn -o mpi_drosoph_result.txt <code>$ time /opt/openmpi/bin/mpirun -np 4 /opt/Bio/mpiblast/bin/mpiblast -d drosoph.nt -i drosoph.seq -p blastn -o mpi_drosoph_result.txt
 Total Execution Time: 395.754 Total Execution Time: 395.754
Line 57: Line 55:
 user    12m13.891s user    12m13.891s
 sys     0m56.631s</code> sys     0m56.631s</code>
 +
 +==== MPI Blast with 12 jobs, 6 nodes ====
 +<code>$ less mpiblast_sge.sh.o5515
 +Total Execution Time: 98.3068</code>
 +
 +==== Paracel Blast ====
 +<code>$ time pb blastall -d alan_drosoph -p blastn -i sequences/drosoph.seq -o drosoph.result
 +                                                                               
 +real    3m6.163s
 +user    0m0.046s
 +sys     0m1.423s</code>
 +
  
 The number of processes for an MPI job should be +1 of the number of CPUs because one process is used as the master to control the other jobs. The number of processes for an MPI job should be +1 of the number of CPUs because one process is used as the master to control the other jobs.
 +
 +===== Random Notes =====
 +
 +==== Number of Jobs ====
 +
 +https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2005-July/012726.html
 +
 +%%With formatdb, the number of nodes refers to compute node number.  
 +With mpiBLAST, np refers to number of processes, which isn't  
 +necessarily linked to compute node or processor number. You can run  
 +10 processes on 4 processors. But it's recommended to run a single  
 +process per processor. But the minimum number of processes for  
 +mpiBLAST is 3, no matter what your compute node number is.%%
 +==== Number of Fragments ====
 +Rule of thumb for large databases, one segment for every gigabyte (144 GB, 144 segments).
 +
 +  * http://lists.mpiblast.org/pipermail/users_lists.mpiblast.org/2009-August/000988.html
 +  * https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2008-February/029231.html
 +==== Incorrect mpiBLAST Version ====
 +<code>$ rpmquery -a mpiblast    
 +mpiblast-1.5.0-pio
 +$ mpiblast --version
 +mpiblast version 1.4.0
 +</code>
 +You're not crazy, it's a known issue. 1.5.0 reports as 1.4.0: http://lists.mpiblast.org/pipermail/users_lists.mpiblast.org/2009-February/000933.html
 ===== Links ===== ===== Links =====
   * Submitting MPI jobs using SGE: http://www.shef.ac.uk/wrgrid/documents/gridengine.html   * Submitting MPI jobs using SGE: http://www.shef.ac.uk/wrgrid/documents/gridengine.html
   * mpiBLAST Guide: http://www.mpiblast.org/Docs/Guide   * mpiBLAST Guide: http://www.mpiblast.org/Docs/Guide
   * Updating the BLAST databases: http://www.ncbi.nlm.nih.gov/blast/docs/update_blastdb.pl   * Updating the BLAST databases: http://www.ncbi.nlm.nih.gov/blast/docs/update_blastdb.pl
 +  * Rocks documentation on mpiBLAST: http://www.rocksclusters.org/roll-documentation/bio/5.2/mpiblast_usage.html
 +  * wwwblast: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/wwwblast/
 +  * OpenMPI FAQ: http://www.open-mpi.org/faq/