Differences

This shows you the differences between two versions of the page.

--- mkatari-bioinformatics-august-2013-blastnotes [2013/08/13 15:18] – created mkatari
+++ mkatari-bioinformatics-august-2013-blastnotes [2015/06/04 12:38] (current) – mkatari
@@ Line 1: / Line 1: @@
-##########################
+[[mkatari-bioinformatics-august-2013|Back to Manny's Bioinformatics Workshop Home]]
-# Run Blast using sbatch #
-##########################
-) create a new sbatch file (call it blast.sbatch) with the following heading. The #! MUST be the first line of the file. This tells the computer (or in this case snatch) to run the code using the bash shell interpreter.
+This page contains a short introduction on how to create a Blast database and also how to run Blast using a query. Please replace your sequences with the names provided for the scripts to work for you.
-''
+**Please remember** --- if you are planning to run Blast on command line, please do it in **interactive** mode. The recommended approach is to use sbatch scripts (described in more detail further down the page).
+====== Getting specific fasta sequences from reference file ======
+In case you need to retrieve a specific sequence from a larger fasta file, use one of my perl scripts. Simply provide a pattern that it needs to match in the definition field and it will retrieve the sequence. The **-f** is the reference file, **-o** is the output file name, and **-query** is the pattern it will try to match.
+<code>
+perl /home/mkatari/PerlScripts/getFastaName.pl \
+    -f /home/mkatari/blast/Mesculenta_147_v4 \
+    -o scaffold12498.fa \
+    -query scaffold12498
+</code>
+====== Creating the blast database ======
+Before we can actually perform the blast, we need to prepare the database using **makeblastdb**. The input (**-in**)for makeblastdb is a fasta file with the reference sequence, in this case the the cassava genomes which includes chromosomes and scaffolds. You have to define the database type (**-dbtype**) to be either "nucl" or "prot" and if your file does not have the correct extension, simply tell it the the file contains fasta sequences ( **-input_type** "fasta" )
+<code>
+makeblastdb -in cassavaV5 -input_type "fasta" -dbtype "nucl"
+</code>
+====== Running Blast ======
+Now to run blast we simply have to specific which blast we want to use and provide the respective arguments. For this example we are looking to see where the scaffold from an older version of the assembly is present in the new assembly. So we are aligning a nucleotide query to a nucleotide database ( **blastn** ). There are incredible number of options for blastn. To get a detailed description of all the different option type ( **blastn -help** ).
+Some of the options I find useful are:
+-query = the name of the query sequence
+-db = the name of the database
+-out = the name of the output file
+-outfmt = in which format to save the file. The default is the traditional output that shows alignments, but I also use value outfmt 6, which will save in tabular format.
+<code>
+blastn -query scaffold12498.fa \
+-db cassavaV5 \
+-outfmt 6 \
+-out scaffold12498.cassavaV5.bout
+</code>
+====== Run Blast using sbatch ======
+  * Create a new sbatch file (call it ''contigs.fa.nr.sbatch'') with the following heading. The ''#!'' MUST be the first line of the file. This tells the computer (or in this case snatch) to run the code using the ''bash'' shell interpreter.
+<code>
 #!/bin/bash
 #SBATCH -p highmem
 #SBATCH -n 8
-''
+</code>
-Normally any other # used in a bash script file means a comment follows. However since we are executing the sbatch file using sbatch command (see below) SBATCH knows to look for #SBATCH and use the information that follows. Here we are telling the SBATCH file to use the highmem partition (the mammoth server) and use 8 CPUs to perform the calculation. It is also important to tell that program you are running that it has 8 CPUs available so it will use them, else you will be reserving 8 CPUs but only using one.
+Normally any other ''#'' used in a bash script file means a comment follows. However since we are executing the sbatch file using sbatch command (see below) SBATCH knows to look for ''#SBATCH'' and use the information that follows. Here we are telling the SBATCH file to use the ''highmem'' partition (the mammoth server) and use ''8'' CPUs to perform the calculation. It is also important to tell that program you are running that it has 8 CPUs available so it will use them, else you will be reserving 8 CPUs but only using one.
-) There are several different job managers and different ways of setting up HPC computer systems. In practice I prefer not to assume that a job that is submitted to a different server will know how to find the different commands or even files. So I like to include the full paths for both. In order to find where the blastx command is located I simply type:
+  * There are several different job managers and different ways of setting up HPC computer systems. In practice I prefer not to assume that a job that is submitted to a different server will know how to find the different commands or even files. So I like to include the full paths for both. In order to find where the ''blastx'' command is located I simply type:
-----------------------------------------------
+<code>
 module load blast
-which blastx # note this will only work if you already have the blast module loaded.
+which blastn # note this will only work if you already have the blast module loaded.
-/export/apps/blast/2.2.28+/bin/blastx
+/export/apps/blast/2.2.28+/bin/blastn
-----------------------------------------------
+</code>
-From Alan's Blast sbatch script example on the wiki, I also know where the nr database is
+From Alan's Blast sbatch script example on [[using-slurm|How to Use Slurm]], I also know where the nr database is
------------------------------------------------
+<code>
 /export/data/bio/ncbi/blast/db/nr
------------------------------------------------
+</code>
-) Edit the contigs.fa.nr.sbatch file to include all changes. Final script looks like this:
+  * Edit the ''contigs.fa.nr.sbatch'' file to include all changes. Final script looks like this:
-----------------------------
+<code>
 #!/bin/bash
@@ Line 36: / Line 76: @@
 #SBATCH -n 8
-/export/apps/blast/2.2.28+/bin/blastx -db /export/data/bio/ncbi/blast/db/nr -query /home/mkatari/ndl06-132-velvet31/contigs.fa -out /home/mkatari/ndl06-132-velvet31/contigs.fa.nr -num_threads 8 -outfmt 6 -evalue 0.00001
+module load blast
-------------------------------
-) Run the sbatch file. As soon as you run the file a job id will be assigned to your submission.
+echo "Blast ready to run"
------------------------
+blastn -db cassavaV5 -query scaffold12498.fa \
+       -out scaffold12498.cassavaV5_2.bout \
+       --num_threads 8 -outfmt 6 -evalue 0.00001
+echo "Blast complete"
+</code>
+  * Run the sbatch file. As soon as you run the file a job id will be assigned to your submission.
+<code>
 sbatch blast.sbatch
------------------------
+</code>
 You can check the status of all jobs on the cluster by typing:
------------------------
+<code>
 squeue
------------------------
+</code>
 You can check the details of your specific job by typing:
-----------------------------------------------
+<code>
 scontrol show job <your jobid>
-----------------------------------------------
+</code>
 You can cancel your job by running
-----------------------------------------------
+<code>
 scancel <your jobid>
-----------------------------------------------
+</code>
 The standard output of your job is redirected to a file called
-----------------------------------------------
+<code>
 slurm-<your jobid>.out
-----------------------------------------------
+</code>
+  * Now imagine that you have to repeat this exact blast for many different sequences but you do not necessarily want to have to create a new batch file or keep editing the same one. The path to the input and output files in our current sbatch files are "hard coded". What we want is the ability to change the input and output name without having to edit the file.
+  * Lucky for us we can define variables in a bash script and provide the value of the variable from the command line. A modified version of the sbatch script is provided below.
+<code>
+#!/bin/bash
+#SBATCH -p batch
+#SBATCH -n 8
+INPUT=$1
+OUTPUT="$1".output
+echo $INPUT
+echo $OUTPUT
+/export/apps/blast/2.2.28+/bin/blastx \
+    -db /export/data/bio/ncbi/blast/db/nr \
+    -query $INPUT \
+    -out $OUTPUT  \
+    -num_threads 8 \
+    -outfmt 6 \
+    -evalue 0.00001
+# the different columns in this output format are :
+# Fields: query id,
+          subject id,
+          % identity,
+          alignment length,
+          mismatches,
+          gap opens,
+          query start,
+          query end,
+          subject start,
+          subject end,
+          evalue,
+          bit score
+</code>
+Arguments on a command line are interpreted by the bash script in sequence. The values automatically inherit the variable $1, $2, $3 ... as they are read from command line. It is a good idea to reassign these with variables that have names that make sense to us. Any string of characters (without spaces) provided after the script name will be assigned as $1 and then the variable INPUT will be assigned this value. In the script above we also see how to create a new variable OUTPUT which contains the same information as INPUT but now also contains a ".output"
+Now to refer to the value saved in the variables we simply put $ infront as shown in the blast command line.
+To execute this sbatch file you would simply provide the name of the input file as shown below.
+<code>
+sbatch /home/mkatari/blast.sbatch /home/mkatari/ndl06-132-velvet31/contigs.fa
+</code>