Differences

This shows you the differences between two versions of the page.

--- using-slurm [2017/06/07 06:28] – aorth
+++ using-slurm [2020/10/08 13:06] – [Batch jobs] jean-baka
@@ Line 8: / Line 8: @@
   * highmem
-"debug" is the default queue, which is useful for testing job parameters, program paths, etc. The run-time limit of the "debug" partition is 5 minutes, after which jobs are killed.
+"debug" is the default queue, which is useful for testing job parameters, program paths, etc. The run-time limit of the "debug" partition is 5 minutes, after which jobs are killed. The other partitions have no set time limit.
 To see more information about the queue configuration, use ''sinfo -lNe''.
+<code>[jbaka@hpc ~]$ sinfo -lNe
+Fri Feb  1 15:27:44 2019
+NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
+compute2       1     batch        idle   64   64:1:1      1        0     10   (null) none
+compute03      1     batch       mixed    8    8:1:1      1        0      5   (null) none
+compute03      1   highmem       mixed    8    8:1:1      1        0      5   (null) none
+compute04      1     batch       mixed    8    8:1:1      1        0      5   (null) none
+hpc            1    debug*        idle    4    4:1:1      1        0      1   (null) none
+mammoth        1   highmem        idle    8    8:1:1      1        0     30   (null) none
+taurus         1     batch       mixed   64   64:1:1      1        0     20   (null) none
+</code>
+The above tells you, for instance, that compute04 has 8 CPUs while compute2 has 64 CPUs. And that a job sent to the "highmem" partition (a SLURM verb equivalent to "queue", as per the vocabulary in use with other schedulers, e.g. Sun Grid Engine), then it will end up being run on either compute03 or mammoth.
 ===== Submitting jobs =====
 ==== Interactive jobs ====
-How to get an interactive session, ie when you want to interact with a program (like R, etc):
+How to get an interactive session, i.e. when you want to interact with a program (like R, etc) for a limited amount of time, making the scheduler aware that you are requesting/using resources on the cluster:
 <code>[aorth@hpc: ~]$ interactive
 salloc: Granted job allocation 1080
 [aorth@taurus: ~]$</code>
-**NB:** interactive jobs have a time limit of 8 hours, if you need more then you should write a batch script.
+**NB:** interactive jobs have a time limit of 8 hours: if you need more, then you should write a batch script.
+You can also open an interactive session on a specific node of the cluster by specifying it through the ''-w'' commandline argument:
+<code>[jbaka@hpc ~]$ interactive -w compute03
+salloc: Granted job allocation 16349
+[jbaka@compute03 ~]$</code>
 ==== Batch jobs ====
-Request 4 CPUs for a NCBI BLAST+ job in the ''batch'' partition.  Create a file //blast.sbatch//:
+We are writing a SLURM script below. The parameters in its header request 4 CPUs for in the ''batch'' partition, and name our job "blastn". This name is only used internally by SLURM for reporting purposes. So let's go ahead and ceate a file //blast.sbatch//:
 <code>#!/usr/bin/env bash
 #SBATCH -p batch
@@ Line 42: / Line 62: @@
 Instead, you can use a local "scratch" folder on the compute nodes to alleviate this burden, for example:
-<code>#!/bin/env bash
+<code>#!/usr/bin/env bash
 #SBATCH -p batch
-#SBATCH -n 4
 #SBATCH -J blastn
+#SBATCH -n 4
 # load the blast module
@@ Line 62: / Line 82: @@
 blastn -query ~/data/sequences/drosoph_14_sequences.seq -db nt -num_threads 4 -out blast.out</code>
-All output is directed to ''$WORKDIR/'', which is the temporary folder on the compute node. See these slides from [[http://alanorth.github.io/hpc-users-group3/#/2|HPC Users Group #3]] for more info.
+All output is directed to ''$WORKDIR/'', which is the temporary folder on the compute node. See these slides from [[https://alanorth.github.io/hpc-users-group3/#/2|HPC Users Group #3]] for more info.
 ==== Check queue status ====
-<code>squeue</code>
+''squeue'' is the command to use in order to get information about the different jobs that are running on the cluster, waiting in a queue for resources to become available, or halted for some reason:
+<code>[jbaka@compute03 ~]$ squeue
+             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+     batch interact  pyumbya  R    6:33:26      1 taurus
+     batch interact ckeambou  R    5:19:07      1 compute04
+     batch interact ckeambou  R    5:12:52      1 compute04
+     batch velvet_o  dkiambi  R    1:39:09      1 compute04
+     batch interact fkibegwa  R      22:38      1 taurus
+     batch interact    jbaka  R       3:27      1 compute03
+</code>
+In addition to the information above, it is sometimes useful to know what is the number of CPUs (computing cores) allocated to each job: the scheduler will queue jobs asking for resources that aren't available, most often because the other jobs are eating up all the CPUs available on the host. To get the number of CPUs for each job and display the whole thing nicely, the command is slightly more involved:
-==== Receive mail notifications ====
+<code>[jbaka@compute03 ~]$ squeue -o"%.7i %.9P %.16j %.8u %.2t %.10M %.6D %10N %C"
-To receive mail notifications about the state of your job, add the following lines to your sbatch script: whereby <EMAIL_ADDRESS> is your email address<code>
+  JOBID PARTITION             NAME     USER ST       TIME  NODES NODELIST   CPUS
-#SBATCH --mail-user <EMAIL_ADDRESS>
+     batch      interactive  pyumbya  R    6:40:52      1 taurus     1
-#SBATCH --mail-type ALL</code>
+     batch      interactive ckeambou  R    5:26:33      1 compute04  1
+     batch      interactive ckeambou  R    5:20:18      1 compute04  1
+     batch velvet_out_ra_10  dkiambi  R    1:46:35      1 compute04  2
+     batch      interactive fkibegwa  R      30:04      1 taurus     1
+     batch      interactive    jbaka  R      10:53      1 compute03  1
+</code>
-Notification mail types(--mail-type) can be BEGIN, END, FAIL, REQUEUE and ALL(any state change).
+or, alternatively:
-Example:
+<code>[jbaka@compute03 ~]$ squeue -O username,jobid,name,nodelist,numcpus
-<code>
+USER                JOBID               NAME                NODELIST            CPUS
-#SBATCH --mail-user J.Doe@cgiar.org
+pyumbya             16330               interactive         taurus              1
-#SBATCH --mail-type ALL</code>
+ckeambou            16339               interactive         compute04           1
+ckeambou            16340               interactive         compute04           1
+dkiambi             16346               velvet_out_ra_109_vecompute04           2
+fkibegwa            16348               interactive         taurus              1
+jbaka               16349               interactive         compute03           1
+</code>