User Tools

Site Tools


using-slurm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
using-slurm [2018/11/29 12:25]
aorth [Batch job using local storage]
using-slurm [2019/02/01 15:37] (current)
jean-baka [Check queue status]
Line 8: Line 8:
   * highmem   * highmem
  
-"​debug"​ is the default queue, which is useful for testing job parameters, program paths, etc. The run-time limit of the "​debug"​ partition is 5 minutes, after which jobs are killed.+"​debug"​ is the default queue, which is useful for testing job parameters, program paths, etc. The run-time limit of the "​debug"​ partition is 5 minutes, after which jobs are killed. The other partitions have no set time limit.
  
 To see more information about the queue configuration,​ use ''​sinfo -lNe''​. To see more information about the queue configuration,​ use ''​sinfo -lNe''​.
 +
 +<​code>​[jbaka@compute03 ~]$ sinfo -lNe
 +Fri Feb  1 15:27:44 2019
 +NODELIST ​  NODES PARTITION ​      STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON ​             ​
 +compute2 ​      ​1 ​    ​batch ​       idle   ​64 ​  ​64:​1:​1 ​     1        0     ​10 ​  ​(null) none                ​
 +compute03 ​     1     ​batch ​      ​mixed ​   8    8:1:1      1        0      5   ​(null) none                ​
 +compute03 ​     1   ​highmem ​      ​mixed ​   8    8:1:1      1        0      5   ​(null) none                ​
 +compute04 ​     1     ​batch ​      ​mixed ​   8    8:1:1      1        0      5   ​(null) none                ​
 +hpc            1    debug* ​       idle    4    4:1:1      1        0      1   ​(null) none                ​
 +mammoth ​       1   ​highmem ​       idle    8    8:1:1      1        0     ​30 ​  ​(null) none                ​
 +taurus ​        ​1 ​    ​batch ​      ​mixed ​  ​64 ​  ​64:​1:​1 ​     1        0     ​20 ​  ​(null) none       
 +</​code>​
 +
 +The above tells you, for instance, that compute04 has 8 CPUs while compute2 has 64 CPUs. And that a job sent to the "​highmem"​ partition (a SLURM verb equivalent to "​queue",​ as per the vocabulary in use with other schedulers, e.g. Sun Grid Engine), then it will end up being run on either compute03 or mammoth. ​
  
 ===== Submitting jobs ===== ===== Submitting jobs =====
 ==== Interactive jobs ==== ==== Interactive jobs ====
-How to get an interactive session, ​ie when you want to interact with a program (like R, etc):+How to get an interactive session, ​i.e. when you want to interact with a program (like R, etc) for a limited amount of time, making the scheduler aware that you are requesting/​using resources on the cluster:
 <​code>​[aorth@hpc:​ ~]$ interactive ​ <​code>​[aorth@hpc:​ ~]$ interactive ​
 salloc: Granted job allocation 1080 salloc: Granted job allocation 1080
 [aorth@taurus:​ ~]$</​code>​ [aorth@taurus:​ ~]$</​code>​
  
-**NB:** interactive jobs have a time limit of 8 hoursif you need more then you should write a batch script.+**NB:** interactive jobs have a time limit of 8 hoursif you need morethen you should write a batch script. 
 + 
 +You can also open an interactive session on a specific node of the cluster by specifying it through the ''​-w''​ commandline argument: 
 +<​code>​[jbaka@hpc ~]$ interactive -w compute03 
 +salloc: Granted job allocation 16349 
 +[jbaka@compute03 ~]$</​code>​ 
 ==== Batch jobs ==== ==== Batch jobs ====
 Request 4 CPUs for a NCBI BLAST+ job in the ''​batch''​ partition. ​ Create a file //​blast.sbatch//:​ Request 4 CPUs for a NCBI BLAST+ job in the ''​batch''​ partition. ​ Create a file //​blast.sbatch//:​
Line 65: Line 85:
  
 ==== Check queue status ==== ==== Check queue status ====
-<​code>​squeue</​code>​+''​squeue''​ is the command to use in order to get information about the different jobs that are running on the cluster, waiting in a queue for resources to become available, or halted for some reason: 
 +<​code>​[jbaka@compute03 ~]$ squeue 
 +             JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME ​ NODES NODELIST(REASON) 
 +             ​16330 ​    batch interact ​ pyumbya ​ R    6:​33:​26 ​     1 taurus 
 +             ​16339 ​    batch interact ckeambou ​ R    5:​19:​07 ​     1 compute04 
 +             ​16340 ​    batch interact ckeambou ​ R    5:​12:​52 ​     1 compute04 
 +             ​16346 ​    batch velvet_o ​ dkiambi ​ R    1:​39:​09 ​     1 compute04 
 +             ​16348 ​    batch interact fkibegwa ​ R      22:38      1 taurus 
 +             ​16349 ​    batch interact ​   jbaka  R       ​3:​27 ​     1 compute03 
 +</​code>​ 
 + 
 +In addition to the information above, it is sometimes useful to know what is the number of CPUs (computing cores) allocated to each job: the scheduler will queue jobs asking for resources that aren't available, most often because the other jobs are eating up all the CPUs available on the host. To get the number of CPUs for each job and display the whole thing nicely, the command is slightly more involved: 
 + 
 +<​code>​[jbaka@compute03 ~]$ squeue -o"​%.7i %.9P %.16j %.8u %.2t %.10M %.6D %10N %C" 
 +  JOBID PARTITION ​            ​NAME ​    USER ST       ​TIME ​ NODES NODELIST ​  ​CPUS 
 +  16330     ​batch ​     interactive ​ pyumbya ​ R    6:​40:​52 ​     1 taurus ​    1 
 +  16339     ​batch ​     interactive ckeambou ​ R    5:​26:​33 ​     1 compute04 ​ 1 
 +  16340     ​batch ​     interactive ckeambou ​ R    5:​20:​18 ​     1 compute04 ​ 1 
 +  16346     batch velvet_out_ra_10 ​ dkiambi ​ R    1:​46:​35 ​     1 compute04 ​ 2 
 +  16348     ​batch ​     interactive fkibegwa ​ R      30:04      1 taurus ​    1 
 +  16349     ​batch ​     interactive ​   jbaka  R      10:53      1 compute03 ​ 1 
 +</​code>​ 
 + 
 +or, alternatively:​ 
 + 
 +<​code>​[jbaka@compute03 ~]$ squeue ​-O username,​jobid,​name,​nodelist,​numcpus 
 +USER                JOBID               ​NAME ​               NODELIST ​           CPUS                 
 +pyumbya ​            ​16330 ​              ​interactive ​        ​taurus ​             1                    
 +ckeambou ​           16339               ​interactive ​        ​compute04 ​          ​1 ​                   
 +ckeambou ​           16340               ​interactive ​        ​compute04 ​          ​1 ​                   
 +dkiambi ​            ​16346 ​              ​velvet_out_ra_109_vecompute04 ​          ​2 ​                   
 +fkibegwa ​           16348               ​interactive ​        ​taurus ​             1                    
 +jbaka               ​16349 ​              ​interactive ​        ​compute03 ​          ​1 ​         
 +</​code>​ 
using-slurm.1543483549.txt.gz · Last modified: 2018/11/29 12:25 by aorth