SLURM Workload Manager¶
SLURM is the workload manager and job scheduler used for Scicluster.
There are two ways of starting jobs with SLURM; either interactively with srun
or as a script with sbatch.
Interactive jobs are a good way to test your setup before you put it into a script or to work with interactive applications like python. You immediately see the results and can check if all parts behave as you expected. See Interactive job for more details.
SLURM Parameter¶
SLURM supports a multitude of different parameters. This enables you to effectivly tailor your script to your need when using Scicluster but also means that it is easy to be confused and waste your time and quota.
The following parameters can be used as command line parameters with sbatch and
srun or in jobscript, see Job script examples.
To use it in a jobscript, start a newline with #SBTACH followed by the parameter.
Replace <….> with the value you want, e.g. --job-name=test-job.
Basic settings:¶
Parameter |
Function |
|---|---|
|
Job name to be displayed by for example |
|
Path to the file where the job output is written to |
|
Path to the file where the job error is written to |
|
Turn on mail notification; type can be one of BEGIN, END, FAIL, REQUEUE or ALL |
|
Email address to send notifications to |
Requesting Resources¶
Parameter |
Function |
|---|---|
|
Time limit for job. Job will be killed by SLURM after time has run out. Format days-hours:minutes:seconds |
|
Number of nodes. Multiple nodes are only useful for jobs with distributed-memory (e.g. MPI). |
|
Memory (RAM) per node. Number followed by unit prefix, e.g. 16G |
|
Memory (RAM) per requested CPU core |
|
Number of (MPI) processes per node. More than one useful only for MPI jobs. Maximum number is node dependent (number of cores) |
|
CPU cores per task. For MPI use one. For parallelized applications benchmark this is the number of threads. |
|
Job will not share nodes with other running jobs. You will be charged for the complete nodes even if you asked for less. |
Accounting¶
See also Partitions (queues).
Parameter |
Function |
|---|---|
|
Project (not user) account the job should be charged to. |
|
Partition/queue in which o run the job. |
|
low, normal or high |
Advanced Job Control¶
Parameter |
Function |
|---|---|
|
Submit a collection of similar jobs, e.g. |
|
Wait with the start of the job until specified dependencies have been satified. E.g. –dependency=afterok:123456 |
|
Enables hyperthreading. Only useful in special circumstances. |
Differences between CPUs and tasks¶
As a new users writing your first SLURM job script the difference between
--ntasks and --cpus-per-task is typically quite confusing.
Assuming you want to run your program on a single node with 16 cores, which
SLURM parameters should you specify?
The answer is it depends whether your application supports MPI. MPI (message passing protocol) is a communication interface used for developing parallel computing programs on distributed memory systems. This is necessary for applications running on multiple computers (nodes) to be able to share (intermediate) results.
To decide which set of parameters you should use, check if your application utilizes MPI and therefore would benefit from running on multiple nodes simultaneously. On the other hand you have an non-MPI enables application or made a mistake in your setup, it doesn’t make sense to request more than one node.
Settings for OpenMP and MPI jobs¶
Single node jobs¶
For applications that are not optimized for HPC (high performance computing) systems like simple python or R scripts and a lot of software which is optimized for desktop PCs.
Simple applications and scripts¶
Many simple tools and scripts are not parallized at all and therefore won’t profit from more than one CPU core.
Parameter |
Function |
|---|---|
|
Start a unparallized job on only one node |
|
For OpenMP, only one task is necessary |
|
Memory (RAM) for the job. Number followed by unit prefix, e.g. 16G |
If you are unsure if your application can benefit from more cores try a higher number and observe the load of your job. If it stays at approximately one there is no need to ask for more than one.
OpenMP applications¶
OpenMP (Open Multi-Processing) is a multiprocessing library is often used for programs on shared memory systems. Shared memory describes systems which share the memory between all processing units (CPU cores), so that each process can access all data on that system.
Parameter |
Function |
|---|---|
|
Start a parallel job for a shared memory system on only one node |
|
For OpenMP, only one task is necessary |
|
Number of threads (CPU cores) to use |
|
Memory (RAM) for the job. Number followed by unit prefix, e.g. 16G |
Multiple node jobs (MPI)¶
For MPI applications.
Depending on the frequency and bandwidth demand of your setup, you can either just start a number of MPI tasks or request whole nodes. While using whole nodes guarantees a lower latency and higher bandwidth it usually results in a longer queuing time compared to cluster wide job. With the latter the SLURM manager can distribute your task across all nodes of Scicluster and utilize otherwise unused cores on nodes which for example run a 6 core job on a 8 core node. This usually results in shorter queuing times but slower inter-process connection speeds.
To use whole nodes¶
Parameter |
Function |
|---|---|
|
Start a parallel job for a distributed memory system on several nodes |
|
Number of (MPI) processes per node. Maximum number depends on node type |
|
Use one CPU core per task. |
|
Job will not share nodes with other running jobs. You don’t need to specify memory as you will get all available on the node. |
Cluster wide¶
Parameter |
Function |
|---|---|
|
Number of (MPI) processes in total. Equals to the number of cores |
|
Memory (RAM) per requested CPU core. Number followed by unit prefix, e.g. 1G |
Scalability¶
You should run a few tests to see what is the best fit between minimizing runtime and maximizing your allocated cpu-quota. That is you should not ask for more cpus for a job than you really can utilize efficiently. Try to run your job on 1, 2, 4, 8, 16, etc., cores to see when the runtime for your job starts tailing off. When you start to see less than 30% improvement in runtime when doubling the cpu-counts you should probably not go any further. Recommendations to a few of the most used applications can be found in Application guides.
Partitions (queues)¶
SLURM differs slightly from the previous Torque system with respect to
definitions of various parameters, and what was known as queues in Torque may
be covered by both --partition=... and --qos=....
We have the following partitions:
Partition |
MaxTime |
DefaultTime |
DefMemPerCPU |
Max number of Nodes |
|---|---|---|---|---|
short |
1 day |
30 min |
512 MB |
1 |
long |
1 week |
30 min |
512 MB |
1 |
PARA |
1 week |
30 min |
512 MB |
4 |
To display a straight-forward summary: available partitions, their job size, status, timelimit and node information with A/I/O/T (allocated, idle, other, and total):
$ sinfo -o "%.10P %.15s %.10a %.10l %.15F"
Numbers represent field length and should be used to properly accommodate the data.
See About Scicluster chapter of the documentation if you need more information on the system architecture.
Quality of servisec (QOS)¶
We have also defined three QOSs (quality of service) for better management: low, normal and high.
QOS |
Max node per user |
Max wall time |
low |
1 (this is the default QOS) |
7 |
normal |
2 |
1 |
high |
3 |
1 |
All members of the faculty of science have low and normal QOS which means they can use 1 node for 7 days or 2 nodes for 1 day. Currently just for testing, all the members have also highQOS i.e. they can use 3 nodes for 1 day. After about one month, this QOS will be assigned only to those of users that report reasonable performance using 3 nodes.
## for 3 nodes
#SBATCH --qos=high
#SBATCH --ntasks=80
#SBATCH -w compute-0-[0,2,3]
#SBATCH --time=1-00:00:00 # maximum time for "high" QOS is 1 day
## for 2 nodes (e.g. compute-0-0 and 0-3)
#SBATCH --qos=normal
#SBATCH --ntasks=56 ##
#SBATCH -w compute-0-[0,3]
#SBATCH --time=1-00:00:00 # maximum time for "normal" QOS is 1 day
## for 1 node (e.g. compute-0-1)
#SBATCH --qos=low ## this is default, so you can ignore it
#SBATCH --ntasks=16 ##
#SBATCH -w compute-0-1
#SBATCH --time=7-00:00:00 # maximum time for "low" QOS is 7 days
Please note that currently compute-0-1 has NOT equipped with 10 G adapter, so for distributed MPI parallel jobs, you can not use it.