Using SLURM on the Brazos Cluster

Overview

The Brazos Cluster uses SLURM (Simple Linux Utility for Resource Management). SLURM is an open-source workload manager for batch scheduling. All processing on the cluster must run through the batch system. Do not run large memory or long running applications on the cluster's login nodes. They will be terminated without notice.

Torque to SLURM

See Translating Torque to SLURM.

SLURM Commands

User commands interact with the slurmctld process to submit, query, and delete jobs, as well as to query and configure the scheduler. Common commands include:

sbatch
- submit a batch script and allocate nodes
salloc
- allocate nodes for interactive use
srun
- run a command across the nodes
squeue
- queue info
sinfo
- node and partition info
scancel
- cancel jobs
scontrol
- Administrator commands
sview
- GUI interface to commands
sstat
- view information about running jobs

Each command's documentation can also be found in the man pages available when logged into the Brazos cluster. For example, to view the man page for the sbatch command execute man sbatch.

Login

SLURM on the Brazos cluster will run in an Enterprise Linux 6 environment on login.brazos.tamu.edu. Use SSH to login to this system with your normal Brazos username and password (generally, your TAMU NetID and password).

Simple Job Submission

As a simple example, we will request a single CPU core to run on single-threaded application, vmtest1, from our home directory. Let's call this script vmtest.slrm. Like Torque, SLURM can use directives at the top of a script to set job parameters. These are denoted by #SBATCH. The options following the directive can also be specified on the sbatch command line.

#!/bin/bash
#SBATCH -J TestJob
#SBATCH -p background
#SBATCH --time=00:10:00
#SBATCH -n1
#SBATCH --mem-per-cpu=300
#SBATCH -o vmtest-%j.out
#SBATCH -e vmtest-%j.err

echo "starting at `date` on `hostname`"

# Print the SLURM job ID.
echo "SLURM_JOBID=$SLURM_JOBID"

# Run the vmtest application
echo "running vmtest 256 100000"
$HOME/vmtest 256 100000

echo "ended at `date` on `hostname`"
exit 0
    

The #SBATCH directives used above are:

-J TestJob
Sets the job name to "TestJob"
-p background
Selects the "background" partition (queue)
--time=00:10:00
Sets the wallclock limit to 10 minutes
-n1
Requests a single task (core).
--mem-per-cpu=300
Requests 300MB per cpu
-o vmtest-%j.out
Defines the job's stdout file
-e vmtest-%j.err
Defines the job's stderr file

By default, SLURM will combine stdout and stderr into a single file. If a stdout file is not specified with -o outfile or stderr with -e errfile, then the stdout and stderr will be stored in slurm-%J.out, where %J is the jobID.

The job is submitted with the sbatch command, its status checked with squeue, and it can be canceled with scancel. An example is shown below (jobID 38) that submits the job, checks it, cancels it, verifies it's gone, lists the stdout and stderr files, and displays the stderr file.

$ sbatch vmtest.slrm
Submitted batch job 38
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                38 background vmtest.s s-johnso R       0:02      1 c0101
$ scancel 38
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
$ ls vmtest-38.*
vmtest-38.err  vmtest-38.out
$ cat vmtest-38.err
slurmd[c0101]: *** JOB 38 CANCELLED AT 2014-01-07T11:31:51 ***
    

The #SBATCH options we have covered should be sufficient to get your single-core jobs running. Please consult the sbatch man page for more options.

MPI

Good news, everyone! MPI applications "just work" with SLURM. The Brazos Cluster supports OpenMPI and MVAPICH2 for running MPI jobs in SLURM. MPI jobs launched using srun have a few required options depending on the MPI implementation being used.

openmpi
--mpi=pmi2 --resv-ports
mvapich2
--mpi=none
Note that the --mpi option sets the SLURM MPI plugin to be used. The none plugin does not mean "no MPI".

In the example below, we will use a couple more #SBATCH directives to specify -p mpi-core8 to request a mpi partition, -N4 to request 4 nodes, and --ntasks-per-node=8 to request 8 tasks per node. The SLURM script, let's call it mpi.slrm, will run /bin/hostname across all processes and will then run two Open MPI applications using srun and mpirun to launch the applications. We will also run a MVAPICH2 application using srun.
Submit with sbatch mpi.slrm.

#!/bin/bash
#SBATCH -J partest
#SBATCH -p mpi-core8
#SBATCH -N4
#SBATCH --ntasks-per-node=8
#SBATCH --mem-per-cpu=600
#SBATCH --time=00:30:00
#SBATCH -o partest-%J.out
#SBATCH -e partest-%J.err

echo "starting at `date` on `hostname`"

# Print the hostname from every process using srun.
echo "srun -l /bin/hostname"
srun -l /bin/hostname

# Load the openmpi module.
module load gcc openmpi

# Need to set OMPI_MCA_btl environment variable.
export OMPI_MCA_btl="openib,self"
echo "running vmtest-mpi-el6-openmpi with srun"
srun --mpi=pmi2 --resv-ports $HOME/vmtest-mpi-el6-openmpi 500 10000
echo "running vmtest-mpi-el6-openmpi with mpirun"
mpirun $HOME/vmtest-mpi-el6-openmpi 500 10000

# Switch from openmpi to mvapich2 environment
module swap openmpi mvapich2

echo "running vmtest-mpi-el6-mvapich2 with srun"
srun --mpi=none $HOME/vmtest-mpi-el6-mvapich2 500 10000

echo "ended at `date` on `hostname`"
exit 0
    

Both mpirun and srun will work to run Open MPI parallel applications, there are a number of command line options to srun --mpi=pmi2 --resv-ports that control resource allocation right down to the processor and even the thread. MVAPICH2 applications are launched using the srun --mpi=none command. Note: our early testing has shown MVAPICH2 to outperform Open MPI on the Brazos cluster.
Please see the srun man page for more information.

SLURM MPI guide

Job Arrays

A job array is created using the --array=<indices> option to sbatch. The indices can be a range such as 0-7, or distinct values, 0,2,4,6,8. This is analogous to Torque's qsub -t option. The stdout and stderr files can use SLURM variables and can be specified as -o jarray-%A-%a.out. The %A represents the Array Job ID, and %a is the task ID within the job.

Consider the example where a 3-task job is submitted with sbatch --array=0-2. If the SLURM job ID assigned to this is 151, then the following environment variables would be set in each task within the job array:

#!/bin/bash
#SBATCH -J jarray
#SBATCH -p serial
#SBATCH -o jarray-%A-%a.out
#SBATCH --time=00:01:00
#SBATCH -N1
#SBATCH --ntasks-per-core=1
#SBATCH --mem-per-cpu=100

echo "starting at `date` on `hostname`"

echo "SLURM_JOBID=$SLURM_JOBID"
echo "SLURM_ARRAY_JOB_ID=$SLURM_ARRAY_JOB_ID"
echo "SLURM_ARRAY_TASK_ID=$SLURM_ARRAY_TASK_ID"

echo "srun -l /bin/hostname"
srun -l /bin/hostname
sleep 30
echo "ended at `date` on `hostname`"
exit 0
    

Here's the command sequences for submitting the job:

$ sbatch --array=0-2 jarray.slrm
Submitted batch job 44
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              44_0    serial   jarray s-johnso  R       0:02      1 c0101
              44_1    serial   jarray s-johnso  R       0:02      1 c0101
              44_2    serial   jarray s-johnso  R       0:02      1 c0101
$ squeue   # no output, job is complete
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
$ ls jarray-*
jarray-44-0.out  jarray-44-1.out  jarray-44-2.out
$ cat jarray-44-1.out
starting at Tue Jan  7 14:13:57 CST 2014 on c0101
SLURM_JOBID=45
SLURM_ARRAY_JOB_ID=44
SLURM_ARRAY_TASK_ID=1
srun -l /bin/hostname
0: c0101
ended at Tue Jan  7 14:14:27 CST 2014 on c0101
    

Examining the output files more closely, we see that they share a common value for SLURM_ARRAY_JOB_ID, which is the base SLURM_JOBID. SLURM will assign each task within the array a new SLURM_JOBID as well as incrementing the SLURM_ARRAY_TASK_ID. You may use these variables to define input files, output files, or to compute other values that can be used by your application.

SLURM_JOBID=44
SLURM_ARRAY_JOB_ID=44
SLURM_ARRAY_TASK_ID=0

SLURM_JOBID=45
SLURM_ARRAY_JOB_ID=44
SLURM_ARRAY_TASK_ID=1

SLURM_JOBID=46
SLURM_ARRAY_JOB_ID=44
SLURM_ARRAY_TASK_ID=2
    

To cancel one or more (or all) elements of job array:

# Submit job array and check queue
$ sbatch --array=0-7 jarray.slrm
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              47_0    serial   jarray s-johnso  R       0:02      1 c0101
              47_1    serial   jarray s-johnso  R       0:02      1 c0101
              47_2    serial   jarray s-johnso  R       0:02      1 c0101
              47_3    serial   jarray s-johnso  R       0:02      1 c0101
              47_4    serial   jarray s-johnso  R       0:02      1 c0101
              47_5    serial   jarray s-johnso  R       0:02      1 c0101
              47_6    serial   jarray s-johnso  R       0:02      1 c0101
              47_7    serial   jarray s-johnso  R       0:02      1 c0101
# Cancel elements 5,6,7
$ scancel 47_[5-7]
# See that they're gone
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              47_0    serial   jarray s-johnso  R       0:16      1 c0101
              47_1    serial   jarray s-johnso  R       0:16      1 c0101
              47_2    serial   jarray s-johnso  R       0:16      1 c0101
              47_3    serial   jarray s-johnso  R       0:16      1 c0101
              47_4    serial   jarray s-johnso  R       0:16      1 c0101
# Cancel the entire remaining array
$ scancel 47
# All gone!
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    

SLURM Job Array Docs

Interactive Jobs

Interactive batch jobs can be launched using the sintr command. The sintr command will enable X11 forwarding to the batch job if your session on the login node contains the DISPLAY environment variable.

To launch an interactive job with the default options use this command:

$ sintr
salloc: Granted job allocation 13161
$ hostname
c0101.brazos.tamu.edu
$ srun hostname
c0101.brazos.tamu.edu
$ exit
salloc: Relinquishing job allocation 13161
salloc: Job allocation 13161 has been revoked.
    

This is an example of launching an interactive job that can use MPI:

$ sintr -N 2 -n 2
salloc: Granted job allocation 13164
$ hostname
c0133.brazos.tamu.edu
$ srun hostname
c0133.brazos.tamu.edu
c0134.brazos.tamu.edu
$ exit
salloc: Relinquishing job allocation 13164
salloc: Job allocation 13164 has been revoked.
    

The sintr -h command can be used to display all of the available options for sintr. There is no man page for sintr.

Partition Information

In SLURM a "partition" is the term used for queues. Each partition corresponds to a set of compute nodes with a specific set of parameters. For Brazos the partitions are also where the preemption logic is defined.

Partition Access Allowed QOS Nodes Per Job CPUs per node Max mem per CPU Time limit Preemption
stakeholder hepx,idhmc hepx,idhmc 1 8/32 2000MB 120hrs preemptor
stakeholder-4g hepx,idhmc hepx,idhmc 1 8/32 4000MB 120hrs preemptor
serial ** all general 1 8 4000MB 72hrs preemptor
serial-long all long 1 8 4000MB 720hrs -
mpi-core8 all mpi 2+ 8 4000MB 48hrs preemptor
mpi-core28 all mpi 2+ 28/56 2000MB 48hrs preemptor
mpi-core32 all mpi 2+ 32 2000MB 48hrs preemptor
mpi-core32-4g all mpi 2+ 32 4000MB 48hrs preemptor
background all background 1 8/32 2000MB 96hrs preemptee
background-4g all background 1 8/32 4000MB 96hrs preemptee
interactive all interactive 1-2 8/32 2000MB 8hrs preemptor
stakeholder
The stakeholder partition is usable only by members of the hepx and idhmc groups. This partition contains only nodes with 2GB of memory per CPU.
stakeholder-4g
The stakeholder partition is usable only by members of the hepx and idhmc groups. This partition contains only nodes with 4GB of memory per CPU.
serial
This is the default partition available to all users on Brazos and is intended for single-node jobs. This partition contains 8-core Intel Harpertown nodes (32GB).
serial-long
The serial-long partition is intended for long running, single-node, jobs. This partition is available to all users and contains the same nodes as the serial partition.
mpi-core8
The mpi-core8 partition is intended for MPI jobs requesting two or more nodes. This partition is available to all users and contains 8-core AMD Shanghai (32GB) InfiniBand nodes.
mpi-core28
The mpi-core28 partition is intended for MPI jobs requesting two or more nodes. This partition is available to all users and contains 28-core/56-thread Intel Broadwell FDR InfiniBand nodes. See Broadwell Notes below.
mpi-core32
The mpi-core32 partition is intended for MPI jobs requesting two or more nodes. This partition is available to all users and contains 32-core AMD Bulldozer/Piledriver (64GB and 128GB) InfiniBand nodes.
mpi-core32-4g
The mpi-core32-4g partition is intended for MPI jobs requesting two or more nodes that may need more memory than is available in the mpi-core32 partition. This partition is available to all users and contains 32-core AMD Bulldozer/Piledriver (128GB) InfiniBand nodes.
background
The background partition is a preemptable partition available to all users and will use any available compute node on the cluster. This partition contains only nodes with 2GB of memory per CPU.
background-4g
The background-4g partition is a preemptable partition available to all users and will use any available compute node on the cluster. This partition contains only nodes with 4GB of memory per CPU.
interactive
The interactive partition is intended for interactive jobs. This partition is available to all users and will use any available compute node on the cluster.

Do not overallocate nodes for your job. Do not attempt to allocate two or more nodes from a MPI partition to run a single-node job. Your job will be deleted, and continued abuse of the MPI partitions in this way will result in your loss of access to the cluster.

Broadwell Notes

The Supermicro nodes added in Fall 2016 represent the first nodes in the Brazos Cluster to have two execution threads per cpu core with the Intel Xeon E5-2658v4 "Broadwell" processors. This architecture presents some scheduling challenges when integrating into an existing heterogeneous cluster. While a thread does not represent an entire execution unit, our initial tests demonstrated that mapping one Unix process to a thread provided better throughput for "single cpu" jobs. If you have been running single cpu jobs in the past using the -n 1 option to SLURM, then you can continue use this option. If you want to explicitly request the new nodes, add -C broadwell to your SLURM options. If you prefer that your application use a full core, you can request two "cpus" per task, --cpus-per-task=2, where "cpus" in this context refers to two threads. This works best if you can request an entire node and run 28 processes on the node, so as to avoid contention for resources with other jobs: -p background -N 1 --ntasks-per-node=28 --cpus-per-task=2 -C broadwell

By default, like the serial jobs from the background and stakeholder partitions, MPI jobs will see all of the threads on a node as an individual processor. Thus, the SLURM option -p mpi-core28 -N 6 -n 336 will allocate all threads on the 6 new nodes.

If you want to explore using the new Broadwell nodes with their two-thread-per-core topology in an MPI job, you can submit your job with -p mpi-core28 -N 6 --ntasks-per-node=28 --cpus-per-task=2. This will allocate all cores and threads across 6 nodes. The mpirun command will see this as 6*28=168 processors. You can explicitly run this as mpirun -np 168 -npernode=28 ./myapp.

As an extention of the above example, if you are using the multi-threaded OpenBLAS library and will be running on the same 28-cores per node, but using the two threads on each core for the BLAS functions, you SLURM script would look something like this:

#!/bin/bash
#SBATCH -J partest
#SBATCH -p mpi-core28
#SBATCH -N 6
#SBATCH --ntasks-per-node=28
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=600
#SBATCH --time=00:30:00
#SBATCH -o partest-%J.out
#SBATCH -e partest-%J.err

# Load the openmpi and openblas modules.
module load gcc openmpi openblas

# Need to set OMPI_MCA_btl environment variable.
export OMPI_MCA_btl="openib,self"

# Run the application vmtest-mpi-el6-openmpi stored in $HOME
mpirun -np 168 -npernode 28 $HOME/vmtest-mpi-el6-openmpi 500 10000

exit 0
    

Resource Limits

Limits on CPU count, memory usage, and job count are be applied on a per-partition basis. However, per-user limits cannot be set at the partition level. These will be set on the account association - either the user record in the accounting database, or the parent account of the user. We can also set limits using SLURM's QoS capabilities.

Partition Resource Limits

The partition information is available using the scontrol show partitions command. For completeness, here's the entire output for the Brazos SLURM environment.

$ scontrol show partitions
PartitionName=admin
   AllowGroups=ALL AllowAccounts=ALL AllowQos=admin
   AllocNodes=ALL Default=NO
   DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=c0[101-132,133-138,207-232,234-241,407-440,507-532,533-540,933-936],c0[611-630,711-730,811-832,911-916,919-931]n[1-2]
   Priority=1001 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=3072 TotalNodes=306 SelectTypeParameters=N/A
   DefMemPerCPU=1900 MaxMemPerCPU=4000

PartitionName=interactive
   AllowGroups=ALL AllowAccounts=ALL AllowQos=interactive
   AllocNodes=ALL Default=NO
   DefaultTime=01:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=2 MaxTime=08:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=c0[101-132,133-138,207-232,234-241,407-440,507-532,533-540,933-936],c0[611-630,711-730,811-832,911-916,919-931]n[1-2]
   Priority=1000 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=3072 TotalNodes=306 SelectTypeParameters=N/A
   DefMemPerCPU=1900 MaxMemPerCPU=2000

PartitionName=stakeholder
   AllowGroups=ALL AllowAccounts=ALL AllowQos=idhmc,hepx,cms-local
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=5-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=c0[101-132,407-440,507-532,533-540,933-936],sbi0631n[01-03,11-13]
   Priority=500 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=1120 TotalNodes=104 SelectTypeParameters=N/A
   DefMemPerCPU=1900 MaxMemPerCPU=2000

PartitionName=stakeholder-4g
   AllowGroups=ALL AllowAccounts=ALL AllowQos=idhmc,hepx,cms-local
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=5-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=c0[133-138,207-232,234-241],c0[611-630,711-730,811-832,911-916,919-931]n[1-2]
   Priority=500 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=1952 TotalNodes=202 SelectTypeParameters=N/A
   DefMemPerCPU=3900 MaxMemPerCPU=4000

PartitionName=serial
   AllowGroups=ALL AllowAccounts=ALL AllowQos=general
   AllocNodes=ALL Default=YES
   DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=3-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=c0[207-232]
   Priority=100 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=208 TotalNodes=26 SelectTypeParameters=N/A
   DefMemPerCPU=3900 MaxMemPerCPU=4000

PartitionName=serial-long
   AllowGroups=ALL AllowAccounts=ALL AllowQos=long
   AllocNodes=ALL Default=NO
   DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=30-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=c0[207-232]
   Priority=1 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=208 TotalNodes=26 SelectTypeParameters=N/A
   DefMemPerCPU=3900 MaxMemPerCPU=4000

PartitionName=mpi-core8
   AllowGroups=ALL AllowAccounts=ALL AllowQos=mpi
   AllocNodes=ALL Default=NO
   DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=c0[611-630,711-730,811-832,911-916,919-931]n[1-2]
   Priority=101 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=1296 TotalNodes=162 SelectTypeParameters=N/A
   DefMemPerCPU=3900 MaxMemPerCPU=4000

PartitionName=mpi-core28
   AllowGroups=ALL AllowAccounts=ALL AllowQos=mpi
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=sbi0631n[01-03,11-13]
   Priority=104 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=336 TotalNodes=6 SelectTypeParameters=N/A
   DefMemPerCPU=1900 MaxMemPerCPU=2000

PartitionName=mpi-core32
   AllowGroups=ALL AllowAccounts=ALL AllowQos=mpi
   AllocNodes=ALL Default=NO
   DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=c0[133-138,234-241,533-540,933-936]
   Priority=102 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=832 TotalNodes=26 SelectTypeParameters=N/A
   DefMemPerCPU=1900 MaxMemPerCPU=2000

PartitionName=mpi-core32-4g
   AllowGroups=ALL AllowAccounts=ALL AllowQos=mpi
   AllocNodes=ALL Default=NO
   DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=c0[133-138,234-241]
   Priority=103 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=448 TotalNodes=14 SelectTypeParameters=N/A
   DefMemPerCPU=3900 MaxMemPerCPU=4000

PartitionName=background
   AllowGroups=ALL AllowAccounts=ALL AllowQos=background
   AllocNodes=ALL Default=NO
   DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=4-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=c0[101-132,407-440,507-532,533-540,933-936],sbi0631n[01-03,11-13]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=GANG,SUSPEND
   State=UP TotalCPUs=1120 TotalNodes=104 SelectTypeParameters=N/A
   DefMemPerCPU=1900 MaxMemPerCPU=2000

PartitionName=background-4g
   AllowGroups=ALL AllowAccounts=ALL AllowQos=background
   AllocNodes=ALL Default=NO
   DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=4-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=c0[133-138,207-232,234-241],c0[611-630,711-730,811-832,911-916,919-931]n[1-2]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=GANG,SUSPEND
   State=UP TotalCPUs=1952 TotalNodes=202 SelectTypeParameters=N/A
   DefMemPerCPU=3900 MaxMemPerCPU=4000
    

Notice that each partition has a default memory limit per cpu of 1900MB or 3900MB (DefMemPerCPU). Because we are permitting multiple jobs to share the same node, it is essential to place a memory limit on all jobs to prevent thrashing. You may request more memory using the --mem-per-cpu=N where N is in megabytes. Please do not request more memory than is needed for your application as this will cause underutilization of the cluster. We will delete any jobs that excessively overallocate memory.

QOS Resource Limits

The SLURM QoS (Quality of Service) is used on the Brazos Cluster to set various limits as well as priorities.

QOS Priority Usage Factor Max Running Jobs
Grp / User
Max Submit Jobs
Grp / User
Max CPUs
Grp / User
Max Time
hepx 10000 0.5 - / - 5000 / - 700 / - 120hrs
idhmc 10000 0.5 - / - 1000 / - 128 / - 120hrs
general 5000 1.0 50 / 10 500 / 15 - / - 72hrs
long 4000 1.0 10 / - 100 / - - / - 720hrs
mpi 5000 1.0 - / 30 500 / 40 1536 / - 48hrs
background 1000 0.25 - / 3000 10000 / 3000 - / - 96hrs
interactive 8000 1.0 - / 1 - / 1 - / - 8hrs

The QOS information is available using the sacctmgr show qos command. For completeness, here's the entire output for the SLURM test configuration on Brazos EL6. More columns are available when run without the format option.

$ sacctmgr show qos format=Name,Priority,UsageFactor,GrpTRES,MaxTRESPerJob,MaxTRESPerUser,GrpJobs,MaxJobsPerUser,GrpSubmit,MaxSubmitJobs,MaxWall
      Name   Priority UsageFactor       GrpTRES       MaxTRES     MaxTRESPU GrpJobs MaxJobsPU GrpSubmit MaxSubmit     MaxWall
---------- ---------- ----------- ------------- ------------- ------------- ------- --------- --------- --------- -----------
    normal          0    1.000000
      hepx      10000    0.500000       cpu=700        node=1                                      5000            5-00:00:00
     idhmc      10000    0.500000       cpu=128        node=1                                      1000            5-00:00:00
background       1000    0.250000                      node=1      cpu=3000                       10000      3000  4-00:00:00
   general       5000    1.000000                      node=1                    50        10       500        15  3-00:00:00
       mpi       5000    1.000000      cpu=1536                                            30       500        40  2-00:00:00
      grid        500    0.250000                      node=1                            1000      5000      2000  3-00:00:00
      long       4000    1.000000                      node=1                    10                 100           30-00:00:00
interacti+       8000    1.000000                      node=2        node=2                 1                   1    08:00:00
     admin     100000    1.000000
 cms-local      10000    0.500000       cpu=200                                 200        50      5000
    

Accounting

The sacct command is used to show accounting information. Without any arguments this will show the completed jobs (and individual job steps within each job) of the current user for the past day. Below is the output from sacct which shows the above jobs. The orted entries represent the openmpi daemons.

$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
42                 bash        mpi       math          2  COMPLETED      0:0 
42.0              orted                  math          2  COMPLETED      0:0 
42.1              orted                  math          2  COMPLETED      0:0 
42.2         vmtest-mpi                  math          2  COMPLETED      0:0 
42.3              orted                  math          2  COMPLETED      0:0 
42.4              orted                  math          2  COMPLETED      0:0 
42.5              orted                  math          2  COMPLETED      0:0 
42.6              orted                  math          2  COMPLETED      0:0 
42.7              orted                  math          2  COMPLETED      0:0 
42.8              orted                  math          2  COMPLETED      0:0 
43               jarray     serial       math          1  COMPLETED      0:0 
43.batch          batch                  math          1  COMPLETED      0:0 
43.0           hostname                  math          1  COMPLETED      0:0 
44               jarray     serial       math          1  COMPLETED      0:0 
44.batch          batch                  math          1  COMPLETED      0:0 
44.0           hostname                  math          1  COMPLETED      0:0 
45               jarray     serial       math          1  COMPLETED      0:0 
45.batch          batch                  math          1  COMPLETED      0:0 
45.0           hostname                  math          1  COMPLETED      0:0 
46               jarray     serial       math          1  COMPLETED      0:0 
46.batch          batch                  math          1  COMPLETED      0:0 
46.0           hostname                  math          1  COMPLETED      0:0 
47               jarray     serial       math          1 CANCELLED+      0:0 
47.batch          batch                  math          1  CANCELLED     0:15 
47.0           hostname                  math          1  COMPLETED      0:0 
48               jarray     serial       math          1 CANCELLED+      0:0 
48.batch          batch                  math          1  CANCELLED     0:15 
48.0           hostname                  math          1  COMPLETED      0:0 
    

A resource usage summary of jobs can be viewed by specifying the -X and --format options.

$ sacct  -X --format=jobid,ncpus,cputime,elapsed,state
       JobID      NCPUS    CPUTime    Elapsed      State 
------------ ---------- ---------- ---------- ---------- 
38                    1   00:00:06   00:00:06 CANCELLED+ 
39                    1   00:00:39   00:00:39 CANCELLED+ 
40                    2   00:00:20   00:00:10  COMPLETED 
41                   64   02:15:28   00:02:07  COMPLETED 
42                    2   04:48:18   02:24:09  COMPLETED 
43                    1   00:00:31   00:00:31  COMPLETED 
44                    1   00:00:31   00:00:31  COMPLETED 
45                    1   00:00:31   00:00:31  COMPLETED 
46                    1   00:00:31   00:00:31  COMPLETED 
47                    1   00:00:24   00:00:24 CANCELLED+ 
48                    1   00:00:24   00:00:24 CANCELLED+ 
49                    1   00:00:24   00:00:24 CANCELLED+ 
50                    1   00:00:24   00:00:24 CANCELLED+ 
51                    1   00:00:24   00:00:24 CANCELLED+ 
52                    1   00:00:13   00:00:13 CANCELLED+ 
53                    1   00:00:13   00:00:13 CANCELLED+ 
54                    1   00:00:13   00:00:13 CANCELLED+ 
    

The many options of sacct can be found at the sacct man page.

References