Monthly Archives: February 2020

05 Submitting jobs to different nodes

Since 2018 we have only one default slurm partition (queue) on zeus called “all”, see here

See also: examples of SLURM submission scripts

Submit to ‘Broadwell’ nodes

The hostnames of Broadwell nodes are zeus[300-343] & zeus[400-409] & zeus[500-501] (56 in total). They have 32 CPU-cores (Broadwell Xeon) and 128GB of RAM. There are several ways how you can use these nodes.

  • submit directly to compute node by specifying its hostname (not recommended, only if you need to use that exact node for some reason, e.g. having a reservation there): e.g. to request zeus300, in your slurm script use
    #SBATCH -w zeus300
  • or request that particular node during submission
    sbatch -w zeus300 slurmscriptname.slurm
  • or in your slurm submission script request constraint “broadwell” with whatever number of tasks you require, for example, we can request one task that can have access to all 32 CPUs of one broadwell node (to run SMP code for example)
    #SBATCH --constraint=broadwell
    #SBATCH --nodes=1
    #SBATCH --ntasks-per-node=1
    #SBATCH --cpus-per-task=32

    or request 32 individual tasks (will allocate one task per CPU)

    #SBATCH --constraint=broadwell
    #SBATCH -n32 -N1

    where -n32 is the total number of CPU’s requested over one node (-N1) (not Nr of CPUs per node)

Submit to ‘Sandybridge’ 12-CPU nodes nodes

GPU nodes of zeus (zeus[200-217]) have 12-core Sandybridge family CPU’s and NVIDIA K20 GPUs (2 per each node). You can use those nodes without GPU’s as well. To use these nodes do either of:

  • specify
    --constraint=sandy

    or request k20 GPUs:

    --gres=gpu:K20:N

    (where N is number of GPUs you need)

    sbatch --constraint=sandy slurmscriptname.slurm
    sbatch -N1 --gres=gpu:K20:2 slurmscriptname.slurm
  • you can also request these particular nodes (zeus200…zeus217) to be allocated to your job, e.g. we can ask for 2 of these nodes:
  • sbatch -w zeus[200-201] slurmscriptname.slurm

     

Submit to Nehalem (8-CPU) nodes

These are old Nehalem CPU based nodes of Zeus HPC each having 8 CPU-cores and 48Gb of RAM: zeus[20-91] (floor 3 of ECB) and zeus[100-171] (Mezz floor of ECB)
If you want to use specifically these nodes you can use

sbatch --constraint=nehalem slurmscriptname.slurm

when submitting your job with sbatch or srun or specify it in your slurm sbatch script

#SBATCH --constraint=nehalem

You can specifically request a particular node (e.g. zeus34 and zeus56…72 in the example below.

sbatch -w zeus[34,56-72] slurmscriptname.slurm

Important thing to remember is that these nodes cannot allocate more trhan 8 tasks per node.

Request nodes with GPU processors

  • In your slurm submission script specify (e.g. to ask for 2 GPUs of any kind)
    #SBATCH --gres=gpu:2
  • OR request particular kind of GPUs (K20 or K80) in your slurm submission script (below example asks for 1 K80 GPU):
    #SBATCH --gres=gpu:k80:1
  • or e.g. 2 K20 GPUs on one node in exclusive mode (node will not be shared with other jobs if it has free resources (CPUs or GPUs that are not used by your job):
    sbatch --gres=gpu:k20:2 -N1 --exclusive slurmscriptname.sh

Launching job without slurm submission script

  • You can run any executable, e.g. “myexecutable”, on the zeuse’s nodes, without any slurm submission scripts (job will be placed in queue while the job is running anyway). Here we request 2 nodes and the total of 12 CPUs for 10 minutes:
  • srun -t 00:10:00 -N2 -n12 ./myexecutable

    Note that when submitting job with “srun”, the command line will not be “released” until the job completes and the output is produced in STDOUT (unless specified otherwise). This approach is suitable if you expect your “job” to complete very quickly. this way you can also see the direct output of your executable in the terminal.

  • Yo can also reserve nodes for a specified time without launching any task on them (i.e. it will launch idle interactive bash session), the example below is asking for 2 Broadwell nodes (32 CPU/node, in exclusive mode (no resource sharing with other jobs)) for 12 hours:
    salloc -N2 --constraint=broadwell --exclusive -t 12:00:00

    Similarly for GPUs, e.g. one Sandybridge node with 2 K20 GPUs for 30 minutes:

    salloc -N1 --gres=gpu:K20:2 --exclusive -t 30

    After submission of “salloc” job your console will start dedicated bash session in interactive mode, which will have all allocated resources available. When you leave that bash session (Ctrl-D or “exit”) your “salloc” job/allocation will be terminated in the queue. To avoid this (job termination on exiting bash session) you can use “screen” utility.

    For all possible “constraints”, see here: http://zeus.coventry.ac.uk/wordpress/?p=1094

  • Note that in any of the cases described above, if you do not specify the time required for your job, it will be assigned default 4 hours, after which the job will be terminated, you can do it by using “-t” flag (below we ask for 8 random CPUs for 8 hours
sbatch -n8 -t 8:00:00 slurmscriptname.slurm
css.php