05 Submitting jobs to different nodes

Since 2018 we have only one default slurm partition (queue) on zeus called “all”, see here

See also: examples of SLURM submission scripts

Submit to ‘Broadwell’ nodes

The hostnames of Broadwell nodes are zeus[300-343] & zeus[400-409] & zeus[500-501] (56 in total). They have 32 CPU-cores (Broadwell Xeon) and 128GB of RAM. There are several ways how you can use these nodes.

  • submit directly to compute node by specifying its hostname (not recommended, only if you need to use that exact node for some reason, e.g. having a reservation there): e.g. to request zeus300, in your slurm script use
    #SBATCH -w zeus300
  • or request that particular node during submission
    sbatch -w zeus300 slurmscriptname.slurm
  • or in your slurm submission script request constraint “broadwell” with whatever number of tasks you require, for example, we can request one task that can have access to all 32 CPUs of one broadwell node (to run SMP code for example)
    #SBATCH --constraint=broadwell
    #SBATCH --nodes=1
    #SBATCH --ntasks-per-node=1
    #SBATCH --cpus-per-task=32

    or request 32 individual tasks (will allocate one task per CPU)

    #SBATCH --constraint=broadwell
    #SBATCH -n32 -N1

    where -n32 is the total number of CPU’s requested over one node (-N1) (not Nr of CPUs per node)

Submit to ‘Sandybridge’ 12-CPU nodes nodes

GPU nodes of zeus (zeus[200-217]) have 12-core Sandybridge family CPU’s and NVIDIA K20 GPUs (2 per each node). You can use those nodes without GPU’s as well. To use these nodes do either of:

  • specify
    --constraint=sandy

    or request k20 GPUs:

    --gres=gpu:K20:N

    (where N is number of GPUs you need)

    sbatch --constraint=sandy slurmscriptname.slurm
    sbatch -N1 --gres=gpu:K20:2 slurmscriptname.slurm
  • you can also request these particular nodes (zeus200…zeus217) to be allocated to your job, e.g. we can ask for 2 of these nodes:
  • sbatch -w zeus[200-201] slurmscriptname.slurm

     

Submit to Nehalem (8-CPU) nodes

These are old Nehalem CPU based nodes of Zeus HPC each having 8 CPU-cores and 48Gb of RAM: zeus[20-91] (floor 3 of ECB) and zeus[100-171] (Mezz floor of ECB)
If you want to use specifically these nodes you can use

sbatch --constraint=nehalem slurmscriptname.slurm

when submitting your job with sbatch or srun or specify it in your slurm sbatch script

#SBATCH --constraint=nehalem

You can specifically request a particular node (e.g. zeus34 and zeus56…72 in the example below.

sbatch -w zeus[34,56-72] slurmscriptname.slurm

Important thing to remember is that these nodes cannot allocate more trhan 8 tasks per node.

Request nodes with GPU processors

  • In your slurm submission script specify (e.g. to ask for 2 GPUs of any kind)
    #SBATCH --gres=gpu:2
  • OR request particular kind of GPUs (K20 or K80) in your slurm submission script (below example asks for 1 K80 GPU):
    #SBATCH --gres=gpu:k80:1
  • or e.g. 2 K20 GPUs on one node in exclusive mode (node will not be shared with other jobs if it has free resources (CPUs or GPUs that are not used by your job):
    sbatch --gres=gpu:k20:2 -N1 --exclusive slurmscriptname.sh

Launching job without slurm submission script

  • You can run any executable, e.g. “myexecutable”, on the zeuse’s nodes, without any slurm submission scripts (job will be placed in queue while the job is running anyway). Here we request 2 nodes and the total of 12 CPUs for 10 minutes:
  • srun -t 00:10:00 -N2 -n12 ./myexecutable

    Note that when submitting job with “srun”, the command line will not be “released” until the job completes and the output is produced in STDOUT (unless specified otherwise). This approach is suitable if you expect your “job” to complete very quickly. this way you can also see the direct output of your executable in the terminal.

  • Yo can also reserve nodes for a specified time without launching any task on them (i.e. it will launch idle interactive bash session), the example below is asking for 2 Broadwell nodes (32 CPU/node, in exclusive mode (no resource sharing with other jobs)) for 12 hours:
    salloc -N2 --constraint=broadwell --exclusive -t 12:00:00

    Similarly for GPUs, e.g. one Sandybridge node with 2 K20 GPUs for 30 minutes:

    salloc -N1 --gres=gpu:K20:2 --exclusive -t 30

    After submission of “salloc” job your console will start dedicated bash session in interactive mode, which will have all allocated resources available. When you leave that bash session (Ctrl-D or “exit”) your “salloc” job/allocation will be terminated in the queue. To avoid this (job termination on exiting bash session) you can use “screen” utility.

    For all possible “constraints”, see here: http://zeus.coventry.ac.uk/wordpress/?p=1094

  • Note that in any of the cases described above, if you do not specify the time required for your job, it will be assigned default 4 hours, after which the job will be terminated, you can do it by using “-t” flag (below we ask for 8 random CPUs for 8 hours
sbatch -n8 -t 8:00:00 slurmscriptname.slurm

Memory requirements when submitting a job

Dear All

To avoid HPC users downing compute nodes, memory limits are introduced on zeus HPC.

If your jobs are not that memory-hungry, you probably will not notice this at all. By “memory-hungry” it is meant exceeding 4GB per CPU-core (default value).

If your job require more than that, you can request more memory using --mem=... or --mem-per-cpu=... [MB] parameter with sbatch.

E.g.

1) ask for “full” 48 GB of memory to be available to your job e.g. on Nehalem (8-CPU) nodes:

sbatch -n8 -N1 --mem=48000 -t=8:00:00 myslurmscript.slurm

In this case one node is requested, 8 tasks (CPUs) and total job memory is 48GB (48000 MB)

If this is not specified max memory would be 4GB x 8 CPUs = 32 GB

2) If you using 32-CPU Broadwell nodes, which have 128GB of RAM, the default value of 4GB/CPU is max possible. If you want more RAM per CPU, e.g. if you use only 2 CPU-tasks, but need all node memory allocated to them, you can do:

sbatch -n2 -N1 --mem-per-cpu=64000 ...

3) if you need more than 128GB per node, you can use SMP node (zeus15, max 512GB/node) by requesting “–constraint=smp” when submitting the job.

If you request amount of memory, exceeding physically available, job will fail to submit with error message “error: Memory specification can not be satisfied”, “error: Unable to allocate resources: Requested node configuration is not available”.

If you requested certain memory for your job (or left it with default 4GB/CPU), which is then exceeded during the run, slurm will terminate the job.

Before this measure, it was possible to “oversubscribe” (consume more memory than available RAM by using disk swap space) and make a node unresponsive/slow, which result in job termination anyway, but in certain cases led to node failure.

Regards

Alex

SLURM Cheat Sheet

SLURM Cheat Sheet

ReqNodeNotAvail:: – HPC shutdown scheduled on 26 October 2019.

Dear HPC users,

All HPC infrastructure will be switched off on 26 & 27 October (weekend) due to power works in EEC. Hence your jobs will get (ReqNodeNotAvail, UnavailableNodes:zeus[27,30-31,43,50,59,62,70-….) status. They will resume on Monday.

Regards

Alex Pedcenko

Launching OpenFOAM on Windows 10 Linux subsystem for Windows

Check OpenFOAM for Windows Screen-casts at https://livecoventryac.sharepoint.com/portals/hub/_layouts/15/PointPublishing.aspx?app=video&p=c&chid=0aedf551-986d-4d5a-bd3a-43007bda3f64&s=0&t=av

 

 

Alex Pedcenko

Built-in SSH Commands in Windows 10

Windows 10 now has native support for SSH. To get it active, go to “Manage optional features” and click on “Add feature” to add it (if it is not there yet).

This command line ssh client can be used instead of PuTTY to connect to HPC terminal.

Once enabled, you can just open command prompt (WinKey+R and type: cmd <Enter>). In the console type “ssh zeus.coventry.ac.uk” or “ssh yourhpcusername@zeus.coventry.ac.uk” if your Windows user name is not the same as HPC user name.

Alex

Learning Linux Command line at LinkedIn Learning

https://www.linkedin.com/learning/learning-linux-command-line-2?trk=share_android_course_learning

 

Scheduled shutdown 13 August 2019

There will be power switched off in ECB on the morning of 13 August. HPCs will be switched off in early morning of 13/08/2019 ( 5:00 am).  The systems will be powered back on after 7 am on the same day (given there is a power in the building).

Regards

Alex

 

Simple MPI “Hello World!” on HPC

How to complile and launch simple MPI code on HPC:
 

 

Connecting to HPC terminal with Google Chrome

Instead of using PuTTY, you can also connect to HPC terminal in Google Chrome using SSH extension, see example below:

css.php