Category Archives: Quick Guides

Running Docker job on EPYC HPC

Running docker container as HPC job on one node (you can’t run a single container on multiple compute nodes obviously):

on Epyc & Pluto HPCs you can run a docker container as a slurm job, like

sbatch -N1 -n 8 -t 8:00:00 docker.slurm

where docker.slurm can contain, for example, the following incantations:
———————————————-

#!/bin/bash

container="ashael/hpl"
container_name=docker_${SLURM_JOB_ID}


docker pull ${container} # pull from docker hub

# we mount current working folder to container under /scratch
docker run -it -d --cpus="$SLURM_JOB_CPUS_PER_NODE" -v ${SLURM_SUBMIT_DIR}:/scratch --name docker_${SLURM_JOB_ID} ${container} # start container in a background (-d "detached" mode)

MY_USERID=$(id -u $USER)

CONTAINER_ID=$(docker ps -aqf "name=${container_name}")

echo my container id is ${CONTAINER_ID}

docker exec  ${CONTAINER_ID} useradd --uid ${MY_USERID} --home $HOME $USER
docker exec  ${CONTAINER_ID} chown -R $USER:$USER /scratch

docker exec  -u ${MY_USERID}:${MY_USERID} ${CONTAINER_ID} mkdir -p /scratch/$SLURM_JOB_ID

# run executable inside your container, output may be written to /scratch/$SLURM_JOB_ID/ inside the container to appear in your current folder
docker exec -u ${MY_USERID}:${MY_USERID} ${CONTAINER_ID} uname --all

# OR execute an existing script runme.sh from current folder ($SLURM_SUBMIT_DIR) mounted inside your container in /scratch, can also pass a jobid and CPU Nrs to it as parameters if needed:
docker exec -u ${MY_USERID}:${MY_USERID} ${CONTAINER_ID} /scratch/runme.sh $SLURM_JOB_ID $SLURM_JOB_CPUS_PER_NODE

# stop the container and clean up
docker stop $CONTAINER_ID
docker ps -a # list all containers' states
docker rm $CONTAINER_ID # clean up, remove the container

Note the local job submission folder is “mounted” into the container in line 5. This is the way you can transfer the files between the container and your sessions (there are other ways as well).

 

Alternatively, while your container is running on the target node, you can ssh to the node and attach to the container’s terminal, allowing you to run interactive commands inside it:

docker attach docker_XXXXXX

where docker_XXXXXX is container name or ID (which you can get as well from “docker ps -a” output on the node.

If you want to terminate container from inside the container, do “Ctrl+D”

if you want to exit the “interactive” container session, leaving the container to run, do “Ctrl+p” and then “Ctrl+q”.

To start exited (but still existing container docker_XXXXXX, check “docker ps -a”), do from the compute node

docker start docker_XXXXXX

To stop the container from outside (from the compute node), do

docker stop docker_XXXXXX

When you finished with your container job, please remove it from the node as:

docker rm docker_XXXXXX

where docker_XXXXXX is the container name or ID (see “docker ps -a”)

At the end, always check “docker ps -a” from the target node to make sure all your containers are stopped and destroyed before you free up the compute node.

Accessing CU network drives while working from home

If you are outside CU campus network (I bet you are now) but want to access W-drive or H-drive or R-drive, you need to connect to CU VPN first. Then open “This PC” (a.k.a. My Computer or Windows Explorer) and enter in the address bar the following addresses:

 

  • W-drive:   \\coventry.ac.uk\csv\Students\Shared\EC\STUDENT\
  • H-Drive Students: \\coventry.ac.uk\csv\Students\Personal  (then check each folder inside to see in each of them your Documents folder is located, you won’t be able to see any others)
  • H-Drive Staff: \\coventry.ac.uk\csv\Staff\Personal  (then check each folder inside to see in each of them your Documents folder is located, you won’t be able to see any others)
  • R-Drive: \\coventry.ac.uk\csv\Research

You will need to authenticate with your CU username in this format:   COVENTRY\yourusername  and your CU password. If you want to make this “permanent” you can mount these folders on your Windows 10 PC by right-clicking on “This PC” –> More –> Map Network Drive –> enter one of these addresses -> check “Connect Using different credentials (if your PC is not CU computer)” –> enter username and password as described above.

Running Parallel Python3 Jupyter Notebook on zeus HPC

This is an approach to launch Jupyter notebook on a compute node of EEC HPC. Normally you launch Jupyter locally and then open associated web interface on your local machine. This is also possible to do on the HPC, however because the compute nodes of the cluster are mostly accessible via CLI only and are not “exposed” to the external to HPC network, one need to tunnel through the headnodes in order to reach them. The provided set of 2 HTA scripts simplifies the procedure: first script submits the Jupyter job to the queuing system of HPC (slurm) using html forms and ssh command tool (plink.exe from PuTTY). The second script (again using plink) establishes the ssh-tunnel to the target node, where the Jupyter server is running and starts the default browser on the client machine, pointing to the local port brought by the tunnel to the client machine. Since the compute nodes of HPC have multiple CPUs (some up to 32 cores), it is also shown that Jupyter notebook can utilise IPython ipcluster for running notebook codes on parallel threads.

link to scripts on coventry github

 

Video:

https://web.microsoftstream.com/embed/video/cb944d42-8d10-4784-8407-6e53fbaf3cbe?autoplay=false&showinfo=true

 

05 Submitting jobs to different nodes

Since 2018 we have only one default slurm partition (queue) on zeus called “all”, see here

See also: examples of SLURM submission scripts

Submit to ‘Broadwell’ nodes

The hostnames of Broadwell nodes are zeus[300-343] & zeus[400-409] & zeus[500-501] (56 in total). They have 32 CPU-cores (Broadwell Xeon) and 128GB of RAM. There are several ways how you can use these nodes.

  • submit directly to compute node by specifying its hostname (not recommended, only if you need to use that exact node for some reason, e.g. having a reservation there): e.g. to request zeus300, in your slurm script use
    #SBATCH -w zeus300
  • or request that particular node during submission
    sbatch -w zeus300 slurmscriptname.slurm
  • or in your slurm submission script request constraint “broadwell” with whatever number of tasks you require, for example, we can request one task that can have access to all 32 CPUs of one broadwell node (to run SMP code for example)
    #SBATCH --constraint=broadwell
    #SBATCH --nodes=1
    #SBATCH --ntasks-per-node=1
    #SBATCH --cpus-per-task=32

    or request 32 individual tasks (will allocate one task per CPU)

    #SBATCH --constraint=broadwell
    #SBATCH -n32 -N1

    where -n32 is the total number of CPU’s requested over one node (-N1) (not Nr of CPUs per node)

Submit to ‘Sandybridge’ 12-CPU nodes nodes

GPU nodes of zeus (zeus[200-217]) have 12-core Sandybridge family CPU’s and NVIDIA K20 GPUs (2 per each node). You can use those nodes without GPU’s as well. To use these nodes do either of:

  • specify
    --constraint=sandy

    or request k20 GPUs:

    --gres=gpu:K20:N

    (where N is number of GPUs you need)

    sbatch --constraint=sandy slurmscriptname.slurm
    sbatch -N1 --gres=gpu:K20:2 slurmscriptname.slurm
  • you can also request these particular nodes (zeus200…zeus217) to be allocated to your job, e.g. we can ask for 2 of these nodes:
  • sbatch -w zeus[200-201] slurmscriptname.slurm

     

Submit to Nehalem (8-CPU) nodes

These are old Nehalem CPU based nodes of Zeus HPC each having 8 CPU-cores and 48Gb of RAM: zeus[20-91] (floor 3 of ECB) and zeus[100-171] (Mezz floor of ECB)
If you want to use specifically these nodes you can use

sbatch --constraint=nehalem slurmscriptname.slurm

when submitting your job with sbatch or srun or specify it in your slurm sbatch script

#SBATCH --constraint=nehalem

You can specifically request a particular node (e.g. zeus34 and zeus56…72 in the example below.

sbatch -w zeus[34,56-72] slurmscriptname.slurm

Important thing to remember is that these nodes cannot allocate more trhan 8 tasks per node.

Request nodes with GPU processors

  • In your slurm submission script specify (e.g. to ask for 2 GPUs of any kind)
    #SBATCH --gres=gpu:2
  • OR request particular kind of GPUs (K20 or K80) in your slurm submission script (below example asks for 1 K80 GPU):
    #SBATCH --gres=gpu:k80:1
  • or e.g. 2 K20 GPUs on one node in exclusive mode (node will not be shared with other jobs if it has free resources (CPUs or GPUs that are not used by your job):
    sbatch --gres=gpu:k20:2 -N1 --exclusive slurmscriptname.sh

Launching job without slurm submission script

  • You can run any executable, e.g. “myexecutable”, on the zeuse’s nodes, without any slurm submission scripts (job will be placed in queue while the job is running anyway). Here we request 2 nodes and the total of 12 CPUs for 10 minutes:
  • srun -t 00:10:00 -N2 -n12 ./myexecutable

    Note that when submitting job with “srun”, the command line will not be “released” until the job completes and the output is produced in STDOUT (unless specified otherwise). This approach is suitable if you expect your “job” to complete very quickly. this way you can also see the direct output of your executable in the terminal.

  • Yo can also reserve nodes for a specified time without launching any task on them (i.e. it will launch idle interactive bash session), the example below is asking for 2 Broadwell nodes (32 CPU/node, in exclusive mode (no resource sharing with other jobs)) for 12 hours:
    salloc -N2 --constraint=broadwell --exclusive -t 12:00:00

    Similarly for GPUs, e.g. one Sandybridge node with 2 K20 GPUs for 30 minutes:

    salloc -N1 --gres=gpu:K20:2 --exclusive -t 30

    After submission of “salloc” job your console will start dedicated bash session in interactive mode, which will have all allocated resources available. When you leave that bash session (Ctrl-D or “exit”) your “salloc” job/allocation will be terminated in the queue. To avoid this (job termination on exiting bash session) you can use “screen” utility.

    For all possible “constraints”, see here: http://zeus.coventry.ac.uk/wordpress/?p=1094

  • Note that in any of the cases described above, if you do not specify the time required for your job, it will be assigned default 4 hours, after which the job will be terminated, you can do it by using “-t” flag (below we ask for 8 random CPUs for 8 hours
sbatch -n8 -t 8:00:00 slurmscriptname.slurm

Memory requirements when submitting a job

Dear All

To avoid HPC users downing compute nodes, memory limits are introduced on zeus HPC.

If your jobs are not that memory-hungry, you probably will not notice this at all. By “memory-hungry” it is meant exceeding 4GB per CPU-core (default value).

If your job require more than that, you can request more memory using --mem=... or --mem-per-cpu=... [MB] parameter with sbatch.

E.g.

1) ask for “full” 48 GB of memory to be available to your job e.g. on Nehalem (8-CPU) nodes:

sbatch -n8 -N1 --mem=48000 -t=8:00:00 myslurmscript.slurm

In this case one node is requested, 8 tasks (CPUs) and total job memory is 48GB (48000 MB)

If this is not specified max memory would be 4GB x 8 CPUs = 32 GB

2) If you using 32-CPU Broadwell nodes, which have 128GB of RAM, the default value of 4GB/CPU is max possible. If you want more RAM per CPU, e.g. if you use only 2 CPU-tasks, but need all node memory allocated to them, you can do:

sbatch -n2 -N1 --mem-per-cpu=64000 ...

3) if you need more than 128GB per node, you can use SMP node (zeus15, max 512GB/node) by requesting “–constraint=smp” when submitting the job.

If you request amount of memory, exceeding physically available, job will fail to submit with error message “error: Memory specification can not be satisfied”, “error: Unable to allocate resources: Requested node configuration is not available”.

If you requested certain memory for your job (or left it with default 4GB/CPU), which is then exceeded during the run, slurm will terminate the job.

Before this measure, it was possible to “oversubscribe” (consume more memory than available RAM by using disk swap space) and make a node unresponsive/slow, which result in job termination anyway, but in certain cases led to node failure.

Regards

Alex

SLURM Cheat Sheet

SLURM Cheat Sheet

Built-in SSH Commands in Windows 10

Windows 10 now has native support for SSH. To get it active, go to “Manage optional features” and click on “Add feature” to add it (if it is not there yet).

This command line ssh client can be used instead of PuTTY to connect to HPC terminal.

Once enabled, you can just open command prompt (WinKey+R and type: cmd <Enter>). In the console type “ssh zeus.coventry.ac.uk” or “ssh yourhpcusername@zeus.coventry.ac.uk” if your Windows user name is not the same as HPC user name.

Alex

Learning Linux Command line at LinkedIn Learning

https://www.linkedin.com/learning/learning-linux-command-line-2?trk=share_android_course_learning

 

Simple MPI “Hello World!” on HPC

How to complile and launch simple MPI code on HPC:
 

 

Connecting to HPC terminal with Google Chrome

Instead of using PuTTY, you can also connect to HPC terminal in Google Chrome using SSH extension, see example below:

css.php