Category Archives: Quick Guides

Running Docker job on EPYC HPC

Posted by admin on May 28, 2021 No comments

Running docker container as HPC job on one node (you can’t run a single container on multiple compute nodes obviously):

on Epyc & Pluto HPCs you can run a docker container as a slurm job, like

sbatch -N1 -n 8 -t 8:00:00 docker.slurm

where docker.slurm can contain, for example, the following incantations:
———————————————-

#!/bin/bash

container="ashael/hpl"
container_name=docker_${SLURM_JOB_ID}


docker pull ${container} # pull from docker hub

# we mount current working folder to container under /scratch
docker run -it -d --cpus="$SLURM_JOB_CPUS_PER_NODE" -v ${SLURM_SUBMIT_DIR}:/scratch --name docker_${SLURM_JOB_ID} ${container} # start container in a background (-d "detached" mode)

MY_USERID=$(id -u $USER)

CONTAINER_ID=$(docker ps -aqf "name=${container_name}")

echo my container id is ${CONTAINER_ID}

docker exec  ${CONTAINER_ID} useradd --uid ${MY_USERID} --home $HOME $USER
docker exec  ${CONTAINER_ID} chown -R $USER:$USER /scratch

docker exec  -u ${MY_USERID}:${MY_USERID} ${CONTAINER_ID} mkdir -p /scratch/$SLURM_JOB_ID

# run executable inside your container, output may be written to /scratch/$SLURM_JOB_ID/ inside the container to appear in your current folder
docker exec -u ${MY_USERID}:${MY_USERID} ${CONTAINER_ID} uname --all

# OR execute an existing script runme.sh from current folder ($SLURM_SUBMIT_DIR) mounted inside your container in /scratch, can also pass a jobid and CPU Nrs to it as parameters if needed:
docker exec -u ${MY_USERID}:${MY_USERID} ${CONTAINER_ID} /scratch/runme.sh $SLURM_JOB_ID $SLURM_JOB_CPUS_PER_NODE

# stop the container and clean up
docker stop $CONTAINER_ID
docker ps -a # list all containers' states
docker rm $CONTAINER_ID # clean up, remove the container

Note the local job submission folder is “mounted” into the container in line 5. This is the way you can transfer the files between the container and your sessions (there are other ways as well).

Alternatively, while your container is running on the target node, you can ssh to the node and attach to the container’s terminal, allowing you to run interactive commands inside it:

docker attach docker_XXXXXX

where docker_XXXXXX is container name or ID (which you can get as well from “docker ps -a” output on the node.

If you want to terminate container from inside the container, do “Ctrl+D”

if you want to exit the “interactive” container session, leaving the container to run, do “Ctrl+p” and then “Ctrl+q”.

To start exited (but still existing container docker_XXXXXX, check “docker ps -a”), do from the compute node

docker start docker_XXXXXX

To stop the container from outside (from the compute node), do

docker stop docker_XXXXXX

When you finished with your container job, please remove it from the node as:

docker rm docker_XXXXXX

where docker_XXXXXX is the container name or ID (see “docker ps -a”)

At the end, always check “docker ps -a” from the target node to make sure all your containers are stopped and destroyed before you free up the compute node.

Accessing CU network drives while working from home

Posted by admin on April 1, 2020 No comments

If you are outside CU campus network (I bet you are now) but want to access W-drive or H-drive or R-drive, you need to connect to CU VPN first. Then open “This PC” (a.k.a. My Computer or Windows Explorer) and enter in the address bar the following addresses:

W-drive: \\coventry.ac.uk\csv\Students\Shared\EC\STUDENT\
H-Drive Students: \\coventry.ac.uk\csv\Students\Personal (then check each folder inside to see in each of them your Documents folder is located, you won’t be able to see any others)
H-Drive Staff: \\coventry.ac.uk\csv\Staff\Personal (then check each folder inside to see in each of them your Documents folder is located, you won’t be able to see any others)
R-Drive: \\coventry.ac.uk\csv\Research

You will need to authenticate with your CU username in this format: COVENTRY\yourusername and your CU password. If you want to make this “permanent” you can mount these folders on your Windows 10 PC by right-clicking on “This PC” –> More –> Map Network Drive –> enter one of these addresses -> check “Connect Using different credentials (if your PC is not CU computer)” –> enter username and password as described above.

Running Parallel Python3 Jupyter Notebook on zeus HPC

Posted by admin on April 1, 2020 No comments

This is an approach to launch Jupyter notebook on a compute node of EEC HPC. Normally you launch Jupyter locally and then open associated web interface on your local machine. This is also possible to do on the HPC, however because the compute nodes of the cluster are mostly accessible via CLI only and are not “exposed” to the external to HPC network, one need to tunnel through the headnodes in order to reach them. The provided set of 2 HTA scripts simplifies the procedure: first script submits the Jupyter job to the queuing system of HPC (slurm) using html forms and ssh command tool (plink.exe from PuTTY). The second script (again using plink) establishes the ssh-tunnel to the target node, where the Jupyter server is running and starts the default browser on the client machine, pointing to the local port brought by the tunnel to the client machine. Since the compute nodes of HPC have multiple CPUs (some up to 32 cores), it is also shown that Jupyter notebook can utilise IPython ipcluster for running notebook codes on parallel threads.

link to scripts on coventry github

Video:

https://web.microsoftstream.com/embed/video/cb944d42-8d10-4784-8407-6e53fbaf3cbe?autoplay=false&showinfo=true

05 Submitting jobs to different nodes

Posted by admin on February 13, 2020 No comments

Since 2018 we have only one default slurm partition (queue) on zeus called “all”, see here

Submit to Broadwell nodes (32-CPU nodes)
Submit to Sandybridge nodes (12 CPUs per node)
Submit to Nehalem nodes
Request GPUs
Running job without slurm submission script

See also: examples of SLURM submission scripts

Submit to ‘Broadwell’ nodes

The hostnames of Broadwell nodes are zeus[300-343] & zeus[400-409] & zeus[500-501] (56 in total). They have 32 CPU-cores (Broadwell Xeon) and 128GB of RAM. There are several ways how you can use these nodes.

submit directly to compute node by specifying its hostname (not recommended, only if you need to use that exact node for some reason, e.g. having a reservation there): e.g. to request zeus300, in your slurm script use
```
#SBATCH -w zeus300
```
or request that particular node during submission
```
sbatch -w zeus300 slurmscriptname.slurm
```
or in your slurm submission script request constraint “broadwell” with whatever number of tasks you require, for example, we can request one task that can have access to all 32 CPUs of one broadwell node (to run SMP code for example)
```
#SBATCH --constraint=broadwell
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
```
or request 32 individual tasks (will allocate one task per CPU)
```
#SBATCH --constraint=broadwell
#SBATCH -n32 -N1
```
where -n32 is the total number of CPU’s requested over one node (-N1) (not Nr of CPUs per node)

Submit to ‘Sandybridge’ 12-CPU nodes nodes

GPU nodes of zeus (zeus[200-217]) have 12-core Sandybridge family CPU’s and NVIDIA K20 GPUs (2 per each node). You can use those nodes without GPU’s as well. To use these nodes do either of:

specify

--constraint=sandy

or request k20 GPUs:

--gres=gpu:K20:N

(where N is number of GPUs you need)

sbatch --constraint=sandy slurmscriptname.slurm

sbatch -N1 --gres=gpu:K20:2 slurmscriptname.slurm

you can also request these particular nodes (zeus200…zeus217) to be allocated to your job, e.g. we can ask for 2 of these nodes:

sbatch -w zeus[200-201] slurmscriptname.slurm

Submit to Nehalem (8-CPU) nodes

These are old Nehalem CPU based nodes of Zeus HPC each having 8 CPU-cores and 48Gb of RAM: zeus[20-91] (floor 3 of ECB) and zeus[100-171] (Mezz floor of ECB)
If you want to use specifically these nodes you can use

sbatch --constraint=nehalem slurmscriptname.slurm

when submitting your job with sbatch or srun or specify it in your slurm sbatch script

#SBATCH --constraint=nehalem

You can specifically request a particular node (e.g. zeus34 and zeus56…72 in the example below.

sbatch -w zeus[34,56-72] slurmscriptname.slurm

Important thing to remember is that these nodes cannot allocate more trhan 8 tasks per node.

Request nodes with GPU processors

In your slurm submission script specify (e.g. to ask for 2 GPUs of any kind)
```
#SBATCH --gres=gpu:2
```
OR request particular kind of GPUs (K20 or K80) in your slurm submission script (below example asks for 1 K80 GPU):
```
#SBATCH --gres=gpu:k80:1
```
or e.g. 2 K20 GPUs on one node in exclusive mode (node will not be shared with other jobs if it has free resources (CPUs or GPUs that are not used by your job):
```
sbatch --gres=gpu:k20:2 -N1 --exclusive slurmscriptname.sh
```

Launching job without slurm submission script

You can run any executable, e.g. “myexecutable”, on the zeuse’s nodes, without any slurm submission scripts (job will be placed in queue while the job is running anyway). Here we request 2 nodes and the total of 12 CPUs for 10 minutes:
```
srun -t 00:10:00 -N2 -n12 ./myexecutable
```
Note that when submitting job with “srun”, the command line will not be “released” until the job completes and the output is produced in STDOUT (unless specified otherwise). This approach is suitable if you expect your “job” to complete very quickly. this way you can also see the direct output of your executable in the terminal.
Yo can also reserve nodes for a specified time without launching any task on them (i.e. it will launch idle interactive bash session), the example below is asking for 2 Broadwell nodes (32 CPU/node, in exclusive mode (no resource sharing with other jobs)) for 12 hours:
```
salloc -N2 --constraint=broadwell --exclusive -t 12:00:00
```
Similarly for GPUs, e.g. one Sandybridge node with 2 K20 GPUs for 30 minutes:
```
salloc -N1 --gres=gpu:K20:2 --exclusive -t 30
```
After submission of “salloc” job your console will start dedicated bash session in interactive mode, which will have all allocated resources available. When you leave that bash session (Ctrl-D or “exit”) your “salloc” job/allocation will be terminated in the queue. To avoid this (job termination on exiting bash session) you can use “screen” utility.

For all possible “constraints”, see here: http://zeus.coventry.ac.uk/wordpress/?p=1094
Note that in any of the cases described above, if you do not specify the time required for your job, it will be assigned default 4 hours, after which the job will be terminated, you can do it by using “-t” flag (below we ask for 8 random CPUs for 8 hours

sbatch -n8 -t 8:00:00 slurmscriptname.slurm

Memory requirements when submitting a job

Posted by admin on January 15, 2020 No comments

Dear All

To avoid HPC users downing compute nodes, memory limits are introduced on zeus HPC.

If your jobs are not that memory-hungry, you probably will not notice this at all. By “memory-hungry” it is meant exceeding 4GB per CPU-core (default value).

If your job require more than that, you can request more memory using --mem=... or --mem-per-cpu=... [MB] parameter with sbatch.

E.g.

1) ask for “full” 48 GB of memory to be available to your job e.g. on Nehalem (8-CPU) nodes:

sbatch -n8 -N1 --mem=48000 -t=8:00:00 myslurmscript.slurm

In this case one node is requested, 8 tasks (CPUs) and total job memory is 48GB (48000 MB)

If this is not specified max memory would be 4GB x 8 CPUs = 32 GB

2) If you using 32-CPU Broadwell nodes, which have 128GB of RAM, the default value of 4GB/CPU is max possible. If you want more RAM per CPU, e.g. if you use only 2 CPU-tasks, but need all node memory allocated to them, you can do:

sbatch -n2 -N1 --mem-per-cpu=64000 ...

3) if you need more than 128GB per node, you can use SMP node (zeus15, max 512GB/node) by requesting “–constraint=smp” when submitting the job.

If you request amount of memory, exceeding physically available, job will fail to submit with error message “error: Memory specification can not be satisfied”, “error: Unable to allocate resources: Requested node configuration is not available”.

If you requested certain memory for your job (or left it with default 4GB/CPU), which is then exceeded during the run, slurm will terminate the job.

Before this measure, it was possible to “oversubscribe” (consume more memory than available RAM by using disk swap space) and make a node unresponsive/slow, which result in job termination anyway, but in certain cases led to node failure.

Regards

Alex

SLURM Cheat Sheet

Posted by admin on November 18, 2019 No comments

SLURM Cheat Sheet

Built-in SSH Commands in Windows 10

Posted by admin on September 20, 2019 No comments

Windows 10 now has native support for SSH. To get it active, go to “Manage optional features” and click on “Add feature” to add it (if it is not there yet).

This command line ssh client can be used instead of PuTTY to connect to HPC terminal.

Once enabled, you can just open command prompt (WinKey+R and type: cmd <Enter>). In the console type “ssh zeus.coventry.ac.uk” or “ssh yourhpcusername@zeus.coventry.ac.uk” if your Windows user name is not the same as HPC user name.

Alex

EEC High Performance Computing

HPC of CU EEC

Category Archives: Quick Guides

Running Docker job on EPYC HPC

Accessing CU network drives while working from home

Running Parallel Python3 Jupyter Notebook on zeus HPC

05 Submitting jobs to different nodes

Submit to ‘Broadwell’ nodes

Submit to ‘Sandybridge’ 12-CPU nodes nodes

Submit to Nehalem (8-CPU) nodes

Request nodes with GPU processors

Launching job without slurm submission script

Memory requirements when submitting a job

SLURM Cheat Sheet

Built-in SSH Commands in Windows 10

Learning Linux Command line at LinkedIn Learning

Simple MPI “Hello World!” on HPC

Connecting to HPC terminal with Google Chrome

Recent Posts

Quick Links

Categories

Recent Comments

Archives

Meta

Archives

Categories

Meta