Category Archives: HPC Announcements

SULIS expression of interest

Dear Colleagues,

Coventry University entered into an equipment bid with the Midlands+ group of Universities for a tier 2 HPC cluster. This bid was successful, and the new cluster, called SULIS, is due to go into production. As one of the consortium in the bid, we have an allocation of core hours and GPU hours on the cluster. The cluster consists of 25,216 AMD EPYC compute cores configured as 167 dual processor CPU compute nodes plus 30 nodes equipped with three Nvidia A100 40GB GPUs. The cluster has a proportion of time reserved for the EPSRC to allocate outside the consortium, but consortium members are also allowed to bid for the EPSRC allocated time.

 

SULIS will be going live on Monday 1st November and we are now inviting expressions of interest. Please find a link to a form, in order to indicate your interest in using the cluster, the time required on the cluster and proposed usage here – https://forms.office.com/r/WKpfznnJte

 

If you have any queries please email HPC.MPCS@coventry.ac.uk

 

Best,

 

Damien Foster BSc (hons) (Edin) DPhil (Oxon) FIMA, MINSTP

Professor of Statistical Physics

Centre Director

Centre for Computational Science and Mathematical Modelling

Faculty of Engineering, Environment and Computing

Coventry University

Coventry CV1 5FB

 

T: 02477 659245 | M: 0797 498 4977 | E: ab5651@coventry.ac.uk

Power outage in ECB, 5 August 2021

Due to power outage in the early morning of 5 August, all HPC infrastructure in ECB was powered off.

By the evening of the same day power problems seems resolved, however aircons in  Zeus HPC room are not working properly and one unit failed. Until this is resolved only part of the Zeus HPC compute nodes will be back online: compute nodes zeus[100…171] and zeus[200-217], which reside in different room unaffected by the aircon failure. I’ll bring the rest of the compute nodes (Broadwells and other half of the Nehalems) as soon as the cooling problem is solved. Contractors are working on the fix.

Regards

Alex Pedcenko

Zeus HPC GPU usage survey

Dear All,

 

we are carrying a survey of our HPC GPU (graphical processors) usage, which would help us to understand and plan for future upgrades of the hardware. If you were using GPUs on zeus HPC or would like to use them in future, can you please complete a very short survey on the following link:

https://forms.office.com/Pages/ResponsePage.aspx?id=mqsYS2U3vkqsfA4NOYr9T9aUY6sEesBHsORvbIuKaq5UQUU0UjVWT1VZVko1WVBGNlFISElSVzdPMy4u

Many thanks

Alex Pedcenko

Folding@Home — Coventry HPC got into top 1500 “donors”

After about a year of operation within Folding@Home project we reached into top 1,500 (out of 225,885)

See https://stats.foldingathome.org/team/259515 

 

 

Alex

Memory requirements when submitting a job

Dear All

To avoid HPC users downing compute nodes, memory limits are introduced on zeus HPC.

If your jobs are not that memory-hungry, you probably will not notice this at all. By “memory-hungry” it is meant exceeding 4GB per CPU-core (default value).

If your job require more than that, you can request more memory using --mem=... or --mem-per-cpu=... [MB] parameter with sbatch.

E.g.

1) ask for “full” 48 GB of memory to be available to your job e.g. on Nehalem (8-CPU) nodes:

sbatch -n8 -N1 --mem=48000 -t=8:00:00 myslurmscript.slurm

In this case one node is requested, 8 tasks (CPUs) and total job memory is 48GB (48000 MB)

If this is not specified max memory would be 4GB x 8 CPUs = 32 GB

2) If you using 32-CPU Broadwell nodes, which have 128GB of RAM, the default value of 4GB/CPU is max possible. If you want more RAM per CPU, e.g. if you use only 2 CPU-tasks, but need all node memory allocated to them, you can do:

sbatch -n2 -N1 --mem-per-cpu=64000 ...

3) if you need more than 128GB per node, you can use SMP node (zeus15, max 512GB/node) by requesting “–constraint=smp” when submitting the job.

If you request amount of memory, exceeding physically available, job will fail to submit with error message “error: Memory specification can not be satisfied”, “error: Unable to allocate resources: Requested node configuration is not available”.

If you requested certain memory for your job (or left it with default 4GB/CPU), which is then exceeded during the run, slurm will terminate the job.

Before this measure, it was possible to “oversubscribe” (consume more memory than available RAM by using disk swap space) and make a node unresponsive/slow, which result in job termination anyway, but in certain cases led to node failure.

Regards

Alex

ReqNodeNotAvail:: – HPC shutdown scheduled on 26 October 2019.

Dear HPC users,

All HPC infrastructure will be switched off on 26 & 27 October (weekend) due to power works in EEC. Hence your jobs will get (ReqNodeNotAvail, UnavailableNodes:zeus[27,30-31,43,50,59,62,70-….) status. They will resume on Monday.

Regards

Alex Pedcenko

Scheduled shutdown 13 August 2019

There will be power switched off in ECB on the morning of 13 August. HPCs will be switched off in early morning of 13/08/2019 ( 5:00 am).  The systems will be powered back on after 7 am on the same day (given there is a power in the building).

Regards

Alex

 

Secure Shell extension for Google Chrome!

Terminal emulator and SSH client is now available as a Google Chrome browser extension: run your ssh session to HPC from Chrome!

Secure Shell is an xterm-compatible terminal emulator and stand-alone ssh client for Chrome. It uses Native-Client to connect directly to ssh servers without the need for external proxies.

https://chrome.google.com/webstore/detail/secure-shell-extension/iodihamcpbpeioajjeobimgagajmlibd

Alex

02 GUI HPC VNC jobs for Windows Clients

For running GUI jobs on HPC compute node (rather than on login node), you can use my new Windows HTA-scripts from here: aa3025/hpc (coventry.ac.uk). They will also work via University VPN (Anyconnect)

##### optional step – usually already done when your HPC account is created ########

Prior to all the following you need to set your VNC password on the HPC (you can also change your VNC password this way):

  • login via ssh to HPC,
  • issue the command “vncpasswd” to change your VNC password (VNC password can be different to your HPC password, but it makes life easier if it is the same as your HPC password). Use this password for connecting to your VNC sessions in future.
####################################################

You can use these “scripts” to launch VNC (Remote Desktop) job on one of the HPC compute nodes if you need access to GUI (Graphical User Interface)

These scripts allow establishing VNC Desktop session on a compute node of HPC : (see also VIDEO screen cast here)

or here:

0) Download the whole distribution from aa3025/hpc (coventry.ac.uk) You have at least to download 2 folders: “tools” and either “zeus” or “pluto” depending on which HPC you use. You can use git for synchronising the whole distribution:

git clone https://github.coventry.ac.uk/aa3025/hpc.git
Alternatively you can download the zip archive from : HERE

1) Depending which HPC you are about to use, choose a folder zeus or pluto to launch VNC session on required HPC. Before proceeding further  check that HPC has free (idle) compute nodes available (“sinfo” in HPC console or check Zeus website or Pluto website)

2) Launch only one script: 01_START_VNC_Session_….._1_node.hta (by double-clicking on it), a.k.a. “submission app”

3) Fill in your HPC username, password and the rest of info. Hit “Submit” button, the script will then submit the “VNC” job to the HPC queue in order to start “vncserver” process [Remote Desktop session] on the assigned compute node.

4) Wait for your VNC job to start in the queue of HPC (you will see job listing in the newly opened black “plink” console window), note the target node name (it is also presented in the last column of the small table at the bottom of the submission app).

5) Once VNC job has started (“Running” state, “R”) in the black console,  you can close it. Make sure your VNC job is in Running state (also specified by “R” against your job ID in the table at the bottom of the submission app, if it is still in “PD” state (means still “pending”), waiting for resources to be allocated).

6) Once you closed the first (job submission) app and black console window, the second script will start automatically to get the ssh-tunnel running between your PC and the allocated compute node and will connect you to the node’s VNC session — fill in the target node name from the previous step and your HPC password. The VNC screen number is always :1 (you can’t change it).

7) Now, the VNC password window will pop up — enter your VNC password there (theoretically it can be different from your HPC password). If you do not have VNC password, return to the top of this page (“Prior to all of the following…”) and set it up.

8) If all went well, the VNC Remote Desktop windows will be presented. You can toggle it to full screen by pressing F8 on your keyboard (amongst some other options in F8 as well).

By default the VNC job will last for 24 hours and will request 1 whole node, you can select different time and less CPUs than on the whole node or more than one compute node to be allocated.

More than one compute node. If you specify more than one node in VNC job submission app, the VNC Desktop session will start on the first allocated node, but the resources of all other allocated nodes will still be available from your VNC Desktop session.

VNC job termination

When the slurm VNC job expires (the time limit of your job [normally 24 hours if you did not change it] was reached), your VNC Desktop session will be terminated.

However, if you wish to terminate the VNC job before it expires (you do not use it any more and need to free the HPC resources), you can either just kill it from ssh console with “scancel JOBID” command, or by selecting “Logout” in your VNC Desktop session and your slurm job with VNC sessions will terminate.

If you just close the VNC viewer window on your PC –> the VNC Desktop session will remain running until termination time is reached or you cancel your VNC job. Make sure you do not waste HPC resources by having idle VNC sessions’ jobs hanging in the queue!

Alex Pedcenko

Update on queues of Zeus HPC

Queuing model UPDATE June 2018

To simplify usage of different queues, we combined all nodes into a single default queue (slurm partition) “all”. The usage limits are now solely user-based, each user has default (for now) number of CPU*minutes that they can use at any time moment (subject to available resources). If this number of CPU*min is reached, the new jobs from this user will be put on queue until their running jobs will free resources. This is independent of type of compute nodes. During this initial stage we will try to adapt the default CPU*min allowance to suite better and more effective HPC usage. The simple principle behind this is that user can use more CPU cores but for less time, or, less CPU cores, but for longer time. The run time of the job to be submitted is determined by the value you set in –time or -t parameter during the submission of a job (e.g. -t 24:00:00).

If you require a particular type of compute nodes (CPU/GPU/Phi etc), this can be done in submission script or during the submission with sbatch command: by specifying an additional parameter “constraint”:

  • for 56 Intel Broadwell CPU based nodes (128GB RAM each) with 32xCPU-cores , specify --constraint=broadwell
  • for 144 Intel Nehalem CPU based nodes (48 GB RAM each) with 8xCPU-cores, specify --constraint=nehalem
  • for 18 Intel SandyBridge CPU based nodes (48 GB RAM) with 12xCPU-cores, do --constraint=sandy
  • for 1 x 32 CPU, 512GB RAM SMP node ask for --constraint=smp
  • for 10 nodes x 2 NVidia Kepler K20 GPUs, ask for --gres=gpu:K20:N (where N is the Nr of GPUs needed, max is 2 GPUs/node)
  • for 18 nodes x 2 NVidia Kepler K80 GPUs, ask for --gres=gpu:K80:N  (where N is the Nr of GPUs needed, max is 2 GPUs/node)
  • for N Intel Phi, ask for --gres=mic:N or --constraint=phi.

For more details on Zeus’s CPUs and nodes see this post: http://zeus.coventry.ac.uk/wordpress/?p=336

If you have no particular preference on the type of CPU or compute node and are running parallel job, please specify ONLY TOTAL Nr of CPUs required, NOT Nr of nodes!: SLURM will assign the nodes automatically.

e.g. if I need 64 CPUs in total for 24 hours on whatever available nodes I submit my slurm script with:

sbatch -n 64 -t 24:00:00 myslurmscriptname.slurm

if I need 64 CPUs in total for 48 hours on Broadwell-based nodes (32 CPUs/node) I submit my slurm script with:

sbatch -n 64 -t 48:00:00 --constraint=broadwell myslurmscriptname.slurm

Finally if I want 2 GPU nodes with 2 Nvidia Kepler K80 GPUs (1 GPUs/node) and 2 CPUs on each node (36 hours), I do something like:

sbatch -N2 --ntasks-per-node=1 --cpus-per-task=2 --gres=gpu:K80:1 -t 36:00:00 mygpuscript.slurm

Certainly some variations of these sbatch commands are possible, also these flags can be specified inside the slurm submission script itself. For full list of possible sbatch options see slurm docs: https://slurm.schedmd.com/sbatch.html

 

Alex Pedcenko

css.php