Queuing model UPDATE June 2018
To simplify usage of different queues, we combined all nodes into a single default queue (slurm partition) “all”. The usage limits are now solely user-based, each user has default (for now) number of CPU*minutes that they can use at any time moment. If this number of CPU*min is reached, the new jobs from this user will be put on queue until their running jobs will free resources. This is independent of type of compute nodes.
If you require a particular type of compute nodes (CPU/GPU/Phi etc), this can be done in submission script or during the submission with sbatch command: by specifying an additional parameter “constraint”:
-
- for 32xCPU Broadwell nodes, specify
--constraint=broadwell
(54 nodes in total, 128GB RAM each) - for Nehalem (8-CPU) nodes, specify
--constraint=nehalem
(144 nodes in total, 48 GB RAM each) - for SandyBridge CPUs (12 CPUs/node), do
--constraint=sandy
(18 nodes in total, 48 GB RAM each) - for 512GB RAM SMP node (32 CPUs max) ask for
--constraint=smp
(1 node in total) - for Nx NVidia Kepler K20 GPUs, ask for
--gres=gpu:K20:N
(18 nodes x 2 GPUs in total) - for Nx NVidia Kepler K80 GPUs, ask for
--gres=gpu:K80:N
(10 nodes x 2 GPUs in total) - for N Intel Phi, ask for
--gres=mic:N
or--constraint=phi
(2 nodes in total, also have 32 CPUs each).
- for 32xCPU Broadwell nodes, specify
If you have no particular preference on type of CPU or compute node and are running parallel job, please specify ONLY TOTAL Nr of CPUs required, NOT Nr of nodes!: SLURM will assign nodes automatically
e.g. if I need 64 CPUs in total for 24 hours on whatever available nodes I submit my slurm script with:
sbatch -n 64 -t 24:00:00 myslurmscriptname.slurm
if I need 64 CPUs in total for 48 hours on Broadwell-based nodes (32 CPUs/node) I submit my slurm script with:
sbatch -n 64 -t 48:00:00 --constraint=broadwell myslurmscriptname.slurm
Finally if I want 2 GPU nodes with 2 Nvidia Kepler K80 GPUs (1 GPUs/node) and 2 CPUs on each node, I do something like:
sbatch -N2 --ntasks-per-node=1 --cpus-per-task=2 --gres=gpu:K80:1 mygpuscript.slurm
full list of possible sbatch options is available in slurm docs: https://slurm.schedmd.com/sbatch.html
Note that if you do not specify the required time for your job (e.g. for 8 hours “… -t 8:00:00 …”), your job will be terminated after default time of 4 hours.
[Historical info below, we don’s have different queues at the moment]
Zeus has few queues (partitions)
- “all” — default queue includes all the nodes of zeus from Nr 22 up: zeus[22-91,100-171,200-217] Wall time limit: 24 hours in this queue
- “long” — Queue for long jobs (>24 hours>) include nodes zeus[22-80] infinite time limit
- “GPU” — Queue for jobs on 18 Sandybridge nodes zeus[200-217] which have GPUs (two K20 each) Wall time limit in this queue is 36 hours
- “SMP” — Queue for one node zeus5, 32 cores (4×8), 512Gb of RAM, Wall time: 48 Hours
- “debug” — zeus[20-21] — for short tests, max wall time is 20 min.
%%%%%%%%%%%%% November 2016 UPDATE %%%%%%%%%%%%%%%%%
After major upgrade new partitions/queues are added see all the queues in the output of sinfo
command. See current queue limitations here.
New queues:
New Broadwell CPU based nodes:
- 44 nodes with 2 x Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz (32 CPU-cores/node) 128Gb RAM: queue Broadwell
- 10 nodes with 2 x Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz (32 CPU-cores/node) 128Gb RAM + 2x NVidia K80 GPU: queue NGPU
- 2 nodes with 2xIntel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz (32 CPU-cores/node) 128Gb RAM + Xeon Phi coprocessor SE10/7120: queue Phi
%%%%%%%%%%%%% %%%%%%%%%%% %%%%%%%%%%%%%%%%%
to submit to either queue (below is “SMP” queue, queue name is case sensitive!):
sbatch --partition SMP slurmscriptname
If you do not specify walltime when submitting the job, your job will be terminated after 4 hours (either queue, apart from “debug”)!
To see the state of all queues:
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug up 20:00 2 idle zeus[20-21] all* up 1-00:00:00 25 mix zeus[22-35,37-40,74-80] all* up 1-00:00:00 134 idle zeus[41-73,81-91,100-171,200-217] all* up 1-00:00:00 1 down zeus36 long up infinite 25 mix zeus[22-35,37-40,74-80] long up infinite 33 idle zeus[41-73] long up infinite 1 down zeus36 GPU up 1-12:00:00 18 idle zeus[200-217] SMP up 2-00:00:00 1 idle zeus5
To see the running jobs in all the queues do:
squeue
[aa3025@zeus3 ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 258451 GPU lm0 ab7585 R 21:29:33 18 zeus[200-217] 258486 GPU s455_tes bakern6 R 12:20:38 1 zeus200 258489 SMP bash ab1083 R 43:51 1 zeus5 258460 all Dyna arestil R 20:03:59 8 zeus[38-45] 258485 all Dropped malinl2 R 12:48:08 3 zeus[27-29] 258484 all Ev_0.4m malinl2 R 12:48:33 3 zeus[23-24,26] 258466 all Lotus_MI massawaj R 18:41:18 3 zeus[31-33] 258478 all carcrash lsdyna5 R 17:05:46 1 zeus72 258431 long duct1 dmitry R 23:41:26 4 zeus[126-129] 251602 long duct1 dmitry R 2-22:27:44 4 zeus[118-120,122] 253428 long duct1 dmitry R 2-00:48:01 3 zeus[155,157-158] 253427 long duct1 dmitry R 2-10:29:58 3 zeus[123-125] 251518 long convnn_1 hermesm R 5-16:03:16 1 zeus54 251517 long convnn_1 hermesm R 5-16:04:17 1 zeus54 251407 long p1822999 bakern6 R 6-18:29:31 1 zeus56 251406 long p1822300 bakern6 R 6-18:30:51 1 zeus59
to see only your jobs do:
squeue -u $USER
Listing properties of a job
to see detailed properties of a job, do
scontrol show jobid 258486
scontrol show jobid 258486 JobId=258486 Name=s455_test UserId=bakern6(581) GroupId=bakern6(581) Priority=13 Account=(null) QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=12:31:26 TimeLimit=1-12:00:00 TimeMin=N/A SubmitTime=2015-02-18T21:59:46 EligibleTime=2015-02-18T21:59:46 StartTime=2015-02-18T21:59:46 EndTime=2015-02-20T09:59:46 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=GPU AllocNode:Sid=zeus3:9863 ReqNodeList=(null) ExcNodeList=(null) NodeList=zeus200 BatchHost=zeus200 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/bakern6/new/vortex_solve/job_455_1000_50.q WorkDir=/home/bakern6/new/vortex_solve
Recent Comments