Monthly Archives: January 2014

Queues of zeus [updated June 2018]

Queuing model UPDATE June 2018

To simplify usage of different queues, we combined all nodes into a single default queue (slurm partition) “all”. The usage limits are now solely user-based, each user has default (for now) number of CPU*minutes that they can use at any time moment. If this number of CPU*min is reached, the new jobs from this user will be put on queue until their running jobs will free resources. This is independent of type of compute nodes.

If you require a particular type of compute nodes (CPU/GPU/Phi etc), this can be done in submission script or during the submission with sbatch command: by specifying an additional parameter “constraint”:

    • for 32xCPU Broadwell nodes, specify --constraint=broadwell (54 nodes in total, 128GB RAM each)
    • for Nehalem (8-CPU) nodes, specify --constraint=nehalem (144 nodes in total, 48 GB RAM each)
    • for SandyBridge CPUs (12 CPUs/node), do --constraint=sandy(18 nodes in total, 48 GB RAM each)
    • for 512GB RAM SMP node (32 CPUs max) ask for --constraint=smp (1 node in total)
    • for Nx NVidia Kepler K20 GPUs, ask for --gres=gpu:K20:N (18 nodes x 2 GPUs in total)
    • for Nx NVidia Kepler K80 GPUs, ask for --gres=gpu:K80:N (10 nodes x 2 GPUs in total)
    • for N Intel Phi, ask for --gres=mic:N or --constraint=phi (2 nodes in total, also have 32 CPUs each).

If you have no particular preference on type of CPU or compute node and are running parallel job, please specify ONLY TOTAL Nr of CPUs required, NOT Nr of nodes!: SLURM will assign nodes automatically

e.g. if I need 64 CPUs in total for 24 hours on whatever available nodes I submit my slurm script with:

sbatch -n 64 -t 24:00:00 myslurmscriptname.slurm

if I need 64 CPUs in total for 48 hours on Broadwell-based nodes (32 CPUs/node) I submit my slurm script with:

sbatch -n 64 -t 48:00:00 --constraint=broadwell myslurmscriptname.slurm

Finally if I want 2 GPU nodes with 2 Nvidia Kepler K80 GPUs (1 GPUs/node) and 2 CPUs on each node, I do something like:

sbatch -N2 --ntasks-per-node=1 --cpus-per-task=2 --gres=gpu:K80:1 mygpuscript.slurm

full list of possible sbatch options is available in slurm docs: https://slurm.schedmd.com/sbatch.html

Note that if you do not specify the required time for your job (e.g. for 8 hours “… -t 8:00:00 …”), your job will be terminated after default time of 4 hours.

 

 

 

 

 

[Historical info below, we don’s have different queues at the moment]

Zeus has few queues (partitions)

  • “all” — default queue includes all the nodes of zeus from Nr 22 up: zeus[22-91,100-171,200-217] Wall time limit: 24 hours in this queue
  • “long” — Queue for long jobs (>24 hours>) include nodes zeus[22-80] infinite time limit
  • “GPU” — Queue for jobs on 18 Sandybridge nodes zeus[200-217] which have GPUs (two K20 each) Wall time limit in this queue is 36 hours
  • “SMP” — Queue for one node zeus5, 32 cores (4×8), 512Gb of RAM, Wall time: 48 Hours
  • “debug” — zeus[20-21] — for short tests, max wall time is 20 min.

%%%%%%%%%%%%% November 2016 UPDATE %%%%%%%%%%%%%%%%%

After major upgrade new partitions/queues are added see all the queues in the output of sinfo command. See current queue limitations here.

New queues:

New Broadwell CPU based nodes:

  • 44 nodes with 2 x Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz (32 CPU-cores/node) 128Gb RAM:   queue Broadwell
  • 10 nodes with 2 x Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz (32 CPU-cores/node) 128Gb RAM + 2x NVidia K80 GPU:  queue NGPU
  • 2 nodes  with 2xIntel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz (32 CPU-cores/node) 128Gb RAM + Xeon Phi coprocessor SE10/7120: queue Phi

%%%%%%%%%%%%% %%%%%%%%%%% %%%%%%%%%%%%%%%%%

to submit to either queue (below is “SMP” queue, queue name is case sensitive!):

sbatch --partition SMP slurmscriptname

If you do not specify walltime when submitting the job, your job will be terminated after 4 hours (either queue, apart from “debug”)!

To see the state of all queues:

sinfo

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug        up      20:00      2   idle zeus[20-21]
all*         up 1-00:00:00     25    mix zeus[22-35,37-40,74-80]
all*         up 1-00:00:00    134   idle zeus[41-73,81-91,100-171,200-217]
all*         up 1-00:00:00      1   down zeus36
long         up   infinite     25    mix zeus[22-35,37-40,74-80]
long         up   infinite     33   idle zeus[41-73]
long         up   infinite      1   down zeus36
GPU          up 1-12:00:00     18   idle zeus[200-217]
SMP          up 2-00:00:00      1   idle zeus5

To see the running jobs in all the queues do:

squeue

[aa3025@zeus3 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            258451       GPU      lm0   ab7585  R   21:29:33     18 zeus[200-217]
            258486       GPU s455_tes  bakern6  R   12:20:38      1 zeus200
            258489       SMP     bash   ab1083  R      43:51      1 zeus5
            258460       all     Dyna  arestil  R   20:03:59      8 zeus[38-45]
            258485       all  Dropped  malinl2  R   12:48:08      3 zeus[27-29]
            258484       all  Ev_0.4m  malinl2  R   12:48:33      3 zeus[23-24,26]
            258466       all Lotus_MI massawaj  R   18:41:18      3 zeus[31-33]
            258478       all carcrash  lsdyna5  R   17:05:46      1 zeus72
            258431      long    duct1   dmitry  R   23:41:26      4 zeus[126-129]
            251602      long    duct1   dmitry  R 2-22:27:44      4 zeus[118-120,122]
            253428      long    duct1   dmitry  R 2-00:48:01      3 zeus[155,157-158]
            253427      long    duct1   dmitry  R 2-10:29:58      3 zeus[123-125]
            251518      long convnn_1  hermesm  R 5-16:03:16      1 zeus54
            251517      long convnn_1  hermesm  R 5-16:04:17      1 zeus54
            251407      long p1822999  bakern6  R 6-18:29:31      1 zeus56
            251406      long p1822300  bakern6  R 6-18:30:51      1 zeus59

to see only your jobs do:

squeue -u $USER

Listing properties of a job

to see detailed properties of a job, do
scontrol show jobid 258486

scontrol show jobid 258486

JobId=258486 Name=s455_test
   UserId=bakern6(581) GroupId=bakern6(581)
   Priority=13 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=12:31:26 TimeLimit=1-12:00:00 TimeMin=N/A
   SubmitTime=2015-02-18T21:59:46 EligibleTime=2015-02-18T21:59:46
   StartTime=2015-02-18T21:59:46 EndTime=2015-02-20T09:59:46
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=GPU AllocNode:Sid=zeus3:9863
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=zeus200
   BatchHost=zeus200
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/bakern6/new/vortex_solve/job_455_1000_50.q
   WorkDir=/home/bakern6/new/vortex_solve

Intel Compilers updated to version 14.0.1 (2013) on zeus

If you use intel compilers, MKL etc.. : to load Intel compilers environment variables, you have to do:

source /share/apps/intel/bin/compilervars.sh intel64

or

module load intel/13

For ver. 11 of Intel compilers do:

source /share/apps/intel/Compiler/11.1/073/bin/ifortvars.sh intel64
source /share/apps/intel/Compiler/11.1/073/bin/iccvars.sh intel64

If you need openmpi compiled with Intel do:

module load intel/openmpi-1.6.5

You can put both commands in your .bashrc for them to be loaded automatically on a compute nodes, when you will submit a job.

The compilers installed into /share/apps/intel shared across all the nodes on zeus.

Please note that you cannot use these compilers for any project generating any income. These are solely for personal academic non-profit use. Otherwise you have to obtain your own license and point these compilers to its location.

Alex

css.php