You can reserve nodes of Zeus HPC by issuing command, e.g.
sallocĀ -t 2:30:00 -N3 -n24
this will reserve 24 CPU-cores (-n24) on 3 compute nodes (-N3) for 2.5 hours.
You can also just specify how many CPU’s you want without number of nodes, say 16 CPU cores anywhere:
sallocĀ -t 2:30:00 -n16
CPU’s will be allocated on available nodes, not necessary in consecutive order.
You will get something like:
[aa3025@zeus2 arrays]$ salloc -t 2:30:00 -n16
salloc: Pending job allocation 296104
salloc: job 296104 queued and waiting for resources
salloc: job 296104 has been allocated resources
salloc: Granted job allocation 296104
[aa3025@zeus2 arrays]$
check on which nodes your job was allocated, if needed:
[aa3025@zeus2 arrays]$ qstat
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
296104 all bash aa3025 R 0:05 1 zeus15
[aa3025@zeus2 arrays]$
or if you already have some jobs in the queue, do
scontrol show jobid=296104
and you will see the specs of the job 296104
JobId=296104 JobName=bash
UserId=aa3025(500) GroupId=aa3025(500)
Priority=2483 Nice=0 Account=default QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:02:16 TimeLimit=02:30:00 TimeMin=N/A
SubmitTime=2017-02-24T11:59:53 EligibleTime=2017-02-24T11:59:53
StartTime=2017-02-24T12:00:02 EndTime=2017-02-24T14:30:02
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=short4 AllocNode:Sid=zeus2:20360
ReqNodeList=(null) ExcNodeList=(null)
NodeList=zeus15
BatchHost=zeus15
NumNodes=1 NumCPUs=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/home/aa3025/tests/arrays
from where we see that node list is
...
NodeList=zeus15
...
Then you can directly use these nodes in a way you like, with “srun” or with mpi, e.g
[aa3025@zeus2 arrays]$ srun hostname
zeus15
zeus15
zeus15
zeus15
zeus15
zeus15
zeus15
zeus15
zeus15
zeus15
zeus15
zeus15
zeus15
zeus15
zeus15
zeus15
So we have 16 instances of “hostname” command executed on the target node(s).
with MPI, e.g. if I use Intel mpi compiled code (say some standard matrix multiplication code in parallel):
Compilation with mpicc wrapper
[aa3025@zeus2 gcc]$ module load intel/13
[aa3025@zeus2 gcc]$ mpicc mmult.c -o mmult_mpi_icc.exe
...
[aa3025@zeus2 gcc]$ mpirun -np 16 ./mmult_mpi_icc.exe
Time taken = 7.069788 seconds
Time taken = 7.093434 seconds
Time taken = 7.129114 seconds
Time taken = 7.146285 seconds
Time taken = 7.139382 seconds
Time taken = 7.143934 seconds
Time taken = 7.152325 seconds
Time taken = 7.160740 seconds
Time taken = 7.162425 seconds
Time taken = 7.159303 seconds
Time taken = 7.182479 seconds
Time taken = 7.169939 seconds
Time taken = 7.194784 seconds
Time taken = 7.193553 seconds
Time taken = 7.187440 seconds
Time taken = 7.245136 seconds
.. is the output from each CPU
When you finished using your reservation before the expiration time, you need to cancel it with “scancel” to free resources (or it will expire after 2.5 hours you specified during the reservation with “salloc”).
It will also be cancelled if you correctly “exit” from the the salloc-ated terminal.
scancel 296104
where 296104 was jobid number.
Alex Pedcenko
Recent Comments