Monthly Archives: February 2014

Submitting jobs sequence — long runs

Its quite simple to increase your job’s run time if needed by using job dependencies without compromising on the priorities of the queue. We want to keep queue very flexible, and encourage users to submit relatively short re-startable jobs.

If you need to submit long re-startable job to the queue you can use job “dependency” option in slurm.

E.g. , say queue “all” has time limit of 24 hours, but you need 3×24 hours for your calculations you can submit you job (same submission script from the same job folder) 3 times:

sbatch  myslurmscript.sh

Submitted batch job 81962

sbatch -d afterok:81962 myslurmscript.sh

Submitted batch job 81963

sbatch -d afterok:81963 myslurmscript.sh

there are also “afterany” and “afternotok” varieties of dependencies.

In this way 1st job 81962 will start and run, the other two will wait in the queue. After the 1st job is completed, next will start… and so on. Each will have 24 hours run time.

This would work if the code you are running is re-startable and can continue execution from the point where the previous its instance stopped. When queue server (SLURM) terminates the job, e.g. due to time limit has reached, it will send SIGTERM signal to your code’s processes on the nodes. You code can receive this signal and save all data needed for it to restart from the same position. After 120 seconds SLURM will send SIGKILL signal and processes will terminate. Normally 2 minutes is quite enough time to save all data to the disk needed to restart the code…

Regards,
Alex Pedcenko

css.php