PBS Pro


The default user environment is setup to use PBS Pro. Please take care in changing your environment settings. If you can no longer find the PBS binaries or man pages, you can reload the PBS module by running module load pbs.

Jobs should be run from the system's scratch directory. See the system specific documention for the scratch directory name. The default user environment provides a predefined $WORKDIR environment variable for usage in user scripts and job submission.

  • $WORKDIR
    • top level of user's scratch directory
  • $PBS_O_WORKDIR
    • working directory in which qsub was executed

Do not run jobs in your home directory.

On cluster systems, if a job does not request all the cores on a node, it is possible another job will share the same node. To prevent this, the job must request exclusive use.


Common PBS commands

  • qsub submit a batch job
  • qstat see all jobs and their status
  • qdel delete a job
  • qalter modify options on a pending job

How to submit a batch job

Use the qsub command to submit a script file. PBSPro job file allows for specification of options at the beginning of the file prefaced by the #PBS delimiter followed by PBS commands. Commands will be executed in the order given in the script.

A script can include just about anything which you could do during a terminal session such as set environment variables, change directories, and move files. See the qsub man page for a complete list of options.

qsub myscriptname

Example PBS scripts

On a cluster system, run an executable on 3 nodes using all 8 cores per node with 8 MPI processes for 5 hours

#!/bin/csh
#PBS -j oe
#PBS -l walltime=5:00:00
#PBS -l select=3:ncpus=8:mpiprocs=8
#PBS -V
cd ${PBS_O_WORKDIR}

On a non-cluster system, run an executable on 24 cores for 5 hours

#!/bin/csh
#PBS -j oe
#PBS -l walltime=5:00:00
#PBS -l ncpus=24
#PBS -V
cd ${PBS_O_WORKDIR}

How long can a job run

There is no limit to the amount of time that a PBS job can run. If no time limit is specified, a default of 12 hours is assigned. Remember to check point long running jobs.

To specify a time, include the following line in the submission script:

#PBS -l walltime=hhhh:mm:ss

Exclusive use of cluster nodes

On cluster systems, if a job does not request all the cores on a node, it is possible another job will share the same node. To prevent this, request exclusive use.

#PBS -l place=excl

How to export interactive session environment to job

To include the shell environment in the batch job, either use the -V option with qsub or as a PBS directive

qsub -V
        
#PBS -V

Hyper-Threading

On cluster systems with Hyper-Threading, the Hyper-Threading is enabled by default. This allows users to run two tasks or threads per core instead of just one. Users can set npcus and mpiprocs to 16 even though there are only 8 physical cores on each node.

#PBS -l select=2:npcus=16:mpiprocs=16

However, if users specify only 8 ncpus and mpiprocs per node.

#PBS -l select=2:ncpus=8:mpiprocs=8

then all 16 tasks (or threads), in this example, are placed on a single 8 core node, while the second node will be empty. This will result in your code running at about half the speed you anticipated.

To avoid this pitfall, use "place=scatter" or "place=scatter:excl" and 8 tasks will be placed on the first node and 8 on the second node.

#PBS -l select=2:npcpus=8:mpiprocs=8,place=scatter:excl

Users have routinely seen a 5% increase in performance, due to hyper-threading, when using 16 tasks or threads per node over using 8 tasks or threads per node.

For more on Hyper-Threading, see Intel's Hyper-Threading Technology

How to hold jobs until the first job finishes

Use the depend option to qsub or use a PBS directive

qsub -W depend=<conditions>
 
#PBS -W depend=<conditions> 

See the qsub man page for a complete list of options

Suppose I want to do a parameter study. How do I submit all these jobs with a different value of parameter?

Use PBS Job Arrays. You can specify how many jobs to run by adding the directive

#PBS -J <range>

where range is X-Y:Z. So 1-10:2 indicates all the even jobs from 1 to 10, i.e., 2,4,6,8 and 10.

PBS defines two environment variables:

PBS_ARRAY_INDEX     job array index
PBS_ARRAY_ID        job array id

These variables are also defined as attributes:

array_index
array_id

Based on the job array index, different input files can be used for each job in the job array. Below is a job script example that submits two jobs, each using one processor.

NOTE: The PBS directives specify the resources that will be used by EACH individual job, NOT all the jobs together.

#!/bin/csh
#PBS -V
#PBS -l select=1:ncpus=1:mpiprocs=1,walltime=48:00
#PBS -N Job_Array_Test
#PBS -j oe -o ja.^array_index^.pbs
#PBS -J 1-2
cd $PBS_O_WORKDIR
#
unset echo
             
echo $PBS_ARRAY_INDEX
               
echo $PBS_ARRAY_ID $PBS_ARRAY_INDEX >> ja.$PBS_ARRAY_INDEX.out
echo ' '        >> ja.$PBS_ARRAY_INDEX.out
bin/pi < pi.inp >> ja.$PBS_ARRAY_INDEX.out
                
exit

How do I determine the number of cores I have requested in my job script?

To determine the number of cores requested, include the following line in the submission script:

set NCORES = `cat $PBS_NODEFILE | wc -l`

How do I determine the number of nodes I have requested in my job script?

To determine the number of nodes requested, include the following line in the submission script:

set NNODES  = `cat $PBS_NODEFILE | uniq | wc -l `

A job submitted to the queue is not running

Look at the job status with qstat. The next to last line of the output is a comment describing the job status. Possible outputs include:

qstat -f yourjobid
  • Job run at date at time on hostname
    • Job is running OK

  • Not Running: No available resources on nodes
    • Job requires more memory,cores, or processors than currently available, or you have requested more resources than what physically exists on the system.

  • Job held, too many failed attempts to run
    • Delete and resubmit job
    • If this persists, contact User support

  • Not Running, Draining system to allow starving job to run
    • Job will not run because the resources are reserved for other jobs.

Queued Jobs stuck in an error state (E)

Try deleting the job with qdel

qdel -Wforce yourjobid

If this does not work, contact User support


The appearance of hyperlinks does not constitute endorsement by the Department of Defense, U.S. Navy, or U.S. Naval Research Laboratory of non-U.S. Government sites or the information, products, or services contained therein. Although the U.S Naval Research Laboratory may or may not use these sites as additional distribution channels for Department of Defense information, it does not exercise editorial control over all of the information that you may find at these locations. Such links are provided consistent with the stated purpose of this website.

This Is An Official U.S. Navy Website.

United States Naval Research Laboratory 4555 Overlook Ave., SW Washington, DC 20375