Scheduler Resourcesο
Understanding the difference between nodes, tasks, and CPUs helps you allocate resources correctly and avoid common mistakes on HPC clusters.
Table of Contentsο
Conceptsο
Term |
Meaning |
|---|---|
Node |
A physical compute machine |
Task |
An MPI process (one independent program rank) |
CPU |
Threads available to a single task |
A job with 2 nodes, 4 tasks per node, and 2 CPUs per task runs:
2 machines
4 MPI processes per machine (8 total)
Each process may use up to 2 threads
Single-Task vs Multi-Task (MPI) Jobsο
Most jobs β Python, R, bash scripts β should request multiple treads, not multiple tasks.
Multiple tasks launch multiple independent copies of your program (MPI ranks). Unless your code explicitly calls MPI_Init / from mpi4py import MPI, extra tasks are wasted allocation.
Correct: threaded job
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --time=2:00:00
#DEP: myenv.img
python train_model.py
Inside the container, thread-parallel libraries will use all 8 CPUs.
Wrong: accidental multi-task
#SBATCH --ntasks=8 # β Launches 8 copies of python β each does the full job
This starts 8 separate Python processes, each running train_model.py from scratch, wasting 7Γ the resources.
Correct: MPI job
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4 # 8 MPI ranks total across 2 nodes
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=1:00:00
#DEP: mpi.img
python mpi_simulation.py
HTCondor: Single-Task Onlyο
HTCondor is inherently single-task (universe = vanilla is assumed for all jobs), so nodes and tasks-per-node do not apply.
When CondaTainer reads .sub files, it will get the script path by executable = ... and parse other resource requests. Then CondaTainer will read the overlay requirements from the script #DEP: tags and launch the container with the appropriate resources.
Array Jobsο
Use --array with condatainer run to submit the same script over a list of inputs. Each line in the input file becomes one subjob; its space-separated tokens arrive as positional arguments ($1, $2, β¦) inside the script.
sample1 condition_A
sample2 condition_B
The file should have no blank lines, and all non-empty lines must have the same number of shell-split tokens (quoted strings count as one). Blank lines are flagged on --dry-run and block submission β remove them before submitting.
condatainer run --array samples.txt quant.sh
condatainer run --array samples.txt --array-limit 4 quant.sh # max 4 concurrent
condatainer run --dry-run --array samples.txt quant.sh # preview without submitting
Job Chainingο
CondaTainer supports three job dependency flags, each mapping to a native scheduler dependency type:
Flag |
Runs when upstream job⦠|
SLURM / PBS |
LSF |
|---|---|---|---|
|
Succeeds |
|
|
|
Fails |
|
|
|
Finishes (any outcome) |
|
|
TRIM=$(condatainer run trim.sh sample1)
ALIGN=$(condatainer run --afterok "$TRIM" align.sh sample1)
condatainer run --afterok "$ALIGN" quant.sh sample1
Multiple job IDs can be joined with colons: --afterok 123:456:789. All three flags can be combined in one submission.
Note:
DAGManis not supported. Please useDAGMandirectly for complex workflows on HTCondor clusters.
Array Jobs + Chainingο
--array and --afterok can be combined: each stage submits an array job and waits for the previous stage to succeed before starting. Use --afterany to proceed even if some subjobs failed. A final single job can collect results after all subjobs complete.
# Stage 1: trim all samples (no dependency)
JOB=$(condatainer run --array samples.txt --array-limit 10 trim.sh)
# Stage 2: align β waits for ALL trim subjobs to finish
JOB=$(condatainer run --array samples.txt --array-limit 10 --afterok "$JOB" align.sh)
# Stage 3: quant β waits for ALL align subjobs to finish
JOB=$(condatainer run --array samples.txt --array-limit 10 --afterok "$JOB" quant.sh)
# Final: single job collecting results β waits for ALL quant subjobs
condatainer run --afterok "$JOB" collect_results.sh samples.txt
How Schedulers Define Resourcesο
Each scheduler has its own syntax for the same underlying resource concepts. All directives below appear as in-script comments (#SBATCH, #PBS, #BSUB) for SLURM/PBS/LSF, or in a separate .sub file for HTCondor.
Concept |
SLURM |
PBS ( |
LSF |
HTCondor |
|---|---|---|---|---|
Nodes |
|
|
(derived from |
(single-task only) |
Total tasks |
|
(Chunks Γ mpiprocs) |
|
(not applicable) |
Tasks per node |
|
|
|
(not applicable) |
CPUs per task |
|
|
|
|
Memory |
|
|
|
|
GPU |
|
|
|
|
Wall time |
|
|
|
|
Job name |
|
|
|
(not supported) |
Stdout / Stderr |
|
|
|
|
Email type |
|
|
|
|
Email address |
|
|
|
|
PBS packs chunks, tasks-per-node, CPUs, and memory into a single -l select=N:ncpus=M:mpiprocs=P:mem=X command. Multiple chunks separated by + allow different resource shapes on different nodes.
LSF uses a single -R string to express node constraint (span[hosts=1] for single-node), task distribution (span[ptile=N]), CPU affinity (affinity[cores(N)]), and memory allocation (rusage[mem=N]). -n N sets the total slot count. -M N is a per-slot memory ulimit and passes through separately.
GPU supportο
GPU resources are normalized across schedulers. SLURM and LSF support specifying a GPU model; PBS and HTCondor translate to count only.
Scheduler |
Num |
Model |
Directive |
|---|---|---|---|
SLURM |
Yes |
Yes |
|
PBS Pro |
Yes |
No |
|
LSF |
Yes |
Yes |
|
HTCondor |
Yes |
No |
|
How CondaTainer Handles Directivesο
When condatainer run is invoked, it reads the scriptβs scheduler directives and processes them.
Normalizing Directives and Passthrough Modeο
CondaTainer parses all directives into a normalized internal representation:
type ResourceSpec struct {
Nodes int // Number of nodes
Ntasks int // Total MPI tasks (0 = not set; use Nodes*TasksPerNode)
TasksPerNode int // MPI ranks per node (0 = non-uniform distribution, e.g. multi-chunk PBS)
CpusPerTask int // CPU threads per task (OpenMP)
MemPerCpuMB int64 // RAM per logical CPU in MB (SLURM --mem-per-cpu, LSF rusage[mem])
MemPerNodeMB int64 // Total RAM per node in MB (SLURM --mem, PBS mem=)
Gpu *GpuSpec // GPU requirements (nil = no GPU; Count = per node)
Time time.Duration // Job walltime limit
Exclusive bool // Request exclusive node access (no other jobs on the same node)
}
If a directive cannot be represented in this schema β for example, low-level CPU topology constraints β CondaTainer enters passthrough mode: the scriptβs directives cannot be regenerated and condatainer run will reject the submission with an error, directing you to submit manually with the native scheduler command. See Manual Submission below.
For MPI jobs, all tasks must share the same CPUs-per-task and memory-per-CPU. CondaTainer stores a single CPUs-per-task and a memory-per-CPU for the entire job. PBS can use multi chunks (select=...+...) for uneven task distributions.
Cross-Scheduler Translationο
Once normalized, CondaTainer can emit directives in any target schedulerβs syntax, enabling scripts written for one scheduler to run on a cluster running a different one.
Example: SLURM script submitted on a PBS cluster
# Original SLURM directives:
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=2
#SBATCH --mem=32G
#SBATCH --time=02:00:00
# PBS equivalent generated at submission time:
#PBS -l select=2:ncpus=8:mpiprocs=4:mem=32768mb
#PBS -l walltime=02:00:00
Uneven MPI distributions are reconstructed as multi-chunk PBS select:
# SLURM: 8 tasks across 3 nodes (3+3+2)
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=3
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=100M
# PBS: two chunks β 2 full nodes + 1 remainder
#PBS -l select=2:ncpus=3:mpiprocs=3:mem=300mb+1:ncpus=2:mpiprocs=2:mem=200mb
Two fields are intentionally dropped during translation:
Partition / queue β queue names are site-specific.
Unrecognized flags β directives with no cross-scheduler equivalent are dropped with a warning.
MPI Auto-Detectionο
When CondaTainer detects ntasks > 1, it automatically wraps the command with mpiexec:
# Generated job command:
mpiexec condatainer run mpi_job.sh
mpiexec must be available in PATH at submission time (e.g. load your MPI module before calling condatainer run). If it is not found, submission fails with an error.
Each MPI rank launches its own container, all sharing the same MPI communicator via the schedulerβs process management interface.
Important
The OpenMPI version inside the container must match the major and minor version on the host.
ml av openmpi
# openmpi/4.1.5
condatainer e mpi.img -- mm-install mpi4py openmpi=4.1 -y
Manual Submissionο
When a script triggers passthrough mode, condatainer run rejects it with an error. You must submit directly with the native scheduler command (sbatch, qsub, bsub).
To still run your payload inside a CondaTainer container in this case, use the self-bootstrap pattern: the script detects whether it is already running inside a container ($IN_CONDATAINER) and, if not, re-invokes itself via condatainer run. CondaTainer then resolves the #DEP: tags and starts the container.
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=200M
#SBATCH --time=02:00:00
#DEP: mpi.img
if [ -z "$IN_CONDATAINER" ] && command -v condatainer >/dev/null 2>&1; then
if [ -n "$SLURM_JOB_ID" ]; then
FULL_COMMAND=$(scontrol show job "$SLURM_JOB_ID" | awk -F= '/Command=/{print $2}' | head -n 1)
ORIGINAL_SCRIPT_PATH=$(echo "$FULL_COMMAND" | awk '{print $1}')
else
ORIGINAL_SCRIPT_PATH=$(realpath "$0")
fi
module purge && module load openmpi/4.1.5 && \
mpiexec condatainer run "$ORIGINAL_SCRIPT_PATH"
exit $?
fi
python my_script.py
Submit with the native scheduler β not condatainer run:
sbatch my_script.sh