sdiag - Scheduling diagnostic tool for Slurm
sdiag shows information related to slurmctld execution about:
threads, agents, jobs, and scheduling algorithms. The goal is to obtain data
from slurmctld behaviour helping to adjust configuration parameters or
queues policies. The main reason behind is to know Slurm behaviour under
systems with a high throughput.
It has two execution modes. The default mode --all shows
several counters and statistics explained later, and there is another
execution option --reset for resetting those values.
Values are reset at midnight UTC time by default.
The first block of information is related to global slurmctld
execution:
- Server thread
count
- The number of current active slurmctld threads. A high number would mean a
high load processing events like job submissions, jobs dispatching, jobs
completing, etc. If this is often close to MAX_SERVER_THREADS it could
point to a potential bottleneck.
- Agent queue
size
- Slurm design has scalability in mind and sending messages to thousands of
nodes is not a trivial task. The agent mechanism helps to control
communication between the slurm daemons and the controller for a best
effort. If this values is close to MAX_AGENT_CNT there could be some
delays affecting jobs management.
- Agent
count
- Number of active agent threads.
- DBD Agent queue
size
- Slurm queues up the messages intended for the SlurmDBD and processes them
in a separate thread. If the SlurmDBD, or database, is down then this
number will increase. The max queue size is calculated as:
MAX(10000, ((max_job_cnt * 2) + (node_record_count * 4)))
If this number begins to grow more than half of the max queue
size, the slurmdbd and the database should be investigated
immediately.
- Jobs submitted
- Number of jobs submitted since last reset
- Jobs
started
- Number of jobs started since last reset. This includes backfilled jobs.
- Jobs
completed
- Number of jobs completed since last reset.
- Jobs
canceled
- Number of jobs canceled since last reset.
- Jobs
failed
- Number of jobs failed due to slurmd or other internal issues since last
reset.
- Job states ts:
- Lists the timestamp of when the following job state counts were gathered.
- Jobs
pending:
- Number of jobs pending at the given time of the time stamp above.
- Jobs
running:
- Number of jobs running at the given time of the time stamp above.
- Jobs running
ts:
- Time stamp of when the running job count was taken.
The second block of information is related to main scheduling
algorithm based on jobs priorities. A scheduling cycle implies to get the
job_write_lock lock, then trying to get resources for jobs pending, starting
from the most priority one and going in descendent order. Once a job can not
get the resources the loop keeps going but just for jobs requesting other
partitions. Jobs with dependencies or affected by accounts limits are not
processed.
- Last cycle
- Time in microseconds for last scheduling cycle.
- Max cycle
- Time in microseconds for the maximum scheduling cycle since last reset.
- Total cycles
- Number of scheduling cycles since last reset. Scheduling is done in
periodically and when a job is submitted or a job is completed.
- Mean cycle
- Mean of scheduling cycles since last reset
- Mean depth
cycle
- Mean of cycle depth. Depth means number of jobs processed in a scheduling
cycle.
- Cycles per
minute
- Counter of scheduling executions per minute
- Last queue
length
- Length of jobs pending queue.
- Latency for
gettimeofday()
- Latency of 1000 calls to the gettimeofday() syscall in microseconds, as
measured at controller startup.
The third block of information is related to backfilling
scheduling algorithm. A backfilling scheduling cycle implies to get locks
for jobs, nodes and partitions objects then trying to get resources for jobs
pending. Jobs are processed based on priorities. If a job can not get
resources the algorithm calculates when it could get them obtaining a future
start time for the job. Then next job is processed and the algorithm tries
to get resources for that job but avoiding to affect the previous
ones, and again it calculates the future start time if not current
resources available. The backfilling algorithm takes more time for each new
job to process since more priority jobs can not be affected. The algorithm
itself takes measures for avoiding a long execution cycle and for taking all
the locks for too long.
- Total backfilled jobs
(since last slurm start)
- Number of jobs started thanks to backfilling since last slurm start.
- Total backfilled jobs
(since last stats cycle start)
- Number of jobs started thanks to backfilling since last time stats where
reset. By default these values are reset at midnight UTC time.
- Total backfilled
heterogeneous job components
- Number of heterogeneous job components started thanks to backfilling since
last Slurm start.
- Total
cycles
- Number of scheduling cycles since last reset
- Last cycle
when
- Time when last execution cycle happened in format "weekday Month
MonthDay hour:minute.seconds year"
- Last cycle
- Time in microseconds of last backfilling cycle. It counts only execution
time removing sleep time inside a scheduling cycle when it takes too much
time. Note that locks are released during the sleep time so that other
work can proceed.
- Max cycle
- Time in microseconds of maximum backfilling cycle execution since last
reset. It counts only execution time removing sleep time inside a
scheduling cycle when it takes too much time. Note that locks are released
during the sleep time so that other work can proceed.
- Mean cycle
- Mean of backfilling scheduling cycles in microseconds since last reset
- Last depth
cycle
- Number of processed jobs during last backfilling scheduling cycle. It
counts every process even if it has no option to execute due to
dependencies or limits.
- Last depth cycle (try
sched)
- Number of processed jobs during last backfilling scheduling cycle. It
counts only processes with a chance to run waiting for available
resources. These jobs are which makes the backfilling algorithm heavier.
- Depth Mean
- Mean of processed jobs during backfilling scheduling cycles since last
reset. Jobs which are found to be ineligible to run when examined by the
backfill scheduler are not counted (e.g. jobs submitted to multiple
partitions and already started, jobs which have reached a QOS or account
limit such as maximum running jobs for an account, etc).
- Depth Mean (try
sched)
- The subset of Depth Mean that the backfill scheduler attempted to
schedule.
- Last queue
length
- Number of jobs pending to be processed by backfilling algorithm. A job
once for each partition it requested. A pending job array will normally be
counted as one job (tasks of a job array which have already been
started/requeued or individually modified will already have individual job
records and are each counted as a separate job).
- Queue length
Mean
- Mean of jobs pending to be processed by backfilling algorithm. A job once
for each partition it requested. A pending job array will normally be
counted as one job (tasks of a job array which have already been
started/requeued or individually modified will already have individual job
records and are each counted as a separate job).
The fourth and fifth blocks of information report the most
frequently issued remote procedure calls (RPCs), calls made for the
Slurmctld daemon to perform some action. The fourth block reports the RPCs
issued by message type. You will need to look up those RPC codes in the
Slurm source code by looking them up in the file
src/common/slurm_protocol_defs.h. The report includes the number of times
each RPC is invoked, the total time consumed by all of those RPCs plus the
average time consumed by each RPC in microseconds. The fifth block reports
the RPCs issued by user ID, the total number of RPCs they have issued, the
total time consumed by all of those RPCs plus the average time consumed by
each RPC in microseconds.
The sixth block of information, labeled Pending RPC Statistics,
shows information about pending outgoing RPCs on the slurmctld agent queue.
The first section of this block shows types of RPCs on the queue and the
count of each. The second section shows up to the first 25 individual RPCs
pending on the agent queue, including the type and the destination host
list. This information is cached and only refreshed on 30 second
intervals.
- -a, --all
- Get and report information. This is the default mode of operation.
- -h, --help
- Print description of options and exit.
- -i,
--sort-by-id
- Sort Remote Procedure Call (RPC) data by message type ID and user ID.
- -r, --reset
- Reset counters. Only supported for Slurm operators and administrators.
- -t,
--sort-by-time
- Sort Remote Procedure Call (RPC) data by total run time.
- -T,
--sort-by-time2
- Sort Remote Procedure Call (RPC) data by average run time.
- --usage
- Print list of options and exit.
- -V, --version
- Print current version number and exit.
Some sdiag options may be set via environment variables.
These environment variables, along with their corresponding options, are
listed below. (Note: commandline options will always override these
settings)
- SLURM_CONF
- The location of the Slurm configuration file.
Copyright (C) 2010-2011 Barcelona Supercomputing Center.
Copyright (C) 2010-2017 SchedMD LLC.
Slurm is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your option)
any later version.
Slurm is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.