| TOIL(1) | Toil | TOIL(1) |
toil - Toil Documentation
Toil is an open-source pure-Python workflow engine that lets people write better pipelines.
Check out our website for a comprehensive list of Toil's features and read our paper to learn what Toil can do in the real world. Please subscribe to our low-volume announce mailing list and feel free to also join us on GitHub and Gitter.
If using Toil for your research, please cite
The Common Workflow Language (CWL) is an emerging standard for writing workflows that are portable across multiple workflow engines and platforms. Running CWL workflows using Toil is easy.
cwlVersion: v1.0 class: CommandLineTool baseCommand: echo stdout: output.txt inputs:
message:
type: string
inputBinding:
position: 1 outputs:
output:
type: stdout
and this code into example-job.yaml:
message: Hello world!
$ toil-cwl-runner example.cwl example-job.yaml
Your output will be in output.txt:
$ cat output.txt Hello world!
Congratulations! You've run your first Toil workflow using the default Batch System, single_machine, and the default file job store (which was placed in a temporary directory for you by toil-cwl-runner).
Toil uses batch systems to manage the jobs it creates.
The single_machine batch system is primarily used to prepare and debug workflows on a local machine. Once validated, try running them on a full-fledged batch system (see Batch System API). Toil supports many different batch systems such as Kubernetes and Grid Engine; its versatility makes it easy to run your workflow in all kinds of places.
Toil's CWL runner is totally customizable! Run toil-cwl-runner --help to see a complete list of available options.
To learn more about CWL, see the CWL User Guide (from where this example was shamelessly borrowed). For information on using CWL with Toil see the section CWL in Toil. And for an example of CWL on an AWS cluster, have a look at Running a CWL Workflow on AWS.
The Workflow Description Language (WDL) is another emerging language for writing workflows that are portable across multiple workflow engines and platforms. Running WDL workflows using Toil is still in alpha, and currently experimental. Toil currently supports basic workflow syntax (see WDL in Toil for more details and examples). Here we go over running a basic WDL helloworld workflow.
workflow write_simple_file {
call write_file
}
task write_file {
String message
command { echo ${message} > wdl-helloworld-output.txt }
output { File test = "wdl-helloworld-output.txt" }
}
and this code into wdl-helloworld.json:
{
"write_simple_file.write_file.message": "Hello world!"
}
$ toil-wdl-runner wdl-helloworld.wdl wdl-helloworld.json
Your output will be in wdl-helloworld-output.txt:
$ cat wdl-helloworld-output.txt Hello world!
This will, like the CWL example above, use the single_machine batch system and an automatically-located file job store by default. You can customize Toil's execution of the workflow with command-line options; run toil-wdl-runner --help to learn about them.
To learn more about WDL in general, see the Terra WDL documentation . For more on using WDL in Toil, see WDL in Toil.
In addition to workflow languages like CWL and WDL, Toil supports running workflows written against its Python API.
An example Toil Python workflow can be run with just three steps:
from toil.common import Toil from toil.job import Job def helloWorld(message, memory="1G", cores=1, disk="1G"):
return f"Hello, world!, here's a message: {message}" if __name__ == "__main__":
parser = Job.Runner.getDefaultArgumentParser()
options = parser.parse_args()
options.clean = "always"
with Toil(options) as toil:
output = toil.start(Job.wrapFn(helloWorld, "You did it!"))
print(output)
$ python3 helloWorld.py file:my-job-store
For something beyond a "Hello, world!" example, refer to A (more) real-world example.
Toil's customization options are available in Python workflows. Run python3 helloWorld.py --help to see a complete list of available options.
For a more detailed example and explanation, we've developed a sample pipeline that merge-sorts a temporary file. This is not supposed to be an efficient sorting program, rather a more fully worked example of what Toil is capable of.
$ python3 sort.py file:jobStore
The workflow created a file called sortedFile.txt in your current directory. Have a look at it and notice that it contains a whole lot of sorted lines!
This workflow does a smart merge sort on a file it generates, fileToSort.txt. The sort is smart because each step of the process---splitting the file into separate chunks, sorting these chunks, and merging them back together---is compartmentalized into a job. Each job can specify its own resource requirements and will only be run after the jobs it depends upon have run. Jobs without dependencies will be run in parallel.
NOTE:
$ python3 sort.py file:jobStore \
--numLines=5000 \
--lineLength=10 \
--overwriteOutput=True \
--workDir=/tmp/
Here we see that we can add our own options to a Toil Python workflow. As noted above, the first two options, --numLines and --lineLength, determine the number of lines and how many characters are in each line. --overwriteOutput causes the current contents of sortedFile.txt to be overwritten, if it already exists. The last option, --workDir, is an option built into Toil to specify where temporary files unique to a job are kept.
To understand the details of what's going on inside. Let's start with the main() function. It looks like a lot of code, but don't worry---we'll break it down piece by piece.
def main(options=None):
if not options:
# deal with command line arguments
parser = ArgumentParser()
Job.Runner.addToilOptions(parser)
parser.add_argument(
"--numLines",
default=defaultLines,
help="Number of lines in file to sort.",
type=int,
)
parser.add_argument(
"--lineLength",
default=defaultLineLen,
help="Length of lines in file to sort.",
type=int,
)
parser.add_argument("--fileToSort", help="The file you wish to sort")
parser.add_argument("--outputFile", help="Where the sorted output will go")
parser.add_argument(
"--overwriteOutput",
help="Write over the output file if it already exists.",
default=True,
)
parser.add_argument(
"--N",
dest="N",
help="The threshold below which a serial sort function is used to sort file. "
"All lines must of length less than or equal to N or program will fail",
default=10000,
)
parser.add_argument(
"--downCheckpoints",
action="store_true",
help="If this option is set, the workflow will make checkpoints on its way through"
'the recursive "down" part of the sort',
)
parser.add_argument(
"--sortMemory",
dest="sortMemory",
help="Memory for jobs that sort chunks of the file.",
default=None,
)
parser.add_argument(
"--mergeMemory",
dest="mergeMemory",
help="Memory for jobs that collate results.",
default=None,
)
options = parser.parse_args()
if not hasattr(options, "sortMemory") or not options.sortMemory:
options.sortMemory = sortMemory
if not hasattr(options, "mergeMemory") or not options.mergeMemory:
options.mergeMemory = sortMemory
# do some input verification
sortedFileName = options.outputFile or "sortedFile.txt"
if not options.overwriteOutput and os.path.exists(sortedFileName):
print(
f"Output file {sortedFileName} already exists. "
f"Delete it to run the sort example again or use --overwriteOutput=True"
)
exit()
fileName = options.fileToSort
if options.fileToSort is None:
# make the file ourselves
fileName = "fileToSort.txt"
if os.path.exists(fileName):
print(f"Sorting existing file: {fileName}")
else:
print(
f"No sort file specified. Generating one automatically called: {fileName}."
)
makeFileToSort(
fileName=fileName, lines=options.numLines, lineLen=options.lineLength
)
else:
if not os.path.exists(options.fileToSort):
raise RuntimeError("File to sort does not exist: %s" % options.fileToSort)
if int(options.N) <= 0:
raise RuntimeError("Invalid value of N: %s" % options.N)
# Now we are ready to run
with Toil(options) as workflow:
sortedFileURL = "file://" + os.path.abspath(sortedFileName)
if not workflow.options.restart:
sortFileURL = "file://" + os.path.abspath(fileName)
sortFileID = workflow.importFile(sortFileURL)
sortedFileID = workflow.start(
Job.wrapJobFn(
setup,
sortFileID,
int(options.N),
options.downCheckpoints,
options=options,
memory=sortMemory,
)
)
else:
sortedFileID = workflow.restart()
workflow.exportFile(sortedFileID, sortedFileURL)
First we make a parser to process command line arguments using the argparse module. It's important that we add the call to Job.Runner.addToilOptions() to initialize our parser with all of Toil's default options. Then we add the command line arguments unique to this workflow, and parse the input. The help message listed with the arguments should give you a pretty good idea of what they can do.
Next we do a little bit of verification of the input arguments. The option --fileToSort allows you to specify a file that needs to be sorted. If this option isn't given, it's here that we make our own file with the call to makeFileToSort().
Finally we come to the context manager that initializes the workflow. We create a path to the input file prepended with 'file://' as per the documentation for toil.common.Toil() when staging a file that is stored locally. Notice that we have to check whether or not the workflow is restarting so that we don't import the file more than once. Finally we can kick off the workflow by calling toil.common.Toil.start() on the job setup. When the workflow ends we capture its output (the sorted file's fileID) and use that in toil.common.Toil.exportFile() to move the sorted file from the job store back into "userland".
Next let's look at the job that begins the actual workflow, setup.
def setup(job, inputFile, N, downCheckpoints, options):
"""
Sets up the sort.
Returns the FileID of the sorted file
"""
RealtimeLogger.info("Starting the merge sort")
return job.addChildJobFn(
down,
inputFile,
N,
"root",
downCheckpoints,
options=options,
preemptible=True,
memory=sortMemory,
).rv()
setup really only does two things. First it writes to the logs using Job.log() and then calls addChildJobFn(). Child jobs run directly after the current job. This function turns the 'job function' down into an actual job and passes in the inputs including an optional resource requirement, memory. The job doesn't actually get run until the call to Job.rv(). Once the job down finishes, its output is returned here.
Now we can look at what down does.
def down(job, inputFileStoreID, N, path, downCheckpoints, options, memory=sortMemory):
"""
Input is a file, a subdivision size N, and a path in the hierarchy of jobs.
If the range is larger than a threshold N the range is divided recursively and
a follow on job is then created which merges back the results else
the file is sorted and placed in the output.
"""
RealtimeLogger.info("Down job starting: %s" % path)
# Read the file
inputFile = job.fileStore.readGlobalFile(inputFileStoreID, cache=False)
length = os.path.getsize(inputFile)
if length > N:
# We will subdivide the file
RealtimeLogger.critical(
"Splitting file: %s of size: %s" % (inputFileStoreID, length)
)
# Split the file into two copies
midPoint = getMidPoint(inputFile, 0, length)
t1 = job.fileStore.getLocalTempFile()
with open(t1, "w") as fH:
fH.write(copySubRangeOfFile(inputFile, 0, midPoint + 1))
t2 = job.fileStore.getLocalTempFile()
with open(t2, "w") as fH:
fH.write(copySubRangeOfFile(inputFile, midPoint + 1, length))
# Call down recursively. By giving the rv() of the two jobs as inputs to the follow-on job, up,
# we communicate the dependency without hindering concurrency.
result = job.addFollowOnJobFn(
up,
job.addChildJobFn(
down,
job.fileStore.writeGlobalFile(t1),
N,
path + "/0",
downCheckpoints,
checkpoint=downCheckpoints,
options=options,
preemptible=True,
memory=options.sortMemory,
).rv(),
job.addChildJobFn(
down,
job.fileStore.writeGlobalFile(t2),
N,
path + "/1",
downCheckpoints,
checkpoint=downCheckpoints,
options=options,
preemptible=True,
memory=options.mergeMemory,
).rv(),
path + "/up",
preemptible=True,
options=options,
memory=options.sortMemory,
).rv()
else:
# We can sort this bit of the file
RealtimeLogger.critical(
"Sorting file: %s of size: %s" % (inputFileStoreID, length)
)
# Sort the copy and write back to the fileStore
shutil.copyfile(inputFile, inputFile + ".sort")
sort(inputFile + ".sort")
result = job.fileStore.writeGlobalFile(inputFile + ".sort")
RealtimeLogger.info("Down job finished: %s" % path)
return result
Down is the recursive part of the workflow. First we read the file into the local filestore by calling job.fileStore.readGlobalFile(). This puts a copy of the file in the temp directory for this particular job. This storage will disappear once this job ends. For a detailed explanation of the filestore, job store, and their interfaces have a look at Managing files within a workflow.
Next down checks the base case of the recursion: is the length of the input file less than N (remember N was an option we added to the workflow in main)? In the base case, we just sort the file, and return the file ID of this new sorted file.
If the base case fails, then the file is split into two new tempFiles using job.fileStore.getLocalTempFile() and the helper function copySubRangeOfFile. Finally we add a follow on Job up with job.addFollowOnJobFn(). We've already seen child jobs. A follow-on Job is a job that runs after the current job and all of its children (and their children and follow-ons) have completed. Using a follow-on makes sense because up is responsible for merging the files together and we don't want to merge the files together until we know they are sorted. Again, the return value of the follow-on job is requested using Job.rv().
Looking at up
def up(job, inputFileID1, inputFileID2, path, options, memory=sortMemory):
"""
Merges the two files and places them in the output.
"""
RealtimeLogger.info("Up job starting: %s" % path)
with job.fileStore.writeGlobalFileStream() as (fileHandle, outputFileStoreID):
fileHandle = codecs.getwriter("utf-8")(fileHandle)
with job.fileStore.readGlobalFileStream(inputFileID1) as inputFileHandle1:
inputFileHandle1 = codecs.getreader("utf-8")(inputFileHandle1)
with job.fileStore.readGlobalFileStream(inputFileID2) as inputFileHandle2:
inputFileHandle2 = codecs.getreader("utf-8")(inputFileHandle2)
RealtimeLogger.info(
"Merging %s and %s to %s"
% (inputFileID1, inputFileID2, outputFileStoreID)
)
merge(inputFileHandle1, inputFileHandle2, fileHandle)
# Cleanup up the input files - these deletes will occur after the completion is successful.
job.fileStore.deleteGlobalFile(inputFileID1)
job.fileStore.deleteGlobalFile(inputFileID2)
RealtimeLogger.info("Up job finished: %s" % path)
return outputFileStoreID
we see that the two input files are merged together and the output is written to a new file using job.fileStore.writeGlobalFileStream(). After a little cleanup, the output file is returned.
Once the final up finishes and all of the rv() promises are fulfilled, main receives the sorted file's ID which it uses in exportFile to send it to the user.
There are other things in this example that we didn't go over such as Checkpoints and the details of much of the Toil Class API.
At the end of the script the lines
if __name__ == '__main__'
main()
are included to ensure that the main function is only run once in the '__main__' process invoked by you, the user. In Toil terms, by invoking the script you created the leader process in which the main() function is run. A worker process is a separate process whose sole purpose is to host the execution of one or more jobs defined in that script. In any Toil workflow there is always one leader process, and potentially many worker processes.
When using the single-machine batch system (the default), the worker processes will be running on the same machine as the leader process. With full-fledged batch systems like Kubernetes the worker processes will typically be started on separate machines. The boilerplate ensures that the pipeline is only started once---on the leader---but not when its job functions are imported and executed on the individual workers.
Typing python3 sort.py --help will show the complete list of arguments for the workflow which includes both Toil's and ones defined inside sort.py. A complete explanation of Toil's arguments can be found in Commandline Options.
By default, Toil logs a lot of information related to the current environment in addition to messages from the batch system and jobs. This can be configured with the --logLevel flag. For example, to only log CRITICAL level messages to the screen:
$ python3 sort.py file:jobStore \
--logLevel=critical \
--overwriteOutput=True
This hides most of the information we get from the Toil run. For more detail, we can run the pipeline with --logLevel=debug to see a comprehensive output. For more information, see Commandline Options.
With Toil, you can recover gracefully from a bug in your pipeline without losing any progress from successfully completed jobs. To demonstrate this, let's add a bug to our example code to see how Toil handles a failure and how we can resume a pipeline after that happens. Add a bad assertion at line 52 of the example (the first line of down()):
def down(job, inputFileStoreID, N, downCheckpoints, memory=sortMemory):
...
assert 1 == 2, "Test error!"
When we run the pipeline, Toil will show a detailed failure log with a traceback:
$ python3 sort.py file:jobStore ... ---TOIL WORKER OUTPUT LOG--- ... m/j/jobonrSMP Traceback (most recent call last): m/j/jobonrSMP File "toil/src/toil/worker.py", line 340, in main m/j/jobonrSMP job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore) m/j/jobonrSMP File "toil/src/toil/job.py", line 1270, in _runner m/j/jobonrSMP returnValues = self._run(jobGraph, fileStore) m/j/jobonrSMP File "toil/src/toil/job.py", line 1217, in _run m/j/jobonrSMP return self.run(fileStore) m/j/jobonrSMP File "toil/src/toil/job.py", line 1383, in run m/j/jobonrSMP rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs) m/j/jobonrSMP File "toil/example.py", line 30, in down m/j/jobonrSMP assert 1 == 2, "Test error!" m/j/jobonrSMP AssertionError: Test error!
If we try and run the pipeline again, Toil will give us an error message saying that a job store of the same name already exists. By default, in the event of a failure, the job store is preserved so that the workflow can be restarted, starting from the previously failed jobs. We can restart the pipeline by running
$ python3 sort.py file:jobStore \
--restart \
--overwriteOutput=True
We can also change the number of times Toil will attempt to retry a failed job:
$ python3 sort.py file:jobStore \
--retryCount 2 \
--restart \
--overwriteOutput=True
You'll now see Toil attempt to rerun the failed job until it runs out of tries. --retryCount is useful for non-systemic errors, like downloading a file that may experience a sporadic interruption, or some other non-deterministic failure.
To successfully restart our pipeline, we can edit our script to comment out line 30, or remove it, and then run
$ python3 sort.py file:jobStore \
--restart \
--overwriteOutput=True
The pipeline will run successfully, and the job store will be removed on the pipeline's completion.
Please see the Status Command section for more on gathering runtime and resource info on jobs.
After having installed the aws extra for Toil during the Installation and set up AWS (see Preparing your AWS environment), the user can run the basic helloWorld.py script (Running a basic Python workflow) on a VM in AWS just by modifying the run command.
Note that when running in AWS, users can either run the workflow on a single instance or run it on a cluster (which is running across multiple containers on multiple AWS instances). For more information on running Toil workflows on a cluster, see Running in AWS.
Also! Remember to use the Destroy-Cluster Command command when finished to destroy the cluster! Otherwise things may not be cleaned up properly.
$ toil launch-cluster <cluster-name> \
--clusterType kubernetes \
--keyPairName <AWS-key-pair-name> \
--leaderNodeType t2.medium \
--nodeTypes t2.medium -w 1 \
--zone us-west-2a
The arguments keyPairName, leaderNodeType, and zone are required to launch a cluster.
$ toil rsync-cluster --zone us-west-2a <cluster-name> helloWorld.py :/tmp
Note that the command requires defining the file to copy as well as the target location on the cluster leader node.
$ toil ssh-cluster --zone us-west-2a <cluster-name>
Note that this command will log you in as the root user.
$ python3 /tmp/helloWorld.py aws:us-west-2:my-S3-bucket
In this particular case, we create an S3 bucket called my-S3-bucket in the us-west-2 availability zone to store intermediate job results.
Along with some other INFO log messages, you should get the following output in your terminal window: Hello, world!, here's a message: You did it!.
$ exit
$ toil destroy-cluster --zone us-west-2a <cluster-name>
Note that this command will destroy the cluster leader node and any resources created to run the job, including the S3 bucket.
After having installed the aws and cwl extras for Toil during the Installation and set up AWS (see Preparing your AWS environment), the user can run a CWL workflow with Toil on AWS.
Also! Remember to use the Destroy-Cluster Command command when finished to destroy the cluster! Otherwise things may not be cleaned up properly.
$ toil launch-cluster <cluster-name> \
--clusterType kubernetes \
--keyPairName <AWS-key-pair-name> \
--leaderNodeType t2.medium \
--nodeTypes t2.medium -w 1 \
--zone us-west-2a
toil rsync-cluster --zone us-west-2a <cluster-name> example.cwl :/tmp toil rsync-cluster --zone us-west-2a <cluster-name> example-job.yaml :/tmp
$ toil ssh-cluster --zone us-west-2a <cluster-name>
sudo apt-get update sudo apt-get -y upgrade sudo apt-get -y dist-upgrade sudo apt-get -y install git
virtualenv --system-site-packages venv source venv/bin/activate
(venv) $ toil-cwl-runner \
--provisioner aws \
--batchSystem kubernetes \
--jobStore aws:us-west-2:any-name \
/tmp/example.cwl /tmp/example-job.yaml
TIP:
$ toil destroy-cluster --zone us-west-2a <cluster-name>
Cactus is a reference-free, whole-genome multiple alignment program that can be run on any of the cloud platforms Toil supports.
NOTE:
This example provides a "cloud agnostic" view of running Cactus with Toil. Most options will not change between cloud providers. However, each provisioner has unique inputs for --leaderNodeType, --nodeType and --zone. We recommend the following:
| Option | Used in | AWS | |
| --leaderNodeType | launch-cluster | t2.medium | n1-standard-1 |
| --zone | launch-cluster | us-west-2a | us-west1-a |
| --zone | cactus | us-west-2 | |
| --nodeType | cactus | c3.4xlarge | n1-standard-8 |
When executing toil launch-cluster with gce specified for --provisioner, the option --boto must be specified and given a path to your .boto file. See Running in Google Compute Engine (GCE) for more information about the --boto option.
Also! Remember to use the Destroy-Cluster Command command when finished to destroy the cluster! Otherwise things may not be cleaned up properly.
$ toil launch-cluster <cluster-name> \
--provisioner <aws, gce> \
--keyPairName <key-pair-name> \
--leaderNodeType <type> \
--nodeType <type> \
-w 1-2 \
--zone <zone>
NOTE:
When using AWS, setting the environment variable eliminates having to specify the --zone option for each command. This will be supported for GCE in the future.
$ export TOIL_AWS_ZONE=us-west-2c
$ toil ssh-cluster --provisioner <aws, gce> <cluster-name> $ mkdir /root/cact_ex $ exit
$ toil rsync-cluster --provisioner <aws, gce> <cluster-name> pestis-short-aws-seqFile.txt :/root/cact_ex $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> GCF_000169655.1_ASM16965v1_genomic.fna :/root/cact_ex $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> GCF_000006645.1_ASM664v1_genomic.fna :/root/cact_ex $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> GCF_000182485.1_ASM18248v1_genomic.fna :/root/cact_ex $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> GCF_000013805.1_ASM1380v1_genomic.fna :/root/cact_ex $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> setup_leaderNode.sh :/root/cact_ex $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> blockTrim1.xml :/root/cact_ex $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> blockTrim3.xml :/root/cact_ex
$ toil ssh-cluster --provisioner <aws, gce> <cluster-name>
$ bash /root/cact_ex/setup_leaderNode.sh $ source cact_venv/bin/activate (cact_venv) $ cd cactus (cact_venv) $ pip install --upgrade .
(cact_venv) $ cactus \
--retry 10 \
--batchSystem kubernetes \
--logDebug \
--logFile /logFile_pestis3 \
--configFile \
/root/cact_ex/blockTrim3.xml <aws, google>:<zone>:cactus-pestis \
/root/cact_ex/pestis-short-aws-seqFile.txt \
/root/cact_ex/pestis_output3.hal
NOTE:
--logDebug --- equivalent to --logLevel DEBUG.
--logFile /logFile_pestis3 --- writes logs in a file named logFile_pestis3 under / folder.
--configFile --- this is not required depending on whether a specific configuration file is intended to run the alignment.
<aws, google>:<zone>:cactus-pestis --- creates a bucket, named cactus-pestis, with the specified cloud provider to store intermediate job files and metadata. NOTE: If you want to use a GCE-based jobstore, specify google here, not gce.
The result file, named pestis_output3.hal, is stored under /root/cact_ex folder of the leader node.
Use cactus --help to see all the Cactus and Toil flags available.
(cact_venv) $ exit
(venv) $ toil rsync-cluster \
--provisioner <aws, gce> <cluster-name> \
:/root/cact_ex/pestis_output3.hal \
<path-of-folder-on-local-machine>
(venv) $ toil destroy-cluster --provisioner <aws, gce> <cluster-name>
The Common Workflow Language (CWL) is an emerging standard for writing workflows that are portable across multiple workflow engines and platforms. Toil has full support for the CWL v1.0, v1.1, and v1.2 standards.
You can use Toil to run CWL workflows or develop and test new ones.
The toil-cwl-runner command provides CWL parsing functionality using cwltool, and leverages the job-scheduling and batch system support of Toil. You can use it to run CWL workflows locally or in the cloud.
To run in local batch mode, provide the CWL file and the input object file:
$ toil-cwl-runner example.cwl example-job.yml
For a simple example of CWL with Toil see Running a basic CWL workflow.
When invoking CWL documents that make use of Docker containers if you see errors that look like
docker: Error response from daemon: Mounts denied: The paths /var/...tmp are not shared from OS X and are not known to Docker.
you may need to add
export TMPDIR=/tmp/docker_tmp
either in your startup file (.bashrc) or add it manually in your shell before invoking toil.
Help information can be found by using this toil command:
$ toil-cwl-runner -h
A more detailed example shows how we can specify both Toil and cwltool arguments for our workflow:
$ toil-cwl-runner \
--singularity \
--jobStore my_jobStore \
--batchSystem lsf \
--workDir `pwd` \
--outdir `pwd` \
--logFile cwltoil.log \
--writeLogs `pwd` \
--logLevel DEBUG \
--retryCount 2 \
--maxLogFileSize 20000000000 \
--stats \
standard_bam_processing.cwl \
inputs.yaml
In this example, we set the following options, which are all passed to Toil:
--singularity: Specifies that all jobs with Docker format containers specified should be run using the Singularity container engine instead of the Docker container engine.
--jobStore: Path to a folder which doesn't exist yet, which will contain the Toil jobstore and all related job-tracking information.
--batchSystem: Use the specified HPC or Cloud-based cluster platform.
--workDir: The directory where all temporary files will be created for the workflow. A subdirectory of this will be set as the $TMPDIR environment variable and this subdirectory can be referenced using the CWL parameter reference $(runtime.tmpdir) in CWL tools and workflows.
--outdir: Directory where final File and Directory outputs will be written. References to these and other output types will be in the JSON object printed to the stdout stream after workflow execution.
--logFile: Path to the main logfile.
--writeLogs: Directory where job logs will be stored. At DEBUG log level, this will contain logs for each Toil job run, as well as stdout/stderr logs for each CWL CommandLineTool that didn't use the stdout/stderr directives to redirect output.
--retryCount: How many times to retry each Toil job.
--maxLogFileSize: Logs that get larger than this value will be truncated.
--stats: Save resources usages in json files that can be collected with the toil stats command after the workflow is done.
Besides the normal Toil options and the options supported by cwltool, toil-cwl-runner adds some of its own options:
To run in cloud and HPC configurations, you may need to provide additional command line parameters to select and configure the batch system to use.
To run a CWL workflow in AWS with toil see Running a CWL Workflow on AWS.
Some CWL workflows use the InplaceUpdateRequirement feature, which requires that operations on files have visible side effects that Toil's file store cannot support. If you need to run a workflow like this, you can make sure that all of your worker nodes have a shared filesystem, and use the --bypass-file-store option to toil-cwl-runner. This will make it leave all CWL intermediate files on disk and share them between jobs using file paths, instead of storing them in the file store and downloading them when jobs need them.
See logs for just one job by using the full log file
This requires knowing the job's toil-generated ID, which can be found in the log files.
cat cwltoil.log | grep jobVM1fIs
Grep for full tool commands from toil logs
This gives you a more concise view of the commands being run (note that this information is only available from Toil when running with --logDebug).
pcregrep -M "\[job .*\.cwl.*$\n(.* .*$\n)*" cwltoil.log # ^allows for multiline matching
Find Bams that have been generated for specific step while pipeline is running:
find . | grep -P '^./out_tmpdir.*_MD\.bam$'
See what jobs have been run
cat log/cwltoil.log | grep -oP "\[job .*.cwl\]" | sort | uniq
or:
cat log/cwltoil.log | grep -i "issued job"
Get status of a workflow
$ toil status /home/johnsoni/TEST_RUNS_3/TEST_run/tmp/jobstore-09ae0acc-c800-11e8-9d09-70106fb1697e <hostname> 2018-10-04 15:01:44,184 MainThread INFO toil.lib.bioio: Root logger is at level 'INFO', 'toil' logger at level 'INFO'. <hostname> 2018-10-04 15:01:44,185 MainThread INFO toil.utils.toilStatus: Parsed arguments <hostname> 2018-10-04 15:01:47,081 MainThread INFO toil.utils.toilStatus: Traversing the job graph gathering jobs. This may take a couple of minutes. Of the 286 jobs considered, there are 179 jobs with children, 107 jobs ready to run, 0 zombie jobs, 0 jobs with services, 0 services, and 0 jobs with log files currently in file:/home/user/jobstore-09ae0acc-c800-11e8-9d09-70106fb1697e.
Toil Stats
You can get run statistics broken down by CWL file. This only works once the workflow is finished:
$ toil stats /path/to/jobstore
This will report resource usage information for all the CWL jobs executed by the workflow.
See Stats Command for an explanation of what the different fields mean.
Understanding toil log files
There is a worker_log.txt file for each Toil job. This file is written to while the job is running, and uploaded at the end if the job finishes or if running at debug log level. If uploaded, the contents are printed to the main log file and transferred to a log file in the --logDir folder.
The new log file will be named something like:
CWLJob_<name of the CWL job>_<attempt number>.log
Standard output/error files will be named like:
<name of the CWL job>.stdout_<attempt number>.log
If you have a workflow revsort.cwl which has a step rev which calls the tool revtool.cwl, the CWL job name ends up being all those parts strung together with .: revsort.cwl.rev.revtool.cwl.
The Workflow Description Language (WDL) is a programming language designed for writing workflows that execute a set of tasks in a pipeline distributed across multiple computers. Workflows enable scientific analyses to be reproducible, by wrapping up a whole sequence of commands, whose outputs feed into other commands, into a workflow that can be executed the same way every time.
Toil can be used to run and to develop WDL workflows. The Toil team also maintains a set of WDL conformance tests for evaluating Toil and other WDL runners.
Toil has beta support for running WDL workflows, using the toil-wdl-runner command. This command comes with the [wdl] extra; see Installing Toil with Extra Features for how to install it if you do not have it.
You can run WDL workflows with toil-wdl-runner. Currently, toil-wdl-runner works by using MiniWDL to parse and interpret the WDL workflow, and has support for workflows in WDL 1.0 or later (which are required to declare a version, and which use inputs and outputs sections).
TIP:
Toil is, for compatible workflows, a drop-in replacement for the Cromwell WDL runner. Instead of running a workflow with Cromwell:
java -jar Cromwell.jar run myWorkflow.wdl --inputs myWorkflow_inputs.json
You can run the workflow with toil-wdl-runner:
toil-wdl-runner myWorkflow.wdl --input myWorkflow_inputs.json
(We're here running Toil with --input, but it can also accept the Cromwell-style --inputs.)
This will default to executing on the current machine, with a job store in an automatically determined temporary location, but you can add a few Toil options to use other Toil-supported batch systems, such as Kubernetes:
toil-wdl-runner --jobStore aws:us-west-2:wdl-job-store --batchSystem kubernetes myWorkflow.wdl --input myWorkflow_inputs.json
For Toil, the --input is optional, and inputs can be passed as a positional argument:
toil-wdl-runner myWorkflow.wdl myWorkflow_inputs.json
You can also run workflows from URLs. For example, to run the MiniWDL self test workflow, you can do:
toil-wdl-runner https://raw.githubusercontent.com/DataBiosphere/toil/36b54c45e8554ded5093bcdd03edb2f6b0d93887/src/toil/test/wdl/miniwdl_self_test/self_test.wdl https://raw.githubusercontent.com/DataBiosphere/toil/36b54c45e8554ded5093bcdd03edb2f6b0d93887/src/toil/test/wdl/miniwdl_self_test/inputs.json
--jobStore: Specifies where to keep the Toil state information while running the workflow. Must be accessible from all machines.
-o or --outputDirectory: Specifies the output folder or URI prefix to save workflow output files in. Defaults to a new directory in the current directory.
-m or --outputFile: Specifies a JSON file name or URI to save workflow output values at. Defaults to standard output.
-i, --input, or --inputs: Alternative to the positional argument for the input JSON file, for compatibility with other WDL runners.
--outputDialect: Specifies an output format dialect. Can be cromwell to just return the workflow's output values as JSON or miniwdl to nest that under an outputs key and includes a dir key.
--referenceInputs: Specifies whether input files to Toil should be passed around by URL reference instead of being imported into Toil's storage. Defaults to off. Can be True or False or other similar words.
--container: Specifies the container engine to use to run tasks. By default this is auto, which tries Singularity if it is installed and Docker if it isn't. Can also be set to docker or singularity explicitly.
--allCallOutputs: Specifies whether outputs from all calls in a workflow should be included alongside the outputs from the output section, when an output section is defined. For strict WDL spec compliance, should be set to False. Usually defaults to False. If the workflow includes metadata for the Cromwell Output Organizer (croo), will default to True.
Any number of other Toil options may also be specified. For defined Toil options, see Commandline Options.
At the default settings, if a WDL task succeeds, the standard output and standard error will be printed in the toil-wdl-runner output, unless they are captured by the workflow (with the stdout() and stderr() WDL built-in functions). If a WDL task fails, they will be printed whether they were meant to be captured or not. Complete logs from Toil for failed jobs will also be printed.
If you would like to save the logs organized by WDL task, you can use the --writeLogs or --writeLogsGzip options to specify a directory where the log files should be saved. Log files will be named after the same dotted, hierarchical workflow and task names used to set values from the input JSON, except that scatters will add an additional numerical component. In addition to the logs for WDL tasks, Toil job logs for failed jobs will also appear here when running at the default log level.
For example, if you run:
toil-wdl-runner --writeLogs logs https://raw.githubusercontent.com/DataBiosphere/toil/36b54c45e8554ded5093bcdd03edb2f6b0d93887/src/toil/test/wdl/miniwdl_self_test/self_test.wdl https://raw.githubusercontent.com/DataBiosphere/toil/36b54c45e8554ded5093bcdd03edb2f6b0d93887/src/toil/test/wdl/miniwdl_self_test/inputs.json
You will end up with a logs/ directory containing:
hello_caller.0.hello.stderr_000.log hello_caller.1.hello.stderr_000.log hello_caller.2.hello.stderr_000.log
The final number is a sequential counter: if a step has to be retried, or if you run the workflow multiple times without clearing out the logs directory, it will increment.
Toil can be used as a development tool for writing and locally testing WDL workflows. These workflows can then be run on Toil against a cloud or cluster backend, or used with other WDL implementations such as Terra, Cromwell, or MiniWDL.
The easiest way to get started with writing WDL workflows is by following a tutorial.
The UCSC Genomics Institute (home of the Toil project) has a tutorial on writing WDL workflows with Toil. You can follow this tutorial to be walked through writing your own WDL workflow with Toil. They also have tips on debugging WDL workflows with Toil.
These tutorials and tips are aimed at users looking to run WDL workflows with Toil in a Slurm environment, but they can also apply in other situations.
You can also learn to write WDL workflows for Toil by following the official WDL tutorials.
When you reach the point of executing your workflow, instead of running with Cromwell:
java -jar Cromwell.jar run myWorkflow.wdl --inputs myWorkflow_inputs.json
you can instead run with toil-wdl-runner:
toil-wdl-runner myWorkflow.wdl --input myWorkflow_inputs.json
For people who prefer video tutorials, Lynn Langit has a Learn WDL video course that will teach you how to write and run WDL workflows. The course is taught using Cromwell, but Toil should also be compatible with the course's workflows.
WDL language specifications can be found here: https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md
Toil is not yet fully conformant with the WDL specification, but it inherits most of the functionality of MiniWDL.
The Toil team maintains a set of WDL Conformance Tests. Much like the CWL Conformance Tests for CWL, the WDL Conformance Tests are useful for determining if a WDL implementation actually follows the WDL specification.
The WDL Conformance Tests include a runner harness that is able to test toil-wdl-runner, as well as Cromwell and MiniWDL, and supports testing conformance with the 1.1, 1.0, and draft-2 versions of WDL.
If you would like to evaluate Toil's WDL conformance for yourself, first make sure that you have toil-wdl-runner installed. It comes with the [wdl] extra; see Installing Toil with Extra Features.
Then, you can check out the test repository:
$ git clone https://github.com/DataBiosphere/wdl-conformance-tests $ cd wdl-conformance-tests
Most tests will need a Docker daemon available, so make sure yours is working properly:
$ docker info $ docker run --rm docker/whalesay cowsay "Docker is working"
Then, you can test toil-wdl-runner against a particular WDL spec version, say 1.1:
$ python3 run.py --runner toil-wdl-runner --versions 1.1
For any failed tests, the test number and the log of the failing test will be reported.
After the tests run, you can clean up intermediate files with:
$ make clean
For more options, see:
$ python3 run.py --help
Or, consult the conformance test documentation.
Toil runs in various environments, including locally and in the cloud (Amazon Web Services and Google Compute Engine). Toil also supports workflows written in two DSLs: CWL and WDL, as well as workflows written in Python (see Developing a Python Workflow).
Toil is built in a modular way so that it can be used on lots of different systems, and with different configurations. The three configurable pieces are the
The job store is a storage abstraction which contains all of the information used in a Toil run. This centralizes all of the files used by jobs in the workflow and also the details of the progress of the run. If a workflow crashes or fails, the job store contains all of the information necessary to resume with minimal repetition of work.
Several different job stores are supported, including the file job store and cloud job stores. For information on developing job stores, see Job Store API.
The file job store is for use locally, and keeps the workflow information in a directory on the machine where the workflow is launched. This is the simplest and most convenient job store for testing or for small runs.
For an example that uses the file job store, see Running a basic CWL workflow.
Toil currently supports the following cloud storage systems as job stores:
These use cloud buckets to house all of the files. This is useful if there are several different worker machines all running jobs that need to access the job store.
A Toil batch system is either a local single-machine (one computer) or a currently supported cluster of computers (lsf, mesos, slurm, torque, htcondor, or grid_engine) These environments manage individual worker nodes under a leader node to process the work required in a workflow. The leader and its workers all coordinate their tasks and files through a centralized job store location.
See Batch System API for a more detailed description of different batch systems, or information on developing batch systems.
The Toil provisioner provides a tool set for running a Toil workflow on a particular cloud platform.
The Toil Cluster Utilities are command line tools used to provision nodes in your desired cloud platform. They allows you to launch nodes, ssh to the leader, and rsync files back and forth.
For detailed instructions for using the provisioner see Running in AWS or Running in Google Compute Engine (GCE).
A quick way to see all of Toil's commandline options is by executing the following on a workflow language front-end:
$ toil-wdl-runner --help
Or a Toil Python workflow:
$ python3 example.py --help
For a basic toil workflow, Toil has one mandatory argument, the job store. All other arguments are optional.
Instead of changing the arguments on the command line, Toil offers support for using a configuration file.
Options will be applied with priority:
You can manually generate an example configuration file to a path you select. To generate a configuration file, run:
$ toil config [filename].yaml
Then uncomment options as necessary and change/provide new values.
After editing the config file, you can run Toil with its settings by passing it on the command line:
$ python3 example.py --config=[filename].yaml
Alternatively, you can edit the default config file, which is located at $HOME/.toil/default.yaml
If CLI options are used in addition to the configuration file, the CLI options will overwrite the configuration file options. For example:
$ python3 example.py --config=[filename].yaml --defaultMemory 80Gi
This will result in a default memory per job of 80GiB no matter what is in the configuration file provided.
Running Toil workflows requires a file path or URL to a central location for all of the intermediate files for the workflow: the job store. For toil-cwl-runner and toil-wdl-runner a job store can often be selected automatically or can be specified with the --jobStore option; Toil Python workflows generally require the job store as a positional command line argument. To use the Python quickstart example, if you're on a node that has a large /scratch volume, you can specify that the jobstore be created there by executing: python3 HelloWorld.py /scratch/my-job-store, or more explicitly, python3 HelloWorld.py file:/scratch/my-job-store.
Syntax for specifying different job stores:
AWS: aws:region-here:job-store-name
Google: google:projectID-here:job-store-name
Different types of job store options can be found below.
Core Toil Options Options to specify the location of the Toil workflow and turn on stats collation about the performance of jobs.
Logging Options Toil hides stdout and stderr by default except in case of job failure. Log levels in toil are based on priority from the logging module:
Batch System Options
Data Storage Options Allows configuring Toil's data storage.
Autoscaling Options Allows the specification of the minimum and maximum number of nodes in an autoscaled cluster, as well as parameters to control the level of provisioning.
Service Options Allows the specification of the maximum number of service jobs in a cluster. By keeping this limited we can avoid nodes occupied with services causing deadlocks. (Not for CWL).
Resource Options The options to specify default cores/memory requirements (if not specified by the jobs themselves), and to limit the total amount of memory/cores requested from the batch system.
Options for rescuing/killing/restarting jobs. The options for jobs that either run too long/fail or get lost (some batch systems have issues!).
Log Management Options
Miscellaneous Options
Debug Options Debug options for finding problems or helping with testing.
In the event of failure, Toil can resume the pipeline by adding the argument --restart and rerunning the workflow. Toil Python workflows (but not CWL or WDL workflows) can even be edited and resumed, which is useful for development or troubleshooting.
Toil supports jobs, or clusters of jobs, that run as services to other accessor jobs. Example services include server databases or Apache Spark Clusters. As service jobs exist to provide services to accessor jobs their runtime is dependent on the concurrent running of their accessor jobs. The dependencies between services and their accessor jobs can create potential deadlock scenarios, where the running of the workflow hangs because only service jobs are being run and their accessor jobs can not be scheduled because of too limited resources to run both simultaneously. To cope with this situation Toil attempts to schedule services and accessors intelligently, however to avoid a deadlock with workflows running service jobs it is advisable to use the following parameters:
Specifying these parameters so that at a maximum cluster size there will be sufficient resources to run accessors in addition to services will ensure that such a deadlock can not occur.
If too low a limit is specified then a deadlock can occur in which toil can not schedule sufficient service jobs concurrently to complete the workflow. Toil will detect this situation if it occurs and throw a toil.DeadlockException exception. Increasing the cluster size and these limits will resolve the issue.
It's good to remember that commandline options can be overridden in the code of a Python workflow. For example, toil.job.Job.Runner.getDefaultOptions() can be used to get the default Toil options, ignoring what was passed on the command line. In this example, this is used to ignore command-line options and always run with the "./toilWorkflow" directory as the jobstore:
options = Job.Runner.getDefaultOptions("./toilWorkflow") # Get the options object
with Toil(options) as toil:
toil.start(Job()) # Run the root job
However, each option can be explicitly set within the workflow by modifying the options object. In this example, we are setting logLevel = "DEBUG" (all log statements are shown) and clean="ALWAYS" (always delete the jobstore) like so:
options = Job.Runner.getDefaultOptions("./toilWorkflow") # Get the options object
options.logLevel = "DEBUG" # Set the log level to the debug level.
options.clean = "ALWAYS" # Always delete the jobStore after a run
with Toil(options) as toil:
toil.start(Job()) # Run the root job
However, the usual incantation is to accept commandline args from the user with the following:
parser = Job.Runner.getDefaultArgumentParser() # Get the parser options = parser.parse_args() # Parse user args to create the options object with Toil(options) as toil:
toil.start(Job()) # Run the root job
We can also have code in the workflow to overwrite user supplied arguments:
parser = Job.Runner.getDefaultArgumentParser() # Get the parser options = parser.parse_args() # Parse user args to create the options object options.logLevel = "DEBUG" # Set the log level to the debug level. options.clean = "ALWAYS" # Always delete the jobStore after a run with Toil(options) as toil:
toil.start(Job()) # Run the root job
Toil includes some utilities for inspecting or manipulating workflows during and after their execution. (There are additional Toil Cluster Utilities available for working with Toil-managed clusters in the cloud.)
The generic toil subcommand utilities are:
status --- Inspects a job store to see which jobs have failed, run successfully, etc.
debug-job --- Runs a failing job on your local machine.
clean --- Delete the job store used by a previous Toil workflow invocation.
kill --- Kills any running jobs in a rogue toil.
For information on a specific utility, run it with the --help option:
toil stats --help
To use the stats command, a workflow must first be run using the --stats option. Using this command makes certain that toil does not delete the job store, no matter what other options are specified (i.e. normally the option --clean=always would delete the job store, but --stats will override this).
We can run an example workflow and record stats:
python3 tutorial_stats.py file:my-jobstore --stats
Where tutorial_stats.py is the following:
import math import time from multiprocessing import Process from toil.common import Toil from toil.job import Job def think(seconds):
start = time.time()
while time.time() - start < seconds:
# Use CPU
math.sqrt(123456) class TimeWaster(Job):
def __init__(self, time_to_think, time_to_waste, space_to_waste, *args, **kwargs):
self.time_to_think = time_to_think
self.time_to_waste = time_to_waste
self.space_to_waste = space_to_waste
super().__init__(*args, **kwargs)
def run(self, fileStore):
# Waste some space
file_path = fileStore.getLocalTempFile()
with open(file_path, "w") as stream:
for i in range(self.space_to_waste):
stream.write("X")
# Do some "useful" compute
processes = []
for core_number in range(max(1, self.cores)):
# Use all the assigned cores to think
p = Process(target=think, args=(self.time_to_think,))
p.start()
processes.append(p)
for p in processes:
p.join()
# Also waste some time
time.sleep(self.time_to_waste) def main():
options = Job.Runner.getDefaultArgumentParser().parse_args()
job1 = TimeWaster(0, 0, 0, displayName="doNothing")
job2 = TimeWaster(10, 0, 4096, displayName="efficientJob")
job3 = TimeWaster(10, 0, 1024, cores=4, displayName="multithreadedJob")
job4 = TimeWaster(1, 9, 65536, displayName="inefficientJob")
job1.addChild(job2)
job1.addChild(job3)
job3.addChild(job4)
with Toil(options) as toil:
if not toil.options.restart:
toil.start(job1)
else:
toil.restart() if __name__ == "__main__":
main()
Notice the displayName key, which can rename a job, giving it an alias when it is finally displayed in stats.
To see the runtime and resources used for each job when it was run, type
toil stats file:my-jobstore
This should output something like the following:
Batch System: single_machine Default Cores: 1 Default Memory: 2097152KiB Max Cores: unlimited Local CPU Time: 55.54 core·s Overall Runtime: 26.23 s Worker
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
3 | 0.34 10.83 10.80 21.23 32.40 | 0.33 10.43 17.94 43.07 53.83 | 0.01 0.40 14.08 41.85 42.25 | 177168Ki 179312Ki 178730Ki 179712Ki 536192Ki | 0Ki 4Ki 22Ki 64Ki 68Ki Job
Worker Jobs | min med ave max
| 1 1 1.3333 2
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
4 | 0.33 10.83 8.10 10.85 32.38 | 0.33 10.43 13.46 41.70 53.82 | 0.01 1.68 2.78 9.02 11.10 | 177168Ki 179488Ki 178916Ki 179696Ki 715664Ki | 0Ki 4Ki 18Ki 64Ki 72Ki
multithreadedJob
Total Cores: 4.0
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
1 | 10.85 10.85 10.85 10.85 10.85 | 41.70 41.70 41.70 41.70 41.70 | 1.68 1.68 1.68 1.68 1.68 | 179488Ki 179488Ki 179488Ki 179488Ki 179488Ki | 4Ki 4Ki 4Ki 4Ki 4Ki
efficientJob
Total Cores: 1.0
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
1 | 10.83 10.83 10.83 10.83 10.83 | 10.43 10.43 10.43 10.43 10.43 | 0.40 0.40 0.40 0.40 0.40 | 179312Ki 179312Ki 179312Ki 179312Ki 179312Ki | 4Ki 4Ki 4Ki 4Ki 4Ki
inefficientJob
Total Cores: 1.0
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
1 | 10.38 10.38 10.38 10.38 10.38 | 1.36 1.36 1.36 1.36 1.36 | 9.02 9.02 9.02 9.02 9.02 | 179696Ki 179696Ki 179696Ki 179696Ki 179696Ki | 64Ki 64Ki 64Ki 64Ki 64Ki
doNothing
Total Cores: 1.0
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
1 | 0.33 0.33 0.33 0.33 0.33 | 0.33 0.33 0.33 0.33 0.33 | 0.01 0.01 0.01 0.01 0.01 | 177168Ki 177168Ki 177168Ki 177168Ki 177168Ki | 0Ki 0Ki 0Ki 0Ki 0Ki
This report gives information on the resources used by your workflow. Note that right now it does NOT track CPU and memory used inside Docker containers, only Singularity containers.
There are three parts to this report.
At the top is a section with overall summary statistics for the run:
Batch System: single_machine Default Cores: 1 Default Memory: 2097152KiB Max Cores: unlimited Local CPU Time: 55.54 core·s Overall Runtime: 26.23 s
This lists some important the settings for the Toil batch system that actually executed jobs. It also lists:
These latter two numbers don't count some startup/shutdown time spent loading and saving files, so you still may want to use the time shell built-in to time your Toil runs overall.
After the overall summary, there is a section with statistics about the Toil worker processes, which Toil used to execute your workflow's jobs:
Worker
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
3 | 0.34 10.83 10.80 21.23 32.40 | 0.33 10.43 17.94 43.07 53.83 | 0.01 0.40 14.08 41.85 42.25 | 177168Ki 179312Ki 178730Ki 179712Ki 536192Ki | 0Ki 4Ki 22Ki 64Ki 68Ki
Finally, there is the breakdown of resource usage by jobs. This starts with a table summarizing the counts of jobs that ran on each worker:
Job
Worker Jobs | min med ave max
| 1 1 1.3333 2
In this example, most of the workers ran one job each, but one worker managed to run two jobs, via chaining. (Jobs will chain when a job has only one dependant job, which in turn depends on only that first job, and the second job needs no more resources than the first job did.)
Next, we have statistics for resource usage over all jobs together:
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
4 | 0.33 10.83 8.10 10.85 32.38 | 0.33 10.43 13.46 41.70 53.82 | 0.01 1.68 2.78 9.02 11.10 | 177168Ki 179488Ki 178916Ki 179696Ki 715664Ki | 0Ki 4Ki 18Ki 64Ki 72Ki
And finally, for each kind of job (as determined by the job's displayName), we have statistics summarizing the resources used by the instances of that kind of job:
multithreadedJob
Total Cores: 4.0
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
1 | 10.85 10.85 10.85 10.85 10.85 | 41.70 41.70 41.70 41.70 41.70 | 1.68 1.68 1.68 1.68 1.68 | 179488Ki 179488Ki 179488Ki 179488Ki 179488Ki | 4Ki 4Ki 4Ki 4Ki 4Ki efficientJob
Total Cores: 1.0
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
1 | 10.83 10.83 10.83 10.83 10.83 | 10.43 10.43 10.43 10.43 10.43 | 0.40 0.40 0.40 0.40 0.40 | 179312Ki 179312Ki 179312Ki 179312Ki 179312Ki | 4Ki 4Ki 4Ki 4Ki 4Ki inefficientJob
Total Cores: 1.0
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
1 | 10.38 10.38 10.38 10.38 10.38 | 1.36 1.36 1.36 1.36 1.36 | 9.02 9.02 9.02 9.02 9.02 | 179696Ki 179696Ki 179696Ki 179696Ki 179696Ki | 64Ki 64Ki 64Ki 64Ki 64Ki doNothing
Total Cores: 1.0
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
1 | 0.33 0.33 0.33 0.33 0.33 | 0.33 0.33 0.33 0.33 0.33 | 0.01 0.01 0.01 0.01 0.01 | 177168Ki 177168Ki 177168Ki 177168Ki 177168Ki | 0Ki 0Ki 0Ki 0Ki 0Ki
For each job, we first list its name, and then the total cores that it asked for, summed across all instances of it. Then we show a table of statistics.
Here the * marker in the table headers becomes relevant; it shows that jobs are being sorted by the median of the real time used. You can control this with the --sortCategory option.
The columns meanings are the same as for the workers:
Once we're done looking at the stats, we can clean up the job store by running:
toil clean file:my-jobstore
Continuing the example from the stats section above, if we ran our workflow with the command
python3 tutorial_stats.py file:my-jobstore --stats
We could interrogate our jobstore with the status command, for example:
toil status file:my-jobstore
If the run was successful, this would not return much valuable information, something like
2018-01-11 19:31:29,739 - toil.lib.bioio - INFO - Root logger is at level 'INFO', 'toil' logger at level 'INFO'. 2018-01-11 19:31:29,740 - toil.utils.toilStatus - INFO - Parsed arguments 2018-01-11 19:31:29,740 - toil.utils.toilStatus - INFO - Checking if we have files for Toil The root job of the job store is absent, the workflow completed successfully.
Otherwise, the toil status command will return something like the following:
The toil status command supports several useful flags, including --perJob to get per-job status information, --logs to print stored worker logs, and --failed to list all failed jobs in the workflow. For more information, run toil status --help.
One use case of toil status is with the --printStatus argument. Running toil status --printStatus file:my-jobstore at any point of the workflow's lifecycle can tell you the progress of the workflow. Note: This command will output all current running jobs but not any finished or failed jobs.
For example, after running workflow.py in another terminal:
$ toil status --printStatus file:my-jobstore [2024-05-31T13:59:13-0700] [MainThread] [I] [toil.utils.toilStatus] Traversing the job graph gathering jobs. This may take a couple of minutes. Of the 2 jobs considered, there are 0 completely failed jobs, 1 jobs with children, 1 jobs ready to run, 0 zombie jobs, 0 jobs with services, 0 services, and 0 jobs with log files currently in FileJobStore(/path/to/my-jobstore). Message bus path: /tmp/tmp9cnaq3bm Job ID kind-TimeWaster/instance-zvdsdkm_ with name TimeWaster is running on SingleMachineBatchSystem as ID 101349. Job ID kind-TimeWaster/instance-7clm8cv2 with name TimeWaster is running on SingleMachineBatchSystem as ID 101350.
At this moment in time, two jobs with the name "TimeWaster" is running on my local machine.
If a Toil pipeline didn't finish successfully, or was run using --clean=always or --stats, the job store will exist until it is deleted. toil clean <jobStore> ensures that all artifacts associated with a job store are removed. This is particularly useful for deleting AWS job stores, which reserves an SDB domain as well as an S3 bucket.
The deletion of the job store can be modified by the --clean argument, and may be set to always, onError, never, or onSuccess (default).
Temporary directories where jobs are running can also be saved from deletion using the --cleanWorkDir, which has the same options as --clean. This option should only be run when debugging, as intermediate jobs will fill up disk space.
If a Toil worklfow fails, and it wasn't run with --clean=always, the failing job will be waiting in the job store to be debugged. (With WDL or CWL workflows, you may have needed to manually set a --jobStore location you can find again.)
You can use toil debug-job on a job in the job store to run it on your local machine, to locally reproduce any error that may have happened during a remote workflow.
The toil debug-job command takes a job store, and the ID or a name of a job in it. If multiple jobs match a job name, and only one seems to have run out of retries and completely failed, it will run that one.
You can also pass the --printJobInfo flag to dump information about the job instead of running it.
To kill all currently running jobs for a given jobstore, use the command
toil kill file:my-jobstore
Toil has a number of tools to assist in debugging. Here we provide help in working through potential problems that a user might encounter in attempting to run a workflow.
Usually, at the end of a failed Toil worklfow, Toil will reproduce the job logs for the jobs that failed. You can look at the end of your workflow log and use the job logs to identify which jobs are failing and why.
The toil status command (Status Command) can be used with the --failed option to list all failed jobs in a Toil job store.
You can also use it with the --logs option to retrieve per-job logs from the job store, for failed jobs that left them. These logs might be useful for diagnosing and fixing the problem.
If you have a failing job's ID or name, you can reproduce its failure on your local machine with toil debug-job. See Debug Job Command.
For example, say you have this WDL workflow in test.wdl. This workflow cannot succeed, due to the typo in the echo command:
version 1.0
workflow test {
call hello
}
task hello {
input {
}
command <<<
set -e
echoo "Hello"
>>>
output {
}
}
You could try to run it with:
toil-wdl-runner --jobStore ./store test.wdl --retryCount 0
But it will fail.
If you want to reproduce the failure later, or on another machine, you can first find out what jobs failed with toil status:
toil status --failed --noAggStats ./store
This will produce something like:
And we can see a failed job with the display name test.hello.command, which describes the job's location in the WDL workflow as the command section of the hello task called from the test workflow. (If you are writing a Toil Python script, this is the job's displayName.) We can then run that job again locally by name with:
toil debug-job ./store test.hello.command
If there were multiple failed jobs with that name (perhaps because of a WDL scatter), we would need to select one by Toil job ID instead:
toil debug-job ./store kind-WDLTaskJob/instance-r9u6_dcs
And if we know there's only one failed WDL task, we can just tell Toil to rerun the failed WDLTaskJob by Python class name:
toil debug-job ./store WDLTaskJob
Any of these will run the job (including any containers) on the local machine, where its execution can be observed live or monitored with a debugger.
The --retrieveTaskDirectory option to toil debug-job allows you to send the input files for a job to a directory, and then stop running the job. It works for CWL and WDL jobs, and for Python workflows that call toil.job.Job.files_downloaded_hook() after downloading their files. It will make the worker work in the specified directory, so the job's temporary directory will be at worker/job inside it. For WDL and CWL jobs that mount files into containers, there will also be an inside directory populated with symlinks to the files as they would be visible from the root of the container's filesystem.
For example, say you have a broken WDL workflow named example_alwaysfail_with_files.wdl, like this:
version 1.0
workflow test {
call make_file as f1
call make_file as f2
call hello {
input:
name_file=f1.out,
unused_file=f2.out
}
}
task make_file {
input {
}
command <<<
echo "These are the contents" >test.txt
>>>
output {
File out = "test.txt"
}
}
task hello {
input {
File name_file
File? unused_file
}
command <<<
set -e
echoo "Hello" "$(cat ~{name_file})"
>>>
output {
File out = stdout()
}
}
You can try and fail to run it like this:
toil-wdl-runner --jobStore ./store example_alwaysfail_with_files.wdl --retryCount 0
If you then dump the files from the failing job:
toil debug-job ./store WDLTaskJob --retrieveTaskDirectory dumpdir
You will end up with a directory tree that looks, accorfing to tree, something like this:
dumpdir ├── inside │ └── mnt │ └── miniwdl_task_container │ └── work │ └── _miniwdl_inputs │ ├── 0 │ │ └── test.txt -> ../../../../../../worker/job/2c6b3dc4-1d21-4abf-9937-db475e6a6bc2/test.txt │ └── 1 │ └── test.txt -> ../../../../../../worker/job/e3d724e1-e6cc-4165-97f1-6f62ab0fb1ef/test.txt └── worker
└── job
├── 2c6b3dc4-1d21-4abf-9937-db475e6a6bc2
│ └── test.txt
├── e3d724e1-e6cc-4165-97f1-6f62ab0fb1ef
│ └── test.txt
├── tmpr2j5yaic
├── tmpxqr9__y4
└── work 15 directories, 4 files
You can see where Toil downloaded the input files for the job to the worker's temporary directory, and how they would be mounted into the container.
Say you have a broken WDL workflow that can't complete. Whenever you run tutorial_debugging_hangs.wdl, it hangs:
version 1.1
workflow TutorialDebugging {
input {
Array[String] messages = ["Uh-oh!", "Oh dear", "Oops"]
}
scatter(message in messages) {
call WhaleSay {
input:
message = message
}
call CountLines {
input:
to_count = WhaleSay.result
}
}
Array[File] to_compress = flatten([CountLines.result, WhaleSay.result])
call CompressFiles {
input:
files = to_compress
}
output {
File compressed = CompressFiles.result
}
}
# Draw ASCII art
task WhaleSay {
input {
String message
}
command <<<
cowsay "~{message}"
>>>
output {
File result = stdout()
}
runtime {
container: "docker/whalesay"
}
}
# Count the lines in a file
task CountLines {
input {
File to_count
}
command <<<
wc -l ~{to_count}
>>>
output {
File result = stdout()
}
runtime {
container: ["ubuntu:latest", "https://gcr.io/standard-images/ubuntu:latest"]
}
}
# Compress files into a ZIP
task CompressFiles {
input {
Array[File] files
}
command <<<
set -e
cat >script.py <<'EOF'
import sys
from zipfile import ZipFile
import os
# Interpret command line arguments
to_compress = list(reversed(sys.argv[1:]))
with ZipFile("compressed.zip", "w") as z:
while to_compress != []:
# Grab the file to add off the end of the list
input_filename = to_compress[-1]
# Now we need to write this to the zip file.
# What internal filename should we use?
basename = os.path.basename(input_filename)
disambiguation_number = 0
while True:
target_filename = str(disambiguation_number) + basename
try:
z.getinfo(target_filename)
except KeyError:
# Filename is free
break
# Otherwise try another name
disambiguation_number += 1
# Now we can actually make the compressed file
with z.open(target_filename, 'w') as out_stream:
with open(input_filename) as in_stream:
for line in in_stream:
# Prefix each line of text with the original input file
# it came from.
# Also remember to encode the text as the zip file
# stream is in binary mode.
out_stream.write(f"{basename}: {line}".encode("utf-8"))
EOF
python script.py ~{sep(" ", files)}
>>>
output {
File result = "compressed.zip"
}
runtime {
container: "python:3.11"
}
}
You can try to run it like this, using Docker containers. Pretend this was actually a run on a large cluster:
$ toil-wdl-runner --jobStore ./store tutorial_debugging_hangs.wdl --container docker
If you run this, it will hang at the TutorialDebugging.CompressFiles.command step:
[2024-06-18T12:12:49-0400] [MainThread] [I] [toil.leader] Issued job 'WDLTaskJob' TutorialDebugging.CompressFiles.command kind-WDLTaskJob/instance-y0ga_907 v1 with job batch system ID: 16 and disk: 2.0 Gi, memory: 2.0 Gi, cores: 1, accelerators: [], preemptible: False Workflow Progress 94%|██████████▎| 15/16 (0 failures) [00:36<00:02, 0.42 jobs/s]
Say you want to find out why it is stuck. First, you need to kill the workflow. Open a new shell in the same directory and run:
# toil kill ./store
You can also hit Control+C in its terminal window and wait for it to stop.
Then, you need to use toil debug-job to run the stuck job on your local machine:
$ toil debug-job ./store TutorialDebugging.CompressFiles.command
This produces some more informative logging messages, showing that the Docker container is managing to start up, but that it stays running indefinitely, with a repeating message:
[2024-06-18T12:18:00-0400] [MainThread] [N] [MiniWDLContainers] docker task running :: service: "lhui2bdzmzmg", task: "sg371eb2yk", node: "zyu9drdp6a", message: "started"
[2024-06-18T12:18:01-0400] [MainThread] [D] [MiniWDLContainers] docker task status :: Timestamp: "2024-06-18T16:17:58.545272049Z", State: "running", Message: "started", ContainerStatus: {"ContainerID": "b7210b346637210b49e7b6353dd24108bc3632bbf2ce7479829d450df6ee453a", "PID": 36510, "ExitCode": 0}, PortStatus: {}
[2024-06-18T12:18:03-0400] [MainThread] [D] [MiniWDLContainers] docker task status :: Timestamp: "2024-06-18T16:17:58.545272049Z", State: "running", Message: "started", ContainerStatus: {"ContainerID": "b7210b346637210b49e7b6353dd24108bc3632bbf2ce7479829d450df6ee453a", "PID": 36510, "ExitCode": 0}, PortStatus: {}
[2024-06-18T12:18:04-0400] [MainThread] [D] [MiniWDLContainers] docker task status :: Timestamp: "2024-06-18T16:17:58.545272049Z", State: "running", Message: "started", ContainerStatus: {"ContainerID": "b7210b346637210b49e7b6353dd24108bc3632bbf2ce7479829d450df6ee453a", "PID": 36510, "ExitCode": 0}, PortStatus: {}
...
This also gives you the Docker container ID of the running container, b7210b346637210b49e7b6353dd24108bc3632bbf2ce7479829d450df6ee453a. You can use that to get a shell inside the running container:
$ docker exec -ti b7210b346637210b49e7b6353dd24108bc3632bbf2ce7479829d450df6ee453a bash root@b7210b346637:/mnt/miniwdl_task_container/work#
Your shell is already in the working directory of the task, so we can inspect the files there to get an idea of how far the task has gotten. Has it managed to create script.py? Has the script managed to create compressed.zip? Let's check:
# ls -lah total 6.1M drwxrwxr-x 6 root root 192 Jun 18 16:17 . drwxr-xr-x 3 root root 4.0K Jun 18 16:17 .. drwxr-xr-x 3 root root 96 Jun 18 16:17 .toil_wdl_runtime drwxrwxr-x 8 root root 256 Jun 18 16:17 _miniwdl_inputs -rw-r--r-- 1 root root 6.0M Jun 18 16:23 compressed.zip -rw-r--r-- 1 root root 1.3K Jun 18 16:17 script.py
So we can see that the script exists, and the zip file also exists. So maybe the script is still running? We can check with ps, but we need the -x option to include processes not under the current shell. We can also include the -u option to get statistics:
# ps -xu USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 2316 808 ? Ss 16:17 0:00 /bin/sh -c /bin/ root 7 0.0 0.0 4208 3056 ? S 16:17 0:00 /bin/bash ../com root 8 0.1 0.0 4208 1924 ? S 16:17 0:00 /bin/bash ../com root 20 95.0 0.4 41096 36428 ? R 16:17 7:09 python script.py root 645 0.0 0.0 4472 3492 pts/0 Ss 16:21 0:00 bash root 1379 0.0 0.0 2636 764 ? S 16:25 0:00 sleep 1 root 1380 0.0 0.0 8584 3912 pts/0 R+ 16:25 0:00 ps -xu
Here we can see that python is indeed running, and it is using 95% of a CPU core. So we can surmise that Python is probably stuck spinning around in an infinite loop. Let's look at our files again:
# ls -lah total 8.1M drwxrwxr-x 6 root root 192 Jun 18 16:17 . drwxr-xr-x 3 root root 4.0K Jun 18 16:17 .. drwxr-xr-x 3 root root 96 Jun 18 16:17 .toil_wdl_runtime drwxrwxr-x 8 root root 256 Jun 18 16:17 _miniwdl_inputs -rw-r--r-- 1 root root 7.6M Jun 18 2024 compressed.zip -rw-r--r-- 1 root root 1.3K Jun 18 16:17 script.py
Note that, while we've been investigating, our compressed.zip file has grown from 6.0M to 7.6M. So we now know that, not only is the Python script stuck in a loop, it is also writing to the ZIP file inside that loop.
Let's inspect the inputs:
# ls -lah _miniwdl_inputs/* _miniwdl_inputs/0: total 4.0K drwxrwxr-x 3 root root 96 Jun 18 16:17 . drwxrwxr-x 8 root root 256 Jun 18 16:17 .. -rw-r--r-- 1 root root 65 Jun 18 16:15 stdout.txt _miniwdl_inputs/1: total 4.0K drwxrwxr-x 3 root root 96 Jun 18 16:17 . drwxrwxr-x 8 root root 256 Jun 18 16:17 .. -rw-r--r-- 1 root root 65 Jun 18 16:15 stdout.txt _miniwdl_inputs/2: total 4.0K drwxrwxr-x 3 root root 96 Jun 18 16:17 . drwxrwxr-x 8 root root 256 Jun 18 16:17 .. -rw-r--r-- 1 root root 65 Jun 18 16:15 stdout.txt _miniwdl_inputs/3: total 4.0K drwxrwxr-x 3 root root 96 Jun 18 16:17 . drwxrwxr-x 8 root root 256 Jun 18 16:17 .. -rw-r--r-- 1 root root 384 Jun 18 16:15 stdout.txt _miniwdl_inputs/4: total 4.0K drwxrwxr-x 3 root root 96 Jun 18 16:17 . drwxrwxr-x 8 root root 256 Jun 18 16:17 .. -rw-r--r-- 1 root root 387 Jun 18 16:15 stdout.txt _miniwdl_inputs/5: total 4.0K drwxrwxr-x 3 root root 96 Jun 18 16:17 . drwxrwxr-x 8 root root 256 Jun 18 16:17 .. -rw-r--r-- 1 root root 378 Jun 18 16:15 stdout.txt
There are the files that are meant to be being compressed into that ZIP file. But, hang on, there are only six of these files, and none of them is over 400 bytes in size. How did we get a multi-megabyte ZIP file? The script must be putting more data than we expected into the ZIP file it is writing.
Taking what we know, we can now inspect the Python script again and see if we can find a way in which it could get stuck in an infinite loop, writing much more data to the ZIP than is actually in the input files. We can also inspect it for WDL variable substitutions (there aren't any). Let's look at it with line numbers using the nl tool, numbering even blank lines with -b a:
# nl -b a script.py
1 import sys
2 from zipfile import ZipFile
3 import os
4
5 # Interpret command line arguments
6 to_compress = list(reversed(sys.argv[1:]))
7
8 with ZipFile("compressed.zip", "w") as z:
9 while to_compress != []: 10 # Grab the file to add off the end of the list 11 input_filename = to_compress[-1] 12 # Now we need to write this to the zip file. 13 # What internal filename should we use? 14 basename = os.path.basename(input_filename) 15 disambiguation_number = 0 16 while True: 17 target_filename = str(disambiguation_number) + basename 18 try: 19 z.getinfo(target_filename) 20 except KeyError: 21 # Filename is free 22 break 23 # Otherwise try another name 24 disambiguation_number += 1 25 # Now we can actually make the compressed file 26 with z.open(target_filename, 'w') as out_stream: 27 with open(input_filename) as in_stream: 28 for line in in_stream: 29 # Prefix each line of text with the original input file 30 # it came from. 31 # Also remember to encode the text as the zip file 32 # stream is in binary mode. 33 out_stream.write(f"{basename}: {line}".encode("utf-8"))
We have three loops here: while to_compress != [] on line 9, while True on line 16, and for line in in_stream on line 28.
The while True loop is immediately suspicious, but none of the code inside it writes to the ZIP file, so we know we can't be stuck in there.
The for line in in_stream loop contains the only call that writes data to the ZIP, so we must be spending time inside it, but it is constrained to loop over a single file at a time, so it can't be the infinite loop we're looking for.
So then we must be infinitely looping at while to_compress != [], and indeed we can see that to_compress is never modified, so it can never become [].
So now we have a theory as to what the problem is, and we can exit out of our shell in the container, and stop toil debug-job with Control+C. Then we can make the following change to our workflow, adding code to the script to actually pop the handled files off the end of the list:
--- tutorial_debugging_works.wdl 2024-06-18 12:03:32 +++ tutorial_debugging_hangs.wdl 2024-06-18 12:03:53 @@ -112,9 +112,6 @@
# Also remember to encode the text as the zip file
# stream is in binary mode.
out_stream.write(f"{basename}: {line}".encode("utf-8")) - # Even though we got distracted by zip file manipulation, remember - # to pop off the file we just did. - to_compress.pop()
EOF
python script.py ~{sep(" ", files)}
>>>
If we apply that change and produce a new file, tutorial_debugging_works.wdl, we can clean up from the old failed run and run a new one:
$ toil clean ./store $ toil-wdl-runner --jobStore ./store tutorial_debugging_works.wdl --container docker
This will produce a successful log, ending with something like:
[2024-06-18T12:42:20-0400] [MainThread] [I] [toil.leader] Finished toil run successfully.
Workflow Progress 100%|███████████| 17/17 (0 failures) [00:24<00:00, 0.72 jobs/s]
{"TutorialDebugging.compressed": "/Users/anovak/workspace/toil/src/toil/test/docs/scripts/wdl-out-u7fkgqbe/f5e16468-0cf6-4776-a5c1-d93d993c4db2/compressed.zip"}
[2024-06-18T12:42:20-0400] [MainThread] [I] [toil.common] Successfully deleted the job store: FileJobStore(/Users/anovak/workspace/toil/src/toil/test/docs/scripts/store)
Note the line to standard output giving us the path on disk where the TutorialDebugging.compressed output from the workflow is. If you look at that ZIP file, you can see it contains the expected files, such as 3stdout.txt, which should contain this suitably prefixed dismayed whale:
stdout.txt: ________
stdout.txt: < Uh-oh! >
stdout.txt: --------
stdout.txt: \
stdout.txt: \
stdout.txt: \
stdout.txt: ## .
stdout.txt: ## ## ## ==
stdout.txt: ## ## ## ## ===
stdout.txt: /""""""""""""""""___/ ===
stdout.txt: ~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ / ===- ~~~
stdout.txt: \______ o __/
stdout.txt: \ \ __/
stdout.txt: \____\______/
When we're done inspecting the output, and satisfied that the workflow now works, we might want to clean up all the auto-generated WDL output directories from the successful and failed run(s):
$ rm -Rf wdl-out-*
Note: Currently these features are only implemented for use locally (single machine) with the fileJobStore.
To view what files currently reside in the jobstore, run the following command:
$ toil debug-file file:path-to-jobstore-directory \
--listFilesInJobStore
When run from the commandline, this should generate a file containing the contents of the job store (in addition to displaying a series of log messages to the terminal). This file is named "jobstore_files.txt" by default and will be generated in the current working directory.
If one wishes to copy any of these files to a local directory, one can run for example:
$ toil debug-file file:path-to-jobstore \
--fetch overview.txt *.bam *.fastq \
--localFilePath=/home/user/localpath
To fetch overview.txt, and all .bam and .fastq files. This can be used to recover previously used input and output files for debugging or reuse in other workflows, or use in general debugging to ensure that certain outputs were imported into the jobStore.
See Stats Command and Status Command for more about gathering statistics about job success, runtime, and resource usage from workflows.
If you execute a workflow using the --debugWorker flag, or if you use toil debug-job, Toil will run the job in the process you started from the command line. This means you can either use pdb, or an IDE that supports debugging Python to interact with the Python process as it runs your job. Note that the --debugWorker flag will only work with the single_machine batch system (the default), and not any of the custom job schedulers.
Toil supports Amazon Web Services (AWS) and Google Compute Engine (GCE) in the cloud and has autoscaling capabilities that can adapt to the size of your workflow, whether your workflow requires 10 instances or 20,000.
Toil does this by creating a virtual cluster running Kubernetes. Kubernetes requires a leader node to coordinate the workflow, and worker nodes to execute the various tasks within the workflow. As the workflow runs, Kubernetes will "autoscale", creating and terminating workers as needed to meet the demands of the workflow. Historically, Toil has spun up clusters with Apache Mesos, but it is no longer recommended.
Once a user is familiar with the basics of running Toil locally (specifying a jobStore, and how to write a workflow), they can move on to the guides below to learn how to translate these workflows into cloud ready workflows.
Toil can launch and manage a cluster of virtual machines to run using the provisioner to run a workflow distributed over several nodes. The provisioner also has the ability to automatically scale up or down the size of the cluster to handle dynamic changes in computational demand (autoscaling). Currently we have working provisioners with AWS and GCE (Azure support has been deprecated).
Toil uses Kubernetes as the Batch System.
See here for instructions for Running in AWS.
See here for instructions for Running in Google Compute Engine (GCE).
Toil offers a suite of commands for using the provisioners to manage clusters.
In addition to the generic Toil Utilities, there are several utilities used for starting and managing a Toil cluster using the AWS or GCE provisioners. They are installed via the [aws] or [google] extra. For installation details see Toil Provisioner.
The toil cluster subcommands are:
launch-cluster --- For autoscaling. This is used to launch a toil leader instance with the specified provisioner.
rsync-cluster --- For autoscaling. Used to transfer files to a cluster launched with toil launch-cluster.
ssh-cluster --- SSHs into the toil appliance container running on the leader of the cluster.
For information on a specific utility, run it with the --help option:
toil launch-cluster --help
The cluster utilities can be used for Running in Google Compute Engine (GCE) and Running in AWS.
TIP:
NOTE:
Running in Google Compute Engine (GCE) contains instructions for
Running toil launch-cluster starts up a leader for a cluster. Workers can be added to the initial cluster by specifying the -w option. An example would be
$ toil launch-cluster my-cluster \
--leaderNodeType t2.small -z us-west-2a \
--keyPairName your-AWS-key-pair-name \
--nodeTypes m3.large,t2.micro -w 1,4
Options are listed below. These can also be displayed by running
$ toil launch-cluster --help
launch-cluster's main positional argument is the clusterName. This is simply the name of your cluster. If it does not exist yet, Toil will create it for you.
Launch-Cluster Options
Logging Options
Toil provides the ability to ssh into the leader of the cluster. This can be done as follows:
$ toil ssh-cluster CLUSTER-NAME-HERE
This will open a shell on the Toil leader and is used to start an Running a Workflow with Autoscaling run. Issues with docker prevent using screen and tmux when sshing the cluster (The shell doesn't know that it is a TTY which prevents it from allocating a new screen session). This can be worked around via
$ script $ screen
Simply running screen within script will get things working properly again.
Finally, you can execute remote commands with the following syntax:
$ toil ssh-cluster CLUSTER-NAME-HERE remoteCommand
It is not advised that you run your Toil workflow using remote execution like this unless a tool like nohup is used to ensure the process does not die if the SSH connection is interrupted.
For an example usage, see Running a Workflow with Autoscaling.
The most frequent use case for the rsync-cluster utility is deploying your workflow code to the Toil leader. Note that the syntax is the same as traditional rsync with the exception of the hostname before the colon. This is not needed in toil rsync-cluster since the hostname is automatically determined by Toil.
Here is an example of its usage:
$ toil rsync-cluster CLUSTER-NAME-HERE \
~/localFile :/remoteDestination
The destroy-cluster command is the advised way to get rid of any Toil cluster launched using the Launch-Cluster Command command. It ensures that all attached nodes, volumes, security groups, etc. are deleted. If a node or cluster is shut down using Amazon's online portal residual resources may still be in use in the background. To delete a cluster run
$ toil destroy-cluster CLUSTER-NAME-HERE
Toil can make use of cloud storage such as AWS or Google buckets to take care of storage needs.
This is useful when running Toil in single machine mode on any cloud platform since it allows you to make use of their integrated storage systems.
For an overview of the job store see Job Store.
For instructions configuring a particular job store see:
Kubernetes is a very popular container orchestration tool that has become a de facto cross-cloud-provider API for accessing cloud resources. Major cloud providers like Amazon, Microsoft, Kubernetes owner Google, and DigitalOcean have invested heavily in making Kubernetes work well on their platforms, by writing their own deployment documentation and developing provider-managed Kubernetes-based products. Using minikube, Kubernetes can even be run on a single machine.
Toil supports running Toil workflows against a Kubernetes cluster, either in the cloud or deployed on user-owned hardware.
To run Toil workflows on Kubernetes, you need to have a Kubernetes cluster set up. This will not be covered here, but there are many options available, and which one you choose will depend on which cloud ecosystem if any you use already, and on pricing. If you are just following along with the documentation, use minikube on your local machine.
Alternatively, Toil can set up a Kubernetes cluster for you with the Toil provisioner. Follow this guide to get started with a Toil-managed Kubernetes cluster on AWS.
Note that currently the only way to run a Toil workflow on Kubernetes is to use the AWS Job Store, so your Kubernetes workflow will currently have to store its data in Amazon's cloud regardless of where you run it. This can result in significant egress charges from Amazon if you run it outside of Amazon.
Kubernetes Cluster Providers:
There are two main ways to run Toil workflows on Kubernetes. You can either run the Toil leader on a machine outside the cluster, with jobs submitted to and run on the cluster, or you can submit the Toil leader itself as a job and have it run inside the cluster. Either way, you will need to configure your own machine to be able to submit jobs to the Kubernetes cluster. Generally, this involves creating and populating a file named .kube/config in your user's home directory, and specifying the cluster to connect to, the certificate and token information needed for mutual authentication, and the Kubernetes namespace within which to work. However, Kubernetes configuration can also be picked up from other files in the .kube directory, environment variables, and the enclosing host when running inside a Kubernetes-managed container.
You will have to do different things here depending on where you got your Kubernetes cluster:
Toil's internal Kubernetes configuration logic mirrors that of the kubectl command. Toil workflows will use the current kubectl context to launch their Kubernetes jobs.
If you are going to run your workflow's leader within the Kubernetes cluster (see Option 1: Running the Leader Inside Kubernetes), you will need a service account in your chosen Kubernetes namespace. Most namespaces should have a service account named default which should work fine. If your cluster requires you to use a different service account, you will need to obtain its name and use it when launching the Kubernetes job containing the Toil leader.
Your local Kubernetes context and/or the service account you are using to run the leader in the cluster will need to have certain permissions in order to run the workflow. Toil needs to be able to interact with jobs and pods in the cluster, and to retrieve pod logs. You as a user may need permission to set up an AWS credentials secret, if one is not already available. Additionally, it is very useful for you as a user to have permission to interact with nodes, and to shell into pods.
The appropriate permissions may already be available to you and your service account by default, especially in managed or ease-of-use-optimized setups such as EKS or minikube.
However, if the appropriate permissions are not already available, you or your cluster administrator will have to grant them manually. The following Role (toil-user) and ClusterRole (node-reader), to be applied with kubectl apply -f filename.yaml, should grant sufficient permissions to run Toil workflows when bound to your account and the service account used by Toil workflows. Be sure to replace YOUR_NAMESPACE_HERE with the namespace you are running your workflows in
apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata:
namespace: YOUR_NAMESPACE_HERE
name: toil-user rules: - apiGroups: ["*"]
resources: ["*"]
verbs: ["explain", "get", "watch", "list", "describe", "logs", "attach", "exec", "port-forward", "proxy", "cp", "auth"] - apiGroups: ["batch"]
resources: ["*"]
verbs: ["get", "watch", "list", "create", "run", "set", "delete"] - apiGroups: [""]
resources: ["secrets", "pods", "pods/attach", "podtemplates", "configmaps", "events", "services"]
verbs: ["patch", "get", "update", "watch", "list", "create", "run", "set", "delete", "exec"] - apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list"] - apiGroups: [""]
resources: ["pods/exec"]
verbs: ["create"]
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata:
name: node-reader rules: - apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "describe"] - apiGroups: [""]
resources: ["namespaces"]
verbs: ["get", "list", "describe"] - apiGroups: ["metrics.k8s.io"]
resources: ["*"]
verbs: ["*"]
To bind a user or service account to the Role or ClusterRole and actually grant the permissions, you will need a RoleBinding and a ClusterRoleBinding, respectively. Make sure to fill in the namespace, username, and service account name, and add more user stanzas if your cluster is to support multiple Toil users.
apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata:
name: toil-developer-member
namespace: toil subjects: - kind: User
name: YOUR_KUBERNETES_USERNAME_HERE
apiGroup: rbac.authorization.k8s.io - kind: ServiceAccount
name: YOUR_SERVICE_ACCOUNT_NAME_HERE
namespace: YOUR_NAMESPACE_HERE roleRef:
kind: Role
name: toil-user
apiGroup: rbac.authorization.k8s.io
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata:
name: read-nodes subjects: - kind: User
name: YOUR_KUBERNETES_USERNAME_HERE
apiGroup: rbac.authorization.k8s.io - kind: ServiceAccount
name: YOUR_SERVICE_ACCOUNT_NAME_HERE
namespace: YOUR_NAMESPACE_HERE roleRef:
kind: ClusterRole
name: node-reader
apiGroup: rbac.authorization.k8s.io
Currently, the only job store, which is what Toil uses to exchange data between jobs, that works with jobs running on Kubernetes is the AWS Job Store. This requires that the Toil leader and Kubernetes jobs be able to connect to and use Amazon S3 and Amazon SimpleDB. It also requires that you have an Amazon Web Services account.
In your AWS account, you need to create an AWS access key. First go to the IAM dashboard; for "us-west1", the link would be:
https://console.aws.amazon.com/iam/home?region=us-west-1#/home
Then create an access key, and save the Access Key ID and the Secret Key. As documented in the AWS documentation:
Make sure that, if your AWS infrastructure requires your user to authenticate with a multi-factor authentication (MFA) token, you obtain a second secret key and access key that don't have this requirement. The secret key and access key used to populate the Kubernetes secret that allows the jobs to contact the job store need to be usable without human intervention.
This only really needs to happen if you run the leader on the local machine. But we need the files in place to fill in the secret in the next step. Run:
$ aws configure
Then when prompted, enter your secret key and access key. This should create a file ~/.aws/credentials that looks like this:
[default] aws_access_key_id = BLAH aws_secret_access_key = blahblahblah
$ cd ~/.aws
Then, create a Kubernetes secret that contains it. We'll call it aws-credentials:
$ kubectl create secret generic aws-credentials --from-file credentials
To configure your workflow to run on Kubernetes, you will have to configure several environment variables, in addition to passing the --batchSystem kubernetes option. Doing the research to figure out what values to give these variables may require talking to your cluster provider.
Note that Docker containers cannot be run inside of unprivileged Kubernetes pods (which are themselves containers). The Docker daemon does not (yet) support this. Other tools, such as Singularity in its user-namespace mode, are able to run containers from within containers. If using Singularity to run containerized tools, and you want downloaded container images to persist between Toil jobs, some setup may be required:
On non-Toil managed clusters: You will also want to set TOIL_KUBERNETES_HOST_PATH, and make sure that Singularity is downloading its containers under the Toil work directory (/var/lib/toil by default) by setting SINGULARITY_CACHEDIR.
On Toil-managed clusters: On clusters created with the launch-cluster command, no setup is required. TOIL_KUBERNETES_HOST_PATH is already set to /var/lib/toil. SINGULARITY_CACHEDIR is set to /var/lib/toil/singularity which is a shared location; however, you may need to implement Singularity locking as shown below or change the Singularity cache location to somewhere else.
If using toil-wdl-runner, all the necessary locking for Singularity is already in place and no work should be necessary. Else, for both Toil managed and non-Toil managed clusters, you will need to make sure that no two jobs try to download the same container at the same time; Singularity has no synchronization or locking around its cache, but the cache is also not safe for simultaneous access by multiple Singularity invocations. Some Toil workflows use their own custom workaround logic for this problem; for example, see this section in toil-wdl-runner.
To run the workflow, you will need to run the Toil leader process somewhere. It can either be run inside Kubernetes as a Kubernetes job, or outside Kubernetes as a normal command.
Once you have determined a set of environment variable values for your workflow run, write a YAML file that defines a Kubernetes job to run your workflow with that configuration. Some configuration items (such as your username, and the name of your AWS credentials secret) need to be written into the YAML so that they can be used from the leader as well.
Note that the leader pod will need your workflow, its other dependencies, and Toil all installed. An easy way to get Toil installed is to start with the Toil appliance image for the version of Toil you want to use. In this example, we use quay.io/ucsc_cgl/toil:5.5.0.
Here's an example YAML file to run a test workflow:
apiVersion: batch/v1 kind: Job metadata:
# It is good practice to include your username in your job name.
# Also specify it in TOIL_KUBERNETES_OWNER
name: demo-user-toil-test # Do not try and rerun the leader job if it fails spec:
backoffLimit: 0
template:
spec:
# Do not restart the pod when the job fails, but keep it around so the
# log can be retrieved
restartPolicy: Never
volumes:
- name: aws-credentials-vol
secret:
# Make sure the AWS credentials are available as a volume.
# This should match TOIL_AWS_SECRET_NAME
secretName: aws-credentials
# You may need to replace this with a different service account name as
# appropriate for your cluster.
serviceAccountName: default
containers:
- name: main
image: quay.io/ucsc_cgl/toil:5.5.0
env:
# Specify your username for inclusion in job names
- name: TOIL_KUBERNETES_OWNER
value: demo-user
# Specify where to find the AWS credentials to access the job store with
- name: TOIL_AWS_SECRET_NAME
value: aws-credentials
# Specify where per-host caches should be stored, on the Kubernetes hosts.
# Needs to be set for Toil's caching to be efficient.
- name: TOIL_KUBERNETES_HOST_PATH
value: /data/scratch
volumeMounts:
# Mount the AWS credentials volume
- mountPath: /root/.aws
name: aws-credentials-vol
resources:
# Make sure to set these resource limits to values large enough
# to accommodate the work your workflow does in the leader
# process, but small enough to fit on your cluster.
#
# Since no request values are specified, the limits are also used
# for the requests.
limits:
cpu: 2
memory: "4Gi"
ephemeral-storage: "10Gi"
command:
- /bin/bash
- -c
- |
# This Bash script will set up Toil and the workflow to run, and run them.
set -e
# We make sure to create a work directory; Toil can't hot-deploy a
# Python file from the root of the filesystem, which is where we start.
mkdir /tmp/work
cd /tmp/work
# We make a virtual environment to allow workflow dependencies to be
# hot-deployed.
#
# We don't really make use of it in this example, but for workflows
# that depend on PyPI packages we will need this.
#
# We use --system-site-packages so that the Toil installed in the
# appliance image is still available.
virtualenv --python python3 --system-site-packages venv
. venv/bin/activate
# Now we install the workflow. Here we're using a demo workflow
# from Toil itself.
wget https://raw.githubusercontent.com/DataBiosphere/toil/releases/4.1.0/src/toil/test/docs/scripts/tutorial_helloworld.py
# Now we run the workflow. We make sure to use the Kubernetes batch
# system and an AWS job store, and we set some generally useful
# logging options. We also make sure to enable caching.
python3 tutorial_helloworld.py \
aws:us-west-2:demouser-toil-test-jobstore \
--batchSystem kubernetes \
--realTimeLogging \
--logInfo
You can save this YAML as leader.yaml, and then run it on your Kubernetes installation with:
$ kubectl apply -f leader.yaml
To monitor the progress of the leader job, you will want to read its logs. If you are using a Kubernetes dashboard such as k9s, you can simply find the pod created for the job in the dashboard, and view its logs there. If not, you will need to locate the pod by hand.
The following techniques are most useful for looking at the pod which holds the Toil leader, but they can also be applied to individual Toil jobs on Kubernetes, even when the leader is outside the cluster.
Kubernetes names pods for jobs by appending a short random string to the name of the job. You can find the name of the pod for your job by doing:
$ kubectl get pods | grep demo-user-toil-test demo-user-toil-test-g5496 1/1 Running 0 2m
Assuming you have set TOIL_KUBERNETES_OWNER correctly, you should be able to find all of your workflow's pods by searching for your username:
$ kubectl get pods | grep demo-user
If the status of a pod is anything other than Pending, you will be able to view its logs with:
$ kubectl logs demo-user-toil-test-g5496
This will dump the pod's logs from the beginning to now and terminate. To follow along with the logs from a running pod, add the -f option:
$ kubectl logs -f demo-user-toil-test-g5496
A status of ImagePullBackoff suggests that you have requested to use an image that is not available. Check the image section of your YAML if you are looking at a leader, or the value of TOIL_APPLIANCE_SELF if you are delaying with a worker job. You also might want to check your Kubernetes node's Internet connectivity and DNS function; in Kubernetes, DNS depends on system-level pods which can be terminated or evicted in cases of resource oversubscription, just like user workloads.
If your pod seems to be stuck Pending, ContainerCreating, you can get information on what is wrong with it by using kubectl describe pod:
$ kubectl describe pod demo-user-toil-test-g5496
Pay particular attention to the Events: section at the end of the output. An indication that a job is too big for the available nodes on your cluster, or that your cluster is too busy for your jobs, is FailedScheduling events:
Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 13s (x79 over 100m) default-scheduler 0/4 nodes are available: 1 Insufficient cpu, 1 Insufficient ephemeral-storage, 4 Insufficient memory.
If a pod is running but seems to be behaving erratically, or seems stuck, you can shell into it and look around:
$ kubectl exec -ti demo-user-toil-test-g5496 /bin/bash
One common cause of stuck pods is attempting to use more memory than allowed by Kubernetes (or by the Toil job's memory resource requirement), but in a way that does not trigger the Linux OOM killer to terminate the pod's processes. In these cases, the pod can remain stuck at nearly 100% memory usage more or less indefinitely, and attempting to shell into the pod (which needs to start a process within the pod, using some of its memory) will fail. In these cases, the recommended solution is to kill the offending pod and increase its (or its Toil job's) memory requirement, or reduce its memory needs by adapting user code.
The Toil Kubernetes batch system includes cleanup code to terminate worker jobs when the leader shuts down. However, if the leader pod is removed by Kubernetes, is forcibly killed or otherwise suffers a sudden existence failure, it can go away while its worker jobs live on. It is not recommended to restart a workflow in this state, as jobs from the previous invocation will remain running and will be trying to modify the job store concurrently with jobs from the new invocation.
To clean up dangling jobs, you can use the following snippet:
$ kubectl get jobs | grep demo-user | cut -f1 -d' ' | xargs -n10 kubectl delete job
This will delete all jobs with demo-user's username in their names, in batches of 10. You can also use the UUID that Toil assigns to a particular workflow invocation in the filter, to clean up only the jobs pertaining to that workflow invocation.
If you don't want to run your Toil leader inside Kubernetes, you can run it locally instead. This can be useful when developing a workflow; files can be hot-deployed from your local machine directly to Kubernetes. However, your local machine will have to have (ideally role-assumption- and MFA-free) access to AWS, and access to Kubernetes. Real time logging will not work unless your local machine is able to listen for incoming UDP packets on arbitrary ports on the address it uses to contact the IPv4 Internet; Toil does no NAT traversal or detection.
Note that if you set TOIL_WORKDIR when running your workflow like this, it will need to be a directory that exists both on the host and in the Toil appliance.
Here is an example of running our test workflow leader locally, outside of Kubernetes:
$ export TOIL_KUBERNETES_OWNER=demo-user # This defaults to your local username if not set $ export TOIL_AWS_SECRET_NAME=aws-credentials $ export TOIL_KUBERNETES_HOST_PATH=/data/scratch $ virtualenv --python python3 --system-site-packages venv $ . venv/bin/activate $ wget https://raw.githubusercontent.com/DataBiosphere/toil/releases/4.1.0/src/toil/test/docs/scripts/tutorial_helloworld.py $ python3 tutorial_helloworld.py \
aws:us-west-2:demouser-toil-test-jobstore \
--batchSystem kubernetes \
--realTimeLogging \
--logInfo
Running CWL workflows on Kubernetes can be challenging, because executing CWL can require toil-cwl-runner to orchestrate containers of its own, within a Kubernetes job running in the Toil appliance container.
Normally, running a CWL workflow should Just Work, as long as the workflow's Docker containers are able to be executed with Singularity, your Kubernetes cluster does not impose extra capability-based confinement (i.e. SELinux, AppArmor) that interferes with Singularity's use of user-mode namespaces, and you make sure to configure Toil so that its workers know where to store their data within the Kubernetes pods (which would be done for you if using a Toil-managed cluster). For example, you should be able to run a CWL workflow like this:
$ export TOIL_KUBERNETES_OWNER=demo-user # This defaults to your local username if not set $ export TOIL_AWS_SECRET_NAME=aws-credentials $ export TOIL_KUBERNETES_HOST_PATH=/data/scratch $ virtualenv --python python3 --system-site-packages venv $ . venv/bin/activate $ pip install toil[kubernetes,cwl]==5.8.0 $ toil-cwl-runner \
--jobStore aws:us-west-2:demouser-toil-test-jobstore \
--batchSystem kubernetes \
--realTimeLogging \
--logInfo \
--disableCaching \
path/to/cwl/workflow \
path/to/cwl/input/object
Additional cwltool options that your workflow might require, such as --no-match-user, can be passed to toil-cwl-runner, which inherits most cwltool options.
Kubernetes clusters based on Ubuntu hosts often will have AppArmor enabled on the host. AppArmor is a capability-based security enhancement system that integrates with the Linux kernel to enforce lists of things which programs may or may not do, called profiles. For example, an AppArmor profile could be applied to a web server process to stop it from using the mount() system call to manipulate the filesystem, because it has no business doing that under normal circumstances but might attempt to do it if compromised by hackers.
Kubernetes clusters also often use Docker as the backing container runtime, to run pod containers. When AppArmor is enabled, Docker will load an AppArmor profile and apply it to all of its containers by default, with the ability for the profile to be overridden on a per-container basis. This profile unfortunately prevents some of the mount() system calls that Singularity uses to set up user-mode containers from working inside the pod, even though these calls would be allowed for an unprivileged user under normal circumstances.
On the UCSC Kubernetes cluster, we configure our Ubuntu hosts with an alternative default AppArmor profile for Docker containers which allows these calls. Other solutions include turning off AppArmor on the host, configuring Kubernetes with a container runtime other than Docker, or using Kubernetes's AppArmor integration to apply a more permissive profile or the unconfined profile to pods that Toil launches.
Toil does not yet have a way to apply a container.apparmor.security.beta.kubernetes.io/runner-container: unconfined annotation to its pods, as described in the Kubernetes AppArmor documentation. This feature is tracked in issue #4331.
Toil jobs can be run on a variety of cloud platforms. Of these, Amazon Web Services (AWS) is currently the best-supported solution. Toil provides the Toil Cluster Utilities to conveniently create AWS clusters, connect to the leader of the cluster, and then launch a workflow. The leader handles distributing the jobs over the worker nodes and autoscaling to optimize costs.
The Running a Workflow with Autoscaling section details how to create a cluster and run a workflow that will dynamically scale depending on the workflow's needs.
The Static Provisioning section explains how a static cluster (one that won't automatically change in size) can be created and provisioned (grown, shrunk, destroyed, etc.).
To use Amazon Web Services (AWS) to run Toil or to just use S3 to host the files during the computation of a workflow, first set up and configure an account with AWS:
$ ssh-keygen -t rsa
~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ eval `ssh-agent -s` $ ssh-add
If your key has a passphrase, you will be prompted to enter it here once.
$ chmod 400 id_rsa
https://us-west-1.console.aws.amazon.com/ec2/v2/home?region=us-west-1#KeyPairs:sort=keyName
https://console.aws.amazon.com/iam/home?region=us-west-1#/home
$ pip install awscli --upgrade --user
$ aws configure
" AWS Access Key ID [****************Q65Q]: " " AWS Secret Access Key [****************G0ys]: " " Default region name [us-west-1]: " " Default output format [json]: "
This will create the files ~/.aws/config and ~/.aws/credentials.
$ virtualenv venv $ source venv/bin/activate $ pip install toil[all]==5.12.0
$ toil launch-cluster <cluster-name> \
--clusterType kubernetes \
--leaderNodeType t2.medium \
--nodeTypes t2.medium -w 1 \
--zone us-west-1a \
--keyPairName id_rsa
To further break down each of these commands:
<cluster-name> --- Just choose a name for your cluster.
--clusterType kubernetes --- Specify the type of cluster to coordinate and execute your workflow. Kubernetes is the recommended option.
--leaderNodeType t2.medium --- Specify the leader node type. Make a t2.medium (2CPU; 4Gb RAM; $0.0464/Hour). List of available AWS instances: https://aws.amazon.com/ec2/pricing/on-demand/
--nodeTypes t2.medium -w 1 --- Specify the worker node type and the number of worker nodes to launch. The Kubernetes cluster requires at least 1 worker node.
--zone us-west-1a --- Specify the AWS zone you want to launch the instance in. Must have the same prefix as the zone in your awscli credentials (which, in the example of this tutorial is: "us-west-1").
--keyPairName id_rsa --- The name of your key pair, which should be "id_rsa" if you've followed this tutorial.
NOTE:
You can also set the TOIL_APPLIANCE_SELF environment variable to one of the Toil project's Docker images, if you would like to launch a cluster using a different version of Toil than the one you have installed.
Using the AWS job store is straightforward after you've finished Preparing your AWS environment; all you need to do is specify the prefix for the job store name.
To run the sort example sort example with the AWS job store you would type
$ python3 sort.py aws:us-west-2:my-aws-sort-jobstore
The Toil provisioner is the component responsible for creating resources in Amazon's cloud. It is included in Toil alongside the [aws] extra and allows us to spin up a cluster.
Getting started with the provisioner is simple:
The Toil provisioner makes heavy use of the Toil Appliance, a Docker image that bundles Toil and all its requirements (e.g. Kubernetes). This makes deployment simple across platforms, and you can even simulate a cluster locally (see Developing with Docker for details).
When using the Toil provisioner, the appliance image will be automatically chosen based on the pip-installed version of Toil on your system. That choice can be overridden by setting the environment variables TOIL_DOCKER_REGISTRY and TOIL_DOCKER_NAME or TOIL_APPLIANCE_SELF. See Environment Variables for more information on these variables. If you are developing with autoscaling and want to test and build your own appliance have a look at Developing with Docker.
For information on using the Toil Provisioner have a look at Running a Workflow with Autoscaling.
Using the provisioner to launch a Toil leader instance is simple using the launch-cluster command. For example, to launch a Kubernetes cluster named "my-cluster" with a t2.medium leader in the us-west-2a zone, run
(venv) $ toil launch-cluster my-cluster \
--clusterType kubernetes \
--leaderNodeType t2.medium \
--nodeTypes t2.medium -w 1 \
--zone us-west-2a \
--keyPairName <AWS-key-pair-name>
The cluster name is used to uniquely identify your cluster and will be used to populate the instance's Name tag. Also, the Toil provisioner will automatically tag your cluster with an Owner tag that corresponds to your keypair name to facilitate cost tracking. In addition, the ToilNodeType tag can be used to filter "leader" vs. "worker" nodes in your cluster.
The leaderNodeType is an EC2 instance type. This only affects the leader node.
The --zone parameter specifies which EC2 availability zone to launch the cluster in. Alternatively, you can specify this option via the TOIL_AWS_ZONE environment variable. Note: the zone is different from an EC2 region. A region corresponds to a geographical area like us-west-2 (Oregon), and availability zones are partitions of this area like us-west-2a.
By default, Toil creates an IAM role for each cluster with sufficient permissions to perform cluster operations (e.g. full S3, EC2, and SDB access). If the default permissions are not sufficient for your use case (e.g. if you need access to ECR), you may create a custom IAM role with all necessary permissions and set the --awsEc2ProfileArn parameter when launching the cluster. Note that your custom role must at least have these permissions in order for the Toil cluster to function properly.
In addition, Toil creates a new security group with the same name as the cluster name with default rules (e.g. opens port 22 for SSH access). If you require additional security groups, you may use the --awsEc2ExtraSecurityGroupId parameter when launching the cluster. Note: Do not use the same name as the cluster name for the extra security groups as any security group matching the cluster name will be deleted once the cluster is destroyed.
For more information on options try:
(venv) $ toil launch-cluster --help
Toil can be used to manage a cluster in the cloud by using the Toil Cluster Utilities. The cluster utilities also make it easy to run a toil workflow directly on this cluster. We call this static provisioning because the size of the cluster does not change. This is in contrast with Running a Workflow with Autoscaling.
To launch worker nodes alongside the leader we use the -w option:
(venv) $ toil launch-cluster my-cluster \
--clusterType kubernetes \
--leaderNodeType t2.small -z us-west-2a \
--keyPairName <AWS-key-pair-name> \
--nodeTypes m3.large,t2.micro -w 1,4 \
--zone us-west-2a
This will spin up a leader node of type t2.small with five additional workers --- one m3.large instance and four t2.micro.
Currently static provisioning is only possible during the cluster's creation. The ability to add new nodes and remove existing nodes via the native provisioner is in development. Of course the cluster can always be deleted with the Destroy-Cluster Command utility.
Now that our cluster is launched, we use the Rsync-Cluster Command utility to copy the workflow to the leader. For a simple workflow in a single file this might look like
(venv) $ toil rsync-cluster -z us-west-2a my-cluster toil-workflow.py :/
NOTE:
Toil can create an autoscaling Kubernetes cluster for you using the AWS provisioner. Autoscaling is a feature of running Toil in a cloud whereby additional cloud instances are launched as needed to run the workflow.
NOTE:
To set up a Kubernetes cluster, simply use the --clusterType=kubernetes command line option to toil launch-cluster. To make it autoscale, specify a range of possible node counts for a node type (such as -w 1-4). The cluster will automatically add and remove nodes, within that range, depending on how many seem to be needed to run the jobs submitted to the cluster.
For example, to launch a Toil cluster with a Kubernetes scheduler, run:
(venv) $ toil launch-cluster <cluster-name> \
--provisioner=aws \
--clusterType kubernetes \
--zone us-west-2a \
--keyPairName <AWS-key-pair-name> \
--leaderNodeType t2.medium \
--leaderStorage 50 \
--nodeTypes t2.medium -w 1-4 \
--nodeStorage 20 \
--logDebug
Behind the scenes, Toil installs kubeadm and configures the kubelet on the Toil leader and all worker nodes. This Toil cluster can then schedule jobs using Kubernetes.
NOTE:
As a demonstration, we will use sort.py again, but run it on a Toil cluster with Kubernetes. First, download this file and put it to the current working directory.
We then need to copy over the workflow file and SSH into the cluster:
(venv) $ toil rsync-cluster -z us-west-2a <cluster-name> sort.py :/root (venv) $ toil ssh-cluster -z us-west-2a <cluster-name>
Remember to replace <cluster-name> with your actual cluster name, and feel free to use your own cluster configuration and/or workflow files. For more information on this step, see the corresponding section of the Static Provisioning tutorial.
IMPORTANT:
Now that we are inside the cluster, a Kubernetes environment should already be configured and running. To verify this, simply run:
$ kubectl get nodes
You should see a leader node with the Ready status. Depending on the number of worker nodes you set to create upfront, you should also see them displayed here.
Additionally, you can also verify that the metrics server is running:
$ kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes"
If there is a JSON response (similar to the output below), and you are not seeing any errors, that means the metrics server is set up and running, and you are good to start running workflows.
{"kind":"NodeMetricsList","apiVersion":"metrics.k8s.io/v1beta1", ...}
NOTE:
Now we can run the workflow:
$ python3 sort.py \
--batchSystem kubernetes \
aws:<region>:<job-store-name>
Make sure to replace <region> and <job-store-name>. It is required to use a cloud-accessible job store like AWS or Google when using the Kubernetes batch system.
The sort workflow should start running on the Kubernetes cluster set up by Toil. This workflow would take a while to execute, so you could put the job in the background and monitor the Kubernetes cluster using kubectl. For example, you can check out the pods that are running:
$ kubectl get pods
You should see an output like:
NAME READY STATUS RESTARTS AGE root-toil-a864e1b0-2e1f-48db-953c-038e5ad293c7-11-4cwdl 0/1 ContainerCreating 0 85s root-toil-a864e1b0-2e1f-48db-953c-038e5ad293c7-14-5dqtk 0/1 Completed 0 18s root-toil-a864e1b0-2e1f-48db-953c-038e5ad293c7-7-gkwc9 0/1 ContainerCreating 0 107s root-toil-a864e1b0-2e1f-48db-953c-038e5ad293c7-9-t7vsb 1/1 Running 0 96s
If a pod failed for whatever reason or if you want to make sure a pod isn't stuck, you can use kubectl describe pod <pod-name> or kubectl logs <pod-name> to inspect the pod.
If everything is successful, you should be able to see an output file from the sort workflow:
$ head sortedFile.txt
You can now run your own workflows!
Toil can run on a heterogeneous cluster of both preemptible and non-preemptible nodes. Being a preemptible node simply means that the node may be shut down at any time, while jobs are running. These jobs can then be restarted later somewhere else.
A node type can be specified as preemptible by adding a spot bid in dollars, after a colon, to its entry in the list of node types provided with the --nodeTypes flag. If spot instance prices rise above your bid, the preemptible nodes will be shut down.
For example, this cluster will have both preemptible and non-preemptible nodes:
(venv) $ toil launch-cluster <cluster-name> \
--provisioner=aws \
--clusterType kubernetes \
--zone us-west-2a \
--keyPairName <AWS-key-pair-name> \
--leaderNodeType t2.medium \
--leaderStorage 50 \
--nodeTypes t2.medium -w 1-4 \
--nodeTypes t2.large:0.20 -w 1-4 \
--nodeStorage 20 \
--logDebug
Individual jobs can explicitly specify whether they should be run on preemptible nodes via the boolean preemptible resource requirement in Toil's Python API. In CWL, this is exposed as a hint UsePreemptible in the http://arvados.org/cwl# namespace (usually imported as arv). In WDL, this is exposed as a runtime attribute preemptible as recognized by Cromwell. Toil's Kubernetes batch system will prefer to schedule preemptible jobs on preemptible nodes.
If a job is not specified to be preemptible, the job will not run on preemptible nodes even if preemptible nodes are available, unless the workflow is run with the --defaultPreemptible flag. The --defaultPreemptible flag will allow jobs without an explicit preemptible requirement to run on preemptible machines. For example:
$ python3 /root/sort.py aws:us-west-2:<my-jobstore-name> \
--batchSystem kubernetes \
--defaultPreemptible
Ensure that your choices for --nodeTypes and --maxNodes <> make sense for your workflow and won't cause it to hang. You should make sure the provisioner is able to create nodes large enough to run the largest job in the workflow, and that non-preemptible node types are allowed if there are non-preemptible jobs in the workflow.
Toil can be configured to access files stored in an S3-compatible object store such as MinIO. The following environment variables can be used to configure the S3 connection used:
Examples:
TOIL_S3_HOST=127.0.0.1 TOIL_S3_PORT=9010 TOIL_S3_USE_SSL=False
Instead of the normal Kubernetes-based autoscaling, you can also use Toil's old Mesos-based autoscaling method, where the scaling logic runs inside the Toil workflow. With this approach, a Toil cluster can only run one workflow at a time. This method also does not work on the ARM architecture.
In this mode, the --preemptibleCompensation flag can be used to handle cases where preemptible nodes may not be available but are required for your workflow. With this flag enabled, the autoscaler will attempt to compensate for a shortage of preemptible nodes of a certain type by creating non-preemptible nodes of that type, if non-preemptible nodes of that type were specified in --nodeTypes.
NOTE:
(venv) $ toil launch-cluster <cluster-name> \
--clusterType mesos \
--keyPairName <AWS-key-pair-name> \
--leaderNodeType t2.medium \
--zone us-west-2a
(venv) $ toil rsync-cluster -z us-west-2a <cluster-name> sort.py :/root
(venv) $ toil ssh-cluster -z us-west-2a <cluster-name>
$ python3 /root/sort.py aws:us-west-2:<my-jobstore-name> \
--provisioner aws \
--nodeTypes c3.large \
--maxNodes 2 \
--batchSystem mesos
NOTE:
$ head fileToSort.txt
$ head sortedFile.txt
Toil provides a dashboard for viewing the RAM and CPU usage of each node, the number of issued jobs of each type, the number of failed jobs, and the size of the jobs queue. To launch this dashboard for a Toil workflow, pass the --metrics flag on the workflow's command line. The dashboard can then be viewed in your browser at localhost:3000 while connected to the leader node through toil ssh-cluster:
To change the default port number, you can use the --grafana_port argument:
(venv) $ toil ssh-cluster -z us-west-2a --grafana_port 8000 <cluster-name>
On AWS, the dashboard keeps track of every node in the cluster to monitor CPU and RAM usage, but it can also be used while running a workflow on a single machine. The dashboard uses Grafana as the front end for displaying real-time plots, and Prometheus for tracking metrics exported by toil: [image]
In order to use the dashboard for a non-released toil version, you will have to build the containers locally with make docker, since the prometheus, grafana, and mtail containers used in the dashboard are tied to a specific toil version.
Toil supports a provisioner with Google, and a Google Job Store. To get started, follow instructions for Preparing your Google environment.
Toil supports using the Google Cloud Platform. Setting this up is easy!
$ ssh-keygen -t rsa -f ~/.ssh/id_rsa -C [USERNAME]
where [USERNAME] is something like jane@example.com. Make sure to leave your password blank.
WARNING:
Make sure only you can read the SSH keys:
$ chmod 400 ~/.ssh/id_rsa ~/.ssh/id_rsa.pub
Near the top of the screen click on 'SSH Keys', then edit, add item, and paste the key. Then save: [image]
For more details look at Google's instructions for adding SSH keys.
To use the Google Job Store you will need to set the GOOGLE_APPLICATION_CREDENTIALS environment variable by following Google's instructions.
Then to run the sort example with the Google job store you would type
$ python3 sort.py google:my-project-id:my-google-sort-jobstore
WARNING:
The steps to run a GCE workflow are similar to those of AWS (Running a Workflow with Autoscaling), except you will need to explicitly specify the --provisioner gce option which otherwise defaults to aws.
(venv) $ toil launch-cluster <CLUSTER-NAME> \
--provisioner gce \
--leaderNodeType n1-standard-1 \
--keyPairName <SSH-KEYNAME> \
--zone us-west1-a
Where <SSH-KEYNAME> is the first part of [USERNAME] used when setting up your ssh key. For example if [USERNAME] was jane@example.com, <SSH-KEYNAME> should be jane.
The --keyPairName option is for an SSH key that was added to the Google account. If your ssh key [USERNAME] was jane@example.com, then your key pair name will be just jane.
(venv) $ toil rsync-cluster --provisioner gce <CLUSTER-NAME> sort.py :/root (venv) $ toil ssh-cluster --provisioner gce <CLUSTER-NAME>
$ python3 /root/sort.py google:<PROJECT-ID>:<JOBSTORE-NAME> \
--provisioner gce \
--batchSystem mesos \
--nodeTypes n1-standard-2 \
--maxNodes 2
$ exit # this exits the ssh from the leader node (venv) $ toil destroy-cluster --provisioner gce <CLUSTER-NAME>
Toil is a flexible framework that can be leveraged in a variety of environments, including high-performance computing (HPC) environments. Toil provides support for a number of batch systems, including Grid Engine, Slurm, Torque and LSF, which are popular schedulers used in these environments. Toil also supports HTCondor, which is a popular scheduler for high-throughput computing (HTC). To use one of these batch systems specify the --batchSystem argument to the workflow.
Due to the cost and complexity of maintaining support for these schedulers we currently consider all but Slurm to be "community supported", that is the core development team does not regularly test or develop support for these systems. However, there are members of the Toil community currently deploying Toil in a wide variety of HPC environments and we welcome external contributions.
Developing the support of a new or existing batch system involves extending the abstract batch system class toil.batchSystems.abstractBatchSystem.AbstractBatchSystem.
When running Toil workflows on Slurm, you usually want to run the workflow itself from the head node. Toil will take care of running all the required sbatch commands for you. You probably do not want to submit the Toil workflow as a Slurm job with sbatch (although you can if you have a large number of workflows to run). You also probably do not want to manually allocate resources with sallocate.
To run a Toil workflow on Slurm, include --batchSystem slurm in your command line arguments. Generally Slurm clusters have shared filesystems, meaning the file job store would be appropriate. You want to make sure to use a job store location that is shared across your Slurm cluster. Additionally, you will likely want to provide another shared directory with the --batchLogsDir option, to allow the Slurm job logs to be retrieved by Toil in case something goes wrong with a job.
For example, to run the sort example sort example on Slurm, assuming you are currently in a shared directory, you would type, on the cluster head node:
$ mkdir -p logs $ python3 sort.py ./store --batchSystem slurm --batchLogsDir ./logs
$ echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc
$ echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc
Then make sure to log out and back in again for the setting to take effect.
Standard output and error from batch system jobs (except for the Mesos batch system) are redirected to files in the toil-<workflowID> directory created within the temporary directory specified by the --workDir option; see Commandline Options. Each file is named as follows: toil_job_<Toil job ID>_batch_<name of batch system>_<job ID from batch system>_<file description>.log, where <file description> is std_output for standard output, and std_error for standard error. HTCondor will also write job event log files with <file description> = job_events.
If capturing standard output and error is desired, --workDir will generally need to be on a shared file system; otherwise if these are written to local temporary directories on each node (e.g. /tmp) Toil will not be able to retrieve them. Alternatively, the --noStdOutErr option forces Toil to discard all standard output and error from batch system jobs.
The GA4GH Workflow Execution Service (WES) is a standardized API for submitting and monitoring workflows. Toil has experimental support for setting up a WES server and executing CWL, WDL, and Toil workflows using the WES API. More information about the WES API specification can be found here.
To get started with the Toil WES server, make sure that the server extra (Installing Toil with Extra Features) is installed.
The WES server requires Celery to distribute and execute workflows. To set up Celery:
docker run -d --name wes-rabbitmq -p 5672:5672 rabbitmq:3.9.5
celery -A toil.server.celery_app worker --loglevel=INFO
To start a WES server on the default port 8080, run the Toil command:
$ toil server
The WES API will be hosted on the following URL:
http://localhost:8080/ga4gh/wes/v1
To use another port, e.g.: 3000, you can specify the --port argument:
$ toil server --port 3000
There are many other command line options. Help information can be found by using this command:
$ toil server --help
Below is a detailed summary of all server-specific options:
Instead of manually setting up the server components (toil server, RabbitMQ, and Celery), you can use the following docker-compose.yml file to orchestrate and link them together.
Make sure to change the credentials for basic authentication by updating the traefik.http.middlewares.auth.basicauth.users label. The passwords can be generated with tools like htpasswd like this. (Note that single $ signs need to be replaced with $$ in the yaml file).
When running on a different host other than localhost, make sure to change the Host to your tartget host in the traefik.http.routers.wes.rule and traefik.http.routers.wespublic.rule labels.
You can also change /tmp/toil-workflows if you want Toil workflows to live somewhere else, and create the directory before starting the server.
In order to run workflows that require Docker, the docker.sock socket must be mounted as volume for Celery. Additionally, the TOIL_WORKDIR directory (defaults to: /var/lib/toil) and /var/lib/cwl (if running CWL workflows with DockerRequirement) should exist on the host and also be mounted as volumes.
Also make sure to run it behind a firewall; it opens up the Toil server on port 8080 to anyone who connects.
# docker-compose.yml version: "3.8" services:
rabbitmq:
image: rabbitmq:3.9.5
hostname: rabbitmq
celery:
image: ${TOIL_APPLIANCE_SELF}
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /var/lib/docker:/var/lib/docker
- /var/lib/toil:/var/lib/toil
- /var/lib/cwl:/var/lib/cwl
- /tmp/toil-workflows:/tmp/toil-workflows
command: celery --broker=amqp://guest:guest@rabbitmq:5672// -A toil.server.celery_app worker --loglevel=INFO
depends_on:
- rabbitmq
wes-server:
image: ${TOIL_APPLIANCE_SELF}
volumes:
- /tmp/toil-workflows:/tmp/toil-workflows
environment:
- TOIL_WES_BROKER_URL=amqp://guest:guest@rabbitmq:5672//
command: toil server --host 0.0.0.0 --port 8000 --work_dir /tmp/toil-workflows
expose:
- 8000
labels:
- "traefik.enable=true"
- "traefik.http.routers.wes.rule=Host(`localhost`)"
- "traefik.http.routers.wes.entrypoints=web"
- "traefik.http.routers.wes.middlewares=auth"
- "traefik.http.middlewares.auth.basicauth.users=test:$$2y$$12$$ci.4U63YX83CwkyUrjqxAucnmi2xXOIlEF6T/KdP9824f1Rf1iyNG"
- "traefik.http.routers.wespublic.rule=Host(`localhost`) && Path(`/ga4gh/wes/v1/service-info`)"
depends_on:
- rabbitmq
- celery
traefik:
image: traefik:v2.2
command:
- "--providers.docker"
- "--providers.docker.exposedbydefault=false"
- "--entrypoints.web.address=:8080"
ports:
- "8080:8080"
volumes:
- /var/run/docker.sock:/var/run/docker.sock
Further customization can also be made as needed. For example, if you have a domain, you can set up HTTPS with Let's Encrypt.
Once everything is configured, simply run docker-compose up to start the containers. Run docker-compose down to stop and remove all containers.
NOTE:
To run the server on a Toil leader instance on EC2:
curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose chmod +x /usr/local/bin/docker-compose # check installation docker-compose --version
or, install a different version of Docker Compose by changing "1.29.2" to another version.
As defined by the GA4GH WES API specification, the following endpoints with base path ga4gh/wes/v1/ are supported by Toil:
| GET /service-info | Get information about the Workflow Execution Service. |
| GET /runs | List the workflow runs. |
| POST /runs | Run a workflow. This endpoint creates a new workflow run and returns a run_id to monitor its progress. |
| GET /runs/{run_id} | Get detailed info about a workflow run. |
| POST /runs/{run_id}/cancel | Cancel a running workflow. |
| GET /runs/{run_id}/status | Get the status (overall state) of a workflow run. |
When running the WES server with the docker-compose setup above, most endpoints (except GET /service-info) will be protected with basic authentication. Make sure to set the Authorization header with the correct credentials when submitting or retrieving a workflow.
Now that the WES API is up and running, we can submit and monitor workflows remotely using the WES API endpoints. A workflow can be submitted for execution using the POST /runs endpoint.
As a quick example, we can submit the example CWL workflow from Running a basic CWL workflow to our WES API:
# example.cwl cwlVersion: v1.0 class: CommandLineTool baseCommand: echo stdout: output.txt inputs:
message:
type: string
inputBinding:
position: 1 outputs:
output:
type: stdout
using cURL:
$ curl --location --request POST 'http://localhost:8080/ga4gh/wes/v1/runs' \
--user test:test \
--form 'workflow_url="example.cwl"' \
--form 'workflow_type="cwl"' \
--form 'workflow_type_version="v1.0"' \
--form 'workflow_params="{\"message\": \"Hello world!\"}"' \
--form 'workflow_attachment=@"./toil_test_files/example.cwl"' {
"run_id": "4deb8beb24894e9eb7c74b0f010305d1" }
Note that the --user argument is used to attach the basic authentication credentials along with the request. Make sure to change test:test to the username and password you configured for your WES server. Alternatively, you can also set the Authorization header manually as "Authorization: Basic base64_encoded_auth".
If the workflow is submitted successfully, a JSON object containing a run_id will be returned. The run_id is a unique identifier of your requested workflow, which can be used to monitor or cancel the run.
There are a few required parameters that have to be set for all workflow submissions, which are the following:
| workflow_url | The URL of the workflow to run. This can refer to a file from workflow_attachment. |
| workflow_type | The type of workflow language. Toil currently supports one of the following: "CWL", "WDL", or "py". To run a Toil Python workflow, set this to "py". |
| workflow_type_version | The version of the workflow language. Supported versions can be found by accessing the GET /service-info endpoint of your WES server. |
| workflow_params | A JSON object that specifies the inputs of the workflow. |
Additionally, the following optional parameters are also available:
| workflow_attachment | A list of files associated with the workflow run. |
| workflow_engine_parameters | A JSON key-value map of workflow engine parameters to send to the runner. Example: {"--logLevel": "INFO", "--workDir": "/tmp/"} |
| tags | A JSON key-value map of metadata associated with the workflow. |
For more details about these parameters, refer to the Run Workflow section in the WES API spec.
Looking at the body of the request of the previous example, note that the workflow_url is a relative URL that refers to the example.cwl file uploaded from the local path ./toil_test_files/example.cwl.
To specify the file name (or subdirectory) of the remote destination file, set the filename field in the Content-Disposition header. You could also upload more than one file by providing the workflow_attachment parameter multiple times with different files.
This can be shown by the following example:
$ curl --location --request POST 'http://localhost:8080/ga4gh/wes/v1/runs' \
--user test:test \
--form 'workflow_url="example.cwl"' \
--form 'workflow_type="cwl"' \
--form 'workflow_type_version="v1.0"' \
--form 'workflow_params="{\"message\": \"Hello world!\"}"' \
--form 'workflow_attachment=@"./toil_test_files/example.cwl"' \
--form 'workflow_attachment=@"./toil_test_files/2.fasta";filename=inputs/test.fasta' \
--form 'workflow_attachment=@"./toil_test_files/2.fastq";filename=inputs/test.fastq'
On the server, the execution directory would have the following structure from the above request:
execution/ ├── example.cwl ├── inputs │ ├── test.fasta | └── test.fastq └── wes_inputs.json
To pass Toil-specific parameters to the workflow, you can include the workflow_engine_parameters parameter along with your request.
For example, to set the logging level to INFO, and change the working directory of the workflow, simply include the following as workflow_engine_parameters:
{"--logLevel": "INFO", "--workDir": "/tmp/"}
These options would be appended at the end of existing parameters during command construction, which would override the default parameters if provided. (Default parameters that can be passed multiple times would not be overridden).
With the run_id returned when submitting the workflow, we can check the status or get the full logs of the workflow run.
The GET /runs/{run_id}/status endpoint can be used to get a simple result with the overall state of your run:
$ curl --user test:test http://localhost:8080/ga4gh/wes/v1/runs/4deb8beb24894e9eb7c74b0f010305d1/status
{
"run_id": "4deb8beb24894e9eb7c74b0f010305d1",
"state": "RUNNING"
}
The possible states here are: QUEUED, INITIALIZING, RUNNING, COMPLETE, EXECUTOR_ERROR, SYSTEM_ERROR, CANCELING, and CANCELED.
To get the detailed information about a workflow run, use the GET /runs/{run_id} endpoint:
$ curl --user test:test http://localhost:8080/ga4gh/wes/v1/runs/4deb8beb24894e9eb7c74b0f010305d1
{
"run_id": "4deb8beb24894e9eb7c74b0f010305d1",
"request": {?
"workflow_attachment": [
"example.cwl"
],
"workflow_url": "example.cwl",
"workflow_type": "cwl",
"workflow_type_version": "v1.0",
"workflow_params": {
"message": "Hello world!"
}
},
"state": "RUNNING",
"run_log": {
"cmd": [
"toil-cwl-runner --outdir=/home/toil/workflows/4deb8beb24894e9eb7c74b0f010305d1/outputs --jobStore=file:/home/toil/workflows/4deb8beb24894e9eb7c74b0f010305d1/toil_job_store /home/toil/workflows/4deb8beb24894e9eb7c74b0f010305d1/execution/example.cwl /home/workflows/4deb8beb24894e9eb7c74b0f010305d1/execution/wes_inputs.json"
],
"start_time": "2021-08-30T17:35:50Z",
"end_time": null,
"stdout": null,
"stderr": null,
"exit_code": null
},
"task_logs": [],
"outputs": {}
}
To cancel a workflow run, use the POST /runs/{run_id}/cancel endpoint:
$ curl --location --request POST 'http://localhost:8080/ga4gh/wes/v1/runs/4deb8beb24894e9eb7c74b0f010305d1/cancel' \
--user test:test {
"run_id": "4deb8beb24894e9eb7c74b0f010305d1" }
This tutorial walks through the features of Toil necessary for developing a workflow using the Toil Python API.
To begin, consider this short Toil Python workflow which illustrates defining a workflow:
import os from toil.common import Toil from toil.job import Job from toil.lib.io import mkdtemp def helloWorld(message):
return f"Hello, world!, here's a message: {message}" if __name__ == "__main__":
jobstore: str = mkdtemp("tutorial_quickstart")
os.rmdir(jobstore)
options = Job.Runner.getDefaultOptions(jobstore)
options.logLevel = "OFF"
options.clean = "always"
hello_job = Job.wrapFn(helloWorld, "Woot")
with Toil(options) as toil:
print(toil.start(hello_job)) # prints "Hello, world!, ..."
The workflow consists of a single job. The resource requirements for that job are (optionally) specified by keyword arguments (memory, cores, disk). The workflow is run using toil.job.Job.Runner.getDefaultOptions(). Below we explain the components of this code in detail.
The atomic unit of work in a Toil workflow is a Job. User code extends this base class, or uses helper methods like toil.job.Job.addChildJobFn(), to define units of work. For example, here is a more long-winded class-based version of the job in the quick start example:
from toil.job import Job class HelloWorld(Job):
def __init__(self, message):
Job.__init__(self, memory="2G", cores=2, disk="3G")
self.message = message
def run(self, fileStore):
return f"Hello, world! Here's a message: {self.message}"
In the example a class, HelloWorld, is defined. The constructor requests 2 gigabytes of memory, 2 cores and 3 gigabytes of local disk to complete the work.
The toil.job.Job.run() method is the function the user overrides to get work done. Here it just returns a message.
It is also possible to log a message using toil.job.Job.log(), which will be registered in the log output of the leader process of the workflow:
...
def run(self, fileStore):
self.log(f"Hello, world! Here's a message: {self.message}")
We can add to the previous example to turn it into a complete workflow by adding the necessary function calls to create an instance of HelloWorld and to run this as a workflow containing a single job. For example:
import os from toil.common import Toil from toil.job import Job from toil.lib.io import mkdtemp class HelloWorld(Job):
def __init__(self, message):
Job.__init__(self)
self.message = message
def run(self, fileStore):
return f"Hello, world!, here's a message: {self.message}" if __name__ == "__main__":
jobstore: str = mkdtemp("tutorial_invokeworkflow")
os.rmdir(jobstore)
options = Job.Runner.getDefaultOptions(jobstore)
options.logLevel = "OFF"
options.clean = "always"
hello_job = HelloWorld("Woot")
with Toil(options) as toil:
print(toil.start(hello_job))
NOTE:
This uses the toil.common.Toil class, which is used to run and resume Toil workflows. It is used as a context manager and allows for preliminary setup, such as staging of files into the job store on the leader node. An instance of the class is initialized by specifying an options object. The actual workflow is then invoked by calling the toil.common.Toil.start() method, passing the root job of the workflow, or, if a workflow is being restarted, toil.common.Toil.restart() should be used. Note that the context manager should have explicit if else branches addressing restart and non restart cases. The boolean value for these if else blocks is toil.options.restart.
For example:
import os from toil.common import Toil from toil.job import Job from toil.lib.io import mkdtemp class HelloWorld(Job):
def __init__(self, message):
Job.__init__(self)
self.message = message
def run(self, fileStore):
return f"Hello, world!, I have a message: {self.message}" if __name__ == "__main__":
jobstore: str = mkdtemp("tutorial_invokeworkflow2")
os.rmdir(jobstore)
options = Job.Runner.getDefaultOptions(jobstore)
options.logLevel = "INFO"
options.clean = "always"
with Toil(options) as toil:
if not toil.options.restart:
job = HelloWorld("Woot!")
output = toil.start(job)
else:
output = toil.restart()
print(output)
The call to toil.job.Job.Runner.getDefaultOptions() creates a set of default options for the workflow. The only argument is a description of how to store the workflow's state in what we call a job-store. Here the job-store is contained in a directory within the current working directory called "toilWorkflowRun". Alternatively this string can encode other ways to store the necessary state, e.g. an S3 bucket object store location. By default the job-store is deleted if the workflow completes successfully.
The workflow is executed in the final line, which creates an instance of HelloWorld and runs it as a workflow. Note all Toil workflows start from a single starting job, referred to as the root job. The return value of the root job is returned as the result of the completed workflow (see promises below to see how this is a useful feature!).
To allow command line control of the options we can use the toil.job.Job.Runner.getDefaultArgumentParser() method to create a argparse.ArgumentParser object which can be used to parse command line options for a Toil Python workflow. For example:
from toil.common import Toil from toil.job import Job class HelloWorld(Job):
def __init__(self, message):
Job.__init__(self)
self.message = message
def run(self, fileStore):
return "Hello, world!, here's a message: %s" % self.message if __name__ == "__main__":
parser = Job.Runner.getDefaultArgumentParser()
options = parser.parse_args()
options.logLevel = "OFF"
options.clean = "always"
hello_job = HelloWorld("Woot")
with Toil(options) as toil:
print(toil.start(hello_job))
This creates a fully fledged Toil Python workflow with all the options Toil exposes as command line arguments. Running this program with --help will print the full list of options.
Alternatively an existing argparse.ArgumentParser object can have Toil command line options added to it with the toil.job.Job.Runner.addToilOptions() method.
In the event that a workflow fails, either because of programmatic error within the jobs being run, or because of node failure, the workflow can be resumed. Workflows can only not be reliably resumed if the job-store itself becomes corrupt.
Critical to resumption is that jobs can be rerun, even if they have apparently completed successfully. Put succinctly, a user defined job should not corrupt its input arguments. That way, regardless of node, network or leader failure the job can be restarted and the workflow resumed.
To resume a workflow specify the "restart" option in the options object passed to toil.common.Toil.start(). If node failures are expected it can also be useful to use the integer "retryCount" option, which will attempt to rerun a job retryCount number of times before marking it fully failed.
In the common scenario that a small subset of jobs fail (including retry attempts) within a workflow Toil will continue to run other jobs until it can do no more, at which point toil.common.Toil.start() will raise a toil.exceptions.FailedJobsException exception. Typically at this point the user can decide to fix the script and resume the workflow or delete the job-store manually and rerun the complete workflow.
Defining jobs by creating class definitions generally involves the boilerplate of creating a constructor. To avoid this the classes toil.job.FunctionWrappingJob and toil.job.JobFunctionWrappingTarget allow functions to be directly converted to jobs. For example, the quick start example (repeated here):
import os from toil.common import Toil from toil.job import Job from toil.lib.io import mkdtemp def helloWorld(message):
return f"Hello, world!, here's a message: {message}" if __name__ == "__main__":
jobstore: str = mkdtemp("tutorial_quickstart")
os.rmdir(jobstore)
options = Job.Runner.getDefaultOptions(jobstore)
options.logLevel = "OFF"
options.clean = "always"
hello_job = Job.wrapFn(helloWorld, "Woot")
with Toil(options) as toil:
print(toil.start(hello_job)) # prints "Hello, world!, ..."
Is equivalent to the previous example, but using a function to define the job.
The function call:
Job.wrapFn(helloWorld, "Woot")
Creates the instance of the toil.job.FunctionWrappingTarget that wraps the function.
The keyword arguments memory, cores and disk allow resource requirements to be specified as before. Even if they are not included as keyword arguments within a function header they can be passed as arguments when wrapping a function as a job and will be used to specify resource requirements.
We can also use the function wrapping syntax to a job function, a function whose first argument is a reference to the wrapping job. Just like a self argument in a class, this allows access to the methods of the wrapping job, see toil.job.JobFunctionWrappingTarget. For example:
import os from toil.common import Toil from toil.job import Job from toil.lib.io import mkdtemp def helloWorld(job, message):
job.log(f"Hello world, I have a message: {message}") if __name__ == "__main__":
jobstore: str = mkdtemp("tutorial_jobfunctions")
os.rmdir(jobstore)
options = Job.Runner.getDefaultOptions(jobstore)
options.logLevel = "INFO"
options.clean = "always"
hello_job = Job.wrapJobFn(helloWorld, "Woot!")
with Toil(options) as toil:
toil.start(hello_job)
Here helloWorld() is a job function. It uses the toil.job.Job.log() to log a message that will be printed to the output console. Here the only subtle difference to note is the line:
hello_job = Job.wrapJobFn(helloWorld, "Woot")
Which uses the function toil.job.Job.wrapJobFn() to wrap the job function instead of toil.job.Job.wrapFn() which wraps a vanilla function.
A parent job can have child jobs and follow-on jobs. These relationships are specified by methods of the job class, e.g. toil.job.Job.addChild() and toil.job.Job.addFollowOn().
Considering a set of jobs the nodes in a job graph and the child and follow-on relationships the directed edges of the graph, we say that a job B that is on a directed path of child/follow-on edges from a job A in the job graph is a successor of A, similarly A is a predecessor of B.
A parent job's child jobs are run directly after the parent job has completed, and in parallel. The follow-on jobs of a job are run after its child jobs and their successors have completed. They are also run in parallel. Follow-ons allow the easy specification of cleanup tasks that happen after a set of parallel child tasks. The following shows a simple example that uses the earlier helloWorld() job function:
from toil.common import Toil from toil.job import Job def helloWorld(job, message):
job.log(f"Hello world, I have a message: {message}") if __name__ == "__main__":
parser = Job.Runner.getDefaultArgumentParser()
options = parser.parse_args()
options.logLevel = "INFO"
options.clean = "always"
j1 = Job.wrapJobFn(helloWorld, "first")
j2 = Job.wrapJobFn(helloWorld, "second or third")
j3 = Job.wrapJobFn(helloWorld, "second or third")
j4 = Job.wrapJobFn(helloWorld, "last")
j1.addChild(j2)
j1.addChild(j3)
j1.addFollowOn(j4)
with Toil(options) as toil:
toil.start(j1)
In the example four jobs are created, first j1 is run, then j2 and j3 are run in parallel as children of j1, finally j4 is run as a follow-on of j1.
There are multiple short hand functions to achieve the same workflow, for example:
from toil.common import Toil from toil.job import Job def helloWorld(job, message):
job.log(f"Hello world, I have a message: {message}") if __name__ == "__main__":
parser = Job.Runner.getDefaultArgumentParser()
options = parser.parse_args()
options.logLevel = "INFO"
options.clean = "always"
j1 = Job.wrapJobFn(helloWorld, "first")
j2 = j1.addChildJobFn(helloWorld, "second or third")
j3 = j1.addChildJobFn(helloWorld, "second or third")
j4 = j1.addFollowOnJobFn(helloWorld, "last")
with Toil(options) as toil:
toil.start(j1)
Equivalently defines the workflow, where the functions toil.job.Job.addChildJobFn() and toil.job.Job.addFollowOnJobFn() are used to create job functions as children or follow-ons of an earlier job.
Jobs graphs are not limited to trees, and can express arbitrary directed acyclic graphs. For a precise definition of legal graphs see toil.job.Job.checkJobGraphForDeadlocks(). The previous example could be specified as a DAG as follows:
from toil.common import Toil from toil.job import Job def helloWorld(job, message):
job.log(f"Hello world, I have a message: {message}") if __name__ == "__main__":
parser = Job.Runner.getDefaultArgumentParser()
options = parser.parse_args()
options.logLevel = "INFO"
options.clean = "always"
j1 = Job.wrapJobFn(helloWorld, "first")
j2 = j1.addChildJobFn(helloWorld, "second or third")
j3 = j1.addChildJobFn(helloWorld, "second or third")
j4 = j2.addChildJobFn(helloWorld, "last")
j3.addChild(j4)
with Toil(options) as toil:
toil.start(j1)
Note the use of an extra child edge to make j4 a child of both j2 and j3.
The previous examples show a workflow being defined outside of a job. However, Toil also allows jobs to be created dynamically within jobs. For example:
import os from toil.common import Toil from toil.job import Job from toil.lib.io import mkdtemp def binaryStringFn(job, depth, message=""):
if depth > 0:
job.addChildJobFn(binaryStringFn, depth - 1, message + "0")
job.addChildJobFn(binaryStringFn, depth - 1, message + "1")
else:
job.log(f"Binary string: {message}") if __name__ == "__main__":
jobstore: str = mkdtemp("tutorial_dynamic")
os.rmdir(jobstore)
options = Job.Runner.getDefaultOptions(jobstore)
options.logLevel = "INFO"
options.clean = "always"
with Toil(options) as toil:
toil.start(Job.wrapJobFn(binaryStringFn, depth=5))
The job function binaryStringFn logs all possible binary strings of length n (here n=5), creating a total of 2^(n+2) - 1 jobs dynamically and recursively. Static and dynamic creation of jobs can be mixed in a Toil workflow, with jobs defined within a job or job function being created at run time.
The previous example of dynamic job creation shows variables from a parent job being passed to a child job. Such forward variable passing is naturally specified by recursive invocation of successor jobs within parent jobs. This can also be achieved statically by passing around references to the return variables of jobs. In Toil this is achieved with promises, as illustrated in the following example:
import os from toil.common import Toil from toil.job import Job from toil.lib.io import mkdtemp def fn(job, i):
job.log("i is: %s" % i, level=100)
return i + 1 if __name__ == "__main__":
jobstore: str = mkdtemp("tutorial_promises")
os.rmdir(jobstore)
options = Job.Runner.getDefaultOptions(jobstore)
options.logLevel = "INFO"
options.clean = "always"
j1 = Job.wrapJobFn(fn, 1)
j2 = j1.addChildJobFn(fn, j1.rv())
j3 = j1.addFollowOnJobFn(fn, j2.rv())
with Toil(options) as toil:
toil.start(j1)
Running this workflow results in three log messages from the jobs: i is 1 from j1, i is 2 from j2 and i is 3 from j3.
The return value from the first job is promised to the second job by the call to toil.job.Job.rv() in the following line:
j2 = j1.addChildFn(fn, j1.rv())
The value of j1.rv() is a promise, rather than the actual return value of the function, because j1 for the given input has at that point not been evaluated. A promise (toil.job.Promise) is essentially a pointer to for the return value that is replaced by the actual return value once it has been evaluated. Therefore, when j2 is run the promise becomes 2.
Promises also support indexing of return values:
def parent(job):
indexable = Job.wrapJobFn(fn)
job.addChild(indexable)
job.addFollowOnFn(raiseWrap, indexable.rv(2)) def raiseWrap(arg):
raise RuntimeError(arg) # raises "2" def fn(job):
return (0, 1, 2, 3)
Promises can be quite useful. For example, we can combine dynamic job creation with promises to achieve a job creation process that mimics the functional patterns possible in many programming languages:
import os from toil.common import Toil from toil.job import Job from toil.lib.io import mkdtemp def binaryStrings(job, depth, message=""):
if depth > 0:
s = [
job.addChildJobFn(binaryStrings, depth - 1, message + "0").rv(),
job.addChildJobFn(binaryStrings, depth - 1, message + "1").rv(),
]
return job.addFollowOnFn(merge, s).rv()
return [message] def merge(strings):
return strings[0] + strings[1] if __name__ == "__main__":
jobstore: str = mkdtemp("tutorial_promises2")
os.rmdir(jobstore)
options = Job.Runner.getDefaultOptions(jobstore)
options.loglevel = "OFF"
options.clean = "always"
with Toil(options) as toil:
print(toil.start(Job.wrapJobFn(binaryStrings, depth=5)))
The return value l of the workflow is a list of all binary strings of length 10, computed recursively. Although a toy example, it demonstrates how closely Toil workflows can mimic typical programming patterns.
Promised requirements are a special case of Promises that allow a job's return value to be used as another job's resource requirements.
This is useful when, for example, a job's storage requirement is determined by a file staged to the job store by an earlier job:
import os from toil.common import Toil from toil.job import Job, PromisedRequirement from toil.lib.io import mkdtemp def parentJob(job):
downloadJob = Job.wrapJobFn(
stageFn,
"file://" + os.path.realpath(__file__),
cores=0.1,
memory="32M",
disk="1M",
)
job.addChild(downloadJob)
analysis = Job.wrapJobFn(
analysisJob,
fileStoreID=downloadJob.rv(0),
disk=PromisedRequirement(downloadJob.rv(1)),
)
job.addFollowOn(analysis) def stageFn(job, url):
importedFile = job.fileStore.import_file(url)
return importedFile, importedFile.size def analysisJob(job, fileStoreID):
# now do some analysis on the file
pass if __name__ == "__main__":
jobstore: str = mkdtemp("tutorial_requirements")
os.rmdir(jobstore)
options = Job.Runner.getDefaultOptions(jobstore)
options.logLevel = "INFO"
options.clean = "always"
with Toil(options) as toil:
toil.start(Job.wrapJobFn(parentJob))
Note that this also makes use of the size attribute of the FileID object. This promised requirements mechanism can also be used in combination with an aggregator for multiple jobs' output values:
def parentJob(job):
aggregator = []
for fileNum in range(0, 10):
downloadJob = Job.wrapJobFn(stageFn, "file://" + os.path.realpath(__file__), cores=0.1, memory='32M', disk='1M')
job.addChild(downloadJob)
aggregator.append(downloadJob)
analysis = Job.wrapJobFn(analysisJob,
fileStoreID=downloadJob.rv(0),
disk=PromisedRequirement(lambda xs: sum(xs), [j.rv(1) for j in aggregator]))
job.addFollowOn(analysis)
Just like regular promises, the return value must be determined prior to scheduling any job that depends on the return value. In our example above, notice how the dependent jobs were follow ons to the parent while promising jobs are children of the parent. This ordering ensures that all promises are properly fulfilled.
The toil.fileStore.FileID class is a small wrapper around Python's builtin string class. It is used to represent a file's ID in the file store, and has a size attribute that is the file's size in bytes. This object is returned by importFile and writeGlobalFile.
It is frequently the case that a workflow will want to create files, both persistent and temporary, during its run. The toil.fileStores.abstractFileStore.AbstractFileStore class is used by jobs to manage these files in a manner that guarantees cleanup and resumption on failure.
The toil.job.Job.run() method has a file store instance as an argument. The following example shows how this can be used to create temporary files that persist for the length of the job, be placed in a specified local disk of the node and that will be cleaned up, regardless of failure, when the job finishes:
import os from toil.common import Toil from toil.job import Job from toil.lib.io import mkdtemp class LocalFileStoreJob(Job):
def run(self, fileStore):
# self.tempDir will always contain the name of a directory within the allocated disk space reserved for the job
scratchDir = self.tempDir
# Similarly create a temporary file.
scratchFile = fileStore.getLocalTempFile() if __name__ == "__main__":
jobstore: str = mkdtemp("tutorial_managing")
os.rmdir(jobstore)
options = Job.Runner.getDefaultOptions(jobstore)
options.logLevel = "INFO"
options.clean = "always"
# Create an instance of FooJob which will have at least 2 gigabytes of storage space.
j = LocalFileStoreJob(disk="2G")
# Run the workflow
with Toil(options) as toil:
toil.start(j)
Job functions can also access the file store for the job. The equivalent of the LocalFileStoreJob class is
def localFileStoreJobFn(job):
scratchDir = job.tempDir
scratchFile = job.fileStore.getLocalTempFile()
Note that the fileStore attribute is accessed as an attribute of the job argument.
In addition to temporary files that exist for the duration of a job, the file store allows the creation of files in a global store, which persists during the workflow and are globally accessible (hence the name) between jobs. For example:
import os from toil.common import Toil from toil.job import Job from toil.lib.io import mkdtemp def globalFileStoreJobFn(job):
job.log(
"The following example exercises all the methods provided "
"by the toil.fileStores.abstractFileStore.AbstractFileStore class"
)
# Create a local temporary file.
scratchFile = job.fileStore.getLocalTempFile()
# Write something in the scratch file.
with open(scratchFile, "w") as fH:
fH.write("What a tangled web we weave")
# Write a copy of the file into the file-store; fileID is the key that can be used to retrieve the file.
# This write is asynchronous by default
fileID = job.fileStore.writeGlobalFile(scratchFile)
# Write another file using a stream; fileID2 is the
# key for this second file.
with job.fileStore.writeGlobalFileStream(cleanup=True) as (fH, fileID2):
fH.write(b"Out brief candle")
# Now read the first file; scratchFile2 is a local copy of the file that is read-only by default.
scratchFile2 = job.fileStore.readGlobalFile(fileID)
# Read the second file to a desired location: scratchFile3.
scratchFile3 = os.path.join(job.tempDir, "foo.txt")
job.fileStore.readGlobalFile(fileID2, userPath=scratchFile3)
# Read the second file again using a stream.
with job.fileStore.readGlobalFileStream(fileID2) as fH:
print(fH.read()) # This prints "Out brief candle"
# Delete the first file from the global file-store.
job.fileStore.deleteGlobalFile(fileID)
# It is unnecessary to delete the file keyed by fileID2 because we used the cleanup flag,
# which removes the file after this job and all its successors have run (if the file still exists) if __name__ == "__main__":
jobstore: str = mkdtemp("tutorial_managing2")
os.rmdir(jobstore)
options = Job.Runner.getDefaultOptions(jobstore)
options.logLevel = "INFO"
options.clean = "always"
with Toil(options) as toil:
toil.start(Job.wrapJobFn(globalFileStoreJobFn))
The example demonstrates the global read, write and delete functionality of the file-store, using both local copies of the files and streams to read and write the files. It covers all the methods provided by the file store interface.
What is obvious is that the file-store provides no functionality to update an existing "global" file, meaning that files are, barring deletion, immutable. Also worth noting is that there is no file system hierarchy for files in the global file store. These limitations allow us to fairly easily support different object stores and to use caching to limit the amount of network file transfer between jobs.
External files can be imported into or exported out of the job store prior to running a workflow when the toil.common.Toil context manager is used on the leader. The context manager provides methods toil.common.Toil.importFile(), and toil.common.Toil.exportFile() for this purpose. The destination and source locations of such files are described with URLs passed to the two methods. Local files can be imported and exported as relative paths, and should be relative to the directory where the toil workflow is initially run from.
Using absolute paths and appropriate schema where possible (prefixing with "file://" or "s3:/" for example), make imports and exports less ambiguous and is recommended.
A list of the currently supported URLs can be found at toil.jobStores.abstractJobStore.AbstractJobStore.importFile(). To import an external file into the job store as a shared file, pass the optional sharedFileName parameter to that method.
If a workflow fails for any reason an imported file acts as any other file in the job store. If the workflow was configured such that it not be cleaned up on a failed run, the file will persist in the job store and needs not be staged again when the workflow is resumed.
Example:
import os from toil.common import Toil from toil.job import Job from toil.lib.io import mkdtemp from toil.test import get_data class HelloWorld(Job):
def __init__(self, id):
Job.__init__(self)
self.inputFileID = id
def run(self, fileStore):
with fileStore.readGlobalFileStream(self.inputFileID, encoding="utf-8") as fi:
with fileStore.writeGlobalFileStream(encoding="utf-8") as (
fo,
outputFileID,
):
fo.write(fi.read() + "World!")
return outputFileID if __name__ == "__main__":
jobstore: str = mkdtemp("tutorial_staging")
tmp: str = mkdtemp("tutorial_staging_tmp")
os.rmdir(jobstore)
options = Job.Runner.getDefaultOptions(jobstore)
options.logLevel = "INFO"
options.clean = "always"
with Toil(options) as toil:
if not toil.options.restart:
inputFileID = toil.importFile(
"file://" + get_data("toil/test/docs/scripts/stagingExampleFiles/in.txt")
)
outputFileID = toil.start(HelloWorld(inputFileID))
else:
outputFileID = toil.restart()
toil.exportFile(
outputFileID,
"file://" + os.path.join(tmp + "out.txt"),
)
Docker containers are commonly used with Toil. The combination of Toil and Docker allows for pipelines to be fully portable between any platform that has both Toil and Docker installed. Docker eliminates the need for the user to do any other tool installation or environment setup.
In order to use Docker containers with Toil, Docker must be installed on all workers of the cluster. Instructions for installing Docker can be found on the Docker website.
When using Toil-based autoscaling, Docker will be automatically set up on the cluster's worker nodes, so no additional installation steps are necessary. Further information on using Toil-based autoscaling can be found in the Running a Workflow with Autoscaling documentation.
In order to use docker containers in a Toil workflow, the container can be built locally or downloaded in real time from an online docker repository like Quay. If the container is not in a repository, the container's layers must be accessible on each node of the cluster.
When invoking docker containers from within a Toil workflow, it is strongly recommended that you use dockerCall(), a toil job function provided in toil.lib.docker. dockerCall leverages docker's own python API, and provides container cleanup on job failure. When docker containers are run without this feature, failed jobs can result in resource leaks. Docker's API can be found at docker-py.
In order to use dockerCall, your installation of Docker must be set up to run without sudo. Instructions for setting this up can be found here.
An example of a basic dockerCall is below:
dockerCall(job=job,
tool='quay.io/ucsc_cgl/bwa',
workDir=job.tempDir,
parameters=['index', '/data/reference.fa'])
Note the assumption that reference.fa file is located in /data. This is Toil's standard convention as a mount location to reduce boilerplate when calling dockerCall. Users can choose their own mount locations by supplying a volumes kwarg to dockerCall, such as: volumes={working_dir: {'bind': '/data', 'mode': 'rw'}}, where working_dir is an absolute path on the user's filesystem.
dockerCall can also be added to workflows like any other job function:
import os from toil.common import Toil from toil.job import Job from toil.lib.docker import apiDockerCall from toil.lib.io import mkdtemp align = Job.wrapJobFn(
apiDockerCall, image="ubuntu", working_dir=os.getcwd(), parameters=["ls", "-lha"] ) if __name__ == "__main__":
jobstore: str = mkdtemp("tutorial_docker")
os.rmdir(jobstore)
options = Job.Runner.getDefaultOptions(jobstore)
options.logLevel = "INFO"
options.clean = "always"
with Toil(options) as toil:
toil.start(align)
cgl-docker-lib contains dockerCall-compatible Dockerized tools that are commonly used in bioinformatics analysis.
The documentation provides guidelines for developing your own Docker containers that can be used with Toil and dockerCall. In order for a container to be compatible with dockerCall, it must have an ENTRYPOINT set to a wrapper script, as described in cgl-docker-lib containerization standards. This can be set by passing in the optional keyword argument, 'entrypoint'. Example:
dockerCall supports currently the 75 keyword arguments found in the python Docker API, under the 'run' command.
It is sometimes desirable to run services, such as a database or server, concurrently with a workflow. The toil.job.Job.Service class provides a simple mechanism for spawning such a service within a Toil workflow, allowing precise specification of the start and end time of the service, and providing start and end methods to use for initialization and cleanup. The following simple, conceptual example illustrates how services work:
import os from toil.common import Toil from toil.job import Job from toil.lib.io import mkdtemp class DemoService(Job.Service):
def start(self, fileStore):
# Start up a database/service here
# Return a value that enables another process to connect to the database
return "loginCredentials"
def check(self):
# A function that if it returns False causes the service to quit
# If it raises an exception the service is killed and an error is reported
return True
def stop(self, fileStore):
# Cleanup the database here
pass j = Job() s = DemoService() loginCredentialsPromise = j.addService(s) def dbFn(loginCredentials):
# Use the login credentials returned from the service's start method to connect to the service
pass j.addChildFn(dbFn, loginCredentialsPromise) if __name__ == "__main__":
jobstore: str = mkdtemp("tutorial_services")
os.rmdir(jobstore)
options = Job.Runner.getDefaultOptions(jobstore)
options.logLevel = "INFO"
options.clean = "always"
with Toil(options) as toil:
toil.start(j)
In this example the DemoService starts a database in the start method, returning an object from the start method indicating how a client job would access the database. The service's stop method cleans up the database, while the service's check method is polled periodically to check the service is alive.
A DemoService instance is added as a service of the root job j, with resource requirements specified. The return value from toil.job.Job.addService() is a promise to the return value of the service's start method. When the promised is fulfilled it will represent how to connect to the database. The promise is passed to a child job of j, which uses it to make a database connection. The services of a job are started before any of its successors have been run and stopped after all the successors of the job have completed successfully.
Multiple services can be created per job, all run in parallel. Additionally, services can define sub-services using toil.job.Job.Service.addChild(). This allows complex networks of services to be created, e.g. Apache Spark clusters, within a workflow.
Services complicate resuming a workflow after failure, because they can create complex dependencies between jobs. For example, consider a service that provides a database that multiple jobs update. If the database service fails and loses state, it is not clear that just restarting the service will allow the workflow to be resumed, because jobs that created that state may have already finished. To get around this problem Toil supports checkpoint jobs, specified as the boolean keyword argument checkpoint to a job or wrapped function, e.g.:
j = Job(checkpoint=True)
A checkpoint job is rerun if one or more of its successors fails its retry attempts, until it itself has exhausted its retry attempts. Upon restarting a checkpoint job all its existing successors are first deleted, and then the job is rerun to define new successors. By checkpointing a job that defines a service, upon failure of the service the database and the jobs that access the service can be redefined and rerun.
To make the implementation of checkpoint jobs simple, a job can only be a checkpoint if when first defined it has no successors, i.e. it can only define successors within its run method.
Let A be a root job potentially with children and follow-ons. Without an encapsulated job the simplest way to specify a job B which runs after A and all its successors is to create a parent of A, call it Ap, and then make B a follow-on of Ap. e.g.:
import os from toil.common import Toil from toil.job import Job from toil.lib.io import mkdtemp if __name__ == "__main__":
# A is a job with children and follow-ons, for example:
A = Job()
A.addChild(Job())
A.addFollowOn(Job())
# B is a job which needs to run after A and its successors
B = Job()
# The way to do this without encapsulation is to make a parent of A, Ap, and make B a follow-on of Ap.
Ap = Job()
Ap.addChild(A)
Ap.addFollowOn(B)
jobstore: str = mkdtemp("tutorial_encapsulations")
os.rmdir(jobstore)
options = Job.Runner.getDefaultOptions(jobstore)
options.logLevel = "INFO"
options.clean = "always"
with Toil(options) as toil:
print(toil.start(Ap))
An encapsulated job E(A) of A saves making Ap, instead we can write:
import os from toil.common import Toil from toil.job import Job from toil.lib.io import mkdtemp if __name__ == "__main__":
# A
A = Job()
A.addChild(Job())
A.addFollowOn(Job())
# Encapsulate A
A = A.encapsulate()
# B is a job which needs to run after A and its successors
B = Job()
# With encapsulation A and its successor subgraph appear to be a single job, hence:
A.addChild(B)
jobstore: str = mkdtemp("tutorial_encapsulations2")
os.rmdir(jobstore)
options = Job.Runner.getDefaultOptions(jobstore)
options.logLevel = "INFO"
options.clean = "always"
with Toil(options) as toil:
print(toil.start(A))
Note the call to toil.job.Job.encapsulate() creates the toil.job.Job.EncapsulatedJob.
If you are packing your workflow(s) as a pip-installable distribution on PyPI, you might be tempted to declare Toil as a dependency in your setup.py, via the install_requires keyword argument to setup(). Unfortunately, this does not work, for two reasons: For one, Toil uses Setuptools' extra mechanism to manage its own optional dependencies. If you explicitly declared a dependency on Toil, you would have to hard-code a particular combination of extras (or no extras at all), robbing the user of the choice what Toil extras to install. Secondly, and more importantly, declaring a dependency on Toil would only lead to Toil being installed on the leader node of a cluster, but not the worker nodes. Auto-deployment does not work here because Toil cannot auto-deploy itself, the classic "Which came first, chicken or egg?" problem.
In other words, you shouldn't explicitly depend on Toil. Document the dependency instead (as in "This workflow needs Toil version X.Y.Z to be installed") and optionally add a version check to your setup.py. Refer to the check_version() function in the toil-lib project's setup.py for an example. Alternatively, you can also just depend on toil-lib and you'll get that check for free.
If your workflow depends on a dependency of Toil, consider not making that dependency explicit either. If you do, you risk a version conflict between your project and Toil. The pip utility may silently ignore that conflict, breaking either Toil or your workflow. It is safest to simply assume that Toil installs that dependency for you. The only downside is that you are locked into the exact version of that dependency that Toil declares. But such is life with Python, which, unlike Java, has no means of dependencies belonging to different software components within the same process, and whose favored software distribution utility is incapable of properly resolving overlapping dependencies and detecting conflicts.
Computational Genomics Lab's Dockstore based production system provides workflow authors a way to run Dockerized versions of their pipeline in an automated, scalable fashion. To be compatible with this system of a workflow should meet the following requirements. In addition to the Docker container, a common workflow language descriptor file is needed. For inputs:
For outputs:
The Toil class configures and starts a Toil run.
Specifically the batch system, job store, and its configuration.
Note that this is very light-weight and that the bulk of the work is done when the context is entered.
This method must be called in the body of a with Toil(...) as toil: statement. This method should not be called more than once for a workflow that has not finished.
By default, returns None if the file does not exist.
See toil.jobStores.abstractJobStore.AbstractJobStore.importFile() for a full description
See toil.jobStores.abstractJobStore.AbstractJobStore.exportFile() for a full description
This directory is always required to exist on a machine, even if the Toil worker has not run yet. If your workers and leader have different temp directories, you may need to set TOIL_WORKDIR.
Will be consistent for all processes on a given machine, and different for all processes on different machines.
If an in-memory filesystem is available, it is used. Otherwise, the local workflow directory, which may be on a shared network filesystem, is used.
The job store interface is an abstraction layer that that hides the specific details of file storage, for example standard file systems, S3, etc. The AbstractJobStore API is implemented to support a give file store, e.g. S3. Implement this API to support a new file store.
JobStores are responsible for storing toil.job.JobDescription (which relate jobs to each other) and files.
Actual toil.job.Job objects are stored in files, referenced by JobDescriptions. All the non-file CRUD methods the JobStore provides deal in JobDescriptions and not full, executable Jobs.
To actually get ahold of a toil.job.Job, use toil.job.Job.loadJob() with a JobStore and the relevant JobDescription.
The instance will not be fully functional until either initialize() or resume() is invoked. Note that the destroy() method may be invoked on the object with or without prior invocation of either of these two methods.
Takes and stores the locator string for the job store, which will be accessible via self.locator.
Create the physical storage for this job store, allocate a workflow ID and persist the given Toil configuration to the store.
Raises an exception if the root job hasn't fulfilled its promise yet.
Currently supported schemes are:
Raises FileNotFoundError if the file does not exist.
Refer to AbstractJobStore.import_file() documentation for currently supported URL schemes.
Note that the helper method _exportFile is used to read from the source and write to destination. To implement any optimizations that circumvent this, the _exportFile method should be overridden by subclasses of AbstractJobStore.
May raise an error if file existence cannot be determined.
Currently supported schemes are:
Raises FileNotFoundError if the URL doesn't exist.
Raises FileNotFoundError if the URL doesn't exist.
Has a readable stream interface, unlike read_from_url() which takes a writable stream.
Fixes jobs that might have been partially updated. Resets the try counts and removes jobs that are not successors of the current root job.
Files associated with the assigned ID will be accepted even if the JobDescription has never been created or updated.
Must call jobDescription.pre_update_hook()
Returns a publicly accessible URL to the given file in the job store. The returned URL starts with 'http:', 'https:' or 'file:'. The returned URL may expire as early as 1h after its been returned. Throw an exception if the file does not exist.
May declare the job to have failed (see toil.job.JobDescription.setupJobAfterFailure()) if there is evidence of a failed update attempt.
Must call jobDescription.pre_update_hook()
This operation is idempotent, i.e. deleting a job twice or deleting a non-existent job will succeed silently.
FIXME: some implementations may not raise this
FIXME: some implementations may not raise this
The file at the given local path may not be modified after this method returns!
Note! Implementations of readFile need to respect/provide the executable attribute on FileIDs.
Note that job stores which encrypt files might return overestimates of file sizes, since the encrypted file may have been padded to the nearest block, augmented with an initialization vector, etc.
Throws an exception if the file does not exist.
Only unread logs will be read unless the read_all parameter is set.
Overwriting the current contents of pid.log is a feature, not a bug of this method. Other methods will rely on always having the most current pid available. So far there is no reason to store any old pids.
The initialized file contains the characters "NO". This should only be changed when the user runs the "toil kill" command.
Changing this file to a "YES" triggers a kill of the leader process. The workers are expected to be cleaned up by the leader.
see https://github.com/DataBiosphere/toil/issues/4218
Functions to wrap jobs and return values (promises).
The subclass of Job for wrapping user functions.
The keywords memory, cores, disk, accelerators`, ``preemptible and checkpoint are reserved keyword arguments that if specified will be used to determine the resources required for the job, as toil.job.Job.__init__(). If they are keyword arguments to the function they will be extracted from the function definition, but may be overridden by the user (as you would expect).
The subclass of FunctionWrappingJob for wrapping user job functions.
To enable the job function to get access to the toil.fileStores.abstractFileStore.AbstractFileStore instance (see toil.job.Job.run()), it is made a variable of the wrapping job called fileStore.
To specify a job's resource requirements the following default keyword arguments can be specified:
For example to wrap a function into a job we would call:
Job.wrapJobFn(myJob, memory='100k', disk='1M', cores=0.1)
The subclass of Job for encapsulating a job, allowing a subgraph of jobs to be treated as a single job.
Let A be the root job of a job subgraph and B be another job we'd like to run after A and all its successors have completed, for this use encapsulate:
# Job A and subgraph, Job B A, B = A(), B() Aprime = A.encapsulate() Aprime.addChild(B) # B will run after A and all its successors have completed, A and its subgraph of # successors in effect appear to be just one job.
If the job being encapsulated has predecessors (e.g. is not the root job), then the encapsulated job will inherit these predecessors. If predecessors are added to the job being encapsulated after the encapsulated job is created then the encapsulating job will NOT inherit these predecessors automatically. Care should be exercised to ensure the encapsulated job has the proper set of predecessors.
The return value of an encapsulated job (as accessed by the toil.job.Job.rv() function) is the return value of the root job, e.g. A().encapsulate().rv() and A().rv() will resolve to the same value after A or A.encapsulate() has been run.
Child jobs will be run directly after this job's toil.job.Job.run() method has completed.
The toil.job.Job.Service.start() method of the service will be called after the run method has completed but before any successors are run. The service's toil.job.Job.Service.stop() method will be called once the successors of the job have been run.
Services allow things like databases and servers to be started and accessed by jobs in a workflow.
Follow-on jobs will be run after the child jobs and their successors have been run.
The "promise" representing a return value of the job's run method, or, in case of a function-wrapping job, the wrapped function's return value.
Prepare this job (the promisor) so that its promises can register themselves with it, when the jobs they are promised to (promisees) are serialized.
The promissee holds the reference to the promise (usually as part of the job arguments) and when it is being pickled, so will the promises it refers to. Pickling a promise triggers it to be registered with the promissor.
The class used to reference return values of jobs/services not yet run/started.
References a return value from a toil.job.Job.run() or toil.job.Job.Service.start() method as a promise before the method itself is run.
Let T be a job. Instances of Promise (termed a promise) are returned by T.rv(), which is used to reference the return value of T's run function. When the promise is passed to the constructor (or as an argument to a wrapped function) of a different, successor job the promise will be replaced by the actual referenced return value. This mechanism allows a return values from one job's run method to be input argument to job before the former job's run function has been executed.
(involving toil.job.Promise instances.)
Use when resource requirements depend on the return value of a parent function. PromisedRequirements can be modified by passing a function that takes the Promise as input.
For example, let f, g, and h be functions. Then a Toil workflow can be defined as follows:: A = Job.wrapFn(f) B = A.addChildFn(g, cores=PromisedRequirement(A.rv()) C = B.addChildFn(h, cores=PromisedRequirement(lambda x: 2*x, B.rv()))
Converts Promise instance to PromisedRequirement.
Jobs are the units of work in Toil which are composed into workflows.
This method must be called by any overriding constructor.
This uses the fact that the self._description instance variable should always be set after __init__().
If __init__() has not been called, raise an error.
It will be used by various actions implemented inside the Job class.
Child jobs will be run directly after this job's toil.job.Job.run() method has completed.
Follow-on jobs will be run after the child jobs and their successors have been run.
The toil.job.Job.Service.start() method of the service will be called after the run method has completed but before any successors are run. The service's toil.job.Job.Service.stop() method will be called once the successors of the job have been run.
Services allow things like databases and servers to be started and accessed by jobs in a workflow.
See toil.job.JobFunctionWrappingJob for a definition of a job function.
See toil.job.JobFunctionWrappingJob for a definition of a job function.
Temp dir is created on first call and will be returned for first and future calls :return: Path to tempDir. See job.fileStore.getLocalTempDir
Convenience function for constructor of toil.job.FunctionWrappingJob.
Convenience function for constructor of toil.job.JobFunctionWrappingJob.
The "promise" representing a return value of the job's run method, or, in case of a function-wrapping job, the wrapped function's return value.
Prepare this job (the promisor) so that its promises can register themselves with it, when the jobs they are promised to (promisees) are serialized.
The promissee holds the reference to the promise (usually as part of the job arguments) and when it is being pickled, so will the promises it refers to. Pickling a promise triggers it to be registered with the promissor.
See toil.job.Job.checkJobGraphConnected(), toil.job.Job.checkJobGraphAcyclic() and toil.job.Job.checkNewCheckpointsAreLeafVertices() for more info.
A root job is a job with no predecessors (i.e. which are not children, follow-ons, or services).
Only deals with jobs created here, rather than loaded from the job store.
As execution always starts from one root job, having multiple root jobs will cause a deadlock to occur.
Only deals with jobs created here, rather than loaded from the job store.
A follow-on edge (A, B) between two jobs A and B is equivalent to adding a child edge to B from (1) A, (2) from each child of A, and (3) from the successors of each child of A. We call each such edge an edge an "implied" edge. The augmented job graph is a job graph including all the implied edges.
For a job graph G = (V, E) the algorithm is O(|V|^2). It is O(|V| + |E|) for a graph with no follow-ons. The former follow-on case could be improved!
Only deals with jobs created here, rather than loaded from the job store.
A job is a leaf it is has no successors.
A checkpoint job must be a leaf when initially added to the job graph. When its run method is invoked it can then create direct successors. This restriction is made to simplify implementation.
Only works on connected components of jobs not yet added to the JobStore.
Examples for deferred functions are ones that handle cleanup of resources external to Toil, like Docker containers, files outside the work directory, etc.
Only considers jobs in this job's subgraph that are newly added, not loaded from the job store.
Ignores service jobs.
The Job's JobDescription must have already had a real jobStoreID assigned to it.
Does not save the JobDescription.
Will abort the job if the "download_only" debug flag is set.
Can be hinted a list of file path pairs outside and inside the job container, in which case the container environment can be reconstructed.
The class used to store all the information that the Toil Leader ever needs to know about a Job.
Can be obtained from an actual (i.e. executable) Job object, and can be used to obtain the Job object from the JobStore.
Never contains other Jobs or JobDescriptions: all reference is by ID.
Subclassed into variants for checkpoint jobs and service jobs that have their specific parameters.
For each job, produces a named tuple with its various names and its original job store ID. The jobs in the chain are in execution order.
If the job hasn't run yet or it didn't chain, produces a one-item list.
(in the order they need to start in)
Follow-ons will come before children.
Phases execute higher numbers to lower numbers.
Will be empty if the job has no unfinished services.
Takes the file store ID that the body is stored at, and the required user script module.
The file store ID can also be "firstJob" for the root job, stored as a shared file instead.
Fails if no body is attached; check has_body() first.
If those jobs have multiple predecessor relationships, they may still be blocked on other jobs.
Returns None when at the final phase (all successors done), and an empty collection if there are more phases but they can't be entered yet (e.g. because we are waiting for the job itself to run).
The predicate function is called with the job's ID.
Treats all other successors as complete and forgets them.
The predicate function is called with the service host job's ID.
Treats all other services as complete and forgets them.
That is to say, all those that have been completed and removed.
When updated in the JobStore, we will save over the other JobDescription.
Useful for chaining jobs: the chained-to job can replace the parent job.
Merges cleanup state and successors other than this job from the job being replaced into this one.
If a parent ServiceHostJob ID is given, that parent service will be started first, and must have already been added.
Does not modify our own ID or those of finished predecessors. IDs not present in the renames dict are left as-is.
Called by the Job saving logic when this JobDescription meets the JobStore and has its ID assigned.
Overridden to perform setup work (like hooking up flag files for service jobs) that requires the JobStore.
Reduce the remainingTryCount if greater than zero and set the memory to be at least as big as the default memory (in case of exhaustion of memory, which is common).
Requires a configuration to have been assigned (see toil.job.Requirer.assignConfig()).
Assumes logJobStoreFileID is set.
The try count set on the JobDescription, or the default based on the retry count from the config if none is set.
Called by the job store.
The Runner contains the methods needed to configure and start a Toil run.
Deprecated by toil.common.Toil.start.
(see Job.Runner.getDefaultOptions and Job.Runner.addToilOptions) starting with this job. :type job: Job :param job: root job of the workflow :raises: toil.exceptions.FailedJobsException if at the end of function there remain failed jobs. :rtype: Any :return: The return value of the root job's run function.
The AbstractFileStore is an abstraction of a Toil run's shared storage.
Also provides the interface to other Toil facilities used by user code, including:
Stores user files in the jobStore, but keeps them separate from actual jobs.
May implement caching.
Passed as argument to the toil.job.Job.run() method.
Access to files is only permitted inside the context manager provided by toil.fileStores.abstractFileStore.AbstractFileStore.open().
Also responsible for committing completed jobs back to the job store with an update operation, and allowing that commit operation to be waited for.
This is a destructive operation and it is important to ensure that there are no other running processes on the system that are modifying or using the file store for this workflow.
This is the intended to be the last call to the file store in a Toil run, called by the batch system cleanup function upon batch system shutdown.
File operations are only permitted inside the context manager.
Implementations must only yield from within with super().open(job):.
Disk usage is measured at the end of the job. TODO: Sample periodically and record peak usage.
The directory will only persist for the duration of the job.
If the file is in a FileStore-managed temporary directory (i.e. from toil.fileStores.abstractFileStore.AbstractFileStore.getLocalTempDir()), it will become a local copy of the file, eligible for deletion by toil.fileStores.abstractFileStore.AbstractFileStore.deleteLocalFile().
If an executable file on the local filesystem is uploaded, its executability will be preserved when it is downloaded again.
(to be announced if the job fails)
If destination is not None, it gives the path that the file was downloaded to. Otherwise, assumes that the file was streamed.
Must be called by readGlobalFile() and readGlobalFileStream() implementations.
If mutable is True, then a copy of the file will be created locally so that the original is not modified and does not change the file for other jobs. If mutable is False, then a link can be created to the file, saving disk resources. The file that is downloaded will be executable if and only if it was originally uploaded from an executable file on the local filesystem.
If a user path is specified, it is used as the destination. If a user path isn't specified, the file is stored in the local temp directory with an encoded name.
The destination file must not be deleted by the user; it can only be deleted through deleteLocalFile.
Implementations must call logAccess() to report the download.
The yielded file handle does not need to and should not be closed explicitly.
Implementations must call logAccess() to report the download.
If a FileID or something else with a non-None 'size' field, gets that.
Otherwise, asks the job store to poll the file's size.
Note that the job store may overestimate the file's size, for example if it is encrypted and had to be augmented with an IV or other encryption framing.
Raises an OSError with an errno of errno.ENOENT if no such local copies exist. Thus, cannot be called multiple times in succession.
The files deleted are all those previously read from this file ID via readGlobalFile by the current job into the job's file-store-provided temp directory, plus the file that was written to create the given file ID, if it was written by the current job from the job's file-store-provided temp directory.
To ensure that the job can be restarted if necessary, the delete will not happen until after the job's run method has completed.
Useful for things like the error logs of Docker containers. The leader will show it to the user or organize it appropriately for user-level log information.
May bump the version number of the job.
May start an asynchronous process. Call waitForCommit() to wait on that process. You must waitForCommit() before committing any further updates to the job. During the asynchronous process, it is safe to modify the job; modifications after this call will not be committed until the next call.
This function is called by this job's successor to ensure that it does not begin modifying the job store until after this job has finished doing so.
Might be called when startCommit is never called on a particular instance, in which case it does not block.
This is intended to be called on batch system shutdown.
It is used to represent a file's ID in the file store, and has a size attribute that is the file's size in bytes. This object is returned by importFile and writeGlobalFile.
Calls into the file store can use bare strings; size will be queried from the job store if unavailable in the ID.
The batch system interface is used by Toil to abstract over different ways of running batches of jobs, for example on Slurm clusters, Kubernetes clusters, or a single node. The toil.batchSystems.abstractBatchSystem.AbstractBatchSystem API is implemented to run jobs using a given job management system.
Environmental variables allow passing of scheduler specific parameters.
For SLURM there are two environment variables - the first applies to all jobs, while the second defined the partition to use for parallel jobs:
export TOIL_SLURM_ARGS="-t 1:00:00 -q fatq" export TOIL_SLURM_PE='multicore'
For TORQUE there are two environment variables - one for everything but the resource requirements, and another - for resources requirements (without the -l prefix):
export TOIL_TORQUE_ARGS="-q fatq" export TOIL_TORQUE_REQS="walltime=1:00:00"
For GridEngine (SGE, UGE), there is an additional environmental variable to define the parallel environment for running multicore jobs:
export TOIL_GRIDENGINE_PE='smp' export TOIL_GRIDENGINE_ARGS='-q batch.q'
For HTCondor, additional parameters can be included in the submit file passed to condor_submit:
export TOIL_HTCONDOR_PARAMS='requirements = TARGET.has_sse4_2 == true; accounting_group = test'
The environment variable is parsed as a semicolon-separated string of parameter = value pairs.
If it does, the setUserScript() can be invoked to set the resource object representing the user script.
Note to implementors: If your implementation returns True here, it should also override
Indicates whether this batch system invokes BatchSystemSupport.workerCleanup() after the last job for a particular workflow invocation finishes. Note that the term worker refers to an entire node, not just a worker process. A worker process may run more than one job sequentially, and more than one concurrent worker process may exist on a worker node, for the same workflow. The batch system is said to shut down after the last worker process terminates.
This method must be called before the first job is issued to this batch system, and only if supportsAutoDeployment() returns True, otherwise it will raise an exception.
Does not return info for jobs killed by killBatchJobs, although they may cause None to be returned earlier than maxWait.
If no useful message is available, return None.
This can be used to report what resource is the limiting factor when scheduling jobs, for example. If the leader thinks the workflow is stuck, the message can be displayed to the user to help them diagnose why it might be stuck.
The worker process will typically inherit the environment of the machine it is running on but this method makes it possible to override specific variables in that inherited environment before the worker is launched. Note that this mechanism is different to the one used by the worker internally to set up the environment of a job. A call to this method affects all jobs issued after this method returns. Note to implementors: This means that you would typically need to copy the variables before enqueuing a job.
If no value is provided it will be looked up from the current environment.
Can be used to ask the Toil worker to do things in-process (such as configuring environment variables, hot-deploying user scripts, or cleaning up a node) that would otherwise require a wrapping "executor" process.
The Service class allows databases and servers to be spawned within a Toil workflow.
Should be subclassed by the user to define services.
Is not executed as a job; runs within a ServiceHostJob.
Toil specific exceptions.
Test make targets, invoked as $ make <target>, subject to which environment variables are set (see Running Integration Tests).
| TARGET | DESCRIPTION |
| test | Invokes all tests. |
| integration_test | Invokes only the integration tests. |
| test_offline | Skips building the Docker appliance and only invokes tests that have no docker dependencies. |
| integration_test_local | Makes integration tests easier to debug locally by running the integration tests serially and doesn't redirect output. This makes it appears on the terminal as expected. |
Before running tests for the first time, initialize your virtual environment following the steps in Installing Plugins.
Run all tests (including slow tests):
$ make test
Run only quick tests (as of Jul 25, 2018, this was ~ 20 minutes):
$ export TOIL_TEST_QUICK=True; make test
Run an individual test with:
$ make test tests=src/toil/test/sort/sortTest.py::SortTest::testSort
The default value for tests is "src" which includes all tests in the src/ subdirectory of the project root. Tests that require a particular feature will be skipped implicitly. If you want to explicitly skip tests that depend on a currently installed feature, use
$ make test tests="-m 'not aws' src"
This will run only the tests that don't depend on the aws extra, even if that extra is currently installed. Note the distinction between the terms feature and extra. Every extra is a feature but there are features that are not extras, such as the gridengine feature. To skip tests involving both the gridengine feature and the aws extra, use the following:
$ make test tests="-m 'not aws and not gridengine' src"
Often it is simpler to use pytest directly, instead of calling the make wrapper. This usually works as expected, but some tests need some manual preparation. To run a specific test with pytest, use the following:
python3 -m pytest src/toil/test/sort/sortTest.py::SortTest::testSort
For more information, see the pytest documentation.
These tests are generally only run using in our CI workflow due to their resource requirements and cost. However, they can be made available for local testing:
$ make push_docker
$ export TOIL_TEST_INTEGRATIVE=True
$ export TOIL_X_KEYNAME=[Your Keyname] $ export TOIL_X_ZONE=[Desired Zone]
Where X is one of our currently supported cloud providers (GCE, AWS).
| TOIL_TEST_TEMP | An absolute path to a directory where Toil tests will write their temporary files. Defaults to the system's standard temporary directory. |
| TOIL_TEST_INTEGRATIVE | If True, this allows the integration tests to run. Only valid when running the tests from the source directory via make test or make test_parallel. |
| TOIL_AWS_KEYNAME | An AWS keyname (see Preparing your AWS environment), which is required to run the AWS tests. |
| TOIL_GOOGLE_PROJECTID | A Google Cloud account projectID (see Running in Google Compute Engine (GCE)), which is required to to run the Google Cloud tests. |
| TOIL_TEST_QUICK | If True, long running tests are skipped. |
Some tests may fail with an ImportError if the required extras are not installed. Install Toil with all of the extras do prevent such errors.
Docker is needed for some of the tests. Follow the appropriate installation instructions for your system on their website to get started.
When running make test you might still get the following error:
$ make test Please set TOIL_DOCKER_REGISTRY, e.g. to quay.io/USER.
To solve, make an account with Quay and specify it like so:
$ TOIL_DOCKER_REGISTRY=quay.io/USER make test
where USER is your Quay username.
For convenience you may want to add this variable to your bashrc by running
$ echo 'export TOIL_DOCKER_REGISTRY=quay.io/USER' >> $HOME/.bashrc
If you're running Toil's Mesos tests, be sure to create the virtualenv with --system-site-packages to include the Mesos Python bindings. Verify this by activating the virtualenv and running pip list | grep mesos. On macOS, this may come up empty. To fix it, run the following:
for i in /usr/local/lib/python2.7/site-packages/*mesos*; do ln -snf $i venv/lib/python2.7/site-packages/; done
To develop on features reliant on the Toil Appliance (the docker image toil uses for AWS autoscaling), you should consider setting up a personal registry on Quay or Docker Hub. Because the Toil Appliance images are tagged with the Git commit they are based on and because only commits on our master branch trigger an appliance build on Quay, as soon as a developer makes a commit or dirties the working copy they will no longer be able to rely on Toil to automatically detect the proper Toil Appliance image. Instead, developers wishing to test any appliance changes in autoscaling should build and push their own appliance image to a personal Docker registry. This is described in the next section.
Note! Toil checks if the docker image specified by TOIL_APPLIANCE_SELF exists prior to launching by using the docker v2 schema. This should be valid for any major docker repository, but there is an option to override this if desired using the option: --forceDockerAppliance.
Here is a general workflow (similar instructions apply when using Docker Hub):
$ make docker
to automatically build a docker image that can now be uploaded to your personal Quay account. On Docker Desktop, containerd may have to be enabled. If you have not installed Toil source code yet see Installing Plugins.
export TOIL_DOCKER_REGISTRY=quay.io/<MY_QUAY_USERNAME>
to your .bashrc or equivalent.
$ make push_docker
which will upload the docker image to your Quay account. Take note of the image's tag for the next step.
The Toil Appliance container can also be useful as a test environment since it can simulate a Toil cluster locally. An important caveat for this is autoscaling, since autoscaling will only work on an EC2 instance and cannot (at this time) be run on a local machine.
To spin up a local cluster, start by using the following Docker run command to launch a Toil leader container:
docker run \
--entrypoint=mesos-master \
--net=host \
-d \
--name=leader \
--volume=/home/jobStoreParentDir:/jobStoreParentDir \
quay.io/ucsc_cgl/toil:3.6.0 \
--registry=in_memory \
--ip=127.0.0.1 \
--port=5050 \
--allocation_interval=500ms
A couple notes on this command: the -d flag tells Docker to run in daemon mode so the container will run in the background. To verify that the container is running you can run docker ps to see all containers. If you want to run your own container rather than the official UCSC container you can simply replace the quay.io/ucsc_cgl/toil:3.6.0 parameter with your own container name.
Also note that we are not mounting the job store directory itself, but rather the location where the job store will be written. Due to complications with running Docker on MacOS, I recommend only mounting directories within your home directory. The next command will launch the Toil worker container with similar parameters:
docker run \
--entrypoint=mesos-slave \
--net=host \
-d \
--name=worker \
--volume=/home/jobStoreParentDir:/jobStoreParentDir \
quay.io/ucsc_cgl/toil:3.6.0 \
--work_dir=/var/lib/mesos \
--master=127.0.0.1:5050 \
--ip=127.0.0.1 \
—-attributes=preemptable:False \
--resources=cpus:2
Note here that we are specifying 2 CPUs and a non-preemptable worker. We can easily change either or both of these in a logical way. To change the number of cores we can change the 2 to whatever number you like, and to change the worker to be preemptable we change preemptable:False to preemptable:True. Also note that the same volume is mounted into the worker. This is needed since both the leader and worker write and read from the job store. Now that your cluster is running, you can run
docker exec -it leader bash
to get a shell in your leader 'node'. You can also replace the leader parameter with worker to get shell access in your worker.
If you want to run Docker inside this Docker cluster (Dockerized tools, perhaps), you should also mount in the Docker socket via -v /var/run/docker.sock:/var/run/docker.sock. This will give the Docker client inside the Toil Appliance access to the Docker engine on the host. Client/engine version mismatches have been known to cause issues, so we recommend using Docker version 1.12.3 on the host to be compatible with the Docker client installed in the Appliance. Finally, be careful where you write files inside the Toil Appliance - 'child' Docker containers launched in the Appliance will actually be siblings to the Appliance since the Docker engine is located on the host. This means that the 'child' container can only mount in files from the Appliance if the files are located in a directory that was originally mounted into the Appliance from the host - that way the files are accessible to the sibling container. Note: if Docker can't find the file/directory on the host it will silently fail and mount in an empty directory.
When running toil-wdl-runner with Singularity, Singularity will decompress images to sandbox directories by default. This can take time if a workflow has lots of images. To avoid this, access to FUSE can be given to the Docker container at startup. There are 2 main ways to do this. Either run all the Docker containers in privileged mode:
docker run \
-d \
--name=toil_leader \
--privileged \
quay.io/ucsc_cgl/toil:6.2.0
Or pass through the /dev/fuse device node into the container:
docker run \
-d \
--name=toil_leader \
--device=/dev/fuse \
quay.io/ucsc_cgl/toil:6.2.0
toil-wdl-runner will handle the logic from there.
In general, as developers and maintainers of the code, we adhere to the following guidelines:
Say there is an issue numbered #123 titled Foo does not work. The branch name would be issues/123-fix-foo and the title of the commit would be Fix foo in case of bar (resolves #123).
./contrib/admin/test-pr theirusername their-branch issues/123-fix-description-here
This must be repeated every time the PR submitter updates their PR, after checking to see that the update is not malicious.
If there is no issue corresponding to the PR, after which the branch can be named, the reviewer of the PR should first create the issue.
Developers who have push access to the main Toil repository are encouraged to make their pull requests from within the repository, to avoid this step.
When squashing a PR from multiple authors, please add Co-authored-by to give credit to all contributing authors.
See Issue #2816 for more details.
These are the steps to take to publish a Toil release:
baseVersion = 'X.Y.0a1'
Make it look like this instead:
baseVersion = 'X.Y.Z'
Commit your change to the branch.
baseVersion = 'X.Y+1.0a1'
Make sure to replace X and Y+1 with actual numbers.
In the contrib/hooks directory, there are two scripts, mypy-after-commit.py and mypy-before-push.py, that can be set up as Git hooks to make sure you don't accidentally push commits that would immediately fail type-checking. These are supposed to eliminate the need to run make mypy constantly. You can install them into your Git working copy like this
ln -rs ./contrib/hooks/mypy-after-commit.py .git/hooks/post-commit ln -rs ./contrib/hooks/mypy-before-push.py .git/hooks/pre-push
After you make a commit, the post-commit script will start type-checking it, and if it takes too long re-launch the process in the background. When you push, the pre-push script will see if the commit you are pushing type-checked successfully, and if it hasn't been type-checked but is currently checked out, it will be type-checked. If type-checking fails, the push will be aborted.
Type-checking will only be performed if you are in a Toil development virtual environment. If you aren't, the scripts won't do anything.
To bypass or override pre-push hook, if it is wrong or if you need to push something that doesn't typecheck, you can git push --no-verify. If the scripts get confused about whether a commit actually typechecks, you can clear out the type-checking result cache, which is in /var/run/user/<your UID>/.mypy_toil_result_cache on Linux and in .mypy_toil_result_cache in the Toil repo on Mac.
To uninstall the scripts, delete .git/hooks/post-commit and .git/hooks/pre-push.
See toil.lib.retry .
retry() can be used to decorate any function based on the list of errors one wishes to retry on.
This list of errors can contain normal Exception objects, and/or RetryCondition objects wrapping Exceptions to include additional conditions.
For example, retrying on a one Exception (HTTPError):
from requests import get from requests.exceptions import HTTPError @retry(errors=[HTTPError]) def update_my_wallpaper():
return get('https://www.deviantart.com/')
Or:
from requests import get from requests.exceptions import HTTPError @retry(errors=[HTTPError, ValueError]) def update_my_wallpaper():
return get('https://www.deviantart.com/')
The examples above will retry for the default interval on any errors specified the "errors=" arg list.
To retry on specifically 500/502/503/504 errors, you could specify an ErrorCondition object instead, for example:
from requests import get from requests.exceptions import HTTPError @retry(errors=[
ErrorCondition(
error=HTTPError,
error_codes=[500, 502, 503, 504]
)]) def update_my_wallpaper():
return requests.get('https://www.deviantart.com/')
To retry on specifically errors containing the phrase "NotFound":
from requests import get from requests.exceptions import HTTPError @retry(errors=[
ErrorCondition(
error=HTTPError,
error_message_must_include="NotFound"
)]) def update_my_wallpaper():
return requests.get('https://www.deviantart.com/')
To retry on all HTTPError errors EXCEPT an HTTPError containing the phrase "NotFound":
from requests import get from requests.exceptions import HTTPError @retry(errors=[
HTTPError,
ErrorCondition(
error=HTTPError,
error_message_must_include="NotFound",
retry_on_this_condition=False
)]) def update_my_wallpaper():
return requests.get('https://www.deviantart.com/')
To retry on boto3's specific status errors, an example of the implementation is:
import boto3 from botocore.exceptions import ClientError @retry(errors=[
ErrorCondition(
error=ClientError,
boto_error_codes=["BucketNotFound"]
)]) def boto_bucket(bucket_name):
boto_session = boto3.session.Session()
s3_resource = boto_session.resource('s3')
return s3_resource.Bucket(bucket_name)
Any combination of these will also work, provided the codes are matched to the correct exceptions. A ValueError will not return a 404, for example.
The retry function as a decorator should make retrying functions easier and clearer. It also encourages smaller independent functions, as opposed to lumping many different things that may need to be retried on different conditions in the same function.
The ErrorCondition object tries to take some of the heavy lifting of writing specific retry conditions and boil it down to an API that covers all common use-cases without the user having to write any new bespoke functions.
Use-cases covered currently:
If new functionality is needed, it's currently best practice in Toil to add functionality to the ErrorCondition itself rather than making a new custom retry method.
This document contains checklists for dealing with PRs. More general PR information is available at Pull Requests.
This checklist is to be kept in sync with the checklist in the pull request template.
When reviewing a PR, do the following:
contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
This checklist is to be kept in sync with the checklist in the pull request template.
When merging a PR, do the following:
The following diagram layouts out the software architecture of Toil.
As noted in Job Basics, a job is the atomic unit of work in a Toil workflow. Workflows extend the Job class to define units of work. These jobs are pickled and stored in the job-store by the leader, and are retrieved and un-pickled by the worker when they are scheduled to run.
During scheduling, Toil does not work with the actual Job objects. Instead, JobDescription objects are used to store all the information that the Toil Leader ever needs to know about the Job. This includes requirements information, dependency information, body object to run, worker command to issue, etc.
Internally, the JobDescription object is referenced by its jobStoreID, which is often not human readable. However, the Job and JobDescription objects contain several human-readable names that are useful for logging and identification:
| jobName | Name of the kind of job this is. This may be used in job store IDs and logging. Also used to let the cluster scaler learn a model for how long the job will take. Defaults to the job class's name if no real user-defined name is available. For a FunctionWrappingJob, the jobName is replaced by the wrapped function's name. For a CWL workflow, the jobName is the class name of the internal job that is running the CWL workflow, such as "CWLJob". |
| unitName | Name of this instance of this kind of job. If set by the user, it will appear with the jobName in logging. For a CWL workflow, the unitName is the dotted path from the workflow down to the task being run, including numbers for scatter steps. |
| displayName | A human-readable name to identify this particular job instance. Used as an identifier of the job class in the stats report. Defaults to the job class's name if no real user-defined name is available. For CWL workflows, this includes the jobName and the unitName. |
Toil's statistics and logging system is implemented in a joint class StatsAndLogging. The class can be instantiated and run as a thread on the leader, where it polls for new log files in the job store with the read_logs() method. These are JSON files, which contain structured data. Structured log messages from user Python code, stored under workers.logs_to_leader, from the file store's log_to_leader() method, will be logged at the appropriate level. The text output that the worker captured for all its chained jobs, in logs.messages, will be logged at debug level in the worker's output. If --writeLogs or --writeLogsGzip is provided, the received worker logs will also be stored by the StatsAndLogging thread into per-job files inside the job store, using writeLogFiles().
Note that the worker only fills this in if running with debug logging on, or if --writeLogsFromAllJobs is set. Otherwise, logs from successful jobs are not persisted. Logs from failed jobs are persisted differently; they are written to the file store, and the log file is made available through toil.job.JobDescription.getLogFileHandle(). The leader thread retrieves these logs and calls back into StatsAndLogging to print or locally save them as appropriate.
The CWL and WDL interpreters use log_user_stream() to inject CWL and WDL task-level logs into the stats and logging logging system. The full text of those logs gets stored in the JSON stats files, and when the StatsAndLogging thread sees them it reports and saves them, similarly to how it treats Toil job logs.
To ship the statistics and the non-failed-job logs around, the job store has a logs mailbox system: the write_logs() method deposits a string, and the read_logs() method on the leader passes the strings to a callback. It tracks a concept of new and old, based on whether the string has been read already by anyone, and one can read only the new values, or all values observed. The stats and logging system uses this to pass around structured JSON holding both log data and worker-measured stats, and expects the StatsAndLogging thread to be the only live reader.
Toil implements lots of optimizations designed for scalability. Here we detail some of the key optimizations.
The leader process is currently implemented as a single thread. Most of the leader's tasks revolve around processing the state of jobs, each stored as a file within the job-store. To minimise the load on this thread, each worker does as much work as possible to manage the state of the job it is running. As a result, with a couple of minor exceptions, the leader process never needs to write or update the state of a job within the job-store. For example, when a job is complete and has no further successors the responsible worker deletes the job from the job-store, marking it complete. The leader then only has to check for the existence of the file when it receives a signal from the batch-system to know that the job is complete. This off-loading of state management is orthogonal to future parallelization of the leader.
The scheduling of successor jobs is partially managed by the worker, reducing the number of individual jobs the leader needs to process. Currently this is very simple: if the there is a single next successor job to run and its resources fit within the resources of the current job and closely match the resources of the current job then the job is run immediately on the worker without returning to the leader. Further extensions of this strategy are possible, but for many workflows which define a series of serial successors (e.g. map sequencing reads, post-process mapped reads, etc.) this pattern is very effective at reducing leader workload.
Critical to running at large-scale is dealing with intermittent node failures. Toil is therefore designed to always be resumable providing the job-store does not become corrupt. This robustness allows Toil to run on preemptible nodes, which are only available when others are not willing to pay more to use them. Designing workflows that divide into many short individual jobs that can use preemptable nodes allows for workflows to be efficiently scheduled and executed.
Running bioinformatic pipelines often require the passing of large datasets between jobs. Toil caches the results from jobs such that child jobs running on the same node can directly use the same file objects, thereby eliminating the need for an intermediary transfer to the job store. Caching also reduces the burden on the local disks, because multiple jobs can share a single file. The resulting drop in I/O allows pipelines to run faster, and, by the sharing of files, allows users to run more jobs in parallel by reducing overall disk requirements.
To demonstrate the efficiency of caching, we ran an experimental internal pipeline on 3 samples from the TCGA Lung Squamous Carcinoma (LUSC) dataset. The pipeline takes the tumor and normal exome fastqs, and the tumor rna fastq and input, and predicts MHC presented neoepitopes in the patient that are potential targets for T-cell based immunotherapies. The pipeline was run individually on the samples on c3.8xlarge machines on AWS (60GB RAM,600GB SSD storage, 32 cores). The pipeline aligns the data to hg19-based references, predicts MHC haplotypes using PHLAT, calls mutations using 2 callers (MuTect and RADIA) and annotates them using SnpEff, then predicts MHC:peptide binding using the IEDB suite of tools before running an in-house rank boosting algorithm on the final calls.
To optimize time taken, The pipeline is written such that mutations are called on a per-chromosome basis from the whole-exome bams and are merged into a complete vcf. Running mutect in parallel on whole exome bams requires each mutect job to download the complete Tumor and Normal Bams to their working directories -- An operation that quickly fills the disk and limits the parallelizability of jobs. The workflow was run in Toil, with and without caching, and Figure 2 shows that the workflow finishes faster in the cached case while using less disk on average than the uncached run. We believe that benefits of caching arising from file transfers will be much higher on magnetic disk-based storage systems as compared to the SSD systems we tested this on.
The CWL document and input document are loaded using the 'cwltool.load_tool' module. This performs normalization and URI expansion (for example, relative file references are turned into absolute file URIs), validates the document against the CWL schema, initializes Python objects corresponding to major document elements (command line tools, workflows, workflow steps), and performs static type checking that sources and sinks have compatible types.
Input files referenced by the CWL document and input document are imported into the Toil file store. CWL documents may use any URI scheme supported by Toil file store, including local files and object storage.
The 'location' field of File references are updated to reflect the import token returned by the Toil file store.
For directory inputs, the directory listing is stored in Directory object. Each individual files is imported into Toil file store.
An initial workflow Job is created from the toplevel CWL document. Then, control passes to the Toil engine which schedules the initial workflow job to run.
When the toplevel workflow job runs, it traverses the CWL workflow and creates a toil job for each step. The dependency graph is expressed by making downstream jobs children of upstream jobs, and initializing the child jobs with an input object containing the promises of output from upstream jobs.
Because Toil jobs have a single output, but CWL permits steps to have multiple output parameters that may feed into multiple other steps, the input to a CWLJob is expressed with an "indirect dictionary". This is a dictionary of input parameters, where each entry value is a tuple of a promise and a promise key. When the job runs, the indirect dictionary is turned into a concrete input object by resolving each promise into its actual value (which is always a dict), and then looking up the promise key to get the actual value for the the input parameter.
If a workflow step specifies a scatter, then a scatter job is created and connected into the workflow graph as described above. When the scatter step runs, it creates child jobs for each parameterizations of the scatter. A gather job is added as a follow-on to gather the outputs into arrays.
When running a command line tool, it first creates output and temporary directories under the Toil local temp dir. It runs the command line tool using the single_job_executor from CWLTool, providing a Toil-specific constructor for filesystem access, and overriding the default PathMapper to use ToilPathMapper.
The ToilPathMapper keeps track of a file's symbolic identifier (the Toil FileID), its local path on the host (the value returned by readGlobalFile) and the the location of the file inside the Docker container.
After executing single_job_executor from CWLTool, it gets back the output object and status. If the underlying job failed, raise an exception. Files from the output object are added to the file store using writeGlobalFile and the 'location' field of File references are updated to reflect the token returned by the Toil file store.
When the workflow completes, it returns an indirect dictionary linking to the outputs of the job steps that contribute to the final output. This is the value returned by toil.start() or toil.restart(). This is resolved to get the final output object. The files in this object are exported from the file store to 'outdir' on the host file system, and the 'location' field of File references are updated to reflect the final exported location of the output files.
Toil requires at least the following permissions in an IAM role to operate on a cluster. These are added by default when launching a cluster. However, ensure that they are present if creating a custom IAM role when launching a cluster with the --awsEc2ProfileArn parameter.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:*",
"s3:*",
"sdb:*",
"iam:PassRole"
],
"Resource": "*"
}
]
}
If you want to run a Toil Python workflow in a distributed environment, on multiple worker machines, either in the cloud or on a bare-metal cluster, the Python code needs to be made available to those other machines. If the workflow's main module imports other modules, those modules also need to be made available on the workers. Toil can automatically do that for you, with a little help on your part. We call this feature auto-deployment of a workflow.
Let's first examine various scenarios of auto-deploying a workflow, which, as we'll see shortly cannot be auto-deployed. Lastly, we'll deal with the issue of declaring Toil as a dependency of a workflow that is packaged as a setuptools distribution.
Toil can be easily deployed to a remote host. First, assuming you've followed our Preparing your AWS environment section to install Toil and use it to create a remote leader node on (in this example) AWS, you can now log into this into using Ssh-Cluster Command and once on the remote host, create and activate a virtualenv (noting to make sure to use the --system-site-packages option!):
$ virtualenv --system-site-packages venv $ . venv/bin/activate
Note the --system-site-packages option, which ensures that globally-installed packages are accessible inside the virtualenv. Do not (re)install Toil after this! The --system-site-packages option has already transferred Toil and the dependencies from your local installation of Toil for you.
From here, you can install a project and its dependencies:
$ tree . ├── util │ ├── __init__.py │ └── sort │ ├── __init__.py │ └── quick.py └── workflow
├── __init__.py
└── main.py 3 directories, 5 files $ pip install matplotlib $ cp -R workflow util venv/lib/python3.9/site-packages
Ideally, your project would have a setup.py file (see setuptools) which streamlines the installation process:
$ tree . ├── util │ ├── __init__.py │ └── sort │ ├── __init__.py │ └── quick.py ├── workflow │ ├── __init__.py │ └── main.py └── setup.py 3 directories, 6 files $ pip install .
Or, if your project has been published to PyPI:
$ pip install my-project
In each case, we have created a virtualenv with the --system-site-packages flag in the venv subdirectory then installed the matplotlib distribution from PyPI along with the two packages that our project consists of. (Again, both Python and Toil are assumed to be present on the leader and all worker nodes.)
We can now run our workflow:
$ python3 main.py --batchSystem=kubernetes …
IMPORTANT:
WARNING:
Also note that using the --single-version-externally-managed flag with setup.py will prevent the installation of your package as an .egg. It will also disable the automatic installation of your project's dependencies.
This scenario applies if a Python workflow imports files that are its siblings:
$ cd my_project $ ls userScript.py utilities.py $ ./userScript.py --batchSystem=kubernetes …
Here userScript.py imports additional functionality from utilities.py. Toil detects that userScript.py has sibling Python files and copies them to the workers, alongside the main Python file. Note that sibling Python files will be auto-deployed regardless of whether they are actually imported by the workflow: all .py files residing in the same directory as the main workflow Python file will automatically be auto-deployed.
This structure is a suitable method of organizing the source code of reasonably complicated workflows.
Recall that in Python, a package is a directory containing one or more .py files, one of which must be called __init__.py, and optionally other packages. For more involved workflows that contain a significant amount of code, this is the recommended way of organizing the source code. Because we use a package hierarchy, the main workflow file is actually a Python module. It is merely one of the modules in the package hierarchy. We need to inform Toil that we want to use a package hierarchy by invoking Python's -m option. This enables Toil to identify the entire set of modules belonging to the workflow and copy all of them to each worker. Note that while using the -m option is optional in the scenarios above, it is mandatory in this one.
The following shell session illustrates this:
$ cd my_project $ tree . ├── utils │ ├── __init__.py │ └── sort │ ├── __init__.py │ └── quick.py └── workflow
├── __init__.py
└── main.py 3 directories, 5 files $ python3 -m workflow.main --batchSystem=kubernetes …
Here the workflow entry point module main.py does not reside in the current directory, but is part of a package called util, in a subdirectory of the current directory. Additional functionality is in a separate module called util.sort.quick which corresponds to util/sort/quick.py. Because we invoke the workflow via python3 -m workflow.main, Toil can determine the root directory of the hierarchy–my_project in this case–and copy all Python modules underneath it to each worker. The -m option is documented here
When -m is passed, Python adds the current working directory to sys.path, the list of root directories to be considered when resolving a module name like workflow.main. Without that added convenience we'd have to run the workflow as PYTHONPATH="$PWD" python3 -m workflow.main. This also means that Toil can detect the root directory of the invoked module's package hierarchy even if it isn't the current working directory. In other words we could do this:
$ cd my_project $ export PYTHONPATH="$PWD" $ cd /some/other/dir $ python3 -m workflow.main --batchSystem=kubernetes …
Also note that the root directory itself must not be package, i.e. must not contain an __init__.py.
Bare-metal clusters typically mount a shared file system like NFS on each node. If every node has that file system mounted at the same path, you can place your project on that shared filesystem and run your Python workflow from there. Additionally, you can clone the Toil source tree into a directory on that shared file system and you won't even need to install Toil on every worker. Be sure to add both your project directory and the Toil clone to PYTHONPATH. Toil replicates PYTHONPATH from the leader to every worker.
Toil currently only supports a tempdir set to a local, non-shared directory.
The term Toil Appliance refers to the Ubuntu-based Docker image that Toil uses for the machines in Toil-manages clusters, and for executing jobs on Kubernetes. It's easily deployed, only needs Docker, and allows a consistent environment on all Toil clusters. To specify a different image, see the Toil Environment Variables section. For more information on the Toil Appliance, see the Running in AWS section.
There are several environment variables that affect the way Toil runs.
| TOIL_CHECK_ENV | A flag that determines whether Toil will try to refer back to a Python virtual environment in which it is installed when composing commands that may be run on other hosts. If set to True, if Toil is installed in the current virtual environment, it will use absolute paths to its own executables (and the virtual environment must thus be available on at the same path on all nodes). Otherwise, Toil internal commands such as _toil_worker will be resolved according to the PATH on the node where they are executed. This setting can be useful in a shared HPC environment, where users may have their own Toil installations in virtual environments. |
| TOIL_WORKDIR | An absolute path to a directory where Toil will write its temporary files. This directory must exist on each worker node and may be set to a different value on each worker. The --workDir command line option overrides this. When using the Toil docker container, such as on Kubernetes, this defaults to /var/lib/toil. When using Toil autoscaling with Mesos, this is somewhere inside the Mesos sandbox. In all other cases, the system's standard temporary directory is used. |
| TOIL_WORKDIR_OVERRIDE | An absolute path to a directory where Toil will write its temporary files. This overrides TOIL_WORKDIR and the --workDir command line option. |
| TOIL_COORDINATION_DIR | An absolute path to a directory where Toil will write its lock files. This directory must exist on each worker node and may be set to a different value on each worker. The --coordinationDir command line option overrides this. |
| TOIL_COORDINATION_DIR_OVERRIDE | An absolute path to a directory where Toil will write its lock files. This overrides TOIL_COORDINATION_DIR and the --coordinationDir command line option. |
| TOIL_BATCH_LOGS_DIR | A directory to save batch system logs into, where the leader can access them. The --batchLogsDir option overrides this. Only works for grid engine batch systems such as gridengine, htcondor, torque, slurm, and lsf. |
| TOIL_KUBERNETES_HOST_PATH | A path on Kubernetes hosts that will be mounted as the Toil work directory in the workers, to allow for shared caching. Will be created if it doesn't already exist. |
| TOIL_KUBERNETES_OWNER | A name prefix for easy identification of Kubernetes jobs. If not set, Toil will use the current user name. |
| TOIL_KUBERNETES_SERVICE_ACCOUNT | A service account name to apply when creating Kubernetes pods. |
| TOIL_KUBERNETES_POD_TIMEOUT | Seconds to wait for a scheduled Kubernetes pod to start running. |
| KUBE_WATCH_ENABLED | A boolean variable that allows for users to utilize kubernetes watch stream feature instead of polling for running jobs. Default value is set to False. |
| TOIL_APPLIANCE_SELF | The fully qualified reference for the Toil Appliance you wish to use, in the form REPO/IMAGE:TAG. quay.io/ucsc_cgl/toil:3.6.0 and cket/toil:3.5.0 are both examples of valid options. Note that since Docker defaults to Dockerhub repos, only quay.io repos need to specify their registry. |
| TOIL_DOCKER_REGISTRY | The URL of the registry of the Toil Appliance image you wish to use. Docker will use Dockerhub by default, but the quay.io registry is also very popular and easily specifiable by setting this option to quay.io. |
| TOIL_DOCKER_NAME | The name of the Toil Appliance image you wish to use. Generally this is simply toil but this option is provided to override this, since the image can be built with arbitrary names. |
| TOIL_AWS_SECRET_NAME | For the Kubernetes batch system, the name of a Kubernetes secret which contains a credentials file granting access to AWS resources. Will be mounted as ~/.aws inside Kubernetes-managed Toil containers. Enables the AWSJobStore to be used with the Kubernetes batch system, if the credentials allow access to S3 and SimpleDB. |
| TOIL_AWS_ZONE | Zone to use when using AWS. Also determines region. Overrides TOIL_AWS_REGION. |
| TOIL_AWS_REGION | Region to use when using AWS. |
| TOIL_AWS_AMI | ID of the AMI to use in node provisioning. If in doubt, don't set this variable. |
| TOIL_AWS_NODE_DEBUG | Determines whether to preserve nodes that have failed health checks. If set to True, nodes that fail EC2 health checks won't immediately be terminated so they can be examined and the cause of failure determined. If any EC2 nodes are left behind in this manner, the security group will also be left behind by necessity as it cannot be deleted until all associated nodes have been terminated. |
| TOIL_AWS_BATCH_QUEUE | Name or ARN of an AWS Batch Queue to use with the AWS Batch batch system. |
| TOIL_AWS_BATCH_JOB_ROLE_ARN | ARN of an IAM role to run AWS Batch jobs as with the AWS Batch batch system. If the jobs are not run with an IAM role or on machines that have access to S3 and SimpleDB, the AWS job store will not be usable. |
| TOIL_GOOGLE_PROJECTID | The Google project ID to use when generating Google job store names for tests or CWL workflows. |
| TOIL_SLURM_ALLOCATE_MEM | Whether to allocate memory in Slurm with --mem. True by default. |
| TOIL_SLURM_ARGS | Arguments for sbatch for the slurm batch system. Do not pass CPU or memory specifications here. Instead, define resource requirements for the job. There is no default value for this variable. If neither --export nor --export-file is in the argument list, --export=ALL will be provided. |
| TOIL_SLURM_PE | Name of the slurm partition to use for parallel jobs. Useful for Slurm clusters that do not offer a partition accepting both single-core and multi-core jobs. There is no default value for this variable. |
| TOIL_SLURM_TIME | Slurm job time limit, in [DD-]HH:MM:SS format. For example, 2-07:15:30 for 2 days, 7 hours, 15 minutes and 30 seconds, or 4:00:00 for 4 hours. |
| TOIL_GRIDENGINE_ARGS | Arguments for qsub for the gridengine batch system. Do not pass CPU or memory specifications here. Instead, define resource requirements for the job. There is no default value for this variable. |
| TOIL_GRIDENGINE_PE | Parallel environment arguments for qsub and for the gridengine batch system. There is no default value for this variable. |
| TOIL_TORQUE_ARGS | Arguments for qsub for the Torque batch system. Do not pass CPU or memory specifications here. Instead, define extra parameters for the job such as queue. Example: -q medium Use TOIL_TORQUE_REQS to pass extra values for the -l resource requirements parameter. There is no default value for this variable. |
| TOIL_TORQUE_REQS | Arguments for the resource requirements for Torque batch system. Do not pass CPU or memory specifications here. Instead, define extra resource requirements as a string that goes after the -l argument to qsub. Example: walltime=2:00:00,file=50gb There is no default value for this variable. |
| TOIL_LSF_ARGS | Additional arguments for the LSF's bsub command. Instead, define extra parameters for the job such as queue. Example: -q medium. There is no default value for this variable. |
| TOIL_HTCONDOR_PARAMS | Additional parameters to include in the HTCondor submit file passed to condor_submit. Do not pass CPU or memory specifications here. Instead define extra parameters which may be required by HTCondor. This variable is parsed as a semicolon-separated string of parameter = value pairs. Example: requirements = TARGET.has_sse4_2 == true; accounting_group = test. There is no default value for this variable. |
| TOIL_CUSTOM_DOCKER_INIT_COMMAND | Any custom bash command to run in the Toil docker container prior to running the Toil services. Can be used for any custom initialization in the worker and/or primary nodes such as private docker docker authentication. Example for AWS ECR: pip install awscli && eval $(aws ecr get-login --no-include-email --region us-east-1). |
| TOIL_CUSTOM_INIT_COMMAND | Any custom bash command to run prior to starting the Toil appliance. Can be used for any custom initialization in the worker and/or primary nodes such as private docker authentication for the Toil appliance itself (i.e. from TOIL_APPLIANCE_SELF). |
| TOIL_S3_HOST | the IP address or hostname to use for connecting to S3. Example: TOIL_S3_HOST=127.0.0.1 |
| TOIL_S3_PORT | a port number to use for connecting to S3. Example: TOIL_S3_PORT=9001 |
| TOIL_S3_USE_SSL | enable or disable the usage of SSL for connecting to S3 (True by default). Example: TOIL_S3_USE_SSL=False |
| TOIL_FTP_USER | The FTP username to override all FTP logins with Example: TOIL_FTP_USER=ftp_user |
| TOIL_FTP_PASSWORD | The FTP password to override all FTP logins with Example: TOIL_FTP_PASSWORD=ftp_password |
| TOIL_WES_BROKER_URL | An optional broker URL to use to communicate between the WES server and Celery task queue. If unset, amqp://guest:guest@localhost:5672// is used. |
| TOIL_WES_JOB_STORE_TYPE | Type of job store to use by default for workflows run via the WES server. Can be file, aws, or google. |
| TOIL_OWNER_TAG | This will tag cloud resources with a tag reading: "Owner: $TOIL_OWNER_TAG". This is used internally at UCSC to stop a bot we have that terminates untagged resources. |
| TOIL_AWS_PROFILE | The name of an AWS profile to run TOIL with. |
| TOIL_AWS_TAGS | This will tag cloud resources with any arbitrary tags given in a JSON format. These are overwritten in favor of CLI options when using launch cluster. For information on valid AWS tags, see AWS Tags. |
| SINGULARITY_DOCKER_HUB_MIRROR | An http or https URL for the Singularity wrapper in the Toil Docker container to use as a mirror for Docker Hub. |
| OMP_NUM_THREADS | The number of cores set for OpenMP applications in the workers. If not set, Toil will use the number of job threads. |
| GUNICORN_CMD_ARGS | Specify additional Gunicorn configurations for the Toil WES server. See Gunicorn settings. |
This page contains auto-generated API reference documentation [1].
| logger | |
| EXIT_STATUS_UNAVAILABLE_VALUE |
| InsufficientSystemResources | Common base class for all non-exit exceptions. |
| AcquisitionTimeoutException | To be raised when a resource request times out. |
| BatchJobExitReason | Enum where members are also (and must be) ints |
| UpdatedBatchJobInfo | |
| WorkerCleanupInfo | |
| AbstractBatchSystem | An abstract base class to represent the interface the batch system must provide to Toil. |
| BatchSystemSupport | Partial implementation of AbstractBatchSystem, support methods. |
| NodeInfo | The coresUsed attribute is a floating point value between 0 (all cores idle) and 1 (all cores |
| AbstractScalableBatchSystem | A batch system that supports a variable number of worker nodes. |
| ResourcePool | Represents an integral amount of a resource (such as memory bytes). |
| ResourceSet | Represents a collection of distinct resources (such as accelerators). |
Enum where members are also (and must be) ints
Given an int that may be or may be equal to a value from the enum, produce the string value of its matching enum entry, or a stringified int.
EXIT_STATUS_UNAVAILABLE_VALUE is used when the exit status is not available (e.g. job is lost, or otherwise died but actual exit code was not reported).
An abstract base class to represent the interface the batch system must provide to Toil.
Whether this batch system supports auto-deployment of the user script itself.
If it does, the setUserScript() can be invoked to set the resource object representing the user script.
Note to implementors: If your implementation returns True here, it should also override
Whether this batch system supports worker cleanup.
Indicates whether this batch system invokes BatchSystemSupport.workerCleanup() after the last job for a particular workflow invocation finishes. Note that the term worker refers to an entire node, not just a worker process. A worker process may run more than one job sequentially, and more than one concurrent worker process may exist on a worker node, for the same workflow. The batch system is said to shut down after the last worker process terminates.
This method must be called before the first job is issued to this batch system, and only if supportsAutoDeployment() returns True, otherwise it will raise an exception.
Does not return info for jobs killed by killBatchJobs, although they may cause None to be returned earlier than maxWait.
If no useful message is available, return None.
This can be used to report what resource is the limiting factor when scheduling jobs, for example. If the leader thinks the workflow is stuck, the message can be displayed to the user to help them diagnose why it might be stuck.
The worker process will typically inherit the environment of the machine it is running on but this method makes it possible to override specific variables in that inherited environment before the worker is launched. Note that this mechanism is different to the one used by the worker internally to set up the environment of a job. A call to this method affects all jobs issued after this method returns. Note to implementors: This means that you would typically need to copy the variables before enqueuing a job.
If no value is provided it will be looked up from the current environment.
Can be used to ask the Toil worker to do things in-process (such as configuring environment variables, hot-deploying user scripts, or cleaning up a node) that would otherwise require a wrapping "executor" process.
Partial implementation of AbstractBatchSystem, support methods.
If no value is provided it will be looked up from the current environment.
Only really makes sense if the backing batch system actually saves logs to a filesystem; Kubernetes for example does not. Ought to be a directory shared between the leader and the workers, if the backing batch system writes logs onto the worker's view of the filesystem, like many HPC schedulers do.
Files will be written to the batch logs directory (--batchLogsDir, defaulting to the Toil work directory) with names containing both the Toil and batch system job IDs, for ease of debugging job failures.
Also see supportsWorkerCleanup().
The memoryUsed attribute is a floating point value between 0 (no memory used) and 1 (all memory used), reflecting the memory pressure on the node.
The coresTotal and memoryTotal attributes are the node's resources, not just the used resources
The requestedCores and requestedMemory attributes are all the resources that Toil Jobs have reserved on the node, regardless of whether the resources are actually being used by the Jobs.
The workers attribute is an integer reflecting the number of workers currently active workers on the node.
A batch system that supports a variable number of worker nodes.
Used by toil.provisioners.clusterScaler.ClusterScaler to scale the number of worker nodes in the cluster up or down depending on overall load.
Common base class for all non-exit exceptions.
To be raised when a resource request times out.
| logger | |
| JobTuple |
| ExceededRetryAttempts | Common base class for all non-exit exceptions. |
| AbstractGridEngineBatchSystem | A partial implementation of BatchSystemSupport for batch systems run on a |
Common base class for all non-exit exceptions.
A partial implementation of BatchSystemSupport for batch systems run on a standard HPC cluster. By default auto-deployment is not implemented.
Common base class for all non-exit exceptions.
A class that represents a thread of control.
This class can be safely subclassed in a limited fashion. There are two ways to specify the activity: by passing a callable object to the constructor, or by overriding the run() method in a subclass.
Note: for the moment this is the only consistent way to cleanly get the batch system job ID
Implementation-specific; called by GridEngineThread.run()
Respects statePollingWait and will return cached results if not within time period to talk with the scheduler.
Called by GridEngineThread.checkOnJobs().
The default implementation falls back on self.getJobExitCode and polls each job individually
Returns None if the job is still running.
If the job is not running but the exit code is not available, it will be EXIT_STATUS_UNAVAILABLE_VALUE. Implementation-specific; called by GridEngineThread.checkOnJobs().
The exit code will only be 0 if the job affirmatively succeeded.
If it does, the setUserScript() can be invoked to set the resource object representing the user script.
Note to implementors: If your implementation returns True here, it should also override
Respects statePollingWait and will return cached results if not within time period to talk with the scheduler.
Does not return info for jobs killed by killBatchJobs, although they may cause None to be returned earlier than maxWait.
If no value is provided it will be looked up from the current environment.
Batch system for running Toil workflows on AWS Batch.
Useful with the AWS job store.
AWS Batch has no means for scheduling based on disk usage, so the backing machines need to have "enough" disk and other constraints need to guarantee that disk does not fill.
Assumes that an AWS Batch Queue name or ARN is already provided.
Handles creating and destroying a JobDefinition for the workflow run.
Additional containers should be launched with Singularity, not Docker.
| logger | |
| STATE_TO_EXIT_REASON | |
| MAX_POLL_COUNT | |
| MIN_REQUESTABLE_MIB | |
| MIN_REQUESTABLE_CORES |
| AWSBatchBatchSystem | Adds cleanup support when the last running job leaves a node, for batch |
Adds cleanup support when the last running job leaves a node, for batch systems that can't provide it using the backing scheduler.
If it does, the setUserScript() can be invoked to set the resource object representing the user script.
Note to implementors: If your implementation returns True here, it should also override
This method must be called before the first job is issued to this batch system, and only if supportsAutoDeployment() returns True, otherwise it will raise an exception.
Does not return info for jobs killed by killBatchJobs, although they may cause None to be returned earlier than maxWait.
| logger |
| BatchSystemCleanupSupport | Adds cleanup support when the last running job leaves a node, for batch |
| WorkerCleanupContext | Context manager used by BatchSystemCleanupSupport to implement |
Adds cleanup support when the last running job leaves a node, for batch systems that can't provide it using the backing scheduler.
Indicates whether this batch system invokes BatchSystemSupport.workerCleanup() after the last job for a particular workflow invocation finishes. Note that the term worker refers to an entire node, not just a worker process. A worker process may run more than one job sequentially, and more than one concurrent worker process may exist on a worker node, for the same workflow. The batch system is said to shut down after the last worker process terminates.
Can be used to ask the Toil worker to do things in-process (such as configuring environment variables, hot-deploying user scripts, or cleaning up a node) that would otherwise require a wrapping "executor" process.
Gets wrapped around the worker's work.
Executor for running inside a container.
Useful for Kubernetes batch system and TES batch system plugin.
| logger |
| pack_job(command[, user_script, environment]) | Create a command that runs the given command in an environment. |
| executor() | Main function of the _toil_contained_executor entrypoint. |
Runs inside the Toil container.
Responsible for setting up the user script and running the command for the job (which may in turn invoke the Toil worker entrypoint).
| logger |
| GridEngineBatchSystem | A partial implementation of BatchSystemSupport for batch systems run on a |
A partial implementation of BatchSystemSupport for batch systems run on a standard HPC cluster. By default auto-deployment is not implemented.
Grid Engine-specific AbstractGridEngineWorker methods
| logger | |
| JobTuple | |
| schedd_lock |
| HTCondorBatchSystem | A partial implementation of BatchSystemSupport for batch systems run on a |
A partial implementation of BatchSystemSupport for batch systems run on a standard HPC cluster. By default auto-deployment is not implemented.
A class that represents a thread of control.
This class can be safely subclassed in a limited fashion. There are two ways to specify the activity: by passing a callable object to the constructor, or by overriding the run() method in a subclass.
Implementation-specific; called by GridEngineThread.run()
Returns None if the job is still running.
If the job is not running but the exit code is not available, it will be EXIT_STATUS_UNAVAILABLE_VALUE. Implementation-specific; called by GridEngineThread.checkOnJobs().
The exit code will only be 0 if the job affirmatively succeeded.
You can only use it inside the context. Handles locking to make sure that only one thread is trying to do this at a time.
This is used for arguments we pass to htcondor that need to be inside both double and single quote enclosures.
For examples of valid strings, see: http://research.cs.wisc.edu/htcondor/manual/current/condor_submit.html#man-condor-submit-environment
Batch system for running Toil workflows on Kubernetes.
Ony useful with network-based job stores, like AWSJobStore.
Within non-privileged Kubernetes containers, additional Docker containers cannot yet be launched. That functionality will need to wait for user-mode Docker
| logger | |
| retryable_kubernetes_errors | |
| KeyValuesList |
| KubernetesBatchSystem | Adds cleanup support when the last running job leaves a node, for batch |
| is_retryable_kubernetes_error(e) | A function that determines whether or not Toil should retry or stop given |
Adds cleanup support when the last running job leaves a node, for batch systems that can't provide it using the backing scheduler.
If it does, the setUserScript() can be invoked to set the resource object representing the user script.
Note to implementors: If your implementation returns True here, it should also override
This method must be called before the first job is issued to this batch system, and only if supportsAutoDeployment() returns True, otherwise it will raise an exception.
Preemptible jobs will be able to run on preemptible or non-preemptible nodes, and will prefer preemptible nodes if available.
Non-preemptible jobs will not be allowed to run on nodes that are marked as preemptible.
Understands the labeling scheme used by EKS, and the taint scheme used by GCE. The Toil-managed Kubernetes setup will mimic at least one of these.
Does not return info for jobs killed by killBatchJobs, although they may cause None to be returned earlier than maxWait.
Type-enforcing protocol for Toil configs that have the extra Kubernetes batch system fields.
TODO: Until MyPY lets protocols inherit form non-protocols, we will have to let the fact that this also has to be a Config just be manually enforced.
| logger |
| BatchSystemLocalSupport | Adds a local queue for helper jobs, useful for CWL & others. |
Adds a local queue for helper jobs, useful for CWL & others.
Returns the jobID if the jobDesc has been submitted to the local queue, otherwise returns None
To be called by killBatchJobs.
| logger |
| LSFBatchSystem | A partial implementation of BatchSystemSupport for batch systems run on a |
A partial implementation of BatchSystemSupport for batch systems run on a standard HPC cluster. By default auto-deployment is not implemented.
LSF specific GridEngineThread methods.
Called by GridEngineThread.checkOnJobs().
The default implementation falls back on self.getJobExitCode and polls each job individually
Returns None if the job is still running.
If the job is not running but the exit code is not available, it will be EXIT_STATUS_UNAVAILABLE_VALUE. Implementation-specific; called by GridEngineThread.checkOnJobs().
The exit code will only be 0 if the job affirmatively succeeded.
| LSB_PARAMS_FILENAME | |
| LSF_CONF_FILENAME | |
| LSF_CONF_ENV | |
| DEFAULT_LSF_UNITS | |
| DEFAULT_RESOURCE_UNITS | |
| LSF_JSON_OUTPUT_MIN_VERSION | |
| logger |
| find(basedir, string) | walk basedir and return all files matching string |
| find_first_match(basedir, string) | return the first file that matches string starting from basedir |
| get_conf_file(filename, env) | |
| apply_conf_file(fn, conf_filename) | |
| per_core_reserve_from_stream(stream) | |
| get_lsf_units_from_stream(stream) | |
| tokenize_conf_stream(conf_handle) | convert the key=val pairs in a LSF config stream to tuples of tokens |
| apply_bparams(fn) | apply fn to each line of bparams, returning the result |
| apply_lsadmin(fn) | apply fn to each line of lsadmin, returning the result |
| get_lsf_units([resource]) | check if we can find LSF_UNITS_FOR_LIMITS in lsadmin and lsf.conf |
| parse_mem_and_cmd_from_output(output) | Use regex to find "MAX MEM" and "Command" inside of an output. |
| get_lsf_version() | Get current LSF version |
| check_lsf_json_output_supported() | Check if the current LSF system supports bjobs json output. |
| parse_memory(mem) | Parse memory parameter. |
| per_core_reservation() | returns True if the cluster is configured for reservations to be per core, |
| log |
| MesosBatchSystem | A Toil batch system implementation that uses Apache Mesos to distribute toil jobs as Mesos |
A Toil batch system implementation that uses Apache Mesos to distribute toil jobs as Mesos tasks over a cluster of agent nodes. A Mesos framework consists of a scheduler and an executor. This class acts as the scheduler and is typically run on the master node that also runs the Mesos master process with which the scheduler communicates via a driver component. The executor is implemented in a separate class. It is run on each agent node and communicates with the Mesos agent process via another driver object. The scheduler may also be run on a separate node from the master, which we then call somewhat ambiguously the driver node.
If it does, the setUserScript() can be invoked to set the resource object representing the user script.
Note to implementors: If your implementation returns True here, it should also override
Indicates whether this batch system invokes BatchSystemSupport.workerCleanup() after the last job for a particular workflow invocation finishes. Note that the term worker refers to an entire node, not just a worker process. A worker process may run more than one job sequentially, and more than one concurrent worker process may exist on a worker node, for the same workflow. The batch system is said to shut down after the last worker process terminates.
This method must be called before the first job is issued to this batch system, and only if supportsAutoDeployment() returns True, otherwise it will raise an exception.
Does not return info for jobs killed by killBatchJobs, although they may cause None to be returned earlier than maxWait.
| collect_ignore |
| log |
| MesosExecutor | Part of Toil's Mesos framework, runs on a Mesos agent. A Toil job is passed to it via the |
| main() |
Part of Toil's Mesos framework, runs on a Mesos agent. A Toil job is passed to it via the task.data field, and launched via call(toil.command).
| log |
| MesosTestSupport | Mixin for test cases that need a running Mesos master and agent on the local host. |
A thread whose join() method re-raises exceptions raised during run(). While join() is idempotent, the exception is only during the first invocation of join() that successfully joined the thread. If join() times out, no exception will be re reraised even though an exception might already have occurred in run().
When subclassing this thread, override tryRun() instead of run().
>>> def f():
... assert 0
>>> t = ExceptionalThread(target=f)
>>> t.start()
>>> t.join()
Traceback (most recent call last):
...
AssertionError
>>> class MyThread(ExceptionalThread):
... def tryRun( self ):
... assert 0
>>> t = MyThread()
>>> t.start()
>>> t.join()
Traceback (most recent call last):
...
AssertionError
A thread whose join() method re-raises exceptions raised during run(). While join() is idempotent, the exception is only during the first invocation of join() that successfully joined the thread. If join() times out, no exception will be re reraised even though an exception might already have occurred in run().
When subclassing this thread, override tryRun() instead of run().
>>> def f():
... assert 0
>>> t = ExceptionalThread(target=f)
>>> t.start()
>>> t.join()
Traceback (most recent call last):
...
AssertionError
>>> class MyThread(ExceptionalThread):
... def tryRun( self ):
... assert 0
>>> t = MyThread()
>>> t.start()
>>> t.join()
Traceback (most recent call last):
...
AssertionError
A thread whose join() method re-raises exceptions raised during run(). While join() is idempotent, the exception is only during the first invocation of join() that successfully joined the thread. If join() times out, no exception will be re reraised even though an exception might already have occurred in run().
When subclassing this thread, override tryRun() instead of run().
>>> def f():
... assert 0
>>> t = ExceptionalThread(target=f)
>>> t.start()
>>> t.join()
Traceback (most recent call last):
...
AssertionError
>>> class MyThread(ExceptionalThread):
... def tryRun( self ):
... assert 0
>>> t = MyThread()
>>> t.start()
>>> t.join()
Traceback (most recent call last):
...
AssertionError
| TaskData | |
| ToilJob |
| JobQueue | |
| MesosShape | Represents a job or a node's "shape", in terms of the dimensions of memory, cores, disk and |
Represents a job or a node's "shape", in terms of the dimensions of memory, cores, disk and wall-time allocation.
The wallTime attribute stores the number of seconds of a node allocation, e.g. 3600 for AWS. FIXME: and for jobs?
The memory and disk attributes store the number of bytes required by a job (or provided by a node) in RAM or on disk (SSD or HDD), respectively.
This is because jobTypes are sorted in decreasing order, and this was done to give expensive jobs priority.
| logger |
| OptionSetter | Protocol for the setOption function we get to let us set up CLI options for |
| set_batchsystem_options(batch_system, set_option) | Call set_option for all the options for the given named batch system, or |
| add_all_batchsystem_options(parser) |
Protocol for the setOption function we get to let us set up CLI options for each batch system.
Actual functionality is defined in the Config class.
| logger | |
| DEFAULT_BATCH_SYSTEM |
| add_batch_system_factory(key, class_factory) | Adds a batch system to the registry for workflow or plugin-supplied batch systems. |
| get_batch_systems() | Get the names of all the availsble batch systems. |
| get_batch_system(key) | Get a batch system class by name. |
| aws_batch_batch_system_factory() | |
| gridengine_batch_system_factory() | |
| lsf_batch_system_factory() | |
| single_machine_batch_system_factory() | |
| mesos_batch_system_factory() | |
| slurm_batch_system_factory() | |
| torque_batch_system_factory() | |
| htcondor_batch_system_factory() | |
| kubernetes_batch_system_factory() | |
| __getattr__(name) | Implement a fallback attribute getter to handle deprecated constants. |
| addBatchSystemFactory(key, batchSystemFactory) | Deprecated method to add a batch system. |
| save_batch_system_plugin_state() | Return a snapshot of the plugin registry that can be restored to remove |
| restore_batch_system_plugin_state(snapshot) | Restore the batch system registry state to a snapshot from |
See <https://stackoverflow.com/a/48242860>.
| logger |
| SingleMachineBatchSystem | The interface for running jobs on a single machine, runs all the jobs you |
| Info | Record for a running job. |
The interface for running jobs on a single machine, runs all the jobs you give it as they come in, but in parallel.
Uses a single "daddy" thread to manage a fleet of child processes.
Communication with the daddy thread happens via two queues: one queue of jobs waiting to be run (the input queue), and one queue of jobs that are finished/stopped and need to be returned by getUpdatedBatchJob (the output queue).
When the batch system is shut down, the daddy thread is stopped.
If running in debug-worker mode, jobs are run immediately as they are sent to the batch system, in the sending thread, and the daddy thread is not run. But the queues are still used.
If it does, the setUserScript() can be invoked to set the resource object representing the user script.
Note to implementors: If your implementation returns True here, it should also override
Indicates whether this batch system invokes BatchSystemSupport.workerCleanup() after the last job for a particular workflow invocation finishes. Note that the term worker refers to an entire node, not just a worker process. A worker process may run more than one job sequentially, and more than one concurrent worker process may exist on a worker node, for the same workflow. The batch system is said to shut down after the last worker process terminates.
Our job is to look at jobs from the input queue.
If a job fits in the available resources, we allocate resources for it and kick off a child process.
We also check on our children.
When a child finishes, we reap it, release its resources, and put its information in the output queue.
If no useful message is available, return None.
This can be used to report what resource is the limiting factor when scheduling jobs, for example. If the leader thinks the workflow is stuck, the message can be displayed to the user to help them diagnose why it might be stuck.
Stores the start time of the job, the Popen object representing its child (or None), the tuple of (coreFractions, memory, disk) it is using (or None), and whether the job is supposed to be being killed.
| logger | |
| TERMINAL_STATES | |
| NONTERMINAL_STATES |
| SlurmBatchSystem | A partial implementation of BatchSystemSupport for batch systems run on a |
| parse_slurm_time(slurm_time) | Parse a Slurm-style time duration like 7-00:00:00 to a number of seconds. |
Raises ValueError if not parseable.
A partial implementation of BatchSystemSupport for batch systems run on a standard HPC cluster. By default auto-deployment is not implemented.
A class that represents a thread of control.
This class can be safely subclassed in a limited fashion. There are two ways to specify the activity: by passing a callable object to the constructor, or by overriding the run() method in a subclass.
| logger |
| TorqueBatchSystem | A partial implementation of BatchSystemSupport for batch systems run on a |
A partial implementation of BatchSystemSupport for batch systems run on a standard HPC cluster. By default auto-deployment is not implemented.
A class that represents a thread of control.
This class can be safely subclassed in a limited fashion. There are two ways to specify the activity: by passing a callable object to the constructor, or by overriding the run() method in a subclass.
Returns None if the job is still running.
If the job is not running but the exit code is not available, it will be EXIT_STATUS_UNAVAILABLE_VALUE. Implementation-specific; called by GridEngineThread.checkOnJobs().
The exit code will only be 0 if the job affirmatively succeeded.
| DeadlockException | Exception thrown by the Leader or BatchSystem when a deadlock is encountered due to insufficient |
Exception thrown by the Leader or BatchSystem when a deadlock is encountered due to insufficient resources to run the workflow
Message types and message bus for leader component coordination.
Historically, the Toil Leader has been organized around functions calling other functions to "handle" different things happening. Over time, it has become very brittle: exactly the right handling functions need to be called in exactly the right order, or it gets confused and does the wrong thing.
The MessageBus is meant to let the leader avoid this by more loosely coupling its components together, by having them communicate by sending messages instead of by calling functions.
When events occur (like a job coming back from the batch system with a failed exit status), this will be translated into a message that will be sent to the bus. Then, all the leader components that need to react to this message in some way (by, say, decrementing the retry count) would listen for the relevant messages on the bus and react to them. If a new component needs to be added, it can be plugged into the message bus and receive and react to messages without interfering with existing components' ability to react to the same messages.
Eventually, the different aspects of the Leader could become separate objects.
By default, messages stay entirely within the Toil leader process, and are not persisted anywhere, not even in the JobStore.
The Message Bus also provides an extension point: its messages can be serialized to a file by the leader (see the --writeMessages option), and they can then be decoded using MessageBus.scan_bus_messages() (as is done in the Toil WES server backend). By replaying the messages and tracking their effects on job state, you can get an up-to-date view of the state of the jobs in a workflow. This includes information, such as whether jobs are issued or running, or what jobs have completely finished, which is not persisted in the JobStore.
The MessageBus instance for the leader process is owned by the Toil leader, but the BatchSystem has an opportunity to connect to it, and can send (or listen for) messages. Right now the BatchSystem deos not have to send or receive any messages; the Leader is responsible for polling it via the BatchSystem API and generating the events. But a BatchSystem implementation may send additional events (like JobAnnotationMessage).
Currently, the MessageBus is implemented using pypubsub, and so messages are always handled in a single Thread, the Toil leader's main loop thread. If other components send events, they will be shipped over to that thread inside the MessageBus. Communication between processes is allowed using MessageBus.connect_output_file() and MessageBus.scan_bus_messages().
| logger | |
| MessageType |
| Names | Stores all the kinds of name a job can have. |
| JobIssuedMessage | Produced when a job is issued to run on the batch system. |
| JobUpdatedMessage | Produced when a job is "updated" and ready to have something happen to it. |
| JobCompletedMessage | Produced when a job is completed, whether successful or not. |
| JobFailedMessage | Produced when a job is completely failed, and will not be retried again. |
| JobMissingMessage | Produced when a job goes missing and should be in the batch system but isn't. |
| JobAnnotationMessage | Produced when extra information (such as an AWS Batch job ID from the |
| ExternalBatchIdMessage | Produced when using a batch system, links toil assigned batch ID to |
| QueueSizeMessage | Produced to describe the size of the queue of jobs issued but not yet |
| ClusterSizeMessage | Produced by the Toil-integrated autoscaler describe the number of |
| ClusterDesiredSizeMessage | Produced by the Toil-integrated autoscaler to describe the number of |
| MessageBus | Holds messages that should cause jobs to change their scheduling states. |
| MessageBusClient | Base class for clients (inboxes and outboxes) of a message bus. Handles |
| MessageInbox | A buffered connection to a message bus that lets us receive messages. |
| MessageOutbox | A connection to a message bus that lets us publish messages. |
| MessageBusConnection | A two-way connection to a message bus. Buffers incoming messages until you |
| JobStatus | Records the status of a job. |
| get_job_kind(names) | Return an identifying string for the job. |
| message_to_bytes(message) | Convert a plain-old-data named tuple into a byte string. |
| bytes_to_message(message_type, data) | Convert bytes from message_to_bytes back to a message of the given type. |
| replay_message_bus(path) | Replay all the messages and work out what they mean for jobs. |
| gen_message_bus_path([tmpdir]) | Return a file path in tmp to store the message bus at. |
The result may contain spaces.
Produced when a job is issued to run on the batch system.
Produced when a job is "updated" and ready to have something happen to it.
Produced when a job is completed, whether successful or not.
Produced when a job is completely failed, and will not be retried again.
Produced when a job goes missing and should be in the batch system but isn't.
Produced when extra information (such as an AWS Batch job ID from the AWSBatchBatchSystem) is available that goes with a job.
Produced when using a batch system, links toil assigned batch ID to Batch system ID (Whatever's returned by local implementation, PID, batch ID, etc)
Produced to describe the size of the queue of jobs issued but not yet completed. Theoretically recoverable from other messages.
Produced by the Toil-integrated autoscaler describe the number of instances of a certain type in a cluster.
Produced by the Toil-integrated autoscaler to describe the number of instances of a certain type that it thinks will be needed.
All messages are NamedTuple objects of various subtypes.
Message order is guaranteed to be preserved within a type.
Returns connection data which must be kept alive for the connection to persist. That data is opaque: the user is not supposed to look at it or touch it or do anything with it other than store it somewhere or delete it.
A buffered connection to a message bus that lets us receive messages. Buffers incoming messages until you are ready for them. Does not preserve ordering between messages of different types.
Messages sent while this function is running will not be yielded by the current call.
A connection to a message bus that lets us publish messages.
We have this so you don't need to store both the bus and your connection.
A two-way connection to a message bus. Buffers incoming messages until you are ready for them, and lets you send messages.
When exit_code is -1, this means the job is either not observed or currently running.
We track the state and name of jobs here, by ID. We would use a list of two items but MyPy can't understand a list of items of multiple types, so we need to define a new class.
Returns a dictionary from the job_id to a dataclass, JobStatus. A JobStatus contains information about a job which we have gathered from the message bus, including the job store id, name of the job the exit code, any associated annotations, the toil batch id the external batch id, and the batch system on which the job is running.
The tmpdir argument will override the directory that the message bus will be made in. If not provided, the standard tempfile order will be used.
| UUID_LENGTH | |
| logger | |
| TOIL_HOME_DIR | |
| DEFAULT_CONFIG_FILE |
| ToilRestartException | Common base class for all non-exit exceptions. |
| ToilContextManagerException | Common base class for all non-exit exceptions. |
| Config | Class to represent configuration operations for a toil workflow run. |
| Toil | A context manager that represents a Toil workflow. |
| ToilMetrics |
| check_and_create_toil_home_dir() | Ensure that TOIL_HOME_DIR exists. |
| check_and_create_default_config_file() | If the default config file does not exist, create it in the Toil home directory. Create the Toil home directory |
| check_and_create_config_file(filepath) | If the config file at the filepath does not exist, try creating it. |
| generate_config(filepath) | Write a Toil config file to the given path. |
| parser_with_common_options([provisioner_options, ...]) | |
| addOptions(parser[, jobstore_as_flag, cwl, wdl]) | Add all Toil command line options to a parser. |
| getNodeID() | Return unique ID of the current node (host). The resulting string will be convertible to a uuid.UUID. |
| cacheDirName(workflowID) | |
| getDirSizeRecursively(dirPath) | This method will return the cumulative number of bytes occupied by the files |
| getFileSystemSize(dirPath) | Return the free space, and total size of the file system hosting dirPath. |
| safeUnpickleFromStream(stream) |
Raises an error if it does not exist and cannot be created. Safe to run simultaneously in multiple processes.
Raises an error if the default config file cannot be created. Safe to run simultaneously in multiple processes. If this process runs this function, it will always see the default config file existing with parseable contents, even if other processes are racing to create it.
No process will see an empty or partially-written default config file.
Safe to run simultaneously in multiple processes. No process will see an empty or partially-written file at the given path.
Set include to "cwl" or "wdl" to include cwl options and wdl options respectfully
Support for config files if using configargparse. This will also check and set up the default config file.
Tries several methods until success. The returned ID should be identical across calls from different processes on the same node at least until the next OS reboot.
The last resort method is uuid.getnode() that in some rare OS configurations may return a random ID each time it is called. However, this method should never be reached on a Linux system, because reading from /proc/sys/kernel/random/boot_id will be tried prior to that. If uuid.getnode() is reached, it will be called twice, and exception raised if the values are not identical.
A context manager that represents a Toil workflow.
Specifically the batch system, job store, and its configuration.
Then load the job store and, on restart, consolidate the derived configuration with the one from the previous invocation of the workflow.
Depending on the configuration, delete the job store.
This method must be called in the body of a with Toil(...) as toil: statement. This method should not be called more than once for a workflow that has not finished.
By default, returns None if the file does not exist.
See toil.jobStores.abstractJobStore.AbstractJobStore.importFile() for a full description
See toil.jobStores.abstractJobStore.AbstractJobStore.exportFile() for a full description
This directory is always required to exist on a machine, even if the Toil worker has not run yet. If your workers and leader have different temp directories, you may need to set TOIL_WORKDIR.
Will be consistent for all processes on a given machine, and different for all processes on different machines.
If an in-memory filesystem is available, it is used. Otherwise, the local workflow directory, which may be on a shared network filesystem, is used.
Common base class for all non-exit exceptions.
Common base class for all non-exit exceptions.
If the method is unable to access a file or directory (due to insufficient permissions, or due to the file or directory having been removed while this function was attempting to traverse it), the error will be handled internally, and a (possibly 0) lower bound on the size of the directory will be returned.
| collect_ignore |
Implemented support for Common Workflow Language (CWL) for Toil.
| logger | |
| DEFAULT_TMPDIR | |
| DEFAULT_TMPDIR_PREFIX | |
| MISSING_FILE | |
| DirectoryContents | |
| V | |
| ProcessType | |
| T | |
| usage_message |
| NoAvailableJobStoreException | Indicates that no job store name is available. |
| UnresolvedDict | Tag to indicate a dict contains promises that must be resolved. |
| SkipNull | Internal sentinel object. |
| Conditional | Object holding conditional expression until we are ready to evaluate it. |
| ResolveSource | Apply linkMerge and pickValue operators to values coming into a port. |
| StepValueFrom | A workflow step input which has a valueFrom expression attached to it. |
| DefaultWithSource | A workflow step input that has both a source and a default value. |
| JustAValue | A simple value masquerading as a 'resolve'-able object. |
| ToilPathMapper | Keeps track of files in a Toil way. |
| ToilSingleJobExecutor | A SingleJobExecutor that does not assume it is at the top level of the workflow. |
| ToilTool | Mixin to hook Toil into a cwltool tool type. |
| ToilCommandLineTool | Subclass the cwltool command line tool to provide the custom ToilPathMapper. |
| ToilExpressionTool | Subclass the cwltool expression tool to provide the custom ToilPathMapper. |
| ToilFsAccess | Custom filesystem access class which handles toil filestore references. |
| VisitFunc | |
| CWLNamedJob | Base class for all CWL jobs that do user work, to give them useful names. |
| ResolveIndirect | Helper Job. |
| CWLJobWrapper | Wrap a CWL job that uses dynamic resources requirement. |
| CWLJob | Execute a CWL tool using cwltool.executors.SingleJobExecutor. |
| CWLScatter | Implement workflow scatter step. |
| CWLGather | Follows on to a scatter Job. |
| SelfJob | Fake job object to facilitate implementation of CWLWorkflow.run(). |
| CWLWorkflow | Toil Job to convert a CWL workflow graph into a Toil job graph. |
| CWLInstallImportsJob | Class represents a unit of work in toil. |
| CWLImportWrapper | Job to organize importing files on workers instead of the leader. Responsible for extracting filenames and metadata, |
| CWLStartJob | Job responsible for starting the CWL workflow. |
| cwltoil_was_removed() | Complain about deprecated entrypoint. |
| filter_skip_null(name, value) | Recursively filter out SkipNull objects from 'value'. |
| ensure_no_collisions(directory[, dir_description]) | Make sure no items in the given CWL Directory have the same name. |
| try_prepull(cwl_tool_uri, runtime_context, batchsystem) | Try to prepull all containers in a CWL workflow with Singularity or Docker. |
| resolve_dict_w_promises(dict_w_promises[, file_store]) | Resolve a dictionary of promises evaluate expressions to produce the actual values. |
| simplify_list(maybe_list) | Turn a length one list loaded by cwltool into a scalar. |
| toil_make_tool(toolpath_object, loadingContext) | Emit custom ToilCommandLineTools. |
| check_directory_dict_invariants(contents) | Make sure a directory structure dict makes sense. Throws an error |
| decode_directory(dir_path) | Decode a directory from a "toildir:" path to a directory (or a file in it). |
| encode_directory(contents) | Encode a directory from a "toildir:" path to a directory (or a file in it). |
| toil_get_file(file_store, index, existing, uri[, ...]) | Set up the given file or directory from the Toil jobstore at a file URI |
| convert_file_uri_to_toil_uri(applyFunc, index, ...) | Given a file URI, convert it to a toil file URI. Uses applyFunc to handle the conversion. |
| path_to_loc(obj) | Make a path into a location. |
| extract_file_uri_once(fileindex, existing, file_metadata) | Extract the filename from a CWL file record. |
| visit_files(func, fs_access, fileindex, existing, ...) | Prepare all files and directories. |
| upload_directory(directory_metadata, directory_contents) | Upload a Directory object. |
| extract_and_convert_file_to_toil_uri(convertfunc, ...) | Extract the file URI out of a file object and convert it to a Toil URI. |
| writeGlobalFileWrapper(file_store, fileuri) | Wrap writeGlobalFile to accept file:// URIs. |
| remove_empty_listings(rec) | |
| toilStageFiles(toil, cwljob, outdir[, destBucket, ...]) | Copy input files out of the global file store and update location and path. |
| get_container_engine(runtime_context) | |
| makeRootJob(tool, jobobj, runtime_context, ...) | Create the Toil root Job object for the CWL tool. Is the same as makeJob() except this also handles import logic. |
| makeJob(tool, jobobj, runtime_context, parent_name, ...) | Create the correct Toil Job object for the CWL tool. |
| remove_pickle_problems(obj) | Doc_loader does not pickle correctly, causing Toil errors, remove from objects. |
| extract_workflow_inputs(options, ...) | Collect all the workflow input files to import later. |
| import_workflow_inputs(jobstore, options, ...[, log_level]) | Import all workflow inputs on the leader. |
| visitSteps(cmdline_tool, op) | Iterate over a CWL Process object, running the op on each tool description |
| rm_unprocessed_secondary_files(job_params) | |
| filtered_secondary_files(unfiltered_secondary_files) | Remove unprocessed secondary files. |
| scan_for_unsupported_requirements(tool[, ...]) | Scan the given CWL tool for any unsupported optional features. |
| determine_load_listing(tool) | Determine the directory.listing feature in CWL. |
| generate_default_job_store(batch_system_name, ...) | Choose a default job store appropriate to the requested batch system and |
| get_options(args) | Parse given args and properly add non-Toil arguments into the cwljob of the Namespace. |
| main([args, stdout]) | Run the main loop for toil-cwl-runner. |
| find_default_container(args, builder) | Find the default constructor by consulting a Toil.options object. |
Tag to indicate a dict contains promises that must be resolved.
Indicates a null value produced by each port of a skipped conditional step. The CWL 1.2 specification calls for treating this the exactly the same as a null value.
If any do, raise a WorkflowException about a "File staging conflict".
Does not recurse into subdirectories.
Evaluation occurs before the enclosing step's inputs are type-checked.
The valueFrom expression will be evaluated to produce the actual input object for the step.
The inputs must be associated with the StepValueFrom object's self.source.
Called when loadContents is specified.
(when the source can be resolved)
Anything else is passed as-is, by reference.
Keeps track of files in a Toil way.
Maps between the symbolic identifier of a file (the Toil FileID), its local path on the host (the value returned by readGlobalFile) and the location of the file inside the software container.
This is called on each File or Directory CWL object. The Files and Directories all have "location" fields. For the Files, these are from upload_file(), and for the Directories, these are from upload_directory() or cwltool internally. With upload_directory(), they and their children will be assigned locations based on listing the Directories using ToilFsAccess. With cwltool, locations will be set as absolute paths.
Produces one MapperEnt for every unique location for a File or Directory. These MapperEnt objects are instructions to cwltool's stage_files function: https://github.com/common-workflow-language/cwltool/blob/a3e3a5720f7b0131fa4f9c0b3f73b62a347278a6/cwltool/process.py#L254
The MapperEnt has fields:
resolved: An absolute local path anywhere on the filesystem where the file/directory can be found, or the contents of a file to populate it with if type is CreateWritableFile or CreateFile. Or, a URI understood by the StdFsAccess in use (for example, toilfile:).
target: An absolute path under stagedir that the file or directory will then be placed at by cwltool. Except if a File or Directory has a dirname field, giving its parent path, that is used instead.
type: One of:
CreateFile: cwltool will create the file at target, treating resolved as the contents.
WritableFile: cwltool will copy the file from resolved to target, making it writable.
CreateWritableFile: cwltool will create the file at target, treating resolved as the contents, and make it writable.
Directory: cwltool will copy or link the directory from resolved to target, if possible. Otherwise, cwltool will make the directory at target if resolved starts with "_:". Otherwise it will do nothing.
WritableDirectory: cwltool will copy the directory from resolved to target, if possible. Otherwise, cwltool will make the directory at target if resolved starts with "_:". Otherwise it will do nothing.
staged: if set to False, cwltool will not make or copy anything for this entry
A SingleJobExecutor that does not assume it is at the top level of the workflow.
We need this because otherwise every job thinks it is top level and tries to discover secondary files, which may exist when they haven't actually been passed at the top level and thus aren't supposed to be visible.
Subclass the cwltool command line tool to provide the custom ToilPathMapper.
Subclass the cwltool expression tool to provide the custom ToilPathMapper.
This factory function is meant to be passed to cwltool.load_tool().
Currently just checks to make sure no empty-string keys exist.
Returns the decoded directory dict, the remaining part of the path (which may be None), and the deduplication key string that uniquely identifies the directory.
Takes the directory dict, which is a dict from name to URI for a file or dict for a subdirectory.
Custom filesystem access class which handles toil filestore references.
Normal file paths will be resolved relative to basedir, but 'toilfile:' and 'toildir:' URIs will be fulfilled from the Toil file store.
Also supports URLs supported by Toil job store implementations.
Run as part of the tool setup, inside jobs on the workers. Also used as part of reorganizing files to get them uploaded at the end of a tool.
Runs once on every unique file URI.
'existing' is a set of files retrieved as inputs from toil_get_file. This ensures they are mapped back as the same name if passed through.
Returns a toil uri path to the object.
(If a CWL object has a "path" and not a "location")
This function matches the predefined function signature in visit_files, which ensures that this function is called on all files inside a CWL object.
Ensures no duplicate files are returned according to fileindex. If a file has not been resolved already (and had file:// prepended) then resolve symlinks. :param fileindex: Forward mapping of filename :param existing: Reverse mapping of filename. This function does not use this :param file_metadata: CWL file record :param mark_broken: Whether files should be marked as missing :param skip_remote: Whether to skip remote files :return:
Will be executed from the leader or worker in the context of the given CWL tool, order, or output object to be used on the workers. Make sure their sizes are set and import all the files.
Recurses inside directories using the fs_access to find files to upload and subdirectory structure to encode, even if their listings are not set or not recursive.
Preserves any listing fields.
If a file cannot be found (like if it is an optional secondary file that doesn't exist), fails, unless mark_broken is set, in which case it applies a sentinel location.
Also does some miscellaneous normalization.
Ignores the listing (which may not be recursive and isn't safe or efficient to touch), and instead uses directory_contents, which is a recursive dict structure from filename to file URI or subdirectory contents dict.
Makes sure the directory actually exists, and rewrites its location to be something we can use on another machine.
If mark_broken is set, ignores missing directories and replaces them with directories containing the given (possibly empty) contents.
We can't rely on the directory's listing as visible to the next tool as a complete recursive description of the files we will need to present to the tool, since some tools require it to be cleared or single-level but still expect to see its contents in the filesystem.
Runs convertfunc on the file URI to handle conversion.
Is used to handle importing files into the jobstore.
If a file doesn't exist, fails with an error, unless mark_broken is set, in which case the missing file is given a special sentinel location.
Unless skip_remote is set, also run on remote files and sets their locations to toil URIs as well.
Base class for all CWL jobs that do user work, to give them useful names.
Helper Job.
Accepts an unresolved dict (containing promises) and produces a dictionary of actual values.
Wrap a CWL job that uses dynamic resources requirement.
When executed, this creates a new child job which has the correct resource requirement set.
Execute a CWL tool using cwltool.executors.SingleJobExecutor.
Env vars specified in the CWL "requirements" section should already be loaded in self.cwltool.requirements, however those specified with "EnvVarRequirement" take precedence and are only populated here. Therefore, this not only returns a dictionary with all evaluated "EnvVarRequirement" env vars, but checks self.cwltool.requirements for any env vars with the same name and replaces their value with that found in the "EnvVarRequirement" env var if it exists.
Actually creates what might be a subgraph of two jobs. The second of which may be the follow on of the first. If only one job is created, it is returned twice.
Actually creates what might be a subgraph of two jobs. The second of which may be the follow on of the first. If only one job is created, it is returned twice.
Types: workflow, job, or job wrapper for dynamic resource requirements.
Implement workflow scatter step.
When run, this creates a child job for each parameterization of the scatter.
Follows on to a scatter Job.
This gathers the outputs of each job in the scatter into an array for each output parameter.
If the object is a list, extract it from all members of the list.
Fake job object to facilitate implementation of CWLWorkflow.run().
See github issue: https://github.com/mypyc/mypyc/issues/804
Toil Job to convert a CWL workflow graph into a Toil job graph.
The Toil job graph will include the appropriate dependencies.
Always runs on the leader, because the batch system knows to schedule it as a local job.
Class represents a unit of work in toil.
Job to organize importing files on workers instead of the leader. Responsible for extracting filenames and metadata, calling ImportsJob, applying imports to the job objects, and scheduling the start workflow job
This class is only used when runImportsOnWorkers is enabled.
Job responsible for starting the CWL workflow.
Takes in the workflow/tool and inputs after all files are imported and creates jobs to run those workflows.
Ran when not importing on workers. :param jobstore: Toil jobstore :param options: Namespace :param initialized_job_order: CWL object :param tool: CWL tool :param log_level: log level :return:
Interpolated strings and optional inputs in secondary files were added to CWL in version 1.1.
The CWL libraries we call do successfully resolve the interpolated strings, but add the resolved fields to the list of unresolved fields so we remove them here after the fact.
We keep secondary files with anything other than MISSING_FILE as their location. The 'required' logic seems to be handled deeper in cwltool.builder.Builder(), and correctly determines which files should be imported. Therefore we remove the files here and if this file is SUPPOSED to exist, it will still give the appropriate file does not exist error, but just a bit further down the track.
If it has them, raise an informative UnsupportedRequirement.
In CWL, any input directory can have a DIRECTORY_NAME.listing (where DIRECTORY_NAME is any variable name) set to one of the following three options:
See https://www.commonwl.org/v1.1/CommandLineTool.html#LoadListingRequirement and https://www.commonwl.org/v1.1/CommandLineTool.html#LoadListingEnum
DIRECTORY_NAME.listing should be determined first from loadListing. If that's not specified, from LoadListingRequirement. Else, default to "no_listing" if unspecified.
Indicates that no job store name is available.
Utility functions used for Toil's CWL interpreter.
| logger | |
| CWL_UNSUPPORTED_REQUIREMENT_EXIT_CODE | |
| CWL_UNSUPPORTED_REQUIREMENT_EXCEPTION | |
| DownReturnType | |
| UpReturnType | |
| DirectoryStructure |
| CWLUnsupportedException | Fallback exception. |
| visit_top_cwl_class(rec, classes, op) | Apply the given operation to all top-level CWL objects with the given named CWL class. |
| visit_cwl_class_and_reduce(rec, classes, op_down, op_up) | Apply the given operations to all CWL objects with the given named CWL class. |
| get_from_structure(dir_dict, path) | Given a relative path, follow it in the given directory structure. |
| download_structure(file_store, index, existing, ...) | Download nested dictionary from the Toil file store to a local path. |
Fallback exception.
Like cwltool's visit_class but doesn't look inside any object visited.
Applies the down operation top-down, and the up operation bottom-up, and passes the down operation's result and a list of the up operation results for all child keys (flattening across lists and collapsing nodes of non-matching classes) to the up operation.
Return the string URI for files, the directory dict for subdirectories, or None for nonexistent things.
Guaranteed to fill the structure with real files, and not symlinks out of it to elsewhere. File URIs may be toilfile: URIs or any other URI that Toil's job store system can read.
| check_cwltool_version() | Check if the installed cwltool version matches Toil's expected version. |
A warning is printed to standard error if the versions differ. We do not assume that logging is set up already. Safe to call repeatedly; only one warning will be printed.
| logger |
| DeferredFunction | |
| DeferredFunctionManager | Implements a deferred function system. Each Toil worker will have an |
>>> from collections import defaultdict
>>> df = DeferredFunction.create(defaultdict, None, {'x':1}, y=2)
>>> df
DeferredFunction(defaultdict, ...)
>>> df.invoke() == defaultdict(None, x=1, y=2)
True
If the Python process terminates before properly exiting the context manager and running the deferred functions, and some other worker process enters or exits the per-job context manager of this class at a later time, or when the DeferredFunctionManager is shut down on the worker, the earlier job's deferred functions will be picked up and run.
Note that deferred function cleanup is on a best-effort basis, and deferred functions may end up getting executed multiple times.
Internally, deferred functions are serialized into files in the given directory, which are locked by the owning process.
If that process dies, other processes can detect that the files are able to be locked, and will take them over.
Not thread safe.
Neutral place for exceptions, to break import cycles.
| logger |
| FailedJobsException | Common base class for all non-exit exceptions. |
Common base class for all non-exit exceptions.
| logger |
| AbstractFileStore | Interface used to allow user code run by Toil to read and write files. |
Interface used to allow user code run by Toil to read and write files.
Also provides the interface to other Toil facilities used by user code, including:
Stores user files in the jobStore, but keeps them separate from actual jobs.
May implement caching.
Passed as argument to the toil.job.Job.run() method.
Access to files is only permitted inside the context manager provided by toil.fileStores.abstractFileStore.AbstractFileStore.open().
Also responsible for committing completed jobs back to the job store with an update operation, and allowing that commit operation to be waited for.
This is a destructive operation and it is important to ensure that there are no other running processes on the system that are modifying or using the file store for this workflow.
This is the intended to be the last call to the file store in a Toil run, called by the batch system cleanup function upon batch system shutdown.
File operations are only permitted inside the context manager.
Implementations must only yield from within with super().open(job):.
Disk usage is measured at the end of the job. TODO: Sample periodically and record peak usage.
The directory will only persist for the duration of the job.
If the file is in a FileStore-managed temporary directory (i.e. from toil.fileStores.abstractFileStore.AbstractFileStore.getLocalTempDir()), it will become a local copy of the file, eligible for deletion by toil.fileStores.abstractFileStore.AbstractFileStore.deleteLocalFile().
If an executable file on the local filesystem is uploaded, its executability will be preserved when it is downloaded again.
(to be announced if the job fails)
If destination is not None, it gives the path that the file was downloaded to. Otherwise, assumes that the file was streamed.
Must be called by readGlobalFile() and readGlobalFileStream() implementations.
If mutable is True, then a copy of the file will be created locally so that the original is not modified and does not change the file for other jobs. If mutable is False, then a link can be created to the file, saving disk resources. The file that is downloaded will be executable if and only if it was originally uploaded from an executable file on the local filesystem.
If a user path is specified, it is used as the destination. If a user path isn't specified, the file is stored in the local temp directory with an encoded name.
The destination file must not be deleted by the user; it can only be deleted through deleteLocalFile.
Implementations must call logAccess() to report the download.
The yielded file handle does not need to and should not be closed explicitly.
Implementations must call logAccess() to report the download.
If a FileID or something else with a non-None 'size' field, gets that.
Otherwise, asks the job store to poll the file's size.
Note that the job store may overestimate the file's size, for example if it is encrypted and had to be augmented with an IV or other encryption framing.
Raises an OSError with an errno of errno.ENOENT if no such local copies exist. Thus, cannot be called multiple times in succession.
The files deleted are all those previously read from this file ID via readGlobalFile by the current job into the job's file-store-provided temp directory, plus the file that was written to create the given file ID, if it was written by the current job from the job's file-store-provided temp directory.
To ensure that the job can be restarted if necessary, the delete will not happen until after the job's run method has completed.
Useful for things like the error logs of Docker containers. The leader will show it to the user or organize it appropriately for user-level log information.
May bump the version number of the job.
May start an asynchronous process. Call waitForCommit() to wait on that process. You must waitForCommit() before committing any further updates to the job. During the asynchronous process, it is safe to modify the job; modifications after this call will not be committed until the next call.
This function is called by this job's successor to ensure that it does not begin modifying the job store until after this job has finished doing so.
Might be called when startCommit is never called on a particular instance, in which case it does not block.
Shutdown the filestore on this node.
This is intended to be called on batch system shutdown.
| logger | |
| SQLITE_TIMEOUT_SECS |
| CacheError | Error Raised if the user attempts to add a non-local file to cache |
| CacheUnbalancedError | Raised if file store can't free enough space for caching |
| IllegalDeletionCacheError | Error raised if the caching code discovers a file that represents a |
| InvalidSourceCacheError | Error raised if the user attempts to add a non-local file to cache |
| CachingFileStore | A cache-enabled file store. |
Error Raised if the user attempts to add a non-local file to cache
Raised if file store can't free enough space for caching
Error raised if the caching code discovers a file that represents a reference to a cached file to have gone missing.
This can be a big problem if a hard link is moved, because then the cache will be unable to evict the file it links to.
Remember that files read with readGlobalFile may not be deleted by the user and need to be deleted with deleteLocalFile.
Error raised if the user attempts to add a non-local file to cache
A cache-enabled file store.
Provides files that are read out as symlinks or hard links into a cache directory for the node, if permitted by the workflow.
Also attempts to write files back to the backing JobStore asynchronously, after quickly taking them into the cache. Writes are only required to finish when the job's actual state after running is committed back to the job store.
Internaly, manages caching using a database. Each node has its own database, shared between all the workers on the node. The database contains several tables:
files contains one entry for each file in the cache. Each entry knows the path to its data on disk. It also knows its global file ID, its state, and its owning worker PID. If the owning worker dies, another worker will pick it up. It also knows its size.
File states are:
refs contains one entry for each outstanding reference to a cached file (hard link, symlink, or full copy). The table name is refs instead of references because references is an SQL reserved word. It remembers what job ID has the reference, and the path the reference is at. References have three states:
jobs contains one entry for each job currently running. It keeps track of the job's ID, the worker that is supposed to be running the job, the job's disk requirement, and the job's local temp dir path that will need to be cleaned up. When workers check for jobs whose workers have died, they null out the old worker, and grab ownership of and clean up jobs and their references until the null-worker jobs are gone.
properties contains key, value pairs for tracking total space available, and whether caching is free for this run.
Yields the process's name in the caching database, and holds onto a lock while your thread has it.
If no limit is available, raises an error.
If no value is available, raises an error.
We can get into a situation where the jobs on the node take up all its space, but then they want to write to or read from the cache. So when that happens, we need to debit space from them somehow...
If no value is available, raises an error.
If no value is available, raises an error.
If not retrievable, raises an error.
Mutable references don't count, but immutable/uploading ones do.
If no value is available, raises an error.
Note that this can't really be relied upon because a file may go cached -> deleting after you look at it. If you need to do something with the file you need to do it in a transaction.
Counts mutable references too.
This will be true when working with certain job stores in certain configurations, most notably the FileJobStore.
If mutable is True, then a copy of the file will be created locally so that the original is not modified and does not change the file for other jobs. If mutable is False, then a link can be created to the file, saving disk resources. The file that is downloaded will be executable if and only if it was originally uploaded from an executable file on the local filesystem.
If a user path is specified, it is used as the destination. If a user path isn't specified, the file is stored in the local temp directory with an encoded name.
The destination file must not be deleted by the user; it can only be deleted through deleteLocalFile.
Implementations must call logAccess() to report the download.
The yielded file handle does not need to and should not be closed explicitly.
Implementations must call logAccess() to report the download.
Raises an OSError with an errno of errno.ENOENT if no such local copies exist. Thus, cannot be called multiple times in succession.
The files deleted are all those previously read from this file ID via readGlobalFile by the current job into the job's file-store-provided temp directory, plus the file that was written to create the given file ID, if it was written by the current job from the job's file-store-provided temp directory.
To ensure that the job can be restarted if necessary, the delete will not happen until after the job's run method has completed.
This function is called by this job's successor to ensure that it does not begin modifying the job store until after this job has finished doing so.
Might be called when startCommit is never called on a particular instance, in which case it does not block.
May bump the version number of the job.
May start an asynchronous process. Call waitForCommit() to wait on that process. You must waitForCommit() before committing any further updates to the job. During the asynchronous process, it is safe to modify the job; modifications after this call will not be committed until the next call.
Job local temp directories will be removed due to their appearance in the database.
| logger |
| NonCachingFileStore | Interface used to allow user code run by Toil to read and write files. |
Interface used to allow user code run by Toil to read and write files.
Also provides the interface to other Toil facilities used by user code, including:
Stores user files in the jobStore, but keeps them separate from actual jobs.
May implement caching.
Passed as argument to the toil.job.Job.run() method.
Access to files is only permitted inside the context manager provided by toil.fileStores.abstractFileStore.AbstractFileStore.open().
Also responsible for committing completed jobs back to the job store with an update operation, and allowing that commit operation to be waited for.
Slurm has been known to delete XDG_RUNTIME_DIR out from under processes it was promised to, so it is possible that in certain misconfigured environments the coordination directory and everything in it could go away unexpectedly. We are going to regularly make sure that the things we think should exist actually exist, and we are going to abort if they do not.
File operations are only permitted inside the context manager.
Implementations must only yield from within with super().open(job):.
If the file is in a FileStore-managed temporary directory (i.e. from toil.fileStores.abstractFileStore.AbstractFileStore.getLocalTempDir()), it will become a local copy of the file, eligible for deletion by toil.fileStores.abstractFileStore.AbstractFileStore.deleteLocalFile().
If an executable file on the local filesystem is uploaded, its executability will be preserved when it is downloaded again.
If mutable is True, then a copy of the file will be created locally so that the original is not modified and does not change the file for other jobs. If mutable is False, then a link can be created to the file, saving disk resources. The file that is downloaded will be executable if and only if it was originally uploaded from an executable file on the local filesystem.
If a user path is specified, it is used as the destination. If a user path isn't specified, the file is stored in the local temp directory with an encoded name.
The destination file must not be deleted by the user; it can only be deleted through deleteLocalFile.
Implementations must call logAccess() to report the download.
The yielded file handle does not need to and should not be closed explicitly.
Implementations must call logAccess() to report the download.
Raises an OSError with an errno of errno.ENOENT if no such local copies exist. Thus, cannot be called multiple times in succession.
The files deleted are all those previously read from this file ID via readGlobalFile by the current job into the job's file-store-provided temp directory, plus the file that was written to create the given file ID, if it was written by the current job from the job's file-store-provided temp directory.
To ensure that the job can be restarted if necessary, the delete will not happen until after the job's run method has completed.
This function is called by this job's successor to ensure that it does not begin modifying the job store until after this job has finished doing so.
Might be called when startCommit is never called on a particular instance, in which case it does not block.
May bump the version number of the job.
May start an asynchronous process. Call waitForCommit() to wait on that process. You must waitForCommit() before committing any further updates to the job. During the asynchronous process, it is safe to modify the job; modifications after this call will not be committed until the next call.
| FileID | A small wrapper around Python's builtin string class. |
A small wrapper around Python's builtin string class.
It is used to represent a file's ID in the file store, and has a size attribute that is the file's size in bytes. This object is returned by importFile and writeGlobalFile.
Calls into the file store can use bare strings; size will be queried from the job store if unavailable in the ID.
| logger | |
| REQUIREMENT_NAMES | |
| ParsedRequirement | |
| ParseableIndivisibleResource | |
| ParseableDivisibleResource | |
| ParseableFlag | |
| ParseableAcceleratorRequirement | |
| ParseableRequirement | |
| T | |
| Promised |
| JobPromiseConstraintError | Error for job being asked to promise its return value, but it not available. |
| ConflictingPredecessorError | Common base class for all non-exit exceptions. |
| DebugStoppingPointReached | Raised when a job reaches a point at which it has been instructed to stop for debugging. |
| FilesDownloadedStoppingPointReached | Raised when a job stops because it was asked to download its files, and the files are downloaded. |
| JobException | General job exception. |
| JobGraphDeadlockException | An exception raised in the event that a workflow contains an unresolvable dependency, such as a cycle. See toil.job.Job.checkJobGraphForDeadlocks(). |
| TemporaryID | Placeholder for a unregistered job ID used by a JobDescription. |
| AcceleratorRequirement | Requirement for one or more computational accelerators, like a GPU or FPGA. |
| RequirementsDict | Typed storage for requirements for a job. |
| Requirer | Base class implementing the storage and presentation of requirements. |
| JobBodyReference | Reference from a job description to its body. |
| JobDescription | Stores all the information that the Toil Leader ever needs to know about a Job. |
| ServiceJobDescription | A description of a job that hosts a service. |
| CheckpointJobDescription | A description of a job that is a checkpoint. |
| Job | Class represents a unit of work in toil. |
| FunctionWrappingJob | Job used to wrap a function. In its run method the wrapped function is called. |
| JobFunctionWrappingJob | A job function is a function whose first argument is a Job |
| PromisedRequirementFunctionWrappingJob | Handles dynamic resource allocation using toil.job.Promise instances. |
| PromisedRequirementJobFunctionWrappingJob | Handles dynamic resource allocation for job functions. |
| EncapsulatedJob | A convenience Job class used to make a job subgraph appear to be a single job. |
| ServiceHostJob | Job that runs a service. Used internally by Toil. Users should subclass Service instead of using this. |
| FileMetadata | Metadata for a file. |
| CombineImportsJob | Combine the outputs of multiple WorkerImportsJob into one promise |
| WorkerImportJob | Job to do file imports on a worker instead of a leader. Assumes all local and cloud files are accessible. |
| ImportsJob | Job to organize and delegate files to individual WorkerImportJobs. |
| Promise | References a return value from a method as a promise before the method itself is run. |
| PromisedRequirement | Class for dynamically allocating job function resource requirements. |
| UnfulfilledPromiseSentinel | This should be overwritten by a proper promised value. |
| parse_accelerator(spec) | Parse an AcceleratorRequirement specified by user code. |
| accelerator_satisfies(candidate, requirement[, ignore]) | Test if candidate partially satisfies the given requirement. |
| accelerators_fully_satisfy(candidates, requirement[, ...]) | Determine if a set of accelerators satisfy a requirement. |
| potential_absolute_uris(uri, path[, importer, ...]) | Get potential absolute URIs to check for an imported file. |
| get_file_sizes(filenames, file_source[, search_paths, ...]) | Resolve relative-URI files in the given environment and turn them into absolute normalized URIs. Returns a dictionary of the string values from the WDL file values |
| unwrap(p) | Function for ensuring you actually have a promised value, and not just a promise. |
| unwrap_all(p) | Function for ensuring you actually have a collection of promised values, |
Error for job being asked to promise its return value, but it not available.
(Due to the return value not yet been hit in the topological order of the job graph.)
Common base class for all non-exit exceptions.
Raised when a job reaches a point at which it has been instructed to stop for debugging.
Raised when a job stops because it was asked to download its files, and the files are downloaded.
Requirement for one or more computational accelerators, like a GPU or FPGA.
Supports formats like:
>>> parse_accelerator(8) {'count': 8, 'kind': 'gpu'}
>>> parse_accelerator("1")
{'count': 1, 'kind': 'gpu'}
>>> parse_accelerator("nvidia-tesla-k80")
{'count': 1, 'kind': 'gpu', 'brand': 'nvidia', 'model': 'nvidia-tesla-k80'}
>>> parse_accelerator("nvidia-tesla-k80:2")
{'count': 2, 'kind': 'gpu', 'brand': 'nvidia', 'model': 'nvidia-tesla-k80'}
>>> parse_accelerator("gpu")
{'count': 1, 'kind': 'gpu'}
>>> parse_accelerator("cuda:1")
{'count': 1, 'kind': 'gpu', 'brand': 'nvidia', 'api': 'cuda'}
>>> parse_accelerator({"kind": "gpu"})
{'count': 1, 'kind': 'gpu'}
>>> parse_accelerator({"brand": "nvidia", "count": 5})
{'count': 5, 'kind': 'gpu', 'brand': 'nvidia'}
Assumes that if not specified, we are talking about GPUs, and about one of them. Knows that "gpu" is a kind, and "cuda" is an API, and "nvidia" is a brand.
Ignores fields specified in ignore.
Typed storage for requirements for a job.
Where requirement values are of different types depending on the requirement.
Has cores, memory, disk, and preemptability as properties.
Must be called exactly once on a loaded JobDescription before any requirements are queried.
Only works on requirements where that makes sense.
Reference from a job description to its body.
Stores all the information that the Toil Leader ever needs to know about a Job.
Can be obtained from an actual (i.e. executable) Job object, and can be used to obtain the Job object from the JobStore.
Never contains other Jobs or JobDescriptions: all reference is by ID.
Subclassed into variants for checkpoint jobs and service jobs that have their specific parameters.
For each job, produces a named tuple with its various names and its original job store ID. The jobs in the chain are in execution order.
If the job hasn't run yet or it didn't chain, produces a one-item list.
(in the order they need to start in)
Follow-ons will come before children.
Phases execute higher numbers to lower numbers.
Will be empty if the job has no unfinished services.
Takes the file store ID that the body is stored at, and the required user script module.
The file store ID can also be "firstJob" for the root job, stored as a shared file instead.
Fails if no body is attached; check has_body() first.
If those jobs have multiple predecessor relationships, they may still be blocked on other jobs.
Returns None when at the final phase (all successors done), and an empty collection if there are more phases but they can't be entered yet (e.g. because we are waiting for the job itself to run).
The predicate function is called with the job's ID.
Treats all other successors as complete and forgets them.
The predicate function is called with the service host job's ID.
Treats all other services as complete and forgets them.
That is to say, all those that have been completed and removed.
When updated in the JobStore, we will save over the other JobDescription.
Useful for chaining jobs: the chained-to job can replace the parent job.
Merges cleanup state and successors other than this job from the job being replaced into this one.
If a parent ServiceHostJob ID is given, that parent service will be started first, and must have already been added.
Does not modify our own ID or those of finished predecessors. IDs not present in the renames dict are left as-is.
Called by the Job saving logic when this JobDescription meets the JobStore and has its ID assigned.
Overridden to perform setup work (like hooking up flag files for service jobs) that requires the JobStore.
Reduce the remainingTryCount if greater than zero and set the memory to be at least as big as the default memory (in case of exhaustion of memory, which is common).
Requires a configuration to have been assigned (see toil.job.Requirer.assignConfig()).
Assumes logJobStoreFileID is set.
The try count set on the JobDescription, or the default based on the retry count from the config if none is set.
Called by the job store.
A description of a job that hosts a service.
When a ServiceJobDescription first meets the JobStore, it needs to set up its flag files.
A description of a job that is a checkpoint.
Writes the changes to the jobStore immediately. All the checkpoint's successors will be deleted, but its try count will not be decreased.
Returns a list with the IDs of any successors deleted.
This uses the fact that the self._description instance variable should always be set after __init__().
If __init__() has not been called, raise an error.
It will be used by various actions implemented inside the Job class.
Child jobs will be run directly after this job's toil.job.Job.run() method has completed.
Follow-on jobs will be run after the child jobs and their successors have been run.
The toil.job.Job.Service.start() method of the service will be called after the run method has completed but before any successors are run. The service's toil.job.Job.Service.stop() method will be called once the successors of the job have been run.
Services allow things like databases and servers to be started and accessed by jobs in a workflow.
See toil.job.JobFunctionWrappingJob for a definition of a job function.
See toil.job.JobFunctionWrappingJob for a definition of a job function.
Temp dir is created on first call and will be returned for first and future calls :return: Path to tempDir. See job.fileStore.getLocalTempDir
Convenience function for constructor of toil.job.FunctionWrappingJob.
Convenience function for constructor of toil.job.JobFunctionWrappingJob.
The "promise" representing a return value of the job's run method, or, in case of a function-wrapping job, the wrapped function's return value.
Prepare this job (the promisor) so that its promises can register themselves with it, when the jobs they are promised to (promisees) are serialized.
The promissee holds the reference to the promise (usually as part of the job arguments) and when it is being pickled, so will the promises it refers to. Pickling a promise triggers it to be registered with the promissor.
See toil.job.Job.checkJobGraphConnected(), toil.job.Job.checkJobGraphAcyclic() and toil.job.Job.checkNewCheckpointsAreLeafVertices() for more info.
A root job is a job with no predecessors (i.e. which are not children, follow-ons, or services).
Only deals with jobs created here, rather than loaded from the job store.
As execution always starts from one root job, having multiple root jobs will cause a deadlock to occur.
Only deals with jobs created here, rather than loaded from the job store.
A follow-on edge (A, B) between two jobs A and B is equivalent to adding a child edge to B from (1) A, (2) from each child of A, and (3) from the successors of each child of A. We call each such edge an edge an "implied" edge. The augmented job graph is a job graph including all the implied edges.
For a job graph G = (V, E) the algorithm is O(|V|^2). It is O(|V| + |E|) for a graph with no follow-ons. The former follow-on case could be improved!
Only deals with jobs created here, rather than loaded from the job store.
A job is a leaf it is has no successors.
A checkpoint job must be a leaf when initially added to the job graph. When its run method is invoked it can then create direct successors. This restriction is made to simplify implementation.
Only works on connected components of jobs not yet added to the JobStore.
Examples for deferred functions are ones that handle cleanup of resources external to Toil, like Docker containers, files outside the work directory, etc.
Deprecated by toil.common.Toil.start.
(see Job.Runner.getDefaultOptions and Job.Runner.addToilOptions) starting with this job. :param job: root job of the workflow :raises: toil.exceptions.FailedJobsException if at the end of function there remain failed jobs. :return: The return value of the root job's run function.
Abstract class used to define the interface to a service.
Should be subclassed by the user to define services.
Is not executed as a job; runs within a ServiceHostJob.
Only considers jobs in this job's subgraph that are newly added, not loaded from the job store.
Ignores service jobs.
The Job's JobDescription must have already had a real jobStoreID assigned to it.
Does not save the JobDescription.
Will abort the job if the "download_only" debug flag is set.
Can be hinted a list of file path pairs outside and inside the job container, in which case the container environment can be reconstructed.
General job exception.
An exception raised in the event that a workflow contains an unresolvable dependency, such as a cycle. See toil.job.Job.checkJobGraphForDeadlocks().
Job used to wrap a function. In its run method the wrapped function is called.
A job function is a function whose first argument is a Job instance that is the wrapping job for the function. This can be used to add successor jobs for the function and perform all the functions the Job class provides.
To enable the job function to get access to the toil.fileStores.abstractFileStore.AbstractFileStore instance (see toil.job.Job.run()), it is made a variable of the wrapping job called fileStore.
To specify a job's resource requirements the following default keyword arguments can be specified:
For example to wrap a function into a job we would call:
Job.wrapJobFn(myJob, memory='100k', disk='1M', cores=0.1)
Handles dynamic resource allocation using toil.job.Promise instances. Spawns child function using parent function parameters and fulfilled promised resource requirements.
Handles dynamic resource allocation for job functions. See toil.job.JobFunctionWrappingJob
A convenience Job class used to make a job subgraph appear to be a single job.
Let A be the root job of a job subgraph and B be another job we'd like to run after A and all its successors have completed, for this use encapsulate:
# Job A and subgraph, Job B A, B = A(), B() Aprime = A.encapsulate() Aprime.addChild(B) # B will run after A and all its successors have completed, A and its subgraph of # successors in effect appear to be just one job.
If the job being encapsulated has predecessors (e.g. is not the root job), then the encapsulated job will inherit these predecessors. If predecessors are added to the job being encapsulated after the encapsulated job is created then the encapsulating job will NOT inherit these predecessors automatically. Care should be exercised to ensure the encapsulated job has the proper set of predecessors.
The return value of an encapsulated job (as accessed by the toil.job.Job.rv() function) is the return value of the root job, e.g. A().encapsulate().rv() and A().rv() will resolve to the same value after A or A.encapsulate() has been run.
Child jobs will be run directly after this job's toil.job.Job.run() method has completed.
The toil.job.Job.Service.start() method of the service will be called after the run method has completed but before any successors are run. The service's toil.job.Job.Service.stop() method will be called once the successors of the job have been run.
Services allow things like databases and servers to be started and accessed by jobs in a workflow.
Follow-on jobs will be run after the child jobs and their successors have been run.
The "promise" representing a return value of the job's run method, or, in case of a function-wrapping job, the wrapped function's return value.
Prepare this job (the promisor) so that its promises can register themselves with it, when the jobs they are promised to (promisees) are serialized.
The promissee holds the reference to the promise (usually as part of the job arguments) and when it is being pickled, so will the promises it refers to. Pickling a promise triggers it to be registered with the promissor.
We don't want to pickle our internal references to the job we encapsulate, so we elide them here. When actually run, we're just a no-op job that can maybe chain.
Job that runs a service. Used internally by Toil. Users should subclass Service instead of using this.
Child jobs will be run directly after this job's toil.job.Job.run() method has completed.
Follow-on jobs will be run after the child jobs and their successors have been run.
The toil.job.Job.Service.start() method of the service will be called after the run method has completed but before any successors are run. The service's toil.job.Job.Service.stop() method will be called once the successors of the job have been run.
Services allow things like databases and servers to be started and accessed by jobs in a workflow.
Metadata for a file. source is the URL to grab the file from parent_dir is parent directory of the source size is the size of the file. Is none if the filesize cannot be retrieved.
Given a URI or bare path, yield in turn all the URIs, with schemes, where we should actually try to find it, given that we want to search under/against the given paths or URIs, the current directory, and the given importing WDL document if any.
Combine the outputs of multiple WorkerImportsJob into one promise
Job to do file imports on a worker instead of a leader. Assumes all local and cloud files are accessible.
For the CWL/WDL runners, this class is only used when runImportsOnWorkers is enabled.
When stream is true but the import is not streamable, the worker will run out of disk space and run a new import job with enough disk space instead. :param files: list of files to import :param file_source: AbstractJobStore :return: Dictionary mapping filenames to associated jobstore FileID
Job to organize and delegate files to individual WorkerImportJobs.
For the CWL/WDL runners, this is only used when runImportsOnWorkers is enabled
References a return value from a toil.job.Job.run() or toil.job.Job.Service.start() method as a promise before the method itself is run.
Let T be a job. Instances of Promise (termed a promise) are returned by T.rv(), which is used to reference the return value of T's run function. When the promise is passed to the constructor (or as an argument to a wrapped function) of a different, successor job the promise will be replaced by the actual referenced return value. This mechanism allows a return values from one job's run method to be input argument to job before the former job's run function has been executed.
Called during pickling when a promise (an instance of this class) is about to be be pickled. Returns the Promise class and construction arguments that will be evaluated during unpickling, namely the job store coordinates of a file that will hold the promised return value. By the time the promise is about to be unpickled, that file should be populated.
The "unwrap" terminology is borrowed from Rust.
The "unwrap" terminology is borrowed from Rust.
(involving toil.job.Promise instances.)
Use when resource requirements depend on the return value of a parent function. PromisedRequirements can be modified by passing a function that takes the Promise as input.
For example, let f, g, and h be functions. Then a Toil workflow can be defined as follows:: A = Job.wrapFn(f) B = A.addChildFn(g, cores=PromisedRequirement(A.rv()) C = B.addChildFn(h, cores=PromisedRequirement(lambda x: 2*x, B.rv()))
Converts Promise instance to PromisedRequirement.
Throws an exception when unpickled.
This won't be unpickled unless the promise wasn't resolved, so we throw an exception.
| logger |
| ProxyConnectionError | Dummy class. |
| LocatorException | Base exception class for all locator exceptions. |
| InvalidImportExportUrlException | Common base class for all non-exit exceptions. |
| NoSuchJobException | Indicates that the specified job does not exist. |
| ConcurrentFileModificationException | Indicates that the file was attempted to be modified by multiple processes at once. |
| NoSuchFileException | Indicates that the specified file does not exist. |
| NoSuchJobStoreException | Indicates that the specified job store does not exist. |
| JobStoreExistsException | Indicates that the specified job store already exists. |
| AbstractJobStore | Represents the physical storage for the jobs and files in a Toil workflow. |
| JobStoreSupport | A mostly fake JobStore to access URLs not really associated with real job |
Dummy class.
Base exception class for all locator exceptions. For example, job store/aws bucket exceptions where they already exist
Common base class for all non-exit exceptions.
Indicates that the specified job does not exist.
Indicates that the file was attempted to be modified by multiple processes at once.
Indicates that the specified file does not exist.
Indicates that the specified job store does not exist.
Indicates that the specified job store already exists.
Represents the physical storage for the jobs and files in a Toil workflow.
JobStores are responsible for storing toil.job.JobDescription (which relate jobs to each other) and files.
Actual toil.job.Job objects are stored in files, referenced by JobDescriptions. All the non-file CRUD methods the JobStore provides deal in JobDescriptions and not full, executable Jobs.
To actually get ahold of a toil.job.Job, use toil.job.Job.loadJob() with a JobStore and the relevant JobDescription.
Create the physical storage for this job store, allocate a workflow ID and persist the given Toil configuration to the store.
Raises an exception if the root job hasn't fulfilled its promise yet.
Currently supported schemes are:
Raises FileNotFoundError if the file does not exist.
Refer to AbstractJobStore.import_file() documentation for currently supported URL schemes.
Note that the helper method _exportFile is used to read from the source and write to destination. To implement any optimizations that circumvent this, the _exportFile method should be overridden by subclasses of AbstractJobStore.
May raise an error if file existence cannot be determined.
Currently supported schemes are:
Raises FileNotFoundError if the URL doesn't exist.
Raises FileNotFoundError if the URL doesn't exist.
Has a readable stream interface, unlike read_from_url() which takes a writable stream.
Fixes jobs that might have been partially updated. Resets the try counts and removes jobs that are not successors of the current root job.
Files associated with the assigned ID will be accepted even if the JobDescription has never been created or updated.
Must call jobDescription.pre_update_hook()
Returns a publicly accessible URL to the given file in the job store. The returned URL starts with 'http:', 'https:' or 'file:'. The returned URL may expire as early as 1h after its been returned. Throw an exception if the file does not exist.
May declare the job to have failed (see toil.job.JobDescription.setupJobAfterFailure()) if there is evidence of a failed update attempt.
Must call jobDescription.pre_update_hook()
This operation is idempotent, i.e. deleting a job twice or deleting a non-existent job will succeed silently.
FIXME: some implementations may not raise this
FIXME: some implementations may not raise this
The file at the given local path may not be modified after this method returns!
Note! Implementations of readFile need to respect/provide the executable attribute on FileIDs.
Note that job stores which encrypt files might return overestimates of file sizes, since the encrypted file may have been padded to the nearest block, augmented with an initialization vector, etc.
Throws an exception if the file does not exist.
Only unread logs will be read unless the read_all parameter is set.
Overwriting the current contents of pid.log is a feature, not a bug of this method. Other methods will rely on always having the most current pid available. So far there is no reason to store any old pids.
The initialized file contains the characters "NO". This should only be changed when the user runs the "toil kill" command.
Changing this file to a "YES" triggers a kill of the leader process. The workers are expected to be cleaned up by the leader.
see https://github.com/DataBiosphere/toil/issues/4218
A mostly fake JobStore to access URLs not really associated with real job stores.
| boto3_session | |
| s3_boto3_resource | |
| s3_boto3_client | |
| logger | |
| CONSISTENCY_TICKS | |
| CONSISTENCY_TIME | |
| aRepr | |
| custom_repr |
| ChecksumError | Raised when a download from AWS does not contain the correct data. |
| DomainDoesNotExist | Raised when a domain that is expected to exist does not exist. |
| BucketLocationConflictException | Base exception class for all locator exceptions. |
| AWSJobStore | A job store that uses Amazon's S3 for file storage and SimpleDB for storing job info and |
Raised when a download from AWS does not contain the correct data.
Raised when a domain that is expected to exist does not exist.
A job store that uses Amazon's S3 for file storage and SimpleDB for storing job info and enforcing strong consistency on the S3 file storage. There will be SDB domains for jobs and files and a versioned S3 bucket for file contents. Job objects are pickled, compressed, partitioned into chunks of 1024 bytes and each chunk is stored as a an attribute of the SDB item representing the job. UUIDs are used to identify jobs and files.
Create the physical storage for this job store, allocate a workflow ID and persist the given Toil configuration to the store.
Files associated with the assigned ID will be accepted even if the JobDescription has never been created or updated.
Must call jobDescription.pre_update_hook()
May declare the job to have failed (see toil.job.JobDescription.setupJobAfterFailure()) if there is evidence of a failed update attempt.
Must call jobDescription.pre_update_hook()
This operation is idempotent, i.e. deleting a job twice or deleting a non-existent job will succeed silently.
FIXME: some implementations may not raise this
FIXME: some implementations may not raise this
Throws an exception if the file does not exist.
Note that job stores which encrypt files might return overestimates of file sizes, since the encrypted file may have been padded to the nearest block, augmented with an initialization vector, etc.
The file at the given local path may not be modified after this method returns!
Note! Implementations of readFile need to respect/provide the executable attribute on FileIDs.
Only unread logs will be read unless the read_all parameter is set.
Returns a publicly accessible URL to the given file in the job store. The returned URL starts with 'http:', 'https:' or 'file:'. The returned URL may expire as early as 1h after its been returned. Throw an exception if the file does not exist.
Represents a file in this job store.
Base exception class for all locator exceptions. For example, job store/aws bucket exceptions where they already exist
| logger | |
| DIAL_SPECIFIC_REGION_CONFIG |
| ServerSideCopyProhibitedError | Raised when AWS refuses to perform a server-side copy between S3 keys, and |
| SDBHelper | A mixin with methods for storing limited amounts of binary data in an SDB item |
| fileSizeAndTime(localFilePath) | |
| uploadFromPath(localFilePath, resource, bucketName, fileID) | Uploads a file to s3, using multipart uploading if applicable |
| uploadFile(readable, resource, bucketName, fileID[, ...]) | Upload a readable object to s3, using multipart uploading if applicable. |
| copyKeyMultipart(resource, srcBucketName, srcKeyName, ...) | Copies a key from a source key to a destination key in multiple parts. Note that if the |
| monkeyPatchSdbConnection(sdb) | |
| sdb_unavailable(e) | |
| no_such_sdb_domain(e) | |
| retryable_ssl_error(e) | |
| retryable_sdb_errors(e) | |
| retry_sdb([delays, timeout, predicate]) |
>>> import os
>>> H=SDBHelper
>>> H.presenceIndicator()
u'numChunks'
>>> H.binaryToAttributes(None)['numChunks']
0
>>> H.attributesToBinary({u'numChunks': 0})
(None, 0)
>>> H.binaryToAttributes(b'')
{u'000': b'VQ==', u'numChunks': 1}
>>> H.attributesToBinary({u'numChunks': 1, u'000': b'VQ=='})
(b'', 1)
Good pseudo-random data is very likely smaller than its bzip2ed form. Subtract 1 for the type character, i.e 'C' or 'U', with which the string is prefixed. We should get one full chunk:
>>> s = os.urandom(H.maxRawValueSize-1)
>>> d = H.binaryToAttributes(s)
>>> len(d), len(d['000'])
(2, 1024)
>>> H.attributesToBinary(d) == (s, 1)
True
One byte more and we should overflow four bytes into the second chunk, two bytes for base64-encoding the additional character and two bytes for base64-padding to the next quartet.
>>> s += s[0:1]
>>> d = H.binaryToAttributes(s)
>>> len(d), len(d['000']), len(d['001'])
(3, 1024, 4)
>>> H.attributesToBinary(d) == (s, 2)
True
Raised when AWS refuses to perform a server-side copy between S3 keys, and insists that you pay to download and upload the data yourself instead.
This function will always do a fast, server-side copy, at least until/unless <https://github.com/boto/boto3/issues/3270> is fixed. In some situations, a fast, server-side copy is not actually possible. For example, when residing in an AWS VPC with an S3 VPC Endpoint configured, copying from a bucket in another region to a bucket in your own region cannot be performed server-side. This is because the VPC Endpoint S3 API servers refuse to perform server-side copies between regions, the source region's API servers refuse to initiate the copy and refer you to the destination bucket's region's API servers, and the VPC routing tables are configured to redirect all access to the current region's S3 API servers to the S3 Endpoint API servers instead.
If a fast server-side copy is not actually possible, a ServerSideCopyProhibitedError will be raised.
| collect_ignore |
| logger |
| FileJobStore | A job store that uses a directory on a locally attached file system. To be compatible with |
A job store that uses a directory on a locally attached file system. To be compatible with distributed batch systems, that file system must be shared by all worker nodes.
see https://github.com/DataBiosphere/toil/issues/4218
Create the physical storage for this job store, allocate a workflow ID and persist the given Toil configuration to the store.
Files associated with the assigned ID will be accepted even if the JobDescription has never been created or updated.
Must call jobDescription.pre_update_hook()
Returns a publicly accessible URL to the given file in the job store. The returned URL starts with 'http:', 'https:' or 'file:'. The returned URL may expire as early as 1h after its been returned. Throw an exception if the file does not exist.
May declare the job to have failed (see toil.job.JobDescription.setupJobAfterFailure()) if there is evidence of a failed update attempt.
Must call jobDescription.pre_update_hook()
This operation is idempotent, i.e. deleting a job twice or deleting a non-existent job will succeed silently.
FIXME: some implementations may not raise this
FIXME: some implementations may not raise this
Throws an exception if the file does not exist.
The file at the given local path may not be modified after this method returns!
Note! Implementations of readFile need to respect/provide the executable attribute on FileIDs.
Note that job stores which encrypt files might return overestimates of file sizes, since the encrypted file may have been padded to the nearest block, augmented with an initialization vector, etc.
Used for debugging.
Only unread logs will be read unless the read_all parameter is set.
| log | |
| GOOGLE_STORAGE | |
| MAX_BATCH_SIZE |
| GoogleJobStore | Represents the physical storage for the jobs and files in a Toil workflow. |
| google_retry_predicate(e) | necessary because under heavy load google may throw |
| google_retry(f) | This decorator retries the wrapped function if google throws any angry service |
or numerous other server errors which need to be retried.
It should wrap any function that makes use of the Google Client API
Represents the physical storage for the jobs and files in a Toil workflow.
JobStores are responsible for storing toil.job.JobDescription (which relate jobs to each other) and files.
Actual toil.job.Job objects are stored in files, referenced by JobDescriptions. All the non-file CRUD methods the JobStore provides deal in JobDescriptions and not full, executable Jobs.
To actually get ahold of a toil.job.Job, use toil.job.Job.loadJob() with a JobStore and the relevant JobDescription.
Fall back to anonymous access if no project is available, unlike the Google Storage module's behavior.
Warn if GOOGLE_APPLICATION_CREDENTIALS is set but not actually present.
Create the physical storage for this job store, allocate a workflow ID and persist the given Toil configuration to the store.
Files associated with the assigned ID will be accepted even if the JobDescription has never been created or updated.
Must call jobDescription.pre_update_hook()
Returns a publicly accessible URL to the given file in the job store. The returned URL starts with 'http:', 'https:' or 'file:'. The returned URL may expire as early as 1h after its been returned. Throw an exception if the file does not exist.
May declare the job to have failed (see toil.job.JobDescription.setupJobAfterFailure()) if there is evidence of a failed update attempt.
Must call jobDescription.pre_update_hook()
This operation is idempotent, i.e. deleting a job twice or deleting a non-existent job will succeed silently.
FIXME: some implementations may not raise this
FIXME: some implementations may not raise this
The file at the given local path may not be modified after this method returns!
Note! Implementations of readFile need to respect/provide the executable attribute on FileIDs.
Note that job stores which encrypt files might return overestimates of file sizes, since the encrypted file may have been padded to the nearest block, augmented with an initialization vector, etc.
Throws an exception if the file does not exist.
Only unread logs will be read unless the read_all parameter is set.
| log |
| JobStoreUnavailableException | Raised when a particular type of job store is requested but can't be used. |
| WritablePipe | An object-oriented wrapper for os.pipe. Clients should subclass it, implement |
| ReadablePipe | An object-oriented wrapper for os.pipe. Clients should subclass it, implement |
| ReadableTransformingPipe | A pipe which is constructed around a readable stream, and which provides a |
| generate_locator(job_store_type[, local_suggestion, ...]) | Generate a random locator for a job store of the given type. Raises an |
An object-oriented wrapper for os.pipe. Clients should subclass it, implement readFrom() to consume the readable end of the pipe, then instantiate the class as a context manager to get the writable end. See the example below.
>>> import sys, shutil
>>> class MyPipe(WritablePipe):
... def readFrom(self, readable):
... shutil.copyfileobj(codecs.getreader('utf-8')(readable), sys.stdout)
>>> with MyPipe() as writable:
... _ = writable.write('Hello, world!\n'.encode('utf-8'))
Hello, world!
Each instance of this class creates a thread and invokes the readFrom method in that thread. The thread will be join()ed upon normal exit from the context manager, i.e. the body of the with statement. If an exception occurs, the thread will not be joined but a well-behaved readFrom() implementation will terminate shortly thereafter due to the pipe having been closed.
Now, exceptions in the reader thread will be reraised in the main thread:
>>> class MyPipe(WritablePipe):
... def readFrom(self, readable):
... raise RuntimeError('Hello, world!')
>>> with MyPipe() as writable:
... pass
Traceback (most recent call last):
...
RuntimeError: Hello, world!
More complicated, less illustrative tests:
Same as above, but proving that handles are closed:
>>> x = os.dup(0); os.close(x) >>> class MyPipe(WritablePipe): ... def readFrom(self, readable): ... raise RuntimeError('Hello, world!') >>> with MyPipe() as writable: ... pass Traceback (most recent call last): ... RuntimeError: Hello, world! >>> y = os.dup(0); os.close(y); x == y True
Exceptions in the body of the with statement aren't masked, and handles are closed:
>>> x = os.dup(0); os.close(x) >>> class MyPipe(WritablePipe): ... def readFrom(self, readable): ... pass >>> with MyPipe() as writable: ... raise RuntimeError('Hello, world!') Traceback (most recent call last): ... RuntimeError: Hello, world! >>> y = os.dup(0); os.close(y); x == y True
An object-oriented wrapper for os.pipe. Clients should subclass it, implement writeTo() to place data into the writable end of the pipe, then instantiate the class as a context manager to get the writable end. See the example below.
>>> import sys, shutil
>>> class MyPipe(ReadablePipe):
... def writeTo(self, writable):
... writable.write('Hello, world!\n'.encode('utf-8'))
>>> with MyPipe() as readable:
... shutil.copyfileobj(codecs.getreader('utf-8')(readable), sys.stdout)
Hello, world!
Each instance of this class creates a thread and invokes the writeTo() method in that thread. The thread will be join()ed upon normal exit from the context manager, i.e. the body of the with statement. If an exception occurs, the thread will not be joined but a well-behaved writeTo() implementation will terminate shortly thereafter due to the pipe having been closed.
Now, exceptions in the reader thread will be reraised in the main thread:
>>> class MyPipe(ReadablePipe):
... def writeTo(self, writable):
... raise RuntimeError('Hello, world!')
>>> with MyPipe() as readable:
... pass
Traceback (most recent call last):
...
RuntimeError: Hello, world!
More complicated, less illustrative tests:
Same as above, but proving that handles are closed:
>>> x = os.dup(0); os.close(x) >>> class MyPipe(ReadablePipe): ... def writeTo(self, writable): ... raise RuntimeError('Hello, world!') >>> with MyPipe() as readable: ... pass Traceback (most recent call last): ... RuntimeError: Hello, world! >>> y = os.dup(0); os.close(y); x == y True
Exceptions in the body of the with statement aren't masked, and handles are closed:
>>> x = os.dup(0); os.close(x) >>> class MyPipe(ReadablePipe): ... def writeTo(self, writable): ... pass >>> with MyPipe() as readable: ... raise RuntimeError('Hello, world!') Traceback (most recent call last): ... RuntimeError: Hello, world! >>> y = os.dup(0); os.close(y); x == y True
A pipe which is constructed around a readable stream, and which provides a context manager that gives a readable stream.
Useful as a base class for pipes which have to transform or otherwise visit bytes that flow through them, instead of just consuming or producing data.
Clients should subclass it and implement transform(), like so:
>>> import sys, shutil
>>> class MyPipe(ReadableTransformingPipe):
... def transform(self, readable, writable):
... writable.write(readable.read().decode('utf-8').upper().encode('utf-8'))
>>> class SourcePipe(ReadablePipe):
... def writeTo(self, writable):
... writable.write('Hello, world!\n'.encode('utf-8'))
>>> with SourcePipe() as source:
... with MyPipe(source) as transformed:
... shutil.copyfileobj(codecs.getreader('utf-8')(transformed), sys.stdout)
HELLO, WORLD!
The transform() method runs in its own thread, and should move data chunk by chunk instead of all at once. It should finish normally if it encounters either an EOF on the readable, or a BrokenPipeError on the writable. This means that it should make sure to actually catch a BrokenPipeError when writing.
See also: toil.lib.misc.WriteWatchingStream.
Raised when a particular type of job store is requested but can't be used.
The leader script (of the leader/worker pair) for running jobs.
| logger |
| Leader | Represents the Toil leader. |
Responsible for determining what jobs are ready to be scheduled, by consulting the job store, and issuing them in the batch system.
This is the leader's main loop.
Put it on a queue if the maximum number of service jobs to be scheduled has been reached.
Returns the jobs that, upon processing, were reissued.
If a job is running for longer than desirable issue a kill instruction. Wait for the job to die then we pass the job to process_finished_job.
If a job is missing, we mark it as so, if it is missing for a number of runs of this function (say 10).. then we try deleting the job (though its probably lost), we wait then we pass the job to process_finished_job.
Called when an attempt to run a job finishes, either successfully or otherwise.
Takes the job out of the issued state, and then works out what to do about the fact that it succeeded or failed.
If wall-clock time is available, informs the cluster scaler about the job finishing.
If the job failed and a batch system ID is available, checks for and reports batch system logs.
Checks if it succeeded and was removed, or if it failed and needs to be set up after failure, and dispatches to the appropriate function.
Accelerator (i.e. GPU) utilities for Toil
| have_working_nvidia_smi() | Return True if the nvidia-smi binary, from nvidia's CUDA userspace |
| get_host_accelerator_numbers() | Work out what accelerator is what. |
| have_working_nvidia_docker_runtime() | Return True if Docker exists and can handle an "nvidia" runtime and the "--gpus" option. |
| count_nvidia_gpus() | Return the number of nvidia GPUs seen by nvidia-smi, or 0 if it is not working. |
| count_amd_gpus() | Return the number of amd GPUs seen by rocm-smi, or 0 if it is not working. |
| get_individual_local_accelerators() | Determine all the local accelerators available. Report each with count 1, |
| get_restrictive_environment_for_local_accelerators(...) | Get environment variables which can be applied to a process to restrict it |
TODO: This isn't quite the same as the check that cwltool uses to decide if it can fulfill a CUDARequirement.
For each accelerator visible to us, returns the host-side (for example, outside-of-Slurm-job) number for that accelerator. It is often the same as the apparent number.
Can be used with Docker's --gpus='"device=#,#,#"' option to forward the right GPUs as seen from a Docker daemon.
TODO: How will numbers work with multiple types of accelerator? We need an accelerator assignment API.
The numbers are in the space of accelerators returned by get_individual_local_accelerators().
| logger |
| ReleaseFeedUnavailableError | Raised when a Flatcar releases can't be located. |
| get_flatcar_ami(ec2_client[, architecture]) | Retrieve the flatcar AMI image to use as the base for all Toil autoscaling instances. |
| flatcar_release_feed_ami(region[, architecture, source]) | Yield AMI IDs for the given architecture from the Flatcar release feed. |
| feed_flatcar_ami_release(ec2_client[, architecture, ...]) | Check a Flatcar release feed for the latest flatcar AMI. |
| aws_marketplace_flatcar_ami_search(ec2_client[, ...]) | Query AWS for all AMI names matching Flatcar-stable-* and return the most recent one. |
Raised when a Flatcar releases can't be located.
AMI must be available to the user on AWS (attempting to launch will return a 403 otherwise).
Retries if the release feed cannot be fetched. If the release feed has a permanent error, yields nothing. If some entries in the release feed are unparseable, yields the others.
Verify it's on AWS.
Does not raise exceptions.
Does not raise exceptions.
| logger | |
| CLUSTER_LAUNCHING_PERMISSIONS | |
| AllowedActionCollection |
| delete_iam_instance_profile(instance_profile_name[, ...]) | |
| delete_iam_role(role_name[, region, quiet]) | Deletes an AWS IAM role. Any separate policies are detached from the role, and any inline policies are deleted. |
| create_iam_role(role_name, ...[, region]) | Creates an AWS IAM role. Any separate policies are detached from the role, and any inline policies are deleted. |
| init_action_collection() | Initialization of an action collection, an action collection contains allowed Actions and NotActions |
| add_to_action_collection(a, b) | Combines two action collections |
| policy_permissions_allow(given_permissions[, ...]) | Check whether given set of actions are a subset of another given set of actions, returns true if they are |
| permission_matches_any(perm, list_perms) | Takes a permission and checks whether it's contained within a list of given permissions |
| get_actions_from_policy_document(policy_doc) | Given a policy document, go through each statement and create an AllowedActionCollection representing the |
| allowed_actions_attached(iam, attached_policies) | Go through all attached policy documents and create an AllowedActionCollection representing granted permissions. |
| allowed_actions_roles(iam, policy_names, role_name) | Returns a dictionary containing a list of all aws actions allowed for a given role. |
| collect_policy_actions(policy_documents) | Collect all of the actions allowed by the given policy documents into one AllowedActionCollection. |
| allowed_actions_user(iam, policy_names, user_name) | Gets all allowed actions for a user given by user_name, returns a dictionary, keyed by resource, |
| allowed_actions_group(iam, policy_names, group_name) | Gets all allowed actions for a group given by group_name, returns a dictionary, keyed by resource, |
| get_policy_permissions(region) | Returns an action collection containing lists of all permission grant patterns keyed by resource |
| get_aws_account_num() | Returns AWS account num |
A NotAction will explicitly allow all actions that don't match a specific pattern eg iam:* allows all non iam actions
| logger |
| list_multipart_uploads(bucket, region, prefix[, ...]) |
| logger |
| AWSConnectionManager | Class that represents a connection to AWS. Caches Boto 3 and Boto 2 objects |
| establish_boto3_session([region_name]) | Get a Boto 3 session usable by the current thread. |
| client(…) | Get a Boto 3 client for a particular AWS service, usable by the current thread. |
| resource(…) | Get a Boto 3 resource for a particular AWS service, usable by the current thread. |
Access to any kind of item goes through the particular method for the thing you want (session, resource, service, Boto2 Context), and then you pass the region you want to work in, and possibly the type of thing you want, as arguments.
This class is intended to eventually enable multi-region clusters, where connections to multiple regions may need to be managed in the same provisioner.
We also support None for a region, in which case no region will be passed to Boto/Boto3. The caller is responsible for implementing e.g. TOIL_AWS_REGION support.
Since connection objects may not be thread safe (see <https://boto3.amazonaws.com/v1/documentation/api/1.14.31/guide/session.html#multithreading-or-multiprocessing-with-sessions>), one is created for each thread that calls the relevant lookup method.
This function may not always establish a new session; it can be memoized.
Global alternative to AWSConnectionManager.
Global alternative to AWSConnectionManager.
| ClientError | |
| logger | |
| THROTTLED_ERROR_CODES |
| NoBucketLocationError | Error to represent that we could not get a location for a bucket. |
| delete_sdb_domain(sdb_domain_name[, region, quiet]) | |
| connection_reset(e) | Return true if an error is a connection reset error. |
| connection_error(e) | Return True if an error represents a failure to make a network connection. |
| retryable_s3_errors(e) | Return true if this is an error from S3 that looks like we ought to retry our request. |
| retry_s3([delays, timeout, predicate]) | Retry iterator of context managers specifically for S3 operations. |
| delete_s3_bucket(s3_resource, bucket[, quiet]) | Delete the given S3 bucket. |
| create_s3_bucket(s3_resource, bucket_name, region) | Create an AWS S3 bucket, using the given Boto3 S3 session, with the |
| enable_public_objects(bucket_name) | Enable a bucket to contain objects which are public. |
| get_bucket_region(bucket_name[, endpoint_url, ...]) | Get the AWS region name associated with the given S3 bucket, or raise NoBucketLocationError. |
| region_to_bucket_location(region) | |
| bucket_location_to_region(location) | |
| get_object_for_url(url[, existing]) | Extracts a key (object) from a given parsed s3:// URL. |
| list_objects_for_url(url) | Extracts a key (object) from a given parsed s3:// URL. The URL will be |
| flatten_tags(tags) | Convert tags from a key to value dict into a list of 'Key': xxx, 'Value': xxx dicts. |
| boto3_pager(requestor_callable, result_attribute_name, ...) | Yield all the results from calling the given Boto 3 method with the |
| get_item_from_attributes(attributes, name) | Given a list of attributes, find the attribute associated with the name and return its corresponding value. |
Supports the us-east-1 region, where bucket creation is special.
ALL S3 bucket creation should use this function.
This adjusts the bucket's Public Access Block setting to not block all public access, and also adjusts the bucket's Object Ownership setting to a setting which enables object ACLs.
Does not touch the account's Public Access Block setting, which can also interfere here. That is probably best left to the account administrator.
This configuration used to be the default, and is what most of Toil's code is written to expect, but it was changed so that new buckets default to the more restrictive setting <https://aws.amazon.com/about-aws/whats-new/2022/12/amazon-s3-automatically-enable-block-public-access-disable-access-control-lists-buckets-april-2023/>, with the expectation that people would write IAM policies for the buckets to allow public access if needed. Toil expects to be able to make arbitrary objects in arbitrary places public, and naming them all in an IAM policy would be a very awkward way to do it. So we restore the old behavior.
Error to represent that we could not get a location for a bucket.
Does not log at info level or above when this does not work; failures are expected in some contexts.
Takes an optional S3 API URL override.
If existing is true and the object does not exist, raises FileNotFoundError.
The attribute_list will be a list of TypedDict's (which boto3 SDB functions commonly return), where each TypedDict has a "Name" and "Value" key value pair. This function grabs the value out of the associated TypedDict.
If the attribute with the name does not exist, the function will return None.
| AWSRegionName | |
| AWSServerErrors | |
| logger |
| get_current_aws_region() | Return the AWS region that the currently configured AWS zone (see |
| get_aws_zone_from_environment() | Get the AWS zone from TOIL_AWS_ZONE if set. |
| get_aws_zone_from_metadata() | Get the AWS zone from instance metadata, if on EC2 and the boto module is |
| get_aws_zone_from_boto() | Get the AWS zone from the Boto3 config file or from AWS_DEFAULT_REGION, if it is configured and the |
| get_aws_zone_from_environment_region() | Pick an AWS zone in the region defined by TOIL_AWS_REGION, if it is set. |
| get_current_aws_zone() | Get the currently configured or occupied AWS zone to use. |
| zone_to_region(zone) | Get a region (e.g. us-west-2) from a zone (e.g. us-west-1c). |
| running_on_ec2() | Return True if we are currently running on EC2, and false otherwise. |
| running_on_ecs() | Return True if we are currently running on Amazon ECS, and false otherwise. |
| build_tag_dict_from_env([environment]) |
Reports the TOIL_AWS_ZONE environment variable if set.
Otherwise, if we have boto and are running on EC2, or if we are on ECS, reports the zone we are running in.
Otherwise, if we have the TOIL_AWS_REGION variable set, chooses a zone in that region.
Finally, if we have boto2, and a default region is configured in Boto 2, chooses a zone in that region.
Returns 'us-east-1a' if no method can produce a zone to use.
| system(command) | A convenience wrapper around subprocess.check_call that logs the command before passing it |
| getLogLevelString([logger]) | |
| setLoggingFromOptions(options) | |
| getTempFile([suffix, rootDir]) |
| deprecated(new_function_name) | |
| compat_bytes(s) | |
| compat_bytes_recursive(data) | Convert a tree of objects over bytes to objects over strings. |
Conversion utilities for mapping memory, disk, core declarations from strings to numbers and vice versa. Also contains general conversion functions
| BINARY_PREFIXES | |
| DECIMAL_PREFIXES | |
| VALID_PREFIXES |
| bytes_in_unit([unit]) | |
| convert_units(num, src_unit[, dst_unit]) | Returns a float representing the converted input in dst_units. |
| parse_memory_string(string) | Given a string representation of some memory (i.e. '1024 Mib'), return the |
| human2bytes(string) | Given a string representation of some memory (i.e. '1024 Mib'), return the |
| bytes2human(n) | Return a binary value as a human readable string with units. |
| b_to_mib(n) | Convert a number from bytes to mibibytes. |
| mib_to_b(n) | Convert a number from mibibytes to bytes. |
| hms_duration_to_seconds(hms) | Parses a given time string in hours:minutes:seconds, |
| strtobool(val) | Make a human-readable string into a bool. |
| opt_strtobool(b) | Convert an optional string representation of bool to None or bool |
Convert a string along the lines of "y", "1", "ON", "TrUe", or "Yes" to True, and the corresponding false-ish values to False.
| logger | |
| FORGO | |
| STOP | |
| RM |
| dockerCheckOutput(*args, **kwargs) | |
| dockerCall(*args, **kwargs) | |
| subprocessDockerCall(*args, **kwargs) | |
| apiDockerCall(job, image[, parameters, deferParam, ...]) | A toil wrapper for the python docker API. |
| dockerKill(container_name[, gentleKill, remove, timeout]) | Immediately kills a container. Equivalent to "docker kill": |
| dockerStop(container_name[, remove]) | Gracefully kills a container. Equivalent to "docker stop": |
| containerIsRunning(container_name[, timeout]) | Checks whether the container is running or not. |
| getContainerName(job) | Create a random string including the job name, and return it. Name will |
Docker API Docs: https://docker-py.readthedocs.io/en/stable/index.html Docker API Code: https://github.com/docker/docker-py
This implements docker's python API within toil so that calls are run as jobs, with the intention that failed/orphaned docker jobs be handled appropriately.
Example of using dockerCall in toil to index a FASTA file with SAMtools:
def toil_job(job):
working_dir = job.fileStore.getLocalTempDir()
path = job.fileStore.readGlobalFile(ref_id,
os.path.join(working_dir, 'ref.fasta')
parameters = ['faidx', path]
apiDockerCall(job,
image='quay.io/ucgc_cgl/samtools:latest',
working_dir=working_dir,
parameters=parameters)
Note that when run with detach=False, or with detach=True and stdout=True or stderr=True, this is a blocking call. When run with detach=True and without output capture, the container is started and returned without waiting for it to finish.
| a_short_time | |
| a_long_time | |
| logger | |
| INCONSISTENCY_ERRORS | |
| iam_client |
| UserError | Unspecified run-time error. |
| UnexpectedResourceState | Common base class for all non-exit exceptions. |
| not_found(e) | |
| inconsistencies_detected(e) | |
| retry_ec2([t, retry_for, retry_while]) | |
| wait_transition(boto3_ec2, resource, from_states, to_state) | Wait until the specified EC2 resource (instance, image, volume, ...) transitions from any |
| wait_instances_running(boto3_ec2, instances) | Wait until no instance in the given iterable is 'pending'. Yield every instance that |
| wait_spot_requests_active(boto3_ec2, requests[, ...]) | Wait until no spot request in the given iterator is in the 'open' state or, optionally, |
| create_spot_instances(boto3_ec2, price, image_id, spec) | Create instances on the spot market. |
| create_ondemand_instances(boto3_ec2, image_id, spec[, ...]) | Requests the RunInstances EC2 API call but accounts for the race between recently created |
| increase_instance_hop_limit(boto3_ec2, boto_instance_list) | Increase the default HTTP hop limit, as we are running Toil and Kubernetes inside a Docker container, so the default |
| prune(bushy) | Prune entries in the given dict with false-y values. |
| wait_until_instance_profile_arn_exists(...) | |
| create_instances(ec2_resource, image_id, key_name, ...) | Replaces create_ondemand_instances. Uses boto3 and returns a list of Boto3 instance dicts. |
| create_launch_template(ec2_client, template_name, ...) | Creates a launch template with the given name for launching instances with the given parameters. |
| create_auto_scaling_group(autoscaling_client, ...[, ...]) | Create a new Auto Scaling Group with the given name (which is also its |
Unspecified run-time error.
Common base class for all non-exit exceptions.
Must be called after the instances are guaranteed to be running.
Tags, if given, are applied to the instances, and all volumes.
We only ever use the default version of any launch template.
Internally calls https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2.html?highlight=create_launch_template#EC2.Client.create_launch_template
The default version of the launch template is used.
| logger | |
| manager | |
| dirname | |
| region_json_dirname | |
| EC2Regions |
| InstanceType |
| is_number(s) | Determines if a unicode string (that may include commas) is a number. |
| parse_storage(storage_info) | Parses EC2 JSON storage param string into a number. |
| parse_memory(mem_info) | Returns EC2 'memory' string as a float. |
| download_region_json(filename[, region]) | Downloads and writes the AWS Billing JSON to a file using the AWS pricing API. |
| reduce_region_json_size(filename) | Deletes information in the json file that we don't need, and rewrites it. This makes the file smaller. |
| updateStaticEC2Instances() | Generates a new python file of fetchable EC2 Instances by region with current prices and specs. |
Format should always be '#' GiB (example: '244 GiB' or '1,952 GiB'). Amazon loves to put commas in their numbers, so we have to accommodate that. If the syntax ever changes, this will raise.
See: https://aws.amazon.com/blogs/aws/new-aws-price-list-api/
The reason being: we used to download the unified AWS Bulk API JSON, which eventually crept up to 5.6Gb, the loading of which could not be done on a 32Gb RAM machine. Now we download each region JSON individually (with AWS's new Query API), but even those may eventually one day grow ridiculously large, so we do what we can to keep the file sizes down (and thus also the amount loaded into memory) to keep this script working for longer.
Takes a few (~3+) minutes to run (you'll need decent internet).
| collect_ignore |
| UnimplementedURLException | Unspecified run-time error. |
| panic | The Python idiom for reraising a primary exception fails when the except block raises a |
| raise_(exc_type, exc_value, traceback) |
This is a contextmanager that should be used like this
If a logging logger is passed to panic(), any secondary Exception raised within the with block will be logged. Otherwise those exceptions are swallowed. At the end of the with block the primary exception will be reraised.
Unspecified run-time error.
| Expando | Pass initial attributes to the constructor: |
| MagicExpando | Use MagicExpando for chained attribute access. |
Pass initial attributes to the constructor:
>>> o = Expando(foo=42)
>>> o.foo
42
Dynamically create new attributes:
>>> o.bar = 'hi'
>>> o.bar
'hi'
Expando is a dictionary:
>>> isinstance(o,dict)
True
>>> o['foo']
42
Works great with JSON:
>>> import json
>>> s='{"foo":42}'
>>> o = json.loads(s,object_hook=Expando)
>>> o.foo
42
>>> o.bar = 'hi'
>>> o.bar
'hi'
And since Expando is a dict, it serializes back to JSON just fine:
>>> json.dumps(o, sort_keys=True)
'{"bar": "hi", "foo": 42}'
Attributes can be deleted, too:
>>> o = Expando(foo=42)
>>> o.foo
42
>>> del o.foo
>>> o.foo
Traceback (most recent call last):
...
AttributeError: 'Expando' object has no attribute 'foo'
>>> o['foo']
Traceback (most recent call last):
...
KeyError: 'foo'
>>> del o.foo
Traceback (most recent call last):
...
AttributeError: foo
And copied:
>>> o = Expando(foo=42)
>>> p = o.copy()
>>> isinstance(p,Expando)
True
>>> o == p
True
>>> o is p
False
Same with MagicExpando ...
>>> o = MagicExpando()
>>> o.foo.bar = 42
>>> p = o.copy()
>>> isinstance(p,MagicExpando)
True
>>> o == p
True
>>> o is p
False
... but the copy is shallow:
>>> o.foo is p.foo
True
Use MagicExpando for chained attribute access.
The first time a missing attribute is accessed, it will be set to a new child MagicExpando.
>>> o=MagicExpando()
>>> o.foo = 42
>>> o
{'foo': 42}
>>> o.bar.hello = 'hi'
>>> o.bar
{'hello': 'hi'}
| logger |
| FtpFsAccess | FTP access with upload. |
Taken and modified from https://github.com/ohsu-comp-bio/cwl-tes/blob/03f0096f9fae8acd527687d3460a726e09190c3a/cwl_tes/ftp.py#L37-L251
Only supports reading, no write support. :param fn: FTP url :param mode: Mode to open FTP url in :return:
| E2Instances | |
| regionDict | |
| ec2InstancesByRegion |
| logger |
| bytes2human(n) | Convert n bytes into a human readable string. |
| human2bytes(s) | Attempts to guess the string format based on default symbols |
When unable to recognize the format ValueError is raised.
Contains functions for integrating Toil with external services such as Dockstore.
| logger | |
| session |
| is_dockstore_workflow(workflow) | Returns True if a workflow string smells Dockstore-y. |
| find_trs_spec(workflow) | Parse a Dockstore workflow URL or TSR ID to a string that is definitely a TRS ID. |
| parse_trs_spec(trs_spec) | Parse a TRS ID to workflow and optional version. |
| get_workflow_root_from_dockstore(workflow[, ...]) | Given a Dockstore URL or TRS identifier, get the root WDL or CWL URL for the workflow. |
| resolve_workflow(workflow[, supported_languages]) | Find the real workflow URL or filename from a command line argument. |
Detects Dockstore page URLs and strings that could be Dockstore TRS IDs.
Accepts inputs like:
Assumes the input is actually one of the supported formats. See is_dockstore_workflow().
TODO: Needs to handle multi-workflow files if Dockstore can.
Transform a workflow URL or path that might actually be a Dockstore page URL or TRS specifier to an actual URL or path to a workflow document.
| logger | |
| TOIL_URI_SCHEME | |
| STANDARD_SCHEMES | |
| REMOTE_SCHEMES | |
| ALL_SCHEMES |
| WriteWatchingStream | A stream wrapping class that calls any functions passed to onWrite() with the number of bytes written for every write. |
| ReadableFileObj | Protocol that is more specific than what file_digest takes as an argument. |
| is_standard_url(filename) | |
| is_remote_url(filename) | Decide if a filename is a known, non-file kind of URL |
| is_any_url(filename) | Decide if a string is a URI like http:// or file://. |
| is_url_with_scheme(filename, schemes) | Return True if filename is a URL with any of the given schemes and False otherwise. |
| is_toil_url(filename) | |
| is_file_url(filename) | |
| mkdtemp([suffix, prefix, dir]) | Make a temporary directory like tempfile.mkdtemp, but with relaxed permissions. |
| robust_rmtree(path) | Robustly tries to delete paths. |
| atomic_tmp_file(final_path) | Return a tmp file name to use with atomic_install. This will be in the |
| atomic_install(tmp_path, final_path) | atomic install of tmp_path as final_path |
| AtomicFileCreate(final_path[, keep]) | Context manager to create a temporary file. Entering returns path to |
| atomic_copy(src_path, dest_path[, executable]) | Copy a file using posix atomic creations semantics. |
| atomic_copyobj(src_fh, dest_path[, length, executable]) | Copy an open file using posix atomic creations semantics. |
| make_public_dir(in_directory[, suggested_name]) | Make a publicly-accessible directory in the given directory. |
| try_path(path[, min_size]) | Try to use the given path. Return it if it exists or can be made, |
| file_digest(f, alg_name) | Polyfilled hashlib.file_digest that works on Python <3.11. |
Otherwise it might be a bare path.
The permissions on the directory will be 711 instead of 700, allowing the group and all other users to traverse the directory. This is necessary if the directory is on NFS and the Docker daemon would like to mount it or a file inside it into a container, because on NFS even the Docker daemon appears bound by the file permissions.
See <https://github.com/DataBiosphere/toil/issues/4644>, and <https://stackoverflow.com/a/67928880> which talks about a similar problem but in the context of user namespaces.
Continues silently if the path to be removed is already gone, or if it goes away while this function is executing.
May raise an error if a path changes between file and directory while the function is executing, or if a permission error is encountered.
Try to make a random directory name with length 4 that doesn't exist, with the given prefix. Otherwise, try length 5, length 6, etc, up to a max of 32 (len of uuid4 with dashes replaced). This function's purpose is mostly to avoid having long file names when generating directories. If somehow this fails, which should be incredibly unlikely, default to a normal uuid4, which was our old default.
Not seekable.
Protocol that is more specific than what file_digest takes as an argument. Also guarantees a read() method. Would extend the protocol from Typeshed for hashlib but those are only declared for 3.11+.
| IT |
| concat | A literal iterable to combine sequence literals (lists, set) with generators or list comprehensions. |
| flatten(iterables) | Flatten an iterable, except for string elements. |
Instead of
>>> [ -1 ] + [ x * 2 for x in range( 3 ) ] + [ -1 ]
[-1, 0, 2, 4, -1]
you can write
>>> list( concat( -1, ( x * 2 for x in range( 3 ) ), -1 ) )
[-1, 0, 2, 4, -1]
This is slightly shorter (not counting the list constructor) and does not involve array construction or concatenation.
Note that concat() flattens (or chains) all iterable arguments into a single result iterable:
>>> list( concat( 1, range( 2, 4 ), 4 ) )
[1, 2, 3, 4]
It only does so one level deep. If you need to recursively flatten a data structure, check out crush().
If you want to prevent that flattening for an iterable argument, wrap it in concat():
>>> list( concat( 1, concat( range( 2, 4 ) ), 4 ) )
[1, range(2, 4), 4]
Some more example.
>>> list( concat() ) # empty concat
[]
>>> list( concat( 1 ) ) # non-iterable
[1]
>>> list( concat( concat() ) ) # empty iterable
[]
>>> list( concat( concat( 1 ) ) ) # singleton iterable
[1]
>>> list( concat( 1, concat( 2 ), 3 ) ) # flattened iterable
[1, 2, 3]
>>> list( concat( 1, [2], 3 ) ) # flattened iterable
[1, 2, 3]
>>> list( concat( 1, concat( [2] ), 3 ) ) # protecting an iterable from being flattened
[1, [2], 3]
>>> list( concat( 1, concat( [2], 3 ), 4 ) ) # protection only works with a single argument
[1, 2, 3, 4]
>>> list( concat( 1, 2, concat( 3, 4 ), 5, 6 ) )
[1, 2, 3, 4, 5, 6]
>>> list( concat( 1, 2, concat( [ 3, 4 ] ), 5, 6 ) )
[1, 2, [3, 4], 5, 6]
Note that while strings are technically iterable, concat() does not flatten them.
>>> list( concat( 'ab' ) )
['ab']
>>> list( concat( concat( 'ab' ) ) )
['ab']
| memoize | Memoize a function result based on its parameters using this decorator. |
| MAT | |
| MRT |
| sync_memoize(f) | Like memoize, but guarantees that decorated function is only called once, even when multiple |
| parse_iso_utc(s) | Parses an ISO time with a hard-coded Z for zulu-time (UTC) at the end. Other timezones are |
| strict_bool(s) | Variant of bool() that only accepts two possible string values. |
For example, this can be used in place of lazy initialization. If the decorating function is invoked by multiple threads, the decorated function may be called more than once with the same arguments.
>>> parse_iso_utc('2016-04-27T00:28:04.000Z')
datetime.datetime(2016, 4, 27, 0, 28, 4)
>>> parse_iso_utc('2016-04-27T00:28:04Z')
datetime.datetime(2016, 4, 27, 0, 28, 4)
>>> parse_iso_utc('2016-04-27T00:28:04X')
Traceback (most recent call last):
...
ValueError: Not a valid ISO datetime in UTC: 2016-04-27T00:28:04X
| logger |
| CalledProcessErrorStderr | Version of CalledProcessError that include stderr in the error message if it is set |
| get_public_ip() | Get the IP that this machine uses to contact the internet. |
| get_user_name() | Get the current user name, or a suitable substitute string if the user name |
| utc_now() | Return a datetime in the UTC timezone corresponding to right now. |
| unix_now_ms() | Return the current time in milliseconds since the Unix epoch. |
| slow_down(seconds) | Toil jobs that have completed are not allowed to have taken 0 seconds, but |
| printq(msg, quiet[, log]) | This is for functions used simultaneously in Toil proper and in the admin scripts. |
| truncExpBackoff() | |
| call_command(cmd, *args[, input, timeout, useCLocale, ...]) | Simplified calling of external commands. |
If behind a NAT, this will still be this computer's IP, and not the router's.
This function takes a possibly 0 job length in seconds and enforces a minimum length to satisfy Toil.
Our admin scripts "print" to stdout, while Toil proper uses logging. For a script that, for example, cleans up IAM, EC2, etc. cruft leftover after failed CI runs, we can call an AWS delete IAM role function, and this prints or logs progress (unless quiet is True), depending on whether the function is called in, say, the jobstore or a script.
Version of CalledProcessError that include stderr in the error message if it is set
If the process fails, CalledProcessErrorStderr is raised.
The captured stderr is always printed, regardless of if an exception occurs, so it can be logged.
Always logs the command at debug log level.
| InnerClass | Note that this is EXPERIMENTAL code. |
A nested class (the inner class) decorated with this will have an additional attribute called 'outer' referencing the instance of the nesting class (the outer class) that was used to create the inner class. The outer instance does not need to be passed to the inner class's constructor, it will be set magically. Shamelessly stolen from
http://stackoverflow.com/questions/2278426/inner-classes-how-can-i-get-the-outer-class-object-at-construction-time#answer-2278595.
with names made more descriptive (I hope) and added caching of the BoundInner classes.
Caveat: Within the inner class, self.__class__ will not be the inner class but a dynamically created subclass thereof. It's name will be the same as that of the inner class, but its __module__ will be different. There will be one such dynamic subclass per inner class and instance of outer class, if that outer class instance created any instances of inner the class.
>>> class Outer(object):
... def new_inner(self):
... # self is an instance of the outer class
... inner = self.Inner()
... # the inner instance's 'outer' attribute is set to the outer instance
... assert inner.outer is self
... return inner
... @InnerClass
... class Inner(object):
... def get_outer(self):
... return self.outer
... @classmethod
... def new_inner(cls):
... return cls()
>>> o = Outer()
>>> i = o.new_inner()
>>> i
<toil.lib.objects.Inner...> bound to <toil.lib.objects.Outer object at ...>
>>> i.get_outer()
<toil.lib.objects.Outer object at ...>
Now with inheritance for both inner and outer:
>>> class DerivedOuter(Outer):
... def new_inner(self):
... return self.DerivedInner()
... @InnerClass
... class DerivedInner(Outer.Inner):
... def get_outer(self):
... assert super( DerivedOuter.DerivedInner, self ).get_outer() == self.outer
... return self.outer
>>> derived_outer = DerivedOuter()
>>> derived_inner = derived_outer.new_inner()
>>> derived_inner
<toil.lib.objects...> bound to <toil.lib.objects.DerivedOuter object at ...>
>>> derived_inner.get_outer()
<toil.lib.objects.DerivedOuter object at ...>
Test a static references: >>> Outer.Inner # doctest: +ELLIPSIS <class 'toil.lib.objects...Inner'> >>> DerivedOuter.Inner # doctest: +ELLIPSIS <class 'toil.lib.objects...Inner'> >>> DerivedOuter.DerivedInner #doctest: +ELLIPSIS <class 'toil.lib.objects...DerivedInner'>
Can't decorate top-level classes. Unfortunately, this is detected when the instance is created, not when the class is defined. >>> @InnerClass ... class Foo(object): ... pass >>> Foo() Traceback (most recent call last): ... RuntimeError: Inner classes must be nested in another class.
All inner instances should refer to a single outer instance: >>> o = Outer() >>> o.new_inner().outer == o == o.new_inner().outer True
All inner instances should be of the same class ... >>> o.new_inner().__class__ == o.new_inner().__class__ True
... but that class isn't the inner class ... >>> o.new_inner().__class__ != Outer.Inner True
... but a subclass of the inner class. >>> isinstance( o.new_inner(), Outer.Inner ) True
Static and class methods, e.g. should work, too
>>> o.Inner.new_inner().outer == o
True
| ResourceMonitor | Global resource monitoring widget. |
| glob(glob_pattern, directoryname) | Walks through a directory and its subdirectories looking for files matching |
Presents class methods to get the resource usage of this process and child processes, and other class methods to adjust the statistics so they can account for e.g. resources used inside containers, or other resource usage that should be billable to the current process.
The memory will be treated as if it was used by a child process at the time our real child processes were also using their peak memory.
The CPU time will be treated as if it had been used by a child process.
This file holds the retry() decorator function and RetryCondition object.
retry() can be used to decorate any function based on the list of errors one wishes to retry on.
This list of errors can contain normal Exception objects, and/or RetryCondition objects wrapping Exceptions to include additional conditions.
For example, retrying on a one Exception (HTTPError):
from requests import get from requests.exceptions import HTTPError @retry(errors=[HTTPError]) def update_my_wallpaper():
return get('https://www.deviantart.com/')
Or:
from requests import get from requests.exceptions import HTTPError @retry(errors=[HTTPError, ValueError]) def update_my_wallpaper():
return get('https://www.deviantart.com/')
The examples above will retry for the default interval on any errors specified the "errors=" arg list.
To retry on specifically 500/502/503/504 errors, you could specify an ErrorCondition object instead, for example:
from requests import get from requests.exceptions import HTTPError @retry(errors=[
ErrorCondition(
error=HTTPError,
error_codes=[500, 502, 503, 504]
)]) def update_my_wallpaper():
return requests.get('https://www.deviantart.com/')
To retry on specifically errors containing the phrase "NotFound":
from requests import get from requests.exceptions import HTTPError @retry(errors=[
ErrorCondition(
error=HTTPError,
error_message_must_include="NotFound"
)]) def update_my_wallpaper():
return requests.get('https://www.deviantart.com/')
To retry on all HTTPError errors EXCEPT an HTTPError containing the phrase "NotFound":
from requests import get from requests.exceptions import HTTPError @retry(errors=[
HTTPError,
ErrorCondition(
error=HTTPError,
error_message_must_include="NotFound",
retry_on_this_condition=False
)]) def update_my_wallpaper():
return requests.get('https://www.deviantart.com/')
To retry on boto3's specific status errors, an example of the implementation is:
import boto3 from botocore.exceptions import ClientError @retry(errors=[
ErrorCondition(
error=ClientError,
boto_error_codes=["BucketNotFound"]
)]) def boto_bucket(bucket_name):
boto_session = boto3.session.Session()
s3_resource = boto_session.resource('s3')
return s3_resource.Bucket(bucket_name)
Any combination of these will also work, provided the codes are matched to the correct exceptions. A ValueError will not return a 404, for example.
The retry function as a decorator should make retrying functions easier and clearer It also encourages smaller independent functions, as opposed to lumping many different things that may need to be retried on different conditions in the same function.
The ErrorCondition object tries to take some of the heavy lifting of writing specific retry conditions and boil it down to an API that covers all common use-cases without the user having to write any new bespoke functions.
Use-cases covered currently:
If new functionality is needed, it's currently best practice in Toil to add functionality to the ErrorCondition itself rather than making a new custom retry method.
| SUPPORTED_HTTP_ERRORS | |
| kubernetes | |
| botocore | |
| logger | |
| RT | |
| DEFAULT_DELAYS | |
| DEFAULT_TIMEOUT | |
| E | |
| retry_flaky_test |
| ErrorCondition | A wrapper describing an error condition. |
| retry([intervals, infinite_retries, errors, ...]) | Retry a function if it fails with any Exception defined in "errors". |
| return_status_code(e) | |
| get_error_code(e) | Get the error code name from a Boto 2 or 3 error, or compatible types. |
| get_error_message(e) | Get the error message string from a Boto 2 or 3 error, or compatible types. |
| get_error_status(e) | Get the HTTP status code from a compatible source. |
| get_error_body(e) | Get the body from a Boto 2 or 3 error, or compatible types. |
| meets_error_message_condition(e, error_message) | |
| meets_error_code_condition(e, error_codes) | These are expected to be normal HTTP error codes, like 404 or 500. |
| meets_boto_error_code_condition(e, boto_error_codes) | These are expected to be AWS's custom error aliases, like 'BucketNotFound' or 'AccessDenied'. |
| error_meets_conditions(e, error_conditions) | |
| old_retry([delays, timeout, predicate]) | Deprecated. |
ErrorCondition events may be used to define errors in more detail to determine whether to retry.
Does so every x seconds, where x is defined by a list of numbers (ints or floats) in "intervals". Also accepts ErrorCondition events for more detailed retry attempts.
A list of exceptions OR ErrorCondition objects to catch and retry on. ErrorCondition objects describe more detailed error event conditions than a plain error. An ErrorCondition specifies: - Exception (required) - Error codes that must match to be retried (optional; defaults to not checking) - A string that must be in the error message to be retried (optional; defaults to not checking) - A bool that can be set to False to always error on this condition.
If not specified, this will default to a generic Exception.
Returns empty string for other errors.
Note that error message conditions also check more than this; this function does not fall back to the traceback for incompatible types.
Such as a Boto 2 or 3 error, kubernetes.client.rest.ApiException, http.client.HTTPException, urllib3.exceptions.HTTPError, requests.exceptions.HTTPError, urllib.error.HTTPError, or compatible type
Returns 0 from other errors.
Returns the code and message if the error does not have a body.
Retry an operation while the failure matches a given predicate and until a given timeout expires, waiting a given amount of time in between attempts. This function is a generator that yields contextmanagers. See doctests below for example usage.
Retry for a limited amount of time:
>>> true = lambda _:True
>>> false = lambda _:False
>>> i = 0
>>> for attempt in old_retry( delays=[0], timeout=.1, predicate=true ):
... with attempt:
... i += 1
... raise RuntimeError('foo')
Traceback (most recent call last):
...
RuntimeError: foo
>>> i > 1
True
If timeout is 0, do exactly one attempt:
>>> i = 0 >>> for attempt in old_retry( timeout=0 ): ... with attempt: ... i += 1 ... raise RuntimeError( 'foo' ) Traceback (most recent call last): ... RuntimeError: foo >>> i 1
Don't retry on success:
>>> i = 0 >>> for attempt in old_retry( delays=[0], timeout=.1, predicate=true ): ... with attempt: ... i += 1 >>> i 1
Don't retry on unless predicate returns True:
>>> i = 0 >>> for attempt in old_retry( delays=[0], timeout=.1, predicate=false): ... with attempt: ... i += 1 ... raise RuntimeError( 'foo' ) Traceback (most recent call last): ... RuntimeError: foo >>> i 1
| logger | |
| current_process_name_lock | |
| current_process_name_for |
| ExceptionalThread | A thread whose join() method re-raises exceptions raised during run(). While join() is |
| LastProcessStandingArena | Class that lets a bunch of processes detect and elect a last process |
| ensure_filesystem_lockable(path[, timeout, hint]) | Make sure that the filesystem used at the given path is one where locks are safe to use. |
| safe_lock(fd[, block, shared]) | Get an fcntl lock, while retrying on IO errors. |
| safe_unlock_and_close(fd) | Release an fcntl lock and close the file descriptor, while handling fcntl IO errors. |
| cpu_count() | Get the rounded-up integer number of whole CPUs available. |
| collect_process_name_garbage() | Delete all the process names that point to files that don't exist anymore |
| destroy_all_process_names() | Delete all our process name files because our process is going away. |
| get_process_name(base_dir) | Return the name of the current process. Like a PID but visible between |
| process_name_exists(base_dir, name) | Return true if the process named by the given name (from process_name) exists, and false otherwise. |
| global_mutex(base_dir, mutex) | Context manager that locks a mutex. The mutex is identified by the given |
File locks are not safe to use on Ceph. See <https://github.com/DataBiosphere/toil/issues/4972>.
Raises an exception if the filesystem is detected as one where using locks is known to trigger bugs in the filesystem implementation. Also raises an exception if the given path does not exist, or if attempting to determine the filesystem type takes more than the timeout in seconds.
If the filesystem type cannot be determined, does nothing.
Raises OSError with EACCES or EAGAIN when a nonblocking lock is not immediately available.
A thread whose join() method re-raises exceptions raised during run(). While join() is idempotent, the exception is only during the first invocation of join() that successfully joined the thread. If join() times out, no exception will be re reraised even though an exception might already have occurred in run().
When subclassing this thread, override tryRun() instead of run().
>>> def f():
... assert 0
>>> t = ExceptionalThread(target=f)
>>> t.start()
>>> t.join()
Traceback (most recent call last):
...
AssertionError
>>> class MyThread(ExceptionalThread):
... def tryRun( self ):
... assert 0
>>> t = MyThread()
>>> t.start()
>>> t.join()
Traceback (most recent call last):
...
AssertionError
You may override this method in a subclass. The standard run() method invokes the callable object passed to the object's constructor as the target argument, if any, with sequential and keyword arguments taken from the args and kwargs arguments, respectively.
This blocks the calling thread until the thread whose join() method is called terminates -- either normally or through an unhandled exception or until the optional timeout occurs.
When the timeout argument is present and not None, it should be a floating-point number specifying a timeout for the operation in seconds (or fractions thereof). As join() always returns None, you must call is_alive() after join() to decide whether a timeout happened -- if the thread is still alive, the join() call timed out.
When the timeout argument is not present or None, the operation will block until the thread terminates.
A thread can be join()ed many times.
join() raises a RuntimeError if an attempt is made to join the current thread as that would cause a deadlock. It is also an error to join() a thread before it has been started and attempts to do so raises the same exception.
Counts hyperthreads as CPUs.
Uses the system's actual CPU count, or the current v1 cgroup's quota per period, if the quota is set.
Ignores the cgroup's cpu shares value, because it's extremely difficult to interpret. See https://github.com/kubernetes/kubernetes/issues/81021.
Caches result for efficiency.
Caller must hold current_process_name_lock.
We let all our FDs get closed by the process death.
We assume there is nobody else using the system during exit to race with.
Can see across container boundaries using the given node workflow directory.
Only works between processes, NOT between threads.
Process enter and leave (sometimes due to sudden existence failure). We guarantee that the last process to leave, if it leaves properly, will get a chance to do some cleanup. If new processes try to enter during the cleanup, they will be delayed until after the cleanup has happened and the previous "last" process has finished leaving.
The user is responsible for making sure you always leave if you enter! Consider using a try/finally; this class is not a context manager.
You may not enter the arena again before leaving it.
Should be used in a loop:
| LocalThrottle | A thread-safe rate limiter that throttles each thread independently. Can be used as a |
| throttle | A context manager for ensuring that the execution of its body takes at least a given amount |
The use as a decorator is deprecated in favor of throttle().
If the wait parameter is False, this method immediatley returns True (if at least the configured minimum interval has passed since the last time this method returned True in the current thread) or False otherwise.
Ensures that body takes at least the given amount of time.
>>> start = time.time() >>> with throttle(1): ... pass >>> 1 <= time.time() - start <= 1.1 True
Ditto when used as a decorator.
>>> @throttle(1) ... def f(): ... pass >>> start = time.time() >>> f() >>> 1 <= time.time() - start <= 1.1 True
If the body takes longer by itself, don't throttle.
>>> start = time.time() >>> with throttle(1): ... time.sleep(2) >>> 2 <= time.time() - start <= 2.1 True
Ditto when used as a decorator.
>>> @throttle(1) ... def f(): ... time.sleep(2) >>> start = time.time() >>> f() >>> 2 <= time.time() - start <= 2.1 True
If an exception occurs, don't throttle.
>>> start = time.time() >>> try: ... with throttle(1): ... raise ValueError('foo') ... except ValueError: ... end = time.time() ... raise Traceback (most recent call last): ... ValueError: foo >>> 0 <= end - start <= 0.1 True
Ditto when used as a decorator.
>>> @throttle(1) ... def f(): ... raise ValueError('foo') >>> start = time.time() >>> try: ... f() ... except ValueError: ... end = time.time() ... raise Traceback (most recent call last): ... ValueError: foo >>> 0 <= end - start <= 0.1 True
| logger | |
| defaultTargetTime | |
| SYS_MAX_SIZE | |
| JOBSTORE_HELP |
| parse_set_env(l) | Parse a list of strings of the form "NAME=VALUE" or just "NAME" into a dictionary. |
| parse_str_list(s) | |
| parse_int_list(s) | |
| iC(min_value[, max_value]) | Returns a function that checks if a given int is in the given half-open interval. |
| fC(minValue[, maxValue]) | Returns a function that checks if a given float is in the given half-open interval. |
| parse_accelerator_list(specs) | Parse a string description of one or more accelerator requirements. |
| parseBool(val) | |
| make_open_interval_action(min[, max]) | Returns an argparse action class to check if the input is within the given half-open interval. |
| parse_jobstore(jobstore_uri) | Turn the jobstore string into it's corresponding URI |
| add_base_toil_options(parser[, jobstore_as_flag, cwl]) | Add base Toil command line options to the parser. |
Strings of the latter from will result in dictionary entries whose value is None.
>>> parse_set_env([])
{}
>>> parse_set_env(['a'])
{'a': None}
>>> parse_set_env(['a='])
{'a': ''}
>>> parse_set_env(['a=b'])
{'a': 'b'}
>>> parse_set_env(['a=a', 'a=b'])
{'a': 'b'}
>>> parse_set_env(['a=b', 'c=d'])
{'a': 'b', 'c': 'd'}
>>> parse_set_env(['a=b=c'])
{'a': 'b=c'}
>>> parse_set_env([''])
Traceback (most recent call last):
...
ValueError: Empty name
>>> parse_set_env(['=1'])
Traceback (most recent call last):
...
ValueError: Empty name
If the jobstore string already is a URI, return the jobstore: aws:/path/to/jobstore -> aws:/path/to/jobstore :param jobstore_uri: string of the jobstore :return: URI of the jobstore
"""The location of the job store for the workflow. A job store holds persistent information about the jobs, stats, and files in a workflow. If the workflow is run with a distributed batch system, the job store must be accessible by all worker nodes. Depending on the desired job store implementation, the location should be formatted according to one of the following schemes: file:<path> where <path> points to a directory on the file system aws:<region>:<prefix> where <region> is the name of an AWS region like us-west-2 and <prefix> will be prepended to the names of any top-level AWS resources in use by job store, e.g. S3 buckets.
google:<project_id>:<prefix> TODO: explain For backwards compatibility, you may also specify ./foo (equivalent to file:./foo or just file:foo) or /bar (equivalent to file:/bar)."""
| add_cwl_options(parser[, suppress]) | Add CWL options to the parser. This only adds nonpositional CWL arguments. |
| add_runner_options(parser[, cwl, wdl]) | Add to the WDL or CWL runners options that are shared or the same between runners |
| add_wdl_options(parser[, suppress]) | Add WDL options to a parser. This only adds nonpositional WDL arguments |
| a_short_time | |
| logger |
| ManagedNodesNotSupportedException | Raised when attempting to add managed nodes (which autoscale up and down by |
| Shape | Represents a job or a node's "shape", in terms of the dimensions of memory, cores, disk and |
| AbstractProvisioner | Interface for provisioning worker nodes to use in a Toil cluster. |
Raised when attempting to add managed nodes (which autoscale up and down by themselves, without the provisioner doing the work) to a provisioner that does not support them.
Polling with this and try/except is the Right Way to check if managed nodes are available from a provisioner.
The wallTime attribute stores the number of seconds of a node allocation, e.g. 3600 for AWS. FIXME: and for jobs?
The memory and disk attributes store the number of bytes required by a job (or provided by a node) in RAM or on disk (SSD or HDD), respectively.
Interface for provisioning worker nodes to use in a Toil cluster.
Implementations must call _setLeaderWorkerAuthentication().
Implementations must call _setLeaderWorkerAuthentication() with the leader so that workers can be launched.
Raises ManagedNodesNotSupportedException if the provisioner implementation or cluster configuration can't have managed nodes.
See the storage.files section: https://github.com/kinvolk/ignition/blob/flatcar-master/doc/configuration-v2_2.md
Will run Mesos master or agent as appropriate in Mesos clusters. For Kubernetes clusters, will just sleep to provide a place to shell into on the leader, and shouldn't run on the worker.
Should only be implemented if Kubernetes clusters are supported.
Defaults to None if not overridden, in which case no cloud provider integration will be used.
Authenticate back to the leader using the JOIN_TOKEN, JOIN_CERT_HASH, and JOIN_ENDPOINT set in the given authentication data dict.
| logger | |
| F |
| InvalidClusterStateException | Common base class for all non-exit exceptions. |
| AWSProvisioner | Interface for provisioning worker nodes to use in a Toil cluster. |
| awsRetryPredicate(e) | |
| expectedShutdownErrors(e) | Matches errors that we expect to occur during shutdown, and which indicate |
| awsRetry(f) | This decorator retries the wrapped function if aws throws unexpected errors. |
| awsFilterImpairedNodes(nodes, boto3_ec2) | |
| collapse_tags(instance_tags) | Collapse tags from boto3 format to node format |
Should not match any errors which indicate that an operation is impossible or unnecessary (such as errors resulting from a thing not existing to be deleted).
It should wrap any function that makes use of boto
Common base class for all non-exit exceptions.
Interface for provisioning worker nodes to use in a Toil cluster.
Raises ManagedNodesNotSupportedException if the provisioner implementation or cluster configuration can't have managed nodes.
These are mostly needed to support Kubernetes' AWS CloudProvider, and some are for the Kubernetes Cluster Autoscaler's AWS integration.
Some of these are really only needed on the leader.
| logger | |
| ZoneTuple |
| get_aws_zone_from_spot_market(spotBid, nodeType, ...) | If a spot bid, node type, and Boto2 EC2 connection are specified, picks a |
| get_best_aws_zone([spotBid, nodeType, boto3_ec2, ...]) | Get the right AWS zone to use. |
| choose_spot_zone(zones, bid, spot_history) | Returns the zone to put the spot request based on, in order of priority: |
| optimize_spot_bid(boto3_ec2, instance_type, spot_bid, ...) | Check whether the bid is in line with history and makes an effort to place |
In this case, zone_options can be used to restrict to a subset of the zones in the region.
Reports the TOIL_AWS_ZONE environment variable if set.
Otherwise, if we are running on EC2 or ECS, reports the zone we are running in.
Otherwise, if a spot bid, node type, and Boto2 EC2 connection are specified, picks a zone where instances are easy to buy from the zones in the region of the Boto2 connection. These parameters must always be specified together, or not at all.
In this case, zone_options can be used to restrict to a subset of the zones in the region.
Otherwise, if we have the TOIL_AWS_REGION variable set, chooses a zone in that region.
Finally, if a default region is configured in Boto 2, chooses a zone in that region.
Returns None if no method can produce a zone to use.
>>> from collections import namedtuple
>>> FauxHistory = namedtuple('FauxHistory', ['price', 'availability_zone'])
>>> zones = ['us-west-2a', 'us-west-2b']
>>> spot_history = [FauxHistory(0.1, 'us-west-2a'), FauxHistory(0.2, 'us-west-2a'), FauxHistory(0.3, 'us-west-2b'), FauxHistory(0.6, 'us-west-2b')]
>>> choose_spot_zone(zones, 0.15, spot_history)
'us-west-2a'
>>> spot_history=[FauxHistory(0.3, 'us-west-2a'), FauxHistory(0.2, 'us-west-2a'), FauxHistory(0.1, 'us-west-2b'), FauxHistory(0.6, 'us-west-2b')] >>> choose_spot_zone(zones, 0.15, spot_history) 'us-west-2b'
>>> spot_history=[FauxHistory(0.1, 'us-west-2a'), FauxHistory(0.7, 'us-west-2a'), FauxHistory(0.1, 'us-west-2b'), FauxHistory(0.6, 'us-west-2b')] >>> choose_spot_zone(zones, 0.15, spot_history) 'us-west-2b'
| logger | |
| EVICTION_THRESHOLD | |
| RESERVE_SMALL_LIMIT | |
| RESERVE_SMALL_AMOUNT | |
| RESERVE_BREAKPOINTS | |
| RESERVE_FRACTIONS | |
| OS_SIZE | |
| FailedConstraint |
| JobTooBigError | Raised in the scaler thread when a job cannot fit in any available node |
| BinPackedFit | If jobShapes is a set of tasks with run requirements (mem/disk/cpu), and nodeShapes is a sorted |
| NodeReservation | The amount of resources that we expect to be available on a given node at each point in time. |
| ClusterScaler | |
| ScalerThread | A thread that automatically scales the number of either preemptible or non-preemptible worker |
| ClusterStats |
| adjustEndingReservationForJob(reservation, jobShape, ...) | Add a job to an ending reservation that ends at wallTime. |
| split(nodeShape, jobShape, wallTime) | Partition a node allocation into two to fit the job. |
| binPacking(nodeShapes, jobShapes, goalTime) | Using the given node shape bins, pack the given job shapes into nodes to |
Uses a first fit decreasing (FFD) bin packing like algorithm to calculate an approximate minimum number of nodes that will fit the given list of jobs. BinPackingFit assumes the ordered list, nodeShapes, is ordered for "node preference" outside of BinPackingFit beforehand. So when virtually "creating" nodes, the first node within nodeShapes that fits the job is the one that's added.
Can be run multiple times.
Returns any distinct Shapes that did not fit, mapping to reasons they did not fit.
Returns the job shape again, and a list of failed constraints, if it did not fit.
To represent the resources available in a reservation, we represent a reservation as a linked list of NodeReservations, each giving the resources free within a single timeslice.
If the job does not fit, returns the failing constraints: the resources that can't be accomodated, and the limits that were hit.
If the job does fit, returns an empty list.
Must always agree with fits()! This codepath is slower and used for diagnosis.
jobShape is the Shape of the job requirements, nodeShape is the Shape of the node this is a reservation for, and targetTime is the maximum time to wait before starting this job.
(splitting the reservation if the job doesn't fill the entire timeslice)
Returning the modified shape of the node and a new node reservation for the extra time that the job didn't fill.
Returns a dict saying how many of each node will be needed, a dict from job shapes that could not fit to reasons why.
These nodes are treated differently than auto-scaled nodes in that they should not be automatically terminated.
Returns an integer.
Returns a dict mapping from nodeShape to the number of nodes we want in the cluster right now, and a dict from job shapes that are too big to run on any node to reasons why.
Also attempts to remove ignored nodes that were marked for graceful removal.
Returns the new size of the cluster.
This method is the definitive source on nodes in cluster, & is responsible for consolidating cluster state between the provisioner & batch system.
Raised in the scaler thread when a job cannot fit in any available node type and is likely to lock up the workflow.
A thread that automatically scales the number of either preemptible or non-preemptible worker nodes according to the resource requirements of the queued jobs.
The scaling calculation is essentially as follows: start with 0 estimated worker nodes. For each queued job, check if we expect it can be scheduled into a worker node before a certain time (currently one hour). Otherwise, attempt to add a single new node of the smallest type that can fit that job.
At each scaling decision point a comparison between the current, C, and newly estimated number of nodes is made. If the absolute difference is less than beta * C then no change is made, else the size of the cluster is adapted. The beta factor is an inertia parameter that prevents continual fluctuations in the number of nodes.
This insures any exceptions raised in the threads are propagated in a timely fashion.
| logger |
| GCEProvisioner | Implements a Google Compute Engine Provisioner using libcloud. |
Implements a Google Compute Engine Provisioner using libcloud.
| a_short_time | |
| logger |
| Node |
Assumes a billing cycle of one hour.
kwargs: input, tty, appliance, collectStdout, sshOptions, strict
| logger |
| NoSuchClusterException | Indicates that the specified cluster does not exist. |
| NoSuchZoneException | Indicates that a valid zone could not be found. |
| ClusterTypeNotSupportedException | Indicates that a provisioner does not support a given cluster type. |
| ClusterCombinationNotSupportedException | Indicates that a provisioner does not support making a given type of cluster with a given architecture. |
| cluster_factory(provisioner[, clusterName, ...]) | Find and instantiate the appropriate provisioner instance to make clusters in the given cloud. |
| add_provisioner_options(parser) | |
| parse_node_types(node_type_specs) | Parse a specification for zero or more node types. |
| check_valid_node_types(provisioner, node_types) | Raises if an invalid nodeType is specified for aws or gce. |
Raises ClusterTypeNotSupportedException if the given provisioner does not implement clusters of the given type.
Takes a comma-separated list of node types. Each node type is a slash-separated list of at least one instance type name (like 'm5a.large' for AWS), and an optional bid in dollars after a colon.
Raises ValueError if a node type cannot be parsed.
Inputs should look something like this:
>>> parse_node_types('c5.4xlarge/c5a.4xlarge:0.42,t2.large')
[({'c5.4xlarge', 'c5a.4xlarge'}, 0.42), ({'t2.large'}, None)]
Indicates that the specified cluster does not exist.
Indicates that a valid zone could not be found.
Indicates that a provisioner does not support a given cluster type.
Indicates that a provisioner does not support making a given type of cluster with a given architecture.
Implements a real-time UDP-based logging system that user scripts can use for debugging.
| logger |
| LoggingDatagramHandler | Receive logging messages from the jobs and display them on the leader. |
| JSONDatagramHandler | Send logging records over UDP serialized as JSON. |
| RealtimeLoggerMetaclass | Metaclass for RealtimeLogger that lets add logging methods. |
| RealtimeLogger | Provide a logger that logs over UDP to the leader. |
Receive logging messages from the jobs and display them on the leader.
Uses bare JSON message encoding.
Messages are JSON-encoded logging module records.
Send logging records over UDP serialized as JSON.
They have to fit in a single UDP datagram, so don't try to log more than 64kb at once.
Metaclass for RealtimeLogger that lets add logging methods.
Like RealtimeLogger.warning(), RealtimeLogger.info(), etc.
To use in a Toil job, do:
>>> from toil.realtimeLogger import RealtimeLogger
>>> RealtimeLogger.info("This logging message goes straight to the leader")
That's all a user of Toil would need to do. On the leader, Job.Runner.startToil() automatically starts the UDP server by using an instance of this class as a context manager.
Note that if the returned logger is used on the leader, you will see the message twice, since it still goes to the normal log handlers, too.
| logger |
| ResourceException | Common base class for all non-exit exceptions. |
| Resource | Represents a file or directory that will be deployed to each node before any jobs in the user script are invoked. |
| FileResource | A resource read from a file on the leader. |
| DirectoryResource | A resource read from a directory on the leader. |
| VirtualEnvResource | A resource read from a virtualenv on the leader. |
| ModuleDescriptor | A path to a Python module decomposed into a namedtuple of three elements |
Represents a file or directory that will be deployed to each node before any jobs in the user script are invoked.
Each instance is a namedtuple with the following elements:
The pathHash element contains the MD5 (in hexdigest form) of the path to the resource on the leader node. The path, and therefore its hash is unique within a job store.
The url element is a "file:" or "http:" URL at which the resource can be obtained.
The contentHash element is an MD5 checksum of the resource, allowing for validation and caching of resources.
If the resource is a regular file, the type attribute will be 'file'.
If the resource is a directory, the type attribute will be 'dir' and the URL will point at a ZIP archive of that directory.
This method should be invoked on the worker. The given path does not need to refer to an existing file or directory on the worker, it only identifies the resource within an instance of toil. This method returns None if no resource for the given path exists.
This method should only be invoked on a worker node after the node was setup for accessing resources via prepareSystem().
Get the path to resource on the worker.
The file or directory at the returned path may or may not yet exist. Invoking download() will ensure that it does.
A resource read from a file on the leader.
The file or directory at the returned path may or may not yet exist. Invoking download() will ensure that it does.
A resource read from a directory on the leader.
The URL will point to a ZIP archive of the directory. All files in that directory (and any subdirectories) will be included. The directory may be a package but it does not need to be.
The file or directory at the returned path may or may not yet exist. Invoking download() will ensure that it does.
A resource read from a virtualenv on the leader.
All modules and packages found in the virtualenv's site-packages directory will be included.
A path to a Python module decomposed into a namedtuple of three elements
>>> import toil.resource
>>> ModuleDescriptor.forModule('toil.resource')
ModuleDescriptor(dirPath='/.../src', name='toil.resource', fromVirtualEnv=False)
>>> import subprocess, tempfile, os
>>> dirPath = tempfile.mkdtemp()
>>> path = os.path.join( dirPath, 'foo.py' )
>>> with open(path,'w') as f:
... _ = f.write('from toil.resource import ModuleDescriptor\n'
... 'print(ModuleDescriptor.forModule(__name__))')
>>> subprocess.check_output([ sys.executable, path ])
b"ModuleDescriptor(dirPath='...', name='foo', fromVirtualEnv=False)\n"
>>> from shutil import rmtree >>> rmtree( dirPath )
Now test a collision. 'collections' is part of the standard library in Python 2 and 3. >>> dirPath = tempfile.mkdtemp() >>> path = os.path.join( dirPath, 'collections.py' ) >>> with open(path,'w') as f: ... _ = f.write('from toil.resource import ModuleDescriptorn' ... 'ModuleDescriptor.forModule(__name__)')
This should fail and return exit status 1 due to the collision with the built-in module: >>> subprocess.call([ sys.executable, path ]) 1
Clean up >>> rmtree( dirPath )
If the given module name is "__main__", it will be translated to the actual file name of the top-level script without the .py or .pyc extension. This method assumes that the module with the specified name has already been loaded.
If it was, return a new module descriptor that points to a local copy of that resource. Should only be called on a worker node. On the leader, this method returns this resource, i.e. self.
Common base class for all non-exit exceptions.
| logger |
| parser_with_server_options() | |
| create_app(args) | Create a "connexion.FlaskApp" instance with Toil server configurations. |
| start_server(args) | Start a Toil server. |
| celery |
| create_celery_app() |
| logger |
| WESClientWithWorkflowEngineParameters | A modified version of the WESClient from the wes-service package that |
| generate_attachment_path_names(paths) | Take in a list of path names and return a list of names with the common path |
| get_deps_from_cwltool(cwl_file[, input_file]) | Return a list of dependencies of the given workflow from cwltool. |
| submit_run(client, cwl_file[, input_file, engine_options]) | Given a CWL file, its input files, and an optional list of engine options, |
| poll_run(client, run_id) | Return True if the given workflow run is in a finished state. |
| print_logs_and_exit(client, run_id) | Fetch the workflow logs from the WES server, print the results, then exit |
| main() |
For example, for the following CWL workflow where "hello.yaml" references a file "message.txt",
Where "message.txt" is resolved to "../input_files/message.txt".
We'd send the workflow file as "workflows/hello.cwl", and send the inputs as "input_files/hello.yaml" and "input_files/message.txt".
A modified version of the WESClient from the wes-service package that includes workflow_engine_parameters support.
TODO: Propose a PR in wes-service to include workflow_engine_params.
This function also attempts to find the attachments from the CWL workflow and its input file, and attach them to the WES run request.
| HAVE_S3 | |
| logger | |
| state_store_cache | |
| TERMINAL_STATES | |
| MAX_CANCELING_SECONDS |
| MemoryStateCache | An in-memory place to store workflow state. |
| AbstractStateStore | A place for the WES server to keep its state: the set of workflows that |
| MemoryStateStore | An in-memory place to store workflow state, for testing. |
| FileStateStore | A place to store workflow state that uses a POSIX-compatible file system. |
| S3StateStore | A place to store workflow state that uses an S3-compatible object store. |
| WorkflowStateStore | Slice of a state store for the state of a particular workflow. |
| WorkflowStateMachine | Class for managing the WES workflow state machine. |
| get_iso_time() | Return the current time in ISO 8601 format. |
| link_file(src, dest) | Create a link to a file from src to dest. |
| download_file_from_internet(src, dest[, content_type]) | Download a file from the Internet and write it to dest. |
| download_file_from_s3(src, dest[, content_type]) | Download a file from Amazon S3 and write it to dest. |
| get_file_class(path) | Return the type of the file as a human readable string. |
| safe_read_file(file) | Safely read a file by acquiring a shared lock to prevent other processes |
| safe_write_file(file, s) | Safely write to a file by acquiring an exclusive lock to prevent other |
| connect_to_state_store(url) | Connect to a place to store state for workflows, defined by a URL. |
| connect_to_workflow_state_store(url, workflow_id) | Connect to a place to store state for the given workflow, in the state |
This is a key-value store, with keys namespaced by workflow ID. Concurrent access from multiple threads or processes is safe and globally consistent.
Keys and workflow IDs are restricted to [-a-zA-Z0-9_], because backends may use them as path or URL components.
Key values are either a string, or None if the key is not set.
Workflow existence isn't a thing; nonexistent workflows just have None for all keys.
Note that we don't yet have a cleanup operation: things are stored permanently. Even clearing all the keys may leave data behind.
Also handles storage for a local cache, with a separate key namespace (not a read/write-through cache).
TODO: Can we replace this with just using a JobStore eventually, when AWSJobStore no longer needs SimpleDB?
An in-memory place to store workflow state, for testing.
Inherits from MemoryStateCache first to provide implementations for AbstractStateStore.
A place to store workflow state that uses a POSIX-compatible file system.
A place to store workflow state that uses an S3-compatible object store.
URL may be a local file path or URL or an S3 URL.
This is the authority on the WES "state" of a workflow. You need one to read or change the state.
Guaranteeing that only certain transitions can be observed is possible but not worth it. Instead, we just let updates clobber each other and grab and cache the first terminal state we see forever. If it becomes important that clients never see e.g. CANCELED -> COMPLETE or COMPLETE -> SYSTEM_ERROR, we can implement a real distributed state machine here.
We do handle making sure that tasks don't get stuck in CANCELING.
State can be:
"UNKNOWN" "QUEUED" "INITIALIZING" "RUNNING" "PAUSED" "COMPLETE" "EXECUTOR_ERROR" "SYSTEM_ERROR" "CANCELED" "CANCELING"
Uses the state store's local cache to prevent needing to read things we've seen already.
| logger | |
| TaskLog |
| VersionNotImplementedException | Raised when the requested workflow version is not implemented. |
| MalformedRequestException | Raised when the request is malformed. |
| WorkflowNotFoundException | Raised when the requested run ID is not found. |
| WorkflowConflictException | Raised when the requested workflow is not in the expected state. |
| OperationForbidden | Raised when the request is forbidden. |
| WorkflowExecutionException | Raised when an internal error occurred during the execution of the workflow. |
| WESBackend | A class to represent a GA4GH Workflow Execution Service (WES) API backend. |
| handle_errors(func) | This decorator catches errors from the wrapped function and returns a JSON |
Raised when the requested workflow version is not implemented.
Raised when the request is malformed.
Raised when the requested run ID is not found.
Raised when the requested workflow is not in the expected state.
Raised when the request is forbidden.
Raised when an internal error occurred during the execution of the workflow.
GET /service-info
GET /runs
POST /runs
GET /runs/{run_id}
POST /runs/{run_id}/cancel
GET /runs/{run_id}/status
| logger | |
| NOTICE |
| WorkflowPlan | These functions pass around dicts of a certain type, with data and files keys. |
| DataDict | Under data, there can be: |
| FilesDict | Under files, there can be: |
| parse_workflow_zip_file(file, workflow_type) | Processes a workflow zip bundle |
| parse_workflow_manifest_file(manifest_file) | Reads a MANIFEST.json file for a workflow zip bundle |
| workflow_manifest_url_to_path(url[, parent_dir]) | Interpret a possibly-relative parsed URL, relative to the given parent directory. |
| task_filter(task, job_status) | AGC requires task names to be annotated with an AWS Batch job ID that they |
""" Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. """
These functions pass around dicts of a certain type, with data and files keys.
Under data, there can be: * workflowUrl (required if no workflowSource): URL to main workflow code.
Under files, there can be: * workflowSource (required if no workflowUrl): Open binary-mode file for the main workflow code. * workflowInputFiles: List of open binary-mode file for input files. Expected to be JSONs. * workflowOptions: Open binary-mode file for a JSON of options sent along with the workflow. * workflowDependencies: Open binary-mode file for the zip the workflow came in, if any.
If the zip only contains a single file, that file is set as workflowSource
If the zip contains multiple files with a MANIFEST.json file, the MANIFEST is used to determine appropriate data and file arguments. (See: parse_workflow_manifest_file())
If the zip contains multiple files, the original zip is set as workflowDependencies
MANIFEST.json is expected to be formatted like:
{
"mainWorkflowURL": "relpath/to/workflow",
"inputFileURLs": [
"relpath/to/input-file-1",
"relpath/to/input-file-2",
"relpath/to/input-file-3"
],
"optionsFileURL": "relpath/to/option-file"
}
The mainWorkflowURL property that provides a relative file path in the zip to a workflow file, which will be set as workflowSource
The inputFileURLs property is optional and provides a list of relative file paths in the zip to input.json files. The list is assumed to be in the order the inputs should be applied - e.g. higher list index is higher priority. If present, it will be used to set workflowInputs(_d) arguments.
The optionsFileURL property is optional and provides a relative file path in the zip to an options.json file. If present, it will be used to set workflowOptions.
This encodes the AWSBatchJobID annotation, from the AmazonBatchBatchSystem, into the task name of the given task, and returns the modified task. If no such annotation is available, the task is censored and None is returned.
| logger | |
| WAIT_FOR_DEATH_TIMEOUT | |
| run_wes |
| ToilWorkflowRunner | A class to represent a workflow runner to run the requested workflow. |
| TaskRunner | Abstraction over the Celery API. Runs our run_wes task and allows canceling it. |
| MultiprocessingTaskRunner | Version of TaskRunner that just runs tasks with Multiprocessing. |
| run_wes_task(base_scratch_dir, state_store_url, ...) | Run a requested workflow. |
| cancel_run(task_id) | Send a SIGTERM signal to the process that is running task_id. |
Responsible for parsing the user request into a shell command, executing that command, and collecting the outputs of the resulting workflow run.
We can swap this out in the server to allow testing without Celery.
Version of TaskRunner that just runs tasks with Multiprocessing.
Can't use threading because there's no way to send a cancel signal or exception to a Python thread, if loops in the task (i.e. ToilWorkflowRunner) don't poll for it.
If the process finishes successfully, it will clean up the log, but if the process crashes, the caller must clean up the log.
| logger |
| ToilWorkflow | |
| ToilBackend | WES backend implemented for Toil to run CWL, WDL, or Toil workflows. This |
Task names will be the job_type values from issued/completed/failed messages, with annotations from JobAnnotationMessage messages if available.
WES backend implemented for Toil to run CWL, WDL, or Toil workflows. This class is responsible for validating and executing submitted workflows.
| GunicornApplication | An entry point to integrate a Gunicorn WSGI server in Python. To start a |
| run_app(app[, options]) | Run a Gunicorn WSGI server. |
An entry point to integrate a Gunicorn WSGI server in Python. To start a WSGI application with callable app, run the following code:
WSGIApplication(app, options={
...
}).run()
For more details, see: https://docs.gunicorn.org/en/latest/custom.html
| logger |
| ServiceManager | Manages the scheduling of services. |
(services and their parent non-service jobs)
When the job's services are running the ID for the job will be returned by toil.leader.ServiceManager.get_ready_client.
Will block until all services are started and blocked.
| logger | |
| root_logger | |
| toil_logger | |
| DEFAULT_LOGLEVEL | |
| TRACE |
| StatsAndLogging | A thread to aggregate statistics and logging. |
| set_log_level(level[, set_logger]) | Sets the root logger level to a given string level (like "INFO"). |
| install_log_color([set_logger]) | Make logs colored. |
| add_logging_options(parser[, default_level]) | Add logging options to set the global log level. |
| configure_root_logger() | Set up the root logger with handlers and formatting. |
| log_to_file(log_file, log_rotation) | |
| set_logging_from_options(options) | |
| suppress_exotic_logging(local_logger) | Attempts to suppress the loggers of all non-Toil packages by setting them to CRITICAL. |
We don't want to prefix every line of the job's log with our own logging info, or we get prefixes wider than any reasonable terminal and longer than the messages.
Should be called before any entry point tries to log anything, to ensure consistent formatting.
For example: 'requests_oauthlib', 'google', 'boto', 'websocket', 'oauthlib', etc.
This will only suppress loggers that have already been instantiated and can be seen in the environment, except for the list declared in "always_suppress".
This is important because some packages, particularly boto3, are not always instantiated yet in the environment when this is run, and so we create the logger and set the level preemptively.
Base testing class for Toil.
| logger | |
| numCores | |
| preemptible | |
| defaultRequirements |
| BatchSystemPluginTest | Class for testing batch system plugin functionality. |
| hidden | Hide abstract base class from unittest's test case loader |
| KubernetesBatchSystemTest | Tests against the Kubernetes batch system |
| KubernetesBatchSystemBenchTest | Kubernetes batch system unit tests that don't need to actually talk to a cluster. |
| AWSBatchBatchSystemTest | Tests against the AWS Batch batch system |
| MesosBatchSystemTest | Tests against the Mesos batch system |
| SingleMachineBatchSystemTest | Tests against the single-machine batch system |
| MaxCoresSingleMachineBatchSystemTest | This test ensures that single machine batch system doesn't exceed the configured number |
| Service | Abstract class used to define the interface to a service. |
| GridEngineBatchSystemTest | Tests against the GridEngine batch system |
| SlurmBatchSystemTest | Tests against the Slurm batch system |
| LSFBatchSystemTest | Tests against the LSF batch system |
| TorqueBatchSystemTest | Tests against the Torque batch system |
| HTCondorBatchSystemTest | Tests against the HTCondor batch system |
| SingleMachineBatchSystemJobTest | Tests Toil workflow against the SingleMachine batch system |
| MesosBatchSystemJobTest | Tests Toil workflow against the Mesos batch system |
| write_temp_file(s, temp_dir) | Dump a string into a temp file and return its path. |
| parentJob(job, cmd) | |
| childJob(job, cmd) | |
| grandChildJob(job, cmd) | |
| greatGrandChild(cmd) | |
| measureConcurrency(filepath[, sleep_time]) | Run in parallel to determine the number of concurrent tasks. |
| count(delta, file_path) | Increments counter file and returns the max number of times the file |
| getCounters(path) | |
| resetCounters(path) | |
| get_omp_threads() |
Class for testing batch system plugin functionality.
http://stackoverflow.com/questions/1323455/python-unit-test-with-base-and-sub-class#answer-25695512
A base test case with generic tests that every batch system should pass.
Cannot assume that the batch system actually executes commands on the local machine/filesystem.
An abstract base class for batch system tests that use a full Toil workflow rather than using the batch system directly.
An abstract class to reduce redundancy between Grid Engine, Slurm, and other similar batch systems
Tests against the Kubernetes batch system
Kubernetes batch system unit tests that don't need to actually talk to a cluster.
Tests against the AWS Batch batch system
Tests against the Mesos batch system
Tests against the single-machine batch system
If hide is true, will try and hide the child processes to make them hard to stop.
This test ensures that single machine batch system doesn't exceed the configured number cores
Abstract class used to define the interface to a service.
Should be subclassed by the user to define services.
Is not executed as a job; runs within a ServiceHostJob.
Tests against the GridEngine batch system
Tests against the Slurm batch system
Tests against the LSF batch system
Tests against the Torque batch system
Tests against the HTCondor batch system
Tests Toil workflow against the SingleMachine batch system
Tests Toil workflow against the Mesos batch system
| logger |
| FakeBatchSystem | Adds cleanup support when the last running job leaves a node, for batch |
| BatchSystemPluginTest | A common base class for Toil tests. |
Adds cleanup support when the last running job leaves a node, for batch systems that can't provide it using the backing scheduler.
If it does, the setUserScript() can be invoked to set the resource object representing the user script.
Note to implementors: If your implementation returns True here, it should also override
Does not return info for jobs killed by killBatchJobs, although they may cause None to be returned earlier than maxWait.
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
| FakeBatchSystem | Class that implements a minimal Batch System, needed to create a Worker (see below). |
| GridEngineTest | Class for unit-testing GridEngineBatchSystem |
| call_qstat_or_qacct(args, **_) |
Class for unit-testing GridEngineBatchSystem
lsfHelper.py shouldn't need a batch system and so the unit tests here should aim to run on any system.
| LSFHelperTest | A common base class for Toil tests. |
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
| FakeBatchSystem | Class that implements a minimal Batch System, needed to create a Worker (see below). |
| SlurmTest | Class for unit-testing SlurmBatchSystem |
| call_sacct(args, **_) | The arguments passed to call_command when executing sacct are: |
| call_scontrol(args, **_) | The arguments passed to call_command when executing scontrol are: |
| call_sacct_raises(*_) | Fake that the sacct command fails by raising a CalledProcessErrorStderr |
1234|COMPLETED|0:0 1234.batch|COMPLETED|0:0 1235|PENDING|0:0 1236|FAILED|0:2 1236.extern|COMPLETED|0:0
Class for unit-testing SlurmBatchSystem
| CactusIntegrationTest | Run the Cactus Integration test on a Kubernetes AWS cluster |
Run the Cactus Integration test on a Kubernetes AWS cluster
| collect_ignore |
| pkg_root | |
| log | |
| CONFORMANCE_TEST_TIMEOUT | |
| TesterFuncType |
| CWLWorkflowTest | CWL tests included in Toil that don't involve the whole CWL conformance |
| CWLv10Test | Run the CWL 1.0 conformance tests in various environments. |
| CWLv11Test | Run the CWL 1.1 conformance tests in various environments. |
| CWLv12Test | Run the CWL 1.2 conformance tests in various environments. |
| ImportWorkersMessageHandler | Detect the import workers log message and set a flag. |
| run_conformance_tests(workDir, yml[, runner, caching, ...]) | Run the CWL conformance tests. |
| test_workflow_echo_string_scatter_stderr_log_dir(tmp_path) | |
| test_log_dir_echo_no_output(tmp_path) | |
| test_log_dir_echo_stderr(tmp_path) | |
| test_filename_conflict_resolution(tmp_path) | |
| test_filename_conflict_resolution_3_or_more(tmp_path) | |
| test_filename_conflict_detection(tmp_path) | Make sure we don't just stage files over each other when using a container. |
| test_filename_conflict_detection_at_root(tmp_path) | Make sure we don't just stage files over each other. |
| test_pick_value_with_one_null_value(caplog) | Make sure toil-cwl-runner does not false log a warning when pickValue is |
| test_workflow_echo_string() | |
| test_workflow_echo_string_scatter_capture_stdout() | |
| test_visit_top_cwl_class() | |
| test_visit_cwl_class_and_reduce() | |
| test_download_structure(tmp_path) | Make sure that download_structure makes the right calls to what it thinks is the file store. |
| test_import_on_workers() |
CWL tests included in Toil that don't involve the whole CWL conformance test suite. Tests Toil-specific functions like URL types supported for inputs.
Run the CWL 1.0 conformance tests in various environments.
Run the CWL 1.1 conformance tests in various environments.
Run the CWL 1.2 conformance tests in various environments.
To run manually:
TOIL_WES_ENDPOINT=http://localhost:8080 TOIL_WES_USER=test TOIL_WES_PASSWORD=password python -m pytest src/toil/test/cwl/cwlTest.py::CWLv12Test::test_wes_server_cwl_conformance -vv --log-level INFO --log-cli-level INFO
Specifically, when using a container and the files are at the root of the work dir.
Detect the import workers log message and set a flag.
If a formatter is specified, it is used to format the record. The record is then written to the stream with a trailing newline. If exception information is present, it is formatted using traceback.print_exception and appended to the stream. If the stream has an 'encoding' attribute, it is used to determine how to do the output to the stream.
| pkg_root |
| ToilDocumentationTest | Tests for scripts in the toil tutorials. |
Tests for scripts in the toil tutorials.
| logger |
| AbstractJobStoreTest | Hide abstract base class from unittest's test case loader |
| AbstractEncryptedJobStoreTest | |
| FileJobStoreTest | A common base class for Toil tests. |
| GoogleJobStoreTest | A common base class for Toil tests. |
| AWSJobStoreTest | A common base class for Toil tests. |
| InvalidAWSJobStoreTest | A common base class for Toil tests. |
| EncryptedAWSJobStoreTest | A common base class for Toil tests. |
| StubHttpRequestHandler | Simple HTTP request handler with GET and HEAD commands. |
| google_retry(x) | |
| tearDownModule() |
http://stackoverflow.com/questions/1323455/python-unit-test-with-base-and-sub-class#answer-25695512
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
Does the job exist in the jobstore it is supposed to be in? Are its attributes what is expected?
In setUp() self.jobstore1 is created and initialized. In this test, after creating newJobStore, .resume() will look for a previously instantiated job store and load its config options. This is expected to be equal but not the same object.
The following demonstrates the job update pattern, where files to be deleted atomically with a job update are referenced in "filesToDelete" array, which is persisted to disk first. If things go wrong during the update, this list of files to delete is used to ensure that the updated job and the files are never both visible at the same time.
A test of job stores that use encryption
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
Simple HTTP request handler with GET and HEAD commands.
This serves files from the current directory and any of its subdirectories. The MIME type for files is determined by calling the .guess_type() method.
The GET and HEAD requests are identical except that the HEAD request omits the actual contents of the file.
| logger |
| IAMTest | Check that given permissions and associated functions perform correctly |
Check that given permissions and associated functions perform correctly
| logger |
| S3Test | Confirm the workarounds for us-east-1. |
Confirm the workarounds for us-east-1.
| logger |
| TagGenerationTest | Test for tag generation from environment variables |
Test for tag generation from environment variables
| logger |
| DockerTest | Tests dockerCall and ensures no containers are left around. |
Tests dockerCall and ensures no containers are left around. When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
| logger |
| ConversionTest | A common base class for Toil tests. |
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
| logger |
| FlatcarFeedTest | Test accessing the Flatcar AMI release feed, independent of the AWS API |
| AMITest | A common base class for Toil tests. |
Test accessing the Flatcar AMI release feed, independent of the AWS API
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
| logger |
| DockstoreLookupTest | Make sure we can look up workflows on Dockstore. |
Make sure we can look up workflows on Dockstore.
Binary mode to allow testing for binary file support.
This lets us test that we have the right workflow contents and not care how we are being shown them.
| logger |
| UserNameAvailableTest | Make sure we can get user names when they are available. |
| UserNameUnvailableTest | Make sure we can get something for a user name when user names are not |
| UserNameVeryBrokenTest | Make sure we can get something for a user name when user name fetching is |
Make sure we can get user names when they are available.
Make sure we can get something for a user name when user names are not available.
Make sure we can get something for a user name when user name fetching is broken in ways we did not expect.
| DataStructuresTest | A common base class for Toil tests. |
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
A simple user script for Toil
| childMessage | |
| parentMessage |
| hello_world(job) | |
| hello_world_child(job, hw) | |
| main() |
| LongTestJob | Class represents a unit of work in toil. |
| LongTestFollowOn | Class represents a unit of work in toil. |
| HelloWorldJob | Class represents a unit of work in toil. |
| HelloWorldFollowOn | Class represents a unit of work in toil. |
| touchFile(fileStore) | |
| main(numJobs) |
Class represents a unit of work in toil.
Class represents a unit of work in toil.
Class represents a unit of work in toil.
Class represents a unit of work in toil.
| OptionsTest | Class to test functionality of all Toil options |
Class to test functionality of all Toil options
| log |
| AWSProvisionerBenchTest | Tests for the AWS provisioner that don't actually provision anything. |
| AbstractAWSAutoscaleTest | A common base class for Toil tests. |
| AWSAutoscaleTest | A common base class for Toil tests. |
| AWSStaticAutoscaleTest | Runs the tests on a statically provisioned cluster with autoscaling enabled. |
| AWSManagedAutoscaleTest | Runs the tests on a self-scaling Kubernetes cluster. |
| AWSAutoscaleTestMultipleNodeTypes | A common base class for Toil tests. |
| AWSRestartTest | This test insures autoscaling works on a restarted Toil run. |
| PreemptibleDeficitCompensationTest | A common base class for Toil tests. |
Tests for the AWS provisioner that don't actually provision anything.
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
Runs the tests on a statically provisioned cluster with autoscaling enabled.
Runs the tests on a self-scaling Kubernetes cluster.
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
This test insures autoscaling works on a restarted Toil run.
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
| logger | |
| c4_8xlarge_preemptible | |
| c4_8xlarge | |
| r3_8xlarge | |
| r5_2xlarge | |
| r5_4xlarge | |
| t2_micro |
| BinPackingTest | A common base class for Toil tests. |
| ClusterScalerTest | A common base class for Toil tests. |
| ScalerThreadTest | A common base class for Toil tests. |
| MockBatchSystemAndProvisioner | Mimics a leader, job batcher, provisioner and scalable batch system. |
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
Ideally, low targetTime means: Start quickly and maximize parallelization after the cpu/disk/mem have been packed.
Disk/cpu/mem packing is prioritized first, so we set job resource reqs so that each t2.micro (1 cpu/8G disk/1G RAM) can only run one job at a time with its resources.
Each job is parametrized to take 300 seconds, so (the minimum of) 1 of them should fit into each node's 0 second window, so we expect 1000 nodes.
Ideally, high targetTime means: Maximize packing within the targetTime after the cpu/disk/mem have been packed.
Disk/cpu/mem packing is prioritized first, so we set job resource reqs so that each t2.micro (1 cpu/8G disk/1G RAM) can only run one job at a time with its resources.
Each job is parametrized to take 300 seconds, so 12 of them should fit into each node's 3600 second window. 1000/12 = 83.33, so we expect 84 nodes.
Disk/cpu/mem packing is prioritized first, so we set job resource reqs so that each t2.micro (1 cpu/8G disk/1G RAM) can run a seemingly infinite number of jobs with its resources.
Since all jobs should pack cpu/disk/mem-wise on a t2.micro, we expect only one t2.micro to be provisioned. If we raise this, as in testLowTargetTime, it will launch 1000 t2.micros.
This is important, because services are one case where the degree of parallelization really, really matters. If you have multiple services, they may all need to be running simultaneously before any real work can be done.
Despite setting globalTargetTime=3600, this should launch 1000 t2.micros because each job's estimated runtime (30000 seconds) extends well beyond 3600 seconds.
If the reservation is extended to fit a long job, and the bin-packer naively searches through all the reservation slices to find the first slice that fits, it will happily assign the first slot that fits the job, even if that slot occurs days in the future.
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
Make sure this overhead is accounted for for large nodes.
Make sure this overhead is accounted for for small nodes.
Make sure this overhead is accounted for so that real-world observed failures cannot happen again.
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
Mimics a leader, job batcher, provisioner and scalable batch system.
Implementations must call _setLeaderWorkerAuthentication().
Implementations must call _setLeaderWorkerAuthentication() with the leader so that workers can be launched.
| log |
| AbstractClusterTest | A common base class for Toil tests. |
| CWLOnARMTest | Run the CWL 1.2 conformance tests on ARM specifically. |
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
Succeeds if the cluster does not currently exist.
The cluster-side path should have a ':' in front of it.
Run the CWL 1.2 conformance tests on ARM specifically.
| log |
| AbstractGCEAutoscaleTest | A common base class for Toil tests. |
| GCEAutoscaleTest | A common base class for Toil tests. |
| GCEStaticAutoscaleTest | Runs the tests on a statically provisioned cluster with autoscaling enabled. |
| GCEAutoscaleTestMultipleNodeTypes | A common base class for Toil tests. |
| GCERestartTest | This test insures autoscaling works on a restarted Toil run |
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
Runs the tests on a statically provisioned cluster with autoscaling enabled.
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
This test insures autoscaling works on a restarted Toil run
| log |
| ProvisionerTest | A common base class for Toil tests. |
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
| parser |
| f0(job) |
| logger |
| ToilServerUtilsTest | Tests for the utility functions used by the Toil server. |
| hidden | |
| FileStateStoreTest | Test file-based state storage. |
| FileStateStoreURLTest | Test file-based state storage using URLs instead of local paths. |
| BucketUsingTest | Base class for tests that need a bucket. |
| AWSStateStoreTest | Test AWS-based state storage. |
| AbstractToilWESServerTest | Class for server tests that provides a self.app in testing mode. |
| ToilWESServerBenchTest | Tests for Toil's Workflow Execution Service API that don't run workflows. |
| ToilWESServerWorkflowTest | Tests of the WES server running workflows. |
| ToilWESServerCeleryWorkflowTest | End-to-end workflow-running tests against Celery. |
| ToilWESServerCeleryS3StateWorkflowTest | Test the server with Celery and state stored in S3. |
Tests for the utility functions used by the Toil server.
Basic tests for state stores.
Test file-based state storage.
Test file-based state storage using URLs instead of local paths.
Base class for tests that need a bucket.
Test AWS-based state storage.
We don't really care about the exact internal structure, but we do care about actually being under the path we are supposed to use.
Class for server tests that provides a self.app in testing mode.
Tests for Toil's Workflow Execution Service API that don't run workflows.
Tests of the WES server running workflows.
If include_message is set to False, don't send a "message" argument in workflow_params. If include_params is also set to False, don't send workflow_params at all.
End-to-end workflow-running tests against Celery.
Test the server with Celery and state stored in S3.
A demonstration of toil. Sorts the lines of a file into ascending order by doing a parallel merge sort. This is an intentionally buggy version that doesn't include restart() for testing purposes.
| defaultLines | |
| defaultLineLen | |
| sortMemory |
| setup(job, inputFile, N, downCheckpoints, options) | Sets up the sort. |
| down(job, inputFileStoreID, N, path, downCheckpoints, ...) | Input is a file, a subdivision size N, and a path in the hierarchy of jobs. |
| up(job, inputFileID1, inputFileID2, path, options[, ...]) | Merges the two files and places them in the output. |
| sort(file) | Sorts the given file. |
| merge(fileHandle1, fileHandle2, outputFileHandle) | Merges together two files maintaining sorted order. |
| copySubRangeOfFile(inputFile, fileStart, fileEnd) | Copies the range (in bytes) between fileStart and fileEnd to the given |
| getMidPoint(file, fileStart, fileEnd) | Finds the point in the file to split. |
| makeFileToSort(fileName[, lines, lineLen]) | |
| main([options]) |
All handles must be text-mode streams.
A demonstration of toil. Sorts the lines of a file into ascending order by doing a parallel merge sort.
| defaultLines | |
| defaultLineLen | |
| sortMemory |
| setup(job, inputFile, N, downCheckpoints, options) | Sets up the sort. |
| down(job, inputFileStoreID, N, path, downCheckpoints, ...) | Input is a file, a subdivision size N, and a path in the hierarchy of jobs. |
| up(job, inputFileID1, inputFileID2, path, options[, ...]) | Merges the two files and places them in the output. |
| sort(file) | Sorts the given file. |
| merge(fileHandle1, fileHandle2, outputFileHandle) | Merges together two files maintaining sorted order. |
| copySubRangeOfFile(inputFile, fileStart, fileEnd) | Copies the range (in bytes) between fileStart and fileEnd to the given |
| getMidPoint(file, fileStart, fileEnd) | Finds the point in the file to split. |
| makeFileToSort(fileName[, lines, lineLen]) | |
| main([options]) |
All handles must be text-mode streams.
| logger | |
| defaultLineLen | |
| defaultLines | |
| defaultN |
| SortTest | Tests Toil by sorting a file in parallel on various combinations of job stores and batch |
| runMain(options) | make sure the output file is deleted every time main is run |
Tests Toil by sorting a file in parallel on various combinations of job stores and batch systems.
| logger |
| AutoDeploymentTest | Tests various auto-deployment scenarios. Using the appliance, i.e. a docker container, |
Tests various auto-deployment scenarios. Using the appliance, i.e. a docker container, for these tests allows for running worker processes on the same node as the leader process while keeping their file systems separate from each other and the leader process. Separate file systems are crucial to prove that auto-deployment does its job.
Mainly written to cover https://github.com/BD2KGenomics/toil/issues/1259 but then also revealed https://github.com/BD2KGenomics/toil/issues/1278.
┌───────────┐
│ Root (W1) │
└───────────┘
│
┌──────────┴─────────┐
▼ ▼ ┌────────────────┐ ┌────────────────────┐ │ Deferring (W2) │ │ Encapsulating (W3) │═══════════════╗ └────────────────┘ └────────────────────┘ ║
│ ║
▼ ▼
┌───────────────────┐ ┌────────────────┐
│ Encapsulated (W3) │ │ Follow-on (W6) │
└───────────────────┘ └────────────────┘
│ │
┌───────┴────────┐ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Dummy 1 (W4) │ │ Dummy 2 (W5) │ │ Last (W6) │
└──────────────┘ └──────────────┘ └──────────────┘
The Wn numbers denote the worker processes that a particular job is run in. Deferring adds a deferred function and then runs for a long time. The deferred function will be present in the cache state for the duration of Deferred. Follow-on is the generic Job instance that's added by encapsulating a job. It runs on the same worker node but in a separate worker process, as the first job in that worker. Because …
1) it is the first job in its worker process (the user script has not been made available on the sys.path by a previous job in that worker) and
… it might cause problems with deserializing a deferred function defined in the user script.
Encapsulated has two children to ensure that Follow-on is run in a separate worker.
┌───────────┐
│ Root (W1) │
└───────────┘
│
┌──────────┴─────────┐
▼ ▼ ┌────────────────┐ ┌────────────────────┐ │ Deferring (W2) │ │ Encapsulating (W3) │═══════════════════════╗ └────────────────┘ └────────────────────┘ ║
│ ║
▼ ▼
┌───────────────────┐ ┌────────────────┐
│ Encapsulated (W3) │════════════╗ │ Follow-on (W7) │
└───────────────────┘ ║ └────────────────┘
│ ║
┌──────┴──────┐ ║
▼ ▼ ▼
┌────────────┐┌────────────┐ ┌──────────────┐
│ Dummy (W4) ││ Dummy (W5) │ │ Trigger (W6) │
└────────────┘└────────────┘ └──────────────┘
Trigger causes Deferring to crash. Follow-on runs next, detects Deferring's left-overs and runs the deferred function. Follow-on is an instance of Job and the first job in its worker process. This test ensures that despite these circumstances, the user script is loaded before the deferred functions defined in it are being run.
Encapsulated has two children to ensure that Follow-on is run in a new worker. That's the only way to guarantee that the user script has not been loaded yet, which would cause the test to succeed coincidentally. We want to test that auto-deploying and loading of the user script are done properly before deferred functions are being run and before any jobs have been executed by that worker.
| logger |
| MessageBusTest | A common base class for Toil tests. |
| failing_job_fn(job) | This function is guaranteed to fail. |
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
| CheckpointTest | A common base class for Toil tests. |
| CheckRetryCount | Fail N times, succeed on the next try. |
| AlwaysFail | Class represents a unit of work in toil. |
| CheckpointFailsFirstTime | Class represents a unit of work in toil. |
| FailOnce | Fail the first time the workflow is run, but succeed thereafter. |
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
Fail N times, succeed on the next try.
Class represents a unit of work in toil.
Class represents a unit of work in toil.
Fail the first time the workflow is run, but succeed thereafter.
| logger |
| DeferredFunctionTest | Test the deferred function system. |
Test the deferred function system.
Initially the first file exists, so the job should SIGKILL itself and neither deferred function will run (in fact, the second should not even be registered). On the restart, the first deferred function should run and the first file should not exist, but the second one should. We assert the presence of the second, then register the second deferred function and exit normally. At the end of the test, neither file should exist.
Incidentally, this also tests for multiple registered deferred functions, and the case where a deferred function fails (since the first file doesn't exist on the retry).
Assert that the file is missing after the pipeline fails, because we're using a single-machine batch system and the leader's batch system cleanup will find and run the deferred function.
| DockerCheckTest | Tests checking whether a docker image exists or not. |
Tests checking whether a docker image exists or not.
| logger |
| EnvironmentTest | Test to make sure that Toil's environment variable save and restore system |
| signal_leader(job) | Make a file in the file store that the leader can see. |
| check_environment(job, try_name) | Fail if the test environment is wrong. |
| wait_a_bit(job) | Toil job that waits. |
| check_environment_repeatedly(job) | Toil job that checks the environment, waits, and checks it again, as |
| main([options]) | Run the actual workflow with the given options. |
Test to make sure that Toil's environment variable save and restore system (environment.pickle) works.
The environment should be captured once at the start of the workflow and should be sent through based on that, not base don the leader's current environment when the job is launched.
| testingIsAutomatic | |
| logger |
| hidden | Hiding the abstract test classes from the Unittest loader so it can be inherited in different |
| NonCachingFileStoreTestWithFileJobStore | Abstract tests for the the various functions in |
| CachingFileStoreTestWithFileJobStore | Abstract tests for the the various cache-related functions in |
| NonCachingFileStoreTestWithAwsJobStore | Abstract tests for the the various functions in |
| CachingFileStoreTestWithAwsJobStore | Abstract tests for the the various cache-related functions in |
| NonCachingFileStoreTestWithGoogleJobStore | Abstract tests for the the various functions in |
| CachingFileStoreTestWithGoogleJobStore | Abstract tests for the the various cache-related functions in |
An abstract base class for testing the various general functions described in :class:toil.fileStores.abstractFileStore.AbstractFileStore
Abstract tests for the the various functions in :class:toil.fileStores.nonCachingFileStore.NonCachingFileStore. These tests are general enough that they can also be used for :class:toil.fileStores.CachingFileStore.
Abstract tests for the the various cache-related functions in :class:toil.fileStores.cachingFileStore.CachingFileStore.
Attempting to get the file from the jobstore should not fail.
Abstract tests for the the various functions in :class:toil.fileStores.nonCachingFileStore.NonCachingFileStore. These tests are general enough that they can also be used for :class:toil.fileStores.CachingFileStore.
Abstract tests for the the various cache-related functions in :class:toil.fileStores.cachingFileStore.CachingFileStore.
Abstract tests for the the various functions in :class:toil.fileStores.nonCachingFileStore.NonCachingFileStore. These tests are general enough that they can also be used for :class:toil.fileStores.CachingFileStore.
Abstract tests for the the various cache-related functions in :class:toil.fileStores.cachingFileStore.CachingFileStore.
Abstract tests for the the various functions in :class:toil.fileStores.nonCachingFileStore.NonCachingFileStore. These tests are general enough that they can also be used for :class:toil.fileStores.CachingFileStore.
Abstract tests for the the various cache-related functions in :class:toil.fileStores.cachingFileStore.CachingFileStore.
| HelloWorldTest | A common base class for Toil tests. |
| HelloWorld | Class represents a unit of work in toil. |
| FollowOn | Class represents a unit of work in toil. |
| childFn(job) |
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
Class represents a unit of work in toil.
Class represents a unit of work in toil.
| ImportExportFileTest | A common base class for Toil tests. |
| RestartingJob | Class represents a unit of work in toil. |
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
Class represents a unit of work in toil.
| JobDescriptionTest | A common base class for Toil tests. |
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
| JobEncapsulationTest | Tests testing the EncapsulationJob class. |
| noOp() | |
| encapsulatedJobFn(job, string, outFile) |
Tests testing the EncapsulationJob class.
| logger | |
| PREFIX_LENGTH | |
| fileStoreString | |
| streamingFileStoreString |
| JobFileStoreTest | Tests testing the methods defined in :class:toil.fileStores.abstractFileStore.AbstractFileStore. |
| fileTestJob(job, inputFileStoreIDs, testStrings, ...) | Test job exercises toil.fileStores.abstractFileStore.AbstractFileStore functions |
| simpleFileStoreJob(job) | |
| fileStoreChild(job, testID1, testID2) |
Tests testing the methods defined in :class:toil.fileStores.abstractFileStore.AbstractFileStore.
| logger |
| JobServiceTest | Tests testing the Job.Service class |
| PerfectServiceTest | Tests testing the Job.Service class |
| ToyService | Abstract class used to define the interface to a service. |
| ToySerializableService | Abstract class used to define the interface to a service. |
| serviceTest(job, outFile, messageInt) | Creates one service and one accessing job, which communicate with two files to establish |
| serviceTestRecursive(job, outFile, messages) | Creates a chain of services and accessing jobs, each paired together. |
| serviceTestParallelRecursive(job, outFiles, messageBundles) | Creates multiple chains of services and accessing jobs. |
| serviceAccessor(job, communicationFiles, outFile, ...) | Writes a random integer iinto the inJobStoreFileID file, then tries 10 times reading |
| fnTest(strings, outputFile) | Function concatenates the strings together and writes them to the output file |
Tests testing the Job.Service class
Tests testing the Job.Service class
Abstract class used to define the interface to a service.
Should be subclassed by the user to define services.
Is not executed as a job; runs within a ServiceHostJob.
Abstract class used to define the interface to a service.
Should be subclassed by the user to define services.
Is not executed as a job; runs within a ServiceHostJob.
| logger |
| JobTest | Tests the job class. |
| TrivialService | Abstract class used to define the interface to a service. |
| simpleJobFn(job, value) | |
| fn1Test(string, outputFile) | Function appends the next character after the last character in the given |
| fn2Test(pStrings, s, outputFile) | Function concatenates the strings in pStrings and s, in that order, and writes the result to |
| trivialParent(job) | |
| parent(job) | |
| diamond(job) | |
| child(job) | |
| errorChild(job) |
Tests the job class.
A -> F \------- B -> D \
\ \
------- C -> E
Follow on is marked by ->
A -> F \------- B -> D \
\ \
------- C -> E
Follow on is marked by ->
Test verification of new checkpoint jobs being leaf vertices, starting with the following baseline workflow:
Parent
| Child # Checkpoint=True
Test verification of a new checkpoint job being leaf vertex, starting with a baseline workflow of a single, root job:
Root # Checkpoint=True
function to create and new workflow and return a tuple of:
Abstract class used to define the interface to a service.
Should be subclassed by the user to define services.
Is not executed as a job; runs within a ServiceHostJob.
| log |
| MiscTests | This class contains miscellaneous tests that don't have enough content to be their own test |
| TestPanic | A common base class for Toil tests. |
This class contains miscellaneous tests that don't have enough content to be their own test file, and that don't logically fit in with any of the other test suites.
Disk space allocation varies from system to system. The computed value should always be equal to or slightly greater than the creation value. This test generates a number of random directories and randomly sized files to test this using getDirSizeRecursively.
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
| log |
| hidden | Hide abstract base class from unittest's test case loader. |
| SingleMachinePromisedRequirementsTest | Tests against the SingleMachine batch system |
| MesosPromisedRequirementsTest | Tests against the Mesos batch system |
| maxConcurrency(job, cpuCount, filename, coresPerJob) | Returns the max number of concurrent tasks when using a PromisedRequirement instance |
| getOne() | |
| getThirtyTwoMb() | |
| logDiskUsage(job, funcName[, sleep]) | Logs the job's disk usage to master and sleeps for specified amount of time. |
http://stackoverflow.com/questions/1323455/python-unit-test-with-base-and-sub-class#answer-25695512
An abstract base class for testing Toil workflows with promised requirements.
Tests against the SingleMachine batch system
Tests against the Mesos batch system
| CachedUnpicklingJobStoreTest | A common base class for Toil tests. |
| ChainedIndexedPromisesTest | A common base class for Toil tests. |
| PathIndexingPromiseTest | Test support for indexing promises of arbitrarily nested data structures of lists, dicts and |
| parent(job) | |
| child() | |
| a(job) | |
| b(job) | |
| c() | |
| d(job) | |
| e() |
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
Test support for indexing promises of arbitrarily nested data structures of lists, dicts and tuples, or any other object supporting the __getitem__() protocol.
| RealtimeLoggerTest | A common base class for Toil tests. |
| MessageDetector | Detect the secret message and set a flag. |
| LogTest | Class represents a unit of work in toil. |
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
Detect the secret message and set a flag.
If a formatter is specified, it is used to format the record. The record is then written to the stream with a trailing newline. If exception information is present, it is formatted using traceback.print_exception and appended to the stream. If the stream has an 'encoding' attribute, it is used to determine how to do the output to the stream.
Class represents a unit of work in toil.
| logger |
| RegularLogTest | A common base class for Toil tests. |
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
| ResourceTest | Test module descriptors and resources derived from them. |
| tempFileContaining(content[, suffix]) | Write a file with the given contents, and keep it on disk as long as the context is active. |
Test module descriptors and resources derived from them.
https://github.com/BD2KGenomics/toil/issues/631 and https://github.com/BD2KGenomics/toil/issues/858
| logger |
| RestartDAGTest | Tests that restarted job DAGs don't run children of jobs that failed in the first run till the |
| passingFn(job[, fileName]) | This function is guaranteed to pass as it does nothing out of the ordinary. If fileName is |
| failingFn(job, failType, fileName) | This function is guaranteed to fail via a raised assertion, or an os.kill |
Tests that restarted job DAGs don't run children of jobs that failed in the first run till the parent completes successfully in the restart.
| ResumabilityTest | https://github.com/BD2KGenomics/toil/issues/808 |
| parent(job) | Set up a bunch of dummy child jobs, and a bad job that needs to be |
| chaining_parent(job) | Set up a failing job to chain to. |
| goodChild(job) | Does nothing. |
| badChild(job) | Fails the first time it's run, succeeds the second time. |
https://github.com/BD2KGenomics/toil/issues/808
| CleanWorkDirTest | Tests testing :class:toil.fileStores.abstractFileStore.AbstractFileStore |
| tempFileTestJob(job) | |
| tempFileTestErrorJob(job) |
Tests testing :class:toil.fileStores.abstractFileStore.AbstractFileStore
| SystemTest | Test various assumptions about the operating system's behavior. |
Test various assumptions about the operating system's behavior.
| log |
| ThreadingTest | Test Toil threading/synchronization tools. |
Test Toil threading/synchronization tools.
| ToilContextManagerTest | A common base class for Toil tests. |
| HelloWorld | Class represents a unit of work in toil. |
| FollowOn | Class represents a unit of work in toil. |
| childFn(job) |
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
Class represents a unit of work in toil.
Class represents a unit of work in toil.
| UserDefinedJobArgTypeTest | Test for issue #423 (Toil can't unpickle classes defined in user scripts) and variants |
| JobClass | Class represents a unit of work in toil. |
| Foo |
| jobFunction(job, level, foo) | |
| main() |
Test for issue #423 (Toil can't unpickle classes defined in user scripts) and variants thereof.
https://github.com/BD2KGenomics/toil/issues/423
Class represents a unit of work in toil.
| WorkerTests | Test miscellaneous units of the worker. |
Test miscellaneous units of the worker.
| logger |
| DebugJobTest | Test the toil debug-job command. |
| workflow_debug_jobstore() | |
| testJobStoreContents() | Test toilDebugFile.printContentsOfJobStore(). |
| fetchFiles(symLink, jobStoreDir, outputDir) | Fn for testFetchJobStoreFiles() and testFetchJobStoreFilesWSymlinks(). |
| testFetchJobStoreFiles() | Test toilDebugFile.fetchJobStoreFiles() symlinks. |
Runs a workflow that imports 'B.txt' and 'mkFile.py' into the jobStore. 'A.txt', 'C.txt', 'ABC.txt' are then created. This checks to make sure these contents are found in the jobStore and printed.
Runs a workflow that imports 'B.txt' and 'mkFile.py' into the jobStore. 'A.txt', 'C.txt', 'ABC.txt' are then created. This test then attempts to get a list of these files and copy them over into our output diectory from the jobStore, confirm that they are present, and then delete them.
Test the toil debug-job command.
| logger | |
| pkg_root |
| ToilKillTest | A set of test cases for "toil kill". |
| ToilKillTestWithAWSJobStore | A set of test cases for "toil kill" using the AWS job store. |
A set of test cases for "toil kill".
A set of test cases for "toil kill" using the AWS job store.
| pkg_root | |
| logger |
| UtilsTest | Tests the utilities that toil ships with, e.g. stats and status, in conjunction with restart |
| RunTwoJobsPerWorker | Runs child job with same resources as self in an attempt to chain the jobs on the same worker |
| printUnicodeCharacter() |
Tests the utilities that toil ships with, e.g. stats and status, in conjunction with restart functionality.
Launches a cluster with custom tags. Verifies the tags exist. ssh's into the cluster. Does some weird string comparisons. Makes certain that TOIL_WORKDIR is set as expected in the ssh'ed cluster. Rsyncs a file and verifies it exists on the leader. Destroys the cluster.
Runs child job with same resources as self in an attempt to chain the jobs on the same worker
| logger | |
| WDL_CONFORMANCE_TEST_REPO | |
| WDL_CONFORMANCE_TEST_COMMIT | |
| WDL_CONFORMANCE_TESTS_UNSUPPORTED_BY_TOIL | |
| WDL_UNIT_TESTS_UNSUPPORTED_BY_TOIL |
| BaseWDLTest | Base test class for WDL tests. |
| WDLConformanceTests | WDL conformance tests for Toil. |
| WDLTests | Tests for Toil's MiniWDL-based implementation. |
| WDLToilBenchTests | Tests for Toil's MiniWDL-based implementation that don't run workflows. |
Base test class for WDL tests.
WDL conformance tests for Toil.
Tests for Toil's MiniWDL-based implementation.
Tests for Toil's MiniWDL-based implementation that don't run workflows.
White box test; will need to be changed or removed if the WDL interpreter changes.
| WDLKubernetesClusterTest | Ensure WDL works on the Kubernetes batchsystem. |
Ensure WDL works on the Kubernetes batchsystem.
| logger | |
| MT | |
| methodNamePartRegex |
| ToilTest | A common base class for Toil tests. |
| ApplianceTestSupport | A Toil test that runs a user script on a minimal cluster of appliance containers. |
| get_data(filename) | Returns an absolute path for a file from this package. |
| get_temp_file([suffix, rootDir]) | Return a string representing a temporary file, that must be manually deleted. |
| needs_env_var(var_name[, comment]) | Use as a decorator before test classes or methods to run only if the given |
| needs_rsync3(test_item) | Decorate classes or methods that depend on any features from rsync version 3.0.0+. |
| needs_online(test_item) | Use as a decorator before test classes or methods to run only if we are meant to talk to the Internet. |
| needs_aws_s3(test_item) | Use as a decorator before test classes or methods to run only if AWS S3 is usable. |
| needs_aws_ec2(test_item) | Use as a decorator before test classes or methods to run only if AWS EC2 is usable. |
| needs_aws_batch(test_item) | Use as a decorator before test classes or methods to run only if AWS Batch |
| needs_google_storage(test_item) | Use as a decorator before test classes or methods to run only if Google |
| needs_google_project(test_item) | Use as a decorator before test classes or methods to run only if we have a Google Cloud project set. |
| needs_gridengine(test_item) | Use as a decorator before test classes or methods to run only if GridEngine is installed. |
| needs_torque(test_item) | Use as a decorator before test classes or methods to run only if PBS/Torque is installed. |
| needs_kubernetes_installed(test_item) | Use as a decorator before test classes or methods to run only if Kubernetes is installed. |
| needs_kubernetes(test_item) | Use as a decorator before test classes or methods to run only if Kubernetes is installed and configured. |
| needs_mesos(test_item) | Use as a decorator before test classes or methods to run only if Mesos is installed. |
| needs_slurm(test_item) | Use as a decorator before test classes or methods to run only if Slurm is installed. |
| needs_htcondor(test_item) | Use a decorator before test classes or methods to run only if the HTCondor is installed. |
| needs_lsf(test_item) | Use as a decorator before test classes or methods to only run them if LSF is installed. |
| needs_java(test_item) | Use as a test decorator to run only if java is installed. |
| needs_docker(test_item) | Use as a decorator before test classes or methods to only run them if |
| needs_singularity(test_item) | Use as a decorator before test classes or methods to only run them if |
| needs_singularity_or_docker(test_item) | Use as a decorator before test classes or methods to only run them if |
| needs_local_cuda(test_item) | Use as a decorator before test classes or methods to only run them if |
| needs_docker_cuda(test_item) | Use as a decorator before test classes or methods to only run them if |
| needs_encryption(test_item) | Use as a decorator before test classes or methods to only run them if PyNaCl is installed |
| needs_cwl(test_item) | Use as a decorator before test classes or methods to only run them if CWLTool is installed |
| needs_wdl(test_item) | Use as a decorator before test classes or methods to only run them if miniwdl is installed |
| needs_server(test_item) | Use as a decorator before test classes or methods to only run them if Connexion is installed. |
| needs_celery_broker(test_item) | Use as a decorator before test classes or methods to run only if RabbitMQ is set up to take Celery jobs. |
| needs_wes_server(test_item) | Use as a decorator before test classes or methods to run only if a WES |
| needs_local_appliance(test_item) | Use as a decorator before test classes or methods to only run them if |
| needs_fetchable_appliance(test_item) | Use as a decorator before test classes or methods to only run them if |
| integrative(test_item) | Use this to decorate integration tests so as to skip them during regular builds. |
| slow(test_item) | Use this decorator to identify tests that are slow and not critical. |
| timeLimit(seconds) | Use to limit the execution time of a function. |
| make_tests(generalMethod, targetClass, **kwargs) | This method dynamically generates test methods using the generalMethod as a template. Each |
A common base class for Toil tests.
Please have every test case directly or indirectly inherit this one.
When running tests you may optionally set the TOIL_TEST_TEMP environment variable to the path of a directory where you want temporary test files be placed. The directory will be created if it doesn't exist. The path may be relative in which case it will be assumed to be relative to the project root. If TOIL_TEST_TEMP is not defined, temporary files and directories will be created in the system's default location for such files and any temporary files or directories left over from tests will be removed automatically removed during tear down. Otherwise, left-over files will not be removed.
Use us-west-2 unless running on EC2, in which case use the region in which the instance is located
Necessary because utilsTest.testAWSProvisionerUtils() uses option --protect-args which is only available in rsync 3
We define integration tests as A) involving other, non-Toil software components that we develop and/or B) having a higher cost (time or money).
Raises an exception if the execution of the function takes more than the specified amount of time. See <http://stackoverflow.com/a/601168>.
>>> import time >>> with timeLimit(2): ... time.sleep(1) >>> import time >>> with timeLimit(1): ... time.sleep(2) Traceback (most recent call last):
... RuntimeError: Timed out
The arguments following the generalMethodName should be a series of one or more dictionaries of the form {str : type, ...} where the key represents the name of the value. The names will be used to represent the permutation of values passed for each parameter in the generalMethod.
The generated method names will list the parameters in lexicographic order by parameter name.
>>> class Foo: ... def has(self, num, letter): ... return num, letter ... ... def hasOne(self, num): ... return num
>>> class Bar(Foo): ... pass
>>> make_tests(Foo.has, Bar, num={'one':1, 'two':2}, letter={'a':'a', 'b':'b'})
>>> b = Bar()
Note that num comes lexicographically before letter and so appears first in the generated method names.
>>> assert b.test_has__letter_a__num_one() == b.has(1, 'a')
>>> assert b.test_has__letter_b__num_one() == b.has(1, 'b')
>>> assert b.test_has__letter_a__num_two() == b.has(2, 'a')
>>> assert b.test_has__letter_b__num_two() == b.has(2, 'b')
>>> f = Foo()
>>> hasattr(f, 'test_has__num_one__letter_a') # should be false because Foo has no test methods False
A Toil test that runs a user script on a minimal cluster of appliance containers.
i.e. one leader container and one worker container.
A thread whose join() method re-raises exceptions raised during run(). While join() is idempotent, the exception is only during the first invocation of join() that successfully joined the thread. If join() times out, no exception will be re reraised even though an exception might already have occurred in run().
When subclassing this thread, override tryRun() instead of run().
>>> def f():
... assert 0
>>> t = ExceptionalThread(target=f)
>>> t.start()
>>> t.join()
Traceback (most recent call last):
...
AssertionError
>>> class MyThread(ExceptionalThread):
... def tryRun( self ):
... assert 0
>>> t = MyThread()
>>> t.start()
>>> t.join()
Traceback (most recent call last):
...
AssertionError
A thread whose join() method re-raises exceptions raised during run(). While join() is idempotent, the exception is only during the first invocation of join() that successfully joined the thread. If join() times out, no exception will be re reraised even though an exception might already have occurred in run().
When subclassing this thread, override tryRun() instead of run().
>>> def f():
... assert 0
>>> t = ExceptionalThread(target=f)
>>> t.start()
>>> t.join()
Traceback (most recent call last):
...
AssertionError
>>> class MyThread(ExceptionalThread):
... def tryRun( self ):
... assert 0
>>> t = MyThread()
>>> t.start()
>>> t.join()
Traceback (most recent call last):
...
AssertionError
A thread whose join() method re-raises exceptions raised during run(). While join() is idempotent, the exception is only during the first invocation of join() that successfully joined the thread. If join() times out, no exception will be re reraised even though an exception might already have occurred in run().
When subclassing this thread, override tryRun() instead of run().
>>> def f():
... assert 0
>>> t = ExceptionalThread(target=f)
>>> t.start()
>>> t.join()
Traceback (most recent call last):
...
AssertionError
>>> class MyThread(ExceptionalThread):
... def tryRun( self ):
... assert 0
>>> t = MyThread()
>>> t.start()
>>> t.join()
Traceback (most recent call last):
...
AssertionError
| logger |
| ToilState | Holds the leader's scheduling information. |
But onlt that which does not need to be persisted back to the JobStore (such as information on completed and outstanding predecessors)
Holds the true single copies of all JobDescription objects that the Leader and ServiceManager will use. The leader and service manager shouldn't do their own load() and update() calls on the JobStore; they should go through this class.
Everything in the leader should reference JobDescriptions by ID.
Only holds JobDescription objects, not Job objects, and those JobDescription objects only exist in single copies.
If jobs are loaded that have updated and need to be dealt with by the leader, JobUpdatedMessage messages will be sent to the message bus.
The jobCache is a map from jobStoreID to JobDescription or None. Is used to speed up the building of the state when loading initially from the JobStore, and is not preserved.
Returns True if the given job exists right now, and false if it hasn't been created or it has been deleted elsewhere.
Doesn't guarantee that the job will or will not be gettable, if racing another process, or if it is still cached.
(one retrieved from get_job())
May raise an exception if the job could not be cleaned up (i.e. files belonging to it failed to delete).
Will make modifications from other hosts visible.
Will make modifications from other hosts visible.
Will wait for up to timeout seconds for a modification (or deletion) from another host to actually be visible.
Always replaces the JobDescription with what is stored in the job store, even if no modification ends up being visible.
Returns True if an update was detected in time, and False otherwise.
(that have not yet succeeded or failed.)
(because one has succeeded or failed.)
Pending successors are those which have not yet succeeded or failed.
Delete a job store used by a previous Toil workflow invocation.
| logger |
| main() |
Create a config file with all default Toil options.
| logger |
| main() |
Debug tool for copying files contained in a toil jobStore.
| logger |
| fetchJobStoreFiles(jobStore, options) | Takes a list of file names as glob patterns, searches for these within a |
| printContentsOfJobStore(job_store[, job_id]) | Fetch a list of all files contained in the job store if nameOfJob is not |
| main() |
Debug tool for running a toil job locally.
| logger |
| main() |
Terminates the specified cluster and associated resources.
| logger |
| main() |
Kills rogue toil processes.
| logger |
| main() |
Launches a toil leader instance with the specified provisioner.
| build_tag_dict_from_env | |
| logger |
| create_tags_dict(tags) | |
| main() |
| main() | |
| get_or_die(module, name) | Get an object from a module or complain that it is missing. |
| loadModules() | |
| printHelp(modules) | |
| printVersion() |
Rsyncs into the toil appliance container running on the leader of the cluster.
| logger |
| main() |
CLI entry for the Toil servers.
| logger |
| main() |
SSH into the toil appliance container running on the leader of the cluster.
| logger |
| main() |
Reports statistical data about a given Toil workflow.
| logger | |
| CATEGORIES | |
| CATEGORY_UNITS | |
| TITLES | |
| TIME_CATEGORIES | |
| SPACE_CATEGORIES | |
| COMPUTED_CATEGORIES | |
| LONG_FORMS | |
| sort_category_choices | |
| sort_field_choices |
| ColumnWidths | Convenience object that stores the width of columns for printing. Helps make things pretty. |
| pad_str(s[, field]) | Pad the beginning of a string with spaces, if necessary. |
| pretty_space(k[, field, alone]) | Given input k as kibibytes, return a nicely formatted string. |
| pretty_time(t[, field, unit, alone]) | Given input t as seconds, return a nicely formatted string. |
| report_unit(unit) | Format a unit name for display. |
| report_time(t, options[, field, unit, alone]) | Given t seconds, report back the correct format as string. |
| report_space(k, options[, field, unit, alone]) | Given k kibibytes, report back the correct format as string. |
| report_number(n[, field, nan_value]) | Given a number, report back the correct format as string. |
| report(v, category, options[, field, alone]) | Report a value of the given category formatted as a string. |
| sprint_tag(key, tag, options[, columnWidths]) | Generate a pretty-print ready string from a JTTag(). |
| decorate_title(category, title, options) | Add extra parts to the category titles. |
| decorate_subheader(category, columnWidths, options) | Add a marker to the correct field if the TITLE is sorted on. |
| get(tree, name) | Return a float value attribute NAME from TREE. |
| sort_jobs(jobTypes, options) | Return a jobTypes all sorted. |
| report_pretty_data(root, worker, job, job_types, options) | Print the important bits out. |
| compute_column_widths(job_types, worker, job, options) | Return a ColumnWidths() object with the correct max widths. |
| update_column_widths(tag, cw, options) | Update the column width attributes for this tag's fields. |
| build_element(element, items, item_name, defaults) | Create an element for output. |
| create_summary(element, containingItems, ...) | Figure out how many jobs (or contained items) ran on each worker (or containing item). |
| get_stats(jobStore) | Sum together all the stats information in the job store. |
| process_data(config, stats) | Collate the stats and report |
| report_data(tree, options) | |
| add_stats_options(parser) | |
| main() | Reports stats on the workflow, use with --stats option to toil. |
If unit is set to B, convert to KiB first.
If it is a NaN or None, use nan_value to represent it instead.
Uses the given field width if set.
If alone is set, the field is being formatted outside a table and might need a unit.
Add units to title if they won't appear in the formatted values. Add a marker to TITLE if the TITLE is sorted on.
Stick a bunch of xxx_number_per_xxx stats into element to describe this.
Produces one object containing lists of the values from all the summed objects.
Tool for reporting on job status.
| logger |
| ToilStatus | Tool for reporting on job status. |
| main() | Reports the state of a Toil workflow. |
Checks to see if a process exists or not.
If the jobstore does not exist, this returns 'QUEUED', assuming it has not been created yet.
Checks for the existence of files created in the toil.Leader.run(). In toil.Leader.run(), if a workflow completes with failed jobs, 'failed.log' is created, otherwise 'succeeded.log' is written. If neither of these exist, the leader is still running jobs.
Exactly the same as the jobStore.loadRootJob() function, but with a different exit message if the root job is not found (indicating the workflow ran successfully to completion and certain stats cannot be gathered from it meaningfully such as which jobs are left to run).
Updates Toil's internal list of EC2 instance types.
| logger |
| internet_connection() | Returns True if there is an internet connection present, and False otherwise. |
| main() |
| baseVersion | |
| cgcloudVersion | |
| version | |
| cacheTag | |
| mainCacheTag | |
| distVersion | |
| exactPython | |
| python | |
| dockerTag | |
| currentCommit | |
| dockerRegistry | |
| dockerName | |
| dirty | |
| cwltool_version |
| get_version(iterable) | Get the version of the WDL document. |
| logger | |
| file_digest | |
| WDLContext | |
| F | |
| WDLBindings | |
| SHARED_PATH_ATTR | |
| DirectoryNamingStateDict |
| InsufficientMountDiskSpace | Common base class for all non-exit exceptions. |
| ReadableFileObj | Protocol that is more specific than what file_digest takes as an argument. |
| FileDigester | Protocol for the features we need from hashlib.file_digest. |
| NonDownloadingSize | WDL size() implementation that avoids downloading files. |
| ToilWDLStdLibBase | Standard library implementation for WDL as run on Toil. |
| ToilWDLStdLibWorkflow | Standard library implementation for workflow scope. |
| ToilWDLStdLibTaskCommand | Standard library implementation to use inside a WDL task command evaluation. |
| ToilWDLStdLibTaskOutputs | Standard library implementation for WDL as run on Toil, with additional |
| WDLBaseJob | Base job class for all WDL-related jobs. |
| WDLTaskWrapperJob | Job that determines the resources needed to run a WDL job. |
| WDLTaskJob | Job that runs a WDL task. |
| WDLWorkflowNodeJob | Job that evaluates a WDL workflow node. |
| WDLWorkflowNodeListJob | Job that evaluates a list of WDL workflow nodes, which are in the same |
| WDLCombineBindingsJob | Job that collects the results from WDL workflow nodes and combines their |
| WDLWorkflowGraph | Represents a graph of WDL WorkflowNodes. |
| WDLSectionJob | Job that can create more graph for a section of the workflow. |
| WDLScatterJob | Job that evaluates a scatter in a WDL workflow. Runs the body for each |
| WDLArrayBindingsJob | Job that takes all new bindings created in an array of input environments, |
| WDLConditionalJob | Job that evaluates a conditional in a WDL workflow. |
| WDLWorkflowJob | Job that evaluates an entire WDL workflow. |
| WDLOutputsJob | Job which evaluates an outputs section for a workflow. |
| WDLStartJob | Job that evaluates an entire WDL workflow, and returns the workflow outputs |
| WDLInstallImportsJob | Class represents a unit of work in toil. |
| WDLImportWrapper | Job to organize importing files on workers instead of the leader. Responsible for extracting filenames and metadata, |
| wdl_error_reporter(task[, exit, log]) | Run code in a context where WDL errors will be reported with pretty formatting. |
| report_wdl_errors(task[, exit, log]) | Create a decorator to report WDL errors with the given task message. |
| remove_common_leading_whitespace(expression[, ...]) | Remove "common leading whitespace" as defined in the WDL 1.1 spec. |
| toil_read_source(uri, path, importer) | Implementation of a MiniWDL read_source function that can use any |
| virtualized_equal(value1, value2) | Check if two WDL values are equal when taking into account file virtualization. |
| combine_bindings(all_bindings) | Combine variable bindings from multiple predecessor tasks into one set for |
| log_bindings(log_function, message, all_bindings) | Log bindings to the console, even if some are still promises. |
| get_supertype(types) | Get the supertype that can hold values of all the given types. |
| for_each_node(root) | Iterate over all WDL workflow nodes in the given node, including inputs, |
| recursive_dependencies(root) | Get the combined workflow_node_dependencies of root and everything under |
| parse_disks(spec, disks_spec) | Parse a WDL disk spec into a disk mount specification. |
| pack_toil_uri(file_id, task_path, dir_id, file_basename) | Encode a Toil file ID and metadata about who wrote it as a URI. |
| unpack_toil_uri(toil_uri) | Unpack a URI made by make_toil_uri to retrieve the FileID and the basename |
| clone_metadata(old_file, new_file) | Copy all Toil metadata from one WDL File to another. |
| set_file_value(file, new_value) | Return a copy of a WDL File with all metadata intact but the value changed. |
| set_file_nonexistent(file, nonexistent) | Return a copy of a WDL File with all metadata intact but the nonexistent flag set to the given value. |
| get_file_nonexistent(file) | Return the nonexistent flag for a file. |
| set_file_virtualized_value(file, virtualized_value) | Return a copy of a WDL File with all metadata intact but the virtualized_value attribute set to the given value. |
| get_file_virtualized_value(file) | Get the virtualized storage location for a file. |
| get_shared_fs_path(file) | If a File has a shared filesystem path, get that path. |
| set_shared_fs_path(file, path) | Return a copy of the given File associated with the given shared filesystem path. |
| view_shared_fs_paths(bindings) | Given WDL bindings, return a copy where all files have their shared filesystem paths as their values. |
| poll_execution_cache(node, bindings) | Return the cached result of calling this workflow or task, and its key. |
| fill_execution_cache(cache_key, output_bindings, ...) | Cache the result of calling a workflow or task. |
| choose_human_readable_directory(root_dir, ...) | Select a good directory to save files from a task and source directory in. |
| evaluate_decls_to_bindings(decls, all_bindings, ...[, ...]) | Evaluate decls with a given bindings environment and standard library. |
| extract_workflow_inputs(environment) | |
| convert_files(environment, file_to_id, file_to_data, ...) | Resolve relative-URI files in the given environment convert the file values to a new value made from a given mapping. |
| convert_remote_files(environment, file_source, task_path) | Resolve relative-URI files in the given environment and import all files. |
| evaluate_named_expression(context, name, ...) | Evaluate an expression when we know the name of it. |
| evaluate_decl(node, environment, stdlib) | Evaluate the expression of a declaration node, or raise an error. |
| evaluate_call_inputs(context, expressions, ...[, ...]) | Evaluate a bunch of expressions with names, and make them into a fresh set of bindings. inputs_dict is a mapping of |
| evaluate_defaultable_decl(node, environment, stdlib) | If the name of the declaration is already defined in the environment, return its value. Otherwise, return the evaluated expression. |
| devirtualize_files(environment, stdlib) | Make sure all the File values embedded in the given bindings point to files |
| virtualize_files(environment, stdlib[, enforce_existence]) | Make sure all the File values embedded in the given bindings point to files |
| add_paths(task_container, host_paths) | Based off of WDL.runtime.task_container.add_paths from miniwdl |
| drop_if_missing(file, standard_library) | Return None if a file doesn't exist, or its path if it does. |
| drop_missing_files(environment, standard_library) | Make sure all the File values embedded in the given bindings point to files |
| get_file_paths_in_bindings(environment) | Get the paths of all files in the bindings. Doesn't guarantee that |
| map_over_files_in_bindings(environment, transform) | Run all File values embedded in the given bindings through the given |
| map_over_files_in_binding(binding, transform) | Run all File values' types and values embedded in the given binding's value through the given |
| map_over_typed_files_in_value(value, transform) | Run all File values embedded in the given value through the given |
| ensure_null_files_are_nullable(value, original_value, ...) | Run through all nested values embedded in the given value and check that the null values are valid. |
| make_root_job(target, inputs, inputs_search_path, ...) | |
| main() | A Toil workflow to interpret WDL input files. |
Protocol that is more specific than what file_digest takes as an argument. Also guarantees a read() method.
Would extend the protocol from Typeshed for hashlib but those are only declared for 3.11+.
Protocol for the features we need from hashlib.file_digest.
Common base class for all non-exit exceptions.
Decorator can then be applied to a function, and if a WDL error happens it will say that it could not {task}.
See <https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#stripping-leading-whitespace>.
Operates on a WDL.Expr.String expression that has already been parsed.
Needs to be async because MiniWDL will await its result.
Treats virtualized and non-virtualized Files referring to the same underlying file as equal.
Useful because section nodes can have internal nodes with dependencies not reflected in those of the section node itself.
The URI will start with the scheme in TOIL_URI_SCHEME.
This will be the path the File was initially imported from, or the path that it has in the call cache.
This should be the path it was initially imported from, or the path that it has in the call cache.
Returns None and the key if the cache has no result for us.
Deals in un-namespaced bindings.
Deals in un-namespaced bindings.
The directories involved may not exist.
WDL size() implementation that avoids downloading files.
MiniWDL's default size() implementation downloads the whole file to get its size. We want to be able to get file sizes from code running on the leader, where there may not be space to download the whole file. So we override the fancy class that implements it so that we can handle sizes for FileIDs using the FileID's stored size info.
Will return bindings with file values set to their corresponding relative-URI.
Returns an environment where each File's value is set to the URI it was found at, its virtualized value is set to what it was loaded into the filestore as (if applicable), and its shared filesystem path is set if it came from the local filesystem.
Standard library implementation for WDL as run on Toil.
The destination directory must already exist. No other devirtualize_to call may be writing to it, including the case of another workflow writing the same task to the same place in the call cache at the same time.
Makes sure sibling files stay siblings and files with the same name don't clobber each other. Called from within this class for tasks, and statically at the end of the workflow for outputs.
Returns the local path to the file. If the file is already a local path, or if it already has an entry in virtualized_to_devirtualized, that path will be re-used instead of creating a new copy in dest_dir.
The input filename could already be devirtualized. In this case, the filename should not be added to the cache.
Standard library implementation for workflow scope.
Handles deduplicating files generated by write_* calls at workflow scope with copies already in the call cache, so that tasks that depend on them can also be fulfilled from the cache.
Standard library implementation to use inside a WDL task command evaluation.
Expects all the filenames in variable bindings to be container-side paths; these are the "virtualized" filenames, while the "devirtualized" filenames are host-side paths.
Standard library implementation for WDL as run on Toil, with additional functions only allowed in task output sections.
filename represents a URI or file name belonging to a WDL value of type value_type. work_dir represents the current working directory of the job and is where all relative paths will be interpreted from
Files must not be virtualized.
TODO: Duplicative with WDL.runtime.task._fspaths, except that is internal and supports Directory objects.
The transformation function must not mutate the original File.
TODO: Replace with WDL.Value.rewrite_env_paths or WDL.Value.rewrite_files
The transformation function must not mutate the original File.
The transformation function must not mutate the original File.
If the transform returns None, the file value is changed to Null.
The transform has access to the type information for the value, so it knows if it may return None, depending on if the value is optional or not.
The transform is allowed to return None only if the mapping result won't actually be used, to allow for scans. So error checking needs to be part of the transform itself.
If a null value is found that does not have a valid corresponding expected_type, raise an error
(This is currently only used to check that null values arising from File coercion are in locations with a nullable File? type. If this is to be used elsewhere, the error message should be changed to describe the appropriate types and not just talk about files.)
For example: If one of the nested values is null but the equivalent nested expected_type is not optional, a FileNotFoundError will be raised :param value: WDL base value to check. This is the WDL value that has been transformed and has the null elements :param original_value: The original WDL base value prior to the transformation. Only used for error messages :param expected_type: The WDL type of the value
Base job class for all WDL-related jobs.
Responsible for post-processing returned bindings, to do things like add in null values for things not defined in a section. Post-processing operations can be added onto any job before it is saved, and will be applied as long as the job's run method calls postprocess().
Also responsible for remembering the Toil WDL configuration keys and values.
Remember to decorate non-trivial overrides with report_wdl_errors().
Should be applied by subclasses' run() implementations to their return values.
Use this when you are returning a promise for bindings, on the job that issues the promise.
Job that determines the resources needed to run a WDL job.
Responsible for evaluating the input declarations for unspecified inputs, evaluating the runtime section, and scheduling or chaining to the real WDL job.
All bindings are in terms of task-internal names.
Job that runs a WDL task.
Responsible for re-evaluating input declarations for unspecified inputs, evaluating the runtime section, re-scheduling if resources are not available, running any command, and evaluating the outputs.
All bindings are in terms of task-internal names.
Currently doesn't implement the MiniWDL plugin system, but does add resource usage monitoring to Docker containers.
Takes the host-side path of the file.
So if Kubernetes is detected, return False :return: bool
Will check if the mount point source has the requested amount of space available.
Note: We are depending on Toil's job scheduling backend to error when the sum of multiple mount points disk requests is greater than the total available For example, if a task has two mount points request 100 GB each but there is only 100 GB available, the df check may pass but Toil should fail to schedule the jobs internally
Job that evaluates a WDL workflow node.
Job that evaluates a list of WDL workflow nodes, which are in the same scope and in a topological dependency order, and which do not call out to any other workflows or tasks or sections.
Job that collects the results from WDL workflow nodes and combines their environment changes.
Operates at a certain level of instantiation (i.e. sub-sections are represented by single nodes).
Assumes all relevant nodes are provided; dependencies outside the provided nodes are assumed to be satisfied already.
This elides/resolves gathers.
Produces dependencies after resolving gathers and internal-to-section dependencies, on nodes that are also in this graph.
Job that can create more graph for a section of the workflow.
These bindings can be overlaid with bindings from the actual execution, so that references to names defined in unexecuted code get a proper default undefined value, and not a KeyError at runtime.
The information to do this comes from MiniWDL's "gathers" system: <https://miniwdl.readthedocs.io/en/latest/WDL.html#WDL.Tree.WorkflowSection.gathers>
TODO: This approach will scale O(n^2) when run on n nested conditionals, because generating these bindings for the outer conditional will visit all the bindings from the inner ones.
Job that evaluates a scatter in a WDL workflow. Runs the body for each value in an array, and makes arrays of the new bindings created in each instance of the body. If an instance of the body doesn't create a binding, it gets a null value in the corresponding array.
Job that takes all new bindings created in an array of input environments, relative to a base environment, and produces bindings where each new binding name is bound to an array of the values in all the input environments.
Useful for producing the results of a scatter.
Job that evaluates a conditional in a WDL workflow.
Job that evaluates an entire WDL workflow.
Job which evaluates an outputs section for a workflow.
Returns an environment with just the outputs bound, in no namespace.
Job that evaluates an entire WDL workflow, and returns the workflow outputs namespaced with the workflow name. Inputs may or may not be namespaced with the workflow name; both forms are accepted.
Class represents a unit of work in toil.
Job to organize importing files on workers instead of the leader. Responsible for extracting filenames and metadata, calling ImportsJob, applying imports to input bindings, and scheduling the start workflow job
This class is only used when runImportsOnWorkers is enabled.
Remember to decorate non-trivial overrides with report_wdl_errors().
| logger |
| StatsDict | Subclass of MagicExpando for type-checking purposes. |
| nextChainable(predecessor, job_store, config) | Returns the next chainable job's JobDescription after the given predecessor |
| workerScript(job_store, config, job_name, job_store_id) | Worker process script, runs a job. |
| parse_args(args) | Parse command-line arguments to the worker. |
| in_contexts(contexts) | Unpickle and enter all the pickled, base64-encoded context managers in the |
| main([argv]) |
Subclass of MagicExpando for type-checking purposes.
| log | |
| KNOWN_EXTANT_IMAGES | |
| cache_path |
| ApplianceImageNotFound | Error raised when using TOIL_APPLIANCE_SELF results in an HTTP error. |
| which(cmd[, mode, path]) | Return the path with conforms to the given mode on the Path. |
| toilPackageDirPath() | Return the absolute path of the directory that corresponds to the top-level toil package. |
| inVirtualEnv() | Test if we are inside a virtualenv or Conda virtual environment. |
| resolveEntryPoint(entryPoint) | Find the path to the given entry point that should work on a worker. |
| physicalMemory() | Calculate the total amount of physical memory, in bytes. |
| physicalDisk(directory) | |
| applianceSelf([forceDockerAppliance]) | Return the fully qualified name of the Docker image to start Toil appliance containers from. |
| customDockerInitCmd() | Return the custom command set by the TOIL_CUSTOM_DOCKER_INIT_COMMAND environment variable. |
| customInitCmd() | Return the custom command set by the TOIL_CUSTOM_INIT_COMMAND environment variable. |
| lookupEnvVar(name, envName, defaultValue) | Look up environment variables that control Toil and log the result. |
| checkDockerImageExists(appliance) | Attempt to check a url registryName for the existence of a docker image with a given tag. |
| parseDockerAppliance(appliance) | Derive parsed registry, image reference, and tag from a docker image string. |
| checkDockerSchema(appliance) | |
| requestCheckRegularDocker(origAppliance, registryName, ...) | Check if an image exists using the requests library. |
| requestCheckDockerIo(origAppliance, imageName, tag) | Check docker.io to see if an image exists using the requests library. |
| logProcessContext(config) |
[Copy-pasted in from python3.6's shutil.which().]
mode defaults to os.F_OK | os.X_OK. path defaults to the result of os.environ.get("PATH"), or can be overridden with a custom search path.
The return value is guaranteed to end in '/toil'.
>>> n = physicalMemory()
>>> n > 0
True
>>> n == physicalMemory()
True
The result is determined by the current version of Toil and three environment variables: TOIL_DOCKER_REGISTRY, TOIL_DOCKER_NAME and TOIL_APPLIANCE_SELF.
TOIL_DOCKER_REGISTRY specifies an account on a publicly hosted docker registry like Quay or Docker Hub. The default is UCSC's CGL account on Quay.io where the Toil team publishes the official appliance images. TOIL_DOCKER_NAME specifies the base name of the image. The default of toil will be adequate in most cases. TOIL_APPLIANCE_SELF fully qualifies the appliance image, complete with registry, image name and version tag, overriding both TOIL_DOCKER_NAME and TOIL_DOCKER_REGISTRY` as well as the version tag of the image. Setting TOIL_APPLIANCE_SELF will not be necessary in most cases.
The custom docker command is run prior to running the workers and/or the primary node's services.
This can be useful for doing any custom initialization on instances (e.g. authenticating to private docker registries). Any single quotes are escaped and the command cannot contain a set of blacklisted chars (newline or tab).
The custom init command is run prior to running Toil appliance itself in workers and/or the primary node (i.e. this is run one stage before TOIL_CUSTOM_DOCKER_INIT_COMMAND).
This can be useful for doing any custom initialization on instances (e.g. authenticating to private docker registries). Any single quotes are escaped and the command cannot contain a set of blacklisted chars (newline or tab).
returns: the custom command or n empty string is returned if the environment variable is not set.
Example: "quay.io/ucsc_cgl/toil:latest" Should return: "quay.io", "ucsc_cgl/toil", "latest"
If a registry is not defined, the default is: "docker.io" If a tag is not defined, the default is: "latest"
Error raised when using TOIL_APPLIANCE_SELF results in an HTTP error.
URL is based on the docker v2 schema.
This has the following format: https://{websitehostname}.io/v2/{repo}/manifests/{tag}
Does not work with the official (docker.io) site, because they require an OAuth token, so a separate check is done for docker.io images.
URL is based on the docker v2 schema. Requires that an access token be fetched first.
UCSC Computational Genomics Lab
2015 – 2025 UCSC Computational Genomics Lab
| April 9, 2025 | 8.0.0 |