Azure Cloud
Requirements
The support for Azure Cloud requires Nextflow version 21.04.0
or later. If you don’t have it installed
use the following command to download it in your computer:
curl get.nextflow.io | bash
./nextflow self-update
Azure Blob Storage
Nextflow has built-in support for Azure Blob Storage. Files stored in a Azure blob container can be accessed transparently in your pipeline script like any other file in the local file system.
The Blob storage account name and key needs to be provided in the Nextflow configuration file as shown below:
azure {
storage {
accountName = "<YOUR BLOB ACCOUNT NAME>"
accountKey = "<YOUR BLOB ACCOUNT KEY>"
}
}
As an alternative to the account key it can also used Shared Access Token using the setting sasToken
in place
of the accountKey
attribute.
Tip
When creating the Shared Access Token make sure to allow the resource types Container and Object and allow the permissions: Read, Write, Delete, List, Add, Create.
Tip
The value of sasToken
is the token stripped by the character ?
from the beginning of the token.
Once the Blob Storage credentials are set you can access the files in the blob container like local files prepending
the file path with the az://
prefix followed by the container name. For example, having a container named my-data
container a file named foo.txt
you can access it in your Nextflow script using the following fully qualified
file path az://my-data/foo.txt
.
Azure Batch
Azure Batch is a managed computing service that allows the execution of containerised workloads in the Azure cloud infrastructure.
Nextflow provides a built-in support for Azure Batch which allows the seamless deployment of a Nextflow pipeline in the cloud offloading the process executions as Batch jobs.
Get started
Create a Batch account in Azure portal. Take note of the account name and key.
Make sure to adjust your quotas on the pipeline’s needs. There are limits on certain resources associated with the Batch account. Many of these limits are default quotas applied by Azure at the subscription or account level. Quotas impact on the number of Pools, CPUs and Jobs you can create at any given time.
Create a Storage account and, within, an Azure Blob Container in the same location where the Batch account was created. Take note of the account name and key.
If planning to use Azure files, create an Azure File share within the same Storage account and upload there the data to mount on the pool nodes.
Associate the Storage account with the Azure Batch account.
Make sure your pipeline processes specify one or more Docker containers by using the container directive.
The container images need to be published into Docker registry such as Docker Hub, Quay or Azure Container Registry that can be reached by Azure Batch environment.
A minimal configuration looks like the following snippet:
process {
executor = 'azurebatch'
}
azure {
storage {
accountName = "<YOUR STORAGE ACCOUNT NAME>"
accountKey = "<YOUR STORAGE ACCOUNT KEY>"
}
batch {
location = '<YOUR LOCATION>'
accountName = '<YOUR BATCH ACCOUNT NAME>'
accountKey = '<YOUR BATCH ACCOUNT KEY>'
autoPoolMode = true
}
}
In the above example, replace the location placeholder with the name of your Azure region and the account placeholders with the values
corresponding to your configuration . Then save it to a file named nextflow.config
.
Tip
The list of Azure regions can be found by executing the following Azure CLI command:
az account list-locations -o table
Given the previous configuration, launch the execution of the pipeline using the following command:
nextflow run <PIPELINE NAME> -w az://YOUR-CONTAINER/work
Replacing <PIPELINE NAME>
with a pipeline name e.g. nextflow-io/rnaseq-nf
and YOUR-CONTAINER
a blob
container in the storage account defined in the above configuration.
See the Batch documentation for further details about the configuration for the Azure Batch service.
Pools configuration
When using the autoPoolMode
setting Nextflow automatically creates a pool of computing nodes to execute the
jobs run by your pipeline. By default it only uses 1 compute node of Standard_D4_v3
type.
The pool is not removed when the pipeline execution terminates, unless the configuration setting deletePoolsOnCompletion=true
is added in your pipeline configuration file.
Pool specific settings, e.g. VM type and count, should be provided in the auto
pool configuration scope, e.g.
azure {
batch {
pools {
auto {
vmType = 'Standard_D2_v2'
vmCount = 10
}
}
}
}
Warning
Don’t forget to clean up the Batch pools to avoid in extra charges in the Batch account or use the auto scaling feature.
Warning
Make sure your Batch account has enough resources to satisfy the pipeline’s requirements and the pool configuration.
Warning
Nextflow uses the same pool Id across pipeline executions, if the pool features have not changed.
Therefore, when using deletePoolsOnCompletion=true
, make sure the pool is completely removed from the Azure Batch account
before re-running the pipeline. The following message is returned when the pool is still shutting down
Error executing process > '<process name> (1)'
Caused by:
Azure Batch pool '<pool name>' not in active state
Named pools
If you want to have a more precise control on the computing nodes pools used in your pipeline using a different pool depending on the task in your pipeline, you can use the Nextflow queue directive to specify the ID of a Azure Batch compute pool that has to be used to run that process’ tasks.
The pool is expected to be already available in the Batch environment, unless the setting allowPoolCreation=true
is
provided in the batch
setting in the pipeline configuration file. In the latter case Nextflow will create the pools on-demand.
The configuration details for each pool can be specified using a snippet as shown below:
azure {
batch {
pools {
foo {
vmType = 'Standard_D2_v2'
vmCount = 10
}
bar {
vmType = 'Standard_E2_v3'
vmCount = 5
}
}
}
}
The above example defines the configuration for two node pools. The first will provision 10 compute nodes of type Standard_D2_v2
,
the second 5 nodes of type Standard_E2_v3
. See the Advanced settings below for the complete list of available
configuration options.
Requirements on pre-existing named pools
When Nextflow is configured to use a pool already available in the Batch account, the target pool must satisfy the following requirements:
the pool must be declared as
dockerCompatible
(Container Type
property)the task slots per node must match with the number of cores for the selected VM. Nextflow would return an error like “Azure Batch pool ‘ID’ slots per node does not match the VM num cores (slots: N, cores: Y)”.
Pool autoscaling
Azure Batch can automatically scale pools based on parameters that you define, saving you time and money. With automatic scaling, Batch dynamically adds nodes to a pool as task demands increase, and removes compute nodes as task demands decrease.
To enable this feature for pools created by Nextflow, add the option autoScale = true
to the corresponding pool configuration scope.
For example, when using the autoPoolMode
, the setting looks like:
azure {
batch {
pools {
auto {
autoScale = true
vmType = 'Standard_D2_v2'
vmCount = 5
maxVmCount = 50
}
}
}
}
Nextflow uses the formula shown below to determine the number of VMs to be provisioned in the pool:
// Get pool lifetime since creation.
lifespan = time() - time("{{poolCreationTime}}");
interval = TimeInterval_Minute * {{scaleInterval}};
// Compute the target nodes based on pending tasks.
// $PendingTasks == The sum of $ActiveTasks and $RunningTasks
$samples = $PendingTasks.GetSamplePercent(interval);
$tasks = $samples < 70 ? max(0, $PendingTasks.GetSample(1)) : max( $PendingTasks.GetSample(1), avg($PendingTasks.GetSample(interval)));
$targetVMs = $tasks > 0 ? $tasks : max(0, $TargetDedicatedNodes/2);
targetPoolSize = max(0, min($targetVMs, {{maxVmCount}}));
// For first interval deploy 1 node, for other intervals scale up/down as per tasks.
$TargetDedicatedNodes = lifespan < interval ? {{vmCount}} : targetPoolSize;
$NodeDeallocationOption = taskcompletion;
The above formula initialises a pool with the number of VMs specified by the vmCount
option, it scales up the pool on-demand,
based on the number of pending tasks up to maxVmCount
nodes. If no jobs are submitted for execution, it scales down
to zero nodes automatically.
If you need a different strategy you can provide your own formula using the scaleFormula
option.
See the Azure Batch documentation for details.
Pool nodes
When Nextflow creates a pool of compute nodes, it selects:
the virtual machine image reference to be installed on the node
the Batch node agent SKU, a program that runs on each node and provides an interface between the node and the Batch service
Together, these settings determine the Operating System and version installed on each node.
By default, Nextflow creates CentOS 8-based pool nodes, but this behavior can be customised in the pool configuration. Below the configurations for image reference/SKU combinations to select two popular systems.
Ubuntu 20.04:
azure.batch.pools.<name>.sku = "batch.node.ubuntu 20.04" azure.batch.pools.<name>.offer = "ubuntu-server-container" azure.batch.pools.<name>.publisher = "microsoft-azure-batch"
CentOS 8 (default):
azure.batch.pools.<name>.sku = "batch.node.centos 8" azure.batch.pools.<name>.offer = "centos-container" azure.batch.pools.<name>.publisher = "microsoft-azure-batch"
In the above snippet replace <name>
with the name of your Azure node pool.
See the Advanced settings below and Azure Batch nodes documentation for more details.
Private container registry
As of version 21.05.0-edge
, a private container registry from where to pull Docker images can be optionally specified as follows
azure {
registry {
server = '<YOUR REGISTRY SERVER>' // e.g.: docker.io, quay.io, <ACCOUNT>.azurecr.io, etc.
userName = '<YOUR REGISTRY USER NAME>'
password = '<YOUR REGISTRY PASSWORD>'
}
}
The private registry is not exclusive, rather it is an addition to the configuration. Public images from other registries are still pulled (if requested by a Task) when a private registry is configured.
Note
When using containers hosted into a private registry, the registry name must also be provided in the container name
specified via the container directive using the format: [server]/[your-organization]/[your-image]:[tag]
.
Read more about image fully qualified image names in the Docker documentation.
Advanced settings
The following configuration options are available:
Name |
Description |
---|---|
azure.storage.accountName |
The blob storage account name |
azure.storage.accountKey |
The blob storage account key |
azure.storage.sasToken |
The blob storage shared access signature token. This can be provided as an alternative to the |
azure.storage.tokenDuration |
The duration of the shared access signature token created by Nextflow when the |
azure.batch.accountName |
The batch service account name. |
azure.batch.accountKey |
The batch service account key. |
azure.batch.endpoint |
The batch service endpoint e.g. |
azure.batch.location |
The name of the batch service region, e.g. |
azure.batch.autoPoolMode |
Enable the automatic creation of batch pools depending on the pipeline resources demand (default: |
azure.batch.allowPoolCreation |
Enable the automatic creation of batch pools specified in the Nextflow configuration file (default: |
azure.batch.deleteJobsOnCompletion |
Enable the automatic deletion of jobs created by the pipeline execution (default: |
azure.batch.deletePoolsOnCompletion |
Enable the automatic deletion of compute node pools upon pipeline completion (default: |
azure.batch.copyToolInstallMode |
Specify where the azcopy tool used by Nextflow. When |
azure.batch.pools.<name>.publisher |
Specify the publisher of virtual machine type used by the pool identified with |
azure.batch.pools.<name>.offer |
Specify the offer type of the virtual machine type used by the pool identified with |
azure.batch.pools.<name>.sku |
Specify the ID of the Compute Node agent SKU which the pool identified with |
azure.batch.pools.<name>.vmType |
Specify the virtual machine type used by the pool identified with |
azure.batch.pools.<name>.vmCount |
Specify the number of virtual machines provisioned by the pool identified with |
azure.batch.pools.<name>.maxVmCount |
Specify the max of virtual machine when using auto scale option. |
azure.batch.pools.<name>.autoScale |
Enable autoscaling feature for the pool identified with |
azure.batch.pools.<name>.fileShareRootPath |
If mounting File Shares, this is the internal root mounting point. Must be |
azure.batch.pools.<name>.scaleFormula |
Specify the scale formula for the pool identified with |
azure.batch.pools.<name>.scaleInterval |
Specify the interval at which to automatically adjust the Pool size according to the autoscale formula. The minimum and maximum value are 5 minutes and 168 hours respectively (default: 10 mins). |
azure.batch.pools.<name>.schedulePolicy |
Specify the scheduling policy for the pool identified with |
azure.batch.pools.<name>.privileged |
Enable the task to run with elevated access. Ignored if runAs is set (default: |
azure.batch.pools.<name>.runAs |
Specify the username under which the task is run. The user must already exist on each node of the pool. |
azure.registry.server |
Specify the container registry from which to pull the Docker images (default: |
azure.registry.userName |
Specify the username to connect to a private container registry (requires |
azure.registry.password |
Specify the password to connect to a private container registry (requires |
azure.retryPolicy.delay |
Delay when retrying failed API requests (default: |
azure.retryPolicy.maxDelay |
Max delay when retrying failed API requests (default: |
azure.retryPolicy.jitter |
Jitter value when retrying failed API requests (default: |
azure.retryPolicy.maxAttempts |
Max attempts when retrying failed API requests (default: |