Google Cloud
Credentials
Credentials for submitting requests to the Google Cloud Batch and Cloud LifeSciences API are picked up from your environment using Application Default Credentials. Application Default Credentials are designed to use the credentials most natural to the environment in which a tool runs.
The most common case will be to pick up your end-user Google credentials from your workstation. You can create these by running the command:
gcloud auth application-default login
and running through the authentication flow. This will write a credential file to your gcloud configuration directory that will be used for any tool you run on your workstation that picks up default credentials.
The next most common case would be when running on a Compute Engine VM. In this case, Application Default Credentials will pick up the Compute Engine Service Account credentials for that VM.
See the Application Default Credentials documentation for how to enable other use cases.
Finally, the GOOGLE_APPLICATION_CREDENTIALS
environment variable can be used to specify location
of the Google credentials file.
If you don’t have it, the credentials file can be download from the Google Cloud Console following these steps:
Open the Google Cloud Console
Go to APIs & Services → Credentials
Click on the Create credentials (blue) drop-down and choose Service account key, in the following page
Select an existing Service account or create a new one if needed
Select JSON as Key type
Click the Create button and download the JSON file giving a name of your choice e.g.
creds.json
.
Then, define the following variable replacing the path in the example with the one of your credentials file just downloaded:
export GOOGLE_APPLICATION_CREDENTIALS=/path/your/file/creds.json
Cloud Batch
Google Cloud Batch is a managed computing service that allows the execution of containerized workloads in the Google Cloud Platform infrastructure.
Nextflow provides built-in support for Cloud Batch which allows the seamless deployment of a Nextflow pipeline in the cloud, offloading the process executions through the Google Cloud service.
Requirements
The support for Google Batch requires Nextflow version 22.07.1-edge
or later. If you have already Nextflow
installed make sure to update to the latest edge release using these commands:
export NXF_EDGE=1
nextflow -self-update
If you don’t have Nextflow, install it with command below:
curl get.nextflow.io | bash
when done, make sure to use the latest edge release running the snippet in the previous paragraph.
Configuration
Make sure to have defined in your environment the GOOGLE_APPLICATION_CREDENTIALS
variable.
See the section Credentials for details.
Note
Make sure your Google account is allowed to access the Google Cloud Batch service by checking the APIs & Services dashboard.
Create or edit the file nextflow.config
in your project root directory. The config must specify the following parameters:
Google Cloud Batch as Nextflow executor i.e.
process.executor = 'google-batch'
.The Docker container image to be used to run pipeline tasks e.g.
process.container = 'biocontainers/salmon:0.8.2--1'
.The Google Cloud project ID to run in e.g.
google.project = 'rare-lattice-222412'
.The Google location e.g.
google.location = 'us-central1'
.
Example:
process {
executor = 'google-batch'
container = 'your/container:latest'
}
google {
project = 'your-project-id'
location = 'us-central1'
}
Note
Make sure to specify the project ID, not the project name.
Note
Make sure to specify a location where Google Batch is available. Refer to the Google Batch documentation for region availability.
Note
A container image must be specified to deploy the process execution. You can use a different Docker image for each process using one or more Process selectors.
The following configuration options are available:
Name |
Description |
---|---|
google.project |
The Google Project Id to use for the pipeline execution. |
google.location |
The Google location where the job executions are deployed (default: |
google.enableRequesterPaysBuckets |
When |
google.batch.allowedLocations |
Define the set of allowed locations for VMs to be provisioned. See Google documentation for details (default: no restriction. Requires version |
google.batch.bootDiskSize |
Set the size of the virtual machine boot disk, e.g |
google.batch.cpuPlatform |
Set the minimum CPU Platform, e.g. |
google.batch.spot |
When |
google.batch.usePrivateAddress |
When |
google.batch.network |
Set network name to attach the VM’s network interface to. The value will be prefixed with global/networks/ unless it contains a /, in which case it is assumed to be a fully specified network resource URL. If unspecified, the global default network is used. |
google.batch.serviceAccountEmail |
Define the Google service account email to use for the pipeline execution. If not specified, the default Compute Engine service account for the project will be used. |
google.batch.subnetwork |
Define the name of the subnetwork to attach the instance to must be specified here, when the specified network is configured for custom subnet creation. The value is prefixed with regions/subnetworks/ unless it contains a /, in which case it is assumed to be a fully specified subnetwork resource URL. |
Process definition
Processes can be defined as usual and by default the cpus
and memory
directives are used to find the cheapest machine
type available at current location that fits the requested resources. If memory
is not specified, 1GB of memory is allocated per cpu.
As of version 23.02.0-edge
, the process machineType
directive can be a list of patterns separated by comma. The pattern can contain a *
to match
any number of characters and ?
to match any single character. Examples of valid patterns: c2-*
, `m?-standard*
, `n*
.
Alternatively it can also be used to define a specific predefined Google Compute Platform machine type or a custom machine type.
Examples:
process automatic_resources_task {
cpus 8
memory '40 GB'
"""
<Your script here>
"""
}
process allowing_some_series {
cpus 8
memory '20 GB'
machineType 'n2-*,c2-*,m3-*'
"""
<Your script here>
"""
}
process predefined_resources_task {
machineType 'n1-highmem-8'
"""
<Your script here>
"""
}
Pipeline execution
The pipeline can be launched either in a local computer or a cloud instance. Pipeline input data can be stored either locally or in a Google Storage bucket.
The pipeline execution must specify a Google Storage bucket where the workflow’s intermediate results are stored using
the -work-dir
command line options. For example:
nextflow run <script or project name> -work-dir gs://my-bucket/some/path
Tip
Any input data not stored in a Google Storage bucket will automatically be transferred to the pipeline work bucket. Use this feature with caution being careful to avoid unnecessary data transfers.
Warning
The Google Storage path needs to contain at least sub-directory. Don’t use only the bucket name e.g. gs://my-bucket
.
Spot instances
Spot instances are supported adding the following setting in the Nextflow config file:
google {
batch.spot = true
}
Since this type of virtual machines can be retired by the provider before the job completion, it is advisable to add the following retry strategy to your config file to instruct Nextflow to automatically re-execute a job if the virtual machine was terminated preemptively:
process {
errorStrategy = { task.exitStatus==14 ? 'retry' : 'terminate' }
maxRetries = 5
}
Fusion file system
As of version 23.02.0-edge
, Google Batch executor supports the use of Fusion file system.
Fusion allows the use of Google Storage as a virtual distributed file system, optimising the data transfer and speeding up most job I/O operations.
To enable the use of Fusion file system in your pipeline, add the following snippet in your Nextflow configuration file:
fusion.enabled = true
wave.enabled = true
process.scratch = false
tower.accessToken = '<YOUR ACCESS TOKEN>'
The Tower access token is optional, but it enables higher API rate limits for the Wave containers service required by Fusion.
Tip
When Fusion is enabled, by default, only machine types that allow to attach local SSD disks will be used. If you specify your own machine type or machine series they should allow to attach local SSD disks, otherwise the job scheduling will fail.
Supported directives
The integration with Google Batch is a developer preview feature. Currently the following Nextflow directives are supported:
Cloud Life Sciences
Requirements
The support for Google Cloud requires Nextflow version 20.01.0
or later. To install it define the following variables
in your system environment:
export NXF_VER=20.01.0
export NXF_MODE=google
Note
As of version 21.04.0
or later the above variables are not required anymore and therefore should not be used.
Cloud Life Sciences is a managed computing service that allows the execution of containerized workloads in the Google Cloud Platform infrastructure.
Nextflow provides built-in support for Cloud Life Sciences API which allows the seamless deployment of a Nextflow pipeline in the cloud, offloading the process executions through the Google Cloud service.
Note
This features requires Nextflow 20.01.0-edge
or later.
Warning
This API works well for coarse-grained workloads i.e. long running jobs. It’s not suggested the use this feature for pipelines spawning many short lived tasks.
Configuration
Make sure to have defined in your environment the GOOGLE_APPLICATION_CREDENTIALS
variable.
See the section Credentials for details.
Tip
Make sure to have enabled Cloud Life Sciences API to use this feature. To learn how to enable it follow this link.
Create a nextflow.config
file in the project root directory. The config must specify the following parameters:
Google Life Sciences as Nextflow executor i.e.
process.executor = 'google-lifesciences'
.The Docker container image to be used to run pipeline tasks e.g.
process.container = 'biocontainers/salmon:0.8.2--1'
.The Google Cloud project ID to run in e.g.
google.project = 'rare-lattice-222412'
.The Google Cloud region or zone. This is where the Compute Engine VMs will be started. You need to specify either one, not both. Multiple regions or zones can be specified by separating them with a comma e.g.
google.zone = 'us-central1-f,us-central-1-b'
.
Example:
process {
executor = 'google-lifesciences'
container = 'your/container:latest'
}
google {
project = 'your-project-id'
zone = 'europe-west1-b'
}
Warning
Make sure to specify in the above setting the project ID not the project name.
Note
A container image must be specified to deploy the process execution. You can use a different Docker image for each process using one or more Process selectors.
The following configuration options are available:
Name |
Description |
---|---|
google.project |
The Google Project Id to use for the pipeline execution. |
google.region |
The Google region where the computation is executed in Compute Engine VMs. Multiple regions can be provided separating them by a comma. Do not specify if a zone is provided. See available Compute Engine regions and zones |
google.zone |
The Google zone where the computation is executed in Compute Engine VMs. Multiple zones can be provided separating them by a comma. Do not specify if a region is provided. See available Compute Engine regions and zones |
google.location |
The Google location where the job executions are deployed to Cloud Life Sciences API. See available Cloud Life Sciences API locations (default: the same as the region or the zone specified). |
google.enableRequesterPaysBuckets |
When |
google.lifeSciences.bootDiskSize |
Set the size of the virtual machine boot disk e.g |
google.lifeSciences.copyImage |
The container image run to copy input and output files. It must include the |
google.lifeSciences.cpuPlatform |
Set the minimum CPU Platform e.g. |
google.lifeSciences.debug |
When |
google.lifeSciences.preemptible |
When |
google.lifeSciences.usePrivateAddress |
When |
google.lifeSciences.network |
Set network name to attach the VM’s network interface to. The value will be prefixed with global/networks/ unless it contains a /, in which case it is assumed to be a fully specified network resource URL. If unspecified, the global default network is used. Requires version |
google.lifeSciences.serviceAccountEmail |
Define the Google service account email to use for the pipeline execution. If not specified, the default Compute Engine service account for the project will be used. Requires version |
google.lifeSciences.subnetwork |
Define the name of the subnetwork to attach the instance to must be specified here, when the specified network is configured for custom subnet creation. The value is prefixed with regions/subnetworks/ unless it contains a /, in which case it is assumed to be a fully specified subnetwork resource URL. Requires version |
google.lifeSciences.sshDaemon |
When |
google.lifeSciences.sshImage |
The container image used to run the SSH daemon (default: |
google.lifeSciences.keepAliveOnFailure |
When |
google.storage.delayBetweenAttempts |
Delay between download attempts from Google Storage (default 10 sec, requires version |
google.storage.maxParallelTransfers |
Max parallel upload/download transfer operations per job (default: |
google.storage.maxTransferAttempts |
Max number of downloads attempts from Google Storage (default: 1, requires version |
google.storage.parallelThreadCount |
Defines the value for the option |
google.storage.downloadMaxComponents |
Defines the value for the option |
Process definition
Processes can be defined as usual and by default the cpus
and memory
directives are used to instantiate a custom
machine type with the specified compute resources. If memory
is not specified, 1GB of memory is allocated per cpu.
A persistent disk will be created with size corresponding to the disk
directive. If disk
is not specified, the
instance default is chosen to ensure reasonable I/O performance.
The process machineType
directive may optionally be used to specify a predefined Google Compute Platform machine type
If specified, this value overrides the cpus
and memory
directives.
If the cpus
and memory
directives are used, the values must comply with the allowed custom machine type specifications . Extended memory is not directly supported, however high memory or cpu predefined
instances may be utilized using the machineType
directive
Examples:
process custom_resources_task {
cpus 8
memory '40 GB'
disk '200 GB'
"""
<Your script here>
"""
}
process predefined_resources_task {
machineType 'n1-highmem-8'
"""
<Your script here>
"""
}
Note
This feature requires Nextflow 19.07.0 or later.
Pipeline execution
The pipeline can be launched either in a local computer or a cloud instance. Pipeline input data can be stored either locally or in a Google Storage bucket.
The pipeline execution must specify a Google Storage bucket where the workflow’s intermediate results are stored using
the -work-dir
command line options. For example:
nextflow run <script or project name> -work-dir gs://my-bucket/some/path
Tip
Any input data not stored in a Google Storage bucket will automatically be transferred to the pipeline work bucket. Use this feature with caution being careful to avoid unnecessary data transfers.
Preemptible instances
Preemptible instances are supported adding the following setting in the Nextflow config file:
google {
lifeSciences.preemptible = true
}
Since this type of virtual machines can be retired by the provider before the job completion, it is advisable to add the following retry strategy to your config file to instruct Nextflow to automatically re-execute a job if the virtual machine was terminated preemptively:
process {
errorStrategy = { task.exitStatus==14 ? 'retry' : 'terminate' }
maxRetries = 5
}
Note
Preemptible instances have a runtime limit of 24 hours.
Tip
For an exhaustive list of all possible error codes, please refer to the official Google LifeSciences documentation.
Hybrid execution
Nextflow allows the use of multiple executors in the same workflow application. This feature enables the deployment of hybrid workloads in which some jobs are executed in the local computer or local computing cluster and some other jobs are offloaded to Google Pipelines service.
To enable this feature use one or more Process selectors in your Nextflow configuration file to apply the Google Pipelines executor only to a subset of processes in your workflow. For example:
process {
withLabel: bigTask {
executor = 'google-lifesciences'
container = 'my/image:tag'
}
}
google {
project = 'your-project-id'
zone = 'europe-west1-b'
}
Then deploy the workflow execution using the -bucket-dir
to specify a Google Storage path
for the jobs computed by the Google Pipeline service and, optionally, the -work-dir
to
specify the local storage for the jobs computed locally:
nextflow run <script or project name> -bucket-dir gs://my-bucket/some/path
Warning
The Google Storage path needs to contain at least sub-directory. Don’t use only the
bucket name e.g. gs://my-bucket
.
Quotas
Compute resources in Google Cloud are subject to resource quotas which may affect your ability to run pipelines at scale. You can request quota increases, and your quotas may automatically increase over time as you use the platform. In particular, GPU quotas are initially set to 0, so you must explicitly request a quota increase in order to use GPUs. Initially you can request an increase to 1 GPU at a time, and after one billing cycle you may be able to increase it further.
Limitations
Currently it’s not possible to specify a disk type different from the default one assigned by the service depending on the chosen instance type.
Troubleshooting
Make sure to have enabled Compute Engine API, Life Sciences API and Cloud Storage Service in the APIs & Services Dashboard page.
Make sure to have enough compute resources to run your pipeline in your project Quotas (i.e. Compute Engine CPUs, Compute Engine Persistent Disk, Compute Engine In-use IP addresses, etc).
Make sure your security credentials allows you to access any Google Storage bucket where input data and temporary files are stored.
Check the directory
google/
created in the task work directory (in the bucket storage) created when on job failure and containing useful information of the job execution. The creation can be enabled as default setting the optiongoogle.lifeSciences.debug = true
in the Nextflow config fileEnable the optional SSH daemon in the job VM using the option
google.lifeSciences.sshDaemon = true
Make sure you are choosing a location where Cloud Life Sciences API is available, and a region or zone where Compute Engine is available.