Fusion file system

Introduction

Fusion is a distributed virtual file system for cloud-native data pipelines, optimised for Nextflow workloads.

It bridges the gap between cloud-native storage and data analysis workflows by implementing a thin client that allows any existing application to access object storage using the standard POSIX interface, thus simplifying and speeding up most operations. Currently, it supports AWS S3 and Google Storage.

Getting started

Requirements

Fusion file system is designed to work with containerised workloads, therefore it requires the use of a container engine such as Docker or a container native platform for the execution of your pipeline, e.g. AWS Batch or Kubernetes.

It also requires the use of Wave containers and Nextflow version 22.10.0 or later. The support for Google storage requires Nextflow 23.02.0-edge or later.

AWS S3 configuration

The AWS S3 bucket should be configured with the following IAM permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::YOUR-BUCKET-NAME"
            ]
        },
        {
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:PutObjectTagging",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::YOUR-BUCKET-NAME/*"
            ],
            "Effect": "Allow"
        }
    ]
}

Use cases

Local execution with S3 bucket as work directory

Fusion file system allows the use of an S3 bucket as a pipeline work directory with the Nextflow local executor. This configuration requires the use of Docker (or a similar container engine) for the execution of your pipeline tasks.

The AWS S3 bucket credentials must be made accessible via standard AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

The following configuration must be added to your Nextflow configuration file:

docker {
  enabled = true
}

fusion {
  enabled = true
  exportAwsAccessKeys = true
}

wave {
    enabled = true
}

Then you can run your pipeline using the following command:

nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch

Replace <YOUR PIPELINE> and <YOUR BUCKET> with a pipeline script and bucket of your choice. For example:

nextflow run https://github.com/nextflow-io/rnaseq-nf -work-dir s3://nextflow-ci/scratch

AWS Batch execution with S3 bucket as work directory

Fusion file system allows the use of an S3 bucket as a pipeline work directory with the AWS Batch executor. The use of Fusion removes the need to create and configure a custom AMI that includes the aws command line tool when setting up the AWS Batch compute environment.

In this deployment scenario, the following configuration must be added to your Nextflow configuration file:

fusion {
  enabled = true
}

wave {
  enabled = true
}

process {
  executor = 'awsbatch'
  queue = '<YOUR BATCH QUEUE>'
}

aws {
  region = '<YOUR AWS REGION>'
}

Then you can run your pipeline using the following command:

nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch

Kubernetes execution with S3 bucket as work directory

Fusion file system allows the use of an S3 bucket as a pipeline work directory with the Kubernetes executor.

The use of Fusion makes removes the need to create and manage a separate persistent volume and shared file system in the Kubernetes cluster.

In this deployment scenario, the following configuration must be added to your Nextflow configuration file:

wave {
  enabled = true
}

fusion {
  enabled = true
}

process {
  executor = 'k8s'
}

k8s {
  context = '<YOUR K8S CONFIGURATION CONTEXT>'
  namespace = '<YOUR K8S NAMESPACE>'
  serviceAccount = '<YOUR K8S SERVICE ACCOUNT>'
}

The k8s.context represents the Kubernetes configuration context to be used for the pipeline execution. This setting can be omitted if Nextflow itself is run as a pod in the Kubernetes clusters.

The k8s.namespace represents the Kubernetes namespace where the jobs submitted by the pipeline execution should be executed.

The k8s.serviceAccount represents the Kubernetes service account that should be used to grant the execution permission to jobs launched by Nextflow. See here for more information on configuring the service account.

With the above configuration in place, you can run your pipeline using the following command:

nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch

NVMe storage

Fusion file system implements a lazy download and upload algorithm that runs in the background to transfer files in parallel to and from object storage into a container-local temporal folder. This temporal folder (/tmp in a default setup) is key for achieving maximum performance.

The temporal folder is used only as a temporal cache, so the size of the volume can be much lower than the actual needs of your pipeline processes. Fusion has a built-in garbage collector that constantly monitors remaining disk space in the temporal folder and immediately evicts old cached entries when necessary.

The recommended setup for maximum performance is to mount an NVMe disk as the temporal folder and run the pipeline with the Nextflow scratch directive set to false, to avoid stage-out transfer time.

Extra configuration is needed when using AWS Batch with NVMe disks to maximize performance:

aws.batch.volumes = '/path/to/ec2/nvme:/tmp'
process.scratch = false

More examples

Check out the Wave showcase repository for more examples on using the Fusion file system.