Fusion file system

Introduction

Fusion is a distributed virtual file system for cloud-native data pipeline and optimised for Nextflow workloads.

It bridges the gap between cloud-native storage and data analysis workflow by implementing a thin client that allows any existing application to access object storage using the standard POSIX interface, thus simplifying and speeding up most operations. Currently it supports AWS S3.

Getting started

Requirements

Fusion file system is designed to work with containerised workloads, therefore it requires the use of a container engine such as Docker or a container native platform for the execution of your pipeline e.g. AWS Batch or Kubernetes.

It also requires the use of Wave containers and Nextflow version 22.10.0 or later.

AWS S3 configuration

The AWS S3 bucket should be configured with the following IAM permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::YOUR-BUCKET-NAME"
            ]
        },
        {
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:PutObjectTagging",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::YOUR-BUCKET-NAME/*"
            ],
            "Effect": "Allow"
        }
    ]
}

Use cases

Local execution with S3 bucket as work directory

Fusion file system allows the use of an S3 bucket as a pipeline work directory with the Nextflow local executor. This configuration requires the use of Docker (or similar container engine) for the execution of your pipeline tasks.

The AWS S3 bucket credentials should be made accessible via standard AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

The following configuration should be added in your Nextflow configuration file:

docker {
  enabled = true
  envWhitelist = 'AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY'
}

fusion {
  enabled = true
}

wave {
    enabled = true
}

Then you can run your pipeline using the following command:

nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch

Replace <YOUR PIPELINE> and <YOUR BUCKET> with a pipeline script and bucket or your choice, for example:

nextflow run https://github.com/nextflow-io/rnaseq-nf -work-dir s3://nextflow-ci/scratch

AWS Batch execution with S3 bucket as work directory

Fusion file system allows the use of an S3 bucket as a pipeline work directory with the AWS Batch executor. The use of Fusion makes obsolete the need to create and configure a custom AMI that includes the aws command line tool, when setting up the AWS Batch compute environment.

The configuration for this deployment scenario looks like the following:

fusion {
  enabled = true
}

wave {
  enabled = true
}

process {
  executor = 'awsbatch'
  queue = '<YOUR BATCH QUEUE>'
}

aws {
  region = '<YOUR AWS REGION>'
}

Then you can run your pipeline using the following command:

nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch

Kubernetes execution with S3 bucket as work directory

Fusion file system allows the use of an S3 bucket as a pipeline work directory with the Kubernetes executor.

The use of Fusion makes obsolete the need to create and manage and separate persistent volume and shared file system in the Kubernetes cluster.

The configuration for this deployment scenario looks like the following:

wave {
  enabled = true
}

fusion {
  enabled = true
}

process {
  executor = 'k8s'
}

k8s {
  context = '<YOUR K8S CONFIGURATION CONTEXT>'
  namespace = '<YOUR K8S NAMESPACE>'
  serviceAccount = '<YOUR K8S SERVICE ACCOUNT>'
}

The k8s.context represents the Kubernetes configuration context to be used for the pipeline execution. This setting can be omitted if Nextflow itself is run as a pod in the Kubernetes clusters.

The k8s.namespace represents the Kubernetes namespace where the jobs submitted by the pipeline execution should be executed.

The k8s.serviceAccount represents the Kubernetes service account that should be used to grant the execution permission to jobs launched by Nextflow. You can find more details how to configure it as the following link.

Having the above configuration in place, you can run your pipeline using the following command:

nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch

NVMe storage

Fusion file system implements a lazy download and upload algorithm that runs in the background to transfer files in parallel to and from object storage into a container local temporal folder. This means that the performance of the temporal folder inside the container (/tmp in a default setup) is key to get maximum performance.

The temporal folder is used only as a temporal cache, so the size of the volume can be much lower than the actual needs of your pipeline processes. Fusion has a build-in garbage collector that constantly monitors remaining disk space on temporal folder and immediately evicts old cached entries when necessary.

The recommended setup to get maximum performance is to mount a NVMe disk as temporal folder and run the pipeline with Nextflow scratch directive set to false to also avoid stage-out transfer time.

Example extra configuration needed when using AWS Batch with NVMe disks to maximize performance:

aws.batch.volumes = '/path/to/ec2/nvme:/tmp'
process.scratch = false

More examples

Check out the Wave showcase repository for more examples on how to use Fusion file system.