Deploy Dask Clusters

The dask.distributed scheduler works well on a single machine and scales to many machines in a cluster. We recommend using dask.distributed clusters at all scales for the following reasons:

  1. It provides access to asynchronous APIs, notably Futures.

  2. It provides a diagnostic dashboard that can provide valuable insight on performance and progress (see Dashboard Diagnostics).

  3. It handles data locality with sophistication, and so can be more efficient than the multiprocessing scheduler on workloads that require multiple processes.

This page describes various ways to set up Dask clusters on different hardware, either locally on your own machine or on a distributed cluster.

You can continue reading or watch the screencast below:

If you import Dask, set up a computation, and call compute, then you will use the single-machine scheduler by default.

import dask.dataframe as dd
df = dd.read_csv(...)
df.x.sum().compute()  # This uses the single-machine scheduler by default

To use the dask.distributed scheduler you must set up a Client.

from dask.distributed import Client
client = Client(...)  # Connect to distributed cluster and override default
df.x.sum().compute()  # This now runs on the distributed system

There are many ways to start the distributed scheduler and worker components, however, the most straight forward way is to use a cluster manager utility class.

from dask.distributed import Client, LocalCluster
cluster = LocalCluster()  # Launches a scheduler and workers locally
client = Client(cluster)  # Connect to distributed cluster and override default
df.x.sum().compute()  # This now runs on the distributed system

These cluster managers deploy a scheduler and the necessary workers as determined by communicating with the resource manager. All cluster managers follow the same interface, but with platform-specific configuration options, so you can switch from your local machine to a remote cluster with very minimal code changes.

../_images/dask-cluster-manager.svg

An overview of cluster management with Dask distributed.

Dask Jobqueue, for example, is a set of cluster managers for HPC users and works with job queueing systems (in this case, the resource manager) such as PBS, Slurm, and SGE. Those workers are then allocated physical hardware resources.

from dask.distributed import Client
from dask_jobqueue import PBSCluster
cluster = PBSCluster()  # Launches a scheduler and workers on HPC via PBS
client = Client(cluster)  # Connect to distributed cluster and override default
df.x.sum().compute()  # This now runs on the distributed system

The following resources explain how to set up Dask on a variety of local and distributed hardware.

Single Machine

Dask runs perfectly well on a single machine with or without a distributed scheduler. But once you start using Dask in anger you’ll find a lot of benefit both in terms of scaling and debugging by using the distributed scheduler.

  • Default Scheduler

    The no-setup default. Uses local threads or processes for larger-than-memory processing

  • dask.distributed

    The sophistication of the newer system on a single machine. This provides more advanced features while still requiring almost no setup.

Distributed Computing

There are a number of ways to run Dask on a distributed cluster (see the Beginner’s Guide to Configuring a Distributed Dask Cluster).

High Performance Computing

See High Performance Computers for more details.

  • Dask-Jobqueue

    Provides cluster managers for PBS, SLURM, LSF, SGE and other resource managers.

  • Dask-MPI

    Deploy Dask from within an existing MPI environment.

  • Dask Gateway for Jobqueue

    Multi-tenant, secure clusters. Once configured, users can launch clusters without direct access to the underlying HPC backend.

Kubernetes

See Kubernetes for more details.

Cloud

See Cloud for more details.

  • Dask-Yarn

    Deploy Dask on YARN clusters, such as are found in traditional Hadoop installations.

  • Dask Cloud Provider

    Constructing and managing ephemeral Dask clusters on AWS, DigitalOcean, Google Cloud, Azure, and Hetzner

  • You can use Coiled, a commercial Dask deployment option, to handle the creation and management of Dask clusters on cloud computing environments (AWS and GCP).

Ad-hoc deployments

  • Manual Setup

    The command line interface to set up dask-scheduler and dask-worker processes.

  • SSH

    Use SSH to set up Dask across an un-managed cluster.

  • Python API (advanced)

    Create Scheduler and Worker objects from Python as part of a distributed Tornado TCP application.

Managed Solutions

  • You can use Coiled to handle the creation and management of Dask clusters on cloud computing environments (AWS and GCP).

  • Domino Data Lab lets users create Dask clusters in a hosted platform.

  • Saturn Cloud lets users create Dask clusters in a hosted platform or within their own AWS accounts.