Deploy Dask Clusters
Contents
Deploy Dask Clusters¶
The dask.distributed
scheduler works well on a single machine and scales to many machines
in a cluster. We recommend using dask.distributed
clusters at all scales for the following
reasons:
It provides access to asynchronous APIs, notably Futures.
It provides a diagnostic dashboard that can provide valuable insight on performance and progress (see Dashboard Diagnostics).
It handles data locality with sophistication, and so can be more efficient than the multiprocessing scheduler on workloads that require multiple processes.
This page describes various ways to set up Dask clusters on different hardware, either locally on your own machine or on a distributed cluster.
You can continue reading or watch the screencast below:
If you import Dask, set up a computation, and call compute
, then you
will use the single-machine scheduler by default.
import dask.dataframe as dd
df = dd.read_csv(...)
df.x.sum().compute() # This uses the single-machine scheduler by default
To use the dask.distributed
scheduler you must set up a Client
.
from dask.distributed import Client
client = Client(...) # Connect to distributed cluster and override default
df.x.sum().compute() # This now runs on the distributed system
There are many ways to start the distributed scheduler and worker components, however, the most straight forward way is to use a cluster manager utility class.
from dask.distributed import Client, LocalCluster
cluster = LocalCluster() # Launches a scheduler and workers locally
client = Client(cluster) # Connect to distributed cluster and override default
df.x.sum().compute() # This now runs on the distributed system
These cluster managers deploy a scheduler and the necessary workers as determined by communicating with the resource manager. All cluster managers follow the same interface, but with platform-specific configuration options, so you can switch from your local machine to a remote cluster with very minimal code changes.
An overview of cluster management with Dask distributed.¶
Dask Jobqueue, for example, is a set of cluster managers for HPC users and works with job queueing systems (in this case, the resource manager) such as PBS, Slurm, and SGE. Those workers are then allocated physical hardware resources.
from dask.distributed import Client
from dask_jobqueue import PBSCluster
cluster = PBSCluster() # Launches a scheduler and workers on HPC via PBS
client = Client(cluster) # Connect to distributed cluster and override default
df.x.sum().compute() # This now runs on the distributed system
The following resources explain how to set up Dask on a variety of local and distributed hardware.
Single Machine¶
Dask runs perfectly well on a single machine with or without a distributed scheduler. But once you start using Dask in anger you’ll find a lot of benefit both in terms of scaling and debugging by using the distributed scheduler.
- Default Scheduler
The no-setup default. Uses local threads or processes for larger-than-memory processing
- dask.distributed
The sophistication of the newer system on a single machine. This provides more advanced features while still requiring almost no setup.
Distributed Computing¶
There are a number of ways to run Dask on a distributed cluster (see the Beginner’s Guide to Configuring a Distributed Dask Cluster).
High Performance Computing¶
See High Performance Computers for more details.
- Dask-Jobqueue
Provides cluster managers for PBS, SLURM, LSF, SGE and other resource managers.
- Dask-MPI
Deploy Dask from within an existing MPI environment.
- Dask Gateway for Jobqueue
Multi-tenant, secure clusters. Once configured, users can launch clusters without direct access to the underlying HPC backend.
Kubernetes¶
See Kubernetes for more details.
- Dask Kubernetes Operator
For native Kubernetes integration for fast moving or ephemeral deployments.
- Dask Gateway for Kubernetes
Multi-tenant, secure clusters. Once configured, users can launch clusters without direct access to the underlying Kubernetes backend.
- Single Cluster Helm Chart
Single Dask cluster and (optionally) Jupyter on deployed with Helm.
Cloud¶
See Cloud for more details.
- Dask-Yarn
Deploy Dask on YARN clusters, such as are found in traditional Hadoop installations.
- Dask Cloud Provider
Constructing and managing ephemeral Dask clusters on AWS, DigitalOcean, Google Cloud, Azure, and Hetzner
You can use Coiled, a commercial Dask deployment option, to handle the creation and management of Dask clusters on cloud computing environments (AWS and GCP).
Ad-hoc deployments¶
- Manual Setup
The command line interface to set up
dask-scheduler
anddask-worker
processes.
- SSH
Use SSH to set up Dask across an un-managed cluster.
- Python API (advanced)
Create
Scheduler
andWorker
objects from Python as part of a distributed Tornado TCP application.
Managed Solutions¶
You can use Coiled to handle the creation and management of Dask clusters on cloud computing environments (AWS and GCP).
Domino Data Lab lets users create Dask clusters in a hosted platform.
Saturn Cloud lets users create Dask clusters in a hosted platform or within their own AWS accounts.