Writing parallel code with Cython¶
Note
This page uses two different syntax variants:
Cython specific
cdef
syntax, which was designed to make type declarations concise and easily readable from a C/C++ perspective.Pure Python syntax which allows static Cython type declarations in pure Python code, following PEP-484 type hints and PEP 526 variable annotations.
To make use of C data types in Python syntax, you need to import the special
cython
module in the Python module that you want to compile, e.g.import cython
If you use the pure Python syntax we strongly recommend you use a recent Cython 3 release, since significant improvements have been made here compared to the 0.29.x releases.
One method of speeding up your Cython code is parallelization: you write code that can be run on multiple cores of your CPU simultaneously. For code that lends itself to parallelization this can produce quite dramatic speed-ups, equal to the number of cores your CPU has (for example a 4Ă— speed-up on a 4-core CPU).
This tutorial assumes that you are already familiar with Cython’s “typed memoryviews” (since code using memoryviews is often the sort of code that’s easy to parallelize with Cython), and also that you’re somewhat familiar with the pitfalls of writing parallel code in general (it aims to be a Cython tutorial rather than a complete introduction to parallel programming).
Before starting, a few notes:
Not all code can be parallelized - for some code the algorithm simply relies on being executed in order and you should not attempt to parallelize it. A cumulative sum is a good example.
Not all code is worth parallelizing. There’s a reasonable amount of overhead in starting a parallel section and so you need to make sure that you’re operating on enough data to make this overhead worthwhile. Additionally, make sure that you are doing actual work on the data! Multiple threads simply reading the same data tends not to parallelize too well. If in doubt, time it.
Cython requires the contents of parallel blocks to be
nogil
. If your algorithm requires access to Python objects then it may not be suitable for parallelization.Cython’s inbuilt parallelization uses the OpenMP constructs
omp parallel for
andomp parallel
. These are ideal for parallelizing relatively small, self-contained blocks of code (especially loops). However, If you want to use other models of parallelization such as spawning and waiting for tasks, or off-loading some “side work” to a continuously running secondary thread, then you might be better using other methods (such as Python’sthreading
module).Actually implementing your parallel Cython code should probably be one of the last steps in your optimization. You should start with some working serial code first. However, it’s worth planning for early since it may affect your choice of algorithm.
This tutorial does not aim to explore all the options available to customize parallelization. See the main parallelism documentation for details. You should also be aware that a lot of the choices Cython makes about how your code is parallelized are fairly fixed and if you want specific OpenMP behaviour that Cython doesn’t provide by default you may be better writing it in C yourself.
Compilation¶
OpenMP requires support from your C/C++ compiler. This support is
usually enabled through a special command-line argument:
on GCC this is -fopenmp
while on MSVC it is
/openmp
. If your compiler doesn’t support OpenMP (or if you
forget to pass the argument) then the code will usually still
compile but will not run in parallel.
The following setup.py
file can be used to compile the
examples in this tutorial:
from setuptools import Extension, setup
from Cython.Build import cythonize
import sys
if sys.platform.startswith("win"):
openmp_arg = '/openmp'
else:
openmp_arg = '-fopenmp'
ext_modules = [
Extension(
"*",
["*.pyx"],
extra_compile_args=[openmp_arg],
extra_link_args=[openmp_arg],
),
Extension(
"*",
["*.pyx"],
extra_compile_args=[openmp_arg],
extra_link_args=[openmp_arg],
)
]
setup(
name='parallel-tutorial',
ext_modules=cythonize(ext_modules),
)
Element-wise parallel operations¶
The easiest and most common parallel operation in Cython is to
iterate across an array element-wise, performing the same
operation on each array element. In the simple example
below we calculate the sin
of every element in an array:
from cython.parallel cimport prange
cimport cython
from libc.math cimport sin
import numpy as np
@cython.boundscheck(False)
@cython.wraparound(False)
def do_sine(double[:,:] input):
cdef double[:,:] output = np.empty_like(input)
cdef Py_ssize_t i, j
for i in prange(input.shape[0], nogil=True):
for j in range(input.shape[1]):
output[i, j] = sin(input[i, j])
return np.asarray(output)
from cython.parallel import prange
import cython
from cython.cimports.libc.math import sin
import numpy as np
@cython.boundscheck(False)
@cython.wraparound(False)
def do_sine(input: cython.double[:,:]):
output : cython.double[:,:] = np.empty_like(input)
i : cython.Py_ssize_t
j : cython.Py_ssize_t
for i in prange(input.shape[0], nogil=True):
for j in range(input.shape[1]):
output[i, j] = sin(input[i, j])
return np.asarray(output)
We parallelize the outermost loop. This is usually a good idea
since there is some overhead to entering and leaving a parallel block.
However, you should also consider the likely size of your arrays.
If input
usually had a size of (2, 10000000)
then parallelizing
over the dimension of length 2
would likely be a worse choice.
The body of the loop itself is nogil
- i.e. you cannot perform
“Python” operations. This is a fairly strong limitation and if you
find that you need to use the GIL then it is likely that Cython’s
parallelization features are not suitable for you. It is possible
to throw exceptions from within the loop, however – Cython simply
regains the GIL and raises an exception, then terminates the loop
on all threads.
It’s necessary to explicitly type the loop variable i
as a
C integer. For a non-parallel loop Cython can infer this, but it
does not currently infer the loop variable for parallel loops,
so not typing i
will lead to compile errors since it will be
a Python object and so unusable without the GIL.
The C code generated is shown below, for the benefit of experienced users of OpenMP. It is simplified a little for readability here:
#pragma omp parallel
{
#pragma omp for firstprivate(i) lastprivate(i) lastprivate(j)
for (__pyx_t_8 = 0; __pyx_t_8 < __pyx_t_9; __pyx_t_8++){
i = __pyx_t_8;
/* body goes here */
}
}
Private variables¶
One useful point to note from the generated C code above - variables
used in the loops like i
and j
are marked as firstprivate
and lastprivate
. Within the loop each thread has its own copy of
the data, the data is initialized
according to its value before the loop, and after the loop the “global”
copy is set equal to the last iteration (i.e. as if the loop were run
in serial).
The basic rules that Cython applies are:
C scalar variables within a
prange
block are madefirstprivate
andlastprivate
,C scalar variables assigned within a parallel block are
private
(which means they can’t be used to pass data in and out of the block),array variables (e.g. memoryviews) are not made private. Instead Cython assumes that you have structured your loop so that each iteration is acting on different data,
Python objects are also not made private, although access to them is controlled via Python’s GIL.
Cython does not currently provide much opportunity of override these choices.
Reductions¶
The second most common parallel operation in Cython is the “reduction” operation. A common example is to accumulate a sum over the whole array, such as in the calculation of a vector norm below:
from cython.parallel cimport prange
cimport cython
from libc.math cimport sqrt
@cython.boundscheck(False)
@cython.wraparound(False)
def l2norm(double[:] x):
cdef double total = 0
cdef Py_ssize_t i
for i in prange(x.shape[0], nogil=True):
total += x[i]*x[i]
return sqrt(total)
from cython.parallel import prange
import cython
from cython.cimports.libc.math import sqrt
@cython.boundscheck(False)
@cython.wraparound(False)
def l2norm(x: cython.double[:]):
total: cython.double = 0
i: cython.Py_ssize_t
for i in prange(x.shape[0], nogil=True):
total += x[i]*x[i]
return sqrt(total)
Cython is able to infer reductions for +=
, *=
, -=
,
&=
, |=
, and ^=
. These only apply to C scalar variables
so you cannot easily reduce a 2D memoryview to a 1D memoryview for
example.
The C code generated is approximately:
#pragma omp parallel reduction(+:total)
{
#pragma omp for firstprivate(i) lastprivate(i)
for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
i = __pyx_t_2;
total = total + /* some indexing code */;
}
}
parallel
blocks¶
Much less frequently used than prange
is Cython’s parallel
operator. parallel
generates a block of code that is run simultaneously
on multiple threads at once. Unlike prange
, however, work is
not automatically divided between threads.
Here we present three common uses for the parallel
block:
Stringing together prange blocks¶
There is some overhead in entering and leaving a parallelized section. Therefore, if you have multiple parallel sections with small serial sections in between it can be more efficient to write one large parallel block. Any small serial sections are duplicated, but the overhead is reduced.
In the example below we do an in-place normalization of a vector. The first parallel loop calculates the norm, the second parallel loop applies the norm to the vector, and we avoid jumping in and out of serial code in between.
from cython.parallel cimport parallel, prange
cimport cython
from libc.math cimport sqrt
@cython.boundscheck(False)
@cython.wraparound(False)
def normalize(double[:] x):
cdef Py_ssize_t i
cdef double total = 0
cdef double norm
with nogil, parallel():
for i in prange(x.shape[0]):
total += x[i]*x[i]
norm = sqrt(total)
for i in prange(x.shape[0]):
x[i] /= norm
from cython.parallel import parallel, prange
import cython
from cython.cimports.libc.math import sqrt
@cython.boundscheck(False)
@cython.wraparound(False)
def normalize(x: cython.double[:]):
i: cython.Py_ssize_t
total: cython.double = 0
norm: cython.double
with cython.nogil, parallel():
for i in prange(x.shape[0]):
total += x[i]*x[i]
norm = sqrt(total)
for i in prange(x.shape[0]):
x[i] /= norm
The C code is approximately:
#pragma omp parallel private(norm) reduction(+:total)
{
/* some calculations of array size... */
#pragma omp for firstprivate(i) lastprivate(i)
for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
/* ... */
}
norm = sqrt(total);
#pragma omp for firstprivate(i) lastprivate(i)
for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
/* ... */
}
}
Allocating “scratch space” for each thread¶
Suppose that each thread requires a small amount of scratch space to work in. They cannot share scratch space because that would lead to data races. In this case the allocation and deallocation is done in a parallel section (so occurs on a per-thread basis) surrounding a loop which then uses the scratch space.
Our example here uses C++ to find the median of each column in
a 2D array (just a parallel version of numpy.median(x, axis=0)
).
We must reorder each column to find the median of it, but don’t want
to modify the input array. Therefore, we allocate a C++ vector per
thread to use as scratch space, and work in that. For efficiency
the vector is allocated outside the prange
loop.
# distutils: language = c++
from cython.parallel cimport parallel, prange
from libcpp.vector cimport vector
from libcpp.algorithm cimport nth_element
cimport cython
from cython.operator cimport dereference
import numpy as np
@cython.boundscheck(False)
@cython.wraparound(False)
def median_along_axis0(const double[:,:] x):
cdef double[::1] out = np.empty(x.shape[1])
cdef Py_ssize_t i, j
cdef vector[double] *scratch
cdef vector[double].iterator median_it
with nogil, parallel():
# allocate scratch space per loop
scratch = new vector[double](x.shape[0])
try:
for i in prange(x.shape[1]):
# copy row into scratch space
for j in range(x.shape[0]):
dereference(scratch)[j] = x[j, i]
median_it = scratch.begin() + scratch.size()//2
nth_element(scratch.begin(), median_it, scratch.end())
# for the sake of a simple example, don't handle even lengths...
out[i] = dereference(median_it)
finally:
del scratch
return np.asarray(out)
# distutils: language = c++
from cython.parallel import parallel, prange
from cython.cimports.libc.stdlib import malloc, free
from cython.cimports.libcpp.algorithm import nth_element
import cython
from cython.operator import dereference
import numpy as np
@cython.boundscheck(False)
@cython.wraparound(False)
def median_along_axis0(x: cython.double[:,:]):
out: cython.double[::1] = np.empty(x.shape[1])
i: cython.Py_ssize_t
j: cython.Py_ssize_t
scratch: cython.pointer(cython.double)
median_it: cython.pointer(cython.double)
with cython.nogil, parallel():
# allocate scratch space per loop
scratch = cython.cast(
cython.pointer(cython.double),
malloc(cython.sizeof(cython.double)*x.shape[0]))
try:
for i in prange(x.shape[1]):
# copy row into scratch space
for j in range(x.shape[0]):
scratch[j] = x[j, i]
median_it = scratch + x.shape[0]//2
nth_element(scratch, median_it, scratch + x.shape[0])
# for the sake of a simple example, don't handle even lengths...
out[i] = dereference(median_it)
finally:
free(scratch)
return np.asarray(out)
Note
Pure and classic syntax examples are not quite identical since pure Python syntax does not support C++ “new”, so we allocate the scratch space slightly differently
In the generated code the scratch
variable is marked as
private
in the outer parallel block. A rough outline is:
#pragma omp parallel private(scratch)
{
scratch = new std::vector<double> ((x.shape[0]))
#pragma omp for firstprivate(i) lastprivate(i) lastprivate(j) lastprivate(median_it)
for (__pyx_t_9 = 0; __pyx_t_9 < __pyx_t_10; __pyx_t_9++){
i = __pyx_t_9;
/* implementation goes here */
}
/* some exception handling detail omitted */
delete scratch;
}
Performing different tasks on each thread¶
Finally, if you manually specify the number of threads and
then identify each thread using omp.get_thread_num()
you can manually split work between threads. This is
a fairly rare use-case in Cython, and probably suggests
that the threading
module is more suitable for what
you’re trying to do. However it is an option.
from cython.parallel cimport parallel
from openmp cimport omp_get_thread_num
cdef void long_running_task1() nogil:
pass
cdef void long_running_task2() nogil:
pass
def do_two_tasks():
cdef int thread_num
with nogil, parallel(num_threads=2):
thread_num = omp_get_thread_num()
if thread_num == 0:
long_running_task1()
elif thread_num == 1:
long_running_task2()
from cython.parallel import parallel
from cython.cimports.openmp import omp_get_thread_num
import cython
@cython.cfunc
@cython.nogil
def long_running_task1() -> cython.void:
pass
@cython.cfunc
@cython.nogil
def long_running_task2() -> cython.void:
pass
def do_two_tasks():
thread_num: cython.int
with cython.nogil, parallel(num_threads=2):
thread_num = omp_get_thread_num()
if thread_num == 0:
long_running_task1()
elif thread_num == 1:
long_running_task2()
The utility of this kind of block is limited by the fact that
variables assigned to in the block are private
to each thread,
so cannot be accessed in the serial section afterwards.
The generated C code for the example above is fairly simple:
#pragma omp parallel private(thread_num)
{
thread_num = omp_get_thread_num();
switch (thread_num) {
/* ... */
}
}