DOKK Library

Proceedings of the 21st Python in Science Conference

Authors Chris Calloway David Shupe Dillon Niederhut Meghann Agarwal

License CC-BY-3.0

   Proceedings of the 21st

Python in Science Conference
Edited by Meghann Agarwal, Chris Calloway, Dillon Niederhut, and David Shupe.

SciPy 2022
Austin, Texas
July 11 - July 17, 2022

Copyright c 2022. The articles in the Proceedings of the Python in Science Conference are copyrighted and owned by their
original authors
This is an open-access publication and is distributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are
For more information, please see:


Conference Chairs
     A LEXANDRE C HABOT-L ECLERC, Enthought, Inc.

Program Chairs
     M ATT H ABERLAND, Cal Poly
     J ULIE H OLLEK, Mozilla
     M ADICKEN M UNK, University of Illinois
     G UEN P RAWIROATMODJO, Microsoft Corp

    M ATT D AVIS, Populus
    D AVID N ICHOLSON, Embedded Intelligence

Birds of a Feather
      A NASTASIIA S ARMAKEEVA, George Washington University

     M EGHANN A GARWAL, Overhaul
     C HRIS C ALLOWAY, University of North Carolina
     D ILLON N IEDERHUT, Novi Labs
     D AVID S HUPE, Caltech’s IPAC Astronomy Data Center

Financial Aid
     S COTT C OLLIS, Argonne National Laboratory
     N ADIA TAHIRI, Université de Montréal

      L OGAN T HOMAS, Enthought, Inc.

      TANIA A LLARD, Quansight Labs
      B RIGITTA S IP ŐCZ, Caltech/IPAC

      C ELIA C INTAS, IBM Research Africa
      B ONNY P M C C LAIN, O’Reilly Media
      FATMA TARLACI, OpenTeams

      PAUL A NZEL, Codecov
      I NESSA PAWSON, Albus Code

     K RISTEN L EISER, Enthought, Inc.

     C HRIS C HAN, Enthought, Inc.
     B ILL C OWAN, Enthought, Inc.
     J ODI H AVRANEK, Enthought, Inc.

      K RISTEN L EISER, Enthought, Inc.
Proceedings Reviewers
     B RIAN G UE
     E D R OGERS
     J YH -M IIN L IN
     K EVIN W. B EAM

     and Ralf Grosse-Kunstleve, and Wenzel Jakob, and Matthieu Darbois, and Aaron Gokaslan, and Jean-Christophe Fillion-
     Robin, and Matt McCormick
     son P. Morgan, and Kyle E. Niemeyer
     AWKWARD PACKAGING : B UILDING SCIKIT-HEP, Henry Schreiner, and Jim Pivarski, and Eduardo Rodrigues
     and Vinay Kashyap, and Douglas Burke, and Karthik Reddy Solipuram, and David van Dyk
     C ARLO L IBRARY, W. Scott Shambaugh
     UF UNCS AND DT YPES :    NEW POSSIBILITIES IN    N UM P Y, Sebastian Berg, and Stéfan van der Walt
     PYAMPUTE : A   P YTHON LIBRARY FOR DATA AMPUTATION, Rianne M Schouten, and Davina Zamanzadeh, and Prabhant
     S CIENTIFIC P YTHON : F ROM G IT H UB TO T IK T OK, Juanita Gomez Romero, and Stéfan van der Walt, and K. Jarrod
     Millman, and Melissa Weber Mendonça, and Inessa Pawson
     S CIENTIFIC P YTHON : B Y MAINTAINERS , FOR MAINTAINERS, Pamphile T. Roy, and Stéfan van der Walt, and K. Jarrod
     Millman, and Melissa Weber Mendonça
     Matt Haberland, and Christoph Baumgarten, and Tirth Patel
     Thomas Nicholas, and Julius Busecke, and Ryan Abernathey


     IN S CI P Y 1.9, Matt Haberland, and Nicholas McKibben
     VELOPER , Daniel Althviz Moré
     run, and Olivia P. Dizon-Paradis, and Dan E. Capecci, and Damon L. Woodard, and Navid Asadizanjani
     B IOFRAME : O PERATING ON G ENOMIC I NTERVAL D ATAFRAMES, Nezar Abdennur, and Geoffrey Fudenberg, and Ilya
     M. Flyamer, and Aleksandra Galitsyna, and Anton Goloborodko, and Maxim Imakaev, and Trevor Manz, and Sergey V.
     and James D. Gaboardi
     otika Singh
     K IWI : P YTHON T OOL FOR T EX P ROCESSING AND C LASSIFICATION, Neelima Pulagam, and Sai Marasani, and Brian
     and My-Linh Luu, and Nadia Tahiri
     D ESIGN OF A S CIENTIFIC D ATA A NALYSIS S UPPORT P LATFORM, Nathan Martindale, and Jason Hite, and Scott Stewart,
     and Mark Adams
     PLANET ’ S LARGEST CLOUD OBSERVATORY , Zachary Sherman, and Scott Collis, and Max Grover, and Robert Jackson,
     and Adam Theisen


     S CI P Y T OOLS P LENARY - CEL TEAM, Inessa Pawson
     S CI P Y T OOLS P LENARY ON M ATPLOTLIB, Elliott Sales de Andrade
     S CI P Y T OOLS P LENARY - N UM P Y, Inessa Pawson



     A MAN G OEL, University of Delhi
     A NURAG S AHA R OY, Saarland University
     I SURU F ERNANDO, University of Illinois at Urbana Champaign
     K ELLY M EEHAN, US Forest Service
     K ADAMBARI D EVARAJAN, University of Rhode Island
     K RISHNA K ATYAL, Thapar Institute of Engineering and Technology
     N AMAN G ERA, Sympy, LPython
     R OHIT G OSWAMI, University of Iceland
     TANYA A KUMU, IBM Research
     Z UHAL C AKIR, Purdue University

The Advanced Scientific Data Format (ASDF): An Update                                                                  1
Perry Greenfield, Edward Slavich, William Jamieson, Nadia Dencheva

Semi-Supervised Semantic Annotator (S3A): Toward Efficient Semantic Labeling                                           7
Nathan Jessurun, Daniel E. Capecci, Olivia P. Dizon-Paradis, Damon L. Woodard, Navid Asadizanjani

Galyleo: A General-Purpose Extensible Visualization Solution                                                          13
Rick McGeer, Andreas Bergen, Mahdiyar Biazi, Matt Hemmings, Robin Schreiber

USACE Coastal Engineering Toolkit and a Method of Creating a Web-Based Application                                    22
Amanda Catlett, Theresa R. Coumbe, Scott D. Christensen, Mary A. Byrant

Search for Extraterrestrial Intelligence: GPU Accelerated TurboSETI                                                   26
Luigi Cruz, Wael Farah, Richard Elkins

Experience report of physics-informed neural networks in fluid simulations: pitfalls and frustration                  28
Pi-Yueh Chuang, Lorena A. Barba

atoMEC: An open-source average-atom Python code                                                                       37
Timothy J. Callow, Daniel Kotik, Eli Kraisler, Attila Cangi

Automatic random variate generation in Python                                                                         46
Christoph Baumgarten, Tirth Patel

Utilizing SciPy and other open source packages to provide a powerful API for materials manipulation in the Schrödinger
Materials Suite                                                                                                      52
Alexandr Fonari, Farshad Fallah, Michael Rauch

A Novel Pipeline for Cell Instance Segmentation, Tracking and Motility Classification of Toxoplasma Gondii in 3D Space 60
Seyed Alireza Vaezi, Gianni Orlando, Mojtaba Fazli, Gary Ward, Silvia Moreno, Shannon Quinn

The myth of the normal curve and what to do about it                                                                  64
Allan Campopiano

Python for Global Applications: teaching scientific Python in context to law and diplomacy students                   69
Anna Haensch, Karin Knudson

Papyri: better documentation for the scientific ecosystem in Jupyter                                                  75
Matthias Bussonnier, Camille Carvalho

Bayesian Estimation and Forecasting of Time Series in statsmodels                                                     83
Chad Fulton

Python vs. the pandemic: a case study in high-stakes software development                                             90
Cliff C. Kerr, Robyn M. Stuart, Dina Mistry, Romesh G. Abeysuriya, Jamie A. Cohen, Lauren George, Michał
Jastrzebski, Michael Famulare, Edward Wenger, Daniel J. Klein

Pylira: deconvolution of images in the presence of Poisson noise                                                      98
Axel Donath, Aneta Siemiginowska, Vinay Kashyap, Douglas Burke, Karthik Reddy Solipuram, David van Dyk

Codebraid Preview for VS Code: Pandoc Markdown Preview with Jupyter Kernels                                          105
Geoffrey M. Poore

Incorporating Task-Agnostic Information in Task-Based Active Learning Using a Variational Autoencoder                110
Curtis Godwin, Meekail Zain, Nathan Safir, Bella Humphrey, Shannon P Quinn

Awkward Packaging: building Scikit-HEP                                                                               115
Henry Schreiner, Jim Pivarski, Eduardo Rodrigues
Keeping your Jupyter notebook code quality bar high (and production ready) with Ploomber                           121
Ido Michael

Likeness: a toolkit for connecting the social fabric of place to human dynamics                                    125
Joseph V. Tuccillo, James D. Gaboardi

poliastro: a Python library for interactive astrodynamics                                                          136
Juan Luis Cano Rodrı́guez, Jorge Martı́nez Garrido

A New Python API for Webots Robotics Simulations                                                                   147
Justin C. Fisher

pyAudioProcessing: Audio Processing, Feature Extraction, and Machine Learning Modeling                             152
Jyotika Singh

Phylogeography: Analysis of genetic and climatic data of SARS-CoV-2                                                159
Aleksandr Koshkarov, Wanlin Li, My-Linh Luu, Nadia Tahiri

Global optimization software library for research and education                                                    167
Nadia Udler

Temporal Word Embeddings Analysis for Disease Prevention                                                           171
Nathan Jacobi, Ivan Mo, Albert You, Krishi Kishore, Zane Page, Shannon P. Quinn, Tim Heckman

Design of a Scientific Data Analysis Support Platform                                                              179
Nathan Martindale, Jason Hite, Scott Stewart, Mark Adams

The Geoscience Community Analysis Toolkit: An Open Development, Community Driven Toolkit in the Scientific Python
Ecosystem                                                                                                     187
Orhan Eroglu, Anissa Zacharias, Michaela Sizemore, Alea Kootz, Heather Craker, John Clyne

popmon: Analysis Package for Dataset Shift Detection                                                               194
Simon Brugman, Tomas Sostak, Pradyot Patil, Max Baak

pyDAMPF: a Python package for modeling mechanical properties of hygroscopic materials under interaction with a nanoprobe
Willy Menacho, Gonzalo Marcelo Ramı́rez-Ávila, Horacio V. Guzman

Improving PyDDA’s atmospheric wind retrievals using automatic differentiation and Augmented Lagrangian methods     210
Robert Jackson, Rebecca Gjini, Sri Hari Krishna Narayanan, Matt Menickelly, Paul Hovland, Jan Hückelheim, Scott

RocketPy: Combining Open-Source and Scientific Libraries to Make the Space Sector More Modern and Accessible       217
João Lemes Gribel Soares, Mateus Stano Junqueira, Oscar Mauricio Prada Ramirez, Patrick Sampaio dos Santos
Brandão, Adriano Augusto Antongiovanni, Guilherme Fernandes Alves, Giovani Hidalgo Ceotto

Wailord: Parsers and Reproducibility for Quantum Chemistry                                                         226
Rohit Goswami

Variational Autoencoders For Semi-Supervised Deep Metric Learning                                                  231
Nathan Safir, Meekail Zain, Curtis Godwin, Eric Miller, Bella Humphrey, Shannon P Quinn

A Python Pipeline for Rapid Application Development (RAD)                                                          240
Scott D. Christensen, Marvin S. Brown, Robert B. Haehnel, Joshua Q. Church, Amanda Catlett, Dallon C. Schofield,
Quyen T. Brannon, Stacy T. Smith

Monaco: A Monte Carlo Library for Performing Uncertainty and Sensitivity Analyses                                  244
W. Scott Shambaugh

Enabling Active Learning Pedagogy and Insight Mining with a Grammar of Model Analysis                              251
Zachary del Rosario
Low Level Feature Extraction for Cilia Segmentation      259
Meekail Zain, Eric Miller, Shannon P Quinn, Cecilia Lo
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                     1

      The Advanced Scientific Data Format (ASDF): An
                            Perry Greenfield‡∗ , Edward Slavich‡† , William Jamieson‡† , Nadia Dencheva‡†


Abstract—We report on progress in developing and extending the new (ASDF)                by outlining our near term plans for further improvements and
format we have developed for the data from the James Webb and Nancy Grace                extensions.
Roman Space Telescopes since we reported on it at a previous Scipy. While
the format was developed as a replacement for the long-standard FITS format              Summary of Motivations
used in astronomy, it is quite generic and not restricted to use with astronomical
                                                                                            •     Suitable as an archival format:
data. We will briefly review the format, and extensions and changes made to
the standard itself, as well as to the reference Python implementation we have                             –   Old versions continue to be supported by
developed to support it. The standard itself has been clarified in a number                                    libraries.
of respects. Recent improvements to the Python implementation include an
                                                                                                           –   Format is sufficiently transparent (e.g., not
improved framework for conversion between complex Python objects and ASDF,
                                                                                                               requiring extensive documentation to de-
better control of the configuration of extensions supported and versioning of
extensions, tools for display and searching of the structured metadata, bet-
                                                                                                               code) for the fundamental set of capabili-
ter developer documentation, tutorials, and a more maintainable and flexible                                   ties.
schema system. This has included a reorganization of the components to make                                –   Metadata is easily viewed with any text
the standard free from astronomical assumptions. A important motivator for the                                 editor.
format was the ability to support serializing functional transforms in multiple
dimensions as well as expressions built out of such transforms, which has now
                                                                                            •   Intrinsically hierarchical
been implemented. More generalized compression schemes are now enabled.                     •   Avoids duplication of shared items
We are currently working on adding chunking support and will discuss our plan               •   Based on existing standard(s) for metadata and structure
for further enhancements.                                                                   •   No tight constraints on attribute lengths or their values.
                                                                                            •   Clearly versioned
Index Terms—data formats, standards, world coordinate systems, yaml                         •   Supports schemas for validating files for basic structure
                                                                                                and value requirements
                                                                                            •   Easily extensible, both for the standard, and for local or
                                                                                                domain-specific conventions.
The Advanced Scientific Data Format (ASDF) was originally
developed in 2015. That original version was described in a paper                        Basics of ASDF Format
[Gre15]. That paper described the shortcomings of the widely used                           •   Format consists of a YAML header optionally followed by
astronomical standard format FITS [FIT16] as well as those of                                   one or more binary blocks for containing binary data.
existing potential alternatives. It is not the goal of this paper to                        •   The YAML [] header contains all the meta-
rehash those points in detail, though it is useful to summarize the                             data and defines the structural relationship of all the data
basic points here. The remainder of this paper will describe where                              elements.
we are using ASDF, what lessons we have learned from using                                  •   YAML tags are used to indicate to libraries the semantics
ASDF for the James Webb Space Telescope, and summarize the                                      of subsections of the YAML header that libraries can use to
most important changes we have made to the standard, the Python                                 construct special software objects. For example, a tag for
library that we use to read and write ASDF files, and best practices                            a data array would indicate to a Python library to convert
for using the format.                                                                           it into a numpy array.
    We will give an example of a more advanced use case that                                •   YAML anchors and alias are used to share common ele-
illustrates some of the powerful advantages of ASDF, and that                                   ments to avoid duplication.
its application is not limited to astronomy, but suitable for much                          •   JSON Schema [],
of scientific and engineering data, as well as models. We finish                                [] is
                                                                                                used for schemas to define expectations for tag content
* Corresponding author:                                                         and whole headers combined with tools to validate actual
‡ Space Telescope Science Institute
† These authors contributed equally.                                                            ASDF files against these schemas.
                                                                                            •   Binary blocks are referenced in the YAML to link binary
Copyright © 2022 Perry Greenfield et al. This is an open-access article                         data to YAML attributes.
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,               •   Support for arrays embedded in YAML or in a binary
provided the original author and source are credited.                                           block.
2                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

    •    Streaming support for a single binary block.                  Changes for 1.6
    •    Permit local definitions of tags and schemas outside of the
                                                                       Addition of the manifest mechanism
    •    While developed for astronomy, useful for general scien-      The manifest is a YAML document that explicitly lists the tags and
         tific or engineering use.                                     other features introduced by an extension to the ASDF standard.
    •    Aims to be language neutral.                                  It provides a more straightforward way of associating tags with
                                                                       schemas, allowing multiple tags to share the same schema, and
                                                                       generally making it simpler to visualize how tags and schemas
Current and planned uses                                               are associated (previously these associations were implied by the
James Webb Space Telescope (JWST)                                      Python implementation but were not documented elsewhere).
NASA requires JWST data products be made available in the
FITS format. Nevertheless, all the calibration pipelines operate       Handling of null values and their interpretation
on the data using an internal objects very close to the the ASDF
                                                                       The standard didn’t previously specify the behavior regarding null
representation. The JWST calibration pipeline uses ASDF to
                                                                       values. The Python library previously removed attributes from the
serialize data that cannot be easily represented in FITS, such as
                                                                       YAML tree when the corresponding Python attribute has a None
World Coordinate System information. The calibration software
                                                                       value upon writing to an ADSF file. On reading files where the
is also capable of reading and producing data products as pure
                                                                       attribute was missing but the schema indicated a default value,
ASDF files.
                                                                       the library would create the Python attribute with the default. As
                                                                       mentioned in the next item, we no longer use this mechanism, and
Nancy Grace Roman Space Telescope
                                                                       now when written, the attribute appears in the YAML tree with
This telescope, with the same mirror size as the Hubble Space          a null value if the Python value is None and the schema permits
Telescope (HST), but a much larger field of view than HST, will        null values.
be launched in 2026 or thereabouts. It is to be used mostly in
survey mode and is capable of producing very large mosaicked
                                                                       Interpretation of default values in schema
images. It will use ASDF as its primary data format.
                                                                       The use of default values in schemas is discouraged since the
Daniel K Inoue Solar Telescope                                         interpretation by libraries is prone to confusion if the assemblage
This telescope is using ASDF for much of the early data products       of schemas conflict with regard to the default. We have stopped
to hold the metadata for a combined set of data which can involve      using defaults in the Python library and recommend that the ASDF
many thousands of files. Furthermore, the World Coordinate             file always be explicit about the value rather than imply it through
System information is stored using ASDF for all the referenced         the schema. If there are practical cases that preclude always
data.                                                                  writing out all values (e.g., they are only relevant to one mode
                                                                       and usually are irrelevant), it should be the library that manages
                                                                       whether such attributes are written conditionally rather using the
Vera Rubin Telescope (for World Coordinate System interchange)
                                                                       schema default mechanism.
There have been users outside of astronomy using ASDF, as well
as contributors to the source code.
                                                                       Add alternative tag URI scheme
                                                                       We now recommend that tag URIs begin with asdf://
Changes to the standard (completed and proposed)
These are based on lessons learned from usage.
                                                                       Be explicit about what kind of complex YAML keys are supported
   The current version of the standard is 1.5.0 (1.6.0 being
developed).                                                            For example, not all legal YAML keys are supported. Namely
   The following items reflect areas where we felt improvements        YAML arrays, which are not hashable in Python. Likewise,
were needed.                                                           general YAML objects are not either. The Standard now limits
                                                                       keys to string, integer, or boolean types. If more complex keys are
Changes for 1.5                                                        required, they should be encoded in strings.
Moving      the    URI     authority    from         to                                                        Still to be done
This is to remove the standard from close association with STScI       Upgrade to JSON Schema draft-07
and make it clear that the format is not intended to be controlled
                                                                       There is interest in some of the new features of this version,
by one institution.
                                                                       however, this is problematic since there are aspects of this version
                                                                       that are incompatible with draft-04, thus requiring all previous
Moving astronomy-specific schemas out of standard
                                                                       schemas to be updated.
These primarily affect the previous inclusion of World Coordinate
Tags, which are strongly associated with astronomy. Remaining
                                                                       Replace extensions section of file history
are those related to time and unit standards, both of obvious gen-
erality, but the implementation must be based on some standards,       This section is considered too specific to the concept of Python
and currently the astropy-based ones are as good or better than        extensions, and is probably best replaced with a more flexible
any.                                                                   system for listing extensions used.
THE ADVANCED SCIENTIFIC DATA FORMAT (ASDF): AN UPDATE                                                                                      3

Changes to Python ASDF package
Easier and more flexible mechanism to create new extensions
The previous system for defining extensions to ASDF, now
deprecated, has been replaced by a new system that makes the
association between tags, schemas, and conversion code more
straightforward, as well as providing more intuitive names for the
methods and attributes, and makes it easier to handle reference
cycles if they are present in the code (also added to the original
Tag handling classes).

Introduced global configuration mechanism (2.8.0)
This reworks how ASDF resources are located, and makes it easier
to update the current configuration, as well as track down the
location of the needed resources (e.g., schemas and converters),
as well as removing performance issues that previously required
extracting information from all the resource files thus slowing the
                                                                        Fig. 1: A plot of the compound model defined in the first segment of
first call.                                                   code.

Added info/search methods and command line tools (2.6.0)
These allow displaying the hierarchical structure of the header and     file. This is made possible by the fact that expressions of models
the values and types of the attributes. Initially, such introspection   are straightforward to represent in YAML structure.
stopped at any tagged item. A subsequent change provides mech-               Despite the fact that the models are in some sense executable,
anisms to see into tagged items (next item). An example of these        they are perfectly safe so long as the library they are implemented
tools is shown in a later section.                                      in is safe (e.g., it doesn’t implement an "execute any OS com-
                                                                        mand" model). Furthermore, the representation in ASDF does not
Added mechanism for info to display tagged item contents (2.9.0)        explicitly use Python code. In principle it could be written or read
This allows the library that converts the YAML to Python objects        in any computer language.
to expose a summary of the contents of the object by supplying               The following illustrates a relatively simple but not trivial
an optional "dunder" method that the info mechanism can take            example.
advantage of.                                                                First we define a 1D model and plot it.
                                                                        import numpy as np
Added documentation on how ASDF library internals work                  import astropy.modeling.models as amm
                                                                        import astropy.units as u
These appear in the readthedocs under the heading "Developer            import asdf
Overview".                                                              from matplotlib import pyplot as plt

                                                                        # Define 3 model components with units
Plugin API for block compressors (2.8.0)                                g1 = amm.Gaussian1D(amplitude=100*u.Jy,
This enables a localized extension to support further compression                           mean=120*u.MHz,
options.                                                                                    stddev=5.*u.MHz)
                                                                        g2 = amm.Gaussian1D(65*u.Jy, 140*u.MHz, 3*u.MHz)
                                                                        powerlaw = amm.PowerLaw1D(amplitude=10*u.Jy,
Support for asdf:// URI scheme (2.8.0)                                                            x_0=100*u.MHz,
Support for ASDF Standard 1.6.0 (2.8.0)                                                           alpha=3)
                                                                        # Define a compound model
This is still subject to modifications to the 1.6.0 standard.           model = g1 + g2 + powerlaw
                                                                        x = np.arange(50, 200) * u.MHz
Modified handling of defaults in schemas and None values (2.8.0)        plt.plot(x, model(x))

As described previously.                                                The following code will save the model to an ASDF file, and read
                                                                        it back in
Using ASDF to store models                                              af = asdf.AsdfFile()
                                                                        af.tree = {'model': model}
This section highlights one aspect of ASDF that few other formats       af.write_to('model.asdf')
support in an archival way, e.g., not using a language-specific         af2 ='model.asdf')
                                                                        model2 = af2['model']
mechanism, such as Python’s pickle. The astropy package contains        model2 is model
a modeling subpackage that defines a number of analytical, as well          False
as a few table-based, models that can be combined in many ways,         model2(103.5) == model(103.5)
such as arithmetically, in composition, or multi-dimensional. Thus          True
it is possible to define fairly complex multi-dimensional models,       Listing the relevant part of the ASDF file illustrates how the model
many of which can use the built in fitting machinery.                   has been saved in the YAML header (reformatted to fit in this paper
     These models, and their compound constructs can be saved           column).
in ASDF files and later read in to recreate the corresponding           model: !transform/add-1.2.0
astropy objects that were used to create the entries in the ASDF          forward:
4                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

  - !transform/add-1.2.0                                                something that the FITS format had no hope of managing, nor any
    forward:                                                            other scientific format that we are aware of.
    - !transform/gaussian1d-1.0.0
      amplitude: !unit/quantity-1.1.0
        {unit: !unit/unit-1.0.0 Jy, value: 100.0}                       Displaying the contents of ASDF files
      - !unit/quantity-1.1.0                                            Functionality has been added to display the structure and content
        {unit: !unit/unit-1.0.0 MHz, value: 92.5}                       of the header (including data item properties), with a number of
      - !unit/quantity-1.1.0
        {unit: !unit/unit-1.0.0 MHz, value: 147.5}                      options of what depth to display, how many lines to display, etc.
      bounds:                                                           An example of the info use is shown in Figure 2.
        stddev: [1.1754943508222875e-38, null]                               There is also functionality to search for items in the file by
      inputs: [x]
      mean: !unit/quantity-1.1.0
                                                                        attribute name and/or values, also using pattern matching for
        {unit: !unit/unit-1.0.0 MHz, value: 120.0}                      either. The search results are shown as attribute paths to the items
      outputs: [y]                                                      that were found.
      stddev: !unit/quantity-1.1.0
        {unit: !unit/unit-1.0.0 MHz, value: 5.0}
    - !transform/gaussian1d-1.0.0                                       ASDF Extension/Converter System
      amplitude: !unit/quantity-1.1.0
        {unit: !unit/unit-1.0.0 Jy, value: 65.0}                        There are a number of components that are involved. Converters
      bounding_box:                                                     encapsulate the code that handles converting Python objects to
      - !unit/quantity-1.1.0
        {unit: !unit/unit-1.0.0 MHz, value: 123.5}
                                                                        and from their ASDF representation. These are classes that inherit
      - !unit/quantity-1.1.0                                            from the basic Converter class and define two Class attributes:
        {unit: !unit/unit-1.0.0 MHz, value: 156.5}                      tags, types each of which is a list of associated tag(s) and class(es)
      bounds:                                                           that the specific converter class will handle (each converter can
        stddev: [1.1754943508222875e-38, null]
      inputs: [x]                                                       handle more than one tag type and more than one class). The
      mean: !unit/quantity-1.1.0                                        ASDF machinery uses this information to map tags to converters
        {unit: !unit/unit-1.0.0 MHz, value: 140.0}                      when reading ASDF content, and to map types to converters when
      outputs: [y]
      stddev: !unit/quantity-1.1.0
                                                                        saving these objects to an ASDF file.
        {unit: !unit/unit-1.0.0 MHz, value: 3.0}                            Each converter class is expected to supply two methods:
    inputs: [x]                                                         to_yaml_tree and from_yaml_tree that construct the
    outputs: [y]                                                        YAML content and convert the YAML content to Python class
  - !transform/power_law1d-1.0.0
    alpha: 3.0                                                          instances respectively.
    amplitude: !unit/quantity-1.1.0                                         A manifest file is used to associate tags and schema ID’s
      {unit: !unit/unit-1.0.0 Jy, value: 10.0}                          so that if a schema has been defined, that the ASDF content
    inputs: [x]
    outputs: [y]
                                                                        can be validated against the schema (as well as providing extra
    x_0: !unit/quantity-1.1.0                                           information for the ASDF content in the info command). Normally
      {unit: !unit/unit-1.0.0 MHz, value: 100.0}                        the converters and manifest are registered with the ASDF library
  inputs: [x]                                                           using standard functions, and this registration is normally (but is
  outputs: [y]
...                                                                     not required to be) triggered by use of Python entry points defined
                                                                        in the setup.cfg file so that this extension is automatically
Note that there are extra pieces of information that define the         recognized when the extension package is installed.
model more precisely. These include:                                        One can of course write their own custom code to convert the
                                                                        contents of ASDF files however they want. The advantage of the
    •   many tags indicating special items. These include different
                                                                        tag/converter system is that the objects can be anywhere in the tree
        kinds of transforms (i.e., functions), quantities (i.e., num-
                                                                        structure and be properly saved and recovered without having any
        bers with units), units, etc.
                                                                        implied knowledge of what attribute or location the object is at.
    •   definitions of the units used.
                                                                        Furthermore, it brings with it the ability to validate the contents
    •   indications of the valid range of the inputs or parameters
                                                                        by use of schema files.
                                                                            Jupyter tutorials that show how to use converters can be found
    •   each function shows the mapping of the inputs and the
        naming of the outputs of each function.
    •   the addition operator is itself a transform.                       •
    Without the use of units, the YAML would be simpler. But               •
the point is that the YAML easily accommodates expression trees.                Your_second_ASDF_converter.ipynb
The tags are used by the library to construct the astropy models,
units and quantities as Python objects. However, nothing in the
above requires the library to be written in Python.                     ASDF Roadmap for STScI Work
    This machinery can handle multidimensional models and sup-          The planned enhancements to ASDF are understandably focussed
ports both the combining of models with arithmetic operators as         on the needs of STScI missions. Nevertheless, we are particularly
well as pipelining the output of one model into another. This           interested in areas that have wider benefit to the general scientific
system has been used to define complex coordinate transforms            and engineering community, and such considerations increase the
from telescope detectors to sky coordinates for imaging, and            priority of items necessary to STScI. Furthermore, we are eager
wavelengths for spectrographs, using over 100 model components,         to aid others working on ASDF by providing advice, reviews, and
THE ADVANCED SCIENTIFIC DATA FORMAT (ASDF): AN UPDATE                                                                                           5

Fig. 2: This shows part of the output of the info command that shows the structure of a Roman Space Telescope test file (provided by the Roman
Telescopes Branch at STScI). Displayed is the relative depth of the item, its type, value, and a title extracted from the associated schema to be
used as explanatory information.

possibly collaborative coding effort. STScI is committed to the           Redefining versioning semantics
long-term support of ADSF.                                                Previously the meaning of different levels of versioning
    The following is a list of planned work, in order of decreasing       were unclear. The normal inclination is to treat schema
priority.                                                                 version using the typical semantic versioning system de-
                                                                          fined for software. But schemas are not software and
Chunking Support
                                                                          we are inclined to use the proposed system for schemas
Since the Roman mission is expected to deal with large data               schemaver-for-semantic-versioning-of-schemas/] To summarize:
sets and mosaicked images, support for chunking is considered             in this case the three levels of versioning correspond to:
essential. We expect to layer the support in our Python library               Model.Revision.Addition where a schema change:
on zarr [], with two different representations,
one where all data is contained within the ADSF file in separate              •   [Model] prevents working with historical data
blocks, and one where the blocks are saved in individual files.               •   [Revision] may prevent working with historical data
Both representations have important advantages and use cases.                 •   [Addition] is compatible with all historical data

                                                                          Integration into astronomy display tools
Improvements to binary block management
                                                                          It is essential that astronomers be able to visualize the data
These enhancements are needed to enable better chunking support           contained within ASDF files conveniently using the commonly
and other capabilities.                                                   available tool, such as SAOImage DS9 [Joy03] and Ginga [Jes13].
6                                                                                                 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Cloud optimized storage                                                        [McK10] W. McKinney. Data structures for statistical computing in python,
                                                                                       Proceedigns of the 9th Python in Science Conference, p56-61, 2010.
Much of the future data processing operations for STScI are                  
expected to be performed on the cloud, so having ASDF efficiently              [Pen09] W. Pence, R. Seaman, R. L. White, Lossless Astronomical Image
support such uses is important. An important element of this is                        Compression and the Effects of Noise, Publications of the Astro-
making the format work efficiently with object storage services                        nomical Society of the Pacific, 121:414-427, April 2009. https:
such as AWS S3 and Google Cloud Storage.                                       [Pen10] W. Pence, R. L. White, R. Seaman. Optimal Compression of Floating-
                                                                                       Point Astronomical Images Without Significant Loss of Information,
IDL support                                                                            Publications of the Astronomical Society of the Pacific, 122:1065-
                                                                                       1076, September 2010.
While Python is rapidly surpassing the use of IDL in astronomy,                [Joy03] W. A. Joye, E. Mandel. New Features of SAOImage DS9, Astronomi-
there is still much IDL code being used, and many of those still                       cal Data Analysis Software and Systems XII ASP Conference Series,
using IDL are in more senior and thus influential positions (they                      295:489, 2003.
aren’t quite dead yet). So making ASDF data at least readable to
IDL is a useful goal.

Support Rice compression
Rice compression [Pen09], [Pen10] has proven a useful lossy
compression algorithm for astronomical imaging data. Supporting
it will be useful to astronomers, particularly for downloading large
imaging data sets.

Pandas Dataframe support
Pandas [McK10] has proven to be a useful tool to many as-
tronomers, as well as many in the sciences and engineering, so
support will enhance the uptake of ASDF.

Compact, easy-to-read schema summaries
Most scientists and even scientific software developers tend to
find JSON Schema files tedious to interpret. A more compact, and
intuitive rendering of the contents would be very useful.

Independent implementation
Having ASDF accepted as a standard data format requires a library
that is divorced from a Python API. Initially this can be done most
easily by layering it on the Python library, but ultimately there
should be an independent implementation which includes support
for C/C++ wrappers. This is by far the item that will require the
most effort, and would benefit from outside involvement.

Provide interfaces to other popular packages
This is a catch all for identifying where there would be significant
advantages to providing the ability to save and recover information
in the ASDF format as an interchange option.

Sources of Information
    •   ASDF Standard:
    •   Python ASDF package documentation: https://asdf.
    •   Repository:
    •   Tutorials:

[Gre15] P. Greenfield, M. Droettboom, E. Bray. ASDF: A new data format
        for astronomy, Astronomy and Computing, 12:240-251, September
[FIT16] FITS Working Group. Definition of the Flexible Image Transport
        System, International Astronomical Union,
        fits_standard.html, July 2016.
[Jes13] E. Jeschke. Ginga: an open-source astronomical image viewer and
        toolkit, Proc. of the 12th Python in Science Conference., p58-
        64,January 2013.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                              7

  Semi-Supervised Semantic Annotator (S3A): Toward
             Efficient Semantic Labeling
       Nathan Jessurun‡∗ , Daniel E. Capecci‡ , Olivia P. Dizon-Paradis‡ , Damon L. Woodard‡ , Navid Asadizanjani‡


Abstract—Most semantic image annotation platforms suffer severe bottlenecks
when handling large images, complex regions of interest, or numerous distinct
foreground regions in a single image. We have developed the Semi-Supervised
Semantic Annotator (S3A) to address each of these issues and facilitate rapid
collection of ground truth pixel-level labeled data. Such a feat is accomplished
through a robust and easy-to-extend integration of arbitrary python image pro-
cessing functions into the semantic labeling process. Importantly, the framework
devised for this application allows easy visualization and machine learning
prediction of arbitrary formats and amounts of per-component metadata. To our
knowledge, the ease and flexibility offered are unique to S3A among all open-
source alternatives.

Index Terms—Semantic annotation, Image labeling, Semi-supervised, Region
of interest

Labeled image data is essential for training, tuning, and evaluating                   Fig. 1. Common use cases for semantic segmentation involve relatively few fore-
                                                                                       ground objects, low-resolution data, and limited complexity per object. Images
the performance of many machine learning applications. Such                            retrieved from
labels are typically defined with simple polygons, ellipses, and
bounding boxes (i.e., "this rectangle contains a cat"). However,
this approach can misrepresent more complex shapes with holes                          and greatly hinders scalability. As such, several tools have been
or multiple regions as shown later in Figure 9. When high accuracy                     proposed to alleviate the burden of collecting these ground-truth
is required, labels must be specified at or close to the pixel-level                   labels [itL18]. Unfortunately, existing tools are heavily biased
- a process known as semantic labeling or semantic segmentation.                       toward lower-resolution images with few regions of interest (ROI),
A detailed description of this process is given in [CZF+ 18].                          similar to Figure 1. While this may not be an issue for some
Examples can readily be found in several popular datasets such                         datasets, such assumptions are crippling for high-fidelity images
as COCO, depicted in Figure 1.                                                         with hundreds of annotated ROIs [LSA+ 10], [WYZZ09].
    Semantic segmentation is important in numerous domains                                 With improving hardware capabilities and increasing need for
including printed circuit board assembly (PCBA) inspection (dis-                       high-resolution ground truth segmentation, there are a continu-
cussed later in the case study) [PJTA20], [AML+ 19], quality                           ally growing number of applications that require high-resolution
control during manufacturing [FRLL18], [AVK+ 01], [AAV+ 02],                           imaging with the previously described characteristics [MKS18],
manuscript restoration / digitization [GNP+ 04], [KBO16], [JB92],                      [DS20]. In these cases, the existing annotation tooling greatly
[TFJ89], [FNK92], and effective patient diagnosis [SKM+ 10],                           impacts productivity due to the previously referenced assumptions
[RLO+ 17], [YPH+ 06], [IGSM14]. In all these cases, imprecise                          and lack of support [Spa20].
annotations severely limit the development of automated solutions                          In response to these bottlenecks, we present the Semi-
and can decrease the accuracy of standard trained segmentation                         Supervised Semantic Annotation (S3A) annotation and prototyping
models.                                                                                platform -- an application which eases the process of pixel-level
    Quality semantic segmentation is difficult due to a reliance on                    labeling in large, complex scenes.1 Its graphical user interface is
large, high-quality datasets, which are often created by manually                      shown in Figure 2. The software includes live app-level property
labeling each image. Manual annotation is error-prone, costly,                         customization, real-time algorithm modification and feedback,
                                                                                       region prediction assistance, constrained component table editing
* Corresponding author:
‡ University of Florida                                                                based on allowed data types, various data export formats, and a
                                                                                       highly adaptable set of plugin interfaces for domain-specific exten-
Copyright © 2022 Nathan Jessurun et al. This is an open-access article                 sions to S3A. Beyond software improvements, these features play
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,          significant roles in bridging the gap between human annotation
provided the original author and source are credited.                                  efforts and scalable, automated segmentation methods [BWS+ 10].
8                                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                                                         Improve                    Semi-
                                                                                                       segmentation               supervised
                                                                                                        techniques                 labeling

                                                                                                           Update                  Generate
                                                                                                           models                training data

                                                                                  Fig. 3. S3A’s can iteratively annotate, evaluate, and update its internals in real-

                                                                                  to specify (but can be modified or customized if desired). As a re-
                                                                                  sult, incorporating additional/customized application functionality
                                                                                  can require as little as one line of code. Processes interface with
Fig. 2. S3A’s interface. The main view consists of an image to annotate, a        PyQtGraph parameters to gain access to data-customized widget
component table of prior annotations, and a toolbar which changes functionality   types and more (
depending on context.                                                                 These processes can also be arbitrarily nested and chained,
                                                                                  which is critical for developing hierarchical image processing
                                                                                  models, an example of which is shown in Figure 4. This frame-
                                                                                  work is used for all image and region processing within S3A.
                                                                                  Note that for image processes, each portion of the hierarchy yields
Application Overview                                                              intermediate outputs to determine which stage of the process flow
Design decisions throughout S3A’s architecture have been driven                   is responsible for various changes. This, in turn, reduces the
by the following objectives:                                                      effort required to determine which parameters must be adjusted
                                                                                  to achieve optimal performance.
    •    Metadata should have significance rather than be treated
         as an afterthought,
                                                                                  Plugins for User Extensions
    •    High-resolution images should have minimal impact on
         the annotation workflow,                                                 The previous section briefly described how custom user functions
    •    ROI density and complexity should not limit annotation                   are easily wrapped within a process, exposing its parameters
         workflow, and                                                            within S3A in a GUI format. A rich plugin interface is built on top
    •    Prototyping should not be hindered by application com-                   of this capability in which custom functions, table field predictors,
         plexity.                                                                 default action hooks, and more can be directly integrated into S3A.
                                                                                  In all cases, only a few lines of code are required to achieve most
    These motives were selected upon noticing the general lack                    integrations between user code and plugin interface specifications.
of solutions for related problems in previous literature and tool-                The core plugin infrastructure consists of a function/property reg-
ing. Moreover, applications that do address multiple aspects of                   istration mechanism and an interaction window that shows them
complex region annotation often require an enterprise service and                 in the UI. As such, arbitrary user functions can be "registered" in
cannot be accessed under open-source policies.                                    one line of code to a plugin, where it will be effectively exposed to
    While the first three points are highlighted in the case study,               the user within S3A. A trivial example is depicted in Figure 5, but
the subsections below outline pieces of S3A’s architecture that                   more complex behavior such as OCR integration is possible with
prove useful for iterative algorithm prototyping and dataset gen-                 similar ease (see this snippet for an implementation leveraging
eration as depicted in Figure 3. Note that beyond the facets                      easyocr).
illustrated here, S3A possesses multiple additional characteris-                      Plugin features are heavily oriented toward easing the pro-
tics as outlined in its documentation (               cess of automation both for general annotation needs and niche
/wikis/docs/User’s-Guide).                                                        datasets. In either case, incorporating existing library functions is
                                                                                  converted into a trivial task directly resulting in lower annotation
Processing Framework                                                              time and higher labeling accuracy.
At the root of S3A’s functionality and configurability lies its
adaptive processing framework. Functions exposed within S3A are                   Adaptable I/O
thinly wrapped using a Process structure responsible for parsing                  An extendable I/O framework allows annotations to be used in
signature information to provide documentation, parameter infor-                  a myriad of ways. Out-of-the-box, S3A easily supports instance-
mation, and more to the UI. Hence, all graphical depictions are                   level segmentation outputs, facilitating deep learning model train-
abstracted beyond the concern of the user while remaining trivial                 ing. As an example, Figure 6 illustrates how each instance in the
                                                                                  image becomes its own pair of image and mask data. When several
  1. A preliminary version was introduced in an earlier publication [JPRA20],
but significant changes to the framework and tool capabilities have been          instances overlap, each is uniquely distinguishable depending
employed since then.                                                              on the characteristic of their label field. Particularly helpful for
SEMI-SUPERVISED SEMANTIC ANNOTATOR (S3A): TOWARD EFFICIENT SEMANTIC LABELING                                                                                           9

Fig. 4. Outputs of each processing stage can be quickly viewed in context after an iteration of annotating. Upon inspecting the results, it is clear the failure point is
a low k value during K-means clustering and segmentation. The woman’s shirt is not sufficiently distinguishable from the background palette to denote a separate
entity. The red dot is an indicator of where the operator clicked during annotation.

from qtpy import QtWidgets
from s3a import (

def hello_world(win: S3A):
        win, "Hello World", "Hello World!"


                                                                                      Fig. 6. Multiple export formats exist, among which is a utility that crops com-
                                                                                      ponents out of the image, optionally padding with scene pixels and resizing to
Fig. 5. Simple standalone functions can be easily exposed to the user through         ensure all shapes are equal. Each sub-image and mask is saved accordingly,
the random tools plugin. Note that if tunable parameters were included in the         which is useful for training on multiple forms of machine learning models.
function signature, pressing "Open Tools" (the top menu option) allows them to
be altered.

                                                                                      binations for functions outside S3A in the event they are utilized
                                                                                      in a different framework.
models with fixed input sizes, these exports can optionally be
forced to have a uniform shape (e.g., 512x512 pixels) while main-
taining their aspect ratio. This is accomplished by incorporating                     Case Study
additional scene pixels around each object until the appropriate                      Both the inspiration and developing efforts for S3A were initially
size is obtained. Models trained on these exports can be directly                     driven by optical printed circuit board (PCB) assurance needs.
plugged back into S3A’s processing framework, allowing them                           In this domain, high-resolution images can contain thousands
to generate new annotations or refine preliminary user efforts.                       of complex objects in a scene, as seen in Figure 7. Moreover,
The described I/O framework is also heavily modularized such                          numerous components are not representable by cardinal shapes
that custom dataset specifications can easily be incorporated. In                     such as rectangles, circles, etc. Hence, high-count polygonal
this manner, future versions of S3A will facilitate interoperability                  regions dominated a significant portion of the annotated regions.
with popular formats such as COCO and Pascal VOC [LMB+ 14],                           The computational overhead from displaying large images and
[EGW+ 10].                                                                            substantial numbers of complex regions either crashed most anno-
                                                                                      tation platforms or prevented real-time interaction. In response,
                                                                                      S3A was designed to fill the gap in open-source annotation
Deep, Portable Customizability
                                                                                      platforms that addressed each issue while requiring minimal setup
Beyond the features previously outlined, S3A provides numerous                        and allowing easy prototyping of arbitrary image processing tasks.
avenues to configure shortcuts, color schemes, and algorithm                          The subsections below describe how the S3A labeling platform
workflows. Several examples of each can be seen in the user                           was utilized to collect a large database of PCB annotations along
guide. Most customizable components prototyped within S3A can                         with their associated metadata2 .
also be easily ported to external workflows after development.
Hierarchical processes have states saved in YAML files describing                     Large Images with Many Annotations
all parameters, which can be reloaded to create user profiles.                        In optical PCB assurance, one method of identifying component
Alternatively, these same files can describe ideal parameter com-                     defects is to localize and characterize all objects in the image. Each
10                                                                                                   PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                               Fig. 8. Regardless of total image size and number of annotations, Python
                                                                               processing is be limited to the ROI or viewbox size for just the selected object
                                                                               based on user preferences. The depiction shows Grab Cut operating on a user-
                                                                               defined initial region within a much larger (8000x6000) image. The resulting
Fig. 7. Example PCB segmentation. In contrast to typical semgentation tasks,   region was available in 1.94 seconds on low-grade hardware.
the scene contains over 4,000 objects with numerous complex shapes.

component can then be cross-referenced against genuine proper-
ties such as length/width, associated text, allowed orientations,
etc. However, PCB surfaces can contain hundreds to thousands
of components at several magnitudes of size, necessitating high-
resolution images for in-line scanning. To handle this problem
more generally, S3A separates the editing and viewing experi-
                                                                               Fig. 9. Annotated objects in S3A can incorporate both holes and distinct regions
ences. In other words, annotation time is orders of magnitude
                                                                               through a multi-polygon container. Holes are represented as polygons drawn on
faster since only edits in one region at a time and on a small subset          top of existing foreground, and can be arbitrarily nested (i.e. island foreground is
of the full image are considered during assisted segmentation. All             also possible).
other annotations are read-only until selected for alteration. For
instance, Figure 8 depicts user inputs on a small ROI out of a
                                                                               key performance improvement when thousands of regions (each
much larger image. The resulting component shape is proposed
                                                                               with thousands of points) are in the same field of view. When
within seconds and can either be accepted or modified further by
                                                                               low polygon counts are required, S3A also supports RDP polygon
the user. While PCB annotations initially inspired this approach, it
                                                                               simplification down to a user-specified epsilon parameter [Ram].
is worth noting that the architectural approach applies to arbitrary
domains of image segmentation.                                                 Complex Metadata
    Another key performance improvement comes from resizing
                                                                               Most annotation software support robust implementation of im-
the processed region to a user-defined maximum size. For instance,
                                                                               age region, class, and various text tags ("metadata"). However,
if an ROI is specified across a large portion of the image but
                                                                               this paradigm makes collecting type-checked or input-sanitized
the maximum processing size is 500x500 pixels, the processed
                                                                               metadata more difficult. This includes label categories such as
area will be downsampled to a maximum dimension length of
                                                                               object rotation, multiclass specifications, dropdown selections,
500 before intensive algorithms are run. The final output will
                                                                               and more. In contrast, S3A treats each metadata field the same
be upsampled back to the initial region size. In this manner,
                                                                               way as object vertices, where they can be algorithm-assisted,
optionally sacrificing a small amount of output accuracy can
                                                                               directly input by the user, or part of a machine learning prediction
drastically accelerate runtime performance for larger annotated
                                                                               framework. Note that simple properties such as text strings or
                                                                               numbers can be directly input in the table cells with minimal need
                                                                               for annotation assistance3 . In conrast, custom fields can provide
Complex Vertices/Semantic Segmentation                                         plugin specifications which allow more advanced user interaction.
Multiple types of PCB components possess complex shapes which                  Finally, auto-populated fields like annotation timestamp or author
might contain holes or noncontiguous regions. Hence, it is bene-               can easily be constructed by providing a factory function instead
ficial for software like S3A to represent these features inherently            of default value in the parameter specification.
with a ComplexXYVertices object: that is, a collection of                          This capability is particularly relevant in the field of optical
polygons which either describe foreground regions or holes. This               PCB assurance. White markings on the PCB surface, known
is enabled by thinly wrapping opencv’s contour and hierarchy                   as silkscreen, indicate important aspects of nearby components.
logic. Example components difficult to accomodate with single-                 Thus, understanding the silkscreen’s orientation, alphanumeric
polygon annotation formats are illustrated in Figure 9.                        characters, associated component, logos present, and more provide
     At the same time, S3A also supports high-count polygons                   several methods by which to characterize / identify features
with no performance losses. Since region edits are performed by                of their respective devices. Both default and customized input
image processing algorithms, there is no need for each vertex                  validators were applied to each field using parameter specifica-
to be manually placed or altered by human input. Thus, such                    tions, custom plugins, or simple factories as described above. A
non-interactive shapes can simply be rendered as a filled path                 summary of the metadata collected for one component is shown
without a large number of event listeners present. This is the                 in Figure 10.
SEMI-SUPERVISED SEMANTIC ANNOTATOR (S3A): TOWARD EFFICIENT SEMANTIC LABELING                                                                                   11

                                                                                     results depending on the initial image complexity [VGSG+ 19].
                                                                                     Hence, these methods would be significantly easier to incorporate
                                                                                     into S3A if a generalized windowing framework was incorporated
                                                                                     which allows users to specify all necessary parameters such as
                                                                                     window overlap, size, sampling frequency, etc. A preliminary
                                                                                     version of this is implemented for categorical-based model pre-
                                                                                     diction, but a more robust feature set for interactive segmentation
                                                                                     is strongly preferable.

                                                                                     Aggregation of Human Annotation Habits
                                                                                     Several times, it has been noted that manual segmentation of
Fig. 10. Metadata can be collected, validated, and customized with ease. A mix       image data is not a feasible or scalable approach for remotely
of default properties (strings, numbers, booleans), factories (timestamp, author),   large datasets. However, there are multiple cases in which human
and custom plugins (yellow circle representing associated device) are present.       intuition can greatly outperform even complex neural networks,
                                                                                     depending on the specific segmentation challenge [RLFF15]. For
                                                                                     this reason, it would be ideal to capture data points possessing
Conclusion and Future Work
                                                                                     information about the human decision-making process and apply
The Semi-Supervised Semantic Annotator (S3A) is proposed to                          them to images at scale. This may include taking into account hu-
address the difficult task of pixel-level annotations of image data.                 man labeling time per class, hesitation between clicks, relationship
For high-resolution images with numerous complex regions of                          between shape boundary complexity and instance quantity, and
interest, existing labeling software faces performance bottlenecks                   more. By aggregating such statistics, a pattern may arise which can
attempting to extract ground-truth information. Moreover, there is                   be leveraged as an additional automated annotation technique.
a lack of capabilities to convert such a labeling workflow into an
automated procedure with feedback at every step. Each of these
challenges is overcome by various features within S3A specifically                   R EFERENCES
designed for such tasks. As a result, S3A provides not only tremen-                  [AAV+ 02]   C Anagnostopoulos, I Anagnostopoulos, D Vergados, G Kouzas,
dous time savings during ground truth annotation, but also allows                                E Kayafas, V Loumos, and G Stassinopoulos. High performance
an annotation pipeline to be directly converted into a prediction                                computing algorithms for textile quality control. Mathematics
scheme. Furthermore, the rapid feedback accessible at every stage                                and Computers in Simulation, 60(3):389–400, September 2002.
of annotation expedites prototyping of novel solutions to imaging                    [AML+ 19]   Mukhil Azhagan, Dhwani Mehta, Hangwei Lu, Sudarshan
domains in which few examples of prior work exist. Nonetheless,                                  Agrawal, Mark Tehranipoor, Damon L Woodard, Navid
multiple avenues exist for improving S3A’s capabilities in each of                               Asadizanjani, and Praveen Chawla. A review on automatic
                                                                                                 bill of material generation and visual inspection on PCBs. In
these areas. Several prominent future goals are highlighted in the
                                                                                                 ISTFA 2019: Proceedings of the 45th International Symposium
following sections.                                                                              for Testing and Failure Analysis, page 256. ASM International,
Dynamic Algorithm Builder                                                            [AVK+ 01]   C. Anagnostopoulos, D. Vergados, E. Kayafas, V. Loumos, and
                                                                                                 G. Stassinopoulos. A computer vision approach for textile
Presently, processing workflows can be specified in a sequential                                 quality control. The Journal of Visualization and Computer
YAML file which describes each algorithm and their respective                                    Animation, 12(1):31–44, 2001. doi:10.1002/vis.245.
parameters. However, this is not easy to adapt within S3A,                           [BWS+ 10]   Steve Branson, Catherine Wah, Florian Schroff, Boris Babenko,
                                                                                                 Peter Welinder, Pietro Perona, and Serge Belongie. Visual
especially by inexperienced annotators. Future iterations of S3A                                 recognition with humans in the loop. In Kostas Daniilidis, Petros
will incoroprate graphical flowcharts which make this process                                    Maragos, and Nikos Paragios, editors, Computer Vision – ECCV
drastically more intuitive and provide faster feedback. Frameworks                               2010, pages 438–451, Berlin, Heidelberg, 2010. Springer Berlin
like Orange [DCE+ ] perform this task well, and S3A would                            [CZF+ 18]   Qimin Cheng, Qian Zhang, Peng Fu, Conghuan Tu, and Sen Li.
strongly benefit from adding the relevant capabilities.                                          A survey and analysis on automatic image annotation. Pattern
                                                                                                 Recognition, 79:242–259, 2018. doi:10.1016/j.patcog.
Image Navigation Assistance                                                                      2018.02.017.
                                                                                     [DCE+ ]     Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž
Several aspects of image navigation can be incorporated to sim-                                  Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar,
plify the handling of large images. For instance, a "minimap" tool                               Marko Toplak, and Anže Starič. Orange: Data mining toolbox
                                                                                                 in Python. 14(1):2349–2353.
would allow users to maintain a global image perspective while                       [DS20]      Polina Demochkina and Andrey V. Savchenko. Improving
making local edits. Furthermore, this sense of scale aids intuition                              the accuracy of one-shot detectors for small objects in x-ray
of how many regions of similar component density, color, etc. exist                              images. In 2020 International Russian Automation Confer-
within the entire image.                                                                         ence (RusAutoCon), page 610–614. IEEE, September 2020.
                                                                                                 URL:, doi:10.
    Second, multiple strategies for annotating large images lever-                               1109/RusAutoCon49822.2020.9208097.
age a windowing approach, where they will divide the total image                     [EGW+ 10]   Mark Everingham, Luc Gool, Christopher K. Williams, John
into several smaller pieces in a gridlike fashion. While this has its                            Winn, and Andrew Zisserman. The pascal visual object classes
                                                                                                 (voc) challenge. Int. J. Comput. Vision, 88(2):303–338, jun
disadvantages, it is fast, easy to automate, and produces reasonable
                                                                                                 2010. URL:, doi:
   2. For those curious, the dataset and associated paper are accessible at https:   [FNK92]     H. Fujisawa, Y. Nakano, and K. Kurino. Segmentation methods
//                                                           for character recognition: From segmentation to document struc-
   3. For a list of input validators and supported primitive types, refer to                     ture analysis. Proceedings of the IEEE, 80(7):1079–1092, July
PyQtGraph’s Parameter documentation.                                                             1992. doi:10.1109/5.156471.
12                                                                                                   PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[FRLL18]    Max K. Ferguson, Ak Ronay, Yung-Tsun Tina Lee, and Kin-                         IEEE Transactions on Medical Imaging, 36(2):674–683, Febru-
            cho. H. Law. Detection and segmentation of manufacturing                        ary 2017. doi:10.1109/TMI.2016.2621185.
            defects with convolutional neural networks and transfer learn-       [SKM+ 10] Sascha Seifert, Michael Kelm, Manuel Moeller, Saikat Mukher-
            ing. Smart and sustainable manufacturing systems, 2, 2018.                      jee, Alexander Cavallaro, Martin Huber, and Dorin Comaniciu.
            doi:10.1520/SSMS20180033.                                                       Semantic annotation of medical images. In Brent J. Liu and
[GNP+ 04]   Basilios Gatos, Kostas Ntzios, Ioannis Pratikakis, Sergios                      William W. Boonn, editors, Medical Imaging 2010: Advanced
            Petridis, T. Konidaris, and Stavros J. Perantonis. A segmentation-              PACS-based Imaging Informatics and Therapeutic Applications,
            free recognition technique to assist old greek handwritten                      volume 7628, pages 43 – 50. International Society for Optics and
            manuscript OCR. In Simone Marinai and Andreas R. Dengel,                        Photonics, SPIE, 2010. URL:,
            editors, Document Analysis Systems VI, Lecture Notes in Com-                    doi:10.1117/12.844207.
            puter Science, pages 63–74, Berlin, Heidelberg, 2004. Springer.      [Spa20]    SpaceNet. Multi-Temporal Urban Development Challenge.
            doi:10.1007/978-3-540-28640-0_7.                                      , June 2020.
[IGSM14]    D. K. Iakovidis, T. Goudas, C. Smailis, and I. Maglogiannis.         [TFJ89]    T. Taxt, P.J. Flynn, and A.K. Jain. Segmentation of document
            Ratsnake: A versatile image annotation tool with application                    images. IEEE Transactions on Pattern Analysis and Machine
            to computer-aided diagnosis, 2014. doi:10.1155/2014/                            Intelligence, 11(12):1322–1329, December 1989. doi:10.
            286856.                                                                         1109/34.41371.
[itL18]     Humans in the Loop. The best image annotation platforms              [VGSG+ 19] Juan P. Vigueras-Guillén, Busra Sari, Stanley F. Goes, Hans G.
            for computer vision (+ an honest review of each), October                       Lemij, Jeroen van Rooij, Koenraad A. Vermeer, and Lucas J.
            2018. URL:                    van Vliet. Fully convolutional architecture vs sliding-window
            platforms-for-computer-vision-an-honest-review-of-each-                         cnn for corneal endothelium cell segmentation. BMC Biomedical
            dac7f565fea.                                                                    Engineering, 1(1):4, January 2019. doi:10.1186/s42490-
[JB92]      Anil K. Jain and Sushil Bhattacharjee. Text segmentation using                  019-0003-2.
            gabor filters for automatic document processing. Machine Vision      [WYZZ09] C. Wang, Shuicheng Yan, Lei Zhang, and H. Zhang. Multi-
            and Applications, 5(3):169–184, June 1992. doi:10.1007/                         label sparse coding for automatic image annotation. In 2009
            BF02626996.                                                                     IEEE Conference on Computer Vision and Pattern Recognition,
[JPRA20]    Nathan Jessurun, Olivia Paradis, Alexandra Roberts, and Navid                   page 1643–1650, June 2009. doi:10.1109/CVPR.2009.
            Asadizanjani. Component Detection and Evaluation Framework                      5206866.
            (CDEF): A Semantic Annotation Tool. Microscopy and Micro-            [YPH 06] Paul A. Yushkevich, Joseph Piven, Heather Cody Hazlett,

            analysis, 26(S2):1470–1474, August 2020. doi:10.1017/                           Rachel Gimpel Smith, Sean Ho, James C. Gee, and Guido
            S1431927620018243.                                                              Gerig. User-guided 3D active contour segmentation of anatom-
[KBO16]     Made Windu Antara Kesiman, Jean-Christophe Burie, and Jean-                     ical structures: Significantly improved efficiency and reliability.
            Marc Ogier. A new scheme for text line and character seg-                       NeuroImage, 31(3):1116–1128, July 2006. doi:10.1016/j.
            mentation from gray scale images of palm leaf manuscript.                       neuroimage.2006.01.015.
            In 2016 15th International Conference on Frontiers in Hand-
            writing Recognition (ICFHR), pages 325–330, October 2016.
[LMB+ 14]   Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
            Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
            Zitnick. Microsoft coco: Common objects in context. In Euro-
            pean conference on computer vision, pages 740–755. Springer,
[LSA+ 10]   L’ubor Ladický, Paul Sturgess, Karteek Alahari, Chris Russell,
            and Philip H. S. Torr. What, where and how many? combining
            object detectors and crfs. In Kostas Daniilidis, Petros Maragos,
            and Nikos Paragios, editors, Computer Vision – ECCV 2010,
            pages 424–437, Berlin, Heidelberg, 2010. Springer Berlin Hei-
[MKS18]     S. Mohajerani, T. A. Krammer, and P. Saeedi. A cloud detection
            algorithm for remote sensing images using fully convolutional
            neural networks. In 2018 IEEE 20th International Workshop on
            Multimedia Signal Processing (MMSP), page 1–5, August 2018.
[PJTA20]    Olivia P Paradis, Nathan T Jessurun, Mark Tehranipoor,
            and Navid Asadizanjani.         Color normalization for robust
            automatic bill of materials generation and visual inspection
            of pcbs. In ISTFA 2020: Papers Accepted for the Planned
            46th International Symposium for Testing and Failure Analysis,
            International Symposium for Testing and Failure Analysis,
            pages 172–179, 2020. URL:
[Ram]       Urs Ramer. An iterative procedure for the polygonal approx-
            imation of plane curves. 1(3):244–256. URL: https://www.
[RLFF15]    Olga Russakovsky, Li-Jia Li, and Li Fei-Fei. Best of both
            worlds: Human-machine collaboration for object annotation.
            In 2015 IEEE Conference on Computer Vision and Pat-
            tern Recognition (CVPR), page 2121–2131. IEEE, June 2015.
            URL:, doi:10.
[RLO+ 17]   Martin Rajchl, Matthew C. H. Lee, Ozan Oktay, Konstantinos
            Kamnitsas, Jonathan Passerat-Palmbach, Wenjia Bai, Mellisa
            Damodaram, Mary A. Rutherford, Joseph V. Hajnal, Bernhard
            Kainz, and Daniel Rueckert. DeepCut: Object segmentation from
            bounding box annotations using convolutional neural networks.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                       13

  Galyleo: A General-Purpose Extensible Visualization
                     Rick McGeer‡∗ , Andreas Bergen‡ , Mahdiyar Biazi‡ , Matt Hemmings‡ , Robin Schreiber‡


Abstract—Galyleo is an open-source, extensible dashboarding solution inte-                Jupyter’s web interface is primarily to offer textboxes for code
grated with JupyterLab [jup]. Galyleo is a standalone web application integrated          entry. Entered code is sent to the server for evaluation and
as an iframe [LS10] into a JupyterLab tab. Users generate data for the dash-              text/HTML results returned. Visualization in a Jupyter Notebook
board inside a Jupyter Notebook [KRKP+ 16], which transmits the data through              is either given by images rendered server-side and returned as
message passing [mdn] to the dashboard; users use drag-and-drop operations
                                                                                          inline image tags, or by JavaScript/HTML5 libraries which have
to add widgets to filter, and charts to display the data, shapes, text, and images.
The dashboard is saved as a JSON [Cro06] file in the user’s filesystem in the
                                                                                          a corresponding server-side Python library. The Python library
same directory as the Notebook.                                                           generates HTML5/JavaScript code for rendering.
                                                                                              The limiting factor is that the visualization library must be in-
Index Terms—JupyterLab, JupyterLab extension, Data visualization                          tegrated with the Python backend by a developer, and only a subset
                                                                                          of the rich array of visualization, charting, and mapping libraries
                                                                                          available on the HTML5/JavaScript platform is integrated. The
                                                                                          HTML5/JavaScript platform is as rich a client-side visualization
Current dashboarding solutions [hol22a] [hol22b] [plo] [pan22]                            platform as Python is a server-side platform.
for Jupyter either involve external, heavyweight tools, ingrained
                                                                                              Galyleo set out to offer the best of both worlds: Python, R, and
HTML/CSS coding, complex publication, or limited control over
                                                                                          Julia as a scalable analytics platform coupled with an extensible
layout, and have restricted widget sets and visualization libraries.
                                                                                          JavaScript/HTML5 visualization and interaction platform. It offers
Graphics objects require a great deal of configuration: size, posi-
                                                                                          a no-code client-side environment, for several reasons.
tion, colors, fonts must be specified for each object. Thus library
solutions involve a significant amount of fairly simple code. Con-                           1)    The Jupyter analytics community is comfortable with
versely, visualization involves analytics, an inherently complex                                   server-side analytics environments (the 100+ kernels
set of operations. Visualization tools such as Tableau [DGHP13]                                    available in Jupyter, including Python, R and Julia) but
or Looker [loo] combine visualization and analytics in a single                                    less so with the JavaScript visualization platform.
application presented through a point-and-click interface. Point-                            2)    Configuration of graphical objects takes a lot of low-value
and-click interfaces are limited in the number and complexity                                      configuration code; conversely, it is relatively easy to do
of operations supported. The complexity of an operation isn’t                                      by hand.
reduced by having a simple point-and-click interface; instead, the
user is confronted with the challenge of trying to do something                               These insights lead to a mixed interface, combining a drag-
complicated by pointing. The result is that tools encapsulate                             and-drop interface for the design and configuration of visual
complex operations in a few buttons, and that leads to a limited                          objects, and a coding, server-side interface for analytics programs.
number of operations with reduced options and/or tools with steep                             Extension of the widget set was an important consideration. A
learning curves.                                                                          widget is a client-side object with a physical component. Galyleo
    In contrast, Jupyter is simply a superior analytics environment                       is designed to be extensible both by adding new visualization
in every respect over a standalone visualization tool: its various                        libraries and components and by adding new widgets.
kernels and their libraries provide a much broader range of analyt-                           Publication of interactive dashboards has been a further chal-
ics capabilities; its programming interface is a much cleaner and                         lenge. A design goal of Galyleo was to offer a simple scheme,
simpler way to perform complex operations; hardware resources                             where a dashboard could be published to the web with a single
can scale far more easily than they can for a visualization tool;                         click.
and connectors to data sources are both plentiful and extensible.                             These then, are the goals of Galyleo:
    Both standalone visualization tools and Jupyter libraries have
a limited set of visualizations. Jupyter is a server-side platform.                          1)    Simple, drag-and-drop design of interactive dashboards in
                                                                                                   a visual editor. The visual design of a Galyleo dashboard
* Corresponding author:
‡ engageLively                                                                                     should be no more complex than design of a PowerPoint
                                                                                                   or Google slide;
Copyright © 2022 Rick McGeer et al. This is an open-access article distributed               2)    Radically simplify the dashboard-design interface by cou-
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the                       pling it to a powerful, Jupyter back end to do the analytics
original author and source are credited.                                                           work, separating visualization and analytics concerns;
14                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

             Fig. 1: Figure 1. A New Galyleo Dashboard                                  Fig. 3: Figure 3. Dataflow in Galyleo

                                                                             As the user creates and manipulates the visual elements, the
                                                                         editor continuously saves the table as a JSON file, which can also
                                                                         be edited with Jupyter’s built-in text editor.

                                                                         The goal of Galyleo is simplicity and transparency. Data prepa-
                                                                         ration is handled in Jupyter, and the basic abstract item, the
                                                                         GalyleoTable is generally created and manipulated there, using an
                                                                         open-source Python library. When a table is ready, the Galyleo-
                                                                         Client library is invoked to send it to the dashboard, where it
                                                                         appears in the table tab of the sidebar. The dashboard author
                                                                         then creates visual elements such as sliders, lists, dropdowns etc.,
           Fig. 2: Figure 2. The Galyleo Dashboard Studio
                                                                         which select rows of the table, and uses these filtered lists as
                                                                         inputs to charts. The general idea is that the author should be
     3)   Maximimize extensibility for visualization and widgets         able to seamlessly move between manipulating and creating data
          on the client side and analytics libraries, data sources and   tables in the Notebook, and filtering and visualizing them in the
          hardware resources on the server side;                         dashboard.
     4)   Easy, simple publication;
                                                                         Data Flow and Conceptual Picture
                                                                         The Galyleo Data Model and Architecture is discussed in detail
Using Galyleo
                                                                         below. The central idea is to have a few, orthogonal, easily-grasped
The general usage model of Galyleo is that a Notebook is being           concepts which make data manipulation easy and intuitive. The
edited and executed in one tab of JupyterLab, and a corresponding        basic concepts are as follows:
dashboard file is being edited and executed in another; as the
Notebook executes, it uses the Galyleo Client library to send               1)    Table: A Table is a list of records, equivalent to a Pandas
data to the dashboard file. To JupyterLab, the Galyleo Dashboard                  DataFrame [pdt20] [WM10] or a SQL Table. In general,
Studio is just another editor; it reads and writes .gd.json files in              in Galyleo, a Table is expected to be produced by an
the current directory.                                                            external source, generally a Jupyter Notebook
                                                                            2)    Filter: A Filter is a logical function which applies to a
The Dashboard Studio                                                              single column of a Table Table, and selects rows from the
                                                                                  Table. Each Filter corresponds to a widget; widgets set
A new Galyleo Dashboard can be launched from the JupyterLab
                                                                                  the values Filter use to select Table rows
launcher or from the File>New menu, as shown in Figure 1.
                                                                            3)    View A View is a subset of a Table selected by one or
    An existing dashboard is saved as a .gd.json file, and is
                                                                                  more Filters. To create a view, the user chooses a Table,
denoted with the Galyleo star logo. It can be opened in the usual
                                                                                  and then chooses one or more Tilters to apply to the Table
way, with a double-click.
                                                                                  to select the rows for the View. The user can also statically
    Once a file is opened, or a new file created, a new Galyleo tab
                                                                                  select a subset of the columns to include in the View.
opens onto it. It resembles a simplified form of a Tableau, Looker,
                                                                            4)    Chart A Chart is a generic term for an object that displays
or PowerBI editor. The collapsible right-hand sidebar offers the
                                                                                  data graphically. Its input is a View or a Table. Each Chart
ability to view Tables, and view, edit, or create Views, Filters,
                                                                                  has a single data source.
and Charts. The bottom half of the right sidebar gives controls for
styling of text and shapes.                                                  The data flow is straightforward. A Table is updated from
    The top bar handles the introduction of decorative and styling       an external source, or the user manipulates a widget. When this
elements to the dashboard: labels and text, simple shapes such as        happens, the affected item signals the dashboard controller that it
ellipses, rectangles, polygons, lines, and images. All images are        has been updated. The controller then signals all charts to redraw
referenced by URL.                                                       themselves. Each Chart will then request updated data from its
GALYLEO: A GENERAL-PURPOSE EXTENSIBLE VISUALIZATION SOLUTION                                                                                15

source Table or View. A View then requests its configured filters
for their current logic functions, and passes these to the source
Table with a request to apply the filters and return the rows which
are selected by all the filters (in the future, a more general Boolean
will be applied; the UI elements to construct this function are
under design). The Table then returns the rows which pass the
filters; the View selects the static subset of columns it supports,
and passes this to its Charts, which then redraw themselves.
     Each item in this flow conceptually has a single data source,
but multiple data targets. There can be multiple Views over a
Table, but each View has a single Table as a source. There can
be multiple charts fed by a View, but each Chart has a single Table
or View as a source.
     It’s important to note that there are no special cases. There is
no distinction, as there is in most visualization systems, between
a "Dimension" or a "Measure"; there are simply columns of data,
                                                                                  Fig. 4: Figure 4. A Published Galyleo Dashboard
which can be either a value or category axis for any Chart. From
this simplicity, significant generality is achieved. For example,
a filter selects values from any column, whether that column is          and configuration gives instant feedback and tight control over
providing value or category. Applying a range filter to a category       appearance. For example, the authors of a LaTeX paper (including
column gives natural telescoping and zooming on the x-axis of a          this one) can’t control the placement of figures within the text. The
chart, without change to the architecture.                               fourth, which is correct, is that configuration code is more verbose,
                                                                         error-prone, and time-consuming than manual configuration.
                                                                             What is less often appreciated is that when operations become
An important operation for any interactive dashboard is drill-           sufficiently complex, coding is a much simpler interface than
downs: expanding detail for a datapoint on a chart. The user             manual configuration. For example, building a pivot table in a
should be able to click on a chart and see a detailed view of            spreadsheet using point-and-click operations have "always had a
the data underlying the datapoint. This was naturally implemented        reputation for being complicated" [Dev]. It’s three lines of code in
in our system by associating a filter with every chart: every chart      Python, even without using the Pandas pivot_table method. Most
in Galyleo is also a Select Filter, and it can be used as a Filter in    analytics procedures are far more easily done in code.
a view, just as any other widget can be.                                     As a result, Galyleo is an appropriate-code environment,
                                                                         which is an environment which combines a coding interface
Publishing The Dashboard                                                 for complex, large-scale, or abstract operations and a point-
Once the dashboard is complete, it can be published to the               and-click interface for simple, concrete, small-scale operations.
web simply by moving the dashboard file to any place it get              Galyleo combines broadly powerful Jupyter-based code and low-
an URL (e.g. a github repo). It can then be viewed by visiting           code libraries for analytics paired with fast GUI-based design and          configuration for graphical elements and layout.
dashboard=<url of dashboard file>. The attached figure shows
a published Galyleo Dashboard, which displays Florence
                                                                         Galyleo Data Model And Architecture
Nightingale’s famous Crimean War dataset. Using the double
sliders underneath the column charts telescope the x axes,               The Galyleo data Model and architecture closely model the
effectively permitting zooming on a range; clicking on a column          dashboard architecture discussed in the previous section. They are
shows the detailed death statistics for that month in the pie chart      based on the idea of a few simple, generalizable structures, which
above the column chart.                                                  are largely independent of each other and communicate through
                                                                         simple interfaces.
No-Code, Low-Code, and Appropriate-Code
                                                                         The GalyleoTable
Galyleo is an appropriate-code environment, meaning that it offers       A GalyleoTable is the fundamental data structure in Galyleo. It
efficient creation to developers at every step. It offers What-You-      is a logical, not a physical abstraction; it simply responds to
See-Is-What-You-Get (WYSIWYG) design tools where appro-                  the GalyleoTable API. A GalyleoTable is a pair (columns, rows),
priate, low-code where appropriate, and full code creation tools         where columns is a list of pairs (name, type), where type is one
where appropriate.                                                       of {string, boolean, number, date}, and rows is a list of lists of
    No-code and low-code environments, where users construct             primitive values, where the length of each component list is the
applications through a visual interface, are popular for several         length of the list of columns and the type of the kth entry in each
reasons. The first is the assumption that coding is time-consuming       list is the type specified by the kth column.
and hard, which isn’t always or necessarily true; the second is
                                                                              Small, public tables may be contained in the dashboard file;
the assumption that coding is a skill known to only a small
                                                                         these are called explicit tables. However, explicitly representing
fraction of the population, which is becoming less true by the
                                                                         the table in the dashboard file has a number of disadvantages:
day. 40% of Berkeley undergraduates take Data 8, in which
every assignment involves programming in a Jupyter Notebook.                1)    An explicit table is in the memory of the client viewing
The third, particularly for graphics code, is that manual design                  the dashboard; if it is too large, it may cause signifi-
16                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

          cant performance problems on the dashboard       author or
          viewer’s device
     2)   Since the dashboard file is accessible on the     web, any
          data within it is public
     3)   The data may be continuously updated from        a source,
          and it’s inconvenient to re-run the Notebook     to update
          the data.

     Therefore, the GalyleoTable can be of one of three types:

     1)   A data server that implements the Table REST API
     2)   A JavaScript object within the dashboard page itself
     3)   A JavaScript messenger in the page that implements a
          messaging version of the API
                                                                                Fig. 5: Figure 5. Galyleo Dataflow with Remote Tables
    An explicit table is simply a special case of (2) -- in this case,
the JavaScript object is simply a linear list of rows.
    These are not exclusive. The JavaScript messenger case is
designed to support the ability of a containing application within       Again, simplicity and orthogonality have shown tremendous bene-
the browser to handle viewer authentication, shrinking the security      fits here. Though filters conceptually act as selectors on rows, they
vulnerability footprint and ensuring that the client application         may perform a variety of roles in implementations. For example,
controls the data going to the dashboard. In general, aside from         a table produced by a simulator may be controlled by a parameter
performing tasks like authentication, the messenger will call an         value given by a Filter function.
external data server for the values themselves.
    Whether in a Data Server, a containing application, or a             Extending Galyleo
JavaScript object, Tables support three operations:                      Every element of the Galyleo system, whether it is a widget, Chart,
                                                                         Table Server, or Filter is defined exclusively through a small set
     1)   Get all the values for a specific column                       of public APIs. This is done to permit easy extension, by either
     2)   Get the max/min/increment for a specific numeric column        the Galyleo team, users, or third parties. A Chart is defined as an
     3)   Get the rows which match a boolean function, passed in         object which has a physical HTML representation, and it supports
          as a parameter to the operation                                four JavaScript methods: redraw (draw the chart), set data (set the
    Of course, (3) is the operation that we have seen above, to          chart’s data), set options (set the chart’s options), and supports
populate a view and a chart. (1) and (2) populate widgets on the         table (a boolean which returns true if and only if the chart can
dashboard; (1) is designed for a select filter, which is a widget        draw the passed-in data set). In addition, it exports out a defined
that lets a user pick a specific set of values for a column; (2) is      JSON structure which indicates what options it supports and the
an optimization for numeric filters, so that the entire list of values   types of their values; this is used by the Chart Editor to display a
for the column need not be sent -- rather, only the start and end        configurator for the chart.
values, and the increment between them.                                      Similarly, the underlying system supports user
                                                                         design of new filters. Again, a filter is simply an object with a
    Each type of table specifies a source, additional information
                                                                         physical presence, that the user can design in lively, and supports a
(in the case of a data server, for example, any header variables
                                                                         specific API -- broadly, set the choices and hand back the Boolean
that must be specified in order to fetch the data), and, optionally,
                                                                         function as a JSON object which will be used to filter the data.
a polling interval. The latter is designed to handle live data; the
dashboard will query the data source at each polling interval to
see if the data has changed.
                                                                         Any system can be used to extend Galyleo; at the end of the
    The choice of these three table instantiations (REST,
                                                                         day, all that need be done is encapsulate a widget or chart in
JavaScript object, messenger) is that they provide the key founda-
                                                                         a snippet of HTML with a JavaScript interface that matches
tional building block for future extensions; it’s easy to add a SQL
                                                                         the Galyleo protocol. This is done most easily and quickly
connection on top of a REST interface, or a Python simulator.
                                                                         by using [SKH21]. is the latest in a line
                                                                         of Smalltalk- and Squeak-inspired [IKM+ 97] JavaScript/HTML
Filters                                                                  integrated development environments that began with the Lively
Tables must be filtered in situ. One of the key motivators behind        Kernel [IPU+ 08] [KIH+ 09] and continued through the Lively Web
remote tables is in keeping large amounts of data from hitting the       [LKI+ 12] [IFH+ 16] [TM17]. Galyleo is an application built in
browser. This is largely defeated if the entire table is sent to the     Lively, following the work done in [HIK+ 16].
dashboard and then filtered there. As a result, there is a Filter API        Lively shares with Jupyter an emphasis on live programming
together with the Table API whereever there are tables.                  [KRB18], orwhere a Read-Evaluate-Act Loop (REAL) program-
     The data flow of the previous section remains unchanged;            ming style. It adds to that a combination of visual and text
it is simply that the filter functions are transmitted to wherever       programming [ABF20], where physical objects are positioned and
the tables happen to be. The dataflow in the case of remote              configured largely by hand as done with any drawing or design
tables (whether messenger-based or REST-based) is shown here,            program (e.g., PowerPoint, Illustrator, DrawPad, Google Draw)
with operations that are resident where the table is situated and        and programmed with a built-in editor and workspace, similar in
operations resident on the dashboard clearly shown.                      concept if not form to a Jupyter Notebook.
GALYLEO: A GENERAL-PURPOSE EXTENSIBLE VISUALIZATION SOLUTION                                                                              17

                                                                          2)    acceptsDataset(<Table or View>) returns a boolean de-
                                                                                pending on whether this chart can draw the data in this
                                                                                view. For example, a Table Chart can draw any tabular
                                                                                data; a Geo Chart typically requires that the first column
                                                                                be a place specifier.
                                                                          In addition, it has a read-only property:
                                                                          1)    optionSpec: A JSON structure describing the options for
                                                                                the chart. This is a dictionary, which specifies the name of
                                                                                each option, and its type (color, number, string, boolean,
                                                                                or enum with values given). Each type corresponds to a
                                                                                specific UI widget that the chart editor uses.
                                                                          And two read write properties:
                                                                          1)    options: The current options, as a JSON dictionary. This
            Fig. 6: Figure 6. The environment                       matches exactly the JSON dictionary in optionSpec, with
                                                                                values in place of the types.
                                                                          2)    dataSource: a string, the name of the current Galyleo
    Lively abstracts away HTML and CSS tags in graphical                        Table or Galyleo View
objects called "Morphs". Morphs [MS95] were invented as the
user interface layer for Self [US87], and have been used as                Typically, an extension to Galyleo’s charting capabilities is
the foundation of the graphics system in Squeak and Scratch            done by incorporating the library as described in the previous
[MRR+ 10]. In this UI, every physical object is a Morph; these         section, implementing the API given in this section, and then
can be as simple as a simple polygon or text string to a full          publishing the result as a component
application. Morphs are combined via composition, similar to the
way that objects are grouped in a presentation or drawing program.     Extending Galyleo’s Widget Set
The composition is simply another Morph, which in turn can be          A widget is a graphical item used to filter data. It operates on a
composed with other Morphs. In this manner, complex Morphs             single column on any table in the current data set. It is either a
can be built up from collections of simpler ones. For example,         range filter (which selects a range of numeric values) or a select
a slider is simply the composition of a circle (the knob) with a       filter (which selects a specific value, or a set of specific values).
thin, long rectangle (the bar). Each Morph can be individually         The API that is implemented consists only of properties.
programmed as a JavaScript object, or can inherit base level
                                                                          1)    valueChanged : a signal, which is fired whenever the
behavior and extend it.
                                                                                value of the widget is changed
    In, each morph turns into a snippet of HTML, CSS,
                                                                          2)    value: read-write. The current value of the widget
and JavaScript code and the entire application turns into a web
                                                                          3)    filter: read-only. The current filter function, as a JSON
page. The programmer doesn’t see the HTML and CSS code
directly; these are auto-generated. Instead, the programmer writes
                                                                          4)    allValues: read-write, select filters only.
JavaScript code for both logic and configuration (to the extent that
                                                                          5)    column: read-only. The name of the column of this
the configuration isn’t done by hand). The code is bundled with
                                                                                widget. Set when the widget is created
the object and integrated in the web page.
                                                                          6)    numericSpec: read-write. A dictionary containing the
    Morphs can be set as reusable components by a simple
                                                                                numeric specification for a numeric or date filter
declaration. They can then be reused in any lively design.
                                                                          Widgets are typically designed as a standard Lively graphical
Incorporating New Libraries                                            component, much as the slider described above.
Libraries are typically incorporated into by attaching
them to a convenient physical object, importing the library from a     Integration into Jupyter Lab: The Galyleo Extension
package manager such as npm, and then writing a small amount
                                                                       Galyleo is a standalone web application that is integrated into
of code to expose the object’s API. The simplest form of this is to
                                                                       JupyterLab using an iframe inside a JupyterLab tab for physical
assign the module to an instance variable so it has an addressable
                                                                       design. A small JupyterLab extension was built that implements
name, but typically a few convenience methods are written as well.
                                                                       the JupyterLab editor API. The JupyterLab extension has two
In this way, a large number of libraries have been incorporated
                                                                       major functions: to handle read/write/undo requests from the
as reusable components in, including Google Maps,
                                                                       JupyterLab menus and file browser, and receive and transmit
Google Charts [goo], Chartjs [cha], D3 [BOH11], Leaflet.js [lea],
                                                                       messages from the running Jupyter kernels to update tables on
OpenLayers [ope], cytoscape:ono and many more.
                                                                       the Dashboard Studio, and to handle the reverse messages where
Extending Galyleo’s Charting and Visualization capabilities
                                                                       the studio requests data from the kernel.
                                                                           Standard Jupyter and browser mechanisms are used. File sys-
A Galyleo Chart is anything that changes its display based on
                                                                       tem requests come to the extension from the standard Jupyter API,
tabular data from a Galyleo Table or Galyleo View. It responds to
                                                                       exactly the same requests and mechanisms that are sent to a Mark-
a specific API, which includes two principal methods:
                                                                       down or Notebook editor. The extension receives them, and then
   1)    drawChart: redraw the chart using the current tabular data    uses standard browser-based messaging (window.postMessage) to
         from the input or view                                        signal the standalone web app. Similarly, when the extension
18                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                       of environments hosted by a server is arbitrary, and the cost is only
                                                                       the cost of maintaining the Dockerfile for each environment.
                                                                            An environment is easy to design for a specific class, project,
                                                                       or task; it’s simply adding libraries and executables to a base
                                                                       Dockerfile. It must be tested, of course, but everything must be.
                                                                       And once it is tested, the burden of software maintenance and
                                                                       installation is removed from the user; the user is already in a task-
                                                                       customized, curated environment. Of course, the usual installation
                                                                       tools (apt, pip, conda, easy_install ) can be pre-loaded (they’re just
                                                                       executables) so if the environment designer missed something it
                                                                       can be added by the end user.
                                                                            Though a user can only be in one environment at a time,
                                                                       persistent storage is shared across all environments, meaning
          Fig. 7: Figure 7. Galyleo Extension Architecture             switching environments is simply a question of swapping one
                                                                       environment out and starting another.
                                                                            Viewed in this light, a JupyterHub is a multi-purpose computer
makes a request of JupyterLab, it does so through this mechanism
                                                                       in the Cloud, with an easy-to-use UI that presents through a
and a receiver in the extension gets it and makes the appropriate
                                                                       browser. JupyterLab isn’t simply an IDE; it’s the window system
method calls within JupyterLab to achieve the objective.
                                                                       and user interface for this computer. The JupyterLab launcher is
    When a kernel makes a request through the Galyleo Client,
                                                                       the desktop for this computer (and it changes what’s presented,
this is handled exactly the same way. A Jupyter messaging server
                                                                       depending on the environment); the file browser is the computer’s
within the extension receives the message from the kernel, and
                                                                       file browser, and the JupyterLab API is the equivalent of the Win-
then uses browser messaging to contact the application with the
                                                                       dows or MacOS desktop APIs and window system that permits
request, and does the reverse on a Galyleo message to the kernel.
                                                                       third parties to build applications for this.
    This is a highly efficient method of interaction, since browser-
based messaging is in-memory transactions on the client machine.            This Jupyter Computer has a large number of advantages over
    It’s important to note that there is nothing Galyleo-specific      a standard desktop or laptop computer. It can be accessed from any
about the extension: the Galyleo Extension is a general method         device, anywhere on Earth with an Internet connection. Software
for any standalone web editor (e.g., a slide or drawing editor) to     installation and maintenance issues are nonexistent. Data loss due
be integrated into JupyterLab. The JupyterLab connection is a few      to hardware failure is extremely unlikely; backups are still required
tens of lines of code in the Galyleo Dashboard. The extension is       to prevent accidental data loss (e.g., erroneous file deletion), but
slightly more complex, but it can be configured for a different        they are far easier to do in a Cloud environment. Hardware
application with a simple data structure which specifies the URL       resources such as disk, RAM, and CPU can be added rapidly,
of the application, file type and extension to be manipulated, and     on a permanent or temporary basis. Relatively exotic resources
message list.                                                          (e.g., GPUs) can also be added, again on an on-demand, temporary
                                                                            The advantages go still further than that. Any resource that
The Jupyter Computer                                                   can be accessed over a network connection can be added to
The implications of the Galyleo Extension go well beyond vi-           the Jupyter Computer simply by adding the appropriate accessor
sualization and dashboards and easy publication in JupyterLab.         library to an environment’s Dockerfile. For example, a database
JupyterLab is billed as the next-generation integrated Develop-        solution such as Snowflake, BigQuery, or Amazon Aurora (or
ment Environment for Jupyter, but in fact it is substantially more     one of many others) can be "installed" by adding the relevant
than that. It is the user interface and windowing system for Cloud-    library module to the environment. Of course, the user will need
based personal computing. Inspired by previous extensions such         to order the database service from the relevant provider, and obtain
as the Vega Extension, the Galyleo Extensions seeks to provide         authentication tokens, and so on -- but this is far less troublesome
the final piece of the puzzle.                                         than even maintaining the library on the desktop.
    Consider a Jupyter server in the Cloud, served from a Jupyter-          However, to date the Jupyter Computer only supports a few
Hub such as the Berkeley Data Hub. It’s built from a base              window-based applications, and adding a new application is a
Ubuntu image, with the standard Jupyter libraries installed and,       time-consuming development task. The applications supported are
importantly, a UI that includes a Linux terminal interface. Any        familiar and easy to enumerate: a Notebook editor, of course; a
Linux executable can be installed in the Jupyter server image, as      Markdown Viewer; a CSV Viewer; a JSON Viewer (not inline
can any Jupyter kernel, and any collection of libraries. The Jupyter   editor), and a text editor that is generally used for everything from
server has per-user persistent storage, which is organized in a        Python files to Markdown to CSV.
standard Linux filesystem. This makes the Jupyter server a curated          This is a small subset of the rich range of JavaScript/HTML5
execution environment with a Linux command-line interface and          applications which have significant value for Jupyter Computer
a Notebook interface for Jupyter execution.                            users. For example, the Ace Code Editor supports over 110
    A JupyterHub similar to Berkeley Data Hub (essentially,            languages and has the functionality of popular desktop editors
anything built from Zero 2 Jupyter Hub or Q-Hub) comes with a          such as Vim and Sublime Text. There are over 1100 open-source
number of "environments". The user chooses the environment on          drawing applications on the JavaScript/HTML5 platform; multiple
startup. Each environment comes with a built-in set of libraries and   spreadsheet applications, the most notable being jExcel, and many
executables designed for a specific task or set of tasks. The number   more.
GALYLEO: A GENERAL-PURPOSE EXTENSIBLE VISUALIZATION SOLUTION                                                                               19

  Fig. 8: Figure 8. Galyleo Extension Application-Side messaging

                                                                              Fig. 9: Figure 9. Generations of Internet Computing
    Up until now, adding a new application to JupyterLab involved
writing a hand-coded extension in Typescript, and compiling
it into JupyterLab. However, the Galyleo Extension has been           the user uses any of a wide variety of text editors to prepare the
designed so that any HTML5/JavaScript application can be added        document, any of a wide variety of productivity and illustrator
easily, simply by configuring the Galyleo Extension with a small      programs to prepare the images, runs this through a local sequence
JSON file.                                                            of commands (e.g., pdflatex paper; bibtex paper; pdflatex paper.
    The promise of the Galyleo Extension is that it can be adapted    Usually Github or another repository is used for storage and
to any open-source JavaScript/HTML5 application very easily.          collaboration.
The Galyleo Extension merely needs the:                                   In a Cloud service, this is another matter. There is at most
                                                                      one editor, selected by the service, on the site. There is no
   •   URL of the application                                         image editing or illustrator program that reads and writes files
   •   File extension that the application reads/writes               on the site. Auxiliary tools, such as a bib searcher, aren’t present
   •   URL of an image for the launcher                               or aren’t customizable. The service has its own siloed storage,
   •   Name of the application for the file menu                      its own text editor, and its own document-preparation pipeline.
    The application must implement a small messaging client,          The tools (aside from the core document-preparation program)
using the standard JavaScript messaging interface, and implement      are primitive. The online service has two advantages over the
the calls the Galyleo Extension makes. The conceptual picture is      personal-device service. Collaboration is generally built-in, with
shown im Figure 8.                                                    multiple people having access to the project, and the software need
    And it must support (at a minimum) messages to read and           not be maintained. Aside from that, the personal-device experience
write the file being edited.                                          is generally superior. In particular, the user is free to pick their own
                                                                      editor, and doesn’t have to orchestrate multiple downloads and
The Third Generation of Network Computing                             uploads from various websites. The usual collection of command-
The World-Wide Web and email comprised the first generation           line utilities are available to small touchups.
of Internet computing (the Internet had been around for a decade          The third generation of Internet Computing represented by the
before the Web, and earlier networks dated from the sixties, but      Jupyter Computer. This offers a Cloud experience similar to the
the Web and email were the first mass-market applications on          personal computer, but with the scalability, reliability, and ease of
the network), and they were very simple -- both were document-        collaboration of the Cloud.
exchange applications, using slightly different protocols. The
second generation of Network applications were the siloed pro-        Conclusion and Further Work
ductivity applications, where standard desktop applications moved     The vision of the Jupyter Computer, bringing the power of the
to the Cloud. The most famous example is of course GSuite             Cloud to the personal computing experience has been started
and Office 365, but there were and are many others -- Canva,          with Galyleo. It will not end there. At the heart of it is a
Loom, Picasa, as well as a large number of social/chat/social         composition of two broadly popular platforms: HTML5/JavaScript
media applications. What they all had in common was that they         for presentation and interaction, and the various Jupyter kernels
were siloed applications which, with the exception of the office      for server-side analytics. Galyleo is a start at seamless interaction
suites, didn’t even share a common store. In many ways, this          of these two platforms. Continuing and extending this is further
second generation of network applications recapitulates the era       development of narrow-waist protocols to permit maximal inde-
immediately prior to the introduction of the personal computer.       pendent development and extension.
That era was dominated by single-application computers such as
word processors, which were simply computers with a hardcoded
program loaded into ROM.                                              Acknowledgements
    The Word Processor era was due to technological limitations       The authors wish to thank Alex Yang, Diptorup Deb, and for
-- the processing power and memory to run multiple programs           their insightful comments, and Meghann Agarwal for stewardship.
simply wasn’t available on low-end hardware, and PC operating         We have received invaluable help from Robert Krahn, Marko
systems didn’t yet exist. In some sense, the current second genera-   Röder, Jens Lincke and Linus Hagemann. We thank the en-
tion of Internet Computing suffers from similar technological con-    gageLively team for all of their support and help: Tim Braman,
straints. The "Operating System" for Internet Computing doesn’t       Patrick Scaglia, Leighton Smith, Sharon Zehavi, Igor Zhukovsky,
yet exist. The Jupyter Computer can provide it.                       Deepak Gupta, Steve King, Rick Rasmussen, Patrick McCue,
    To see the difference that this can make, consider LaTeX (per-    Jeff Wade, Tim Gibson. The JupyterLab development commu-
haps preceded by Docutils, as is the case for SciPy) preparation of   nity has been helpful and supportive; we want to thank Tony
a document. On a personal computer, it’s fairly straightforward;      Fast, Jason Grout, Mehmet Bektas, Isabela Presedo-Floyd, Brian
20                                                                                                PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Granger, and Michal Krassowski. The engageLively Technology                    [hol22b]   Installation - holoviews v1.14.9, May 2022. URL: https:
Advisory Board has helped shape these ideas: Ani Mardurkar,                               //
                                                                               [IFH+ 16]  Daniel Ingalls, Tim Felgentreff, Robert Hirschfeld, Robert
Priya Joseph, David Peterson, Sunil Joshi, Michael Czahor, Isha
                                                                                          Krahn, Jens Lincke, Marko Röder, Antero Taivalsaari, and
Oke, Petrus Zwart, Larry Rowe, Glenn Ricart, Sunil Joshi, Antony                          Tommi Mikkonen. A world of active objects for work and play:
Ng. We want to thank the people from the AWS team that have                               The first ten years of lively. In Proceedings of the 2016 ACM
helped us tremendously: Matt Vail, Omar Valle, Pat Santora.                               International Symposium on New Ideas, New Paradigms, and
                                                                                          Reflections on Programming and Software, Onward! 2016, page
Galyleo has been dramatically improved with the assistance of our                         238–249, New York, NY, USA, 2016. Association for Comput-
Japanese colleagues at KCT and Pacific Rim Technologies: Yoshio                           ing Machinery. URL:,
Nakamura, Ted Okasaki, Ryder Saint, Yoshikazu Tokushige, and                              doi:10.1145/2986012.2986029.
Naoyuki Shimazaki. Our undestanding of Jupyter in an academic                  [IKM+ 97] Dan Ingalls, Ted Kaehler, John Maloney, Scott Wallace, and
                                                                                          Alan Kay. Back to the future: The story of squeak, a prac-
context came from our colleagues and friends at Berkeley, the                             tical smalltalk written in itself. In Proceedings of the 12th
University of Victoria, and UBC: Shawna Dark, Hausi Müller,                               ACM SIGPLAN Conference on Object-Oriented Programming,
Ulrike Stege, James Colliander, Chris Holdgraf, Nitesh Mor. Use                           Systems, Languages, and Applications, OOPSLA ’97, page
                                                                                          318–326, New York, NY, USA, 1997. Association for Comput-
of Jupyter in a research context was emphasized by Andrew
                                                                                          ing Machinery. URL:,
Weidlea, Eli Dart, Jeff D’Ambrogia. We benefitted enormously                              doi:10.1145/263698.263754.
from the CITRIS Foundry: Alic Chen, Jing Ge, Peter Minor, Kyle                 [IPU+ 08]  Daniel Ingalls, Krzysztof Palacz, Stephen Uhler, Antero Taival-
Clark, Julie Sammons, Kira Gardner. The Alchemist Accelerator                             saari, and Tommi Mikkonen. The lively kernel a self-supporting
                                                                                          system on a web page. In Workshop on Self-sustaining Systems,
was central to making this product: Ravi Belani, Arianna Haider,                          pages 31–50. Springer, 2008. doi:10.1007/978-3-540-
Jasmine Sunga, Mia Scott, Kenn So, Aaron Kalb, Adam Frankl.                               89275-5_2.
Kris Singh was a constant source of inspiration and help. Larry                [jup]      Jupyterlab documentation. URL: https://jupyterlab.readthedocs.
Singer gave us tremendous help early on. Vibhu Mittal more                                io/en/stable/.
than anyone inspired us to pursue this road. Ken Lutz has been                 [KIH+ 09] Robert Krahn, Dan Ingalls, Robert Hirschfeld, Jens Lincke, and
                                                                                          Krzysztof Palacz. Lively wiki a development environment for
a constant sounding board and inspiration, and worked hand-in-                            creating and sharing active web content. In Proceedings of the
hand with us to develop this product. Our early customers and                             5th International Symposium on Wikis and Open Collaboration,
partners have been and continue to be a source of inspiration,                            WikiSym ’09, New York, NY, USA, 2009. Association for
                                                                                          Computing Machinery. URL:
support, and experience that is absolutely invaluable: Jonathan                           1641324, doi:10.1145/1641309.1641324.
Tan, Roger Basu, Jason Koeller, Steve Schwab, Michael Collins,                 [KRB18]    Juraj Kubelka, Romain Robbes, and Alexandre Bergel. The road
Alefiya Hussain, Geoff Lawler, Jim Chimiak, Fraukë Tillman,                               to live programming: Insights from the practice. In Proceedings
Andy Bavier, Andy Milburn, Augustine Bui. All of our customers                            of the 40th International Conference on Software Engineering,
                                                                                          ICSE ’18, page 1090–1101, New York, NY, USA, 2018. Associ-
are really partners, none moreso than the fantastic teams at Tanjo                        ation for Computing Machinery. URL:
AI and Ultisim: Bjorn Nordwall, Ken Lane, Jay Sanders, Eric                               3180155.3180200, doi:10.1145/3180155.3180200.
Smith, Miguel Matos, Linda Bernard, Kevin Clark, and Richard                   [KRKP+ 16] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez,
Boyd. We want to especially thank our investors, who bet on this                          Brian Granger, Matthias Bussonnier, Jonathan Frederic, Kyle
                                                                                          Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, Paul
technology and company.                                                                   Ivanov, Damián Avila, Safia Abdalla, Carol Willing, and Jupyter
                                                                                          development team. Jupyter Notebooks - a publishing format for
                                                                                          reproducible computational workflows. IOS Press, 2016. URL:
R EFERENCES                                                                     
                                                                               [lea]      An open-source javascript library for interactive maps. URL:
[ABF20]     Leif Andersen, Michael Ballantyne, and Matthias Felleisen.          
            Adding interactive visual syntax to textual code. Proc. ACM        [LKI+ 12]  Jens Lincke, Robert Krahn, Dan Ingalls, Marko Roder, and
            Program. Lang., 4(OOPSLA), nov 2020. URL:                    Robert Hirschfeld. The lively partsbin–a cloud-based repository
            10.1145/3428290, doi:10.1145/3428290.                                         for collaborative development of active web content. In 2012
[BOH11]     Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. D3 data-                 45th Hawaii International Conference on System Sciences, pages
            driven documents. IEEE Transactions on Visualization and Com-                 693–701, 2012. doi:10.1109/HICSS.2012.42.
            puter Graphics, 17(12):2301–2309, dec 2011. URL: https://doi.      [loo]      Looker. URL:
            org/10.1109/TVCG.2011.185, doi:10.1109/TVCG.2011.                  [LS10]     Bruce Lawson and Remy Sharp. Introducing HTML5. New
            185.                                                                          Riders Publishing, USA, 1st edition, 2010.
[cha]       Chart.js. URL:                           [mdn]      Window.postmessage() - web apis: Mdn. URL: https://developer.
[Cro06]     D. Crockford. The application/json media type for javascript        
            object notation (json). RFC 4627, RFC Editor, July 2006. http://
                                                                               [MRR+ 10] John Maloney, Mitchel Resnick, Natalie Rusk, Brian Silverman,
   URL: http://www.rfc-editor.
                                                                                          and Evelyn Eastmond. The scratch programming language
            org/rfc/rfc4627.txt, doi:10.17487/rfc4627.
                                                                                          and environment. ACM Transactions on Computing Educa-
[Dev]       Erik Devaney. How to create a pivot table in excel: A step-by-
                                                                                          tion (TOCE), 10(4):1–15, 2010. URL:
            step tutorial. URL:
                                                                                          1868358.1868363, doi:10.1145/1868358.1868363.
[DGHP13]    Marcello D’Agostino, Dov M Gabbay, Reiner Hähnle, and              [MS95]     John H Maloney and Randall B Smith. Directness and liveness in
            Joachim Posegga. Handbook of tableau methods. Springer                        the morphic user interface construction environment. In Proceed-
            Science & Business Media, 2013.                                               ings of the 8th annual ACM symposium on User interface and
                                                                                          software technology, pages 21–28, 1995. URL:
[goo]       Charts: google developers. URL:
                                                                                          10.1145/215585.215636, doi:10.1145/215585.215636.
[HIK+ 16]   Matthew Hemmings, Daniel Ingalls, Robert Krahn, Rick               [ope]      Openlayers. URL:
            McGeer, Glenn Ricart, Marko Röder, and Ulrike Stege. Livetalk:     [pan22]    Panel, May 2022. URL:
            A framework for collaborative browser-based replicated-            [pdt20]    The pandas development team. pandas-dev/pandas: Pandas,
            computation applications. In 2016 28th International Tele-                    February 2020. URL:,
            traffic Congress (ITC 28), volume 01, pages 270–277, 2016.                    doi:10.5281/zenodo.3509134.
            doi:10.1109/ITC-28.2016.144.                                       [plo]      Dash overview. URL:
[hol22a]    High-level tools to simplify visualization in python, Apr 2022.    [SKH21]    Robin Schrieber, Robert Krahn, and Linus Hagemann.
            URL:                                          , 2021.

[TM17]     Antero Taivalsaari and Tommi Mikkonen. The web as a software
           platform: Ten years later. In International Conference on Web
           Information Systems and Technologies, volume 2, pages 41–50.
           SCITEPRESS, 2017. doi:10.5220/0006234800410050.
[US87]     David Ungar and Randall B. Smith. Self: The power of simplic-
           ity. volume 22, page 227–242, New York, NY, USA, dec 1987.
           Association for Computing Machinery. URL:
           1145/38807.38828, doi:10.1145/38807.38828.
[WM10]     Wes McKinney. Data Structures for Statistical Computing in
           Python. In Stéfan van der Walt and Jarrod Millman, editors,
           Proceedings of the 9th Python in Science Conference, pages 56
           – 61, 2010. doi:10.25080/Majora-92bf1922-00a.
22                                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

  USACE Coastal Engineering Toolkit and a Method of
        Creating a Web-Based Application
                           Amanda Catlett‡∗ , Theresa R. Coumbe‡ , Scott D. Christensen‡ , Mary A. Byrant‡


Abstract—In the early 1990s the Automated Coastal Engineering Systems,                     the goal of deploying the ACES tools as a web-based application,
ACES, was created with the goal of providing state-of-the-art computer-based               and ultimately renamed it to: USACE Coastal Engineering Toolkit
tools to increase the accuracy, reliability, and cost-effectiveness of Corps coastal       (UCET).
engineering endeavors. Over the past 30 years, ACES has become less and less
                                                                                                The RAD team focused on updating the Python codebase
accessible to engineers. An updated version of ACES was necessary for use in
                                                                                           utilizing Python’s object-oriented programming and the newly
coastal engineering. Our goal was to bring the tools in ACES to a user-friendly
web-based dashboard that would allow a wide range of users to be able to easily
                                                                                           developed HoloViz ecosystem. The team refactored the code to
and quickly visualize results. We will discuss how we restructured the code                implement inheritance so the code is clean, readable, and scalable.
using class inheritance and the three libraries Param, Panel, and HoloViews to             The tools were expanded to a Graphical User Interface (GUI) so
create an extensible, interactive, graphical user interface. We have created the           the implementation to a web-app would provide a user-friendly
USACE Coastal Engineering Toolkit, UCET, which is a web-based application                  experience. This was done by using the HoloViz-maintained
that contains 20 of the tools in ACES. UCET serves as an outline for the process           libraries: Param, Panel, and Holoviews.
of taking a model or set of tools and developing web-based application that can
                                                                                                This paper will discuss some of the steps that were taken
produce visualizations of the results.
                                                                                           by the RAD team to update the Python codebase to create a
                                                                                           panel application of the coastal engineering tools. In particular,
Index Terms—GUI, Param, Panel, HoloViews
                                                                                           refactoring the input and output variables with the Param library,
                                                                                           the class hierarchy used, and utilization of Panel and HoloViews
Introduction                                                                               for a user-friendly experience.
The Automated Coastal Engineering System (ACES) was devel-
oped in response to the charge by the LTG E. R. Heiberg III,                               Refactoring Using Param
who was the Chief of Engineers at the time, to provide improved
design capabilities to the Corps coastal specialists. [Leenknecht]                         Each coastal tool in UCET has two classes, the model class and the
In 1992, ACES was presented as an interactive computer-based                               GUI class. The model class holds input and output variables and
design and analysis system in the field of coastal engineering. The                        the methods needed to run the model. Whereas the GUI class holds
tools consist of seven functional areas which are: Wave Prediction,                        information for GUI visualization. To make implementation of the
Wave Theory, Structural Design, Wave Runup Transmission and                                GUI more seamless we refactored model variables to utilize the
Overtopping, Littoral Process, and Inlet Processes. These func-                            Param library. Param is a library that has the goal of simplifying
tional areas contain classical theory describing wave motion, to                           the codebase by letting the programmer explicitly declare the types
expressions resulting from tests of structures in wave flumes, and                         and values of parameters accepted by the code. Param can also be
numerical models describing the exchange of energy from the at-                            seamlessly used when implementing the GUI through Panel and
mosphere to the sea surface. The math behind these uses anything                           HoloViews.
from simple algebraic expressions, both theoretical and empirical,                             Each UCET tool’s model class declares the input and output
to numerically intense algorithms. [Leenknecht][UG][shankar]                               values used in the model as class parameters. Each input and
    Originally, ACES was written in FORTRAN 77 resulting in                                output variables are declared and given the following metadata
a decreased ability to use the tool as technology has evolved. In                          features:
2017, the codebase was converted from FORTRAN 77 to MAT-
                                                                                              •   default: each input variable is defined as a Param with a
LAB and Python. This conversion ensured that coastal engineers
                                                                                                  default value defined from the 1992 ACES user manual
using this tool base would not need training in yet another coding
                                                                                              •   bounds: each input variable is defined with range values
language. In 2020, the Engineered Resilient Systems (ERS) Rapid
                                                                                                  defined in the 1992 ACES user manual
Application Development (RAD) team undertook the project with
                                                                                              •   doc or docstrings: input and output variables have the
                                                                                                  expected variable and description of the variable defined
* Corresponding author:
‡ ERDC                                                                                            as a doc. This is used as a label over the input and
                                                                                                  output widgets. Most docstrings follow the pattern of
Copyright © 2022 Amanda Catlett et al. This is an open-access article                             <variable>:<description of variable [units, if any]>
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,                 •   constant: the output variables all set constant equal True,
provided the original author and source are credited.                                             thereby restricting the user’s ability to manipulate the
USACE COASTAL ENGINEERING TOOLKIT AND A METHOD OF CREATING A WEB-BASED APPLICATION                                                       23

       value. Note that when calculations are being done they will     classes. In figure 1 the model classes are labeled as: Base-Tool
       need to be inside a with param.edit_constant(self) function     Class, Graph-Tool Class, Water-Tool Class, and Graph-Water-Tool
   •   precedence: input and output variables will use prece-          Class and each has a corresponding GUI class.
       dence when there are instances where the variable does              Due to the inheritance in UCET, the first two questions that
       not need to be seen.                                            can be asked when adding a tool are: ‘Does this tool need water
                                                                       variables for the calculation?’ and ‘Does this tool have a graph?’.
   The following is an example of an input parameter:
                                                                       The developer can then add a model class and a GUI class and
H = param.Number(                                                      inherit based on figure 1. For instance, Linear Wave Theory is
    doc='H: wave height [{distance_unit}]',
    default=6.3,                                                       an application that yields first-order approximations for various
    bounds=(0.1, 200)                                                  parameters of wave motion as predicted by the wave theory. It
)                                                                      provides common items of interest such as water surface elevation,
An example of an output variable is:                                   general wave properties, particle kinematics and pressure as a
                                                                       function of wave height and period, water depth, and position
L = param.Number(
    doc='L: Wavelength [{distance_unit}]',                             in the wave form. This tool uses water density and has multiple
    constant=True                                                      graphs in its output. Therefore, Linear Wave Theory is considered
)                                                                      a Graph-Water-Tool and the model class will inherit from Water-
The model’s main calculation functions mostly remained un-             TypeDriver and the GUI class will inherit the linear wave theory
changed. However, the use of Param eliminated the need for code        model class, WaterTypeGui, and TabularDataGui.
that handled type checking and bounds checks.
                                                                       GUI Implementation Using Panel and HoloViews
Class Hierarchy
                                                                       Each UCET tool has a GUI class where the Panel and HoloView
UCET has twenty tools from six of the original seven functional        libraries are implemented. Panel is a hierarchical container that
areas of ACES. When we designed our class hierarchy, we focused        can layout panes, widgets, or other Panels in an arrangement
on the visualization of the web application rather than functional     that forms an app or dashboard. The Pane is used to render any
areas. Thus, each tool’s class can be categorized into Base-Tool,      widget-like object such as Spinner, Tabulator, Buttons, CheckBox,
Graph-Tool, Water-Tool, or Graph-Water-Tool. The Base-Tool has         Indicators, etc. Those widgets are used to gather user input and
the coastal engineering models that do not have any water property     run the specific tool’s model.
inputs (such as water density) in the calculations and no graphical
                                                                           UCET utilizes the following widgets to gather user input:
output. The Graph-Tool has the coastal engineering models that
do not have any water property inputs in the calculations but have        •   Spinner: single numeric input values
a graphical output. Water-Tool has the coastal engineering models         •   Tabulator: table input data
that have water property inputs in the calculations and no graphical      •   CheckBox: true or false values
output. Graph-Water-Tool has the coastal engineering models that          •   Drop down: items that have a list of pre-selected values,
have water property inputs in the calculations and has a graphical            such as which units to use
output. Figure 1 shows a flow of inheritance for each of those
classes.                                                                   UCET utilizes indicators.Number, Tabulator, and graphs to
    There are two types of general categories for the classes in       visualize the outputs of the coastal engineering models. A single
the UCET codebase: utility and tool-specific. Utility classes have     number is shown using indicators.Number and graph data is
methods and functions that are utilized across more than one tool.     displayed using the Tabulator widget to show the data of the graph.
The Utility classes are:                                               The graphs are created using HoloViews and have tool options
                                                                       such as pan, zooming, and saving. Buttons are used to calculate,
   •   BaseDriver: holds methods and functions that each tool
                                                                       save the current run, and save the graph data.
       needs to collect data, run coastal engineering models, and
                                                                           All of these widgets are organized into 5 pan-
       print data.
                                                                       els: title, options, inputs, outputs, and graph. The
   •   WaterDriver: has the methods that make water density
                                                                       BaseGui/WaterTypeGui/TabularDataGui have methods that
       and water weight available to the models that need those
                                                                       organize the widgets within the 5 panels that most tools follow.
       inputs for the calculations.
                                                                       The “options” panel has a row that holds the dropdown selections
   •   BaseGui: has the functions and methods for the visualiza-
                                                                       for units and water type (if the tool is a Water-Tool). Some tools
       tion and utilization of all inputs and outputs within each
                                                                       have a second row in the “options” panel with other drop-down
       tool’s GUI.
                                                                       options. The input panel has two columns for spinner widgets
   •   WaterTypeGui: has the widget for water selection.
                                                                       with a calculation button at the bottom left. The output panel has
   •   TabulatorDataGui: holds the functions and methods used
                                                                       two columns of indicators.Number for the single numeric output
       for visualizing plots and the ability to download the data
                                                                       values. At the bottom of the output panel there is a button to “save
       that is used for plotting.
                                                                       the current profile”. The graph panel is tabbed where the first
    Each coastal tool in UCET has two classes, the model class and     tab shows the graph and the second tab shows the data provided
the GUI class. The model class holds input and output variables        within the graph. An visual outline of this can ben seen in the
and the methods needed to run the model. The model class either        following figure. Some of the UCET tools have more complicated
directly inherits from the BaseDriver or the WaterTypeDriver. The      input or output visualizations and that tool’s GUI class will add
tool’s GUI class holds information for GUI visualization that is       or modify methods to meet the needs of that tool.
different from the BaseGui, WaterTypeGUI, and TabulatorDataGui             The general outline of a UCET tool for the GUI.
24                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                       zero the point is outside the waveform. Therefore, if a user makes
                                                                       a combination where the sum is less than zero, UCET will post
                                                                       a warning to tell the user that the point is outside the waveform.
Current State                                                          See the below figure for an example The developers have been
UCET approaches software development from the perspective of           documenting this project using GitHub and JIRA.
someone within the field of Research and Development. Each                 An example of a warning message based on chosen inputs.
tool within UCET is not inherently complex from the traditional
software perspective. However, this codebase enables researchers       Results
to execute complex coastal engineering models in a user-friendly       Linear Wave Theory was described in the class hierarchy example.
environment by leveraging open-source libraries in the scientific      This Graph-Water-Tool utilizes most of the BaseGui methods. The
Python ecosystem such as: Param, Panel, and HoloViews.                 biggest difference is instead of having three graphs in the graph
    Currently, UCET is only deployed using a command line              panel there is a plot selector drop down where the user can select
interface panel serve command. UCET is awaiting the Security           which graph they want to see.
Technical Implementation Guide process before it can be launched           Windspeed Adjustment and Wave Growth provides a quick
as a website. As part of this security vetting process we plan to      and simple estimate for wave growth over open-water and re-
leverage continuous integration/continuous development (CI/CD)         stricted fetches in deep and shallow water. This is a Base-Tool
tools to automate the deployment process. While this process is        as there are no graphs and no water variables for the calculations.
happening, we have started to get feedback from coastal engineers      This tool has four additional options in the options panel where
to update the tools usability, accuracy, and adding suggested          the user can select the wind observation type, fetch type, wave
features. To minimize the amount of computer science knowledge         equation type, and if knots are being used. Based on the selection
the coastal engineers need, our team created a batch script. This      of these options, the input and output variables will change so only
script creates a conda environment, activates and runs the panel       what is used or calculated for those selections are seen.
serve command to launch the app on a local host. The user only
needs to click on the batch script for this to take place.             Conclusion and Future Work
    Other tests are being created to ensure the accuracy of the        Thirty years ago, ACES was developed to provide improved
tools using a testing framework to compare output from UCET            design capabilities to Corps coastal specialists and while these
with that of the FORTRAN original code. The biggest barrier to         tools are still used today, it became more and more difficult for
this testing strategy is getting data from the FORTRAN to compare      users to access them. Five years ago, there was a push to update
with Python. Currently, there are tests for most of the tools that     the code base to one that coastal specialists would be more familiar
read a CSV file of input and output results from FORTRAN and           with: MATLAB and Python. Within the last two years the RAD
compare with what the Python code is calculating.                      team was able to finalize the update so that the user can access
    Our team has also compiled an updated user guide on how to         these tools without having years of programming experience. We
use the tool, what to expect from the tool, and a deeper description   were able to do this by utilizing classes, inheritance, and the
on any warning messages that might appear as the user adds input       Param, Panel, and HoloViews libraries. The use of inheritance
values. An example of a warning message would be, if a user            has allowed for shorter code-bases and also has made it so new
chooses input values that make it so the application does not make     tools can be added to the toolkit. Param, Panel, and HoloViews
physical sense, a warning message will appear under the output         work cohesively together to not only run the models but make a
header and replace all output values. For a more concrete example:     simple interface.
Linear Wave Theory has a vertical coordinate (z) and the water             Future work will involve expanding UCET to include current
depth (d) as input values and when those values sum is less than       coastal engineering models, and completing the security vetting
USACE COASTAL ENGINEERING TOOLKIT AND A METHOD OF CREATING A WEB-BASED APPLICATION                                                               25

                                                                      process to deploy to a publicly accessible website. We plan to
                                                                      incorporate an automated CI/CD to ensure smooth deployment
                                                                      of future versions. We also will continue to incorporate feedback
                                                                      from users and refine the code to ensure the application provides
                                                                      a quality user experience.

                                                                      R EFERENCES
                                                                      [Leenknecht] David A. Leenknecht, Andre Szuwalski, and Ann R. Sherlock.
                                                                                   1992. Automated Coastal Engineering System -Technical Refer-
                                                                                   ence. Technical report.
                                                                      [panel]      “Panel: A High-Level App and Dashboarding Solution for
                                                                                   Python.” Panel 0.12.6 Documentation, Panel Contributors,
                                                                      [holoviz]    “High-Level Tools to Simplify Visualization in Python.”
                                                                                   HoloViz 0.13.0 Documentation, HoloViz Authors, 2017, https:
                                                                      [UG]         David A. Leenknecht, et al. “Automated Tools for Coastal
                                                                                   Engineering.” Journal of Coastal Research, vol. 11, no.
                                                                                   4, Coastal Education & Research Foundation, Inc., 1995,
                                                                                   pp. 1108-24.
                                                                      [shankar]    N.J. Shankar, M.P.R. Jayaratne, Wave run-up and overtopping
                                                                                   on smooth and rough slopes of coastal structures, Ocean Engi-
                                                                                   neering, Volume 30, Issue 2, 2003, Pages 221-238, ISSN 0029-

            Fig. 1: Screen shot of Linear Wave Theory

  Fig. 2: Screen shot of Windspeed Adjustment and Wave Growth
26                                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

               Search for Extraterrestrial Intelligence: GPU
                         Accelerated TurboSETI
                                                    Luigi Cruz‡∗ , Wael Farah‡ , Richard Elkins‡


Abstract—A common technique adopted by the Search For Extraterrestrial In-             by an analog-to-digital converter as voltages and transmitted to a
telligence (SETI) community is monitoring electromagnetic radiation for signs of       processing logic to extract useful information from it. The data
extraterrestrial technosignatures using ground-based radio observatories. The          stream generated by a radio telescope can easily reach the rate
analysis is made using a Python-based software called TurboSETI to detect nar-         of terabits per second because of the ultra-wide bandwidth radio
rowband drifting signals inside the recordings that can mean a technosignature.
                                                                                       spectrum. The current workflow utilized by the Breakthrough
The data stream generated by a telescope can easily reach the rate of terabits
per second. Our goal was to improve the processing speeds by writing a GPU-
                                                                                       Listen, the largest scientific research program aimed at finding
accelerated backend in addition to the original CPU-based implementation of the        evidence of extraterrestrial intelligence, consists in pre-processing
de-doppler algorithm used to integrate the power of drifting signals. We discuss       and storing the incoming data as frequency-time binary files
how we ported a CPU-only program to leverage the parallel capabilities of a            ([LCS+ 19]) in persistent storage for later analysis. This post-
GPU using CuPy, Numba, and custom CUDA kernels. The accelerated backend                analysis is made possible using a Python-based software called
reached a speed-up of an order of magnitude over the CPU implementation.               TurboSETI ([ESF+ 17]) to detect narrowband signals that could be
                                                                                       drifting in frequency owing to the relative radial velocity between
Index Terms—gpu, numba, cupy, seti, turboseti                                          the observer on earth, and the transmitter. The offline processing
                                                                                       speed of TurboSETI is directly related to the scientific output of
1. Introduction                                                                        an observation. Each voltage file ingested by TurboSETI is often
                                                                                       on the order of a few hundreds of gigabytes. To process data
The Search for Extraterrestrial Intelligence (SETI) is a broad term                    efficiently without Python overhead, the program uses Numpy for
utilized to describe the effort of locating any scientific proof of                    near machine-level performance. To measure a potential signal’s
past or present technology that originated beyond the bounds of                        drift rate, TurboSETI uses a de-doppler algorithm to align the
Earth. SETI can be performed in a plethora of ways: either actively                    frequency axis according to a pre-set drift rate. Another algorithm
by deploying orbiters and rovers around planets/moons within the                       called “hitsearch” ([ESF+ 17]) is then utilized to identify any
solar system, or passively by either searching for biosignatures in                    signal present in the recorded spectrum. These two algorithms
exoplanet atmospheres or “listening” to technologically-capable                        are the most resource-hungry elements of the pipeline consuming
extraterrestrial civilizations. One of the most common techniques                      almost 90% of the running time.
adopted by the SETI community is monitoring electromagnetic
radiation for narrowband signs of technosignatures using ground-
based radio observatories. This search can be performed in mul-                        2. Approach
tiple ways: equipment primarily built for this task, like the Allen                    Multiple methods were utilized in this effort to write a GPU-
Telescope Array (California, USA), renting observation time, or                        accelerated backend and optimize the CPU implementation of
in the background while the primary user is conducting other ob-                       TurboSETI. In this section, we enumerate all three main methods.
servations. Other radio-observatories useful for this search include
the MeerKAT Telescope (Northern Cape, South Africa), Green                             2.1. CuPy
Bank Telescope (West Virginia, USA), and the Parkes Telescope                          The original implementation of TurboSETI heavily depends on
(New South Wales, Australia). The operation of a radio-telescope                       Numpy ([HMvdW+ 20]) for data processing. To keep the number
is similar to an optical telescope. Instead of using optics to                         of modifications as low as possible, we implemented the GPU-
concentrate light into an optical sensor, a radio-telescope operates                   accelerated backend using CuPy ([OUN+ 17]). This open-source
by concentrating electromagnetic waves into an antenna using a                         library offers GPU acceleration backed by NVIDIA CUDA and
large reflective structure called a “dish” ([Reb82]). The interac-                     AMD ROCm while using a Numpy style API. This enabled us
tion between the metallic antenna and the electromagnetic wave                         to reuse most of the code between the CPU and GPU-based
generates a faint electrical current. This effect is then quantized                    implementations.

* Corresponding author:                                                2.1. Numba
‡ SETI Institute
                                                                                       Some computationally heavy methods of the original CPU-based
Copyright © 2022 Luigi Cruz et al. This is an open-access article distributed          implementation of TurboSETI were written in Cython. This ap-
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the           proach has disadvantages: the developer has to be familiar with
original author and source are credited.                                               Cython syntax to alter the code; the code requires additional logic
SEARCH FOR EXTRATERRESTRIAL INTELLIGENCE: GPU ACCELERATED TURBOSETI                                                                                 27

  Double-Precision (float64)                                              4. Conclusion
  Impl.        Device      File A      File B          File C             The original implementation of TurboSETI worked exclusively
  Cython       CPU         0.44 min    25.26 min       23.06 min          on the CPU to process data. We implemented a GPU-accelerated
  Numba        CPU         0.36 min    20.67 min       22.44 min          backend to leverage the massive parallelization capabilities of a
  CuPy         GPU         0.05 min    2.73 min        3.40 min           graphical device. The benchmark performed shows that the new
                                                                          CPU and GPU implementation takes significantly less time to
                                   TABLE 1                                process observation data resulting in more science being produced.
 Double precision processing time benchmark with Cython, Numba and CuPy   Based on the results, the recommended configuration to run the
                                                                          program is with single-precision calculations on a GPU device.

   Single-Precision (float32)                                             R EFERENCES
                                                                          [ESF+ 17]   J. Emilio Enriquez, Andrew Siemion, Griffin Foster, Vishal
   Impl.       Device      File A       File B          File C
                                                                                      Gajjar, Greg Hellbourg, Jack Hickish, Howard Isaacson,
   Numba       CPU         0.26 min     16.13 min       16.15 min                     Danny C. Price, Steve Croft, David DeBoer, Matt Lebof-
   CuPy        GPU         0.03 min     1.52 min        2.14 min                      sky, David H. E. MacMahon, and Dan Werthimer. The
                                                                                      breakthrough listen search for intelligent life: 1.1–1.9
                                   TABLE 2                                            ghz observations of 692 nearby stars.            The Astrophys-
     Single precision processing time benchmark with Numba and CuPy                   ical Journal, 849(2):104, Nov 2017.             URL: https://ui.
                              implementation.                               , doi:
                                                                          [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der
                                                                                      Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric
                                                                                      Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,
to be compiled at installation time. Consequently, it was decided                     Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van
                                                                                      Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del
to replace Cython with pure Python methods decorated with the                         Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant,
Numba ([LPS15]) accelerator. By leveraging the power of the Just-                     Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer
In-Time (JIT) compiler from Low Level Virtual Machine (LLVM),                         Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro-
                                                                                      gramming with NumPy. Nature, 585(7825):357–362, Septem-
Numba can compile Python code into assembly code as well                              ber 2020. URL:,
as apply Single Instruction/Multiple Data (SIMD) acceleration                         doi:10.1038/s41586-020-2649-2.
instructions to achieve near machine-level speeds.                        [LCS 19]
                                                                              +       Matthew Lebofsky, Steve Croft, Andrew P. V. Siemion,
                                                                                      Danny C. Price, J. Emilio Enriquez, Howard Isaacson, David
                                                                                      H. E. MacMahon, David Anderson, Bryan Brzycki, Jeff Cobb,
2.2. Single-Precision Floating-Point                                                  Daniel Czech, David DeBoer, Julia DeMarines, Jamie Drew,
The original implementation of the software handled the input                         Griffin Foster, Vishal Gajjar, Nectaria Gizani, Greg Hellbourg,
                                                                                      Eric J. Korpela, and Brian Lacki. The breakthrough listen
data as double-precision floating-point numbers. This behavior                        search for intelligent life: Public data, formats, reduction, and
would cause all the mathematical operations to take significantly                     archiving. Publications of the Astronomical Society of the
longer to process because of the extended precision. The ultimate                     Pacific, 131(1006):124505, Nov 2019. URL:
                                                                                      abs/1906.07391, doi:10.1088/1538-3873/ab3e82.
precision of the output product is inherently limited by the preci-       [LPS15]     Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba:
sion of the original input data which in most cases is represented                    A llvm-based python jit compiler. In Proceedings of the
by an 8-bit signed integer. Therefore, the addition of a single-                      Second Workshop on the LLVM Compiler Infrastructure in
precision floating-point number decreased the processing time                         HPC, LLVM ’15, New York, NY, USA, 2015. Association
                                                                                      for Computing Machinery. URL:
without compromising the useful precision of the output data.                         2833157.2833162, doi:10.1145/2833157.2833162.
                                                                          [OUN 17]
                                                                               +      Ryosuke Okuta, Yuya Unno, Daisuke Nishino, Shohei Hido,
                                                                                      and Crissman Loomis. Cupy: A numpy-compatible library
3. Results                                                                            for nvidia gpu calculations. In Proceedings of Workshop
                                                                                      on Machine Learning Systems (LearningSys) in The Thirty-
To test the speed improvements between implementations we used                        first Annual Conference on Neural Information Processing
files from previous observations coming from different observato-                     Systems (NIPS), 2017. URL:
ries. Table 1 indicates the processing times it took to process three                 assets/papers/paper_16.pdf.
                                                                          [Reb82]     Grote Reber. Cosmic Static, pages 61–69. Springer Nether-
different files in double-precision mode. We can notice that the                      lands, Dordrecht, 1982. URL:
CPU implementation based on Numba is measurably faster than                           94-009-7752-5_6, doi:10.1007/978-94-009-7752-
the original CPU implementation based on Cython. At the same                          5_6.
time, the GPU-accelerated backend processed the data from 6.8 to
9.3 times faster than the original CPU-based implementation.
     Table 2 indicates the same results as Table 1 but with single-
precision floating points. The original Cython implementation was
left out because it doesn’t support single-precision mode. Here,
the same data was processed from 7.5 to 10.6 times faster than the
Numba CPU-based implementation.
     To illustrate the processing time improvement, a single obser-
vation containing 105 GB of data was processed in 12 hours by the
original CPU-based TurboSETI implementation on an i7-7700K
Intel CPU, and just 1 hour and 45 minutes by the GPU-accelerated
backend on a GTX 1070 Ti NVIDIA GPU.
28                                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

        Experience report of physics-informed neural
     networks in fluid simulations: pitfalls and frustration
                                                         Pi-Yueh Chuang‡∗ , Lorena A. Barba‡


Abstract—Though PINNs (physics-informed neural networks) are now deemed                  PINN (physics-informed neural network) method denotes an ap-
as a complement to traditional CFD (computational fluid dynamics) solvers                proach to incorporate deep learning in CFD applications, where
rather than a replacement, their ability to solve the Navier-Stokes equations            solving partial differential equations plays the key role. These par-
without given data is still of great interest. This report presents our not-so-          tial differential equations include the well-known Navier-Stokes
successful experiments of solving the Navier-Stokes equations with PINN as
                                                                                         equations—one of the Millennium Prize Problems. The universal
a replacement to traditional solvers. We aim to, with our experiments, prepare
readers for the challenges they may face if they are interested in data-free PINN.
                                                                                         approximation theorem ([Hor]) implies that neural networks can
In this work, we used two standard flow problems: 2D Taylor-Green vortex at              model the solution to the Navier-Stokes equations with high
Re = 100 and 2D cylinder flow at Re = 200. The PINN method solved the 2D                 fidelity and capture complicated flow details as long as networks
Taylor-Green vortex problem with acceptable results, and we used this flow as an         are big enough. The idea of PINN methods can be traced back
accuracy and performance benchmark. About 32 hours of training were required             to [DPT], while the name PINN was coined in [RPK]. Human-
for the PINN method’s accuracy to match the accuracy of a 16 × 16 finite-                provided data are not necessary in applying PINN [LMMK], mak-
difference simulation, which took less than 20 seconds. The 2D cylinder flow, on         ing it a potential alternative to traditional CFD solvers. Sometimes
the other hand, did not produce a physical solution. The PINN method behaved
                                                                                         it is branded as unsupervised learning—it does not rely on human-
like a steady-flow solver and did not capture the vortex shedding phenomenon.
                                                                                         provided data, making it sound very "AI." It is now common to
By sharing our experience, we would like to emphasize that the PINN method is
still a work-in-progress, especially in terms of solving flow problems without any
                                                                                         see headlines like "AI has cracked the Navier-Stokes equations" in
given data. More work is needed to make PINN feasible for real-world problems            recent popular science articles ([Hao]).
in such applications. (Reproducibility package: [Chu22].)                                     Though data-free PINN as an alternative to traditional CFD
                                                                                         solvers may sound attractive, PINN can also be used under data-
Index Terms—computational fluid dynamics, deep learning, physics-informed                driven configurations, for which it is better suited. Cai et al.
neural network                                                                           [CMW+ ] state that PINN is not meant to be a replacement of
                                                                                         existing CFD solvers due to its inferior accuracy and efficiency.
                                                                                         The most useful applications of PINN should be those with
1. Introduction
                                                                                         some given data, and thus the models are trained against the
Recent advances in computing and programming techniques have                             data. For example, when we have experimental measurements or
motivated practitioners to revisit deep learning applications in                         partial simulation results (coarse-grid data, limited numbers of
computational fluid dynamics (CFD). We use the verb "revisit"                            snapshots, etc.) from traditional CFD solvers, PINN may be useful
because deep learning applications in CFD already existed going                          to reconstruct the flow or to be a surrogate model.
back to at least the 1990s, for example, using neural networks as                             Nevertheless, data-free PINN may offer some advantages over
surrogate models ([LS], [FS]). Another example is the work of                            traditional solvers, and using data-free PINN to replace traditional
Lagaris and his/her colleagues ([LLF]) on solving partial differen-                      solvers is still of great interest to researchers (e.g., [KDYI]). First,
tial equations with fully-connected neural networks back in 1998.                        it is a mesh-free scheme, which benefits engineering problems
Similar work with radial basis function networks can be found                            where fluid flows interact with objects of complicated geometries.
in reference [LLQH]. Nevertheless, deep learning applications                            Simulating these fluid flows with traditional numerical methods
in CFD did not get much attention until this decade, thanks to                           usually requires high-quality unstructured meshes with time-
modern computing technology, including GPUs, cloud computing,                            consuming human intervention in the pre-processing stage before
high-level libraries like PyTorch and TensorFlow, and their Python                       actual simulations. The second benefit of PINN is that the trained
APIs.                                                                                    models approximate the governing equations’ general solutions,
     Solving partial differential equations with deep learning is                        meaning there is no need to solve the equations repeatedly for
particularly interesting to CFD researchers and practitioners. The                       different flow parameters. For example, a flow model taking
                                                                                         boundary velocity profiles as its input arguments can predict
* Corresponding author:                                                 flows under different boundary velocity profiles after training.
‡ Department of Mechanical and Aerospace Engineering, The George Wash-
ington University, Washington, DC 20052, USA                                             Conventional numerical methods, on the contrary, require repeated
                                                                                         simulations, each one covering one boundary velocity profile.
Copyright © 2022 Pi-Yueh Chuang et al. This is an open-access article                    This feature could help in situations like engineering design op-
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,            timization: the process of running sets of experiments to conduct
provided the original author and source are credited.                                    parameter sweeps and find the optimal values or geometries for

products. Given these benefits, researchers continue studying and      and momentum equations:
improving the usability of data-free PINN (e.g., [WYP], [DZ],
[WTP], [SS]).                                                                            ∂U
                                                                                                     ~ = − 1 ∇p + ν∇2U
                                                                                               ~ · ∇)U               ~ +~g
                                                                                            + (U                                             (2)
    Data-free PINN, however, is not ready nor meant to replace                           ∂t                ρ
traditional CFD solvers. This claim may be obvious to researchers       where ρ = ρ(~x,t), ν = ν(~x,t), and p = p(~x,t) are scalar fields
experienced in PINN, but it may not be clear to others, especially      denoting density, kinematic viscosity, and pressure, respectively.
to CFD end-users without ample expertise in numerical methods.         ~x denotes the spatial coordinate, and ~x = [x, y]T in two di-
Even in literature that aims to improve PINN, it’s common to see        mensions. The density and viscosity fields are usually known
only the success stories with simple CFD problems. Important in-        and given, while the pressure field is unknown. U        ~ = U(~
                                                                                                                                      ~ x,t) =
formation concerning the feasibility of PINN in practical and real-     [u(x, y,t), v(x, y,t)]T is a vector field for flow velocity. All of them
world applications is often missing from these success stories. For     are functions of the spatial coordinate in the computational domain
example, few reports discuss the required computing resources,          Ω and time before a given limit T . The gravitational field ~g may
the computational cost of training, the convergence properties, or      also be a function of space and time, though it is usually a constant.
the error analysis of PINN. PINN suffers from performance and           A solution to the Navier-Stokes equations is subjected to an initial
solvability issues due to the need for high-order automatic differ-     condition and boundary conditions:
entiation and multi-objective nonlinear optimization. Evaluating                    
high-order derivatives using automatic differentiation increases                        ~ x,t) = U
                                                                                     U(~
                                                                                                   ~ 0 (~x),   ∀~x ∈ Ω, t = 0
the computational graphs of neural networks. And multi-objective                        U(~x,t) = UΓ (~x,t), ∀~x ∈ Γ, t ∈ [0, T ]
                                                                                        ~           ~                                         (3)
optimization, which reduces all the residuals of the differential                       p(~x,t) = pΓ (x,t), ∀~x ∈ Γ, t ∈ [0, T ]
equations, initial conditions, and boundary conditions, makes
the training difficult to converge to small-enough loss values.        where Γ represents the boundary of the computational domain.
Fluid flows are sensitive nonlinear dynamical systems in which
a small change or error in inputs may produce a very different         2.1. The PINN method
flow field. So to get correct solutions, the optimization in PINN      The basic form of the PINN method ([RPK], [CMW+ ]) starts from
needs to minimize the loss to values very close to zero, further       approximating U~ and p with a neural network:
compromising the method’s solvability and performance.                                       " #
    This paper reports on our not-so-successful PINN story as a                               ~
                                                                                                 (~x,t) ≈ G(~x,t; Θ)               (4)
lesson learned to readers, so they can be aware of the challenges                              p
they may face if they consider using data-free PINN in real-world
applications. Our story includes two computational experiments         Here we use a single network that predicts both pressure and
as case studies to benchmark the PINN method’s accuracy and            velocity fields. It is also possible to use different networks for them
computational performance. The first case study is a Taylor-           separately. Later in this work, we will use GU and G p to denote
Green vortex, solved successfully though not to our complete           the predicted velocity and pressure from the neural network. Θ at
satisfaction. We will discuss the performance of PINN using this       this point represents the free parameters of the network.
case study. The second case study, flow over a cylinder, did not           To determine the free parameters, Θ, ideally, we hope the
even result in a physical solution. We will discuss the frustration    approximate solution gives zero residuals for equations (1), (2),
we encountered with PINN in this case study.                           and (3). That is
    We built our PINN solver with the help of NVIDIA’s Modulus          r1 (~x,t; Θ) ≡ ∇ · GU = 0
library ([noa]). Modulus is a high-level Python package built on
                                                                                        ∂ GU                  1
top of PyTorch that helps users develop PINN-based differential         r2 (~x,t; Θ) ≡       + (GU · ∇)GU + ∇G p − ν∇2 GU −~g = 0
equation solvers. Also, in each case study, we also carried out sim-                     ∂t                   ρ
ulations with our CFD solver, PetIBM ([CMKAB18]). PetIBM is             r3 (~x; Θ) ≡ GU      ~
                                                                                       t=0 − U0 = 0
a traditional solver using staggered-grid finite difference methods     r4 (~x,t; Θ) ≡ GU − U~ Γ = 0, ∀~x ∈ Γ
with MPI parallelization and GPU computing. PetIBM simulations          r5 (~x,t; Θ) ≡ G p − pΓ = 0, ∀~x ∈ Γ
in each case study served as baseline data. For all cases, config-
urations, post-processing scripts, and required Singularity image      And the set of desired parameter, Θ = θ , is the common zero root
definitions can be found at reference [Chu22].                         of all the residuals.
    This paper is structured as follows: the second section briefly        The derivatives of G with respect to ~x and t are usually ob-
describes the PINN method and an analogy to traditional CFD            tained using automatic differentiation. Nevertheless, it is possible
methods. The third and fourth sections provide our computational       to use analytical derivatives when the chosen network architecture
experiments of the Taylor-Green vortex in 2D and a 2D laminar          is simple enough, as reported by early-day literature ([LLF],
cylinder flow with vortex shedding. Most discussions happen            [LLQH]).
in the corresponding case studies. The last section presents the           If residuals in (5) are not complicated, and if the number of
conclusion and discussions that did not fit into either one of the     the parameters, NΘ , is small enough, we may numerically find the
cases.                                                                 zero root by solving a system of NΘ nonlinear equations generated
                                                                       from a suitable set of NΘ spatial-temporal points. However, the
2. Solving Navier-Stokes equations with PINN                           scenario rarely happens as G is usually highly complicated and
                                                                       NΘ is large. Moreover, we do not even know if such a zero root
The incompressible Navier-Stokes equations in vector form are
                                                                       exists for the equations in (5).
composed of the continuity equation:
                                                                           Instead, in PINN, the condition is relaxed. We do not seek the
                             ∇ ·U
                                ~ =0                             (1)   zero root of (5) but just hope to find a set of parameters that make
30                                                                                            PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

the residuals sufficiently close to zero. Consider the sum of the l2     2.2. An analogy to conventional numerical methods
norms of residuals:                                                      For readers with a background in numerical methods for partial
                                                    x∈Ω                  differential equations, we would like to make an analogy between
      r(~x,t; Θ = θ ) ≡ ∑ kri (~x,t; Θ = θ )k , ∀
                                                                 (6)     traditional numerical methods and PINN.
                        i=1                         t ∈ [0, T ]
                                                                             In obtaining strong solutions to differential equations, we can
The θ that makes residuals closest to zero (or even equal to zero        describe the solution workflows of most numerical methods with
if such θ exists) also makes (6) minimal because r(~x,t; Θ) ≥ 0. In      five stages:
other words,
                                        (                                      1)   Designing the approximate solution with undetermined
                                          x∈Ω                                       parameters
               θ = arg min r(~x,t; Θ) ∀                         (7)
                       Θ                  t ∈ [0, T ]                          2)   Choosing proper approximation for derivatives
                                                                               3)   Obtaining the so-called modified equation by substituting
This poses a fundamental difference between the PINN method
                                                                                    approximate derivatives into the differential equations
and traditional CFD schemes, making it potentially more difficult
                                                                                    and initial/boundary conditions
for the PINN method to achieve the same accuracy as the tradi-
                                                                               4)   Generating a system of linear/nonlinear algebraic equa-
tional schemes. We will discuss this more in section 3. Note that
in practice, each loss term on the right-hand-side of equation (6) is
                                                                               5)   Solving the system of equations
weighted. We ignore the weights here for demonstrating purpose.
     To solve (7), theoretically, we can use any number of spatial-         For example, to solve ∇U 2 (x) = s(x), the most naive spectral
temporal points, which eases the need of computational resources,        method ([Tre]) approximates the solution with U(x) ≈ G(x) =
compared to finding the zero root directly. Gradient-descent-            ∑ ci φi (x), where ci represents undetermined parameters, and φi (x)
based optimizers further reduce the computational cost, especially       i=1
                                                                         denotes a set of either polynomials, trigonometric functions, or
in terms of memory usage and the difficulty of parallelization.
                                                                         complex exponentials. Next, obtaining the first derivative of U is
Alternatively, Quasi-Newton methods may work but only when                                                                              N
NΘ is small enough.                                                      straightforward—we can just assume U 0 (x) ≈ G0 (x) = ∑ ci φi0 (x).
     However, even though equation (7) may be solvable, it is still      The second-order derivative may be more tricky. One can assume
a significantly expensive task. While typical data-driven learning                        N
requires one back-propagation pass on the derivatives of the loss        U 00 (x) ≈ G00 = ∑ ci φi00 (x). Or, another choice for nodal bases (i.e.,
function, here automatic differentiation is needed to evaluate the                                                                     N
derivatives of G with respect to ~x and t. The first-order derivatives   when φi (x) is chosen to make ci ≡ G(xi )) is U 00 (x) ≈ ∑ ci G0 (xi ).
require one back-propagation on the network, while the second-           Because φi (x) is known, the derivatives are analytical. After sub-
order derivatives present in the diffusion term ∇2 GU require an         stituting the approximate solution and derivatives in to the target
additional back-propagation on the first-order derivatives’ com-         differential equation, we need to solve for parameters c1 , · · · , cN .
putational graph. Finally, to update parameters in an optimizer,         We do so by selecting N points from the computational domain
the gradients of G with respect to parameters Θ requires another         and creating a system of N linear equations:
back-propagation on the graph of the second-order derivatives.                                                               
This all leads to a very large computational graph. We will see the                   φ100 (x1 ) · · · φN00 (x1 )   c1     s(x1 )
                                                                                     .                     ..                
                                                                                     .          ..                .   . 
performance of the PINN method in the case studies.                                  .              .       .   ..  −  ..  = 0         (8)
     In summary, when viewing the PINN method as supervised                          φ1 (xN ) · · · φN (xN ) cN
                                                                                       00               00                 s(xN )
machine learning, the inputs of a network are spatial-temporal
coordinates, and the outputs are the physical quantities of our          Finally, we determine the parameters by solving this linear system.
interest. The loss or objective functions in PINN are governing          Though this example uses a spectral method, the workflow also
equations that regulate how the target physical quantities should        applies to many other numerical methods, such as finite difference
behave. The use of governing equations eliminates the need for           methods, which can be reformatted as a form of spectral method.
true answers. A trivial example is using Bernoulli’s equation as             With this workflow in mind, it should be easy to see the anal-
the loss function, i.e., loss = 2gu2    p
                                     + ρg − H0 + z(x), and a neural      ogy between PINN and conventional numerical methods. Aside
network predicts the flow speed u and pressure p at a given              from using much more complicated approximate solutions, the
location x along a streamline. (The gravitational acceleration           major difference lies in how to determine the unknown parameters
g, density ρ, energy head H0 , and elevation z(x) are usually            in the approximate solutions. While traditional methods solve the
known and given.) Such a loss function regulates the relationship        zero-residual conditions, PINN relies on searching the minimal
between predicted u and p and does not need true answers for             residuals. A secondary difference is how to approximate deriva-
the two quantities. Unlike Bernoulli’s equation, most governing          tives. Conventional numerical methods use analytical or numerical
equations in physics are usually differential equations (e.g., heat      differentiation of the approximate solutions, and the PINN meth-
equations). The main difference is that now the PINN method              ods usually depends on automatic differentiation. This difference
needs automatic differentiation to evaluate the loss. Regardless         may be minor as we are still able to use analytical differentiation
of the forms of governing equations, spatial-temporal coordinates        for simple network architectures with PINN. However, automatic
are the only data required during training. Hence, throughout this       differentiation is a major factor affecting PINN’s performance.
paper, training data means spatial-temporal points and does not
                                                                         3. Case 1: Taylor-Green vortex: accuracy and performance
involve any true answers to predicted quantities. (Note in some
literature, the PINN method is applied to applications that do need      3.1. 2D Taylor-Green vortex
true answers, see [CMW+ ]. These applications are out of scope           The Taylor-Green vortex represents a family of flows with a
here.)                                                                   specific form of analytical initial flow conditions in both 2D

                                                                           Fig. 2: Total residuals (loss) with respect to training iterations.
Fig. 1: Contours of u and v at t = 32 to demonstrate the solution of
2D Taylor-Green vortex.
                                                                         variants). We carried out the training using different numbers of
and 3D. The 2D Taylor-Green vortex has closed-form analytical            GPUs to investigate the performance of the PINN solver. All cases
solutions with periodic boundary conditions, and hence they are          were trained up to 1 million iterations. Note that the parallelization
standard benchmark cases for verifying CFD solvers. In this work,        was done with weak scaling, meaning increasing the number of
we used the following 2D Taylor-Green vortex:                            GPUs would not reduce the workload of each GPU. Instead,
                                                                        increasing the number of GPUs would increase the total and
                         x       y        ν
     u(x, y,t) = V0 cos( ) sin( ) exp(−2 2 t)                           per-iteration numbers of training points. Therefore, our expected
                        L       L        L
                          x       y        ν                            outcome was that all cases required about the same wall time to
      v(x, y,t) = −V0 sin( ) cos( ) exp(−2 2 t)               (9)        finish, while the residual from using 8 GPUs would converge the
                         L       L        L
                   ρ           2x       2y           ν                  fastest.
     p(x, y,t) = − V02 cos( ) + cos( ) exp(−4 2 t)                          After training, the PINN solver’s prediction errors (i.e., accu-
                     4          L        L            L
                                                                         racy) were evaluated on cell centers of a 512 × 512 Cartesian mesh
where V0 represents the peak (and also the lowest) velocity at           against the analytical solution. With these spatially distributed
t = 0. Other symbols carry the same meaning as those in section          errors, we calculated the L2 error norm for a given t:
2.                                                                                      sZ                     r
    The periodic boundary conditions were applied to x = −Lπ,                     L2 =       error(x, y)2 dΩ ≈ ∑ ∑ errori,2 j ∆Ωi, j       (10)
x = Lπ, y = −Lπ, and y = Lπ. We used the following parameters                              Ω
                                                                                                                     i   j
in this work: V0 = L = ρ = 1.0 and ν = 0.01. These parameters
correspond to Reynolds number Re = 100. Figure 1 shows a                 where i and j here are the indices of a cell center in the Cartesian
snapshot of velocity at t = 32.                                          mesh. ∆Ωi, j is the corresponding cell area, 4π 2 /5122 in this case.
                                                                             We compared accuracy and performance against results using
3.2. Solver and runtime configurations                                   PetIBM. All PetIBM simulations in this section were done with
                                                                         1 K40 GPU and 6 CPU cores (Intel i7-5930K) on our old lab
The neural network used in the PINN solver is a fully-connected
                                                                         workstation. We carried out 7 PetIBM simulations with different
neural network with 6 hidden layers and 256 neurons per layer.
                                                                         spatial resolutions: 2k × 2k for k = 4, 5, . . . , 10. The time step size
The activation functions are SiLU ([HG]). We used Adam for
                                                                         for each spatial resolution was ∆t = 0.1/2k−4 .
optimization, and its initial parameters are the defaults from Py-
                                                                             A special note should be made here: the PINN solver used
Torch. The learning rate exponentially decayed through PyTorch’s
                                                                         single-precision floats, while PetIBM used double-precision floats.
ExponentialLR with gamma equal to 0.951/10000 . Note we did
                                                                         It might sound unfair. However, this discrepancy does not change
not conduct hyperparameter optimization, given the computational
                                                                         the qualitative findings and conclusions, as we will see later.
cost. The hyperparameters are mostly the defaults used by the 3D
Taylor-Green example in Modulus ([noa]).
    The training data were simply spatial-temporal coordinates.          3.3. Results
Before the training, the PINN solver pre-generated 18,432,000            Figure 2 shows the convergence history of the total residuals
spatial-temporal points to evaluate the residuals of the Navier-         (equation (6)). Using more GPUs in weak scaling (i.e., more
Stokes equations (the r1 and r2 in equation (5)). These training         training points) did not accelerate the convergence, contrary to
points were randomly chosen from the spatial domain [−π, π] ×            what we expected. All cases converged at a similar rate. Though
[−π, π] and temporal domain (0, 100]. The solver used only 18,432        without a quantitative criterion or justification, we considered that
points in each training iteration, making it a batch training. For       further training would not improve the accuracy. Figure 3 gives a
the residual of the initial condition (the r3 ), the solver also pre-    visual taste of what the predictions from the neural network look
generated 18,432,000 random spatial points and used only 18,432          like.
per iteration. Note that for r3 , the points were distributed in space       The result visually agrees with that in figure 1. However, as
only because t = 0 is a fixed condition. Because of the periodic         shown in figure 4, the error magnitudes from the PINN solver
boundary conditions, the solver did not require any training points      are much higher than those from PetIBM. Figure 4 shows the
for r4 and r5 .                                                          prediction errors with respect to t. We only present the error on
    The hardware used for the PINN solver was a single node of           the u velocity as those for v and p are similar. The accuracy of
NVIDIA’s DGX-A100. It was equipped with 8 A100 GPUs (80GB                the PINN solver is similar to that of the 16 × 16 simulation with
32                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                                    Fig. 5: L2 error norm versus wall time.
     Fig. 3: Contours of u and v at t = 32 from the PINN solver.

                                                                      3.4. Discussion
                                                                      A notice should be made regarding the results: we do not claim
                                                                      that these results represent the most optimized configuration of
                                                                      the PINN method. Neither do we claim the qualitative conclusions
                                                                      apply to all other hyperparameter configurations. These results
                                                                      merely reflect the outcomes of our computational experiments
                                                                      with respect to the specific configuration abovementioned. They
                                                                      should be deemed experimental data rather than a thorough anal-
                                                                      ysis of the method’s characteristics.
                                                                          The Taylor-Green vortex serves as a good benchmark case
                                                                      because it reduces the number of required residual constraints:
                                                                      residuals r4 and r5 are excluded from r in equation 6. This means
             Fig. 4: L2 error norm versus simulation time.
                                                                      the optimizer can concentrate only on the residuals of initial
                                                                      conditions and the Navier-Stokes equations.
                                                                          Using more GPUs (thus using more training points, i.e., spatio-
PetIBM. Using more GPUs, which implies more training points,
                                                                      temporal points) did not speed up the convergence, which may
does not improve the accuracy.
                                                                      indicate that the per-iteration number of points on a single GPU
    Regardless of the magnitudes, the trends of the errors with       is already big enough. The number of training points mainly
respect to t are similar for both PINN and PetIBM. For PetIBM,        affects the mean gradients of the residual with respect to model
the trend shown in figure 4 indicates that the temporal error is      parameters, which then will be used to update parameters by
bounded, and the scheme is stable. However, this concept does         gradient-descent-based optimizers. If the number of points is
not apply to PINN as it does not use any time-marching schemes.       already big enough on a single GPU, then using more points or
What this means for PINN is still unclear to us. Nevertheless,        more GPUs is unlikely to change the mean gradients significantly,
it shows that PINN is able to propagate the influence of initial      causing the convergence solely to rely on learning rates.
conditions to later times, which is a crucial factor for solving
                                                                          The accuracy of the PINN solver was acceptable but not
hyperbolic partial differential equations.
                                                                      satisfying, especially when considering how much time it took
    Figure 5 shows the computational cost of PINN and PetIBM          to achieve such accuracy. The low accuracy to some degree was
in terms of the desired accuracy versus the required wall time. We    not surprising. Recall the theory in section 2. The PINN method
only show the PINN results of 8 A100 GPUs on this figure. We          only seeks the minimal residual on the total residual’s hyperplane.
believe this type of plot may help evaluate the computational cost    It does not try to find the zero root of the hyperplane and does not
in engineering applications. According to the figure, for example,    even care whether such a zero root exists. Furthermore, by using a
achieving an accuracy of 10−3 at t = 2 requires less than 1 second    gradient-descent-based optimizer, the resulting minimum is likely
for PetIBM with 1 K40 and 6 CPU cores, but it requires more than      just a local minimum. It makes sense that it is hard for the residual
8 hours for PINN with at least 1 A100 GPU.                            to be close to zero, meaning it is hard to make errors small.
    Table 1 lists the wall time per 1 thousand iterations and the         Regarding the performance result in figure 5, we would like
scaling efficiency. As indicated previously, weak scaling was used    to avoid interpreting the result as one solver being better than the
in PINN, which follows most machine learning applications.            other one. The proper conclusion drawn from the figure should be
                                                                      as follows: when using the PINN solver as a CFD simulator for
                                                                      a specific flow condition, PetIBM outperforms the PINN solver.
                           1 GPUs   2 GPUs     4 GPUs        8 GPUs   As stated in section 1, the PINN method can solve flows under
     Time (sec/1k iters)   85.0     87.7       89.1          90.1     different flow parameters in one run—a capability that PetIBM
     Efficiency (%)        100      97         95            94       does not have. The performance result in figure 5 only considers a
                                                                      limited application of the PINN solver.
                                                                          One issue for this case study was how to fairly compare
TABLE 1: Weak scaling performance of the PINN solver using            the PINN solver and PetIBM, especially when investigating the
NVIDIA A100-80GB GPUs                                                 accuracy versus the workload/problem size or time-to-solution

versus problem size. Defining the problem size in PINN is not
as straightforward as we thought. Let us start with degrees of
freedom—in PINN, it is called the number of model parame-
ters, and in traditional CFD solvers, it is called the number of
unknowns. The PINN solver and traditional CFD solvers are
all trying to determine the free parameters in models (that is,
approximate solutions). Hence, the number of degrees of freedom
determines the problem sizes and workloads. However, in PINN,
problem sizes and workloads do not depend on degrees of freedom
solely. The number of training points also plays a critical role
in workloads. We were not sure if it made sense to define a
problem size as the sum of the per-iteration number of training
points and the number of model parameters. For example, 100
model parameters plus 100 training points is not equivalent to 150
model parameters plus 50 training points in terms of workloads.
So without a proper definition of problem size and workload, it
was not clear how to fairly compare PINN and traditional CFD
    Nevertheless, the gap between the performances of PINN and            Fig. 6: Demonstration of velocity and vorticity fields at t = 200 from
                                                                          a PetIBM simulation.
PetIBM is too large, and no one can argue that using other metrics
would change the conclusion. Not to mention that the PINN solver
ran on A100 GPUs, while PetIBM ran on a single K40 GPU
                                                                          200. Figure 6 shows the velocity and vorticity snapshots at t = 200.
in our lab, a product from 2013. This is also not a surprising
                                                                          As shown in the figure, this type of flow displays a phenomenon
conclusion because, as indicated in section 2, the use of automatic
                                                                          called vortex shedding. Though vortex shedding makes the flow
differentiation for temporal and spatial derivatives results in a huge
                                                                          always unsteady, after a certain time, the flow reaches a periodic
computational graph. In addition, the PINN solver uses gradient-
                                                                          stage and the flow pattern repeats after a certain period.
descent based method, which is a first-order method and limits the
                                                                              The Navier-Stokes equations can be deemed as a dynamical
                                                                          system. Instability appears in the flow under some flow conditions
    Weak scaling is a natural choice of the PINN solver when it
                                                                          and responds to small perturbations, causing the vortex shedding.
comes to distributed computing. As we don’t know a proper way
                                                                          In nature, the vortex shedding comes from the uncertainty and
to define workload, simply copying all model parameters to all
                                                                          perturbation existing everywhere. In CFD simulations, the vortex
processes and using the same number of training points on all
                                                                          shedding is caused by small numerical and rounding errors in
processes works well.
                                                                          calculations. Interested readers should consult reference [Wil].

4. Case 2: 2D cylinder flows: harder than we thought                      4.2. Solver and runtime configurations
This case study shows what really made us frustrated: a 2D                For the PINN solver, we tested with two networks. Both were
cylinder flow at Reynolds number Re = 200. We failed to even              fully-connected neural networks: one with 256 neurons per layer,
produce a solution that qualitatively captures the key physical           while the other one with 512 neurons per layer. All other net-
phenomenon of this flow: vortex shedding.                                 work configurations were the same as those in section 3, except
                                                                          we allowed human intervention to manually adjust the learning
4.1. Problem description                                                  rates during training. Our intention for this case study was to
The computational domain is [−8, 25] × [−8, 8], and a cylinder            successfully obtain physical solutions from the PINN solver,
with a radius of 0.5 sits at coordinate (0, 0). The velocity boundary     rather than conducting a performance and accuracy benchmark.
conditions are (u, v) = (1, 0) along x = −8, y = −8, and y = 8. On        Therefore, we would adjust the learning rate to accelerate the
the cylinder surface is the no-slip condition, i.e., (u, v) = (0, 0).     convergence or to escape from local minimums. This decision was
At the outlet (x = 25), we enforced a pressure boundary condition         in line with common machine learning practice. We did not carry
p = 0. The initial condition is (u, v) = (0, 0). Note that this initial   out hyperparameter optimization. These parameters were chosen
condition is different from most traditional CFD simulations.             because they work in Modulus’ examples and in the Taylor-Green
Conventionally, CFD simulations use (u, v) = (1, 0) for cylinder          vortex experiment.
flows. A uniform initial condition of u = 1 does not satisfy                   The PINN solver pre-generated 40, 960, 000 spatial-temporal
the Navier-Stokes equations due to the no-slip boundary on the            points from a spatial domain in [−8, 25] × [−8, 8] and temporal
cylinder surface. Conventional CFD solvers are usually able to            domain (0, 200] to evaluate residuals of the Navier-Stokes equa-
correct the solution during time-marching by propagating bound-           tions, and used 40, 960 points per iteration. The number of pre-
ary effects into the domain through numerical schemes’ stencils.          generated points for the initial condition was 2, 048, 000, and the
In our experience, using u = 1 or u = 0 did not matter for PINN           per-iteration number is 2, 048. On each boundary, the numbers of
because both did not give reasonable results. Nevertheless, the           pre-generated and per-iteration points are 8,192,000 and 8,192.
PINN solver’s results shown in this section were obtained using a         Both cases used 8 A100 GPUs, which scaled these numbers up
uniform u = 0 for the initial condition.                                  with a factor of 8. For example, during each iteration, a total of
    The density, ρ, is one, and the kinematic viscosity is ν =            327, 680 points were actually used to evaluate the Navier-Stokes
0.005. These parameters correspond to Reynolds number Re =                equations’ residuals. Both cases ran up to 64 hours in wall time.
34                                                                                                PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

     Fig. 7: Training history of the 2D cylinder flow at Re = 200.

    One PetIBM simulation was carried out as a baseline. This
simulation had a spatial resolution of 1485 × 720, and the time
step size is 0.005. Figure 6 was rendered using this simulation.
The hardware used was 1 K40 GPU plus 6 cores of i7-5930K                             Fig. 8: Velocity and vorticity at t = 200 from PINN.
CPU. It took about 1.7 hours to finish.
    The quantity of interest is the drag coefficient. We consider
both the friction drag and pressure drag in the coefficient calcula-
tion as follows:
                                                   
                         Z       ∂  U~ ·~t
            CD =           ρν               ny − pnx  dS      (11)
                  ρU02 D            ∂~n

Here, U0 = 1 is the inlet velocity. ~n = [nx , ny ]T and ~t = [ny , −nx ]T
are the normal and tangent vectors, respectively. S represents the
cylinder surface. The theoretical lift coefficient (CL ) for this flow
is zero due to the symmetrical geometry.

4.3. Results
Note, as stated in section 3.4, we deem the results as experimental
data under a specific experiment configuration. Hence, we do not
claim that the results and qualitative conclusions will apply to                      Fig. 9: Drag and lift coefficients with respect to t
other hyperparameter configuration.
    Figure 7 shows the convergence history. The bumps in the
                                                                             practice. Our viewpoints may be subjective, and hence we leave
history correspond to our manual adjustment of the learning rates.
                                                                             them here in the discussion.
After 64 hours of training, the total loss had not converged to an
                                                                                 Allow us to start this discussion with a hypothetical situation.
obvious steady value. However, we decided not to continue the
                                                                             If one asks why we chose such a spatial and temporal resolution
training because, as later results will show, it is our judgment call
                                                                             for a conventional CFD simulation, we have mathematical or
that the results would not be correct even if the training converged.
                                                                             physical reasons to back our decision. However, if the person asks
    Figure 8 provides a visualization of the predicted velocity              why we chose 6 hidden layers and 256 neurons per layer, we will
and vorticity at t = 200. And in figure 9 are the drag and lift              not be able to justify it. "It worked in another case!" is probably the
coefficients versus simulation time. From both figures, we couldn’t          best answer we can offer. The situation also indicates that we have
see any sign of vortex shedding with the PINN solver.                        systematic approaches to improve a conventional simulation but
    We provide a comparison against the values reported by others            can only improve PINN’s results through computer experiments.
in table 2. References [GS74] and [For80] calculate the drag                     Most traditional numerical methods have rigorous analytical
coefficients using steady flow simulations, which were popular               derivations and analyses. Each parameter used in a scheme has
decades ago because of their inexpensive computational costs.                a meaning or a purpose in physical or numerical aspects. The
The actual flow is not a steady flow, and these steady-flow                  simplest example is the spatial resolution in the finite difference
coefficient values are lower than unsteady-flow predictions. The             method, which controls the truncation errors in derivatives. Or,
drag coefficient from the PINN solver is closer to the steady-flow
                                                                                                        Unsteady simulations    Steady simulations
4.4. Discussion                                                                   PetIBM       PINN    [DSY07] [RKM09]          [GS74]    [For80]

While researchers may be interested in why the PINN solver                          1.38       0.95       1.25        1.34        0.97      0.83
behaves like a steady flow solver, in this section, we would like
to focus more on the user experience and the usability of PINN in                          TABLE 2: Comparison of drag coefficients, CD

the choice of the limiters in finite volume methods, used to inhibit    CFD solvers. The literature shows researchers have shifted their
the oscillation in solutions. So when a conventional CFD solver         attention to hybrid-mode applications. For example, in [JEA+ 20],
produces unsatisfying or even non-physical results, practitioners       the authors combined the concept of PINN and a traditional CFD
usually have systematic approaches to identify the cause or             solver to train a model that takes in low-resolution CFD simulation
improve the outcomes. Moreover, when necessary, practitioners           results and outputs high-resolution flow fields.
know how to balance the computational cost and the accuracy,                 For people with a strong background in numerical methods or
which is a critical point for using computer-aided engineering.         CFD, we would suggest trying to think out of the box. During
Engineering always concerns the costs and outcomes.                     our work, we realized our mindset and ideas were limited by what
     On the other hand, the PINN method lacks well-defined              we were used to in CFD. An example is the initial conditions.
procedures to control the outcome. For example, we know the             We were used to only having one set of initial conditions when
numbers of neurons and layers control the degrees of freedom in a       the temporal derivative in differential equations is only first-order.
model. With more degrees of freedom, a neural network model can         However, in PINN, nothing limits us from using more than one
approximate a more complicated phenomenon. However, when we             initial condition. We can generate results at t = 0, 1, . . . ,tn using
feel that a neural network is not complicated enough to capture a       a traditional CFD solver and add the residuals corresponding to
physical phenomenon, what strategy should we use to adjust the          these time snapshots to the total residual, so the PINN method
neurons and layers? Should we increase neurons or layers first?         may perform better in predicting t > tn . In other words, the PINN
By how much?                                                            solver becomes the traditional CFD solvers’ replacement only for
     Moreover, when it comes to something non-numeric, it is even       t > tn ([noa]).
more challenging to know what to use and why to use it. For                  As discussed in [THM+ ], solving partial differential equations
instance, what activation function should we use and why? Should        with deep learning is still a work-in-progress. It may not work in
we use the same activation everywhere? Not to mention that we           many situations. Nevertheless, it does not mean we should stay
are not yet even considering a different network architecture here.     away from PINN and discard this idea. Stepping away from a new
     Ultimately, are we even sure that increasing the network’s         thing gives zero chance for it to evolve, and we will never know
complexity is the right path? Our assumption that the network           if PINN can be improved to a mature state that works well. Of
is not complicated enough may just be wrong.                            course, overly promoting its bright side with only success stories
     The following situation happened in this case study. Before        does not help, either. Rather, we should honestly face all troubles,
we realized the PINN solver behaved like a steady-flow solver, we       difficulties, and challenges. Knowing the problem is the first step
attributed the cause to model complexity. We faced the problem          to solving it.
of how to increase the model complexity systematically. Theoret-
ically, we could follow the practice of the design of experiments       Acknowledgements
(e.g., through grid search or Taguchi methods). However, given the
computational cost and the number of hyperparameters/options of         We appreciate the support by NVIDIA, through sponsoring the
PINN, a proper design of experiments is not affordable for us.          access to its high-performance computing cluster.
Furthermore, the design of experiments requires the outcome to
change with changes in inputs. In our case, the vortex shedding         R EFERENCES
remains absent regardless of how we changed hyperparameters.
                                                                        [Chu22]   Pi-Yueh Chuang.             barbagroup/scipy-2022-repro-pack:
     Let us move back to the flow problem to conclude this
                                                                                  20220530, May 2022. URL:
case study. The model complexity may not be the culprit here.                     6592457, doi:10.5281/zenodo.6592457.
Vortex shedding is the product of the dynamical systems of the          [CMKAB18] Pi-Yueh Chuang, Olivier Mesnard, Anush Krishnan, and Lorena
Navier-Stokes equations and the perturbations from numerical                      A. Barba. PetIBM: toolbox and applications of the immersed-
                                                                                  boundary method on distributed-memory architectures. Journal
calculations (which implicitly mimic the perturbations in nature).                of Open Source Software, 3(25):558, May 2018. URL: http://
Suppose the PINN solver’s prediction was the steady-state solution      , doi:10.21105/
to the flow. We may need to introduce uncertainties and perturba-                 joss.00558.
tions in the neural network or the training data, such as a perturbed   [CMW+ ]   Shengze Cai, Zhiping Mao, Zhicheng Wang, Minglang Yin,
                                                                                  and George Em Karniadakis. Physics-informed neural net-
initial condition described in [LD15]. As for why PINN predicts                   works (PINNs) for fluid mechanics: a review. 37(12):1727–
the steady-state solution, we cannot answer it currently.                         1738.     URL:
                                                                                  01148-1, doi:10.1007/s10409-021-01148-1.
                                                                        [DPT]     M. W. M. G. Dissanayake and N. Phan-Thien. Neural-network-
5. Further discussion and conclusion                                              based approximations for solving partial differential equations.
                                                                                  10(3):195–201. URL:
Because of the widely available deep learning libraries, such as                  1002/cnm.1640100303, doi:10.1002/cnm.1640100303.
PyTorch, and the ease of Python, implementing a PINN solver is          [DSY07]   Jian Deng, Xue-Ming Shao, and Zhao-Sheng Yu. Hydro-
                                                                                  dynamic studies on two traveling wavy foils in tandem
relatively more straightforward nowadays. This may be one reason
                                                                                  arrangement.     Physics of Fluids, 19(11):113104, Novem-
why the PINN method suddenly became so popular in recent                          ber 2007. URL:,
years. This paper does not intend to discourage people from trying                doi:10.1063/1.2814259.
the PINN method. Instead, we share our failures and frustration         [DZ]      Yifan Du and Tamer A. Zaki.                  Evolutional deep
                                                                                  neural network.        104(4):045303.       URL: https://link.
using PINN so that interested readers may know what immediate           , doi:10.1103/
challenges should be resolved for PINN.                                           PhysRevE.104.045303.
    Our paper is limited to using the PINN solver as a replacement      [For80]   Bengt Fornberg.           A numerical study of steady
for traditional CFD solvers. However, as the first section indicates,             viscous flow past a circular cylinder.              Journal of
                                                                                  Fluid Mechanics, 98(04):819, June 1980.            URL: http:
PINN can do more than solving one specific flow under specific                    //,
flow parameters. Moreover, PINN can also work with traditional                    doi:10.1017/S0022112080000419.
36                                                                                                PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[FS]        William E. Faller and Scott J. Schreck. Unsteady fluid mechan-         [THM+ ]   Nils Thuerey, Philipp Holl, Maximilian Mueller, Patrick
            ics applications of neural networks. 34(1):48–55. URL: http:                     Schnell, Felix Trost, and Kiwon Um. Physics-based deep
            //, doi:10.2514/2.2134.                           learning. Number: arXiv:2109.05237. URL:
[GS74]      V.A. Gushchin and V.V. Shchennikov. A numerical method                           abs/2109.05237, arXiv:2109.05237[physics].
            of solving the navier-stokes equations. USSR Computa-                  [Tre]     Lloyd N. Trefethen. Spectral Methods in MATLAB. Soft-
            tional Mathematics and Mathematical Physics, 14(2):242–250,                      ware, environments, tools. Society for Industrial and Applied
            January 1974. URL:                     Mathematics. URL:
            pii/0041555374900615, doi:10.1016/0041-5553(74)                                  9780898719598, doi:10.1137/1.9780898719598.
            90061-5.                                                               [Wil]     C. H. K. Williamson.               Vortex dynamics in the
[Hao]       Karen Hao. AI has cracked a key mathematical puzzle for                          cylinder wake.         28(1):477–539.       URL: http://www.
            understanding our world. URL: https://www.technologyreview.            ,
            com/2020/10/30/1011435/ai-fourier-neural-network-cracks-                         doi:10.1146/annurev.fl.28.010196.002401.
            navier-stokes-and-partial-differential-equations/.                     [WTP]     Sifan Wang, Yujun Teng, and Paris Perdikaris.          Under-
[HG]        Dan Hendrycks and Kevin Gimpel. Gaussian error linear units                      standing and mitigating gradient flow pathologies in physics-
            (GELUs). Publisher: arXiv Version Number: 4. URL: https://                       informed neural networks. 43(5):A3055–A3081. URL: https:
  , doi:10.48550/ARXIV.1606.                               //, doi:10.1137/
            08415.                                                                           20M1318043.
                                                                                   [WYP]     Sifan Wang, Xinling Yu, and Paris Perdikaris.           When
[Hor]       Kurt Hornik. Approximation capabilities of multilayer feedfor-
                                                                                             and why PINNs fail to train: A neural tangent
            ward networks. 4(2):251–257. URL: https://linkinghub.elsevier.
                                                                                             kernel perspective.           449:110768.       URL: https:
            com/retrieve/pii/089360809190009T, doi:10.1016/0893-
[JEA+ 20]   Chiyu “Max” Jiang, Soheil Esmaeilzadeh, Kamyar Aziz-
            zadenesheli, Karthik Kashinath, Mustafa Mustafa, Hamdi A.
            Tchelepi, Philip Marcus, Mr Prabhat, and Anima Anandkumar.
            Meshfreeflownet: A physics-constrained deep continuous space-
            time super-resolution framework. In SC20: International Con-
            ference for High Performance Computing, Networking, Storage
            and Analysis, pages 1–15, 2020. doi:10.1109/SC41405.
[KDYI]      Hasan Karali, Umut M. Demirezen, Mahmut A. Yukselen, and
            Gokhan Inalhan. A novel physics informed deep learning
            method for simulation-based modelling. In AIAA Scitech 2021
            Forum. American Institute of Aeronautics and Astronautics.
            URL:, doi:10.
[LD15]      Mouna Laroussi and Mohamed Djebbi. Vortex Shedding for
            Flow Past Circular Cylinder: Effects of Initial Conditions.
            Universal Journal of Fluid Mechanics, 3:19–32, 2015.
[LLF]       I. E. Lagaris, A. Likas, and D. I. Fotiadis. Artificial neu-
            ral networks for solving ordinary and partial differential
            equations. 9(5):987–1000. URL:
            document/712178/, arXiv:physics/9705023, doi:10.
[LLQH]      Jianyu Li, Siwei Luo, Yingjian Qi, and Yaping Huang. Numer-
            ical solution of elliptic partial differential equation using radial
            basis function neural networks. 16(5):729–734. URL: https:
[LMMK]      Lu Lu, Xuhui Meng, Zhiping Mao, and George Em Karniadakis.
            DeepXDE: A deep learning library for solving differential
            equations. 63(1):208–228. URL:
            1137/19M1274067, doi:10.1137/19M1274067.
[LS]        Dennis J. Linse and Robert F. Stengel. Identification of
            aerodynamic coefficients using computational neural networks.
            16(6):1018–1025. Publisher: Springer US, Place: Boston,
            MA. URL:,
[noa]       Modulus. URL:
[RKM09]     B.N. Rajani, A. Kandasamy, and Sekhar Majumdar. Nu-
            merical simulation of laminar flow past a circular cylin-
            der. Applied Mathematical Modelling, 33(3):1228–1247, March
            2009. arXiv: DOI: 10.1002/fld.1 Publisher: Elsevier Inc. ISBN:
            02712091 10970363. URL:
            2008.01.017, doi:10.1016/j.apm.2008.01.017.
[RPK]       M. Raissi, P. Perdikaris, and G.E. Karniadakis. Physics-
            informed neural networks: A deep learning framework for
            solving forward and inverse problems involving nonlinear
            partial differential equations. 378:686–707. URL: https:
[SS]        Justin      Sirignano      and      Konstantinos      Spiliopoulos.
            DGM: A deep learning algorithm for solving partial
            differential equations.        375:1339–1364.         URL: https:
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                    37

 atoMEC: An open-source average-atom Python code
                                      Timothy J. Callow‡§∗ , Daniel Kotik‡§ , Eli Kraisler¶ , Attila Cangi‡§


Abstract—Average-atom models are an important tool in studying matter under             methods are often denoted as "first-principles" because, formally
extreme conditions, such as those conditions experienced in planetary cores,            speaking, they yield the exact properties of the system, under cer-
brown and white dwarfs, and during inertial confinement fusion. In the right            tain well-founded theoretical approximations. Density-functional
context, average-atom models can yield results with similar accuracy to simu-           theory (DFT), initially developed as a ground-state theory [HK64],
lations which require orders of magnitude more computing time, and thus can
                                                                                        [KS65] but later extended to non-zero temperatures [Mer65],
greatly reduce financial and environmental costs. Unfortunately, due to the wide
range of possible models and approximations, and the lack of open-source
                                                                                        [PPF+ 11], is one such theory and has been used extensively to
codes, average-atom models can at times appear inaccessible. In this paper, we          study materials under WDM conditions [GDRT14]. Even though
present our open-source average-atom code, atoMEC. We explain the aims and              DFT reformulates the Schrödinger equation in a computationally
structure of atoMEC to illuminate the different stages and options in an average-       efficient manner [Koh99], the cost of running calculations be-
atom calculation, and to facilitate community contributions. We also discuss the        comes prohibitively expensive at higher temperatures. Formally,
use of various open-source Python packages in atoMEC, which have expedited              it scales as O(N 3 τ 3 ), with N the particle number (which usually
its development.                                                                        also increases with temperature) and τ the temperature [CRNB18].
                                                                                        This poses a serious computational challenge in the WDM regime.
Index Terms—computational physics, plasma physics, atomic physics, materi-
                                                                                        Furthermore, although DFT is a formally exact theory, in prac-
als science
                                                                                        tice it relies on approximations for the so-called "exchange-
                                                                                        correlation" energy, which is, roughly speaking, responsible for
Introduction                                                                            simulating all the quantum interactions between electrons. Exist-
                                                                                        ing exchange-correlation approximations have not been rigorously
The study of matter under extreme conditions — materials
                                                                                        tested under WDM conditions. An alternative method used in
exposed to high temperatures, high pressures, or strong elec-
                                                                                        the WDM community is path-integral Monte–Carlo [DGB18],
tromagnetic fields — is critical to our understanding of many
                                                                                        which yields essentially exact properties; however, it is even more
important scientific and technological processes, such as nuclear
                                                                                        limited by computational cost than DFT, and becomes unfeasibly
fusion and various astrophysical and planetary physics phenomena
                                                                                        expensive at lower temperatures due to the fermion sign problem.
[GFG+ 16]. Of particular interest within this broad field is the
                                                                                            It is therefore of great interest to reduce the computational
warm dense matter (WDM) regime, which is typically character-
                                                                                        complexity of the aforementioned methods. The use of graphics
ized by temperatures in the range of 103 − 106 degrees (Kelvin),
                                                                                        processing units in DFT calculations is becomingly increasingly
and densities ranging from dense gases to highly compressed
                                                                                        common, and has been shown to offer significant speed-ups
solids (∼ 0.01 − 1000 g cm−3 ) [BDM+ 20]. In this regime, it is
                                                                                        relative to conventional calculations using central processing units
important to account for the quantum mechanical nature of the
                                                                                        [MED11], [JFC+ 13]. Some other examples of promising develop-
electrons (and in some cases, also the nuclei). Therefore conven-
                                                                                        ments to reduce the cost of DFT calculations include machine-
tional methods from plasma physics, which either neglect quantum
                                                                                        learning-based solutions [SRH+ 12], [BVL+ 17], [EFP+ 21] and
effects or treat them coarsely, are usually not sufficiently accurate.
                                                                                        stochastic DFT [CRNB18], [BNR13]. However, in this paper,
On the other hand, methods from condensed-matter physics and
                                                                                        we focus on an alternative class of models known as "average-
quantum chemistry, which account fully for quantum interactions,
                                                                                        atom" models. Average-atom models have a long history in plasma
typically target the ground-state only, and become computationally
                                                                                        physics [CHKC22]: they account for quantum effects, typically
intractable for systems at high temperatures.
                                                                                        using DFT, but reduce the complex system of interacting electrons
    Nevertheless, there are methods which can, in principle, be
                                                                                        and nuclei to a single atom immersed in a plasma (the "average"
applied to study materials at any given temperature and den-
                                                                                        atom). An illustration of this principle (reduced to two dimensions
sity whilst formally accounting for quantum interactions. These
                                                                                        for visual purposes) is shown in Fig. 1. This significantly reduces
* Corresponding author:
                                                                                        the cost relative to a full DFT simulation, because the particle
‡ Center for Advanced Systems Understanding (CASUS), D-02826 Görlitz,                   number is restricted to the number of electrons per nucleus, and
Germany                                                                                 spherical symmetry is exploited to reduce the three-dimensional
§ Helmholtz-Zentrum Dresden-Rossendorf, D-01328 Dresden, Germany
¶ Fritz Haber Center for Molecular Dynamics and Institute of Chemistry, The             problem to one dimension.
Hebrew University of Jerusalem, 9091401 Jerusalem, Israel                                   Naturally, to reduce the complexity of the problem as de-
                                                                                        scribed, various approximations must be introduced. It is im-
Copyright © 2022 Timothy J. Callow et al. This is an open-access article                portant to understand these approximations and their limitations
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,           for average-atom models to have genuine predictive capabilities.
provided the original author and source are credited.                                   Unfortunately, this is not always the case: although average-atom
38                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                       Theoretical background
                                                                       Properties of interest in the warm dense matter regime include the
                                                                       equation-of-state data, which is the relation between the density,
                                                                       energy, temperature and pressure of a material [HRD08]; the mean
                                                                       ionization state and the electron ionization energies, which tell
                                                                       us about how tightly bound the electrons are to the nuclei; and
                                                                       the electrical and thermal conductivities. These properties yield
                                                                       information pertinent to our understanding of stellar and planetary
                                                                       physics, the Earth’s core, inertial confinement fusion, and more
                                                                       besides. To exactly obtain these properties, one needs (in theory) to
                                                                       determine the thermodynamic ensemble of the quantum states (the
                                                                       so-called wave-functions) representing the electrons and nuclei.
Fig. 1: Illustration of the average-atom concept. The many-body        Fortunately, they can be obtained with reasonable accuracy using
and fully-interacting system of electron density (shaded blue) and     models such as average-atom models; in this section, we elaborate
nuclei (red points) on the left is mapped into the much simpler
system of independent atoms on the right. Any of these identical       on how this is done.
atoms represents the "average-atom". The effects of interaction from       We shall briefly review the key theory underpinning the type of
neighboring atoms are implicitly accounted for in an approximate       average-atom model implemented in atoMEC. This is intended for
manner through the choice of boundary conditions.                      readers without a background in quantum mechanics, to give some
                                                                       context to the purposes and mechanisms of the code. For a compre-
                                                                       hensive derivation of this average-atom model, we direct readers
                                                                       to Ref. [CHKC22]. The average-atom model we shall describe
models share common concepts, there is no unique formal theory         falls into a class of models known as ion-sphere models, which
underpinning them. Therefore a variety of models and codes exist,      are the simplest (and still most widely used) class of average-atom
and it is not typically clear which models can be expected to          model. There are alternative (more advanced) classes of model
perform most accurately under which conditions. In a previous          such as ion-correlation [Roz91] and neutral pseudo-atom models
paper [CHKC22], we addressed this issue by deriving an average-        [SS14] which we have not yet implemented in atoMEC, and thus
atom model from first principles, and comparing the impact of          we do not elaborate on them here.
different approximations within this model on some common                  As demonstrated in Fig. 1, the idea of the ion-sphere model
properties.                                                            is to map a fully-interacting system of many electrons and
    In this paper, we focus on computational aspects of average-       nuclei into a set of independent atoms which do not interact
atom models for WDM. We introduce atoMEC [CKTS+ 21]:                   explicitly with any of the other spheres. Naturally, this depends
an open-source average-atom code for studying Matter under             on several assumptions and approximations, but there is formal
Extreme Conditions. One of the main aims of atoMEC is to im-           justification for such a mapping [CHKC22]. Furthermore, there
prove the accessibility and understanding of average-atom models.      are many examples in which average-atom models have shown
To the best of our knowledge, open-source average-atom codes           good agreement with more accurate simulations and experimental
are in scarce supply: with atoMEC, we aim to provide a tool that       data [FB19], which further justifies this mapping.
people can use to run average-atom simulations and also to add             Although the average-atom picture is significantly simplified
their own models, which should facilitate comparisons of different     relative to the full many-body problem, even determining the
approximations. The relative simplicity of average-atom codes          wave-functions and their ensemble weights for an atom at finite
means that they are not only efficient to run, but also efficient      temperature is a complex problem. Fortunately, DFT reduces this
to develop: this means, for example, that they can be used as a        complexity further, by establishing that the electron density — a
test-bed for new ideas that could be later implemented in full DFT     far less complex entity than the wave-functions — is sufficient to
codes, and are also accessible to those without extensive prior        determine all physical observables. The most popular formulation
expertise, such as students. atoMEC aims to facilitate development     of DFT, known as Kohn–Sham DFT (KS-DFT) [KS65], allows us
by following good practice in software engineering (for example        to construct the fully-interacting density from a non-interacting
extensive documentation), a careful design structure, and of course    system of electrons, simplifying the problem further still. Due to
through the choice of Python and its widely used scientific stack,     the spherical symmetry of the atom, the non-interacting electrons
in particular the NumPy [HMvdW+ 20] and SciPy [VGO+ 20]                — known as KS electrons (or KS orbitals) — can be represented
libraries.                                                             as a wave-function that is a product of radial and angular compo-
    This paper is structured as follows: in the next section, we
briefly review the key theoretical points which are important                               φnlm (r) = Xnl (r)Ylm (θ , φ ) ,            (1)
to understand the functionality of atoMEC, assuming no prior
                                                                       where n, l, and m are the quantum numbers of the orbitals, which
physical knowledge of the reader. Following that, we present
                                                                       come from the fact that the wave-function is an eigenfunction of
the key functionality of atoMEC, discuss the code structure
                                                                       the Hamiltonian operator, and Ylm (θ , φ ) are the spherical harmonic
and algorithms, and explain how these relate to the theoretical
aspects introduced. Finally, we present an example case study:         functions.1 The radial coordinate r represents the absolute distance
we consider helium under the conditions often experienced in           from the nucleus.
the outer layers of a white dwarf star, and probe the behavior
                                                                         1. Please note that the notation in Eq. (1) does not imply Einstein sum-
of a few important properties, namely the band-gap, pressure, and      mation notation. All summations in this paper are written explicitly; Einstein
ionization degree.                                                     summation notation is not used.
ATOMEC: AN OPEN-SOURCE AVERAGE-ATOM PYTHON CODE                                                                                                     39

     We therefore only need to determine the radial KS orbitals              energy required to excite an electron bound to the nucleus to being
Xnl (r). These are determined by solving the radial KS equation,             a free (conducting) electron. These predicted ionization energies
which is similar to the Schrödinger equation for a non-interacting           can be used, for example, to help understand ionization potential
system, with an additional term in the potential to mimic the                depression, an important but somewhat controversial effect in
effects of electron-electron interaction (within the single atom).           WDM [STJ+ 14]. Another property that can be straightforwardly
The radial KS equation is given by:                                          obtained from the energy levels and their occupation numbers is
    2                                                                    the mean ionization state Z̄ 2 ,
         d     2 d    l(l + 1)
    −       +      −             + vs [n](r) Xnl (r) = εnl Xnl (r). (2)
         dr2 r dr        r2                                                                       Z̄ = ∑(2l + 1) fnl (εnl , µ, τ)                  (6)
We have written the above equation in a way that emphasizes that
it is an eigenvalue equation, with the eigenvalues εnl being the             which is an important input parameter for various models, such
energies of the KS orbitals.                                                 as adiabats which are used to model inertial confinement fusion
     On the left-hand side, the terms in the round brackets come             [KDF+ 11].
from the kinetic energy operator acting on the orbitals. The vs [n](r)           Various other interesting properties can also be calculated
term is the KS potential, which itself is composed of three different        following some post-processing of the output of an SCF cal-
terms,                                                                       culation, for example the pressure exerted by the electrons and
                               Z RWS                                         ions. Furthermore, response properties, i.e. those resulting from
                     Z                        n(x)x2     δ Fxc [n]           an external perturbation like a laser pulse, can also be obtained
        vs [n](r) = − + 4π              dx             +           ,   (3)
                     r           0           max(r, x)    δ n(r)             from the output of an SCF cycle. These properties include, for
where RWS is the radius of the atomic sphere, n(r) is the electron           example, electrical conductivities [Sta16] and dynamical structure
density, Z the nuclear charge, and Fxc [n] the exchange-correlation          factors [SPS+ 14].
free energy functional. Thus the three terms in the potential are
respectively the electron-nuclear attraction, the classical Hartree          Code structure and details
repulsion, and the exchange-correlation (xc) potential.                      In the following sections, we describe the structure of the code
     We note that the KS potential and its constituents are function-        in relation to the physical problem being modeled. Average-atom
als of the electron density n(r). Were it not for this dependence            models typically rely on various parameters and approximations.
on the density, solving Eq. 2 just amounts to solving an ordinary            In atoMEC, we have tried to structure the code in a way that makes
linear differential equation (ODE). However, the electron density            clear which parameters come from the physical problem studied
is in fact constructed from the orbitals in the following way,               compared to choices of the model and numerical or algorithmic
             n(r) = 2 ∑(2l + 1) fnl (εnl , µ, τ)|Xnl (r)|2 ,           (4)   choices.
                                                                             atoMEC.Atom: Physical parameters
where fnl (εnl , µ, τ) is the Fermi–Dirac distribution, given by
                                                                             The first step of any simulation in WDM (which also applies to
                                               1                             simulations in science more generally) is to define the physical
                   fnl (εnl , µ, τ) =                   ,              (5)
                                     1 + e(εnl −µ)/τ                         parameters of the problem. These parameters are unique in the
where τ is the temperature, and µ is the chemical potential, which           sense that, if we had an exact method to simulate the real system,
is determined by fixing the number of electrons to be equal to               then for each combination of these parameters there would be a
a pre-determined value Ne (typically equal to the nuclear charge             unique solution. In other words, regardless of the model — be
Z). The Fermi–Dirac distribution therefore assigns weights to the            it average atom or a different technique — these parameters are
KS orbitals in the construction of the density, with the weight              always required and are independent of the model.
depending on their energy.                                                       In average-atom models, there are typically three parameters
     Therefore, the KS potential that determines the KS orbitals via         defining the physical problem, which are:
the ODE (2), is itself dependent on the KS orbitals. Consequently,               •   the atomic species;
the KS orbitals and their dependent quantities (the density and                  •   the temperature of the material, τ;
KS potential) must be determined via a so-called self-consistent                 •   the mass density of the material, ρm .
field (SCF) procedure. An initial guess for the orbitals, Xnl0 (r),
is used to construct the initial density n0 (r) and potential v0s (r).            The mass density also directly corresponds to the mean dis-
The ODE (2) is then solved to update the orbitals. This process is           tance between two nuclei (atomic centers), which in the average-
iterated until some appropriately chosen quantities — in atoMEC              atom model is equal to twice the radius of the atomic sphere, RWS .
the total free energy, density and KS potential — are converged,             An additional physical parameter not mentioned above is the net
i.e. ni+1 (r) = ni (r), vi+1         i         i+1 = F i , within some       charge of the material being considered, i.e. the difference be-
                          s (r) = vs (r), F
reasonable numerical tolerance. In Fig. 2, we illustrate the life-           tween the nuclear charge Z and the electron number Ne . However,
cycle of the average-atom model described so far, including the              we usually assume zero net charge in average-atom simulations
SCF procedure. On the left-hand side of this figure, we show the             (i.e. the number of electrons is equal to the atomic charge).
physical choices and mathematical operations, and on the right-                   In atoMEC, these physical parameters are controlled by the
hand side, the representative classes and functions in atoMEC. In            Atom object. As an example, we consider aluminum under ambi-
the following section, we shall discuss some aspects of this figure          ent conditions, i.e. at room temperature, τ = 300 K, and normal
in more detail.                                                              metallic density, ρm = 2.7 g cm−3 . We set this up as:
     Some quantities obtained from the completion of the SCF pro-
                                                                               2. The summation in Eq. (6) is often shown as an integral because the
cedure are directly of interest. For example, the energy eigenvalues         energies above a certain threshold form a continuous distribution (in most
εnl are related to the electron ionization energies, i.e. the amount of      models).
40                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 2: Schematic of the average-atom model set-up and the self-consistent field (SCF) cycle. On the left-hand side, the physical choices and
mathematical operations that define the model and SCF cycle are shown. On the right-hand side, the (higher-order) functions and classes in
atoMEC corresponding to the items on the left-hand side are shown. Some liberties are taken with the code snippets in the right-hand column
of the figure to improve readability; more precisely, some non-crucial intermediate steps are not shown, and some parameters are also not
shown or simplified. The dotted lines represent operations that are taken care of within the models.CalcEnergy function, but are shown
nevertheless to improve understanding.
ATOMEC: AN OPEN-SOURCE AVERAGE-ATOM PYTHON CODE                                                                                              41

                                                                        Fig. 4: Auto-generated print          statement    from   calling   the
                                                                        models.ISModel object.
Fig. 3: Auto-generated       print   statement   from   calling   the
atoMEC.Atom object.
                                                                        with a "quantum" treatment of the unbound electrons, and choose
                                                                        the LDA exchange functional (which is also the default). This
from atoMEC import Atom
Al = Atom("Al", 300, density=2.7, units_temp="K")                       model is set up as:
                                                                        from atoMEC import models
By default, the above code automatically prints the output seen         model = models.ISModel(Al, bc="neumann",
in Fig. 3. We see that the first two arguments of the Atom object                    xfunc_id="lda_x", unbound="quantum")
are the chemical symbol of the element being studied, and the           By default, the above code prints the output shown in Fig.
temperature. In addition, at least one of "density" or "radius" must    4. The first (and only mandatory) input parameter to the
be specified. In atoMEC, the default (and only permitted) units for     models.ISModel object is the Atom object that we generated
the mass density are g cm−3 ; all other input and output units in       earlier. Together with the optional spinpol and spinmag
atoMEC are by default Hartree atomic units, and hence we specify        parameters in the models.ISModel object, this sets either the
"K" for Kelvin.                                                         total number of electrons (spinpol=False) or the number of
    The information in Fig. 3 displays the chosen parameters in         electrons in each spin channel (spinpol=True).
units commonly used in the plasma and condensed-matter physics              The remaining information displayed in Fig. 4 shows directly
communities, as well as some other information directly obtained        the chosen model parameters, or the default values where these
from these parameters. The chemical symbol ("Al" in this case)          parameters are not specified. The exchange and correlation func-
is passed to the mendeleev library [men14] to generate this data,       tionals - set by the parameters xfunc_id and cfunc_id - are
which is used later in the calculation.                                 passed to the LIBXC library [LSOM18] for processing. So far,
    This initial stage of the average-atom calculation, i.e. the        only the "local density" family of approximations is available
specification of physical parameters and initialization of the Atom     in atoMEC, and thus the default values are usually a sensible
object, is shown in the top row at the top of Fig. 2.                   choice. For more information on exchange and correlation func-
atoMEC.models: Model parameters                                         tionals, there are many reviews in the literature, for example Ref.
After the physical parameters are set, the next stage of the average-
                                                                            This stage of the average-atom calculation, i.e. the specifica-
atom calculation is to choose the model and approximations within
                                                                        tion of the model and the choices of approximation within that, is
that class of model. As discussed, so far the only class of model
                                                                        shown in the second row of Fig. 2.
implemented in atoMEC is the ion-sphere model. Within this
model, there are still various choices to be made by the user.          ISModel.CalcEnergy: SCF calculation and numerical parameters
In some cases, these choices make little difference to the results,     Once the physical parameters and model have been defined, the
but in other cases they have significant impact. The user might         next stage in the average-atom calculation (or indeed any DFT
have some physical intuition as to which is most important, or          calculation) is the SCF procedure. In atoMEC, this is invoked
alternatively may want to run the same physical parameters with         by the ISModel.CalcEnergy function. This function is called
several different model parameters to examine the effects. Some         CalcEnergy because it finds the KS orbitals (and associated KS
choices available in atoMEC, listed approximately in decreasing         density) which minimize the total free energy.
order of impact (but this can depend strongly on the system under            Clearly, there are various mathematical and algorithmic
consideration), are:                                                    choices in this calculation. These include, for example: the basis in
   •   the boundary conditions used to solve the KS equations;          which the KS orbitals and potential are represented, the algorithm
   •   the treatment of the unbound electrons, which means              used to solve the KS equations (2), and how to ensure smooth
       those electrons not tightly bound to the nucleus, but rather     convergence of the SCF cycle. In atoMEC, the SCF procedure
       delocalized over the whole atomic sphere;                        currently follows a single pre-determined algorithm, which we
   •   the choice of exchange and correlation functionals, the          briefly review below.
       central approximations of DFT [CMSY12];                               In atoMEC, we represent the radial KS quantities (orbitals,
   •   the spin polarization and magnetization.                         density and potential) on a logarithmic grid, i.e. x = log(r).
                                                                        Furthermore, we make a transformation of the orbitals Pnl (x) =
   We do not discuss the theory and impact of these different
                                                                        Xnl (x)ex/2 . Then the equations to be solved become:
choices in this paper. Rather, we direct readers to Refs. [CHKC22]
and [CKC22] in which all of these choices are discussed.                           d2 Pnl (x)
                                                                                              − 2e2x (W (x) − εnl )Pnl (x) = 0              (7)
   In atoMEC, the ion-sphere model is controlled by the                               dx2
models.ISModel object. Continuing with our aluminum ex-                                                   1        1 2 −2x
ample, we choose the so-called "neumann" boundary condition,                        W (x) = vs [n](x) +       l+         e .                (8)
                                                                                                          2        2
42                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

In atoMEC, we solve the KS equations using a matrix imple-                a unique set of physical and model inputs — these parameters
mentation of Numerov’s algorithm [PGW12]. This means we                   should be independently varied until some property (such as the
diagonalize the following equation:                                       total free energy) is considered suitably converged with respect to
                                                                          that parameter. Changing the SCF parameters should not affect the
                   Ĥ ~P      = ~ε B̂~P , where                    (9)
                                                                          final results (within the convergence tolerances), only the number
                        Ĥ    = T̂ + B̂ +Ws (~x) ,                (10)    of iterations in the SCF cycle.
                                   1                                          Let us now consider an example SCF calculation, using the
                        T̂    = − e−2~x  ,                      (11)
                                   2                                      Atom and model objects we have already defined:
                                Iˆ−1 − 2Iˆ0 + Iˆ1
                        Â    =                   , and           (12)    from atoMEC import config
                                      dx2                                 config.numcores = -1 # parallelize
                                Iˆ−1 + 10Iˆ0 + Iˆ1
                        B̂    =                     ,             (13)
                                        12                                nmax = 3 # max value of principal quantum number
                                                                          lmax = 3 # max value of angular quantum number
In the above, Iˆ−1/0/1 are lower shift, identify, and upper shift
matrices.                                                                 # run SCF calculation
    The Hamiltonian matrix Ĥ is sparse and we only seek a subset         scf_out = model.CalcEnergy(
of eigenstates with lower energies: therefore there is no need to          lmax,
perform a full diagonalization, which scales as O(N 3 ), with N            grid_params={"ngrid": 1500},
being the size of the radial grid. Instead, we use SciPy’s sparse ma-      scf_params={"mixfrac": 0.7},
trix diagonalization function scipy.sparse.linalg.eigs,
which scales more efficiently and allows us to go to larger grid          We see that the first two parameters passed to the CalcEnergy
sizes.                                                                    function are the nmax and lmax quantum numbers, which specify
    After each step in the SCF cycle, the relative changes in the         the number of eigenstates to compute. Precisely speaking, there
free energy F, density n(r) and potential vs (r) are computed.            is a unique Hamiltonian for each value of the angular quantum
Specifically, the quantities computed are                                 number l (and in a spin-polarized calculation, also for each
                                F i − F i−1                               spin quantum number). The sparse diagonalization routine then
                  ∆F         =                                    (14)    computes the first nmax eigenvalues for each Hamiltonian. In
                               R                                          atoMEC, these diagonalizations can be run in parallel since they
                                 dr|ni (r) − ni−1 (r)|
                  ∆n         =       R                            (15)    are independent for each value of l. This is done by setting the
                                        drni (r)
                               R                                          config.numcores variable to the number of cores desired
                                 dr|vs (r) − vi−1
                                               s (r)|                     (config.numcores=-1 uses all the available cores) and han-
                  ∆v         =       R
                                                       .          (16)
                                        drvs (r)                          dled via the joblib library [Job20].
Once all three of these metrics fall below a certain threshold, the           The remaining parameters passed to the CalcEnergy func-
SCF cycle is considered converged and the calculation finishes.           tion are optional; in the above, we have specified a grid size
    The SCF cycle is an example of a non-linear system and thus           of 1500 points and a mixing fraction α = 0.7. The above code
is prone to chaotic (non-convergent) behavior. Consequently a             automatically prints the output seen in Fig. 5. This output shows
range of techniques have been developed to ensure convergence             the SCF cycle and, upon completion, the breakdown of the total
[SM91]. Fortunately, the tendency for calculations not to converge        free energy into its various components, as well as other useful
becomes less likely for temperatures above zero (and especially           information such as the KS energy levels and their occupations.
as temperatures increase). Therefore we have implemented only                 Additionally, the output of the SCF function is a dictionary
a simple linear mixing scheme in atoMEC. The potential used in            containing the staticKS.Orbitals, staticKS.Density,
each diagonalization step of the SCF cycle is not simply the one          staticKS.Potential and staticKS.Density objects.
generated from the most recent density, but a mix of that potential       For example, one could extract the eigenfunctions as follows:
and the previous one,
                                                                          orbs = scf_out["orbitals"] # orbs object
                 vs (r) = αvis (r) + (1 − α)vi−1
                  (i)                                                     ks_eigfuncs = orbs.eigfuncs # eigenfunctions
                                             s (r) .              (17)
In general, a lower value of the mixing fraction α makes the              The initialization of the SCF procedure is shown in the third and
SCF cycle more stable, but requires more iterations to converge.          fourth rows of Fig. 2, with the SCF procedure itself shown in the
Typically a choice of α ≈ 0.5 gives a reasonable balance between          remaining rows.
speed and stability.                                                          This completes the section on the code structure and
    We can thus summarize the key parameters in an SCF calcu-             algorithmic details. As discussed, with the output of an
lation as follows:                                                        SCF calculation, there are various kinds of post-processing
                                                                          one can perform to obtain other properties of interest. So
     •   the maximum number of eigenstates to compute, in terms
                                                                          far in atoMEC, these are limited to the computation of
         of both the principal and angular quantum numbers;
                                                                          the pressure (ISModel.CalcPressure), the electron
     •   the numerical grid parameters, in particular the grid size;
                                                                          localization function (atoMEC.postprocess.ELFTools)
     •   the convergence tolerances, Eqs. (14) to (16);
                                                                          and           the          Kubo–Greenwood             conductivity
     •   the SCF parameters, i.e. the mixing fraction and the
                                                                          (atoMEC.postprocess.conductivity).                   We      refer
         maximum number of iterations.
                                                                          readers to our pre-print [CKC22] for details on how the electron
    The first three items in this list essentially control the accuracy   localization function and the Kubo–Greenwood conductivity can
of the calculation. In principle, for each SCF calculation — i.e.         be used to improve predictions of the mean ionization state.
ATOMEC: AN OPEN-SOURCE AVERAGE-ATOM PYTHON CODE                                                                                            43

                                                                       Fig. 6: Helium density-of-states (DOS) as a function of energy, for
                                                                       different mass densities ρm , and at temperature τ = 50 kK. Black
                                                                       dots indicate the occupations of the electrons in the permitted energy
                                                                       ranges. Dashed black lines indicate the band-gap (the energy gap
                                                                       between the insulating and conducting bands). Between 5 and 6
                                                                       g cm−3 , the band-gap disappears.

                                                                       and temperature) and electrical conductivity.
                                                                            To calculate the insulator-to-metallic transition point, the
                                                                       key quantity is the electronic band-gap. The concept of band-
                                                                       structures is a complicated topic, which we try to briefly describe
                                                                       in layman’s terms. In solids, electrons can occupy certain energy
                                                                       ranges — we call these the energy bands. In insulating materials,
                                                                       there is a gap between these energy ranges that electrons are
                                                                       forbidden from occupying — this is the so-called band-gap. In
                                                                       conducting materials, there is no such gap, and therefore electrons
                                                                       can conduct electricity because they can be excited into any part
                                                                       of the energy spectrum. Therefore, a simple method to determine
                                                                       the insulator-to-metallic transition is to determine the density at
                                                                       which the band-gap becomes zero.
                                                                            In Fig. 6, we plot the density-of-states (DOS) as a function of
                                                                       energy, for different densities and at fixed temperature τ = 50 kK.
                                                                       The DOS shows the energy ranges that the electrons are allowed to
                                                                       occupy; we also show the actual energies occupied by the electrons
                                                                       (according to Fermi–Dirac statistics) with the black dots. We can
                                                                       clearly see in this figure that the band-gap (the region where the
                                                                       DOS is zero) becomes smaller as a function of density. From
Fig. 5: Auto-generated print statement          from   calling   the   this figure, it seems the transition from insulating to metallic state
ISModel.CalcEnergy function                                            happens somewhere between 5 and 6 g cm−3 .
                                                                            In Fig. 7, we plot the band-gap as a function of density, for a
                                                                       fixed temperature τ = 50 kK. Visually, it appears that the relation-
Case-study: Helium
                                                                       ship between band-gap and density is linear at this temperature.
In this section, we consider an application of atoMEC in the           This is confirmed using a linear fit, which has a coefficient of
WDM regime. Helium is the second most abundant element in the          determination value of almost exactly one, R2 = 0.9997. Using this
universe (after hydrogen) and therefore understanding its behavior     fit, the band-gap is predicted to close at 5.5 g cm−3 . Also in this
under a wide range of conditions is important for our under-           figure, we show the fraction of ionized electrons, which is given by
standing of many astrophysical processes. Of particular interest       Z̄/Ne , using Eq. (6) to calculate Z̄, and Ne being the total electron
are the conditions under which helium is expected to undergo a         number. The ionization fraction also relates to the conductivity of
transition from insulating to metallic behavior in the outer layers    the material, because ionized electrons are not bound to any nuclei
of white dwarfs, which are characterized by densities of around        and therefore free to conduct electricity. We see that the ionization
1 − 20 g cm−3 and temperatures of 10 − 50 kK [PR20]. These             fraction mostly increases with density (excepting some strange
conditions are a typical example of the WDM regime. Besides            behavior around ρm = 1 g cm−3 ), which is further evidence of the
predicting the point at which the insulator-to-metallic transition     transition from insulating to conducting behaviour with increasing
occurs in the density-temperature spectrum, other properties of        density.
interest include equation-of-state data (relating pressure, density,        As a final analysis, we plot the pressure as a function of mass
44                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                        open-source scientific libraries — especially the Python libraries
                                                                        NumPy, SciPy, joblib and mendeleev, as well as LIBXC.
                                                                            We finish this paper by emphasizing that atoMEC is still in the
                                                                        early stages of development, and there are many opportunities to
                                                                        improve and extend the code. These include, for example:

                                                                           •    adding new average-atom models, and different approxi-
                                                                                mations to the existing models.ISModel model;
                                                                           •    optimizing the code, in particular the routines in the
                                                                                numerov module;
                                                                           •    adding new postprocessing functionality, for example to
                                                                                compute structure factors;
                                                                           •    improving the structure and design choices of the code.
Fig. 7: Band-gap (red circles) and ionization fraction (blue squares)
for helium as a function of mass density, at temperature τ = 50 kK.         Of course, these are just a snapshot of the avenues for future
The relationship between the band-gap and the density appears to be     development in atoMEC. We are open to contributions in these
linear.                                                                 areas and many more besides.

                                                                        This work was partly funded by the Center for Advanced Systems
                                                                        Understanding (CASUS) which is financed by Germany’s Federal
                                                                        Ministry of Education and Research (BMBF) and by the Saxon
                                                                        Ministry for Science, Culture and Tourism (SMWK) with tax
                                                                        funds on the basis of the budget approved by the Saxon State

                                                                        R EFERENCES
                                                                        [BDM+ 20]    M. Bonitz, T. Dornheim, Zh. A. Moldabekov, S. Zhang,
                                                                                     P. Hamann, H. Kählert, A. Filinov, K. Ramakrishna, and J. Vor-
                                                                                     berger. Ab initio simulation of warm dense matter. Phys. Plas-
                                                                                     mas, 27(4):042710, 2020. doi:10.1063/1.5143225.
                                                                        [BNR13]      Roi Baer, Daniel Neuhauser, and Eran Rabani.               Self-
                                                                                     averaging stochastic Kohn-Sham density-functional theory.
Fig. 8: Helium pressure (logarithmic scale) as a function of mass                    Phys. Rev. Lett., 111:106402, Sep 2013. doi:10.1103/
density and temperature. The pressure increases with density and                     PhysRevLett.111.106402.
temperature (as expected), with a stronger dependence on density.       [BVL+ 17]    Felix Brockherde, Leslie Vogt, Li Li, Mark E. Tuckerman,
                                                                                     Kieron Burke, and Klaus-Robert Müller. Bypassing the Kohn-
                                                                                     Sham equations with machine learning. Nature Communica-
density and temperature in Fig. 8. The pressure is given by the                      tions, 8(1):872, Oct 2017. doi:10.1038/s41467-017-
sum of two terms: (i) the electronic pressure, calculated using         [CHKC22]     T. J. Callow, S. B. Hansen, E. Kraisler, and A. Cangi.
the method described in Ref. [FB19], and (ii) the ionic pressure,                    First-principles derivation and properties of density-functional
calculated using the ideal gas law. We observe that the pressure                     average-atom models. Phys. Rev. Research, 4:023055, Apr
                                                                                     2022. doi:10.1103/PhysRevResearch.4.023055.
increases with both density and temperature, which is the expected
                                                                        [CKC22]      Timothy J. Callow, Eli Kraisler, and Attila Cangi. Accurate
behavior. Under these conditions, the density dependence is much                     and efficient computation of mean ionization states with an
stronger, especially for higher densities.                                           average-atom Kubo-Greenwood approach, 2022. doi:10.
    The code required to generate the above results and plots can                    48550/ARXIV.2203.05863.
                                                                        [CKTS+ 21]   Timothy Callow, Daniel Kotik, Ekaterina Tsve-
be found in this repository.                                                         toslavova Stankulova, Eli Kraisler, and Attila Cangi.
                                                                                     atomec, August 2021. If you use this software, please cite it
Conclusions and future work                                                          using these metadata. doi:10.5281/zenodo.5205719.
                                                                        [CMSY12]     Aron J. Cohen, Paula Mori-Sánchez, and Weitao Yang. Chal-
In this paper, we have presented atoMEC: an average-atom Python                      lenges for density functional theory. Chemical Reviews,
code for studying materials under extreme conditions. The open-                      112(1):289–320, 2012. doi:10.1021/cr200107z.
                                                                        [CRNB18]     Yael Cytter, Eran Rabani, Daniel Neuhauser, and Roi Baer.
source nature of atoMEC, and the choice to use (pure) Python as                      Stochastic density functional theory at finite temperatures.
the programming language, is designed to improve the accessibil-                     Phys. Rev. B, 97:115207, Mar 2018. doi:10.1103/
ity of average-atom models.                                                          PhysRevB.97.115207.
    We gave significant attention to the code structure in this         [DGB18]      Tobias Dornheim, Simon Groth, and Michael Bonitz. The
                                                                                     uniform electron gas at warm dense matter conditions. Phys.
paper, and tried as much as possible to connect the functions                        Rep., 744:1 – 86, 2018. doi:10.1016/j.physrep.
and objects in the code with the underlying theory. We hope that                     2018.04.001.
this not only improves atoMEC from a user perspective, but also         [EFP+ 21]    J. A. Ellis, L. Fiedler, G. A. Popoola, N. A. Modine, J. A.
facilitates new contributions from the wider average-atom, WDM                       Stephens, A. P. Thompson, A. Cangi, and S. Rajamanickam.
                                                                                     Accelerating finite-temperature kohn-sham density functional
and scientific Python communities. Another aim of the paper was                      theory with deep neural networks. Phys. Rev. B, 104:035120,
to communicate how atoMEC benefits from a strong ecosystem of                        Jul 2021. doi:10.1103/PhysRevB.104.035120.
ATOMEC: AN OPEN-SOURCE AVERAGE-ATOM PYTHON CODE                                                                                                   45

[FB19]     Gérald Faussurier and Christophe Blancard. Pressure in warm                 temperature density-functional theory.      Phys. Rev. Lett.,
           and hot dense matter using the average-atom model. Phys. Rev.               107:163001, Oct 2011. doi:10.1103/PhysRevLett.
           E, 99:053201, May 2019. doi:10.1103/PhysRevE.99.                            107.163001.
           053201.                                                         [PR20]      Martin Preising and Ronald Redmer. Metallization of dense
[GDRT14]   Frank Graziani, Michael P Desjarlais, Ronald Redmer, and                    fluid helium from ab initio simulations. Phys. Rev. B,
           Samuel B Trickey. Frontiers and challenges in warm dense                    102:224107, Dec 2020. doi:10.1103/PhysRevB.102.
           matter, volume 96. Springer Science & Business, 2014. doi:                  224107.
           10.1007/978-3-319-04912-0.                                      [Roz91]     Balazs F. Rozsnyai. Photoabsorption in hot plasmas based
[GFG+ 16]  S H Glenzer, L B Fletcher, E Galtier, B Nagler, R Alonso-                   on the ion-sphere and ion-correlation models. Phys. Rev. A,
           Mori, B Barbrel, S B Brown, D A Chapman, Z Chen, C B                        43:3035–3042, Mar 1991. doi:10.1103/PhysRevA.43.
           Curry, F Fiuza, E Gamboa, M Gauthier, D O Gericke, A Glea-                  3035.
           son, S Goede, E Granados, P Heimann, J Kim, D Kraus,            [SM91]      H. B. Schlegel and J. J. W. McDouall. Do You Have SCF Sta-
           M J MacDonald, A J Mackinnon, R Mishra, A Ravasio,                          bility and Convergence Problems?, pages 167–185. Springer
           C Roedel, P Sperling, W Schumaker, Y Y Tsui, J Vorberger,                   Netherlands, Dordrecht, 1991. doi:10.1007/978-94-
           U Zastrau, A Fry, W E White, J B Hasting, and H J Lee.                      011-3262-6_2.
           Matter under extreme conditions experiments at the Linac        [SPS+ 14]   A. N. Souza, D. J. Perkins, C. E. Starrett, D. Saumon, and
           Coherent Light Source. J. Phys. B, 49(9):092001, apr 2016.                  S. B. Hansen. Predictions of x-ray scattering spectra for warm
           doi:10.1088/0953-4075/49/9/092001.                                          dense matter. Phys. Rev. E, 89:023108, Feb 2014. doi:
[HK64]     P. Hohenberg and W. Kohn. Inhomogeneous electron gas.                       10.1103/PhysRevE.89.023108.
           Phys. Rev., 136(3B):B864–B871, Nov 1964. doi:10.1103/           [SRH+ 12]   John C. Snyder, Matthias Rupp, Katja Hansen, Klaus-Robert
           PhysRev.136.B864.                                                           Müller, and Kieron Burke. Finding density functionals with
[HMvdW 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der
         +                                                                             machine learning. Phys. Rev. Lett., 108:253002, Jun 2012.
           Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric                  doi:10.1103/PhysRevLett.108.253002.
           Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,      [SS14]      C.E. Starrett and D. Saumon. A simple method for determining
           Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerk-                the ionic structure of warm dense matter. High Energy Density
           wijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río,                Physics, 10:35–42, 2014. doi:10.1016/j.hedp.2013.
           Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin                   12.001.
           Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi,         [Sta16]     C.E. Starrett. Kubo–Greenwood approach to conductivity
           Christoph Gohlke, and Travis E. Oliphant. Array programming                 in dense plasmas with average atom models. High Energy
           with NumPy. Nature, 585(7825):357–362, September 2020.                      Density Physics, 19:58–64, 2016. doi:10.1016/j.hedp.
           doi:10.1038/s41586-020-2649-2.                                              2016.04.001.
[HRD08]    Bastian Holst, Ronald Redmer, and Michael P. Desjarlais.        [STJ+ 14]   Sang-Kil Son, Robert Thiele, Zoltan Jurek, Beata Ziaja, and
           Thermophysical properties of warm dense hydrogen using                      Robin Santra. Quantum-mechanical calculation of ionization-
           quantum molecular dynamics simulations. Phys. Rev. B,                       potential lowering in dense plasmas. Phys. Rev. X, 4:031004,
           77:184201, May 2008. doi:10.1103/PhysRevB.77.                               Jul 2014. doi:10.1103/PhysRevX.4.031004.
           184201.                                                         [VGO+ 20]   Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt
[JFC+ 13]  Weile Jia, Jiyun Fu, Zongyan Cao, Long Wang, Xuebin Chi,                    Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski,
           Weiguo Gao, and Lin-Wang Wang. Fast plane wave density                      Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté-
           functional theory molecular dynamics calculations on multi-                 fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar-
           GPU machines. Journal of Computational Physics, 251:102–                    rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric
           115, 2013. doi:10.1016/                                   Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat,
[Job20]    Joblib Development Team. Joblib: running Python functions                   Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde,
           as pipeline jobs., 2020.                     Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin-
                                                                                       tero, Charles R. Harris, Anne M. Archibald, Antônio H.
[KDF+ 11]  A. L. Kritcher, T. Döppner, C. Fortmann, T. Ma, O. L.
                                                                                       Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy
           Landen, R. Wallace, and S. H. Glenzer. In-Flight Measure-
                                                                                       1.0 Contributors. SciPy 1.0: Fundamental Algorithms for
           ments of Capsule Shell Adiabats in Laser-Driven Implosions.
                                                                                       Scientific Computing in Python. Nature Methods, 17:261–272,
           Phys. Rev. Lett., 107:015002, Jul 2011. doi:10.1103/
                                                                                       2020. doi:10.1038/s41592-019-0686-2.
[Koh99]    W. Kohn. Nobel lecture: Electronic structure of matter—wave
           functions and density functionals. Rev. Mod. Phys., 71:1253–
           1266, 10 1999. doi:10.1103/RevModPhys.71.1253.
[KS65]     W. Kohn and L. J. Sham. Self-consistent equations including
           exchange and correlation effects. Phys. Rev., 140(4A):A1133–
           A1138, Nov 1965.            doi:10.1103/PhysRev.140.
[LSOM18]   Susi Lehtola, Conrad Steigemann, Micael J.T. Oliveira, and
           Miguel A.L. Marques. Recent developments in LIBXC —
           A comprehensive library of functionals for density functional
           theory. SoftwareX, 7:1–5, 2018. doi:10.1016/j.softx.
[MED11]    Stefan Maintz, Bernhard Eck, and Richard Dronskowski.
           Speeding up plane-wave electronic-structure calculations us-
           ing graphics-processing units. Computer Physics Communi-
           cations, 182(7):1421–1427, 2011. doi:10.1016/j.cpc.
[men14]    mendeleev – A Python resource for properties of chemical
           elements, ions and isotopes, ver. 0.9.0.
           lmmentel/mendeleev, 2014.
[Mer65]    N. David Mermin. Thermal properties of the inhomogeneous
           electron gas. Phys. Rev., 137:A1441–A1443, Mar 1965. doi:
[PGW12]    Mohandas Pillai, Joshua Goglio, and Thad G. Walker. Matrix
           numerov method for solving schrödinger’s equation. Amer-
           ican Journal of Physics, 80(11):1017–1019, 2012. doi:
[PPF+ 11]  S. Pittalis, C. R. Proetto, A. Floris, A. Sanna, C. Bersier,
           K. Burke, and E. K. U. Gross. Exact conditions in finite-
46                                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

           Automatic random variate generation in Python
                                                          Christoph Baumgarten‡∗ , Tirth Patel


Abstract—The generation of random variates is an important tool that is re-                  •    For inversion methods, the structural properties of the
quired in many applications. Various software programs or packages contain                        underlying uniform random number generator are pre-
generators for standard distributions like the normal, exponential or Gamma,                      served and the numerical accuracy of the methods can be
e.g., the programming language R and the packages SciPy and NumPy in                              controlled by a parameter. Therefore, inversion is usually
Python. However, it is not uncommon that sampling from new/non-standard dis-
                                                                                                  the only method applied for simulations using quasi-Monte
tributions is required. Instead of deriving specific generators in such situations,
so-called automatic or black-box methods have been developed. These allow
                                                                                                  Carlo (QMC) methods.
the user to generate random variates from fairly large classes of distributions              •    Depending on the use case, one can choose between a fast
by only specifying some properties of the distributions (e.g. the density and/or                  setup with slow marginal generation time and vice versa.
cumulative distribution function). In this note, we describe the implementation of
such methods from the C library UNU.RAN in the Python package SciPy and
                                                                                              The latter point is important depending on the use case: if a
provide a brief overview of the functionality.                                            large number of samples is required for a given distribution with
                                                                                          fixed shape parameters, a slower setup that only has to be run once
Index Terms—numerical inversion, generation of random variates                            can be accepted if the marginal generation times are low. If small
                                                                                          to moderate samples sizes are required for many different shape
                                                                                          parameters, then it is important to have a fast setup. The former
                                                                                          situation is referred to as the fixed-parameter case and the latter as
The generation of random variates is an important tool that is                            the varying parameter case.
required in many applications. Various software programs or                                   Implementations of various methods are available in the
packages contain generators for standard distributions, e.g., R                           C library UNU.RAN ([HL07]) and in the associated R pack-
([R C21]) and SciPy ([VGO+ 20]) and NumPy ([HMvdW+ 20])                                   age Runuran (
in Python. Standard references for these algorithms are the books                         index.html, [TL03]). The aim of this note is to introduce the
[Dev86], [Dag88], [Gen03], and [Knu14]. An interested reader                              Python implementation in the SciPy package that makes some
will find many references to the vast existing literature in these                        of the key methods in UNU.RAN available to Python users in
works. While relying on general methods such as the rejection                             SciPy 1.8.0. These general tools can be seen as a complement
principle, the algorithms for well-known distributions are often                          to the existing specific sampling methods: they might lead to
specifically designed for a particular distribution. This is also the                     better performance in specific situations compared to the existing
case in the module stats in SciPy that contains more than 100                             generators, e.g., if a very large number of samples are required for
distributions and the module random in NumPy with more than                               a fixed parameter of a distribution or if the implemented sampling
30 distributions. However, there are also so-called automatic or                          method relies on a slow default that is based on numerical
black-box methods for sampling from large classes of distributions                        inversion of the CDF. For advanced users, they also offer various
with a single piece of code. For such algorithms, information                             options that allow to fine-tune the generators (e.g., to control the
about the distribution such as the density, potentially together with                     time needed for the setup step).
its derivative, the cumulative distribution function (CDF), and/or
the mode must be provided. See [HLD04] for a comprehensive
overview of these methods. Although the development of such                               Automatic algorithms in SciPy
methods was originally motivated to generate variates from non-                           Many of the automatic algorithms described in [HLD04] and
standard distributions, these universal methods have advantages                           [DHL10] are implemented in the ANSI C library, UNU.RAN
that make their usage attractive even for sampling from standard                          (Universal Non-Uniform RANdom variate generators). Our goal
distributions. We mention some of the important properties (see                           was to provide a Python interface to the most important methods
[LH00], [HLD04], [DHL10]):                                                                from UNU.RAN to generate univariate discrete and continuous
                                                                                          non-uniform random variates. The following generators have been
     •   The algorithms can be used to sample from truncated
                                                                                          implemented in SciPy 1.8.0:
                                                                                             •    TransformedDensityRejection:              Transformed
* Corresponding author:
‡ Unaffiliated                                                                                    Density Rejection (TDR) ([H9̈5], [GW92])
                                                                                             •    NumericalInverseHermite: Hermite interpolation
Copyright © 2022 Christoph Baumgarten et al. This is an open-access article                       based INVersion of CDF (HINV) ([HL03])
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,                •    NumericalInversePolynomial: Polynomial inter-
provided the original author and source are credited.                                             polation based INVersion of CDF (PINV) ([DHL10])
AUTOMATIC RANDOM VARIATE GENERATION IN PYTHON                                                                                                   47

    •   SimpleRatioUniforms: Simple Ratio-Of-Uniforms                   by computing tangents at suitable design points. Note that by its
        (SROU) ([Ley01], [Ley03])                                       nature any rejection method requires not always the same number
    •   DiscreteGuideTable: (Discrete) Guide Table                      of uniform variates to generate one non-uniform variate; this
        method (DGT) ([CA74])                                           makes the use of QMC and of some variance reduction methods
    •   DiscreteAliasUrn: (Discrete) Alias-Urn method                   more difficult or impossible. On the other hand, rejection is often
        (DAU) ([Wal77])                                                 the fastest choice for the varying parameter case.
                                                                            The Ratio-Of-Uniforms method (ROU, [KM77]) is another
    Before describing the implementation in SciPy in Section            general method that relies on rejection. The underlying principle is
scipy_impl, we give a short introduction to random variate gener-       that p
                                                                             if (U,V ) is uniformly distributed on the set A f := {(u, v) : 0 <
ation in Section intro_rv_gen.                                          v ≤ f (u/v), a < u/v < b} where f is a PDF with support (a, b),
                                                                        then X := U/V follows a distribution according to f . In general, it
A very brief introduction to random variate generation                  is not possible to sample uniform values on A f directly. However,
It is well-known that random variates can be generated by inver-        if A f ⊂ R := [u− , u+ ] × [0, v+ ] for finite constants u− , u+ , v+ , one
sion of the CDF F of a distribution: if U is a uniform random           can apply the rejection method: generate uniform values (U,V ) on
number on (0, 1), X := F −1 (U) is distributed according to F.          the bounding rectangle R until (U,V ) ∈ A f and return X = U/V .
Unfortunately, the inverse CDF can only be expressed in closed          Automatic methods relying on the ROU method such as SROU
form for very few distributions, e.g., the exponential or Cauchy        and automatic ROU ([Ley00]) need a setup step to find a suitable
distribution. If this is not the case, one needs to rely on imple-      region S ∈ R2 such that A f ⊂ S and such that one can generate
mentations of special functions to compute the inverse CDF for          (U,V ) uniformly on S efficiently.
standard distributions like the normal, Gamma or beta distributions
or numerical methods for inverting the CDF are required. Such           Description of the SciPy interface
procedures, however, have the disadvantage that they may be slow        SciPy provides an object-oriented API to UNU.RAN’s methods.
or inaccurate, and developing fast and robust inversion algorithms      To initialize a generator, two steps are required:
such as HINV and PINV is a non-trivial task. HINV relies on
Hermite interpolation of the inverse CDF and requires the CDF               1)   creating a distribution class and object,
and PDF as an input. PINV only requires the PDF. The algorithm              2)   initializing the generator itself.
then computes the CDF via adaptive Gauss-Lobatto integration                In step 1, a distributions object must be created that im-
and an approximation of the inverse CDF using Newton’s polyno-          plements required methods (e.g., pdf, cdf). This can either
mial interpolation. Note that an approximation of the inverse CDF       be a custom object or a distribution object from the classes
can be achieved by interpolating the points (F(xi ), xi ) for points    rv_continuous or rv_discrete in SciPy. Once the gen-
xi in the domain of F, i.e., no evaluation of the inverse CDF is        erator is initialized from the distribution object, it provides a
required.                                                               rvs method to sample random variates from the given dis-
      For discrete distributions, F is a step-function. To compute      tribution. It also provides a ppf method that approximates
the inverse CDF F −1 (U), the simplest idea would be to apply           the inverse CDF if the initialized generator uses an inversion
sequential search: if X takes values 0, 1, 2, . . . with probabil-      method. The following example illustrates how to initialize the
ities p0 , p1 , p2 , . . . , start with j = 0 and keep incrementing j   NumericalInversePolynomial (PINV) generator for the
until F( j) = p0 + · · · + p j ≥ U. When the search terminates,         standard normal distribution:
X = j = F −1 (U). Clearly, this approach is generally very slow         import numpy as np
and more efficient methods have been developed: if X takes L            from scipy.stats import sampling
distinct values, DGT realizes very fast inversion using so-called       from math import exp
guide tables / hash tables to find the index j. In contrast DAU is      # create a distribution class with implementation
not an inversion method but uses the alias method, i.e., tables are     # of the PDF. Note that the normalization constant
precomputed to write X as an equi-probable mixture of L two-            # is not required
point distributions (the alias values).                                 class StandardNormal:
                                                                            def pdf(self, x):
      The rejection method has been suggested in [VN51]. In its                 return exp(-0.5 * x**2)
simplest form, assume that f is a bounded density on [a, b],
i.e., f (x) ≤ M for all x ∈ [a, b]. Sample two independent uniform      # create a distribution object and initialize the
                                                                        # generator
random variates on U on [0, 1] and V on [a, b] until M ·U ≤ f (V ).     dist = StandardNormal()
Note that the accepted points (U,V ) are uniformly distributed in       rng = sampling.NumericalInversePolynomial(dist)
the region between the x-axis and the graph of the PDF. Hence,
X := V has the desired distribution f . This is a special case of       # sample 100,000 random variates from the given
                                                                        # distribution
the general version: if f , g are two densities on an interval J such   rvs = rng.rvs(100000)
that f (x) ≤ c · g(x) for all x ∈ J and a constant c ≥ 1, sample
U uniformly distributed on [0, 1] and X distributed according to        As NumericalInversePolynomial generator uses an in-
g until c · U · g(X) ≤ f (X). Then X has the desired distribution       version method, it also provides a ppf method that approximates
 f . It can be shown that the expected number of iterations before      the inverse CDF:
the acceptance condition is met is equal to c. Hence, the main          # evaluate the approximate PPF at a few points
                                                                        ppf = rng.ppf([0.1, 0.5, 0.9])
challenge is to find hat functions g for which c is small and from
which random variates can be generated efficiently. TDR solves          It is also easy to sample from a truncated distribution by passing
this problem by applying a transformation T to the density such         a domain argument to the constructor of the generator. For
that x 7→ T ( f (x)) is concave. A hat function can then be found       example, to sample from truncated normal distribution:
48                                                                                                 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

# truncate the distribution by passing a                                         reference/random/bit_generators/index.html. To change the uni-
# `domain` argument                                                              form random number generator, a random_state parameter
rng = sampling.NumericalInversePolynomial(
   dist, domain=(-1, 1)                                                          can be passed as shown in the example below:
)                                                                                # 64-bit PCG random number generator in NumPy
                                                                                 urng = np.random.Generator(np.random.PCG64())
While the default options of the generators should work well in                  # The above line can also be replaced by:
many situations, we point out that there are various parameters that             # ``urng = np.random.default_rng()``
the user can modify, e.g., to provide further information about the              # as PCG64 is the default generator starting
                                                                                 # from NumPy 1.19.0
distribution (such as mode or center) or to control the numerical
accuracy of the approximated PPF. (u_resolution). Details                        # change the uniform random number generator by
can be found in the SciPy documentation                  # passing the `random_state` argument
                                                                                 rng = sampling.NumericalInversePolynomial(
doc/scipy/reference/. The above code can easily be generalized to
                                                                                    dist, random_state=urng
sample from parametrized distributions using instance attributes                 )
in the distribution class. For example, to sample from the gamma
                                                                                 We also point out that the PPF of inversion methods can be applied
distribution with shape parameter alpha, we can create the
                                                                                 to sequences of quasi-random numbers. SciPy provides different
distribution class with parameters as instance attributes:
                                                                                 sequences in its QMC module (scipy.stats.qmc).
class Gamma:
                                                                                     NumericalInverseHermite provides a qrvs method
    def __init__(self, alpha):
        self.alpha = alpha                                                       which generates random variates using QMC methods present
                                                                                 in SciPy (scipy.stats.qmc) as uniform random number
     def pdf(self, x):                                                           generators3 . The next example illustrates how to use qrvs with a
         return x**(self.alpha-1) * exp(-x)
                                                                                 generator created directly from a SciPy distribution object.
     def support(self):                                                          from scipy import stats
         return 0, np.inf                                                        from scipy.stats import qmc

# initialize a distribution object with varying                                  # 1D Halton sequence generator.
# parameters                                                                     qrng = qmc.Halton(d=1)
dist1 = Gamma(2)
dist2 = Gamma(3)                                                                 rng = sampling.NumericalInverseHermite(stats.norm())

# initialize a generator for each distribution                                   # generate quasi random numbers using the Halton
rng1 = sampling.NumericalInversePolynomial(dist1)                                # sequence as uniform variates
rng2 = sampling.NumericalInversePolynomial(dist2)                                qrvs = rng.qrvs(size=100, qmc_engine=qrng)
In the above example, the support method is used to set the
domain of the distribution. This can alternatively be done by                    Benchmarking
passing a domain parameter to the constructor.                                   To analyze the performance of the implementation, we tested the
     In addition to continuous distribution, two UNU.RAN methods                 methods applied to several standard distributions against the gen-
have been added in SciPy to sample from discrete distributions. In               erators in NumPy and the original UNU.RAN C library. In addi-
this case, the distribution can be either be represented using a                 tion, we selected one non-standard distribution to demonstrate that
probability vector (which is passed to the constructor as a Python               substantial reductions in the runtime can be achieved compared to
list or NumPy array) or a Python object with the implementation                  other implementations. All the benchmarks were carried out using
of the probability mass function. In the latter case, a finite domain            NumPy 1.22.4 and SciPy 1.8.1 running in a single core on Ubuntu
must be passed to the constructor or the object should implement                 20.04.3 LTS with Intel(R) Core(TM) i7-8750H CPU (2.20GHz
the support method1 .                                                            clock speed, 16GB RAM). We run the benchmarks with NumPy’s
# Probability vector to represent a discrete                                     MT19937 (Mersenne Twister) and PCG64 random number gen-
# distribution. Note that the probability vector
                                                                                 erators (np.random.MT19937 and np.random.PCG64) in
# need not be vectorized
pv = [0.1, 9.0, 2.9, 3.4, 0.3]                                                   Python and use NumPy’s C implementation of MT19937 in the
                                                                                 UNU.RAN C benchmarks. As explained above, the use of PCG64
# PCG64 uniform RNG with seed 123                                                is recommended, and MT19937 is only included to compare the
urng = np.random.default_rng(123)
rng = sampling.DiscreteAliasUrn(                                                 speed of the Python implementation and the C library by relying
   pv, random_state=urng                                                         on the same uniform number generator (i.e., differences in the
)                                                                                performance of the uniform number generation are not taken
                                                                                 into account). The code for all the benchmarks can be found on
# sample from the given discrete distribution
rvs = rng.rvs(100000)                                                  
                                                                                     The methods used in NumPy to generate normal, gamma, and
                                                                                 beta random variates are:
Underlying uniform pseudo-random number generators
NumPy provides several generators for uniform pseudo-random                         •   the ziggurat algorithm ([MT00b]) to sample from the
numbers2 . It is highly recommended to use NumPy’s default                              standard normal distribution,
random number generator np.random.PCG64 for better speed                           2. By default, NumPy’s legacy random number generator, MT19937
and performance, see [O’N14] and                   (np.random.RandomState()) is used as the uniform random number
                                                                                 generator for consistency with the stats module in SciPy.
  1. Support for discrete distributions with infinite domain hasn’t been added     3. In      SciPy      1.9.0,      qrvs       will      be added to
yet.                                                                             NumericalInversePolynomial.
AUTOMATIC RANDOM VARIATE GENERATION IN PYTHON                                                                                                     49

   •    the rejection algorithms in Chapter XII.2.6 in [Dev86] if        70-200 times faster. This clearly shows the benefit of using a
        α < 1 and in [MT00a] if α > 1 for the Gamma distribution,        black-box algorithm.
   •    Johnk’s algorithm ([Jöh64], Section IX.3.5 in [Dev86]) if
        max{α, β } ≤ 1, otherwise a ratio of two Gamma variates          Conclusion
        with shape parameter α and β (see Section IX.4.1 in
                                                                         The interface to UNU.RAN in SciPy provides easy access to
        [Dev86]) for the beta distribution.
                                                                         different algorithms for non-uniform variate generation for large
Benchmarking against the normal, gamma, and beta distributions
                                                                         classes of univariate continuous and discrete distributions. We
                                                                         have shown that the methods are easy to use and that the al-
Table 1 compares the performance for the standard normal,
                                                                         gorithms perform very well both for standard and non-standard
Gamma and beta distributions. We recall that the density of the
                                                                         distributions. A comprehensive documentation suite, a tutorial
Gamma distribution with shape parameter a > 0 is given by
                                                                         and many examples are available at
x ∈ (0, ∞) 7→ xa−1 e−x and the density of the beta distribution with
                                                         α−1 (1−x)β −1   scipy/reference/stats.sampling.html and
shape parameters α, β > 0 is given by x ∈ (0, 1) 7→ x B(α,β      )       scipy/tutorial/stats/sampling.html. Various methods have been im-
where Γ(·) and B(·, ·) are the Gamma and beta functions. The             plemented in SciPy, and if specific use cases require additional
results are reported in Table 1.                                         functionality from UNU.RAN, the methods can easily be added
    We summarize our main observations:                                  to SciPy given the flexible framework that has been developed.
   1)    The setup step in Python is substantially slower than           Another area of further development is to better integrate SciPy’s
         in C due to expensive Python callbacks, especially for          QMC generators for the inversion methods.
         PINV and HINV. However, the time taken for the setup is             Finally, we point out that other sampling methods like Markov
         low compared to the sampling time if large samples are          Chain Monte Carlo and copula methods are not part of SciPy. Rel-
         drawn. Note that as expected, SROU has a very fast setup        evant Python packages in that context are PyMC ([PHF10]), PyS-
         such that this method is suitable for the varying parameter     tan relying on Stan ([Tea21]), Copulas (
         case.                                                           and PyCopula (
   2)    The sampling time in Python is slightly higher than in
         C for the MT19937 random number generator. If the               Acknowledgments
         recommended PCG64 generator is used, the sampling               The authors wish to thank Wolfgang Hörmann and Josef Leydold
         time in Python is slightly lower. The only exception            for agreeing to publish the library under a BSD license and for
         is SROU: due to Python callbacks, the performance is            helpful feedback on the implementation and this note. In addition,
         substantially slower than in C. However, as the main            we thank Ralf Gommers, Matt Haberland, Nicholas McKibben,
         advantage of SROU is the fast setup time, the main use          Pamphile Roy, and Kai Striega for their code contributions, re-
         case is the varying parameter case (i.e., the method is not     views, and helpful suggestions. The second author was supported
         supposed to be used to generate large samples).                 by the Google Summer of Code 2021 program5 .
   3)    PINV, HINV, and TDR are at most about 2x slower than
         the specialized NumPy implementation for the normal
                                                                         R EFERENCES
         distribution. For the Gamma and beta distribution, they
         even perform better for some of the chosen shape pa-            [CA74]         Hui-Chuan Chen and Yoshinori Asau.                On gener-
                                                                                        ating random variates from an empirical distribution.
         rameters. These results underline the strong performance                       AIIE Transactions, 6(2):163–166, 1974. doi:10.1080/
         of these black-box approaches even for standard distribu-                      05695557408974949.
         tions.                                                          [Dag88]        John Dagpunar. Principles of random variate generation.
   4)    While the application of PINV requires bounded densi-                          Oxford University Press, USA, 1988.
                                                                         [Dev86]        Luc Devroye. Non-Uniform Random Variate Generation.
         ties, no issues are encountered for α = 0.05 since the                         Springer-Verlag, New York, 1986. doi:10.1007/978-1-
         unbounded part is cut off by the algorithm. However, the                       4613-8643-8.
         setup can fail for very small values of α.                      [DHL10]        Gerhard Derflinger, Wolfgang Hörmann, and Josef Leydold.
                                                                                        Random variate generation by numerical inversion when only
                                                                                        the density is known. ACM Transactions on Modeling and
Benchmarking against a non-standard distribution                                        Computer Simulation (TOMACS), 20(4):1–25, 2010. doi:
We benchmark the performance of PINV to sample from the                                 10.1145/1842722.1842723.
                                                                         [Gen03]        James E Gentle. Random number generation and Monte Carlo
generalized normal distribution    ([Sub23]) whose density is given
                            p                                                           methods, volume 381. Springer, 2003. doi:10.1007/
by x ∈ (−∞, ∞) 7→ 2Γ(1/p)     against the method proposed in [NP09]                     b97336.
and against the implementation in SciPy’s gennorm distribu-              [GW92]         Walter R Gilks and Pascal Wild. Adaptive rejection sampling
                                                                                        for Gibbs sampling. Journal of the Royal Statistical Society:
tion. The approach in [NP09] relies on transforming Gamma                               Series C (Applied Statistics), 41(2):337–348, 1992. doi:10.
variates to the generalized normal distribution whereas SciPy                           2307/2347565.
relies on computing the inverse of CDF of the Gamma distri-              [H9̈5]         Wolfgang Hörmann. A rejection technique for sampling from
bution (                     T-concave distributions. ACM Trans. Math. Softw., 21(2):182–
                                                                                        193, 1995. doi:10.1145/203082.203089.
special.gammainccinv.html). The results for different values of p        [HL03]         Wolfgang Hörmann and Josef Leydold. Continuous random
are shown in Table 2.                                                                   variate generation by fast numerical inversion. ACM Trans-
    PINV is usually about twice as fast than the special-                               actions on Modeling and Computer Simulation (TOMACS),
                                                                                        13(4):347–362, 2003. doi:10.1145/945511.945517.
ized method and about 15-150 times faster than SciPy’s
implementation4 . We also found an R package pgnorm (https:                 4. In SciPy 1.9.0, the speed will be improved by implementing the method
// that implements vari-         from [NP09]
ous approaches from [KR13]. In that case, PINV is usually about             5.
50                                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                         Python                                             C
                     Distribution      Method
                                                   Setup     Sampling (PCG64) Sampling (MT19937)              Setup     Sampling (MT19937)
                                        PINV        4.6            29.6              36.5                      0.27            32.4
                                       HINV         2.5            33.7              40.9                      0.38            36.8
                   Standard normal      TDR         0.2            37.3              47.8                      0.02            41.4
                                       SROU        8.7 µs          2510              2160                     0.5 µs            232
                                       NumPy          -            17.6              22.4                        -                -
                                        PINV       196.0           29.8              37.2                      37.9            32.5
                    Gamma(0.05)        HINV         24.5           36.1              43.8                       1.9            40.7
                                       NumPy          -            55.0              68.1                        -                -
                                        PINV        16.5           31.2              38.6                       2.0            34.5
                     Gamma(0.5)        HINV         4.9            34.2              41.7                       0.6            37.9
                                       NumPy          -            86.4              99.2                        -                -
                                        PINV        5.3            30.8              38.7                       0.5            34.6
                                       HINV         5.3              33              40.6                       0.4            36.8
                                        TDR         0.2            38.8              49.6                      0.03              44
                                       NumPy          -            36.5              47.1                        -                -
                                        PINV        21.4           33.1              39.9                       2.4            37.3
                    Beta(0.5, 0.5)     HINV         2.1            38.4              45.3                       0.2              42
                                       NumPy          -             101               112                        -                -
                                       HINV         0.2              37              44.3                      0.01            41.1
                    Beta(0.5, 1.0)
                                       NumPy          -             125               138                        -                -
                                        PINV        15.7           30.5              37.2                       1.7            34.3
                                       HINV         4.1            33.4              40.8                       0.4            37.1
                    Beta(1.3, 1.2)
                                        TDR         0.2            46.8              57.8                      0.03              45
                                       NumPy          -            74.3                97                        -                -
                                        PINV        9.7            30.2              38.2                       0.9            33.8
                                       HINV         5.8            33.7              41.2                       0.4            37.4
                    Beta(3.0, 2.0)
                                        TDR         0.2            42.8              52.8                      0.02              44
                                       NumPy          -            72.6              92.8                        -                -
                                                                            TABLE 1
Average time taken (reported in milliseconds, unless mentioned otherwise) to sample 1 million random variates from the standard normal distribution. The mean is
 computed over 7 iterations. Standard deviations are not reported as they were very small (less than 1% of the mean in the large majority of cases). Note that not
all methods can always be applied, e.g., TDR cannot be applied to the Gamma distribution if a < 1 since the PDF is not log-concave in that case. As NumPy uses
                                           rejection algorithms with precomputed constants, no setup time is reported.

                                           p                         0.25    0.45     0.75      1       1.5       2       5        8
                                Nardon and Pianca (2009)             100      101      101     45      148      120      128     122
                              SciPy’s gennorm distribution           832     1000     1110    559      5240    6720     6230    5950
                           Python (PINV Method, PCG64 urng)           50       47       45     41       40       37       38      38
                                                                           TABLE 2
Comparing SciPy’s implementation and a specialized method against PINV to sample 1 million variates from the generalized normal distribution for different values
                                 of the parameter p. Time reported in milliseconds. The mean is computer over 7 iterations.

[HL07]      Wolfgang Hörmann and Josef Leydold. UNU.RAN - Univer-                                   ates. ACM Transactions on Mathematical Software (TOMS),
            sal Non-Uniform RANdom number generators, 2007. https:                                  3(3):257–260, 1977. doi:10.1145/355744.355750.
            //                                    [Knu14]         Donald E Knuth. The Art of Computer Programming, Volume
[HLD04]     Wolfgang Hörmann, Josef Leydold, and Gerhard Derflinger.                                2: Seminumerical algorithms. Addison-Wesley Professional,
            Automatic nonuniform random variate generation. Springer,                               2014. doi:10.2307/2317055.
            2004. doi:10.1007/978-3-662-05946-3.                                    [KR13]          Steve Kalke and W-D Richter. Simulation of the p-generalized
[HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der                                 Gaussian distribution. Journal of Statistical Computation
            Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric                              and Simulation, 83(4):641–667, 2013. doi:10.1080/
            Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,                              00949655.2011.631187.
            Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van                  [Ley00]         Josef Leydold. Automatic sampling with the ratio-of-uniforms
            Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del                             method.      ACM Transactions on Mathematical Software
            Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant,                                (TOMS), 26(1):78–98, 2000. doi:10.1145/347837.
            Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer                                   347863.
            Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro-            [Ley01]         Josef Leydold. A simple universal generator for continuous
            gramming with NumPy. Nature, 585(7825):357–362, 2020.                                   and discrete univariate T-concave distributions. ACM Transac-
            doi:10.1038/s41586-020-2649-2.                                                          tions on Mathematical Software (TOMS), 27(1):66–82, 2001.
[Jöh64]     MD Jöhnk. Erzeugung von betaverteilten und gammaverteilten                              doi:10.1145/382043.382322.
            Zufallszahlen. Metrika, 8(1):5–15, 1964. doi:10.1007/                   [Ley03]         Josef Leydold. Short universal generators via generalized
            bf02613706.                                                                             ratio-of-uniforms method. Mathematics of Computation,
[KM77]      Albert J Kinderman and John F Monahan. Computer gen-                                    72(243):1453–1471, 2003. doi:10.1090/s0025-5718-
            eration of random variables using the ratio of uniform devi-                            03-01511-4.
AUTOMATIC RANDOM VARIATE GENERATION IN PYTHON                                  51

[LH00]       Josef Leydold and Wolfgang Hörmann. Universal algorithms
             as an alternative for generating non-uniform continuous ran-
             dom variates. In Proceedings of the International Conference
             on Monte Carlo Simulation 2000., pages 177–183, 2000.
[MT00a]      George Marsaglia and Wai Wan Tsang. A simple method for
             generating gamma variables. ACM Transactions on Math-
             ematical Software (TOMS), 26(3):363–372, 2000. doi:
[MT00b]      George Marsaglia and Wai Wan Tsang. The ziggurat method
             for generating random variables. Journal of statistical soft-
             ware, 5(1):1–7, 2000. doi:10.18637/jss.v005.i08.
[NP09]       Martina Nardon and Paolo Pianca. Simulation techniques
             for generalized Gaussian densities. Journal of Statistical
             Computation and Simulation, 79(11):1317–1329, 2009. doi:
[O’N14]      Melissa E. O’Neill. PCG: A family of simple fast space-
             efficient statistically good algorithms for random number gen-
             eration. Technical Report HMC-CS-2014-0905, Harvey Mudd
             College, Claremont, CA, September 2014.
[PHF10]      Anand Patil, David Huard, and Christopher J Fonnesbeck.
             PyMC: Bayesian stochastic modelling in Python. Journal of
             Statistical Software, 35(4):1, 2010. doi:10.18637/jss.
[R C21]      R Core Team. R: A language and environment for statistical
             computing, 2021.
[Sub23]      M.T. Subbotin. On the law of frequency of error. Mat. Sbornik,
             31(2):296–301, 1923.
[Tea21]      Stan Development Team. Stan modeling language users guide
             and reference manual, version 2.28., 2021.
[TL03]       Günter Tirler and Josef Leydold. Automatic non-uniform
             random variate generation in r. In Proceedings of DSC, page 2,
[VGO+ 20]    Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt
             Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski,
             Pearu Peterson, Warren Weckesser, Jonathan Bright, et al.
             Scipy 1.0: fundamental algorithms for scientific computing in
             python. Nature methods, pages 1–12, 2020. doi:10.1038/
[VN51]       John Von Neumann. Various techniques used in connection
             with random digits. Appl. Math Ser, 12(36-38):3, 1951.
[Wal77]      Alastair J Walker. An efficient method for generating discrete
             random variables with general distributions. ACM Transac-
             tions on Mathematical Software (TOMS), 3(3):253–256, 1977.
52                                                                                                     PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

     Utilizing SciPy and other open source packages to
     provide a powerful API for materials manipulation in
                the Schrödinger Materials Suite
                                            Alexandr Fonari‡∗ , Farshad Fallah‡ , Michael Rauch‡


Abstract—The use of several open source scientific packages in the                   open-source and many of which blend the two to optimize capa-
Schrödinger Materials Science Suite will be discussed. A typical workflow for        bilities and efficiency. For example, the main simulation engine
materials discovery will be described, discussing how open source packages           for molecular quantum mechanics is the Jaguar [BHH+ 13] pro-
have been incorporated at every stage. Some recent implementations of ma-            prietary code. The proprietary classical molecular dynamics code
chine learning for materials discovery will be discussed, as well as how open
                                                                                     Desmond (distributed by Schrödinger, Inc.) [SGB+ 14] is used to
source packages were leveraged to achieve results faster and more efficiently.
                                                                                     obtain physical properties of soft materials, surfaces and polymers.
Index Terms—materials, active learning, OLED, deposition, evaporation
                                                                                     For periodic quantum mechanics, the main simulation engine is
                                                                                     the open source code Quantum ESPRESSO (QE) [GAB+ 17]. One
                                                                                     of the co-authors of this proceedings (A. Fonari) contributes to
Introduction                                                                         the QE code in order to make integration with the Materials Suite
                                                                                     more seamless and less error-prone. As part of this integration,
A common materials discovery practice or workflow is to start
                                                                                     support for using the portable XML format for input and output
with reading an experimental structure of a material or generating
                                                                                     in QE has been implemented in the open source Python package
a structure in silico, computing its properties of interest (e.g.
                                                                                     qeschema [BDBF].
elastic constants, electrical conductivity), tuning the material by
                                                                                          Figure 2 gives an overview of some of the various products that
modifying its structure (e.g. doping) or adding and removing
                                                                                     compose the Schrödinger Materials Science Suite. The various
atoms (deposition, evaporation), and then recomputing the proper-
                                                                                     workflows are implemented mainly in Python (some of them
ties of the modified material (Figure 1). Computational materials
                                                                                     described below), calling on proprietary or open-source code
discovery leverages such workflows to empower researchers to
                                                                                     where appropriate, to improve the performance of the software
explore vast design spaces and uncover root causes without (or in
                                                                                     and reduce overall maintenance.
conjunction with) laboratory experimentation.
                                                                                          The materials discovery cycle can be run in a high-throughput
    Software tools for computational materials discovery can be
                                                                                     manner, enumerating different structure modifications in a system-
facilitated by utilizing existing libraries that cover the fundamental
                                                                                     atic fashion, such as doping ratio in a semiconductor or depositing
mathematics used in the calculations in an optimized fashion. This
                                                                                     different adsorbates. As we will detail herein, there are several
use of existing libraries allows developers to devote more time
                                                                                     open source packages that allow the user to generate a large
to developing new features instead of re-inventing established
                                                                                     number of structures, run calculations in high throughput manner
methods. As a result, such a complementary approach improves
                                                                                     and analyze the results. For example, the open source package
the performance of computational materials software and reduces
                                                                                     pymatgen [ORJ+ 13] facilitates generation and analysis of periodic
overall maintenance.
                                                                                     structures. It can generate inputs for and read outputs of QE, the
    The Schrödinger Materials Science Suite [LLC22] is a propri-
                                                                                     commercial codes VASP and Gaussian, and several other formats.
etary computational chemistry/physics platform that streamlines
                                                                                     To run and manage workflow jobs in a high-throughput manner,
materials discovery workflows into a single graphical user inter-
                                                                                     open source packages such as Custodian [ORJ+ 13] and AiiDA
face (Materials Science Maestro). The interface is a single portal
                                                                                     [HZU+ 20] can be used.
for structure building and enumeration, physics-based modeling
and machine learning, visualization and analysis. Tying together
the various modules are a wide variety of scientific packages, some                  Materials import and generation
of which are proprietary to Schrödinger, Inc., some of which are
                                                                                     For reading and writing of material structures, several open source
                                                                                     packages (e.g. OpenBabel [OBJ+ 11], RDKit [LTK+ 22]) have
* Corresponding author:
‡ Schrödinger Inc., 1540 Broadway, 24th Floor. New York, NY 10036                    implemented functionality for working with several commonly
                                                                                     used formats (e.g. CIF, PDB, mol, xyz). Periodic structures
Copyright © 2022 Alexandr Fonari et al. This is an open-access article               of materials, mainly coming from single crystal X-ray/neutron
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,        diffraction experiments, are distributed in CIF (Crystallographic
provided the original author and source are credited.                                Information File), PDB (Protein Data Bank) and lately mmCIF

                                  Fig. 1: Example of a workflow for computational materials discovery.

                          Fig. 2: Some example products that compose the Schrödinger Materials Science Suite.

formats [WF05]. Correctly reading experimental structures is of        work went into this project) and others to correctly read and
significant importance, since the rest of the materials discovery      convert periodic structures in OpenBabel. By version 3.1.1 (the
workflow depends on it. In addition to atom coordinates and            most recent at writing time), the authors are not aware of any
periodic cell information, structural data also contains symme-        structures read incorrectly by OpenBabel. In general, non-periodic
try operations (listed explicitly or by the means of providing         molecular formats are simpler to handle because they only contain
a space group) that can be used to decrease the number of              atom coordinates but no cell or symmetry information. OpenBabel
computations required for a particular system by accounting for        has Python bindings but due to the GPL license limitation, it is
symmetry. This can be important, especially when scaling high-         called as a subprocess from the Schrödinger Materials Suite.
throughput calculations. From file, structure is read in a structure        Another important consideration in structure generation is
object through which atomic coordinates (as a NumPy array) and         modeling of substitutional disorder in solid alloys and materials
chemical information of the material can be accessed and updated.      with point defects (intermetallics, semiconductors, oxides and
Structure object is similar to the one implemented in open source      their crystalline surfaces). In such cases, the unit cell and atomic
packages such as pymatgen [ORJ+ 13] and ASE [LMB+ 17]. All             sites of the crystal or surface slab are well defined while the chem-
the structure manipulations during the workflows are done by           ical species occupying the site may vary. In order to simulate sub-
using structure object interface (see structure deformation example    stitutional disorder, one must generate the ensemble of structures
below). Example of Structure object definition in pymatgen:            that includes all statistically significant atomic distributions in a
class Structure:                                                       given unit cell. This can be achieved by a brute force enumeration
                                                                       of all symmetrically unique atomic structures with a given number
   def __init__(self, lattice, species, coords, ...):                  of vacancies, impurities or solute atoms. The open source library
       """Create a periodic structure."""
                                                                       enumlib [HF08] implements algorithms for such a systematic
One consideration of note is that PDB, CIF and mmCIF structure         enumeration of periodic structures. The enumlib package consists
formats allow description of the positional disorder (for example,     of several Fortran binaries and Python scripts that can be run as a
a solvent molecule without a stable position within the cell           subprocess (no Python bindings). This allows the user to generate
which can be described by multiple sets of coordinates). Another       a large set of symmetrically nonequivalent materials with different
complication is that experimental data spans an interval of almost     compositions (e.g. doping or defect concentration).
a century: one of the oldest crystal structures deposited in the           Recently, we applied this approach in simultaneous study of
Cambridge Structural Database (CSD) [GBLW16] dates to 1924             the activity and stability of Pt based core-shell type catalysts for
[HM24]. These nuances and others present nontrivial technical          the oxygen reduction reaction [MGF+ 19]. We generated a set of
challenges for developers. Thus, it has been a continuous effort       stable doped Pt/transition metal/nitrogen surfaces using periodic
by Schrödinger, Inc. (at least 39 commits and several weeks of         enumeration. Using QE to perform periodic density functional
54                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                       Jaguar that took 457,265 CPU hours (~52 years) [MAS+ 20]. An-
                                                                       other similar case study is the high-throughput molecular dynam-
                                                                       ics simulations (MD) of thermophysical properties of polymers for
                                                                       various applications [ABG+ 21]. There, using Desmond we com-
                                                                       puted the glass transition temperature (Tg ) of 315 polymers and
                                                                       compared the results with experimental measurements [Bic02].
                                                                       This study took advantage of GPU (graphics processing unit)
                                                                       support as implemented in Desmond, as well as the job scheduler
                                                                       API described above.
                                                                           Other workflows implemented in the Schrödinger Materials
                                                                       Science Suite utilize open source packages as well. For soft mate-
                                                                       rials (polymers, organic small molecules and substrates composed
                                                                       of soft molecules), convex hull and related mathematical methods
          Fig. 3: Example of the job submission process.               are important for finding possible accessible solvent voids (during
                                                                       submerging or sorption) and adsorbate sites (during molecular
                                                                       deposition). These methods are conveniently implemented in the
theory (DFT) calculations, we assessed surface phase diagrams          open source SciPy [VGO+ 20] and NumPy [HMvdW+ 20] pack-
for Pt alloys and identified the avenues for stabilizing the cost      ages. Thus, we implemented molecular deposition and evaporation
effective core-shell systems by a judicious choice of the catalyst     workflows by using the Desmond MD engine as the backend
core material. Such catalysts may prove critical in electrocatalysis   in tandem with the convex hull functionality. This workflow
for fuel cell applications.                                            enables simulation of the deposition and evaporation of the
                                                                       small molecules on a substrate. We utilized the aforementioned
                                                                       deposition workflow in the study of organic light-emitting diodes
Workflow capabilities                                                  (OLEDs), which are fabricated using a stepwise process, where
In the last section, we briefly described a complete workflow from     new layers are deposited on top of previous layers. Both vacuum
structure generation and enumeration to periodic DFT calculations      and solution deposition processes have been used to prepare these
to analysis. In order to be able to run a massively parallel           films, primarily as amorphous thin film active layers lacking
screening of materials, a highly scalable and stable queuing system    long-range order. Each of these deposition techniques introduces
(job scheduler) is required. We have implemented a job queuing         changes to the film structure and consequently, different charge-
system on top of the most used queuing systems (LSF, PBS,              transfer and luminescent properties [WKB+ 22].
SGE, SLURM, TORQUE, UGE) and exposed a Python API to                       As can be seen from above, a workflow is usually some
submit and monitor jobs. In line with technological advancements,      sort of structure modification through the structure object with
cloud is also supported by means of a virtual cluster configured       a subsequent call to a backend code and analysis of its output if
with SLURM. This allows the user to submit a large number              it succeeds. Input for the next iteration depends on the output
of jobs, limited only by SLURM scheduling capabilities and             of the previous iteration in some workflows. Due to the large
cloud resources. In order to accommodate job dependencies in           chemical and manipulation space of the materials, sometimes it
workflows, for each job, a parent job (or multiple parent jobs) can    very tricky to keep code for all workflows follow the same code
be defined forming a directed graph of jobs (Figure 3).                logic. For every workflow and/or functionality in the Materials
    There could be several reasons for a job to fail. Depending        Science Suite, some sort of peer reviewed material (publication,
on the reason of failure, there are several restart and recovery       conference presentation) is created where implemented algorithms
mechanisms in place. The lowest level is the restart mechanism         are described to facilitate reproducibility.
(in SLURM it is called requeue) which is performed by the
queuing system itself. This is triggered when a node goes down.
                                                                       Data fitting algorithms and use cases
On the cloud, preemptible instances (nodes) can go offline at any
moment. In addition, workflows implemented in the proprietary          Materials simulation engines for QM, periodic DFT, and classical
Schrödinger Materials Science Suite have built-in methods for          MD (referred to herein as backends) are frequently written in
handling various types of failure. For example, if the simulation      compiled languages with enabled parallelization for CPU or GPU
is not converging to a requested energy accuracy, it is wasteful       hardware. These backends are called from Python workflows
to blindly restart the calculation without changing some input         using the job queuing systems described above. Meanwhile, pack-
parameters. However, in the case of a failure due to full disk         ages such as SciPy and NumPy provide sophisticated numerical
space, it is reasonable to try restart with hopes to get a node with   function optimization and fitting capabilities. Here, we describe
more empty disk space. If a job fails (and cannot be restarted),       examples of how the Schrödinger suite can be used to combine
all its children (if any) will not start, thus saving queuing and      materials simulations with popular optimization routines in the
computational time.                                                    SciPy ecosystem.
    Having developed robust systems for running calculations, job          Recently      we     implemented       convex     analysis   of
queuing and troubleshooting (autonomously, when applicable),           the stress strain curve (as described here [PKD18]).
the developed workflows have allowed us and our customers to           scipy.optimize.minimize is used for a constrained
perform massive screenings of materials and their properties. For      minimization with boundary conditions of a function related to
example, we reported a massive screening of 250,000 charge-            the stress strain curve. The stress strain curve is obtained from a
conducting organic materials, totaling approximately 3,619,000         series of MD simulations on deformed cells (cell deformations
DFT SCF (self-consistent field) single-molecule calculations using     are defined by strain type and deformation step). The pressure

tensor of a deformed cell is related to stress. This analysis allowed   and AutoQSAR [DDS+ 16] from the Schrödinger suite. Depending
prediction of elongation at yield for high density polyethylene         on the type of materials, benchmark data can be obtained using
polymer. Figure 4 shows obtained calculated yield of 10% vs.            different codes available in the Schrödinger suite:
experimental value within 9-18% range [BAS+ 20].
                                                                           •   small molecules and finite systems - Jaguar
    The scipy.optimize package is used for a least-squares
                                                                           •   periodic systems - Quantum ESPRESSO
fit of the bulk energies at different cell volumes (compressed
                                                                           •   larger polymeric and similar systems - Desmond
and expanded) in order to obtain the bulk modulus and equation
of state (EOS) of a material. In the Schrödinger suite this was             Different materials systems require different descriptors for
implemented as a part of an EOS workflow, in which fitting is           featurization. For example, for crystalline periodic systems, we
performed on the results obtained from a series of QE calculations      have implemented several sets of tailored descriptors. Genera-
performed on the original as well as compressed and expanded            tion of these descriptors again uses a mix of open source and
(deformed) cells. An example of deformation applied to a structure      Schrödinger proprietary tools. Specifically:
in pymatgen:
                                                                           •   elemental features such as atomic weight, number of
from pymatgen.analysis.elasticity import strain                                valence electrons in s, p and d-shells, and electronegativity
from pymatgen.core import lattice
from pymatgen.core import structure                                        •   structural features such as density, volume per atom, and
                                                                               packing fraction descriptors implemented in the open
deform =   strain.Deformation([                                                source matminer package [WDF+ 18]
   [1.0,   0.02, 0.02],
                                                                           •   intercalation descriptors such as cation and anion counts,
   [0.0,   1.0, 0.0],
   [0.0,   0.0, 1.0]])                                                         crystal packing fraction, and average neighbor ionicity
                                                                               [SYC+ 17] implemented in the Schrödinger suite
latt = lattice.Lattice([                                                   •   three-dimensional smooth overlap of atomic positions
   [3.84, 0.00, 0.00],
   [1.92, 3.326, 0.00],                                                        (SOAP) descriptors implemented in the open source
   [0.00, -2.22, 3.14],                                                        DScribe package [HJM+ 20].
                                                                            We are currently training models that use these descriptors
st = structure.Structure(                                               to predict properties, such as bulk modulus, of a set of Li-
   latt,                                                                containing battery related compounds [Cha]. Several models will
   ["Si", "Si"],
   [[0, 0, 0], [0.75, 0.5, 0.75]])                                      be compared, such as kernel regression methods (as implemented
                                                                        in the open source scikit-learn code [PVG+ 11]) and AutoQSAR.
strained_st = deform.apply_to_structure(st)                                 For isolated small molecules and extended non-periodic sys-
This is also an example of loosely coupled (embarrassingly              tems, RDKit can be used to generate a large number of atomic and
parallel) jobs. In particular, calculations of the deformed cells       molecular descriptors. A lot of effort has been devoted to ensure
only depend on the bulk calculation and do not depend on each           that RDKit can be used on a wide variety of materials that are
other. Thus, all the deformation jobs can be submitted in parallel,     supported by the Schrödinger suite. At the time of writing, the 4th
facilitating high-throughput runs.                                      most active contributor to RDKit is Ricardo Rodriguez-Schmidt
    Structure refinement from powder diffraction experiment is an-      from Schrödinger [RDK].
other example where more complex optimization is used. Powder               Recently, active learning (AL) combined with DFT has re-
diffraction is a widely used method in drug discovery to assess         ceived much attention to address the challenge of leveraging
purity of the material and discover known or unknown crystal            exhaustive libraries in materials informatics [VPB21], [SPA+ 19].
polymorphs [KBD+ 21]. In particular, there is interest in fitting of    On our side, we have implemented a workflow that employs active
the experimental powder diffraction intensity peaks to the indexed      learning (AL) for intelligent and iterative identification of promis-
peaks (Pawley refinement) [JPS92]. Here we employed the open            ing materials candidates within a large dataset. In the framework of
source lmfit package [NSA+ 16] to perform a minimization of             AL, the predicted value with associated uncertainty is considered
the multivariable Voigt-like function that represents the entire        to decide what materials to be added in each iteration, aiming to
diffraction spectrum. This allows the user to refine (optimize) unit    improve the model performance in the next iteration (Figure 5).
cell parameters coming from the indexing data and as the result,            Since it could be important to consider multiple properties
goodness of fit (R-factor) between experimental and simulated           simultaneously in material discovery, multiple property optimiza-
spectrum is minimized.                                                  tion (MPO) has also been implemented as a part of the AL work-
                                                                        flow [KAG+ 22]. MPO allows scaling and combining multiple
                                                                        properties into a single score. We employed the AL workflow
Machine learning techniques                                             to determine the top candidates for hole (positively charged
Of late, there is great interest in machine learning assisted mate-     carrier) transport layer (HTL) by evaluating 550 molecules in 10
rials discovery. There are several components required to perform       iterations using DFT calculations for a dataset of ~9,000 molecules
machine learning assisted materials discovery. In order to train a      [AKA+ 22]. Resulting model was validated by randomly picking
model, benchmark data from simulation and/or experimental data          a molecule from the dataset, computing properties with DFT and
is required. Besides benchmark data, computation of the relevant        comparing those to the predicted values. According to the semi-
descriptors is required (see below). Finally, a model based on          classical Marcus equation [Mar93], high rates of hole transfer are
benchmark data and descriptors is generated that allows prediction      inversely proportional to hole reorganization energies. Thus, MPO
of properties for novel materials. There are several techniques to      scores were computed based on minimizing hole reorganization
generate the model, such as linear or non-linear fitting to neural      energy and targeting oxidation potential to an appropriate level to
networks. Tools include the open source DeepChem [REW+ 19]              ensure a low energy barrier for hole injection from the anode
56                                                                                           PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 4: Left: The uniaxial stress/strain curve of a polymer calculated using Desmond through the stress strain workflow. The dark grey band
indicates an inflection that marks the yield point. Right: Constant strain simulation with convex analysis indicates elongation at yield. The red
curve shows simulated stress versus strain. The blue curve shows convex analysis.

                      Fig. 5: Active learning workflow for the design and discovery of novel optoelectronics molecules.

into the emissive layer. In this workflow, we used RDKit to               of similar items (similar molecules). In this case, benchmark data
compute descriptors for the chemical structures. These descriptors        is only needed for few representatives of each cluster. We are
generated on the initial subset of structures are given as vectors to     currently working on applying this approach to train models for
an algorithm based on Random Forest Regressor as implemented              predicting physical properties of soft materials (polymers).
in scikit-learn. Bayesian optimization is employed to tune the
hyperparameters of the model. In each iteration, a trained model          Conclusions
is applied for making predictions on the remaining materials in           We present several examples of how Schrödinger Materials Suite
the dataset. Figure 6 (A) displays MPO scores for the HTL dataset         integrates open source software packages. There is a wide range
estimated by AL as a function of hole reorganization energies that        of applications in materials science that can benefit from already
are separately calculated for all the materials. This figure indicates    existing open source code. Where possible, we report issues to
that there are many materials in the dataset with desired low hole        the package authors and submit improvements and bug fixes in
reorganization energies but are not suitable for HTL due to their         the form of the pull requests. We are thankful to all who have
improper oxidation potentials, suggesting that MPO is important           contributed to open source libraries, and have made it possible for
to evaluate the optoelectronic performance of the materials. Figure       us to develop a platform for accelerating innovation in materials
6 (B) presents MPO scores of the materials used in the training           and drug discovery. We will continue contributing to these projects
dataset of AL, demonstrating that the feedback loop in the AL             and we hope to further give back to the scientific community by
workflow efficiently guides the data collection as the size of the        facilitating research in both academia and industry. We hope that
training set increases.                                                   this report will inspire other scientific companies to give back to
    To appreciate the computational efficiency of such an ap-             the open source community in order to improve the computational
proach, it is worth noting that performing DFT calculations for           materials field and make science more reproducible.
all of the 9,000 molecules in the dataset would increase the
computational cost by a factor of 15 versus the AL workflow. It           Acknowledgments
seems that AL approach can be useful in the cases where problem           The authors acknowledge Bradley Dice and Wenduo Zhou for
space is broad (like chemical space), but there are many clusters         their valuable comments during the review of the manuscript.

Fig. 6: A: MPO score of all materials in the HTL dataset. B: Those used in the training set as a function of the hole reorganization energy (
λh ).

R EFERENCES                                                                                  tal Engineering and Materials, 72, 2016. doi:10.1107/
[ABG+ 21]     Mohammad Atif Faiz Afzal, Andrea R. Browning, Alexan-              [HF08]      Gus L.W. Hart and Rodney W. Forcade.                     Algo-
              der Goldberg, Mathew D. Halls, Jacob L. Gavartin, Tsuguo                       rithm for generating derivative structures. Physical Re-
              Morisato, Thomas F. Hughes, David J. Giesen, and Joseph E.                     view B - Condensed Matter and Materials Physics, 77,
              Goose. High-throughput molecular dynamics simulations and                      2008. URL:, doi:10.
              validation of thermophysical properties of polymers for var-                   1103/PhysRevB.77.224115.
              ious applications. ACS Applied Polymer Materials, 3, 2021.         [HJM+ 20]   Lauri Himanen, Marc O.J. Jager, Eiaki V. Morooka, Fil-
              doi:10.1021/acsapm.0c00524.                                                    ippo Federici Canova, Yashasvi S. Ranawat, David Z. Gao,
[AKA+ 22]     Hadi Abroshan, H. Shaun Kwak, Yuling An, Christopher                           Patrick Rinke, and Adam S. Foster. Dscribe: Library of
              Brown, Anand Chandrasekaran, Paul Winget, and Mathew D.                        descriptors for machine learning in materials science. Com-
              Halls. Active learning accelerates design and optimization                     puter Physics Communications, 247, 2020. URL: https:
              of hole-transporting materials for organic electronics. Fron-                  //, doi:10.1016/j.cpc.
              tiers in Chemistry, 9, 2022. doi:10.3389/fchem.2021.                           2019.106949.
              800371.                                                            [HM24]      O Hassel and H Mark. The crystal structure of graphite.
[BAS+ 20]     A. R. Browning, M. A. F. Afzal, J. Sanders, A. Goldberg,                       Physik. Z, 25:317–337, 1924.
              A. Chandrasekaran, and H. S. Kwak. Polyolefin molecular            [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der
              simulation for critical physical characteristics. International                Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric
              Polyolefins Conference, 2020.                                                  Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,
[BDBF]        Davide Brunato, Pietro Delugas, Giovanni Borghi, and                           Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van
              Alexandr Fonari. qeschema. URL:                        Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del
              qeschema.                                                                      Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant,
[BHH+ 13]     Art D. Bochevarov, Edward Harder, Thomas F. Hughes,                            Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer
              Jeremy R. Greenwood, Dale A. Braden, Dean M. Philipp,                          Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array
              David Rinaldo, Mathew D. Halls, Jing Zhang, and Richard A.                     programming with numpy, 2020. URL:,
              Friesner. Jaguar: A high-performance quantum chemistry                         doi:10.1038/s41586-020-2649-2.
              software program with strengths in life and materials sci-         [HZU+ 20]   Sebastiaan P. Huber, Spyros Zoupanos, Martin Uhrin, Leopold
              ences. International Journal of Quantum Chemistry, 113,                        Talirz, Leonid Kahle, Rico Hauselmann, Dominik Gresch,
              2013. doi:10.1002/qua.24481.                                                   Tiziano Müller, Aliaksandr V. Yakutovich, Casper W. Ander-
[Bic02]       Jozef Bicerano. Prediction of polymer properties. cRc Press,                   sen, Francisco F. Ramirez, Carl S. Adorf, Fernando Gargiulo,
              2002.                                                                          Snehal Kumbhar, Elsa Passaro, Conrad Johnston, Andrius
[Cha]         A. Chandrasekaran. Active learning accelerated design of ionic                 Merkys, Andrea Cepellotti, Nicolas Mounet, Nicola Marzari,
              materials. in progress.                                                        Boris Kozinsky, and Giovanni Pizzi. Aiida 1.0, a scalable com-
[DDS+ 16]     Steven L. Dixon, Jianxin Duan, Ethan Smith, Christopher                        putational infrastructure for automated reproducible workflows
              D. Von Bargen, Woody Sherman, and Matthew P. Repasky.                          and data provenance. Scientific Data, 7, 2020. URL: https://
              Autoqsar: An automated machine learning tool for best-               , doi:10.1038/s41597-020-00638-4.
              practice quantitative structure-activity relationship modeling.    [JPS92]     J. Jansen, R. Peschar, and H. Schenk. Determination of
              Future Medicinal Chemistry, 8, 2016. doi:10.4155/fmc-                          accurate intensities from powder diffraction data. i. whole-
              2016-0093.                                                                     pattern fitting with a least-squares procedure.        Journal
[GAB+ 17]     P. Giannozzi, O. Andreussi, T. Brumme, O. Bunau, M. Buon-                      of Applied Crystallography, 25, 1992. doi:10.1107/
              giorno Nardelli, M. Calandra, R. Car, C. Cavazzoni,                            S0021889891012104.
              D. Ceresoli, M. Cococcioni, N. Colonna, I. Carnimeo, A. Dal        [KAG+ 22]   H. Shaun Kwak, Yuling An, David J. Giesen, Thomas F.
              Corso, S. De Gironcoli, P. Delugas, R. A. Distasio, A. Ferretti,               Hughes, Christopher T. Brown, Karl Leswing, Hadi Abroshan,
              A. Floris, G. Fratesi, G. Fugallo, R. Gebauer, U. Gerstmann,                   and Mathew D. Halls. Design of organic electronic materials
              F. Giustino, T. Gorni, J. Jia, M. Kawamura, H. Y. Ko,                          with a goal-directed generative model powered by deep neural
              A. Kokalj, E. Kücükbenli, M. Lazzeri, M. Marsili, N. Marzari,                  networks and high-throughput molecular simulations. Fron-
              F. Mauri, N. L. Nguyen, H. V. Nguyen, A. Otero-De-La-                          tiers in Chemistry, 9, 2022. doi:10.3389/fchem.2021.
              Roza, L. Paulatto, S. Poncé, D. Rocca, R. Sabatini, B. Santra,                 800370.
              M. Schlipf, A. P. Seitsonen, A. Smogunov, I. Timrov, T. Thon-      [KBD+ 21]   James A Kaduk, Simon J L Billinge, Robert E Dinnebier,
              hauser, P. Umari, N. Vast, X. Wu, and S. Baroni. Advanced                      Nathan Henderson, Ian Madsen, Radovan Černý, Matteo
              capabilities for materials modelling with quantum espresso.                    Leoni, Luca Lutterotti, Seema Thakral, and Daniel Chateigner.
              Journal of Physics Condensed Matter, 29, 2017. URL:                            Powder diffraction. Nature Reviews Methods Primers, 1:77,
    , doi:10.1088/1361-                           2021.     URL:,
              648X/aa8f79.                                                                   doi:10.1038/s43586-021-00074-7.
[GBLW16]      Colin R. Groom, Ian J. Bruno, Matthew P. Lightfoot, and            [LLC22]     Schrödinger LLC. Schrödinger release 2022-2: Materials
              Suzanna C. Ward.         The cambridge structural database.                    science suite, 2022. URL:
              Acta Crystallographica Section B: Structural Science, Crys-                    platform/materials-science.
58                                                                                              PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[LMB+ 17]   Ask Hjorth Larsen, Jens JØrgen Mortensen, Jakob Blomqvist,                      Ho, Douglas J. Ierardi, Lev Iserovich, Jeffrey S. Kuskin,
            Ivano E. Castelli, Rune Christensen, Marcin Dułak, Jesper                       Richard H. Larson, Timothy Layman, Li Siang Lee, Adam K.
            Friis, Michael N. Groves, BjØrk Hammer, Cory Hargus,                            Lerer, Chester Li, Daniel Killebrew, Kenneth M. Macken-
            Eric D. Hermes, Paul C. Jennings, Peter Bjerre Jensen,                          zie, Shark Yeuk Hai Mok, Mark A. Moraes, Rolf Mueller,
            James Kermode, John R. Kitchin, Esben Leonhard Kols-                            Lawrence J. Nociolo, Jon L. Peticolas, Terry Quan, Daniel
            bjerg, Joseph Kubal, Kristen Kaasbjerg, Steen Lysgaard,                         Ramot, John K. Salmon, Daniele P. Scarpazza, U. Ben Schafer,
            Jón Bergmann Maronsson, Tristan Maxson, Thomas Olsen,                           Naseer Siddique, Christopher W. Snyder, Jochen Spengler,
            Lars Pastewka, Andrew Peterson, Carsten Rostgaard, Jakob                        Ping Tak Peter Tang, Michael Theobald, Horia Toma, Brian
            SchiØtz, Ole Schütt, Mikkel Strange, Kristian S. Thygesen,                      Towles, Benjamin Vitale, Stanley C. Wang, and Cliff Young.
            Tejs Vegge, Lasse Vilhelmsen, Michael Walter, Zhenhua Zeng,                     Anton 2: Raising the bar for performance and programmabil-
            and Karsten W. Jacobsen. The atomic simulation envi-                            ity in a special-purpose molecular dynamics supercomputer.
            ronment - a python library for working with atoms, 2017.                        volume 2015-January, 2014. doi:10.1109/SC.2014.9.
            URL:, doi:10.1088/1361-              [SPA+ 19]   Gabriel R. Schleder, Antonio C.M. Padilha, Carlos Mera
            648X/aa680e.                                                                    Acosta, Marcio Costa, and Adalberto Fazzio. From dft to
[LTK+ 22]   Greg Landrum, Paolo Tosco, Brian Kelley, Ric, sriniker,                         machine learning: Recent approaches to materials science -
            gedeck, Riccardo Vianello, NadineSchneider, Eisuke                              a review. JPhys Materials, 2, 2019. doi:10.1088/2515-
            Kawashima, Andrew Dalke, Dan N, David Cosgrove,                                 7639/ab084b.
            Gareth Jones, Brian Cole, Matt Swain, Samo Turk,                    [SYC+ 17]   Austin D Sendek, Qian Yang, Ekin D Cubuk, Karel-
            AlexanderSavelyev, Alain Vaucher, Maciej Wójcikowski,                           Alexander N Duerloo, Yi Cui, and Evan J Reed. Holistic
            Ichiru Take, Daniel Probst, Kazuya Ujihara, Vincent F.                          computational structure screening of more than 12000 can-
            Scalfani, guillaume godin, Axel Pahl, Francois Berenger,                        didates for solid lithium-ion conductor materials. Energy and
            JLVarjo, strets123, JP, and DoliathGavid. rdkit. 6 2022. URL:                   Environmental Science, 10:306–320, 2017. doi:10.1039/
  , doi:10.5281/ZENODO.6605135.                                 c6ee02697d.
[Mar93]     Rudolph A. Marcus. Electron transfer reactions in chemistry.        [VGO+ 20]   Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt
            theory and experiment. Reviews of Modern Physics, 65, 1993.                     Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski,
            doi:10.1103/RevModPhys.65.599.                                                  Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté-
[MAS+ 20]   Nobuyuki N. Matsuzawa, Hideyuki Arai, Masaru Sasago, Eiji                       fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar-
            Fujii, Alexander Goldberg, Thomas J. Mustard, H. Shaun                          rod Millman, Nikolay Mayorov, Andrew R.J. Nelson, Eric
            Kwak, David J. Giesen, Fabio Ranalli, and Mathew D. Halls.                      Jones, Robert Kern, Eric Larson, C. J. Carey, İlhan Polat,
            Massive theoretical screen of hole conducting organic mate-                     Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde,
            rials in the heteroacene family by using a cloud-computing                      Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin-
            environment. Journal of Physical Chemistry A, 124, 2020.                        tero, Charles R. Harris, Anne M. Archibald, Antônio H.
            doi:10.1021/acs.jpca.9b10998.                                                   Ribeiro, Fabian Pedregosa, Paul van Mulbregt, Aditya Vi-
[MGF+ 19]   Thomas Mustard, Jacob Gavartin, Alexandr Fonari, Caroline                       jaykumar, Alessandro Pietro Bardelli, Alex Rothberg, An-
            Krauter, Alexander Goldberg, H Kwak, Tsuguo Morisato,                           dreas Hilboll, Andreas Kloeckner, Anthony Scopatz, Antony
            Sudharsan Pandiyan, and Mathew Halls. Surface reactivity                        Lee, Ariel Rokem, C. Nathan Woods, Chad Fulton, Charles
            and stability of core-shell solid catalysts from ab initio combi-               Masson, Christian Haggström, Clark Fitzgerald, David A.
            natorial calculations. volume 258, 2019.                                        Nicholson, David R. Hagen, Dmitrii V. Pasechnik, Emanuele
[NSA+ 16]   Matthew Newville, Till Stensitzki, Daniel B Allen, Michal                       Olivetti, Eric Martin, Eric Wieser, Fabrice Silva, Felix Lenders,
            Rawlik, Antonino Ingargiola, and Andrew Nelson. Lmfit: Non-                     Florian Wilhelm, G. Young, Gavin A. Price, Gert Ludwig
            linear least-square minimization and curve-fitting for python.                  Ingold, Gregory E. Allen, Gregory R. Lee, Hervé Audren, Irvin
            Astrophysics Source Code Library, page ascl–1606, 2016.                         Probst, Jörg P. Dietrich, Jacob Silterra, James T. Webber, Janko
            URL:                                         Slavič, Joel Nothman, Johannes Buchner, Johannes Kulick,
[OBJ+ 11]   Noel M. O’Boyle, Michael Banck, Craig A. James, Chris                           Johannes L. Schönberger, José Vinícius de Miranda Cardoso,
            Morley, Tim Vandermeersch, and Geoffrey R. Hutchison.                           Joscha Reimer, Joseph Harrington, Juan Luis Cano Rodríguez,
            Open babel: An open chemical toolbox. Journal of Chem-                          Juan Nunez-Iglesias, Justin Kuczynski, Kevin Tritz, Martin
            informatics, 3, 2011. URL:, doi:                         Thoma, Matthew Newville, Matthias Kümmerer, Maximilian
            10.1186/1758-2946-3-33.                                                         Bolingbroke, Michael Tartre, Mikhail Pak, Nathaniel J. Smith,
[ORJ+ 13]   Shyue Ping Ong, William Davidson Richards, Anubhav Jain,                        Nikolai Nowaczyk, Nikolay Shebanov, Oleksandr Pavlyk,
            Geoffroy Hautier, Michael Kocher, Shreyas Cholia, Dan                           Per A. Brodtkorb, Perry Lee, Robert T. McGibbon, Roman
            Gunter, Vincent L. Chevrier, Kristin A. Persson, and Gerbrand                   Feldbauer, Sam Lewis, Sam Tygier, Scott Sievert, Sebastiano
            Ceder. Python materials genomics (pymatgen): A robust, open-                    Vigna, Stefan Peterson, Surhud More, Tadeusz Pudlik, Takuya
            source python library for materials analysis. Computational                     Oshima, Thomas J. Pingel, Thomas P. Robitaille, Thomas
            Materials Science, 68, 2013. URL:,                        Spura, Thouis R. Jones, Tim Cera, Tim Leslie, Tiziano Zito,
            doi:10.1016/j.commatsci.2012.10.028.                                            Tom Krauss, Utkarsh Upadhyay, Yaroslav O. Halchenko, and
[PKD18]     Paul N. Patrone, Anthony J. Kearsley, and Andrew M. Di-                         Yoshiki Vázquez-Baeza. Scipy 1.0: fundamental algorithms
            enstfrey. The role of data analysis in uncertainty quantifica-                  for scientific computing in python. Nature Methods, 17, 2020.
            tion: Case studies for materials modeling. volume 0, 2018.                      doi:10.1038/s41592-019-0686-2.
            doi:10.2514/6.2018-0927.                                            [VPB21]     Rama Vasudevan, Ghanshyam Pilania, and Prasanna V. Bal-
[PVG+ 11]   Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vin-                      achandran. Machine learning for materials design and dis-
            cent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blon-                    covery. Journal of Applied Physics, 129, 2021. doi:
            del, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake                       10.1063/5.0043300.
            Vanderplas, Alexandre Passos, David Cournapeau, Matthieu            [WDF+ 18]   Logan Ward, Alexander Dunn, Alireza Faghaninia, Nils E.R.
            Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-                        Zimmermann, Saurabh Bajaj, Qi Wang, Joseph Montoya,
            learn: Machine learning in python. Journal of Machine                           Jiming Chen, Kyle Bystrom, Maxwell Dylla, Kyle Chard,
            Learning Research, 12, 2011. URL:                    Mark Asta, Kristin A. Persson, G. Jeffrey Snyder, Ian Foster,
[RDK]       Rdkit contributors.       URL:                  and Anubhav Jain. Matminer: An open source toolkit for
            graphs/contributors.                                                            materials data mining. Computational Materials Science,
[REW+ 19]   Bharath Ramsundar, Peter Eastman, Patrick Walters,                              152, 2018. URL:,
            Vijay Pande, Karl Leswing, and Zhenqin Wu.                   Deep               doi:10.1016/j.commatsci.2018.05.018.
            Learning for the Life Sciences. O’Reilly Media, 2019.               [WF05]      John D. Westbrook and Paula M.D. Fitzgerald. The pdb
                               format, mmcif formats, and other data formats, 2005. doi:
            Microscopy/dp/1492039837.                                                       10.1002/0471721204.ch8.
[SGB+ 14]   David E. Shaw, J. P. Grossman, Joseph A. Bank, Brannon Bat-         [WKB+ 22]   Paul Winget, H. Shaun Kwak, Christopher T. Brown, Alexandr
            son, J. Adam Butts, Jack C. Chao, Martin M. Deneroff, Ron O.                    Fonari, Kevin Tran, Alexander Goldberg, Andrea R. Brown-
            Dror, Amos Even, Christopher H. Fenton, Anthony Forte,                          ing, and Mathew D. Halls. Organic thin films for oled appli-
            Joseph Gagliardo, Gennette Gill, Brian Greskamp, C. Richard                     cations: Influence of molecular structure, deposition method,

             and deposition conditions. International Conference on the
             Science and Technology of Synthetic Metals, 2022.
60                                                                                                            PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

       A Novel Pipeline for Cell Instance Segmentation,
       Tracking and Motility Classification of Toxoplasma
                     Gondii in 3D Space
           Seyed Alireza Vaezi‡∗ , Gianni Orlando‡ , Mojtaba Fazli§ , Gary Ward¶ , Silvia Moreno‡ , Shannon Quinn‡


Abstract—Toxoplasma gondii is the parasitic protozoan that causes dissem-                   individuals, the infection has fatal implications in fetuses and
inated toxoplasmosis, a disease that is estimated to infect around one-third                immunocompromised individuals [SG12] . T. gondii’s virulence
of the world’s population. While the disease is commonly asymptomatic, the                  is directly linked to its lytic cycle which is comprised of invasion,
success of the parasite is in large part due to its ability to easily spread through        replication, egress, and motility. Studying the motility of T. gondii
nucleated cells. The virulence of T. gondii is predicated on the parasite’s motility.
                                                                                            is crucial in understanding its lytic cycle in order to develop
Thus the inspection of motility patterns during its lytic cycle has become a topic
of keen interest. Current cell tracking projects usually focus on cell images
                                                                                            potential treatments.
captured in 2D which are not a true representation of the actual motion of a                     For this reason, we present a novel pipeline to detect, segment,
cell. Current 3D tracking projects lack a comprehensive pipeline covering all               track, and classify the motility pattern of T. gondii in 3D space.
phases of preprocessing, cell detection, cell instance segmentation, tracking,              One of the main goals is to make our pipeline intuitively easy
and motion classification, and merely implement a subset of the phases. More-               to use so that the users who are not experienced in the fields of
over, current 3D segmentation and tracking pipelines are not targeted for users             machine learning (ML), deep learning (DL), or computer vision
with less experience in deep learning packages. Our pipeline, TSeg, on the                  (CV) can still benefit from it. The other objective is to equip it with
other hand, is developed for segmenting, tracking, and classifying the motility
                                                                                            the most robust and accurate set of segmentation and detection
phenotypes of T. gondii in 3D microscopic images. Although TSeg is built initially
                                                                                            tools so that the end product has a broad generalization, allowing
focusing on T. gondii, it provides generic functions to allow users with similar
but distinct applications to use it off-the-shelf. Interacting with all of TSeg’s
                                                                                            it to perform well and accurately for various cell types right off
modules is possible through our Napari plugin which is developed mainly off the             the shelf.
familiar SciPy scientific stack. Additionally, our plugin is designed with a user-               PlantSeg uses a variant of 3D U-Net, called Residual 3D U-
friendly GUI in Napari which adds several benefits to each step of the pipeline             Net, for preprocessing and segmentation of multiple cell types
such as visualization and representation in 3D. TSeg proves to fulfill a better             [WCV+ 20]. PlantSeg performs best among Deep Learning algo-
generalization, making it capable of delivering accurate results with images of             rithms for 3D Instance Segmentation and is very robust against
other cell types.                                                                           image noise [KPR+ 21]. The segmentation module also includes
                                                                                            the optional use of CellPose [SWMP21]. CellPose is a generalized
Introduction                                                                                segmentation algorithm trained on a wide range of cell types
Quantitative cell research often requires the measurement of                                and is the first step toward increased optionality in TSeg. The
different cell properties including size, shape, and motility. This                         Cell Tracking module consolidates the cell particles across the z-
step is facilitated using segmentation of imaged cells. With flu-                           axis to materialize cells in 3D space and estimates centroids for
orescent markers, computational tools can be used to complete                               each cell. The tracking module is also responsible for extracting
segmentation and identify cell features and positions over time.                            the trajectories of cells based on the movements of centroids
2D measurements of cells can be useful, but the more difficult task                         throughout consecutive video frames, which is eventually the input
of deriving 3D information from cell images is vital for metrics                            of the motion classifier module.
such as motility and volumetric qualities.                                                       Most of the state-of-the-art pipelines are restricted to 2D space
    Toxoplasmosis is an infection caused by the intracellular                               which is not a true representative of the actual motion of the
parasite Toxoplasma gondii. T. gondii is one of the most suc-                               organism. Many of them require knowledge and expertise in pro-
cessful parasites, infecting at least one-third of the world’s pop-                         gramming, or in machine learning and deep learning models and
ulation. Although Toxoplasmosis is generally benign in healthy                              frameworks, thus limiting the demographic of users that can use
                                                                                            them. All of them solely include a subset of the aforementioned
* Corresponding author:                                                     modules (i.e. detection, segmentation, tracking, and classification)
‡ University of Georgia
§ harvard University                                                                        [SWMP21]. Many pipelines rely on the user to train their own
¶ University of Vermont                                                                     model, hand-tailored for their specific application. This demands
                                                                                            high levels of experience and skill in ML/DL and consequently
Copyright © 2022 Seyed Alireza Vaezi et al. This is an open-access article                  undermines the possibility and feasibility of quickly utilizing an
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,               off-the-shelf pipeline and still getting good results.
provided the original author and source are credited.                                            To address these we present TSeg. It segments T. gondii cells

                                                                         As an example, Fazli et al. [FVMQ18] identified three distinct
                                                                         motility types for T. gondii with two-dimensional data, however,
                                                                         they also acknowledge and state that based established heuristics
                                                                         from previous works there are more than three motility phenotypes
                                                                         for T. gondii. The focus on 2D research is understandable due to
                                                                         several factors. 3D data is difficult to capture as tools for capturing
                                                                         3D slices and the computational requirements for analyzing this
                                                                         data are not available in most research labs. Most segmentation
                                                                         tools are unable to track objects in 3D space as the assignment of
                                                                         related centroids is more difficult. The additional noise from cap-
                                                                         ture and focus increases the probability of incorrect assignment.
                                                                         3D data also has issues with overlapping features and increased
                                                                         computation required per frame of time.
                                                                              Fazli et al. [FVMQ18] studies the motility patterns of T. gondii
                                                                         and provides a computational pipeline for identifying motility
                                                                         phenotypes of T. gondii in an unsupervised, data-driven way. In
                                                                         that work Ca2+ is added to T. gondii cells inside a Fetal Bovine
                                                                         Serum. T. gondii cells react to Ca2+ and become motile and
                                                                         fluorescent. The images of motile T. gondii cells were captured
                                                                         using an LSM 710 confocal microscope. They use Python 3 and
                                                                         associated scientific computing libraries (NumPy, SciPy, scikit-
                                                                         learn, matplotlib) in their pipeline to track and cluster the trajecto-
                                                                         ries of T. gondii. Based on this work Fazli et al. [FVM+ 18] work
                                                                         on another pipeline consisting of preprocessing, sparsification, cell
                                                                         detection, and cell tracking modules to track T. gondii in 3D
                                                                         video microscopy where each frame of the video consists of image
                                                                         slices taken 1 micro-meters of focal depth apart along the z-axis
            Fig. 1: The overview of TSeg’s architecture.                 direction. In their latest work Fazli et al. [FSA+ 19] developed a
                                                                         lightweight and scalable pipeline using task distribution and paral-
                                                                         lelism. Their pipeline consists of multiple modules: reprocessing,
in 3D microscopic images, tracks their trajectories, and classifies      sparsification, cell detection, cell tracking, trajectories extraction,
the motion patterns observed throughout the 3D frames. TSeg is           parametrization of the trajectories, and clustering. They could
comprised of four modules: pre-processing, segmentation, track-          classify three distinct motion patterns in T. gondii using the same
ing, and classification. We developed TSeg as a plugin for Napari        data from their previous work.
[SLE+ 22] - an open-source fast and interactive image viewer for              While combining open source tools is not a novel architecture,
Python designed for browsing, annotating, and analyzing large            little has been done to integrate 3D cell tracking tools. Fazeli et
multi-dimensional images. Having TSeg implemented as a part of           al. [FRF+ 20] motivated by the same interest in providing better
Napari not only provides a user-friendly design but also gives more      tools to non-software professionals created a 2D cell tracking
advanced users the possibility to attach and execute their custom        pipeline. This pipeline combines Stardist [WSH+ 20] and Track-
code and even interact with the steps of the pipeline if needed.         Mate [TPS+ 17] for automated cell tracking. This pipeline begins
The preprocessing module is equipped with basic and extra filters        with the user loading cell images and centroid approximations to
and functionalities to aid in the preparation of the input data.         the ZeroCostDL4Mic [vCLJ+ 21] platform. ZeroCostDL4Mic is
TSeg gives its users the advantage of utilizing the functionalities      a deep learning training tool for those with no coding expertise.
that PlantSeg and CellPose provide. These functionalities can be         Once the platform is trained and masks for the training set are
chosen in the pre-processing, detection, and segmentation steps.         made for hand-drawn annotations, the training set can be input
This brings forth a huge variety of algorithms and pre-built models      to Stardist. Stardist performs automated object detection using
to select from, making TSeg not only a great fit for T. gindii, but      Euclidean distance to probabilistically determine cell pixels versus
also a variety of different cell types.                                  background pixels. Lastly, Trackmate uses segmentation images to
    The rest of this paper is structured as follows: After briefly re-   track labels between timeframes and display analytics.
viewing the literature in Related Work, we move on to thoroughly              This Stardist pipeline is similar in concept to TSeg. Both
describe the details of our work in the Method section. Following        create an automated segmentation and tracking pipeline but TSeg
that, the Results section depicts the results of comprehensive tests     is oriented to 3D data. Cells move in 3-dimensional space that
of our plugin on T. gondii cells.                                        is not represented in a flat plane. TSeg also does not require
                                                                         the manual training necessary for the other pipeline. Individuals
Related Work                                                             with low technical expertise should not be expected to create
The recent solutions in generalized and automated segmentation           masks for training or even understand the training of deep neural
tools are focused on 2D cell images. Segmentation of cellular            networks. Lastly, this pipeline does not account for imperfect
structures in 2D is important but not representative of realistic        datasets without the need for preprocessing. All implemented
environments. Microbiological organisms are free to move on the          algorithms in TSeg account for microscopy images with some
z-axis and tracking without taking this factor into account cannot       amount of noise.
guarantee a full representation of the actual motility patterns.              Wen et al. [WMV+ 21] combines multiple existing new tech-
62                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

nologies including deep learning and presents 3DeeCellTracker.         user. The full code of TSeg is available on GitHub under the MIT
3DeeCellTracker segments and tracks cells on 3D time-lapse             open source license at TSeg can
images. Using a small subset of their dataset they train the deep      be installed through Napari’s plugins menu.
learning architecture 3D U-Net for segmentation. For tracking,
a combination of two strategies was used to increase accuracy:         Computational Pipeline
local cell region strategies, and spatial pattern strategy. Kapoor              Pre-Processing: Due to the fast imaging speed in data
et al. [KC21] presents VollSeg that uses deep learning methods         acquisition, the image slices will inherently have a vignetting
to segment, track, and analyze cells in 3D with irregular shape        artifact, meaning that the corners of the images will be slightly
and intensity distribution. It is a Jupyter Notebook-based Python      darker than the center of the image. To eliminate this artifact we
package and also has a UI in Napari. For tracking, a custom            added adaptive thresholding and logarithmic correction to the pre-
tracking code is developed based on Trackmate.                         processing module. Furthermore, another prevalent artifact on our
    Many segmentation tools require some amount of knowledge           dataset images was a Film-Grain noise (AKA salt and pepper
in Machine or Deep Learning concepts. Training the neural              noise). To remove or reduce such noise a simple gaussian blur
network in creating masks is a common step for open-source             filter and a sharpening filter are included.
segmentation tools. Automating this process makes the pipeline                  Cell Detection and Segmentation: TSeg’s Detection and
more accessible to microbiology researchers.                           Segmentation modules are in fact backed by PlantSeg and Cell-
                                                                       Pose. The Detection Module is built only based on PlantSeg’s
Method                                                                 CNN Detection Module [WCV+ 20] , and for the Segmentation
                                                                       Module, only one of the three tools can be selected to be executed
                                                                       as the segmentation tool in the pipeline. Naturally, each of the tools
Our dataset consists of 11 videos of T. gondii cells under a           demands specific interface elements different from the others since
microscope, obtained from different experiments with different         each accepts different input values and various parameters. TSeg
numbers of cells. The videos are on average around 63 frames in        orchestrates this and makes sure the arguments and parameters are
length. Each frame has a stack of 41 image slices of size 500×502      passed to the corresponding selected segmentation tool properly
pixels along the z-axis (z-slices). The z-slices are captured 1µm      and the execution will be handled accordingly. The parameters
apart in optical focal length making them 402µm×401µm×40µm             include but are not limited to input data location, output directory,
in volume. The slices were recorded in raw format as RGB TIF           and desired segmentation algorithm. This allows the end-user
images but are converted to grayscale for our purpose. This data       complete control over the process and feedback from each step
is captured using a PlanApo 20x objective (NA = 0:75) on a             of the process. The preprocessed images and relevant parameters
preheated Nikon Eclipse TE300 epifluorescence microscope. The          are sent to a modular segmentation controller script. As an effort
image stacks were captured using an iXon 885 EMCCD camera              to allow future development on TSeg, the segmentation controller
(Andor Technology, Belfast, Ireland) cooled to -70oC and driven        script shows how the pipeline integrates two completely different
by NIS Elements software (Nikon Instruments, Melville, NY) as          segmentation packages. While both PlantSeg and CellPose use
part of related research by Ward et al. [LRK+ 14]. The camera was      conda environments, PlantSeg requires modification of a YAML
set to frame transfer sensor mode, with a vertical pixel shift speed   file for initialization while CellPose initializes directly from com-
of 1:0 µs, vertical clock voltage amplitude of +1, readout speed       mand line parameters. In order to implement PlantSeg, TSeg gen-
of 35MHz, conversion gain of 3:8×, EM gain setting of 3 and 22         erates a YAML file based on GUI input elements. After parameters
binning, and the z-slices were imaged with an exposure time of         are aligned, the conda environment for the chosen segmentation
16ms.                                                                  algorithm is opened in a subprocess. The $CONDA_PREFIX
                                                                       environment variable allows the bash command to start conda and
Software                                                               context switch to the correct segmentation environment.
        Napari Plugin: TSeg is developed as a plugin for Napari -               Tracking: Features in each segmented image are found
a fast and interactive multi-dimensional image viewer for python       using the scipy label function. In order to reduce any leftover
that allows volumetric viewing of 3D images [SLE+ 22]. Plugins         noise, any features under a minimum size are filtered out and
enable developers to customize and extend the functionality of         considered leftover noise. After feature extraction, centroids are
Napari. For every module of TSeg, we developed its corresponding       calculated using the center of mass function in scipy. The centroid
widget in the GUI, plus a widget for file management. The widgets      of the 3D cell can be used as a representation of the entire
have self-explanatory interface elements with tooltips to guide        body during tracking. The tracking algorithm goes through each
the inexperienced user to traverse through the pipeline with ease.     captured time instance and connects centroids to the likely next
Layers in Napari are the basic viewable objects that can be shown      movement of the cell. Tracking involves a series of measures in or-
in the Napari viewer. Seven different layer types are supported        der to avoid incorrect assignments. An incorrect assignment could
in Napari: Image, Labels, Points, Shapes, Surface, Tracks, and         lead to inaccurate result sets and unrealistic motility patterns. If the
Vectors, each of which corresponds to a different data type,           same number of features in each frame of time could be guaranteed
visualization, and interactivity [SLE+ 22]. After its execution, the   from segmentation, minimum distance could assign features rather
viewable output of each widget gets added to the layers. This          accurately. Since this is not a guarantee, the Hungarian algorithm
allows the user to evaluate and modify the parameters of the           must be used to associate a COST with the assignment of feature
widget to get the best results before continuing to the next widget.   tracking. The Hungarian method is a combinatorial optimization
Napari supports bidirectional communication between the viewer         algorithm that solves the assignment problem in polynomial time.
and the Python kernel and has a built-in console that allows users     COST for the tracking algorithm determines which feature is the
to control all the features of the viewer programmatically. This       next iteration of the cell’s tracking through the complete time
adds more flexibility and customizability to TSeg for the advanced     series. The combination of distance between centroids for all

previous points and the distance to the potential new centroid.            [LRK+ 14]    Jacqueline Leung, Mark Rould, Christoph Konradt, Christopher
If an optimal next centroid can’t be found within an acceptable                         Hunter, and Gary Ward. Disruption of tgphil1 alters specific
                                                                                        parameters of toxoplasma gondii motility measured in a quanti-
distance of the current point, the tracking for the cell is considered                  tative, three-dimensional live motility assay. PloS one, 9:e85763,
as complete. Likewise, if a feature is not assigned to a current                        01 2014. doi:10.1371/journal.pone.0085763.
centroid, this feature is considered a new object and is tracked as        [SG12]       Geita Saadatnia and Majid Golkar. A review on human toxoplas-
the algorithm progresses. The complete path for each feature is                         mosis. Scandinavian journal of infectious diseases, 44(11):805–
                                                                                        814, 2012. doi:10.3109/00365548.2012.693197.
then stored for motility analysis.                                         [SLE+ 22]    Nicholas Sofroniew, Talley Lambert, Kira Evans, Juan Nunez-
        Motion Classification: To classify the motility pattern of                      Iglesias, Grzegorz Bokota, Philip Winston, Gonzalo Peña-
T. gondii in 3D space in an unsupervised fashion we implement                           Castellanos, Kevin Yamauchi, Matthias Bussonnier, Draga Don-
                                                                                        cila Pop, Ahmet Can Solak, Ziyang Liu, Pam Wadhwa, Al-
and use the method that Fazli et. al. introduced [FSA+ 19]. In that                     ister Burt, Genevieve Buckley, Andrew Sweet, Lukasz Mi-
work, they used an autoregressive model (AR); a linear dynamical                        gas, Volker Hilsenstein, Lorenzo Gaifas, Jordão Bragantini,
system that encodes a Markov-based transition prediction method.                        Jaime Rodríguez-Guerra, Hector Muñoz, Jeremy Freeman, Peter
The reason is that although K-means is a favorable clustering                           Boone, Alan Lowe, Christoph Gohlke, Loic Royer, Andrea
                                                                                        PIERRÉ, Hagai Har-Gil, and Abigail McGovern. napari: a multi-
algorithm, there are a few drawbacks to it and to the conventional                      dimensional image viewer for Python, May 2022. If you use
methods that draw them impractical. Firstly, K-means assumes Eu-                        this software, please cite it using these metadata. URL: https:
clidian distance, but AR motion parameters are geodesics that do                        //, doi:10.5281/zenodo.
not reside in a Euclidean space, and secondly, K-means assumes             [SWMP21]     Carsen Stringer, Tim Wang, Michalis Michaelos, and Marius
isotropic clusters, however, although AR motion parameters may                          Pachitariu. Cellpose: a generalist algorithm for cellular segmen-
exhibit isotropy in their space, without a proper distance metric,                      tation. Nature methods, 18(1):100–106, 2021. doi:10.1101/
this issue cannot be clearly examined [FSA+ 19].                                        2020.02.02.931238.
                                                                           [TPS+ 17]    Jean-Yves Tinevez, Nick Perry, Johannes Schindelin,
                                                                                        Genevieve M. Hoopes, Gregory D. Reynolds, Emmanuel
                                                                                        Laplantine, Sebastian Y. Bednarek, Spencer L. Shorte, and
Conclusion and Discussion                                                               Kevin W. Eliceiri. Trackmate: An open and extensible platform
TSeg is an easy to use pipeline designed to study the motility                          for single-particle tracking. Methods, 115:80–90, 2017. Image
                                                                                        Processing for Biologists. URL: https://www.sciencedirect.
patterns of T. gondii in 3D space. It is developed as a plugin                          com/science/article/pii/S1046202316303346,          doi:https:
for Napari and is equipped with a variety of deep learning based                        //
segmentation tools borrowed from PlantSeg and CellPose, making             [vCLJ+ 21]   Lucas von Chamier, Romain F Laine, Johanna Jukkala,
                                                                                        Christoph Spahn, Daniel Krentzel, Elias Nehme, Martina
it a suitable off-the-shelf tool for applications incorporating im-                     Lerche, Sara Hernández-Pérez, Pieta K Mattila, Eleni Kari-
ages of cell types not limited to T. gondii. Future work on TSeg                        nou, et al. Democratising deep learning for microscopy with
includes the expantion of implemented algorithms and tools in its                       zerocostdl4mic. Nature communications, 12(1):1–18, 2021.
preprocessing, segmentation, tracking, and clustering modules.                          doi:10.1038/s41467-021-22518-0.
                                                                           [WCV+ 20]    Adrian Wolny, Lorenzo Cerrone, Athul Vijayan, Rachele To-
                                                                                        fanelli, Amaya Vilches Barro, Marion Louveaux, Christian
                                                                                        Wenzl, Sören Strauss, David Wilson-Sánchez, Rena Lymbouri-
R EFERENCES                                                                             dou, Susanne S Steigleder, Constantin Pape, Alberto Bailoni,
                                                                                        Salva Duran-Nebreda, George W Bassel, Jan U Lohmann, Mil-
[FRF+ 20] Elnaz Fazeli, Nathan H Roy, Gautier Follain, Romain F Laine,                  tos Tsiantis, Fred A Hamprecht, Kay Schneitz, Alexis Maizel,
          Lucas von Chamier, Pekka E Hänninen, John E Eriksson, Jean-                   and Anna Kreshuk. Accurate and versatile 3d segmenta-
          Yves Tinevez, and Guillaume Jacquemet. Automated cell track-                  tion of plant tissues at cellular resolution. eLife, 9:e57613,
          ing using stardist and trackmate. F1000Research, 9, 2020.                     jul 2020. URL:, doi:10.
          doi:10.12688/f1000research.27019.1.                                           7554/eLife.57613.
[FSA+ 19] Mojtaba Sedigh Fazli, Rachel V Stadler, BahaaEddin Alaila,       [WMV+ 21]    Chentao Wen, Takuya Miura, Venkatakaushik Voleti, Kazushi
          Stephen A Vella, Silvia NJ Moreno, Gary E Ward, and Shannon                   Yamaguchi, Motosuke Tsutsumi, Kei Yamamoto, Kohei Otomo,
          Quinn. Lightweight and scalable particle tracking and motion                  Yukako Fujie, Takayuki Teramoto, Takeshi Ishihara, Kazuhiro
          clustering of 3d cell trajectories. In 2019 IEEE International                Aoki, Tomomi Nemoto, Elizabeth Mc Hillman, and Koutarou D
          Conference on Data Science and Advanced Analytics (DSAA),                     Kimura. 3DeeCellTracker, a deep learning-based pipeline for
          pages 412–421. IEEE, 2019. doi:10.1109/dsaa.2019.                             segmenting and tracking cells in 3D time lapse images. Elife, 10,
          00056.                                                                        March 2021. URL:, doi:
[FVM 18] Mojtaba S Fazli, Stephen A Vella, Silvia NJ Moreno, Gary E
          Ward, and Shannon P Quinn. Toward simple & scalable 3d cell      [WSH+ 20]    Martin Weigert, Uwe Schmidt, Robert Haase, Ko Sugawara,
          tracking. In 2018 IEEE International Conference on Big Data                   and Gene Myers. Star-convex polyhedra for 3d object detec-
          (Big Data), pages 3217–3225. IEEE, 2018. doi:10.1109/                         tion and segmentation in microscopy. In 2020 IEEE Winter
          BigData.2018.8622403.                                                         Conference on Applications of Computer Vision (WACV). IEEE,
[FVMQ18] Mojtaba S Fazli, Stephen A Velia, Silvia NJ Moreno, and                        mar 2020. URL:
          Shannon Quinn. Unsupervised discovery of toxoplasma gondii                    9093435, doi:10.1109/wacv45572.2020.9093435.
          motility phenotypes. In 2018 IEEE 15th International Sympo-
          sium on Biomedical Imaging (ISBI 2018), pages 981–984. IEEE,
          2018. doi:10.1109/isbi.2018.8363735.
[KC21]    Varun Kapoor and Claudia Carabaña. Cell tracking in 3d
          using deep learning segmentations. In Python in Science Con-
          ference, pages 154–161, 2021. doi:10.25080/majora-
[KPR+ 21] Anuradha Kar, Manuel Petit, Yassin Refahi, Guillaume
          Cerutti, Christophe Godin, and Jan Traas.          Assessment
          of deep learning algorithms for 3d instance segmentation
          of confocal image datasets. bioRxiv, 2021. URL: https:
          pdf, doi:10.1101/2021.06.09.447748.
64                                                                                                   PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

 The myth of the normal curve and what to do about it
                                                                Allan Campopiano∗


Index Terms—Python, R, robust statistics, bootstrapping, trimmed mean, data
science, hypothesis testing

    Reliance on the normal curve as a tool for measurement is
almost a given. It shapes our grading systems, our measures of
intelligence, and importantly, it forms the mathematical backbone
of many of our inferential statistical tests and algorithms. Some
even call it “God’s curve” for its supposed presence in nature
    Scientific fields that deal in explanatory and predictive statis-
tics make particular use of the normal curve, often using it to
conveniently define thresholds beyond which a result is considered
statistically significant (e.g., t-test, F-test). Even familiar machine
learning models have, buried in their guts, an assumption of the
normal curve (e.g., LDA, gaussian naive Bayes, logistic & linear
    The normal curve has had a grip on us for some time; the                       Fig. 1: Standard normal (orange) and contaminated normal (blue).
                                                                                   The variance of the contaminated curve is more than 10 times that
aphorism by Cramer [Cra46] still rings true for many today:
                                                                                   of the standard normal curve. This can cause serious issues with
         “Everyone believes in the [normal] law of errors, the                     statistical power when using traditional hypothesis testing methods.
     experimenters because they think it is a mathematical
     theorem, the mathematicians because they think it is an
     experimental fact.”                                                           new Python library for robust hypothesis testing will be introduced
    Many students of statistics learn that N=40 is enough to ignore                along with an interactive tool for robust statistics education.
the violation of the assumption of normality. This belief stems
from early research showing that the sampling distribution of the                  The contaminated normal
mean quickly approaches normal, even when drawing from non-
normal distributions—as long as samples are sufficiently large. It                 One of the most striking counterexamples of “N=40 is enough”
is common to demonstrate this result by sampling from uniform                      is shown when sampling from the so-called contaminated normal
and exponential distributions. Since these look nothing like the                   [Tuk60][Tan82]. This distribution is also bell shaped and sym-
normal curve, it was assumed that N=40 must be enough to avoid                     metrical but it has slightly heavier tails when compared to the
practical issues when sampling from other types of non-normal                      standard normal curve. That is, it contains outliers and is difficult
distributions [Wil13]. (Others reached similar conclusions with                    to distinguish from a normal distribution with the naked eye.
different methodology [Gle93].)                                                    Consider the distributions in Figure 1. The variance of the normal
    Two practical issues have since been identified based on this                  distribution is 1 but the variance of the contaminated normal is
early research: (1) The distributions under study were light tailed                10.9!
(they did not produce outliers), and (2) statistics other than the                     The consequence of this inflated variance is apparent when
sample mean were not tested and may behave differently. In                         examining statistical power. To demonstrate, Figure 2 shows two
the half century following these early findings, many important                    pairs of distributions: On the left, there are two normal distribu-
discoveries have been made—calling into question the usefulness                    tions (variance 1) and on the right there are two contaminated
of the normal curve [Wil13].                                                       distributions (variance 10.9). Both pairs of distributions have a
    The following sections uncover various pitfalls one might                      mean difference of 0.8. Wilcox [Wil13] showed that by taking
encounter when assuming normality—especially as they relate to                     random samples of N=40 from each normal curve, and comparing
hypothesis testing. To help researchers overcome these problems, a                 them with Student’s t-test, statistical power was approximately
                                                                                   0.94. However, when following this same procedure for the
* Corresponding author:                                         contaminated groups, statistical power was only 0.25.
                                                                                       The point here is that even small apparent departures from
Copyright © 2022 Allan Campopiano. This is an open-access article dis-             normality, especially in the tails, can have a large impact on
tributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, pro-       commonly used statistics. The problems continue to get worse
vided the original author and source are credited.                                 when examining effect sizes but these findings are not discussed
THE MYTH OF THE NORMAL CURVE AND WHAT TO DO ABOUT IT                                                                                             65

Fig. 2: Two normal curves (left) and two contaminated normal curves
(right). Despite the obvious effect sizes (∆ = 0.8 for both pairs) as
well as the visual similarities of the distributions, power is only ~0.25
under contamination; however, power is ~0.94 under normality (using
Student’s t-test).

in this article. Interested readers should see Wilcox’s 1992 paper          Fig. 3: Actual t-distribution (orange) and assumed t-distribution
                                                                            (blue). When simulating a t-distribution based on a lognormal curve,
[Wil92].                                                                    T does not follow the assumed shape. This can cause poor probability
    Perhaps one could argue that the contaminated normal dis-               coverage and increased Type I Error when using traditional hypothe-
tribution actually represents an extreme departure from normal-             sis testing approaches.
ity and therefore should not be taken seriously; however, dis-
tributions that generate outliers are likely common in practice
[HD82][Mic89][Wil09]. A reasonable goal would then be to                    Modern robust methods
choose methods that perform well under such situations and
                                                                            When it comes to hypothesis testing, one intuitive way of dealing
continue to perform well under normality. In addition, serious
                                                                            with the issues described above would be to (1) replace the
issues still exist even when examining light-tailed and skewed
                                                                            sample mean (and standard deviation) with a robust alternative
distributions (e.g., lognormal), and statistics other than the sample
                                                                            and (2) use a non-parametric resampling technique to estimate the
mean (e.g., T). These findings will be discussed in the following
                                                                            sampling distribution (rather than assuming a theoretical shape)1 .
                                                                            Two such candidates are the 20% trimmed mean and the percentile
                                                                            bootstrap test, both of which have been shown to have practical
                                                                            value when dealing with issues of outliers and non-normality
Student’s t-distribution
Another common statistic is the T value obtained from Student’s
t-test. As will be demonstrated, T is more sensitive to violations of       The trimmed mean
normality than the sample mean (which has already been shown
to not be robust). This is despite the fact that the t-distribution is      The trimmed mean is nothing more than sorting values, removing
also bell shaped, light tailed, and symmetrical—a close relative of         a proportion from each tail, and computing the mean on the
the normal curve.                                                           remaining values. Formally,
    The assumption is that T follows a t-distribution (and with                 •   Let X1 ...Xn be a random sample and X(1) ≤ X(2) ... ≤ X(n)
large samples it approaches normality). We can test this assump-                    be the observations in ascending order
tion by generating random samples from a lognormal distribution.                •   The proportion to trim is γ(0 ≤ γ ≤ .5)
Specifically, 5000 datasets of sample size 20 were randomly drawn               •   Let g = bγnc. That is, the proportion to trim multiplied by
from a lognormal distribution using SciPy’s lognorm.rvs                             n, rounded down to the nearest integer
function. For each dataset, T was calculated and the resulting t-
distribution was plotted. Figure 3 shows that the assumption that               Then, in symbols, the trimmed mean can be expressed as
T follows a t-distribution does not hold.                                   follows:
    With N=20, the assumption is that with a probability of 0.95,                                   X(g+1) + ... + X(n−g)
T will be between -2.09 and 2.09. However, when sampling from                                 X̄t =
                                                                                                           n − 2g
a lognormal distribution in the manner just described, there is
actually a 0.95 probability that T will be between approximately            If the proportion to trim is 0.2, more than twenty percent of
-4.2 and 1.4 (i.e., the middle 95% of the actual t-distribution is          the values would have to be altered to make the trimmed mean
much wider than the assumed t-distribution). Based on this result           arbitrarily large or small. The sample mean, on the other hand,
we can conclude that sampling from skewed distributions (e.g.,              can be made to go to ±∞ (arbitrarily large or small) by changing
lognormal) leads to increased Type I Error when using Student’s             a single value. The trimmed mean is more robust than the sample
t-test [Wil98].                                                             mean in all measures of robustness that have been studied [Wil13].
                                                                            In particular the 20% trimmed mean has been shown to have
        “Surely the hallowed bell-shaped curve has cracked                  practical value as it avoids issues associated with the median (not
     from top to bottom. Perhaps, like the Liberty Bell, it                 discussed here) and still protects against outliers.
     should be enshrined somewhere as a memorial to more
     heroic days — Earnest Ernest, Philadelphia Inquirer. 10                  1. Another option is to use a parametric test that assumes a different
     November 1974. [FG81]”                                                 underlying model.
66                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

The percentile bootstrap test                                           best experienced in tandem with Wilcox’s book “Introduction to
In most traditional parametric tests, there is an assumption that       Robust Estimation and Hypothesis Testing”.
the sampling distribution has a particular shape (normal, f-                Hypothesize brings many of these functions into the open-
distribution, t-distribution, etc). We can use these distributions      source Python library ecosystem with the goal of lowering the
to test the null hypothesis; however, as discussed, the theoretical     barrier to modern robust methods—even for those who have
distributions are not always approximated well when violations of       not had extensive training in statistics or coding. With modern
assumptions occur. Non-parametric resampling techniques such            browser-based notebook environments (e.g., Deepnote), learning
as bootstrapping and permutation tests build empirical sampling         to use Hypothesize can be relatively straightforward. In fact, every
distributions, and from these, one can robustly derive p-values and     statistical test listed in the docs is associated with a hosted note-
CIs. One example is the percentile bootstrap test [Efr92][TE93].        book, pre-filled with sample data and code. But certainly, simply
    The percentile bootstrap test can be thought of as an al-           pip install Hypothesize to use Hypothesize in any en-
gorithm that uses the data at hand to estimate the underlying           vironment that supports Python. See van Noordt and Willoughby
sampling distribution of a statistic (pulling yourself up by your       [vNW21] and van Noordt et al. [vNDTE22] for examples of
own bootstraps, as the saying goes). This approach is in contrast       Hypothesize being used in applied research.
to traditional methods that assume the sampling distribution takes          The API for Hypothesize is organized by single- and two-
a particular shape). The percentile boostrap test works well with       factor tests, as well as measures of association. Input data for
small sample sizes, under normality, under non-normality, and it        the groups, conditions, and measures are given in the form of a
easily extends to multi-group tests (ANOVA) and measures of             Pandas DataFrame [pdt20][WM10]. By way of example, one can
association (correlation, regression). For a two-sample case, the       compare two independent groups (e.g., placebo versus treatment)
steps to compute the percentile bootstrap test can be described as      using the 20% trimmed mean and the percentile bootstrap test, as
follows:                                                                follows (note that Hypothesize uses the naming conventions found
                                                                        in WRS):
     1)    Randomly resample with replacement n values from
           group one                                                    from hypothesize.utilities import trim_mean
                                                                        from hypothesize.compare_groups_with_single_factor \
     2)    Randomly resample with replacement n values from                 import pb2gen
           group two
     3)    Compute X̄1 − X̄2 based on you new sample (the mean          results = pb2gen(df.placebo, df.treatment, trim_mean)
     4)    Store the difference & repeat steps 1-3 many times (say,     As shown below, the results are returned as a Python dictionary
           1000)                                                        containing the p-value, confidence intervals, and other important
     5)    Consider the middle 95% of all differences (the confi-       details.
           dence interval)                                              {
     6)    If the confidence interval contains zero, there is no        'ci': [-0.22625614592148624, 0.06961754796950131],
           statistical difference, otherwise, you can reject the null   'est_1': 0.43968438076483285,
                                                                        'est_2': 0.5290985245430996,
           hypothesis (there is a statistical difference)               'est_dif': -0.08941414377826673,
                                                                        'n1': 50,
                                                                        'n2': 50,
Implementing and teaching modern robust methods                         'p_value': 0.27,
                                                                        'variance': 0.005787027326924963
Despite over a half a century of convincing findings, and thousands
of papers, robust statistical methods are still not widely adopted
in applied research [EHM08][Wil98]. This may be due to various          For measuring associations, several options exist in Hypothesize.
false beliefs. For example,                                             One example is the Winsorized correlation which is a robust
     •    Classical methods are robust to violations of assumptions     alternative to Pearson’s R. For example,
     •    Correcting non-normal distributions by transforming the       from hypothesize.measuring_associations import wincor
          data will solve all issues
     •    Traditional non-parametric tests are suitable replacements    results = wincor(df.height, df.weight, tr=.2)
          for parametric tests that violate assumptions
                                                                        returns the Winsorized correlation coefficient and other relevant
   Perhaps the most obvious reason for the lack of adoption of          statistics:
modern methods is a lack of easy-to-use software and training re-       {
sources. In the following sections, two resources will be presented:    'cor': 0.08515087411576182,
one for implementing robust methods and one for teaching them.          'nval': 50,
                                                                        'sig': 0.558539575073185,
                                                                        'wcov': 0.004207827245660796
Robust statistics for Python                                            }
Hypothesize is a robust null hypothesis significance testing
(NHST) library for Python [CW20]. It is based on Wilcox’s WRS
package for R which contains hundreds of functions for computing        A case study using real-world data
robust measures of central tendency and hypothesis testing. At          It is helpful to demonstrate that robust methods in Hypothesize
the time of this writing, the WRS library in R contains many            (and in other libraries) can make a practical difference when
more functions than Hypothesize and its value to researchers            dealing with real-world data. In a study by Miller on sexual
who use inferential statistics cannot be understated. WRS is            attitudes, 1327 men and 2282 women were asked how many sexual
THE MYTH OF THE NORMAL CURVE AND WHAT TO DO ABOUT IT                                                                                         67

partners they desired over the next 30 years (the data are available
from Rand R. Wilcox’s site). When comparing these groups using
Student’s t-test, we get the following results:
'ci': [-1491.09, 4823.24],
't_value': 1.035308,
'p_value': 0.300727

That is, we fail to reject the null hypothesis at the α = 0.05 level
using Student’s test for independent groups. However, if we switch
to a robust analogue of the t-test, one that utilizes bootstrapping
and trimmed means, we can indeed reject the null hypothesis.
Here are the corresponding results from Hypothesize’s yuenbt
test (based on [Yue74]):
from hypothesize.compare_groups_with_single_factor \
    import yuenbt
                                                                         Fig. 4: An example of the robust stats simulator in Deepnote’s hosted
                                                                         notebook environment. A minimalist UI can lower the barrier-to-entry
results = yuenbt(df.males, df.females,                                   to robust statistics concepts.
    tr=.2, alpha=.05)

{                                                                            The robust statistics simulator allows users to interact with the
'ci': [1.41, 2.11],                                                      following parameters:
'test_stat': 9.85,
'p_value': 0.0                                                              •    Distribution shape
                                                                            •    Level of contamination
The point here is that robust statistics can make a practi-                 •    Sample size
cal difference with real-world data (even when N is consid-                 •    Skew and heaviness of tails
ered large). Many other examples of robust statistics making a               Each of these characteristics can be adjusted independently in
practical difference with real-world data have been documented           order to compare classic approaches to their robust alternatives.
[HD82][Wil09][Wil01].                                                    The two measures that are used to evaluate the performance of
    It is important to note that robust methods may also fail to         classic and robust methods are the standard error and Type I Error.
reject when a traditional test rejects (remember that traditional            Standard error is a measure of how much an estimator varies
tests can suffer from increased Type I Error). It is also possible       across random samples from our population. We want to choose
that both approaches yield the same or similar conclusions. The          estimators that have a low standard error. Type I Error is also
exact pattern of results depends largely on the characteristics of the   known as False Positive Rate. We want to choose methods that
underlying population distribution. To be able to reason about how       keep Type I Error close to the nominal rate (usually 0.05). The
robust statistics behave when compared to traditional methods the        robust statistics simulator can guide these decisions by providing
robust statistics simulator has been created and is described in the     empirical evidence as to why particular estimators and statistical
next section.                                                            tests have been chosen.
Robust statistics simulator
Having a library of robust statistical functions is not enough to
make modern methods commonplace in applied research. Ed-                 This paper gives an overview of the issues associated with the
ucators and practitioners still need intuitive training tools that       normal curve. The concern with traditional methods, in terms of
demonstrate the core issues surrounding classical methods and            robustness to violations of normality, have been known for over
how robust analogues compare.                                            a half century and modern alternatives have been recommended;
    As mentioned, computational notebooks that run in the cloud          however, for various reasons that have been discussed, modern
offer a unique solution to learning beyond that of static textbooks      robust methods have not yet become commonplace in applied
and documentation. Learning can be interactive and exploratory           research settings.
since narration, visualization, widgets (e.g., buttons, slider bars),        One reason is the lack of easy-to-use software and teaching
and code can all be experienced in a ready-to-go compute envi-           resources for robust statistics. To help fill this gap, Hypothesize, a
ronment—with no overhead related to local environment setup.             peer-reviewed and open-source Python library was developed. In
    As a compendium to Hypothesize, and a resource for un-               addition, to help clearly demonstrate and visualize the advantages
derstanding and teaching robust statistics in general, the robust        of robust methods, the robust statistics simulator was created.
statistics simulator repository has been developed. It is a notebook-    Using these tools, practitioners can begin to integrate robust
based collection of interactive demonstrations aimed at clearly and      statistical methods into their inferential testing repertoire.
visually explaining the conditions under which classic methods
fail relative to robust methods. A hosted notebook with the              Acknowledgements
rendered visualizations of the simulations can be accessed here.         The author would like to thank Karlynn Chan and Rand R. Wilcox
and seen in Figure 4. Since the simulations run in the browser and       as well as Elizabeth Dlha and the entire Deepnote team for their
require very little understanding of code, students and teachers can     support of this project. In addition, the author would like to thank
easily onboard to the study of robust statistics.                        Kelvin Lee for his insightful review of this manuscript.
68                                                                                              PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

R EFERENCES                                                                    [WM10]    Wes McKinney. Data Structures for Statistical Computing in
                                                                                         Python. In Stéfan van der Walt and Jarrod Millman, editors,
                                                                                         Proceedings of the 9th Python in Science Conference, pages 56 –
[Cra46]   Harold Cramer. Mathematical methods of statistics, princeton
                                                                                         61, 2010. doi:10.25080/Majora-92bf1922-00a.
          univ. Press, Princeton, NJ, 1946. URL:
                                                                               [Yue74]   Karen K Yuen. The two-sample trimmed t for unequal population
                                                                                         variances. Biometrika, 61(1):165–170, 1974. doi:10.2307/
[CvNS18] Allan Campopiano, Stefon JR van Noordt, and Sidney J Sega-
          lowitz. Statslab: An open-source eeg toolbox for comput-
          ing single-subject effects using robust statistics. Behavioural
          Brain Research, 347:425–435, 2018. doi:10.1016/j.bbr.
[CW20]    Allan Campopiano and Rand R. Wilcox. Hypothesize: Ro-
          bust statistics for python. Journal of Open Source Software,
          5(50):2241, 2020. doi:10.21105/joss.02241.
[Efr92]   Bradley Efron. Bootstrap methods: another look at the jackknife.
          In Breakthroughs in statistics, pages 569–593. Springer, 1992.
[EHM08]   David M Erceg-Hurn and Vikki M Mirosevich. Modern robust
          statistical methods: an easy way to maximize the accuracy and
          power of your research. American Psychologist, 63(7):591, 2008.
[FG81]    Joseph Fashing and Ted Goertzel. The myth of the normal curve
          a theoretical critique and examination of its role in teaching and
          research. Humanity & Society, 5(1):14–31, 1981. doi:10.
[Gle93]   John R Gleason. Understanding elongation: The scale contami-
          nated normal family. Journal of the American Statistical Asso-
          ciation, 88(421):327–337, 1993. doi:10.1080/01621459.
[HD82]    MaryAnn Hill and WJ Dixon. Robustness in real life: A study
          of clinical laboratory data. Biometrics, pages 377–396, 1982.
[Mic89]   Theodore Micceri. The unicorn, the normal curve, and other
          improbable creatures. Psychological bulletin, 105(1):156, 1989.
[pdt20]   The pandas development team. pandas-dev/pandas: Pandas,
          February 2020. URL:,
[Tan82]   WY Tan. Sampling distributions and robustness of t, f and
          variance-ratio in two samples and anova models with respect to
          departure from normality. Comm. Statist.-Theor. Meth., 11:2485–
          2511, 1982. URL:
[TE93]    Robert J Tibshirani and Bradley Efron. An introduction to
          the bootstrap. Monographs on statistics and applied probabil-
          ity, 57:1–436, 1993. URL:
[Tuk60]   J. W. Tukey. A survey of sampling from contaminated distribu-
          tions. Contributions to Probability and Statistics, pages 448–485,
          1960. URL:
[vNDTE22] Stefon van Noordt, James A Desjardins, BASIS Team, and
          Mayada Elsabbagh. Inter-trial theta phase consistency during
          face processing in infants is associated with later emerging
          autism. Autism Research, 15(5):834–846, 2022. doi:10.
[vNW21]   Stefon van Noordt and Teena Willoughby. Cortical matura-
          tion from childhood to adolescence is reflected in resting state
          eeg signal complexity. Developmental cognitive neuroscience,
          48:100945, 2021. doi:10.1016/j.dcn.2021.100945.
[Wil92]   Rand R Wilcox. Why can methods for comparing means have
          relatively low power, and what can you do to correct the prob-
          lem? Current Directions in Psychological Science, 1(3):101–105,
          1992. doi:10.1111/1467-8721.ep10768801.
[Wil98]   Rand R Wilcox. How many discoveries have been lost by
          ignoring modern statistical methods? American Psychologist,
          53(3):300, 1998. doi:10.1037/0003-066X.53.3.300.
[Wil01]   Rand R Wilcox. Fundamentals of modern statistical meth-
          ods: Substantially improving power and accuracy, volume 249.
          Springer, 2001. URL:
[Wil09]   Rand R Wilcox. Robust ancova using a smoother with boot-
          strap bagging. British Journal of Mathematical and Sta-
          tistical Psychology, 62(2):427–437, 2009. doi:10.1348/
[Wil13]   Rand R Wilcox. Introduction to robust estimation and hypothesis
          testing. Academic press, 2013. doi:10.1016/c2010-0-
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                           69

       Python for Global Applications: teaching scientific
       Python in context to law and diplomacy students
                                                           Anna Haensch‡§∗ , Karin Knudson‡§


Abstract—For students across domains and disciplines, the message has been                 the students and faculty at the Fletcher School are eager to
communicated loud and clear: data skills are an essential qualification for today’s        seize upon our current data moment to expand their quantitative
job market. This includes not only the traditional introductory stats coursework           offerings. With this in mind, The Fletcher School reached out to
but also machine learning, artificial intelligence, and programming in Python or           the co-authors to develop a course in data science, situated in the
R. Consequently, there has been significant student-initiated demand for data
                                                                                           context of international diplomacy.
analytic and computational skills sometimes with very clear objectives in mind,
and other times guided by a vague sense of “the work I want to do will require
                                                                                               In response, we developed the (Python-based) course, Data
this.” Now we have options. If we train students using “black box” algorithms              Science for Global Applications, which had its inaugural offering
without attending to the technical choices involved, then we run the risk of               in the Spring semester of 2022. The course had 30 enrolled
unleashing practitioners who might do more harm than good. On the other hand,              Fletcher School students, primarily from the MALD program.
courses that completely unpack the “black box” can be so steeped in theory that            When the course was announced we had a flood of interest from
the barrier to entry becomes too high for students from social science and policy          Fletcher students who were extremely interested in broadening
backgrounds, thereby excluding critical voices. In sum, both of these options              their studies with this course. With a goal of keeping a close
lead to a pitfall that has gained significant media attention over recent years: the
                                                                                           interactive atmosphere we capped enrollment at 30. To inform the
harms caused by algorithms that are implemented without sufficient attention to
human context. In this paper, we - two mathematicians turned data scientists
                                                                                           direction of our course, we surveyed students on their background
- present a framework for teaching introductory data science skills in a highly            in programming (see Fig. 1) and on their motivations for learning
contextualized and domain flexible environment. We will present example course             data science (see Fig 2). Students reported only very limited
outlines at the semester, weekly, and daily level, and share materials that we             experience with programming - if any at all - with that experience
think hold promise.                                                                        primarily in Excel and Tableau. Student motivations varied, but
                                                                                           the goal to get a job where they were able to make a meaningful
Index Terms—computational social science, public policy, data science, teach-              social impact was the primary motivation.
ing with Python

As data science continues to gain prominence in the public eye,
and as we become more aware of the many facets of our lives
that intersect with data-driven technologies and policies every day,
universities are broadening their academic offerings to keep up
with what students and their future employers demand. Not only
are students hoping to obtain more hard skills in data science
(e.g. Python programming experience), but they are interested
in applying tools of data science across domains that haven’t                              Fig. 1: The majority of the 30 students enrolled in the course had little
historically been part of the quantitative curriculum. The Master                          to no programming experience, and none reported having "a lot" of
of Arts in Law and Diplomacy (MALD) is the flagship program of                             experience. Those who did have some experience were most likely to
the Fletcher School of Law and International Diplomacy at Tufts                            have worked in Excel or Tableau.
University. Historically, the program has contained core elements
of quantitative reasoning with a focus on business, finance, and                                The MALD program, which is interdisciplinary by design, pro-
international development, as is typical in graduate programs in                           vides ample footholds for domain specific data science. Keeping
international relations. Like academic institutions more broadly,                          this in mind, as a throughline for the course, each student worked
                                                                                           to develop their own quantitative policy project. Coursework and
* Corresponding author:                                             discussions were designed to move this project forward from
‡ Tufts University
§ Data Intensive Studies Center                                                            initial policy question, to data sourcing and visualizing, and
                                                                                           eventually to modeling and analysis.
Copyright © 2022 Anna Haensch et al. This is an open-access article dis-                        In what follows we will describe how we structured our
tributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, pro-               course with the goal of empowering beginner programmers to use
vided the original author and source are credited.                                         Python for data science in the context of international relations
70                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                        might understand in the abstract that the way the handling of
                                                                        missing data can substantially affect the outcome of an analysis,
                                                                        but will likely have a stronger understanding if they have had to
                                                                        consider how to deal with missing data in their own project.
                                                                            We used several course structures to support connecting data
                                                                        science and Python "skills" with their context. Students had
                                                                        readings and journaling assignments throughout the semester on
                                                                        topics that connected data science with society. In their journal
                                                                        responses, students were asked to connect the ideas in the reading
                                                                        to their other academic/professional interests, or ideas from other
                                                                        classes with the following prompt:
                                                                                  Your reflection should be a 250-300 word narrative.
                                                                             Be sure to tie the reading back into your own studies,
                                                                             experiences, and areas of interest. For each reading,
Fig. 2: The 30 enrolled students were asked to indicate which were           come up with 1-2 discussion questions based on the con-
relevant motivations for taking the course. Curiosity and a desire to        cepts discussed in the readings. This can be a curiosity
make a meaningful social impact were among the top motivations our           question, where you’re interested in finding out more,
students expressed.                                                          a critical question, where you challenge the author’s
                                                                             assumptions or decisions, or an application question,
                                                                             where you think about how concepts from the reading
and diplomacy. We will also share details about course content
                                                                             would apply to a particular context you are interested in
and structure, methods of assessment, and Python programming
resources that we deployed through Google Colab. All of the
materials described here can be found on the public course page             These readings (highlighted in gray in Fig 3), assignments, and      the related in-class discussions were interleaved among Python
                                                                        exercises meant to give students practice with skills including
Course Philosophy and Goals
                                                                        manipulating DataFrames in pandas [The22], [Mck10], plotting in
                                                                        Matplotlib [Hun07] and seaborn [Was21], mapping with GeoPan-
Our high level goals for the course were i) to empower students         das [Jor21], and modeling with scikit-learn [Ped11]. Student
with the skills to gain insight from data using Python and ii) to       projects included a thorough data audit component requiring
deepen students’ understanding of how the use of data science           students to explore data sources and their human context in detail.
affects society. As we sought to achieve these high level goals         Precise details and language around the data audit can be found
within the limited time scope of a single semester, the following       on the course website.
core principles were essential in shaping our course design. Below,
we briefly describe each of these principles and share some             Managing Fears & Concerns Through Supported Programming
examples of how they were reflected in the course structure. In a
                                                                        We surmised that students who are new to programming and
subsequent section we will more precisely describe the content of
                                                                        possibly intimidated by learning the unfamiliar skill would do
the course, whereupon we will further elaborate on these principles
                                                                        well in an environment that included plenty of what we call
and share instructional materials. But first, our core principles:
                                                                        supported programming - that is, practicing programming in class
Connecting the Technical and Social
                                                                        with immediate access to instructor and peer support.
                                                                            In the pre-course survey we created, many students identified
To understand the impact of data science on the world (and the
                                                                        concerns about their quantitative preparation, whether they would
potential policy implications of such impact), it helps to have
                                                                        be able to keep up with the course, and how hard programming
hands-on practice with data science. Conversely, to effectively
                                                                        might be. We sought to acknowledge these concerns head-on,
and ethically practice data science, it is important to understand
                                                                        assure students of our full confidence in their ability to master
how data science lives in the world. Thus, the "hard" skills of
                                                                        the material, and provide them with all the resources they needed
coding, wrangling data, visualizing, and modeling are best taught
                                                                        to succeed.
intertwined with a robust study of ways in which data science is
                                                                            A key resource to which we thought all students needed
used and misused.
                                                                        access was instructor attention. In addition to keeping the class
    There is an increasing need to educate future policy-makers
                                                                        size capped at 30 people, with both co-instructors attending all
with knowledge of how data science algorithms can be used
                                                                        course meetings, we structured class time to maximize the time
and misused. One way to approach meeting this need, especially
                                                                        students spent actually doing data science in class. We sought
for students within a less technically-focused program, would
                                                                        to keep demonstrations short, and intersperse them with coding
be to teach students about how algorithms can be used without
                                                                        exercises so that students could practice with new ideas right
actually teaching them to use algorithms. However, we argue that
                                                                        away. Our Colab notebooks included in the course materials show
students will gain a deeper understanding of the societal and
                                                                        one way that we wove student practice time throughout. Drawing
ethical implications of data science if they also have practical
                                                                        insight from social practice theory of learning (e.g. [Eng01],
data science skills. For example, a student could gain a broad
                                                                        [Pen16]), we sought to keep in mind how individual practice and
understanding of how biased training data might lead to biased
                                                                        learning pathways develop in relation to their particular social and
algorithmic predictions, but such understanding is likely to be
deeper and more memorable when a student has actually practiced           1. This journaling prompt was developed by our colleague Desen Ozkan at
training a model using different training data. Similarly, someone      Tufts University.

institutional context. Crucially, we devoted a great deal of in-class   and preparing data for exploratory data analysis, visualizing and
time to students doing data science, and a great deal of energy         annotating data, and finally modeling and analyzing data. All
into making this practice time a positive and empowering social         of this was done with the goal of answering a policy question
experience. During student practice time, we were circulating           developed by the student, allowing the student to flex some
throughout the room, answering student questions and helping            domain expertise to supplement the (sometimes overwhelming!)
students to problem solve and debug, and encouraging students           programmatic components.
to work together and help each other. A small organizational                Our project explicitly required that students find two datasets
change we made in the first weeks of the semester that proved           of interest and merge them for the final analysis. This presented
to have outsized impact was moving our office hours to hold them        both logistical and technical challenges. As one student pointed
directly after class in an almost-adjacent room, to make it as easy     out after finally finding open data: hearing people talk about the
as possible for students to attend office hours. Students were vocal    need for open data is one thing, but you really realize what that
in their appreciation of office hours.                                  means when you’ve spent weeks trying to get access to data that
    We contend that the value of supported programming time             you know exists. Understanding the provenance of the data they
is two-fold. First, it helps beginning programmers learn more           were working with helped students assess the biases and limita-
quickly. While learning to code necessarily involves challenges,        tions, and also gave students a strong sense of ownership over
students new to a language can sometimes struggle for an un-            their final projects. An unplanned consequence of the broad scope
productively long time on things like simple syntax issues. When        of the policy project was that we, the instructors, learned nearly
students have help available, they can move forward from minor          as much about international diplomacy as the students learned
issues faster and move more efficiently into building a meaningful      about programming and data science, a bidirectional exchange of
understanding. Secondly, supported programming time helps stu-          knowledge that we surmised to have contributed to student feeling
dents to understand that they are not alone in the challenges they      of empowerment and a positive class environment.
are facing in learning to program. They can see other students
learning and facing similar challenges, can have the empowering         Course Structure
experience of helping each other out, and when asking for help
can notice that even their instructors sometimes rely on resources      We broke the course into three modules, each with focused
like StackOverflow. An unforeseen benefit we believe co-teaching        reading/journaling topics, Python exercises, and policy project
had was to give us as instructors the opportunity to consult            benchmarks: (i) getting and cleaning data, (ii) visualizing data,
with each other during class time and share different approaches.       and (iii) modeling data. In what follows we will describe the key
These instructor interactions modeled for students how even as          goals of each module and highlight the readings and exercises that
experienced practitioners of data science, we too were constantly       we compiled to work towards these goals.
                                                                        Getting and Cleaning Data
    Lastly, a small but (we thought) important aspect of our setup
was teaching students to set up a computing environment on              Getting, cleaning, and wrangling data typically make up a signif-
their own laptops, with Python, conda [Ana16], and JupyterLab           icant proportion of the time involved in a data science project.
[Pro22]. Using the command line and moving from an environ-             Therefore, we devoted significant time in our course to learning
ment like Google Colab to one’s own computer can both present           these skills, focusing on loading and manipulating data using
significant barriers, but doing so successfully can be an important     pandas. Key skills included loading data into a pandas DataFrame,
part of helping students feel like ‘real’ programmers. We devoted       working with missing data, and slicing, grouping, and merging
an entire class period to helping students with installation and        DataFrames in various ways. After initial exposure and practice
setup on their own computers.                                           with example datasets, students applied their skills to wrangling
    We considered it an important measure of success how many           the diverse and sometimes messy and large datasets that they found
students told us at the end of the course that the class had helped     for their individual projects. Since one requirement of the project
them overcome sometimes longstanding feelings that technical            was to integrate more than one dataset, merging was of particular
skills like coding and modeling were not for them.                      importance.
                                                                            During this portion of the course, students read and discussed
Leveraging Existing Strengths To Enhance Student Ownership              Boyd and Crawford’s Critical Questions for Big Data [Boy12]
Even as beginning programmers, students are capable of creating a       which situates big data in the context of knowledge itself and
meaningful policy-related data science project within the semester,     raises important questions about access to data and privacy. Ad-
starting from formulating a question and finding relevant datasets.     ditional readings included selected chapters from D’Ignazio and
Working on the project throughout the semester (not just at the         Klein’s Data Feminism [Dig20] which highlights the importance
end) gave essential context to data science skills as students could    of what we choose to count and what it means when data is
translate into what an idea might mean for "their" data. Giving         missing.
students wide leeway in their project topic allowed the project to
be a point of connection between new data science skills and their      Visualizing Data
existing domain knowledge. Students chose projects within their         A fundamental component to communicating findings from data
particular areas of interest or expertise, and a number chose to        is well-executed data visualization. We chose to place this module
additionally connect their project for this course to their degree      in the middle of the course, since it was important that students
capstone project.                                                       have a common language for interpreting and communicating their
    Project benchmarks were placed throughout the semester              analysis before moving to the more complicated aspects of data
(highlighted in green in Fig 3) allowing students a concrete            modeling. In developing this common language, we used Wilke’s
way to develop their new skills in identifying datasets, loading        Fundamentals of Data Visualization [Wil19] and Cairo’s How
72                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 3: Course outline for a 13-week semester with two 70 minute instructional blocks each week. Course readings are highlighted in gray and
policy project benchmarks are highlighted in green.

Chart’s Lie [Cai19] as a backbone for this section of the course.       using Python. Having the concrete target of how a student wanted
In addition to reading the text materials, students were tasked with    their visualization to look seemed to be a motivating starting
finding visualizations “in the wild,” both good and bad. Course         point from which to practice coding and debugging. We spent
discussions centered on the found visualizations, with Wilke and        several class periods on supported programming time for students
Cairo’s writings as a common foundation. From the readings and          to develop their visualizations.
discussions, students became comfortable with the language and              Working on building the narratives of their project and devel-
taxonomy around visualizations and began to develop a better ap-        oping their own visualizations in the context of the course readings
preciation of what makes a visualization compelling and readable.       gave students a heightened sense of attention to detail. During
Students were able to formulate a plan about how they could best        one day of class when students shared visualizations and gave
visualize their data. The next task was to translate these plans into   feedback to one another, students commented and inquired about
Python.                                                                 incredibly small details of each others’ presentations, for example,
   To help students gain a level of comfort with data visualization     how to adjust y-tick alignment on a horizontal bar chart. This sort
in Python, we provided instruction and examples of working              of tiny detail is hard to convey in a lecture, but gains outsized
with a variety of charts using Matplotlib and seaborn, as well          importance when a student has personally wrestled with it.
as maps and choropleths using GeoPandas, and assigned students
programming assignments that involved writing code to create            Modeling Data
a visualization matching one in an image. With that practical           In this section we sought to expose students to introductory
grounding, students were ready to visualize their own project data      approaches in each of regression, classification, and clustering

in Python. Specifically, we practiced using scikit-learn to work            And finally, to supplement the technical components of the
with linear regression, logistic regression, decision trees, random     course we also had readings with associated journal entries sub-
forests, and gaussian mixture models. Our focus was not on the          mitted at a cadence of roughly two per module. Journal prompts
theoretical underpinnings of any particular model, but rather on        are described above and available on the course website.
the kinds of problems that regression, classification, or clustering
models respectively, are able to solve, as well as some basic ideas
about model assessment. The uniform and approachable scikit-            Conclusion
learn API [Bui13] was crucial in supporting this focus, since it        Various listings of key competencies in data science have been
allowed us to focus less on syntax around any one model, and more       proposed [NAS18]. For example, [Dev17] suggests the following
on the larger contours of modeling, with all its associated promise     pillars for an undergraduate data science curriculum: computa-
and perils. We spent a good deal of time building an understanding      tional and statistical thinking, mathematical foundations, model
of train-test splits and their role in model assessment.                building and assessment, algorithms and software foundation,
     Student projects were required to include a modeling com-          data curation, and knowledge transference—communication and
ponent. Just the process of deciding which of regression, clas-         responsibility. As we sought to contribute to the training of
sification, or clustering were appropriate for a given dataset and      data-science informed practitioners of international relations, we
policy question is highly non-trivial for beginners. The diversity of   focused on helping students build an initial competency especially
student projects and datasets meant students had to grapple with        in the last four of these.
this decision process in its full complexity. We were delighted by          We can point to several key aspects of the course that made
the variety of modeling approaches students used in their projects,     it successful. Primary among them was the fact that the majority
as well as by students’ thoughtful discussions of the limitations of    of class time was spent in supported programming. This means
their analysis.                                                         that students were able to ask their instructors or peers as soon
     To accompany this section of the course, students were as-         as questions arose. Novice programmers who aren’t part of a
signed readings focusing on some of the societal impacts of data        formal computer science program often don’t have immediate
modeling and algorithms more broadly. These readings included           access to the resources necessary to get "unstuck." for the novice
a chapter from O’Neil’s Weapons of Math Destruction [One16] as          programmer, even learning how to google technical terms can be a
well as Buolamwini and Gebru’s Gender Shades [Buo18]. Both of           challenge. This sort of immediate debugging and feedback helped
these readings emphasize the capacity of algorithms to exacerbate       students remain confident and optimistic about their projects. This
inequalities and highlight the importance of transparency and           was made all the more effective since we were co-teaching the
ethical data practices. These readings resonated especially strongly    course and had double the resources to troubleshoot. Co-teaching
with our students, many of whom had recently taken courses in           also had the unforeseen benefit of making our classroom a place
cyber policy and ethics in artificial intelligence.                     where the growth mindset was actively modeled and nurtured:
                                                                        where one instructor wasn’t able to answer a question, the other
                                                                        instructor often could. Finally, it was precisely the motivation of
Formal assessment was based on four components, already alluded         learning data science in context that allowed students to maintain a
to throughout this note. The largest was the ongoing policy             sense of ownership over their work and build connections between
project which had benchmarks with rolling due dates throughout          their other courses.
the semester. Moreover, time spent practicing coding skills in              Learning programming from the ground up is difficult. Stu-
class was often done in service of the project. For example, in         dents arrive excited to learn, but also nervous and occasionally
week 4, when students learned to set up their local computing           heavy with the baggage they carry from prior experience in
environments, they also had time to practice loading, reading, and      quantitative courses. However, with a sufficient supported learning
saving data files associated with their chosen project datasets. This   environment it’s possible to impart relevant skills. It was a measure
brought challenges, since often students sitting side-by-side were      of the success of the course how many students told us that the
dealing with different operating systems and data formats. But          course had helped them overcome negative prior beliefs about
from this challenge emerged many organic conversations about            their ability to code. Teaching data science skills in context and
file types and the importance of naming conventions. The rubric         with relevant projects that leverage students’ existing expertise and
for the final project is shown in Fig 4.                                outside reading situates the new knowledge in a place that feels
     The policy project culminated with in-class “micro presenta-       familiar and accessible to students. This contextualization allows
tions” and a policy paper. We dedicated two days of class in week       students to gain some mastery while simultaneously playing to
13 for in-class presentations, for which each student presented         their strengths and interests.
one slide consisting of a descriptive title, one visualization, and
several “key takeaways” from the project. This extremely restric-
tive format helped students to think critically about the narrative     R EFERENCES
information conveyed in a visualization, and was designed to
create time for robust conversation around each presentation.           [Ana16] Anaconda Software Distribution. Computer software. Vers. 2-2.4.0.
     In addition to the policy project, each of the three course                Anaconda, Nov. 2016. Web.
                                                                        [Boy12] Boyd, Danah, and Kate Crawford. Critical questions for big data:
modules also had an associated set of Python exercises (available               Provocations for a cultural, technological, and scholarly phe-
on the course website). Students were given ample time both in                  nomenon. Information, communication & society 15.5 (2012):662-
and out of class to ask questions about the exercises. Overall, these           679.
exercises proved to be the most technically challenging component       [Bui13] Buitinck, Lars, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa,
                                                                                Andreas Mueller, Olivier Grisel, Vlad Niculae et al. API design for
of the course, but we invited students to resubmit after an initial             machine learning software: experiences from the scikit-learn project.
round of grading.                                                               arXiv preprint arXiv:1309.0238 (2013).
74                                                                                                  PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

        Fig. 4: Rubric for the policy project that formed a core component of the formal assessment of students throughout the course.

[Buo18] Buolamwini, Joy, and Timnit Gebru. Gender shades: Intersectional        [Ped11] Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent
        accuracy disparities in commercial gender classification. Conference            Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel et al.
        on fairness, accountability and transparency. PMLR, 2018. http://               Scikit-learn: Machine learning in Python. the Journal of machine                                    Learning research 12 (2011): 2825-2830.
[Cai19] Cairo, Alberto. How charts lie: Getting smarter about visual infor-             5555/1953048.2078195
        mation. WW Norton & Company, 2019.                                      [Pen16] Penuel, William R., Daniela K. DiGiacomo, Katie Van Horne, and
[Dev17] De Veaux, Richard D., Mahesh Agarwal, Maia Averett, Benjamin                    Ben Kirshner. A Social Practice Theory of Learning and Becoming
        S. Baumer, Andrew Bray, Thomas C. Bressoud, Lance Bryant et al.                 across Contexts and Time. Frontline Learning Research 4, no. 4
        Curriculum guidelines for undergraduate programs in data science.               (2016): 30-38.
        Annual Review of Statistics and Its Application 4 (2017): 15-30.        [Pro22] Project Jupyter, 2022. jupyterlab/jupyterlab: JupyterLab 3.4.3 https:                        //
[Dig20] D’Ignazio, Catherine, and Lauren F. Klein. Data Feminism. MIT           [The22] The Pandas Development Team, 2022. pandas-dev/pandas: Pandas
        press, 2020.                                                                    1.4.2. Zenodo.
[Eng01] Engeström, Yrjö. Expansive learning at work: Toward an activity         [Was21] Waskom, Michael L. Seaborn: statistical data visualization. Journal
        theoretical reconceptualization. Journal of education and work 14,              of Open Source Software 6, no. 60 (2021): 3021.
        no. 1 (2001): 133-156.                21105/joss.03021
[Hun07] Hunter, J.D., Matplotlib: A 2D Graphics Environment. Computing in       [Wil19] Wilke, Claus O. Fundamentals of data visualization: a primer on
        Science & Engineering, vol. 9, no. 3 (2007): 90-95.            making informative and compelling figures. O’Reilly Media, 2019.
[Jor21] Jordahl, Kelsey et al. 2021. Geopandas/geopandas: V0.10.2. Zenodo.
[Mck10] McKinney, Wes. Data structures for statistical computing in python.
        In Proceedings of the 9th Python in Science Conference, vol. 445, no.
        1, pp. 51-56. 2010.
[NAS18] National Academies of Sciences, Engineering, and Medicine. Data
        science for undergraduates: Opportunities and options. National
        Academies Press, 2018.
[One16] O’Neil, Cathy. Weapons of math destruction: How big data increases
        inequality and threatens democracy. Broadway Books, 2016.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                    75

             Papyri: better documentation for the scientific
                          ecosystem in Jupyter
                                                   Matthias Bussonnier‡§∗ , Camille Carvalho¶k


Abstract—We present here the idea behind Papyri, a framework we are devel-             documentation is often displayed as raw source where no naviga-
oping to provide a better documentation experience for the scientific ecosystem.       tion is possible. On the maintainers’ side, the final documentation
In particular, we wish to provide a documentation browser (from within Jupyter         rendering is less of a priority. Rather, maintainers should aim at
or other IDEs and Python editors) that gives a unified experience, cross library       making users gain from improvement in the rendering without
navigation search and indexing. By decoupling documentation generation from
                                                                                       having to rebuild all the docs.
rendering we hope this can help address some of the documentation accessi-
bility concerns, and allow customisation based on users’ preferences.
                                                                                           Conda-Forge [CFRG] has shown that concerted efforts can
                                                                                       give a much better experience to end-users, and in today’s world
Index Terms—Documentation, Jupyter, ecosystem, accessibility                           where it is ubiquitous to share libraries source on code platforms,
                                                                                       perform continuous integration and many other tools, we believe
                                                                                       a better documentation framework for many of the libraries of the
Introduction                                                                           scientific Python should be available.
Over the past decades, the Python ecosystem has grown rapidly,                             Thus, against all advice we received and based on our own
and one of the last bastion where some of the proprietary competi-                     experience, we have decided to rebuild an opinionated documen-
tion tools shine is integrated documentation. Indeed, open-source                      tation framework, from scratch, and with minimal dependencies:
libraries are usually developed in distributed settings that can make                  Papyri. Papyri focuses on building an intermediate documentation
it hard to develop coherent and integrated systems.                                    representation format, that lets us decouple building, and rendering
    While a number of tools and documentations exists (and                             the docs. This highly simplifies many operations and gives us
improvements are made everyday), most efforts attempt to build                         access to many desired features that were not available up to now.
documentation in an isolated way, inherently creating a heteroge-                          In what follows, we provide the framework in which Papyri
neous framework. The consequences are twofolds: (i) it becomes                         has been created and present its objectives (context and goals),
difficult for newcomers to grasp the tools properly, (ii) there is a                   we describe the Papyri features (format, installation, and usage),
lack of cohesion and of unified framework due to library authors                       then present its current implementation. We end this paper with
making their proper choices as well as having to maintain build                        comments on current challenges and future work.
scripts or services.
    Many users, colleagues, and members of the community have                          Context and objectives
been frustrated with the documentation experience in the Python                        Through out the paper, we will draw several comparisons between
ecosystem. Given a library, who hasn’t struggled to find the                           documentation building and compiled languages. Also, we will
"official" website for the documentation ? Often, users stumble                        borrow and adapt commonly used terminology. In particular, sim-
across an old documentation version that is better ranked in their                     ilarities with "ahead-of-time" (AOT) [AOT], "just-in-time"" (JIT)
favorite search engine, and this impacts significantly the learning                    [JIT], intermediate representation (IR) [IR], link-time optimization
process of less experienced users.                                                     (LTO) [LTO], static vs dynamic linking will be highlighted. This
    On users’ local machine, this process is affected by lim-                          allows us to clarify the presentation of the underlying architecture.
ited documentation rendering. Indeed, while in many Integrated                         However, there is no requirement to be familiar with the above
Development Environments (IDEs) the inspector provides some                            to understand the concepts underneath Papyri. In that context, we
documentation, users do not get access to the narrative, or the full                   wish to discuss documentation building as a process from a source-
documentation gallery. For Command Line Interface (CLI) users,                         code meant for a machine to a final output targeting the flesh and
                                                                                       blood machine between the keyboard and the chair.
* Corresponding author:
‡ QuanSight, Inc
§ Digital Ours Lab, SARL.                                                              Current tools and limitations
¶ University of California Merced, Merced, CA, USA
|| Univ Lyon, INSA Lyon, UJM, UCBL, ECL, CNRS UMR 5208, ICJ, F-69621,                  In the scientific Python ecosystem, it is well known that Docutils
France                                                                                 [docutils] and Sphinx [sphinx] are major cornerstones for pub-
                                                                                       lishing HTML documentation for Python. In fact, they are used
Copyright © 2022 Matthias Bussonnier et al. This is an open-access article             by all the libraries in this ecosystem. While a few alternatives
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,          exist, most tools and services have some internal knowledge of
provided the original author and source are credited.                                  Sphinx. For instance, Read the Docs [RTD] provides a specific
76                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Sphinx theme [RTD-theme] users can opt-in to, Jupyter-book
[JPYBOOK] is built on top of Sphinx, and MyST parser [MYST]
(which is made to allow markdown in documentation) targets
Sphinx as a backend, to name a few. All of the above provide an
"ahead-of-time" documentation compilation and rendering, which
is slow and computationally intensive. When a project needs its
specific plugins, extensions and configurations to properly build
(which is almost always the case), it is relatively difficult to
build documentation for a single object (like a single function,
module or class). This makes AOT tools difficult to use for
interactive exploration. One can then consider a JIT approach,
as done for Docrepr [DOCREPR] (integrated both in Jupyter and
Spyder [Spyder]). However in that case, interactive documentation
lacks inline plots, crosslinks, indexing, search and many custom          Fig. 1: The following screenshot shows the help for
directives.                                                               scipy.signal.dpss, as currently accessible (left), as shown by
                                                                          Papyri for Jupyterlab extension (right). An extended version of the
    Some of the above limitations are inherent to the design              right pannel is displayed in Figure 4.
of documentation build tools that were intended for a separate
documentation construction. While Sphinx does provide features
like intersphinx, link resolutions are done at the documentation          raw docstrings (see for example the SymPy discussion2 on how
building phase. Thus, this is inherently unidirectional, and can          equations should be displayed in docstrings, and left panel of
break easily. To illustrate this, we consider NumPy [NP] and SciPy        Figure 1). In terms of format, markdown is appealing, however
[SP], two extremely close libraries. In order to obtain proper cross-     inconsistencies in the rendering will be created between libraries.
linked documentation, one is required to perform at least five steps:     Finally, some libraries can dynamically modify their docstring at
      •   build NumPy documentation                                       runtime. While this sometime avoids using directives, it ends up
      •   publish NumPy object.inv file.                                  being more expensive (runtime costs, complex maintenance, and
      •   (re)build SciPy documentation using NumPy obj.inv               contribution costs).
                                                                          Objectives of the project
      •   publish SciPy object.inv file
      •   (re)build NumPy docs to make use of SciPy’s obj.inv             We now layout the objectives of the Papyri documentation frame-
                                                                          work. Let us emphasize that the project is in no way intended to
    Only then can both SciPy’s and NumPy’s documentation refer            replace or cover many features included in well-established docu-
to each other. As one can expect, cross links break every time            mentation tools such as Sphinx or Jupyter-book. Those projects are
a new version of a library is published1 . Pre-produced HTML              extremely flexible and meet the needs of their users for publishing
in IDEs and other tools are then prone to error and difficult to          a standalone documentation website of PDFs. The Papyri project
maintain. This also raises security issues: some institutions be-         addresses specific documentation challenges (mentioned above),
come reluctant to use tools like Docrepr or viewing pre-produced          we present below what is (and what is not) the scope of work.
HTML.                                                                         Goal (a): design a non-generic (non fully customisable)
                                                                          website builder. When authors want or need complete control
Docstrings format                                                         of the output and wide personalisation options, or branding, then
The Numpydoc format is ubiquitous among the scientific ecosys-            Papyri is not likely the project to look at. That is to say single-
tem [NPDOC]. It is loosely based on reStructuredText (RST)                project websites where appearance, layout, domain need to be
syntax, and despite supporting full RST syntax, docstrings rarely         controlled by the author is not part of the objectives.
contain full-featured directive. Maintainers are confronted to the            Goal (b): create a uniform documentation structure and
following dilemma:                                                        syntax. The Papyri project prescribes stricter requirements in
      •   keep the docstrings simple. This means mostly text-based        terms of format, structure, and syntax compared to other tools
          docstrings with few directive for efficient readability. The    such as Docutils and Sphinx. When possible, the documentation
          end-user may be exposed to raw docstring, there is no on-       follows the Diátaxis Framework [DT]. This provides a uniform
          the-fly directive interpretation. This is the case for tools    documentation setup and syntax, simplifying contributions to the
          such as IPython and Jupyter.                                    project and easing error catching at compile time. Such strict envi-
      •   write an extensive docstring. This includes references, and     ronment is qualitatively supported by a number of documentation
          directive that potentially creates graphics, tables and more,   fixes done upstream during the development stage of the project3 .
          allowing an enriched end-user experience. However this          Since Papyri is not fully customisable, users who are already using
          may be computationally intensive, and executing code to         documentation tools such as Sphinx, mkdocs [mkdocs] and others
          view docs could be a security risk.                             should expect their project to require minor modifications to work
                                                                          with Papyri.
    Other factors impact this choice: (i) users, (ii) format, (iii)           Goal (c): provide accessibility and user proficiency. Ac-
runtime. IDE users or non-Terminal users motivate to push for             cessibility is a top priority of the project. To that aim, items
extensive docstrings. Tools like Docrepr can mitigate this problem        are associated to semantic meaning as much as possible, and
by allowing partial rendering. However, users are often exposed to
                                                                            2. sympy/sympy#14963
     1. ipython/ipython#12210, numpy/numpy#21016, & #29073                  3. Tests have been performed on NumPy, SciPy.
PAPYRI: BETTER DOCUMENTATION FOR THE SCIENTIFIC ECOSYSTEM IN JUPYTER                                                                          77

documentation rendering is separated from documentation build-         Intermediate Representation for Documentation (IRD)
ing phase. That way, accessibility features such as high contract
                                                                               IRD format: Papyri relies on standard interchangeable
themes (for better text-to-speech (TTS) raw data), early example
                                                                       "Intermediate Representation for Documentation" (IRD) format.
highlights (for newcomers) and type annotation (for advanced
                                                                       This allows to reduce operation complexity of the documentation
users) can be quickly available. With the uniform documentation
                                                                       build. For example, given M documentation producers and N
structure, this provides a coherent experience where users become
                                                                       renderers, a full documentation build would be O(MN) (each
more comfortable finding information in a single location (see
                                                                       renderer needs to understand each producer). If each producer only
Figure 1).
                                                                       cares about producing IRD, and if each renderer only consumes it,
    Goal (d): make documentation building simple, fast, and            then one can reduce to O(M+N). Additionally, one can take IRD
independent. One objective of the project is to make documenta-        from multiple producers at once, and render them all to a single
tion installation and rendering relatively straightforward and fast.   target, breaking the silos between libraries.
To that aim, the project includes relative independence of doc-
                                                                           At the moment, IRD files are currently separated into four
umentation building across libraries, allowing bidirectional cross
                                                                       main categories roughly following the Diátaxis framework [DT]
links (i.e. both forward and backward links between pages) to
                                                                       and some technical needs:
be maintained more easily. In other words, a single library can be
built without the need to access documentation from another. Also,         •   API files describe the documentation for a single ob-
the project should include straightforward lookup documentation                ject, expressed as a JSON object. When possible, the
for an object from the interactive read–eval–print loop (REPL).                information is encoded semantically (Objective (c)). Files
Finally, efforts are put to limit the installation speed (to avoid             are organized based on the fully-qualified name of the
polynomial growth when installing packages on large distributed                Python object they reference, and contain either absolute
systems).                                                                      reference to another object (library, version and identi-
                                                                               fier), or delayed references to objects that may exist in
                                                                               another library. Some extra per-object meta information
The Papyri solution
                                                                               like file/line number of definitions can be stored as well.
In this section we describe in more detail how Papyri has been             •   Narrative files are similar to API files, except that they do
implemented to address the objectives mentioned above.                         not represent a given object, but possess a previous/next
                                                                               page. They are organised in an ordered tree related to the
                                                                               table of content.
Making documentation a multi-step process
                                                                           •   Example files are a non-ordered collection of files.
When using current documentation tools, customisation made by              •   Assets files are untouched binary resource archive files that
maintainers usually falls into the following two categories:                   can be referenced by any of the above three ones. They are
                                                                               the only ones that contain backward references, and no
   •   simpler input convenience,                                              forward references.
   •   modification of final rendering.
                                                                           In addition to the four categories above, metadata about the
     This first category often requires arbitrary code execution and   current package is stored: this includes library name, current
must import the library currently being built. This is the case        version, PyPi name, GitHub repository slug4 , maintainers’ names,
for example for the use of .. code-block:::, or custom                 logo, issue tracker and others. In particular, metadata allows
:rc: directive. The second one offers a more user friendly en-         us to auto-generate links to issue trackers, and to source files
vironment. For example, sphinx-copybutton [sphinx-copybutton]          when rendering. In order to properly resolve some references and
adds a button to easily copy code snippets in a single click,          normalize links convention, we also store a mapping from fully
and pydata-sphinx-theme [pydata-sphinx-theme] or sphinx-rtd-           qualified names to canonical ones.
dark-mode provide a different appearance. As a consequence,
                                                                           Let us make some remarks about the current stage of IRD for-
developers must make choices on behalf of their end-users: this
                                                                       mat. The exact structure of package metadata has not been defined
may concern syntax highlights, type annotations display, light/dark
                                                                       yet. At the moment it is reduced to the minimum functionality.
                                                                       While formats such as codemeta [CODEMETA] could be adopted,
     Being able to modify extensions and re-render the documenta-      in order to avoid information duplication we rely on metadata
tion without the rebuilding and executing stage is quite appealing.    either present in the published packages already or extracted from
Thus, the building phase in Papyri (collecting documentation           Github repository sources. Also, IRD files must be standardized
information) is separated from the rendering phase (Objective (c)):    in order to achieve a uniform syntax structure (Objective (b)).
at this step, Papyri has no knowledge and no configuration options     In this paper, we do not discuss IRD files distribution. Last, the
that permit to modify the appearance of the final documentation.       final specification of IRD files is still in progress and regularly
Additionally, the optional rendering process has no knowledge of       undergoes major changes (even now). Thus, we invite contributors
the building step, and can be run without accessing the libraries      to consult the current state of implementation on the GitHub
involved.                                                              repository [Papyri]. Once the IRD format is more stable, this will
     This kind of technique is commonly used in the field of           be published as a JSON schema, with full specification and more
compilers with the usage of Single Compilation Unit [SCU] and          in-depth description.
Intermediate Representation [IR], but to our knowledge, it has not
been implemented for documentation in the Python ecosystem.
                                                                         4. "slug" is the common term that refers to the various combinations
As mentioned before, this separation is key to achieving many          of organization name/user name/repository name, that uniquely identifies a
features proposed in Objectives (c), (d) (see Figure 2).               repository on a platform like GitHub.
78                                                                                                  PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 2: Sketch representing how to build documentation with Papyri. Step 1: Each project builds an IRD bundle that contains semantic
information about the project documentation. Step 2: the IRD bundles are publihsed online. Step 3: users install IRD bundles locally on their
machine, pages get corsslinked, indexed, etc. Step 4: IDEs render documentation on-the-fly, taking into consideration users’ preferences.

        IRD bundles: Once a library has collected IRD repre-                      package managers or IDEs, one could imagine this process being
sentation for all documentation items (functions, class, narrative                automatic, or on demand. This step should be fairly efficient as it
sections, tutorials, examples), Papyri consolidates them into what                mostly requires downloading and unpacking IRD files.
we will refer to as IRD bundles. A Bundle gathers all IRD files                       Finally, IDEs developers want to make sure IRD files can be
and metadata for a single version of a library5 . Bundles are a                   properly rendered and browsed by their users when requested.
convenient unit to speak about publication, installation, or update               This may potentially take into account users’ preferences, and may
of a given library documentation files.                                           provide added values such as indexing, searching, bookmarks and
    Unlike package installation, IRD bundles do not have the                      others, as seen in rustsdocs,
notion of dependencies. Thus, a fully fledged package manager is
not necessary, and one can simply download corresponding files                    Current implementation
and unpack them at the installation phase.
                                                                                  We present here some of the technological choices made in the
    Additionally, IRD bundles for multiple versions of the same
                                                                                  current Papyri implementation. At the moment, it is only targeting
library (or conflicting libraries) are not inherently problematic as
                                                                                  a subset of projects and users that could make use of IRD files and
they can be shared across multiple environments.
                                                                                  bundles. As a consequence, it is constrained in order to minimize
    From a security standpoint, installing IRD bundles does not
                                                                                  the current scope and efforts development. Understanding the
require the execution of arbitrary code. This is a critical element
                                                                                  implementation is not necessary to use Papyri neither as a project
for adoption in deployments. There exists as well an opportunity to
                                                                                  maintainer nor as a user, but it can help understanding some of the
provide localized variants at the IRD installation time (IRD bundle
                                                                                  current limitations.
translations haven’t been explored exhaustively at the moment).
                                                                                      Additionally, nothing prevents alternatives and complementary
                                                                                  implementations with different choices: as long as other imple-
IRD and high level usage
                                                                                  mentations can produce (or consume) IRD bundles, they should
Papyri-based documentation involves three broad categories of                     be perfectly compatible and work together.
stakeholders (library maintainers, end-users, IDE developers), and                    The following sections are thus mostly informative to under-
processes. This leads to certain requirements for IRD files and                   stand the state of the current code base. In particular we restricted
bundles.                                                                          ourselves to:
    On the maintainers’ side, the goal is to ensure that Papyri can
build IRD files, and publish IRD bundles. Creation of IRD files                      •   Producing IRD bundles for the core scientific Python
and bundles is the most computationally intensive step. It may                           projects (Numpy, SciPy, Matplotlib...)
require complex dependencies, or specific plugins. Thus, this can                    •   Rendering IRD documentation for a single user on their
be a multi-step process, or one can use external tooling (not related                    local machine.
to Papyri nor using Python) to create them. Visual appearance                         Finally, some of the technological choices have no other
and rendering of documentation is not taken into account in this                  justification than the main developer having interests in them, or
process. Overall, building IRD files and bundles takes about the                  making iterations on IRD format and main code base faster.
same amount of time as running a full Sphinx build. The limiting
factor is often associated to executing library examples and code                 IRD files generation
snippets. For example, building SciPy & NumPy documentation                       The current implementation of Papyri only targets some compat-
IRD files on a 2021 Macbook Pro M1 (base model), including                        ibility with Sphinx (a website and PDF documentation builder),
executing examples in most docstrings and type inferring most                     reStructuredText (RST) as narrative documentation syntax and
examples (with most variables semantically inferred) can take                     Numpydoc (both a project and standard for docstring formatting).
several minutes.                                                                       These are widely used by a majority of the core scientific
    End-users are responsible for installing desired IRD bundles.                 Python ecosystem, and thus having Papyri and IRD bundles
In most cases, it will consist of IRD bundles from already                        compatible with existing projects is critical. We estimate that
installed libraries. While Papyri is not currently integrated with                about 85%-90% of current documentation pages being built with
                                                                                  Sphinx, RST and Numpydoc can be built with Papyri. Future work
   5. One could have IRD bundles not attached to a particular library. For
example, this can be done if an author wishes to provide only a set of examples   includes extensions to be compatible with MyST (a project to
or tutorials. We will not discuss this case further here.                         bring markdown syntax to Sphinx), but this is not a priority.
PAPYRI: BETTER DOCUMENTATION FOR THE SCIENTIFIC ECOSYSTEM IN JUPYTER                                                                        79

     To understand RST Syntax in narrative documentation, RST
documents need to be parsed. To do so, Papyri uses tree-sitter
[TS] and tree-sitter-rst [TSRST] projects, allowing us to extract an
"Abstract Syntax Tree" (AST) from the text files. When using tree-
sitter, AST nodes contain bytes-offsets into the original text buffer.
Then one can easily "unparse" an AST node when necessary. This
is relatively convenient for handling custom directives and edge
cases (for instance, when projects rely on a loose definition of
the RST syntax). Let us provide an example: RST directives are
usually of the form:
.. directive:: arguments

                                                                         Fig. 3: Sketch representing how Papyri stores information in 3
While technically there is no space before the ::, Docutils and          different format depending on access patterns: a SQLite database for
Sphinx will not create errors when building the documentation.           relationship information, on-disk CBOR files for more compact storate
Due to our choice of a rigid (but unified) structure, we use tree-       of IRD, and RAW files (e.g. Images). A GraphStore API abstracts all
sitter that indicates an error node if there is an extra space. This     access and takes care of maintinaing consistency.
allows us to check for error nodes, unparse, add heuristics to
restore a proper syntax, then parse again to obtain the new node.
                                                                         (like a database server) are not necessary available. This provides
     Alternatively, a number of directives like warnings, notes
                                                                         an adapted framework to test Papyri on an end-user machine.
admonitions still contain valid RST. Instead of storing the
                                                                             With those requirements we decided to use a combination of
directive with the raw text, we parse the full document (potentially
                                                                         SQLite (an in-process database engine), Concise Binary Object
finding invalid syntax), and unparse to the raw text only if the
                                                                         Representation (CBOR) and raw storage to better reflect the access
directive requires it.
                                                                         pattern (see Figure 3).
     Serialisation of data structure into IRD files is currently us-
                                                                             SQLite allows us to easily query for object existence, and
ing a custom serialiser. Future work includes maybe swapping
                                                                         graph information (relationship between objects) at runtime. It is
to msgspec [msgspec]. The AST objects are completely typed,
                                                                         optimized for infrequent reading access. Currently many queries
however they contain a number of unions and sequences of unions.
                                                                         are done at runtime, when rendering documentation. The goal is to
It turns out, many frameworks like pydantic [pydantic] do not
                                                                         move most of SQLite information resolving step at the installation
support sequences of unions where each item in the union may
                                                                         time (such as looking for inter-libraries links) once the codebase
be of a different type. To our knowledge, there are just few other
                                                                         and IRD format have stabilized. SQLite is less strongly typed than
documentation related projects that treat AST as an intermediate
                                                                         other relational or graph database and needs custom logic, but
object with a stable format that can be manipulated by external
                                                                         is ubiquitous on all systems and does not need a separate server
tools. In particular, the most popular one is Pandoc [pandoc], a
                                                                         process, making it an easy choice of database.
project meant to convert from many document types to plenty of
                                                                             CBOR is a more space efficient alternative to JSON. In par-
other ones.
                                                                         ticular, keys in IRD are often highly redundant, and can be highly
     The current Papyri strategy is to type-infer all code examples
                                                                         optimized when using CBOR. Storing IRD in CBOR thus reduces
with Jedi [JEDI], and pre-syntax highlight using pygments when
                                                                         disk usage and can also allow faster deserialization without
                                                                         requiring potentially CPU intensive compression/decompression.
IRD File Installation                                                    This is a good compromise for potentially low performance users’
Download and installation of IRD files is done concurrently using        machines.
httpx [httpx], with Trio [Trio] as an async framework, allowing us           Raw storage is used for binary blobs which need to be accessed
to download files concurrently.                                          without further processing. This typically refers to images, and
    The current implementation of Papyri targets Python doc-             raw storage can be accessed with standard tools like image
umentation and is written in Python. We can then query the               viewers.
existing version of Python libraries installed, and infer the ap-            Finally, access to all of these resources is provided via an
propriate version of the requested documentation. At the moment,         internal GraphStore API which is agnostic of the backend, but
the implementation is set to tentatively guess relevant libraries        ensures consistency of operations like adding/removing/replacing
versions when the exact version number is missing from the install       documents. Figure 3 summarizes this process.
command.                                                                     Of course the above choices depend on the context where
    For convenience and performance, IRD bundles are being post-         documentation is rendered and viewed. For example, an online
processed and stored in a different format. For local rendering, we      archive intended to browse documentation for multiple projects
mostly need to perform the following operations:                         and versions may decide to use an actual graph database for object
                                                                         relationship, and store other files on a Content Delivery Network
    1)   Query graph information about cross-links across docu-          or blob storage for random access.
    2)   Render a single page.                                           Documentation Rendering
    3)   Access raw data (e.g. images).
                                                                         The current Papyri implementation includes a certain number
    We also assume that IRD files may be infrequently updated,           of rendering engines (presented below). Each of them mostly
that disk space is limited, and that installing or running services      consists of fetching a single page with its metadata, and walking
80                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

through the IRD AST tree, and rendering each node with users’            Future goals include improving/replacing the JupyterLab’s ques-
preferences.                                                             tion mark operator (obj?) and the JupyterLab Inspector (when
                                                                         possible). A screenshot of the current development version of the
     •   An ASCII terminal renders using Jinja2 [Jinja2]. This
                                                                         JupyterLab extension can be seen in Figure 4.
         can be useful for piping documentation to other tools
         like grep, less, cat. Then one can work in a highly
         restricted environment, making sure that reading the docu-      Challenges
         mentation is coherent. This can serve as a proxy for screen     We mentioned above some limitations we encountered (in ren-
         reading.                                                        dering usage for instance) and what will be done in the future
     •   A Textual User Interface browser renders using urwid.           to address them. We provide below some limitations related to
         Navigation within the terminal is possible, one can reflow      syntax choices, and broader opportunities that arise from the
         long lines on resized windows, and even open image files        Papyri project.
         in external editors. Nonetheless, several bugs have been
         encountered in urwid. The project aims at replacing the         Limitations
         CLI IPython question mark operator (obj?) interface             The decoupling of the building and rendering phases is key in
         (which currently only shows raw docstrings) in urwid with       Papyri. However, it requires us to come up with a method that
         a new one written with Rich/Textual. For this interface,        uniquely identifies each object. In particular, this is essential in
         having images stored raw on disk is useful as it allows us      order to link any object documentation without accessing the IRD
         to directly call into a system image viewer to display them.    bundles build from all the libraries. To that aim, we use the fully
     •   A JIT rendering engine uses Jinja2, Quart [quart], Trio.        qualified names of an object. Namely, each object is identified
         Quart is an async version of flask [flask]. This option         by the concatenation of the module in which it is defined, with
         contains the most features, and therefore is the main one       its local name. Nonetheless, several particular cases need specific
         used for development. This environment lets us iterate over     treatment.
         the rendering engine rapidly. When exploring the User In-           •   To mirror the Python syntax, is it easy to use . to
         terface design and navigation, we found that a list of back             concatenate both parts. Unfortunately, that leads to some
         references has limited uses. Indeed, it is can be challenging           ambiguity when modules re-export functions have the
         to judge the relevance of back references, as well as their             same name. For example, if one types
         relationship to each other. By playing with a network
                                                                                 # module mylib/
         graph visualisation (see Figure 5)), we can identify clusters
         of similar information within back references. Of course,               from .mything import mything
         this identification has limits especially when pages have a
                                                                                 then mylib.mything is ambiguous both with respect
         large number of back references (where the graph becomes
                                                                                 to the mything submodule, and the reexported object.
         too busy). This illustrate as well a strength of the Papyri
                                                                                 In future versions, the chosen convention will use : as a
         architecture: creating this network visualization did not
                                                                                 module/name separator.
         require any regeneration of the documentation, one simply
                                                                             •   Decorated functions or other dynamic approaches to ex-
         updates the template and re-renders the current page as
                                                                                 pose functions to users end up having <local>> in their
                                                                                 fully qualified names, which is invalid.
     •   A static AOT rendering of all the existing pages that can
                                                                             •   Many built-in functions (np.sin, np.cos, etc.) do not
         be rendered ahead of time uses the same class as the JIT
                                                                                 have a fully qualified name that can be extracted by object
         rendering. Basically, this loops through all entries in the
                                                                                 introspection. We believe it should be possible to identify
         SQLite database and renders each item independently. This
                                                                                 those via other means like docstring hash (to be explored).
         renderer is mostly used for exhaustive testing and perfor-
                                                                             •   Fully qualified names are often not canonical names (i.e.
         mance measures for Papyri. This can render most of the
                                                                                 the name typically used for import). While we made efforts
         API documentation of IPython, Astropy [astropy], Dask
                                                                                 to create a mapping from one to another, finding the canon-
         and distributed [Dask], Matplotlib [MPL], [MPL-DOI],
                                                                                 ical name automatically is not always straightforward.
         Networkx [NX], NumPy [NP], Pandas, Papyri, SciPy,
                                                                             •   There are also challenges with case sensitivity. For ex-
         Scikit-image and others. It can represent ~28000 pages
                                                                                 ample for MacOS file systems, a couple of objects may
         in ~60 seconds (that is ~450 pages/s on a recent Macbook
                                                                                 unfortunately refer to the same IRD file on disk. To address
         pro M1).
                                                                                 this, a case-sensitive hash is appended at the end of the
    For all of the above renderers, profiling shows that docu-                   filename.
mentation rendering is mostly limited by object de-serialisation             •   Many libraries have a syntax that looks right once ren-
from disk and Jinja2 templating engine. In the early project                     dered to HTML while not following proper syntax, or a
development phase, we attempted to write a static HTML renderer                  syntax that relies on specificities of Docutils and Sphinx
in a compiled language (like Rust, using compiled and typed                      rendering/parsing.
checked templates). This provided a speedup of roughly a factor              •   Many custom directive plugins cannot be reused from
10. However, its implementation is now out of sync with the main                 Sphinx. These will need to be reimplemented.
Papyri code base.
    Finally, a JupyterLab extension is currently in progress. The        Future possibilities
documentation then presents itself as a side-panel and is capable        Beyond what has been presented in this paper, there are several
of basic browsing and rendering (see Figure 1 and Figure 4). The         opportunities to improve and extend what Papyri can allow for the
model uses typescript, react and native JupyterLab component.            scientific Python ecosystem.
PAPYRI: BETTER DOCUMENTATION FOR THE SCIENTIFIC ECOSYSTEM IN JUPYTER                                                                      81

                                                                       Fig. 5: Local graph (made with D3.js [D3js]) representing the
                                                                       connections among the most important nodes around current page
                                                                       across many libraries, when viewing numpy.ndarray. Nodes are
                                                                       sized with respect to the number of incomming links, and colored
                                                                       with respect to their library. This graph is generated at rendering
                                                                       time, and is updated depending on the libraries currently installed.
                                                                       This graph helps identify related functions and documentation. It can
                                                                       become challenging to read for highly connected items as seen here
                                                                       for numpy.ndarray.

                                                                           The first area is the ability to build IRD bundles on Continuous
                                                                       Integration platforms. Services like GitHub action, Azure pipeline
                                                                       and many others are already setup to test packages. We hope
                                                                       to leverage this infrastructure to build IRD files and make them
                                                                       available to users.
                                                                            A second area is hosting of intermediate IRD files. While the
                                                                       current prototype is hosted by http index using GitHub pages,
                                                                       it is likely not a sustainable hosting platform as disk space is
                                                                       limited. To our knowledge, IRD files are smaller in size than
                                                                       HTML documentation, we hope that other platforms like Read the
                                                                       Docs can be leveraged. This could provide a single domain that
                                                                       renders the documentation for multiple libraries, thus avoiding the
                                                                       display of many library subdomains. This contributes to giving a
                                                                       more unified experience for users.
                                                                          It should be possible for projects to avoid using many dy-
                                                                       namic docstrings interpolation that are used to document *args
                                                                       and **kwargs. This would make sources easier to read, and
                                                                       potentially have some speedup at the library import time.
                                                                           Once a (given and appropriately used by its users) library uses
                                                                       an IDE that supports Papyri for documentation, docstring syntax
                                                                       could be exchanged for markdown.
                                                                           As IRD files are structured, it should be feasible to provide
                                                                       cross-version information in the documentation. For example, if
                                                                       one installs multiple versions of IRD bundles for a library, then
                                                                       assuming the user does not use the latest version, the renderer
Fig. 4: Example of extended view of the Papyri documentation for       could inspect IRD files from previous/future versions to indi-
Jupyterlab extension (here for SciPy). Code examples can now include   cate the range of versions for which the documentation has not
plots. Most token in each examples are linked to the corresponding     changed. Upon additional efforts, it should be possible to infer
page. Early navigation bar is visible at the top.                      when a parameter was removed, or will be removed, or to simply
                                                                       display the difference between two versions.
82                                                                                                   PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Conclusion                                                                        [RTD-theme] 
To address some of the current limitations in documentation                       [SCU]       
accessibility, building and maintaining, we have provided a new                                         Unit
documentation framework called Papyri. We presented its features                  [SP]                  Pauli Virtanen, Ralf Gommers, Travis E. Oliphant,
                                                                                                        Matt Haberland, Tyler Reddy, David Cournapeau,
and underlying implementation choices (such as crosslink main-
                                                                                                        Evgeni Burovski, Pearu Peterson, Warren Weckesser,
tenance, decoupling building and rendering phases, enriching the                                        Jonathan Bright, Stéfan J. van der Walt, Matthew
rendering features, using the IRD format to create a unified syntax                                     Brett, Joshua Wilson, K. Jarrod Millman, Nikolay
structure, etc.). While the project is still at its early stage, clear                                  Mayorov, Andrew R. J. Nelson, Eric Jones, Robert
                                                                                                        Kern, Eric Larson, CJ Carey, İlhan Polat, Yu Feng,
impacts can already be seen on the availability of high-quality                                         Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef
documentation for end-users, and on the workload reduction for                                          Perktold, Robert Cimrman, Ian Henriksen, E.A. Quin-
maintainers. Building IRD format opened a wide range of tech-                                           tero, Charles R Harris, Anne M. Archibald, Antônio
nical possibilities, and contributes to improving users’ experience                                     H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and
                                                                                                        SciPy 1.0 Contributors. (2020) SciPy 1.0: Fundamen-
(and therefore the success of the scientific Python ecosystem). This                                    tal Algorithms for Scientific Computing in Python.
may become necessary for users to navigate in an exponentially                                          Nature Methods, 17(3), 261-272. 10.1038/s41592-
growing ecosystem.                                                                                      019-0686-2
Acknowledgments                                                                   [TS]        
                                                                                  [astropy]             The Astropy Project: Building an inclusive, open-
The authors want to thank S. Gallegos (author of tree-sitter-rst), J.                                   science project and status of the v2.0 core package,
L. Cano Rodríguez and E. Holscher (Read The Docs), C. Holdgraf                                
(2i2c), B. Granger and F. Pérez (Jupyter Project), T. Allard and I.               [docutils]  
Presedo-Floyd (QuanSight) for their useful feedback and help on                   [httpx]     
this project.                                                                     [mkdocs]    
Funding                                                                           [pydantic]  
M. B. received a 2-year grant from the Chan Zuckerberg Initia-                    [pydata-sphinx-theme]
tive (CZI) Essential Open Source Software for Science (EOS)                       [sphinx-copybutton]
– EOSS4-0000000017 via the NumFOCUS 501(3)c non profit to                         [sphinx]    
develop the Papyri project.                                                       [Trio]      

[CFRG]               conda-forge community. (2015). The conda-forge
                     Project: Community-based Software Distribution Built
                     on the conda Package Format and Ecosystem. Zenodo.
[Dask]               Dask Development Team (2016). Dask: Library for
                     dynamic task scheduling,
[MPL]                J.D. Hunter, "Matplotlib: A 2D Graphics Environ-
                     ment", Computing in Science & Engineering, vol. 9,
                     no. 3, pp. 90-95, 2007,
[NP]                 Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Ar-
                     ray programming with NumPy. Nature 585, 357–362
                     (2020). DOI: 10.1038/s41586-020-2649-2
[NX]                 Aric A. Hagberg, Daniel A. Schult and Pieter J. Swart,
                     “Exploring network structure, dynamics, and function
                     using NetworkX”, in Proceedings of the 7th Python
                     in Science Conference (SciPy2008), Gäel Varoquaux,
                     Travis Vaught, and Jarrod Millman (Eds), (Pasadena,
                     CA USA), pp. 11–15, Aug 2008
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                                83

  Bayesian Estimation and Forecasting of Time Series
                    in statsmodels
                                                                          Chad Fulton‡∗


Abstract—Statsmodels, a Python library for statistical and econometric                    inference for the well-developed stable of time series models
analysis, has traditionally focused on frequentist inference, including in its mod-       in statsmodels, and providing access to the rich associated
els for time series data. This paper introduces the powerful features for Bayesian        feature set already mentioned, presents a complementary option
inference of time series models that exist in statsmodels, with applications              to these more general-purpose libraries.1
to model fitting, forecasting, time series decomposition, data simulation, and
impulse response functions.
                                                                                          Time series analysis in statsmodels
Index Terms—time series, forecasting, bayesian inference, Markov chain Monte              A time series is a sequence of observations ordered in time, and
Carlo, statsmodels                                                                        time series data appear commonly in statistics, economics, finance,
                                                                                          climate science, control systems, and signal processing, among
Introduction                                                                              many other fields. One distinguishing characteristic of many time
Statsmodels [SP10] is a well-established Python library for                               series is that observations that are close in time tend to be more
statistical and econometric analysis, with support for a wide range                       correlated, a feature known as autocorrelation. While successful
of important model classes, including linear regression, ANOVA,                           analyses of time series data must account for this, statistical
generalized linear models (GLM), generalized additive models                              models can harness it to decompose a time series into trend,
(GAM), mixed effects models, and time series models, among                                seasonal, and cyclical components, produce forecasts of future
many others. In most cases, model fitting proceeds by using                               data, and study the propagation of shocks over time.
frequentist inference, such as maximum likelihood estimation                                  We now briefly review the models for time series data that are
(MLE). In this paper, we focus on the class of time series                                available in statsmodels and describe their features.2
models [MPS11], support for which has grown substantially in
                                                                                          Exponential smoothing models
statsmodels over the last decade. After introducing several
                                                                                          Exponential smoothing models are constructed by combining
of the most important new model classes – which are by default
                                                                                          one or more simple equations that each describe some aspect
fitted using MLE – and their features – which include forecasting,
                                                                                          of the evolution of univariate time series data. While originally
time series decomposition and seasonal adjustment, data simula-
                                                                                          somewhat ad hoc, these models can be defined in terms of a
tion, and impulse response analysis – we describe the powerful
                                                                                          proper statistical model (for example, see [HKOS08]). They have
functions that enable users to apply Bayesian methods to a wide
                                                                                          enjoyed considerable popularity in forecasting (for example, see
range of time series models.
     Support for Bayesian inference in Python outside of                                  the implementation in R described by [HA18]). A prototypical
statsmodels has also grown tremendously, particularly in                                  example that allows for trending data and a seasonal component
the realm of probabilistic programming, and includes powerful                             – often known as the additive "Holt-Winters’ method" – can be
libraries such as PyMC3 [SWF16], PyStan [CGH+ 17], and                                    written as
TensorFlow Probability [DLT+ 17]. Meanwhile, ArviZ                                                       lt = α(yt − st−m ) + (1 − α)(lt−1 + bt−1 )
[KCHM19] provides many excellent tools for associated diagnos-                                           bt = β (lt − lt−1 ) + (1 − β )bt−1
tics and vizualisations. The aim of these libraries is to provide                                        st = γ(yt − lt−1 − bt−1 ) + (1 − γ)st−m
support for Bayesian analysis of a large class of models, and
they make available both advanced techniques, including auto-                             where lt is the level of the series, bt is the trend, st is the
tuning algorithms, and flexible model specification. By contrast,                         seasonal component of period m, and α, β , γ are parameters of
here we focus on simpler techniques. However, while the libraries                         the model. When augmented with an error term with some given
above do include some support for time series models, this has                            probability distribution (usually Gaussian), likelihood-based infer-
not been their primary focus. As a result, introducing Bayesian                           ence can be used to estimate the parameters. In statsmodels,

* Corresponding author:                                                1. In addition, it is possible to combine the sampling algorithms of PyMC3
‡ Federal Reserve Board of Governors                                                      with the time series models of statsmodels, although we will not discuss
                                                                                          this approach in detail here. See, for example,
Copyright © 2022 Chad Fulton. This is an open-access article distributed                  13.0/examples/notebooks/generated/statespace_sarimax_pymc3.html.
under the terms of the Creative Commons Attribution License, which permits                   2. In addition to statistical models, statsmodels also provides a number
unrestricted use, distribution, and reproduction in any medium, provided the              of tools for exploratory data analysis, diagnostics, and hypothesis testing
original author and source are credited.                                                  related to time series data; see
84                                                                                                PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

additive exponential smoothing models can be constructed using               # ARMA(1, 1) model with explanatory variable
the statespace.ExponentialSmoothing class.3 The fol-                         X = mdata['realint']
                                                                             model_arma11 = sm.tsa.ARIMA(
lowing code shows how to apply the additive Holt-Winters model                  y, order=(1, 0, 1), exog=X)
above to model quarterly data on consumer prices:                            # SARIMAX(p, d, q)x(P, D, Q, s) model
import statsmodels.api as sm                                                 model_sarimax = sm.tsa.ARIMA(
# Load data                                                                     y, order=(p, d, q), seasonal_order=(P, D, Q, s))
mdata = sm.datasets.macrodata.load().data                                        While this class of models often produces highly competitive
# Compute annualized consumer price inflation
y = np.log(mdata['cpi']).diff().iloc[1:] * 400                               forecasts, it does not produce a decomposition of a time series
                                                                             into, for example, trend and seasonal components.
# Construct the Holt-Winters model
model_hw = sm.tsa.statespace.ExponentialSmoothing(                           Vector autoregressive models
   y, trend=True, seasonal=12)
                                                                             While the SARIMAX models above handle univariate series,
                                                                             statsmodels also has support for the multivariate generaliza-
Structural time series models
                                                                             tion to vector autoregressive (VAR) models.5 These models are
Structural time series models, introduced by [Har90] and also                written
sometimes known as unobserved components models, similarly                                  yt = ν + Φ1 yt−1 + · · · + Φ p yt−p + εt
decompose a univariate time series into trend, seasonal, cyclical,
and irregular components:                                                    where yt is now considered as an m × 1 vector. As a result, the
                                                                             intercept ν is also an m × 1 vector, the coefficients Φi are each
                          yt = µt + γt + ct + εt                             m × m matrices, and the error term is εt ∼ N(0m , Ω), with Ω an
where µt is the trend, γt is the seasonal component, ct is the cycli-        m×m matrix. These models can be constructed in statsmodels
cal component, and εt ∼ N(0, σ 2 ) is the error term. However, this          using the VARMAX class, as follows6
equation can be augmented in many ways, for example to include               # Multivariate dataset
explanatory variables or an autoregressive component. In addition,           z = (np.log(mdata['realgdp', 'realcons', 'cpi'])
there are many possible specifications for the trend, seasonal,
and cyclical components, so that a wide variety of time series               # VAR(1) model
characteristics can be accommodated. In statsmodels, these                   model_var = sm.tsa.VARMAX(z, order=(1, 0))
models can be constructed from the UnobservedComponents
class; a few examples are given in the following code:                       Dynamic factor models
# "Local level" model                                                        statsmodels also supports a second model for multivariate
model_ll = sm.tsa.UnobservedComponents(y, 'llevel')                          time series: the dynamic factor model (DFM). These models, often
# "Local linear trend", with seasonal component
model_arma11 = sm.tsa.UnobservedComponents(                                  used for dimension reduction, posit a few unobserved factors, with
   y, 'lltrend', seasonal=4)                                                 autoregressive dynamics, that are used to explain the variation
                                                                             in the observed dataset. In statsmodels, there are two model
These models have become popular for time series analysis and
                                                                             classes, DynamicFactor` and DynamicFactorMQ, that can
forecasting, as they are flexible and the estimated components are
                                                                             fit versions of the DFM. Here we focus on the DynamicFactor
intuitive. Indeed, Google’s Causal Impact library [BGK+ 15] uses
                                                                             class, for which the model can be written
a Bayesian structural time series approach directly, and Facebook’s
Prophet library [TL17] uses a conceptually similar framework and                                 yt = Λ ft + εt
is estimated using PyStan.                                                                       ft = Φ1 ft−1 + · · · + Φ p ft−p + ηt

Autoregressive moving-average models                                         Here again, the observation is assumed to be m × 1, but the factors
                                                                             are k × 1, where it is possible that k << m. As before, we assume
Autoregressive moving-average (ARMA) models, ubiquitous in
                                                                             conformable coefficient matrices and Gaussian errors.
time series applications, are well-supported in statsmodels,
                                                                                 The following code shows how to construct a DFM in
including their generalizations, abbreviated as "SARIMAX", that
allow for integrated time series data, explanatory variables, and
seasonal effects.4 A general version of this model, excluding                # DFM with 2 factors that evolve as a VAR(3)
                                                                             model_dfm = sm.tsa.DynamicFactor(
integration, can be written as                                                  z, k_factors=2, factor_order=3)
      yt = xt β + ξt
      ξt = φ1 ξt−1 + · · · + φ p ξt−p + εt + θ1 εt−1 + · · · + θq εt−q       Linear Gaussian state space models
                                                                             In statsmodels, each of the model classes introduced
where εt ∼ N(0, σ 2 ). These are constructed in statsmodels
                                                                             above   (    statespace.ExponentialSmoothing,
with the ARIMA class; the following code shows how to construct
                                                                             UnobservedComponents,        ARIMA,        VARMAX,
a variety of autoregressive moving-average models for consumer
price data:                                                                     4. Note that in statsmodels, models with explanatory variables are in
# AR(2) model                                                                the form of "regression with SARIMA errors".
model_ar2 = sm.tsa.ARIMA(y, order=(2, 0, 0))                                    5. statsmodels also supports vector moving-average (VMA) models
                                                                             using the same model class as described here for the VAR case, but, for brevity,
   3. A second class, ETSModel, can also be used for both additive and       we do not explicitly discuss them here.
multiplicative models, and can exhibit superior performance with maximum        6. A second class, VAR, can also be used to fit VAR models, using least
likelihood estimation. However, it lacks some of the features relevant for   squares. However, it lacks some of the features relevant for Bayesian inference
Bayesian inference discussed in this paper.                                  discussed in this paper.
BAYESIAN ESTIMATION AND FORECASTING OF TIME SERIES IN STATSMODELS                                                                                 85

                                         Fig. 1: Selected functionality of state space models in statsmodels.

DynamicFactor, and DynamicFactorMQ) are implemented                             fcast = results_ll.forecast(4)
as part of a broader class of models, referred to as linear Gaussian
                                                                                # Produce a draw from the posterior distribution
state space models (hereafter for brevity, simply "state space                  # of the state vector
models" or SSM). This class of models can be written as                         sim_ll.simulate()
                                                                                draw = sim_ll.simulated_state
                 yt = dt + Zt αt + εt          εt ∼ N(0, Ht )
             αt+1 = ct + Tt αt + Rt ηt         ηt ∼ N(0, Qt )                   Nearly identical code could be used for any of the model classes
                                                                                introduced above, since they are all implemented as part of the
where αt represents an unobserved vector containing the "state"                 same state space model framework. In the next section, we show
of the dynamic system. In general, the model is multivariate, with              how these features can be used to perform Bayesian inference with
yt and εt m × 1 vector, αt k × 1, and ηt r times 1.                             these models.
    Powerful tools exist for state space models to estimate the
values of the unobserved state vector, compute the value of
the likelihood function for frequentist inference, and perform
posterior sampling for Bayesian inference. These tools include the              Bayesian inference via Markov chain Monte Carlo
celebrated Kalman filter and smoother and a simulation smoother,
all of which are important for conducting Bayesian inference for                We begin by giving a cursory overview of the key elements
these models.7 The implementation in statsmodels largely                        of Bayesian inference required for our purposes here.8 In brief,
follows the treatment in [DK12], and is described in more detail                the Bayesian approach stems from Bayes’ theorem, in which
in [Ful15].                                                                     the posterior distribution for an object of interest is derived as
    In addition to these key tools, state space models also admit               proportional to the combination of a prior distribution and the
general implementations of useful features such as forecasting,                 likelihood function
data simulation, time series decomposition, and impulse response
analysis. As a consequence, each of these features extends to each
                                                                                                     p(A|B) ∝ p(B|A) × p(A)
of the time series models described above. Figure 1 presents a                                       | {z } | {z } |{z}
diagram showing how to produce these features, and the code                                          posterior   likelihood   prior
below briefly introduces a subset of them.
# Construct the Model                                                           Here, we will be interested in the posterior distribution of the pa-
model_ll = sm.tsa.UnobservedComponents(y, 'llevel')                             rameters of our model and of the unobserved states, conditional on
                                                                                the chosen model specification and the observed time series data.
# Construct a simulation smoother
sim_ll = model_ll.simulation_smoother()                                         While in most cases the form of the posterior cannot be derived an-
                                                                                alytically, simulation-based methods such as Markov chain Monte
# Parameter values (variance of error and                                       Carlo (MCMC) can be used to draw samples that approximate
# variance of level innovation, respectively)                                   the posterior distribution nonetheless. While PyMC3, PyStan,
params = [4, 0.75]
                                                                                and TensorFlow Probability emphasize Hamiltonian Monte Carlo
# Compute the log-likelihood of these parameters                                (HMC) and no-U-turn sampling (NUTS) MCMC methods, we
llf = model_ll.loglike(params)                                                  focus on the simpler random walk Metropolis-Hastings (MH) and
# `smooth` applies the Kalman filter and smoother
                                                                                Gibbs sampling (GS) methods. These are standard MCMC meth-
# with a given set of parameters and returns a                                  ods that have enjoyed great success in time series applications and
# Results object                                                                which are simple to implement, given the state space framework
results_ll = model_ll.smooth(params)                                            already available in statsmodels. In addition, the ArviZ library
# Produce forecasts for the next 4 periods                                      is designed to work with MCMC output from any source, and we
                                                                                can easily adapt it to our use.
  7. Statsmodels currently contains two implementations of simulation               With either Metropolis-Hastings or Gibbs sampling, our pro-
smoothers for the linear Gaussian state space model. The default is the "mean   cedure will produce a sequence of sample values (of parameters
correction" simulation smoother of [DK02]. The precision-based simulation
smoother of [CJ09] can alternatively be used by specifying method='cfa'         and / or the unobserved state vector) that approximate draws from
when creating the simulation smoother object.                                   the posterior distribution arbitrarily well, as the number of length
86                                                                                                 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

of the chain of samples becomes very large.

Random walk Metropolis-Hastings
In random walk Metropolis-Hastings (MH), we begin with an arbi-
trary point as the initial sample, and then iteratively construct new
samples in the chain as follows. At each iteration, (a) construct a
proposal by perturbing the previous sample by a Gaussian random
variable, and then (b) accept the proposal with some probability.
If a proposal is accepted, it becomes the next sample in the chain,
while if it is rejected then the previous sample value is carried over.
Here, we show how to implement Metropolis-Hastings estimation
of the variance parameter in a simple model, which only requires
the use of the log-likelihood computation introduced above.
                                                                                 Fig. 2: Approximate posterior distribution of variance parameter,
import arviz as az                                                               random walk model, Metropolis-Hastings; U.S. Industrial Production.
from scipy import stats

# Construct the model
model_rw = sm.tsa.UnobservedComponents(y, 'rwalk')

# Specify the prior distribution. With MH, this
# can be freely chosen by the user
prior = stats.uniform(0.0001, 100)

# Specify the Gaussian perturbation distribution
perturb = stats.norm(scale=0.1)

# Storage
niter = 100000
samples_rw = np.zeros(niter + 1)

# Initialization
samples_rw[0] = y.diff().var()
llf = model_rw.loglike(samples_rw[0])
prior_llf = prior.logpdf(samples_rw[0])
                                                                                 Fig. 3: Approximate posterior joint distribution of variance parame-
# Iterations                                                                     ters, local level model, Gibbs sampling; CPI inflation.
for i in range(1, niter + 1):
   # Compute the proposal value
   proposal = samples_rw[i - 1] + perturb.rvs()
                                                                                 Gibbs sampling
     # Compute the acceptance probability                                        Gibbs sampling (GS) is a special case of Metropolis-Hastings
     proposal_llf = model_rw.loglike(proposal)                                   (MH) that is applicable when it is possible to produce draws
     proposal_prior_llf = prior.logpdf(proposal)
     accept_prob = np.exp(                                                       directly from the conditional distributions of every variable, even
        proposal_llf - llf                                                       though it is still not possible to derive the general form of the joint
        + prior_llf - proposal_prior_llf)                                        posterior. While this approach can be superior to random walk
                                                                                 MH when it is applicable, the ability to derive the conditional
     # Accept or reject the value
     if accept_prob > stats.uniform.rvs():                                       distributions typically requires the use of a "conjugate" prior – i.e.,
        samples_rw[i] = proposal                                                 a prior from some specific family of distributions. For example,
        llf = proposal_llf                                                       above we specified a uniform distribution as the prior when
        prior_llf = proposal_prior_llf
                                                                                 sampling via MH, but that is not possible with Gibbs sampling.
        samples_rw[i] = samples_rw[i - 1]                                        Here, we show how to implement Gibbs sampling estimation of
                                                                                 the variance parameter, now making use of an inverse Gamma
# Convert for use with ArviZ and plot posterior                                  prior, and the simulation smoother introduced above.
samples_rw = az.convert_to_inference_data(
   samples_rw)                                                                   # Construct the model and simulation smoother
# Eliminate the first 10000 samples as burn-in;                                  model_ll = sm.tsa.UnobservedComponents(y, 'llevel')
# thin by factor of 10 to reduce autocorrelation                                 sim_ll = model_ll.simulation_smoother()
   {'draw': np.s_[10000::10]}), kind='bin',                                      # Specify the prior distributions. With GS, we must
   point_estimate='median')                                                      # choose an inverse Gamma prior for each variance
                                                                                 priors = [stats.invgamma(0.01, scale=0.01)] * 2
The approximate posterior distribution, constructed from the sam-
ple chain, is shown in Figure 2.                                                 # Storage
                                                                                 niter = 100000
                                                                                 samples_ll = np.zeros((niter + 1, 2))
   8. While a detailed description of these issues is out of the scope of this
paper, there are many superb references on this topic. We refer the interested   # Initialization
reader to [WH99], which provides a book-length treatment of Bayesian             samples_ll[0] = [y.diff().var(), 1e-5]
inference for state space models, and [KN99], which provides many examples
and applications.                                                                # Iterations
BAYESIAN ESTIMATION AND FORECASTING OF TIME SERIES IN STATSMODELS                                                                               87

for i in range(1, niter + 1):
   # (a) Update the model parameters
   model_ll.update(samples_ll[i - 1])

   # (b) Draw from the conditional posterior of
   # the state vector
   sample_state = sim_ll.simulated_state.T

   # (c) Compute / draw from conditional posterior
   # of the parameters:
   # ...observation error variance
   resid = y - sample_state[:, 0]
   post_shape = len(resid) / 2 + 0.01
   post_scale = np.sum(resid**2) / 2 + 0.01
   samples_ll[i, 0] = stats.invgamma(
      post_shape, scale=post_scale).rvs()

   # ...level error variance
   resid = sample_state[1:] - sample_state[:-1]
   post_shape = len(resid) / 2 + 0.01                                  Fig. 4: Data and forecast with 80% credible interval; U.S. Industrial
   post_scale = np.sum(resid**2) / 2 + 0.01                            Production.
   samples_ll[i, 1] = stats.invgamma(
      post_shape, scale=post_scale).rvs()

# Convert for use with ArviZ and plot posterior
samples_ll = az.convert_to_inference_data(
   {'parameters': samples_ll[None, ...]},
   coords={'parameter': model_ll.param_names},
   dims={'parameters': ['parameter']})
   {'draw': np.s_[10000::10]}), kind='hexbin');
The approximate posterior distribution, constructed from the sam-
ple chain, is shown in Figure 3.

Illustrative examples
For clarity and brevity, the examples in the previous section gave
results for simple cases. However, these basic methods carry
through to each of the models introduced earlier, including in cases
with multivariate data and hundreds of parameters. Moreover, the
Metropolis-Hastings approach can be combined with the Gibbs
sampling approach, so that if the end user wishes to use Gibbs
sampling for some parameters, they are not restricted to choose
only conjugate priors for all parameters.
    In addition to sampling the posterior distributions of the
parameters, this method allows sampling other objects of inter-
est, including forecasts of observed variables, impulse response
functions, and the unobserved state vector. This last possibility
is especially useful in cases such as the structural time series       Fig. 5: Estimated level, trend, and seasonal components, with 80%
model, in which the unobserved states correspond to interpretable      credible interval; U.S. Industrial Production.
elements such as the trend and seasonal components. We provide
several illustrative examples of the various types of analysis that
are possible.                                                          model = sm.tsa.UnobservedComponents(
                                                                          y, 'lltrend', seasonal=12)
Forecasting and Time Series Decomposition                                  To produce the time-series decomposition into level, trend, and
In our first example, we apply the Gibbs sampling approach to          seasonal components, we will use samples from the posterior of
a structural time series model in order to forecast U.S. Industrial    the state vector (µt , βt , γt ) for each time period t. These are im-
Production and to produce a decomposition of the series into level,    mediately available when using the Gibbs sampling approach; in
trend, and seasonal components. The model is                           the earlier example, the draw at each iteration was assigned to the
                   yt = µt + γt + εt          observation equation     variable sample_state. To produce forecasts, we need to draw from
                                                                       the posterior predictive distribution for horizons h = 1, 2, . . . H.
                   µt = βt + µt−1 + ζt                         level
                                                                       This can be easily accomplished by using the simulate method
                   βt = βt−1 + ξt                             trend    introduced earlier. To be concrete, we can accomplish these tasks
                   γt = γt−s + ηt                          seasonal    by modifying section (b) of our Gibbs sampler iterations as
Here, we set the seasonal periodicity to s=12, since Industrial        follows:
Production is a monthly variable. We can construct this model            9. This model is often referred to as a "local linear trend" model (with
in Statsmodels as9                                                     additionally a seasonal component); lltrend is an abbreviation of this name.
88                                                                                                 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                          Fig. 6: "Causal impact" of COVID-19 on U.S. Sales in Manufacturing and Trade Industries.

# (b') Draw from the conditional posterior of                                 on U.S. Sales in Manufacturing and Trade Industries.11
# the state vector
model.update(params[i - 1])
sim.simulate()                                                                Extensions
# save the draw for use later in time series
# decomposition                                                               There are many extensions to the time series models presented
states[i] = sim.simulated_state.T
                                                                              here that are made possible when using Bayesian inference.
# Draw from the posterior predictive distribution                             First, it is easy to create custom state space models within the
# using the `simulate` method                                                 statsmodels framework. As one example, the statsmodels
n_fcast = 48                                                                  documentation describes how to create a model that extends the
fcast[i] = model.simulate(
   params[i - 1], n_fcast,                                                    typical VAR described above with time-varying parameters.12
   initial_state=states[i, -1]).to_frame()                                    These custom state space models automatically inherit all the
                                                                              functionality described above, so that Bayesian inference can be
These forecasts and the decomposition into level, trend, and sea-             conducted in exactly the same way.
sonal components are summarized in Figures 4 and 5, which show                    Second, because the general state space model available in
the median values along with 80% credible intervals. Notably, the             statsmodels and introduced above allows for time-varying
intervals shown incorporate for both the uncertainty arising from             system matrices, it is possible using Gibbs sampling methods
the stochastic terms in the model as well as the need to estimate             to introduce support for automatic outlier handling, stochastic
the models’ parameters.10                                                     volatility, and regime switching models, even though these are
                                                                              largely infeasible in statsmodels when using frequentist meth-
Casual impacts                                                                ods such as maximum likelihood estimation.13
A closely related procedure described in [BGK+ 15] uses a
Bayesian structural time series model to estimate the "causal                 Conclusion
impact" of some event on some observed variable. This approach
stops estimation of the model just before the date of an event                This paper introduces the suite of time series models available in
and produces a forecast by drawing from the posterior predictive              statsmodels and shows how Bayesian inference using Markov
density, using the procedure described just above. It then uses the           chain Monte Carlo methods can be applied to estimate their
difference between the actual path of the data and the forecast to            parameters and produce analyses of interest, including time series
estimate impact of the event.                                                 decompositions and forecasts.
    An example of this approach is shown in Figure 6, in which we
                                                                                 11. In this example, we used a local linear trend model with no seasonal
use this method to illustrate the effect of the COVID-19 pandemic             component.
                                                                                 12. For details, see
  10. The popular Prophet library, [TL17], similarly uses an additive model   generated/statespace_tvpvar_mcmc_cfa.html.
combined with Bayesian sampling methods to produce forecasts and decom-          13. See, for example, [SW16] for an application of these techniques that
positions, although its underlying model is a GAM rather than a state space   handles outliers, [KSC98] for stochastic volatility, and [KN98] for an applica-
model.                                                                        tion to dynamic factor models with regime switching.
BAYESIAN ESTIMATION AND FORECASTING OF TIME SERIES IN STATSMODELS                                                                                   89

R EFERENCES                                                                     [SWF16]   John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck.
                                                                                          Probabilistic programming in Python using PyMC3. PeerJ
[BGK+ 15] Kay H. Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy,                  Computer Science, 2:e55, April 2016. Publisher: PeerJ Inc.
          and Steven L. Scott. Inferring causal impact using Bayesian                     URL:, doi:10.7717/peerj-
          structural time-series models. Annals of Applied Statistics, 9:247–             cs.55.
          274, 2015. doi:10.1214/14-aoas788.                                    [TL17]    Sean J. Taylor and Benjamin Letham. Forecasting at scale.
[CGH+ 17] Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel                        Technical Report e3190v2, PeerJ Inc., September 2017. ISSN:
          Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker,                         2167-9843. URL:, doi:10.
          Jiqiang Guo, Peter Li, and Allen Riddell.                Stan : A               7287/peerj.preprints.3190v2.
          Probabilistic Programming Language.            Journal of Statisti-   [WH99]    Mike West and Jeff Harrison. Bayesian Forecasting and Dynamic
          cal Software, 76(1), January 2017.           Institution: Columbia              Models. Springer, New York, 2nd edition edition, March 1999.
          Univ., New York, NY (United States); Harvard Univ., Cam-                        00000.
          bridge, MA (United States). URL:
          biblio/1430202-stan-probabilistic-programming-language, doi:
[CJ09]    Joshua C.C. Chan and Ivan Jeliazkov. Efficient simulation and in-
          tegrated likelihood estimation in state space models. International
          Journal of Mathematical Modelling and Numerical Optimisation,
          1(1-2):101–120, January 2009. Publisher: Inderscience Publish-
          ers. URL:
[DK02]    J. Durbin and S. J. Koopman. A simple and efficient simula-
          tion smoother for state space time series analysis. Biometrika,
          89(3):603–616, August 2002. URL: http://biomet.oxfordjournals.
          org/content/89/3/603, doi:10.1093/biomet/89.3.603.
[DK12]    James Durbin and Siem Jan Koopman. Time Series Analysis by
          State Space Methods: Second Edition. Oxford University Press,
          May 2012.
[DLT+ 17] Joshua V. Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo,
          Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi,
          Matt Hoffman, and Rif A. Saurous. TensorFlow Distributions.
          Technical Report arXiv:1711.10604, arXiv, November 2017.
          arXiv:1711.10604 [cs, stat] type: article. URL:
          abs/1711.10604, doi:10.48550/arXiv.1711.10604.
[Ful15]   Chad Fulton. Estimating time series models by state space
          methods in python: Statsmodels. 2015.
[HA18]    Rob J Hyndman and George Athanasopoulos. Forecasting:
          principles and practice. OTexts, 2018.
[Har90]   Andrew C. Harvey. Forecasting, Structural Time Series Models
          and the Kalman Filter. Cambridge University Press, 1990.
[HKOS08] Rob Hyndman, Anne B. Koehler, J. Keith Ord, and Ralph D.
          Snyder. Forecasting with Exponential Smoothing: The State
          Space Approach. Springer Science & Business Media, June 2008.
          Google-Books-ID: GSyzox8Lu9YC.
[KCHM19] Ravin Kumar, Colin Carroll, Ari Hartikainen, and Osvaldo Mar-
          tin. ArviZ a unified library for exploratory analysis of Bayesian
          models in Python. Journal of Open Source Software, 4(33):1143,
          2019. Publisher: The Open Journal. URL:
          21105/joss.01143, doi:10.21105/joss.01143.
[KN98]    Chang-Jin Kim and Charles R. Nelson. Business Cycle Turning
          Points, A New Coincident Index, and Tests of Duration Depen-
          dence Based on a Dynamic Factor Model With Regime Switch-
          ing. The Review of Economics and Statistics, 80(2):188–201,
          May 1998. Publisher: MIT Press. URL:
          003465398557447, doi:10.1162/003465398557447.
[KN99]    Chang-Jin Kim and Charles R. Nelson. State-Space Models with
          Regime Switching: Classical and Gibbs-Sampling Approaches
          with Applications. MIT Press Books, The MIT Press, 1999. URL:

[KSC98]   Sangjoon Kim, Neil Shephard, and Siddhartha Chib. Stochastic
          Volatility: Likelihood Inference and Comparison with ARCH
          Models. The Review of Economic Studies, 65(3):361–393, July
          1998. 01855. URL:
          3/361, doi:10.1111/1467-937X.00050.
[MPS11]   Wes McKinney, Josef Perktold, and Skipper Seabold. Time Series
          Analysis in Python with statsmodels. In Stéfan van der Walt
          and Jarrod Millman, editors, Proceedings of the 10th Python in
          Science Conference, pages 107 – 113, 2011. doi:10.25080/
[SP10]    Skipper Seabold and Josef Perktold. Statsmodels: Econometric
          and Statistical Modeling with Python. In Stéfan van der Walt and
          Jarrod Millman, editors, Proceedings of the 9th Python in Science
          Conference, pages 92 – 96, 2010. doi:10.25080/Majora-
[SW16]    James H. Stock and Mark W. Watson. Core Inflation and Trend
          Inflation. Review of Economics and Statistics, 98(4):770–784,
          March 2016. 00000. URL:
          00608, doi:10.1162/REST_a_00608.
90                                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

 Python vs. the pandemic: a case study in high-stakes
                 software development
     Cliff C. Kerr‡§∗ , Robyn M. Stuart¶k , Dina Mistry∗∗ , Romesh G. Abeysuriyak , Jamie A. Cohen‡ , Lauren George†† ,
                         Michał Jastrzebski‡‡ , Michael Famulare‡ , Edward Wenger‡ , Daniel J. Klein‡


Abstract—When it became clear in early 2020 that COVID-19 was going to                   modeling, and drug discovery made it well placed to contribute to
be a major public health threat, politicians and public health officials turned to       a global pandemic response plan. Founded in 2008, the Institute
academic disease modelers like us for urgent guidance. Academic software                 for Disease Modeling (IDM) has provided analytical support for
development is typically a slow and haphazard process, and we realized that              BMGF (which it has been a part of since 2020) and other global
business-as-usual would not suffice for dealing with this crisis. Here we describe
                                                                                         health partners, with a focus on eradicating malaria and polio.
the case study of how we built Covasim (, an agent-based model
of COVID-19 epidemiology and public health interventions, by using standard
                                                                                         Since its creation, IDM has built up a portfolio of computational
Python libraries like NumPy and Numba, along with less common ones like                  tools to understand, analyze, and predict the dynamics of different
Sciris ( Covasim was created in a few weeks, an order of magnitude           diseases.
faster than the typical model development process, and achieves performance                  When "coronavirus disease 2019" (COVID-19) and the virus
comparable to C++ despite being written in pure Python. It has become one                that causes it (SARS-CoV-2) were first identified in late 2019,
of the most widely adopted COVID models, and is used by researchers and                  our team began summarizing what was known about the virus
policymakers in dozens of countries. Covasim’s rapid development was enabled             [Fam19]. By early February 2020, even though it was more than
not only by leveraging the Python scientific computing ecosystem, but also by
                                                                                         a month before the World Health Organization (WHO) declared
adopting coding practices and workflows that lowered the barriers to entry for
                                                                                         a pandemic [Med20], it had become clear that COVID-19 would
scientific contributors without sacrificing either performance or rigor.
                                                                                         become a major public health threat. The outbreak on the Diamond
Index Terms—COVID-19, SARS-CoV-2, Epidemiology, Mathematical modeling,
                                                                                         Princess cruise ship [RSWS20] was the impetus for us to start
NumPy, Numba, Sciris                                                                     modeling COVID in detail. Specifically, we needed a tool to (a)
                                                                                         incorporate new data as soon as it became available, (b) explore
                                                                                         policy scenarios, and (c) predict likely future epidemic trajectories.
Background                                                                                   The first step was to identify which software tool would form
For decades, scientists have been concerned about the possibility                        the best starting point for our new COVID model. Infectious
of another global pandemic on the scale of the 1918 flu [Gar05].                         disease models come in two major types: agent-based models track
Despite a number of "close calls" – including SARS in 2002                               the behavior of individual "people" (agents) in the simulation,
[AFG+ 04]; Ebola in 2014-2016 [Tea14]; and flu outbreaks in-                             with each agent’s behavior represented by a random (probabilis-
cluding 1957, 1968, and H1N1 in 2009 [SHK16], some of which                              tic) process. Compartmental models track populations of people
led to 1 million or more deaths – the last time we experienced                           over time, typically using deterministic difference equations. The
the emergence of a planetary-scale new pathogen was when HIV                             richest modeling framework used by IDM at the time was EMOD,
spread globally in the 1980s [CHL+ 08].                                                  which is a multi-disease agent-based model written in C++ and
    In 2015, Bill Gates gave a TED talk stating that the world was                       based on JSON configuration files [BGB+ 18]. We also considered
not ready to deal with another pandemic [Hof20]. While the Bill                          Atomica, a multi-disease compartmental model written in Python
& Melinda Gates Foundation (BMGF) has not historically focused                           and based on Excel input files [KAK+ 19]. However, both of
on pandemic preparedness, its expertise in disease surveillance,                         these options posed significant challenges: as a compartmental
                                                                                         model, Atomica would have been unable to capture the individual-
* Corresponding author:                                                level detail necessary for modeling the Diamond Princess out-
‡ Institute for Disease Modeling, Bill & Melinda Gates Foundation, Seattle,              break (such as passenger-crew interactions); EMOD had sufficient
                                                                                         flexibility, but developing new disease modules had historically
§ School of Physics, University of Sydney, Sydney, Australia
¶ Department of Mathematical Sciences, University of Copenhagen, Copen-                  required months rather than days.
hagen, Denmark                                                                               As a result, we instead started developing Covasim ("COVID-
|| Burnet Institute, Melbourne, Australia                                                19 Agent-based Simulator") [KSM+ 21] from a nascent agent-
** Twitter, Seattle, USA                                                                 based model written in Python, LEMOD-FP ("Light-EMOD for
†† Microsoft, Seattle, USA
‡‡ GitHub, San Francisco, USA                                                            Family Planning"). LEMOD-FP was used to model reproductive
                                                                                         health choices of women in Senegal; this model had in turn
Copyright © 2022 Cliff C. Kerr et al. This is an open-access article distributed         been based on an even simpler agent-based model of measles
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the             vaccination programs in Nigeria ("Value-of-Information Simula-
original author and source are credited.                                                 tor" or VoISim). We subsequently applied the lessons we learned
PYTHON VS. THE PANDEMIC: A CASE STUDY IN HIGH-STAKES SOFTWARE DEVELOPMENT                                                                       91

                                                                        scientific computing libraries.

                                                                        Software architecture and implementation
                                                                        Covasim conceptual design and usage
                                                                        Covasim is a standard susceptible-exposed-infectious-recovered
                                                                        (SEIR) model (Fig. 3). As noted above, it is an agent-based model,
                                                                        meaning that individual people and their interactions with one
                                                                        another are simulated explicitly (rather than implicitly, as in a
                                                                        compartmental model).
                                                                            The fundamental calculation that Covasim performs is to
                                                                        determine the probability that a given person, on a given time step,
                                                                        will change from one state to another, such as from susceptible
                                                                        to exposed (i.e., that person was infected), from undiagnosed to
                                                                        diagnosed, or from critically ill to dead. Covasim is fully open-
                                                                        source and available on GitHub ( and PyPI
                                                                        (pip install covasim), and comes with comprehensive
                                                                        documentation, including tutorials (
                                                                            The first principle of Covasim’s design philosophy is that
                                                                        "Common tasks should be simple" – for example, defining pa-
                                                                        rameters, running a simulation, and plotting results. The following
                                                                        example illustrates this principle; it creates a simulation with a
                                                                        custom parameter value, runs it, and plots the results:
Fig. 1: Daily reported global COVID-19-related deaths (top;             import covasim as cv
smoothed with a one-week rolling window), relative to the timing of     cv.Sim(pop_size=100e3).run().plot()
known variants of concern (VOCs) and variants of interest (VOIs), as    The second principle of Covasim’s design philosophy is "Un-
well as Covasim releases (bottom).
                                                                        common tasks can’t always be simple, but they still should be
                                                                        possible." Examples include writing a custom goodness-of-fit
from developing Covasim to turn LEMOD-FP into a new family              function or defining a new population structure. To some extent,
planning model, "FPsim", which will be launched later this year         the second principle is at odds with the first, since the more
[OVCC+ 22].                                                             flexibility an interface has, typically the more complex it is as
     Parallel to the development of Covasim, other research teams       well.
at IDM developed their own COVID models, including one based                To illustrate the tension between these two principles, the
on the EMOD framework [SWC+ 22], and one based on an earlier            following code shows how to run two simulations to determine the
influenza model [COSF20]. However, while both of these models           impact of a custom intervention aimed at protecting the elderly in
saw use in academic contexts [KCP+ 20], neither were able to            Japan, with results shown in Fig. 4:
incorporate new features quickly enough, or were easy enough to         import covasim as cv
use, for widespread external adoption in a policy context.              # Define a custom intervention
     Covasim, by contrast, had immediate real-world impact. The         def elderly(sim, old=70):
first version was released on 10 March 2020, and on 12 March                if sim.t =='2020-04-01'):
                                                                                elderly = sim.people.age > old
2020, its output was presented by Washington State Governor Jay
                                                                                sim.people.rel_sus[elderly] = 0.0
Inslee during a press conference as justification for school closures
and social distancing measures [KMS+ 21].                               # Set custom parameters
     Since the early days of the pandemic, Covasim releases have        pars = dict(
                                                                            pop_type = 'hybrid', # More realistic population
coincided with major events in the pandemic, especially the iden-           location = 'japan', # Japan's population pyramid
tification of new variants of concern (Fig. 1). Covasim was quickly         pop_size = 50e3, # Have 50,000 people total
adopted globally, including applications in the UK regarding                pop_infected = 100, # 100 infected people
                                                                            n_days = 90, # Run for 90 days
school closures [PGKS+ 20], Australia regarding outbreak control
[SAK+ 21], and Vietnam regarding lockdown measures [PSN+ 21].
     To date, Covasim has been downloaded from PyPI over                # Run multiple sims in parallel and plot key results
100,000 times [PeP22], has been used in dozens of academic              label = 'Protect the elderly'
                                                                        s1 = cv.Sim(pars, label='Default')
studies [KMS+ 21], and informed decision-making on every con-           s2 = cv.Sim(pars, interventions=elderly, label=label)
tinent (Fig. 2), making it one of the most widely used COVID            msim = cv.parallel(s1, s2)
models [KSM+ 21]. We believe key elements of its success include        msim.plot(['cum_deaths', 'cum_infections'])
(a) the simplicity of its architecture; (b) its high performance,       Similar design philosophies have been articulated by previously,
enabled by the use of NumPy arrays and Numba decorators;                such as for Grails [AJ09] among others1 .
and (c) our emphasis on prioritizing usability, including flexible
type handling and careful choices of default settings. In the             1. Other similar philosophical statements include "The manifesto of Mat-
remainder of this paper, we outline these principles in more detail,    plotlib is: simple and common tasks should be simple to perform; provide
                                                                        options for more complex tasks" (Data Processing Using Python) and "Simple,
in the hope that these will provide a useful roadmap for other          common tasks should be simple to perform; Options should be provided to
groups wanting to quickly develop high-performance, easy-to-use         enable more complex tasks" (Instrumental).
92                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                Fig. 2: Locations where Covasim has been used to help produce a paper, report, or policy recommendation.

Fig. 3: Basic Covasim disease model. The blue arrow shows the
process of reinfection.

                                                                         Fig. 4: Illustrative result of a simulation in Covasim focused on
Simplifications using Sciris                                             exploring an intervention for protecting the elderly.

A key component of Covasim’s architecture is heavy reliance
on Sciris ( [KAH+ ng], a library of functions for      running simulations in parallel.
scientific computing that provide additional flexibility and ease-
of-use on top of NumPy, SciPy, and Matplotlib, including paral-          Array-based architecture
lel computing, array operations, and high-performance container          In a typical agent-based simulation, the outermost loop is over
datatypes.                                                               time, while the inner loops iterate over different agents and agent
    As shown in Fig. 5, Sciris significantly reduces the number          states. For a simulation like Covasim, with roughly 700 (daily)
of lines of code required to perform common scientific tasks,            timesteps to represent the first two years of the pandemic, tens
allowing the user to focus on the code’s scientific logic rather than    or hundreds of thousands of agents, and several dozen states, this
the low-level implementation. Key Covasim features that rely on          requires on the order of one billion update steps.
Sciris include: ensuring consistent dictionary, list, and array types        However, we can take advantage of the fact that each state
(e.g., allowing the user to provide inputs as either lists or arrays);   (such as agent age or their infection status) has the same data
referencing ordered dictionary elements by index; handling and           type, and thus we can avoid an explicit loop over agents by instead
interconverting dates (e.g., allowing the user to provide either a       representing agents as entries in NumPy vectors, and performing
date string or a datetime object); saving and loading files; and         operations on these vectors. These two architectures are shown in
PYTHON VS. THE PANDEMIC: A CASE STUDY IN HIGH-STAKES SOFTWARE DEVELOPMENT                                                                    93

Fig. 5: Comparison of functionally identical code implemented without Sciris (left) and with (right). In this example, tasks that together take
30 lines of code without Sciris can be accomplished in 7 lines with it.
94                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                                for t in self.time_vec:
                                                                                    for person in self.people:
                                                                                        if person.alive:

                                                                      # Array-based agent simulation

                                                                      class People:

                                                                           def age_people(self, inds):
                                                                               self.age[inds] += 1

                                                                           def check_died(self, inds):
                                                                               rands = np.random.rand(len(inds))
                                                                               died = rands < self.death_probs[inds]:
                                                                               self.alive[inds[died]] = False
Fig. 6: The standard object-oriented approach for implementing
agent-based models (top), compared to the array-based approach        class Sim:
used in Covasim (bottom).
                                                                           def run(self):
                                                                               for t in self.time_vec:
                                                                                   alive = sc.findinds(self.people.alive)

                                                                      Numba optimization
                                                                      Numba is a compiler that translates subsets of Python and NumPy
                                                                      into machine code [LPS15]. Each low-level numerical function
                                                                      was tested with and without Numba decoration; in some cases
                                                                      speed improvements were negligible, while in other cases they
                                                                      were considerable. For example, the following function is roughly
                                                                      10 times faster with the Numba decorator than without:
                                                                      import numpy as np
                                                                      import numba as nb

                                                                      @nb.njit((nb.int32, nb.int32), cache=True)
                                                                      def choose_r(max_n, n):
Fig. 7: Performance comparison for FPsim from an explicit loop-           return np.random.choice(max_n, n, replace=True)
based approach compared to an array-based approach, showing a
factor of ~70 speed improvement for large population sizes.           Since Covasim is stochastic, calculations rarely need to be exact;
                                                                      as a result, most numerical operations are performed as 32-bit
Fig. 6. Compared to the explicitly object-oriented implementation         Together, these speed optimizations allow Covasim to run at
of an agent-based model, the array-based version is 1-2 orders of     roughly 5-10 million simulated person-days per second of CPU
magnitude faster for population sizes larger than 10,000 agents.      time – a speed comparable to agent-based models implemented
The relative performance of these two approaches is shown in          purely in C or C++ [HPN+ 21]. Practically, this means that most
Fig. 7 for FPsim (which, like Covasim, was initially implemented      users can run Covasim analyses on their laptops without needing
using an object-oriented approach before being converted to an        to use cloud-based or HPC computing resources.
array-based approach). To illustrate the difference between object-
based and array-based implementations, the following example          Lessons for scientific software development
shows how aging and death would be implemented in each:
                                                                      Accessible coding and design
# Object-based agent simulation
                                                                      Since Covasim was designed to be used by scientists and health
class Person:                                                         officials, not developers, we made a number of design decisions
                                                                      that preferenced accessibility to our audience over other principles
     def age_person(self):
                                                                      of good software design.
         self.age += 1
         return                                                           First, Covasim is designed to have as flexible of user inputs
                                                                      as possible. For example, a date can be specified as an integer
     def check_died(self):                                            number of days from the start of the simulation, as a string (e.g.
         rand = np.random.random()
         if rand < self.death_prob:                                   '2020-04-04'), or as a datetime object. Similarly, numeric
             self.alive = False                                       inputs that can have either one or multiple values (such as the
         return                                                       change in transmission rate following one or multiple lockdowns)
                                                                      can be provided as a scalar, list, or NumPy array. As long as the
class Sim:
                                                                      input is unambiguous, we prioritized ease-of-use and simplicity
     def run(self):                                                   of the interface over rigorous type checking. Since Covasim is a
PYTHON VS. THE PANDEMIC: A CASE STUDY IN HIGH-STAKES SOFTWARE DEVELOPMENT                                                                    95

top-level library (i.e., it does not perform low-level functions as       health background, through to public health experts with virtually
part of other libraries), this prioritization has been welcomed by        no prior experience in Python. Roughly 45% of Covasim con-
its users.                                                                tributors had significant Python expertise, while 60% had public
     Second, "advanced" Python programming paradigms – such               health experience; only about half a dozen contributors (<10%)
as method and function decorators, lambda functions, multiple             had significant experience in both areas.
inheritance, and "dunder" methods – have been avoided where                    These half-dozen contributors formed a core group (including
possible, even when they would otherwise be good coding prac-             the authors of this paper) that oversaw overall Covasim develop-
tice. This is because a relatively large fraction of Covasim users,       ment. Using GitHub for both software and project management,
including those with relatively limited Python backgrounds, need          we created issues and assigned them to other contributors based
to inspect and modify the source code. A Covasim user coming              on urgency and skillset match. All pull requests were reviewed by
from an R programming background, for example, may not have               at least one person from this group, and often two, prior to merge.
encountered the NumPy function intersect1d() before, but                  While the danger of accepting changes from contributors with
they can quickly look it up and understand it as being equivalent         limited Python experience is self-evident, considerable risks were
to R’s intersect() function. In contrast, an R user who has               also posed by contributors who lacked epidemiological insight.
not encountered method decorators before is unlikely to be able to        For example, some of the proposed tests were written based on
look them up and understand their meaning (indeed, they may not           assumptions that were true for a given time and place, but which
even know what terms to search for). While Covasim indeed does            were not valid for other geographical contexts.
use each of the "advanced" methods listed above (e.g., the Numba               One surprising outcome was that even though Covasim is
decorators described above), they have been kept to a minimum             largely a software project, after the initial phase of development
and sequestered in particular files the user is less likely to interact   (i.e., the first 4-8 weeks), we found that relatively few tasks could
with.                                                                     be assigned to the developers as opposed to the epidemiologists
     Third, testing for Covasim presented a major challenge. Given        and infectious disease modelers on the project. We believe there
that Covasim was being used to make decisions that affected tens          are several reasons for this. First, epidemiologists tended to be
of millions of people, even the smallest errors could have poten-         much more aware of knowledge they were missing (e.g., what
tially catastrophic consequences. Furthermore, errors could arise         a particular NumPy function did), and were more readily able
not only in the software logic, but also in an incorrectly entered        to fill that gap (e.g., look it up in the documentation or on
parameter value or a misinterpreted scientific study. Compounding         Stack Overflow). By contrast, developers without expertise in
these challenges, features often had to be developed and used             epidemiology were less able to identify gaps in their knowledge
on a timescale of hours or days to be of use to policymakers,             and address them (e.g., by finding a study on Google Scholar).
a speed which was incompatible with traditional software testing          As a consequence, many of the epidemiologists’ software skills
approaches. In addition, the rapidly evolving codebase made it            improved markedly over the first few months, while the develop-
difficult to write even simple regression tests. Our solution was to      ers’ epidemiology knowledge increased more slowly. Second, and
use a hierarchical testing approach: low-level functions were tested      more importantly, we found that once transparent and performant
through a standard software unit test approach, while new features        coding practices had been implemented, epidemiologists were able
and higher-level outputs were tested extensively by infectious            to successfully adapt them to new contexts even without complete
disease modelers who varied inputs corresponding to realistic             understanding of the code. Thus, for developing a scientific
scenarios, and checked the outputs (predominantly in the form             software tool, we propose that a successful staffing plan would
of graphs) against their intuition. We found that these high-level        consist of a roughly equal ratio of developers and domain experts
"sanity checks" were far more effective in catching bugs than             during the early development phase, followed by a rapid (on a
formal software tests, and as a result shifted the emphasis of            timescale of weeks) ramp-down of developers and ramp-up of
our test suite to prioritize the former. Public releases of Covasim       domain experts.
have held up well to extensive scrutiny, both by our external                  Acknowledging that Covasim’s potential user base includes
collaborators and by "COVID skeptics" who were highly critical            many people who have limited coding skills, we developed a three-
of other COVID models [Den20].                                            tiered support model to maximize Covasim’s real-world policy
     Finally, since much of our intended audience has little to           impact (Fig. 8). For "mode 1" engagements, we perform the anal-
no Python experience, we provided as many alternative ways of             yses using Covasim ourselves. While this mode typically ensures
accessing Covasim as possible. For R users, we provide exam-              high quality and efficiency, it is highly resource-constrained and
ples of how to run Covasim using the reticulate package                   thus used only for our highest-profile engagements, such as with
[AUTE17], which allows Python to be called from within R.                 the Vietnam Ministry of Health [PSN+ 21] and Washington State
For specific applications, such as our test-trace-quarantine work         Department of Health [KMS+ 21]. For "mode 2" engagements, we
(, we developed bespoke webapps via            offer our partners training on how to use Covasim, and let them
Jupyter notebooks [GP21] and Voilà [Qua19]. To help non-experts           lead analyses with our feedback. This is our preferred mode of
gain intuition about COVID epidemic dynamics, we also devel-              engagement, since it balances efficiency and sustainability, and has
oped a generic JavaScript-based webapp interface for Covasim              been used for contexts including the United Kingdom [PGKS+ 20]
(, but it does not have sufficient flexibility     and Australia [SLSS+ 22]. Finally, "mode 3" partnerships, in
to answer real-world policy questions.                                    which Covasim is downloaded and used without our direct input,
                                                                          are of course the default approach in the open-source software
Workflow and team management                                              ecosystem, including for Python. While this mode is by far the
Covasim was developed by a team of roughly 75 people with                 most scalable, in practice, relatively few health departments or
widely disparate backgrounds: from those with 20+ years of                ministries of health have the time and internal technical capacity to
enterprise-level software development experience and no public            use this mode; instead, most of the mode 3 uptake of Covasim has
96                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

been by academic groups [LG+ 21]. Thus, we provide mode 1 and            [AUTE17]    JJ Allaire, Kevin Ushey, Yuan Tang, and Dirk Eddelbuettel.
mode 2 partnerships to make Covasim’s impact more immediate                          reticulate: R Interface to Python, 2017. URL: https://github.
and direct than would be possible via mode 3 alone.
                                                                         [BGB+ 18]   Anna Bershteyn, Jaline Gerardin, Daniel Bridenbecker, Christo-
                                                                                     pher W Lorton, Jonathan Bloedow, Robert S Baker, Guil-
Future directions                                                                    laume Chabot-Couture, Ye Chen, Thomas Fischle, Kurt Frey,
                                                                                     et al. Implementation and applications of EMOD, an individual-
While the need for COVID modeling is hopefully starting to                           based multi-disease modeling platform. Pathogens and disease,
decrease, we and our collaborators are continuing development                        76(5):fty059, 2018. doi:10.1093/femspd/fty059.
of Covasim by updating parameters with the latest scientific             [CHL+ 08]   Myron S Cohen, Nick Hellmann, Jay A Levy, Kevin DeCock,
                                                                                     Joep Lange, et al. The spread, treatment, and prevention of
evidence, implementing new immune dynamics [CSN+ 21], and                            HIV-1: evolution of a global pandemic. The Journal of Clin-
providing other usability and bug-fix updates. We also continue                      ical Investigation, 118(4):1244–1254, 2008. doi:10.1172/
to provide support and training workshops (including in-person                       JCI34706.
workshops, which were not possible earlier in the pandemic).             [COSF20]    Dennis L Chao, Assaf P Oron, Devabhaktuni Srikrishna, and
                                                                                     Michael Famulare. Modeling layered non-pharmaceutical inter-
    We are using what we learned during the development of                           ventions against SARS-CoV-2 in the United States with Corvid.
Covasim to build a broader suite of Python-based disease mod-                        MedRxiv, 2020. doi:10.1101/2020.04.08.20058487.
eling tools (tentatively named "*-sim" or "Starsim"). The suite          [CSN+ 21]   Jamie A Cohen, Robyn Margaret Stuart, Rafael C Nùñez,
of Starsim tools under development includes models for family                        Katherine Rosenfeld, Bradley Wagner, Stewart Chang, Cliff
                                                                                     Kerr, Michael Famulare, and Daniel J Klein. Mechanistic mod-
planning [OVCC+ 22], polio, respiratory syncytial virus (RSV),                       eling of SARS-CoV-2 immune memory, variants, and vaccines.
and human papillomavirus (HPV). To date, each tool in this                           medRxiv, 2021. doi:10.1101/2021.05.31.21258018.
suite uses an independent codebase, and is related to Covasim            [Den20]     Denim, Sue. Another Computer Simulation, Another Alarmist
only through the shared design principles described above, and                       Prediction, 2020. URL:
                                                                         [Fam19]     Mike Famulare. nCoV: preliminary estimates of the confirmed-
by having used the Covasim codebase as the starting point for                        case-fatality-ratio and infection-fatality-ratio, and initial pan-
development.                                                                         demic risk assessment. Institute for Disease Modeling, 2019.
    A major open question is whether the disease dynamics im-            [Gar05]     Laurie Garrett. The next pandemic. Foreign Aff., 84:3, 2005.
plemented in Covasim and these related models have sufficient                        doi:10.2307/20034417.
                                                                         [GP21]      Brian E. Granger and Fernando Pérez. Jupyter: Thinking and
overlap to be refactored into a single disease-agnostic modeling                     storytelling with code and data. Computing in Science & En-
library, which the disease-specific modeling libraries would then                    gineering, 23(2):7–14, 2021. doi:10.1109/MCSE.2021.
import. This "core and specialization" approach was adopted by                       3059263.
EMOD and Atomica, and while both frameworks continue to be               [Hof20]     Bert Hofman. The global pandemic. Horizons: Journal of
                                                                                     International Relations and Sustainable Development, (16):60–
used, no multi-disease modeling library has yet seen widespread                      69, 2020.
adoption within the disease modeling community. The alternative          [HPN+ 21]   Robert Hinch, William JM Probert, Anel Nurtay, Michelle
approach, currently used by the Starsim suite, is for each disease                   Kendall, Chris Wymant, Matthew Hall, Katrina Lythgoe, Ana
model to be a self-contained library. A shared library would                         Bulas Cruz, Lele Zhao, Andrea Stewart, et al. OpenABM-
                                                                                     Covid19—An agent-based model for non-pharmaceutical inter-
reduce code duplication, and allow new features and bug fixes                        ventions against COVID-19 including contact tracing. PLoS
to be immediately rolled out to multiple models simultaneously.                      computational biology, 17(7):e1009146, 2021.             doi:10.
However, it would also increase interdependencies that would have                    1371/journal.pcbi.1009146.
the effect of increasing code complexity, increasing the risk of         [KAH+ ng]   Cliff C Kerr, Romesh G Abeysuriya, Vlad-S, tefan Harbuz,
                                                                                     George L Chadderdon, Parham Saidi, Paula Sanz-Leon, James
introducing subtle bugs. Which of these two options is preferable                    Jansson, Maria del Mar Quiroga, Sherrie Hughes, Rowan
likely depends on the speed with which new disease models need                       Martin-and Kelly, Jamie Cohen, Robyn M Stuart, and Anna
to be implemented. We hope that for the foreseeable future, none                     Nachesa. Sciris: a Python library to simplify scientific com-
will need to be implemented as quickly as Covasim.                                   puting. Available at, 2022 (forthcoming).
                                                                         [KAK+ 19]   David J Kedziora, Romesh Abeysuriya, Cliff C Kerr, George L
                                                                                     Chadderdon, Vlad-S, tefan Harbuz, Sarah Metzger, David P Wil-
Acknowledgements                                                                     son, and Robyn M Stuart. The Cascade Analysis Tool: software
                                                                                     to analyze and optimize care cascades. Gates Open Research, 3,
We thank additional contributors to Covasim, including Katherine                     2019. doi:10.12688/gatesopenres.13031.2.
Rosenfeld, Gregory R. Hart, Rafael C. Núñez, Prashanth Selvaraj,         [KCP+ 20]   Joel R Koo, Alex R Cook, Minah Park, Yinxiaohe Sun, Haoyang
Brittany Hagedorn, Amanda S. Izzo, Greer Fowler, Anna Palmer,                        Sun, Jue Tao Lim, Clarence Tam, and Borame L Dickens.
                                                                                     Interventions to mitigate early spread of sars-cov-2 in singapore:
Dominic Delport, Nick Scott, Sherrie L. Kelly, Caroline S. Ben-                      a modelling study. The Lancet Infectious Diseases, 20(6):678–
nette, Bradley G. Wagner, Stewart T. Chang, Assaf P. Oron, Paula                     688, 2020. doi:10.1016/S1473-3099(20)30162-6.
Sanz-Leon, and Jasmina Panovska-Griffiths. We also wish to thank         [KMS+ 21]   Cliff C Kerr, Dina Mistry, Robyn M Stuart, Katherine Rosenfeld,
Maleknaz Nayebi and Natalie Dean for helpful discussions on                          Gregory R Hart, Rafael C Núñez, Jamie A Cohen, Prashanth
                                                                                     Selvaraj, Romesh G Abeysuriya, Michał Jastrz˛ebski, et al. Con-
code architecture and workflow practices, respectively.                              trolling COVID-19 via test-trace-quarantine. Nature Commu-
                                                                                     nications, 12(1):1–12, 2021. doi:10.1038/s41467-021-
R EFERENCES                                                              [KSM+ 21]   Cliff C Kerr, Robyn M Stuart, Dina Mistry, Romesh G Abey-
[AFG+ 04]   Roy M Anderson, Christophe Fraser, Azra C Ghani, Christl A               suriya, Katherine Rosenfeld, Gregory R Hart, Rafael C Núñez,
            Donnelly, Steven Riley, Neil M Ferguson, Gabriel M Leung,                Jamie A Cohen, Prashanth Selvaraj, Brittany Hagedorn, et al.
            Tai H Lam, and Anthony J Hedley. Epidemiology, transmis-                 Covasim: an agent-based model of COVID-19 dynamics and
            sion dynamics and control of sars: the 2002–2003 epidemic.               interventions. PLOS Computational Biology, 17(7):e1009149,
            Philosophical Transactions of the Royal Society of London.               2021. doi:10.1371/journal.pcbi.1009149.
            Series B: Biological Sciences, 359(1447):1091–1105, 2004.    [LG+ 21]    Junjiang Li, Philippe Giabbanelli, et al. Returning to a normal
            doi:10.1098/rstb.2004.1490.                                              life via COVID-19 vaccines in the United States: a large-
[AJ09]      Bashar Abdul-Jawad. Groovy and Grails Recipes. Springer,                 scale Agent-Based simulation study. JMIR medical informatics,
            2009.                                                                    9(4):e27419, 2021. doi:10.2196/27419.
PYTHON VS. THE PANDEMIC: A CASE STUDY IN HIGH-STAKES SOFTWARE DEVELOPMENT                                                                           97

Fig. 8: The three pathways to impact with Covasim, from high bandwidth/small scale to low bandwidth/large scale. IDM: Institute for Disease
Modeling; OSS: open-source software; GPG: global public good; PyPI: Python Package Index.

[LPS15]    Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A                    the impact of COVID-19 vaccines in a representative COVAX
           llvm-based python jit compiler. In Proceedings of the Second                   AMC country setting due to ongoing internal migration: A
           Workshop on the LLVM Compiler Infrastructure in HPC, pages                     modeling study. PLOS Global Public Health, 2(1):e0000053,
           1–6, 2015. doi:10.1145/2833157.2833162.                                        2022. doi:10.1371/journal.pgph.0000053.
[Med20]    The Lancet Respiratory Medicine. COVID-19: delay, mitigate,          [Tea14]   WHO Ebola Response Team. Ebola virus disease in west
           and communicate. The Lancet Respiratory Medicine, 8(4):321,                    africa—the first 9 months of the epidemic and forward projec-
           2020. doi:10.1016/S2213-2600(20)30128-4.                                       tions. New England Journal of Medicine, 371(16):1481–1495,
[OVCC 22] Michelle L O’Brien, Annie Valente, Guillaume Chabot-Couture,
       +                                                                                  2014. doi:10.1056/NEJMoa1411100.
           Joshua Proctor, Daniel Klein, Cliff Kerr, and Marita Zimmer-
           mann. FPSim: An agent-based model of family planning for
           informed policy decision-making. In PAA 2022 Annual Meeting.
           PAA, 2022.
[PeP22]    PePy. PePy download statistics, 2022. URL:
[PGKS+ 20] Jasmina Panovska-Griffiths, Cliff C Kerr, Robyn M Stuart, Dina
           Mistry, Daniel J Klein, Russell M Viner, and Chris Bonell.
           Determining the optimal strategy for reopening schools, the
           impact of test and trace interventions, and the risk of occurrence
           of a second COVID-19 epidemic wave in the UK: a modelling
           study. The Lancet Child & Adolescent Health, 4(11):817–827,
           2020. doi:10.1016/S2352-4642(20)30250-9.
[PSN+ 21] Quang D Pham, Robyn M Stuart, Thuong V Nguyen, Quang C
           Luong, Quang D Tran, Thai Q Pham, Lan T Phan, Tan Q Dang,
           Duong N Tran, Hung T Do, et al. Estimating and mitigating the
           risk of COVID-19 epidemic rebound associated with reopening
           of international borders in Vietnam: a modelling study. The
           Lancet Global Health, 9(7):e916–e924, 2021. doi:10.1016/
[Qua19]    QuantStack. And voilá! Jupyter Blog, 2019. URL: https://blog.
[RSWS20] Joacim Rocklöv, Henrik Sjödin, and Annelies Wilder-Smith.
           COVID-19 outbreak on the Diamond Princess cruise ship: esti-
           mating the epidemic potential and effectiveness of public health
           countermeasures. Journal of Travel Medicine, 27(3):taaa030,
           2020. doi:10.1093/jtm/taaa030.
[SAK+ 21] Robyn M Stuart, Romesh G Abeysuriya, Cliff C Kerr, Dina
           Mistry, Dan J Klein, Richard T Gray, Margaret Hellard, and
           Nick Scott. Role of masks, testing and contact tracing in
           preventing COVID-19 resurgences: a case study from New
           South Wales, Australia. BMJ open, 11(4):e045941, 2021.
[SHK16]    Patrick R Saunders-Hastings and Daniel Krewski. Review-
           ing the history of pandemic influenza: understanding patterns
           of emergence and transmission. Pathogens, 5(4):66, 2016.
[SLSS+ 22] Paula Sanz-Leon, Nathan J Stevenson, Robyn M Stuart,
           Romesh G Abeysuriya, James C Pang, Stephen B Lambert,
           Cliff C Kerr, and James A Roberts. Risk of sustained SARS-
           CoV-2 transmission in Queensland, Australia. Scientific reports,
           12(1):1–9, 2022. doi:10.1101/2021.06.08.21258599.
[SWC 22] Prashanth Selvaraj, Bradley G Wagner, Dennis L Chao,

           Maïna L’Azou Jackson, J Gabrielle Breugelmans, Nicholas Jack-
           son, and Stewart T Chang. Rural prioritization may increase
98                                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

      Pylira: deconvolution of images in the presence of
                        Poisson noise
Axel Donath‡∗ , Aneta Siemiginowska‡ , Vinay Kashyap‡ , Douglas Burke‡ , Karthik Reddy Solipuram§ , David van Dyk¶


Abstract—All physical and astronomical imaging observations are degraded by           of the signal intensity to the signal variance. Any statistically
the finite angular resolution of the camera and telescope systems. The recovery       correct post-processing or reconstruction method thus requires a
of the true image is limited by both how well the instrument characteristics          careful treatment of the Poisson nature of the measured image.
are known and by the magnitude of measurement noise. In the case of a                     To maximise the scientific use of the data, it is often desired to
high signal to noise ratio data, the image can be sharpened or “deconvolved”
                                                                                      correct the degradation introduced by the imaging process. Besides
robustly by using established standard methods such as the Richardson-Lucy
method. However, the situation changes for sparse data and the low signal to
                                                                                      correction for non-uniform exposure and background noise this
noise regime, such as those frequently encountered in X-ray and gamma-ray             also includes the correction for the "blurring" introduced by the
astronomy, where deconvolution leads inevitably to an amplification of noise          point spread function (PSF) of the instrument. Where the latter
and poorly reconstructed images. However, the results in this regime can              process is often called "deconvolution". Depending on whether
be improved by making use of physically meaningful prior assumptions and              the PSF of the instrument is known or not, one distinguishes
statistically principled modeling techniques. One proposed method is the LIRA         between the "blind deconvolution" and "non blind deconvolution"
algorithm, which requires smoothness of the reconstructed image at multiple           process. For astronomical observations, the PSF can often either
scales. In this contribution, we introduce a new python package called Pylira,
                                                                                      be simulated, given a model of the telescope and detector, or
which exposes the original C implementation of the LIRA algorithm to Python
                                                                                      inferred directly from the data by observing far distant objects,
users. We briefly describe the package structure, development setup and show
a Chandra as well as Fermi-LAT analysis example.
                                                                                      which appear as a point source to the instrument.
                                                                                          While in other branches of astronomy deconvolution methods
Index Terms—deconvolution, point spread function, poisson, low counts, X-ray,         are already part of the standard analysis, such as the CLEAN
gamma-ray                                                                             algorithm for radio data, developed by [Hog74], this is not the
                                                                                      case for X-ray and gamma-ray astronomy. As any deconvolution
                                                                                      method aims to enhance small-scale structures in an image, it
Introduction                                                                          becomes increasingly hard to solve for the regime of low signal-
Any physical and astronomical imaging process is affected by                          to-noise ratio, where small-scale structures are more affected by
the limited angular resolution of the instrument or telescope. In                     noise.
addition, the quality of the resulting image is also degraded by
background or instrumental measurement noise and non-uniform                          The Deconvolution Problem
exposure. For short wavelengths and associated low intensities of                     Basic Statistical Model
the signal, the imaging process consists of recording individual                      Assuming the data in each pixel di in the recorded counts image
photons (often called "events") originating from a source of                          follows a Poisson distribution, the total likelihood of obtaining the
interest. This imaging process is typical for X-ray and gamma-                        measured image from a model image of the expected counts λi
ray telescopes, but images taken by magnetic resonance imaging                        with N pixels is given by:
or fluorescence microscopy show Poisson noise too. For each
individual photon, the incident direction, energy and arrival time
                                                                                                                           N   exp −di λidi
                                                                                                          L (d|λ ) = ∏                                    (1)
is measured. Based on this information, the event can be binned                                                            i       di !
into two dimensional data structures to form an actual image.
                                                                                      By taking the logarithm, dropping the constant terms and inverting
    As a consequence of the low intensities associated to the                         the sign one can transform the product into a sum over pixels,
recording of individual events, the measured signal follows Pois-                     which is also often called the Cash [Cas79] fit statistics:
son statistics. This imposes a non-linear relationship between the
measured signal and true underlying intensity as well as a coupling                                       C (λ |d) = ∑(λi − di log λi )                   (2)
* Corresponding author:
‡ Center for Astrophysics | Harvard & Smithsonian                                     Where the expected counts λi are given by the convolution of the
§ University of Maryland Baltimore County                                             true underlying flux distribution xi with the PSF pk :
¶ Imperial College London
                                                                                                                 λi = ∑ xi pi−k                           (3)
Copyright © 2022 Axel Donath et al. This is an open-access article distributed                                             k
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the          This operation is often called "forward modelling" or "forward
original author and source are credited.                                              folding" with the instrument response.
PYLIRA: DECONVOLUTION OF IMAGES IN THE PRESENCE OF POISSON NOISE                                                                            99

Richardson Lucy (RL)
To obtain the most likely value of xn given the data, one searches
a maximum of the total likelihood function, or equivalently a of
minimum C . This high dimensional optimization problem can
e.g., be solved by a classic gradient descent approach. Assuming
the pixels values xi of the true image as independent parameters,
one can take the derivative of Eq. 2 with respect to the individual
xi . This way one obtains a rule for how to update the current set
of pixels xn in each iteration of the optimization:
                                       ∂ C (d|x)
                     xn+1 = xn − α ·                               (4)
                                          ∂ xi
Where α is a factor to define the step size. This method is in
general equivalent to the gradient descent and backpropagation
methods used in modern machine learning techniques. This ba-
sic principle of solving the deconvolution problem for images
with Poisson noise was proposed by [Ric72] and [Luc74]. Their
method, named after the original authors, is often known as the
                                                                         Fig. 1: The images show the result of the RL algorithm applied
Richardson & Lucy (RL) method. It was shown by [Ric72] that              to a simulated example dataset with varying numbers of iterations.
this converges to a maximum likelihood solution of Eq. 2. A              The image in the upper left shows the simulated counts. Those have
Python implementation of the standard RL method is available             been derived from the ground truth (upper mid) by convolving with a
e.g. in the Scikit-Image package [vdWSN+ 14].                            Gaussian PSF of width σ = 3 pix and applying Poisson noise to it.
    Instead of the iterative, gradient descent based optimization it     The illustration uses the implementation of the RL algorithm from the
is also possible to sample from the posterior distribution using a       Scikit-Image package [vdWSN+ 14].
simple Metropolis-Hastings [Has70] approach and uniform prior.
This is demonstrated in one of the Pylira online tutorials (Intro-
                                                                         the smoothness of the reconstructed image on multiple spatial
duction to Deconvolution using MCMC Methods).
                                                                         scales. Starting from the full resolution, the image pixels xi are
                                                                         collected into 2 by 2 groups Qk . The four pixel values associated
RL Reconstruction Quality
                                                                         with each group are divided by their sum to obtain a grid of “split
While technically the RL method converges to a maximum like-             proportions” with respect to the image down-sized by a factor of
lihood solution, it mostly still results in poorly restored images,      two along both axes. This process is repeated using the down sized
especially if extended emission regions are present in the image.        image with pixel values equal to the sums over the 2 by 2 groups
The problem is illustrated in Fig. 1 using a simulated example           from the full-resolution image, and the process continues until the
image. While for a low number of iterations, the RL method still         resolution of the image is only a single pixel, containing the total
results in a smooth intensity distribution, the structure of the image   sum of the full-resolution image. This multi-scale representation
decomposes more and more into a set of point-like sources with           is illustrated in Fig. 2.
growing number of iterations.                                                 For each of the 2x2 groups of the re-normalized images a
    Because of the PSF convolution, an extended emission region          Dirichlet distribution is introduced as a prior:
can decompose into multiple nearby point sources and still lead
to good model prediction, when compared with the data. Those                                φk ∝ Dirichlet(αk , αk , αk , αk )            (6)
almost equally good solutions correspond to many narrow local            and multiplied across all 2x2 groups and resolution levels k. For
minima or "spikes" in the global likelihood surface. Depending on        each resolution level a smoothing parameter αk is introduced.
the start estimate for the reconstructed image x the RL method           These hyper-parameters can be interpreted as having an infor-
will follow the steepest gradient and converge towards the nearest       mation content equivalent of adding αk "hallucinated" counts in
narrow local minimum. This problem has been described by                 each grouping. This effectively results in a smoothing of the
multiple authors, such as [PR94] and [FBPW95].                           image at the given resolution level. The distribution of α values
                                                                         at each resolution level is the further described by a hyper-prior
Multi-Scale Prior & LIRA
One solution to this problem was described in [ECKvD04] and                                    p(αk ) = exp (−δ α 3 /3)                 (7)
[CSv+ 11]. First, the simple forward folded model described in
Eq. 3 can be extended by taking into account the non-uniform             Resulting in a fully hierarchical Bayesian model. A more com-
exposure ei and an additional known background component bi :            plete and detailed description of the prior definition is given in
                     λi = ∑ (ei · (xi + bi )) pi−k                 (5)       The problem is then solved by using a Gibbs MCMC sampling
                           k                                             approach. After a "burn-in" phase the sampling process typically
The background bi can be more generally understood as a "base-           reaches convergence and starts sampling from the posterior distri-
line" image and thus include known structures, which are not of          bution. The reconstructed image is then computed as the mean of
interest for the deconvolution process. E.g., a bright point source      the posterior samples. As for each pixel a full distribution of its
to model the core of an AGN while studying its jets.                     values is available, the information can also be used to compute
    Second, the authors proposed to extend the Poisson log-              the associated error of the reconstructed value. This is another
likelihood function (Equation 2) by a log-prior term that controls       main advantage over RL or Maxium A-Postori (MAP) algorithms.
100                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                           1   $ sudo apt-get install r-base-dev r-base r-mathlib
                                                                           2   $ pip install pylira

                                                                          For more detailed instructions see Pylira installation instructions.

                                                                          API & Subpackages
                                                                          Pylira is structured in multiple sub-packages. The pylira.src
                                                                          module contains the original C implementation and the Pybind11
                                                                          wrapper code. The pylira.core sub-package contains the
                                                                          main Python API, pylira.utils includes utility functions
                                                                          for plotting and serialisation. And implements
                                                                          multiple pre-defined datasets for testing and tutorials.

                                                                          Analysis Examples
                                                                          Simple Point Source
                                                                          Pylira was designed to offer a simple Python class based user
                                                                          interface, which allows for a short learning curve of using the
                                                                          package for users who are familiar with Python in general and
                                                                          more specifically with Numpy. A typical complete usage example
                                                                          of the Pylira package is shown in the following:
Fig. 2: The image illustrates the multi-scale decomposition used in
the LIRA prior for a 4x4 pixels example image. Each quadrant of 2x2        1   import numpy as np
sub-images is labelled with QN . The sub-pixels in each quadrant are       2   from pylira import LIRADeconvolver
labelled Λi j . .                                                          3   from import point_source_gauss_psf
                                                                           5   # create example dataset
                                                                           6   data = point_source_gauss_psf()
The Pylira Package                                                         7
                                                                           8   # define initial flux image
Dependencies & Development                                                 9   data["flux_init"] = data["flux"]
The Pylira package is a thin Python wrapper around the original           10

LIRA implementation provided by the authors of [CSv+ 11]. The             11   deconvolve = LIRADeconvolver(
                                                                          12       n_iter_max=3_000,
original algorithm was implemented in C and made available as a           13       n_burn_in=500,
package for the R Language [R C20]. Thus the implementation de-           14       alpha_init=np.ones(5)
pends on the RMath library, which is still a required dependency of       15   )
Pylira. The Python wrapper was built using the Pybind11 [JRM17]           17   result =
package, which allows to reduce the code overhead introduced by           18
the wrapper to a minimum. For the data handling, Pylira relies on         19   # plot pixel traces, result shown in Figure 3
Numpy [HMvdW+ 20] arrays for the serialisation to the FITS data           20   result.plot_pixel_traces_region(
                                                                          21       center_pix=(16, 16), radius_pix=3
format on Astropy [Col18]. The (interactive) plotting functionality       22   )
is achieved via Matplotlib [Hun07] and Ipywidgets [wc15], which           23

are both optional dependencies. Pylira is openly developed on             24   # plot pixel traces, result shown in Figure 4
                                                                          25   result.plot_parameter_traces()
Github at It relies on GitHub        26
Actions as a continuous integration service and uses the Read             27   # finally serialise the result
the Docs service to build and deploy the documentation. The on-           28   result.write("result.fits")
line documentation can be found on         The main interface is exposed via the LIRADeconvolver
Pylira implements a set of unit tests to assure compatibility             class, which takes the configuration of the algorithm on initial-
and reproducibility of the results with different versions of the         isation. Typical configuration parameters include the total num-
dependencies and across different platforms. As Pylira relies on          ber of iterations n_iter_max and the number of "burn-in"
random sampling for the MCMC process an exact reproducibility             iterations, to be excluded from the posterior mean computation.
of results is hard to achieve on different platforms; however the         The data, represented by a simple Python dict data structure,
agreement of results is at least guaranteed in the statistical limit of   contains a "counts", "psf" and optionally "exposure"
drawing many samples.                                                     and "background" array. The dataset is then passed to the
                                                                 method to execute the deconvolu-
                                                                          tion. The result is a LIRADeconvolverResult object, which
Pylira is available via the Python package index (,              features the possibility to write the result as a FITS file, as well
currently at version 0.1. As Pylira still depends on the RMath            as to inspect the result with diagnostic plots. The result of the
library, it is required to install this first. So the recommended way     computation is shown in the left panel of Fig. 3.
to install Pylira is on MacOS is:
1     $ brew install r                                                    Diagnostic Plots
2     $ pip install pylira
                                                                          To validate the quality of the results Pylira provides many built-
On Linux the RMath dependency can be installed using standard             in diagnostic plots. One of these diagnostic plot is shown in the
package managers. For example on Ubuntu, one would do                     right panel of Fig. 3. The plot shows the image sampling trace
PYLIRA: DECONVOLUTION OF IMAGES IN THE PRESENCE OF POISSON NOISE                                                                                              101

                                                                                                              Pixel trace for (16, 16)
    30                                                        800                 1000
    25                                                                            800
                                                              500                 600

                                                                 Posterior Mean
                                                                                                                                           Burn in
    15                                                        400                                                                          Mean
                                                                                  400                                                      1 Std. Deviation
                                                              200                 200
         0     5      10      15     20      25     30                                    0        500       1000     1500         2000      2500       3000
                                                                                                                Number of Iterations

Fig. 3: The curves show the traces of value the pixel of interest for a simulated point source and its neighboring pixels (see code example).
The image on the left shows the posterior mean. The white circle in the image shows the circular region defining the neighboring pixels. The
blue line on the right plot shows the trace of the pixel of interest. The solid horizontal orange line shows the mean value (excluding burn-in)
of the pixel across all iterations and the shaded orange area the 1 σ error region. The burn in phase is shown in transparent blue and ignored
while computing the mean. The shaded gray lines show the traces of the neighboring pixels.

for a single pixel of interest and its surrounding circular region of                   Chandra is a space-based X-ray observatory, which has been
interest. This visualisation allows the user to assess the stability               in operation since 1999. It consists of nested cylindrical paraboloid
of a small region in the image e.g. an astronomical point source                   and hyperboloid surfaces, which form an imaging optical system
during the MCMC sampling process. Due to the correlation with                      for X-rays. In the focal plane, it has multiple instruments for dif-
neighbouring pixels, the actual value of a pixel might vary in the                 ferent scientific purposes. This includes a high-resolution camera
sampling process, which appears as "dips" in the trace of the pixel                (HRC) and an Advanced CCD Imaging Spectrometer (ACIS). The
of interest and anti-correlated "peaks" in the one or mutiple of                   typical angular resolution is 0.5 arcsecond and the covered energy
the surrounding pixels. In the example a stable state of the pixels                ranges from 0.1 - 10 keV.
of interest is reached after approximately 1000 iterations. This                        Figure 5 shows the result of the Pylira algorithm applied to
suggests that the number of burn-in iterations, which was defined                  Chandra data of the Galactic Center region between 0.5 and 7 keV.
beforehand, should be increased.                                                   The PSF was obtained from simulations using the simulate_psf
    Pylira relies on an MCMC sampling approach to sample                           tool from the official Chandra science tools ciao 4.14 [FMA+ 06].
a series of reconstructed images from the posterior likelihood                     The algorithm achieves both an improved spatial resolution as well
defined by Eq. 2. Along with the sampling, it marginalises over                    as a reduced noise level and higher contrast of the image in the
the smoothing hyper-parameters and optimizes them in the same                      right panel compared to the unprocessed counts data shown in the
process. To diagnose the validity of the results it is important to                left panel.
visualise the sampling traces of both the sampled images as well                        As a second example, we use data from the Fermi Large Area
as hyper-parameters.                                                               Telescope (LAT). The Fermi-LAT is a satellite-based imaging
    Figure 4 shows another typical diagnostic plot created by the                  gamma-ray detector, which covers an energy range of 20 MeV
code example above. In a multi-panel figure, the user can inspect                  to >300 GeV. The angular resolution varies strongly with energy
the traces of the total log-posterior as well as the traces of the                 and ranges from 0.1 to >10 degree1 .
smoothing parameters. Each panel corresponds to the smoothing                           Figure 6 shows the result of the Pylira algorithm applied to
hyper parameter introduced for each level of the multi-scale                       Fermi-LAT data above 1 GeV to the region around the Galactic
representation of the reconstructed image. The figure also shows                   Center. The PSF was obtained from simulations using the gtpsf
the mean value along with the 1 σ error region. In this case,                      tool from the official Fermitools v2.0.19 [Fer19]. First, one can
the algorithm shows stable convergence after a burn-in phase of                    see that the algorithm achieves again a considerable improvement
approximately 200 iterations for the log-posterior as well as all of               in the spatial resolution compared to the raw counts. It clearly
the multi-scale smoothing parameters.                                              resolves multiple point sources left to the bright Galactic Center

Astronomical Analysis Examples                                                     Summary & Outlook

Both in the X-ray as well as in the gamma-ray regime, the Galactic                 The Pylira package provides Python wrappers for the LIRA al-
Center is a complex emission region. It shows point sources,                       gorithm. It allows the deconvolution of low-counts data following
extended sources, as well as underlying diffuse emission and thus                    1.
represents a challenge for any astronomical data analysis.                         htm
102                                                                                              PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                   Logpost                               Smoothingparam0                            Smoothingparam1
                                     Burn in              0.35                                        0.35
       1500                          Valid                0.30
                                     Mean                                                             0.30
                                     1 Std. Deviation     0.25                                        0.25
                                                          0.20                                        0.20
           500                                            0.15                                        0.15
                0                                         0.10                                        0.10
                                                          0.05                                        0.05
                                                          0.00                                        0.00
                    0         200 400 600 800 1000               0     200 400 600 800 1000                  0     200 400 600 800 1000
                                Number of Iterations                     Number of Iterations                        Number of Iterations
                              Smoothingparam2                            Smoothingparam3                            Smoothingparam4
         0.20                                            0.175
                                                         0.150                                       0.150
         0.15                                            0.125                                       0.125
                                                         0.100                                       0.100
         0.10                                                                                        0.075
         0.05                                            0.050                                       0.050
                                                         0.025                                       0.025
         0.00                                            0.000                                       0.000
                    0         200 400 600 800 1000               0     200 400 600 800 1000                  0     200 400 600 800 1000
                                Number of Iterations                     Number of Iterations                        Number of Iterations

Fig. 4: The curves show the traces of the log posterior value as well as traces of the values of the prior parameter values. The SmoothingparamN
parameters correspond to the smoothing parameters αN per multi-scale level. The solid horizontal orange lines show the mean value, the shaded
orange area the 1 σ error region. The burn in phase is shown transparent and ignored while estimating the mean.

                                                Counts                                           Deconvolved



                        17h45m40.6s40.4s     40.2s 40.0s 39.8s       39.6s   17h45m40.6s40.4s   40.2s 40.0s 39.8s         39.6s
                                              Right Ascension                                    Right Ascension
Fig. 5: Pylira applied to Chandra ACIS data of the Galactic Center region, using the observation IDs 4684 and 4684. The image on the left
shows the raw observed counts between 0.5 and 7 keV. The image on the right shows the deconvolved version. The LIRA hyperprior values
were chosen as ms_al_kap1=1, ms_al_kap2=0.02, ms_al_kap3=1. No baseline background model was included.
PYLIRA: DECONVOLUTION OF IMAGES IN THE PRESENCE OF POISSON NOISE                                                                                                        103

                                                           Counts                                               Deconvolved
                          0°40'            PSF
      Galactic Latitude

                            00'                                                                                                                           26


                          -0°20'                                                                                                                          9
                            40'                                                                                                                           2

                                   0°40'         20'         00'      359°40'    20'         0°40'       20'         00'      359°40'      20'
                                                       Galactic Longitude                                      Galactic Longitude
Fig. 6: Pylira applied to Fermi-LAT data from the Galactic Center region. The image on the left shows the raw measured counts between
5 and 1000 GeV. The image on the right shows the deconvolved version. The LIRA hyperprior values were chosen as ms_al_kap1=1,
ms_al_kap2=0.02, ms_al_kap3=1. No baseline background model was included.

Poisson statistics using a Bayesian sampling approach and a multi-                           [CSv+ 11]   A. Connors, N. M. Stein, D. van Dyk, V. Kashyap, and
scale smoothing prior assumption. The results can be easily written                                      A. Siemiginowska. LIRA — The Low-Counts Image Restora-
                                                                                                         tion and Analysis Package: A Teaching Version via R. In I. N.
to FITS files and inspected by plotting the trace of the sampling                                        Evans, A. Accomazzi, D. J. Mink, and A. H. Rots, editors,
process. This allows users to check for general convergence as                                           Astronomical Data Analysis Software and Systems XX, volume
well as pixel to pixel correlations for selected regions of interest.                                    442 of Astronomical Society of the Pacific Conference Series,
The package is openly developed on GitHub and includes tests                                             page 463, July 2011.
                                                                                             [ECKvD04]   David N. Esch, Alanna Connors, Margarita Karovska, and
and documentation, such that it can be maintained and improved                                           David A. van Dyk. An image restoration technique with
in the future, while ensuring consistency of the results. It comes                                       error estimates. The Astrophysical Journal, 610(2):1213–
with multiple built-in test datasets and explanatory tutorials in                                        1227, aug 2004. URL:, doi:
the form of Jupyter notebooks. Future plans include the support                              [FBPW95]    D. A. Fish, A. M. Brinicombe, E. R. Pike, and J. G.
for parallelisation or distributed computing, more flexible prior                                        Walker. Blind deconvolution by means of the richardson–
definitions and the possibility to account for systematic errors on                                      lucy algorithm. J. Opt. Soc. Am. A, 12(1):58–65, Jan 1995.
the PSF during the sampling process.                                                                     URL:
                                                                                                         1-58, doi:10.1364/JOSAA.12.000058.
                                                                                             [Fer19]     Fermi Science Support Development Team. Fermitools: Fermi
Acknowledgements                                                                                         Science Tools. Astrophysics Source Code Library, record
                                                                                                         ascl:1905.011, May 2019. arXiv:1905.011.
This work was conducted under the auspices of the CHASC                                      [FMA+ 06]   Antonella Fruscione, Jonathan C. McDowell, Glenn E. Allen,
International Astrostatistics Center. CHASC is supported by NSF                                          Nancy S. Brickhouse, Douglas J. Burke, John E. Davis, Nick
                                                                                                         Durham, Martin Elvis, Elizabeth C. Galle, Daniel E. Har-
grants DMS-21-13615, DMS-21-13397, and DMS-21-13605; by                                                  ris, David P. Huenemoerder, John C. Houck, Bish Ishibashi,
the UK Engineering and Physical Sciences Research Council                                                Margarita Karovska, Fabrizio Nicastro, Michael S. Noble,
[EP/W015080/1]; and by NASA 18-APRA18-0019. We thank                                                     Michael A. Nowak, Frank A. Primini, Aneta Siemiginowska,
CHASC members for many helpful discussions, especially Xiao-                                             Randall K. Smith, and Michael Wise. CIAO: Chandra’s data
                                                                                                         analysis system. In David R. Silva and Rodger E. Doxsey,
Li Meng and Katy McKeough. DvD was also supported in part                                                editors, Society of Photo-Optical Instrumentation Engineers
by a Marie-Skodowska-Curie RISE Grant (H2020-MSCA-RISE-                                                  (SPIE) Conference Series, volume 6270 of Society of Photo-
2019-873089) provided by the European Commission. Aneta                                                  Optical Instrumentation Engineers (SPIE) Conference Series,
                                                                                                         page 62701V, June 2006. doi:10.1117/12.671760.
Siemiginowska, Vinay Kashyap, and Doug Burke further acknowl-                                [Has70]     W. K. Hastings. Monte Carlo Sampling Methods using Markov
edge support from NASA contract to the Chandra X-ray Center                                              Chains and their Applications. Biometrika, 57(1):97–109,
NAS8-03060.                                                                                              April 1970. doi:10.1093/biomet/57.1.97.
                                                                                             [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der
                                                                                                         Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric
R EFERENCES                                                                                              Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,
                                                                                                         Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van
[Cas79]                      W. Cash. Parameter estimation in astronomy through ap-                      Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del
                             plication of the likelihood ratio. The Astrophysical Journal,               Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant,
                             228:939–947, March 1979. doi:10.1086/156922.                                Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer
[Col18]                      Astropy Collaboration. The Astropy Project: Building an                     Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro-
                             Open-science Project and Status of the v2.0 Core Package. The               gramming with NumPy. Nature, 585(7825):357–362, Septem-
                             Astrophysical Journal, 156(3):123, September 2018. arXiv:                   ber 2020. URL:,
                             1801.02634, doi:10.3847/1538-3881/aabc4f.                                   doi:10.1038/s41586-020-2649-2.
104                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[Hog74]     J. A. Hogbom. Aperture Synthesis with a Non-Regular
            Distribution of Interferometer Baselines. Astronomy and As-
            trophysics Supplement, 15:417, June 1974.
[Hun07]     J. D. Hunter. Matplotlib: A 2d graphics environment. Com-
            puting in Science & Engineering, 9(3):90–95, 2007. doi:
[JRM17]     Wenzel Jakob, Jason Rhinelander, and Dean Moldovan. py-
            bind11 – seamless operability between c++11 and python,
[Luc74]     L. B. Lucy. An iterative technique for the rectification of
            observed distributions. Astronomical Journal, 79:745, June
            1974. doi:10.1086/111605.
[PR94]      K. M. Perry and S. J. Reeves. Generalized Cross-Validation
            as a Stopping Rule for the Richardson-Lucy Algorithm. In
            Robert J. Hanisch and Richard L. White, editors, The Restora-
            tion of HST Images and Spectra - II, page 97, January 1994.
[R C20]     R Core Team. R: A Language and Environment for Statistical
            Computing. R Foundation for Statistical Computing, Vienna,
            Austria, 2020. URL:
[Ric72]     William Hadley Richardson. Bayesian-Based Iterative Method
            of Image Restoration. Journal of the Optical Society of
            America (1917-1983), 62(1):55, January 1972. doi:10.
[vdWSN+ 14] Stéfan van der Walt, Johannes L. Schönberger, Juan Nunez-
            Iglesias, François Boulogne, Joshua D. Warner, Neil Yager,
            Emmanuelle Gouillart, Tony Yu, and the scikit-image con-
            tributors. scikit-image: image processing in Python. PeerJ,
            2:e453, 6 2014. URL:, doi:
[wc15]      Jupyter widgets community. ipywidgets, a github repository.
            Retrieved from,
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                        105

   Codebraid Preview for VS Code: Pandoc Markdown
             Preview with Jupyter Kernels
                                                                     Geoffrey M. Poore‡∗


Abstract—Codebraid Preview is a VS Code extension that provides a live                   including raw chunks of text in other formats such as reStructured-
preview of Pandoc Markdown documents with optional support for executing                 Text. When executable code is involved, the RMarkdown-style
embedded code. Unlike typical Markdown previews, all Pandoc features are fully           approach of Markdown with embedded code can sometimes be
supported because Pandoc itself generates the preview. The Markdown source               more convenient than a browser-based Jupyter notebook since the
and the preview are fully integrated with features like bidirectional scroll sync.
                                                                                         writing process involves more direct interaction with the complete
The preview supports LaTeX math via KaTeX. Code blocks and inline code can
be executed with Codebraid, using either its built-in execution system or Jupyter
                                                                                         document source.
kernels. For executed code, any combination of the code and its output can be                While using a Pandoc Markdown variant as a source format
displayed in the preview as well as the final document. Code execution is non-           brings many advantages, the actual writing process itself can
blocking, so the preview always remains live and up-to-date even while code is           be less than ideal, especially when executable code is involved.
still running.                                                                           Pandoc Markdown variants are so powerful precisely because they
                                                                                         provide so many extensions to Markdown, but this also means
Index Terms—reproducibility, dynamic report generation, literate programming,            that they can only be fully rendered by Pandoc itself. When text
Python, Pandoc, Markdown, Project Jupyter
                                                                                         editors such as VS Code provide a built-in Markdown preview,
                                                                                         typically only a small subset of Pandoc features is supported,
Introduction                                                                             so the representation of the document output will be inaccurate.
                                                                                         Some editors provide a visual Markdown editing mode, in which
Pandoc [JM22] is increasingly a foundational tool for creating sci-
                                                                                         a partially rendered version of the document is displayed in the
entific and technical documents. It provides Pandoc’s Markdown
                                                                                         editor and menus or keyboard shortcuts may replace the direct
and other Markdown variants that add critical features absent in
                                                                                         entry of Markdown syntax. These generally suffer from the same
basic Markdown, such as citations, footnotes, mathematics, and
                                                                                         issue. This is only exacerbated when the document embeds code
tables. At the same time, Pandoc simplifies document creation
                                                                                         that is executed during the build process, since that goes even
by providing conversion from Markdown (and other formats) to
                                                                                         further beyond basic Markdown.
formats like LaTeX, HTML, Microsoft Word, and PowerPoint.
                                                                                             An alternative is to use Pandoc itself to generate HTML or
Pandoc is especially useful for documents with embedded code
                                                                                         PDF output, and then display this as a preview. Depending on the
that is executed during the build process. RStudio’s RMarkdown
                                                                                         text editor used, the HTML or PDF might be displayed within the
[RSt20] and more recently Quarto [RSt22] leverage Pandoc to
                                                                                         text editor in a panel beside the document source, or in a separate
convert Markdown documents to other formats, with code exe-
                                                                                         browser window or PDF viewer. For example, Quarto offers both
cution provided by knitr [YX15]. JupyterLab [GP21] centers the
                                                                                         possibilities, depending on whether RStudio, VS Code, or another
writing experience around an interactive, browser-based notebook
                                                                                         editor is used.1 While this approach resolves the inaccuracy issues
instead of a Markdown document, but still relies on Pandoc for
                                                                                         of a basic Markdown preview, it also gives up features such as
export to formats other than HTML [Jup22]. There are also ways
                                                                                         scroll sync that tightly integrate the Markdown source with the
to interact with a Jupyter Notebook as a Markdown document,
                                                                                         preview. In the case of executable code, there is the additional
such as Jupytext [MWtJT20] and Pandoc’s own native Jupyter
                                                                                         issue of a time delay in rendering the preview. Pandoc itself can
                                                                                         typically convert even a relatively long document in under one
    Writing with Pandoc’s Markdown or a similar Markdown
                                                                                         second. However, when code is executed as part of the document
variant has advantages when multiple output formats are required,
                                                                                         build process, preview update is blocked until code execution
since Pandoc provides the conversion capabilities. Pandoc Mark-
down variants can also serve as a simpler syntax when creating
HTML, LaTeX, or similar documents. They allow HTML and                                       This paper introduces Codebraid Preview, a VS Code exten-
LaTeX to be intermixed with Markdown syntax. They also support                           sion that provides a live preview of Pandoc Markdown documents
                                                                                         with optional support for executing embedded code. Codebraid
* Corresponding author:                                                    Preview provides a Pandoc-based preview while avoiding most
‡ Union University                                                                       of the traditional drawbacks of this approach. The next section
Copyright © 2022 Geoffrey M. Poore. This is an open-access article distributed
under the terms of the Creative Commons Attribution License, which permits                  1. The RStudio editor is unique in also offering a Pandoc-based visual
unrestricted use, distribution, and reproduction in any medium, provided the             editing mode, starting with version 1.4 from January 2021 (https://www.
original author and source are credited.                                       
106                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

provides an overview of features. This is followed by sections               There is also support for document export with Pandoc, using
focusing on scroll sync, LaTeX support, and code execution as            the VS Code command palette or the export-with-Pandoc button.
examples of solutions and remaining challenges in creating a
better Pandoc writing experience.                                        Scroll sync
                                                                         Tight source-preview integration requires a source map, or a
Overview of Codebraid Preview                                            mapping from characters in the source to characters in the output.
Codebraid Preview can be installed through the VS Code ex-               Due to Pandoc’s parsing algorithms, tracking source location
tension manager. Development is at            during parsing is not possible in the general case.2
codebraid-preview-vscode. Pandoc must be installed separately                 Pandoc 2.11.3 was released in December 2020. It added
( For code execution capabilities, Codebraid        a sourcepos extension for CommonMark and formats
must also be installed (            based on it, including GitHub-Flavored Markdown (GFM) and
    The preview panel can be opened using the VS Code command            commonmark_x (CommonMark plus extensions similar to Pan-
palette, or by clicking the Codebraid Preview button that is visible     doc’s Markdown). The CommonMark parser uses a different
when a Markdown document is open. The preview panel takes the            parsing algorithm from the Pandoc’s Markdown parser, and this
document in its current state, converts it into HTML using Pandoc,       algorithm permits tracking source location. For the first time, it
and displays the result using a webview. An example is shown in          was possible to construct a source map for a Pandoc input format.
Figure 1. Since the preview is generated by Pandoc, all Pandoc                Codebraid Preview defaults to commonmark_x as an input
features are fully supported.                                            format, since it provides the most features of all CommonMark-
    By default, the preview updates automatically whenever the           based formats. Features continue to be added to commonmark_x
Markdown source is changed. There is a short user-configurable           and it is gradually nearing feature parity with Pandoc’s Mark-
minimum update interval. For shorter documents, sub-second               down. Citations are perhaps the most important feature currently
updates are typical.                                                     missing.3
    The preview uses the same styling CSS as VS Code’s built-                 Codebraid Preview provides full bidirectional scroll sync be-
in Markdown preview, so it automatically adjusts to the VS Code          tween source and preview for all CommonMark-based formats,
color theme. For example, changing between light and dark themes         using data provided by sourcepos. In the output HTML, the
changes the background and text colors in the preview.                   first image or inline text element created by each Markdown
    Codebraid Preview leverages recent Pandoc advances to pro-           source line is given an id attribute corresponding to the source
vide bidirectional scroll sync between the Markdown source               line number. When the source is scrolled to a given line range,
and the preview for all CommonMark-based Markdown variants               the preview scrolls to the corresponding HTML elements using
that Pandoc supports (commonmark, gfm, commonmark_x).                    these id attributes. When the preview is scrolled, the visible
By default, Codebraid Preview treats Markdown documents as               HTML elements are detected via the Intersection Observer API.4
commonmark_x, which is CommonMark with Pandoc exten-                     Then their id attributes are used to determine the corresponding
sions for features like math, footnotes, and special list types. The     Markdown line range, and the source scrolls to those lines.
preview still works for other Markdown variants, but scroll sync is           Scroll sync is slightly more complicated when working with
disabled. By default, scroll sync is fully bidirectional, so scrolling   output that is generated by executed code. For example, if a code
either the source or the preview will cause the other to scroll to       block is executed and creates several plots in the preview, there
the corresponding location. Scroll sync can instead be configured        isn’t necessarily a way to trace each individual plot back to a
to be only from source to preview or only from preview to source.        particular line of code in the Markdown source. In such cases, the
As far as I am aware, this is the first time that scroll sync has been   line range of the executed code is mapped proportionally to the
implemented in a Pandoc-based preview.                                   vertical space occupied by its output.
    The same underlying features that make scroll sync possible               Pandoc supports multi-file documents. It can be given a list
are also used to provide other preview capabilities. Double-             of files to combine into a single output document. Codebraid
clicking in the preview moves the cursor in the editor to the            Preview provides scroll sync for multi-file documents. For ex-
corresponding line of the Markdown source.                               ample, suppose a document is divided into two files in the same
    Since many Markdown variants support LaTeX math, the                 directory, and Treating these
preview includes math support via KaTeX [EA22].                          as a single document involves creating a YAML configuration file
    Codebraid Preview can simply be used for writing plain Pan-          _codebraid_preview.yaml that lists the files:
doc documents. Optional execution of embedded code is possible                 input-files:
with Codebraid [GMP19], using its built-in code execution system               -
or Jupyter kernels. When Jupyter kernels are used, it is possible              -
to obtain the same output that would be present in a Jupyter             Now launching a preview from either or
notebook, including rich output such as plots and mathematics. It will display a preview that combines both
is also possible to specify a custom display so that only a selected     files. When the preview is scrolled, the editor scrolls to the
combination of code, stdout, stderr, and rich output is shown while      corresponding source location, automatically switching between
the rest are hidden. Code execution is decoupled from the preview
process, so the Markdown source can be edited and the preview              2. See for example
can update even while code is running in the background. As far as         3. The Pandoc Roadmap at
                                                                         summarizes current commonmark_x capabilities.
I am aware, no previous software for executing code in Markdown
                                                                           4. For technical details, For
has supported building a document with partial code output before        an overview,
execution has completed.                                                 Observer_API.
CODEBRAID PREVIEW FOR VS CODE: PANDOC MARKDOWN PREVIEW WITH JUPYTER KERNELS                                                             107

Fig. 1: Screenshot of a Markdown document with Codebraid Preview in VS Code. This document uses Codebraid to execute code with Jupyter
kernels, so all plots and math visible in the preview are generated during document build. and depending on the part of                  of HTML rendering. In the future, optional MathJax support may
the preview that is visible.                                            be needed to provide broader math support. For some applications,
    The preview still works when the input format is set to a non-      it may also be worth considering caching pre-rendered or image
CommonMark format, but in that case scroll sync is disabled. If         versions of equations to improve performance.
Pandoc adds sourcepos support for additional input formats in
the future, scroll sync will work automatically once Codebraid          Code execution
Preview adds those formats to the supported list. It is possible
to attempt to reconstruct a source map by performing a parallel         Optional support for executing code embedded in Markdown
string search on Pandoc output and the original source. This can        documents is provided by Codebraid [GMP19]. Codebraid uses
be error-prone due to text manipulation during format conversion,       Pandoc to convert a document into an abstract syntax tree (AST),
but in the future it may be possible to construct a good enough         then extracts any inline or block code marked with Codebraid
source map to extend basic scroll sync support to additional input      attributes from the AST, executes the code, and finally formats the
formats.                                                                code output so that Pandoc can use it to create the final output
                                                                        document. Code execution is performed with Codebraid’s own
                                                                        built-in system or with Jupyter kernels. For example, the code
LaTeX support
Support for mathematics is one of the key features provided by                ```{.python .cb-run}
many Markdown variants in Pandoc, including commonmark_x.                     print("Hello *world!*")
Math support in the preview panel is supplied by KaTeX [EA22],                ```
which is a JavaScript library for rendering LaTeX math in the
                                                                        would result in
    One of the disadvantages of using Pandoc to create the preview            Hello world!
is that every update of the preview is a complete update. This          after processing by Codebraid and finally Pandoc. The .cb-run
makes the preview more sensitive to HTML rendering time. In             is a Codebraid attribute that marks the code block for execution
contrast, in a Jupyter notebook, it is common to write Markdown         and specifies the default display of code output. Further examples
in multiple cells which are rendered separately and independently.      of Codebraid usage are visible in Figure 1.
    MathJax [Mat22] provides a broader range of LaTeX support                Mixing a live preview with executable code provides potential
than KaTeX, and is used in software such as JupyterLab and              usability and security challenges. By default, code only runs when
Quarto. While MathJax performance has improved significantly            the user selects execution in the VS Code command palette or
since the release of version 3.0 in 2019, KaTeX can still have a        clicks the Codebraid execute button. When the preview automati-
speed advantage, so it is currently the default due to the importance   cally updates as a result of Markdown source changes, it only uses
108                                                                                               PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

cached code output. Stale cached output is detected by hashing                      While this build process is significantly more interactive than
executed code, and then marked in the preview to alert the user.                what has been possible previously, it also suggests additional
    The standard approach to executing code within Markdown                     avenues for future exploration. Codebraid’s built-in code execution
documents blocks the document build process until all code has                  system is designed to execute a predefined sequence of code
finished running. Code is extracted from the Markdown source and                chunks and then exit. Jupyter kernels are currently used in the
executed. Then the output is combined with the original source and              same manner to avoid any potential issues with out-of-order
passed on to Pandoc or another Markdown application for final                   execution. However, Jupyter kernels can receive and execute code
conversion. This is the approach taken by RMarkdown, Quarto,                    indefinitely, which is how they commonly function in Jupyter note-
and similar software, as well as by Codebraid until recently. This              books. Instead of starting a new Jupyter kernel at the beginning of
design works well for building a document a single time, but                    each code execution cycle, it would be possible to keep the kernel
blocking until all code has executed is not ideal in the context                from the previous execution cycle and only pass modified code
of a document preview.                                                          chunks to it. This would allow the same out-of-order execution
    Codebraid now offers a new mode of code execution that al-                  issues that are possible in a Jupyter notebook. Yet that would
lows a document to be rebuilt continuously during code execution,               make possible much more rapid code output, particularly in cases
with each build including all code output available at that time.               where large datasets must be loaded or significant preprocessing
This process involves the following steps:                                      is required.

      1)   The user selects code execution. Codebraid Preview                   Conclusion
           passes the document to Codebraid. Codebraid begins
                                                                                Codebraid Preview represents a significant advance in tools for
           code execution.
                                                                                writing with Pandoc. For the first time, it is possible to preview
      2)   As soon as any code output is available, Codebraid
                                                                                a Pandoc Markdown document using Pandoc itself while having
           immediately streams this back to Codebraid Preview. The
                                                                                features like scroll sync between the Markdown source and the
           output is in a format compatible with the YAML metadata
                                                                                preview. When embedded code needs to be executed, it is possible
           block at the start of Pandoc Markdown documents. The
                                                                                to see code output in the preview and to continue editing the
           output includes a hash of the code that was executed, so
                                                                                document during code execution, instead of having to wait until
           that code changes can be detected later.
                                                                                code finishes running.
      3)   If the document is modified while code is running or if
                                                                                    Codebraid Preview or future previewers that follow this ap-
           code output is received, Codebraid Preview rebuilds the
                                                                                proach may be perfectly adequate for shorter and even some longer
           preview. It creates a copy of the document with all current
                                                                                documents, but at some point a combination of document length,
           Codebraid output inserted into the YAML metadata block
                                                                                document complexity, and mathematical content will strain what is
           at the start of the document. This modified document is
                                                                                possible and ultimately decrease preview update frequency. Every
           then passed to Pandoc. Pandoc runs with a Lua filter5 that
                                                                                update of the preview involves converting the entire document
           modifies the document AST before final conversion. The
                                                                                with Pandoc and then rendering the resulting HTML.
           filter removes all code marked with Codebraid attributes
                                                                                    On the parsing side, Pandoc’s move toward CommonMark-
           from the AST, and replaces it with the corresponding
                                                                                based Markdown variants may eventually lead to enough stan-
           code output stored in the AST metadata. If code has
                                                                                dardization that other implementations with the same syntax and
           been modified since execution began, this is detected
                                                                                features are possible. This in turn might enable entirely new
           with the hash of the code, and an HTML class is added
                                                                                approaches. An ideal scenario would be a Pandoc-compatible
           to the output that will mark it visually as stale output.
                                                                                JavaScript-based parser that can parse multiple Markdown strings
           Code that does not yet have output is replaced by a
                                                                                while treating them as having a shared document state for things
           visible placeholder to indicate that code is still running.
                                                                                like labels, references, and numbering. For example, this could
           When the Lua filter finishes AST modifications, Pandoc
                                                                                allow Pandoc Markdown within a Jupyter notebook, with all
           completes the document build, and the preview updates.
                                                                                Markdown content sharing a single document state, maybe with
      4)   As long as code is executing, the previous process repeats
                                                                                each Markdown cell being automatically updated based on Mark-
           whenever the preview needs to be rebuilt.
                                                                                down changes elsewhere.
      5)   Once code execution completes, the most recent output is
                                                                                    Perhaps more practically, on the preview display side, there
           reused for all subsequent preview updates until the next
                                                                                may be ways to optimize how the HTML generated by Pandoc is
           time the user chooses to execute code. Any code changes
                                                                                loaded in the preview. A related consideration might be alternative
           continue to be detected by hashing the code during the
                                                                                preview formats. There is a significant tradition of tight source-
           build process, so that the output can be marked visually
                                                                                preview integration in LaTeX (for example, [Lau08]). In principle,
           as stale in the preview.
                                                                                Pandoc’s sourcepos extension should make possible Mark-
    The overall result of this process is twofold. First, building              down to PDF synchronization, using LaTeX as an intermediary.
a document involving executed code is nearly as fast as building
a plain Pandoc document. The additional output metadata plus                    R EFERENCES
the filter are the only extra elements involved in the document
                                                                                [EA22]     Emily Eisenberg and Sophie Alpert. KaTeX: The fastest math
build, and Pandoc Lua filters have excellent performance. Second,                          typesetting library for the web, 2022. URL:
the output for each code chunk appears in the preview almost                    [GMP19]    Geoffrey M. Poore. Codebraid: Live Code in Pandoc Mark-
immediately after the chunk finishes execution.                                            down. In Chris Calloway, David Lippa, Dillon Niederhut, and
                                                                                           David Shupe, editors, Proceedings of the 18th Python in Science
                                                                                           Conference, pages 54 – 61, 2019. doi:10.25080/Majora-
  5. For an overview of Lua filters, see              7ddc1dd1-008.

[GP21]    Brian E. Granger and Fernando Pérez. Jupyter: Thinking and
          storytelling with code and data. Computing in Science &
          Engineering, 23(2):7–14, 2021. doi:10.1109/MCSE.2021.
[JM22]    John MacFarlane. Pandoc: a universal document converter, 2006–
          2022. URL:
[Jup22]   Jupyter Development Team. nbconvert: Convert Notebooks to
          other formats, 2015–2022. URL: https://nbconvert.readthedocs.
[Lau08]   Jerôme Laurens. Direct and reverse synchronization with Sync-
          TEX. TUGBoat, 29(3):365–371, 2008.
[Mat22]   MathJax. MathJax: Beautiful and accessible math in all browsers,
          2009–2022. URL:
[MWtJT20] Marc Wouts and the Jupytext Team. Jupyter notebooks as
          Markdown documents, Julia, Python or R scripts, 2018–2020.
[RSt20]   RStudio Inc. R Markdown, 2016–2020. URL: https://rmarkdown.

[RSt22]   RStudio Inc. Welcome to Quarto, 2022. URL:
[YX15]    Yihui Xie. Dynamic Documents with R and knitr. Chapman &
          Hall/CRC Press, 2015.
110                                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

            Incorporating Task-Agnostic Information in
          Task-Based Active Learning Using a Variational
                     Curtis Godwin‡†∗ , Meekail Zain§†∗ , Nathan Safir‡ , Bella Humphrey§ , Shannon P Quinn§¶


Abstract—It is often much easier and less expensive to collect data than to              constraints by specifying a budget of points that can be labeled at
label it. Active learning (AL) ([Set09]) responds to this issue by selecting which       a time and evaluating against this budget.
unlabeled data are best to label next. Standard approaches utilize task-aware                In AL, the model for which we select new labels is referred to
AL, which identifies informative samples based on a trained supervised model.            as the task model. If this model is a classifier neural network, the
Task-agnostic AL ignores the task model and instead makes selections based
                                                                                         space in which it maps inputs before classifying them is known
on learned properties of the dataset. We seek to combine these approaches
and measure the contribution of incorporating task-agnostic information into
                                                                                         as the latent space or representation space. A recent branch of
standard AL, with the suspicion that the extra information in the task-agnostic          AL ([SS18], [SCN+ 18], [YK19]), prominent for its applications
features may improve the selection process. We test this on various AL methods           to deep models, focuses on mapping unlabeled points into the task
using a ResNet classifier with and without added unsupervised information from           model’s latent space before comparing them.
a variational autoencoder (VAE). Although the results do not show a significant              These methods are limited in their analysis by the labeled
improvement, we investigate the effects on the acquisition function and suggest          data they must train on, failing to make use of potentially useful
potential approaches for extending the work.                                             information embedded in the unlabeled data. We therefore suggest
                                                                                         that this family of methods may be improved by extending their
Index Terms—active learning, variational autoencoder, deep learning, pytorch,            representation spaces to include unsupervised features learned
semi-supervised learning, unsupervised learning
                                                                                         over the entire dataset. For this purpose, we opt to use a variational
                                                                                         autoencoder (VAE) ([KW13]) , which is a prominent method for
                                                                                         unsupervised representation learning. Our main contributions are
                                                                                         (a) a new methodology for extending AL methods using VAE
In deep learning, the capacity for data gathering often signifi-                         features and (b) an experiment comparing AL performance across
cantly outpaces the labeling. This is easily observed in the field                       two recent feature-based AL methods using the new method.
of bioimaging, where ground-truth labeling usually requires the
expertise of a clinician. For example, producing a large quantity                        Related Literature
of CT scans is relatively simple, but having them labeled for                            Active learning
COVID-19 by cardiologists takes much more time and money.                                Much of the early active learning (AL) literature is based on
These constraints ultimately limit the contribution of deep learning                     shallower, less computationally demanding networks since deeper
to many crucial research problems.                                                       architectures were not well-developed at the time. Settles ([Set09])
    This labeling issue has compelled advancements in the field of                       provides a review of these early methods. The modern approach
active learning (AL) ([Set09]). In a typical AL setting, there is a                      uses an acquisition function, which involves ranking all available
set of labeled data and a (usually larger) set of unlabeled data. A                      unlabeled points by some chosen heuristic H and choosing to
model is trained on the labeled data, then the model is analyzed to                      label the points of highest ranking.
evaluate which unlabeled points should be labeled to best improve
the loss objective after further training. AL acknowledges labeling

† These authors contributed equally.
* Corresponding author:,
‡ Institute for Artificial Intelligence, University of Georgia, Athens, GA 30602
* Corresponding author:,
§ Department of Computer Science, University of Georgia, Athens, GA 30602
¶ Department of Cellular Biology, University of Georgia, Athens, GA 30602

Copyright © 2022 Curtis Godwin et al. This is an open-access article dis-
tributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, pro-                The popularity of the acquisition approach has led to a widely-
vided the original author and source are credited.                                       used evaluation procedure, which we describe in Algorithm 1.

This procedure trains a task model T on the initial labeled data,       representation c. An additional fully connected layer then maps
records its test accuracy, then uses H to label a set of unlabeled      c into a single value constituting the loss prediction.
points. We then once again train T on the labeled data and record           When attempting to train a network to directly predict T ’s
its accuracy. This is repeated until a desired number of labels is      loss during training, the ground truth losses naturally decrease as
reached, and then the accuracies can be graphed against the num-        T is optimized, resulting in a moving objective. The authors of
ber of available labels to demonstrate performance over the course      ([YK19]) find that a more stable ground truth is the inequality
of labeling. We can use this evaluation algorithm to separately         between the losses of given pairs of points. In this case, P is
evaluate multiple acquisition functions on their resulting accuracy     trained on pairs of labeled points, so that P is penalized for
graphs. This is utilized in many AL papers to show the efficacy         producing predicted loss pairs that exhibit a different inequality
of their suggested heuristics in comparison to others ([WZL+ 16],       than the corresponding true loss pair.
[SS18], [SCN+ 18], [YK19]).                                                 More specifically, for each batch of labeled data Lbatch ⊂ L
    The prevailing approach to point selection has been to choose       that is propagated through T during training, the batch of true
unlabeled points for which the model is most uncertain, the as-         losses is computed and split randomly into a batch of pairs Pbatch .
sumption being that uncertain points will be the most informative       The loss prediction network produces a corresponding batch of
([BRK21]). A popular early method was to label the unlabeled            predicted loss pairs, denoted Pebatch . The following pair loss is then
points of highest Shannon entropy ([Sha48]) under the task model,       computed given each p ∈ Pbatch and its corresponding p̃ ∈ Pebatch :
which is a measure of uncertainty between the classes of the
data. This method is now more commonly used in combination                       L pair (p, p̃) = max(0, −I (p) · ( p̃(1) − p̃(2) ) + ξ ),   (3)
with a representativeness measure ([WZL+ 16]) to avoid selecting        where I is the following indicator function for pair inequality:
condensed clusters of very similar points.                                                      (
                                                                                                    1, p(1) > p(2)
                                                                                        I (p) =                         .              (4)
Recent heuristics using deep features                                                              −1, p(1) ≤ p(2)
For convolutional neural networks (CNNs) in image classification
settings, the task model T can be decomposed into a feature-            Variational Autoencoders
generating module                                                       Variational autoencoders (VAEs) ([KW13]) are an unsupervised
                         T f : Rn → R f ,                               method for modeling data using Bayesian posterior inference.
                                                                        We begin with the Bayesian assumption that the data is well-
which maps the input data vectors to the output of the final fully      modeled by some distribution, often a multivariate Gaussian. We
connected layer before classification, and a classification module      also assume that this data distribution can be inferred reasonably
                                                                        well by a lower dimensional random variable, also often modeled
                      Tc : R f → {0, 1, ..., c},
                                                                        by a multivariate Gaussian.
where c is the number of classes.                                           The inference process then consists of an encoding into the
    Recent deep learning-based AL methods have approached the           lower dimensional latent variable, followed by a decoding back
notion of model uncertainty in terms of the rich features generated     into the data dimension. We parametrize both the encoder and the
by the learned model. Core-set ([SS18]) and MedAL ([SCN+ 18])           decoder as neural networks, jointly optimizing their parameters
select unlabeled points that are the furthest from the labeled set      with the following loss function ([KW19]):
in terms of L2 distance between the learned features. For core-set,             Lθ ,φ (x) = log pθ (x|z) + [log pθ (z) − log qφ (z|x)],      (5)
each point constructing the set S in step 6 of Algorithm 1 is chosen
by                                                                      where θ and φ are the parameters of the encoder and the decoder,
              u∗ = argmax min ||(T f (u) − T f (``))||2 ,        (1)    respectively. The first term is the reconstruction error, penalizing
                     u∈U    ` ∈L
                                                                        the parameters for producing poor reconstructions of the input
where U is the unlabeled set and L is the labeled set. The              data. The second term is the regularization error, encouraging the
analogous operation for MedAL is                                        encoding to resemble a pre-selected prior distribution, commonly
                                                                        a unit Gaussian prior.
                            1 |L|                                           The encoder of a well-optimized VAE can be used to gen-
             u∗ = argmax
                               ∑ ||T f (u) − T f (Li )||2 .
                           |L| i=1
                                                                        erate latent encodings with rich features which are sufficient to
                                                                        approximately reconstruct the data. The features also have some
Note that after a point u∗ is chosen, the selection of the next point   geometric consistency, in the sense that the encoder is encouraged
assumes the previous u∗ to be in the labeled set. This way we           to generate encodings in the pattern of a Gaussian distribution.
discourage choosing sets that are closely packed together, leading
to sets that are more diverse in terms of their features. This effect
is more pronounced in the core-set method since it takes the            Methods
minimum distance whereas MedAL uses the average distance.               We observe that the notions of uncertainty developed in the core-
    Another recent method ([YK19]) trains a regression network          set and MedAL methods rely on distances between feature vectors
to predict the loss of the task model, then takes the heuristic H       modeled by the task model T . Additionally, loss prediction relies
in Algorithm 1 to select the unlabeled points of highest predicted      on a fully connected layer mapping from a feature space to a single
loss. To implement this, the loss prediction network P is attached      value, producing different predictions depending on the values of
to a ResNet task model T and is trained jointly with T . The            the relevant feature vector. Thus all of these methods utilize spatial
inputs to P are the features output by the ResNet’s four residual       reasoning in a vector space.
blocks. These features are mapped into the same dimensionality              Furthermore, in each of these methods, the heuristic H only
via a fully connected layer and then concatenated to form a             has access to information learned by the task model, which is
112                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

trained only on the labeled points at a given timestep in the la-        ensure that the task models being compared were supplied with
beling procedure. Since variational autoencoder (VAE) encodings          the same initial set of labels.
are not limited by the contents of the labeled set, we suggest that          With four NVIDIA 2080 GPUs, the total runtime for the
the aforementioned methods may benefit by expanding the vector           MNIST experiments was 5113s for core-set and 4955s for loss
spaces they investigate to include VAE features learned across           prediction; for ChestMNIST, the total runtime was 7085s for core-
the entire dataset, including the unlabeled data. These additional       set and 7209s for loss prediction.
features will constitute representative and previously inaccessible
information regarding the data, which may improve the active
learning process.
    We implement this by first training a VAE model V on the
given dataset. V can then be used as a function returning the
VAE features for any given datapoint. We append these additional
features to the relevant vector spaces using vector concatenation,
an operation we denote with the symbol _. The modified point
selection operation in core-set then becomes
  u∗ = argmax min ||([T f (u) _ αV (u)] − [T f (``) _ αV (``)]||2 ,
         u∈U    ` ∈L
where α is a hyperparameter that scales the influence of the VAE
features in computing the vector distance. To similarly modify the
loss prediction method, we concatenate the VAE features to the           Fig. 1: The average MNIST results using the core-set heuristic versus
final ResNet feature concatenation c before the loss prediction,         the VAE-augmented core-set heuristic for Algorithm 1 over 5 runs.
so that the extra information is factored into the training of the
prediction network P.

In order to measure the efficacy of the newly proposed methods,
we generate accuracy graphs using Algorithm 1, freezing all
settings except the selection heuristic H . We then compare the
performance of the core-set and loss prediction heuristics with
their VAE-augmented counterparts.
    We use ResNet-18 pretrained on ImageNet as the task model,
using the SGD optimizer with learning rate 0.001 and momen-
tum 0.9. We train on the MNIST ([Den12]) and ChestMNIST
([YSN21]) datasets. ChestMNIST consists of 112,120 chest X-ray
images resized to 28x28 and is one of several benchmark medical
image datasets introduced in ([YSN21]).
                                                                         Fig. 2: The average MNIST results using the loss prediction heuristic
    For both datasets we experiment on randomly selected subsets,        versus the VAE-augmented loss prediction heuristic for Algorithm 1
using 25000 points for MNIST and 30000 points for ChestMNIST.            over 5 runs.
In both cases we begin with 3000 initial labels and label 3000
points per active learning step. We opt to retrain the task model
after each labeling step instead of fine-tuning.
    We use a similar training strategy as in ([SCN+ 18]), training
the task model until >99% train accuracy before selecting new
points to label. This ensures that the ResNet is similarly well fit to
the labeled data at each labeling iteration. This is implemented by
training for 10 epochs on the initial training set and increasing the
training epochs by 5 after each labeling iteration.
    The VAEs used for the experiments are trained for 20 epochs
using an Adam optimizer with learning rate 0.001 and weight
decay 0.005. The VAE encoder architecture consists of four con-
volutional downsampling filters and two linear layers to learn the
low dimensional mean and log variance. The decoder consists of
an upsampling convolution and four size-preserving convolutions
to learn the reconstruction.
                                                                         Fig. 3: The average ChestMNIST results using the core-set heuristic
    Experiments were run five times, each with a separate set of         versus the VAE-augmented core-set heuristic for Algorithm 1 over 5
randomly chosen initial labels, with the displayed results showing       runs.
the average validation accuracies across all runs. Figures 1 and
3 show the core-set results, while Figures 2 and 4 show the loss            To investigate the qualitative difference between the VAE and
prediction results. In all cases, shared random seeds were used to       non-VAE approaches, we performed an additional experiment

Fig. 4: The average ChestMNIST results using the loss prediction
heuristic versus the VAE-augmented loss prediction heuristic for
Algorithm 1 over 5 runs.

to visualize an example of core-set selection. We first train the
ResNet-18 with the same hyperparameter settings on 1000 initial
labels from the ChestMNIST dataset, then randomly choose 1556         Fig. 6: A t-SNE visualization of the ChestMNIST points chosen by
(5%) of the unlabeled points from which to select 100 points to       core-set when the ResNet features are augmented with VAE features.
label. These smaller sizes were chosen to promote visual clarity in
the output graphs.
    We use t-SNE ([VdMH08]) dimensionality reduction to show          process. In 5, the selected points tend to be more spread out,
the ResNet features of the labeled set, the unlabeled set, and the    while in 6 they cluster at one edge. This appears to mirror the
points chosen to be labeled by core-set.                              transformation of the rest of the data, which is more spread out
                                                                      without the VAE features, but becomes condensed in the center
                                                                      when they are introduced, approaching the shape of a Gaussian
                                                                           It seems that with the added VAE features, the selected points
                                                                      are further out of distribution in the latent space. This makes sense
                                                                      because points tend to be more sparse at the tails of a Guassian
                                                                      distribution and core-set prioritizes points that are well-isolated
                                                                      from other points.
                                                                           One reason for the lack of performance improvement may be
                                                                      the homogeneous nature of the VAE, where the optimization goal
                                                                      is reconstruction rather than classification. This could be improved
                                                                      by using a multimodal prior in the VAE, which may do a better
                                                                      job of modeling relevant differences between points.

                                                                      Our original intuition was that additional unsupervised informa-
                                                                      tion may improve established active learning methods, especially
                                                                      when using a modern unsupervised representation method such as
                                                                      a VAE. The experimental results did not indicate this hypothesis,
                                                                      but additional investigation of the VAE features showed a notable
                                                                      change in the task model latent space. Though this did not result in
Fig. 5: A t-SNE visualization of the ChestMNIST points chosen by      superior point selections in our case, it is of interest whether dif-
core-set.                                                             ferent approaches to latent space augmentation in active learning
                                                                      may fare better.
                                                                          Future work may explore the use of class-conditional VAEs
Discussion                                                            in a similar application, since a VAE that can utilize the available
                                                                      class labels may produce more effective representations, and it
Overall, the VAE-augmented active learning heuristics did not
                                                                      could be retrained along with the task model after each labeling
exhibit a significant performance difference when compared with
their counterparts. The only case of a significant p-value (<0.05)
occurred during loss prediction on the MNIST dataset at 21000
labels.                                                               R EFERENCES
    The t-SNE visualizations in Figures 5 and 6 show some of          [BRK21]    Samuel Budd, Emma C Robinson, and Bernhard Kainz. A
the influence that the VAE features have on the core-set selection               survey on active learning and human-in-the-loop deep learning
114                                                                            PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

          for medical image analysis. Medical Image Analysis, 71:102062,
          2021. doi:10.1016/
[Den12]   Li Deng. The mnist database of handwritten digit images for
          machine learning research. IEEE Signal Processing Magazine,
          29(6):141–142, 2012. doi:10.1109/MSP.2012.2211477.
[KW13]    Diederik P Kingma and Max Welling. Auto-encoding variational
          bayes. arXiv preprint arXiv:1312.6114, 2013.
[KW19]    Diederik P. Kingma and Max Welling.                    An Intro-
          duction to Variational Autoencoders.             Now Publishers,
          2019. URL:, doi:
[SCN 18] Asim Smailagic, Pedro Costa, Hae Young Noh, Devesh

          Walawalkar, Kartik Khandelwal, Adrian Galdran, Mostafa Mir-
          shekari, Jonathon Fagert, Susu Xu, Pei Zhang, et al. Medal:
          Accurate and robust deep active learning for medical image
          analysis. In 2018 17th IEEE international conference on machine
          learning and applications (ICMLA), pages 481–488. IEEE, 2018.
[Set09]   Burr Settles. Active learning literature survey. 2009.
[Sha48]   Claude Elwood Shannon. A mathematical theory of communica-
          tion. The Bell system technical journal, 27(3):379–423, 1948.
[SS18]    Ozan Sener and Silvio Savarese. Active learning for convolutional
          neural networks: A core-set approach. In International Conference
          on Learning Representations, 2018. URL:
[VdMH08] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data
          using t-sne. Journal of machine learning research, 9(11), 2008.
[WZL+ 16] Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang
          Lin. Cost-effective active learning for deep image classification.
          IEEE Transactions on Circuits and Systems for Video Technol-
          ogy, 27(12):2591–2600, 2016. doi:10.1109/tcsvt.2016.
[YK19]    Donggeun Yoo and In So Kweon. Learning loss for active
          learning. In Proceedings of the IEEE/CVF conference on
          computer vision and pattern recognition, pages 93–102, 2019.
[YSN21] Jiancheng Yang, Rui Shi, and Bingbing Ni. Medmnist classi-
          fication decathlon: A lightweight automl benchmark for med-
          ical image analysis. In 2021 IEEE 18th International Sym-
          posium on Biomedical Imaging (ISBI), pages 191–195, 2021.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                            115

                     Awkward Packaging: building Scikit-HEP
                                              Henry Schreiner‡∗ , Jim Pivarski‡ , Eduardo Rodrigues§


Abstract—Scikit-HEP has grown rapidly over the last few years, not just to serve            parts [Lam98]. The glueing together of the system was done in
the needs of the High Energy Physics (HEP) community, but in many ways,                     Python, a model still popular today, though some experiments are
the Python ecosystem at large. AwkwardArray, boost-histogram/hist, and iminuit              now using Python + Numba as an alternative model, such as for
are examples of libraries that are used beyond the original HEP focus. In this              example the Xenon1T experiment [RTA+ 17], [RS21].
paper we will look at key packages in the ecosystem, and how the collection of
                                                                                                In the early 2000s, the use of Python HEP exploded, heavily
30+ packages was developed and maintained. Also we will look at some of the
software ecosystem contributions made to packages like cibuildwheel, pybind11,
                                                                                            driven by experiments like LHCb developing frameworks and user
nox, scikit-build, build, and pipx that support this effort. We will also discuss the       tools for scripting. ROOT started providing Python bindings in
Scikit-HEP developer pages and initial WebAssembly support.                                 2004 [LGMM05] that were not considered Pythonic [GTW20],
                                                                                            and still required a complex multi-hour build of ROOT to use1 .
Index Terms—packaging, ecosystem, high energy physics, community project                    Analyses still consisted largely of ROOT, with Python sometimes
                                                                                            showing up.
                                                                                                By the mid 2010’s, a marked change had occurred, driven by
                                                                                            the success of Python in Data Science, especially in education.
High Energy Physics (HEP) has always had intense computing                                  Many new students were coming into HEP with little or no
needs due to the size and scale of the data collected. The                                  C++ experience, but with existing knowledge of Python and the
World Wide Web was invented at the CERN Physics laboratory                                  growing Python data science ecosystem, like NumPy and Pandas.
in Switzerland in 1989 when scientists in the EU were trying                                Several HEP experiment analyses were performed in, or driven
to communicate results and datasets with scientist in the US,                               by, Python, with ROOT only being used for things that were
and vice-versa [LCC+ 09]. Today, HEP has the largest scientific                             not available in the Python ecosystem. Some of these were HEP
machine in the world, at CERN: the Large Hadron Collider (LHC),                             specific: ROOT is also a data format, so users needed to be able
27 km in circumference [EB08], with multiple experiments with                               to read data from ROOT files. Others were less specific: HEP
thousands of collaborators processing over a petabyte of raw data                           users have intense histogram requirements due to the data sizes,
every day, with 100 petabytes being stored per year at CERN. This                           large portions of HEP data are "jagged" rather than rectangular;
is one of the largest scientific datasets in the world of exabyte scale                     vector manipulation was important (especially Lorenz Vectors, a
[PJ11], which is roughly comparable in order of magnitude to all                            four dimensional relativistic vector with a non-Euclidean metric);
of astronomy or YouTube [SLF+ 15].                                                          and data fitting was important, especially with complex models
    In the mid nineties, HEP users were beginning to look for                               and accurate error estimation.
a new language to replace Fortran. A few HEP scientists started
investigating the use of Python around the release of 1.0.0 in 1994
                                                                                            Beginnings of a scikit
[Tem22]. A year later, the ROOT project for an analysis toolkit
(and framework) was released, quickly making C++ the main                                   In 2016, the ecosystem for Python in HEP was rather fragmented.
language for HEP. The ROOT project also needed an interpreted                               Physicists were developing tools in isolation, without knowing
language to driving analysis code. Python was rejected for this role                        out the overlaps with other tools, and without making them
due to being "exotic" at the time, and because it was considered too                        interoperable. There were a handful of popular packages that
much to ask physicists to code in two languages. Instead, ROOT                              were useful in HEP spread around among different authors. The
provided a C++ interpreter, called CINT, which later was replaced                           ROOTPy project had several packages that made the ROOT-
with Cling, which is the basis for the clang-repl project in LLVM                           Python bridge a little easier than the built-in PyROOT, such as the
today [IVL22].                                                                              root-numpy and related root-pandas packages. The C++ MINUIT
    Python would start showing up in the late 90’s in experiment                            fitting library was integrated into ROOT, but the iminuit package
frameworks as a configuration language. These frameworks were                               [Dea20] provided an easy to install standalone Python package
primarily written in C++, but were made of many configurable                                with an extracted copy of MINUIT. Several other specialized
                                                                                            standalone C++ packages had bindings as well. Many of the initial
* Corresponding author:                                               authors were transitioning to a less-code centric role or leaving
‡ Princeton University
§ University of Liverpool                                                                   for industry, leaving projects like ROOTPy and iminuit without
Copyright © 2022 Henry Schreiner et al. This is an open-access article
distributed under the terms of the Creative Commons Attribution License,                       1. Almost 20 years later ROOT’s Python bindings have been rewritten for
which permits unrestricted use, distribution, and reproduction in any medium,               easier Pythonizations, and installing ROOT in Conda is now much easier,
provided the original author and source are credited.                                       thanks in large part to efforts from Scikit-HEP developers.
116                                                                                             PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                              later writer) that could remove the initial conversion environment
                                                                              by simply pip installing a package. It also had a simple, Pythonic
                                                                              interface and produced outputs Python users could immediately
                                                                              use, like NumPy arrays, instead of PyROOT’s wrapped C++
  pyhepmc                                                      nndrone
                                                                                  Uproot needed to do more than just be file format
                                                                              reader/writer; it needed to provide a way to represent the special
                              pylhe                                           structure and common objects that ROOT files could contain.
                                                                              This lead to the development of two related packages that would
                                                   hepunits                   support uproot. One, uproot-methods, included Pythonic access to
                                                                              functionality provided by ROOT for its core classes, like spatial
                                                                              and Lorentz vectors. The other was AwkwardArray, which would
                              uhi                                             grow to become one of the most important and most general
                                                                              packages in Scikit-HEP. This package allows NumPy-like idioms
                                                                              for array-at-a-time manipulation on jagged data structures. A
                                                                              jagged array is a (possibly structured) array with a variable length
                                                                              dimension. These are very common and relevant in HEP; events
                                                                              have a variable number of tracks, tracks have a variable number
      Fig. 1: The Scikit-HEP ecosystem and affiliated packages.               of hits in the detector, etc. Many other fields also have jagged
                                                                              data structures. While there are formats to store such structures,
                                                                              computations on jagged structures have usually been closer to SQL
    Eduardo Rodrigues, a scientist working on the LHCb ex-
                                                                              queries on multiple tables than direct object manipulation. Pandas
periment for the University of Cincinnati, started working on a
                                                                              handles this through multiple indexing and a lot of duplication.
package called scikit-hep that would provide a set to tools useful
                                                                                  Uproot was a huge hit with incoming HEP students (see Fig 2);
for physicists working on HEP analysis. The initial version of the
                                                                              suddenly they could access HEP data using a library installed with
scikit-hep package had a simple vector library, HEP related units
                                                                              pip or conda and no external compiler or library requirements, and
and conversions, several useful statistical tools, and provenance
                                                                              could easily use tools they already knew that were compatible with
recording functionality,
                                                                              the Python buffer protocol, like NumPy, Pandas and the rapidly
    He also placed the scikit-hep GitHub repository in a Scikit-
                                                                              growing machine learning frameworks. There were still some gaps
HEP GitHub organization, and asked several of the other HEP
                                                                              and pain points in the ecosystem, but an analysis without writing
related packages to join. The ROOTPy project was ending, with
                                                                              C++ (interpreted or compiled) and compiling ROOT manually was
the primary author moving on, and so several of the then-popular
                                                                              finally possible. Scikit-HEP did not and does not intend to replace
packages2 that were included in the ROOTPy organization were
                                                                              ROOT, but it provides alternative solutions that work natively in
happily transferred to Scikit-HEP. Several other existing HEP
                                                                              the Python "Big Data" ecosystem.
libraries, primarily interfacing to existing C++ simulation and
                                                                                  Several other useful HEP libraries were also written. Particle
tracking frameworks, also joined, like PyJet and NumPythia. Some
                                                                              was written for accessing the Particle Data Group (PDG) particle
of these libraries have been retired or replaced today, but were an
                                                                              data in a simple and Pythonic way. DecayLanguage originally
important part of Scikit-HEP’s initial growth.
                                                                              provided tooling for decay definitions, but was quickly expanded
                                                                              to include tools to read and validate "DEC" decay files, an existing
First initial success                                                         text format used to configure simulations in HEP.
In 2016, the largest barrier to using Python in HEP in a Pythonic
way was ROOT. It was challenging to compile, had many non-                    Building compiled packages
Python dependencies, was huge compared to most Python li-
braries, and didn’t play well with Python packaging. It was not               In 2018, HEP physicist and programmer Hans Dembinski pro-
Pythonic, meaning it had very little support for Python protocols             posed a histogram library to the Boost libraries, the most influen-
like iteration, buffers, keyword arguments, tab completion and                tial C++ library collection; many additions to the standard library
inspect in, dunder methods, didn’t follow conventions for useful              are based on Boost. Boost.Histogram provided a histogram-as-
reprs, and Python naming conventions; it was simply a direct on-              an-object concept from HEP, but was designed around C++14
demand C++ binding, including pointers. Many Python analyses                  templating, using composable axes and storage types. It originally
started with a "convert data" step using PyROOT to read ROOT                  had an initial Python binding, written in Boost::Python. Henry
files and convert them to a Python friendly format like HDF5.                 Schreiner proposed the creation of a standalone binding to be
Then the bulk of the analysis would use reproducible Python                   written with pybind11 in Scikit-HEP. The original bindings were
virtual environments or Conda environments.                                   removed, Boost::Histogram was accepted into the Boost libraries,
     This changed when Jim Pivarski introduced the Uproot pack-               and work began on boost-histogram. IRIS-HEP, a multi-institution
age, a pure-Python implementation of a ROOT file reader (and                  project for sustainable HEP software, had just started, which was
                                                                              providing funding for several developers to work on Scikit-HEP
   2. The primary package of the ROOTPy project, also called ROOTPy, was      project packages such as this one. This project would pioneer
not transferred, but instead had a final release and then died. It was an     standalone C++ library development and deployment for Scikit-
inspiration for the new PyROOT bindings, and influenced later Scikit-HEP      HEP.
packages like mplhep. The transferred libraries have since been replaced by
integrated ROOT functionality. All these packages required ROOT, which is          There were already a variety of attempts at histogram libraries,
not on PyPI, so were not suited for a Python-centric ecosystem.               but none of them filled the requirements of HEP physicists:
AWKWARD PACKAGING: BUILDING SCIKIT-HEP                                                                                                                    117

                                                       ROOT (C++ and PyROOT)
                                                              (as a baseline for scale)

                                                             CMSSW config                                                  th
                                                      (Python but not data analysis)                                    Py
                                                               PyROOT                              of                                             ag es
                                                                                               e                                              ack
                                                                                             Us                                           EPp
                                                                                                                                  kit  -H
                                                                                                                        of S

Fig. 2: Adoption of scientific Python libraries and Scikit-HEP among members of the CMS experiment (one of the four major LHC experiments).
CMS requires users to fork github:cms-sw/cmssw, which can be used to identify 3484 physicist users, who created 16656 non-fork repos.
This plot quantifies adoption by counting "#include X", "import X", and "from X import" strings in the users’ code to measure
adoption of various libraries (most popular by category are shown).

                                                                                                                                      lhep gram,
                                                       mainstream Python adoption

                                                      in HEP: when many histogram

                                                                                                                             hist st::His
                                                          libraries lived and died

                                                                                                                                 , mp

                                    histogram part of ROOT
                                         (395 C++ files)                                                                                    YODA
                                                                   histograms                                                        histograms
                                                                    in rootpy                                                         in Coffea

Fig. 3: Developer activity on histogram libraries in HEP: number of unique committers to each library per month, smoothed (derived from git
logs). Illustrates the convergence of a fractured community (around 2017) into a unified one (now).

fills on pre-existing histograms, simple manipulation of multi-                        pybind11.
dimensional histograms, competitive performance, and easy to                               The first stand-alone development was azure-wheel-helpers, a
install in clusters or for students. Any new attempt here would                        set of files that helped produce wheels on the new Azure Pipelines
have to be clearly better than the existing collection of diverse                      platform. Building redistributable wheels requires a variety of
attempts (see Fig 3). The development of a library with compiled                       techniques, even without shared libraries, that vary dramatically
components intended to be usable everywhere required good                              between platforms and were/are poorly documented. On Linux,
support for building libraries that was lacking both in Scikit-                        everything needs to be built inside a controlled manylinux image,
HEP and to an extent the broader Python ecosystem. Previous                            and post-processed by the auditwheel tool. On macOS, this in-
advancements in the packaging ecosystem, such as the wheel                             cludes downloading an official CPython binary for Python to allow
format for distributing binary platform dependent Python packages                      older versions of macOS to be targeted (10.9+), several special
and the manylinux specification and docker image that allowed a                        environment variables, especially when cross compiling to Apple
single compiled wheel to target many distributions of Linux, but                       Silicon, and post processing with the develwheel tool. Windows is
there still were many challenges to making a library redistributable                   the simplest, as most versions of CPython work identically there.
on all platforms.                                                                      azure-wheel-helpers worked well, and was quickly adapted for
     The boost-histogram library only depended on header-only                          the other packages in Scikit-HEP that included non-ROOT binary
components of the Boost libraries, and the header-only pybind11                        components. Work here would eventually be merged into the
package, so it was able to avoid a separate compile step or                            existing and general cibuildwheel package, which would become
linking to external dependencies, which simplified the initial build                   the build tool for all non-ROOT binary packages in Scikit-HEP, as
process. All needed files were collected from git submodules and                       well as over 600 other packages like matplotlib and numpy, and
packed into a source distribution (SDist), and everything was built                    was accepted into the PyPA (Python Packaging Authority).
using only setuptools, making build-from-source simple on any                              The second major development was the upstreaming of CI
system supporting C++14. This did not include RHEL 7, a popular                        and build system developments to pybind11. Pybind11 is a C++
platform in HEP at the time, and on any platform building could                        API for Python designed for writing a binding to C++, and
take several minutes and required several gigabytes of memory                          provided significant benefits to our packages over (mis)-using
to resolve the heavy C++ templating in the Boost libraries and                         Cython for bindings; Cython was designed to transpile a Python-
118                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

like language to C (or C++), and just happened to support bindings
since you can call C and C++ from it, but it was not what it
was designed for. Benefits of pybind11 included reduced code                                          thin wrapper
complexity and duplication, no pre-process step (cythonize), no
need to pin NumPy when building, and a cross-package API. The
iMinuit package was later moved from Cython to pybind11 as                                                                    fully featured
well, and pybind11 became the Scikit-HEP recommended binding
tool. We contributed a variety of fixes and features to pybind11,
including positional-only and keyword-only arguments, the option                                              plotting in
to prepend to the overload chain, and an API for type access                                                  Matplotlib
and manipulation. We also completely redesigned CMake inte-
gration, added a new pure-Setuptools helpers file, and completely                                      mplhep               plotting in
redesigned the CI using GitHub Actions, running over 70 jobs on
a variety of systems and compilers. We also helped modernize and
improve all the example projects with simpler builds, new CI, and
cibuildwheel support.
    This example of a project with binary components being
                                                                        Fig. 4: The collection of histogram packages and related packages in
usable everywhere then encouraged the development of Awkward            Scikit-HEP.
1.0, a rewrite of AwkwardArray replacing the Python-only code
with compiled code using pybind11, fixing some long-standing
limitations, like an inability to slice past two dimensions or select
                                                                        broader HEP ecosystem. The affiliated classification is also used
"n choose k" for k > 5; these simply could not be expressed
                                                                        on broader ecosystem packages like pybind11 and cibuildwheel
using Awkward 0’s NumPy expressions, but can be solved with
                                                                        that we recommend and share maintainers with.
custom compiled kernels. This also enabled further developments
in backends [PEL20].                                                        Histogramming was designed to be a collection of specialized
                                                                        packages (see Fig. 4) with carefully defined interoperability;
                                                                        boost-histogram for manipulation and filling, Hist for a user-
Broader ecosystem                                                       friendly interface and simple plotting tools, histoprint for display-
Scikit-HEP had become a "toolset" for HEP analysis in Python, a         ing histograms, and the existing mplhep and uproot packages also
collection of packages that worked together, instead of a "toolkit"     needed to be able to work with histograms. This ecosystem was
like ROOT, which is one monopackage that tries to provide every-        built and is held together with UHI, which is a formal specification
thing [R+ 20]. A toolset is more natural in the Python ecosystem,       agreed upon by several developers of different libraries, backed by
where we have good packaging tools and many existing libraries.         a statically typed Protocol, for a PlottableHistogram object. Pro-
Scikit-HEP only needed to fill existing gaps, instead of covering       ducers of histograms, like boost-histogram/hist and uproot provide
every possible aspect of an analysis like ROOT did. The original        objects that follow this specification, and users of histograms,
scikit-hep package had its functionality pulled out into existing or    such as mplhep and histoprint take any object that follows this
new separate packages such as HEPUnits and Vector, and the core         specification. The UHI library is not required at runtime, though it
scikit-hep package instead became a metapackage with no unique          does also provide a few simple utilities to help a library also accept
functionality on its own. Instead, it installs a useful subset of our   ROOT histograms, which do not (currently) follow the Protocol, so
libraries for a physicist wanting to quickly get started on a new       several libraries have decided to include it at runtime too. By using
analysis.                                                               a static type checker like MyPy to statically enforce a Protocol,
    Scikit-HEP was quickly becoming the center of HEP specific          libraries that can communicate without depending on each other
Python software (see Fig. 1). Several other projects or packages        or on a shared runtime dependency and class inheritance. This has
joined Scikit-HEP iMinuit, a popular HEP and astrophysics fitting       been a great success story for Scikit-HEP, and We expect Protocols
library, was probably the most widely used single package to            to continue to be used in more places in the ecosystem.
have joined. PyHF and cabinetry also joined; these were larger              The design for Scikit-HEP as a toolset is of many parts that
frameworks that could drive a significant part of an analysis           all work well together. One example of a package pulling together
internally using other Scikit-HEP tools.                                many components is uproot-browser, a tool that combines uproot,
    Other packages, like GooFit, Coffea, and zFit, were not added,      Hist, and Python libraries like textual and plotext to provide a
but were built on Scikit-HEP packages and had developers work-          terminal browser for ROOT files.
ing closely with Scikit-HEP maintainers. Scikit-HEP introduced              Scikit-HEP’s external contributions continued to grow. One of
an "affiliated" classification for these packages, which allowed        the most notable ones was our work on cibuildwheel. This was
an external package to be listed on the Scikit-HEP website              a Python package that supported building redistributable wheels
and encouraged collaboration. Coffea had a strong influence             on multiple CI systems. Unlike our own azure-wheel-helpers or
on histogram design, and zFit has contributed code to Scikit-           the competing multibuild package, it was written in Python, so
HEP. Currently all affiliated packages have at least one Scikit-        good practices in Python package design could apply, like unit
HEP developer as a maintainer, though that is currently not a           and integration tests, static checks, and it was easy to remain
requirement. An affiliated package fills a particular need for the      independent of the underlying CI system. Building wheels on
community. Scikit-HEP doesn’t have to, or need to, attempt to           Linux requires a docker image, macOS requires the
develop a package that others are providing, but rather tries to        Python, and Windows can use any copy of Python - cibuildwheel
ensure that the externally provided package works well with the         uses this to supply Python in all cases, which keeps it from
AWKWARD PACKAGING: BUILDING SCIKIT-HEP                                                                                                       119

depending on the CI’s support for a particular Python version. We       helpful for monitoring adoption of the developer pages, especially
merged our improvements to cibuildwheel, like better Windows            newer additions, across the Scikit-HEP packages. This package
support, VCS versioning support, and better PEP 518 support.            was then implemented directly into the Scikit-HEP pages, using
We dropped azure-wheel-helpers, and eventually a scikit-build           Pyodide to run Python in WebAssembly directly inside a user’s
maintainer joined the cibuildwheel project. cibuildwheel would          browser. Now anyone visiting the page can enter their repository
go on to join the PyPA, and is now in use in over 600 packages,         and branch, and see the adoption report in a couple of seconds.
including numpy, matplotlib, mypy, and scikit-learn.
    Our continued contributions to cibuildwheel included a              Working toward the future
TOML-based configuration system for cibuildwheel 2.0, an over-
                                                                        Scikit-HEP is looking toward the future in several different areas.
ride system to make supporting multiple manylinux and musllinux
                                                                        We have been working with the Pyodide developers to support
targets easier, a way to build directly from SDists, an option to use
                                                                        WebAssembly; boost-histogram is compiled into Pyodide 0.20,
build instead of pip, the automatic detection of Python version
                                                                        and Pyodide’s support for pybind11 packages is significantly bet-
requirements, and better globbing support for build specifiers. We
                                                                        ter due to that work, including adding support for C++ exception
also helped improve the code quality in various ways, including
                                                                        handling. PyHF’s documentation includes a live Pyodide kernel,
fully statically typing the codebase, applying various checks and
                                                                        and a try-pyhf site (based on the repo-review tool) lets users run
style controls, automating CI processes, and improving support for
                                                                        a model without installing anything - it can even be saved as a
special platforms like CPython 3.8 on macOS Apple Silicon.
                                                                        webapp on mobile devices.
    We also have helped with build, nox, pyodide, and many other
                                                                            We have also been working with Scikit-Build to try to provide
packages, improving the tooling we depend on to develop scikit-
                                                                        a modern build experience in Python using CMake. This project
build and giving back to the community.
                                                                        is just starting, but we expect over the next year or two that
                                                                        the usage of CMake as a first class build tool for binaries in
The Scikit-HEP Developer Pages                                          Python will be possible using modern developments and avoiding
A variety of packaging best practices were coming out of the            distutils/setuptools hacks.
boost-histogram work, supporting both ease of installation for
users as well as various static checks and styling to keep the          Summary
package easy to maintain and reduce bugs. These techniques              The Scikit-HEP project started in Autumn 2016 and has grown
would also be useful apply to Scikit-HEP’s nearly thirty other          to be a core component in many HEP analyses. It has also
packages, but applying them one-by-one was not scalable. The            provided packages that are growing in usage outside of HEP, like
development and adoption of azure-wheel-helpers included a se-          AwkwardArray, boost-histogram/Hist, and iMinuit. The tooling
ries of blog posts that covered the Azure Pipelines platform and        developed and improved by Scikit-HEP has helped Scikit-HEP
wheel building details. This ended up serving as the inspiration        developers as well as the broader Python community.
for a new set of pages on the Scikit-HEP website for developers
interested in making Python packages. Unlike blog posts, these
would be continuously maintained and extended over the years,           R EFERENCES
serving as a template and guide for updating and adding packages        [Dea20]   Hans Dembinski and Piti Ongmongkolkul et al.            scikit-
to Scikit-HEP, and educating new developers.                                      hep/iminuit. Dec 2020. URL:
                                                                                  3949207, doi:10.5281/zenodo.3949207.
    These pages grew to describe the best practices for developing
                                                                        [EB08]    Lyndon Evans and Philip Bryant. Lhc machine. Journal of
and maintaining a package, covering recommended configuration,                    instrumentation, 3(08):S08001, 2008.
style checking, testing, continuous integration setup, task runners,    [GTW20] Galli, Massimiliano, Tejedor, Enric, and Wunsch, Stefan. "a new
and more. Shortly after the introduction of the developer pages,                  pyroot: Modern, interoperable and more pythonic". EPJ Web
                                                                                  Conf., 245:06004, 2020. URL:
Scikit-HEP developers started asking for a template to quickly                    202024506004, doi:10.1051/epjconf/202024506004.
produce new packages following the guidelines. This was eventu-         [IVL22]   Ioana Ifrim, Vassil Vassilev, and David J Lange. GPU Ac-
ally produced; the "cookiecutter" based template is kept in sync                  celerated Automatic Differentiation With Clad. arXiv preprint
with the developer pages; any new addition to one is also added                   arXiv:2203.06139, 2022.
                                                                        [Lam98]   Stephan Lammel.         Computing models of cdf and dØ
to the other. The developer pages are also kept up to date using a                in run ii. Computer Physics Communications, 110(1):32–
CI job that bumps any GitHub Actions or pre-commit versions to                    37, 1998. URL:
the most recent versions weekly. Some portions of the developer                   pii/S0010465597001501, doi:10.1016/s0010-4655(97)
pages have been contributed to, as well.           [LCC+ 09] Barry M Leiner, Vinton G Cerf, David D Clark, Robert E
    The cookie cutter was developed to be able to support multiple                Kahn, Leonard Kleinrock, Daniel C Lynch, Jon Postel, Larry G
build backends; the original design was to target both pure Python                Roberts, and Stephen Wolff. A brief history of the internet.
and Pybind11 based binary builds. This has expanded to include                    ACM SIGCOMM Computer Communication Review, 39(5):22–
                                                                                  31, 2009.
11 different backends by mid 2022, including Rust extensions,           [LGMM05] W Lavrijsen, J Generowicz, M Marino, and P Mato. Reflection-
many PEP 621 based backends, and a Scikit-Build based backend                     Based Python-C++ Bindings. 2005. URL:
for pybind11 in addition to the classic Setuptools one. This has                  record/865620, doi:10.5170/CERN-2005-002.441.
                                                                        [PEL20]   Jim Pivarski, Peter Elmer, and David Lange. Awkward arrays
helped work out bugs and influence the design of several PEP                      in python, c++, and numba. In EPJ Web of Conferences,
621 packages, including helping with the addition of PEP 621 to                   volume 245, page 05023. EDP Sciences, 2020. doi:10.1051/
Setuptools.                                                                       epjconf/202024505023.
    The most recent addition to the pages was based on a new            [PJ11]    Andreas J Peters and Lukasz Janyst. Exabyte scale storage at
                                                                                  CERN. In Journal of Physics: Conference Series, volume 331,
repo-review package which evaluates and existing repository to                    page 052015. IOP Publishing, 2011. doi:10.1088/1742-
see what parts of the guidelines are being followed. This was                     6596/331/5/052015.
120                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[R+ 20]     Eduardo Rodrigues et al. The Scikit HEP Project – overview and
            prospects. EPJ Web of Conferences, 245:06028, 2020. arXiv:
            2007.03577, doi:10.1051/epjconf/202024506028.
[RS21]      Olivier Rousselle and Tom Sykora. Fast simulation of Time-
            of-Flight detectors at the LHC. In EPJ Web of Conferences,
            volume 251, page 03027. EDP Sciences, 2021. doi:10.1051/
[RTA+ 17]   D Remenska, C Tunnell, J Aalbers, S Verhoeven, J Maassen, and
            J Templon. Giving pandas ROOT to chew on: experiences with
            the XENON1T Dark Matter experiment. In Journal of Physics:
            Conference Series, volume 898, page 042003. IOP Publishing,
[SLF+ 15]   Zachary D Stephens, Skylar Y Lee, Faraz Faghri, Roy H
            Campbell, Chengxiang Zhai, Miles J Efron, Ravishankar Iyer,
            Michael C Schatz, Saurabh Sinha, and Gene E Robinson. Big
            data: astronomical or genomical? PLoS biology, 13(7):e1002195,
[Tem22]     Jeffrey Templon. Reflections on the uptake of the Python pro-
            gramming language in Nuclear and High-Energy Physics, March
            2022. None. URL:,
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                  121

  Keeping your Jupyter notebook code quality bar high
        (and production ready) with Ploomber
                                                                    Ido Michael‡∗


    This paper walks through this interactive tutorial. It is highly
recommended running this interactively so it’s easier to follow and
see the results in real-time. There’s a binder link in there as well,
so you can launch it instantly.
                                                                                   Fig. 1: In this pipeline none of the tasks were executed - it’s all red.

1. Introduction
Notebooks are an excellent environment for data exploration:                       In addition, it can transform a notebook to a single-task pipeline
they allow us to write code interactively and get visual feedback,                 and then the user can split it into smaller tasks as they see fit.
providing an unbeatable experience for understanding our data.                         To refactor the notebook, we use the soorgeon refactor
    However, this convenience comes at a cost; if we are not                       command:
careful about adding and removing code cells, we may have an                       soorgeon refactor nb.ipynb
irreproducible notebook. Arbitrary execution order is a prevalent                  After running the refactor command, we can take a look at the
problem: a recent analysis found that about 36% of notebooks on                    local directory and see that we now have multiple python tasks
GitHub did not execute in linear order. To ensure our notebooks                    which that are ready for production:
run, we must continuously test them to catch these problems.
                                                                                   ls playground
    A second notable problem is the size of notebooks: the more
cells we have, the more difficult it is to debug since there are more              We can see that we have a few new files. pipeline.yaml
variables and code involved.                                                       contains the pipeline declaration, and tasks/ contains the stages
    Software engineers typically break down projects into multiple                 that Soorgeon identified based on our H2 Markdown headings:
steps and test continuously to prevent broken and unmaintainable                   ls playground/tasks
code. However, applying these ideas for data analysis requires
extra work; multiple notebooks imply we have to ensure the output                  One of the best ways to onboard new people and explain what
from one stage becomes the input for the next one. Furthermore,                    each workflow is doing is by plotting the pipeline (note that we’re
we can no longer press “Run all cells” in Jupyter to test our                      now using ploomber, which is the framework for developing
analysis from start to finish.                                                     pipelines):
    Ploomber provides all the necessary tools to build multi-                      ploomber plot
stage, reproducible pipelines in Jupyter that feel like a single
                                                                                   This command will generate the plot below for us, which will
notebook. Users can easily break down their analysis into multiple
                                                                                   allow us to stay up to date with changes that are happening in our
notebooks and execute them all with a single command.
                                                                                   pipeline and get the current status of tasks that were executed or
                                                                                   failed to execute.
2. Refactoring a legacy notebook                                                       Soorgeon correctly identified the stages in our
If you already have a python project in a single notebook, you                     original nb.ipynb notebook. It even detected that
can use our tool Soorgeon to automatically refactor it into a                      the     last   two     tasks    (linear-regression,            and
Ploomber pipeline. Soorgeon statically analyzes your code, cleans                  random-forest-regressor) are independent of each
up unnecessary imports, and makes sure your monolithic notebook                    other!
is broken down into smaller components. It does that by scanning                       We can also get a summary of the pipeline with ploomber
the markdown in the notebook and analyzing the headers; each                       status:
H2 header in our example is marking a new self-contained task.                     cd playground
                                                                                   ploomber status
* Corresponding author:
‡ Ploomber
                                                                                   3. The pipeline.yaml file
Copyright © 2022 Ido Michael. This is an open-access article distributed
under the terms of the Creative Commons Attribution License, which permits         To develop a pipeline, users create a pipeline.yaml file and
unrestricted use, distribution, and reproduction in any medium, provided the
                                                                                   declare the tasks and their outputs as follows:
original author and source are credited.
122                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                                      Fig. 3: Here we can see the build outputs

Fig. 2: In here we can see the status of each of our pipeline’s tasks,
runtime and location.

  - source:
       nb: output/executed.ipynb
       data: output/data.csv

  # more tasks here...

The previous pipeline has a single task (
and generates two outputs: output/executed.ipynb and
output/data.csv. You may be wondering why we have a
notebook as an output: Ploomber converts scripts to notebooks
before execution; hence, our script is considered the source and the
notebook a byproduct of the execution. Using scripts as sources
(instead of notebooks) makes it simpler to use git. However, this
does not mean you have to give up interactive development since
Ploomber integrates with Jupyter, allowing you to edit scripts as
notebooks.                                                                            Fig. 4: These are the post build artifacts
    In this case, since we used soorgeon to refactor an existing
notebook, we did not have to write the pipeline.yaml file.
                                                                         # Sample data quality checks after loading the raw data
                                                                         # Check nulls
4. Building the pipeline                                                 assert not df['HouseAge'].isnull().values.any()
Let’s build the pipeline (this will take ~30 seconds):
                                                                         # Check a specific range - no outliers
cd playground                                                            assert df['HouseAge'].between(0,100).any()
ploomber build
                                                                         # Exact expected row count
We can see which are the tasks that ran during this command, how         assert len(df) == 11085
long they took to execute, and the contributions of each task to the
overall pipeline execution runtime.                                      ** We’ll do the same for tasks/, open the file
   Navigate to playground/output/ and you’ll see all the                 and add the tests:
outputs: the executed notebooks, data files and trained model.           # Sample tests after the notebook ran
                                                                         # Check task test input exists
ls playground/output                                                     assert Path(upstream['train-test-split']['X_test']).exists()
In this figure, we can see all of the data that was collected during
                                                                         # Check task train input exists
the pipeline, any artifacts that might be useful to the user, and some   assert Path(upstream['train-test-split']['y_train']).exists()
of the execution history that is saved on the notebook’s context.
                                                                         # Validating output type
                                                                         assert 'pkl' in upstream['train-test-split']['X_test']
5. Testing and quality checks
                                                                         Adding these snippets will allow us to validate that the data we’re
** Open tasks/ as a notebook by right-clicking
                                                                         looking for exists and has the quality we expect. For instance, in
on it and then Open With -> Notebook and add the following
                                                                         the first test we’re checking there are no missing rows, and that
code after the cell with # noqa:
                                                                         the data sample we have are for houses up to 100 years old.

                                                                                        Fig. 6: lab-open-with-notebook
            Fig. 5: Now we see an independent new task

    In the second snippet, we’re checking that there are train and
test inputs which are crucial for training the model.

6. Maintaining the pipeline
Let’s look again at our pipeline plot:                                          Fig. 7: The new task is attached to the pipeline
The arrows in the diagram represent input/output dependencies            At the top of the notebook, you’ll see the following:
and depict the execution order. For example, the first task (load)
                                                                      upstream = None
loads some data, then clean uses such data as input and
processes it, then train-test-split splits our dataset into           This special variable indicates which tasks should execute before
training and test sets. Finally, we use those datasets to train a     the notebook we’re currently working on. In this case, we want to
linear regression and a random forest regressor.                      get training data so we can train our new model so we change the
    Soorgeon extracted and declared this dependencies for us, but     upstream variable:
if we want to modify the existing pipeline, we need to declare        upstream = ['train-test-split']
such dependencies. Let’s see how.
    We can also see that the pipeline is green, meaning all of the    Let’s generate the plot again:
tasks in it have been executed recently.                              cd playground
                                                                      ploomber plot
7. Adding a new task                                                  Ploomber now recognizes our dependency declaration!
Let’s say we want to train another model and decide to try Gradient      Open
Boosting Regressor. First, we modify the pipeline.yaml file           playground/tasks/
and add a new task:
                                                                      as a notebook by right-clicking on it and then Open With ->
    Open playground/pipeline.yaml and add the follow-
                                                                      Notebook and add the following code:
ing lines at the end
                                                                      from pathlib import Path
- source: tasks/
                                                                      import pickle
    nb: output/gradient-boosting-regressor.ipynb
                                                                      import seaborn as sns
Now, let’s create a base file by executing ploomber                   from sklearn.ensemble import GradientBoostingRegressor
scaffold:                                                             y_train = pickle.loads(Path(
cd playground                                                            upstream['train-test-split']['y_train']).read_bytes())
ploomber scaffold                                                     y_test = pickle.loads(Path(
This      is     the      output   of     the     command:       `    X_test = pickle.loads(Path(
Found spec at 'pipeline.yaml' Adding                                     upstream['train-test-split']['X_test']).read_bytes())
/Users/ido/ploomber-workshop/playground/                              X_train = pickle.loads(Path(
Created 1 new task sources. `                                         gbr = GradientBoostingRegressor()
   We can see it created the task sources for our new task, we just, y_train)
have to fill those in right now.                                      y_pred = gbr.predict(X_test)
   Let’s see how the plot looks now:                                  sns.scatterplot(x=y_test, y=y_pred)
cd playground
ploomber plot
You can see that Ploomber recognizes the new file, but it does not    8. Incremental builds
have any dependency, so let’s tell Ploomber that it should execute    Data workflows require a lot of iteration. For example, you may
after train-test-split:                                               want to generate a new feature or model. However, it’s wasteful
    Open                                                              to re-execute every task with every minor change. Therefore,
playground/tasks/                       one of Ploomber’s core features is incremental builds, which
                                                                      automatically skip tasks whose source code hasn’t changed.
as a notebook by right-clicking on it and then Open With ->
                                                                          Run the pipeline again:
124                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                        11. Resources
                                                                        Thanks for taking the time to go through this tutorial! We hope
                                                                        you consider using Ploomber for your next project. If you have
                                                                        any questions or need help, please reach out to us! (contact info
                                                                            Here are a few resources to dig deeper:
                                                                           •   GitHub
                                                                           •   Documentation
                                                                           •   Code examples
          Fig. 8: We can see this pipeline has multiple new tasks.         •   JupyterCon 2020 talk
                                                                           •   Argo Community Meeting talk
                                                                           •   Pangeo Showcase talk (AWS Batch demo)
cd playground                                                              •   Jupyter project
ploomber build
You can see that only the gradient-boosting-regressor
                                                                        10. Contact
task ran!
    Incremental builds allow us to iterate faster without keeping          •   Twitter
track of task changes.                                                     •   Join us on Slack
    Check              out             playground/output/                  •   E-mail us
    which contains the output notebooks with the model evaluation

9. Parallel execution and Ploomber cloud execution
This section can run locally or on the cloud. To setup the cloud
we’ll need to register for an api key
    Ploomber cloud allows you to scale your experiments into the
cloud without provisioning machines and without dealing with
    Open playground/pipeline.yaml and add the following code
instead of the source task:
- source: tasks/
This is how your task should look like in the end
- source: tasks/
  name: random-forest-
    nb: output/random-forest-regressor.ipynb
        # creates 4 tasks (2 * 2)
        n_estimators: [5, 10]
        criterion: [gini, entropy]
In addition, we’ll need to add a flag to tell the pipeline to execute
in parallel. Open playground/pipeline.yaml and add the following
code above the -tasks section (line 1):
    # Execute independent tasks in parallel executor: parallel
ploomber plot

ploomber build

10. Execution in the cloud
When working with datasets that fit in memory, running your
pipeline is simple enough, but sometimes you may need more
computing power for your analysis. Ploomber makes it simple
to execute your code in a distributed environment without code
    Check out Soopervisor, the package that implements exporting
Ploomber projects in the cloud with support for:
      •    Kubernetes (Argo Workflows)
      •    AWS Batch
      •    Airflow
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                   125

   Likeness: a toolkit for connecting the social fabric of
                place to human dynamics
                                                     Joseph V. Tuccillo‡∗ , James D. Gaboardi‡


Abstract—The ability to produce richly-attributed synthetic populations is                   Modeling these processes at scale and with respect to indi-
key for understanding human dynamics, responding to emergencies, and                    vidual privacy is most commonly achieved through agent-based
preparing for future events, all while protecting individual privacy. The Like-         simulations on synthetic populations [SEM14]. Synthetic popula-
ness toolkit accomplishes these goals with a suite of Python packages:                  tions consist of individual agents that, when viewed in aggregate,
pymedm/pymedm_legacy, livelike, and actlike. This production
                                                                                        closely recreate the makeup of an area’s observed population
process is initialized in pymedm (or pymedm_legacy) that utilizes census
microdata records as the foundation on which disaggregated spatial allocation
                                                                                        [HHSB12], [TMKD17]. Modeling human dynamics with syn-
matrices are built. The next step, performed by livelike, is the generation of          thetic populations is common across research areas including spa-
a fully autonomous agent population attributed with hundreds of demographic             tial epidemiology [DKA+ 08], [BBE+ 08], [HNB+ 11], [NCA13],
census variables. The agent population synthesized in livelike is then                  [RSF+ 21], [SNGJ+ 09], public health [BCD+ 06], [BFH+ 17],
attributed with residential coordinates in actlike based on block assignment            [SPH11], [TCR08], [MCB+ 08], and transportation [BBM96],
and, finally, allocated to an optimal daytime activity location via the street          [ZFJ14]. However, a persistent limitation across these applications
network. We present a case study in Knox County, Tennessee, synthesizing 30             is that synthetic populations often do not capture a wide enough
populations of public K–12 school students & teachers and allocating them to
                                                                                        range of individual characteristics to assess how human dynamics
schools. Validation of our results shows they are highly promising by replicating
                                                                                        are linked to human security problems (e.g., how a person’s age,
reported school enrollment and teacher capacity with a high degree of fidelity.
                                                                                        limited transportation access, and linguistic isolation may interact
Index Terms—activity spaces, agent-based modeling, human dynamics, popu-                with their housing situation in a flood evacuation emergency).
lation synthesis
                                                                                            In this paper, we introduce Likeness [TG22], a Python toolkit
                                                                                        for connecting the social fabric of place to human dynamics via
Introduction                                                                            models that support increased spatial, temporal, and demographic
Human security fundamentally involves the functional capacity                           fidelity. Likeness is an extension of the UrbanPop framework de-
that individuals possess to withstand adverse circumstances, me-                        veloped at Oak Ridge National Laboratory (ORNL) that embraces
diated by the social and physical environments in which they live                       a new paradigm of "vivid" synthetic populations [TM21], [Tuc21],
[Hew97]. Attention to human dynamics is a key piece of the                              in which individual agents may be attributed in potentially hun-
human security puzzle, as it reveals spatial policy interventions                       dreds of ways, across subjects spanning demographics, socioe-
most appropriate to the ways in which people within a community                         conomic status, housing, and health. Vivid synthetic populations
behave and interact in daily life. For example, "one size fits all"                     benefit human dynamics research both by enabling more precise
solutions do not exist for mitigating disease spread, promoting                         geolocation of population segments, as well as providing a deeper
physical activity, or enabling access to healthy food sources.                          understanding of how individual and neighborhood characteris-
Rather, understanding these outcomes requires examination of                            tics are coupled. UrbanPop’s early development was motivated
processes like residential sorting, mobility, and social transmis-                      by linking models of residential sorting and worker commute
sion.                                                                                   behaviors [MNP+ 17], [MPN+ 17], [ANM+ 18]. Likeness expands
                                                                                        upon the UrbanPop approach by providing a novel integrated
* Corresponding author:
‡ Oak Ridge National Laboratory                                                         model that pairs vivid residential synthetic populations with an
                                                                                        activity simulation model on real-world transportation networks,
Copyright © 2022 Oak Ridge National Laboratory. This is an open-access                  with travel destinations based on points of interest (POIs) curated
article distributed under the terms of the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any          from location services and federal critical facilities data.
medium, provided the original author and source are credited.
Notice: This manuscript has been authored by UT-Battelle, LLC under                         We first provide an overview of Likeness’ capabilities, then
Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy.                      provide a more detailed walkthrough of its central workflow with
The United States Government retains and the publisher, by accepting the
article for publication, acknowledges that the United States Government                 respect to livelike, a package for population synthesis and
retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or         residential characterization, and actlike a package for activity
reproduce the published form of this manuscript, or allow others to do so, for          allocation. We provide preliminary usage examples for Likeness
United States Government purposes. The Department of Energy will provide                based on 1) social contact networks in POIs 2) 24-hour POI
public access to these results of federally sponsored research in accordance
with the DOE Public Access Plan (                occupancy characteristics. Finally, we discuss existing limitations
access-plan).                                                                           and the outlook for future development.
126                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Overview of Core Capabilities and Workflow                                the ACS Public-Use Microdata Sample (PUMS) at the scale
UrbanPop initially combined the vivid synthetic populations pro-          of census block groups (typically 300–6000 people) or tracts
duced from the American Community Survey (ACS) using the                  (1200–8000 people), depending upon the use-case.
Penalized-Maximum Entropy Dasymetric Modeling (P-MEDM)                        Downscaling the PUMS from the Public-Use Microdata Area
method, which is detailed later, with a commute model based on            (PUMA) level at which it is offered (100,000 or more people) to
origin-destination flows, to generate a detailed dataset of daytime       these neighborhood scales then enables us to produce synthetic
and nighttime synthetic populations across the United States              populations (the livelike package) and simulate their travel
[MPN+ 17]. Our development of Likeness is motivated by extend-            to POIs (the actlike package) in an integrated model. This ap-
ing the existing capabilities of UrbanPop to routing libraries avail-     proach provides a new means of modeling population mobility and
able in Python like osmnx1 and pandana2 [Boe17], [FW12].                  activity spaces with respect to real-world transportation networks
In doing so, we are able to simulate travel to regular daytime            and POIs, in turn enabling investigation of social processes from
activities (work and school) based on real-world transportation           the atomic (e.g., person) level in human systems.
networks. Likeness continues to use the P-MEDM approach, but                  Likeness offers two implementations of P-MEDM. The first,
is fully integrated with the U.S. Census Bureau’s ACS Summary             the pymedm package, is written natively in Python based on
File (SF) and Census Microdata APIs, enabling the production of           scipy.optimize.minimize, and while fully operational re-
activity models on-the-fly.                                               mains in development and is currently suitable for one-off simu-
    Likeness features three core capabilities supporting activ-           lations. The second, the pmedm_legacy package, uses rpy2 as
ity simulation with vivid synthetic populations (Figure 1).               a bridge to [NBLS14]’s original implementation of P-MEDM3 in
The first, spatial allocation, is provided by the pymedm and              R/C++ and is currently more stable and scalable. We offer conda
pmedm_legacy packages and uses Iterative Proportional Fitting             environments specific to each package, based on user preferences.
(IPF) to downscale census microdata records to small neighbor-                Each package’s functionality centers around a PMEDM class,
hood areas, providing a basis for population synthesis. Baseline          which contains information required to solve the P-MEDM prob-
residential synthetic populations are then created and stratified into    lem:
agent segments (e.g., grade 10 students, hospitality workers) using          •    The individual (household) level constraints based on ACS
the livelike package. Finally, the actlike package models                         PUMS. To preserve households from the PUMS in the syn-
travel across agent segments of interest to POIs outside places of                thetic population, the person-level constraints describing
residence at varying times of day.                                                household members are aggregated to the household level
                                                                                  and merged with household-level constraints.
Spatial Allocation: the pymedm & pmedm_legacy packages                       •    PUMS household sample weights.
Synthetic populations are typically generated from census micro-             •    The target (e.g., block group) and aggregate (e.g., tract)
data, which consists of a sample of publicly available longform                   zone constraints based on population-level estimates avail-
responses to official statistical surveys. To preserve respondent                 able in the ACS SF.
confidentiality, census microdata is often published at spatial              •    The target/aggregate zone 90% margins of error and asso-
scales the size of a city or larger. Spatial allocation with IPF                  ciated standard errors (SE = 1.645 × MOE).
provides a maximum-likelihood estimator for microdata responses               The PMEDM classes feature a solve() method that returns
in small (e.g., neighborhood) areas based on aggregate data               an optimized P-MEDM solution and allocation matrix. Through
published about those areas (known as "constraints"), resulting           a diagnostics module, users may then evaluate a P-MEDM
in a baseline for population synthesis [WCC+ 09], [BBM96],                solution based on the proportion of published 90% MOEs from
[TMKD17]. UrbanPop is built upon a regularized implementation             the summary-level ACS data preserved at the target (allocation)
of IPF, the P-MEDM method, that permits many more input census            scale.
variables than traditional approaches [LNB13], [NBLS14]. The P-
MEDM objective function (Eq. 1) is written as:                            Population Synthesis: the livelike package
                           n wit     wit      e2                          The livelike package generates baseline residential synthetic
                  max − ∑        log     − ∑ k2                    (1)
                        it N dit     dit   k 2σk                          populations and performs agent segmentation for activity simula-
where wit is the estimate of variable i in zone t, dit is the synthetic
estimate of variable i in location t, n is the number of microdata        Specifying and Solving Spatial Allocation Problems
responses, and N is the total population size. Uncertainty in
                                                                          The livelike workflow is oriented around a user-specified
variable estimates is handled by adding an error term to the
               e2                                                         constraints file containing all of the information necessary to
allocation ∑k 2σk2 , where ek is the error between the synthetic          specify a P-MEDM problem for a PUMA of interest. "Constraints"
and published estimate of ACS variable k and σk is the ACS                are variables from the ACS common among people/households
standard error for the estimate of variable k. This is accomplished       (PUMS) and populations (SF) that are used as both model inputs
by leveraging the uncertainty in the input variables: the "tighter"       and descriptors. The constraints file includes information for
the margins of error on the estimate of variable k in place t, the        bridging PUMS variable definitions with those from the SF using
more leverage it holds upon the solution [NBLS14].                        helper functions provided by the livelike.pums module,
    The P-MEDM procedure outputs an allocation matrix that                including table IDs, sampling universe (person/household), and
estimates the probability of individuals matching responses from          tags for the range of ACS vintages (years) for which the variables
                                                                          are relevant.
  2.                                        3.
LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS                                                              127

                                                 Fig. 1: Core capabilities and workflow of Likeness.

    The primary livelike class is the acs.puma, which stores                implementation of [LB13]’s "Truncate, Replicate, Sample" (TRS)
information about a single PUMA necessary for spatial allocation            method. TRS works by separating each cell of the allocation
of the PUMS data to block groups/tracts with P-MEDM. The                    matrix into whole-number (integer) and fractional components,
process of creating an acs.puma is integrated with the U.S.                 then incrementing the whole-number estimates by a random
Census Bureau’s ACS SF and Census Microdata 5-Year Estimates                sample of unit weights performed with sampling probabilities
(5YE) APIs4 . This enables generation of an acs.puma class                  based on the fractional component. Because TRS is stochastic,
with a high-level call involving just a few parameters: 1) the              the homesim.hsim() function generates multiple (default 30)
PUMA’s Federal Information Processing Standard (FIPS) code 2)               realizations of the residential population. The results are provided
the constraints file, loaded as a pandas.DataFrame and 3) the               as a pandas.DataFrame in long format, attributed by:
target ACS vintage (year). An example call to build an acs.puma
                                                                                •   PUMS Household ID (h_id)
for the Knoxville City, TN PUMA (FIPS 4701603) using the ACS
                                                                                •   Simulation number (sim)
2015–2019 5-Year Estimates is:
                                                                                •   Target zone FIPS code (geoid)
    fips="4701603",                                                             •   Household count (count)
    year=2019                                                                   Since household and person-level attributes are combined
)                                                                           when creating the acs.puma class, person-level records from
                                                                            the PUMS are assumed to be joined to the synthesized household
The censusdata package5 is used internally to
                                                                            IDs many-to-one. For example, if two people, A01 and A03, in
fetch population-level (SF) constraints, standard errors,
                                                                            household A have some attribute of interest, and there are 3
and MOEs from the ACS 5YE API, while the
                                                                            households of type A in zone G, then we estimate that a total
acs.extract_pums_constraints function is used to
                                                                            of 6 people with that attribute from household A reside in zone G.
fetch individual-level constraints and weights from the Census
Microdata 5YE API.
                                                                            Agent Generation
    Spatial allocation is then carried out by passing
the acs.puma attributes to a pymedm.PMEDM or                                The synthetic populations can then be segmented into different
pmedm_legacy.PMEDM (depending on user preference).                          groups of agents (e.g., workers by industry, students by grade) for
                                                                            activity modeling with the actlike package. Agent segments
Population Synthesis                                                        may be identified in several ways:
The homesim module provides support for population synthe-
                                                                                •   Using acs.extract_pums_segment_ids() to
sis on the spatial allocation matrix within a solved P-MEDM
                                                                                    fetch the person IDs (household serial number + person
object. The population synthesis procedure involves converting
                                                                                    line number) from the Census Microdata API matching
the fractional estimates from the allocation matrix (n household
                                                                                    some criteria of interest (e.g., public school students in
IDs by m zones) to integer representation such that whole peo-
                                                                                    10th grade).
ple/households are preserved. This homesim module features an
                                                                                •   Using acs.extract_pums_descriptors() to
  4.                          fetch criteria that may be queried from the Census
  5.                                            Microdata API. This is useful when dealing with criteria
128                                                                                           PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

       more specific than can be directly controlled for in the         in time and are placed with a greater frequency proportional
       P-MEDM problem (e.g., detailed NAICS code of worker,             to reported household density [LB13]. We employ population
       exact number of hours worked).                                   and housing counts within 2010 Decennial Census blocks to
                                                                        formulate a modified Variable Size Bin Packing Problem [FL86],
    The function est.tabulate_by_serial() is then used
                                                                        [CGSdG08] for each populated block group, which allows for
to tabulate agents by target zone and simulation by appending
                                                                        an optimal placement of household points and is accomplished
them to the synthetic population based on household ID, then
                                                                        by     the    actlike.block_denisty_allocation()
aggregating the person-level counts. This routine is flexible in that
                                                                        function      that       creates      and       solves     an
a user can use any set of criteria available from the PUMS to
                                                                        actlike.block_allocation.BinPack instance.
define customized agents for mobility modeling purposes.

Other Capabilities                                                      Activity Allocation
        Population Statistics: In addition to agent creation, the       Once household location attribution is complete, individual agents
livelike.est module also supports the creation of popula-               must be allocated from households (nighttime locations) to prob-
tion statistics. This can be used to estimate the compositional         able activity spaces (daytime locations). This is achieved through
characteristics of small neighborhood areas and POIs, for ex-           spatial network modeling over the streets within a study area via
ample to simulate social contact networks (see Students). To            OpenStreetMap6 utilizing osmnx for network extraction & pre-
accomplish this, the results of est.tabulate_by_serial                  processing and pandana for shortest path and route calculations.
(see Agent Generation) are converted to proportional esti-              The underlying impedance metric for shortest path calculation,
mates to facilitate POIs (est.to_prop()), then averaged                 handled in actlike.calc_cost_mtx() and associated in-
across simulations to produce Monte Carlo estimates and errors          ternal functions, can either take the form of distance or travel time.
est.monte_carlo_estimate()).                                            Moreover, household and activity locations must be connected to
        Multiple ACS Vintages and PUMAs: The multi                      nearby network edges for realistic representations within network
module extends the capabilities of livelike to                          space [GFH20].
multiple ACS 5YE vintages (dating back to 2016), as                         With a cost matrix from all residences to daytime loca-
well as multiple PUMAs (e.g., a metropolitan area) via                  tions calculated, the simulated population can then be "sent"
the multi module. Using multi.make_pumas()                              to the likely activity spaces by utilizing an instance of
or       multi.make_multiyear_pumas(),                   multiple       actlike.ActivityAllocation to generate an adapted
PUMAs/multiple years may be stored in a dict                            Transportation Problem. This mixed integer program, solved using
that    enables     iterative  runs    for   spatial   allocation       the solve() method, optimally associates all population within
(multi.make_pmedm_problems()),                        population        an activity space with the objective of minimizing the total cost of
synthesis       (multi.homesim()),         and     agent     cre-       impedance (Eq. 2), being subject to potentially relaxed minimum
ation             (multi.extract_pums_segment_ids(),                    and maximum capacity constraints (Eq. 4 & 5). Each decision
multi.extract_pums_segment_ids_multiyear(),                             variable (xi j ) represents a potential allocation from origin i to
multi.extract_pums_descriptors(),                             and       destination j that must be an integer greater than or equal to zero
multi.extract_pums_descriptors_multiyear()).                            (Eq. 6 & 7). The problem is formulated as follows:
This functionality is currently available for pmedm_legacy
only.                                                                                                     min ∑ ∑ ci j xi j                     (2)
                                                                                                                    i∈I j∈J

Activity Allocation: the actlike package                                                         s.t.         ∑ xi j = Oi       ∀i ∈ I;         (3)
The actlike package [GT22] allocates agents from synthetic                                                    j∈J

populations generated by livelike POI, like schools and work-
places, based on optimal allocation about transportation networks
                                                                                              s.t.      ∑ xi j ≥ minD j          ∀ j ∈ J;       (4)
derived from osmnx and pandana [Boe17], [FW12]. Solutions
are the product of a modified integer program (Transportation                                 s.t.      ∑ xi j ≤ maxD j          ∀ j ∈ J;       (5)
Problem [Hit41], [Koo49], [MS01], [MS15]) modeled in pulp                                               i∈I
or mip [MOD11], [ST20], whereby supply (students/workers)
                                                                                              s.t.      xi j ≥ 0       ∀i ∈ I    ∀ j ∈ J;       (6)
are "shipped" to demand locations (schools/workplaces), with
potentially relaxed minimum and maximum capacity constraints at                               s.t.      xi j ∈ Z ∀i ∈ I          ∀ j ∈ J.       (7)
demand locations. Impedance from nighttime to daytime locations
(Origin-Destination [OD] pairs) can be modeled by either network            where
distance or network travel time.                                                      i ∈ I = each household in the set of origins
                                                                                       j ∈ J = each school in the set of destinations
Location Synthesis
                                                                                      xi j = allocation decision from i ∈ I to j ∈ J
Following the generation of synthetic households for the study
                                                                                      ci j = cost between all i, j pairs
universe, locations for all households across the 30 default
simulations must be created. In order to intelligently site pseudo-                   Oi = population in origin i for i ∈ I
neighborhood clusters of random points, we adopt a dasymetric                         minD j = minimum capacity j for j ∈ J
[QC13] approach, which we term intelligent block-based (IBB)                          maxD j = maximum capacity j for j ∈ J
allocation, whereby household locations are only placed within
blocks known to have been populated at a particular period                6.
LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS                                                            129

The key to this adapted formulation of the classic Trans-                         Because school attendance in Knox County is restricted by
portation Problem is the utilization of minimum and maxi-                         district boundaries, we only placed student households in
mum capacity thresholds that are generated endogenously within                    the PUMAs intersecting with the district (FIPS 4701601,
actlike.ActivityAllocation and are tuned to reflect                               4701602, 4701603, 4701604). However, because educators
the uncertainty of both the population estimates generated by                     may live outside school district boundaries, we simulated
livelike and the reported (or predicted) capacities at activity                   their household locations throughout the Knoxville CBSA.
locations. Moreover, network impedance from origins to destina-            •      Used actlike to perform optimal allocation of
tions (ci j ) can be randomly reduced through an internal process                 workers and students about road networks in Knox
by passing in an integer value to the reduce_seed keyword ar-                     County/Knoxville CBSA. Across the 30 simulations and
gument. By triggering this functionality, the count and magnitude                 14 segments identified, we produced a total of 420 travel
of reduction is determined algorithmically. A random reduction                    simulations. Network impedance was measured in geo-
of this nature is beneficial in generating dispersed solutions that               graphic distance for all student simulations and travel time
do not resemble compact clusters, with an example being the                       for all educator simulations.
replication of a private school’s student body that does not adhere
                                                                           Figure 2 demonstrates the optimal allocations, routing, and
to public school attendance zones.
                                                                       network space for a single simulation of 10th grade public school
    After the optimal solution is found for an
                                                                       students in Knox County, TN. Students, shown in households
actlike.ActivityAllocation                   instance,     selected
                                                                       as small black dots, are associated with schools, represented by
decisions are isolated from non-zero decision variables
                                                                       transparent colored circles sized according to reported enrollment.
with the realized_allocations() method. These
                                                                       The network space connecting student residential locations to
allocations are then used to generate solution routes with the
                                                                       assigned schools is displayed in a matching color. Further, the
network_routes() function that represent the shortest path
                                                                       inset in Figure 2 provides the pseudo-school attendance zone for
along the network traversed from residential locations to assigned
                                                                       10th graders at one school in central Knoxville and demonstrates
activity spaces. Solutions can be further validated with Canonical
                                                                       the adherence to network space.
Correlation Analysis, in instances where the agent segments are
stratified, and simple linear regression for those where a single
segment of agents is used. Validation is discussed further in
Validation & Diagnostics.                                              Our study of K–12 students examines social contact networks
                                                                       with respect to potentially underserved student populations via
                                                                       the compositional characteristics of POIs (schools).
Case Study: K–12 Public Schools in Knox County, TN
                                                                           We characterized each school’s student body by identifying
To illustrate Likeness’ capability to simulate POI travel among        student profiles based on several criteria: minority race/ethnicity,
specific population segments, we provide a case study of travel to     poverty status, single caregiver households, and unemployed care-
POIs, in this case K–12 schools, in Knox County, TN. Our choice        giver households (householder and/or spouse/parnter). We defined
of K–12 schools was motivated by several factors. First, they serve    6 student profiles using an implementation of the density-based
as common destinations for the two major groups—workers and            K-Modes clustering algorithm [CLB09] with a distance heuris-
students—expected to consistently travel on a typical business         tic designed to optimize cluster separation [NLHH07] available
day [RWM+ 17]. Second, a complete inventory of public school           through the kmodes package9 [dV21]. Student profile labels were
locations, as well as faculty and enrollment sizes, is available       appended to the student travel simulation results, then used to
publicly through federal open data sources. In this case, we           produce Monte Carlo proportional estimates of profiles by school.
obtained school locations and faculty sizes from the Homeland              The results in Figure 3 reveal strong dissimilarities in student
Infrastructure Foundation-Level Database (HIFLD)7 and student          makeup between schools on the periphery of Knox County and
enrollment sizes by grade from the National Center for Education       those nearer to Knoxville’s downtown core in the center of the
Statistics (NCES) Common Core of Data8 .                               county. We estimate that the former are largely composed of
    We chose the Knox County School District, which coincides          students in married families, above poverty, and with employed
with Knox County’s boundaries, as our study area. We used the          caregivers, whereas the latter are characterized more strongly by
livelike package to create 30 synthetic populations for the            single caregiver living arrangements and, particularly in areas
Knoxville Core-Based Statistical Area (CBSA), then for each            north of the downtown core, economic distress (pop-out map).
simulation we:
   •    Isolated agent segments from the synthetic population.
                                                                       Workers (Educators)
        K–12 educators consist of full-time workers employed as
        primary and secondary education teachers (2018 Standard        We evaluated the results of our K–12 educator simulations with
        Occupation Classification System codes 2300–2320) in           respect to POI occupancy characteristics, as informed by commute
        elementary and secondary schools (NAICS 6111). We              and work statistics obtained from the PUMS. Specifically, we used
        separated out student agents by public schools and by          work arrival times associated with each synthetic worker (PUMS
        grade level (Kindergarten through Grade 12).                   JWAP) to timestamp the start of each work day, and incremented
   •    Performed IBB allocation to simulate the household loca-       this by daily hours worked (derived from PUMS W KHP) to create
        tions of workers and students. Our selection of household      a second timestamp for work departure. The estimated departure
        locations for workers and students varied geographically.      time assumes that each educator travels to the school for a typical
                                                                       5-day workweek, and is estimated as JWAP + W KHP  5 .
  8.                                   9.
130                                                                                     PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                 Fig. 2: Optimal allocations for one simulation of 10th grade public school students in Knox County, TN.

Fig. 3: Compositional characteristics of K–12 public schools in Knox County, TN based on 6 student profiles. Glyph plot methodolgy adapted
from [GLC+ 15].
LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS                                                        131

                            Fig. 4: Hourly worker occupancy estimates for K–12 schools in Knox County, TN.

   Roughly 50 educator agents per simulation were not attributed       Validation & Diagnostics
with work arrival times, possibly due to the source PUMS re-           A determination of modeling output robustness was needed to
spondents being away from their typical workplaces (e.g., on           validate our results. Specifically, we aimed to ensure the preser-
summer or winter break) but still working virtually when they          vation of relative facility size and composition. To perform this
were surveyed. We filled in these unkown arrival times with the        validation, we tested the optimal allocations of those generated by
modal arrival time observed across all simulations (7:25 AM).          Likeness against the maximally adjusted reported enrollment &
                                                                       faculty employment counts. We used the maximum adjusted value
                                                                       to account for scenarios where the population synthesis phase
    Figure 4 displays the hourly proportion of educators present       resulted in a total demographic segment greater than reported total
at each school in Knox County between 7:00 AM (t700) and               facility capacity. We employed Canonical Correlation Analysis
6:00 PM (t1800). Morning worker arrivals occur more rapidly            (CCA) [Kna78] for the K–12 public school student allocations
than afternoon departures. Between the hours of 7:00 AM and            due to their stratified nature, and an ordinary least squares (OLS)
9:00 AM (t700–t900), schools transition from nearly empty              simple linear regression for the educator allocations [PVG+ 11].
of workers to being close to capacity. In the afternoon, workers       Because CCA is a multivariate measure, it is only a suitable
begin to gradually depart at 3:00 PM (t1500) with somewhere            diagnostic for activity allocation when multiple segments (e.g.,
between 50%–70% of workers still present by 4:00 PM (t1600),           students by grade) are of interest. For educators, which we
then workers begin to depart in earnest at 5:00 PM into 6:00 PM        treated as a single agent segment without stratification, we used
(t1700–t1800), by which most have returned home.                       OLS regression instead. The CCA for students was performed in
                                                                       two components: Between-Destination, which measures capacity
                                                                       across facilities, and Within-Destination, which measures capacity
    Geographic differences are also visible and may be a function      across strata.
of (1) a higher concentration of a particular school type (e.g.,           Descriptive Monte Carlo statistics from the 30 simulations
elementary, middle, high) in this area and (2) staggered starts        were run on the resultant coefficients of determination (R2 ),
between these types (to accommodate bus schedules, etc.). This         which show a goodness of fit (approaching 1). As seen in Table
could be due in part to concentrations of different school schedules   1, all models performed exceedingly well, though the Within-
by grade level, especially elementary schools starting much earlier    Destination CCA performed slightly less well than both the
than middle and high schools10 . For example, schools near the         Between-Destination CCA and the OLS linear regression. In fact,
center of Knox County reach worker capacity more quickly in the        the global minimum of all R2 scores approaches 0.99 (students
morning, starting around 8:00 AM (t800), but also empty out            – Within-Destination), which demonstrates robust preservation of
more rapidly than schools in surrounding areas beginning around
4:00 PM (t1600).                                                         10.
132                                                                                            PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                     K–12                                   R2 Type                     Min       Median      Mean       Max
                                                            Between-Destination CCA   0.9967       0.9974    0.9973    0.9976
                     Students (public schools)
                                                            Within-Destination CCA    0.9883       0.9894    0.9896    0.9910
                     Educators (public & private schools)   OLS Linear Regression     0.9977       0.9983    0.9983    0.9991

      TABLE 1: Validating optimal allocations considering reported enrollment at public schools & faculty employment at all schools.

true capacities in our synthetic activity modeling. Furthermore,          agent characterization and travel along real-world transportation
a global maximum of greater than 0.999 is seen for educators,             networks to POIs. These capabilities benefit planners and urban
which indicates a near perfect replication of relative faculty sizes      researchers by providing a richer understanding of how spatial
by school.                                                                policy interventions can be designed with respect to how people
                                                                          live, move, and interact. Likeness strives to be flexible toward a
Discussion                                                                variety of research applications linked to human security, among
Our Case Study demonstrates the twofold benefits of modeling              them spatial epidemiology, transportation equity, and environmen-
human dynamics with vivid synthetic populations. Using Like-              tal hazards.
ness, we are able to both produce a more reasoned estimate of the              Several ongoing developments will further Likeness’ capa-
neighborhoods in which people reside and interact than existing           bilities. First, we plan to expand our support for POIs curated
synthetic population frameworks, as well as support more nuanced          by location services (e.g., Google, Facebook, Here, TomTom,
characterization of human activities at specific POIs (e.g., social       FourSquare) by the ORNL PlanetSense project [TBP+ 15] by
contact networks, occupancy).                                             incorporating factors like facility size, hours of operation, and pop-
    The examples provided in the Case Study show how this                 ularity curves to refine the destination capacity estimates required
refined understanding of human dynamics can benefit planning              to perform actlike simulations. Second, along with multi-
applications. For example, in the event of a localized emergency,         modal travel, we plan to incorporate multiple trip models based
the results of Students could be used to examine schools for              on large-scale human activity datasets like the American Time Use
which rendezvous with caregivers might pose an added challenge            Survey11 and National Household Travel Survey12 . Together, these
towards students (e.g., more students from single caregiver vs.           improvements will extend our travel simulations to "non-obligate"
married family households). Additionally, the POI occupancy               population segments traveling to civic, social, and recreational
dynamics demonstrated in Workers (Educators) could be used                activities [BMWR22]. Third, the current procedure for spatial
to assess the times at which worker commutes to/from places               allocation uses block groups as the target scale for population
of employment might be most sensitive to a nearby disruption.             synthesis. However, there are a limited number of constraining
Another application in the public health sphere might be to use           variables available at the block group level. To include a larger
occupancy estimates to anticipate the best time of day to reach           volume of constraints (e.g., vehicle access, language), we are
workers, during a vaccination campaign, for example.                      exploring an additional tract-level approach. P-MEDM in this
    Our case study had several limitations that we plan to over-          case is run on cross-covariances between tracts and "supertract"
come in future work. First, we assumed that all travel within our         aggregations created with the Max-p-regions problem [DAR12],
study area occurs along road networks. While road-based travel            [WRK21] implemented in PySAL’s spopt [RA07], [FGK+ 21],
is the dominant means of travel in the Knoxville CBSA, this               [RAA+ 21], [FBG+ 22].
assumption is not transferable to other urban areas within the                 As a final note, the Likeness toolkit is being developed on top
United States. Our eventual goal is to build in additional modes of       of key open source dependencies in the Scientific Python ecosys-
travel like public transit, walk/bike, and ferries by expanding our       tem, the core of which are, of course, numpy [HMvdW+ 20]
ingest of OpenStreetMap features.                                         and scipy [VGO+ 20]. Although an exhaustive list would be
    Second, we do not yet offer direct support for non-traditional        prohibitive, major packages not previously mentioned include
schools (e.g., populations with special needs, families on military       geopandas [JdBF+ 21], matplotlib [Hun07], networkx
bases). For example, the Tennessee School for the Deaf falls              [HSS08], pandas [pdt20], [WM10], and shapely [G+ ]. Our
within our study area, and its compositional estimate could be            goal is contribute to the community with releases of the packages
refined if we reapportioned students more likely in attendance to         comprising Likeness, but since this is an emerging project its
that location.                                                            development to date has been limited to researchers at ORNL.
    Third, we did not account for teachers in virtual schools,            However, we plan to provide a fully open-sourced code base
which may form a portion of the missing work arrival times                within the coming year through GitHub13 .
discussed in Workers (Educators). Work-from-home populations
can be better incorporated into our travel simulations by apply-
ing work schedules from time-use surveys to probabilistically             This material is based upon the work supported by the U.S.
assign in-person or remote status based on occupation. We are             Department of Energy under contract no. DE-AC05-00OR22725.
particularly interested in using this technique with Likeness to
better understand changing patterns of life during the COVID-19           R EFERENCES
pandemic in 2020.                                                         [ANM+ 18]     H.M. Abdul Aziz, Nicholas N. Nagle, April M. Morton,
                                                                                        Michael R. Hilliard, Devin A. White, and Robert N. Stew-
The Likeness toolkit enhances agent creation for modeling human             12.
dynamics through its dual capabilities of high-fidelity ("vivid")           13.
LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS                                                                         133

             art. Exploring the impact of walk–bike infrastructure, safety       [GFH20]     James D. Gaboardi, David C. Folch, and Mark W. Horner.
             perception, and built-environment on active transportation                      Connecting Points to Spatial Networks: Effects on Discrete
             mode choice: a random parameter model using New York                            Optimization Models. Geographical Analysis, 52(2):299–322,
             City commuter data. Transportation, 45(5):1207–1229, 2018.                      2020. doi:10.1111/gean.12211.
             doi:10.1007/s11116-017-9760-8.                                      [GLC+ 15]   Isabella Gollini, Binbin Lu, Martin Charlton, Christopher
[BBE+ 08]    Christopher L. Barrett, Keith R. Bisset, Stephen G. Eubank,                     Brunsdon, and Paul Harris. GWmodel: An R package for
             Xizhou Feng, and Madhav V. Marathe. EpiSimdemics: an ef-                        exploring spatial heterogeneity using geographically weighted
             ficient algorithm for simulating the spread of infectious disease               models. Journal of Statistical Software, 63(17):1–50, 2015.
             over large realistic social networks. In SC’08: Proceedings of                  doi:10.18637/jss.v063.i17.
             the 2008 ACM/IEEE Conference on Supercomputing, pages               [GT22]      James D. Gaboardi and Joseph V. Tuccillo. Simulating Travel
             1–12. IEEE, 2008. doi:10.1109/SC.2008.5214892.                                  to Points of Interest for Demographically-rich Synthetic Popu-
[BBM96]      Richard J. Beckman, Keith A. Baggerly, and Michael D.                           lations, February 2022. American Association of Geographers
             McKay. Creating synthetic baseline populations. Transporta-                     Annual Meeting. doi:10.5281/zenodo.6335783.
             tion Research Part A: Policy and Practice, 30(6):415–429,           [Hew97]     Kenneth Hewitt. Vulnerability Perspectives: the Human Ecol-
             1996. doi:10.1016/0965-8564(96)00004-3.                                         ogy of Endangerment. In Regions of Risk: A Geographical
[BCD+ 06]    Dimitris Ballas, Graham Clarke, Danny Dorling, Jan Rigby,                       Introduction to Disasters, chapter 6, pages 141–164. Addison
             and Ben Wheeler. Using geographical information systems and                     Wesley Longman, 1997.
             spatial microsimulation for the analysis of health inequalities.    [HHSB12]    Kirk Harland, Alison Heppenstall, Dianna Smith, and Mark H.
             Health Informatics Journal, 12(1):65–79, 2006. doi:10.                          Birkin. Creating realistic synthetic populations at varying
             1177/1460458206061217.                                                          spatial scales: A comparative critique of population synthesis
[BFH+ 17]    Komal Basra, M. Patricia Fabian, Raymond R. Holberger,                          techniques. Journal of Artificial Societies and Social Simula-
             Robert French, and Jonathan I. Levy. Community-engaged                          tion, 15(1):1, 2012. doi:10.18564/jasss.1909.
             modeling of geographic and demographic patterns of mul-             [Hit41]     Frank L. Hitchcock. The Distribution of a Product from
             tiple public health risk factors. International Journal of                      Several Sources to Numerous Localities. Journal of Mathe-
             Environmental Research and Public Health, 14(7):730, 2017.                      matics and Physics, 20(1-4):224–230, 1941. doi:10.1002/
             doi:10.3390/ijerph14070730.                                                     sapm1941201224.
[BMWR22]     Christa Brelsford, Jessica J. Moehl, Eric M. Weber, and             [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der
             Amy N. Rose. Segmented Population Models: Improving the                         Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric
             LandScan USA Non-Obligate Population Estimate (NOPE).                           Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,
             American Association of Geographers 2022 Annual Meeting,                        Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerk-
             2022.                                                                           wijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río,
[Boe17]      Geoff Boeing. OSMnx: New methods for acquiring, con-                            Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin
             structing, analyzing, and visualizing complex street networks.                  Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi,
             Computers, Environment and Urban Systems, 65:126–139,                           Christoph Gohlke, and Travis E. Oliphant. Array programming
             September 2017. doi:10.1016/j.compenvurbsys.                                    with NumPy. Nature, 585(7825):357–362, September 2020.
             2017.05.004.                                                                    doi:10.1038/s41586-020-2649-2.
[CGSdG08]    Isabel Correia, Luís Gouveia, and Francisco Saldanha-da             [HNB+ 11]   Jan A.C. Hontelez, Nico Nagelkerke, Till Bärnighausen, Roel
             Gama. Solving the variable size bin packing problem                             Bakker, Frank Tanser, Marie-Louise Newell, Mark N. Lurie,
             with discretized formulations. Computers & Operations Re-                       Rob Baltussen, and Sake J. de Vlas. The potential impact of
             search, 35(6):2103–2113, June 2008. doi:10.1016/j.                              RV144-like vaccines in rural South Africa: a study using the
             cor.2006.10.014.                                                                STDSIM microsimulation model. Vaccine, 29(36):6100–6106,
                                                                                             2011. doi:10.1016/j.vaccine.2011.06.059.
[CLB09]      Fuyuan Cao, Jiye Liang, and Liang Bai. A new initialization
             method for categorical data clustering. Expert Systems with         [HSS08]     Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart.
             Applications, 36(7):10223–10228, 2009. doi:10.1016/j.                           Exploring Network Structure, Dynamics, and Function using
             eswa.2009.01.060.                                                               NetworkX. In Gaël Varoquaux, Travis Vaught, and Jarrod
                                                                                             Millman, editors, Proceedings of the 7th Python in Science
[DAR12]      Juan C. Duque, Luc Anselin, and Sergio J. Rey. THE MAX-
                                                                                             Conference, pages 11 – 15, Pasadena, CA USA, 2008. URL:
             P-REGIONS PROBLEM*. Journal of Regional Science,
             52(3):397–419, 2012. doi:10.1111/j.1467-9787.
                                                                                 [Hun07]     J. D. Hunter. Matplotlib: A 2D graphics environment. Com-
                                                                                             puting in Science & Engineering, 9(3):90–95, 2007. doi:
[DKA+ 08]    M. Diaz, J.J. Kim, G. Albero, S. De Sanjose, G. Clifford, F.X.                  10.1109/MCSE.2007.55.
             Bosch, and S.J. Goldie. Health and economic impact of HPV
                                                                                 [JdBF+ 21]  Kelsey Jordahl, Joris Van den Bossche, Martin Fleischmann,
             16 and 18 vaccination and cervical cancer screening in India.
                                                                                             James McBride, Jacob Wasserman, Adrian Garcia Badaracco,
             British Journal of Cancer, 99(2):230–238, 2008. doi:10.
                                                                                             Jeffrey Gerard, Alan D. Snow, Jeff Tratner, Matthew Perry,
                                                                                             Carson Farmer, Geir Arne Hjelle, Micah Cochran, Sean
[dV21]       Nelis J. de Vos. kmodes categorical clustering library. https:                  Gillies, Lucas Culbertson, Matt Bartos, Brendan Ward, Gia-
             //, 2015–2021.                                          como Caria, Mike Taves, Nick Eubank, sangarshanan, John
[FBG+ 22]    Xin Feng, Germano Barcelos, James D. Gaboardi, Elijah                           Flavin, Matt Richards, Sergio Rey, maxalbert, Aleksey Bi-
             Knaap, Ran Wei, Levi J. Wolf, Qunshan Zhao, and Sergio J.                       logur, Christopher Ren, Dani Arribas-Bel, Daniel Mesejo-
             Rey. spopt: a python package for solving spatial optimization                   León, and Leah Wasser. geopandas/geopandas: v0.10.2, Octo-
             problems in PySAL. Journal of Open Source Software,                             ber 2021. doi:10.5281/zenodo.5573592.
             7(74):3330, 2022. doi:10.21105/joss.03330.                          [Kna78]     Thomas R. Knapp. Canonical Correlation Analysis: A general
[FGK+ 21]    Xin Feng, James D. Gaboardi, Elijah Knaap, Sergio J. Rey,                       parametric significance-testing system. Psychological Bulletin,
             and Ran Wei. pysal/spopt, jan 2021. URL:                    85(2):410–416, 1978. doi:10.1037/0033-2909.85.
             pysal/spopt, doi:10.5281/zenodo.4444156.                                        2.410.
[FL86]       D.K. Friesen and M.A. Langston. Variable Sized Bin Packing.         [Koo49]     Tjalling C. Koopmans. Optimum Utilization of the Transporta-
             SIAM Journal on Computing, 15(1):222–230, February 1986.                        tion System. Econometrica, 17:136–146, 1949. Publisher:
             doi:10.1137/0215016.                                                            [Wiley, Econometric Society]. doi:10.2307/1907301.
[FW12]       Fletcher Foti and Paul Waddell. A Generalized Com-                  [LB13]      Robin Lovelace and Dimitris Ballas. ‘Truncate, replicate,
             putational Framework for Accessibility: From the Pedes-                         sample’: A method for creating integer weights for spa-
             trian to the Metropolitan Scale. In Transportation Re-                          tial microsimulation. Computers, Environment and Urban
             search Board Annual Conference, pages 1–14, 2012.                               Systems, 41:1–11, September 2013. doi:10.1016/j.
             URL:                    compenvurbsys.2013.03.004.
             4thITM/Papers-A/0117-000062.pdf.                                    [LNB13]     Stefan Leyk, Nicholas N. Nagle, and Barbara P. Buttenfield.
[G+ ]        Sean Gillies et al. Shapely: manipulation and analysis of                       Maximum Entropy Dasymetric Modeling for Demographic
             geometric objects, 2007–. URL:                      Small Area Estimation. Geographical Analysis, 45(3):285–
             shapely.                                                                        306, July 2013. doi:10.1111/gean.12011.
134                                                                                            PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[MCB+ 08]   Karyn Morrissey, Graham Clarke, Dimitris Ballas, Stephen                       Scan USA 2016 [Data set]. Technical report, Oak Ridge
            Hynes, and Cathal O’Donoghue. Examining access to GP                           National Laboratory, 2017. doi:10.48690/1523377.
            services in rural Ireland using microsimulation analysis. Area,   [SEM14]      Samarth Swarup, Stephen G. Eubank, and Madhav V. Marathe.
            40(3):354–364, 2008. doi:10.1111/j.1475-4762.                                  Computational epidemiology as a challenge domain for multi-
            2008.00844.x.                                                                  agent systems. In Proceedings of the 2014 international con-
[MNP+ 17]   April M. Morton, Nicholas N. Nagle, Jesse O. Piburn,                           ference on Autonomous agents and multi-agent systems, pages
            Robert N. Stewart, and Ryan McManamay. A hybrid dasy-                          1173–1176, 2014. URL:
            metric and machine learning approach to high-resolution                        aamas2014/proceedings/aamas/p1173.pdf.
            residential electricity consumption modeling. In Advances         [SNGJ+ 09]   Beate Sander, Azhar Nizam, Louis P. Garrison Jr., Maarten J.
            in Geocomputation, pages 47–58. Springer, 2017. doi:                           Postma, M. Elizabeth Halloran, and Ira M. Longini Jr. Eco-
            10.1007/978-3-319-22786-3_5.                                                   nomic evaluation of influenza pandemic mitigation strate-
[MOD11]     Stuart     Mitchell,     Michael    O’Sullivan,    and     Iain                gies in the United States using a stochastic microsimulation
            Dunning.         PuLP: A Linear Programming Toolkit                            transmission model. Value in Health, 12(2):226–233, 2009.
            for Python.            Technical report, 2011.            URL:                 doi:10.1111/j.1524-4733.2008.00437.x.
           [SPH11]      Dianna M. Smith, Jamie R. Pearce, and Kirk Harland. Can
            216/PAPERS/2011.%20PuLP%20-%20A%20Linear%                                      a deterministic spatial microsimulation model provide reli-
            20Programming%20Toolkit%20for%20Python.pdf.                                    able small-area estimates of health behaviours? An example
[MPN+ 17]   April M. Morton, Jesse O. Piburn, Nicholas N. Nagle, H.M.                      of smoking prevalence in New Zealand. Health & Place,
            Aziz, Samantha E. Duchscherer, and Robert N. Stewart. A                        17(2):618–624, 2011. doi:10.1016/j.healthplace.
            simulation approach for modeling high-resolution daytime                       2011.01.001.
            commuter travel flows and distributions of worker subpopula-      [ST20]       Haroldo G. Santos and Túlio A.M. Toffolo. Mixed Integer Lin-
            tions. In GeoComputation 2017, Leeds, UK, pages 1–5, 2017.                     ear Programming with Python. Technical report, 2020. URL:
[MS01]      Harvey J. Miller and Shih-Lung Shaw. Geographic Informa-          [TBP+ 15]    Gautam S. Thakur, Budhendra L. Bhaduri, Jesse O. Piburn,
            tion Systems for Transportation: Principles and Applications.                  Kelly M. Sims, Robert N. Stewart, and Marie L. Urban.
            Oxford University Press, New York, 2001.                                       PlanetSense: a real-time streaming and spatio-temporal an-
[MS15]      Harvey J. Miller and Shih-Lung Shaw. Geographic Informa-                       alytics platform for gathering geo-spatial intelligence from
            tion Systems for Transportation in the 21st Century. Geogra-                   open source data. In Proceedings of the 23rd SIGSPATIAL
            phy Compass, 9(4):180–189, 2015. doi:10.1111/gec3.                             International Conference on Advances in Geographic Informa-
            12204.                                                                         tion Systems, pages 1–4, 2015. doi:10.1145/2820783.
[NBLS14]    Nicholas N. Nagle, Barbara P. Buttenfield, Stefan Leyk, and                    2820882.
            Seth Spielman. Dasymetric modeling and uncertainty. Annals        [TCR08]      Melanie N. Tomintz, Graham P. Clarke, and Janette E. Rigby.
            of the Association of American Geographers, 104(1):80–95,                      The geography of smoking in Leeds: estimating individual
            2014. doi:10.1080/00045608.2013.843439.                                        smoking rates and the implications for the location of stop
[NCA13]     Markku Nurhonen, Allen C. Cheng, and Kari Auranen. Pneu-                       smoking services. Area, 40(3):341–353, 2008. doi:10.
            mococcal transmission and disease in silico: a microsimu-                      1111/j.1475-4762.2008.00837.x.
            lation model of the indirect effects of vaccination. PloS         [TG22]       Joseph V. Tuccillo and James D. Gaboardi. Connecting Vivid
            one, 8(2):e56079, 2013. doi:10.1371/journal.pone.                              Population Data to Human Dynamics, June 2022. Distilling
            0056079.                                                                       Diversity by Tapping High-Resolution Population and Survey
[NLHH07]    Michael K. Ng, Mark Junjie Li, Joshua Zhexue Huang, and                        Data. doi:10.5281/zenodo.6607533.
            Zengyou He. On the impact of dissimilarity measure in             [TM21]       Joseph V. Tuccillo and Jessica Moehl. An Individual-
            k-modes clustering algorithm. IEEE Transactions on Pat-                        Oriented Typology of Social Areas in the United States, May
            tern Analysis and Machine Intelligence, 29(3):503–507, 2007.                   2021. 2021 ACS Data Users Conference. doi:10.5281/
            doi:10.1109/TPAMI.2007.53.                                                     zenodo.6672291.
[pdt20]     The pandas development team. pandas-dev/pandas: Pandas,           [TMKD17]     Matthias Templ, Bernhard Meindl, Alexander Kowarik, and
            February 2020. doi:10.5281/zenodo.3509134.                                     Olivier Dupriez. Simulation of synthetic complex data: The
[PVG+ 11]   F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,                            R package simPop. Journal of Statistical Software, 79:1–38,
            B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,                  2017. doi:10.18637/jss.v079.i10.
            V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,              [Tuc21]      Joseph V. Tuccillo. An Individual-Centered Approach for
            M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:                         Geodemographic Classification. In 11th International Con-
            Machine Learning in Python. Journal of Machine Learning                        ference on Geographic Information Science 2021 Short Paper
            Research, 12:2825–2830, 2011. URL:                       Proceedings, pages 1–6, 2021. doi:10.25436/E2H59M.
            papers/v12/pedregosa11a.html.                                     [VGO+ 20]    Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt
[QC13]      Fang Qiu and Robert Cromley. Areal Interpolation and                           Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski,
            Dasymetric Modeling: Areal Interpolation and Dasymetric                        Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté-
            Modeling. Geographical Analysis, 45(3):213–215, July 2013.                     fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar-
            doi:10.1111/gean.12016.                                                        rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric
[RA07]      Sergio J. Rey and Luc Anselin. PySAL: A Python Library of                      Jones, Robert Kern, Eric Larson, C.J. Carey, İlhan Polat,
            Spatial Analytical Methods. The Review of Regional Studies,                    Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde,
            37(1):5–27, 2007. URL:                  Josef Perktold, Robert Cimrman, Ian Henriksen, E.A. Quin-
            8285.pdf, doi:10.52324/001c.8285.                                              tero, Charles R. Harris, Anne M. Archibald, Antônio H.
[RAA+ 21]   Sergio J. Rey, Luc Anselin, Pedro Amaral, Dani Arribas-                        Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy
            Bel, Renan Xavier Cortes, James David Gaboardi, Wei Kang,                      1.0 Contributors. SciPy 1.0: Fundamental Algorithms for
            Elijah Knaap, Ziqi Li, Stefanie Lumnitz, Taylor M. Oshan,                      Scientific Computing in Python. Nature Methods, 17:261–272,
            Hu Shao, and Levi John Wolf. The PySAL Ecosystem:                              2020. doi:10.1038/s41592-019-0686-2.
            Philosophy and Implementation. Geographical Analysis, 2021.       [WCC+ 09]    William D. Wheaton, James C. Cajka, Bernadette M. Chas-
            doi:10.1111/gean.12276.                                                        teen, Diane K. Wagener, Philip C. Cooley, Laxminarayana
[RSF+ 21]   Krishna P. Reddy, Fatma M. Shebl, Julia H.A. Foote, Guy                        Ganapathi, Douglas J. Roberts, and Justine L. Allpress.
            Harling, Justine A. Scott, Christopher Panella, Kieran P. Fitz-                Synthesized population databases: A US geospatial database
            maurice, Clare Flanagan, Emily P. Hyle, Anne M. Neilan, et al.                 for agent-based models.       Methods report (RTI Press),
            Cost-effectiveness of public health strategies for COVID-19                    2009(10):905, 2009. doi:10.3768/rtipress.2009.
            epidemic control in South Africa: a microsimulation modelling                  mr.0010.0905.
            study. The Lancet Global Health, 9(2):e120–e129, 2021.            [WM10]       Wes McKinney. Data Structures for Statistical Computing in
            doi:10.1016/S2214-109X(20)30452-6.                                             Python. In Stéfan van der Walt and Jarrod Millman, editors,
[RWM+ 17]   Amy N. Rose, Eric M. Weber, Jessica J. Moehl, Melanie L.                       Proceedings of the 9th Python in Science Conference, pages 56
            Laverdiere, Hsiu-Han Yang, Matthew C. Whitehead, Kelly M.                      – 61, 2010. doi:10.25080/Majora-92bf1922-00a.
            Sims, Nathan E. Trombley, and Budhendra L. Bhaduri. Land-         [WRK21]      Ran Wei, Sergio J. Rey, and Elijah Knaap. Efficient re-

             gionalization for spatially explicit neighborhood delineation.
             International Journal of Geographical Information Science,
             35(1):135–151, 2021. doi:10.1080/13658816.2020.
[ZFJ14]      Yi Zhu and Joseph Ferreira Jr. Synthetic population gener-
             ation at disaggregated spatial scales for land use and trans-
             portation microsimulation. Transportation Research Record,
             2429(1):168–177, 2014. doi:10.3141/2429-18.
136                                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                      poliastro: a Python library for interactive
                                              Juan Luis Cano Rodríguez‡∗ , Jorge Martínez Garrido‡



Abstract—Space is more popular than ever, with the growing public awareness                problem. This work was generalized by Newton to give birth to
of interplanetary scientific missions, as well as the increasingly large number            the n-body problem, and many other mathematicians worked on
of satellite companies planning to deploy satellite constellations. Python has             it throughout the centuries (Daniel and Johann Bernoulli, Euler,
become a fundamental technology in the astronomical sciences, and it has also              Gauss). Poincaré established in the 1890s that no general closed-
caught the attention of the Space Engineering community.
                                                                                           form solution exists for the n-body problem, since the resulting
     One of the requirements for designing a space mission is studying the
trajectories of satellites, probes, and other artificial objects, usually ignoring
                                                                                           dynamical system is chaotic [Bat99]. Sundman proved in the
non-gravitational forces or treating them as perturbations: the so-called n-body           1900s the existence of convergent solutions for a few restricted
problem. However, for preliminary design studies and most practical purposes, it           with n = 3.
is sufficient to consider only two bodies: the object under study and its attractor.                                M = E − e sin E                      (1)
     Even though the two-body problem has many analytical solutions, or-
                                                                                           In 1903 Tsiokovsky evaluated the conditions required for artificial
bit propagation (the initial value problem) and targeting (the boundary value
problem) remain computationally intensive because of long propagation times,
                                                                                           objects to leave the orbit of the earth; this is considered as a foun-
tight tolerances, and vast solution spaces. On the other hand, astrodynamics               dational contribution to the field of astrodynamics. Tsiokovsky
researchers often do not share the source code they used to run analyses and               devised equation 2 which relates the increase in velocity with the
simulations, which makes it challenging to try out new solutions.                          effective exhaust velocity of thrusted gases and the fraction of used
     This paper presents poliastro, an open-source Python library for interactive          propellant.
astrodynamics that features an easy-to-use API and tools for quick visualization.                                      ∆v = ve ln                             (2)
poliastro implements core astrodynamics algorithms (such as the resolution                                                        mf
of the Kepler and Lambert problems) and leverages numba, a Just-in-Time                    Further developments by Kondratyuk, Hohmann, and Oberth in
compiler for scientific Python, to optimize the running time. Thanks to Astropy,           the early 20th century all added to the growing field of orbital
poliastro can perform seamless coordinate frame conversions and use proper
                                                                                           mechanics, which in turn enabled the development of space flight
physical units and timescales. At the moment, poliastro is the longest-lived
Python library for astrodynamics, has contributors from all around the world,
                                                                                           in the USSR and the United States in the 1950s and 1960s.
and several New Space companies and people in academia use it.                             The two-body problem
                                                                                           In a system of i ∈ 1, ..., n bodies subject to their mutual attraction,
Index Terms—astrodynamics, orbital mechanics, orbit propagation, orbit visu-
alization, two-body problem
                                                                                           by application of Newton’s law of universal gravitation, the total
                                                                                           force fi affecting mi due to the presence of the other n − 1 masses
                                                                                           is given by [Bat99]:
Introduction                                                                                                                 n
                                                                                                                                  mi m j
                                                                                                                   fi = −G ∑              r
                                                                                                                                         3 ij
History                                                                                                                      j6=i |ri j |
The term "astrodynamics" was coined by the American as-
                                                                                           where G = 6.67430 · 10−11 N m2 kg−2 is the universal gravita-
tronomer Samuel Herrick, who received encouragement from
                                                                                           tional constant, and ri j denotes the position vector from mi to m j .
the space pioneer Robert H. Goddard, and refers to the branch
                                                                                           Applying Newton’s second law of motion results in a system of n
of space science dealing with the motion of artificial celestial
                                                                                           differential equations:
bodies ([Dub73], [Her71]). However, the roots of its mathematical
foundations go back several centuries.                                                                            d2 ri        n
                                                                                                                        = −G ∑          r
                                                                                                                                       3 ij
    Kepler first introduced his laws of planetary motion in 1609                                                  dt         j6=i i j |
and 1619 and derived his famous transcendental equation (1),                               By setting n = 2 in 4 and subtracting the two resulting equali-
which we now see as capturing a restricted form of the two-body                            ties, one arrives to the fundamental equation of the two-body
* Corresponding author:
‡ Unaffiliated                                                                                                            d2 r     µ
                                                                                                                               =− 3r                         (5)
                                                                                                                          dt 2     r
Copyright © 2022 Juan Luis Cano Rodríguez et al. This is an open-access                    where µ = G(m1 + m2 ) = G(M + m). When m  M (for example,
article distributed under the terms of the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any             an artificial satellite orbiting a planet), one can consider µ = GM
medium, provided the original author and source are credited.                              a property of the attractor.
POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS                                                                                   137

Keplerian vs non-keplerian motion                                       State of the art
Conveniently manipulating equation 5 leads to several properties        In our view, at the time of creating poliastro there were a number
[Bat99] that were already published by Johannes Kepler in the           of issues with existing open source astrodynamics software that
1610s, namely:                                                          posed a barrier of entry for novices and amateur practitioners.
   1)    The orbit always describes a conic section (an ellipse, a      Most of these barriers still exist today and are described in the
         parabola, or an hyperbola), with the attractor at one of       following paragraphs. The goals of the project can be condensed
         the two foci and can be written in polar coordinates like      as follows:
         r = 1+epcos ν (Kepler’s first law).                                1)   Set an example on reproducibility and good coding prac-
   2)    The magnitude of the specific angular momentum h =                      tices in astrodynamics.
         r2 ddtθ is constant an equal to two times the areal velocity       2)   Become an approachable software even for novices.
         (Kepler’s second law).                                             3)   Offer a performant software that can be also used in
   3)    For closed (circular and elliptical) orbits, the periodq is             scripting and interactive workflows.
         related to the size of the orbit through P = 2π aµ
         (Kepler’s third law).                                               The most mature software libraries for astrodynamics are
                                                                        arguably Orekit [noa22c], a "low level space dynamics library
    For many practical purposes it is usually sufficient to limit       written in Java" with an open governance model, and SPICE
the study to one object orbiting an attractor and ignore all other      [noa22d], a toolkit developed by NASA’s Navigation and An-
external forces of the system, hence restricting the study to           cillary Information Facility at the Jet Propulsion Laboratory.
trajectories governed by equation 5. Such trajectories are called       Other similar, smaller projects that appeared later on and that
"Keplerian", and several problems can be formulated for them:           are still maintained to this day include PyKEP [IBD+ 20], be-
   •    The initial-value problem, which is usually called prop-        yond [noa22a], tudatpy [noa22e], sbpy [MKDVB+ 19], Skyfield
        agation, involves determining the position and velocity of      [Rho20] (Python), CelestLab (Scilab) [noa22b], astrodynamics.jl
        an object after an elapse period of time given some initial     (Julia) [noa] and Nyx (Rust) [noa21a]. In addition, there are
        conditions.                                                     some Graphical User Interface (GUI) based open source programs
   •    Preliminary orbit determination, which involves using           used for Mission Analysis and orbit visualization, such as GMAT
        exact or approximate methods to derive a Keplerian orbit        [noa20] and gpredict [noa18], and complete web applications for
        from a set of observations.                                     tracking constellations of satellites like the SatNOGS project by
   •    The boundary-value problem, often named the Lambert             the Libre Space Foundation [noa21b].
        problem, which involves determining a Keplerian orbit                The level of quality and maintenance of these packages is
        from boundary conditions, usually departure and arrival         somewhat heterogeneous. Community-led projects with a strong
        position vectors and a time of flight.                          corporate backing like Orekit are in excellent health, while on
                                                                        the other hand smaller projects developed by volunteers (beyond,
     Fortunately, most of these problems boil down to finding           astrodynamics.jl) or with limited institutional support (PyKEP,
numerical solutions to relatively simple algebraic relations be-        GMAT) suffer from lack of maintenance. Part of the problem
tween time and angular variables: for elliptic motion (0 ≤ e < 1)       might stem from the fact that most scientists are never taught how
it is the Kepler equation, and equivalent relations exist for the       to build software efficiently, let alone the skills to collaboratively
other eccentricity regimes [Bat99]. Numerical solutions for these       develop software in the open [WAB+ 14], and astrodynamicists are
equations can be found in a number of different ways, each one          no exception.
with different complexity and precision tradeoffs. In the Methods            On the other hand, it is often difficult to translate the advances
section we list the ones implemented by poliastro.                      in astrodynamics research to software. Classical algorithms devel-
     On the other hand, there are many situations in which natural      oped throughout the 20th century are described in papers that are
and artificial orbital perturbations must be taken into account so      sometimes difficult to find, and source code or validation data
that the actual non-Keplerian motion can be properly analyzed:          is almost never available. When it comes to modern research
   •    Interplanetary travel in the proximity of other planets. On     carried in the digital era, source code and validation data is
        a first approximation it is usually enough to study the         still difficult, even though they are supposedly provided "upon
        trajectory in segments and focus the analysis on the closest    reasonable request" [SSM18] [GBP22].
        attractor, hence patching several Keplerian orbits along             It is no surprise that astrodynamics software often requires
        the way (the so-called "patched-conic approximation")           deep expertise. However, there are often implicit assumptions that
        [Bat99]. The boundary surface that separates one segment        are not documented with an adequate level of detail which orig-
        from the other is called the sphere of influence.               inate widespread misconceptions and lead even seasoned profes-
   •    Use of solar sails, electric propulsion, or other means         sionals to make conceptual mistakes. Some of the most notorious
        of continuous thrust. Devising the optimal guidance laws        misconceptions arise around the use of general perturbations data
        that minimize travel time or fuel consumption under these       (OMMs and TLEs) [Fin07], the geometric interpretation of the
        conditions is usually treated as an optimization problem        mean anomaly [Bat99], or coordinate transformations [VCHK06].
        of a dynamical system, and as such it is particularly                Finally, few of the open source software libraries mentioned
        challenging [Con14].                                            above are amenable to scripting or interactive use, as promoted by
   •    Artificial satellites in the vicinity of a planet. This is      computational notebooks like Jupyter [KRKP+ 16].
        the regime in which all the commercial space industry                The following sections will now discuss the various areas of
        operates, especially for those satellites in Low-Earth Orbit    current research that an astrodynamicist will engage in, and how
        (LEO).                                                          poliastro improves their workflow.
138                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                                                Nice, high level API
Software Architecture
The architecture of poliastro emerges from the following set of
conflicting requirements:
                                                                                             Dangerous™ algorithms
      1)    There should be a high-level API that enables users to
            perform orbital calculations in a straightforward way and                 Fig. 1: poliastro two-layer architecture
            prevent typical mistakes.
      2)    The running time of the algorithms should be within the         Most of the methods of the High level API consist only
            same order of magnitude of existing compiled implemen-      of the necessary unit compatibility checks, plus a wrapper over
            tations.                                                    the corresponding Core API function that performs the actual
      3)    The library should be written in a popular open-source      computation.
            language to maximize adoption and lower the barrier to
            external contributors.                                      def E_to_nu(E, ecc):
                                                                            """True anomaly from eccentric anomaly."""
    One of the most typical mistakes we set ourselves to prevent            return (
with the high-level API is dimensional errors. Addition and                     E_to_nu_fast(
substraction operations of physical quantities are defined only for
quantities with the same units [Dro53]: for example, the operation              ) << u.rad
1 km + 100 m requires a scale transformation of at least one                ).to(E.unit)
of the operands, since they have different units (kilometers and        As a result, poliastro offers a unit-safe API that performs the least
meters) but the same dimension (length), whereas the operation          amount of computation possible to minimize the performance
1 km + 1 kg is directly not allowed because dimensions are              penalty of unit checks, and also a unit-unsafe API that offers
incompatible (length and mass). As such, software systems oper-         maximum performance at the cost of not performing any unit
ating with physical quantities should raise exceptions when adding      validation checks.
different dimensions, and transparently perform the required scale          Finally, there are several options to write performant code that
transformations when adding different units of the same dimen-          can be used from Python, and one of them is using a fast, compiled
sion.                                                                   language for the CPU intensive parts. Successful examples of this
    With this in mind, we evaluated several Python packages for         include NumPy, written in C [HMvdW+ 20], SciPy, featuring a
unit handling (see [JGAZJT+ 18] for a recent survey) and chose          mix of FORTRAN, C, and C++ code [VGO+ 20], and pandas,
astropy.units [TPWS+ 18].                                               making heavy use of Cython [BBC+ 11]. However, having to
radius = 6000 # km                                                      write code in two different languages hinders the development
altitude = 500 # m                                                      speed, makes debugging more difficult, and narrows the potential
# Wrong!
                                                                        contributor base (what Julia creators called "The Two Language
distance = radius + altitude                                            Problem" [BEKS17]).
                                                                            As authors of poliastro we wanted to use Python as the
from astropy import units as u                                          sole programming language of the implementation, and the best
# Correct                                                               solution we found to improve its performance was to use Numba,
distance = (radius << + (altitude << u.m)                         a LLVM-based Python JIT compiler [LPS15].
This notion of providing a "safe" API extends to other parts            Usage
of the library by leveraging other capabilities of the Astropy
                                                                        Basic Orbit and Ephem creation
project. For example, timestamps use astropy.time objects,
which take care of the appropriate handling of time scales              The two central objects of the poliastro high level API are Orbit
(such as TDB or UTC), reference frame conversions leverage              and Ephem:
astropy.coordinates, and so forth.                                         •    Orbit objects represent an osculating (hence Keplerian)
    One of the drawbacks of existing unit packages is that                      orbit of a dimensionless object around an attractor at a
they impose a significant performance penalty. Even though                      given point in time and a certain reference frame.
astropy.units is integrated with NumPy, hence allowing                     •    Ephem objects represent an ephemerides, a sequence of
the creation of array quantities, all the unit compatibility checks             spatial coordinates over a period of time in a certain
are implemented in Python and require lots of introspection, and                reference frame.
this can slow down mathematical operations by several orders of             There are six parameters that uniquely determine a Keplerian
magnitude. As such, to fulfill our desired performance requirement      orbit, plus the gravitational parameter of the corresponding attrac-
for poliastro, we envisioned a two-layer architecture:                  tor (k or µ). Optionally, an epoch that contextualizes the orbit
      •    The Core API follows a procedural style, and all the         can be included as well. This set of six parameters is not unique,
           functions receive Python numerical types and NumPy           and several of them have been developed over the years to serve
           arrays for maximum performance.                              different purposes. The most widely used ones are:
      •    The High level API is object-oriented, all the methods          •    Cartesian elements: Three components for the position
           receive Astropy Quantity objects with physical units,                (x, y, z) and three components for the velocity (vx , vy , vz ).
           and computations are deferred to the Core API.                       This set has no singularities.
POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS                                                                              139

   •   Classical Keplerian elements: Two components for the
       shape of the conic (usually the semimajor axis a or              from poliastro.ephem import Ephem
       semiparameter p and the eccentricity e), three Euler angles      # Configure high fidelity ephemerides globally
       for the orientation of the orbital plane in space (inclination   # (requires network access)
       i, right ascension of the ascending node Ω, and argument         solar_system_ephemeris.set("jpl")
       of periapsis ω), and one polar angle for the position of the
                                                                        # For predefined poliastro attractors
       body along the conic (usually true anomaly f or ν). This         earth = Ephem.from_body(Earth,
       set of elements has an easy geometrical interpretation and
       the advantage that, in pure two-body motion, five of them        # For the rest of the Solar System bodies
                                                                        ceres = Ephem.from_horizons("Ceres",
       are fixed (a, e, i, Ω, ω) and only one is time-dependent
       (ν), which greatly simplifies the analytical treatment of        There are some crucial differences between Orbit and Ephem
       orbital perturbations. However, they suffer from singular-       objects:
       ities steming from the Euler angles ("gimbal lock") and             •   Orbit objects have an attractor, whereas Ephem objects
       equations expressed in them are ill-conditioned near such               do not. Ephemerides can originate from complex trajecto-
       singularities.                                                          ries that don’t necessarily conform to the ideal two-body
   •   Walker modified equinoctial elements: Six parameters                    problem.
       (p, f , g, h, k, L). Only L is time-dependent and this set has      •   Orbit objects capture a precise instant in a two-body mo-
       no singularities, however the geometrical interpretation of             tion plus the necessary information to propagate it forward
       the rest of the elements is lost [WIO85].                               in time indefinitely, whereas Ephem objects represent a
    Here is how to create an Orbit from cartesian and from clas-               bounded time history of a trajectory. This is because the
sical Keplerian elements. Walker modified equinoctial elements                 equations for the two-body motion are known, whereas
are supported as well.                                                         an ephemeris is either an observation or a prediction
from astropy import units as u                                                 that cannot be extrapolated in any case without external
                                                                               knowledge. As such, Orbit objects have a .propagate
from poliastro.bodies import Earth, Sun                                        method, but Ephem ones do not. This prevents users from
from poliastro.twobody import Orbit
from poliastro.constants import J2000
                                                                               attempting to propagate the position of the planets, which
                                                                               will always yield poor results compared to the excellent
# Data from Curtis, example 4.3                                                ephemerides calculated by external entities.
r = [-6045, -3490, 2500] <<
v = [-3.457, 6.618, 2.533] << / u.s                                   Finally, both types have methods to convert between them:
                                                                           •   Ephem.from_orbit is the equivalent of sampling a
orb_curtis = Orbit.from_vectors(
   Earth, # Attractor                                                          two-body motion over a given time interval. As explained
   r, v # Elements                                                             above, the resulting Ephem loses the information about
)                                                                              the original attractor.
# Data for Mars at J2000 from JPL HORIZONS
                                                                           •   Orbit.from_ephem is the equivalent of calculating
a = 1.523679 <<                                                           the osculating orbit at a certain point of a trajectory,
ecc = 0.093315 <<                                                        assuming a given attractor. The resulting Orbit loses
inc = 1.85 << u.deg                                                            the information about the original, potentially complex
raan = 49.562 << u.deg
argp = 286.537 << u.deg                                                        trajectory.
nu = 23.33 << u.deg
                                                                        Orbit propagation
orb_mars = Orbit.from_classical(                                        Orbit objects have a .propagate method that takes an elapsed
   a, ecc, inc, raan, argp, nu,
                                                                        time and returns another Orbit with new orbital elements and an
   J2000 # Epoch                                                        updated epoch:
)                                                                       >>> from poliastro.examples import iss
When displayed on an interactive REPL, Orbit objects provide            >>> iss
basic information about the geometry, the attractor, and the epoch:     >>> 6772 x 6790 km x 51.6 deg (GCRS) ...
>>> orb_curtis
7283 x 10293 km x 153.2 deg (GCRS) orbit                                >>>
around Earth (X) at epoch J2000.000 (TT)                                <Quantity 46.59580468 deg>

>>> orb_mars                                                            >>> iss_30m = iss.propagate(30 << u.min)
1 x 2 AU x 1.9 deg (HCRS) orbit
around Sun (X) at epoch J2000.000 (TT)                                  >>> (iss_30m.epoch - iss.epoch).datetime
Similarly, Ephem objects can be created using a variety of class-
methods as well. Thanks to astropy.coordinates built-in                 >>> ( -
                                                                        <Quantity 116.54513153 deg>
low-fidelity ephemerides, as well as its capability to remotely
                                                       The default propagation algorithm is an analytical procedure
access the JPL HORIZONS system, the user can seamlessly build
                                                       described in [FCM13] that works seamlessly in the near parabolic
an object that contains the time history of the position of any Solar
System body:                                           region. In addition, poliastro implements analytical propagation
from astropy.time import Time                          algorithms as described in [DB83], [OG86], [Mar95], [Mik87],
from astropy.coordinates import solar_system_ephemeris [PP13], [Cha22], and [VM07].
140                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                        rr = propagate(

                                                                        Continuous thrust control laws
                                                                        Beyond natural perturbations, spacecraft can modify their trajec-
                                                                        tory on purpose by using impulsive maneuvers (as explained in
                                                                        the next section) as well as continuous thrust guidance laws. The
                                                                        user can define custom guidance laws by providing a perturbation
Fig. 2: Osculating (Keplerian) vs perturbed (true) orbit (source:       acceleration in the same way natural perturbations are used. In
Wikipedia, CC BY-SA 3.0)                                                addition, poliastro includes several analytical solutions for con-
                                                                        tinuous thrust guidance laws with specific purposes, as studied in
                                                                        [CR17]: optimal transfer between circular coplanar orbits [Ede61]
Natural perturbations                                                   [Bur67], optimal transfer between circular inclined orbits [Ede61]
As showcased in Figure 2, at any point in a trajectory we               [Kec97], quasi-optimal eccentricity-only change [Pol97], simulta-
can define an ideal Keplerian orbit with the same position and          neous eccentricity and inclination change [Pol00], and agument of
velocity under the attraction of a point mass: this is called the       periapsis adjustment [Pol98]. A much more rigorous analysis of a
osculating orbit. Some numerical propagation methods exist that         similar set of laws can be found in [DCV21].
model the true, perturbed orbit as a deviation from an evolving,        from poliastro.twobody.thrust import change_ecc_inc
osculating orbit. poliastro implements Cowell’s method [CC10],
which consists in adding all the perturbation accelerations and then    ecc_f = 0.0 <<
                                                                        inc_f = 20.0 << u.deg
integrating the resulting differential equation with any numerical
                                                                        f = 2.4e-6 << ( / u.s**2)
method of choice:
                          d2 r      µ                                   a_d, _, t_f = change_ecc_inc(orbit, ecc_f, inc_f, f)
                               = − 3 r + ad                       (6)
                          dt       r
The resulting equation is usually integrated using high order
                                                                        Impulsive maneuvers
numerical methods, since the integration times are quite large
and the tolerances comparatively tight. An in-depth discussion of       Impulsive maneuvers are modeled considering a change in the
such methods can be found in [HNW09]. poliastro uses Dormand-           velocity of a spacecraft while its position remains fixed. The
Prince 8(5,3) (DOP853), a commonly used method available in             poliastro.maneuver.Maneuver class provides various
SciPy [HMvdW+ 20].                                                      constructors to instantiate popular impulsive maneuvers in the
    There are several natural perturbations included: J2 and J3         framework of the non-perturbed two-body problem:
gravitational terms, several atmospheric drag models (exponential,         •   Maneuver.impulse
[Jac77], [AAAA62], [AAA+ 76]), and helpers for third body                  •   Maneuver.hohmann
gravitational attraction and radiation pressure as described in [?].       •   Maneuver.bielliptic
@njit                                                       •                  Maneuver.lambert
def combined_a_d(
    t0, state, k, j2, r_eq, c_d, a_over_m, h0, rho0
):                                                      from poliastro.maneuver import Maneuver
    return (
        J2_perturbation(                                orb_i = Orbit.circular(Earth, alt=700 <<
             t0, state, k, j2, r_eq                     hoh = Maneuver.hohmann(orb_i, r_f=36000 <<
        ) + atmospheric_drag_exponential(
             t0, state, k, r_eq, c_d, a_over_m, h0, rho0Once instantiated, Maneuver objects provide information regard-
        )                                               ing total ∆v and ∆t:
                                                        >>> hoh.get_total_cost()
                                                        <Quantity 3.6173981270031357 km / s>
def f(t0, state, k):
    du_kep = func_twobody(t0, state, k)
                                                        >>> hoh.get_total_time()
    ax, ay, az = combined_a_d(
                                                        <Quantity 15729.741535747102 s>
        state,                                          Maneuver objects can be applied to Orbit instances using the
        R=R,                                            apply_maneuver method.
        C_D=C_D,                                        >>> orb_i
        A_over_m=A_over_m,                              7078 x 7078 km x 0.0 deg (GCRS) orbit
        H0=H0,                                          around Earth (X)
        J2=Earth.J2.value,                              >>> orb_f = orb_i.apply_maneuver(hoh)
    )                                                   >>> orb_f
    du_ad = np.array([0, 0, 0, ax, ay, az])             36000 x 36000 km x 0.0 deg (GCRS) orbit
                                                        around Earth (X)
    return du_kep + du_ad
POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS                                                                                                                                                         141

Targeting                                                                                                Earth - Mars for year 2020-2021, C3 launch

Targeting is the problem of finding the orbit connecting two                                                                                                                Days of flight


                                                                                                                                                                            Arrival velocity km/s         41.90




                                                                                                                                                                                            200 35.7
positions over a finite amount of time. Within the context of



the non-perturbed two-body problem, targeting is just a matter                                                                                  26
                                                                                                                                                  .4                                                      37.24


of solving the BVP, also known as Lambert’s problem. Because

targeting tries to find for an orbit, the problem is included in the                  2021-03                                                                                                             32.59

                                                                                                                                   37.2 410.9.0

Initial Orbit Determination field.


                                                                                                                                     20.2 3

    The poliastro.iod package contains izzo and                                       2021-02

                                                                       Arrival date


vallado modules. These provide a lambert function for solv-

                                                                                                                                                                                                                 km2 / s2
                                                                                                            40.3 45.0



ing the targeting problem. Nevertheless, a Maneuver.lambert


constructor is also provided so users can keep taking advantage of                                                                                                                                        18.62

Orbit objects.
# Declare departure and arrival datetimes
date_launch = time.Time(                                                                                                                                                                                  9.31
    '2011-11-26 15:02', scale='tdb'                                                                                              5.0
)                                                                                                                                                                                          Perseverance   4.66


date_arrival = time.Time(
    '2012-08-06 05:17', scale='tdb'                                                                                                                                                        Hope Mars
                                                                                      2020-10                                                                                                             0.00
)                                                                                               3         4         5         6         7         8         9         0
                                                                                            0-0       0-0       0-0       0-0       0-0       0-0       0-0       0-1
                                                                                        202       202       202       202       202       202       202       202
# Define initial and final orbits                                                                                                       Launch date
orb_earth = Orbit.from_ephem(
    Sun, Ephem.from_body(Earth, date_launch),                          Fig. 3: Porkchop plot for Earth-Mars transfer arrival energy showing
    date_launch                                                        latest missions to the Martian planet.
orb_mars = Orbit.from_ephem(
    Sun, Ephem.from_body(Mars, date_arrival),
    date_arrival                                                           Generated graphics can be static or interactive. The main
)                                                                      difference between these two is the ability to modify the camera
                                                                       view in a dynamic way when using interactive plotters.
# Compute targetting maneuver and apply it
man_lambert = Maneuver.lambert(orb_earth, orb_mars)                        The most important classes in the poliastro.plotting
orb_trans, orb_target = ss0.apply_maneuver(                            package are StaticOrbitPlotter and OrbitPlotter3D.
    man_lambert, intermediate=true                                     In addition, the poliastro.plotting.misc module con-
                                                                       tains the plot_solar_system function, which allows the user
Targeting is closely related to quick mission design by means of       to visualize inner and outter both in 2D and 3D, as requested by
porkchop diagrams. These are contour plots showing all combi-          users.
nations of departure and arrival dates with the specific energy for        The following example illustrates the plotting capabilities of
each transfer orbit. They allow for quick identification of the most   poliastro. At first, orbits to be plotted are computed and their
optimal transfer dates between two bodies.                             plotting style is declared:
    The poliastro.plotting.porkchop provides the                       from poliastro.plotting.misc import plot_solar_system
PorkchopPlotter class which allows the user to generate
these diagrams.                                                        # Current datetime
                                                                       now =
from poliastro.plotting.porkchop import (
    PorkchopPlotter                                                    # Obtain Florence and Halley orbits
)                                                                      florence = Orbit.from_sbdb("Florence")
from poliastro.utils import time_range                                 halley_1835_ephem = Ephem.from_horizons(
                                                                           "90000031", now
# Generate all launch and arrival dates                                )
launch_span = time_range(                                              halley_1835 = Orbit.from_ephem(
    "2020-03-01", end="2020-10-01", periods=int(150)                       Sun, halley_1835_ephem, halley_1835_ephem.epochs[0]
)                                                                      )
arrival_span = time_range(
    "2020-10-01", end="2021-05-01", periods=int(150)                   # Define orbit labels and color style
)                                                                      florence_style = {label: "Florence", color: "#000000"}
                                                                       halley_style = {label: "Florence", color: "#84B0B8"}
# Create an instance of the porkchop and plot it
porkchop = PorkchopPlotter(                                            The static two-dimensional plot can be created using the following
    Earth, Mars, launch_span, arrival_span,                            code:
                                                                       # Generate a static 2D figure
Previous code, with some additional customization, generates           frame2D = rame = plot_solar_system(
figure 3.                                                                  epoch=now, outer=False
                                                                       frame2D.plot(florence, **florence_style)
Plotting                                                               frame2D.plot(florence, **halley_style)
For      visualization  purposes,   poliastro    provides the
                                                                       As a result, figure 4 is obtained.
poliastro.plotting package, which contains various
                                                                           The interactive three-dimensional plot can be created using the
utilities for generating 2D and 3D graphics using different
                                                                       following code:
backends such as matplotlib [Hun07] and Plotly [Inc15].
142                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 4: Two-dimensional view of the inner Solar System, Florence,
and Halley.                                                            Fig. 6: Natural perturbations affecting Low-Earth Orbit (LEO) mo-
                                                                       tion (source: [VM07])

# Generate an interactive 3D figure
frame3D = rame = plot_solar_system(                                    the Simplified General Perturbation (SGP) models, first developed
    epoch=now, outer=False,
                                                                       in [HK66] and then refined in [LC69] into what we know these
    use_3d=True, interactive=true
)                                                                      days as the SGP4 propagator [HR80] [VCHK06]. Even though
frame3D.plot(florence, **florence_style)                               certain elements of the reference frame used by SGP4 are not
frame3D.plot(florence, **halley_style)                                 properly specified [VCHK06] and that its accuracy might still be
As a result, figure 5 is obtained.                                     too limited for certain applications [Ko09] [Lar16], it is nowadays
                                                                       the most widely used propagation method thanks in large part to
                                                                       the dissemination of General Perturbations orbital data by the US
                                                                       501(c)(3) CelesTrak (which itself obtains it from the 18th Space
                                                                       Defense Squadron of the US Space Force).
                                                                            The starting point of SGP4 is a special element set that uses
                                                                       Brouwer mean orbital elements [Bro59] plus a ballistic coefficient
                                                                       based on an approximation of the atmospheric drag [LC69], and
                                                                       its results are expressed in a special coordinate system called True
                                                                       Equator Mean Equinox (TEME). Special care needs to be taken
                                                                       to avoid mixing mean elements with osculating elements, and to
                                                                       convert the output of the propagation to the appropriate reference
                                                                       frame. These element sets have been traditionally distributed in a
                                                                       compact text representation called Two-Line Element sets (TLEs)
                                                                       (see 7 for an example). However this format is quite cryptic and
Fig. 5: Three-dimensional view of the inner Solar System, Florence,    suffers from a number of shortcomings, so recently there has
and Halley.
                                                                       been a push to use the Orbit Data Messages international standard
                                                                       developed by the Consultive Committee for Space Data Systems
Commercial Earth satellites                                            (CCSDS 502.0-B-2).
Figure 6 gives a clear picture of the most important natural pertur-   1 25544U 98067A   22156.15037205 .00008547 00000+0 15823-3 0 9994
bations affecting satellites in LEO, namely: the first harmonic of     2 25544 51.6449   36.2070 0004577 196.3587 298.4146 15.49876730343319

the geopotential field J2 (representing the attractor oblateness),     Fig. 7: Two-Line Element set (TLE) for the ISS (retrieved on 2022-
the atmospheric drag, and the higher order harmonics of the            06-05)
geopotential field.
    At least the most significant of these perturbations need to be       At the moment, general perturbations data both in OMM and
taken into account when propagating LEO orbits, and therefore          TLE format can be integrated with poliastro thanks to the sgp4
the methods for purely Keplerian motion are not enough. As             Python library and the Ephem class as follows:
seen above, poliastro implements a number of these perturbations       from astropy.coordinates import TEME, GCRS
already - however, numerical methods are much slower than
analytical ones, and this can render them unsuitable for large         from poliastro.ephem import Ephem
                                                                       from poliastro.frames import Planes
scale simulations, satellite conjunction assesment, propagation in
constrained hardware, and so forth.
    To address this issue, semianalytical propagation methods          def ephem_from_gp(sat, times):
were devised that attempt to strike a balance between the fast             errors, rs, vs = sat.sgp4_array(times.jd1, times.jd2)
                                                                           if not (errors == 0).all():
running times of analytical methods and the necessary inclusion                warn(
of perturbation forces. One of such semianalytical methods are                     "Some objects could not be propagated, "
POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS                                                                                   143

               "proceeding with the rest",                              do not want to use some of the higher level poliastro abstractions
               stacklevel=2,                                            or drag its large number of heavy dependencies.
          rs = rs[errors == 0]
                                                                            Finally, the sustainability of the project cannot yet be taken for
          vs = vs[errors == 0]                                          granted: the project has reached a level of complexity that already
          times = times[errors == 0]                                    warrants dedicated development effort that cannot be covered with
                                                                        short-lived grants. Such funding could potentially come from the
     cart_teme = CartesianRepresentation(
         rs <<,                                                    private sector, but although there is evidence that several for-profit
         xyz_axis=-1,                                                   companies are using poliastro, we have very little information of
         differentials=CartesianDifferential(                           how is it being used and what problems are those users having,
             vs << ( / u.s),
                                                                        let alone what avenues for funded work could potentially work.
         ),                                                             Organizations like the Libre Space Foundation advocate for a
     )                                                                  strong copyleft licensing model to convince commercial actors to
     cart_gcrs = (                                                      contribute to the commons, but in principle that goes against the
         TEME(cart_teme, obstime=times)
         .transform_to(GCRS(obstime=times))                             permissive licensing that the wider Scientific Python ecosystem,
         .cartesian                                                     including poliastro, has adopted. With the advent of new business
     )                                                                  models and the ever increasing reliance in open source by the
                                                                        private sector, a variety of ways to engage commercial users and
     return Ephem(
         cart_gcrs,                                                     include them in the conversation exist. However, these have not
         times,                                                         been explored yet.
     )                                                                  Acknowledgements
However, no native integration with SGP4 has been implemented           The authors would like to thank Prof. Michèle Lavagna for her
yet in poliastro, for technical and non-technical reasons. On one       original guidance and inspiration, David A. Vallado for his en-
hand, this propagator is too different from the other methods, and      couragement and for publishing the source code for the algorithms
we have not yet devised how to add it to the library in a way           from his book for free, Dr. T.S. Kelso for his tireless efforts in
that does not create confusion. On the other hand, adding such          maintaining CelesTrak, Alejandro Sáez for sharing the dream of
a propagator to poliastro would probably open the flood gates of        a better way, Prof. Dr. Manuel Sanjurjo Rivo for believing in my
corporate users of the library, and we would like to first devise       work, Helge Eichhorn for his enthusiasm and decisive influence
a sustainability strategy for the project, which is addressed in the    in poliastro, the whole OpenAstronomy collaboration for opening
next section.                                                           the door for us, the NumFOCUS organization for their immense
                                                                        support, and Alexandra Elbakyan for enabling scientific progress
Future work
Despite the fact that poliastro has existed for almost a decade, for    R EFERENCES
most of its history it has been developed by volunteers on their        [AAA+ 76]   United States Committee on Extension to the Standard At-
free time, and only in the past five years it has received funding                  mosphere, United States National Aeronautics, Space Ad-
through various Summer of Code programs (SOCIS 2017, GSOC                           ministration, United States National Oceanic, Atmospheric
                                                                                    Administration, and United States Air Force. U.S. Stan-
2018-2021) and institutional grants (NumFOCUS 2020, 2021).                          dard Atmosphere, 1976. NOAA - SIT 76-1562. National
The funded work has had an overwhemingly positive impact on                         Oceanic and Amospheric [sic] Administration, 1976. URL:
the project, however the lack of a dedicated maintainer has caused        
                                                                        [AAAA62]    United States Committee on Extension to the Standard At-
some technical debt to accrue over the years, and some parts of
                                                                                    mosphere, United States National Aeronautics, Space Admin-
the project are in need of refactoring or better documentation.                     istration, and United States Environmental Science Services
    Historically, poliastro has tried to implement algorithms that                  Administration. U.S. Standard Atmosphere, 1962: ICAO
were applicable for all the planets in the Solar System, however                    Standard Atmosphere to 20 Kilometers; Proposed ICAO Ex-
                                                                                    tension to 32 Kilometers; Tables and Data to 700 Kilo-
some of them have proved to be very difficult to generalize for                     meters. U.S. Government Printing Office, 1962. URL:
bodies other than the Earth. For cases like these, poliastro ships a       package, but going forward we would like                [Bat99]     Richard H. Battin. An Introduction to the Mathematics
                                                                                    and Methods of Astrodynamics, Revised Edition. American
to continue embracing a generic approach that can serve other                       Institute of Aeronautics and Astronautics, Inc., Reston, VA,
bodies as well.                                                                     January 1999. URL:
    Several open source projects have successfully used poliastro                   861543, doi:10.2514/4.861543.
or were created taking inspiration from it, like spacetech-ssa          [BBC+ 11]   Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dal-
                                                                                    cin, Dag Sverre Seljebotn, and Kurt Smith. Cython: The
by IBM1 or mubody [BBVPFSC22]. AGI (previously Analytical                           Best of Both Worlds. Computing in Science & Engineering,
Graphics, Inc., now Ansys Government Initiatives) published a                       13(2):31–39, March 2011. URL:
series of scripts to automate the commercial tool STK from Python                   document/5582062/, doi:10.1109/MCSE.2010.118.
                                                                        [BBVPFSC22] Juan Bermejo Ballesteros, José María Vergara Pérez,
leveraging poliastro2 . However, we have observed that there is still               Alejandro Fernández Soler, and Javier Cubas.             Mu-
lots of repeated code across similar open source libraries written                  body, an astrodynamics open-source Python library fo-
in Python, which means that there is an opportunity to provide                      cused on libration points.         Barcelona, Spain, April
a "kernel" of algorithms that can be easily reused. Although                        2022. URL:
poliastro.core started as a separate layer to isolate fast, non-
safe functions as described above, we think we could move it to           1.
an external package so it can be depended upon by projects that           2.
144                                                                                             PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[BEKS17]    Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Vi-                        Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant,
            ral B. Shah.        Julia: A Fresh Approach to Numerical                      Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer
            Computing. SIAM Review, 59(1):65–98, January 2017.                            Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro-
            URL:, doi:                       gramming with NumPy. Nature, 585(7825):357–362, Septem-
            10.1137/141000671.                                                            ber 2020. URL:
[Bro59]     Dirk Brouwer. Solution of the problem of artificial satellite                 2649-2, doi:10.1038/s41586-020-2649-2.
            theory without drag. The Astronomical Journal, 64:378,           [HNW09]      E. Hairer, S. P. Nørsett, and Gerhard Wanner. Solving ordi-
            November 1959. URL:                    nary differential equations I: nonstiff problems. Number 8
            query?1959AJ.....64..378B, doi:10.1086/107958.                                in Springer series in computational mathematics. Springer,
[Bur67]     E.G.C. Burt.            On space manoeuvres with con-                         Heidelberg ; London, 2nd rev. ed edition, 2009. OCLC:
            tinuous thrust.            Planetary and Space Science,                       ocn620251790.
            15(1):103–122,       January     1967.         URL:     https:   [HR80]       Felix R. Hoots and Ronald L. Roehrich. Models for prop-
            //,                      agation of NORAD element sets. Technical report, Defense
            doi:10.1016/0032-0633(67)90070-0.                                             Technical Information Center, Fort Belvoir, VA, December
[CC10]      Philip Herbert Cowell and Andrew Claude Crommelin. Inves-                     1980. URL:
            tigation of the Motion of Halley’s Comet from 1759 to 1910.      [Hun07]      J. D. Hunter. Matplotlib: A 2D graphics environment. Com-
            Neill & Company, limited, 1910.                                               puting in Science & Engineering, 9(3):90–95, 2007. Pub-
[Cha22]     Kevin Charls. Recursive solution to Kepler’s problem for                      lisher: IEEE COMPUTER SOC. doi:10.1109/MCSE.
            elliptical orbits - application in robust Newton-Raphson and                  2007.55.
            co-planar closest approach estimation. 2022. Publisher:          [IBD+ 20]    Dario Izzo, Will Binns, Dariomm098, Alessio Mereta,
            Unpublished Version Number: 1. URL:                        Christopher Iliffe Sprague, Dhennes, Bert Van Den Abbeele,
            10.13140/RG.2.2.18578.58563/1, doi:10.13140/RG.2.                             Chris Andre, Krzysztof Nowak, Nat Guy, Alberto Isaac Bar-
            2.18578.58563/1.                                                              quín Murguía, Pablo, Frédéric Chapoton, GiacomoAcciarini,
[Con14]     Bruce A. Conway. Spacecraft trajectory optimization. Num-                     Moritz V. Looz, Dietmarwo, Mike Heddes, Anatoli Babenia,
            ber 29 in Cambridge aerospace series. Cambridge university                    Baptiste Fournier, Johannes Simon, Jonathan Willitts, Ma-
            press, Cambridge (GB), 2014.                                                  teusz Polnik, Sanjeev Narayanaswamy, The Gitter Badger,
[CR17]      Juan Luis Cano Rodríguez. Study of analytical solutions for                   and Jack Yarndley. esa/pykep: Optimize, October 2020.
            low-thrust trajectories. Master’s thesis, Universidad Politéc-                URL:, doi:10.5281/
            nica de Madrid, March 2017.                                                   ZENODO.4091753.
[DB83]      J. M. A. Danby and T. M. Burkardt. The solution of Kepler’s      [Inc15]      Plotly Technologies Inc. Collaborative data science, 2015.
            equation, I. Celestial Mechanics, 31(2):95–107, October                       Place: Montreal, QC Publisher: Plotly Technologies Inc. URL:
            1983. URL:,             
            doi:10.1007/BF01686811.                                          [Jac77]      L. G. Jacchia. Thermospheric Temperature, Density, and
[DCV21]     Marilena Di Carlo and Massimiliano Vasile. Analytical                         Composition: New Models. SAO Special Report, 375, March
            solutions for low-thrust orbit transfers. Celestial Mechanics                 1977. ADS Bibcode: 1977SAOSR.375.....J. URL: https:
            and Dynamical Astronomy, 133(7):33, July 2021. URL: https:                    //
            //, doi:10.          [JGAZJT+ 18] Nathan J. Goldbaum, John A. ZuHone, Matthew J. Turk,
            1007/s10569-021-10033-9.                                                      Kacper Kowalik, and Anna L. Rosen. unyt: Handle, ma-
[Dro53]     S. Drobot. On the foundations of Dimensional Analysis.                        nipulate, and convert data with units in Python. Jour-
            Studia Mathematica, 14(1):84–99, 1953. URL: http://www.                       nal of Open Source Software, 3(28):809, August 2018.
  , doi:10.4064/                          URL:, doi:
            sm-14-1-84-99.                                                                10.21105/joss.00809.
[Dub73]     G. N. Duboshin. Book Review: Samuel Herrick. Astrodynam-         [Kec97]      Jean Albert Kechichian. Reformulation of Edelbaum’s Low-
            ics. Soviet Astronomy, 16:1064, June 1973. ADS Bibcode:                       Thrust Transfer Problem Using Optimal Control Theory.
            1973SvA....16.1064D. URL:                      Journal of Guidance, Control, and Dynamics, 20(5):988–
            abs/1973SvA....16.1064D.                                                      994, September 1997. URL:
[Ede61]     Theodore N. Edelbaum. Propulsion Requirements for Con-                        2.4145, doi:10.2514/2.4145.
            trollable Satellites. ARS Journal, 31(8):1079–1089, August       [Ko09]       TS Kelso and others. Analysis of the Iridium 33-Cosmos
            1961. URL:, doi:                      2251 collision. Advances in the Astronautical Sciences,
            10.2514/8.5723.                                                               135(2):1099–1112, 2009. Publisher: Citeseer.
[FCM13]     Davide Farnocchia, Davide Bracali Cioci, and Andrea Milani.      [KRKP+ 16]   Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez,
            Robust resolution of Kepler’s equation in all eccentricity                    Brian E Granger, Matthias Bussonnier, Jonathan Frederic,
            regimes. Celestial Mechanics and Dynamical Astronomy,                         Kyle Kelley, Jessica B Hamrick, Jason Grout, Sylvain Cor-
            116(1):21–34, May 2013. URL:                     lay, and others. Jupyter Notebooks-a publishing format for
            1007/s10569-013-9476-9, doi:10.1007/s10569-013-                               reproducible computational workflows., volume 2016. 2016.
            9476-9.                                                          [Lar16]      Martin Lara. Analytical and Semianalytical Propagation
[Fin07]     D Finkleman. "TLE or Not TLE?" That is the Question (AAS                      of Space Orbits: The Role of Polar-Nodal Variables. In
            07-126). ADVANCES IN THE ASTRONAUTICAL SCIENCES,                              Gerard Gómez and Josep J. Masdemont, editors, Astro-
            127(1):401, 2007. Publisher: Published for the American                       dynamics Network AstroNet-II, volume 44, pages 151–
            Astronautical Society by Univelt; 1999.                                       166. Springer International Publishing, Cham, 2016. Se-
[GBP22]     Mirko Gabelica, Ružica Bojčić, and Livia Puljak. Many                       ries Title: Astrophysics and Space Science Proceedings.
            researchers were not compliant with their published                           URL:
            data sharing statement: mixed-methods study.            Jour-                 11, doi:10.1007/978-3-319-23986-6_11.
            nal of Clinical Epidemiology, page S089543562200141X,            [LC69]       M. H. Lane and K. Cranford.              An improved ana-
            May 2022. URL:                      lytical drag theory for the artificial satellite problem.
            pii/S089543562200141X, doi:10.1016/j.jclinepi.                                In Astrodynamics Conference, Princeton,NJ,U.S.A., August
            2022.05.019.                                                                  1969. American Institute of Aeronautics and Astronautics.
[Her71]     Samuel Herrick. Astrodynamics. Van Nostrand Reinhold Co,                      URL:, doi:10.
            London, New York, 1971.                                                       2514/6.1969-925.
[HK66]      CG Hilton and JR Kuhlman. Mathematical models for the            [LPS15]      Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: a
            space defense center. Philco-Ford Publication No. U-3871,                     LLVM-based Python JIT compiler. In Proceedings of the Sec-
            17:28, 1966.                                                                  ond Workshop on the LLVM Compiler Infrastructure in HPC
[HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der                       - LLVM ’15, pages 1–6, Austin, Texas, 2015. ACM Press.
            Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric                    URL:,
            Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,                    doi:10.1145/2833157.2833162.
            Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van           [Mar95]      F. Landis Markley.        Kepler Equation solver.      Celes-
            Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del                   tial Mechanics & Dynamical Astronomy, 63(1):101–111,
POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS                                                                                             145

            1995. URL:,                         S. Fabbro, L. A. Ferreira, T. Finethy, R. T. Fox, L. H.
            doi:10.1007/BF00691917.                                                         Garrison, S. L. J. Gibbons, D. A. Goldstein, R. Gommers, J. P.
[Mik87]     Seppo Mikkola. A cubic approximation for Kepler’s equa-                         Greco, P. Greenfield, A. M. Groener, F. Grollier, A. Hagen,
            tion.    Celestial Mechanics, 40(3-4):329–334, September                        P. Hirst, D. Homeier, A. J. Horton, G. Hosseinzadeh, L. Hu,
            1987. URL:,                         J. S. Hunkeler, Ž. Ivezić, A. Jain, T. Jenness, G. Kanarek,
            doi:10.1007/BF01235850.                                                         S. Kendrew, N. S. Kern, W. E. Kerzendorf, A. Khvalko,
[MKDVB+ 19] Michael Mommert, Michael Kelley, Miguel De Val-Borro,                           J. King, D. Kirkby, A. M. Kulkarni, A. Kumar, A. Lee,
            Jian-Yang Li, Giannina Guzman, Brigitta Sipőcz, Josef                          D. Lenz, S. P. Littlefair, Z. Ma, D. M. Macleod, M. Mastropi-
            Ďurech, Mikael Granvik, Will Grundy, Nick Moskovitz,                           etro, C. McCully, S. Montagnac, B. M. Morris, M. Mueller,
            Antti Penttilä, and Nalin Samarasinha. sbpy: A Python                           S. J. Mumford, D. Muna, N. A. Murphy, S. Nelson, G. H.
            module for small-body planetary astronomy.                 Jour-                Nguyen, J. P. Ninan, M. Nöthe, S. Ogaz, S. Oh, J. K. Parejko,
            nal of Open Source Software, 4(38):1426, June 2019.                             N. Parley, S. Pascual, R. Patil, A. A. Patil, A. L. Plunkett,
            URL:, doi:                     J. X. Prochaska, T. Rastogi, V. Reddy Janga, J. Sabater,
            10.21105/joss.01426.                                                            P. Sakurikar, M. Seifert, L. E. Sherbert, H. Sherwood-Taylor,
[noa]       Astrodynamics.jl.          URL:                  A. Y. Shih, J. Sick, M. T. Silbiger, S. Singanamalla, L. P.
            Astrodynamics.jl.                                                               Singer, P. H. Sladen, K. A. Sooley, S. Sornarajah, O. Stre-
[noa18]     gpredict, January 2018.          URL:                 icher, P. Teuben, S. W. Thomas, G. R. Tremblay, J. E. H.
            gpredict/releases/tag/v2.2.1.                                                   Turner, V. Terrón, M. H. van Kerkwijk, A. de la Vega,
[noa20]     GMAT, July 2020. URL:                         L. L. Watkins, B. A. Weaver, J. B. Whitmore, J. Woillez,
            gmat/files/GMAT/GMAT-R2020a/.                                                   V. Zabalza, and (Astropy Contributors). The Astropy Project:
[noa21a]    nyx, November 2021. URL:                          Building an Open-science Project and Status of the v2.0
            nyx/-/tags/1.0.0.                                                               Core Package. The Astronomical Journal, 156(3):123, August
[noa21b]    SatNOGS, October 2021.                 URL:                 2018. URL:
            librespacefoundation/satnogs/satnogs-client/-/tags/1.7.                         3881/aabc4f, doi:10.3847/1538-3881/aabc4f.
[noa22a]    beyond, January 2022. URL:         [VCHK06]    David Vallado, Paul Crawford, Ricahrd Hujsak, and T.S.
            0.7.4/.                                                                         Kelso. Revisiting Spacetrack Report #3. In AIAA/AAS Astro-
[noa22b]    celestlab, January 2022.          URL:                dynamics Specialist Conference and Exhibit, Keystone, Col-
            toolboxes/celestlab/3.4.1.                                                      orado, August 2006. American Institute of Aeronautics and
[noa22c]    Orekit, June 2022.         URL:               Astronautics. URL:
            orekit/-/releases/11.2.                                                         6753, doi:10.2514/6.2006-6753.
[noa22d]    SPICE, January 2022. URL:           [VGO+ 20]   Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt
            toolkit.html.                                                                   Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski,
[noa22e]    tudatpy, January 2022. URL:                      Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté-
            tudatpy/releases/tag/0.6.0.                                                     fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar-
[OG86]      A. W. Odell and R. H. Gooding. Procedures for solving                           rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric
            Kepler’s equation. Celestial Mechanics, 38(4):307–334, April                    Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat,
            1986. URL:,                         Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde,
            doi:10.1007/BF01238923.                                                         Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin-
[Pol97]     James E Pollard. Simplified approach for assessment of low-                     tero, Charles R. Harris, Anne M. Archibald, Antônio H.
            thrust elliptical orbit transfers. In 25th International Electric               Ribeiro, Fabian Pedregosa, Paul van Mulbregt, SciPy 1.0
            Propulsion Conference, Cleveland, OH, pages 97–160, 1997.                       Contributors, Aditya Vijaykumar, Alessandro Pietro Bardelli,
[Pol98]     James Pollard. Evaluation of low-thrust orbital maneuvers.                      Alex Rothberg, Andreas Hilboll, Andreas Kloeckner, Anthony
            In 34th AIAA/ASME/SAE/ASEE Joint Propulsion Confer-                             Scopatz, Antony Lee, Ariel Rokem, C. Nathan Woods, Chad
            ence and Exhibit, Cleveland,OH,U.S.A., July 1998. Ameri-                        Fulton, Charles Masson, Christian Häggström, Clark Fitzger-
            can Institute of Aeronautics and Astronautics. URL: https:                      ald, David A. Nicholson, David R. Hagen, Dmitrii V. Pasech-
            //, doi:10.2514/6.                          nik, Emanuele Olivetti, Eric Martin, Eric Wieser, Fabrice
            1998-3486.                                                                      Silva, Felix Lenders, Florian Wilhelm, G. Young, Gavin A.
[Pol00]     J. E. Pollard. Simplified analysis of low-thrust orbital maneu-                 Price, Gert-Ludwig Ingold, Gregory E. Allen, Gregory R. Lee,
            vers. Technical report, Defense Technical Information Center,                   Hervé Audren, Irvin Probst, Jörg P. Dietrich, Jacob Silterra,
            Fort Belvoir, VA, August 2000. URL:                        James T Webber, Janko Slavič, Joel Nothman, Johannes Buch-
            docs/citations/ADA384536.                                                       ner, Johannes Kulick, Johannes L. Schönberger, José Vinícius
[PP13]      Adonis Reinier Pimienta-Penalver. Accurate Kepler equation                      de Miranda Cardoso, Joscha Reimer, Joseph Harrington, Juan
            solver without transcendental function evaluations. State                       Luis Cano Rodríguez, Juan Nunez-Iglesias, Justin Kuczynski,
            University of New York at Buffalo, 2013.                                        Kevin Tritz, Martin Thoma, Matthew Newville, Matthias
[Rho20]     Brandon Rhodes. Skyfield: Generate high precision research-                     Kümmerer, Maximilian Bolingbroke, Michael Tartre, Mikhail
            grade positions for stars, planets, moons, and Earth satellites,                Pak, Nathaniel J. Smith, Nikolai Nowaczyk, Nikolay She-
            February 2020.                                                                  banov, Oleksandr Pavlyk, Per A. Brodtkorb, Perry Lee,
[SSM18]     Victoria Stodden, Jennifer Seiler, and Zhaokun Ma. An                           Robert T. McGibbon, Roman Feldbauer, Sam Lewis, Sam
            empirical analysis of journal policy effectiveness for                          Tygier, Scott Sievert, Sebastiano Vigna, Stefan Peterson,
            computational reproducibility. Proceedings of the National                      Surhud More, Tadeusz Pudlik, Takuya Oshima, Thomas J.
            Academy of Sciences, 115(11):2584–2589, March 2018.                             Pingel, Thomas P. Robitaille, Thomas Spura, Thouis R. Jones,
            URL:,                  Tim Cera, Tim Leslie, Tiziano Zito, Tom Krauss, Utkarsh
            doi:10.1073/pnas.1708290115.                                                    Upadhyay, Yaroslav O. Halchenko, and Yoshiki Vázquez-
                                                                                            Baeza. SciPy 1.0: fundamental algorithms for scientific
[TPWS+ 18]  The Astropy Collaboration, A. M. Price-Whelan, B. M.
                                                                                            computing in Python. Nature Methods, 17(3):261–272,
            Sipőcz, H. M. Günther, P. L. Lim, S. M. Crawford, S. Conseil,
                                                                                            March 2020. URL:
            D. L. Shupe, M. W. Craig, N. Dencheva, A. Ginsburg, J. T.
                                                                                            019-0686-2, doi:10.1038/s41592-019-0686-2.
            VanderPlas, L. D. Bradley, D. Pérez-Suárez, M. de Val-Borro,
            (Primary Paper Contributors), T. L. Aldcroft, K. L. Cruz, T. P.     [VM07]      David A. Vallado and Wayne D. McClain. Fundamentals
            Robitaille, E. J. Tollerud, (Astropy Coordination Commit-                       of astrodynamics and applications. Number 21 in Space
            tee), C. Ardelean, T. Babej, Y. P. Bach, M. Bachetti, A. V.                     technology library. Microcosm Press [u.a.], Hawthorne, Calif.,
            Bakanov, S. P. Bamford, G. Barentsen, P. Barmby, A. Baum-                       3. ed., 1. printing edition, 2007.
            bach, K. L. Berry, F. Biscani, M. Boquien, K. A. Bostroem,          [WAB+ 14]   Greg Wilson, D. A. Aruliah, C. Titus Brown, Neil P.
            L. G. Bouma, G. B. Brammer, E. M. Bray, H. Breytenbach,                         Chue Hong, Matt Davis, Richard T. Guy, Steven H. D. Had-
            H. Buddelmeijer, D. J. Burke, G. Calderone, J. L. Cano                          dock, Kathryn D. Huff, Ian M. Mitchell, Mark D. Plumbley,
            Rodríguez, M. Cara, J. V. M. Cardoso, S. Cheedella, Y. Copin,                   Ben Waugh, Ethan P. White, and Paul Wilson. Best Practices
            L. Corrales, D. Crichton, D. D’Avella, C. Deil, É. Depagne,                     for Scientific Computing. PLoS Biology, 12(1):e1001745,
            J. P. Dietrich, A. Donath, M. Droettboom, N. Earl, T. Erben,                    January 2014. URL:
146                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

          1001745, doi:10.1371/journal.pbio.1001745.
[WIO85]   M. J. H. Walker, B. Ireland, and Joyce Owens. A set modified
          equinoctial orbit elements. Celestial Mechanics, 36(4):409–
          419, August 1985. URL:
          BF01227493, doi:10.1007/BF01227493.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                              147

   A New Python API for Webots Robotics Simulations
                                                                   Justin C. Fisher‡∗


Abstract—Webots is a popular open-source package for 3D robotics simula-                 In qualitative terms, the old API feels like one is awkwardly
tions. It can also be used as a 3D interactive environment for other physics-        using Python to call C and C++ functions, whereas the new API
based modeling, virtual reality, teaching or games. Webots has provided a sim-       feels much simpler, much easier, and like it is fully intended for
ple API allowing Python programs to control robots and/or the simulated world,       Python. Here is a representative (but far from comprehensive) list
but this API is inefficient and does not provide many "pythonic" conveniences.
                                                                                     of examples:
A new Python API for Webots is presented that is more efficient and provides a
more intuitive, easily usable, and "pythonic" interface.
                                                                                        •   Unlike the old API, the new API contains helpful Python
Index Terms—Webots, Python, Robotics, Robot Operating System (ROS),                         type annotations and docstrings.
Open Dynamics Engine (ODE), 3D Physics Simulation                                       •   Webots employs many vectors, e.g., for 3D positions, 4D
                                                                                            rotations, and RGB colors. The old API typically treats
                                                                                            these as lists or integers (24-bit colors). In the new API
1. Introduction
                                                                                            these are Vector objects, with conveniently addressable
Webots is a popular open-source package for 3D robotics sim-                                components (e.g. vector.x or, conve-
ulations [Mic01], [Webots]. It can also be used as a 3D in-                                 nient helper methods like vector.magnitude and
teractive environment for other physics-based modeling, virtual                             vector.unit_vector, and overloaded vector arith-
reality, teaching or games. Webots uses the Open Dynamics                                   metic operations, akin to (and interoperable with) NumPy
Engine [ODE], which allows physical simulations of Newtonian                                arrays.
bodies, collisions, joints, springs, friction, and fluid dynamics.                      •   The new API also provides easy interfacing between
Webots provides the means to simulate a wide variety of robot                               high-resolution Webots sensors (like cameras and Lidar)
components, including motors, actuators, wheels, treads, grippers,                          and Numpy arrays, to make it much more convenient to
light sensors, ultrasound sensors, pressure sensors, range finders,                         use Webots with popular Python packages like Numpy
radar, lidar, and cameras (with many of these sensors drawing                               [NumPy], [Har01], Scipy [Scipy], [Vir01], PIL/PILLOW
their inputs from GPU processing of the simulation). A typical                              [PIL] or OpenCV [OpenCV], [Brad01]. For example,
simulation will involve one or more robots, each with somewhere                             converting a Webots camera image to a NumPy array is
between 3 and 30 moving parts (though more would be possible),                              now as simple as camera.array and this now allows
each running its own controller program to process information                              the array to share memory with the camera, making this
taken in by its sensors to determine what control signals to send to                        extremely fast regardless of image size.
its devices. A simulated world typically involves a ground surface                      •   The old API often requires that all function parameters be
(which may be