Proceedings of the 21st Python in Science Conference

Authors Chris Calloway, David Shupe, Dillon Niederhut, Meghann Agarwal,

Plaintext

Proceedings of the 21st

Python in Science Conference
P ROCEEDINGS OF THE 21 ST P YTHON IN S CIENCE C ONFERENCE
Edited by Meghann Agarwal, Chris Calloway, Dillon Niederhut, and David Shupe.

SciPy 2022
Austin, Texas
July 11 - July 17, 2022

Copyright c 2022. The articles in the Proceedings of the Python in Science Conference are copyrighted and owned by their
original authors
This is an open-access publication and is distributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are
credited.
For more information, please see: http://creativecommons.org/licenses/by/3.0/

ISSN:2575-9752
https://doi.org/10.25080/majora-212e5952-046
O RGANIZATION

Conference Chairs
J ONATHAN G UYER, NIST
A LEXANDRE C HABOT-L ECLERC, Enthought, Inc.

Program Chairs
M ATT H ABERLAND, Cal Poly
J ULIE H OLLEK, Mozilla
M ADICKEN M UNK, University of Illinois
G UEN P RAWIROATMODJO, Microsoft Corp

Communications
A RLISS C OLLINS, NumFOCUS
M ATT D AVIS, Populus
D AVID N ICHOLSON, Embedded Intelligence

Birds of a Feather
A NDREW R EID, NIST
A NASTASIIA S ARMAKEEVA, George Washington University

Proceedings
M EGHANN A GARWAL, Overhaul
C HRIS C ALLOWAY, University of North Carolina
D ILLON N IEDERHUT, Novi Labs
D AVID S HUPE, Caltech’s IPAC Astronomy Data Center

Financial Aid
S COTT C OLLIS, Argonne National Laboratory
N ADIA TAHIRI, Université de Montréal

Tutorials
M IKE H EARNE, USGS
L OGAN T HOMAS, Enthought, Inc.

Sprints
TANIA A LLARD, Quansight Labs
B RIGITTA S IP ŐCZ, Caltech/IPAC

Diversity
C ELIA C INTAS, IBM Research Africa
B ONNY P M C C LAIN, O’Reilly Media
FATMA TARLACI, OpenTeams

Activities
PAUL A NZEL, Codecov
I NESSA PAWSON, Albus Code

Sponsors
K RISTEN L EISER, Enthought, Inc.

Financial
C HRIS C HAN, Enthought, Inc.
B ILL C OWAN, Enthought, Inc.
J ODI H AVRANEK, Enthought, Inc.

Logistics
K RISTEN L EISER, Enthought, Inc.
Proceedings Reviewers
A ILEEN N IELSEN
A JIT D HOBALE
A LEJANDRO C OCA -C ASTRO
A LEXANDER YANG
B HUPENDRA A R AUT
B RADLEY D ICE
B RIAN G UE
C ADIOU C ORENTIN
C ARL S IMON A DORF
C HEN Z HANG
C HIARA M ARMO
C HITARANJAN M AHAPATRA
C HRIS C ALLOWAY
D ANIEL W HEELER
D AVID N ICHOLSON
D AVID S HUPE
D ILLON N IEDERHUT
D IPTORUP D EB
J ELENA M ILOSEVIC
M ICHAL M ACIEJEWSKI
E D R OGERS
H IMAGHNA B HATTACHARJEE
H ONGSUP S HIN
I NDRANEIL PAUL
I VAN M ARROQUIN
J AMES L AMB
J YH -M IIN L IN
J YOTIKA S INGH
K ARTHIK M URUGADOSS
K EHINDE A JAYI
K ELLY L. R OWLAND
K ELVIN L EE
K EVIN M AIK J ABLONKA
K EVIN W. B EAM
K UNTAO Z HAO
M ARUTHI NH
M ATT C RAIG
M ATTHEW F EICKERT
M EGHANN A GARWAL
M ELISSA W EBER M ENDONÇA
O NURALP S OYLEMEZ
R OHIT G OSWAMI
RYAN B UNNEY
S HUBHAM S HARMA
S IDDHARTHA S RIVASTAVA
S USHANT M ORE
T ETSUO K OYAMA
T HOMAS N ICHOLAS
V ICTORIA A DESOBA
V IDHI C HUGH
V IVEK S INHA
W ENDUO Z HOU
Z UHAL C AKIR
ACCEPTED TALK S LIDES

B UILDING B INARY E XTENSIONS WITH PYBIND 11, SCIKIT- BUILD , AND CIBUILDWHEEL, Henry Schreiner, and Joe Rickerby,
and Ralf Grosse-Kunstleve, and Wenzel Jakob, and Matthieu Darbois, and Aaron Gokaslan, and Jean-Christophe Fillion-
Robin, and Matt McCormick
doi.org/10.25080/majora-212e5952-033
P YTHON D EVELOPMENT S CHEMES FOR M ONTE C ARLO N EUTRONICS ON H IGH P ERFORMANCE C OMPUTING, Jack-
son P. Morgan, and Kyle E. Niemeyer
doi.org/10.25080/majora-212e5952-034
AWKWARD PACKAGING : B UILDING SCIKIT-HEP, Henry Schreiner, and Jim Pivarski, and Eduardo Rodrigues
doi.org/10.25080/majora-212e5952-035
D EVELOPMENT OF A CCESSIBLE , A ESTHETICALLY-P LEASING C OLOR S EQUENCES, Matthew A. Petroff
doi.org/10.25080/majora-212e5952-036
C UTTING E DGE C LIMATE S CIENCE IN THE C LOUD WITH PANGEO, Julius Busecke
doi.org/10.25080/majora-212e5952-037
P YLIRA : DECONVOLUTION OF IMAGES IN THE PRESENCE OF P OISSON NOISE, Axel Donath, and Aneta Siemiginowska,
and Vinay Kashyap, and Douglas Burke, and Karthik Reddy Solipuram, and David van Dyk
doi.org/10.25080/majora-212e5952-038
A CCELERATING S CIENCE WITH THE G ENERATIVE T OOLKIT FOR S CIENTIFIC D ISCOVERY (GT4SD), GT4SD team
doi.org/10.25080/majora-212e5952-039
MM ODEL : A MODULAR MODELING FRAMEWORK FOR SCIENTIFIC PROTOTYPING, Peter Sun, and John A. Marohn
doi.org/10.25080/majora-212e5952-03a
M ONACO : Q UANTIFY U NCERTAINTY AND S ENSITIVITIES IN Y OUR C OMPUTATIONAL M ODELS WITH A M ONTE
C ARLO L IBRARY, W. Scott Shambaugh
doi.org/10.25080/majora-212e5952-03b
UF UNCS AND DT YPES : NEW POSSIBILITIES IN N UM P Y, Sebastian Berg, and Stéfan van der Walt
doi.org/10.25080/majora-212e5952-03c
P ER P YTHON AD ASTRA : INTERACTIVE A STRODYNAMICS WITH POLIASTRO, Juan Luis Cano Rodrı́guez
doi.org/10.25080/majora-212e5952-03d
PYAMPUTE : A P YTHON LIBRARY FOR DATA AMPUTATION , Rianne M Schouten, and Davina Zamanzadeh, and Prabhant
Singh
doi.org/10.25080/majora-212e5952-03e
S CIENTIFIC P YTHON : F ROM G IT H UB TO T IK T OK, Juanita Gomez Romero, and Stéfan van der Walt, and K. Jarrod
Millman, and Melissa Weber Mendonça, and Inessa Pawson
doi.org/10.25080/majora-212e5952-03f
S CIENTIFIC P YTHON : B Y MAINTAINERS , FOR MAINTAINERS, Pamphile T. Roy, and Stéfan van der Walt, and K. Jarrod
Millman, and Melissa Weber Mendonça
doi.org/10.25080/majora-212e5952-040
I MPROVING RANDOM SAMPLING IN P YTHON : SCIPY. STATS . SAMPLING AND SCIPY. STATS . QMC, Pamphile T. Roy, and
Matt Haberland, and Christoph Baumgarten, and Tirth Patel
doi.org/10.25080/majora-212e5952-041
P ETABYTE - SCALE OCEAN DATA ANALYTICS ON STAGGERED GRIDS VIA THE GRID UFUNC PROTOCOL IN X GCM,
Thomas Nicholas, and Julius Busecke, and Ryan Abernathey
doi.org/10.25080/majora-212e5952-042

ACCEPTED P OSTERS

O PTIMAL R EVIEW A SSIGNMENTS FOR THE S CI P Y C ONFERENCE U SING B INARY I NTEGER L INEAR P ROGRAMMING
IN S CI P Y 1.9, Matt Haberland, and Nicholas McKibben
doi.org/10.25080/majora-212e5952-029
C ONTRIBUTING TO O PEN S OURCE S OFTWARE : F ROM NOT KNOWING P YTHON TO BECOMING A S PYDER CORE DE -
VELOPER , Daniel Althviz Moré
doi.org/10.25080/majora-212e5952-02a
S EMI -S UPERVISED S EMANTIC A NNOTATOR (S3A): T OWARD E FFICIENT S EMANTIC I MAGE L ABELING, Nathan Jessu-
run, and Olivia P. Dizon-Paradis, and Dan E. Capecci, and Damon L. Woodard, and Navid Asadizanjani
doi.org/10.25080/majora-212e5952-02b
B IOFRAME : O PERATING ON G ENOMIC I NTERVAL D ATAFRAMES, Nezar Abdennur, and Geoffrey Fudenberg, and Ilya
M. Flyamer, and Aleksandra Galitsyna, and Anton Goloborodko, and Maxim Imakaev, and Trevor Manz, and Sergey V.
Venev
doi.org/10.25080/majora-212e5952-02c
L IKENESS : A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS, Joseph V. Tuccillo,
and James D. Gaboardi
doi.org/10.25080/majora-212e5952-02d
PYA UDIO P ROCESSING : A UDIO P ROCESSING , F EATURE E XTRACTION , AND M ACHINE L EARNING M ODELING , Jy-
otika Singh
doi.org/10.25080/majora-212e5952-02e
K IWI : P YTHON T OOL FOR T EX P ROCESSING AND C LASSIFICATION, Neelima Pulagam, and Sai Marasani, and Brian
Sass
doi.org/10.25080/majora-212e5952-02f
P HYLOGEOGRAPHY: A NALYSIS OF GENETIC AND CLIMATIC DATA OF SARS-C O V-2, Wanlin Li, and Aleksandr Koshkarov,
and My-Linh Luu, and Nadia Tahiri
doi.org/10.25080/majora-212e5952-030
D ESIGN OF A S CIENTIFIC D ATA A NALYSIS S UPPORT P LATFORM, Nathan Martindale, and Jason Hite, and Scott Stewart,
and Mark Adams
doi.org/10.25080/majora-212e5952-031
O PENING ARM: A PIVOT TO COMMUNITY SOFTWARE TO MEET THE NEEDS OF USERS AND STAKEHOLDERS OF THE
PLANET ’ S LARGEST CLOUD OBSERVATORY , Zachary Sherman, and Scott Collis, and Max Grover, and Robert Jackson,
and Adam Theisen
doi.org/10.25080/majora-212e5952-032

S CI P Y TOOLS P LENARIES

S CI P Y T OOLS P LENARY - CEL TEAM, Inessa Pawson
doi.org/10.25080/majora-212e5952-043
S CI P Y T OOLS P LENARY ON M ATPLOTLIB, Elliott Sales de Andrade
doi.org/10.25080/majora-212e5952-044
S CI P Y T OOLS P LENARY - N UM P Y, Inessa Pawson
doi.org/10.25080/majora-212e5952-045

L IGHTNING TALKS

D OWNSAMPLING T IME S ERIES D ATA FOR V ISUALIZATIONS, Delaina Moore
doi.org/10.25080/majora-212e5952-027
A NALYSIS AS A PPLICATIONS : Q UICK INTRODUCTION TO LOCKFILES, Matthew Feickert
doi.org/10.25080/majora-212e5952-028
S CHOLARSHIP R ECIPIENTS

A MAN G OEL, University of Delhi
A NURAG S AHA R OY, Saarland University
I SURU F ERNANDO, University of Illinois at Urbana Champaign
K ELLY M EEHAN, US Forest Service
K ADAMBARI D EVARAJAN, University of Rhode Island
K RISHNA K ATYAL, Thapar Institute of Engineering and Technology
M ATTHEW M URRAY, Dask
N AMAN G ERA, Sympy, LPython
R OHIT G OSWAMI, University of Iceland
S IMON C ROSS, QuTIP
TANYA A KUMU, IBM Research
Z UHAL C AKIR, Purdue University
C ONTENTS

The Advanced Scientific Data Format (ASDF): An Update 1
Perry Greenfield, Edward Slavich, William Jamieson, Nadia Dencheva

Semi-Supervised Semantic Annotator (S3A): Toward Efficient Semantic Labeling 7
Nathan Jessurun, Daniel E. Capecci, Olivia P. Dizon-Paradis, Damon L. Woodard, Navid Asadizanjani

Galyleo: A General-Purpose Extensible Visualization Solution 13
Rick McGeer, Andreas Bergen, Mahdiyar Biazi, Matt Hemmings, Robin Schreiber

USACE Coastal Engineering Toolkit and a Method of Creating a Web-Based Application 22
Amanda Catlett, Theresa R. Coumbe, Scott D. Christensen, Mary A. Byrant

Search for Extraterrestrial Intelligence: GPU Accelerated TurboSETI 26
Luigi Cruz, Wael Farah, Richard Elkins

Experience report of physics-informed neural networks in fluid simulations: pitfalls and frustration 28
Pi-Yueh Chuang, Lorena A. Barba

atoMEC: An open-source average-atom Python code 37
Timothy J. Callow, Daniel Kotik, Eli Kraisler, Attila Cangi

Automatic random variate generation in Python 46
Christoph Baumgarten, Tirth Patel

Utilizing SciPy and other open source packages to provide a powerful API for materials manipulation in the Schrödinger
Materials Suite 52
Alexandr Fonari, Farshad Fallah, Michael Rauch

A Novel Pipeline for Cell Instance Segmentation, Tracking and Motility Classification of Toxoplasma Gondii in 3D Space 60
Seyed Alireza Vaezi, Gianni Orlando, Mojtaba Fazli, Gary Ward, Silvia Moreno, Shannon Quinn

The myth of the normal curve and what to do about it 64
Allan Campopiano

Python for Global Applications: teaching scientific Python in context to law and diplomacy students 69
Anna Haensch, Karin Knudson

Papyri: better documentation for the scientific ecosystem in Jupyter 75
Matthias Bussonnier, Camille Carvalho

Bayesian Estimation and Forecasting of Time Series in statsmodels 83
Chad Fulton

Python vs. the pandemic: a case study in high-stakes software development 90
Cliff C. Kerr, Robyn M. Stuart, Dina Mistry, Romesh G. Abeysuriya, Jamie A. Cohen, Lauren George, Michał
Jastrzebski, Michael Famulare, Edward Wenger, Daniel J. Klein

Pylira: deconvolution of images in the presence of Poisson noise 98
Axel Donath, Aneta Siemiginowska, Vinay Kashyap, Douglas Burke, Karthik Reddy Solipuram, David van Dyk

Codebraid Preview for VS Code: Pandoc Markdown Preview with Jupyter Kernels 105
Geoffrey M. Poore

Incorporating Task-Agnostic Information in Task-Based Active Learning Using a Variational Autoencoder 110
Curtis Godwin, Meekail Zain, Nathan Safir, Bella Humphrey, Shannon P Quinn

Awkward Packaging: building Scikit-HEP 115
Henry Schreiner, Jim Pivarski, Eduardo Rodrigues
Keeping your Jupyter notebook code quality bar high (and production ready) with Ploomber 121
Ido Michael

Likeness: a toolkit for connecting the social fabric of place to human dynamics 125
Joseph V. Tuccillo, James D. Gaboardi

poliastro: a Python library for interactive astrodynamics 136
Juan Luis Cano Rodrı́guez, Jorge Martı́nez Garrido

A New Python API for Webots Robotics Simulations 147
Justin C. Fisher

pyAudioProcessing: Audio Processing, Feature Extraction, and Machine Learning Modeling 152
Jyotika Singh

Phylogeography: Analysis of genetic and climatic data of SARS-CoV-2 159
Aleksandr Koshkarov, Wanlin Li, My-Linh Luu, Nadia Tahiri

Global optimization software library for research and education 167
Nadia Udler

Temporal Word Embeddings Analysis for Disease Prevention 171
Nathan Jacobi, Ivan Mo, Albert You, Krishi Kishore, Zane Page, Shannon P. Quinn, Tim Heckman

Design of a Scientific Data Analysis Support Platform 179
Nathan Martindale, Jason Hite, Scott Stewart, Mark Adams

The Geoscience Community Analysis Toolkit: An Open Development, Community Driven Toolkit in the Scientific Python
Ecosystem 187
Orhan Eroglu, Anissa Zacharias, Michaela Sizemore, Alea Kootz, Heather Craker, John Clyne

popmon: Analysis Package for Dataset Shift Detection 194
Simon Brugman, Tomas Sostak, Pradyot Patil, Max Baak

pyDAMPF: a Python package for modeling mechanical properties of hygroscopic materials under interaction with a nanoprobe
202
Willy Menacho, Gonzalo Marcelo Ramı́rez-Ávila, Horacio V. Guzman

Improving PyDDA’s atmospheric wind retrievals using automatic differentiation and Augmented Lagrangian methods 210
Robert Jackson, Rebecca Gjini, Sri Hari Krishna Narayanan, Matt Menickelly, Paul Hovland, Jan Hückelheim, Scott
Collis

RocketPy: Combining Open-Source and Scientific Libraries to Make the Space Sector More Modern and Accessible 217
João Lemes Gribel Soares, Mateus Stano Junqueira, Oscar Mauricio Prada Ramirez, Patrick Sampaio dos Santos
Brandão, Adriano Augusto Antongiovanni, Guilherme Fernandes Alves, Giovani Hidalgo Ceotto

Wailord: Parsers and Reproducibility for Quantum Chemistry 226
Rohit Goswami

Variational Autoencoders For Semi-Supervised Deep Metric Learning 231
Nathan Safir, Meekail Zain, Curtis Godwin, Eric Miller, Bella Humphrey, Shannon P Quinn

A Python Pipeline for Rapid Application Development (RAD) 240
Scott D. Christensen, Marvin S. Brown, Robert B. Haehnel, Joshua Q. Church, Amanda Catlett, Dallon C. Schofield,
Quyen T. Brannon, Stacy T. Smith

Monaco: A Monte Carlo Library for Performing Uncertainty and Sensitivity Analyses 244
W. Scott Shambaugh

Enabling Active Learning Pedagogy and Insight Mining with a Grammar of Model Analysis 251
Zachary del Rosario
Low Level Feature Extraction for Cilia Segmentation 259
Meekail Zain, Eric Miller, Shannon P Quinn, Cecilia Lo
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 1

The Advanced Scientific Data Format (ASDF): An
Update
Perry Greenfield‡∗ , Edward Slavich‡† , William Jamieson‡† , Nadia Dencheva‡†

Abstract—We report on progress in developing and extending the new (ASDF) by outlining our near term plans for further improvements and
format we have developed for the data from the James Webb and Nancy Grace extensions.
Roman Space Telescopes since we reported on it at a previous Scipy. While
the format was developed as a replacement for the long-standard FITS format Summary of Motivations
used in astronomy, it is quite generic and not restricted to use with astronomical
• Suitable as an archival format:
data. We will briefly review the format, and extensions and changes made to
the standard itself, as well as to the reference Python implementation we have – Old versions continue to be supported by
developed to support it. The standard itself has been clarified in a number libraries.
of respects. Recent improvements to the Python implementation include an
– Format is sufficiently transparent (e.g., not
improved framework for conversion between complex Python objects and ASDF,
requiring extensive documentation to de-
better control of the configuration of extensions supported and versioning of
extensions, tools for display and searching of the structured metadata, bet-
code) for the fundamental set of capabili-
ter developer documentation, tutorials, and a more maintainable and flexible ties.
schema system. This has included a reorganization of the components to make – Metadata is easily viewed with any text
the standard free from astronomical assumptions. A important motivator for the editor.
format was the ability to support serializing functional transforms in multiple
dimensions as well as expressions built out of such transforms, which has now
• Intrinsically hierarchical
been implemented. More generalized compression schemes are now enabled. • Avoids duplication of shared items
We are currently working on adding chunking support and will discuss our plan • Based on existing standard(s) for metadata and structure
for further enhancements. • No tight constraints on attribute lengths or their values.
• Clearly versioned
Index Terms—data formats, standards, world coordinate systems, yaml • Supports schemas for validating files for basic structure
and value requirements
• Easily extensible, both for the standard, and for local or
Introduction
domain-specific conventions.
The Advanced Scientific Data Format (ASDF) was originally
developed in 2015. That original version was described in a paper Basics of ASDF Format
[Gre15]. That paper described the shortcomings of the widely used • Format consists of a YAML header optionally followed by
astronomical standard format FITS [FIT16] as well as those of one or more binary blocks for containing binary data.
existing potential alternatives. It is not the goal of this paper to • The YAML [http://yaml.org] header contains all the meta-
rehash those points in detail, though it is useful to summarize the data and defines the structural relationship of all the data
basic points here. The remainder of this paper will describe where elements.
we are using ASDF, what lessons we have learned from using • YAML tags are used to indicate to libraries the semantics
ASDF for the James Webb Space Telescope, and summarize the of subsections of the YAML header that libraries can use to
most important changes we have made to the standard, the Python construct special software objects. For example, a tag for
library that we use to read and write ASDF files, and best practices a data array would indicate to a Python library to convert
for using the format. it into a numpy array.
We will give an example of a more advanced use case that • YAML anchors and alias are used to share common ele-
illustrates some of the powerful advantages of ASDF, and that ments to avoid duplication.
its application is not limited to astronomy, but suitable for much • JSON Schema [http://json-schema.org/specification.html],
of scientific and engineering data, as well as models. We finish [http://json-schema.org/understanding-json-schema/] is
used for schemas to define expectations for tag content
* Corresponding author: perry@stsci.edu and whole headers combined with tools to validate actual
‡ Space Telescope Science Institute
† These authors contributed equally. ASDF files against these schemas.
• Binary blocks are referenced in the YAML to link binary
Copyright © 2022 Perry Greenfield et al. This is an open-access article data to YAML attributes.
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, • Support for arrays embedded in YAML or in a binary
provided the original author and source are credited. block.
2 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

• Streaming support for a single binary block. Changes for 1.6
• Permit local definitions of tags and schemas outside of the
Addition of the manifest mechanism
standard.
• While developed for astronomy, useful for general scien- The manifest is a YAML document that explicitly lists the tags and
tific or engineering use. other features introduced by an extension to the ASDF standard.
• Aims to be language neutral. It provides a more straightforward way of associating tags with
schemas, allowing multiple tags to share the same schema, and
generally making it simpler to visualize how tags and schemas
Current and planned uses are associated (previously these associations were implied by the
James Webb Space Telescope (JWST) Python implementation but were not documented elsewhere).
NASA requires JWST data products be made available in the
FITS format. Nevertheless, all the calibration pipelines operate Handling of null values and their interpretation
on the data using an internal objects very close to the the ASDF
The standard didn’t previously specify the behavior regarding null
representation. The JWST calibration pipeline uses ASDF to
values. The Python library previously removed attributes from the
serialize data that cannot be easily represented in FITS, such as
YAML tree when the corresponding Python attribute has a None
World Coordinate System information. The calibration software
value upon writing to an ADSF file. On reading files where the
is also capable of reading and producing data products as pure
attribute was missing but the schema indicated a default value,
ASDF files.
the library would create the Python attribute with the default. As
mentioned in the next item, we no longer use this mechanism, and
Nancy Grace Roman Space Telescope
now when written, the attribute appears in the YAML tree with
This telescope, with the same mirror size as the Hubble Space a null value if the Python value is None and the schema permits
Telescope (HST), but a much larger field of view than HST, will null values.
be launched in 2026 or thereabouts. It is to be used mostly in
survey mode and is capable of producing very large mosaicked
Interpretation of default values in schema
images. It will use ASDF as its primary data format.
The use of default values in schemas is discouraged since the
Daniel K Inoue Solar Telescope interpretation by libraries is prone to confusion if the assemblage
This telescope is using ASDF for much of the early data products of schemas conflict with regard to the default. We have stopped
to hold the metadata for a combined set of data which can involve using defaults in the Python library and recommend that the ASDF
many thousands of files. Furthermore, the World Coordinate file always be explicit about the value rather than imply it through
System information is stored using ASDF for all the referenced the schema. If there are practical cases that preclude always
data. writing out all values (e.g., they are only relevant to one mode
and usually are irrelevant), it should be the library that manages
whether such attributes are written conditionally rather using the
Vera Rubin Telescope (for World Coordinate System interchange)
schema default mechanism.
There have been users outside of astronomy using ASDF, as well
as contributors to the source code.
Add alternative tag URI scheme
We now recommend that tag URIs begin with asdf://
Changes to the standard (completed and proposed)
These are based on lessons learned from usage.
Be explicit about what kind of complex YAML keys are supported
The current version of the standard is 1.5.0 (1.6.0 being
developed). For example, not all legal YAML keys are supported. Namely
The following items reflect areas where we felt improvements YAML arrays, which are not hashable in Python. Likewise,
were needed. general YAML objects are not either. The Standard now limits
keys to string, integer, or boolean types. If more complex keys are
Changes for 1.5 required, they should be encoded in strings.
Moving the URI authority from stsci.edu to
asdf-format.org Still to be done
This is to remove the standard from close association with STScI Upgrade to JSON Schema draft-07
and make it clear that the format is not intended to be controlled
There is interest in some of the new features of this version,
by one institution.
however, this is problematic since there are aspects of this version
that are incompatible with draft-04, thus requiring all previous
Moving astronomy-specific schemas out of standard
schemas to be updated.
These primarily affect the previous inclusion of World Coordinate
Tags, which are strongly associated with astronomy. Remaining
Replace extensions section of file history
are those related to time and unit standards, both of obvious gen-
erality, but the implementation must be based on some standards, This section is considered too specific to the concept of Python
and currently the astropy-based ones are as good or better than extensions, and is probably best replaced with a more flexible
any. system for listing extensions used.
THE ADVANCED SCIENTIFIC DATA FORMAT (ASDF): AN UPDATE 3

Changes to Python ASDF package
Easier and more flexible mechanism to create new extensions
(2.8.0)
The previous system for defining extensions to ASDF, now
deprecated, has been replaced by a new system that makes the
association between tags, schemas, and conversion code more
straightforward, as well as providing more intuitive names for the
methods and attributes, and makes it easier to handle reference
cycles if they are present in the code (also added to the original
Tag handling classes).

Introduced global configuration mechanism (2.8.0)
This reworks how ASDF resources are located, and makes it easier
to update the current configuration, as well as track down the
location of the needed resources (e.g., schemas and converters),
as well as removing performance issues that previously required
extracting information from all the resource files thus slowing the
Fig. 1: A plot of the compound model defined in the first segment of
first asdf.open call. code.

Added info/search methods and command line tools (2.6.0)
These allow displaying the hierarchical structure of the header and file. This is made possible by the fact that expressions of models
the values and types of the attributes. Initially, such introspection are straightforward to represent in YAML structure.
stopped at any tagged item. A subsequent change provides mech- Despite the fact that the models are in some sense executable,
anisms to see into tagged items (next item). An example of these they are perfectly safe so long as the library they are implemented
tools is shown in a later section. in is safe (e.g., it doesn’t implement an "execute any OS com-
mand" model). Furthermore, the representation in ASDF does not
Added mechanism for info to display tagged item contents (2.9.0) explicitly use Python code. In principle it could be written or read
This allows the library that converts the YAML to Python objects in any computer language.
to expose a summary of the contents of the object by supplying The following illustrates a relatively simple but not trivial
an optional "dunder" method that the info mechanism can take example.
advantage of. First we define a 1D model and plot it.
import numpy as np
Added documentation on how ASDF library internals work import astropy.modeling.models as amm
import astropy.units as u
These appear in the readthedocs under the heading "Developer import asdf
Overview". from matplotlib import pyplot as plt

# Define 3 model components with units
Plugin API for block compressors (2.8.0) g1 = amm.Gaussian1D(amplitude=100*u.Jy,
This enables a localized extension to support further compression mean=120*u.MHz,
options. stddev=5.*u.MHz)
g2 = amm.Gaussian1D(65*u.Jy, 140*u.MHz, 3*u.MHz)
powerlaw = amm.PowerLaw1D(amplitude=10*u.Jy,
Support for asdf:// URI scheme (2.8.0) x_0=100*u.MHz,
Support for ASDF Standard 1.6.0 (2.8.0) alpha=3)
# Define a compound model
This is still subject to modifications to the 1.6.0 standard. model = g1 + g2 + powerlaw
x = np.arange(50, 200) * u.MHz
Modified handling of defaults in schemas and None values (2.8.0) plt.plot(x, model(x))

As described previously. The following code will save the model to an ASDF file, and read
it back in
Using ASDF to store models af = asdf.AsdfFile()
af.tree = {'model': model}
This section highlights one aspect of ASDF that few other formats af.write_to('model.asdf')
support in an archival way, e.g., not using a language-specific af2 = asdf.open('model.asdf')
model2 = af2['model']
mechanism, such as Python’s pickle. The astropy package contains model2 is model
a modeling subpackage that defines a number of analytical, as well False
as a few table-based, models that can be combined in many ways, model2(103.5) == model(103.5)
such as arithmetically, in composition, or multi-dimensional. Thus True
it is possible to define fairly complex multi-dimensional models, Listing the relevant part of the ASDF file illustrates how the model
many of which can use the built in fitting machinery. has been saved in the YAML header (reformatted to fit in this paper
These models, and their compound constructs can be saved column).
in ASDF files and later read in to recreate the corresponding model: !transform/add-1.2.0
astropy objects that were used to create the entries in the ASDF forward:
4 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

- !transform/add-1.2.0 something that the FITS format had no hope of managing, nor any
forward: other scientific format that we are aware of.
- !transform/gaussian1d-1.0.0
amplitude: !unit/quantity-1.1.0
{unit: !unit/unit-1.0.0 Jy, value: 100.0} Displaying the contents of ASDF files
bounding_box:
- !unit/quantity-1.1.0 Functionality has been added to display the structure and content
{unit: !unit/unit-1.0.0 MHz, value: 92.5} of the header (including data item properties), with a number of
- !unit/quantity-1.1.0
{unit: !unit/unit-1.0.0 MHz, value: 147.5} options of what depth to display, how many lines to display, etc.
bounds: An example of the info use is shown in Figure 2.
stddev: [1.1754943508222875e-38, null] There is also functionality to search for items in the file by
inputs: [x]
mean: !unit/quantity-1.1.0
attribute name and/or values, also using pattern matching for
{unit: !unit/unit-1.0.0 MHz, value: 120.0} either. The search results are shown as attribute paths to the items
outputs: [y] that were found.
stddev: !unit/quantity-1.1.0
{unit: !unit/unit-1.0.0 MHz, value: 5.0}
- !transform/gaussian1d-1.0.0 ASDF Extension/Converter System
amplitude: !unit/quantity-1.1.0
{unit: !unit/unit-1.0.0 Jy, value: 65.0} There are a number of components that are involved. Converters
bounding_box: encapsulate the code that handles converting Python objects to
- !unit/quantity-1.1.0
{unit: !unit/unit-1.0.0 MHz, value: 123.5}
and from their ASDF representation. These are classes that inherit
- !unit/quantity-1.1.0 from the basic Converter class and define two Class attributes:
{unit: !unit/unit-1.0.0 MHz, value: 156.5} tags, types each of which is a list of associated tag(s) and class(es)
bounds: that the specific converter class will handle (each converter can
stddev: [1.1754943508222875e-38, null]
inputs: [x] handle more than one tag type and more than one class). The
mean: !unit/quantity-1.1.0 ASDF machinery uses this information to map tags to converters
{unit: !unit/unit-1.0.0 MHz, value: 140.0} when reading ASDF content, and to map types to converters when
outputs: [y]
stddev: !unit/quantity-1.1.0
saving these objects to an ASDF file.
{unit: !unit/unit-1.0.0 MHz, value: 3.0} Each converter class is expected to supply two methods:
inputs: [x] to_yaml_tree and from_yaml_tree that construct the
outputs: [y] YAML content and convert the YAML content to Python class
- !transform/power_law1d-1.0.0
alpha: 3.0 instances respectively.
amplitude: !unit/quantity-1.1.0 A manifest file is used to associate tags and schema ID’s
{unit: !unit/unit-1.0.0 Jy, value: 10.0} so that if a schema has been defined, that the ASDF content
inputs: [x]
outputs: [y]
can be validated against the schema (as well as providing extra
x_0: !unit/quantity-1.1.0 information for the ASDF content in the info command). Normally
{unit: !unit/unit-1.0.0 MHz, value: 100.0} the converters and manifest are registered with the ASDF library
inputs: [x] using standard functions, and this registration is normally (but is
outputs: [y]
... not required to be) triggered by use of Python entry points defined
in the setup.cfg file so that this extension is automatically
Note that there are extra pieces of information that define the recognized when the extension package is installed.
model more precisely. These include: One can of course write their own custom code to convert the
contents of ASDF files however they want. The advantage of the
• many tags indicating special items. These include different
tag/converter system is that the objects can be anywhere in the tree
kinds of transforms (i.e., functions), quantities (i.e., num-
structure and be properly saved and recovered without having any
bers with units), units, etc.
implied knowledge of what attribute or location the object is at.
• definitions of the units used.
Furthermore, it brings with it the ability to validate the contents
• indications of the valid range of the inputs or parameters
by use of schema files.
(bounds)
Jupyter tutorials that show how to use converters can be found
• each function shows the mapping of the inputs and the
at:
naming of the outputs of each function.
• the addition operator is itself a transform. • https://github.com/asdf-format/tutorials/blob/master/
Your_first_ASDF_converter.ipynb
Without the use of units, the YAML would be simpler. But • https://github.com/asdf-format/tutorials/blob/master/
the point is that the YAML easily accommodates expression trees. Your_second_ASDF_converter.ipynb
The tags are used by the library to construct the astropy models,
units and quantities as Python objects. However, nothing in the
above requires the library to be written in Python. ASDF Roadmap for STScI Work
This machinery can handle multidimensional models and sup- The planned enhancements to ASDF are understandably focussed
ports both the combining of models with arithmetic operators as on the needs of STScI missions. Nevertheless, we are particularly
well as pipelining the output of one model into another. This interested in areas that have wider benefit to the general scientific
system has been used to define complex coordinate transforms and engineering community, and such considerations increase the
from telescope detectors to sky coordinates for imaging, and priority of items necessary to STScI. Furthermore, we are eager
wavelengths for spectrographs, using over 100 model components, to aid others working on ASDF by providing advice, reviews, and
THE ADVANCED SCIENTIFIC DATA FORMAT (ASDF): AN UPDATE 5

Fig. 2: This shows part of the output of the info command that shows the structure of a Roman Space Telescope test file (provided by the Roman
Telescopes Branch at STScI). Displayed is the relative depth of the item, its type, value, and a title extracted from the associated schema to be
used as explanatory information.

possibly collaborative coding effort. STScI is committed to the Redefining versioning semantics
long-term support of ADSF. Previously the meaning of different levels of versioning
The following is a list of planned work, in order of decreasing were unclear. The normal inclination is to treat schema
priority. version using the typical semantic versioning system de-
fined for software. But schemas are not software and
Chunking Support
we are inclined to use the proposed system for schemas
[url: https://snowplowanalytics.com/blog/2014/05/13/introducing-
Since the Roman mission is expected to deal with large data schemaver-for-semantic-versioning-of-schemas/] To summarize:
sets and mosaicked images, support for chunking is considered in this case the three levels of versioning correspond to:
essential. We expect to layer the support in our Python library Model.Revision.Addition where a schema change:
on zarr [https://zarr.dev/], with two different representations,
one where all data is contained within the ADSF file in separate • [Model] prevents working with historical data
blocks, and one where the blocks are saved in individual files. • [Revision] may prevent working with historical data
Both representations have important advantages and use cases. • [Addition] is compatible with all historical data

Integration into astronomy display tools
Improvements to binary block management
It is essential that astronomers be able to visualize the data
These enhancements are needed to enable better chunking support contained within ASDF files conveniently using the commonly
and other capabilities. available tool, such as SAOImage DS9 [Joy03] and Ginga [Jes13].
6 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Cloud optimized storage [McK10] W. McKinney. Data structures for statistical computing in python,
Proceedigns of the 9th Python in Science Conference, p56-61, 2010.
Much of the future data processing operations for STScI are https://doi.org/10.25080/Majora-92bf1922-00a
expected to be performed on the cloud, so having ASDF efficiently [Pen09] W. Pence, R. Seaman, R. L. White, Lossless Astronomical Image
support such uses is important. An important element of this is Compression and the Effects of Noise, Publications of the Astro-
making the format work efficiently with object storage services nomical Society of the Pacific, 121:414-427, April 2009. https:
//doi.org/10.48550/arXiv.0903.2140
such as AWS S3 and Google Cloud Storage. [Pen10] W. Pence, R. L. White, R. Seaman. Optimal Compression of Floating-
Point Astronomical Images Without Significant Loss of Information,
IDL support Publications of the Astronomical Society of the Pacific, 122:1065-
1076, September 2010. https://doi.org/10.1086/656249
While Python is rapidly surpassing the use of IDL in astronomy, [Joy03] W. A. Joye, E. Mandel. New Features of SAOImage DS9, Astronomi-
there is still much IDL code being used, and many of those still cal Data Analysis Software and Systems XII ASP Conference Series,
using IDL are in more senior and thus influential positions (they 295:489, 2003.
aren’t quite dead yet). So making ASDF data at least readable to
IDL is a useful goal.

Support Rice compression
Rice compression [Pen09], [Pen10] has proven a useful lossy
compression algorithm for astronomical imaging data. Supporting
it will be useful to astronomers, particularly for downloading large
imaging data sets.

Pandas Dataframe support
Pandas [McK10] has proven to be a useful tool to many as-
tronomers, as well as many in the sciences and engineering, so
support will enhance the uptake of ASDF.

Compact, easy-to-read schema summaries
Most scientists and even scientific software developers tend to
find JSON Schema files tedious to interpret. A more compact, and
intuitive rendering of the contents would be very useful.

Independent implementation
Having ASDF accepted as a standard data format requires a library
that is divorced from a Python API. Initially this can be done most
easily by layering it on the Python library, but ultimately there
should be an independent implementation which includes support
for C/C++ wrappers. This is by far the item that will require the
most effort, and would benefit from outside involvement.

Provide interfaces to other popular packages
This is a catch all for identifying where there would be significant
advantages to providing the ability to save and recover information
in the ASDF format as an interchange option.

Sources of Information
• ASDF Standard: https://asdf-standard.readthedocs.io/en/
latest/
• Python ASDF package documentation: https://asdf.
readthedocs.io/en/stable/
• Repository: https://github.com//asdf-format/asdf
• Tutorials: https://github.com/asdf-format/tutorials

R EFERENCES
[Gre15] P. Greenfield, M. Droettboom, E. Bray. ASDF: A new data format
for astronomy, Astronomy and Computing, 12:240-251, September
2015. https://doi.org/10.1016/j.ascom.2015.06.004
[FIT16] FITS Working Group. Definition of the Flexible Image Transport
System, International Astronomical Union, http://fits.gsfc.nasa.gov/
fits_standard.html, July 2016.
[Jes13] E. Jeschke. Ginga: an open-source astronomical image viewer and
toolkit, Proc. of the 12th Python in Science Conference., p58-
64,January 2013. https://doi.org/10.25080/Majora-8b375195-00a
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 7

Semi-Supervised Semantic Annotator (S3A): Toward
Efficient Semantic Labeling
Nathan Jessurun‡∗ , Daniel E. Capecci‡ , Olivia P. Dizon-Paradis‡ , Damon L. Woodard‡ , Navid Asadizanjani‡

Abstract—Most semantic image annotation platforms suffer severe bottlenecks
when handling large images, complex regions of interest, or numerous distinct
foreground regions in a single image. We have developed the Semi-Supervised
Semantic Annotator (S3A) to address each of these issues and facilitate rapid
collection of ground truth pixel-level labeled data. Such a feat is accomplished
through a robust and easy-to-extend integration of arbitrary python image pro-
cessing functions into the semantic labeling process. Importantly, the framework
devised for this application allows easy visualization and machine learning
prediction of arbitrary formats and amounts of per-component metadata. To our
knowledge, the ease and flexibility offered are unique to S3A among all open-
source alternatives.

Index Terms—Semantic annotation, Image labeling, Semi-supervised, Region
of interest

Introduction
Labeled image data is essential for training, tuning, and evaluating Fig. 1. Common use cases for semantic segmentation involve relatively few fore-
ground objects, low-resolution data, and limited complexity per object. Images
the performance of many machine learning applications. Such retrieved from https://cocodataset.org/#explore.
labels are typically defined with simple polygons, ellipses, and
bounding boxes (i.e., "this rectangle contains a cat"). However,
this approach can misrepresent more complex shapes with holes and greatly hinders scalability. As such, several tools have been
or multiple regions as shown later in Figure 9. When high accuracy proposed to alleviate the burden of collecting these ground-truth
is required, labels must be specified at or close to the pixel-level labels [itL18]. Unfortunately, existing tools are heavily biased
- a process known as semantic labeling or semantic segmentation. toward lower-resolution images with few regions of interest (ROI),
A detailed description of this process is given in [CZF+ 18]. similar to Figure 1. While this may not be an issue for some
Examples can readily be found in several popular datasets such datasets, such assumptions are crippling for high-fidelity images
as COCO, depicted in Figure 1. with hundreds of annotated ROIs [LSA+ 10], [WYZZ09].
Semantic segmentation is important in numerous domains With improving hardware capabilities and increasing need for
including printed circuit board assembly (PCBA) inspection (dis- high-resolution ground truth segmentation, there are a continu-
cussed later in the case study) [PJTA20], [AML+ 19], quality ally growing number of applications that require high-resolution
control during manufacturing [FRLL18], [AVK+ 01], [AAV+ 02], imaging with the previously described characteristics [MKS18],
manuscript restoration / digitization [GNP+ 04], [KBO16], [JB92], [DS20]. In these cases, the existing annotation tooling greatly
[TFJ89], [FNK92], and effective patient diagnosis [SKM+ 10], impacts productivity due to the previously referenced assumptions
[RLO+ 17], [YPH+ 06], [IGSM14]. In all these cases, imprecise and lack of support [Spa20].
annotations severely limit the development of automated solutions In response to these bottlenecks, we present the Semi-
and can decrease the accuracy of standard trained segmentation Supervised Semantic Annotation (S3A) annotation and prototyping
models. platform -- an application which eases the process of pixel-level
Quality semantic segmentation is difficult due to a reliance on labeling in large, complex scenes.1 Its graphical user interface is
large, high-quality datasets, which are often created by manually shown in Figure 2. The software includes live app-level property
labeling each image. Manual annotation is error-prone, costly, customization, real-time algorithm modification and feedback,
region prediction assistance, constrained component table editing
* Corresponding author: njessurun@ufl.edu
‡ University of Florida based on allowed data types, various data export formats, and a
highly adaptable set of plugin interfaces for domain-specific exten-
Copyright © 2022 Nathan Jessurun et al. This is an open-access article sions to S3A. Beyond software improvements, these features play
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, significant roles in bridging the gap between human annotation
provided the original author and source are credited. efforts and scalable, automated segmentation methods [BWS+ 10].
8 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Improve Semi-
segmentation supervised
techniques labeling

Update Generate
models training data

Fig. 3. S3A’s can iteratively annotate, evaluate, and update its internals in real-
time.

to specify (but can be modified or customized if desired). As a re-
sult, incorporating additional/customized application functionality
can require as little as one line of code. Processes interface with
Fig. 2. S3A’s interface. The main view consists of an image to annotate, a PyQtGraph parameters to gain access to data-customized widget
component table of prior annotations, and a toolbar which changes functionality types and more (https://github.com/pyqtgraph/pyqtgraph).
depending on context. These processes can also be arbitrarily nested and chained,
which is critical for developing hierarchical image processing
models, an example of which is shown in Figure 4. This frame-
work is used for all image and region processing within S3A.
Note that for image processes, each portion of the hierarchy yields
Application Overview intermediate outputs to determine which stage of the process flow
Design decisions throughout S3A’s architecture have been driven is responsible for various changes. This, in turn, reduces the
by the following objectives: effort required to determine which parameters must be adjusted
to achieve optimal performance.
• Metadata should have significance rather than be treated
as an afterthought,
Plugins for User Extensions
• High-resolution images should have minimal impact on
the annotation workflow, The previous section briefly described how custom user functions
• ROI density and complexity should not limit annotation are easily wrapped within a process, exposing its parameters
workflow, and within S3A in a GUI format. A rich plugin interface is built on top
• Prototyping should not be hindered by application com- of this capability in which custom functions, table field predictors,
plexity. default action hooks, and more can be directly integrated into S3A.
In all cases, only a few lines of code are required to achieve most
These motives were selected upon noticing the general lack integrations between user code and plugin interface specifications.
of solutions for related problems in previous literature and tool- The core plugin infrastructure consists of a function/property reg-
ing. Moreover, applications that do address multiple aspects of istration mechanism and an interaction window that shows them
complex region annotation often require an enterprise service and in the UI. As such, arbitrary user functions can be "registered" in
cannot be accessed under open-source policies. one line of code to a plugin, where it will be effectively exposed to
While the first three points are highlighted in the case study, the user within S3A. A trivial example is depicted in Figure 5, but
the subsections below outline pieces of S3A’s architecture that more complex behavior such as OCR integration is possible with
prove useful for iterative algorithm prototyping and dataset gen- similar ease (see this snippet for an implementation leveraging
eration as depicted in Figure 3. Note that beyond the facets easyocr).
illustrated here, S3A possesses multiple additional characteris- Plugin features are heavily oriented toward easing the pro-
tics as outlined in its documentation (https://gitlab.com/s3a/s3a/- cess of automation both for general annotation needs and niche
/wikis/docs/User’s-Guide). datasets. In either case, incorporating existing library functions is
converted into a trivial task directly resulting in lower annotation
Processing Framework time and higher labeling accuracy.
At the root of S3A’s functionality and configurability lies its
adaptive processing framework. Functions exposed within S3A are Adaptable I/O
thinly wrapped using a Process structure responsible for parsing An extendable I/O framework allows annotations to be used in
signature information to provide documentation, parameter infor- a myriad of ways. Out-of-the-box, S3A easily supports instance-
mation, and more to the UI. Hence, all graphical depictions are level segmentation outputs, facilitating deep learning model train-
abstracted beyond the concern of the user while remaining trivial ing. As an example, Figure 6 illustrates how each instance in the
image becomes its own pair of image and mask data. When several
1. A preliminary version was introduced in an earlier publication [JPRA20],
but significant changes to the framework and tool capabilities have been instances overlap, each is uniquely distinguishable depending
employed since then. on the characteristic of their label field. Particularly helpful for
SEMI-SUPERVISED SEMANTIC ANNOTATOR (S3A): TOWARD EFFICIENT SEMANTIC LABELING 9

Fig. 4. Outputs of each processing stage can be quickly viewed in context after an iteration of annotating. Upon inspecting the results, it is clear the failure point is
a low k value during K-means clustering and segmentation. The woman’s shirt is not sufficiently distinguishable from the background palette to denote a separate
entity. The red dot is an indicator of where the operator clicked during annotation.

from qtpy import QtWidgets
from s3a import (
S3A,
__main__,
RandomToolsPlugin,
)

def hello_world(win: S3A):
QtWidgets.QMessageBox.information(
win, "Hello World", "Hello World!"
)

RandomToolsPlugin.deferredRegisterFunc(
hello_world
)

__main__.mainCli()
Fig. 6. Multiple export formats exist, among which is a utility that crops com-
ponents out of the image, optionally padding with scene pixels and resizing to
Fig. 5. Simple standalone functions can be easily exposed to the user through ensure all shapes are equal. Each sub-image and mask is saved accordingly,
the random tools plugin. Note that if tunable parameters were included in the which is useful for training on multiple forms of machine learning models.
function signature, pressing "Open Tools" (the top menu option) allows them to
be altered.

binations for functions outside S3A in the event they are utilized
in a different framework.
models with fixed input sizes, these exports can optionally be
forced to have a uniform shape (e.g., 512x512 pixels) while main-
taining their aspect ratio. This is accomplished by incorporating Case Study
additional scene pixels around each object until the appropriate Both the inspiration and developing efforts for S3A were initially
size is obtained. Models trained on these exports can be directly driven by optical printed circuit board (PCB) assurance needs.
plugged back into S3A’s processing framework, allowing them In this domain, high-resolution images can contain thousands
to generate new annotations or refine preliminary user efforts. of complex objects in a scene, as seen in Figure 7. Moreover,
The described I/O framework is also heavily modularized such numerous components are not representable by cardinal shapes
that custom dataset specifications can easily be incorporated. In such as rectangles, circles, etc. Hence, high-count polygonal
this manner, future versions of S3A will facilitate interoperability regions dominated a significant portion of the annotated regions.
with popular formats such as COCO and Pascal VOC [LMB+ 14], The computational overhead from displaying large images and
[EGW+ 10]. substantial numbers of complex regions either crashed most anno-
tation platforms or prevented real-time interaction. In response,
S3A was designed to fill the gap in open-source annotation
Deep, Portable Customizability
platforms that addressed each issue while requiring minimal setup
Beyond the features previously outlined, S3A provides numerous and allowing easy prototyping of arbitrary image processing tasks.
avenues to configure shortcuts, color schemes, and algorithm The subsections below describe how the S3A labeling platform
workflows. Several examples of each can be seen in the user was utilized to collect a large database of PCB annotations along
guide. Most customizable components prototyped within S3A can with their associated metadata2 .
also be easily ported to external workflows after development.
Hierarchical processes have states saved in YAML files describing Large Images with Many Annotations
all parameters, which can be reloaded to create user profiles. In optical PCB assurance, one method of identifying component
Alternatively, these same files can describe ideal parameter com- defects is to localize and characterize all objects in the image. Each
10 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 8. Regardless of total image size and number of annotations, Python
processing is be limited to the ROI or viewbox size for just the selected object
based on user preferences. The depiction shows Grab Cut operating on a user-
defined initial region within a much larger (8000x6000) image. The resulting
Fig. 7. Example PCB segmentation. In contrast to typical semgentation tasks, region was available in 1.94 seconds on low-grade hardware.
the scene contains over 4,000 objects with numerous complex shapes.

component can then be cross-referenced against genuine proper-
ties such as length/width, associated text, allowed orientations,
etc. However, PCB surfaces can contain hundreds to thousands
of components at several magnitudes of size, necessitating high-
resolution images for in-line scanning. To handle this problem
more generally, S3A separates the editing and viewing experi-
Fig. 9. Annotated objects in S3A can incorporate both holes and distinct regions
ences. In other words, annotation time is orders of magnitude
through a multi-polygon container. Holes are represented as polygons drawn on
faster since only edits in one region at a time and on a small subset top of existing foreground, and can be arbitrarily nested (i.e. island foreground is
of the full image are considered during assisted segmentation. All also possible).
other annotations are read-only until selected for alteration. For
instance, Figure 8 depicts user inputs on a small ROI out of a
key performance improvement when thousands of regions (each
much larger image. The resulting component shape is proposed
with thousands of points) are in the same field of view. When
within seconds and can either be accepted or modified further by
low polygon counts are required, S3A also supports RDP polygon
the user. While PCB annotations initially inspired this approach, it
simplification down to a user-specified epsilon parameter [Ram].
is worth noting that the architectural approach applies to arbitrary
domains of image segmentation. Complex Metadata
Another key performance improvement comes from resizing
Most annotation software support robust implementation of im-
the processed region to a user-defined maximum size. For instance,
age region, class, and various text tags ("metadata"). However,
if an ROI is specified across a large portion of the image but
this paradigm makes collecting type-checked or input-sanitized
the maximum processing size is 500x500 pixels, the processed
metadata more difficult. This includes label categories such as
area will be downsampled to a maximum dimension length of
object rotation, multiclass specifications, dropdown selections,
500 before intensive algorithms are run. The final output will
and more. In contrast, S3A treats each metadata field the same
be upsampled back to the initial region size. In this manner,
way as object vertices, where they can be algorithm-assisted,
optionally sacrificing a small amount of output accuracy can
directly input by the user, or part of a machine learning prediction
drastically accelerate runtime performance for larger annotated
framework. Note that simple properties such as text strings or
objects.
numbers can be directly input in the table cells with minimal need
for annotation assistance3 . In conrast, custom fields can provide
Complex Vertices/Semantic Segmentation plugin specifications which allow more advanced user interaction.
Multiple types of PCB components possess complex shapes which Finally, auto-populated fields like annotation timestamp or author
might contain holes or noncontiguous regions. Hence, it is bene- can easily be constructed by providing a factory function instead
ficial for software like S3A to represent these features inherently of default value in the parameter specification.
with a ComplexXYVertices object: that is, a collection of This capability is particularly relevant in the field of optical
polygons which either describe foreground regions or holes. This PCB assurance. White markings on the PCB surface, known
is enabled by thinly wrapping opencv’s contour and hierarchy as silkscreen, indicate important aspects of nearby components.
logic. Example components difficult to accomodate with single- Thus, understanding the silkscreen’s orientation, alphanumeric
polygon annotation formats are illustrated in Figure 9. characters, associated component, logos present, and more provide
At the same time, S3A also supports high-count polygons several methods by which to characterize / identify features
with no performance losses. Since region edits are performed by of their respective devices. Both default and customized input
image processing algorithms, there is no need for each vertex validators were applied to each field using parameter specifica-
to be manually placed or altered by human input. Thus, such tions, custom plugins, or simple factories as described above. A
non-interactive shapes can simply be rendered as a filled path summary of the metadata collected for one component is shown
without a large number of event listeners present. This is the in Figure 10.
SEMI-SUPERVISED SEMANTIC ANNOTATOR (S3A): TOWARD EFFICIENT SEMANTIC LABELING 11

results depending on the initial image complexity [VGSG+ 19].
Hence, these methods would be significantly easier to incorporate
into S3A if a generalized windowing framework was incorporated
which allows users to specify all necessary parameters such as
window overlap, size, sampling frequency, etc. A preliminary
version of this is implemented for categorical-based model pre-
diction, but a more robust feature set for interactive segmentation
is strongly preferable.

Aggregation of Human Annotation Habits
Several times, it has been noted that manual segmentation of
Fig. 10. Metadata can be collected, validated, and customized with ease. A mix image data is not a feasible or scalable approach for remotely
of default properties (strings, numbers, booleans), factories (timestamp, author), large datasets. However, there are multiple cases in which human
and custom plugins (yellow circle representing associated device) are present. intuition can greatly outperform even complex neural networks,
depending on the specific segmentation challenge [RLFF15]. For
this reason, it would be ideal to capture data points possessing
Conclusion and Future Work
information about the human decision-making process and apply
The Semi-Supervised Semantic Annotator (S3A) is proposed to them to images at scale. This may include taking into account hu-
address the difficult task of pixel-level annotations of image data. man labeling time per class, hesitation between clicks, relationship
For high-resolution images with numerous complex regions of between shape boundary complexity and instance quantity, and
interest, existing labeling software faces performance bottlenecks more. By aggregating such statistics, a pattern may arise which can
attempting to extract ground-truth information. Moreover, there is be leveraged as an additional automated annotation technique.
a lack of capabilities to convert such a labeling workflow into an
automated procedure with feedback at every step. Each of these
challenges is overcome by various features within S3A specifically R EFERENCES
designed for such tasks. As a result, S3A provides not only tremen- [AAV+ 02] C Anagnostopoulos, I Anagnostopoulos, D Vergados, G Kouzas,
dous time savings during ground truth annotation, but also allows E Kayafas, V Loumos, and G Stassinopoulos. High performance
an annotation pipeline to be directly converted into a prediction computing algorithms for textile quality control. Mathematics
scheme. Furthermore, the rapid feedback accessible at every stage and Computers in Simulation, 60(3):389–400, September 2002.
doi:10.1016/S0378-4754(02)00031-9.
of annotation expedites prototyping of novel solutions to imaging [AML+ 19] Mukhil Azhagan, Dhwani Mehta, Hangwei Lu, Sudarshan
domains in which few examples of prior work exist. Nonetheless, Agrawal, Mark Tehranipoor, Damon L Woodard, Navid
multiple avenues exist for improving S3A’s capabilities in each of Asadizanjani, and Praveen Chawla. A review on automatic
bill of material generation and visual inspection on PCBs. In
these areas. Several prominent future goals are highlighted in the
ISTFA 2019: Proceedings of the 45th International Symposium
following sections. for Testing and Failure Analysis, page 256. ASM International,
2019.
Dynamic Algorithm Builder [AVK+ 01] C. Anagnostopoulos, D. Vergados, E. Kayafas, V. Loumos, and
G. Stassinopoulos. A computer vision approach for textile
Presently, processing workflows can be specified in a sequential quality control. The Journal of Visualization and Computer
YAML file which describes each algorithm and their respective Animation, 12(1):31–44, 2001. doi:10.1002/vis.245.
parameters. However, this is not easy to adapt within S3A, [BWS+ 10] Steve Branson, Catherine Wah, Florian Schroff, Boris Babenko,
Peter Welinder, Pietro Perona, and Serge Belongie. Visual
especially by inexperienced annotators. Future iterations of S3A recognition with humans in the loop. In Kostas Daniilidis, Petros
will incoroprate graphical flowcharts which make this process Maragos, and Nikos Paragios, editors, Computer Vision – ECCV
drastically more intuitive and provide faster feedback. Frameworks 2010, pages 438–451, Berlin, Heidelberg, 2010. Springer Berlin
Heidelberg.
like Orange [DCE+ ] perform this task well, and S3A would [CZF+ 18] Qimin Cheng, Qian Zhang, Peng Fu, Conghuan Tu, and Sen Li.
strongly benefit from adding the relevant capabilities. A survey and analysis on automatic image annotation. Pattern
Recognition, 79:242–259, 2018. doi:10.1016/j.patcog.
Image Navigation Assistance 2018.02.017.
[DCE+ ] Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž
Several aspects of image navigation can be incorporated to sim- Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar,
plify the handling of large images. For instance, a "minimap" tool Marko Toplak, and Anže Starič. Orange: Data mining toolbox
in Python. 14(1):2349–2353.
would allow users to maintain a global image perspective while [DS20] Polina Demochkina and Andrey V. Savchenko. Improving
making local edits. Furthermore, this sense of scale aids intuition the accuracy of one-shot detectors for small objects in x-ray
of how many regions of similar component density, color, etc. exist images. In 2020 International Russian Automation Confer-
within the entire image. ence (RusAutoCon), page 610–614. IEEE, September 2020.
URL: https://ieeexplore.ieee.org/document/9208097/, doi:10.
Second, multiple strategies for annotating large images lever- 1109/RusAutoCon49822.2020.9208097.
age a windowing approach, where they will divide the total image [EGW+ 10] Mark Everingham, Luc Gool, Christopher K. Williams, John
into several smaller pieces in a gridlike fashion. While this has its Winn, and Andrew Zisserman. The pascal visual object classes
(voc) challenge. Int. J. Comput. Vision, 88(2):303–338, jun
disadvantages, it is fast, easy to automate, and produces reasonable
2010. URL: https://doi.org/10.1007/s11263-009-0275-4, doi:
10.1007/s11263-009-0275-4.
2. For those curious, the dataset and associated paper are accessible at https: [FNK92] H. Fujisawa, Y. Nakano, and K. Kurino. Segmentation methods
//www.trust-hub.org/#/data/pcb-images. for character recognition: From segmentation to document struc-
3. For a list of input validators and supported primitive types, refer to ture analysis. Proceedings of the IEEE, 80(7):1079–1092, July
PyQtGraph’s Parameter documentation. 1992. doi:10.1109/5.156471.
12 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[FRLL18] Max K. Ferguson, Ak Ronay, Yung-Tsun Tina Lee, and Kin- IEEE Transactions on Medical Imaging, 36(2):674–683, Febru-
cho. H. Law. Detection and segmentation of manufacturing ary 2017. doi:10.1109/TMI.2016.2621185.
defects with convolutional neural networks and transfer learn- [SKM+ 10] Sascha Seifert, Michael Kelm, Manuel Moeller, Saikat Mukher-
ing. Smart and sustainable manufacturing systems, 2, 2018. jee, Alexander Cavallaro, Martin Huber, and Dorin Comaniciu.
doi:10.1520/SSMS20180033. Semantic annotation of medical images. In Brent J. Liu and
[GNP+ 04] Basilios Gatos, Kostas Ntzios, Ioannis Pratikakis, Sergios William W. Boonn, editors, Medical Imaging 2010: Advanced
Petridis, T. Konidaris, and Stavros J. Perantonis. A segmentation- PACS-based Imaging Informatics and Therapeutic Applications,
free recognition technique to assist old greek handwritten volume 7628, pages 43 – 50. International Society for Optics and
manuscript OCR. In Simone Marinai and Andreas R. Dengel, Photonics, SPIE, 2010. URL: https://doi.org/10.1117/12.844207,
editors, Document Analysis Systems VI, Lecture Notes in Com- doi:10.1117/12.844207.
puter Science, pages 63–74, Berlin, Heidelberg, 2004. Springer. [Spa20] SpaceNet. Multi-Temporal Urban Development Challenge.
doi:10.1007/978-3-540-28640-0_7. https://spacenet.ai/sn7-challenge/, June 2020.
[IGSM14] D. K. Iakovidis, T. Goudas, C. Smailis, and I. Maglogiannis. [TFJ89] T. Taxt, P.J. Flynn, and A.K. Jain. Segmentation of document
Ratsnake: A versatile image annotation tool with application images. IEEE Transactions on Pattern Analysis and Machine
to computer-aided diagnosis, 2014. doi:10.1155/2014/ Intelligence, 11(12):1322–1329, December 1989. doi:10.
286856. 1109/34.41371.
[itL18] Humans in the Loop. The best image annotation platforms [VGSG+ 19] Juan P. Vigueras-Guillén, Busra Sari, Stanley F. Goes, Hans G.
for computer vision (+ an honest review of each), October Lemij, Jeroen van Rooij, Koenraad A. Vermeer, and Lucas J.
2018. URL: https://hackernoon.com/the-best-image-annotation- van Vliet. Fully convolutional architecture vs sliding-window
platforms-for-computer-vision-an-honest-review-of-each- cnn for corneal endothelium cell segmentation. BMC Biomedical
dac7f565fea. Engineering, 1(1):4, January 2019. doi:10.1186/s42490-
[JB92] Anil K. Jain and Sushil Bhattacharjee. Text segmentation using 019-0003-2.
gabor filters for automatic document processing. Machine Vision [WYZZ09] C. Wang, Shuicheng Yan, Lei Zhang, and H. Zhang. Multi-
and Applications, 5(3):169–184, June 1992. doi:10.1007/ label sparse coding for automatic image annotation. In 2009
BF02626996. IEEE Conference on Computer Vision and Pattern Recognition,
[JPRA20] Nathan Jessurun, Olivia Paradis, Alexandra Roberts, and Navid page 1643–1650, June 2009. doi:10.1109/CVPR.2009.
Asadizanjani. Component Detection and Evaluation Framework 5206866.
(CDEF): A Semantic Annotation Tool. Microscopy and Micro- [YPH 06] Paul A. Yushkevich, Joseph Piven, Heather Cody Hazlett,
+

analysis, 26(S2):1470–1474, August 2020. doi:10.1017/ Rachel Gimpel Smith, Sean Ho, James C. Gee, and Guido
S1431927620018243. Gerig. User-guided 3D active contour segmentation of anatom-
[KBO16] Made Windu Antara Kesiman, Jean-Christophe Burie, and Jean- ical structures: Significantly improved efficiency and reliability.
Marc Ogier. A new scheme for text line and character seg- NeuroImage, 31(3):1116–1128, July 2006. doi:10.1016/j.
mentation from gray scale images of palm leaf manuscript. neuroimage.2006.01.015.
In 2016 15th International Conference on Frontiers in Hand-
writing Recognition (ICFHR), pages 325–330, October 2016.
doi:10.1109/ICFHR.2016.0068.
[LMB+ 14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Zitnick. Microsoft coco: Common objects in context. In Euro-
pean conference on computer vision, pages 740–755. Springer,
2014.
[LSA+ 10] L’ubor Ladický, Paul Sturgess, Karteek Alahari, Chris Russell,
and Philip H. S. Torr. What, where and how many? combining
object detectors and crfs. In Kostas Daniilidis, Petros Maragos,
and Nikos Paragios, editors, Computer Vision – ECCV 2010,
pages 424–437, Berlin, Heidelberg, 2010. Springer Berlin Hei-
delberg.
[MKS18] S. Mohajerani, T. A. Krammer, and P. Saeedi. A cloud detection
algorithm for remote sensing images using fully convolutional
neural networks. In 2018 IEEE 20th International Workshop on
Multimedia Signal Processing (MMSP), page 1–5, August 2018.
doi:10.1109/MMSP.2018.8547095.
[PJTA20] Olivia P Paradis, Nathan T Jessurun, Mark Tehranipoor,
and Navid Asadizanjani. Color normalization for robust
automatic bill of materials generation and visual inspection
of pcbs. In ISTFA 2020: Papers Accepted for the Planned
46th International Symposium for Testing and Failure Analysis,
International Symposium for Testing and Failure Analysis,
pages 172–179, 2020. URL: https://doi.org/10.31399/asm.cp.
istfa2020p0172https://dl.asminternational.org/istfa/proceedings-
pdf/ISTFA2020/83348/172/425605/istfa2020p0172.pdf,
doi:10.31399/asm.cp.istfa2020p0172.
[Ram] Urs Ramer. An iterative procedure for the polygonal approx-
imation of plane curves. 1(3):244–256. URL: https://www.
sciencedirect.com/science/article/pii/S0146664X72800170,
doi:10.1016/S0146-664X(72)80017-0.
[RLFF15] Olga Russakovsky, Li-Jia Li, and Li Fei-Fei. Best of both
worlds: Human-machine collaboration for object annotation.
In 2015 IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), page 2121–2131. IEEE, June 2015.
URL: http://ieeexplore.ieee.org/document/7298824/, doi:10.
1109/CVPR.2015.7298824.
[RLO+ 17] Martin Rajchl, Matthew C. H. Lee, Ozan Oktay, Konstantinos
Kamnitsas, Jonathan Passerat-Palmbach, Wenjia Bai, Mellisa
Damodaram, Mary A. Rutherford, Joseph V. Hajnal, Bernhard
Kainz, and Daniel Rueckert. DeepCut: Object segmentation from
bounding box annotations using convolutional neural networks.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 13

Galyleo: A General-Purpose Extensible Visualization
Solution
Rick McGeer‡∗ , Andreas Bergen‡ , Mahdiyar Biazi‡ , Matt Hemmings‡ , Robin Schreiber‡

Abstract—Galyleo is an open-source, extensible dashboarding solution inte- Jupyter’s web interface is primarily to offer textboxes for code
grated with JupyterLab [jup]. Galyleo is a standalone web application integrated entry. Entered code is sent to the server for evaluation and
as an iframe [LS10] into a JupyterLab tab. Users generate data for the dash- text/HTML results returned. Visualization in a Jupyter Notebook
board inside a Jupyter Notebook [KRKP+ 16], which transmits the data through is either given by images rendered server-side and returned as
message passing [mdn] to the dashboard; users use drag-and-drop operations
inline image tags, or by JavaScript/HTML5 libraries which have
to add widgets to filter, and charts to display the data, shapes, text, and images.
The dashboard is saved as a JSON [Cro06] file in the user’s filesystem in the
a corresponding server-side Python library. The Python library
same directory as the Notebook. generates HTML5/JavaScript code for rendering.
The limiting factor is that the visualization library must be in-
Index Terms—JupyterLab, JupyterLab extension, Data visualization tegrated with the Python backend by a developer, and only a subset
of the rich array of visualization, charting, and mapping libraries
Introduction
available on the HTML5/JavaScript platform is integrated. The
HTML5/JavaScript platform is as rich a client-side visualization
Current dashboarding solutions [hol22a] [hol22b] [plo] [pan22] platform as Python is a server-side platform.
for Jupyter either involve external, heavyweight tools, ingrained
Galyleo set out to offer the best of both worlds: Python, R, and
HTML/CSS coding, complex publication, or limited control over
Julia as a scalable analytics platform coupled with an extensible
layout, and have restricted widget sets and visualization libraries.
JavaScript/HTML5 visualization and interaction platform. It offers
Graphics objects require a great deal of configuration: size, posi-
a no-code client-side environment, for several reasons.
tion, colors, fonts must be specified for each object. Thus library
solutions involve a significant amount of fairly simple code. Con- 1) The Jupyter analytics community is comfortable with
versely, visualization involves analytics, an inherently complex server-side analytics environments (the 100+ kernels
set of operations. Visualization tools such as Tableau [DGHP13] available in Jupyter, including Python, R and Julia) but
or Looker [loo] combine visualization and analytics in a single less so with the JavaScript visualization platform.
application presented through a point-and-click interface. Point- 2) Configuration of graphical objects takes a lot of low-value
and-click interfaces are limited in the number and complexity configuration code; conversely, it is relatively easy to do
of operations supported. The complexity of an operation isn’t by hand.
reduced by having a simple point-and-click interface; instead, the
user is confronted with the challenge of trying to do something These insights lead to a mixed interface, combining a drag-
complicated by pointing. The result is that tools encapsulate and-drop interface for the design and configuration of visual
complex operations in a few buttons, and that leads to a limited objects, and a coding, server-side interface for analytics programs.
number of operations with reduced options and/or tools with steep Extension of the widget set was an important consideration. A
learning curves. widget is a client-side object with a physical component. Galyleo
In contrast, Jupyter is simply a superior analytics environment is designed to be extensible both by adding new visualization
in every respect over a standalone visualization tool: its various libraries and components and by adding new widgets.
kernels and their libraries provide a much broader range of analyt- Publication of interactive dashboards has been a further chal-
ics capabilities; its programming interface is a much cleaner and lenge. A design goal of Galyleo was to offer a simple scheme,
simpler way to perform complex operations; hardware resources where a dashboard could be published to the web with a single
can scale far more easily than they can for a visualization tool; click.
and connectors to data sources are both plentiful and extensible. These then, are the goals of Galyleo:
Both standalone visualization tools and Jupyter libraries have
a limited set of visualizations. Jupyter is a server-side platform. 1) Simple, drag-and-drop design of interactive dashboards in
a visual editor. The visual design of a Galyleo dashboard
* Corresponding author: rick.mcgeer@engageLively.com
‡ engageLively should be no more complex than design of a PowerPoint
or Google slide;
Copyright © 2022 Rick McGeer et al. This is an open-access article distributed 2) Radically simplify the dashboard-design interface by cou-
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the pling it to a powerful, Jupyter back end to do the analytics
original author and source are credited. work, separating visualization and analytics concerns;
14 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 1: Figure 1. A New Galyleo Dashboard Fig. 3: Figure 3. Dataflow in Galyleo

As the user creates and manipulates the visual elements, the
editor continuously saves the table as a JSON file, which can also
be edited with Jupyter’s built-in text editor.

Workflow
The goal of Galyleo is simplicity and transparency. Data prepa-
ration is handled in Jupyter, and the basic abstract item, the
GalyleoTable is generally created and manipulated there, using an
open-source Python library. When a table is ready, the Galyleo-
Client library is invoked to send it to the dashboard, where it
appears in the table tab of the sidebar. The dashboard author
then creates visual elements such as sliders, lists, dropdowns etc.,
Fig. 2: Figure 2. The Galyleo Dashboard Studio
which select rows of the table, and uses these filtered lists as
inputs to charts. The general idea is that the author should be
3) Maximimize extensibility for visualization and widgets able to seamlessly move between manipulating and creating data
on the client side and analytics libraries, data sources and tables in the Notebook, and filtering and visualizing them in the
hardware resources on the server side; dashboard.
4) Easy, simple publication;
Data Flow and Conceptual Picture
The Galyleo Data Model and Architecture is discussed in detail
Using Galyleo
below. The central idea is to have a few, orthogonal, easily-grasped
The general usage model of Galyleo is that a Notebook is being concepts which make data manipulation easy and intuitive. The
edited and executed in one tab of JupyterLab, and a corresponding basic concepts are as follows:
dashboard file is being edited and executed in another; as the
Notebook executes, it uses the Galyleo Client library to send 1) Table: A Table is a list of records, equivalent to a Pandas
data to the dashboard file. To JupyterLab, the Galyleo Dashboard DataFrame [pdt20] [WM10] or a SQL Table. In general,
Studio is just another editor; it reads and writes .gd.json files in in Galyleo, a Table is expected to be produced by an
the current directory. external source, generally a Jupyter Notebook
2) Filter: A Filter is a logical function which applies to a
The Dashboard Studio single column of a Table Table, and selects rows from the
Table. Each Filter corresponds to a widget; widgets set
A new Galyleo Dashboard can be launched from the JupyterLab
the values Filter use to select Table rows
launcher or from the File>New menu, as shown in Figure 1.
3) View A View is a subset of a Table selected by one or
An existing dashboard is saved as a .gd.json file, and is
more Filters. To create a view, the user chooses a Table,
denoted with the Galyleo star logo. It can be opened in the usual
and then chooses one or more Tilters to apply to the Table
way, with a double-click.
to select the rows for the View. The user can also statically
Once a file is opened, or a new file created, a new Galyleo tab
select a subset of the columns to include in the View.
opens onto it. It resembles a simplified form of a Tableau, Looker,
4) Chart A Chart is a generic term for an object that displays
or PowerBI editor. The collapsible right-hand sidebar offers the
data graphically. Its input is a View or a Table. Each Chart
ability to view Tables, and view, edit, or create Views, Filters,
has a single data source.
and Charts. The bottom half of the right sidebar gives controls for
styling of text and shapes. The data flow is straightforward. A Table is updated from
The top bar handles the introduction of decorative and styling an external source, or the user manipulates a widget. When this
elements to the dashboard: labels and text, simple shapes such as happens, the affected item signals the dashboard controller that it
ellipses, rectangles, polygons, lines, and images. All images are has been updated. The controller then signals all charts to redraw
referenced by URL. themselves. Each Chart will then request updated data from its
GALYLEO: A GENERAL-PURPOSE EXTENSIBLE VISUALIZATION SOLUTION 15

source Table or View. A View then requests its configured filters
for their current logic functions, and passes these to the source
Table with a request to apply the filters and return the rows which
are selected by all the filters (in the future, a more general Boolean
will be applied; the UI elements to construct this function are
under design). The Table then returns the rows which pass the
filters; the View selects the static subset of columns it supports,
and passes this to its Charts, which then redraw themselves.
Each item in this flow conceptually has a single data source,
but multiple data targets. There can be multiple Views over a
Table, but each View has a single Table as a source. There can
be multiple charts fed by a View, but each Chart has a single Table
or View as a source.
It’s important to note that there are no special cases. There is
no distinction, as there is in most visualization systems, between
a "Dimension" or a "Measure"; there are simply columns of data,
Fig. 4: Figure 4. A Published Galyleo Dashboard
which can be either a value or category axis for any Chart. From
this simplicity, significant generality is achieved. For example,
a filter selects values from any column, whether that column is and configuration gives instant feedback and tight control over
providing value or category. Applying a range filter to a category appearance. For example, the authors of a LaTeX paper (including
column gives natural telescoping and zooming on the x-axis of a this one) can’t control the placement of figures within the text. The
chart, without change to the architecture. fourth, which is correct, is that configuration code is more verbose,
error-prone, and time-consuming than manual configuration.
Drilldowns
What is less often appreciated is that when operations become
An important operation for any interactive dashboard is drill- sufficiently complex, coding is a much simpler interface than
downs: expanding detail for a datapoint on a chart. The user manual configuration. For example, building a pivot table in a
should be able to click on a chart and see a detailed view of spreadsheet using point-and-click operations have "always had a
the data underlying the datapoint. This was naturally implemented reputation for being complicated" [Dev]. It’s three lines of code in
in our system by associating a filter with every chart: every chart Python, even without using the Pandas pivot_table method. Most
in Galyleo is also a Select Filter, and it can be used as a Filter in analytics procedures are far more easily done in code.
a view, just as any other widget can be. As a result, Galyleo is an appropriate-code environment,
which is an environment which combines a coding interface
Publishing The Dashboard for complex, large-scale, or abstract operations and a point-
Once the dashboard is complete, it can be published to the and-click interface for simple, concrete, small-scale operations.
web simply by moving the dashboard file to any place it get Galyleo combines broadly powerful Jupyter-based code and low-
an URL (e.g. a github repo). It can then be viewed by visiting code libraries for analytics paired with fast GUI-based design and
https://galyleobeta.engagelively.com/public/galyleo/index.html? configuration for graphical elements and layout.
dashboard=<url of dashboard file>. The attached figure shows
a published Galyleo Dashboard, which displays Florence
Galyleo Data Model And Architecture
Nightingale’s famous Crimean War dataset. Using the double
sliders underneath the column charts telescope the x axes, The Galyleo data Model and architecture closely model the
effectively permitting zooming on a range; clicking on a column dashboard architecture discussed in the previous section. They are
shows the detailed death statistics for that month in the pie chart based on the idea of a few simple, generalizable structures, which
above the column chart. are largely independent of each other and communicate through
simple interfaces.
No-Code, Low-Code, and Appropriate-Code
The GalyleoTable
Galyleo is an appropriate-code environment, meaning that it offers A GalyleoTable is the fundamental data structure in Galyleo. It
efficient creation to developers at every step. It offers What-You- is a logical, not a physical abstraction; it simply responds to
See-Is-What-You-Get (WYSIWYG) design tools where appro- the GalyleoTable API. A GalyleoTable is a pair (columns, rows),
priate, low-code where appropriate, and full code creation tools where columns is a list of pairs (name, type), where type is one
where appropriate. of {string, boolean, number, date}, and rows is a list of lists of
No-code and low-code environments, where users construct primitive values, where the length of each component list is the
applications through a visual interface, are popular for several length of the list of columns and the type of the kth entry in each
reasons. The first is the assumption that coding is time-consuming list is the type specified by the kth column.
and hard, which isn’t always or necessarily true; the second is
Small, public tables may be contained in the dashboard file;
the assumption that coding is a skill known to only a small
these are called explicit tables. However, explicitly representing
fraction of the population, which is becoming less true by the
the table in the dashboard file has a number of disadvantages:
day. 40% of Berkeley undergraduates take Data 8, in which
every assignment involves programming in a Jupyter Notebook. 1) An explicit table is in the memory of the client viewing
The third, particularly for graphics code, is that manual design the dashboard; if it is too large, it may cause signifi-
16 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

cant performance problems on the dashboard author or
viewer’s device
2) Since the dashboard file is accessible on the web, any
data within it is public
3) The data may be continuously updated from a source,
and it’s inconvenient to re-run the Notebook to update
the data.

Therefore, the GalyleoTable can be of one of three types:

1) A data server that implements the Table REST API
2) A JavaScript object within the dashboard page itself
3) A JavaScript messenger in the page that implements a
messaging version of the API
Fig. 5: Figure 5. Galyleo Dataflow with Remote Tables
An explicit table is simply a special case of (2) -- in this case,
the JavaScript object is simply a linear list of rows.
Comments
These are not exclusive. The JavaScript messenger case is
designed to support the ability of a containing application within Again, simplicity and orthogonality have shown tremendous bene-
the browser to handle viewer authentication, shrinking the security fits here. Though filters conceptually act as selectors on rows, they
vulnerability footprint and ensuring that the client application may perform a variety of roles in implementations. For example,
controls the data going to the dashboard. In general, aside from a table produced by a simulator may be controlled by a parameter
performing tasks like authentication, the messenger will call an value given by a Filter function.
external data server for the values themselves.
Whether in a Data Server, a containing application, or a Extending Galyleo
JavaScript object, Tables support three operations: Every element of the Galyleo system, whether it is a widget, Chart,
Table Server, or Filter is defined exclusively through a small set
1) Get all the values for a specific column of public APIs. This is done to permit easy extension, by either
2) Get the max/min/increment for a specific numeric column the Galyleo team, users, or third parties. A Chart is defined as an
3) Get the rows which match a boolean function, passed in object which has a physical HTML representation, and it supports
as a parameter to the operation four JavaScript methods: redraw (draw the chart), set data (set the
Of course, (3) is the operation that we have seen above, to chart’s data), set options (set the chart’s options), and supports
populate a view and a chart. (1) and (2) populate widgets on the table (a boolean which returns true if and only if the chart can
dashboard; (1) is designed for a select filter, which is a widget draw the passed-in data set). In addition, it exports out a defined
that lets a user pick a specific set of values for a column; (2) is JSON structure which indicates what options it supports and the
an optimization for numeric filters, so that the entire list of values types of their values; this is used by the Chart Editor to display a
for the column need not be sent -- rather, only the start and end configurator for the chart.
values, and the increment between them. Similarly, the underlying lively.next system supports user
design of new filters. Again, a filter is simply an object with a
Each type of table specifies a source, additional information
physical presence, that the user can design in lively, and supports a
(in the case of a data server, for example, any header variables
specific API -- broadly, set the choices and hand back the Boolean
that must be specified in order to fetch the data), and, optionally,
function as a JSON object which will be used to filter the data.
a polling interval. The latter is designed to handle live data; the
dashboard will query the data source at each polling interval to lively.next
see if the data has changed.
Any system can be used to extend Galyleo; at the end of the
The choice of these three table instantiations (REST,
day, all that need be done is encapsulate a widget or chart in
JavaScript object, messenger) is that they provide the key founda-
a snippet of HTML with a JavaScript interface that matches
tional building block for future extensions; it’s easy to add a SQL
the Galyleo protocol. This is done most easily and quickly
connection on top of a REST interface, or a Python simulator.
by using lively.next [SKH21]. lively.next is the latest in a line
of Smalltalk- and Squeak-inspired [IKM+ 97] JavaScript/HTML
Filters integrated development environments that began with the Lively
Tables must be filtered in situ. One of the key motivators behind Kernel [IPU+ 08] [KIH+ 09] and continued through the Lively Web
remote tables is in keeping large amounts of data from hitting the [LKI+ 12] [IFH+ 16] [TM17]. Galyleo is an application built in
browser. This is largely defeated if the entire table is sent to the Lively, following the work done in [HIK+ 16].
dashboard and then filtered there. As a result, there is a Filter API Lively shares with Jupyter an emphasis on live programming
together with the Table API whereever there are tables. [KRB18], orwhere a Read-Evaluate-Act Loop (REAL) program-
The data flow of the previous section remains unchanged; ming style. It adds to that a combination of visual and text
it is simply that the filter functions are transmitted to wherever programming [ABF20], where physical objects are positioned and
the tables happen to be. The dataflow in the case of remote configured largely by hand as done with any drawing or design
tables (whether messenger-based or REST-based) is shown here, program (e.g., PowerPoint, Illustrator, DrawPad, Google Draw)
with operations that are resident where the table is situated and and programmed with a built-in editor and workspace, similar in
operations resident on the dashboard clearly shown. concept if not form to a Jupyter Notebook.
GALYLEO: A GENERAL-PURPOSE EXTENSIBLE VISUALIZATION SOLUTION 17

2) acceptsDataset(<Table or View>) returns a boolean de-
pending on whether this chart can draw the data in this
view. For example, a Table Chart can draw any tabular
data; a Geo Chart typically requires that the first column
be a place specifier.
In addition, it has a read-only property:
1) optionSpec: A JSON structure describing the options for
the chart. This is a dictionary, which specifies the name of
each option, and its type (color, number, string, boolean,
or enum with values given). Each type corresponds to a
specific UI widget that the chart editor uses.
And two read write properties:
1) options: The current options, as a JSON dictionary. This
Fig. 6: Figure 6. The lively.next environment matches exactly the JSON dictionary in optionSpec, with
values in place of the types.
2) dataSource: a string, the name of the current Galyleo
Lively abstracts away HTML and CSS tags in graphical Table or Galyleo View
objects called "Morphs". Morphs [MS95] were invented as the
user interface layer for Self [US87], and have been used as Typically, an extension to Galyleo’s charting capabilities is
the foundation of the graphics system in Squeak and Scratch done by incorporating the library as described in the previous
[MRR+ 10]. In this UI, every physical object is a Morph; these section, implementing the API given in this section, and then
can be as simple as a simple polygon or text string to a full publishing the result as a component
application. Morphs are combined via composition, similar to the
way that objects are grouped in a presentation or drawing program. Extending Galyleo’s Widget Set
The composition is simply another Morph, which in turn can be A widget is a graphical item used to filter data. It operates on a
composed with other Morphs. In this manner, complex Morphs single column on any table in the current data set. It is either a
can be built up from collections of simpler ones. For example, range filter (which selects a range of numeric values) or a select
a slider is simply the composition of a circle (the knob) with a filter (which selects a specific value, or a set of specific values).
thin, long rectangle (the bar). Each Morph can be individually The API that is implemented consists only of properties.
programmed as a JavaScript object, or can inherit base level
1) valueChanged : a signal, which is fired whenever the
behavior and extend it.
value of the widget is changed
In lively.next, each morph turns into a snippet of HTML, CSS,
2) value: read-write. The current value of the widget
and JavaScript code and the entire application turns into a web
3) filter: read-only. The current filter function, as a JSON
page. The programmer doesn’t see the HTML and CSS code
structure
directly; these are auto-generated. Instead, the programmer writes
4) allValues: read-write, select filters only.
JavaScript code for both logic and configuration (to the extent that
5) column: read-only. The name of the column of this
the configuration isn’t done by hand). The code is bundled with
widget. Set when the widget is created
the object and integrated in the web page.
6) numericSpec: read-write. A dictionary containing the
Morphs can be set as reusable components by a simple
numeric specification for a numeric or date filter
declaration. They can then be reused in any lively design.
Widgets are typically designed as a standard Lively graphical
Incorporating New Libraries component, much as the slider described above.
Libraries are typically incorporated into lively.next by attaching
them to a convenient physical object, importing the library from a Integration into Jupyter Lab: The Galyleo Extension
package manager such as npm, and then writing a small amount
Galyleo is a standalone web application that is integrated into
of code to expose the object’s API. The simplest form of this is to
JupyterLab using an iframe inside a JupyterLab tab for physical
assign the module to an instance variable so it has an addressable
design. A small JupyterLab extension was built that implements
name, but typically a few convenience methods are written as well.
the JupyterLab editor API. The JupyterLab extension has two
In this way, a large number of libraries have been incorporated
major functions: to handle read/write/undo requests from the
as reusable components in lively.next, including Google Maps,
JupyterLab menus and file browser, and receive and transmit
Google Charts [goo], Chartjs [cha], D3 [BOH11], Leaflet.js [lea],
messages from the running Jupyter kernels to update tables on
OpenLayers [ope], cytoscape:ono and many more.
the Dashboard Studio, and to handle the reverse messages where
Extending Galyleo’s Charting and Visualization capabilities
the studio requests data from the kernel.
Standard Jupyter and browser mechanisms are used. File sys-
A Galyleo Chart is anything that changes its display based on
tem requests come to the extension from the standard Jupyter API,
tabular data from a Galyleo Table or Galyleo View. It responds to
exactly the same requests and mechanisms that are sent to a Mark-
a specific API, which includes two principal methods:
down or Notebook editor. The extension receives them, and then
1) drawChart: redraw the chart using the current tabular data uses standard browser-based messaging (window.postMessage) to
from the input or view signal the standalone web app. Similarly, when the extension
18 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

of environments hosted by a server is arbitrary, and the cost is only
the cost of maintaining the Dockerfile for each environment.
An environment is easy to design for a specific class, project,
or task; it’s simply adding libraries and executables to a base
Dockerfile. It must be tested, of course, but everything must be.
And once it is tested, the burden of software maintenance and
installation is removed from the user; the user is already in a task-
customized, curated environment. Of course, the usual installation
tools (apt, pip, conda, easy_install ) can be pre-loaded (they’re just
executables) so if the environment designer missed something it
can be added by the end user.
Though a user can only be in one environment at a time,
persistent storage is shared across all environments, meaning
Fig. 7: Figure 7. Galyleo Extension Architecture switching environments is simply a question of swapping one
environment out and starting another.
Viewed in this light, a JupyterHub is a multi-purpose computer
makes a request of JupyterLab, it does so through this mechanism
in the Cloud, with an easy-to-use UI that presents through a
and a receiver in the extension gets it and makes the appropriate
browser. JupyterLab isn’t simply an IDE; it’s the window system
method calls within JupyterLab to achieve the objective.
and user interface for this computer. The JupyterLab launcher is
When a kernel makes a request through the Galyleo Client,
the desktop for this computer (and it changes what’s presented,
this is handled exactly the same way. A Jupyter messaging server
depending on the environment); the file browser is the computer’s
within the extension receives the message from the kernel, and
file browser, and the JupyterLab API is the equivalent of the Win-
then uses browser messaging to contact the application with the
dows or MacOS desktop APIs and window system that permits
request, and does the reverse on a Galyleo message to the kernel.
third parties to build applications for this.
This is a highly efficient method of interaction, since browser-
based messaging is in-memory transactions on the client machine. This Jupyter Computer has a large number of advantages over
It’s important to note that there is nothing Galyleo-specific a standard desktop or laptop computer. It can be accessed from any
about the extension: the Galyleo Extension is a general method device, anywhere on Earth with an Internet connection. Software
for any standalone web editor (e.g., a slide or drawing editor) to installation and maintenance issues are nonexistent. Data loss due
be integrated into JupyterLab. The JupyterLab connection is a few to hardware failure is extremely unlikely; backups are still required
tens of lines of code in the Galyleo Dashboard. The extension is to prevent accidental data loss (e.g., erroneous file deletion), but
slightly more complex, but it can be configured for a different they are far easier to do in a Cloud environment. Hardware
application with a simple data structure which specifies the URL resources such as disk, RAM, and CPU can be added rapidly,
of the application, file type and extension to be manipulated, and on a permanent or temporary basis. Relatively exotic resources
message list. (e.g., GPUs) can also be added, again on an on-demand, temporary
basis.
The advantages go still further than that. Any resource that
The Jupyter Computer can be accessed over a network connection can be added to
The implications of the Galyleo Extension go well beyond vi- the Jupyter Computer simply by adding the appropriate accessor
sualization and dashboards and easy publication in JupyterLab. library to an environment’s Dockerfile. For example, a database
JupyterLab is billed as the next-generation integrated Develop- solution such as Snowflake, BigQuery, or Amazon Aurora (or
ment Environment for Jupyter, but in fact it is substantially more one of many others) can be "installed" by adding the relevant
than that. It is the user interface and windowing system for Cloud- library module to the environment. Of course, the user will need
based personal computing. Inspired by previous extensions such to order the database service from the relevant provider, and obtain
as the Vega Extension, the Galyleo Extensions seeks to provide authentication tokens, and so on -- but this is far less troublesome
the final piece of the puzzle. than even maintaining the library on the desktop.
Consider a Jupyter server in the Cloud, served from a Jupyter- However, to date the Jupyter Computer only supports a few
Hub such as the Berkeley Data Hub. It’s built from a base window-based applications, and adding a new application is a
Ubuntu image, with the standard Jupyter libraries installed and, time-consuming development task. The applications supported are
importantly, a UI that includes a Linux terminal interface. Any familiar and easy to enumerate: a Notebook editor, of course; a
Linux executable can be installed in the Jupyter server image, as Markdown Viewer; a CSV Viewer; a JSON Viewer (not inline
can any Jupyter kernel, and any collection of libraries. The Jupyter editor), and a text editor that is generally used for everything from
server has per-user persistent storage, which is organized in a Python files to Markdown to CSV.
standard Linux filesystem. This makes the Jupyter server a curated This is a small subset of the rich range of JavaScript/HTML5
execution environment with a Linux command-line interface and applications which have significant value for Jupyter Computer
a Notebook interface for Jupyter execution. users. For example, the Ace Code Editor supports over 110
A JupyterHub similar to Berkeley Data Hub (essentially, languages and has the functionality of popular desktop editors
anything built from Zero 2 Jupyter Hub or Q-Hub) comes with a such as Vim and Sublime Text. There are over 1100 open-source
number of "environments". The user chooses the environment on drawing applications on the JavaScript/HTML5 platform; multiple
startup. Each environment comes with a built-in set of libraries and spreadsheet applications, the most notable being jExcel, and many
executables designed for a specific task or set of tasks. The number more.
GALYLEO: A GENERAL-PURPOSE EXTENSIBLE VISUALIZATION SOLUTION 19

Fig. 8: Figure 8. Galyleo Extension Application-Side messaging

Fig. 9: Figure 9. Generations of Internet Computing
Up until now, adding a new application to JupyterLab involved
writing a hand-coded extension in Typescript, and compiling
it into JupyterLab. However, the Galyleo Extension has been the user uses any of a wide variety of text editors to prepare the
designed so that any HTML5/JavaScript application can be added document, any of a wide variety of productivity and illustrator
easily, simply by configuring the Galyleo Extension with a small programs to prepare the images, runs this through a local sequence
JSON file. of commands (e.g., pdflatex paper; bibtex paper; pdflatex paper.
The promise of the Galyleo Extension is that it can be adapted Usually Github or another repository is used for storage and
to any open-source JavaScript/HTML5 application very easily. collaboration.
The Galyleo Extension merely needs the: In a Cloud service, this is another matter. There is at most
one editor, selected by the service, on the site. There is no
• URL of the application image editing or illustrator program that reads and writes files
• File extension that the application reads/writes on the site. Auxiliary tools, such as a bib searcher, aren’t present
• URL of an image for the launcher or aren’t customizable. The service has its own siloed storage,
• Name of the application for the file menu its own text editor, and its own document-preparation pipeline.
The application must implement a small messaging client, The tools (aside from the core document-preparation program)
using the standard JavaScript messaging interface, and implement are primitive. The online service has two advantages over the
the calls the Galyleo Extension makes. The conceptual picture is personal-device service. Collaboration is generally built-in, with
shown im Figure 8. multiple people having access to the project, and the software need
And it must support (at a minimum) messages to read and not be maintained. Aside from that, the personal-device experience
write the file being edited. is generally superior. In particular, the user is free to pick their own
editor, and doesn’t have to orchestrate multiple downloads and
The Third Generation of Network Computing uploads from various websites. The usual collection of command-
The World-Wide Web and email comprised the first generation line utilities are available to small touchups.
of Internet computing (the Internet had been around for a decade The third generation of Internet Computing represented by the
before the Web, and earlier networks dated from the sixties, but Jupyter Computer. This offers a Cloud experience similar to the
the Web and email were the first mass-market applications on personal computer, but with the scalability, reliability, and ease of
the network), and they were very simple -- both were document- collaboration of the Cloud.
exchange applications, using slightly different protocols. The
second generation of Network applications were the siloed pro- Conclusion and Further Work
ductivity applications, where standard desktop applications moved The vision of the Jupyter Computer, bringing the power of the
to the Cloud. The most famous example is of course GSuite Cloud to the personal computing experience has been started
and Office 365, but there were and are many others -- Canva, with Galyleo. It will not end there. At the heart of it is a
Loom, Picasa, as well as a large number of social/chat/social composition of two broadly popular platforms: HTML5/JavaScript
media applications. What they all had in common was that they for presentation and interaction, and the various Jupyter kernels
were siloed applications which, with the exception of the office for server-side analytics. Galyleo is a start at seamless interaction
suites, didn’t even share a common store. In many ways, this of these two platforms. Continuing and extending this is further
second generation of network applications recapitulates the era development of narrow-waist protocols to permit maximal inde-
immediately prior to the introduction of the personal computer. pendent development and extension.
That era was dominated by single-application computers such as
word processors, which were simply computers with a hardcoded
program loaded into ROM. Acknowledgements
The Word Processor era was due to technological limitations The authors wish to thank Alex Yang, Diptorup Deb, and for
-- the processing power and memory to run multiple programs their insightful comments, and Meghann Agarwal for stewardship.
simply wasn’t available on low-end hardware, and PC operating We have received invaluable help from Robert Krahn, Marko
systems didn’t yet exist. In some sense, the current second genera- Röder, Jens Lincke and Linus Hagemann. We thank the en-
tion of Internet Computing suffers from similar technological con- gageLively team for all of their support and help: Tim Braman,
straints. The "Operating System" for Internet Computing doesn’t Patrick Scaglia, Leighton Smith, Sharon Zehavi, Igor Zhukovsky,
yet exist. The Jupyter Computer can provide it. Deepak Gupta, Steve King, Rick Rasmussen, Patrick McCue,
To see the difference that this can make, consider LaTeX (per- Jeff Wade, Tim Gibson. The JupyterLab development commu-
haps preceded by Docutils, as is the case for SciPy) preparation of nity has been helpful and supportive; we want to thank Tony
a document. On a personal computer, it’s fairly straightforward; Fast, Jason Grout, Mehmet Bektas, Isabela Presedo-Floyd, Brian
20 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Granger, and Michal Krassowski. The engageLively Technology [hol22b] Installation - holoviews v1.14.9, May 2022. URL: https:
Advisory Board has helped shape these ideas: Ani Mardurkar, //holoviews.org/.
[IFH+ 16] Daniel Ingalls, Tim Felgentreff, Robert Hirschfeld, Robert
Priya Joseph, David Peterson, Sunil Joshi, Michael Czahor, Isha
Krahn, Jens Lincke, Marko Röder, Antero Taivalsaari, and
Oke, Petrus Zwart, Larry Rowe, Glenn Ricart, Sunil Joshi, Antony Tommi Mikkonen. A world of active objects for work and play:
Ng. We want to thank the people from the AWS team that have The first ten years of lively. In Proceedings of the 2016 ACM
helped us tremendously: Matt Vail, Omar Valle, Pat Santora. International Symposium on New Ideas, New Paradigms, and
Reflections on Programming and Software, Onward! 2016, page
Galyleo has been dramatically improved with the assistance of our 238–249, New York, NY, USA, 2016. Association for Comput-
Japanese colleagues at KCT and Pacific Rim Technologies: Yoshio ing Machinery. URL: https://doi.org/10.1145/2986012.2986029,
Nakamura, Ted Okasaki, Ryder Saint, Yoshikazu Tokushige, and doi:10.1145/2986012.2986029.
Naoyuki Shimazaki. Our undestanding of Jupyter in an academic [IKM+ 97] Dan Ingalls, Ted Kaehler, John Maloney, Scott Wallace, and
Alan Kay. Back to the future: The story of squeak, a prac-
context came from our colleagues and friends at Berkeley, the tical smalltalk written in itself. In Proceedings of the 12th
University of Victoria, and UBC: Shawna Dark, Hausi Müller, ACM SIGPLAN Conference on Object-Oriented Programming,
Ulrike Stege, James Colliander, Chris Holdgraf, Nitesh Mor. Use Systems, Languages, and Applications, OOPSLA ’97, page
318–326, New York, NY, USA, 1997. Association for Comput-
of Jupyter in a research context was emphasized by Andrew
ing Machinery. URL: https://doi.org/10.1145/263698.263754,
Weidlea, Eli Dart, Jeff D’Ambrogia. We benefitted enormously doi:10.1145/263698.263754.
from the CITRIS Foundry: Alic Chen, Jing Ge, Peter Minor, Kyle [IPU+ 08] Daniel Ingalls, Krzysztof Palacz, Stephen Uhler, Antero Taival-
Clark, Julie Sammons, Kira Gardner. The Alchemist Accelerator saari, and Tommi Mikkonen. The lively kernel a self-supporting
system on a web page. In Workshop on Self-sustaining Systems,
was central to making this product: Ravi Belani, Arianna Haider, pages 31–50. Springer, 2008. doi:10.1007/978-3-540-
Jasmine Sunga, Mia Scott, Kenn So, Aaron Kalb, Adam Frankl. 89275-5_2.
Kris Singh was a constant source of inspiration and help. Larry [jup] Jupyterlab documentation. URL: https://jupyterlab.readthedocs.
Singer gave us tremendous help early on. Vibhu Mittal more io/en/stable/.
than anyone inspired us to pursue this road. Ken Lutz has been [KIH+ 09] Robert Krahn, Dan Ingalls, Robert Hirschfeld, Jens Lincke, and
Krzysztof Palacz. Lively wiki a development environment for
a constant sounding board and inspiration, and worked hand-in- creating and sharing active web content. In Proceedings of the
hand with us to develop this product. Our early customers and 5th International Symposium on Wikis and Open Collaboration,
partners have been and continue to be a source of inspiration, WikiSym ’09, New York, NY, USA, 2009. Association for
Computing Machinery. URL: https://doi.org/10.1145/1641309.
support, and experience that is absolutely invaluable: Jonathan 1641324, doi:10.1145/1641309.1641324.
Tan, Roger Basu, Jason Koeller, Steve Schwab, Michael Collins, [KRB18] Juraj Kubelka, Romain Robbes, and Alexandre Bergel. The road
Alefiya Hussain, Geoff Lawler, Jim Chimiak, Fraukë Tillman, to live programming: Insights from the practice. In Proceedings
Andy Bavier, Andy Milburn, Augustine Bui. All of our customers of the 40th International Conference on Software Engineering,
ICSE ’18, page 1090–1101, New York, NY, USA, 2018. Associ-
are really partners, none moreso than the fantastic teams at Tanjo ation for Computing Machinery. URL: https://doi.org/10.1145/
AI and Ultisim: Bjorn Nordwall, Ken Lane, Jay Sanders, Eric 3180155.3180200, doi:10.1145/3180155.3180200.
Smith, Miguel Matos, Linda Bernard, Kevin Clark, and Richard [KRKP+ 16] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez,
Boyd. We want to especially thank our investors, who bet on this Brian Granger, Matthias Bussonnier, Jonathan Frederic, Kyle
Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, Paul
technology and company. Ivanov, Damián Avila, Safia Abdalla, Carol Willing, and Jupyter
development team. Jupyter Notebooks - a publishing format for
reproducible computational workflows. IOS Press, 2016. URL:
R EFERENCES https://eprints.soton.ac.uk/403913/.
[lea] An open-source javascript library for interactive maps. URL:
[ABF20] Leif Andersen, Michael Ballantyne, and Matthias Felleisen. https://leafletjs.com/.
Adding interactive visual syntax to textual code. Proc. ACM [LKI+ 12] Jens Lincke, Robert Krahn, Dan Ingalls, Marko Roder, and
Program. Lang., 4(OOPSLA), nov 2020. URL: https://doi.org/ Robert Hirschfeld. The lively partsbin–a cloud-based repository
10.1145/3428290, doi:10.1145/3428290. for collaborative development of active web content. In 2012
[BOH11] Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. D3 data- 45th Hawaii International Conference on System Sciences, pages
driven documents. IEEE Transactions on Visualization and Com- 693–701, 2012. doi:10.1109/HICSS.2012.42.
puter Graphics, 17(12):2301–2309, dec 2011. URL: https://doi. [loo] Looker. URL: https://looker.com/.
org/10.1109/TVCG.2011.185, doi:10.1109/TVCG.2011. [LS10] Bruce Lawson and Remy Sharp. Introducing HTML5. New
185. Riders Publishing, USA, 1st edition, 2010.
[cha] Chart.js. URL: https://www.chartjs.org/. [mdn] Window.postmessage() - web apis: Mdn. URL: https://developer.
[Cro06] D. Crockford. The application/json media type for javascript mozilla.org/en-US/docs/Web/API/Window/postMessage.
object notation (json). RFC 4627, RFC Editor, July 2006. http://
[MRR+ 10] John Maloney, Mitchel Resnick, Natalie Rusk, Brian Silverman,
www.rfc-editor.org/rfc/rfc4627.txt. URL: http://www.rfc-editor.
and Evelyn Eastmond. The scratch programming language
org/rfc/rfc4627.txt, doi:10.17487/rfc4627.
and environment. ACM Transactions on Computing Educa-
[Dev] Erik Devaney. How to create a pivot table in excel: A step-by-
tion (TOCE), 10(4):1–15, 2010. URL: https://doi.org/10.1145/
step tutorial. URL: https://blog.hubspot.com/marketing/how-to-
1868358.1868363, doi:10.1145/1868358.1868363.
create-pivot-table-tutorial-ht.
[DGHP13] Marcello D’Agostino, Dov M Gabbay, Reiner Hähnle, and [MS95] John H Maloney and Randall B Smith. Directness and liveness in
Joachim Posegga. Handbook of tableau methods. Springer the morphic user interface construction environment. In Proceed-
Science & Business Media, 2013. ings of the 8th annual ACM symposium on User interface and
software technology, pages 21–28, 1995. URL: https://doi.org/
[goo] Charts: google developers. URL: https://developers.google.com/
10.1145/215585.215636, doi:10.1145/215585.215636.
chart/.
[HIK+ 16] Matthew Hemmings, Daniel Ingalls, Robert Krahn, Rick [ope] Openlayers. URL: https://openlayers.org/.
McGeer, Glenn Ricart, Marko Röder, and Ulrike Stege. Livetalk: [pan22] Panel, May 2022. URL: https://panel.holoviz.org/.
A framework for collaborative browser-based replicated- [pdt20] The pandas development team. pandas-dev/pandas: Pandas,
computation applications. In 2016 28th International Tele- February 2020. URL: https://doi.org/10.5281/zenodo.3509134,
traffic Congress (ITC 28), volume 01, pages 270–277, 2016. doi:10.5281/zenodo.3509134.
doi:10.1109/ITC-28.2016.144. [plo] Dash overview. URL: https://plotly.com/dash/.
[hol22a] High-level tools to simplify visualization in python, Apr 2022. [SKH21] Robin Schrieber, Robert Krahn, and Linus Hagemann.
URL: https://holoviz.org/. lively.next, 2021.
GALYLEO: A GENERAL-PURPOSE EXTENSIBLE VISUALIZATION SOLUTION 21

[TM17] Antero Taivalsaari and Tommi Mikkonen. The web as a software
platform: Ten years later. In International Conference on Web
Information Systems and Technologies, volume 2, pages 41–50.
SCITEPRESS, 2017. doi:10.5220/0006234800410050.
[US87] David Ungar and Randall B. Smith. Self: The power of simplic-
ity. volume 22, page 227–242, New York, NY, USA, dec 1987.
Association for Computing Machinery. URL: https://doi.org/10.
1145/38807.38828, doi:10.1145/38807.38828.
[WM10] Wes McKinney. Data Structures for Statistical Computing in
Python. In Stéfan van der Walt and Jarrod Millman, editors,
Proceedings of the 9th Python in Science Conference, pages 56
– 61, 2010. doi:10.25080/Majora-92bf1922-00a.
22 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

USACE Coastal Engineering Toolkit and a Method of
Creating a Web-Based Application
Amanda Catlett‡∗ , Theresa R. Coumbe‡ , Scott D. Christensen‡ , Mary A. Byrant‡

Abstract—In the early 1990s the Automated Coastal Engineering Systems, the goal of deploying the ACES tools as a web-based application,
ACES, was created with the goal of providing state-of-the-art computer-based and ultimately renamed it to: USACE Coastal Engineering Toolkit
tools to increase the accuracy, reliability, and cost-effectiveness of Corps coastal (UCET).
engineering endeavors. Over the past 30 years, ACES has become less and less
The RAD team focused on updating the Python codebase
accessible to engineers. An updated version of ACES was necessary for use in
utilizing Python’s object-oriented programming and the newly
coastal engineering. Our goal was to bring the tools in ACES to a user-friendly
web-based dashboard that would allow a wide range of users to be able to easily
developed HoloViz ecosystem. The team refactored the code to
and quickly visualize results. We will discuss how we restructured the code implement inheritance so the code is clean, readable, and scalable.
using class inheritance and the three libraries Param, Panel, and HoloViews to The tools were expanded to a Graphical User Interface (GUI) so
create an extensible, interactive, graphical user interface. We have created the the implementation to a web-app would provide a user-friendly
USACE Coastal Engineering Toolkit, UCET, which is a web-based application experience. This was done by using the HoloViz-maintained
that contains 20 of the tools in ACES. UCET serves as an outline for the process libraries: Param, Panel, and Holoviews.
of taking a model or set of tools and developing web-based application that can
This paper will discuss some of the steps that were taken
produce visualizations of the results.
by the RAD team to update the Python codebase to create a
panel application of the coastal engineering tools. In particular,
Index Terms—GUI, Param, Panel, HoloViews
refactoring the input and output variables with the Param library,
the class hierarchy used, and utilization of Panel and HoloViews
Introduction for a user-friendly experience.
The Automated Coastal Engineering System (ACES) was devel-
oped in response to the charge by the LTG E. R. Heiberg III, Refactoring Using Param
who was the Chief of Engineers at the time, to provide improved
design capabilities to the Corps coastal specialists. [Leenknecht] Each coastal tool in UCET has two classes, the model class and the
In 1992, ACES was presented as an interactive computer-based GUI class. The model class holds input and output variables and
design and analysis system in the field of coastal engineering. The the methods needed to run the model. Whereas the GUI class holds
tools consist of seven functional areas which are: Wave Prediction, information for GUI visualization. To make implementation of the
Wave Theory, Structural Design, Wave Runup Transmission and GUI more seamless we refactored model variables to utilize the
Overtopping, Littoral Process, and Inlet Processes. These func- Param library. Param is a library that has the goal of simplifying
tional areas contain classical theory describing wave motion, to the codebase by letting the programmer explicitly declare the types
expressions resulting from tests of structures in wave flumes, and and values of parameters accepted by the code. Param can also be
numerical models describing the exchange of energy from the at- seamlessly used when implementing the GUI through Panel and
mosphere to the sea surface. The math behind these uses anything HoloViews.
from simple algebraic expressions, both theoretical and empirical, Each UCET tool’s model class declares the input and output
to numerically intense algorithms. [Leenknecht][UG][shankar] values used in the model as class parameters. Each input and
Originally, ACES was written in FORTRAN 77 resulting in output variables are declared and given the following metadata
a decreased ability to use the tool as technology has evolved. In features:
2017, the codebase was converted from FORTRAN 77 to MAT-
• default: each input variable is defined as a Param with a
LAB and Python. This conversion ensured that coastal engineers
default value defined from the 1992 ACES user manual
using this tool base would not need training in yet another coding
• bounds: each input variable is defined with range values
language. In 2020, the Engineered Resilient Systems (ERS) Rapid
defined in the 1992 ACES user manual
Application Development (RAD) team undertook the project with
• doc or docstrings: input and output variables have the
expected variable and description of the variable defined
* Corresponding author: amanda.r.catlett@erdc.dren.mil
‡ ERDC as a doc. This is used as a label over the input and
output widgets. Most docstrings follow the pattern of
Copyright © 2022 Amanda Catlett et al. This is an open-access article <variable>:<description of variable [units, if any]>
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, • constant: the output variables all set constant equal True,
provided the original author and source are credited. thereby restricting the user’s ability to manipulate the
USACE COASTAL ENGINEERING TOOLKIT AND A METHOD OF CREATING A WEB-BASED APPLICATION 23

value. Note that when calculations are being done they will classes. In figure 1 the model classes are labeled as: Base-Tool
need to be inside a with param.edit_constant(self) function Class, Graph-Tool Class, Water-Tool Class, and Graph-Water-Tool
• precedence: input and output variables will use prece- Class and each has a corresponding GUI class.
dence when there are instances where the variable does Due to the inheritance in UCET, the first two questions that
not need to be seen. can be asked when adding a tool are: ‘Does this tool need water
variables for the calculation?’ and ‘Does this tool have a graph?’.
The following is an example of an input parameter:
The developer can then add a model class and a GUI class and
H = param.Number( inherit based on figure 1. For instance, Linear Wave Theory is
doc='H: wave height [{distance_unit}]',
default=6.3, an application that yields first-order approximations for various
bounds=(0.1, 200) parameters of wave motion as predicted by the wave theory. It
) provides common items of interest such as water surface elevation,
An example of an output variable is: general wave properties, particle kinematics and pressure as a
function of wave height and period, water depth, and position
L = param.Number(
doc='L: Wavelength [{distance_unit}]', in the wave form. This tool uses water density and has multiple
constant=True graphs in its output. Therefore, Linear Wave Theory is considered
) a Graph-Water-Tool and the model class will inherit from Water-
The model’s main calculation functions mostly remained un- TypeDriver and the GUI class will inherit the linear wave theory
changed. However, the use of Param eliminated the need for code model class, WaterTypeGui, and TabularDataGui.
that handled type checking and bounds checks.
GUI Implementation Using Panel and HoloViews
Class Hierarchy
Each UCET tool has a GUI class where the Panel and HoloView
UCET has twenty tools from six of the original seven functional libraries are implemented. Panel is a hierarchical container that
areas of ACES. When we designed our class hierarchy, we focused can layout panes, widgets, or other Panels in an arrangement
on the visualization of the web application rather than functional that forms an app or dashboard. The Pane is used to render any
areas. Thus, each tool’s class can be categorized into Base-Tool, widget-like object such as Spinner, Tabulator, Buttons, CheckBox,
Graph-Tool, Water-Tool, or Graph-Water-Tool. The Base-Tool has Indicators, etc. Those widgets are used to gather user input and
the coastal engineering models that do not have any water property run the specific tool’s model.
inputs (such as water density) in the calculations and no graphical
UCET utilizes the following widgets to gather user input:
output. The Graph-Tool has the coastal engineering models that
do not have any water property inputs in the calculations but have • Spinner: single numeric input values
a graphical output. Water-Tool has the coastal engineering models • Tabulator: table input data
that have water property inputs in the calculations and no graphical • CheckBox: true or false values
output. Graph-Water-Tool has the coastal engineering models that • Drop down: items that have a list of pre-selected values,
have water property inputs in the calculations and has a graphical such as which units to use
output. Figure 1 shows a flow of inheritance for each of those
classes. UCET utilizes indicators.Number, Tabulator, and graphs to
There are two types of general categories for the classes in visualize the outputs of the coastal engineering models. A single
the UCET codebase: utility and tool-specific. Utility classes have number is shown using indicators.Number and graph data is
methods and functions that are utilized across more than one tool. displayed using the Tabulator widget to show the data of the graph.
The Utility classes are: The graphs are created using HoloViews and have tool options
such as pan, zooming, and saving. Buttons are used to calculate,
• BaseDriver: holds methods and functions that each tool
save the current run, and save the graph data.
needs to collect data, run coastal engineering models, and
All of these widgets are organized into 5 pan-
print data.
els: title, options, inputs, outputs, and graph. The
• WaterDriver: has the methods that make water density
BaseGui/WaterTypeGui/TabularDataGui have methods that
and water weight available to the models that need those
organize the widgets within the 5 panels that most tools follow.
inputs for the calculations.
The “options” panel has a row that holds the dropdown selections
• BaseGui: has the functions and methods for the visualiza-
for units and water type (if the tool is a Water-Tool). Some tools
tion and utilization of all inputs and outputs within each
have a second row in the “options” panel with other drop-down
tool’s GUI.
options. The input panel has two columns for spinner widgets
• WaterTypeGui: has the widget for water selection.
with a calculation button at the bottom left. The output panel has
• TabulatorDataGui: holds the functions and methods used
two columns of indicators.Number for the single numeric output
for visualizing plots and the ability to download the data
values. At the bottom of the output panel there is a button to “save
that is used for plotting.
the current profile”. The graph panel is tabbed where the first
Each coastal tool in UCET has two classes, the model class and tab shows the graph and the second tab shows the data provided
the GUI class. The model class holds input and output variables within the graph. An visual outline of this can ben seen in the
and the methods needed to run the model. The model class either following figure. Some of the UCET tools have more complicated
directly inherits from the BaseDriver or the WaterTypeDriver. The input or output visualizations and that tool’s GUI class will add
tool’s GUI class holds information for GUI visualization that is or modify methods to meet the needs of that tool.
different from the BaseGui, WaterTypeGUI, and TabulatorDataGui The general outline of a UCET tool for the GUI.
24 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

zero the point is outside the waveform. Therefore, if a user makes
a combination where the sum is less than zero, UCET will post
a warning to tell the user that the point is outside the waveform.
Current State See the below figure for an example The developers have been
UCET approaches software development from the perspective of documenting this project using GitHub and JIRA.
someone within the field of Research and Development. Each An example of a warning message based on chosen inputs.
tool within UCET is not inherently complex from the traditional
software perspective. However, this codebase enables researchers Results
to execute complex coastal engineering models in a user-friendly Linear Wave Theory was described in the class hierarchy example.
environment by leveraging open-source libraries in the scientific This Graph-Water-Tool utilizes most of the BaseGui methods. The
Python ecosystem such as: Param, Panel, and HoloViews. biggest difference is instead of having three graphs in the graph
Currently, UCET is only deployed using a command line panel there is a plot selector drop down where the user can select
interface panel serve command. UCET is awaiting the Security which graph they want to see.
Technical Implementation Guide process before it can be launched Windspeed Adjustment and Wave Growth provides a quick
as a website. As part of this security vetting process we plan to and simple estimate for wave growth over open-water and re-
leverage continuous integration/continuous development (CI/CD) stricted fetches in deep and shallow water. This is a Base-Tool
tools to automate the deployment process. While this process is as there are no graphs and no water variables for the calculations.
happening, we have started to get feedback from coastal engineers This tool has four additional options in the options panel where
to update the tools usability, accuracy, and adding suggested the user can select the wind observation type, fetch type, wave
features. To minimize the amount of computer science knowledge equation type, and if knots are being used. Based on the selection
the coastal engineers need, our team created a batch script. This of these options, the input and output variables will change so only
script creates a conda environment, activates and runs the panel what is used or calculated for those selections are seen.
serve command to launch the app on a local host. The user only
needs to click on the batch script for this to take place. Conclusion and Future Work
Other tests are being created to ensure the accuracy of the Thirty years ago, ACES was developed to provide improved
tools using a testing framework to compare output from UCET design capabilities to Corps coastal specialists and while these
with that of the FORTRAN original code. The biggest barrier to tools are still used today, it became more and more difficult for
this testing strategy is getting data from the FORTRAN to compare users to access them. Five years ago, there was a push to update
with Python. Currently, there are tests for most of the tools that the code base to one that coastal specialists would be more familiar
read a CSV file of input and output results from FORTRAN and with: MATLAB and Python. Within the last two years the RAD
compare with what the Python code is calculating. team was able to finalize the update so that the user can access
Our team has also compiled an updated user guide on how to these tools without having years of programming experience. We
use the tool, what to expect from the tool, and a deeper description were able to do this by utilizing classes, inheritance, and the
on any warning messages that might appear as the user adds input Param, Panel, and HoloViews libraries. The use of inheritance
values. An example of a warning message would be, if a user has allowed for shorter code-bases and also has made it so new
chooses input values that make it so the application does not make tools can be added to the toolkit. Param, Panel, and HoloViews
physical sense, a warning message will appear under the output work cohesively together to not only run the models but make a
header and replace all output values. For a more concrete example: simple interface.
Linear Wave Theory has a vertical coordinate (z) and the water Future work will involve expanding UCET to include current
depth (d) as input values and when those values sum is less than coastal engineering models, and completing the security vetting
USACE COASTAL ENGINEERING TOOLKIT AND A METHOD OF CREATING A WEB-BASED APPLICATION 25

process to deploy to a publicly accessible website. We plan to
incorporate an automated CI/CD to ensure smooth deployment
of future versions. We also will continue to incorporate feedback
from users and refine the code to ensure the application provides
a quality user experience.

R EFERENCES
[Leenknecht] David A. Leenknecht, Andre Szuwalski, and Ann R. Sherlock.
1992. Automated Coastal Engineering System -Technical Refer-
ence. Technical report. https://usace.contentdm.oclc.org/digital/
collection/p266001coll1/id/2321/
[panel] “Panel: A High-Level App and Dashboarding Solution for
Python.” Panel 0.12.6 Documentation, Panel Contributors,
2019, https://panel.holoviz.org/.
[holoviz] “High-Level Tools to Simplify Visualization in Python.”
HoloViz 0.13.0 Documentation, HoloViz Authors, 2017, https:
//holoviz.org.
[UG] David A. Leenknecht, et al. “Automated Tools for Coastal
Engineering.” Journal of Coastal Research, vol. 11, no.
4, Coastal Education & Research Foundation, Inc., 1995,
pp. 1108-24. https://usace.contentdm.oclc.org/digital/collection/
p266001coll1/id/2321/
[shankar] N.J. Shankar, M.P.R. Jayaratne, Wave run-up and overtopping
on smooth and rough slopes of coastal structures, Ocean Engi-
neering, Volume 30, Issue 2, 2003, Pages 221-238, ISSN 0029-
8018, https://doi.org/10.1016/S0029-8018(02)00016-1

Fig. 1: Screen shot of Linear Wave Theory

Fig. 2: Screen shot of Windspeed Adjustment and Wave Growth
26 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Search for Extraterrestrial Intelligence: GPU
Accelerated TurboSETI
Luigi Cruz‡∗ , Wael Farah‡ , Richard Elkins‡

Abstract—A common technique adopted by the Search For Extraterrestrial In- by an analog-to-digital converter as voltages and transmitted to a
telligence (SETI) community is monitoring electromagnetic radiation for signs of processing logic to extract useful information from it. The data
extraterrestrial technosignatures using ground-based radio observatories. The stream generated by a radio telescope can easily reach the rate
analysis is made using a Python-based software called TurboSETI to detect nar- of terabits per second because of the ultra-wide bandwidth radio
rowband drifting signals inside the recordings that can mean a technosignature.
spectrum. The current workflow utilized by the Breakthrough
The data stream generated by a telescope can easily reach the rate of terabits
per second. Our goal was to improve the processing speeds by writing a GPU-
Listen, the largest scientific research program aimed at finding
accelerated backend in addition to the original CPU-based implementation of the evidence of extraterrestrial intelligence, consists in pre-processing
de-doppler algorithm used to integrate the power of drifting signals. We discuss and storing the incoming data as frequency-time binary files
how we ported a CPU-only program to leverage the parallel capabilities of a ([LCS+ 19]) in persistent storage for later analysis. This post-
GPU using CuPy, Numba, and custom CUDA kernels. The accelerated backend analysis is made possible using a Python-based software called
reached a speed-up of an order of magnitude over the CPU implementation. TurboSETI ([ESF+ 17]) to detect narrowband signals that could be
drifting in frequency owing to the relative radial velocity between
Index Terms—gpu, numba, cupy, seti, turboseti the observer on earth, and the transmitter. The offline processing
speed of TurboSETI is directly related to the scientific output of
1. Introduction an observation. Each voltage file ingested by TurboSETI is often
on the order of a few hundreds of gigabytes. To process data
The Search for Extraterrestrial Intelligence (SETI) is a broad term efficiently without Python overhead, the program uses Numpy for
utilized to describe the effort of locating any scientific proof of near machine-level performance. To measure a potential signal’s
past or present technology that originated beyond the bounds of drift rate, TurboSETI uses a de-doppler algorithm to align the
Earth. SETI can be performed in a plethora of ways: either actively frequency axis according to a pre-set drift rate. Another algorithm
by deploying orbiters and rovers around planets/moons within the called “hitsearch” ([ESF+ 17]) is then utilized to identify any
solar system, or passively by either searching for biosignatures in signal present in the recorded spectrum. These two algorithms
exoplanet atmospheres or “listening” to technologically-capable are the most resource-hungry elements of the pipeline consuming
extraterrestrial civilizations. One of the most common techniques almost 90% of the running time.
adopted by the SETI community is monitoring electromagnetic
radiation for narrowband signs of technosignatures using ground-
based radio observatories. This search can be performed in mul- 2. Approach
tiple ways: equipment primarily built for this task, like the Allen Multiple methods were utilized in this effort to write a GPU-
Telescope Array (California, USA), renting observation time, or accelerated backend and optimize the CPU implementation of
in the background while the primary user is conducting other ob- TurboSETI. In this section, we enumerate all three main methods.
servations. Other radio-observatories useful for this search include
the MeerKAT Telescope (Northern Cape, South Africa), Green 2.1. CuPy
Bank Telescope (West Virginia, USA), and the Parkes Telescope The original implementation of TurboSETI heavily depends on
(New South Wales, Australia). The operation of a radio-telescope Numpy ([HMvdW+ 20]) for data processing. To keep the number
is similar to an optical telescope. Instead of using optics to of modifications as low as possible, we implemented the GPU-
concentrate light into an optical sensor, a radio-telescope operates accelerated backend using CuPy ([OUN+ 17]). This open-source
by concentrating electromagnetic waves into an antenna using a library offers GPU acceleration backed by NVIDIA CUDA and
large reflective structure called a “dish” ([Reb82]). The interac- AMD ROCm while using a Numpy style API. This enabled us
tion between the metallic antenna and the electromagnetic wave to reuse most of the code between the CPU and GPU-based
generates a faint electrical current. This effect is then quantized implementations.

* Corresponding author: lfcruz@seti.org 2.1. Numba
‡ SETI Institute
Some computationally heavy methods of the original CPU-based
Copyright © 2022 Luigi Cruz et al. This is an open-access article distributed implementation of TurboSETI were written in Cython. This ap-
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the proach has disadvantages: the developer has to be familiar with
original author and source are credited. Cython syntax to alter the code; the code requires additional logic
SEARCH FOR EXTRATERRESTRIAL INTELLIGENCE: GPU ACCELERATED TURBOSETI 27

Double-Precision (float64) 4. Conclusion
Impl. Device File A File B File C The original implementation of TurboSETI worked exclusively
Cython CPU 0.44 min 25.26 min 23.06 min on the CPU to process data. We implemented a GPU-accelerated
Numba CPU 0.36 min 20.67 min 22.44 min backend to leverage the massive parallelization capabilities of a
CuPy GPU 0.05 min 2.73 min 3.40 min graphical device. The benchmark performed shows that the new
CPU and GPU implementation takes significantly less time to
TABLE 1 process observation data resulting in more science being produced.
Double precision processing time benchmark with Cython, Numba and CuPy Based on the results, the recommended configuration to run the
implementation.
program is with single-precision calculations on a GPU device.

Single-Precision (float32) R EFERENCES
[ESF+ 17] J. Emilio Enriquez, Andrew Siemion, Griffin Foster, Vishal
Impl. Device File A File B File C
Gajjar, Greg Hellbourg, Jack Hickish, Howard Isaacson,
Numba CPU 0.26 min 16.13 min 16.15 min Danny C. Price, Steve Croft, David DeBoer, Matt Lebof-
CuPy GPU 0.03 min 1.52 min 2.14 min sky, David H. E. MacMahon, and Dan Werthimer. The
breakthrough listen search for intelligent life: 1.1–1.9
TABLE 2 ghz observations of 692 nearby stars. The Astrophys-
Single precision processing time benchmark with Numba and CuPy ical Journal, 849(2):104, Nov 2017. URL: https://ui.
implementation. adsabs.harvard.edu/abs/2017ApJ...849..104E/abstract, doi:
10.3847/1538-4357/aa8d1b.
[HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der
Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric
Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,
to be compiled at installation time. Consequently, it was decided Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van
Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del
to replace Cython with pure Python methods decorated with the Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant,
Numba ([LPS15]) accelerator. By leveraging the power of the Just- Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer
In-Time (JIT) compiler from Low Level Virtual Machine (LLVM), Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro-
gramming with NumPy. Nature, 585(7825):357–362, Septem-
Numba can compile Python code into assembly code as well ber 2020. URL: https://doi.org/10.1038/s41586-020-2649-2,
as apply Single Instruction/Multiple Data (SIMD) acceleration doi:10.1038/s41586-020-2649-2.
instructions to achieve near machine-level speeds. [LCS 19]
+ Matthew Lebofsky, Steve Croft, Andrew P. V. Siemion,
Danny C. Price, J. Emilio Enriquez, Howard Isaacson, David
H. E. MacMahon, David Anderson, Bryan Brzycki, Jeff Cobb,
2.2. Single-Precision Floating-Point Daniel Czech, David DeBoer, Julia DeMarines, Jamie Drew,
The original implementation of the software handled the input Griffin Foster, Vishal Gajjar, Nectaria Gizani, Greg Hellbourg,
Eric J. Korpela, and Brian Lacki. The breakthrough listen
data as double-precision floating-point numbers. This behavior search for intelligent life: Public data, formats, reduction, and
would cause all the mathematical operations to take significantly archiving. Publications of the Astronomical Society of the
longer to process because of the extended precision. The ultimate Pacific, 131(1006):124505, Nov 2019. URL: https://arxiv.org/
abs/1906.07391, doi:10.1088/1538-3873/ab3e82.
precision of the output product is inherently limited by the preci- [LPS15] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba:
sion of the original input data which in most cases is represented A llvm-based python jit compiler. In Proceedings of the
by an 8-bit signed integer. Therefore, the addition of a single- Second Workshop on the LLVM Compiler Infrastructure in
precision floating-point number decreased the processing time HPC, LLVM ’15, New York, NY, USA, 2015. Association
for Computing Machinery. URL: https://doi.org/10.1145/
without compromising the useful precision of the output data. 2833157.2833162, doi:10.1145/2833157.2833162.
[OUN 17]
+ Ryosuke Okuta, Yuya Unno, Daisuke Nishino, Shohei Hido,
and Crissman Loomis. Cupy: A numpy-compatible library
3. Results for nvidia gpu calculations. In Proceedings of Workshop
on Machine Learning Systems (LearningSys) in The Thirty-
To test the speed improvements between implementations we used first Annual Conference on Neural Information Processing
files from previous observations coming from different observato- Systems (NIPS), 2017. URL: http://learningsys.org/nips17/
ries. Table 1 indicates the processing times it took to process three assets/papers/paper_16.pdf.
[Reb82] Grote Reber. Cosmic Static, pages 61–69. Springer Nether-
different files in double-precision mode. We can notice that the lands, Dordrecht, 1982. URL: https://doi.org/10.1007/978-
CPU implementation based on Numba is measurably faster than 94-009-7752-5_6, doi:10.1007/978-94-009-7752-
the original CPU implementation based on Cython. At the same 5_6.
time, the GPU-accelerated backend processed the data from 6.8 to
9.3 times faster than the original CPU-based implementation.
Table 2 indicates the same results as Table 1 but with single-
precision floating points. The original Cython implementation was
left out because it doesn’t support single-precision mode. Here,
the same data was processed from 7.5 to 10.6 times faster than the
Numba CPU-based implementation.
To illustrate the processing time improvement, a single obser-
vation containing 105 GB of data was processed in 12 hours by the
original CPU-based TurboSETI implementation on an i7-7700K
Intel CPU, and just 1 hour and 45 minutes by the GPU-accelerated
backend on a GTX 1070 Ti NVIDIA GPU.
28 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Experience report of physics-informed neural
networks in fluid simulations: pitfalls and frustration
Pi-Yueh Chuang‡∗ , Lorena A. Barba‡

Abstract—Though PINNs (physics-informed neural networks) are now deemed PINN (physics-informed neural network) method denotes an ap-
as a complement to traditional CFD (computational fluid dynamics) solvers proach to incorporate deep learning in CFD applications, where
rather than a replacement, their ability to solve the Navier-Stokes equations solving partial differential equations plays the key role. These par-
without given data is still of great interest. This report presents our not-so- tial differential equations include the well-known Navier-Stokes
successful experiments of solving the Navier-Stokes equations with PINN as
equations—one of the Millennium Prize Problems. The universal
a replacement to traditional solvers. We aim to, with our experiments, prepare
readers for the challenges they may face if they are interested in data-free PINN.
approximation theorem ([Hor]) implies that neural networks can
In this work, we used two standard flow problems: 2D Taylor-Green vortex at model the solution to the Navier-Stokes equations with high
Re = 100 and 2D cylinder flow at Re = 200. The PINN method solved the 2D fidelity and capture complicated flow details as long as networks
Taylor-Green vortex problem with acceptable results, and we used this flow as an are big enough. The idea of PINN methods can be traced back
accuracy and performance benchmark. About 32 hours of training were required to [DPT], while the name PINN was coined in [RPK]. Human-
for the PINN method’s accuracy to match the accuracy of a 16 × 16 finite- provided data are not necessary in applying PINN [LMMK], mak-
difference simulation, which took less than 20 seconds. The 2D cylinder flow, on ing it a potential alternative to traditional CFD solvers. Sometimes
the other hand, did not produce a physical solution. The PINN method behaved
it is branded as unsupervised learning—it does not rely on human-
like a steady-flow solver and did not capture the vortex shedding phenomenon.
provided data, making it sound very "AI." It is now common to
By sharing our experience, we would like to emphasize that the PINN method is
still a work-in-progress, especially in terms of solving flow problems without any
see headlines like "AI has cracked the Navier-Stokes equations" in
given data. More work is needed to make PINN feasible for real-world problems recent popular science articles ([Hao]).
in such applications. (Reproducibility package: [Chu22].) Though data-free PINN as an alternative to traditional CFD
solvers may sound attractive, PINN can also be used under data-
Index Terms—computational fluid dynamics, deep learning, physics-informed driven configurations, for which it is better suited. Cai et al.
neural network [CMW+ ] state that PINN is not meant to be a replacement of
existing CFD solvers due to its inferior accuracy and efficiency.
The most useful applications of PINN should be those with
1. Introduction
some given data, and thus the models are trained against the
Recent advances in computing and programming techniques have data. For example, when we have experimental measurements or
motivated practitioners to revisit deep learning applications in partial simulation results (coarse-grid data, limited numbers of
computational fluid dynamics (CFD). We use the verb "revisit" snapshots, etc.) from traditional CFD solvers, PINN may be useful
because deep learning applications in CFD already existed going to reconstruct the flow or to be a surrogate model.
back to at least the 1990s, for example, using neural networks as Nevertheless, data-free PINN may offer some advantages over
surrogate models ([LS], [FS]). Another example is the work of traditional solvers, and using data-free PINN to replace traditional
Lagaris and his/her colleagues ([LLF]) on solving partial differen- solvers is still of great interest to researchers (e.g., [KDYI]). First,
tial equations with fully-connected neural networks back in 1998. it is a mesh-free scheme, which benefits engineering problems
Similar work with radial basis function networks can be found where fluid flows interact with objects of complicated geometries.
in reference [LLQH]. Nevertheless, deep learning applications Simulating these fluid flows with traditional numerical methods
in CFD did not get much attention until this decade, thanks to usually requires high-quality unstructured meshes with time-
modern computing technology, including GPUs, cloud computing, consuming human intervention in the pre-processing stage before
high-level libraries like PyTorch and TensorFlow, and their Python actual simulations. The second benefit of PINN is that the trained
APIs. models approximate the governing equations’ general solutions,
Solving partial differential equations with deep learning is meaning there is no need to solve the equations repeatedly for
particularly interesting to CFD researchers and practitioners. The different flow parameters. For example, a flow model taking
boundary velocity profiles as its input arguments can predict
* Corresponding author: pychuang@gwu.edu flows under different boundary velocity profiles after training.
‡ Department of Mechanical and Aerospace Engineering, The George Wash-
ington University, Washington, DC 20052, USA Conventional numerical methods, on the contrary, require repeated
simulations, each one covering one boundary velocity profile.
Copyright © 2022 Pi-Yueh Chuang et al. This is an open-access article This feature could help in situations like engineering design op-
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, timization: the process of running sets of experiments to conduct
provided the original author and source are credited. parameter sweeps and find the optimal values or geometries for
EXPERIENCE REPORT OF PHYSICS-INFORMED NEURAL NETWORKS IN FLUID SIMULATIONS: PITFALLS AND FRUSTRATION 29

products. Given these benefits, researchers continue studying and and momentum equations:
improving the usability of data-free PINN (e.g., [WYP], [DZ],
[WTP], [SS]). ∂U
~
~ = − 1 ∇p + ν∇2U
~ · ∇)U ~ +~g
+ (U (2)
Data-free PINN, however, is not ready nor meant to replace ∂t ρ
traditional CFD solvers. This claim may be obvious to researchers where ρ = ρ(~x,t), ν = ν(~x,t), and p = p(~x,t) are scalar fields
experienced in PINN, but it may not be clear to others, especially denoting density, kinematic viscosity, and pressure, respectively.
to CFD end-users without ample expertise in numerical methods. ~x denotes the spatial coordinate, and ~x = [x, y]T in two di-
Even in literature that aims to improve PINN, it’s common to see mensions. The density and viscosity fields are usually known
only the success stories with simple CFD problems. Important in- and given, while the pressure field is unknown. U ~ = U(~
~ x,t) =
formation concerning the feasibility of PINN in practical and real- [u(x, y,t), v(x, y,t)]T is a vector field for flow velocity. All of them
world applications is often missing from these success stories. For are functions of the spatial coordinate in the computational domain
example, few reports discuss the required computing resources, Ω and time before a given limit T . The gravitational field ~g may
the computational cost of training, the convergence properties, or also be a function of space and time, though it is usually a constant.
the error analysis of PINN. PINN suffers from performance and A solution to the Navier-Stokes equations is subjected to an initial
solvability issues due to the need for high-order automatic differ- condition and boundary conditions:
entiation and multi-objective nonlinear optimization. Evaluating 
high-order derivatives using automatic differentiation increases ~ x,t) = U
 U(~
 ~ 0 (~x), ∀~x ∈ Ω, t = 0
the computational graphs of neural networks. And multi-objective U(~x,t) = UΓ (~x,t), ∀~x ∈ Γ, t ∈ [0, T ]
~ ~ (3)


optimization, which reduces all the residuals of the differential p(~x,t) = pΓ (x,t), ∀~x ∈ Γ, t ∈ [0, T ]
equations, initial conditions, and boundary conditions, makes
the training difficult to converge to small-enough loss values. where Γ represents the boundary of the computational domain.
Fluid flows are sensitive nonlinear dynamical systems in which
a small change or error in inputs may produce a very different 2.1. The PINN method
flow field. So to get correct solutions, the optimization in PINN The basic form of the PINN method ([RPK], [CMW+ ]) starts from
needs to minimize the loss to values very close to zero, further approximating U~ and p with a neural network:
compromising the method’s solvability and performance. " #
This paper reports on our not-so-successful PINN story as a ~
U
(~x,t) ≈ G(~x,t; Θ) (4)
lesson learned to readers, so they can be aware of the challenges p
they may face if they consider using data-free PINN in real-world
applications. Our story includes two computational experiments Here we use a single network that predicts both pressure and
as case studies to benchmark the PINN method’s accuracy and velocity fields. It is also possible to use different networks for them
computational performance. The first case study is a Taylor- separately. Later in this work, we will use GU and G p to denote
Green vortex, solved successfully though not to our complete the predicted velocity and pressure from the neural network. Θ at
satisfaction. We will discuss the performance of PINN using this this point represents the free parameters of the network.
case study. The second case study, flow over a cylinder, did not To determine the free parameters, Θ, ideally, we hope the
even result in a physical solution. We will discuss the frustration approximate solution gives zero residuals for equations (1), (2),
we encountered with PINN in this case study. and (3). That is
We built our PINN solver with the help of NVIDIA’s Modulus r1 (~x,t; Θ) ≡ ∇ · GU = 0
library ([noa]). Modulus is a high-level Python package built on
∂ GU 1
top of PyTorch that helps users develop PINN-based differential r2 (~x,t; Θ) ≡ + (GU · ∇)GU + ∇G p − ν∇2 GU −~g = 0
equation solvers. Also, in each case study, we also carried out sim- ∂t ρ
(5)
ulations with our CFD solver, PetIBM ([CMKAB18]). PetIBM is r3 (~x; Θ) ≡ GU ~
t=0 − U0 = 0
a traditional solver using staggered-grid finite difference methods r4 (~x,t; Θ) ≡ GU − U~ Γ = 0, ∀~x ∈ Γ
with MPI parallelization and GPU computing. PetIBM simulations r5 (~x,t; Θ) ≡ G p − pΓ = 0, ∀~x ∈ Γ
in each case study served as baseline data. For all cases, config-
urations, post-processing scripts, and required Singularity image And the set of desired parameter, Θ = θ , is the common zero root
definitions can be found at reference [Chu22]. of all the residuals.
This paper is structured as follows: the second section briefly The derivatives of G with respect to ~x and t are usually ob-
describes the PINN method and an analogy to traditional CFD tained using automatic differentiation. Nevertheless, it is possible
methods. The third and fourth sections provide our computational to use analytical derivatives when the chosen network architecture
experiments of the Taylor-Green vortex in 2D and a 2D laminar is simple enough, as reported by early-day literature ([LLF],
cylinder flow with vortex shedding. Most discussions happen [LLQH]).
in the corresponding case studies. The last section presents the If residuals in (5) are not complicated, and if the number of
conclusion and discussions that did not fit into either one of the the parameters, NΘ , is small enough, we may numerically find the
cases. zero root by solving a system of NΘ nonlinear equations generated
from a suitable set of NΘ spatial-temporal points. However, the
2. Solving Navier-Stokes equations with PINN scenario rarely happens as G is usually highly complicated and
NΘ is large. Moreover, we do not even know if such a zero root
The incompressible Navier-Stokes equations in vector form are
exists for the equations in (5).
composed of the continuity equation:
Instead, in PINN, the condition is relaxed. We do not seek the
∇ ·U
~ =0 (1) zero root of (5) but just hope to find a set of parameters that make
30 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

the residuals sufficiently close to zero. Consider the sum of the l2 2.2. An analogy to conventional numerical methods
norms of residuals: For readers with a background in numerical methods for partial
(
5
x∈Ω differential equations, we would like to make an analogy between
r(~x,t; Θ = θ ) ≡ ∑ kri (~x,t; Θ = θ )k , ∀
2
(6) traditional numerical methods and PINN.
i=1 t ∈ [0, T ]
In obtaining strong solutions to differential equations, we can
The θ that makes residuals closest to zero (or even equal to zero describe the solution workflows of most numerical methods with
if such θ exists) also makes (6) minimal because r(~x,t; Θ) ≥ 0. In five stages:
other words,
( 1) Designing the approximate solution with undetermined
x∈Ω parameters
θ = arg min r(~x,t; Θ) ∀ (7)
Θ t ∈ [0, T ] 2) Choosing proper approximation for derivatives
3) Obtaining the so-called modified equation by substituting
This poses a fundamental difference between the PINN method
approximate derivatives into the differential equations
and traditional CFD schemes, making it potentially more difficult
and initial/boundary conditions
for the PINN method to achieve the same accuracy as the tradi-
4) Generating a system of linear/nonlinear algebraic equa-
tional schemes. We will discuss this more in section 3. Note that
tions
in practice, each loss term on the right-hand-side of equation (6) is
5) Solving the system of equations
weighted. We ignore the weights here for demonstrating purpose.
To solve (7), theoretically, we can use any number of spatial- For example, to solve ∇U 2 (x) = s(x), the most naive spectral
temporal points, which eases the need of computational resources, method ([Tre]) approximates the solution with U(x) ≈ G(x) =
N
compared to finding the zero root directly. Gradient-descent- ∑ ci φi (x), where ci represents undetermined parameters, and φi (x)
based optimizers further reduce the computational cost, especially i=1
denotes a set of either polynomials, trigonometric functions, or
in terms of memory usage and the difficulty of parallelization.
complex exponentials. Next, obtaining the first derivative of U is
Alternatively, Quasi-Newton methods may work but only when N
NΘ is small enough. straightforward—we can just assume U 0 (x) ≈ G0 (x) = ∑ ci φi0 (x).
i=1
However, even though equation (7) may be solvable, it is still The second-order derivative may be more tricky. One can assume
a significantly expensive task. While typical data-driven learning N
requires one back-propagation pass on the derivatives of the loss U 00 (x) ≈ G00 = ∑ ci φi00 (x). Or, another choice for nodal bases (i.e.,
i=1
function, here automatic differentiation is needed to evaluate the N
derivatives of G with respect to ~x and t. The first-order derivatives when φi (x) is chosen to make ci ≡ G(xi )) is U 00 (x) ≈ ∑ ci G0 (xi ).
i=1
require one back-propagation on the network, while the second- Because φi (x) is known, the derivatives are analytical. After sub-
order derivatives present in the diffusion term ∇2 GU require an stituting the approximate solution and derivatives in to the target
additional back-propagation on the first-order derivatives’ com- differential equation, we need to solve for parameters c1 , · · · , cN .
putational graph. Finally, to update parameters in an optimizer, We do so by selecting N points from the computational domain
the gradients of G with respect to parameters Θ requires another and creating a system of N linear equations:
back-propagation on the graph of the second-order derivatives.     
This all leads to a very large computational graph. We will see the φ100 (x1 ) · · · φN00 (x1 ) c1 s(x1 )
 . ..     
 . ..  .   . 
performance of the PINN method in the case studies.  . . .   ..  −  ..  = 0 (8)
In summary, when viewing the PINN method as supervised φ1 (xN ) · · · φN (xN ) cN
00 00 s(xN )
machine learning, the inputs of a network are spatial-temporal
coordinates, and the outputs are the physical quantities of our Finally, we determine the parameters by solving this linear system.
interest. The loss or objective functions in PINN are governing Though this example uses a spectral method, the workflow also
equations that regulate how the target physical quantities should applies to many other numerical methods, such as finite difference
behave. The use of governing equations eliminates the need for methods, which can be reformatted as a form of spectral method.
true answers. A trivial example is using Bernoulli’s equation as With this workflow in mind, it should be easy to see the anal-
the loss function, i.e., loss = 2gu2 p
+ ρg − H0 + z(x), and a neural ogy between PINN and conventional numerical methods. Aside
network predicts the flow speed u and pressure p at a given from using much more complicated approximate solutions, the
location x along a streamline. (The gravitational acceleration major difference lies in how to determine the unknown parameters
g, density ρ, energy head H0 , and elevation z(x) are usually in the approximate solutions. While traditional methods solve the
known and given.) Such a loss function regulates the relationship zero-residual conditions, PINN relies on searching the minimal
between predicted u and p and does not need true answers for residuals. A secondary difference is how to approximate deriva-
the two quantities. Unlike Bernoulli’s equation, most governing tives. Conventional numerical methods use analytical or numerical
equations in physics are usually differential equations (e.g., heat differentiation of the approximate solutions, and the PINN meth-
equations). The main difference is that now the PINN method ods usually depends on automatic differentiation. This difference
needs automatic differentiation to evaluate the loss. Regardless may be minor as we are still able to use analytical differentiation
of the forms of governing equations, spatial-temporal coordinates for simple network architectures with PINN. However, automatic
are the only data required during training. Hence, throughout this differentiation is a major factor affecting PINN’s performance.
paper, training data means spatial-temporal points and does not
3. Case 1: Taylor-Green vortex: accuracy and performance
involve any true answers to predicted quantities. (Note in some
literature, the PINN method is applied to applications that do need 3.1. 2D Taylor-Green vortex
true answers, see [CMW+ ]. These applications are out of scope The Taylor-Green vortex represents a family of flows with a
here.) specific form of analytical initial flow conditions in both 2D
EXPERIENCE REPORT OF PHYSICS-INFORMED NEURAL NETWORKS IN FLUID SIMULATIONS: PITFALLS AND FRUSTRATION 31

Fig. 2: Total residuals (loss) with respect to training iterations.
Fig. 1: Contours of u and v at t = 32 to demonstrate the solution of
2D Taylor-Green vortex.
variants). We carried out the training using different numbers of
and 3D. The 2D Taylor-Green vortex has closed-form analytical GPUs to investigate the performance of the PINN solver. All cases
solutions with periodic boundary conditions, and hence they are were trained up to 1 million iterations. Note that the parallelization
standard benchmark cases for verifying CFD solvers. In this work, was done with weak scaling, meaning increasing the number of
we used the following 2D Taylor-Green vortex: GPUs would not reduce the workload of each GPU. Instead,
 increasing the number of GPUs would increase the total and
x y ν

 u(x, y,t) = V0 cos( ) sin( ) exp(−2 2 t) per-iteration numbers of training points. Therefore, our expected

 L L L

 x y ν outcome was that all cases required about the same wall time to
v(x, y,t) = −V0 sin( ) cos( ) exp(−2 2 t) (9) finish, while the residual from using 8 GPUs would converge the
 L L L

 ρ 2x 2y ν fastest.


 p(x, y,t) = − V02 cos( ) + cos( ) exp(−4 2 t) After training, the PINN solver’s prediction errors (i.e., accu-
4 L L L
racy) were evaluated on cell centers of a 512 × 512 Cartesian mesh
where V0 represents the peak (and also the lowest) velocity at against the analytical solution. With these spatially distributed
t = 0. Other symbols carry the same meaning as those in section errors, we calculated the L2 error norm for a given t:
2. sZ r
The periodic boundary conditions were applied to x = −Lπ, L2 = error(x, y)2 dΩ ≈ ∑ ∑ errori,2 j ∆Ωi, j (10)
x = Lπ, y = −Lπ, and y = Lπ. We used the following parameters Ω
i j
in this work: V0 = L = ρ = 1.0 and ν = 0.01. These parameters
correspond to Reynolds number Re = 100. Figure 1 shows a where i and j here are the indices of a cell center in the Cartesian
snapshot of velocity at t = 32. mesh. ∆Ωi, j is the corresponding cell area, 4π 2 /5122 in this case.
We compared accuracy and performance against results using
3.2. Solver and runtime configurations PetIBM. All PetIBM simulations in this section were done with
1 K40 GPU and 6 CPU cores (Intel i7-5930K) on our old lab
The neural network used in the PINN solver is a fully-connected
workstation. We carried out 7 PetIBM simulations with different
neural network with 6 hidden layers and 256 neurons per layer.
spatial resolutions: 2k × 2k for k = 4, 5, . . . , 10. The time step size
The activation functions are SiLU ([HG]). We used Adam for
for each spatial resolution was ∆t = 0.1/2k−4 .
optimization, and its initial parameters are the defaults from Py-
A special note should be made here: the PINN solver used
Torch. The learning rate exponentially decayed through PyTorch’s
single-precision floats, while PetIBM used double-precision floats.
ExponentialLR with gamma equal to 0.951/10000 . Note we did
It might sound unfair. However, this discrepancy does not change
not conduct hyperparameter optimization, given the computational
the qualitative findings and conclusions, as we will see later.
cost. The hyperparameters are mostly the defaults used by the 3D
Taylor-Green example in Modulus ([noa]).
The training data were simply spatial-temporal coordinates. 3.3. Results
Before the training, the PINN solver pre-generated 18,432,000 Figure 2 shows the convergence history of the total residuals
spatial-temporal points to evaluate the residuals of the Navier- (equation (6)). Using more GPUs in weak scaling (i.e., more
Stokes equations (the r1 and r2 in equation (5)). These training training points) did not accelerate the convergence, contrary to
points were randomly chosen from the spatial domain [−π, π] × what we expected. All cases converged at a similar rate. Though
[−π, π] and temporal domain (0, 100]. The solver used only 18,432 without a quantitative criterion or justification, we considered that
points in each training iteration, making it a batch training. For further training would not improve the accuracy. Figure 3 gives a
the residual of the initial condition (the r3 ), the solver also pre- visual taste of what the predictions from the neural network look
generated 18,432,000 random spatial points and used only 18,432 like.
per iteration. Note that for r3 , the points were distributed in space The result visually agrees with that in figure 1. However, as
only because t = 0 is a fixed condition. Because of the periodic shown in figure 4, the error magnitudes from the PINN solver
boundary conditions, the solver did not require any training points are much higher than those from PetIBM. Figure 4 shows the
for r4 and r5 . prediction errors with respect to t. We only present the error on
The hardware used for the PINN solver was a single node of the u velocity as those for v and p are similar. The accuracy of
NVIDIA’s DGX-A100. It was equipped with 8 A100 GPUs (80GB the PINN solver is similar to that of the 16 × 16 simulation with
32 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 5: L2 error norm versus wall time.
Fig. 3: Contours of u and v at t = 32 from the PINN solver.

3.4. Discussion
A notice should be made regarding the results: we do not claim
that these results represent the most optimized configuration of
the PINN method. Neither do we claim the qualitative conclusions
apply to all other hyperparameter configurations. These results
merely reflect the outcomes of our computational experiments
with respect to the specific configuration abovementioned. They
should be deemed experimental data rather than a thorough anal-
ysis of the method’s characteristics.
The Taylor-Green vortex serves as a good benchmark case
because it reduces the number of required residual constraints:
residuals r4 and r5 are excluded from r in equation 6. This means
Fig. 4: L2 error norm versus simulation time.
the optimizer can concentrate only on the residuals of initial
conditions and the Navier-Stokes equations.
Using more GPUs (thus using more training points, i.e., spatio-
PetIBM. Using more GPUs, which implies more training points,
temporal points) did not speed up the convergence, which may
does not improve the accuracy.
indicate that the per-iteration number of points on a single GPU
Regardless of the magnitudes, the trends of the errors with is already big enough. The number of training points mainly
respect to t are similar for both PINN and PetIBM. For PetIBM, affects the mean gradients of the residual with respect to model
the trend shown in figure 4 indicates that the temporal error is parameters, which then will be used to update parameters by
bounded, and the scheme is stable. However, this concept does gradient-descent-based optimizers. If the number of points is
not apply to PINN as it does not use any time-marching schemes. already big enough on a single GPU, then using more points or
What this means for PINN is still unclear to us. Nevertheless, more GPUs is unlikely to change the mean gradients significantly,
it shows that PINN is able to propagate the influence of initial causing the convergence solely to rely on learning rates.
conditions to later times, which is a crucial factor for solving
The accuracy of the PINN solver was acceptable but not
hyperbolic partial differential equations.
satisfying, especially when considering how much time it took
Figure 5 shows the computational cost of PINN and PetIBM to achieve such accuracy. The low accuracy to some degree was
in terms of the desired accuracy versus the required wall time. We not surprising. Recall the theory in section 2. The PINN method
only show the PINN results of 8 A100 GPUs on this figure. We only seeks the minimal residual on the total residual’s hyperplane.
believe this type of plot may help evaluate the computational cost It does not try to find the zero root of the hyperplane and does not
in engineering applications. According to the figure, for example, even care whether such a zero root exists. Furthermore, by using a
achieving an accuracy of 10−3 at t = 2 requires less than 1 second gradient-descent-based optimizer, the resulting minimum is likely
for PetIBM with 1 K40 and 6 CPU cores, but it requires more than just a local minimum. It makes sense that it is hard for the residual
8 hours for PINN with at least 1 A100 GPU. to be close to zero, meaning it is hard to make errors small.
Table 1 lists the wall time per 1 thousand iterations and the Regarding the performance result in figure 5, we would like
scaling efficiency. As indicated previously, weak scaling was used to avoid interpreting the result as one solver being better than the
in PINN, which follows most machine learning applications. other one. The proper conclusion drawn from the figure should be
as follows: when using the PINN solver as a CFD simulator for
a specific flow condition, PetIBM outperforms the PINN solver.
1 GPUs 2 GPUs 4 GPUs 8 GPUs As stated in section 1, the PINN method can solve flows under
Time (sec/1k iters) 85.0 87.7 89.1 90.1 different flow parameters in one run—a capability that PetIBM
Efficiency (%) 100 97 95 94 does not have. The performance result in figure 5 only considers a
limited application of the PINN solver.
One issue for this case study was how to fairly compare
TABLE 1: Weak scaling performance of the PINN solver using the PINN solver and PetIBM, especially when investigating the
NVIDIA A100-80GB GPUs accuracy versus the workload/problem size or time-to-solution
EXPERIENCE REPORT OF PHYSICS-INFORMED NEURAL NETWORKS IN FLUID SIMULATIONS: PITFALLS AND FRUSTRATION 33

versus problem size. Defining the problem size in PINN is not
as straightforward as we thought. Let us start with degrees of
freedom—in PINN, it is called the number of model parame-
ters, and in traditional CFD solvers, it is called the number of
unknowns. The PINN solver and traditional CFD solvers are
all trying to determine the free parameters in models (that is,
approximate solutions). Hence, the number of degrees of freedom
determines the problem sizes and workloads. However, in PINN,
problem sizes and workloads do not depend on degrees of freedom
solely. The number of training points also plays a critical role
in workloads. We were not sure if it made sense to define a
problem size as the sum of the per-iteration number of training
points and the number of model parameters. For example, 100
model parameters plus 100 training points is not equivalent to 150
model parameters plus 50 training points in terms of workloads.
So without a proper definition of problem size and workload, it
was not clear how to fairly compare PINN and traditional CFD
methods.
Nevertheless, the gap between the performances of PINN and Fig. 6: Demonstration of velocity and vorticity fields at t = 200 from
a PetIBM simulation.
PetIBM is too large, and no one can argue that using other metrics
would change the conclusion. Not to mention that the PINN solver
ran on A100 GPUs, while PetIBM ran on a single K40 GPU
200. Figure 6 shows the velocity and vorticity snapshots at t = 200.
in our lab, a product from 2013. This is also not a surprising
As shown in the figure, this type of flow displays a phenomenon
conclusion because, as indicated in section 2, the use of automatic
called vortex shedding. Though vortex shedding makes the flow
differentiation for temporal and spatial derivatives results in a huge
always unsteady, after a certain time, the flow reaches a periodic
computational graph. In addition, the PINN solver uses gradient-
stage and the flow pattern repeats after a certain period.
descent based method, which is a first-order method and limits the
The Navier-Stokes equations can be deemed as a dynamical
performance.
system. Instability appears in the flow under some flow conditions
Weak scaling is a natural choice of the PINN solver when it
and responds to small perturbations, causing the vortex shedding.
comes to distributed computing. As we don’t know a proper way
In nature, the vortex shedding comes from the uncertainty and
to define workload, simply copying all model parameters to all
perturbation existing everywhere. In CFD simulations, the vortex
processes and using the same number of training points on all
shedding is caused by small numerical and rounding errors in
processes works well.
calculations. Interested readers should consult reference [Wil].

4. Case 2: 2D cylinder flows: harder than we thought 4.2. Solver and runtime configurations
This case study shows what really made us frustrated: a 2D For the PINN solver, we tested with two networks. Both were
cylinder flow at Reynolds number Re = 200. We failed to even fully-connected neural networks: one with 256 neurons per layer,
produce a solution that qualitatively captures the key physical while the other one with 512 neurons per layer. All other net-
phenomenon of this flow: vortex shedding. work configurations were the same as those in section 3, except
we allowed human intervention to manually adjust the learning
4.1. Problem description rates during training. Our intention for this case study was to
The computational domain is [−8, 25] × [−8, 8], and a cylinder successfully obtain physical solutions from the PINN solver,
with a radius of 0.5 sits at coordinate (0, 0). The velocity boundary rather than conducting a performance and accuracy benchmark.
conditions are (u, v) = (1, 0) along x = −8, y = −8, and y = 8. On Therefore, we would adjust the learning rate to accelerate the
the cylinder surface is the no-slip condition, i.e., (u, v) = (0, 0). convergence or to escape from local minimums. This decision was
At the outlet (x = 25), we enforced a pressure boundary condition in line with common machine learning practice. We did not carry
p = 0. The initial condition is (u, v) = (0, 0). Note that this initial out hyperparameter optimization. These parameters were chosen
condition is different from most traditional CFD simulations. because they work in Modulus’ examples and in the Taylor-Green
Conventionally, CFD simulations use (u, v) = (1, 0) for cylinder vortex experiment.
flows. A uniform initial condition of u = 1 does not satisfy The PINN solver pre-generated 40, 960, 000 spatial-temporal
the Navier-Stokes equations due to the no-slip boundary on the points from a spatial domain in [−8, 25] × [−8, 8] and temporal
cylinder surface. Conventional CFD solvers are usually able to domain (0, 200] to evaluate residuals of the Navier-Stokes equa-
correct the solution during time-marching by propagating bound- tions, and used 40, 960 points per iteration. The number of pre-
ary effects into the domain through numerical schemes’ stencils. generated points for the initial condition was 2, 048, 000, and the
In our experience, using u = 1 or u = 0 did not matter for PINN per-iteration number is 2, 048. On each boundary, the numbers of
because both did not give reasonable results. Nevertheless, the pre-generated and per-iteration points are 8,192,000 and 8,192.
PINN solver’s results shown in this section were obtained using a Both cases used 8 A100 GPUs, which scaled these numbers up
uniform u = 0 for the initial condition. with a factor of 8. For example, during each iteration, a total of
The density, ρ, is one, and the kinematic viscosity is ν = 327, 680 points were actually used to evaluate the Navier-Stokes
0.005. These parameters correspond to Reynolds number Re = equations’ residuals. Both cases ran up to 64 hours in wall time.
34 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 7: Training history of the 2D cylinder flow at Re = 200.

One PetIBM simulation was carried out as a baseline. This
simulation had a spatial resolution of 1485 × 720, and the time
step size is 0.005. Figure 6 was rendered using this simulation.
The hardware used was 1 K40 GPU plus 6 cores of i7-5930K Fig. 8: Velocity and vorticity at t = 200 from PINN.
CPU. It took about 1.7 hours to finish.
The quantity of interest is the drag coefficient. We consider
both the friction drag and pressure drag in the coefficient calcula-
tion as follows:
 
2
Z ∂ U~ ·~t
CD = ρν ny − pnx  dS (11)
ρU02 D ∂~n
S

Here, U0 = 1 is the inlet velocity. ~n = [nx , ny ]T and ~t = [ny , −nx ]T
are the normal and tangent vectors, respectively. S represents the
cylinder surface. The theoretical lift coefficient (CL ) for this flow
is zero due to the symmetrical geometry.

4.3. Results
Note, as stated in section 3.4, we deem the results as experimental
data under a specific experiment configuration. Hence, we do not
claim that the results and qualitative conclusions will apply to Fig. 9: Drag and lift coefficients with respect to t
other hyperparameter configuration.
Figure 7 shows the convergence history. The bumps in the
practice. Our viewpoints may be subjective, and hence we leave
history correspond to our manual adjustment of the learning rates.
them here in the discussion.
After 64 hours of training, the total loss had not converged to an
Allow us to start this discussion with a hypothetical situation.
obvious steady value. However, we decided not to continue the
If one asks why we chose such a spatial and temporal resolution
training because, as later results will show, it is our judgment call
for a conventional CFD simulation, we have mathematical or
that the results would not be correct even if the training converged.
physical reasons to back our decision. However, if the person asks
Figure 8 provides a visualization of the predicted velocity why we chose 6 hidden layers and 256 neurons per layer, we will
and vorticity at t = 200. And in figure 9 are the drag and lift not be able to justify it. "It worked in another case!" is probably the
coefficients versus simulation time. From both figures, we couldn’t best answer we can offer. The situation also indicates that we have
see any sign of vortex shedding with the PINN solver. systematic approaches to improve a conventional simulation but
We provide a comparison against the values reported by others can only improve PINN’s results through computer experiments.
in table 2. References [GS74] and [For80] calculate the drag Most traditional numerical methods have rigorous analytical
coefficients using steady flow simulations, which were popular derivations and analyses. Each parameter used in a scheme has
decades ago because of their inexpensive computational costs. a meaning or a purpose in physical or numerical aspects. The
The actual flow is not a steady flow, and these steady-flow simplest example is the spatial resolution in the finite difference
coefficient values are lower than unsteady-flow predictions. The method, which controls the truncation errors in derivatives. Or,
drag coefficient from the PINN solver is closer to the steady-flow
predictions.
Unsteady simulations Steady simulations
4.4. Discussion PetIBM PINN [DSY07] [RKM09] [GS74] [For80]

While researchers may be interested in why the PINN solver 1.38 0.95 1.25 1.34 0.97 0.83
behaves like a steady flow solver, in this section, we would like
to focus more on the user experience and the usability of PINN in TABLE 2: Comparison of drag coefficients, CD
EXPERIENCE REPORT OF PHYSICS-INFORMED NEURAL NETWORKS IN FLUID SIMULATIONS: PITFALLS AND FRUSTRATION 35

the choice of the limiters in finite volume methods, used to inhibit CFD solvers. The literature shows researchers have shifted their
the oscillation in solutions. So when a conventional CFD solver attention to hybrid-mode applications. For example, in [JEA+ 20],
produces unsatisfying or even non-physical results, practitioners the authors combined the concept of PINN and a traditional CFD
usually have systematic approaches to identify the cause or solver to train a model that takes in low-resolution CFD simulation
improve the outcomes. Moreover, when necessary, practitioners results and outputs high-resolution flow fields.
know how to balance the computational cost and the accuracy, For people with a strong background in numerical methods or
which is a critical point for using computer-aided engineering. CFD, we would suggest trying to think out of the box. During
Engineering always concerns the costs and outcomes. our work, we realized our mindset and ideas were limited by what
On the other hand, the PINN method lacks well-defined we were used to in CFD. An example is the initial conditions.
procedures to control the outcome. For example, we know the We were used to only having one set of initial conditions when
numbers of neurons and layers control the degrees of freedom in a the temporal derivative in differential equations is only first-order.
model. With more degrees of freedom, a neural network model can However, in PINN, nothing limits us from using more than one
approximate a more complicated phenomenon. However, when we initial condition. We can generate results at t = 0, 1, . . . ,tn using
feel that a neural network is not complicated enough to capture a a traditional CFD solver and add the residuals corresponding to
physical phenomenon, what strategy should we use to adjust the these time snapshots to the total residual, so the PINN method
neurons and layers? Should we increase neurons or layers first? may perform better in predicting t > tn . In other words, the PINN
By how much? solver becomes the traditional CFD solvers’ replacement only for
Moreover, when it comes to something non-numeric, it is even t > tn ([noa]).
more challenging to know what to use and why to use it. For As discussed in [THM+ ], solving partial differential equations
instance, what activation function should we use and why? Should with deep learning is still a work-in-progress. It may not work in
we use the same activation everywhere? Not to mention that we many situations. Nevertheless, it does not mean we should stay
are not yet even considering a different network architecture here. away from PINN and discard this idea. Stepping away from a new
Ultimately, are we even sure that increasing the network’s thing gives zero chance for it to evolve, and we will never know
complexity is the right path? Our assumption that the network if PINN can be improved to a mature state that works well. Of
is not complicated enough may just be wrong. course, overly promoting its bright side with only success stories
The following situation happened in this case study. Before does not help, either. Rather, we should honestly face all troubles,
we realized the PINN solver behaved like a steady-flow solver, we difficulties, and challenges. Knowing the problem is the first step
attributed the cause to model complexity. We faced the problem to solving it.
of how to increase the model complexity systematically. Theoret-
ically, we could follow the practice of the design of experiments Acknowledgements
(e.g., through grid search or Taguchi methods). However, given the
computational cost and the number of hyperparameters/options of We appreciate the support by NVIDIA, through sponsoring the
PINN, a proper design of experiments is not affordable for us. access to its high-performance computing cluster.
Furthermore, the design of experiments requires the outcome to
change with changes in inputs. In our case, the vortex shedding R EFERENCES
remains absent regardless of how we changed hyperparameters.
[Chu22] Pi-Yueh Chuang. barbagroup/scipy-2022-repro-pack:
Let us move back to the flow problem to conclude this
20220530, May 2022. URL: https://doi.org/10.5281/zenodo.
case study. The model complexity may not be the culprit here. 6592457, doi:10.5281/zenodo.6592457.
Vortex shedding is the product of the dynamical systems of the [CMKAB18] Pi-Yueh Chuang, Olivier Mesnard, Anush Krishnan, and Lorena
Navier-Stokes equations and the perturbations from numerical A. Barba. PetIBM: toolbox and applications of the immersed-
boundary method on distributed-memory architectures. Journal
calculations (which implicitly mimic the perturbations in nature). of Open Source Software, 3(25):558, May 2018. URL: http://
Suppose the PINN solver’s prediction was the steady-state solution joss.theoj.org/papers/10.21105/joss.00558, doi:10.21105/
to the flow. We may need to introduce uncertainties and perturba- joss.00558.
tions in the neural network or the training data, such as a perturbed [CMW+ ] Shengze Cai, Zhiping Mao, Zhicheng Wang, Minglang Yin,
and George Em Karniadakis. Physics-informed neural net-
initial condition described in [LD15]. As for why PINN predicts works (PINNs) for fluid mechanics: a review. 37(12):1727–
the steady-state solution, we cannot answer it currently. 1738. URL: https://link.springer.com/10.1007/s10409-021-
01148-1, doi:10.1007/s10409-021-01148-1.
[DPT] M. W. M. G. Dissanayake and N. Phan-Thien. Neural-network-
5. Further discussion and conclusion based approximations for solving partial differential equations.
10(3):195–201. URL: https://onlinelibrary.wiley.com/doi/10.
Because of the widely available deep learning libraries, such as 1002/cnm.1640100303, doi:10.1002/cnm.1640100303.
PyTorch, and the ease of Python, implementing a PINN solver is [DSY07] Jian Deng, Xue-Ming Shao, and Zhao-Sheng Yu. Hydro-
dynamic studies on two traveling wavy foils in tandem
relatively more straightforward nowadays. This may be one reason
arrangement. Physics of Fluids, 19(11):113104, Novem-
why the PINN method suddenly became so popular in recent ber 2007. URL: http://aip.scitation.org/doi/10.1063/1.2814259,
years. This paper does not intend to discourage people from trying doi:10.1063/1.2814259.
the PINN method. Instead, we share our failures and frustration [DZ] Yifan Du and Tamer A. Zaki. Evolutional deep
neural network. 104(4):045303. URL: https://link.
using PINN so that interested readers may know what immediate aps.org/doi/10.1103/PhysRevE.104.045303, doi:10.1103/
challenges should be resolved for PINN. PhysRevE.104.045303.
Our paper is limited to using the PINN solver as a replacement [For80] Bengt Fornberg. A numerical study of steady
for traditional CFD solvers. However, as the first section indicates, viscous flow past a circular cylinder. Journal of
Fluid Mechanics, 98(04):819, June 1980. URL: http:
PINN can do more than solving one specific flow under specific //www.journals.cambridge.org/abstract_S0022112080000419,
flow parameters. Moreover, PINN can also work with traditional doi:10.1017/S0022112080000419.
36 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[FS] William E. Faller and Scott J. Schreck. Unsteady fluid mechan- [THM+ ] Nils Thuerey, Philipp Holl, Maximilian Mueller, Patrick
ics applications of neural networks. 34(1):48–55. URL: http: Schnell, Felix Trost, and Kiwon Um. Physics-based deep
//arc.aiaa.org/doi/10.2514/2.2134, doi:10.2514/2.2134. learning. Number: arXiv:2109.05237. URL: http://arxiv.org/
[GS74] V.A. Gushchin and V.V. Shchennikov. A numerical method abs/2109.05237, arXiv:2109.05237[physics].
of solving the navier-stokes equations. USSR Computa- [Tre] Lloyd N. Trefethen. Spectral Methods in MATLAB. Soft-
tional Mathematics and Mathematical Physics, 14(2):242–250, ware, environments, tools. Society for Industrial and Applied
January 1974. URL: https://linkinghub.elsevier.com/retrieve/ Mathematics. URL: http://epubs.siam.org/doi/book/10.1137/1.
pii/0041555374900615, doi:10.1016/0041-5553(74) 9780898719598, doi:10.1137/1.9780898719598.
90061-5. [Wil] C. H. K. Williamson. Vortex dynamics in the
[Hao] Karen Hao. AI has cracked a key mathematical puzzle for cylinder wake. 28(1):477–539. URL: http://www.
understanding our world. URL: https://www.technologyreview. annualreviews.org/doi/10.1146/annurev.fl.28.010196.002401,
com/2020/10/30/1011435/ai-fourier-neural-network-cracks- doi:10.1146/annurev.fl.28.010196.002401.
navier-stokes-and-partial-differential-equations/. [WTP] Sifan Wang, Yujun Teng, and Paris Perdikaris. Under-
[HG] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units standing and mitigating gradient flow pathologies in physics-
(GELUs). Publisher: arXiv Version Number: 4. URL: https:// informed neural networks. 43(5):A3055–A3081. URL: https:
arxiv.org/abs/1606.08415, doi:10.48550/ARXIV.1606. //epubs.siam.org/doi/10.1137/20M1318043, doi:10.1137/
08415. 20M1318043.
[WYP] Sifan Wang, Xinling Yu, and Paris Perdikaris. When
[Hor] Kurt Hornik. Approximation capabilities of multilayer feedfor-
and why PINNs fail to train: A neural tangent
ward networks. 4(2):251–257. URL: https://linkinghub.elsevier.
kernel perspective. 449:110768. URL: https:
com/retrieve/pii/089360809190009T, doi:10.1016/0893-
//linkinghub.elsevier.com/retrieve/pii/S002199912100663X,
6080(91)90009-T.
doi:10.1016/j.jcp.2021.110768.
[JEA+ 20] Chiyu “Max” Jiang, Soheil Esmaeilzadeh, Kamyar Aziz-
zadenesheli, Karthik Kashinath, Mustafa Mustafa, Hamdi A.
Tchelepi, Philip Marcus, Mr Prabhat, and Anima Anandkumar.
Meshfreeflownet: A physics-constrained deep continuous space-
time super-resolution framework. In SC20: International Con-
ference for High Performance Computing, Networking, Storage
and Analysis, pages 1–15, 2020. doi:10.1109/SC41405.
2020.00013.
[KDYI] Hasan Karali, Umut M. Demirezen, Mahmut A. Yukselen, and
Gokhan Inalhan. A novel physics informed deep learning
method for simulation-based modelling. In AIAA Scitech 2021
Forum. American Institute of Aeronautics and Astronautics.
URL: https://arc.aiaa.org/doi/10.2514/6.2021-0177, doi:10.
2514/6.2021-0177.
[LD15] Mouna Laroussi and Mohamed Djebbi. Vortex Shedding for
Flow Past Circular Cylinder: Effects of Initial Conditions.
Universal Journal of Fluid Mechanics, 3:19–32, 2015.
[LLF] I. E. Lagaris, A. Likas, and D. I. Fotiadis. Artificial neu-
ral networks for solving ordinary and partial differential
equations. 9(5):987–1000. URL: http://ieeexplore.ieee.org/
document/712178/, arXiv:physics/9705023, doi:10.
1109/72.712178.
[LLQH] Jianyu Li, Siwei Luo, Yingjian Qi, and Yaping Huang. Numer-
ical solution of elliptic partial differential equation using radial
basis function neural networks. 16(5):729–734. URL: https:
//linkinghub.elsevier.com/retrieve/pii/S0893608003000832,
doi:10.1016/S0893-6080(03)00083-2.
[LMMK] Lu Lu, Xuhui Meng, Zhiping Mao, and George Em Karniadakis.
DeepXDE: A deep learning library for solving differential
equations. 63(1):208–228. URL: https://epubs.siam.org/doi/10.
1137/19M1274067, doi:10.1137/19M1274067.
[LS] Dennis J. Linse and Robert F. Stengel. Identification of
aerodynamic coefficients using computational neural networks.
16(6):1018–1025. Publisher: Springer US, Place: Boston,
MA. URL: http://link.springer.com/10.1007/0-306-48610-5_9,
doi:10.2514/3.21122.
[noa] Modulus. URL: https://docs.nvidia.com/deeplearning/modulus/
index.html.
[RKM09] B.N. Rajani, A. Kandasamy, and Sekhar Majumdar. Nu-
merical simulation of laminar flow past a circular cylin-
der. Applied Mathematical Modelling, 33(3):1228–1247, March
2009. arXiv: DOI: 10.1002/fld.1 Publisher: Elsevier Inc. ISBN:
02712091 10970363. URL: http://dx.doi.org/10.1016/j.apm.
2008.01.017, doi:10.1016/j.apm.2008.01.017.
[RPK] M. Raissi, P. Perdikaris, and G.E. Karniadakis. Physics-
informed neural networks: A deep learning framework for
solving forward and inverse problems involving nonlinear
partial differential equations. 378:686–707. URL: https:
//linkinghub.elsevier.com/retrieve/pii/S0021999118307125,
doi:10.1016/j.jcp.2018.10.045.
[SS] Justin Sirignano and Konstantinos Spiliopoulos.
DGM: A deep learning algorithm for solving partial
differential equations. 375:1339–1364. URL: https:
//linkinghub.elsevier.com/retrieve/pii/S0021999118305527,
doi:10.1016/j.jcp.2018.08.029.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 37

atoMEC: An open-source average-atom Python code
Timothy J. Callow‡§∗ , Daniel Kotik‡§ , Eli Kraisler¶ , Attila Cangi‡§

Abstract—Average-atom models are an important tool in studying matter under methods are often denoted as "first-principles" because, formally
extreme conditions, such as those conditions experienced in planetary cores, speaking, they yield the exact properties of the system, under cer-
brown and white dwarfs, and during inertial confinement fusion. In the right tain well-founded theoretical approximations. Density-functional
context, average-atom models can yield results with similar accuracy to simu- theory (DFT), initially developed as a ground-state theory [HK64],
lations which require orders of magnitude more computing time, and thus can
[KS65] but later extended to non-zero temperatures [Mer65],
greatly reduce financial and environmental costs. Unfortunately, due to the wide
range of possible models and approximations, and the lack of open-source
[PPF+ 11], is one such theory and has been used extensively to
codes, average-atom models can at times appear inaccessible. In this paper, we study materials under WDM conditions [GDRT14]. Even though
present our open-source average-atom code, atoMEC. We explain the aims and DFT reformulates the Schrödinger equation in a computationally
structure of atoMEC to illuminate the different stages and options in an average- efficient manner [Koh99], the cost of running calculations be-
atom calculation, and to facilitate community contributions. We also discuss the comes prohibitively expensive at higher temperatures. Formally,
use of various open-source Python packages in atoMEC, which have expedited it scales as O(N 3 τ 3 ), with N the particle number (which usually
its development. also increases with temperature) and τ the temperature [CRNB18].
This poses a serious computational challenge in the WDM regime.
Index Terms—computational physics, plasma physics, atomic physics, materi-
Furthermore, although DFT is a formally exact theory, in prac-
als science
tice it relies on approximations for the so-called "exchange-
correlation" energy, which is, roughly speaking, responsible for
Introduction simulating all the quantum interactions between electrons. Exist-
ing exchange-correlation approximations have not been rigorously
The study of matter under extreme conditions — materials
tested under WDM conditions. An alternative method used in
exposed to high temperatures, high pressures, or strong elec-
the WDM community is path-integral Monte–Carlo [DGB18],
tromagnetic fields — is critical to our understanding of many
which yields essentially exact properties; however, it is even more
important scientific and technological processes, such as nuclear
limited by computational cost than DFT, and becomes unfeasibly
fusion and various astrophysical and planetary physics phenomena
expensive at lower temperatures due to the fermion sign problem.
[GFG+ 16]. Of particular interest within this broad field is the
It is therefore of great interest to reduce the computational
warm dense matter (WDM) regime, which is typically character-
complexity of the aforementioned methods. The use of graphics
ized by temperatures in the range of 103 − 106 degrees (Kelvin),
processing units in DFT calculations is becomingly increasingly
and densities ranging from dense gases to highly compressed
common, and has been shown to offer significant speed-ups
solids (∼ 0.01 − 1000 g cm−3 ) [BDM+ 20]. In this regime, it is
relative to conventional calculations using central processing units
important to account for the quantum mechanical nature of the
[MED11], [JFC+ 13]. Some other examples of promising develop-
electrons (and in some cases, also the nuclei). Therefore conven-
ments to reduce the cost of DFT calculations include machine-
tional methods from plasma physics, which either neglect quantum
learning-based solutions [SRH+ 12], [BVL+ 17], [EFP+ 21] and
effects or treat them coarsely, are usually not sufficiently accurate.
stochastic DFT [CRNB18], [BNR13]. However, in this paper,
On the other hand, methods from condensed-matter physics and
we focus on an alternative class of models known as "average-
quantum chemistry, which account fully for quantum interactions,
atom" models. Average-atom models have a long history in plasma
typically target the ground-state only, and become computationally
physics [CHKC22]: they account for quantum effects, typically
intractable for systems at high temperatures.
using DFT, but reduce the complex system of interacting electrons
Nevertheless, there are methods which can, in principle, be
and nuclei to a single atom immersed in a plasma (the "average"
applied to study materials at any given temperature and den-
atom). An illustration of this principle (reduced to two dimensions
sity whilst formally accounting for quantum interactions. These
for visual purposes) is shown in Fig. 1. This significantly reduces
* Corresponding author: t.callow@hzdr.de
the cost relative to a full DFT simulation, because the particle
‡ Center for Advanced Systems Understanding (CASUS), D-02826 Görlitz, number is restricted to the number of electrons per nucleus, and
Germany spherical symmetry is exploited to reduce the three-dimensional
§ Helmholtz-Zentrum Dresden-Rossendorf, D-01328 Dresden, Germany
¶ Fritz Haber Center for Molecular Dynamics and Institute of Chemistry, The problem to one dimension.
Hebrew University of Jerusalem, 9091401 Jerusalem, Israel Naturally, to reduce the complexity of the problem as de-
scribed, various approximations must be introduced. It is im-
Copyright © 2022 Timothy J. Callow et al. This is an open-access article portant to understand these approximations and their limitations
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, for average-atom models to have genuine predictive capabilities.
provided the original author and source are credited. Unfortunately, this is not always the case: although average-atom
38 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Theoretical background
Properties of interest in the warm dense matter regime include the
equation-of-state data, which is the relation between the density,
energy, temperature and pressure of a material [HRD08]; the mean
ionization state and the electron ionization energies, which tell
us about how tightly bound the electrons are to the nuclei; and
the electrical and thermal conductivities. These properties yield
information pertinent to our understanding of stellar and planetary
physics, the Earth’s core, inertial confinement fusion, and more
besides. To exactly obtain these properties, one needs (in theory) to
determine the thermodynamic ensemble of the quantum states (the
so-called wave-functions) representing the electrons and nuclei.
Fig. 1: Illustration of the average-atom concept. The many-body Fortunately, they can be obtained with reasonable accuracy using
and fully-interacting system of electron density (shaded blue) and models such as average-atom models; in this section, we elaborate
nuclei (red points) on the left is mapped into the much simpler
system of independent atoms on the right. Any of these identical on how this is done.
atoms represents the "average-atom". The effects of interaction from We shall briefly review the key theory underpinning the type of
neighboring atoms are implicitly accounted for in an approximate average-atom model implemented in atoMEC. This is intended for
manner through the choice of boundary conditions. readers without a background in quantum mechanics, to give some
context to the purposes and mechanisms of the code. For a compre-
hensive derivation of this average-atom model, we direct readers
to Ref. [CHKC22]. The average-atom model we shall describe
models share common concepts, there is no unique formal theory falls into a class of models known as ion-sphere models, which
underpinning them. Therefore a variety of models and codes exist, are the simplest (and still most widely used) class of average-atom
and it is not typically clear which models can be expected to model. There are alternative (more advanced) classes of model
perform most accurately under which conditions. In a previous such as ion-correlation [Roz91] and neutral pseudo-atom models
paper [CHKC22], we addressed this issue by deriving an average- [SS14] which we have not yet implemented in atoMEC, and thus
atom model from first principles, and comparing the impact of we do not elaborate on them here.
different approximations within this model on some common As demonstrated in Fig. 1, the idea of the ion-sphere model
properties. is to map a fully-interacting system of many electrons and
In this paper, we focus on computational aspects of average- nuclei into a set of independent atoms which do not interact
atom models for WDM. We introduce atoMEC [CKTS+ 21]: explicitly with any of the other spheres. Naturally, this depends
an open-source average-atom code for studying Matter under on several assumptions and approximations, but there is formal
Extreme Conditions. One of the main aims of atoMEC is to im- justification for such a mapping [CHKC22]. Furthermore, there
prove the accessibility and understanding of average-atom models. are many examples in which average-atom models have shown
To the best of our knowledge, open-source average-atom codes good agreement with more accurate simulations and experimental
are in scarce supply: with atoMEC, we aim to provide a tool that data [FB19], which further justifies this mapping.
people can use to run average-atom simulations and also to add Although the average-atom picture is significantly simplified
their own models, which should facilitate comparisons of different relative to the full many-body problem, even determining the
approximations. The relative simplicity of average-atom codes wave-functions and their ensemble weights for an atom at finite
means that they are not only efficient to run, but also efficient temperature is a complex problem. Fortunately, DFT reduces this
to develop: this means, for example, that they can be used as a complexity further, by establishing that the electron density — a
test-bed for new ideas that could be later implemented in full DFT far less complex entity than the wave-functions — is sufficient to
codes, and are also accessible to those without extensive prior determine all physical observables. The most popular formulation
expertise, such as students. atoMEC aims to facilitate development of DFT, known as Kohn–Sham DFT (KS-DFT) [KS65], allows us
by following good practice in software engineering (for example to construct the fully-interacting density from a non-interacting
extensive documentation), a careful design structure, and of course system of electrons, simplifying the problem further still. Due to
through the choice of Python and its widely used scientific stack, the spherical symmetry of the atom, the non-interacting electrons
in particular the NumPy [HMvdW+ 20] and SciPy [VGO+ 20] — known as KS electrons (or KS orbitals) — can be represented
libraries. as a wave-function that is a product of radial and angular compo-
nents,
This paper is structured as follows: in the next section, we
briefly review the key theoretical points which are important φnlm (r) = Xnl (r)Ylm (θ , φ ) , (1)
to understand the functionality of atoMEC, assuming no prior
where n, l, and m are the quantum numbers of the orbitals, which
physical knowledge of the reader. Following that, we present
come from the fact that the wave-function is an eigenfunction of
the key functionality of atoMEC, discuss the code structure
the Hamiltonian operator, and Ylm (θ , φ ) are the spherical harmonic
and algorithms, and explain how these relate to the theoretical
aspects introduced. Finally, we present an example case study: functions.1 The radial coordinate r represents the absolute distance
we consider helium under the conditions often experienced in from the nucleus.
the outer layers of a white dwarf star, and probe the behavior
1. Please note that the notation in Eq. (1) does not imply Einstein sum-
of a few important properties, namely the band-gap, pressure, and mation notation. All summations in this paper are written explicitly; Einstein
ionization degree. summation notation is not used.
ATOMEC: AN OPEN-SOURCE AVERAGE-ATOM PYTHON CODE 39

We therefore only need to determine the radial KS orbitals energy required to excite an electron bound to the nucleus to being
Xnl (r). These are determined by solving the radial KS equation, a free (conducting) electron. These predicted ionization energies
which is similar to the Schrödinger equation for a non-interacting can be used, for example, to help understand ionization potential
system, with an additional term in the potential to mimic the depression, an important but somewhat controversial effect in
effects of electron-electron interaction (within the single atom). WDM [STJ+ 14]. Another property that can be straightforwardly
The radial KS equation is given by: obtained from the energy levels and their occupation numbers is
2 the mean ionization state Z̄ 2 ,
d 2 d l(l + 1)
− + − + vs [n](r) Xnl (r) = εnl Xnl (r). (2)
dr2 r dr r2 Z̄ = ∑(2l + 1) fnl (εnl , µ, τ) (6)
n,l
We have written the above equation in a way that emphasizes that
it is an eigenvalue equation, with the eigenvalues εnl being the which is an important input parameter for various models, such
energies of the KS orbitals. as adiabats which are used to model inertial confinement fusion
On the left-hand side, the terms in the round brackets come [KDF+ 11].
from the kinetic energy operator acting on the orbitals. The vs [n](r) Various other interesting properties can also be calculated
term is the KS potential, which itself is composed of three different following some post-processing of the output of an SCF cal-
terms, culation, for example the pressure exerted by the electrons and
Z ions. Furthermore, response properties, i.e. those resulting from
Z RWS n(x)x2 δ Fxc [n] an external perturbation like a laser pulse, can also be obtained
vs [n](r) = − + 4π dx + , (3)
r 0 max(r, x) δ n(r) from the output of an SCF cycle. These properties include, for
where RWS is the radius of the atomic sphere, n(r) is the electron example, electrical conductivities [Sta16] and dynamical structure
density, Z the nuclear charge, and Fxc [n] the exchange-correlation factors [SPS+ 14].
free energy functional. Thus the three terms in the potential are
respectively the electron-nuclear attraction, the classical Hartree Code structure and details
repulsion, and the exchange-correlation (xc) potential. In the following sections, we describe the structure of the code
We note that the KS potential and its constituents are function- in relation to the physical problem being modeled. Average-atom
als of the electron density n(r). Were it not for this dependence models typically rely on various parameters and approximations.
on the density, solving Eq. 2 just amounts to solving an ordinary In atoMEC, we have tried to structure the code in a way that makes
linear differential equation (ODE). However, the electron density clear which parameters come from the physical problem studied
is in fact constructed from the orbitals in the following way, compared to choices of the model and numerical or algorithmic
n(r) = 2 ∑(2l + 1) fnl (εnl , µ, τ)|Xnl (r)|2 , (4) choices.
nl
atoMEC.Atom: Physical parameters
where fnl (εnl , µ, τ) is the Fermi–Dirac distribution, given by
The first step of any simulation in WDM (which also applies to
1 simulations in science more generally) is to define the physical
fnl (εnl , µ, τ) = , (5)
1 + e(εnl −µ)/τ parameters of the problem. These parameters are unique in the
where τ is the temperature, and µ is the chemical potential, which sense that, if we had an exact method to simulate the real system,
is determined by fixing the number of electrons to be equal to then for each combination of these parameters there would be a
a pre-determined value Ne (typically equal to the nuclear charge unique solution. In other words, regardless of the model — be
Z). The Fermi–Dirac distribution therefore assigns weights to the it average atom or a different technique — these parameters are
KS orbitals in the construction of the density, with the weight always required and are independent of the model.
depending on their energy. In average-atom models, there are typically three parameters
Therefore, the KS potential that determines the KS orbitals via defining the physical problem, which are:
the ODE (2), is itself dependent on the KS orbitals. Consequently, • the atomic species;
the KS orbitals and their dependent quantities (the density and • the temperature of the material, τ;
KS potential) must be determined via a so-called self-consistent • the mass density of the material, ρm .
field (SCF) procedure. An initial guess for the orbitals, Xnl0 (r),
is used to construct the initial density n0 (r) and potential v0s (r). The mass density also directly corresponds to the mean dis-
The ODE (2) is then solved to update the orbitals. This process is tance between two nuclei (atomic centers), which in the average-
iterated until some appropriately chosen quantities — in atoMEC atom model is equal to twice the radius of the atomic sphere, RWS .
the total free energy, density and KS potential — are converged, An additional physical parameter not mentioned above is the net
i.e. ni+1 (r) = ni (r), vi+1 i i+1 = F i , within some charge of the material being considered, i.e. the difference be-
s (r) = vs (r), F
reasonable numerical tolerance. In Fig. 2, we illustrate the life- tween the nuclear charge Z and the electron number Ne . However,
cycle of the average-atom model described so far, including the we usually assume zero net charge in average-atom simulations
SCF procedure. On the left-hand side of this figure, we show the (i.e. the number of electrons is equal to the atomic charge).
physical choices and mathematical operations, and on the right- In atoMEC, these physical parameters are controlled by the
hand side, the representative classes and functions in atoMEC. In Atom object. As an example, we consider aluminum under ambi-
the following section, we shall discuss some aspects of this figure ent conditions, i.e. at room temperature, τ = 300 K, and normal
in more detail. metallic density, ρm = 2.7 g cm−3 . We set this up as:
Some quantities obtained from the completion of the SCF pro-
2. The summation in Eq. (6) is often shown as an integral because the
cedure are directly of interest. For example, the energy eigenvalues energies above a certain threshold form a continuous distribution (in most
εnl are related to the electron ionization energies, i.e. the amount of models).
40 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 2: Schematic of the average-atom model set-up and the self-consistent field (SCF) cycle. On the left-hand side, the physical choices and
mathematical operations that define the model and SCF cycle are shown. On the right-hand side, the (higher-order) functions and classes in
atoMEC corresponding to the items on the left-hand side are shown. Some liberties are taken with the code snippets in the right-hand column
of the figure to improve readability; more precisely, some non-crucial intermediate steps are not shown, and some parameters are also not
shown or simplified. The dotted lines represent operations that are taken care of within the models.CalcEnergy function, but are shown
nevertheless to improve understanding.
ATOMEC: AN OPEN-SOURCE AVERAGE-ATOM PYTHON CODE 41

Fig. 4: Auto-generated print statement from calling the
models.ISModel object.
Fig. 3: Auto-generated print statement from calling the
atoMEC.Atom object.
with a "quantum" treatment of the unbound electrons, and choose
the LDA exchange functional (which is also the default). This
from atoMEC import Atom
Al = Atom("Al", 300, density=2.7, units_temp="K") model is set up as:
from atoMEC import models
By default, the above code automatically prints the output seen model = models.ISModel(Al, bc="neumann",
in Fig. 3. We see that the first two arguments of the Atom object xfunc_id="lda_x", unbound="quantum")
are the chemical symbol of the element being studied, and the By default, the above code prints the output shown in Fig.
temperature. In addition, at least one of "density" or "radius" must 4. The first (and only mandatory) input parameter to the
be specified. In atoMEC, the default (and only permitted) units for models.ISModel object is the Atom object that we generated
the mass density are g cm−3 ; all other input and output units in earlier. Together with the optional spinpol and spinmag
atoMEC are by default Hartree atomic units, and hence we specify parameters in the models.ISModel object, this sets either the
"K" for Kelvin. total number of electrons (spinpol=False) or the number of
The information in Fig. 3 displays the chosen parameters in electrons in each spin channel (spinpol=True).
units commonly used in the plasma and condensed-matter physics The remaining information displayed in Fig. 4 shows directly
communities, as well as some other information directly obtained the chosen model parameters, or the default values where these
from these parameters. The chemical symbol ("Al" in this case) parameters are not specified. The exchange and correlation func-
is passed to the mendeleev library [men14] to generate this data, tionals - set by the parameters xfunc_id and cfunc_id - are
which is used later in the calculation. passed to the LIBXC library [LSOM18] for processing. So far,
This initial stage of the average-atom calculation, i.e. the only the "local density" family of approximations is available
specification of physical parameters and initialization of the Atom in atoMEC, and thus the default values are usually a sensible
object, is shown in the top row at the top of Fig. 2. choice. For more information on exchange and correlation func-
atoMEC.models: Model parameters tionals, there are many reviews in the literature, for example Ref.
[CMSY12].
After the physical parameters are set, the next stage of the average-
This stage of the average-atom calculation, i.e. the specifica-
atom calculation is to choose the model and approximations within
tion of the model and the choices of approximation within that, is
that class of model. As discussed, so far the only class of model
shown in the second row of Fig. 2.
implemented in atoMEC is the ion-sphere model. Within this
model, there are still various choices to be made by the user. ISModel.CalcEnergy: SCF calculation and numerical parameters
In some cases, these choices make little difference to the results, Once the physical parameters and model have been defined, the
but in other cases they have significant impact. The user might next stage in the average-atom calculation (or indeed any DFT
have some physical intuition as to which is most important, or calculation) is the SCF procedure. In atoMEC, this is invoked
alternatively may want to run the same physical parameters with by the ISModel.CalcEnergy function. This function is called
several different model parameters to examine the effects. Some CalcEnergy because it finds the KS orbitals (and associated KS
choices available in atoMEC, listed approximately in decreasing density) which minimize the total free energy.
order of impact (but this can depend strongly on the system under Clearly, there are various mathematical and algorithmic
consideration), are: choices in this calculation. These include, for example: the basis in
• the boundary conditions used to solve the KS equations; which the KS orbitals and potential are represented, the algorithm
• the treatment of the unbound electrons, which means used to solve the KS equations (2), and how to ensure smooth
those electrons not tightly bound to the nucleus, but rather convergence of the SCF cycle. In atoMEC, the SCF procedure
delocalized over the whole atomic sphere; currently follows a single pre-determined algorithm, which we
• the choice of exchange and correlation functionals, the briefly review below.
central approximations of DFT [CMSY12]; In atoMEC, we represent the radial KS quantities (orbitals,
• the spin polarization and magnetization. density and potential) on a logarithmic grid, i.e. x = log(r).
Furthermore, we make a transformation of the orbitals Pnl (x) =
We do not discuss the theory and impact of these different
Xnl (x)ex/2 . Then the equations to be solved become:
choices in this paper. Rather, we direct readers to Refs. [CHKC22]
and [CKC22] in which all of these choices are discussed. d2 Pnl (x)
− 2e2x (W (x) − εnl )Pnl (x) = 0 (7)
In atoMEC, the ion-sphere model is controlled by the dx2

models.ISModel object. Continuing with our aluminum ex- 1 1 2 −2x
ample, we choose the so-called "neumann" boundary condition, W (x) = vs [n](x) + l+ e . (8)
2 2
42 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

In atoMEC, we solve the KS equations using a matrix imple- a unique set of physical and model inputs — these parameters
mentation of Numerov’s algorithm [PGW12]. This means we should be independently varied until some property (such as the
diagonalize the following equation: total free energy) is considered suitably converged with respect to
that parameter. Changing the SCF parameters should not affect the
Ĥ ~P = ~ε B̂~P , where (9)
final results (within the convergence tolerances), only the number
Ĥ = T̂ + B̂ +Ws (~x) , (10) of iterations in the SCF cycle.
1 Let us now consider an example SCF calculation, using the
T̂ = − e−2~x Â , (11)
2 Atom and model objects we have already defined:
Iˆ−1 − 2Iˆ0 + Iˆ1
Â = , and (12) from atoMEC import config
dx2 config.numcores = -1 # parallelize
Iˆ−1 + 10Iˆ0 + Iˆ1
B̂ = , (13)
12 nmax = 3 # max value of principal quantum number
lmax = 3 # max value of angular quantum number
In the above, Iˆ−1/0/1 are lower shift, identify, and upper shift
matrices. # run SCF calculation
The Hamiltonian matrix Ĥ is sparse and we only seek a subset scf_out = model.CalcEnergy(
nmax,
of eigenstates with lower energies: therefore there is no need to lmax,
perform a full diagonalization, which scales as O(N 3 ), with N grid_params={"ngrid": 1500},
being the size of the radial grid. Instead, we use SciPy’s sparse ma- scf_params={"mixfrac": 0.7},
)
trix diagonalization function scipy.sparse.linalg.eigs,
which scales more efficiently and allows us to go to larger grid We see that the first two parameters passed to the CalcEnergy
sizes. function are the nmax and lmax quantum numbers, which specify
After each step in the SCF cycle, the relative changes in the the number of eigenstates to compute. Precisely speaking, there
free energy F, density n(r) and potential vs (r) are computed. is a unique Hamiltonian for each value of the angular quantum
Specifically, the quantities computed are number l (and in a spin-polarized calculation, also for each
F i − F i−1 spin quantum number). The sparse diagonalization routine then
∆F = (14) computes the first nmax eigenvalues for each Hamiltonian. In
Fi
R atoMEC, these diagonalizations can be run in parallel since they
dr|ni (r) − ni−1 (r)|
∆n = R (15) are independent for each value of l. This is done by setting the
drni (r)
R config.numcores variable to the number of cores desired
dr|vs (r) − vi−1
i
s (r)| (config.numcores=-1 uses all the available cores) and han-
∆v = R
i
. (16)
drvs (r) dled via the joblib library [Job20].
Once all three of these metrics fall below a certain threshold, the The remaining parameters passed to the CalcEnergy func-
SCF cycle is considered converged and the calculation finishes. tion are optional; in the above, we have specified a grid size
The SCF cycle is an example of a non-linear system and thus of 1500 points and a mixing fraction α = 0.7. The above code
is prone to chaotic (non-convergent) behavior. Consequently a automatically prints the output seen in Fig. 5. This output shows
range of techniques have been developed to ensure convergence the SCF cycle and, upon completion, the breakdown of the total
[SM91]. Fortunately, the tendency for calculations not to converge free energy into its various components, as well as other useful
becomes less likely for temperatures above zero (and especially information such as the KS energy levels and their occupations.
as temperatures increase). Therefore we have implemented only Additionally, the output of the SCF function is a dictionary
a simple linear mixing scheme in atoMEC. The potential used in containing the staticKS.Orbitals, staticKS.Density,
each diagonalization step of the SCF cycle is not simply the one staticKS.Potential and staticKS.Density objects.
generated from the most recent density, but a mix of that potential For example, one could extract the eigenfunctions as follows:
and the previous one,
orbs = scf_out["orbitals"] # orbs object
vs (r) = αvis (r) + (1 − α)vi−1
(i) ks_eigfuncs = orbs.eigfuncs # eigenfunctions
s (r) . (17)
In general, a lower value of the mixing fraction α makes the The initialization of the SCF procedure is shown in the third and
SCF cycle more stable, but requires more iterations to converge. fourth rows of Fig. 2, with the SCF procedure itself shown in the
Typically a choice of α ≈ 0.5 gives a reasonable balance between remaining rows.
speed and stability. This completes the section on the code structure and
We can thus summarize the key parameters in an SCF calcu- algorithmic details. As discussed, with the output of an
lation as follows: SCF calculation, there are various kinds of post-processing
one can perform to obtain other properties of interest. So
• the maximum number of eigenstates to compute, in terms
far in atoMEC, these are limited to the computation of
of both the principal and angular quantum numbers;
the pressure (ISModel.CalcPressure), the electron
• the numerical grid parameters, in particular the grid size;
localization function (atoMEC.postprocess.ELFTools)
• the convergence tolerances, Eqs. (14) to (16);
and the Kubo–Greenwood conductivity
• the SCF parameters, i.e. the mixing fraction and the
(atoMEC.postprocess.conductivity). We refer
maximum number of iterations.
readers to our pre-print [CKC22] for details on how the electron
The first three items in this list essentially control the accuracy localization function and the Kubo–Greenwood conductivity can
of the calculation. In principle, for each SCF calculation — i.e. be used to improve predictions of the mean ionization state.
ATOMEC: AN OPEN-SOURCE AVERAGE-ATOM PYTHON CODE 43

Fig. 6: Helium density-of-states (DOS) as a function of energy, for
different mass densities ρm , and at temperature τ = 50 kK. Black
dots indicate the occupations of the electrons in the permitted energy
ranges. Dashed black lines indicate the band-gap (the energy gap
between the insulating and conducting bands). Between 5 and 6
g cm−3 , the band-gap disappears.

and temperature) and electrical conductivity.
To calculate the insulator-to-metallic transition point, the
key quantity is the electronic band-gap. The concept of band-
structures is a complicated topic, which we try to briefly describe
in layman’s terms. In solids, electrons can occupy certain energy
ranges — we call these the energy bands. In insulating materials,
there is a gap between these energy ranges that electrons are
forbidden from occupying — this is the so-called band-gap. In
conducting materials, there is no such gap, and therefore electrons
can conduct electricity because they can be excited into any part
of the energy spectrum. Therefore, a simple method to determine
the insulator-to-metallic transition is to determine the density at
which the band-gap becomes zero.
In Fig. 6, we plot the density-of-states (DOS) as a function of
energy, for different densities and at fixed temperature τ = 50 kK.
The DOS shows the energy ranges that the electrons are allowed to
occupy; we also show the actual energies occupied by the electrons
(according to Fermi–Dirac statistics) with the black dots. We can
clearly see in this figure that the band-gap (the region where the
DOS is zero) becomes smaller as a function of density. From
Fig. 5: Auto-generated print statement from calling the this figure, it seems the transition from insulating to metallic state
ISModel.CalcEnergy function happens somewhere between 5 and 6 g cm−3 .
In Fig. 7, we plot the band-gap as a function of density, for a
fixed temperature τ = 50 kK. Visually, it appears that the relation-
Case-study: Helium
ship between band-gap and density is linear at this temperature.
In this section, we consider an application of atoMEC in the This is confirmed using a linear fit, which has a coefficient of
WDM regime. Helium is the second most abundant element in the determination value of almost exactly one, R2 = 0.9997. Using this
universe (after hydrogen) and therefore understanding its behavior fit, the band-gap is predicted to close at 5.5 g cm−3 . Also in this
under a wide range of conditions is important for our under- figure, we show the fraction of ionized electrons, which is given by
standing of many astrophysical processes. Of particular interest Z̄/Ne , using Eq. (6) to calculate Z̄, and Ne being the total electron
are the conditions under which helium is expected to undergo a number. The ionization fraction also relates to the conductivity of
transition from insulating to metallic behavior in the outer layers the material, because ionized electrons are not bound to any nuclei
of white dwarfs, which are characterized by densities of around and therefore free to conduct electricity. We see that the ionization
1 − 20 g cm−3 and temperatures of 10 − 50 kK [PR20]. These fraction mostly increases with density (excepting some strange
conditions are a typical example of the WDM regime. Besides behavior around ρm = 1 g cm−3 ), which is further evidence of the
predicting the point at which the insulator-to-metallic transition transition from insulating to conducting behaviour with increasing
occurs in the density-temperature spectrum, other properties of density.
interest include equation-of-state data (relating pressure, density, As a final analysis, we plot the pressure as a function of mass
44 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

open-source scientific libraries — especially the Python libraries
NumPy, SciPy, joblib and mendeleev, as well as LIBXC.
We finish this paper by emphasizing that atoMEC is still in the
early stages of development, and there are many opportunities to
improve and extend the code. These include, for example:

• adding new average-atom models, and different approxi-
mations to the existing models.ISModel model;
• optimizing the code, in particular the routines in the
numerov module;
• adding new postprocessing functionality, for example to
compute structure factors;
• improving the structure and design choices of the code.
Fig. 7: Band-gap (red circles) and ionization fraction (blue squares)
for helium as a function of mass density, at temperature τ = 50 kK. Of course, these are just a snapshot of the avenues for future
The relationship between the band-gap and the density appears to be development in atoMEC. We are open to contributions in these
linear. areas and many more besides.

Acknowledgements
This work was partly funded by the Center for Advanced Systems
Understanding (CASUS) which is financed by Germany’s Federal
Ministry of Education and Research (BMBF) and by the Saxon
Ministry for Science, Culture and Tourism (SMWK) with tax
funds on the basis of the budget approved by the Saxon State
Parliament.

R EFERENCES
[BDM+ 20] M. Bonitz, T. Dornheim, Zh. A. Moldabekov, S. Zhang,
P. Hamann, H. Kählert, A. Filinov, K. Ramakrishna, and J. Vor-
berger. Ab initio simulation of warm dense matter. Phys. Plas-
mas, 27(4):042710, 2020. doi:10.1063/1.5143225.
[BNR13] Roi Baer, Daniel Neuhauser, and Eran Rabani. Self-
averaging stochastic Kohn-Sham density-functional theory.
Fig. 8: Helium pressure (logarithmic scale) as a function of mass Phys. Rev. Lett., 111:106402, Sep 2013. doi:10.1103/
density and temperature. The pressure increases with density and PhysRevLett.111.106402.
temperature (as expected), with a stronger dependence on density. [BVL+ 17] Felix Brockherde, Leslie Vogt, Li Li, Mark E. Tuckerman,
Kieron Burke, and Klaus-Robert Müller. Bypassing the Kohn-
Sham equations with machine learning. Nature Communica-
density and temperature in Fig. 8. The pressure is given by the tions, 8(1):872, Oct 2017. doi:10.1038/s41467-017-
00839-3.
sum of two terms: (i) the electronic pressure, calculated using [CHKC22] T. J. Callow, S. B. Hansen, E. Kraisler, and A. Cangi.
the method described in Ref. [FB19], and (ii) the ionic pressure, First-principles derivation and properties of density-functional
calculated using the ideal gas law. We observe that the pressure average-atom models. Phys. Rev. Research, 4:023055, Apr
2022. doi:10.1103/PhysRevResearch.4.023055.
increases with both density and temperature, which is the expected
[CKC22] Timothy J. Callow, Eli Kraisler, and Attila Cangi. Accurate
behavior. Under these conditions, the density dependence is much and efficient computation of mean ionization states with an
stronger, especially for higher densities. average-atom Kubo-Greenwood approach, 2022. doi:10.
The code required to generate the above results and plots can 48550/ARXIV.2203.05863.
[CKTS+ 21] Timothy Callow, Daniel Kotik, Ekaterina Tsve-
be found in this repository. toslavova Stankulova, Eli Kraisler, and Attila Cangi.
atomec, August 2021. If you use this software, please cite it
Conclusions and future work using these metadata. doi:10.5281/zenodo.5205719.
[CMSY12] Aron J. Cohen, Paula Mori-Sánchez, and Weitao Yang. Chal-
In this paper, we have presented atoMEC: an average-atom Python lenges for density functional theory. Chemical Reviews,
code for studying materials under extreme conditions. The open- 112(1):289–320, 2012. doi:10.1021/cr200107z.
[CRNB18] Yael Cytter, Eran Rabani, Daniel Neuhauser, and Roi Baer.
source nature of atoMEC, and the choice to use (pure) Python as Stochastic density functional theory at finite temperatures.
the programming language, is designed to improve the accessibil- Phys. Rev. B, 97:115207, Mar 2018. doi:10.1103/
ity of average-atom models. PhysRevB.97.115207.
We gave significant attention to the code structure in this [DGB18] Tobias Dornheim, Simon Groth, and Michael Bonitz. The
uniform electron gas at warm dense matter conditions. Phys.
paper, and tried as much as possible to connect the functions Rep., 744:1 – 86, 2018. doi:10.1016/j.physrep.
and objects in the code with the underlying theory. We hope that 2018.04.001.
this not only improves atoMEC from a user perspective, but also [EFP+ 21] J. A. Ellis, L. Fiedler, G. A. Popoola, N. A. Modine, J. A.
facilitates new contributions from the wider average-atom, WDM Stephens, A. P. Thompson, A. Cangi, and S. Rajamanickam.
Accelerating finite-temperature kohn-sham density functional
and scientific Python communities. Another aim of the paper was theory with deep neural networks. Phys. Rev. B, 104:035120,
to communicate how atoMEC benefits from a strong ecosystem of Jul 2021. doi:10.1103/PhysRevB.104.035120.
ATOMEC: AN OPEN-SOURCE AVERAGE-ATOM PYTHON CODE 45

[FB19] Gérald Faussurier and Christophe Blancard. Pressure in warm temperature density-functional theory. Phys. Rev. Lett.,
and hot dense matter using the average-atom model. Phys. Rev. 107:163001, Oct 2011. doi:10.1103/PhysRevLett.
E, 99:053201, May 2019. doi:10.1103/PhysRevE.99. 107.163001.
053201. [PR20] Martin Preising and Ronald Redmer. Metallization of dense
[GDRT14] Frank Graziani, Michael P Desjarlais, Ronald Redmer, and fluid helium from ab initio simulations. Phys. Rev. B,
Samuel B Trickey. Frontiers and challenges in warm dense 102:224107, Dec 2020. doi:10.1103/PhysRevB.102.
matter, volume 96. Springer Science & Business, 2014. doi: 224107.
10.1007/978-3-319-04912-0. [Roz91] Balazs F. Rozsnyai. Photoabsorption in hot plasmas based
[GFG+ 16] S H Glenzer, L B Fletcher, E Galtier, B Nagler, R Alonso- on the ion-sphere and ion-correlation models. Phys. Rev. A,
Mori, B Barbrel, S B Brown, D A Chapman, Z Chen, C B 43:3035–3042, Mar 1991. doi:10.1103/PhysRevA.43.
Curry, F Fiuza, E Gamboa, M Gauthier, D O Gericke, A Glea- 3035.
son, S Goede, E Granados, P Heimann, J Kim, D Kraus, [SM91] H. B. Schlegel and J. J. W. McDouall. Do You Have SCF Sta-
M J MacDonald, A J Mackinnon, R Mishra, A Ravasio, bility and Convergence Problems?, pages 167–185. Springer
C Roedel, P Sperling, W Schumaker, Y Y Tsui, J Vorberger, Netherlands, Dordrecht, 1991. doi:10.1007/978-94-
U Zastrau, A Fry, W E White, J B Hasting, and H J Lee. 011-3262-6_2.
Matter under extreme conditions experiments at the Linac [SPS+ 14] A. N. Souza, D. J. Perkins, C. E. Starrett, D. Saumon, and
Coherent Light Source. J. Phys. B, 49(9):092001, apr 2016. S. B. Hansen. Predictions of x-ray scattering spectra for warm
doi:10.1088/0953-4075/49/9/092001. dense matter. Phys. Rev. E, 89:023108, Feb 2014. doi:
[HK64] P. Hohenberg and W. Kohn. Inhomogeneous electron gas. 10.1103/PhysRevE.89.023108.
Phys. Rev., 136(3B):B864–B871, Nov 1964. doi:10.1103/ [SRH+ 12] John C. Snyder, Matthias Rupp, Katja Hansen, Klaus-Robert
PhysRev.136.B864. Müller, and Kieron Burke. Finding density functionals with
[HMvdW 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der
+ machine learning. Phys. Rev. Lett., 108:253002, Jun 2012.
Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric doi:10.1103/PhysRevLett.108.253002.
Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, [SS14] C.E. Starrett and D. Saumon. A simple method for determining
Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerk- the ionic structure of warm dense matter. High Energy Density
wijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Physics, 10:35–42, 2014. doi:10.1016/j.hedp.2013.
Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin 12.001.
Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, [Sta16] C.E. Starrett. Kubo–Greenwood approach to conductivity
Christoph Gohlke, and Travis E. Oliphant. Array programming in dense plasmas with average atom models. High Energy
with NumPy. Nature, 585(7825):357–362, September 2020. Density Physics, 19:58–64, 2016. doi:10.1016/j.hedp.
doi:10.1038/s41586-020-2649-2. 2016.04.001.
[HRD08] Bastian Holst, Ronald Redmer, and Michael P. Desjarlais. [STJ+ 14] Sang-Kil Son, Robert Thiele, Zoltan Jurek, Beata Ziaja, and
Thermophysical properties of warm dense hydrogen using Robin Santra. Quantum-mechanical calculation of ionization-
quantum molecular dynamics simulations. Phys. Rev. B, potential lowering in dense plasmas. Phys. Rev. X, 4:031004,
77:184201, May 2008. doi:10.1103/PhysRevB.77. Jul 2014. doi:10.1103/PhysRevX.4.031004.
184201. [VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt
[JFC+ 13] Weile Jia, Jiyun Fu, Zongyan Cao, Long Wang, Xuebin Chi, Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski,
Weiguo Gao, and Lin-Wang Wang. Fast plane wave density Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté-
functional theory molecular dynamics calculations on multi- fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar-
GPU machines. Journal of Computational Physics, 251:102– rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric
115, 2013. doi:10.1016/j.jcp.2013.05.005. Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat,
[Job20] Joblib Development Team. Joblib: running Python functions Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde,
as pipeline jobs. https://joblib.readthedocs.io/, 2020. Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin-
tero, Charles R. Harris, Anne M. Archibald, Antônio H.
[KDF+ 11] A. L. Kritcher, T. Döppner, C. Fortmann, T. Ma, O. L.
Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy
Landen, R. Wallace, and S. H. Glenzer. In-Flight Measure-
1.0 Contributors. SciPy 1.0: Fundamental Algorithms for
ments of Capsule Shell Adiabats in Laser-Driven Implosions.
Scientific Computing in Python. Nature Methods, 17:261–272,
Phys. Rev. Lett., 107:015002, Jul 2011. doi:10.1103/
2020. doi:10.1038/s41592-019-0686-2.
PhysRevLett.107.015002.
[Koh99] W. Kohn. Nobel lecture: Electronic structure of matter—wave
functions and density functionals. Rev. Mod. Phys., 71:1253–
1266, 10 1999. doi:10.1103/RevModPhys.71.1253.
[KS65] W. Kohn and L. J. Sham. Self-consistent equations including
exchange and correlation effects. Phys. Rev., 140(4A):A1133–
A1138, Nov 1965. doi:10.1103/PhysRev.140.
A1133.
[LSOM18] Susi Lehtola, Conrad Steigemann, Micael J.T. Oliveira, and
Miguel A.L. Marques. Recent developments in LIBXC —
A comprehensive library of functionals for density functional
theory. SoftwareX, 7:1–5, 2018. doi:10.1016/j.softx.
2017.11.002.
[MED11] Stefan Maintz, Bernhard Eck, and Richard Dronskowski.
Speeding up plane-wave electronic-structure calculations us-
ing graphics-processing units. Computer Physics Communi-
cations, 182(7):1421–1427, 2011. doi:10.1016/j.cpc.
2011.03.010.
[men14] mendeleev – A Python resource for properties of chemical
elements, ions and isotopes, ver. 0.9.0. https://github.com/
lmmentel/mendeleev, 2014.
[Mer65] N. David Mermin. Thermal properties of the inhomogeneous
electron gas. Phys. Rev., 137:A1441–A1443, Mar 1965. doi:
10.1103/PhysRev.137.A1441.
[PGW12] Mohandas Pillai, Joshua Goglio, and Thad G. Walker. Matrix
numerov method for solving schrödinger’s equation. Amer-
ican Journal of Physics, 80(11):1017–1019, 2012. doi:
10.1119/1.4748813.
[PPF+ 11] S. Pittalis, C. R. Proetto, A. Floris, A. Sanna, C. Bersier,
K. Burke, and E. K. U. Gross. Exact conditions in finite-
46 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Automatic random variate generation in Python
Christoph Baumgarten‡∗ , Tirth Patel

Abstract—The generation of random variates is an important tool that is re- • For inversion methods, the structural properties of the
quired in many applications. Various software programs or packages contain underlying uniform random number generator are pre-
generators for standard distributions like the normal, exponential or Gamma, served and the numerical accuracy of the methods can be
e.g., the programming language R and the packages SciPy and NumPy in controlled by a parameter. Therefore, inversion is usually
Python. However, it is not uncommon that sampling from new/non-standard dis-
the only method applied for simulations using quasi-Monte
tributions is required. Instead of deriving specific generators in such situations,
so-called automatic or black-box methods have been developed. These allow
Carlo (QMC) methods.
the user to generate random variates from fairly large classes of distributions • Depending on the use case, one can choose between a fast
by only specifying some properties of the distributions (e.g. the density and/or setup with slow marginal generation time and vice versa.
cumulative distribution function). In this note, we describe the implementation of
such methods from the C library UNU.RAN in the Python package SciPy and
The latter point is important depending on the use case: if a
provide a brief overview of the functionality. large number of samples is required for a given distribution with
fixed shape parameters, a slower setup that only has to be run once
Index Terms—numerical inversion, generation of random variates can be accepted if the marginal generation times are low. If small
to moderate samples sizes are required for many different shape
parameters, then it is important to have a fast setup. The former
Introduction
situation is referred to as the fixed-parameter case and the latter as
The generation of random variates is an important tool that is the varying parameter case.
required in many applications. Various software programs or Implementations of various methods are available in the
packages contain generators for standard distributions, e.g., R C library UNU.RAN ([HL07]) and in the associated R pack-
([R C21]) and SciPy ([VGO+ 20]) and NumPy ([HMvdW+ 20]) age Runuran (https://cran.r-project.org/web/packages/Runuran/
in Python. Standard references for these algorithms are the books index.html, [TL03]). The aim of this note is to introduce the
[Dev86], [Dag88], [Gen03], and [Knu14]. An interested reader Python implementation in the SciPy package that makes some
will find many references to the vast existing literature in these of the key methods in UNU.RAN available to Python users in
works. While relying on general methods such as the rejection SciPy 1.8.0. These general tools can be seen as a complement
principle, the algorithms for well-known distributions are often to the existing specific sampling methods: they might lead to
specifically designed for a particular distribution. This is also the better performance in specific situations compared to the existing
case in the module stats in SciPy that contains more than 100 generators, e.g., if a very large number of samples are required for
distributions and the module random in NumPy with more than a fixed parameter of a distribution or if the implemented sampling
30 distributions. However, there are also so-called automatic or method relies on a slow default that is based on numerical
black-box methods for sampling from large classes of distributions inversion of the CDF. For advanced users, they also offer various
with a single piece of code. For such algorithms, information options that allow to fine-tune the generators (e.g., to control the
about the distribution such as the density, potentially together with time needed for the setup step).
its derivative, the cumulative distribution function (CDF), and/or
the mode must be provided. See [HLD04] for a comprehensive
overview of these methods. Although the development of such Automatic algorithms in SciPy
methods was originally motivated to generate variates from non- Many of the automatic algorithms described in [HLD04] and
standard distributions, these universal methods have advantages [DHL10] are implemented in the ANSI C library, UNU.RAN
that make their usage attractive even for sampling from standard (Universal Non-Uniform RANdom variate generators). Our goal
distributions. We mention some of the important properties (see was to provide a Python interface to the most important methods
[LH00], [HLD04], [DHL10]): from UNU.RAN to generate univariate discrete and continuous
non-uniform random variates. The following generators have been
• The algorithms can be used to sample from truncated
implemented in SciPy 1.8.0:
distributions.
• TransformedDensityRejection: Transformed
* Corresponding author: christoph.baumgarten@gmail.com
‡ Unaffiliated Density Rejection (TDR) ([H9̈5], [GW92])
• NumericalInverseHermite: Hermite interpolation
Copyright © 2022 Christoph Baumgarten et al. This is an open-access article based INVersion of CDF (HINV) ([HL03])
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, • NumericalInversePolynomial: Polynomial inter-
provided the original author and source are credited. polation based INVersion of CDF (PINV) ([DHL10])
AUTOMATIC RANDOM VARIATE GENERATION IN PYTHON 47

• SimpleRatioUniforms: Simple Ratio-Of-Uniforms by computing tangents at suitable design points. Note that by its
(SROU) ([Ley01], [Ley03]) nature any rejection method requires not always the same number
• DiscreteGuideTable: (Discrete) Guide Table of uniform variates to generate one non-uniform variate; this
method (DGT) ([CA74]) makes the use of QMC and of some variance reduction methods
• DiscreteAliasUrn: (Discrete) Alias-Urn method more difficult or impossible. On the other hand, rejection is often
(DAU) ([Wal77]) the fastest choice for the varying parameter case.
The Ratio-Of-Uniforms method (ROU, [KM77]) is another
Before describing the implementation in SciPy in Section general method that relies on rejection. The underlying principle is
scipy_impl, we give a short introduction to random variate gener- that p
if (U,V ) is uniformly distributed on the set A f := {(u, v) : 0 <
ation in Section intro_rv_gen. v ≤ f (u/v), a < u/v < b} where f is a PDF with support (a, b),
then X := U/V follows a distribution according to f . In general, it
A very brief introduction to random variate generation is not possible to sample uniform values on A f directly. However,
It is well-known that random variates can be generated by inver- if A f ⊂ R := [u− , u+ ] × [0, v+ ] for finite constants u− , u+ , v+ , one
sion of the CDF F of a distribution: if U is a uniform random can apply the rejection method: generate uniform values (U,V ) on
number on (0, 1), X := F −1 (U) is distributed according to F. the bounding rectangle R until (U,V ) ∈ A f and return X = U/V .
Unfortunately, the inverse CDF can only be expressed in closed Automatic methods relying on the ROU method such as SROU
form for very few distributions, e.g., the exponential or Cauchy and automatic ROU ([Ley00]) need a setup step to find a suitable
distribution. If this is not the case, one needs to rely on imple- region S ∈ R2 such that A f ⊂ S and such that one can generate
mentations of special functions to compute the inverse CDF for (U,V ) uniformly on S efficiently.
standard distributions like the normal, Gamma or beta distributions
or numerical methods for inverting the CDF are required. Such Description of the SciPy interface
procedures, however, have the disadvantage that they may be slow SciPy provides an object-oriented API to UNU.RAN’s methods.
or inaccurate, and developing fast and robust inversion algorithms To initialize a generator, two steps are required:
such as HINV and PINV is a non-trivial task. HINV relies on
Hermite interpolation of the inverse CDF and requires the CDF 1) creating a distribution class and object,
and PDF as an input. PINV only requires the PDF. The algorithm 2) initializing the generator itself.
then computes the CDF via adaptive Gauss-Lobatto integration In step 1, a distributions object must be created that im-
and an approximation of the inverse CDF using Newton’s polyno- plements required methods (e.g., pdf, cdf). This can either
mial interpolation. Note that an approximation of the inverse CDF be a custom object or a distribution object from the classes
can be achieved by interpolating the points (F(xi ), xi ) for points rv_continuous or rv_discrete in SciPy. Once the gen-
xi in the domain of F, i.e., no evaluation of the inverse CDF is erator is initialized from the distribution object, it provides a
required. rvs method to sample random variates from the given dis-
For discrete distributions, F is a step-function. To compute tribution. It also provides a ppf method that approximates
the inverse CDF F −1 (U), the simplest idea would be to apply the inverse CDF if the initialized generator uses an inversion
sequential search: if X takes values 0, 1, 2, . . . with probabil- method. The following example illustrates how to initialize the
ities p0 , p1 , p2 , . . . , start with j = 0 and keep incrementing j NumericalInversePolynomial (PINV) generator for the
until F( j) = p0 + · · · + p j ≥ U. When the search terminates, standard normal distribution:
X = j = F −1 (U). Clearly, this approach is generally very slow import numpy as np
and more efficient methods have been developed: if X takes L from scipy.stats import sampling
distinct values, DGT realizes very fast inversion using so-called from math import exp
guide tables / hash tables to find the index j. In contrast DAU is # create a distribution class with implementation
not an inversion method but uses the alias method, i.e., tables are # of the PDF. Note that the normalization constant
precomputed to write X as an equi-probable mixture of L two- # is not required
point distributions (the alias values). class StandardNormal:
def pdf(self, x):
The rejection method has been suggested in [VN51]. In its return exp(-0.5 * x**2)
simplest form, assume that f is a bounded density on [a, b],
i.e., f (x) ≤ M for all x ∈ [a, b]. Sample two independent uniform # create a distribution object and initialize the
# generator
random variates on U on [0, 1] and V on [a, b] until M ·U ≤ f (V ). dist = StandardNormal()
Note that the accepted points (U,V ) are uniformly distributed in rng = sampling.NumericalInversePolynomial(dist)
the region between the x-axis and the graph of the PDF. Hence,
X := V has the desired distribution f . This is a special case of # sample 100,000 random variates from the given
# distribution
the general version: if f , g are two densities on an interval J such rvs = rng.rvs(100000)
that f (x) ≤ c · g(x) for all x ∈ J and a constant c ≥ 1, sample
U uniformly distributed on [0, 1] and X distributed according to As NumericalInversePolynomial generator uses an in-
g until c · U · g(X) ≤ f (X). Then X has the desired distribution version method, it also provides a ppf method that approximates
f . It can be shown that the expected number of iterations before the inverse CDF:
the acceptance condition is met is equal to c. Hence, the main # evaluate the approximate PPF at a few points
ppf = rng.ppf([0.1, 0.5, 0.9])
challenge is to find hat functions g for which c is small and from
which random variates can be generated efficiently. TDR solves It is also easy to sample from a truncated distribution by passing
this problem by applying a transformation T to the density such a domain argument to the constructor of the generator. For
that x 7→ T ( f (x)) is concave. A hat function can then be found example, to sample from truncated normal distribution:
48 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

# truncate the distribution by passing a reference/random/bit_generators/index.html. To change the uni-
# `domain` argument form random number generator, a random_state parameter
rng = sampling.NumericalInversePolynomial(
dist, domain=(-1, 1) can be passed as shown in the example below:
) # 64-bit PCG random number generator in NumPy
urng = np.random.Generator(np.random.PCG64())
While the default options of the generators should work well in # The above line can also be replaced by:
many situations, we point out that there are various parameters that # ``urng = np.random.default_rng()``
the user can modify, e.g., to provide further information about the # as PCG64 is the default generator starting
# from NumPy 1.19.0
distribution (such as mode or center) or to control the numerical
accuracy of the approximated PPF. (u_resolution). Details # change the uniform random number generator by
can be found in the SciPy documentation https://docs.scipy.org/ # passing the `random_state` argument
rng = sampling.NumericalInversePolynomial(
doc/scipy/reference/. The above code can easily be generalized to
dist, random_state=urng
sample from parametrized distributions using instance attributes )
in the distribution class. For example, to sample from the gamma
We also point out that the PPF of inversion methods can be applied
distribution with shape parameter alpha, we can create the
to sequences of quasi-random numbers. SciPy provides different
distribution class with parameters as instance attributes:
sequences in its QMC module (scipy.stats.qmc).
class Gamma:
NumericalInverseHermite provides a qrvs method
def __init__(self, alpha):
self.alpha = alpha which generates random variates using QMC methods present
in SciPy (scipy.stats.qmc) as uniform random number
def pdf(self, x): generators3 . The next example illustrates how to use qrvs with a
return x**(self.alpha-1) * exp(-x)
generator created directly from a SciPy distribution object.
def support(self): from scipy import stats
return 0, np.inf from scipy.stats import qmc

# initialize a distribution object with varying # 1D Halton sequence generator.
# parameters qrng = qmc.Halton(d=1)
dist1 = Gamma(2)
dist2 = Gamma(3) rng = sampling.NumericalInverseHermite(stats.norm())

# initialize a generator for each distribution # generate quasi random numbers using the Halton
rng1 = sampling.NumericalInversePolynomial(dist1) # sequence as uniform variates
rng2 = sampling.NumericalInversePolynomial(dist2) qrvs = rng.qrvs(size=100, qmc_engine=qrng)
In the above example, the support method is used to set the
domain of the distribution. This can alternatively be done by Benchmarking
passing a domain parameter to the constructor. To analyze the performance of the implementation, we tested the
In addition to continuous distribution, two UNU.RAN methods methods applied to several standard distributions against the gen-
have been added in SciPy to sample from discrete distributions. In erators in NumPy and the original UNU.RAN C library. In addi-
this case, the distribution can be either be represented using a tion, we selected one non-standard distribution to demonstrate that
probability vector (which is passed to the constructor as a Python substantial reductions in the runtime can be achieved compared to
list or NumPy array) or a Python object with the implementation other implementations. All the benchmarks were carried out using
of the probability mass function. In the latter case, a finite domain NumPy 1.22.4 and SciPy 1.8.1 running in a single core on Ubuntu
must be passed to the constructor or the object should implement 20.04.3 LTS with Intel(R) Core(TM) i7-8750H CPU (2.20GHz
the support method1 . clock speed, 16GB RAM). We run the benchmarks with NumPy’s
# Probability vector to represent a discrete MT19937 (Mersenne Twister) and PCG64 random number gen-
# distribution. Note that the probability vector
erators (np.random.MT19937 and np.random.PCG64) in
# need not be vectorized
pv = [0.1, 9.0, 2.9, 3.4, 0.3] Python and use NumPy’s C implementation of MT19937 in the
UNU.RAN C benchmarks. As explained above, the use of PCG64
# PCG64 uniform RNG with seed 123 is recommended, and MT19937 is only included to compare the
urng = np.random.default_rng(123)
rng = sampling.DiscreteAliasUrn( speed of the Python implementation and the C library by relying
pv, random_state=urng on the same uniform number generator (i.e., differences in the
) performance of the uniform number generation are not taken
into account). The code for all the benchmarks can be found on
# sample from the given discrete distribution
rvs = rng.rvs(100000) https://github.com/tirthasheshpatel/unuran_benchmarks.
The methods used in NumPy to generate normal, gamma, and
beta random variates are:
Underlying uniform pseudo-random number generators
NumPy provides several generators for uniform pseudo-random • the ziggurat algorithm ([MT00b]) to sample from the
numbers2 . It is highly recommended to use NumPy’s default standard normal distribution,
random number generator np.random.PCG64 for better speed 2. By default, NumPy’s legacy random number generator, MT19937
and performance, see [O’N14] and https://numpy.org/doc/stable/ (np.random.RandomState()) is used as the uniform random number
generator for consistency with the stats module in SciPy.
1. Support for discrete distributions with infinite domain hasn’t been added 3. In SciPy 1.9.0, qrvs will be added to
yet. NumericalInversePolynomial.
AUTOMATIC RANDOM VARIATE GENERATION IN PYTHON 49

• the rejection algorithms in Chapter XII.2.6 in [Dev86] if 70-200 times faster. This clearly shows the benefit of using a
α < 1 and in [MT00a] if α > 1 for the Gamma distribution, black-box algorithm.
• Johnk’s algorithm ([Jöh64], Section IX.3.5 in [Dev86]) if
max{α, β } ≤ 1, otherwise a ratio of two Gamma variates Conclusion
with shape parameter α and β (see Section IX.4.1 in
The interface to UNU.RAN in SciPy provides easy access to
[Dev86]) for the beta distribution.
different algorithms for non-uniform variate generation for large
Benchmarking against the normal, gamma, and beta distributions
classes of univariate continuous and discrete distributions. We
have shown that the methods are easy to use and that the al-
Table 1 compares the performance for the standard normal,
gorithms perform very well both for standard and non-standard
Gamma and beta distributions. We recall that the density of the
distributions. A comprehensive documentation suite, a tutorial
Gamma distribution with shape parameter a > 0 is given by
and many examples are available at https://docs.scipy.org/doc/
x ∈ (0, ∞) 7→ xa−1 e−x and the density of the beta distribution with
α−1 (1−x)β −1 scipy/reference/stats.sampling.html and https://docs.scipy.org/doc/
shape parameters α, β > 0 is given by x ∈ (0, 1) 7→ x B(α,β ) scipy/tutorial/stats/sampling.html. Various methods have been im-
where Γ(·) and B(·, ·) are the Gamma and beta functions. The plemented in SciPy, and if specific use cases require additional
results are reported in Table 1. functionality from UNU.RAN, the methods can easily be added
We summarize our main observations: to SciPy given the flexible framework that has been developed.
1) The setup step in Python is substantially slower than Another area of further development is to better integrate SciPy’s
in C due to expensive Python callbacks, especially for QMC generators for the inversion methods.
PINV and HINV. However, the time taken for the setup is Finally, we point out that other sampling methods like Markov
low compared to the sampling time if large samples are Chain Monte Carlo and copula methods are not part of SciPy. Rel-
drawn. Note that as expected, SROU has a very fast setup evant Python packages in that context are PyMC ([PHF10]), PyS-
such that this method is suitable for the varying parameter tan relying on Stan ([Tea21]), Copulas (https://sdv.dev/Copulas/)
case. and PyCopula (https://blent-ai.github.io/pycopula/).
2) The sampling time in Python is slightly higher than in
C for the MT19937 random number generator. If the Acknowledgments
recommended PCG64 generator is used, the sampling The authors wish to thank Wolfgang Hörmann and Josef Leydold
time in Python is slightly lower. The only exception for agreeing to publish the library under a BSD license and for
is SROU: due to Python callbacks, the performance is helpful feedback on the implementation and this note. In addition,
substantially slower than in C. However, as the main we thank Ralf Gommers, Matt Haberland, Nicholas McKibben,
advantage of SROU is the fast setup time, the main use Pamphile Roy, and Kai Striega for their code contributions, re-
case is the varying parameter case (i.e., the method is not views, and helpful suggestions. The second author was supported
supposed to be used to generate large samples). by the Google Summer of Code 2021 program5 .
3) PINV, HINV, and TDR are at most about 2x slower than
the specialized NumPy implementation for the normal
R EFERENCES
distribution. For the Gamma and beta distribution, they
even perform better for some of the chosen shape pa- [CA74] Hui-Chuan Chen and Yoshinori Asau. On gener-
ating random variates from an empirical distribution.
rameters. These results underline the strong performance AIIE Transactions, 6(2):163–166, 1974. doi:10.1080/
of these black-box approaches even for standard distribu- 05695557408974949.
tions. [Dag88] John Dagpunar. Principles of random variate generation.
4) While the application of PINV requires bounded densi- Oxford University Press, USA, 1988.
[Dev86] Luc Devroye. Non-Uniform Random Variate Generation.
ties, no issues are encountered for α = 0.05 since the Springer-Verlag, New York, 1986. doi:10.1007/978-1-
unbounded part is cut off by the algorithm. However, the 4613-8643-8.
setup can fail for very small values of α. [DHL10] Gerhard Derflinger, Wolfgang Hörmann, and Josef Leydold.
Random variate generation by numerical inversion when only
the density is known. ACM Transactions on Modeling and
Benchmarking against a non-standard distribution Computer Simulation (TOMACS), 20(4):1–25, 2010. doi:
We benchmark the performance of PINV to sample from the 10.1145/1842722.1842723.
[Gen03] James E Gentle. Random number generation and Monte Carlo
generalized normal distribution ([Sub23]) whose density is given
p methods, volume 381. Springer, 2003. doi:10.1007/
pe−|x|
by x ∈ (−∞, ∞) 7→ 2Γ(1/p) against the method proposed in [NP09] b97336.
and against the implementation in SciPy’s gennorm distribu- [GW92] Walter R Gilks and Pascal Wild. Adaptive rejection sampling
for Gibbs sampling. Journal of the Royal Statistical Society:
tion. The approach in [NP09] relies on transforming Gamma Series C (Applied Statistics), 41(2):337–348, 1992. doi:10.
variates to the generalized normal distribution whereas SciPy 2307/2347565.
relies on computing the inverse of CDF of the Gamma distri- [H9̈5] Wolfgang Hörmann. A rejection technique for sampling from
bution (https://docs.scipy.org/doc/scipy/reference/generated/scipy. T-concave distributions. ACM Trans. Math. Softw., 21(2):182–
193, 1995. doi:10.1145/203082.203089.
special.gammainccinv.html). The results for different values of p [HL03] Wolfgang Hörmann and Josef Leydold. Continuous random
are shown in Table 2. variate generation by fast numerical inversion. ACM Trans-
PINV is usually about twice as fast than the special- actions on Modeling and Computer Simulation (TOMACS),
13(4):347–362, 2003. doi:10.1145/945511.945517.
ized method and about 15-150 times faster than SciPy’s
implementation4 . We also found an R package pgnorm (https: 4. In SciPy 1.9.0, the speed will be improved by implementing the method
//cran.r-project.org/web/packages/pgnorm/) that implements vari- from [NP09]
ous approaches from [KR13]. In that case, PINV is usually about 5. https://summerofcode.withgoogle.com/projects/#5912428874825728
50 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Python C
Distribution Method
Setup Sampling (PCG64) Sampling (MT19937) Setup Sampling (MT19937)
PINV 4.6 29.6 36.5 0.27 32.4
HINV 2.5 33.7 40.9 0.38 36.8
Standard normal TDR 0.2 37.3 47.8 0.02 41.4
SROU 8.7 µs 2510 2160 0.5 µs 232
NumPy - 17.6 22.4 - -
PINV 196.0 29.8 37.2 37.9 32.5
Gamma(0.05) HINV 24.5 36.1 43.8 1.9 40.7
NumPy - 55.0 68.1 - -
PINV 16.5 31.2 38.6 2.0 34.5
Gamma(0.5) HINV 4.9 34.2 41.7 0.6 37.9
NumPy - 86.4 99.2 - -
PINV 5.3 30.8 38.7 0.5 34.6
HINV 5.3 33 40.6 0.4 36.8
Gamma(3.0)
TDR 0.2 38.8 49.6 0.03 44
NumPy - 36.5 47.1 - -
PINV 21.4 33.1 39.9 2.4 37.3
Beta(0.5, 0.5) HINV 2.1 38.4 45.3 0.2 42
NumPy - 101 112 - -
HINV 0.2 37 44.3 0.01 41.1
Beta(0.5, 1.0)
NumPy - 125 138 - -
PINV 15.7 30.5 37.2 1.7 34.3
HINV 4.1 33.4 40.8 0.4 37.1
Beta(1.3, 1.2)
TDR 0.2 46.8 57.8 0.03 45
NumPy - 74.3 97 - -
PINV 9.7 30.2 38.2 0.9 33.8
HINV 5.8 33.7 41.2 0.4 37.4
Beta(3.0, 2.0)
TDR 0.2 42.8 52.8 0.02 44
NumPy - 72.6 92.8 - -
TABLE 1
Average time taken (reported in milliseconds, unless mentioned otherwise) to sample 1 million random variates from the standard normal distribution. The mean is
computed over 7 iterations. Standard deviations are not reported as they were very small (less than 1% of the mean in the large majority of cases). Note that not
all methods can always be applied, e.g., TDR cannot be applied to the Gamma distribution if a < 1 since the PDF is not log-concave in that case. As NumPy uses
rejection algorithms with precomputed constants, no setup time is reported.

p 0.25 0.45 0.75 1 1.5 2 5 8
Nardon and Pianca (2009) 100 101 101 45 148 120 128 122
SciPy’s gennorm distribution 832 1000 1110 559 5240 6720 6230 5950
Python (PINV Method, PCG64 urng) 50 47 45 41 40 37 38 38
TABLE 2
Comparing SciPy’s implementation and a specialized method against PINV to sample 1 million variates from the generalized normal distribution for different values
of the parameter p. Time reported in milliseconds. The mean is computer over 7 iterations.

[HL07] Wolfgang Hörmann and Josef Leydold. UNU.RAN - Univer- ates. ACM Transactions on Mathematical Software (TOMS),
sal Non-Uniform RANdom number generators, 2007. https: 3(3):257–260, 1977. doi:10.1145/355744.355750.
//statmath.wu.ac.at/unuran/doc.html. [Knu14] Donald E Knuth. The Art of Computer Programming, Volume
[HLD04] Wolfgang Hörmann, Josef Leydold, and Gerhard Derflinger. 2: Seminumerical algorithms. Addison-Wesley Professional,
Automatic nonuniform random variate generation. Springer, 2014. doi:10.2307/2317055.
2004. doi:10.1007/978-3-662-05946-3. [KR13] Steve Kalke and W-D Richter. Simulation of the p-generalized
[HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Gaussian distribution. Journal of Statistical Computation
Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric and Simulation, 83(4):641–667, 2013. doi:10.1080/
Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, 00949655.2011.631187.
Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van [Ley00] Josef Leydold. Automatic sampling with the ratio-of-uniforms
Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del method. ACM Transactions on Mathematical Software
Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, (TOMS), 26(1):78–98, 2000. doi:10.1145/347837.
Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer 347863.
Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro- [Ley01] Josef Leydold. A simple universal generator for continuous
gramming with NumPy. Nature, 585(7825):357–362, 2020. and discrete univariate T-concave distributions. ACM Transac-
doi:10.1038/s41586-020-2649-2. tions on Mathematical Software (TOMS), 27(1):66–82, 2001.
[Jöh64] MD Jöhnk. Erzeugung von betaverteilten und gammaverteilten doi:10.1145/382043.382322.
Zufallszahlen. Metrika, 8(1):5–15, 1964. doi:10.1007/ [Ley03] Josef Leydold. Short universal generators via generalized
bf02613706. ratio-of-uniforms method. Mathematics of Computation,
[KM77] Albert J Kinderman and John F Monahan. Computer gen- 72(243):1453–1471, 2003. doi:10.1090/s0025-5718-
eration of random variables using the ratio of uniform devi- 03-01511-4.
AUTOMATIC RANDOM VARIATE GENERATION IN PYTHON 51

[LH00] Josef Leydold and Wolfgang Hörmann. Universal algorithms
as an alternative for generating non-uniform continuous ran-
dom variates. In Proceedings of the International Conference
on Monte Carlo Simulation 2000., pages 177–183, 2000.
[MT00a] George Marsaglia and Wai Wan Tsang. A simple method for
generating gamma variables. ACM Transactions on Math-
ematical Software (TOMS), 26(3):363–372, 2000. doi:
10.1145/358407.358414.
[MT00b] George Marsaglia and Wai Wan Tsang. The ziggurat method
for generating random variables. Journal of statistical soft-
ware, 5(1):1–7, 2000. doi:10.18637/jss.v005.i08.
[NP09] Martina Nardon and Paolo Pianca. Simulation techniques
for generalized Gaussian densities. Journal of Statistical
Computation and Simulation, 79(11):1317–1329, 2009. doi:
10.1080/00949650802290912.
[O’N14] Melissa E. O’Neill. PCG: A family of simple fast space-
efficient statistically good algorithms for random number gen-
eration. Technical Report HMC-CS-2014-0905, Harvey Mudd
College, Claremont, CA, September 2014.
[PHF10] Anand Patil, David Huard, and Christopher J Fonnesbeck.
PyMC: Bayesian stochastic modelling in Python. Journal of
Statistical Software, 35(4):1, 2010. doi:10.18637/jss.
v035.i04.
[R C21] R Core Team. R: A language and environment for statistical
computing, 2021. https://www.R-project.org/.
[Sub23] M.T. Subbotin. On the law of frequency of error. Mat. Sbornik,
31(2):296–301, 1923.
[Tea21] Stan Development Team. Stan modeling language users guide
and reference manual, version 2.28., 2021. https://mc-stan.org.
[TL03] Günter Tirler and Josef Leydold. Automatic non-uniform
random variate generation in r. In Proceedings of DSC, page 2,
2003.
[VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt
Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski,
Pearu Peterson, Warren Weckesser, Jonathan Bright, et al.
Scipy 1.0: fundamental algorithms for scientific computing in
python. Nature methods, pages 1–12, 2020. doi:10.1038/
s41592-019-0686-2.
[VN51] John Von Neumann. Various techniques used in connection
with random digits. Appl. Math Ser, 12(36-38):3, 1951.
[Wal77] Alastair J Walker. An efficient method for generating discrete
random variables with general distributions. ACM Transac-
tions on Mathematical Software (TOMS), 3(3):253–256, 1977.
doi:10.1145/355744.355749.
52 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Utilizing SciPy and other open source packages to
provide a powerful API for materials manipulation in
the Schrödinger Materials Suite
Alexandr Fonari‡∗ , Farshad Fallah‡ , Michael Rauch‡

Abstract—The use of several open source scientific packages in the open-source and many of which blend the two to optimize capa-
Schrödinger Materials Science Suite will be discussed. A typical workflow for bilities and efficiency. For example, the main simulation engine
materials discovery will be described, discussing how open source packages for molecular quantum mechanics is the Jaguar [BHH+ 13] pro-
have been incorporated at every stage. Some recent implementations of ma- prietary code. The proprietary classical molecular dynamics code
chine learning for materials discovery will be discussed, as well as how open
Desmond (distributed by Schrödinger, Inc.) [SGB+ 14] is used to
source packages were leveraged to achieve results faster and more efficiently.
obtain physical properties of soft materials, surfaces and polymers.
Index Terms—materials, active learning, OLED, deposition, evaporation
For periodic quantum mechanics, the main simulation engine is
the open source code Quantum ESPRESSO (QE) [GAB+ 17]. One
of the co-authors of this proceedings (A. Fonari) contributes to
Introduction the QE code in order to make integration with the Materials Suite
more seamless and less error-prone. As part of this integration,
A common materials discovery practice or workflow is to start
support for using the portable XML format for input and output
with reading an experimental structure of a material or generating
in QE has been implemented in the open source Python package
a structure in silico, computing its properties of interest (e.g.
qeschema [BDBF].
elastic constants, electrical conductivity), tuning the material by
Figure 2 gives an overview of some of the various products that
modifying its structure (e.g. doping) or adding and removing
compose the Schrödinger Materials Science Suite. The various
atoms (deposition, evaporation), and then recomputing the proper-
workflows are implemented mainly in Python (some of them
ties of the modified material (Figure 1). Computational materials
described below), calling on proprietary or open-source code
discovery leverages such workflows to empower researchers to
where appropriate, to improve the performance of the software
explore vast design spaces and uncover root causes without (or in
and reduce overall maintenance.
conjunction with) laboratory experimentation.
The materials discovery cycle can be run in a high-throughput
Software tools for computational materials discovery can be
manner, enumerating different structure modifications in a system-
facilitated by utilizing existing libraries that cover the fundamental
atic fashion, such as doping ratio in a semiconductor or depositing
mathematics used in the calculations in an optimized fashion. This
different adsorbates. As we will detail herein, there are several
use of existing libraries allows developers to devote more time
open source packages that allow the user to generate a large
to developing new features instead of re-inventing established
number of structures, run calculations in high throughput manner
methods. As a result, such a complementary approach improves
and analyze the results. For example, the open source package
the performance of computational materials software and reduces
pymatgen [ORJ+ 13] facilitates generation and analysis of periodic
overall maintenance.
structures. It can generate inputs for and read outputs of QE, the
The Schrödinger Materials Science Suite [LLC22] is a propri-
commercial codes VASP and Gaussian, and several other formats.
etary computational chemistry/physics platform that streamlines
To run and manage workflow jobs in a high-throughput manner,
materials discovery workflows into a single graphical user inter-
open source packages such as Custodian [ORJ+ 13] and AiiDA
face (Materials Science Maestro). The interface is a single portal
[HZU+ 20] can be used.
for structure building and enumeration, physics-based modeling
and machine learning, visualization and analysis. Tying together
the various modules are a wide variety of scientific packages, some Materials import and generation
of which are proprietary to Schrödinger, Inc., some of which are
For reading and writing of material structures, several open source
packages (e.g. OpenBabel [OBJ+ 11], RDKit [LTK+ 22]) have
* Corresponding author: sasha.fonari@schrodinger.com
‡ Schrödinger Inc., 1540 Broadway, 24th Floor. New York, NY 10036 implemented functionality for working with several commonly
used formats (e.g. CIF, PDB, mol, xyz). Periodic structures
Copyright © 2022 Alexandr Fonari et al. This is an open-access article of materials, mainly coming from single crystal X-ray/neutron
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, diffraction experiments, are distributed in CIF (Crystallographic
provided the original author and source are credited. Information File), PDB (Protein Data Bank) and lately mmCIF
UTILIZING SCIPY AND OTHER OPEN SOURCE PACKAGES TO PROVIDE A POWERFUL API FOR MATERIALS MANIPULATION IN THE SCHRÖDINGER MATERIALS SUITE 53

Fig. 1: Example of a workflow for computational materials discovery.

Fig. 2: Some example products that compose the Schrödinger Materials Science Suite.

formats [WF05]. Correctly reading experimental structures is of work went into this project) and others to correctly read and
significant importance, since the rest of the materials discovery convert periodic structures in OpenBabel. By version 3.1.1 (the
workflow depends on it. In addition to atom coordinates and most recent at writing time), the authors are not aware of any
periodic cell information, structural data also contains symme- structures read incorrectly by OpenBabel. In general, non-periodic
try operations (listed explicitly or by the means of providing molecular formats are simpler to handle because they only contain
a space group) that can be used to decrease the number of atom coordinates but no cell or symmetry information. OpenBabel
computations required for a particular system by accounting for has Python bindings but due to the GPL license limitation, it is
symmetry. This can be important, especially when scaling high- called as a subprocess from the Schrödinger Materials Suite.
throughput calculations. From file, structure is read in a structure Another important consideration in structure generation is
object through which atomic coordinates (as a NumPy array) and modeling of substitutional disorder in solid alloys and materials
chemical information of the material can be accessed and updated. with point defects (intermetallics, semiconductors, oxides and
Structure object is similar to the one implemented in open source their crystalline surfaces). In such cases, the unit cell and atomic
packages such as pymatgen [ORJ+ 13] and ASE [LMB+ 17]. All sites of the crystal or surface slab are well defined while the chem-
the structure manipulations during the workflows are done by ical species occupying the site may vary. In order to simulate sub-
using structure object interface (see structure deformation example stitutional disorder, one must generate the ensemble of structures
below). Example of Structure object definition in pymatgen: that includes all statistically significant atomic distributions in a
class Structure: given unit cell. This can be achieved by a brute force enumeration
of all symmetrically unique atomic structures with a given number
def __init__(self, lattice, species, coords, ...): of vacancies, impurities or solute atoms. The open source library
"""Create a periodic structure."""
enumlib [HF08] implements algorithms for such a systematic
One consideration of note is that PDB, CIF and mmCIF structure enumeration of periodic structures. The enumlib package consists
formats allow description of the positional disorder (for example, of several Fortran binaries and Python scripts that can be run as a
a solvent molecule without a stable position within the cell subprocess (no Python bindings). This allows the user to generate
which can be described by multiple sets of coordinates). Another a large set of symmetrically nonequivalent materials with different
complication is that experimental data spans an interval of almost compositions (e.g. doping or defect concentration).
a century: one of the oldest crystal structures deposited in the Recently, we applied this approach in simultaneous study of
Cambridge Structural Database (CSD) [GBLW16] dates to 1924 the activity and stability of Pt based core-shell type catalysts for
[HM24]. These nuances and others present nontrivial technical the oxygen reduction reaction [MGF+ 19]. We generated a set of
challenges for developers. Thus, it has been a continuous effort stable doped Pt/transition metal/nitrogen surfaces using periodic
by Schrödinger, Inc. (at least 39 commits and several weeks of enumeration. Using QE to perform periodic density functional
54 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Jaguar that took 457,265 CPU hours (~52 years) [MAS+ 20]. An-
other similar case study is the high-throughput molecular dynam-
ics simulations (MD) of thermophysical properties of polymers for
various applications [ABG+ 21]. There, using Desmond we com-
puted the glass transition temperature (Tg ) of 315 polymers and
compared the results with experimental measurements [Bic02].
This study took advantage of GPU (graphics processing unit)
support as implemented in Desmond, as well as the job scheduler
API described above.
Other workflows implemented in the Schrödinger Materials
Science Suite utilize open source packages as well. For soft mate-
rials (polymers, organic small molecules and substrates composed
of soft molecules), convex hull and related mathematical methods
Fig. 3: Example of the job submission process. are important for finding possible accessible solvent voids (during
submerging or sorption) and adsorbate sites (during molecular
deposition). These methods are conveniently implemented in the
theory (DFT) calculations, we assessed surface phase diagrams open source SciPy [VGO+ 20] and NumPy [HMvdW+ 20] pack-
for Pt alloys and identified the avenues for stabilizing the cost ages. Thus, we implemented molecular deposition and evaporation
effective core-shell systems by a judicious choice of the catalyst workflows by using the Desmond MD engine as the backend
core material. Such catalysts may prove critical in electrocatalysis in tandem with the convex hull functionality. This workflow
for fuel cell applications. enables simulation of the deposition and evaporation of the
small molecules on a substrate. We utilized the aforementioned
deposition workflow in the study of organic light-emitting diodes
Workflow capabilities (OLEDs), which are fabricated using a stepwise process, where
In the last section, we briefly described a complete workflow from new layers are deposited on top of previous layers. Both vacuum
structure generation and enumeration to periodic DFT calculations and solution deposition processes have been used to prepare these
to analysis. In order to be able to run a massively parallel films, primarily as amorphous thin film active layers lacking
screening of materials, a highly scalable and stable queuing system long-range order. Each of these deposition techniques introduces
(job scheduler) is required. We have implemented a job queuing changes to the film structure and consequently, different charge-
system on top of the most used queuing systems (LSF, PBS, transfer and luminescent properties [WKB+ 22].
SGE, SLURM, TORQUE, UGE) and exposed a Python API to As can be seen from above, a workflow is usually some
submit and monitor jobs. In line with technological advancements, sort of structure modification through the structure object with
cloud is also supported by means of a virtual cluster configured a subsequent call to a backend code and analysis of its output if
with SLURM. This allows the user to submit a large number it succeeds. Input for the next iteration depends on the output
of jobs, limited only by SLURM scheduling capabilities and of the previous iteration in some workflows. Due to the large
cloud resources. In order to accommodate job dependencies in chemical and manipulation space of the materials, sometimes it
workflows, for each job, a parent job (or multiple parent jobs) can very tricky to keep code for all workflows follow the same code
be defined forming a directed graph of jobs (Figure 3). logic. For every workflow and/or functionality in the Materials
There could be several reasons for a job to fail. Depending Science Suite, some sort of peer reviewed material (publication,
on the reason of failure, there are several restart and recovery conference presentation) is created where implemented algorithms
mechanisms in place. The lowest level is the restart mechanism are described to facilitate reproducibility.
(in SLURM it is called requeue) which is performed by the
queuing system itself. This is triggered when a node goes down.
Data fitting algorithms and use cases
On the cloud, preemptible instances (nodes) can go offline at any
moment. In addition, workflows implemented in the proprietary Materials simulation engines for QM, periodic DFT, and classical
Schrödinger Materials Science Suite have built-in methods for MD (referred to herein as backends) are frequently written in
handling various types of failure. For example, if the simulation compiled languages with enabled parallelization for CPU or GPU
is not converging to a requested energy accuracy, it is wasteful hardware. These backends are called from Python workflows
to blindly restart the calculation without changing some input using the job queuing systems described above. Meanwhile, pack-
parameters. However, in the case of a failure due to full disk ages such as SciPy and NumPy provide sophisticated numerical
space, it is reasonable to try restart with hopes to get a node with function optimization and fitting capabilities. Here, we describe
more empty disk space. If a job fails (and cannot be restarted), examples of how the Schrödinger suite can be used to combine
all its children (if any) will not start, thus saving queuing and materials simulations with popular optimization routines in the
computational time. SciPy ecosystem.
Having developed robust systems for running calculations, job Recently we implemented convex analysis of
queuing and troubleshooting (autonomously, when applicable), the stress strain curve (as described here [PKD18]).
the developed workflows have allowed us and our customers to scipy.optimize.minimize is used for a constrained
perform massive screenings of materials and their properties. For minimization with boundary conditions of a function related to
example, we reported a massive screening of 250,000 charge- the stress strain curve. The stress strain curve is obtained from a
conducting organic materials, totaling approximately 3,619,000 series of MD simulations on deformed cells (cell deformations
DFT SCF (self-consistent field) single-molecule calculations using are defined by strain type and deformation step). The pressure
UTILIZING SCIPY AND OTHER OPEN SOURCE PACKAGES TO PROVIDE A POWERFUL API FOR MATERIALS MANIPULATION IN THE SCHRÖDINGER MATERIALS SUITE 55

tensor of a deformed cell is related to stress. This analysis allowed and AutoQSAR [DDS+ 16] from the Schrödinger suite. Depending
prediction of elongation at yield for high density polyethylene on the type of materials, benchmark data can be obtained using
polymer. Figure 4 shows obtained calculated yield of 10% vs. different codes available in the Schrödinger suite:
experimental value within 9-18% range [BAS+ 20].
• small molecules and finite systems - Jaguar
The scipy.optimize package is used for a least-squares
• periodic systems - Quantum ESPRESSO
fit of the bulk energies at different cell volumes (compressed
• larger polymeric and similar systems - Desmond
and expanded) in order to obtain the bulk modulus and equation
of state (EOS) of a material. In the Schrödinger suite this was Different materials systems require different descriptors for
implemented as a part of an EOS workflow, in which fitting is featurization. For example, for crystalline periodic systems, we
performed on the results obtained from a series of QE calculations have implemented several sets of tailored descriptors. Genera-
performed on the original as well as compressed and expanded tion of these descriptors again uses a mix of open source and
(deformed) cells. An example of deformation applied to a structure Schrödinger proprietary tools. Specifically:
in pymatgen:
• elemental features such as atomic weight, number of
from pymatgen.analysis.elasticity import strain valence electrons in s, p and d-shells, and electronegativity
from pymatgen.core import lattice
from pymatgen.core import structure • structural features such as density, volume per atom, and
packing fraction descriptors implemented in the open
deform = strain.Deformation([ source matminer package [WDF+ 18]
[1.0, 0.02, 0.02],
• intercalation descriptors such as cation and anion counts,
[0.0, 1.0, 0.0],
[0.0, 0.0, 1.0]]) crystal packing fraction, and average neighbor ionicity
[SYC+ 17] implemented in the Schrödinger suite
latt = lattice.Lattice([ • three-dimensional smooth overlap of atomic positions
[3.84, 0.00, 0.00],
[1.92, 3.326, 0.00], (SOAP) descriptors implemented in the open source
[0.00, -2.22, 3.14], DScribe package [HJM+ 20].
])
We are currently training models that use these descriptors
st = structure.Structure( to predict properties, such as bulk modulus, of a set of Li-
latt, containing battery related compounds [Cha]. Several models will
["Si", "Si"],
[[0, 0, 0], [0.75, 0.5, 0.75]]) be compared, such as kernel regression methods (as implemented
in the open source scikit-learn code [PVG+ 11]) and AutoQSAR.
strained_st = deform.apply_to_structure(st) For isolated small molecules and extended non-periodic sys-
This is also an example of loosely coupled (embarrassingly tems, RDKit can be used to generate a large number of atomic and
parallel) jobs. In particular, calculations of the deformed cells molecular descriptors. A lot of effort has been devoted to ensure
only depend on the bulk calculation and do not depend on each that RDKit can be used on a wide variety of materials that are
other. Thus, all the deformation jobs can be submitted in parallel, supported by the Schrödinger suite. At the time of writing, the 4th
facilitating high-throughput runs. most active contributor to RDKit is Ricardo Rodriguez-Schmidt
Structure refinement from powder diffraction experiment is an- from Schrödinger [RDK].
other example where more complex optimization is used. Powder Recently, active learning (AL) combined with DFT has re-
diffraction is a widely used method in drug discovery to assess ceived much attention to address the challenge of leveraging
purity of the material and discover known or unknown crystal exhaustive libraries in materials informatics [VPB21], [SPA+ 19].
polymorphs [KBD+ 21]. In particular, there is interest in fitting of On our side, we have implemented a workflow that employs active
the experimental powder diffraction intensity peaks to the indexed learning (AL) for intelligent and iterative identification of promis-
peaks (Pawley refinement) [JPS92]. Here we employed the open ing materials candidates within a large dataset. In the framework of
source lmfit package [NSA+ 16] to perform a minimization of AL, the predicted value with associated uncertainty is considered
the multivariable Voigt-like function that represents the entire to decide what materials to be added in each iteration, aiming to
diffraction spectrum. This allows the user to refine (optimize) unit improve the model performance in the next iteration (Figure 5).
cell parameters coming from the indexing data and as the result, Since it could be important to consider multiple properties
goodness of fit (R-factor) between experimental and simulated simultaneously in material discovery, multiple property optimiza-
spectrum is minimized. tion (MPO) has also been implemented as a part of the AL work-
flow [KAG+ 22]. MPO allows scaling and combining multiple
properties into a single score. We employed the AL workflow
Machine learning techniques to determine the top candidates for hole (positively charged
Of late, there is great interest in machine learning assisted mate- carrier) transport layer (HTL) by evaluating 550 molecules in 10
rials discovery. There are several components required to perform iterations using DFT calculations for a dataset of ~9,000 molecules
machine learning assisted materials discovery. In order to train a [AKA+ 22]. Resulting model was validated by randomly picking
model, benchmark data from simulation and/or experimental data a molecule from the dataset, computing properties with DFT and
is required. Besides benchmark data, computation of the relevant comparing those to the predicted values. According to the semi-
descriptors is required (see below). Finally, a model based on classical Marcus equation [Mar93], high rates of hole transfer are
benchmark data and descriptors is generated that allows prediction inversely proportional to hole reorganization energies. Thus, MPO
of properties for novel materials. There are several techniques to scores were computed based on minimizing hole reorganization
generate the model, such as linear or non-linear fitting to neural energy and targeting oxidation potential to an appropriate level to
networks. Tools include the open source DeepChem [REW+ 19] ensure a low energy barrier for hole injection from the anode
56 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 4: Left: The uniaxial stress/strain curve of a polymer calculated using Desmond through the stress strain workflow. The dark grey band
indicates an inflection that marks the yield point. Right: Constant strain simulation with convex analysis indicates elongation at yield. The red
curve shows simulated stress versus strain. The blue curve shows convex analysis.

Fig. 5: Active learning workflow for the design and discovery of novel optoelectronics molecules.

into the emissive layer. In this workflow, we used RDKit to of similar items (similar molecules). In this case, benchmark data
compute descriptors for the chemical structures. These descriptors is only needed for few representatives of each cluster. We are
generated on the initial subset of structures are given as vectors to currently working on applying this approach to train models for
an algorithm based on Random Forest Regressor as implemented predicting physical properties of soft materials (polymers).
in scikit-learn. Bayesian optimization is employed to tune the
hyperparameters of the model. In each iteration, a trained model Conclusions
is applied for making predictions on the remaining materials in We present several examples of how Schrödinger Materials Suite
the dataset. Figure 6 (A) displays MPO scores for the HTL dataset integrates open source software packages. There is a wide range
estimated by AL as a function of hole reorganization energies that of applications in materials science that can benefit from already
are separately calculated for all the materials. This figure indicates existing open source code. Where possible, we report issues to
that there are many materials in the dataset with desired low hole the package authors and submit improvements and bug fixes in
reorganization energies but are not suitable for HTL due to their the form of the pull requests. We are thankful to all who have
improper oxidation potentials, suggesting that MPO is important contributed to open source libraries, and have made it possible for
to evaluate the optoelectronic performance of the materials. Figure us to develop a platform for accelerating innovation in materials
6 (B) presents MPO scores of the materials used in the training and drug discovery. We will continue contributing to these projects
dataset of AL, demonstrating that the feedback loop in the AL and we hope to further give back to the scientific community by
workflow efficiently guides the data collection as the size of the facilitating research in both academia and industry. We hope that
training set increases. this report will inspire other scientific companies to give back to
To appreciate the computational efficiency of such an ap- the open source community in order to improve the computational
proach, it is worth noting that performing DFT calculations for materials field and make science more reproducible.
all of the 9,000 molecules in the dataset would increase the
computational cost by a factor of 15 versus the AL workflow. It Acknowledgments
seems that AL approach can be useful in the cases where problem The authors acknowledge Bradley Dice and Wenduo Zhou for
space is broad (like chemical space), but there are many clusters their valuable comments during the review of the manuscript.
UTILIZING SCIPY AND OTHER OPEN SOURCE PACKAGES TO PROVIDE A POWERFUL API FOR MATERIALS MANIPULATION IN THE SCHRÖDINGER MATERIALS SUITE 57

Fig. 6: A: MPO score of all materials in the HTL dataset. B: Those used in the training set as a function of the hole reorganization energy (
λh ).

R EFERENCES tal Engineering and Materials, 72, 2016. doi:10.1107/
S2052520616003954.
[ABG+ 21] Mohammad Atif Faiz Afzal, Andrea R. Browning, Alexan- [HF08] Gus L.W. Hart and Rodney W. Forcade. Algo-
der Goldberg, Mathew D. Halls, Jacob L. Gavartin, Tsuguo rithm for generating derivative structures. Physical Re-
Morisato, Thomas F. Hughes, David J. Giesen, and Joseph E. view B - Condensed Matter and Materials Physics, 77,
Goose. High-throughput molecular dynamics simulations and 2008. URL: https://github.com/msg-byu/enumlib/, doi:10.
validation of thermophysical properties of polymers for var- 1103/PhysRevB.77.224115.
ious applications. ACS Applied Polymer Materials, 3, 2021. [HJM+ 20] Lauri Himanen, Marc O.J. Jager, Eiaki V. Morooka, Fil-
doi:10.1021/acsapm.0c00524. ippo Federici Canova, Yashasvi S. Ranawat, David Z. Gao,
[AKA+ 22] Hadi Abroshan, H. Shaun Kwak, Yuling An, Christopher Patrick Rinke, and Adam S. Foster. Dscribe: Library of
Brown, Anand Chandrasekaran, Paul Winget, and Mathew D. descriptors for machine learning in materials science. Com-
Halls. Active learning accelerates design and optimization puter Physics Communications, 247, 2020. URL: https:
of hole-transporting materials for organic electronics. Fron- //singroup.github.io/dscribe/latest/, doi:10.1016/j.cpc.
tiers in Chemistry, 9, 2022. doi:10.3389/fchem.2021. 2019.106949.
800371. [HM24] O Hassel and H Mark. The crystal structure of graphite.
[BAS+ 20] A. R. Browning, M. A. F. Afzal, J. Sanders, A. Goldberg, Physik. Z, 25:317–337, 1924.
A. Chandrasekaran, and H. S. Kwak. Polyolefin molecular [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der
simulation for critical physical characteristics. International Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric
Polyolefins Conference, 2020. Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,
[BDBF] Davide Brunato, Pietro Delugas, Giovanni Borghi, and Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van
Alexandr Fonari. qeschema. URL: https://github.com/QEF/ Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del
qeschema. Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant,
[BHH+ 13] Art D. Bochevarov, Edward Harder, Thomas F. Hughes, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer
Jeremy R. Greenwood, Dale A. Braden, Dean M. Philipp, Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array
David Rinaldo, Mathew D. Halls, Jing Zhang, and Richard A. programming with numpy, 2020. URL: https://numpy.org/,
Friesner. Jaguar: A high-performance quantum chemistry doi:10.1038/s41586-020-2649-2.
software program with strengths in life and materials sci- [HZU+ 20] Sebastiaan P. Huber, Spyros Zoupanos, Martin Uhrin, Leopold
ences. International Journal of Quantum Chemistry, 113, Talirz, Leonid Kahle, Rico Hauselmann, Dominik Gresch,
2013. doi:10.1002/qua.24481. Tiziano Müller, Aliaksandr V. Yakutovich, Casper W. Ander-
[Bic02] Jozef Bicerano. Prediction of polymer properties. cRc Press, sen, Francisco F. Ramirez, Carl S. Adorf, Fernando Gargiulo,
2002. Snehal Kumbhar, Elsa Passaro, Conrad Johnston, Andrius
[Cha] A. Chandrasekaran. Active learning accelerated design of ionic Merkys, Andrea Cepellotti, Nicolas Mounet, Nicola Marzari,
materials. in progress. Boris Kozinsky, and Giovanni Pizzi. Aiida 1.0, a scalable com-
[DDS+ 16] Steven L. Dixon, Jianxin Duan, Ethan Smith, Christopher putational infrastructure for automated reproducible workflows
D. Von Bargen, Woody Sherman, and Matthew P. Repasky. and data provenance. Scientific Data, 7, 2020. URL: https://
Autoqsar: An automated machine learning tool for best- www.aiida.net/, doi:10.1038/s41597-020-00638-4.
practice quantitative structure-activity relationship modeling. [JPS92] J. Jansen, R. Peschar, and H. Schenk. Determination of
Future Medicinal Chemistry, 8, 2016. doi:10.4155/fmc- accurate intensities from powder diffraction data. i. whole-
2016-0093. pattern fitting with a least-squares procedure. Journal
[GAB+ 17] P. Giannozzi, O. Andreussi, T. Brumme, O. Bunau, M. Buon- of Applied Crystallography, 25, 1992. doi:10.1107/
giorno Nardelli, M. Calandra, R. Car, C. Cavazzoni, S0021889891012104.
D. Ceresoli, M. Cococcioni, N. Colonna, I. Carnimeo, A. Dal [KAG+ 22] H. Shaun Kwak, Yuling An, David J. Giesen, Thomas F.
Corso, S. De Gironcoli, P. Delugas, R. A. Distasio, A. Ferretti, Hughes, Christopher T. Brown, Karl Leswing, Hadi Abroshan,
A. Floris, G. Fratesi, G. Fugallo, R. Gebauer, U. Gerstmann, and Mathew D. Halls. Design of organic electronic materials
F. Giustino, T. Gorni, J. Jia, M. Kawamura, H. Y. Ko, with a goal-directed generative model powered by deep neural
A. Kokalj, E. Kücükbenli, M. Lazzeri, M. Marsili, N. Marzari, networks and high-throughput molecular simulations. Fron-
F. Mauri, N. L. Nguyen, H. V. Nguyen, A. Otero-De-La- tiers in Chemistry, 9, 2022. doi:10.3389/fchem.2021.
Roza, L. Paulatto, S. Poncé, D. Rocca, R. Sabatini, B. Santra, 800370.
M. Schlipf, A. P. Seitsonen, A. Smogunov, I. Timrov, T. Thon- [KBD+ 21] James A Kaduk, Simon J L Billinge, Robert E Dinnebier,
hauser, P. Umari, N. Vast, X. Wu, and S. Baroni. Advanced Nathan Henderson, Ian Madsen, Radovan Černý, Matteo
capabilities for materials modelling with quantum espresso. Leoni, Luca Lutterotti, Seema Thakral, and Daniel Chateigner.
Journal of Physics Condensed Matter, 29, 2017. URL: Powder diffraction. Nature Reviews Methods Primers, 1:77,
https://www.quantum-espresso.org/, doi:10.1088/1361- 2021. URL: https://doi.org/10.1038/s43586-021-00074-7,
648X/aa8f79. doi:10.1038/s43586-021-00074-7.
[GBLW16] Colin R. Groom, Ian J. Bruno, Matthew P. Lightfoot, and [LLC22] Schrödinger LLC. Schrödinger release 2022-2: Materials
Suzanna C. Ward. The cambridge structural database. science suite, 2022. URL: https://www.schrodinger.com/
Acta Crystallographica Section B: Structural Science, Crys- platform/materials-science.
58 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[LMB+ 17] Ask Hjorth Larsen, Jens JØrgen Mortensen, Jakob Blomqvist, Ho, Douglas J. Ierardi, Lev Iserovich, Jeffrey S. Kuskin,
Ivano E. Castelli, Rune Christensen, Marcin Dułak, Jesper Richard H. Larson, Timothy Layman, Li Siang Lee, Adam K.
Friis, Michael N. Groves, BjØrk Hammer, Cory Hargus, Lerer, Chester Li, Daniel Killebrew, Kenneth M. Macken-
Eric D. Hermes, Paul C. Jennings, Peter Bjerre Jensen, zie, Shark Yeuk Hai Mok, Mark A. Moraes, Rolf Mueller,
James Kermode, John R. Kitchin, Esben Leonhard Kols- Lawrence J. Nociolo, Jon L. Peticolas, Terry Quan, Daniel
bjerg, Joseph Kubal, Kristen Kaasbjerg, Steen Lysgaard, Ramot, John K. Salmon, Daniele P. Scarpazza, U. Ben Schafer,
Jón Bergmann Maronsson, Tristan Maxson, Thomas Olsen, Naseer Siddique, Christopher W. Snyder, Jochen Spengler,
Lars Pastewka, Andrew Peterson, Carsten Rostgaard, Jakob Ping Tak Peter Tang, Michael Theobald, Horia Toma, Brian
SchiØtz, Ole Schütt, Mikkel Strange, Kristian S. Thygesen, Towles, Benjamin Vitale, Stanley C. Wang, and Cliff Young.
Tejs Vegge, Lasse Vilhelmsen, Michael Walter, Zhenhua Zeng, Anton 2: Raising the bar for performance and programmabil-
and Karsten W. Jacobsen. The atomic simulation envi- ity in a special-purpose molecular dynamics supercomputer.
ronment - a python library for working with atoms, 2017. volume 2015-January, 2014. doi:10.1109/SC.2014.9.
URL: https://wiki.fysik.dtu.dk/ase/, doi:10.1088/1361- [SPA+ 19] Gabriel R. Schleder, Antonio C.M. Padilha, Carlos Mera
648X/aa680e. Acosta, Marcio Costa, and Adalberto Fazzio. From dft to
[LTK+ 22] Greg Landrum, Paolo Tosco, Brian Kelley, Ric, sriniker, machine learning: Recent approaches to materials science -
gedeck, Riccardo Vianello, NadineSchneider, Eisuke a review. JPhys Materials, 2, 2019. doi:10.1088/2515-
Kawashima, Andrew Dalke, Dan N, David Cosgrove, 7639/ab084b.
Gareth Jones, Brian Cole, Matt Swain, Samo Turk, [SYC+ 17] Austin D Sendek, Qian Yang, Ekin D Cubuk, Karel-
AlexanderSavelyev, Alain Vaucher, Maciej Wójcikowski, Alexander N Duerloo, Yi Cui, and Evan J Reed. Holistic
Ichiru Take, Daniel Probst, Kazuya Ujihara, Vincent F. computational structure screening of more than 12000 can-
Scalfani, guillaume godin, Axel Pahl, Francois Berenger, didates for solid lithium-ion conductor materials. Energy and
JLVarjo, strets123, JP, and DoliathGavid. rdkit. 6 2022. URL: Environmental Science, 10:306–320, 2017. doi:10.1039/
https://rdkit.org/, doi:10.5281/ZENODO.6605135. c6ee02697d.
[Mar93] Rudolph A. Marcus. Electron transfer reactions in chemistry. [VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt
theory and experiment. Reviews of Modern Physics, 65, 1993. Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski,
doi:10.1103/RevModPhys.65.599. Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté-
[MAS+ 20] Nobuyuki N. Matsuzawa, Hideyuki Arai, Masaru Sasago, Eiji fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar-
Fujii, Alexander Goldberg, Thomas J. Mustard, H. Shaun rod Millman, Nikolay Mayorov, Andrew R.J. Nelson, Eric
Kwak, David J. Giesen, Fabio Ranalli, and Mathew D. Halls. Jones, Robert Kern, Eric Larson, C. J. Carey, İlhan Polat,
Massive theoretical screen of hole conducting organic mate- Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde,
rials in the heteroacene family by using a cloud-computing Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin-
environment. Journal of Physical Chemistry A, 124, 2020. tero, Charles R. Harris, Anne M. Archibald, Antônio H.
doi:10.1021/acs.jpca.9b10998. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, Aditya Vi-
[MGF+ 19] Thomas Mustard, Jacob Gavartin, Alexandr Fonari, Caroline jaykumar, Alessandro Pietro Bardelli, Alex Rothberg, An-
Krauter, Alexander Goldberg, H Kwak, Tsuguo Morisato, dreas Hilboll, Andreas Kloeckner, Anthony Scopatz, Antony
Sudharsan Pandiyan, and Mathew Halls. Surface reactivity Lee, Ariel Rokem, C. Nathan Woods, Chad Fulton, Charles
and stability of core-shell solid catalysts from ab initio combi- Masson, Christian Haggström, Clark Fitzgerald, David A.
natorial calculations. volume 258, 2019. Nicholson, David R. Hagen, Dmitrii V. Pasechnik, Emanuele
[NSA+ 16] Matthew Newville, Till Stensitzki, Daniel B Allen, Michal Olivetti, Eric Martin, Eric Wieser, Fabrice Silva, Felix Lenders,
Rawlik, Antonino Ingargiola, and Andrew Nelson. Lmfit: Non- Florian Wilhelm, G. Young, Gavin A. Price, Gert Ludwig
linear least-square minimization and curve-fitting for python. Ingold, Gregory E. Allen, Gregory R. Lee, Hervé Audren, Irvin
Astrophysics Source Code Library, page ascl–1606, 2016. Probst, Jörg P. Dietrich, Jacob Silterra, James T. Webber, Janko
URL: https://lmfit.github.io/lmfit-py/. Slavič, Joel Nothman, Johannes Buchner, Johannes Kulick,
[OBJ+ 11] Noel M. O’Boyle, Michael Banck, Craig A. James, Chris Johannes L. Schönberger, José Vinícius de Miranda Cardoso,
Morley, Tim Vandermeersch, and Geoffrey R. Hutchison. Joscha Reimer, Joseph Harrington, Juan Luis Cano Rodríguez,
Open babel: An open chemical toolbox. Journal of Chem- Juan Nunez-Iglesias, Justin Kuczynski, Kevin Tritz, Martin
informatics, 3, 2011. URL: https://openbabel.org/, doi: Thoma, Matthew Newville, Matthias Kümmerer, Maximilian
10.1186/1758-2946-3-33. Bolingbroke, Michael Tartre, Mikhail Pak, Nathaniel J. Smith,
[ORJ+ 13] Shyue Ping Ong, William Davidson Richards, Anubhav Jain, Nikolai Nowaczyk, Nikolay Shebanov, Oleksandr Pavlyk,
Geoffroy Hautier, Michael Kocher, Shreyas Cholia, Dan Per A. Brodtkorb, Perry Lee, Robert T. McGibbon, Roman
Gunter, Vincent L. Chevrier, Kristin A. Persson, and Gerbrand Feldbauer, Sam Lewis, Sam Tygier, Scott Sievert, Sebastiano
Ceder. Python materials genomics (pymatgen): A robust, open- Vigna, Stefan Peterson, Surhud More, Tadeusz Pudlik, Takuya
source python library for materials analysis. Computational Oshima, Thomas J. Pingel, Thomas P. Robitaille, Thomas
Materials Science, 68, 2013. URL: https://pymatgen.org/, Spura, Thouis R. Jones, Tim Cera, Tim Leslie, Tiziano Zito,
doi:10.1016/j.commatsci.2012.10.028. Tom Krauss, Utkarsh Upadhyay, Yaroslav O. Halchenko, and
[PKD18] Paul N. Patrone, Anthony J. Kearsley, and Andrew M. Di- Yoshiki Vázquez-Baeza. Scipy 1.0: fundamental algorithms
enstfrey. The role of data analysis in uncertainty quantifica- for scientific computing in python. Nature Methods, 17, 2020.
tion: Case studies for materials modeling. volume 0, 2018. doi:10.1038/s41592-019-0686-2.
doi:10.2514/6.2018-0927. [VPB21] Rama Vasudevan, Ghanshyam Pilania, and Prasanna V. Bal-
[PVG+ 11] Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vin- achandran. Machine learning for materials design and dis-
cent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blon- covery. Journal of Applied Physics, 129, 2021. doi:
del, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake 10.1063/5.0043300.
Vanderplas, Alexandre Passos, David Cournapeau, Matthieu [WDF+ 18] Logan Ward, Alexander Dunn, Alireza Faghaninia, Nils E.R.
Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit- Zimmermann, Saurabh Bajaj, Qi Wang, Joseph Montoya,
learn: Machine learning in python. Journal of Machine Jiming Chen, Kyle Bystrom, Maxwell Dylla, Kyle Chard,
Learning Research, 12, 2011. URL: https://scikit-learn.org/. Mark Asta, Kristin A. Persson, G. Jeffrey Snyder, Ian Foster,
[RDK] Rdkit contributors. URL: https://github.com/rdkit/rdkit/ and Anubhav Jain. Matminer: An open source toolkit for
graphs/contributors. materials data mining. Computational Materials Science,
[REW+ 19] Bharath Ramsundar, Peter Eastman, Patrick Walters, 152, 2018. URL: https://hackingmaterials.lbl.gov/matminer/,
Vijay Pande, Karl Leswing, and Zhenqin Wu. Deep doi:10.1016/j.commatsci.2018.05.018.
Learning for the Life Sciences. O’Reilly Media, 2019. [WF05] John D. Westbrook and Paula M.D. Fitzgerald. The pdb
https://www.amazon.com/Deep-Learning-Life-Sciences- format, mmcif formats, and other data formats, 2005. doi:
Microscopy/dp/1492039837. 10.1002/0471721204.ch8.
[SGB+ 14] David E. Shaw, J. P. Grossman, Joseph A. Bank, Brannon Bat- [WKB+ 22] Paul Winget, H. Shaun Kwak, Christopher T. Brown, Alexandr
son, J. Adam Butts, Jack C. Chao, Martin M. Deneroff, Ron O. Fonari, Kevin Tran, Alexander Goldberg, Andrea R. Brown-
Dror, Amos Even, Christopher H. Fenton, Anthony Forte, ing, and Mathew D. Halls. Organic thin films for oled appli-
Joseph Gagliardo, Gennette Gill, Brian Greskamp, C. Richard cations: Influence of molecular structure, deposition method,
UTILIZING SCIPY AND OTHER OPEN SOURCE PACKAGES TO PROVIDE A POWERFUL API FOR MATERIALS MANIPULATION IN THE SCHRÖDINGER MATERIALS SUITE 59

and deposition conditions. International Conference on the
Science and Technology of Synthetic Metals, 2022.
60 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

A Novel Pipeline for Cell Instance Segmentation,
Tracking and Motility Classification of Toxoplasma
Gondii in 3D Space
Seyed Alireza Vaezi‡∗ , Gianni Orlando‡ , Mojtaba Fazli§ , Gary Ward¶ , Silvia Moreno‡ , Shannon Quinn‡

Abstract—Toxoplasma gondii is the parasitic protozoan that causes dissem- individuals, the infection has fatal implications in fetuses and
inated toxoplasmosis, a disease that is estimated to infect around one-third immunocompromised individuals [SG12] . T. gondii’s virulence
of the world’s population. While the disease is commonly asymptomatic, the is directly linked to its lytic cycle which is comprised of invasion,
success of the parasite is in large part due to its ability to easily spread through replication, egress, and motility. Studying the motility of T. gondii
nucleated cells. The virulence of T. gondii is predicated on the parasite’s motility.
is crucial in understanding its lytic cycle in order to develop
Thus the inspection of motility patterns during its lytic cycle has become a topic
of keen interest. Current cell tracking projects usually focus on cell images
potential treatments.
captured in 2D which are not a true representation of the actual motion of a For this reason, we present a novel pipeline to detect, segment,
cell. Current 3D tracking projects lack a comprehensive pipeline covering all track, and classify the motility pattern of T. gondii in 3D space.
phases of preprocessing, cell detection, cell instance segmentation, tracking, One of the main goals is to make our pipeline intuitively easy
and motion classification, and merely implement a subset of the phases. More- to use so that the users who are not experienced in the fields of
over, current 3D segmentation and tracking pipelines are not targeted for users machine learning (ML), deep learning (DL), or computer vision
with less experience in deep learning packages. Our pipeline, TSeg, on the (CV) can still benefit from it. The other objective is to equip it with
other hand, is developed for segmenting, tracking, and classifying the motility
the most robust and accurate set of segmentation and detection
phenotypes of T. gondii in 3D microscopic images. Although TSeg is built initially
tools so that the end product has a broad generalization, allowing
focusing on T. gondii, it provides generic functions to allow users with similar
but distinct applications to use it off-the-shelf. Interacting with all of TSeg’s
it to perform well and accurately for various cell types right off
modules is possible through our Napari plugin which is developed mainly off the the shelf.
familiar SciPy scientific stack. Additionally, our plugin is designed with a user- PlantSeg uses a variant of 3D U-Net, called Residual 3D U-
friendly GUI in Napari which adds several benefits to each step of the pipeline Net, for preprocessing and segmentation of multiple cell types
such as visualization and representation in 3D. TSeg proves to fulfill a better [WCV+ 20]. PlantSeg performs best among Deep Learning algo-
generalization, making it capable of delivering accurate results with images of rithms for 3D Instance Segmentation and is very robust against
other cell types. image noise [KPR+ 21]. The segmentation module also includes
the optional use of CellPose [SWMP21]. CellPose is a generalized
Introduction segmentation algorithm trained on a wide range of cell types
Quantitative cell research often requires the measurement of and is the first step toward increased optionality in TSeg. The
different cell properties including size, shape, and motility. This Cell Tracking module consolidates the cell particles across the z-
step is facilitated using segmentation of imaged cells. With flu- axis to materialize cells in 3D space and estimates centroids for
orescent markers, computational tools can be used to complete each cell. The tracking module is also responsible for extracting
segmentation and identify cell features and positions over time. the trajectories of cells based on the movements of centroids
2D measurements of cells can be useful, but the more difficult task throughout consecutive video frames, which is eventually the input
of deriving 3D information from cell images is vital for metrics of the motion classifier module.
such as motility and volumetric qualities. Most of the state-of-the-art pipelines are restricted to 2D space
Toxoplasmosis is an infection caused by the intracellular which is not a true representative of the actual motion of the
parasite Toxoplasma gondii. T. gondii is one of the most suc- organism. Many of them require knowledge and expertise in pro-
cessful parasites, infecting at least one-third of the world’s pop- gramming, or in machine learning and deep learning models and
ulation. Although Toxoplasmosis is generally benign in healthy frameworks, thus limiting the demographic of users that can use
them. All of them solely include a subset of the aforementioned
* Corresponding author: sv22900@uga.edu modules (i.e. detection, segmentation, tracking, and classification)
‡ University of Georgia
§ harvard University [SWMP21]. Many pipelines rely on the user to train their own
¶ University of Vermont model, hand-tailored for their specific application. This demands
high levels of experience and skill in ML/DL and consequently
Copyright © 2022 Seyed Alireza Vaezi et al. This is an open-access article undermines the possibility and feasibility of quickly utilizing an
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, off-the-shelf pipeline and still getting good results.
provided the original author and source are credited. To address these we present TSeg. It segments T. gondii cells
A NOVEL PIPELINE FOR CELL INSTANCE SEGMENTATION, TRACKING AND MOTILITY CLASSIFICATION OF TOXOPLASMA GONDII IN 3D SPACE 61

As an example, Fazli et al. [FVMQ18] identified three distinct
motility types for T. gondii with two-dimensional data, however,
they also acknowledge and state that based established heuristics
from previous works there are more than three motility phenotypes
for T. gondii. The focus on 2D research is understandable due to
several factors. 3D data is difficult to capture as tools for capturing
3D slices and the computational requirements for analyzing this
data are not available in most research labs. Most segmentation
tools are unable to track objects in 3D space as the assignment of
related centroids is more difficult. The additional noise from cap-
ture and focus increases the probability of incorrect assignment.
3D data also has issues with overlapping features and increased
computation required per frame of time.
Fazli et al. [FVMQ18] studies the motility patterns of T. gondii
and provides a computational pipeline for identifying motility
phenotypes of T. gondii in an unsupervised, data-driven way. In
that work Ca2+ is added to T. gondii cells inside a Fetal Bovine
Serum. T. gondii cells react to Ca2+ and become motile and
fluorescent. The images of motile T. gondii cells were captured
using an LSM 710 confocal microscope. They use Python 3 and
associated scientific computing libraries (NumPy, SciPy, scikit-
learn, matplotlib) in their pipeline to track and cluster the trajecto-
ries of T. gondii. Based on this work Fazli et al. [FVM+ 18] work
on another pipeline consisting of preprocessing, sparsification, cell
detection, and cell tracking modules to track T. gondii in 3D
video microscopy where each frame of the video consists of image
slices taken 1 micro-meters of focal depth apart along the z-axis
Fig. 1: The overview of TSeg’s architecture. direction. In their latest work Fazli et al. [FSA+ 19] developed a
lightweight and scalable pipeline using task distribution and paral-
lelism. Their pipeline consists of multiple modules: reprocessing,
in 3D microscopic images, tracks their trajectories, and classifies sparsification, cell detection, cell tracking, trajectories extraction,
the motion patterns observed throughout the 3D frames. TSeg is parametrization of the trajectories, and clustering. They could
comprised of four modules: pre-processing, segmentation, track- classify three distinct motion patterns in T. gondii using the same
ing, and classification. We developed TSeg as a plugin for Napari data from their previous work.
[SLE+ 22] - an open-source fast and interactive image viewer for While combining open source tools is not a novel architecture,
Python designed for browsing, annotating, and analyzing large little has been done to integrate 3D cell tracking tools. Fazeli et
multi-dimensional images. Having TSeg implemented as a part of al. [FRF+ 20] motivated by the same interest in providing better
Napari not only provides a user-friendly design but also gives more tools to non-software professionals created a 2D cell tracking
advanced users the possibility to attach and execute their custom pipeline. This pipeline combines Stardist [WSH+ 20] and Track-
code and even interact with the steps of the pipeline if needed. Mate [TPS+ 17] for automated cell tracking. This pipeline begins
The preprocessing module is equipped with basic and extra filters with the user loading cell images and centroid approximations to
and functionalities to aid in the preparation of the input data. the ZeroCostDL4Mic [vCLJ+ 21] platform. ZeroCostDL4Mic is
TSeg gives its users the advantage of utilizing the functionalities a deep learning training tool for those with no coding expertise.
that PlantSeg and CellPose provide. These functionalities can be Once the platform is trained and masks for the training set are
chosen in the pre-processing, detection, and segmentation steps. made for hand-drawn annotations, the training set can be input
This brings forth a huge variety of algorithms and pre-built models to Stardist. Stardist performs automated object detection using
to select from, making TSeg not only a great fit for T. gindii, but Euclidean distance to probabilistically determine cell pixels versus
also a variety of different cell types. background pixels. Lastly, Trackmate uses segmentation images to
The rest of this paper is structured as follows: After briefly re- track labels between timeframes and display analytics.
viewing the literature in Related Work, we move on to thoroughly This Stardist pipeline is similar in concept to TSeg. Both
describe the details of our work in the Method section. Following create an automated segmentation and tracking pipeline but TSeg
that, the Results section depicts the results of comprehensive tests is oriented to 3D data. Cells move in 3-dimensional space that
of our plugin on T. gondii cells. is not represented in a flat plane. TSeg also does not require
the manual training necessary for the other pipeline. Individuals
Related Work with low technical expertise should not be expected to create
The recent solutions in generalized and automated segmentation masks for training or even understand the training of deep neural
tools are focused on 2D cell images. Segmentation of cellular networks. Lastly, this pipeline does not account for imperfect
structures in 2D is important but not representative of realistic datasets without the need for preprocessing. All implemented
environments. Microbiological organisms are free to move on the algorithms in TSeg account for microscopy images with some
z-axis and tracking without taking this factor into account cannot amount of noise.
guarantee a full representation of the actual motility patterns. Wen et al. [WMV+ 21] combines multiple existing new tech-
62 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

nologies including deep learning and presents 3DeeCellTracker. user. The full code of TSeg is available on GitHub under the MIT
3DeeCellTracker segments and tracks cells on 3D time-lapse open source license at https://github.com/salirezav/tseg. TSeg can
images. Using a small subset of their dataset they train the deep be installed through Napari’s plugins menu.
learning architecture 3D U-Net for segmentation. For tracking,
a combination of two strategies was used to increase accuracy: Computational Pipeline
local cell region strategies, and spatial pattern strategy. Kapoor Pre-Processing: Due to the fast imaging speed in data
et al. [KC21] presents VollSeg that uses deep learning methods acquisition, the image slices will inherently have a vignetting
to segment, track, and analyze cells in 3D with irregular shape artifact, meaning that the corners of the images will be slightly
and intensity distribution. It is a Jupyter Notebook-based Python darker than the center of the image. To eliminate this artifact we
package and also has a UI in Napari. For tracking, a custom added adaptive thresholding and logarithmic correction to the pre-
tracking code is developed based on Trackmate. processing module. Furthermore, another prevalent artifact on our
Many segmentation tools require some amount of knowledge dataset images was a Film-Grain noise (AKA salt and pepper
in Machine or Deep Learning concepts. Training the neural noise). To remove or reduce such noise a simple gaussian blur
network in creating masks is a common step for open-source filter and a sharpening filter are included.
segmentation tools. Automating this process makes the pipeline Cell Detection and Segmentation: TSeg’s Detection and
more accessible to microbiology researchers. Segmentation modules are in fact backed by PlantSeg and Cell-
Pose. The Detection Module is built only based on PlantSeg’s
Method CNN Detection Module [WCV+ 20] , and for the Segmentation
Module, only one of the three tools can be selected to be executed
Data
as the segmentation tool in the pipeline. Naturally, each of the tools
Our dataset consists of 11 videos of T. gondii cells under a demands specific interface elements different from the others since
microscope, obtained from different experiments with different each accepts different input values and various parameters. TSeg
numbers of cells. The videos are on average around 63 frames in orchestrates this and makes sure the arguments and parameters are
length. Each frame has a stack of 41 image slices of size 500×502 passed to the corresponding selected segmentation tool properly
pixels along the z-axis (z-slices). The z-slices are captured 1µm and the execution will be handled accordingly. The parameters
apart in optical focal length making them 402µm×401µm×40µm include but are not limited to input data location, output directory,
in volume. The slices were recorded in raw format as RGB TIF and desired segmentation algorithm. This allows the end-user
images but are converted to grayscale for our purpose. This data complete control over the process and feedback from each step
is captured using a PlanApo 20x objective (NA = 0:75) on a of the process. The preprocessed images and relevant parameters
preheated Nikon Eclipse TE300 epifluorescence microscope. The are sent to a modular segmentation controller script. As an effort
image stacks were captured using an iXon 885 EMCCD camera to allow future development on TSeg, the segmentation controller
(Andor Technology, Belfast, Ireland) cooled to -70oC and driven script shows how the pipeline integrates two completely different
by NIS Elements software (Nikon Instruments, Melville, NY) as segmentation packages. While both PlantSeg and CellPose use
part of related research by Ward et al. [LRK+ 14]. The camera was conda environments, PlantSeg requires modification of a YAML
set to frame transfer sensor mode, with a vertical pixel shift speed file for initialization while CellPose initializes directly from com-
of 1:0 µs, vertical clock voltage amplitude of +1, readout speed mand line parameters. In order to implement PlantSeg, TSeg gen-
of 35MHz, conversion gain of 3:8×, EM gain setting of 3 and 22 erates a YAML file based on GUI input elements. After parameters
binning, and the z-slices were imaged with an exposure time of are aligned, the conda environment for the chosen segmentation
16ms. algorithm is opened in a subprocess. The $CONDA_PREFIX
environment variable allows the bash command to start conda and
Software context switch to the correct segmentation environment.
Napari Plugin: TSeg is developed as a plugin for Napari - Tracking: Features in each segmented image are found
a fast and interactive multi-dimensional image viewer for python using the scipy label function. In order to reduce any leftover
that allows volumetric viewing of 3D images [SLE+ 22]. Plugins noise, any features under a minimum size are filtered out and
enable developers to customize and extend the functionality of considered leftover noise. After feature extraction, centroids are
Napari. For every module of TSeg, we developed its corresponding calculated using the center of mass function in scipy. The centroid
widget in the GUI, plus a widget for file management. The widgets of the 3D cell can be used as a representation of the entire
have self-explanatory interface elements with tooltips to guide body during tracking. The tracking algorithm goes through each
the inexperienced user to traverse through the pipeline with ease. captured time instance and connects centroids to the likely next
Layers in Napari are the basic viewable objects that can be shown movement of the cell. Tracking involves a series of measures in or-
in the Napari viewer. Seven different layer types are supported der to avoid incorrect assignments. An incorrect assignment could
in Napari: Image, Labels, Points, Shapes, Surface, Tracks, and lead to inaccurate result sets and unrealistic motility patterns. If the
Vectors, each of which corresponds to a different data type, same number of features in each frame of time could be guaranteed
visualization, and interactivity [SLE+ 22]. After its execution, the from segmentation, minimum distance could assign features rather
viewable output of each widget gets added to the layers. This accurately. Since this is not a guarantee, the Hungarian algorithm
allows the user to evaluate and modify the parameters of the must be used to associate a COST with the assignment of feature
widget to get the best results before continuing to the next widget. tracking. The Hungarian method is a combinatorial optimization
Napari supports bidirectional communication between the viewer algorithm that solves the assignment problem in polynomial time.
and the Python kernel and has a built-in console that allows users COST for the tracking algorithm determines which feature is the
to control all the features of the viewer programmatically. This next iteration of the cell’s tracking through the complete time
adds more flexibility and customizability to TSeg for the advanced series. The combination of distance between centroids for all
A NOVEL PIPELINE FOR CELL INSTANCE SEGMENTATION, TRACKING AND MOTILITY CLASSIFICATION OF TOXOPLASMA GONDII IN 3D SPACE 63

previous points and the distance to the potential new centroid. [LRK+ 14] Jacqueline Leung, Mark Rould, Christoph Konradt, Christopher
If an optimal next centroid can’t be found within an acceptable Hunter, and Gary Ward. Disruption of tgphil1 alters specific
parameters of toxoplasma gondii motility measured in a quanti-
distance of the current point, the tracking for the cell is considered tative, three-dimensional live motility assay. PloS one, 9:e85763,
as complete. Likewise, if a feature is not assigned to a current 01 2014. doi:10.1371/journal.pone.0085763.
centroid, this feature is considered a new object and is tracked as [SG12] Geita Saadatnia and Majid Golkar. A review on human toxoplas-
the algorithm progresses. The complete path for each feature is mosis. Scandinavian journal of infectious diseases, 44(11):805–
814, 2012. doi:10.3109/00365548.2012.693197.
then stored for motility analysis. [SLE+ 22] Nicholas Sofroniew, Talley Lambert, Kira Evans, Juan Nunez-
Motion Classification: To classify the motility pattern of Iglesias, Grzegorz Bokota, Philip Winston, Gonzalo Peña-
T. gondii in 3D space in an unsupervised fashion we implement Castellanos, Kevin Yamauchi, Matthias Bussonnier, Draga Don-
cila Pop, Ahmet Can Solak, Ziyang Liu, Pam Wadhwa, Al-
and use the method that Fazli et. al. introduced [FSA+ 19]. In that ister Burt, Genevieve Buckley, Andrew Sweet, Lukasz Mi-
work, they used an autoregressive model (AR); a linear dynamical gas, Volker Hilsenstein, Lorenzo Gaifas, Jordão Bragantini,
system that encodes a Markov-based transition prediction method. Jaime Rodríguez-Guerra, Hector Muñoz, Jeremy Freeman, Peter
The reason is that although K-means is a favorable clustering Boone, Alan Lowe, Christoph Gohlke, Loic Royer, Andrea
PIERRÉ, Hagai Har-Gil, and Abigail McGovern. napari: a multi-
algorithm, there are a few drawbacks to it and to the conventional dimensional image viewer for Python, May 2022. If you use
methods that draw them impractical. Firstly, K-means assumes Eu- this software, please cite it using these metadata. URL: https:
clidian distance, but AR motion parameters are geodesics that do //doi.org/10.5281/zenodo.6598542, doi:10.5281/zenodo.
6598542.
not reside in a Euclidean space, and secondly, K-means assumes [SWMP21] Carsen Stringer, Tim Wang, Michalis Michaelos, and Marius
isotropic clusters, however, although AR motion parameters may Pachitariu. Cellpose: a generalist algorithm for cellular segmen-
exhibit isotropy in their space, without a proper distance metric, tation. Nature methods, 18(1):100–106, 2021. doi:10.1101/
this issue cannot be clearly examined [FSA+ 19]. 2020.02.02.931238.
[TPS+ 17] Jean-Yves Tinevez, Nick Perry, Johannes Schindelin,
Genevieve M. Hoopes, Gregory D. Reynolds, Emmanuel
Laplantine, Sebastian Y. Bednarek, Spencer L. Shorte, and
Conclusion and Discussion Kevin W. Eliceiri. Trackmate: An open and extensible platform
TSeg is an easy to use pipeline designed to study the motility for single-particle tracking. Methods, 115:80–90, 2017. Image
Processing for Biologists. URL: https://www.sciencedirect.
patterns of T. gondii in 3D space. It is developed as a plugin com/science/article/pii/S1046202316303346, doi:https:
for Napari and is equipped with a variety of deep learning based //doi.org/10.1016/j.ymeth.2016.09.016.
segmentation tools borrowed from PlantSeg and CellPose, making [vCLJ+ 21] Lucas von Chamier, Romain F Laine, Johanna Jukkala,
Christoph Spahn, Daniel Krentzel, Elias Nehme, Martina
it a suitable off-the-shelf tool for applications incorporating im- Lerche, Sara Hernández-Pérez, Pieta K Mattila, Eleni Kari-
ages of cell types not limited to T. gondii. Future work on TSeg nou, et al. Democratising deep learning for microscopy with
includes the expantion of implemented algorithms and tools in its zerocostdl4mic. Nature communications, 12(1):1–18, 2021.
preprocessing, segmentation, tracking, and clustering modules. doi:10.1038/s41467-021-22518-0.
[WCV 20] Adrian Wolny, Lorenzo Cerrone, Athul Vijayan, Rachele To-
+

fanelli, Amaya Vilches Barro, Marion Louveaux, Christian
Wenzl, Sören Strauss, David Wilson-Sánchez, Rena Lymbouri-
R EFERENCES dou, Susanne S Steigleder, Constantin Pape, Alberto Bailoni,
Salva Duran-Nebreda, George W Bassel, Jan U Lohmann, Mil-
[FRF+ 20] Elnaz Fazeli, Nathan H Roy, Gautier Follain, Romain F Laine, tos Tsiantis, Fred A Hamprecht, Kay Schneitz, Alexis Maizel,
Lucas von Chamier, Pekka E Hänninen, John E Eriksson, Jean- and Anna Kreshuk. Accurate and versatile 3d segmenta-
Yves Tinevez, and Guillaume Jacquemet. Automated cell track- tion of plant tissues at cellular resolution. eLife, 9:e57613,
ing using stardist and trackmate. F1000Research, 9, 2020. jul 2020. URL: https://doi.org/10.7554/eLife.57613, doi:10.
doi:10.12688/f1000research.27019.1. 7554/eLife.57613.
[FSA+ 19] Mojtaba Sedigh Fazli, Rachel V Stadler, BahaaEddin Alaila, [WMV+ 21] Chentao Wen, Takuya Miura, Venkatakaushik Voleti, Kazushi
Stephen A Vella, Silvia NJ Moreno, Gary E Ward, and Shannon Yamaguchi, Motosuke Tsutsumi, Kei Yamamoto, Kohei Otomo,
Quinn. Lightweight and scalable particle tracking and motion Yukako Fujie, Takayuki Teramoto, Takeshi Ishihara, Kazuhiro
clustering of 3d cell trajectories. In 2019 IEEE International Aoki, Tomomi Nemoto, Elizabeth Mc Hillman, and Koutarou D
Conference on Data Science and Advanced Analytics (DSAA), Kimura. 3DeeCellTracker, a deep learning-based pipeline for
pages 412–421. IEEE, 2019. doi:10.1109/dsaa.2019. segmenting and tracking cells in 3D time lapse images. Elife, 10,
00056. March 2021. URL: https://doi.org/10.7554/eLife.59187, doi:
[FVM 18] Mojtaba S Fazli, Stephen A Vella, Silvia NJ Moreno, Gary E
+
10.7554/eLife.59187.
Ward, and Shannon P Quinn. Toward simple & scalable 3d cell [WSH+ 20] Martin Weigert, Uwe Schmidt, Robert Haase, Ko Sugawara,
tracking. In 2018 IEEE International Conference on Big Data and Gene Myers. Star-convex polyhedra for 3d object detec-
(Big Data), pages 3217–3225. IEEE, 2018. doi:10.1109/ tion and segmentation in microscopy. In 2020 IEEE Winter
BigData.2018.8622403. Conference on Applications of Computer Vision (WACV). IEEE,
[FVMQ18] Mojtaba S Fazli, Stephen A Velia, Silvia NJ Moreno, and mar 2020. URL: https://doi.org/10.1109%2Fwacv45572.2020.
Shannon Quinn. Unsupervised discovery of toxoplasma gondii 9093435, doi:10.1109/wacv45572.2020.9093435.
motility phenotypes. In 2018 IEEE 15th International Sympo-
sium on Biomedical Imaging (ISBI 2018), pages 981–984. IEEE,
2018. doi:10.1109/isbi.2018.8363735.
[KC21] Varun Kapoor and Claudia Carabaña. Cell tracking in 3d
using deep learning segmentations. In Python in Science Con-
ference, pages 154–161, 2021. doi:10.25080/majora-
1b6fd038-014.
[KPR+ 21] Anuradha Kar, Manuel Petit, Yassin Refahi, Guillaume
Cerutti, Christophe Godin, and Jan Traas. Assessment
of deep learning algorithms for 3d instance segmentation
of confocal image datasets. bioRxiv, 2021. URL: https:
//www.biorxiv.org/content/early/2021/06/10/2021.06.09.447748,
arXiv:https://www.biorxiv.org/content/
early/2021/06/10/2021.06.09.447748.full.
pdf, doi:10.1101/2021.06.09.447748.
64 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

The myth of the normal curve and what to do about it
Allan Campopiano∗

Index Terms—Python, R, robust statistics, bootstrapping, trimmed mean, data
science, hypothesis testing

Reliance on the normal curve as a tool for measurement is
almost a given. It shapes our grading systems, our measures of
intelligence, and importantly, it forms the mathematical backbone
of many of our inferential statistical tests and algorithms. Some
even call it “God’s curve” for its supposed presence in nature
[Mic89].
Scientific fields that deal in explanatory and predictive statis-
tics make particular use of the normal curve, often using it to
conveniently define thresholds beyond which a result is considered
statistically significant (e.g., t-test, F-test). Even familiar machine
learning models have, buried in their guts, an assumption of the
normal curve (e.g., LDA, gaussian naive Bayes, logistic & linear
regression).
The normal curve has had a grip on us for some time; the Fig. 1: Standard normal (orange) and contaminated normal (blue).
The variance of the contaminated curve is more than 10 times that
aphorism by Cramer [Cra46] still rings true for many today:
of the standard normal curve. This can cause serious issues with
“Everyone believes in the [normal] law of errors, the statistical power when using traditional hypothesis testing methods.
experimenters because they think it is a mathematical
theorem, the mathematicians because they think it is an
experimental fact.” new Python library for robust hypothesis testing will be introduced
Many students of statistics learn that N=40 is enough to ignore along with an interactive tool for robust statistics education.
the violation of the assumption of normality. This belief stems
from early research showing that the sampling distribution of the The contaminated normal
mean quickly approaches normal, even when drawing from non-
normal distributions—as long as samples are sufficiently large. It One of the most striking counterexamples of “N=40 is enough”
is common to demonstrate this result by sampling from uniform is shown when sampling from the so-called contaminated normal
and exponential distributions. Since these look nothing like the [Tuk60][Tan82]. This distribution is also bell shaped and sym-
normal curve, it was assumed that N=40 must be enough to avoid metrical but it has slightly heavier tails when compared to the
practical issues when sampling from other types of non-normal standard normal curve. That is, it contains outliers and is difficult
distributions [Wil13]. (Others reached similar conclusions with to distinguish from a normal distribution with the naked eye.
different methodology [Gle93].) Consider the distributions in Figure 1. The variance of the normal
Two practical issues have since been identified based on this distribution is 1 but the variance of the contaminated normal is
early research: (1) The distributions under study were light tailed 10.9!
(they did not produce outliers), and (2) statistics other than the The consequence of this inflated variance is apparent when
sample mean were not tested and may behave differently. In examining statistical power. To demonstrate, Figure 2 shows two
the half century following these early findings, many important pairs of distributions: On the left, there are two normal distribu-
discoveries have been made—calling into question the usefulness tions (variance 1) and on the right there are two contaminated
of the normal curve [Wil13]. distributions (variance 10.9). Both pairs of distributions have a
The following sections uncover various pitfalls one might mean difference of 0.8. Wilcox [Wil13] showed that by taking
encounter when assuming normality—especially as they relate to random samples of N=40 from each normal curve, and comparing
hypothesis testing. To help researchers overcome these problems, a them with Student’s t-test, statistical power was approximately
0.94. However, when following this same procedure for the
* Corresponding author: allan@deepnote.com contaminated groups, statistical power was only 0.25.
The point here is that even small apparent departures from
Copyright © 2022 Allan Campopiano. This is an open-access article dis- normality, especially in the tails, can have a large impact on
tributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, pro- commonly used statistics. The problems continue to get worse
vided the original author and source are credited. when examining effect sizes but these findings are not discussed
THE MYTH OF THE NORMAL CURVE AND WHAT TO DO ABOUT IT 65

Fig. 2: Two normal curves (left) and two contaminated normal curves
(right). Despite the obvious effect sizes (∆ = 0.8 for both pairs) as
well as the visual similarities of the distributions, power is only ~0.25
under contamination; however, power is ~0.94 under normality (using
Student’s t-test).

in this article. Interested readers should see Wilcox’s 1992 paper Fig. 3: Actual t-distribution (orange) and assumed t-distribution
(blue). When simulating a t-distribution based on a lognormal curve,
[Wil92]. T does not follow the assumed shape. This can cause poor probability
Perhaps one could argue that the contaminated normal dis- coverage and increased Type I Error when using traditional hypothe-
tribution actually represents an extreme departure from normal- sis testing approaches.
ity and therefore should not be taken seriously; however, dis-
tributions that generate outliers are likely common in practice
[HD82][Mic89][Wil09]. A reasonable goal would then be to Modern robust methods
choose methods that perform well under such situations and
When it comes to hypothesis testing, one intuitive way of dealing
continue to perform well under normality. In addition, serious
with the issues described above would be to (1) replace the
issues still exist even when examining light-tailed and skewed
sample mean (and standard deviation) with a robust alternative
distributions (e.g., lognormal), and statistics other than the sample
and (2) use a non-parametric resampling technique to estimate the
mean (e.g., T). These findings will be discussed in the following
sampling distribution (rather than assuming a theoretical shape)1 .
section.
Two such candidates are the 20% trimmed mean and the percentile
bootstrap test, both of which have been shown to have practical
value when dealing with issues of outliers and non-normality
Student’s t-distribution
[CvNS18][Wil13].
Another common statistic is the T value obtained from Student’s
t-test. As will be demonstrated, T is more sensitive to violations of The trimmed mean
normality than the sample mean (which has already been shown
to not be robust). This is despite the fact that the t-distribution is The trimmed mean is nothing more than sorting values, removing
also bell shaped, light tailed, and symmetrical—a close relative of a proportion from each tail, and computing the mean on the
the normal curve. remaining values. Formally,
The assumption is that T follows a t-distribution (and with • Let X1 ...Xn be a random sample and X(1) ≤ X(2) ... ≤ X(n)
large samples it approaches normality). We can test this assump- be the observations in ascending order
tion by generating random samples from a lognormal distribution. • The proportion to trim is γ(0 ≤ γ ≤ .5)
Specifically, 5000 datasets of sample size 20 were randomly drawn • Let g = bγnc. That is, the proportion to trim multiplied by
from a lognormal distribution using SciPy’s lognorm.rvs n, rounded down to the nearest integer
function. For each dataset, T was calculated and the resulting t-
distribution was plotted. Figure 3 shows that the assumption that Then, in symbols, the trimmed mean can be expressed as
T follows a t-distribution does not hold. follows:
With N=20, the assumption is that with a probability of 0.95, X(g+1) + ... + X(n−g)
T will be between -2.09 and 2.09. However, when sampling from X̄t =
n − 2g
a lognormal distribution in the manner just described, there is
actually a 0.95 probability that T will be between approximately If the proportion to trim is 0.2, more than twenty percent of
-4.2 and 1.4 (i.e., the middle 95% of the actual t-distribution is the values would have to be altered to make the trimmed mean
much wider than the assumed t-distribution). Based on this result arbitrarily large or small. The sample mean, on the other hand,
we can conclude that sampling from skewed distributions (e.g., can be made to go to ±∞ (arbitrarily large or small) by changing
lognormal) leads to increased Type I Error when using Student’s a single value. The trimmed mean is more robust than the sample
t-test [Wil98]. mean in all measures of robustness that have been studied [Wil13].
In particular the 20% trimmed mean has been shown to have
“Surely the hallowed bell-shaped curve has cracked practical value as it avoids issues associated with the median (not
from top to bottom. Perhaps, like the Liberty Bell, it discussed here) and still protects against outliers.
should be enshrined somewhere as a memorial to more
heroic days — Earnest Ernest, Philadelphia Inquirer. 10 1. Another option is to use a parametric test that assumes a different
November 1974. [FG81]” underlying model.
66 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

The percentile bootstrap test best experienced in tandem with Wilcox’s book “Introduction to
In most traditional parametric tests, there is an assumption that Robust Estimation and Hypothesis Testing”.
the sampling distribution has a particular shape (normal, f- Hypothesize brings many of these functions into the open-
distribution, t-distribution, etc). We can use these distributions source Python library ecosystem with the goal of lowering the
to test the null hypothesis; however, as discussed, the theoretical barrier to modern robust methods—even for those who have
distributions are not always approximated well when violations of not had extensive training in statistics or coding. With modern
assumptions occur. Non-parametric resampling techniques such browser-based notebook environments (e.g., Deepnote), learning
as bootstrapping and permutation tests build empirical sampling to use Hypothesize can be relatively straightforward. In fact, every
distributions, and from these, one can robustly derive p-values and statistical test listed in the docs is associated with a hosted note-
CIs. One example is the percentile bootstrap test [Efr92][TE93]. book, pre-filled with sample data and code. But certainly, simply
The percentile bootstrap test can be thought of as an al- pip install Hypothesize to use Hypothesize in any en-
gorithm that uses the data at hand to estimate the underlying vironment that supports Python. See van Noordt and Willoughby
sampling distribution of a statistic (pulling yourself up by your [vNW21] and van Noordt et al. [vNDTE22] for examples of
own bootstraps, as the saying goes). This approach is in contrast Hypothesize being used in applied research.
to traditional methods that assume the sampling distribution takes The API for Hypothesize is organized by single- and two-
a particular shape). The percentile boostrap test works well with factor tests, as well as measures of association. Input data for
small sample sizes, under normality, under non-normality, and it the groups, conditions, and measures are given in the form of a
easily extends to multi-group tests (ANOVA) and measures of Pandas DataFrame [pdt20][WM10]. By way of example, one can
association (correlation, regression). For a two-sample case, the compare two independent groups (e.g., placebo versus treatment)
steps to compute the percentile bootstrap test can be described as using the 20% trimmed mean and the percentile bootstrap test, as
follows: follows (note that Hypothesize uses the naming conventions found
in WRS):
1) Randomly resample with replacement n values from
group one from hypothesize.utilities import trim_mean
from hypothesize.compare_groups_with_single_factor \
2) Randomly resample with replacement n values from import pb2gen
group two
3) Compute X̄1 − X̄2 based on you new sample (the mean results = pb2gen(df.placebo, df.treatment, trim_mean)
difference)
4) Store the difference & repeat steps 1-3 many times (say, As shown below, the results are returned as a Python dictionary
1000) containing the p-value, confidence intervals, and other important
5) Consider the middle 95% of all differences (the confi- details.
dence interval) {
6) If the confidence interval contains zero, there is no 'ci': [-0.22625614592148624, 0.06961754796950131],
statistical difference, otherwise, you can reject the null 'est_1': 0.43968438076483285,
'est_2': 0.5290985245430996,
hypothesis (there is a statistical difference) 'est_dif': -0.08941414377826673,
'n1': 50,
'n2': 50,
Implementing and teaching modern robust methods 'p_value': 0.27,
'variance': 0.005787027326924963
Despite over a half a century of convincing findings, and thousands
}
of papers, robust statistical methods are still not widely adopted
in applied research [EHM08][Wil98]. This may be due to various For measuring associations, several options exist in Hypothesize.
false beliefs. For example, One example is the Winsorized correlation which is a robust
• Classical methods are robust to violations of assumptions alternative to Pearson’s R. For example,
• Correcting non-normal distributions by transforming the from hypothesize.measuring_associations import wincor
data will solve all issues
• Traditional non-parametric tests are suitable replacements results = wincor(df.height, df.weight, tr=.2)
for parametric tests that violate assumptions
returns the Winsorized correlation coefficient and other relevant
Perhaps the most obvious reason for the lack of adoption of statistics:
modern methods is a lack of easy-to-use software and training re- {
sources. In the following sections, two resources will be presented: 'cor': 0.08515087411576182,
one for implementing robust methods and one for teaching them. 'nval': 50,
'sig': 0.558539575073185,
'wcov': 0.004207827245660796
Robust statistics for Python }
Hypothesize is a robust null hypothesis significance testing
(NHST) library for Python [CW20]. It is based on Wilcox’s WRS
package for R which contains hundreds of functions for computing A case study using real-world data
robust measures of central tendency and hypothesis testing. At It is helpful to demonstrate that robust methods in Hypothesize
the time of this writing, the WRS library in R contains many (and in other libraries) can make a practical difference when
more functions than Hypothesize and its value to researchers dealing with real-world data. In a study by Miller on sexual
who use inferential statistics cannot be understated. WRS is attitudes, 1327 men and 2282 women were asked how many sexual
THE MYTH OF THE NORMAL CURVE AND WHAT TO DO ABOUT IT 67

partners they desired over the next 30 years (the data are available
from Rand R. Wilcox’s site). When comparing these groups using
Student’s t-test, we get the following results:
{
'ci': [-1491.09, 4823.24],
't_value': 1.035308,
'p_value': 0.300727
}

That is, we fail to reject the null hypothesis at the α = 0.05 level
using Student’s test for independent groups. However, if we switch
to a robust analogue of the t-test, one that utilizes bootstrapping
and trimmed means, we can indeed reject the null hypothesis.
Here are the corresponding results from Hypothesize’s yuenbt
test (based on [Yue74]):
from hypothesize.compare_groups_with_single_factor \
import yuenbt
Fig. 4: An example of the robust stats simulator in Deepnote’s hosted
notebook environment. A minimalist UI can lower the barrier-to-entry
results = yuenbt(df.males, df.females, to robust statistics concepts.
tr=.2, alpha=.05)

{ The robust statistics simulator allows users to interact with the
'ci': [1.41, 2.11], following parameters:
'test_stat': 9.85,
'p_value': 0.0 • Distribution shape
}
• Level of contamination
The point here is that robust statistics can make a practi- • Sample size
cal difference with real-world data (even when N is consid- • Skew and heaviness of tails
ered large). Many other examples of robust statistics making a Each of these characteristics can be adjusted independently in
practical difference with real-world data have been documented order to compare classic approaches to their robust alternatives.
[HD82][Wil09][Wil01]. The two measures that are used to evaluate the performance of
It is important to note that robust methods may also fail to classic and robust methods are the standard error and Type I Error.
reject when a traditional test rejects (remember that traditional Standard error is a measure of how much an estimator varies
tests can suffer from increased Type I Error). It is also possible across random samples from our population. We want to choose
that both approaches yield the same or similar conclusions. The estimators that have a low standard error. Type I Error is also
exact pattern of results depends largely on the characteristics of the known as False Positive Rate. We want to choose methods that
underlying population distribution. To be able to reason about how keep Type I Error close to the nominal rate (usually 0.05). The
robust statistics behave when compared to traditional methods the robust statistics simulator can guide these decisions by providing
robust statistics simulator has been created and is described in the empirical evidence as to why particular estimators and statistical
next section. tests have been chosen.
Robust statistics simulator
Conclusion
Having a library of robust statistical functions is not enough to
make modern methods commonplace in applied research. Ed- This paper gives an overview of the issues associated with the
ucators and practitioners still need intuitive training tools that normal curve. The concern with traditional methods, in terms of
demonstrate the core issues surrounding classical methods and robustness to violations of normality, have been known for over
how robust analogues compare. a half century and modern alternatives have been recommended;
As mentioned, computational notebooks that run in the cloud however, for various reasons that have been discussed, modern
offer a unique solution to learning beyond that of static textbooks robust methods have not yet become commonplace in applied
and documentation. Learning can be interactive and exploratory research settings.
since narration, visualization, widgets (e.g., buttons, slider bars), One reason is the lack of easy-to-use software and teaching
and code can all be experienced in a ready-to-go compute envi- resources for robust statistics. To help fill this gap, Hypothesize, a
ronment—with no overhead related to local environment setup. peer-reviewed and open-source Python library was developed. In
As a compendium to Hypothesize, and a resource for un- addition, to help clearly demonstrate and visualize the advantages
derstanding and teaching robust statistics in general, the robust of robust methods, the robust statistics simulator was created.
statistics simulator repository has been developed. It is a notebook- Using these tools, practitioners can begin to integrate robust
based collection of interactive demonstrations aimed at clearly and statistical methods into their inferential testing repertoire.
visually explaining the conditions under which classic methods
fail relative to robust methods. A hosted notebook with the Acknowledgements
rendered visualizations of the simulations can be accessed here. The author would like to thank Karlynn Chan and Rand R. Wilcox
and seen in Figure 4. Since the simulations run in the browser and as well as Elizabeth Dlha and the entire Deepnote team for their
require very little understanding of code, students and teachers can support of this project. In addition, the author would like to thank
easily onboard to the study of robust statistics. Kelvin Lee for his insightful review of this manuscript.
68 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

R EFERENCES [WM10] Wes McKinney. Data Structures for Statistical Computing in
Python. In Stéfan van der Walt and Jarrod Millman, editors,
Proceedings of the 9th Python in Science Conference, pages 56 –
[Cra46] Harold Cramer. Mathematical methods of statistics, princeton
61, 2010. doi:10.25080/Majora-92bf1922-00a.
univ. Press, Princeton, NJ, 1946. URL: https://books.google.ca/
[Yue74] Karen K Yuen. The two-sample trimmed t for unequal population
books?id=CRTKKaJO0DYC.
variances. Biometrika, 61(1):165–170, 1974. doi:10.2307/
[CvNS18] Allan Campopiano, Stefon JR van Noordt, and Sidney J Sega-
2334299.
lowitz. Statslab: An open-source eeg toolbox for comput-
ing single-subject effects using robust statistics. Behavioural
Brain Research, 347:425–435, 2018. doi:10.1016/j.bbr.
2018.03.025.
[CW20] Allan Campopiano and Rand R. Wilcox. Hypothesize: Ro-
bust statistics for python. Journal of Open Source Software,
5(50):2241, 2020. doi:10.21105/joss.02241.
[Efr92] Bradley Efron. Bootstrap methods: another look at the jackknife.
In Breakthroughs in statistics, pages 569–593. Springer, 1992.
doi:10.1007/978-1-4612-4380-9_41.
[EHM08] David M Erceg-Hurn and Vikki M Mirosevich. Modern robust
statistical methods: an easy way to maximize the accuracy and
power of your research. American Psychologist, 63(7):591, 2008.
doi:10.1037/0003-066X.63.7.591.
[FG81] Joseph Fashing and Ted Goertzel. The myth of the normal curve
a theoretical critique and examination of its role in teaching and
research. Humanity & Society, 5(1):14–31, 1981. doi:10.
1177/016059768100500103.
[Gle93] John R Gleason. Understanding elongation: The scale contami-
nated normal family. Journal of the American Statistical Asso-
ciation, 88(421):327–337, 1993. doi:10.1080/01621459.
1993.10594325.
[HD82] MaryAnn Hill and WJ Dixon. Robustness in real life: A study
of clinical laboratory data. Biometrics, pages 377–396, 1982.
doi:10.2307/2530452.
[Mic89] Theodore Micceri. The unicorn, the normal curve, and other
improbable creatures. Psychological bulletin, 105(1):156, 1989.
doi:10.1037/0033-2909.105.1.156.
[pdt20] The pandas development team. pandas-dev/pandas: Pandas,
February 2020. URL: https://doi.org/10.5281/zenodo.3509134,
doi:10.5281/zenodo.3509134.
[Tan82] WY Tan. Sampling distributions and robustness of t, f and
variance-ratio in two samples and anova models with respect to
departure from normality. Comm. Statist.-Theor. Meth., 11:2485–
2511, 1982. URL: https://pascal-francis.inist.fr/vibad/index.php?
action=getRecordDetail&idt=PASCAL83X0380619.
[TE93] Robert J Tibshirani and Bradley Efron. An introduction to
the bootstrap. Monographs on statistics and applied probabil-
ity, 57:1–436, 1993. URL: https://books.google.ca/books?id=
gLlpIUxRntoC.
[Tuk60] J. W. Tukey. A survey of sampling from contaminated distribu-
tions. Contributions to Probability and Statistics, pages 448–485,
1960. URL: https://ci.nii.ac.jp/naid/20000755025/en/.
[vNDTE22] Stefon van Noordt, James A Desjardins, BASIS Team, and
Mayada Elsabbagh. Inter-trial theta phase consistency during
face processing in infants is associated with later emerging
autism. Autism Research, 15(5):834–846, 2022. doi:10.
1002/aur.2701.
[vNW21] Stefon van Noordt and Teena Willoughby. Cortical matura-
tion from childhood to adolescence is reflected in resting state
eeg signal complexity. Developmental cognitive neuroscience,
48:100945, 2021. doi:10.1016/j.dcn.2021.100945.
[Wil92] Rand R Wilcox. Why can methods for comparing means have
relatively low power, and what can you do to correct the prob-
lem? Current Directions in Psychological Science, 1(3):101–105,
1992. doi:10.1111/1467-8721.ep10768801.
[Wil98] Rand R Wilcox. How many discoveries have been lost by
ignoring modern statistical methods? American Psychologist,
53(3):300, 1998. doi:10.1037/0003-066X.53.3.300.
[Wil01] Rand R Wilcox. Fundamentals of modern statistical meth-
ods: Substantially improving power and accuracy, volume 249.
Springer, 2001. URL: https://link.springer.com/book/10.1007/
978-1-4757-3522-2.
[Wil09] Rand R Wilcox. Robust ancova using a smoother with boot-
strap bagging. British Journal of Mathematical and Sta-
tistical Psychology, 62(2):427–437, 2009. doi:10.1348/
000711008X325300.
[Wil13] Rand R Wilcox. Introduction to robust estimation and hypothesis
testing. Academic press, 2013. doi:10.1016/c2010-0-
67044-1.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 69

Python for Global Applications: teaching scientific
Python in context to law and diplomacy students
Anna Haensch‡§∗ , Karin Knudson‡§

Abstract—For students across domains and disciplines, the message has been the students and faculty at the Fletcher School are eager to
communicated loud and clear: data skills are an essential qualification for today’s seize upon our current data moment to expand their quantitative
job market. This includes not only the traditional introductory stats coursework offerings. With this in mind, The Fletcher School reached out to
but also machine learning, artificial intelligence, and programming in Python or the co-authors to develop a course in data science, situated in the
R. Consequently, there has been significant student-initiated demand for data
context of international diplomacy.
analytic and computational skills sometimes with very clear objectives in mind,
and other times guided by a vague sense of “the work I want to do will require
In response, we developed the (Python-based) course, Data
this.” Now we have options. If we train students using “black box” algorithms Science for Global Applications, which had its inaugural offering
without attending to the technical choices involved, then we run the risk of in the Spring semester of 2022. The course had 30 enrolled
unleashing practitioners who might do more harm than good. On the other hand, Fletcher School students, primarily from the MALD program.
courses that completely unpack the “black box” can be so steeped in theory that When the course was announced we had a flood of interest from
the barrier to entry becomes too high for students from social science and policy Fletcher students who were extremely interested in broadening
backgrounds, thereby excluding critical voices. In sum, both of these options their studies with this course. With a goal of keeping a close
lead to a pitfall that has gained significant media attention over recent years: the
interactive atmosphere we capped enrollment at 30. To inform the
harms caused by algorithms that are implemented without sufficient attention to
human context. In this paper, we - two mathematicians turned data scientists
direction of our course, we surveyed students on their background
- present a framework for teaching introductory data science skills in a highly in programming (see Fig. 1) and on their motivations for learning
contextualized and domain flexible environment. We will present example course data science (see Fig 2). Students reported only very limited
outlines at the semester, weekly, and daily level, and share materials that we experience with programming - if any at all - with that experience
think hold promise. primarily in Excel and Tableau. Student motivations varied, but
the goal to get a job where they were able to make a meaningful
Index Terms—computational social science, public policy, data science, teach- social impact was the primary motivation.
ing with Python

Introduction
As data science continues to gain prominence in the public eye,
and as we become more aware of the many facets of our lives
that intersect with data-driven technologies and policies every day,
universities are broadening their academic offerings to keep up
with what students and their future employers demand. Not only
are students hoping to obtain more hard skills in data science
(e.g. Python programming experience), but they are interested
in applying tools of data science across domains that haven’t Fig. 1: The majority of the 30 students enrolled in the course had little
historically been part of the quantitative curriculum. The Master to no programming experience, and none reported having "a lot" of
of Arts in Law and Diplomacy (MALD) is the flagship program of experience. Those who did have some experience were most likely to
the Fletcher School of Law and International Diplomacy at Tufts have worked in Excel or Tableau.
University. Historically, the program has contained core elements
of quantitative reasoning with a focus on business, finance, and The MALD program, which is interdisciplinary by design, pro-
international development, as is typical in graduate programs in vides ample footholds for domain specific data science. Keeping
international relations. Like academic institutions more broadly, this in mind, as a throughline for the course, each student worked
to develop their own quantitative policy project. Coursework and
* Corresponding author: anna.haensch@tufts.edu discussions were designed to move this project forward from
‡ Tufts University
§ Data Intensive Studies Center initial policy question, to data sourcing and visualizing, and
eventually to modeling and analysis.
Copyright © 2022 Anna Haensch et al. This is an open-access article dis- In what follows we will describe how we structured our
tributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, pro- course with the goal of empowering beginner programmers to use
vided the original author and source are credited. Python for data science in the context of international relations
70 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

might understand in the abstract that the way the handling of
missing data can substantially affect the outcome of an analysis,
but will likely have a stronger understanding if they have had to
consider how to deal with missing data in their own project.
We used several course structures to support connecting data
science and Python "skills" with their context. Students had
readings and journaling assignments throughout the semester on
topics that connected data science with society. In their journal
responses, students were asked to connect the ideas in the reading
to their other academic/professional interests, or ideas from other
classes with the following prompt:
Your reflection should be a 250-300 word narrative.
Be sure to tie the reading back into your own studies,
experiences, and areas of interest. For each reading,
Fig. 2: The 30 enrolled students were asked to indicate which were come up with 1-2 discussion questions based on the con-
relevant motivations for taking the course. Curiosity and a desire to cepts discussed in the readings. This can be a curiosity
make a meaningful social impact were among the top motivations our question, where you’re interested in finding out more,
students expressed. a critical question, where you challenge the author’s
assumptions or decisions, or an application question,
where you think about how concepts from the reading
and diplomacy. We will also share details about course content
would apply to a particular context you are interested in
and structure, methods of assessment, and Python programming
exploring.1
resources that we deployed through Google Colab. All of the
materials described here can be found on the public course page These readings (highlighted in gray in Fig 3), assignments, and
https://karink520.github.io/data-science-for-global-applications/. the related in-class discussions were interleaved among Python
exercises meant to give students practice with skills including
Course Philosophy and Goals
manipulating DataFrames in pandas [The22], [Mck10], plotting in
Matplotlib [Hun07] and seaborn [Was21], mapping with GeoPan-
Our high level goals for the course were i) to empower students das [Jor21], and modeling with scikit-learn [Ped11]. Student
with the skills to gain insight from data using Python and ii) to projects included a thorough data audit component requiring
deepen students’ understanding of how the use of data science students to explore data sources and their human context in detail.
affects society. As we sought to achieve these high level goals Precise details and language around the data audit can be found
within the limited time scope of a single semester, the following on the course website.
core principles were essential in shaping our course design. Below,
we briefly describe each of these principles and share some Managing Fears & Concerns Through Supported Programming
examples of how they were reflected in the course structure. In a
We surmised that students who are new to programming and
subsequent section we will more precisely describe the content of
possibly intimidated by learning the unfamiliar skill would do
the course, whereupon we will further elaborate on these principles
well in an environment that included plenty of what we call
and share instructional materials. But first, our core principles:
supported programming - that is, practicing programming in class
Connecting the Technical and Social
with immediate access to instructor and peer support.
In the pre-course survey we created, many students identified
To understand the impact of data science on the world (and the
concerns about their quantitative preparation, whether they would
potential policy implications of such impact), it helps to have
be able to keep up with the course, and how hard programming
hands-on practice with data science. Conversely, to effectively
might be. We sought to acknowledge these concerns head-on,
and ethically practice data science, it is important to understand
assure students of our full confidence in their ability to master
how data science lives in the world. Thus, the "hard" skills of
the material, and provide them with all the resources they needed
coding, wrangling data, visualizing, and modeling are best taught
to succeed.
intertwined with a robust study of ways in which data science is
A key resource to which we thought all students needed
used and misused.
access was instructor attention. In addition to keeping the class
There is an increasing need to educate future policy-makers
size capped at 30 people, with both co-instructors attending all
with knowledge of how data science algorithms can be used
course meetings, we structured class time to maximize the time
and misused. One way to approach meeting this need, especially
students spent actually doing data science in class. We sought
for students within a less technically-focused program, would
to keep demonstrations short, and intersperse them with coding
be to teach students about how algorithms can be used without
exercises so that students could practice with new ideas right
actually teaching them to use algorithms. However, we argue that
away. Our Colab notebooks included in the course materials show
students will gain a deeper understanding of the societal and
one way that we wove student practice time throughout. Drawing
ethical implications of data science if they also have practical
insight from social practice theory of learning (e.g. [Eng01],
data science skills. For example, a student could gain a broad
[Pen16]), we sought to keep in mind how individual practice and
understanding of how biased training data might lead to biased
learning pathways develop in relation to their particular social and
algorithmic predictions, but such understanding is likely to be
deeper and more memorable when a student has actually practiced 1. This journaling prompt was developed by our colleague Desen Ozkan at
training a model using different training data. Similarly, someone Tufts University.
PYTHON FOR GLOBAL APPLICATIONS: TEACHING SCIENTIFIC PYTHON IN CONTEXT TO LAW AND DIPLOMACY STUDENTS 71

institutional context. Crucially, we devoted a great deal of in-class and preparing data for exploratory data analysis, visualizing and
time to students doing data science, and a great deal of energy annotating data, and finally modeling and analyzing data. All
into making this practice time a positive and empowering social of this was done with the goal of answering a policy question
experience. During student practice time, we were circulating developed by the student, allowing the student to flex some
throughout the room, answering student questions and helping domain expertise to supplement the (sometimes overwhelming!)
students to problem solve and debug, and encouraging students programmatic components.
to work together and help each other. A small organizational Our project explicitly required that students find two datasets
change we made in the first weeks of the semester that proved of interest and merge them for the final analysis. This presented
to have outsized impact was moving our office hours to hold them both logistical and technical challenges. As one student pointed
directly after class in an almost-adjacent room, to make it as easy out after finally finding open data: hearing people talk about the
as possible for students to attend office hours. Students were vocal need for open data is one thing, but you really realize what that
in their appreciation of office hours. means when you’ve spent weeks trying to get access to data that
We contend that the value of supported programming time you know exists. Understanding the provenance of the data they
is two-fold. First, it helps beginning programmers learn more were working with helped students assess the biases and limita-
quickly. While learning to code necessarily involves challenges, tions, and also gave students a strong sense of ownership over
students new to a language can sometimes struggle for an un- their final projects. An unplanned consequence of the broad scope
productively long time on things like simple syntax issues. When of the policy project was that we, the instructors, learned nearly
students have help available, they can move forward from minor as much about international diplomacy as the students learned
issues faster and move more efficiently into building a meaningful about programming and data science, a bidirectional exchange of
understanding. Secondly, supported programming time helps stu- knowledge that we surmised to have contributed to student feeling
dents to understand that they are not alone in the challenges they of empowerment and a positive class environment.
are facing in learning to program. They can see other students
learning and facing similar challenges, can have the empowering Course Structure
experience of helping each other out, and when asking for help
can notice that even their instructors sometimes rely on resources We broke the course into three modules, each with focused
like StackOverflow. An unforeseen benefit we believe co-teaching reading/journaling topics, Python exercises, and policy project
had was to give us as instructors the opportunity to consult benchmarks: (i) getting and cleaning data, (ii) visualizing data,
with each other during class time and share different approaches. and (iii) modeling data. In what follows we will describe the key
These instructor interactions modeled for students how even as goals of each module and highlight the readings and exercises that
experienced practitioners of data science, we too were constantly we compiled to work towards these goals.
learning.
Getting and Cleaning Data
Lastly, a small but (we thought) important aspect of our setup
was teaching students to set up a computing environment on Getting, cleaning, and wrangling data typically make up a signif-
their own laptops, with Python, conda [Ana16], and JupyterLab icant proportion of the time involved in a data science project.
[Pro22]. Using the command line and moving from an environ- Therefore, we devoted significant time in our course to learning
ment like Google Colab to one’s own computer can both present these skills, focusing on loading and manipulating data using
significant barriers, but doing so successfully can be an important pandas. Key skills included loading data into a pandas DataFrame,
part of helping students feel like ‘real’ programmers. We devoted working with missing data, and slicing, grouping, and merging
an entire class period to helping students with installation and DataFrames in various ways. After initial exposure and practice
setup on their own computers. with example datasets, students applied their skills to wrangling
We considered it an important measure of success how many the diverse and sometimes messy and large datasets that they found
students told us at the end of the course that the class had helped for their individual projects. Since one requirement of the project
them overcome sometimes longstanding feelings that technical was to integrate more than one dataset, merging was of particular
skills like coding and modeling were not for them. importance.
During this portion of the course, students read and discussed
Leveraging Existing Strengths To Enhance Student Ownership Boyd and Crawford’s Critical Questions for Big Data [Boy12]
Even as beginning programmers, students are capable of creating a which situates big data in the context of knowledge itself and
meaningful policy-related data science project within the semester, raises important questions about access to data and privacy. Ad-
starting from formulating a question and finding relevant datasets. ditional readings included selected chapters from D’Ignazio and
Working on the project throughout the semester (not just at the Klein’s Data Feminism [Dig20] which highlights the importance
end) gave essential context to data science skills as students could of what we choose to count and what it means when data is
translate into what an idea might mean for "their" data. Giving missing.
students wide leeway in their project topic allowed the project to
be a point of connection between new data science skills and their Visualizing Data
existing domain knowledge. Students chose projects within their A fundamental component to communicating findings from data
particular areas of interest or expertise, and a number chose to is well-executed data visualization. We chose to place this module
additionally connect their project for this course to their degree in the middle of the course, since it was important that students
capstone project. have a common language for interpreting and communicating their
Project benchmarks were placed throughout the semester analysis before moving to the more complicated aspects of data
(highlighted in green in Fig 3) allowing students a concrete modeling. In developing this common language, we used Wilke’s
way to develop their new skills in identifying datasets, loading Fundamentals of Data Visualization [Wil19] and Cairo’s How
72 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 3: Course outline for a 13-week semester with two 70 minute instructional blocks each week. Course readings are highlighted in gray and
policy project benchmarks are highlighted in green.

Chart’s Lie [Cai19] as a backbone for this section of the course. using Python. Having the concrete target of how a student wanted
In addition to reading the text materials, students were tasked with their visualization to look seemed to be a motivating starting
finding visualizations “in the wild,” both good and bad. Course point from which to practice coding and debugging. We spent
discussions centered on the found visualizations, with Wilke and several class periods on supported programming time for students
Cairo’s writings as a common foundation. From the readings and to develop their visualizations.
discussions, students became comfortable with the language and Working on building the narratives of their project and devel-
taxonomy around visualizations and began to develop a better ap- oping their own visualizations in the context of the course readings
preciation of what makes a visualization compelling and readable. gave students a heightened sense of attention to detail. During
Students were able to formulate a plan about how they could best one day of class when students shared visualizations and gave
visualize their data. The next task was to translate these plans into feedback to one another, students commented and inquired about
Python. incredibly small details of each others’ presentations, for example,
To help students gain a level of comfort with data visualization how to adjust y-tick alignment on a horizontal bar chart. This sort
in Python, we provided instruction and examples of working of tiny detail is hard to convey in a lecture, but gains outsized
with a variety of charts using Matplotlib and seaborn, as well importance when a student has personally wrestled with it.
as maps and choropleths using GeoPandas, and assigned students
programming assignments that involved writing code to create Modeling Data
a visualization matching one in an image. With that practical In this section we sought to expose students to introductory
grounding, students were ready to visualize their own project data approaches in each of regression, classification, and clustering
PYTHON FOR GLOBAL APPLICATIONS: TEACHING SCIENTIFIC PYTHON IN CONTEXT TO LAW AND DIPLOMACY STUDENTS 73

in Python. Specifically, we practiced using scikit-learn to work And finally, to supplement the technical components of the
with linear regression, logistic regression, decision trees, random course we also had readings with associated journal entries sub-
forests, and gaussian mixture models. Our focus was not on the mitted at a cadence of roughly two per module. Journal prompts
theoretical underpinnings of any particular model, but rather on are described above and available on the course website.
the kinds of problems that regression, classification, or clustering
models respectively, are able to solve, as well as some basic ideas
about model assessment. The uniform and approachable scikit- Conclusion
learn API [Bui13] was crucial in supporting this focus, since it Various listings of key competencies in data science have been
allowed us to focus less on syntax around any one model, and more proposed [NAS18]. For example, [Dev17] suggests the following
on the larger contours of modeling, with all its associated promise pillars for an undergraduate data science curriculum: computa-
and perils. We spent a good deal of time building an understanding tional and statistical thinking, mathematical foundations, model
of train-test splits and their role in model assessment. building and assessment, algorithms and software foundation,
Student projects were required to include a modeling com- data curation, and knowledge transference—communication and
ponent. Just the process of deciding which of regression, clas- responsibility. As we sought to contribute to the training of
sification, or clustering were appropriate for a given dataset and data-science informed practitioners of international relations, we
policy question is highly non-trivial for beginners. The diversity of focused on helping students build an initial competency especially
student projects and datasets meant students had to grapple with in the last four of these.
this decision process in its full complexity. We were delighted by We can point to several key aspects of the course that made
the variety of modeling approaches students used in their projects, it successful. Primary among them was the fact that the majority
as well as by students’ thoughtful discussions of the limitations of of class time was spent in supported programming. This means
their analysis. that students were able to ask their instructors or peers as soon
To accompany this section of the course, students were as- as questions arose. Novice programmers who aren’t part of a
signed readings focusing on some of the societal impacts of data formal computer science program often don’t have immediate
modeling and algorithms more broadly. These readings included access to the resources necessary to get "unstuck." for the novice
a chapter from O’Neil’s Weapons of Math Destruction [One16] as programmer, even learning how to google technical terms can be a
well as Buolamwini and Gebru’s Gender Shades [Buo18]. Both of challenge. This sort of immediate debugging and feedback helped
these readings emphasize the capacity of algorithms to exacerbate students remain confident and optimistic about their projects. This
inequalities and highlight the importance of transparency and was made all the more effective since we were co-teaching the
ethical data practices. These readings resonated especially strongly course and had double the resources to troubleshoot. Co-teaching
with our students, many of whom had recently taken courses in also had the unforeseen benefit of making our classroom a place
cyber policy and ethics in artificial intelligence. where the growth mindset was actively modeled and nurtured:
where one instructor wasn’t able to answer a question, the other
Assessments
instructor often could. Finally, it was precisely the motivation of
Formal assessment was based on four components, already alluded learning data science in context that allowed students to maintain a
to throughout this note. The largest was the ongoing policy sense of ownership over their work and build connections between
project which had benchmarks with rolling due dates throughout their other courses.
the semester. Moreover, time spent practicing coding skills in Learning programming from the ground up is difficult. Stu-
class was often done in service of the project. For example, in dents arrive excited to learn, but also nervous and occasionally
week 4, when students learned to set up their local computing heavy with the baggage they carry from prior experience in
environments, they also had time to practice loading, reading, and quantitative courses. However, with a sufficient supported learning
saving data files associated with their chosen project datasets. This environment it’s possible to impart relevant skills. It was a measure
brought challenges, since often students sitting side-by-side were of the success of the course how many students told us that the
dealing with different operating systems and data formats. But course had helped them overcome negative prior beliefs about
from this challenge emerged many organic conversations about their ability to code. Teaching data science skills in context and
file types and the importance of naming conventions. The rubric with relevant projects that leverage students’ existing expertise and
for the final project is shown in Fig 4. outside reading situates the new knowledge in a place that feels
The policy project culminated with in-class “micro presenta- familiar and accessible to students. This contextualization allows
tions” and a policy paper. We dedicated two days of class in week students to gain some mastery while simultaneously playing to
13 for in-class presentations, for which each student presented their strengths and interests.
one slide consisting of a descriptive title, one visualization, and
several “key takeaways” from the project. This extremely restric-
tive format helped students to think critically about the narrative R EFERENCES
information conveyed in a visualization, and was designed to
create time for robust conversation around each presentation. [Ana16] Anaconda Software Distribution. Computer software. Vers. 2-2.4.0.
In addition to the policy project, each of the three course Anaconda, Nov. 2016. Web. https://anaconda.com.
[Boy12] Boyd, Danah, and Kate Crawford. Critical questions for big data:
modules also had an associated set of Python exercises (available Provocations for a cultural, technological, and scholarly phe-
on the course website). Students were given ample time both in nomenon. Information, communication & society 15.5 (2012):662-
and out of class to ask questions about the exercises. Overall, these 679. https://doi.org/10.1080/1369118X.2012.678878
exercises proved to be the most technically challenging component [Bui13] Buitinck, Lars, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa,
Andreas Mueller, Olivier Grisel, Vlad Niculae et al. API design for
of the course, but we invited students to resubmit after an initial machine learning software: experiences from the scikit-learn project.
round of grading. arXiv preprint arXiv:1309.0238 (2013).
74 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 4: Rubric for the policy project that formed a core component of the formal assessment of students throughout the course.

[Buo18] Buolamwini, Joy, and Timnit Gebru. Gender shades: Intersectional [Ped11] Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent
accuracy disparities in commercial gender classification. Conference Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel et al.
on fairness, accountability and transparency. PMLR, 2018. http:// Scikit-learn: Machine learning in Python. the Journal of machine
proceedings.mlr.press/v81/buolamwini18a.html Learning research 12 (2011): 2825-2830. https://dl.acm.org/doi/10.
[Cai19] Cairo, Alberto. How charts lie: Getting smarter about visual infor- 5555/1953048.2078195
mation. WW Norton & Company, 2019. [Pen16] Penuel, William R., Daniela K. DiGiacomo, Katie Van Horne, and
[Dev17] De Veaux, Richard D., Mahesh Agarwal, Maia Averett, Benjamin Ben Kirshner. A Social Practice Theory of Learning and Becoming
S. Baumer, Andrew Bray, Thomas C. Bressoud, Lance Bryant et al. across Contexts and Time. Frontline Learning Research 4, no. 4
Curriculum guidelines for undergraduate programs in data science. (2016): 30-38. http://dx.doi.org/10.14786/flr.v4i4.205
Annual Review of Statistics and Its Application 4 (2017): 15-30. [Pro22] Project Jupyter, 2022. jupyterlab/jupyterlab: JupyterLab 3.4.3 https:
https://doi.org/10.1146/annurev-statistics-060116-053930 //github.com/jupyterlab/jupyterlab
[Dig20] D’Ignazio, Catherine, and Lauren F. Klein. Data Feminism. MIT [The22] The Pandas Development Team, 2022. pandas-dev/pandas: Pandas
press, 2020. 1.4.2. Zenodo. https://doi.org/10.5281/zenodo.6408044
[Eng01] Engeström, Yrjö. Expansive learning at work: Toward an activity [Was21] Waskom, Michael L. Seaborn: statistical data visualization. Journal
theoretical reconceptualization. Journal of education and work 14, of Open Source Software 6, no. 60 (2021): 3021. https://doi.org/10.
no. 1 (2001): 133-156. https://doi.org/10.1080/13639080020028747 21105/joss.03021
[Hun07] Hunter, J.D., Matplotlib: A 2D Graphics Environment. Computing in [Wil19] Wilke, Claus O. Fundamentals of data visualization: a primer on
Science & Engineering, vol. 9, no. 3 (2007): 90-95. https://doi.org/ making informative and compelling figures. O’Reilly Media, 2019.
10.1109/MCSE.2007.55
[Jor21] Jordahl, Kelsey et al. 2021. Geopandas/geopandas: V0.10.2. Zenodo.
https://doi.org/10.5281/zenodo.5573592.
[Mck10] McKinney, Wes. Data structures for statistical computing in python.
In Proceedings of the 9th Python in Science Conference, vol. 445, no.
1, pp. 51-56. 2010. https://doi.org/10.25080/Majora-92bf1922-00a
[NAS18] National Academies of Sciences, Engineering, and Medicine. Data
science for undergraduates: Opportunities and options. National
Academies Press, 2018.
[One16] O’Neil, Cathy. Weapons of math destruction: How big data increases
inequality and threatens democracy. Broadway Books, 2016.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 75

Papyri: better documentation for the scientific
ecosystem in Jupyter
Matthias Bussonnier‡§∗ , Camille Carvalho¶k

Abstract—We present here the idea behind Papyri, a framework we are devel- documentation is often displayed as raw source where no naviga-
oping to provide a better documentation experience for the scientific ecosystem. tion is possible. On the maintainers’ side, the final documentation
In particular, we wish to provide a documentation browser (from within Jupyter rendering is less of a priority. Rather, maintainers should aim at
or other IDEs and Python editors) that gives a unified experience, cross library making users gain from improvement in the rendering without
navigation search and indexing. By decoupling documentation generation from
having to rebuild all the docs.
rendering we hope this can help address some of the documentation accessi-
bility concerns, and allow customisation based on users’ preferences.
Conda-Forge [CFRG] has shown that concerted efforts can
give a much better experience to end-users, and in today’s world
Index Terms—Documentation, Jupyter, ecosystem, accessibility where it is ubiquitous to share libraries source on code platforms,
perform continuous integration and many other tools, we believe
a better documentation framework for many of the libraries of the
Introduction scientific Python should be available.
Over the past decades, the Python ecosystem has grown rapidly, Thus, against all advice we received and based on our own
and one of the last bastion where some of the proprietary competi- experience, we have decided to rebuild an opinionated documen-
tion tools shine is integrated documentation. Indeed, open-source tation framework, from scratch, and with minimal dependencies:
libraries are usually developed in distributed settings that can make Papyri. Papyri focuses on building an intermediate documentation
it hard to develop coherent and integrated systems. representation format, that lets us decouple building, and rendering
While a number of tools and documentations exists (and the docs. This highly simplifies many operations and gives us
improvements are made everyday), most efforts attempt to build access to many desired features that were not available up to now.
documentation in an isolated way, inherently creating a heteroge- In what follows, we provide the framework in which Papyri
neous framework. The consequences are twofolds: (i) it becomes has been created and present its objectives (context and goals),
difficult for newcomers to grasp the tools properly, (ii) there is a we describe the Papyri features (format, installation, and usage),
lack of cohesion and of unified framework due to library authors then present its current implementation. We end this paper with
making their proper choices as well as having to maintain build comments on current challenges and future work.
scripts or services.
Many users, colleagues, and members of the community have Context and objectives
been frustrated with the documentation experience in the Python Through out the paper, we will draw several comparisons between
ecosystem. Given a library, who hasn’t struggled to find the documentation building and compiled languages. Also, we will
"official" website for the documentation ? Often, users stumble borrow and adapt commonly used terminology. In particular, sim-
across an old documentation version that is better ranked in their ilarities with "ahead-of-time" (AOT) [AOT], "just-in-time"" (JIT)
favorite search engine, and this impacts significantly the learning [JIT], intermediate representation (IR) [IR], link-time optimization
process of less experienced users. (LTO) [LTO], static vs dynamic linking will be highlighted. This
On users’ local machine, this process is affected by lim- allows us to clarify the presentation of the underlying architecture.
ited documentation rendering. Indeed, while in many Integrated However, there is no requirement to be familiar with the above
Development Environments (IDEs) the inspector provides some to understand the concepts underneath Papyri. In that context, we
documentation, users do not get access to the narrative, or the full wish to discuss documentation building as a process from a source-
documentation gallery. For Command Line Interface (CLI) users, code meant for a machine to a final output targeting the flesh and
blood machine between the keyboard and the chair.
* Corresponding author: bussonniermatthias@gmail.com
‡ QuanSight, Inc
§ Digital Ours Lab, SARL. Current tools and limitations
¶ University of California Merced, Merced, CA, USA
|| Univ Lyon, INSA Lyon, UJM, UCBL, ECL, CNRS UMR 5208, ICJ, F-69621, In the scientific Python ecosystem, it is well known that Docutils
France [docutils] and Sphinx [sphinx] are major cornerstones for pub-
lishing HTML documentation for Python. In fact, they are used
Copyright © 2022 Matthias Bussonnier et al. This is an open-access article by all the libraries in this ecosystem. While a few alternatives
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, exist, most tools and services have some internal knowledge of
provided the original author and source are credited. Sphinx. For instance, Read the Docs [RTD] provides a specific
76 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Sphinx theme [RTD-theme] users can opt-in to, Jupyter-book
[JPYBOOK] is built on top of Sphinx, and MyST parser [MYST]
(which is made to allow markdown in documentation) targets
Sphinx as a backend, to name a few. All of the above provide an
"ahead-of-time" documentation compilation and rendering, which
is slow and computationally intensive. When a project needs its
specific plugins, extensions and configurations to properly build
(which is almost always the case), it is relatively difficult to
build documentation for a single object (like a single function,
module or class). This makes AOT tools difficult to use for
interactive exploration. One can then consider a JIT approach,
as done for Docrepr [DOCREPR] (integrated both in Jupyter and
Spyder [Spyder]). However in that case, interactive documentation
lacks inline plots, crosslinks, indexing, search and many custom Fig. 1: The following screenshot shows the help for
directives. scipy.signal.dpss, as currently accessible (left), as shown by
Papyri for Jupyterlab extension (right). An extended version of the
Some of the above limitations are inherent to the design right pannel is displayed in Figure 4.
of documentation build tools that were intended for a separate
documentation construction. While Sphinx does provide features
like intersphinx, link resolutions are done at the documentation raw docstrings (see for example the SymPy discussion2 on how
building phase. Thus, this is inherently unidirectional, and can equations should be displayed in docstrings, and left panel of
break easily. To illustrate this, we consider NumPy [NP] and SciPy Figure 1). In terms of format, markdown is appealing, however
[SP], two extremely close libraries. In order to obtain proper cross- inconsistencies in the rendering will be created between libraries.
linked documentation, one is required to perform at least five steps: Finally, some libraries can dynamically modify their docstring at
• build NumPy documentation runtime. While this sometime avoids using directives, it ends up
• publish NumPy object.inv file. being more expensive (runtime costs, complex maintenance, and
• (re)build SciPy documentation using NumPy obj.inv contribution costs).
file.
Objectives of the project
• publish SciPy object.inv file
• (re)build NumPy docs to make use of SciPy’s obj.inv We now layout the objectives of the Papyri documentation frame-
work. Let us emphasize that the project is in no way intended to
Only then can both SciPy’s and NumPy’s documentation refer replace or cover many features included in well-established docu-
to each other. As one can expect, cross links break every time mentation tools such as Sphinx or Jupyter-book. Those projects are
a new version of a library is published1 . Pre-produced HTML extremely flexible and meet the needs of their users for publishing
in IDEs and other tools are then prone to error and difficult to a standalone documentation website of PDFs. The Papyri project
maintain. This also raises security issues: some institutions be- addresses specific documentation challenges (mentioned above),
come reluctant to use tools like Docrepr or viewing pre-produced we present below what is (and what is not) the scope of work.
HTML. Goal (a): design a non-generic (non fully customisable)
website builder. When authors want or need complete control
Docstrings format of the output and wide personalisation options, or branding, then
The Numpydoc format is ubiquitous among the scientific ecosys- Papyri is not likely the project to look at. That is to say single-
tem [NPDOC]. It is loosely based on reStructuredText (RST) project websites where appearance, layout, domain need to be
syntax, and despite supporting full RST syntax, docstrings rarely controlled by the author is not part of the objectives.
contain full-featured directive. Maintainers are confronted to the Goal (b): create a uniform documentation structure and
following dilemma: syntax. The Papyri project prescribes stricter requirements in
• keep the docstrings simple. This means mostly text-based terms of format, structure, and syntax compared to other tools
docstrings with few directive for efficient readability. The such as Docutils and Sphinx. When possible, the documentation
end-user may be exposed to raw docstring, there is no on- follows the Diátaxis Framework [DT]. This provides a uniform
the-fly directive interpretation. This is the case for tools documentation setup and syntax, simplifying contributions to the
such as IPython and Jupyter. project and easing error catching at compile time. Such strict envi-
• write an extensive docstring. This includes references, and ronment is qualitatively supported by a number of documentation
directive that potentially creates graphics, tables and more, fixes done upstream during the development stage of the project3 .
allowing an enriched end-user experience. However this Since Papyri is not fully customisable, users who are already using
may be computationally intensive, and executing code to documentation tools such as Sphinx, mkdocs [mkdocs] and others
view docs could be a security risk. should expect their project to require minor modifications to work
with Papyri.
Other factors impact this choice: (i) users, (ii) format, (iii) Goal (c): provide accessibility and user proficiency. Ac-
runtime. IDE users or non-Terminal users motivate to push for cessibility is a top priority of the project. To that aim, items
extensive docstrings. Tools like Docrepr can mitigate this problem are associated to semantic meaning as much as possible, and
by allowing partial rendering. However, users are often exposed to
2. sympy/sympy#14963
1. ipython/ipython#12210, numpy/numpy#21016, & #29073 3. Tests have been performed on NumPy, SciPy.
PAPYRI: BETTER DOCUMENTATION FOR THE SCIENTIFIC ECOSYSTEM IN JUPYTER 77

documentation rendering is separated from documentation build- Intermediate Representation for Documentation (IRD)
ing phase. That way, accessibility features such as high contract
IRD format: Papyri relies on standard interchangeable
themes (for better text-to-speech (TTS) raw data), early example
"Intermediate Representation for Documentation" (IRD) format.
highlights (for newcomers) and type annotation (for advanced
This allows to reduce operation complexity of the documentation
users) can be quickly available. With the uniform documentation
build. For example, given M documentation producers and N
structure, this provides a coherent experience where users become
renderers, a full documentation build would be O(MN) (each
more comfortable finding information in a single location (see
renderer needs to understand each producer). If each producer only
Figure 1).
cares about producing IRD, and if each renderer only consumes it,
Goal (d): make documentation building simple, fast, and then one can reduce to O(M+N). Additionally, one can take IRD
independent. One objective of the project is to make documenta- from multiple producers at once, and render them all to a single
tion installation and rendering relatively straightforward and fast. target, breaking the silos between libraries.
To that aim, the project includes relative independence of doc-
At the moment, IRD files are currently separated into four
umentation building across libraries, allowing bidirectional cross
main categories roughly following the Diátaxis framework [DT]
links (i.e. both forward and backward links between pages) to
and some technical needs:
be maintained more easily. In other words, a single library can be
built without the need to access documentation from another. Also, • API files describe the documentation for a single ob-
the project should include straightforward lookup documentation ject, expressed as a JSON object. When possible, the
for an object from the interactive read–eval–print loop (REPL). information is encoded semantically (Objective (c)). Files
Finally, efforts are put to limit the installation speed (to avoid are organized based on the fully-qualified name of the
polynomial growth when installing packages on large distributed Python object they reference, and contain either absolute
systems). reference to another object (library, version and identi-
fier), or delayed references to objects that may exist in
another library. Some extra per-object meta information
The Papyri solution
like file/line number of definitions can be stored as well.
In this section we describe in more detail how Papyri has been • Narrative files are similar to API files, except that they do
implemented to address the objectives mentioned above. not represent a given object, but possess a previous/next
page. They are organised in an ordered tree related to the
table of content.
Making documentation a multi-step process
• Example files are a non-ordered collection of files.
When using current documentation tools, customisation made by • Assets files are untouched binary resource archive files that
maintainers usually falls into the following two categories: can be referenced by any of the above three ones. They are
the only ones that contain backward references, and no
• simpler input convenience, forward references.
• modification of final rendering.
In addition to the four categories above, metadata about the
This first category often requires arbitrary code execution and current package is stored: this includes library name, current
must import the library currently being built. This is the case version, PyPi name, GitHub repository slug4 , maintainers’ names,
for example for the use of .. code-block:::, or custom logo, issue tracker and others. In particular, metadata allows
:rc: directive. The second one offers a more user friendly en- us to auto-generate links to issue trackers, and to source files
vironment. For example, sphinx-copybutton [sphinx-copybutton] when rendering. In order to properly resolve some references and
adds a button to easily copy code snippets in a single click, normalize links convention, we also store a mapping from fully
and pydata-sphinx-theme [pydata-sphinx-theme] or sphinx-rtd- qualified names to canonical ones.
dark-mode provide a different appearance. As a consequence,
Let us make some remarks about the current stage of IRD for-
developers must make choices on behalf of their end-users: this
mat. The exact structure of package metadata has not been defined
may concern syntax highlights, type annotations display, light/dark
yet. At the moment it is reduced to the minimum functionality.
theme.
While formats such as codemeta [CODEMETA] could be adopted,
Being able to modify extensions and re-render the documenta- in order to avoid information duplication we rely on metadata
tion without the rebuilding and executing stage is quite appealing. either present in the published packages already or extracted from
Thus, the building phase in Papyri (collecting documentation Github repository sources. Also, IRD files must be standardized
information) is separated from the rendering phase (Objective (c)): in order to achieve a uniform syntax structure (Objective (b)).
at this step, Papyri has no knowledge and no configuration options In this paper, we do not discuss IRD files distribution. Last, the
that permit to modify the appearance of the final documentation. final specification of IRD files is still in progress and regularly
Additionally, the optional rendering process has no knowledge of undergoes major changes (even now). Thus, we invite contributors
the building step, and can be run without accessing the libraries to consult the current state of implementation on the GitHub
involved. repository [Papyri]. Once the IRD format is more stable, this will
This kind of technique is commonly used in the field of be published as a JSON schema, with full specification and more
compilers with the usage of Single Compilation Unit [SCU] and in-depth description.
Intermediate Representation [IR], but to our knowledge, it has not
been implemented for documentation in the Python ecosystem.
4. "slug" is the common term that refers to the various combinations
As mentioned before, this separation is key to achieving many of organization name/user name/repository name, that uniquely identifies a
features proposed in Objectives (c), (d) (see Figure 2). repository on a platform like GitHub.
78 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 2: Sketch representing how to build documentation with Papyri. Step 1: Each project builds an IRD bundle that contains semantic
information about the project documentation. Step 2: the IRD bundles are publihsed online. Step 3: users install IRD bundles locally on their
machine, pages get corsslinked, indexed, etc. Step 4: IDEs render documentation on-the-fly, taking into consideration users’ preferences.

IRD bundles: Once a library has collected IRD repre- package managers or IDEs, one could imagine this process being
sentation for all documentation items (functions, class, narrative automatic, or on demand. This step should be fairly efficient as it
sections, tutorials, examples), Papyri consolidates them into what mostly requires downloading and unpacking IRD files.
we will refer to as IRD bundles. A Bundle gathers all IRD files Finally, IDEs developers want to make sure IRD files can be
and metadata for a single version of a library5 . Bundles are a properly rendered and browsed by their users when requested.
convenient unit to speak about publication, installation, or update This may potentially take into account users’ preferences, and may
of a given library documentation files. provide added values such as indexing, searching, bookmarks and
Unlike package installation, IRD bundles do not have the others, as seen in rustsdocs, devdocs.io.
notion of dependencies. Thus, a fully fledged package manager is
not necessary, and one can simply download corresponding files Current implementation
and unpack them at the installation phase.
We present here some of the technological choices made in the
Additionally, IRD bundles for multiple versions of the same
current Papyri implementation. At the moment, it is only targeting
library (or conflicting libraries) are not inherently problematic as
a subset of projects and users that could make use of IRD files and
they can be shared across multiple environments.
bundles. As a consequence, it is constrained in order to minimize
From a security standpoint, installing IRD bundles does not
the current scope and efforts development. Understanding the
require the execution of arbitrary code. This is a critical element
implementation is not necessary to use Papyri neither as a project
for adoption in deployments. There exists as well an opportunity to
maintainer nor as a user, but it can help understanding some of the
provide localized variants at the IRD installation time (IRD bundle
current limitations.
translations haven’t been explored exhaustively at the moment).
Additionally, nothing prevents alternatives and complementary
implementations with different choices: as long as other imple-
IRD and high level usage
mentations can produce (or consume) IRD bundles, they should
Papyri-based documentation involves three broad categories of be perfectly compatible and work together.
stakeholders (library maintainers, end-users, IDE developers), and The following sections are thus mostly informative to under-
processes. This leads to certain requirements for IRD files and stand the state of the current code base. In particular we restricted
bundles. ourselves to:
On the maintainers’ side, the goal is to ensure that Papyri can
build IRD files, and publish IRD bundles. Creation of IRD files • Producing IRD bundles for the core scientific Python
and bundles is the most computationally intensive step. It may projects (Numpy, SciPy, Matplotlib...)
require complex dependencies, or specific plugins. Thus, this can • Rendering IRD documentation for a single user on their
be a multi-step process, or one can use external tooling (not related local machine.
to Papyri nor using Python) to create them. Visual appearance Finally, some of the technological choices have no other
and rendering of documentation is not taken into account in this justification than the main developer having interests in them, or
process. Overall, building IRD files and bundles takes about the making iterations on IRD format and main code base faster.
same amount of time as running a full Sphinx build. The limiting
factor is often associated to executing library examples and code IRD files generation
snippets. For example, building SciPy & NumPy documentation The current implementation of Papyri only targets some compat-
IRD files on a 2021 Macbook Pro M1 (base model), including ibility with Sphinx (a website and PDF documentation builder),
executing examples in most docstrings and type inferring most reStructuredText (RST) as narrative documentation syntax and
examples (with most variables semantically inferred) can take Numpydoc (both a project and standard for docstring formatting).
several minutes. These are widely used by a majority of the core scientific
End-users are responsible for installing desired IRD bundles. Python ecosystem, and thus having Papyri and IRD bundles
In most cases, it will consist of IRD bundles from already compatible with existing projects is critical. We estimate that
installed libraries. While Papyri is not currently integrated with about 85%-90% of current documentation pages being built with
Sphinx, RST and Numpydoc can be built with Papyri. Future work
5. One could have IRD bundles not attached to a particular library. For
example, this can be done if an author wishes to provide only a set of examples includes extensions to be compatible with MyST (a project to
or tutorials. We will not discuss this case further here. bring markdown syntax to Sphinx), but this is not a priority.
PAPYRI: BETTER DOCUMENTATION FOR THE SCIENTIFIC ECOSYSTEM IN JUPYTER 79

To understand RST Syntax in narrative documentation, RST
documents need to be parsed. To do so, Papyri uses tree-sitter
[TS] and tree-sitter-rst [TSRST] projects, allowing us to extract an
"Abstract Syntax Tree" (AST) from the text files. When using tree-
sitter, AST nodes contain bytes-offsets into the original text buffer.
Then one can easily "unparse" an AST node when necessary. This
is relatively convenient for handling custom directives and edge
cases (for instance, when projects rely on a loose definition of
the RST syntax). Let us provide an example: RST directives are
usually of the form:
.. directive:: arguments

body
Fig. 3: Sketch representing how Papyri stores information in 3
While technically there is no space before the ::, Docutils and different format depending on access patterns: a SQLite database for
Sphinx will not create errors when building the documentation. relationship information, on-disk CBOR files for more compact storate
Due to our choice of a rigid (but unified) structure, we use tree- of IRD, and RAW files (e.g. Images). A GraphStore API abstracts all
sitter that indicates an error node if there is an extra space. This access and takes care of maintinaing consistency.
allows us to check for error nodes, unparse, add heuristics to
restore a proper syntax, then parse again to obtain the new node.
(like a database server) are not necessary available. This provides
Alternatively, a number of directives like warnings, notes
an adapted framework to test Papyri on an end-user machine.
admonitions still contain valid RST. Instead of storing the
With those requirements we decided to use a combination of
directive with the raw text, we parse the full document (potentially
SQLite (an in-process database engine), Concise Binary Object
finding invalid syntax), and unparse to the raw text only if the
Representation (CBOR) and raw storage to better reflect the access
directive requires it.
pattern (see Figure 3).
Serialisation of data structure into IRD files is currently us-
SQLite allows us to easily query for object existence, and
ing a custom serialiser. Future work includes maybe swapping
graph information (relationship between objects) at runtime. It is
to msgspec [msgspec]. The AST objects are completely typed,
optimized for infrequent reading access. Currently many queries
however they contain a number of unions and sequences of unions.
are done at runtime, when rendering documentation. The goal is to
It turns out, many frameworks like pydantic [pydantic] do not
move most of SQLite information resolving step at the installation
support sequences of unions where each item in the union may
time (such as looking for inter-libraries links) once the codebase
be of a different type. To our knowledge, there are just few other
and IRD format have stabilized. SQLite is less strongly typed than
documentation related projects that treat AST as an intermediate
other relational or graph database and needs custom logic, but
object with a stable format that can be manipulated by external
is ubiquitous on all systems and does not need a separate server
tools. In particular, the most popular one is Pandoc [pandoc], a
process, making it an easy choice of database.
project meant to convert from many document types to plenty of
CBOR is a more space efficient alternative to JSON. In par-
other ones.
ticular, keys in IRD are often highly redundant, and can be highly
The current Papyri strategy is to type-infer all code examples
optimized when using CBOR. Storing IRD in CBOR thus reduces
with Jedi [JEDI], and pre-syntax highlight using pygments when
disk usage and can also allow faster deserialization without
possible.
requiring potentially CPU intensive compression/decompression.
IRD File Installation This is a good compromise for potentially low performance users’
Download and installation of IRD files is done concurrently using machines.
httpx [httpx], with Trio [Trio] as an async framework, allowing us Raw storage is used for binary blobs which need to be accessed
to download files concurrently. without further processing. This typically refers to images, and
The current implementation of Papyri targets Python doc- raw storage can be accessed with standard tools like image
umentation and is written in Python. We can then query the viewers.
existing version of Python libraries installed, and infer the ap- Finally, access to all of these resources is provided via an
propriate version of the requested documentation. At the moment, internal GraphStore API which is agnostic of the backend, but
the implementation is set to tentatively guess relevant libraries ensures consistency of operations like adding/removing/replacing
versions when the exact version number is missing from the install documents. Figure 3 summarizes this process.
command. Of course the above choices depend on the context where
For convenience and performance, IRD bundles are being post- documentation is rendered and viewed. For example, an online
processed and stored in a different format. For local rendering, we archive intended to browse documentation for multiple projects
mostly need to perform the following operations: and versions may decide to use an actual graph database for object
relationship, and store other files on a Content Delivery Network
1) Query graph information about cross-links across docu- or blob storage for random access.
ments.
2) Render a single page. Documentation Rendering
3) Access raw data (e.g. images).
The current Papyri implementation includes a certain number
We also assume that IRD files may be infrequently updated, of rendering engines (presented below). Each of them mostly
that disk space is limited, and that installing or running services consists of fetching a single page with its metadata, and walking
80 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

through the IRD AST tree, and rendering each node with users’ Future goals include improving/replacing the JupyterLab’s ques-
preferences. tion mark operator (obj?) and the JupyterLab Inspector (when
possible). A screenshot of the current development version of the
• An ASCII terminal renders using Jinja2 [Jinja2]. This
JupyterLab extension can be seen in Figure 4.
can be useful for piping documentation to other tools
like grep, less, cat. Then one can work in a highly
restricted environment, making sure that reading the docu- Challenges
mentation is coherent. This can serve as a proxy for screen We mentioned above some limitations we encountered (in ren-
reading. dering usage for instance) and what will be done in the future
• A Textual User Interface browser renders using urwid. to address them. We provide below some limitations related to
Navigation within the terminal is possible, one can reflow syntax choices, and broader opportunities that arise from the
long lines on resized windows, and even open image files Papyri project.
in external editors. Nonetheless, several bugs have been
encountered in urwid. The project aims at replacing the Limitations
CLI IPython question mark operator (obj?) interface The decoupling of the building and rendering phases is key in
(which currently only shows raw docstrings) in urwid with Papyri. However, it requires us to come up with a method that
a new one written with Rich/Textual. For this interface, uniquely identifies each object. In particular, this is essential in
having images stored raw on disk is useful as it allows us order to link any object documentation without accessing the IRD
to directly call into a system image viewer to display them. bundles build from all the libraries. To that aim, we use the fully
• A JIT rendering engine uses Jinja2, Quart [quart], Trio. qualified names of an object. Namely, each object is identified
Quart is an async version of flask [flask]. This option by the concatenation of the module in which it is defined, with
contains the most features, and therefore is the main one its local name. Nonetheless, several particular cases need specific
used for development. This environment lets us iterate over treatment.
the rendering engine rapidly. When exploring the User In- • To mirror the Python syntax, is it easy to use . to
terface design and navigation, we found that a list of back concatenate both parts. Unfortunately, that leads to some
references has limited uses. Indeed, it is can be challenging ambiguity when modules re-export functions have the
to judge the relevance of back references, as well as their same name. For example, if one types
relationship to each other. By playing with a network
# module mylib/__init__.py
graph visualisation (see Figure 5)), we can identify clusters
of similar information within back references. Of course, from .mything import mything
this identification has limits especially when pages have a
then mylib.mything is ambiguous both with respect
large number of back references (where the graph becomes
to the mything submodule, and the reexported object.
too busy). This illustrate as well a strength of the Papyri
In future versions, the chosen convention will use : as a
architecture: creating this network visualization did not
module/name separator.
require any regeneration of the documentation, one simply
• Decorated functions or other dynamic approaches to ex-
updates the template and re-renders the current page as
pose functions to users end up having <local>> in their
needed.
fully qualified names, which is invalid.
• A static AOT rendering of all the existing pages that can
• Many built-in functions (np.sin, np.cos, etc.) do not
be rendered ahead of time uses the same class as the JIT
have a fully qualified name that can be extracted by object
rendering. Basically, this loops through all entries in the
introspection. We believe it should be possible to identify
SQLite database and renders each item independently. This
those via other means like docstring hash (to be explored).
renderer is mostly used for exhaustive testing and perfor-
• Fully qualified names are often not canonical names (i.e.
mance measures for Papyri. This can render most of the
the name typically used for import). While we made efforts
API documentation of IPython, Astropy [astropy], Dask
to create a mapping from one to another, finding the canon-
and distributed [Dask], Matplotlib [MPL], [MPL-DOI],
ical name automatically is not always straightforward.
Networkx [NX], NumPy [NP], Pandas, Papyri, SciPy,
• There are also challenges with case sensitivity. For ex-
Scikit-image and others. It can represent ~28000 pages
ample for MacOS file systems, a couple of objects may
in ~60 seconds (that is ~450 pages/s on a recent Macbook
unfortunately refer to the same IRD file on disk. To address
pro M1).
this, a case-sensitive hash is appended at the end of the
For all of the above renderers, profiling shows that docu- filename.
mentation rendering is mostly limited by object de-serialisation • Many libraries have a syntax that looks right once ren-
from disk and Jinja2 templating engine. In the early project dered to HTML while not following proper syntax, or a
development phase, we attempted to write a static HTML renderer syntax that relies on specificities of Docutils and Sphinx
in a compiled language (like Rust, using compiled and typed rendering/parsing.
checked templates). This provided a speedup of roughly a factor • Many custom directive plugins cannot be reused from
10. However, its implementation is now out of sync with the main Sphinx. These will need to be reimplemented.
Papyri code base.
Finally, a JupyterLab extension is currently in progress. The Future possibilities
documentation then presents itself as a side-panel and is capable Beyond what has been presented in this paper, there are several
of basic browsing and rendering (see Figure 1 and Figure 4). The opportunities to improve and extend what Papyri can allow for the
model uses typescript, react and native JupyterLab component. scientific Python ecosystem.
PAPYRI: BETTER DOCUMENTATION FOR THE SCIENTIFIC ECOSYSTEM IN JUPYTER 81

Fig. 5: Local graph (made with D3.js [D3js]) representing the
connections among the most important nodes around current page
across many libraries, when viewing numpy.ndarray. Nodes are
sized with respect to the number of incomming links, and colored
with respect to their library. This graph is generated at rendering
time, and is updated depending on the libraries currently installed.
This graph helps identify related functions and documentation. It can
become challenging to read for highly connected items as seen here
for numpy.ndarray.

The first area is the ability to build IRD bundles on Continuous
Integration platforms. Services like GitHub action, Azure pipeline
and many others are already setup to test packages. We hope
to leverage this infrastructure to build IRD files and make them
available to users.
A second area is hosting of intermediate IRD files. While the
current prototype is hosted by http index using GitHub pages,
it is likely not a sustainable hosting platform as disk space is
limited. To our knowledge, IRD files are smaller in size than
HTML documentation, we hope that other platforms like Read the
Docs can be leveraged. This could provide a single domain that
renders the documentation for multiple libraries, thus avoiding the
display of many library subdomains. This contributes to giving a
more unified experience for users.
It should be possible for projects to avoid using many dy-
namic docstrings interpolation that are used to document *args
and **kwargs. This would make sources easier to read, and
potentially have some speedup at the library import time.
Once a (given and appropriately used by its users) library uses
an IDE that supports Papyri for documentation, docstring syntax
could be exchanged for markdown.
As IRD files are structured, it should be feasible to provide
cross-version information in the documentation. For example, if
one installs multiple versions of IRD bundles for a library, then
assuming the user does not use the latest version, the renderer
Fig. 4: Example of extended view of the Papyri documentation for could inspect IRD files from previous/future versions to indi-
Jupyterlab extension (here for SciPy). Code examples can now include cate the range of versions for which the documentation has not
plots. Most token in each examples are linked to the corresponding changed. Upon additional efforts, it should be possible to infer
page. Early navigation bar is visible at the top. when a parameter was removed, or will be removed, or to simply
display the difference between two versions.
82 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Conclusion [RTD-theme] https://sphinx-rtd-theme.readthedocs.io/en/stable/
[RTD] https://readthedocs.org/
To address some of the current limitations in documentation [SCU] https://en.wikipedia.org/wiki/Single_Compilation_
accessibility, building and maintaining, we have provided a new Unit
documentation framework called Papyri. We presented its features [SP] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant,
Matt Haberland, Tyler Reddy, David Cournapeau,
and underlying implementation choices (such as crosslink main-
Evgeni Burovski, Pearu Peterson, Warren Weckesser,
tenance, decoupling building and rendering phases, enriching the Jonathan Bright, Stéfan J. van der Walt, Matthew
rendering features, using the IRD format to create a unified syntax Brett, Joshua Wilson, K. Jarrod Millman, Nikolay
structure, etc.). While the project is still at its early stage, clear Mayorov, Andrew R. J. Nelson, Eric Jones, Robert
Kern, Eric Larson, CJ Carey, İlhan Polat, Yu Feng,
impacts can already be seen on the availability of high-quality Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef
documentation for end-users, and on the workload reduction for Perktold, Robert Cimrman, Ian Henriksen, E.A. Quin-
maintainers. Building IRD format opened a wide range of tech- tero, Charles R Harris, Anne M. Archibald, Antônio
nical possibilities, and contributes to improving users’ experience H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and
SciPy 1.0 Contributors. (2020) SciPy 1.0: Fundamen-
(and therefore the success of the scientific Python ecosystem). This tal Algorithms for Scientific Computing in Python.
may become necessary for users to navigate in an exponentially Nature Methods, 17(3), 261-272. 10.1038/s41592-
growing ecosystem. 019-0686-2
[Spyder] https://www.spyder-ide.org/
[TSRST] https://github.com/stsewd/tree-sitter-rst
Acknowledgments [TS] https://tree-sitter.github.io/tree-sitter/
[astropy] The Astropy Project: Building an inclusive, open-
The authors want to thank S. Gallegos (author of tree-sitter-rst), J. science project and status of the v2.0 core package,
L. Cano Rodríguez and E. Holscher (Read The Docs), C. Holdgraf https://doi.org/10.48550/arXiv.1801.02634
(2i2c), B. Granger and F. Pérez (Jupyter Project), T. Allard and I. [docutils] https://docutils.sourceforge.io/
[flask] https://flask.palletsprojects.com/en/2.1.x/
Presedo-Floyd (QuanSight) for their useful feedback and help on [httpx] https://www.python-httpx.org/
this project. [mkdocs] https://www.mkdocs.org/
[msgspec] https://pypi.org/project/msgspec
[pandoc] https://pandoc.org/
Funding [pydantic] https://pydantic-docs.helpmanual.io/
M. B. received a 2-year grant from the Chan Zuckerberg Initia- [pydata-sphinx-theme] https://pydata-sphinx-theme.readthedocs.io/en/stable/
[quart] https://pgjones.gitlab.io/quart/
tive (CZI) Essential Open Source Software for Science (EOS) [sphinx-copybutton] https://sphinx-copybutton.readthedocs.io/en/latest/
– EOSS4-0000000017 via the NumFOCUS 501(3)c non profit to [sphinx] https://www.sphinx-doc.org/en/master/
develop the Papyri project. [Trio] https://trio.readthedocs.io/

R EFERENCES
[AOT] https://en.wikipedia.org/wiki/Ahead-of-time_
compilation
[CFRG] conda-forge community. (2015). The conda-forge
Project: Community-based Software Distribution Built
on the conda Package Format and Ecosystem. Zenodo.
http://doi.org/10.5281/zenodo.4774216
[CODEMETA] https://codemeta.github.io/
[D3js] https://d3js.org/
[DOCREPR] https://github.com/spyder-ide/docrepr
[DT] https://diataxis.fr/
[Dask] Dask Development Team (2016). Dask: Library for
dynamic task scheduling, https://dask.org
[IR] https://en.wikipedia.org/wiki/Intermediate_
representation
[JEDI] https://github.com/davidhalter/jedi
[JIT] https://en.wikipedia.org/wiki/Just-in-time_
compilation
[JPYBOOK] https://jupyterbook.org/
[Jinja2] https://jinja.palletsprojects.com/
[LTO] https://en.wikipedia.org/wiki/Interprocedural_
optimization
[MPL-DOI] https://doi.org/10.5281/zenodo.6513224
[MPL] J.D. Hunter, "Matplotlib: A 2D Graphics Environ-
ment", Computing in Science & Engineering, vol. 9,
no. 3, pp. 90-95, 2007,
[MYST] https://myst-parser.readthedocs.io/en/latest/
[NPDOC] https://numpydoc.readthedocs.io/en/latest/format.html
[NP] Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Ar-
ray programming with NumPy. Nature 585, 357–362
(2020). DOI: 10.1038/s41586-020-2649-2
[NX] Aric A. Hagberg, Daniel A. Schult and Pieter J. Swart,
“Exploring network structure, dynamics, and function
using NetworkX”, in Proceedings of the 7th Python
in Science Conference (SciPy2008), Gäel Varoquaux,
Travis Vaught, and Jarrod Millman (Eds), (Pasadena,
CA USA), pp. 11–15, Aug 2008
[Papyri] https://github.com/jupyter/papyri
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 83

Bayesian Estimation and Forecasting of Time Series
in statsmodels
Chad Fulton‡∗

Abstract—Statsmodels, a Python library for statistical and econometric inference for the well-developed stable of time series models
analysis, has traditionally focused on frequentist inference, including in its mod- in statsmodels, and providing access to the rich associated
els for time series data. This paper introduces the powerful features for Bayesian feature set already mentioned, presents a complementary option
inference of time series models that exist in statsmodels, with applications to these more general-purpose libraries.1
to model fitting, forecasting, time series decomposition, data simulation, and
impulse response functions.
Time series analysis in statsmodels
Index Terms—time series, forecasting, bayesian inference, Markov chain Monte A time series is a sequence of observations ordered in time, and
Carlo, statsmodels time series data appear commonly in statistics, economics, finance,
climate science, control systems, and signal processing, among
Introduction many other fields. One distinguishing characteristic of many time
Statsmodels [SP10] is a well-established Python library for series is that observations that are close in time tend to be more
statistical and econometric analysis, with support for a wide range correlated, a feature known as autocorrelation. While successful
of important model classes, including linear regression, ANOVA, analyses of time series data must account for this, statistical
generalized linear models (GLM), generalized additive models models can harness it to decompose a time series into trend,
(GAM), mixed effects models, and time series models, among seasonal, and cyclical components, produce forecasts of future
many others. In most cases, model fitting proceeds by using data, and study the propagation of shocks over time.
frequentist inference, such as maximum likelihood estimation We now briefly review the models for time series data that are
(MLE). In this paper, we focus on the class of time series available in statsmodels and describe their features.2
models [MPS11], support for which has grown substantially in
Exponential smoothing models
statsmodels over the last decade. After introducing several
Exponential smoothing models are constructed by combining
of the most important new model classes – which are by default
one or more simple equations that each describe some aspect
fitted using MLE – and their features – which include forecasting,
of the evolution of univariate time series data. While originally
time series decomposition and seasonal adjustment, data simula-
somewhat ad hoc, these models can be defined in terms of a
tion, and impulse response analysis – we describe the powerful
proper statistical model (for example, see [HKOS08]). They have
functions that enable users to apply Bayesian methods to a wide
enjoyed considerable popularity in forecasting (for example, see
range of time series models.
Support for Bayesian inference in Python outside of the implementation in R described by [HA18]). A prototypical
statsmodels has also grown tremendously, particularly in example that allows for trending data and a seasonal component
the realm of probabilistic programming, and includes powerful – often known as the additive "Holt-Winters’ method" – can be
libraries such as PyMC3 [SWF16], PyStan [CGH+ 17], and written as
TensorFlow Probability [DLT+ 17]. Meanwhile, ArviZ lt = α(yt − st−m ) + (1 − α)(lt−1 + bt−1 )
[KCHM19] provides many excellent tools for associated diagnos- bt = β (lt − lt−1 ) + (1 − β )bt−1
tics and vizualisations. The aim of these libraries is to provide st = γ(yt − lt−1 − bt−1 ) + (1 − γ)st−m
support for Bayesian analysis of a large class of models, and
they make available both advanced techniques, including auto- where lt is the level of the series, bt is the trend, st is the
tuning algorithms, and flexible model specification. By contrast, seasonal component of period m, and α, β , γ are parameters of
here we focus on simpler techniques. However, while the libraries the model. When augmented with an error term with some given
above do include some support for time series models, this has probability distribution (usually Gaussian), likelihood-based infer-
not been their primary focus. As a result, introducing Bayesian ence can be used to estimate the parameters. In statsmodels,

* Corresponding author: chad.t.fulton@frb.gov 1. In addition, it is possible to combine the sampling algorithms of PyMC3
‡ Federal Reserve Board of Governors with the time series models of statsmodels, although we will not discuss
this approach in detail here. See, for example, https://www.statsmodels.org/v0.
Copyright © 2022 Chad Fulton. This is an open-access article distributed 13.0/examples/notebooks/generated/statespace_sarimax_pymc3.html.
under the terms of the Creative Commons Attribution License, which permits 2. In addition to statistical models, statsmodels also provides a number
unrestricted use, distribution, and reproduction in any medium, provided the of tools for exploratory data analysis, diagnostics, and hypothesis testing
original author and source are credited. related to time series data; see https://www.statsmodels.org/stable/tsa.html.
84 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

additive exponential smoothing models can be constructed using # ARMA(1, 1) model with explanatory variable
the statespace.ExponentialSmoothing class.3 The fol- X = mdata['realint']
model_arma11 = sm.tsa.ARIMA(
lowing code shows how to apply the additive Holt-Winters model y, order=(1, 0, 1), exog=X)
above to model quarterly data on consumer prices: # SARIMAX(p, d, q)x(P, D, Q, s) model
import statsmodels.api as sm model_sarimax = sm.tsa.ARIMA(
# Load data y, order=(p, d, q), seasonal_order=(P, D, Q, s))
mdata = sm.datasets.macrodata.load().data While this class of models often produces highly competitive
# Compute annualized consumer price inflation
y = np.log(mdata['cpi']).diff().iloc[1:] * 400 forecasts, it does not produce a decomposition of a time series
into, for example, trend and seasonal components.
# Construct the Holt-Winters model
model_hw = sm.tsa.statespace.ExponentialSmoothing( Vector autoregressive models
y, trend=True, seasonal=12)
While the SARIMAX models above handle univariate series,
statsmodels also has support for the multivariate generaliza-
Structural time series models
tion to vector autoregressive (VAR) models.5 These models are
Structural time series models, introduced by [Har90] and also written
sometimes known as unobserved components models, similarly yt = ν + Φ1 yt−1 + · · · + Φ p yt−p + εt
decompose a univariate time series into trend, seasonal, cyclical,
and irregular components: where yt is now considered as an m × 1 vector. As a result, the
intercept ν is also an m × 1 vector, the coefficients Φi are each
yt = µt + γt + ct + εt m × m matrices, and the error term is εt ∼ N(0m , Ω), with Ω an
where µt is the trend, γt is the seasonal component, ct is the cycli- m×m matrix. These models can be constructed in statsmodels
cal component, and εt ∼ N(0, σ 2 ) is the error term. However, this using the VARMAX class, as follows6
equation can be augmented in many ways, for example to include # Multivariate dataset
explanatory variables or an autoregressive component. In addition, z = (np.log(mdata['realgdp', 'realcons', 'cpi'])
.diff().iloc[1:])
there are many possible specifications for the trend, seasonal,
and cyclical components, so that a wide variety of time series # VAR(1) model
characteristics can be accommodated. In statsmodels, these model_var = sm.tsa.VARMAX(z, order=(1, 0))
models can be constructed from the UnobservedComponents
class; a few examples are given in the following code: Dynamic factor models
# "Local level" model statsmodels also supports a second model for multivariate
model_ll = sm.tsa.UnobservedComponents(y, 'llevel') time series: the dynamic factor model (DFM). These models, often
# "Local linear trend", with seasonal component
model_arma11 = sm.tsa.UnobservedComponents( used for dimension reduction, posit a few unobserved factors, with
y, 'lltrend', seasonal=4) autoregressive dynamics, that are used to explain the variation
in the observed dataset. In statsmodels, there are two model
These models have become popular for time series analysis and
classes, DynamicFactor` and DynamicFactorMQ, that can
forecasting, as they are flexible and the estimated components are
fit versions of the DFM. Here we focus on the DynamicFactor
intuitive. Indeed, Google’s Causal Impact library [BGK+ 15] uses
class, for which the model can be written
a Bayesian structural time series approach directly, and Facebook’s
Prophet library [TL17] uses a conceptually similar framework and yt = Λ ft + εt
is estimated using PyStan. ft = Φ1 ft−1 + · · · + Φ p ft−p + ηt

Autoregressive moving-average models Here again, the observation is assumed to be m × 1, but the factors
are k × 1, where it is possible that k << m. As before, we assume
Autoregressive moving-average (ARMA) models, ubiquitous in
conformable coefficient matrices and Gaussian errors.
time series applications, are well-supported in statsmodels,
The following code shows how to construct a DFM in
including their generalizations, abbreviated as "SARIMAX", that
statsmodels
allow for integrated time series data, explanatory variables, and
seasonal effects.4 A general version of this model, excluding # DFM with 2 factors that evolve as a VAR(3)
model_dfm = sm.tsa.DynamicFactor(
integration, can be written as z, k_factors=2, factor_order=3)
yt = xt β + ξt
ξt = φ1 ξt−1 + · · · + φ p ξt−p + εt + θ1 εt−1 + · · · + θq εt−q Linear Gaussian state space models
In statsmodels, each of the model classes introduced
where εt ∼ N(0, σ 2 ). These are constructed in statsmodels
above ( statespace.ExponentialSmoothing,
with the ARIMA class; the following code shows how to construct
UnobservedComponents, ARIMA, VARMAX,
a variety of autoregressive moving-average models for consumer
price data: 4. Note that in statsmodels, models with explanatory variables are in
# AR(2) model the form of "regression with SARIMA errors".
model_ar2 = sm.tsa.ARIMA(y, order=(2, 0, 0)) 5. statsmodels also supports vector moving-average (VMA) models
using the same model class as described here for the VAR case, but, for brevity,
3. A second class, ETSModel, can also be used for both additive and we do not explicitly discuss them here.
multiplicative models, and can exhibit superior performance with maximum 6. A second class, VAR, can also be used to fit VAR models, using least
likelihood estimation. However, it lacks some of the features relevant for squares. However, it lacks some of the features relevant for Bayesian inference
Bayesian inference discussed in this paper. discussed in this paper.
BAYESIAN ESTIMATION AND FORECASTING OF TIME SERIES IN STATSMODELS 85

Fig. 1: Selected functionality of state space models in statsmodels.

DynamicFactor, and DynamicFactorMQ) are implemented fcast = results_ll.forecast(4)
as part of a broader class of models, referred to as linear Gaussian
# Produce a draw from the posterior distribution
state space models (hereafter for brevity, simply "state space # of the state vector
models" or SSM). This class of models can be written as sim_ll.simulate()
draw = sim_ll.simulated_state
yt = dt + Zt αt + εt εt ∼ N(0, Ht )
αt+1 = ct + Tt αt + Rt ηt ηt ∼ N(0, Qt ) Nearly identical code could be used for any of the model classes
introduced above, since they are all implemented as part of the
where αt represents an unobserved vector containing the "state" same state space model framework. In the next section, we show
of the dynamic system. In general, the model is multivariate, with how these features can be used to perform Bayesian inference with
yt and εt m × 1 vector, αt k × 1, and ηt r times 1. these models.
Powerful tools exist for state space models to estimate the
values of the unobserved state vector, compute the value of
the likelihood function for frequentist inference, and perform
posterior sampling for Bayesian inference. These tools include the Bayesian inference via Markov chain Monte Carlo
celebrated Kalman filter and smoother and a simulation smoother,
all of which are important for conducting Bayesian inference for We begin by giving a cursory overview of the key elements
these models.7 The implementation in statsmodels largely of Bayesian inference required for our purposes here.8 In brief,
follows the treatment in [DK12], and is described in more detail the Bayesian approach stems from Bayes’ theorem, in which
in [Ful15]. the posterior distribution for an object of interest is derived as
In addition to these key tools, state space models also admit proportional to the combination of a prior distribution and the
general implementations of useful features such as forecasting, likelihood function
data simulation, time series decomposition, and impulse response
analysis. As a consequence, each of these features extends to each
p(A|B) ∝ p(B|A) × p(A)
of the time series models described above. Figure 1 presents a | {z } | {z } |{z}
diagram showing how to produce these features, and the code posterior likelihood prior
below briefly introduces a subset of them.
# Construct the Model Here, we will be interested in the posterior distribution of the pa-
model_ll = sm.tsa.UnobservedComponents(y, 'llevel') rameters of our model and of the unobserved states, conditional on
the chosen model specification and the observed time series data.
# Construct a simulation smoother
sim_ll = model_ll.simulation_smoother() While in most cases the form of the posterior cannot be derived an-
alytically, simulation-based methods such as Markov chain Monte
# Parameter values (variance of error and Carlo (MCMC) can be used to draw samples that approximate
# variance of level innovation, respectively) the posterior distribution nonetheless. While PyMC3, PyStan,
params = [4, 0.75]
and TensorFlow Probability emphasize Hamiltonian Monte Carlo
# Compute the log-likelihood of these parameters (HMC) and no-U-turn sampling (NUTS) MCMC methods, we
llf = model_ll.loglike(params) focus on the simpler random walk Metropolis-Hastings (MH) and
# `smooth` applies the Kalman filter and smoother
Gibbs sampling (GS) methods. These are standard MCMC meth-
# with a given set of parameters and returns a ods that have enjoyed great success in time series applications and
# Results object which are simple to implement, given the state space framework
results_ll = model_ll.smooth(params) already available in statsmodels. In addition, the ArviZ library
# Produce forecasts for the next 4 periods is designed to work with MCMC output from any source, and we
can easily adapt it to our use.
7. Statsmodels currently contains two implementations of simulation With either Metropolis-Hastings or Gibbs sampling, our pro-
smoothers for the linear Gaussian state space model. The default is the "mean cedure will produce a sequence of sample values (of parameters
correction" simulation smoother of [DK02]. The precision-based simulation
smoother of [CJ09] can alternatively be used by specifying method='cfa' and / or the unobserved state vector) that approximate draws from
when creating the simulation smoother object. the posterior distribution arbitrarily well, as the number of length
86 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

of the chain of samples becomes very large.

Random walk Metropolis-Hastings
In random walk Metropolis-Hastings (MH), we begin with an arbi-
trary point as the initial sample, and then iteratively construct new
samples in the chain as follows. At each iteration, (a) construct a
proposal by perturbing the previous sample by a Gaussian random
variable, and then (b) accept the proposal with some probability.
If a proposal is accepted, it becomes the next sample in the chain,
while if it is rejected then the previous sample value is carried over.
Here, we show how to implement Metropolis-Hastings estimation
of the variance parameter in a simple model, which only requires
the use of the log-likelihood computation introduced above.
Fig. 2: Approximate posterior distribution of variance parameter,
import arviz as az random walk model, Metropolis-Hastings; U.S. Industrial Production.
from scipy import stats

# Construct the model
model_rw = sm.tsa.UnobservedComponents(y, 'rwalk')

# Specify the prior distribution. With MH, this
# can be freely chosen by the user
prior = stats.uniform(0.0001, 100)

# Specify the Gaussian perturbation distribution
perturb = stats.norm(scale=0.1)

# Storage
niter = 100000
samples_rw = np.zeros(niter + 1)

# Initialization
samples_rw[0] = y.diff().var()
llf = model_rw.loglike(samples_rw[0])
prior_llf = prior.logpdf(samples_rw[0])
Fig. 3: Approximate posterior joint distribution of variance parame-
# Iterations ters, local level model, Gibbs sampling; CPI inflation.
for i in range(1, niter + 1):
# Compute the proposal value
proposal = samples_rw[i - 1] + perturb.rvs()
Gibbs sampling
# Compute the acceptance probability Gibbs sampling (GS) is a special case of Metropolis-Hastings
proposal_llf = model_rw.loglike(proposal) (MH) that is applicable when it is possible to produce draws
proposal_prior_llf = prior.logpdf(proposal)
accept_prob = np.exp( directly from the conditional distributions of every variable, even
proposal_llf - llf though it is still not possible to derive the general form of the joint
+ prior_llf - proposal_prior_llf) posterior. While this approach can be superior to random walk
MH when it is applicable, the ability to derive the conditional
# Accept or reject the value
if accept_prob > stats.uniform.rvs(): distributions typically requires the use of a "conjugate" prior – i.e.,
samples_rw[i] = proposal a prior from some specific family of distributions. For example,
llf = proposal_llf above we specified a uniform distribution as the prior when
prior_llf = proposal_prior_llf
else:
sampling via MH, but that is not possible with Gibbs sampling.
samples_rw[i] = samples_rw[i - 1] Here, we show how to implement Gibbs sampling estimation of
the variance parameter, now making use of an inverse Gamma
# Convert for use with ArviZ and plot posterior prior, and the simulation smoother introduced above.
samples_rw = az.convert_to_inference_data(
samples_rw) # Construct the model and simulation smoother
# Eliminate the first 10000 samples as burn-in; model_ll = sm.tsa.UnobservedComponents(y, 'llevel')
# thin by factor of 10 to reduce autocorrelation sim_ll = model_ll.simulation_smoother()
az.plot_posterior(samples_rw.posterior.sel(
{'draw': np.s_[10000::10]}), kind='bin', # Specify the prior distributions. With GS, we must
point_estimate='median') # choose an inverse Gamma prior for each variance
priors = [stats.invgamma(0.01, scale=0.01)] * 2
The approximate posterior distribution, constructed from the sam-
ple chain, is shown in Figure 2. # Storage
niter = 100000
samples_ll = np.zeros((niter + 1, 2))
8. While a detailed description of these issues is out of the scope of this
paper, there are many superb references on this topic. We refer the interested # Initialization
reader to [WH99], which provides a book-length treatment of Bayesian samples_ll[0] = [y.diff().var(), 1e-5]
inference for state space models, and [KN99], which provides many examples
and applications. # Iterations
BAYESIAN ESTIMATION AND FORECASTING OF TIME SERIES IN STATSMODELS 87

for i in range(1, niter + 1):
# (a) Update the model parameters
model_ll.update(samples_ll[i - 1])

# (b) Draw from the conditional posterior of
# the state vector
sim_ll.simulate()
sample_state = sim_ll.simulated_state.T

# (c) Compute / draw from conditional posterior
# of the parameters:
# ...observation error variance
resid = y - sample_state[:, 0]
post_shape = len(resid) / 2 + 0.01
post_scale = np.sum(resid**2) / 2 + 0.01
samples_ll[i, 0] = stats.invgamma(
post_shape, scale=post_scale).rvs()

# ...level error variance
resid = sample_state[1:] - sample_state[:-1]
post_shape = len(resid) / 2 + 0.01 Fig. 4: Data and forecast with 80% credible interval; U.S. Industrial
post_scale = np.sum(resid**2) / 2 + 0.01 Production.
samples_ll[i, 1] = stats.invgamma(
post_shape, scale=post_scale).rvs()

# Convert for use with ArviZ and plot posterior
samples_ll = az.convert_to_inference_data(
{'parameters': samples_ll[None, ...]},
coords={'parameter': model_ll.param_names},
dims={'parameters': ['parameter']})
az.plot_pair(samples_ll.posterior.sel(
{'draw': np.s_[10000::10]}), kind='hexbin');
The approximate posterior distribution, constructed from the sam-
ple chain, is shown in Figure 3.

Illustrative examples
For clarity and brevity, the examples in the previous section gave
results for simple cases. However, these basic methods carry
through to each of the models introduced earlier, including in cases
with multivariate data and hundreds of parameters. Moreover, the
Metropolis-Hastings approach can be combined with the Gibbs
sampling approach, so that if the end user wishes to use Gibbs
sampling for some parameters, they are not restricted to choose
only conjugate priors for all parameters.
In addition to sampling the posterior distributions of the
parameters, this method allows sampling other objects of inter-
est, including forecasts of observed variables, impulse response
functions, and the unobserved state vector. This last possibility
is especially useful in cases such as the structural time series Fig. 5: Estimated level, trend, and seasonal components, with 80%
model, in which the unobserved states correspond to interpretable credible interval; U.S. Industrial Production.
elements such as the trend and seasonal components. We provide
several illustrative examples of the various types of analysis that
are possible. model = sm.tsa.UnobservedComponents(
y, 'lltrend', seasonal=12)
Forecasting and Time Series Decomposition To produce the time-series decomposition into level, trend, and
In our first example, we apply the Gibbs sampling approach to seasonal components, we will use samples from the posterior of
a structural time series model in order to forecast U.S. Industrial the state vector (µt , βt , γt ) for each time period t. These are im-
Production and to produce a decomposition of the series into level, mediately available when using the Gibbs sampling approach; in
trend, and seasonal components. The model is the earlier example, the draw at each iteration was assigned to the
yt = µt + γt + εt observation equation variable sample_state. To produce forecasts, we need to draw from
the posterior predictive distribution for horizons h = 1, 2, . . . H.
µt = βt + µt−1 + ζt level
This can be easily accomplished by using the simulate method
βt = βt−1 + ξt trend introduced earlier. To be concrete, we can accomplish these tasks
γt = γt−s + ηt seasonal by modifying section (b) of our Gibbs sampler iterations as
Here, we set the seasonal periodicity to s=12, since Industrial follows:
Production is a monthly variable. We can construct this model 9. This model is often referred to as a "local linear trend" model (with
in Statsmodels as9 additionally a seasonal component); lltrend is an abbreviation of this name.
88 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 6: "Causal impact" of COVID-19 on U.S. Sales in Manufacturing and Trade Industries.

# (b') Draw from the conditional posterior of on U.S. Sales in Manufacturing and Trade Industries.11
# the state vector
model.update(params[i - 1])
sim.simulate() Extensions
# save the draw for use later in time series
# decomposition There are many extensions to the time series models presented
states[i] = sim.simulated_state.T
here that are made possible when using Bayesian inference.
# Draw from the posterior predictive distribution First, it is easy to create custom state space models within the
# using the `simulate` method statsmodels framework. As one example, the statsmodels
n_fcast = 48 documentation describes how to create a model that extends the
fcast[i] = model.simulate(
params[i - 1], n_fcast, typical VAR described above with time-varying parameters.12
initial_state=states[i, -1]).to_frame() These custom state space models automatically inherit all the
functionality described above, so that Bayesian inference can be
These forecasts and the decomposition into level, trend, and sea- conducted in exactly the same way.
sonal components are summarized in Figures 4 and 5, which show Second, because the general state space model available in
the median values along with 80% credible intervals. Notably, the statsmodels and introduced above allows for time-varying
intervals shown incorporate for both the uncertainty arising from system matrices, it is possible using Gibbs sampling methods
the stochastic terms in the model as well as the need to estimate to introduce support for automatic outlier handling, stochastic
the models’ parameters.10 volatility, and regime switching models, even though these are
largely infeasible in statsmodels when using frequentist meth-
Casual impacts ods such as maximum likelihood estimation.13
A closely related procedure described in [BGK+ 15] uses a
Bayesian structural time series model to estimate the "causal Conclusion
impact" of some event on some observed variable. This approach
stops estimation of the model just before the date of an event This paper introduces the suite of time series models available in
and produces a forecast by drawing from the posterior predictive statsmodels and shows how Bayesian inference using Markov
density, using the procedure described just above. It then uses the chain Monte Carlo methods can be applied to estimate their
difference between the actual path of the data and the forecast to parameters and produce analyses of interest, including time series
estimate impact of the event. decompositions and forecasts.
An example of this approach is shown in Figure 6, in which we
11. In this example, we used a local linear trend model with no seasonal
use this method to illustrate the effect of the COVID-19 pandemic component.
12. For details, see https://www.statsmodels.org/devel/examples/notebooks/
10. The popular Prophet library, [TL17], similarly uses an additive model generated/statespace_tvpvar_mcmc_cfa.html.
combined with Bayesian sampling methods to produce forecasts and decom- 13. See, for example, [SW16] for an application of these techniques that
positions, although its underlying model is a GAM rather than a state space handles outliers, [KSC98] for stochastic volatility, and [KN98] for an applica-
model. tion to dynamic factor models with regime switching.
BAYESIAN ESTIMATION AND FORECASTING OF TIME SERIES IN STATSMODELS 89

R EFERENCES [SWF16] John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck.
Probabilistic programming in Python using PyMC3. PeerJ
[BGK+ 15] Kay H. Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy, Computer Science, 2:e55, April 2016. Publisher: PeerJ Inc.
and Steven L. Scott. Inferring causal impact using Bayesian URL: https://peerj.com/articles/cs-55, doi:10.7717/peerj-
structural time-series models. Annals of Applied Statistics, 9:247– cs.55.
274, 2015. doi:10.1214/14-aoas788. [TL17] Sean J. Taylor and Benjamin Letham. Forecasting at scale.
[CGH+ 17] Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel Technical Report e3190v2, PeerJ Inc., September 2017. ISSN:
Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, 2167-9843. URL: https://peerj.com/preprints/3190, doi:10.
Jiqiang Guo, Peter Li, and Allen Riddell. Stan : A 7287/peerj.preprints.3190v2.
Probabilistic Programming Language. Journal of Statisti- [WH99] Mike West and Jeff Harrison. Bayesian Forecasting and Dynamic
cal Software, 76(1), January 2017. Institution: Columbia Models. Springer, New York, 2nd edition edition, March 1999.
Univ., New York, NY (United States); Harvard Univ., Cam- 00000.
bridge, MA (United States). URL: https://www.osti.gov/pages/
biblio/1430202-stan-probabilistic-programming-language, doi:
10.18637/jss.v076.i01.
[CJ09] Joshua C.C. Chan and Ivan Jeliazkov. Efficient simulation and in-
tegrated likelihood estimation in state space models. International
Journal of Mathematical Modelling and Numerical Optimisation,
1(1-2):101–120, January 2009. Publisher: Inderscience Publish-
ers. URL: https://www.inderscienceonline.com/doi/abs/10.1504/
IJMMNO.2009.03009.
[DK02] J. Durbin and S. J. Koopman. A simple and efficient simula-
tion smoother for state space time series analysis. Biometrika,
89(3):603–616, August 2002. URL: http://biomet.oxfordjournals.
org/content/89/3/603, doi:10.1093/biomet/89.3.603.
[DK12] James Durbin and Siem Jan Koopman. Time Series Analysis by
State Space Methods: Second Edition. Oxford University Press,
May 2012.
[DLT+ 17] Joshua V. Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo,
Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi,
Matt Hoffman, and Rif A. Saurous. TensorFlow Distributions.
Technical Report arXiv:1711.10604, arXiv, November 2017.
arXiv:1711.10604 [cs, stat] type: article. URL: http://arxiv.org/
abs/1711.10604, doi:10.48550/arXiv.1711.10604.
[Ful15] Chad Fulton. Estimating time series models by state space
methods in python: Statsmodels. 2015.
[HA18] Rob J Hyndman and George Athanasopoulos. Forecasting:
principles and practice. OTexts, 2018.
[Har90] Andrew C. Harvey. Forecasting, Structural Time Series Models
and the Kalman Filter. Cambridge University Press, 1990.
[HKOS08] Rob Hyndman, Anne B. Koehler, J. Keith Ord, and Ralph D.
Snyder. Forecasting with Exponential Smoothing: The State
Space Approach. Springer Science & Business Media, June 2008.
Google-Books-ID: GSyzox8Lu9YC.
[KCHM19] Ravin Kumar, Colin Carroll, Ari Hartikainen, and Osvaldo Mar-
tin. ArviZ a unified library for exploratory analysis of Bayesian
models in Python. Journal of Open Source Software, 4(33):1143,
2019. Publisher: The Open Journal. URL: https://doi.org/10.
21105/joss.01143, doi:10.21105/joss.01143.
[KN98] Chang-Jin Kim and Charles R. Nelson. Business Cycle Turning
Points, A New Coincident Index, and Tests of Duration Depen-
dence Based on a Dynamic Factor Model With Regime Switch-
ing. The Review of Economics and Statistics, 80(2):188–201,
May 1998. Publisher: MIT Press. URL: https://doi.org/10.1162/
003465398557447, doi:10.1162/003465398557447.
[KN99] Chang-Jin Kim and Charles R. Nelson. State-Space Models with
Regime Switching: Classical and Gibbs-Sampling Approaches
with Applications. MIT Press Books, The MIT Press, 1999. URL:
http://ideas.repec.org/b/mtp/titles/0262112388.html.
[KSC98] Sangjoon Kim, Neil Shephard, and Siddhartha Chib. Stochastic
Volatility: Likelihood Inference and Comparison with ARCH
Models. The Review of Economic Studies, 65(3):361–393, July
1998. 01855. URL: http://restud.oxfordjournals.org/content/65/
3/361, doi:10.1111/1467-937X.00050.
[MPS11] Wes McKinney, Josef Perktold, and Skipper Seabold. Time Series
Analysis in Python with statsmodels. In Stéfan van der Walt
and Jarrod Millman, editors, Proceedings of the 10th Python in
Science Conference, pages 107 – 113, 2011. doi:10.25080/
Majora-ebaa42b7-012.
[SP10] Skipper Seabold and Josef Perktold. Statsmodels: Econometric
and Statistical Modeling with Python. In Stéfan van der Walt and
Jarrod Millman, editors, Proceedings of the 9th Python in Science
Conference, pages 92 – 96, 2010. doi:10.25080/Majora-
92bf1922-011.
[SW16] James H. Stock and Mark W. Watson. Core Inflation and Trend
Inflation. Review of Economics and Statistics, 98(4):770–784,
March 2016. 00000. URL: http://dx.doi.org/10.1162/REST_a_
00608, doi:10.1162/REST_a_00608.
90 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Python vs. the pandemic: a case study in high-stakes
software development
Cliff C. Kerr‡§∗ , Robyn M. Stuart¶k , Dina Mistry∗∗ , Romesh G. Abeysuriyak , Jamie A. Cohen‡ , Lauren George†† ,
Michał Jastrzebski‡‡ , Michael Famulare‡ , Edward Wenger‡ , Daniel J. Klein‡

Abstract—When it became clear in early 2020 that COVID-19 was going to modeling, and drug discovery made it well placed to contribute to
be a major public health threat, politicians and public health officials turned to a global pandemic response plan. Founded in 2008, the Institute
academic disease modelers like us for urgent guidance. Academic software for Disease Modeling (IDM) has provided analytical support for
development is typically a slow and haphazard process, and we realized that BMGF (which it has been a part of since 2020) and other global
business-as-usual would not suffice for dealing with this crisis. Here we describe
health partners, with a focus on eradicating malaria and polio.
the case study of how we built Covasim (covasim.org), an agent-based model
of COVID-19 epidemiology and public health interventions, by using standard
Since its creation, IDM has built up a portfolio of computational
Python libraries like NumPy and Numba, along with less common ones like tools to understand, analyze, and predict the dynamics of different
Sciris (sciris.org). Covasim was created in a few weeks, an order of magnitude diseases.
faster than the typical model development process, and achieves performance When "coronavirus disease 2019" (COVID-19) and the virus
comparable to C++ despite being written in pure Python. It has become one that causes it (SARS-CoV-2) were first identified in late 2019,
of the most widely adopted COVID models, and is used by researchers and our team began summarizing what was known about the virus
policymakers in dozens of countries. Covasim’s rapid development was enabled [Fam19]. By early February 2020, even though it was more than
not only by leveraging the Python scientific computing ecosystem, but also by
a month before the World Health Organization (WHO) declared
adopting coding practices and workflows that lowered the barriers to entry for
a pandemic [Med20], it had become clear that COVID-19 would
scientific contributors without sacrificing either performance or rigor.
become a major public health threat. The outbreak on the Diamond
Index Terms—COVID-19, SARS-CoV-2, Epidemiology, Mathematical modeling,
Princess cruise ship [RSWS20] was the impetus for us to start
NumPy, Numba, Sciris modeling COVID in detail. Specifically, we needed a tool to (a)
incorporate new data as soon as it became available, (b) explore
policy scenarios, and (c) predict likely future epidemic trajectories.
Background The first step was to identify which software tool would form
For decades, scientists have been concerned about the possibility the best starting point for our new COVID model. Infectious
of another global pandemic on the scale of the 1918 flu [Gar05]. disease models come in two major types: agent-based models track
Despite a number of "close calls" – including SARS in 2002 the behavior of individual "people" (agents) in the simulation,
[AFG+ 04]; Ebola in 2014-2016 [Tea14]; and flu outbreaks in- with each agent’s behavior represented by a random (probabilis-
cluding 1957, 1968, and H1N1 in 2009 [SHK16], some of which tic) process. Compartmental models track populations of people
led to 1 million or more deaths – the last time we experienced over time, typically using deterministic difference equations. The
the emergence of a planetary-scale new pathogen was when HIV richest modeling framework used by IDM at the time was EMOD,
spread globally in the 1980s [CHL+ 08]. which is a multi-disease agent-based model written in C++ and
In 2015, Bill Gates gave a TED talk stating that the world was based on JSON configuration files [BGB+ 18]. We also considered
not ready to deal with another pandemic [Hof20]. While the Bill Atomica, a multi-disease compartmental model written in Python
& Melinda Gates Foundation (BMGF) has not historically focused and based on Excel input files [KAK+ 19]. However, both of
on pandemic preparedness, its expertise in disease surveillance, these options posed significant challenges: as a compartmental
model, Atomica would have been unable to capture the individual-
* Corresponding author: cliff@covasim.org level detail necessary for modeling the Diamond Princess out-
‡ Institute for Disease Modeling, Bill & Melinda Gates Foundation, Seattle, break (such as passenger-crew interactions); EMOD had sufficient
USA
flexibility, but developing new disease modules had historically
§ School of Physics, University of Sydney, Sydney, Australia
¶ Department of Mathematical Sciences, University of Copenhagen, Copen- required months rather than days.
hagen, Denmark As a result, we instead started developing Covasim ("COVID-
|| Burnet Institute, Melbourne, Australia 19 Agent-based Simulator") [KSM+ 21] from a nascent agent-
** Twitter, Seattle, USA based model written in Python, LEMOD-FP ("Light-EMOD for
†† Microsoft, Seattle, USA
‡‡ GitHub, San Francisco, USA Family Planning"). LEMOD-FP was used to model reproductive
health choices of women in Senegal; this model had in turn
Copyright © 2022 Cliff C. Kerr et al. This is an open-access article distributed been based on an even simpler agent-based model of measles
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the vaccination programs in Nigeria ("Value-of-Information Simula-
original author and source are credited. tor" or VoISim). We subsequently applied the lessons we learned
PYTHON VS. THE PANDEMIC: A CASE STUDY IN HIGH-STAKES SOFTWARE DEVELOPMENT 91

scientific computing libraries.

Software architecture and implementation
Covasim conceptual design and usage
Covasim is a standard susceptible-exposed-infectious-recovered
(SEIR) model (Fig. 3). As noted above, it is an agent-based model,
meaning that individual people and their interactions with one
another are simulated explicitly (rather than implicitly, as in a
compartmental model).
The fundamental calculation that Covasim performs is to
determine the probability that a given person, on a given time step,
will change from one state to another, such as from susceptible
to exposed (i.e., that person was infected), from undiagnosed to
diagnosed, or from critically ill to dead. Covasim is fully open-
source and available on GitHub (http://covasim.org) and PyPI
(pip install covasim), and comes with comprehensive
documentation, including tutorials (http://docs.covasim.org).
The first principle of Covasim’s design philosophy is that
"Common tasks should be simple" – for example, defining pa-
rameters, running a simulation, and plotting results. The following
example illustrates this principle; it creates a simulation with a
custom parameter value, runs it, and plots the results:
Fig. 1: Daily reported global COVID-19-related deaths (top; import covasim as cv
smoothed with a one-week rolling window), relative to the timing of cv.Sim(pop_size=100e3).run().plot()
known variants of concern (VOCs) and variants of interest (VOIs), as The second principle of Covasim’s design philosophy is "Un-
well as Covasim releases (bottom).
common tasks can’t always be simple, but they still should be
possible." Examples include writing a custom goodness-of-fit
from developing Covasim to turn LEMOD-FP into a new family function or defining a new population structure. To some extent,
planning model, "FPsim", which will be launched later this year the second principle is at odds with the first, since the more
[OVCC+ 22]. flexibility an interface has, typically the more complex it is as
Parallel to the development of Covasim, other research teams well.
at IDM developed their own COVID models, including one based To illustrate the tension between these two principles, the
on the EMOD framework [SWC+ 22], and one based on an earlier following code shows how to run two simulations to determine the
influenza model [COSF20]. However, while both of these models impact of a custom intervention aimed at protecting the elderly in
saw use in academic contexts [KCP+ 20], neither were able to Japan, with results shown in Fig. 4:
incorporate new features quickly enough, or were easy enough to import covasim as cv
use, for widespread external adoption in a policy context. # Define a custom intervention
Covasim, by contrast, had immediate real-world impact. The def elderly(sim, old=70):
first version was released on 10 March 2020, and on 12 March if sim.t == sim.day('2020-04-01'):
elderly = sim.people.age > old
2020, its output was presented by Washington State Governor Jay
sim.people.rel_sus[elderly] = 0.0
Inslee during a press conference as justification for school closures
and social distancing measures [KMS+ 21]. # Set custom parameters
Since the early days of the pandemic, Covasim releases have pars = dict(
pop_type = 'hybrid', # More realistic population
coincided with major events in the pandemic, especially the iden- location = 'japan', # Japan's population pyramid
tification of new variants of concern (Fig. 1). Covasim was quickly pop_size = 50e3, # Have 50,000 people total
adopted globally, including applications in the UK regarding pop_infected = 100, # 100 infected people
n_days = 90, # Run for 90 days
school closures [PGKS+ 20], Australia regarding outbreak control
)
[SAK+ 21], and Vietnam regarding lockdown measures [PSN+ 21].
To date, Covasim has been downloaded from PyPI over # Run multiple sims in parallel and plot key results
100,000 times [PeP22], has been used in dozens of academic label = 'Protect the elderly'
s1 = cv.Sim(pars, label='Default')
studies [KMS+ 21], and informed decision-making on every con- s2 = cv.Sim(pars, interventions=elderly, label=label)
tinent (Fig. 2), making it one of the most widely used COVID msim = cv.parallel(s1, s2)
models [KSM+ 21]. We believe key elements of its success include msim.plot(['cum_deaths', 'cum_infections'])
(a) the simplicity of its architecture; (b) its high performance, Similar design philosophies have been articulated by previously,
enabled by the use of NumPy arrays and Numba decorators; such as for Grails [AJ09] among others1 .
and (c) our emphasis on prioritizing usability, including flexible
type handling and careful choices of default settings. In the 1. Other similar philosophical statements include "The manifesto of Mat-
remainder of this paper, we outline these principles in more detail, plotlib is: simple and common tasks should be simple to perform; provide
options for more complex tasks" (Data Processing Using Python) and "Simple,
in the hope that these will provide a useful roadmap for other common tasks should be simple to perform; Options should be provided to
groups wanting to quickly develop high-performance, easy-to-use enable more complex tasks" (Instrumental).
92 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 2: Locations where Covasim has been used to help produce a paper, report, or policy recommendation.

Fig. 3: Basic Covasim disease model. The blue arrow shows the
process of reinfection.

Fig. 4: Illustrative result of a simulation in Covasim focused on
Simplifications using Sciris exploring an intervention for protecting the elderly.

A key component of Covasim’s architecture is heavy reliance
on Sciris (http://sciris.org) [KAH+ ng], a library of functions for running simulations in parallel.
scientific computing that provide additional flexibility and ease-
of-use on top of NumPy, SciPy, and Matplotlib, including paral- Array-based architecture
lel computing, array operations, and high-performance container In a typical agent-based simulation, the outermost loop is over
datatypes. time, while the inner loops iterate over different agents and agent
As shown in Fig. 5, Sciris significantly reduces the number states. For a simulation like Covasim, with roughly 700 (daily)
of lines of code required to perform common scientific tasks, timesteps to represent the first two years of the pandemic, tens
allowing the user to focus on the code’s scientific logic rather than or hundreds of thousands of agents, and several dozen states, this
the low-level implementation. Key Covasim features that rely on requires on the order of one billion update steps.
Sciris include: ensuring consistent dictionary, list, and array types However, we can take advantage of the fact that each state
(e.g., allowing the user to provide inputs as either lists or arrays); (such as agent age or their infection status) has the same data
referencing ordered dictionary elements by index; handling and type, and thus we can avoid an explicit loop over agents by instead
interconverting dates (e.g., allowing the user to provide either a representing agents as entries in NumPy vectors, and performing
date string or a datetime object); saving and loading files; and operations on these vectors. These two architectures are shown in
PYTHON VS. THE PANDEMIC: A CASE STUDY IN HIGH-STAKES SOFTWARE DEVELOPMENT 93

Fig. 5: Comparison of functionally identical code implemented without Sciris (left) and with (right). In this example, tasks that together take
30 lines of code without Sciris can be accomplished in 7 lines with it.
94 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

for t in self.time_vec:
for person in self.people:
if person.alive:
person.age_person()
person.check_died()

# Array-based agent simulation

class People:

def age_people(self, inds):
self.age[inds] += 1
return

def check_died(self, inds):
rands = np.random.rand(len(inds))
died = rands < self.death_probs[inds]:
self.alive[inds[died]] = False
return
Fig. 6: The standard object-oriented approach for implementing
agent-based models (top), compared to the array-based approach class Sim:
used in Covasim (bottom).
def run(self):
for t in self.time_vec:
alive = sc.findinds(self.people.alive)
self.people.age_people(inds=alive)
self.people.check_died(inds=alive)

Numba optimization
Numba is a compiler that translates subsets of Python and NumPy
into machine code [LPS15]. Each low-level numerical function
was tested with and without Numba decoration; in some cases
speed improvements were negligible, while in other cases they
were considerable. For example, the following function is roughly
10 times faster with the Numba decorator than without:
import numpy as np
import numba as nb

@nb.njit((nb.int32, nb.int32), cache=True)
def choose_r(max_n, n):
Fig. 7: Performance comparison for FPsim from an explicit loop- return np.random.choice(max_n, n, replace=True)
based approach compared to an array-based approach, showing a
factor of ~70 speed improvement for large population sizes. Since Covasim is stochastic, calculations rarely need to be exact;
as a result, most numerical operations are performed as 32-bit
operations.
Fig. 6. Compared to the explicitly object-oriented implementation Together, these speed optimizations allow Covasim to run at
of an agent-based model, the array-based version is 1-2 orders of roughly 5-10 million simulated person-days per second of CPU
magnitude faster for population sizes larger than 10,000 agents. time – a speed comparable to agent-based models implemented
The relative performance of these two approaches is shown in purely in C or C++ [HPN+ 21]. Practically, this means that most
Fig. 7 for FPsim (which, like Covasim, was initially implemented users can run Covasim analyses on their laptops without needing
using an object-oriented approach before being converted to an to use cloud-based or HPC computing resources.
array-based approach). To illustrate the difference between object-
based and array-based implementations, the following example Lessons for scientific software development
shows how aging and death would be implemented in each:
Accessible coding and design
# Object-based agent simulation
Since Covasim was designed to be used by scientists and health
class Person: officials, not developers, we made a number of design decisions
that preferenced accessibility to our audience over other principles
def age_person(self):
of good software design.
self.age += 1
return First, Covasim is designed to have as flexible of user inputs
as possible. For example, a date can be specified as an integer
def check_died(self): number of days from the start of the simulation, as a string (e.g.
rand = np.random.random()
if rand < self.death_prob: '2020-04-04'), or as a datetime object. Similarly, numeric
self.alive = False inputs that can have either one or multiple values (such as the
return change in transmission rate following one or multiple lockdowns)
can be provided as a scalar, list, or NumPy array. As long as the
class Sim:
input is unambiguous, we prioritized ease-of-use and simplicity
def run(self): of the interface over rigorous type checking. Since Covasim is a
PYTHON VS. THE PANDEMIC: A CASE STUDY IN HIGH-STAKES SOFTWARE DEVELOPMENT 95

top-level library (i.e., it does not perform low-level functions as health background, through to public health experts with virtually
part of other libraries), this prioritization has been welcomed by no prior experience in Python. Roughly 45% of Covasim con-
its users. tributors had significant Python expertise, while 60% had public
Second, "advanced" Python programming paradigms – such health experience; only about half a dozen contributors (<10%)
as method and function decorators, lambda functions, multiple had significant experience in both areas.
inheritance, and "dunder" methods – have been avoided where These half-dozen contributors formed a core group (including
possible, even when they would otherwise be good coding prac- the authors of this paper) that oversaw overall Covasim develop-
tice. This is because a relatively large fraction of Covasim users, ment. Using GitHub for both software and project management,
including those with relatively limited Python backgrounds, need we created issues and assigned them to other contributors based
to inspect and modify the source code. A Covasim user coming on urgency and skillset match. All pull requests were reviewed by
from an R programming background, for example, may not have at least one person from this group, and often two, prior to merge.
encountered the NumPy function intersect1d() before, but While the danger of accepting changes from contributors with
they can quickly look it up and understand it as being equivalent limited Python experience is self-evident, considerable risks were
to R’s intersect() function. In contrast, an R user who has also posed by contributors who lacked epidemiological insight.
not encountered method decorators before is unlikely to be able to For example, some of the proposed tests were written based on
look them up and understand their meaning (indeed, they may not assumptions that were true for a given time and place, but which
even know what terms to search for). While Covasim indeed does were not valid for other geographical contexts.
use each of the "advanced" methods listed above (e.g., the Numba One surprising outcome was that even though Covasim is
decorators described above), they have been kept to a minimum largely a software project, after the initial phase of development
and sequestered in particular files the user is less likely to interact (i.e., the first 4-8 weeks), we found that relatively few tasks could
with. be assigned to the developers as opposed to the epidemiologists
Third, testing for Covasim presented a major challenge. Given and infectious disease modelers on the project. We believe there
that Covasim was being used to make decisions that affected tens are several reasons for this. First, epidemiologists tended to be
of millions of people, even the smallest errors could have poten- much more aware of knowledge they were missing (e.g., what
tially catastrophic consequences. Furthermore, errors could arise a particular NumPy function did), and were more readily able
not only in the software logic, but also in an incorrectly entered to fill that gap (e.g., look it up in the documentation or on
parameter value or a misinterpreted scientific study. Compounding Stack Overflow). By contrast, developers without expertise in
these challenges, features often had to be developed and used epidemiology were less able to identify gaps in their knowledge
on a timescale of hours or days to be of use to policymakers, and address them (e.g., by finding a study on Google Scholar).
a speed which was incompatible with traditional software testing As a consequence, many of the epidemiologists’ software skills
approaches. In addition, the rapidly evolving codebase made it improved markedly over the first few months, while the develop-
difficult to write even simple regression tests. Our solution was to ers’ epidemiology knowledge increased more slowly. Second, and
use a hierarchical testing approach: low-level functions were tested more importantly, we found that once transparent and performant
through a standard software unit test approach, while new features coding practices had been implemented, epidemiologists were able
and higher-level outputs were tested extensively by infectious to successfully adapt them to new contexts even without complete
disease modelers who varied inputs corresponding to realistic understanding of the code. Thus, for developing a scientific
scenarios, and checked the outputs (predominantly in the form software tool, we propose that a successful staffing plan would
of graphs) against their intuition. We found that these high-level consist of a roughly equal ratio of developers and domain experts
"sanity checks" were far more effective in catching bugs than during the early development phase, followed by a rapid (on a
formal software tests, and as a result shifted the emphasis of timescale of weeks) ramp-down of developers and ramp-up of
our test suite to prioritize the former. Public releases of Covasim domain experts.
have held up well to extensive scrutiny, both by our external Acknowledging that Covasim’s potential user base includes
collaborators and by "COVID skeptics" who were highly critical many people who have limited coding skills, we developed a three-
of other COVID models [Den20]. tiered support model to maximize Covasim’s real-world policy
Finally, since much of our intended audience has little to impact (Fig. 8). For "mode 1" engagements, we perform the anal-
no Python experience, we provided as many alternative ways of yses using Covasim ourselves. While this mode typically ensures
accessing Covasim as possible. For R users, we provide exam- high quality and efficiency, it is highly resource-constrained and
ples of how to run Covasim using the reticulate package thus used only for our highest-profile engagements, such as with
[AUTE17], which allows Python to be called from within R. the Vietnam Ministry of Health [PSN+ 21] and Washington State
For specific applications, such as our test-trace-quarantine work Department of Health [KMS+ 21]. For "mode 2" engagements, we
(http://ttq-app.covasim.org), we developed bespoke webapps via offer our partners training on how to use Covasim, and let them
Jupyter notebooks [GP21] and Voilà [Qua19]. To help non-experts lead analyses with our feedback. This is our preferred mode of
gain intuition about COVID epidemic dynamics, we also devel- engagement, since it balances efficiency and sustainability, and has
oped a generic JavaScript-based webapp interface for Covasim been used for contexts including the United Kingdom [PGKS+ 20]
(http://app.covasim.org), but it does not have sufficient flexibility and Australia [SLSS+ 22]. Finally, "mode 3" partnerships, in
to answer real-world policy questions. which Covasim is downloaded and used without our direct input,
are of course the default approach in the open-source software
Workflow and team management ecosystem, including for Python. While this mode is by far the
Covasim was developed by a team of roughly 75 people with most scalable, in practice, relatively few health departments or
widely disparate backgrounds: from those with 20+ years of ministries of health have the time and internal technical capacity to
enterprise-level software development experience and no public use this mode; instead, most of the mode 3 uptake of Covasim has
96 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

been by academic groups [LG+ 21]. Thus, we provide mode 1 and [AUTE17] JJ Allaire, Kevin Ushey, Yuan Tang, and Dirk Eddelbuettel.
mode 2 partnerships to make Covasim’s impact more immediate reticulate: R Interface to Python, 2017. URL: https://github.
com/rstudio/reticulate.
and direct than would be possible via mode 3 alone.
[BGB+ 18] Anna Bershteyn, Jaline Gerardin, Daniel Bridenbecker, Christo-
pher W Lorton, Jonathan Bloedow, Robert S Baker, Guil-
Future directions laume Chabot-Couture, Ye Chen, Thomas Fischle, Kurt Frey,
et al. Implementation and applications of EMOD, an individual-
While the need for COVID modeling is hopefully starting to based multi-disease modeling platform. Pathogens and disease,
decrease, we and our collaborators are continuing development 76(5):fty059, 2018. doi:10.1093/femspd/fty059.
of Covasim by updating parameters with the latest scientific [CHL+ 08] Myron S Cohen, Nick Hellmann, Jay A Levy, Kevin DeCock,
Joep Lange, et al. The spread, treatment, and prevention of
evidence, implementing new immune dynamics [CSN+ 21], and HIV-1: evolution of a global pandemic. The Journal of Clin-
providing other usability and bug-fix updates. We also continue ical Investigation, 118(4):1244–1254, 2008. doi:10.1172/
to provide support and training workshops (including in-person JCI34706.
workshops, which were not possible earlier in the pandemic). [COSF20] Dennis L Chao, Assaf P Oron, Devabhaktuni Srikrishna, and
Michael Famulare. Modeling layered non-pharmaceutical inter-
We are using what we learned during the development of ventions against SARS-CoV-2 in the United States with Corvid.
Covasim to build a broader suite of Python-based disease mod- MedRxiv, 2020. doi:10.1101/2020.04.08.20058487.
eling tools (tentatively named "*-sim" or "Starsim"). The suite [CSN+ 21] Jamie A Cohen, Robyn Margaret Stuart, Rafael C Nùñez,
of Starsim tools under development includes models for family Katherine Rosenfeld, Bradley Wagner, Stewart Chang, Cliff
Kerr, Michael Famulare, and Daniel J Klein. Mechanistic mod-
planning [OVCC+ 22], polio, respiratory syncytial virus (RSV), eling of SARS-CoV-2 immune memory, variants, and vaccines.
and human papillomavirus (HPV). To date, each tool in this medRxiv, 2021. doi:10.1101/2021.05.31.21258018.
suite uses an independent codebase, and is related to Covasim [Den20] Denim, Sue. Another Computer Simulation, Another Alarmist
only through the shared design principles described above, and Prediction, 2020. URL: https://dailysceptic.org/schools-paper.
[Fam19] Mike Famulare. nCoV: preliminary estimates of the confirmed-
by having used the Covasim codebase as the starting point for case-fatality-ratio and infection-fatality-ratio, and initial pan-
development. demic risk assessment. Institute for Disease Modeling, 2019.
A major open question is whether the disease dynamics im- [Gar05] Laurie Garrett. The next pandemic. Foreign Aff., 84:3, 2005.
plemented in Covasim and these related models have sufficient doi:10.2307/20034417.
[GP21] Brian E. Granger and Fernando Pérez. Jupyter: Thinking and
overlap to be refactored into a single disease-agnostic modeling storytelling with code and data. Computing in Science & En-
library, which the disease-specific modeling libraries would then gineering, 23(2):7–14, 2021. doi:10.1109/MCSE.2021.
import. This "core and specialization" approach was adopted by 3059263.
EMOD and Atomica, and while both frameworks continue to be [Hof20] Bert Hofman. The global pandemic. Horizons: Journal of
International Relations and Sustainable Development, (16):60–
used, no multi-disease modeling library has yet seen widespread 69, 2020.
adoption within the disease modeling community. The alternative [HPN+ 21] Robert Hinch, William JM Probert, Anel Nurtay, Michelle
approach, currently used by the Starsim suite, is for each disease Kendall, Chris Wymant, Matthew Hall, Katrina Lythgoe, Ana
model to be a self-contained library. A shared library would Bulas Cruz, Lele Zhao, Andrea Stewart, et al. OpenABM-
Covid19—An agent-based model for non-pharmaceutical inter-
reduce code duplication, and allow new features and bug fixes ventions against COVID-19 including contact tracing. PLoS
to be immediately rolled out to multiple models simultaneously. computational biology, 17(7):e1009146, 2021. doi:10.
However, it would also increase interdependencies that would have 1371/journal.pcbi.1009146.
the effect of increasing code complexity, increasing the risk of [KAH+ ng] Cliff C Kerr, Romesh G Abeysuriya, Vlad-S, tefan Harbuz,
George L Chadderdon, Parham Saidi, Paula Sanz-Leon, James
introducing subtle bugs. Which of these two options is preferable Jansson, Maria del Mar Quiroga, Sherrie Hughes, Rowan
likely depends on the speed with which new disease models need Martin-and Kelly, Jamie Cohen, Robyn M Stuart, and Anna
to be implemented. We hope that for the foreseeable future, none Nachesa. Sciris: a Python library to simplify scientific com-
will need to be implemented as quickly as Covasim. puting. Available at http://paper.sciris.org, 2022 (forthcoming).
[KAK+ 19] David J Kedziora, Romesh Abeysuriya, Cliff C Kerr, George L
Chadderdon, Vlad-S, tefan Harbuz, Sarah Metzger, David P Wil-
Acknowledgements son, and Robyn M Stuart. The Cascade Analysis Tool: software
to analyze and optimize care cascades. Gates Open Research, 3,
We thank additional contributors to Covasim, including Katherine 2019. doi:10.12688/gatesopenres.13031.2.
Rosenfeld, Gregory R. Hart, Rafael C. Núñez, Prashanth Selvaraj, [KCP+ 20] Joel R Koo, Alex R Cook, Minah Park, Yinxiaohe Sun, Haoyang
Brittany Hagedorn, Amanda S. Izzo, Greer Fowler, Anna Palmer, Sun, Jue Tao Lim, Clarence Tam, and Borame L Dickens.
Interventions to mitigate early spread of sars-cov-2 in singapore:
Dominic Delport, Nick Scott, Sherrie L. Kelly, Caroline S. Ben- a modelling study. The Lancet Infectious Diseases, 20(6):678–
nette, Bradley G. Wagner, Stewart T. Chang, Assaf P. Oron, Paula 688, 2020. doi:10.1016/S1473-3099(20)30162-6.
Sanz-Leon, and Jasmina Panovska-Griffiths. We also wish to thank [KMS+ 21] Cliff C Kerr, Dina Mistry, Robyn M Stuart, Katherine Rosenfeld,
Maleknaz Nayebi and Natalie Dean for helpful discussions on Gregory R Hart, Rafael C Núñez, Jamie A Cohen, Prashanth
Selvaraj, Romesh G Abeysuriya, Michał Jastrz˛ebski, et al. Con-
code architecture and workflow practices, respectively. trolling COVID-19 via test-trace-quarantine. Nature Commu-
nications, 12(1):1–12, 2021. doi:10.1038/s41467-021-
23276-9.
R EFERENCES [KSM+ 21] Cliff C Kerr, Robyn M Stuart, Dina Mistry, Romesh G Abey-
[AFG+ 04] Roy M Anderson, Christophe Fraser, Azra C Ghani, Christl A suriya, Katherine Rosenfeld, Gregory R Hart, Rafael C Núñez,
Donnelly, Steven Riley, Neil M Ferguson, Gabriel M Leung, Jamie A Cohen, Prashanth Selvaraj, Brittany Hagedorn, et al.
Tai H Lam, and Anthony J Hedley. Epidemiology, transmis- Covasim: an agent-based model of COVID-19 dynamics and
sion dynamics and control of sars: the 2002–2003 epidemic. interventions. PLOS Computational Biology, 17(7):e1009149,
Philosophical Transactions of the Royal Society of London. 2021. doi:10.1371/journal.pcbi.1009149.
Series B: Biological Sciences, 359(1447):1091–1105, 2004. [LG+ 21] Junjiang Li, Philippe Giabbanelli, et al. Returning to a normal
doi:10.1098/rstb.2004.1490. life via COVID-19 vaccines in the United States: a large-
[AJ09] Bashar Abdul-Jawad. Groovy and Grails Recipes. Springer, scale Agent-Based simulation study. JMIR medical informatics,
2009. 9(4):e27419, 2021. doi:10.2196/27419.
PYTHON VS. THE PANDEMIC: A CASE STUDY IN HIGH-STAKES SOFTWARE DEVELOPMENT 97

Fig. 8: The three pathways to impact with Covasim, from high bandwidth/small scale to low bandwidth/large scale. IDM: Institute for Disease
Modeling; OSS: open-source software; GPG: global public good; PyPI: Python Package Index.

[LPS15] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A the impact of COVID-19 vaccines in a representative COVAX
llvm-based python jit compiler. In Proceedings of the Second AMC country setting due to ongoing internal migration: A
Workshop on the LLVM Compiler Infrastructure in HPC, pages modeling study. PLOS Global Public Health, 2(1):e0000053,
1–6, 2015. doi:10.1145/2833157.2833162. 2022. doi:10.1371/journal.pgph.0000053.
[Med20] The Lancet Respiratory Medicine. COVID-19: delay, mitigate, [Tea14] WHO Ebola Response Team. Ebola virus disease in west
and communicate. The Lancet Respiratory Medicine, 8(4):321, africa—the first 9 months of the epidemic and forward projec-
2020. doi:10.1016/S2213-2600(20)30128-4. tions. New England Journal of Medicine, 371(16):1481–1495,
[OVCC 22] Michelle L O’Brien, Annie Valente, Guillaume Chabot-Couture,
+ 2014. doi:10.1056/NEJMoa1411100.
Joshua Proctor, Daniel Klein, Cliff Kerr, and Marita Zimmer-
mann. FPSim: An agent-based model of family planning for
informed policy decision-making. In PAA 2022 Annual Meeting.
PAA, 2022.
[PeP22] PePy. PePy download statistics, 2022. URL: https://pepy.tech/
project/covasim.
[PGKS+ 20] Jasmina Panovska-Griffiths, Cliff C Kerr, Robyn M Stuart, Dina
Mistry, Daniel J Klein, Russell M Viner, and Chris Bonell.
Determining the optimal strategy for reopening schools, the
impact of test and trace interventions, and the risk of occurrence
of a second COVID-19 epidemic wave in the UK: a modelling
study. The Lancet Child & Adolescent Health, 4(11):817–827,
2020. doi:10.1016/S2352-4642(20)30250-9.
[PSN+ 21] Quang D Pham, Robyn M Stuart, Thuong V Nguyen, Quang C
Luong, Quang D Tran, Thai Q Pham, Lan T Phan, Tan Q Dang,
Duong N Tran, Hung T Do, et al. Estimating and mitigating the
risk of COVID-19 epidemic rebound associated with reopening
of international borders in Vietnam: a modelling study. The
Lancet Global Health, 9(7):e916–e924, 2021. doi:10.1016/
S2214-109X(21)00103-0.
[Qua19] QuantStack. And voilá! Jupyter Blog, 2019. URL: https://blog.
jupyter.org/and-voil%C3%A0-f6a2c08a4a93.
[RSWS20] Joacim Rocklöv, Henrik Sjödin, and Annelies Wilder-Smith.
COVID-19 outbreak on the Diamond Princess cruise ship: esti-
mating the epidemic potential and effectiveness of public health
countermeasures. Journal of Travel Medicine, 27(3):taaa030,
2020. doi:10.1093/jtm/taaa030.
[SAK+ 21] Robyn M Stuart, Romesh G Abeysuriya, Cliff C Kerr, Dina
Mistry, Dan J Klein, Richard T Gray, Margaret Hellard, and
Nick Scott. Role of masks, testing and contact tracing in
preventing COVID-19 resurgences: a case study from New
South Wales, Australia. BMJ open, 11(4):e045941, 2021.
doi:10.1136/bmjopen-2020-045941.
[SHK16] Patrick R Saunders-Hastings and Daniel Krewski. Review-
ing the history of pandemic influenza: understanding patterns
of emergence and transmission. Pathogens, 5(4):66, 2016.
doi:10.3390/pathogens5040066.
[SLSS+ 22] Paula Sanz-Leon, Nathan J Stevenson, Robyn M Stuart,
Romesh G Abeysuriya, James C Pang, Stephen B Lambert,
Cliff C Kerr, and James A Roberts. Risk of sustained SARS-
CoV-2 transmission in Queensland, Australia. Scientific reports,
12(1):1–9, 2022. doi:10.1101/2021.06.08.21258599.
[SWC 22] Prashanth Selvaraj, Bradley G Wagner, Dennis L Chao,
+

Maïna L’Azou Jackson, J Gabrielle Breugelmans, Nicholas Jack-
son, and Stewart T Chang. Rural prioritization may increase
98 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Pylira: deconvolution of images in the presence of
Poisson noise
Axel Donath‡∗ , Aneta Siemiginowska‡ , Vinay Kashyap‡ , Douglas Burke‡ , Karthik Reddy Solipuram§ , David van Dyk¶

Abstract—All physical and astronomical imaging observations are degraded by of the signal intensity to the signal variance. Any statistically
the finite angular resolution of the camera and telescope systems. The recovery correct post-processing or reconstruction method thus requires a
of the true image is limited by both how well the instrument characteristics careful treatment of the Poisson nature of the measured image.
are known and by the magnitude of measurement noise. In the case of a To maximise the scientific use of the data, it is often desired to
high signal to noise ratio data, the image can be sharpened or “deconvolved”
correct the degradation introduced by the imaging process. Besides
robustly by using established standard methods such as the Richardson-Lucy
method. However, the situation changes for sparse data and the low signal to
correction for non-uniform exposure and background noise this
noise regime, such as those frequently encountered in X-ray and gamma-ray also includes the correction for the "blurring" introduced by the
astronomy, where deconvolution leads inevitably to an amplification of noise point spread function (PSF) of the instrument. Where the latter
and poorly reconstructed images. However, the results in this regime can process is often called "deconvolution". Depending on whether
be improved by making use of physically meaningful prior assumptions and the PSF of the instrument is known or not, one distinguishes
statistically principled modeling techniques. One proposed method is the LIRA between the "blind deconvolution" and "non blind deconvolution"
algorithm, which requires smoothness of the reconstructed image at multiple process. For astronomical observations, the PSF can often either
scales. In this contribution, we introduce a new python package called Pylira,
be simulated, given a model of the telescope and detector, or
which exposes the original C implementation of the LIRA algorithm to Python
inferred directly from the data by observing far distant objects,
users. We briefly describe the package structure, development setup and show
a Chandra as well as Fermi-LAT analysis example.
which appear as a point source to the instrument.
While in other branches of astronomy deconvolution methods
Index Terms—deconvolution, point spread function, poisson, low counts, X-ray, are already part of the standard analysis, such as the CLEAN
gamma-ray algorithm for radio data, developed by [Hog74], this is not the
case for X-ray and gamma-ray astronomy. As any deconvolution
method aims to enhance small-scale structures in an image, it
Introduction becomes increasingly hard to solve for the regime of low signal-
Any physical and astronomical imaging process is affected by to-noise ratio, where small-scale structures are more affected by
the limited angular resolution of the instrument or telescope. In noise.
addition, the quality of the resulting image is also degraded by
background or instrumental measurement noise and non-uniform The Deconvolution Problem
exposure. For short wavelengths and associated low intensities of Basic Statistical Model
the signal, the imaging process consists of recording individual Assuming the data in each pixel di in the recorded counts image
photons (often called "events") originating from a source of follows a Poisson distribution, the total likelihood of obtaining the
interest. This imaging process is typical for X-ray and gamma- measured image from a model image of the expected counts λi
ray telescopes, but images taken by magnetic resonance imaging with N pixels is given by:
or fluorescence microscopy show Poisson noise too. For each
individual photon, the incident direction, energy and arrival time
N exp −di λidi
L (d|λ ) = ∏ (1)
is measured. Based on this information, the event can be binned i di !
into two dimensional data structures to form an actual image.
By taking the logarithm, dropping the constant terms and inverting
As a consequence of the low intensities associated to the the sign one can transform the product into a sum over pixels,
recording of individual events, the measured signal follows Pois- which is also often called the Cash [Cas79] fit statistics:
son statistics. This imposes a non-linear relationship between the
N
measured signal and true underlying intensity as well as a coupling C (λ |d) = ∑(λi − di log λi ) (2)
i
* Corresponding author: axel.donath@cfa.harvard.edu
‡ Center for Astrophysics | Harvard & Smithsonian Where the expected counts λi are given by the convolution of the
§ University of Maryland Baltimore County true underlying flux distribution xi with the PSF pk :
¶ Imperial College London
λi = ∑ xi pi−k (3)
Copyright © 2022 Axel Donath et al. This is an open-access article distributed k
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the This operation is often called "forward modelling" or "forward
original author and source are credited. folding" with the instrument response.
PYLIRA: DECONVOLUTION OF IMAGES IN THE PRESENCE OF POISSON NOISE 99

Richardson Lucy (RL)
To obtain the most likely value of xn given the data, one searches
a maximum of the total likelihood function, or equivalently a of
minimum C . This high dimensional optimization problem can
e.g., be solved by a classic gradient descent approach. Assuming
the pixels values xi of the true image as independent parameters,
one can take the derivative of Eq. 2 with respect to the individual
xi . This way one obtains a rule for how to update the current set
of pixels xn in each iteration of the optimization:
∂ C (d|x)
xn+1 = xn − α · (4)
∂ xi
Where α is a factor to define the step size. This method is in
general equivalent to the gradient descent and backpropagation
methods used in modern machine learning techniques. This ba-
sic principle of solving the deconvolution problem for images
with Poisson noise was proposed by [Ric72] and [Luc74]. Their
method, named after the original authors, is often known as the
Fig. 1: The images show the result of the RL algorithm applied
Richardson & Lucy (RL) method. It was shown by [Ric72] that to a simulated example dataset with varying numbers of iterations.
this converges to a maximum likelihood solution of Eq. 2. A The image in the upper left shows the simulated counts. Those have
Python implementation of the standard RL method is available been derived from the ground truth (upper mid) by convolving with a
e.g. in the Scikit-Image package [vdWSN+ 14]. Gaussian PSF of width σ = 3 pix and applying Poisson noise to it.
Instead of the iterative, gradient descent based optimization it The illustration uses the implementation of the RL algorithm from the
is also possible to sample from the posterior distribution using a Scikit-Image package [vdWSN+ 14].
simple Metropolis-Hastings [Has70] approach and uniform prior.
This is demonstrated in one of the Pylira online tutorials (Intro-
the smoothness of the reconstructed image on multiple spatial
duction to Deconvolution using MCMC Methods).
scales. Starting from the full resolution, the image pixels xi are
collected into 2 by 2 groups Qk . The four pixel values associated
RL Reconstruction Quality
with each group are divided by their sum to obtain a grid of “split
While technically the RL method converges to a maximum like- proportions” with respect to the image down-sized by a factor of
lihood solution, it mostly still results in poorly restored images, two along both axes. This process is repeated using the down sized
especially if extended emission regions are present in the image. image with pixel values equal to the sums over the 2 by 2 groups
The problem is illustrated in Fig. 1 using a simulated example from the full-resolution image, and the process continues until the
image. While for a low number of iterations, the RL method still resolution of the image is only a single pixel, containing the total
results in a smooth intensity distribution, the structure of the image sum of the full-resolution image. This multi-scale representation
decomposes more and more into a set of point-like sources with is illustrated in Fig. 2.
growing number of iterations. For each of the 2x2 groups of the re-normalized images a
Because of the PSF convolution, an extended emission region Dirichlet distribution is introduced as a prior:
can decompose into multiple nearby point sources and still lead
to good model prediction, when compared with the data. Those φk ∝ Dirichlet(αk , αk , αk , αk ) (6)
almost equally good solutions correspond to many narrow local and multiplied across all 2x2 groups and resolution levels k. For
minima or "spikes" in the global likelihood surface. Depending on each resolution level a smoothing parameter αk is introduced.
the start estimate for the reconstructed image x the RL method These hyper-parameters can be interpreted as having an infor-
will follow the steepest gradient and converge towards the nearest mation content equivalent of adding αk "hallucinated" counts in
narrow local minimum. This problem has been described by each grouping. This effectively results in a smoothing of the
multiple authors, such as [PR94] and [FBPW95]. image at the given resolution level. The distribution of α values
at each resolution level is the further described by a hyper-prior
Multi-Scale Prior & LIRA
distribution:
One solution to this problem was described in [ECKvD04] and p(αk ) = exp (−δ α 3 /3) (7)
[CSv+ 11]. First, the simple forward folded model described in
Eq. 3 can be extended by taking into account the non-uniform Resulting in a fully hierarchical Bayesian model. A more com-
exposure ei and an additional known background component bi : plete and detailed description of the prior definition is given in
[ECKvD04].
λi = ∑ (ei · (xi + bi )) pi−k (5) The problem is then solved by using a Gibbs MCMC sampling
k approach. After a "burn-in" phase the sampling process typically
The background bi can be more generally understood as a "base- reaches convergence and starts sampling from the posterior distri-
line" image and thus include known structures, which are not of bution. The reconstructed image is then computed as the mean of
interest for the deconvolution process. E.g., a bright point source the posterior samples. As for each pixel a full distribution of its
to model the core of an AGN while studying its jets. values is available, the information can also be used to compute
Second, the authors proposed to extend the Poisson log- the associated error of the reconstructed value. This is another
likelihood function (Equation 2) by a log-prior term that controls main advantage over RL or Maxium A-Postori (MAP) algorithms.
100 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

1 $ sudo apt-get install r-base-dev r-base r-mathlib
2 $ pip install pylira

For more detailed instructions see Pylira installation instructions.

API & Subpackages
Pylira is structured in multiple sub-packages. The pylira.src
module contains the original C implementation and the Pybind11
wrapper code. The pylira.core sub-package contains the
main Python API, pylira.utils includes utility functions
for plotting and serialisation. And pylira.data implements
multiple pre-defined datasets for testing and tutorials.

Analysis Examples
Simple Point Source
Pylira was designed to offer a simple Python class based user
interface, which allows for a short learning curve of using the
package for users who are familiar with Python in general and
more specifically with Numpy. A typical complete usage example
of the Pylira package is shown in the following:
Fig. 2: The image illustrates the multi-scale decomposition used in
the LIRA prior for a 4x4 pixels example image. Each quadrant of 2x2 1 import numpy as np
sub-images is labelled with QN . The sub-pixels in each quadrant are 2 from pylira import LIRADeconvolver
labelled Λi j . . 3 from pylira.data import point_source_gauss_psf
4
5 # create example dataset
6 data = point_source_gauss_psf()
The Pylira Package 7
8 # define initial flux image
Dependencies & Development 9 data["flux_init"] = data["flux"]
The Pylira package is a thin Python wrapper around the original 10

LIRA implementation provided by the authors of [CSv+ 11]. The 11 deconvolve = LIRADeconvolver(
12 n_iter_max=3_000,
original algorithm was implemented in C and made available as a 13 n_burn_in=500,
package for the R Language [R C20]. Thus the implementation de- 14 alpha_init=np.ones(5)
pends on the RMath library, which is still a required dependency of 15 )
16
Pylira. The Python wrapper was built using the Pybind11 [JRM17] 17 result = deconvolve.run(data=data)
package, which allows to reduce the code overhead introduced by 18
the wrapper to a minimum. For the data handling, Pylira relies on 19 # plot pixel traces, result shown in Figure 3
Numpy [HMvdW+ 20] arrays for the serialisation to the FITS data 20 result.plot_pixel_traces_region(
21 center_pix=(16, 16), radius_pix=3
format on Astropy [Col18]. The (interactive) plotting functionality 22 )
is achieved via Matplotlib [Hun07] and Ipywidgets [wc15], which 23

are both optional dependencies. Pylira is openly developed on 24 # plot pixel traces, result shown in Figure 4
25 result.plot_parameter_traces()
Github at https://github.com/astrostat/pylira. It relies on GitHub 26
Actions as a continuous integration service and uses the Read 27 # finally serialise the result
the Docs service to build and deploy the documentation. The on- 28 result.write("result.fits")
line documentation can be found on https://pylira.readthedocs.io. The main interface is exposed via the LIRADeconvolver
Pylira implements a set of unit tests to assure compatibility class, which takes the configuration of the algorithm on initial-
and reproducibility of the results with different versions of the isation. Typical configuration parameters include the total num-
dependencies and across different platforms. As Pylira relies on ber of iterations n_iter_max and the number of "burn-in"
random sampling for the MCMC process an exact reproducibility iterations, to be excluded from the posterior mean computation.
of results is hard to achieve on different platforms; however the The data, represented by a simple Python dict data structure,
agreement of results is at least guaranteed in the statistical limit of contains a "counts", "psf" and optionally "exposure"
drawing many samples. and "background" array. The dataset is then passed to the
LIRADeconvolver.run() method to execute the deconvolu-
Installation
tion. The result is a LIRADeconvolverResult object, which
Pylira is available via the Python package index (pypi.org), features the possibility to write the result as a FITS file, as well
currently at version 0.1. As Pylira still depends on the RMath as to inspect the result with diagnostic plots. The result of the
library, it is required to install this first. So the recommended way computation is shown in the left panel of Fig. 3.
to install Pylira is on MacOS is:
1 $ brew install r Diagnostic Plots
2 $ pip install pylira
To validate the quality of the results Pylira provides many built-
On Linux the RMath dependency can be installed using standard in diagnostic plots. One of these diagnostic plot is shown in the
package managers. For example on Ubuntu, one would do right panel of Fig. 3. The plot shows the image sampling trace
PYLIRA: DECONVOLUTION OF IMAGES IN THE PRESENCE OF POISSON NOISE 101

Pixel trace for (16, 16)
30 800 1000
700
25 800
600
20
500 600

Posterior Mean
Burn in
Valid
15 400 Mean
400 1 Std. Deviation
300
10
200 200
5
100
0
0
0 5 10 15 20 25 30 0 500 1000 1500 2000 2500 3000
Number of Iterations

Fig. 3: The curves show the traces of value the pixel of interest for a simulated point source and its neighboring pixels (see code example).
The image on the left shows the posterior mean. The white circle in the image shows the circular region defining the neighboring pixels. The
blue line on the right plot shows the trace of the pixel of interest. The solid horizontal orange line shows the mean value (excluding burn-in)
of the pixel across all iterations and the shaded orange area the 1 σ error region. The burn in phase is shown in transparent blue and ignored
while computing the mean. The shaded gray lines show the traces of the neighboring pixels.

for a single pixel of interest and its surrounding circular region of Chandra is a space-based X-ray observatory, which has been
interest. This visualisation allows the user to assess the stability in operation since 1999. It consists of nested cylindrical paraboloid
of a small region in the image e.g. an astronomical point source and hyperboloid surfaces, which form an imaging optical system
during the MCMC sampling process. Due to the correlation with for X-rays. In the focal plane, it has multiple instruments for dif-
neighbouring pixels, the actual value of a pixel might vary in the ferent scientific purposes. This includes a high-resolution camera
sampling process, which appears as "dips" in the trace of the pixel (HRC) and an Advanced CCD Imaging Spectrometer (ACIS). The
of interest and anti-correlated "peaks" in the one or mutiple of typical angular resolution is 0.5 arcsecond and the covered energy
the surrounding pixels. In the example a stable state of the pixels ranges from 0.1 - 10 keV.
of interest is reached after approximately 1000 iterations. This Figure 5 shows the result of the Pylira algorithm applied to
suggests that the number of burn-in iterations, which was defined Chandra data of the Galactic Center region between 0.5 and 7 keV.
beforehand, should be increased. The PSF was obtained from simulations using the simulate_psf
Pylira relies on an MCMC sampling approach to sample tool from the official Chandra science tools ciao 4.14 [FMA+ 06].
a series of reconstructed images from the posterior likelihood The algorithm achieves both an improved spatial resolution as well
defined by Eq. 2. Along with the sampling, it marginalises over as a reduced noise level and higher contrast of the image in the
the smoothing hyper-parameters and optimizes them in the same right panel compared to the unprocessed counts data shown in the
process. To diagnose the validity of the results it is important to left panel.
visualise the sampling traces of both the sampled images as well As a second example, we use data from the Fermi Large Area
as hyper-parameters. Telescope (LAT). The Fermi-LAT is a satellite-based imaging
Figure 4 shows another typical diagnostic plot created by the gamma-ray detector, which covers an energy range of 20 MeV
code example above. In a multi-panel figure, the user can inspect to >300 GeV. The angular resolution varies strongly with energy
the traces of the total log-posterior as well as the traces of the and ranges from 0.1 to >10 degree1 .
smoothing parameters. Each panel corresponds to the smoothing Figure 6 shows the result of the Pylira algorithm applied to
hyper parameter introduced for each level of the multi-scale Fermi-LAT data above 1 GeV to the region around the Galactic
representation of the reconstructed image. The figure also shows Center. The PSF was obtained from simulations using the gtpsf
the mean value along with the 1 σ error region. In this case, tool from the official Fermitools v2.0.19 [Fer19]. First, one can
the algorithm shows stable convergence after a burn-in phase of see that the algorithm achieves again a considerable improvement
approximately 200 iterations for the log-posterior as well as all of in the spatial resolution compared to the raw counts. It clearly
the multi-scale smoothing parameters. resolves multiple point sources left to the bright Galactic Center
source.

Astronomical Analysis Examples Summary & Outlook

Both in the X-ray as well as in the gamma-ray regime, the Galactic The Pylira package provides Python wrappers for the LIRA al-
Center is a complex emission region. It shows point sources, gorithm. It allows the deconvolution of low-counts data following
extended sources, as well as underlying diffuse emission and thus 1. https://www.slac.stanford.edu/exp/glast/groups/canda/lat_Performance.
represents a challenge for any astronomical data analysis. htm
102 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Logpost Smoothingparam0 Smoothingparam1
Burn in 0.35 0.35
1500 Valid 0.30
Mean 0.30
1 Std. Deviation 0.25 0.25
1000
0.20 0.20
500 0.15 0.15
0 0.10 0.10
0.05 0.05
500
0.00 0.00
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Number of Iterations Number of Iterations Number of Iterations
Smoothingparam2 Smoothingparam3 Smoothingparam4
0.200
0.175
0.20 0.175
0.150 0.150
0.15 0.125 0.125
0.100 0.100
0.10 0.075
0.075
0.05 0.050 0.050
0.025 0.025
0.00 0.000 0.000
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Number of Iterations Number of Iterations Number of Iterations

Fig. 4: The curves show the traces of the log posterior value as well as traces of the values of the prior parameter values. The SmoothingparamN
parameters correspond to the smoothing parameters αN per multi-scale level. The solid horizontal orange lines show the mean value, the shaded
orange area the 1 σ error region. The burn in phase is shown transparent and ignored while estimating the mean.

Counts Deconvolved
500
PSF
257
132
-29°00'25"
68
Declination

35
Counts

18
30"
9
5

2
35"
17h45m40.6s40.4s 40.2s 40.0s 39.8s 39.6s 17h45m40.6s40.4s 40.2s 40.0s 39.8s 39.6s
Right Ascension Right Ascension
Fig. 5: Pylira applied to Chandra ACIS data of the Galactic Center region, using the observation IDs 4684 and 4684. The image on the left
shows the raw observed counts between 0.5 and 7 keV. The image on the right shows the deconvolved version. The LIRA hyperprior values
were chosen as ms_al_kap1=1, ms_al_kap2=0.02, ms_al_kap3=1. No baseline background model was included.
PYLIRA: DECONVOLUTION OF IMAGES IN THE PRESENCE OF POISSON NOISE 103

Counts Deconvolved
200
0°40' PSF
120
72
20'
43
Galactic Latitude

00' 26

Counts
16

-0°20' 9
5
40' 2

0°40' 20' 00' 359°40' 20' 0°40' 20' 00' 359°40' 20'
Galactic Longitude Galactic Longitude
Fig. 6: Pylira applied to Fermi-LAT data from the Galactic Center region. The image on the left shows the raw measured counts between
5 and 1000 GeV. The image on the right shows the deconvolved version. The LIRA hyperprior values were chosen as ms_al_kap1=1,
ms_al_kap2=0.02, ms_al_kap3=1. No baseline background model was included.

Poisson statistics using a Bayesian sampling approach and a multi- [CSv+ 11] A. Connors, N. M. Stein, D. van Dyk, V. Kashyap, and
scale smoothing prior assumption. The results can be easily written A. Siemiginowska. LIRA — The Low-Counts Image Restora-
tion and Analysis Package: A Teaching Version via R. In I. N.
to FITS files and inspected by plotting the trace of the sampling Evans, A. Accomazzi, D. J. Mink, and A. H. Rots, editors,
process. This allows users to check for general convergence as Astronomical Data Analysis Software and Systems XX, volume
well as pixel to pixel correlations for selected regions of interest. 442 of Astronomical Society of the Pacific Conference Series,
The package is openly developed on GitHub and includes tests page 463, July 2011.
[ECKvD04] David N. Esch, Alanna Connors, Margarita Karovska, and
and documentation, such that it can be maintained and improved David A. van Dyk. An image restoration technique with
in the future, while ensuring consistency of the results. It comes error estimates. The Astrophysical Journal, 610(2):1213–
with multiple built-in test datasets and explanatory tutorials in 1227, aug 2004. URL: https://doi.org/10.1086/421761, doi:
10.1086/421761.
the form of Jupyter notebooks. Future plans include the support [FBPW95] D. A. Fish, A. M. Brinicombe, E. R. Pike, and J. G.
for parallelisation or distributed computing, more flexible prior Walker. Blind deconvolution by means of the richardson–
definitions and the possibility to account for systematic errors on lucy algorithm. J. Opt. Soc. Am. A, 12(1):58–65, Jan 1995.
the PSF during the sampling process. URL: http://opg.optica.org/josaa/abstract.cfm?URI=josaa-12-
1-58, doi:10.1364/JOSAA.12.000058.
[Fer19] Fermi Science Support Development Team. Fermitools: Fermi
Acknowledgements Science Tools. Astrophysics Source Code Library, record
ascl:1905.011, May 2019. arXiv:1905.011.
This work was conducted under the auspices of the CHASC [FMA+ 06] Antonella Fruscione, Jonathan C. McDowell, Glenn E. Allen,
International Astrostatistics Center. CHASC is supported by NSF Nancy S. Brickhouse, Douglas J. Burke, John E. Davis, Nick
Durham, Martin Elvis, Elizabeth C. Galle, Daniel E. Har-
grants DMS-21-13615, DMS-21-13397, and DMS-21-13605; by ris, David P. Huenemoerder, John C. Houck, Bish Ishibashi,
the UK Engineering and Physical Sciences Research Council Margarita Karovska, Fabrizio Nicastro, Michael S. Noble,
[EP/W015080/1]; and by NASA 18-APRA18-0019. We thank Michael A. Nowak, Frank A. Primini, Aneta Siemiginowska,
CHASC members for many helpful discussions, especially Xiao- Randall K. Smith, and Michael Wise. CIAO: Chandra’s data
analysis system. In David R. Silva and Rodger E. Doxsey,
Li Meng and Katy McKeough. DvD was also supported in part editors, Society of Photo-Optical Instrumentation Engineers
by a Marie-Skodowska-Curie RISE Grant (H2020-MSCA-RISE- (SPIE) Conference Series, volume 6270 of Society of Photo-
2019-873089) provided by the European Commission. Aneta Optical Instrumentation Engineers (SPIE) Conference Series,
page 62701V, June 2006. doi:10.1117/12.671760.
Siemiginowska, Vinay Kashyap, and Doug Burke further acknowl- [Has70] W. K. Hastings. Monte Carlo Sampling Methods using Markov
edge support from NASA contract to the Chandra X-ray Center Chains and their Applications. Biometrika, 57(1):97–109,
NAS8-03060. April 1970. doi:10.1093/biomet/57.1.97.
[HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der
Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric
R EFERENCES Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,
Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van
[Cas79] W. Cash. Parameter estimation in astronomy through ap- Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del
plication of the likelihood ratio. The Astrophysical Journal, Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant,
228:939–947, March 1979. doi:10.1086/156922. Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer
[Col18] Astropy Collaboration. The Astropy Project: Building an Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro-
Open-science Project and Status of the v2.0 Core Package. The gramming with NumPy. Nature, 585(7825):357–362, Septem-
Astrophysical Journal, 156(3):123, September 2018. arXiv: ber 2020. URL: https://doi.org/10.1038/s41586-020-2649-2,
1801.02634, doi:10.3847/1538-3881/aabc4f. doi:10.1038/s41586-020-2649-2.
104 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[Hog74] J. A. Hogbom. Aperture Synthesis with a Non-Regular
Distribution of Interferometer Baselines. Astronomy and As-
trophysics Supplement, 15:417, June 1974.
[Hun07] J. D. Hunter. Matplotlib: A 2d graphics environment. Com-
puting in Science & Engineering, 9(3):90–95, 2007. doi:
10.1109/MCSE.2007.55.
[JRM17] Wenzel Jakob, Jason Rhinelander, and Dean Moldovan. py-
bind11 – seamless operability between c++11 and python,
2017. https://github.com/pybind/pybind11.
[Luc74] L. B. Lucy. An iterative technique for the rectification of
observed distributions. Astronomical Journal, 79:745, June
1974. doi:10.1086/111605.
[PR94] K. M. Perry and S. J. Reeves. Generalized Cross-Validation
as a Stopping Rule for the Richardson-Lucy Algorithm. In
Robert J. Hanisch and Richard L. White, editors, The Restora-
tion of HST Images and Spectra - II, page 97, January 1994.
doi:10.1002/ima.1850060412.
[R C20] R Core Team. R: A Language and Environment for Statistical
Computing. R Foundation for Statistical Computing, Vienna,
Austria, 2020. URL: https://www.R-project.org/.
[Ric72] William Hadley Richardson. Bayesian-Based Iterative Method
of Image Restoration. Journal of the Optical Society of
America (1917-1983), 62(1):55, January 1972. doi:10.
1364/josa.62.000055.
[vdWSN+ 14] Stéfan van der Walt, Johannes L. Schönberger, Juan Nunez-
Iglesias, François Boulogne, Joshua D. Warner, Neil Yager,
Emmanuelle Gouillart, Tony Yu, and the scikit-image con-
tributors. scikit-image: image processing in Python. PeerJ,
2:e453, 6 2014. URL: https://doi.org/10.7717/peerj.453, doi:
10.7717/peerj.453.
[wc15] Jupyter widgets community. ipywidgets, a github repository.
Retrieved from https://github.com/jupyter-widgets/ipywidgets,
2015.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 105

Codebraid Preview for VS Code: Pandoc Markdown
Preview with Jupyter Kernels
Geoffrey M. Poore‡∗

Abstract—Codebraid Preview is a VS Code extension that provides a live including raw chunks of text in other formats such as reStructured-
preview of Pandoc Markdown documents with optional support for executing Text. When executable code is involved, the RMarkdown-style
embedded code. Unlike typical Markdown previews, all Pandoc features are fully approach of Markdown with embedded code can sometimes be
supported because Pandoc itself generates the preview. The Markdown source more convenient than a browser-based Jupyter notebook since the
and the preview are fully integrated with features like bidirectional scroll sync.
writing process involves more direct interaction with the complete
The preview supports LaTeX math via KaTeX. Code blocks and inline code can
be executed with Codebraid, using either its built-in execution system or Jupyter
document source.
kernels. For executed code, any combination of the code and its output can be While using a Pandoc Markdown variant as a source format
displayed in the preview as well as the final document. Code execution is non- brings many advantages, the actual writing process itself can
blocking, so the preview always remains live and up-to-date even while code is be less than ideal, especially when executable code is involved.
still running. Pandoc Markdown variants are so powerful precisely because they
provide so many extensions to Markdown, but this also means
Index Terms—reproducibility, dynamic report generation, literate programming, that they can only be fully rendered by Pandoc itself. When text
Python, Pandoc, Markdown, Project Jupyter
editors such as VS Code provide a built-in Markdown preview,
typically only a small subset of Pandoc features is supported,
Introduction so the representation of the document output will be inaccurate.
Some editors provide a visual Markdown editing mode, in which
Pandoc [JM22] is increasingly a foundational tool for creating sci-
a partially rendered version of the document is displayed in the
entific and technical documents. It provides Pandoc’s Markdown
editor and menus or keyboard shortcuts may replace the direct
and other Markdown variants that add critical features absent in
entry of Markdown syntax. These generally suffer from the same
basic Markdown, such as citations, footnotes, mathematics, and
issue. This is only exacerbated when the document embeds code
tables. At the same time, Pandoc simplifies document creation
that is executed during the build process, since that goes even
by providing conversion from Markdown (and other formats) to
further beyond basic Markdown.
formats like LaTeX, HTML, Microsoft Word, and PowerPoint.
An alternative is to use Pandoc itself to generate HTML or
Pandoc is especially useful for documents with embedded code
PDF output, and then display this as a preview. Depending on the
that is executed during the build process. RStudio’s RMarkdown
text editor used, the HTML or PDF might be displayed within the
[RSt20] and more recently Quarto [RSt22] leverage Pandoc to
text editor in a panel beside the document source, or in a separate
convert Markdown documents to other formats, with code exe-
browser window or PDF viewer. For example, Quarto offers both
cution provided by knitr [YX15]. JupyterLab [GP21] centers the
possibilities, depending on whether RStudio, VS Code, or another
writing experience around an interactive, browser-based notebook
editor is used.1 While this approach resolves the inaccuracy issues
instead of a Markdown document, but still relies on Pandoc for
of a basic Markdown preview, it also gives up features such as
export to formats other than HTML [Jup22]. There are also ways
scroll sync that tightly integrate the Markdown source with the
to interact with a Jupyter Notebook as a Markdown document,
preview. In the case of executable code, there is the additional
such as Jupytext [MWtJT20] and Pandoc’s own native Jupyter
issue of a time delay in rendering the preview. Pandoc itself can
support.
typically convert even a relatively long document in under one
Writing with Pandoc’s Markdown or a similar Markdown
second. However, when code is executed as part of the document
variant has advantages when multiple output formats are required,
build process, preview update is blocked until code execution
since Pandoc provides the conversion capabilities. Pandoc Mark-
completes.
down variants can also serve as a simpler syntax when creating
HTML, LaTeX, or similar documents. They allow HTML and This paper introduces Codebraid Preview, a VS Code exten-
LaTeX to be intermixed with Markdown syntax. They also support sion that provides a live preview of Pandoc Markdown documents
with optional support for executing embedded code. Codebraid
* Corresponding author: gpoore@uu.edu Preview provides a Pandoc-based preview while avoiding most
‡ Union University of the traditional drawbacks of this approach. The next section
Copyright © 2022 Geoffrey M. Poore. This is an open-access article distributed
under the terms of the Creative Commons Attribution License, which permits 1. The RStudio editor is unique in also offering a Pandoc-based visual
unrestricted use, distribution, and reproduction in any medium, provided the editing mode, starting with version 1.4 from January 2021 (https://www.
original author and source are credited. rstudio.com/blog/announcing-rstudio-1-4/).
106 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

provides an overview of features. This is followed by sections There is also support for document export with Pandoc, using
focusing on scroll sync, LaTeX support, and code execution as the VS Code command palette or the export-with-Pandoc button.
examples of solutions and remaining challenges in creating a
better Pandoc writing experience. Scroll sync
Tight source-preview integration requires a source map, or a
Overview of Codebraid Preview mapping from characters in the source to characters in the output.
Codebraid Preview can be installed through the VS Code ex- Due to Pandoc’s parsing algorithms, tracking source location
tension manager. Development is at https://github.com/gpoore/ during parsing is not possible in the general case.2
codebraid-preview-vscode. Pandoc must be installed separately Pandoc 2.11.3 was released in December 2020. It added
(https://pandoc.org/). For code execution capabilities, Codebraid a sourcepos extension for CommonMark and formats
must also be installed (https://github.com/gpoore/codebraid). based on it, including GitHub-Flavored Markdown (GFM) and
The preview panel can be opened using the VS Code command commonmark_x (CommonMark plus extensions similar to Pan-
palette, or by clicking the Codebraid Preview button that is visible doc’s Markdown). The CommonMark parser uses a different
when a Markdown document is open. The preview panel takes the parsing algorithm from the Pandoc’s Markdown parser, and this
document in its current state, converts it into HTML using Pandoc, algorithm permits tracking source location. For the first time, it
and displays the result using a webview. An example is shown in was possible to construct a source map for a Pandoc input format.
Figure 1. Since the preview is generated by Pandoc, all Pandoc Codebraid Preview defaults to commonmark_x as an input
features are fully supported. format, since it provides the most features of all CommonMark-
By default, the preview updates automatically whenever the based formats. Features continue to be added to commonmark_x
Markdown source is changed. There is a short user-configurable and it is gradually nearing feature parity with Pandoc’s Mark-
minimum update interval. For shorter documents, sub-second down. Citations are perhaps the most important feature currently
updates are typical. missing.3
The preview uses the same styling CSS as VS Code’s built- Codebraid Preview provides full bidirectional scroll sync be-
in Markdown preview, so it automatically adjusts to the VS Code tween source and preview for all CommonMark-based formats,
color theme. For example, changing between light and dark themes using data provided by sourcepos. In the output HTML, the
changes the background and text colors in the preview. first image or inline text element created by each Markdown
Codebraid Preview leverages recent Pandoc advances to pro- source line is given an id attribute corresponding to the source
vide bidirectional scroll sync between the Markdown source line number. When the source is scrolled to a given line range,
and the preview for all CommonMark-based Markdown variants the preview scrolls to the corresponding HTML elements using
that Pandoc supports (commonmark, gfm, commonmark_x). these id attributes. When the preview is scrolled, the visible
By default, Codebraid Preview treats Markdown documents as HTML elements are detected via the Intersection Observer API.4
commonmark_x, which is CommonMark with Pandoc exten- Then their id attributes are used to determine the corresponding
sions for features like math, footnotes, and special list types. The Markdown line range, and the source scrolls to those lines.
preview still works for other Markdown variants, but scroll sync is Scroll sync is slightly more complicated when working with
disabled. By default, scroll sync is fully bidirectional, so scrolling output that is generated by executed code. For example, if a code
either the source or the preview will cause the other to scroll to block is executed and creates several plots in the preview, there
the corresponding location. Scroll sync can instead be configured isn’t necessarily a way to trace each individual plot back to a
to be only from source to preview or only from preview to source. particular line of code in the Markdown source. In such cases, the
As far as I am aware, this is the first time that scroll sync has been line range of the executed code is mapped proportionally to the
implemented in a Pandoc-based preview. vertical space occupied by its output.
The same underlying features that make scroll sync possible Pandoc supports multi-file documents. It can be given a list
are also used to provide other preview capabilities. Double- of files to combine into a single output document. Codebraid
clicking in the preview moves the cursor in the editor to the Preview provides scroll sync for multi-file documents. For ex-
corresponding line of the Markdown source. ample, suppose a document is divided into two files in the same
Since many Markdown variants support LaTeX math, the directory, chapter_1.md and chapter_2.md. Treating these
preview includes math support via KaTeX [EA22]. as a single document involves creating a YAML configuration file
Codebraid Preview can simply be used for writing plain Pan- _codebraid_preview.yaml that lists the files:
doc documents. Optional execution of embedded code is possible input-files:
with Codebraid [GMP19], using its built-in code execution system - chapter_1.md
or Jupyter kernels. When Jupyter kernels are used, it is possible - chapter_2.md
to obtain the same output that would be present in a Jupyter Now launching a preview from either chapter_1.md or
notebook, including rich output such as plots and mathematics. It chapter_2.md will display a preview that combines both
is also possible to specify a custom display so that only a selected files. When the preview is scrolled, the editor scrolls to the
combination of code, stdout, stderr, and rich output is shown while corresponding source location, automatically switching between
the rest are hidden. Code execution is decoupled from the preview
process, so the Markdown source can be edited and the preview 2. See for example https://github.com/jgm/pandoc/issues/4565.
can update even while code is running in the background. As far as 3. The Pandoc Roadmap at https://github.com/jgm/pandoc/wiki/Roadmap
summarizes current commonmark_x capabilities.
I am aware, no previous software for executing code in Markdown
4. For technical details, https://www.w3.org/TR/intersection-observer/. For
has supported building a document with partial code output before an overview, https://developer.mozilla.org/en-US/docs/Web/API/Intersection_
execution has completed. Observer_API.
CODEBRAID PREVIEW FOR VS CODE: PANDOC MARKDOWN PREVIEW WITH JUPYTER KERNELS 107

Fig. 1: Screenshot of a Markdown document with Codebraid Preview in VS Code. This document uses Codebraid to execute code with Jupyter
kernels, so all plots and math visible in the preview are generated during document build.

chapter_1.md and chapter_2.md depending on the part of of HTML rendering. In the future, optional MathJax support may
the preview that is visible. be needed to provide broader math support. For some applications,
The preview still works when the input format is set to a non- it may also be worth considering caching pre-rendered or image
CommonMark format, but in that case scroll sync is disabled. If versions of equations to improve performance.
Pandoc adds sourcepos support for additional input formats in
the future, scroll sync will work automatically once Codebraid Code execution
Preview adds those formats to the supported list. It is possible
to attempt to reconstruct a source map by performing a parallel Optional support for executing code embedded in Markdown
string search on Pandoc output and the original source. This can documents is provided by Codebraid [GMP19]. Codebraid uses
be error-prone due to text manipulation during format conversion, Pandoc to convert a document into an abstract syntax tree (AST),
but in the future it may be possible to construct a good enough then extracts any inline or block code marked with Codebraid
source map to extend basic scroll sync support to additional input attributes from the AST, executes the code, and finally formats the
formats. code output so that Pandoc can use it to create the final output
document. Code execution is performed with Codebraid’s own
built-in system or with Jupyter kernels. For example, the code
LaTeX support
block
Support for mathematics is one of the key features provided by ```{.python .cb-run}
many Markdown variants in Pandoc, including commonmark_x. print("Hello *world!*")
Math support in the preview panel is supplied by KaTeX [EA22], ```
which is a JavaScript library for rendering LaTeX math in the
would result in
browser.
One of the disadvantages of using Pandoc to create the preview Hello world!
is that every update of the preview is a complete update. This after processing by Codebraid and finally Pandoc. The .cb-run
makes the preview more sensitive to HTML rendering time. In is a Codebraid attribute that marks the code block for execution
contrast, in a Jupyter notebook, it is common to write Markdown and specifies the default display of code output. Further examples
in multiple cells which are rendered separately and independently. of Codebraid usage are visible in Figure 1.
MathJax [Mat22] provides a broader range of LaTeX support Mixing a live preview with executable code provides potential
than KaTeX, and is used in software such as JupyterLab and usability and security challenges. By default, code only runs when
Quarto. While MathJax performance has improved significantly the user selects execution in the VS Code command palette or
since the release of version 3.0 in 2019, KaTeX can still have a clicks the Codebraid execute button. When the preview automati-
speed advantage, so it is currently the default due to the importance cally updates as a result of Markdown source changes, it only uses
108 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

cached code output. Stale cached output is detected by hashing While this build process is significantly more interactive than
executed code, and then marked in the preview to alert the user. what has been possible previously, it also suggests additional
The standard approach to executing code within Markdown avenues for future exploration. Codebraid’s built-in code execution
documents blocks the document build process until all code has system is designed to execute a predefined sequence of code
finished running. Code is extracted from the Markdown source and chunks and then exit. Jupyter kernels are currently used in the
executed. Then the output is combined with the original source and same manner to avoid any potential issues with out-of-order
passed on to Pandoc or another Markdown application for final execution. However, Jupyter kernels can receive and execute code
conversion. This is the approach taken by RMarkdown, Quarto, indefinitely, which is how they commonly function in Jupyter note-
and similar software, as well as by Codebraid until recently. This books. Instead of starting a new Jupyter kernel at the beginning of
design works well for building a document a single time, but each code execution cycle, it would be possible to keep the kernel
blocking until all code has executed is not ideal in the context from the previous execution cycle and only pass modified code
of a document preview. chunks to it. This would allow the same out-of-order execution
Codebraid now offers a new mode of code execution that al- issues that are possible in a Jupyter notebook. Yet that would
lows a document to be rebuilt continuously during code execution, make possible much more rapid code output, particularly in cases
with each build including all code output available at that time. where large datasets must be loaded or significant preprocessing
This process involves the following steps: is required.

1) The user selects code execution. Codebraid Preview Conclusion
passes the document to Codebraid. Codebraid begins
Codebraid Preview represents a significant advance in tools for
code execution.
writing with Pandoc. For the first time, it is possible to preview
2) As soon as any code output is available, Codebraid
a Pandoc Markdown document using Pandoc itself while having
immediately streams this back to Codebraid Preview. The
features like scroll sync between the Markdown source and the
output is in a format compatible with the YAML metadata
preview. When embedded code needs to be executed, it is possible
block at the start of Pandoc Markdown documents. The
to see code output in the preview and to continue editing the
output includes a hash of the code that was executed, so
document during code execution, instead of having to wait until
that code changes can be detected later.
code finishes running.
3) If the document is modified while code is running or if
Codebraid Preview or future previewers that follow this ap-
code output is received, Codebraid Preview rebuilds the
proach may be perfectly adequate for shorter and even some longer
preview. It creates a copy of the document with all current
documents, but at some point a combination of document length,
Codebraid output inserted into the YAML metadata block
document complexity, and mathematical content will strain what is
at the start of the document. This modified document is
possible and ultimately decrease preview update frequency. Every
then passed to Pandoc. Pandoc runs with a Lua filter5 that
update of the preview involves converting the entire document
modifies the document AST before final conversion. The
with Pandoc and then rendering the resulting HTML.
filter removes all code marked with Codebraid attributes
On the parsing side, Pandoc’s move toward CommonMark-
from the AST, and replaces it with the corresponding
based Markdown variants may eventually lead to enough stan-
code output stored in the AST metadata. If code has
dardization that other implementations with the same syntax and
been modified since execution began, this is detected
features are possible. This in turn might enable entirely new
with the hash of the code, and an HTML class is added
approaches. An ideal scenario would be a Pandoc-compatible
to the output that will mark it visually as stale output.
JavaScript-based parser that can parse multiple Markdown strings
Code that does not yet have output is replaced by a
while treating them as having a shared document state for things
visible placeholder to indicate that code is still running.
like labels, references, and numbering. For example, this could
When the Lua filter finishes AST modifications, Pandoc
allow Pandoc Markdown within a Jupyter notebook, with all
completes the document build, and the preview updates.
Markdown content sharing a single document state, maybe with
4) As long as code is executing, the previous process repeats
each Markdown cell being automatically updated based on Mark-
whenever the preview needs to be rebuilt.
down changes elsewhere.
5) Once code execution completes, the most recent output is
Perhaps more practically, on the preview display side, there
reused for all subsequent preview updates until the next
may be ways to optimize how the HTML generated by Pandoc is
time the user chooses to execute code. Any code changes
loaded in the preview. A related consideration might be alternative
continue to be detected by hashing the code during the
preview formats. There is a significant tradition of tight source-
build process, so that the output can be marked visually
preview integration in LaTeX (for example, [Lau08]). In principle,
as stale in the preview.
Pandoc’s sourcepos extension should make possible Mark-
The overall result of this process is twofold. First, building down to PDF synchronization, using LaTeX as an intermediary.
a document involving executed code is nearly as fast as building
a plain Pandoc document. The additional output metadata plus R EFERENCES
the filter are the only extra elements involved in the document
[EA22] Emily Eisenberg and Sophie Alpert. KaTeX: The fastest math
build, and Pandoc Lua filters have excellent performance. Second, typesetting library for the web, 2022. URL: https://katex.org/.
the output for each code chunk appears in the preview almost [GMP19] Geoffrey M. Poore. Codebraid: Live Code in Pandoc Mark-
immediately after the chunk finishes execution. down. In Chris Calloway, David Lippa, Dillon Niederhut, and
David Shupe, editors, Proceedings of the 18th Python in Science
Conference, pages 54 – 61, 2019. doi:10.25080/Majora-
5. For an overview of Lua filters, see https://pandoc.org/lua-filters.html. 7ddc1dd1-008.
CODEBRAID PREVIEW FOR VS CODE: PANDOC MARKDOWN PREVIEW WITH JUPYTER KERNELS 109

[GP21] Brian E. Granger and Fernando Pérez. Jupyter: Thinking and
storytelling with code and data. Computing in Science &
Engineering, 23(2):7–14, 2021. doi:10.1109/MCSE.2021.
3059263.
[JM22] John MacFarlane. Pandoc: a universal document converter, 2006–
2022. URL: https://pandoc.org/.
[Jup22] Jupyter Development Team. nbconvert: Convert Notebooks to
other formats, 2015–2022. URL: https://nbconvert.readthedocs.
io.
[Lau08] Jerôme Laurens. Direct and reverse synchronization with Sync-
TEX. TUGBoat, 29(3):365–371, 2008.
[Mat22] MathJax. MathJax: Beautiful and accessible math in all browsers,
2009–2022. URL: https://www.mathjax.org/.
[MWtJT20] Marc Wouts and the Jupytext Team. Jupyter notebooks as
Markdown documents, Julia, Python or R scripts, 2018–2020.
URL: https://jupytext.readthedocs.io/.
[RSt20] RStudio Inc. R Markdown, 2016–2020. URL: https://rmarkdown.
rstudio.com/.
[RSt22] RStudio Inc. Welcome to Quarto, 2022. URL: https://quarto.org/.
[YX15] Yihui Xie. Dynamic Documents with R and knitr. Chapman &
Hall/CRC Press, 2015.
110 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Incorporating Task-Agnostic Information in
Task-Based Active Learning Using a Variational
Autoencoder
Curtis Godwin‡†∗ , Meekail Zain§†∗ , Nathan Safir‡ , Bella Humphrey§ , Shannon P Quinn§¶

Abstract—It is often much easier and less expensive to collect data than to constraints by specifying a budget of points that can be labeled at
label it. Active learning (AL) ([Set09]) responds to this issue by selecting which a time and evaluating against this budget.
unlabeled data are best to label next. Standard approaches utilize task-aware In AL, the model for which we select new labels is referred to
AL, which identifies informative samples based on a trained supervised model. as the task model. If this model is a classifier neural network, the
Task-agnostic AL ignores the task model and instead makes selections based
space in which it maps inputs before classifying them is known
on learned properties of the dataset. We seek to combine these approaches
and measure the contribution of incorporating task-agnostic information into
as the latent space or representation space. A recent branch of
standard AL, with the suspicion that the extra information in the task-agnostic AL ([SS18], [SCN+ 18], [YK19]), prominent for its applications
features may improve the selection process. We test this on various AL methods to deep models, focuses on mapping unlabeled points into the task
using a ResNet classifier with and without added unsupervised information from model’s latent space before comparing them.
a variational autoencoder (VAE). Although the results do not show a significant These methods are limited in their analysis by the labeled
improvement, we investigate the effects on the acquisition function and suggest data they must train on, failing to make use of potentially useful
potential approaches for extending the work. information embedded in the unlabeled data. We therefore suggest
that this family of methods may be improved by extending their
Index Terms—active learning, variational autoencoder, deep learning, pytorch, representation spaces to include unsupervised features learned
semi-supervised learning, unsupervised learning
over the entire dataset. For this purpose, we opt to use a variational
autoencoder (VAE) ([KW13]) , which is a prominent method for
unsupervised representation learning. Our main contributions are
Introduction
(a) a new methodology for extending AL methods using VAE
In deep learning, the capacity for data gathering often signifi- features and (b) an experiment comparing AL performance across
cantly outpaces the labeling. This is easily observed in the field two recent feature-based AL methods using the new method.
of bioimaging, where ground-truth labeling usually requires the
expertise of a clinician. For example, producing a large quantity Related Literature
of CT scans is relatively simple, but having them labeled for Active learning
COVID-19 by cardiologists takes much more time and money. Much of the early active learning (AL) literature is based on
These constraints ultimately limit the contribution of deep learning shallower, less computationally demanding networks since deeper
to many crucial research problems. architectures were not well-developed at the time. Settles ([Set09])
This labeling issue has compelled advancements in the field of provides a review of these early methods. The modern approach
active learning (AL) ([Set09]). In a typical AL setting, there is a uses an acquisition function, which involves ranking all available
set of labeled data and a (usually larger) set of unlabeled data. A unlabeled points by some chosen heuristic H and choosing to
model is trained on the labeled data, then the model is analyzed to label the points of highest ranking.
evaluate which unlabeled points should be labeled to best improve
the loss objective after further training. AL acknowledges labeling

† These authors contributed equally.
* Corresponding author: cmgodwin263@gmail.com, meekail.zain@uga.edu
‡ Institute for Artificial Intelligence, University of Georgia, Athens, GA 30602
USA
* Corresponding author: cmgodwin263@gmail.com, meekail.zain@uga.edu
§ Department of Computer Science, University of Georgia, Athens, GA 30602
USA
¶ Department of Cellular Biology, University of Georgia, Athens, GA 30602
USA

Copyright © 2022 Curtis Godwin et al. This is an open-access article dis-
tributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, pro- The popularity of the acquisition approach has led to a widely-
vided the original author and source are credited. used evaluation procedure, which we describe in Algorithm 1.
INCORPORATING TASK-AGNOSTIC INFORMATION IN TASK-BASED ACTIVE LEARNING USING A VARIATIONAL AUTOENCODER 111

This procedure trains a task model T on the initial labeled data, representation c. An additional fully connected layer then maps
records its test accuracy, then uses H to label a set of unlabeled c into a single value constituting the loss prediction.
points. We then once again train T on the labeled data and record When attempting to train a network to directly predict T ’s
its accuracy. This is repeated until a desired number of labels is loss during training, the ground truth losses naturally decrease as
reached, and then the accuracies can be graphed against the num- T is optimized, resulting in a moving objective. The authors of
ber of available labels to demonstrate performance over the course ([YK19]) find that a more stable ground truth is the inequality
of labeling. We can use this evaluation algorithm to separately between the losses of given pairs of points. In this case, P is
evaluate multiple acquisition functions on their resulting accuracy trained on pairs of labeled points, so that P is penalized for
graphs. This is utilized in many AL papers to show the efficacy producing predicted loss pairs that exhibit a different inequality
of their suggested heuristics in comparison to others ([WZL+ 16], than the corresponding true loss pair.
[SS18], [SCN+ 18], [YK19]). More specifically, for each batch of labeled data Lbatch ⊂ L
The prevailing approach to point selection has been to choose that is propagated through T during training, the batch of true
unlabeled points for which the model is most uncertain, the as- losses is computed and split randomly into a batch of pairs Pbatch .
sumption being that uncertain points will be the most informative The loss prediction network produces a corresponding batch of
([BRK21]). A popular early method was to label the unlabeled predicted loss pairs, denoted Pebatch . The following pair loss is then
points of highest Shannon entropy ([Sha48]) under the task model, computed given each p ∈ Pbatch and its corresponding p̃ ∈ Pebatch :
which is a measure of uncertainty between the classes of the
data. This method is now more commonly used in combination L pair (p, p̃) = max(0, −I (p) · ( p̃(1) − p̃(2) ) + ξ ), (3)
with a representativeness measure ([WZL+ 16]) to avoid selecting where I is the following indicator function for pair inequality:
condensed clusters of very similar points. (
1, p(1) > p(2)
I (p) = . (4)
Recent heuristics using deep features −1, p(1) ≤ p(2)
For convolutional neural networks (CNNs) in image classification
settings, the task model T can be decomposed into a feature- Variational Autoencoders
generating module Variational autoencoders (VAEs) ([KW13]) are an unsupervised
T f : Rn → R f , method for modeling data using Bayesian posterior inference.
We begin with the Bayesian assumption that the data is well-
which maps the input data vectors to the output of the final fully modeled by some distribution, often a multivariate Gaussian. We
connected layer before classification, and a classification module also assume that this data distribution can be inferred reasonably
well by a lower dimensional random variable, also often modeled
Tc : R f → {0, 1, ..., c},
by a multivariate Gaussian.
where c is the number of classes. The inference process then consists of an encoding into the
Recent deep learning-based AL methods have approached the lower dimensional latent variable, followed by a decoding back
notion of model uncertainty in terms of the rich features generated into the data dimension. We parametrize both the encoder and the
by the learned model. Core-set ([SS18]) and MedAL ([SCN+ 18]) decoder as neural networks, jointly optimizing their parameters
select unlabeled points that are the furthest from the labeled set with the following loss function ([KW19]):
in terms of L2 distance between the learned features. For core-set, Lθ ,φ (x) = log pθ (x|z) + [log pθ (z) − log qφ (z|x)], (5)
each point constructing the set S in step 6 of Algorithm 1 is chosen
by where θ and φ are the parameters of the encoder and the decoder,
u∗ = argmax min ||(T f (u) − T f (``))||2 , (1) respectively. The first term is the reconstruction error, penalizing
u∈U ` ∈L
the parameters for producing poor reconstructions of the input
where U is the unlabeled set and L is the labeled set. The data. The second term is the regularization error, encouraging the
analogous operation for MedAL is encoding to resemble a pre-selected prior distribution, commonly
a unit Gaussian prior.
1 |L| The encoder of a well-optimized VAE can be used to gen-
u∗ = argmax
u∈U
∑ ||T f (u) − T f (Li )||2 .
|L| i=1
(2)
erate latent encodings with rich features which are sufficient to
approximately reconstruct the data. The features also have some
Note that after a point u∗ is chosen, the selection of the next point geometric consistency, in the sense that the encoder is encouraged
assumes the previous u∗ to be in the labeled set. This way we to generate encodings in the pattern of a Gaussian distribution.
discourage choosing sets that are closely packed together, leading
to sets that are more diverse in terms of their features. This effect
is more pronounced in the core-set method since it takes the Methods
minimum distance whereas MedAL uses the average distance. We observe that the notions of uncertainty developed in the core-
Another recent method ([YK19]) trains a regression network set and MedAL methods rely on distances between feature vectors
to predict the loss of the task model, then takes the heuristic H modeled by the task model T . Additionally, loss prediction relies
in Algorithm 1 to select the unlabeled points of highest predicted on a fully connected layer mapping from a feature space to a single
loss. To implement this, the loss prediction network P is attached value, producing different predictions depending on the values of
to a ResNet task model T and is trained jointly with T . The the relevant feature vector. Thus all of these methods utilize spatial
inputs to P are the features output by the ResNet’s four residual reasoning in a vector space.
blocks. These features are mapped into the same dimensionality Furthermore, in each of these methods, the heuristic H only
via a fully connected layer and then concatenated to form a has access to information learned by the task model, which is
112 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

trained only on the labeled points at a given timestep in the la- ensure that the task models being compared were supplied with
beling procedure. Since variational autoencoder (VAE) encodings the same initial set of labels.
are not limited by the contents of the labeled set, we suggest that With four NVIDIA 2080 GPUs, the total runtime for the
the aforementioned methods may benefit by expanding the vector MNIST experiments was 5113s for core-set and 4955s for loss
spaces they investigate to include VAE features learned across prediction; for ChestMNIST, the total runtime was 7085s for core-
the entire dataset, including the unlabeled data. These additional set and 7209s for loss prediction.
features will constitute representative and previously inaccessible
information regarding the data, which may improve the active
learning process.
We implement this by first training a VAE model V on the
given dataset. V can then be used as a function returning the
VAE features for any given datapoint. We append these additional
features to the relevant vector spaces using vector concatenation,
an operation we denote with the symbol _. The modified point
selection operation in core-set then becomes
u∗ = argmax min ||([T f (u) _ αV (u)] − [T f (``) _ αV (``)]||2 ,
u∈U ` ∈L
(6)
where α is a hyperparameter that scales the influence of the VAE
features in computing the vector distance. To similarly modify the
loss prediction method, we concatenate the VAE features to the Fig. 1: The average MNIST results using the core-set heuristic versus
final ResNet feature concatenation c before the loss prediction, the VAE-augmented core-set heuristic for Algorithm 1 over 5 runs.
so that the extra information is factored into the training of the
prediction network P.

Experiments
In order to measure the efficacy of the newly proposed methods,
we generate accuracy graphs using Algorithm 1, freezing all
settings except the selection heuristic H . We then compare the
performance of the core-set and loss prediction heuristics with
their VAE-augmented counterparts.
We use ResNet-18 pretrained on ImageNet as the task model,
using the SGD optimizer with learning rate 0.001 and momen-
tum 0.9. We train on the MNIST ([Den12]) and ChestMNIST
([YSN21]) datasets. ChestMNIST consists of 112,120 chest X-ray
images resized to 28x28 and is one of several benchmark medical
image datasets introduced in ([YSN21]).
Fig. 2: The average MNIST results using the loss prediction heuristic
For both datasets we experiment on randomly selected subsets, versus the VAE-augmented loss prediction heuristic for Algorithm 1
using 25000 points for MNIST and 30000 points for ChestMNIST. over 5 runs.
In both cases we begin with 3000 initial labels and label 3000
points per active learning step. We opt to retrain the task model
after each labeling step instead of fine-tuning.
We use a similar training strategy as in ([SCN+ 18]), training
the task model until >99% train accuracy before selecting new
points to label. This ensures that the ResNet is similarly well fit to
the labeled data at each labeling iteration. This is implemented by
training for 10 epochs on the initial training set and increasing the
training epochs by 5 after each labeling iteration.
The VAEs used for the experiments are trained for 20 epochs
using an Adam optimizer with learning rate 0.001 and weight
decay 0.005. The VAE encoder architecture consists of four con-
volutional downsampling filters and two linear layers to learn the
low dimensional mean and log variance. The decoder consists of
an upsampling convolution and four size-preserving convolutions
to learn the reconstruction.
Fig. 3: The average ChestMNIST results using the core-set heuristic
Experiments were run five times, each with a separate set of versus the VAE-augmented core-set heuristic for Algorithm 1 over 5
randomly chosen initial labels, with the displayed results showing runs.
the average validation accuracies across all runs. Figures 1 and
3 show the core-set results, while Figures 2 and 4 show the loss To investigate the qualitative difference between the VAE and
prediction results. In all cases, shared random seeds were used to non-VAE approaches, we performed an additional experiment
INCORPORATING TASK-AGNOSTIC INFORMATION IN TASK-BASED ACTIVE LEARNING USING A VARIATIONAL AUTOENCODER 113

Fig. 4: The average ChestMNIST results using the loss prediction
heuristic versus the VAE-augmented loss prediction heuristic for
Algorithm 1 over 5 runs.

to visualize an example of core-set selection. We first train the
ResNet-18 with the same hyperparameter settings on 1000 initial
labels from the ChestMNIST dataset, then randomly choose 1556 Fig. 6: A t-SNE visualization of the ChestMNIST points chosen by
(5%) of the unlabeled points from which to select 100 points to core-set when the ResNet features are augmented with VAE features.
label. These smaller sizes were chosen to promote visual clarity in
the output graphs.
We use t-SNE ([VdMH08]) dimensionality reduction to show process. In 5, the selected points tend to be more spread out,
the ResNet features of the labeled set, the unlabeled set, and the while in 6 they cluster at one edge. This appears to mirror the
points chosen to be labeled by core-set. transformation of the rest of the data, which is more spread out
without the VAE features, but becomes condensed in the center
when they are introduced, approaching the shape of a Gaussian
distribution.
It seems that with the added VAE features, the selected points
are further out of distribution in the latent space. This makes sense
because points tend to be more sparse at the tails of a Guassian
distribution and core-set prioritizes points that are well-isolated
from other points.
One reason for the lack of performance improvement may be
the homogeneous nature of the VAE, where the optimization goal
is reconstruction rather than classification. This could be improved
by using a multimodal prior in the VAE, which may do a better
job of modeling relevant differences between points.

Conclusion
Our original intuition was that additional unsupervised informa-
tion may improve established active learning methods, especially
when using a modern unsupervised representation method such as
a VAE. The experimental results did not indicate this hypothesis,
but additional investigation of the VAE features showed a notable
change in the task model latent space. Though this did not result in
Fig. 5: A t-SNE visualization of the ChestMNIST points chosen by superior point selections in our case, it is of interest whether dif-
core-set. ferent approaches to latent space augmentation in active learning
may fare better.
Future work may explore the use of class-conditional VAEs
Discussion in a similar application, since a VAE that can utilize the available
class labels may produce more effective representations, and it
Overall, the VAE-augmented active learning heuristics did not
could be retrained along with the task model after each labeling
exhibit a significant performance difference when compared with
iteration.
their counterparts. The only case of a significant p-value (<0.05)
occurred during loss prediction on the MNIST dataset at 21000
labels. R EFERENCES
The t-SNE visualizations in Figures 5 and 6 show some of [BRK21] Samuel Budd, Emma C Robinson, and Bernhard Kainz. A
the influence that the VAE features have on the core-set selection survey on active learning and human-in-the-loop deep learning
114 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

for medical image analysis. Medical Image Analysis, 71:102062,
2021. doi:10.1016/j.media.2021.102062.
[Den12] Li Deng. The mnist database of handwritten digit images for
machine learning research. IEEE Signal Processing Magazine,
29(6):141–142, 2012. doi:10.1109/MSP.2012.2211477.
[KW13] Diederik P Kingma and Max Welling. Auto-encoding variational
bayes. arXiv preprint arXiv:1312.6114, 2013.
[KW19] Diederik P. Kingma and Max Welling. An Intro-
duction to Variational Autoencoders. Now Publishers,
2019. URL: https://doi.org/10.1561%2F9781680836233, doi:
10.1561/9781680836233.
[SCN 18] Asim Smailagic, Pedro Costa, Hae Young Noh, Devesh
+

Walawalkar, Kartik Khandelwal, Adrian Galdran, Mostafa Mir-
shekari, Jonathon Fagert, Susu Xu, Pei Zhang, et al. Medal:
Accurate and robust deep active learning for medical image
analysis. In 2018 17th IEEE international conference on machine
learning and applications (ICMLA), pages 481–488. IEEE, 2018.
doi:10.1109/icmla.2018.00078.
[Set09] Burr Settles. Active learning literature survey. 2009.
[Sha48] Claude Elwood Shannon. A mathematical theory of communica-
tion. The Bell system technical journal, 27(3):379–423, 1948.
[SS18] Ozan Sener and Silvio Savarese. Active learning for convolutional
neural networks: A core-set approach. In International Conference
on Learning Representations, 2018. URL: https://openreview.net/
forum?id=H1aIuk-RW.
[VdMH08] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data
using t-sne. Journal of machine learning research, 9(11), 2008.
[WZL+ 16] Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang
Lin. Cost-effective active learning for deep image classification.
IEEE Transactions on Circuits and Systems for Video Technol-
ogy, 27(12):2591–2600, 2016. doi:10.1109/tcsvt.2016.
2589879.
[YK19] Donggeun Yoo and In So Kweon. Learning loss for active
learning. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 93–102, 2019.
doi:10.1109/CVPR.2019.00018.
[YSN21] Jiancheng Yang, Rui Shi, and Bingbing Ni. Medmnist classi-
fication decathlon: A lightweight automl benchmark for med-
ical image analysis. In 2021 IEEE 18th International Sym-
posium on Biomedical Imaging (ISBI), pages 191–195, 2021.
doi:10.1109/ISBI48211.2021.9434062.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 115

Awkward Packaging: building Scikit-HEP
Henry Schreiner‡∗ , Jim Pivarski‡ , Eduardo Rodrigues§

Abstract—Scikit-HEP has grown rapidly over the last few years, not just to serve parts [Lam98]. The glueing together of the system was done in
the needs of the High Energy Physics (HEP) community, but in many ways, Python, a model still popular today, though some experiments are
the Python ecosystem at large. AwkwardArray, boost-histogram/hist, and iminuit now using Python + Numba as an alternative model, such as for
are examples of libraries that are used beyond the original HEP focus. In this example the Xenon1T experiment [RTA+ 17], [RS21].
paper we will look at key packages in the ecosystem, and how the collection of
In the early 2000s, the use of Python HEP exploded, heavily
30+ packages was developed and maintained. Also we will look at some of the
software ecosystem contributions made to packages like cibuildwheel, pybind11,
driven by experiments like LHCb developing frameworks and user
nox, scikit-build, build, and pipx that support this effort. We will also discuss the tools for scripting. ROOT started providing Python bindings in
Scikit-HEP developer pages and initial WebAssembly support. 2004 [LGMM05] that were not considered Pythonic [GTW20],
and still required a complex multi-hour build of ROOT to use1 .
Index Terms—packaging, ecosystem, high energy physics, community project Analyses still consisted largely of ROOT, with Python sometimes
showing up.
By the mid 2010’s, a marked change had occurred, driven by
Introduction
the success of Python in Data Science, especially in education.
High Energy Physics (HEP) has always had intense computing Many new students were coming into HEP with little or no
needs due to the size and scale of the data collected. The C++ experience, but with existing knowledge of Python and the
World Wide Web was invented at the CERN Physics laboratory growing Python data science ecosystem, like NumPy and Pandas.
in Switzerland in 1989 when scientists in the EU were trying Several HEP experiment analyses were performed in, or driven
to communicate results and datasets with scientist in the US, by, Python, with ROOT only being used for things that were
and vice-versa [LCC+ 09]. Today, HEP has the largest scientific not available in the Python ecosystem. Some of these were HEP
machine in the world, at CERN: the Large Hadron Collider (LHC), specific: ROOT is also a data format, so users needed to be able
27 km in circumference [EB08], with multiple experiments with to read data from ROOT files. Others were less specific: HEP
thousands of collaborators processing over a petabyte of raw data users have intense histogram requirements due to the data sizes,
every day, with 100 petabytes being stored per year at CERN. This large portions of HEP data are "jagged" rather than rectangular;
is one of the largest scientific datasets in the world of exabyte scale vector manipulation was important (especially Lorenz Vectors, a
[PJ11], which is roughly comparable in order of magnitude to all four dimensional relativistic vector with a non-Euclidean metric);
of astronomy or YouTube [SLF+ 15]. and data fitting was important, especially with complex models
In the mid nineties, HEP users were beginning to look for and accurate error estimation.
a new language to replace Fortran. A few HEP scientists started
investigating the use of Python around the release of 1.0.0 in 1994
Beginnings of a scikit
[Tem22]. A year later, the ROOT project for an analysis toolkit
(and framework) was released, quickly making C++ the main In 2016, the ecosystem for Python in HEP was rather fragmented.
language for HEP. The ROOT project also needed an interpreted Physicists were developing tools in isolation, without knowing
language to driving analysis code. Python was rejected for this role out the overlaps with other tools, and without making them
due to being "exotic" at the time, and because it was considered too interoperable. There were a handful of popular packages that
much to ask physicists to code in two languages. Instead, ROOT were useful in HEP spread around among different authors. The
provided a C++ interpreter, called CINT, which later was replaced ROOTPy project had several packages that made the ROOT-
with Cling, which is the basis for the clang-repl project in LLVM Python bridge a little easier than the built-in PyROOT, such as the
today [IVL22]. root-numpy and related root-pandas packages. The C++ MINUIT
Python would start showing up in the late 90’s in experiment fitting library was integrated into ROOT, but the iminuit package
frameworks as a configuration language. These frameworks were [Dea20] provided an easy to install standalone Python package
primarily written in C++, but were made of many configurable with an extracted copy of MINUIT. Several other specialized
standalone C++ packages had bindings as well. Many of the initial
* Corresponding author: henryfs@princeton.edu authors were transitioning to a less-code centric role or leaving
‡ Princeton University
§ University of Liverpool for industry, leaving projects like ROOTPy and iminuit without
maintainers.
Copyright © 2022 Henry Schreiner et al. This is an open-access article
distributed under the terms of the Creative Commons Attribution License, 1. Almost 20 years later ROOT’s Python bindings have been rewritten for
which permits unrestricted use, distribution, and reproduction in any medium, easier Pythonizations, and installing ROOT in Conda is now much easier,
provided the original author and source are credited. thanks in large part to efforts from Scikit-HEP developers.
116 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

later writer) that could remove the initial conversion environment
by simply pip installing a package. It also had a simple, Pythonic
numpythia
interface and produced outputs Python users could immediately
use, like NumPy arrays, instead of PyROOT’s wrapped C++
pyhepmc nndrone
pointers.
Uproot needed to do more than just be file format
reader/writer; it needed to provide a way to represent the special
pylhe structure and common objects that ROOT files could contain.
This lead to the development of two related packages that would
hepunits support uproot. One, uproot-methods, included Pythonic access to
functionality provided by ROOT for its core classes, like spatial
and Lorentz vectors. The other was AwkwardArray, which would
uhi grow to become one of the most important and most general
histoprint
packages in Scikit-HEP. This package allows NumPy-like idioms
for array-at-a-time manipulation on jagged data structures. A
jagged array is a (possibly structured) array with a variable length
dimension. These are very common and relevant in HEP; events
have a variable number of tracks, tracks have a variable number
Fig. 1: The Scikit-HEP ecosystem and affiliated packages. of hits in the detector, etc. Many other fields also have jagged
data structures. While there are formats to store such structures,
computations on jagged structures have usually been closer to SQL
Eduardo Rodrigues, a scientist working on the LHCb ex-
queries on multiple tables than direct object manipulation. Pandas
periment for the University of Cincinnati, started working on a
handles this through multiple indexing and a lot of duplication.
package called scikit-hep that would provide a set to tools useful
Uproot was a huge hit with incoming HEP students (see Fig 2);
for physicists working on HEP analysis. The initial version of the
suddenly they could access HEP data using a library installed with
scikit-hep package had a simple vector library, HEP related units
pip or conda and no external compiler or library requirements, and
and conversions, several useful statistical tools, and provenance
could easily use tools they already knew that were compatible with
recording functionality,
the Python buffer protocol, like NumPy, Pandas and the rapidly
He also placed the scikit-hep GitHub repository in a Scikit-
growing machine learning frameworks. There were still some gaps
HEP GitHub organization, and asked several of the other HEP
and pain points in the ecosystem, but an analysis without writing
related packages to join. The ROOTPy project was ending, with
C++ (interpreted or compiled) and compiling ROOT manually was
the primary author moving on, and so several of the then-popular
finally possible. Scikit-HEP did not and does not intend to replace
packages2 that were included in the ROOTPy organization were
ROOT, but it provides alternative solutions that work natively in
happily transferred to Scikit-HEP. Several other existing HEP
the Python "Big Data" ecosystem.
libraries, primarily interfacing to existing C++ simulation and
Several other useful HEP libraries were also written. Particle
tracking frameworks, also joined, like PyJet and NumPythia. Some
was written for accessing the Particle Data Group (PDG) particle
of these libraries have been retired or replaced today, but were an
data in a simple and Pythonic way. DecayLanguage originally
important part of Scikit-HEP’s initial growth.
provided tooling for decay definitions, but was quickly expanded
to include tools to read and validate "DEC" decay files, an existing
First initial success text format used to configure simulations in HEP.
In 2016, the largest barrier to using Python in HEP in a Pythonic
way was ROOT. It was challenging to compile, had many non- Building compiled packages
Python dependencies, was huge compared to most Python li-
braries, and didn’t play well with Python packaging. It was not In 2018, HEP physicist and programmer Hans Dembinski pro-
Pythonic, meaning it had very little support for Python protocols posed a histogram library to the Boost libraries, the most influen-
like iteration, buffers, keyword arguments, tab completion and tial C++ library collection; many additions to the standard library
inspect in, dunder methods, didn’t follow conventions for useful are based on Boost. Boost.Histogram provided a histogram-as-
reprs, and Python naming conventions; it was simply a direct on- an-object concept from HEP, but was designed around C++14
demand C++ binding, including pointers. Many Python analyses templating, using composable axes and storage types. It originally
started with a "convert data" step using PyROOT to read ROOT had an initial Python binding, written in Boost::Python. Henry
files and convert them to a Python friendly format like HDF5. Schreiner proposed the creation of a standalone binding to be
Then the bulk of the analysis would use reproducible Python written with pybind11 in Scikit-HEP. The original bindings were
virtual environments or Conda environments. removed, Boost::Histogram was accepted into the Boost libraries,
This changed when Jim Pivarski introduced the Uproot pack- and work began on boost-histogram. IRIS-HEP, a multi-institution
age, a pure-Python implementation of a ROOT file reader (and project for sustainable HEP software, had just started, which was
providing funding for several developers to work on Scikit-HEP
2. The primary package of the ROOTPy project, also called ROOTPy, was project packages such as this one. This project would pioneer
not transferred, but instead had a final release and then died. It was an standalone C++ library development and deployment for Scikit-
inspiration for the new PyROOT bindings, and influenced later Scikit-HEP HEP.
packages like mplhep. The transferred libraries have since been replaced by
integrated ROOT functionality. All these packages required ROOT, which is There were already a variety of attempts at histogram libraries,
not on PyPI, so were not suited for a Python-centric ecosystem. but none of them filled the requirements of HEP physicists:
AWKWARD PACKAGING: BUILDING SCIKIT-HEP 117

ROOT (C++ and PyROOT)
(as a baseline for scale)
Scientiﬁc
Python

P
HE
Scikit-HEP
in
on
CMSSW conﬁg th
(Python but not data analysis) Py
c
ntiﬁ
ie
Sc
PyROOT of age
s
e ack
Us Pp
it -HE
cik
of S
Use

Fig. 2: Adoption of scientific Python libraries and Scikit-HEP among members of the CMS experiment (one of the four major LHC experiments).
CMS requires users to fork github:cms-sw/cmssw, which can be used to identify 3484 physicist users, who created 16656 non-fork repos.
This plot quantifies adoption by counting "#include X", "import X", and "from X import" strings in the users’ code to measure
adoption of various libraries (most popular by category are shown).

bo
lhep gram,
com
mainstream Python adoption

to
in HEP: when many histogram

hist st::His
libraries lived and died

, mp
Boo
ROOT

histogram part of ROOT
(395 C++ ﬁles) YODA
histograms histograms
YODA
in rootpy in Coﬀea

Fig. 3: Developer activity on histogram libraries in HEP: number of unique committers to each library per month, smoothed (derived from git
logs). Illustrates the convergence of a fractured community (around 2017) into a unified one (now).

fills on pre-existing histograms, simple manipulation of multi- pybind11.
dimensional histograms, competitive performance, and easy to The first stand-alone development was azure-wheel-helpers, a
install in clusters or for students. Any new attempt here would set of files that helped produce wheels on the new Azure Pipelines
have to be clearly better than the existing collection of diverse platform. Building redistributable wheels requires a variety of
attempts (see Fig 3). The development of a library with compiled techniques, even without shared libraries, that vary dramatically
components intended to be usable everywhere required good between platforms and were/are poorly documented. On Linux,
support for building libraries that was lacking both in Scikit- everything needs to be built inside a controlled manylinux image,
HEP and to an extent the broader Python ecosystem. Previous and post-processed by the auditwheel tool. On macOS, this in-
advancements in the packaging ecosystem, such as the wheel cludes downloading an official CPython binary for Python to allow
format for distributing binary platform dependent Python packages older versions of macOS to be targeted (10.9+), several special
and the manylinux specification and docker image that allowed a environment variables, especially when cross compiling to Apple
single compiled wheel to target many distributions of Linux, but Silicon, and post processing with the develwheel tool. Windows is
there still were many challenges to making a library redistributable the simplest, as most versions of CPython work identically there.
on all platforms. azure-wheel-helpers worked well, and was quickly adapted for
The boost-histogram library only depended on header-only the other packages in Scikit-HEP that included non-ROOT binary
components of the Boost libraries, and the header-only pybind11 components. Work here would eventually be merged into the
package, so it was able to avoid a separate compile step or existing and general cibuildwheel package, which would become
linking to external dependencies, which simplified the initial build the build tool for all non-ROOT binary packages in Scikit-HEP, as
process. All needed files were collected from git submodules and well as over 600 other packages like matplotlib and numpy, and
packed into a source distribution (SDist), and everything was built was accepted into the PyPA (Python Packaging Authority).
using only setuptools, making build-from-source simple on any The second major development was the upstreaming of CI
system supporting C++14. This did not include RHEL 7, a popular and build system developments to pybind11. Pybind11 is a C++
platform in HEP at the time, and on any platform building could API for Python designed for writing a binding to C++, and
take several minutes and required several gigabytes of memory provided significant benefits to our packages over (mis)-using
to resolve the heavy C++ templating in the Boost libraries and Cython for bindings; Cython was designed to transpile a Python-
118 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

like language to C (or C++), and just happened to support bindings
since you can call C and C++ from it, but it was not what it
Boost::Histogram
was designed for. Benefits of pybind11 included reduced code thin wrapper
complexity and duplication, no pre-process step (cythonize), no
need to pin NumPy when building, and a cross-package API. The
boost-histogram
iMinuit package was later moved from Cython to pybind11 as fully featured
well, and pybind11 became the Scikit-HEP recommended binding
tool. We contributed a variety of fixes and features to pybind11,
hist
including positional-only and keyword-only arguments, the option plotting in
to prepend to the overload chain, and an API for type access Matplotlib
and manipulation. We also completely redesigned CMake inte-
gration, added a new pure-Setuptools helpers file, and completely mplhep plotting in
terminal
redesigned the CI using GitHub Actions, running over 70 jobs on
a variety of systems and compilers. We also helped modernize and
histoprint
improve all the example projects with simpler builds, new CI, and
cibuildwheel support.
This example of a project with binary components being
Fig. 4: The collection of histogram packages and related packages in
usable everywhere then encouraged the development of Awkward Scikit-HEP.
1.0, a rewrite of AwkwardArray replacing the Python-only code
with compiled code using pybind11, fixing some long-standing
limitations, like an inability to slice past two dimensions or select
broader HEP ecosystem. The affiliated classification is also used
"n choose k" for k > 5; these simply could not be expressed
on broader ecosystem packages like pybind11 and cibuildwheel
using Awkward 0’s NumPy expressions, but can be solved with
that we recommend and share maintainers with.
custom compiled kernels. This also enabled further developments
in backends [PEL20]. Histogramming was designed to be a collection of specialized
packages (see Fig. 4) with carefully defined interoperability;
boost-histogram for manipulation and filling, Hist for a user-
Broader ecosystem friendly interface and simple plotting tools, histoprint for display-
Scikit-HEP had become a "toolset" for HEP analysis in Python, a ing histograms, and the existing mplhep and uproot packages also
collection of packages that worked together, instead of a "toolkit" needed to be able to work with histograms. This ecosystem was
like ROOT, which is one monopackage that tries to provide every- built and is held together with UHI, which is a formal specification
thing [R+ 20]. A toolset is more natural in the Python ecosystem, agreed upon by several developers of different libraries, backed by
where we have good packaging tools and many existing libraries. a statically typed Protocol, for a PlottableHistogram object. Pro-
Scikit-HEP only needed to fill existing gaps, instead of covering ducers of histograms, like boost-histogram/hist and uproot provide
every possible aspect of an analysis like ROOT did. The original objects that follow this specification, and users of histograms,
scikit-hep package had its functionality pulled out into existing or such as mplhep and histoprint take any object that follows this
new separate packages such as HEPUnits and Vector, and the core specification. The UHI library is not required at runtime, though it
scikit-hep package instead became a metapackage with no unique does also provide a few simple utilities to help a library also accept
functionality on its own. Instead, it installs a useful subset of our ROOT histograms, which do not (currently) follow the Protocol, so
libraries for a physicist wanting to quickly get started on a new several libraries have decided to include it at runtime too. By using
analysis. a static type checker like MyPy to statically enforce a Protocol,
Scikit-HEP was quickly becoming the center of HEP specific libraries that can communicate without depending on each other
Python software (see Fig. 1). Several other projects or packages or on a shared runtime dependency and class inheritance. This has
joined Scikit-HEP iMinuit, a popular HEP and astrophysics fitting been a great success story for Scikit-HEP, and We expect Protocols
library, was probably the most widely used single package to to continue to be used in more places in the ecosystem.
have joined. PyHF and cabinetry also joined; these were larger The design for Scikit-HEP as a toolset is of many parts that
frameworks that could drive a significant part of an analysis all work well together. One example of a package pulling together
internally using other Scikit-HEP tools. many components is uproot-browser, a tool that combines uproot,
Other packages, like GooFit, Coffea, and zFit, were not added, Hist, and Python libraries like textual and plotext to provide a
but were built on Scikit-HEP packages and had developers work- terminal browser for ROOT files.
ing closely with Scikit-HEP maintainers. Scikit-HEP introduced Scikit-HEP’s external contributions continued to grow. One of
an "affiliated" classification for these packages, which allowed the most notable ones was our work on cibuildwheel. This was
an external package to be listed on the Scikit-HEP website a Python package that supported building redistributable wheels
and encouraged collaboration. Coffea had a strong influence on multiple CI systems. Unlike our own azure-wheel-helpers or
on histogram design, and zFit has contributed code to Scikit- the competing multibuild package, it was written in Python, so
HEP. Currently all affiliated packages have at least one Scikit- good practices in Python package design could apply, like unit
HEP developer as a maintainer, though that is currently not a and integration tests, static checks, and it was easy to remain
requirement. An affiliated package fills a particular need for the independent of the underlying CI system. Building wheels on
community. Scikit-HEP doesn’t have to, or need to, attempt to Linux requires a docker image, macOS requires the python.org
develop a package that others are providing, but rather tries to Python, and Windows can use any copy of Python - cibuildwheel
ensure that the externally provided package works well with the uses this to supply Python in all cases, which keeps it from
AWKWARD PACKAGING: BUILDING SCIKIT-HEP 119

depending on the CI’s support for a particular Python version. We helpful for monitoring adoption of the developer pages, especially
merged our improvements to cibuildwheel, like better Windows newer additions, across the Scikit-HEP packages. This package
support, VCS versioning support, and better PEP 518 support. was then implemented directly into the Scikit-HEP pages, using
We dropped azure-wheel-helpers, and eventually a scikit-build Pyodide to run Python in WebAssembly directly inside a user’s
maintainer joined the cibuildwheel project. cibuildwheel would browser. Now anyone visiting the page can enter their repository
go on to join the PyPA, and is now in use in over 600 packages, and branch, and see the adoption report in a couple of seconds.
including numpy, matplotlib, mypy, and scikit-learn.
Our continued contributions to cibuildwheel included a Working toward the future
TOML-based configuration system for cibuildwheel 2.0, an over-
Scikit-HEP is looking toward the future in several different areas.
ride system to make supporting multiple manylinux and musllinux
We have been working with the Pyodide developers to support
targets easier, a way to build directly from SDists, an option to use
WebAssembly; boost-histogram is compiled into Pyodide 0.20,
build instead of pip, the automatic detection of Python version
and Pyodide’s support for pybind11 packages is significantly bet-
requirements, and better globbing support for build specifiers. We
ter due to that work, including adding support for C++ exception
also helped improve the code quality in various ways, including
handling. PyHF’s documentation includes a live Pyodide kernel,
fully statically typing the codebase, applying various checks and
and a try-pyhf site (based on the repo-review tool) lets users run
style controls, automating CI processes, and improving support for
a model without installing anything - it can even be saved as a
special platforms like CPython 3.8 on macOS Apple Silicon.
webapp on mobile devices.
We also have helped with build, nox, pyodide, and many other
We have also been working with Scikit-Build to try to provide
packages, improving the tooling we depend on to develop scikit-
a modern build experience in Python using CMake. This project
build and giving back to the community.
is just starting, but we expect over the next year or two that
the usage of CMake as a first class build tool for binaries in
The Scikit-HEP Developer Pages Python will be possible using modern developments and avoiding
A variety of packaging best practices were coming out of the distutils/setuptools hacks.
boost-histogram work, supporting both ease of installation for
users as well as various static checks and styling to keep the Summary
package easy to maintain and reduce bugs. These techniques The Scikit-HEP project started in Autumn 2016 and has grown
would also be useful apply to Scikit-HEP’s nearly thirty other to be a core component in many HEP analyses. It has also
packages, but applying them one-by-one was not scalable. The provided packages that are growing in usage outside of HEP, like
development and adoption of azure-wheel-helpers included a se- AwkwardArray, boost-histogram/Hist, and iMinuit. The tooling
ries of blog posts that covered the Azure Pipelines platform and developed and improved by Scikit-HEP has helped Scikit-HEP
wheel building details. This ended up serving as the inspiration developers as well as the broader Python community.
for a new set of pages on the Scikit-HEP website for developers
interested in making Python packages. Unlike blog posts, these
would be continuously maintained and extended over the years, R EFERENCES
serving as a template and guide for updating and adding packages [Dea20] Hans Dembinski and Piti Ongmongkolkul et al. scikit-
to Scikit-HEP, and educating new developers. hep/iminuit. Dec 2020. URL: https://doi.org/10.5281/zenodo.
3949207, doi:10.5281/zenodo.3949207.
These pages grew to describe the best practices for developing
[EB08] Lyndon Evans and Philip Bryant. Lhc machine. Journal of
and maintaining a package, covering recommended configuration, instrumentation, 3(08):S08001, 2008.
style checking, testing, continuous integration setup, task runners, [GTW20] Galli, Massimiliano, Tejedor, Enric, and Wunsch, Stefan. "a new
and more. Shortly after the introduction of the developer pages, pyroot: Modern, interoperable and more pythonic". EPJ Web
Conf., 245:06004, 2020. URL: https://doi.org/10.1051/epjconf/
Scikit-HEP developers started asking for a template to quickly 202024506004, doi:10.1051/epjconf/202024506004.
produce new packages following the guidelines. This was eventu- [IVL22] Ioana Ifrim, Vassil Vassilev, and David J Lange. GPU Ac-
ally produced; the "cookiecutter" based template is kept in sync celerated Automatic Differentiation With Clad. arXiv preprint
with the developer pages; any new addition to one is also added arXiv:2203.06139, 2022.
[Lam98] Stephan Lammel. Computing models of cdf and dØ
to the other. The developer pages are also kept up to date using a in run ii. Computer Physics Communications, 110(1):32–
CI job that bumps any GitHub Actions or pre-commit versions to 37, 1998. URL: https://www.sciencedirect.com/science/article/
the most recent versions weekly. Some portions of the developer pii/S0010465597001501, doi:10.1016/s0010-4655(97)
00150-1.
pages have been contributed to packaging.python.org, as well. [LCC+ 09] Barry M Leiner, Vinton G Cerf, David D Clark, Robert E
The cookie cutter was developed to be able to support multiple Kahn, Leonard Kleinrock, Daniel C Lynch, Jon Postel, Larry G
build backends; the original design was to target both pure Python Roberts, and Stephen Wolff. A brief history of the internet.
and Pybind11 based binary builds. This has expanded to include ACM SIGCOMM Computer Communication Review, 39(5):22–
31, 2009.
11 different backends by mid 2022, including Rust extensions, [LGMM05] W Lavrijsen, J Generowicz, M Marino, and P Mato. Reflection-
many PEP 621 based backends, and a Scikit-Build based backend Based Python-C++ Bindings. 2005. URL: https://cds.cern.ch/
for pybind11 in addition to the classic Setuptools one. This has record/865620, doi:10.5170/CERN-2005-002.441.
[PEL20] Jim Pivarski, Peter Elmer, and David Lange. Awkward arrays
helped work out bugs and influence the design of several PEP in python, c++, and numba. In EPJ Web of Conferences,
621 packages, including helping with the addition of PEP 621 to volume 245, page 05023. EDP Sciences, 2020. doi:10.1051/
Setuptools. epjconf/202024505023.
The most recent addition to the pages was based on a new [PJ11] Andreas J Peters and Lukasz Janyst. Exabyte scale storage at
CERN. In Journal of Physics: Conference Series, volume 331,
repo-review package which evaluates and existing repository to page 052015. IOP Publishing, 2011. doi:10.1088/1742-
see what parts of the guidelines are being followed. This was 6596/331/5/052015.
120 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[R+ 20] Eduardo Rodrigues et al. The Scikit HEP Project – overview and
prospects. EPJ Web of Conferences, 245:06028, 2020. arXiv:
2007.03577, doi:10.1051/epjconf/202024506028.
[RS21] Olivier Rousselle and Tom Sykora. Fast simulation of Time-
of-Flight detectors at the LHC. In EPJ Web of Conferences,
volume 251, page 03027. EDP Sciences, 2021. doi:10.1051/
epjconf/202125103027.
[RTA+ 17] D Remenska, C Tunnell, J Aalbers, S Verhoeven, J Maassen, and
J Templon. Giving pandas ROOT to chew on: experiences with
the XENON1T Dark Matter experiment. In Journal of Physics:
Conference Series, volume 898, page 042003. IOP Publishing,
2017.
[SLF+ 15] Zachary D Stephens, Skylar Y Lee, Faraz Faghri, Roy H
Campbell, Chengxiang Zhai, Miles J Efron, Ravishankar Iyer,
Michael C Schatz, Saurabh Sinha, and Gene E Robinson. Big
data: astronomical or genomical? PLoS biology, 13(7):e1002195,
2015.
[Tem22] Jeffrey Templon. Reflections on the uptake of the Python pro-
gramming language in Nuclear and High-Energy Physics, March
2022. None. URL: https://doi.org/10.5281/zenodo.6353621,
doi:10.5281/zenodo.6353621.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 121

Keeping your Jupyter notebook code quality bar high
(and production ready) with Ploomber
Ido Michael‡∗

This paper walks through this interactive tutorial. It is highly
recommended running this interactively so it’s easier to follow and
see the results in real-time. There’s a binder link in there as well,
so you can launch it instantly.
Fig. 1: In this pipeline none of the tasks were executed - it’s all red.

1. Introduction
Notebooks are an excellent environment for data exploration: In addition, it can transform a notebook to a single-task pipeline
they allow us to write code interactively and get visual feedback, and then the user can split it into smaller tasks as they see fit.
providing an unbeatable experience for understanding our data. To refactor the notebook, we use the soorgeon refactor
However, this convenience comes at a cost; if we are not command:
careful about adding and removing code cells, we may have an soorgeon refactor nb.ipynb
irreproducible notebook. Arbitrary execution order is a prevalent After running the refactor command, we can take a look at the
problem: a recent analysis found that about 36% of notebooks on local directory and see that we now have multiple python tasks
GitHub did not execute in linear order. To ensure our notebooks which that are ready for production:
run, we must continuously test them to catch these problems.
ls playground
A second notable problem is the size of notebooks: the more
cells we have, the more difficult it is to debug since there are more We can see that we have a few new files. pipeline.yaml
variables and code involved. contains the pipeline declaration, and tasks/ contains the stages
Software engineers typically break down projects into multiple that Soorgeon identified based on our H2 Markdown headings:
steps and test continuously to prevent broken and unmaintainable ls playground/tasks
code. However, applying these ideas for data analysis requires
extra work; multiple notebooks imply we have to ensure the output One of the best ways to onboard new people and explain what
from one stage becomes the input for the next one. Furthermore, each workflow is doing is by plotting the pipeline (note that we’re
we can no longer press “Run all cells” in Jupyter to test our now using ploomber, which is the framework for developing
analysis from start to finish. pipelines):
Ploomber provides all the necessary tools to build multi- ploomber plot
stage, reproducible pipelines in Jupyter that feel like a single
This command will generate the plot below for us, which will
notebook. Users can easily break down their analysis into multiple
allow us to stay up to date with changes that are happening in our
notebooks and execute them all with a single command.
pipeline and get the current status of tasks that were executed or
failed to execute.
2. Refactoring a legacy notebook Soorgeon correctly identified the stages in our
If you already have a python project in a single notebook, you original nb.ipynb notebook. It even detected that
can use our tool Soorgeon to automatically refactor it into a the last two tasks (linear-regression, and
Ploomber pipeline. Soorgeon statically analyzes your code, cleans random-forest-regressor) are independent of each
up unnecessary imports, and makes sure your monolithic notebook other!
is broken down into smaller components. It does that by scanning We can also get a summary of the pipeline with ploomber
the markdown in the notebook and analyzing the headers; each status:
H2 header in our example is marking a new self-contained task. cd playground
ploomber status
* Corresponding author: ido@ploomber.io
‡ Ploomber
3. The pipeline.yaml file
Copyright © 2022 Ido Michael. This is an open-access article distributed
under the terms of the Creative Commons Attribution License, which permits To develop a pipeline, users create a pipeline.yaml file and
unrestricted use, distribution, and reproduction in any medium, provided the
declare the tasks and their outputs as follows:
original author and source are credited.
122 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 3: Here we can see the build outputs

Fig. 2: In here we can see the status of each of our pipeline’s tasks,
runtime and location.

tasks:
- source: script.py
product:
nb: output/executed.ipynb
data: output/data.csv

# more tasks here...

The previous pipeline has a single task (script.py)
and generates two outputs: output/executed.ipynb and
output/data.csv. You may be wondering why we have a
notebook as an output: Ploomber converts scripts to notebooks
before execution; hence, our script is considered the source and the
notebook a byproduct of the execution. Using scripts as sources
(instead of notebooks) makes it simpler to use git. However, this
does not mean you have to give up interactive development since
Ploomber integrates with Jupyter, allowing you to edit scripts as
notebooks. Fig. 4: These are the post build artifacts
In this case, since we used soorgeon to refactor an existing
notebook, we did not have to write the pipeline.yaml file.
# Sample data quality checks after loading the raw data
# Check nulls
4. Building the pipeline assert not df['HouseAge'].isnull().values.any()
Let’s build the pipeline (this will take ~30 seconds):
# Check a specific range - no outliers
cd playground assert df['HouseAge'].between(0,100).any()
ploomber build
# Exact expected row count
We can see which are the tasks that ran during this command, how assert len(df) == 11085
long they took to execute, and the contributions of each task to the
overall pipeline execution runtime. ** We’ll do the same for tasks/linear-regression.py, open the file
Navigate to playground/output/ and you’ll see all the and add the tests:
outputs: the executed notebooks, data files and trained model. # Sample tests after the notebook ran
# Check task test input exists
ls playground/output assert Path(upstream['train-test-split']['X_test']).exists()
In this figure, we can see all of the data that was collected during
# Check task train input exists
the pipeline, any artifacts that might be useful to the user, and some assert Path(upstream['train-test-split']['y_train']).exists()
of the execution history that is saved on the notebook’s context.
# Validating output type
assert 'pkl' in upstream['train-test-split']['X_test']
5. Testing and quality checks
Adding these snippets will allow us to validate that the data we’re
** Open tasks/train-test-split.py as a notebook by right-clicking
looking for exists and has the quality we expect. For instance, in
on it and then Open With -> Notebook and add the following
the first test we’re checking there are no missing rows, and that
code after the cell with # noqa:
the data sample we have are for houses up to 100 years old.
KEEPING YOUR JUPYTER NOTEBOOK CODE QUALITY BAR HIGH (AND PRODUCTION READY) WITH PLOOMBER 123

Fig. 6: lab-open-with-notebook
Fig. 5: Now we see an independent new task

In the second snippet, we’re checking that there are train and
test inputs which are crucial for training the model.

6. Maintaining the pipeline
Let’s look again at our pipeline plot: Fig. 7: The new task is attached to the pipeline
Image('playground/pipeline.png')
The arrows in the diagram represent input/output dependencies At the top of the notebook, you’ll see the following:
and depict the execution order. For example, the first task (load)
upstream = None
loads some data, then clean uses such data as input and
processes it, then train-test-split splits our dataset into This special variable indicates which tasks should execute before
training and test sets. Finally, we use those datasets to train a the notebook we’re currently working on. In this case, we want to
linear regression and a random forest regressor. get training data so we can train our new model so we change the
Soorgeon extracted and declared this dependencies for us, but upstream variable:
if we want to modify the existing pipeline, we need to declare upstream = ['train-test-split']
such dependencies. Let’s see how.
We can also see that the pipeline is green, meaning all of the Let’s generate the plot again:
tasks in it have been executed recently. cd playground
ploomber plot
7. Adding a new task Ploomber now recognizes our dependency declaration!
Let’s say we want to train another model and decide to try Gradient Open
Boosting Regressor. First, we modify the pipeline.yaml file playground/tasks/gradient-boosting-regressor.py
and add a new task:
as a notebook by right-clicking on it and then Open With ->
Open playground/pipeline.yaml and add the follow-
Notebook and add the following code:
ing lines at the end
from pathlib import Path
- source: tasks/gradient-boosting-regressor.py
import pickle
product:
nb: output/gradient-boosting-regressor.ipynb
import seaborn as sns
Now, let’s create a base file by executing ploomber from sklearn.ensemble import GradientBoostingRegressor
scaffold: y_train = pickle.loads(Path(
cd playground upstream['train-test-split']['y_train']).read_bytes())
ploomber scaffold y_test = pickle.loads(Path(
upstream['train-test-split']['y_test']).read_bytes())
This is the output of the command: ` X_test = pickle.loads(Path(
Found spec at 'pipeline.yaml' Adding upstream['train-test-split']['X_test']).read_bytes())
/Users/ido/ploomber-workshop/playground/ X_train = pickle.loads(Path(
upstream['train-test-split']['X_train']).read_bytes())
tasks/ gradient-boosting-regressor.py...
Created 1 new task sources. ` gbr = GradientBoostingRegressor()
We can see it created the task sources for our new task, we just gbr.fit(X_train, y_train)
have to fill those in right now. y_pred = gbr.predict(X_test)
Let’s see how the plot looks now: sns.scatterplot(x=y_test, y=y_pred)
cd playground
ploomber plot
You can see that Ploomber recognizes the new file, but it does not 8. Incremental builds
have any dependency, so let’s tell Ploomber that it should execute Data workflows require a lot of iteration. For example, you may
after train-test-split: want to generate a new feature or model. However, it’s wasteful
Open to re-execute every task with every minor change. Therefore,
playground/tasks/gradient-boosting-regressor.py one of Ploomber’s core features is incremental builds, which
automatically skip tasks whose source code hasn’t changed.
as a notebook by right-clicking on it and then Open With ->
Run the pipeline again:
Notebook:
124 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

11. Resources
Thanks for taking the time to go through this tutorial! We hope
you consider using Ploomber for your next project. If you have
any questions or need help, please reach out to us! (contact info
below).
Here are a few resources to dig deeper:
• GitHub
• Documentation
• Code examples
Fig. 8: We can see this pipeline has multiple new tasks. • JupyterCon 2020 talk
• Argo Community Meeting talk
• Pangeo Showcase talk (AWS Batch demo)
cd playground • Jupyter project
ploomber build
You can see that only the gradient-boosting-regressor
10. Contact
task ran!
Incremental builds allow us to iterate faster without keeping • Twitter
track of task changes. • Join us on Slack
Check out playground/output/ • E-mail us
gradient-boosting-regressor.ipynb,
which contains the output notebooks with the model evaluation
plot.

9. Parallel execution and Ploomber cloud execution
This section can run locally or on the cloud. To setup the cloud
we’ll need to register for an api key
Ploomber cloud allows you to scale your experiments into the
cloud without provisioning machines and without dealing with
infrastrucutres.
Open playground/pipeline.yaml and add the following code
instead of the source task:
- source: tasks/random-forest-regressor.py
This is how your task should look like in the end
- source: tasks/random-forest-regressor.py
name: random-forest-
product:
nb: output/random-forest-regressor.ipynb
grid:
# creates 4 tasks (2 * 2)
n_estimators: [5, 10]
criterion: [gini, entropy]
In addition, we’ll need to add a flag to tell the pipeline to execute
in parallel. Open playground/pipeline.yaml and add the following
code above the -tasks section (line 1):
yaml
# Execute independent tasks in parallel executor: parallel
ploomber plot

ploomber build

10. Execution in the cloud
When working with datasets that fit in memory, running your
pipeline is simple enough, but sometimes you may need more
computing power for your analysis. Ploomber makes it simple
to execute your code in a distributed environment without code
changes.
Check out Soopervisor, the package that implements exporting
Ploomber projects in the cloud with support for:
• Kubernetes (Argo Workflows)
• AWS Batch
• Airflow
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 125

Likeness: a toolkit for connecting the social fabric of
place to human dynamics
Joseph V. Tuccillo‡∗ , James D. Gaboardi‡

Abstract—The ability to produce richly-attributed synthetic populations is Modeling these processes at scale and with respect to indi-
key for understanding human dynamics, responding to emergencies, and vidual privacy is most commonly achieved through agent-based
preparing for future events, all while protecting individual privacy. The Like- simulations on synthetic populations [SEM14]. Synthetic popula-
ness toolkit accomplishes these goals with a suite of Python packages: tions consist of individual agents that, when viewed in aggregate,
pymedm/pymedm_legacy, livelike, and actlike. This production
closely recreate the makeup of an area’s observed population
process is initialized in pymedm (or pymedm_legacy) that utilizes census
microdata records as the foundation on which disaggregated spatial allocation
[HHSB12], [TMKD17]. Modeling human dynamics with syn-
matrices are built. The next step, performed by livelike, is the generation of thetic populations is common across research areas including spa-
a fully autonomous agent population attributed with hundreds of demographic tial epidemiology [DKA+ 08], [BBE+ 08], [HNB+ 11], [NCA13],
census variables. The agent population synthesized in livelike is then [RSF+ 21], [SNGJ+ 09], public health [BCD+ 06], [BFH+ 17],
attributed with residential coordinates in actlike based on block assignment [SPH11], [TCR08], [MCB+ 08], and transportation [BBM96],
and, finally, allocated to an optimal daytime activity location via the street [ZFJ14]. However, a persistent limitation across these applications
network. We present a case study in Knox County, Tennessee, synthesizing 30 is that synthetic populations often do not capture a wide enough
populations of public K–12 school students & teachers and allocating them to
range of individual characteristics to assess how human dynamics
schools. Validation of our results shows they are highly promising by replicating
are linked to human security problems (e.g., how a person’s age,
reported school enrollment and teacher capacity with a high degree of fidelity.
limited transportation access, and linguistic isolation may interact
Index Terms—activity spaces, agent-based modeling, human dynamics, popu- with their housing situation in a flood evacuation emergency).
lation synthesis
In this paper, we introduce Likeness [TG22], a Python toolkit
for connecting the social fabric of place to human dynamics via
Introduction models that support increased spatial, temporal, and demographic
Human security fundamentally involves the functional capacity fidelity. Likeness is an extension of the UrbanPop framework de-
that individuals possess to withstand adverse circumstances, me- veloped at Oak Ridge National Laboratory (ORNL) that embraces
diated by the social and physical environments in which they live a new paradigm of "vivid" synthetic populations [TM21], [Tuc21],
[Hew97]. Attention to human dynamics is a key piece of the in which individual agents may be attributed in potentially hun-
human security puzzle, as it reveals spatial policy interventions dreds of ways, across subjects spanning demographics, socioe-
most appropriate to the ways in which people within a community conomic status, housing, and health. Vivid synthetic populations
behave and interact in daily life. For example, "one size fits all" benefit human dynamics research both by enabling more precise
solutions do not exist for mitigating disease spread, promoting geolocation of population segments, as well as providing a deeper
physical activity, or enabling access to healthy food sources. understanding of how individual and neighborhood characteris-
Rather, understanding these outcomes requires examination of tics are coupled. UrbanPop’s early development was motivated
processes like residential sorting, mobility, and social transmis- by linking models of residential sorting and worker commute
sion. behaviors [MNP+ 17], [MPN+ 17], [ANM+ 18]. Likeness expands
upon the UrbanPop approach by providing a novel integrated
* Corresponding author: tuccillojv@ornl.gov
‡ Oak Ridge National Laboratory model that pairs vivid residential synthetic populations with an
activity simulation model on real-world transportation networks,
Copyright © 2022 Oak Ridge National Laboratory. This is an open-access with travel destinations based on points of interest (POIs) curated
article distributed under the terms of the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any from location services and federal critical facilities data.
medium, provided the original author and source are credited.
Notice: This manuscript has been authored by UT-Battelle, LLC under We first provide an overview of Likeness’ capabilities, then
Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. provide a more detailed walkthrough of its central workflow with
The United States Government retains and the publisher, by accepting the
article for publication, acknowledges that the United States Government respect to livelike, a package for population synthesis and
retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or residential characterization, and actlike a package for activity
reproduce the published form of this manuscript, or allow others to do so, for allocation. We provide preliminary usage examples for Likeness
United States Government purposes. The Department of Energy will provide based on 1) social contact networks in POIs 2) 24-hour POI
public access to these results of federally sponsored research in accordance
with the DOE Public Access Plan (http://energy.gov/downloads/doe-public- occupancy characteristics. Finally, we discuss existing limitations
access-plan). and the outlook for future development.
126 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Overview of Core Capabilities and Workflow the ACS Public-Use Microdata Sample (PUMS) at the scale
UrbanPop initially combined the vivid synthetic populations pro- of census block groups (typically 300–6000 people) or tracts
duced from the American Community Survey (ACS) using the (1200–8000 people), depending upon the use-case.
Penalized-Maximum Entropy Dasymetric Modeling (P-MEDM) Downscaling the PUMS from the Public-Use Microdata Area
method, which is detailed later, with a commute model based on (PUMA) level at which it is offered (100,000 or more people) to
origin-destination flows, to generate a detailed dataset of daytime these neighborhood scales then enables us to produce synthetic
and nighttime synthetic populations across the United States populations (the livelike package) and simulate their travel
[MPN+ 17]. Our development of Likeness is motivated by extend- to POIs (the actlike package) in an integrated model. This ap-
ing the existing capabilities of UrbanPop to routing libraries avail- proach provides a new means of modeling population mobility and
able in Python like osmnx1 and pandana2 [Boe17], [FW12]. activity spaces with respect to real-world transportation networks
In doing so, we are able to simulate travel to regular daytime and POIs, in turn enabling investigation of social processes from
activities (work and school) based on real-world transportation the atomic (e.g., person) level in human systems.
networks. Likeness continues to use the P-MEDM approach, but Likeness offers two implementations of P-MEDM. The first,
is fully integrated with the U.S. Census Bureau’s ACS Summary the pymedm package, is written natively in Python based on
File (SF) and Census Microdata APIs, enabling the production of scipy.optimize.minimize, and while fully operational re-
activity models on-the-fly. mains in development and is currently suitable for one-off simu-
Likeness features three core capabilities supporting activ- lations. The second, the pmedm_legacy package, uses rpy2 as
ity simulation with vivid synthetic populations (Figure 1). a bridge to [NBLS14]’s original implementation of P-MEDM3 in
The first, spatial allocation, is provided by the pymedm and R/C++ and is currently more stable and scalable. We offer conda
pmedm_legacy packages and uses Iterative Proportional Fitting environments specific to each package, based on user preferences.
(IPF) to downscale census microdata records to small neighbor- Each package’s functionality centers around a PMEDM class,
hood areas, providing a basis for population synthesis. Baseline which contains information required to solve the P-MEDM prob-
residential synthetic populations are then created and stratified into lem:
agent segments (e.g., grade 10 students, hospitality workers) using • The individual (household) level constraints based on ACS
the livelike package. Finally, the actlike package models PUMS. To preserve households from the PUMS in the syn-
travel across agent segments of interest to POIs outside places of thetic population, the person-level constraints describing
residence at varying times of day. household members are aggregated to the household level
and merged with household-level constraints.
Spatial Allocation: the pymedm & pmedm_legacy packages • PUMS household sample weights.
Synthetic populations are typically generated from census micro- • The target (e.g., block group) and aggregate (e.g., tract)
data, which consists of a sample of publicly available longform zone constraints based on population-level estimates avail-
responses to official statistical surveys. To preserve respondent able in the ACS SF.
confidentiality, census microdata is often published at spatial • The target/aggregate zone 90% margins of error and asso-
scales the size of a city or larger. Spatial allocation with IPF ciated standard errors (SE = 1.645 × MOE).
provides a maximum-likelihood estimator for microdata responses The PMEDM classes feature a solve() method that returns
in small (e.g., neighborhood) areas based on aggregate data an optimized P-MEDM solution and allocation matrix. Through
published about those areas (known as "constraints"), resulting a diagnostics module, users may then evaluate a P-MEDM
in a baseline for population synthesis [WCC+ 09], [BBM96], solution based on the proportion of published 90% MOEs from
[TMKD17]. UrbanPop is built upon a regularized implementation the summary-level ACS data preserved at the target (allocation)
of IPF, the P-MEDM method, that permits many more input census scale.
variables than traditional approaches [LNB13], [NBLS14]. The P-
MEDM objective function (Eq. 1) is written as: Population Synthesis: the livelike package
n wit wit e2 The livelike package generates baseline residential synthetic
max − ∑ log − ∑ k2 (1)
it N dit dit k 2σk populations and performs agent segmentation for activity simula-
tion.
where wit is the estimate of variable i in zone t, dit is the synthetic
estimate of variable i in location t, n is the number of microdata Specifying and Solving Spatial Allocation Problems
responses, and N is the total population size. Uncertainty in
The livelike workflow is oriented around a user-specified
variable estimates is handled by adding an error term to the
e2 constraints file containing all of the information necessary to
allocation ∑k 2σk2 , where ek is the error between the synthetic specify a P-MEDM problem for a PUMA of interest. "Constraints"
k
and published estimate of ACS variable k and σk is the ACS are variables from the ACS common among people/households
standard error for the estimate of variable k. This is accomplished (PUMS) and populations (SF) that are used as both model inputs
by leveraging the uncertainty in the input variables: the "tighter" and descriptors. The constraints file includes information for
the margins of error on the estimate of variable k in place t, the bridging PUMS variable definitions with those from the SF using
more leverage it holds upon the solution [NBLS14]. helper functions provided by the livelike.pums module,
The P-MEDM procedure outputs an allocation matrix that including table IDs, sampling universe (person/household), and
estimates the probability of individuals matching responses from tags for the range of ACS vintages (years) for which the variables
are relevant.
1. https://github.com/gboeing/osmnx
2. https://github.com/UDST/pandana 3. https://bitbucket.org/nnnagle/pmedmrcpp
LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS 127

Fig. 1: Core capabilities and workflow of Likeness.

The primary livelike class is the acs.puma, which stores implementation of [LB13]’s "Truncate, Replicate, Sample" (TRS)
information about a single PUMA necessary for spatial allocation method. TRS works by separating each cell of the allocation
of the PUMS data to block groups/tracts with P-MEDM. The matrix into whole-number (integer) and fractional components,
process of creating an acs.puma is integrated with the U.S. then incrementing the whole-number estimates by a random
Census Bureau’s ACS SF and Census Microdata 5-Year Estimates sample of unit weights performed with sampling probabilities
(5YE) APIs4 . This enables generation of an acs.puma class based on the fractional component. Because TRS is stochastic,
with a high-level call involving just a few parameters: 1) the the homesim.hsim() function generates multiple (default 30)
PUMA’s Federal Information Processing Standard (FIPS) code 2) realizations of the residential population. The results are provided
the constraints file, loaded as a pandas.DataFrame and 3) the as a pandas.DataFrame in long format, attributed by:
target ACS vintage (year). An example call to build an acs.puma
• PUMS Household ID (h_id)
for the Knoxville City, TN PUMA (FIPS 4701603) using the ACS
• Simulation number (sim)
2015–2019 5-Year Estimates is:
• Target zone FIPS code (geoid)
acs.puma(
fips="4701603", • Household count (count)
constraints=constraints,
year=2019 Since household and person-level attributes are combined
) when creating the acs.puma class, person-level records from
the PUMS are assumed to be joined to the synthesized household
The censusdata package5 is used internally to
IDs many-to-one. For example, if two people, A01 and A03, in
fetch population-level (SF) constraints, standard errors,
household A have some attribute of interest, and there are 3
and MOEs from the ACS 5YE API, while the
households of type A in zone G, then we estimate that a total
acs.extract_pums_constraints function is used to
of 6 people with that attribute from household A reside in zone G.
fetch individual-level constraints and weights from the Census
Microdata 5YE API.
Agent Generation
Spatial allocation is then carried out by passing
the acs.puma attributes to a pymedm.PMEDM or The synthetic populations can then be segmented into different
pmedm_legacy.PMEDM (depending on user preference). groups of agents (e.g., workers by industry, students by grade) for
activity modeling with the actlike package. Agent segments
Population Synthesis may be identified in several ways:
The homesim module provides support for population synthe-
• Using acs.extract_pums_segment_ids() to
sis on the spatial allocation matrix within a solved P-MEDM
fetch the person IDs (household serial number + person
object. The population synthesis procedure involves converting
line number) from the Census Microdata API matching
the fractional estimates from the allocation matrix (n household
some criteria of interest (e.g., public school students in
IDs by m zones) to integer representation such that whole peo-
10th grade).
ple/households are preserved. This homesim module features an
• Using acs.extract_pums_descriptors() to
4. https://www.census.gov/data/developers/data-sets.html fetch criteria that may be queried from the Census
5. https://pypi.org/project/CensusData Microdata API. This is useful when dealing with criteria
128 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

more specific than can be directly controlled for in the in time and are placed with a greater frequency proportional
P-MEDM problem (e.g., detailed NAICS code of worker, to reported household density [LB13]. We employ population
exact number of hours worked). and housing counts within 2010 Decennial Census blocks to
formulate a modified Variable Size Bin Packing Problem [FL86],
The function est.tabulate_by_serial() is then used
[CGSdG08] for each populated block group, which allows for
to tabulate agents by target zone and simulation by appending
an optimal placement of household points and is accomplished
them to the synthetic population based on household ID, then
by the actlike.block_denisty_allocation()
aggregating the person-level counts. This routine is flexible in that
function that creates and solves an
a user can use any set of criteria available from the PUMS to
actlike.block_allocation.BinPack instance.
define customized agents for mobility modeling purposes.

Other Capabilities Activity Allocation
Population Statistics: In addition to agent creation, the Once household location attribution is complete, individual agents
livelike.est module also supports the creation of popula- must be allocated from households (nighttime locations) to prob-
tion statistics. This can be used to estimate the compositional able activity spaces (daytime locations). This is achieved through
characteristics of small neighborhood areas and POIs, for ex- spatial network modeling over the streets within a study area via
ample to simulate social contact networks (see Students). To OpenStreetMap6 utilizing osmnx for network extraction & pre-
accomplish this, the results of est.tabulate_by_serial processing and pandana for shortest path and route calculations.
(see Agent Generation) are converted to proportional esti- The underlying impedance metric for shortest path calculation,
mates to facilitate POIs (est.to_prop()), then averaged handled in actlike.calc_cost_mtx() and associated in-
across simulations to produce Monte Carlo estimates and errors ternal functions, can either take the form of distance or travel time.
est.monte_carlo_estimate()). Moreover, household and activity locations must be connected to
Multiple ACS Vintages and PUMAs: The multi nearby network edges for realistic representations within network
module extends the capabilities of livelike to space [GFH20].
multiple ACS 5YE vintages (dating back to 2016), as With a cost matrix from all residences to daytime loca-
well as multiple PUMAs (e.g., a metropolitan area) via tions calculated, the simulated population can then be "sent"
the multi module. Using multi.make_pumas() to the likely activity spaces by utilizing an instance of
or multi.make_multiyear_pumas(), multiple actlike.ActivityAllocation to generate an adapted
PUMAs/multiple years may be stored in a dict Transportation Problem. This mixed integer program, solved using
that enables iterative runs for spatial allocation the solve() method, optimally associates all population within
(multi.make_pmedm_problems()), population an activity space with the objective of minimizing the total cost of
synthesis (multi.homesim()), and agent cre- impedance (Eq. 2), being subject to potentially relaxed minimum
ation (multi.extract_pums_segment_ids(), and maximum capacity constraints (Eq. 4 & 5). Each decision
multi.extract_pums_segment_ids_multiyear(), variable (xi j ) represents a potential allocation from origin i to
multi.extract_pums_descriptors(), and destination j that must be an integer greater than or equal to zero
multi.extract_pums_descriptors_multiyear()). (Eq. 6 & 7). The problem is formulated as follows:
This functionality is currently available for pmedm_legacy
only. min ∑ ∑ ci j xi j (2)
i∈I j∈J

Activity Allocation: the actlike package s.t. ∑ xi j = Oi ∀i ∈ I; (3)
The actlike package [GT22] allocates agents from synthetic j∈J

populations generated by livelike POI, like schools and work-
places, based on optimal allocation about transportation networks
s.t. ∑ xi j ≥ minD j ∀ j ∈ J; (4)
i∈I
derived from osmnx and pandana [Boe17], [FW12]. Solutions
are the product of a modified integer program (Transportation s.t. ∑ xi j ≤ maxD j ∀ j ∈ J; (5)
Problem [Hit41], [Koo49], [MS01], [MS15]) modeled in pulp i∈I
or mip [MOD11], [ST20], whereby supply (students/workers)
s.t. xi j ≥ 0 ∀i ∈ I ∀ j ∈ J; (6)
are "shipped" to demand locations (schools/workplaces), with
potentially relaxed minimum and maximum capacity constraints at s.t. xi j ∈ Z ∀i ∈ I ∀ j ∈ J. (7)
demand locations. Impedance from nighttime to daytime locations
(Origin-Destination [OD] pairs) can be modeled by either network where
distance or network travel time. i ∈ I = each household in the set of origins
j ∈ J = each school in the set of destinations
Location Synthesis
xi j = allocation decision from i ∈ I to j ∈ J
Following the generation of synthetic households for the study
ci j = cost between all i, j pairs
universe, locations for all households across the 30 default
simulations must be created. In order to intelligently site pseudo- Oi = population in origin i for i ∈ I
neighborhood clusters of random points, we adopt a dasymetric minD j = minimum capacity j for j ∈ J
[QC13] approach, which we term intelligent block-based (IBB) maxD j = maximum capacity j for j ∈ J
allocation, whereby household locations are only placed within
blocks known to have been populated at a particular period 6. https://www.openstreetmap.org/about
LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS 129

The key to this adapted formulation of the classic Trans- Because school attendance in Knox County is restricted by
portation Problem is the utilization of minimum and maxi- district boundaries, we only placed student households in
mum capacity thresholds that are generated endogenously within the PUMAs intersecting with the district (FIPS 4701601,
actlike.ActivityAllocation and are tuned to reflect 4701602, 4701603, 4701604). However, because educators
the uncertainty of both the population estimates generated by may live outside school district boundaries, we simulated
livelike and the reported (or predicted) capacities at activity their household locations throughout the Knoxville CBSA.
locations. Moreover, network impedance from origins to destina- • Used actlike to perform optimal allocation of
tions (ci j ) can be randomly reduced through an internal process workers and students about road networks in Knox
by passing in an integer value to the reduce_seed keyword ar- County/Knoxville CBSA. Across the 30 simulations and
gument. By triggering this functionality, the count and magnitude 14 segments identified, we produced a total of 420 travel
of reduction is determined algorithmically. A random reduction simulations. Network impedance was measured in geo-
of this nature is beneficial in generating dispersed solutions that graphic distance for all student simulations and travel time
do not resemble compact clusters, with an example being the for all educator simulations.
replication of a private school’s student body that does not adhere
Figure 2 demonstrates the optimal allocations, routing, and
to public school attendance zones.
network space for a single simulation of 10th grade public school
After the optimal solution is found for an
students in Knox County, TN. Students, shown in households
actlike.ActivityAllocation instance, selected
as small black dots, are associated with schools, represented by
decisions are isolated from non-zero decision variables
transparent colored circles sized according to reported enrollment.
with the realized_allocations() method. These
The network space connecting student residential locations to
allocations are then used to generate solution routes with the
assigned schools is displayed in a matching color. Further, the
network_routes() function that represent the shortest path
inset in Figure 2 provides the pseudo-school attendance zone for
along the network traversed from residential locations to assigned
10th graders at one school in central Knoxville and demonstrates
activity spaces. Solutions can be further validated with Canonical
the adherence to network space.
Correlation Analysis, in instances where the agent segments are
stratified, and simple linear regression for those where a single
Students
segment of agents is used. Validation is discussed further in
Validation & Diagnostics. Our study of K–12 students examines social contact networks
with respect to potentially underserved student populations via
the compositional characteristics of POIs (schools).
Case Study: K–12 Public Schools in Knox County, TN
We characterized each school’s student body by identifying
To illustrate Likeness’ capability to simulate POI travel among student profiles based on several criteria: minority race/ethnicity,
specific population segments, we provide a case study of travel to poverty status, single caregiver households, and unemployed care-
POIs, in this case K–12 schools, in Knox County, TN. Our choice giver households (householder and/or spouse/parnter). We defined
of K–12 schools was motivated by several factors. First, they serve 6 student profiles using an implementation of the density-based
as common destinations for the two major groups—workers and K-Modes clustering algorithm [CLB09] with a distance heuris-
students—expected to consistently travel on a typical business tic designed to optimize cluster separation [NLHH07] available
day [RWM+ 17]. Second, a complete inventory of public school through the kmodes package9 [dV21]. Student profile labels were
locations, as well as faculty and enrollment sizes, is available appended to the student travel simulation results, then used to
publicly through federal open data sources. In this case, we produce Monte Carlo proportional estimates of profiles by school.
obtained school locations and faculty sizes from the Homeland The results in Figure 3 reveal strong dissimilarities in student
Infrastructure Foundation-Level Database (HIFLD)7 and student makeup between schools on the periphery of Knox County and
enrollment sizes by grade from the National Center for Education those nearer to Knoxville’s downtown core in the center of the
Statistics (NCES) Common Core of Data8 . county. We estimate that the former are largely composed of
We chose the Knox County School District, which coincides students in married families, above poverty, and with employed
with Knox County’s boundaries, as our study area. We used the caregivers, whereas the latter are characterized more strongly by
livelike package to create 30 synthetic populations for the single caregiver living arrangements and, particularly in areas
Knoxville Core-Based Statistical Area (CBSA), then for each north of the downtown core, economic distress (pop-out map).
simulation we:
• Isolated agent segments from the synthetic population.
Workers (Educators)
K–12 educators consist of full-time workers employed as
primary and secondary education teachers (2018 Standard We evaluated the results of our K–12 educator simulations with
Occupation Classification System codes 2300–2320) in respect to POI occupancy characteristics, as informed by commute
elementary and secondary schools (NAICS 6111). We and work statistics obtained from the PUMS. Specifically, we used
separated out student agents by public schools and by work arrival times associated with each synthetic worker (PUMS
grade level (Kindergarten through Grade 12). JWAP) to timestamp the start of each work day, and incremented
• Performed IBB allocation to simulate the household loca- this by daily hours worked (derived from PUMS W KHP) to create
tions of workers and students. Our selection of household a second timestamp for work departure. The estimated departure
locations for workers and students varied geographically. time assumes that each educator travels to the school for a typical
5-day workweek, and is estimated as JWAP + W KHP 5 .
7. https://hifld-geoplatform.opendata.arcgis.com
8. https://nces.ed.gov/ccd/files.asp 9. https://pypi.org/project/kmodes
130 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 2: Optimal allocations for one simulation of 10th grade public school students in Knox County, TN.

Fig. 3: Compositional characteristics of K–12 public schools in Knox County, TN based on 6 student profiles. Glyph plot methodolgy adapted
from [GLC+ 15].
LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS 131

Fig. 4: Hourly worker occupancy estimates for K–12 schools in Knox County, TN.

Roughly 50 educator agents per simulation were not attributed Validation & Diagnostics
with work arrival times, possibly due to the source PUMS re- A determination of modeling output robustness was needed to
spondents being away from their typical workplaces (e.g., on validate our results. Specifically, we aimed to ensure the preser-
summer or winter break) but still working virtually when they vation of relative facility size and composition. To perform this
were surveyed. We filled in these unkown arrival times with the validation, we tested the optimal allocations of those generated by
modal arrival time observed across all simulations (7:25 AM). Likeness against the maximally adjusted reported enrollment &
faculty employment counts. We used the maximum adjusted value
to account for scenarios where the population synthesis phase
Figure 4 displays the hourly proportion of educators present resulted in a total demographic segment greater than reported total
at each school in Knox County between 7:00 AM (t700) and facility capacity. We employed Canonical Correlation Analysis
6:00 PM (t1800). Morning worker arrivals occur more rapidly (CCA) [Kna78] for the K–12 public school student allocations
than afternoon departures. Between the hours of 7:00 AM and due to their stratified nature, and an ordinary least squares (OLS)
9:00 AM (t700–t900), schools transition from nearly empty simple linear regression for the educator allocations [PVG+ 11].
of workers to being close to capacity. In the afternoon, workers Because CCA is a multivariate measure, it is only a suitable
begin to gradually depart at 3:00 PM (t1500) with somewhere diagnostic for activity allocation when multiple segments (e.g.,
between 50%–70% of workers still present by 4:00 PM (t1600), students by grade) are of interest. For educators, which we
then workers begin to depart in earnest at 5:00 PM into 6:00 PM treated as a single agent segment without stratification, we used
(t1700–t1800), by which most have returned home. OLS regression instead. The CCA for students was performed in
two components: Between-Destination, which measures capacity
across facilities, and Within-Destination, which measures capacity
Geographic differences are also visible and may be a function across strata.
of (1) a higher concentration of a particular school type (e.g., Descriptive Monte Carlo statistics from the 30 simulations
elementary, middle, high) in this area and (2) staggered starts were run on the resultant coefficients of determination (R2 ),
between these types (to accommodate bus schedules, etc.). This which show a goodness of fit (approaching 1). As seen in Table
could be due in part to concentrations of different school schedules 1, all models performed exceedingly well, though the Within-
by grade level, especially elementary schools starting much earlier Destination CCA performed slightly less well than both the
than middle and high schools10 . For example, schools near the Between-Destination CCA and the OLS linear regression. In fact,
center of Knox County reach worker capacity more quickly in the the global minimum of all R2 scores approaches 0.99 (students
morning, starting around 8:00 AM (t800), but also empty out – Within-Destination), which demonstrates robust preservation of
more rapidly than schools in surrounding areas beginning around
4:00 PM (t1600). 10. https://www.knoxschools.org/Page/5553
132 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

K–12 R2 Type Min Median Mean Max
Between-Destination CCA 0.9967 0.9974 0.9973 0.9976
Students (public schools)
Within-Destination CCA 0.9883 0.9894 0.9896 0.9910
Educators (public & private schools) OLS Linear Regression 0.9977 0.9983 0.9983 0.9991

TABLE 1: Validating optimal allocations considering reported enrollment at public schools & faculty employment at all schools.

true capacities in our synthetic activity modeling. Furthermore, agent characterization and travel along real-world transportation
a global maximum of greater than 0.999 is seen for educators, networks to POIs. These capabilities benefit planners and urban
which indicates a near perfect replication of relative faculty sizes researchers by providing a richer understanding of how spatial
by school. policy interventions can be designed with respect to how people
live, move, and interact. Likeness strives to be flexible toward a
Discussion variety of research applications linked to human security, among
Our Case Study demonstrates the twofold benefits of modeling them spatial epidemiology, transportation equity, and environmen-
human dynamics with vivid synthetic populations. Using Like- tal hazards.
ness, we are able to both produce a more reasoned estimate of the Several ongoing developments will further Likeness’ capa-
neighborhoods in which people reside and interact than existing bilities. First, we plan to expand our support for POIs curated
synthetic population frameworks, as well as support more nuanced by location services (e.g., Google, Facebook, Here, TomTom,
characterization of human activities at specific POIs (e.g., social FourSquare) by the ORNL PlanetSense project [TBP+ 15] by
contact networks, occupancy). incorporating factors like facility size, hours of operation, and pop-
The examples provided in the Case Study show how this ularity curves to refine the destination capacity estimates required
refined understanding of human dynamics can benefit planning to perform actlike simulations. Second, along with multi-
applications. For example, in the event of a localized emergency, modal travel, we plan to incorporate multiple trip models based
the results of Students could be used to examine schools for on large-scale human activity datasets like the American Time Use
which rendezvous with caregivers might pose an added challenge Survey11 and National Household Travel Survey12 . Together, these
towards students (e.g., more students from single caregiver vs. improvements will extend our travel simulations to "non-obligate"
married family households). Additionally, the POI occupancy population segments traveling to civic, social, and recreational
dynamics demonstrated in Workers (Educators) could be used activities [BMWR22]. Third, the current procedure for spatial
to assess the times at which worker commutes to/from places allocation uses block groups as the target scale for population
of employment might be most sensitive to a nearby disruption. synthesis. However, there are a limited number of constraining
Another application in the public health sphere might be to use variables available at the block group level. To include a larger
occupancy estimates to anticipate the best time of day to reach volume of constraints (e.g., vehicle access, language), we are
workers, during a vaccination campaign, for example. exploring an additional tract-level approach. P-MEDM in this
Our case study had several limitations that we plan to over- case is run on cross-covariances between tracts and "supertract"
come in future work. First, we assumed that all travel within our aggregations created with the Max-p-regions problem [DAR12],
study area occurs along road networks. While road-based travel [WRK21] implemented in PySAL’s spopt [RA07], [FGK+ 21],
is the dominant means of travel in the Knoxville CBSA, this [RAA+ 21], [FBG+ 22].
assumption is not transferable to other urban areas within the As a final note, the Likeness toolkit is being developed on top
United States. Our eventual goal is to build in additional modes of of key open source dependencies in the Scientific Python ecosys-
travel like public transit, walk/bike, and ferries by expanding our tem, the core of which are, of course, numpy [HMvdW+ 20]
ingest of OpenStreetMap features. and scipy [VGO+ 20]. Although an exhaustive list would be
Second, we do not yet offer direct support for non-traditional prohibitive, major packages not previously mentioned include
schools (e.g., populations with special needs, families on military geopandas [JdBF+ 21], matplotlib [Hun07], networkx
bases). For example, the Tennessee School for the Deaf falls [HSS08], pandas [pdt20], [WM10], and shapely [G+ ]. Our
within our study area, and its compositional estimate could be goal is contribute to the community with releases of the packages
refined if we reapportioned students more likely in attendance to comprising Likeness, but since this is an emerging project its
that location. development to date has been limited to researchers at ORNL.
Third, we did not account for teachers in virtual schools, However, we plan to provide a fully open-sourced code base
which may form a portion of the missing work arrival times within the coming year through GitHub13 .
discussed in Workers (Educators). Work-from-home populations
Acknowledgements
can be better incorporated into our travel simulations by apply-
ing work schedules from time-use surveys to probabilistically This material is based upon the work supported by the U.S.
assign in-person or remote status based on occupation. We are Department of Energy under contract no. DE-AC05-00OR22725.
particularly interested in using this technique with Likeness to
better understand changing patterns of life during the COVID-19 R EFERENCES
pandemic in 2020. [ANM+ 18] H.M. Abdul Aziz, Nicholas N. Nagle, April M. Morton,
Michael R. Hilliard, Devin A. White, and Robert N. Stew-
Conclusion
11. https://www.bls.gov/tus
The Likeness toolkit enhances agent creation for modeling human 12. https://nhts.ornl.gov
dynamics through its dual capabilities of high-fidelity ("vivid") 13. https://github.com/ORNL
LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS 133

art. Exploring the impact of walk–bike infrastructure, safety [GFH20] James D. Gaboardi, David C. Folch, and Mark W. Horner.
perception, and built-environment on active transportation Connecting Points to Spatial Networks: Effects on Discrete
mode choice: a random parameter model using New York Optimization Models. Geographical Analysis, 52(2):299–322,
City commuter data. Transportation, 45(5):1207–1229, 2018. 2020. doi:10.1111/gean.12211.
doi:10.1007/s11116-017-9760-8. [GLC+ 15] Isabella Gollini, Binbin Lu, Martin Charlton, Christopher
[BBE+ 08] Christopher L. Barrett, Keith R. Bisset, Stephen G. Eubank, Brunsdon, and Paul Harris. GWmodel: An R package for
Xizhou Feng, and Madhav V. Marathe. EpiSimdemics: an ef- exploring spatial heterogeneity using geographically weighted
ficient algorithm for simulating the spread of infectious disease models. Journal of Statistical Software, 63(17):1–50, 2015.
over large realistic social networks. In SC’08: Proceedings of doi:10.18637/jss.v063.i17.
the 2008 ACM/IEEE Conference on Supercomputing, pages [GT22] James D. Gaboardi and Joseph V. Tuccillo. Simulating Travel
1–12. IEEE, 2008. doi:10.1109/SC.2008.5214892. to Points of Interest for Demographically-rich Synthetic Popu-
[BBM96] Richard J. Beckman, Keith A. Baggerly, and Michael D. lations, February 2022. American Association of Geographers
McKay. Creating synthetic baseline populations. Transporta- Annual Meeting. doi:10.5281/zenodo.6335783.
tion Research Part A: Policy and Practice, 30(6):415–429, [Hew97] Kenneth Hewitt. Vulnerability Perspectives: the Human Ecol-
1996. doi:10.1016/0965-8564(96)00004-3. ogy of Endangerment. In Regions of Risk: A Geographical
[BCD+ 06] Dimitris Ballas, Graham Clarke, Danny Dorling, Jan Rigby, Introduction to Disasters, chapter 6, pages 141–164. Addison
and Ben Wheeler. Using geographical information systems and Wesley Longman, 1997.
spatial microsimulation for the analysis of health inequalities. [HHSB12] Kirk Harland, Alison Heppenstall, Dianna Smith, and Mark H.
Health Informatics Journal, 12(1):65–79, 2006. doi:10. Birkin. Creating realistic synthetic populations at varying
1177/1460458206061217. spatial scales: A comparative critique of population synthesis
[BFH+ 17] Komal Basra, M. Patricia Fabian, Raymond R. Holberger, techniques. Journal of Artificial Societies and Social Simula-
Robert French, and Jonathan I. Levy. Community-engaged tion, 15(1):1, 2012. doi:10.18564/jasss.1909.
modeling of geographic and demographic patterns of mul- [Hit41] Frank L. Hitchcock. The Distribution of a Product from
tiple public health risk factors. International Journal of Several Sources to Numerous Localities. Journal of Mathe-
Environmental Research and Public Health, 14(7):730, 2017. matics and Physics, 20(1-4):224–230, 1941. doi:10.1002/
doi:10.3390/ijerph14070730. sapm1941201224.
[BMWR22] Christa Brelsford, Jessica J. Moehl, Eric M. Weber, and [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der
Amy N. Rose. Segmented Population Models: Improving the Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric
LandScan USA Non-Obligate Population Estimate (NOPE). Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,
American Association of Geographers 2022 Annual Meeting, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerk-
2022. wijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río,
[Boe17] Geoff Boeing. OSMnx: New methods for acquiring, con- Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin
structing, analyzing, and visualizing complex street networks. Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi,
Computers, Environment and Urban Systems, 65:126–139, Christoph Gohlke, and Travis E. Oliphant. Array programming
September 2017. doi:10.1016/j.compenvurbsys. with NumPy. Nature, 585(7825):357–362, September 2020.
2017.05.004. doi:10.1038/s41586-020-2649-2.
[CGSdG08] Isabel Correia, Luís Gouveia, and Francisco Saldanha-da [HNB+ 11] Jan A.C. Hontelez, Nico Nagelkerke, Till Bärnighausen, Roel
Gama. Solving the variable size bin packing problem Bakker, Frank Tanser, Marie-Louise Newell, Mark N. Lurie,
with discretized formulations. Computers & Operations Re- Rob Baltussen, and Sake J. de Vlas. The potential impact of
search, 35(6):2103–2113, June 2008. doi:10.1016/j. RV144-like vaccines in rural South Africa: a study using the
cor.2006.10.014. STDSIM microsimulation model. Vaccine, 29(36):6100–6106,
2011. doi:10.1016/j.vaccine.2011.06.059.
[CLB09] Fuyuan Cao, Jiye Liang, and Liang Bai. A new initialization
method for categorical data clustering. Expert Systems with [HSS08] Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart.
Applications, 36(7):10223–10228, 2009. doi:10.1016/j. Exploring Network Structure, Dynamics, and Function using
eswa.2009.01.060. NetworkX. In Gaël Varoquaux, Travis Vaught, and Jarrod
Millman, editors, Proceedings of the 7th Python in Science
[DAR12] Juan C. Duque, Luc Anselin, and Sergio J. Rey. THE MAX-
Conference, pages 11 – 15, Pasadena, CA USA, 2008. URL:
P-REGIONS PROBLEM*. Journal of Regional Science,
https://www.osti.gov/biblio/960616.
52(3):397–419, 2012. doi:10.1111/j.1467-9787.
[Hun07] J. D. Hunter. Matplotlib: A 2D graphics environment. Com-
2011.00743.x.
puting in Science & Engineering, 9(3):90–95, 2007. doi:
[DKA+ 08] M. Diaz, J.J. Kim, G. Albero, S. De Sanjose, G. Clifford, F.X. 10.1109/MCSE.2007.55.
Bosch, and S.J. Goldie. Health and economic impact of HPV
[JdBF+ 21] Kelsey Jordahl, Joris Van den Bossche, Martin Fleischmann,
16 and 18 vaccination and cervical cancer screening in India.
James McBride, Jacob Wasserman, Adrian Garcia Badaracco,
British Journal of Cancer, 99(2):230–238, 2008. doi:10.
Jeffrey Gerard, Alan D. Snow, Jeff Tratner, Matthew Perry,
1038/sj.bjc.6604462.
Carson Farmer, Geir Arne Hjelle, Micah Cochran, Sean
[dV21] Nelis J. de Vos. kmodes categorical clustering library. https: Gillies, Lucas Culbertson, Matt Bartos, Brendan Ward, Gia-
//github.com/nicodv/kmodes, 2015–2021. como Caria, Mike Taves, Nick Eubank, sangarshanan, John
[FBG+ 22] Xin Feng, Germano Barcelos, James D. Gaboardi, Elijah Flavin, Matt Richards, Sergio Rey, maxalbert, Aleksey Bi-
Knaap, Ran Wei, Levi J. Wolf, Qunshan Zhao, and Sergio J. logur, Christopher Ren, Dani Arribas-Bel, Daniel Mesejo-
Rey. spopt: a python package for solving spatial optimization León, and Leah Wasser. geopandas/geopandas: v0.10.2, Octo-
problems in PySAL. Journal of Open Source Software, ber 2021. doi:10.5281/zenodo.5573592.
7(74):3330, 2022. doi:10.21105/joss.03330. [Kna78] Thomas R. Knapp. Canonical Correlation Analysis: A general
[FGK+ 21] Xin Feng, James D. Gaboardi, Elijah Knaap, Sergio J. Rey, parametric significance-testing system. Psychological Bulletin,
and Ran Wei. pysal/spopt, jan 2021. URL: https://github.com/ 85(2):410–416, 1978. doi:10.1037/0033-2909.85.
pysal/spopt, doi:10.5281/zenodo.4444156. 2.410.
[FL86] D.K. Friesen and M.A. Langston. Variable Sized Bin Packing. [Koo49] Tjalling C. Koopmans. Optimum Utilization of the Transporta-
SIAM Journal on Computing, 15(1):222–230, February 1986. tion System. Econometrica, 17:136–146, 1949. Publisher:
doi:10.1137/0215016. [Wiley, Econometric Society]. doi:10.2307/1907301.
[FW12] Fletcher Foti and Paul Waddell. A Generalized Com- [LB13] Robin Lovelace and Dimitris Ballas. ‘Truncate, replicate,
putational Framework for Accessibility: From the Pedes- sample’: A method for creating integer weights for spa-
trian to the Metropolitan Scale. In Transportation Re- tial microsimulation. Computers, Environment and Urban
search Board Annual Conference, pages 1–14, 2012. Systems, 41:1–11, September 2013. doi:10.1016/j.
URL: https://onlinepubs.trb.org/onlinepubs/conferences/2012/ compenvurbsys.2013.03.004.
4thITM/Papers-A/0117-000062.pdf. [LNB13] Stefan Leyk, Nicholas N. Nagle, and Barbara P. Buttenfield.
[G+ ] Sean Gillies et al. Shapely: manipulation and analysis of Maximum Entropy Dasymetric Modeling for Demographic
geometric objects, 2007–. URL: https://github.com/shapely/ Small Area Estimation. Geographical Analysis, 45(3):285–
shapely. 306, July 2013. doi:10.1111/gean.12011.
134 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[MCB+ 08] Karyn Morrissey, Graham Clarke, Dimitris Ballas, Stephen Scan USA 2016 [Data set]. Technical report, Oak Ridge
Hynes, and Cathal O’Donoghue. Examining access to GP National Laboratory, 2017. doi:10.48690/1523377.
services in rural Ireland using microsimulation analysis. Area, [SEM14] Samarth Swarup, Stephen G. Eubank, and Madhav V. Marathe.
40(3):354–364, 2008. doi:10.1111/j.1475-4762. Computational epidemiology as a challenge domain for multi-
2008.00844.x. agent systems. In Proceedings of the 2014 international con-
[MNP+ 17] April M. Morton, Nicholas N. Nagle, Jesse O. Piburn, ference on Autonomous agents and multi-agent systems, pages
Robert N. Stewart, and Ryan McManamay. A hybrid dasy- 1173–1176, 2014. URL: https://www.ifaamas.org/AAMAS/
metric and machine learning approach to high-resolution aamas2014/proceedings/aamas/p1173.pdf.
residential electricity consumption modeling. In Advances [SNGJ+ 09] Beate Sander, Azhar Nizam, Louis P. Garrison Jr., Maarten J.
in Geocomputation, pages 47–58. Springer, 2017. doi: Postma, M. Elizabeth Halloran, and Ira M. Longini Jr. Eco-
10.1007/978-3-319-22786-3_5. nomic evaluation of influenza pandemic mitigation strate-
[MOD11] Stuart Mitchell, Michael O’Sullivan, and Iain gies in the United States using a stochastic microsimulation
Dunning. PuLP: A Linear Programming Toolkit transmission model. Value in Health, 12(2):226–233, 2009.
for Python. Technical report, 2011. URL: doi:10.1111/j.1524-4733.2008.00437.x.
https://www.dit.uoi.gr/e-class/modules/document/file.php/ [SPH11] Dianna M. Smith, Jamie R. Pearce, and Kirk Harland. Can
216/PAPERS/2011.%20PuLP%20-%20A%20Linear% a deterministic spatial microsimulation model provide reli-
20Programming%20Toolkit%20for%20Python.pdf. able small-area estimates of health behaviours? An example
[MPN+ 17] April M. Morton, Jesse O. Piburn, Nicholas N. Nagle, H.M. of smoking prevalence in New Zealand. Health & Place,
Aziz, Samantha E. Duchscherer, and Robert N. Stewart. A 17(2):618–624, 2011. doi:10.1016/j.healthplace.
simulation approach for modeling high-resolution daytime 2011.01.001.
commuter travel flows and distributions of worker subpopula- [ST20] Haroldo G. Santos and Túlio A.M. Toffolo. Mixed Integer Lin-
tions. In GeoComputation 2017, Leeds, UK, pages 1–5, 2017. ear Programming with Python. Technical report, 2020. URL:
URL: http://www.geocomputation.org/2017/papers/44.pdf. https://python-mip.readthedocs.io/_/downloads/en/latest/pdf/.
[MS01] Harvey J. Miller and Shih-Lung Shaw. Geographic Informa- [TBP+ 15] Gautam S. Thakur, Budhendra L. Bhaduri, Jesse O. Piburn,
tion Systems for Transportation: Principles and Applications. Kelly M. Sims, Robert N. Stewart, and Marie L. Urban.
Oxford University Press, New York, 2001. PlanetSense: a real-time streaming and spatio-temporal an-
[MS15] Harvey J. Miller and Shih-Lung Shaw. Geographic Informa- alytics platform for gathering geo-spatial intelligence from
tion Systems for Transportation in the 21st Century. Geogra- open source data. In Proceedings of the 23rd SIGSPATIAL
phy Compass, 9(4):180–189, 2015. doi:10.1111/gec3. International Conference on Advances in Geographic Informa-
12204. tion Systems, pages 1–4, 2015. doi:10.1145/2820783.
[NBLS14] Nicholas N. Nagle, Barbara P. Buttenfield, Stefan Leyk, and 2820882.
Seth Spielman. Dasymetric modeling and uncertainty. Annals [TCR08] Melanie N. Tomintz, Graham P. Clarke, and Janette E. Rigby.
of the Association of American Geographers, 104(1):80–95, The geography of smoking in Leeds: estimating individual
2014. doi:10.1080/00045608.2013.843439. smoking rates and the implications for the location of stop
[NCA13] Markku Nurhonen, Allen C. Cheng, and Kari Auranen. Pneu- smoking services. Area, 40(3):341–353, 2008. doi:10.
mococcal transmission and disease in silico: a microsimu- 1111/j.1475-4762.2008.00837.x.
lation model of the indirect effects of vaccination. PloS [TG22] Joseph V. Tuccillo and James D. Gaboardi. Connecting Vivid
one, 8(2):e56079, 2013. doi:10.1371/journal.pone. Population Data to Human Dynamics, June 2022. Distilling
0056079. Diversity by Tapping High-Resolution Population and Survey
[NLHH07] Michael K. Ng, Mark Junjie Li, Joshua Zhexue Huang, and Data. doi:10.5281/zenodo.6607533.
Zengyou He. On the impact of dissimilarity measure in [TM21] Joseph V. Tuccillo and Jessica Moehl. An Individual-
k-modes clustering algorithm. IEEE Transactions on Pat- Oriented Typology of Social Areas in the United States, May
tern Analysis and Machine Intelligence, 29(3):503–507, 2007. 2021. 2021 ACS Data Users Conference. doi:10.5281/
doi:10.1109/TPAMI.2007.53. zenodo.6672291.
[pdt20] The pandas development team. pandas-dev/pandas: Pandas, [TMKD17] Matthias Templ, Bernhard Meindl, Alexander Kowarik, and
February 2020. doi:10.5281/zenodo.3509134. Olivier Dupriez. Simulation of synthetic complex data: The
[PVG+ 11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, R package simPop. Journal of Statistical Software, 79:1–38,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, 2017. doi:10.18637/jss.v079.i10.
V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, [Tuc21] Joseph V. Tuccillo. An Individual-Centered Approach for
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Geodemographic Classification. In 11th International Con-
Machine Learning in Python. Journal of Machine Learning ference on Geographic Information Science 2021 Short Paper
Research, 12:2825–2830, 2011. URL: https://www.jmlr.org/ Proceedings, pages 1–6, 2021. doi:10.25436/E2H59M.
papers/v12/pedregosa11a.html. [VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt
[QC13] Fang Qiu and Robert Cromley. Areal Interpolation and Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski,
Dasymetric Modeling: Areal Interpolation and Dasymetric Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté-
Modeling. Geographical Analysis, 45(3):213–215, July 2013. fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar-
doi:10.1111/gean.12016. rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric
[RA07] Sergio J. Rey and Luc Anselin. PySAL: A Python Library of Jones, Robert Kern, Eric Larson, C.J. Carey, İlhan Polat,
Spatial Analytical Methods. The Review of Regional Studies, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde,
37(1):5–27, 2007. URL: https://rrs.scholasticahq.com/article/ Josef Perktold, Robert Cimrman, Ian Henriksen, E.A. Quin-
8285.pdf, doi:10.52324/001c.8285. tero, Charles R. Harris, Anne M. Archibald, Antônio H.
[RAA+ 21] Sergio J. Rey, Luc Anselin, Pedro Amaral, Dani Arribas- Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy
Bel, Renan Xavier Cortes, James David Gaboardi, Wei Kang, 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for
Elijah Knaap, Ziqi Li, Stefanie Lumnitz, Taylor M. Oshan, Scientific Computing in Python. Nature Methods, 17:261–272,
Hu Shao, and Levi John Wolf. The PySAL Ecosystem: 2020. doi:10.1038/s41592-019-0686-2.
Philosophy and Implementation. Geographical Analysis, 2021. [WCC+ 09] William D. Wheaton, James C. Cajka, Bernadette M. Chas-
doi:10.1111/gean.12276. teen, Diane K. Wagener, Philip C. Cooley, Laxminarayana
[RSF+ 21] Krishna P. Reddy, Fatma M. Shebl, Julia H.A. Foote, Guy Ganapathi, Douglas J. Roberts, and Justine L. Allpress.
Harling, Justine A. Scott, Christopher Panella, Kieran P. Fitz- Synthesized population databases: A US geospatial database
maurice, Clare Flanagan, Emily P. Hyle, Anne M. Neilan, et al. for agent-based models. Methods report (RTI Press),
Cost-effectiveness of public health strategies for COVID-19 2009(10):905, 2009. doi:10.3768/rtipress.2009.
epidemic control in South Africa: a microsimulation modelling mr.0010.0905.
study. The Lancet Global Health, 9(2):e120–e129, 2021. [WM10] Wes McKinney. Data Structures for Statistical Computing in
doi:10.1016/S2214-109X(20)30452-6. Python. In Stéfan van der Walt and Jarrod Millman, editors,
[RWM+ 17] Amy N. Rose, Eric M. Weber, Jessica J. Moehl, Melanie L. Proceedings of the 9th Python in Science Conference, pages 56
Laverdiere, Hsiu-Han Yang, Matthew C. Whitehead, Kelly M. – 61, 2010. doi:10.25080/Majora-92bf1922-00a.
Sims, Nathan E. Trombley, and Budhendra L. Bhaduri. Land- [WRK21] Ran Wei, Sergio J. Rey, and Elijah Knaap. Efficient re-
LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS 135

gionalization for spatially explicit neighborhood delineation.
International Journal of Geographical Information Science,
35(1):135–151, 2021. doi:10.1080/13658816.2020.
1759806.
[ZFJ14] Yi Zhu and Joseph Ferreira Jr. Synthetic population gener-
ation at disaggregated spatial scales for land use and trans-
portation microsimulation. Transportation Research Record,
2429(1):168–177, 2014. doi:10.3141/2429-18.
136 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

poliastro: a Python library for interactive
astrodynamics
Juan Luis Cano Rodríguez‡∗ , Jorge Martínez Garrido‡

https://www.youtube.com/watch?v=VCpTgU1pb5k

Abstract—Space is more popular than ever, with the growing public awareness problem. This work was generalized by Newton to give birth to
of interplanetary scientific missions, as well as the increasingly large number the n-body problem, and many other mathematicians worked on
of satellite companies planning to deploy satellite constellations. Python has it throughout the centuries (Daniel and Johann Bernoulli, Euler,
become a fundamental technology in the astronomical sciences, and it has also Gauss). Poincaré established in the 1890s that no general closed-
caught the attention of the Space Engineering community.
form solution exists for the n-body problem, since the resulting
One of the requirements for designing a space mission is studying the
trajectories of satellites, probes, and other artificial objects, usually ignoring
dynamical system is chaotic [Bat99]. Sundman proved in the
non-gravitational forces or treating them as perturbations: the so-called n-body 1900s the existence of convergent solutions for a few restricted
problem. However, for preliminary design studies and most practical purposes, it with n = 3.
is sufficient to consider only two bodies: the object under study and its attractor. M = E − e sin E (1)
Even though the two-body problem has many analytical solutions, or-
In 1903 Tsiokovsky evaluated the conditions required for artificial
bit propagation (the initial value problem) and targeting (the boundary value
problem) remain computationally intensive because of long propagation times,
objects to leave the orbit of the earth; this is considered as a foun-
tight tolerances, and vast solution spaces. On the other hand, astrodynamics dational contribution to the field of astrodynamics. Tsiokovsky
researchers often do not share the source code they used to run analyses and devised equation 2 which relates the increase in velocity with the
simulations, which makes it challenging to try out new solutions. effective exhaust velocity of thrusted gases and the fraction of used
This paper presents poliastro, an open-source Python library for interactive propellant.
m0
astrodynamics that features an easy-to-use API and tools for quick visualization. ∆v = ve ln (2)
poliastro implements core astrodynamics algorithms (such as the resolution mf
of the Kepler and Lambert problems) and leverages numba, a Just-in-Time Further developments by Kondratyuk, Hohmann, and Oberth in
compiler for scientific Python, to optimize the running time. Thanks to Astropy, the early 20th century all added to the growing field of orbital
poliastro can perform seamless coordinate frame conversions and use proper
mechanics, which in turn enabled the development of space flight
physical units and timescales. At the moment, poliastro is the longest-lived
Python library for astrodynamics, has contributors from all around the world,
in the USSR and the United States in the 1950s and 1960s.
and several New Space companies and people in academia use it. The two-body problem
In a system of i ∈ 1, ..., n bodies subject to their mutual attraction,
Index Terms—astrodynamics, orbital mechanics, orbit propagation, orbit visu-
alization, two-body problem
by application of Newton’s law of universal gravitation, the total
force fi affecting mi due to the presence of the other n − 1 masses
is given by [Bat99]:
Introduction n
mi m j
fi = −G ∑ r
3 ij
(3)
History j6=i |ri j |
The term "astrodynamics" was coined by the American as-
where G = 6.67430 · 10−11 N m2 kg−2 is the universal gravita-
tronomer Samuel Herrick, who received encouragement from
tional constant, and ri j denotes the position vector from mi to m j .
the space pioneer Robert H. Goddard, and refers to the branch
Applying Newton’s second law of motion results in a system of n
of space science dealing with the motion of artificial celestial
differential equations:
bodies ([Dub73], [Her71]). However, the roots of its mathematical
foundations go back several centuries. d2 ri n
mj
2
= −G ∑ r
3 ij
(4)
Kepler first introduced his laws of planetary motion in 1609 dt j6=i i j |
|r
and 1619 and derived his famous transcendental equation (1), By setting n = 2 in 4 and subtracting the two resulting equali-
which we now see as capturing a restricted form of the two-body ties, one arrives to the fundamental equation of the two-body
problem:
* Corresponding author: hello@juanlu.space
‡ Unaffiliated d2 r µ
=− 3r (5)
dt 2 r
Copyright © 2022 Juan Luis Cano Rodríguez et al. This is an open-access where µ = G(m1 + m2 ) = G(M + m). When m M (for example,
article distributed under the terms of the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any an artificial satellite orbiting a planet), one can consider µ = GM
medium, provided the original author and source are credited. a property of the attractor.
POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS 137

Keplerian vs non-keplerian motion State of the art
Conveniently manipulating equation 5 leads to several properties In our view, at the time of creating poliastro there were a number
[Bat99] that were already published by Johannes Kepler in the of issues with existing open source astrodynamics software that
1610s, namely: posed a barrier of entry for novices and amateur practitioners.
1) The orbit always describes a conic section (an ellipse, a Most of these barriers still exist today and are described in the
parabola, or an hyperbola), with the attractor at one of following paragraphs. The goals of the project can be condensed
the two foci and can be written in polar coordinates like as follows:
r = 1+epcos ν (Kepler’s first law). 1) Set an example on reproducibility and good coding prac-
2) The magnitude of the specific angular momentum h = tices in astrodynamics.
r2 ddtθ is constant an equal to two times the areal velocity 2) Become an approachable software even for novices.
(Kepler’s second law). 3) Offer a performant software that can be also used in
3) For closed (circular and elliptical) orbits, the periodq is scripting and interactive workflows.
3
related to the size of the orbit through P = 2π aµ
(Kepler’s third law). The most mature software libraries for astrodynamics are
arguably Orekit [noa22c], a "low level space dynamics library
For many practical purposes it is usually sufficient to limit written in Java" with an open governance model, and SPICE
the study to one object orbiting an attractor and ignore all other [noa22d], a toolkit developed by NASA’s Navigation and An-
external forces of the system, hence restricting the study to cillary Information Facility at the Jet Propulsion Laboratory.
trajectories governed by equation 5. Such trajectories are called Other similar, smaller projects that appeared later on and that
"Keplerian", and several problems can be formulated for them: are still maintained to this day include PyKEP [IBD+ 20], be-
• The initial-value problem, which is usually called prop- yond [noa22a], tudatpy [noa22e], sbpy [MKDVB+ 19], Skyfield
agation, involves determining the position and velocity of [Rho20] (Python), CelestLab (Scilab) [noa22b], astrodynamics.jl
an object after an elapse period of time given some initial (Julia) [noa] and Nyx (Rust) [noa21a]. In addition, there are
conditions. some Graphical User Interface (GUI) based open source programs
• Preliminary orbit determination, which involves using used for Mission Analysis and orbit visualization, such as GMAT
exact or approximate methods to derive a Keplerian orbit [noa20] and gpredict [noa18], and complete web applications for
from a set of observations. tracking constellations of satellites like the SatNOGS project by
• The boundary-value problem, often named the Lambert the Libre Space Foundation [noa21b].
problem, which involves determining a Keplerian orbit The level of quality and maintenance of these packages is
from boundary conditions, usually departure and arrival somewhat heterogeneous. Community-led projects with a strong
position vectors and a time of flight. corporate backing like Orekit are in excellent health, while on
the other hand smaller projects developed by volunteers (beyond,
Fortunately, most of these problems boil down to finding astrodynamics.jl) or with limited institutional support (PyKEP,
numerical solutions to relatively simple algebraic relations be- GMAT) suffer from lack of maintenance. Part of the problem
tween time and angular variables: for elliptic motion (0 ≤ e < 1) might stem from the fact that most scientists are never taught how
it is the Kepler equation, and equivalent relations exist for the to build software efficiently, let alone the skills to collaboratively
other eccentricity regimes [Bat99]. Numerical solutions for these develop software in the open [WAB+ 14], and astrodynamicists are
equations can be found in a number of different ways, each one no exception.
with different complexity and precision tradeoffs. In the Methods On the other hand, it is often difficult to translate the advances
section we list the ones implemented by poliastro. in astrodynamics research to software. Classical algorithms devel-
On the other hand, there are many situations in which natural oped throughout the 20th century are described in papers that are
and artificial orbital perturbations must be taken into account so sometimes difficult to find, and source code or validation data
that the actual non-Keplerian motion can be properly analyzed: is almost never available. When it comes to modern research
• Interplanetary travel in the proximity of other planets. On carried in the digital era, source code and validation data is
a first approximation it is usually enough to study the still difficult, even though they are supposedly provided "upon
trajectory in segments and focus the analysis on the closest reasonable request" [SSM18] [GBP22].
attractor, hence patching several Keplerian orbits along It is no surprise that astrodynamics software often requires
the way (the so-called "patched-conic approximation") deep expertise. However, there are often implicit assumptions that
[Bat99]. The boundary surface that separates one segment are not documented with an adequate level of detail which orig-
from the other is called the sphere of influence. inate widespread misconceptions and lead even seasoned profes-
• Use of solar sails, electric propulsion, or other means sionals to make conceptual mistakes. Some of the most notorious
of continuous thrust. Devising the optimal guidance laws misconceptions arise around the use of general perturbations data
that minimize travel time or fuel consumption under these (OMMs and TLEs) [Fin07], the geometric interpretation of the
conditions is usually treated as an optimization problem mean anomaly [Bat99], or coordinate transformations [VCHK06].
of a dynamical system, and as such it is particularly Finally, few of the open source software libraries mentioned
challenging [Con14]. above are amenable to scripting or interactive use, as promoted by
• Artificial satellites in the vicinity of a planet. This is computational notebooks like Jupyter [KRKP+ 16].
the regime in which all the commercial space industry The following sections will now discuss the various areas of
operates, especially for those satellites in Low-Earth Orbit current research that an astrodynamicist will engage in, and how
(LEO). poliastro improves their workflow.
138 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Methods
Nice, high level API
Software Architecture
The architecture of poliastro emerges from the following set of
conflicting requirements:
Dangerous™ algorithms
1) There should be a high-level API that enables users to
perform orbital calculations in a straightforward way and Fig. 1: poliastro two-layer architecture
prevent typical mistakes.
2) The running time of the algorithms should be within the Most of the methods of the High level API consist only
same order of magnitude of existing compiled implemen- of the necessary unit compatibility checks, plus a wrapper over
tations. the corresponding Core API function that performs the actual
3) The library should be written in a popular open-source computation.
language to maximize adoption and lower the barrier to
@u.quantity_input(E=u.rad, ecc=u.one)
external contributors. def E_to_nu(E, ecc):
"""True anomaly from eccentric anomaly."""
One of the most typical mistakes we set ourselves to prevent return (
with the high-level API is dimensional errors. Addition and E_to_nu_fast(
E.to_value(u.rad),
substraction operations of physical quantities are defined only for
ecc.value
quantities with the same units [Dro53]: for example, the operation ) << u.rad
1 km + 100 m requires a scale transformation of at least one ).to(E.unit)
of the operands, since they have different units (kilometers and As a result, poliastro offers a unit-safe API that performs the least
meters) but the same dimension (length), whereas the operation amount of computation possible to minimize the performance
1 km + 1 kg is directly not allowed because dimensions are penalty of unit checks, and also a unit-unsafe API that offers
incompatible (length and mass). As such, software systems oper- maximum performance at the cost of not performing any unit
ating with physical quantities should raise exceptions when adding validation checks.
different dimensions, and transparently perform the required scale Finally, there are several options to write performant code that
transformations when adding different units of the same dimen- can be used from Python, and one of them is using a fast, compiled
sion. language for the CPU intensive parts. Successful examples of this
With this in mind, we evaluated several Python packages for include NumPy, written in C [HMvdW+ 20], SciPy, featuring a
unit handling (see [JGAZJT+ 18] for a recent survey) and chose mix of FORTRAN, C, and C++ code [VGO+ 20], and pandas,
astropy.units [TPWS+ 18]. making heavy use of Cython [BBC+ 11]. However, having to
radius = 6000 # km write code in two different languages hinders the development
altitude = 500 # m speed, makes debugging more difficult, and narrows the potential
# Wrong!
contributor base (what Julia creators called "The Two Language
distance = radius + altitude Problem" [BEKS17]).
As authors of poliastro we wanted to use Python as the
from astropy import units as u sole programming language of the implementation, and the best
# Correct solution we found to improve its performance was to use Numba,
distance = (radius << u.km) + (altitude << u.m) a LLVM-based Python JIT compiler [LPS15].
This notion of providing a "safe" API extends to other parts Usage
of the library by leveraging other capabilities of the Astropy
Basic Orbit and Ephem creation
project. For example, timestamps use astropy.time objects,
which take care of the appropriate handling of time scales The two central objects of the poliastro high level API are Orbit
(such as TDB or UTC), reference frame conversions leverage and Ephem:
astropy.coordinates, and so forth. • Orbit objects represent an osculating (hence Keplerian)
One of the drawbacks of existing unit packages is that orbit of a dimensionless object around an attractor at a
they impose a significant performance penalty. Even though given point in time and a certain reference frame.
astropy.units is integrated with NumPy, hence allowing • Ephem objects represent an ephemerides, a sequence of
the creation of array quantities, all the unit compatibility checks spatial coordinates over a period of time in a certain
are implemented in Python and require lots of introspection, and reference frame.
this can slow down mathematical operations by several orders of There are six parameters that uniquely determine a Keplerian
magnitude. As such, to fulfill our desired performance requirement orbit, plus the gravitational parameter of the corresponding attrac-
for poliastro, we envisioned a two-layer architecture: tor (k or µ). Optionally, an epoch that contextualizes the orbit
• The Core API follows a procedural style, and all the can be included as well. This set of six parameters is not unique,
functions receive Python numerical types and NumPy and several of them have been developed over the years to serve
arrays for maximum performance. different purposes. The most widely used ones are:
• The High level API is object-oriented, all the methods • Cartesian elements: Three components for the position
receive Astropy Quantity objects with physical units, (x, y, z) and three components for the velocity (vx , vy , vz ).
and computations are deferred to the Core API. This set has no singularities.
POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS 139

• Classical Keplerian elements: Two components for the
shape of the conic (usually the semimajor axis a or from poliastro.ephem import Ephem
semiparameter p and the eccentricity e), three Euler angles # Configure high fidelity ephemerides globally
for the orientation of the orbital plane in space (inclination # (requires network access)
i, right ascension of the ascending node Ω, and argument solar_system_ephemeris.set("jpl")
of periapsis ω), and one polar angle for the position of the
# For predefined poliastro attractors
body along the conic (usually true anomaly f or ν). This earth = Ephem.from_body(Earth, Time.now().tdb)
set of elements has an easy geometrical interpretation and
the advantage that, in pure two-body motion, five of them # For the rest of the Solar System bodies
ceres = Ephem.from_horizons("Ceres", Time.now().tdb)
are fixed (a, e, i, Ω, ω) and only one is time-dependent
(ν), which greatly simplifies the analytical treatment of There are some crucial differences between Orbit and Ephem
orbital perturbations. However, they suffer from singular- objects:
ities steming from the Euler angles ("gimbal lock") and • Orbit objects have an attractor, whereas Ephem objects
equations expressed in them are ill-conditioned near such do not. Ephemerides can originate from complex trajecto-
singularities. ries that don’t necessarily conform to the ideal two-body
• Walker modified equinoctial elements: Six parameters problem.
(p, f , g, h, k, L). Only L is time-dependent and this set has • Orbit objects capture a precise instant in a two-body mo-
no singularities, however the geometrical interpretation of tion plus the necessary information to propagate it forward
the rest of the elements is lost [WIO85]. in time indefinitely, whereas Ephem objects represent a
Here is how to create an Orbit from cartesian and from clas- bounded time history of a trajectory. This is because the
sical Keplerian elements. Walker modified equinoctial elements equations for the two-body motion are known, whereas
are supported as well. an ephemeris is either an observation or a prediction
from astropy import units as u that cannot be extrapolated in any case without external
knowledge. As such, Orbit objects have a .propagate
from poliastro.bodies import Earth, Sun method, but Ephem ones do not. This prevents users from
from poliastro.twobody import Orbit
from poliastro.constants import J2000
attempting to propagate the position of the planets, which
will always yield poor results compared to the excellent
# Data from Curtis, example 4.3 ephemerides calculated by external entities.
r = [-6045, -3490, 2500] << u.km
v = [-3.457, 6.618, 2.533] << u.km / u.s Finally, both types have methods to convert between them:
• Ephem.from_orbit is the equivalent of sampling a
orb_curtis = Orbit.from_vectors(
Earth, # Attractor two-body motion over a given time interval. As explained
r, v # Elements above, the resulting Ephem loses the information about
) the original attractor.
# Data for Mars at J2000 from JPL HORIZONS
• Orbit.from_ephem is the equivalent of calculating
a = 1.523679 << u.au the osculating orbit at a certain point of a trajectory,
ecc = 0.093315 << u.one assuming a given attractor. The resulting Orbit loses
inc = 1.85 << u.deg the information about the original, potentially complex
raan = 49.562 << u.deg
argp = 286.537 << u.deg trajectory.
nu = 23.33 << u.deg
Orbit propagation
orb_mars = Orbit.from_classical( Orbit objects have a .propagate method that takes an elapsed
Sun,
a, ecc, inc, raan, argp, nu,
time and returns another Orbit with new orbital elements and an
J2000 # Epoch updated epoch:
) >>> from poliastro.examples import iss
When displayed on an interactive REPL, Orbit objects provide >>> iss
basic information about the geometry, the attractor, and the epoch: >>> 6772 x 6790 km x 51.6 deg (GCRS) ...
>>> orb_curtis
7283 x 10293 km x 153.2 deg (GCRS) orbit >>> iss.nu.to(u.deg)
around Earth (X) at epoch J2000.000 (TT) <Quantity 46.59580468 deg>

>>> orb_mars >>> iss_30m = iss.propagate(30 << u.min)
1 x 2 AU x 1.9 deg (HCRS) orbit
around Sun (X) at epoch J2000.000 (TT) >>> (iss_30m.epoch - iss.epoch).datetime
datetime.timedelta(seconds=1800)
Similarly, Ephem objects can be created using a variety of class-
methods as well. Thanks to astropy.coordinates built-in >>> (iss_30m.nu - iss.nu).to(u.deg)
<Quantity 116.54513153 deg>
low-fidelity ephemerides, as well as its capability to remotely
The default propagation algorithm is an analytical procedure
access the JPL HORIZONS system, the user can seamlessly build
described in [FCM13] that works seamlessly in the near parabolic
an object that contains the time history of the position of any Solar
System body: region. In addition, poliastro implements analytical propagation
from astropy.time import Time algorithms as described in [DB83], [OG86], [Mar95], [Mik87],
from astropy.coordinates import solar_system_ephemeris [PP13], [Cha22], and [VM07].
140 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

rr = propagate(
orbit,
tofs,
method=cowell,
f=f,
)

Continuous thrust control laws
Beyond natural perturbations, spacecraft can modify their trajec-
tory on purpose by using impulsive maneuvers (as explained in
the next section) as well as continuous thrust guidance laws. The
user can define custom guidance laws by providing a perturbation
Fig. 2: Osculating (Keplerian) vs perturbed (true) orbit (source: acceleration in the same way natural perturbations are used. In
Wikipedia, CC BY-SA 3.0) addition, poliastro includes several analytical solutions for con-
tinuous thrust guidance laws with specific purposes, as studied in
[CR17]: optimal transfer between circular coplanar orbits [Ede61]
Natural perturbations [Bur67], optimal transfer between circular inclined orbits [Ede61]
As showcased in Figure 2, at any point in a trajectory we [Kec97], quasi-optimal eccentricity-only change [Pol97], simulta-
can define an ideal Keplerian orbit with the same position and neous eccentricity and inclination change [Pol00], and agument of
velocity under the attraction of a point mass: this is called the periapsis adjustment [Pol98]. A much more rigorous analysis of a
osculating orbit. Some numerical propagation methods exist that similar set of laws can be found in [DCV21].
model the true, perturbed orbit as a deviation from an evolving, from poliastro.twobody.thrust import change_ecc_inc
osculating orbit. poliastro implements Cowell’s method [CC10],
which consists in adding all the perturbation accelerations and then ecc_f = 0.0 << u.one
inc_f = 20.0 << u.deg
integrating the resulting differential equation with any numerical
f = 2.4e-6 << (u.km / u.s**2)
method of choice:
d2 r µ a_d, _, t_f = change_ecc_inc(orbit, ecc_f, inc_f, f)
2
= − 3 r + ad (6)
dt r
The resulting equation is usually integrated using high order
Impulsive maneuvers
numerical methods, since the integration times are quite large
and the tolerances comparatively tight. An in-depth discussion of Impulsive maneuvers are modeled considering a change in the
such methods can be found in [HNW09]. poliastro uses Dormand- velocity of a spacecraft while its position remains fixed. The
Prince 8(5,3) (DOP853), a commonly used method available in poliastro.maneuver.Maneuver class provides various
SciPy [HMvdW+ 20]. constructors to instantiate popular impulsive maneuvers in the
There are several natural perturbations included: J2 and J3 framework of the non-perturbed two-body problem:
gravitational terms, several atmospheric drag models (exponential, • Maneuver.impulse
[Jac77], [AAAA62], [AAA+ 76]), and helpers for third body • Maneuver.hohmann
gravitational attraction and radiation pressure as described in [?]. • Maneuver.bielliptic
@njit • Maneuver.lambert
def combined_a_d(
t0, state, k, j2, r_eq, c_d, a_over_m, h0, rho0
): from poliastro.maneuver import Maneuver
return (
J2_perturbation( orb_i = Orbit.circular(Earth, alt=700 << u.km)
t0, state, k, j2, r_eq hoh = Maneuver.hohmann(orb_i, r_f=36000 << u.km)
) + atmospheric_drag_exponential(
t0, state, k, r_eq, c_d, a_over_m, h0, rho0Once instantiated, Maneuver objects provide information regard-
) ing total ∆v and ∆t:
)
>>> hoh.get_total_cost()
<Quantity 3.6173981270031357 km / s>
def f(t0, state, k):
du_kep = func_twobody(t0, state, k)
>>> hoh.get_total_time()
ax, ay, az = combined_a_d(
<Quantity 15729.741535747102 s>
t0,
state, Maneuver objects can be applied to Orbit instances using the
k,
R=R, apply_maneuver method.
C_D=C_D, >>> orb_i
A_over_m=A_over_m, 7078 x 7078 km x 0.0 deg (GCRS) orbit
H0=H0, around Earth (X)
rho0=rho0,
J2=Earth.J2.value, >>> orb_f = orb_i.apply_maneuver(hoh)
) >>> orb_f
du_ad = np.array([0, 0, 0, ax, ay, az]) 36000 x 36000 km x 0.0 deg (GCRS) orbit
around Earth (X)
return du_kep + du_ad
POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS 141

Targeting Earth - Mars for year 2020-2021, C3 launch
2021-05

34.1
Targeting is the problem of finding the orbit connecting two Days of flight

.8
.0
Arrival velocity km/s 41.90

434553750..04273
400

24
5.0

31.0

200 35.7
positions over a finite amount of time. Within the context of

43.4

313.80.8
2021-04

.0
the non-perturbed two-body problem, targeting is just a matter 26
.4 37.24

32.6

41.9
of solving the BVP, also known as Lambert’s problem. Because

24.8
targeting tries to find for an orbit, the problem is included in the 2021-03 32.59

29.5
34.1
37.2 410.9.0

18..86
.9
Initial Orbit Determination field.

20.2 3
27

17.1
27.93
The poliastro.iod package contains izzo and 2021-02

Arrival date

23.3

38.8
vallado modules. These provide a lambert function for solv-

km2 / s2
15.5
40.3 45.0
23.28

3.8

5
29.

26.4
ing the targeting problem. Nevertheless, a Maneuver.lambert

21.7
2021-01

27.9
constructor is also provided so users can keep taking advantage of 18.62

32.6
Orbit objects.
13.97
2020-12
# Declare departure and arrival datetimes
date_launch = time.Time( 9.31
'2011-11-26 15:02', scale='tdb' 5.0
2020-11
) Perseverance 4.66

.0
Tianwen-1

100
date_arrival = time.Time(
'2012-08-06 05:17', scale='tdb' Hope Mars
2020-10 0.00
) 3 4 5 6 7 8 9 0
0-0 0-0 0-0 0-0 0-0 0-0 0-0 0-1
202 202 202 202 202 202 202 202
# Define initial and final orbits Launch date
orb_earth = Orbit.from_ephem(
Sun, Ephem.from_body(Earth, date_launch), Fig. 3: Porkchop plot for Earth-Mars transfer arrival energy showing
date_launch latest missions to the Martian planet.
)
orb_mars = Orbit.from_ephem(
Sun, Ephem.from_body(Mars, date_arrival),
date_arrival Generated graphics can be static or interactive. The main
) difference between these two is the ability to modify the camera
view in a dynamic way when using interactive plotters.
# Compute targetting maneuver and apply it
man_lambert = Maneuver.lambert(orb_earth, orb_mars) The most important classes in the poliastro.plotting
orb_trans, orb_target = ss0.apply_maneuver( package are StaticOrbitPlotter and OrbitPlotter3D.
man_lambert, intermediate=true In addition, the poliastro.plotting.misc module con-
)
tains the plot_solar_system function, which allows the user
Targeting is closely related to quick mission design by means of to visualize inner and outter both in 2D and 3D, as requested by
porkchop diagrams. These are contour plots showing all combi- users.
nations of departure and arrival dates with the specific energy for The following example illustrates the plotting capabilities of
each transfer orbit. They allow for quick identification of the most poliastro. At first, orbits to be plotted are computed and their
optimal transfer dates between two bodies. plotting style is declared:
The poliastro.plotting.porkchop provides the from poliastro.plotting.misc import plot_solar_system
PorkchopPlotter class which allows the user to generate
these diagrams. # Current datetime
now = Time.now().tdb
from poliastro.plotting.porkchop import (
PorkchopPlotter # Obtain Florence and Halley orbits
) florence = Orbit.from_sbdb("Florence")
from poliastro.utils import time_range halley_1835_ephem = Ephem.from_horizons(
"90000031", now
# Generate all launch and arrival dates )
launch_span = time_range( halley_1835 = Orbit.from_ephem(
"2020-03-01", end="2020-10-01", periods=int(150) Sun, halley_1835_ephem, halley_1835_ephem.epochs[0]
) )
arrival_span = time_range(
"2020-10-01", end="2021-05-01", periods=int(150) # Define orbit labels and color style
) florence_style = {label: "Florence", color: "#000000"}
halley_style = {label: "Florence", color: "#84B0B8"}
# Create an instance of the porkchop and plot it
porkchop = PorkchopPlotter( The static two-dimensional plot can be created using the following
Earth, Mars, launch_span, arrival_span, code:
)
# Generate a static 2D figure
Previous code, with some additional customization, generates frame2D = rame = plot_solar_system(
figure 3. epoch=now, outer=False
)
frame2D.plot(florence, **florence_style)
Plotting frame2D.plot(florence, **halley_style)
For visualization purposes, poliastro provides the
As a result, figure 4 is obtained.
poliastro.plotting package, which contains various
The interactive three-dimensional plot can be created using the
utilities for generating 2D and 3D graphics using different
following code:
backends such as matplotlib [Hun07] and Plotly [Inc15].
142 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 4: Two-dimensional view of the inner Solar System, Florence,
and Halley. Fig. 6: Natural perturbations affecting Low-Earth Orbit (LEO) mo-
tion (source: [VM07])

# Generate an interactive 3D figure
frame3D = rame = plot_solar_system( the Simplified General Perturbation (SGP) models, first developed
epoch=now, outer=False,
in [HK66] and then refined in [LC69] into what we know these
use_3d=True, interactive=true
) days as the SGP4 propagator [HR80] [VCHK06]. Even though
frame3D.plot(florence, **florence_style) certain elements of the reference frame used by SGP4 are not
frame3D.plot(florence, **halley_style) properly specified [VCHK06] and that its accuracy might still be
As a result, figure 5 is obtained. too limited for certain applications [Ko09] [Lar16], it is nowadays
the most widely used propagation method thanks in large part to
the dissemination of General Perturbations orbital data by the US
501(c)(3) CelesTrak (which itself obtains it from the 18th Space
Defense Squadron of the US Space Force).
The starting point of SGP4 is a special element set that uses
Brouwer mean orbital elements [Bro59] plus a ballistic coefficient
based on an approximation of the atmospheric drag [LC69], and
its results are expressed in a special coordinate system called True
Equator Mean Equinox (TEME). Special care needs to be taken
to avoid mixing mean elements with osculating elements, and to
convert the output of the propagation to the appropriate reference
frame. These element sets have been traditionally distributed in a
compact text representation called Two-Line Element sets (TLEs)
(see 7 for an example). However this format is quite cryptic and
Fig. 5: Three-dimensional view of the inner Solar System, Florence, suffers from a number of shortcomings, so recently there has
and Halley.
been a push to use the Orbit Data Messages international standard
developed by the Consultive Committee for Space Data Systems
Commercial Earth satellites (CCSDS 502.0-B-2).
Figure 6 gives a clear picture of the most important natural pertur- 1 25544U 98067A 22156.15037205 .00008547 00000+0 15823-3 0 9994
bations affecting satellites in LEO, namely: the first harmonic of 2 25544 51.6449 36.2070 0004577 196.3587 298.4146 15.49876730343319

the geopotential field J2 (representing the attractor oblateness), Fig. 7: Two-Line Element set (TLE) for the ISS (retrieved on 2022-
the atmospheric drag, and the higher order harmonics of the 06-05)
geopotential field.
At least the most significant of these perturbations need to be At the moment, general perturbations data both in OMM and
taken into account when propagating LEO orbits, and therefore TLE format can be integrated with poliastro thanks to the sgp4
the methods for purely Keplerian motion are not enough. As Python library and the Ephem class as follows:
seen above, poliastro implements a number of these perturbations from astropy.coordinates import TEME, GCRS
already - however, numerical methods are much slower than
analytical ones, and this can render them unsuitable for large from poliastro.ephem import Ephem
from poliastro.frames import Planes
scale simulations, satellite conjunction assesment, propagation in
constrained hardware, and so forth.
To address this issue, semianalytical propagation methods def ephem_from_gp(sat, times):
were devised that attempt to strike a balance between the fast errors, rs, vs = sat.sgp4_array(times.jd1, times.jd2)
if not (errors == 0).all():
running times of analytical methods and the necessary inclusion warn(
of perturbation forces. One of such semianalytical methods are "Some objects could not be propagated, "
POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS 143

"proceeding with the rest", do not want to use some of the higher level poliastro abstractions
stacklevel=2, or drag its large number of heavy dependencies.
)
rs = rs[errors == 0]
Finally, the sustainability of the project cannot yet be taken for
vs = vs[errors == 0] granted: the project has reached a level of complexity that already
times = times[errors == 0] warrants dedicated development effort that cannot be covered with
short-lived grants. Such funding could potentially come from the
cart_teme = CartesianRepresentation(
rs << u.km, private sector, but although there is evidence that several for-profit
xyz_axis=-1, companies are using poliastro, we have very little information of
differentials=CartesianDifferential( how is it being used and what problems are those users having,
vs << (u.km / u.s),
xyz_axis=-1,
let alone what avenues for funded work could potentially work.
), Organizations like the Libre Space Foundation advocate for a
) strong copyleft licensing model to convince commercial actors to
cart_gcrs = ( contribute to the commons, but in principle that goes against the
TEME(cart_teme, obstime=times)
.transform_to(GCRS(obstime=times)) permissive licensing that the wider Scientific Python ecosystem,
.cartesian including poliastro, has adopted. With the advent of new business
) models and the ever increasing reliance in open source by the
private sector, a variety of ways to engage commercial users and
return Ephem(
cart_gcrs, include them in the conversation exist. However, these have not
times, been explored yet.
plane=Planes.EARTH_EQUATOR
) Acknowledgements
However, no native integration with SGP4 has been implemented The authors would like to thank Prof. Michèle Lavagna for her
yet in poliastro, for technical and non-technical reasons. On one original guidance and inspiration, David A. Vallado for his en-
hand, this propagator is too different from the other methods, and couragement and for publishing the source code for the algorithms
we have not yet devised how to add it to the library in a way from his book for free, Dr. T.S. Kelso for his tireless efforts in
that does not create confusion. On the other hand, adding such maintaining CelesTrak, Alejandro Sáez for sharing the dream of
a propagator to poliastro would probably open the flood gates of a better way, Prof. Dr. Manuel Sanjurjo Rivo for believing in my
corporate users of the library, and we would like to first devise work, Helge Eichhorn for his enthusiasm and decisive influence
a sustainability strategy for the project, which is addressed in the in poliastro, the whole OpenAstronomy collaboration for opening
next section. the door for us, the NumFOCUS organization for their immense
support, and Alexandra Elbakyan for enabling scientific progress
worldwide.
Future work
Despite the fact that poliastro has existed for almost a decade, for R EFERENCES
most of its history it has been developed by volunteers on their [AAA+ 76] United States Committee on Extension to the Standard At-
free time, and only in the past five years it has received funding mosphere, United States National Aeronautics, Space Ad-
through various Summer of Code programs (SOCIS 2017, GSOC ministration, United States National Oceanic, Atmospheric
Administration, and United States Air Force. U.S. Stan-
2018-2021) and institutional grants (NumFOCUS 2020, 2021). dard Atmosphere, 1976. NOAA - SIT 76-1562. National
The funded work has had an overwhemingly positive impact on Oceanic and Amospheric [sic] Administration, 1976. URL:
the project, however the lack of a dedicated maintainer has caused https://books.google.es/books?id=x488AAAAIAAJ.
[AAAA62] United States Committee on Extension to the Standard At-
some technical debt to accrue over the years, and some parts of
mosphere, United States National Aeronautics, Space Admin-
the project are in need of refactoring or better documentation. istration, and United States Environmental Science Services
Historically, poliastro has tried to implement algorithms that Administration. U.S. Standard Atmosphere, 1962: ICAO
were applicable for all the planets in the Solar System, however Standard Atmosphere to 20 Kilometers; Proposed ICAO Ex-
tension to 32 Kilometers; Tables and Data to 700 Kilo-
some of them have proved to be very difficult to generalize for meters. U.S. Government Printing Office, 1962. URL:
bodies other than the Earth. For cases like these, poliastro ships a https://books.google.es/books?id=fWdTAAAAMAAJ.
poliastro.earth package, but going forward we would like [Bat99] Richard H. Battin. An Introduction to the Mathematics
and Methods of Astrodynamics, Revised Edition. American
to continue embracing a generic approach that can serve other Institute of Aeronautics and Astronautics, Inc., Reston, VA,
bodies as well. January 1999. URL: https://arc.aiaa.org/doi/book/10.2514/4.
Several open source projects have successfully used poliastro 861543, doi:10.2514/4.861543.
or were created taking inspiration from it, like spacetech-ssa [BBC+ 11] Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dal-
cin, Dag Sverre Seljebotn, and Kurt Smith. Cython: The
by IBM1 or mubody [BBVPFSC22]. AGI (previously Analytical Best of Both Worlds. Computing in Science & Engineering,
Graphics, Inc., now Ansys Government Initiatives) published a 13(2):31–39, March 2011. URL: http://ieeexplore.ieee.org/
series of scripts to automate the commercial tool STK from Python document/5582062/, doi:10.1109/MCSE.2010.118.
[BBVPFSC22] Juan Bermejo Ballesteros, José María Vergara Pérez,
leveraging poliastro2 . However, we have observed that there is still Alejandro Fernández Soler, and Javier Cubas. Mu-
lots of repeated code across similar open source libraries written body, an astrodynamics open-source Python library fo-
in Python, which means that there is an opportunity to provide cused on libration points. Barcelona, Spain, April
a "kernel" of algorithms that can be easily reused. Although 2022. URL: https://sseasymposium.org/wp-content/uploads/
2022/04/4thSSEA_AllAbstracts.pdf.
poliastro.core started as a separate layer to isolate fast, non-
safe functions as described above, we think we could move it to 1. https://github.com/IBM/spacetech-ssa
an external package so it can be depended upon by projects that 2. https://github.com/AnalyticalGraphicsInc/STKCodeExamples/
144 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[BEKS17] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Vi- Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant,
ral B. Shah. Julia: A Fresh Approach to Numerical Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer
Computing. SIAM Review, 59(1):65–98, January 2017. Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro-
URL: https://epubs.siam.org/doi/10.1137/141000671, doi: gramming with NumPy. Nature, 585(7825):357–362, Septem-
10.1137/141000671. ber 2020. URL: https://www.nature.com/articles/s41586-020-
[Bro59] Dirk Brouwer. Solution of the problem of artificial satellite 2649-2, doi:10.1038/s41586-020-2649-2.
theory without drag. The Astronomical Journal, 64:378, [HNW09] E. Hairer, S. P. Nørsett, and Gerhard Wanner. Solving ordi-
November 1959. URL: http://adsabs.harvard.edu/cgi-bin/bib_ nary differential equations I: nonstiff problems. Number 8
query?1959AJ.....64..378B, doi:10.1086/107958. in Springer series in computational mathematics. Springer,
[Bur67] E.G.C. Burt. On space manoeuvres with con- Heidelberg ; London, 2nd rev. ed edition, 2009. OCLC:
tinuous thrust. Planetary and Space Science, ocn620251790.
15(1):103–122, January 1967. URL: https: [HR80] Felix R. Hoots and Ronald L. Roehrich. Models for prop-
//linkinghub.elsevier.com/retrieve/pii/0032063367900700, agation of NORAD element sets. Technical report, Defense
doi:10.1016/0032-0633(67)90070-0. Technical Information Center, Fort Belvoir, VA, December
[CC10] Philip Herbert Cowell and Andrew Claude Crommelin. Inves- 1980. URL: http://www.dtic.mil/docs/citations/ADA093554.
tigation of the Motion of Halley’s Comet from 1759 to 1910. [Hun07] J. D. Hunter. Matplotlib: A 2D graphics environment. Com-
Neill & Company, limited, 1910. puting in Science & Engineering, 9(3):90–95, 2007. Pub-
[Cha22] Kevin Charls. Recursive solution to Kepler’s problem for lisher: IEEE COMPUTER SOC. doi:10.1109/MCSE.
elliptical orbits - application in robust Newton-Raphson and 2007.55.
co-planar closest approach estimation. 2022. Publisher: [IBD+ 20] Dario Izzo, Will Binns, Dariomm098, Alessio Mereta,
Unpublished Version Number: 1. URL: https://rgdoi.net/ Christopher Iliffe Sprague, Dhennes, Bert Van Den Abbeele,
10.13140/RG.2.2.18578.58563/1, doi:10.13140/RG.2. Chris Andre, Krzysztof Nowak, Nat Guy, Alberto Isaac Bar-
2.18578.58563/1. quín Murguía, Pablo, Frédéric Chapoton, GiacomoAcciarini,
[Con14] Bruce A. Conway. Spacecraft trajectory optimization. Num- Moritz V. Looz, Dietmarwo, Mike Heddes, Anatoli Babenia,
ber 29 in Cambridge aerospace series. Cambridge university Baptiste Fournier, Johannes Simon, Jonathan Willitts, Ma-
press, Cambridge (GB), 2014. teusz Polnik, Sanjeev Narayanaswamy, The Gitter Badger,
[CR17] Juan Luis Cano Rodríguez. Study of analytical solutions for and Jack Yarndley. esa/pykep: Optimize, October 2020.
low-thrust trajectories. Master’s thesis, Universidad Politéc- URL: https://zenodo.org/record/4091753, doi:10.5281/
nica de Madrid, March 2017. ZENODO.4091753.
[DB83] J. M. A. Danby and T. M. Burkardt. The solution of Kepler’s [Inc15] Plotly Technologies Inc. Collaborative data science, 2015.
equation, I. Celestial Mechanics, 31(2):95–107, October Place: Montreal, QC Publisher: Plotly Technologies Inc. URL:
1983. URL: http://link.springer.com/10.1007/BF01686811, https://plot.ly.
doi:10.1007/BF01686811. [Jac77] L. G. Jacchia. Thermospheric Temperature, Density, and
[DCV21] Marilena Di Carlo and Massimiliano Vasile. Analytical Composition: New Models. SAO Special Report, 375, March
solutions for low-thrust orbit transfers. Celestial Mechanics 1977. ADS Bibcode: 1977SAOSR.375.....J. URL: https:
and Dynamical Astronomy, 133(7):33, July 2021. URL: https: //ui.adsabs.harvard.edu/abs/1977SAOSR.375.....J.
//link.springer.com/10.1007/s10569-021-10033-9, doi:10. [JGAZJT+ 18] Nathan J. Goldbaum, John A. ZuHone, Matthew J. Turk,
1007/s10569-021-10033-9. Kacper Kowalik, and Anna L. Rosen. unyt: Handle, ma-
[Dro53] S. Drobot. On the foundations of Dimensional Analysis. nipulate, and convert data with units in Python. Jour-
Studia Mathematica, 14(1):84–99, 1953. URL: http://www. nal of Open Source Software, 3(28):809, August 2018.
impan.pl/get/doi/10.4064/sm-14-1-84-99, doi:10.4064/ URL: http://joss.theoj.org/papers/10.21105/joss.00809, doi:
sm-14-1-84-99. 10.21105/joss.00809.
[Dub73] G. N. Duboshin. Book Review: Samuel Herrick. Astrodynam- [Kec97] Jean Albert Kechichian. Reformulation of Edelbaum’s Low-
ics. Soviet Astronomy, 16:1064, June 1973. ADS Bibcode: Thrust Transfer Problem Using Optimal Control Theory.
1973SvA....16.1064D. URL: https://ui.adsabs.harvard.edu/ Journal of Guidance, Control, and Dynamics, 20(5):988–
abs/1973SvA....16.1064D. 994, September 1997. URL: https://arc.aiaa.org/doi/10.2514/
[Ede61] Theodore N. Edelbaum. Propulsion Requirements for Con- 2.4145, doi:10.2514/2.4145.
trollable Satellites. ARS Journal, 31(8):1079–1089, August [Ko09] TS Kelso and others. Analysis of the Iridium 33-Cosmos
1961. URL: https://arc.aiaa.org/doi/10.2514/8.5723, doi: 2251 collision. Advances in the Astronautical Sciences,
10.2514/8.5723. 135(2):1099–1112, 2009. Publisher: Citeseer.
[FCM13] Davide Farnocchia, Davide Bracali Cioci, and Andrea Milani. [KRKP+ 16] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez,
Robust resolution of Kepler’s equation in all eccentricity Brian E Granger, Matthias Bussonnier, Jonathan Frederic,
regimes. Celestial Mechanics and Dynamical Astronomy, Kyle Kelley, Jessica B Hamrick, Jason Grout, Sylvain Cor-
116(1):21–34, May 2013. URL: http://link.springer.com/10. lay, and others. Jupyter Notebooks-a publishing format for
1007/s10569-013-9476-9, doi:10.1007/s10569-013- reproducible computational workflows., volume 2016. 2016.
9476-9. [Lar16] Martin Lara. Analytical and Semianalytical Propagation
[Fin07] D Finkleman. "TLE or Not TLE?" That is the Question (AAS of Space Orbits: The Role of Polar-Nodal Variables. In
07-126). ADVANCES IN THE ASTRONAUTICAL SCIENCES, Gerard Gómez and Josep J. Masdemont, editors, Astro-
127(1):401, 2007. Publisher: Published for the American dynamics Network AstroNet-II, volume 44, pages 151–
Astronautical Society by Univelt; 1999. 166. Springer International Publishing, Cham, 2016. Se-
[GBP22] Mirko Gabelica, Ružica Bojčić, and Livia Puljak. Many ries Title: Astrophysics and Space Science Proceedings.
researchers were not compliant with their published URL: http://link.springer.com/10.1007/978-3-319-23986-6_
data sharing statement: mixed-methods study. Jour- 11, doi:10.1007/978-3-319-23986-6_11.
nal of Clinical Epidemiology, page S089543562200141X, [LC69] M. H. Lane and K. Cranford. An improved ana-
May 2022. URL: https://linkinghub.elsevier.com/retrieve/ lytical drag theory for the artificial satellite problem.
pii/S089543562200141X, doi:10.1016/j.jclinepi. In Astrodynamics Conference, Princeton,NJ,U.S.A., August
2022.05.019. 1969. American Institute of Aeronautics and Astronautics.
[Her71] Samuel Herrick. Astrodynamics. Van Nostrand Reinhold Co, URL: https://arc.aiaa.org/doi/10.2514/6.1969-925, doi:10.
London, New York, 1971. 2514/6.1969-925.
[HK66] CG Hilton and JR Kuhlman. Mathematical models for the [LPS15] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: a
space defense center. Philco-Ford Publication No. U-3871, LLVM-based Python JIT compiler. In Proceedings of the Sec-
17:28, 1966. ond Workshop on the LLVM Compiler Infrastructure in HPC
[HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der - LLVM ’15, pages 1–6, Austin, Texas, 2015. ACM Press.
Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric URL: http://dl.acm.org/citation.cfm?doid=2833157.2833162,
Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, doi:10.1145/2833157.2833162.
Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van [Mar95] F. Landis Markley. Kepler Equation solver. Celes-
Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del tial Mechanics & Dynamical Astronomy, 63(1):101–111,
POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS 145

1995. URL: http://link.springer.com/10.1007/BF00691917, S. Fabbro, L. A. Ferreira, T. Finethy, R. T. Fox, L. H.
doi:10.1007/BF00691917. Garrison, S. L. J. Gibbons, D. A. Goldstein, R. Gommers, J. P.
[Mik87] Seppo Mikkola. A cubic approximation for Kepler’s equa- Greco, P. Greenfield, A. M. Groener, F. Grollier, A. Hagen,
tion. Celestial Mechanics, 40(3-4):329–334, September P. Hirst, D. Homeier, A. J. Horton, G. Hosseinzadeh, L. Hu,
1987. URL: http://link.springer.com/10.1007/BF01235850, J. S. Hunkeler, Ž. Ivezić, A. Jain, T. Jenness, G. Kanarek,
doi:10.1007/BF01235850. S. Kendrew, N. S. Kern, W. E. Kerzendorf, A. Khvalko,
[MKDVB+ 19] Michael Mommert, Michael Kelley, Miguel De Val-Borro, J. King, D. Kirkby, A. M. Kulkarni, A. Kumar, A. Lee,
Jian-Yang Li, Giannina Guzman, Brigitta Sipőcz, Josef D. Lenz, S. P. Littlefair, Z. Ma, D. M. Macleod, M. Mastropi-
Ďurech, Mikael Granvik, Will Grundy, Nick Moskovitz, etro, C. McCully, S. Montagnac, B. M. Morris, M. Mueller,
Antti Penttilä, and Nalin Samarasinha. sbpy: A Python S. J. Mumford, D. Muna, N. A. Murphy, S. Nelson, G. H.
module for small-body planetary astronomy. Jour- Nguyen, J. P. Ninan, M. Nöthe, S. Ogaz, S. Oh, J. K. Parejko,
nal of Open Source Software, 4(38):1426, June 2019. N. Parley, S. Pascual, R. Patil, A. A. Patil, A. L. Plunkett,
URL: http://joss.theoj.org/papers/10.21105/joss.01426, doi: J. X. Prochaska, T. Rastogi, V. Reddy Janga, J. Sabater,
10.21105/joss.01426. P. Sakurikar, M. Seifert, L. E. Sherbert, H. Sherwood-Taylor,
[noa] Astrodynamics.jl. URL: https://github.com/JuliaSpace/ A. Y. Shih, J. Sick, M. T. Silbiger, S. Singanamalla, L. P.
Astrodynamics.jl. Singer, P. H. Sladen, K. A. Sooley, S. Sornarajah, O. Stre-
[noa18] gpredict, January 2018. URL: https://github.com/csete/ icher, P. Teuben, S. W. Thomas, G. R. Tremblay, J. E. H.
gpredict/releases/tag/v2.2.1. Turner, V. Terrón, M. H. van Kerkwijk, A. de la Vega,
[noa20] GMAT, July 2020. URL: https://sourceforge.net/projects/ L. L. Watkins, B. A. Weaver, J. B. Whitmore, J. Woillez,
gmat/files/GMAT/GMAT-R2020a/. V. Zabalza, and (Astropy Contributors). The Astropy Project:
[noa21a] nyx, November 2021. URL: https://gitlab.com/nyx-space/ Building an Open-science Project and Status of the v2.0
nyx/-/tags/1.0.0. Core Package. The Astronomical Journal, 156(3):123, August
[noa21b] SatNOGS, October 2021. URL: https://gitlab.com/ 2018. URL: https://iopscience.iop.org/article/10.3847/1538-
librespacefoundation/satnogs/satnogs-client/-/tags/1.7. 3881/aabc4f, doi:10.3847/1538-3881/aabc4f.
[noa22a] beyond, January 2022. URL: https://pypi.org/project/beyond/ [VCHK06] David Vallado, Paul Crawford, Ricahrd Hujsak, and T.S.
0.7.4/. Kelso. Revisiting Spacetrack Report #3. In AIAA/AAS Astro-
[noa22b] celestlab, January 2022. URL: https://atoms.scilab.org/ dynamics Specialist Conference and Exhibit, Keystone, Col-
toolboxes/celestlab/3.4.1. orado, August 2006. American Institute of Aeronautics and
[noa22c] Orekit, June 2022. URL: https://gitlab.orekit.org/orekit/ Astronautics. URL: https://arc.aiaa.org/doi/10.2514/6.2006-
orekit/-/releases/11.2. 6753, doi:10.2514/6.2006-6753.
[noa22d] SPICE, January 2022. URL: https://naif.jpl.nasa.gov/naif/ [VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt
toolkit.html. Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski,
[noa22e] tudatpy, January 2022. URL: https://github.com/tudat-team/ Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté-
tudatpy/releases/tag/0.6.0. fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar-
[OG86] A. W. Odell and R. H. Gooding. Procedures for solving rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric
Kepler’s equation. Celestial Mechanics, 38(4):307–334, April Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat,
1986. URL: http://link.springer.com/10.1007/BF01238923, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde,
doi:10.1007/BF01238923. Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin-
[Pol97] James E Pollard. Simplified approach for assessment of low- tero, Charles R. Harris, Anne M. Archibald, Antônio H.
thrust elliptical orbit transfers. In 25th International Electric Ribeiro, Fabian Pedregosa, Paul van Mulbregt, SciPy 1.0
Propulsion Conference, Cleveland, OH, pages 97–160, 1997. Contributors, Aditya Vijaykumar, Alessandro Pietro Bardelli,
[Pol98] James Pollard. Evaluation of low-thrust orbital maneuvers. Alex Rothberg, Andreas Hilboll, Andreas Kloeckner, Anthony
In 34th AIAA/ASME/SAE/ASEE Joint Propulsion Confer- Scopatz, Antony Lee, Ariel Rokem, C. Nathan Woods, Chad
ence and Exhibit, Cleveland,OH,U.S.A., July 1998. Ameri- Fulton, Charles Masson, Christian Häggström, Clark Fitzger-
can Institute of Aeronautics and Astronautics. URL: https: ald, David A. Nicholson, David R. Hagen, Dmitrii V. Pasech-
//arc.aiaa.org/doi/10.2514/6.1998-3486, doi:10.2514/6. nik, Emanuele Olivetti, Eric Martin, Eric Wieser, Fabrice
1998-3486. Silva, Felix Lenders, Florian Wilhelm, G. Young, Gavin A.
[Pol00] J. E. Pollard. Simplified analysis of low-thrust orbital maneu- Price, Gert-Ludwig Ingold, Gregory E. Allen, Gregory R. Lee,
vers. Technical report, Defense Technical Information Center, Hervé Audren, Irvin Probst, Jörg P. Dietrich, Jacob Silterra,
Fort Belvoir, VA, August 2000. URL: http://www.dtic.mil/ James T Webber, Janko Slavič, Joel Nothman, Johannes Buch-
docs/citations/ADA384536. ner, Johannes Kulick, Johannes L. Schönberger, José Vinícius
[PP13] Adonis Reinier Pimienta-Penalver. Accurate Kepler equation de Miranda Cardoso, Joscha Reimer, Joseph Harrington, Juan
solver without transcendental function evaluations. State Luis Cano Rodríguez, Juan Nunez-Iglesias, Justin Kuczynski,
University of New York at Buffalo, 2013. Kevin Tritz, Martin Thoma, Matthew Newville, Matthias
[Rho20] Brandon Rhodes. Skyfield: Generate high precision research- Kümmerer, Maximilian Bolingbroke, Michael Tartre, Mikhail
grade positions for stars, planets, moons, and Earth satellites, Pak, Nathaniel J. Smith, Nikolai Nowaczyk, Nikolay She-
February 2020. banov, Oleksandr Pavlyk, Per A. Brodtkorb, Perry Lee,
[SSM18] Victoria Stodden, Jennifer Seiler, and Zhaokun Ma. An Robert T. McGibbon, Roman Feldbauer, Sam Lewis, Sam
empirical analysis of journal policy effectiveness for Tygier, Scott Sievert, Sebastiano Vigna, Stefan Peterson,
computational reproducibility. Proceedings of the National Surhud More, Tadeusz Pudlik, Takuya Oshima, Thomas J.
Academy of Sciences, 115(11):2584–2589, March 2018. Pingel, Thomas P. Robitaille, Thomas Spura, Thouis R. Jones,
URL: https://pnas.org/doi/full/10.1073/pnas.1708290115, Tim Cera, Tim Leslie, Tiziano Zito, Tom Krauss, Utkarsh
doi:10.1073/pnas.1708290115. Upadhyay, Yaroslav O. Halchenko, and Yoshiki Vázquez-
Baeza. SciPy 1.0: fundamental algorithms for scientific
[TPWS+ 18] The Astropy Collaboration, A. M. Price-Whelan, B. M.
computing in Python. Nature Methods, 17(3):261–272,
Sipőcz, H. M. Günther, P. L. Lim, S. M. Crawford, S. Conseil,
March 2020. URL: http://www.nature.com/articles/s41592-
D. L. Shupe, M. W. Craig, N. Dencheva, A. Ginsburg, J. T.
019-0686-2, doi:10.1038/s41592-019-0686-2.
VanderPlas, L. D. Bradley, D. Pérez-Suárez, M. de Val-Borro,
(Primary Paper Contributors), T. L. Aldcroft, K. L. Cruz, T. P. [VM07] David A. Vallado and Wayne D. McClain. Fundamentals
Robitaille, E. J. Tollerud, (Astropy Coordination Commit- of astrodynamics and applications. Number 21 in Space
tee), C. Ardelean, T. Babej, Y. P. Bach, M. Bachetti, A. V. technology library. Microcosm Press [u.a.], Hawthorne, Calif.,
Bakanov, S. P. Bamford, G. Barentsen, P. Barmby, A. Baum- 3. ed., 1. printing edition, 2007.
bach, K. L. Berry, F. Biscani, M. Boquien, K. A. Bostroem, [WAB+ 14] Greg Wilson, D. A. Aruliah, C. Titus Brown, Neil P.
L. G. Bouma, G. B. Brammer, E. M. Bray, H. Breytenbach, Chue Hong, Matt Davis, Richard T. Guy, Steven H. D. Had-
H. Buddelmeijer, D. J. Burke, G. Calderone, J. L. Cano dock, Kathryn D. Huff, Ian M. Mitchell, Mark D. Plumbley,
Rodríguez, M. Cara, J. V. M. Cardoso, S. Cheedella, Y. Copin, Ben Waugh, Ethan P. White, and Paul Wilson. Best Practices
L. Corrales, D. Crichton, D. D’Avella, C. Deil, É. Depagne, for Scientific Computing. PLoS Biology, 12(1):e1001745,
J. P. Dietrich, A. Donath, M. Droettboom, N. Earl, T. Erben, January 2014. URL: https://dx.plos.org/10.1371/journal.pbio.
146 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

1001745, doi:10.1371/journal.pbio.1001745.
[WIO85] M. J. H. Walker, B. Ireland, and Joyce Owens. A set modified
equinoctial orbit elements. Celestial Mechanics, 36(4):409–
419, August 1985. URL: http://link.springer.com/10.1007/
BF01227493, doi:10.1007/BF01227493.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 147

A New Python API for Webots Robotics Simulations
Justin C. Fisher‡∗

Abstract—Webots is a popular open-source package for 3D robotics simula- In qualitative terms, the old API feels like one is awkwardly
tions. It can also be used as a 3D interactive environment for other physics- using Python to call C and C++ functions, whereas the new API
based modeling, virtual reality, teaching or games. Webots has provided a sim- feels much simpler, much easier, and like it is fully intended for
ple API allowing Python programs to control robots and/or the simulated world, Python. Here is a representative (but far from comprehensive) list
but this API is inefficient and does not provide many "pythonic" conveniences.
of examples:
A new Python API for Webots is presented that is more efficient and provides a
more intuitive, easily usable, and "pythonic" interface.
• Unlike the old API, the new API contains helpful Python
Index Terms—Webots, Python, Robotics, Robot Operating System (ROS), type annotations and docstrings.
Open Dynamics Engine (ODE), 3D Physics Simulation • Webots employs many vectors, e.g., for 3D positions, 4D
rotations, and RGB colors. The old API typically treats
these as lists or integers (24-bit colors). In the new API
1. Introduction
these are Vector objects, with conveniently addressable
Webots is a popular open-source package for 3D robotics sim- components (e.g. vector.x or color.red), conve-
ulations [Mic01], [Webots]. It can also be used as a 3D in- nient helper methods like vector.magnitude and
teractive environment for other physics-based modeling, virtual vector.unit_vector, and overloaded vector arith-
reality, teaching or games. Webots uses the Open Dynamics metic operations, akin to (and interoperable with) NumPy
Engine [ODE], which allows physical simulations of Newtonian arrays.
bodies, collisions, joints, springs, friction, and fluid dynamics. • The new API also provides easy interfacing between
Webots provides the means to simulate a wide variety of robot high-resolution Webots sensors (like cameras and Lidar)
components, including motors, actuators, wheels, treads, grippers, and Numpy arrays, to make it much more convenient to
light sensors, ultrasound sensors, pressure sensors, range finders, use Webots with popular Python packages like Numpy
radar, lidar, and cameras (with many of these sensors drawing [NumPy], [Har01], Scipy [Scipy], [Vir01], PIL/PILLOW
their inputs from GPU processing of the simulation). A typical [PIL] or OpenCV [OpenCV], [Brad01]. For example,
simulation will involve one or more robots, each with somewhere converting a Webots camera image to a NumPy array is
between 3 and 30 moving parts (though more would be possible), now as simple as camera.array and this now allows
each running its own controller program to process information the array to share memory with the camera, making this
taken in by its sensors to determine what control signals to send to extremely fast regardless of image size.
its devices. A simulated world typically involves a ground surface • The old API often requires that all function parameters be
(which may be a sloping polygon mesh) and dozens of walls, given explicitly in every call, whereas the new API gives
obstacles, and/or other objects, which may be stationary or moving many parameters commonly used default values, allowing
in the physics simulation. them often to be omitted, and keyword arguments to be
Webots has historically provided a simple Python API, allow- used where needed.
ing Python programs to control individual robots or the simulated • Most attributes are now accessible (and alterable, when ap-
world. This Python API is a thin wrapper over a C++ API, which plicable) by pythonic properties like motor.velocity.
itself is a wrapper over Webots’ core C API. These nested layers • Many devices now have Python methods like __bool__
of API-wrapping are inefficient. Furthermore, this API is not very overloaded in intuitive ways. E.g., you can now use if
"pythonic" and did not provide many of the conveniences that bumper to detect if a bumper has been pressed, rather
help to make development in Python be fast, intuitive, and easy than the old if bumper.getValue().
to learn. This paper presents a new Python API [NewAPI01] that • Pythonic container-like interfaces are now provided.
more efficiently interfaces directly with the Webots C API and You may now use for target in radar to iterate
provides a more intuitive, easily usable, and "pythonic" interface through the various targets a radar device has detected or
for controlling Webots robots and simulations. for packet in receiver to iterate through com-
munication packets that a receiver device has received
* Corresponding author: fisher@smu.edu
‡ Southern Methodist University, Department of Philosophy (and it now automatically handles a wide variety of Python
objects, not just strings).
Copyright © 2022 Justin C. Fisher. This is an open-access article distributed • The old API requires supervisor controllers to use a
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the wide variety of separate functions to traverse and in-
original author and source are credited. teract with the simulation’s scene tree, including dif-
148 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

ferent functions for different VRML datatypes (like well as ultrasound and touch sensors to detect obstables. Using
SFVec3f or MFInt32). The new API automatically these sensors, the robots navigate towards the lamp in a cluttered
handles these datatypes and translates intuitive Python playground sandbox that includes sloping sand, an exterior wall,
syntax (like dot-notation and square-bracket indexing) and various obstacles including a puddle of water and platforms
to the Webots equivalents. E.g., you can now move from which robots may fall.
a particular crate 1 meter in the x direction using This interdisciplinary class draws students with diverse back-
a command like world.CRATES[3].translation grounds, and programming skills. Accomodating those with fewer
+= [1,0,0]. Under the old API, this would require skills required simplifying many of the complexities of the old
numerous function calls (calling getNodeFromDef to Webots API. It also required setting up tools to use Webots
find the CRATES node, getMFNode to find the child "supervisor" powers to help manipulate the simulated world, e.g.
with index 3, getSFField to find its translation field, to provide students easier customization options for their robots.
and getSFVec3f to retrieve that field’s value, then some The old Webots API makes the use of such supervisor powers
list manipulation to alter the x-component of that value, tedious and difficult, even for experienced coders, so this prac-
and finally a call to setSFVec3f to set the new value). tically required developing new tools to streamline the process.
These factors led to the development of an interface that would be
As another example illustrating how much easier the new
much easier for novice students to adapt to, and that would make it
API is to use, here are two lines from Webots’ sample
much easier for an experienced programmer to make much use of
supervisor_draw_trail, as it would appear in the old
supervisor powers to manipulate the simulated world. Discussion
Python API.
of this with the core Webots development team then led to the
f = supervisor.getField(supervisor.getRoot(), decision to incorporate these improvements into Webots, where
"children")
f.importMFNodeFromString(-1, trail_plan) they can be of benefit to a much broader community.

And here is how that looks written using the new API: 3. Design Decisions.
world.children.append(trail_plan) This section discusses some design decisions that arose in develop-
The new API is mostly backwards-compatible with the old Python ing this API, and discusses the factors that drove these decisions.
Webots API, and provides an option to display deprecation warn- This may help give the reader a better understanding of this API,
ings with helpful advice for changing to the new API. and also of relevant considerations that would arise in many other
The new Python API is planned for inclusion in an upcoming development scenarios.
Webots release, to replace the old one. In the meantime, an early-
3.1. Shifting from functions to properties.
access version is available, distributed under Apache 2.0 licence,
the same permissibe open-source license that Webots is distributed The old Python API for Webots consists largely
under. of methods like motor.getVelocity() and
In what follows, the history and motivation for this new API motor.setVelocity(new_velocity). In the new API
is discussed, including its use in teaching an interdisciplinary these have quite uniformly been changed to Python properties, so
undergraduate Cognitive Science course called Minds, Brains and these purposes are now accomplished with motor.velocity
Robotics. Some of the design decisions for the new API are and motor.velocity = new_velocity.
discussed, which will not only aid in understanding it, but also Reduction of wordiness and punctuation helps to make pro-
have broader relevance to parallel dilemmas that face many other grams easier to read and to understand, and it reduces the cognitive
software developers. And some metrics are given to quantify how load on coders. However, there are also drawbacks.
the new API has improved over the old. One drawback is that properties can give the mistaken impres-
sion that some attributes are computationally cheap to get or set. In
cases where this impression would be misleading, more traditional
2. History and Motivation. method calls were retained and/or the comparative expense of the
Much of this new API was developed by the author in the operation was clearly documented.
course of teaching an interdisciplinary Southern Methodist Uni- Two other drawbacks are related. One is that inviting ordinary
versity undergraduate Cognitive Science course entitled Minds, users to assign properties to API objects might lead them to assign
Brains and Robotics (PHIL 3316). Before the Covid pandemic, other attributes that could cause problems. Since Python lacks
this course had involved lab activities where students build and true privacy protections, it has always faced this sort of worry, but
program physical robots. The pandemic forced these activities this worry becomes even worse when users start to feel familiar
to become virtual. Fortunately, Webots simulations actually have moving beyond just using defined methods to interact with an
many advantages over physical robots, including not requiring object.
any specialized hardware (beyond a decent personal computer), Relatedly, Python debugging provides direct feedback in
making much more interesting uses of altitude rather than having cases where a user misspells motor.setFoo(v) but not when
the robots confined to a safely flat surface, allowing robots someone mispells ’motor.foo = v‘. If a user inadvertently types
to engage in dangerous or destructive activities that would be motor.setFool(v) they will get an AttributeError
risky or expensive with physical hardware, allowing a much noting that motor lacks a setFool attribute. But if a user
broader array of sensors including high-resolution cameras, and inadvertently types motor.fool = v, then Python will silently
enabling full-fledged neural network and computational vision create a new .fool attribute for motor and the user will often
simulations. For example, an early activity in this class involves have no idea what has gone wrong.
building Braitenburg-style vehicles [Bra01] that use light sensors These two drawbacks both involve users setting an attribute
and cameras to detect a lamp carried by a hovering drone, as they shouldn’t: either an attribute that has another purpose, or one
A NEW PYTHON API FOR WEBOTS ROBOTICS SIMULATIONS 149

that doesn’t. Defenses against the first include "hiding" important in the simulated world (presuming that the controller has such
attributes behind a leading "_", or protecting them with a Python permissions, of course). In many use cases, supervisor robots don’t
property, which can also help provide useful doc-strings. Unfor- actually have bodies and devices of their own, and just use their
tunately it’s much harder to protect against misspellings in this supervisor powers incorporeally, so all they will need is world.
piece-meal fashion. In the case where a robot’s controller wants to exert both forms
This led to the decision to have robot devices like motors of control, it can import both robot to control its own body, and
and cameras employ a blanket __setattr__ that will generate world to control the rest of the world.
warnings if non-property attributes of devices are set from outside This distinction helps to make things more intuitively clear.
the module. So the user who inadvertently types motor.fool It also frees world from having all the properties and methods
= v will immediately be warned of their mistake. This does incur that robot has, which in turn reduces the risk of name-collisions
a performance cost, but that cost is often worthwhile when it saves as world takes on the role of serving as the root of the proxy
development time and frustration. For cases when performance is scene tree. In the new API, world.children refers to the
crucial, and/or a user wants to live dangerously and meddle inside children field of the root of the scene tree which contains (al-
API objects, this layer of protection can be deactivated. most) all of the simulated world, world.WorldInfo refers to
An alternative approach, suggested by Matthew Feickert, one of these children, a WorldInfo node, and world.ROBOT2
would have been to use __slots__ rather than an ordinary dynamically returns a node within the world whose Webots
__dict__ to store device attributes, which would also have the DEF-name is "ROBOT2". These uses of world would have
effect of raising an error if users attempt to modify unexpected been much less intuitive if users thought of world as being
attributes. Not having a __dict__ can make it harder to do a special sort of robot, rather than as being their handle on
some things like cached properties and multiple inheritance. But controlling the simulated world. Other sorts of supervisor func-
in cases where such issues don’t arise or can be worked around, tionality also are very intuitively associated with world, like
readers facing similar challenges may find __slots__ to be a world.save(filename) to save the state of the simulated
preferable solution. world, or world.mode = 'PAUSE'.
Having world.attributes dynamically fetch nodes and
3.2 Backwards Compatibility. fields from the scene tree did come with some drawbacks. There
The new API offers many new ways of doing things, many is a risk of name-collisions, though these are rare since Webots
of which would seem "better" by most metrics, with the main field-names are known in advance, and nodes are typically sought
drawback being just that they differ from old ways. The possibility by ALL-CAPS DEF-names, which won’t collide with world
of making a clean break from the old API was considered, but that ’s lower-case and MixedCase attributes. Linters like MyPy and
would stop old code from working, alienate veteran users, and PyCharm also cannot anticipate such dynamic references, which
risk causing a schism akin to the deep one that arose between is unfortunate, but does not stop such dynamic references from
Python 2 and Python 3 communities when Python 3 opted against being extremely useful.
backwards compatibility.
Another option would have been to refrain from adding a 4. Readability Metrics
"new-and-better" feature to avoid introducing redundancies or
A main advantage of the new API is that it allows Webots
backward incompatibilities. But that has obvious drawbacks too.
controllers to be written in a manner that is easier for coders to
Instead, a compromise was typically adopted: to provide both
read, write, and understand. Qualitatively, this difference becomes
the "new-and-better" way and the "worse-old" way. This redun-
quite apparent upon a cursory inspection of examples like the one
dancy was eased by shifting from getFoo / setFoo methods
given in section 1. As another representative example, here are
to properties, and from CamelCase to pythonic snake_case,
three lines from Webots’ included supervisor_draw_trail
which reduced the number of name collisions between old and
sample as they would appear in the old Python API:
new. Employing the "worse-old" way leads to a deprecation
warning that includes helpful advice regarding shifting to the trail_node = world.getFromDef("TRAIL")
point_field = trail_node.getField("coord")\
"new-and-better" way of doing things. This may help users to .getSFNode()\
transition more gradually to the new ways, or they can shut these .getField("point")
warnings off to help preserve good will, and hopefully avoid a index_field = trail_node.getField("coordIndex")
schism.
And here is their equivalent in the new API:
3.3 Separating robot and world. point_field = world.TRAIL.coord.point
index_field = world.TRAIL.coordIndex
In Webots there is a distinction between "ordinary robots" whose
capabilities are generally limited to using the robot’s own devices, Brief inspection should reveal that the latter code is much easier
and "supervisor robots" who share those capabilities, but also have to read, write and understand, not just because it is shorter, but
virtual omniscience and omnipotence over most aspects of the also because its punctuation is limited to standard Python syntax
simulated world. In the old API, supervisor controller programs for traversing attributes of objects, because it reduces the need
import a Supervisor subclass of Robot, but typically still to introduce new variables like trail_node for things that
call this unusually powerful robot robot, which has led to many it already makes easy to reference (via world.TRAIL, which
confusions. the new API automatically caches for fast repeat reference), and
In the new API these two sorts of powers are strictly separated. because it invisibly handles selecting appropriate C-API functions
Importing robot provides an object that can be used to control like getField and getSFNode, saving the user from needing
the devices in the robot itself. Importing world provides an to learn and remember all these functions (of which there are
object that can be used to observe and enact changes anywhere many).
150 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Metric New API Old API Halstead Metric New API Old API

Lines of Code (with blanks, comments) 43 49 Vocabulary = (n1)operators+(n2)operands 18 54
Source Lines of Code (without those) 29 35 Length = (N1)operator + (N2)operand instances 38 99
Logical Lines of Code (single commands) 27 38 Volume = Length * log2 (Vocabulary) 158 570
Cyclomatic Complexity 5 (A) 8 (B) Difficulty = (n1 * N2) / (2 * n2) 4.62 4.77
Effort = Difficulty * Volume 731 2715
TABLE 1 Time = Effort / 18 41 151
Length and Complexity Metrics. Raw measures for Bugs = Volume / 3000 0.05 0.19
supervisor_draw_trail as it would be written with the new Python API
for Webots or the old Python API for Webots. The "lines of codes" measures
differ with respect to how they count blank lines, comments, and lines that TABLE 2
combine multiple commands. Cyclomatic complexity measures the number of Halstead Metrics. Halstead metrics for supervisor_draw_trail as it
potential branching points in the code. would be written with the new and old Python API’s for Webots. Lower numbers
are commonly construed as being better.

This intuitive impression is confirmed by automated metrics
for code readability. The measures in what follows consider the quickly under the new API, since it provides many simpler ways
full supervisor_draw_trail sample controller (from which of doing things, and need never do any worse since it provides
the above snippet was drawn), since this is the Webots sample backwards-compatible options.
controller that makes the most sustained use of supervisor func- Another collection of classic measures of code readability
tionality to perform a fairly plausible supervisor task (maintaining was developed by Halstead. [Hal01] These measures (especially
the position of a streamer that trails behind the robot). Webots volume) have been shown to correlate with human assessments
provides this sample controller in C [SDTC], but it was re- of code readability [Bus01], [Pos01]. These measures generally
implemented using both the Old Python API and the New Python penalize a program for using a "vocabulary" involving more
API [Metrics], maintaining straightforward correspondence be- operators and operands. Table 2 shows these metrics, as computed
tween the two, with the only differences being directly due to by Radon. (Again all measures are reported, while remaining
the differences in the API’s. neutral about which are most significant.) The new API scores
Some raw measures for the two controllers are shown in significantly lower/"better" on these metrics, due in large part
Table 1. These were gathered using the Radon code-analysis to its automatically selecting among many different C-API calls
tools [Radon]. (These metrics, as well as those below, may be without these needing to appear in the user’s code. E.g. hav-
reproduced by (1) installing Radon [Radon], (2) downloading ing motor.velocity as a unified property involves fewer
the source files to compare and the script for computing Metrics unique names than having users write both setVelocity() and
[Metrics], (3) ensuring that the path at the top of the script refers getVelocity(), and often forming a third local velocity
to the local location of the source files to be compared, and variable. And having world.children[-1] access the last
(4) running this script.) Multiple metrics are reported because child that field in the simulation saves having to count getField,
theorists disagree about which are most relevant in assessing and getMFNode in the vocabulary, and often also saves forming
code readability, because some of these play a role in computing additional local variables for nodes or fields gotten in this way.
other metrics discussed below, and because this may help to allay Both of these factors also help the new API to greatly reduce
potential worries that a few favorable metrics might have been parentheses counts.
cherry-picked. This paper provides some explanation of these Lastly, the Maintainability Index and variants thereof are
metrics and of their potential significance, while remaining neutral intended to measure of how easy to support and change source
regarding which, if any, of these metrics is best. code is. [Oman01] Variants of the Maintainability Index are
The "lines of code" measures reflect that the new API makes commonly used, including in Microsoft Visual Studio. These
it easier to do more things with less code. The measures differ measures combine Halstead Volume, Source Lines of Code, and
in how they count blank lines, comments, multi-line statements, Cyclomatic Complexity, all mentioned above, and two variants
and multi-statement lines like if p: q(). Line counts can be (SEI and Radon) also provide credit for percentage of comment
misleading, especially when the code with fewer lines has longer lines. (Both samples compared here include 5 comment lines, but
lines, though upcoming measures will show that that is not the these compose a higher percentage of the new API’s shorter code).
case here. Different versions of this measure weight and curve these factors
Cyclomatic Complexity counts the number of potential somewhat differently, but since the new API outperforms the old
branching points that appear within the code, like if, while and on each factor, all versions agree that it gets the higher/"better"
for. [McC01] Cyclomatic Complexity is strongly correlated with score, as shown in Table 3. (These measures were computed based
other plausible measures of code readability involving indentation on the input components as counted by Radon.)
structure [Hin01]. The new API’s score is lower/"better" due to its There are potential concerns about each of these measures
automatically converting vector-like values to the format needed of code readability, and one can easily imagine playing a form
for importing new nodes into the Webots simulation, and due to of "code golf" to optimize some of these scores without actually
its automatic caching allowing a simpler loop to remove unwanted improving readability (though it would be difficult to do this for all
nodes. By Radon’s reckoning this difference in complexity already scores at once). Fortunately, most plausible measures of readabil-
gives the old API a "B" grade, as compared to the new API’s "A". ity have been observed to be strongly correllated across ordinary
These complexity measures would surely rise in more complex cases, [Pos01] so the clear and unanimous agreement between
controllers employed in larger simulations, but they would rise less these measures is a strong confirmation that the new API is indeed
A NEW PYTHON API FOR WEBOTS ROBOTICS SIMULATIONS 151

Maintainability Index version New API Old API [NewAPI01] https://github.com/Justin-Fisher/new_python_api_for_webots
[NumPy] Numerical Python (NumPy). https://www.numpy.org
Original [Oman01] 89 79 [ODE] Open Dynamics Engine. https://www.ode.org/
Software Engineering Institute 78 62 [Oman01] Oman, P and J Hagemeister. "Metrics for assessing a software
Microsoft Visual Studio 52 46 system’s maintainability," Proceedings Conference on Software
Maintenance, 337-44. 1992. doi: 10.1109/ICSM.1992.242525.
Radon 82 75
[OpenCV] Open Source Computer Vision Library for Python. https://
github.com/opencv/opencv-python
TABLE 3 [PIL] Python Imaging Library. https://python-pillow.org/
Maintainability Index Metrics. Maintainability Index metrics for [Pos01] Posnet, D, A Hindle and P Devanbu. "A simpler model of
supervisor_draw_trail as it would be written with the new and old software readability." Proceedings of the 8th working conference
versions of the Python API for Webots, according to different versions of the on mining software repositories, 73-82. 2011.
Maintainability Index. Higher numbers are commonly construed as being better. [Radon] Radon. https://radon.readthedocs.io/en/latest/index.html
[Sca01] Scalabrino, S, M Linares-Vasquez, R Oliveto and D Poshy-
vanyk. "A Comprehensive Model for Code Readability."
Jounal of Software: Evolution and Process, 1-29. 2017. doi:
10.1002/smr.1958.
[Scipy] https://www.scipy.org
more readable. Other plausible measures of readability would take [SDTC] https://cyberbotics.com/doc/guide/samples-howto#supervisor_
into account factors like whether the operands are ordinary English draw_trail-wbt
words, [Sca01] or how deeply nested (or indented) the code ends [SDTNew] https://github.com/Justin-Fisher/new_python_api_for_webots/
blob/d180bcc7f505f8168246bee379f8067dfaf373ea/webots_
up being, [Hin01] both of which would also favor the new API. new_python_api_samples/controllers/supervisor_draw_trail_
So the mathematics confirm what was likely obvious from visual python/supervisor_draw_trail_new_api_bare_bones.py
comparison of code samples above, that the new API is indeed [SDTOld] https://github.com/Justin-Fisher/new_python_api_for_webots/
more "readable" than the old. blob/d180bcc7f505f8168246bee379f8067dfaf373ea/webots_
new_python_api_samples/controllers/supervisor_draw_trail_
python/supervisor_draw_trail_old_api_bare_bones.py
5. Conclusions [Vir01] Virtanen, P, R. Gommers, T. Oliphant, et al. SciPy 1.0: Funda-
A new Python API for Webots robotic simulations was presented. mental Algorithms for Scientific Computing in Python. Nature
Methods, 17(3), 261-72. 2020. doi: 10.1038/s41592-019-0686-2.
It more efficiently interfaces directly with the Webots C API and [Webots] Webots Open Source Robotic Simulator. https://cyberbotics.
provides a more intuitive, easily usable, and "pythonic" interface com/
for controlling Webots robots and simulations. Motivations for the
API and some of its design decisions were discussed, including
decisions use python properties, to add new functionality along-
side deprecated backwards compatibility, and to separate robot and
supervisor/world functionality. Advantages of the new API were
discussed and quantified using automated code readability metrics.

More Information
An early-access version of the new API and a variety of sam-
ple programs and metric computations: https://github.com/Justin-
Fisher/new_python_api_for_webots
Lengthy discussion of the new API and its planned inclusion
in Webots: https://github.com/cyberbotics/webots/pull/3801
Webots home page, including free download of Webots: https:
//cyberbotics.com/

R EFERENCES
[Brad01] Bradski, G. The OpenCV Library. Dr Dobb’s Journal of Soft-
ware Tools. 2000.
[Bra01] Braitenberg, V. Vehicles: Experiments in synthetic psychology.
Cambridge, MA: MIT Press. 1984.
[Bus01] Buse, R and W Weimer. Learning a metric for code readability.
IEEE Transactions on Software Engineering, 36(4): 546-58.
2010. doi: 10.1109/TSE.2009.70.
[Metrics] Fisher, J. Readability Metrics for a New Python API for Webots
Robotics Simulations. 2022. doi: 10.5281/zenodo.6813819.
[Hal01] Halstead, M. Elements of software science. Elsevier New York.
1977.
[Har01] Harris, C., K. Millman, S. van der Walt, et al. Array pro-
gramming with NumPy. Nature 585, 357–62. 2020. doi:
10.1038/s41586-020-2649-2.
[Hin01] Hindle, A, MW Godfrey and RC Holt. "Reading beside the
lines: Indentation as a proxy for complexity metric." Program
Comprehension. The 16th IEEE International Conference, 133-
42. 2008. doi: 10.1109/icpc.2008.13.
[McC01] McCabe, TJ. "A Complexity Measure" , 2(4): 308-320. 1976.
[Mic01] Michel, O. "Webots: Professional Mobile Robot Simulation.
Journal of Advanced Robotics Systems. 1(1): 39-42. 2004. doi:
10.5772/5618.
152 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

pyAudioProcessing: Audio Processing, Feature
Extraction, and Machine Learning Modeling
Jyotika Singh‡∗

Abstract—pyAudioProcessing is a Python based library for processing audio that are closer in the vector space are expected to be similar
data, constructing and extracting numerical features from audio, building and in meaning [Wik22b]. Word embeddings work great for many
testing machine learning models, and classifying data with existing pre-trained applications surrounding textual data [JS21]. However, passing
audio classification models or custom user-built models. MATLAB is a popular numbers, an audio signal, or an image through a word embeddings
language of choice for a vast amount of research in the audio and speech
generation method is not likely to return any meaningful numerical
processing domain. On the contrary, Python remains the language of choice
for a vast majority of machine learning research and functionality. This library
representation that can be used to train machine learning models.
contains features built in Python that were originally published in MATLAB. Different data types correlate with feature formation techniques
pyAudioProcessing allows the user to compute various features from audio files specific to their domain rather than a one-size-fits-all. These
including Gammatone Frequency Cepstral Coefficients (GFCC), Mel Frequency methods for audio signals are very specific to audio and speech
Cepstral Coefficients (MFCC), spectral features, chroma features, and others signal processing, which is a domain of digital signal processing.
such as beat-based and cepstrum-based features from audio. One can use Digital signal processing is a field of its own and is not feasible to
these features along with one’s own classification backend or any of the pop- master in an ad-hoc fashion. This calls for the need to have sought-
ular scikit-learn classifiers that have been integrated into pyAudioProcessing.
after and useful processes for audio signals to be in a ready-to-use
Cleaning functions to strip unwanted portions from the audio are another offering
state by users.
of the library. It further contains integrations with other audio functionalities
such as frequency and time-series visualizations and audio format conversions. There are two popular approaches for feature building in audio
This software aims to provide machine learning engineers, data scientists, classification tasks.
researchers, and students with a set of baseline models to classify audio. 1. Computing spectrograms from audio signals as images and
The library is available at https://github.com/jsingh811/pyAudioProcessing and using an image classification pipeline for the remainder.
is under GPL-3.0 license. 2. Computing features from audio files directly as numerical
vectors and applying them to a classification backend.
Index Terms—pyAudioProcessing, audio processing, audio data, audio clas-
pyAudioProcessing includes the capability of computing spec-
sification, audio feature extraction, gfcc, mfcc, spectral features, spectrogram,
trograms, but focusses most functionalities around the latter for
chroma
building audio models. This tool contains implementations of
various widely used audio feature extraction techniques, and
Introduction integrates with popular scikit-learn classifiers including support
vector machine (SVM), SVM radial basis function kernel (RBF),
The motivation behind this software is to make available complex
random forest, logistic regression, k-nearest neighbors (k-NN),
audio features in Python for a variety of audio processing tasks.
gradient boosting, and extra trees. Audio data can be cleaned,
Python is a popular choice for machine learning tasks. Having
trained, tested, and classified using pyAudioProcessing [Sin21].
solutions for computing complex audio features using Python
enables easier and unified usage of Python for building machine Some other useful libraries for the domain of audio pro-
learning algorithms on audio. This not only implies the need for cessing include librosa [MRL+ 15], spafe [Mal20], essentia
resources to guide solutions for audio processing, but also signifies [BWG+ 13], pyAudioAnalysis [Gia15], and paid services from
the need for Python guides and implementations to solve audio and service providers such as Google1 .
speech cleaning, transformation, and classification tasks. The use of pyAudioProcessing in the community inspires the
Different data processing techniques work well for different need and growth of this software. It is referenced in a text book
types of data. For example, in natural language processing, word titled Artificial Intelligence with Python Cookbook published by
embedding is a term used for the representation of words for Packt Publishing in October 2020 [Auf20]. Additionally, pyAu-
text analysis, typically in the form of a real-valued numerical dioProcessing is a part of specific admissions requirement for a
vector that encodes the meaning of the word such that the words funded PhD project at University of Portsmouth2 . It is further
referenced in this thesis paper titled "Master Thesis AI Method-
* Corresponding author: singhjyotika811@gmail.com ologies for Processing Acoustic Signals AI Usage for Processing
‡ Placemakr Acoustic Signals" [Din21], in recent research on audio processing
for assessing attention levels in Attention Deficit Hyperactivity
Copyright © 2022 Jyotika Singh. This is an open-access article distributed
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the 1. https://developers.google.com/learn/pathways/get-started-audio-
original author and source are credited. classification
PYAUDIOPROCESSING: AUDIO PROCESSING, FEATURE EXTRACTION, AND MACHINE LEARNING MODELING 153

Disorder (ADHD) students [BGSR21], and more. There are thus Class Metric
far 16000+ downloads via pip for pyAudioProcessing with 1000+ Accuracy Precision F1
downloads in the last month [PeP22]. As several different audio music 97.60% 98.79% 98.19%
features need development, new issues are created on GitHub speech 98.80% 97.63% 98.21%
and contributions to the code by the open-source community are
welcome to grow the tool faster.
TABLE 1: Per-class evaluation metrics for audio type (speech vs
music) classification pre-trained model.
Core Functionalities
Class Metric
pyAudioProcessing aims to provide an end-to-end processing so-
Accuracy Precision F1
lution for converting between audio file formats, visualizing time
and frequency domain representations, cleaning with silence and music 94.60% 96.93% 95.75%
low-activity segments removal from audio, building features from speech 97.00% 97.79% 97.39%
raw audio samples, and training a machine learning model that birds 100.00% 96.89% 98.42%
can then be used to classify unseen raw audio samples (e.g., into
categories such as music, speech, etc.). This library allows the user
to extract features such as Mel Frequency Cepstral Coefficients TABLE 2: Per-class evaluation metrics for audio type (speech vs
music vs bird sound) classification pre-trained model.
(MFCC) [CD14], Gammatone Frequency Cepstral Coefficients
(GFCC) [JDHP17], spectral features, chroma features and other
beat-based and cepstrum based features from audio to use with
Methods and Results
one’s own classification backend or scikit-learn classifiers that
have been built into pyAudioProcessing. The classifier implemen- Pre-trained models
tation examples that are a part of this software aim to give the pyAudioProcessing offers pre-trained audio classification models
users a sample solution to audio classification problems and help for the Python community to aid in quick baseline establishment.
build the foundation to tackle new and unseen problems. This is an evolving feature as new datasets and classification
pyAudioProcessing provides seven core functionalities com- problems gain prominence in the field.
prising different stages of audio signal processing. Some of the pre-trained models include the following.
1. Converting audio files to .wav format to give the users 1. Audio type classifier to determine speech versus music:
the ability to work with different types of audio to increase Trained a Support Vector Machine (SVM) classifier for classifying
compatibility with code and processes that work best with .wav audio into two possible classes - music, speech. This classifier
audio type. was trained using Mel Frequency Cepstral Coefficients (MFCC),
2. Audio visualization in time-series and frequency represen- spectral features, and chroma features. This model was trained on
tation, including spectrograms. manually created and curated samples for speech and music. The
3. Segmenting and removing low-activity segments from audio per-class evaluation metrics are shown in Table 1.
files for removing unwanted audio segments that are less likely to 2. Audio type classifier to determine speech versus music ver-
represent meaningful information. sus bird sounds: Trained Support Vector Machine (SVM) classifier
4. Building numerical features from audio that can be used for classifying audio into three possible classes - music, speech,
to train machine learning models. The set of features supported birds. This classifier was trained using Mel Frequency Cepstral
evolves with time as research informs new and improved algo- Coefficients (MFCC), spectral features, and chroma features. The
rithms. per-class evaluation metrics are shown in Table 2.
5. Ability to export the features built with this library to use 3. Music genre classifier using the GTZAN [TEC01]: Trained
with any custom machine learning backend of the user’s choosing. on SVM classifier using Gammatone Frequency Cepstral Coef-
ficients (GFCC), Mel Frequency Cepstral Coefficients (MFCC),
6. Capability that allows users to train scikit-learn classifiers
spectral features, and chroma features to classify music into 10
using features of their choosing directly from raw data. pyAudio-
genre classes - blues, classical, country, disco, hiphop, jazz, metal,
Processing
pop, reggae, rock. The per-class evaluation metrics are shown in
a). runs automatic hyper-parameter tuning Table 3.
b). returns to the user the training model metrics These models aim to present capability of audio feature gen-
along with cross-validation confusion matrix (a cross- eration algorithms in extracting meaningful numeric patterns from
validation confusion matrix is an evaluation matrix from the audio data. One can train their own classifiers using similar
where we can estimate the performance of the model features and different machine learning backend for researching
broken down by each class/category) for model evalua- and exploring improvements.
tion
c). allows the user to test the created classifier with Audio features
the same features used for training There are multiple types of features one can extract from audio.
7. Includes pre-trained models to provide users with baseline Information about getting started with audio processing is well
audio classifiers. described in [Sin19]. pyAudioProcessing allows users to compute
GFCC, MFCC, other cepstral features, spectral features, temporal
2. https://www.port.ac.uk/study/postgraduate-research/research-degrees/ features, chroma features, and more. Details on how to extract
phd/explore-our-projects/detection-of-emotional-states-from-speech-and-text these features are present in the project documentation on GitHub.
154 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Class Metric
Accuracy Precision F1

pop 72.36% 78.63% 75.36%
met 87.31% 85.52% 86.41%
dis 62.84% 59.45% 61.10%
blu 83.02% 72.96% 77.66%
reg 79.82% 69.72% 74.43%
cla 90.61% 86.38% 88.44%
rock 53.10% 51.50% 52.29%
hip 60.94% 77.22% 68.12%
cou 58.34% 62.53% 60.36%
jazz 78.10% 85.17% 81.48%

TABLE 3: Per-class evaluation metrics for music genre classification pre-trained model.

Generally, features useful in different audio prediction tasks (es-
pecially speech) include Linear Prediction Coefficients (LPC) and Another filter inspired by human hearing is the gammatone
Linear Prediction Cepstral Coefficients (LPCC), Bark Frequency filter bank. The gammatone filter bank shape looks similar to the
Cepstral Coefficients (BFCC), Power Normalized Cepstral Coef- mel filter bank, expect the peaks are smoother than the triangular
ficients (PNCC), and spectral features like spectral flux, entropy, shape of the mel filters. gammatone filters are conceived to be a
roll off, centroid, spread, and energy entropy. good approximation to the human auditory filters and are used as
While MFCC features find use in most commonly encountered a front-end simulation of the cochlea. Since a human ear is the
audio processing tasks such as audio type classification, speech perfect receiver and distinguisher of speakers in the presence of
classification, GFCC features have been found to have application noise or no noise, construction of gammatone filters that mimic
in speaker identification or speaker diarization (the process of auditory filters became desirable. Thus, it has many applications
partitioning an input audio stream into homogeneous segments in speech processing because it aims to replicate how we hear.
according to the human speaker identity [Wik22a]). Applications, GFCCs are formed by passing the spectrum through a gam-
comparisons and uses can be found in [ZW13], [pat21], and matone filter bank, followed by loudness compression and DCT,
[pat22]. as seen in Figure 3. The first (approximately) 22 features are
pyAudioProcessing library includes computation of these fea- called GFCCs. GFCCs have a number of applications in speech
tures for audio segments of a single audio, followed by computing processing, such as speaker identification. GFCC for a sample
mean and standard deviation of all the signal segments. speech audio can be seen in Figure 4.

Mel Frequency Cepstral Coefficients (MFCC): Temporal features:

The mel scale relates perceived frequency, or pitch, of a pure Temporal features from audio are extracted from the signal
tone to its actual measured frequency. Humans are much better information in its time domain representations. Examples include
at discerning small changes in pitch at low frequencies compared signal energy, entropy, zero crossing rate, etc. Some sample mean
to high frequencies. Incorporating this scale makes our features temporal features can be seen in Figure 5.
match more closely what humans hear. The mel-frequency scale is
approximately linear for frequencies below 1 kHz and logarithmic Spectral features:
for frequencies above 1 kHz, as shown in Figure 1. This is
motivated by the fact that the human auditory system becomes Spectral features on the other hand derive information con-
less frequency-selective as frequency increases above 1 kHz. tained in the frequency domain representation of an audio signal.
The signal is divided into segments and a spectrum is com- The signal can be converted from time domain to frequency
puted. Passing a spectrum through the mel filter bank, followed by domain using the Fourier transform. Useful features from the
taking the log magnitude and a discrete cosine transform (DCT) signal spectrum include fundamental frequency, spectral entropy,
produces the mel cepstrum. DCT extracts the signal’s main infor- spectral spread, spectral flux, spectral centroid, spectral roll-off,
mation and peaks. For this very property, DCT is also widely used etc. Some sample mean spectral features can be seen in Figure
in applications such as JPEG and MPEG compressions. The peaks 6.
after DCT contain the gist of the audio information. Typically,
the first 13-20 coefficients extracted from the mel cepstrum are Chroma features:
called the MFCCs. These hold very useful information about audio
and are often used to train machine learning models. The process Chroma features are highly popular for music audio data. In
of developing these coefficients can be seen in the form of an Western music, the term chroma feature or chromagram closely re-
illustration in Figure 1. MFCC for a sample speech audio can be lates to the twelve different pitch classes. Chroma-based features,
seen in Figure 2. which are also referred to as "pitch class profiles", are a powerful
tool for analyzing music whose pitches can be meaningfully
Gammatone Frequency Cepstral Coefficients (GFCC): categorized (often into twelve categories : A, A#, B, C, C#, D,
PYAUDIOPROCESSING: AUDIO PROCESSING, FEATURE EXTRACTION, AND MACHINE LEARNING MODELING 155

Fig. 1: MFCC from audio spectrum.

Features boston acc london acc

mfcc 0.765 0.412
clean+mfcc 0.823 0.471

TABLE 4: Performance comparison on test data between MFCC
feature trained model with and without cleaning.

usually depicted as a heat map, i.e., as an image with the intensity
shown by varying the color or brightness.
After applying the algorithm for signal alteration to remove
Fig. 2: MFCC from a sample speech audio. irrelevant and low activity audio segments, the resultant audio’s
time-series plot looks like Figure 10. The spectrogram looks like
Figure 11. It can be seen that the low activity areas are now
D#, E, F, F#, G, G# ) and whose tuning approximates to the equal-
missing from the audio and the resultant audio contains more
tempered scale [con22]. A prime characteristic of chroma features
activity filled regions. This algorithm removes silences as well
is that they capture the harmonic and melodic attributes of audio,
as low-activity regions from the audio.
while being robust to changes in timbre and instrumentation. Some
sample mean chroma features can be seen in Figure 7. These visualizations were produced using pyAudioProcessing
and can be produced for any audio signal using the library.
Audio data cleaning/de-noising
Often times an audio sample has multiple segments present in the Impact of cleaning on feature formations for a classifica-
same signal that do not contain anything but silence or a slight tion task:
degree of background noise compared to the rest of the audio.
For most applications, those low activity segments make up the A spoken location name classification problem was considered
irrelevant information of the signal. for this evaluation. The dataset consisted of 23 samples for
The audio clip shown in Figure 8 is a human saying the word training per class and 17 samples for testing per class. The total
"london" and represents the audio plotted in the time domain, with number of classes is 2 - london and boston. This dataset was
signal amplitude as y-axis and sample number as x-axis. The areas manually created and can be found linked in the project readme
where the signal looks closer to zero/low in amplitude are areas of pyAudioProcessing. For comparative purposes, the classifier is
where speech is absent and represents the pauses the speaker took kept constant at SVM, and the parameter C is chosen based on grid
while saying the word "london". search for each experiment based on best precision, recall and F1
Figure 9 shows the spectrogram of the same audio signal. A score. Results in table 4 show the impact of applying the low-
spectrogram contains time on the x-axis and frequency of the y- activity region removal using pyAudioProcessing prior to training
axis. A spectrogram is a visual representation of the spectrum of the model using MFCC features.
frequencies of a signal as it varies with time. When applied to It can be seen that the accuracies increased when audio sam-
an audio signal, spectrograms are sometimes called sonographs, ples were cleaned prior to training the model. This is especially
voiceprints, or voicegrams. When the data are represented in a 3D useful in cases where silence or low-activity regions in the audio
plot they may be called waterfalls. As [Wik21] mentions, spectro- do not contribute to the predictions and act as noise in the signal.
grams are used extensively in the fields of music, linguistics, sonar,
radar, speech processing, seismology, and others. Spectrograms Integrations
of audio can be used to identify spoken words phonetically, and
to analyze the various calls of animals. A spectrogram can be pyAudioProcessing integrates with third-party tools such as scikit-
generated by an optical spectrometer, a bank of band-pass filters, learn, matplotlib, and pydub to offer additional functionalities.
by Fourier transform or by a wavelet transform. A spectrogram is
156 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 3: GFCC from audio spectrum.

Fig. 4: GFCC from a sample speech audio. Fig. 6: Spectral features from a sample speech audio.

Fig. 5: Temporal extractions from a sample speech audio.
Fig. 7: Chroma features from a sample speech audio.

Training, classification, and evaluation:
Audio visualization:
The library contains integrations with scikit-learn classifiers
for passing audio through feature extraction followed by classi-
fication directly using the raw audio samples as input. Training Spectrograms are 2-D images representing sequences of spec-
results include computation of cross-validation results along with tra with time along one axis, frequency along the other, and bright-
hyperparameter tuning details. ness or color representing the strength of a frequency component
at each time frame [Wys17]. Not only can one see whether there
Audio format conversion: is more or less energy at, for example, 2 Hz vs 10 Hz, but one
can also see how energy levels vary over time [PNS]. Some of
Some applications and integrations work best with .wav data the convolutional neural network architectures for images can be
format. pyAudioProcessing integrates with tools that perform applied to audio signals on top of the spectrograms. This is a dif-
format conversion and presents them as a functionality via the ferent route of building audio models by developing spectrograms
library. followed by image processing. Time-series, frequency-domain,
and spectrogram (both time and frequency domains) visualizations
can be retrieved using pyAudioProcessing and its integrations. See
figures 10 and 9 as examples.
PYAUDIOPROCESSING: AUDIO PROCESSING, FEATURE EXTRACTION, AND MACHINE LEARNING MODELING 157

Fig. 8: Time-series representation of speech for "london". Fig. 11: Spectrogram of cleaned speech for "london".

software’s readme and wiki for giving the user a guide and the
flexibility of usage. pyAudioProcessing has been used in active
research around audio processing and can be used as the basis for
further python-based research efforts.
pyAudioProcessing is updated frequently in order to apply
enhancements and new functionalities with recent research efforts
of the digital signal processing and machine learning community.
Some of the ongoing implementations include additions of cepstral
features such as LPCC, integration with deep learning backends,
and a variety of spectrogram formations that can be used for image
classification-based audio classification tasks.

R EFERENCES
Fig. 9: Spectrogram of speech for "london".
[Auf20] Ben Auffarth. Artificial Intelligence with Python Cookbook. Packt
Publishing, 10 2020.
[BGSR21] Srivi Balaji, Meghana Gopannagari, Svanik Sharma, and Preethi
Conclusion Rajgopal. Developing a machine learning algorithm to assess atten-
tion levels in adh students in a virtual learning setting using audio
In this paper pyAudioProcessing, an open-source Python library, and video processing. International Journal of Recent Technology
is presented. The tool implements and integrates a wide range and Engineering (IJRTE), 10, 5 2021. doi:10.35940/ijrte.
of audio processing functionalities. Using pyAudioProcessing, A5965.0510121.
[BWG 13] Dmitry Bogdanov, N Wack, Emilia Gómez, Sankalp Gulati,
+
one can read and visualize audio signals, clean audio signals by Perfecto Herrera, Oscar Mayor, G Roma, Justin Salamon, Jose
removal of irrelevant content, build and extract complex features Zapata, and Xavier Serra. Essentia: an audio analysis library for
such as GFCC, MFCC, and other spectrum and cepstrum based music information retrieval. 11 2013.
features, build classification models, and use pre-built trained [CD14] Paresh M. Chauhan and Nikita P. Desai. Mel frequency cepstral
coefficients (mfcc) based speaker identification in noisy envi-
baseline models to classify different types of audio. Wrappers ronment using wiener filter. In 2014 International Conference
along with command-line usage examples are provided in the on Green Computing Communication and Electrical Engineer-
ing (ICGCCEE), pages 1–5, 2014. doi:10.1109/ICGCCEE.
2014.6921394.
[con22] Wikipedia contributors. Chroma feature — wikipedia the
free encyclopedia, 2022. Online; accessed 18-May-2022.
URL: https://en.wikipedia.org/w/index.php?title=Chroma_feature&
oldid=1066722932.
[Din21] Vincent Dinger. Master Thesis KI Methodiken fÃ¼r die Ver-
arbeitung akustischer Signale AI Usage for Processing Acoustic
Signals. PhD thesis, Kaiserslautern University of Applied Sciences,
03 2021. doi:10.13140/RG.2.2.15872.97287.
[Gia15] Theodoros Giannakopoulos. pyaudioanalysis: An open-source
python library for audio signal analysis. PloS one, 10(12), 2015.
doi:10.1371/journal.pone.0144610.
[JDHP17] Medikonda Jeevan, Atul Dhingra, M. Hanmandlu, and Bijaya
Panigrahi. Robust Speaker Verification Using GFCC Based i-
Vectors, volume 395, pages 85–91. Springer, 10 2017. doi:
10.1007/978-81-322-3592-7\_9.
[JS21] Jyotika Singh. Social Media Analysis using Natural Lan-
guage Processing Techniques. In Meghann Agarwal, Chris Cal-
loway, Dillon Niederhut, and David Shupe, editors, Proceed-
ings of the 20th Python in Science Conference, pages 52 –
Fig. 10: Time-series representation of cleaned speech for "london". 58, 2021. URL: http://conference.scipy.org/proceedings/scipy2021/
158 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

pdfs/jyotika_singh.pdf, doi:10.25080/majora-1b6fd038-
009.
[Mal20] Ayoub Malek. spafe/spafe: 0.1.2, April 2020. URL: https://github.
com/SuperKogito/spafe.
[MRL+ 15] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt
McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and
music signal analysis in python. In Proceedings of the 14th python
in science conference, volume 8, 2015. doi:10.5281/zenodo.
4792298.
[pat21] Method for optimizing media and marketing content using cross-
platform video intelligence, 2021. URL: https://patents.google.com/
patent/US10949880B2/en.
[pat22] Media and marketing optimization with cross platform consumer
and content intelligence, 2022. URL: https://patents.google.com/
patent/US20210201349A1/en.
[PeP22] PePy. PePy download statistics, 2022. URL: https://pepy.tech/
project/pyAudioProcessing.
[PNS] PNSN. What is a spectrogram? URL: https://pnsn.org/
spectrograms/what-is-a-spectrogram#.
[Sin19] Jyotika Singh. An introduction to audio processing and machine
learning using python, 2019. URL: https://opensource.com/article/
19/9/audio-processing-machine-learning-python.
[Sin21] Jyotika Singh. jsingh811/pyAudioProcessing: Audio pro-
cessing, feature extraction and classification, July 2021.
URL: https://github.com/jsingh811/pyAudioProcessing, doi:10.
5281/zenodo.5121041.
[TEC01] George Tzanetakis, Georg Essl, and Perry Cook. Automatic musical
genre classification of audio signals, 2001. URL: http://ismir2001.
ismir.net/pdf/tzanetakis.pdf.
[Wik21] Wikipedia contributors. Spectrogram — Wikipedia, the free
encyclopedia, 2021. [Online; accessed 19-July-2021]. URL:
https://en.wikipedia.org/w/index.php?title=Spectrogram&oldid=
1031156666.
[Wik22a] Wikipedia contributors. Speaker diarisation — Wikipedia,
the free encyclopedia, 2022. [Online; accessed 23-June-
2022]. URL: https://en.wikipedia.org/w/index.php?title=Speaker_
diarisation&oldid=1090834931.
[Wik22b] Wikipedia contributors. Word embedding — Wikipedia,
the free encyclopedia, 2022. [Online; accessed 23-June-
2022]. URL: https://en.wikipedia.org/w/index.php?title=Word_
embedding&oldid=1091348337.
[Wys17] Lonce Wyse. Audio spectrogram representations for processing
with convolutional neural networks. 06 2017.
[ZW13] Xiaojia Zhao and DeLiang Wang. Analyzing noise robustness
of mfcc and gfcc features in speaker identification. In 2013
IEEE International Conference on Acoustics, Speech and Signal
Processing, pages 7204–7208, 2013. doi:10.1109/ICASSP.
2013.6639061.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 159

Phylogeography: Analysis of genetic and climatic data
of SARS-CoV-2
Aleksandr Koshkarov‡§¶∗ , Wanlin Li‡¶ , My-Linh Luuk , Nadia Tahiri‡

Abstract—Due to the fact that the SARS-CoV-2 pandemic reaches its peak, contributed to the development of vaccines to better combat
researchers around the globe are combining efforts to investigate the genetics the spread of the virus. Studying the factors (e.g., environment,
of different variants to better deal with its distribution. This paper discusses host, agent of transmission) that influence epidemiology helps
phylogeographic approaches to examine how patterns of divergence within us to limit the continued spread of infection and prepare for the
SARS-CoV-2 coincide with geographic features, such as climatic features. First,
future re-emergence of diseases caused by subtypes of coronavirus
we propose a python-based bioinformatic pipeline called aPhylogeo for phylo-
geographic analysis written in Python 3 that help researchers better understand
[LFZK06]. However, few studies report associations between
the distribution of the virus in specific regions via a configuration file, and then environmental factors and the genetics of different variants. Dif-
run all the analysis operations in a single run. In particular, the aPhylogeo tool ferent variants of SARS-CoV-2 are expected to spread differently
determines which parts of the genetic sequence undergo a high mutation rate depending on geographical conditions, such as the meteorological
depending on geographic conditions, using a sliding window that moves along parameters. The main objective of this study is to find clear corre-
the genetic sequence alignment in user-defined steps and a window size. As a lations between genetics and geographic distribution of different
Python-based cross-platform program, aPhylogeo works on Windows®, MacOS variants of SARS-CoV-2.
X® and GNU/Linux. The implementation of this pipeline is publicly available
on GitHub (https://github.com/tahiri-lab/aPhylogeo). Second, we present an ex-
ample of analysis of our new aPhylogeo tool on real data (SARS-CoV-2) to Several studies showed that COVID-19 cases and related
understand the occurrence of different variants. climatic factors correlate significantly with each other ([OCFC20],
[SDdPS+ 20], and [SMVS+ 22]). Oliveiros et al. [OCFC20] re-
Index Terms—Phylogeography, SARS-CoV-2, Bioinformatics, Genetic, Climatic ported a decrease in the rate of SARS-CoV-2 progression with the
Condition
onset of spring and summer in the northern hemisphere. Sobral
et al. [SDdPS+ 20] suggested a negative correlation between mean
Introduction temperature by country and the number of SARS-CoV-2 infec-
tions, along with a positive correlation between rainfall and SARS-
The global pandemic caused by severe acute respiratory syn- CoV-2 transmission. This contrasts with the results of the study by
drome coronavirus 2 (SARS-CoV-2) is at its peak and more and Sabarathinam et al. [SMVS+ 22], which showed that an increase in
more variants of SARS-CoV-2 were described over time. Among temperature led to an increase in the spread of SARS-CoV-2. The
these, some are considered variants of concern (VOC) by the results of Chen et al. [CPK+ 21] imply that a country located 1000
World Health Organization (WHO) due to their impact on global km closer to the equator can expect 33% fewer cases of SARS-
public health, such as Alpha (B.1.1.7), Beta (B.1.351), Gamma CoV-2 per million population. Some virus variants may be more
(P.1), Delta (B.1.617.2), and Omicron (B.1.1.529) [CRA+ 22]. stable in environments with specific climatic factors. Sabarathinam
Although significant progress was made in vaccine development et al. [SMVS+ 22] compared mutation patterns of SARS-CoV-
and mass vaccination is being implemented in many countries, the 2 with time series of changes in precipitation, humidity, and
continued emergence of new variants of SARS-CoV-2 threatens temperature. They suggested that temperatures between 43°F and
to reverse the progress made to date. Researchers around the 54°F, humidity of 67-75%, and precipitation of 2-4 mm may be
world collaborate to better understand the genetics of the different the optimal environment for the transition of the mutant form from
variants, along with the factors that influence the epidemiology of D614 to G614.
this infectious disease. Genetic studies of the different variants

* Corresponding author: Nadia.Tahiri@USherbrooke.ca In this study, we examine the geospatial lineage of SARS-
‡ Department of Computer Science, University of Sherbrooke, Sherbrooke, QC
J1K2R1, Canada CoV-2 by combining genetic data and metadata from associated
§ Center of Artificial Intelligence, Astrakhan State University, Astrakhan, sampling locations. Thus, an association between genetics and the
414056, Russia geographic distribution of SARS-CoV-2 variants can be found. We
¶ Contributed equally
|| Department of Computer Science, University of Quebec at Montreal, Mon- focus on developing a new algorithm to find relationships between
treal, QC, Canada a reference tree (i.e., a tree of geographic species distributions, a
temperature tree, a habitat precipitation tree, or others) with their
Copyright © 2022 Aleksandr Koshkarov et al. This is an open-access article genetic compositions. This new algorithm can help find which
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, genes or which subparts of a gene are sensitive or favorable to a
provided the original author and source are credited. given environment.
160 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Problem statement and proposal to represent 38 gene sequences of SARS-CoV-2. After collecting
Phylogeography is the study of the principles and processes that genetic data, we extracted 5 climatic factors for the 20 regions,
govern the distribution of genealogical lineages, particularly at the i.e., Temperature, Humidity, Precipitation, Wind speed, and Sky
intraspecific level. The geographic distribution of species is often surface shortwave downward irradiance. This data was obtained
correlated with the patterns associated with the species’ genes from the NASA website (https://power.larc.nasa.gov/).
([A+ 00] and [KM02]). In a phylogeographic study, three major In the second step, trees are created with climatic data and
processes should be considered (see [Nag92] for more details), genetic data, respectively. For climatic data, we calculated the
which are: dissimilarity between each pair of variants (i.e., from different
climatic conditions), resulting in a symmetric square matrix. From
1) Genetic drift is the result of allele sampling errors. These this matrix, the neighbor joining algorithm was used to construct
errors are due to generational transmission of alleles and the climate tree. The same approach was implemented for genetic
geographical barriers. Genetic drift is a function of the data. Using nucleotide sequences from the 38 SARS-CoV-2 lin-
size of the population. Indeed, the larger the population, eages, phylogenetic reconstruction is repeated to construct genetic
the lower the genetic drift. This is explained by the ability trees, considering only the data within a window that moves along
to maintain genetic diversity in the original population. the alignment in user-defined steps and window size (their length
By convention, we say that an allele is fixed if it reaches is denoted by the number of base pairs (bp)).
the frequency of 100%, and that it is lost if it reaches the In the third step, the phylogenetic trees constructed in each
frequency of 0%. sliding window are compared to the climatic trees using the
2) Gene flow or migration is an important process for Robinson and Foulds (RF) topological distance [RF81]. The
conducting a phylogeographic study. It is the transfer distance was normalized by 2n−6, where n is the number of leaves
of alleles from one population to another, increasing (i.e., taxa). The proposed approach considers bootstrapping. The
intrapopulation diversity and decreasing interpopulation implementation of sliding window technology provides a more
diversity. accurate identification of regions with high gene mutation rates.
3) There are many selections in all species. Here we indicate
As a result, we highlighted a correlation between parts of
the two most important of them, if they are essential
genes with a high rate of mutations depending on the geographic
for a phylogeographic study. (a) Sexual selection is a
distribution of viruses, which emphasizes the emergence of new
phenomenon resulting from an attractive characteristic
variants (i.e., Alpha, Beta, Delta, Gamma, and Omicron).
between two species. Therefore, this selection is a func-
tion of the size of the population. (b) Natural selection The creation of phylogenetic trees, as mentioned above, is an
is a function of fertility, mortality, and adaptation of a important part of the solution and includes the main steps of the
species to a habitat. developed pipeline. This function is intended for genetic data. The
main parameters of this part are as follows:
Populations living in different environments with varying
def create_phylo_tree(gene,
climatic conditions are subject to pressures that can lead to window_size,
evolutionary divergence and reproductive isolation ([OS98] and step_size,
[Sch01]). Phylogeny and geography are then correlated. This bootstrap_threshold,
rf_threshold,
study, therefore, aims to present an algorithm to show the possible data_names):
correlation between certain genes or gene fragments and the
geographical distribution of species. number_seq = align_sequence(gene)
Most studies in phylogeography consider only genetic data sliding_window(window_size, step_size)
...
without directly considering climatic data. They indirectly take for file in files:
this information as a basis for locating the habitat of the species. try:
We have developed the first version of a phylogeography that ...
create_bootstrap()
integrates climate data. The sliding window strategy provides more run_dnadist()
robust results, as it particularly highlights the areas sensitive to run_neighbor()
climate adaptation. run_consense()
filter_results(gene,
bootstrap_threshold,
Methods and Python scripts rf_threshold,
data_names,
In order to achieve our goal, we designed a workflow and then number_seq,
developed a script in Python version 3.9 called aPhylogeo for file))
phylogeographic analysis (see [LLKT22] for more details). It in- ...
except Exception as error:
teracts with multiple bioinformatic programs, taking climatic data raise
and nucleotide data as input, and performs multiple phylogenetic
analyses on nucleotide sequencing data using a sliding window This function takes gene data, window size, step size, boot-
approach. The process is divided into three main steps (see Figure strap threshold, threshold for the Robinson and Foulds dis-
1). tance, and data names as input parameters. Then the func-
The first step involves collecting data to search for quality tion sequentially connects the main steps of the pipeline:
viral sequences that are essential for the conditions of our results. align_sequence(gene), sliding_window(window_size, step_size),
All sequences were retrieved from the NCBI Virus website (Na- create_bootstrap(), run_dnadist(), run_neighbor(), run_consense(),
tional Center for Biotechnology Information, https://www.ncbi. and filter_results with parameters. As a result, we obtain a phylo-
nlm.nih.gov/labs/virus/vssi/#/). In total, 20 regions were selected genetic tree (or several trees), which is written to a file.
PHYLOGEOGRAPHY: ANALYSIS OF GENETIC AND CLIMATIC DATA OF SARS-COV-2 161

We have created a function (create_tree) to create the climate for line in f:
trees. The function is described as follow: if line != "\n":
espece = list_names[index]
def create_tree(file_name, names): nb_espace = 11 - len(espece)
for i in range(1, len(names)): out.write(espece)
for i in range(nb_espace):
create_matrix(file_name, out.write(" ")
names[0], out.write(line[debut:fin])
names[i], index = index + 1
"infile") out.close()
f.close()
os.system("./exec/neighbor " + start = start + step
"< input/input.txt") fin = fin + step
except:
subprocess.call(["mv", print("An error occurred.")
"outtree",
"intree"])

subprocess.call(["rm",
Algorithmic complexity
"infile", The complexity of the algorithm described in the previous section
"outfile"]) depends on the complexity of the various external programs used
os.system("./exec/consense "+ and the number of windows that the alignment can contain, plus
"< input/input.txt") one for the total alignment that the program will process.
Recall the different complexities of the different external
newick_file = names[i].replace(" ", "_") +
"_newick" programs used in the algorithm:
• SeqBoot program: O(r × n × SA)
subprocess.call(["rm",
"outfile"]) • DNADist program: O(n2 )
• Neighbor program: O(n3 )
subprocess.call(["mv", • Consense program: O(r × n2 )
"outtree",
newick_file]) • RaxML program: O(e × n × SA)
• RF program: O(n2 ),
The sliding window strategy can detect genetic fragments depend-
ing on environmental parameters, but this work requires time- where n is a number of species (or taxa), r is a number of
consuming data preprocessing and the use of several bioinformat- replicates, SA is a size of the multiple sequence alignment (MSA),
ics programs. For example, we need to verify that each sequence and e is a number of refinement steps performed by the RaxML
identifier in the sequencing data always matches the corresponding algorithm. For all SA ∈ N ∗ and for all W S, S ∈ N, the number of
metadata. If samples are added or removed, we need to check windows can be evaluated as follow (Eq. 1):

whether the sequencing dataset matches the metadata and make SA −W S
changes accordingly. In the next stage, we need to align the nb = +1 , (1)
S
sequences (multiple sequence alignment, MSA) and integrate all
where W S is a window size, and S is a step.
step by step into specific software such as MUSCLE [Edg04],
Phylip package (i.e. Seqboot, DNADist, Neighbor, and Consense)
[Fel05], RF [RF81], and raxmlHPC [Sta14]. The use of each Dataset
software requires expertise in bioinformatics. In addition, the The following two principles were applied to select the samples
intermediate analysis steps inevitably generate many files, the for analysis.
management of which not only consumes the time of the biologist,
1) Selection of SARS-CoV-2 Pango lineages that are
but is also subject to errors, which reduces the reproducibility
dispersed in different phylogenetic clusters whenever
of the study. At present, there are only a few systems designed
possible.
to automate the analysis of phylogeography. In this context, the
development of a computer program for a better understanding The Pango lineage nomenclature system is hierarchical and
of the nature and evolution of coronavirus is essential for the fine-scaled and is designed to capture the leading edge of
advancement of clinical research. pandemic transmission. Each Pango lineage aims to define an
The following sliding window function illustrates moving the epidemiologically relevant phylogenetic cluster, for instance, an
sliding window through an alignment with window size and step introduction into a distinct geographic area with evidence of
size as parameters. The first 11 characters are allocated to species onward transmission [RHO+ 20]. From one side, Pango lineages
names, plus a space. signify groups or clusters of infections with shared ancestry.
def sliding_window(window_size=0, step=0): If the entire pandemic can be thought of as a vast branching
try: tree of transmission, then the Pango lineages represent individual
f = open("infile", "r") branches within that tree. From another side, Pango lineages are
...
# slide the window along the sequence intended to highlight epidemiologically relevant events, such as
start = 0 the appearance of the virus in a new location, a rapid increase in
fin = start + window_size the number of cases, or the evolution of viruses with new phe-
while fin <= longueur: notypes [OSU+ 21]. Therefore, to have some sequence diversity
index = 0
with open("out", "r") as f, ... as out: in the selected samples, we avoided selecting lineages belonging
... to the same or similar phylogenetic clusters. For example, among
162 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 1: The workflow of the algorithm. The operations within this workflow include several blocks. The blocks are highlighted by three different
colors. The first block (grey color) is responsible for creating the trees based on the climate data. The second block (green color) performs the
function of input parameter validation. The third block (blue color) allows the creation of phylogenetic trees. This is the most important block
and the basis of this study, through the results of which the user receives the output data with the necessary calculations.
PHYLOGEOGRAPHY: ANALYSIS OF GENETIC AND CLIMATIC DATA OF SARS-COV-2 163

C.36, C.36.1, C.36.2, C.36.3 and C.36.3.1, only C.36 was used as
a sample for analysis.
2) Selection of the lineages that are clearly dominant in
a particular region compared to other regions.
Through significant advances in the generation and ex-
change of SARS-CoV-2 genomic data in real time, international
spread of lineages is tracked and recorded on the website (cov-
lineages.org/global_report.html) [OHP+ 21]. Based on the statis-
tical information provided by the website, our study focuses on
SARS-CoV-2 lineages that were first identified (Earliest date)
and widely disseminated in a particular country (Most common
country) during a certain period (Table 1).
We list four examples of the distribution of a set of lineages:
• Both lineages A.2.3 and B.1.1.107 have 100% distribu-
tion in the United Kingdom. Both lineages D.2 and D.3
have 100% distribution in Australia. B.1.1.172, L.4 and
P.1.13 have 100% distribution in the United States. Finally,
AH.1, AK.2, C.7 have 100% distribution in Switzerland,
Germany, and Denmark, respectively.
Fig. 2: Climatic conditions of each lineage in most common country
• The country with the widest distribution of L.2 is the
at the time of first detection. The climate factors involved include
Netherlands (77.0%), followed by Germany (19.0%). Due Temperature at 2 meters (C), Specific humidity at 2 meters (g/kg),
to a 58% difference in the distribution of L.2 between the Precipitation corrected (mm/day), Wind speed at 10 meters (m/s), and
two locations, we consider the Netherlands as the main All sky surface shortwave downward irradiance (kW − hr/m2 /day).
distribution country of L.2 and, therefore, it was selected
as a sample.
• Similarly, the most predominant country of distribution of collected climatological data for the three days before the earliest
C.37 is Peru (44%), followed by Chile (19.0%), with a reporting date corresponding to each lineage and averaged them
difference of 25%. Among all samples of this study, C.37 for analysis (Fig. 2).
was the lineage with the least difference in distribution per- Although the selection of samples was based on the phyloge-
centage between the two countries. Considering the need netic cluster of lineage and transmission, most of the sites involved
to increase the diversity of the geographical distribution of represent different meteorological conditions. As shown in Figure
the samples, C.37 was also selected. 2, the 38 samples involved temperatures ranging from -4 C to 32.6
• In contrast, the distribution of C.6 is 17.0% in France, C, with an average temperature of 15.3 C. The Specific humidity
14.0% in Angola, 13.0% in Portugal, and 8.0% in Switzer- ranged from 2.9 g/kg to 19.2 g/kg with an average of 8.3 g/kg. The
land, and we concluded that C.6 does not show a tendency variability of Wind speed and All sky surface shortwave downward
in terms of geographic distribution and, therefore, was not irradiance was relatively small across samples compared to other
included as a sample for analysis. parameters. The Wind speed ranged from 0.7 m/s to 9.3 m/s with
an average of 4.0 m/s, and All sky surface shortwave downward
In accordance with the above principles, we selected 38
irradiance ranged from 0.8 kW-hr/m2/day to 8.6 kW-hr/m2/day
lineages with regional characteristics for further study. Based on
with an average of 4.5 kW-hr/m2/day. In contrast to the other
location information, complete nucleotide sequencing data for
parameters, 75% of the cities involved receive less than 2.2 mm
these 38 lineages was collected from the NCBI Virus website
of precipitation per day, and only 5 cities have more than 5 mm
(https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/). In the case of
of precipitation per day. The minimum precipitation is 0 mm/day,
the availability of multiple sequencing results for the same lineage
the maximum precipitation is 12 mm/day, and the average value
in the same country, we selected the sequence whose collection
is 2.1 mm/day.
date was closest to the earliest date presented. If there are several
sequencing results for the same country on the same date, the
sequence with the least number of ambiguous characters (N per
Results
nucleotide) is selected (Table 1).
Based on the sampling locations (consistent with the most In this section, we describe the results obtained on our dataset (see
common country, but accurate to specific cities) of each lineage Data section) using our new algorithm (see Method section).
sequence in Table 1, combined with the time when the lineage The size of the sliding window and the advanced step for
was first discovered, we obtained data on climatic conditions at the sliding window play an important role in the analysis. We
the time each lineage was first discovered. The meteorological restricted our conditions to certain values. For comparison, we
parameters include Temperature at 2 meters, Specific humidity at applied five combinations of parameters (window size and step
2 meters, Precipitation corrected, Wind speed at 10 meters, and size) to the same dataset. These include the choice of different
All sky surface shortwave Downward irradiance. The daily data window sizes (20bp, 50bp, 200bp) and step sizes (10bp, 50bp,
for the above parameters were collected from the NASA website 200bp). These combinations of window sizes and steps provide an
(https://power.larc.nasa.gov/). Considering that the spread of the opportunity to have three different movement strategies (overlap-
virus in a country and the data statistics are time-consuming, we ping, non-overlapping, with gaps). Here we fixed the pair (window
164 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Lineage Most Common Country Earliest Date Sequence Accession
A.2.3 United Kingdom 100.0% 2020-03-12 OW470304.1
AE.2 Bahrain 100.0% 2020-06-23 MW341474
AH.1 Switzerland 100.0% 2021-01-05 OD999779
AK.2 Germany 100.0% 2020-09-19 OU077014
B.1.1.107 United Kingdom 100.0% 2020-06-06 OA976647
B.1.1.172 USA 100.0% 2020-04-06 MW035925
BA.2.24 Japan 99.0% 2022-01-27 BS004276
C.1 South Africa 93.0% 2020-04-16 OM739053.1
C.7 Denmark 100.0% 2020-05-11 OU282540
C.17 Egypt 69.0% 2020-04-04 MZ380247
C.20 Switzerland 85.0% 2020-10-26 OU007060
C.23 USA 90.0% 2020-05-11 ON134852
C.31 USA 87.0% 2020-08-11 OM052492
C.36 Egypt 34.0% 2020-03-13 MW828621
C.37 Peru 43.0% 2021-02-02 OL622102
D.2 Australia 100.0% 2020-03-19 MW320730
D.3 Australia 100.0% 2020-06-14 MW320869
D.4 United Kingdom 80.0% 2020-08-13 OA967683
D.5 Sweden 65.0% 2020-10-12 OU370897
Q.2 Italy 99.0% 2020-12-15 OU471040
Q.3 USA 99.0% 2020-07-08 ON129429
Q.6 France 92.0% 2021-03-02 ON300460
Q.7 France 86.0% 2021-01-29 ON442016
L.2 Netherlands 73.0% 2020-03-23 LR883305
L.4 USA 100.0% 2020-06-29 OK546730
N.1 USA 91.0% 2020-03-25 MT520277
N.3 Argentina 96.0% 2020-04-17 MW633892
N.4 Chile 92.0% 2020-03-25 MW365278
N.6 Chile 98.0% 2020-02-16 MW365092
N.7 Uruguay 100.0% 2020-06-18 MW298637
N.8 Kenya 94.0% 2020-06-23 OK510491
N.9 Brazil 96.0% 2020-09-25 MZ191508
M.2 Switzerland 90.0% 2020-10-26 OU009929
P.1.7.1 Peru 94.0% 2021-02-07 OK594577
P.1.13 USA 100.0% 2021-02-24 OL522465
P.2 Brazil 58.0% 2020-04-13 ON148325
P.3 Philippines 83.0% 2021-01-08 OL989074
P.7 Brazil 71.0% 2020-07-01 ON148327

TABLE 1: SARS-CoV-2 lineages analyzed. The lineage assignments covered in the table were last updated on March 1, 2022. Among all
Pango lineages of SARS-CoV-2, 38 lineages were analyzed. Corresponding sequencing data were found in the NCBI database based on the
date of earliest detection and country of most common. The table also marks the percentage of the virus in the most common country compared
to all countries where the virus is present.

size, step size) at some values (20, 10), (20, 50), (50, 50), (200, climate conditions on bootstrap values greater than 10.
50) and (200, 200). The trend of RF values variation under different climatic
conditions does not vary much throughout this whole
1) Robinson and Foulds baseline and bootstrap thresh-
sequence sliding window scan, which may be related
old: the phylogenetic trees constructed in each sliding
to the correlation between climatic factors (Wind Speed,
window are compared to the climatic trees using the
Downward Irradiance, Precipitation, Humidity, Temper-
Robinson and Foulds topological distance (the RF dis-
ature). Windows starting from or containing position
tance). We defined the value of the RF distance ob-
(28550bp) were screened in all five scans for different
tained for regions without any mutations as the baseline.
combinations of window size and step size. The window
Although different sample sizes and sample sequence
formed from position 29200bp to position 29470bp is
characteristics can cause differences in the baseline, how-
screened out in all four scans except for the combination
ever, regions without any mutation are often accompanied
of 50bp window size with 50bp step size. As Figure 3
by very low bootstrap values. Using the distribution
shows, if there are gaps in the scan (window size: 20bp,
of bootstrap values and combining it with validation
step size: 50bp), some potential mutation windows are not
of alignment visualization, we confirmed that the RF
screened compared to other movement strategies because
baseline value in this study was 50, and the bootstrap
the sequences of the gap part are not computed by the
values corresponding to this baseline were smaller than
algorithm. In addition, when the window size is small,
10.
the capture of the window mutation signal becomes more
2) Sliding window: the implementation of sliding window
sensitive, especially when the number of samples is small.
technology with bootstrap threshold provides a more
At this time, a single base change in a single sequence can
accurate identification of regions with high gene mutation
cause a change in the value of the RF distance. Therefore,
rates. Figure 3 shows the general pattern of the RF
high quality sequencing data is required to prevent errors
distance changes over alignment windows with different
PHYLOGEOGRAPHY: ANALYSIS OF GENETIC AND CLIMATIC DATA OF SARS-COV-2 165

caused by ambiguous characters (N in nucleotide) on the (dates). In addition, since the size of the sliding window
RF distance values. In cases where a larger window size and the forward step play an important role in the anal-
(200bp) is selected, the overlapping movement strategy ysis, we need to perform several tests to choose the best
(window size: 200bp, step size: 50bp) allows the signal of combination of parameters. In this case, it is important to
base mutations to be repeatedly verified and enhanced in provide the faster performance of this solution, and we
adjacent window scans compared to the non-overlapping plan to adapt the code to parallelize the computations.
strategy (window size: 200bp, step size: 200bp). In this In addition, we intend to use the resources of Compute
situation, the range of the RF distance values is relatively Canada and Compute Quebec for these high load calcu-
large, and the number of windows eventually screened is lations.
relatively greater. Due to the small number of the SARS- 2) To enable further analysis of this topic, it would be
CoV-2 lineages sequences that we analyzed in this study, interesting to relate the results obtained, especially the
we chose to scan the alignment sequences with a larger values obtained from the best positions of the multiple
window and overlapping movement strategy for further sequence alignments, to the dimensional structure of the
analysis (window size: 200bp, step size: 50bp). proteins, or to the map of the selective pressure exerted
3) Comparaison between genetic trees and climatic trees: on the indicated alignment fragments.
the RF distance quantified the difference between a phy- 3) We can envisage a study that would consist in selecting
logenetic tree constructed in specific sliding windows and only different phenotypes of a single species, for exam-
a climatic tree constructed in corresponding climatic data. ple, Homo Sapiens, in different geographical locations. In
Relatively low RF distance values represent relatively this case, we would have to consider a larger geographical
more similarity between the phylogenetic tree and the area in order to significantly increase the variation of
climatic tree. With our algorithm based on the sliding the selected climatic parameters. This type of research
window technique, regions with high mutation rates can would consist in observing the evolution of the genes
be identified (Fig 4). Subsequently, we compare the of the selected species according to different climatic
RF values of these regions. In cases where there is a parameters.
correlation between the occurrence of mutations and the 4) We intend to develop a website that can help biologists,
climate factors studied, the regions with relatively low ecologists and other interested professionals to perform
RF distance values (the alignment position of 15550bp calculations in their phylogeography projects faster and
– 15600bp and 24650bp-24750bp) are more likely to easier. We plan to create a user-friendly interface with
be correlated with climate factors than the other loci the input of the necessary initial parameters and the
screened for mutations. possibility to save the results (for example, by sending
them to an email). These results will include calculated
In addition, we can state that we have made an effort to parameters and visualizations.
make our tool as independent as possible of the input data and
parameters. Our pipeline can also be applied to phylogeographic
studies of other species. In cases where it is determined (or Acknowledgements
assumed) that the occurrence of a mutation is associated with The authors thank SciPy conference and reviewers for their valu-
certain geographic factors, our pipeline can help to highlight able comments on this paper. This work was supported by Natural
mutant regions and specific mutant regions within them that are Sciences and Engineering Research Council of Canada and the
more likely to be associated with that geographic parameter. Our University of Sherbrooke grant.
algorithm can provide a reference for further biological studies.

R EFERENCES
Conclusions and future work
[A+ 00] John C Avise et al. Phylogeography: the history and formation
In this paper, a bioinformatics pipeline for phylogeographic of species. Harvard University Press, 2000. doi:10.1093/
analysis is designed to help researchers better understand the icb/41.1.134.
distribution of viruses in specific regions using genetic and climate [CPK+ 21] Simiao Chen, Klaus Prettner, Michael Kuhn, Pascal Geldsetzer,
data. We propose a new algorithm called aPhylogeo [LLKT22] Chen Wang, Till Bärnighausen, and David E Bloom. Climate
and the spread of covid-19. Scientific Reports, 11(1):1–6, 2021.
that allows the user to quickly and intuitively create trees from doi:10.1038/s41598-021-87692-z.
genetic and climate data. Using a sliding window, the algorithm [CRA+ 22] Marco Cascella, Michael Rajnik, Abdul Aleem, Scott C Dule-
finds specific regions on the viral genetic sequences that can bohn, and Raffaela Di Napoli. Features, evaluation, and treat-
ment of coronavirus (covid-19). Statpearls [internet], 2022.
be correlated to the climatic conditions of the region. To our
[Edg04] Robert C Edgar. Muscle: a multiple sequence alignment method
knowledge, this is the first study of its kind that incorporates with reduced time and space complexity. BMC bioinformatics,
climate data into this type of study. It aims to help the scientific 5(1):1–19, 2004. doi:10.1186/1471-2105-5-113.
community by facilitating research in the field of phylogeography. [Fel05] Joseph Felsenstein. PHYLIP (Phylogeny Inference Package)
version 3.6. Distributed by the author. Department of Genome
Our solution runs on Windows®, MacOS X® and GNU/Linux Sciences, University of Washington, Seattle, 2005.
and the code is freely available to researchers and collaborators on [KM02] L Lacey Knowles and Wayne P Maddison. Statistical phylo-
GitHub (https://github.com/tahiri-lab/aPhylogeo). geography. Molecular Ecology, 11(12):2623–2635, 2002. doi:
As a future work on the project, we plan to incorporate the 10.1146/annurev.ecolsys.38.091206.095702.
[LFZK06] Kun Lin, Daniel Yee-Tak Fong, Biliu Zhu, and Johan Karl-
following additional features: berg. Environmental factors on the sars epidemic: air tem-
perature, passage of time and multiplicative effect of hospital
1) We can handle large amounts of data, especially when infection. Epidemiology & Infection, 134(2):223–230, 2006.
considering many countries and longer time periods doi:10.1017/S0950268805005054.
166 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 3: Heatmap of Robinson and Foulds topological distance over alignment windows. Five different combinations of parameters were applied
(a) window size = 20bp and step size = 10bp; (b) window size = 20bp and step size = 50bp; (c) window size = 50bp and step size = 50bp;
(d) window size = 200bp and step size = 50bp; and (e) window size = 200bp and step size = 200bp. Robinson and Foulds topological
distance was used to quantify the distance between a phylogenetic tree constructed in certain sliding windows and a climatic tree constructed
in corresponding climatic data (wind speed, downward irradiance, precipitation, humidity, temperature).

grinch. Wellcome open research, 6, 2021. doi:10.12688/
wellcomeopenres.16661.2.
[OS98] Matthew R Orr and Thomas B Smith. Ecology and speciation.
Trends in Ecology & Evolution, 13(12):502–506, 1998. doi:
10.1016/s0169-5347(98)01511-0.
[OSU+ 21] Áine O’Toole, Emily Scher, Anthony Underwood, Ben Jack-
son, Verity Hill, John T McCrone, Rachel Colquhoun, Chris
Ruis, Khalil Abu-Dahab, Ben Taylor, et al. Assignment of
epidemiological lineages in an emerging pandemic using the
pangolin tool. Virus Evolution, 7(2):veab064, 2021. doi:
10.1093/ve/veab064.
[RF81] David F Robinson and Leslie R Foulds. Comparison of phyloge-
netic trees. Mathematical biosciences, 53(1-2):131–147, 1981.
doi:10.1016/0025-5564(81)90043-2.
[RHO+ 20] Andrew Rambaut, Edward C Holmes, Áine O’Toole, Verity
Hill, John T McCrone, Christopher Ruis, Louis du Plessis, and
Oliver G Pybus. A dynamic nomenclature proposal for sars-
cov-2 lineages to assist genomic epidemiology. Nature micro-
biology, 5(11):1403–1407, 2020. doi:10.1038/s41564-
020-0770-5.
[Sch01] Dolph Schluter. Ecology and the origin of species. Trends in
ecology & evolution, 16(7):372–380, 2001. doi:10.1016/
s0169-5347(01)02198-x.
Fig. 4: Robinson and Foulds topological distance normalized changes [SDdPS 20] Marcos Felipe Falcão Sobral, Gisleia Benini Duarte, Ana
+
over the alignment windows. Multiple phylogenetic analyses were Iza Gomes da Penha Sobral, Marcelo Luiz Monteiro Marinho,
performed using a sliding window (window size = 200 bp and step size and André de Souza Melo. Association between climate vari-
= 50 bp). Phylogenetic reconstruction was repeated considering only ables and global transmission of sars-cov-2. Science of The
data within a window that moved along the alignment in steps. The Total Environment, 729:138997, 2020. doi:10.1016/j.
RF normalized topological distance was used to quantify the distance scitotenv.2020.138997.
between the phylogenetic tree constructed in each sliding window and [SMVS+ 22] Chidambaram Sabarathinam, Prasanna Mohan Viswanathan,
Venkatramanan Senapathi, Shankar Karuppannan, Dhanu Radha
the climate tree constructed in the corresponding climate data (Wind
Samayamanthula, Gnanachandrasamy Gopalakrishnan, Ra-
speed, Downward irradiance, Precipitation, Humidity, Temperature). manathan Alagappan, and Prosun Bhattacharya. Sars-cov-2
Only regions with high genetic mutation rates were marked in the phase i transmission and mutability linked to the interplay of
figure. climatic variables: a global observation on the pandemic spread.
Environmental Science and Pollution Research, pages 1–18,
2022. doi:10.1007/s11356-021-17481-8.
[Sta14] Alexandros Stamatakis. Raxml version 8: a tool for phy-
[LLKT22] Wanlin Li, My-Lin Luu, Aleksandr Koshkarov, and Nadia logenetic analysis and post-analysis of large phylogenies.
Tahiri. aPhylogeo (version 1.0), July 2022. URL: https:// Bioinformatics, 30(9):1312–1313, 2014. doi:10.1093/
github.com/tahiri-lab/aPhylogeo, doi:doi.org/10.5281/ bioinformatics/btu033.
zenodo.6773603.
[Nag92] Thomas Nagylaki. Rate of evolution of a quantitative character.
Proceedings of the National Academy of Sciences, 89(17):8121–
8124, 1992. doi:10.1073/pnas.89.17.8121.
[OCFC20] Barbara Oliveiros, Liliana Caramelo, Nuno C Ferreira, and
Francisco Caramelo. Role of temperature and humidity in the
modulation of the doubling time of covid-19 cases. MedRxiv,
2020. doi:10.1101/2020.03.05.20031872.
[OHP+ 21] Áine O’Toole, Verity Hill, Oliver G Pybus, Alexander Watts,
Issac I Bogoch, Kamran Khan, Jane P Messina, The COVID,
Genomics UK, et al. Tracking the international spread
of sars-cov-2 lineages b. 1.1. 7 and b. 1.351/501y-v2 with
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 167

Global optimization software library for research and
education
Nadia Udler‡∗

Abstract—Machine learning models are often represented by functions given the distance to optimal point. In this paper the basic SA algorithm
by computer programs. Optimization of such functions is a challenging task is used as a starting point. We can offer more basic module as a
because traditional derivative based optimization methods with guaranteed starting point ( and by specifying distribution as ’exponential’ get
convergence properties cannot be used.. This software allows to create new the variant of SA) thus achieving more flexible design opportuni-
optimization methods with desired properties, based on basic modules. These
ties for custom optimization algorithm. Note that convergence of
basic modules are designed in accordance with approach for constructing global
optimization methods based on potential theory [KAP]. These methods do not
the newly created hybrid algorithm does not need to be verified
use derivatives of objective function and as a result work with nondifferentiable when using minpy basic modules, whereas previously mentioned
functions (or functions given by computer programs, or black box functions), but SA-based hybrid has to be verified separately ( see [GLUQ])
have guaranteed convergence. The software helps to understand principles of Testing functions are included in the library. They represent
learning algorithms. This software may be used by researchers to design their broad range of use cases covering above mentioned difficult
own variations or hybrids of known heuristic optimization methods. It may be functions. In this paper we describe the approach underlying these
used by students to understand how known heuristic optimization methods work optimization methods. The distinctive feature of these methods
and how certain parameters affect the behavior of the method.
is that they are not heuristic in nature. The algorithms are de-
Index Terms—global optimization, black-box functions, algorithmically defined
rived based on potential theory [KAP], and their convergence is
functions, potential functions guaranteed by their derivation method [KPP]. Recently potential
theory was applied to prove convergence of well known heuristic
methods, for example see [BIS] for convergence of PSO, and to
Introduction
re prove convergence of well known gradient based methods, in
Optimization lies at the heart of machine learning and data particular, first order methods - see [NBAG] for convergence of
science. One of the most relevant problems in machine learning is gradient descent and [ZALO] for mirror descent. For potential
automatic selection of the algorithm depending on the objective. functions approach for stochastic first order optimization methods
This is necessary in many applications such as robotics, simulating see [ATFB].
biological or chemical processes, trading strategies optimization,
to name a few [KHNT]. We developed a library of optimization
methods as a first step for self-adapting algorithms. Optimization Outline of the approach
methods in this library work with all objectives including very The approach works for non-smooth or algorithmically defined
onerous ones, such as black box functions and functions given by functions. For detailed description of the approach see [KAP],
computer code, and the convergences of methods is guaranteed. [KP]. In this approach the original optimization problem is re-
This library allows to create customized derivative free learning placed with a randomized problem, allowing the use of Monte-
algorithms with desired properties by combining building blocks Carlo methods for calculating integrals. This is especially impor-
from this library or other Python libraries. tant if the objective function is given by its values (no analytical
The library is intended primarily for educational purposes formula) and derivatives are not known. The original problem
and its focus is on transparency of the methods rather than on is restated in the framework of gradient (sub gradient) methods,
efficiency of implementation. employing the standard theory (convergence theorems for gradient
The library can be used by researches to design optimization (sub gradient) methods), whereas no derivatives of the objective
methods with desired properties by varying parameters of the function are needed. At the same time, the method obtained is
general algorithm. a method of nonlocal search unlike other gradient methods. It
As an example, consider variant of simulated annealing (SA) will be shown, that instead of measuring the gradient of the
proposed in [FGSB] where different values of parameters ( Boltz- objective function we can measure the gradient of the potential
man distribution parameters, step size, etc.) are used depending of function at each iteration step , and the value of the gradient
can be obtained using values of objective function only, in the
* Corresponding author: nadiakap@optonline.net
‡ University of Connecticut (Stamford) framework of Monte Carlo methods for calculating integrals.
Furthermore, this value does not have to be precise, because
Copyright © 2022 Nadia Udler. This is an open-access article distributed it is recalculated at each iteration step. It will also be shown
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the that well-known zero-order optimization methods ( methods that
original author and source are credited. do not use derivatives of objective function but its values only)
168 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

are generalized into their adaptive extensions. The generalization where ( , ) defines dot product.
of zero-order methods (that are heuristic in nature) is obtained Assuming differentiability of the integrals (for example, by
using standardized methodology, namely, gradient (sub gradient) selecting the appropriate pxε (x, y) and using 3, 4 we get
framework. We consider the unconstrained optimization problem
Z Z
d
f (x1 , x2 , ..xn ) → min (1) δY F(X 0 ) = [ f (x)pxε (x − εy, y)dxdy]ε=0 =
x∈Rn dε Rn Rn
By randomizing we get dR R d R R
= [ dε Rn f (x) Rn pxε (x−εy, y)dxdy]ε=0 = [ Rn f (x)( dε Rn pxε (x−
F(X) = E[ f (X)] → min (2) εy, y)dy)dx]ε=0 =
x∈Rn R R d
= R Rn
f (x)( Rn [ dε pxε (x − εy, y)]ε=0 dy)dx =
where X is a random vector from Rn , {X} is a set of such random R
− Rn f (x)( Rn [divx (pxε (x, y)y)]dy)dx =
vectors, and E[·] is the expectation operator.
Problem 2 is equivalent to problem 1 in the sense that any Z Z
realization of the random vector X ∗ , where X ∗ is a solution to 2, − f (x)divx [ (pxε (x, y)y)dy]dx
Rn Rn
that has a nonzero probability, will be a solution to problem 1 (see
[KAP] for proof). p ε (x,y)
Using formula for conditional distribution pY /X 0 =x (y) = px εy (x)) ,
Note that 2 is the stochastic optimization problem of the R x

functional F(X) . where pxε (x) = Rn pxε y (x, u)du
R R
To study the gradient nature of the solution algorithms for we get δY F(X 0 ) = − Rn f (x)divx [pxε (x) Rn pY /X 0 =x (y)ydy]dx
R
problem 2, a variation of objective functional F(X) will be consid- Denote y(x) = Rn ypY /X 0 =x (y)dy = E[Y /X 0 = x]
ered. Taking into account normalization condition for density we
The suggested approach makes it possible to obtain opti- arrive at the following expression for directional derivative:
mization methods in systematic way, similar to the methodology Z
adopted in smooth optimization. Derivation includes random- δY F(X 0 ) = − ( f (x) −C)divx [px0 (x)y(x)]dx
ization of the original optimization problem, finding directional Rn
derivative for the randomized problem and choosing moving
direction Y based on the condition that directional derivative in where C is arbitrary chosen constant
the direction of Y is being less or equal to 0. Considering solution to δY F(X 0 ) → minY allows to obtain
Because of randomization, the expression for directional gradient-like algorithms for optimization that use only objective
derivative doesn’t contain the differential characteristics of the function values ( do not use derivatives of objective function)
original function. We obtain the condition for selecting the di-
rection of search Y in terms of its characteristics - conditional
expectation. Conditional expectation is a vector function (or Potential function as a solution to Poisson’s equation
vector field) and can be decomposed (following the theorem of
Decomposing vector field px0 (x)y(x) into potential field ∇ϕ0 (x)
decomposition of the vector field) into the sum of the gradient
and divergence-free component W0 (x):
of scalar function P and a function with zero divergence. P is
called a potential function. As a result the original problem is
px0 (x)y(x) = ∇φ0 (x) +W0 (x)
reduced to optimization of the potential function, furthermore, the
potential function is specific for each iteration step. Next, we arrive
at partial differential equation that connects P and the original we arrive at Poisson’s equation for potential function:
function. To define computational algorithms it is necessary to
specify the dynamics of the random vectors. For example, the ∆ϕ0 (x) = −L[ f (x) −C]pu (x)
dynamics can be expressed in a form of densities. For certain class
of distributions, for example normal distribution, the dynamics can where L is a constant
be written in terms of expectation and covariance matrix. It is also Solution to Poisson’s equation approaching 0 at infinity may
possible to express the dynamics in mixed characteristics. be written in the following form
Z
Expression for directional derivative ϕ0 (x) = E(x, ξ )[ f (ξ ) −C]pu (ξ )dξ
Rn
Derivative of objective functional F(X) in the direction of the
random vector Y at the point X 0 (Gateaux derivative) is: where E(x, ξ ) is a fundamental solution to Laplace’s equation.
d
δY F(X 0 ) = dε d
F(X 0 + εY )ε=0 = dε F(X ε )dxε=0 = Then for potential component ∆ϕ0 (x) we have
d R
dε f (X)pxε (x)ε=0
where density function of the random vector X ε = X 0 + εY ∆ϕ0 (x) = −LE[∆x E(x, u)( f (x) −C)]
may be expressed in terms of joint density function pX 0 ,Y (x, y) of
X 0 and Y as follows: To conclude, the representation for gradient-like direction is
Z
obtained. This direction maximizes directional derivative of the
pxε (x) = pxε (x − εy, y)dy (3)
Rn objective functional F(X). Therefore, this representation can be
The following relation (property of divergence) will be needed used for computing the gradient of the objective function f(x)
later using only its values. Gradient direction of the objective function
d f(x) is determined by the gradient of the potential function ϕ0 (x),
pxε (x − εy, y) = (−∇x pxε (x, y), y) = −divx (pxε (x, y)y) (4) which, in turn, is determined by Poisson’s equation.
dε
GLOBAL OPTIMIZATION SOFTWARE LIBRARY FOR RESEARCH AND EDUCATION 169

Practical considerations The code is organized in such a way that it allows to pair the
The dynamics of the expectation of objective function may be algorithm with objective function. The new algorithm may be im-
written in the space of random vectors as follows: plmented as method of class Minimize. Newly created algorithm
can be paired with test objectivve function supplied with a library
XN+1 = XN + αN+1YN+1 or with externally supplied objective function (implemented in
separate python module). New algorithms can be made more or
where N - iteration number, Y N+1 - random vector that defines less universal, that is, may have different number of parameters
direction of move at ( N+1)th iteration, αN+1 -step size on (N+1)th that user can specify. For example, it is possible to create Nelder
iteration. Y N+1 must be feasible at each iteration, i.e. the objective and Mead algorithm (NM) using basic modules, and this would
functional should decrease: F(X N+1 ) < (X N ). Applying expection be an example of the most specific algorithm. It is also possible
to (12) and presenting E[YN+1 asconditional expectation Ex E[Y /X] to create Stochastic Extention of NM (more generic than classic
we get: NM, similar to Simplicial Homology Global Optimisation [ESF]
XN+1 = E[XN ] + αN+1 EX N E[Y N+1 /X N ] method) and with certain settings of adjustable parameters it may
work identical to classic NM. Library repository may be found
Replacing mathematical expectations E[XN ] and YN+1 ] with their
N+1 here: https://github.com/nadiakap/MinPy_edu
estimates E and y(X N ) we get: The following algorithms demonstrate steps similar to steps of
E
N+1 N
= E + αN+1 E X N [y(X N )] Nelder and Mead algorithm (NM) but select only those points with
objective function values smaller or equal to mean level of objec-
Note that expression for y(X N ) was obtained in the previos section tive funtion. Such an improvement to NM assures its convergence
up to certain parameters. By setting parameters to certain values [KPP]. Unlike NM, they are derived from the generic approach.
we can obtain stochastic extensions of well known heuristics such First variant (NM-stochastic) resembles NM but corrects some
as Nelder and Mead algorithm or Covariance Matrix Adaptation of its drawbacks, and second variant (NM-nonlocal) has some
Evolution Strategy. In minpy library we use several common build- similarity to random search as well as to NM and helps to resolve
ing blocks to create different algorithms. Customized algorithms some other issues of classical NM algorithm.
may be defined by combining these common blocks and varying Steps of NM-stochastic:
their parameters.
1) Initialize the search by generating K ≥ n separate real-
Main building blocks include computing center of mass of the
izations of ui0 , i=1,..K of the random vector U0 , and set
sample points and finding newtonian potential.
m0 = K1 ∑Ki=0 ui0
2) On step j = 1, 2, ...
Key takeaways, example algorithm, and code organization
a.Compute the mean level c j−1 = K1 ∑Ki=1 f (uij−1 )
Many industry professionals and researchers utilize mathematical b.Calculate new set of vertices:
optimization packages to search for better solutions of their
m j−1 − uij−1
problems. Examples of such problem include minimization of uij = m j−1 + ε j−1 ( f (uij−1 ) − c j−1 )
free energy in physical system [FW], robot gait optimization ||m j−1 − uij−1 ||n
from robotics [PHS], designing materials for 3D printing [ZM],
[TMAACBA], wine production [CTC], [CWC], optimizing chem- c.Set m j = K1 ∑Ki=0 uij
ical reactions [VNJT]. These problems may involve "black box d.Adjust the step size ε j−1 so that f (m j ) < f (m j−1 ). If
optimization", where the structure of the objective function is approximate ε j−1 cannot be obtained within the specified number
unknown and is revealed through a small sequence of expen- of trails, then set mk = m j−1
sive trials. Software implementations for these methods become e.Use sample standard deviation as termination criterion:
more user friendly. As a rule, however, certain modeling skills 1 K
are needed to formulate real world problem in a way suitable Dj = ( ∑ ( f (uij ) − c j )2 )1/2
K − 1 i=1
for applying software package. Moreover, selecting optimization
method appropriate for the model is a challenging task. Our Note that classic simplex search methods do not use values of
educational software helps users of such optimization packages objective function to calculate reflection/expantion/contraction co-
and may be considered as a companion to them. The focus efficients. Those coefficients are the same for all vertices, whereas
of our software is on transparency of the methods rather than in NM-stochastic the distance each vertex will travel depends
on efficiency. A principal benefit of our software is the unified on the difference between objective function value and average
approach for constructing algorithms whereby any other algorithm value across all vertices ( f (uij ) − c j ). NM-stochastic shares the
is obtained from the generalized algorithm by changing certain following drawbacks with classic simplex methods: a. simlex may
parameters. Well known heuristic algorithms such as Nelder and collapse into a nearly degenerate figure, and usually proposed
Mead (NM) algorithm may be obtained using this generalized remedy is to restart the simlex every once in a while, b. only initial
approach, as well as new algorithms. Although some derivative- vertices are randomly generated, and the path of all subsequent
free optimization packages (matlab global optimization toolbox, vertices is deterministic. Next variant of the algorithm (NM-
Tensorflow Probability optimizers, Excel Evolutionary Solver, nonlocal) maintains the randomness of vertices on each step, while
scikit-learn Stochastic Gradient Descent class, scipy.optimize.shgo adjusting the distribution of U0 to mimic the pattern of the modi-
method) put a lot of effort in transparency and educational value, fied vertices. The corrected algorithm has much higher exploration
they don’t have the same level of flexibility and generality as our power than the first algorithm (similar to the exploration power of
system. An example of educational-only optimization software is random search algorithms), and has exploitation power of direct -
[SAS]. It is limited to teach Particle Swarm Optimization. search algorithms.
170 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Steps of NM - nonlocal [VNJT] Fath, Verena, Kockmann, Norbert, Otto, Jürgen, Röder,
Thorsten, Self-optimising processes and real-time-optimisation
1) Choose a starting point x0 and set m0 = x0 . of organic syntheses in a microreactor system using
Nelder–Mead and design of experiments, React. Chem. Eng.,
2. On step j = 1, 2, ... Obtain K separate realizations of uii , 2020,5, 1281-1299, https://doi.org/10.1039/D0RE00081G
i=1,..K of the random vector U j [ZM] Plüss, T.; Zimmer, F.; Hehn, T.; Murk, A. Characterisation and
Comparison of Material Parameters of 3D-Printable Absorbing
a.Compute f (uij−1 ), j = 1, 2, ..K, and the sample mean level Materials. Materials 2022, 15, 1503. https://doi.org/10.3390/
ma15041503
1 K [TMAACBA] Thoufeili Taufek, Yupiter H.P. Manurung, Mohd Shahriman
c j−1 = ∑ f (uij−1 )
K i=1
Adenan, Syidatul Akma, Hui Leng Choo, Borhen Louhichi,
Martin Bednardz, and Izhar Aziz.3D Printing and Additive
Manufacturing, 2022, http://doi.org/10.1089/3dp.2021.0197
b.Generate the new estimate of the mean: [CTC] Vismara, P., Coletta, R. & Trombettoni, G. Constrained global
m j−1 − uij optimization for wine blending. Constraints 21, 597–615
1 K
m j = m j−1 + ε j ∑
K i=1
[( f (uij ) − c j )
||m j−1 − uij ||n
]
[CWC]
(2016), https://doi.org/10.1007/s10601-015-9235-5
Terry Hui-Ye Chiu, Chienwen Wu, Chun-Hao Chen, A Gen-
eralized Wine Quality Prediction Framework by Evolutionary
Adjust the step size ε j−1 so that f (m j ) < f (m j−1 ). If approximate Algorithms, International Journal of Interactive Multimedia
and Artificial Intelligence, Vol. 6, Nº7,2021, https://doi.org/10.
ε j−1 cannot be obtained within the specified number of trails, then 9781/ijimai.2021.04.006
set mk = m j−1 [KHNT] Pascal Kerschke, Holger H. Hoos, Frank Neumann, Heike
c.Use sample standard deviation as termination criterion Trautmann; Automated Algorithm Selection: Survey and Per-
spectives. Evol Comput 2019; 27 (1): 3–45, https://doi.org/10.
1 K 1162/evco_a_00242
Dj = ( ∑ ( f (uij ) − c j )2 )1/2
K − 1 i=1
[SAS] Leandro dos Santos Coelho, Cezar Augusto Sierakowski, A
software tool for teaching of particle swarm optimization
fundamentals, Advances in Engineering Software, Volume 39,
Issue 11, 2008, Pages 877-887, ISSN 0965-9978, https://doi.
R EFERENCES org/10.1016/j.advengsoft.2008.01.005.
[ESF] Endres, S.C., Sandrock, C. & Focke, W.W. A simplicial ho-
[KAP] Kaplinskii, A.I.,Pesin, A.M.,Propoi, A.I.(1994). Analysis of mology algorithm for Lipschitz optimisation. J Glob Optim 72,
search methods of optimization based on potential theory. I: 181–217 (2018), https://doi.org/10.1007/s10898-018-0645-y
Nonlocal properties. Automation and Remote Control. Volume
55, N.9, Part 2, September, pp.1316-1323 (rus. pp.97-105),
1994
[KP] Kaplinskii, A.I. and Propoi, A.I., Nonlocal Optimization Meth-
ods ofthe First Order Based on Potential Theory, Automation
and Remote Control. Volume 55, N.7, Part 2, July, pp.1004-
1011 (rus. pp.97-102), 1994
[KPP] Kaplinskii, A.I., Pesin, A.M.,Propoi, A.I. Analysis of search
methods of optimization based on potential theory. III: Conver-
gence of methods. Automation and remote Control, Volume 55,
N.11, Part 1, November, pp.1604-1610 (rus. pp.66-72 ), 1994.
[NBAG] Nikhil Bansal, Anupam Gupta, Potential-function proofs for
gradient methods, Theory of Computing, Volume 15, (2019)
Article 4 pp. 1-32, https://doi.org/10.4086/toc.2019.v015a004
[ATFB] Adrien Taylor, Francis Bach, Stochastic first-order meth-
ods: non-asymptotic and computer-aided analyses via
potential functions, arXiv:1902.00947 [math.OC], 2019,
https://doi.org/10.48550/arXiv.1902.00947
[ZALO] Zeyuan Allen-Zhu and Lorenzo Orecchia, Linear Coupling: An
Ultimate Unification of Gradient and Mirror Descent, Inno-
vations in Theoretical Computer Science Conference (ITCS),
2017, pp. 3:1-3:22, https://doi.org/10.4230/LIPIcs.ITCS.2017.3
[BIS] Berthold Immanuel Schmitt, Convergence Analysis for Particle
Swarm Optimization, FAU University Press, 2015
[FGSB] FJuan Frausto-Solis, Ernesto Liñán-García, Juan Paulo
Sánchez-Hernández, J. Javier González-Barbosa, Carlos
González-Flores, Guadalupe Castilla-Valdez, Multiphase Sim-
ulated Annealing Based on Boltzmann and Bose-Einstein
Distribution Applied to Protein Folding Problem, Advances
in Bioinformatics, Volume 2016, Article ID 7357123, https:
//doi.org/10.1155/2016/7357123
[GLUQ] Gong G., Liu, Y., Qian M, Simulated annealing with a potential
function with discontinuous gradient on Rd , Ici. China Ser. A-
Math. 44, 571-578, 2001, https://doi.org/10.1007/BF02876705
[PHS] Valdez, S.I., Hernandez, E., Keshtkar, S. (2020). A Hybrid
EDA/Nelder-Mead for Concurrent Robot Optimization. In:
Madureira, A., Abraham, A., Gandhi, N., Varela, M. (eds)
Hybrid Intelligent Systems. HIS 2018. Advances in Intel-
ligent Systems and Computing, vol 923. Springer, Cham.
https://doi.org/10.1007/978-3-030-14347-3_20
[FW] Fan, Yi & Wang, Pengjun & Heidari, Ali Asghar & Chen,
Huiling & HamzaTurabieh, & Mafarja, Majdi, 2022. "Random
reselection particle swarm optimization for optimal design of
solar photovoltaic modules," Energy, Elsevier, vol. 239(PA),
https://doi.org/10.1016/j.energy.2021.121865
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 171

Temporal Word Embeddings Analysis for Disease
Prevention
Nathan Jacobi‡∗ , Ivan Mo‡§ , Albert You‡ , Krishi Kishore‡ , Zane Page‡ , Shannon P. Quinn‡¶ , Tim Heckmank

Abstract—Human languages’ semantics and structure constantly change over then be studied to track contextual drift over time. However, a
time through mediums such as culturally significant events. By viewing the common issue in these so-called “temporal word embeddings”
semantic changes of words during notable events, contexts of existing and is that they are often unaligned — i.e. the embeddings do not
novel words can be predicted for similar, current events. By studying the initial lie within the same embedding space. Past proposed solutions
outbreak of a disease and the associated semantic shifts of select words, we
to aligning temporal word embeddings require multiple separate
hope to be able to spot social media trends to prevent future outbreaks faster
than traditional methods. To explore this idea, we generate a temporal word
alignment problems to be solved, or for “anchor words” – words
embedding model that allows us to study word semantics evolving over time. that have no contextual shifts between times – to be used for
Using these temporal word embeddings, we use machine learning models to mapping one time period to the next [HLJ16]. Yao et al. propose a
predict words associated with the disease outbreak. solution to this alignment issue, shown to produce accurate and
aligned temporal word embeddings, through solving one joint
Index Terms—Natural Language Processing, Word Embeddings, Bioinformat- alignment problem across all time slices, which we utilize here
ics, Social Media, Disease Prediction [YSD+ 18].

Introduction & Background Methodology
Human languages experience continual changes to their semantic Data Collection & Pre-Processing
structures. Natural language processing techniques allow us to
Our data set is a corpus D of over 7 million tweets collected
examine these semantic alterations through methods such as word
from Scott County, Indiana from the dates January 1st, 2014 until
embeddings. Word embeddings provide low dimension numerical
January 17th, 2017. The data was lent to us from Twitter after
representations of words, mapping lexical meanings into a vector
a data request, and has not yet been made publicly available.
space. Words that lie close together in this vector space represent
During this time period, an HIV outbreak was taking place in
close semantic similarities [MCCD13]. This numerical vector
Scott County, with an eventual 215 confirmed cases being linked
space allows for quantitative analysis of semantics and contextual
to the outbreak [PPH+ 16]. Gonsalves et al. predicts an additional
meanings, allowing for more use in machine learning models that
126 undiagnosed HIV cases were linked to this same outbreak
utilize human language.
We hypothesize that disease outbreaks can be predicted faster [GC18]. The state’s response led to questioning if the outbreak
than traditional methods by studying word embeddings and their could have been stemmed or further prevented with an earlier
semantic shifts during past outbreaks. By surveying the context response [Gol17]. Our corpus was selected with a focus on tweets
of select medical terms and other words associated with a disease related to the outbreak. By closely studying the semantic shifts
during the initial outbreak, we create a generalized model that can during this outbreak, we hope to accurately predict similar future
be used to catch future similar outbreaks quickly. By leveraging outbreaks before they reach large case numbers, allowing for a
social media activity, we predict similar semantic trends can be critical earlier response.
found in real time. Additionally, this allows novel terms to be To study semantic shifts through time, the corpus was split
evaluated in context without requiring a priori knowledge of them, into 18 temporal buckets, each spanning a 2 month period. All data
allowing potential outbreaks to be detected early in their lifespans, utilized in scripts was handled via the pandas Python package. The
thus minimizing the resultant damage to public health. corpus within each bucket is represented by Dt , with t representing
Given a corpus spanning a fixed time period, multiple word the temporal slice. Within each 2 month period, tweets were split
embeddings can be created at set temporal intervals, which can into 12 pre-processed output csv files. Pre-processing steps first
removed retweets, links, images, emojis, and punctuation. Com-
* Corresponding author: Nathan.Jacobi@uga.edu mon stop words were removed from the tweets using the NLTK
‡ Computer Science Department, University of Georgia Python package, and each tweet was tokenized. A vocabulary
§ Linguistics Department, University of Georgia
¶ Cellular Biology Department, University of Georgia dictionary was then generated for each of the 18 temporal buckets,
|| Public Health Department, University of Georgia containing each unique word and a count of its occurrences
within its respective bucket. The vocabulary dictionaries for each
Copyright © 2022 Nathan Jacobi et al. This is an open-access article dis- bucket were then combined into a global vocabulary dictionary,
tributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, pro- containing the total counts for each unique word across all 18
vided the original author and source are credited. buckets. Our experiments utilized two vocabulary dictionaries: the
172 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

first being the 10,000 most frequently occurring words from the B = Y (t)U(t) + γU(t) + τ(U(t − 1) +U(t + 1))
global vocabulary for ensuring proper generation of embedding
vectors, the second being a combined vocabulary of 15,000 terms, To decompose PPMI(t) in our model, SciPy’s linear algebra
including our target HIV/AIDS related terms. This combined package was utilized to solve for eigendecomposition of each
vocabulary consisted of the top 10,000 words across D as well PPMI(t), and the top 100 terms were kept to generate an em-
as an additional 473 HIV/AIDS related terms that occurred at bedding of d = 100. The alignment was then applied, yielding
least 8 times within the corpus. The 10,000th most frequent term 18 temporally aligned word embedding sets of our vocabulary,
in D occurred 39 times, so to ensure results were not influenced with dimensions |V | × d, or 15,000 x 100. These word embedding
by sparsity in the less frequent HIV/AIDS terms, 4,527 randomly sets are aligned spatially and in terms of rotations, however there
selected terms with occurrences between 10 and 25 times were appears to be some spatial drift that we hope to remove by tuning
added to the vocabulary, bringing it to a total of 15,000 terms. hyperparameters. Following alignment, these vectors are usable
The HIV/AIDS related terms came from a list of 1,031 terms we for experimentation and analysis.
compiled, primarily coming from the U.S. Department of Veteran
Predictions for Detecting Modern Shifts
Affairs published list of HIV/AIDS related terms, and other terms
we thought were pertinent to include, such as HIV medications Following the generation of temporally aligned word embedding,
and terms relating to sexual health [Aff05]. they can be used for semantic shift analysis. Using the word
embedding vectors generated for each temporal bucket, 2 new
Temporally Aligned Vector Generation data sets were created to use for determining patterns in the
Generating word2vec embeddings is typically done through 2 semantic shifts surrounding HIV outbreaks. Both of these data
primary methods: continuous bag-of-words (CBOW) and skip- sets were constructed using our second vocabulary of 15,000
gram, however many other various models exist [MCCD13]. Our terms, including the 473 HIV/AIDS related terms, and each term’s
methods use a CBOW approach at generating embeddings, which embedding of d = 100 that were generated by the dynamic
generates a word’s vector embedding based on the context the embedding model. The first experimental data set was the shift
word appears in, i.e. the words in a window range surrounding in the d = 100 embedding vector between each time bucket and
the target word. Following pre-processing of our corpus, steps the one that immediately followed it. These shifts were calculated
for generating word embeddings were applied to each temporal by simply subtracting the next temporal and initial vectors from
bucket. For each time bucket, co-occurrence matrices were first each other. In addition to the change in the 100 dimensional vector
created, with a window size w = 5. These matrices contained between each time bucket and its next, the initial and next 10
the total occurrences of each word against every other within a dimensional embeddings were included from each, which were
window range L of 5 words within the corpus at time t. Each generated using the same dynamic embedding model. This yielded
co-occurrence matrix was of dimensions |V | × |V |. Following the each word having 17 observations and 121 features: {d_vec0 . . .
generation of each of these co-occurrence matrices, a |V | × |V | d_vec99, v_init_0 . . . v_init_9, v_fin_0 . . . v_fin_9, label}. This
dimensioned Positive Pointwise Mutual Information matrix was data set will be referred to as "data_121". The reasoning to include
calculated. The value in each cell was calculated as follows: these lower dimensional embeddings was so that both the shift
and initial and next positions in the embedding space would be
PPMI(t, L)w,c = max{PMI(Dt , L)w,c , 0},
used in our machine learning algorithms. The other experimental
where w and c are two words in V. Embeddings generated by data set was constructed similarly, but rather than subtracting the
word2vec can be approximated by PMI matrices, where given two vectors and including lower dimensions vectors, the initial
embedding vectors utilize the following equation [YSD+ 18]: and next 100 dimensional vectors were listed as features. This
allowed machine learning algorithms to have access to the full
uTw uc ≈ PMI(D, L)w,c
positional information of each vector alongside the shift between
Each embedding u has a reduced dimensionality d, typically the two. This yielded each word having 17 observations and 201
around 25 - 200. Each PPMI from our data set is created inde- features: {vec_init0 . . . vec_init99, vec_fin0 . . . vec_fin99, label}.
pendently from each other temporal bucket. After these PPMI This data set will be referred to as "data_201". With the 15,000
matrices are made, temporal word embeddings can be created terms each having 17 observations, it led to a total of 255,000
using the method proposed by Yao et al. [YSD+ 18]. The proposed observations. It should be noted that in addition to the vector
solution focuses on the equation: information, the data sets also listed the number of days since
the outbreak began, the predicted number of cases at that point
U(t)U(t)T ≈ PPMI(t, L)
in time, from [GC18], and the total magnitude of the shift in the
where U is a set of embeddings from time period t. Decomposing vector between the corresponding time buckets. All these features
each PPMI(t) will yield embedding U(t), however each U(t) is not were dropped prior to use within the models, as the magnitude
guaranteed to be in the same embedding space. Yao et al. derives feature was colinear with the other positional features, and the case
U(t)A = B with the following equation234 [YSD+ 18]: and day data will not be available in predicting modern outbreaks.
A = U(t)T U(t) + (γ + λ + 2τ)I, Using these data, two machine learning algorithms were applied:
unsupervised k-means clustering and a supervised neural network.
1. All code used can be found here https://github.com/quinngroup/Twitter-
Embedding-Analysis/ K-means Clustering
2. γ represents the forcing regularizer. λ represents the Frobenius norm
regularizer. τ represents the smoothing regularizer.
To examine any similarities within shifts, k-means clustering was
3. Y(t) represents PPMI(t). performed on the data sets at first. Initial attempts at k-means with
4. The original equation uses W(t), but this acts as identical to U(t) in the the 100 dimensional embeddings yielded extremely large inertial
code. We replaced it here to improve readability. values and poor results. In an attempt to reduce inertia, features
TEMPORAL WORD EMBEDDINGS ANALYSIS FOR DISEASE PREVENTION 173

for data that k-means would be performed onto were assessed. 100, 150, and 200. Additionally, several certainty thresholds for a
K-means was performed on a reduced dimensionality data set, positive classification were tested on each of the models. The best
with embedding vectors of dimensionality d = 10, however this results from each will be listed in the results section. As we begin
led to strict convergence and poor results again. The data set implementation of these models on other HIV outbreak related
with the change in an embeddings vector, data_121, continued data sets, the proper certainty thresholds can be better determined.
to contain the changes of vectors between each time bucket and
its next. However, rather than the 10 dimensional position vectors
Results
for both time buckets, 2 dimensional positions were used instead,
generated by UMAP from the 10 dimensioned vectors. The second Analysis of Embeddings
data set, data_201, always led to strict convergence on clustering, To ensure accuracy in word embeddings generated in this model,
even when reduced to just the 10 dimensional representations. we utilized word2vec (w2v), a proven neural network method of
Therefore, k-means was performed explicitly on the data_121 embeddings [MCCD13]. For each temporal bucket, a static w2v
set, with the 2 dimensional representations alongside the 100 embedding of d = 100 was generated to compare to the temporal
dimensional change in the vectors. Separate two dimensional embedding generated from the same bucket. These vectors were
UMAP representations were generated for use as a feature and generated from the same corpus as the ones generated by the
for visual examination. The data set also did not have the term’s dynamic model. As the vectors do not lie within the same
label listed as a feature for clustering. embedding space, the vectors cannot be directly compared. As
Inertia at convergence on clustering for k-means was reduced the temporal embeddings generated by the alignment model are
significantly, as much as 86% after features were reassessed, yield- influenced by other temporal buckets, we hypothesize notably
ing significantly better results. Following the clustering, the results different vectors. Methods for testing quality in [YSD+ 18] rely
were analyzed to determine which clusters contained the higher on a semi-supervised approach: the corpus used is an annotated
than average incidence rates of medical terms and HIV/AIDS set of New York Times articles, and the section (Sports, Business,
related terms. These clusters can then be considered target clusters, Politics, etc.) are given alongside the text, and can be used to
and large incidences of words being clustered within these can be assess strength of an embedding. Additionally, the corpus used
flagged as indicative as a possible outbreak. spans over 20 years, allowing for metrics such as checking the
closest word to leaders or titles, such as "president" or "NYC
Neural Network Predictions mayor" throughout time. These methods show that this dynamic
In addition to the k-means model, we created a neural network word embedding alignment model yields accurate results.
model for binary classification of our terms. Our target class was Major differences can be attributed to the word2vec model
terms that we hypothesized were closely related to the HIV epi- only being given a section of the corpus at a time, while our model
demic in Scott County, i.e. any word in our HIV terms list. Several had access to the entire corpus across all temporal buckets. Terms
iterations with varying number of layers, activation functions, and that might not have appeared in the given time bucket might still
nodes within each layer were attempted to maximize performance. appear in the embeddings generated by our model, but not at all
Each model used an 80% training, 20% testing split on these data, within the word2vec embeddings. For example, most embeddings
with two variations performed of this split on training and testing generated by the word2vec model did not often have hashtagged
data. The first was randomly splitting all 255,000 observations, terms in their top 10 closest terms, while embeddings generated
without care of some observations for a term being in both training by our model often did. As hashtagged terms are very related to
set and some being in the testing set. This split of data will ongoing events, keeping these terms can give useful information
be referred to as "mixed" data, as the terms are mixed between to this outbreak. Modern hashtagged terms will likely be the most
the splits. The second split of data split the 15,000 words into common novel terms that we have no prior knowledge on, and we
80% training and 20% testing. After the vocabulary was split, hypothesize that these terms will be relevant to ongoing outbreaks.
the corresponding observations in the data were split accordingly, Given that our corpus spans a significantly shorter time period
leaving all observations for each term within the same split. than the New York Times set, and does not have annotations, we
Additionally, we tested a neural network that would accept the use existing baseline data sets of word similarities. We evaluated
same data as the input, either data_201 or data_121, with the the accuracy of both models’ vectors using a baseline sources
addition of the label assigned to that observation by the k-means for the semantic similarity of terms. The first source used was
model as a feature. The goal of these models, in addition was to SimLex-999, which contains 999 word pairings, with correspond-
correctly identifying terms we classified as related to the outbreak, ing human generated similarity scores on a scale of 0-10, where
was to discover new terms that shift in similar ways to the HIV 10 is the highest similarity [HRK15]. Cosine similarities for each
terms we labeled. pair of terms in SimLex-999 were calculated for both the w2v
The neural network model used was four layers, with three model vectors as well as vectors generated by the dynamic model
ReLu layers with 128, 256, and 256 neurons, followed by a single for each temporal bucket. Pairs containing terms that were not
neuron sigmoid output layer. This neural network was constructed present in the model generated vectors were omitted for that
using the Keras module of the TensorFlow library. The main models similarity measurements. The cosine similarities were then
difference between them was the input data itself. The input data compared to the assigned SimLex scores using the Spearman’s
were data_201 with and without k-means labels, data_121 with rank correlation coefficient. The results of this baseline can be seen
and without k-means labels. On each of these, there were two splits in Table 1. The Spearman’s coefficient of both sets of embeddings,
of the training and testing data, as in the previously mentioned averaged across all 18 temporal buckets, was .151334 for the
"mixed" terms. Parameters of the neural network layers were w2v vectors and .15506 for the dynamic word embedding (dwe)
adjusted, but results did not improve significantly across the data vectors. The dwe vectors slightly outperformed the w2v baseline
sets. All models were trained with a varying number of epochs: 50, in this test of word similarities. However, it should be noted that
174 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Time w2v Score dwe Score Difference w2v dwe Difference
Bucket (MEN) (MEN) (MEN) Score Score (SL)
(SL) (SL)

0 0.437816 0.567757 0.129941 0.136146 0.169702 0.033556
1 0.421271 0.561996 0.140724 0.131751 0.167809 0.036058
2 0.481644 0.554162 0.072518 0.113067 0.165794 0.052727
3 0.449981 0.543395 0.093413 0.137704 0.163349 0.025645
4 0.360462 0.532634 0.172172 0.169419 0.158774 -0.010645
5 0.353343 0.521376 0.168032 0.133773 0.157173 0.023400
6 0.365653 0.511323 0.145669 0.173503 0.154299 -0.019204
7 0.358100 0.502065 0.143965 0.196332 0.152701 -0.043631
8 0.380266 0.497222 0.116955 0.152287 0.154338 .002051
9 0.405048 0.496563 0.091514 0.149980 0.148919 -0.001061
10 0.403719 0.499463 0.095744 0.145412 0.142114 -0.003298
11 0.381033 0.504986 0.123952 0.181667 0.141901 -0.039766
12 0.378455 0.511041 0.132586 0.159254 0.144187 -0.015067
13 0.391209 0.514521 0.123312 0.145519 0.147816 0.002297
14 0.405100 0.519095 0.113995 0.151422 0.152477 0.001055
15 0.419895 0.522854 0.102959 0.117026 0.154963 0.037937
16 0.400947 0.524462 0.123515 0.158833 0.157687 -0.001146
17 0.321936 0.525109 0.203172 0.170925 0.157068 -0.013857
Average 0.437816 0.567757 0.129941 0.151334 0.155059 0.003725

TABLE 1: Spearman’s correlation coefficients for w2v vectors and dynamic word embedding (dwe) vectors for all 18 temporal clusters against
the SimLex word pair data set.

Fig. 1: 2 Dimensional Representation of Embeddings from Time Bucket 0.
TEMPORAL WORD EMBEDDINGS ANALYSIS FOR DISEASE PREVENTION 175

Fig. 2: 2 Dimensional Representation of Embeddings from Time Bucket 17.

these Spearman’s coefficients are very low compared to baselines UMAP, can be seen in Figure 1 and Figure 2. Figure 1 represents
such as in [WWC+ 19], where the average Spearman’s coefficient the embedding generated for the first time bucket, while Figure
amongst common models was .38133 on this data set of words. 2 represents the embedding generated for the final time bucket.
These models, however, were trained on corpus generated from These UMAP representations use cosine distance as their metric
Wikipedia pages — wiki2010. The lower Spearman’s coefficients over Euclidian distance, leading to more dense clusters and more
can likely be accounted to our corpus. In 2014-2017, when accurate representations of nearby terms within the embedding
this corpus was generated, Twitter had a 140 character limit on space. The section of terms outlying from the main grouping
tweets. The limited characters have been shown to affect user’s appears to be terms that do not appear often within that temporal
language within their tweets [BTKSDZ19], possibly affecting our cluster itself, but may appear several times later in a temporal
embeddings. Boot et al. show that Twitter increasing the character bucket. Figure 1 contains a zoomed in view of this outlying group,
limit to 280 characters in 2017 impacted the language within the as well as a subgrouping on the outskirts of the main group,
tweets. As we test this pipeline on more Twitter data from various containing food related terms. The majority of these terms are
time intervals, the character increase in 2017 is something to keep ones that would likely be hashtagged frequently during a brief time
in mind. period within one temporal bucket. These terms are still relevant
The second source of baseline was the MEN Test Collection, to study, as hashtagged terms that appear frequently for a brief
containing 3,000 pairs with similarity scores of 0-50, with 50 period of time are most likely extremely attached to an ongoing
being the most similar [BTB14]. Following the same methodology event. In future iterations, the length of each temporal bucket will
for assessing the strength of embeddings as we did for the be decreased, hopefully giving more temporal buckets access to
SimLex-999 set, the Spearman’s coefficients from this set yielded terms that only appear within one currently.
much better results than from the SimLex-999 set. The average
of the Spearman’s coefficients, across all 18 temporal buckets, K-Means Clustering Results
was .39532 for the w2v embeddings and .52278 for the dwe The results of the k-means clustering can be seen below in
embeddings. The dwe significantly outperformed the w2v baseline Figures 4 and 5. Figure 4 shows the results of k-means clustering
on this set, but still did not reach the average correlation of with the corresponding 2 dimensional UMAP positions generated
.7306 that other common models achieved in the baseline tests from the 10 dimensional vector that were used as features in
in [WWC+ 19]. the clustering. Figure 5 shows the results of k-means clustering
Two dimensional representations of embeddings, generated by with the corresponding 2 dimensional UMAP representation of the
176 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Cluster All Words HIV Terms Difference

0 0.173498 0.287048 0.113549
1 0.231063 0.238876 0.007814
2 0.220039 0.205600 -0.014440
3 0.023933 0.000283 -0.023651
4 0.108078 0.105581 -0.002498
5 0.096149 0.084276 -0.011873
6 0.023525 0.031391 0.007866
7 0.123714 0.046946 -0.076768

TABLE 2: Distribution of HIV terms and all terms within k-means
clusters

Fig. 4: Results of k-means clustering shown over the 2 dimensional
UMAP representation of the 10 dimensional embeddings.

Fig. 3: Bar graph showing k-means clustering distribution of HIV
terms against all terms.

entire data set used in clustering. The k-means clustering revealed
semantic shifts of HIV related terms being clustered with higher
incidence than other terms in one cluster. Incidence rates for all
terms and HIV terms in each cluster can be seen in Table 2 and
Fig. 5: Results of k-means clustering shown over the 2 dimensional
Figure 3. This increased incidence rate of HIV related terms in UMAP representation of the full data set.
certain clusters leads us to hypothesize that semantic shifts of
terms in future datasets can be clustered using the same k-means
model, and analyzed to search for outbreaks. Clustering of terms and .1 for the mixed split in both sets. The difference in certainty
in future data sets can be compared to these clustering results, and thresholds was due to any mixed term data set having an extremely
similarities between the data can be recognized. large number of false positives on .01, but more reasonable results
on .1.
Neural Network Results These results show that classification of terms surrounding
Neural network models we generated showed promising results the Scott County HIV outbreak is achievable, but the model will
on classification of HIV related terms. The goal of the models need to be refined on more data. It can be seen that the mixed
was to identify and discover terms surrounding the HIV outbreak. term split of data led to a high rate of true positives, however
Therefore we were not concerned about the rate of false positive it quickly became much more specific to terms outside of our
terms. False positive terms likely had semantic shifts very similar target class on higher epochs, with false positives dropping to
to the HIV related terms, and therefore can be related to the lower rates. Additionally, accuracy on data_201 begins to increase
outbreak. These terms can be labeled as potentially HIV related between 150 and 200 epoch models for the unmixed split, so
while studying future data sets, which can aid the identifying of even higher epoch models might improve results further for the
if an outbreak is ongoing during the time tweets in the corpus unmixed split. Outliers, such as in the true positives in data_121
were tweeted. We looked for a balance of finding false positive with 100 epochs without k-means labels, can be explained by
terms without lowering our certainty threshold to include too many the certainty threshold. If the certainty threshold was .05 for that
terms. Results of the testing data for data_201 set can be seen in model, there would have been 86 true positives, and 1,129 false
3, and results of the testing data for data_121 set can be seen in 4. positives. A precise certainty threshold can be found as we test this
The certainty threshold for the unmixed split in both sets was .01, model on other HIV related data sets and control data sets. With
TEMPORAL WORD EMBEDDINGS ANALYSIS FOR DISEASE PREVENTION 177

With K-Means Label Without K-Means Label

Epochs Accuracy Precision Recall TP FP TN FN Accuracy Precision Recall TP FP TN FN
50 0.9589 0.0513 0.0041 8 148 48897 1947 0.9571 0.1538 0.0266 52 286 48759 1903
100 0.9589 0.0824 0.0072 14 156 48889 1941 0.9608 0.0893 0.0026 5 51 48994 1950
150 0.6915 0.0535 0.4220 825 14602 34443 1130 0.7187 0.0451 0.3141 614 13006 36039 1341
200 0.7397 0.0388 0.2435 476 11797 37248 1479 0.7566 0.0399 0.2317 453 10912 38133 1502
50Mix 0.9881 0.9107 0.7967 1724 169 48667 440 0.9811 0.9417 0.5901 1277 79 48757 887
100Mix 0.9814 0.9418 0.5980 1294 80 48756 870 0.9823 0.9090 0.6465 1399 140 48696 765
150Mix 0.9798 0.9595 0.5471 1184 50 48786 980 0.9752 0.9934 0.4191 907 6 48830 1257
200Mix 0.9736 0.9846 0.3835 830 13 48823 1334 0.9770 0.9834 0.4658 1008 17 48819 1156

TABLE 3: Results of the neural network run on the data_201 set. The epochs column shows the number of training epochs on the models, as
well as if the words were mixed between the training and testing data, denoted by "Mix".

With K-Means Label Without K-Means Label

Epochs Accuracy Precision Recall TP FP TN FN Accuracy Precision Recall TP FP TN FN
50 0.9049 0.0461 0.0752 147 3041 46004 1808 0.9350 0.0652 0.0522 102 1463 47582 1853
100 0.9555 0.1133 0.0235 46 360 48685 1909 0.8251 0.0834 0.3565 697 7663 41382 1258
150 0.9554 0.0897 0.0179 35 355 48690 1920 0.9572 0.0957 0.0138 27 255 48790 1928
200 0.9496 0.0335 0.0113 22 635 48410 1933 0.9525 0.0906 0.0266 52 522 48523 1903
50Mix 0.9285 0.2973 0.5018 1086 2567 46269 1078 0.9487 0.4062 0.4501 974 1424 47412 1190
100Mix 0.9475 0.3949 0.4464 966 1480 47356 1198 0.9492 0.4192 0.5134 1111 1539 47297 1053
150Mix 0.9344 0.3112 0.4496 973 2154 46682 1191 0.9514 0.4291 0.4390 950 1264 47572 1214
200Mix 0.9449 0.3779 0.4635 1003 1651 47185 1161 0.9500 0.4156 0.4395 951 1337 47499 1213

TABLE 4: Results of the neural network on the data_121 set. The epochs column shows the number of training epochs on the models, as well
as if the words were mixed between the training and testing data, denoted by "Mix".

enough experimentation and data, a set can be run through our insight into relevant medical activity, but also further strengthen
pipeline and a certainty of there being a potential HIV outbreak in and expand our model and its credibility. There is a large source
the region the data originated from can be generated by a future of data potentially related to HIV/AIDS on Twitter, so finding
model. and collecting this data would be a crucial first step. One potent
example of data could be from the 220 United States counties
determined by the CDC to be considered vulnerable to HIV and/or
Conclusion
viral hepatitis outbreaks due to injection drug use, similar to the
Our results prove promising, with high accuracy and decent recall outbreak that occurred in Scott County [VHRH+ 16]. Our next
on classification of HIV/AIDS related terms, as well as potentially data set that is being studied is tweets from Cabell County, West
discovering new terms related to the outbreak. Given more HIV Virginia, from January of 2018 through 2020. During this time
related data sets and control data sets, we could begin examining an HIV outbreak similar to the one that took place in Scott
and generating thresholds of what might be indicative of an County in 2014 occurred [AMK20]. The end goal is to create
outbreak. To improve results, metrics for our word2vec baseline a pipeline that can perform live semantic shift analysis at set
model and statistical analysis could be further explored, as well as intervals of time within these counties, and classify these shifts
exploring previously mentioned noise and biases from our data. as they happen. A future model can predict whether or not the
Additionally, sparsity of data in earlier temporal buckets may number of terms classified as HIV related is indicative of an
lead to some loss of accuracy. Fine tuning hyperparameters of outbreak. If enough terms classified by our model as potentially
the alignment model through grid searching would likely even indicative of an outbreak become detected, or if this future model
further improve these results. We predict that given more data sets predicts a possible outbreak, public health officials can be notified
containing tweets from areas and times that had similar HIV/AIDS and the severity of a possible outbreak can be mitigated if properly
outbreaks to Scott County, as well as control data sets that are handled.
not directly related to an HIV outbreak, we could determine
Expansion into other social media platforms would increase
a threshold of words that would define a county as potentially
the variety of data our model has access to, and therefore what
undergoing an HIV outbreak. With a refined pipeline and model
our model is able to respond to. With the foundational model
such as this, we hope to be able to begin biosurveillance to try to
established, we will be able to focus on converting the data and
prevent future outbreaks.
addressing the differences between social networks (e.g. audience
and online etiquette). Reddit and Instagram are two points of
Future Work interest due to their increasing prevalence, as well as vastness of
Case studies of previous datasets related to other diseases and available data.
collection of more modern tweets could not only provide critical An idea for future implementation following the generation
178 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

of a generalized model would be creating a web application. The [PPH+ 16] Philip J. Peters, Pamela Pontones, Karen W. Hoover, Monita R.
ideal audience would be medical officials and organizations, but Patel, Romeo R. Galang, Jessica Shields, Sara J. Blosser,
Michael W. Spiller, Brittany Combs, William M. Switzer, and
even public or research use for trend prediction could be potent. et al. HIV infection linked to injection use of Oxymorphone
The application would give users the ability to pick from a given in Indiana, 2014–2015. New England Journal of Medicine,
glossary of medical terms, defining their own set of significant 375(3):229–239, 2016. doi:10.1056/NEJMoa1515195.
words to run our model on. Our model would then expose any [VHRH+ 16] Michelle M. Van Handel, Charles E. Rose, Elaine J. Hallisey,
Jessica L. Kolling, Jon E. Zibbell, Brian Lewis, Michele K.
potential trends or insight for the given terms in contemporary Bohm, Christopher M. Jones, Barry E. Flanagan, Azfar-E-Alam
data, allowing for quicker responses to activity. Customization of Siddiqi, and et al. County-level vulnerability assessment for
the data pool could also be a feature, where tweets and other rapid dissemination of HIV or HCV infections among persons
who inject drugs, United States. JAIDS Journal of Acquired
social media posts are filtered to specified geographic regions or Immune Deficiency Syndromes, 73(3):323–331, 2016. doi:
time windows, yielding more specific results. 10.1097/qai.0000000000001098.
Additionally, we would like to reassess our embedding model [WWC+ 19] Bin Wang, Angela Wang, Fenxiao Chen, Yuncheng Wang, and
C.-C. Jay Kuo. Evaluating word embedding models: Methods
to try and improve embeddings generated and our understanding and experimental results. APSIPA Transactions on Signal and
of the semantic shifts. This project has been ongoing for several Information Processing, 8(1), 2019. doi:10.1017/atsip.
years, and new models, such as the use of bidirectional encoders, 2019.12.
as in BERT [DCLT18], have proven to have high performance. [YSD+ 18] Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao, and Hui
Xiong. Dynamic word embeddings for evolving semantic dis-
BERT based models have also been used for temporal embedding covery. In Proceedings of the Eleventh ACM International Con-
studies, such as in [LMD+ 19], a study focused on clinical corpora. ference on Web Search and Data Mining:, WSDM ’18, page
We predict that updating our pipeline to match more modern 673–681, New York, NY, USA, 2018. Association for Comput-
ing Machinery. doi:10.1145/3159652.3159703.
methodology can lead to more effective disease detection.

R EFERENCES

[Aff05] Veteran Affairs. Glossary of HIV/AIDS terms: Veterans affairs,
Dec 2005. URL: https://www.hiv.va.gov/provider/glossary/
index.asp.
[AMK20] A Atkins, RP McClung, and M Kilkenny. Notes from the
field: Outbreak of Human Immunodeficiency Virus infection
among persons who inject drugs — Cabell County, West
Virginia, 2018–2019. Morbidity and Mortality Weekly Report,
69(16):499–500, 2020. doi:10.15585/mmwr.mm6916a2.
[BTB14] Elia Bruni, Nam Khanh Tran, and Marco Baroni. Multimodal
distributional semantics. J. Artif. Int. Res., 49(1):1–47, 2014.
doi:10.1613/jair.4135.
[BTKSDZ19] Arnout Boot, Erik Tjon Kim Sang, Katinka Dijkstra, and
Rolf Zwaan. How character limit affects language usage
in tweets. Palgrave Communications, 5(76), 2019. doi:
10.1057/s41599-019-0280-3.
[DCLT18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. BERT: Pre-training of deep bidirectional transform-
ers for language understanding, 2018. doi:10.18653/v1/
N19-1423.
[GC18] Gregg S Gonsalves and Forrest W Crawford. Dynamics of
the HIV outbreak and response in Scott County, IN, USA,
2011–15: A modelling study. The Lancet HIV, 5(10), 2018.
URL: https://pubmed.ncbi.nlm.nih.gov/30220531/.
[Gol17] Nicholas J. Golding. The needle and the damage done: In-
diana’s response to the 2015 HIV epidemic and the need to
change state and federal policies regarding needle exchanges
and intravenous drug users. Indiana Health Law Review,
14(2):173, 2017. doi:10.18060/3911.0038.
[HLJ16] William L. Hamilton, Jure Leskovec, and Dan Jurafsky. Di-
achronic word embeddings reveal statistical laws of seman-
tic change. CoRR, abs/1605.09096, 2016. arXiv:1605.
09096, doi:10.48550/arXiv.1605.09096.
[HRK15] Felix Hill, Roi Reichart, and Anna Korhonen. SimLex-
999: Evaluating semantic models with (genuine) similarity
estimation. Computational Linguistics, 41(4):665–695, 2015.
doi:10.1162/COLI_a_00237.
[LMD+ 19] Chen Lin, Timothy Miller, Dmitriy Dligach, Steven Bethard,
and Savova Guergana. A BERT-based universal model for both
within- and cross-sentence clinical temporal relation extraction.
In Proceedings of the 2nd Clinical Natural Language Process-
ing Workshop, pages 65–71. Association for Computational
Linguistics, 2019. doi:10.18653/v1/W19-1908.
[MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
Efficient estimation of word representations in vector space,
2013. doi:10.48550/ARXIV.1301.3781.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 179

Design of a Scientific Data Analysis Support Platform
Nathan Martindale‡∗ , Jason Hite‡ , Scott Stewart‡ , Mark Adams‡

Abstract—Software data analytic workflows are a critical aspect of modern Fundamentally, science revolves around the ability for others
scientific research and play a crucial role in testing scientific hypotheses. A to repeat and reproduce prior published works, and this has
typical scientific data analysis life cycle in a research project must include become a difficult task with many computation-based studies.
several steps that may not be fundamental to testing the hypothesis, but are Often, scientists outside of a computer science field may not have
essential for reproducibility. This includes tasks that have analogs to software
training in software engineering best practices, or they may simply
engineering practices such as versioning code, sharing code among research
team members, maintaining a structured codebase, and tracking associated
disregard them because the focus of a researcher is on scientific
resources such as software environments. Tasks unique to scientific research publications rather than the analysis software itself. Lack of docu-
include designing, implementing, and modifying code that tests a hypothesis. mentation and provenance of research artifacts and frequent failure
This work refers to this code as an experiment, which is defined as a software to publish repositories for data and source code has led to a crisis
analog to physical experiments. in reproducibility in artificial intelligence (AI) and other fields that
A software experiment manager should support tracking and reproducing rely heavily on computation [SBB13], [DMR+ 09], [Hut18]. One
individual experiment runs, organizing and presenting results, and storing and study showed that quantifiably few machine learning (ML) papers
reloading intermediate data on long-running computations. A software experi-
document specifics in how they ran their experiments [GGA18].
ment manager with these features would reduce the time a researcher spends
This gap between established practices from the software engi-
on tedious busywork and would enable more effective collaboration. This work
discusses the necessary design features in more depth, some of the existing
neering field and how computational research is conducted has
software packages that support this workflow, and a custom developed open- been studied for some time, and the problems that can stem from
source solution to address these needs. it are discussed at length in [Sto18].
To mitigate these issues, computation-based research requires
Index Terms—reproducible research, experiment life cycle, data analysis sup- better infrastructure and tooling [Pen11] as well as applying
port relevant software engineering principles [Sto18], [Dub05] to allow
data scientists to ensure their work is effective, correct, and
Introduction reproducible. In this paper we focus on the ability to manage re-
producible workflows for scientific experiments and data analyses.
Modern science increasingly uses software as a tool for conducting
We discuss the features that software to support this might require,
research and scientific data analyses. The growing number of
compare some of the existing tools that address them, and finally
libraries and frameworks facilitating this work has greatly low-
present the open-source tool Curifactory which incorporates the
ered the barrier to usage, allowing more researchers to benefit
proposed design elements.
from this paradigm. However, as a result of the dependence on
software, there is a need for more thorough integration of sound
software engineering practices with the scientific process. The Related Work
fragility of complex environments containing heavily intercon- Reproducibility of AI experiments has been separated into three
nected packages coupled with a lack of provenance of the artifacts different degrees [GK18]: Experiment reproduciblity, or repeata-
generated throughout the development of an experiment increases bility, refers to using the same code implementation with the
the potential for long-term problems, undetected bugs, and failure same data to obtain the same results. Data reproducibility, or
to reproduce previous analyses. replicability, is when a different implementation with the same
* Corresponding author: martindalena@ornl.gov
data outputs the same results. Finally, method reproducibility
‡ Oak Ridge National Laboratory describes when a different implementation with different data is
able to achieve consistent results. These degrees are discussed
Copyright © 2022 Oak Ridge National Laboratory. This is an open-access
article distributed under the terms of the Creative Commons Attribution in [GGA18], comparing the implications and trade-offs on the
License, which permits unrestricted use, distribution, and reproduction in any amount of work for the original researcher versus an external
medium, provided the original author and source are credited. researcher, and the degree of generality afforded by a reproduced
Notice: This manuscript has been authored by UT-Battelle, LLC, under implementation. A repeatable experiment places the greatest bur-
contract DE-AC05-00OR22725 with the US Department of Energy (DOE).
The US government retains and the publisher, by accepting the article for pub- den on the original researcher, requiring the full codebase and
lication, acknowledges that the US government retains a nonexclusive, paid-up, experiment to be sufficiently documented and published so that
irrevocable, worldwide license to publish or reproduce the published form of a peer is able to correctly repeat it. At the other end of the
this manuscript, or allow others to do so, for US government purposes. DOE spectrum, method reproducibility demands the greatest burden
will provide public access to these results of federally sponsored research in ac-
cordance with the DOE Public Access Plan (http://energy.gov/downloads/doe- on the external researcher, as they must implement and run the
public-access-plan). experiment from scratch. For the remainder of this paper, we refer
180 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

to "reproducibility" as experiment reproducibility (repeatability). ifications to take full advantage of all features. This can entail
Tooling that is able to assist with documentation and organization a significant learning curve and places additional burden on
of a published experiment reduces the amount of work for the the researcher. To address this, some sources propose automatic
original researcher and still allows for the lowest level of burden documentation of experiments and code through static source code
to external researchers to verify and extend previous work. analysis [NFP+ 20], [Red19].
In an effort to encourage better reproducibility based on Beyond the preexisting body of knowledge about software
datasets, the Findable, Accessible, Interoperable, and Reusable engineering principles, other works [SNTH13], [KHS09] de-
(FAIR) data principles [WDA+ 16] were established. These prin- scribe recommended rules and practices to follow when conduct-
ciples recommend that data should have unique and persistent ing computation-based research. These include avoiding manual
identifiers, use common standards, and provide rich metadata data manipulation in favor of scripted changes, keeping detailed
description and provenance, allowing both humans and machines records of how results are produced (manual provenance), tracking
to effectively parse them. These principles have been extended the versions of libraries and programs used, and tracking random
more broadly to software [LGK+ 20], computational workflows seeds. Many of these ideas can be assisted or encapsulated through
[GCS+ 20], and to entire data pipelines [MLC+ 21]. appropriate infrastructure decisions, which is the premise on
Various works have surveyed software engineering practices which this work bases its software reviews.
and identified practices that provide value in scientific computing Although this paper focuses on the scientific workflow, a
contexts, including various forms of unit and regression testing, growing related field tackles many of the same issues from
proper source control usage, formal verification, bug tracking, an industry standpoint: machine learning operations (MLOps)
and agile development methods [Sto18], [Dub05]. In particular, [Goy20]. MLOps, an ML-oriented version of DevOps, is con-
[Sto18] described many concepts from agile development as being cerned with supporting an entire data science life cycle, from data
well suited to an experimental context, where the current knowl- acquisition to deployment of a production model. Many of the
edge and goals may be fairly dynamic throughout the project. They same challenges are present, reproducibility and provenance are
noted that although many of these techniques could be directly crucial in both production and research workflows [RMRO21].
applied, some required adaptation to make sense in the scientific Infrastructure, tools, and practices developed for MLOps may also
software domain. hold value in the scientific community.
Similar to this paper, two other works [DGST09], [WWG21] A taxonomy for ML tools that we reference throughout this
discuss sets of design aspects and features that a workflow work is from [QCL21], which describes a characterization of tools
manager would need. Deelman et al. describe the life cycle of consisting of three primary categories: general, analysis support,
a workflow as composition, mapping, execution, and provenance and reproducibility support, each of which is further subdivided
capture [DGST09]. A workflow manager must then support each into aspects to describe a tool. For example, these subaspects
of these aspects. Composition is how the workflow is constructed, include data visualization, web dashboard capabilities, experiment
such as through a graphical interface or with a text configuration logging, and the interaction modes the tool supports, such as a
file. Mapping and execution are determining the resources to be command line interface (CLI) or application programming inter-
used for a workflow and then utilizing those resources to run it, face (API).
including distributing to cloud compute and external representa-
tional state transfer (REST) services. This also refers to scheduling Design Features
subworkflows/tasks to reuse intermediate artifacts as available. We combine the two sets of capabilities from [DGST09] and
Provenance, which is crucial for enabling repeatability, is how all [WWG21] with the taxonomy from [QCL21] to propose a set
artifacts, library versions, and other relevant metadata are tracked of six design features that are important for an experiment
during the execution of a workflow. manager. These include orchestration, parameterization, caching,
Wratten, Wilm, and Göke surveyed many bioinformatics pi- reproducibility, reporting, and scalability. The crossover between
pline and workflow management tools, listing the challenges that these proposed feature sets are shown in Table 1. We expand on
tooling should address: data provenance, portability, scalability, each of these in more depth in the subsections below.
and re-entrancy [WWG21]. Provenance is defined the same way
as in [DGST09], and further states the need for generating Orchestration
reports that include the tracking information and metadata for Orchestration of an experiment refers to the mechanisms used
the associated experiment run. Portability—allowing set up and to chain and compose a sequence of smaller logical steps into
execution of an experiment in a different environment—can be an overarching pipeline. This provides a higher-level view of an
a challenge because of the dependency requirements of a given experiment and helps abstract away some of the implementation
system and the ease with which the environment can be specified details. Operation of most workflow managers is based on a
and reinitialized on a different machine or operating system. directed acyclic graph (DAG), which specifies the stages/steps as
Scalability is important especially when large scale data, many nodes and the edges connecting them as their respective inputs and
compute-heavy steps, or both are involved throughout the work- outputs. The intent with orchestration is to encourage designing
flow. Scalability in a manager involves allowing execution on a distinct, reusable steps that can easily be composed in different
high-performance computing (HPC) system or with some form of ways to support testing different hypotheses or overarching ex-
parallel compute. Finally they mention re-entrancy, or the ability periment runs. This allows greater focus on the design of the
to resume execution of a compute step from where it last stopped, experiments than the implementation of the underlying functions
preventing unnecessary recomputation of prior steps. that the experiments consist of. As discussed in the taxonomy
One area of the literature that needs further discussion is [QCL21], pipeline creation can consist of a combination of scripts,
the design of automated provenance tracking systems. Existing configuration files, or a visual tool. This aspect falls within the
workflow management tools generally require source code mod- composition capability discussed in [DGST09].
DESIGN OF A SCIENTIFIC DATA ANALYSIS SUPPORT PLATFORM 181

This work [DGST09] [WWG21] Taxonomy [QCL21]

Orchestration Composition — Reproducibility/pipeline creation
Parameterization — — —
Caching — Re-entrancy —
Reproducibility Provenance Provenance, portability Reproducibility
Reporting — — Analysis/visualization, web dashboard
Scalability Mapping, execution Scalability Analysis/computational resources

TABLE 1: Comparing design features listed in various works.

Parameterization Reproducibility
Parameterization specifies how a compute pipeline is customized
Mechanisms for reproducibility are one of the most important fea-
for a particular run by passing in configuration values to change
tures for a successful data analysis support platform. Reproducibil-
aspects of the experiment. The ability to customize analysis code
ity is challenging because of the complexity of constantly evolving
is crucial to conducting a compute-based experiment, providing a
codebases, complicated and changing dependency graphs, and
mechanism to manipulate a variable under test to verify or reject
inconsistent hardware and environments. Reproducibility entails
a hypothesis.
two subcomponents: provenance and portability. This falls under
Conventionally, parameterization is done either through spec-
the provenance aspect from [DGST09], both data provenance and
ifying parameters in a CLI call or by passing configuration files
portability from [WWG21], and the entire reproducibility support
in a format like JSON or YAML. As discussed in [DGST09],
section of the taxonomy [QCL21].
parameterization sometimes consists of more complicated needs,
such as conducting parameter sweeps or grid searches. There are Data provenance is about tracking the history, configuration,
libraries dedicated to managing parameter searches like this, such and steps taken to produce an intermediate or final data artifact.
as hyperopt [BYC13] used in [RMRO21]. In ML this would include the cleaning/munging steps used and
the intermediate tables created in the process, but provenance can
Although not provided as a design capability in the other
apply more broadly to any type of artifact an experiment may
works, we claim the mechanisms provided for parameterization
produce, such as ML models themselves, or "model provenance"
are important, as these mechanisms are the primary way to con-
[SH18]. Applying provenance beyond just data is critical, as
figure, modify, and vary experiment execution without explicitly
models may be sensitive to the specific sets of training data and
changing the code itself or modifying hard-coded values. This
conditions used to produce them [Hut18]. This means that every-
means that a recorded parameter set can better "describe" an
thing required to directly and exactly reproduce a given artifact
experiment run, increasing provenance and making it easier for
is recorded, such as the manipulations applied to its predecessors
another researcher to understand what pieces of an experiment
and all hyperparameters used within those manipulations.
can be readily changed and explored.
Some support is provided for this in [DGST09], stating that Portability refers to the ability to take an experiment and
the necessity of running many slight variations on workflows execute it outside of the initial computing environment it was
sometimes leads to the creation of ad hoc scripts to generate the created in [WWG21]. This can be a challenge if all software
variants, which leads to increased complexity in the organization dependency versions are not strictly defined, or when some de-
of the codebase. Improved mechanisms to parameterize the same pendencies may not be available in all environments. Minimally,
workflow for many variants helps to manage this complexity. allowing portability requires keeping explicit track of all packages
and the versions used. A 2017 study [OBA17] found that even
this minimal step is rarely taken. Another mechanism to support
Caching
portability is the use of containerization, such as with Docker or
Refining experiment code and finding bugs is often a lengthy Podman [SH18].
iterative process, and removing the friction of constantly rerunning
all intermediate steps every time an experiment is wrong can
improve efficiency. Caching values between each step of an Reporting
experiment allows execution to resume at a certain spot in the
pipeline, rather than starting from scratch every time. This is Reporting is an important step for analyzing the results of an
defined as re-entrancy in [WWG21]. experiment, through visualizations, summaries, comparisons of
In addition to increasing the speed of rerunning experiments results, or combinations thereof. As a design capability, reporting
and running new experiments that combine old results for analysis, refers to the mechanisms available for the system to export or
caching is useful to help find and debug mistakes throughout retrieve these results for human analysis. Although data visu-
an experiment. Cached outputs from each step allow manual alization and analysis can be done manually by the scientist,
interrogation outside of the experiment. For example, if a cleaning tools to assist with making these steps easier and to keep results
step was implemented incorrectly and a user noticed an invalid organized are valuable from a project management standpoint.
value in an output data table, they could use a notebook to load Mechanisms for this might include a web interface for exploring
and manipulate the intermediate artifact tables for that data to individual or multiple runs. Under the taxonomy [QCL21], this
determine what stage introduced the error and what code should falls primarily within analysis support, such as data visualization
be used to correctly fix it. or a web dashboard.
182 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Scalability container. MLFlow then ensures that the environment is set up and
Many data analytic problems require large amounts of space active before running. The CLI even allows directly specifying a
and compute resources, often beyond what can be handled on GitHub link to an mlflow-enabled project to download, set up, and
an individual machine. To efficiently support running a large then run the associated experiment. For reporting, the MLFlow
experiment, mechanisms for scaling execution are important and tracking UI lets the user view and compare various runs and their
could include anything from supporting parallel computation on associated artifacts through a web dashboard. For scalability, both
an experiment or stage level, to allowing the execution of jobs on distributed storage for saving/loading artifacts as well as execution
remote machines or within an HPC context. This falls within both of runs on distributed clusters is supported.
mapping and execution from [DGST09], the scalability aspect
Sacred
from [WWG21], and the computational resources category within
the taxonomy [QCL21]. Sacred [GKC+ 17] is a Python library and CLI tool to help
organize and reproduce experiments. Orchestration is managed
through the use of Python decorators, a "main" for experiment
Existing Tools
entry point functions and "capture" for parameterizable functions,
A wide range of pipeline and workflow tools have been devel- where function arguments are automatically populated from the
oped to support many of these design features, and some of the active configuration when called. Parameterization is done directly
more common examples include DVC [KPP+ 22] and MLFlow in Python through applying a config decorator to a function that
[MLf22]. We briefly survey and analyze a small sample of these assigns variables. Configurations can also be written to or read
tools to demonstrate the diversity of ideas and their applicability in from JSON and YAML files, so parameters must be simple types.
different situations. Table 2 compares the support of each design Different observers can be specified to automatically track much
feature by each tool. of the metadata, environment information, and current parameters,
and within the code the user can specify additional artifacts and
DVC resources to track during the run. Each run will store the requested
DVC [KPP+ 22] is a Git-like version control tool for datasets. outputs, although there is no re-entrant use of these cached values.
Orchestration is done by specifying stages, or runnable script Portability is supported through the ability to print the versions of
commands, either in YAML or directly on the CLI. A stage is libraries needed to run a particular experiment. Reporting can be
specified with output file paths and input file paths as dependen- done through a specific type of observer, and the user can provide
cies, allowing an implicit pipeline or DAG to form, representing all custom templated reports that are generated at the end of each run.
the processing steps. Parameterization is done by defining within a
YAML file what the possible parameters are, along with the default Kedro
values. When running the DAG, parameters can be customized on Kedro [ABC+ 22] is another Python library/CLI tool for managing
the CLI. Since inputs and outputs are file paths, caching and re- reproducible and modular experiments. Orchestration is particu-
entrancy come for free, and DVC will intelligently determine if larly well done with "node" and "pipeline" abstractions, a node
certain stages do not need to be re-computed. referring to a single compute step with defined inputs and outputs,
A saved experiment or state is frozen into each commit, so and a pipeline implemented as an ordered list of nodes. Pipelines
all parameters and artifacts are available at any point. No explicit can be composed and joined to create an overarching workflow.
tracking of the environment (e.g., software versions and hardware Possible parameters are defined in a YAML file and either set
info) is present, but this could be manually included by tracking it in other parameter files or configured on the CLI. Similar to
in a separate file. Reporting can be done by specifying per-stage MLFlow, while tracking outputs are cached, there’s no automatic
metrics to track in the YAML configuration. The CLI includes a mechanism for re-entrancy. Provenance is achieved by storing
way to generate HTML files on the fly to render requested plots. user-specified metrics and tracked datasets for each run, and it
There is also an external "Iterative Studio" project, which provides has a few different mechanisms for portability. This includes the
a live web dashboard to view continually updating HTML reports ability to export an entire project into a Docker container. A
from DVC. For scalability, parallel runs can be achieved by separate Kedro-Viz tool provides a web dashboard to show a map
queuing an experiment multiple times in the CLI. of experiments, as well as showing each tracked experiment run
and allowing comparison of metrics and outputs between them.
MLFlow Projects can be deployed into several different cloud providers,
MLFlow [MLf22] is a framework for managing the entire life such as Databricks and Dask clusters, allowing for several options
cycle of an ML project, with an emphasis on scalability and de- for scalability.
ployment. It has no specific mechanisms for orchestration, instead
allowing the user to intersperse MLFlow API calls in an existing Curifactory
codebase. Runnable scripts can be provided as entry points into Curifactory [MHSA22] is a Python API and CLI tool for organiz-
a configuration YAML, along with the parameters that can be ing, tracking, reproducing, and exporting computational research
provided to it. Parameters are changed through the CLI. Although experiments and data analysis workflows. It is intended primarily
MLFlow has extensive capabilities for tracking artifacts, there are for smaller teams conducting research, rather than production-
no automatic re-entrancy methods. Reproducibility is a strong fea- level or large-scale ML projects. Curifactory is available on
ture, and provenance and portability are well supported. The track- GitHub1 with an open-source BSD-3-Clause license. Below, we
ing module provides provenance by recording metadata such as the describe the mechanisms within Curifactory to support each of the
Git commit, parameters, metrics, and any user-specified artifacts six capabilities, and compare it with the tools discussed above.
in the code. Portability is done by allowing the environment for
an entry point to be specified as a Conda environment or Docker 1. https://github.com/ORNL/curifactory
DESIGN OF A SCIENTIFIC DATA ANALYSIS SUPPORT PLATFORM 183

Orchestration Parameterization Caching Provenance Portability Reporting Scalability
DVC + + ++ + + + +
MLFlow + * ++ ++ ++ ++
Sacred + ++ * ++ + +
Kedro + + * + ++ ++ ++
Curifactory + ++ ++ ++ ++ + +

TABLE 2: Supported design features in each tool. Note, + indicates that a feature is supported, ++ indicates very strong support, and *
indicates tooling that supports caching artifacts as a provenance tool but does not provide a mechanism for automatically reloading cached
values as a form of re-entrancy.

@stage(inputs=["model"], outputs=["results"])
def test_model(record, model):
# ...

def run(argsets, manager):
"""An example experiment definition.

The primary intent of an experiment is to run
each set of arguments through the desired
stages, in order to compare results at the end.
"""
for argset in argsets:
# A record is the "pipeline state"
# associated with each set of arguments.
# Stages take and return a record,
# automatically handling pushing and
# pulling inputs and outputs from the
# record state.
record = Record(manager, argsets)
test_model(train_model(load_data(record)))

Parameterization

Parameterization in Curifactory is done directly in Python scripts.
The user defines a dataclass with the parameters they need
throughout their various stages in order to customize the exper-
iment, and they can then define parameter files that each return
Fig. 1: Stages are composed into an experiment. one or more instances of this arguments class. All stages in an
experiment are automatically given access to the current argument
set in use while an experiment is running.
Orchestration While configuration can also be done directly in Python in
Curifactory provides several abstractions, the lowest level of which Sacred, Curifactory makes a different trade-off: A parameter file
is a stage. A stage is a function that takes a defined set of input or get_params() function in Curifactory returns an array of
variable names, a defined set of output variable names, and an one or more argument sets, and arguments can directly include
optional set of caching strategies for the outputs. Stages are similar complex Python objects. Unlike Sacred, this means Curifactory
to Kedro’s nodes but implemented with @stage() decorators on cannot directly translate back and forth from static configuration
the target function rather than passing the target function to a files, but in exchange allows for grid searches to be defined directly
node() call. One level up from a stage is an experiment: an and easily in a single parameter file, as well as allowing argument
experiment describes the orchestration of these stages as shown in sets to be composed or even inherit from other argument set
Figure 1, functionally chaining them together without needing to instances. Importantly, Curifactory can still encode representations
explicitly manage what variables are passed between the stages. of arguments into JSON for provenance, but this is a one direc-
tional transformation.
@stage(inputs=None, outputs=["data"])
def load_data(record): This approach allows a great deal of flexibility, and is valuable
# every stage has the currently active record in experiments where a large range of parameters need to be
# passed to it, which contains the "state", or tested or there is significant repetition among parameter sets.
# all previous output values associated with
# the current argset, as defined in the For example, in an experiment testing different effects of model
# Parameterization section training hyperparameters, there may be several parameter files
# ... meant to vary only the arguments needed for model training while
using the same base set of data cleaning arguments. Composing
@stage(inputs=["data"], outputs=["model", "stats"])
def train_model(record, data): these parameter sets from a common imported set means that any
# ... subsequent changes to the data cleaning arguments only need to
184 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

be modified in one place, rather than each individual parameter
file.
@dataclass
class MyArgs(curifactory.ExperimentArgs):
"""Define the possible arguments needed in the
stages."""
random_seed: int = 42
train_test_ratio: float = 0.8
layers: tuple = (100,)
activation: str = "relu"

def get_params():
"""Define a simple grid search: return
many arguments instances for testing."""
args = []
layer_sizes = [10, 20, 50, 100]
for size in layer_sizes:
args.append(MyArgs(name=f"network_{size}",
layers=(size,)))
return args

Caching
Curifactory supports per-stage caching, similar to memoization,
through a set of easy-to-use caching strategies. When a stage
executes, it uses the specified cache mechanism to store the stage
outputs to disk, with a filename based on the experiment, stage,
and a hash of the arguments. When the experiment is re-executed,
if it finds an existing output on disk based on this name, it short-
circuits the stage computation and simply reloads the previously
cached files, allowing a form of re-entrancy. Adding this caching
ability to a stage is done through simply providing the list of Fig. 2: Metadata block at the top of a report.
caching strategies to the stage decorator, one for each output:
@stage(
inputs=["data"], default Dockerfile for this purpose, and running the experiment
outputs=["training_set", "testing_set"], with the Docker flag creates an image that exposes a Jupyter
cachers=[PandasCSVCacher]*2 notebook to repeat the run and keep the artifacts in memory, as
):
def split_data(record, data): well as a file server pointing to the appropriate cache for manual
# stage definition exploration and inspection. Directly reproducing the experiment
can be done either through the exposed notebook or by running
Reproducibility the Curifactory experiment command inside of the image.
As mentioned before, reproducibility consists of tracking prove-
nance and metadata of artifacts as well as providing a means to set Reporting
up and repeat an experiment in a different compute environment. While Curifactory does not run a live web dashboard like MLFlow,
To handle provenance, Curifactory automatically records metadata DVC’s Iterative Studio, and Kedro-viz, every experiment run
for every experiment run executed, including a logfile of the outputs an HTML experiment report and updates a top-level index
console output, current Git commit hash, argument sets used and HTML page linking to the new report, which can be browsed
the rendered versions of those arguments, and the CLI command from a file manager or statically served if running from an
used to start the run. The final reports from each run also include a external compute resource. Although simplistic, this reduces the
graphical representation of the stage DAG, and shows each output dependencies and infrastructure needed to achieve a basic level
artifact and what its cache file location is. of reporting, and produces stand-alone folders for consumption
Curifactory has two mechanisms to fully track and export an outside of the original environment if needed.
experiment run. The first is to execute a "full store" run, which Every report from Curifactory includes all relevant metadata
creates a single exported folder containing all metadata mentioned mentioned above, including the machine host name, experiment
above, along with a copy of every cache file created, the output sequential run number, Git commit hash, parameters, and com-
run report (mentioned below), as well as a Python requirements.txt mand line string. Stage code can add user-defined objects to output
and Conda environment dump, containing a list of all packages in in each report, such as tables, figures, and so on. Curifactory comes
the environment and their respective versions. This run folder can with a default set of helpers for several basic types of output
then be distributed. Reproducing from the folder consists of setting visualizations, including basic line plots, entire Matplotlib figures,
up an environment based on the Conda/Python dependencies as and dataframes.
needed, and running the experiment command using the exported The output report also contains a graphical representation of
folder as the cache directory. the DAG for the experiment, rendered using Graphviz, and shows
The second mechanism is a command to create a Docker con- the artifacts produced by each stage and the file path where they
tainer that includes the environment, entire codebase, and artifact are cached. An example of some of the components of this report
cache for a specific experiment run. Curifactory comes with a are rendered in figures 2, 3, 4, and 5.
DESIGN OF A SCIENTIFIC DATA ANALYSIS SUPPORT PLATFORM 185

Conclusion
The complexity in modern software, environments, and data ana-
lytic approaches threaten the reproducibility and effectiveness of
computation-based studies. This has been compounded by the lack
of standardization in infrastructure tools and software engineering
principles applied within scientific research domains. While many
novel tools and systems are in development to address these
shortcomings, several design critieria must be met, including the
ability to easily compose and orchestrate experiments, parameter-
ize them to manipulate variables under test, cache intermediate
artifacts, record provenance of all artifacts and allow the software
to port to other systems, produce output visualizations and reports
for analysis, and scale execution to the resource requirements
of the experiment. We developed Curifactory to address these
criteria specifically for small research teams running Python based
experiments.
Fig. 3: User-defined objects to report ("reportables").

Acknowledgements
The authors would like to acknowledge the US Department of
Energy, National Nuclear Security Administration’s Office of De-
fense Nuclear Nonproliferation Research and Development (NA-
22) for supporting this work.

R EFERENCES
[ABC+ 22] Sajid Alam, Lorena Bălan, Gabriel Comym, Yetunde Dada, Ivan
Danov, Lim Hoang, Rashida Kanchwala, Jiri Klein, Antony
Milne, Joel Schwarzmann, Merel Theisen, and Susanna Wong.
Kedro. https://kedro.org/, March 2022.
[BYC13] James Bergstra, Daniel Yamins, and David Cox. Making a Sci-
Fig. 4: Graphviz rendering of experiment DAG. Each large colored
ence of Model Search: Hyperparameter Optimization in Hundreds
area represents a single record associated with a specific argset. White of Dimensions for Vision Architectures. In Proceedings of the
ellipses are stages, and the blocks in between them are the input and 30th International Conference on Machine Learning, pages 115–
output artifacts. 123. PMLR, February 2013.
[DGST09] Ewa Deelman, Dennis Gannon, Matthew Shields, and Ian Taylor.
Workflows and e-Science: An overview of workflow system
features and capabilities. Future Generation Computer Systems,
Scalability 25:524–540, May 2009. doi:10.1016/j.future.2008.
06.012.
Curifactory has no integrated method of executing portions of jobs [DMR+ 09] David L. Donoho, Arian Maleki, Inam Ur Rahman, Morteza
on external compute resources like Kedro and MLFlow, but it does Shahram, and Victoria Stodden. Reproducible Research in Com-
allow local multi-process parallelization of parameter sets. When putational Harmonic Analysis. Computing in Science Engineer-
an experiment run would entail executing a series of stages for ing, 11(1):8–18, January 2009. doi:10.1109/MCSE.2009.
15.
each argument set in series, Curifactory can divide the collection [Dub05] P.F. Dubois. Maintaining correctness in scientific programs.
of argument sets into one subcollection per process, and runs the Computing in Science Engineering, 7(3):80–85, May 2005. doi:
experiment in parallel on each subcollection. By taking advantage 10.1109/MCSE.2005.54.
[GCS 20] Carole Goble, Sarah Cohen-Boulakia, Stian Soiland-Reyes,
+
of the caching mechanism, when all parallel runs complete, the Daniel Garijo, Yolanda Gil, Michael R. Crusoe, Kristian Peters,
experiment reruns in a single process to aggregate all of the and Daniel Schober. FAIR Computational Workflows. Data
precached values into a single report. Intelligence, 2(1-2):108–121, January 2020. doi:10.1162/
dint_a_00033.
[GGA18] Odd Erik Gundersen, Yolanda Gil, and David W. Aha. On Repro-
ducible AI: Towards Reproducible Research, Open Science, and
Digital Scholarship in AI Publications. AI Magazine, 39(3):56–
68, September 2018. doi:10.1609/aimag.v39i3.2816.
[GK18] Odd Erik Gundersen and Sigbjørn Kjensmo. State of the Art:
Reproducibility in Artificial Intelligence. Proceedings of the
AAAI Conference on Artificial Intelligence, 32(1), April 2018.
doi:10.1609/aaai.v32i1.11503.
[GKC+ 17] Klaus Greff, Aaron Klein, Martin Chovanec, Frank Hutter, and
Jürgen Schmidhuber. The Sacred Infrastructure for Computa-
tional Research. In Proceedings of the 16th Python in Sci-
ence Conference, pages 49–56, Austin, Texas, 2017. SciPy.
doi:10.25080/shinma-7f4c6e7-008.
[Goy20] A. Goyal. Machine learning operations, 2020.
[Hut18] Matthew Hutson. Artificial intelligence faces reproducibility
Fig. 5: Graphviz rendering of each record in more depth, showing crisis. Science, 359(6377):725–726, February 2018. doi:
cache file paths and artifact data types. 10.1126/science.359.6377.725.
186 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[KHS09] Diane Kelly, Daniel Hook, and Rebecca Sanders. Five Rec- Hovig. Ten Simple Rules for Reproducible Computational Re-
ommended Practices for Computational Scientists Who Write search. PLOS Computational Biology, 9(10):e1003285, October
Software. Computing in Science Engineering, 11(5):48–53, 2013. doi:10.1371/journal.pcbi.1003285.
September 2009. doi:10.1109/MCSE.2009.139. [Sto18] Tim Storer. Bridging the Chasm: A Survey of Software Engineer-
[KPP+ 22] Ruslan Kuprieiev, Saugat Pachhai, Dmitry Petrov, Paweł ing Practice in Scientific Programming. ACM Computing Surveys,
Redzyński, Casper da Costa-Luis, Peter Rowlands, Alexander 50(4):1–32, July 2018. doi:10.1145/3084225.
Schepanovski, Ivan Shcheklein, Batuhan Taskaya, Jorge Orpinel, [WDA+ 16] Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg,
Gao, Fábio Santos, David de la Iglesia Castro, Aman Sharma, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg,
Zhanibek, Dani Hodovic, Nikita Kodenko, Andrew Grigorev, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E.
Earl, Nabanita Dash, George Vyshnya, maykulkarni, Max Hora, Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè
Vera, Sanidhya Mangal, Wojciech Baranowski, Clemens Wolff, Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T.
and Kurian Benoy. DVC: Data Version Control - Git for Data Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alasdair
& Models. Zenodo, April 2022. doi:10.5281/zenodo. J. G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap
6417224. Heringa, Peter A. C. ’t Hoen, Rob Hooft, Tobias Kuhn, Ruben
[LGK 20] Anna-Lena Lamprecht, Leyla Garcia, Mateusz Kuzak, Car-
+ Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Al-
los Martinez, Ricardo Arcila, Eva Martin Del Pico, Victoria bert Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-
Dominguez Del Angel, Stephanie van de Sandt, Jon Ison, Serra, Marco Roos, Rene van Schaik, Susanna-Assunta Sansone,
Paula Andrea Martinez, Peter McQuilton, Alfonso Valencia, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn,
Jennifer Harrow, Fotis Psomopoulos, Josep Ll Gelpi, Neil Morris A. Swertz, Mark Thompson, Johan van der Lei, Erik
Chue Hong, Carole Goble, and Salvador Capella-Gutierrez. To- van Mulligen, Jan Velterop, Andra Waagmeester, Peter Wit-
wards FAIR principles for research software. Data Science, tenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons.
3(1):37–59, January 2020. doi:10.3233/DS-190026. The FAIR Guiding Principles for scientific data management
[MHSA22] Nathan Martindale, Jason Hite, Scott L. Stewart, and Mark and stewardship. Scientific Data, 3(1):160018, March 2016.
Adams. Curifactory. https://github.com/ORNL/curifactory, doi:10.1038/sdata.2016.18.
March 2022. [WWG21] Laura Wratten, Andreas Wilm, and Jonathan Göke. Reproducible,
[MLC+ 21] Sonia Natalie Mitchell, Andrew Lahiff, Nathan Cummings, scalable, and shareable analysis pipelines with bioinformatics
Jonathan Hollocombe, Bram Boskamp, Dennis Reddyhoff, Ryan workflow managers. Nature Methods, 18(10):1161–1168, Oc-
Field, Kristian Zarebski, Antony Wilson, Martin Burke, Blair tober 2021. doi:10.1038/s41592-021-01254-9.
Archibald, Paul Bessell, Richard Blackwell, Lisa A. Boden, Alys
Brett, Sam Brett, Ruth Dundas, Jessica Enright, Alejandra N.
Gonzalez-Beltran, Claire Harris, Ian Hinder, Christopher David
Hughes, Martin Knight, Vino Mano, Ciaran McMonagle, Do-
minic Mellor, Sibylle Mohr, Glenn Marion, Louise Matthews,
Iain J. McKendrick, Christopher Mark Pooley, Thibaud Por-
phyre, Aaron Reeves, Edward Townsend, Robert Turner, Jeremy
Walton, and Richard Reeve. FAIR Data Pipeline: Provenance-
driven data management for traceable scientific workflows.
arXiv:2110.07117 [cs, q-bio], October 2021. arXiv:2110.
07117.
[MLf22] MLflow: A Machine Learning Lifecycle Platform. https://mlflow.
org/, April 2022.
[NFP+ 20] Mohammad Hossein Namaki, Avrilia Floratou, Fotis Psallidas,
Subru Krishnan, Ashvin Agrawal, Yinghui Wu, Yiwen Zhu,
and Markus Weimer. Vamsa: Automated Provenance Tracking
in Data Science Scripts. In Proceedings of the 26th ACM
SIGKDD International Conference on Knowledge Discovery &
Data Mining, KDD ’20, pages 1542–1551, New York, NY, USA,
August 2020. Association for Computing Machinery. doi:
10.1145/3394486.3403205.
[OBA17] Babatunde K. Olorisade, Pearl Brereton, and Peter Andras. Re-
producibility in Machine Learning-Based Studies: An Example
of Text Mining. In Reproducibility in ML Workshop, 34th In-
ternational Conference on Machine Learning, ICML 2017, June
2017.
[Pen11] Roger D. Peng. Reproducible Research in Computational Sci-
ence. Science, 334(6060):1226–1227, December 2011. doi:
10.1126/science.1213847.
[QCL21] Luigi Quaranta, Fabio Calefato, and Filippo Lanubile. A Taxon-
omy of Tools for Reproducible Machine Learning Experiments.
In AIxIA 2021 Discussion Papers, 20th International Conference
of the Italian Association for Artificial Intelligence, pages 65–76,
2021.
[Red19] Sergey Redyuk. Automated Documentation of End-to-End Ex-
periments in Data Science. In 2019 IEEE 35th International
Conference on Data Engineering (ICDE), pages 2076–2080,
April 2019. doi:10.1109/ICDE.2019.00243.
[RMRO21] Philipp Ruf, Manav Madan, Christoph Reich, and Djaffar Ould-
Abdeslam. Demystifying MLOps and Presenting a Recipe for the
Selection of Open-Source Tools. Applied Sciences, 11(19):8861,
January 2021. doi:10.3390/app11198861.
[SBB13] Victoria Stodden, Jonathan Borwein, and David H. Bailey. Pub-
lishing Standards for Computational Science: “Setting the Default
to Reproducible”. Pennsylvania State University, 2013.
[SH18] Peter Sugimura and Florian Hartl. Building a Reproducible
Machine Learning Pipeline. arXiv:1810.04570 [cs, stat], October
2018. arXiv:1810.04570.
[SNTH13] Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and Eivind
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 187

The Geoscience Community Analysis Toolkit: An
Open Development, Community Driven Toolkit in the
Scientific Python Ecosystem
Orhan Eroglu‡∗ , Anissa Zacharias‡ , Michaela Sizemore‡ , Alea Kootz‡ , Heather Craker‡ , John Clyne‡

https://www.youtube.com/watch?v=34zFGkDwJPc

Abstract—The Geoscience Community Analysis Toolkit (GeoCAT) team de- GeoCAT has seven Python tools for geoscientific computation
velops and maintains data analysis and visualization tools on structured and and visualization. These tools are built upon the Pangeo [HRA18]
unstructured grids for the geosciences community in the Scientific Python ecosystem. In particular, they rely on Xarray [HH17], and Dask
Ecosystem (SPE). In response to dealing with increasing geoscientific data [MR15], as well as they are compatible with Numpy and use
sizes, GeoCAT prioritizes scalability, ensuring its implementations are scalable
Jupyter Notebooks for demonstration purposes. Dask compatibil-
from personal laptops to HPC clusters. Another major goal of the GeoCAT
team is to ensure community involvement throughout the whole project lifecycle,
ity allows the GeoCAT functions to scale from personal laptops
which is realized through an open development mindset by encouraging users to high performance computing (HPC) systems such as NCAR’s
and contributors to get involved in decision-making. With this model, we not Casper, Cheyenne, and upcoming Derecho clusters [CKZ+ 22].
only have our project stack open-sourced but also ensure most of the project Additionally, GeoCAT also utilizes Numba, an open source just-
assets that are directly related to the software development lifecycle are publicly in-time (JIT) compiler [LPS15], to translate Python and NumPy
accessible. code into machine codes in order to get faster executions wherever
possible. GeoCAT’s visualization components rely on Matplotlib
Index Terms—data analysis, geocat, geoscience, open development, open
source, scalability, visualization
[Hun07] for most of the plotting functionalities, Cartopy [Met15]
for projections, as well as the Datashader and Holoviews stack
[Anaa] for big data rendering. Figure 1 shows these technologies
Introduction
with their essential roles around GeoCAT.
The Geoscience Community Analysis Toolkit (GeoCAT) team, Briefly, GeoCAT-comp houses computational operators for
established in 2019, leads the software engineering efforts of applications ranging from regridding and interpolation, to cli-
the National Center for Atmospheric Research (NCAR) “Pivot matology and meteorology. GeoCAT-examples provides over 140
to Python” initiative [Geo19]. Before then, NCAR Command publication-quality plotting scripts in Python for Earth sciences. It
Language (NCL) [BBHH12] was developed by NCAR as an also houses Jupyter notebooks with high-performance, interactive
interpreted, domain-specific language that was aimed to support plots that enable features such as pan and zoom on fine-resolution,
the analysis and visualization needs of the global geosciences unstructured geoscience data (e.g. ~3 km data rendered within
community. NCL had been serving several tens of thousands of a few tens of seconds to a few minutes on personal laptops).
users for decades. It is still available for use but has not been This is achieved by making use of the connectivity information
actively developed as it has been in maintenance mode. in the unstructured grid and rendering data via the Datashader
The initiative had an initial two-year roadmap with major and Holoviews ecosystem [Anaa]. GeoCAT-viz enables higher-
milestones being: (1) Replicating NCL’s computational routines in level implementation of Matplotlib and Cartopy plotting capabil-
Python, (2) training and support for transitioning NCL users into ities through its variety of easy to use visualization convenience
Python, and (3) moving tools into an open development model. functions for GeoCAT-examples. GeoCAT also maintains WRF-
GeoCAT aims to create scalable data analysis and visualization Python (Weather Research and Forecasting), which works with
tools on structured and unstructured grids for the geosciences WRF-ARW model output and provides diagnostic and interpola-
community in the SPE. The GeoCAT team is committed to tion routines.
open development, which helps the team prioritize community GeoCAT was recently awarded Project Raijin, which is an
involvement at any level of the project lifecycle alongside having NSF EarthCube-funded effort [NSF21] [CEMZ21]. Its goal is to
the whole software stack open-sourced. enhance the open-source analysis and visualization tool landscape
by developing community-owned, sustainable, scalable tools that
* Corresponding author: oero@ucar.edu
‡ National Center for Atmospheric Research facilitate operating on unstructured climate and global weather
data in the SPE. Throughout this three-year project, GeoCAT
Copyright © 2022 Orhan Eroglu et al. This is an open-access article dis- will work on the development of data analysis and visualization
tributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, pro- functions that operate directly on the native grid as well as
vided the original author and source are credited. establish an active community of user-contributors.
188 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 1: The core Python technologies on which GeoCAT relies on

This paper will provide insights about GeoCAT’s software PYPI badges in the Package row shows and links to the latest
stack and current status, team scope and near-term plans, open versions of the software tool distributed through NCAR’s Conda
development methodology, as well as current pathways of com- channel and PyPI, respectively. The LICENSE badge provides a
munity involvement. link to our software licenses, Apache License version 2.0 [Apa04],
for all of the GeoCAT stack, enabling the redistribution of the
GeoCAT Software open-source software products on an "as is" basis. Finally, to
provide reproducibility of our software products (either for the
The GeoCAT team develops and maintains several open-source
latest or any older version), we publish version-specific Digital
software tools. Before describing those tools, it is vital to explain
Object Identifiers (DOIs), which can be accessed through the DOI
in detail how the team implements the continuous integration and
badge. This allows the end-user to accurately cite the specific
continuous delivery/deployment (CI/CD) in consistence for all of
version of the GeoCAT tools they used for science or research
those tools.
purposes.
Continuous Integration and Continuous Delivery/Deployment
(CI/CD)
GeoCAT employs a continuous delivery model, with a monthly
package release cycle on package management systems and pack-
age indexes such as Conda [Anab] and PyPI [Pyt]. This model
helps the team make new functions available as soon as they are
implemented and address potential errors quickly. To assist this
process, the team utilizes multiple tools throughout GitHub assets
to ensure automation, unit testing and code coverage, as well as
licensing and reproducibility. Figure 2, for example, shows the
set of badges displaying the near real-time status of each CI/CD
implementation in the GitHub repository homepage from one of Fig. 2: GeoCAT-comp’s badges in the beginning of its README file
(i.e. the home page of the Githug repository) [geob]
our software tools.
CI build tests of our repositories are implemented and au-
tomated (for pushed commits, pull requests, and daily scheduled
execution) via GitHub Actions workflows [Git], with the CI badge GeoCAT-comp (and GeoCAT-f2py)
shown in Figure 2 displaying the status (i.e. pass or fail) of GeoCAT-comp is the computational component of the GeoCAT
those workflows. Similarly, the CONDA-BUILDS badge shows project as can be seen in Figure 4. GeoCAT-comp houses im-
if the conda recipe works successfully for the repository. The plementations of geoscience data analysis functions. Novel re-
Python package "codecov" [cod] analyzes the percentage of code search and development is conducted for analyzing both structured
coverage from unit tests in the repository. Additionally, the overall and unstructured grid data from various research fields such as
results as well as details for each code script can be seen via climate, weather, atmosphere, ocean, among others. In addition,
the COVERAGE badge. Each of our software repositories has some of the functionalities of GeoCAT-comp are inspired or
a corresponding documentation page that is populated mostly- reimplemented from the NCL in order to address the first goal
automatically through the Sphinx Python documentation generator of the "Pivot to Python effort. For that purpose, 114 NCL rou-
[Bra21] and published through ReadTheDocs [rea] via an auto- tines were selected, excluding some functionalities such as date
mated building and versioning schema. The DOCS badge provides routines, which could be handled by other packages in the Python
a link to the documentation page along with showing failures, if ecosystem today. These functions were ranked by order of website
any, with the documentation rendering process. Figure 3 shows documentation access from most to least, and prioritization was
the documentation homepage of GeoCAT-comp. The NCAR and made based on those ranks. Today, GeoCAT-comp provides the
THE GEOSCIENCE COMMUNITY ANALYSIS TOOLKIT: AN OPEN DEVELOPMENT, COMMUNITY DRIVEN TOOLKIT IN THE SCIENTIFIC PYTHON ECOSYSTEM 189

Fig. 3: GeoCAT-comp documentation homepage built with Sphinx using a theme provided by ReadTheDocs [geoa]

same or similar capabilities of about 39% (44 out of 114) of those GeoCAT-comp code-base does not explicitly contain or require
functions. any compiled code, making it more accessible to the general
Some of the functions that are made available through Python community at large. In addition, GeoCAT-f2py is auto-
GeoCAT-comp are listed below, for which the GeoCAT-comp matically installed through GeoCAT-comp installation, and all
documentation [geoa] provides signatures and descriptions as well functions contained in the "geocat.f2py" package are imported
as links to the usage examples: transparently into the "geocat.comp" namespace. Thus, GeoCAT-
• Spherical harmonics (both decomposition and recomposi- comp serves as a user API to access the entire computational
tion as well as area weighting) toolkit even though its GitHub repository itself only contains pure
• Fourier transforms such as band-block, band-pass, low- Python code from the developer’s perspective. Whenever prospec-
pass, and high-pass tive contributors want to contribute computational functionality in
• Meteorological variable computations such as relative hu- pure Python, GeoCAT-comp is the only GitHub repository they
midity, dew-point temperature, heat index, saturation vapor need to deal with. Therefore, there is no onus on contributors of
pressure, and more pure Python code to build, compile, or test any compiled code
• Climatology functions such as climate average over mul- (e.g. Fortran) at GeoCAT-comp level.
tiple years, daily/monthly/seasonal averages, as well as GeoCAT-examples (and GeoCAT-viz)
anomalies GeoCAT-examples [geoe] was created to address a few of the
• Regridding of curvilinear grid to rectilinear grid, unstruc- original milestones of NCAR’s "Pivot to Python" initiative: (1)
tured grid to rectilinear grid, curvilinear grid to unstruc- to provide the geoscience community with well-documented visu-
tured grid, and vice versa alization examples for several plotting classes in the SPE, and (2)
• Interpolation methods such as bilinear interpolation of a to help transition NCL users into the Python ecosystem through
rectilinear to another rectilinear grid, hybrid-sigma levels providing such resources. It was born in early 2020 as the result of
to isobaric levels, and sigma to hybrid coordinates a multi-day hackathon event among the GeoCAT team and several
• Empirical orthogonal function (EOF) analysis other scientists and developers from various NCAR labs/groups.
Many of the computational functions in GeoCAT are im- It has since grown to house novel visualization examples and
plemented in pure Python. However, there are others that were showcase the capabilities of other GeoCAT components, like
originally implemented in Fortran but are now wrapped up in GeoCAT-comp, along with newer technologies like interactive
Python with the help of Numpy’s F2PY, Fortran to Python in- plotting notebooks. Figure 5 illustrates one of the unique GeoCAT-
terface generator. This is mostly because re-implementing some examples cases that was aimed at exploring the best practices for
functions would require understanding of complicated algorithm data visualization like choosing color blind friendly colormaps.
flows and implementation of extensive unit tests that would end The GeoCAT-examples [geod] gallery contains over 140 ex-
up taking too much time, compared to wrapping their already- ample Python plotting scripts, demonstrating functionalities from
implemented Fortran routines up in Python. Furthermore, outside Python packages like Matplotlib, Cartopy, Numpy, and Xarray.
contributors from science background would keep considering to The gallery includes plots from a range of visualization categories
add new functions to GeoCAT from their older Fortran routines such as box plots, contours, meteograms, overlays, projections,
in the future. To facilitate contribution, the whole GeoCAT-comp shapefiles, streamlines, and trajectories among others. The plotting
structure is split into two repositories with respect to being categories and scripts under GeoCAT-examples cover almost all of
either pure-Python or Python with compiled code (i.e. Fortran) the NCL plot types and techniques. In addition, GeoCAT-examples
implementations. Such implementation layers are handled with houses plotting examples for individual GeoCAT-comp analysis
the GeoCAT-comp and GeoCAT-f2py repositories, respectively. functions.
190 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 4: GeoCAT project structure with all of the software tools [geoc]

viz helps keep the LOC comparable to NCL, one of the Taylor
diagrams (i.e. Taylor_6) took 80 LOC in NCL, and its Python
implementation in GeoCAT-examples takes 72 LOC. If many
of the Matplotlib functions (e.g. figure and axes initialization,
adjustment of several axes parameters, call to plotting functions for
Taylor diagram, management of grids, addition of titles, contours,
etc.) used in this example weren’t wrapped up in GeoCAT-viz
[geof], the same visualization would easily end up in around two
hundred LOC.

Fig. 5: Comparison between NCL (left) and Python (right) when
choosing a colormap; GeoCAT-examples aiming at choosing color
blind friendly colormaps [SEKZ22]

Despite Matplotlib and Cartopy’s capabilities to reproduce
almost all of NCL plots, there was one significant caveat with
using their low-level implementations against NCL: NCL’s high-
level plotting functions allowed scientists to plot most of the cases Fig. 6: Taylor diagram and curly vector examples that created with
in only tens of lines of codes (LOC) while the Matplotlib and the help of GeoCAT-viz
Cartopy stack required writing a few hundred LOC. In order
to build a higher-level implementation on top of Matplotlib and Recently, the GeoCAT team has been focused on interactive
Cartopy while recreating the NCL-like plots (from vital plotting plotting technologies, especially for larger data sets that contain
capabilities that were not readily available in the Python ecosystem millions of data points. This effort was centered on unstructured
at the time such as Taylor diagrams and curly vectors to more grid visualization as part of Project Raijin, which is detailed in
stylistic changes such as font sizes, color schemes, etc. that resem- a later section in this manuscript. That is because unstructured
ble NCL plots), the GeoCAT-viz library [geof] was implemented. meshes are a great research and application field for big data
Use of functions from this library in GeoCAT-examples signifi- and interactivity such as zoom in/out for regions of interest. As
cantly reduces the LOC requirements for most of the visualization a result of this effort, we created a new notebooks gallery under
examples to comparable numbers to those of NCL’s. Figure 6 GeoCAT-examples to house such interactive data visualizations.
shows Taylor diagram and curly vector examples that have been The first notebook, a screenshot from which is shown in Figure 7,
created with the help of GeoCAT-viz. To exemplify how GeoCAT- in this gallery is implemented via the Datashader and Holoviews
THE GEOSCIENCE COMMUNITY ANALYSIS TOOLKIT: AN OPEN DEVELOPMENT, COMMUNITY DRIVEN TOOLKIT IN THE SCIENTIFIC PYTHON ECOSYSTEM 191

ecosystem [Anaa], and it provides a high-performance, interactive charge of the software development of Project Raijin, which
visualization of a Model for Prediction Across Scales (MPAS) mainly consists of implementing visualization and analysis func-
Global Storm-Resolving Model weather simulation dataset. The tions in the SPE to be executed on native grids. While doing so,
interactivity features are pan and zoom to reveal greater data GeoCAT is also responsible for establishing an open development
fidelity globally and regionally. The data used in this work is environment, clearly documenting the implementation work, and
the courtesy of the DYAMOND effort [SSA+ 19] and has varying aligning deployments with the project milestones as well as SPE
resolutions from 30 km to 3.75 km. Our notebook in the gallery requirements and specifications.
uses the 30 km resolution data for the users to be able to download GeoCAT has created the Xarray-based Uxarray package [uxa]
and work on it in their local configuration. However, our work to recognize unstructured grid models through partnership with
with the 3.75 km resolution data (i.e. about 42 million hexagonal geoscience community groups. UXarray is built on top of the
cells globally) showed that rendering the data took only a few built-in Xarray Dataset functionalities while recognizing several
minutes on a decent laptop, even without any parallelization. The unstructured grid formats (UGRID, SCRIP, and Exodus for now).
main reason behind such a high performance was that we used the Since there are more unstructured mesh models in the community
cell-to-node connectivity information in the MPAS data to render than UXarray natively supports, its architecture will also support
the native grid directly (i.e. without remapping to the structured addition of new models. Figure 8 shows the regularly structured
grid) along with utilizing the Datashader stack. Without using the “latitude-longitude” grids versus a few unstructured grid models.
connectivity information, it would require to run much costly The UXarray project has implemented data input/output func-
Delaunay triangulation. The notebook provides a comparison tions for UGRID, SCRIP, and Exodus, as well as methods for
between these two approaches as well. surface area and integration calculations so far. The team is cur-
rently conducting open discussions (through GitHub Discussions)
GeoCAT-datafiles
with community members, who are interested in unstructured
GeoCAT-datafiles is GeoCAT’s small data storage component as grids research and development in order to prioritize data analysis
a Github repository. This tool houses many datasets in different operators to be implemented throughout the project lifecycle.
file formats such as NetCDF, which can be used along with other
GeoCAT tools or ad-hoc data needs in any other Python script.
The datasets can be accessed by the end-user through a lightweight Scalability
convenience function:
geocat.datafiles.get("folder_name/filename") GeoCAT is aware of the fact that today’s geoscientific models
are capable of generating huge sizes of data. Furthermore, these
GeoCAT-datafiles fetches the file by simply reading from the
datasets, such as those produced by global convective-permitting
local storage, if any, or downloading from the GeoCAT-datafiles
models, are going to grow even larger in size in the future.
repository, if not in the local storage, with the help of Pooch
Therefore, computational and visualization functions that are
framework [USR+ 20].
being developed in the geoscientific research and development
WRF-Python workflows need to be scalable from personal devices (e.g. laptops)
to HPC (e.g. NCAR’s Casper, Cheyenne, and upcoming Derecho
WRF-Python was created in early 2017 in order to replicate NCL’s
clusters) and cloud platforms (e.g. AWS).
Weather Research and Forecasting (WRF) package in the SPE, and
In order to keep up with the scalability objectives, GeoCAT
it covers 100% of the routines in that package. About two years
functions are implemented to operate on Dask arrays in addition
later, NCAR’s “Pivot to Python” initiative was announced, and the
to natively supporting NumPy arrays and Xarray DataArrays.
GeoCAT team has taken over development and maintenance of
Therefore, the GeoCAT functions can trivially and transparently be
WRF-Python.
parallelized to be run on shared-memory and distributed-memory
The package focuses on creating a Python package that elim-
platforms after having Dask cluster/client properly configured and
inates the need to work across multiple software platforms when
functions fed with Dask arrays or Dask-backed Xarray DataArrays
using WRF datasets. It contains more than 30 computational
(i.e. chunked Xarray DataArrays that wrap up Dask arrays).
(e.g. diagnostic calculations, several interpolation routines) and
visualization routines that aim at reducing the amount of post-
processing tools necessary to visualize WRF output files.
Even though there is no continuous development in WRF- Open Development
Python, as is seen in the rest of the GeoCAT stack, the package is
To ensure community involvement at every level in the develop-
still maintained with timely responses and bug-fix releases to the
ment lifecycle, GeoCAT is committed to an open development
issues reported by the user community.
model. In order to implement this model, GeoCAT provides all
of its software tools as GitHub repositories with public GitHub
Project Raijin project boards and roadmaps, issue tracking and development re-
“Collaborative Research: EarthCube Capabilities: Raijin: Commu- viewing, comprehensive documentation for users and contributors
nity Geoscience Analysis Tools for Unstructured Mesh Data”, i.e. such as Contributor’s Guide [geoc] and toolkit-specific documen-
Project Raijin, of the consortium between NCAR and Pennsylva- tation, along with community announcements on the GeoCAT
nia State University has been awarded by NSF 21-515 EarthCube blog. Furthermore, GeoCAT encourages community feedback and
for an award period of 1 September, 2021 - 31 August, 2024 contribution at any level with inclusive and welcoming language.
[NSF21]. Project Raijin aims at developing community-owned, As a result of this, community requests and feedback have played
sustainable, scalable tools that facilitate operating on unstructured significant role in forming and revising the GeoCAT roadmap and
climate and global weather data [rai]. The GeoCAT team is in projects’ scope.
192 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 7: The interactive plot interface from the MPAS visualization notebook in GeoCAT-examples

and GeoCAT-viz in particular has received significant contribu-
tions through SIParCS in 2020 and 2021 summers (i.e. tens
of visualization examples as well as important infrastructural
changes were made available by our interns) [CKZ+ 22] [LLZ+ 21]
[CFS21]. Furthermore, the team has created three essential and
one collaboration project through SIParCS 2022 summer through
which advanced geoscientific visualization, unstructured grid vi-
sualization and data analysis, Fortran to Python algorithm and
code development, as well as GPU optimization for GeoCAT-
Fig. 8: Regular grid (left) vs MPAS-A & CAM-SE grids comp routines will be investigated.

Project Pythia
Community engagement
The GeoCAT effort is also a part of the NSF funded Project
To further promote engagement with the geoscience community, Pythia. Project Pythia aims to provide a public, web-accessible
GeoCAT organizes and attends various community events. First training resource that could help educate earth scientists to more
of all, scientific conferences and meetings are great venues for effectively use the SPE and cloud computing for dealing with
such a scientific software engineering project to share updates big data in geosciences. GeoCAT helps with Pythia development
and progress with the community. For instance, the American through content creation and infrastructure contributions. GeoCAT
Meteorological Society (AMS) Annual Meeting and American has also contributed several Python tutorials (such as Numpy, Mat-
Geophysical Union (AGU) Fall Meeting are two significant sci- plotlib, Cartopy, etc.) to the educational resources created through
entific events that the GeoCAT team presented one or multiple Project Pythia. These materials consist of live tutorial sessions,
publications every year since its birth to inform the community. interactive Jupyter notebook demonstrations, Q&A sessions, as
The annual Scientific Computing with Python (SciPy) conference well as published video recording of the event on Pythia’s Youtube
is another great fit to showcase what GeoCAT has been conducting channel. As a result, it helps us engage with the community
in geoscience. The team also attended The International Confer- through multiple channels.
ence for High Performance Computing, Networking, Storage, and
Analysis (SC) a few times to keep up-to-date with the industry
state-of-the-arts in these technologies. Future directions
Creating internship projects is another way of improving com- GeoCAT aims to keep increasing the number of data analysis and
munity interactions as it triggers collaboration through GeoCAT, visualization functionalities in both structured and unstructured
institutions, students, and university in general. The GeoCAT meshes with the same pace as has been done so far. The team will
team, thus,encourages undergraduate and graduate student engage- continue prioritizing scalability and open development in future
ment in the Python ecosystem through participation in NCAR’s development and maintenance of its software tools landscape. To
Summer Internships in Parallel Computational Science (SIParCS). achieve the goals with scalability of our tools, we will ensure our
Such programs are quite beneficial for both students and scientific implementations are compatible with the state-of-the-art and up-
software development teams. To exemplify, GeoCAT-examples to-date with the best practices of the technology we are using, e.g.
THE GEOSCIENCE COMMUNITY ANALYSIS TOOLKIT: AN OPEN DEVELOPMENT, COMMUNITY DRIVEN TOOLKIT IN THE SCIENTIFIC PYTHON ECOSYSTEM 193

Dask. To enhance the community involvement in our open devel- [Met15] Met Office. Cartopy: a cartographic python library with a matplotlib
opment model, we will continue interacting with the community interface. Exeter, Devon, 2010 - 2015. URL: http://scitools.org.uk/
cartopy.
members through significant events such as Pangeo community [MR15] Matthew Rocklin. Dask: Parallel Computation with Blocked algo-
meetings, scientific conferences, tutorials and workshops of Geo- rithms and Task Scheduling. In Kathryn Huff and James Bergstra,
CAT’s own as well as other community members; we will keep editors, Proceedings of the 14th Python in Science Conference, pages
our timely communication with the stakeholders through GitHub 126 – 132, 2015. doi:10.25080/Majora-7b98e3ed-013.
[NSF21] NSF. Collaborative research: Earthcube capabilities: Raijin:
assets and other communication channels. Community geoscience analysis tools for unstructured mesh
data. https://nsf.gov/awardsearch/showAward?AWD_ID=2126458&
HistoricalAwards=false, 2021. Online; accessed 17 May 2022.
R EFERENCES [Pyt] Python Software Foundation. The Python Package Index - PyPI.
https://pypi.org/. Online; accessed 18 May 2022.
[Anaa] Anaconda. Datashader. https://datashader.org/. Online; accessed 29 [rai] Raijin homepage. https://raijin.ucar.edu/. Online; accessed 21 May
June 2022. 2022.
[Anab] Anaconda, Inc. Conda package manager. https://docs.conda.io/en/ [rea] ReadTheDocs. https://readthedocs.org/. Online; accessed 18 May
latest/. Online; accessed 18 May 2022. 2022.
[Apa04] Apache Software Foundation. Apache License, version 2.0. https: [SEKZ22] Michaela Sizemore, Orhan Eroglu, Alea Kootz, and Anissa
//www.apache.org/licenses/LICENSE-2.0, 2004. Online; accessed Zacharias. Pivoting to Python: Lessons Learned in Recreating the
18 May 2022. NCAR Command Language in Python. 102nd American Meteoro-
[BBHH12] David Brown, Rick Brownrigg, Mary Haley, and Wei Huang. logical Society Annual Meeting, 2022.
NCAR Command Language (ncl), 2012. doi:http://dx.doi. [SSA+ 19] Bjorn Stevens, Masaki Satoh, Ludovic Auger, Joachim Bier-
org/10.5065/D6WD3XH5. camp, Christopher S Bretherton, Xi Chen, Peter Düben, Falko
[Bra21] Georg Brandl. Sphinx documentation. URL http://sphinx-doc. Judt, Marat Khairoutdinov, Daniel Klocke, et al. DYAMOND:
org/sphinx. pdf, 2021. the DYnamics of the Atmospheric general circulation Modeled
[CEMZ21] John Clyne, Orhan Eroglu, Brian Medeiros, and Colin M Zarzy- On Non-hydrostatic Domains. Progress in Earth and Planetary
cki. Project raijin: Community geoscience analysis tools for unstruc- Science, 6(1):1–17, 2019. doi:https://doi.org/10.1186/
tured grids. In AGU Fall Meeting 2021. AGU, 2021. s40645-019-0304-z.
[CFS21] Heather Rose Craker, Claire Anne Fiorino, and Michaela Victoria [USR+ 20] Leonardo Uieda, Santiago Rubén Soler, Rémi Rampin, Hugo
Sizemore. Rebuilding the ncl visualization gallery in python. In Van Kemenade, Matthew Turk, Daniel Shapero, Anderson Bani-
101nd American Meteorological Society Annual Meeting. AMS, hirwe, and John Leeman. Pooch: A friend to fetch your data
2021. files. Journal of Open Source Software, 5(45):1943, 2020. doi:
[CKZ+ 22] Heather Craker, Alea Kootz, Anissa Zacharias, Michaela Size- 10.21105/joss.01943.
more, and Orhan Eroglu. NCAR’s GeoCAT Announcement of [uxa] UXarray GitHub repository. https://github.com/UXARRAY/uxarray.
Computational Tools. In 102nd American Meteorological Society Online; accessed 20 May 2022. doi:10.5281/zenodo.
Annual Meeting. AMS, 2022. 5655065.
[cod] Codecov. https://about.codecov.io/. Online; accessed 18 May 2022.
[geoa] GeoCAT-comp documentation page. https://geocat-
comp.readthedocs.io/en/latest/index.html. Online; accessed 20
May 2022. doi:doi:10.5281/zenodo.6607205.
[geob] GeoCAT-comp GitHub repository. https://github.com/NCAR/
geocat-comp. Online; accessed 20 May 2022. doi:doi:10.
5281/zenodo.6607205.
[geoc] GeoCAT Contributor’s Guide. https://geocat.ucar.edu/pages/
contributing.html. Online; accessed 20 May 2022. doi:10.5065/
a8pp-4358.
[geod] GeoCAT-examples documentation page. https://geocat-examples.
readthedocs.io/en/latest/index.html. Online; accessed 20 May 2022.
doi:10.5281/zenodo.6678258.
[geoe] GeoCAT-examples GitHub repository. https://github.com/NCAR/
geocat-examples. Online; accessed 20 May 2022. doi:10.5281/
zenodo.6678258.
[geof] GeoCAT-viz GitHub repository. https://github.com/NCAR/geocat-
viz. Online; accessed 20 May 2022. doi:10.5281/zenodo.
6678345.
[Geo19] GeoCAT. The future of NCL and the Pivot to Python. https:
//www.ncl.ucar.edu/Document/Pivot_to_Python, 2019. Online; ac-
cessed 17 May 2022. doi:http://dx.doi.org/10.5065/
D6WD3XH5.
[Git] GitHub. Github Actions. https://docs.github.com/en/actions. Online;
accessed 18 May 2022.
[HH17] Stephan Hoyer and Joseph Hamman. xarray: N-D labeled arrays
and datasets in Python. Journal of Open Research Software, 5(1):10,
2017. doi:http://doi.org/10.5334/jors.148.
[HRA18] Joseph Hamman, Matthew Rocklin, and Ryan Abernathy. Pangeo:
A big-data ecosystem for scalable earth system science. EGU
General Assembly Conference Abstracts, 2018.
[Hun07] J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in
Science & Engineering, 9(3):90–95, 2007. doi:10.1109/MCSE.
2007.55.
[LLZ+ 21] Erin Lincoln, Jiaqi Li, Anissa Zacharias, Michaela Sizemore,
Orhan Eroglu, and Julia Kent. Expanding and strengthening the
transition from NCL to Python visualizations. In AGU Fall Meeting
2021. AGU, 2021.
[LPS15] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A
llvm-based python jit compiler. In Proceedings of the Second Work-
shop on the LLVM Compiler Infrastructure in HPC, pages 1–6, 2015.
doi:https://doi.org/10.1145/2833157.2833162.
194 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

popmon: Analysis Package for Dataset Shift Detection
Simon Brugman‡∗ , Tomas Sostak§ , Pradyot Patil‡ , Max Baak‡

Abstract—popmon is an open-source Python package to check the stability of
a tabular dataset. popmon creates histograms of features binned in time-slices,
and compares the stability of its profiles and distributions using statistical tests,
both over time and with respect to a reference dataset. It works with numerical,
ordinal and categorical features, on both pandas and Spark dataframes, and
the histograms can be higher-dimensional, e.g. it can also track correlations
between sets of features. popmon can automatically detect and alert on
changes observed over time, such as trends, shifts, peaks, outliers, anomalies,
changing correlations, etc., using monitoring business rules that are either static
or dynamic. popmon results are presented in a self-contained report.

Index Terms—dataset shift detection, population shift, covariate shift, his- Fig. 1: The popmon package logo
togramming, profiling

make it easy to detect which (combinations of) features are most
Introduction affected by changing distributions.
Tracking model performance is crucial to guarantee that a model popmon is light-weight. For example, only one line is required
behaves as designed and trained initially, and for determining to generate a stability report.
whether to promote a model with the same initial design but report = popmon.df_stability_report(
trained on different data to production. Model performance de- df,
pends directly on the data used for training and the data predicted time_axis="date",
time_width="1w",
on. Changes in the latter (e.g. certain word frequency, user demo- time_offset="2022-1-1"
graphics, etc.) can affect the performance and make predictions )
unreliable. report.to_file("report.html")
Given that input data often change over time, it is important to The package is built on top of Python’s scientific computing
track changes in both input distributions and delivered predictions ecosystem (numpy, scipy [HMvdW+ 20], [VGO+ 20]) and sup-
periodically, and to act on them when they are significantly ports pandas and Apache Spark dataframes [pdt20], [WM10],
different from past instances – e.g. to diagnose and retrain an [ZXW+ 16]. This paper discusses how popmon monitors for
incorrect model in production. Predictions may be far ahead in dataset changes. The popmon code is modular in design and user
time, so the performance can only be verified later, for example in configurable. The project is available as open-source software.1
one year. Taking action at that point might already be too late.
To make monitoring both more consistent and semi-automatic, Related work
ING Bank has created a generic Python package called popmon.
Many algorithms detecting dataset shift exist that follow a similar
popmon monitors the stability of data populations over time and
structure [LLD+ 18], using various data structures and algorithms
detects dataset shifts, based on techniques from statistical process
at each step [DKVY06], [QAWZ15]. However, few are readily
control and the dataset shift literature.
available to use in production. popmon offers both a framework
popmon employs so-called dynamic monitoring rules to flag
that generalizes pipelines needed to implement those algorithms,
and alert on changes observed over time. Using a specified refer-
and default data drift pipelines, built on histograms with statistical
ence dataset, from which observed levels of variation are extracted
comparisons and profiles (see Sec. data representation).
automatically, popmon sets allowed boundaries on the input data.
Other families of tools have been developed that work on
If the reference dataset changes over time, the effective ranges on
individual data points, for model explanations (e.g. SHAP [LL17],
the input data can change accordingly. Dynamic monitoring rules
feature attributions [SLL20]), rule-based data monitoring (e.g.
* Corresponding author: simon.brugman@ing.com
Great Expectations, Deequ [GCSG22], [SLS+ 18]) and outlier
‡ ING Analytics Wholesale Banking detection (e.g. [RGL19], [LPO17]).
§ Vinted alibi-detect [KVLC+ 20], [VLKV+ 22] is somewhat
Copyright © 2022 Simon Brugman et al. This is an open-access article similar to popmon. This is an open-source Python library that
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, 1. See https://github.com/ing-bank/popmon for code, documentation, tutori-
provided the original author and source are credited. als and example stability reports.
POPMON: ANALYSIS PACKAGE FOR DATASET SHIFT DETECTION 195

focuses on outlier, adversarial and drift detection. It allows for
monitoring of tabular, text, images and time series data, using
both online and offline detectors. The backend is implemented Source data
in TensorFlow and PyTorch. Much of the reporting functionality,
such as feature distributions, are restricted to the (commercial) en-
terprise version called seldon-deploy. Integrations for model

Time-axis
External
deployment are available based on Kubernetes. The infrastructure Reference
Data (nD)
setup thus is more complex and restrictive than for popmon, dataset
(optional)
which can run on any developer’s machine.
Partition on
Contributions time-axis
The advantage of popmon’s dynamic monitoring rules over con-
ventional static ones, is that little prior knowledge is required of
the input data to set sensible limits on the desired level of stability. Temporal partitioning
This makes popmon a scalable solution over multiple datasets.
To the best of our knowledge, no other monitoring tool exists
that suits our criteria to monitor models in production for dataset D1 D2 D3 D4 D5

shift. In particular, no other, light-weight, open-source package is
available that performs such extensive stability tests of a pandas Partitioned dataset
or Spark dataset.
We believe the combination of wide applicability, out-of-the- Data representation
box performance, available statistical tests, and configurability
makes popmon an ideal addition to the toolbox of any data
scientist or machine learning engineer.

Approach Histograms per feature for each partition

popmon tests the dataset stability and reports the results through
Comparison generation
a sequence of steps (Fig. 2):
1) The data are represented by histograms of features,
binned in time-slices (Sec. data representation). D1 D2 D3 D4 D5
2) The data is arranged according to the selected reference
type (Sec. comparisons).
Historical data New data
3) The stability of the profiles and distributions of those
histograms are compared using statistical tests, both with Statistical comparison
respect to a reference and over time. It works with numer-
ical, ordinal, categorical features, and the histograms can
be higher-dimensional, e.g. it can also track correlations
Metric
between any two features (Sec. comparisons).
4) popmon can automatically flag and alert on changes Value of interest
observed over time, such as trends, anomalies, changing over time
correlations, etc, using monitoring rules (Sec. alerting).
5) Results are reported to the user via a dedicated, self- Dynamic bounds
contained report (Sec. reporting).

Dataset shift
In the context of supervised learning, one can distinguish dataset Value of interest
Reference distribution Traffic light bounds
shift as a shift in various distributions: over time

1) Covariate shift: shift in the independent variables (p(x)). Reporting
2) Prior probability shift: shift in the target variable (the
class, p(y)).
3) Concept shift: shift in the relationship between the inde-
pendent and target variables (i.e. p(x|y)).
Note that there is a lot of variation in terminology used, refer-
ring to probabilities prevents this ambiguity. For more information
on dataset shift see Quinonero-Candela et al. [QCSSL08]. Fig. 2: Step-by-step overview of popmon’s pipeline as described in
popmon is primarily interested in monitoring the distributions section approach onward.
of features p(x) and labels p(y) for monitoring trained classifiers.
These data in deployment ideally resembles the training data.
196 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

However, the package can be used more widely, for instance Implementation
by monitoring interactions between features and the label, or the
For the creation of histograms from data records the open-source
distribution of model predictions.
histogrammar package has been adopted. histogrammar
has been implemented in both Scala and Python [PS21],
Temporal representation [PSSE16], and works on Spark and pandas dataframes re-
popmon requires features to be distributed as a function of time spectively. The two implementations have been tested exten-
(bins), which can be provided in two ways: sively to guarantee compatibility. The histograms coming out
of histogrammar form the basis of the monitoring code in
1) Time axis. Two-dimensional (or higher) distributions are popmon, which otherwise does not require input dataframes. In
provided, where the first dimension is time and the second other words, the monitoring code itself has no Spark or pandas
is the feature to monitor. To get time slices, the time data dependencies, keeping the code base relatively simple.
column needs to be specified, e.g. “date”, including the
bin width, e.g. one week (“1w”), and the offset, which is
Histogram types
the lower edge of one time-bin, e.g. a certain start date
(“2022-1-1”). Three types of histograms are typically used:
2) Ordered data batches. A set of distributions of features
is provided, corresponding to a new batch of data. This • Normal histograms, meant for numerical features with
batch is considered a new time-slice, and is stitched to known, fixed ranges. The bin specifications are the lowest
an existing set of batches, in order of incoming batches, and highest expected values and the number of (equidis-
where each batch is assigned a unique, increasing index. tant) bins.
Together the indices form an artificial, binned time-axis. • Categorical histograms, for categorical and ordinal fea-
tures, typically boolean or string-based. A categorical
histogram accepts any value: when not yet encountered,
Data representation
it creates a new bin. No bin specifications are required.
popmon uses histogram-based monitoring to track potential • Sparse histograms are open-ended histograms, for numer-
dataset shift and outliers over time, as detailed in the next sub- ical features with no known range. The bin specifications
section. only need the bin-width, and optionally the origin (the
In the literature, alternative data representations are also em- lower edge of bin zero, with a default value of zero).
ployed, such as kdq-trees [DKVY06]. Different data representa- Sparse histograms accept any value. When the value is
tions are in principle compatible with the popmon pipeline, as it not yet encountered, a new bin gets created.
is similarly structured to alternative methods (see [LLD+ 18], c.f.
Fig 5). For normal and sparse histograms reasonable bin specifica-
Dimensionality reduction techniques may be used to transform tions can be derived automatically. Both categorical and sparse
the input dataset into a space where the distance between instances histograms are dictionaries with histogram properties. New (index,
are more meaningful for comparison, before using popmon, or in- bin) pairs get created whenever needed. Although this could result
between steps. For example a linear projection may be used as a in out-of-memory problems, e.g. when histogramming billions
preprocessing step, by taking the principal components of PCA as of unique strings, in practice this is typically not an issue, as
in [QAWZ15]. Machine learning classifiers or autoencoders have this can be easily mitigated. Features may be transformed into
also been used for this purpose [LWS18], [RGL19] and can be a representation with a lower number of distinct values, e.g. via
particularly helpful for high-dimensional data such as images or embedding or substrings; or one selects the top-n most frequently
text. occurring values.
Open-ended histograms are ideal for monitoring dataset shift
Histogram-based monitoring and outliers: they capture any kind of (large) data change. When
There are multiple reasons behind the histogram-based monitoring there is a drift, there is no need to change the low- and high-range
approach taken in popmon. values. The same holds for outlier detection: if a new maximum
Histograms are small in size, and thus are efficiently stored and or minimum value is found, it is still captured.
transferred, regardless of the input dataset size. Once data records
have been aggregated feature-wise, with a minimum number of Dimensionality
entries per bin, they are typically no longer privacy sensitive (e.g.
knowing the number of records with age 30-35 in a dataset). A histogram can be multi-dimensional, and any combination of
popmon is primarily looking for changes in data distributions. types is possible. The first dimension is always the time axis,
Solely monitoring the (main) profiles of a distribution, such as which is always represented by a sparse histogram. The second
the mean, standard deviation and min and max values, does not dimension is the feature to monitor over time. When adding a third
necessarily capture the changes in a feature’s distribution. Well- axis for another feature, the heatmap between those two features
known examples of this are Anscome’s Quartet [Ans73] and the is created over time. For example, when monitoring financial
dinosaurs datasets [MF17], where – between different datasets – transactions: the first axis could be time, the second axis client
the means and correlation between two features are identical, but type, and the third axis transaction amount.
the distributions are different. Histograms of the corresponding Usually one feature is followed over time, or at maximum two.
features (or feature pairs), however, do capture the corresponding The synthetic datasets in section synthetic datasets contain exam-
changes. ples of higher-dimensional histograms for known interactions.
POPMON: ANALYSIS PACKAGE FOR DATASET SHIFT DETECTION 197

Additivity the adjacent time-slots. A sliding reference, on the other hand,
Histograms are additive. As an example, a batch of data records is updated with more recent data, that incorporates this trend.
arrives each week. A new batch arrives, containing timestamps Consider the case where the data contain a price field that is yearly
that were missing in a previous batch. When histograms are made indexed to the inflation, then using a static reference may alert
of the new batch, these can be readily summed with the histograms purely on the trend.
of the previous batches. The missing records are immediately put The reference implementations are provided for common sce-
into the right time-slices. narios, such as working with a fixed dataset, batched dataset or
It is important that the bin specifications are the same between with streaming data. For instance, a fixed dataset is common for
different batches of data, otherwise their histograms cannot be exploratory data analysis and one-off monitoring, whereas batched
summed and comparisons are impossible. or streaming data is more common in a production setting.
The reference may be static or dynamic. Four different refer-
Limitations ence types are possible:
There is one downside to using histograms: since the data get 1) Self-reference. Using the full dataset on which the sta-
aggregated into bins, and profiles and statistical tests are obtained bility report is built as a reference. This method is static:
from the histograms, slightly lower resolution is achieved than each time slot is compared to all the slots in the dataset.
on the full dataset. In practice, however, this is a non-issue; This is the default reference setting.
histograms work great for data monitoring. The reference type 2) External reference. Using an external reference set, for
and time-axis binning configuration allow the user for selecting an example the training data of your classifier, to identify
effective resolution. which time slots are deviating. This is also a static
method: each time slot is compared to the full reference
Comparisons set.
3) Rolling reference. Using a rolling window on the input
In popmon the monitoring of data stability is based on statistical dataset, allowing one to compare each time slot to a
process control (SPC) techniques. SPC is a standard method to window of preceding time slots. This method is dynamic:
manage the data quality of high-volume data processing opera- one can set the size of the window and the shift from the
tions, for example in a large data warehouse [Eng99]. The idea current time slot. By default the 10 preceding time slots
is as follows. Most features have multiple sources of variation are used.
from underlying processes. When these processes are stable, the 4) Expanding reference. Using an expanding reference,
variation of a feature over time should remain within a known allowing one to compare each time slot to all preceding
set of limits. The level of variation is obtained from a reference time slots. This is also a dynamic method, with variable
dataset, one that is deemed stable and trustworthy. window size, since all available previous time slots are
For each feature in the input data (except the time column), used. For example, with ten available time slots the
the stability is determined by taking the reference dataset – for window size is 9.
example the data on which a classification model was trained –
and contrasting each time slot in the input data. Statistical comparisons
The comparison can be done in two ways:
Users may have various reasons to prefer a two-sample test over
1) Comparisons: statistically comparing each time slot another. The appropriate comparison depends on our confidence in
to the reference data (for example using Kolmogorov- the reference dataset [Ric22], and certain tests may be more com-
Smirnov testing, χ 2 testing, or the Pearson correlation). mon in some fields. Many common tests are related [DKVY06],
2) Profiles: for example, tracking the mean of a distribution e.g. the χ 2 function is the first-order expansion of the KL distance
over time and contrasting this to the reference data. function.
Similar analyses can be done for other summary statistics, Therefore, popmon provides an extensible framework that
such as the median, min, max or quantiles. This is related allows users to provide custom two-sample tests using a simple
to the CUsUM technique [Pag54], a well-known method syntax, via the registry pattern:
in SPC. @Comparisons.register(key="jsd", description="JSD")
def jensen_shannon_divergence(p, q):
m = 0.5 * (p + q)
Reference types
return (
Consider X to be an N-dimensional dataset representing our 0.5 *
reference data, and X 0 to be our incoming data. A covariate shift (kl_divergence(p, m) + kl_divergence(q, m))
)
occurs when p(X) 6= p(X 0 ) is detected. Different choices for X
and X 0 may detect different types of drift (e.g. sudden, gradual, Most commonly used test statistics are implemented, such as the
incremental). p(X) is referred to as the reference dataset. Population-Stability-Index and the Jensen-Shannon divergence.
Many change-detection algorithms use a window-based solu- The implementations of the χ 2 and Kolmogorov-Smirnov tests
tion that compares a static reference to a test window [DKVY06], account for statistical fluctuations in both the input and reference
or a sliding window for both, where the reference is dynamically distributions. For example, this is relevant when comparing adja-
updated [QAWZ15]. A static reference is a wise choice for mon- cent, low-statistics time slices.
itoring of a trained classifier: the performance of such a classifier
depends on the similarity of the test data to the training data. Profiles
Moreover, it may pick up an incremental departure (trend) from Tracking the distribution of values of interest over time is achieved
the initial distribution, that will not be significant in comparison to via profiles. These are functions of the input histogram. Metrics
198 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

may be defined for all dimensions (e.g. count, correlations), or
for specific dimensions as in the case of 1D numerical histograms
(e.g. quantiles). Extending the existing set of profiles is possible
via a syntax similar as above:
@Profiles.register(
key=["q5", "q50", "q95"],
description=[
"5% percentile",
"50% percentile (median)",
"95% percentile"
],
dim=1,
type="num"
)
def profile_quantiles(values, counts):
return logic_goes_here(values, counts)

Denote xi (t) as the profile i of feature x at time t, for example the Fig. 3: A snapshot of part of the HTML stability report. It shows the
5% quantile of the histogram of incoming transaction amounts in aggregated traffic light overview. This view can be used to prioritize
a given week. Identical bin specifications are assumed between the features for inspection.
reference and incoming data. x̄i is defined as the average of that
metric on the reference data, and σxi as the corresponding standard
Dynamic monitoring rules
deviation.
The normalized residual between the incoming and reference Dynamic monitoring rules are complementary to static rules. The
data, also known as the “pull” or “Z-score”, is given by: levels of variation in feature metrics are assumed to have been
measured on the reference data. Per feature metric, incoming data
xi (t) − x̄i
pulli (t) = . are compared against the reference levels. When (significantly)
σxi outside of the known bounds, instability of the underlying sources
When the underlying sources of variation are stable, and assuming is assumed, and a warning gets raised to the user.
the reference dataset is asymptotically large and independent from popmon’s dynamic monitoring rules raise traffic lights to
the incoming data, pulli (t) follows a normal distribution centered the user whenever the normalized residual pulli (t) falls outside
around zero and with unit width, N(0, 1), as dictated by the central certain, configurable ranges. By default:
limit theorem [Fis11]. 
In practice, the criteria for normality are hardly ever met. Typi- Green, if |pulli (t)| ≤ 4

cally the distribution is wider with larger tails. Yet, approximately T L(pulli ,t) = Yellow, if 4 < |pulli (t)| ≤ 7


normal behaviour is exhibited. Chebyshev’s inequality [Che67] Red, if |pulli (t)| > 7
guarantees that, for a wide class of distributions, no more than k12
of the distribution’s values can be k or more standard deviations If the reference dataset is changing over time, the effective ranges
away from the mean. For example, a minimum of 75% (88.9%) of on xi (t) can change as well. The advantage of this approach over
values must lie within two (three) standard deviations of the mean. static rules is that significant deviations in the incoming data can
These boundaries reoccur in Sec. dynamic monitoring rules. be flagged and alerted to the user for a large set of features and
corresponding metrics, requiring little (or no) prior knowledge of
the data at hand. The relevant knowledge is all extracted from the
Alerting
reference dataset.
For alerting, popmon uses traffic-light-based monitoring rules, With multiple feature metrics, many dynamic monitoring tests
raising green, yellow or red alerts to the user. Green alerts signal can get performed on the same dataset. This raises the multiple
the data are fine, yellow alerts serve as warnings of meaningful comparisons problem: the more inferences are made, the more
deviations, and red alerts need critical attention. These monitoring likely erroneous red flags are raised. To compensate for a large
rules can be static or dynamic, as explained in this section. number of tests being made, typically one can set wider traffic
light boundaries, reducing the false positive rate.2 The boundaries
Static monitoring rules control the size of the deviations - or number of red and yellow
Static monitoring rules are traditional data quality rules (e.g. alerts - that the user would like to be informed of.
[RD00]). Denote xi (t) as metric i of feature x at time t, for example
the number of NaNs encountered in feature x on a given day. As
Reporting
an example, the following traffic lights might be set on xi (t):
 popmon outputs monitoring results as HTML stability reports.
Green, if xi (t) ≤ 1
 The reports offer multiple views of the data (histograms and
T L(xi ,t) = Yellow, if 1 < xi (t) ≤ 10 heatmaps), the profiles and comparisons, and traffic light alerts.


Red, if xi (t) > 10 There are several reasons for providing self-contained reports: they
can be opened in the browser, easily shared, stored as artifacts, and
The thresholds of this monitoring rule are fixed, and considered tracked using tools such as MLFlow. The reports also have no need
static over time. They need to be set by hand, to sensible values. for an advanced infrastructure setup, and are possible to create and
This requires domain knowledge of the data and the processes
that produce it. Setting these traffic light ranges is a time-costly 2. Alternatively one may apply the Bonferroni correction to counteract this
process when covering many features and corresponding metrics. problem [Bon36].
POPMON: ANALYSIS PACKAGE FOR DATASET SHIFT DETECTION 199

Fig. 4: LED: Pearson correlation compared with previous histogram. Fig. 5: Sine1: The dataset shifts around data points 20.000, 40.000,
The shifting points are correctly identified at every 5th of the LED 60.000 and 80.000 of the Sine1 dataset are clearly visible.
dataset. Similar patterns are visible for other comparisons, e.g. χ 2 .

view in many environments: from a local machine, a (restricted)
environment, to a public cloud. If, however, a certain dashboarding
tool is available, then the metrics computed by popmon are
exposed and can be exported into that tool, for example Kibana
[Ela22]. One downside of producing self-contained reports is that
they can get large when the plots are pre-rendered and embedded.
This is mitigated by embedding plots as JSON that are (lazily)
rendered on the client-side. Plotly express [Plo22] powers the
interactive embedded plots in popmon as of v1.0.0.
Note that multiple reference types can be used in the same sta-
bility report. For instance, popmon’s default reference pipelines
always include a rolling comparison with window size 1, i.e.
comparing to the preceding time slot.

Synthetic datasets
In the literature synthetic datasets are commonly used to test the Fig. 6: Hyperplane: The incremental drift compared to the reference
effectiveness of dataset shift monitoring approaches [LLD+ 18]. dataset is observed for the PhiK correlation between the predictions
One can test the detection for all kinds of shifts, as the generation and the label.
process controls when and how the shift happens. popmon has
been tested on multiple of such artificial datasets: Sine1, Sine2,
reference. The predictions of this model are added to the dataset,
Mixed, Stagger, Circles, LED, SEA and Hyperplane [PVP18],
simulating a machine learning model in production. popmon is
[SK], [Fan04]. These datasets cover myriad dataset shift charac-
able to pick up the divergence between the predictions and the
teristics: sudden and gradual drifts, dependency of the label on
class label, as depicted in Figure 6.
just one or multiple features, binary and multiclass labels, and
containing unrelated features. The dataset descriptions and sample
popmon configurations are available in the code repository. Conclusion
The reports generated by popmon capture features and time This paper has presented popmon, an open-source Python pack-
bins where the dataset shift is occurring for all tested datasets. age to check the stability of a tabular dataset. Built around
Interactions between features and the label can be used for histogram-based monitoring, it runs on a dataset of arbitrary size,
feature selection, in addition to monitoring the individual feature supporting both pandas and Spark dataframes. Using the variations
distributions. The sudden and gradual drifts are clearly visible observed in a reference dataset, popmon can automatically detect
using a rolling reference, see Fig. 4 for examples. The drift in the and flag deviations in incoming data, requiring little prior domain
Hyperplane dataset, incremental and gradual, is not expected to be knowledge. As such, popmon is a scalable solution that can be
detected using a rolling reference or self-reference. Moreover, the applied to many datasets. By default its findings get presented
dataset is synthesized so that the distribution of the features and in a single HTML report. This makes popmon ideal for both
the class balance does not change [Fan04]. exploratory data analysis and as a monitoring tool for machine
The process to monitor this dataset could be set up in multiple learning models running in production. We believe the combina-
ways, one of which is described here. A logistic regression model tion of out-of-the-box performance and presented features makes
is trained on the first 10% of the data, which is also used as static popmon an excellent addition to the data practitioner’s toolbox.
200 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Acknowledgements [MF17] Justin Matejka and George Fitzmaurice. Same stats, different
graphs: generating datasets with varied appearance and identi-
We thank our colleagues from the ING Analytics Wholesale cal statistics through simulated annealing. In Proceedings of
Banking team for fruitful discussions, all past contributors to the 2017 CHI conference on human factors in computing sys-
tems, pages 1290–1294, 2017. URL: https://doi.org/10.1145/
popmon, and in particular Fabian Jansen and Ilan Fridman Rojas 3025453.3025912, doi:10.1145/3025453.3025912.
for carefully reading the manuscript. This work is supported by [Pag54] Ewas S Page. Continuous inspection schemes. Biometrika,
ING Bank. 41(1/2):100–115, 1954. URL: https://doi.org/10.2307/
2333009, doi:10.2307/2333009.
[pdt20] The pandas development team. pandas-dev/pandas: Pan-
das, February 2020. URL: https://doi.org/10.5281/zenodo.
R EFERENCES 3509134, doi:10.5281/zenodo.3509134.
[Plo22] Plotly Development Team. Plotly.py: The interactive graphing
[Ans73] F.J. Anscome. Graphs in statistical analysis. American library for Python (includes Plotly Express), 6 2022. URL:
Statistician. 27 (1), pages 17–21, 1973. URL: https://doi.org/ https://github.com/plotly/plotly.py.
10.2307/2682899, doi:10.2307/2682899. [PS21] Jim Pivarski and Alexey Svyatkovskiy.
[Bon36] Carlo Bonferroni. Teoria statistica delle classi e calcolo delle histogrammar/histogrammar-scala: v1.0.20, April
probabilita. Pubblicazioni del R Istituto Superiore di Scienze 2021. URL: https://doi.org/10.5281/zenodo.4660177,
Economiche e Commericiali di Firenze, 8:3–62, 1936. doi:10.5281/zenodo.4660177.
[Che67] Pafnutii Lvovich Chebyshev. Des valeurs moyennes, liou- [PSSE16] Jim Pivarski, Alexey Svyatkovskiy, Ferdinand Schenck,
ville’s. J. Math. Pures Appl., 12:177–184, 1867. and Bill Engels. histogrammar-python: 1.0.0, September
[DKVY06] Tamraparni Dasu, Shankar Krishnan, Suresh Venkatasubra- 2016. URL: https://doi.org/10.5281/zenodo.61418, doi:10.
manian, and Ke Yi. An information-theoretic approach to 5281/zenodo.61418.
detecting changes in multi-dimensional data streams. In In [PVP18] Ali Pesaranghader, Herna Viktor, and Eric Paquet. Reser-
Proc. Symp. on the Interface of Statistics, Computing Science, voir of diverse adaptive learners and stacking fast hoeffding
and Applications. Citeseer, 2006. drift detection methods for evolving data streams. Machine
[Ela22] Elastic. Kibana, 2022. URL: https://github.com/elastic/kibana. Learning, 107(11):1711–1743, 2018. URL: https://doi.org/10.
[Eng99] Larry English. Improving Data Warehouse and Business Infor- 1007/s10994-018-5719-z, doi:10.1007/s10994-018-
mation Quality: Methods for Reducing Costs and Increasing 5719-z.
Profits. Wiley, 1999. [QAWZ15] Abdulhakim A Qahtan, Basma Alharbi, Suojin Wang, and
[Fan04] Wei Fan. Systematic data selection to mine concept-drifting Xiangliang Zhang. A pca-based change detection frame-
data streams. In Proceedings of the Tenth ACM SIGKDD work for multidimensional data streams: Change detection
International Conference on Knowledge Discovery and Data in multidimensional data streams. In Proceedings of the
Mining, KDD ’04, page 128–137, New York, NY, USA, 2004. 21th ACM SIGKDD International Conference on Knowledge
Association for Computing Machinery. URL: https://doi. Discovery and Data Mining, pages 935–944, 2015. doi:
org/10.1145/1014052.1014069, doi:10.1145/1014052. 10.1145/2783258.2783359.
1014069. [QCSSL08] Joaquin Quiñonero-Candela, Masashi Sugiyama, Anton
[Fis11] Hans Fischer. The Central Limit Theorem from Laplace to Schwaighofer, and Neil D Lawrence. Dataset shift in machine
Cauchy: Changes in Stochastic Objectives and in Analytical learning. Mit Press, 2008.
Methods, pages 17–74. Springer New York, New York, NY, [RD00] Erhard Rahm and Hong Hai Do. Data cleaning: Problems and
2011. URL: https://doi.org/10.1007/978-0-387-87857-7_2, current approaches. IEEE Data Eng. Bull., 23(4):3–13, 2000.
doi:10.1007/978-0-387-87857-7\_2. [RGL19] Stephan Rabanser, Stephan Günnemann, and Zachary
[GCSG22] Abe Gong, James Campbell, Superconductive, and Great Ex- Lipton. Failing loudly: An empirical study of
pectations. Great Expectations, 2022. URL: https://github. methods for detecting dataset shift. Advances in
com/great-expectations/great_expectations, doi:10.5281/ Neural Information Processing Systems, 32, 2019.
zenodo.5683574. URL: https://proceedings.neurips.cc/paper/2019/hash/
[HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der 846c260d715e5b854ffad5f70a516c88-Abstract.html.
Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric [Ric22] Oliver E Richardson. Loss as the inconsistency of a proba-
Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, bilistic dependency graph: Choose your model, not your loss
Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van function. In International Conference on Artificial Intelligence
Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del and Statistics, pages 2706–2735. PMLR, 2022.
Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, [SK] W Nick Street and YongSeog Kim. A streaming ensemble
Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer algorithm (sea) for large-scale classification. In Proceedings
Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro- of the Seventh ACM SIGKDD International Conference on
gramming with NumPy. Nature, 585(7825):357–362, Septem- Knowledge Discovery and Data Mining, KDD ’01, page
ber 2020. URL: https://doi.org/10.1038/s41586-020-2649-2, 377–382, New York, NY, USA. Association for Comput-
doi:10.1038/s41586-020-2649-2. ing Machinery. URL: https://doi.org/10.1145/502512.502568,
[KVLC+ 20] Janis Klaise, Arnaud Van Looveren, Clive Cox, Giovanni doi:10.1145/502512.502568.
Vacanti, and Alexandru Coca. Monitoring and explainability [SLL20] Pascal Sturmfels, Scott Lundberg, and Su-In Lee. Visu-
of models in production. arXiv preprint arXiv:2007.06299, alizing the impact of feature attribution baselines. Distill,
2020. URL: https://doi.org/10.48550/arXiv.2007.06299, doi: 2020. https://distill.pub/2020/attribution-baselines. doi:
10.48550/arXiv.2007.06299. 10.23915/distill.00022.
[LL17] Scott M Lundberg and Su-In Lee. A unified approach to in- [SLS+ 18] Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem
terpreting model predictions. Advances in neural information Celikel, Felix Biessmann, and Andreas Grafberger. Automat-
processing systems, 30, 2017. ing large-scale data quality verification. Proc. VLDB Endow.,
[LLD+ 18] Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and 11(12):1781–1794, aug 2018. URL: https://doi.org/10.14778/
Guangquan Zhang. Learning under concept drift: A review. 3229863.3229867, doi:10.14778/3229863.3229867.
IEEE Transactions on Knowledge and Data Engineering, [VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt
31(12):2346–2363, 2018. doi:10.1109/TKDE.2018. Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski,
2876857. Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté-
[LPO17] David Lopez-Paz and Maxime Oquab. Revisiting classifier fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar-
two-sample tests. In International Conference on Learning rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric
Representations, 2017. Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat,
[LWS18] Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. De- Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde,
tecting and correcting for label shift with black box predictors. Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin-
In International conference on machine learning, pages 3122– tero, Charles R. Harris, Anne M. Archibald, Antônio H.
3130. PMLR, 2018. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy
POPMON: ANALYSIS PACKAGE FOR DATASET SHIFT DETECTION 201

1.0 Contributors. SciPy 1.0: Fundamental Algorithms for
Scientific Computing in Python. Nature Methods, 17:261–272,
2020. doi:10.1038/s41592-019-0686-2.
[VLKV+ 22] Arnaud Van Looveren, Janis Klaise, Giovanni Vacanti, Oliver
Cobb, Ashley Scillitoe, and Robert Samoilescu. Alibi Detect:
Algorithms for outlier, adversarial and drift detection, 4 2022.
URL: https://github.com/SeldonIO/alibi-detect.
[WM10] Wes McKinney. Data Structures for Statistical Computing in
Python. In Stéfan van der Walt and Jarrod Millman, editors,
Proceedings of the 9th Python in Science Conference, pages
56–61, 2010. doi:10.25080/Majora-92bf1922-00a.
[ZXW+ 16] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata
Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh
Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Gh-
odsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. Apache
spark: A unified engine for big data processing. Commun.
ACM, 59(11):56–65, oct 2016. URL: https://doi.org/10.1145/
2934664, doi:10.1145/2934664.
202 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

pyDAMPF: a Python package for modeling
mechanical properties of hygroscopic materials under
interaction with a nanoprobe
Willy Menacho‡§ , Gonzalo Marcelo Ramírez-Ávila‡§ , Horacio V. Guzman¶k‡§∗

Abstract—pyDAMPF is a tool oriented to the Atomic Force Microscopy (AFM) Despite the recent open-source availability of dynamic AFM
community, which allows the simulation of the physical properties of materials simulation packages [GGG15], [MHR08], a broad usage for the
under variable relative humidity (RH). In particular, pyDAMPF is mainly focused assessment and planning of experiments has yet to come. One of
on the mechanical properties of polymeric hygroscopic nanofibers that play an the problems is that it is often hard to simulate several operational
essential role in designing tissue scaffolds for implants and filtering devices.
parameters at once. For example, most scientists evaluate differ-
Those mechanical properties have been mostly studied from a very coarse
perspective reaching a micrometer scale. However, at the nanoscale, the me-
ent AFM cantilevers before starting new experiments. A typical
chanical response of polymeric fibers becomes cumbersome due to both exper- evaluation criterion is the maximum exerted force that prevents
imental and theoretical limitations. For example, the response of polymeric fibers invasivity of the nanoprobe into the sample. The variety of AFM
to RH demands advanced models that consider sub-nanometric changes in the cantilevers depends on the geometrical and material characteristics
local structure of each single polymer chain. From an experimental viewpoint, used for its fabrication. Moreover, manufacturers’ nanofabrication
choosing the optimal cantilevers to scan the fibers under variable RH is not techniques may change from time to time, according to the
trivial. necessities of the experiments, like sharper tips and/or higher
In this article, we show how to use pyDAMPF to choose one optimal
oscillation frequencies. From a simulation perspective, evaluating
nanoprobe for planned experiments with a hygroscopic polymer. Along these
observables for reaching optimal results on upcoming experiments
lines, We show how to evaluate common and non-trivial operational parame-
ters from an AFM cantilever of different manufacturers. Our results show in a
is nowadays possible for tens or hundreds of cantilevers. On top of
stepwise approach the most relevant parameters to compare the cantilevers other operational parameters in the case of dynamic AFM like the
based on a non-invasive criterion of measurements. The computing engine is oscillation amplitude A0 , set-point Asp , among other materials ex-
written in Fortran, and wrapped into Python. This aims to reuse physics code pected properties that may feed simulations and create simulations
without losing interoperability with high-level packages. We have also introduced batches of easily thousands of cases. Given this context, we focus
an in-house and transparent method for allowing multi-thread computations to this article on choosing a cantilever out of an initial pyDAMPF
the users of the pyDAMPF code, which we benchmarked for various comput- database of 30. In fact, many of them are similar in terms of spring
ing architectures (PC, Google Colab and an HPC facility) and results in very
constant kc , cantilever volume Vc and also Tip’s radius RT . Then
favorable speed-up compared to former AFM simulators.
we focus on seven archetypical and distinct cases/cantilevers to
Index Terms—Materials science, Nanomechanical properties, AFM, f2py, multi- understand the characteristics of each of the parameters specified
threading CPUs, numerical simulations, polymers in the manufacturers’ datasheets, by evaluating the maximum
(peak) forces.
Introduction and Motivation
We present four scenarios comparing a total of seven can-
tilevers and the same sample, where we use as a test-case Poly-
This article provides an overview of pyDAMPF, which is a Vinyl Acetate (PVA) fiber. The first scenario (Figure 1) illustrates
BSD licensed, Python and Fortran modeling tool that enables the difference between air and a moist environment. On the
AFM users to simulate the interaction between a probe (can- second one, a cantilever, only very soft and stiff cantilever spring
tilever) and materials at the nanoscale under diverse environments. constants are compared (see Figure :ref:fig1b‘). At the same time,
The code is packaged in a bundle and hosted on GitHub at the different volumes along the 30 cantilevers are depicted in
(https://github.com/govarguz/pyDAMPF). Figure 3. A final and mostly very common comparison is scenario
‡ Instituto de Investigaciones Físicas. 4, by comparing one of the most sensitive parameters to the force
§ Carrera de Física, Universidad Mayor de San Andrés. Campus Universitario of the tip’s radii (see Figure 4).
Cota Cota. La Paz, Bolivia The quantitative analysis for these four scenarios is presented
* Corresponding author: horacio.guzman@ijs.si and also the advantages of computing several simulation cases
¶ Department of Theoretical Physics
|| Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia at once with our in-house development. Such a comparison is
performed under the most common computers used in science,
Copyright © 2022 Willy Menacho et al. This is an open-access article dis- namely, personal computers (PC), cloud (Colab) and supercom-
tributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, pro- puting (small Xeon based cluster). We reach a Speed-up of 20
vided the original author and source are credited. over the former implementation [GGG15].
PYDAMPF: A PYTHON PACKAGE FOR MODELING MECHANICAL PROPERTIES OF HYGROSCOPIC MATERIALS UNDER INTERACTION WITH A NANOPROBE 203

Another novelty of pyDAMPF is the detailed [GS05] calcu- Serial method: This method is completely transparent to
lation of the environmental-related parameters, like the quality the user and will execute all the simulation cases found in the file
factor Q. tempall.txt by running the script inputs_processor.py. Our in-house
Here, we summarize the main features of pyDAMPF are: development creates an individual folder for each simulation case,
which can be executed in one thread.
• Highly efficient structure in terms of time-to-result, at least
def serial_method(tcases, factor, tempall):
one order of magnitude faster than existing approaches. lst = gen_limites(tcases, factor)
• Easy to use for scientists without a computing background, change_dir()
in particular in the use of multi-threads. for i in range(1,factor+1):
direc = os.getcwd()
• It supports the addition of further AFM cantilevers and direc2 = direc+'/pyDAMPF_BASE/'
parameters into the code database. direc3 = direc+'/SERIALBASIC_0/'+str(i)+'/'
• Allows an interactive analysis, including a graphical and shutil.copytree ( direc2,direc3)
table-based comparison of results through Jupyter Note- os.chdir ( direc+'/SERIALBASIC_0/1/nrun/')
exec(open('generate_cases.py').read())
books.
As arguments, the serial method requires the total number of
The results presented in this article are available as Google simulation cases obtained from tempall.txt. In contrast, the factor
Colaboratory notebook, which facilitates to explore pyDAMPF parameter has, in this case,a default value of 1.
and these examples. Parallel method: The parallel method uses more than one
computational thread. It is similar to the serial method; however,
this method distributes the total load along the available threads
Methods and executes in a parallel-fashion. This method comprises two
parts: first, a function that takes care of the bookkeeping of cases
Processing inputs
and folders:
pyDAMPF counts with an initial database of 30 cantilevers,
def Parallel_method(tcases, factor, tempall):
which can be extended at any time by accessing to the file can- lst = gen_limites(tcases, factor)
tilevers_data.txt then, the program inputs_processor.py reads the change_dir()
cantilever database and asks for further physical and operational for i in range(1,factor+1):
lim_inferior=lst[i-1][0]
variables, required to start the simulations. This will generate lim_superior=lst[i-1][1]
tempall.txt, which contains all cases e.g. 30 to be simulated with direc =os.getcwd()
pyDAMPF direc2 =direc+'/pyDAMPF_BASE/'
direc3 =direc+'/SERIALBASIC_0/'+str(i)+'/'
def inputs_processor(variables,data): shutil.copytree ( direc2,direc3)
a,b = np.shape(data) factorantiguo = ' factor=1'
final = gran_permutador( variables, data) factornuevo='factor='+str(factor)
f_name = ' tempall.txt' rangoantiguo = '( 0,paraleliz)'
np.savetxt(f_name,final) rangonuevo='('+str(lim_inferior)+','
directory = os.getcwd() +str(lim_superior)+')'
shutil.copy(directory+'/tempall.txt',directory+' os.chdir(direc+'/PARALLELBASIC_0/'+str(i))
/EXECUTE_pyDAMPF/') pyname =' nrun/generate_cases.py'
shutil.copy(directory+'/tempall.txt',directory+' newpath=direc+'/PARALLELBASIC_0/'+str(i)+'/'
/EXECUTE_pyDAMPF/pyDAMPF_BASE/nrun/') +pyname
reemplazo(newpath,factorantiguo,factornuevo)
The variables inside the argument of the function inputs_processor reemplazo(newpath,rangoantiguo,rangonuevo)
are interactively requested from a shell command line. Then the os.chdir(direc)
file tempall.txt is generated and copied to the folders that will
This part generates serial-like folders for each thread’s number of
contain the simulations.
cases to be executed.
The second part of the parallel method will execute pyDAMPF,
Execute pyDAMPF which contains at the same time two scripts. One for executing
pyDAMPF in a common UNIX based desktop or laptop. While the
For execution in a single or multi-thread way, we require first second is a python script that generated SLURM code to launch
to wrap our numeric core from Fortran to Python by using jobs in HPC facilities.
f2py [Vea20]. Namely, the file pyDAMPF.f90 within the folder
EXECUTE_pyDAMPF. • Execution with SLURM
Compilation with f2py: This step is only required once
It runs pyDAMPF in different threads under the SLURM
and depends on the computer architecture the code for this reads:
queuing system.
f2py -c --fcompiler=gnu95 pyDAMPF.f90 -m mypyDAMPF
def cluster(factor):
for i in range(1,factor+1):
This command-line generates mypyDAMPF.so, which will be
with open('jobpyDAMPF'+str(i)+'.x','w')
automatically located in the simulation folders. as ssf :
Once we have obtained the numerical code as Python modules, ssf.write('#/bin/bashl|n ')
we need to choose the execution mode, which can be serial or ssf.write('#SBATCH--time=23:00:00
\n')
parallel. Whereby parallel refers to multi-threading capabilities ssf.write('#SBATCH--constraint=
only within this first version of the code. epyc3\n')
204 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

ssf.write('\n')
ssf.write('ml Anaconda3/2019.10\n')
ssf.write('\n')
ssf.write('ml foss/2018a\n')
ssf.write('\n')
ssf.write('cd/home/$<USER>/pyDAMPF/
EXECUTE_pyDAMPF/PARALLELBASIC_0/'+str(i)+'/nrun
\n')
ssf.write('\n')
ssf.write('echo$pwd\n')
ssf.write('\n')
ssf.write('python3 generate_cases.py
\n')
ssf.close();
os.system(sbatch jobpyDAMPF)'+str(i)+'
.x;')
os.system(rm jobpyDAMPF)'+str(i)+'.x;')

The above script generates SLURM jobs for a chosen set of
threads; after launched, those jobs files are erased in order to
improve bookkeeping.
Fig. 1: Schematic of the tip-sample interface comparing air at a given
• Parallel execution with UNIX based Laptops or Desktops Relative Humidity with air.
Usually, microscopes (AFM) computers have no SLURM pre-
installed; for such a configuration, we run the following script:
def compute(factor):
direc = os.getcwd()
for i in range(1,factor+1):
os.chdir(direc+'/PARALLELBASIC_0/'+
str(i)+'/nrun')
os.system('python3 generate_cases.py
&')
os.chdir(direc)

This function allows the proper execution of the parallel case
without a queuing system and where a slight delay might appear
from thread to thread execution.

Analysis
Graphically:
• With static graphics, as shown in Figures 5, 9, 13 and 17.
python3 Graphical_analysis.py
Fig. 2: Schematic of the tip-sample interface comparing a hard (stiff)
• With interactive graphics, as shown in Figure 18. cantilever with a soft cantilever.
pip install plotly
jupyter notebook Graphical_analysis.ipynb

Quantitatively:
• With static data table:
python3 Quantitative_analysis.py

• With interactive tables
Quantitative_analysis.ipynb uses a minimalistic dashboard
application for tabular data visualization tabloo with easy
installation.:
pip install tabloo
jupyter notebook Quantitative_analysis.ipynb

Results and discussions
In Figure 1, we show four scenarios to be tackled in this test-
case for pyDAMPF. As described in the introduction, the first
scenario (Figure 1), compares between air and moist environment,
Fig. 3: Schematic of the tip-sample interface comparing a cantilever
the second tackles soft and stiff cantilevers(see Figure 2), next with a high volume compared with a cantilever with a small volume.
is Figure Figure 3, with the cantilever volume comparison and
PYDAMPF: A PYTHON PACKAGE FOR MODELING MECHANICAL PROPERTIES OF HYGROSCOPIC MATERIALS UNDER INTERACTION WITH A NANOPROBE 205

Fig. 6: Time-varying force for PVA at RH = 60.1% for different
cantilevers. The simulations show elastic (Hertz) responses. For each
curve, the maximum force value is the peak force. Two complete
Fig. 4: Schematic of the tip-sample interface comparing a cantilever
oscillations are shown corresponding to a hard (stiff) cantilever with
with a wide tip with a cantilever with a sharp tip.
a soft cantilever. The simulations were performed for Asp /A0 = 0.8 .

Fig. 5: Time-varying force for PVA at RH = 60.1% for different Fig. 7: Time-varying force for PVA at RH = 60.1% for different
cantilevers. The simulations show elastic (Hertz) responses. For each cantilevers. The simulations show elastic (Hertz) responses. For each
curve, the maximum force value is the peak force. Two complete curve, the maximum force value is the peak force. Two complete os-
oscillations are shown corresponding to air at a given Relative cillations are shown corresponding to a cantilever with a high volume
Humidity with air. The simulations were performed for Asp /A0 = 0.8 compared with a cantilever with a small volume. The simulations were
. performed for Asp /A0 = 0.8 .

the force the tip’s radio (see Figure 4). Further details of the
cantilevers depicted here are included in Table 22.
The AFM is widely used for mechanical properties mapping of
matter [Gar20]. Hence, the first comparison of the four scenarios
points out to the force response versus time according to a
Hertzian interaction [Guz17]. In Figure 5, we see the humid air
(RH = 60.1%) changes the measurement conditions by almost
10%. Using a stiffer cantilever (kc = 2.7[N/m]) will also increase
the force by almost 50% from the softer one (kc = 0.8[N/m]),
see Figure 6. Interestingly, the cantilever’s volume, a smaller
cantilever, results in the highest force by almost doubling the force
by almost five folds of the smallest volume (Figure 7). Finally, the
Tip radius difference between 8 and 20 nm will impact the force
in roughly 40 pN (Figure 8).
Fig. 8: Time-varying force for PVA at RH = 60.1% for different
Now, if we consider literature values for different cantilevers. The simulations show elastic (Hertz) responses. For each
RH [FCK+ 12], [HLLB09], we can evaluate the Peak or Maximum curve, the maximum force value is the peak force. Two complete
Forces. This force in all cases depicted in Figure 9 shows a oscillations are shown corresponding to a cantilever with a wide tip
monotonically increasing behavior with the higher Young mod- with a cantilever with a sharp tip. The simulations were performed
ulus. Remarkably, the force varies in a range of 25% from dried for Asp /A0 = 0.8 .
206 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 9: Peak force reached for a PVA sample subjected to different
Fig. 11: Peak force reached for a PVA sample subjected to different
relative humidities 0.0%, 29.5%, 39.9% and 60.1% corresponding
relative humidities 0.0%, 29.5%, 39.9% and 60.1% corresponding to
to air at a given Relative Humidity with air. The simulations were
a cantilever with a high volume compared with a cantilever with a
performed for Asp /A0 = 0.8 .
small volume. The simulations were performed for Asp /A0 = 0.8 .

Fig. 10: Peak force reached for a PVA sample subjected to different
relative humidities 0.0%, 29.5%, 39.9% and 60.1% corresponding to
a hard (stiff) cantilever with a soft cantilever. The simulations were Fig. 12: Peak force reached for a PVA sample subjected to different
performed for Asp /A0 = 0.8 . relative humidities 0.0%, 29.5%, 39.9% and 60.1% corresponding to
a cantilever with a wide tip with a cantilever with a sharp tip. The
simulations were performed for Asp /A0 = 0.8 .

PVA to one at RH = 60.1% (see Figure 9).
In order to properly describe operational parameters in dy-
namic AFM we analyze the peak force dependence with the set-
point amplitude Asp . In Figure 13, we have the comparison of
peak forces for the different cantilevers as a function of Asp . The
sensitivity of the peak force is higher for the type of cantilevers
with varying kc and Vc . Nonetheless, the peak force dependence
given by the Hertzian mechanics has a dependence with the
square root of the tip radius, and for those Radii on Table 22
are not influencing the force much. However, they could strongly
influence resolution [GG13].
Figure 17 shows the dependence of the peak force as a function
of kc , Vc , and RT , respectively, for all the cantilevers listed in
Table 22; constituting a graphical summary of the seven analyzed
cantilevers for completeness of the analysis.
Another way to summarize the results in AFM simulations
if to show the Force vs. Distance curves (see Fig. 18), which in
these case show exactly how for example a stiffer cantilever may Fig. 13: Dependence of the maximum force on the set-point amplitude
penetrate more into the sample by simple checking the distance corresponding to air at a given Relative Humidity with air.
cantilever e reaches. On the other hand, it also jumps into the
PYDAMPF: A PYTHON PACKAGE FOR MODELING MECHANICAL PROPERTIES OF HYGROSCOPIC MATERIALS UNDER INTERACTION WITH A NANOPROBE 207

Fig. 14: Dependence of the maximum force on the set-point amplitude Fig. 17: Dependence of the maximum force with the most important
corresponding to a hard (stiff) cantilever with a soft cantilever. characteristics of each cantilever, filtering the cantilevers used for the
scenarios , the figure shows maximum force dependent on the: (a)
force constant k, (b) cantilever tip radius, and (c) cantilever volume,
respectively. The simulations were performed for $A_{sp}/A_{0}$ =
0.8.

Fig. 15: Dependence of the maximum force on the set-point amplitude
corresponding to a cantilever with a high volume compared with a
cantilever with a small volume.

Fig. 18: Three-dimensional plots of the various cantilevers provided
by the manufacturer and those in the pyDAMPF database that
establish a given maximum force at a given distance between the
tip and the sample for a PVA polymer subjected to RH= 0% with E =
930 [MPa].

eyes that a cantilever with small volume f has less damping from
the environment and thus it also indents more than the ones with
higher volume. Although these type of plots are the easiest to
make, they carry lots of experimental information. In addition,
pyDAMPF can plot such 3D figures interactively that enables a
detailed comparison of those curves.
Fig. 16: Dependence of the maximum force on the set-point amplitude As we aim a massive use of pyDAMPF, we also perform the
corresponding to a cantilever with a wide tip with a cantilever with a corresponding benchmarks on four different computing platforms,
sharp tip. where two of them resembles the standard PC or Laptop found
at the labs, and the other two aim to cloud and HPC facilities,
208 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 19: Three-dimensional plots of the various cantilevers provided Fig. 21: Speed up parallel method.
by the manufacturer and those in the pyDAMPF database that
establish a given maximum force at a given distance between the
tip and the sample for a PVA polymer subjected to RH = 60.1% with
E = 248.8 [MPa].

Fig. 22: Data used for Figs. 5, 9 and 13 with an A0 = 10[nm] . Observe
that the quality factor and Young’s modulus have three different values
respectively for RH1 = 29.5%, RH2 = 39.9% y RH3 = 60.1%. ∗∗
The values presented for Quality Factor Q were calculated at Google
Colaboratory notebook Q calculation, using the method proposed by
[GS05], [Sad98].

Fig. 23: Computers used to run pyDAMPF and Former work
[GGG15], ∗ the free version of Colab provides this capability, there
are two paid versions which provide much greater capacity, these
versions known as Colab Pro and Colab Pro+ are only available in
Fig. 20: Comparison of times taken by both the parallel method and some countries.
the serial method.

respectively (see Table 23 for details).
Figure 20 shows the average run time for the serial and parallel
implementation. Despite a slightly higher performance for the case
of the HPC cluster nodes, a high-end computer (PC 2) may also
reach similar values, which is our current goal. Another striking
aspect observed by looking at the speed-up, is the maximum
and minimum run times, which notoriously show the on-demand
character of cloud services. As their maxima and minima show the
highest variations.
To calculate the speed up we use the following equation:

ttotal
S=
tthread

Where S is the speed up , tT hread is the execution time of a Fig. 24: Execution times per computational thread, for each computer.
computational thread, and tTotal is the sum of times, shown in Note that each Thread consists of 9 simulation cases, with a sum time
showing the total of 90 cases for evaluating 3 different Young moduli
the table 24. For our calculations we used the highest, the average
and 30 cantilevers at the same time.
and the lowest execution time per thread.
PYDAMPF: A PYTHON PACKAGE FOR MODELING MECHANICAL PROPERTIES OF HYGROSCOPIC MATERIALS UNDER INTERACTION WITH A NANOPROBE 209

Limitations R EFERENCES
The main limitation of dynamic AFM simulators based in con- [FCK+ 12] Kathrin Friedemann, Tomas Corrales, Michael Kappl, Katharina
Landfester, and Daniel Crespy. Facile and large-scale fabrication
tinuum modeling is that sometimes a molecular behavior is over- of anisometric particles from fibers synthesized by colloid elec-
looked. Such a limitation comes from the multiple time and length trospinning. Small, 8:144–153, 2012. doi:10.1002/smll.
scales behind the physics of complex systems, as it is the case 201101247.
of polymers and biopolymers. In this regard, several efforts on [Gar20] Ricardo Garcia. Nanomechanical mapping of soft materials with
the atomic force microscope: methods, theory and applications.
the multiscale modeling of materials have been proposed, joining The Royal Society of Chemistry, 49:5850–5884, 2020. doi:10.
mainly efforts to stretch the multiscale gap [GTK+ 19]. We also 1039/d0cs00318b.
plan to do so, within a current project, for modeling the polymeric [GG13] Horacio V. Guzman and Ricardo Garcia. Peak forces and lateral
resolution in amplitude modulation force microscopy in liquid.
fibers as molecular chains and providing "feedback" between mod- Beilstein Journal of Nanotechnology, 4:852–859, 2013. doi:
els from a top-down strategy. Code-wise, the implementation will 10.3762/bjnano.4.96.
be also gradually improved. Nonetheless, to maintain scientific [GGG15] Horacio V. Guzman, Pablo D. Garcia, and Ricardo Garcia. Dy-
code is a challenging task. In particular without the support for namic force microscopy simulator (dforce): A tool for planning
and understanding tapping and bimodal afm experiments. Beilstein
our students once they finish their thesis. In this respect, we will Journal of Nanotechnology, 6:369–379, 2015. doi:10.3762/
seek software funding and more community contributions. bjnano.6.36.
[GPG13] Horacio V. Guzman, Alma P. Perrino, and Ricardo Garcia. Peak
forces in high-resolution imaging of soft matter in liquid. ACS
Nano, 7:3198–3204, 2013. doi:10.1021/nn4012835.
Future work
[GS05] Christopher P. Green and John E. Sader. Frequency response of
There are several improvements that are planned for pyDAMPF. cantilever beams immersed in viscous fluids near a solid surface
with applications to the atomic force microscope. Journal of Ap-
plied Physics, 98:114913, 2005. doi:10.1063/1.2136418.
• We plan to include a link to molecular dynamics simula- [GTK+ 19] Horacio V. Guzman, Nikita Tretyakov, Hideki Kobayashi, Aoife C.
tions of polymer chains in a multiscale like approach. Fogarty, Karsten Kreis, Jakub Krajniak, Christoph Junghans, Kurt
• We plan to use experimental values with less uncertainty Kremer, and Torsten Stuehn. Espresso++ 2.0: Advanced methods
for multiscale molecular simulation. Computer Physics Communi-
to boost semi-empirical models based on pyDAMPF.
cations, 238:66–76, 2019. doi:10.1016/j.cpc.2018.12.
• The code is still not very clean and some internal cleanup 017.
is necessary. This is especially true for the Python backend [Guz17] Horacio V. Guzman. Scaling law to determine peak forces
which may require a refactoring. in tapping-mode afm experiments on finite elastic soft matter
systems. Beilstein Journal of Nanotechnology, 8:968–974, 2017.
• Some AI optimization was also envisioned, particularly for doi:10.3762/bjnano.8.98.
optimizing criteria and comparing operational parameters. [HLLB09] Fei Hang, Dun Lu, Shuang Wu Li, and Asa H. Barber. Stress-strain
behavior of individual electrospun polymer fibers using combina-
tion afm and sem. Materials Research Society, 1185:1185–II07–
Conclusions 10, 2009. doi:10.1557/PROC-1185-II07-10.
[MHR08] John Melcher, Shuiqing Hu, and Arvind Raman. Veda: A
In summary, pyDAMPF is a highly efficient and adaptable simu- web-based virtual environment for dynamic atomic force mi-
croscopy. Review of Scientific Instruments, 79:061301, 2008.
lation tool aimed at analyzing, planning and interpreting dynamic doi:10.1063/1.2938864.
AFM experiments. [Ram20] Prabhu Ramachandran. Compyle: a Python package for paral-
It is important to keep in mind that pyDAMPF uses cantilever lel computing. In Meghann Agarwal, Chris Calloway, Dillon
Niederhut, and David Shupe, editors, Proceedings of the 19th
manufacturers information to analyze, evaluate and choose a Python in Science Conference, pages 32 – 39, 2020. doi:
certain nanoprobe that fulfills experimental criteria. If this will 10.25080/majora-342d178e-005.
not be the case, it will advise the experimentalists on what to [Sad98] John E. Sader. Frequency response of cantilever beams immersed
expect from their measurements and the response a material may in viscous fluids with applications to the atomic force microscope.
Journal of Applied Physics, 84:64–76, 1998. doi:10.1063/1.
have. We currently support multi-thread execution using in-house 368002.
development. However, in our outlook, we plan to extend the [Vea20] Pauli Virtanen and et al. Scipy 1.0: fundamental algorithms for
code to GPU by using transpiling tools, like compyle [Ram20], scientific computing in Python. Nature Methods, 17:261–272,
2020. doi:10.1038/s41592-019-0686-2.
as the availability of GPUs also increases in standard worksta-
tions. In addition, we have shown how to reuse a widely tested
Fortran code [GPG13] and wrap it as a python module to profit
from pythonic libraries and interactivity via Jupyter notebooks.
Implementing new interaction forces for the simulator is straight-
forward. However, this code includes the state-of-the-art contact,
viscous, van der Waals, capillarity and electrostatic forces used for
physics at the interfaces. Moreover, we plan to implement soon
semi-empirical analysis and multiscale modeling with molecular
dynamics simulations.

Acknowledgments
H.V.G thanks the financial support by the Slovenian Research
Agency (Funding No. P1-0055). We gratefully acknowledge the
fruitful discussions with Tomas Corrales and our joint Fondecyt
Regular project 1211901.
210 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Improving PyDDA’s atmospheric wind retrievals using
automatic differentiation and Augmented Lagrangian
methods
Robert Jackson‡∗ , Rebecca Gjini§ , Sri Hari Krishna Narayanan‡ , Matt Menickelly, Paul Hovland‡ , Jan Hückelheim‡ ,
Scott Collis‡

Introduction [LSKJ17] as detailed in the 2019 SciPy Conference proceedings
Meteorologists require information about the spatiotemporal dis- (see [JCL+ 20], [RJSCTL+ 19]). It provided a much easier to
tribution of winds in thunderstorms in order to analyze how use and more portable interface for wind retrievals than was
physical and dynamical processes govern thunderstorm evolution. provided by these packages. In PyDDA versions 0.5 and prior,
Knowledge of such processes is vital for predicting severe and the implementation of Equation (1) uses NumPy [HMvdW+ 20]
hazardous weather events. However, acquiring wind observations to calculate J and its gradient. In order to find the wind field
in thunderstorms is a non-trivial task. There are a variety of in- V that minimizes J, PyDDA used the limited memory Broy-
struments that can measure winds including radars, anemometers, den–Fletcher–Goldfarb–Shanno bounded (L-BFGS-B) from SciPy
and vertically pointing wind profilers. The difficulty in acquiring [VGO+ 20]. L-BFGS-B requires gradients of J in order to mini-
a three dimensional volume of the 3D wind field from these mize J. Considering the antiquity of the CEDRIC and Multidop
sensors is that these sensors typically only measure either point packages, these first steps provided the transition to Python that
observations or only the component of the wind field parallel was needed in order to enhance accessibility of wind retrieval
to the direction of the antenna. Therefore, in order to obtain 3D software by the scientific community. For more information
wind fields, the weather radar community uses a weak variational about PyDDA versions 0.5 and prior, consult [RJSCTL+ 19] and
technique that finds a 3D wind field that minimizes a cost function [JCL+ 20].
J. However, there are further improvements that still needed
J(V) = µm Jm + µo Jo + µv Jv + µb Jb + µs Js (1) to be made in order to optimize both the accuracy and speed
of the PyDDA retrievals. For example, the cost functions and
Here, Jm is how much the wind field V violates the anelastic mass gradients in PyDDA 0.5 are implemented in NumPy which does
continuity equation. Jo is how much the wind field is different not take advantage of GPU architectures for potential speedups
from the radar observations. Jv is how much the wind field violates [HMvdW+ 20]. In addition, the gradients of the cost function that
the vertical vorticity equation. Jb is how much the wind field are required for the weak variational technique are hand-coded
differs from a prescribed background. Finally Js is related to even though packages such as Jax [BFH+ 18] and TensorFlow
the smoothness of the wind field, quantified as the Laplacian [AAB+ 15] can automatically calculate these gradients. These
of the wind field. The scalars µx are weights determining the needs motivated new features for the release of PyDDA 1.0. In
relative contribution of each cost function to the total J. The PyDDA 1.0, we utilize Jax and TensorFlow’s automatic differen-
flexibility in this formulation potentially allows for factoring in tiation capabilities for differentiating J, making these calculations
the uncertainties that are inherent in the measurements. This less prone to human error and more efficient.
formulation is expandable to include cost functions related to data Finally, upgrading PyDDA to use Jax and TensorFlow allows it
from other sources such as weather forecast models and soundings. to take advantage of GPUs, increasing the speed of retrievals. This
For more specific information on these cost functions, see [SPG09] paper shows how Jax and TensorFlow are used to automatically
and [PSX12]. calculate the gradient of J and improve the performance of
PyDDA is an open source Python package that implements the PyDDA’s wind retrievals using GPUs.
weak variational technique for retrieving winds. It was originally In addition, a drawback to the weak variational technique
developed in order to modernize existing codes for the weak is that the technique requires user specified constants µ. This
variational retrievals such as CEDRIC [MF98] and Multidop therefore creates the possibility that winds retrieved from different
datasets may not be physically consistent with each other, affecting
* Corresponding author: rjackson@anl.gov reproducibility. Therefore, for the PyDDA 1.1 release, this paper
‡ Argonne National Laboratory, 9700 Cass Ave., Argonne, IL, 60439
§ University of California at San Diego also details a new approach that uses Augmented Lagrangian
solvers in order to place strong constraints on the wind field such
Copyright © 2022 Robert Jackson et al. This is an open-access article dis- that it satisfies a mass continuity constraint to within a specified
tributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, pro- tolerance while minimizing the rest of the cost function. This
vided the original author and source are credited. new approach also takes advantage of the automatically calculated
IMPROVING PYDDA’S ATMOSPHERIC WIND RETRIEVALS USING AUTOMATIC DIFFERENTIATION AND AUGMENTED LAGRANGIAN METHODS 211

gradients that are implemented in PyDDA 1.0. This paper will J were calculated by finding the closed form of the gradient
show that this new approach eliminates the need for user specified by hand and then coding the closed form in Python. The code
constants, ensuring the reproducibility of the results produced by snippet below provides an example of how the cost function Jm is
PyDDA. implemented in PyDDA using NumPy.
def calculate_mass_continuity(u, v, w, z, dx, dy, dz):
Weak variational technique
dudx = np.gradient(u, dx, axis=2)
This section summarizes the weak variational technique that was dvdy = np.gradient(v, dy, axis=1)
implemented in PyDDA previous to version 1.0 and is currently dwdz = np.gradient(w, dz, axis=0)
the default option for PyDDA 1.1. PyDDA currently uses the div = dudx + dvdy + dwdz
weak variational formulation given by Equation (1). For this
proceedings, we will focus our attention on the mass continuity return coeff * np.sum(np.square(div)) / 2.0
Jm and observational cost function Jo . In PyDDA, Jm is given as In order to hand code the gradient of the cost function above, one
the discrete volume integral of the square of the anelastic mass has to write the closed form of the derivative into another function
continuity equation like below.

δ (ρs u) δ (ρs v) δ (ρs w) 2 def calculate_mass_continuity_gradient(u, v, w, z, dx,
Jm (u, v, w) = ∑ + + , (2) dy, dz, coeff):
volume δx δy δz dudx = np.gradient(u, dx, axis=2)
dvdy = np.gradient(v, dy, axis=1)
where u is the zonal component of the wind field and v is the dwdz = np.gradient(w, dz, axis=0)
meridional component of the wind field. ρs is the density of air,
which is approximated in PyDDA as ρs (z) = e−z/10000 where z is grad_u = -np.gradient(div, dx, axis=2) * coeff
grad_v = -np.gradient(div, dy, axis=1) * coeff
the height in meters. The physical interpretation of this equation is grad_w = -np.gradient(div, dz, axis=0) * coeff
that a column of air in the atmosphere is only allowed to compress
in order to generate changes in air density in the vertical direction. y = np.stack([grad_u, grad_v, grad_w], axis=0)
Therefore, wind convergence at the surface will generate vertical return y.flatten()
air motion. A corollary of this is that divergent winds must occur Hand coding these functions can be labor intensive for compli-
in the presence of a downdraft. At the scales of winds observed cated cost functions. In addition, there is no guarantee that there is
by PyDDA, this is a reasonable approximation of the winds in the a closed form solution for the gradient. Therefore, we tested using
atmosphere. both Jax and TensorFlow to automatically compute the gradients
The cost function Jo metricizes how much the wind field is of J. Computing the gradients of J using Jax can be done in two
different from the winds measured by each radar. Since a scanning lines of code using jax.vjp:
radar will scan a storm while pointing at an elevation angle θ and primals, fun_vjp = jax.vjp(
an azimuth angle φ , the wind field must first be projected to the calculate_radial_vel_cost_function,
radar’s coordinates. After that, PyDDA finds the total square error vrs, azs, els, u, v, w, wts, rmsVr, weights,
coeff)
between the analysis wind field and the radar observed winds as
_, _, _, p_x1, p_y1, p_z1, _, _, _, _ = fun_vjp(1.0)
done in Equation (3).
Calculating the gradients using automatic differentiation us-
Jo (u, v, w) = ∑ (u cos θ sin φ + v cos θ cos φ + (w − wt ) sin θ )2 ing TensorFlow is also a simple code snippet using
volume
(3) tf.GradientTape:
Here, wt is the terminal velocity of the particles scanned by with tf.GradientTape() as tape:
the radar volume. This is approximated using empirical relation- tape.watch(u)
tape.watch(v)
ships between wt and the radar reflectivity Z. PyDDA then uses tape.watch(w)
the limited memory Broyden–Fletcher–Goldfarb–Shanno bounded loss = calculate_radial_vel_cost_function(
(L-BFGS-B) algorithm (see, e.g., [LN89]) to find the u, v, and w vrs, azs, els, u, v, w,
wts, rmsVr, weights, coeff)
that solves the optimization problem
grad = tape.gradient(loss)
min J(u, v, w) , µm Jm (u, v, w) + µv Jv (u, v, w). (4)
u,v,w
As one can see, there is no more need to derive the closed form of
For experiments using the weak variational technique, we run the gradient of the cost function. Rather, the cost function itself is
the optimization until either the Linf norm of the gradient of J now the input to a snippet of code that automatically provides the
is less than 10−8 or when the maximum change in u, v, and derivative. In PyDDA 1.0, there are now three different engines
w between iterations is less than 0.01 m/s as done by [PSX12]. that the user can specify. The classic "scipy" mode uses the
Typically, the second criteria is reached first. Before PyDDA 1.0, NumPy-based cost function and hand coded gradients used by
PyDDA utilized SciPy’s L-BFGS-B implementation. However, versions of PyDDA previous to 1.0. In addition, there are now
as of PyDDA 1.0 one can also use TensorFlow’s L-BFGS-B TensorFlow and Jax modes that use both cost functions and
implementation, which is used here for the experiments with the automatically generated gradients generated using TensorFlow or
weak variational technique [AAB+ 15]. Jax.

Using automatic differentiation Improving performance with GPU capabilities
The optimization problem in Equation (4) requires the gradients The implementation of a TensorFlow-based engine provides Py-
of J. In PyDDA 0.5 and prior, the gradients of the cost function DDA the capability to take advantage of CUDA-compatible GPUs.
212 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

CPU-based retrievals increases as resolution decreases, demon-
strating the importance of the GPU for conducting high-resolution
wind retrievals. In Table 1, using a GPU to retrieve the Hurricane
Florence example at 1 km resolution reduces the run time from 341
s to 12 s. Therefore, these performance improvements show that
PyDDA’s TensorFlow-based engine now enables it to handle both
spatial scales of hundreds of kms at a 1 km resolution. For a day
of data at this resolution, assuming five minutes between scans, an
entire day of data can be processed in 57 minutes. With the use of
multi-GPU clusters and selecting for cases where precipitation is
present, this enables the ability to process winds from multi-year
radar datasets within days instead of months.
In addition, simply using TensorFlow’s implementation of
L-BFGS-B as well as the TensorFlow calculated cost function
and gradients provides a significant performance improvement
compared to the original "scipy" engine in PyDDA 0.5, being up
to a factor of 30 faster. In fact, running PyDDA’s original "scipy"
engine on the 0.5 km resolution data for the Hurricane Florence
example would have likely taken 50 days to complete on an Intel
Core i7-based MacBook laptop. Therefore, that particular run was
not tenable to do and therefore not shown in Figure 1. In any case,
this shows that upgrading the calculations to use TensorFlow’s
Fig. 1: The time in seconds of execution of the Hurricane Florence automatically generated gradients and L-BFGS-B implementation
retrieval example when using the TensorFlow and SciPy engines on provides a very significant speedup to the processing time.
an Intel Core i7 MacBook in CPU mode and on a node of Argonne
National Laboratory’s Lambda cluster, utilizing a single NVIDIA
Tesla A100 GPU for the calculation. Augmented Lagrangian method
The release of PyDDA 1.0 focused on improving its performance
Method 0.5 km 1 km 2.5 km 5.0 km and gradient accuracy by using automatic differentiation for cal-
culating the gradient. For PyDDA 1.1, the PyDDA development
SciPy Engine ~50 days 5771.2 s 871.5 s 226.9 s team focused on implementing a technique that enables the user to
TensorFlow 7372.5 s 341.5 s 28.1 s 7.0 s automatically determine the weight coefficients µ. This technique
Engine
builds upon the automatic differentiation work done for PyDDA
NVIDIA 89.4 s 12.0 s 3.5 s 2.6 s
Tesla A100 1.0 by using the automatically generated gradients. In this work,
GPU we consider a constrained reformulation of Equation (4) that
requires wind fields returned by PyDDA to (approximately) satisfy
mass continuity constraints. That is, we focus on the constrained
TABLE 1: Run times for each of the benchmarks in Figure 1. optimization problem

min Jv (u, v, w)
u,v,w (5)
Given that weather radar datasets can span decades and processing
s. to Jm (u, v, w) = 0,
each 10 minute time period of data given by the radar can take
on the order of 1-2 minutes with PyDDA using regular CPU where we now interpret Jm as a vector mapping that outputs, at
operations, if this time were reduced to seconds, then processing each grid point in the discretized volume δ (ρδ xs u) + δ (ρs v) δ (ρs w)
δy + δz .
winds from years of radar data would become tenable. Therefore, Notice that the formulation in Equation (5) has no dependencies
we used the TensorFlow-based PyDDA using the weak variational on scalars µ.
technique on the Hurricane Florence example in the PyDDA To solve the optimization problem in Equation (5), we im-
Documentation. On 14 September 2018, Hurricane Florence was plemented an augmented Lagrangian method with a filter mech-
within range of 2 radars from the NEXRAD network: KMHX anism inspired by [LV20]. An augmented Lagrangian method
stationed in Newport, NC and KLTX stationed in Wilmington, considers the Lagrangian associated with an equality-constrained
NC. In addition, the High Resolution Rapid Refresh model runs optimization problem, in this case L0 (u, v, w, λ ) = Jv (u, v, w) −
provided an additional constraint for the wind retrieval. For more λ > Jm (u, v, w), where λ is a vector of Lagrange multipliers of
information on this example, see [RJSCTL+ 19]. The analysis the same length as the number of grid points in the discretized
domain spans 400 km by 400 km horizontally, and the horizontal volume. The Lagrangian is then augmented with an additional
resolution was allowed to vary for different runs in order to com- squared-penalty term on the constraints to yield Lµ (u, v, w, λ ) =
pare how both the CPU and GPU-based retrievals’ performance L0 (u, v, w, λ ) + µ2 kJm (u, v, w)k2 , where we have intentionally used
would be affected by grid resolution. The time of completion of µ > 0 as the scalar in the penalty term to make comparisons
each of these retrievals is shown in Figure 1. with Equation (4) transparent. It is well known (see, for instance,
Figure 1 and Table 1 show that, in general, the retrievals took Theorem 17.5 of [NW06]) that under some not overly restrictive
anywhere from 10 to 100 fold less time on the GPU compared to conditions there exists a finite µ̄ such that if µ ≥ µ̄, then each local
the CPU. The discrepancy in performance between the GPU and solution of Equation (5) corresponds to a strict local minimizer
IMPROVING PYDDA’S ATMOSPHERIC WIND RETRIEVALS USING AUTOMATIC DIFFERENTIATION AND AUGMENTED LAGRANGIAN METHODS 213

of Lµ (u, v, w, λ ∗ ) for a suitable choice of multipliers λ ∗ . Essen-
tially, augmented Lagrangian methods solve a short sequence of
unconstrained problems Lµ (u, v, w, λ ), with different values of µ
until a solution is returned that is a local, feasible solution to
Equation (5). In our implementation of an augmented Lagrangian
method, the coarse minimization of Lµ (u, v, w, λ ) is performed
by the Scipy implementation of LBFGS-B with the TensorFlow
implementation of the cost function and gradients. Additionally, in
our implementation, we employ a filter mechanism (see a survey
in [FLT06]) recently proposed for augmented Lagrangian methods
in [LV20] in order to guarantee convergence. We defer details
to that paper, but note that the feasibility restoration phase (the
minimization of a squared constraint violation) required by such
a filter method is also performed by the SciPy implementation of
LBFGS-B.
The PyDDA documentation contains an example of a
mesoscale convective system (MCS) that was sampled by a C-
band Polarization Radar (CPOL) and a Bureau of Meteorology
Australia radar on 20 Jan 2006 in Darwin, Australia. For more
details on this storm and the radar network configuration, see
[CPMW13]. For more information about the CPOL radar dataset,
see [JCL+ 18]. This example with its data is included in the
PyDDA Documentation as the "Example of retrieving and plotting
winds."
Figure 2 shows the winds retrieved by the Augmented La-
grangian technique with µ = 1 and from the weak variational
technique with µ = 1 on the right. Figure 2 shows that both tech-
niques are capturing similar horizontal wind fields in this storm.
However, the Augmented Lagrangian technique is resolving an
updraft that is not present in the wind field generated by the weak
variational technique. Since there is horizontal wind convergence
in this region, we expect there to be an updraft present in this
box in order for the solution to be physically realistic. Therefore,
for µ = 1, the Augmented Lagrangian technique is doing a better
job at resolving the updrafts present in the storm than the weak
variational technique is. This shows that adjusting µ is required in
order for the weak variational technique to resolve the updraft.
We solve the unconstrained formulation (4) using the imple-
mentation of L-BFGS-B currently employed in PyDDA; we fix
the value µv = 1 and vary µm = 2 j : j = 0, 1, 2, . . . , 16. We also
solve the constrained formulation (5) using our implementation
of a filter Augmented Lagrangian method, and instead vary the
initial guess of penalty parameter µ = 2 j : j = 0, 1, 2, . . . , 16. For
the initial state, we use the wind profile from the weather balloon
launch at 00 UTC 20 Jan 2006 from Darwin and apply it to
the whole analysis domain. A summary of results is shown in
Figures 3 and 4. We applied a maximum constraint violation
tolerance of 10−3 to the filter Augmented Lagrangian method.
This is a tolerance that assumes that the winds do not violate
the mass continuity constraint by more than 0.001 m2 s−2 . Notice
Fig. 2: The PyDDA retrieved winds overlaid over reflectivity from the
that such a tolerance is impossible to supply to the weak vari-
C-band Polarization Radar for the MCS that passed over Darwin,
ational method, highlighting the key advantage of employing a Australia on 20 Jan 2006. The winds were retrieved using the weak
constrained method. Notice that in this example, only 5 settings of variational technique with µ = 1 (a) and the Augmented Lagrangian
µm lead to sufficiently feasible solutions returned by the variational technique with µ = 1 (b). The contours represent vertical velocities at
technique. 3.05 km altitude. The boxed region shows the updrafts that generated
Finally, a variable of interest to atmospheric scientists for the heavy precipitation.
winds inside MCSes is the vertical wind velocity. It provides a
measure of the intensity of the storm by demonstrating the amount
of upscale growth contributing to intensification. Figure 5 shows
the mean updraft velocities inside the box in Figure 2 as a function
of height for each of the runs of the TensorFlow L-BFGS-B and
214 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 3: The x-axis shows, on a logarithmic scale, the maximum
constraint violation in the units of divergence of the wind field and
the y-axis shows the value of the data-fitting term Jv at the optimal
solution. The legend lists the number of function/gradient calls made
by the filter Augmented Lagrangian Method, which is the dominant
cost of both approaches. The dashed line at 10−3 denotes the tolerance
on the maximum constraint violation that was supplied to the filter
Augmented Lagrangian method.

Fig. 5: The mean updraft velocity obtained by (left) the weak
variational and (right) the Augmented Lagrangian technique inside
the updrafts in the boxed region of Figure 2. Each line represents a
Fig. 4: As 3, but for the weak variational technique that uses L-BFGS- different value of µ for the given technique.
B.

using the Augmented Lagrangian technique will result in more
Augmented Lagrangian techniques. Table 2 summarizes the mean reproducible wind fields from radar wind networks since it is
and spread of the solutions in Figure 5. For the updraft velocities less sensitive to user-defined parameters than the weak variational
produced by the Augmented Lagrangian technique, there is a 1 m/s technique. However, a limitation of this technique is that, for now,
spread of velocities produced for given values of µ at altitudes this technique is limited to two radars and to the mass continuity
< 7.5 km in Table 2. At an altitude of 10 km, this spread is and vertical vorticity constraints.
1.9 m/s. This is likely due to the reduced spatial coverage of
the radars at higher altitudes. However, for the weak variational Concluding remarks
technique, the sensitivity of the retrieval to µ is much more Atmospheric wind retrievals are vital for forecasting severe
pronounced, with up to 2.8 m/s differences between retrievals. weather events. Therefore, this motivated us to develop an open
Therefore, using the Augmented Lagrangian technique makes the source package for developing atmospheric wind retrievals called
vertical velocities less sensitive to µ. Therefore, this shows that PyDDA. In the original releases of PyDDA (versions 0.5 and
IMPROVING PYDDA’S ATMOSPHERIC WIND RETRIEVALS USING AUTOMATIC DIFFERENTIATION AND AUGMENTED LAGRANGIAN METHODS 215

Min Mean Max Std. Dev. a paid-up nonexclusive, irrevocable worldwide license in said
article to reproduce, prepare derivative works, distribute copies
Weak variational
2.5 km 1.2 1.8 2.7 0.6
to the public, and perform publicly and display publicly, by or
5 km 2.2 2.9 4.0 0.7 on behalf of the Government. The Department of Energy will
7.5 km 3.2 3.9 5.0 0.4 provide public access to these results of federally sponsored
10 km 2.3 3.3 4.9 1.0 research in accordance with the DOE Public Access Plan. This
Aug. Lagrangian material is based upon work supported by Laboratory Directed
2.5 km 1.8 2.8 3.3 0.5 Research and Development (LDRD) funding from Argonne Na-
5 km 3.1 3.3 3.5 0.1 tional Laboratory, provided by the Director, Office of Science, of
7.5 km 3.2 3.5 3.9 0.1 the U.S. Department of Energy under Contract No. DE-AC02-
10 km 3.0 4.3 4.9 0.5 06CH11357. This material is also based upon work funded by
program development funds from the Mathematics and Computer
Science and Environmental Science departments at Argonne Na-
TABLE 2: Minimum, mean, maximum, and standard deviation of w tional Laboratory.
(m/s) for select levels in Figure 5.

prior), the original goal of PyDDA was to convert legacy wind R EFERENCES
retrieval packages such as CEDRIC and Multidop to be fully
Pythonic, open source, and accessible to the scientific community. [AAB+ 15] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo,
Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis,
However, there remained many improvements to be made to Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfel-
PyDDA to optimize the speed of the retrievals and to make it low, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing
easier to add constraints to PyDDA. Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh
Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore,
This therefore motivated two major changes to PyDDA’s wind Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens,
retrieval routine for PyDDA 1.0. The first major change to PyDDA Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker,
in PyDDA 1.0 was to simplify the wind retrieval process by Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol
automating the calculation of the gradient of the cost function Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan
Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine
used for the weak variational technique. To do this, we utilized learning on heterogeneous systems, 2015. Software available
Jax and TensorFlow’s capabilities to do automatic differentiation from tensorflow.org. URL: https://www.tensorflow.org/.
of functions. This also allows PyDDA to take advantage of GPU [BFH+ 18] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James
resources, significantly speeding up retrieval times for mesoscale Johnson, Chris Leary, Dougal Maclaurin, George Necula,
Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne,
retrievals at kilometer-scale resolution. In addition, running the and Qiao Zhang. JAX: composable transformations of
TensorFlow-based version of PyDDA provided significant perfor- Python+NumPy programs, 2018. URL: http://github.com/
mance improvements even when using a CPU. google/jax.
[CPMW13] Scott Collis, Alain Protat, Peter T. May, and Christopher
These automatically generated gradients were then used to Williams. Statistics of storm updraft velocities from twp-ice
implement an Augmented Lagrangian technique in PyDDA 1.1 including verification with profiling measurements. Journal
that allows for automatically determining the weights for each of Applied Meteorology and Climatology, 52(8):1909 – 1922,
cost function in the retrieval. The Augmented Lagrangian tech- 2013. doi:10.1175/JAMC-D-12-0230.1.
[FLT06] Roger Fletcher, Sven Leyffer, and Philippe Toint. A brief
nique guarantees convergence to a physically realistic solution, history of filter methods. Technical report, Argonne National
something that is not always the case for a given set of weights Laboratory, 2006. URL: http://www.optimization-online.org/
for the weak variational technique. Therefore, this both creates DB_FILE/2006/10/1489.pdf.
more reproducible wind retrievals and simplifies the process of [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der
Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric
retrieving winds for the non-specialist user. However, since the Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,
Augmented Lagrangian technique currently only supports the Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerk-
ingesting of radar data into the retrieval, plans for PyDDA 1.2 and wijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río,
Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin
beyond include expanding the Augmented Lagrangian technique
Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi,
to support multiple data sources such as models and rawinsondes. Christoph Gohlke, and Travis E. Oliphant. Array programming
with NumPy. Nature, 585(7825):357–362, September 2020.
doi:10.1038/s41586-020-2649-2.
Code Availability [JCL+ 18] R. C. Jackson, S. M. Collis, V. Louf, A. Protat, and L. Ma-
jewski. A 17 year climatology of the macrophysical prop-
PyDDA is available for public use with documentation and
erties of convection in darwin. Atmospheric Chemistry and
examples available at https://openradarscience.org/PyDDA. The Physics, 18(23):17687–17704, 2018. doi:10.5194/acp-
GitHub repository that hosts PyDDA’s source code is available 18-17687-2018.
at https://github.com/openradar/PyDDA. [JCL+ 20] Robert Jackson, Scott Collis, Timothy Lang, Corey Potvin,
and Todd Munson. Pydda: A pythonic direct data assimilation
framework for wind retrievals. Journal of Open Research
Acknowledgments Software, 8(1):20, 2020. doi:10.5334/jors.264.
[LN89] Dong C. Liu and Jorge Nocedal. On the limited memory
The submitted manuscript has been created by UChicago Argonne, bfgs method for large scale optimization. MATHEMATI-
LLC, Operator of Argonne National Laboratory (’Argonne’). Ar- CAL PROGRAMMING, 45:503–528, 1989. doi:10.1007/
bf01589116.
gonne, a U.S. Department of Energy Office of Science laboratory,
[LSKJ17] Timothy Lang, Mario Souto, Shahin Khobahi, and Bobby
is operated under Contract No. DE-AC02-06CH11357. The U.S. Jackson. nasa/multidop: Multidop v0.3, October 2017. doi:
Government retains for itself, and others acting on its behalf, 10.5281/zenodo.1035904.
216 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[LV20] Sven Leyffer and Charlie Vanaret. An augmented lagrangian
filter method. Mathematical Methods of Operations Research,
92(2):343–376, 2020. URL: https://doi.org/10.1007/s00186-
020-00713-x, doi:10.1007/s00186-020-00713-x.
[MF98] L. Jay Miller and Sherri M. Fredrick. Custom editing and
display of reduced information in cartesian space. Technical
report, National Center for Atmospheric Research, 1998.
[NW06] Jorge Nocedal and Stephen J. Wright. Numerical Optimization.
Springer, New York, NY, USA, second edition, 2006.
[PSX12] Corey K. Potvin, Alan Shapiro, and Ming Xue. Impact of
a vertical vorticity constraint in variational dual-doppler wind
analysis: Tests with real and simulated supercell data. Journal
of Atmospheric and Oceanic Technology, 29(1):32 – 49, 2012.
doi:10.1175/JTECH-D-11-00019.1.
[RJSCTL+ 19] Robert Jackson, Scott Collis, Timothy Lang, Corey Potvin,
and Todd Munson. PyDDA: A new Pythonic Wind Re-
trieval Package. In Chris Calloway, David Lippa, Dillon
Niederhut, and David Shupe, editors, Proceedings of the
18th Python in Science Conference, pages 111 – 117, 2019.
doi:10.25080/Majora-7ddc1dd1-010.
[SPG09] Alan Shapiro, Corey K. Potvin, and Jidong Gao. Use of a verti-
cal vorticity equation in variational dual-doppler wind analysis.
Journal of Atmospheric and Oceanic Technology, 26(10):2089
– 2106, 2009. doi:10.1175/2009JTECHA1256.1.
[VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt
Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski,
Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté-
fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar-
rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric
Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat,
Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde,
Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin-
tero, Charles R. Harris, Anne M. Archibald, Antônio H.
Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy
1.0 Contributors. SciPy 1.0: Fundamental Algorithms for
Scientific Computing in Python. Nature Methods, 17:261–272,
2020. doi:10.1038/s41592-019-0686-2.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 217

RocketPy: Combining Open-Source and Scientific
Libraries to Make the Space Sector More Modern and
Accessible
João Lemes Gribel Soares‡∗ , Mateus Stano Junqueira‡ , Oscar Mauricio Prada Ramirez‡ , Patrick Sampaio dos
Santos Brandão‡§ , Adriano Augusto Antongiovanni‡ , Guilherme Fernandes Alves‡ , Giovani Hidalgo Ceotto‡

Abstract—In recent years we are seeing exponential growth in the space sector, important issue. Moreover, performance is always a requirement
with new companies emerging in it. On top of that more people are becoming both for saving financial and time resources while efficiently
fascinated to participate in the aerospace revolution, which motivates students launch performance goals.
and hobbyists to build more High Powered and Sounding Rockets. However, In this scenario, crucial parameters should be determined be-
rocketry is still a very inaccessible field, with high knowledge of entry-level and
fore a safe launch can be performed. Examples include calculating
concrete terms. To make it more accessible, people need an active community
with flexible, easy-to-use, and well-documented tools. RocketPy is a software
with high accuracy and certainty the most likely impact or landing
solution created to address all those issues, solving the trajectory simulation region. This information greatly increases range safety and the
for High-Power rockets being built on top of SciPy and the Python Scien- possibility of recovering the rocket [Wil18]. As another example,
tific Environment. The code allows for a sophisticated 6 degrees of freedom it is important to determine the altitude of the rocket’s apogee in
simulation of a rocket’s flight trajectory, including high fidelity variable mass order to avoid collision with other aircraft and prevent airspace
effects as well as descent under parachutes. All of this is packaged into an violations.
architecture that facilitates complex simulations, such as multi-stage rockets, To better attend to those issues, RocketPy was created as a
design and trajectory optimization, and dispersion analysis. In this work, the
computational tool that can accurately predict all dynamic param-
flexibility and usability of RocketPy are indicated in three example simulations:
eters involved in the flight of sounding, model, and High-Powered
a basic trajectory simulation, a dynamic stability analysis, and a Monte Carlo
dispersion simulation. The code structure and the main implemented methods
Rockets, given parameters such as the rocket geometry, motor
are also presented. characteristics, and environmental conditions. It is an open source
project, well structured, and documented, allowing collaborators
Index Terms—rocketry, flight, rocket trajectory, flexibility, Monte Carlo analysis to contribute with new features with minimum effort regarding
legacy code modification [CSA+ 21].
Introduction
Background
When it comes to rockets, there is a wide field ranging from
Rocketry terminology
orbital rockets to model rockets. Between them, two types of
rockets are relevant to this work: sounding rockets and High- To better understand the current work, some specific terms regard-
Powered Rockets (HPRs). Sounding rockets are mainly used ing the rocketry field are stated below:
by government agencies for scientific experiments in suborbital • Apogee: The point at which a body is furthest from earth
flights while HPRs are generally used for educational purposes, • Degrees of freedom: Maximum number of independent
with increasing popularity in university competitions, such as the values in an equation
annual Spaceport America Cup, which hosts more than 100 rocket • Flight Trajectory: 3-dimensional path, over time, of the
design teams from all over the world. After the university-built rocket during its flight
rocket TRAVELER IV [AEH+ 19] successfully reached space by • Launch Rail: Guidance for the rocket to accelerate to a
crossing the Kármán line in 2019, both Sounding Rockets and stable flight speed
HPRs can now be seen as two converging categories in terms of • Powered Flight: Phase of the flight where the motor is
overall flight trajectory. active
HPRs are becoming bigger and more robust, increasing their • Free Flight: Phase of the flight where the motor is inactive
potential hazard, along with their capacity, making safety an and no other component but its inertia is influencing the
rocket’s trajectory
* Corresponding author: jgribel@usp.br
‡ Escola Politécnica of the University of São Paulo • Standard Atmosphere: Average pressure, temperature, and
§ École Centrale de Nantes. air density for various altitudes
• Nozzle: Part of the rocket’s engine that accelerates the
Copyright © 2022 João Lemes Gribel Soares et al. This is an open-access exhaust gases
article distributed under the terms of the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any • Static hot-fire test: Test to measure the integrity of the
medium, provided the original author and source are credited. motor and determine its thrust curve
218 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

• Thrust Curve: Evolution of thrust force generated by a Function
motor Variable interpolation meshes/grids from different sources can
• Static Margin: Is a non-dimensional distance to analyze lead to problems regarding coupling different data types. To
the stability solve this, RocketPy employs a dedicated Function class which
• Nosecone: The forward-most section of a rocket, shaped allows for more natural and dynamic handling of these objects,
for aerodynamics structuring them as Rn → R mathematical functions.
• Fin: Flattened append of the rocket providing stability Through the use of those methods, this approach allows for
during flight, keeping it in the flight trajectory quick and easy arithmetic operations between lambda expressions
and list-defined interpolated functions, as well as scalars. Different
Flight Model interpolation methods are available to be chosen from, among
them simple polynomial, spline, and Akima ([Aki70]). Extrapo-
The flight model of a high-powered rocket takes into account at
lation of Function objects outside the domain constrained by a
least three different phases:
given dataset is also allowed.
1. The first phase consists of a linear movement along the Furthermore, evaluation of definite integrals of these Function
launch rail: The motion of the rocket is restricted to one dimen- objects is among their feature set. By cleverly exploiting the
sion, which means that only the translation along with the rail chosen interpolation option, RocketPy calculates the values fast
needs to be modeled. During this phase, four forces can act on and precisely through the use of different analytical methods. If
the rocket: weight, engine thrust, rail reactions, and aerodynamic numerical integration is required, the class makes use of SciPy’s
forces. implementation of the QUADPACK Fortran library [PdDKÜK83].
2. After completely leaving the rail, a phase of 6 degrees of For 1-dimensional Functions, evaluation of derivatives at a point
freedom (DOF) is established, which includes powered flight and is made possible through the employment of a simple finite
free flight: The rocket is free to move in three-dimensional space difference method.
and weight, engine thrust, normal and axial aerodynamic forces Finally, to increase usability and readability, all Function
are still important. object instances are callable and can be presented in multiple
3. Once apogee is reached, a parachute is usually deployed, ways depending on the given arguments. If no argument is given,
characterizing the third phase of flight: the parachute descent. In a Matplotlib figure opens and the plot of the function is shown in-
the last phase, the parachute is launched from the rocket, which is side its domain. Only 2-dimensional and 3-dimensional functions
usually divided into two or more parts joined by ropes. This phase can be plotted. This is especially useful for the post-processing
ends at the point of impact. methods where various information on the classes responsible for
the definition of the rocket and its flight is presented, providing for
more concise code. If an n-sized array is passed instead, RocketPy
Design: RocketPy Architecture
will try and evaluate the value of the Function at this given point
Four main classes organize the dataflow during the simulations: using different methods, returning its value. An example of the
motor, rocket, environment, and flight [CSA+ 21]. Furthermore, usage of the Function class can be found in the Examples section.
there is also a helper class named function, which will be described Additionally, if another Function object is passed, the class
further. In the Motor class, the main physical and geometric will try to match their respective domain and co-domain in order
parameters of the motor are configured, such as nozzle geometry, to return a third instance, representing a composition of functions,
grain parameters, mass, inertia, and thrust curve. This first-class in the likes of: h(x) = (g◦ f )(x) = g( f (x)). With different Function
acts as an input to the Rocket class where the user is also asked objects defined, the comparePlots method can be used to plot, in
to define certain parameters of the rocket such as the inertial mass a single graph, different functions.
tensor, geometry, drag coefficients, and parachute description. By imitating, in syntax, commonly used mathematical no-
Finally, the Flight class joins the rocket and motor parameters with tation, RocketPy allows for more understandable and human-
information from another class called Environment, such as wind, readable code, especially in the implementation of the more
atmospheric, and earth models, to generate a simulation of the extensive and cluttered rocket equations of motion.
rocket’s trajectory. This modular architecture, along with its well-
structured and documented code, facilitates complex simulations, Environment
starting with the use of Jupyter Notebooks that people can adapt The Environment class reads, processes and stores all the infor-
for their specific use case. Fig. 1 illustrates RocketPy architecture. mation regarding wind and atmospheric model data. It receives
as inputs launch point coordinates, as well as the length of the
launch rail, and then provides the flight class with six profiles as
a function of altitude: wind speed in east and north directions,
atmospheric pressure, air density, dynamic viscosity, and speed
of sound. For instance, an Environment object can be set as
representing New Mexico, United States:
1 from rocketpy import Environment
2
3 ex_env = Environment(
4 railLength=5.2,
5 latitude=32.990254,
Fig. 1: RocketPy classes interaction [CSA+ 21] 6 longitude=-106.974998,
7 elevation=1400
8 )
ROCKETPY: COMBINING OPEN-SOURCE AND SCIENTIFIC LIBRARIES TO MAKE THE SPACE SECTOR MORE MODERN AND ACCESSIBLE 219

RocketPy requires datetime library information specifying the of rocket motors: solid motors, liquid motors, and hybrid motors.
year, month, day and hour to compute the weather conditions on Currently, a robust Solid Motor class has been fully implemented
the specified day of launch. An optional argument, the timezone, and tested. For example, a typical solid motor can be created as an
may also be specified. If the user prefers to omit it, RocketPy will object in the following way:
assume the datetime object is given in standard UTC time, just as 1 from rocketpy import SolidMotor
follows: 2
3 ex_motor = SolidMotor(
1 import datetime 4 thrustSource='Motor_file.eng',
2 tomorrow = ( 5 burnOut=2,
3 datetime.date.today() + 6 reshapeThrustCurve= False,
4 datetime.timedelta(days=1) 7 grainNumber=5,
5 )
8 grainSeparation=3/1000,
6
9 grainOuterRadius=33/1000,
7 date_info = ( 10 grainInitialInnerRadius=15/1000,
8 tomorrow.year, 11 grainInitialHeight=120/1000,
9 tomorrow.month, 12 grainDensity= 1782.51,
10 tomorrow.day, 13 nozzleRadius=49.5/2000,
11 12 14 throatRadius=21.5/2000,
12 ) # Hour given in UTC time 15 interpolationMethod='linear')
By default, the International Standard Atmosphere [ISO75] static
atmospheric model is loaded. However, it is easy to set other Rocket
models by importing data from different meteorological agencys’ The Rocket Class is responsible for creating and defining the
public datasets, such as Wyoming Upper-Air Soundings and Eu- rocket’s core characteristics. Mostly composed of physical at-
ropean Centre for Medium-Range Weather Forecasts (ECMWF); tributes, such as mass and moments of inertia, the rocket object
or to set a customized atmospheric model based on user-defined will be responsible for storage and calculate mechanical parame-
functions. As RocketPy supports integration with different meteo- ters.
rological agencies’ datasets, it allows for a sophisticated definition A rocket object can be defined with the following code:
of weather conditions including forecasts and historical reanalysis
1 from rocketpy import Rocket
scenarios. 2
In this case, NOAA’s RUC Soundings data model is used, a 3 ex_rocket = Rocket(
worldwide and open-source meteorological model made available 4 motor=ex_motor,
5 radius=127 / 2000,
online. The file name is set as GFS, indicating the use of the Global 6 mass=19.197 - 2.956,
Forecast System provided by NOAA, which features a forecast 7 inertiaI=6.60,
with a quarter degree equally spaced longitude/latitude grid with 8 inertiaZ=0.0351,
a temporal resolution of three hours. 9 distanceRocketNozzle=-1.255,
10 distanceRocketPropellant=-0.85704,
1 ex_env.setAtmosphericModel( 11 powerOffDrag="data/rocket/powerOffDragCurve.csv",
2 type='Forecast', 12 powerOnDrag="data/rocket/powerOnDragCurve.csv",
3 file='GFS') 13 )
4 ex_env.info()
As stated in [RocketPy architecture], a fundamental input of the
What is happening on the back-end of this code’s snippet is Rock- rocket is its motor, an object of the Motor class that must be
etPy utilizing the OPeNDAP protocol to retrieve data arrays from previously defined. Some inputs are fairly simple and can be easily
NOAA’s server. It parses by using the netCDF4 data management obtained with a CAD model of the rocket such as radius, mass,
system, allowing for the retrieval of pressure, temperature, wind and moment of inertia on two different axes. The distance inputs
velocity, and surface elevation data as a function of altitude. The are relative to the center of mass and define the position of the
Environment class then computes the following parameters: wind motor nozzle and the center of mass of the motor propellant. The
speed, wind heading, speed of sound, air density, and dynamic powerOffDrag and powerOnDrag receive .csv data that represents
viscosity. Finally, plots of the evaluated parameters concerning the drag coefficient as a function of rocket speed for the case where
the altitude are all passed on to the mission analyst by calling the the motor is off and other for the motor still burning, respectively.
Env.info() method. At this point, the simulation would run a rocket with a tube of a
certain diameter, with its center of mass specified and a motor at its
Motor
end. For a better simulation, a few more important aspects should
RocketPy is flexible enough to work with most types of motors then be defined, called Aerodynamic surfaces. Three of them are
used in sound rockets. The main function of the Motor class accepted in the code, these being the nosecone, fins, and tail. They
is to provide the thrust curve, the propulsive mass, the inertia can be simply added to the code via the following methods:
tensor, and the position of its center of mass as a function of time. 1 nose_cone = ex_rocket.addNose(
Geometric parameters regarding propellant grains and the motor’s 2 length=0.55829, kind="vonKarman",
nozzle must be provided, as well as a thrust curve as a function 3 distanceToCM=0.71971
4 )
of time. The latter is preferably obtained empirically from a static
5 fin_set = ex_rocket.addFins(
hot-fire test, however, many of the curves for commercial motors 6 4, span=0.100, rootChord=0.120, tipChord=0.040,
are freely available online [Cok98]. 7 distanceToCM=-1.04956
Alternatively, for homemade motors, there is a wide range 8 )
9 tail = ex_rocket.addTail(
of open-source internal ballistics simulators, such as OpenMotor
10 topRadius=0.0635, bottomRadius=0.0435,
[Rei22], can predict the produced thrust with high accuracy for a 11 length=0.06, distanceToCM=-1.194656
given sizing and propellant combination. There are different types 12 )
220 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

All these methods receive defining geometrical parameters and the Rocket class and the Environment class are used as input to
their distance to the rocket’s center of mass (distanceToCM) initialize it, along with parameters such as launch heading and
as inputs. Each of these surfaces generates, during the flight, inclination relative to the Earth’s surface:
a lift force that can be calculated via a lift coefficient, which 1 from rocketpy import Flight
is calculated with geometrical properties, as shown in [Bar67]. 2

Further on, these coefficients are used to calculate the center of 3 ex_flight = Flight(
4 rocket=rocket,
pressure and subsequently the static margin. In each of these 5 environment=env,
methods, the static margin is reevaluated. 6 inclination=85,
Finally, the parachutes can be added in a similar manner to 7 heading=0
8 )
the aerodynamic surfaces. However, a few inputs regarding the
electronics involved in the activation of the parachute are required. Once the simulation is initialized, run, and completed, the
The most interesting of them is the trigger and samplingRate instance of the Flight class stores relevant raw data. The
inputs, which are used to define the parachute’s activation. The Flight.postProcess() method can then be used to com-
trigger is a function that returns a boolean value that signifies pute secondary parameters such as the rocket’s Mach number
when the parachute should be activated. The samplingRate is the during flight and its angle of attack.
time interval that the trigger will be evaluated in the simulation To perform the numerical integration of the equations of mo-
time steps. tion, the Flight class uses the LSODA solver [Pet83] implemented
1 def parachute_trigger(p, y): by Scipy’s scipy.integrate module [VGO+ 20]. Usually,
2 if vel_z < 0 and height < 800: well-designed rockets result in non-stiff equations of motion.
3 boole = True However, during flight, rockets may become unstable due to
4 else:
5 boole = False variations in their inertial and aerodynamic properties, which can
6 return boole result in a stiff system. LSODA switches automatically between
7 the nonstiff Adams method and the stiff BDF method, depending
8 ex_parachute = ex_rocket.addParachute(
9 'ParachuteName',
on the detected stiffness, perfectly handle both cases.
10 CdS=10.0, Since a rocket’s flight trajectory is composed of multiple
11 trigger=parachute_trigger, phases, each with its own set of governing equations, RocketPy
12 samplingRate=105, employs a couple of clever methods to run the numerical inte-
13 lag=1.5,
14 noise=(0, 8.3, 0.5) gration. The Flight class uses a FlightPhases container to
15 ) hold each FlightPhase. The FlightPhases container will
orchestrate the different FlightPhase instances, and compose
With the rocket fully defined, the Rocket.info() and
them during the flight.
Rocket.allInfo() methods can be called giving us informa-
This is crucial because there are events that may or may not
tion and plots of the calculations performed in the class. One of the
happen during the simulation, such as the triggering of a parachute
most relevant outputs of the Rocket class is the static margin, as
ejection system (which may or may not fail) or the activation of a
it is important for the rocket stability and makes possible several
premature flight termination event. There are also events such as
analyses. It is visualized through the time plot in Fig. 2, which
the departure from the launch rail or the apogee that is known to
shows the variation of the static margin as the motor burns its
occur, but their timestamp is unknown until the simulation is run.
propellant.
All of these events can trigger new flight phases, characterized by
a change in the rocket’s equations of motion. Furthermore, such
events can happen close to each other and provoke delayed phases.
To handle this, the Flight class has a mechanism for creating
new phases and adding them dynamically in the appropriate order
to the FlightPhases container.
The constructor of the FlightPhase class takes the follow-
ing arguments:
• t: a timestamp that symbolizes at which instant such flight
phase should begin;
• derivative: a function that returns the time derivatives
of the rocket’s state vector (i.e., calculates the equations of
motion for this flight phase);
• callbacks: a list of callback functions to be run when
the flight phase begins (which can be useful if some
parameters of the rocket need to be modified before the
flight phase begins).
Fig. 2: Static Margin The constructor of the Flight class initializes the
FlightPhases container with a rail phase and also a dummy
max time phase which marks the maximum flight duration. Then,
Flight it loops through the elements of the container.
The Flight class is responsible for the integration of the rocket’s Inside the loop, an important attribute of the current
equations of motion overtime [CSA+ 21]. Data from instances of flight phase is set: FlightPhase.timeBound, the maxi-
ROCKETPY: COMBINING OPEN-SOURCE AND SCIENTIFIC LIBRARIES TO MAKE THE SPACE SECTOR MORE MODERN AND ACCESSIBLE 221

mum timestamp of the flight phase, which is always equal
to the initial timestamp of the next flight phase. Ordinar-
ily, it would be possible to run the LSODA solver from
FlightPhase.t to FlightPhase.timeBound. However,
this is not an option because the events which can trigger new
flight phases need to be checked throughout the simulation. While
scipy.integrate.solve_ivp does offer the events ar-
gument to aid in this, it is not possible to use it with most of the
events that need to be tracked, since they cannot be expressed in
the necessary form.
As an example, consider the very common event of a parachute
ejection system. To simulate real-time algorithms, the necessary
inputs to the ejection algorithm need to be supplied at regular
intervals to simulate the desired sampling rate. Furthermore, the
ejection algorithm cannot be called multiple times without real
data since it generally stores all the inputs it gets to calculate if
the rocket has reached the apogee to trigger the parachute release
mechanism. Discrete controllers can present the same peculiar
properties.
To handle this, the instance of the FlightPhase class holds Fig. 3: 3D flight trajectory, an output of the Flight.allInfo method
a TimeNodes container, which stores all the required timesteps,
or TimeNode, that the integration algorithm should stop at so
that the events can be checked, usually by feeding the necessary Monte Carlo simulations, which require a large number of
data to parachutes and discrete control trigger functions. When it simulations to be performed (10,000 ~ 100,000).
comes to discrete controllers, they may change some parameters • The code structure should be flexible. This is important
in the rocket once they are called. On the other hand, a parachute due to the diversity of possible scenarios that exist in a
triggers rarely actually trigger, and thus, rarely invoke the creation rocket design context. Each user will have their simulation
of a new flight phase characterized by descent under parachute requirements and should be able to modify and adapt new
governing equations of motion. features to meet their needs. For this reason, the code was
The Flight class can take advantage of this fact by employing designed in a fashion such that each major component is
overshootable time nodes: time nodes that the integrator does separated into self-encapsulated classes, responsible for a
not need to stop. This allows the integration algorithm to use single functionality. This tenet follows the concepts of the
more optimized timesteps and significantly reduce the number of so-called Single Responsibility Principle (SRP) [MNK03].
iterations needed to perform a simulation. Once a new timestep • Finally, the software should aim to be accessible. The
is taken, the Flight class checks all overshootable time nodes that source code was openly published on GitHub (https:
have passed and feeds their event triggers with interpolated data. //github.com/Projeto-Jupiter/RocketPy), where the com-
In case when an event is triggered, the simulation is rolled back to munity started to be built and a group of developers, known
that state. as the RocketPy Team, are currently assigned as dedicated
In summary, throughout a simulation, the Flight class loops maintainers. The job involves not only helping to improve
through each non-overshootable TimeNode of each element of the code, but also working towards building a healthy
the FlightPhases container. At each TimeNode, the event ecosystem of Python, rocketry, and scientific computing
triggers are fed with the necessary input data. Once an event is enthusiasts alike; thus facilitating access to the high-
triggered, a new FlightPhase is created and added to the main quality simulation without a great level of specialization.
container. These loops continue until the simulation is completed,
either by reaching the maximum flight duration or by reaching a The following examples demonstrate how RocketPy can be a
terminal event, such as ground impact. useful tool during the design and operation of a rocket model,
Once the simulation is completed, raw data can al- enabling functionalities not available by other simulation software
ready be accessed. To compute secondary parameters, the before.
Flight.postProcess() is used. It takes advantage of the
fact that the FlightPhases container keeps all relevant flight Examples
information to essentially retrace the trajectory and capture more
Using RocketPy for Rocket Design
information about the flight.
Once secondary parameters are computed, the 1) Apogee by Mass using a Function helper class
Flight.allInfo method can be used to show and plot Because of performance and safety reasons, apogee is one of
all the relevant information, as illustrated in Fig. 3. the most important results in rocketry competitions, and it’s highly
valuable for teams to understand how different Rocket parameters
The adaptability of the Code and Accessibility
can change it. Since a direct relation is not available for this kind
RocketPy’s development started in 2017, and since the beginning, of computation, the characteristic of running simulation quickly is
certain requirements were kept in mind: utilized for evaluation of how the Apogee is affected by the mass
• Execution times should be fast. There is a high interest in of the Rocket. This function is highly used during the early phases
performing sensitivity analysis, optimization studies and of the design of a Rocket.
222 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

An example of code of how this could be achieved: 16 terminateOnApogee=True,
17 verbose=True,
1 from rocketpy import Function
18 )
2
19 ex_flight.postProcess()
3 def apogee(mass):
20 simulation_results += [(
4 # Prepare Environment
21 ex_flight.attitudeAngle,
5 ex_env = Environment(...)
22 ex_rocket.staticMargin(0),
6
23 ex_rocket.staticMargin(ex_flight.outOfRailTime),
7 ex_env.setAtmosphericModel(
24 ex_rocket.staticMargin(ex_flight.tFinal)
8 type="CustomAtmosphere",
25 )]
9 wind_v=-5
26 Function.comparePlots(
10 )
27 simulation_results,
11
28 xlabel="Time (s)",
12 # Prepare Motor
29 ylabel="Attitude Angle (deg)",
13 ex_motor = SolidMotor(...)
30 )
14
15 # Prepare Rocket The next step is to start the simulations themselves, which can
16 ex_rocket = Rocket(
17 ..., be done through a loop where the Flight class is called, perform
18 mass=mass, the simulation, save the desired parameters into a list and then
19 ... follow through with the next iteration. The post-process flight data
20 )
21
method is being used to make RocketPy evaluate additional result
22 ex_rocket.setRailButtons([0.2, -0.5]) parameters after the simulation.
23 nose_cone = ex_rocket.addNose(.....) Finally, the Function.comparePlots() method is used to plot
24 fin_set = ex_rocket.addFins(....) the final result, as reported at Fig. 4.
25 tail = ex_rocket.addTail(....)
26
27 # Simulate Flight until Apogee
28 ex_flight = Flight(.....)
29 return ex_flight.apogee
30
31 apogee_by_mass = Function(
32 apogee, inputs="Mass (kg)",
33 outputs="Estimated Apogee (m)"
34 )
35 apogee_by_mass.plot(8, 20, 20)

The possibility of generating this relation between mass and
apogee in a graph shows the flexibility of Rocketpy and also the
importance of the simulation being designed to run fast.
1) Dynamic Stability Analysis
In this analysis the integration of three different RocketPy
classes will be explored: Function, Rocket, and Flight. The moti-
vation is to investigate how static stability translates into dynamic
stability, i.e. different static margins result relies on different Fig. 4: Dynamic Stability example, unstable rocket presented on blue
dynamic behavior, which also depends on the rocket’s rotational line
inertia.
We can assume the objects stated in [motor] and [rocket]
sections and just add a couple of variations on some input data Monte Carlo Simulation
to visualize the output effects. More specifically, the idea will be When simulating a rocket’s trajectory, many input parameters
to explore how the dynamic stability of the studied rocket varies may not be completely reliable due to several uncertainties in
by changing the position of the set of fins by a certain factor. measurements raised during the design or construction phase of
To do that, we have to simulate multiple flights with different the rocket. These uncertainties can be considered together in a
static margins, which is achieved by varying the rocket’s fin group of Monte Carlo simulations [RK16] which can be built on
positions. This can be done through a simple python loop, as top of RocketPy.
described below: The Monte Carlo method here is applied by running a signifi-
1 simulation_results = [] cant number of simulations where each iteration has a different
2 for factor in [0.5, 0.7, 0.9, 1.1, 1.3]: set of inputs that are randomly sampled given a previously
3 # remove previous fin set known probability distribution, for instance the mean and standard
4 ex_rocket.aerodynamicSurfaces.remove(fin_set)
5 fin_set = ex_rocket.addFins( deviation of a Gaussian distribution. Almost every input data
6 4, span=0.1, rootChord=0.120, tipChord=0.040, presents some kind of uncertainty, except for the number of fins or
7 distanceToCM=-1.04956 * factor propellant grains that a rocket presents. Moreover, some inputs,
8 )
9 ex_flight = Flight(
such as wind conditions, system failures, or the aerodynamic
10 rocket=ex_rocket, coefficient curves, may behave differently and must receive special
11 environment=env, treatment.
12 inclination=90, Statistical analysis can then be made on all the simulations,
13 heading=0,
14 maxTimeStep=0.01, with the main result being the 1σ , 2σ , and 3σ ellipses representing
15 maxTime=5, the possible area of impact and the area where the apogee is
ROCKETPY: COMBINING OPEN-SOURCE AND SCIENTIFIC LIBRARIES TO MAKE THE SPACE SECTOR MORE MODERN AND ACCESSIBLE 223

reached (Fig. 5). All ellipses can be evaluated based on the method 22 export_flight_data(s, ex_flight)
presented by [Che66]. 23 except Exception as E:
24 # if an error occurs, export the error
25 # message to a text file
26 print(E)
27 export_flight_error(s)

Finally, the set of inputs for each simulation along with its set of
outputs, are stored in a .txt file. This allows for long-term data
storage and the possibility to append simulations to previously
finished ones. The stored output data can be used to study the final
probability distribution of key parameters, as illustrated on Fig. 6.

Fig. 5: 1 1σ , 2 2σ , and 3 3σ dispersion ellipses for both apogee and
landing point

When performing the Monte Carlo simulations on RocketPy,
all the inputs - i.e. the parameters along with their respective
standard deviations - are stored in a dictionary. The randomized
set of inputs is then generated using a yield function:
1 def sim_settings(analysis_params, iter_number): Fig. 6: Distribution of apogee altitude
2 i = 0
3 while i < iter_number:
4 # Generate a simulation setting Finally, it is also worth mentioning that all the information
5 sim_setting = {} generated in the Monte Carlo simulation is based on RocketPy
6 for p_key, p_value in analysis_params.items(): may be of utmost importance to safety and operational manage-
7 if type(p_value) is tuple:
8 sim_setting[p_key] = normal(*p_value) ment during rocket launches, once it allows for a more reliable
9 else: prediction of the landing site and apogee coordinates.
10 sim_setting[p_key] = choice(p_value)
11 # Update counter
12 i += 1 Validation of the results: Unit, Dimensionality and Acceptance
13 # Yield a simulation setting
14 yield sim_setting Tests
Validation is a big problem for libraries like RocketPy, where
Where analysis_params is the dictionary with the inputs and
true values for some results like apogee and maximum velocity
iter_number is the total number of simulations to be performed. At
is very hard to obtain or simply not available. Therefore, in
that time the function yields one dictionary with one set of inputs,
order to make RocketPy more robust and easier to modify, while
which will be used to run a simulation. Later the sim_settings
maintaining precise results, some innovative testing strategies have
function is called again and another simulation is run until the
loop iterations reach the number of simulations: been implemented.
First of all, unit tests were implemented for all classes and
1 for s in sim_settings(analysis_params, iter_number):
2 # Define all classes to simulate with the current their methods ensuring that each function is working properly.
3 # set of inputs generated by sim_settings Given a set of different inputs that each function can receive, the
4 respective outputs are tested against expected results, which can be
5 # Prepare Environment
6 ex_env = Environment(.....)
based on real data or augmented examples cases. The test fails if
7 # Prepare Motor the output deviates considerably from the established conditions,
8 ex_motor = SolidMotor(.....) or an unexpected error occurs along the way.
9 # Prepare Rocket Since RocketPy relies heavily on mathematical functions to
10 ex_rocket = Rocket(.....)
11 nose_cone = ex_rocket.addNose(.....) express the governing equations, implementation errors can occur
12 fin_set = ex_rocket.addFins(....) due to the convoluted nature of such expressions. Hence, to reduce
13 tail = ex_rocket.addTail(.....) the probability of such errors, there is a second layer of testing
14
15 # Considers any possible errors in the simulation
which will evaluate if such equations are dimensionally correct.
16 try: To accomplish this, RocketPy makes use of the numericalunits
17 # Simulate Flight until Apogee library, which defines a set of independent base units as randomly-
18 ex_flight = Flight(.....) chosen positive floating point numbers. In a dimensionally-correct
19
20 # Function to export all output and input function, the units all cancel out when the final answer is divided
21 # data to a text file (.txt) by its resulting unit. And thus, the result is deterministic, not
224 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

random. On the other hand, if the function contains dimensionally- 1 def test_static_margin_dimension(
incorrect equations, there will be random factors causing a 2 unitless_rocket,
3 unitful_rocket
randomly-varying final answer. In practice, RocketPy runs two 4 ):
calculations: one without numericalunits, and another with the 5 ...
dimensionality variables. The results are then compared to assess 6 s1 = unitless_rocket.staticMargin(0)
if the dimensionality is correct. 7 s2 = unitful_rocket.staticMargin(0)
8 assert abs(s1 - s2) < 1e-6
Here is an example. First, a SolidMotor object and a Rocket
object are initialized without numericalunits: In case the value of interest has units, such as the position of the
1 @pytest.fixture center of pressure of the rocket, which has units of length, then
2 def unitless_solid_motor(): such value must be divided by the relevant unit for comparison:
3 return SolidMotor( 1 def test_cp_position_dimension(
4 thrustSource="Cesaroni_M1670.eng", 2 unitless_rocket,
5 burnOut=3.9, 3 unitful_rocket
6 grainNumber=5, 4 ):
7 grainSeparation=0.005, 5 ...
8 grainDensity=1815, 6 cp1 = unitless_rocket.cpPosition(0)
9 ... 7 cp2 = unitful_rocket.cpPosition(0) / m
10 ) 8 assert abs(cp1 - cp2) < 1e-6
11
12 @pytest.fixture If the assertion fails, we can assume that the formula responsible
13 def unitless_rocket(solid_motor):
14 return Rocket(
for calculating the center of pressure position was implemented
15 motor=unitless_solid_motor, incorrectly, probably with a dimensional error.
16 radius=0.0635, Finally, some tests at a larger scale, known as acceptance
17 mass=16.241, tests, were implemented to validate outcomes such as apogee,
18 inertiaI=6.60,
19 inertiaZ=0.0351, apogee time, maximum velocity, and maximum acceleration when
20 distanceRocketNozzle=-1.255, compared to real flight data. A required accuracy for such values
21 distanceRocketPropellant=-0.85704, were established after the publication of the experimental data by
22 ...
[CSA+ 21]. Such tests are crucial for ensuring that the code doesn’t
23 )
lose precision as a result of new updates.
Then, a SolidMotor object and a Rocket object are initialized with These three layers of testing ensure that the code is trustwor-
numericalunits: thy, and that new features can be implemented without degrading
1 import numericalunits the results.
2
3 @pytest.fixture
4 def m(): Conclusions
5 return numericalunits.m
6
RocketPy is an easy-to-use tool for simulating high-powered
7 rocket trajectories built with SciPy and the Python Scientific
8 @pytest.fixture Environment. The software’s modular architecture is based on
9 def kg(): four main classes and helper classes with well-documented code
10 return numericalunits.kg
11
that allows to easily adapt complex simulations to various needs
12 @pytest.fixture using the supplied Jupyter Notebooks. The code can be a useful
13 def unitful_motor(kg, m): tool during Rocket design and operation, allowing to calculate
return SolidMotor(
14
of key parameters such as apogee and dynamic stability as well
15 thrustSource="Cesaroni_M1670.eng",
16 burnOut=3.9, as high-fidelity 6-DOF vehicle trajectory with a wide variety of
17 grainNumber=5, customizable parameters, from its launch to its point of impact.
18 grainSeparation=0.005 * m, RocketPy is an ever-evolving framework and is also accessible to
19 grainDensity=1815 * (kg / m**3),
20 ... anyone interested, with an active community maintaining it and
21 ) working on future features such as the implementation of other
22 engine types, such as hybrids and liquids motors, and even orbital
23 @pytest.fixture flights.
24 def unitful_rocket(kg, m, dimensionless_motor):
25 return Rocket(
26 motor=unitful_motor, Installing RocketPy
27 radius=0.0635 * m,
28 mass=16.241 * kg, RocketPy was made to run on Python 3.6+ and requires the
29 inertiaI=6.60 * (kg * m**2), packages: Numpy >=1.0, Scipy >=1.0 and Matplotlib >= 3.0. For
30 inertiaZ=0.0351 * (kg * m**2), a complete experience we also recommend netCDF4 >= 1.4. All
31 distanceRocketNozzle=-1.255 * m,
32 distanceRocketPropellant=-0.85704 * m, these packages, except netCDF4, will be installed automatically if
33 ... the user does not have them. To install, execute:
34 )
pip install rocketpy
Then, to ensure that the equations implemented in both classes
or
(Rocket and SolidMotor) are dimensionally correct, the val-
conda install -c conda-forge rocketpy
ues computed can be compared. For example, the Rocket class
computes the rocket’s static margin, which is a non-dimensional The source code, documentation and more examples are available
value and the result from both calculations should be the same: at https://github.com/Projeto-Jupiter/RocketPy
ROCKETPY: COMBINING OPEN-SOURCE AND SCIENTIFIC LIBRARIES TO MAKE THE SPACE SECTOR MORE MODERN AND ACCESSIBLE 225

Acknowledgments
The authors would like to thank the University of São Paulo, for
the support during the development of the current publication, and
also all members of Projeto Jupiter and the RocketPy Team who
contributed to the making of the RocketPy library.

R EFERENCES
[AEH+ 19] Adam Aitoumeziane, Peter Eusebio, Conor Hayes, Vivek Ra-
machandran, Jamie Smith, Jayasurya Sridharan, Luke St Regis,
Mark Stephenson, Neil Tewksbury, Madeleine Tran, and Hao-
nan Yang. Traveler IV Apogee Analysis. Technical report,
USC Rocket Propulsion Laboratory, Los Angeles, 2019. URL:
http://www.uscrpl.com/s/Traveler-IV-Whitepaper.
[Aki70] Hiroshi Akima. A new method of interpolation and smooth
curve fitting based on local procedures. Journal of the ACM
(JACM), 17(4):589–602, 1970. doi:10.1145/321607.
321609.
[Bar67] James S Barrowman. The Practical Calculation of the Aero-
dynamic Characteristics of Slender Finned Vehicles. PhD
thesis, Catholic University of America, Washington, DC United
States, 1967.
[Che66] Victor Chew. Confidence, Prediction, and Tolerance Re-
gions for the Multivariate Normal Distribution. Journal of
the American Statistical Association, 61(315), 1966. doi:
10.1080/01621459.1966.10480892.
[Cok98] J Coker. Thrustcurve.org — rocket motor performance data
online, 1998. URL: https://www.thrustcurve.org/.
[CSA+ 21] Giovani H Ceotto, Rodrigo N Schmitt, Guilherme F Alves, Lu-
cas A Pezente, and Bruno S Carmo. Rocketpy: Six degree-of-
freedom rocket trajectory simulator. Journal of Aerospace En-
gineering, 34(6), 2021. doi:10.1061/(ASCE)AS.1943-
5525.0001331.
[ISO75] ISO Central Secretary. Standard Atmosphere. Technical Report
ISO 2533:1975, International Organization for Standardization,
Geneva, CH, 5 1975.
[MNK03] Robert C Martin, James Newkirk, and Robert S Koss. Agile
software development: principles, patterns, and practices, vol-
ume 2. Prentice Hall Upper Saddle River, NJ, 2003.
[PdDKÜK83] Robert Piessens, Elise de Doncker-Kapenga, Christoph W
Überhuber, and David K Kahaner. Quadpack: a subroutine
package for automatic integration, volume 1. Springer Science
& Business Media, 1983. doi:10.1007/978-3-642-
61786-7.
[Pet83] Linda Petzold. Automatic Selection of Methods for Solving
Stiff and Nonstiff Systems of Ordinary Differential Equa-
tions. SIAM Journal on Scientific and Statistical Computing,
4(1):136–148, 3 1983. doi:10.1137/0904010.
[Rei22] A Reilley. openmotor: An open-source internal ballistics
simulator for rocket motor experimenters, 2022. URL: https:
//github.com/reilleya/openMotor.
[RK16] Reuven Y Rubinstein and Dirk P Kroese. Simulation and the
Monte Carlo method. John Wiley & Sons, 2016. doi:10.
1002/9781118631980.
[VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haber-
land, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu
Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van
der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman,
Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert
Kern, Eric Larson, C J Carey, İlhan Polat, Yu Feng, Eric W.
Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert
Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris,
Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa,
Paul van Mulbregt, and SciPy 1.0 Contributors. SciPy 1.0: Fun-
damental Algorithms for Scientific Computing in Python. Na-
ture Methods, 17:261–272, 2020. doi:10.1038/s41592-
019-0686-2.
[Wil18] Paul D. Wilde. Range safety requirements and methods for
sounding rocket launches. Journal of Space Safety Engineer-
ing, 5(1):14–21, 3 2018. doi:10.1016/j.jsse.2018.
01.002.
226 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Wailord: Parsers and Reproducibility for Quantum
Chemistry
Rohit Goswami‡§∗

Abstract—Data driven advances dominate the applied sciences landscape, and text classification, can be linked to the difficulty in obtaining
with quantum chemistry being no exception to the rule. Dataset biases and labeled results for training purposes. This is not an issue in the
human error are key bottlenecks in the development of reproducible and general- computational physical sciences at all, as the training data can
ized insights. At a computational level, we demonstrate how changing the granu- often be labeled without human intervention. This is especially
larity of the abstractions employed in data generation from simulations can aid in
true when simulations are carried out at varying levels of accuracy.
reproducible work. In particular, we introduce wailord (https://wailord.xyz), a
free-and-open-source python library to shorten the gap between data-analysis
However, this also leads to a heavy reliance on high accuracy
and computational chemistry, with a focus on the ORCA suite binaries. A two calculations on "benchmark" datasets and results [HMSE+ 21],
level hierarchy and exhaustive unit-testing ensure the ability to reproducibly [SEJ+ 19].
describe and analyze "computational experiments". wailord offers both input Compute is expensive, and the reproduction of data which
generation, with enhanced analysis, and raw output analysis, for traditionally is openly available is often hard to justify as a valid scientific
executed ORCA runs. The design focuses on treating output and input gener- endeavor. Rather than focus on the observable outputs of cal-
ation in terms of a mini domain specific language instead of more imperative culations, instead we assert that it is best to be able to have
approaches, and we demonstrate how this abstraction facilitates chemical in-
reproducible confidence in the elements of the workflow. In the
sights.
following sections, we will outline wailord, a library which
Index Terms—quantum chemistry, parsers, reproducible reports, computational
implements a two level structure for interacting with ORCA
inference [Nee12] to implement an end-to-end workflow to analyze and
prepare datasets. Our focus on ORCA is due to its rapid and
responsive development cycles, that it is free to use (but not open
Introduction source) and also because of its large repertoire of computational
The use of computational methods for chemistry is ubiquitous chemistry calculations. Notably, the black-box nature of ORCA
and few modern chemists retain the initial skepticism of the field (in that the source is not available) mirrors that of many other
[Koh99], [Sch86]. Machine learning has been further earmarked packages (which are not free) like VASP [Haf08]. Using ORCA
[MSH19], [Dra20], [SGT+ 19] as an effective accelerator for then, allows us to design a workflow which is best suited for
computational chemistry at every level, from DFT [GLL+ 16] to working with many software suites in the community.
alchemical searches [DBCC16] and saddle point searches [ÁJ18]. We shall understand this wailord from the lens of what is
However, these methods trade technical rigor for vast amounts of often known as a design pattern in the practice of computational
data, and so the ability to reproduce results becomes increasingly science and engineering. That is, a template or description to solve
more important. Independently, the ability to reproduce results commonly occurring problems in the design of programs.
[Pen11], [SNTH13] in all fields of computational research, and
has spawned a veritable flock of methodological and program- Structure and Implementation
matic advances [CAB+ 19], including the sophisticated provenance
Python has grown to become the lingua-franca for much of the
tracking of AiiDA [PCS+ 16], [HZU+ 20].
scientific community [Oli07], [MA11], in no small part because
of its interactive nature. In particular, the REPL (read-evaluate-
Dataset bias print-loop) structure which has been prioritized (from IPython to
[EIS+ 20], [BS19], [RBA+ 19] has gained prominence in the ma- Jupyter) is one of the prime motivations for the use of Python
chine learning literature, but has not yet percolated through to as an exploratory tool. Additionally, PyPI, the python package
the chemical sciences community. At its core, the argument for index, accelerates the widespread disambiguation of software
dataset biases in generic machine learning problems of image packages. Thus wailord is implemented as a free and open
source python library.
* Corresponding author: rog32@hi.is
‡ Science Institute, University of Iceland
§ Quansight Austin, TX, USA Structure
Data generation involves set of known configurations (say, xyz
Copyright © 2022 Rohit Goswami. This is an open-access article distributed inputs) and a series of common calculations whose outputs are
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the required. Computational chemistry packages tend to be focused
original author and source are credited. on acceleration and setup details on a per-job scale. wailord,
WAILORD: PARSERS AND REPRODUCIBILITY FOR QUANTUM CHEMISTRY 227

in contrast, considers the outputs of simulations to form a tree,
where the actual run and its inputs are the leaves, and each layer
of the tree structure holds information which is collated into a
single dataframe which is presented to the user.
Downstream tasks for simulations of chemical systems involve
questions phrased as queries or comparative measures. With that in
mind, wailord generates pandas dataframes which are indis-
tinguishable from standard machine learning information sources,
to trivialize the data-munging and preparation process. The outputs
of wailord represent concrete information and it is not meant to
store runs like the ASE database [LMB+ 17] , nor run a process to
manage discrete workflows like AiiDA [HZU+ 20].
By construction, it differs also from existing "interchange"
formats as those favored by the materials data repositories like
the QCArchive project [SAB+ 21] and is partially close in spirit to
the cclib endeavor [OTL08].

Implementation
Fig. 1: Some implemented workflows including the two input YML
Two classes form the backbone of the data-harvesting process. The files. VPT2 stands for second-order vibrational perturbation theory
intended point of interface with a user is the orcaExp class which and Orca_vis objects are part of wailord’s class structure. PES
collects information from multiple ORCA outputs and produces stands for potential energy surface.
dataframes which include relevant metadata (theory, basis, system,
etc.) along with the requested results (energy surfaces, energies,
angles, geometries, frequencies, etc.). A lower level "orca visitor" User Interface
class is meant to parse each individual ORCA output. Until the The core user interface is depicted in Fig. [[fig:uiwail]]. The
release of ORCA 5 which promises structured property files, test suites cover standard usage and serve as ad-hoc tutorials.
the outputs are necessarily parsed with regular expressions, but Additionally, jupyter notebooks are also able to effectively
validated extensively. The focus on ORCA has allowed for more run wailord which facilitates its use over SSH connections to
exotic helper functions, like the calculation of rate constants from high-performance-computing (HPC) clusters. The user is able to
orcaVis files. However, beyond this functionality offered by the describe the nature of calculations required in a simple YAML file
quantum chemistry software (ORCA), a computational chemistry format. A command line interface can then be used to generate
workflow requires data to be more malleable. To this end, the inputs, or another YAML file may be passed to describe the
plain-text or binary outputs of quantum chemistry software must paths needed. A very basic harness script for submissions is also
be further worked on (post-processed) to gain insights. This means generated which can be rate limited to ensure optimal runs on an
for example, that the outputs may be entered into a spreadsheet, HPC cluster.
or into a plain text note, or a lab notebook, but in practice,
programming languages are a good level of abstraction. Of the
programming languages, Python as a general purpose program- Design and Usage
ming language with a high rate of community adoption is a good
A simulation study can be broken into:
starting place.
Python has a rich set of structures implemented in the standard • Inputs + Configuration for runs + Data for structures
library, which have been liberally used for structuring outputs. • Outputs per run
Furthermore, there have been efforts to convert the grammar • Post-processing and aggregation
of graphics [WW05] and tidy-data [WAB+ 19] approaches to
the pandas package which have also been adapted internally, From a software design perspective, it is important to rec-
including strict unit adherence using the pint library. The user ognize the right level of abstraction for the given problem. An
is not burdened by these implementation details and is instead object-oriented pattern is seen to be the correct design paradigm.
ensured a pandas data-frame for all operations, both at the However, though combining test driven development and object
orcaVis level, and the orcaExp level. oriented design is robust and extensible, the design of wailord
Software industry practices have been followed throughout the is meant to tackle the problem at the level of a domain specific
development process. In particular, the entire package is written in language. Recall from formal language theory [AA07] the fact
a test-driven-development (TDD) fashion which has been proven that a grammar is essentially meant to specify the entire possible
many times over for academia [DJS08] and industry [BN06]. set of inputs and outputs for a given language. A grammar can
In essence, each feature is accompanied by a test-case. This is be expressed as a series of tokens (terminal symbols) and non-
meant to ensure that once the end-user is able to run the test- terminal (syntactic variables) symbols along with rules defining
suite, they are guaranteed the features promised by the software. valid combinations of these.
Additionally, this means that potential bugs can be submitted It may appear that there is little but splitting hairs between
as a test case which helps isolate errors for fixes. Furthermore, parsing data line by line as is traditionally done in libraries, com-
software testing allows for coverage metrics, thereby enhancing pared to defining the exact structural relations between allowed
user and development confidence in different components of any symbols. However, this design, apart from disallowing invalid
large code-base. inputs, also makes sense from a pedagogical perspective.
228 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

For example, of the inputs, structured data like configurations Usage is then facilitated by a high-level call.
(XYZ formats) are best handled by concrete grammars, where waex.cookies.gen_base(
each rule is followed in order: template="basicExperiment",
absolute=False,
grammar_xyz = Grammar(
filen="./lab6/expCookieST_meth.yml",
r"""
)
meta = natoms ws coord_block ws?
natoms = number The resulting directory tree can be sent to a High Performance
coord_block = (aline ws)+
aline = (atype ws cline)
Computing Cluster (HPC), and once executed via the generated
atype = ~"[a-zA-Z]" / ~"[0-9]" run-script helper; locally analysis can proceed.
cline = (float ws float ws float) mdat = waio.orca.genEBASet(Path("buildOuts") / \
float = pm number "." number "methylene",
pm = ~"[+-]?" deci=4)
number = ~"\\d+" print(mdat.to_latex(index=False,
ws = ~"\\s*" caption="CH2 energies and angles \
""" at various levels of theory, with NUMGRAD"))
)
In certain situations, ordering may be relevant as well (e.g. for gen-
This definition maps neatly into the exact specification of an xyz
erating curves of varying density functional theoretic complexity).
file:
This can be handled as well.
2 For the outputs, similar to the key ideas across signac, nix,
H -2.8 2.8 0.1 spack and other tools, control is largely taken away from the user
H -3.2 3.4 0.2 in terms of the auto-generated directory structure. The outputs of
each run is largely collected through regular expressions, due to
Where we recognize that the overarching structure is of the
the ever changing nature of the outputs of closed source software.
number of atoms, followed by multiple coordinate blocks followed
Importantly, for a code which is meant to confer insights,
by optional whitespace. We move on to define each coordinate
the concept of units is key. wailord with ORCA has first class
block as a line of one or many aline constructs, each of which
support for units using pint.
is an atype with whitespace and three float values representing
coordinates. Finally we define the positive, negative, numeric and
Dissociation of H2
whitespace symbols to round out the grammar. This is the exact
form of every valid xyz file. The parsimonious library allows As a concrete example, we demonstrate a popular pedagogical
handling grammatical constructs in a Pythonic manner. exercise, namely to obtain the binding energy curves of the H2
However, the generation of inputs is facilitated through the molecule at varying basis sets and for the Hartree Fock, along with
use of generalized templates for "experiments" controlled by the results of Kolos and Wolniewicz [KW68]. We first recognize,
cookiecutter. This allows for validations on the workflow that even for a moderate 9 basis sets with 33 points, we expect
during setup itself. around 1814 data points. Where each basis set requires a separate
For the purposes of the simulation study, one "experiment" run, this is easily expected to be tedious.
consists of multiple single-shot runs; each of which can take a Naively, this would require modifying and generating ORCA
long time. input files.
Concretely, the top-level "experiment" is controlled by a !UHF 3-21G ENERGY
YAML file: %paras
project_slug: methylene R = 0.4, 2.0, 33 # x-axis of H1
project_name: singlet_triplet_methylene end
outdir: "./lab6"
desc: An experiment to calculate singlet and triplet *xyz 0 1
states differences at a QCISD(T) level H 0.00 0.0000000 0.0000000
author: Rohit H {R} 0.0000000 0.0000000
year: "2020" *
license: MIT
orca_root: "/home/orca/" We can formulate the requirement imperatively as:
orca_yml: "orcaST_meth.yml" qc:
inp_xyz: "ch2_631ppg88_trip.xyz" active: True
style: ["UHF", "QCISD", "QCISD(T)"]
Where each run is then controlled individually. calculations: ["ENERGY"] # Same as single point or SP
qc: basis_sets:
active: True - 3-21G
style: ["UHF", "QCISD", "QCISD(T)"] - 6-31G
calculations: ["OPT"] - 6-311G
basis_sets: - 6-311G*
- 6-311++G** - 6-311G**
xyz: "inp.xyz" - 6-311++G**
spin: - 6-311++G(2d,2p)
- "0 1" # Singlet - 6-311++G(2df,2pd)
- "0 3" # Triplet - 6-311++G(3df,3pd)
extra: "!NUMGRAD" xyz: "inp.xyz"
viz: spin:
molden: True - "0 1"
chemcraft: True params:
jobscript: "basejob.sh" - name: R
WAILORD: PARSERS AND REPRODUCIBILITY FOR QUANTUM CHEMISTRY 229

range: [0.4, 2.00]
points: 33
slot:
xyz: True
atype: "H"
anum: 1 # Start from 0
axis: "x"
extra: Null
jobscript: "basejob.sh"

This run configuration is coupled with an experiment setup file,
similar to the one in the previous section. With this in place,
generating a data-set of all the required data is fairly trivial.
kolos = pd.read_csv(
"../kolos_H2.ene",
skiprows=4,
header=None,
names=["bond_length", "Actual Energy"],
sep=" ",
)
kolos['theory']="Kolos"

expt = waio.orca.orcaExp(expfolder=Path("buildOuts") / "h2")
h2dat = expt.get_energy_surface()
Fig. 2: Plots generated from tidy principles for post-processing
Finally, the resulting data can be plotted using tidy principles.
wailord parsed outputs.
imgname = "images/plotH2A.png"
p1a = (
p9.ggplot(
data=h2dat, mapping=p9.aes(x="bond_length",
here has been applied to ORCA, however, the two level structure
y="Actual Energy", has generalizations to most quantum chemistry codes as well.
color="theory") Importantly, we note that the ideas expressed form a design
) pattern for interacting with a plethora of computational tools
+ p9.geom_point()
+ p9.geom_point(mapping=p9.aes(x="bond_length", in a reproducible manner. By defining appropriate scopes for
y="SCF Energy"), our structured parsers, generating deterministic directory trees,
color="black", alpha=0.1, along with a judicious use of regular expressions for output data
shape='*', show_legend=True)
harvesting, we are able to leverage tidy-data principles to analyze
+ p9.geom_point(mapping=p9.aes(x="bond_length",
y="Actual Energy", the results of a large number of single-shot runs.
color="theory"), Taken together, this tool-set and methodology can be used to
data=kolos, generate elegant reports combining code and concepts together
show_legend=True)
+ p9.scales.scale_y_continuous(breaks in a seamless whole. Beyond this, the interpretation of each
= np.arange( h2dat["Actual Energy"].min(), computational experiment in terms of a concrete domain specific
h2dat["Actual Energy"].max(), 0.05) ) language is expected to reduce the requirement of having to re-run
+ p9.ggtitle("Scan of an H2 \ benchmark calculations.
bond length (dark stars are SCF energies)")
+ p9.labels.xlab("Bond length in Angstrom")
+ p9.labels.ylab("Actual Energy (Hatree)") Acknowledgments
+ p9.facet_wrap("basis")
) R Goswami thanks H. Jónsson and V. Ásgeirsson for discussions
p1a.save(imgname, width=10, height=10, dpi=300) on the design of computational experiments for inference in
Which gives rise to the concise representation Fig. 2 from which computation chemistry. This work was partially supported by the
all required inference can be drawn. Icelandic Research Fund, grant number 217436052.
In this particular case, it is possible to see the deviations from
the experimental results at varying levels of theory for different R EFERENCES
basis sets. [AA07] Alfred V. Aho and Alfred V. Aho, editors. Compilers: Principles,
Techniques, & Tools. Pearson/Addison Wesley, Boston, 2nd ed
edition, 2007.
Conclusions [ÁJ18] Vilhjálmur Ásgeirsson and Hannes Jónsson. Exploring Potential
Energy Surfaces with Saddle Point Searches. In Wanda Andreoni
We have discussed wailord in the context of generating, in
and Sidney Yip, editors, Handbook of Materials Modeling, pages
a reproducible manner the structured inputs and output datasets 1–26. Springer International Publishing, Cham, 2018. doi:
which facilitate chemical insight. The formulation of bespoke 10.1007/978-3-319-42913-7_28-1.
datasets tailored to the study of specific properties across a wide [BN06] Thirumalesh Bhat and Nachiappan Nagappan. Evaluating the
efficacy of test-driven development: Industrial case studies. In
range of materials at varying levels of theory has been shown. Proceedings of the 2006 ACM/IEEE International Symposium
The test-driven-development approach is a robust methodology on Empirical Software Engineering, ISESE ’06, pages 356–363,
for interacting with closed source software. The design patterns New York, NY, USA, September 2006. Association for Comput-
expressed, of which the wailord library is a concrete imple- ing Machinery. doi:10.1145/1159733.1159787.
[BS19] Avrim Blum and Kevin Stangl. Recovering from Biased Data:
mentation, is expected to be augmented with more workflows, in Can Fairness Constraints Improve Accuracy? arXiv:1912.01094
particular, with a focus on nudged elastic band. The methodology [cs, stat], December 2019. arXiv:1912.01094.
230 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[CAB+ 19] The Turing Way Community, Becky Arnold, Louise Bowler, [Nee12] Frank Neese. The ORCA program system. WIREs Computa-
Sarah Gibson, Patricia Herterich, Rosie Higman, Anna Krys- tional Molecular Science, 2(1):73–78, 2012. doi:10.1002/
talli, Alexander Morley, Martin O’Reilly, and Kirstie Whitaker. wcms.81.
The Turing Way: A Handbook for Reproducible Data Science. [Oli07] T. E. Oliphant. Python for Scientific Computing. Comput-
Zenodo, March 2019. ing in Science Engineering, 9(3):10–20, May 2007. doi:
[DBCC16] Sandip De, Albert P. Bartók, Gábor Csányi, and Michele 10/fjzzc8.
Ceriotti. Comparing molecules and solids across struc- [OTL08] Noel M. O’boyle, Adam L. Tenderholt, and Karol M.
tural and alchemical space. Physical Chemistry Chemical Langner. Cclib: A library for package-independent computa-
Physics, 18(20):13754–13769, May 2016. doi:10.1039/ tional chemistry algorithms. Journal of Computational Chem-
C6CP00415F. istry, 29(5):839–845, 2008. doi:10.1002/jcc.20823.
[DJS08] Chetan Desai, David Janzen, and Kyle Savage. A survey [PCS+ 16] Giovanni Pizzi, Andrea Cepellotti, Riccardo Sabatini, Nicola
of evidence for test-driven development in academia. ACM Marzari, and Boris Kozinsky. AiiDA: Automated interactive
SIGCSE Bulletin, 40(2):97–101, June 2008. doi:10.1145/ infrastructure and database for computational science. Compu-
1383602.1383644. tational Materials Science, 111:218–230, January 2016. doi:
[Dra20] Pavlo O. Dral. Quantum Chemistry in the Age of Ma- 10.1016/j.commatsci.2015.09.013.
chine Learning. The Journal of Physical Chemistry Let- [Pen11] Roger D. Peng. Reproducible Research in Computational Sci-
ters, 11(6):2336–2347, March 2020. doi:10.1021/acs. ence. Science, 334(6060):1226–1227, December 2011. doi:
jpclett.9b03664. 10/fdv356.
[EIS+ 20] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris [RBA+ 19] Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler,
Tsipras, Jacob Steinhardt, and Aleksander Madry. Identifying Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville.
Statistical Bias in Dataset Replication. arXiv:2005.09619 [cs, On the Spectral Bias of Neural Networks. In Proceedings of
stat], May 2020. arXiv:2005.09619. the 36th International Conference on Machine Learning, pages
[GLL+ 16] Ting Gao, Hongzhi Li, Wenze Li, Lin Li, Chao Fang, Hui Li, Li- 5301–5310. PMLR, May 2019.
Hong Hu, Yinghua Lu, and Zhong-Min Su. A machine learning [SAB+ 21] Daniel G. A. Smith, Doaa Altarawy, Lori A. Burns, Matthew
correction for DFT non-covalent interactions based on the S22, Welborn, Levi N. Naden, Logan Ward, Sam Ellis, Benjamin P.
S66 and X40 benchmark databases. Journal of Cheminformatics, Pritchard, and T. Daniel Crawford. The MolSSI QCArchive
8(1):24, May 2016. doi:10.1186/s13321-016-0133-7. project: An open-source platform to compute, organize, and
[Haf08] Jürgen Hafner. Ab-initio simulations of materials using VASP: share quantum chemistry data. WIREs Computational Molecular
Density-functional theory and beyond. Journal of Computa- Science, 11(2):e1491, 2021. doi:10.1002/wcms.1491.
tional Chemistry, 29(13):2044–2078, 2008. doi:10.1002/ [Sch86] Henry F. Schaefer. Methylene: A Paradigm for Computational
jcc.21057. Quantum Chemistry. Science, 231(4742):1100–1107, March
1986. doi:10.1126/science.231.4742.1100.
[HMSE+ 21] Johannes Hoja, Leonardo Medrano Sandonas, Brian G. Ernst,
[SEJ+ 19] Andrew W. Senior, Richard Evans, John Jumper, James Kirk-
Alvaro Vazquez-Mayagoitia, Robert A. DiStasio Jr., and Alexan-
patrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Žídek,
dre Tkatchenko. QM7-X, a comprehensive dataset of quantum-
Alexander W. R. Nelson, Alex Bridgland, Hugo Penedones,
mechanical properties spanning the chemical space of small
Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli,
organic molecules. Scientific Data, 8(1):43, February 2021.
David T. Jones, David Silver, Koray Kavukcuoglu, and Demis
doi:10.1038/s41597-021-00812-2.
Hassabis. Protein structure prediction using multiple deep neural
[HZU+ 20] Sebastiaan P. Huber, Spyros Zoupanos, Martin Uhrin, Leopold networks in the 13th Critical Assessment of Protein Structure
Talirz, Leonid Kahle, Rico Häuselmann, Dominik Gresch, Prediction (CASP13). Proteins: Structure, Function, and Bioin-
Tiziano Müller, Aliaksandr V. Yakutovich, Casper W. Andersen, formatics, 87(12):1141–1148, 2019. doi:10.1002/prot.
Francisco F. Ramirez, Carl S. Adorf, Fernando Gargiulo, Snehal 25834.
Kumbhar, Elsa Passaro, Conrad Johnston, Andrius Merkys, An- [SGT+ 19] K. T. Schütt, M. Gastegger, A. Tkatchenko, K.-R. Müller,
drea Cepellotti, Nicolas Mounet, Nicola Marzari, Boris Kozin- and R. J. Maurer. Unifying machine learning and quantum
sky, and Giovanni Pizzi. AiiDA 1.0, a scalable computa- chemistry with a deep neural network for molecular wavefunc-
tional infrastructure for automated reproducible workflows and tions. Nature Communications, 10(1):5024, November 2019.
data provenance. Scientific Data, 7(1):300, September 2020. doi:10.1038/s41467-019-12875-2.
doi:10.1038/s41597-020-00638-4. [SNTH13] Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and Eivind
[Koh99] W. Kohn. Nobel Lecture: Electronic structure of matter— Hovig. Ten Simple Rules for Reproducible Computational Re-
wave functions and density functionals. Reviews of Modern search. PLOS Computational Biology, 9(10):e1003285, October
Physics, 71(5):1253–1266, October 1999. doi:10.1103/ 2013. doi:10/pjb.
RevModPhys.71.1253. [WAB+ 19] Hadley Wickham, Mara Averick, Jennifer Bryan, Winston
[KW68] W. Kolos and L. Wolniewicz. Improved Theoretical Ground- Chang, Lucy D’Agostino McGowan, Romain François, Garrett
State Energy of the Hydrogen Molecule. The Journal of Chem- Grolemund, Alex Hayes, Lionel Henry, Jim Hester, Max Kuhn,
ical Physics, 49(1):404–410, July 1968. doi:10.1063/1. Thomas Lin Pedersen, Evan Miller, Stephan Milton Bache,
1669836. Kirill Müller, Jeroen Ooms, David Robinson, Dana Paige Seidel,
[LMB+ 17] Ask Hjorth Larsen, Jens Jørgen Mortensen, Jakob Blomqvist, Vitalie Spinu, Kohske Takahashi, Davis Vaughan, Claus Wilke,
Ivano E. Castelli, Rune Christensen, Marcin Du\lak, Jesper Kara Woo, and Hiroaki Yutani. Welcome to the Tidyverse.
Friis, Michael N. Groves, Bjørk Hammer, Cory Hargus, Eric D. Journal of Open Source Software, 4(43):1686, November 2019.
Hermes, Paul C. Jennings, Peter Bjerre Jensen, James Kermode, doi:10.21105/joss.01686.
John R. Kitchin, Esben Leonhard Kolsbjerg, Joseph Kubal, Kris- [WW05] Leland Wilkinson and Graham Wills. The Grammar of Graph-
ten Kaasbjerg, Steen Lysgaard, Jón Bergmann Maronsson, Tris- ics. Statistics and Computing. Springer, New York, 2nd ed
tan Maxson, Thomas Olsen, Lars Pastewka, Andrew Peterson, edition, 2005.
Carsten Rostgaard, Jakob Schiøtz, Ole Schütt, Mikkel Strange,
Kristian S. Thygesen, Tejs Vegge, Lasse Vilhelmsen, Michael
Walter, Zhenhua Zeng, and Karsten W. Jacobsen. The atomic
simulation environment—a Python library for working with
atoms. Journal of Physics: Condensed Matter, 29(27):273002,
June 2017. doi:10.1088/1361-648X/aa680e.
[MA11] K. J. Millman and M. Aivazis. Python for Scientists and
Engineers. Computing in Science Engineering, 13(2):9–12,
March 2011. doi:10/dc343g.
[MSH19] Ralf Meyer, Klemens S. Schmuck, and Andreas W. Hauser.
Machine Learning in Computational Chemistry: An Evalua-
tion of Method Performance for Nudged Elastic Band Cal-
culations. Journal of Chemical Theory and Computation,
15(11):6513–6523, November 2019. doi:10.1021/acs.
jctc.9b00708.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 231

Variational Autoencoders For Semi-Supervised Deep
Metric Learning
Nathan Safir‡∗ , Meekail Zain§ , Curtis Godwin‡ , Eric Miller‡ , Bella Humphrey§ , Shannon P Quinn§¶

Abstract—Deep metric learning (DML) methods generally do not incorporate loss may help incorporate semantic information from unlabelled
unlabelled data. We propose borrowing components of the variational autoen- sources. Second, we propose that the structure of the VAE latent
coder (VAE) methodology to extend DML methods to train on semi-supervised space, as it is confined by a prior distribution, can be used to
datasets. We experimentally evaluate the atomic benefits to the perform- ing induce bias in the latent space of a DML system. For instance,
DML on the VAE latent space such as the enhanced ability to train using
if we know a dataset contains N -many classes, creating a prior
unlabelled data and to induce bias given prior knowledge. We find that jointly
training DML with an autoencoder and VAE may be potentially helpful for some
distribution that is a learnable mixture of N gaussians may help
semi-suprevised datasets, but that a training routine of alternating between produce better representations. Third, we propose that performing
the DML loss and an additional unsupervised loss across epochs is generally DML on the latent space of the VAE so that the DML task can
unviable. be jointly optimized with the VAE to incorporate unlabelled data
may help produce better representations.
Index Terms—Variational Autoencoders, Metric Learning, Deep Learning, Rep-
Each of the three improvement proposals will be evaluated
resentation Learning, Generative Models
experimentally. The improvement proposals will be evaluated by
comparing a standard DML implementation to the same DML
Introduction implementation:
Within the broader field of representation learning, metric learning
is an area which looks to define a distance metric which is smaller • jointly optimized with an autoencoder
between similar objects (such as objects of the same class) and • while structuring the latent space around a prior distribu-
larger between dissimilar objects. Oftentimes, a map is learned tion using the VAE’s KL-divergence loss term between the
from inputs into a low-dimensional latent space where euclidean approximated posterior and prior
distance exhibits this relationship, encouraged by training said • jointly optimized with a VAE
map against a loss (cost) function based on the euclidean distance Our primary contribution is evaluating these three improve-
between sets of similar and dissimilar objects in the latent space. ment proposals. Our secondary contribution is presenting the
Existing metric learning methods are generally unable to learn results of the joint approaches for VAEs and DML for more recent
from unlabelled data, which is problematic because unlabelled metric losses that have not been jointly optimized with a VAE in
data is often easier to obtain and is potentially informative. previous literature.
We take inspiration from variational autoencoders (VAEs),
a generative representation learning architecture, for using un-
Related Literature
labelled data to create accurate representations. Specifically, we
look to evaluate three atomic improvement proposals that detail The goal of this research is to investigate how components of the
how pieces of the VAE architecture can create a better deep metric variational autoencoder can help the performance of deep metric
learning (DML) model on a semi-supervised dataset. From here, learning in semi supervised tasks. We draw on previous literature
we can ascertain which specific qualities of how VAEs process to find not only prior attempts at this specific research goal but
unlabelled data are most helpful in modifying DML methods to also work in adjacent research questions that proves insightful.
train with semi-supervised datasets. In this review of the literature, we discuss previous related work
First, we propose that the autoencoder structure of the VAE in the areas of Semi-Supervised Metric Learning and VAEs with
helps the clustering of unlabelled points, as the reconstruction Metric Losses.

* Corresponding author: nssafir@gmail.com Semi-Supervised Metric Learning
‡ Institute for Artificial Intelligence, University of Georgia, Athens, GA 30602
USA There have been previous approaches to designing metric learning
§ Department of Computer Science, University of Georgia, Athens, GA 30602 architectures which incorporate unlabelled data into the metric
USA
¶ Department of Cellular Biology, University of Georgia, Athens, GA 30602 learning training regimen for semi-supervised datasets. One of the
USA original approaches is the MPCK-MEANS algorithm proposed
by Bilenko et al. ([BBM04]), which adds a penalty for placing
Copyright © 2022 Nathan Safir et al. This is an open-access article distributed labelled inputs in the same cluster which are of a different class
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the or in different clusters if they are of the same class. This penalty
original author and source are credited. is proportional to the metric distance between the pair of inputs.
232 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Baghshah and Shouraki ([BS09]) also looks to impose similar also experiment with adding a (different) metric loss to the overall
constraints by introducing a loss term to preserve locally linear VAE loss function.
relationships between labelled and unlabelled data in the input Most recently, Grosnit et al. ([GTM+ 21]) leverage a new
space. Wang et al. ([WYF13]) also use a regularizer term to training algorithm for combining VAEs and DML for Bayesian
preserve the topology of the input space. Using VAEs, in a sense, Optimization and said algorithm using simple, contrastive, and
draws on this theme: though there is not explicit term to enforce triplet metric losses. We look to build on this literature by also
that the topology of the input space is preserved, a topology of testing a combined VAE DML architecture on more recent metric
the inputs is intended to be learned through a low-dimensional losses, albeit using a simpler training regimen.
manifold in the latent space.
One more recent common general approach to this problem
is to use the unlabelled data’s proximity to the labelled data Deep Metric Learning (DML)
to estimate labels for unlabelled data, effectively transforming
unlabelled data into labelled data. Dutta et al. ([DHS21]) and Li et Metric learning attempts to create representations for data by
al. ([LYZ+ 19]) propose a model which uses affinity propagation training against the similarity or dissimilarity of samples. In a
on a k-Nearest-Neighbors graph to label partitions of unlabelled more technical sense, there are two notable functions in DML
data based on their closest neighbors in the latent space. Wu et al. systems. Function fθ is a neural network which maps the input
([WFZ20]) also look to assign pseudo-labels to unlabelled data, data X to the latent points Z (i.e. fθ : X 7→ Z, where θ is the
but not through a graph-based approach. Instead, the proposed network parameters). Generally, Z exists in a space of much lower
model looks to approximate "soft" pseudo-labels for unlabelled dimensionality than X (eg. X is a set of 28 × 28 pixel pictures such
data from the metric learning similarity measure between the that X ⊂ R28×28 and Z ⊂ R10 ).
embedding of unlabelled data and the center of each input of each The function D fθ (x, y) = D( fθ (x), fθ (y)) represents the dis-
class of the labelled data. tance between two inputs x, y ∈ X. To create a useful embedding
Several of the recent graph based approaches can be consid- model fθ , we would like for fθ to produce large values of D fθ (x, y)
ered state-of-the-art for semi supervised metric learning. Li et. when x and y are dissimilar and for fθ to produce small values of
al.’s paper states their methods achieve 98.9 percent clustering D fθ (x, y) when x and y are similar. In some cases, dissimilarity
accuracy on the MNIST dataset with 10% labelled data, outper- and similarity can refer to when inputs are of different and the
forming two similar state-of-the-art methods, DFCM ([ARJM18]) same classes, respectively.
and SDEC ([RHD+ 19]), by roughly 8 points. Dutta et. al.’s method It is common for the Euclidean metric (i.e. the L2 metric) to
also outperforms 5 other state for the R@1 metric (the "percentage be used as a distance function in metric learning. The generalized
of test examples" that have at least one 1 "nearest neighbor from L p metric can be defined as follows, where z0 , z1 ∈ Rd .
the same class.") by at leat 1.2 on the MNIST dataset, as well
d
as the Fashion-MNIST and CIFAR-10 datasets. It is difficult to
D p (z0 , z1 ) = ||z0 − z1 || p = ( ∑ |z0i − z1i | p )1/p
compare the two approaches as the evaluation metrics used in i=1
each paper differ. Li et al.’s paper has been cited rather heavily
relative to other papers in the field and can be considered state If we have chosen fθ (a neural network) and the distance function
of the art for semi-supervised DML on MNIST. The paper also D (the L2 metric), the remaining component to be defined in
provides a helpful metric (98.9 percent clustering accuracy on the a metric learning system is the loss function for training f . In
MNIST dataset with 10% labelled data) to use as a reference point practice, we will be using triplet loss ([SKP15]), one of the most
for the results in this paper. common metric learning loss functions.

VAEs with Metric Loss
Methodology
Some approaches to incorporating labelled data into VAEs use
a metric loss to govern the latent space more explicitly. Lin et We look to discover the potential of applying components of the
al. ([LDD+ 18]) model the intra-class invariance (i.e. the class- VAE methodology to DML systems. We test this through present-
related information of a data point) and intra-class variance (i.e. ing incremental modifications to the basic DML architecture. Each
the distinct features of a data point not unique to it’s class) modified architecture corresponds to an improvement proposal
seperately. Like several other models in this section, this paper’s about how a specific part of the VAE training regime and loss
proposed model incorporates a metric loss term for the latent function may be adapted to assist the performance of a DML
vectors representing intra-class invariance and the latent vectors method for a semi-supervised dataset.
representing both intra-class invariance and intra-class variance. The general method we will take for creating modified DML
Kulkarni et al. ([KCJ20]) incorporate labelled information into models involves extending the training regimen to two phases,
the VAE methodology in two ways. First, a modified architecture a supervised and unsupervised phase. In the supervised phase the
called the CVAE is used in which the encoder and generator of the modified DML model behaves identically to the base DML model,
VAE is not only conditioned on the input X and latent vector z, training on the same metric loss function. In the unsupervised
respectively, but also on the label Y . The CVAE was introduced in phase, the DML model will train against an unsupervised loss
previous papers ([SLY15]) ([DCGO19]). Second, the authors add inspired by the VAE. This may require extra steps to be added
a metric loss, specifically a multi-class N-pair loss ([Soh16]), in to the DML architecture. In the pseudocode, s refers to boolean
the overall loss function of the model. While it is unclear how the variable representing if the current phase is supervised. α is a
CVAE technique would be adapted in a semi-supervised setting, hyperparameter which modulates the impact of the unsupervised
as there is not a label Y associated with each datapoint X, we on total loss for the DML autoencoder.
VARIATIONAL AUTOENCODERS FOR SEMI-SUPERVISED DEEP METRIC LEARNING 233

Improvement Proposal 1 distribution instead of a point will allow us to calculate the KL
divergence.
We first look to evaluate the improvement proposal that adding
In practice, we will be evaluating a DML model with a unit
a reconstruction loss to a DML system can improve the quality
prior and a DML model with a mixture of gaussians (GMM) prior.
of clustering in the latent representations on a semi-supervised
The latter model constructs the prior as a mixture of n gaussians –
dataset. Reconstruction loss in and of itself enforces a similar
each the vertice of the unit (i.e. each side is 2 units long) hypercube
semantic mapping onto the latent space as a metric loss, but can
in the latent space. The logvar of each component is set equal to
be computed without labelled data. In theory, we believe that the
one. Constructing the prior in this way is beneficial in that it is
added constraint that the latent vector must be reconstructed to
ensured that each component is evenly spaced within the latent
approximate the original output will train the spatial positioning
space, but is limiting in that there must be exactly 2d components
to reflect semantic information. Following this reasoning, obser-
in the GMM prior. Thus, to test, we will test a dataset with 10
vations which share similar semantic information, specifically
classes on the latent space dimensionality of 4, such that there
observations of the same class (even if not labelled as such),
are 24 = 16 gaussian components in the GMM prior. Though the
should intuitively be positioned nearby within the latent space. To
number of prior components is greater than the number of classes,
test if this intuition occurs in practice, we evaluate if a DML model
the latent mapping may still exhibit the pattern of classes forming
with an autoencoder structure and reconstruction loss (described in
clusters around the prior components as the extra components may
further detail below) will perform better than a plain DML model
be made redundant.
in terms of clustering quality. This will be especially evident for
The drawback of the decision to set the GMM components’
semi-supervised datasets in which the amount of labelled data is
means to the coordinates of the unit hypercube’s vertices is that
not feasible for solely supervised DML.
the manifold of the chosen dataset may not necessarily exist in 4
Given a semi-supervised dataset, we assume a standard DML dimensions. Choosing gaussian components from a d-dimensional
system will use only the labelled data and train given a metric loss hypersphere in the latent space R d would solve this issue, but
Lmetric (see Algorithm 1). Our modified model DML Autoencoder there does not appear to be a solution for choosing n evenly spaced
will extend the DML model’s training regime by adding a decoder points spanning d dimensions on a d-dimensional hypersphere.
network which takes the latent point z as input and produces an KL Divergence is calculated with a monte carlo approximation
output x̂. The unsupervised loss LU is equal to the reconstruction for the GMM and analytically with the unit prior.
loss.
Improvement Proposal 3
Improvement Proposal 2 The third improvement proposal we look to evaluate is that
given a semi-supervised dataset, optimizing a DML model jointly
Say we are aware that a dataset has n classes. It may be useful
with a VAE on the VAE’s latent space will produce superior
to encourage that there are n clusters in the latent space of a
clustering than the DML model individually. The intuition behind
DML model. This can be enforced by using a prior distribution
this approach is that DML methods can learn from only supervised
containing n many Gaussians. As we wish to measure only
data and VAE methods can learn from only unsupervised data; the
the affect of inducing bias on the representation without adding
proposed methodology will optimize both tasks simultaneously to
any complexity to the model, the prior distribution will not be
learn from both supervised and unsupervised data.
learnable (unlike VAE with VampPrior). By testing whether the
The MetricVAE implementation we create jointly optimizes
classes of points in the latent space are organized along the prior
the VAE task and DML task on the VAE latent space. The
components we can test whether bias can be induced using a
unsupervised loss is set to the VAE loss. The implementation uses
prior to constrain the latent space of a DML. By testing whether
the VAE with VampPrior model instead of the vanilla VAE.
clustering improves performance, we can evaluate whether this
inductive bias is helpful.
Results
Given a fully supervised dataset, we assume a standard DML
system will use only the labelled data and train given a metric loss Experimental Configuration
Lmetric . Our modified model will extend the DML system’s training Each set of experiments shares a similar hyperparameter search
regime by setting the unsupervised loss to a KL divergence term space. Below we describe the hyperparameters that are included
that measures the difference between posterior distributions and in the search space of each experiment and the evaluation method.
a prior distribution. It should also be noted that, like the VAE Learning Rate (lr): Through informal experimentation, we
encoder, we will map the input not to a latent point but to a have found that the learning rate of 0.001 causes the models to
latent distribution. The latent point is stochastically sampled from converge consistently (relative to 0.005 and 0.0005). The learning
the latent distribution during training. Mapping the input to a rate is thus set to 0.001 in each experiment.
234 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)
VARIATIONAL AUTOENCODERS FOR SEMI-SUPERVISED DEEP METRIC LEARNING 235

Latent Space Dimensionality (lsdim): Latent space dimen- ([YSW+ 21]). The MNIST and OrganAMNIST datasets are similar
sionality refers to the dimensionality of the vector output of the in dimensionality (1 x 28 x 28), number of samples (60,000 and
encoder of a DML network or the dimensionality of the posterior 58,850, respectively) and in that they are both greyscale.
distribution of a VAE (also the dimensionality of the latent space). Evaluation: We evaluate the results by running each model
When the latent space dimensionality is 2, we see the added benefit on a test partition of data. We then take the latent points Z
of creating plots of the latent representations (though we can generated by the model and the corresponding labels Y . Three
accomplish this through using dimensionality reduction methods classifiers (sklearn’s implementation of RandomForest, MLP, and
like tSNE for higher dimensionalities as well). Example values for kNN) each output predicted labels Ŷ for the latent points. In
this hyperparameter used in experiments are 2, 4, and 10. most of the charts shown, however, we only include the kNN
Alpha: Alpha (α) is a hyperapameter which refers to the classification output due to space constraints and the lack of
balance between the unsupervised and supervised losses of some meaningful difference between the output for each classifier. We
of the modified DML models. More details about the role of α finally measure the quality of the predicted labels Ŷ using the
in the model implementations are discussed in the methodology Adjusted Mutual Information Score (AMI) ([?]) and accuracy
section of the model. Potential values for alpha are each between (which is still helpful but is also easier to interpret in some cases).
0 (exclusive) and 1 (inclusive). We do not include 0 in this set as if This scoring metric is common in research that looks to evaluate
α is set to 0, the model is equivalent to the fully supervised plain clustering performance ([ZG21]) ([EKGB16]). We will be using
DML model because the supervised loss would not be included. If sklearn’s implementation of AMI ([PVG+ 11]). The performance
α is set to 1, then the model would train on only the unsupervised of a classifier on the latent points intuitively can be used as a
loss; for instance if the DML Autoencoder had α set to 1, then the measure of quality of clustering.
model would be equivalent to an autoencoder.
Partial Labels Percentage (pl%): The partial labels per- Improvement Proposal 1 Results: Benefits of Reconstruction Loss
centage hyperparameter refers to the percentage of the dataset that In evaluating the first improvement proposal, we compare the
is labelled and thus the size of the partion of the dataset that can performance of the plain DML model to the DML Autoencoder
be used for labelled training. Of course, each of the datasets we model. We do so by comparing the performance of the plain
use is fully labelled, so a partially labelled datset can be trivially DML system and the DML Autoencoder across a search space
constructed by ignoring some of the labels. As the sizes of the containing the lsdim, alpha, and pl% hyperparameters and both
dataset vary, each percentage can refer to a different number of datasets.
labelled samples. Values for the partial label percentage we use In Table 1 and Table 2, we observe that for relatively small
across experiments include 0.01, 0.1, and 10 (with each value amounts of labelled samples (the partial labels percentages of 0.01
referring to the percentage). and 0.1 correspond to 6 and 60 labelled samples respectively),
Datasets: Two datasets are used for evaluating the models. the DML Autoencoder severely outperforms the DML model.
The first dataset is MNIST ([LC10]), a very popular dataset However, when the number of labelled samples increases (the
in machine learning containing greyscale images of handwritten partial labels percentage of 10 correspond to 6000 labelled sam-
digits. The second dataset we use is the organ OrganAMNIST ples respectively), the DML model significantly outperforms the
dataset from MedMNIST v2 ([YSW+ 21]). This dataset contains DML Autoencoder. This trend is not too surprising, as when there
2D slices from computed tomography images from the Liver is sufficient data to train unsupervised methods and insufficient
Tumor Segmentation Benchmark – the labels correspond to the data to train supervised method, as is the case for the 0.01 and
classification of 11 different body organs. The decision to use 0.1 partial label percentages, the unsupervised method will likely
a second dataset was motivated because as the improvement perform better.
proposals are tested over more datasets, the results supporting the The data looks to show that adding a reconstruction loss to a
improvement proposals become more generalizable. The decision DML system can improve the quality of clustering in the latent
to use the OrganAMNIST dataset specifically is motivated in representations on a semi-supervised dataset when there are small
part due to the Quinn Research Group working on similar tasks amounts (roughly less than 100 samples) of labelled data and a
for biomedical imaging ([ZRS+ 20]). It is also motivated in part sufficient quantity of unlabelled data. But an important caveat is
because OrganAMNIST is a more difficult dataset, at least for that it is not convincing that the DML Autoencoder effectively
the classfication task, as the leading accuracy for MNIST is .9991 combined the unsupervised and supervised losses to create a
([ALP+ 20]) while the leading accuracy for OrganAMNIST is .951 superior model, as a plain autoencoder (i.e. the DML Autoencoder
236 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 1: Sample images from the MNIST (left) and OrganAMNIST of MedMNIST (right) datasets

with α = 1) outperforms the DML for the partial labels percentage routine of alternating between supervised loss (in this case, metric
of or less than 0.1% and underperforms the DML for the partial loss) and unsupervised (in this case, VAE loss) is not optimal for
labels percentage of 10%. training the model.
We have trained a seperate combined VAE and DML model
Improvement Proposal 2 Results: Incorporating Inductive Bias with which trains on both the unsupervised and supervised loss each
a Prior epoch instead of alternating between the two each epoch. In the
In evaluating the second improvement proposal, we compare the results for this model, we see that an alpha value of over zero
performance of the plain DML model to the DML with a unit prior (i.e. incorporating both the supervised metric loss into the overall
and a DML with a GMM prior. The DML prior with the GMM MVAE loss function) can help improve performance especially
prior will have 2^2 = 4 gaussian components when lsdim = 2 and among lower dimensionalities. Given our analysis of the data, we
2^4 = 16 components when lsdim = 4. Our broad intention is to see that incorporating the DML loss to the VAE is potentially
see if changing the shape (specifically the number of components) helpful, but only when training the unsupervised and supervised
of the prior can induce bias by affecting the pattern of embeddings. losses jointly. Even in that case, it is unclear whether the MVAE
We hypothesize that when the GMM prior contains n components performs better than the corresponding DML model even if it does
and n is slightly greater than or equal to the number of classes, perform better than the corresponding VAE model.
each class will cluster around one of the prior components. We will
test this for the GMM prior with 16 components (lsdim = 4) as
Conclusion
both the MNIST and MedMNIST datasets have 10 classes. We are
unable to set the number of GMM components to 10 as our GMM Conclusion
sampling method only allows for the number of components to In this work, we have set out to determine how DML can be
equal a power of 2. Bseline models include a plain DML and a extended for semi-supervised datasets by borrowing components
DML with a unit prior (the distribution N(0, 1)). of the variational autoencoder. We have formalized this approach
In Table 3, it is very evident that across both datasets, the DML through defining three specific improvement proposals. To evalu-
models with any prior distribution all devolve to the null model ate each improvement proposal, we have created several variations
(i.e. the classifier is no better than random selection). From the of the DML model, such as the DML Autoencoder, DML with
visualizations of the latent embeddings, we see that the embedded Unit/GMM Prior, and MVAE. We then tested the performance
data for the DML models with priors appears completely random. of the models across several semi-supervised partitions of two
In the case of the GMM prior, it also does not appear to take on the datasets, along with other configurations of hyperparameters.
shape of the prior or reflect the number of components in the prior. We have determined from the analysis of our results, there
This may be due to the training routine of the DML models. As is too much dissenting data to clearly accept any three of the
the KL divergence loss, which can be said to "fit" the embeddings improvement proposals. For improvement proposal 1, while the
to the prior, trains on alternating epochs with the supervised DML DML Autoencoder outperforms the DML for semisupervised
loss, it is possible that the two losses are not balanced correctly datasets with small amounts of labelled data, it’s peformance is not
during the training process. From the discussed results, it is fair consistently much better than that of a plain autoencoder which
to state that adding a prior distribution to a DML model through uses no labelled data. For improvement proposal 2, each of the
training the model on the KL divergence between the prior and DML models with an added prior performed extremely poorly,
approximated posterior distributions on alternating epochs does is near or at the level of the null model. For improvement proposal
not an effective way to induce bias in the latent space. 3, we see the same extremely poor performance from the MVAE
models.
Improvement Proposal 3 Results: Jointly Optimizing DML with VAE From the results in improvement proposals 1 and 3, we find
To evaluate the third improvement proposal, we compare the that there may be potential in incorporating the autoencoder and
performance of DMLs to MetricVAEs (defined in the previous VAE loss terms into DML systems. However, we were unable to
chapter) across several metric losses. We run experiments for show that any of these improvement proposals would consistently
triplet loss, supervised loss, and center loss DML and MetricVAE outperform the both the DML and fully unsupervised architectures
models. To evaluate the improvement proposal, we will assess in semisupervised settings. We also found that the training routine
whether the model performance improves for the MetricVAE over used for the improvement proposals, in which the loss function
the DML for the same metric loss and other hyper parameters. would alternate between supervised and unsupervised losses each
Like the previous improvement proposal, the proposed Metric- epoch, was not effective. This is especially evident in comparing
VAE model does not perform better than the null model. As with the two combined VAE DML models for improvement proposal
improvement proposal 2, it is possible this is because the training 3.
VARIATIONAL AUTOENCODERS FOR SEMI-SUPERVISED DEEP METRIC LEARNING 237

Fig. 2: Table 1: Comparison of the DML (left) and DML Autoencoder (right) models for the MNIST dataset. Bolded values indicate best
performance for each partial labels percentage partition (pl%).

Fig. 3: Table 2: Comparison of the DML (left) and DML Autoencoder (right) models for the MEDMNIST dataset..

Future Work R EFERENCES

In the future, it would be worthwhile to evaluate these improve- [AHS20] Georgios Arvanitidis, Søren Hauberg, and Bernhard Schölkopf.
Geometrically enriched latent spaces. arXiv preprint
ment proposals using a different training routine. We have stated arXiv:2008.00565, 2020. doi:10.48550/arXiv.2008.
previously that perhaps the extremely poor performance of the 00565.
DML with a prior and MVAE models may be due to alternating [ALP+ 20] Sanghyeon An, Min Jun Lee, Sanglee Park, Heerin Yang, and
on training against a supervised and unsupervised loss. Further Jungmin So. An ensemble of simple convolutional neural network
models for MNIST digit recognition. CoRR, abs/2008.10400,
research could look to develop or compare several different 2020. URL: https://arxiv.org/abs/2008.10400, arXiv:2008.
training routines. One alternative would be alternating between 10400, doi:10.48550/arXiv.2008.10400.
losses at each batch instead of each epoch. Another alternative, [ARJM18] Ali Arshad, Saman Riaz, Licheng Jiao, and Aparna Murthy.
specifically for the MVAE, may be first training DML on labelled Semi-supervised deep fuzzy c-mean clustering for software fault
prediction. IEEE Access, 6:25675–25685, 2018. doi:10.
data, training a GMM on it’s outputs, and then using the GMM as 1109/ACCESS.2018.2835304.
the prior distribution for the VAE. [BBM04] Mikhail Bilenko, Sugato Basu, and Raymond J Mooney. Integrat-
ing constraints and metric learning in semi-supervised clustering.
Another potentially interesting avenue for future study is in
In Proceedings of the twenty-first international conference on
investigating a fourth improvement proposal: the ability to define Machine learning, page 11, 2004. doi:10.1145/1015330.
a Riemannian metric on the latent space. Previous research has 1015360.
shown a Riemannian metric can be computed on the latent space [BS09] Mahdieh Soleymani Baghshah and Saeed Bagheri Shouraki.
of the VAE by computing the pull-back metric of the VAE’s Semi-supervised metric learning using pairwise constraints. In
Twenty-First International Joint Conference on Artificial Intelli-
decoder function ([AHS20]). Through the Riemannian metric we gence, 2009.
could calculate metric losses such as triplet loss with a geodesic [DCGO19] Sara Dahmani, Vincent Colotte, Valérian Girard, and Slim Ouni.
instead of euclidean distance. The geodesic distance may be a Conditional variational auto-encoder for text-driven expressive
more accurate representation of similarity in the latent space than audiovisual speech synthesis. In INTERSPEECH 2019-20th
Annual Conference of the International Speech Communication
euclidean distance as it accounts for the structure of the input Association, 2019. doi:10.21437/interspeech.2019-
data. 2848.
238 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 4: Table 3: Comparison of the DML model (left) and the DML with prior models with a unit gaussian prior (center) and GMM prior
(right) models for the MNIST dataset.

Fig. 5: Comparison of latent spaces for DML with unit prior (left) and DML with GMM prior containing 4 components (right) for lsdim
= 2 on OrganAMNIST dataset. The gaussian components are shown as black with the raidus equal to variance (1). There appears to be no
evidence of the distinct gaussian components in the latent space on the right. It does appear that the unit prior may regularize the magnitude
of the latent vectors

Fig. 6: Graph of reconstruction loss (componenet of unsupervised loss) of MVAE across epochs. The unsupervised loss does not converge
despite being trained on each epoch.

Fig. 7: Table 4: Experiments performed on MVAE architecture across fully labelled MNIST dataset that trains on objective function L =
LU +γ ∗LS on fully supervised dataset. The best results for the classification accuracy on the MVAE embeddings in a given latent-dimensionality
are bolded.
VARIATIONAL AUTOENCODERS FOR SEMI-SUPERVISED DEEP METRIC LEARNING 239

[DHS21] Ujjal Kr Dutta, Mehrtash Harandi, and Chellu Chandra Sekhar.
Semi-supervised metric learning: A deep resurrection. 2021.
doi:10.48550/arXiv.2105.05061.
[EKGB16] Scott Emmons, Stephen Kobourov, Mike Gallant, and Katy
Börner. Analysis of network clustering algorithms and clus-
ter quality metrics at scale. PloS one, 11(7):e0159161, 2016.
doi:10.1371/journal.pone.0159161.
[GTM 21] Antoine Grosnit, Rasul Tutunov, Alexandre Max Maraval, Ryan-
+

Rhys Griffiths, Alexander I Cowen-Rivers, Lin Yang, Lin Zhu,
Wenlong Lyu, Zhitang Chen, Jun Wang, et al. High-dimensional
bayesian optimisation with variational autoencoders and deep
metric learning. arXiv preprint arXiv:2106.03609, 2021. doi:
10.48550/arXiv.2106.03609.
[KCJ20] Ajinkya Kulkarni, Vincent Colotte, and Denis Jouvet. Deep
variational metric learning for transfer of expressivity in multi-
speaker text to speech. In International Conference on Statistical
Language and Speech Processing, pages 157–168. Springer, 2020.
doi:10.1007/978-3-030-59430-5_13.
[LC10] Yann LeCun and Corinna Cortes. MNIST handwritten digit
database. 2010. URL: http://yann.lecun.com/exdb/mnist/ [cited
2016-01-14 14:24:11].
[LDD+ 18] Xudong Lin, Yueqi Duan, Qiyuan Dong, Jiwen Lu, and Jie Zhou.
Deep variational metric learning. In Proceedings of the European
Conference on Computer Vision (ECCV), pages 689–704, 2018.
doi:10.1007/978-3-030-01267-0_42.
[LYZ+ 19] Xiaocui Li, Hongzhi Yin, Ke Zhou, Hongxu Chen, Shazia Sadiq,
and Xiaofang Zhou. Semi-supervised clustering with deep metric
learning. In International Conference on Database Systems for
Advanced Applications, pages 383–386. Springer, 2019. doi:
10.1007/978-3-030-18590-9_50.
[PVG+ 11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg,
J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,
and E. Duchesnay. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research, 12:2825–2830, 2011.
[RHD 19] Yazhou Ren, Kangrong Hu, Xinyi Dai, Lili Pan, Steven CH Hoi,
+

and Zenglin Xu. Semi-supervised deep embedded clustering. Neu-
rocomputing, 325:121–130, 2019. doi:10.1016/j.neucom.
2018.10.016.
[SKP15] Florian Schroff, Dmitry Kalenichenko, and James Philbin.
Facenet: A unified embedding for face recognition and clus-
tering. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 815–823, 2015. doi:
10.1109/cvpr.2015.7298682.
[SLY15] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning struc-
tured output representation using deep conditional generative
models. Advances in neural information processing systems,
28:3483–3491, 2015.
[Soh16] Kihyuk Sohn. Improved deep metric learning with multi-class n-
pair loss objective. In Advances in neural information processing
systems, pages 1857–1865, 2016.
[WFZ20] Sanyou Wu, Xingdong Feng, and Fan Zhou. Metric learning
by similarity network for deep semi-supervised learning. In
Developments of Artificial Intelligence Technologies in Compu-
tation and Robotics: Proceedings of the 14th International FLINS
Conference (FLINS 2020), pages 995–1002. World Scientific,
2020. doi:10.1142/9789811223334_0120.
[WYF13] Qianying Wang, Pong C Yuen, and Guocan Feng. Semi-
supervised metric learning via topology preserving multiple semi-
supervised assumptions. Pattern Recognition, 46(9):2576–2587,
2013. doi:10.1016/j.patcog.2013.02.015.
[YSW+ 21] Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao,
Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2:
A large-scale lightweight benchmark for 2d and 3d biomedical
image classification. arXiv preprint arXiv:2110.14795, 2021.
doi:10.48550/arXiv.2110.14795.
[ZG21] Zhen Zhu and Yuan Gao. Finding cross-border collaborative
centres in biopharma patent networks: A clustering comparison
approach based on adjusted mutual information. In International
Conference on Complex Networks and Their Applications, pages
62–72. Springer, 2021. doi:10.1007/978-3-030-93409-
5_6.
[ZRS+ 20] Meekail Zain, Sonia Rao, Nathan Safir, Quinn Wyner, Isabella
Humphrey, Alexa Eldridge, Chenxiao Li, BahaaEddin AlAila,
and Shannon P. Quinn. Towards an unsupervised spatiotemporal
representation of cilia video using a modular generative pipeline.
2020. doi:10.25080/majora-342d178e-017.
240 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

A Python Pipeline for Rapid Application Development
(RAD)
Scott D. Christensen‡∗ , Marvin S. Brown‡ , Robert B. Haehnel‡ , Joshua Q. Church‡ , Amanda Catlett‡ , Dallon C.
Schofield‡ , Quyen T. Brannon‡ , Stacy T. Smith‡

Abstract—Rapid Application Development (RAD) is the ability to rapidly pro- Python ecosystem provides a rich set of tools that can be applied to
totype an interactive interface through frequent feedback, so that it can be various data sources to provide valuable insights. These insitghts
quickly deployed and delivered to stakeholders and customers. RAD is a critical can be integrated into decision support systems that can enhance
capability needed to meet the ever-evolving demands in scientific research and the information available when making mission critical decisions.
data science. To further this capability in the Python ecosystem, we have curated
Yet, while the opportunities are vast, the ability to get the resources
and developed a set of open-source tools, including Panel, Bokeh, and Tethys
Platform. These tools enable prototyping interfaces in a Jupyter Notebook and
necessary to pursue those opportunities requires effective and
facilitate the progression of the interface into a fully-featured, deployable web- timely communication of the value and feasibility of a proposed
application. project.
We have found that rapid prototyping is a very impactful way
Index Terms—web app, Panel, Tethys, Tethys Platform, Bokeh, Jupyter to concretely show the value that can be obtained from a proposal.
Moreover, it also illustrates with clarity that the project is feasible
and likely to succeed. Many scientific workflows are developed in
Introduction
Python, and often the prototyping phase is done in a Jupyter Note-
With the tools for data science continually improving and an al- book. The Jupyter environment provides an easy way to quickly
most innumerable supply of new data sources, there are seemingly modify code and visualize output. However, the visualizations are
endless opportunities to create new insights and decision support interlaced with the code and thus it does not serve as an ideal way
systems. Yet, an investment of resources are needed to extract demonstrate the prototype to stakeholders, that may not be familiar
the value from data using new and improved tools. Well-timed with Jupyter Notebooks or code. The Jupyter Dashboard project
and impactful proposals are necessary to gain the support and was addressing this issue before support for it was dropped in
resources needed from stakeholders and decision makers to pursue 2017. To address this technical gap, we worked with the Holoviz
these opportunities. The ability to rapidly prototype capabilities team to develop the Panel library. [Panel] Panel is a high-level
and new ideas provides a powerful visual tool to communicate Python library for developing apps and dashboards. It enables
the impact of a proposal. Interactive applications are even more building layouts with interactive widgets in a Jupyter Notebook
impactful by engaging the user in the data analysis process. environment, but can then easily transition to serving the same
After a prototype is implemented to communicate ideas and code on a standalone secure webserver. This capability enabled
feasibility of a project, additional success is determined by the us to rapidly prototype workflows and dashboards that could be
ability to produce the end product on time and within budget. directly accessed by potential sponsors.
If the deployable product needs to be completely re-written using Panel makes prototyping and deploying simple. It can also
different tools, programing languages, and/or frameworks from the be iterative. As new features are developed we can continue to
prototype, then significantly more time and resources are required. work in the Jupyter Notebook environment and then seamlessly
The ability to quickly mature a prototype to production-ready transition the new code to a deployed application. Since appli-
application using the same tool stack can make the difference in cations continue to mature they often require production-level
the success of a project. features. Panel apps are deployed via Bokeh, and the Bokeh
framework lacks some aspects that are needed in some production
Background applications (e.g. a user management system for authentication
and permissions, and a database to persist data beyond a session).
At the US Army Engineer Research and Development Center Bokeh doesn’t provide either of these aspects natively.
(ERDC) there are evolving needs to support the missions of the Tethys Platform is a Django-based web framework that is
US Army Corps of Engineers and our partners. The scientific geared toward making scientific web applications easier to de-
velop by scientists and engineers. [Swain] It provides a Python
* Corresponding author: Scott.D.Christensen@usace.army.mil
‡ US Army Engineer Research and Development Center Software Development Kit (SDK) that enables web apps to be
created almost purely in Python, while still leaving the flexibility
Copyright © 2022 Scott D. Christensen et al. This is an open-access article to add custom HTML, JavaScript, and CSS. Tethys provides
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, user management and role-based permissions control. It also
provided the original author and source are credited. enables database persistence and computational job management
A PYTHON PIPELINE FOR RAPID APPLICATION DEVELOPMENT (RAD) 241

[Christensen], in addition to many visualization tools. Tethys of-
fers the power of a fully-featured web framework without the need
to be an expert in full-stack web development. However, Tethys
lacks the ease of prototyping in a Jupyter Notebook environment
that is provided by Panel.
To support both the rapid prototyping capability provided
by Panel and the production-level features of Tethys Platform,
we needed a pipeline that could take our Panel-based code
and integrate it into the Tethys Platform framework. Through
collaborations with the Bokeh development team and developers
at Aquaveo, LLC, we were able to create that integration of
Panel (Bokeh) and Tethys. This paper demonstrates the seamless
pipeline that facilitates Rapid Application Development (RAD).
In the next section we describe how the RAD pipeline is used at
the ERDC for a particular use case, but first we will provide some
background on the use case itself. Fig. 1: Collective Sweep Inputs Stage rendered in a Jupyter Notebook.

Use Case
Helios is a computational fluid dynamics (CFD) code for simulat-
ing rotorcraft. It is very computationally demanding and requires
High Performance Computing (HPC) resources to execute any-
thing but the most basic of models. At the ERDC we often face a
need to run parameter sweeps to determine the affects of varying
a particular parameter (or set of parameters). Setting up a Helios
model to run on the HPC is a somewhat involved process that
requires file management and creating a script to submit the job
to the queueing system. When executing a parameter sweep the
process becomes even more cumbersome, and is often avoided.
While tedeous to perform manually, the process of modifying
input files, transferring to the HPC, and generating and submitting
job scripts to the the HPC queueing system can be automated
with Python. Furthermore, it can be made much more accessible,
even to those without extensive knowledge of how Helios works,
through a web-based interface.

Methods
To automate the process of submitting Helios model parameter
sweeps to the HPC via a simple interactive web application Fig. 2: Collective Sweep Inputs Stage rendered as a stand-alone
we developed and used the RAD pipeline. Initially three Helios Bokeh app.
parameter sweep workflows were identified:
1) Collective Sweep
API to execute commands on the login nodes of the DoD HPC
2) Speed Sweep
systems. The PyUIT library provides a Python wrapper for the
3) Ensemble Analysis
UIT+ REST API. Additionally, it provides Panel-based interfaces
The process of submitting each of these workflows to the HPC for each of the workflow steps listed above. Panel refers to a
was similar. They each involved the same basic steps: workflow comprised of a sequence of steps as a pipeline, and
each step in the pipeline is called a stage. Thus, PyUIT provides a
1) Authentication to the HPC template stage class for each step in the basisc HPC workflow.
2) Connecting to a specific HPC system
The PyUIT pipeline stages were customized to create inter-
3) Specifying the parameter sweep inputs
faces for each of the three Helios workflows. Other than the
4) Submtting the job to the queuing system
inputs stage, the rest of the stages are the same for each of the
5) Monitoring the job as it runs
workflows (See figures 1, 2, and 3). The inputs stage allows the
6) Visualizing the results
user to select a Helios input file and then provides inputs to allow
In fact, these steps are essentially the same for any job being the user to specify the values for the parameter(s) that will be
submitted to the HPC. To ensure that we were able to resuse varied in the sweep. Each of these stages was first created in a
as much code as possible we created PyUIT, a generic, open- Jupyter Notebook. We were then able to deploy each workflow as
source Python library that enables this workflow. The ability to a standalone Bokeh application. Finally we integrated the Panel-
authenticate and connect to the DoD HPC systems is enabled based app into Tethys to leverage the compute job management
by a service called User Interface Toolkit Plus (UIT+). [PyUIT] system and single-sign-on authentication.
UIT+ provides an OAuth2 authentication service and a RESTful As additional features are required, we are able to leverage
242 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 5: The Helios Tethys App is the framework for launching each of
the three Panel-based Helios parameter sweep workflows.

Fig. 3: Collective Sweep Inputs Stage rendered in the Helios Tethys
App.

the same pipeline: first developing the capability in a Jupyter
Notebook, then testing with a Bokeh-served app, and finally, a
full integration into Tethys.

Results
By integrating the Panel workflows into the Helios Tethys app
we can take advantage of Tethys Platform features, such as the
jobs table, which persists metadata about computational jobs in a
database.
Fig. 6: Actions associated with a job. The available actions depend
on the job’s status.

results to view. The pages that display the results are built with
Panel, but Tethys enables them to be populated with information
about the job from the database. Figure 7 shows the Tracking Data
tab of the results viewer page. The plot is a dynamic Bokeh plot
that enables the user to select the data to plot on each axis. This
particular plot is showing the variation of the coeffient of drag of
the fuselage body over the simulation time.
Figure 8 shows what is called CoViz data, or data that is
extracted from the solution as the model is running. This image is
showing an isosurface colored by density.
Fig. 4: Helios Tethys App home page showing a table of previously
submitted Helios simulations.
Conclusion
The Helios Tethys App has demonstrated the value of the RAD pi-
Each of the three workflows can be launched from the home
pline, which enables both rapid prototyping and rapid progression
page of the Helios Tethys app as shown in Figure 5. Although
to production. This enables researchers to quickly communicate
the home page was created in the Tethys framework, once the
and prove ideas and deliver successful products on time. In
workflows are launched the same Panel code that was previously
addition to the Helios Tethys App, RAD has been instrumental
developed is called to display the workflow (refer to figures 1, 2,
for the mission success of various projects at the ERDC.
and 3).
From the Tethys Jobs Table different actions are available for
each job including viewing results once the job has completed (see R EFERENCES
6). [Christensen] Christensen, S. D., Swain, N. R., Jones, N. L., Nelson, E.
View job results is much more natural in the Tethys app. Helios J., Snow, A. D., & Dolder, H. G. (2017). A Comprehensive
jobs often take multiple days to complete. By embedding the Python Toolkit for Accessing High-Throughput Computing to
Support Large Hydrologic Modeling Tasks. JAWRA Journal
Helios Panel workflows in Tethys users can leave the web app of the American Water Resources Association, 53(2), 333-343.
(ending their session), and then come back later and pull up the https://doi.org/10.1111/1752-1688.12455
A PYTHON PIPELINE FOR RAPID APPLICATION DEVELOPMENT (RAD) 243

Fig. 7: Timeseries output associated with a Helios Speed Sweep run.

Fig. 8: Isosurface visualization from a Helios Speed Sweep run.

[Panel] https://www.panel.org
[PyUIT] https://github.com/erdc/pyuit
[Swain] Swain, N. R., Christensen, S. D., Snow, A. D., Dolder, H.,
Espinoza-Dávalos, G., Goharian, E., Jones, N. L., Ames, D.P.,
& Burian, S. J. (2016). A new open source platform for
lowering the barrier for environmental web app development.
Environmental Modelling & Software, 85, 11-26. https://doi.
org/10.1016/j.envsoft.2016.08.003
244 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Monaco: A Monte Carlo Library for Performing
Uncertainty and Sensitivity Analyses
W. Scott Shambaugh∗

Abstract—This paper introduces monaco, a Python library for conducting integration), tailored towards training neural nets, or require a
Monte Carlo simulations of computational models, and performing uncertainty deep statistical background to use. See [OGA+ 20], [RJS+ 21], and
analysis (UA) and sensitivity analysis (SA) on the results. UA and SA are critical [DSICJ20] for an overview of the currently available Python tools
to effective and responsible use of models in science, engineering, and public for performing UA and SA. For the domain expert who wants to
policy, however their use is uncommon. By providing a simple, general, and
perform UA and SA on their existing models, there is not an easy
rigorous-by-default library that wraps around existing models, monaco makes
UA and SA easy and accessible to practitioners with a basic knowledge of
tool to do both in a single shot. monaco was written to address
statistics. this gap.

Index Terms—Monte Carlo, Modeling, Uncertainty Quantification, Uncertainty
Analysis, Sensitivity Analysis, Decision-Making, Ensemble Prediction, VARS, D-
VARS

Introduction Fig. 1: The monaco project logo.
Computational models form the backbone of decision-making
processes in science, engineering, and public policy. However,
our increased reliance on these models stands in contrast to the Motivation for Monte Carlo Approach
difficulty in understanding them as we add increasing complexity
Mathematical Grounding
to try and capture ever more of the fine details of real-world
interactions. Practitioners will often take the results of their large, Randomized Monte Carlo sampling offers a cure to the curse of
complex model as a point estimate, with no knowledge of how dimensionality: consider an investigation of the output from k
uncertain those results are [FST16]. Multiple-scenario modeling input factors y = f (x1 , x2 , ..., xk ) where each factor is uniformly
(e.g. looking at a worst-case, most-likely, and best-case scenario) sampled between 0 and 1, xi ∈ U[0, 1]. The input space is then a
is an improvement, but a complete global exploration of the input k-dimensional hypercube with volume 1. If each input is varied
space is needed. That gives insight into the overall distribution of one at a time (OAT), then the volume V of the convex hull of the
results (UA) as well as the relative influence of the different input sampled points forms a hyperoctahedron with volume V = k!1 (or
π k/2
factors on the ouput variance (SA). This complete understanding is optimistically, a hypersphere with V = 2k Γ(k/2+1) ), both of which
critical for effective and responsible use of models in any decision- decrease super-exponentially as k increases. Unless the model is
making process, and policy papers have identified UA and SA as known to be linear, this leaves the input space wholly unexplored.
key modeling practices [ALMR20] [EPA09]. In contrast, the volume of the convex hull of n → ∞ random
Despite the importance of UA and SA, recent literature reviews samples as is obtained with a Monte Carlo approach will converge
show that they are uncommon – in 2014 only 1.3% of all published to V = 1, with much better coverage within that volume as well
papers [FST16] using modeling performed any SA. And even [DFM92]. See Fig. 2.
when performed, best practices are usually lacking – amongst
papers which specifically claimed to perform sensitivity analysis, Benefits and Drawbacks of Basic Monte Carlo Sampling
a 2019 review found only 21% performed global (as opposed to monaco focuses on forward uncertainty propagation with basic
local or zero) UA, and 41% performed global SA [SAB+ 19]. Monte Carlo sampling. This has several benefits:
Typically, UA and SA are done using Monte Carlo simula-
• The method is conceptually simple, lowering the barrier of
tions, for reasons explored in the following section. There are
entry and increasing the ease of communicating results to
Monte Carlo frameworks available, however existing options are
a broader audience.
largely domain-specific, focused on narrow sub-problems (i.e.
• The same sample points can be used for UA and SA. Gen-
* Corresponding author: wsshambaugh@gmail.com erally, Bayesian methods such as Markov Chain Monte
Carlo provide much faster convergence on UA quantities
Copyright © 2022 W. Scott Shambaugh. This is an open-access article dis- of interest, but their undersampling of regions that do not
tributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, pro- contribute to the desired quantities is inadequate for SA
vided the original author and source are credited. and complete exploration of the input space. The author’s
MONACO: A MONTE CARLO LIBRARY FOR PERFORMING UNCERTAINTY AND SENSITIVITY ANALYSES 245

Fig. 3: Monte Carlo workflow for understanding the full behavior of
a computational model, inspired by [SAB+ 19].
Fig. 2: Volume fraction V of a k-dimensional hypercube enclosed by
the convex hull of n → ∞ random samples versus OAT samples along
the principle axes of the input space. monaco Structure
Overall Structure
experience aligns with [SAB+ 19] in that there is great Broadly, each input factor and model output is a variable that
practical benefit in broad sampling without pigeonholing can be thought of as lists (rows) containing the full range of
one’s purview to particular posteriors, through uncovering randomized values. Cases are slices (columns) that take the i’th
bugs and edge cases in regions of input space that were input and output value for each variable, and represent a single
not being previously considered. run of the model. Each case is run on its own, and the output
• It can be applied to domains that are not data-rich. See for values are collected into output variables. Fig. 4 shows a visual
example NASA’s use of Monte Carlo simulations during representation of this.
rocket design prior to collecting test flight data [HB10].
However, basic Monte Carlo sampling is subject to the classi-
cal drawbacks of √ the method such as poor sampling of rare events
and the slow σ / n convergence on quantities of interest. If the
outputs and regions of interest are firmly known at the outset, then
other sampling methods will be more efficient [KTB13].
Additionally, given that any conclusions are conditional on
the correctness of the underlying model and input parameters,
the task of validation is critical to confidence in the UA and SA
results. However, this is currently out of scope for the library
and must be performed with other tools. In a data-poor domain,
hypothesis testing or probabilistic prediction measures like loss
scores can be used to anchor the outputs against a small number
of real-life test data. More generally, the "inverse problem" of
model and parameter validation is a deep field unto itself and
[C+ 12] and [SLKW08] are recommended as overviews of some
methods. If monaco’s scope is too limited for the reader’s needs,
the author recommends UQpy [OGA+ 20] for UA and SA, and
PyMC [SWF16] or Stan [CGH+ 17] as good general-purpose
Fig. 4: Structure of a monaco simulation, showing the relationship
probabilistic programming Python libraries. between the major objects and functions. This maps onto the central
block in Fig. 3.
Workflow
UA and SA of any model follows a common workflow. Probability
distributions for the model inputs are defined, and randomly Simulation Setup
sampled values for a large number of cases are fed to the model. The base of a monaco simulation is the Sim object. This object
The outputs from each case are collected and the full set of is formed by passing it a name, the number of random cases
inputs and outputs can be analyzed. Typically, UA is performed ncases, and a dict fcns of the handles for three user-defined
by generating histograms, scatter plots, and summary statistics for functions detailed in the next section. A random seed that then
the output variables, and SA is performed by looking at the effect seeds the entire simulation can also be passed in here, and is
of input on output variables through scatter plots, performing highly recommended for repeatability of results.
regressions, and calculating sensitivity indices. These results can Input variables then need to be defined. monaco takes in the
then be compared to real-world test data to validate the model or handle to any of scipy.stat’s continuous or discrete probability
inform revisions to the model and input variables. See Fig. 3. distributions, as well as the required arguments for that probability
Note that with model and input parameter validation currently distribution [VGO+ 20]. If nonnumeric inputs are desired, the
outside monaco’s scope, closing that part of the workflow loop is method can also take in a nummap dictionary which maps the
left up to the user. randomly drawn integers to values of other types.
246 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

At this point the sim can be run. The randomized drawing nonnumeric, a valmap dict assigning numbers to
of input values, creation of cases, running of those cases, and each unique value is automatically generated.
extraction of output values are automatically executed.
4) Calculate statistics & sensitivities for input & output
User-Defined Functions variables.
5) Plot variables, their statistics, and sensitivities.
The user needs to define three functions to wrap monaco’s Monte
Carlo structure around their existing computational model. First
Incorporating into Existing Workflows
is a run function which either calls or directly implements their
model. Second is a preprocess function which takes in a Case If the user wants to use existing workflows for generating, run-
object, extracts the randomized inputs, and structures them with ning, post-processing, or examining results, any combination of
any other invariant data to pass to the run function. Third is a monaco’s major steps can be replaced with external tooling by
postprocess function which takes in a Case object as well as the saving and loading input and output variables to file. For example,
results from the model, and extracts the desired output values. The monaco can be used only for its parallel processing backend by
Python call chain is as: importing existing randomly drawn input variables, running the
postprocess(case, *run(*preprocess(case)))
simulation, and exporting the output variables for outside analysis.
Or, it can be used only for its plotting and analysis capabilities by
Or equivalently to expand the Python star notation into pseu- feeding it inputs and outputs generated elsewhere.
docode:
siminput = (siminput1, siminput2, ...) Resource Usage
= preprocess(case)
simoutput = (simoutput1, simoutput2, ...) Note that monaco’s computational and storage overhead in cre-
= run(*siminput) ating easily-interrogatable objects for each variable, value, and
= run(siminput1, siminput2, ...) case makes it an inefficient choice for computationally simple
_ = postprocess(case, *simoutput)
applications with high n, such as Monte Carlo integration. If the
= postprocess(case, simoutput1, simoutput2, ...)
preprocessed sim input and raw output for each case (which for
These three functions must be passed to the simulation in a dict some models may dominate storage) is not retained, then the
with keys ’run’, ’preprocess’, and ’postprocess’. See the example storage bottleneck will be the creation of a Val object for each
code at the end of the paper for a simple worked example. case’s input and output values with minimum size 0.5 kB. The
maximum n will be driven by the size of the RAM on the host
Examining Results
machine being capable of holding at least 0.5 ∗ n(kin + kout ) kB.
After running, users should generally do all of the following On the computational bottleneck side, monaco is best suited for
UA and SA tasks to get a full picture of the behavior of their models where the model runtime dominates the random variate
computational model. generation and the few hundred microseconds of dask.delayed
• Plot the results (UA & SA). task switching time.
• Calculate statistics for input or output variables (UA).
• Calculate sensitivity indices to rank importance of the Technical Features
input variables on variance of the output variables (SA).
Sampling Methods
• Investigate specific cases with outlier or puzzling results.
• Save the results to file or pass them to other programs. Random sampling of the percentiles for each variable can be done
using scipy’s pseudo-random number generator (PRNG), or with
Data Flow any of the low-discrepancy methods from the scip.stats.qmc quasi-
A summary of the process and data flow: Monte Carlo (QMC) module. QMC in general provides faster
O(log(n)k n−1 ) convergence compared to the O(n−1/2 ) conver-
1) Instantiate a Sim object. gence of random sampling [Caf98]. Available low-discrepancy
2) Add input variables to the sim with specified probability options are regular or scrambled Sobol sequences, regular or
distributions. scrambled Halton sequences, or Latin Hypercube Sampling. In
3) Run the simulation. This executes the following: general, the ’sobol_random’ method that generates scrambled
a) Random percentiles pi ∈ U[0, 1] are drawn Sobol sequences [Sob67] [Owe20] is recommended in nearly
ndraws times for each of the input variables. all cases as the sequence with the fastest QMC convergence
b) These percentiles are transformed into random [CKK18], balanced integration properties as long as the number of
values via the inverse cumulative density function cases is a power of 2, and a fairly flat frequency spectrum (though
of the target probability distribution xi = F −1 (pi ). sampling spectra are rarely a concern) [PCX+ 18]. See Fig. 5 for a
c) If nonnumeric inputs are desired, the numbers are visual comparison of some of the options.
converted to objects via a nummap dict.
d) Case objects are created and populated with the Order Statistics, or, How Many Cases to Run?
input values for each case. How many Monte Carlo cases should one run? One answer would
e) Each case is run by structuring the inputs values be to choose n ≥ 2k with a sampling method that implements a
with the preprocess function, passing them to (t,m,s) digital net (such as a Sobol or Halton sequence), which
the run function, and collecting the output values guarantees that there will be at least one sample point in every
with the postprocess function. hyperoctant of the input space [JK08]. This should be considered
f) The output values are collected into output vari- a lower bound for SA, with the number of cases run being some
ables and saved back to the sim. If the values are integer multiple of 2k .
MONACO: A MONTE CARLO LIBRARY FOR PERFORMING UNCERTAINTY AND SENSITIVITY ANALYSES 247

Sensitivity Indices
Sensitivity indices give a measure of the relationship between the
variance of a scalar output variable to the variance of each of the
input variables. In other words, they measure which of the input
ranges have the largest effect on an output range. It is crucial that
sensitivity indices are global rather than local measures – global
sensitivity has the stronger theoretical grounding and there is no
reason to rely on local measures in scenarios such as automated
computer experiments where data can be easily and arbitrarily
sampled [SRA+ 08] [PBPS22].
With computer-designed experiments, it is possible to con-
struct a specially constructed sample set to directly calculate
global sensitivity indices such as the Total-Order Sobol index
[Sob01], or the IVARS100 index [RG16]. However, this special
construction requires either sacrificing the desirable UA properties
of low-discrepancy sampling, or conducting an additional Monte
Carlo analysis of the model with a different sample set. For this
reason, monaco uses the D-VARS approach to calculating global
sensitivity indices, which allows for using a set of given data
[SR20]. This is the first publically available implementation of
the D-VARS algorithm.

Fig. 5: 256 uniform and normal samples along with the 2D frequency Plotting
spectra for PRNG random sampling (top), Sobol sampling (middle), monaco includes a plotting module that takes in input and output
and scrambled Sobol sampling (bottom, default). variables and quickly creates histograms, empirical CDFs, scatter
plots, or 2D or 3D "spaghetti plots" depending on what is most ap-
propriate for each variable. Variable statistics and their confidence
Along a similar vein, [DFM92] suggests that with random
intervals are automatically shown on plots when applicable.
sampling n ≥ 2.136k is sufficient to ensure that the volume fraction
V approaches 1. The author hypothesizes that for a digital net, the Vector Data
n ≥ λ k condition will be satisfied with some λ ≤ 2, and so n ≥ 2k
will suffice for this condition to hold. However, these methods of If the values for an output variable are length s lists, NumPy
choosing the number of cases may undersample for low k and be arrays, or Pandas dataframes, they are treated as timeseries with s
infeasible for high k. steps. Variable statistics for these variables are calculated on the
A rigorous way of choosing the number of cases is to first ensemble of values at each step, giving time-varying statistics.
choose a statistical interval (e.g. a confidence interval for a The plotting module will automatically plot size (1, s) arrays
percentile, or a tolerance interval to contain a percent of the against the step number as 2-D lines, size (2, s) arrays as 2-D
population), and then use order statistics to calculate the minimum parametric lines, and size (3, s) arrays as 3-D parametric lines.
n required to obtain that result at a desired confidence level. This
Parallel Processing
approach is independent of k, making UA of high-dimensional
models tractable. monaco implements order statistics routines monaco uses dask.distributed [Roc15] as a parallel processing
for calculating these statistical intervals with a distribution-free backend, and supports preprocessing, running, and postprocessing
approach that makes no assumptions about the normality or other cases in a parallel arrangement. Users familiar with dask can
shape characteristics of the output distribution. See Chapter 5 of extend the parallelization of their simulation from their single
[HM91] for background. machine to a distributed cluster.
A more qualitative UA method would simply be to choose a For simple simulations such as the example code at the end of
reasonably high n (say, n = 210 ), manually examine the results to the paper, the overhead of setting up a dask server may outweigh
ensure high-interest areas are not being undersampled, and rely the speedup from parallel computation, and in those cases monaco
on bootstrapping of the desired variable statistics to obtain the also supports running single-threaded in a single for-loop.
required confidence levels.
The Median Case
Variable Statistics A "nominal" run is often useful as a baseline to compare other
For any input or output variable, a statistic can be calculated cases against. If desired, the user can set a flag to force the
for the ensemble of values. monaco builds in some common first case to be the median 50th percentile draw of all the input
statistics (mean, percentile, etc), or alternatively the user can variables prior to random sampling.
pass in a custom one. To obtain a confidence interval for this
statistic, the results are resampled with replacement using the Debugging Cases
scipy.stats.bootstrap module. The number of bootstrap samples By default, all the raw results from each case’s simulation run
is determined using an order statistic approach as outlined in the prior to postprocessing are saved to the corresponding Case object.
previous section, and multiplying that number by a scaling factor Individual cases can be interrogated by looking at these raw
(default 10x) for smoothness of results. results, or by indicating that their results should be highlighted
248 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

in plots. If some cases fail to run, monaco will mark them as fcns=fcns, seed=seed)
incomplete and those specific cases can be rerun without requiring
# Generate the input variables
the full set of cases to be recomputed. A debug flag can be set to sim.addInVar(name='die1', dist=randint,
not skip over failed cases and instead stop at a breakpoint or dump distkwargs={'low': 1, 'high': 6+1})
the stack trace on encountering an exception. sim.addInVar(name='die2', dist=randint,
distkwargs={'low': 1, 'high': 6+1})
Saving and Loading to File # Run the Simulation
The base Sim object and the Case objects can be serialized and sim.runSim()
saved to or loaded from .mcsim and .mccase files respectively, The results of the simulation can then be analyzed and examined.
which are stored in a results directory. The Case objects are saved Fig. 6 shows the plots this code generates.
separately since the raw results from a run of the simulation # Calculate the mean and 5-95th percentile
may be arbitrarily large, and the Sim object can be comparatively # statistics for the dice sum
lightweight. Loading the Sim object from file will automatically sim.outvars['Sum'].addVarStat('mean')
attempt to load the cases in the same directory, but can also stand sim.outvars['Sum'].addVarStat('percentile',
{'p':[0.05, 0.95]})
alone if the raw results are not needed.
Alternatively, the numerical representations for input and out- # Plots a histogram of the dice sum
put variables can be saved to and loaded from .json or .csv files. mc.plot(sim.outvars['Sum'])
This is useful for interfacing with external tooling, but discards # Creates a scatter plot of the sum vs the roll
the metadata that would be present by saving to monaco’s native # number, showing randomness
objects. mc.plot(sim.outvars['Sum'],
sim.outvars['Roll Number'])

Example # Calculate the sensitivity of the dice sum to
# each of the input variables
Presented here is a simple example showing a Monte Carlo sim.calcSensitivities('Sum')
simulation of rolling two 6-sided dice and looking at their sum. sim.outvars['Sum'].plotSensitivities()
The user starts with their run function which here directly
implements their computational model. They must then create
preprocess and postprocess functions to feed in the randomized
input values and collect the outputs from that model.
# The 'run' function, which implements the
# existing computational model (or wraps it)
def example_run(die1, die2):
dicesum = die1 + die2
return (dicesum, )

# The 'preprocess' function grabs the random
# input values for each case and structures it
# with any other data in the format the 'run'
# function expects
def example_preprocess(case):
die1 = case.invals['die1'].val
die2 = case.invals['die2'].val
return (die1, die2)

# The 'postprocess' function takes the output
# from the 'run' function and saves off the
# outputs for each case
def example_postprocess(case, dicesum):
case.addOutVal(name='Sum', val=dicesum)
case.addOutVal(name='Roll Number',
val=case.ncase)
return None

The monaco simulation is initialized, given input variables with Fig. 6: Output from the example code which calculates the sum of two
specified probability distributions (here a random integer between random dice rolls. The top plot shows a histogram of the 2-dice sum
1 and 6), and run. with the mean and 5–95th percentiles marked, the middle plot shows
the randomness over the set of rolls, and the bottom plot shows that
import monaco as mc
from scipy.stats import randint each of the dice contributes 50% to the variance of the sum.

# dict structure for the three input functions
fcns = {'run' : example_run, Case Studies
'preprocess' : example_preprocess,
'postprocess': example_postprocess} These two case studies are toy models meant as illustrative of
potential uses, and not of expertise or rigor in their respective
# Initialize the simulation domains. Please see https://github.com/scottshambaugh/monaco/
ndraws = 1024 # Arbitrary for this example tree/main/examples for their source code as well as several more
seed = 123456 # Recommended for repeatability
Monte Carlo implementation examples across a range of domains
sim = mc.Sim(name='Dice Roll', ndraws=ndraws, including financial modeling, pandemic spread, and integration.
MONACO: A MONTE CARLO LIBRARY FOR PERFORMING UNCERTAINTY AND SENSITIVITY ANALYSES 249

Baseball The calculated win probabilities from this simulation are
This case study models the trajectory of a baseball in flight 93.4% Democratic, 6.2% Republican, and 0.4% Tie. The 25–75th
after being hit for varying speeds, angles, topspins, aerodynamic percentile range for the number of electoral votes for the Demo-
conditions, and mass properties. From assumed initial conditions cratic candidate is 281–412, and the actual election result was 306
immediately after being hit, the physics of the ball’s ballistic flight electoral votes. See Fig. 8.
are calculated over time until it hits the ground.
Fig. 7 shows some plots of the results. A baseball team might
use analyses like this to determine where outfielders should be
placed to catch a ball for a hitter with known characteristics, or
determine what aspect of a hit a batter should focus on to improve
their home run potential.

Fig. 8: Predicted electoral votes for the Democratic 2020 US Pres-
idential candidate with the median and 25-75th percentile interval
marked (top), and a map of the predicted Democratic win probability
per state (bottom).

Conclusion
This paper has introduced the ideas underlying Monte Carlo
analysis and discussed when it is appropriate to use for conducting
UA and SA. It has shown how monaco implements a rigorous,
parallel Monte Carlo process, and how to use it through a simple
example and two case studies. This library is geared towards
scientists, engineers, and policy analysts that have a computational
model in their domain of expertise, enough statistical knowledge
to define a probability distribution, and a desire to ensure their
model will make accurate predictions of reality. The author hopes
this tool will help contribute to easier and more widespread use of
Fig. 7: 100 simulated baseball trajectories (top), and the relationship UA and SA in improved decision-making.
between launch angle and landing distance (bottom). Home runs are
highlighted in orange.
Further Information
monaco is available on PyPI as the package monaco, has API
Election
documentation at https://monaco.rtfd.io/, and is hosted on github
This case study attempts to predict the result of the 2020 US at https://github.com/scottshambaugh/monaco/.
presidential election, based on polling data from FiveThirtyEight
3 weeks prior to the election [Fiv20].
Each state independently casts a normally distributed percent- R EFERENCES
age of votes for the Democratic, Republican, and Other candidates,
[ALMR20] I Azzini, G Listorti, TA Mara, and R Rosati. Uncertainty and
based on polling. Also assumed is a uniform ±3% national sensitivity analysis for policy decision making. An Introductory
swing due to polling error which is applied to all states equally. Guide. Joint Research Centre, European Commission, Luxem-
That summed percentage is then normalized so the total for all bourg, 2020. doi:10.2760/922129.
candidates is 100%. The winner of each state’s election assigns [C+ 12] National Research Council et al. Assessing the reliability of
complex models: mathematical and statistical foundations of
their electoral votes to that candidate, and the candidate that wins verification, validation, and uncertainty quantification. National
at least 270 of the 538 electoral votes is the winner. Academies Press, 2012. doi:10.17226/13395.
250 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[Caf98] Russel E Caflisch. Monte carlo and quasi-monte carlo in science conference, volume 130, page 136. Citeseer, 2015.
methods. Acta numerica, 7:1–49, 1998. doi:10.1017/ doi:10.25080/majora-7b98e3ed-013.
S0962492900002804. [SAB+ 19] Andrea Saltelli, Ksenia Aleksankina, William Becker, Pamela
[CGH+ 17] Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Fennell, Federico Ferretti, Niels Holst, Sushan Li, and Qiongli
Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Wu. Why so many published sensitivity analyses are false: A
Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic systematic review of sensitivity analysis practices. Environmental
programming language. Journal of statistical software, 76(1), modelling & software, 114:29–39, 2019. doi:10.1016/j.
2017. doi:10.18637/jss.v076.i01. envsoft.2019.01.012.
[CKK18] Per Christensen, Andrew Kensler, and Charlie Kilpatrick. Pro- [SLKW08] Richard M Shiffrin, Michael D Lee, Woojae Kim, and Eric-
gressive multi-jittered sample sequences. In Computer Graphics Jan Wagenmakers. A survey of model evaluation approaches
Forum, volume 37, pages 21–33. Wiley Online Library, 2018. with a tutorial on hierarchical bayesian methods. Cog-
doi:10.1111/cgf.13472. nitive Science, 32(8):1248–1284, 2008. doi:10.1080/
[DFM92] Martin E. Dyer, Zoltan Füredi, and Colin McDiarmid. Volumes 03640210802414826.
spanned by random points in the hypercube. Random Struc- [Sob67] Ilya M Sobol. On the distribution of points in a cube and
tures & Algorithms, 3(1):91–106, 1992. doi:10.1002/rsa. the approximate evaluation of integrals. Zhurnal Vychislitel’noi
3240030107. Matematiki i Matematicheskoi Fiziki, 7(4):784–802, 1967. doi:
[DSICJ20] Dominique Douglas-Smith, Takuya Iwanaga, Barry F.W. Croke, 10.1016/0041-5553(67)90144-9.
and Anthony J. Jakeman. Certain trends in uncertainty and [Sob01] Ilya M Sobol. Global sensitivity indices for nonlinear mathe-
sensitivity analysis: An overview of software tools and tech- matical models and their monte carlo estimates. Mathematics
niques. Environmental Modelling & Software, 124, 2020. doi: and computers in simulation, 55(1-3):271–280, 2001. doi:
10.1016/j.envsoft.2019.104588. 10.1016/s0378-4754(00)00270-6.
[EPA09] US EPA. Guidance on the development, evaluation, and appli- [SR20] Razi Sheikholeslami and Saman Razavi. A fresh look at vari-
cation of environmental models (epa/100/k-09/003), 2009. URL: ography: measuring dependence and possible sensitivities across
https://nepis.epa.gov/Exe/ZyPDF.cgi?Dockey=P1003E4R.PDF. geophysical systems from any given data. Geophysical Re-
[Fiv20] FiveThirtyEight. 2020 general election forecast - state topline search Letters, 47(20):e2020GL089829, 2020. doi:10.1029/
polls-plus data, October 2020. URL: https://github.com/ 2020gl089829.
fivethirtyeight/data/tree/master/election-forecasts-2020. [SRA+ 08] Andrea Saltelli, Marco Ratto, Terry Andres, Francesca Campo-
[FST16] Federico Ferretti, Andrea Saltelli, and Stefano Tarantola. Trends longo, Jessica Cariboni, Debora Gatelli, Michaela Saisana, and
in sensitivity analysis practice in the last decade. Science of Stefano Tarantola. Global sensitivity analysis: the primer. John
the total environment, 568:666–670, 2016. doi:10.1016/j. Wiley & Sons, 2008. doi:10.1002/9780470725184.
scitotenv.2016.02.133. [SWF16] John Salvatier, Thomas V Wiecki, and Christopher Fonnesbeck.
[HB10] John Hanson and Bernard Beard. Applying monte carlo simu- Probabilistic programming in python using pymc3. PeerJ Com-
lation to launch vehicle design and requirements verification. In puter Science, 2:e55, 2016. doi:10.7717/peerj-cs.55.
AIAA Guidance, Navigation, and Control Conference. American [VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haber-
Institute of Aeronautics and Astronautics, 2010. doi:10.2514/ land, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu
6.2010-8433. Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0:
fundamental algorithms for scientific computing in python. Na-
[HM91] Gerald J Hahn and William Q Meeker. Statistical intervals: a
ture methods, 17(3):261–272, 2020. doi:10.14293/s2199-
guide for practitioners. John Wiley & Sons, 1991. doi:10.
1006.1.sor-life.a7056644.v1.rysreg.
1002/9780470316771.ch5.
[JK08] Stephen Joe and Frances Y Kuo. Constructing sobol sequences
with better two-dimensional projections. SIAM Journal on Sci-
entific Computing, 30(5):2635–2654, 2008. doi:10.1137/
070709359.
[KTB13] Dirk P Kroese, Thomas Taimre, and Zdravko I Botev. Handbook
of monte carlo methods. John Wiley & Sons, 2013. doi:10.
1002/9781118014967.
[OGA+ 20] Audrey Olivier, Dimitris G. Giovanis, B.S. Aakash, Mohit
Chauhan, Lohit Vandanapu, and Michael D. Shields. Uqpy: A
general purpose python package and development environment
for uncertainty quantification. Journal of Computational Science,
47:101204, 2020. doi:10.1016/j.jocs.2020.101204.
[Owe20] Art B Owen. On dropping the first sobol’point. arXiv
preprint arXiv:2008.08051, 2020. doi:10.48550/arXiv.
2008.08051.
[PBPS22] Arnald Puy, William Becker, Samuele Lo Piano, and An-
drea Saltelli. A comprehensive comparison of total-order es-
timators for global sensitivity analysis. International Journal
for Uncertainty Quantification, 12(2), 2022. doi:int.j.
uncertaintyquantification.2021038133.
[PCX+ 18] Hélène Perrier, David Coeurjolly, Feng Xie, Matt Pharr, Pat
Hanrahan, and Victor Ostromoukhov. Sequences with low-
discrepancy blue-noise 2-d projections. In Computer Graphics
Forum, volume 37, pages 339–353. Wiley Online Library, 2018.
doi:10.1111/cgf.13366.
[RG16] Saman Razavi and Hoshin V Gupta. A new framework for
comprehensive, robust, and efficient global sensitivity analysis:
1. theory. Water Resources Research, 52(1):423–439, 2016.
doi:10.1002/2015wr017558.
[RJS+ 21] Saman Razavi, Anthony Jakeman, Andrea Saltelli, Clémentine
Prieur, Bertrand Iooss, Emanuele Borgonovo, Elmar Plischke,
Samuele Lo Piano, Takuya Iwanaga, William Becker, et al. The
future of sensitivity analysis: An essential discipline for systems
modeling and policy support. Environmental Modelling & Soft-
ware, 137:104954, 2021. doi:10.1016/j.envsoft.2020.
104954.
[Roc15] Matthew Rocklin. Dask: Parallel computation with blocked
algorithms and task scheduling. In Proceedings of the 14th python
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 251

Enabling Active Learning Pedagogy and Insight
Mining with a Grammar of Model Analysis
Zachary del Rosario‡∗

Abstract—Modern engineering models are complex, with dozens of inputs, The fundamental issue underlying these criteria is a flawed
uncertainties arising from simplifying assumptions, and dense output data. heuristic for uncertainty propagation; initial human subjects work
While major strides have been made in the computational scalability of complex suggests that engineers’ tendency to misdiagnose sources of vari-
models, relatively less attention has been paid to user-friendly, reusable tools to ability as inconsequential noise may contribute to the persistent
explore and make sense of these models. Grama is a python package aimed at
application of flawed design criteria [AFD+ 21]. These flawed
supporting these activities. Grama is a grammar of model analysis: an ontology
that specifies data (in tidy form), models (with quantified uncertainties), and
treatments of uncertainty are not limited to engineering design;
the verbs that connect these objects. This definition enables a reusable set recent work by Kahneman et al. [KSS21] highlights widespread
of evaluation "verbs" that provide a consistent analysis toolkit across different failures to recognize or address variability in human judgment,
grama models. This paper presents three case studies that illustrate pedagogy leading to bias in hiring, economic loss, and an unacceptably
and engineering work with grama: 1. Providing teachable moments through capricious application of justice.
errors for learners, 2. Providing reusable tools to help users self-initiate pro- Grama was originally developed to support model analysis un-
ductive modeling behaviors, and 3. Enabling exploratory model analysis (EMA) der uncertainty; in particular, to enable active learning [FEM+ 14]
– exploratory data analysis augmented with data generation.
– a form of teaching characterized by active student engagement
Index Terms—engineering, engineering education, exploratory model analysis,
shown to be superior to lecture alone. This toolkit aims to integrate
software design, uncertainty quantification the disciplinary perspectives of computational engineering and
statistical analysis within a unified environment to support a
coding to learn pedagogy [Bar16] – a teaching philosophy that
Introduction uses code to teach a discipline, rather than as a means to teach
Modern engineering relies on scientific computing. Computational computer science or coding itself. The design of grama is heavily
advances enable faster analysis and design cycles by reducing inspired by the Tidyverse [WAB+ 19], an integrated set of R
the need for physical experiments. For instance, finite-element packages organized around the ’tidy data’ concept [Wic14]. Grama
analysis enables computational study of aerodynamic flutter, and uses the tidy data concept and introduces an analogous concepts
Reynolds-averaged Navier-Stokes simulation supports the simu- for models.
lation of jet engines. Both of these are enabling technologies
that support the design of modern aircraft [KN05]. Modern ar-
Grama: A Grammar of Model Analysis
eas of computational research include heterogeneous computing
environments [MV15], task-based parallelism [BTSA12], and big Grama [dR20] is an integrated set of tools for working with data
data [SS13]. Another line of work considers the development of and models. Pandas [pdt20], [WM10] is used as the underlying
integrated tools to unite diverse disciplinary perspectives in a sin- data class, while grama implements a Model class. A grama
gle, unified environment (e.g., the integration of multiple physical model includes a number of functions – mathematical expressions
phenomena in a single code [EVB+ 20] or the integration of a or simulations – and domain/distribution information for the de-
computational solver and data analysis tools [MTW+ 22]). Such terministic/random inputs. The following code illustrates a simple
integrated computational frameworks are highlighted as essential grama model with both deterministic and random inputs1 .
for applications such as computational analysis and design of # Each cp_* function adds information to the model
aircraft [SKA+ 14]. While engineering computation has advanced md_example = (
along the aforementioned axes, the conceptual understanding of gr.Model("An example model")
# Overloaded `>>` provides pipe syntax
practicing engineers has lagged in key areas. >> gr.cp_vec_function(
Every aircraft you have ever flown on has been designed using fun=lambda df: gr.df_make(f=df.x+df.y+df.z),
probabilistically-flawed, potentially dangerous criteria [dRFI21]. var=["x", "y", "z"],
out=["f"],
)
* Corresponding author: zdelrosario@olin.edu
‡ Assistant Professor of Engineering and Applied Statistics, Olin College of >> gr.cp_bounds(x=(-1, +1))
Engineering >> gr.cp_marginals(
y=gr.marg_mom("norm", mean=0, sd=1),
Copyright © 2022 Zachary del Rosario. This is an open-access article dis- z=gr.marg_mom("uniform", mean=0, sd=1),
tributed under the terms of the Creative Commons Attribution License, which )
permits unrestricted use, distribution, and reproduction in any medium, pro-
vided the original author and source are credited. 1. Throughout, import grama as gr is assumed.
252 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

>> gr.cp_copula_gaussian(
df_corr=gr.df_make(
var1="y",
var2="z",
corr=0.5,
)
)
)

While an engineer’s interpretation of the term "model" focuses on
the input-to-output mapping (the simulation), and a statistician’s
interpretation of the term "model" focuses on a distribution, the
grama model integrates both perspectives in a single model.
Grama models are intended to be evaluated to generate data.
The data can then be analyzed using visual and statistical means.
Models can be composed to add more information, or fit to a
dataset. Figure 1 illustrates this interplay between data and models
in terms of the four categories of function "verbs" provided in
Fig. 2: Input sweep generated from the code above. Each panel
grama.
visualizes the effect of changing a single input, with all other inputs
held constant.

>> gr.tf_filter(DF.sweep_var == "x")
>> gr.ggplot(gr.aes("x", "f", group="sweep_ind"))
+ gr.geom_line()
)

This system of defaults is important for pedagogical design:
Introductory grama code can be made extremely simple when first
Fig. 1: Verb categories in grama. These grama functions start with an introducing a concept. However, the defaults can be overridden
identifying prefix, e.g. ev_* for evaluation verbs. to carry out sophisticated and targeted analyses. We will see in
the Case Studies below how this concise syntax encourages sound
analysis among students.
Defaults for Concise Code
Grama verbs are designed with sensible default arguments to
Pedagogy Case Studies
enable concise code. For instance, the following code visualizes
input sweeps across its three inputs, similar to a ceteris paribus The following two case studies illustrate how grama is designed
profile [KBB19], [Bie20]. to support pedagogy: the formal method and practice of teaching.
In particular, grama is designed for an active learning pedagogy
(
## Concise default analysis [FEM+ 14], a style of teaching characterized by active student
md_example engagement.
>> gr.ev_sinews(df_det="swp")
>> gr.pt_auto() Teachable Moments through Errors for Learners
)
An advantage of a unified modeling environment like grama is
This code uses the default number of sweeps and sweep density, the opportunity to introduce design errors for learners in order to
and constructs a visualization of the results. The resulting plot is provide teachable moments.
shown in Figure 2. It is common in probabilistic modeling to make problematic
Grama imports the plotnine package for data visualization assumptions. For instance, Cullen and Frey [CF99] note that
[HK21], both to provide an expressive grammar of graphics, but modelers frequently and erroneously treat the normal distribution
also to implement a variety of "autoplot" routines. These are as a default choice for all unknown quantities. Another common
called via a dispatcher gr.pt_auto() which uses metadata issue is to assume, by default, the independence of all random
from evaluation verbs to construct a default visual. Combined inputs to a model. This is often done tacitly – with the indepen-
with sensible defaults for keyword arguments, these tools provide dence assumption unstated. These assumptions are problematic, as
a concise syntax even for sophisticated analyses. The same code they can adversely impact the validity of a probabilistic analysis
can be slightly modified to change a default argument value, or to [dRFI21].
use plotnine to create a more tailored visual. To highlight the dependency issue for novice modelers, grama
( uses error messages to provide just-in-time feedback to a user
md_example who does not articulate their modeling choices. For example,
## Override default parameters
>> gr.ev_sinews(df_det="swp", n_sweeps=10) the following code builds a model with no dependency structure
>> gr.pt_auto() specified. The result is an error message that summarizes the
) conceptual issue and points the user to a primer on random
(
variable modeling.
md_example md_flawed = (
>> gr.ev_sinews(df_det="swp") gr.Model("An example model")
## Construct a targeted plot >> gr.cp_vec_function(
ENABLING ACTIVE LEARNING PEDAGOGY AND INSIGHT MINING WITH A GRAMMAR OF MODEL ANALYSIS 253

fun=lambda df: gr.df_make(f=df.x+df.y+df.z), data=data,
var=["x", "y", "z"], columns=["f", "x", "y"],
out=["f"], )
)
>> gr.cp_bounds(x=(-1, +1)) The ability to write low-level programming constructs – such
>> gr.cp_marginals( as the loops above – is an obviously worthy learning outcome
y=gr.marg_mom("norm", mean=0, sd=1), in a course on scientific computing. However, not all courses
z=gr.marg_mom("uniform", mean=0, sd=1),
) should focus on low-level programming constructs. Grama is not
## NOTE: No dependency specified designed to support low-level learning outcomes; instead, the
) package is designed to support a "coding to learn" philosophy
(
md_flawed [Bar16] focused on higher-order learning outcomes to support
## This code will throw an Error sound modeling practices.
>> gr.ev_sample(n=1000, df_det="nom") Parameter sweep functionality can be achieved in grama
)
without explicit loop management and with sensible defaults for
the analysis parameters. This provides a "quick and dirty" tool
Error ValueError: Present model copula must be de- to inspect a model’s behavior. A grama approach to parameter
fined for sampling. Use CopulaIndependence only sweeps is shown below.
when inputs can be guaranteed independent. See the ## Parameter sweep: Grama approach
Documentation chapter on Random Variable Modeling # Gather model info
for more information. https://py-grama.readthedocs.io/en/ md_gr = (
gr.Model()
latest/source/rv_modeling.html >> gr.cp_vec_function(
fun=lambda df: gr.df_make(f=df.x**2 * df.y),
Grama is designed both as a teaching tool and a scientific var=["x", "y"],
modeling toolkit. For the student, grama offers teachable moments out=["f"],
to help the novice grow as a modeler. For the scientist, grama )
>> gr.cp_bounds(
enforces practices that promote scientific reproducibility. x=(-1, +1),
y=(-1, +1),
Encouraging Sound Analysis )
)
As mentioned above, concise grama syntax is desirable to encour- # Generate data
age sound analysis practices. Grama is designed to support higher- df_gr = gr.eval_sinews(
level learning outcomes [Blo56]. For instance, rather than focusing md_gr,
df_det="swp",
on applying programming constructs to generate model results, n_sweeps=3,
grama is intended to help users study model results ("evaluate," )
according to Bloom’s Taxonomy). Sound computational analysis
Once a model is implemented in grama, generating and visualizing
demands study of simulation results (e.g., to check for numerical
a parameter sweep is trivial, requiring just two lines of code and
instabilities). This case study makes this learning outcome distinc-
zero initial choices for analysis parameters. The practical outcome
tion concrete by considering parameter sweeps.
of this software design is that users will tend to self-initiate
Generating a parameter sweep similar to Figure 2 with stan-
parameter sweeps: While students will rarely choose to write the
dard Python libraries requires a considerable amount of boilerplate
extensive boilerplate code necessary for a parameter sweep (unless
code, manual coordination of model information, and explicit loop
required to do so), students writing code in grama will tend to self-
construction. The following code generates parameter sweep data
initiate sound analysis practices.
using standard libraries. Note that this code sweeps through values
For example, the following code is unmodified from a student
of x holding values of y fixed; additional code would be necessary
report3 . The original author implemented an ordinary differential
to construct a sweep through y2 .
equation model to simulate the track time "finish_time" of
## Parameter sweep: Manual approach an electric formula car, and sought to study the impact of variables
# Gather model info
x_lo = -1; x_up = +1; such as the gear ratio "GR" on "finish_time". While the
y_lo = -1; y_up = +1; assignment did not require a parameter sweep, the student chose
f_model = lambda x, y: x**2 * y to carry out their own study. The code below is a self-initiated
# Analysis parameters
parameter sweep of the track time model.
nx = 10 # Grid resolution for x
y_const = [-1, 0, +1] # Constant values for y ## Unedited student code
# Generate data md_car = (
data = np.zeros((nx * len(y_const), 3)) gr.Model("Accel Model")
for i, x in enumerate( >> gr.cp_function(
np.linspace(x_lo, x_up, num=nx) fun = calculate_finish_time,
): var = ["GR", "dt_mass", "I_net" ],
for j, y in enumerate(y_const): out = ["finish_time"],
data[i + j*nx, 0] = f_model(x, y) )
data[i + j*nx, 1] = x
data[i + j*nx, 2] = y >> gr.cp_bounds(
# Package data for visual GR=(+1,+4),
df_manual = pd.DataFrame( dt_mass=(+5,+15),
I_net=(+.2,+.3),
2. Code assumes import numpy as np; import pandas as
pd. 3. Included with permission of the author, on condition of anonymity.
254 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

)
)

gr.plot_auto(
gr.eval_sinews(
md_car,
df_det="swp",
#skip=True,
n_density=20,
n_sweeps=5,
seed=101,
)
)

Fig. 4: Schematic boat hull rotated to 22.5◦ . The forces due to gravity
and buoyancy act at the center of mass (COM) and center of buoyancy
(COB), respectively. Note that this hull is upright stable, as the couple
will rotate the boat to upright.

that a restoring torque is generated (Fig. 4). However, this upright
stability is not guaranteed; Figure 5 illustrates a boat design that
does not provide a restoring torque near its upright angle. An
upright-unstable boat will tend to capsize spontaneously.

Fig. 3: Input sweep generated from the student code above. The image
has been cropped for space, and the results are generated with an
older version of grama. The jagged response at higher values of the
input are evidence of solver instabilities.

The parameter sweep shown in Figure 2 gives an overall impres-
sion of the effect of input "GR" on the output "finish_time".
This particular input tends to dominate the results. However,
variable results at higher values of "GR" provide evidence of
numerical instability in the ODE solver underlying the model.
Without this sort of model evaluation, the student author would
not have discovered the limitations of the model.

Exploratory Model Analysis Case Study
This final case study illustrates how grama supports exploratory Fig. 5: Schematic boat hull rotated to 22.5◦ . Gravity and buoyancy
model analysis. This iterative process is a computational approach are annotated as in Figure 4. Note that this hull is upright unstable,
to mining insights into physical systems. The following use case as the couple will rotate the boat away from upright.
illustrates the approach by considering the design of boat hull
cross-sections. Naval engineers analyze the stability of a boat design by
constructing a moment curve, such as the one pictured in Figure
Static Stability of Boat Hulls 6. This curve depicts the net moment due to buoyancy at various
Stability is a key consideration in boat hull design. One of the most angles, assuming the vessel is in vertical equilibrium. From this
fundamental aspects of stability is static stability; the behavior of a figure we can see that the design is upright-stable, as it possesses
boat when perturbed away from static equilibrium [LE00]. Figure a negative slope at upright θ = 0◦ . Note that a boat may not have
4 illustrates the physical mechanism governing stability at small an unlimited range of stability as Figure 6 exhibits an angle of
perturbations from an upright orientation. vanishing stability (AVS) beyond which the boat does not recover
As a boat is rotated away from its upright orientation, its center to upright.
of buoyancy (COB) will tend to migrate. If the boat is in vertical The classical way to build intuition about boat stability is
equilibrium, its buoyant force will be equal in magnitude to its via mathematical derivations [LE00]. In the following section we
weight. A stable boat is a hull whose COB migrates in such a way present an alternative way to build intuition through exploratory
ENABLING ACTIVE LEARNING PEDAGOGY AND INSIGHT MINING WITH A GRAMMAR OF MODEL ANALYSIS 255

gr.tf_iocorr() computes correlations between every pair of
input variables var and outputs out. The routine also attaches
metadata, enabling an autoplot as a tileplot of the correlation
values.
(
df_boats
>> gr.tf_iocorr(
var=["H", "W", "n", "d", "f_com"],
out=["mass", "angle", "stability"],
)
>> gr.pt_auto()
)

Fig. 6: Total moment on a boat hull as it is rotated through 180◦ .
A negative slope at upright θ = 0◦ is required for upright stability.
Stability is lost at the angle of vanishing stability (AVS).

model analysis.

EMA for Insight Mining
Generation and post-processing of the moment curve are imple-
mented in the grama model md_performance4 . This model
parameterizes a 2d boat hull via its height H, width W, shape
of corner n, the vertical height of the center of mass f_com
(as a fraction of the height), and the displacement ratio d (the Fig. 7: Tile plot of input/output correlations; autoplot gr.pt_auto()
ratio of the boat’s mass to maximum water mass displaced). visualization of gr.tf_iocorr() output.
Note that a boat with d > 1 is incapable of flotation. A
smaller value of d corresponds to a boat that floats higher in The correlations in Figure 7 suggest that stability is posi-
the water. The model md_performance returns stability tively impacted by increasing the width W and displacement ratio
= -dMdtheta_0 (the negative of the moment curve slope at d of a boat, and by decreasing the height H, shape factor n, and
upright) as well as the mass and AVS angle. A positive value vertical location of the center of mass f_com. The correlations
of stability indicates upright stability, while a larger value of also suggest a similar impact of each variable on the AVS angle,
angle indicates a wider range of stability. but with a weaker dependence on H. These results also suggest that
The EMA process begins by generating data from the model. f_com has the strongest effect on both stability and angle.
However, the generation of a moment curve is a nontrivial cal- Correlations are a reasonable first-check of input/output be-
culation. One should exercise care in choosing an initial sample havior, but linear correlation quantifies only an average, linear
of designs to analyze. The statistical problem of selecting efficient association. A second-pass at the data would be to fit an accurate
input values for a computer model is called the design of computer surrogate model and inspect parameter sweeps. The following
experiments [SSW89]. The grama verb gr.tf_sp() implements the code defines a gaussian process fit [RW05] for both stability
support points algorithm [MJ18] to reduce a large dataset of target and angle, and estimates model error using k-folds cross valida-
points to a smaller (but representative) sample. The following code tion [JWHT13]. Note that a non-default kernel is necessary for a
generates a sample of input design values via gr.ev_sample() reasonable fit of the latter output5 .
with the skip=True argument, uses gr.tf_sp() to "com- ## Define fitting procedure
pact" this large sample, then evaluates the performance model at ft_common = gr.ft_gp(
var=["H", "W", "n", "d", "f_com"],
the smaller sample. out=["angle", "stability"],
df_boats = ( kernels=dict(
md_performance stability=None, # Use default
>> gr.ev_sample( angle=RBF(length_scale=0.1),
n=5e3, )
df_det="nom", )
seed=101, ## Estimate model accuracy via k-folds CV
skip=True, (
) df_boats
>> gr.tf_sp(n=1000, seed=101) >> gr.tf_kfolds(
>> gr.tf_md(md=md_performance) ft=ft_common,
) out=["angle", "stability"],
)
With an initial sample generated, we can perform an ex- )
ploratory analysis relating the inputs and outputs. The verb
5. RBF is imported as from sklearn.gaussian_process.kernels
4. The analysis reported here is available as a jupyter notebook. import RBF.
256 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

angle stability k Direction H W n d f_com
0.771 0.979 0
1 -0.0277 0.0394 -0.1187 0.4009 -0.9071
0.815 0.976 1
2 -0.6535 0.3798 -0.0157 -0.6120 -0.2320
0.835 0.95 2
0.795 0.962 3
0.735 0.968 4
TABLE 2: Subspace weights in df_weights.

TABLE 1: Accuracy (R2 ) estimated via k-fold cross validation of
all the sweeps across f_com for stability_mean tend to be
gaussian process model.
monotone with a fairly steep slope. This is in agreement with
the correlation results of Figure 7; the f_com sweeps tend to
The k-folds CV results (Tab. 1) suggest a highly accurate have the steepest slopes. Given the high accuracy of the model
model for stability, and a moderately accurate model for for stability (as measured by k-folds CV), this trend is
angle. The following code defines the surrogate model over a reasonably trustworthy.
domain that includes the original dataset, and performs parameter However, the same figure shows an inconsistent (non-
sweeps across all inputs. monotone) effect of most inputs on the AVS angle_mean.
md_fit = ( These results are in agreement with the k-fold CV results shown
df_boats above. Clearly, the surrogate model is untrustworthy, and we
>> ft_common() should resist trusting conclusions from the parameter sweeps for
>> gr.cp_marginals(
H=gr.marg_mom("uniform", mean=2.0, cov=0.30), angle_mean. This undermines the conclusion we drew from
W=gr.marg_mom("uniform", mean=2.5, cov=0.35), the input/output correlations pictured in Figure 7. Clearly, angle
n=gr.marg_mom("uniform", mean=1.0, cov=0.30), exhibits more complex behavior than a simple linear correlation
d=gr.marg_mom("uniform", mean=0.5, cov=0.30),
f_com=gr.marg_mom(
with each of the boat design variables.
"uniform", A different analysis of the boat hull angle data helps
mean=0.55, develop useful insights. We pursue an active subspace analysis
cov=0.47, of the data to reduce the dimensionality of the input space by
),
) identifying directions that best explain variation in the output
>> gr.cp_copula_independence() [dCI17], [Con15]. The verb gr.tf_polyridge() implements
) the variable projection algorithm of Hokanson and Constantine
(
[HC18]. The following code pursues a two-dimensional reduction
md_fit of the input space. Note that the hyperparameter n_degree=6 is
>> gr.ev_sinews(df_det="swp", n_sweeps=5) set via a cross-validation study.
>> gr.pt_auto()
) ## Find two important directions
df_weights = (
df_boats
>> gr.tf_polyridge(
var=["H", "W", "n", "d", "f_com"],
out="angle",
n_degree=6, # Set via CV study
n_dim=2, # Seek 2d subspace
)
)

The subspace weights are reported in Table 2. Note that the
leading direction 1 is dominated by the displacement ratio d and
COM location f_com. Essentially, this describes the "loading"
of the vessel. The second direction corresponds to "widening and
shortening" of the hull cross-section (in addition to lowering d and
f_com).
Using the subspace weights in Table 2 to produce a 2d projec-
tion of the feature space enables visualizing all boat geometries in
a single plot. Figure 9 reveals that this 2d projection is very suc-
Fig. 8: Parameter sweeps for fitted GP model. Model *_mean and cessful at separating universally-stable (angle==180), upright-
predictive uncertainty *_sd values are reported for each output unstable (angle==0), and intermediate cases (0 < angle <
angle, stability. 180). Intermediate cases are concentrated at higher values of
the second active variable. There is a phase transition between
Figure 8 displays parameter sweeps for the surrogate model of universally-stable and upright-unstable vessels at lower values of
stability and angle. Note that the surrogate model reports the second active variable.
both a mean trend *_mean and a predictive uncertainty *_sd. Interpreting Figure 9 in light of Table 2 provides us with deep
The former is the model’s prediction for future values, while the insight about boat stability: Since active variable 1 corresponds to
latter quantifies the model’s confidence in each prediction. loading (high displacement ratio d with a low COM f_com), we
The parameter sweeps of Figure 8 show a consistent and strong can see that the boat’s loading conditions are key to determining
effect of f_com on the stability_mean of the boat; note that its stability. Since active variable 2 depends on the aspect ratio
ENABLING ACTIVE LEARNING PEDAGOGY AND INSIGHT MINING WITH A GRAMMAR OF MODEL ANALYSIS 257

native to derivation for the activities in an active learning approach.
Rather than structuring courses around deriving and implementing
scientific models, course exercises could have students explore
the behavior of a pre-implemented model to better understand
physical phenomena. Lorena Barba [Bar16] describes some of the
benefits in this style of lesson design. EMA is also an important
part of the modeling practitioner’s toolkit as a means to verify a
model’s implementation and to develop new insights. Grama sup-
ports both novices and practitioners in performing EMA through
a concise syntax.

R EFERENCES

[AFD+ 21] Riya Aggarwal, Mira Flynn, Sam Daitzman, Diane Lam, and
Zachary Riggins del Rosario. A qualitative study of engineer-
ing students’ reasoning about statistical variability. In 2021
Fall ASEE Middle Atlantic Section Meeting, 2021. URL:
https://peer.asee.org/38421.
Fig. 9: Boat design feature vectors projected to 2d active subspace. [Bar16] Lorena Barba. Computational thinking: I do not think
it means what you think it means. Technical re-
The origin corresponds to the mean feature vector. port, 2016. URL: https://lorenabarba.com/blog/computational-
thinking-i-do-not-think-it-means-what-you-think-it-means/.
[Bie20] Przemyslaw Biecek. ceterisParibus: Ceteris Paribus Profiles,
(higher width, shorter height), Figure 9 suggests that only wider 2020. R package version 0.4.2. URL: https://cran.r-project.org/
boats will tend to exhibit intermediate stability. package=ceterisParibus.
[Blo56] Benjamin Samuel Bloom. Taxonomy of educational objectives:
The classification of educational goals. Addison-Wesley Long-
Conclusions
man Ltd., 1956.
Grama is a Python implementation of a grammar of model anal- [Bry20] Jennifer Bryan. object of type closure is not subsettable. 2020.
ysis. The grammar’s design supports an active learning approach rstudio::conf 2020. URL: https://rstd.io/debugging.
[BTSA12] Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken.
to teaching sound scientific modeling practices. Two case studies Legion: Expressing locality and independence with logical re-
demonstrated the teaching benefits of grama: errors for learners gions. In SC’12: Proceedings of the International Conference on
help guide novices toward a more sound analysis, while concise High Performance Computing, Networking, Storage and Analy-
syntax encourages novices to carry out sound analysis practices. sis, pages 1–11. IEEE, 2012. URL: https://ieeexplore.ieee.org/
document/6468504, doi:10.1109/SC.2012.71.
Grama can also be used for exploratory model analysis (EMA) [CF99] Alison C Cullen and H Christopher Frey. Probabilistic Tech-
– an exploratory procedure to mine a scientific model for useful niques In Exposure Assessment: A Handbook For Dealing With
insights. A case study of boat hull design demonstrated EMA. Variability And Uncertainty In Models And Inputs. Springer
Science & Business Media, 1999.
In particular, the example explored and explained the relationship
[Con15] Paul G. Constantine. Active Subspaces: Emerging Ideas for
between boat design parameters and two metrics of boat stability. Dimension Reduction in Parameter Studies. SIAM Philadelphia,
Several ideas from the grama project are of interest to other 2015. doi:10.1137/1.9781611973860.
practitioners and developers in scientific computing. Grama was [dCI17] Zachary del Rosario, Paul G. Constantine, and Gianluca Iac-
designed to support model analysis under uncertainty. However, carino. Developing design insight through active subspaces.
In 19th AIAA Non-Deterministic Approaches Conference, page
the data/model and four-verb ontology (Fig. 1) underpinning 1090, 2017. URL: https://arc.aiaa.org/doi/10.2514/6.2017-1090,
grama is a much more general idea. This design enables very doi:10.2514/6.2017-1090.
concise model analysis syntax, which provides much of the benefit [dR20] Zachary del Rosario. Grama: A grammar of model analysis. Jour-
nal of Open Source Software, 5(51):2462, 2020. URL: https://doi.
behind grama. org/10.21105/joss.02462, doi:10.21105/joss.02462.
The design idiom of errors for learners is not simply focused [dRFI21] Zachary del Rosario, Richard W Fenrich, and Gianluca Iaccarino.
on writing "useful" error messages, but is rather a design orien- When are allowables conservative? AIAA Journal, 59(5):1760–
tation to use errors to introduce teachable moments. In addition 1772, 2021. URL: https://doi.org/10.2514/1.J059578, doi:10.
2514/1.J059578.
to writing error messages "for humans" [Bry20], an errors for [EVB+ 20] M Esmaily, L Villafane, AJ Banko, G Iaccarino, JK Eaton,
learners philosophy designs errors not simply to avoid fatal and A Mani. A benchmark for particle-laden turbu-
program behavior, but rather introduces exceptions to prevent lent duct flow: A joint computational and experimen-
conceptually invalid analyses. For instance, in the case study tal study. International Journal of Multiphase Flow,
132:103410, 2020. URL: https://www.sciencedirect.com/
presented above, designing gr.tf_sample() to assume independent science/article/abs/pii/S030193222030519X, doi:10.1016/
random inputs when a copula is unspecified would lead to code j.ijmultiphaseflow.2020.103410.
that throws errors less frequently. However, this would silently [FEM+ 14] Scott Freeman, Sarah L Eddy, Miles McDonough, Michelle K
endorse the conceptually problematic mentality of "independence Smith, Nnadozie Okoroafor, Hannah Jordt, and Mary Pat Wen-
deroth. Active learning increases student performance in sci-
is the default." While throwing an error message for an unspecified ence, engineering, and mathematics. Proceedings of the Na-
dependence structure leads to more frequent errors, it serves as a tional Academy of Sciences, 111(23):8410–8415, 2014. doi:
frequent reminder that dependency is an important part of a model 10.1073/pnas.1319030111.
[HC18] Jeffrey M Hokanson and Paul G Constantine. Data-driven
involving random inputs.
polynomial ridge approximation using variable projection. SIAM
Finally, exploratory model analysis holds benefits for both Journal on Scientific Computing, 40(3):A1566–A1589, 2018.
learners and practitioners of scientific modeling. EMA is an alter- doi:10.1137/17M1117690.
258 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[HK21] Jan Katins gdowding austin matthias-k Tyler Funnell Florian
Finkernagel Jonas Arnfred Dan Blanchard et al. Hassan Kibirige,
Greg Lamp. has2k1/plotnine: v0.8.0. Mar 2021. doi:10.
5281/zenodo.4636791.
[JWHT13] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshi-
rani. An Introduction to Statistical Learning: with Applications in
R, volume 112. Springer, 2013. URL: https://www.statlearning.
com/.
[KBB19] Michał Kuźba, Ewa Baranowska, and Przemysław Biecek. pyce-
terisparibus: explaining machine learning models with ceteris
paribus profiles in python. Journal of Open Source Software,
4(37):1389, 2019. URL: https://joss.theoj.org/papers/10.21105/
joss.01389, doi:10.21105/joss.01389.
[KN05] Andy Keane and Prasanth Nair. Computational Approaches For
Aerospace Design: The Pursuit Of Excellence. John Wiley &
Sons, 2005.
[KSS21] Daniel Kahneman, Olivier Sibony, and Cass R Sunstein. Noise:
A flaw in human judgment. Little, Brown, 2021.
[LE00] Lars Larsson and Rolf Eliasson. Principles of Yacht Design.
McGraw Hill Companies, 2000.
[MJ18] Simon Mak and V Roshan Joseph. Support points. The Annals
of Statistics, 46(6A):2562–2592, 2018. doi:10.1214/17-
AOS1629.
[MTW 22] Kazuki Maeda, Thiago Teixeira, Jonathan M Wang, Jeffrey M
+

Hokanson, Caetano Melone, Mario Di Renzo, Steve Jones, Javier
Urzay, and Gianluca Iaccarino. An integrated heterogeneous
computing framework for ensemble simulations of laser-induced
ignition. arXiv preprint arXiv:2202.02319, 2022. URL: https:
//arxiv.org/abs/2202.02319, doi:10.48550/arXiv.2202.
02319.
[MV15] Sparsh Mittal and Jeffrey S Vetter. A survey of cpu-gpu heteroge-
neous computing techniques. ACM Computing Surveys (CSUR),
47(4):1–35, 2015. URL: https://dl.acm.org/doi/10.1145/2788396,
doi:10.1145/2788396.
[pdt20] The pandas development team. pandas-dev/pandas: Pandas,
February 2020. URL: https://doi.org/10.5281/zenodo.3509134,
doi:10.5281/zenodo.3509134.
[RW05] Carl Edward Rasmussen and Christopher K. I. Williams. Gaus-
sian Processes for Machine Learning. The MIT Press, 11
2005. URL: https://doi.org/10.7551/mitpress/3206.001.0001,
doi:10.7551/mitpress/3206.001.0001.
[SKA+ 14] Jeffrey P Slotnick, Abdollah Khodadoust, Juan Alonso, David
Darmofal, William Gropp, Elizabeth Lurie, and Dimitri J
Mavriplis. Cfd vision 2030 study: A path to revolutionary
computational aerosciences. Technical report, 2014. URL:
https://ntrs.nasa.gov/citations/20140003093.
[SS13] Seref Sagiroglu and Duygu Sinanc. Big data: A review.
In 2013 International Conference on Collaboration Technolo-
gies and Systems (CTS), pages 42–47. IEEE, 2013. URL:
https://ieeexplore.ieee.org/document/6567202, doi:10.1109/
CTS.2013.6567202.
[SSW89] Jerome Sacks, Susannah B. Schiller, and William J. Welch.
Designs for computer experiments. Technometrics, 31(1):41–
47, 1989. URL: http://www.jstor.org/stable/1270363, doi:10.
2307/1270363.
[WAB 19] Hadley Wickham, Mara Averick, Jennifer Bryan, Winston Chang,
+

Lucy D’Agostino McGowan, Romain François, Garrett Grole-
mund, Alex Hayes, Lionel Henry, Jim Hester, et al. Welcome
to the tidyverse. Journal of Open Source Software, 4(43):1686,
2019. doi:10.21105/joss.01686.
[Wic14] Hadley Wickham. Tidy data. Journal of Statistical Software,
59(10):1–23, 2014. doi:10.18637/jss.v059.i10.
[WM10] Wes McKinney. Data Structures for Statistical Computing in
Python. In Stéfan van der Walt and Jarrod Millman, editors,
Proceedings of the 9th Python in Science Conference, pages 56 –
61, 2010. doi:10.25080/Majora-92bf1922-00a.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 259

Low Level Feature Extraction for Cilia Segmentation
Meekail Zain‡†∗ , Eric Miller§† , Shannon P Quinn‡¶ , Cecilia Lok

Abstract—Cilia are organelles found on the surface of some cells in the human
body that sweep rhythmically to transport substances. Dysfunction of ciliary
motion is often indicative of diseases known as ciliopathies, which disrupt
the functionality of macroscopic structures within the lungs, kidneys and other
organs [LWL+ 18]. Phenotyping ciliary motion is an essential step towards un-
derstanding ciliopathies; however, this is generally an expert-intensive process
[QZD+ 15]. A means of automatically parsing recordings of cilia to determine
useful information would greatly reduce the amount of expert intervention re-
quired. This would not only improve overall throughput, but also mitigate human
error, and greatly improve the accessibility of cilia-based insights. Such automa-
tion is difficult to achieve due to the noisy, partially occluded and potentially out-
of-phase imagery used to represent cilia, as well as the fact that cilia occupy a
minority of any given image. Segmentation of cilia mitigates these issues, and is
thus a critical step in enabling a powerful pipeline. However, cilia are notoriously
difficult to properly segment in most imagery, imposing a bottleneck on the
pipeline. Experimentation on and evaluation of alternative methods for feature
extraction of cilia imagery hence provide the building blocks of a more potent
segmentation model. Current experiments show up to a 10% improvement over
base segmentation models using a novel combination of feature extractors.

Index Terms—cilia, segmentation, u-net, deep learning
Fig. 1: A sample frame from the cilia dataset

Introduction
gation in the Quinn Research Group at the University of Georgia
Cilia are organelles found on the surface of some cells in the hu-
[ZRS+ 20].
man body that sweep rhythmically to transport substances [Ish17].
The current pipeline consists of three major stages: preprocess-
Dysfunction of ciliary motion often indicates diseases known as
ing, where segmentation masks and optical flow representations
ciliopathies, which on a larger scale disrupt the functionality of
are created to supplement raw cilia video data; appearance, where
structures within the lungs, kidneys and other organs. Pheno-
a model learns a condensed spacial representation of the cilia; and
typing ciliary motion is an essential step towards understanding
dynamics, which learns a representation from the video, encoded
ciliopathies. However, this is generally an expert-intensive pro-
as a series of latent points from the appearance module. In the
cess [LWL+ 18], [QZD+ 15]. A means of automatically parsing
primary module, the segmentation mask is essential in scoping
recordings of cilia to determine useful information would greatly
downstream analysis to the cilia themselves, so inaccuracies at
reduce the amount of expert intervention required, thus increasing
this stage directly affect the overall performance of the pipeline.
throughput while alleviating the potential for human error. Hence,
However, due to the high variance of ciliary structure, as well
Zain et al. (2020) discuss the construction of a generative pipeline
as the noisy and out-of-phase imagery available, segmentation
to model and analyze ciliary motion, a prevalent field of investi-
attempts have been prone to error.
† These authors contributed equally. While segmentation masks for such a pipeline could be
* Corresponding author: meekail.zain@uga.edu manually generated, the process requires intensive expert labor
‡ Department of Computer Science, University of Georgia, Athens, GA 30602
[DvBB+ 21]. Requiring manual segmentation before analysis thus
USA
§ Institute for Artificial Intelligence, University of Georgia, Athens, GA 30602 greatly increases the barrier to entry for this tool. Not only would
USA it increase the financial strain of adopting ciliary analysis as a
¶ Department of Cellular Biology, University of Georgia, Athens, GA 30602 clinical tool, but it would also serve as an insurmountable barrier to
USA
|| Department of Developmental Biology, University of Pittsburgh, Pittsburgh, entry for communities that do not have reliable access to such clin-
PA 15261 USA icians in the first place, such as many developing nations and rural
populations. Not only can automated segmentation mitigate these
Copyright © 2022 Meekail Zain et al. This is an open-access article distributed barriers to entry, but it can also simplify existing treatment and
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the analysis infrastructure. In particular, it has the potential to reduce
original author and source are credited. the magnitude of work required by an expert clinician, thereby
260 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

expansion. The contraction path follows the standard strategy of
most convolutional neural networks (CNNs), where convolutions
are followed by Rectified Linear Unit (ReLU) activation func-
tions and max pooling layers. While max pooling downsamples
the images, the convolutions double the number of channels.
Upon expansion, up-convolutions are applied to up-sample the
image while reducing the number of channels. At each stage,
the network concatenates the up-sampled image with the image
of corresponding size (cropped to account for border pixels)
from a layer in the contracting path. A final layer uses pixel-
wise (1 × 1) convolutions to map each pixel to a corresponding
class, building a segmentation. Before training, data is generally
augmented to provide both invariance in rotation and scale as well
as a larger amount of training data. In general, U-Nets have shown
high performance on biomedical data sets with low quantities
Fig. 2: The classical U-Net architecture, which serves as both a of labelled images, as well as reasonably fast training times on
baseline and backbone model for this research graphics processing units (GPUs) [RFB15]. However, in a few
past experiments with cilia data, the U-Net architecture has had
low segmentation accuracy [LMZ+ 18]. Difficulties modeling cilia
decreasing costs and increasing clinician throughput [QZD+ 15], with CNN-based architectures include their fine high-variance
[ZRS+ 20]. Furthermore, manual segmentation imparts clinician- structure, spatial sparsity, color homogeneity (with respect to the
specific bias which reduces the reproducability of results, making background and ambient cells), as well as inconsistent shape and
it difficult to verify novel techniques and claims [DvBB+ 21]. distribution across samples. Hence, various enhancements to the
A thorough review of previous segmentation models, specif- pure U-Net model are necessary for reliable cilia segmentation.
ically those using the same dataset, shows that current results
are poor, impeding tasks further along the pipeline. For this
Methodology
study, model architectures utilize various methods of feature
extraction that are hypothesized to improve the accuracy of a base The U-Net architecture is the backbone of the model due to its
segmentation model, such as using zero-phased PCA maps and well-established performance in the biomedical image analysis
Sparse Autoencoder reconstructions with various parameters as a domain. This paper focuses on extracting and highlighting the
data augmentation tool. Various experiments with these methods underlying features in the image through various means. There-
provide a summary of both qualitative and quantitative results fore, optimization of the U-Net backbone itself is not a major
necessary in ascertaining the viability for such feature extractors consideration of this project. Indeed, the relative performance of
to aid in segmentation. the various modified U-Nets sufficiently communicates the effi-
cacy of the underlying methods. Each feature extraction method
will map the underlying raw image to a corresponding feature
Related Works map. To evaluate the usefulness of these feature maps, the model
Lu et. al. (2018) utilized a Dense Net segmentation model as an concatenates these augmentations to the original image and use
upstream to a CNN-based Long Short-Term Memory (LSTM) the aggregate data as input to a U-Net that is slightly modified to
time-series model for classifying cilia based on spatiotemporal accept multiple input channels.
patterns [LMZ+ 18]. While the model reports good classification The feature extractors of interest are Zero-phase PCA sphering
accuracy and a high F-1 score, the underlying dataset only (ZCA) and a Sparse Autoencoder (SAE), on both of which the
contains 75 distinct samples and the results must therefore be following subsections provide more detail. Roughly speaking,
taken with great care. Furthermore, Lu et. al. did not report the these are both lossy, non-bijective transformations which map
separate performance of the upstream segmentation network. Their a single image to a single feature map. In the case of ZCA,
approach did, however, inspire the follow-up methodology of Zain empirically the feature maps tend to preserve edges and reduce
et. al. (2020) for segmentation. In particular, they employ a Dense the rest of the image to arbitrary noise, thereby emphasizing local
Net segmentation model as well, however they first augment the structure (since cell structure tends not to be well-preserved). The
underlying images with the calculated optical flow. In this way, SAE instead acts as a harsh compression and filters out both linear
their segmentation strategy employs both spatial and temporal and non-linear features, preserving global structure. Each extractor
information. To compare against [LMZ+ 18], the authors evaluated is evaluated by considering the performance of a U-Net model
their segmentation model in the same way—as an upstream to trained on multi-channel inputs, where the first channel is the
an CNN/LSTM classification network. Their model improved original image, and the second and/or third channels are the feature
the classification accuracy two points above that of Charles et. maps extracted by these methods. In particular, the objective is for
al. (2018). Their reported intersection-over-union (IoU) score is the doubly-augmented data, or the “composite” model, to achieve
33.06% and marks the highest performance achieved on this state-of-the-art performance on this challenging dataset.
dataset. The ZCA implementation utilizes SciPy linear algebra solvers,
One alternative segmentation model, often used in biomedical and both U-Net and SAE architectures use the PyTorch deep
image processing and analysis, where labelled data sets are rela- learning library. Next, the evaluation stage employs canonical
tively small, is the U-Net architecture (2) [RFB15]. Developed by segmentation quality metrics, such as the Jaccard score and Dice
Ronneberger et. al., U-Nets consist of two parts: contraction and coefficient, on various models. When applied to the composite
LOW LEVEL FEATURE EXTRACTION FOR CILIA SEGMENTATION 261

model, these metrics determine any potential improvements to the feature since often times in image analysis low eigenvalues (and
state-of-the-art for cilia segmentation. the span of their corresponding eigenvectors) tend to capture high-
frequency data. Such data is essential for tasks such as texture
Cilia Data analysis, and thus tuning the value of ε helps to preserve this data.
As in the Zain paper, the input data is a limited set of grayscale ZCA maps for various values of ε on a sample image are shown
cilia imagery, from both healthy patients and those diagnosed with in figure 3.
ciliopathies, with corresponding ground truth masks provided by
experts. The images are cropped to 128 × 128 patches. The images
are cropped at random coordinates in order to increase the size
and variance of the sample space, and each image is cropped a
number of times proportional its resolution. Additionally, crops
that contain less than fifteen percent cilia are excluded from the Fig. 3: Comparison of ZCA maps on a cilia sample image with various
training/test sets. This method increases the size of the training levels of ε. The original image is followed by maps with ε = 1e − 4,
set from 253 images to 1409 images. Finally, standard minmax ε = 1e − 5, ε = 1e − 6, and ε = 1e − 7, from left to right.
contrast normalization maps the luminosity to the interval [0, 1].

Zero-phase PCA sphering (ZCA) Sparse Autoencoder (SAE)
The first augmentation of the underlying data concatenates the Similar in aim to ZCA, an SAE can augment the underlying
input to the backbone U-Net model with the ZCA-transformed images to further filter and reduce noise while allowing the
data. ZCA maps the underlying data to a version of the data that is construction and retention of potentially nonlinear spatial features.
“rotated” through the dataspace to ensure certain spectral proper- Autoencoders are deep learning models that first compress data
ties. ZCA in effect can implicitly normalize the data using the most into a low-level latent space and then attempt to reconstruct images
significant (by empirical variance) spatial features present across from the low-level representation. SAEs in particular add an
the dataset. Given a matrix X with rows representing samples and additional constraint, usually via the loss function, that encourages
columns for each feature, a sphering (or whitening) transformation sparsity (i.e., less activation) in hidden layers of the network. Xu
W is one which decorrelates X. That is, the covariance of W X et. al. use the SAE architecture for breast cancer nuclear detection
must be equal to the identity matrix. By the spectral theorem, and show that the architecture preserves essential, high-level,
the symmetric matrix XX T —the covariance matrix corresponding and often nonlinear aspects of the initial imagery—even when
to the data, assuming the data is centered—can be decomposed unlabelled—such as shape and color [XXL+ 16]. An adaptation of
into PDPT , where P is an orthogonal matrix of eigenvectors the first two terms of their loss function enforces sparsity:
and D a diagonal matrix of corresponding eigenvalues of the
covariance matrix. ZCA uses the sphering matrix W = PD−1/2 PT 1 N 1 n
and can be thought of as a transformation into the eigenspace of
LSAE (θ ) = ∑
N k=1
(L(x(k), dθ̂ (eθ̌ (x(k))))) + α ∑ KL(ρ||ρ̂).
n j=1
its covariance matrix—projection onto the data’s principal axes,
The first term is a standard reconstruction loss (mean squared
as the minimal projection residual is onto the axes with maximal
error), whereas the latter is the mean Kullback-Leibler (KL)
variance—followed by normalization of variance along every axis
divergence between ρ̂, the activation of a neuron in the encoder,
and rotation back into the original image space. In order to reduce
and ρ, the enforced activation. For the case of experiments
the amount of two-way correlation in images, Krizhevsky applies
performed here, ρ = 0.05 remains constant but values of α vary,
ZCA whitening to preprocess CIFAR-10 data before classification
specifically 1e − 2, 1e − 3, and 1e − 4, for each of which a static
and shows that this process nicely preserves features, such as edges
dataset is created for feeding into the segmentation model. Larger
[LjWD19].
alpha prioritizes sparsity over reconstruction accuracy, which to
This ZCA implementation uses the Python SciPy library
an extent, is hypothesized to retain significant low-level features
(SciPy), which builds on top of low-level hardware-optimized
of the cilia. Reconstructions with various values of α are shown
routines such as BLAS and LAPACK to efficiently calculate many
in figure 4
linear algebra operations. In particular, these expirements imple-
ment ZCA as a generalized whitening technique. While normal the
normal ZCA calculation selects a whitening −1 T
p matrix W = PD 2 P ,
a more applicable alternative is W = P (D + εI)−1 PT where ε
is a hyperparameter which attenuates eigenvalue sensitivity. This
new "whitening" is actually not a proper whitening since it does
not guarantee an identity covariance matrix. It does however serve Fig. 4: Comparison of SAE reconstructions from different training
a similar purpose and actually lends some benefits. instances with various levels of α (the activation loss weight). From
left to right: original image, α = 1e − 2 reconstruction, α = 1e − 3
Most importantly, it is indeed a generalization of canonical q reconstruction, α = 1e − 4 reconstruction.
ZCA. That is to say, ε = 0 recovers canonical ZCA, and λ → λ1
provides the spectrum ofqW on the eigenvalues. Otherwise, ε > 0 A significant amount of freedom can be found in potential
1
results in the map λ → λ +ε . In this case, while all eigenvalues architectural choices for SAE. A focus on low-medium complexity
map to smaller values compared to the original map, the smallest models both provides efficiency and minimizes overfitting and ar-
eigenvalues map to significantly smaller values compared to the tifacts as consequence of degenerate autoencoding. One important
original map. This means that ε serves to “dampen” the effects danger to be aware of is that SAEs—and indeed, all AEs—are
of whitening for particularly small eigenvalues. This is a valuable at risk of a degenerate solution wherein a sufficiently complex
262 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Fig. 6: Artifacts generated during the training of U-Net. From left to
right: original image, generated segmentation mask (pre-threshold),
ground-truth segmentation mask

Fig. 5: Illustration and pseudocode for Spatial Broadcast Decoding
[WMBL19]

Fig. 7: Artifacts generated during the training of ZCA+U-Net. From
decoder essentially learns to become a hashmap of arbitrary (and left to right: original image, ZCA-mapped image, generated segmen-
potentially random) encodings. tation mask (pre-threshold), ground-truth segmentation mask
The SAE will therefore utilize a CNN architecture, as op-
posed to more modern transformer-style architectures, since the
figure 9 was taken only 10 epochs into the training process.
simplicity and induced spatial bias provide potent defenses against
Notably, this model, the composite pipeline, produced usable
overfitting and mode collapse. Furthermore the encoder will use
artifacts in mere minutes of training, whereas other models did
Spatial Broadcast Decoding (SBD) which provides a method for
not produce similar results until after about 10-40 epochs.
decoding from a latent vector using size-preserving convolutions,
Figure 10 provides a summary of experiments performed with
thereby preserving the spatial bias even in decoding, and eliminat-
SAE and ZCA augmented data, along with a few composite models
ing the artifacts generated by alternate decoding strategies such as
and a base U-Net for comparison. These models were produced
“transposed” convolutions [WMBL19].
with data augmentation at various values of α (for the Sparse
Autoencoder loss function) and ε (for ZCA) discussed above.
Spatial Broadcast Decoding (SBD)
While the table provides five metrics, those of primary importance
Spatial Broadcast Decoding provides an alternative method from are the Intersection over Union (IoU), or Jaccard Score, as well
”transposed” (or ”skip”) convolutions to upsample images in the as the Dice (or F1) score, which are the most commonly used
decoder portion of CNN-based autoencoders. Rather than main- metrics for evaluating the performance of segmentation models.
taining the square shape, and hence associated spatial properties, Most feature extraction models at least marginally improve the
of the latent representation, the output of the encoder is reshaped performance in of the U-Net in terms of IoU and Dice scores,
into a single one-dimensional tensor per input image, which is then and the best-performing composite model (with ε of 1e − 4
tiled to the shape of the desired image (in this case, 128 × 128). for ZCA and α of 1e − 3 for SAE) provide an improvement
In this way, the initial dimension of the latent vector becomes of approximately 10% from the base U-Net in these metrics.
the number of input channels when fed into the decoder, and two There does not seem to be an obvious correlation between which
additional channels are added to represent 2-dimensional spatial feature extraction hyperparameters provided the best performance
coordinates. In its initial publication, SBD has been shown to pro- for individual ZCA+U-Net and SAE+U-Net models versus those
vide effective results in disentangling latent space representations for the composite pipeline, but further experiments may assist in
in various autoencoder models. analyzing this possibility.
The base U-Net does outperform the others in precision,
U-Net
All models use a standard U-Net and undergo the same training
process to provide a solid basis for analysis. Besides the number
of input channels to the initial model (1 plus the number of
augmentation channels from SAE and ZCA, up to 3 total chan-
nels), the model architecture is identical for all runs. A single-
channel (original image) U-Net first trains as a basis point for
analysis. The model trains on two-channel inputs provided by Fig. 8: Artifacts generated during the training of SAE+U-Net. From
ZCA (original image concatenated with the ZCA-mapped one) left to right: original image, SAE-reconstructed image, generated
with various ε values for the dataset, and similarly SAE with segmentation mask (pre-threshold), ground-truth segmentation mask
various α values, train the model. Finally, composite models train
with a few combinations of ZCA and SAE hyperparameters. Each
training process uses binary cross entropy loss with a learning rate
of 1e − 3 for 225 epochs.

Results
Fig. 9: Artifacts generated 10 epochs into the training of the compos-
Figures 6, 7, 8, and 9 show masks produced on validation data ite U-Net. From left to right: original image, ZCA-mapped image,
from instances of the four model types. While the former three SAE-mapped image, generated segmentation mask (pre-threshold),
show results near the end of training (about 200-250 epochs), ground-truth segmentation mask
LOW LEVEL FEATURE EXTRACTION FOR CILIA SEGMENTATION 263

Extractor Parameters Scores Implications internal to other projects within the research group
Model ε (ZCA) α (SAE) IoU Accuracy Recall Dice Precision sponsoring this research are clear. As discussed earlier, later
U-Net (base) — — 0.399 0.759 0.501 0.529 0.692 pipelines of ciliary representation and modeling are currently
1e − 4 — 0.395 0.754 0.509 0.513 0.625 being bottlenecked by the poor segmentation masks produced by
1e − 5 — 0.401 0.732 0.563 0.539 0.607 base U-Nets, and the under-segmented predictions provided by
ZCA + U-Net
1e − 6 — 0.408 0.756 0.543 0.546 0.644
1e − 7 — 0.419 0.758 0.563 0.557 0.639 the original model limits the scope of what these later stages
— 1e − 2 0.380 0.719 0.568 0.520 0.558 may achieve. Better predictions hence tend to transfer to better
SAE + U-Net — 1e − 3 0.398 0.751 0.512 0.526 0.656 downstream results.
— 1e − 4 0.416 0.735 0.607 0.555 0.603
These results also have significant implications outside of the
1e − 4 1e − 2 0.401 0.761 0.506 0.521 0.649
1e − 4 1e − 3 0.441 0.767 0.580 0.585 0.661
specific task of cilia segmentation and modeling. The inherent
1e − 4 1e − 4 0.305 0.722 0.398 0.424 0.588 problem that motivated an introduction of feature extraction into
1e − 5 1e − 2 0.392 0.707 0.624 0.530 0.534
1e − 5 1e − 3 0.413 0.770 0.514 0.546 0.678
the segmentation process was the poor quality of the given dataset.
1e − 5 1e − 4 0.413 0.751 0.565 0.550 0.619 From occlusion to poor lighting to blurred images, these are
Composite
1e − 6 1e − 2 0.392 0.719 0.602 0.527 0.571 problems that typically plague segmentation models in the real
1e − 6 1e − 3 0.395 0.759 0.480 0.521 0.711
1e − 6 1e − 4 0.405 0.729 0.587 0.545 0.591 world, where data sets are not of ideal quality. For many modern
1e − 7 1e − 2 0.383 0.753 0.487 0.503 0.655 computer vision tasks, segmentation is a necessary technique to
1e − 7 1e − 3 0.380 0.736 0.526 0.519 0.605
1e − 7 1e − 4 0.293 0.674 0.445 0.418 0.487
begin analysis of certain objects in an image, including any forms
of objects from people to vehicles to landscapes. Many images
Fig. 10: A summary of segmentation scores on test data for a base for these tasks are likely to come from low-resolution imagery,
U-Net model, ZCA+U-Net, SAE+U-Net, and a composite model, with whether that be satellite data or security cameras, and are likely
various feature extraction hyperparameters. The best result for each to face similar problems as the given cilia dataset in terms of
scoring metric is in bold. image quality. Even if this is not the case, manual labelling, like
that of this dataset and convenient in many other instances, is
Input Images Predicted Masks
Original ZCA SAE Ground Truth Base U-Net ZCA + U-Net SAE + U-Net Composite prone to error and is likely to bottleneck results. As experiments
have shown, feature extraction through SAE and ZCA maps are
a potential avenue for improvement of such models and would be
an interesting topic to explore on other problematic datsets.
Especially compelling, aside from the raw numeric results, is
how soon composite pipelines began to produce usable masks on
training data. As discussed earlier, most original U-Net models
would take at least 40-50 epochs before showing any accurate
predictions on training data. However, when feeding in composite
Fig. 11: Comparison of predicted masks and ground truth for three SAE and ZCA data along with the original image, unusually
test images. ZCA mapped images with ε = 1e − 4 and SAE reconstruc- accurate masks were produced within just a couple minutes, with
tions with α = 1e − 3 are used where applicable. usable results at 10 epochs. This has potential implications in
scenarios such as one-shot and/or unsupervised learning, where
models cannot train over a large datset.
however. Analysis of predicted masks from various models, some
of which are shown in figure 11, shows that the base U-Net Future Research
model tends to under-predict cilia, explaining the relatively high While this work establishes a primary direction and a novel
precision. Previous endeavors in cilia segmentation also revealed perspective for segmenting cilia, there are many interesting and
this pattern. valuable directions for future planned research. In particular, a
novel and still-developing alternative to the convolution layer
known as a Sharpened Cosine Similarity (SCS) layer has begun
to attract some attention. While regular CNNs are proficient at
Conclusions filtering, developing invariance to certain forms of noise and
This paper highlights the current shortcomings of automated, perturbation, they are notoriously poor at serving as a spatial
deep-learning based segmentation models for cilia, specifically indicator for features. Convolution activations can be high due to
on the data provided to the Quinn Research Group, and provides changes in luminosity and do not necessarily imply the distribu-
two additional methods, Zero-Phase PCA Sphering (ZCA) and tion of the underlying luminosity, therefore losing precise spatial
Sparse Autoencoders (SAE), for performing feature extracting information. By design, SCS avoids these faults by considering
augmentations with the purpose of aiding a U-Net model in the mathematical case of a “normalized” convolution, wherein
segmentation. An analysis of U-Nets with various combinations neither the magnitude of the input, nor of the kernel, affect the final
of these feature extraction and parameters help determine the output. Instead, SCS activations are dictated purely by the relative
feasibility for low-level feature extraction in improving cilia seg- magnitudes of weights in the kernel, which is to say by the spatial
mentation, and results from initial experiments show up to 10% distribution of features in the input [Pis22]. Domain knowledge
increases in relevant metrics. suggests that cilia, while able to vary greatly, all share relatively
While these improvements, in general, have been marginal, unique spatial distributions when compared to non-cilia such as
these results show that pre-segmentation based feature extraction cells, out-of-phase structures, microscopy artifacts, etc. Therefore,
methods, particularly the avenues explored, provide a worthwhile SCS may provide a strong augmentation to the backbone U-
path of exploration and research for improving cilia segmentation. Net model by acting as an additional layer in tandem with the
264 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

already existing convolution layers. This way, the model is a true [LWL+ 18] Fangzhao Li, Changjian Wang, Xiaohui Liu, Yuxing Peng, and
generalization of the canonical U-Net and is less likely to suffer Shiyao Jin. A composite model of wound segmentation based
on traditional methods and deep neural networks. Computational
poor performance due to the introduction of SCS. intelligence and neuroscience, 2018, 2018. doi:10.1155/
Another avenue of exploration would be a more robust ablation 2018/4149103.
study on some of the hyperparameters of the feature extractors [Pis22] Raphael Pisonir. Sharpened cosine distance as an alternative for
convolutions, Jan 2022. URL: https://www.rpisoni.dev.
used. While most of the hyperparameters were chosen based on [QZD 15] Shannon P Quinn, Maliha J Zahid, John R Durkin, Richard J
+
either canonical choices [XXL+ 16] or through empirical study Francis, Cecilia W Lo, and S Chakra Chennubhotla. Auto-
(e.g. ε for ZCA whitening), a more comprehensive hyperparameter mated identification of abnormal respiratory ciliary motion in
search would be worth consideration. This would be especially nasal biopsies. Science translational medicine, 7(299):299ra124
|–| 299ra124, 2015. doi:10.1126/scitranslmed.
valuable for the composite model since the choice of most opti- aaa1233.
mal hyperparameters is dependent on the downstream tasks and [RFB15] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
therefore may be different for the composite model than what was net: Convolutional networks for biomedical image segmentation.
CoRR, 2015. doi:10.48550/arXiv.1505.04597.
found for the individual models. [WMBL19] Nicholas Watters, Loïc Matthey, Christopher P. Burgess, and
More robust data augmentation could additionally improve Alexander Lerchner. Spatial broadcast decoder: A simple archi-
results. Image cropping and basic augmentation methods alone tecture for learning disentangled representations in vaes. CoRR,
provided minor improvements of just the base U-Net from the 2019. doi:10.48550/arXiv.1901.07017.
[XXL+ 16] Jun Xu, Lei Xiang, Qingshan Liu, Hannah Gilmore, Jianzhong
state of the art. Regarding the cropping method, an upper threshold Wu, Jinghai Tang, and Anant Madabhushi. Stacked sparse au-
for the percent of cilia per image may be worth implementing, toencoder (ssae) for nuclei detection on breast cancer histopathol-
as cropped images containing over approximately 90% cilia pro- ogy images. IEEE Transactions on Medical Imaging, 35(1):119–
130, 2016. doi:10.1109/TMI.2015.2458702.
duced poor results, likely due to a lack of surrounding context. [ZRS+ 20] Meekail Zain, Sonia Rao, Nathan Safir, Quinn Wyner, Isabella
Additionally, rotations and lighting/contrast adjustments could Humphrey, Alex Eldridge, Chenxiao Li, BahaaEddin AlAila,
further augment the data set during the training process. and Shannon Quinn. Towards an unsupervised spatiotemporal
representation of cilia video using a modular generative pipeline.
Re-segmenting the cilia images by hand, a planned endeavor, In Proceedings of the Python in Science Conference, 2020.
will likely provide more accurate masks for the training process. doi:10.25080/majora-342d178e-017.
This is an especially difficult task for the cilia dataset, as the poor
lighting and focus even causes medical professionals to disagree
on the exact location of cilia in certain instances. However, the re-
search group associated with this paper is currently in the process
of setting up a web interface for such professionals to ”vote” on
segmentation masks. Additionally, it is likely worth experimenting
with various thresholds for converting U-Net outputs into masks,
and potentially some form of region growing to dynamically aid
the process.
Finally, it is possible to train the SAE and U-Net jointly as
an end-to-end system. Current experimentation has foregone this
path due to the additional computational and memory complexity
and has instead opted for separate training to at least justify this
direction of exploration. Training in an end-to-end fashion could
lead to a more optimal result and potentially even an interesting
latent representation of ciliary features in the image. It is worth
noting that larger end-to-end systems like this tend to be more
difficult to train and balance, and such architectures can fall into
degenerate solutions more readily.

R EFERENCES

[DvBB+ 21] Cenna Doornbos, Ronald van Beek, Ernie MHF Bongers, Dorien
Lugtenberg, Peter Klaren, Lisenka ELM Vissers, Ronald Roep-
man, Machteld M Oud, et al. Cell-based assay for ciliopathy
patients to improve accurate diagnosis using alpaca. Euro-
pean Journal of Human Genetics, 29(11):1677 |–| 1689, 2021.
doi:10.1038/s41431-021-00907-9.
[Ish17] Takashi Ishikawa. Axoneme structure from motile cilia. Cold
Spring Harbor perspectives in biology, 9(1):a028076, 2017.
doi:10.1101/cshperspect.a028076.
[LjWD19] Hui Li, Xiao jun Wu, and Tariq S. Durrani. Infrared and visible
image fusion with resnet and zero-phase component analysis. In-
frared Physics & Technology, 102:103039, 2019. doi:https:
//doi.org/10.1016/j.infrared.2019.103039.
[LMZ+ 18] Charles Lu, M. Marx, M. Zahid, C. W. Lo, Chakra Chennubhotla,
and Shannon P. Quinn. Stacked neural networks for end-to-
end ciliary motion analysis. CoRR, 2018. doi:10.48550/
arXiv.1803.07534.