DOKK Library

Proceedings of the 21st Python in Science Conference

Authors Chris Calloway David Shupe Dillon Niederhut Meghann Agarwal

License CC-BY-3.0

Plaintext
   Proceedings of the 21st

Python in Science Conference
P ROCEEDINGS       OF THE   21 ST P YTHON   IN   S CIENCE C ONFERENCE
Edited by Meghann Agarwal, Chris Calloway, Dillon Niederhut, and David Shupe.


SciPy 2022
Austin, Texas
July 11 - July 17, 2022



Copyright c 2022. The articles in the Proceedings of the Python in Science Conference are copyrighted and owned by their
original authors
This is an open-access publication and is distributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are
credited.
For more information, please see: http://creativecommons.org/licenses/by/3.0/



ISSN:2575-9752
https://doi.org/10.25080/majora-212e5952-046
O RGANIZATION

Conference Chairs
     J ONATHAN G UYER, NIST
     A LEXANDRE C HABOT-L ECLERC, Enthought, Inc.

Program Chairs
     M ATT H ABERLAND, Cal Poly
     J ULIE H OLLEK, Mozilla
     M ADICKEN M UNK, University of Illinois
     G UEN P RAWIROATMODJO, Microsoft Corp

Communications
    A RLISS C OLLINS, NumFOCUS
    M ATT D AVIS, Populus
    D AVID N ICHOLSON, Embedded Intelligence

Birds of a Feather
      A NDREW R EID, NIST
      A NASTASIIA S ARMAKEEVA, George Washington University

Proceedings
     M EGHANN A GARWAL, Overhaul
     C HRIS C ALLOWAY, University of North Carolina
     D ILLON N IEDERHUT, Novi Labs
     D AVID S HUPE, Caltech’s IPAC Astronomy Data Center

Financial Aid
     S COTT C OLLIS, Argonne National Laboratory
     N ADIA TAHIRI, Université de Montréal

Tutorials
      M IKE H EARNE, USGS
      L OGAN T HOMAS, Enthought, Inc.

Sprints
      TANIA A LLARD, Quansight Labs
      B RIGITTA S IP ŐCZ, Caltech/IPAC

Diversity
      C ELIA C INTAS, IBM Research Africa
      B ONNY P M C C LAIN, O’Reilly Media
      FATMA TARLACI, OpenTeams

Activities
      PAUL A NZEL, Codecov
      I NESSA PAWSON, Albus Code

Sponsors
     K RISTEN L EISER, Enthought, Inc.

Financial
     C HRIS C HAN, Enthought, Inc.
     B ILL C OWAN, Enthought, Inc.
     J ODI H AVRANEK, Enthought, Inc.

Logistics
      K RISTEN L EISER, Enthought, Inc.
Proceedings Reviewers
     A ILEEN N IELSEN
     A JIT D HOBALE
     A LEJANDRO C OCA -C ASTRO
     A LEXANDER YANG
     B HUPENDRA A R AUT
     B RADLEY D ICE
     B RIAN G UE
     C ADIOU C ORENTIN
     C ARL S IMON A DORF
     C HEN Z HANG
     C HIARA M ARMO
     C HITARANJAN M AHAPATRA
     C HRIS C ALLOWAY
     D ANIEL W HEELER
     D AVID N ICHOLSON
     D AVID S HUPE
     D ILLON N IEDERHUT
     D IPTORUP D EB
     J ELENA M ILOSEVIC
     M ICHAL M ACIEJEWSKI
     E D R OGERS
     H IMAGHNA B HATTACHARJEE
     H ONGSUP S HIN
     I NDRANEIL PAUL
     I VAN M ARROQUIN
     J AMES L AMB
     J YH -M IIN L IN
     J YOTIKA S INGH
     K ARTHIK M URUGADOSS
     K EHINDE A JAYI
     K ELLY L. R OWLAND
     K ELVIN L EE
     K EVIN M AIK J ABLONKA
     K EVIN W. B EAM
     K UNTAO Z HAO
     M ARUTHI NH
     M ATT C RAIG
     M ATTHEW F EICKERT
     M EGHANN A GARWAL
     M ELISSA W EBER M ENDONÇA
     O NURALP S OYLEMEZ
     R OHIT G OSWAMI
     RYAN B UNNEY
     S HUBHAM S HARMA
     S IDDHARTHA S RIVASTAVA
     S USHANT M ORE
     T ETSUO K OYAMA
     T HOMAS N ICHOLAS
     V ICTORIA A DESOBA
     V IDHI C HUGH
     V IVEK S INHA
     W ENDUO Z HOU
     Z UHAL C AKIR
ACCEPTED TALK S LIDES

     B UILDING B INARY E XTENSIONS WITH PYBIND 11, SCIKIT- BUILD , AND CIBUILDWHEEL, Henry Schreiner, and Joe Rickerby,
     and Ralf Grosse-Kunstleve, and Wenzel Jakob, and Matthieu Darbois, and Aaron Gokaslan, and Jean-Christophe Fillion-
     Robin, and Matt McCormick
     doi.org/10.25080/majora-212e5952-033
     P YTHON D EVELOPMENT S CHEMES FOR M ONTE C ARLO N EUTRONICS ON H IGH P ERFORMANCE C OMPUTING, Jack-
     son P. Morgan, and Kyle E. Niemeyer
     doi.org/10.25080/majora-212e5952-034
     AWKWARD PACKAGING : B UILDING SCIKIT-HEP, Henry Schreiner, and Jim Pivarski, and Eduardo Rodrigues
     doi.org/10.25080/majora-212e5952-035
     D EVELOPMENT OF A CCESSIBLE , A ESTHETICALLY-P LEASING C OLOR S EQUENCES, Matthew A. Petroff
     doi.org/10.25080/majora-212e5952-036
     C UTTING E DGE C LIMATE S CIENCE IN THE C LOUD WITH PANGEO, Julius Busecke
     doi.org/10.25080/majora-212e5952-037
     P YLIRA : DECONVOLUTION OF IMAGES IN THE PRESENCE OF P OISSON NOISE, Axel Donath, and Aneta Siemiginowska,
     and Vinay Kashyap, and Douglas Burke, and Karthik Reddy Solipuram, and David van Dyk
     doi.org/10.25080/majora-212e5952-038
     A CCELERATING S CIENCE WITH THE G ENERATIVE T OOLKIT FOR S CIENTIFIC D ISCOVERY (GT4SD), GT4SD team
     doi.org/10.25080/majora-212e5952-039
     MM ODEL : A MODULAR MODELING FRAMEWORK FOR SCIENTIFIC PROTOTYPING, Peter Sun, and John A. Marohn
     doi.org/10.25080/majora-212e5952-03a
     M ONACO : Q UANTIFY U NCERTAINTY          AND   S ENSITIVITIES   IN   Y OUR C OMPUTATIONAL M ODELS     WITH A    M ONTE
     C ARLO L IBRARY, W. Scott Shambaugh
     doi.org/10.25080/majora-212e5952-03b
     UF UNCS AND DT YPES :    NEW POSSIBILITIES IN    N UM P Y, Sebastian Berg, and Stéfan van der Walt
     doi.org/10.25080/majora-212e5952-03c
     P ER P YTHON AD     ASTRA : INTERACTIVE   A STRODYNAMICS     WITH POLIASTRO , Juan Luis Cano Rodrı́guez
     doi.org/10.25080/majora-212e5952-03d
     PYAMPUTE : A   P YTHON LIBRARY FOR DATA AMPUTATION, Rianne M Schouten, and Davina Zamanzadeh, and Prabhant
     Singh
     doi.org/10.25080/majora-212e5952-03e
     S CIENTIFIC P YTHON : F ROM G IT H UB TO T IK T OK, Juanita Gomez Romero, and Stéfan van der Walt, and K. Jarrod
     Millman, and Melissa Weber Mendonça, and Inessa Pawson
     doi.org/10.25080/majora-212e5952-03f
     S CIENTIFIC P YTHON : B Y MAINTAINERS , FOR MAINTAINERS, Pamphile T. Roy, and Stéfan van der Walt, and K. Jarrod
     Millman, and Melissa Weber Mendonça
     doi.org/10.25080/majora-212e5952-040
     I MPROVING RANDOM SAMPLING IN P YTHON : SCIPY. STATS . SAMPLING AND SCIPY. STATS . QMC, Pamphile T. Roy, and
     Matt Haberland, and Christoph Baumgarten, and Tirth Patel
     doi.org/10.25080/majora-212e5952-041
     P ETABYTE - SCALE   OCEAN DATA ANALYTICS ON STAGGERED GRIDS VIA THE GRID UFUNC PROTOCOL IN X GCM,
     Thomas Nicholas, and Julius Busecke, and Ryan Abernathey
     doi.org/10.25080/majora-212e5952-042

ACCEPTED P OSTERS

     O PTIMAL R EVIEW A SSIGNMENTS FOR THE S CI P Y C ONFERENCE             U SING B INARY I NTEGER L INEAR P ROGRAMMING
     IN S CI P Y 1.9, Matt Haberland, and Nicholas McKibben
     doi.org/10.25080/majora-212e5952-029
     C ONTRIBUTING TO O PEN S OURCE S OFTWARE : F ROM         NOT KNOWING        P YTHON   TO BECOMING A   S PYDER   CORE DE -
     VELOPER , Daniel Althviz Moré
     doi.org/10.25080/majora-212e5952-02a
     S EMI -S UPERVISED S EMANTIC A NNOTATOR (S3A): T OWARD E FFICIENT S EMANTIC I MAGE L ABELING, Nathan Jessu-
     run, and Olivia P. Dizon-Paradis, and Dan E. Capecci, and Damon L. Woodard, and Navid Asadizanjani
     doi.org/10.25080/majora-212e5952-02b
     B IOFRAME : O PERATING ON G ENOMIC I NTERVAL D ATAFRAMES, Nezar Abdennur, and Geoffrey Fudenberg, and Ilya
     M. Flyamer, and Aleksandra Galitsyna, and Anton Goloborodko, and Maxim Imakaev, and Trevor Manz, and Sergey V.
     Venev
     doi.org/10.25080/majora-212e5952-02c
     L IKENESS : A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS, Joseph V. Tuccillo,
     and James D. Gaboardi
     doi.org/10.25080/majora-212e5952-02d
     PYA UDIO P ROCESSING : A UDIO P ROCESSING , F EATURE E XTRACTION , AND M ACHINE L EARNING M ODELING , Jy-
     otika Singh
     doi.org/10.25080/majora-212e5952-02e
     K IWI : P YTHON T OOL FOR T EX P ROCESSING AND C LASSIFICATION, Neelima Pulagam, and Sai Marasani, and Brian
     Sass
     doi.org/10.25080/majora-212e5952-02f
     P HYLOGEOGRAPHY: A NALYSIS OF GENETIC AND CLIMATIC DATA OF SARS-C O V-2, Wanlin Li, and Aleksandr Koshkarov,
     and My-Linh Luu, and Nadia Tahiri
     doi.org/10.25080/majora-212e5952-030
     D ESIGN OF A S CIENTIFIC D ATA A NALYSIS S UPPORT P LATFORM, Nathan Martindale, and Jason Hite, and Scott Stewart,
     and Mark Adams
     doi.org/10.25080/majora-212e5952-031
     O PENING ARM: A PIVOT TO COMMUNITY SOFTWARE TO MEET THE NEEDS OF USERS AND STAKEHOLDERS OF THE
     PLANET ’ S LARGEST CLOUD OBSERVATORY , Zachary Sherman, and Scott Collis, and Max Grover, and Robert Jackson,
     and Adam Theisen
     doi.org/10.25080/majora-212e5952-032


S CI P Y TOOLS P LENARIES

     S CI P Y T OOLS P LENARY - CEL TEAM, Inessa Pawson
     doi.org/10.25080/majora-212e5952-043
     S CI P Y T OOLS P LENARY ON M ATPLOTLIB, Elliott Sales de Andrade
     doi.org/10.25080/majora-212e5952-044
     S CI P Y T OOLS P LENARY - N UM P Y, Inessa Pawson
     doi.org/10.25080/majora-212e5952-045


L IGHTNING TALKS

     D OWNSAMPLING T IME S ERIES D ATA      FOR   V ISUALIZATIONS, Delaina Moore
     doi.org/10.25080/majora-212e5952-027
     A NALYSIS   AS   A PPLICATIONS : Q UICK INTRODUCTION TO LOCKFILES, Matthew Feickert
     doi.org/10.25080/majora-212e5952-028
S CHOLARSHIP R ECIPIENTS

     A MAN G OEL, University of Delhi
     A NURAG S AHA R OY, Saarland University
     I SURU F ERNANDO, University of Illinois at Urbana Champaign
     K ELLY M EEHAN, US Forest Service
     K ADAMBARI D EVARAJAN, University of Rhode Island
     K RISHNA K ATYAL, Thapar Institute of Engineering and Technology
     M ATTHEW M URRAY, Dask
     N AMAN G ERA, Sympy, LPython
     R OHIT G OSWAMI, University of Iceland
     S IMON C ROSS, QuTIP
     TANYA A KUMU, IBM Research
     Z UHAL C AKIR, Purdue University
C ONTENTS

The Advanced Scientific Data Format (ASDF): An Update                                                                  1
Perry Greenfield, Edward Slavich, William Jamieson, Nadia Dencheva

Semi-Supervised Semantic Annotator (S3A): Toward Efficient Semantic Labeling                                           7
Nathan Jessurun, Daniel E. Capecci, Olivia P. Dizon-Paradis, Damon L. Woodard, Navid Asadizanjani

Galyleo: A General-Purpose Extensible Visualization Solution                                                          13
Rick McGeer, Andreas Bergen, Mahdiyar Biazi, Matt Hemmings, Robin Schreiber

USACE Coastal Engineering Toolkit and a Method of Creating a Web-Based Application                                    22
Amanda Catlett, Theresa R. Coumbe, Scott D. Christensen, Mary A. Byrant

Search for Extraterrestrial Intelligence: GPU Accelerated TurboSETI                                                   26
Luigi Cruz, Wael Farah, Richard Elkins

Experience report of physics-informed neural networks in fluid simulations: pitfalls and frustration                  28
Pi-Yueh Chuang, Lorena A. Barba

atoMEC: An open-source average-atom Python code                                                                       37
Timothy J. Callow, Daniel Kotik, Eli Kraisler, Attila Cangi

Automatic random variate generation in Python                                                                         46
Christoph Baumgarten, Tirth Patel

Utilizing SciPy and other open source packages to provide a powerful API for materials manipulation in the Schrödinger
Materials Suite                                                                                                      52
Alexandr Fonari, Farshad Fallah, Michael Rauch

A Novel Pipeline for Cell Instance Segmentation, Tracking and Motility Classification of Toxoplasma Gondii in 3D Space 60
Seyed Alireza Vaezi, Gianni Orlando, Mojtaba Fazli, Gary Ward, Silvia Moreno, Shannon Quinn

The myth of the normal curve and what to do about it                                                                  64
Allan Campopiano

Python for Global Applications: teaching scientific Python in context to law and diplomacy students                   69
Anna Haensch, Karin Knudson

Papyri: better documentation for the scientific ecosystem in Jupyter                                                  75
Matthias Bussonnier, Camille Carvalho

Bayesian Estimation and Forecasting of Time Series in statsmodels                                                     83
Chad Fulton

Python vs. the pandemic: a case study in high-stakes software development                                             90
Cliff C. Kerr, Robyn M. Stuart, Dina Mistry, Romesh G. Abeysuriya, Jamie A. Cohen, Lauren George, Michał
Jastrzebski, Michael Famulare, Edward Wenger, Daniel J. Klein

Pylira: deconvolution of images in the presence of Poisson noise                                                      98
Axel Donath, Aneta Siemiginowska, Vinay Kashyap, Douglas Burke, Karthik Reddy Solipuram, David van Dyk

Codebraid Preview for VS Code: Pandoc Markdown Preview with Jupyter Kernels                                          105
Geoffrey M. Poore

Incorporating Task-Agnostic Information in Task-Based Active Learning Using a Variational Autoencoder                110
Curtis Godwin, Meekail Zain, Nathan Safir, Bella Humphrey, Shannon P Quinn

Awkward Packaging: building Scikit-HEP                                                                               115
Henry Schreiner, Jim Pivarski, Eduardo Rodrigues
Keeping your Jupyter notebook code quality bar high (and production ready) with Ploomber                           121
Ido Michael

Likeness: a toolkit for connecting the social fabric of place to human dynamics                                    125
Joseph V. Tuccillo, James D. Gaboardi

poliastro: a Python library for interactive astrodynamics                                                          136
Juan Luis Cano Rodrı́guez, Jorge Martı́nez Garrido

A New Python API for Webots Robotics Simulations                                                                   147
Justin C. Fisher

pyAudioProcessing: Audio Processing, Feature Extraction, and Machine Learning Modeling                             152
Jyotika Singh

Phylogeography: Analysis of genetic and climatic data of SARS-CoV-2                                                159
Aleksandr Koshkarov, Wanlin Li, My-Linh Luu, Nadia Tahiri

Global optimization software library for research and education                                                    167
Nadia Udler

Temporal Word Embeddings Analysis for Disease Prevention                                                           171
Nathan Jacobi, Ivan Mo, Albert You, Krishi Kishore, Zane Page, Shannon P. Quinn, Tim Heckman

Design of a Scientific Data Analysis Support Platform                                                              179
Nathan Martindale, Jason Hite, Scott Stewart, Mark Adams

The Geoscience Community Analysis Toolkit: An Open Development, Community Driven Toolkit in the Scientific Python
Ecosystem                                                                                                     187
Orhan Eroglu, Anissa Zacharias, Michaela Sizemore, Alea Kootz, Heather Craker, John Clyne

popmon: Analysis Package for Dataset Shift Detection                                                               194
Simon Brugman, Tomas Sostak, Pradyot Patil, Max Baak

pyDAMPF: a Python package for modeling mechanical properties of hygroscopic materials under interaction with a nanoprobe
202
Willy Menacho, Gonzalo Marcelo Ramı́rez-Ávila, Horacio V. Guzman

Improving PyDDA’s atmospheric wind retrievals using automatic differentiation and Augmented Lagrangian methods     210
Robert Jackson, Rebecca Gjini, Sri Hari Krishna Narayanan, Matt Menickelly, Paul Hovland, Jan Hückelheim, Scott
Collis

RocketPy: Combining Open-Source and Scientific Libraries to Make the Space Sector More Modern and Accessible       217
João Lemes Gribel Soares, Mateus Stano Junqueira, Oscar Mauricio Prada Ramirez, Patrick Sampaio dos Santos
Brandão, Adriano Augusto Antongiovanni, Guilherme Fernandes Alves, Giovani Hidalgo Ceotto

Wailord: Parsers and Reproducibility for Quantum Chemistry                                                         226
Rohit Goswami

Variational Autoencoders For Semi-Supervised Deep Metric Learning                                                  231
Nathan Safir, Meekail Zain, Curtis Godwin, Eric Miller, Bella Humphrey, Shannon P Quinn

A Python Pipeline for Rapid Application Development (RAD)                                                          240
Scott D. Christensen, Marvin S. Brown, Robert B. Haehnel, Joshua Q. Church, Amanda Catlett, Dallon C. Schofield,
Quyen T. Brannon, Stacy T. Smith

Monaco: A Monte Carlo Library for Performing Uncertainty and Sensitivity Analyses                                  244
W. Scott Shambaugh

Enabling Active Learning Pedagogy and Insight Mining with a Grammar of Model Analysis                              251
Zachary del Rosario
Low Level Feature Extraction for Cilia Segmentation      259
Meekail Zain, Eric Miller, Shannon P Quinn, Cecilia Lo
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                     1




      The Advanced Scientific Data Format (ASDF): An
                         Update
                            Perry Greenfield‡∗ , Edward Slavich‡† , William Jamieson‡† , Nadia Dencheva‡†



                                                                                     F



Abstract—We report on progress in developing and extending the new (ASDF)                by outlining our near term plans for further improvements and
format we have developed for the data from the James Webb and Nancy Grace                extensions.
Roman Space Telescopes since we reported on it at a previous Scipy. While
the format was developed as a replacement for the long-standard FITS format              Summary of Motivations
used in astronomy, it is quite generic and not restricted to use with astronomical
                                                                                            •     Suitable as an archival format:
data. We will briefly review the format, and extensions and changes made to
the standard itself, as well as to the reference Python implementation we have                             –   Old versions continue to be supported by
developed to support it. The standard itself has been clarified in a number                                    libraries.
of respects. Recent improvements to the Python implementation include an
                                                                                                           –   Format is sufficiently transparent (e.g., not
improved framework for conversion between complex Python objects and ASDF,
                                                                                                               requiring extensive documentation to de-
better control of the configuration of extensions supported and versioning of
extensions, tools for display and searching of the structured metadata, bet-
                                                                                                               code) for the fundamental set of capabili-
ter developer documentation, tutorials, and a more maintainable and flexible                                   ties.
schema system. This has included a reorganization of the components to make                                –   Metadata is easily viewed with any text
the standard free from astronomical assumptions. A important motivator for the                                 editor.
format was the ability to support serializing functional transforms in multiple
dimensions as well as expressions built out of such transforms, which has now
                                                                                            •   Intrinsically hierarchical
been implemented. More generalized compression schemes are now enabled.                     •   Avoids duplication of shared items
We are currently working on adding chunking support and will discuss our plan               •   Based on existing standard(s) for metadata and structure
for further enhancements.                                                                   •   No tight constraints on attribute lengths or their values.
                                                                                            •   Clearly versioned
Index Terms—data formats, standards, world coordinate systems, yaml                         •   Supports schemas for validating files for basic structure
                                                                                                and value requirements
                                                                                            •   Easily extensible, both for the standard, and for local or
Introduction
                                                                                                domain-specific conventions.
The Advanced Scientific Data Format (ASDF) was originally
developed in 2015. That original version was described in a paper                        Basics of ASDF Format
[Gre15]. That paper described the shortcomings of the widely used                           •   Format consists of a YAML header optionally followed by
astronomical standard format FITS [FIT16] as well as those of                                   one or more binary blocks for containing binary data.
existing potential alternatives. It is not the goal of this paper to                        •   The YAML [http://yaml.org] header contains all the meta-
rehash those points in detail, though it is useful to summarize the                             data and defines the structural relationship of all the data
basic points here. The remainder of this paper will describe where                              elements.
we are using ASDF, what lessons we have learned from using                                  •   YAML tags are used to indicate to libraries the semantics
ASDF for the James Webb Space Telescope, and summarize the                                      of subsections of the YAML header that libraries can use to
most important changes we have made to the standard, the Python                                 construct special software objects. For example, a tag for
library that we use to read and write ASDF files, and best practices                            a data array would indicate to a Python library to convert
for using the format.                                                                           it into a numpy array.
    We will give an example of a more advanced use case that                                •   YAML anchors and alias are used to share common ele-
illustrates some of the powerful advantages of ASDF, and that                                   ments to avoid duplication.
its application is not limited to astronomy, but suitable for much                          •   JSON Schema [http://json-schema.org/specification.html],
of scientific and engineering data, as well as models. We finish                                [http://json-schema.org/understanding-json-schema/] is
                                                                                                used for schemas to define expectations for tag content
* Corresponding author: perry@stsci.edu                                                         and whole headers combined with tools to validate actual
‡ Space Telescope Science Institute
† These authors contributed equally.                                                            ASDF files against these schemas.
                                                                                            •   Binary blocks are referenced in the YAML to link binary
Copyright © 2022 Perry Greenfield et al. This is an open-access article                         data to YAML attributes.
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,               •   Support for arrays embedded in YAML or in a binary
provided the original author and source are credited.                                           block.
2                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

    •    Streaming support for a single binary block.                  Changes for 1.6
    •    Permit local definitions of tags and schemas outside of the
                                                                       Addition of the manifest mechanism
         standard.
    •    While developed for astronomy, useful for general scien-      The manifest is a YAML document that explicitly lists the tags and
         tific or engineering use.                                     other features introduced by an extension to the ASDF standard.
    •    Aims to be language neutral.                                  It provides a more straightforward way of associating tags with
                                                                       schemas, allowing multiple tags to share the same schema, and
                                                                       generally making it simpler to visualize how tags and schemas
Current and planned uses                                               are associated (previously these associations were implied by the
James Webb Space Telescope (JWST)                                      Python implementation but were not documented elsewhere).
NASA requires JWST data products be made available in the
FITS format. Nevertheless, all the calibration pipelines operate       Handling of null values and their interpretation
on the data using an internal objects very close to the the ASDF
                                                                       The standard didn’t previously specify the behavior regarding null
representation. The JWST calibration pipeline uses ASDF to
                                                                       values. The Python library previously removed attributes from the
serialize data that cannot be easily represented in FITS, such as
                                                                       YAML tree when the corresponding Python attribute has a None
World Coordinate System information. The calibration software
                                                                       value upon writing to an ADSF file. On reading files where the
is also capable of reading and producing data products as pure
                                                                       attribute was missing but the schema indicated a default value,
ASDF files.
                                                                       the library would create the Python attribute with the default. As
                                                                       mentioned in the next item, we no longer use this mechanism, and
Nancy Grace Roman Space Telescope
                                                                       now when written, the attribute appears in the YAML tree with
This telescope, with the same mirror size as the Hubble Space          a null value if the Python value is None and the schema permits
Telescope (HST), but a much larger field of view than HST, will        null values.
be launched in 2026 or thereabouts. It is to be used mostly in
survey mode and is capable of producing very large mosaicked
                                                                       Interpretation of default values in schema
images. It will use ASDF as its primary data format.
                                                                       The use of default values in schemas is discouraged since the
Daniel K Inoue Solar Telescope                                         interpretation by libraries is prone to confusion if the assemblage
This telescope is using ASDF for much of the early data products       of schemas conflict with regard to the default. We have stopped
to hold the metadata for a combined set of data which can involve      using defaults in the Python library and recommend that the ASDF
many thousands of files. Furthermore, the World Coordinate             file always be explicit about the value rather than imply it through
System information is stored using ASDF for all the referenced         the schema. If there are practical cases that preclude always
data.                                                                  writing out all values (e.g., they are only relevant to one mode
                                                                       and usually are irrelevant), it should be the library that manages
                                                                       whether such attributes are written conditionally rather using the
Vera Rubin Telescope (for World Coordinate System interchange)
                                                                       schema default mechanism.
There have been users outside of astronomy using ASDF, as well
as contributors to the source code.
                                                                       Add alternative tag URI scheme
                                                                       We now recommend that tag URIs begin with asdf://
Changes to the standard (completed and proposed)
These are based on lessons learned from usage.
                                                                       Be explicit about what kind of complex YAML keys are supported
   The current version of the standard is 1.5.0 (1.6.0 being
developed).                                                            For example, not all legal YAML keys are supported. Namely
   The following items reflect areas where we felt improvements        YAML arrays, which are not hashable in Python. Likewise,
were needed.                                                           general YAML objects are not either. The Standard now limits
                                                                       keys to string, integer, or boolean types. If more complex keys are
Changes for 1.5                                                        required, they should be encoded in strings.
Moving      the    URI     authority    from    stsci.edu         to
asdf-format.org                                                        Still to be done
This is to remove the standard from close association with STScI       Upgrade to JSON Schema draft-07
and make it clear that the format is not intended to be controlled
                                                                       There is interest in some of the new features of this version,
by one institution.
                                                                       however, this is problematic since there are aspects of this version
                                                                       that are incompatible with draft-04, thus requiring all previous
Moving astronomy-specific schemas out of standard
                                                                       schemas to be updated.
These primarily affect the previous inclusion of World Coordinate
Tags, which are strongly associated with astronomy. Remaining
                                                                       Replace extensions section of file history
are those related to time and unit standards, both of obvious gen-
erality, but the implementation must be based on some standards,       This section is considered too specific to the concept of Python
and currently the astropy-based ones are as good or better than        extensions, and is probably best replaced with a more flexible
any.                                                                   system for listing extensions used.
THE ADVANCED SCIENTIFIC DATA FORMAT (ASDF): AN UPDATE                                                                                      3

Changes to Python ASDF package
Easier and more flexible mechanism to create new extensions
(2.8.0)
The previous system for defining extensions to ASDF, now
deprecated, has been replaced by a new system that makes the
association between tags, schemas, and conversion code more
straightforward, as well as providing more intuitive names for the
methods and attributes, and makes it easier to handle reference
cycles if they are present in the code (also added to the original
Tag handling classes).

Introduced global configuration mechanism (2.8.0)
This reworks how ASDF resources are located, and makes it easier
to update the current configuration, as well as track down the
location of the needed resources (e.g., schemas and converters),
as well as removing performance issues that previously required
extracting information from all the resource files thus slowing the
                                                                        Fig. 1: A plot of the compound model defined in the first segment of
first asdf.open call.                                                   code.

Added info/search methods and command line tools (2.6.0)
These allow displaying the hierarchical structure of the header and     file. This is made possible by the fact that expressions of models
the values and types of the attributes. Initially, such introspection   are straightforward to represent in YAML structure.
stopped at any tagged item. A subsequent change provides mech-               Despite the fact that the models are in some sense executable,
anisms to see into tagged items (next item). An example of these        they are perfectly safe so long as the library they are implemented
tools is shown in a later section.                                      in is safe (e.g., it doesn’t implement an "execute any OS com-
                                                                        mand" model). Furthermore, the representation in ASDF does not
Added mechanism for info to display tagged item contents (2.9.0)        explicitly use Python code. In principle it could be written or read
This allows the library that converts the YAML to Python objects        in any computer language.
to expose a summary of the contents of the object by supplying               The following illustrates a relatively simple but not trivial
an optional "dunder" method that the info mechanism can take            example.
advantage of.                                                                First we define a 1D model and plot it.
                                                                        import numpy as np
Added documentation on how ASDF library internals work                  import astropy.modeling.models as amm
                                                                        import astropy.units as u
These appear in the readthedocs under the heading "Developer            import asdf
Overview".                                                              from matplotlib import pyplot as plt

                                                                        # Define 3 model components with units
Plugin API for block compressors (2.8.0)                                g1 = amm.Gaussian1D(amplitude=100*u.Jy,
This enables a localized extension to support further compression                           mean=120*u.MHz,
options.                                                                                    stddev=5.*u.MHz)
                                                                        g2 = amm.Gaussian1D(65*u.Jy, 140*u.MHz, 3*u.MHz)
                                                                        powerlaw = amm.PowerLaw1D(amplitude=10*u.Jy,
Support for asdf:// URI scheme (2.8.0)                                                            x_0=100*u.MHz,
Support for ASDF Standard 1.6.0 (2.8.0)                                                           alpha=3)
                                                                        # Define a compound model
This is still subject to modifications to the 1.6.0 standard.           model = g1 + g2 + powerlaw
                                                                        x = np.arange(50, 200) * u.MHz
Modified handling of defaults in schemas and None values (2.8.0)        plt.plot(x, model(x))

As described previously.                                                The following code will save the model to an ASDF file, and read
                                                                        it back in
Using ASDF to store models                                              af = asdf.AsdfFile()
                                                                        af.tree = {'model': model}
This section highlights one aspect of ASDF that few other formats       af.write_to('model.asdf')
support in an archival way, e.g., not using a language-specific         af2 = asdf.open('model.asdf')
                                                                        model2 = af2['model']
mechanism, such as Python’s pickle. The astropy package contains        model2 is model
a modeling subpackage that defines a number of analytical, as well          False
as a few table-based, models that can be combined in many ways,         model2(103.5) == model(103.5)
such as arithmetically, in composition, or multi-dimensional. Thus          True
it is possible to define fairly complex multi-dimensional models,       Listing the relevant part of the ASDF file illustrates how the model
many of which can use the built in fitting machinery.                   has been saved in the YAML header (reformatted to fit in this paper
     These models, and their compound constructs can be saved           column).
in ASDF files and later read in to recreate the corresponding           model: !transform/add-1.2.0
astropy objects that were used to create the entries in the ASDF          forward:
4                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

  - !transform/add-1.2.0                                                something that the FITS format had no hope of managing, nor any
    forward:                                                            other scientific format that we are aware of.
    - !transform/gaussian1d-1.0.0
      amplitude: !unit/quantity-1.1.0
        {unit: !unit/unit-1.0.0 Jy, value: 100.0}                       Displaying the contents of ASDF files
      bounding_box:
      - !unit/quantity-1.1.0                                            Functionality has been added to display the structure and content
        {unit: !unit/unit-1.0.0 MHz, value: 92.5}                       of the header (including data item properties), with a number of
      - !unit/quantity-1.1.0
        {unit: !unit/unit-1.0.0 MHz, value: 147.5}                      options of what depth to display, how many lines to display, etc.
      bounds:                                                           An example of the info use is shown in Figure 2.
        stddev: [1.1754943508222875e-38, null]                               There is also functionality to search for items in the file by
      inputs: [x]
      mean: !unit/quantity-1.1.0
                                                                        attribute name and/or values, also using pattern matching for
        {unit: !unit/unit-1.0.0 MHz, value: 120.0}                      either. The search results are shown as attribute paths to the items
      outputs: [y]                                                      that were found.
      stddev: !unit/quantity-1.1.0
        {unit: !unit/unit-1.0.0 MHz, value: 5.0}
    - !transform/gaussian1d-1.0.0                                       ASDF Extension/Converter System
      amplitude: !unit/quantity-1.1.0
        {unit: !unit/unit-1.0.0 Jy, value: 65.0}                        There are a number of components that are involved. Converters
      bounding_box:                                                     encapsulate the code that handles converting Python objects to
      - !unit/quantity-1.1.0
        {unit: !unit/unit-1.0.0 MHz, value: 123.5}
                                                                        and from their ASDF representation. These are classes that inherit
      - !unit/quantity-1.1.0                                            from the basic Converter class and define two Class attributes:
        {unit: !unit/unit-1.0.0 MHz, value: 156.5}                      tags, types each of which is a list of associated tag(s) and class(es)
      bounds:                                                           that the specific converter class will handle (each converter can
        stddev: [1.1754943508222875e-38, null]
      inputs: [x]                                                       handle more than one tag type and more than one class). The
      mean: !unit/quantity-1.1.0                                        ASDF machinery uses this information to map tags to converters
        {unit: !unit/unit-1.0.0 MHz, value: 140.0}                      when reading ASDF content, and to map types to converters when
      outputs: [y]
      stddev: !unit/quantity-1.1.0
                                                                        saving these objects to an ASDF file.
        {unit: !unit/unit-1.0.0 MHz, value: 3.0}                            Each converter class is expected to supply two methods:
    inputs: [x]                                                         to_yaml_tree and from_yaml_tree that construct the
    outputs: [y]                                                        YAML content and convert the YAML content to Python class
  - !transform/power_law1d-1.0.0
    alpha: 3.0                                                          instances respectively.
    amplitude: !unit/quantity-1.1.0                                         A manifest file is used to associate tags and schema ID’s
      {unit: !unit/unit-1.0.0 Jy, value: 10.0}                          so that if a schema has been defined, that the ASDF content
    inputs: [x]
    outputs: [y]
                                                                        can be validated against the schema (as well as providing extra
    x_0: !unit/quantity-1.1.0                                           information for the ASDF content in the info command). Normally
      {unit: !unit/unit-1.0.0 MHz, value: 100.0}                        the converters and manifest are registered with the ASDF library
  inputs: [x]                                                           using standard functions, and this registration is normally (but is
  outputs: [y]
...                                                                     not required to be) triggered by use of Python entry points defined
                                                                        in the setup.cfg file so that this extension is automatically
Note that there are extra pieces of information that define the         recognized when the extension package is installed.
model more precisely. These include:                                        One can of course write their own custom code to convert the
                                                                        contents of ASDF files however they want. The advantage of the
    •   many tags indicating special items. These include different
                                                                        tag/converter system is that the objects can be anywhere in the tree
        kinds of transforms (i.e., functions), quantities (i.e., num-
                                                                        structure and be properly saved and recovered without having any
        bers with units), units, etc.
                                                                        implied knowledge of what attribute or location the object is at.
    •   definitions of the units used.
                                                                        Furthermore, it brings with it the ability to validate the contents
    •   indications of the valid range of the inputs or parameters
                                                                        by use of schema files.
        (bounds)
                                                                            Jupyter tutorials that show how to use converters can be found
    •   each function shows the mapping of the inputs and the
                                                                        at:
        naming of the outputs of each function.
    •   the addition operator is itself a transform.                       •    https://github.com/asdf-format/tutorials/blob/master/
                                                                                Your_first_ASDF_converter.ipynb
    Without the use of units, the YAML would be simpler. But               •    https://github.com/asdf-format/tutorials/blob/master/
the point is that the YAML easily accommodates expression trees.                Your_second_ASDF_converter.ipynb
The tags are used by the library to construct the astropy models,
units and quantities as Python objects. However, nothing in the
above requires the library to be written in Python.                     ASDF Roadmap for STScI Work
    This machinery can handle multidimensional models and sup-          The planned enhancements to ASDF are understandably focussed
ports both the combining of models with arithmetic operators as         on the needs of STScI missions. Nevertheless, we are particularly
well as pipelining the output of one model into another. This           interested in areas that have wider benefit to the general scientific
system has been used to define complex coordinate transforms            and engineering community, and such considerations increase the
from telescope detectors to sky coordinates for imaging, and            priority of items necessary to STScI. Furthermore, we are eager
wavelengths for spectrographs, using over 100 model components,         to aid others working on ASDF by providing advice, reviews, and
THE ADVANCED SCIENTIFIC DATA FORMAT (ASDF): AN UPDATE                                                                                           5




Fig. 2: This shows part of the output of the info command that shows the structure of a Roman Space Telescope test file (provided by the Roman
Telescopes Branch at STScI). Displayed is the relative depth of the item, its type, value, and a title extracted from the associated schema to be
used as explanatory information.


possibly collaborative coding effort. STScI is committed to the           Redefining versioning semantics
long-term support of ADSF.                                                Previously the meaning of different levels of versioning
    The following is a list of planned work, in order of decreasing       were unclear. The normal inclination is to treat schema
priority.                                                                 version using the typical semantic versioning system de-
                                                                          fined for software. But schemas are not software and
Chunking Support
                                                                          we are inclined to use the proposed system for schemas
                                                                          [url: https://snowplowanalytics.com/blog/2014/05/13/introducing-
Since the Roman mission is expected to deal with large data               schemaver-for-semantic-versioning-of-schemas/] To summarize:
sets and mosaicked images, support for chunking is considered             in this case the three levels of versioning correspond to:
essential. We expect to layer the support in our Python library               Model.Revision.Addition where a schema change:
on zarr [https://zarr.dev/], with two different representations,
one where all data is contained within the ADSF file in separate              •   [Model] prevents working with historical data
blocks, and one where the blocks are saved in individual files.               •   [Revision] may prevent working with historical data
Both representations have important advantages and use cases.                 •   [Addition] is compatible with all historical data

                                                                          Integration into astronomy display tools
Improvements to binary block management
                                                                          It is essential that astronomers be able to visualize the data
These enhancements are needed to enable better chunking support           contained within ASDF files conveniently using the commonly
and other capabilities.                                                   available tool, such as SAOImage DS9 [Joy03] and Ginga [Jes13].
6                                                                                                 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Cloud optimized storage                                                        [McK10] W. McKinney. Data structures for statistical computing in python,
                                                                                       Proceedigns of the 9th Python in Science Conference, p56-61, 2010.
Much of the future data processing operations for STScI are                            https://doi.org/10.25080/Majora-92bf1922-00a
expected to be performed on the cloud, so having ASDF efficiently              [Pen09] W. Pence, R. Seaman, R. L. White, Lossless Astronomical Image
support such uses is important. An important element of this is                        Compression and the Effects of Noise, Publications of the Astro-
making the format work efficiently with object storage services                        nomical Society of the Pacific, 121:414-427, April 2009. https:
                                                                                       //doi.org/10.48550/arXiv.0903.2140
such as AWS S3 and Google Cloud Storage.                                       [Pen10] W. Pence, R. L. White, R. Seaman. Optimal Compression of Floating-
                                                                                       Point Astronomical Images Without Significant Loss of Information,
IDL support                                                                            Publications of the Astronomical Society of the Pacific, 122:1065-
                                                                                       1076, September 2010. https://doi.org/10.1086/656249
While Python is rapidly surpassing the use of IDL in astronomy,                [Joy03] W. A. Joye, E. Mandel. New Features of SAOImage DS9, Astronomi-
there is still much IDL code being used, and many of those still                       cal Data Analysis Software and Systems XII ASP Conference Series,
using IDL are in more senior and thus influential positions (they                      295:489, 2003.
aren’t quite dead yet). So making ASDF data at least readable to
IDL is a useful goal.

Support Rice compression
Rice compression [Pen09], [Pen10] has proven a useful lossy
compression algorithm for astronomical imaging data. Supporting
it will be useful to astronomers, particularly for downloading large
imaging data sets.

Pandas Dataframe support
Pandas [McK10] has proven to be a useful tool to many as-
tronomers, as well as many in the sciences and engineering, so
support will enhance the uptake of ASDF.

Compact, easy-to-read schema summaries
Most scientists and even scientific software developers tend to
find JSON Schema files tedious to interpret. A more compact, and
intuitive rendering of the contents would be very useful.

Independent implementation
Having ASDF accepted as a standard data format requires a library
that is divorced from a Python API. Initially this can be done most
easily by layering it on the Python library, but ultimately there
should be an independent implementation which includes support
for C/C++ wrappers. This is by far the item that will require the
most effort, and would benefit from outside involvement.

Provide interfaces to other popular packages
This is a catch all for identifying where there would be significant
advantages to providing the ability to save and recover information
in the ASDF format as an interchange option.

Sources of Information
    •   ASDF Standard: https://asdf-standard.readthedocs.io/en/
        latest/
    •   Python ASDF package documentation: https://asdf.
        readthedocs.io/en/stable/
    •   Repository: https://github.com//asdf-format/asdf
    •   Tutorials: https://github.com/asdf-format/tutorials

R EFERENCES
[Gre15] P. Greenfield, M. Droettboom, E. Bray. ASDF: A new data format
        for astronomy, Astronomy and Computing, 12:240-251, September
        2015. https://doi.org/10.1016/j.ascom.2015.06.004
[FIT16] FITS Working Group. Definition of the Flexible Image Transport
        System, International Astronomical Union, http://fits.gsfc.nasa.gov/
        fits_standard.html, July 2016.
[Jes13] E. Jeschke. Ginga: an open-source astronomical image viewer and
        toolkit, Proc. of the 12th Python in Science Conference., p58-
        64,January 2013. https://doi.org/10.25080/Majora-8b375195-00a
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                              7




  Semi-Supervised Semantic Annotator (S3A): Toward
             Efficient Semantic Labeling
       Nathan Jessurun‡∗ , Daniel E. Capecci‡ , Olivia P. Dizon-Paradis‡ , Damon L. Woodard‡ , Navid Asadizanjani‡



                                                                                   F



Abstract—Most semantic image annotation platforms suffer severe bottlenecks
when handling large images, complex regions of interest, or numerous distinct
foreground regions in a single image. We have developed the Semi-Supervised
Semantic Annotator (S3A) to address each of these issues and facilitate rapid
collection of ground truth pixel-level labeled data. Such a feat is accomplished
through a robust and easy-to-extend integration of arbitrary python image pro-
cessing functions into the semantic labeling process. Importantly, the framework
devised for this application allows easy visualization and machine learning
prediction of arbitrary formats and amounts of per-component metadata. To our
knowledge, the ease and flexibility offered are unique to S3A among all open-
source alternatives.

Index Terms—Semantic annotation, Image labeling, Semi-supervised, Region
of interest



Introduction
Labeled image data is essential for training, tuning, and evaluating                   Fig. 1. Common use cases for semantic segmentation involve relatively few fore-
                                                                                       ground objects, low-resolution data, and limited complexity per object. Images
the performance of many machine learning applications. Such                            retrieved from https://cocodataset.org/#explore.
labels are typically defined with simple polygons, ellipses, and
bounding boxes (i.e., "this rectangle contains a cat"). However,
this approach can misrepresent more complex shapes with holes                          and greatly hinders scalability. As such, several tools have been
or multiple regions as shown later in Figure 9. When high accuracy                     proposed to alleviate the burden of collecting these ground-truth
is required, labels must be specified at or close to the pixel-level                   labels [itL18]. Unfortunately, existing tools are heavily biased
- a process known as semantic labeling or semantic segmentation.                       toward lower-resolution images with few regions of interest (ROI),
A detailed description of this process is given in [CZF+ 18].                          similar to Figure 1. While this may not be an issue for some
Examples can readily be found in several popular datasets such                         datasets, such assumptions are crippling for high-fidelity images
as COCO, depicted in Figure 1.                                                         with hundreds of annotated ROIs [LSA+ 10], [WYZZ09].
    Semantic segmentation is important in numerous domains                                 With improving hardware capabilities and increasing need for
including printed circuit board assembly (PCBA) inspection (dis-                       high-resolution ground truth segmentation, there are a continu-
cussed later in the case study) [PJTA20], [AML+ 19], quality                           ally growing number of applications that require high-resolution
control during manufacturing [FRLL18], [AVK+ 01], [AAV+ 02],                           imaging with the previously described characteristics [MKS18],
manuscript restoration / digitization [GNP+ 04], [KBO16], [JB92],                      [DS20]. In these cases, the existing annotation tooling greatly
[TFJ89], [FNK92], and effective patient diagnosis [SKM+ 10],                           impacts productivity due to the previously referenced assumptions
[RLO+ 17], [YPH+ 06], [IGSM14]. In all these cases, imprecise                          and lack of support [Spa20].
annotations severely limit the development of automated solutions                          In response to these bottlenecks, we present the Semi-
and can decrease the accuracy of standard trained segmentation                         Supervised Semantic Annotation (S3A) annotation and prototyping
models.                                                                                platform -- an application which eases the process of pixel-level
    Quality semantic segmentation is difficult due to a reliance on                    labeling in large, complex scenes.1 Its graphical user interface is
large, high-quality datasets, which are often created by manually                      shown in Figure 2. The software includes live app-level property
labeling each image. Manual annotation is error-prone, costly,                         customization, real-time algorithm modification and feedback,
                                                                                       region prediction assistance, constrained component table editing
* Corresponding author: njessurun@ufl.edu
‡ University of Florida                                                                based on allowed data types, various data export formats, and a
                                                                                       highly adaptable set of plugin interfaces for domain-specific exten-
Copyright © 2022 Nathan Jessurun et al. This is an open-access article                 sions to S3A. Beyond software improvements, these features play
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,          significant roles in bridging the gap between human annotation
provided the original author and source are credited.                                  efforts and scalable, automated segmentation methods [BWS+ 10].
8                                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)



                                                                                                         Improve                    Semi-
                                                                                                       segmentation               supervised
                                                                                                        techniques                 labeling




                                                                                                           Update                  Generate
                                                                                                           models                training data




                                                                                  Fig. 3. S3A’s can iteratively annotate, evaluate, and update its internals in real-
                                                                                  time.



                                                                                  to specify (but can be modified or customized if desired). As a re-
                                                                                  sult, incorporating additional/customized application functionality
                                                                                  can require as little as one line of code. Processes interface with
Fig. 2. S3A’s interface. The main view consists of an image to annotate, a        PyQtGraph parameters to gain access to data-customized widget
component table of prior annotations, and a toolbar which changes functionality   types and more (https://github.com/pyqtgraph/pyqtgraph).
depending on context.                                                                 These processes can also be arbitrarily nested and chained,
                                                                                  which is critical for developing hierarchical image processing
                                                                                  models, an example of which is shown in Figure 4. This frame-
                                                                                  work is used for all image and region processing within S3A.
                                                                                  Note that for image processes, each portion of the hierarchy yields
Application Overview                                                              intermediate outputs to determine which stage of the process flow
Design decisions throughout S3A’s architecture have been driven                   is responsible for various changes. This, in turn, reduces the
by the following objectives:                                                      effort required to determine which parameters must be adjusted
                                                                                  to achieve optimal performance.
    •    Metadata should have significance rather than be treated
         as an afterthought,
                                                                                  Plugins for User Extensions
    •    High-resolution images should have minimal impact on
         the annotation workflow,                                                 The previous section briefly described how custom user functions
    •    ROI density and complexity should not limit annotation                   are easily wrapped within a process, exposing its parameters
         workflow, and                                                            within S3A in a GUI format. A rich plugin interface is built on top
    •    Prototyping should not be hindered by application com-                   of this capability in which custom functions, table field predictors,
         plexity.                                                                 default action hooks, and more can be directly integrated into S3A.
                                                                                  In all cases, only a few lines of code are required to achieve most
    These motives were selected upon noticing the general lack                    integrations between user code and plugin interface specifications.
of solutions for related problems in previous literature and tool-                The core plugin infrastructure consists of a function/property reg-
ing. Moreover, applications that do address multiple aspects of                   istration mechanism and an interaction window that shows them
complex region annotation often require an enterprise service and                 in the UI. As such, arbitrary user functions can be "registered" in
cannot be accessed under open-source policies.                                    one line of code to a plugin, where it will be effectively exposed to
    While the first three points are highlighted in the case study,               the user within S3A. A trivial example is depicted in Figure 5, but
the subsections below outline pieces of S3A’s architecture that                   more complex behavior such as OCR integration is possible with
prove useful for iterative algorithm prototyping and dataset gen-                 similar ease (see this snippet for an implementation leveraging
eration as depicted in Figure 3. Note that beyond the facets                      easyocr).
illustrated here, S3A possesses multiple additional characteris-                      Plugin features are heavily oriented toward easing the pro-
tics as outlined in its documentation (https://gitlab.com/s3a/s3a/-               cess of automation both for general annotation needs and niche
/wikis/docs/User’s-Guide).                                                        datasets. In either case, incorporating existing library functions is
                                                                                  converted into a trivial task directly resulting in lower annotation
Processing Framework                                                              time and higher labeling accuracy.
At the root of S3A’s functionality and configurability lies its
adaptive processing framework. Functions exposed within S3A are                   Adaptable I/O
thinly wrapped using a Process structure responsible for parsing                  An extendable I/O framework allows annotations to be used in
signature information to provide documentation, parameter infor-                  a myriad of ways. Out-of-the-box, S3A easily supports instance-
mation, and more to the UI. Hence, all graphical depictions are                   level segmentation outputs, facilitating deep learning model train-
abstracted beyond the concern of the user while remaining trivial                 ing. As an example, Figure 6 illustrates how each instance in the
                                                                                  image becomes its own pair of image and mask data. When several
  1. A preliminary version was introduced in an earlier publication [JPRA20],
but significant changes to the framework and tool capabilities have been          instances overlap, each is uniquely distinguishable depending
employed since then.                                                              on the characteristic of their label field. Particularly helpful for
SEMI-SUPERVISED SEMANTIC ANNOTATOR (S3A): TOWARD EFFICIENT SEMANTIC LABELING                                                                                           9




Fig. 4. Outputs of each processing stage can be quickly viewed in context after an iteration of annotating. Upon inspecting the results, it is clear the failure point is
a low k value during K-means clustering and segmentation. The woman’s shirt is not sufficiently distinguishable from the background palette to denote a separate
entity. The red dot is an indicator of where the operator clicked during annotation.



from qtpy import QtWidgets
from s3a import (
    S3A,
    __main__,
    RandomToolsPlugin,
)

def hello_world(win: S3A):
    QtWidgets.QMessageBox.information(
        win, "Hello World", "Hello World!"
    )

RandomToolsPlugin.deferredRegisterFunc(
    hello_world
)

__main__.mainCli()
                                                                                      Fig. 6. Multiple export formats exist, among which is a utility that crops com-
                                                                                      ponents out of the image, optionally padding with scene pixels and resizing to
Fig. 5. Simple standalone functions can be easily exposed to the user through         ensure all shapes are equal. Each sub-image and mask is saved accordingly,
the random tools plugin. Note that if tunable parameters were included in the         which is useful for training on multiple forms of machine learning models.
function signature, pressing "Open Tools" (the top menu option) allows them to
be altered.

                                                                                      binations for functions outside S3A in the event they are utilized
                                                                                      in a different framework.
models with fixed input sizes, these exports can optionally be
forced to have a uniform shape (e.g., 512x512 pixels) while main-
taining their aspect ratio. This is accomplished by incorporating                     Case Study
additional scene pixels around each object until the appropriate                      Both the inspiration and developing efforts for S3A were initially
size is obtained. Models trained on these exports can be directly                     driven by optical printed circuit board (PCB) assurance needs.
plugged back into S3A’s processing framework, allowing them                           In this domain, high-resolution images can contain thousands
to generate new annotations or refine preliminary user efforts.                       of complex objects in a scene, as seen in Figure 7. Moreover,
The described I/O framework is also heavily modularized such                          numerous components are not representable by cardinal shapes
that custom dataset specifications can easily be incorporated. In                     such as rectangles, circles, etc. Hence, high-count polygonal
this manner, future versions of S3A will facilitate interoperability                  regions dominated a significant portion of the annotated regions.
with popular formats such as COCO and Pascal VOC [LMB+ 14],                           The computational overhead from displaying large images and
[EGW+ 10].                                                                            substantial numbers of complex regions either crashed most anno-
                                                                                      tation platforms or prevented real-time interaction. In response,
                                                                                      S3A was designed to fill the gap in open-source annotation
Deep, Portable Customizability
                                                                                      platforms that addressed each issue while requiring minimal setup
Beyond the features previously outlined, S3A provides numerous                        and allowing easy prototyping of arbitrary image processing tasks.
avenues to configure shortcuts, color schemes, and algorithm                          The subsections below describe how the S3A labeling platform
workflows. Several examples of each can be seen in the user                           was utilized to collect a large database of PCB annotations along
guide. Most customizable components prototyped within S3A can                         with their associated metadata2 .
also be easily ported to external workflows after development.
Hierarchical processes have states saved in YAML files describing                     Large Images with Many Annotations
all parameters, which can be reloaded to create user profiles.                        In optical PCB assurance, one method of identifying component
Alternatively, these same files can describe ideal parameter com-                     defects is to localize and characterize all objects in the image. Each
10                                                                                                   PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




                                                                               Fig. 8. Regardless of total image size and number of annotations, Python
                                                                               processing is be limited to the ROI or viewbox size for just the selected object
                                                                               based on user preferences. The depiction shows Grab Cut operating on a user-
                                                                               defined initial region within a much larger (8000x6000) image. The resulting
Fig. 7. Example PCB segmentation. In contrast to typical semgentation tasks,   region was available in 1.94 seconds on low-grade hardware.
the scene contains over 4,000 objects with numerous complex shapes.



component can then be cross-referenced against genuine proper-
ties such as length/width, associated text, allowed orientations,
etc. However, PCB surfaces can contain hundreds to thousands
of components at several magnitudes of size, necessitating high-
resolution images for in-line scanning. To handle this problem
more generally, S3A separates the editing and viewing experi-
                                                                               Fig. 9. Annotated objects in S3A can incorporate both holes and distinct regions
ences. In other words, annotation time is orders of magnitude
                                                                               through a multi-polygon container. Holes are represented as polygons drawn on
faster since only edits in one region at a time and on a small subset          top of existing foreground, and can be arbitrarily nested (i.e. island foreground is
of the full image are considered during assisted segmentation. All             also possible).
other annotations are read-only until selected for alteration. For
instance, Figure 8 depicts user inputs on a small ROI out of a
                                                                               key performance improvement when thousands of regions (each
much larger image. The resulting component shape is proposed
                                                                               with thousands of points) are in the same field of view. When
within seconds and can either be accepted or modified further by
                                                                               low polygon counts are required, S3A also supports RDP polygon
the user. While PCB annotations initially inspired this approach, it
                                                                               simplification down to a user-specified epsilon parameter [Ram].
is worth noting that the architectural approach applies to arbitrary
domains of image segmentation.                                                 Complex Metadata
    Another key performance improvement comes from resizing
                                                                               Most annotation software support robust implementation of im-
the processed region to a user-defined maximum size. For instance,
                                                                               age region, class, and various text tags ("metadata"). However,
if an ROI is specified across a large portion of the image but
                                                                               this paradigm makes collecting type-checked or input-sanitized
the maximum processing size is 500x500 pixels, the processed
                                                                               metadata more difficult. This includes label categories such as
area will be downsampled to a maximum dimension length of
                                                                               object rotation, multiclass specifications, dropdown selections,
500 before intensive algorithms are run. The final output will
                                                                               and more. In contrast, S3A treats each metadata field the same
be upsampled back to the initial region size. In this manner,
                                                                               way as object vertices, where they can be algorithm-assisted,
optionally sacrificing a small amount of output accuracy can
                                                                               directly input by the user, or part of a machine learning prediction
drastically accelerate runtime performance for larger annotated
                                                                               framework. Note that simple properties such as text strings or
objects.
                                                                               numbers can be directly input in the table cells with minimal need
                                                                               for annotation assistance3 . In conrast, custom fields can provide
Complex Vertices/Semantic Segmentation                                         plugin specifications which allow more advanced user interaction.
Multiple types of PCB components possess complex shapes which                  Finally, auto-populated fields like annotation timestamp or author
might contain holes or noncontiguous regions. Hence, it is bene-               can easily be constructed by providing a factory function instead
ficial for software like S3A to represent these features inherently            of default value in the parameter specification.
with a ComplexXYVertices object: that is, a collection of                          This capability is particularly relevant in the field of optical
polygons which either describe foreground regions or holes. This               PCB assurance. White markings on the PCB surface, known
is enabled by thinly wrapping opencv’s contour and hierarchy                   as silkscreen, indicate important aspects of nearby components.
logic. Example components difficult to accomodate with single-                 Thus, understanding the silkscreen’s orientation, alphanumeric
polygon annotation formats are illustrated in Figure 9.                        characters, associated component, logos present, and more provide
     At the same time, S3A also supports high-count polygons                   several methods by which to characterize / identify features
with no performance losses. Since region edits are performed by                of their respective devices. Both default and customized input
image processing algorithms, there is no need for each vertex                  validators were applied to each field using parameter specifica-
to be manually placed or altered by human input. Thus, such                    tions, custom plugins, or simple factories as described above. A
non-interactive shapes can simply be rendered as a filled path                 summary of the metadata collected for one component is shown
without a large number of event listeners present. This is the                 in Figure 10.
SEMI-SUPERVISED SEMANTIC ANNOTATOR (S3A): TOWARD EFFICIENT SEMANTIC LABELING                                                                                   11

                                                                                     results depending on the initial image complexity [VGSG+ 19].
                                                                                     Hence, these methods would be significantly easier to incorporate
                                                                                     into S3A if a generalized windowing framework was incorporated
                                                                                     which allows users to specify all necessary parameters such as
                                                                                     window overlap, size, sampling frequency, etc. A preliminary
                                                                                     version of this is implemented for categorical-based model pre-
                                                                                     diction, but a more robust feature set for interactive segmentation
                                                                                     is strongly preferable.

                                                                                     Aggregation of Human Annotation Habits
                                                                                     Several times, it has been noted that manual segmentation of
Fig. 10. Metadata can be collected, validated, and customized with ease. A mix       image data is not a feasible or scalable approach for remotely
of default properties (strings, numbers, booleans), factories (timestamp, author),   large datasets. However, there are multiple cases in which human
and custom plugins (yellow circle representing associated device) are present.       intuition can greatly outperform even complex neural networks,
                                                                                     depending on the specific segmentation challenge [RLFF15]. For
                                                                                     this reason, it would be ideal to capture data points possessing
Conclusion and Future Work
                                                                                     information about the human decision-making process and apply
The Semi-Supervised Semantic Annotator (S3A) is proposed to                          them to images at scale. This may include taking into account hu-
address the difficult task of pixel-level annotations of image data.                 man labeling time per class, hesitation between clicks, relationship
For high-resolution images with numerous complex regions of                          between shape boundary complexity and instance quantity, and
interest, existing labeling software faces performance bottlenecks                   more. By aggregating such statistics, a pattern may arise which can
attempting to extract ground-truth information. Moreover, there is                   be leveraged as an additional automated annotation technique.
a lack of capabilities to convert such a labeling workflow into an
automated procedure with feedback at every step. Each of these
challenges is overcome by various features within S3A specifically                   R EFERENCES
designed for such tasks. As a result, S3A provides not only tremen-                  [AAV+ 02]   C Anagnostopoulos, I Anagnostopoulos, D Vergados, G Kouzas,
dous time savings during ground truth annotation, but also allows                                E Kayafas, V Loumos, and G Stassinopoulos. High performance
an annotation pipeline to be directly converted into a prediction                                computing algorithms for textile quality control. Mathematics
scheme. Furthermore, the rapid feedback accessible at every stage                                and Computers in Simulation, 60(3):389–400, September 2002.
                                                                                                 doi:10.1016/S0378-4754(02)00031-9.
of annotation expedites prototyping of novel solutions to imaging                    [AML+ 19]   Mukhil Azhagan, Dhwani Mehta, Hangwei Lu, Sudarshan
domains in which few examples of prior work exist. Nonetheless,                                  Agrawal, Mark Tehranipoor, Damon L Woodard, Navid
multiple avenues exist for improving S3A’s capabilities in each of                               Asadizanjani, and Praveen Chawla. A review on automatic
                                                                                                 bill of material generation and visual inspection on PCBs. In
these areas. Several prominent future goals are highlighted in the
                                                                                                 ISTFA 2019: Proceedings of the 45th International Symposium
following sections.                                                                              for Testing and Failure Analysis, page 256. ASM International,
                                                                                                 2019.
Dynamic Algorithm Builder                                                            [AVK+ 01]   C. Anagnostopoulos, D. Vergados, E. Kayafas, V. Loumos, and
                                                                                                 G. Stassinopoulos. A computer vision approach for textile
Presently, processing workflows can be specified in a sequential                                 quality control. The Journal of Visualization and Computer
YAML file which describes each algorithm and their respective                                    Animation, 12(1):31–44, 2001. doi:10.1002/vis.245.
parameters. However, this is not easy to adapt within S3A,                           [BWS+ 10]   Steve Branson, Catherine Wah, Florian Schroff, Boris Babenko,
                                                                                                 Peter Welinder, Pietro Perona, and Serge Belongie. Visual
especially by inexperienced annotators. Future iterations of S3A                                 recognition with humans in the loop. In Kostas Daniilidis, Petros
will incoroprate graphical flowcharts which make this process                                    Maragos, and Nikos Paragios, editors, Computer Vision – ECCV
drastically more intuitive and provide faster feedback. Frameworks                               2010, pages 438–451, Berlin, Heidelberg, 2010. Springer Berlin
                                                                                                 Heidelberg.
like Orange [DCE+ ] perform this task well, and S3A would                            [CZF+ 18]   Qimin Cheng, Qian Zhang, Peng Fu, Conghuan Tu, and Sen Li.
strongly benefit from adding the relevant capabilities.                                          A survey and analysis on automatic image annotation. Pattern
                                                                                                 Recognition, 79:242–259, 2018. doi:10.1016/j.patcog.
Image Navigation Assistance                                                                      2018.02.017.
                                                                                     [DCE+ ]     Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž
Several aspects of image navigation can be incorporated to sim-                                  Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar,
plify the handling of large images. For instance, a "minimap" tool                               Marko Toplak, and Anže Starič. Orange: Data mining toolbox
                                                                                                 in Python. 14(1):2349–2353.
would allow users to maintain a global image perspective while                       [DS20]      Polina Demochkina and Andrey V. Savchenko. Improving
making local edits. Furthermore, this sense of scale aids intuition                              the accuracy of one-shot detectors for small objects in x-ray
of how many regions of similar component density, color, etc. exist                              images. In 2020 International Russian Automation Confer-
within the entire image.                                                                         ence (RusAutoCon), page 610–614. IEEE, September 2020.
                                                                                                 URL: https://ieeexplore.ieee.org/document/9208097/, doi:10.
    Second, multiple strategies for annotating large images lever-                               1109/RusAutoCon49822.2020.9208097.
age a windowing approach, where they will divide the total image                     [EGW+ 10]   Mark Everingham, Luc Gool, Christopher K. Williams, John
into several smaller pieces in a gridlike fashion. While this has its                            Winn, and Andrew Zisserman. The pascal visual object classes
                                                                                                 (voc) challenge. Int. J. Comput. Vision, 88(2):303–338, jun
disadvantages, it is fast, easy to automate, and produces reasonable
                                                                                                 2010. URL: https://doi.org/10.1007/s11263-009-0275-4, doi:
                                                                                                 10.1007/s11263-009-0275-4.
   2. For those curious, the dataset and associated paper are accessible at https:   [FNK92]     H. Fujisawa, Y. Nakano, and K. Kurino. Segmentation methods
//www.trust-hub.org/#/data/pcb-images.                                                           for character recognition: From segmentation to document struc-
   3. For a list of input validators and supported primitive types, refer to                     ture analysis. Proceedings of the IEEE, 80(7):1079–1092, July
PyQtGraph’s Parameter documentation.                                                             1992. doi:10.1109/5.156471.
12                                                                                                   PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[FRLL18]    Max K. Ferguson, Ak Ronay, Yung-Tsun Tina Lee, and Kin-                         IEEE Transactions on Medical Imaging, 36(2):674–683, Febru-
            cho. H. Law. Detection and segmentation of manufacturing                        ary 2017. doi:10.1109/TMI.2016.2621185.
            defects with convolutional neural networks and transfer learn-       [SKM+ 10] Sascha Seifert, Michael Kelm, Manuel Moeller, Saikat Mukher-
            ing. Smart and sustainable manufacturing systems, 2, 2018.                      jee, Alexander Cavallaro, Martin Huber, and Dorin Comaniciu.
            doi:10.1520/SSMS20180033.                                                       Semantic annotation of medical images. In Brent J. Liu and
[GNP+ 04]   Basilios Gatos, Kostas Ntzios, Ioannis Pratikakis, Sergios                      William W. Boonn, editors, Medical Imaging 2010: Advanced
            Petridis, T. Konidaris, and Stavros J. Perantonis. A segmentation-              PACS-based Imaging Informatics and Therapeutic Applications,
            free recognition technique to assist old greek handwritten                      volume 7628, pages 43 – 50. International Society for Optics and
            manuscript OCR. In Simone Marinai and Andreas R. Dengel,                        Photonics, SPIE, 2010. URL: https://doi.org/10.1117/12.844207,
            editors, Document Analysis Systems VI, Lecture Notes in Com-                    doi:10.1117/12.844207.
            puter Science, pages 63–74, Berlin, Heidelberg, 2004. Springer.      [Spa20]    SpaceNet. Multi-Temporal Urban Development Challenge.
            doi:10.1007/978-3-540-28640-0_7.                                                https://spacenet.ai/sn7-challenge/, June 2020.
[IGSM14]    D. K. Iakovidis, T. Goudas, C. Smailis, and I. Maglogiannis.         [TFJ89]    T. Taxt, P.J. Flynn, and A.K. Jain. Segmentation of document
            Ratsnake: A versatile image annotation tool with application                    images. IEEE Transactions on Pattern Analysis and Machine
            to computer-aided diagnosis, 2014. doi:10.1155/2014/                            Intelligence, 11(12):1322–1329, December 1989. doi:10.
            286856.                                                                         1109/34.41371.
[itL18]     Humans in the Loop. The best image annotation platforms              [VGSG+ 19] Juan P. Vigueras-Guillén, Busra Sari, Stanley F. Goes, Hans G.
            for computer vision (+ an honest review of each), October                       Lemij, Jeroen van Rooij, Koenraad A. Vermeer, and Lucas J.
            2018. URL: https://hackernoon.com/the-best-image-annotation-                    van Vliet. Fully convolutional architecture vs sliding-window
            platforms-for-computer-vision-an-honest-review-of-each-                         cnn for corneal endothelium cell segmentation. BMC Biomedical
            dac7f565fea.                                                                    Engineering, 1(1):4, January 2019. doi:10.1186/s42490-
[JB92]      Anil K. Jain and Sushil Bhattacharjee. Text segmentation using                  019-0003-2.
            gabor filters for automatic document processing. Machine Vision      [WYZZ09] C. Wang, Shuicheng Yan, Lei Zhang, and H. Zhang. Multi-
            and Applications, 5(3):169–184, June 1992. doi:10.1007/                         label sparse coding for automatic image annotation. In 2009
            BF02626996.                                                                     IEEE Conference on Computer Vision and Pattern Recognition,
[JPRA20]    Nathan Jessurun, Olivia Paradis, Alexandra Roberts, and Navid                   page 1643–1650, June 2009. doi:10.1109/CVPR.2009.
            Asadizanjani. Component Detection and Evaluation Framework                      5206866.
            (CDEF): A Semantic Annotation Tool. Microscopy and Micro-            [YPH 06] Paul A. Yushkevich, Joseph Piven, Heather Cody Hazlett,
                                                                                      +

            analysis, 26(S2):1470–1474, August 2020. doi:10.1017/                           Rachel Gimpel Smith, Sean Ho, James C. Gee, and Guido
            S1431927620018243.                                                              Gerig. User-guided 3D active contour segmentation of anatom-
[KBO16]     Made Windu Antara Kesiman, Jean-Christophe Burie, and Jean-                     ical structures: Significantly improved efficiency and reliability.
            Marc Ogier. A new scheme for text line and character seg-                       NeuroImage, 31(3):1116–1128, July 2006. doi:10.1016/j.
            mentation from gray scale images of palm leaf manuscript.                       neuroimage.2006.01.015.
            In 2016 15th International Conference on Frontiers in Hand-
            writing Recognition (ICFHR), pages 325–330, October 2016.
            doi:10.1109/ICFHR.2016.0068.
[LMB+ 14]   Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
            Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
            Zitnick. Microsoft coco: Common objects in context. In Euro-
            pean conference on computer vision, pages 740–755. Springer,
            2014.
[LSA+ 10]   L’ubor Ladický, Paul Sturgess, Karteek Alahari, Chris Russell,
            and Philip H. S. Torr. What, where and how many? combining
            object detectors and crfs. In Kostas Daniilidis, Petros Maragos,
            and Nikos Paragios, editors, Computer Vision – ECCV 2010,
            pages 424–437, Berlin, Heidelberg, 2010. Springer Berlin Hei-
            delberg.
[MKS18]     S. Mohajerani, T. A. Krammer, and P. Saeedi. A cloud detection
            algorithm for remote sensing images using fully convolutional
            neural networks. In 2018 IEEE 20th International Workshop on
            Multimedia Signal Processing (MMSP), page 1–5, August 2018.
            doi:10.1109/MMSP.2018.8547095.
[PJTA20]    Olivia P Paradis, Nathan T Jessurun, Mark Tehranipoor,
            and Navid Asadizanjani.         Color normalization for robust
            automatic bill of materials generation and visual inspection
            of pcbs. In ISTFA 2020: Papers Accepted for the Planned
            46th International Symposium for Testing and Failure Analysis,
            International Symposium for Testing and Failure Analysis,
            pages 172–179, 2020. URL: https://doi.org/10.31399/asm.cp.
            istfa2020p0172https://dl.asminternational.org/istfa/proceedings-
            pdf/ISTFA2020/83348/172/425605/istfa2020p0172.pdf,
            doi:10.31399/asm.cp.istfa2020p0172.
[Ram]       Urs Ramer. An iterative procedure for the polygonal approx-
            imation of plane curves. 1(3):244–256. URL: https://www.
            sciencedirect.com/science/article/pii/S0146664X72800170,
            doi:10.1016/S0146-664X(72)80017-0.
[RLFF15]    Olga Russakovsky, Li-Jia Li, and Li Fei-Fei. Best of both
            worlds: Human-machine collaboration for object annotation.
            In 2015 IEEE Conference on Computer Vision and Pat-
            tern Recognition (CVPR), page 2121–2131. IEEE, June 2015.
            URL: http://ieeexplore.ieee.org/document/7298824/, doi:10.
            1109/CVPR.2015.7298824.
[RLO+ 17]   Martin Rajchl, Matthew C. H. Lee, Ozan Oktay, Konstantinos
            Kamnitsas, Jonathan Passerat-Palmbach, Wenjia Bai, Mellisa
            Damodaram, Mary A. Rutherford, Joseph V. Hajnal, Bernhard
            Kainz, and Daniel Rueckert. DeepCut: Object segmentation from
            bounding box annotations using convolutional neural networks.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                       13




  Galyleo: A General-Purpose Extensible Visualization
                       Solution
                     Rick McGeer‡∗ , Andreas Bergen‡ , Mahdiyar Biazi‡ , Matt Hemmings‡ , Robin Schreiber‡



                                                                                      F



Abstract—Galyleo is an open-source, extensible dashboarding solution inte-                Jupyter’s web interface is primarily to offer textboxes for code
grated with JupyterLab [jup]. Galyleo is a standalone web application integrated          entry. Entered code is sent to the server for evaluation and
as an iframe [LS10] into a JupyterLab tab. Users generate data for the dash-              text/HTML results returned. Visualization in a Jupyter Notebook
board inside a Jupyter Notebook [KRKP+ 16], which transmits the data through              is either given by images rendered server-side and returned as
message passing [mdn] to the dashboard; users use drag-and-drop operations
                                                                                          inline image tags, or by JavaScript/HTML5 libraries which have
to add widgets to filter, and charts to display the data, shapes, text, and images.
The dashboard is saved as a JSON [Cro06] file in the user’s filesystem in the
                                                                                          a corresponding server-side Python library. The Python library
same directory as the Notebook.                                                           generates HTML5/JavaScript code for rendering.
                                                                                              The limiting factor is that the visualization library must be in-
Index Terms—JupyterLab, JupyterLab extension, Data visualization                          tegrated with the Python backend by a developer, and only a subset
                                                                                          of the rich array of visualization, charting, and mapping libraries
Introduction
                                                                                          available on the HTML5/JavaScript platform is integrated. The
                                                                                          HTML5/JavaScript platform is as rich a client-side visualization
Current dashboarding solutions [hol22a] [hol22b] [plo] [pan22]                            platform as Python is a server-side platform.
for Jupyter either involve external, heavyweight tools, ingrained
                                                                                              Galyleo set out to offer the best of both worlds: Python, R, and
HTML/CSS coding, complex publication, or limited control over
                                                                                          Julia as a scalable analytics platform coupled with an extensible
layout, and have restricted widget sets and visualization libraries.
                                                                                          JavaScript/HTML5 visualization and interaction platform. It offers
Graphics objects require a great deal of configuration: size, posi-
                                                                                          a no-code client-side environment, for several reasons.
tion, colors, fonts must be specified for each object. Thus library
solutions involve a significant amount of fairly simple code. Con-                           1)    The Jupyter analytics community is comfortable with
versely, visualization involves analytics, an inherently complex                                   server-side analytics environments (the 100+ kernels
set of operations. Visualization tools such as Tableau [DGHP13]                                    available in Jupyter, including Python, R and Julia) but
or Looker [loo] combine visualization and analytics in a single                                    less so with the JavaScript visualization platform.
application presented through a point-and-click interface. Point-                            2)    Configuration of graphical objects takes a lot of low-value
and-click interfaces are limited in the number and complexity                                      configuration code; conversely, it is relatively easy to do
of operations supported. The complexity of an operation isn’t                                      by hand.
reduced by having a simple point-and-click interface; instead, the
user is confronted with the challenge of trying to do something                               These insights lead to a mixed interface, combining a drag-
complicated by pointing. The result is that tools encapsulate                             and-drop interface for the design and configuration of visual
complex operations in a few buttons, and that leads to a limited                          objects, and a coding, server-side interface for analytics programs.
number of operations with reduced options and/or tools with steep                             Extension of the widget set was an important consideration. A
learning curves.                                                                          widget is a client-side object with a physical component. Galyleo
    In contrast, Jupyter is simply a superior analytics environment                       is designed to be extensible both by adding new visualization
in every respect over a standalone visualization tool: its various                        libraries and components and by adding new widgets.
kernels and their libraries provide a much broader range of analyt-                           Publication of interactive dashboards has been a further chal-
ics capabilities; its programming interface is a much cleaner and                         lenge. A design goal of Galyleo was to offer a simple scheme,
simpler way to perform complex operations; hardware resources                             where a dashboard could be published to the web with a single
can scale far more easily than they can for a visualization tool;                         click.
and connectors to data sources are both plentiful and extensible.                             These then, are the goals of Galyleo:
    Both standalone visualization tools and Jupyter libraries have
a limited set of visualizations. Jupyter is a server-side platform.                          1)    Simple, drag-and-drop design of interactive dashboards in
                                                                                                   a visual editor. The visual design of a Galyleo dashboard
* Corresponding author: rick.mcgeer@engageLively.com
‡ engageLively                                                                                     should be no more complex than design of a PowerPoint
                                                                                                   or Google slide;
Copyright © 2022 Rick McGeer et al. This is an open-access article distributed               2)    Radically simplify the dashboard-design interface by cou-
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the                       pling it to a powerful, Jupyter back end to do the analytics
original author and source are credited.                                                           work, separating visualization and analytics concerns;
14                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




             Fig. 1: Figure 1. A New Galyleo Dashboard                                  Fig. 3: Figure 3. Dataflow in Galyleo


                                                                             As the user creates and manipulates the visual elements, the
                                                                         editor continuously saves the table as a JSON file, which can also
                                                                         be edited with Jupyter’s built-in text editor.

                                                                         Workflow
                                                                         The goal of Galyleo is simplicity and transparency. Data prepa-
                                                                         ration is handled in Jupyter, and the basic abstract item, the
                                                                         GalyleoTable is generally created and manipulated there, using an
                                                                         open-source Python library. When a table is ready, the Galyleo-
                                                                         Client library is invoked to send it to the dashboard, where it
                                                                         appears in the table tab of the sidebar. The dashboard author
                                                                         then creates visual elements such as sliders, lists, dropdowns etc.,
           Fig. 2: Figure 2. The Galyleo Dashboard Studio
                                                                         which select rows of the table, and uses these filtered lists as
                                                                         inputs to charts. The general idea is that the author should be
     3)   Maximimize extensibility for visualization and widgets         able to seamlessly move between manipulating and creating data
          on the client side and analytics libraries, data sources and   tables in the Notebook, and filtering and visualizing them in the
          hardware resources on the server side;                         dashboard.
     4)   Easy, simple publication;
                                                                         Data Flow and Conceptual Picture
                                                                         The Galyleo Data Model and Architecture is discussed in detail
Using Galyleo
                                                                         below. The central idea is to have a few, orthogonal, easily-grasped
The general usage model of Galyleo is that a Notebook is being           concepts which make data manipulation easy and intuitive. The
edited and executed in one tab of JupyterLab, and a corresponding        basic concepts are as follows:
dashboard file is being edited and executed in another; as the
Notebook executes, it uses the Galyleo Client library to send               1)    Table: A Table is a list of records, equivalent to a Pandas
data to the dashboard file. To JupyterLab, the Galyleo Dashboard                  DataFrame [pdt20] [WM10] or a SQL Table. In general,
Studio is just another editor; it reads and writes .gd.json files in              in Galyleo, a Table is expected to be produced by an
the current directory.                                                            external source, generally a Jupyter Notebook
                                                                            2)    Filter: A Filter is a logical function which applies to a
The Dashboard Studio                                                              single column of a Table Table, and selects rows from the
                                                                                  Table. Each Filter corresponds to a widget; widgets set
A new Galyleo Dashboard can be launched from the JupyterLab
                                                                                  the values Filter use to select Table rows
launcher or from the File>New menu, as shown in Figure 1.
                                                                            3)    View A View is a subset of a Table selected by one or
    An existing dashboard is saved as a .gd.json file, and is
                                                                                  more Filters. To create a view, the user chooses a Table,
denoted with the Galyleo star logo. It can be opened in the usual
                                                                                  and then chooses one or more Tilters to apply to the Table
way, with a double-click.
                                                                                  to select the rows for the View. The user can also statically
    Once a file is opened, or a new file created, a new Galyleo tab
                                                                                  select a subset of the columns to include in the View.
opens onto it. It resembles a simplified form of a Tableau, Looker,
                                                                            4)    Chart A Chart is a generic term for an object that displays
or PowerBI editor. The collapsible right-hand sidebar offers the
                                                                                  data graphically. Its input is a View or a Table. Each Chart
ability to view Tables, and view, edit, or create Views, Filters,
                                                                                  has a single data source.
and Charts. The bottom half of the right sidebar gives controls for
styling of text and shapes.                                                  The data flow is straightforward. A Table is updated from
    The top bar handles the introduction of decorative and styling       an external source, or the user manipulates a widget. When this
elements to the dashboard: labels and text, simple shapes such as        happens, the affected item signals the dashboard controller that it
ellipses, rectangles, polygons, lines, and images. All images are        has been updated. The controller then signals all charts to redraw
referenced by URL.                                                       themselves. Each Chart will then request updated data from its
GALYLEO: A GENERAL-PURPOSE EXTENSIBLE VISUALIZATION SOLUTION                                                                                15

source Table or View. A View then requests its configured filters
for their current logic functions, and passes these to the source
Table with a request to apply the filters and return the rows which
are selected by all the filters (in the future, a more general Boolean
will be applied; the UI elements to construct this function are
under design). The Table then returns the rows which pass the
filters; the View selects the static subset of columns it supports,
and passes this to its Charts, which then redraw themselves.
     Each item in this flow conceptually has a single data source,
but multiple data targets. There can be multiple Views over a
Table, but each View has a single Table as a source. There can
be multiple charts fed by a View, but each Chart has a single Table
or View as a source.
     It’s important to note that there are no special cases. There is
no distinction, as there is in most visualization systems, between
a "Dimension" or a "Measure"; there are simply columns of data,
                                                                                  Fig. 4: Figure 4. A Published Galyleo Dashboard
which can be either a value or category axis for any Chart. From
this simplicity, significant generality is achieved. For example,
a filter selects values from any column, whether that column is          and configuration gives instant feedback and tight control over
providing value or category. Applying a range filter to a category       appearance. For example, the authors of a LaTeX paper (including
column gives natural telescoping and zooming on the x-axis of a          this one) can’t control the placement of figures within the text. The
chart, without change to the architecture.                               fourth, which is correct, is that configuration code is more verbose,
                                                                         error-prone, and time-consuming than manual configuration.
Drilldowns
                                                                             What is less often appreciated is that when operations become
An important operation for any interactive dashboard is drill-           sufficiently complex, coding is a much simpler interface than
downs: expanding detail for a datapoint on a chart. The user             manual configuration. For example, building a pivot table in a
should be able to click on a chart and see a detailed view of            spreadsheet using point-and-click operations have "always had a
the data underlying the datapoint. This was naturally implemented        reputation for being complicated" [Dev]. It’s three lines of code in
in our system by associating a filter with every chart: every chart      Python, even without using the Pandas pivot_table method. Most
in Galyleo is also a Select Filter, and it can be used as a Filter in    analytics procedures are far more easily done in code.
a view, just as any other widget can be.                                     As a result, Galyleo is an appropriate-code environment,
                                                                         which is an environment which combines a coding interface
Publishing The Dashboard                                                 for complex, large-scale, or abstract operations and a point-
Once the dashboard is complete, it can be published to the               and-click interface for simple, concrete, small-scale operations.
web simply by moving the dashboard file to any place it get              Galyleo combines broadly powerful Jupyter-based code and low-
an URL (e.g. a github repo). It can then be viewed by visiting           code libraries for analytics paired with fast GUI-based design and
https://galyleobeta.engagelively.com/public/galyleo/index.html?          configuration for graphical elements and layout.
dashboard=<url of dashboard file>. The attached figure shows
a published Galyleo Dashboard, which displays Florence
                                                                         Galyleo Data Model And Architecture
Nightingale’s famous Crimean War dataset. Using the double
sliders underneath the column charts telescope the x axes,               The Galyleo data Model and architecture closely model the
effectively permitting zooming on a range; clicking on a column          dashboard architecture discussed in the previous section. They are
shows the detailed death statistics for that month in the pie chart      based on the idea of a few simple, generalizable structures, which
above the column chart.                                                  are largely independent of each other and communicate through
                                                                         simple interfaces.
No-Code, Low-Code, and Appropriate-Code
                                                                         The GalyleoTable
Galyleo is an appropriate-code environment, meaning that it offers       A GalyleoTable is the fundamental data structure in Galyleo. It
efficient creation to developers at every step. It offers What-You-      is a logical, not a physical abstraction; it simply responds to
See-Is-What-You-Get (WYSIWYG) design tools where appro-                  the GalyleoTable API. A GalyleoTable is a pair (columns, rows),
priate, low-code where appropriate, and full code creation tools         where columns is a list of pairs (name, type), where type is one
where appropriate.                                                       of {string, boolean, number, date}, and rows is a list of lists of
    No-code and low-code environments, where users construct             primitive values, where the length of each component list is the
applications through a visual interface, are popular for several         length of the list of columns and the type of the kth entry in each
reasons. The first is the assumption that coding is time-consuming       list is the type specified by the kth column.
and hard, which isn’t always or necessarily true; the second is
                                                                              Small, public tables may be contained in the dashboard file;
the assumption that coding is a skill known to only a small
                                                                         these are called explicit tables. However, explicitly representing
fraction of the population, which is becoming less true by the
                                                                         the table in the dashboard file has a number of disadvantages:
day. 40% of Berkeley undergraduates take Data 8, in which
every assignment involves programming in a Jupyter Notebook.                1)    An explicit table is in the memory of the client viewing
The third, particularly for graphics code, is that manual design                  the dashboard; if it is too large, it may cause signifi-
16                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

          cant performance problems on the dashboard       author or
          viewer’s device
     2)   Since the dashboard file is accessible on the     web, any
          data within it is public
     3)   The data may be continuously updated from        a source,
          and it’s inconvenient to re-run the Notebook     to update
          the data.

     Therefore, the GalyleoTable can be of one of three types:

     1)   A data server that implements the Table REST API
     2)   A JavaScript object within the dashboard page itself
     3)   A JavaScript messenger in the page that implements a
          messaging version of the API
                                                                                Fig. 5: Figure 5. Galyleo Dataflow with Remote Tables
    An explicit table is simply a special case of (2) -- in this case,
the JavaScript object is simply a linear list of rows.
                                                                         Comments
    These are not exclusive. The JavaScript messenger case is
designed to support the ability of a containing application within       Again, simplicity and orthogonality have shown tremendous bene-
the browser to handle viewer authentication, shrinking the security      fits here. Though filters conceptually act as selectors on rows, they
vulnerability footprint and ensuring that the client application         may perform a variety of roles in implementations. For example,
controls the data going to the dashboard. In general, aside from         a table produced by a simulator may be controlled by a parameter
performing tasks like authentication, the messenger will call an         value given by a Filter function.
external data server for the values themselves.
    Whether in a Data Server, a containing application, or a             Extending Galyleo
JavaScript object, Tables support three operations:                      Every element of the Galyleo system, whether it is a widget, Chart,
                                                                         Table Server, or Filter is defined exclusively through a small set
     1)   Get all the values for a specific column                       of public APIs. This is done to permit easy extension, by either
     2)   Get the max/min/increment for a specific numeric column        the Galyleo team, users, or third parties. A Chart is defined as an
     3)   Get the rows which match a boolean function, passed in         object which has a physical HTML representation, and it supports
          as a parameter to the operation                                four JavaScript methods: redraw (draw the chart), set data (set the
    Of course, (3) is the operation that we have seen above, to          chart’s data), set options (set the chart’s options), and supports
populate a view and a chart. (1) and (2) populate widgets on the         table (a boolean which returns true if and only if the chart can
dashboard; (1) is designed for a select filter, which is a widget        draw the passed-in data set). In addition, it exports out a defined
that lets a user pick a specific set of values for a column; (2) is      JSON structure which indicates what options it supports and the
an optimization for numeric filters, so that the entire list of values   types of their values; this is used by the Chart Editor to display a
for the column need not be sent -- rather, only the start and end        configurator for the chart.
values, and the increment between them.                                      Similarly, the underlying lively.next system supports user
                                                                         design of new filters. Again, a filter is simply an object with a
    Each type of table specifies a source, additional information
                                                                         physical presence, that the user can design in lively, and supports a
(in the case of a data server, for example, any header variables
                                                                         specific API -- broadly, set the choices and hand back the Boolean
that must be specified in order to fetch the data), and, optionally,
                                                                         function as a JSON object which will be used to filter the data.
a polling interval. The latter is designed to handle live data; the
dashboard will query the data source at each polling interval to         lively.next
see if the data has changed.
                                                                         Any system can be used to extend Galyleo; at the end of the
    The choice of these three table instantiations (REST,
                                                                         day, all that need be done is encapsulate a widget or chart in
JavaScript object, messenger) is that they provide the key founda-
                                                                         a snippet of HTML with a JavaScript interface that matches
tional building block for future extensions; it’s easy to add a SQL
                                                                         the Galyleo protocol. This is done most easily and quickly
connection on top of a REST interface, or a Python simulator.
                                                                         by using lively.next [SKH21]. lively.next is the latest in a line
                                                                         of Smalltalk- and Squeak-inspired [IKM+ 97] JavaScript/HTML
Filters                                                                  integrated development environments that began with the Lively
Tables must be filtered in situ. One of the key motivators behind        Kernel [IPU+ 08] [KIH+ 09] and continued through the Lively Web
remote tables is in keeping large amounts of data from hitting the       [LKI+ 12] [IFH+ 16] [TM17]. Galyleo is an application built in
browser. This is largely defeated if the entire table is sent to the     Lively, following the work done in [HIK+ 16].
dashboard and then filtered there. As a result, there is a Filter API        Lively shares with Jupyter an emphasis on live programming
together with the Table API whereever there are tables.                  [KRB18], orwhere a Read-Evaluate-Act Loop (REAL) program-
     The data flow of the previous section remains unchanged;            ming style. It adds to that a combination of visual and text
it is simply that the filter functions are transmitted to wherever       programming [ABF20], where physical objects are positioned and
the tables happen to be. The dataflow in the case of remote              configured largely by hand as done with any drawing or design
tables (whether messenger-based or REST-based) is shown here,            program (e.g., PowerPoint, Illustrator, DrawPad, Google Draw)
with operations that are resident where the table is situated and        and programmed with a built-in editor and workspace, similar in
operations resident on the dashboard clearly shown.                      concept if not form to a Jupyter Notebook.
GALYLEO: A GENERAL-PURPOSE EXTENSIBLE VISUALIZATION SOLUTION                                                                              17

                                                                          2)    acceptsDataset(<Table or View>) returns a boolean de-
                                                                                pending on whether this chart can draw the data in this
                                                                                view. For example, a Table Chart can draw any tabular
                                                                                data; a Geo Chart typically requires that the first column
                                                                                be a place specifier.
                                                                          In addition, it has a read-only property:
                                                                          1)    optionSpec: A JSON structure describing the options for
                                                                                the chart. This is a dictionary, which specifies the name of
                                                                                each option, and its type (color, number, string, boolean,
                                                                                or enum with values given). Each type corresponds to a
                                                                                specific UI widget that the chart editor uses.
                                                                          And two read write properties:
                                                                          1)    options: The current options, as a JSON dictionary. This
            Fig. 6: Figure 6. The lively.next environment                       matches exactly the JSON dictionary in optionSpec, with
                                                                                values in place of the types.
                                                                          2)    dataSource: a string, the name of the current Galyleo
    Lively abstracts away HTML and CSS tags in graphical                        Table or Galyleo View
objects called "Morphs". Morphs [MS95] were invented as the
user interface layer for Self [US87], and have been used as                Typically, an extension to Galyleo’s charting capabilities is
the foundation of the graphics system in Squeak and Scratch            done by incorporating the library as described in the previous
[MRR+ 10]. In this UI, every physical object is a Morph; these         section, implementing the API given in this section, and then
can be as simple as a simple polygon or text string to a full          publishing the result as a component
application. Morphs are combined via composition, similar to the
way that objects are grouped in a presentation or drawing program.     Extending Galyleo’s Widget Set
The composition is simply another Morph, which in turn can be          A widget is a graphical item used to filter data. It operates on a
composed with other Morphs. In this manner, complex Morphs             single column on any table in the current data set. It is either a
can be built up from collections of simpler ones. For example,         range filter (which selects a range of numeric values) or a select
a slider is simply the composition of a circle (the knob) with a       filter (which selects a specific value, or a set of specific values).
thin, long rectangle (the bar). Each Morph can be individually         The API that is implemented consists only of properties.
programmed as a JavaScript object, or can inherit base level
                                                                          1)    valueChanged : a signal, which is fired whenever the
behavior and extend it.
                                                                                value of the widget is changed
    In lively.next, each morph turns into a snippet of HTML, CSS,
                                                                          2)    value: read-write. The current value of the widget
and JavaScript code and the entire application turns into a web
                                                                          3)    filter: read-only. The current filter function, as a JSON
page. The programmer doesn’t see the HTML and CSS code
                                                                                structure
directly; these are auto-generated. Instead, the programmer writes
                                                                          4)    allValues: read-write, select filters only.
JavaScript code for both logic and configuration (to the extent that
                                                                          5)    column: read-only. The name of the column of this
the configuration isn’t done by hand). The code is bundled with
                                                                                widget. Set when the widget is created
the object and integrated in the web page.
                                                                          6)    numericSpec: read-write. A dictionary containing the
    Morphs can be set as reusable components by a simple
                                                                                numeric specification for a numeric or date filter
declaration. They can then be reused in any lively design.
                                                                          Widgets are typically designed as a standard Lively graphical
Incorporating New Libraries                                            component, much as the slider described above.
Libraries are typically incorporated into lively.next by attaching
them to a convenient physical object, importing the library from a     Integration into Jupyter Lab: The Galyleo Extension
package manager such as npm, and then writing a small amount
                                                                       Galyleo is a standalone web application that is integrated into
of code to expose the object’s API. The simplest form of this is to
                                                                       JupyterLab using an iframe inside a JupyterLab tab for physical
assign the module to an instance variable so it has an addressable
                                                                       design. A small JupyterLab extension was built that implements
name, but typically a few convenience methods are written as well.
                                                                       the JupyterLab editor API. The JupyterLab extension has two
In this way, a large number of libraries have been incorporated
                                                                       major functions: to handle read/write/undo requests from the
as reusable components in lively.next, including Google Maps,
                                                                       JupyterLab menus and file browser, and receive and transmit
Google Charts [goo], Chartjs [cha], D3 [BOH11], Leaflet.js [lea],
                                                                       messages from the running Jupyter kernels to update tables on
OpenLayers [ope], cytoscape:ono and many more.
                                                                       the Dashboard Studio, and to handle the reverse messages where
Extending Galyleo’s Charting and Visualization capabilities
                                                                       the studio requests data from the kernel.
                                                                           Standard Jupyter and browser mechanisms are used. File sys-
A Galyleo Chart is anything that changes its display based on
                                                                       tem requests come to the extension from the standard Jupyter API,
tabular data from a Galyleo Table or Galyleo View. It responds to
                                                                       exactly the same requests and mechanisms that are sent to a Mark-
a specific API, which includes two principal methods:
                                                                       down or Notebook editor. The extension receives them, and then
   1)    drawChart: redraw the chart using the current tabular data    uses standard browser-based messaging (window.postMessage) to
         from the input or view                                        signal the standalone web app. Similarly, when the extension
18                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                       of environments hosted by a server is arbitrary, and the cost is only
                                                                       the cost of maintaining the Dockerfile for each environment.
                                                                            An environment is easy to design for a specific class, project,
                                                                       or task; it’s simply adding libraries and executables to a base
                                                                       Dockerfile. It must be tested, of course, but everything must be.
                                                                       And once it is tested, the burden of software maintenance and
                                                                       installation is removed from the user; the user is already in a task-
                                                                       customized, curated environment. Of course, the usual installation
                                                                       tools (apt, pip, conda, easy_install ) can be pre-loaded (they’re just
                                                                       executables) so if the environment designer missed something it
                                                                       can be added by the end user.
                                                                            Though a user can only be in one environment at a time,
                                                                       persistent storage is shared across all environments, meaning
          Fig. 7: Figure 7. Galyleo Extension Architecture             switching environments is simply a question of swapping one
                                                                       environment out and starting another.
                                                                            Viewed in this light, a JupyterHub is a multi-purpose computer
makes a request of JupyterLab, it does so through this mechanism
                                                                       in the Cloud, with an easy-to-use UI that presents through a
and a receiver in the extension gets it and makes the appropriate
                                                                       browser. JupyterLab isn’t simply an IDE; it’s the window system
method calls within JupyterLab to achieve the objective.
                                                                       and user interface for this computer. The JupyterLab launcher is
    When a kernel makes a request through the Galyleo Client,
                                                                       the desktop for this computer (and it changes what’s presented,
this is handled exactly the same way. A Jupyter messaging server
                                                                       depending on the environment); the file browser is the computer’s
within the extension receives the message from the kernel, and
                                                                       file browser, and the JupyterLab API is the equivalent of the Win-
then uses browser messaging to contact the application with the
                                                                       dows or MacOS desktop APIs and window system that permits
request, and does the reverse on a Galyleo message to the kernel.
                                                                       third parties to build applications for this.
    This is a highly efficient method of interaction, since browser-
based messaging is in-memory transactions on the client machine.            This Jupyter Computer has a large number of advantages over
    It’s important to note that there is nothing Galyleo-specific      a standard desktop or laptop computer. It can be accessed from any
about the extension: the Galyleo Extension is a general method         device, anywhere on Earth with an Internet connection. Software
for any standalone web editor (e.g., a slide or drawing editor) to     installation and maintenance issues are nonexistent. Data loss due
be integrated into JupyterLab. The JupyterLab connection is a few      to hardware failure is extremely unlikely; backups are still required
tens of lines of code in the Galyleo Dashboard. The extension is       to prevent accidental data loss (e.g., erroneous file deletion), but
slightly more complex, but it can be configured for a different        they are far easier to do in a Cloud environment. Hardware
application with a simple data structure which specifies the URL       resources such as disk, RAM, and CPU can be added rapidly,
of the application, file type and extension to be manipulated, and     on a permanent or temporary basis. Relatively exotic resources
message list.                                                          (e.g., GPUs) can also be added, again on an on-demand, temporary
                                                                       basis.
                                                                            The advantages go still further than that. Any resource that
The Jupyter Computer                                                   can be accessed over a network connection can be added to
The implications of the Galyleo Extension go well beyond vi-           the Jupyter Computer simply by adding the appropriate accessor
sualization and dashboards and easy publication in JupyterLab.         library to an environment’s Dockerfile. For example, a database
JupyterLab is billed as the next-generation integrated Develop-        solution such as Snowflake, BigQuery, or Amazon Aurora (or
ment Environment for Jupyter, but in fact it is substantially more     one of many others) can be "installed" by adding the relevant
than that. It is the user interface and windowing system for Cloud-    library module to the environment. Of course, the user will need
based personal computing. Inspired by previous extensions such         to order the database service from the relevant provider, and obtain
as the Vega Extension, the Galyleo Extensions seeks to provide         authentication tokens, and so on -- but this is far less troublesome
the final piece of the puzzle.                                         than even maintaining the library on the desktop.
    Consider a Jupyter server in the Cloud, served from a Jupyter-          However, to date the Jupyter Computer only supports a few
Hub such as the Berkeley Data Hub. It’s built from a base              window-based applications, and adding a new application is a
Ubuntu image, with the standard Jupyter libraries installed and,       time-consuming development task. The applications supported are
importantly, a UI that includes a Linux terminal interface. Any        familiar and easy to enumerate: a Notebook editor, of course; a
Linux executable can be installed in the Jupyter server image, as      Markdown Viewer; a CSV Viewer; a JSON Viewer (not inline
can any Jupyter kernel, and any collection of libraries. The Jupyter   editor), and a text editor that is generally used for everything from
server has per-user persistent storage, which is organized in a        Python files to Markdown to CSV.
standard Linux filesystem. This makes the Jupyter server a curated          This is a small subset of the rich range of JavaScript/HTML5
execution environment with a Linux command-line interface and          applications which have significant value for Jupyter Computer
a Notebook interface for Jupyter execution.                            users. For example, the Ace Code Editor supports over 110
    A JupyterHub similar to Berkeley Data Hub (essentially,            languages and has the functionality of popular desktop editors
anything built from Zero 2 Jupyter Hub or Q-Hub) comes with a          such as Vim and Sublime Text. There are over 1100 open-source
number of "environments". The user chooses the environment on          drawing applications on the JavaScript/HTML5 platform; multiple
startup. Each environment comes with a built-in set of libraries and   spreadsheet applications, the most notable being jExcel, and many
executables designed for a specific task or set of tasks. The number   more.
GALYLEO: A GENERAL-PURPOSE EXTENSIBLE VISUALIZATION SOLUTION                                                                               19




  Fig. 8: Figure 8. Galyleo Extension Application-Side messaging

                                                                              Fig. 9: Figure 9. Generations of Internet Computing
    Up until now, adding a new application to JupyterLab involved
writing a hand-coded extension in Typescript, and compiling
it into JupyterLab. However, the Galyleo Extension has been           the user uses any of a wide variety of text editors to prepare the
designed so that any HTML5/JavaScript application can be added        document, any of a wide variety of productivity and illustrator
easily, simply by configuring the Galyleo Extension with a small      programs to prepare the images, runs this through a local sequence
JSON file.                                                            of commands (e.g., pdflatex paper; bibtex paper; pdflatex paper.
    The promise of the Galyleo Extension is that it can be adapted    Usually Github or another repository is used for storage and
to any open-source JavaScript/HTML5 application very easily.          collaboration.
The Galyleo Extension merely needs the:                                   In a Cloud service, this is another matter. There is at most
                                                                      one editor, selected by the service, on the site. There is no
   •   URL of the application                                         image editing or illustrator program that reads and writes files
   •   File extension that the application reads/writes               on the site. Auxiliary tools, such as a bib searcher, aren’t present
   •   URL of an image for the launcher                               or aren’t customizable. The service has its own siloed storage,
   •   Name of the application for the file menu                      its own text editor, and its own document-preparation pipeline.
    The application must implement a small messaging client,          The tools (aside from the core document-preparation program)
using the standard JavaScript messaging interface, and implement      are primitive. The online service has two advantages over the
the calls the Galyleo Extension makes. The conceptual picture is      personal-device service. Collaboration is generally built-in, with
shown im Figure 8.                                                    multiple people having access to the project, and the software need
    And it must support (at a minimum) messages to read and           not be maintained. Aside from that, the personal-device experience
write the file being edited.                                          is generally superior. In particular, the user is free to pick their own
                                                                      editor, and doesn’t have to orchestrate multiple downloads and
The Third Generation of Network Computing                             uploads from various websites. The usual collection of command-
The World-Wide Web and email comprised the first generation           line utilities are available to small touchups.
of Internet computing (the Internet had been around for a decade          The third generation of Internet Computing represented by the
before the Web, and earlier networks dated from the sixties, but      Jupyter Computer. This offers a Cloud experience similar to the
the Web and email were the first mass-market applications on          personal computer, but with the scalability, reliability, and ease of
the network), and they were very simple -- both were document-        collaboration of the Cloud.
exchange applications, using slightly different protocols. The
second generation of Network applications were the siloed pro-        Conclusion and Further Work
ductivity applications, where standard desktop applications moved     The vision of the Jupyter Computer, bringing the power of the
to the Cloud. The most famous example is of course GSuite             Cloud to the personal computing experience has been started
and Office 365, but there were and are many others -- Canva,          with Galyleo. It will not end there. At the heart of it is a
Loom, Picasa, as well as a large number of social/chat/social         composition of two broadly popular platforms: HTML5/JavaScript
media applications. What they all had in common was that they         for presentation and interaction, and the various Jupyter kernels
were siloed applications which, with the exception of the office      for server-side analytics. Galyleo is a start at seamless interaction
suites, didn’t even share a common store. In many ways, this          of these two platforms. Continuing and extending this is further
second generation of network applications recapitulates the era       development of narrow-waist protocols to permit maximal inde-
immediately prior to the introduction of the personal computer.       pendent development and extension.
That era was dominated by single-application computers such as
word processors, which were simply computers with a hardcoded
program loaded into ROM.                                              Acknowledgements
    The Word Processor era was due to technological limitations       The authors wish to thank Alex Yang, Diptorup Deb, and for
-- the processing power and memory to run multiple programs           their insightful comments, and Meghann Agarwal for stewardship.
simply wasn’t available on low-end hardware, and PC operating         We have received invaluable help from Robert Krahn, Marko
systems didn’t yet exist. In some sense, the current second genera-   Röder, Jens Lincke and Linus Hagemann. We thank the en-
tion of Internet Computing suffers from similar technological con-    gageLively team for all of their support and help: Tim Braman,
straints. The "Operating System" for Internet Computing doesn’t       Patrick Scaglia, Leighton Smith, Sharon Zehavi, Igor Zhukovsky,
yet exist. The Jupyter Computer can provide it.                       Deepak Gupta, Steve King, Rick Rasmussen, Patrick McCue,
    To see the difference that this can make, consider LaTeX (per-    Jeff Wade, Tim Gibson. The JupyterLab development commu-
haps preceded by Docutils, as is the case for SciPy) preparation of   nity has been helpful and supportive; we want to thank Tony
a document. On a personal computer, it’s fairly straightforward;      Fast, Jason Grout, Mehmet Bektas, Isabela Presedo-Floyd, Brian
20                                                                                                PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Granger, and Michal Krassowski. The engageLively Technology                    [hol22b]   Installation - holoviews v1.14.9, May 2022. URL: https:
Advisory Board has helped shape these ideas: Ani Mardurkar,                               //holoviews.org/.
                                                                               [IFH+ 16]  Daniel Ingalls, Tim Felgentreff, Robert Hirschfeld, Robert
Priya Joseph, David Peterson, Sunil Joshi, Michael Czahor, Isha
                                                                                          Krahn, Jens Lincke, Marko Röder, Antero Taivalsaari, and
Oke, Petrus Zwart, Larry Rowe, Glenn Ricart, Sunil Joshi, Antony                          Tommi Mikkonen. A world of active objects for work and play:
Ng. We want to thank the people from the AWS team that have                               The first ten years of lively. In Proceedings of the 2016 ACM
helped us tremendously: Matt Vail, Omar Valle, Pat Santora.                               International Symposium on New Ideas, New Paradigms, and
                                                                                          Reflections on Programming and Software, Onward! 2016, page
Galyleo has been dramatically improved with the assistance of our                         238–249, New York, NY, USA, 2016. Association for Comput-
Japanese colleagues at KCT and Pacific Rim Technologies: Yoshio                           ing Machinery. URL: https://doi.org/10.1145/2986012.2986029,
Nakamura, Ted Okasaki, Ryder Saint, Yoshikazu Tokushige, and                              doi:10.1145/2986012.2986029.
Naoyuki Shimazaki. Our undestanding of Jupyter in an academic                  [IKM+ 97] Dan Ingalls, Ted Kaehler, John Maloney, Scott Wallace, and
                                                                                          Alan Kay. Back to the future: The story of squeak, a prac-
context came from our colleagues and friends at Berkeley, the                             tical smalltalk written in itself. In Proceedings of the 12th
University of Victoria, and UBC: Shawna Dark, Hausi Müller,                               ACM SIGPLAN Conference on Object-Oriented Programming,
Ulrike Stege, James Colliander, Chris Holdgraf, Nitesh Mor. Use                           Systems, Languages, and Applications, OOPSLA ’97, page
                                                                                          318–326, New York, NY, USA, 1997. Association for Comput-
of Jupyter in a research context was emphasized by Andrew
                                                                                          ing Machinery. URL: https://doi.org/10.1145/263698.263754,
Weidlea, Eli Dart, Jeff D’Ambrogia. We benefitted enormously                              doi:10.1145/263698.263754.
from the CITRIS Foundry: Alic Chen, Jing Ge, Peter Minor, Kyle                 [IPU+ 08]  Daniel Ingalls, Krzysztof Palacz, Stephen Uhler, Antero Taival-
Clark, Julie Sammons, Kira Gardner. The Alchemist Accelerator                             saari, and Tommi Mikkonen. The lively kernel a self-supporting
                                                                                          system on a web page. In Workshop on Self-sustaining Systems,
was central to making this product: Ravi Belani, Arianna Haider,                          pages 31–50. Springer, 2008. doi:10.1007/978-3-540-
Jasmine Sunga, Mia Scott, Kenn So, Aaron Kalb, Adam Frankl.                               89275-5_2.
Kris Singh was a constant source of inspiration and help. Larry                [jup]      Jupyterlab documentation. URL: https://jupyterlab.readthedocs.
Singer gave us tremendous help early on. Vibhu Mittal more                                io/en/stable/.
than anyone inspired us to pursue this road. Ken Lutz has been                 [KIH+ 09] Robert Krahn, Dan Ingalls, Robert Hirschfeld, Jens Lincke, and
                                                                                          Krzysztof Palacz. Lively wiki a development environment for
a constant sounding board and inspiration, and worked hand-in-                            creating and sharing active web content. In Proceedings of the
hand with us to develop this product. Our early customers and                             5th International Symposium on Wikis and Open Collaboration,
partners have been and continue to be a source of inspiration,                            WikiSym ’09, New York, NY, USA, 2009. Association for
                                                                                          Computing Machinery. URL: https://doi.org/10.1145/1641309.
support, and experience that is absolutely invaluable: Jonathan                           1641324, doi:10.1145/1641309.1641324.
Tan, Roger Basu, Jason Koeller, Steve Schwab, Michael Collins,                 [KRB18]    Juraj Kubelka, Romain Robbes, and Alexandre Bergel. The road
Alefiya Hussain, Geoff Lawler, Jim Chimiak, Fraukë Tillman,                               to live programming: Insights from the practice. In Proceedings
Andy Bavier, Andy Milburn, Augustine Bui. All of our customers                            of the 40th International Conference on Software Engineering,
                                                                                          ICSE ’18, page 1090–1101, New York, NY, USA, 2018. Associ-
are really partners, none moreso than the fantastic teams at Tanjo                        ation for Computing Machinery. URL: https://doi.org/10.1145/
AI and Ultisim: Bjorn Nordwall, Ken Lane, Jay Sanders, Eric                               3180155.3180200, doi:10.1145/3180155.3180200.
Smith, Miguel Matos, Linda Bernard, Kevin Clark, and Richard                   [KRKP+ 16] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez,
Boyd. We want to especially thank our investors, who bet on this                          Brian Granger, Matthias Bussonnier, Jonathan Frederic, Kyle
                                                                                          Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, Paul
technology and company.                                                                   Ivanov, Damián Avila, Safia Abdalla, Carol Willing, and Jupyter
                                                                                          development team. Jupyter Notebooks - a publishing format for
                                                                                          reproducible computational workflows. IOS Press, 2016. URL:
R EFERENCES                                                                               https://eprints.soton.ac.uk/403913/.
                                                                               [lea]      An open-source javascript library for interactive maps. URL:
[ABF20]     Leif Andersen, Michael Ballantyne, and Matthias Felleisen.                    https://leafletjs.com/.
            Adding interactive visual syntax to textual code. Proc. ACM        [LKI+ 12]  Jens Lincke, Robert Krahn, Dan Ingalls, Marko Roder, and
            Program. Lang., 4(OOPSLA), nov 2020. URL: https://doi.org/                    Robert Hirschfeld. The lively partsbin–a cloud-based repository
            10.1145/3428290, doi:10.1145/3428290.                                         for collaborative development of active web content. In 2012
[BOH11]     Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. D3 data-                 45th Hawaii International Conference on System Sciences, pages
            driven documents. IEEE Transactions on Visualization and Com-                 693–701, 2012. doi:10.1109/HICSS.2012.42.
            puter Graphics, 17(12):2301–2309, dec 2011. URL: https://doi.      [loo]      Looker. URL: https://looker.com/.
            org/10.1109/TVCG.2011.185, doi:10.1109/TVCG.2011.                  [LS10]     Bruce Lawson and Remy Sharp. Introducing HTML5. New
            185.                                                                          Riders Publishing, USA, 1st edition, 2010.
[cha]       Chart.js. URL: https://www.chartjs.org/.                           [mdn]      Window.postmessage() - web apis: Mdn. URL: https://developer.
[Cro06]     D. Crockford. The application/json media type for javascript                  mozilla.org/en-US/docs/Web/API/Window/postMessage.
            object notation (json). RFC 4627, RFC Editor, July 2006. http://
                                                                               [MRR+ 10] John Maloney, Mitchel Resnick, Natalie Rusk, Brian Silverman,
            www.rfc-editor.org/rfc/rfc4627.txt. URL: http://www.rfc-editor.
                                                                                          and Evelyn Eastmond. The scratch programming language
            org/rfc/rfc4627.txt, doi:10.17487/rfc4627.
                                                                                          and environment. ACM Transactions on Computing Educa-
[Dev]       Erik Devaney. How to create a pivot table in excel: A step-by-
                                                                                          tion (TOCE), 10(4):1–15, 2010. URL: https://doi.org/10.1145/
            step tutorial. URL: https://blog.hubspot.com/marketing/how-to-
                                                                                          1868358.1868363, doi:10.1145/1868358.1868363.
            create-pivot-table-tutorial-ht.
[DGHP13]    Marcello D’Agostino, Dov M Gabbay, Reiner Hähnle, and              [MS95]     John H Maloney and Randall B Smith. Directness and liveness in
            Joachim Posegga. Handbook of tableau methods. Springer                        the morphic user interface construction environment. In Proceed-
            Science & Business Media, 2013.                                               ings of the 8th annual ACM symposium on User interface and
                                                                                          software technology, pages 21–28, 1995. URL: https://doi.org/
[goo]       Charts: google developers. URL: https://developers.google.com/
                                                                                          10.1145/215585.215636, doi:10.1145/215585.215636.
            chart/.
[HIK+ 16]   Matthew Hemmings, Daniel Ingalls, Robert Krahn, Rick               [ope]      Openlayers. URL: https://openlayers.org/.
            McGeer, Glenn Ricart, Marko Röder, and Ulrike Stege. Livetalk:     [pan22]    Panel, May 2022. URL: https://panel.holoviz.org/.
            A framework for collaborative browser-based replicated-            [pdt20]    The pandas development team. pandas-dev/pandas: Pandas,
            computation applications. In 2016 28th International Tele-                    February 2020. URL: https://doi.org/10.5281/zenodo.3509134,
            traffic Congress (ITC 28), volume 01, pages 270–277, 2016.                    doi:10.5281/zenodo.3509134.
            doi:10.1109/ITC-28.2016.144.                                       [plo]      Dash overview. URL: https://plotly.com/dash/.
[hol22a]    High-level tools to simplify visualization in python, Apr 2022.    [SKH21]    Robin Schrieber, Robert Krahn, and Linus Hagemann.
            URL: https://holoviz.org/.                                                    lively.next, 2021.
GALYLEO: A GENERAL-PURPOSE EXTENSIBLE VISUALIZATION SOLUTION               21

[TM17]     Antero Taivalsaari and Tommi Mikkonen. The web as a software
           platform: Ten years later. In International Conference on Web
           Information Systems and Technologies, volume 2, pages 41–50.
           SCITEPRESS, 2017. doi:10.5220/0006234800410050.
[US87]     David Ungar and Randall B. Smith. Self: The power of simplic-
           ity. volume 22, page 227–242, New York, NY, USA, dec 1987.
           Association for Computing Machinery. URL: https://doi.org/10.
           1145/38807.38828, doi:10.1145/38807.38828.
[WM10]     Wes McKinney. Data Structures for Statistical Computing in
           Python. In Stéfan van der Walt and Jarrod Millman, editors,
           Proceedings of the 9th Python in Science Conference, pages 56
           – 61, 2010. doi:10.25080/Majora-92bf1922-00a.
22                                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




  USACE Coastal Engineering Toolkit and a Method of
        Creating a Web-Based Application
                           Amanda Catlett‡∗ , Theresa R. Coumbe‡ , Scott D. Christensen‡ , Mary A. Byrant‡



                                                                                       F



Abstract—In the early 1990s the Automated Coastal Engineering Systems,                     the goal of deploying the ACES tools as a web-based application,
ACES, was created with the goal of providing state-of-the-art computer-based               and ultimately renamed it to: USACE Coastal Engineering Toolkit
tools to increase the accuracy, reliability, and cost-effectiveness of Corps coastal       (UCET).
engineering endeavors. Over the past 30 years, ACES has become less and less
                                                                                                The RAD team focused on updating the Python codebase
accessible to engineers. An updated version of ACES was necessary for use in
                                                                                           utilizing Python’s object-oriented programming and the newly
coastal engineering. Our goal was to bring the tools in ACES to a user-friendly
web-based dashboard that would allow a wide range of users to be able to easily
                                                                                           developed HoloViz ecosystem. The team refactored the code to
and quickly visualize results. We will discuss how we restructured the code                implement inheritance so the code is clean, readable, and scalable.
using class inheritance and the three libraries Param, Panel, and HoloViews to             The tools were expanded to a Graphical User Interface (GUI) so
create an extensible, interactive, graphical user interface. We have created the           the implementation to a web-app would provide a user-friendly
USACE Coastal Engineering Toolkit, UCET, which is a web-based application                  experience. This was done by using the HoloViz-maintained
that contains 20 of the tools in ACES. UCET serves as an outline for the process           libraries: Param, Panel, and Holoviews.
of taking a model or set of tools and developing web-based application that can
                                                                                                This paper will discuss some of the steps that were taken
produce visualizations of the results.
                                                                                           by the RAD team to update the Python codebase to create a
                                                                                           panel application of the coastal engineering tools. In particular,
Index Terms—GUI, Param, Panel, HoloViews
                                                                                           refactoring the input and output variables with the Param library,
                                                                                           the class hierarchy used, and utilization of Panel and HoloViews
Introduction                                                                               for a user-friendly experience.
The Automated Coastal Engineering System (ACES) was devel-
oped in response to the charge by the LTG E. R. Heiberg III,                               Refactoring Using Param
who was the Chief of Engineers at the time, to provide improved
design capabilities to the Corps coastal specialists. [Leenknecht]                         Each coastal tool in UCET has two classes, the model class and the
In 1992, ACES was presented as an interactive computer-based                               GUI class. The model class holds input and output variables and
design and analysis system in the field of coastal engineering. The                        the methods needed to run the model. Whereas the GUI class holds
tools consist of seven functional areas which are: Wave Prediction,                        information for GUI visualization. To make implementation of the
Wave Theory, Structural Design, Wave Runup Transmission and                                GUI more seamless we refactored model variables to utilize the
Overtopping, Littoral Process, and Inlet Processes. These func-                            Param library. Param is a library that has the goal of simplifying
tional areas contain classical theory describing wave motion, to                           the codebase by letting the programmer explicitly declare the types
expressions resulting from tests of structures in wave flumes, and                         and values of parameters accepted by the code. Param can also be
numerical models describing the exchange of energy from the at-                            seamlessly used when implementing the GUI through Panel and
mosphere to the sea surface. The math behind these uses anything                           HoloViews.
from simple algebraic expressions, both theoretical and empirical,                             Each UCET tool’s model class declares the input and output
to numerically intense algorithms. [Leenknecht][UG][shankar]                               values used in the model as class parameters. Each input and
    Originally, ACES was written in FORTRAN 77 resulting in                                output variables are declared and given the following metadata
a decreased ability to use the tool as technology has evolved. In                          features:
2017, the codebase was converted from FORTRAN 77 to MAT-
                                                                                              •   default: each input variable is defined as a Param with a
LAB and Python. This conversion ensured that coastal engineers
                                                                                                  default value defined from the 1992 ACES user manual
using this tool base would not need training in yet another coding
                                                                                              •   bounds: each input variable is defined with range values
language. In 2020, the Engineered Resilient Systems (ERS) Rapid
                                                                                                  defined in the 1992 ACES user manual
Application Development (RAD) team undertook the project with
                                                                                              •   doc or docstrings: input and output variables have the
                                                                                                  expected variable and description of the variable defined
* Corresponding author: amanda.r.catlett@erdc.dren.mil
‡ ERDC                                                                                            as a doc. This is used as a label over the input and
                                                                                                  output widgets. Most docstrings follow the pattern of
Copyright © 2022 Amanda Catlett et al. This is an open-access article                             <variable>:<description of variable [units, if any]>
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,                 •   constant: the output variables all set constant equal True,
provided the original author and source are credited.                                             thereby restricting the user’s ability to manipulate the
USACE COASTAL ENGINEERING TOOLKIT AND A METHOD OF CREATING A WEB-BASED APPLICATION                                                       23

       value. Note that when calculations are being done they will     classes. In figure 1 the model classes are labeled as: Base-Tool
       need to be inside a with param.edit_constant(self) function     Class, Graph-Tool Class, Water-Tool Class, and Graph-Water-Tool
   •   precedence: input and output variables will use prece-          Class and each has a corresponding GUI class.
       dence when there are instances where the variable does              Due to the inheritance in UCET, the first two questions that
       not need to be seen.                                            can be asked when adding a tool are: ‘Does this tool need water
                                                                       variables for the calculation?’ and ‘Does this tool have a graph?’.
   The following is an example of an input parameter:
                                                                       The developer can then add a model class and a GUI class and
H = param.Number(                                                      inherit based on figure 1. For instance, Linear Wave Theory is
    doc='H: wave height [{distance_unit}]',
    default=6.3,                                                       an application that yields first-order approximations for various
    bounds=(0.1, 200)                                                  parameters of wave motion as predicted by the wave theory. It
)                                                                      provides common items of interest such as water surface elevation,
An example of an output variable is:                                   general wave properties, particle kinematics and pressure as a
                                                                       function of wave height and period, water depth, and position
L = param.Number(
    doc='L: Wavelength [{distance_unit}]',                             in the wave form. This tool uses water density and has multiple
    constant=True                                                      graphs in its output. Therefore, Linear Wave Theory is considered
)                                                                      a Graph-Water-Tool and the model class will inherit from Water-
The model’s main calculation functions mostly remained un-             TypeDriver and the GUI class will inherit the linear wave theory
changed. However, the use of Param eliminated the need for code        model class, WaterTypeGui, and TabularDataGui.
that handled type checking and bounds checks.
                                                                       GUI Implementation Using Panel and HoloViews
Class Hierarchy
                                                                       Each UCET tool has a GUI class where the Panel and HoloView
UCET has twenty tools from six of the original seven functional        libraries are implemented. Panel is a hierarchical container that
areas of ACES. When we designed our class hierarchy, we focused        can layout panes, widgets, or other Panels in an arrangement
on the visualization of the web application rather than functional     that forms an app or dashboard. The Pane is used to render any
areas. Thus, each tool’s class can be categorized into Base-Tool,      widget-like object such as Spinner, Tabulator, Buttons, CheckBox,
Graph-Tool, Water-Tool, or Graph-Water-Tool. The Base-Tool has         Indicators, etc. Those widgets are used to gather user input and
the coastal engineering models that do not have any water property     run the specific tool’s model.
inputs (such as water density) in the calculations and no graphical
                                                                           UCET utilizes the following widgets to gather user input:
output. The Graph-Tool has the coastal engineering models that
do not have any water property inputs in the calculations but have        •   Spinner: single numeric input values
a graphical output. Water-Tool has the coastal engineering models         •   Tabulator: table input data
that have water property inputs in the calculations and no graphical      •   CheckBox: true or false values
output. Graph-Water-Tool has the coastal engineering models that          •   Drop down: items that have a list of pre-selected values,
have water property inputs in the calculations and has a graphical            such as which units to use
output. Figure 1 shows a flow of inheritance for each of those
classes.                                                                   UCET utilizes indicators.Number, Tabulator, and graphs to
    There are two types of general categories for the classes in       visualize the outputs of the coastal engineering models. A single
the UCET codebase: utility and tool-specific. Utility classes have     number is shown using indicators.Number and graph data is
methods and functions that are utilized across more than one tool.     displayed using the Tabulator widget to show the data of the graph.
The Utility classes are:                                               The graphs are created using HoloViews and have tool options
                                                                       such as pan, zooming, and saving. Buttons are used to calculate,
   •   BaseDriver: holds methods and functions that each tool
                                                                       save the current run, and save the graph data.
       needs to collect data, run coastal engineering models, and
                                                                           All of these widgets are organized into 5 pan-
       print data.
                                                                       els: title, options, inputs, outputs, and graph. The
   •   WaterDriver: has the methods that make water density
                                                                       BaseGui/WaterTypeGui/TabularDataGui have methods that
       and water weight available to the models that need those
                                                                       organize the widgets within the 5 panels that most tools follow.
       inputs for the calculations.
                                                                       The “options” panel has a row that holds the dropdown selections
   •   BaseGui: has the functions and methods for the visualiza-
                                                                       for units and water type (if the tool is a Water-Tool). Some tools
       tion and utilization of all inputs and outputs within each
                                                                       have a second row in the “options” panel with other drop-down
       tool’s GUI.
                                                                       options. The input panel has two columns for spinner widgets
   •   WaterTypeGui: has the widget for water selection.
                                                                       with a calculation button at the bottom left. The output panel has
   •   TabulatorDataGui: holds the functions and methods used
                                                                       two columns of indicators.Number for the single numeric output
       for visualizing plots and the ability to download the data
                                                                       values. At the bottom of the output panel there is a button to “save
       that is used for plotting.
                                                                       the current profile”. The graph panel is tabbed where the first
    Each coastal tool in UCET has two classes, the model class and     tab shows the graph and the second tab shows the data provided
the GUI class. The model class holds input and output variables        within the graph. An visual outline of this can ben seen in the
and the methods needed to run the model. The model class either        following figure. Some of the UCET tools have more complicated
directly inherits from the BaseDriver or the WaterTypeDriver. The      input or output visualizations and that tool’s GUI class will add
tool’s GUI class holds information for GUI visualization that is       or modify methods to meet the needs of that tool.
different from the BaseGui, WaterTypeGUI, and TabulatorDataGui             The general outline of a UCET tool for the GUI.
24                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




                                                                       zero the point is outside the waveform. Therefore, if a user makes
                                                                       a combination where the sum is less than zero, UCET will post
                                                                       a warning to tell the user that the point is outside the waveform.
Current State                                                          See the below figure for an example The developers have been
UCET approaches software development from the perspective of           documenting this project using GitHub and JIRA.
someone within the field of Research and Development. Each                 An example of a warning message based on chosen inputs.
tool within UCET is not inherently complex from the traditional
software perspective. However, this codebase enables researchers       Results
to execute complex coastal engineering models in a user-friendly       Linear Wave Theory was described in the class hierarchy example.
environment by leveraging open-source libraries in the scientific      This Graph-Water-Tool utilizes most of the BaseGui methods. The
Python ecosystem such as: Param, Panel, and HoloViews.                 biggest difference is instead of having three graphs in the graph
    Currently, UCET is only deployed using a command line              panel there is a plot selector drop down where the user can select
interface panel serve command. UCET is awaiting the Security           which graph they want to see.
Technical Implementation Guide process before it can be launched           Windspeed Adjustment and Wave Growth provides a quick
as a website. As part of this security vetting process we plan to      and simple estimate for wave growth over open-water and re-
leverage continuous integration/continuous development (CI/CD)         stricted fetches in deep and shallow water. This is a Base-Tool
tools to automate the deployment process. While this process is        as there are no graphs and no water variables for the calculations.
happening, we have started to get feedback from coastal engineers      This tool has four additional options in the options panel where
to update the tools usability, accuracy, and adding suggested          the user can select the wind observation type, fetch type, wave
features. To minimize the amount of computer science knowledge         equation type, and if knots are being used. Based on the selection
the coastal engineers need, our team created a batch script. This      of these options, the input and output variables will change so only
script creates a conda environment, activates and runs the panel       what is used or calculated for those selections are seen.
serve command to launch the app on a local host. The user only
needs to click on the batch script for this to take place.             Conclusion and Future Work
    Other tests are being created to ensure the accuracy of the        Thirty years ago, ACES was developed to provide improved
tools using a testing framework to compare output from UCET            design capabilities to Corps coastal specialists and while these
with that of the FORTRAN original code. The biggest barrier to         tools are still used today, it became more and more difficult for
this testing strategy is getting data from the FORTRAN to compare      users to access them. Five years ago, there was a push to update
with Python. Currently, there are tests for most of the tools that     the code base to one that coastal specialists would be more familiar
read a CSV file of input and output results from FORTRAN and           with: MATLAB and Python. Within the last two years the RAD
compare with what the Python code is calculating.                      team was able to finalize the update so that the user can access
    Our team has also compiled an updated user guide on how to         these tools without having years of programming experience. We
use the tool, what to expect from the tool, and a deeper description   were able to do this by utilizing classes, inheritance, and the
on any warning messages that might appear as the user adds input       Param, Panel, and HoloViews libraries. The use of inheritance
values. An example of a warning message would be, if a user            has allowed for shorter code-bases and also has made it so new
chooses input values that make it so the application does not make     tools can be added to the toolkit. Param, Panel, and HoloViews
physical sense, a warning message will appear under the output         work cohesively together to not only run the models but make a
header and replace all output values. For a more concrete example:     simple interface.
Linear Wave Theory has a vertical coordinate (z) and the water             Future work will involve expanding UCET to include current
depth (d) as input values and when those values sum is less than       coastal engineering models, and completing the security vetting
USACE COASTAL ENGINEERING TOOLKIT AND A METHOD OF CREATING A WEB-BASED APPLICATION                                                               25

                                                                      process to deploy to a publicly accessible website. We plan to
                                                                      incorporate an automated CI/CD to ensure smooth deployment
                                                                      of future versions. We also will continue to incorporate feedback
                                                                      from users and refine the code to ensure the application provides
                                                                      a quality user experience.

                                                                      R EFERENCES
                                                                      [Leenknecht] David A. Leenknecht, Andre Szuwalski, and Ann R. Sherlock.
                                                                                   1992. Automated Coastal Engineering System -Technical Refer-
                                                                                   ence. Technical report. https://usace.contentdm.oclc.org/digital/
                                                                                   collection/p266001coll1/id/2321/
                                                                      [panel]      “Panel: A High-Level App and Dashboarding Solution for
                                                                                   Python.” Panel 0.12.6 Documentation, Panel Contributors,
                                                                                   2019, https://panel.holoviz.org/.
                                                                      [holoviz]    “High-Level Tools to Simplify Visualization in Python.”
                                                                                   HoloViz 0.13.0 Documentation, HoloViz Authors, 2017, https:
                                                                                   //holoviz.org.
                                                                      [UG]         David A. Leenknecht, et al. “Automated Tools for Coastal
                                                                                   Engineering.” Journal of Coastal Research, vol. 11, no.
                                                                                   4, Coastal Education & Research Foundation, Inc., 1995,
                                                                                   pp. 1108-24. https://usace.contentdm.oclc.org/digital/collection/
                                                                                   p266001coll1/id/2321/
                                                                      [shankar]    N.J. Shankar, M.P.R. Jayaratne, Wave run-up and overtopping
                                                                                   on smooth and rough slopes of coastal structures, Ocean Engi-
                                                                                   neering, Volume 30, Issue 2, 2003, Pages 221-238, ISSN 0029-
                                                                                   8018, https://doi.org/10.1016/S0029-8018(02)00016-1




            Fig. 1: Screen shot of Linear Wave Theory




  Fig. 2: Screen shot of Windspeed Adjustment and Wave Growth
26                                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




               Search for Extraterrestrial Intelligence: GPU
                         Accelerated TurboSETI
                                                    Luigi Cruz‡∗ , Wael Farah‡ , Richard Elkins‡



                                                                                   F



Abstract—A common technique adopted by the Search For Extraterrestrial In-             by an analog-to-digital converter as voltages and transmitted to a
telligence (SETI) community is monitoring electromagnetic radiation for signs of       processing logic to extract useful information from it. The data
extraterrestrial technosignatures using ground-based radio observatories. The          stream generated by a radio telescope can easily reach the rate
analysis is made using a Python-based software called TurboSETI to detect nar-         of terabits per second because of the ultra-wide bandwidth radio
rowband drifting signals inside the recordings that can mean a technosignature.
                                                                                       spectrum. The current workflow utilized by the Breakthrough
The data stream generated by a telescope can easily reach the rate of terabits
per second. Our goal was to improve the processing speeds by writing a GPU-
                                                                                       Listen, the largest scientific research program aimed at finding
accelerated backend in addition to the original CPU-based implementation of the        evidence of extraterrestrial intelligence, consists in pre-processing
de-doppler algorithm used to integrate the power of drifting signals. We discuss       and storing the incoming data as frequency-time binary files
how we ported a CPU-only program to leverage the parallel capabilities of a            ([LCS+ 19]) in persistent storage for later analysis. This post-
GPU using CuPy, Numba, and custom CUDA kernels. The accelerated backend                analysis is made possible using a Python-based software called
reached a speed-up of an order of magnitude over the CPU implementation.               TurboSETI ([ESF+ 17]) to detect narrowband signals that could be
                                                                                       drifting in frequency owing to the relative radial velocity between
Index Terms—gpu, numba, cupy, seti, turboseti                                          the observer on earth, and the transmitter. The offline processing
                                                                                       speed of TurboSETI is directly related to the scientific output of
1. Introduction                                                                        an observation. Each voltage file ingested by TurboSETI is often
                                                                                       on the order of a few hundreds of gigabytes. To process data
The Search for Extraterrestrial Intelligence (SETI) is a broad term                    efficiently without Python overhead, the program uses Numpy for
utilized to describe the effort of locating any scientific proof of                    near machine-level performance. To measure a potential signal’s
past or present technology that originated beyond the bounds of                        drift rate, TurboSETI uses a de-doppler algorithm to align the
Earth. SETI can be performed in a plethora of ways: either actively                    frequency axis according to a pre-set drift rate. Another algorithm
by deploying orbiters and rovers around planets/moons within the                       called “hitsearch” ([ESF+ 17]) is then utilized to identify any
solar system, or passively by either searching for biosignatures in                    signal present in the recorded spectrum. These two algorithms
exoplanet atmospheres or “listening” to technologically-capable                        are the most resource-hungry elements of the pipeline consuming
extraterrestrial civilizations. One of the most common techniques                      almost 90% of the running time.
adopted by the SETI community is monitoring electromagnetic
radiation for narrowband signs of technosignatures using ground-
based radio observatories. This search can be performed in mul-                        2. Approach
tiple ways: equipment primarily built for this task, like the Allen                    Multiple methods were utilized in this effort to write a GPU-
Telescope Array (California, USA), renting observation time, or                        accelerated backend and optimize the CPU implementation of
in the background while the primary user is conducting other ob-                       TurboSETI. In this section, we enumerate all three main methods.
servations. Other radio-observatories useful for this search include
the MeerKAT Telescope (Northern Cape, South Africa), Green                             2.1. CuPy
Bank Telescope (West Virginia, USA), and the Parkes Telescope                          The original implementation of TurboSETI heavily depends on
(New South Wales, Australia). The operation of a radio-telescope                       Numpy ([HMvdW+ 20]) for data processing. To keep the number
is similar to an optical telescope. Instead of using optics to                         of modifications as low as possible, we implemented the GPU-
concentrate light into an optical sensor, a radio-telescope operates                   accelerated backend using CuPy ([OUN+ 17]). This open-source
by concentrating electromagnetic waves into an antenna using a                         library offers GPU acceleration backed by NVIDIA CUDA and
large reflective structure called a “dish” ([Reb82]). The interac-                     AMD ROCm while using a Numpy style API. This enabled us
tion between the metallic antenna and the electromagnetic wave                         to reuse most of the code between the CPU and GPU-based
generates a faint electrical current. This effect is then quantized                    implementations.

* Corresponding author: lfcruz@seti.org                                                2.1. Numba
‡ SETI Institute
                                                                                       Some computationally heavy methods of the original CPU-based
Copyright © 2022 Luigi Cruz et al. This is an open-access article distributed          implementation of TurboSETI were written in Cython. This ap-
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the           proach has disadvantages: the developer has to be familiar with
original author and source are credited.                                               Cython syntax to alter the code; the code requires additional logic
SEARCH FOR EXTRATERRESTRIAL INTELLIGENCE: GPU ACCELERATED TURBOSETI                                                                                 27

  Double-Precision (float64)                                              4. Conclusion
  Impl.        Device      File A      File B          File C             The original implementation of TurboSETI worked exclusively
  Cython       CPU         0.44 min    25.26 min       23.06 min          on the CPU to process data. We implemented a GPU-accelerated
  Numba        CPU         0.36 min    20.67 min       22.44 min          backend to leverage the massive parallelization capabilities of a
  CuPy         GPU         0.05 min    2.73 min        3.40 min           graphical device. The benchmark performed shows that the new
                                                                          CPU and GPU implementation takes significantly less time to
                                   TABLE 1                                process observation data resulting in more science being produced.
 Double precision processing time benchmark with Cython, Numba and CuPy   Based on the results, the recommended configuration to run the
                               implementation.
                                                                          program is with single-precision calculations on a GPU device.


   Single-Precision (float32)                                             R EFERENCES
                                                                          [ESF+ 17]   J. Emilio Enriquez, Andrew Siemion, Griffin Foster, Vishal
   Impl.       Device      File A       File B          File C
                                                                                      Gajjar, Greg Hellbourg, Jack Hickish, Howard Isaacson,
   Numba       CPU         0.26 min     16.13 min       16.15 min                     Danny C. Price, Steve Croft, David DeBoer, Matt Lebof-
   CuPy        GPU         0.03 min     1.52 min        2.14 min                      sky, David H. E. MacMahon, and Dan Werthimer. The
                                                                                      breakthrough listen search for intelligent life: 1.1–1.9
                                   TABLE 2                                            ghz observations of 692 nearby stars.            The Astrophys-
     Single precision processing time benchmark with Numba and CuPy                   ical Journal, 849(2):104, Nov 2017.             URL: https://ui.
                              implementation.                                         adsabs.harvard.edu/abs/2017ApJ...849..104E/abstract, doi:
                                                                                      10.3847/1538-4357/aa8d1b.
                                                                          [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der
                                                                                      Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric
                                                                                      Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,
to be compiled at installation time. Consequently, it was decided                     Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van
                                                                                      Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del
to replace Cython with pure Python methods decorated with the                         Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant,
Numba ([LPS15]) accelerator. By leveraging the power of the Just-                     Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer
In-Time (JIT) compiler from Low Level Virtual Machine (LLVM),                         Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro-
                                                                                      gramming with NumPy. Nature, 585(7825):357–362, Septem-
Numba can compile Python code into assembly code as well                              ber 2020. URL: https://doi.org/10.1038/s41586-020-2649-2,
as apply Single Instruction/Multiple Data (SIMD) acceleration                         doi:10.1038/s41586-020-2649-2.
instructions to achieve near machine-level speeds.                        [LCS 19]
                                                                              +       Matthew Lebofsky, Steve Croft, Andrew P. V. Siemion,
                                                                                      Danny C. Price, J. Emilio Enriquez, Howard Isaacson, David
                                                                                      H. E. MacMahon, David Anderson, Bryan Brzycki, Jeff Cobb,
2.2. Single-Precision Floating-Point                                                  Daniel Czech, David DeBoer, Julia DeMarines, Jamie Drew,
The original implementation of the software handled the input                         Griffin Foster, Vishal Gajjar, Nectaria Gizani, Greg Hellbourg,
                                                                                      Eric J. Korpela, and Brian Lacki. The breakthrough listen
data as double-precision floating-point numbers. This behavior                        search for intelligent life: Public data, formats, reduction, and
would cause all the mathematical operations to take significantly                     archiving. Publications of the Astronomical Society of the
longer to process because of the extended precision. The ultimate                     Pacific, 131(1006):124505, Nov 2019. URL: https://arxiv.org/
                                                                                      abs/1906.07391, doi:10.1088/1538-3873/ab3e82.
precision of the output product is inherently limited by the preci-       [LPS15]     Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba:
sion of the original input data which in most cases is represented                    A llvm-based python jit compiler. In Proceedings of the
by an 8-bit signed integer. Therefore, the addition of a single-                      Second Workshop on the LLVM Compiler Infrastructure in
precision floating-point number decreased the processing time                         HPC, LLVM ’15, New York, NY, USA, 2015. Association
                                                                                      for Computing Machinery. URL: https://doi.org/10.1145/
without compromising the useful precision of the output data.                         2833157.2833162, doi:10.1145/2833157.2833162.
                                                                          [OUN 17]
                                                                               +      Ryosuke Okuta, Yuya Unno, Daisuke Nishino, Shohei Hido,
                                                                                      and Crissman Loomis. Cupy: A numpy-compatible library
3. Results                                                                            for nvidia gpu calculations. In Proceedings of Workshop
                                                                                      on Machine Learning Systems (LearningSys) in The Thirty-
To test the speed improvements between implementations we used                        first Annual Conference on Neural Information Processing
files from previous observations coming from different observato-                     Systems (NIPS), 2017. URL: http://learningsys.org/nips17/
ries. Table 1 indicates the processing times it took to process three                 assets/papers/paper_16.pdf.
                                                                          [Reb82]     Grote Reber. Cosmic Static, pages 61–69. Springer Nether-
different files in double-precision mode. We can notice that the                      lands, Dordrecht, 1982. URL: https://doi.org/10.1007/978-
CPU implementation based on Numba is measurably faster than                           94-009-7752-5_6, doi:10.1007/978-94-009-7752-
the original CPU implementation based on Cython. At the same                          5_6.
time, the GPU-accelerated backend processed the data from 6.8 to
9.3 times faster than the original CPU-based implementation.
     Table 2 indicates the same results as Table 1 but with single-
precision floating points. The original Cython implementation was
left out because it doesn’t support single-precision mode. Here,
the same data was processed from 7.5 to 10.6 times faster than the
Numba CPU-based implementation.
     To illustrate the processing time improvement, a single obser-
vation containing 105 GB of data was processed in 12 hours by the
original CPU-based TurboSETI implementation on an i7-7700K
Intel CPU, and just 1 hour and 45 minutes by the GPU-accelerated
backend on a GTX 1070 Ti NVIDIA GPU.
28                                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




        Experience report of physics-informed neural
     networks in fluid simulations: pitfalls and frustration
                                                         Pi-Yueh Chuang‡∗ , Lorena A. Barba‡



                                                                                     F



Abstract—Though PINNs (physics-informed neural networks) are now deemed                  PINN (physics-informed neural network) method denotes an ap-
as a complement to traditional CFD (computational fluid dynamics) solvers                proach to incorporate deep learning in CFD applications, where
rather than a replacement, their ability to solve the Navier-Stokes equations            solving partial differential equations plays the key role. These par-
without given data is still of great interest. This report presents our not-so-          tial differential equations include the well-known Navier-Stokes
successful experiments of solving the Navier-Stokes equations with PINN as
                                                                                         equations—one of the Millennium Prize Problems. The universal
a replacement to traditional solvers. We aim to, with our experiments, prepare
readers for the challenges they may face if they are interested in data-free PINN.
                                                                                         approximation theorem ([Hor]) implies that neural networks can
In this work, we used two standard flow problems: 2D Taylor-Green vortex at              model the solution to the Navier-Stokes equations with high
Re = 100 and 2D cylinder flow at Re = 200. The PINN method solved the 2D                 fidelity and capture complicated flow details as long as networks
Taylor-Green vortex problem with acceptable results, and we used this flow as an         are big enough. The idea of PINN methods can be traced back
accuracy and performance benchmark. About 32 hours of training were required             to [DPT], while the name PINN was coined in [RPK]. Human-
for the PINN method’s accuracy to match the accuracy of a 16 × 16 finite-                provided data are not necessary in applying PINN [LMMK], mak-
difference simulation, which took less than 20 seconds. The 2D cylinder flow, on         ing it a potential alternative to traditional CFD solvers. Sometimes
the other hand, did not produce a physical solution. The PINN method behaved
                                                                                         it is branded as unsupervised learning—it does not rely on human-
like a steady-flow solver and did not capture the vortex shedding phenomenon.
                                                                                         provided data, making it sound very "AI." It is now common to
By sharing our experience, we would like to emphasize that the PINN method is
still a work-in-progress, especially in terms of solving flow problems without any
                                                                                         see headlines like "AI has cracked the Navier-Stokes equations" in
given data. More work is needed to make PINN feasible for real-world problems            recent popular science articles ([Hao]).
in such applications. (Reproducibility package: [Chu22].)                                     Though data-free PINN as an alternative to traditional CFD
                                                                                         solvers may sound attractive, PINN can also be used under data-
Index Terms—computational fluid dynamics, deep learning, physics-informed                driven configurations, for which it is better suited. Cai et al.
neural network                                                                           [CMW+ ] state that PINN is not meant to be a replacement of
                                                                                         existing CFD solvers due to its inferior accuracy and efficiency.
                                                                                         The most useful applications of PINN should be those with
1. Introduction
                                                                                         some given data, and thus the models are trained against the
Recent advances in computing and programming techniques have                             data. For example, when we have experimental measurements or
motivated practitioners to revisit deep learning applications in                         partial simulation results (coarse-grid data, limited numbers of
computational fluid dynamics (CFD). We use the verb "revisit"                            snapshots, etc.) from traditional CFD solvers, PINN may be useful
because deep learning applications in CFD already existed going                          to reconstruct the flow or to be a surrogate model.
back to at least the 1990s, for example, using neural networks as                             Nevertheless, data-free PINN may offer some advantages over
surrogate models ([LS], [FS]). Another example is the work of                            traditional solvers, and using data-free PINN to replace traditional
Lagaris and his/her colleagues ([LLF]) on solving partial differen-                      solvers is still of great interest to researchers (e.g., [KDYI]). First,
tial equations with fully-connected neural networks back in 1998.                        it is a mesh-free scheme, which benefits engineering problems
Similar work with radial basis function networks can be found                            where fluid flows interact with objects of complicated geometries.
in reference [LLQH]. Nevertheless, deep learning applications                            Simulating these fluid flows with traditional numerical methods
in CFD did not get much attention until this decade, thanks to                           usually requires high-quality unstructured meshes with time-
modern computing technology, including GPUs, cloud computing,                            consuming human intervention in the pre-processing stage before
high-level libraries like PyTorch and TensorFlow, and their Python                       actual simulations. The second benefit of PINN is that the trained
APIs.                                                                                    models approximate the governing equations’ general solutions,
     Solving partial differential equations with deep learning is                        meaning there is no need to solve the equations repeatedly for
particularly interesting to CFD researchers and practitioners. The                       different flow parameters. For example, a flow model taking
                                                                                         boundary velocity profiles as its input arguments can predict
* Corresponding author: pychuang@gwu.edu                                                 flows under different boundary velocity profiles after training.
‡ Department of Mechanical and Aerospace Engineering, The George Wash-
ington University, Washington, DC 20052, USA                                             Conventional numerical methods, on the contrary, require repeated
                                                                                         simulations, each one covering one boundary velocity profile.
Copyright © 2022 Pi-Yueh Chuang et al. This is an open-access article                    This feature could help in situations like engineering design op-
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,            timization: the process of running sets of experiments to conduct
provided the original author and source are credited.                                    parameter sweeps and find the optimal values or geometries for
EXPERIENCE REPORT OF PHYSICS-INFORMED NEURAL NETWORKS IN FLUID SIMULATIONS: PITFALLS AND FRUSTRATION                                          29

products. Given these benefits, researchers continue studying and      and momentum equations:
improving the usability of data-free PINN (e.g., [WYP], [DZ],
[WTP], [SS]).                                                                            ∂U
                                                                                          ~
                                                                                                     ~ = − 1 ∇p + ν∇2U
                                                                                               ~ · ∇)U               ~ +~g
                                                                                            + (U                                             (2)
    Data-free PINN, however, is not ready nor meant to replace                           ∂t                ρ
traditional CFD solvers. This claim may be obvious to researchers       where ρ = ρ(~x,t), ν = ν(~x,t), and p = p(~x,t) are scalar fields
experienced in PINN, but it may not be clear to others, especially      denoting density, kinematic viscosity, and pressure, respectively.
to CFD end-users without ample expertise in numerical methods.         ~x denotes the spatial coordinate, and ~x = [x, y]T in two di-
Even in literature that aims to improve PINN, it’s common to see        mensions. The density and viscosity fields are usually known
only the success stories with simple CFD problems. Important in-        and given, while the pressure field is unknown. U        ~ = U(~
                                                                                                                                      ~ x,t) =
formation concerning the feasibility of PINN in practical and real-     [u(x, y,t), v(x, y,t)]T is a vector field for flow velocity. All of them
world applications is often missing from these success stories. For     are functions of the spatial coordinate in the computational domain
example, few reports discuss the required computing resources,          Ω and time before a given limit T . The gravitational field ~g may
the computational cost of training, the convergence properties, or      also be a function of space and time, though it is usually a constant.
the error analysis of PINN. PINN suffers from performance and           A solution to the Navier-Stokes equations is subjected to an initial
solvability issues due to the need for high-order automatic differ-     condition and boundary conditions:
entiation and multi-objective nonlinear optimization. Evaluating                    
high-order derivatives using automatic differentiation increases                        ~ x,t) = U
                                                                                     U(~
                                                                                                   ~ 0 (~x),   ∀~x ∈ Ω, t = 0
the computational graphs of neural networks. And multi-objective                        U(~x,t) = UΓ (~x,t), ∀~x ∈ Γ, t ∈ [0, T ]
                                                                                        ~           ~                                         (3)
                                                                                    
                                                                                    
optimization, which reduces all the residuals of the differential                       p(~x,t) = pΓ (x,t), ∀~x ∈ Γ, t ∈ [0, T ]
equations, initial conditions, and boundary conditions, makes
the training difficult to converge to small-enough loss values.        where Γ represents the boundary of the computational domain.
Fluid flows are sensitive nonlinear dynamical systems in which
a small change or error in inputs may produce a very different         2.1. The PINN method
flow field. So to get correct solutions, the optimization in PINN      The basic form of the PINN method ([RPK], [CMW+ ]) starts from
needs to minimize the loss to values very close to zero, further       approximating U~ and p with a neural network:
compromising the method’s solvability and performance.                                       " #
    This paper reports on our not-so-successful PINN story as a                               ~
                                                                                              U
                                                                                                 (~x,t) ≈ G(~x,t; Θ)               (4)
lesson learned to readers, so they can be aware of the challenges                              p
they may face if they consider using data-free PINN in real-world
applications. Our story includes two computational experiments         Here we use a single network that predicts both pressure and
as case studies to benchmark the PINN method’s accuracy and            velocity fields. It is also possible to use different networks for them
computational performance. The first case study is a Taylor-           separately. Later in this work, we will use GU and G p to denote
Green vortex, solved successfully though not to our complete           the predicted velocity and pressure from the neural network. Θ at
satisfaction. We will discuss the performance of PINN using this       this point represents the free parameters of the network.
case study. The second case study, flow over a cylinder, did not           To determine the free parameters, Θ, ideally, we hope the
even result in a physical solution. We will discuss the frustration    approximate solution gives zero residuals for equations (1), (2),
we encountered with PINN in this case study.                           and (3). That is
    We built our PINN solver with the help of NVIDIA’s Modulus          r1 (~x,t; Θ) ≡ ∇ · GU = 0
library ([noa]). Modulus is a high-level Python package built on
                                                                                        ∂ GU                  1
top of PyTorch that helps users develop PINN-based differential         r2 (~x,t; Θ) ≡       + (GU · ∇)GU + ∇G p − ν∇2 GU −~g = 0
equation solvers. Also, in each case study, we also carried out sim-                     ∂t                   ρ
                                                                                                                                  (5)
ulations with our CFD solver, PetIBM ([CMKAB18]). PetIBM is             r3 (~x; Θ) ≡ GU      ~
                                                                                       t=0 − U0 = 0
a traditional solver using staggered-grid finite difference methods     r4 (~x,t; Θ) ≡ GU − U~ Γ = 0, ∀~x ∈ Γ
with MPI parallelization and GPU computing. PetIBM simulations          r5 (~x,t; Θ) ≡ G p − pΓ = 0, ∀~x ∈ Γ
in each case study served as baseline data. For all cases, config-
urations, post-processing scripts, and required Singularity image      And the set of desired parameter, Θ = θ , is the common zero root
definitions can be found at reference [Chu22].                         of all the residuals.
    This paper is structured as follows: the second section briefly        The derivatives of G with respect to ~x and t are usually ob-
describes the PINN method and an analogy to traditional CFD            tained using automatic differentiation. Nevertheless, it is possible
methods. The third and fourth sections provide our computational       to use analytical derivatives when the chosen network architecture
experiments of the Taylor-Green vortex in 2D and a 2D laminar          is simple enough, as reported by early-day literature ([LLF],
cylinder flow with vortex shedding. Most discussions happen            [LLQH]).
in the corresponding case studies. The last section presents the           If residuals in (5) are not complicated, and if the number of
conclusion and discussions that did not fit into either one of the     the parameters, NΘ , is small enough, we may numerically find the
cases.                                                                 zero root by solving a system of NΘ nonlinear equations generated
                                                                       from a suitable set of NΘ spatial-temporal points. However, the
2. Solving Navier-Stokes equations with PINN                           scenario rarely happens as G is usually highly complicated and
                                                                       NΘ is large. Moreover, we do not even know if such a zero root
The incompressible Navier-Stokes equations in vector form are
                                                                       exists for the equations in (5).
composed of the continuity equation:
                                                                           Instead, in PINN, the condition is relaxed. We do not seek the
                             ∇ ·U
                                ~ =0                             (1)   zero root of (5) but just hope to find a set of parameters that make
30                                                                                            PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

the residuals sufficiently close to zero. Consider the sum of the l2     2.2. An analogy to conventional numerical methods
norms of residuals:                                                      For readers with a background in numerical methods for partial
                                                  (
                         5
                                                    x∈Ω                  differential equations, we would like to make an analogy between
      r(~x,t; Θ = θ ) ≡ ∑ kri (~x,t; Θ = θ )k , ∀
                                             2
                                                                 (6)     traditional numerical methods and PINN.
                        i=1                         t ∈ [0, T ]
                                                                             In obtaining strong solutions to differential equations, we can
The θ that makes residuals closest to zero (or even equal to zero        describe the solution workflows of most numerical methods with
if such θ exists) also makes (6) minimal because r(~x,t; Θ) ≥ 0. In      five stages:
other words,
                                        (                                      1)   Designing the approximate solution with undetermined
                                          x∈Ω                                       parameters
               θ = arg min r(~x,t; Θ) ∀                         (7)
                       Θ                  t ∈ [0, T ]                          2)   Choosing proper approximation for derivatives
                                                                               3)   Obtaining the so-called modified equation by substituting
This poses a fundamental difference between the PINN method
                                                                                    approximate derivatives into the differential equations
and traditional CFD schemes, making it potentially more difficult
                                                                                    and initial/boundary conditions
for the PINN method to achieve the same accuracy as the tradi-
                                                                               4)   Generating a system of linear/nonlinear algebraic equa-
tional schemes. We will discuss this more in section 3. Note that
                                                                                    tions
in practice, each loss term on the right-hand-side of equation (6) is
                                                                               5)   Solving the system of equations
weighted. We ignore the weights here for demonstrating purpose.
     To solve (7), theoretically, we can use any number of spatial-         For example, to solve ∇U 2 (x) = s(x), the most naive spectral
temporal points, which eases the need of computational resources,        method ([Tre]) approximates the solution with U(x) ≈ G(x) =
                                                                         N
compared to finding the zero root directly. Gradient-descent-            ∑ ci φi (x), where ci represents undetermined parameters, and φi (x)
based optimizers further reduce the computational cost, especially       i=1
                                                                         denotes a set of either polynomials, trigonometric functions, or
in terms of memory usage and the difficulty of parallelization.
                                                                         complex exponentials. Next, obtaining the first derivative of U is
Alternatively, Quasi-Newton methods may work but only when                                                                              N
NΘ is small enough.                                                      straightforward—we can just assume U 0 (x) ≈ G0 (x) = ∑ ci φi0 (x).
                                                                                                                                        i=1
     However, even though equation (7) may be solvable, it is still      The second-order derivative may be more tricky. One can assume
a significantly expensive task. While typical data-driven learning                        N
requires one back-propagation pass on the derivatives of the loss        U 00 (x) ≈ G00 = ∑ ci φi00 (x). Or, another choice for nodal bases (i.e.,
                                                                                          i=1
function, here automatic differentiation is needed to evaluate the                                                                     N
derivatives of G with respect to ~x and t. The first-order derivatives   when φi (x) is chosen to make ci ≡ G(xi )) is U 00 (x) ≈ ∑ ci G0 (xi ).
                                                                                                                                       i=1
require one back-propagation on the network, while the second-           Because φi (x) is known, the derivatives are analytical. After sub-
order derivatives present in the diffusion term ∇2 GU require an         stituting the approximate solution and derivatives in to the target
additional back-propagation on the first-order derivatives’ com-         differential equation, we need to solve for parameters c1 , · · · , cN .
putational graph. Finally, to update parameters in an optimizer,         We do so by selecting N points from the computational domain
the gradients of G with respect to parameters Θ requires another         and creating a system of N linear equations:
back-propagation on the graph of the second-order derivatives.                                                               
This all leads to a very large computational graph. We will see the                   φ100 (x1 ) · · · φN00 (x1 )   c1     s(x1 )
                                                                                     .                     ..                
                                                                                     .          ..                .   . 
performance of the PINN method in the case studies.                                  .              .       .   ..  −  ..  = 0         (8)
     In summary, when viewing the PINN method as supervised                          φ1 (xN ) · · · φN (xN ) cN
                                                                                       00               00                 s(xN )
machine learning, the inputs of a network are spatial-temporal
coordinates, and the outputs are the physical quantities of our          Finally, we determine the parameters by solving this linear system.
interest. The loss or objective functions in PINN are governing          Though this example uses a spectral method, the workflow also
equations that regulate how the target physical quantities should        applies to many other numerical methods, such as finite difference
behave. The use of governing equations eliminates the need for           methods, which can be reformatted as a form of spectral method.
true answers. A trivial example is using Bernoulli’s equation as             With this workflow in mind, it should be easy to see the anal-
the loss function, i.e., loss = 2gu2    p
                                     + ρg − H0 + z(x), and a neural      ogy between PINN and conventional numerical methods. Aside
network predicts the flow speed u and pressure p at a given              from using much more complicated approximate solutions, the
location x along a streamline. (The gravitational acceleration           major difference lies in how to determine the unknown parameters
g, density ρ, energy head H0 , and elevation z(x) are usually            in the approximate solutions. While traditional methods solve the
known and given.) Such a loss function regulates the relationship        zero-residual conditions, PINN relies on searching the minimal
between predicted u and p and does not need true answers for             residuals. A secondary difference is how to approximate deriva-
the two quantities. Unlike Bernoulli’s equation, most governing          tives. Conventional numerical methods use analytical or numerical
equations in physics are usually differential equations (e.g., heat      differentiation of the approximate solutions, and the PINN meth-
equations). The main difference is that now the PINN method              ods usually depends on automatic differentiation. This difference
needs automatic differentiation to evaluate the loss. Regardless         may be minor as we are still able to use analytical differentiation
of the forms of governing equations, spatial-temporal coordinates        for simple network architectures with PINN. However, automatic
are the only data required during training. Hence, throughout this       differentiation is a major factor affecting PINN’s performance.
paper, training data means spatial-temporal points and does not
                                                                         3. Case 1: Taylor-Green vortex: accuracy and performance
involve any true answers to predicted quantities. (Note in some
literature, the PINN method is applied to applications that do need      3.1. 2D Taylor-Green vortex
true answers, see [CMW+ ]. These applications are out of scope           The Taylor-Green vortex represents a family of flows with a
here.)                                                                   specific form of analytical initial flow conditions in both 2D
EXPERIENCE REPORT OF PHYSICS-INFORMED NEURAL NETWORKS IN FLUID SIMULATIONS: PITFALLS AND FRUSTRATION                                             31




                                                                           Fig. 2: Total residuals (loss) with respect to training iterations.
Fig. 1: Contours of u and v at t = 32 to demonstrate the solution of
2D Taylor-Green vortex.
                                                                         variants). We carried out the training using different numbers of
and 3D. The 2D Taylor-Green vortex has closed-form analytical            GPUs to investigate the performance of the PINN solver. All cases
solutions with periodic boundary conditions, and hence they are          were trained up to 1 million iterations. Note that the parallelization
standard benchmark cases for verifying CFD solvers. In this work,        was done with weak scaling, meaning increasing the number of
we used the following 2D Taylor-Green vortex:                            GPUs would not reduce the workload of each GPU. Instead,
                                                                        increasing the number of GPUs would increase the total and
                         x       y        ν
    
     u(x, y,t) = V0 cos( ) sin( ) exp(−2 2 t)                           per-iteration numbers of training points. Therefore, our expected
    
                        L       L        L
    
                          x       y        ν                            outcome was that all cases required about the same wall time to
      v(x, y,t) = −V0 sin( ) cos( ) exp(−2 2 t)               (9)        finish, while the residual from using 8 GPUs would converge the
                         L       L        L
    
                   ρ           2x       2y           ν                  fastest.
    
    
     p(x, y,t) = − V02 cos( ) + cos( ) exp(−4 2 t)                          After training, the PINN solver’s prediction errors (i.e., accu-
                     4          L        L            L
                                                                         racy) were evaluated on cell centers of a 512 × 512 Cartesian mesh
where V0 represents the peak (and also the lowest) velocity at           against the analytical solution. With these spatially distributed
t = 0. Other symbols carry the same meaning as those in section          errors, we calculated the L2 error norm for a given t:
2.                                                                                      sZ                     r
    The periodic boundary conditions were applied to x = −Lπ,                     L2 =       error(x, y)2 dΩ ≈ ∑ ∑ errori,2 j ∆Ωi, j       (10)
x = Lπ, y = −Lπ, and y = Lπ. We used the following parameters                              Ω
                                                                                                                     i   j
in this work: V0 = L = ρ = 1.0 and ν = 0.01. These parameters
correspond to Reynolds number Re = 100. Figure 1 shows a                 where i and j here are the indices of a cell center in the Cartesian
snapshot of velocity at t = 32.                                          mesh. ∆Ωi, j is the corresponding cell area, 4π 2 /5122 in this case.
                                                                             We compared accuracy and performance against results using
3.2. Solver and runtime configurations                                   PetIBM. All PetIBM simulations in this section were done with
                                                                         1 K40 GPU and 6 CPU cores (Intel i7-5930K) on our old lab
The neural network used in the PINN solver is a fully-connected
                                                                         workstation. We carried out 7 PetIBM simulations with different
neural network with 6 hidden layers and 256 neurons per layer.
                                                                         spatial resolutions: 2k × 2k for k = 4, 5, . . . , 10. The time step size
The activation functions are SiLU ([HG]). We used Adam for
                                                                         for each spatial resolution was ∆t = 0.1/2k−4 .
optimization, and its initial parameters are the defaults from Py-
                                                                             A special note should be made here: the PINN solver used
Torch. The learning rate exponentially decayed through PyTorch’s
                                                                         single-precision floats, while PetIBM used double-precision floats.
ExponentialLR with gamma equal to 0.951/10000 . Note we did
                                                                         It might sound unfair. However, this discrepancy does not change
not conduct hyperparameter optimization, given the computational
                                                                         the qualitative findings and conclusions, as we will see later.
cost. The hyperparameters are mostly the defaults used by the 3D
Taylor-Green example in Modulus ([noa]).
    The training data were simply spatial-temporal coordinates.          3.3. Results
Before the training, the PINN solver pre-generated 18,432,000            Figure 2 shows the convergence history of the total residuals
spatial-temporal points to evaluate the residuals of the Navier-         (equation (6)). Using more GPUs in weak scaling (i.e., more
Stokes equations (the r1 and r2 in equation (5)). These training         training points) did not accelerate the convergence, contrary to
points were randomly chosen from the spatial domain [−π, π] ×            what we expected. All cases converged at a similar rate. Though
[−π, π] and temporal domain (0, 100]. The solver used only 18,432        without a quantitative criterion or justification, we considered that
points in each training iteration, making it a batch training. For       further training would not improve the accuracy. Figure 3 gives a
the residual of the initial condition (the r3 ), the solver also pre-    visual taste of what the predictions from the neural network look
generated 18,432,000 random spatial points and used only 18,432          like.
per iteration. Note that for r3 , the points were distributed in space       The result visually agrees with that in figure 1. However, as
only because t = 0 is a fixed condition. Because of the periodic         shown in figure 4, the error magnitudes from the PINN solver
boundary conditions, the solver did not require any training points      are much higher than those from PetIBM. Figure 4 shows the
for r4 and r5 .                                                          prediction errors with respect to t. We only present the error on
    The hardware used for the PINN solver was a single node of           the u velocity as those for v and p are similar. The accuracy of
NVIDIA’s DGX-A100. It was equipped with 8 A100 GPUs (80GB                the PINN solver is similar to that of the 16 × 16 simulation with
32                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




                                                                                    Fig. 5: L2 error norm versus wall time.
     Fig. 3: Contours of u and v at t = 32 from the PINN solver.


                                                                      3.4. Discussion
                                                                      A notice should be made regarding the results: we do not claim
                                                                      that these results represent the most optimized configuration of
                                                                      the PINN method. Neither do we claim the qualitative conclusions
                                                                      apply to all other hyperparameter configurations. These results
                                                                      merely reflect the outcomes of our computational experiments
                                                                      with respect to the specific configuration abovementioned. They
                                                                      should be deemed experimental data rather than a thorough anal-
                                                                      ysis of the method’s characteristics.
                                                                          The Taylor-Green vortex serves as a good benchmark case
                                                                      because it reduces the number of required residual constraints:
                                                                      residuals r4 and r5 are excluded from r in equation 6. This means
             Fig. 4: L2 error norm versus simulation time.
                                                                      the optimizer can concentrate only on the residuals of initial
                                                                      conditions and the Navier-Stokes equations.
                                                                          Using more GPUs (thus using more training points, i.e., spatio-
PetIBM. Using more GPUs, which implies more training points,
                                                                      temporal points) did not speed up the convergence, which may
does not improve the accuracy.
                                                                      indicate that the per-iteration number of points on a single GPU
    Regardless of the magnitudes, the trends of the errors with       is already big enough. The number of training points mainly
respect to t are similar for both PINN and PetIBM. For PetIBM,        affects the mean gradients of the residual with respect to model
the trend shown in figure 4 indicates that the temporal error is      parameters, which then will be used to update parameters by
bounded, and the scheme is stable. However, this concept does         gradient-descent-based optimizers. If the number of points is
not apply to PINN as it does not use any time-marching schemes.       already big enough on a single GPU, then using more points or
What this means for PINN is still unclear to us. Nevertheless,        more GPUs is unlikely to change the mean gradients significantly,
it shows that PINN is able to propagate the influence of initial      causing the convergence solely to rely on learning rates.
conditions to later times, which is a crucial factor for solving
                                                                          The accuracy of the PINN solver was acceptable but not
hyperbolic partial differential equations.
                                                                      satisfying, especially when considering how much time it took
    Figure 5 shows the computational cost of PINN and PetIBM          to achieve such accuracy. The low accuracy to some degree was
in terms of the desired accuracy versus the required wall time. We    not surprising. Recall the theory in section 2. The PINN method
only show the PINN results of 8 A100 GPUs on this figure. We          only seeks the minimal residual on the total residual’s hyperplane.
believe this type of plot may help evaluate the computational cost    It does not try to find the zero root of the hyperplane and does not
in engineering applications. According to the figure, for example,    even care whether such a zero root exists. Furthermore, by using a
achieving an accuracy of 10−3 at t = 2 requires less than 1 second    gradient-descent-based optimizer, the resulting minimum is likely
for PetIBM with 1 K40 and 6 CPU cores, but it requires more than      just a local minimum. It makes sense that it is hard for the residual
8 hours for PINN with at least 1 A100 GPU.                            to be close to zero, meaning it is hard to make errors small.
    Table 1 lists the wall time per 1 thousand iterations and the         Regarding the performance result in figure 5, we would like
scaling efficiency. As indicated previously, weak scaling was used    to avoid interpreting the result as one solver being better than the
in PINN, which follows most machine learning applications.            other one. The proper conclusion drawn from the figure should be
                                                                      as follows: when using the PINN solver as a CFD simulator for
                                                                      a specific flow condition, PetIBM outperforms the PINN solver.
                           1 GPUs   2 GPUs     4 GPUs        8 GPUs   As stated in section 1, the PINN method can solve flows under
     Time (sec/1k iters)   85.0     87.7       89.1          90.1     different flow parameters in one run—a capability that PetIBM
     Efficiency (%)        100      97         95            94       does not have. The performance result in figure 5 only considers a
                                                                      limited application of the PINN solver.
                                                                          One issue for this case study was how to fairly compare
TABLE 1: Weak scaling performance of the PINN solver using            the PINN solver and PetIBM, especially when investigating the
NVIDIA A100-80GB GPUs                                                 accuracy versus the workload/problem size or time-to-solution
EXPERIENCE REPORT OF PHYSICS-INFORMED NEURAL NETWORKS IN FLUID SIMULATIONS: PITFALLS AND FRUSTRATION                                          33

versus problem size. Defining the problem size in PINN is not
as straightforward as we thought. Let us start with degrees of
freedom—in PINN, it is called the number of model parame-
ters, and in traditional CFD solvers, it is called the number of
unknowns. The PINN solver and traditional CFD solvers are
all trying to determine the free parameters in models (that is,
approximate solutions). Hence, the number of degrees of freedom
determines the problem sizes and workloads. However, in PINN,
problem sizes and workloads do not depend on degrees of freedom
solely. The number of training points also plays a critical role
in workloads. We were not sure if it made sense to define a
problem size as the sum of the per-iteration number of training
points and the number of model parameters. For example, 100
model parameters plus 100 training points is not equivalent to 150
model parameters plus 50 training points in terms of workloads.
So without a proper definition of problem size and workload, it
was not clear how to fairly compare PINN and traditional CFD
methods.
    Nevertheless, the gap between the performances of PINN and            Fig. 6: Demonstration of velocity and vorticity fields at t = 200 from
                                                                          a PetIBM simulation.
PetIBM is too large, and no one can argue that using other metrics
would change the conclusion. Not to mention that the PINN solver
ran on A100 GPUs, while PetIBM ran on a single K40 GPU
                                                                          200. Figure 6 shows the velocity and vorticity snapshots at t = 200.
in our lab, a product from 2013. This is also not a surprising
                                                                          As shown in the figure, this type of flow displays a phenomenon
conclusion because, as indicated in section 2, the use of automatic
                                                                          called vortex shedding. Though vortex shedding makes the flow
differentiation for temporal and spatial derivatives results in a huge
                                                                          always unsteady, after a certain time, the flow reaches a periodic
computational graph. In addition, the PINN solver uses gradient-
                                                                          stage and the flow pattern repeats after a certain period.
descent based method, which is a first-order method and limits the
                                                                              The Navier-Stokes equations can be deemed as a dynamical
performance.
                                                                          system. Instability appears in the flow under some flow conditions
    Weak scaling is a natural choice of the PINN solver when it
                                                                          and responds to small perturbations, causing the vortex shedding.
comes to distributed computing. As we don’t know a proper way
                                                                          In nature, the vortex shedding comes from the uncertainty and
to define workload, simply copying all model parameters to all
                                                                          perturbation existing everywhere. In CFD simulations, the vortex
processes and using the same number of training points on all
                                                                          shedding is caused by small numerical and rounding errors in
processes works well.
                                                                          calculations. Interested readers should consult reference [Wil].

4. Case 2: 2D cylinder flows: harder than we thought                      4.2. Solver and runtime configurations
This case study shows what really made us frustrated: a 2D                For the PINN solver, we tested with two networks. Both were
cylinder flow at Reynolds number Re = 200. We failed to even              fully-connected neural networks: one with 256 neurons per layer,
produce a solution that qualitatively captures the key physical           while the other one with 512 neurons per layer. All other net-
phenomenon of this flow: vortex shedding.                                 work configurations were the same as those in section 3, except
                                                                          we allowed human intervention to manually adjust the learning
4.1. Problem description                                                  rates during training. Our intention for this case study was to
The computational domain is [−8, 25] × [−8, 8], and a cylinder            successfully obtain physical solutions from the PINN solver,
with a radius of 0.5 sits at coordinate (0, 0). The velocity boundary     rather than conducting a performance and accuracy benchmark.
conditions are (u, v) = (1, 0) along x = −8, y = −8, and y = 8. On        Therefore, we would adjust the learning rate to accelerate the
the cylinder surface is the no-slip condition, i.e., (u, v) = (0, 0).     convergence or to escape from local minimums. This decision was
At the outlet (x = 25), we enforced a pressure boundary condition         in line with common machine learning practice. We did not carry
p = 0. The initial condition is (u, v) = (0, 0). Note that this initial   out hyperparameter optimization. These parameters were chosen
condition is different from most traditional CFD simulations.             because they work in Modulus’ examples and in the Taylor-Green
Conventionally, CFD simulations use (u, v) = (1, 0) for cylinder          vortex experiment.
flows. A uniform initial condition of u = 1 does not satisfy                   The PINN solver pre-generated 40, 960, 000 spatial-temporal
the Navier-Stokes equations due to the no-slip boundary on the            points from a spatial domain in [−8, 25] × [−8, 8] and temporal
cylinder surface. Conventional CFD solvers are usually able to            domain (0, 200] to evaluate residuals of the Navier-Stokes equa-
correct the solution during time-marching by propagating bound-           tions, and used 40, 960 points per iteration. The number of pre-
ary effects into the domain through numerical schemes’ stencils.          generated points for the initial condition was 2, 048, 000, and the
In our experience, using u = 1 or u = 0 did not matter for PINN           per-iteration number is 2, 048. On each boundary, the numbers of
because both did not give reasonable results. Nevertheless, the           pre-generated and per-iteration points are 8,192,000 and 8,192.
PINN solver’s results shown in this section were obtained using a         Both cases used 8 A100 GPUs, which scaled these numbers up
uniform u = 0 for the initial condition.                                  with a factor of 8. For example, during each iteration, a total of
    The density, ρ, is one, and the kinematic viscosity is ν =            327, 680 points were actually used to evaluate the Navier-Stokes
0.005. These parameters correspond to Reynolds number Re =                equations’ residuals. Both cases ran up to 64 hours in wall time.
34                                                                                                PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




     Fig. 7: Training history of the 2D cylinder flow at Re = 200.


    One PetIBM simulation was carried out as a baseline. This
simulation had a spatial resolution of 1485 × 720, and the time
step size is 0.005. Figure 6 was rendered using this simulation.
The hardware used was 1 K40 GPU plus 6 cores of i7-5930K                             Fig. 8: Velocity and vorticity at t = 200 from PINN.
CPU. It took about 1.7 hours to finish.
    The quantity of interest is the drag coefficient. We consider
both the friction drag and pressure drag in the coefficient calcula-
tion as follows:
                                                   
                     2
                         Z       ∂  U~ ·~t
            CD =           ρν               ny − pnx  dS      (11)
                  ρU02 D            ∂~n
                           S

Here, U0 = 1 is the inlet velocity. ~n = [nx , ny ]T and ~t = [ny , −nx ]T
are the normal and tangent vectors, respectively. S represents the
cylinder surface. The theoretical lift coefficient (CL ) for this flow
is zero due to the symmetrical geometry.

4.3. Results
Note, as stated in section 3.4, we deem the results as experimental
data under a specific experiment configuration. Hence, we do not
claim that the results and qualitative conclusions will apply to                      Fig. 9: Drag and lift coefficients with respect to t
other hyperparameter configuration.
    Figure 7 shows the convergence history. The bumps in the
                                                                             practice. Our viewpoints may be subjective, and hence we leave
history correspond to our manual adjustment of the learning rates.
                                                                             them here in the discussion.
After 64 hours of training, the total loss had not converged to an
                                                                                 Allow us to start this discussion with a hypothetical situation.
obvious steady value. However, we decided not to continue the
                                                                             If one asks why we chose such a spatial and temporal resolution
training because, as later results will show, it is our judgment call
                                                                             for a conventional CFD simulation, we have mathematical or
that the results would not be correct even if the training converged.
                                                                             physical reasons to back our decision. However, if the person asks
    Figure 8 provides a visualization of the predicted velocity              why we chose 6 hidden layers and 256 neurons per layer, we will
and vorticity at t = 200. And in figure 9 are the drag and lift              not be able to justify it. "It worked in another case!" is probably the
coefficients versus simulation time. From both figures, we couldn’t          best answer we can offer. The situation also indicates that we have
see any sign of vortex shedding with the PINN solver.                        systematic approaches to improve a conventional simulation but
    We provide a comparison against the values reported by others            can only improve PINN’s results through computer experiments.
in table 2. References [GS74] and [For80] calculate the drag                     Most traditional numerical methods have rigorous analytical
coefficients using steady flow simulations, which were popular               derivations and analyses. Each parameter used in a scheme has
decades ago because of their inexpensive computational costs.                a meaning or a purpose in physical or numerical aspects. The
The actual flow is not a steady flow, and these steady-flow                  simplest example is the spatial resolution in the finite difference
coefficient values are lower than unsteady-flow predictions. The             method, which controls the truncation errors in derivatives. Or,
drag coefficient from the PINN solver is closer to the steady-flow
predictions.
                                                                                                        Unsteady simulations    Steady simulations
4.4. Discussion                                                                   PetIBM       PINN    [DSY07] [RKM09]          [GS74]    [For80]

While researchers may be interested in why the PINN solver                          1.38       0.95       1.25        1.34        0.97      0.83
behaves like a steady flow solver, in this section, we would like
to focus more on the user experience and the usability of PINN in                          TABLE 2: Comparison of drag coefficients, CD
EXPERIENCE REPORT OF PHYSICS-INFORMED NEURAL NETWORKS IN FLUID SIMULATIONS: PITFALLS AND FRUSTRATION                                           35

the choice of the limiters in finite volume methods, used to inhibit    CFD solvers. The literature shows researchers have shifted their
the oscillation in solutions. So when a conventional CFD solver         attention to hybrid-mode applications. For example, in [JEA+ 20],
produces unsatisfying or even non-physical results, practitioners       the authors combined the concept of PINN and a traditional CFD
usually have systematic approaches to identify the cause or             solver to train a model that takes in low-resolution CFD simulation
improve the outcomes. Moreover, when necessary, practitioners           results and outputs high-resolution flow fields.
know how to balance the computational cost and the accuracy,                 For people with a strong background in numerical methods or
which is a critical point for using computer-aided engineering.         CFD, we would suggest trying to think out of the box. During
Engineering always concerns the costs and outcomes.                     our work, we realized our mindset and ideas were limited by what
     On the other hand, the PINN method lacks well-defined              we were used to in CFD. An example is the initial conditions.
procedures to control the outcome. For example, we know the             We were used to only having one set of initial conditions when
numbers of neurons and layers control the degrees of freedom in a       the temporal derivative in differential equations is only first-order.
model. With more degrees of freedom, a neural network model can         However, in PINN, nothing limits us from using more than one
approximate a more complicated phenomenon. However, when we             initial condition. We can generate results at t = 0, 1, . . . ,tn using
feel that a neural network is not complicated enough to capture a       a traditional CFD solver and add the residuals corresponding to
physical phenomenon, what strategy should we use to adjust the          these time snapshots to the total residual, so the PINN method
neurons and layers? Should we increase neurons or layers first?         may perform better in predicting t > tn . In other words, the PINN
By how much?                                                            solver becomes the traditional CFD solvers’ replacement only for
     Moreover, when it comes to something non-numeric, it is even       t > tn ([noa]).
more challenging to know what to use and why to use it. For                  As discussed in [THM+ ], solving partial differential equations
instance, what activation function should we use and why? Should        with deep learning is still a work-in-progress. It may not work in
we use the same activation everywhere? Not to mention that we           many situations. Nevertheless, it does not mean we should stay
are not yet even considering a different network architecture here.     away from PINN and discard this idea. Stepping away from a new
     Ultimately, are we even sure that increasing the network’s         thing gives zero chance for it to evolve, and we will never know
complexity is the right path? Our assumption that the network           if PINN can be improved to a mature state that works well. Of
is not complicated enough may just be wrong.                            course, overly promoting its bright side with only success stories
     The following situation happened in this case study. Before        does not help, either. Rather, we should honestly face all troubles,
we realized the PINN solver behaved like a steady-flow solver, we       difficulties, and challenges. Knowing the problem is the first step
attributed the cause to model complexity. We faced the problem          to solving it.
of how to increase the model complexity systematically. Theoret-
ically, we could follow the practice of the design of experiments       Acknowledgements
(e.g., through grid search or Taguchi methods). However, given the
computational cost and the number of hyperparameters/options of         We appreciate the support by NVIDIA, through sponsoring the
PINN, a proper design of experiments is not affordable for us.          access to its high-performance computing cluster.
Furthermore, the design of experiments requires the outcome to
change with changes in inputs. In our case, the vortex shedding         R EFERENCES
remains absent regardless of how we changed hyperparameters.
                                                                        [Chu22]   Pi-Yueh Chuang.             barbagroup/scipy-2022-repro-pack:
     Let us move back to the flow problem to conclude this
                                                                                  20220530, May 2022. URL: https://doi.org/10.5281/zenodo.
case study. The model complexity may not be the culprit here.                     6592457, doi:10.5281/zenodo.6592457.
Vortex shedding is the product of the dynamical systems of the          [CMKAB18] Pi-Yueh Chuang, Olivier Mesnard, Anush Krishnan, and Lorena
Navier-Stokes equations and the perturbations from numerical                      A. Barba. PetIBM: toolbox and applications of the immersed-
                                                                                  boundary method on distributed-memory architectures. Journal
calculations (which implicitly mimic the perturbations in nature).                of Open Source Software, 3(25):558, May 2018. URL: http://
Suppose the PINN solver’s prediction was the steady-state solution                joss.theoj.org/papers/10.21105/joss.00558, doi:10.21105/
to the flow. We may need to introduce uncertainties and perturba-                 joss.00558.
tions in the neural network or the training data, such as a perturbed   [CMW+ ]   Shengze Cai, Zhiping Mao, Zhicheng Wang, Minglang Yin,
                                                                                  and George Em Karniadakis. Physics-informed neural net-
initial condition described in [LD15]. As for why PINN predicts                   works (PINNs) for fluid mechanics: a review. 37(12):1727–
the steady-state solution, we cannot answer it currently.                         1738.     URL: https://link.springer.com/10.1007/s10409-021-
                                                                                  01148-1, doi:10.1007/s10409-021-01148-1.
                                                                        [DPT]     M. W. M. G. Dissanayake and N. Phan-Thien. Neural-network-
5. Further discussion and conclusion                                              based approximations for solving partial differential equations.
                                                                                  10(3):195–201. URL: https://onlinelibrary.wiley.com/doi/10.
Because of the widely available deep learning libraries, such as                  1002/cnm.1640100303, doi:10.1002/cnm.1640100303.
PyTorch, and the ease of Python, implementing a PINN solver is          [DSY07]   Jian Deng, Xue-Ming Shao, and Zhao-Sheng Yu. Hydro-
                                                                                  dynamic studies on two traveling wavy foils in tandem
relatively more straightforward nowadays. This may be one reason
                                                                                  arrangement.     Physics of Fluids, 19(11):113104, Novem-
why the PINN method suddenly became so popular in recent                          ber 2007. URL: http://aip.scitation.org/doi/10.1063/1.2814259,
years. This paper does not intend to discourage people from trying                doi:10.1063/1.2814259.
the PINN method. Instead, we share our failures and frustration         [DZ]      Yifan Du and Tamer A. Zaki.                  Evolutional deep
                                                                                  neural network.        104(4):045303.       URL: https://link.
using PINN so that interested readers may know what immediate                     aps.org/doi/10.1103/PhysRevE.104.045303, doi:10.1103/
challenges should be resolved for PINN.                                           PhysRevE.104.045303.
    Our paper is limited to using the PINN solver as a replacement      [For80]   Bengt Fornberg.           A numerical study of steady
for traditional CFD solvers. However, as the first section indicates,             viscous flow past a circular cylinder.              Journal of
                                                                                  Fluid Mechanics, 98(04):819, June 1980.            URL: http:
PINN can do more than solving one specific flow under specific                    //www.journals.cambridge.org/abstract_S0022112080000419,
flow parameters. Moreover, PINN can also work with traditional                    doi:10.1017/S0022112080000419.
36                                                                                                PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[FS]        William E. Faller and Scott J. Schreck. Unsteady fluid mechan-         [THM+ ]   Nils Thuerey, Philipp Holl, Maximilian Mueller, Patrick
            ics applications of neural networks. 34(1):48–55. URL: http:                     Schnell, Felix Trost, and Kiwon Um. Physics-based deep
            //arc.aiaa.org/doi/10.2514/2.2134, doi:10.2514/2.2134.                           learning. Number: arXiv:2109.05237. URL: http://arxiv.org/
[GS74]      V.A. Gushchin and V.V. Shchennikov. A numerical method                           abs/2109.05237, arXiv:2109.05237[physics].
            of solving the navier-stokes equations. USSR Computa-                  [Tre]     Lloyd N. Trefethen. Spectral Methods in MATLAB. Soft-
            tional Mathematics and Mathematical Physics, 14(2):242–250,                      ware, environments, tools. Society for Industrial and Applied
            January 1974. URL: https://linkinghub.elsevier.com/retrieve/                     Mathematics. URL: http://epubs.siam.org/doi/book/10.1137/1.
            pii/0041555374900615, doi:10.1016/0041-5553(74)                                  9780898719598, doi:10.1137/1.9780898719598.
            90061-5.                                                               [Wil]     C. H. K. Williamson.               Vortex dynamics in the
[Hao]       Karen Hao. AI has cracked a key mathematical puzzle for                          cylinder wake.         28(1):477–539.       URL: http://www.
            understanding our world. URL: https://www.technologyreview.                      annualreviews.org/doi/10.1146/annurev.fl.28.010196.002401,
            com/2020/10/30/1011435/ai-fourier-neural-network-cracks-                         doi:10.1146/annurev.fl.28.010196.002401.
            navier-stokes-and-partial-differential-equations/.                     [WTP]     Sifan Wang, Yujun Teng, and Paris Perdikaris.          Under-
[HG]        Dan Hendrycks and Kevin Gimpel. Gaussian error linear units                      standing and mitigating gradient flow pathologies in physics-
            (GELUs). Publisher: arXiv Version Number: 4. URL: https://                       informed neural networks. 43(5):A3055–A3081. URL: https:
            arxiv.org/abs/1606.08415, doi:10.48550/ARXIV.1606.                               //epubs.siam.org/doi/10.1137/20M1318043, doi:10.1137/
            08415.                                                                           20M1318043.
                                                                                   [WYP]     Sifan Wang, Xinling Yu, and Paris Perdikaris.           When
[Hor]       Kurt Hornik. Approximation capabilities of multilayer feedfor-
                                                                                             and why PINNs fail to train: A neural tangent
            ward networks. 4(2):251–257. URL: https://linkinghub.elsevier.
                                                                                             kernel perspective.           449:110768.       URL: https:
            com/retrieve/pii/089360809190009T, doi:10.1016/0893-
                                                                                             //linkinghub.elsevier.com/retrieve/pii/S002199912100663X,
            6080(91)90009-T.
                                                                                             doi:10.1016/j.jcp.2021.110768.
[JEA+ 20]   Chiyu “Max” Jiang, Soheil Esmaeilzadeh, Kamyar Aziz-
            zadenesheli, Karthik Kashinath, Mustafa Mustafa, Hamdi A.
            Tchelepi, Philip Marcus, Mr Prabhat, and Anima Anandkumar.
            Meshfreeflownet: A physics-constrained deep continuous space-
            time super-resolution framework. In SC20: International Con-
            ference for High Performance Computing, Networking, Storage
            and Analysis, pages 1–15, 2020. doi:10.1109/SC41405.
            2020.00013.
[KDYI]      Hasan Karali, Umut M. Demirezen, Mahmut A. Yukselen, and
            Gokhan Inalhan. A novel physics informed deep learning
            method for simulation-based modelling. In AIAA Scitech 2021
            Forum. American Institute of Aeronautics and Astronautics.
            URL: https://arc.aiaa.org/doi/10.2514/6.2021-0177, doi:10.
            2514/6.2021-0177.
[LD15]      Mouna Laroussi and Mohamed Djebbi. Vortex Shedding for
            Flow Past Circular Cylinder: Effects of Initial Conditions.
            Universal Journal of Fluid Mechanics, 3:19–32, 2015.
[LLF]       I. E. Lagaris, A. Likas, and D. I. Fotiadis. Artificial neu-
            ral networks for solving ordinary and partial differential
            equations. 9(5):987–1000. URL: http://ieeexplore.ieee.org/
            document/712178/, arXiv:physics/9705023, doi:10.
            1109/72.712178.
[LLQH]      Jianyu Li, Siwei Luo, Yingjian Qi, and Yaping Huang. Numer-
            ical solution of elliptic partial differential equation using radial
            basis function neural networks. 16(5):729–734. URL: https:
            //linkinghub.elsevier.com/retrieve/pii/S0893608003000832,
            doi:10.1016/S0893-6080(03)00083-2.
[LMMK]      Lu Lu, Xuhui Meng, Zhiping Mao, and George Em Karniadakis.
            DeepXDE: A deep learning library for solving differential
            equations. 63(1):208–228. URL: https://epubs.siam.org/doi/10.
            1137/19M1274067, doi:10.1137/19M1274067.
[LS]        Dennis J. Linse and Robert F. Stengel. Identification of
            aerodynamic coefficients using computational neural networks.
            16(6):1018–1025. Publisher: Springer US, Place: Boston,
            MA. URL: http://link.springer.com/10.1007/0-306-48610-5_9,
            doi:10.2514/3.21122.
[noa]       Modulus. URL: https://docs.nvidia.com/deeplearning/modulus/
            index.html.
[RKM09]     B.N. Rajani, A. Kandasamy, and Sekhar Majumdar. Nu-
            merical simulation of laminar flow past a circular cylin-
            der. Applied Mathematical Modelling, 33(3):1228–1247, March
            2009. arXiv: DOI: 10.1002/fld.1 Publisher: Elsevier Inc. ISBN:
            02712091 10970363. URL: http://dx.doi.org/10.1016/j.apm.
            2008.01.017, doi:10.1016/j.apm.2008.01.017.
[RPK]       M. Raissi, P. Perdikaris, and G.E. Karniadakis. Physics-
            informed neural networks: A deep learning framework for
            solving forward and inverse problems involving nonlinear
            partial differential equations. 378:686–707. URL: https:
            //linkinghub.elsevier.com/retrieve/pii/S0021999118307125,
            doi:10.1016/j.jcp.2018.10.045.
[SS]        Justin      Sirignano      and      Konstantinos      Spiliopoulos.
            DGM: A deep learning algorithm for solving partial
            differential equations.        375:1339–1364.         URL: https:
            //linkinghub.elsevier.com/retrieve/pii/S0021999118305527,
            doi:10.1016/j.jcp.2018.08.029.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                    37




 atoMEC: An open-source average-atom Python code
                                      Timothy J. Callow‡§∗ , Daniel Kotik‡§ , Eli Kraisler¶ , Attila Cangi‡§



                                                                                    F



Abstract—Average-atom models are an important tool in studying matter under             methods are often denoted as "first-principles" because, formally
extreme conditions, such as those conditions experienced in planetary cores,            speaking, they yield the exact properties of the system, under cer-
brown and white dwarfs, and during inertial confinement fusion. In the right            tain well-founded theoretical approximations. Density-functional
context, average-atom models can yield results with similar accuracy to simu-           theory (DFT), initially developed as a ground-state theory [HK64],
lations which require orders of magnitude more computing time, and thus can
                                                                                        [KS65] but later extended to non-zero temperatures [Mer65],
greatly reduce financial and environmental costs. Unfortunately, due to the wide
range of possible models and approximations, and the lack of open-source
                                                                                        [PPF+ 11], is one such theory and has been used extensively to
codes, average-atom models can at times appear inaccessible. In this paper, we          study materials under WDM conditions [GDRT14]. Even though
present our open-source average-atom code, atoMEC. We explain the aims and              DFT reformulates the Schrödinger equation in a computationally
structure of atoMEC to illuminate the different stages and options in an average-       efficient manner [Koh99], the cost of running calculations be-
atom calculation, and to facilitate community contributions. We also discuss the        comes prohibitively expensive at higher temperatures. Formally,
use of various open-source Python packages in atoMEC, which have expedited              it scales as O(N 3 τ 3 ), with N the particle number (which usually
its development.                                                                        also increases with temperature) and τ the temperature [CRNB18].
                                                                                        This poses a serious computational challenge in the WDM regime.
Index Terms—computational physics, plasma physics, atomic physics, materi-
                                                                                        Furthermore, although DFT is a formally exact theory, in prac-
als science
                                                                                        tice it relies on approximations for the so-called "exchange-
                                                                                        correlation" energy, which is, roughly speaking, responsible for
Introduction                                                                            simulating all the quantum interactions between electrons. Exist-
                                                                                        ing exchange-correlation approximations have not been rigorously
The study of matter under extreme conditions — materials
                                                                                        tested under WDM conditions. An alternative method used in
exposed to high temperatures, high pressures, or strong elec-
                                                                                        the WDM community is path-integral Monte–Carlo [DGB18],
tromagnetic fields — is critical to our understanding of many
                                                                                        which yields essentially exact properties; however, it is even more
important scientific and technological processes, such as nuclear
                                                                                        limited by computational cost than DFT, and becomes unfeasibly
fusion and various astrophysical and planetary physics phenomena
                                                                                        expensive at lower temperatures due to the fermion sign problem.
[GFG+ 16]. Of particular interest within this broad field is the
                                                                                            It is therefore of great interest to reduce the computational
warm dense matter (WDM) regime, which is typically character-
                                                                                        complexity of the aforementioned methods. The use of graphics
ized by temperatures in the range of 103 − 106 degrees (Kelvin),
                                                                                        processing units in DFT calculations is becomingly increasingly
and densities ranging from dense gases to highly compressed
                                                                                        common, and has been shown to offer significant speed-ups
solids (∼ 0.01 − 1000 g cm−3 ) [BDM+ 20]. In this regime, it is
                                                                                        relative to conventional calculations using central processing units
important to account for the quantum mechanical nature of the
                                                                                        [MED11], [JFC+ 13]. Some other examples of promising develop-
electrons (and in some cases, also the nuclei). Therefore conven-
                                                                                        ments to reduce the cost of DFT calculations include machine-
tional methods from plasma physics, which either neglect quantum
                                                                                        learning-based solutions [SRH+ 12], [BVL+ 17], [EFP+ 21] and
effects or treat them coarsely, are usually not sufficiently accurate.
                                                                                        stochastic DFT [CRNB18], [BNR13]. However, in this paper,
On the other hand, methods from condensed-matter physics and
                                                                                        we focus on an alternative class of models known as "average-
quantum chemistry, which account fully for quantum interactions,
                                                                                        atom" models. Average-atom models have a long history in plasma
typically target the ground-state only, and become computationally
                                                                                        physics [CHKC22]: they account for quantum effects, typically
intractable for systems at high temperatures.
                                                                                        using DFT, but reduce the complex system of interacting electrons
    Nevertheless, there are methods which can, in principle, be
                                                                                        and nuclei to a single atom immersed in a plasma (the "average"
applied to study materials at any given temperature and den-
                                                                                        atom). An illustration of this principle (reduced to two dimensions
sity whilst formally accounting for quantum interactions. These
                                                                                        for visual purposes) is shown in Fig. 1. This significantly reduces
* Corresponding author: t.callow@hzdr.de
                                                                                        the cost relative to a full DFT simulation, because the particle
‡ Center for Advanced Systems Understanding (CASUS), D-02826 Görlitz,                   number is restricted to the number of electrons per nucleus, and
Germany                                                                                 spherical symmetry is exploited to reduce the three-dimensional
§ Helmholtz-Zentrum Dresden-Rossendorf, D-01328 Dresden, Germany
¶ Fritz Haber Center for Molecular Dynamics and Institute of Chemistry, The             problem to one dimension.
Hebrew University of Jerusalem, 9091401 Jerusalem, Israel                                   Naturally, to reduce the complexity of the problem as de-
                                                                                        scribed, various approximations must be introduced. It is im-
Copyright © 2022 Timothy J. Callow et al. This is an open-access article                portant to understand these approximations and their limitations
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,           for average-atom models to have genuine predictive capabilities.
provided the original author and source are credited.                                   Unfortunately, this is not always the case: although average-atom
38                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                       Theoretical background
                                                                       Properties of interest in the warm dense matter regime include the
                                                                       equation-of-state data, which is the relation between the density,
                                                                       energy, temperature and pressure of a material [HRD08]; the mean
                                                                       ionization state and the electron ionization energies, which tell
                                                                       us about how tightly bound the electrons are to the nuclei; and
                                                                       the electrical and thermal conductivities. These properties yield
                                                                       information pertinent to our understanding of stellar and planetary
                                                                       physics, the Earth’s core, inertial confinement fusion, and more
                                                                       besides. To exactly obtain these properties, one needs (in theory) to
                                                                       determine the thermodynamic ensemble of the quantum states (the
                                                                       so-called wave-functions) representing the electrons and nuclei.
Fig. 1: Illustration of the average-atom concept. The many-body        Fortunately, they can be obtained with reasonable accuracy using
and fully-interacting system of electron density (shaded blue) and     models such as average-atom models; in this section, we elaborate
nuclei (red points) on the left is mapped into the much simpler
system of independent atoms on the right. Any of these identical       on how this is done.
atoms represents the "average-atom". The effects of interaction from       We shall briefly review the key theory underpinning the type of
neighboring atoms are implicitly accounted for in an approximate       average-atom model implemented in atoMEC. This is intended for
manner through the choice of boundary conditions.                      readers without a background in quantum mechanics, to give some
                                                                       context to the purposes and mechanisms of the code. For a compre-
                                                                       hensive derivation of this average-atom model, we direct readers
                                                                       to Ref. [CHKC22]. The average-atom model we shall describe
models share common concepts, there is no unique formal theory         falls into a class of models known as ion-sphere models, which
underpinning them. Therefore a variety of models and codes exist,      are the simplest (and still most widely used) class of average-atom
and it is not typically clear which models can be expected to          model. There are alternative (more advanced) classes of model
perform most accurately under which conditions. In a previous          such as ion-correlation [Roz91] and neutral pseudo-atom models
paper [CHKC22], we addressed this issue by deriving an average-        [SS14] which we have not yet implemented in atoMEC, and thus
atom model from first principles, and comparing the impact of          we do not elaborate on them here.
different approximations within this model on some common                  As demonstrated in Fig. 1, the idea of the ion-sphere model
properties.                                                            is to map a fully-interacting system of many electrons and
    In this paper, we focus on computational aspects of average-       nuclei into a set of independent atoms which do not interact
atom models for WDM. We introduce atoMEC [CKTS+ 21]:                   explicitly with any of the other spheres. Naturally, this depends
an open-source average-atom code for studying Matter under             on several assumptions and approximations, but there is formal
Extreme Conditions. One of the main aims of atoMEC is to im-           justification for such a mapping [CHKC22]. Furthermore, there
prove the accessibility and understanding of average-atom models.      are many examples in which average-atom models have shown
To the best of our knowledge, open-source average-atom codes           good agreement with more accurate simulations and experimental
are in scarce supply: with atoMEC, we aim to provide a tool that       data [FB19], which further justifies this mapping.
people can use to run average-atom simulations and also to add             Although the average-atom picture is significantly simplified
their own models, which should facilitate comparisons of different     relative to the full many-body problem, even determining the
approximations. The relative simplicity of average-atom codes          wave-functions and their ensemble weights for an atom at finite
means that they are not only efficient to run, but also efficient      temperature is a complex problem. Fortunately, DFT reduces this
to develop: this means, for example, that they can be used as a        complexity further, by establishing that the electron density — a
test-bed for new ideas that could be later implemented in full DFT     far less complex entity than the wave-functions — is sufficient to
codes, and are also accessible to those without extensive prior        determine all physical observables. The most popular formulation
expertise, such as students. atoMEC aims to facilitate development     of DFT, known as Kohn–Sham DFT (KS-DFT) [KS65], allows us
by following good practice in software engineering (for example        to construct the fully-interacting density from a non-interacting
extensive documentation), a careful design structure, and of course    system of electrons, simplifying the problem further still. Due to
through the choice of Python and its widely used scientific stack,     the spherical symmetry of the atom, the non-interacting electrons
in particular the NumPy [HMvdW+ 20] and SciPy [VGO+ 20]                — known as KS electrons (or KS orbitals) — can be represented
libraries.                                                             as a wave-function that is a product of radial and angular compo-
                                                                       nents,
    This paper is structured as follows: in the next section, we
briefly review the key theoretical points which are important                               φnlm (r) = Xnl (r)Ylm (θ , φ ) ,            (1)
to understand the functionality of atoMEC, assuming no prior
                                                                       where n, l, and m are the quantum numbers of the orbitals, which
physical knowledge of the reader. Following that, we present
                                                                       come from the fact that the wave-function is an eigenfunction of
the key functionality of atoMEC, discuss the code structure
                                                                       the Hamiltonian operator, and Ylm (θ , φ ) are the spherical harmonic
and algorithms, and explain how these relate to the theoretical
aspects introduced. Finally, we present an example case study:         functions.1 The radial coordinate r represents the absolute distance
we consider helium under the conditions often experienced in           from the nucleus.
the outer layers of a white dwarf star, and probe the behavior
                                                                         1. Please note that the notation in Eq. (1) does not imply Einstein sum-
of a few important properties, namely the band-gap, pressure, and      mation notation. All summations in this paper are written explicitly; Einstein
ionization degree.                                                     summation notation is not used.
ATOMEC: AN OPEN-SOURCE AVERAGE-ATOM PYTHON CODE                                                                                                     39

     We therefore only need to determine the radial KS orbitals              energy required to excite an electron bound to the nucleus to being
Xnl (r). These are determined by solving the radial KS equation,             a free (conducting) electron. These predicted ionization energies
which is similar to the Schrödinger equation for a non-interacting           can be used, for example, to help understand ionization potential
system, with an additional term in the potential to mimic the                depression, an important but somewhat controversial effect in
effects of electron-electron interaction (within the single atom).           WDM [STJ+ 14]. Another property that can be straightforwardly
The radial KS equation is given by:                                          obtained from the energy levels and their occupation numbers is
    2                                                                    the mean ionization state Z̄ 2 ,
         d     2 d    l(l + 1)
    −       +      −             + vs [n](r) Xnl (r) = εnl Xnl (r). (2)
         dr2 r dr        r2                                                                       Z̄ = ∑(2l + 1) fnl (εnl , µ, τ)                  (6)
                                                                                                        n,l
We have written the above equation in a way that emphasizes that
it is an eigenvalue equation, with the eigenvalues εnl being the             which is an important input parameter for various models, such
energies of the KS orbitals.                                                 as adiabats which are used to model inertial confinement fusion
     On the left-hand side, the terms in the round brackets come             [KDF+ 11].
from the kinetic energy operator acting on the orbitals. The vs [n](r)           Various other interesting properties can also be calculated
term is the KS potential, which itself is composed of three different        following some post-processing of the output of an SCF cal-
terms,                                                                       culation, for example the pressure exerted by the electrons and
                               Z RWS                                         ions. Furthermore, response properties, i.e. those resulting from
                     Z                        n(x)x2     δ Fxc [n]           an external perturbation like a laser pulse, can also be obtained
        vs [n](r) = − + 4π              dx             +           ,   (3)
                     r           0           max(r, x)    δ n(r)             from the output of an SCF cycle. These properties include, for
where RWS is the radius of the atomic sphere, n(r) is the electron           example, electrical conductivities [Sta16] and dynamical structure
density, Z the nuclear charge, and Fxc [n] the exchange-correlation          factors [SPS+ 14].
free energy functional. Thus the three terms in the potential are
respectively the electron-nuclear attraction, the classical Hartree          Code structure and details
repulsion, and the exchange-correlation (xc) potential.                      In the following sections, we describe the structure of the code
     We note that the KS potential and its constituents are function-        in relation to the physical problem being modeled. Average-atom
als of the electron density n(r). Were it not for this dependence            models typically rely on various parameters and approximations.
on the density, solving Eq. 2 just amounts to solving an ordinary            In atoMEC, we have tried to structure the code in a way that makes
linear differential equation (ODE). However, the electron density            clear which parameters come from the physical problem studied
is in fact constructed from the orbitals in the following way,               compared to choices of the model and numerical or algorithmic
             n(r) = 2 ∑(2l + 1) fnl (εnl , µ, τ)|Xnl (r)|2 ,           (4)   choices.
                       nl
                                                                             atoMEC.Atom: Physical parameters
where fnl (εnl , µ, τ) is the Fermi–Dirac distribution, given by
                                                                             The first step of any simulation in WDM (which also applies to
                                               1                             simulations in science more generally) is to define the physical
                   fnl (εnl , µ, τ) =                   ,              (5)
                                     1 + e(εnl −µ)/τ                         parameters of the problem. These parameters are unique in the
where τ is the temperature, and µ is the chemical potential, which           sense that, if we had an exact method to simulate the real system,
is determined by fixing the number of electrons to be equal to               then for each combination of these parameters there would be a
a pre-determined value Ne (typically equal to the nuclear charge             unique solution. In other words, regardless of the model — be
Z). The Fermi–Dirac distribution therefore assigns weights to the            it average atom or a different technique — these parameters are
KS orbitals in the construction of the density, with the weight              always required and are independent of the model.
depending on their energy.                                                       In average-atom models, there are typically three parameters
     Therefore, the KS potential that determines the KS orbitals via         defining the physical problem, which are:
the ODE (2), is itself dependent on the KS orbitals. Consequently,               •   the atomic species;
the KS orbitals and their dependent quantities (the density and                  •   the temperature of the material, τ;
KS potential) must be determined via a so-called self-consistent                 •   the mass density of the material, ρm .
field (SCF) procedure. An initial guess for the orbitals, Xnl0 (r),
is used to construct the initial density n0 (r) and potential v0s (r).            The mass density also directly corresponds to the mean dis-
The ODE (2) is then solved to update the orbitals. This process is           tance between two nuclei (atomic centers), which in the average-
iterated until some appropriately chosen quantities — in atoMEC              atom model is equal to twice the radius of the atomic sphere, RWS .
the total free energy, density and KS potential — are converged,             An additional physical parameter not mentioned above is the net
i.e. ni+1 (r) = ni (r), vi+1         i         i+1 = F i , within some       charge of the material being considered, i.e. the difference be-
                          s (r) = vs (r), F
reasonable numerical tolerance. In Fig. 2, we illustrate the life-           tween the nuclear charge Z and the electron number Ne . However,
cycle of the average-atom model described so far, including the              we usually assume zero net charge in average-atom simulations
SCF procedure. On the left-hand side of this figure, we show the             (i.e. the number of electrons is equal to the atomic charge).
physical choices and mathematical operations, and on the right-                   In atoMEC, these physical parameters are controlled by the
hand side, the representative classes and functions in atoMEC. In            Atom object. As an example, we consider aluminum under ambi-
the following section, we shall discuss some aspects of this figure          ent conditions, i.e. at room temperature, τ = 300 K, and normal
in more detail.                                                              metallic density, ρm = 2.7 g cm−3 . We set this up as:
     Some quantities obtained from the completion of the SCF pro-
                                                                               2. The summation in Eq. (6) is often shown as an integral because the
cedure are directly of interest. For example, the energy eigenvalues         energies above a certain threshold form a continuous distribution (in most
εnl are related to the electron ionization energies, i.e. the amount of      models).
40                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




Fig. 2: Schematic of the average-atom model set-up and the self-consistent field (SCF) cycle. On the left-hand side, the physical choices and
mathematical operations that define the model and SCF cycle are shown. On the right-hand side, the (higher-order) functions and classes in
atoMEC corresponding to the items on the left-hand side are shown. Some liberties are taken with the code snippets in the right-hand column
of the figure to improve readability; more precisely, some non-crucial intermediate steps are not shown, and some parameters are also not
shown or simplified. The dotted lines represent operations that are taken care of within the models.CalcEnergy function, but are shown
nevertheless to improve understanding.
ATOMEC: AN OPEN-SOURCE AVERAGE-ATOM PYTHON CODE                                                                                              41




                                                                        Fig. 4: Auto-generated print          statement    from   calling   the
                                                                        models.ISModel object.
Fig. 3: Auto-generated       print   statement   from   calling   the
atoMEC.Atom object.
                                                                        with a "quantum" treatment of the unbound electrons, and choose
                                                                        the LDA exchange functional (which is also the default). This
from atoMEC import Atom
Al = Atom("Al", 300, density=2.7, units_temp="K")                       model is set up as:
                                                                        from atoMEC import models
By default, the above code automatically prints the output seen         model = models.ISModel(Al, bc="neumann",
in Fig. 3. We see that the first two arguments of the Atom object                    xfunc_id="lda_x", unbound="quantum")
are the chemical symbol of the element being studied, and the           By default, the above code prints the output shown in Fig.
temperature. In addition, at least one of "density" or "radius" must    4. The first (and only mandatory) input parameter to the
be specified. In atoMEC, the default (and only permitted) units for     models.ISModel object is the Atom object that we generated
the mass density are g cm−3 ; all other input and output units in       earlier. Together with the optional spinpol and spinmag
atoMEC are by default Hartree atomic units, and hence we specify        parameters in the models.ISModel object, this sets either the
"K" for Kelvin.                                                         total number of electrons (spinpol=False) or the number of
    The information in Fig. 3 displays the chosen parameters in         electrons in each spin channel (spinpol=True).
units commonly used in the plasma and condensed-matter physics              The remaining information displayed in Fig. 4 shows directly
communities, as well as some other information directly obtained        the chosen model parameters, or the default values where these
from these parameters. The chemical symbol ("Al" in this case)          parameters are not specified. The exchange and correlation func-
is passed to the mendeleev library [men14] to generate this data,       tionals - set by the parameters xfunc_id and cfunc_id - are
which is used later in the calculation.                                 passed to the LIBXC library [LSOM18] for processing. So far,
    This initial stage of the average-atom calculation, i.e. the        only the "local density" family of approximations is available
specification of physical parameters and initialization of the Atom     in atoMEC, and thus the default values are usually a sensible
object, is shown in the top row at the top of Fig. 2.                   choice. For more information on exchange and correlation func-
atoMEC.models: Model parameters                                         tionals, there are many reviews in the literature, for example Ref.
                                                                        [CMSY12].
After the physical parameters are set, the next stage of the average-
                                                                            This stage of the average-atom calculation, i.e. the specifica-
atom calculation is to choose the model and approximations within
                                                                        tion of the model and the choices of approximation within that, is
that class of model. As discussed, so far the only class of model
                                                                        shown in the second row of Fig. 2.
implemented in atoMEC is the ion-sphere model. Within this
model, there are still various choices to be made by the user.          ISModel.CalcEnergy: SCF calculation and numerical parameters
In some cases, these choices make little difference to the results,     Once the physical parameters and model have been defined, the
but in other cases they have significant impact. The user might         next stage in the average-atom calculation (or indeed any DFT
have some physical intuition as to which is most important, or          calculation) is the SCF procedure. In atoMEC, this is invoked
alternatively may want to run the same physical parameters with         by the ISModel.CalcEnergy function. This function is called
several different model parameters to examine the effects. Some         CalcEnergy because it finds the KS orbitals (and associated KS
choices available in atoMEC, listed approximately in decreasing         density) which minimize the total free energy.
order of impact (but this can depend strongly on the system under            Clearly, there are various mathematical and algorithmic
consideration), are:                                                    choices in this calculation. These include, for example: the basis in
   •   the boundary conditions used to solve the KS equations;          which the KS orbitals and potential are represented, the algorithm
   •   the treatment of the unbound electrons, which means              used to solve the KS equations (2), and how to ensure smooth
       those electrons not tightly bound to the nucleus, but rather     convergence of the SCF cycle. In atoMEC, the SCF procedure
       delocalized over the whole atomic sphere;                        currently follows a single pre-determined algorithm, which we
   •   the choice of exchange and correlation functionals, the          briefly review below.
       central approximations of DFT [CMSY12];                               In atoMEC, we represent the radial KS quantities (orbitals,
   •   the spin polarization and magnetization.                         density and potential) on a logarithmic grid, i.e. x = log(r).
                                                                        Furthermore, we make a transformation of the orbitals Pnl (x) =
   We do not discuss the theory and impact of these different
                                                                        Xnl (x)ex/2 . Then the equations to be solved become:
choices in this paper. Rather, we direct readers to Refs. [CHKC22]
and [CKC22] in which all of these choices are discussed.                           d2 Pnl (x)
                                                                                              − 2e2x (W (x) − εnl )Pnl (x) = 0              (7)
   In atoMEC, the ion-sphere model is controlled by the                               dx2
                                                                                                                    
models.ISModel object. Continuing with our aluminum ex-                                                   1        1 2 −2x
ample, we choose the so-called "neumann" boundary condition,                        W (x) = vs [n](x) +       l+         e .                (8)
                                                                                                          2        2
42                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

In atoMEC, we solve the KS equations using a matrix imple-                a unique set of physical and model inputs — these parameters
mentation of Numerov’s algorithm [PGW12]. This means we                   should be independently varied until some property (such as the
diagonalize the following equation:                                       total free energy) is considered suitably converged with respect to
                                                                          that parameter. Changing the SCF parameters should not affect the
                   Ĥ ~P      = ~ε B̂~P , where                    (9)
                                                                          final results (within the convergence tolerances), only the number
                        Ĥ    = T̂ + B̂ +Ws (~x) ,                (10)    of iterations in the SCF cycle.
                                   1                                          Let us now consider an example SCF calculation, using the
                        T̂    = − e−2~x  ,                      (11)
                                   2                                      Atom and model objects we have already defined:
                                Iˆ−1 − 2Iˆ0 + Iˆ1
                        Â    =                   , and           (12)    from atoMEC import config
                                      dx2                                 config.numcores = -1 # parallelize
                                Iˆ−1 + 10Iˆ0 + Iˆ1
                        B̂    =                     ,             (13)
                                        12                                nmax = 3 # max value of principal quantum number
                                                                          lmax = 3 # max value of angular quantum number
In the above, Iˆ−1/0/1 are lower shift, identify, and upper shift
matrices.                                                                 # run SCF calculation
    The Hamiltonian matrix Ĥ is sparse and we only seek a subset         scf_out = model.CalcEnergy(
                                                                           nmax,
of eigenstates with lower energies: therefore there is no need to          lmax,
perform a full diagonalization, which scales as O(N 3 ), with N            grid_params={"ngrid": 1500},
being the size of the radial grid. Instead, we use SciPy’s sparse ma-      scf_params={"mixfrac": 0.7},
                                                                           )
trix diagonalization function scipy.sparse.linalg.eigs,
which scales more efficiently and allows us to go to larger grid          We see that the first two parameters passed to the CalcEnergy
sizes.                                                                    function are the nmax and lmax quantum numbers, which specify
    After each step in the SCF cycle, the relative changes in the         the number of eigenstates to compute. Precisely speaking, there
free energy F, density n(r) and potential vs (r) are computed.            is a unique Hamiltonian for each value of the angular quantum
Specifically, the quantities computed are                                 number l (and in a spin-polarized calculation, also for each
                                F i − F i−1                               spin quantum number). The sparse diagonalization routine then
                  ∆F         =                                    (14)    computes the first nmax eigenvalues for each Hamiltonian. In
                                    Fi
                               R                                          atoMEC, these diagonalizations can be run in parallel since they
                                 dr|ni (r) − ni−1 (r)|
                  ∆n         =       R                            (15)    are independent for each value of l. This is done by setting the
                                        drni (r)
                               R                                          config.numcores variable to the number of cores desired
                                 dr|vs (r) − vi−1
                                      i
                                               s (r)|                     (config.numcores=-1 uses all the available cores) and han-
                  ∆v         =       R
                                            i
                                                       .          (16)
                                        drvs (r)                          dled via the joblib library [Job20].
Once all three of these metrics fall below a certain threshold, the           The remaining parameters passed to the CalcEnergy func-
SCF cycle is considered converged and the calculation finishes.           tion are optional; in the above, we have specified a grid size
    The SCF cycle is an example of a non-linear system and thus           of 1500 points and a mixing fraction α = 0.7. The above code
is prone to chaotic (non-convergent) behavior. Consequently a             automatically prints the output seen in Fig. 5. This output shows
range of techniques have been developed to ensure convergence             the SCF cycle and, upon completion, the breakdown of the total
[SM91]. Fortunately, the tendency for calculations not to converge        free energy into its various components, as well as other useful
becomes less likely for temperatures above zero (and especially           information such as the KS energy levels and their occupations.
as temperatures increase). Therefore we have implemented only                 Additionally, the output of the SCF function is a dictionary
a simple linear mixing scheme in atoMEC. The potential used in            containing the staticKS.Orbitals, staticKS.Density,
each diagonalization step of the SCF cycle is not simply the one          staticKS.Potential and staticKS.Density objects.
generated from the most recent density, but a mix of that potential       For example, one could extract the eigenfunctions as follows:
and the previous one,
                                                                          orbs = scf_out["orbitals"] # orbs object
                 vs (r) = αvis (r) + (1 − α)vi−1
                  (i)                                                     ks_eigfuncs = orbs.eigfuncs # eigenfunctions
                                             s (r) .              (17)
In general, a lower value of the mixing fraction α makes the              The initialization of the SCF procedure is shown in the third and
SCF cycle more stable, but requires more iterations to converge.          fourth rows of Fig. 2, with the SCF procedure itself shown in the
Typically a choice of α ≈ 0.5 gives a reasonable balance between          remaining rows.
speed and stability.                                                          This completes the section on the code structure and
    We can thus summarize the key parameters in an SCF calcu-             algorithmic details. As discussed, with the output of an
lation as follows:                                                        SCF calculation, there are various kinds of post-processing
                                                                          one can perform to obtain other properties of interest. So
     •   the maximum number of eigenstates to compute, in terms
                                                                          far in atoMEC, these are limited to the computation of
         of both the principal and angular quantum numbers;
                                                                          the pressure (ISModel.CalcPressure), the electron
     •   the numerical grid parameters, in particular the grid size;
                                                                          localization function (atoMEC.postprocess.ELFTools)
     •   the convergence tolerances, Eqs. (14) to (16);
                                                                          and           the          Kubo–Greenwood             conductivity
     •   the SCF parameters, i.e. the mixing fraction and the
                                                                          (atoMEC.postprocess.conductivity).                   We      refer
         maximum number of iterations.
                                                                          readers to our pre-print [CKC22] for details on how the electron
    The first three items in this list essentially control the accuracy   localization function and the Kubo–Greenwood conductivity can
of the calculation. In principle, for each SCF calculation — i.e.         be used to improve predictions of the mean ionization state.
ATOMEC: AN OPEN-SOURCE AVERAGE-ATOM PYTHON CODE                                                                                            43




                                                                       Fig. 6: Helium density-of-states (DOS) as a function of energy, for
                                                                       different mass densities ρm , and at temperature τ = 50 kK. Black
                                                                       dots indicate the occupations of the electrons in the permitted energy
                                                                       ranges. Dashed black lines indicate the band-gap (the energy gap
                                                                       between the insulating and conducting bands). Between 5 and 6
                                                                       g cm−3 , the band-gap disappears.


                                                                       and temperature) and electrical conductivity.
                                                                            To calculate the insulator-to-metallic transition point, the
                                                                       key quantity is the electronic band-gap. The concept of band-
                                                                       structures is a complicated topic, which we try to briefly describe
                                                                       in layman’s terms. In solids, electrons can occupy certain energy
                                                                       ranges — we call these the energy bands. In insulating materials,
                                                                       there is a gap between these energy ranges that electrons are
                                                                       forbidden from occupying — this is the so-called band-gap. In
                                                                       conducting materials, there is no such gap, and therefore electrons
                                                                       can conduct electricity because they can be excited into any part
                                                                       of the energy spectrum. Therefore, a simple method to determine
                                                                       the insulator-to-metallic transition is to determine the density at
                                                                       which the band-gap becomes zero.
                                                                            In Fig. 6, we plot the density-of-states (DOS) as a function of
                                                                       energy, for different densities and at fixed temperature τ = 50 kK.
                                                                       The DOS shows the energy ranges that the electrons are allowed to
                                                                       occupy; we also show the actual energies occupied by the electrons
                                                                       (according to Fermi–Dirac statistics) with the black dots. We can
                                                                       clearly see in this figure that the band-gap (the region where the
                                                                       DOS is zero) becomes smaller as a function of density. From
Fig. 5: Auto-generated print statement          from   calling   the   this figure, it seems the transition from insulating to metallic state
ISModel.CalcEnergy function                                            happens somewhere between 5 and 6 g cm−3 .
                                                                            In Fig. 7, we plot the band-gap as a function of density, for a
                                                                       fixed temperature τ = 50 kK. Visually, it appears that the relation-
Case-study: Helium
                                                                       ship between band-gap and density is linear at this temperature.
In this section, we consider an application of atoMEC in the           This is confirmed using a linear fit, which has a coefficient of
WDM regime. Helium is the second most abundant element in the          determination value of almost exactly one, R2 = 0.9997. Using this
universe (after hydrogen) and therefore understanding its behavior     fit, the band-gap is predicted to close at 5.5 g cm−3 . Also in this
under a wide range of conditions is important for our under-           figure, we show the fraction of ionized electrons, which is given by
standing of many astrophysical processes. Of particular interest       Z̄/Ne , using Eq. (6) to calculate Z̄, and Ne being the total electron
are the conditions under which helium is expected to undergo a         number. The ionization fraction also relates to the conductivity of
transition from insulating to metallic behavior in the outer layers    the material, because ionized electrons are not bound to any nuclei
of white dwarfs, which are characterized by densities of around        and therefore free to conduct electricity. We see that the ionization
1 − 20 g cm−3 and temperatures of 10 − 50 kK [PR20]. These             fraction mostly increases with density (excepting some strange
conditions are a typical example of the WDM regime. Besides            behavior around ρm = 1 g cm−3 ), which is further evidence of the
predicting the point at which the insulator-to-metallic transition     transition from insulating to conducting behaviour with increasing
occurs in the density-temperature spectrum, other properties of        density.
interest include equation-of-state data (relating pressure, density,        As a final analysis, we plot the pressure as a function of mass
44                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                        open-source scientific libraries — especially the Python libraries
                                                                        NumPy, SciPy, joblib and mendeleev, as well as LIBXC.
                                                                            We finish this paper by emphasizing that atoMEC is still in the
                                                                        early stages of development, and there are many opportunities to
                                                                        improve and extend the code. These include, for example:

                                                                           •    adding new average-atom models, and different approxi-
                                                                                mations to the existing models.ISModel model;
                                                                           •    optimizing the code, in particular the routines in the
                                                                                numerov module;
                                                                           •    adding new postprocessing functionality, for example to
                                                                                compute structure factors;
                                                                           •    improving the structure and design choices of the code.
Fig. 7: Band-gap (red circles) and ionization fraction (blue squares)
for helium as a function of mass density, at temperature τ = 50 kK.         Of course, these are just a snapshot of the avenues for future
The relationship between the band-gap and the density appears to be     development in atoMEC. We are open to contributions in these
linear.                                                                 areas and many more besides.


                                                                        Acknowledgements
                                                                        This work was partly funded by the Center for Advanced Systems
                                                                        Understanding (CASUS) which is financed by Germany’s Federal
                                                                        Ministry of Education and Research (BMBF) and by the Saxon
                                                                        Ministry for Science, Culture and Tourism (SMWK) with tax
                                                                        funds on the basis of the budget approved by the Saxon State
                                                                        Parliament.


                                                                        R EFERENCES
                                                                        [BDM+ 20]    M. Bonitz, T. Dornheim, Zh. A. Moldabekov, S. Zhang,
                                                                                     P. Hamann, H. Kählert, A. Filinov, K. Ramakrishna, and J. Vor-
                                                                                     berger. Ab initio simulation of warm dense matter. Phys. Plas-
                                                                                     mas, 27(4):042710, 2020. doi:10.1063/1.5143225.
                                                                        [BNR13]      Roi Baer, Daniel Neuhauser, and Eran Rabani.               Self-
                                                                                     averaging stochastic Kohn-Sham density-functional theory.
Fig. 8: Helium pressure (logarithmic scale) as a function of mass                    Phys. Rev. Lett., 111:106402, Sep 2013. doi:10.1103/
density and temperature. The pressure increases with density and                     PhysRevLett.111.106402.
temperature (as expected), with a stronger dependence on density.       [BVL+ 17]    Felix Brockherde, Leslie Vogt, Li Li, Mark E. Tuckerman,
                                                                                     Kieron Burke, and Klaus-Robert Müller. Bypassing the Kohn-
                                                                                     Sham equations with machine learning. Nature Communica-
density and temperature in Fig. 8. The pressure is given by the                      tions, 8(1):872, Oct 2017. doi:10.1038/s41467-017-
                                                                                     00839-3.
sum of two terms: (i) the electronic pressure, calculated using         [CHKC22]     T. J. Callow, S. B. Hansen, E. Kraisler, and A. Cangi.
the method described in Ref. [FB19], and (ii) the ionic pressure,                    First-principles derivation and properties of density-functional
calculated using the ideal gas law. We observe that the pressure                     average-atom models. Phys. Rev. Research, 4:023055, Apr
                                                                                     2022. doi:10.1103/PhysRevResearch.4.023055.
increases with both density and temperature, which is the expected
                                                                        [CKC22]      Timothy J. Callow, Eli Kraisler, and Attila Cangi. Accurate
behavior. Under these conditions, the density dependence is much                     and efficient computation of mean ionization states with an
stronger, especially for higher densities.                                           average-atom Kubo-Greenwood approach, 2022. doi:10.
    The code required to generate the above results and plots can                    48550/ARXIV.2203.05863.
                                                                        [CKTS+ 21]   Timothy Callow, Daniel Kotik, Ekaterina Tsve-
be found in this repository.                                                         toslavova Stankulova, Eli Kraisler, and Attila Cangi.
                                                                                     atomec, August 2021. If you use this software, please cite it
Conclusions and future work                                                          using these metadata. doi:10.5281/zenodo.5205719.
                                                                        [CMSY12]     Aron J. Cohen, Paula Mori-Sánchez, and Weitao Yang. Chal-
In this paper, we have presented atoMEC: an average-atom Python                      lenges for density functional theory. Chemical Reviews,
code for studying materials under extreme conditions. The open-                      112(1):289–320, 2012. doi:10.1021/cr200107z.
                                                                        [CRNB18]     Yael Cytter, Eran Rabani, Daniel Neuhauser, and Roi Baer.
source nature of atoMEC, and the choice to use (pure) Python as                      Stochastic density functional theory at finite temperatures.
the programming language, is designed to improve the accessibil-                     Phys. Rev. B, 97:115207, Mar 2018. doi:10.1103/
ity of average-atom models.                                                          PhysRevB.97.115207.
    We gave significant attention to the code structure in this         [DGB18]      Tobias Dornheim, Simon Groth, and Michael Bonitz. The
                                                                                     uniform electron gas at warm dense matter conditions. Phys.
paper, and tried as much as possible to connect the functions                        Rep., 744:1 – 86, 2018. doi:10.1016/j.physrep.
and objects in the code with the underlying theory. We hope that                     2018.04.001.
this not only improves atoMEC from a user perspective, but also         [EFP+ 21]    J. A. Ellis, L. Fiedler, G. A. Popoola, N. A. Modine, J. A.
facilitates new contributions from the wider average-atom, WDM                       Stephens, A. P. Thompson, A. Cangi, and S. Rajamanickam.
                                                                                     Accelerating finite-temperature kohn-sham density functional
and scientific Python communities. Another aim of the paper was                      theory with deep neural networks. Phys. Rev. B, 104:035120,
to communicate how atoMEC benefits from a strong ecosystem of                        Jul 2021. doi:10.1103/PhysRevB.104.035120.
ATOMEC: AN OPEN-SOURCE AVERAGE-ATOM PYTHON CODE                                                                                                   45

[FB19]     Gérald Faussurier and Christophe Blancard. Pressure in warm                 temperature density-functional theory.      Phys. Rev. Lett.,
           and hot dense matter using the average-atom model. Phys. Rev.               107:163001, Oct 2011. doi:10.1103/PhysRevLett.
           E, 99:053201, May 2019. doi:10.1103/PhysRevE.99.                            107.163001.
           053201.                                                         [PR20]      Martin Preising and Ronald Redmer. Metallization of dense
[GDRT14]   Frank Graziani, Michael P Desjarlais, Ronald Redmer, and                    fluid helium from ab initio simulations. Phys. Rev. B,
           Samuel B Trickey. Frontiers and challenges in warm dense                    102:224107, Dec 2020. doi:10.1103/PhysRevB.102.
           matter, volume 96. Springer Science & Business, 2014. doi:                  224107.
           10.1007/978-3-319-04912-0.                                      [Roz91]     Balazs F. Rozsnyai. Photoabsorption in hot plasmas based
[GFG+ 16]  S H Glenzer, L B Fletcher, E Galtier, B Nagler, R Alonso-                   on the ion-sphere and ion-correlation models. Phys. Rev. A,
           Mori, B Barbrel, S B Brown, D A Chapman, Z Chen, C B                        43:3035–3042, Mar 1991. doi:10.1103/PhysRevA.43.
           Curry, F Fiuza, E Gamboa, M Gauthier, D O Gericke, A Glea-                  3035.
           son, S Goede, E Granados, P Heimann, J Kim, D Kraus,            [SM91]      H. B. Schlegel and J. J. W. McDouall. Do You Have SCF Sta-
           M J MacDonald, A J Mackinnon, R Mishra, A Ravasio,                          bility and Convergence Problems?, pages 167–185. Springer
           C Roedel, P Sperling, W Schumaker, Y Y Tsui, J Vorberger,                   Netherlands, Dordrecht, 1991. doi:10.1007/978-94-
           U Zastrau, A Fry, W E White, J B Hasting, and H J Lee.                      011-3262-6_2.
           Matter under extreme conditions experiments at the Linac        [SPS+ 14]   A. N. Souza, D. J. Perkins, C. E. Starrett, D. Saumon, and
           Coherent Light Source. J. Phys. B, 49(9):092001, apr 2016.                  S. B. Hansen. Predictions of x-ray scattering spectra for warm
           doi:10.1088/0953-4075/49/9/092001.                                          dense matter. Phys. Rev. E, 89:023108, Feb 2014. doi:
[HK64]     P. Hohenberg and W. Kohn. Inhomogeneous electron gas.                       10.1103/PhysRevE.89.023108.
           Phys. Rev., 136(3B):B864–B871, Nov 1964. doi:10.1103/           [SRH+ 12]   John C. Snyder, Matthias Rupp, Katja Hansen, Klaus-Robert
           PhysRev.136.B864.                                                           Müller, and Kieron Burke. Finding density functionals with
[HMvdW 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der
         +                                                                             machine learning. Phys. Rev. Lett., 108:253002, Jun 2012.
           Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric                  doi:10.1103/PhysRevLett.108.253002.
           Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,      [SS14]      C.E. Starrett and D. Saumon. A simple method for determining
           Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerk-                the ionic structure of warm dense matter. High Energy Density
           wijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río,                Physics, 10:35–42, 2014. doi:10.1016/j.hedp.2013.
           Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin                   12.001.
           Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi,         [Sta16]     C.E. Starrett. Kubo–Greenwood approach to conductivity
           Christoph Gohlke, and Travis E. Oliphant. Array programming                 in dense plasmas with average atom models. High Energy
           with NumPy. Nature, 585(7825):357–362, September 2020.                      Density Physics, 19:58–64, 2016. doi:10.1016/j.hedp.
           doi:10.1038/s41586-020-2649-2.                                              2016.04.001.
[HRD08]    Bastian Holst, Ronald Redmer, and Michael P. Desjarlais.        [STJ+ 14]   Sang-Kil Son, Robert Thiele, Zoltan Jurek, Beata Ziaja, and
           Thermophysical properties of warm dense hydrogen using                      Robin Santra. Quantum-mechanical calculation of ionization-
           quantum molecular dynamics simulations. Phys. Rev. B,                       potential lowering in dense plasmas. Phys. Rev. X, 4:031004,
           77:184201, May 2008. doi:10.1103/PhysRevB.77.                               Jul 2014. doi:10.1103/PhysRevX.4.031004.
           184201.                                                         [VGO+ 20]   Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt
[JFC+ 13]  Weile Jia, Jiyun Fu, Zongyan Cao, Long Wang, Xuebin Chi,                    Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski,
           Weiguo Gao, and Lin-Wang Wang. Fast plane wave density                      Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté-
           functional theory molecular dynamics calculations on multi-                 fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar-
           GPU machines. Journal of Computational Physics, 251:102–                    rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric
           115, 2013. doi:10.1016/j.jcp.2013.05.005.                                   Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat,
[Job20]    Joblib Development Team. Joblib: running Python functions                   Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde,
           as pipeline jobs. https://joblib.readthedocs.io/, 2020.                     Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin-
                                                                                       tero, Charles R. Harris, Anne M. Archibald, Antônio H.
[KDF+ 11]  A. L. Kritcher, T. Döppner, C. Fortmann, T. Ma, O. L.
                                                                                       Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy
           Landen, R. Wallace, and S. H. Glenzer. In-Flight Measure-
                                                                                       1.0 Contributors. SciPy 1.0: Fundamental Algorithms for
           ments of Capsule Shell Adiabats in Laser-Driven Implosions.
                                                                                       Scientific Computing in Python. Nature Methods, 17:261–272,
           Phys. Rev. Lett., 107:015002, Jul 2011. doi:10.1103/
                                                                                       2020. doi:10.1038/s41592-019-0686-2.
           PhysRevLett.107.015002.
[Koh99]    W. Kohn. Nobel lecture: Electronic structure of matter—wave
           functions and density functionals. Rev. Mod. Phys., 71:1253–
           1266, 10 1999. doi:10.1103/RevModPhys.71.1253.
[KS65]     W. Kohn and L. J. Sham. Self-consistent equations including
           exchange and correlation effects. Phys. Rev., 140(4A):A1133–
           A1138, Nov 1965.            doi:10.1103/PhysRev.140.
           A1133.
[LSOM18]   Susi Lehtola, Conrad Steigemann, Micael J.T. Oliveira, and
           Miguel A.L. Marques. Recent developments in LIBXC —
           A comprehensive library of functionals for density functional
           theory. SoftwareX, 7:1–5, 2018. doi:10.1016/j.softx.
           2017.11.002.
[MED11]    Stefan Maintz, Bernhard Eck, and Richard Dronskowski.
           Speeding up plane-wave electronic-structure calculations us-
           ing graphics-processing units. Computer Physics Communi-
           cations, 182(7):1421–1427, 2011. doi:10.1016/j.cpc.
           2011.03.010.
[men14]    mendeleev – A Python resource for properties of chemical
           elements, ions and isotopes, ver. 0.9.0. https://github.com/
           lmmentel/mendeleev, 2014.
[Mer65]    N. David Mermin. Thermal properties of the inhomogeneous
           electron gas. Phys. Rev., 137:A1441–A1443, Mar 1965. doi:
           10.1103/PhysRev.137.A1441.
[PGW12]    Mohandas Pillai, Joshua Goglio, and Thad G. Walker. Matrix
           numerov method for solving schrödinger’s equation. Amer-
           ican Journal of Physics, 80(11):1017–1019, 2012. doi:
           10.1119/1.4748813.
[PPF+ 11]  S. Pittalis, C. R. Proetto, A. Floris, A. Sanna, C. Bersier,
           K. Burke, and E. K. U. Gross. Exact conditions in finite-
46                                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




           Automatic random variate generation in Python
                                                          Christoph Baumgarten‡∗ , Tirth Patel



                                                                                      F



Abstract—The generation of random variates is an important tool that is re-                  •    For inversion methods, the structural properties of the
quired in many applications. Various software programs or packages contain                        underlying uniform random number generator are pre-
generators for standard distributions like the normal, exponential or Gamma,                      served and the numerical accuracy of the methods can be
e.g., the programming language R and the packages SciPy and NumPy in                              controlled by a parameter. Therefore, inversion is usually
Python. However, it is not uncommon that sampling from new/non-standard dis-
                                                                                                  the only method applied for simulations using quasi-Monte
tributions is required. Instead of deriving specific generators in such situations,
so-called automatic or black-box methods have been developed. These allow
                                                                                                  Carlo (QMC) methods.
the user to generate random variates from fairly large classes of distributions              •    Depending on the use case, one can choose between a fast
by only specifying some properties of the distributions (e.g. the density and/or                  setup with slow marginal generation time and vice versa.
cumulative distribution function). In this note, we describe the implementation of
such methods from the C library UNU.RAN in the Python package SciPy and
                                                                                              The latter point is important depending on the use case: if a
provide a brief overview of the functionality.                                            large number of samples is required for a given distribution with
                                                                                          fixed shape parameters, a slower setup that only has to be run once
Index Terms—numerical inversion, generation of random variates                            can be accepted if the marginal generation times are low. If small
                                                                                          to moderate samples sizes are required for many different shape
                                                                                          parameters, then it is important to have a fast setup. The former
Introduction
                                                                                          situation is referred to as the fixed-parameter case and the latter as
The generation of random variates is an important tool that is                            the varying parameter case.
required in many applications. Various software programs or                                   Implementations of various methods are available in the
packages contain generators for standard distributions, e.g., R                           C library UNU.RAN ([HL07]) and in the associated R pack-
([R C21]) and SciPy ([VGO+ 20]) and NumPy ([HMvdW+ 20])                                   age Runuran (https://cran.r-project.org/web/packages/Runuran/
in Python. Standard references for these algorithms are the books                         index.html, [TL03]). The aim of this note is to introduce the
[Dev86], [Dag88], [Gen03], and [Knu14]. An interested reader                              Python implementation in the SciPy package that makes some
will find many references to the vast existing literature in these                        of the key methods in UNU.RAN available to Python users in
works. While relying on general methods such as the rejection                             SciPy 1.8.0. These general tools can be seen as a complement
principle, the algorithms for well-known distributions are often                          to the existing specific sampling methods: they might lead to
specifically designed for a particular distribution. This is also the                     better performance in specific situations compared to the existing
case in the module stats in SciPy that contains more than 100                             generators, e.g., if a very large number of samples are required for
distributions and the module random in NumPy with more than                               a fixed parameter of a distribution or if the implemented sampling
30 distributions. However, there are also so-called automatic or                          method relies on a slow default that is based on numerical
black-box methods for sampling from large classes of distributions                        inversion of the CDF. For advanced users, they also offer various
with a single piece of code. For such algorithms, information                             options that allow to fine-tune the generators (e.g., to control the
about the distribution such as the density, potentially together with                     time needed for the setup step).
its derivative, the cumulative distribution function (CDF), and/or
the mode must be provided. See [HLD04] for a comprehensive
overview of these methods. Although the development of such                               Automatic algorithms in SciPy
methods was originally motivated to generate variates from non-                           Many of the automatic algorithms described in [HLD04] and
standard distributions, these universal methods have advantages                           [DHL10] are implemented in the ANSI C library, UNU.RAN
that make their usage attractive even for sampling from standard                          (Universal Non-Uniform RANdom variate generators). Our goal
distributions. We mention some of the important properties (see                           was to provide a Python interface to the most important methods
[LH00], [HLD04], [DHL10]):                                                                from UNU.RAN to generate univariate discrete and continuous
                                                                                          non-uniform random variates. The following generators have been
     •   The algorithms can be used to sample from truncated
                                                                                          implemented in SciPy 1.8.0:
         distributions.
                                                                                             •    TransformedDensityRejection:              Transformed
* Corresponding author: christoph.baumgarten@gmail.com
‡ Unaffiliated                                                                                    Density Rejection (TDR) ([H9̈5], [GW92])
                                                                                             •    NumericalInverseHermite: Hermite interpolation
Copyright © 2022 Christoph Baumgarten et al. This is an open-access article                       based INVersion of CDF (HINV) ([HL03])
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,                •    NumericalInversePolynomial: Polynomial inter-
provided the original author and source are credited.                                             polation based INVersion of CDF (PINV) ([DHL10])
AUTOMATIC RANDOM VARIATE GENERATION IN PYTHON                                                                                                   47

    •   SimpleRatioUniforms: Simple Ratio-Of-Uniforms                   by computing tangents at suitable design points. Note that by its
        (SROU) ([Ley01], [Ley03])                                       nature any rejection method requires not always the same number
    •   DiscreteGuideTable: (Discrete) Guide Table                      of uniform variates to generate one non-uniform variate; this
        method (DGT) ([CA74])                                           makes the use of QMC and of some variance reduction methods
    •   DiscreteAliasUrn: (Discrete) Alias-Urn method                   more difficult or impossible. On the other hand, rejection is often
        (DAU) ([Wal77])                                                 the fastest choice for the varying parameter case.
                                                                            The Ratio-Of-Uniforms method (ROU, [KM77]) is another
    Before describing the implementation in SciPy in Section            general method that relies on rejection. The underlying principle is
scipy_impl, we give a short introduction to random variate gener-       that p
                                                                             if (U,V ) is uniformly distributed on the set A f := {(u, v) : 0 <
ation in Section intro_rv_gen.                                          v ≤ f (u/v), a < u/v < b} where f is a PDF with support (a, b),
                                                                        then X := U/V follows a distribution according to f . In general, it
A very brief introduction to random variate generation                  is not possible to sample uniform values on A f directly. However,
It is well-known that random variates can be generated by inver-        if A f ⊂ R := [u− , u+ ] × [0, v+ ] for finite constants u− , u+ , v+ , one
sion of the CDF F of a distribution: if U is a uniform random           can apply the rejection method: generate uniform values (U,V ) on
number on (0, 1), X := F −1 (U) is distributed according to F.          the bounding rectangle R until (U,V ) ∈ A f and return X = U/V .
Unfortunately, the inverse CDF can only be expressed in closed          Automatic methods relying on the ROU method such as SROU
form for very few distributions, e.g., the exponential or Cauchy        and automatic ROU ([Ley00]) need a setup step to find a suitable
distribution. If this is not the case, one needs to rely on imple-      region S ∈ R2 such that A f ⊂ S and such that one can generate
mentations of special functions to compute the inverse CDF for          (U,V ) uniformly on S efficiently.
standard distributions like the normal, Gamma or beta distributions
or numerical methods for inverting the CDF are required. Such           Description of the SciPy interface
procedures, however, have the disadvantage that they may be slow        SciPy provides an object-oriented API to UNU.RAN’s methods.
or inaccurate, and developing fast and robust inversion algorithms      To initialize a generator, two steps are required:
such as HINV and PINV is a non-trivial task. HINV relies on
Hermite interpolation of the inverse CDF and requires the CDF               1)   creating a distribution class and object,
and PDF as an input. PINV only requires the PDF. The algorithm              2)   initializing the generator itself.
then computes the CDF via adaptive Gauss-Lobatto integration                In step 1, a distributions object must be created that im-
and an approximation of the inverse CDF using Newton’s polyno-          plements required methods (e.g., pdf, cdf). This can either
mial interpolation. Note that an approximation of the inverse CDF       be a custom object or a distribution object from the classes
can be achieved by interpolating the points (F(xi ), xi ) for points    rv_continuous or rv_discrete in SciPy. Once the gen-
xi in the domain of F, i.e., no evaluation of the inverse CDF is        erator is initialized from the distribution object, it provides a
required.                                                               rvs method to sample random variates from the given dis-
      For discrete distributions, F is a step-function. To compute      tribution. It also provides a ppf method that approximates
the inverse CDF F −1 (U), the simplest idea would be to apply           the inverse CDF if the initialized generator uses an inversion
sequential search: if X takes values 0, 1, 2, . . . with probabil-      method. The following example illustrates how to initialize the
ities p0 , p1 , p2 , . . . , start with j = 0 and keep incrementing j   NumericalInversePolynomial (PINV) generator for the
until F( j) = p0 + · · · + p j ≥ U. When the search terminates,         standard normal distribution:
X = j = F −1 (U). Clearly, this approach is generally very slow         import numpy as np
and more efficient methods have been developed: if X takes L            from scipy.stats import sampling
distinct values, DGT realizes very fast inversion using so-called       from math import exp
guide tables / hash tables to find the index j. In contrast DAU is      # create a distribution class with implementation
not an inversion method but uses the alias method, i.e., tables are     # of the PDF. Note that the normalization constant
precomputed to write X as an equi-probable mixture of L two-            # is not required
point distributions (the alias values).                                 class StandardNormal:
                                                                            def pdf(self, x):
      The rejection method has been suggested in [VN51]. In its                 return exp(-0.5 * x**2)
simplest form, assume that f is a bounded density on [a, b],
i.e., f (x) ≤ M for all x ∈ [a, b]. Sample two independent uniform      # create a distribution object and initialize the
                                                                        # generator
random variates on U on [0, 1] and V on [a, b] until M ·U ≤ f (V ).     dist = StandardNormal()
Note that the accepted points (U,V ) are uniformly distributed in       rng = sampling.NumericalInversePolynomial(dist)
the region between the x-axis and the graph of the PDF. Hence,
X := V has the desired distribution f . This is a special case of       # sample 100,000 random variates from the given
                                                                        # distribution
the general version: if f , g are two densities on an interval J such   rvs = rng.rvs(100000)
that f (x) ≤ c · g(x) for all x ∈ J and a constant c ≥ 1, sample
U uniformly distributed on [0, 1] and X distributed according to        As NumericalInversePolynomial generator uses an in-
g until c · U · g(X) ≤ f (X). Then X has the desired distribution       version method, it also provides a ppf method that approximates
 f . It can be shown that the expected number of iterations before      the inverse CDF:
the acceptance condition is met is equal to c. Hence, the main          # evaluate the approximate PPF at a few points
                                                                        ppf = rng.ppf([0.1, 0.5, 0.9])
challenge is to find hat functions g for which c is small and from
which random variates can be generated efficiently. TDR solves          It is also easy to sample from a truncated distribution by passing
this problem by applying a transformation T to the density such         a domain argument to the constructor of the generator. For
that x 7→ T ( f (x)) is concave. A hat function can then be found       example, to sample from truncated normal distribution:
48                                                                                                 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

# truncate the distribution by passing a                                         reference/random/bit_generators/index.html. To change the uni-
# `domain` argument                                                              form random number generator, a random_state parameter
rng = sampling.NumericalInversePolynomial(
   dist, domain=(-1, 1)                                                          can be passed as shown in the example below:
)                                                                                # 64-bit PCG random number generator in NumPy
                                                                                 urng = np.random.Generator(np.random.PCG64())
While the default options of the generators should work well in                  # The above line can also be replaced by:
many situations, we point out that there are various parameters that             # ``urng = np.random.default_rng()``
the user can modify, e.g., to provide further information about the              # as PCG64 is the default generator starting
                                                                                 # from NumPy 1.19.0
distribution (such as mode or center) or to control the numerical
accuracy of the approximated PPF. (u_resolution). Details                        # change the uniform random number generator by
can be found in the SciPy documentation https://docs.scipy.org/                  # passing the `random_state` argument
                                                                                 rng = sampling.NumericalInversePolynomial(
doc/scipy/reference/. The above code can easily be generalized to
                                                                                    dist, random_state=urng
sample from parametrized distributions using instance attributes                 )
in the distribution class. For example, to sample from the gamma
                                                                                 We also point out that the PPF of inversion methods can be applied
distribution with shape parameter alpha, we can create the
                                                                                 to sequences of quasi-random numbers. SciPy provides different
distribution class with parameters as instance attributes:
                                                                                 sequences in its QMC module (scipy.stats.qmc).
class Gamma:
                                                                                     NumericalInverseHermite provides a qrvs method
    def __init__(self, alpha):
        self.alpha = alpha                                                       which generates random variates using QMC methods present
                                                                                 in SciPy (scipy.stats.qmc) as uniform random number
     def pdf(self, x):                                                           generators3 . The next example illustrates how to use qrvs with a
         return x**(self.alpha-1) * exp(-x)
                                                                                 generator created directly from a SciPy distribution object.
     def support(self):                                                          from scipy import stats
         return 0, np.inf                                                        from scipy.stats import qmc

# initialize a distribution object with varying                                  # 1D Halton sequence generator.
# parameters                                                                     qrng = qmc.Halton(d=1)
dist1 = Gamma(2)
dist2 = Gamma(3)                                                                 rng = sampling.NumericalInverseHermite(stats.norm())

# initialize a generator for each distribution                                   # generate quasi random numbers using the Halton
rng1 = sampling.NumericalInversePolynomial(dist1)                                # sequence as uniform variates
rng2 = sampling.NumericalInversePolynomial(dist2)                                qrvs = rng.qrvs(size=100, qmc_engine=qrng)
In the above example, the support method is used to set the
domain of the distribution. This can alternatively be done by                    Benchmarking
passing a domain parameter to the constructor.                                   To analyze the performance of the implementation, we tested the
     In addition to continuous distribution, two UNU.RAN methods                 methods applied to several standard distributions against the gen-
have been added in SciPy to sample from discrete distributions. In               erators in NumPy and the original UNU.RAN C library. In addi-
this case, the distribution can be either be represented using a                 tion, we selected one non-standard distribution to demonstrate that
probability vector (which is passed to the constructor as a Python               substantial reductions in the runtime can be achieved compared to
list or NumPy array) or a Python object with the implementation                  other implementations. All the benchmarks were carried out using
of the probability mass function. In the latter case, a finite domain            NumPy 1.22.4 and SciPy 1.8.1 running in a single core on Ubuntu
must be passed to the constructor or the object should implement                 20.04.3 LTS with Intel(R) Core(TM) i7-8750H CPU (2.20GHz
the support method1 .                                                            clock speed, 16GB RAM). We run the benchmarks with NumPy’s
# Probability vector to represent a discrete                                     MT19937 (Mersenne Twister) and PCG64 random number gen-
# distribution. Note that the probability vector
                                                                                 erators (np.random.MT19937 and np.random.PCG64) in
# need not be vectorized
pv = [0.1, 9.0, 2.9, 3.4, 0.3]                                                   Python and use NumPy’s C implementation of MT19937 in the
                                                                                 UNU.RAN C benchmarks. As explained above, the use of PCG64
# PCG64 uniform RNG with seed 123                                                is recommended, and MT19937 is only included to compare the
urng = np.random.default_rng(123)
rng = sampling.DiscreteAliasUrn(                                                 speed of the Python implementation and the C library by relying
   pv, random_state=urng                                                         on the same uniform number generator (i.e., differences in the
)                                                                                performance of the uniform number generation are not taken
                                                                                 into account). The code for all the benchmarks can be found on
# sample from the given discrete distribution
rvs = rng.rvs(100000)                                                            https://github.com/tirthasheshpatel/unuran_benchmarks.
                                                                                     The methods used in NumPy to generate normal, gamma, and
                                                                                 beta random variates are:
Underlying uniform pseudo-random number generators
NumPy provides several generators for uniform pseudo-random                         •   the ziggurat algorithm ([MT00b]) to sample from the
numbers2 . It is highly recommended to use NumPy’s default                              standard normal distribution,
random number generator np.random.PCG64 for better speed                           2. By default, NumPy’s legacy random number generator, MT19937
and performance, see [O’N14] and https://numpy.org/doc/stable/                   (np.random.RandomState()) is used as the uniform random number
                                                                                 generator for consistency with the stats module in SciPy.
  1. Support for discrete distributions with infinite domain hasn’t been added     3. In      SciPy      1.9.0,      qrvs       will      be added to
yet.                                                                             NumericalInversePolynomial.
AUTOMATIC RANDOM VARIATE GENERATION IN PYTHON                                                                                                     49

   •    the rejection algorithms in Chapter XII.2.6 in [Dev86] if        70-200 times faster. This clearly shows the benefit of using a
        α < 1 and in [MT00a] if α > 1 for the Gamma distribution,        black-box algorithm.
   •    Johnk’s algorithm ([Jöh64], Section IX.3.5 in [Dev86]) if
        max{α, β } ≤ 1, otherwise a ratio of two Gamma variates          Conclusion
        with shape parameter α and β (see Section IX.4.1 in
                                                                         The interface to UNU.RAN in SciPy provides easy access to
        [Dev86]) for the beta distribution.
                                                                         different algorithms for non-uniform variate generation for large
Benchmarking against the normal, gamma, and beta distributions
                                                                         classes of univariate continuous and discrete distributions. We
                                                                         have shown that the methods are easy to use and that the al-
Table 1 compares the performance for the standard normal,
                                                                         gorithms perform very well both for standard and non-standard
Gamma and beta distributions. We recall that the density of the
                                                                         distributions. A comprehensive documentation suite, a tutorial
Gamma distribution with shape parameter a > 0 is given by
                                                                         and many examples are available at https://docs.scipy.org/doc/
x ∈ (0, ∞) 7→ xa−1 e−x and the density of the beta distribution with
                                                         α−1 (1−x)β −1   scipy/reference/stats.sampling.html and https://docs.scipy.org/doc/
shape parameters α, β > 0 is given by x ∈ (0, 1) 7→ x B(α,β      )       scipy/tutorial/stats/sampling.html. Various methods have been im-
where Γ(·) and B(·, ·) are the Gamma and beta functions. The             plemented in SciPy, and if specific use cases require additional
results are reported in Table 1.                                         functionality from UNU.RAN, the methods can easily be added
    We summarize our main observations:                                  to SciPy given the flexible framework that has been developed.
   1)    The setup step in Python is substantially slower than           Another area of further development is to better integrate SciPy’s
         in C due to expensive Python callbacks, especially for          QMC generators for the inversion methods.
         PINV and HINV. However, the time taken for the setup is             Finally, we point out that other sampling methods like Markov
         low compared to the sampling time if large samples are          Chain Monte Carlo and copula methods are not part of SciPy. Rel-
         drawn. Note that as expected, SROU has a very fast setup        evant Python packages in that context are PyMC ([PHF10]), PyS-
         such that this method is suitable for the varying parameter     tan relying on Stan ([Tea21]), Copulas (https://sdv.dev/Copulas/)
         case.                                                           and PyCopula (https://blent-ai.github.io/pycopula/).
   2)    The sampling time in Python is slightly higher than in
         C for the MT19937 random number generator. If the               Acknowledgments
         recommended PCG64 generator is used, the sampling               The authors wish to thank Wolfgang Hörmann and Josef Leydold
         time in Python is slightly lower. The only exception            for agreeing to publish the library under a BSD license and for
         is SROU: due to Python callbacks, the performance is            helpful feedback on the implementation and this note. In addition,
         substantially slower than in C. However, as the main            we thank Ralf Gommers, Matt Haberland, Nicholas McKibben,
         advantage of SROU is the fast setup time, the main use          Pamphile Roy, and Kai Striega for their code contributions, re-
         case is the varying parameter case (i.e., the method is not     views, and helpful suggestions. The second author was supported
         supposed to be used to generate large samples).                 by the Google Summer of Code 2021 program5 .
   3)    PINV, HINV, and TDR are at most about 2x slower than
         the specialized NumPy implementation for the normal
                                                                         R EFERENCES
         distribution. For the Gamma and beta distribution, they
         even perform better for some of the chosen shape pa-            [CA74]         Hui-Chuan Chen and Yoshinori Asau.                On gener-
                                                                                        ating random variates from an empirical distribution.
         rameters. These results underline the strong performance                       AIIE Transactions, 6(2):163–166, 1974. doi:10.1080/
         of these black-box approaches even for standard distribu-                      05695557408974949.
         tions.                                                          [Dag88]        John Dagpunar. Principles of random variate generation.
   4)    While the application of PINV requires bounded densi-                          Oxford University Press, USA, 1988.
                                                                         [Dev86]        Luc Devroye. Non-Uniform Random Variate Generation.
         ties, no issues are encountered for α = 0.05 since the                         Springer-Verlag, New York, 1986. doi:10.1007/978-1-
         unbounded part is cut off by the algorithm. However, the                       4613-8643-8.
         setup can fail for very small values of α.                      [DHL10]        Gerhard Derflinger, Wolfgang Hörmann, and Josef Leydold.
                                                                                        Random variate generation by numerical inversion when only
                                                                                        the density is known. ACM Transactions on Modeling and
Benchmarking against a non-standard distribution                                        Computer Simulation (TOMACS), 20(4):1–25, 2010. doi:
We benchmark the performance of PINV to sample from the                                 10.1145/1842722.1842723.
                                                                         [Gen03]        James E Gentle. Random number generation and Monte Carlo
generalized normal distribution    ([Sub23]) whose density is given
                            p                                                           methods, volume 381. Springer, 2003. doi:10.1007/
                     pe−|x|
by x ∈ (−∞, ∞) 7→ 2Γ(1/p)     against the method proposed in [NP09]                     b97336.
and against the implementation in SciPy’s gennorm distribu-              [GW92]         Walter R Gilks and Pascal Wild. Adaptive rejection sampling
                                                                                        for Gibbs sampling. Journal of the Royal Statistical Society:
tion. The approach in [NP09] relies on transforming Gamma                               Series C (Applied Statistics), 41(2):337–348, 1992. doi:10.
variates to the generalized normal distribution whereas SciPy                           2307/2347565.
relies on computing the inverse of CDF of the Gamma distri-              [H9̈5]         Wolfgang Hörmann. A rejection technique for sampling from
bution (https://docs.scipy.org/doc/scipy/reference/generated/scipy.                     T-concave distributions. ACM Trans. Math. Softw., 21(2):182–
                                                                                        193, 1995. doi:10.1145/203082.203089.
special.gammainccinv.html). The results for different values of p        [HL03]         Wolfgang Hörmann and Josef Leydold. Continuous random
are shown in Table 2.                                                                   variate generation by fast numerical inversion. ACM Trans-
    PINV is usually about twice as fast than the special-                               actions on Modeling and Computer Simulation (TOMACS),
                                                                                        13(4):347–362, 2003. doi:10.1145/945511.945517.
ized method and about 15-150 times faster than SciPy’s
implementation4 . We also found an R package pgnorm (https:                 4. In SciPy 1.9.0, the speed will be improved by implementing the method
//cran.r-project.org/web/packages/pgnorm/) that implements vari-         from [NP09]
ous approaches from [KR13]. In that case, PINV is usually about             5. https://summerofcode.withgoogle.com/projects/#5912428874825728
50                                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                         Python                                             C
                     Distribution      Method
                                                   Setup     Sampling (PCG64) Sampling (MT19937)              Setup     Sampling (MT19937)
                                        PINV        4.6            29.6              36.5                      0.27            32.4
                                       HINV         2.5            33.7              40.9                      0.38            36.8
                   Standard normal      TDR         0.2            37.3              47.8                      0.02            41.4
                                       SROU        8.7 µs          2510              2160                     0.5 µs            232
                                       NumPy          -            17.6              22.4                        -                -
                                        PINV       196.0           29.8              37.2                      37.9            32.5
                    Gamma(0.05)        HINV         24.5           36.1              43.8                       1.9            40.7
                                       NumPy          -            55.0              68.1                        -                -
                                        PINV        16.5           31.2              38.6                       2.0            34.5
                     Gamma(0.5)        HINV         4.9            34.2              41.7                       0.6            37.9
                                       NumPy          -            86.4              99.2                        -                -
                                        PINV        5.3            30.8              38.7                       0.5            34.6
                                       HINV         5.3              33              40.6                       0.4            36.8
                     Gamma(3.0)
                                        TDR         0.2            38.8              49.6                      0.03              44
                                       NumPy          -            36.5              47.1                        -                -
                                        PINV        21.4           33.1              39.9                       2.4            37.3
                    Beta(0.5, 0.5)     HINV         2.1            38.4              45.3                       0.2              42
                                       NumPy          -             101               112                        -                -
                                       HINV         0.2              37              44.3                      0.01            41.1
                    Beta(0.5, 1.0)
                                       NumPy          -             125               138                        -                -
                                        PINV        15.7           30.5              37.2                       1.7            34.3
                                       HINV         4.1            33.4              40.8                       0.4            37.1
                    Beta(1.3, 1.2)
                                        TDR         0.2            46.8              57.8                      0.03              45
                                       NumPy          -            74.3                97                        -                -
                                        PINV        9.7            30.2              38.2                       0.9            33.8
                                       HINV         5.8            33.7              41.2                       0.4            37.4
                    Beta(3.0, 2.0)
                                        TDR         0.2            42.8              52.8                      0.02              44
                                       NumPy          -            72.6              92.8                        -                -
                                                                            TABLE 1
Average time taken (reported in milliseconds, unless mentioned otherwise) to sample 1 million random variates from the standard normal distribution. The mean is
 computed over 7 iterations. Standard deviations are not reported as they were very small (less than 1% of the mean in the large majority of cases). Note that not
all methods can always be applied, e.g., TDR cannot be applied to the Gamma distribution if a < 1 since the PDF is not log-concave in that case. As NumPy uses
                                           rejection algorithms with precomputed constants, no setup time is reported.



                                           p                         0.25    0.45     0.75      1       1.5       2       5        8
                                Nardon and Pianca (2009)             100      101      101     45      148      120      128     122
                              SciPy’s gennorm distribution           832     1000     1110    559      5240    6720     6230    5950
                           Python (PINV Method, PCG64 urng)           50       47       45     41       40       37       38      38
                                                                           TABLE 2
Comparing SciPy’s implementation and a specialized method against PINV to sample 1 million variates from the generalized normal distribution for different values
                                 of the parameter p. Time reported in milliseconds. The mean is computer over 7 iterations.




[HL07]      Wolfgang Hörmann and Josef Leydold. UNU.RAN - Univer-                                   ates. ACM Transactions on Mathematical Software (TOMS),
            sal Non-Uniform RANdom number generators, 2007. https:                                  3(3):257–260, 1977. doi:10.1145/355744.355750.
            //statmath.wu.ac.at/unuran/doc.html.                                    [Knu14]         Donald E Knuth. The Art of Computer Programming, Volume
[HLD04]     Wolfgang Hörmann, Josef Leydold, and Gerhard Derflinger.                                2: Seminumerical algorithms. Addison-Wesley Professional,
            Automatic nonuniform random variate generation. Springer,                               2014. doi:10.2307/2317055.
            2004. doi:10.1007/978-3-662-05946-3.                                    [KR13]          Steve Kalke and W-D Richter. Simulation of the p-generalized
[HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der                                 Gaussian distribution. Journal of Statistical Computation
            Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric                              and Simulation, 83(4):641–667, 2013. doi:10.1080/
            Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,                              00949655.2011.631187.
            Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van                  [Ley00]         Josef Leydold. Automatic sampling with the ratio-of-uniforms
            Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del                             method.      ACM Transactions on Mathematical Software
            Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant,                                (TOMS), 26(1):78–98, 2000. doi:10.1145/347837.
            Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer                                   347863.
            Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro-            [Ley01]         Josef Leydold. A simple universal generator for continuous
            gramming with NumPy. Nature, 585(7825):357–362, 2020.                                   and discrete univariate T-concave distributions. ACM Transac-
            doi:10.1038/s41586-020-2649-2.                                                          tions on Mathematical Software (TOMS), 27(1):66–82, 2001.
[Jöh64]     MD Jöhnk. Erzeugung von betaverteilten und gammaverteilten                              doi:10.1145/382043.382322.
            Zufallszahlen. Metrika, 8(1):5–15, 1964. doi:10.1007/                   [Ley03]         Josef Leydold. Short universal generators via generalized
            bf02613706.                                                                             ratio-of-uniforms method. Mathematics of Computation,
[KM77]      Albert J Kinderman and John F Monahan. Computer gen-                                    72(243):1453–1471, 2003. doi:10.1090/s0025-5718-
            eration of random variables using the ratio of uniform devi-                            03-01511-4.
AUTOMATIC RANDOM VARIATE GENERATION IN PYTHON                                  51

[LH00]       Josef Leydold and Wolfgang Hörmann. Universal algorithms
             as an alternative for generating non-uniform continuous ran-
             dom variates. In Proceedings of the International Conference
             on Monte Carlo Simulation 2000., pages 177–183, 2000.
[MT00a]      George Marsaglia and Wai Wan Tsang. A simple method for
             generating gamma variables. ACM Transactions on Math-
             ematical Software (TOMS), 26(3):363–372, 2000. doi:
             10.1145/358407.358414.
[MT00b]      George Marsaglia and Wai Wan Tsang. The ziggurat method
             for generating random variables. Journal of statistical soft-
             ware, 5(1):1–7, 2000. doi:10.18637/jss.v005.i08.
[NP09]       Martina Nardon and Paolo Pianca. Simulation techniques
             for generalized Gaussian densities. Journal of Statistical
             Computation and Simulation, 79(11):1317–1329, 2009. doi:
             10.1080/00949650802290912.
[O’N14]      Melissa E. O’Neill. PCG: A family of simple fast space-
             efficient statistically good algorithms for random number gen-
             eration. Technical Report HMC-CS-2014-0905, Harvey Mudd
             College, Claremont, CA, September 2014.
[PHF10]      Anand Patil, David Huard, and Christopher J Fonnesbeck.
             PyMC: Bayesian stochastic modelling in Python. Journal of
             Statistical Software, 35(4):1, 2010. doi:10.18637/jss.
             v035.i04.
[R C21]      R Core Team. R: A language and environment for statistical
             computing, 2021. https://www.R-project.org/.
[Sub23]      M.T. Subbotin. On the law of frequency of error. Mat. Sbornik,
             31(2):296–301, 1923.
[Tea21]      Stan Development Team. Stan modeling language users guide
             and reference manual, version 2.28., 2021. https://mc-stan.org.
[TL03]       Günter Tirler and Josef Leydold. Automatic non-uniform
             random variate generation in r. In Proceedings of DSC, page 2,
             2003.
[VGO+ 20]    Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt
             Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski,
             Pearu Peterson, Warren Weckesser, Jonathan Bright, et al.
             Scipy 1.0: fundamental algorithms for scientific computing in
             python. Nature methods, pages 1–12, 2020. doi:10.1038/
             s41592-019-0686-2.
[VN51]       John Von Neumann. Various techniques used in connection
             with random digits. Appl. Math Ser, 12(36-38):3, 1951.
[Wal77]      Alastair J Walker. An efficient method for generating discrete
             random variables with general distributions. ACM Transac-
             tions on Mathematical Software (TOMS), 3(3):253–256, 1977.
             doi:10.1145/355744.355749.
52                                                                                                     PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




     Utilizing SciPy and other open source packages to
     provide a powerful API for materials manipulation in
                the Schrödinger Materials Suite
                                            Alexandr Fonari‡∗ , Farshad Fallah‡ , Michael Rauch‡



                                                                                 F



Abstract—The use of several open source scientific packages in the                   open-source and many of which blend the two to optimize capa-
Schrödinger Materials Science Suite will be discussed. A typical workflow for        bilities and efficiency. For example, the main simulation engine
materials discovery will be described, discussing how open source packages           for molecular quantum mechanics is the Jaguar [BHH+ 13] pro-
have been incorporated at every stage. Some recent implementations of ma-            prietary code. The proprietary classical molecular dynamics code
chine learning for materials discovery will be discussed, as well as how open
                                                                                     Desmond (distributed by Schrödinger, Inc.) [SGB+ 14] is used to
source packages were leveraged to achieve results faster and more efficiently.
                                                                                     obtain physical properties of soft materials, surfaces and polymers.
Index Terms—materials, active learning, OLED, deposition, evaporation
                                                                                     For periodic quantum mechanics, the main simulation engine is
                                                                                     the open source code Quantum ESPRESSO (QE) [GAB+ 17]. One
                                                                                     of the co-authors of this proceedings (A. Fonari) contributes to
Introduction                                                                         the QE code in order to make integration with the Materials Suite
                                                                                     more seamless and less error-prone. As part of this integration,
A common materials discovery practice or workflow is to start
                                                                                     support for using the portable XML format for input and output
with reading an experimental structure of a material or generating
                                                                                     in QE has been implemented in the open source Python package
a structure in silico, computing its properties of interest (e.g.
                                                                                     qeschema [BDBF].
elastic constants, electrical conductivity), tuning the material by
                                                                                          Figure 2 gives an overview of some of the various products that
modifying its structure (e.g. doping) or adding and removing
                                                                                     compose the Schrödinger Materials Science Suite. The various
atoms (deposition, evaporation), and then recomputing the proper-
                                                                                     workflows are implemented mainly in Python (some of them
ties of the modified material (Figure 1). Computational materials
                                                                                     described below), calling on proprietary or open-source code
discovery leverages such workflows to empower researchers to
                                                                                     where appropriate, to improve the performance of the software
explore vast design spaces and uncover root causes without (or in
                                                                                     and reduce overall maintenance.
conjunction with) laboratory experimentation.
                                                                                          The materials discovery cycle can be run in a high-throughput
    Software tools for computational materials discovery can be
                                                                                     manner, enumerating different structure modifications in a system-
facilitated by utilizing existing libraries that cover the fundamental
                                                                                     atic fashion, such as doping ratio in a semiconductor or depositing
mathematics used in the calculations in an optimized fashion. This
                                                                                     different adsorbates. As we will detail herein, there are several
use of existing libraries allows developers to devote more time
                                                                                     open source packages that allow the user to generate a large
to developing new features instead of re-inventing established
                                                                                     number of structures, run calculations in high throughput manner
methods. As a result, such a complementary approach improves
                                                                                     and analyze the results. For example, the open source package
the performance of computational materials software and reduces
                                                                                     pymatgen [ORJ+ 13] facilitates generation and analysis of periodic
overall maintenance.
                                                                                     structures. It can generate inputs for and read outputs of QE, the
    The Schrödinger Materials Science Suite [LLC22] is a propri-
                                                                                     commercial codes VASP and Gaussian, and several other formats.
etary computational chemistry/physics platform that streamlines
                                                                                     To run and manage workflow jobs in a high-throughput manner,
materials discovery workflows into a single graphical user inter-
                                                                                     open source packages such as Custodian [ORJ+ 13] and AiiDA
face (Materials Science Maestro). The interface is a single portal
                                                                                     [HZU+ 20] can be used.
for structure building and enumeration, physics-based modeling
and machine learning, visualization and analysis. Tying together
the various modules are a wide variety of scientific packages, some                  Materials import and generation
of which are proprietary to Schrödinger, Inc., some of which are
                                                                                     For reading and writing of material structures, several open source
                                                                                     packages (e.g. OpenBabel [OBJ+ 11], RDKit [LTK+ 22]) have
* Corresponding author: sasha.fonari@schrodinger.com
‡ Schrödinger Inc., 1540 Broadway, 24th Floor. New York, NY 10036                    implemented functionality for working with several commonly
                                                                                     used formats (e.g. CIF, PDB, mol, xyz). Periodic structures
Copyright © 2022 Alexandr Fonari et al. This is an open-access article               of materials, mainly coming from single crystal X-ray/neutron
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,        diffraction experiments, are distributed in CIF (Crystallographic
provided the original author and source are credited.                                Information File), PDB (Protein Data Bank) and lately mmCIF
UTILIZING SCIPY AND OTHER OPEN SOURCE PACKAGES TO PROVIDE A POWERFUL API FOR MATERIALS MANIPULATION IN THE SCHRÖDINGER MATERIALS SUITE    53




                                  Fig. 1: Example of a workflow for computational materials discovery.




                          Fig. 2: Some example products that compose the Schrödinger Materials Science Suite.


formats [WF05]. Correctly reading experimental structures is of        work went into this project) and others to correctly read and
significant importance, since the rest of the materials discovery      convert periodic structures in OpenBabel. By version 3.1.1 (the
workflow depends on it. In addition to atom coordinates and            most recent at writing time), the authors are not aware of any
periodic cell information, structural data also contains symme-        structures read incorrectly by OpenBabel. In general, non-periodic
try operations (listed explicitly or by the means of providing         molecular formats are simpler to handle because they only contain
a space group) that can be used to decrease the number of              atom coordinates but no cell or symmetry information. OpenBabel
computations required for a particular system by accounting for        has Python bindings but due to the GPL license limitation, it is
symmetry. This can be important, especially when scaling high-         called as a subprocess from the Schrödinger Materials Suite.
throughput calculations. From file, structure is read in a structure        Another important consideration in structure generation is
object through which atomic coordinates (as a NumPy array) and         modeling of substitutional disorder in solid alloys and materials
chemical information of the material can be accessed and updated.      with point defects (intermetallics, semiconductors, oxides and
Structure object is similar to the one implemented in open source      their crystalline surfaces). In such cases, the unit cell and atomic
packages such as pymatgen [ORJ+ 13] and ASE [LMB+ 17]. All             sites of the crystal or surface slab are well defined while the chem-
the structure manipulations during the workflows are done by           ical species occupying the site may vary. In order to simulate sub-
using structure object interface (see structure deformation example    stitutional disorder, one must generate the ensemble of structures
below). Example of Structure object definition in pymatgen:            that includes all statistically significant atomic distributions in a
class Structure:                                                       given unit cell. This can be achieved by a brute force enumeration
                                                                       of all symmetrically unique atomic structures with a given number
   def __init__(self, lattice, species, coords, ...):                  of vacancies, impurities or solute atoms. The open source library
       """Create a periodic structure."""
                                                                       enumlib [HF08] implements algorithms for such a systematic
One consideration of note is that PDB, CIF and mmCIF structure         enumeration of periodic structures. The enumlib package consists
formats allow description of the positional disorder (for example,     of several Fortran binaries and Python scripts that can be run as a
a solvent molecule without a stable position within the cell           subprocess (no Python bindings). This allows the user to generate
which can be described by multiple sets of coordinates). Another       a large set of symmetrically nonequivalent materials with different
complication is that experimental data spans an interval of almost     compositions (e.g. doping or defect concentration).
a century: one of the oldest crystal structures deposited in the           Recently, we applied this approach in simultaneous study of
Cambridge Structural Database (CSD) [GBLW16] dates to 1924             the activity and stability of Pt based core-shell type catalysts for
[HM24]. These nuances and others present nontrivial technical          the oxygen reduction reaction [MGF+ 19]. We generated a set of
challenges for developers. Thus, it has been a continuous effort       stable doped Pt/transition metal/nitrogen surfaces using periodic
by Schrödinger, Inc. (at least 39 commits and several weeks of         enumeration. Using QE to perform periodic density functional
54                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                       Jaguar that took 457,265 CPU hours (~52 years) [MAS+ 20]. An-
                                                                       other similar case study is the high-throughput molecular dynam-
                                                                       ics simulations (MD) of thermophysical properties of polymers for
                                                                       various applications [ABG+ 21]. There, using Desmond we com-
                                                                       puted the glass transition temperature (Tg ) of 315 polymers and
                                                                       compared the results with experimental measurements [Bic02].
                                                                       This study took advantage of GPU (graphics processing unit)
                                                                       support as implemented in Desmond, as well as the job scheduler
                                                                       API described above.
                                                                           Other workflows implemented in the Schrödinger Materials
                                                                       Science Suite utilize open source packages as well. For soft mate-
                                                                       rials (polymers, organic small molecules and substrates composed
                                                                       of soft molecules), convex hull and related mathematical methods
          Fig. 3: Example of the job submission process.               are important for finding possible accessible solvent voids (during
                                                                       submerging or sorption) and adsorbate sites (during molecular
                                                                       deposition). These methods are conveniently implemented in the
theory (DFT) calculations, we assessed surface phase diagrams          open source SciPy [VGO+ 20] and NumPy [HMvdW+ 20] pack-
for Pt alloys and identified the avenues for stabilizing the cost      ages. Thus, we implemented molecular deposition and evaporation
effective core-shell systems by a judicious choice of the catalyst     workflows by using the Desmond MD engine as the backend
core material. Such catalysts may prove critical in electrocatalysis   in tandem with the convex hull functionality. This workflow
for fuel cell applications.                                            enables simulation of the deposition and evaporation of the
                                                                       small molecules on a substrate. We utilized the aforementioned
                                                                       deposition workflow in the study of organic light-emitting diodes
Workflow capabilities                                                  (OLEDs), which are fabricated using a stepwise process, where
In the last section, we briefly described a complete workflow from     new layers are deposited on top of previous layers. Both vacuum
structure generation and enumeration to periodic DFT calculations      and solution deposition processes have been used to prepare these
to analysis. In order to be able to run a massively parallel           films, primarily as amorphous thin film active layers lacking
screening of materials, a highly scalable and stable queuing system    long-range order. Each of these deposition techniques introduces
(job scheduler) is required. We have implemented a job queuing         changes to the film structure and consequently, different charge-
system on top of the most used queuing systems (LSF, PBS,              transfer and luminescent properties [WKB+ 22].
SGE, SLURM, TORQUE, UGE) and exposed a Python API to                       As can be seen from above, a workflow is usually some
submit and monitor jobs. In line with technological advancements,      sort of structure modification through the structure object with
cloud is also supported by means of a virtual cluster configured       a subsequent call to a backend code and analysis of its output if
with SLURM. This allows the user to submit a large number              it succeeds. Input for the next iteration depends on the output
of jobs, limited only by SLURM scheduling capabilities and             of the previous iteration in some workflows. Due to the large
cloud resources. In order to accommodate job dependencies in           chemical and manipulation space of the materials, sometimes it
workflows, for each job, a parent job (or multiple parent jobs) can    very tricky to keep code for all workflows follow the same code
be defined forming a directed graph of jobs (Figure 3).                logic. For every workflow and/or functionality in the Materials
    There could be several reasons for a job to fail. Depending        Science Suite, some sort of peer reviewed material (publication,
on the reason of failure, there are several restart and recovery       conference presentation) is created where implemented algorithms
mechanisms in place. The lowest level is the restart mechanism         are described to facilitate reproducibility.
(in SLURM it is called requeue) which is performed by the
queuing system itself. This is triggered when a node goes down.
                                                                       Data fitting algorithms and use cases
On the cloud, preemptible instances (nodes) can go offline at any
moment. In addition, workflows implemented in the proprietary          Materials simulation engines for QM, periodic DFT, and classical
Schrödinger Materials Science Suite have built-in methods for          MD (referred to herein as backends) are frequently written in
handling various types of failure. For example, if the simulation      compiled languages with enabled parallelization for CPU or GPU
is not converging to a requested energy accuracy, it is wasteful       hardware. These backends are called from Python workflows
to blindly restart the calculation without changing some input         using the job queuing systems described above. Meanwhile, pack-
parameters. However, in the case of a failure due to full disk         ages such as SciPy and NumPy provide sophisticated numerical
space, it is reasonable to try restart with hopes to get a node with   function optimization and fitting capabilities. Here, we describe
more empty disk space. If a job fails (and cannot be restarted),       examples of how the Schrödinger suite can be used to combine
all its children (if any) will not start, thus saving queuing and      materials simulations with popular optimization routines in the
computational time.                                                    SciPy ecosystem.
    Having developed robust systems for running calculations, job          Recently      we     implemented       convex     analysis   of
queuing and troubleshooting (autonomously, when applicable),           the stress strain curve (as described here [PKD18]).
the developed workflows have allowed us and our customers to           scipy.optimize.minimize is used for a constrained
perform massive screenings of materials and their properties. For      minimization with boundary conditions of a function related to
example, we reported a massive screening of 250,000 charge-            the stress strain curve. The stress strain curve is obtained from a
conducting organic materials, totaling approximately 3,619,000         series of MD simulations on deformed cells (cell deformations
DFT SCF (self-consistent field) single-molecule calculations using     are defined by strain type and deformation step). The pressure
UTILIZING SCIPY AND OTHER OPEN SOURCE PACKAGES TO PROVIDE A POWERFUL API FOR MATERIALS MANIPULATION IN THE SCHRÖDINGER MATERIALS SUITE     55

tensor of a deformed cell is related to stress. This analysis allowed   and AutoQSAR [DDS+ 16] from the Schrödinger suite. Depending
prediction of elongation at yield for high density polyethylene         on the type of materials, benchmark data can be obtained using
polymer. Figure 4 shows obtained calculated yield of 10% vs.            different codes available in the Schrödinger suite:
experimental value within 9-18% range [BAS+ 20].
                                                                           •   small molecules and finite systems - Jaguar
    The scipy.optimize package is used for a least-squares
                                                                           •   periodic systems - Quantum ESPRESSO
fit of the bulk energies at different cell volumes (compressed
                                                                           •   larger polymeric and similar systems - Desmond
and expanded) in order to obtain the bulk modulus and equation
of state (EOS) of a material. In the Schrödinger suite this was             Different materials systems require different descriptors for
implemented as a part of an EOS workflow, in which fitting is           featurization. For example, for crystalline periodic systems, we
performed on the results obtained from a series of QE calculations      have implemented several sets of tailored descriptors. Genera-
performed on the original as well as compressed and expanded            tion of these descriptors again uses a mix of open source and
(deformed) cells. An example of deformation applied to a structure      Schrödinger proprietary tools. Specifically:
in pymatgen:
                                                                           •   elemental features such as atomic weight, number of
from pymatgen.analysis.elasticity import strain                                valence electrons in s, p and d-shells, and electronegativity
from pymatgen.core import lattice
from pymatgen.core import structure                                        •   structural features such as density, volume per atom, and
                                                                               packing fraction descriptors implemented in the open
deform =   strain.Deformation([                                                source matminer package [WDF+ 18]
   [1.0,   0.02, 0.02],
                                                                           •   intercalation descriptors such as cation and anion counts,
   [0.0,   1.0, 0.0],
   [0.0,   0.0, 1.0]])                                                         crystal packing fraction, and average neighbor ionicity
                                                                               [SYC+ 17] implemented in the Schrödinger suite
latt = lattice.Lattice([                                                   •   three-dimensional smooth overlap of atomic positions
   [3.84, 0.00, 0.00],
   [1.92, 3.326, 0.00],                                                        (SOAP) descriptors implemented in the open source
   [0.00, -2.22, 3.14],                                                        DScribe package [HJM+ 20].
])
                                                                            We are currently training models that use these descriptors
st = structure.Structure(                                               to predict properties, such as bulk modulus, of a set of Li-
   latt,                                                                containing battery related compounds [Cha]. Several models will
   ["Si", "Si"],
   [[0, 0, 0], [0.75, 0.5, 0.75]])                                      be compared, such as kernel regression methods (as implemented
                                                                        in the open source scikit-learn code [PVG+ 11]) and AutoQSAR.
strained_st = deform.apply_to_structure(st)                                 For isolated small molecules and extended non-periodic sys-
This is also an example of loosely coupled (embarrassingly              tems, RDKit can be used to generate a large number of atomic and
parallel) jobs. In particular, calculations of the deformed cells       molecular descriptors. A lot of effort has been devoted to ensure
only depend on the bulk calculation and do not depend on each           that RDKit can be used on a wide variety of materials that are
other. Thus, all the deformation jobs can be submitted in parallel,     supported by the Schrödinger suite. At the time of writing, the 4th
facilitating high-throughput runs.                                      most active contributor to RDKit is Ricardo Rodriguez-Schmidt
    Structure refinement from powder diffraction experiment is an-      from Schrödinger [RDK].
other example where more complex optimization is used. Powder               Recently, active learning (AL) combined with DFT has re-
diffraction is a widely used method in drug discovery to assess         ceived much attention to address the challenge of leveraging
purity of the material and discover known or unknown crystal            exhaustive libraries in materials informatics [VPB21], [SPA+ 19].
polymorphs [KBD+ 21]. In particular, there is interest in fitting of    On our side, we have implemented a workflow that employs active
the experimental powder diffraction intensity peaks to the indexed      learning (AL) for intelligent and iterative identification of promis-
peaks (Pawley refinement) [JPS92]. Here we employed the open            ing materials candidates within a large dataset. In the framework of
source lmfit package [NSA+ 16] to perform a minimization of             AL, the predicted value with associated uncertainty is considered
the multivariable Voigt-like function that represents the entire        to decide what materials to be added in each iteration, aiming to
diffraction spectrum. This allows the user to refine (optimize) unit    improve the model performance in the next iteration (Figure 5).
cell parameters coming from the indexing data and as the result,            Since it could be important to consider multiple properties
goodness of fit (R-factor) between experimental and simulated           simultaneously in material discovery, multiple property optimiza-
spectrum is minimized.                                                  tion (MPO) has also been implemented as a part of the AL work-
                                                                        flow [KAG+ 22]. MPO allows scaling and combining multiple
                                                                        properties into a single score. We employed the AL workflow
Machine learning techniques                                             to determine the top candidates for hole (positively charged
Of late, there is great interest in machine learning assisted mate-     carrier) transport layer (HTL) by evaluating 550 molecules in 10
rials discovery. There are several components required to perform       iterations using DFT calculations for a dataset of ~9,000 molecules
machine learning assisted materials discovery. In order to train a      [AKA+ 22]. Resulting model was validated by randomly picking
model, benchmark data from simulation and/or experimental data          a molecule from the dataset, computing properties with DFT and
is required. Besides benchmark data, computation of the relevant        comparing those to the predicted values. According to the semi-
descriptors is required (see below). Finally, a model based on          classical Marcus equation [Mar93], high rates of hole transfer are
benchmark data and descriptors is generated that allows prediction      inversely proportional to hole reorganization energies. Thus, MPO
of properties for novel materials. There are several techniques to      scores were computed based on minimizing hole reorganization
generate the model, such as linear or non-linear fitting to neural      energy and targeting oxidation potential to an appropriate level to
networks. Tools include the open source DeepChem [REW+ 19]              ensure a low energy barrier for hole injection from the anode
56                                                                                           PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




Fig. 4: Left: The uniaxial stress/strain curve of a polymer calculated using Desmond through the stress strain workflow. The dark grey band
indicates an inflection that marks the yield point. Right: Constant strain simulation with convex analysis indicates elongation at yield. The red
curve shows simulated stress versus strain. The blue curve shows convex analysis.




                      Fig. 5: Active learning workflow for the design and discovery of novel optoelectronics molecules.


into the emissive layer. In this workflow, we used RDKit to               of similar items (similar molecules). In this case, benchmark data
compute descriptors for the chemical structures. These descriptors        is only needed for few representatives of each cluster. We are
generated on the initial subset of structures are given as vectors to     currently working on applying this approach to train models for
an algorithm based on Random Forest Regressor as implemented              predicting physical properties of soft materials (polymers).
in scikit-learn. Bayesian optimization is employed to tune the
hyperparameters of the model. In each iteration, a trained model          Conclusions
is applied for making predictions on the remaining materials in           We present several examples of how Schrödinger Materials Suite
the dataset. Figure 6 (A) displays MPO scores for the HTL dataset         integrates open source software packages. There is a wide range
estimated by AL as a function of hole reorganization energies that        of applications in materials science that can benefit from already
are separately calculated for all the materials. This figure indicates    existing open source code. Where possible, we report issues to
that there are many materials in the dataset with desired low hole        the package authors and submit improvements and bug fixes in
reorganization energies but are not suitable for HTL due to their         the form of the pull requests. We are thankful to all who have
improper oxidation potentials, suggesting that MPO is important           contributed to open source libraries, and have made it possible for
to evaluate the optoelectronic performance of the materials. Figure       us to develop a platform for accelerating innovation in materials
6 (B) presents MPO scores of the materials used in the training           and drug discovery. We will continue contributing to these projects
dataset of AL, demonstrating that the feedback loop in the AL             and we hope to further give back to the scientific community by
workflow efficiently guides the data collection as the size of the        facilitating research in both academia and industry. We hope that
training set increases.                                                   this report will inspire other scientific companies to give back to
    To appreciate the computational efficiency of such an ap-             the open source community in order to improve the computational
proach, it is worth noting that performing DFT calculations for           materials field and make science more reproducible.
all of the 9,000 molecules in the dataset would increase the
computational cost by a factor of 15 versus the AL workflow. It           Acknowledgments
seems that AL approach can be useful in the cases where problem           The authors acknowledge Bradley Dice and Wenduo Zhou for
space is broad (like chemical space), but there are many clusters         their valuable comments during the review of the manuscript.
UTILIZING SCIPY AND OTHER OPEN SOURCE PACKAGES TO PROVIDE A POWERFUL API FOR MATERIALS MANIPULATION IN THE SCHRÖDINGER MATERIALS SUITE                  57




Fig. 6: A: MPO score of all materials in the HTL dataset. B: Those used in the training set as a function of the hole reorganization energy (
λh ).


R EFERENCES                                                                                  tal Engineering and Materials, 72, 2016. doi:10.1107/
                                                                                             S2052520616003954.
[ABG+ 21]     Mohammad Atif Faiz Afzal, Andrea R. Browning, Alexan-              [HF08]      Gus L.W. Hart and Rodney W. Forcade.                     Algo-
              der Goldberg, Mathew D. Halls, Jacob L. Gavartin, Tsuguo                       rithm for generating derivative structures. Physical Re-
              Morisato, Thomas F. Hughes, David J. Giesen, and Joseph E.                     view B - Condensed Matter and Materials Physics, 77,
              Goose. High-throughput molecular dynamics simulations and                      2008. URL: https://github.com/msg-byu/enumlib/, doi:10.
              validation of thermophysical properties of polymers for var-                   1103/PhysRevB.77.224115.
              ious applications. ACS Applied Polymer Materials, 3, 2021.         [HJM+ 20]   Lauri Himanen, Marc O.J. Jager, Eiaki V. Morooka, Fil-
              doi:10.1021/acsapm.0c00524.                                                    ippo Federici Canova, Yashasvi S. Ranawat, David Z. Gao,
[AKA+ 22]     Hadi Abroshan, H. Shaun Kwak, Yuling An, Christopher                           Patrick Rinke, and Adam S. Foster. Dscribe: Library of
              Brown, Anand Chandrasekaran, Paul Winget, and Mathew D.                        descriptors for machine learning in materials science. Com-
              Halls. Active learning accelerates design and optimization                     puter Physics Communications, 247, 2020. URL: https:
              of hole-transporting materials for organic electronics. Fron-                  //singroup.github.io/dscribe/latest/, doi:10.1016/j.cpc.
              tiers in Chemistry, 9, 2022. doi:10.3389/fchem.2021.                           2019.106949.
              800371.                                                            [HM24]      O Hassel and H Mark. The crystal structure of graphite.
[BAS+ 20]     A. R. Browning, M. A. F. Afzal, J. Sanders, A. Goldberg,                       Physik. Z, 25:317–337, 1924.
              A. Chandrasekaran, and H. S. Kwak. Polyolefin molecular            [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der
              simulation for critical physical characteristics. International                Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric
              Polyolefins Conference, 2020.                                                  Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,
[BDBF]        Davide Brunato, Pietro Delugas, Giovanni Borghi, and                           Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van
              Alexandr Fonari. qeschema. URL: https://github.com/QEF/                        Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del
              qeschema.                                                                      Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant,
[BHH+ 13]     Art D. Bochevarov, Edward Harder, Thomas F. Hughes,                            Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer
              Jeremy R. Greenwood, Dale A. Braden, Dean M. Philipp,                          Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array
              David Rinaldo, Mathew D. Halls, Jing Zhang, and Richard A.                     programming with numpy, 2020. URL: https://numpy.org/,
              Friesner. Jaguar: A high-performance quantum chemistry                         doi:10.1038/s41586-020-2649-2.
              software program with strengths in life and materials sci-         [HZU+ 20]   Sebastiaan P. Huber, Spyros Zoupanos, Martin Uhrin, Leopold
              ences. International Journal of Quantum Chemistry, 113,                        Talirz, Leonid Kahle, Rico Hauselmann, Dominik Gresch,
              2013. doi:10.1002/qua.24481.                                                   Tiziano Müller, Aliaksandr V. Yakutovich, Casper W. Ander-
[Bic02]       Jozef Bicerano. Prediction of polymer properties. cRc Press,                   sen, Francisco F. Ramirez, Carl S. Adorf, Fernando Gargiulo,
              2002.                                                                          Snehal Kumbhar, Elsa Passaro, Conrad Johnston, Andrius
[Cha]         A. Chandrasekaran. Active learning accelerated design of ionic                 Merkys, Andrea Cepellotti, Nicolas Mounet, Nicola Marzari,
              materials. in progress.                                                        Boris Kozinsky, and Giovanni Pizzi. Aiida 1.0, a scalable com-
[DDS+ 16]     Steven L. Dixon, Jianxin Duan, Ethan Smith, Christopher                        putational infrastructure for automated reproducible workflows
              D. Von Bargen, Woody Sherman, and Matthew P. Repasky.                          and data provenance. Scientific Data, 7, 2020. URL: https://
              Autoqsar: An automated machine learning tool for best-                         www.aiida.net/, doi:10.1038/s41597-020-00638-4.
              practice quantitative structure-activity relationship modeling.    [JPS92]     J. Jansen, R. Peschar, and H. Schenk. Determination of
              Future Medicinal Chemistry, 8, 2016. doi:10.4155/fmc-                          accurate intensities from powder diffraction data. i. whole-
              2016-0093.                                                                     pattern fitting with a least-squares procedure.        Journal
[GAB+ 17]     P. Giannozzi, O. Andreussi, T. Brumme, O. Bunau, M. Buon-                      of Applied Crystallography, 25, 1992. doi:10.1107/
              giorno Nardelli, M. Calandra, R. Car, C. Cavazzoni,                            S0021889891012104.
              D. Ceresoli, M. Cococcioni, N. Colonna, I. Carnimeo, A. Dal        [KAG+ 22]   H. Shaun Kwak, Yuling An, David J. Giesen, Thomas F.
              Corso, S. De Gironcoli, P. Delugas, R. A. Distasio, A. Ferretti,               Hughes, Christopher T. Brown, Karl Leswing, Hadi Abroshan,
              A. Floris, G. Fratesi, G. Fugallo, R. Gebauer, U. Gerstmann,                   and Mathew D. Halls. Design of organic electronic materials
              F. Giustino, T. Gorni, J. Jia, M. Kawamura, H. Y. Ko,                          with a goal-directed generative model powered by deep neural
              A. Kokalj, E. Kücükbenli, M. Lazzeri, M. Marsili, N. Marzari,                  networks and high-throughput molecular simulations. Fron-
              F. Mauri, N. L. Nguyen, H. V. Nguyen, A. Otero-De-La-                          tiers in Chemistry, 9, 2022. doi:10.3389/fchem.2021.
              Roza, L. Paulatto, S. Poncé, D. Rocca, R. Sabatini, B. Santra,                 800370.
              M. Schlipf, A. P. Seitsonen, A. Smogunov, I. Timrov, T. Thon-      [KBD+ 21]   James A Kaduk, Simon J L Billinge, Robert E Dinnebier,
              hauser, P. Umari, N. Vast, X. Wu, and S. Baroni. Advanced                      Nathan Henderson, Ian Madsen, Radovan Černý, Matteo
              capabilities for materials modelling with quantum espresso.                    Leoni, Luca Lutterotti, Seema Thakral, and Daniel Chateigner.
              Journal of Physics Condensed Matter, 29, 2017. URL:                            Powder diffraction. Nature Reviews Methods Primers, 1:77,
              https://www.quantum-espresso.org/, doi:10.1088/1361-                           2021.     URL: https://doi.org/10.1038/s43586-021-00074-7,
              648X/aa8f79.                                                                   doi:10.1038/s43586-021-00074-7.
[GBLW16]      Colin R. Groom, Ian J. Bruno, Matthew P. Lightfoot, and            [LLC22]     Schrödinger LLC. Schrödinger release 2022-2: Materials
              Suzanna C. Ward.         The cambridge structural database.                    science suite, 2022. URL: https://www.schrodinger.com/
              Acta Crystallographica Section B: Structural Science, Crys-                    platform/materials-science.
58                                                                                              PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[LMB+ 17]   Ask Hjorth Larsen, Jens JØrgen Mortensen, Jakob Blomqvist,                      Ho, Douglas J. Ierardi, Lev Iserovich, Jeffrey S. Kuskin,
            Ivano E. Castelli, Rune Christensen, Marcin Dułak, Jesper                       Richard H. Larson, Timothy Layman, Li Siang Lee, Adam K.
            Friis, Michael N. Groves, BjØrk Hammer, Cory Hargus,                            Lerer, Chester Li, Daniel Killebrew, Kenneth M. Macken-
            Eric D. Hermes, Paul C. Jennings, Peter Bjerre Jensen,                          zie, Shark Yeuk Hai Mok, Mark A. Moraes, Rolf Mueller,
            James Kermode, John R. Kitchin, Esben Leonhard Kols-                            Lawrence J. Nociolo, Jon L. Peticolas, Terry Quan, Daniel
            bjerg, Joseph Kubal, Kristen Kaasbjerg, Steen Lysgaard,                         Ramot, John K. Salmon, Daniele P. Scarpazza, U. Ben Schafer,
            Jón Bergmann Maronsson, Tristan Maxson, Thomas Olsen,                           Naseer Siddique, Christopher W. Snyder, Jochen Spengler,
            Lars Pastewka, Andrew Peterson, Carsten Rostgaard, Jakob                        Ping Tak Peter Tang, Michael Theobald, Horia Toma, Brian
            SchiØtz, Ole Schütt, Mikkel Strange, Kristian S. Thygesen,                      Towles, Benjamin Vitale, Stanley C. Wang, and Cliff Young.
            Tejs Vegge, Lasse Vilhelmsen, Michael Walter, Zhenhua Zeng,                     Anton 2: Raising the bar for performance and programmabil-
            and Karsten W. Jacobsen. The atomic simulation envi-                            ity in a special-purpose molecular dynamics supercomputer.
            ronment - a python library for working with atoms, 2017.                        volume 2015-January, 2014. doi:10.1109/SC.2014.9.
            URL: https://wiki.fysik.dtu.dk/ase/, doi:10.1088/1361-              [SPA+ 19]   Gabriel R. Schleder, Antonio C.M. Padilha, Carlos Mera
            648X/aa680e.                                                                    Acosta, Marcio Costa, and Adalberto Fazzio. From dft to
[LTK+ 22]   Greg Landrum, Paolo Tosco, Brian Kelley, Ric, sriniker,                         machine learning: Recent approaches to materials science -
            gedeck, Riccardo Vianello, NadineSchneider, Eisuke                              a review. JPhys Materials, 2, 2019. doi:10.1088/2515-
            Kawashima, Andrew Dalke, Dan N, David Cosgrove,                                 7639/ab084b.
            Gareth Jones, Brian Cole, Matt Swain, Samo Turk,                    [SYC+ 17]   Austin D Sendek, Qian Yang, Ekin D Cubuk, Karel-
            AlexanderSavelyev, Alain Vaucher, Maciej Wójcikowski,                           Alexander N Duerloo, Yi Cui, and Evan J Reed. Holistic
            Ichiru Take, Daniel Probst, Kazuya Ujihara, Vincent F.                          computational structure screening of more than 12000 can-
            Scalfani, guillaume godin, Axel Pahl, Francois Berenger,                        didates for solid lithium-ion conductor materials. Energy and
            JLVarjo, strets123, JP, and DoliathGavid. rdkit. 6 2022. URL:                   Environmental Science, 10:306–320, 2017. doi:10.1039/
            https://rdkit.org/, doi:10.5281/ZENODO.6605135.                                 c6ee02697d.
[Mar93]     Rudolph A. Marcus. Electron transfer reactions in chemistry.        [VGO+ 20]   Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt
            theory and experiment. Reviews of Modern Physics, 65, 1993.                     Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski,
            doi:10.1103/RevModPhys.65.599.                                                  Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté-
[MAS+ 20]   Nobuyuki N. Matsuzawa, Hideyuki Arai, Masaru Sasago, Eiji                       fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar-
            Fujii, Alexander Goldberg, Thomas J. Mustard, H. Shaun                          rod Millman, Nikolay Mayorov, Andrew R.J. Nelson, Eric
            Kwak, David J. Giesen, Fabio Ranalli, and Mathew D. Halls.                      Jones, Robert Kern, Eric Larson, C. J. Carey, İlhan Polat,
            Massive theoretical screen of hole conducting organic mate-                     Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde,
            rials in the heteroacene family by using a cloud-computing                      Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin-
            environment. Journal of Physical Chemistry A, 124, 2020.                        tero, Charles R. Harris, Anne M. Archibald, Antônio H.
            doi:10.1021/acs.jpca.9b10998.                                                   Ribeiro, Fabian Pedregosa, Paul van Mulbregt, Aditya Vi-
[MGF+ 19]   Thomas Mustard, Jacob Gavartin, Alexandr Fonari, Caroline                       jaykumar, Alessandro Pietro Bardelli, Alex Rothberg, An-
            Krauter, Alexander Goldberg, H Kwak, Tsuguo Morisato,                           dreas Hilboll, Andreas Kloeckner, Anthony Scopatz, Antony
            Sudharsan Pandiyan, and Mathew Halls. Surface reactivity                        Lee, Ariel Rokem, C. Nathan Woods, Chad Fulton, Charles
            and stability of core-shell solid catalysts from ab initio combi-               Masson, Christian Haggström, Clark Fitzgerald, David A.
            natorial calculations. volume 258, 2019.                                        Nicholson, David R. Hagen, Dmitrii V. Pasechnik, Emanuele
[NSA+ 16]   Matthew Newville, Till Stensitzki, Daniel B Allen, Michal                       Olivetti, Eric Martin, Eric Wieser, Fabrice Silva, Felix Lenders,
            Rawlik, Antonino Ingargiola, and Andrew Nelson. Lmfit: Non-                     Florian Wilhelm, G. Young, Gavin A. Price, Gert Ludwig
            linear least-square minimization and curve-fitting for python.                  Ingold, Gregory E. Allen, Gregory R. Lee, Hervé Audren, Irvin
            Astrophysics Source Code Library, page ascl–1606, 2016.                         Probst, Jörg P. Dietrich, Jacob Silterra, James T. Webber, Janko
            URL: https://lmfit.github.io/lmfit-py/.                                         Slavič, Joel Nothman, Johannes Buchner, Johannes Kulick,
[OBJ+ 11]   Noel M. O’Boyle, Michael Banck, Craig A. James, Chris                           Johannes L. Schönberger, José Vinícius de Miranda Cardoso,
            Morley, Tim Vandermeersch, and Geoffrey R. Hutchison.                           Joscha Reimer, Joseph Harrington, Juan Luis Cano Rodríguez,
            Open babel: An open chemical toolbox. Journal of Chem-                          Juan Nunez-Iglesias, Justin Kuczynski, Kevin Tritz, Martin
            informatics, 3, 2011. URL: https://openbabel.org/, doi:                         Thoma, Matthew Newville, Matthias Kümmerer, Maximilian
            10.1186/1758-2946-3-33.                                                         Bolingbroke, Michael Tartre, Mikhail Pak, Nathaniel J. Smith,
[ORJ+ 13]   Shyue Ping Ong, William Davidson Richards, Anubhav Jain,                        Nikolai Nowaczyk, Nikolay Shebanov, Oleksandr Pavlyk,
            Geoffroy Hautier, Michael Kocher, Shreyas Cholia, Dan                           Per A. Brodtkorb, Perry Lee, Robert T. McGibbon, Roman
            Gunter, Vincent L. Chevrier, Kristin A. Persson, and Gerbrand                   Feldbauer, Sam Lewis, Sam Tygier, Scott Sievert, Sebastiano
            Ceder. Python materials genomics (pymatgen): A robust, open-                    Vigna, Stefan Peterson, Surhud More, Tadeusz Pudlik, Takuya
            source python library for materials analysis. Computational                     Oshima, Thomas J. Pingel, Thomas P. Robitaille, Thomas
            Materials Science, 68, 2013. URL: https://pymatgen.org/,                        Spura, Thouis R. Jones, Tim Cera, Tim Leslie, Tiziano Zito,
            doi:10.1016/j.commatsci.2012.10.028.                                            Tom Krauss, Utkarsh Upadhyay, Yaroslav O. Halchenko, and
[PKD18]     Paul N. Patrone, Anthony J. Kearsley, and Andrew M. Di-                         Yoshiki Vázquez-Baeza. Scipy 1.0: fundamental algorithms
            enstfrey. The role of data analysis in uncertainty quantifica-                  for scientific computing in python. Nature Methods, 17, 2020.
            tion: Case studies for materials modeling. volume 0, 2018.                      doi:10.1038/s41592-019-0686-2.
            doi:10.2514/6.2018-0927.                                            [VPB21]     Rama Vasudevan, Ghanshyam Pilania, and Prasanna V. Bal-
[PVG+ 11]   Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vin-                      achandran. Machine learning for materials design and dis-
            cent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blon-                    covery. Journal of Applied Physics, 129, 2021. doi:
            del, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake                       10.1063/5.0043300.
            Vanderplas, Alexandre Passos, David Cournapeau, Matthieu            [WDF+ 18]   Logan Ward, Alexander Dunn, Alireza Faghaninia, Nils E.R.
            Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-                        Zimmermann, Saurabh Bajaj, Qi Wang, Joseph Montoya,
            learn: Machine learning in python. Journal of Machine                           Jiming Chen, Kyle Bystrom, Maxwell Dylla, Kyle Chard,
            Learning Research, 12, 2011. URL: https://scikit-learn.org/.                    Mark Asta, Kristin A. Persson, G. Jeffrey Snyder, Ian Foster,
[RDK]       Rdkit contributors.       URL: https://github.com/rdkit/rdkit/                  and Anubhav Jain. Matminer: An open source toolkit for
            graphs/contributors.                                                            materials data mining. Computational Materials Science,
[REW+ 19]   Bharath Ramsundar, Peter Eastman, Patrick Walters,                              152, 2018. URL: https://hackingmaterials.lbl.gov/matminer/,
            Vijay Pande, Karl Leswing, and Zhenqin Wu.                   Deep               doi:10.1016/j.commatsci.2018.05.018.
            Learning for the Life Sciences. O’Reilly Media, 2019.               [WF05]      John D. Westbrook and Paula M.D. Fitzgerald. The pdb
            https://www.amazon.com/Deep-Learning-Life-Sciences-                             format, mmcif formats, and other data formats, 2005. doi:
            Microscopy/dp/1492039837.                                                       10.1002/0471721204.ch8.
[SGB+ 14]   David E. Shaw, J. P. Grossman, Joseph A. Bank, Brannon Bat-         [WKB+ 22]   Paul Winget, H. Shaun Kwak, Christopher T. Brown, Alexandr
            son, J. Adam Butts, Jack C. Chao, Martin M. Deneroff, Ron O.                    Fonari, Kevin Tran, Alexander Goldberg, Andrea R. Brown-
            Dror, Amos Even, Christopher H. Fenton, Anthony Forte,                          ing, and Mathew D. Halls. Organic thin films for oled appli-
            Joseph Gagliardo, Gennette Gill, Brian Greskamp, C. Richard                     cations: Influence of molecular structure, deposition method,
UTILIZING SCIPY AND OTHER OPEN SOURCE PACKAGES TO PROVIDE A POWERFUL API FOR MATERIALS MANIPULATION IN THE SCHRÖDINGER MATERIALS SUITE   59

             and deposition conditions. International Conference on the
             Science and Technology of Synthetic Metals, 2022.
60                                                                                                            PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




       A Novel Pipeline for Cell Instance Segmentation,
       Tracking and Motility Classification of Toxoplasma
                     Gondii in 3D Space
           Seyed Alireza Vaezi‡∗ , Gianni Orlando‡ , Mojtaba Fazli§ , Gary Ward¶ , Silvia Moreno‡ , Shannon Quinn‡



                                                                                        F



Abstract—Toxoplasma gondii is the parasitic protozoan that causes dissem-                   individuals, the infection has fatal implications in fetuses and
inated toxoplasmosis, a disease that is estimated to infect around one-third                immunocompromised individuals [SG12] . T. gondii’s virulence
of the world’s population. While the disease is commonly asymptomatic, the                  is directly linked to its lytic cycle which is comprised of invasion,
success of the parasite is in large part due to its ability to easily spread through        replication, egress, and motility. Studying the motility of T. gondii
nucleated cells. The virulence of T. gondii is predicated on the parasite’s motility.
                                                                                            is crucial in understanding its lytic cycle in order to develop
Thus the inspection of motility patterns during its lytic cycle has become a topic
of keen interest. Current cell tracking projects usually focus on cell images
                                                                                            potential treatments.
captured in 2D which are not a true representation of the actual motion of a                     For this reason, we present a novel pipeline to detect, segment,
cell. Current 3D tracking projects lack a comprehensive pipeline covering all               track, and classify the motility pattern of T. gondii in 3D space.
phases of preprocessing, cell detection, cell instance segmentation, tracking,              One of the main goals is to make our pipeline intuitively easy
and motion classification, and merely implement a subset of the phases. More-               to use so that the users who are not experienced in the fields of
over, current 3D segmentation and tracking pipelines are not targeted for users             machine learning (ML), deep learning (DL), or computer vision
with less experience in deep learning packages. Our pipeline, TSeg, on the                  (CV) can still benefit from it. The other objective is to equip it with
other hand, is developed for segmenting, tracking, and classifying the motility
                                                                                            the most robust and accurate set of segmentation and detection
phenotypes of T. gondii in 3D microscopic images. Although TSeg is built initially
                                                                                            tools so that the end product has a broad generalization, allowing
focusing on T. gondii, it provides generic functions to allow users with similar
but distinct applications to use it off-the-shelf. Interacting with all of TSeg’s
                                                                                            it to perform well and accurately for various cell types right off
modules is possible through our Napari plugin which is developed mainly off the             the shelf.
familiar SciPy scientific stack. Additionally, our plugin is designed with a user-               PlantSeg uses a variant of 3D U-Net, called Residual 3D U-
friendly GUI in Napari which adds several benefits to each step of the pipeline             Net, for preprocessing and segmentation of multiple cell types
such as visualization and representation in 3D. TSeg proves to fulfill a better             [WCV+ 20]. PlantSeg performs best among Deep Learning algo-
generalization, making it capable of delivering accurate results with images of             rithms for 3D Instance Segmentation and is very robust against
other cell types.                                                                           image noise [KPR+ 21]. The segmentation module also includes
                                                                                            the optional use of CellPose [SWMP21]. CellPose is a generalized
Introduction                                                                                segmentation algorithm trained on a wide range of cell types
Quantitative cell research often requires the measurement of                                and is the first step toward increased optionality in TSeg. The
different cell properties including size, shape, and motility. This                         Cell Tracking module consolidates the cell particles across the z-
step is facilitated using segmentation of imaged cells. With flu-                           axis to materialize cells in 3D space and estimates centroids for
orescent markers, computational tools can be used to complete                               each cell. The tracking module is also responsible for extracting
segmentation and identify cell features and positions over time.                            the trajectories of cells based on the movements of centroids
2D measurements of cells can be useful, but the more difficult task                         throughout consecutive video frames, which is eventually the input
of deriving 3D information from cell images is vital for metrics                            of the motion classifier module.
such as motility and volumetric qualities.                                                       Most of the state-of-the-art pipelines are restricted to 2D space
    Toxoplasmosis is an infection caused by the intracellular                               which is not a true representative of the actual motion of the
parasite Toxoplasma gondii. T. gondii is one of the most suc-                               organism. Many of them require knowledge and expertise in pro-
cessful parasites, infecting at least one-third of the world’s pop-                         gramming, or in machine learning and deep learning models and
ulation. Although Toxoplasmosis is generally benign in healthy                              frameworks, thus limiting the demographic of users that can use
                                                                                            them. All of them solely include a subset of the aforementioned
* Corresponding author: sv22900@uga.edu                                                     modules (i.e. detection, segmentation, tracking, and classification)
‡ University of Georgia
§ harvard University                                                                        [SWMP21]. Many pipelines rely on the user to train their own
¶ University of Vermont                                                                     model, hand-tailored for their specific application. This demands
                                                                                            high levels of experience and skill in ML/DL and consequently
Copyright © 2022 Seyed Alireza Vaezi et al. This is an open-access article                  undermines the possibility and feasibility of quickly utilizing an
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,               off-the-shelf pipeline and still getting good results.
provided the original author and source are credited.                                            To address these we present TSeg. It segments T. gondii cells
A NOVEL PIPELINE FOR CELL INSTANCE SEGMENTATION, TRACKING AND MOTILITY CLASSIFICATION OF TOXOPLASMA GONDII IN 3D SPACE                        61

                                                                         As an example, Fazli et al. [FVMQ18] identified three distinct
                                                                         motility types for T. gondii with two-dimensional data, however,
                                                                         they also acknowledge and state that based established heuristics
                                                                         from previous works there are more than three motility phenotypes
                                                                         for T. gondii. The focus on 2D research is understandable due to
                                                                         several factors. 3D data is difficult to capture as tools for capturing
                                                                         3D slices and the computational requirements for analyzing this
                                                                         data are not available in most research labs. Most segmentation
                                                                         tools are unable to track objects in 3D space as the assignment of
                                                                         related centroids is more difficult. The additional noise from cap-
                                                                         ture and focus increases the probability of incorrect assignment.
                                                                         3D data also has issues with overlapping features and increased
                                                                         computation required per frame of time.
                                                                              Fazli et al. [FVMQ18] studies the motility patterns of T. gondii
                                                                         and provides a computational pipeline for identifying motility
                                                                         phenotypes of T. gondii in an unsupervised, data-driven way. In
                                                                         that work Ca2+ is added to T. gondii cells inside a Fetal Bovine
                                                                         Serum. T. gondii cells react to Ca2+ and become motile and
                                                                         fluorescent. The images of motile T. gondii cells were captured
                                                                         using an LSM 710 confocal microscope. They use Python 3 and
                                                                         associated scientific computing libraries (NumPy, SciPy, scikit-
                                                                         learn, matplotlib) in their pipeline to track and cluster the trajecto-
                                                                         ries of T. gondii. Based on this work Fazli et al. [FVM+ 18] work
                                                                         on another pipeline consisting of preprocessing, sparsification, cell
                                                                         detection, and cell tracking modules to track T. gondii in 3D
                                                                         video microscopy where each frame of the video consists of image
                                                                         slices taken 1 micro-meters of focal depth apart along the z-axis
            Fig. 1: The overview of TSeg’s architecture.                 direction. In their latest work Fazli et al. [FSA+ 19] developed a
                                                                         lightweight and scalable pipeline using task distribution and paral-
                                                                         lelism. Their pipeline consists of multiple modules: reprocessing,
in 3D microscopic images, tracks their trajectories, and classifies      sparsification, cell detection, cell tracking, trajectories extraction,
the motion patterns observed throughout the 3D frames. TSeg is           parametrization of the trajectories, and clustering. They could
comprised of four modules: pre-processing, segmentation, track-          classify three distinct motion patterns in T. gondii using the same
ing, and classification. We developed TSeg as a plugin for Napari        data from their previous work.
[SLE+ 22] - an open-source fast and interactive image viewer for              While combining open source tools is not a novel architecture,
Python designed for browsing, annotating, and analyzing large            little has been done to integrate 3D cell tracking tools. Fazeli et
multi-dimensional images. Having TSeg implemented as a part of           al. [FRF+ 20] motivated by the same interest in providing better
Napari not only provides a user-friendly design but also gives more      tools to non-software professionals created a 2D cell tracking
advanced users the possibility to attach and execute their custom        pipeline. This pipeline combines Stardist [WSH+ 20] and Track-
code and even interact with the steps of the pipeline if needed.         Mate [TPS+ 17] for automated cell tracking. This pipeline begins
The preprocessing module is equipped with basic and extra filters        with the user loading cell images and centroid approximations to
and functionalities to aid in the preparation of the input data.         the ZeroCostDL4Mic [vCLJ+ 21] platform. ZeroCostDL4Mic is
TSeg gives its users the advantage of utilizing the functionalities      a deep learning training tool for those with no coding expertise.
that PlantSeg and CellPose provide. These functionalities can be         Once the platform is trained and masks for the training set are
chosen in the pre-processing, detection, and segmentation steps.         made for hand-drawn annotations, the training set can be input
This brings forth a huge variety of algorithms and pre-built models      to Stardist. Stardist performs automated object detection using
to select from, making TSeg not only a great fit for T. gindii, but      Euclidean distance to probabilistically determine cell pixels versus
also a variety of different cell types.                                  background pixels. Lastly, Trackmate uses segmentation images to
    The rest of this paper is structured as follows: After briefly re-   track labels between timeframes and display analytics.
viewing the literature in Related Work, we move on to thoroughly              This Stardist pipeline is similar in concept to TSeg. Both
describe the details of our work in the Method section. Following        create an automated segmentation and tracking pipeline but TSeg
that, the Results section depicts the results of comprehensive tests     is oriented to 3D data. Cells move in 3-dimensional space that
of our plugin on T. gondii cells.                                        is not represented in a flat plane. TSeg also does not require
                                                                         the manual training necessary for the other pipeline. Individuals
Related Work                                                             with low technical expertise should not be expected to create
The recent solutions in generalized and automated segmentation           masks for training or even understand the training of deep neural
tools are focused on 2D cell images. Segmentation of cellular            networks. Lastly, this pipeline does not account for imperfect
structures in 2D is important but not representative of realistic        datasets without the need for preprocessing. All implemented
environments. Microbiological organisms are free to move on the          algorithms in TSeg account for microscopy images with some
z-axis and tracking without taking this factor into account cannot       amount of noise.
guarantee a full representation of the actual motility patterns.              Wen et al. [WMV+ 21] combines multiple existing new tech-
62                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

nologies including deep learning and presents 3DeeCellTracker.         user. The full code of TSeg is available on GitHub under the MIT
3DeeCellTracker segments and tracks cells on 3D time-lapse             open source license at https://github.com/salirezav/tseg. TSeg can
images. Using a small subset of their dataset they train the deep      be installed through Napari’s plugins menu.
learning architecture 3D U-Net for segmentation. For tracking,
a combination of two strategies was used to increase accuracy:         Computational Pipeline
local cell region strategies, and spatial pattern strategy. Kapoor              Pre-Processing: Due to the fast imaging speed in data
et al. [KC21] presents VollSeg that uses deep learning methods         acquisition, the image slices will inherently have a vignetting
to segment, track, and analyze cells in 3D with irregular shape        artifact, meaning that the corners of the images will be slightly
and intensity distribution. It is a Jupyter Notebook-based Python      darker than the center of the image. To eliminate this artifact we
package and also has a UI in Napari. For tracking, a custom            added adaptive thresholding and logarithmic correction to the pre-
tracking code is developed based on Trackmate.                         processing module. Furthermore, another prevalent artifact on our
    Many segmentation tools require some amount of knowledge           dataset images was a Film-Grain noise (AKA salt and pepper
in Machine or Deep Learning concepts. Training the neural              noise). To remove or reduce such noise a simple gaussian blur
network in creating masks is a common step for open-source             filter and a sharpening filter are included.
segmentation tools. Automating this process makes the pipeline                  Cell Detection and Segmentation: TSeg’s Detection and
more accessible to microbiology researchers.                           Segmentation modules are in fact backed by PlantSeg and Cell-
                                                                       Pose. The Detection Module is built only based on PlantSeg’s
Method                                                                 CNN Detection Module [WCV+ 20] , and for the Segmentation
                                                                       Module, only one of the three tools can be selected to be executed
Data
                                                                       as the segmentation tool in the pipeline. Naturally, each of the tools
Our dataset consists of 11 videos of T. gondii cells under a           demands specific interface elements different from the others since
microscope, obtained from different experiments with different         each accepts different input values and various parameters. TSeg
numbers of cells. The videos are on average around 63 frames in        orchestrates this and makes sure the arguments and parameters are
length. Each frame has a stack of 41 image slices of size 500×502      passed to the corresponding selected segmentation tool properly
pixels along the z-axis (z-slices). The z-slices are captured 1µm      and the execution will be handled accordingly. The parameters
apart in optical focal length making them 402µm×401µm×40µm             include but are not limited to input data location, output directory,
in volume. The slices were recorded in raw format as RGB TIF           and desired segmentation algorithm. This allows the end-user
images but are converted to grayscale for our purpose. This data       complete control over the process and feedback from each step
is captured using a PlanApo 20x objective (NA = 0:75) on a             of the process. The preprocessed images and relevant parameters
preheated Nikon Eclipse TE300 epifluorescence microscope. The          are sent to a modular segmentation controller script. As an effort
image stacks were captured using an iXon 885 EMCCD camera              to allow future development on TSeg, the segmentation controller
(Andor Technology, Belfast, Ireland) cooled to -70oC and driven        script shows how the pipeline integrates two completely different
by NIS Elements software (Nikon Instruments, Melville, NY) as          segmentation packages. While both PlantSeg and CellPose use
part of related research by Ward et al. [LRK+ 14]. The camera was      conda environments, PlantSeg requires modification of a YAML
set to frame transfer sensor mode, with a vertical pixel shift speed   file for initialization while CellPose initializes directly from com-
of 1:0 µs, vertical clock voltage amplitude of +1, readout speed       mand line parameters. In order to implement PlantSeg, TSeg gen-
of 35MHz, conversion gain of 3:8×, EM gain setting of 3 and 22         erates a YAML file based on GUI input elements. After parameters
binning, and the z-slices were imaged with an exposure time of         are aligned, the conda environment for the chosen segmentation
16ms.                                                                  algorithm is opened in a subprocess. The $CONDA_PREFIX
                                                                       environment variable allows the bash command to start conda and
Software                                                               context switch to the correct segmentation environment.
        Napari Plugin: TSeg is developed as a plugin for Napari -               Tracking: Features in each segmented image are found
a fast and interactive multi-dimensional image viewer for python       using the scipy label function. In order to reduce any leftover
that allows volumetric viewing of 3D images [SLE+ 22]. Plugins         noise, any features under a minimum size are filtered out and
enable developers to customize and extend the functionality of         considered leftover noise. After feature extraction, centroids are
Napari. For every module of TSeg, we developed its corresponding       calculated using the center of mass function in scipy. The centroid
widget in the GUI, plus a widget for file management. The widgets      of the 3D cell can be used as a representation of the entire
have self-explanatory interface elements with tooltips to guide        body during tracking. The tracking algorithm goes through each
the inexperienced user to traverse through the pipeline with ease.     captured time instance and connects centroids to the likely next
Layers in Napari are the basic viewable objects that can be shown      movement of the cell. Tracking involves a series of measures in or-
in the Napari viewer. Seven different layer types are supported        der to avoid incorrect assignments. An incorrect assignment could
in Napari: Image, Labels, Points, Shapes, Surface, Tracks, and         lead to inaccurate result sets and unrealistic motility patterns. If the
Vectors, each of which corresponds to a different data type,           same number of features in each frame of time could be guaranteed
visualization, and interactivity [SLE+ 22]. After its execution, the   from segmentation, minimum distance could assign features rather
viewable output of each widget gets added to the layers. This          accurately. Since this is not a guarantee, the Hungarian algorithm
allows the user to evaluate and modify the parameters of the           must be used to associate a COST with the assignment of feature
widget to get the best results before continuing to the next widget.   tracking. The Hungarian method is a combinatorial optimization
Napari supports bidirectional communication between the viewer         algorithm that solves the assignment problem in polynomial time.
and the Python kernel and has a built-in console that allows users     COST for the tracking algorithm determines which feature is the
to control all the features of the viewer programmatically. This       next iteration of the cell’s tracking through the complete time
adds more flexibility and customizability to TSeg for the advanced     series. The combination of distance between centroids for all
A NOVEL PIPELINE FOR CELL INSTANCE SEGMENTATION, TRACKING AND MOTILITY CLASSIFICATION OF TOXOPLASMA GONDII IN 3D SPACE                                 63

previous points and the distance to the potential new centroid.            [LRK+ 14]    Jacqueline Leung, Mark Rould, Christoph Konradt, Christopher
If an optimal next centroid can’t be found within an acceptable                         Hunter, and Gary Ward. Disruption of tgphil1 alters specific
                                                                                        parameters of toxoplasma gondii motility measured in a quanti-
distance of the current point, the tracking for the cell is considered                  tative, three-dimensional live motility assay. PloS one, 9:e85763,
as complete. Likewise, if a feature is not assigned to a current                        01 2014. doi:10.1371/journal.pone.0085763.
centroid, this feature is considered a new object and is tracked as        [SG12]       Geita Saadatnia and Majid Golkar. A review on human toxoplas-
the algorithm progresses. The complete path for each feature is                         mosis. Scandinavian journal of infectious diseases, 44(11):805–
                                                                                        814, 2012. doi:10.3109/00365548.2012.693197.
then stored for motility analysis.                                         [SLE+ 22]    Nicholas Sofroniew, Talley Lambert, Kira Evans, Juan Nunez-
        Motion Classification: To classify the motility pattern of                      Iglesias, Grzegorz Bokota, Philip Winston, Gonzalo Peña-
T. gondii in 3D space in an unsupervised fashion we implement                           Castellanos, Kevin Yamauchi, Matthias Bussonnier, Draga Don-
                                                                                        cila Pop, Ahmet Can Solak, Ziyang Liu, Pam Wadhwa, Al-
and use the method that Fazli et. al. introduced [FSA+ 19]. In that                     ister Burt, Genevieve Buckley, Andrew Sweet, Lukasz Mi-
work, they used an autoregressive model (AR); a linear dynamical                        gas, Volker Hilsenstein, Lorenzo Gaifas, Jordão Bragantini,
system that encodes a Markov-based transition prediction method.                        Jaime Rodríguez-Guerra, Hector Muñoz, Jeremy Freeman, Peter
The reason is that although K-means is a favorable clustering                           Boone, Alan Lowe, Christoph Gohlke, Loic Royer, Andrea
                                                                                        PIERRÉ, Hagai Har-Gil, and Abigail McGovern. napari: a multi-
algorithm, there are a few drawbacks to it and to the conventional                      dimensional image viewer for Python, May 2022. If you use
methods that draw them impractical. Firstly, K-means assumes Eu-                        this software, please cite it using these metadata. URL: https:
clidian distance, but AR motion parameters are geodesics that do                        //doi.org/10.5281/zenodo.6598542, doi:10.5281/zenodo.
                                                                                        6598542.
not reside in a Euclidean space, and secondly, K-means assumes             [SWMP21]     Carsen Stringer, Tim Wang, Michalis Michaelos, and Marius
isotropic clusters, however, although AR motion parameters may                          Pachitariu. Cellpose: a generalist algorithm for cellular segmen-
exhibit isotropy in their space, without a proper distance metric,                      tation. Nature methods, 18(1):100–106, 2021. doi:10.1101/
this issue cannot be clearly examined [FSA+ 19].                                        2020.02.02.931238.
                                                                           [TPS+ 17]    Jean-Yves Tinevez, Nick Perry, Johannes Schindelin,
                                                                                        Genevieve M. Hoopes, Gregory D. Reynolds, Emmanuel
                                                                                        Laplantine, Sebastian Y. Bednarek, Spencer L. Shorte, and
Conclusion and Discussion                                                               Kevin W. Eliceiri. Trackmate: An open and extensible platform
TSeg is an easy to use pipeline designed to study the motility                          for single-particle tracking. Methods, 115:80–90, 2017. Image
                                                                                        Processing for Biologists. URL: https://www.sciencedirect.
patterns of T. gondii in 3D space. It is developed as a plugin                          com/science/article/pii/S1046202316303346,          doi:https:
for Napari and is equipped with a variety of deep learning based                        //doi.org/10.1016/j.ymeth.2016.09.016.
segmentation tools borrowed from PlantSeg and CellPose, making             [vCLJ+ 21]   Lucas von Chamier, Romain F Laine, Johanna Jukkala,
                                                                                        Christoph Spahn, Daniel Krentzel, Elias Nehme, Martina
it a suitable off-the-shelf tool for applications incorporating im-                     Lerche, Sara Hernández-Pérez, Pieta K Mattila, Eleni Kari-
ages of cell types not limited to T. gondii. Future work on TSeg                        nou, et al. Democratising deep learning for microscopy with
includes the expantion of implemented algorithms and tools in its                       zerocostdl4mic. Nature communications, 12(1):1–18, 2021.
preprocessing, segmentation, tracking, and clustering modules.                          doi:10.1038/s41467-021-22518-0.
                                                                           [WCV+ 20]    Adrian Wolny, Lorenzo Cerrone, Athul Vijayan, Rachele To-
                                                                                        fanelli, Amaya Vilches Barro, Marion Louveaux, Christian
                                                                                        Wenzl, Sören Strauss, David Wilson-Sánchez, Rena Lymbouri-
R EFERENCES                                                                             dou, Susanne S Steigleder, Constantin Pape, Alberto Bailoni,
                                                                                        Salva Duran-Nebreda, George W Bassel, Jan U Lohmann, Mil-
[FRF+ 20] Elnaz Fazeli, Nathan H Roy, Gautier Follain, Romain F Laine,                  tos Tsiantis, Fred A Hamprecht, Kay Schneitz, Alexis Maizel,
          Lucas von Chamier, Pekka E Hänninen, John E Eriksson, Jean-                   and Anna Kreshuk. Accurate and versatile 3d segmenta-
          Yves Tinevez, and Guillaume Jacquemet. Automated cell track-                  tion of plant tissues at cellular resolution. eLife, 9:e57613,
          ing using stardist and trackmate. F1000Research, 9, 2020.                     jul 2020. URL: https://doi.org/10.7554/eLife.57613, doi:10.
          doi:10.12688/f1000research.27019.1.                                           7554/eLife.57613.
[FSA+ 19] Mojtaba Sedigh Fazli, Rachel V Stadler, BahaaEddin Alaila,       [WMV+ 21]    Chentao Wen, Takuya Miura, Venkatakaushik Voleti, Kazushi
          Stephen A Vella, Silvia NJ Moreno, Gary E Ward, and Shannon                   Yamaguchi, Motosuke Tsutsumi, Kei Yamamoto, Kohei Otomo,
          Quinn. Lightweight and scalable particle tracking and motion                  Yukako Fujie, Takayuki Teramoto, Takeshi Ishihara, Kazuhiro
          clustering of 3d cell trajectories. In 2019 IEEE International                Aoki, Tomomi Nemoto, Elizabeth Mc Hillman, and Koutarou D
          Conference on Data Science and Advanced Analytics (DSAA),                     Kimura. 3DeeCellTracker, a deep learning-based pipeline for
          pages 412–421. IEEE, 2019. doi:10.1109/dsaa.2019.                             segmenting and tracking cells in 3D time lapse images. Elife, 10,
          00056.                                                                        March 2021. URL: https://doi.org/10.7554/eLife.59187, doi:
[FVM 18] Mojtaba S Fazli, Stephen A Vella, Silvia NJ Moreno, Gary E
     +
                                                                                        10.7554/eLife.59187.
          Ward, and Shannon P Quinn. Toward simple & scalable 3d cell      [WSH+ 20]    Martin Weigert, Uwe Schmidt, Robert Haase, Ko Sugawara,
          tracking. In 2018 IEEE International Conference on Big Data                   and Gene Myers. Star-convex polyhedra for 3d object detec-
          (Big Data), pages 3217–3225. IEEE, 2018. doi:10.1109/                         tion and segmentation in microscopy. In 2020 IEEE Winter
          BigData.2018.8622403.                                                         Conference on Applications of Computer Vision (WACV). IEEE,
[FVMQ18] Mojtaba S Fazli, Stephen A Velia, Silvia NJ Moreno, and                        mar 2020. URL: https://doi.org/10.1109%2Fwacv45572.2020.
          Shannon Quinn. Unsupervised discovery of toxoplasma gondii                    9093435, doi:10.1109/wacv45572.2020.9093435.
          motility phenotypes. In 2018 IEEE 15th International Sympo-
          sium on Biomedical Imaging (ISBI 2018), pages 981–984. IEEE,
          2018. doi:10.1109/isbi.2018.8363735.
[KC21]    Varun Kapoor and Claudia Carabaña. Cell tracking in 3d
          using deep learning segmentations. In Python in Science Con-
          ference, pages 154–161, 2021. doi:10.25080/majora-
          1b6fd038-014.
[KPR+ 21] Anuradha Kar, Manuel Petit, Yassin Refahi, Guillaume
          Cerutti, Christophe Godin, and Jan Traas.          Assessment
          of deep learning algorithms for 3d instance segmentation
          of confocal image datasets. bioRxiv, 2021. URL: https:
          //www.biorxiv.org/content/early/2021/06/10/2021.06.09.447748,
          arXiv:https://www.biorxiv.org/content/
          early/2021/06/10/2021.06.09.447748.full.
          pdf, doi:10.1101/2021.06.09.447748.
64                                                                                                   PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




 The myth of the normal curve and what to do about it
                                                                Allan Campopiano∗



                                                                               F


Index Terms—Python, R, robust statistics, bootstrapping, trimmed mean, data
science, hypothesis testing

    Reliance on the normal curve as a tool for measurement is
almost a given. It shapes our grading systems, our measures of
intelligence, and importantly, it forms the mathematical backbone
of many of our inferential statistical tests and algorithms. Some
even call it “God’s curve” for its supposed presence in nature
[Mic89].
    Scientific fields that deal in explanatory and predictive statis-
tics make particular use of the normal curve, often using it to
conveniently define thresholds beyond which a result is considered
statistically significant (e.g., t-test, F-test). Even familiar machine
learning models have, buried in their guts, an assumption of the
normal curve (e.g., LDA, gaussian naive Bayes, logistic & linear
regression).
    The normal curve has had a grip on us for some time; the                       Fig. 1: Standard normal (orange) and contaminated normal (blue).
                                                                                   The variance of the contaminated curve is more than 10 times that
aphorism by Cramer [Cra46] still rings true for many today:
                                                                                   of the standard normal curve. This can cause serious issues with
         “Everyone believes in the [normal] law of errors, the                     statistical power when using traditional hypothesis testing methods.
     experimenters because they think it is a mathematical
     theorem, the mathematicians because they think it is an
     experimental fact.”                                                           new Python library for robust hypothesis testing will be introduced
    Many students of statistics learn that N=40 is enough to ignore                along with an interactive tool for robust statistics education.
the violation of the assumption of normality. This belief stems
from early research showing that the sampling distribution of the                  The contaminated normal
mean quickly approaches normal, even when drawing from non-
normal distributions—as long as samples are sufficiently large. It                 One of the most striking counterexamples of “N=40 is enough”
is common to demonstrate this result by sampling from uniform                      is shown when sampling from the so-called contaminated normal
and exponential distributions. Since these look nothing like the                   [Tuk60][Tan82]. This distribution is also bell shaped and sym-
normal curve, it was assumed that N=40 must be enough to avoid                     metrical but it has slightly heavier tails when compared to the
practical issues when sampling from other types of non-normal                      standard normal curve. That is, it contains outliers and is difficult
distributions [Wil13]. (Others reached similar conclusions with                    to distinguish from a normal distribution with the naked eye.
different methodology [Gle93].)                                                    Consider the distributions in Figure 1. The variance of the normal
    Two practical issues have since been identified based on this                  distribution is 1 but the variance of the contaminated normal is
early research: (1) The distributions under study were light tailed                10.9!
(they did not produce outliers), and (2) statistics other than the                     The consequence of this inflated variance is apparent when
sample mean were not tested and may behave differently. In                         examining statistical power. To demonstrate, Figure 2 shows two
the half century following these early findings, many important                    pairs of distributions: On the left, there are two normal distribu-
discoveries have been made—calling into question the usefulness                    tions (variance 1) and on the right there are two contaminated
of the normal curve [Wil13].                                                       distributions (variance 10.9). Both pairs of distributions have a
    The following sections uncover various pitfalls one might                      mean difference of 0.8. Wilcox [Wil13] showed that by taking
encounter when assuming normality—especially as they relate to                     random samples of N=40 from each normal curve, and comparing
hypothesis testing. To help researchers overcome these problems, a                 them with Student’s t-test, statistical power was approximately
                                                                                   0.94. However, when following this same procedure for the
* Corresponding author: allan@deepnote.com                                         contaminated groups, statistical power was only 0.25.
                                                                                       The point here is that even small apparent departures from
Copyright © 2022 Allan Campopiano. This is an open-access article dis-             normality, especially in the tails, can have a large impact on
tributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, pro-       commonly used statistics. The problems continue to get worse
vided the original author and source are credited.                                 when examining effect sizes but these findings are not discussed
THE MYTH OF THE NORMAL CURVE AND WHAT TO DO ABOUT IT                                                                                             65




Fig. 2: Two normal curves (left) and two contaminated normal curves
(right). Despite the obvious effect sizes (∆ = 0.8 for both pairs) as
well as the visual similarities of the distributions, power is only ~0.25
under contamination; however, power is ~0.94 under normality (using
Student’s t-test).


in this article. Interested readers should see Wilcox’s 1992 paper          Fig. 3: Actual t-distribution (orange) and assumed t-distribution
                                                                            (blue). When simulating a t-distribution based on a lognormal curve,
[Wil92].                                                                    T does not follow the assumed shape. This can cause poor probability
    Perhaps one could argue that the contaminated normal dis-               coverage and increased Type I Error when using traditional hypothe-
tribution actually represents an extreme departure from normal-             sis testing approaches.
ity and therefore should not be taken seriously; however, dis-
tributions that generate outliers are likely common in practice
[HD82][Mic89][Wil09]. A reasonable goal would then be to                    Modern robust methods
choose methods that perform well under such situations and
                                                                            When it comes to hypothesis testing, one intuitive way of dealing
continue to perform well under normality. In addition, serious
                                                                            with the issues described above would be to (1) replace the
issues still exist even when examining light-tailed and skewed
                                                                            sample mean (and standard deviation) with a robust alternative
distributions (e.g., lognormal), and statistics other than the sample
                                                                            and (2) use a non-parametric resampling technique to estimate the
mean (e.g., T). These findings will be discussed in the following
                                                                            sampling distribution (rather than assuming a theoretical shape)1 .
section.
                                                                            Two such candidates are the 20% trimmed mean and the percentile
                                                                            bootstrap test, both of which have been shown to have practical
                                                                            value when dealing with issues of outliers and non-normality
Student’s t-distribution
                                                                            [CvNS18][Wil13].
Another common statistic is the T value obtained from Student’s
t-test. As will be demonstrated, T is more sensitive to violations of       The trimmed mean
normality than the sample mean (which has already been shown
to not be robust). This is despite the fact that the t-distribution is      The trimmed mean is nothing more than sorting values, removing
also bell shaped, light tailed, and symmetrical—a close relative of         a proportion from each tail, and computing the mean on the
the normal curve.                                                           remaining values. Formally,
    The assumption is that T follows a t-distribution (and with                 •   Let X1 ...Xn be a random sample and X(1) ≤ X(2) ... ≤ X(n)
large samples it approaches normality). We can test this assump-                    be the observations in ascending order
tion by generating random samples from a lognormal distribution.                •   The proportion to trim is γ(0 ≤ γ ≤ .5)
Specifically, 5000 datasets of sample size 20 were randomly drawn               •   Let g = bγnc. That is, the proportion to trim multiplied by
from a lognormal distribution using SciPy’s lognorm.rvs                             n, rounded down to the nearest integer
function. For each dataset, T was calculated and the resulting t-
distribution was plotted. Figure 3 shows that the assumption that               Then, in symbols, the trimmed mean can be expressed as
T follows a t-distribution does not hold.                                   follows:
    With N=20, the assumption is that with a probability of 0.95,                                   X(g+1) + ... + X(n−g)
T will be between -2.09 and 2.09. However, when sampling from                                 X̄t =
                                                                                                           n − 2g
a lognormal distribution in the manner just described, there is
actually a 0.95 probability that T will be between approximately            If the proportion to trim is 0.2, more than twenty percent of
-4.2 and 1.4 (i.e., the middle 95% of the actual t-distribution is          the values would have to be altered to make the trimmed mean
much wider than the assumed t-distribution). Based on this result           arbitrarily large or small. The sample mean, on the other hand,
we can conclude that sampling from skewed distributions (e.g.,              can be made to go to ±∞ (arbitrarily large or small) by changing
lognormal) leads to increased Type I Error when using Student’s             a single value. The trimmed mean is more robust than the sample
t-test [Wil98].                                                             mean in all measures of robustness that have been studied [Wil13].
                                                                            In particular the 20% trimmed mean has been shown to have
        “Surely the hallowed bell-shaped curve has cracked                  practical value as it avoids issues associated with the median (not
     from top to bottom. Perhaps, like the Liberty Bell, it                 discussed here) and still protects against outliers.
     should be enshrined somewhere as a memorial to more
     heroic days — Earnest Ernest, Philadelphia Inquirer. 10                  1. Another option is to use a parametric test that assumes a different
     November 1974. [FG81]”                                                 underlying model.
66                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

The percentile bootstrap test                                           best experienced in tandem with Wilcox’s book “Introduction to
In most traditional parametric tests, there is an assumption that       Robust Estimation and Hypothesis Testing”.
the sampling distribution has a particular shape (normal, f-                Hypothesize brings many of these functions into the open-
distribution, t-distribution, etc). We can use these distributions      source Python library ecosystem with the goal of lowering the
to test the null hypothesis; however, as discussed, the theoretical     barrier to modern robust methods—even for those who have
distributions are not always approximated well when violations of       not had extensive training in statistics or coding. With modern
assumptions occur. Non-parametric resampling techniques such            browser-based notebook environments (e.g., Deepnote), learning
as bootstrapping and permutation tests build empirical sampling         to use Hypothesize can be relatively straightforward. In fact, every
distributions, and from these, one can robustly derive p-values and     statistical test listed in the docs is associated with a hosted note-
CIs. One example is the percentile bootstrap test [Efr92][TE93].        book, pre-filled with sample data and code. But certainly, simply
    The percentile bootstrap test can be thought of as an al-           pip install Hypothesize to use Hypothesize in any en-
gorithm that uses the data at hand to estimate the underlying           vironment that supports Python. See van Noordt and Willoughby
sampling distribution of a statistic (pulling yourself up by your       [vNW21] and van Noordt et al. [vNDTE22] for examples of
own bootstraps, as the saying goes). This approach is in contrast       Hypothesize being used in applied research.
to traditional methods that assume the sampling distribution takes          The API for Hypothesize is organized by single- and two-
a particular shape). The percentile boostrap test works well with       factor tests, as well as measures of association. Input data for
small sample sizes, under normality, under non-normality, and it        the groups, conditions, and measures are given in the form of a
easily extends to multi-group tests (ANOVA) and measures of             Pandas DataFrame [pdt20][WM10]. By way of example, one can
association (correlation, regression). For a two-sample case, the       compare two independent groups (e.g., placebo versus treatment)
steps to compute the percentile bootstrap test can be described as      using the 20% trimmed mean and the percentile bootstrap test, as
follows:                                                                follows (note that Hypothesize uses the naming conventions found
                                                                        in WRS):
     1)    Randomly resample with replacement n values from
           group one                                                    from hypothesize.utilities import trim_mean
                                                                        from hypothesize.compare_groups_with_single_factor \
     2)    Randomly resample with replacement n values from                 import pb2gen
           group two
     3)    Compute X̄1 − X̄2 based on you new sample (the mean          results = pb2gen(df.placebo, df.treatment, trim_mean)
           difference)
     4)    Store the difference & repeat steps 1-3 many times (say,     As shown below, the results are returned as a Python dictionary
           1000)                                                        containing the p-value, confidence intervals, and other important
     5)    Consider the middle 95% of all differences (the confi-       details.
           dence interval)                                              {
     6)    If the confidence interval contains zero, there is no        'ci': [-0.22625614592148624, 0.06961754796950131],
           statistical difference, otherwise, you can reject the null   'est_1': 0.43968438076483285,
                                                                        'est_2': 0.5290985245430996,
           hypothesis (there is a statistical difference)               'est_dif': -0.08941414377826673,
                                                                        'n1': 50,
                                                                        'n2': 50,
Implementing and teaching modern robust methods                         'p_value': 0.27,
                                                                        'variance': 0.005787027326924963
Despite over a half a century of convincing findings, and thousands
                                                                        }
of papers, robust statistical methods are still not widely adopted
in applied research [EHM08][Wil98]. This may be due to various          For measuring associations, several options exist in Hypothesize.
false beliefs. For example,                                             One example is the Winsorized correlation which is a robust
     •    Classical methods are robust to violations of assumptions     alternative to Pearson’s R. For example,
     •    Correcting non-normal distributions by transforming the       from hypothesize.measuring_associations import wincor
          data will solve all issues
     •    Traditional non-parametric tests are suitable replacements    results = wincor(df.height, df.weight, tr=.2)
          for parametric tests that violate assumptions
                                                                        returns the Winsorized correlation coefficient and other relevant
   Perhaps the most obvious reason for the lack of adoption of          statistics:
modern methods is a lack of easy-to-use software and training re-       {
sources. In the following sections, two resources will be presented:    'cor': 0.08515087411576182,
one for implementing robust methods and one for teaching them.          'nval': 50,
                                                                        'sig': 0.558539575073185,
                                                                        'wcov': 0.004207827245660796
Robust statistics for Python                                            }
Hypothesize is a robust null hypothesis significance testing
(NHST) library for Python [CW20]. It is based on Wilcox’s WRS
package for R which contains hundreds of functions for computing        A case study using real-world data
robust measures of central tendency and hypothesis testing. At          It is helpful to demonstrate that robust methods in Hypothesize
the time of this writing, the WRS library in R contains many            (and in other libraries) can make a practical difference when
more functions than Hypothesize and its value to researchers            dealing with real-world data. In a study by Miller on sexual
who use inferential statistics cannot be understated. WRS is            attitudes, 1327 men and 2282 women were asked how many sexual
THE MYTH OF THE NORMAL CURVE AND WHAT TO DO ABOUT IT                                                                                         67

partners they desired over the next 30 years (the data are available
from Rand R. Wilcox’s site). When comparing these groups using
Student’s t-test, we get the following results:
{
'ci': [-1491.09, 4823.24],
't_value': 1.035308,
'p_value': 0.300727
}

That is, we fail to reject the null hypothesis at the α = 0.05 level
using Student’s test for independent groups. However, if we switch
to a robust analogue of the t-test, one that utilizes bootstrapping
and trimmed means, we can indeed reject the null hypothesis.
Here are the corresponding results from Hypothesize’s yuenbt
test (based on [Yue74]):
from hypothesize.compare_groups_with_single_factor \
    import yuenbt
                                                                         Fig. 4: An example of the robust stats simulator in Deepnote’s hosted
                                                                         notebook environment. A minimalist UI can lower the barrier-to-entry
results = yuenbt(df.males, df.females,                                   to robust statistics concepts.
    tr=.2, alpha=.05)

{                                                                            The robust statistics simulator allows users to interact with the
'ci': [1.41, 2.11],                                                      following parameters:
'test_stat': 9.85,
'p_value': 0.0                                                              •    Distribution shape
}
                                                                            •    Level of contamination
The point here is that robust statistics can make a practi-                 •    Sample size
cal difference with real-world data (even when N is consid-                 •    Skew and heaviness of tails
ered large). Many other examples of robust statistics making a               Each of these characteristics can be adjusted independently in
practical difference with real-world data have been documented           order to compare classic approaches to their robust alternatives.
[HD82][Wil09][Wil01].                                                    The two measures that are used to evaluate the performance of
    It is important to note that robust methods may also fail to         classic and robust methods are the standard error and Type I Error.
reject when a traditional test rejects (remember that traditional            Standard error is a measure of how much an estimator varies
tests can suffer from increased Type I Error). It is also possible       across random samples from our population. We want to choose
that both approaches yield the same or similar conclusions. The          estimators that have a low standard error. Type I Error is also
exact pattern of results depends largely on the characteristics of the   known as False Positive Rate. We want to choose methods that
underlying population distribution. To be able to reason about how       keep Type I Error close to the nominal rate (usually 0.05). The
robust statistics behave when compared to traditional methods the        robust statistics simulator can guide these decisions by providing
robust statistics simulator has been created and is described in the     empirical evidence as to why particular estimators and statistical
next section.                                                            tests have been chosen.
Robust statistics simulator
                                                                         Conclusion
Having a library of robust statistical functions is not enough to
make modern methods commonplace in applied research. Ed-                 This paper gives an overview of the issues associated with the
ucators and practitioners still need intuitive training tools that       normal curve. The concern with traditional methods, in terms of
demonstrate the core issues surrounding classical methods and            robustness to violations of normality, have been known for over
how robust analogues compare.                                            a half century and modern alternatives have been recommended;
    As mentioned, computational notebooks that run in the cloud          however, for various reasons that have been discussed, modern
offer a unique solution to learning beyond that of static textbooks      robust methods have not yet become commonplace in applied
and documentation. Learning can be interactive and exploratory           research settings.
since narration, visualization, widgets (e.g., buttons, slider bars),        One reason is the lack of easy-to-use software and teaching
and code can all be experienced in a ready-to-go compute envi-           resources for robust statistics. To help fill this gap, Hypothesize, a
ronment—with no overhead related to local environment setup.             peer-reviewed and open-source Python library was developed. In
    As a compendium to Hypothesize, and a resource for un-               addition, to help clearly demonstrate and visualize the advantages
derstanding and teaching robust statistics in general, the robust        of robust methods, the robust statistics simulator was created.
statistics simulator repository has been developed. It is a notebook-    Using these tools, practitioners can begin to integrate robust
based collection of interactive demonstrations aimed at clearly and      statistical methods into their inferential testing repertoire.
visually explaining the conditions under which classic methods
fail relative to robust methods. A hosted notebook with the              Acknowledgements
rendered visualizations of the simulations can be accessed here.         The author would like to thank Karlynn Chan and Rand R. Wilcox
and seen in Figure 4. Since the simulations run in the browser and       as well as Elizabeth Dlha and the entire Deepnote team for their
require very little understanding of code, students and teachers can     support of this project. In addition, the author would like to thank
easily onboard to the study of robust statistics.                        Kelvin Lee for his insightful review of this manuscript.
68                                                                                              PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

R EFERENCES                                                                    [WM10]    Wes McKinney. Data Structures for Statistical Computing in
                                                                                         Python. In Stéfan van der Walt and Jarrod Millman, editors,
                                                                                         Proceedings of the 9th Python in Science Conference, pages 56 –
[Cra46]   Harold Cramer. Mathematical methods of statistics, princeton
                                                                                         61, 2010. doi:10.25080/Majora-92bf1922-00a.
          univ. Press, Princeton, NJ, 1946. URL: https://books.google.ca/
                                                                               [Yue74]   Karen K Yuen. The two-sample trimmed t for unequal population
          books?id=CRTKKaJO0DYC.
                                                                                         variances. Biometrika, 61(1):165–170, 1974. doi:10.2307/
[CvNS18] Allan Campopiano, Stefon JR van Noordt, and Sidney J Sega-
                                                                                         2334299.
          lowitz. Statslab: An open-source eeg toolbox for comput-
          ing single-subject effects using robust statistics. Behavioural
          Brain Research, 347:425–435, 2018. doi:10.1016/j.bbr.
          2018.03.025.
[CW20]    Allan Campopiano and Rand R. Wilcox. Hypothesize: Ro-
          bust statistics for python. Journal of Open Source Software,
          5(50):2241, 2020. doi:10.21105/joss.02241.
[Efr92]   Bradley Efron. Bootstrap methods: another look at the jackknife.
          In Breakthroughs in statistics, pages 569–593. Springer, 1992.
          doi:10.1007/978-1-4612-4380-9_41.
[EHM08]   David M Erceg-Hurn and Vikki M Mirosevich. Modern robust
          statistical methods: an easy way to maximize the accuracy and
          power of your research. American Psychologist, 63(7):591, 2008.
          doi:10.1037/0003-066X.63.7.591.
[FG81]    Joseph Fashing and Ted Goertzel. The myth of the normal curve
          a theoretical critique and examination of its role in teaching and
          research. Humanity & Society, 5(1):14–31, 1981. doi:10.
          1177/016059768100500103.
[Gle93]   John R Gleason. Understanding elongation: The scale contami-
          nated normal family. Journal of the American Statistical Asso-
          ciation, 88(421):327–337, 1993. doi:10.1080/01621459.
          1993.10594325.
[HD82]    MaryAnn Hill and WJ Dixon. Robustness in real life: A study
          of clinical laboratory data. Biometrics, pages 377–396, 1982.
          doi:10.2307/2530452.
[Mic89]   Theodore Micceri. The unicorn, the normal curve, and other
          improbable creatures. Psychological bulletin, 105(1):156, 1989.
          doi:10.1037/0033-2909.105.1.156.
[pdt20]   The pandas development team. pandas-dev/pandas: Pandas,
          February 2020. URL: https://doi.org/10.5281/zenodo.3509134,
          doi:10.5281/zenodo.3509134.
[Tan82]   WY Tan. Sampling distributions and robustness of t, f and
          variance-ratio in two samples and anova models with respect to
          departure from normality. Comm. Statist.-Theor. Meth., 11:2485–
          2511, 1982. URL: https://pascal-francis.inist.fr/vibad/index.php?
          action=getRecordDetail&idt=PASCAL83X0380619.
[TE93]    Robert J Tibshirani and Bradley Efron. An introduction to
          the bootstrap. Monographs on statistics and applied probabil-
          ity, 57:1–436, 1993. URL: https://books.google.ca/books?id=
          gLlpIUxRntoC.
[Tuk60]   J. W. Tukey. A survey of sampling from contaminated distribu-
          tions. Contributions to Probability and Statistics, pages 448–485,
          1960. URL: https://ci.nii.ac.jp/naid/20000755025/en/.
[vNDTE22] Stefon van Noordt, James A Desjardins, BASIS Team, and
          Mayada Elsabbagh. Inter-trial theta phase consistency during
          face processing in infants is associated with later emerging
          autism. Autism Research, 15(5):834–846, 2022. doi:10.
          1002/aur.2701.
[vNW21]   Stefon van Noordt and Teena Willoughby. Cortical matura-
          tion from childhood to adolescence is reflected in resting state
          eeg signal complexity. Developmental cognitive neuroscience,
          48:100945, 2021. doi:10.1016/j.dcn.2021.100945.
[Wil92]   Rand R Wilcox. Why can methods for comparing means have
          relatively low power, and what can you do to correct the prob-
          lem? Current Directions in Psychological Science, 1(3):101–105,
          1992. doi:10.1111/1467-8721.ep10768801.
[Wil98]   Rand R Wilcox. How many discoveries have been lost by
          ignoring modern statistical methods? American Psychologist,
          53(3):300, 1998. doi:10.1037/0003-066X.53.3.300.
[Wil01]   Rand R Wilcox. Fundamentals of modern statistical meth-
          ods: Substantially improving power and accuracy, volume 249.
          Springer, 2001. URL: https://link.springer.com/book/10.1007/
          978-1-4757-3522-2.
[Wil09]   Rand R Wilcox. Robust ancova using a smoother with boot-
          strap bagging. British Journal of Mathematical and Sta-
          tistical Psychology, 62(2):427–437, 2009. doi:10.1348/
          000711008X325300.
[Wil13]   Rand R Wilcox. Introduction to robust estimation and hypothesis
          testing. Academic press, 2013. doi:10.1016/c2010-0-
          67044-1.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                           69




       Python for Global Applications: teaching scientific
       Python in context to law and diplomacy students
                                                           Anna Haensch‡§∗ , Karin Knudson‡§



                                                                                       F



Abstract—For students across domains and disciplines, the message has been                 the students and faculty at the Fletcher School are eager to
communicated loud and clear: data skills are an essential qualification for today’s        seize upon our current data moment to expand their quantitative
job market. This includes not only the traditional introductory stats coursework           offerings. With this in mind, The Fletcher School reached out to
but also machine learning, artificial intelligence, and programming in Python or           the co-authors to develop a course in data science, situated in the
R. Consequently, there has been significant student-initiated demand for data
                                                                                           context of international diplomacy.
analytic and computational skills sometimes with very clear objectives in mind,
and other times guided by a vague sense of “the work I want to do will require
                                                                                               In response, we developed the (Python-based) course, Data
this.” Now we have options. If we train students using “black box” algorithms              Science for Global Applications, which had its inaugural offering
without attending to the technical choices involved, then we run the risk of               in the Spring semester of 2022. The course had 30 enrolled
unleashing practitioners who might do more harm than good. On the other hand,              Fletcher School students, primarily from the MALD program.
courses that completely unpack the “black box” can be so steeped in theory that            When the course was announced we had a flood of interest from
the barrier to entry becomes too high for students from social science and policy          Fletcher students who were extremely interested in broadening
backgrounds, thereby excluding critical voices. In sum, both of these options              their studies with this course. With a goal of keeping a close
lead to a pitfall that has gained significant media attention over recent years: the
                                                                                           interactive atmosphere we capped enrollment at 30. To inform the
harms caused by algorithms that are implemented without sufficient attention to
human context. In this paper, we - two mathematicians turned data scientists
                                                                                           direction of our course, we surveyed students on their background
- present a framework for teaching introductory data science skills in a highly            in programming (see Fig. 1) and on their motivations for learning
contextualized and domain flexible environment. We will present example course             data science (see Fig 2). Students reported only very limited
outlines at the semester, weekly, and daily level, and share materials that we             experience with programming - if any at all - with that experience
think hold promise.                                                                        primarily in Excel and Tableau. Student motivations varied, but
                                                                                           the goal to get a job where they were able to make a meaningful
Index Terms—computational social science, public policy, data science, teach-              social impact was the primary motivation.
ing with Python



Introduction
As data science continues to gain prominence in the public eye,
and as we become more aware of the many facets of our lives
that intersect with data-driven technologies and policies every day,
universities are broadening their academic offerings to keep up
with what students and their future employers demand. Not only
are students hoping to obtain more hard skills in data science
(e.g. Python programming experience), but they are interested
in applying tools of data science across domains that haven’t                              Fig. 1: The majority of the 30 students enrolled in the course had little
historically been part of the quantitative curriculum. The Master                          to no programming experience, and none reported having "a lot" of
of Arts in Law and Diplomacy (MALD) is the flagship program of                             experience. Those who did have some experience were most likely to
the Fletcher School of Law and International Diplomacy at Tufts                            have worked in Excel or Tableau.
University. Historically, the program has contained core elements
of quantitative reasoning with a focus on business, finance, and                                The MALD program, which is interdisciplinary by design, pro-
international development, as is typical in graduate programs in                           vides ample footholds for domain specific data science. Keeping
international relations. Like academic institutions more broadly,                          this in mind, as a throughline for the course, each student worked
                                                                                           to develop their own quantitative policy project. Coursework and
* Corresponding author: anna.haensch@tufts.edu                                             discussions were designed to move this project forward from
‡ Tufts University
§ Data Intensive Studies Center                                                            initial policy question, to data sourcing and visualizing, and
                                                                                           eventually to modeling and analysis.
Copyright © 2022 Anna Haensch et al. This is an open-access article dis-                        In what follows we will describe how we structured our
tributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, pro-               course with the goal of empowering beginner programmers to use
vided the original author and source are credited.                                         Python for data science in the context of international relations
70                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                        might understand in the abstract that the way the handling of
                                                                        missing data can substantially affect the outcome of an analysis,
                                                                        but will likely have a stronger understanding if they have had to
                                                                        consider how to deal with missing data in their own project.
                                                                            We used several course structures to support connecting data
                                                                        science and Python "skills" with their context. Students had
                                                                        readings and journaling assignments throughout the semester on
                                                                        topics that connected data science with society. In their journal
                                                                        responses, students were asked to connect the ideas in the reading
                                                                        to their other academic/professional interests, or ideas from other
                                                                        classes with the following prompt:
                                                                                  Your reflection should be a 250-300 word narrative.
                                                                             Be sure to tie the reading back into your own studies,
                                                                             experiences, and areas of interest. For each reading,
Fig. 2: The 30 enrolled students were asked to indicate which were           come up with 1-2 discussion questions based on the con-
relevant motivations for taking the course. Curiosity and a desire to        cepts discussed in the readings. This can be a curiosity
make a meaningful social impact were among the top motivations our           question, where you’re interested in finding out more,
students expressed.                                                          a critical question, where you challenge the author’s
                                                                             assumptions or decisions, or an application question,
                                                                             where you think about how concepts from the reading
and diplomacy. We will also share details about course content
                                                                             would apply to a particular context you are interested in
and structure, methods of assessment, and Python programming
                                                                             exploring.1
resources that we deployed through Google Colab. All of the
materials described here can be found on the public course page             These readings (highlighted in gray in Fig 3), assignments, and
https://karink520.github.io/data-science-for-global-applications/.      the related in-class discussions were interleaved among Python
                                                                        exercises meant to give students practice with skills including
Course Philosophy and Goals
                                                                        manipulating DataFrames in pandas [The22], [Mck10], plotting in
                                                                        Matplotlib [Hun07] and seaborn [Was21], mapping with GeoPan-
Our high level goals for the course were i) to empower students         das [Jor21], and modeling with scikit-learn [Ped11]. Student
with the skills to gain insight from data using Python and ii) to       projects included a thorough data audit component requiring
deepen students’ understanding of how the use of data science           students to explore data sources and their human context in detail.
affects society. As we sought to achieve these high level goals         Precise details and language around the data audit can be found
within the limited time scope of a single semester, the following       on the course website.
core principles were essential in shaping our course design. Below,
we briefly describe each of these principles and share some             Managing Fears & Concerns Through Supported Programming
examples of how they were reflected in the course structure. In a
                                                                        We surmised that students who are new to programming and
subsequent section we will more precisely describe the content of
                                                                        possibly intimidated by learning the unfamiliar skill would do
the course, whereupon we will further elaborate on these principles
                                                                        well in an environment that included plenty of what we call
and share instructional materials. But first, our core principles:
                                                                        supported programming - that is, practicing programming in class
Connecting the Technical and Social
                                                                        with immediate access to instructor and peer support.
                                                                            In the pre-course survey we created, many students identified
To understand the impact of data science on the world (and the
                                                                        concerns about their quantitative preparation, whether they would
potential policy implications of such impact), it helps to have
                                                                        be able to keep up with the course, and how hard programming
hands-on practice with data science. Conversely, to effectively
                                                                        might be. We sought to acknowledge these concerns head-on,
and ethically practice data science, it is important to understand
                                                                        assure students of our full confidence in their ability to master
how data science lives in the world. Thus, the "hard" skills of
                                                                        the material, and provide them with all the resources they needed
coding, wrangling data, visualizing, and modeling are best taught
                                                                        to succeed.
intertwined with a robust study of ways in which data science is
                                                                            A key resource to which we thought all students needed
used and misused.
                                                                        access was instructor attention. In addition to keeping the class
    There is an increasing need to educate future policy-makers
                                                                        size capped at 30 people, with both co-instructors attending all
with knowledge of how data science algorithms can be used
                                                                        course meetings, we structured class time to maximize the time
and misused. One way to approach meeting this need, especially
                                                                        students spent actually doing data science in class. We sought
for students within a less technically-focused program, would
                                                                        to keep demonstrations short, and intersperse them with coding
be to teach students about how algorithms can be used without
                                                                        exercises so that students could practice with new ideas right
actually teaching them to use algorithms. However, we argue that
                                                                        away. Our Colab notebooks included in the course materials show
students will gain a deeper understanding of the societal and
                                                                        one way that we wove student practice time throughout. Drawing
ethical implications of data science if they also have practical
                                                                        insight from social practice theory of learning (e.g. [Eng01],
data science skills. For example, a student could gain a broad
                                                                        [Pen16]), we sought to keep in mind how individual practice and
understanding of how biased training data might lead to biased
                                                                        learning pathways develop in relation to their particular social and
algorithmic predictions, but such understanding is likely to be
deeper and more memorable when a student has actually practiced           1. This journaling prompt was developed by our colleague Desen Ozkan at
training a model using different training data. Similarly, someone      Tufts University.
PYTHON FOR GLOBAL APPLICATIONS: TEACHING SCIENTIFIC PYTHON IN CONTEXT TO LAW AND DIPLOMACY STUDENTS                                      71

institutional context. Crucially, we devoted a great deal of in-class   and preparing data for exploratory data analysis, visualizing and
time to students doing data science, and a great deal of energy         annotating data, and finally modeling and analyzing data. All
into making this practice time a positive and empowering social         of this was done with the goal of answering a policy question
experience. During student practice time, we were circulating           developed by the student, allowing the student to flex some
throughout the room, answering student questions and helping            domain expertise to supplement the (sometimes overwhelming!)
students to problem solve and debug, and encouraging students           programmatic components.
to work together and help each other. A small organizational                Our project explicitly required that students find two datasets
change we made in the first weeks of the semester that proved           of interest and merge them for the final analysis. This presented
to have outsized impact was moving our office hours to hold them        both logistical and technical challenges. As one student pointed
directly after class in an almost-adjacent room, to make it as easy     out after finally finding open data: hearing people talk about the
as possible for students to attend office hours. Students were vocal    need for open data is one thing, but you really realize what that
in their appreciation of office hours.                                  means when you’ve spent weeks trying to get access to data that
    We contend that the value of supported programming time             you know exists. Understanding the provenance of the data they
is two-fold. First, it helps beginning programmers learn more           were working with helped students assess the biases and limita-
quickly. While learning to code necessarily involves challenges,        tions, and also gave students a strong sense of ownership over
students new to a language can sometimes struggle for an un-            their final projects. An unplanned consequence of the broad scope
productively long time on things like simple syntax issues. When        of the policy project was that we, the instructors, learned nearly
students have help available, they can move forward from minor          as much about international diplomacy as the students learned
issues faster and move more efficiently into building a meaningful      about programming and data science, a bidirectional exchange of
understanding. Secondly, supported programming time helps stu-          knowledge that we surmised to have contributed to student feeling
dents to understand that they are not alone in the challenges they      of empowerment and a positive class environment.
are facing in learning to program. They can see other students
learning and facing similar challenges, can have the empowering         Course Structure
experience of helping each other out, and when asking for help
can notice that even their instructors sometimes rely on resources      We broke the course into three modules, each with focused
like StackOverflow. An unforeseen benefit we believe co-teaching        reading/journaling topics, Python exercises, and policy project
had was to give us as instructors the opportunity to consult            benchmarks: (i) getting and cleaning data, (ii) visualizing data,
with each other during class time and share different approaches.       and (iii) modeling data. In what follows we will describe the key
These instructor interactions modeled for students how even as          goals of each module and highlight the readings and exercises that
experienced practitioners of data science, we too were constantly       we compiled to work towards these goals.
learning.
                                                                        Getting and Cleaning Data
    Lastly, a small but (we thought) important aspect of our setup
was teaching students to set up a computing environment on              Getting, cleaning, and wrangling data typically make up a signif-
their own laptops, with Python, conda [Ana16], and JupyterLab           icant proportion of the time involved in a data science project.
[Pro22]. Using the command line and moving from an environ-             Therefore, we devoted significant time in our course to learning
ment like Google Colab to one’s own computer can both present           these skills, focusing on loading and manipulating data using
significant barriers, but doing so successfully can be an important     pandas. Key skills included loading data into a pandas DataFrame,
part of helping students feel like ‘real’ programmers. We devoted       working with missing data, and slicing, grouping, and merging
an entire class period to helping students with installation and        DataFrames in various ways. After initial exposure and practice
setup on their own computers.                                           with example datasets, students applied their skills to wrangling
    We considered it an important measure of success how many           the diverse and sometimes messy and large datasets that they found
students told us at the end of the course that the class had helped     for their individual projects. Since one requirement of the project
them overcome sometimes longstanding feelings that technical            was to integrate more than one dataset, merging was of particular
skills like coding and modeling were not for them.                      importance.
                                                                            During this portion of the course, students read and discussed
Leveraging Existing Strengths To Enhance Student Ownership              Boyd and Crawford’s Critical Questions for Big Data [Boy12]
Even as beginning programmers, students are capable of creating a       which situates big data in the context of knowledge itself and
meaningful policy-related data science project within the semester,     raises important questions about access to data and privacy. Ad-
starting from formulating a question and finding relevant datasets.     ditional readings included selected chapters from D’Ignazio and
Working on the project throughout the semester (not just at the         Klein’s Data Feminism [Dig20] which highlights the importance
end) gave essential context to data science skills as students could    of what we choose to count and what it means when data is
translate into what an idea might mean for "their" data. Giving         missing.
students wide leeway in their project topic allowed the project to
be a point of connection between new data science skills and their      Visualizing Data
existing domain knowledge. Students chose projects within their         A fundamental component to communicating findings from data
particular areas of interest or expertise, and a number chose to        is well-executed data visualization. We chose to place this module
additionally connect their project for this course to their degree      in the middle of the course, since it was important that students
capstone project.                                                       have a common language for interpreting and communicating their
    Project benchmarks were placed throughout the semester              analysis before moving to the more complicated aspects of data
(highlighted in green in Fig 3) allowing students a concrete            modeling. In developing this common language, we used Wilke’s
way to develop their new skills in identifying datasets, loading        Fundamentals of Data Visualization [Wil19] and Cairo’s How
72                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




Fig. 3: Course outline for a 13-week semester with two 70 minute instructional blocks each week. Course readings are highlighted in gray and
policy project benchmarks are highlighted in green.


Chart’s Lie [Cai19] as a backbone for this section of the course.       using Python. Having the concrete target of how a student wanted
In addition to reading the text materials, students were tasked with    their visualization to look seemed to be a motivating starting
finding visualizations “in the wild,” both good and bad. Course         point from which to practice coding and debugging. We spent
discussions centered on the found visualizations, with Wilke and        several class periods on supported programming time for students
Cairo’s writings as a common foundation. From the readings and          to develop their visualizations.
discussions, students became comfortable with the language and              Working on building the narratives of their project and devel-
taxonomy around visualizations and began to develop a better ap-        oping their own visualizations in the context of the course readings
preciation of what makes a visualization compelling and readable.       gave students a heightened sense of attention to detail. During
Students were able to formulate a plan about how they could best        one day of class when students shared visualizations and gave
visualize their data. The next task was to translate these plans into   feedback to one another, students commented and inquired about
Python.                                                                 incredibly small details of each others’ presentations, for example,
   To help students gain a level of comfort with data visualization     how to adjust y-tick alignment on a horizontal bar chart. This sort
in Python, we provided instruction and examples of working              of tiny detail is hard to convey in a lecture, but gains outsized
with a variety of charts using Matplotlib and seaborn, as well          importance when a student has personally wrestled with it.
as maps and choropleths using GeoPandas, and assigned students
programming assignments that involved writing code to create            Modeling Data
a visualization matching one in an image. With that practical           In this section we sought to expose students to introductory
grounding, students were ready to visualize their own project data      approaches in each of regression, classification, and clustering
PYTHON FOR GLOBAL APPLICATIONS: TEACHING SCIENTIFIC PYTHON IN CONTEXT TO LAW AND DIPLOMACY STUDENTS                                               73

in Python. Specifically, we practiced using scikit-learn to work            And finally, to supplement the technical components of the
with linear regression, logistic regression, decision trees, random     course we also had readings with associated journal entries sub-
forests, and gaussian mixture models. Our focus was not on the          mitted at a cadence of roughly two per module. Journal prompts
theoretical underpinnings of any particular model, but rather on        are described above and available on the course website.
the kinds of problems that regression, classification, or clustering
models respectively, are able to solve, as well as some basic ideas
about model assessment. The uniform and approachable scikit-            Conclusion
learn API [Bui13] was crucial in supporting this focus, since it        Various listings of key competencies in data science have been
allowed us to focus less on syntax around any one model, and more       proposed [NAS18]. For example, [Dev17] suggests the following
on the larger contours of modeling, with all its associated promise     pillars for an undergraduate data science curriculum: computa-
and perils. We spent a good deal of time building an understanding      tional and statistical thinking, mathematical foundations, model
of train-test splits and their role in model assessment.                building and assessment, algorithms and software foundation,
     Student projects were required to include a modeling com-          data curation, and knowledge transference—communication and
ponent. Just the process of deciding which of regression, clas-         responsibility. As we sought to contribute to the training of
sification, or clustering were appropriate for a given dataset and      data-science informed practitioners of international relations, we
policy question is highly non-trivial for beginners. The diversity of   focused on helping students build an initial competency especially
student projects and datasets meant students had to grapple with        in the last four of these.
this decision process in its full complexity. We were delighted by          We can point to several key aspects of the course that made
the variety of modeling approaches students used in their projects,     it successful. Primary among them was the fact that the majority
as well as by students’ thoughtful discussions of the limitations of    of class time was spent in supported programming. This means
their analysis.                                                         that students were able to ask their instructors or peers as soon
     To accompany this section of the course, students were as-         as questions arose. Novice programmers who aren’t part of a
signed readings focusing on some of the societal impacts of data        formal computer science program often don’t have immediate
modeling and algorithms more broadly. These readings included           access to the resources necessary to get "unstuck." for the novice
a chapter from O’Neil’s Weapons of Math Destruction [One16] as          programmer, even learning how to google technical terms can be a
well as Buolamwini and Gebru’s Gender Shades [Buo18]. Both of           challenge. This sort of immediate debugging and feedback helped
these readings emphasize the capacity of algorithms to exacerbate       students remain confident and optimistic about their projects. This
inequalities and highlight the importance of transparency and           was made all the more effective since we were co-teaching the
ethical data practices. These readings resonated especially strongly    course and had double the resources to troubleshoot. Co-teaching
with our students, many of whom had recently taken courses in           also had the unforeseen benefit of making our classroom a place
cyber policy and ethics in artificial intelligence.                     where the growth mindset was actively modeled and nurtured:
                                                                        where one instructor wasn’t able to answer a question, the other
Assessments
                                                                        instructor often could. Finally, it was precisely the motivation of
Formal assessment was based on four components, already alluded         learning data science in context that allowed students to maintain a
to throughout this note. The largest was the ongoing policy             sense of ownership over their work and build connections between
project which had benchmarks with rolling due dates throughout          their other courses.
the semester. Moreover, time spent practicing coding skills in              Learning programming from the ground up is difficult. Stu-
class was often done in service of the project. For example, in         dents arrive excited to learn, but also nervous and occasionally
week 4, when students learned to set up their local computing           heavy with the baggage they carry from prior experience in
environments, they also had time to practice loading, reading, and      quantitative courses. However, with a sufficient supported learning
saving data files associated with their chosen project datasets. This   environment it’s possible to impart relevant skills. It was a measure
brought challenges, since often students sitting side-by-side were      of the success of the course how many students told us that the
dealing with different operating systems and data formats. But          course had helped them overcome negative prior beliefs about
from this challenge emerged many organic conversations about            their ability to code. Teaching data science skills in context and
file types and the importance of naming conventions. The rubric         with relevant projects that leverage students’ existing expertise and
for the final project is shown in Fig 4.                                outside reading situates the new knowledge in a place that feels
     The policy project culminated with in-class “micro presenta-       familiar and accessible to students. This contextualization allows
tions” and a policy paper. We dedicated two days of class in week       students to gain some mastery while simultaneously playing to
13 for in-class presentations, for which each student presented         their strengths and interests.
one slide consisting of a descriptive title, one visualization, and
several “key takeaways” from the project. This extremely restric-
tive format helped students to think critically about the narrative     R EFERENCES
information conveyed in a visualization, and was designed to
create time for robust conversation around each presentation.           [Ana16] Anaconda Software Distribution. Computer software. Vers. 2-2.4.0.
     In addition to the policy project, each of the three course                Anaconda, Nov. 2016. Web. https://anaconda.com.
                                                                        [Boy12] Boyd, Danah, and Kate Crawford. Critical questions for big data:
modules also had an associated set of Python exercises (available               Provocations for a cultural, technological, and scholarly phe-
on the course website). Students were given ample time both in                  nomenon. Information, communication & society 15.5 (2012):662-
and out of class to ask questions about the exercises. Overall, these           679. https://doi.org/10.1080/1369118X.2012.678878
exercises proved to be the most technically challenging component       [Bui13] Buitinck, Lars, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa,
                                                                                Andreas Mueller, Olivier Grisel, Vlad Niculae et al. API design for
of the course, but we invited students to resubmit after an initial             machine learning software: experiences from the scikit-learn project.
round of grading.                                                               arXiv preprint arXiv:1309.0238 (2013).
74                                                                                                  PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




        Fig. 4: Rubric for the policy project that formed a core component of the formal assessment of students throughout the course.


[Buo18] Buolamwini, Joy, and Timnit Gebru. Gender shades: Intersectional        [Ped11] Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent
        accuracy disparities in commercial gender classification. Conference            Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel et al.
        on fairness, accountability and transparency. PMLR, 2018. http://               Scikit-learn: Machine learning in Python. the Journal of machine
        proceedings.mlr.press/v81/buolamwini18a.html                                    Learning research 12 (2011): 2825-2830. https://dl.acm.org/doi/10.
[Cai19] Cairo, Alberto. How charts lie: Getting smarter about visual infor-             5555/1953048.2078195
        mation. WW Norton & Company, 2019.                                      [Pen16] Penuel, William R., Daniela K. DiGiacomo, Katie Van Horne, and
[Dev17] De Veaux, Richard D., Mahesh Agarwal, Maia Averett, Benjamin                    Ben Kirshner. A Social Practice Theory of Learning and Becoming
        S. Baumer, Andrew Bray, Thomas C. Bressoud, Lance Bryant et al.                 across Contexts and Time. Frontline Learning Research 4, no. 4
        Curriculum guidelines for undergraduate programs in data science.               (2016): 30-38. http://dx.doi.org/10.14786/flr.v4i4.205
        Annual Review of Statistics and Its Application 4 (2017): 15-30.        [Pro22] Project Jupyter, 2022. jupyterlab/jupyterlab: JupyterLab 3.4.3 https:
        https://doi.org/10.1146/annurev-statistics-060116-053930                        //github.com/jupyterlab/jupyterlab
[Dig20] D’Ignazio, Catherine, and Lauren F. Klein. Data Feminism. MIT           [The22] The Pandas Development Team, 2022. pandas-dev/pandas: Pandas
        press, 2020.                                                                    1.4.2. Zenodo. https://doi.org/10.5281/zenodo.6408044
[Eng01] Engeström, Yrjö. Expansive learning at work: Toward an activity         [Was21] Waskom, Michael L. Seaborn: statistical data visualization. Journal
        theoretical reconceptualization. Journal of education and work 14,              of Open Source Software 6, no. 60 (2021): 3021. https://doi.org/10.
        no. 1 (2001): 133-156. https://doi.org/10.1080/13639080020028747                21105/joss.03021
[Hun07] Hunter, J.D., Matplotlib: A 2D Graphics Environment. Computing in       [Wil19] Wilke, Claus O. Fundamentals of data visualization: a primer on
        Science & Engineering, vol. 9, no. 3 (2007): 90-95. https://doi.org/            making informative and compelling figures. O’Reilly Media, 2019.
        10.1109/MCSE.2007.55
[Jor21] Jordahl, Kelsey et al. 2021. Geopandas/geopandas: V0.10.2. Zenodo.
        https://doi.org/10.5281/zenodo.5573592.
[Mck10] McKinney, Wes. Data structures for statistical computing in python.
        In Proceedings of the 9th Python in Science Conference, vol. 445, no.
        1, pp. 51-56. 2010. https://doi.org/10.25080/Majora-92bf1922-00a
[NAS18] National Academies of Sciences, Engineering, and Medicine. Data
        science for undergraduates: Opportunities and options. National
        Academies Press, 2018.
[One16] O’Neil, Cathy. Weapons of math destruction: How big data increases
        inequality and threatens democracy. Broadway Books, 2016.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                    75




             Papyri: better documentation for the scientific
                          ecosystem in Jupyter
                                                   Matthias Bussonnier‡§∗ , Camille Carvalho¶k



                                                                                   F



Abstract—We present here the idea behind Papyri, a framework we are devel-             documentation is often displayed as raw source where no naviga-
oping to provide a better documentation experience for the scientific ecosystem.       tion is possible. On the maintainers’ side, the final documentation
In particular, we wish to provide a documentation browser (from within Jupyter         rendering is less of a priority. Rather, maintainers should aim at
or other IDEs and Python editors) that gives a unified experience, cross library       making users gain from improvement in the rendering without
navigation search and indexing. By decoupling documentation generation from
                                                                                       having to rebuild all the docs.
rendering we hope this can help address some of the documentation accessi-
bility concerns, and allow customisation based on users’ preferences.
                                                                                           Conda-Forge [CFRG] has shown that concerted efforts can
                                                                                       give a much better experience to end-users, and in today’s world
Index Terms—Documentation, Jupyter, ecosystem, accessibility                           where it is ubiquitous to share libraries source on code platforms,
                                                                                       perform continuous integration and many other tools, we believe
                                                                                       a better documentation framework for many of the libraries of the
Introduction                                                                           scientific Python should be available.
Over the past decades, the Python ecosystem has grown rapidly,                             Thus, against all advice we received and based on our own
and one of the last bastion where some of the proprietary competi-                     experience, we have decided to rebuild an opinionated documen-
tion tools shine is integrated documentation. Indeed, open-source                      tation framework, from scratch, and with minimal dependencies:
libraries are usually developed in distributed settings that can make                  Papyri. Papyri focuses on building an intermediate documentation
it hard to develop coherent and integrated systems.                                    representation format, that lets us decouple building, and rendering
    While a number of tools and documentations exists (and                             the docs. This highly simplifies many operations and gives us
improvements are made everyday), most efforts attempt to build                         access to many desired features that were not available up to now.
documentation in an isolated way, inherently creating a heteroge-                          In what follows, we provide the framework in which Papyri
neous framework. The consequences are twofolds: (i) it becomes                         has been created and present its objectives (context and goals),
difficult for newcomers to grasp the tools properly, (ii) there is a                   we describe the Papyri features (format, installation, and usage),
lack of cohesion and of unified framework due to library authors                       then present its current implementation. We end this paper with
making their proper choices as well as having to maintain build                        comments on current challenges and future work.
scripts or services.
    Many users, colleagues, and members of the community have                          Context and objectives
been frustrated with the documentation experience in the Python                        Through out the paper, we will draw several comparisons between
ecosystem. Given a library, who hasn’t struggled to find the                           documentation building and compiled languages. Also, we will
"official" website for the documentation ? Often, users stumble                        borrow and adapt commonly used terminology. In particular, sim-
across an old documentation version that is better ranked in their                     ilarities with "ahead-of-time" (AOT) [AOT], "just-in-time"" (JIT)
favorite search engine, and this impacts significantly the learning                    [JIT], intermediate representation (IR) [IR], link-time optimization
process of less experienced users.                                                     (LTO) [LTO], static vs dynamic linking will be highlighted. This
    On users’ local machine, this process is affected by lim-                          allows us to clarify the presentation of the underlying architecture.
ited documentation rendering. Indeed, while in many Integrated                         However, there is no requirement to be familiar with the above
Development Environments (IDEs) the inspector provides some                            to understand the concepts underneath Papyri. In that context, we
documentation, users do not get access to the narrative, or the full                   wish to discuss documentation building as a process from a source-
documentation gallery. For Command Line Interface (CLI) users,                         code meant for a machine to a final output targeting the flesh and
                                                                                       blood machine between the keyboard and the chair.
* Corresponding author: bussonniermatthias@gmail.com
‡ QuanSight, Inc
§ Digital Ours Lab, SARL.                                                              Current tools and limitations
¶ University of California Merced, Merced, CA, USA
|| Univ Lyon, INSA Lyon, UJM, UCBL, ECL, CNRS UMR 5208, ICJ, F-69621,                  In the scientific Python ecosystem, it is well known that Docutils
France                                                                                 [docutils] and Sphinx [sphinx] are major cornerstones for pub-
                                                                                       lishing HTML documentation for Python. In fact, they are used
Copyright © 2022 Matthias Bussonnier et al. This is an open-access article             by all the libraries in this ecosystem. While a few alternatives
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,          exist, most tools and services have some internal knowledge of
provided the original author and source are credited.                                  Sphinx. For instance, Read the Docs [RTD] provides a specific
76                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Sphinx theme [RTD-theme] users can opt-in to, Jupyter-book
[JPYBOOK] is built on top of Sphinx, and MyST parser [MYST]
(which is made to allow markdown in documentation) targets
Sphinx as a backend, to name a few. All of the above provide an
"ahead-of-time" documentation compilation and rendering, which
is slow and computationally intensive. When a project needs its
specific plugins, extensions and configurations to properly build
(which is almost always the case), it is relatively difficult to
build documentation for a single object (like a single function,
module or class). This makes AOT tools difficult to use for
interactive exploration. One can then consider a JIT approach,
as done for Docrepr [DOCREPR] (integrated both in Jupyter and
Spyder [Spyder]). However in that case, interactive documentation
lacks inline plots, crosslinks, indexing, search and many custom          Fig. 1: The following screenshot shows the help for
directives.                                                               scipy.signal.dpss, as currently accessible (left), as shown by
                                                                          Papyri for Jupyterlab extension (right). An extended version of the
    Some of the above limitations are inherent to the design              right pannel is displayed in Figure 4.
of documentation build tools that were intended for a separate
documentation construction. While Sphinx does provide features
like intersphinx, link resolutions are done at the documentation          raw docstrings (see for example the SymPy discussion2 on how
building phase. Thus, this is inherently unidirectional, and can          equations should be displayed in docstrings, and left panel of
break easily. To illustrate this, we consider NumPy [NP] and SciPy        Figure 1). In terms of format, markdown is appealing, however
[SP], two extremely close libraries. In order to obtain proper cross-     inconsistencies in the rendering will be created between libraries.
linked documentation, one is required to perform at least five steps:     Finally, some libraries can dynamically modify their docstring at
      •   build NumPy documentation                                       runtime. While this sometime avoids using directives, it ends up
      •   publish NumPy object.inv file.                                  being more expensive (runtime costs, complex maintenance, and
      •   (re)build SciPy documentation using NumPy obj.inv               contribution costs).
          file.
                                                                          Objectives of the project
      •   publish SciPy object.inv file
      •   (re)build NumPy docs to make use of SciPy’s obj.inv             We now layout the objectives of the Papyri documentation frame-
                                                                          work. Let us emphasize that the project is in no way intended to
    Only then can both SciPy’s and NumPy’s documentation refer            replace or cover many features included in well-established docu-
to each other. As one can expect, cross links break every time            mentation tools such as Sphinx or Jupyter-book. Those projects are
a new version of a library is published1 . Pre-produced HTML              extremely flexible and meet the needs of their users for publishing
in IDEs and other tools are then prone to error and difficult to          a standalone documentation website of PDFs. The Papyri project
maintain. This also raises security issues: some institutions be-         addresses specific documentation challenges (mentioned above),
come reluctant to use tools like Docrepr or viewing pre-produced          we present below what is (and what is not) the scope of work.
HTML.                                                                         Goal (a): design a non-generic (non fully customisable)
                                                                          website builder. When authors want or need complete control
Docstrings format                                                         of the output and wide personalisation options, or branding, then
The Numpydoc format is ubiquitous among the scientific ecosys-            Papyri is not likely the project to look at. That is to say single-
tem [NPDOC]. It is loosely based on reStructuredText (RST)                project websites where appearance, layout, domain need to be
syntax, and despite supporting full RST syntax, docstrings rarely         controlled by the author is not part of the objectives.
contain full-featured directive. Maintainers are confronted to the            Goal (b): create a uniform documentation structure and
following dilemma:                                                        syntax. The Papyri project prescribes stricter requirements in
      •   keep the docstrings simple. This means mostly text-based        terms of format, structure, and syntax compared to other tools
          docstrings with few directive for efficient readability. The    such as Docutils and Sphinx. When possible, the documentation
          end-user may be exposed to raw docstring, there is no on-       follows the Diátaxis Framework [DT]. This provides a uniform
          the-fly directive interpretation. This is the case for tools    documentation setup and syntax, simplifying contributions to the
          such as IPython and Jupyter.                                    project and easing error catching at compile time. Such strict envi-
      •   write an extensive docstring. This includes references, and     ronment is qualitatively supported by a number of documentation
          directive that potentially creates graphics, tables and more,   fixes done upstream during the development stage of the project3 .
          allowing an enriched end-user experience. However this          Since Papyri is not fully customisable, users who are already using
          may be computationally intensive, and executing code to         documentation tools such as Sphinx, mkdocs [mkdocs] and others
          view docs could be a security risk.                             should expect their project to require minor modifications to work
                                                                          with Papyri.
    Other factors impact this choice: (i) users, (ii) format, (iii)           Goal (c): provide accessibility and user proficiency. Ac-
runtime. IDE users or non-Terminal users motivate to push for             cessibility is a top priority of the project. To that aim, items
extensive docstrings. Tools like Docrepr can mitigate this problem        are associated to semantic meaning as much as possible, and
by allowing partial rendering. However, users are often exposed to
                                                                            2. sympy/sympy#14963
     1. ipython/ipython#12210, numpy/numpy#21016, & #29073                  3. Tests have been performed on NumPy, SciPy.
PAPYRI: BETTER DOCUMENTATION FOR THE SCIENTIFIC ECOSYSTEM IN JUPYTER                                                                          77

documentation rendering is separated from documentation build-         Intermediate Representation for Documentation (IRD)
ing phase. That way, accessibility features such as high contract
                                                                               IRD format: Papyri relies on standard interchangeable
themes (for better text-to-speech (TTS) raw data), early example
                                                                       "Intermediate Representation for Documentation" (IRD) format.
highlights (for newcomers) and type annotation (for advanced
                                                                       This allows to reduce operation complexity of the documentation
users) can be quickly available. With the uniform documentation
                                                                       build. For example, given M documentation producers and N
structure, this provides a coherent experience where users become
                                                                       renderers, a full documentation build would be O(MN) (each
more comfortable finding information in a single location (see
                                                                       renderer needs to understand each producer). If each producer only
Figure 1).
                                                                       cares about producing IRD, and if each renderer only consumes it,
    Goal (d): make documentation building simple, fast, and            then one can reduce to O(M+N). Additionally, one can take IRD
independent. One objective of the project is to make documenta-        from multiple producers at once, and render them all to a single
tion installation and rendering relatively straightforward and fast.   target, breaking the silos between libraries.
To that aim, the project includes relative independence of doc-
                                                                           At the moment, IRD files are currently separated into four
umentation building across libraries, allowing bidirectional cross
                                                                       main categories roughly following the Diátaxis framework [DT]
links (i.e. both forward and backward links between pages) to
                                                                       and some technical needs:
be maintained more easily. In other words, a single library can be
built without the need to access documentation from another. Also,         •   API files describe the documentation for a single ob-
the project should include straightforward lookup documentation                ject, expressed as a JSON object. When possible, the
for an object from the interactive read–eval–print loop (REPL).                information is encoded semantically (Objective (c)). Files
Finally, efforts are put to limit the installation speed (to avoid             are organized based on the fully-qualified name of the
polynomial growth when installing packages on large distributed                Python object they reference, and contain either absolute
systems).                                                                      reference to another object (library, version and identi-
                                                                               fier), or delayed references to objects that may exist in
                                                                               another library. Some extra per-object meta information
The Papyri solution
                                                                               like file/line number of definitions can be stored as well.
In this section we describe in more detail how Papyri has been             •   Narrative files are similar to API files, except that they do
implemented to address the objectives mentioned above.                         not represent a given object, but possess a previous/next
                                                                               page. They are organised in an ordered tree related to the
                                                                               table of content.
Making documentation a multi-step process
                                                                           •   Example files are a non-ordered collection of files.
When using current documentation tools, customisation made by              •   Assets files are untouched binary resource archive files that
maintainers usually falls into the following two categories:                   can be referenced by any of the above three ones. They are
                                                                               the only ones that contain backward references, and no
   •   simpler input convenience,                                              forward references.
   •   modification of final rendering.
                                                                           In addition to the four categories above, metadata about the
     This first category often requires arbitrary code execution and   current package is stored: this includes library name, current
must import the library currently being built. This is the case        version, PyPi name, GitHub repository slug4 , maintainers’ names,
for example for the use of .. code-block:::, or custom                 logo, issue tracker and others. In particular, metadata allows
:rc: directive. The second one offers a more user friendly en-         us to auto-generate links to issue trackers, and to source files
vironment. For example, sphinx-copybutton [sphinx-copybutton]          when rendering. In order to properly resolve some references and
adds a button to easily copy code snippets in a single click,          normalize links convention, we also store a mapping from fully
and pydata-sphinx-theme [pydata-sphinx-theme] or sphinx-rtd-           qualified names to canonical ones.
dark-mode provide a different appearance. As a consequence,
                                                                           Let us make some remarks about the current stage of IRD for-
developers must make choices on behalf of their end-users: this
                                                                       mat. The exact structure of package metadata has not been defined
may concern syntax highlights, type annotations display, light/dark
                                                                       yet. At the moment it is reduced to the minimum functionality.
theme.
                                                                       While formats such as codemeta [CODEMETA] could be adopted,
     Being able to modify extensions and re-render the documenta-      in order to avoid information duplication we rely on metadata
tion without the rebuilding and executing stage is quite appealing.    either present in the published packages already or extracted from
Thus, the building phase in Papyri (collecting documentation           Github repository sources. Also, IRD files must be standardized
information) is separated from the rendering phase (Objective (c)):    in order to achieve a uniform syntax structure (Objective (b)).
at this step, Papyri has no knowledge and no configuration options     In this paper, we do not discuss IRD files distribution. Last, the
that permit to modify the appearance of the final documentation.       final specification of IRD files is still in progress and regularly
Additionally, the optional rendering process has no knowledge of       undergoes major changes (even now). Thus, we invite contributors
the building step, and can be run without accessing the libraries      to consult the current state of implementation on the GitHub
involved.                                                              repository [Papyri]. Once the IRD format is more stable, this will
     This kind of technique is commonly used in the field of           be published as a JSON schema, with full specification and more
compilers with the usage of Single Compilation Unit [SCU] and          in-depth description.
Intermediate Representation [IR], but to our knowledge, it has not
been implemented for documentation in the Python ecosystem.
                                                                         4. "slug" is the common term that refers to the various combinations
As mentioned before, this separation is key to achieving many          of organization name/user name/repository name, that uniquely identifies a
features proposed in Objectives (c), (d) (see Figure 2).               repository on a platform like GitHub.
78                                                                                                  PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




Fig. 2: Sketch representing how to build documentation with Papyri. Step 1: Each project builds an IRD bundle that contains semantic
information about the project documentation. Step 2: the IRD bundles are publihsed online. Step 3: users install IRD bundles locally on their
machine, pages get corsslinked, indexed, etc. Step 4: IDEs render documentation on-the-fly, taking into consideration users’ preferences.


        IRD bundles: Once a library has collected IRD repre-                      package managers or IDEs, one could imagine this process being
sentation for all documentation items (functions, class, narrative                automatic, or on demand. This step should be fairly efficient as it
sections, tutorials, examples), Papyri consolidates them into what                mostly requires downloading and unpacking IRD files.
we will refer to as IRD bundles. A Bundle gathers all IRD files                       Finally, IDEs developers want to make sure IRD files can be
and metadata for a single version of a library5 . Bundles are a                   properly rendered and browsed by their users when requested.
convenient unit to speak about publication, installation, or update               This may potentially take into account users’ preferences, and may
of a given library documentation files.                                           provide added values such as indexing, searching, bookmarks and
    Unlike package installation, IRD bundles do not have the                      others, as seen in rustsdocs, devdocs.io.
notion of dependencies. Thus, a fully fledged package manager is
not necessary, and one can simply download corresponding files                    Current implementation
and unpack them at the installation phase.
                                                                                  We present here some of the technological choices made in the
    Additionally, IRD bundles for multiple versions of the same
                                                                                  current Papyri implementation. At the moment, it is only targeting
library (or conflicting libraries) are not inherently problematic as
                                                                                  a subset of projects and users that could make use of IRD files and
they can be shared across multiple environments.
                                                                                  bundles. As a consequence, it is constrained in order to minimize
    From a security standpoint, installing IRD bundles does not
                                                                                  the current scope and efforts development. Understanding the
require the execution of arbitrary code. This is a critical element
                                                                                  implementation is not necessary to use Papyri neither as a project
for adoption in deployments. There exists as well an opportunity to
                                                                                  maintainer nor as a user, but it can help understanding some of the
provide localized variants at the IRD installation time (IRD bundle
                                                                                  current limitations.
translations haven’t been explored exhaustively at the moment).
                                                                                      Additionally, nothing prevents alternatives and complementary
                                                                                  implementations with different choices: as long as other imple-
IRD and high level usage
                                                                                  mentations can produce (or consume) IRD bundles, they should
Papyri-based documentation involves three broad categories of                     be perfectly compatible and work together.
stakeholders (library maintainers, end-users, IDE developers), and                    The following sections are thus mostly informative to under-
processes. This leads to certain requirements for IRD files and                   stand the state of the current code base. In particular we restricted
bundles.                                                                          ourselves to:
    On the maintainers’ side, the goal is to ensure that Papyri can
build IRD files, and publish IRD bundles. Creation of IRD files                      •   Producing IRD bundles for the core scientific Python
and bundles is the most computationally intensive step. It may                           projects (Numpy, SciPy, Matplotlib...)
require complex dependencies, or specific plugins. Thus, this can                    •   Rendering IRD documentation for a single user on their
be a multi-step process, or one can use external tooling (not related                    local machine.
to Papyri nor using Python) to create them. Visual appearance                         Finally, some of the technological choices have no other
and rendering of documentation is not taken into account in this                  justification than the main developer having interests in them, or
process. Overall, building IRD files and bundles takes about the                  making iterations on IRD format and main code base faster.
same amount of time as running a full Sphinx build. The limiting
factor is often associated to executing library examples and code                 IRD files generation
snippets. For example, building SciPy & NumPy documentation                       The current implementation of Papyri only targets some compat-
IRD files on a 2021 Macbook Pro M1 (base model), including                        ibility with Sphinx (a website and PDF documentation builder),
executing examples in most docstrings and type inferring most                     reStructuredText (RST) as narrative documentation syntax and
examples (with most variables semantically inferred) can take                     Numpydoc (both a project and standard for docstring formatting).
several minutes.                                                                       These are widely used by a majority of the core scientific
    End-users are responsible for installing desired IRD bundles.                 Python ecosystem, and thus having Papyri and IRD bundles
In most cases, it will consist of IRD bundles from already                        compatible with existing projects is critical. We estimate that
installed libraries. While Papyri is not currently integrated with                about 85%-90% of current documentation pages being built with
                                                                                  Sphinx, RST and Numpydoc can be built with Papyri. Future work
   5. One could have IRD bundles not attached to a particular library. For
example, this can be done if an author wishes to provide only a set of examples   includes extensions to be compatible with MyST (a project to
or tutorials. We will not discuss this case further here.                         bring markdown syntax to Sphinx), but this is not a priority.
PAPYRI: BETTER DOCUMENTATION FOR THE SCIENTIFIC ECOSYSTEM IN JUPYTER                                                                        79

     To understand RST Syntax in narrative documentation, RST
documents need to be parsed. To do so, Papyri uses tree-sitter
[TS] and tree-sitter-rst [TSRST] projects, allowing us to extract an
"Abstract Syntax Tree" (AST) from the text files. When using tree-
sitter, AST nodes contain bytes-offsets into the original text buffer.
Then one can easily "unparse" an AST node when necessary. This
is relatively convenient for handling custom directives and edge
cases (for instance, when projects rely on a loose definition of
the RST syntax). Let us provide an example: RST directives are
usually of the form:
.. directive:: arguments

     body
                                                                         Fig. 3: Sketch representing how Papyri stores information in 3
While technically there is no space before the ::, Docutils and          different format depending on access patterns: a SQLite database for
Sphinx will not create errors when building the documentation.           relationship information, on-disk CBOR files for more compact storate
Due to our choice of a rigid (but unified) structure, we use tree-       of IRD, and RAW files (e.g. Images). A GraphStore API abstracts all
sitter that indicates an error node if there is an extra space. This     access and takes care of maintinaing consistency.
allows us to check for error nodes, unparse, add heuristics to
restore a proper syntax, then parse again to obtain the new node.
                                                                         (like a database server) are not necessary available. This provides
     Alternatively, a number of directives like warnings, notes
                                                                         an adapted framework to test Papyri on an end-user machine.
admonitions still contain valid RST. Instead of storing the
                                                                             With those requirements we decided to use a combination of
directive with the raw text, we parse the full document (potentially
                                                                         SQLite (an in-process database engine), Concise Binary Object
finding invalid syntax), and unparse to the raw text only if the
                                                                         Representation (CBOR) and raw storage to better reflect the access
directive requires it.
                                                                         pattern (see Figure 3).
     Serialisation of data structure into IRD files is currently us-
                                                                             SQLite allows us to easily query for object existence, and
ing a custom serialiser. Future work includes maybe swapping
                                                                         graph information (relationship between objects) at runtime. It is
to msgspec [msgspec]. The AST objects are completely typed,
                                                                         optimized for infrequent reading access. Currently many queries
however they contain a number of unions and sequences of unions.
                                                                         are done at runtime, when rendering documentation. The goal is to
It turns out, many frameworks like pydantic [pydantic] do not
                                                                         move most of SQLite information resolving step at the installation
support sequences of unions where each item in the union may
                                                                         time (such as looking for inter-libraries links) once the codebase
be of a different type. To our knowledge, there are just few other
                                                                         and IRD format have stabilized. SQLite is less strongly typed than
documentation related projects that treat AST as an intermediate
                                                                         other relational or graph database and needs custom logic, but
object with a stable format that can be manipulated by external
                                                                         is ubiquitous on all systems and does not need a separate server
tools. In particular, the most popular one is Pandoc [pandoc], a
                                                                         process, making it an easy choice of database.
project meant to convert from many document types to plenty of
                                                                             CBOR is a more space efficient alternative to JSON. In par-
other ones.
                                                                         ticular, keys in IRD are often highly redundant, and can be highly
     The current Papyri strategy is to type-infer all code examples
                                                                         optimized when using CBOR. Storing IRD in CBOR thus reduces
with Jedi [JEDI], and pre-syntax highlight using pygments when
                                                                         disk usage and can also allow faster deserialization without
possible.
                                                                         requiring potentially CPU intensive compression/decompression.
IRD File Installation                                                    This is a good compromise for potentially low performance users’
Download and installation of IRD files is done concurrently using        machines.
httpx [httpx], with Trio [Trio] as an async framework, allowing us           Raw storage is used for binary blobs which need to be accessed
to download files concurrently.                                          without further processing. This typically refers to images, and
    The current implementation of Papyri targets Python doc-             raw storage can be accessed with standard tools like image
umentation and is written in Python. We can then query the               viewers.
existing version of Python libraries installed, and infer the ap-            Finally, access to all of these resources is provided via an
propriate version of the requested documentation. At the moment,         internal GraphStore API which is agnostic of the backend, but
the implementation is set to tentatively guess relevant libraries        ensures consistency of operations like adding/removing/replacing
versions when the exact version number is missing from the install       documents. Figure 3 summarizes this process.
command.                                                                     Of course the above choices depend on the context where
    For convenience and performance, IRD bundles are being post-         documentation is rendered and viewed. For example, an online
processed and stored in a different format. For local rendering, we      archive intended to browse documentation for multiple projects
mostly need to perform the following operations:                         and versions may decide to use an actual graph database for object
                                                                         relationship, and store other files on a Content Delivery Network
    1)   Query graph information about cross-links across docu-          or blob storage for random access.
         ments.
    2)   Render a single page.                                           Documentation Rendering
    3)   Access raw data (e.g. images).
                                                                         The current Papyri implementation includes a certain number
    We also assume that IRD files may be infrequently updated,           of rendering engines (presented below). Each of them mostly
that disk space is limited, and that installing or running services      consists of fetching a single page with its metadata, and walking
80                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

through the IRD AST tree, and rendering each node with users’            Future goals include improving/replacing the JupyterLab’s ques-
preferences.                                                             tion mark operator (obj?) and the JupyterLab Inspector (when
                                                                         possible). A screenshot of the current development version of the
     •   An ASCII terminal renders using Jinja2 [Jinja2]. This
                                                                         JupyterLab extension can be seen in Figure 4.
         can be useful for piping documentation to other tools
         like grep, less, cat. Then one can work in a highly
         restricted environment, making sure that reading the docu-      Challenges
         mentation is coherent. This can serve as a proxy for screen     We mentioned above some limitations we encountered (in ren-
         reading.                                                        dering usage for instance) and what will be done in the future
     •   A Textual User Interface browser renders using urwid.           to address them. We provide below some limitations related to
         Navigation within the terminal is possible, one can reflow      syntax choices, and broader opportunities that arise from the
         long lines on resized windows, and even open image files        Papyri project.
         in external editors. Nonetheless, several bugs have been
         encountered in urwid. The project aims at replacing the         Limitations
         CLI IPython question mark operator (obj?) interface             The decoupling of the building and rendering phases is key in
         (which currently only shows raw docstrings) in urwid with       Papyri. However, it requires us to come up with a method that
         a new one written with Rich/Textual. For this interface,        uniquely identifies each object. In particular, this is essential in
         having images stored raw on disk is useful as it allows us      order to link any object documentation without accessing the IRD
         to directly call into a system image viewer to display them.    bundles build from all the libraries. To that aim, we use the fully
     •   A JIT rendering engine uses Jinja2, Quart [quart], Trio.        qualified names of an object. Namely, each object is identified
         Quart is an async version of flask [flask]. This option         by the concatenation of the module in which it is defined, with
         contains the most features, and therefore is the main one       its local name. Nonetheless, several particular cases need specific
         used for development. This environment lets us iterate over     treatment.
         the rendering engine rapidly. When exploring the User In-           •   To mirror the Python syntax, is it easy to use . to
         terface design and navigation, we found that a list of back             concatenate both parts. Unfortunately, that leads to some
         references has limited uses. Indeed, it is can be challenging           ambiguity when modules re-export functions have the
         to judge the relevance of back references, as well as their             same name. For example, if one types
         relationship to each other. By playing with a network
                                                                                 # module mylib/__init__.py
         graph visualisation (see Figure 5)), we can identify clusters
         of similar information within back references. Of course,               from .mything import mything
         this identification has limits especially when pages have a
                                                                                 then mylib.mything is ambiguous both with respect
         large number of back references (where the graph becomes
                                                                                 to the mything submodule, and the reexported object.
         too busy). This illustrate as well a strength of the Papyri
                                                                                 In future versions, the chosen convention will use : as a
         architecture: creating this network visualization did not
                                                                                 module/name separator.
         require any regeneration of the documentation, one simply
                                                                             •   Decorated functions or other dynamic approaches to ex-
         updates the template and re-renders the current page as
                                                                                 pose functions to users end up having <local>> in their
         needed.
                                                                                 fully qualified names, which is invalid.
     •   A static AOT rendering of all the existing pages that can
                                                                             •   Many built-in functions (np.sin, np.cos, etc.) do not
         be rendered ahead of time uses the same class as the JIT
                                                                                 have a fully qualified name that can be extracted by object
         rendering. Basically, this loops through all entries in the
                                                                                 introspection. We believe it should be possible to identify
         SQLite database and renders each item independently. This
                                                                                 those via other means like docstring hash (to be explored).
         renderer is mostly used for exhaustive testing and perfor-
                                                                             •   Fully qualified names are often not canonical names (i.e.
         mance measures for Papyri. This can render most of the
                                                                                 the name typically used for import). While we made efforts
         API documentation of IPython, Astropy [astropy], Dask
                                                                                 to create a mapping from one to another, finding the canon-
         and distributed [Dask], Matplotlib [MPL], [MPL-DOI],
                                                                                 ical name automatically is not always straightforward.
         Networkx [NX], NumPy [NP], Pandas, Papyri, SciPy,
                                                                             •   There are also challenges with case sensitivity. For ex-
         Scikit-image and others. It can represent ~28000 pages
                                                                                 ample for MacOS file systems, a couple of objects may
         in ~60 seconds (that is ~450 pages/s on a recent Macbook
                                                                                 unfortunately refer to the same IRD file on disk. To address
         pro M1).
                                                                                 this, a case-sensitive hash is appended at the end of the
    For all of the above renderers, profiling shows that docu-                   filename.
mentation rendering is mostly limited by object de-serialisation             •   Many libraries have a syntax that looks right once ren-
from disk and Jinja2 templating engine. In the early project                     dered to HTML while not following proper syntax, or a
development phase, we attempted to write a static HTML renderer                  syntax that relies on specificities of Docutils and Sphinx
in a compiled language (like Rust, using compiled and typed                      rendering/parsing.
checked templates). This provided a speedup of roughly a factor              •   Many custom directive plugins cannot be reused from
10. However, its implementation is now out of sync with the main                 Sphinx. These will need to be reimplemented.
Papyri code base.
    Finally, a JupyterLab extension is currently in progress. The        Future possibilities
documentation then presents itself as a side-panel and is capable        Beyond what has been presented in this paper, there are several
of basic browsing and rendering (see Figure 1 and Figure 4). The         opportunities to improve and extend what Papyri can allow for the
model uses typescript, react and native JupyterLab component.            scientific Python ecosystem.
PAPYRI: BETTER DOCUMENTATION FOR THE SCIENTIFIC ECOSYSTEM IN JUPYTER                                                                      81




                                                                       Fig. 5: Local graph (made with D3.js [D3js]) representing the
                                                                       connections among the most important nodes around current page
                                                                       across many libraries, when viewing numpy.ndarray. Nodes are
                                                                       sized with respect to the number of incomming links, and colored
                                                                       with respect to their library. This graph is generated at rendering
                                                                       time, and is updated depending on the libraries currently installed.
                                                                       This graph helps identify related functions and documentation. It can
                                                                       become challenging to read for highly connected items as seen here
                                                                       for numpy.ndarray.



                                                                           The first area is the ability to build IRD bundles on Continuous
                                                                       Integration platforms. Services like GitHub action, Azure pipeline
                                                                       and many others are already setup to test packages. We hope
                                                                       to leverage this infrastructure to build IRD files and make them
                                                                       available to users.
                                                                            A second area is hosting of intermediate IRD files. While the
                                                                       current prototype is hosted by http index using GitHub pages,
                                                                       it is likely not a sustainable hosting platform as disk space is
                                                                       limited. To our knowledge, IRD files are smaller in size than
                                                                       HTML documentation, we hope that other platforms like Read the
                                                                       Docs can be leveraged. This could provide a single domain that
                                                                       renders the documentation for multiple libraries, thus avoiding the
                                                                       display of many library subdomains. This contributes to giving a
                                                                       more unified experience for users.
                                                                          It should be possible for projects to avoid using many dy-
                                                                       namic docstrings interpolation that are used to document *args
                                                                       and **kwargs. This would make sources easier to read, and
                                                                       potentially have some speedup at the library import time.
                                                                           Once a (given and appropriately used by its users) library uses
                                                                       an IDE that supports Papyri for documentation, docstring syntax
                                                                       could be exchanged for markdown.
                                                                           As IRD files are structured, it should be feasible to provide
                                                                       cross-version information in the documentation. For example, if
                                                                       one installs multiple versions of IRD bundles for a library, then
                                                                       assuming the user does not use the latest version, the renderer
Fig. 4: Example of extended view of the Papyri documentation for       could inspect IRD files from previous/future versions to indi-
Jupyterlab extension (here for SciPy). Code examples can now include   cate the range of versions for which the documentation has not
plots. Most token in each examples are linked to the corresponding     changed. Upon additional efforts, it should be possible to infer
page. Early navigation bar is visible at the top.                      when a parameter was removed, or will be removed, or to simply
                                                                       display the difference between two versions.
82                                                                                                   PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Conclusion                                                                        [RTD-theme]           https://sphinx-rtd-theme.readthedocs.io/en/stable/
                                                                                  [RTD]                 https://readthedocs.org/
To address some of the current limitations in documentation                       [SCU]                 https://en.wikipedia.org/wiki/Single_Compilation_
accessibility, building and maintaining, we have provided a new                                         Unit
documentation framework called Papyri. We presented its features                  [SP]                  Pauli Virtanen, Ralf Gommers, Travis E. Oliphant,
                                                                                                        Matt Haberland, Tyler Reddy, David Cournapeau,
and underlying implementation choices (such as crosslink main-
                                                                                                        Evgeni Burovski, Pearu Peterson, Warren Weckesser,
tenance, decoupling building and rendering phases, enriching the                                        Jonathan Bright, Stéfan J. van der Walt, Matthew
rendering features, using the IRD format to create a unified syntax                                     Brett, Joshua Wilson, K. Jarrod Millman, Nikolay
structure, etc.). While the project is still at its early stage, clear                                  Mayorov, Andrew R. J. Nelson, Eric Jones, Robert
                                                                                                        Kern, Eric Larson, CJ Carey, İlhan Polat, Yu Feng,
impacts can already be seen on the availability of high-quality                                         Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef
documentation for end-users, and on the workload reduction for                                          Perktold, Robert Cimrman, Ian Henriksen, E.A. Quin-
maintainers. Building IRD format opened a wide range of tech-                                           tero, Charles R Harris, Anne M. Archibald, Antônio
nical possibilities, and contributes to improving users’ experience                                     H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and
                                                                                                        SciPy 1.0 Contributors. (2020) SciPy 1.0: Fundamen-
(and therefore the success of the scientific Python ecosystem). This                                    tal Algorithms for Scientific Computing in Python.
may become necessary for users to navigate in an exponentially                                          Nature Methods, 17(3), 261-272. 10.1038/s41592-
growing ecosystem.                                                                                      019-0686-2
                                                                                  [Spyder]              https://www.spyder-ide.org/
                                                                                  [TSRST]               https://github.com/stsewd/tree-sitter-rst
Acknowledgments                                                                   [TS]                  https://tree-sitter.github.io/tree-sitter/
                                                                                  [astropy]             The Astropy Project: Building an inclusive, open-
The authors want to thank S. Gallegos (author of tree-sitter-rst), J.                                   science project and status of the v2.0 core package,
L. Cano Rodríguez and E. Holscher (Read The Docs), C. Holdgraf                                          https://doi.org/10.48550/arXiv.1801.02634
(2i2c), B. Granger and F. Pérez (Jupyter Project), T. Allard and I.               [docutils]            https://docutils.sourceforge.io/
                                                                                  [flask]               https://flask.palletsprojects.com/en/2.1.x/
Presedo-Floyd (QuanSight) for their useful feedback and help on                   [httpx]               https://www.python-httpx.org/
this project.                                                                     [mkdocs]              https://www.mkdocs.org/
                                                                                  [msgspec]             https://pypi.org/project/msgspec
                                                                                  [pandoc]              https://pandoc.org/
Funding                                                                           [pydantic]            https://pydantic-docs.helpmanual.io/
M. B. received a 2-year grant from the Chan Zuckerberg Initia-                    [pydata-sphinx-theme] https://pydata-sphinx-theme.readthedocs.io/en/stable/
                                                                                  [quart]               https://pgjones.gitlab.io/quart/
tive (CZI) Essential Open Source Software for Science (EOS)                       [sphinx-copybutton]   https://sphinx-copybutton.readthedocs.io/en/latest/
– EOSS4-0000000017 via the NumFOCUS 501(3)c non profit to                         [sphinx]              https://www.sphinx-doc.org/en/master/
develop the Papyri project.                                                       [Trio]                https://trio.readthedocs.io/


R EFERENCES
[AOT]                https://en.wikipedia.org/wiki/Ahead-of-time_
                     compilation
[CFRG]               conda-forge community. (2015). The conda-forge
                     Project: Community-based Software Distribution Built
                     on the conda Package Format and Ecosystem. Zenodo.
                     http://doi.org/10.5281/zenodo.4774216
[CODEMETA]           https://codemeta.github.io/
[D3js]               https://d3js.org/
[DOCREPR]            https://github.com/spyder-ide/docrepr
[DT]                 https://diataxis.fr/
[Dask]               Dask Development Team (2016). Dask: Library for
                     dynamic task scheduling, https://dask.org
[IR]                 https://en.wikipedia.org/wiki/Intermediate_
                     representation
[JEDI]               https://github.com/davidhalter/jedi
[JIT]                https://en.wikipedia.org/wiki/Just-in-time_
                     compilation
[JPYBOOK]            https://jupyterbook.org/
[Jinja2]             https://jinja.palletsprojects.com/
[LTO]                https://en.wikipedia.org/wiki/Interprocedural_
                     optimization
[MPL-DOI]            https://doi.org/10.5281/zenodo.6513224
[MPL]                J.D. Hunter, "Matplotlib: A 2D Graphics Environ-
                     ment", Computing in Science & Engineering, vol. 9,
                     no. 3, pp. 90-95, 2007,
[MYST]               https://myst-parser.readthedocs.io/en/latest/
[NPDOC]              https://numpydoc.readthedocs.io/en/latest/format.html
[NP]                 Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Ar-
                     ray programming with NumPy. Nature 585, 357–362
                     (2020). DOI: 10.1038/s41586-020-2649-2
[NX]                 Aric A. Hagberg, Daniel A. Schult and Pieter J. Swart,
                     “Exploring network structure, dynamics, and function
                     using NetworkX”, in Proceedings of the 7th Python
                     in Science Conference (SciPy2008), Gäel Varoquaux,
                     Travis Vaught, and Jarrod Millman (Eds), (Pasadena,
                     CA USA), pp. 11–15, Aug 2008
[Papyri]             https://github.com/jupyter/papyri
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                                83




  Bayesian Estimation and Forecasting of Time Series
                    in statsmodels
                                                                          Chad Fulton‡∗



                                                                                      F



Abstract—Statsmodels, a Python library for statistical and econometric                    inference for the well-developed stable of time series models
analysis, has traditionally focused on frequentist inference, including in its mod-       in statsmodels, and providing access to the rich associated
els for time series data. This paper introduces the powerful features for Bayesian        feature set already mentioned, presents a complementary option
inference of time series models that exist in statsmodels, with applications              to these more general-purpose libraries.1
to model fitting, forecasting, time series decomposition, data simulation, and
impulse response functions.
                                                                                          Time series analysis in statsmodels
Index Terms—time series, forecasting, bayesian inference, Markov chain Monte              A time series is a sequence of observations ordered in time, and
Carlo, statsmodels                                                                        time series data appear commonly in statistics, economics, finance,
                                                                                          climate science, control systems, and signal processing, among
Introduction                                                                              many other fields. One distinguishing characteristic of many time
Statsmodels [SP10] is a well-established Python library for                               series is that observations that are close in time tend to be more
statistical and econometric analysis, with support for a wide range                       correlated, a feature known as autocorrelation. While successful
of important model classes, including linear regression, ANOVA,                           analyses of time series data must account for this, statistical
generalized linear models (GLM), generalized additive models                              models can harness it to decompose a time series into trend,
(GAM), mixed effects models, and time series models, among                                seasonal, and cyclical components, produce forecasts of future
many others. In most cases, model fitting proceeds by using                               data, and study the propagation of shocks over time.
frequentist inference, such as maximum likelihood estimation                                  We now briefly review the models for time series data that are
(MLE). In this paper, we focus on the class of time series                                available in statsmodels and describe their features.2
models [MPS11], support for which has grown substantially in
                                                                                          Exponential smoothing models
statsmodels over the last decade. After introducing several
                                                                                          Exponential smoothing models are constructed by combining
of the most important new model classes – which are by default
                                                                                          one or more simple equations that each describe some aspect
fitted using MLE – and their features – which include forecasting,
                                                                                          of the evolution of univariate time series data. While originally
time series decomposition and seasonal adjustment, data simula-
                                                                                          somewhat ad hoc, these models can be defined in terms of a
tion, and impulse response analysis – we describe the powerful
                                                                                          proper statistical model (for example, see [HKOS08]). They have
functions that enable users to apply Bayesian methods to a wide
                                                                                          enjoyed considerable popularity in forecasting (for example, see
range of time series models.
     Support for Bayesian inference in Python outside of                                  the implementation in R described by [HA18]). A prototypical
statsmodels has also grown tremendously, particularly in                                  example that allows for trending data and a seasonal component
the realm of probabilistic programming, and includes powerful                             – often known as the additive "Holt-Winters’ method" – can be
libraries such as PyMC3 [SWF16], PyStan [CGH+ 17], and                                    written as
TensorFlow Probability [DLT+ 17]. Meanwhile, ArviZ                                                       lt = α(yt − st−m ) + (1 − α)(lt−1 + bt−1 )
[KCHM19] provides many excellent tools for associated diagnos-                                           bt = β (lt − lt−1 ) + (1 − β )bt−1
tics and vizualisations. The aim of these libraries is to provide                                        st = γ(yt − lt−1 − bt−1 ) + (1 − γ)st−m
support for Bayesian analysis of a large class of models, and
they make available both advanced techniques, including auto-                             where lt is the level of the series, bt is the trend, st is the
tuning algorithms, and flexible model specification. By contrast,                         seasonal component of period m, and α, β , γ are parameters of
here we focus on simpler techniques. However, while the libraries                         the model. When augmented with an error term with some given
above do include some support for time series models, this has                            probability distribution (usually Gaussian), likelihood-based infer-
not been their primary focus. As a result, introducing Bayesian                           ence can be used to estimate the parameters. In statsmodels,

* Corresponding author: chad.t.fulton@frb.gov                                                1. In addition, it is possible to combine the sampling algorithms of PyMC3
‡ Federal Reserve Board of Governors                                                      with the time series models of statsmodels, although we will not discuss
                                                                                          this approach in detail here. See, for example, https://www.statsmodels.org/v0.
Copyright © 2022 Chad Fulton. This is an open-access article distributed                  13.0/examples/notebooks/generated/statespace_sarimax_pymc3.html.
under the terms of the Creative Commons Attribution License, which permits                   2. In addition to statistical models, statsmodels also provides a number
unrestricted use, distribution, and reproduction in any medium, provided the              of tools for exploratory data analysis, diagnostics, and hypothesis testing
original author and source are credited.                                                  related to time series data; see https://www.statsmodels.org/stable/tsa.html.
84                                                                                                PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

additive exponential smoothing models can be constructed using               # ARMA(1, 1) model with explanatory variable
the statespace.ExponentialSmoothing class.3 The fol-                         X = mdata['realint']
                                                                             model_arma11 = sm.tsa.ARIMA(
lowing code shows how to apply the additive Holt-Winters model                  y, order=(1, 0, 1), exog=X)
above to model quarterly data on consumer prices:                            # SARIMAX(p, d, q)x(P, D, Q, s) model
import statsmodels.api as sm                                                 model_sarimax = sm.tsa.ARIMA(
# Load data                                                                     y, order=(p, d, q), seasonal_order=(P, D, Q, s))
mdata = sm.datasets.macrodata.load().data                                        While this class of models often produces highly competitive
# Compute annualized consumer price inflation
y = np.log(mdata['cpi']).diff().iloc[1:] * 400                               forecasts, it does not produce a decomposition of a time series
                                                                             into, for example, trend and seasonal components.
# Construct the Holt-Winters model
model_hw = sm.tsa.statespace.ExponentialSmoothing(                           Vector autoregressive models
   y, trend=True, seasonal=12)
                                                                             While the SARIMAX models above handle univariate series,
                                                                             statsmodels also has support for the multivariate generaliza-
Structural time series models
                                                                             tion to vector autoregressive (VAR) models.5 These models are
Structural time series models, introduced by [Har90] and also                written
sometimes known as unobserved components models, similarly                                  yt = ν + Φ1 yt−1 + · · · + Φ p yt−p + εt
decompose a univariate time series into trend, seasonal, cyclical,
and irregular components:                                                    where yt is now considered as an m × 1 vector. As a result, the
                                                                             intercept ν is also an m × 1 vector, the coefficients Φi are each
                          yt = µt + γt + ct + εt                             m × m matrices, and the error term is εt ∼ N(0m , Ω), with Ω an
where µt is the trend, γt is the seasonal component, ct is the cycli-        m×m matrix. These models can be constructed in statsmodels
cal component, and εt ∼ N(0, σ 2 ) is the error term. However, this          using the VARMAX class, as follows6
equation can be augmented in many ways, for example to include               # Multivariate dataset
explanatory variables or an autoregressive component. In addition,           z = (np.log(mdata['realgdp', 'realcons', 'cpi'])
                                                                                    .diff().iloc[1:])
there are many possible specifications for the trend, seasonal,
and cyclical components, so that a wide variety of time series               # VAR(1) model
characteristics can be accommodated. In statsmodels, these                   model_var = sm.tsa.VARMAX(z, order=(1, 0))
models can be constructed from the UnobservedComponents
class; a few examples are given in the following code:                       Dynamic factor models
# "Local level" model                                                        statsmodels also supports a second model for multivariate
model_ll = sm.tsa.UnobservedComponents(y, 'llevel')                          time series: the dynamic factor model (DFM). These models, often
# "Local linear trend", with seasonal component
model_arma11 = sm.tsa.UnobservedComponents(                                  used for dimension reduction, posit a few unobserved factors, with
   y, 'lltrend', seasonal=4)                                                 autoregressive dynamics, that are used to explain the variation
                                                                             in the observed dataset. In statsmodels, there are two model
These models have become popular for time series analysis and
                                                                             classes, DynamicFactor` and DynamicFactorMQ, that can
forecasting, as they are flexible and the estimated components are
                                                                             fit versions of the DFM. Here we focus on the DynamicFactor
intuitive. Indeed, Google’s Causal Impact library [BGK+ 15] uses
                                                                             class, for which the model can be written
a Bayesian structural time series approach directly, and Facebook’s
Prophet library [TL17] uses a conceptually similar framework and                                 yt = Λ ft + εt
is estimated using PyStan.                                                                       ft = Φ1 ft−1 + · · · + Φ p ft−p + ηt

Autoregressive moving-average models                                         Here again, the observation is assumed to be m × 1, but the factors
                                                                             are k × 1, where it is possible that k << m. As before, we assume
Autoregressive moving-average (ARMA) models, ubiquitous in
                                                                             conformable coefficient matrices and Gaussian errors.
time series applications, are well-supported in statsmodels,
                                                                                 The following code shows how to construct a DFM in
including their generalizations, abbreviated as "SARIMAX", that
                                                                             statsmodels
allow for integrated time series data, explanatory variables, and
seasonal effects.4 A general version of this model, excluding                # DFM with 2 factors that evolve as a VAR(3)
                                                                             model_dfm = sm.tsa.DynamicFactor(
integration, can be written as                                                  z, k_factors=2, factor_order=3)
      yt = xt β + ξt
      ξt = φ1 ξt−1 + · · · + φ p ξt−p + εt + θ1 εt−1 + · · · + θq εt−q       Linear Gaussian state space models
                                                                             In statsmodels, each of the model classes introduced
where εt ∼ N(0, σ 2 ). These are constructed in statsmodels
                                                                             above   (    statespace.ExponentialSmoothing,
with the ARIMA class; the following code shows how to construct
                                                                             UnobservedComponents,        ARIMA,        VARMAX,
a variety of autoregressive moving-average models for consumer
price data:                                                                     4. Note that in statsmodels, models with explanatory variables are in
# AR(2) model                                                                the form of "regression with SARIMA errors".
model_ar2 = sm.tsa.ARIMA(y, order=(2, 0, 0))                                    5. statsmodels also supports vector moving-average (VMA) models
                                                                             using the same model class as described here for the VAR case, but, for brevity,
   3. A second class, ETSModel, can also be used for both additive and       we do not explicitly discuss them here.
multiplicative models, and can exhibit superior performance with maximum        6. A second class, VAR, can also be used to fit VAR models, using least
likelihood estimation. However, it lacks some of the features relevant for   squares. However, it lacks some of the features relevant for Bayesian inference
Bayesian inference discussed in this paper.                                  discussed in this paper.
BAYESIAN ESTIMATION AND FORECASTING OF TIME SERIES IN STATSMODELS                                                                                 85




                                         Fig. 1: Selected functionality of state space models in statsmodels.


DynamicFactor, and DynamicFactorMQ) are implemented                             fcast = results_ll.forecast(4)
as part of a broader class of models, referred to as linear Gaussian
                                                                                # Produce a draw from the posterior distribution
state space models (hereafter for brevity, simply "state space                  # of the state vector
models" or SSM). This class of models can be written as                         sim_ll.simulate()
                                                                                draw = sim_ll.simulated_state
                 yt = dt + Zt αt + εt          εt ∼ N(0, Ht )
             αt+1 = ct + Tt αt + Rt ηt         ηt ∼ N(0, Qt )                   Nearly identical code could be used for any of the model classes
                                                                                introduced above, since they are all implemented as part of the
where αt represents an unobserved vector containing the "state"                 same state space model framework. In the next section, we show
of the dynamic system. In general, the model is multivariate, with              how these features can be used to perform Bayesian inference with
yt and εt m × 1 vector, αt k × 1, and ηt r times 1.                             these models.
    Powerful tools exist for state space models to estimate the
values of the unobserved state vector, compute the value of
the likelihood function for frequentist inference, and perform
posterior sampling for Bayesian inference. These tools include the              Bayesian inference via Markov chain Monte Carlo
celebrated Kalman filter and smoother and a simulation smoother,
all of which are important for conducting Bayesian inference for                We begin by giving a cursory overview of the key elements
these models.7 The implementation in statsmodels largely                        of Bayesian inference required for our purposes here.8 In brief,
follows the treatment in [DK12], and is described in more detail                the Bayesian approach stems from Bayes’ theorem, in which
in [Ful15].                                                                     the posterior distribution for an object of interest is derived as
    In addition to these key tools, state space models also admit               proportional to the combination of a prior distribution and the
general implementations of useful features such as forecasting,                 likelihood function
data simulation, time series decomposition, and impulse response
analysis. As a consequence, each of these features extends to each
                                                                                                     p(A|B) ∝ p(B|A) × p(A)
of the time series models described above. Figure 1 presents a                                       | {z } | {z } |{z}
diagram showing how to produce these features, and the code                                          posterior   likelihood   prior
below briefly introduces a subset of them.
# Construct the Model                                                           Here, we will be interested in the posterior distribution of the pa-
model_ll = sm.tsa.UnobservedComponents(y, 'llevel')                             rameters of our model and of the unobserved states, conditional on
                                                                                the chosen model specification and the observed time series data.
# Construct a simulation smoother
sim_ll = model_ll.simulation_smoother()                                         While in most cases the form of the posterior cannot be derived an-
                                                                                alytically, simulation-based methods such as Markov chain Monte
# Parameter values (variance of error and                                       Carlo (MCMC) can be used to draw samples that approximate
# variance of level innovation, respectively)                                   the posterior distribution nonetheless. While PyMC3, PyStan,
params = [4, 0.75]
                                                                                and TensorFlow Probability emphasize Hamiltonian Monte Carlo
# Compute the log-likelihood of these parameters                                (HMC) and no-U-turn sampling (NUTS) MCMC methods, we
llf = model_ll.loglike(params)                                                  focus on the simpler random walk Metropolis-Hastings (MH) and
# `smooth` applies the Kalman filter and smoother
                                                                                Gibbs sampling (GS) methods. These are standard MCMC meth-
# with a given set of parameters and returns a                                  ods that have enjoyed great success in time series applications and
# Results object                                                                which are simple to implement, given the state space framework
results_ll = model_ll.smooth(params)                                            already available in statsmodels. In addition, the ArviZ library
# Produce forecasts for the next 4 periods                                      is designed to work with MCMC output from any source, and we
                                                                                can easily adapt it to our use.
  7. Statsmodels currently contains two implementations of simulation               With either Metropolis-Hastings or Gibbs sampling, our pro-
smoothers for the linear Gaussian state space model. The default is the "mean   cedure will produce a sequence of sample values (of parameters
correction" simulation smoother of [DK02]. The precision-based simulation
smoother of [CJ09] can alternatively be used by specifying method='cfa'         and / or the unobserved state vector) that approximate draws from
when creating the simulation smoother object.                                   the posterior distribution arbitrarily well, as the number of length
86                                                                                                 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

of the chain of samples becomes very large.

Random walk Metropolis-Hastings
In random walk Metropolis-Hastings (MH), we begin with an arbi-
trary point as the initial sample, and then iteratively construct new
samples in the chain as follows. At each iteration, (a) construct a
proposal by perturbing the previous sample by a Gaussian random
variable, and then (b) accept the proposal with some probability.
If a proposal is accepted, it becomes the next sample in the chain,
while if it is rejected then the previous sample value is carried over.
Here, we show how to implement Metropolis-Hastings estimation
of the variance parameter in a simple model, which only requires
the use of the log-likelihood computation introduced above.
                                                                                 Fig. 2: Approximate posterior distribution of variance parameter,
import arviz as az                                                               random walk model, Metropolis-Hastings; U.S. Industrial Production.
from scipy import stats

# Construct the model
model_rw = sm.tsa.UnobservedComponents(y, 'rwalk')

# Specify the prior distribution. With MH, this
# can be freely chosen by the user
prior = stats.uniform(0.0001, 100)

# Specify the Gaussian perturbation distribution
perturb = stats.norm(scale=0.1)

# Storage
niter = 100000
samples_rw = np.zeros(niter + 1)

# Initialization
samples_rw[0] = y.diff().var()
llf = model_rw.loglike(samples_rw[0])
prior_llf = prior.logpdf(samples_rw[0])
                                                                                 Fig. 3: Approximate posterior joint distribution of variance parame-
# Iterations                                                                     ters, local level model, Gibbs sampling; CPI inflation.
for i in range(1, niter + 1):
   # Compute the proposal value
   proposal = samples_rw[i - 1] + perturb.rvs()
                                                                                 Gibbs sampling
     # Compute the acceptance probability                                        Gibbs sampling (GS) is a special case of Metropolis-Hastings
     proposal_llf = model_rw.loglike(proposal)                                   (MH) that is applicable when it is possible to produce draws
     proposal_prior_llf = prior.logpdf(proposal)
     accept_prob = np.exp(                                                       directly from the conditional distributions of every variable, even
        proposal_llf - llf                                                       though it is still not possible to derive the general form of the joint
        + prior_llf - proposal_prior_llf)                                        posterior. While this approach can be superior to random walk
                                                                                 MH when it is applicable, the ability to derive the conditional
     # Accept or reject the value
     if accept_prob > stats.uniform.rvs():                                       distributions typically requires the use of a "conjugate" prior – i.e.,
        samples_rw[i] = proposal                                                 a prior from some specific family of distributions. For example,
        llf = proposal_llf                                                       above we specified a uniform distribution as the prior when
        prior_llf = proposal_prior_llf
     else:
                                                                                 sampling via MH, but that is not possible with Gibbs sampling.
        samples_rw[i] = samples_rw[i - 1]                                        Here, we show how to implement Gibbs sampling estimation of
                                                                                 the variance parameter, now making use of an inverse Gamma
# Convert for use with ArviZ and plot posterior                                  prior, and the simulation smoother introduced above.
samples_rw = az.convert_to_inference_data(
   samples_rw)                                                                   # Construct the model and simulation smoother
# Eliminate the first 10000 samples as burn-in;                                  model_ll = sm.tsa.UnobservedComponents(y, 'llevel')
# thin by factor of 10 to reduce autocorrelation                                 sim_ll = model_ll.simulation_smoother()
az.plot_posterior(samples_rw.posterior.sel(
   {'draw': np.s_[10000::10]}), kind='bin',                                      # Specify the prior distributions. With GS, we must
   point_estimate='median')                                                      # choose an inverse Gamma prior for each variance
                                                                                 priors = [stats.invgamma(0.01, scale=0.01)] * 2
The approximate posterior distribution, constructed from the sam-
ple chain, is shown in Figure 2.                                                 # Storage
                                                                                 niter = 100000
                                                                                 samples_ll = np.zeros((niter + 1, 2))
   8. While a detailed description of these issues is out of the scope of this
paper, there are many superb references on this topic. We refer the interested   # Initialization
reader to [WH99], which provides a book-length treatment of Bayesian             samples_ll[0] = [y.diff().var(), 1e-5]
inference for state space models, and [KN99], which provides many examples
and applications.                                                                # Iterations
BAYESIAN ESTIMATION AND FORECASTING OF TIME SERIES IN STATSMODELS                                                                               87

for i in range(1, niter + 1):
   # (a) Update the model parameters
   model_ll.update(samples_ll[i - 1])

   # (b) Draw from the conditional posterior of
   # the state vector
   sim_ll.simulate()
   sample_state = sim_ll.simulated_state.T

   # (c) Compute / draw from conditional posterior
   # of the parameters:
   # ...observation error variance
   resid = y - sample_state[:, 0]
   post_shape = len(resid) / 2 + 0.01
   post_scale = np.sum(resid**2) / 2 + 0.01
   samples_ll[i, 0] = stats.invgamma(
      post_shape, scale=post_scale).rvs()

   # ...level error variance
   resid = sample_state[1:] - sample_state[:-1]
   post_shape = len(resid) / 2 + 0.01                                  Fig. 4: Data and forecast with 80% credible interval; U.S. Industrial
   post_scale = np.sum(resid**2) / 2 + 0.01                            Production.
   samples_ll[i, 1] = stats.invgamma(
      post_shape, scale=post_scale).rvs()

# Convert for use with ArviZ and plot posterior
samples_ll = az.convert_to_inference_data(
   {'parameters': samples_ll[None, ...]},
   coords={'parameter': model_ll.param_names},
   dims={'parameters': ['parameter']})
az.plot_pair(samples_ll.posterior.sel(
   {'draw': np.s_[10000::10]}), kind='hexbin');
The approximate posterior distribution, constructed from the sam-
ple chain, is shown in Figure 3.

Illustrative examples
For clarity and brevity, the examples in the previous section gave
results for simple cases. However, these basic methods carry
through to each of the models introduced earlier, including in cases
with multivariate data and hundreds of parameters. Moreover, the
Metropolis-Hastings approach can be combined with the Gibbs
sampling approach, so that if the end user wishes to use Gibbs
sampling for some parameters, they are not restricted to choose
only conjugate priors for all parameters.
    In addition to sampling the posterior distributions of the
parameters, this method allows sampling other objects of inter-
est, including forecasts of observed variables, impulse response
functions, and the unobserved state vector. This last possibility
is especially useful in cases such as the structural time series       Fig. 5: Estimated level, trend, and seasonal components, with 80%
model, in which the unobserved states correspond to interpretable      credible interval; U.S. Industrial Production.
elements such as the trend and seasonal components. We provide
several illustrative examples of the various types of analysis that
are possible.                                                          model = sm.tsa.UnobservedComponents(
                                                                          y, 'lltrend', seasonal=12)
Forecasting and Time Series Decomposition                                  To produce the time-series decomposition into level, trend, and
In our first example, we apply the Gibbs sampling approach to          seasonal components, we will use samples from the posterior of
a structural time series model in order to forecast U.S. Industrial    the state vector (µt , βt , γt ) for each time period t. These are im-
Production and to produce a decomposition of the series into level,    mediately available when using the Gibbs sampling approach; in
trend, and seasonal components. The model is                           the earlier example, the draw at each iteration was assigned to the
                   yt = µt + γt + εt          observation equation     variable sample_state. To produce forecasts, we need to draw from
                                                                       the posterior predictive distribution for horizons h = 1, 2, . . . H.
                   µt = βt + µt−1 + ζt                         level
                                                                       This can be easily accomplished by using the simulate method
                   βt = βt−1 + ξt                             trend    introduced earlier. To be concrete, we can accomplish these tasks
                   γt = γt−s + ηt                          seasonal    by modifying section (b) of our Gibbs sampler iterations as
Here, we set the seasonal periodicity to s=12, since Industrial        follows:
Production is a monthly variable. We can construct this model            9. This model is often referred to as a "local linear trend" model (with
in Statsmodels as9                                                     additionally a seasonal component); lltrend is an abbreviation of this name.
88                                                                                                 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




                          Fig. 6: "Causal impact" of COVID-19 on U.S. Sales in Manufacturing and Trade Industries.


# (b') Draw from the conditional posterior of                                 on U.S. Sales in Manufacturing and Trade Industries.11
# the state vector
model.update(params[i - 1])
sim.simulate()                                                                Extensions
# save the draw for use later in time series
# decomposition                                                               There are many extensions to the time series models presented
states[i] = sim.simulated_state.T
                                                                              here that are made possible when using Bayesian inference.
# Draw from the posterior predictive distribution                             First, it is easy to create custom state space models within the
# using the `simulate` method                                                 statsmodels framework. As one example, the statsmodels
n_fcast = 48                                                                  documentation describes how to create a model that extends the
fcast[i] = model.simulate(
   params[i - 1], n_fcast,                                                    typical VAR described above with time-varying parameters.12
   initial_state=states[i, -1]).to_frame()                                    These custom state space models automatically inherit all the
                                                                              functionality described above, so that Bayesian inference can be
These forecasts and the decomposition into level, trend, and sea-             conducted in exactly the same way.
sonal components are summarized in Figures 4 and 5, which show                    Second, because the general state space model available in
the median values along with 80% credible intervals. Notably, the             statsmodels and introduced above allows for time-varying
intervals shown incorporate for both the uncertainty arising from             system matrices, it is possible using Gibbs sampling methods
the stochastic terms in the model as well as the need to estimate             to introduce support for automatic outlier handling, stochastic
the models’ parameters.10                                                     volatility, and regime switching models, even though these are
                                                                              largely infeasible in statsmodels when using frequentist meth-
Casual impacts                                                                ods such as maximum likelihood estimation.13
A closely related procedure described in [BGK+ 15] uses a
Bayesian structural time series model to estimate the "causal                 Conclusion
impact" of some event on some observed variable. This approach
stops estimation of the model just before the date of an event                This paper introduces the suite of time series models available in
and produces a forecast by drawing from the posterior predictive              statsmodels and shows how Bayesian inference using Markov
density, using the procedure described just above. It then uses the           chain Monte Carlo methods can be applied to estimate their
difference between the actual path of the data and the forecast to            parameters and produce analyses of interest, including time series
estimate impact of the event.                                                 decompositions and forecasts.
    An example of this approach is shown in Figure 6, in which we
                                                                                 11. In this example, we used a local linear trend model with no seasonal
use this method to illustrate the effect of the COVID-19 pandemic             component.
                                                                                 12. For details, see https://www.statsmodels.org/devel/examples/notebooks/
  10. The popular Prophet library, [TL17], similarly uses an additive model   generated/statespace_tvpvar_mcmc_cfa.html.
combined with Bayesian sampling methods to produce forecasts and decom-          13. See, for example, [SW16] for an application of these techniques that
positions, although its underlying model is a GAM rather than a state space   handles outliers, [KSC98] for stochastic volatility, and [KN98] for an applica-
model.                                                                        tion to dynamic factor models with regime switching.
BAYESIAN ESTIMATION AND FORECASTING OF TIME SERIES IN STATSMODELS                                                                                   89

R EFERENCES                                                                     [SWF16]   John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck.
                                                                                          Probabilistic programming in Python using PyMC3. PeerJ
[BGK+ 15] Kay H. Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy,                  Computer Science, 2:e55, April 2016. Publisher: PeerJ Inc.
          and Steven L. Scott. Inferring causal impact using Bayesian                     URL: https://peerj.com/articles/cs-55, doi:10.7717/peerj-
          structural time-series models. Annals of Applied Statistics, 9:247–             cs.55.
          274, 2015. doi:10.1214/14-aoas788.                                    [TL17]    Sean J. Taylor and Benjamin Letham. Forecasting at scale.
[CGH+ 17] Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel                        Technical Report e3190v2, PeerJ Inc., September 2017. ISSN:
          Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker,                         2167-9843. URL: https://peerj.com/preprints/3190, doi:10.
          Jiqiang Guo, Peter Li, and Allen Riddell.                Stan : A               7287/peerj.preprints.3190v2.
          Probabilistic Programming Language.            Journal of Statisti-   [WH99]    Mike West and Jeff Harrison. Bayesian Forecasting and Dynamic
          cal Software, 76(1), January 2017.           Institution: Columbia              Models. Springer, New York, 2nd edition edition, March 1999.
          Univ., New York, NY (United States); Harvard Univ., Cam-                        00000.
          bridge, MA (United States). URL: https://www.osti.gov/pages/
          biblio/1430202-stan-probabilistic-programming-language, doi:
          10.18637/jss.v076.i01.
[CJ09]    Joshua C.C. Chan and Ivan Jeliazkov. Efficient simulation and in-
          tegrated likelihood estimation in state space models. International
          Journal of Mathematical Modelling and Numerical Optimisation,
          1(1-2):101–120, January 2009. Publisher: Inderscience Publish-
          ers. URL: https://www.inderscienceonline.com/doi/abs/10.1504/
          IJMMNO.2009.03009.
[DK02]    J. Durbin and S. J. Koopman. A simple and efficient simula-
          tion smoother for state space time series analysis. Biometrika,
          89(3):603–616, August 2002. URL: http://biomet.oxfordjournals.
          org/content/89/3/603, doi:10.1093/biomet/89.3.603.
[DK12]    James Durbin and Siem Jan Koopman. Time Series Analysis by
          State Space Methods: Second Edition. Oxford University Press,
          May 2012.
[DLT+ 17] Joshua V. Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo,
          Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi,
          Matt Hoffman, and Rif A. Saurous. TensorFlow Distributions.
          Technical Report arXiv:1711.10604, arXiv, November 2017.
          arXiv:1711.10604 [cs, stat] type: article. URL: http://arxiv.org/
          abs/1711.10604, doi:10.48550/arXiv.1711.10604.
[Ful15]   Chad Fulton. Estimating time series models by state space
          methods in python: Statsmodels. 2015.
[HA18]    Rob J Hyndman and George Athanasopoulos. Forecasting:
          principles and practice. OTexts, 2018.
[Har90]   Andrew C. Harvey. Forecasting, Structural Time Series Models
          and the Kalman Filter. Cambridge University Press, 1990.
[HKOS08] Rob Hyndman, Anne B. Koehler, J. Keith Ord, and Ralph D.
          Snyder. Forecasting with Exponential Smoothing: The State
          Space Approach. Springer Science & Business Media, June 2008.
          Google-Books-ID: GSyzox8Lu9YC.
[KCHM19] Ravin Kumar, Colin Carroll, Ari Hartikainen, and Osvaldo Mar-
          tin. ArviZ a unified library for exploratory analysis of Bayesian
          models in Python. Journal of Open Source Software, 4(33):1143,
          2019. Publisher: The Open Journal. URL: https://doi.org/10.
          21105/joss.01143, doi:10.21105/joss.01143.
[KN98]    Chang-Jin Kim and Charles R. Nelson. Business Cycle Turning
          Points, A New Coincident Index, and Tests of Duration Depen-
          dence Based on a Dynamic Factor Model With Regime Switch-
          ing. The Review of Economics and Statistics, 80(2):188–201,
          May 1998. Publisher: MIT Press. URL: https://doi.org/10.1162/
          003465398557447, doi:10.1162/003465398557447.
[KN99]    Chang-Jin Kim and Charles R. Nelson. State-Space Models with
          Regime Switching: Classical and Gibbs-Sampling Approaches
          with Applications. MIT Press Books, The MIT Press, 1999. URL:
          http://ideas.repec.org/b/mtp/titles/0262112388.html.
[KSC98]   Sangjoon Kim, Neil Shephard, and Siddhartha Chib. Stochastic
          Volatility: Likelihood Inference and Comparison with ARCH
          Models. The Review of Economic Studies, 65(3):361–393, July
          1998. 01855. URL: http://restud.oxfordjournals.org/content/65/
          3/361, doi:10.1111/1467-937X.00050.
[MPS11]   Wes McKinney, Josef Perktold, and Skipper Seabold. Time Series
          Analysis in Python with statsmodels. In Stéfan van der Walt
          and Jarrod Millman, editors, Proceedings of the 10th Python in
          Science Conference, pages 107 – 113, 2011. doi:10.25080/
          Majora-ebaa42b7-012.
[SP10]    Skipper Seabold and Josef Perktold. Statsmodels: Econometric
          and Statistical Modeling with Python. In Stéfan van der Walt and
          Jarrod Millman, editors, Proceedings of the 9th Python in Science
          Conference, pages 92 – 96, 2010. doi:10.25080/Majora-
          92bf1922-011.
[SW16]    James H. Stock and Mark W. Watson. Core Inflation and Trend
          Inflation. Review of Economics and Statistics, 98(4):770–784,
          March 2016. 00000. URL: http://dx.doi.org/10.1162/REST_a_
          00608, doi:10.1162/REST_a_00608.
90                                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




 Python vs. the pandemic: a case study in high-stakes
                 software development
     Cliff C. Kerr‡§∗ , Robyn M. Stuart¶k , Dina Mistry∗∗ , Romesh G. Abeysuriyak , Jamie A. Cohen‡ , Lauren George†† ,
                         Michał Jastrzebski‡‡ , Michael Famulare‡ , Edward Wenger‡ , Daniel J. Klein‡



                                                                                     F


Abstract—When it became clear in early 2020 that COVID-19 was going to                   modeling, and drug discovery made it well placed to contribute to
be a major public health threat, politicians and public health officials turned to       a global pandemic response plan. Founded in 2008, the Institute
academic disease modelers like us for urgent guidance. Academic software                 for Disease Modeling (IDM) has provided analytical support for
development is typically a slow and haphazard process, and we realized that              BMGF (which it has been a part of since 2020) and other global
business-as-usual would not suffice for dealing with this crisis. Here we describe
                                                                                         health partners, with a focus on eradicating malaria and polio.
the case study of how we built Covasim (covasim.org), an agent-based model
of COVID-19 epidemiology and public health interventions, by using standard
                                                                                         Since its creation, IDM has built up a portfolio of computational
Python libraries like NumPy and Numba, along with less common ones like                  tools to understand, analyze, and predict the dynamics of different
Sciris (sciris.org). Covasim was created in a few weeks, an order of magnitude           diseases.
faster than the typical model development process, and achieves performance                  When "coronavirus disease 2019" (COVID-19) and the virus
comparable to C++ despite being written in pure Python. It has become one                that causes it (SARS-CoV-2) were first identified in late 2019,
of the most widely adopted COVID models, and is used by researchers and                  our team began summarizing what was known about the virus
policymakers in dozens of countries. Covasim’s rapid development was enabled             [Fam19]. By early February 2020, even though it was more than
not only by leveraging the Python scientific computing ecosystem, but also by
                                                                                         a month before the World Health Organization (WHO) declared
adopting coding practices and workflows that lowered the barriers to entry for
                                                                                         a pandemic [Med20], it had become clear that COVID-19 would
scientific contributors without sacrificing either performance or rigor.
                                                                                         become a major public health threat. The outbreak on the Diamond
Index Terms—COVID-19, SARS-CoV-2, Epidemiology, Mathematical modeling,
                                                                                         Princess cruise ship [RSWS20] was the impetus for us to start
NumPy, Numba, Sciris                                                                     modeling COVID in detail. Specifically, we needed a tool to (a)
                                                                                         incorporate new data as soon as it became available, (b) explore
                                                                                         policy scenarios, and (c) predict likely future epidemic trajectories.
Background                                                                                   The first step was to identify which software tool would form
For decades, scientists have been concerned about the possibility                        the best starting point for our new COVID model. Infectious
of another global pandemic on the scale of the 1918 flu [Gar05].                         disease models come in two major types: agent-based models track
Despite a number of "close calls" – including SARS in 2002                               the behavior of individual "people" (agents) in the simulation,
[AFG+ 04]; Ebola in 2014-2016 [Tea14]; and flu outbreaks in-                             with each agent’s behavior represented by a random (probabilis-
cluding 1957, 1968, and H1N1 in 2009 [SHK16], some of which                              tic) process. Compartmental models track populations of people
led to 1 million or more deaths – the last time we experienced                           over time, typically using deterministic difference equations. The
the emergence of a planetary-scale new pathogen was when HIV                             richest modeling framework used by IDM at the time was EMOD,
spread globally in the 1980s [CHL+ 08].                                                  which is a multi-disease agent-based model written in C++ and
    In 2015, Bill Gates gave a TED talk stating that the world was                       based on JSON configuration files [BGB+ 18]. We also considered
not ready to deal with another pandemic [Hof20]. While the Bill                          Atomica, a multi-disease compartmental model written in Python
& Melinda Gates Foundation (BMGF) has not historically focused                           and based on Excel input files [KAK+ 19]. However, both of
on pandemic preparedness, its expertise in disease surveillance,                         these options posed significant challenges: as a compartmental
                                                                                         model, Atomica would have been unable to capture the individual-
* Corresponding author: cliff@covasim.org                                                level detail necessary for modeling the Diamond Princess out-
‡ Institute for Disease Modeling, Bill & Melinda Gates Foundation, Seattle,              break (such as passenger-crew interactions); EMOD had sufficient
USA
                                                                                         flexibility, but developing new disease modules had historically
§ School of Physics, University of Sydney, Sydney, Australia
¶ Department of Mathematical Sciences, University of Copenhagen, Copen-                  required months rather than days.
hagen, Denmark                                                                               As a result, we instead started developing Covasim ("COVID-
|| Burnet Institute, Melbourne, Australia                                                19 Agent-based Simulator") [KSM+ 21] from a nascent agent-
** Twitter, Seattle, USA                                                                 based model written in Python, LEMOD-FP ("Light-EMOD for
†† Microsoft, Seattle, USA
‡‡ GitHub, San Francisco, USA                                                            Family Planning"). LEMOD-FP was used to model reproductive
                                                                                         health choices of women in Senegal; this model had in turn
Copyright © 2022 Cliff C. Kerr et al. This is an open-access article distributed         been based on an even simpler agent-based model of measles
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the             vaccination programs in Nigeria ("Value-of-Information Simula-
original author and source are credited.                                                 tor" or VoISim). We subsequently applied the lessons we learned
PYTHON VS. THE PANDEMIC: A CASE STUDY IN HIGH-STAKES SOFTWARE DEVELOPMENT                                                                       91

                                                                        scientific computing libraries.

                                                                        Software architecture and implementation
                                                                        Covasim conceptual design and usage
                                                                        Covasim is a standard susceptible-exposed-infectious-recovered
                                                                        (SEIR) model (Fig. 3). As noted above, it is an agent-based model,
                                                                        meaning that individual people and their interactions with one
                                                                        another are simulated explicitly (rather than implicitly, as in a
                                                                        compartmental model).
                                                                            The fundamental calculation that Covasim performs is to
                                                                        determine the probability that a given person, on a given time step,
                                                                        will change from one state to another, such as from susceptible
                                                                        to exposed (i.e., that person was infected), from undiagnosed to
                                                                        diagnosed, or from critically ill to dead. Covasim is fully open-
                                                                        source and available on GitHub (http://covasim.org) and PyPI
                                                                        (pip install covasim), and comes with comprehensive
                                                                        documentation, including tutorials (http://docs.covasim.org).
                                                                            The first principle of Covasim’s design philosophy is that
                                                                        "Common tasks should be simple" – for example, defining pa-
                                                                        rameters, running a simulation, and plotting results. The following
                                                                        example illustrates this principle; it creates a simulation with a
                                                                        custom parameter value, runs it, and plots the results:
Fig. 1: Daily reported global COVID-19-related deaths (top;             import covasim as cv
smoothed with a one-week rolling window), relative to the timing of     cv.Sim(pop_size=100e3).run().plot()
known variants of concern (VOCs) and variants of interest (VOIs), as    The second principle of Covasim’s design philosophy is "Un-
well as Covasim releases (bottom).
                                                                        common tasks can’t always be simple, but they still should be
                                                                        possible." Examples include writing a custom goodness-of-fit
from developing Covasim to turn LEMOD-FP into a new family              function or defining a new population structure. To some extent,
planning model, "FPsim", which will be launched later this year         the second principle is at odds with the first, since the more
[OVCC+ 22].                                                             flexibility an interface has, typically the more complex it is as
     Parallel to the development of Covasim, other research teams       well.
at IDM developed their own COVID models, including one based                To illustrate the tension between these two principles, the
on the EMOD framework [SWC+ 22], and one based on an earlier            following code shows how to run two simulations to determine the
influenza model [COSF20]. However, while both of these models           impact of a custom intervention aimed at protecting the elderly in
saw use in academic contexts [KCP+ 20], neither were able to            Japan, with results shown in Fig. 4:
incorporate new features quickly enough, or were easy enough to         import covasim as cv
use, for widespread external adoption in a policy context.              # Define a custom intervention
     Covasim, by contrast, had immediate real-world impact. The         def elderly(sim, old=70):
first version was released on 10 March 2020, and on 12 March                if sim.t == sim.day('2020-04-01'):
                                                                                elderly = sim.people.age > old
2020, its output was presented by Washington State Governor Jay
                                                                                sim.people.rel_sus[elderly] = 0.0
Inslee during a press conference as justification for school closures
and social distancing measures [KMS+ 21].                               # Set custom parameters
     Since the early days of the pandemic, Covasim releases have        pars = dict(
                                                                            pop_type = 'hybrid', # More realistic population
coincided with major events in the pandemic, especially the iden-           location = 'japan', # Japan's population pyramid
tification of new variants of concern (Fig. 1). Covasim was quickly         pop_size = 50e3, # Have 50,000 people total
adopted globally, including applications in the UK regarding                pop_infected = 100, # 100 infected people
                                                                            n_days = 90, # Run for 90 days
school closures [PGKS+ 20], Australia regarding outbreak control
                                                                        )
[SAK+ 21], and Vietnam regarding lockdown measures [PSN+ 21].
     To date, Covasim has been downloaded from PyPI over                # Run multiple sims in parallel and plot key results
100,000 times [PeP22], has been used in dozens of academic              label = 'Protect the elderly'
                                                                        s1 = cv.Sim(pars, label='Default')
studies [KMS+ 21], and informed decision-making on every con-           s2 = cv.Sim(pars, interventions=elderly, label=label)
tinent (Fig. 2), making it one of the most widely used COVID            msim = cv.parallel(s1, s2)
models [KSM+ 21]. We believe key elements of its success include        msim.plot(['cum_deaths', 'cum_infections'])
(a) the simplicity of its architecture; (b) its high performance,       Similar design philosophies have been articulated by previously,
enabled by the use of NumPy arrays and Numba decorators;                such as for Grails [AJ09] among others1 .
and (c) our emphasis on prioritizing usability, including flexible
type handling and careful choices of default settings. In the             1. Other similar philosophical statements include "The manifesto of Mat-
remainder of this paper, we outline these principles in more detail,    plotlib is: simple and common tasks should be simple to perform; provide
                                                                        options for more complex tasks" (Data Processing Using Python) and "Simple,
in the hope that these will provide a useful roadmap for other          common tasks should be simple to perform; Options should be provided to
groups wanting to quickly develop high-performance, easy-to-use         enable more complex tasks" (Instrumental).
92                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




                Fig. 2: Locations where Covasim has been used to help produce a paper, report, or policy recommendation.




Fig. 3: Basic Covasim disease model. The blue arrow shows the
process of reinfection.

                                                                         Fig. 4: Illustrative result of a simulation in Covasim focused on
Simplifications using Sciris                                             exploring an intervention for protecting the elderly.

A key component of Covasim’s architecture is heavy reliance
on Sciris (http://sciris.org) [KAH+ ng], a library of functions for      running simulations in parallel.
scientific computing that provide additional flexibility and ease-
of-use on top of NumPy, SciPy, and Matplotlib, including paral-          Array-based architecture
lel computing, array operations, and high-performance container          In a typical agent-based simulation, the outermost loop is over
datatypes.                                                               time, while the inner loops iterate over different agents and agent
    As shown in Fig. 5, Sciris significantly reduces the number          states. For a simulation like Covasim, with roughly 700 (daily)
of lines of code required to perform common scientific tasks,            timesteps to represent the first two years of the pandemic, tens
allowing the user to focus on the code’s scientific logic rather than    or hundreds of thousands of agents, and several dozen states, this
the low-level implementation. Key Covasim features that rely on          requires on the order of one billion update steps.
Sciris include: ensuring consistent dictionary, list, and array types        However, we can take advantage of the fact that each state
(e.g., allowing the user to provide inputs as either lists or arrays);   (such as agent age or their infection status) has the same data
referencing ordered dictionary elements by index; handling and           type, and thus we can avoid an explicit loop over agents by instead
interconverting dates (e.g., allowing the user to provide either a       representing agents as entries in NumPy vectors, and performing
date string or a datetime object); saving and loading files; and         operations on these vectors. These two architectures are shown in
PYTHON VS. THE PANDEMIC: A CASE STUDY IN HIGH-STAKES SOFTWARE DEVELOPMENT                                                                    93




Fig. 5: Comparison of functionally identical code implemented without Sciris (left) and with (right). In this example, tasks that together take
30 lines of code without Sciris can be accomplished in 7 lines with it.
94                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                                for t in self.time_vec:
                                                                                    for person in self.people:
                                                                                        if person.alive:
                                                                                            person.age_person()
                                                                                            person.check_died()

                                                                      # Array-based agent simulation

                                                                      class People:

                                                                           def age_people(self, inds):
                                                                               self.age[inds] += 1
                                                                               return

                                                                           def check_died(self, inds):
                                                                               rands = np.random.rand(len(inds))
                                                                               died = rands < self.death_probs[inds]:
                                                                               self.alive[inds[died]] = False
                                                                               return
Fig. 6: The standard object-oriented approach for implementing
agent-based models (top), compared to the array-based approach        class Sim:
used in Covasim (bottom).
                                                                           def run(self):
                                                                               for t in self.time_vec:
                                                                                   alive = sc.findinds(self.people.alive)
                                                                                   self.people.age_people(inds=alive)
                                                                                   self.people.check_died(inds=alive)


                                                                      Numba optimization
                                                                      Numba is a compiler that translates subsets of Python and NumPy
                                                                      into machine code [LPS15]. Each low-level numerical function
                                                                      was tested with and without Numba decoration; in some cases
                                                                      speed improvements were negligible, while in other cases they
                                                                      were considerable. For example, the following function is roughly
                                                                      10 times faster with the Numba decorator than without:
                                                                      import numpy as np
                                                                      import numba as nb

                                                                      @nb.njit((nb.int32, nb.int32), cache=True)
                                                                      def choose_r(max_n, n):
Fig. 7: Performance comparison for FPsim from an explicit loop-           return np.random.choice(max_n, n, replace=True)
based approach compared to an array-based approach, showing a
factor of ~70 speed improvement for large population sizes.           Since Covasim is stochastic, calculations rarely need to be exact;
                                                                      as a result, most numerical operations are performed as 32-bit
                                                                      operations.
Fig. 6. Compared to the explicitly object-oriented implementation         Together, these speed optimizations allow Covasim to run at
of an agent-based model, the array-based version is 1-2 orders of     roughly 5-10 million simulated person-days per second of CPU
magnitude faster for population sizes larger than 10,000 agents.      time – a speed comparable to agent-based models implemented
The relative performance of these two approaches is shown in          purely in C or C++ [HPN+ 21]. Practically, this means that most
Fig. 7 for FPsim (which, like Covasim, was initially implemented      users can run Covasim analyses on their laptops without needing
using an object-oriented approach before being converted to an        to use cloud-based or HPC computing resources.
array-based approach). To illustrate the difference between object-
based and array-based implementations, the following example          Lessons for scientific software development
shows how aging and death would be implemented in each:
                                                                      Accessible coding and design
# Object-based agent simulation
                                                                      Since Covasim was designed to be used by scientists and health
class Person:                                                         officials, not developers, we made a number of design decisions
                                                                      that preferenced accessibility to our audience over other principles
     def age_person(self):
                                                                      of good software design.
         self.age += 1
         return                                                           First, Covasim is designed to have as flexible of user inputs
                                                                      as possible. For example, a date can be specified as an integer
     def check_died(self):                                            number of days from the start of the simulation, as a string (e.g.
         rand = np.random.random()
         if rand < self.death_prob:                                   '2020-04-04'), or as a datetime object. Similarly, numeric
             self.alive = False                                       inputs that can have either one or multiple values (such as the
         return                                                       change in transmission rate following one or multiple lockdowns)
                                                                      can be provided as a scalar, list, or NumPy array. As long as the
class Sim:
                                                                      input is unambiguous, we prioritized ease-of-use and simplicity
     def run(self):                                                   of the interface over rigorous type checking. Since Covasim is a
PYTHON VS. THE PANDEMIC: A CASE STUDY IN HIGH-STAKES SOFTWARE DEVELOPMENT                                                                    95

top-level library (i.e., it does not perform low-level functions as       health background, through to public health experts with virtually
part of other libraries), this prioritization has been welcomed by        no prior experience in Python. Roughly 45% of Covasim con-
its users.                                                                tributors had significant Python expertise, while 60% had public
     Second, "advanced" Python programming paradigms – such               health experience; only about half a dozen contributors (<10%)
as method and function decorators, lambda functions, multiple             had significant experience in both areas.
inheritance, and "dunder" methods – have been avoided where                    These half-dozen contributors formed a core group (including
possible, even when they would otherwise be good coding prac-             the authors of this paper) that oversaw overall Covasim develop-
tice. This is because a relatively large fraction of Covasim users,       ment. Using GitHub for both software and project management,
including those with relatively limited Python backgrounds, need          we created issues and assigned them to other contributors based
to inspect and modify the source code. A Covasim user coming              on urgency and skillset match. All pull requests were reviewed by
from an R programming background, for example, may not have               at least one person from this group, and often two, prior to merge.
encountered the NumPy function intersect1d() before, but                  While the danger of accepting changes from contributors with
they can quickly look it up and understand it as being equivalent         limited Python experience is self-evident, considerable risks were
to R’s intersect() function. In contrast, an R user who has               also posed by contributors who lacked epidemiological insight.
not encountered method decorators before is unlikely to be able to        For example, some of the proposed tests were written based on
look them up and understand their meaning (indeed, they may not           assumptions that were true for a given time and place, but which
even know what terms to search for). While Covasim indeed does            were not valid for other geographical contexts.
use each of the "advanced" methods listed above (e.g., the Numba               One surprising outcome was that even though Covasim is
decorators described above), they have been kept to a minimum             largely a software project, after the initial phase of development
and sequestered in particular files the user is less likely to interact   (i.e., the first 4-8 weeks), we found that relatively few tasks could
with.                                                                     be assigned to the developers as opposed to the epidemiologists
     Third, testing for Covasim presented a major challenge. Given        and infectious disease modelers on the project. We believe there
that Covasim was being used to make decisions that affected tens          are several reasons for this. First, epidemiologists tended to be
of millions of people, even the smallest errors could have poten-         much more aware of knowledge they were missing (e.g., what
tially catastrophic consequences. Furthermore, errors could arise         a particular NumPy function did), and were more readily able
not only in the software logic, but also in an incorrectly entered        to fill that gap (e.g., look it up in the documentation or on
parameter value or a misinterpreted scientific study. Compounding         Stack Overflow). By contrast, developers without expertise in
these challenges, features often had to be developed and used             epidemiology were less able to identify gaps in their knowledge
on a timescale of hours or days to be of use to policymakers,             and address them (e.g., by finding a study on Google Scholar).
a speed which was incompatible with traditional software testing          As a consequence, many of the epidemiologists’ software skills
approaches. In addition, the rapidly evolving codebase made it            improved markedly over the first few months, while the develop-
difficult to write even simple regression tests. Our solution was to      ers’ epidemiology knowledge increased more slowly. Second, and
use a hierarchical testing approach: low-level functions were tested      more importantly, we found that once transparent and performant
through a standard software unit test approach, while new features        coding practices had been implemented, epidemiologists were able
and higher-level outputs were tested extensively by infectious            to successfully adapt them to new contexts even without complete
disease modelers who varied inputs corresponding to realistic             understanding of the code. Thus, for developing a scientific
scenarios, and checked the outputs (predominantly in the form             software tool, we propose that a successful staffing plan would
of graphs) against their intuition. We found that these high-level        consist of a roughly equal ratio of developers and domain experts
"sanity checks" were far more effective in catching bugs than             during the early development phase, followed by a rapid (on a
formal software tests, and as a result shifted the emphasis of            timescale of weeks) ramp-down of developers and ramp-up of
our test suite to prioritize the former. Public releases of Covasim       domain experts.
have held up well to extensive scrutiny, both by our external                  Acknowledging that Covasim’s potential user base includes
collaborators and by "COVID skeptics" who were highly critical            many people who have limited coding skills, we developed a three-
of other COVID models [Den20].                                            tiered support model to maximize Covasim’s real-world policy
     Finally, since much of our intended audience has little to           impact (Fig. 8). For "mode 1" engagements, we perform the anal-
no Python experience, we provided as many alternative ways of             yses using Covasim ourselves. While this mode typically ensures
accessing Covasim as possible. For R users, we provide exam-              high quality and efficiency, it is highly resource-constrained and
ples of how to run Covasim using the reticulate package                   thus used only for our highest-profile engagements, such as with
[AUTE17], which allows Python to be called from within R.                 the Vietnam Ministry of Health [PSN+ 21] and Washington State
For specific applications, such as our test-trace-quarantine work         Department of Health [KMS+ 21]. For "mode 2" engagements, we
(http://ttq-app.covasim.org), we developed bespoke webapps via            offer our partners training on how to use Covasim, and let them
Jupyter notebooks [GP21] and Voilà [Qua19]. To help non-experts           lead analyses with our feedback. This is our preferred mode of
gain intuition about COVID epidemic dynamics, we also devel-              engagement, since it balances efficiency and sustainability, and has
oped a generic JavaScript-based webapp interface for Covasim              been used for contexts including the United Kingdom [PGKS+ 20]
(http://app.covasim.org), but it does not have sufficient flexibility     and Australia [SLSS+ 22]. Finally, "mode 3" partnerships, in
to answer real-world policy questions.                                    which Covasim is downloaded and used without our direct input,
                                                                          are of course the default approach in the open-source software
Workflow and team management                                              ecosystem, including for Python. While this mode is by far the
Covasim was developed by a team of roughly 75 people with                 most scalable, in practice, relatively few health departments or
widely disparate backgrounds: from those with 20+ years of                ministries of health have the time and internal technical capacity to
enterprise-level software development experience and no public            use this mode; instead, most of the mode 3 uptake of Covasim has
96                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

been by academic groups [LG+ 21]. Thus, we provide mode 1 and            [AUTE17]    JJ Allaire, Kevin Ushey, Yuan Tang, and Dirk Eddelbuettel.
mode 2 partnerships to make Covasim’s impact more immediate                          reticulate: R Interface to Python, 2017. URL: https://github.
                                                                                     com/rstudio/reticulate.
and direct than would be possible via mode 3 alone.
                                                                         [BGB+ 18]   Anna Bershteyn, Jaline Gerardin, Daniel Bridenbecker, Christo-
                                                                                     pher W Lorton, Jonathan Bloedow, Robert S Baker, Guil-
Future directions                                                                    laume Chabot-Couture, Ye Chen, Thomas Fischle, Kurt Frey,
                                                                                     et al. Implementation and applications of EMOD, an individual-
While the need for COVID modeling is hopefully starting to                           based multi-disease modeling platform. Pathogens and disease,
decrease, we and our collaborators are continuing development                        76(5):fty059, 2018. doi:10.1093/femspd/fty059.
of Covasim by updating parameters with the latest scientific             [CHL+ 08]   Myron S Cohen, Nick Hellmann, Jay A Levy, Kevin DeCock,
                                                                                     Joep Lange, et al. The spread, treatment, and prevention of
evidence, implementing new immune dynamics [CSN+ 21], and                            HIV-1: evolution of a global pandemic. The Journal of Clin-
providing other usability and bug-fix updates. We also continue                      ical Investigation, 118(4):1244–1254, 2008. doi:10.1172/
to provide support and training workshops (including in-person                       JCI34706.
workshops, which were not possible earlier in the pandemic).             [COSF20]    Dennis L Chao, Assaf P Oron, Devabhaktuni Srikrishna, and
                                                                                     Michael Famulare. Modeling layered non-pharmaceutical inter-
    We are using what we learned during the development of                           ventions against SARS-CoV-2 in the United States with Corvid.
Covasim to build a broader suite of Python-based disease mod-                        MedRxiv, 2020. doi:10.1101/2020.04.08.20058487.
eling tools (tentatively named "*-sim" or "Starsim"). The suite          [CSN+ 21]   Jamie A Cohen, Robyn Margaret Stuart, Rafael C Nùñez,
of Starsim tools under development includes models for family                        Katherine Rosenfeld, Bradley Wagner, Stewart Chang, Cliff
                                                                                     Kerr, Michael Famulare, and Daniel J Klein. Mechanistic mod-
planning [OVCC+ 22], polio, respiratory syncytial virus (RSV),                       eling of SARS-CoV-2 immune memory, variants, and vaccines.
and human papillomavirus (HPV). To date, each tool in this                           medRxiv, 2021. doi:10.1101/2021.05.31.21258018.
suite uses an independent codebase, and is related to Covasim            [Den20]     Denim, Sue. Another Computer Simulation, Another Alarmist
only through the shared design principles described above, and                       Prediction, 2020. URL: https://dailysceptic.org/schools-paper.
                                                                         [Fam19]     Mike Famulare. nCoV: preliminary estimates of the confirmed-
by having used the Covasim codebase as the starting point for                        case-fatality-ratio and infection-fatality-ratio, and initial pan-
development.                                                                         demic risk assessment. Institute for Disease Modeling, 2019.
    A major open question is whether the disease dynamics im-            [Gar05]     Laurie Garrett. The next pandemic. Foreign Aff., 84:3, 2005.
plemented in Covasim and these related models have sufficient                        doi:10.2307/20034417.
                                                                         [GP21]      Brian E. Granger and Fernando Pérez. Jupyter: Thinking and
overlap to be refactored into a single disease-agnostic modeling                     storytelling with code and data. Computing in Science & En-
library, which the disease-specific modeling libraries would then                    gineering, 23(2):7–14, 2021. doi:10.1109/MCSE.2021.
import. This "core and specialization" approach was adopted by                       3059263.
EMOD and Atomica, and while both frameworks continue to be               [Hof20]     Bert Hofman. The global pandemic. Horizons: Journal of
                                                                                     International Relations and Sustainable Development, (16):60–
used, no multi-disease modeling library has yet seen widespread                      69, 2020.
adoption within the disease modeling community. The alternative          [HPN+ 21]   Robert Hinch, William JM Probert, Anel Nurtay, Michelle
approach, currently used by the Starsim suite, is for each disease                   Kendall, Chris Wymant, Matthew Hall, Katrina Lythgoe, Ana
model to be a self-contained library. A shared library would                         Bulas Cruz, Lele Zhao, Andrea Stewart, et al. OpenABM-
                                                                                     Covid19—An agent-based model for non-pharmaceutical inter-
reduce code duplication, and allow new features and bug fixes                        ventions against COVID-19 including contact tracing. PLoS
to be immediately rolled out to multiple models simultaneously.                      computational biology, 17(7):e1009146, 2021.             doi:10.
However, it would also increase interdependencies that would have                    1371/journal.pcbi.1009146.
the effect of increasing code complexity, increasing the risk of         [KAH+ ng]   Cliff C Kerr, Romesh G Abeysuriya, Vlad-S, tefan Harbuz,
                                                                                     George L Chadderdon, Parham Saidi, Paula Sanz-Leon, James
introducing subtle bugs. Which of these two options is preferable                    Jansson, Maria del Mar Quiroga, Sherrie Hughes, Rowan
likely depends on the speed with which new disease models need                       Martin-and Kelly, Jamie Cohen, Robyn M Stuart, and Anna
to be implemented. We hope that for the foreseeable future, none                     Nachesa. Sciris: a Python library to simplify scientific com-
will need to be implemented as quickly as Covasim.                                   puting. Available at http://paper.sciris.org, 2022 (forthcoming).
                                                                         [KAK+ 19]   David J Kedziora, Romesh Abeysuriya, Cliff C Kerr, George L
                                                                                     Chadderdon, Vlad-S, tefan Harbuz, Sarah Metzger, David P Wil-
Acknowledgements                                                                     son, and Robyn M Stuart. The Cascade Analysis Tool: software
                                                                                     to analyze and optimize care cascades. Gates Open Research, 3,
We thank additional contributors to Covasim, including Katherine                     2019. doi:10.12688/gatesopenres.13031.2.
Rosenfeld, Gregory R. Hart, Rafael C. Núñez, Prashanth Selvaraj,         [KCP+ 20]   Joel R Koo, Alex R Cook, Minah Park, Yinxiaohe Sun, Haoyang
Brittany Hagedorn, Amanda S. Izzo, Greer Fowler, Anna Palmer,                        Sun, Jue Tao Lim, Clarence Tam, and Borame L Dickens.
                                                                                     Interventions to mitigate early spread of sars-cov-2 in singapore:
Dominic Delport, Nick Scott, Sherrie L. Kelly, Caroline S. Ben-                      a modelling study. The Lancet Infectious Diseases, 20(6):678–
nette, Bradley G. Wagner, Stewart T. Chang, Assaf P. Oron, Paula                     688, 2020. doi:10.1016/S1473-3099(20)30162-6.
Sanz-Leon, and Jasmina Panovska-Griffiths. We also wish to thank         [KMS+ 21]   Cliff C Kerr, Dina Mistry, Robyn M Stuart, Katherine Rosenfeld,
Maleknaz Nayebi and Natalie Dean for helpful discussions on                          Gregory R Hart, Rafael C Núñez, Jamie A Cohen, Prashanth
                                                                                     Selvaraj, Romesh G Abeysuriya, Michał Jastrz˛ebski, et al. Con-
code architecture and workflow practices, respectively.                              trolling COVID-19 via test-trace-quarantine. Nature Commu-
                                                                                     nications, 12(1):1–12, 2021. doi:10.1038/s41467-021-
                                                                                     23276-9.
R EFERENCES                                                              [KSM+ 21]   Cliff C Kerr, Robyn M Stuart, Dina Mistry, Romesh G Abey-
[AFG+ 04]   Roy M Anderson, Christophe Fraser, Azra C Ghani, Christl A               suriya, Katherine Rosenfeld, Gregory R Hart, Rafael C Núñez,
            Donnelly, Steven Riley, Neil M Ferguson, Gabriel M Leung,                Jamie A Cohen, Prashanth Selvaraj, Brittany Hagedorn, et al.
            Tai H Lam, and Anthony J Hedley. Epidemiology, transmis-                 Covasim: an agent-based model of COVID-19 dynamics and
            sion dynamics and control of sars: the 2002–2003 epidemic.               interventions. PLOS Computational Biology, 17(7):e1009149,
            Philosophical Transactions of the Royal Society of London.               2021. doi:10.1371/journal.pcbi.1009149.
            Series B: Biological Sciences, 359(1447):1091–1105, 2004.    [LG+ 21]    Junjiang Li, Philippe Giabbanelli, et al. Returning to a normal
            doi:10.1098/rstb.2004.1490.                                              life via COVID-19 vaccines in the United States: a large-
[AJ09]      Bashar Abdul-Jawad. Groovy and Grails Recipes. Springer,                 scale Agent-Based simulation study. JMIR medical informatics,
            2009.                                                                    9(4):e27419, 2021. doi:10.2196/27419.
PYTHON VS. THE PANDEMIC: A CASE STUDY IN HIGH-STAKES SOFTWARE DEVELOPMENT                                                                           97




Fig. 8: The three pathways to impact with Covasim, from high bandwidth/small scale to low bandwidth/large scale. IDM: Institute for Disease
Modeling; OSS: open-source software; GPG: global public good; PyPI: Python Package Index.


[LPS15]    Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A                    the impact of COVID-19 vaccines in a representative COVAX
           llvm-based python jit compiler. In Proceedings of the Second                   AMC country setting due to ongoing internal migration: A
           Workshop on the LLVM Compiler Infrastructure in HPC, pages                     modeling study. PLOS Global Public Health, 2(1):e0000053,
           1–6, 2015. doi:10.1145/2833157.2833162.                                        2022. doi:10.1371/journal.pgph.0000053.
[Med20]    The Lancet Respiratory Medicine. COVID-19: delay, mitigate,          [Tea14]   WHO Ebola Response Team. Ebola virus disease in west
           and communicate. The Lancet Respiratory Medicine, 8(4):321,                    africa—the first 9 months of the epidemic and forward projec-
           2020. doi:10.1016/S2213-2600(20)30128-4.                                       tions. New England Journal of Medicine, 371(16):1481–1495,
[OVCC 22] Michelle L O’Brien, Annie Valente, Guillaume Chabot-Couture,
       +                                                                                  2014. doi:10.1056/NEJMoa1411100.
           Joshua Proctor, Daniel Klein, Cliff Kerr, and Marita Zimmer-
           mann. FPSim: An agent-based model of family planning for
           informed policy decision-making. In PAA 2022 Annual Meeting.
           PAA, 2022.
[PeP22]    PePy. PePy download statistics, 2022. URL: https://pepy.tech/
           project/covasim.
[PGKS+ 20] Jasmina Panovska-Griffiths, Cliff C Kerr, Robyn M Stuart, Dina
           Mistry, Daniel J Klein, Russell M Viner, and Chris Bonell.
           Determining the optimal strategy for reopening schools, the
           impact of test and trace interventions, and the risk of occurrence
           of a second COVID-19 epidemic wave in the UK: a modelling
           study. The Lancet Child & Adolescent Health, 4(11):817–827,
           2020. doi:10.1016/S2352-4642(20)30250-9.
[PSN+ 21] Quang D Pham, Robyn M Stuart, Thuong V Nguyen, Quang C
           Luong, Quang D Tran, Thai Q Pham, Lan T Phan, Tan Q Dang,
           Duong N Tran, Hung T Do, et al. Estimating and mitigating the
           risk of COVID-19 epidemic rebound associated with reopening
           of international borders in Vietnam: a modelling study. The
           Lancet Global Health, 9(7):e916–e924, 2021. doi:10.1016/
           S2214-109X(21)00103-0.
[Qua19]    QuantStack. And voilá! Jupyter Blog, 2019. URL: https://blog.
           jupyter.org/and-voil%C3%A0-f6a2c08a4a93.
[RSWS20] Joacim Rocklöv, Henrik Sjödin, and Annelies Wilder-Smith.
           COVID-19 outbreak on the Diamond Princess cruise ship: esti-
           mating the epidemic potential and effectiveness of public health
           countermeasures. Journal of Travel Medicine, 27(3):taaa030,
           2020. doi:10.1093/jtm/taaa030.
[SAK+ 21] Robyn M Stuart, Romesh G Abeysuriya, Cliff C Kerr, Dina
           Mistry, Dan J Klein, Richard T Gray, Margaret Hellard, and
           Nick Scott. Role of masks, testing and contact tracing in
           preventing COVID-19 resurgences: a case study from New
           South Wales, Australia. BMJ open, 11(4):e045941, 2021.
           doi:10.1136/bmjopen-2020-045941.
[SHK16]    Patrick R Saunders-Hastings and Daniel Krewski. Review-
           ing the history of pandemic influenza: understanding patterns
           of emergence and transmission. Pathogens, 5(4):66, 2016.
           doi:10.3390/pathogens5040066.
[SLSS+ 22] Paula Sanz-Leon, Nathan J Stevenson, Robyn M Stuart,
           Romesh G Abeysuriya, James C Pang, Stephen B Lambert,
           Cliff C Kerr, and James A Roberts. Risk of sustained SARS-
           CoV-2 transmission in Queensland, Australia. Scientific reports,
           12(1):1–9, 2022. doi:10.1101/2021.06.08.21258599.
[SWC 22] Prashanth Selvaraj, Bradley G Wagner, Dennis L Chao,
     +

           Maïna L’Azou Jackson, J Gabrielle Breugelmans, Nicholas Jack-
           son, and Stewart T Chang. Rural prioritization may increase
98                                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




      Pylira: deconvolution of images in the presence of
                        Poisson noise
Axel Donath‡∗ , Aneta Siemiginowska‡ , Vinay Kashyap‡ , Douglas Burke‡ , Karthik Reddy Solipuram§ , David van Dyk¶



                                                                                  F



Abstract—All physical and astronomical imaging observations are degraded by           of the signal intensity to the signal variance. Any statistically
the finite angular resolution of the camera and telescope systems. The recovery       correct post-processing or reconstruction method thus requires a
of the true image is limited by both how well the instrument characteristics          careful treatment of the Poisson nature of the measured image.
are known and by the magnitude of measurement noise. In the case of a                     To maximise the scientific use of the data, it is often desired to
high signal to noise ratio data, the image can be sharpened or “deconvolved”
                                                                                      correct the degradation introduced by the imaging process. Besides
robustly by using established standard methods such as the Richardson-Lucy
method. However, the situation changes for sparse data and the low signal to
                                                                                      correction for non-uniform exposure and background noise this
noise regime, such as those frequently encountered in X-ray and gamma-ray             also includes the correction for the "blurring" introduced by the
astronomy, where deconvolution leads inevitably to an amplification of noise          point spread function (PSF) of the instrument. Where the latter
and poorly reconstructed images. However, the results in this regime can              process is often called "deconvolution". Depending on whether
be improved by making use of physically meaningful prior assumptions and              the PSF of the instrument is known or not, one distinguishes
statistically principled modeling techniques. One proposed method is the LIRA         between the "blind deconvolution" and "non blind deconvolution"
algorithm, which requires smoothness of the reconstructed image at multiple           process. For astronomical observations, the PSF can often either
scales. In this contribution, we introduce a new python package called Pylira,
                                                                                      be simulated, given a model of the telescope and detector, or
which exposes the original C implementation of the LIRA algorithm to Python
                                                                                      inferred directly from the data by observing far distant objects,
users. We briefly describe the package structure, development setup and show
a Chandra as well as Fermi-LAT analysis example.
                                                                                      which appear as a point source to the instrument.
                                                                                          While in other branches of astronomy deconvolution methods
Index Terms—deconvolution, point spread function, poisson, low counts, X-ray,         are already part of the standard analysis, such as the CLEAN
gamma-ray                                                                             algorithm for radio data, developed by [Hog74], this is not the
                                                                                      case for X-ray and gamma-ray astronomy. As any deconvolution
                                                                                      method aims to enhance small-scale structures in an image, it
Introduction                                                                          becomes increasingly hard to solve for the regime of low signal-
Any physical and astronomical imaging process is affected by                          to-noise ratio, where small-scale structures are more affected by
the limited angular resolution of the instrument or telescope. In                     noise.
addition, the quality of the resulting image is also degraded by
background or instrumental measurement noise and non-uniform                          The Deconvolution Problem
exposure. For short wavelengths and associated low intensities of                     Basic Statistical Model
the signal, the imaging process consists of recording individual                      Assuming the data in each pixel di in the recorded counts image
photons (often called "events") originating from a source of                          follows a Poisson distribution, the total likelihood of obtaining the
interest. This imaging process is typical for X-ray and gamma-                        measured image from a model image of the expected counts λi
ray telescopes, but images taken by magnetic resonance imaging                        with N pixels is given by:
or fluorescence microscopy show Poisson noise too. For each
individual photon, the incident direction, energy and arrival time
                                                                                                                           N   exp −di λidi
                                                                                                          L (d|λ ) = ∏                                    (1)
is measured. Based on this information, the event can be binned                                                            i       di !
into two dimensional data structures to form an actual image.
                                                                                      By taking the logarithm, dropping the constant terms and inverting
    As a consequence of the low intensities associated to the                         the sign one can transform the product into a sum over pixels,
recording of individual events, the measured signal follows Pois-                     which is also often called the Cash [Cas79] fit statistics:
son statistics. This imposes a non-linear relationship between the
                                                                                                                      N
measured signal and true underlying intensity as well as a coupling                                       C (λ |d) = ∑(λi − di log λi )                   (2)
                                                                                                                       i
* Corresponding author: axel.donath@cfa.harvard.edu
‡ Center for Astrophysics | Harvard & Smithsonian                                     Where the expected counts λi are given by the convolution of the
§ University of Maryland Baltimore County                                             true underlying flux distribution xi with the PSF pk :
¶ Imperial College London
                                                                                                                 λi = ∑ xi pi−k                           (3)
Copyright © 2022 Axel Donath et al. This is an open-access article distributed                                             k
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the          This operation is often called "forward modelling" or "forward
original author and source are credited.                                              folding" with the instrument response.
PYLIRA: DECONVOLUTION OF IMAGES IN THE PRESENCE OF POISSON NOISE                                                                            99

Richardson Lucy (RL)
To obtain the most likely value of xn given the data, one searches
a maximum of the total likelihood function, or equivalently a of
minimum C . This high dimensional optimization problem can
e.g., be solved by a classic gradient descent approach. Assuming
the pixels values xi of the true image as independent parameters,
one can take the derivative of Eq. 2 with respect to the individual
xi . This way one obtains a rule for how to update the current set
of pixels xn in each iteration of the optimization:
                                       ∂ C (d|x)
                     xn+1 = xn − α ·                               (4)
                                          ∂ xi
Where α is a factor to define the step size. This method is in
general equivalent to the gradient descent and backpropagation
methods used in modern machine learning techniques. This ba-
sic principle of solving the deconvolution problem for images
with Poisson noise was proposed by [Ric72] and [Luc74]. Their
method, named after the original authors, is often known as the
                                                                         Fig. 1: The images show the result of the RL algorithm applied
Richardson & Lucy (RL) method. It was shown by [Ric72] that              to a simulated example dataset with varying numbers of iterations.
this converges to a maximum likelihood solution of Eq. 2. A              The image in the upper left shows the simulated counts. Those have
Python implementation of the standard RL method is available             been derived from the ground truth (upper mid) by convolving with a
e.g. in the Scikit-Image package [vdWSN+ 14].                            Gaussian PSF of width σ = 3 pix and applying Poisson noise to it.
    Instead of the iterative, gradient descent based optimization it     The illustration uses the implementation of the RL algorithm from the
is also possible to sample from the posterior distribution using a       Scikit-Image package [vdWSN+ 14].
simple Metropolis-Hastings [Has70] approach and uniform prior.
This is demonstrated in one of the Pylira online tutorials (Intro-
                                                                         the smoothness of the reconstructed image on multiple spatial
duction to Deconvolution using MCMC Methods).
                                                                         scales. Starting from the full resolution, the image pixels xi are
                                                                         collected into 2 by 2 groups Qk . The four pixel values associated
RL Reconstruction Quality
                                                                         with each group are divided by their sum to obtain a grid of “split
While technically the RL method converges to a maximum like-             proportions” with respect to the image down-sized by a factor of
lihood solution, it mostly still results in poorly restored images,      two along both axes. This process is repeated using the down sized
especially if extended emission regions are present in the image.        image with pixel values equal to the sums over the 2 by 2 groups
The problem is illustrated in Fig. 1 using a simulated example           from the full-resolution image, and the process continues until the
image. While for a low number of iterations, the RL method still         resolution of the image is only a single pixel, containing the total
results in a smooth intensity distribution, the structure of the image   sum of the full-resolution image. This multi-scale representation
decomposes more and more into a set of point-like sources with           is illustrated in Fig. 2.
growing number of iterations.                                                 For each of the 2x2 groups of the re-normalized images a
    Because of the PSF convolution, an extended emission region          Dirichlet distribution is introduced as a prior:
can decompose into multiple nearby point sources and still lead
to good model prediction, when compared with the data. Those                                φk ∝ Dirichlet(αk , αk , αk , αk )            (6)
almost equally good solutions correspond to many narrow local            and multiplied across all 2x2 groups and resolution levels k. For
minima or "spikes" in the global likelihood surface. Depending on        each resolution level a smoothing parameter αk is introduced.
the start estimate for the reconstructed image x the RL method           These hyper-parameters can be interpreted as having an infor-
will follow the steepest gradient and converge towards the nearest       mation content equivalent of adding αk "hallucinated" counts in
narrow local minimum. This problem has been described by                 each grouping. This effectively results in a smoothing of the
multiple authors, such as [PR94] and [FBPW95].                           image at the given resolution level. The distribution of α values
                                                                         at each resolution level is the further described by a hyper-prior
Multi-Scale Prior & LIRA
                                                                         distribution:
One solution to this problem was described in [ECKvD04] and                                    p(αk ) = exp (−δ α 3 /3)                 (7)
[CSv+ 11]. First, the simple forward folded model described in
Eq. 3 can be extended by taking into account the non-uniform             Resulting in a fully hierarchical Bayesian model. A more com-
exposure ei and an additional known background component bi :            plete and detailed description of the prior definition is given in
                                                                         [ECKvD04].
                     λi = ∑ (ei · (xi + bi )) pi−k                 (5)       The problem is then solved by using a Gibbs MCMC sampling
                           k                                             approach. After a "burn-in" phase the sampling process typically
The background bi can be more generally understood as a "base-           reaches convergence and starts sampling from the posterior distri-
line" image and thus include known structures, which are not of          bution. The reconstructed image is then computed as the mean of
interest for the deconvolution process. E.g., a bright point source      the posterior samples. As for each pixel a full distribution of its
to model the core of an AGN while studying its jets.                     values is available, the information can also be used to compute
    Second, the authors proposed to extend the Poisson log-              the associated error of the reconstructed value. This is another
likelihood function (Equation 2) by a log-prior term that controls       main advantage over RL or Maxium A-Postori (MAP) algorithms.
100                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                           1   $ sudo apt-get install r-base-dev r-base r-mathlib
                                                                           2   $ pip install pylira

                                                                          For more detailed instructions see Pylira installation instructions.

                                                                          API & Subpackages
                                                                          Pylira is structured in multiple sub-packages. The pylira.src
                                                                          module contains the original C implementation and the Pybind11
                                                                          wrapper code. The pylira.core sub-package contains the
                                                                          main Python API, pylira.utils includes utility functions
                                                                          for plotting and serialisation. And pylira.data implements
                                                                          multiple pre-defined datasets for testing and tutorials.

                                                                          Analysis Examples
                                                                          Simple Point Source
                                                                          Pylira was designed to offer a simple Python class based user
                                                                          interface, which allows for a short learning curve of using the
                                                                          package for users who are familiar with Python in general and
                                                                          more specifically with Numpy. A typical complete usage example
                                                                          of the Pylira package is shown in the following:
Fig. 2: The image illustrates the multi-scale decomposition used in
the LIRA prior for a 4x4 pixels example image. Each quadrant of 2x2        1   import numpy as np
sub-images is labelled with QN . The sub-pixels in each quadrant are       2   from pylira import LIRADeconvolver
labelled Λi j . .                                                          3   from pylira.data import point_source_gauss_psf
                                                                           4
                                                                           5   # create example dataset
                                                                           6   data = point_source_gauss_psf()
The Pylira Package                                                         7
                                                                           8   # define initial flux image
Dependencies & Development                                                 9   data["flux_init"] = data["flux"]
The Pylira package is a thin Python wrapper around the original           10

LIRA implementation provided by the authors of [CSv+ 11]. The             11   deconvolve = LIRADeconvolver(
                                                                          12       n_iter_max=3_000,
original algorithm was implemented in C and made available as a           13       n_burn_in=500,
package for the R Language [R C20]. Thus the implementation de-           14       alpha_init=np.ones(5)
pends on the RMath library, which is still a required dependency of       15   )
                                                                          16
Pylira. The Python wrapper was built using the Pybind11 [JRM17]           17   result = deconvolve.run(data=data)
package, which allows to reduce the code overhead introduced by           18
the wrapper to a minimum. For the data handling, Pylira relies on         19   # plot pixel traces, result shown in Figure 3
Numpy [HMvdW+ 20] arrays for the serialisation to the FITS data           20   result.plot_pixel_traces_region(
                                                                          21       center_pix=(16, 16), radius_pix=3
format on Astropy [Col18]. The (interactive) plotting functionality       22   )
is achieved via Matplotlib [Hun07] and Ipywidgets [wc15], which           23

are both optional dependencies. Pylira is openly developed on             24   # plot pixel traces, result shown in Figure 4
                                                                          25   result.plot_parameter_traces()
Github at https://github.com/astrostat/pylira. It relies on GitHub        26
Actions as a continuous integration service and uses the Read             27   # finally serialise the result
the Docs service to build and deploy the documentation. The on-           28   result.write("result.fits")
line documentation can be found on https://pylira.readthedocs.io.         The main interface is exposed via the LIRADeconvolver
Pylira implements a set of unit tests to assure compatibility             class, which takes the configuration of the algorithm on initial-
and reproducibility of the results with different versions of the         isation. Typical configuration parameters include the total num-
dependencies and across different platforms. As Pylira relies on          ber of iterations n_iter_max and the number of "burn-in"
random sampling for the MCMC process an exact reproducibility             iterations, to be excluded from the posterior mean computation.
of results is hard to achieve on different platforms; however the         The data, represented by a simple Python dict data structure,
agreement of results is at least guaranteed in the statistical limit of   contains a "counts", "psf" and optionally "exposure"
drawing many samples.                                                     and "background" array. The dataset is then passed to the
                                                                          LIRADeconvolver.run() method to execute the deconvolu-
Installation
                                                                          tion. The result is a LIRADeconvolverResult object, which
Pylira is available via the Python package index (pypi.org),              features the possibility to write the result as a FITS file, as well
currently at version 0.1. As Pylira still depends on the RMath            as to inspect the result with diagnostic plots. The result of the
library, it is required to install this first. So the recommended way     computation is shown in the left panel of Fig. 3.
to install Pylira is on MacOS is:
1     $ brew install r                                                    Diagnostic Plots
2     $ pip install pylira
                                                                          To validate the quality of the results Pylira provides many built-
On Linux the RMath dependency can be installed using standard             in diagnostic plots. One of these diagnostic plot is shown in the
package managers. For example on Ubuntu, one would do                     right panel of Fig. 3. The plot shows the image sampling trace
PYLIRA: DECONVOLUTION OF IMAGES IN THE PRESENCE OF POISSON NOISE                                                                                              101


                                                                                                              Pixel trace for (16, 16)
    30                                                        800                 1000
                                                              700
    25                                                                            800
                                                              600
    20
                                                              500                 600




                                                                 Posterior Mean
                                                                                                                                           Burn in
                                                                                                                                           Valid
    15                                                        400                                                                          Mean
                                                                                  400                                                      1 Std. Deviation
                                                              300
    10
                                                              200                 200
     5
                                                              100
                                                                                     0
     0
         0     5      10      15     20      25     30                                    0        500       1000     1500         2000      2500       3000
                                                                                                                Number of Iterations

Fig. 3: The curves show the traces of value the pixel of interest for a simulated point source and its neighboring pixels (see code example).
The image on the left shows the posterior mean. The white circle in the image shows the circular region defining the neighboring pixels. The
blue line on the right plot shows the trace of the pixel of interest. The solid horizontal orange line shows the mean value (excluding burn-in)
of the pixel across all iterations and the shaded orange area the 1 σ error region. The burn in phase is shown in transparent blue and ignored
while computing the mean. The shaded gray lines show the traces of the neighboring pixels.


for a single pixel of interest and its surrounding circular region of                   Chandra is a space-based X-ray observatory, which has been
interest. This visualisation allows the user to assess the stability               in operation since 1999. It consists of nested cylindrical paraboloid
of a small region in the image e.g. an astronomical point source                   and hyperboloid surfaces, which form an imaging optical system
during the MCMC sampling process. Due to the correlation with                      for X-rays. In the focal plane, it has multiple instruments for dif-
neighbouring pixels, the actual value of a pixel might vary in the                 ferent scientific purposes. This includes a high-resolution camera
sampling process, which appears as "dips" in the trace of the pixel                (HRC) and an Advanced CCD Imaging Spectrometer (ACIS). The
of interest and anti-correlated "peaks" in the one or mutiple of                   typical angular resolution is 0.5 arcsecond and the covered energy
the surrounding pixels. In the example a stable state of the pixels                ranges from 0.1 - 10 keV.
of interest is reached after approximately 1000 iterations. This                        Figure 5 shows the result of the Pylira algorithm applied to
suggests that the number of burn-in iterations, which was defined                  Chandra data of the Galactic Center region between 0.5 and 7 keV.
beforehand, should be increased.                                                   The PSF was obtained from simulations using the simulate_psf
    Pylira relies on an MCMC sampling approach to sample                           tool from the official Chandra science tools ciao 4.14 [FMA+ 06].
a series of reconstructed images from the posterior likelihood                     The algorithm achieves both an improved spatial resolution as well
defined by Eq. 2. Along with the sampling, it marginalises over                    as a reduced noise level and higher contrast of the image in the
the smoothing hyper-parameters and optimizes them in the same                      right panel compared to the unprocessed counts data shown in the
process. To diagnose the validity of the results it is important to                left panel.
visualise the sampling traces of both the sampled images as well                        As a second example, we use data from the Fermi Large Area
as hyper-parameters.                                                               Telescope (LAT). The Fermi-LAT is a satellite-based imaging
    Figure 4 shows another typical diagnostic plot created by the                  gamma-ray detector, which covers an energy range of 20 MeV
code example above. In a multi-panel figure, the user can inspect                  to >300 GeV. The angular resolution varies strongly with energy
the traces of the total log-posterior as well as the traces of the                 and ranges from 0.1 to >10 degree1 .
smoothing parameters. Each panel corresponds to the smoothing                           Figure 6 shows the result of the Pylira algorithm applied to
hyper parameter introduced for each level of the multi-scale                       Fermi-LAT data above 1 GeV to the region around the Galactic
representation of the reconstructed image. The figure also shows                   Center. The PSF was obtained from simulations using the gtpsf
the mean value along with the 1 σ error region. In this case,                      tool from the official Fermitools v2.0.19 [Fer19]. First, one can
the algorithm shows stable convergence after a burn-in phase of                    see that the algorithm achieves again a considerable improvement
approximately 200 iterations for the log-posterior as well as all of               in the spatial resolution compared to the raw counts. It clearly
the multi-scale smoothing parameters.                                              resolves multiple point sources left to the bright Galactic Center
                                                                                   source.

Astronomical Analysis Examples                                                     Summary & Outlook

Both in the X-ray as well as in the gamma-ray regime, the Galactic                 The Pylira package provides Python wrappers for the LIRA al-
Center is a complex emission region. It shows point sources,                       gorithm. It allows the deconvolution of low-counts data following
extended sources, as well as underlying diffuse emission and thus                    1. https://www.slac.stanford.edu/exp/glast/groups/canda/lat_Performance.
represents a challenge for any astronomical data analysis.                         htm
102                                                                                              PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)



                                   Logpost                               Smoothingparam0                            Smoothingparam1
                                     Burn in              0.35                                        0.35
       1500                          Valid                0.30
                                     Mean                                                             0.30
                                     1 Std. Deviation     0.25                                        0.25
       1000
                                                          0.20                                        0.20
           500                                            0.15                                        0.15
                0                                         0.10                                        0.10
                                                          0.05                                        0.05
           500
                                                          0.00                                        0.00
                    0         200 400 600 800 1000               0     200 400 600 800 1000                  0     200 400 600 800 1000
                                Number of Iterations                     Number of Iterations                        Number of Iterations
                              Smoothingparam2                            Smoothingparam3                            Smoothingparam4
                                                         0.200
                                                                                                     0.175
         0.20                                            0.175
                                                         0.150                                       0.150
         0.15                                            0.125                                       0.125
                                                         0.100                                       0.100
         0.10                                                                                        0.075
                                                         0.075
         0.05                                            0.050                                       0.050
                                                         0.025                                       0.025
         0.00                                            0.000                                       0.000
                    0         200 400 600 800 1000               0     200 400 600 800 1000                  0     200 400 600 800 1000
                                Number of Iterations                     Number of Iterations                        Number of Iterations

Fig. 4: The curves show the traces of the log posterior value as well as traces of the values of the prior parameter values. The SmoothingparamN
parameters correspond to the smoothing parameters αN per multi-scale level. The solid horizontal orange lines show the mean value, the shaded
orange area the 1 σ error region. The burn in phase is shown transparent and ignored while estimating the mean.



                                                Counts                                           Deconvolved
                                                                                                                                          500
                                  PSF
                                                                                                                                          257
                                                                                                                                          132
               -29°00'25"
                                                                                                                                          68
 Declination




                                                                                                                                          35
                                                                                                                                               Counts



                                                                                                                                          18
                        30"
                                                                                                                                          9
                                                                                                                                          5

                                                                                                                                          2
                        35"
                        17h45m40.6s40.4s     40.2s 40.0s 39.8s       39.6s   17h45m40.6s40.4s   40.2s 40.0s 39.8s         39.6s
                                              Right Ascension                                    Right Ascension
Fig. 5: Pylira applied to Chandra ACIS data of the Galactic Center region, using the observation IDs 4684 and 4684. The image on the left
shows the raw observed counts between 0.5 and 7 keV. The image on the right shows the deconvolved version. The LIRA hyperprior values
were chosen as ms_al_kap1=1, ms_al_kap2=0.02, ms_al_kap3=1. No baseline background model was included.
PYLIRA: DECONVOLUTION OF IMAGES IN THE PRESENCE OF POISSON NOISE                                                                                                        103


                                                           Counts                                               Deconvolved
                                                                                                                                                          200
                          0°40'            PSF
                                                                                                                                                          120
                                                                                                                                                          72
                            20'
                                                                                                                                                          43
      Galactic Latitude




                            00'                                                                                                                           26




                                                                                                                                                               Counts
                                                                                                                                                          16

                          -0°20'                                                                                                                          9
                                                                                                                                                          5
                            40'                                                                                                                           2

                                   0°40'         20'         00'      359°40'    20'         0°40'       20'         00'      359°40'      20'
                                                       Galactic Longitude                                      Galactic Longitude
Fig. 6: Pylira applied to Fermi-LAT data from the Galactic Center region. The image on the left shows the raw measured counts between
5 and 1000 GeV. The image on the right shows the deconvolved version. The LIRA hyperprior values were chosen as ms_al_kap1=1,
ms_al_kap2=0.02, ms_al_kap3=1. No baseline background model was included.


Poisson statistics using a Bayesian sampling approach and a multi-                           [CSv+ 11]   A. Connors, N. M. Stein, D. van Dyk, V. Kashyap, and
scale smoothing prior assumption. The results can be easily written                                      A. Siemiginowska. LIRA — The Low-Counts Image Restora-
                                                                                                         tion and Analysis Package: A Teaching Version via R. In I. N.
to FITS files and inspected by plotting the trace of the sampling                                        Evans, A. Accomazzi, D. J. Mink, and A. H. Rots, editors,
process. This allows users to check for general convergence as                                           Astronomical Data Analysis Software and Systems XX, volume
well as pixel to pixel correlations for selected regions of interest.                                    442 of Astronomical Society of the Pacific Conference Series,
The package is openly developed on GitHub and includes tests                                             page 463, July 2011.
                                                                                             [ECKvD04]   David N. Esch, Alanna Connors, Margarita Karovska, and
and documentation, such that it can be maintained and improved                                           David A. van Dyk. An image restoration technique with
in the future, while ensuring consistency of the results. It comes                                       error estimates. The Astrophysical Journal, 610(2):1213–
with multiple built-in test datasets and explanatory tutorials in                                        1227, aug 2004. URL: https://doi.org/10.1086/421761, doi:
                                                                                                         10.1086/421761.
the form of Jupyter notebooks. Future plans include the support                              [FBPW95]    D. A. Fish, A. M. Brinicombe, E. R. Pike, and J. G.
for parallelisation or distributed computing, more flexible prior                                        Walker. Blind deconvolution by means of the richardson–
definitions and the possibility to account for systematic errors on                                      lucy algorithm. J. Opt. Soc. Am. A, 12(1):58–65, Jan 1995.
the PSF during the sampling process.                                                                     URL: http://opg.optica.org/josaa/abstract.cfm?URI=josaa-12-
                                                                                                         1-58, doi:10.1364/JOSAA.12.000058.
                                                                                             [Fer19]     Fermi Science Support Development Team. Fermitools: Fermi
Acknowledgements                                                                                         Science Tools. Astrophysics Source Code Library, record
                                                                                                         ascl:1905.011, May 2019. arXiv:1905.011.
This work was conducted under the auspices of the CHASC                                      [FMA+ 06]   Antonella Fruscione, Jonathan C. McDowell, Glenn E. Allen,
International Astrostatistics Center. CHASC is supported by NSF                                          Nancy S. Brickhouse, Douglas J. Burke, John E. Davis, Nick
                                                                                                         Durham, Martin Elvis, Elizabeth C. Galle, Daniel E. Har-
grants DMS-21-13615, DMS-21-13397, and DMS-21-13605; by                                                  ris, David P. Huenemoerder, John C. Houck, Bish Ishibashi,
the UK Engineering and Physical Sciences Research Council                                                Margarita Karovska, Fabrizio Nicastro, Michael S. Noble,
[EP/W015080/1]; and by NASA 18-APRA18-0019. We thank                                                     Michael A. Nowak, Frank A. Primini, Aneta Siemiginowska,
CHASC members for many helpful discussions, especially Xiao-                                             Randall K. Smith, and Michael Wise. CIAO: Chandra’s data
                                                                                                         analysis system. In David R. Silva and Rodger E. Doxsey,
Li Meng and Katy McKeough. DvD was also supported in part                                                editors, Society of Photo-Optical Instrumentation Engineers
by a Marie-Skodowska-Curie RISE Grant (H2020-MSCA-RISE-                                                  (SPIE) Conference Series, volume 6270 of Society of Photo-
2019-873089) provided by the European Commission. Aneta                                                  Optical Instrumentation Engineers (SPIE) Conference Series,
                                                                                                         page 62701V, June 2006. doi:10.1117/12.671760.
Siemiginowska, Vinay Kashyap, and Doug Burke further acknowl-                                [Has70]     W. K. Hastings. Monte Carlo Sampling Methods using Markov
edge support from NASA contract to the Chandra X-ray Center                                              Chains and their Applications. Biometrika, 57(1):97–109,
NAS8-03060.                                                                                              April 1970. doi:10.1093/biomet/57.1.97.
                                                                                             [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der
                                                                                                         Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric
R EFERENCES                                                                                              Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,
                                                                                                         Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van
[Cas79]                      W. Cash. Parameter estimation in astronomy through ap-                      Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del
                             plication of the likelihood ratio. The Astrophysical Journal,               Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant,
                             228:939–947, March 1979. doi:10.1086/156922.                                Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer
[Col18]                      Astropy Collaboration. The Astropy Project: Building an                     Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro-
                             Open-science Project and Status of the v2.0 Core Package. The               gramming with NumPy. Nature, 585(7825):357–362, Septem-
                             Astrophysical Journal, 156(3):123, September 2018. arXiv:                   ber 2020. URL: https://doi.org/10.1038/s41586-020-2649-2,
                             1801.02634, doi:10.3847/1538-3881/aabc4f.                                   doi:10.1038/s41586-020-2649-2.
104                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[Hog74]     J. A. Hogbom. Aperture Synthesis with a Non-Regular
            Distribution of Interferometer Baselines. Astronomy and As-
            trophysics Supplement, 15:417, June 1974.
[Hun07]     J. D. Hunter. Matplotlib: A 2d graphics environment. Com-
            puting in Science & Engineering, 9(3):90–95, 2007. doi:
            10.1109/MCSE.2007.55.
[JRM17]     Wenzel Jakob, Jason Rhinelander, and Dean Moldovan. py-
            bind11 – seamless operability between c++11 and python,
            2017. https://github.com/pybind/pybind11.
[Luc74]     L. B. Lucy. An iterative technique for the rectification of
            observed distributions. Astronomical Journal, 79:745, June
            1974. doi:10.1086/111605.
[PR94]      K. M. Perry and S. J. Reeves. Generalized Cross-Validation
            as a Stopping Rule for the Richardson-Lucy Algorithm. In
            Robert J. Hanisch and Richard L. White, editors, The Restora-
            tion of HST Images and Spectra - II, page 97, January 1994.
            doi:10.1002/ima.1850060412.
[R C20]     R Core Team. R: A Language and Environment for Statistical
            Computing. R Foundation for Statistical Computing, Vienna,
            Austria, 2020. URL: https://www.R-project.org/.
[Ric72]     William Hadley Richardson. Bayesian-Based Iterative Method
            of Image Restoration. Journal of the Optical Society of
            America (1917-1983), 62(1):55, January 1972. doi:10.
            1364/josa.62.000055.
[vdWSN+ 14] Stéfan van der Walt, Johannes L. Schönberger, Juan Nunez-
            Iglesias, François Boulogne, Joshua D. Warner, Neil Yager,
            Emmanuelle Gouillart, Tony Yu, and the scikit-image con-
            tributors. scikit-image: image processing in Python. PeerJ,
            2:e453, 6 2014. URL: https://doi.org/10.7717/peerj.453, doi:
            10.7717/peerj.453.
[wc15]      Jupyter widgets community. ipywidgets, a github repository.
            Retrieved from https://github.com/jupyter-widgets/ipywidgets,
            2015.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                        105




   Codebraid Preview for VS Code: Pandoc Markdown
             Preview with Jupyter Kernels
                                                                     Geoffrey M. Poore‡∗



                                                                                     F



Abstract—Codebraid Preview is a VS Code extension that provides a live                   including raw chunks of text in other formats such as reStructured-
preview of Pandoc Markdown documents with optional support for executing                 Text. When executable code is involved, the RMarkdown-style
embedded code. Unlike typical Markdown previews, all Pandoc features are fully           approach of Markdown with embedded code can sometimes be
supported because Pandoc itself generates the preview. The Markdown source               more convenient than a browser-based Jupyter notebook since the
and the preview are fully integrated with features like bidirectional scroll sync.
                                                                                         writing process involves more direct interaction with the complete
The preview supports LaTeX math via KaTeX. Code blocks and inline code can
be executed with Codebraid, using either its built-in execution system or Jupyter
                                                                                         document source.
kernels. For executed code, any combination of the code and its output can be                While using a Pandoc Markdown variant as a source format
displayed in the preview as well as the final document. Code execution is non-           brings many advantages, the actual writing process itself can
blocking, so the preview always remains live and up-to-date even while code is           be less than ideal, especially when executable code is involved.
still running.                                                                           Pandoc Markdown variants are so powerful precisely because they
                                                                                         provide so many extensions to Markdown, but this also means
Index Terms—reproducibility, dynamic report generation, literate programming,            that they can only be fully rendered by Pandoc itself. When text
Python, Pandoc, Markdown, Project Jupyter
                                                                                         editors such as VS Code provide a built-in Markdown preview,
                                                                                         typically only a small subset of Pandoc features is supported,
Introduction                                                                             so the representation of the document output will be inaccurate.
                                                                                         Some editors provide a visual Markdown editing mode, in which
Pandoc [JM22] is increasingly a foundational tool for creating sci-
                                                                                         a partially rendered version of the document is displayed in the
entific and technical documents. It provides Pandoc’s Markdown
                                                                                         editor and menus or keyboard shortcuts may replace the direct
and other Markdown variants that add critical features absent in
                                                                                         entry of Markdown syntax. These generally suffer from the same
basic Markdown, such as citations, footnotes, mathematics, and
                                                                                         issue. This is only exacerbated when the document embeds code
tables. At the same time, Pandoc simplifies document creation
                                                                                         that is executed during the build process, since that goes even
by providing conversion from Markdown (and other formats) to
                                                                                         further beyond basic Markdown.
formats like LaTeX, HTML, Microsoft Word, and PowerPoint.
                                                                                             An alternative is to use Pandoc itself to generate HTML or
Pandoc is especially useful for documents with embedded code
                                                                                         PDF output, and then display this as a preview. Depending on the
that is executed during the build process. RStudio’s RMarkdown
                                                                                         text editor used, the HTML or PDF might be displayed within the
[RSt20] and more recently Quarto [RSt22] leverage Pandoc to
                                                                                         text editor in a panel beside the document source, or in a separate
convert Markdown documents to other formats, with code exe-
                                                                                         browser window or PDF viewer. For example, Quarto offers both
cution provided by knitr [YX15]. JupyterLab [GP21] centers the
                                                                                         possibilities, depending on whether RStudio, VS Code, or another
writing experience around an interactive, browser-based notebook
                                                                                         editor is used.1 While this approach resolves the inaccuracy issues
instead of a Markdown document, but still relies on Pandoc for
                                                                                         of a basic Markdown preview, it also gives up features such as
export to formats other than HTML [Jup22]. There are also ways
                                                                                         scroll sync that tightly integrate the Markdown source with the
to interact with a Jupyter Notebook as a Markdown document,
                                                                                         preview. In the case of executable code, there is the additional
such as Jupytext [MWtJT20] and Pandoc’s own native Jupyter
                                                                                         issue of a time delay in rendering the preview. Pandoc itself can
support.
                                                                                         typically convert even a relatively long document in under one
    Writing with Pandoc’s Markdown or a similar Markdown
                                                                                         second. However, when code is executed as part of the document
variant has advantages when multiple output formats are required,
                                                                                         build process, preview update is blocked until code execution
since Pandoc provides the conversion capabilities. Pandoc Mark-
                                                                                         completes.
down variants can also serve as a simpler syntax when creating
HTML, LaTeX, or similar documents. They allow HTML and                                       This paper introduces Codebraid Preview, a VS Code exten-
LaTeX to be intermixed with Markdown syntax. They also support                           sion that provides a live preview of Pandoc Markdown documents
                                                                                         with optional support for executing embedded code. Codebraid
* Corresponding author: gpoore@uu.edu                                                    Preview provides a Pandoc-based preview while avoiding most
‡ Union University                                                                       of the traditional drawbacks of this approach. The next section
Copyright © 2022 Geoffrey M. Poore. This is an open-access article distributed
under the terms of the Creative Commons Attribution License, which permits                  1. The RStudio editor is unique in also offering a Pandoc-based visual
unrestricted use, distribution, and reproduction in any medium, provided the             editing mode, starting with version 1.4 from January 2021 (https://www.
original author and source are credited.                                                 rstudio.com/blog/announcing-rstudio-1-4/).
106                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

provides an overview of features. This is followed by sections               There is also support for document export with Pandoc, using
focusing on scroll sync, LaTeX support, and code execution as            the VS Code command palette or the export-with-Pandoc button.
examples of solutions and remaining challenges in creating a
better Pandoc writing experience.                                        Scroll sync
                                                                         Tight source-preview integration requires a source map, or a
Overview of Codebraid Preview                                            mapping from characters in the source to characters in the output.
Codebraid Preview can be installed through the VS Code ex-               Due to Pandoc’s parsing algorithms, tracking source location
tension manager. Development is at https://github.com/gpoore/            during parsing is not possible in the general case.2
codebraid-preview-vscode. Pandoc must be installed separately                 Pandoc 2.11.3 was released in December 2020. It added
(https://pandoc.org/). For code execution capabilities, Codebraid        a sourcepos extension for CommonMark and formats
must also be installed (https://github.com/gpoore/codebraid).            based on it, including GitHub-Flavored Markdown (GFM) and
    The preview panel can be opened using the VS Code command            commonmark_x (CommonMark plus extensions similar to Pan-
palette, or by clicking the Codebraid Preview button that is visible     doc’s Markdown). The CommonMark parser uses a different
when a Markdown document is open. The preview panel takes the            parsing algorithm from the Pandoc’s Markdown parser, and this
document in its current state, converts it into HTML using Pandoc,       algorithm permits tracking source location. For the first time, it
and displays the result using a webview. An example is shown in          was possible to construct a source map for a Pandoc input format.
Figure 1. Since the preview is generated by Pandoc, all Pandoc                Codebraid Preview defaults to commonmark_x as an input
features are fully supported.                                            format, since it provides the most features of all CommonMark-
    By default, the preview updates automatically whenever the           based formats. Features continue to be added to commonmark_x
Markdown source is changed. There is a short user-configurable           and it is gradually nearing feature parity with Pandoc’s Mark-
minimum update interval. For shorter documents, sub-second               down. Citations are perhaps the most important feature currently
updates are typical.                                                     missing.3
    The preview uses the same styling CSS as VS Code’s built-                 Codebraid Preview provides full bidirectional scroll sync be-
in Markdown preview, so it automatically adjusts to the VS Code          tween source and preview for all CommonMark-based formats,
color theme. For example, changing between light and dark themes         using data provided by sourcepos. In the output HTML, the
changes the background and text colors in the preview.                   first image or inline text element created by each Markdown
    Codebraid Preview leverages recent Pandoc advances to pro-           source line is given an id attribute corresponding to the source
vide bidirectional scroll sync between the Markdown source               line number. When the source is scrolled to a given line range,
and the preview for all CommonMark-based Markdown variants               the preview scrolls to the corresponding HTML elements using
that Pandoc supports (commonmark, gfm, commonmark_x).                    these id attributes. When the preview is scrolled, the visible
By default, Codebraid Preview treats Markdown documents as               HTML elements are detected via the Intersection Observer API.4
commonmark_x, which is CommonMark with Pandoc exten-                     Then their id attributes are used to determine the corresponding
sions for features like math, footnotes, and special list types. The     Markdown line range, and the source scrolls to those lines.
preview still works for other Markdown variants, but scroll sync is           Scroll sync is slightly more complicated when working with
disabled. By default, scroll sync is fully bidirectional, so scrolling   output that is generated by executed code. For example, if a code
either the source or the preview will cause the other to scroll to       block is executed and creates several plots in the preview, there
the corresponding location. Scroll sync can instead be configured        isn’t necessarily a way to trace each individual plot back to a
to be only from source to preview or only from preview to source.        particular line of code in the Markdown source. In such cases, the
As far as I am aware, this is the first time that scroll sync has been   line range of the executed code is mapped proportionally to the
implemented in a Pandoc-based preview.                                   vertical space occupied by its output.
    The same underlying features that make scroll sync possible               Pandoc supports multi-file documents. It can be given a list
are also used to provide other preview capabilities. Double-             of files to combine into a single output document. Codebraid
clicking in the preview moves the cursor in the editor to the            Preview provides scroll sync for multi-file documents. For ex-
corresponding line of the Markdown source.                               ample, suppose a document is divided into two files in the same
    Since many Markdown variants support LaTeX math, the                 directory, chapter_1.md and chapter_2.md. Treating these
preview includes math support via KaTeX [EA22].                          as a single document involves creating a YAML configuration file
    Codebraid Preview can simply be used for writing plain Pan-          _codebraid_preview.yaml that lists the files:
doc documents. Optional execution of embedded code is possible                 input-files:
with Codebraid [GMP19], using its built-in code execution system               - chapter_1.md
or Jupyter kernels. When Jupyter kernels are used, it is possible              - chapter_2.md
to obtain the same output that would be present in a Jupyter             Now launching a preview from either chapter_1.md or
notebook, including rich output such as plots and mathematics. It        chapter_2.md will display a preview that combines both
is also possible to specify a custom display so that only a selected     files. When the preview is scrolled, the editor scrolls to the
combination of code, stdout, stderr, and rich output is shown while      corresponding source location, automatically switching between
the rest are hidden. Code execution is decoupled from the preview
process, so the Markdown source can be edited and the preview              2. See for example https://github.com/jgm/pandoc/issues/4565.
can update even while code is running in the background. As far as         3. The Pandoc Roadmap at https://github.com/jgm/pandoc/wiki/Roadmap
                                                                         summarizes current commonmark_x capabilities.
I am aware, no previous software for executing code in Markdown
                                                                           4. For technical details, https://www.w3.org/TR/intersection-observer/. For
has supported building a document with partial code output before        an overview, https://developer.mozilla.org/en-US/docs/Web/API/Intersection_
execution has completed.                                                 Observer_API.
CODEBRAID PREVIEW FOR VS CODE: PANDOC MARKDOWN PREVIEW WITH JUPYTER KERNELS                                                             107




Fig. 1: Screenshot of a Markdown document with Codebraid Preview in VS Code. This document uses Codebraid to execute code with Jupyter
kernels, so all plots and math visible in the preview are generated during document build.


chapter_1.md and chapter_2.md depending on the part of                  of HTML rendering. In the future, optional MathJax support may
the preview that is visible.                                            be needed to provide broader math support. For some applications,
    The preview still works when the input format is set to a non-      it may also be worth considering caching pre-rendered or image
CommonMark format, but in that case scroll sync is disabled. If         versions of equations to improve performance.
Pandoc adds sourcepos support for additional input formats in
the future, scroll sync will work automatically once Codebraid          Code execution
Preview adds those formats to the supported list. It is possible
to attempt to reconstruct a source map by performing a parallel         Optional support for executing code embedded in Markdown
string search on Pandoc output and the original source. This can        documents is provided by Codebraid [GMP19]. Codebraid uses
be error-prone due to text manipulation during format conversion,       Pandoc to convert a document into an abstract syntax tree (AST),
but in the future it may be possible to construct a good enough         then extracts any inline or block code marked with Codebraid
source map to extend basic scroll sync support to additional input      attributes from the AST, executes the code, and finally formats the
formats.                                                                code output so that Pandoc can use it to create the final output
                                                                        document. Code execution is performed with Codebraid’s own
                                                                        built-in system or with Jupyter kernels. For example, the code
LaTeX support
                                                                        block
Support for mathematics is one of the key features provided by                ```{.python .cb-run}
many Markdown variants in Pandoc, including commonmark_x.                     print("Hello *world!*")
Math support in the preview panel is supplied by KaTeX [EA22],                ```
which is a JavaScript library for rendering LaTeX math in the
                                                                        would result in
browser.
    One of the disadvantages of using Pandoc to create the preview            Hello world!
is that every update of the preview is a complete update. This          after processing by Codebraid and finally Pandoc. The .cb-run
makes the preview more sensitive to HTML rendering time. In             is a Codebraid attribute that marks the code block for execution
contrast, in a Jupyter notebook, it is common to write Markdown         and specifies the default display of code output. Further examples
in multiple cells which are rendered separately and independently.      of Codebraid usage are visible in Figure 1.
    MathJax [Mat22] provides a broader range of LaTeX support                Mixing a live preview with executable code provides potential
than KaTeX, and is used in software such as JupyterLab and              usability and security challenges. By default, code only runs when
Quarto. While MathJax performance has improved significantly            the user selects execution in the VS Code command palette or
since the release of version 3.0 in 2019, KaTeX can still have a        clicks the Codebraid execute button. When the preview automati-
speed advantage, so it is currently the default due to the importance   cally updates as a result of Markdown source changes, it only uses
108                                                                                               PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

cached code output. Stale cached output is detected by hashing                      While this build process is significantly more interactive than
executed code, and then marked in the preview to alert the user.                what has been possible previously, it also suggests additional
    The standard approach to executing code within Markdown                     avenues for future exploration. Codebraid’s built-in code execution
documents blocks the document build process until all code has                  system is designed to execute a predefined sequence of code
finished running. Code is extracted from the Markdown source and                chunks and then exit. Jupyter kernels are currently used in the
executed. Then the output is combined with the original source and              same manner to avoid any potential issues with out-of-order
passed on to Pandoc or another Markdown application for final                   execution. However, Jupyter kernels can receive and execute code
conversion. This is the approach taken by RMarkdown, Quarto,                    indefinitely, which is how they commonly function in Jupyter note-
and similar software, as well as by Codebraid until recently. This              books. Instead of starting a new Jupyter kernel at the beginning of
design works well for building a document a single time, but                    each code execution cycle, it would be possible to keep the kernel
blocking until all code has executed is not ideal in the context                from the previous execution cycle and only pass modified code
of a document preview.                                                          chunks to it. This would allow the same out-of-order execution
    Codebraid now offers a new mode of code execution that al-                  issues that are possible in a Jupyter notebook. Yet that would
lows a document to be rebuilt continuously during code execution,               make possible much more rapid code output, particularly in cases
with each build including all code output available at that time.               where large datasets must be loaded or significant preprocessing
This process involves the following steps:                                      is required.

      1)   The user selects code execution. Codebraid Preview                   Conclusion
           passes the document to Codebraid. Codebraid begins
                                                                                Codebraid Preview represents a significant advance in tools for
           code execution.
                                                                                writing with Pandoc. For the first time, it is possible to preview
      2)   As soon as any code output is available, Codebraid
                                                                                a Pandoc Markdown document using Pandoc itself while having
           immediately streams this back to Codebraid Preview. The
                                                                                features like scroll sync between the Markdown source and the
           output is in a format compatible with the YAML metadata
                                                                                preview. When embedded code needs to be executed, it is possible
           block at the start of Pandoc Markdown documents. The
                                                                                to see code output in the preview and to continue editing the
           output includes a hash of the code that was executed, so
                                                                                document during code execution, instead of having to wait until
           that code changes can be detected later.
                                                                                code finishes running.
      3)   If the document is modified while code is running or if
                                                                                    Codebraid Preview or future previewers that follow this ap-
           code output is received, Codebraid Preview rebuilds the
                                                                                proach may be perfectly adequate for shorter and even some longer
           preview. It creates a copy of the document with all current
                                                                                documents, but at some point a combination of document length,
           Codebraid output inserted into the YAML metadata block
                                                                                document complexity, and mathematical content will strain what is
           at the start of the document. This modified document is
                                                                                possible and ultimately decrease preview update frequency. Every
           then passed to Pandoc. Pandoc runs with a Lua filter5 that
                                                                                update of the preview involves converting the entire document
           modifies the document AST before final conversion. The
                                                                                with Pandoc and then rendering the resulting HTML.
           filter removes all code marked with Codebraid attributes
                                                                                    On the parsing side, Pandoc’s move toward CommonMark-
           from the AST, and replaces it with the corresponding
                                                                                based Markdown variants may eventually lead to enough stan-
           code output stored in the AST metadata. If code has
                                                                                dardization that other implementations with the same syntax and
           been modified since execution began, this is detected
                                                                                features are possible. This in turn might enable entirely new
           with the hash of the code, and an HTML class is added
                                                                                approaches. An ideal scenario would be a Pandoc-compatible
           to the output that will mark it visually as stale output.
                                                                                JavaScript-based parser that can parse multiple Markdown strings
           Code that does not yet have output is replaced by a
                                                                                while treating them as having a shared document state for things
           visible placeholder to indicate that code is still running.
                                                                                like labels, references, and numbering. For example, this could
           When the Lua filter finishes AST modifications, Pandoc
                                                                                allow Pandoc Markdown within a Jupyter notebook, with all
           completes the document build, and the preview updates.
                                                                                Markdown content sharing a single document state, maybe with
      4)   As long as code is executing, the previous process repeats
                                                                                each Markdown cell being automatically updated based on Mark-
           whenever the preview needs to be rebuilt.
                                                                                down changes elsewhere.
      5)   Once code execution completes, the most recent output is
                                                                                    Perhaps more practically, on the preview display side, there
           reused for all subsequent preview updates until the next
                                                                                may be ways to optimize how the HTML generated by Pandoc is
           time the user chooses to execute code. Any code changes
                                                                                loaded in the preview. A related consideration might be alternative
           continue to be detected by hashing the code during the
                                                                                preview formats. There is a significant tradition of tight source-
           build process, so that the output can be marked visually
                                                                                preview integration in LaTeX (for example, [Lau08]). In principle,
           as stale in the preview.
                                                                                Pandoc’s sourcepos extension should make possible Mark-
    The overall result of this process is twofold. First, building              down to PDF synchronization, using LaTeX as an intermediary.
a document involving executed code is nearly as fast as building
a plain Pandoc document. The additional output metadata plus                    R EFERENCES
the filter are the only extra elements involved in the document
                                                                                [EA22]     Emily Eisenberg and Sophie Alpert. KaTeX: The fastest math
build, and Pandoc Lua filters have excellent performance. Second,                          typesetting library for the web, 2022. URL: https://katex.org/.
the output for each code chunk appears in the preview almost                    [GMP19]    Geoffrey M. Poore. Codebraid: Live Code in Pandoc Mark-
immediately after the chunk finishes execution.                                            down. In Chris Calloway, David Lippa, Dillon Niederhut, and
                                                                                           David Shupe, editors, Proceedings of the 18th Python in Science
                                                                                           Conference, pages 54 – 61, 2019. doi:10.25080/Majora-
  5. For an overview of Lua filters, see https://pandoc.org/lua-filters.html.              7ddc1dd1-008.
CODEBRAID PREVIEW FOR VS CODE: PANDOC MARKDOWN PREVIEW WITH JUPYTER KERNELS   109

[GP21]    Brian E. Granger and Fernando Pérez. Jupyter: Thinking and
          storytelling with code and data. Computing in Science &
          Engineering, 23(2):7–14, 2021. doi:10.1109/MCSE.2021.
          3059263.
[JM22]    John MacFarlane. Pandoc: a universal document converter, 2006–
          2022. URL: https://pandoc.org/.
[Jup22]   Jupyter Development Team. nbconvert: Convert Notebooks to
          other formats, 2015–2022. URL: https://nbconvert.readthedocs.
          io.
[Lau08]   Jerôme Laurens. Direct and reverse synchronization with Sync-
          TEX. TUGBoat, 29(3):365–371, 2008.
[Mat22]   MathJax. MathJax: Beautiful and accessible math in all browsers,
          2009–2022. URL: https://www.mathjax.org/.
[MWtJT20] Marc Wouts and the Jupytext Team. Jupyter notebooks as
          Markdown documents, Julia, Python or R scripts, 2018–2020.
          URL: https://jupytext.readthedocs.io/.
[RSt20]   RStudio Inc. R Markdown, 2016–2020. URL: https://rmarkdown.
          rstudio.com/.
[RSt22]   RStudio Inc. Welcome to Quarto, 2022. URL: https://quarto.org/.
[YX15]    Yihui Xie. Dynamic Documents with R and knitr. Chapman &
          Hall/CRC Press, 2015.
110                                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




            Incorporating Task-Agnostic Information in
          Task-Based Active Learning Using a Variational
                           Autoencoder
                     Curtis Godwin‡†∗ , Meekail Zain§†∗ , Nathan Safir‡ , Bella Humphrey§ , Shannon P Quinn§¶



                                                                                     F



Abstract—It is often much easier and less expensive to collect data than to              constraints by specifying a budget of points that can be labeled at
label it. Active learning (AL) ([Set09]) responds to this issue by selecting which       a time and evaluating against this budget.
unlabeled data are best to label next. Standard approaches utilize task-aware                In AL, the model for which we select new labels is referred to
AL, which identifies informative samples based on a trained supervised model.            as the task model. If this model is a classifier neural network, the
Task-agnostic AL ignores the task model and instead makes selections based
                                                                                         space in which it maps inputs before classifying them is known
on learned properties of the dataset. We seek to combine these approaches
and measure the contribution of incorporating task-agnostic information into
                                                                                         as the latent space or representation space. A recent branch of
standard AL, with the suspicion that the extra information in the task-agnostic          AL ([SS18], [SCN+ 18], [YK19]), prominent for its applications
features may improve the selection process. We test this on various AL methods           to deep models, focuses on mapping unlabeled points into the task
using a ResNet classifier with and without added unsupervised information from           model’s latent space before comparing them.
a variational autoencoder (VAE). Although the results do not show a significant              These methods are limited in their analysis by the labeled
improvement, we investigate the effects on the acquisition function and suggest          data they must train on, failing to make use of potentially useful
potential approaches for extending the work.                                             information embedded in the unlabeled data. We therefore suggest
                                                                                         that this family of methods may be improved by extending their
Index Terms—active learning, variational autoencoder, deep learning, pytorch,            representation spaces to include unsupervised features learned
semi-supervised learning, unsupervised learning
                                                                                         over the entire dataset. For this purpose, we opt to use a variational
                                                                                         autoencoder (VAE) ([KW13]) , which is a prominent method for
                                                                                         unsupervised representation learning. Our main contributions are
Introduction
                                                                                         (a) a new methodology for extending AL methods using VAE
In deep learning, the capacity for data gathering often signifi-                         features and (b) an experiment comparing AL performance across
cantly outpaces the labeling. This is easily observed in the field                       two recent feature-based AL methods using the new method.
of bioimaging, where ground-truth labeling usually requires the
expertise of a clinician. For example, producing a large quantity                        Related Literature
of CT scans is relatively simple, but having them labeled for                            Active learning
COVID-19 by cardiologists takes much more time and money.                                Much of the early active learning (AL) literature is based on
These constraints ultimately limit the contribution of deep learning                     shallower, less computationally demanding networks since deeper
to many crucial research problems.                                                       architectures were not well-developed at the time. Settles ([Set09])
    This labeling issue has compelled advancements in the field of                       provides a review of these early methods. The modern approach
active learning (AL) ([Set09]). In a typical AL setting, there is a                      uses an acquisition function, which involves ranking all available
set of labeled data and a (usually larger) set of unlabeled data. A                      unlabeled points by some chosen heuristic H and choosing to
model is trained on the labeled data, then the model is analyzed to                      label the points of highest ranking.
evaluate which unlabeled points should be labeled to best improve
the loss objective after further training. AL acknowledges labeling

† These authors contributed equally.
* Corresponding author: cmgodwin263@gmail.com, meekail.zain@uga.edu
‡ Institute for Artificial Intelligence, University of Georgia, Athens, GA 30602
USA
* Corresponding author: cmgodwin263@gmail.com, meekail.zain@uga.edu
§ Department of Computer Science, University of Georgia, Athens, GA 30602
USA
¶ Department of Cellular Biology, University of Georgia, Athens, GA 30602
USA

Copyright © 2022 Curtis Godwin et al. This is an open-access article dis-
tributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, pro-                The popularity of the acquisition approach has led to a widely-
vided the original author and source are credited.                                       used evaluation procedure, which we describe in Algorithm 1.
INCORPORATING TASK-AGNOSTIC INFORMATION IN TASK-BASED ACTIVE LEARNING USING A VARIATIONAL AUTOENCODER                                        111

This procedure trains a task model T on the initial labeled data,       representation c. An additional fully connected layer then maps
records its test accuracy, then uses H to label a set of unlabeled      c into a single value constituting the loss prediction.
points. We then once again train T on the labeled data and record           When attempting to train a network to directly predict T ’s
its accuracy. This is repeated until a desired number of labels is      loss during training, the ground truth losses naturally decrease as
reached, and then the accuracies can be graphed against the num-        T is optimized, resulting in a moving objective. The authors of
ber of available labels to demonstrate performance over the course      ([YK19]) find that a more stable ground truth is the inequality
of labeling. We can use this evaluation algorithm to separately         between the losses of given pairs of points. In this case, P is
evaluate multiple acquisition functions on their resulting accuracy     trained on pairs of labeled points, so that P is penalized for
graphs. This is utilized in many AL papers to show the efficacy         producing predicted loss pairs that exhibit a different inequality
of their suggested heuristics in comparison to others ([WZL+ 16],       than the corresponding true loss pair.
[SS18], [SCN+ 18], [YK19]).                                                 More specifically, for each batch of labeled data Lbatch ⊂ L
    The prevailing approach to point selection has been to choose       that is propagated through T during training, the batch of true
unlabeled points for which the model is most uncertain, the as-         losses is computed and split randomly into a batch of pairs Pbatch .
sumption being that uncertain points will be the most informative       The loss prediction network produces a corresponding batch of
([BRK21]). A popular early method was to label the unlabeled            predicted loss pairs, denoted Pebatch . The following pair loss is then
points of highest Shannon entropy ([Sha48]) under the task model,       computed given each p ∈ Pbatch and its corresponding p̃ ∈ Pebatch :
which is a measure of uncertainty between the classes of the
data. This method is now more commonly used in combination                       L pair (p, p̃) = max(0, −I (p) · ( p̃(1) − p̃(2) ) + ξ ),   (3)
with a representativeness measure ([WZL+ 16]) to avoid selecting        where I is the following indicator function for pair inequality:
condensed clusters of very similar points.                                                      (
                                                                                                    1, p(1) > p(2)
                                                                                        I (p) =                         .              (4)
Recent heuristics using deep features                                                              −1, p(1) ≤ p(2)
For convolutional neural networks (CNNs) in image classification
settings, the task model T can be decomposed into a feature-            Variational Autoencoders
generating module                                                       Variational autoencoders (VAEs) ([KW13]) are an unsupervised
                         T f : Rn → R f ,                               method for modeling data using Bayesian posterior inference.
                                                                        We begin with the Bayesian assumption that the data is well-
which maps the input data vectors to the output of the final fully      modeled by some distribution, often a multivariate Gaussian. We
connected layer before classification, and a classification module      also assume that this data distribution can be inferred reasonably
                                                                        well by a lower dimensional random variable, also often modeled
                      Tc : R f → {0, 1, ..., c},
                                                                        by a multivariate Gaussian.
where c is the number of classes.                                           The inference process then consists of an encoding into the
    Recent deep learning-based AL methods have approached the           lower dimensional latent variable, followed by a decoding back
notion of model uncertainty in terms of the rich features generated     into the data dimension. We parametrize both the encoder and the
by the learned model. Core-set ([SS18]) and MedAL ([SCN+ 18])           decoder as neural networks, jointly optimizing their parameters
select unlabeled points that are the furthest from the labeled set      with the following loss function ([KW19]):
in terms of L2 distance between the learned features. For core-set,             Lθ ,φ (x) = log pθ (x|z) + [log pθ (z) − log qφ (z|x)],      (5)
each point constructing the set S in step 6 of Algorithm 1 is chosen
by                                                                      where θ and φ are the parameters of the encoder and the decoder,
              u∗ = argmax min ||(T f (u) − T f (``))||2 ,        (1)    respectively. The first term is the reconstruction error, penalizing
                     u∈U    ` ∈L
                                                                        the parameters for producing poor reconstructions of the input
where U is the unlabeled set and L is the labeled set. The              data. The second term is the regularization error, encouraging the
analogous operation for MedAL is                                        encoding to resemble a pre-selected prior distribution, commonly
                                                                        a unit Gaussian prior.
                            1 |L|                                           The encoder of a well-optimized VAE can be used to gen-
             u∗ = argmax
                    u∈U
                               ∑ ||T f (u) − T f (Li )||2 .
                           |L| i=1
                                                                 (2)
                                                                        erate latent encodings with rich features which are sufficient to
                                                                        approximately reconstruct the data. The features also have some
Note that after a point u∗ is chosen, the selection of the next point   geometric consistency, in the sense that the encoder is encouraged
assumes the previous u∗ to be in the labeled set. This way we           to generate encodings in the pattern of a Gaussian distribution.
discourage choosing sets that are closely packed together, leading
to sets that are more diverse in terms of their features. This effect
is more pronounced in the core-set method since it takes the            Methods
minimum distance whereas MedAL uses the average distance.               We observe that the notions of uncertainty developed in the core-
    Another recent method ([YK19]) trains a regression network          set and MedAL methods rely on distances between feature vectors
to predict the loss of the task model, then takes the heuristic H       modeled by the task model T . Additionally, loss prediction relies
in Algorithm 1 to select the unlabeled points of highest predicted      on a fully connected layer mapping from a feature space to a single
loss. To implement this, the loss prediction network P is attached      value, producing different predictions depending on the values of
to a ResNet task model T and is trained jointly with T . The            the relevant feature vector. Thus all of these methods utilize spatial
inputs to P are the features output by the ResNet’s four residual       reasoning in a vector space.
blocks. These features are mapped into the same dimensionality              Furthermore, in each of these methods, the heuristic H only
via a fully connected layer and then concatenated to form a             has access to information learned by the task model, which is
112                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

trained only on the labeled points at a given timestep in the la-        ensure that the task models being compared were supplied with
beling procedure. Since variational autoencoder (VAE) encodings          the same initial set of labels.
are not limited by the contents of the labeled set, we suggest that          With four NVIDIA 2080 GPUs, the total runtime for the
the aforementioned methods may benefit by expanding the vector           MNIST experiments was 5113s for core-set and 4955s for loss
spaces they investigate to include VAE features learned across           prediction; for ChestMNIST, the total runtime was 7085s for core-
the entire dataset, including the unlabeled data. These additional       set and 7209s for loss prediction.
features will constitute representative and previously inaccessible
information regarding the data, which may improve the active
learning process.
    We implement this by first training a VAE model V on the
given dataset. V can then be used as a function returning the
VAE features for any given datapoint. We append these additional
features to the relevant vector spaces using vector concatenation,
an operation we denote with the symbol _. The modified point
selection operation in core-set then becomes
  u∗ = argmax min ||([T f (u) _ αV (u)] − [T f (``) _ αV (``)]||2 ,
         u∈U    ` ∈L
                                                               (6)
where α is a hyperparameter that scales the influence of the VAE
features in computing the vector distance. To similarly modify the
loss prediction method, we concatenate the VAE features to the           Fig. 1: The average MNIST results using the core-set heuristic versus
final ResNet feature concatenation c before the loss prediction,         the VAE-augmented core-set heuristic for Algorithm 1 over 5 runs.
so that the extra information is factored into the training of the
prediction network P.


Experiments
In order to measure the efficacy of the newly proposed methods,
we generate accuracy graphs using Algorithm 1, freezing all
settings except the selection heuristic H . We then compare the
performance of the core-set and loss prediction heuristics with
their VAE-augmented counterparts.
    We use ResNet-18 pretrained on ImageNet as the task model,
using the SGD optimizer with learning rate 0.001 and momen-
tum 0.9. We train on the MNIST ([Den12]) and ChestMNIST
([YSN21]) datasets. ChestMNIST consists of 112,120 chest X-ray
images resized to 28x28 and is one of several benchmark medical
image datasets introduced in ([YSN21]).
                                                                         Fig. 2: The average MNIST results using the loss prediction heuristic
    For both datasets we experiment on randomly selected subsets,        versus the VAE-augmented loss prediction heuristic for Algorithm 1
using 25000 points for MNIST and 30000 points for ChestMNIST.            over 5 runs.
In both cases we begin with 3000 initial labels and label 3000
points per active learning step. We opt to retrain the task model
after each labeling step instead of fine-tuning.
    We use a similar training strategy as in ([SCN+ 18]), training
the task model until >99% train accuracy before selecting new
points to label. This ensures that the ResNet is similarly well fit to
the labeled data at each labeling iteration. This is implemented by
training for 10 epochs on the initial training set and increasing the
training epochs by 5 after each labeling iteration.
    The VAEs used for the experiments are trained for 20 epochs
using an Adam optimizer with learning rate 0.001 and weight
decay 0.005. The VAE encoder architecture consists of four con-
volutional downsampling filters and two linear layers to learn the
low dimensional mean and log variance. The decoder consists of
an upsampling convolution and four size-preserving convolutions
to learn the reconstruction.
                                                                         Fig. 3: The average ChestMNIST results using the core-set heuristic
    Experiments were run five times, each with a separate set of         versus the VAE-augmented core-set heuristic for Algorithm 1 over 5
randomly chosen initial labels, with the displayed results showing       runs.
the average validation accuracies across all runs. Figures 1 and
3 show the core-set results, while Figures 2 and 4 show the loss            To investigate the qualitative difference between the VAE and
prediction results. In all cases, shared random seeds were used to       non-VAE approaches, we performed an additional experiment
INCORPORATING TASK-AGNOSTIC INFORMATION IN TASK-BASED ACTIVE LEARNING USING A VARIATIONAL AUTOENCODER                                     113




Fig. 4: The average ChestMNIST results using the loss prediction
heuristic versus the VAE-augmented loss prediction heuristic for
Algorithm 1 over 5 runs.


to visualize an example of core-set selection. We first train the
ResNet-18 with the same hyperparameter settings on 1000 initial
labels from the ChestMNIST dataset, then randomly choose 1556         Fig. 6: A t-SNE visualization of the ChestMNIST points chosen by
(5%) of the unlabeled points from which to select 100 points to       core-set when the ResNet features are augmented with VAE features.
label. These smaller sizes were chosen to promote visual clarity in
the output graphs.
    We use t-SNE ([VdMH08]) dimensionality reduction to show          process. In 5, the selected points tend to be more spread out,
the ResNet features of the labeled set, the unlabeled set, and the    while in 6 they cluster at one edge. This appears to mirror the
points chosen to be labeled by core-set.                              transformation of the rest of the data, which is more spread out
                                                                      without the VAE features, but becomes condensed in the center
                                                                      when they are introduced, approaching the shape of a Gaussian
                                                                      distribution.
                                                                           It seems that with the added VAE features, the selected points
                                                                      are further out of distribution in the latent space. This makes sense
                                                                      because points tend to be more sparse at the tails of a Guassian
                                                                      distribution and core-set prioritizes points that are well-isolated
                                                                      from other points.
                                                                           One reason for the lack of performance improvement may be
                                                                      the homogeneous nature of the VAE, where the optimization goal
                                                                      is reconstruction rather than classification. This could be improved
                                                                      by using a multimodal prior in the VAE, which may do a better
                                                                      job of modeling relevant differences between points.

                                                                      Conclusion
                                                                      Our original intuition was that additional unsupervised informa-
                                                                      tion may improve established active learning methods, especially
                                                                      when using a modern unsupervised representation method such as
                                                                      a VAE. The experimental results did not indicate this hypothesis,
                                                                      but additional investigation of the VAE features showed a notable
                                                                      change in the task model latent space. Though this did not result in
Fig. 5: A t-SNE visualization of the ChestMNIST points chosen by      superior point selections in our case, it is of interest whether dif-
core-set.                                                             ferent approaches to latent space augmentation in active learning
                                                                      may fare better.
                                                                          Future work may explore the use of class-conditional VAEs
Discussion                                                            in a similar application, since a VAE that can utilize the available
                                                                      class labels may produce more effective representations, and it
Overall, the VAE-augmented active learning heuristics did not
                                                                      could be retrained along with the task model after each labeling
exhibit a significant performance difference when compared with
                                                                      iteration.
their counterparts. The only case of a significant p-value (<0.05)
occurred during loss prediction on the MNIST dataset at 21000
labels.                                                               R EFERENCES
    The t-SNE visualizations in Figures 5 and 6 show some of          [BRK21]    Samuel Budd, Emma C Robinson, and Bernhard Kainz. A
the influence that the VAE features have on the core-set selection               survey on active learning and human-in-the-loop deep learning
114                                                                            PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

          for medical image analysis. Medical Image Analysis, 71:102062,
          2021. doi:10.1016/j.media.2021.102062.
[Den12]   Li Deng. The mnist database of handwritten digit images for
          machine learning research. IEEE Signal Processing Magazine,
          29(6):141–142, 2012. doi:10.1109/MSP.2012.2211477.
[KW13]    Diederik P Kingma and Max Welling. Auto-encoding variational
          bayes. arXiv preprint arXiv:1312.6114, 2013.
[KW19]    Diederik P. Kingma and Max Welling.                    An Intro-
          duction to Variational Autoencoders.             Now Publishers,
          2019. URL: https://doi.org/10.1561%2F9781680836233, doi:
          10.1561/9781680836233.
[SCN 18] Asim Smailagic, Pedro Costa, Hae Young Noh, Devesh
     +

          Walawalkar, Kartik Khandelwal, Adrian Galdran, Mostafa Mir-
          shekari, Jonathon Fagert, Susu Xu, Pei Zhang, et al. Medal:
          Accurate and robust deep active learning for medical image
          analysis. In 2018 17th IEEE international conference on machine
          learning and applications (ICMLA), pages 481–488. IEEE, 2018.
          doi:10.1109/icmla.2018.00078.
[Set09]   Burr Settles. Active learning literature survey. 2009.
[Sha48]   Claude Elwood Shannon. A mathematical theory of communica-
          tion. The Bell system technical journal, 27(3):379–423, 1948.
[SS18]    Ozan Sener and Silvio Savarese. Active learning for convolutional
          neural networks: A core-set approach. In International Conference
          on Learning Representations, 2018. URL: https://openreview.net/
          forum?id=H1aIuk-RW.
[VdMH08] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data
          using t-sne. Journal of machine learning research, 9(11), 2008.
[WZL+ 16] Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang
          Lin. Cost-effective active learning for deep image classification.
          IEEE Transactions on Circuits and Systems for Video Technol-
          ogy, 27(12):2591–2600, 2016. doi:10.1109/tcsvt.2016.
          2589879.
[YK19]    Donggeun Yoo and In So Kweon. Learning loss for active
          learning. In Proceedings of the IEEE/CVF conference on
          computer vision and pattern recognition, pages 93–102, 2019.
          doi:10.1109/CVPR.2019.00018.
[YSN21] Jiancheng Yang, Rui Shi, and Bingbing Ni. Medmnist classi-
          fication decathlon: A lightweight automl benchmark for med-
          ical image analysis. In 2021 IEEE 18th International Sym-
          posium on Biomedical Imaging (ISBI), pages 191–195, 2021.
          doi:10.1109/ISBI48211.2021.9434062.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                            115




                     Awkward Packaging: building Scikit-HEP
                                              Henry Schreiner‡∗ , Jim Pivarski‡ , Eduardo Rodrigues§



                                                                                        F



Abstract—Scikit-HEP has grown rapidly over the last few years, not just to serve            parts [Lam98]. The glueing together of the system was done in
the needs of the High Energy Physics (HEP) community, but in many ways,                     Python, a model still popular today, though some experiments are
the Python ecosystem at large. AwkwardArray, boost-histogram/hist, and iminuit              now using Python + Numba as an alternative model, such as for
are examples of libraries that are used beyond the original HEP focus. In this              example the Xenon1T experiment [RTA+ 17], [RS21].
paper we will look at key packages in the ecosystem, and how the collection of
                                                                                                In the early 2000s, the use of Python HEP exploded, heavily
30+ packages was developed and maintained. Also we will look at some of the
software ecosystem contributions made to packages like cibuildwheel, pybind11,
                                                                                            driven by experiments like LHCb developing frameworks and user
nox, scikit-build, build, and pipx that support this effort. We will also discuss the       tools for scripting. ROOT started providing Python bindings in
Scikit-HEP developer pages and initial WebAssembly support.                                 2004 [LGMM05] that were not considered Pythonic [GTW20],
                                                                                            and still required a complex multi-hour build of ROOT to use1 .
Index Terms—packaging, ecosystem, high energy physics, community project                    Analyses still consisted largely of ROOT, with Python sometimes
                                                                                            showing up.
                                                                                                By the mid 2010’s, a marked change had occurred, driven by
Introduction
                                                                                            the success of Python in Data Science, especially in education.
High Energy Physics (HEP) has always had intense computing                                  Many new students were coming into HEP with little or no
needs due to the size and scale of the data collected. The                                  C++ experience, but with existing knowledge of Python and the
World Wide Web was invented at the CERN Physics laboratory                                  growing Python data science ecosystem, like NumPy and Pandas.
in Switzerland in 1989 when scientists in the EU were trying                                Several HEP experiment analyses were performed in, or driven
to communicate results and datasets with scientist in the US,                               by, Python, with ROOT only being used for things that were
and vice-versa [LCC+ 09]. Today, HEP has the largest scientific                             not available in the Python ecosystem. Some of these were HEP
machine in the world, at CERN: the Large Hadron Collider (LHC),                             specific: ROOT is also a data format, so users needed to be able
27 km in circumference [EB08], with multiple experiments with                               to read data from ROOT files. Others were less specific: HEP
thousands of collaborators processing over a petabyte of raw data                           users have intense histogram requirements due to the data sizes,
every day, with 100 petabytes being stored per year at CERN. This                           large portions of HEP data are "jagged" rather than rectangular;
is one of the largest scientific datasets in the world of exabyte scale                     vector manipulation was important (especially Lorenz Vectors, a
[PJ11], which is roughly comparable in order of magnitude to all                            four dimensional relativistic vector with a non-Euclidean metric);
of astronomy or YouTube [SLF+ 15].                                                          and data fitting was important, especially with complex models
    In the mid nineties, HEP users were beginning to look for                               and accurate error estimation.
a new language to replace Fortran. A few HEP scientists started
investigating the use of Python around the release of 1.0.0 in 1994
                                                                                            Beginnings of a scikit
[Tem22]. A year later, the ROOT project for an analysis toolkit
(and framework) was released, quickly making C++ the main                                   In 2016, the ecosystem for Python in HEP was rather fragmented.
language for HEP. The ROOT project also needed an interpreted                               Physicists were developing tools in isolation, without knowing
language to driving analysis code. Python was rejected for this role                        out the overlaps with other tools, and without making them
due to being "exotic" at the time, and because it was considered too                        interoperable. There were a handful of popular packages that
much to ask physicists to code in two languages. Instead, ROOT                              were useful in HEP spread around among different authors. The
provided a C++ interpreter, called CINT, which later was replaced                           ROOTPy project had several packages that made the ROOT-
with Cling, which is the basis for the clang-repl project in LLVM                           Python bridge a little easier than the built-in PyROOT, such as the
today [IVL22].                                                                              root-numpy and related root-pandas packages. The C++ MINUIT
    Python would start showing up in the late 90’s in experiment                            fitting library was integrated into ROOT, but the iminuit package
frameworks as a configuration language. These frameworks were                               [Dea20] provided an easy to install standalone Python package
primarily written in C++, but were made of many configurable                                with an extracted copy of MINUIT. Several other specialized
                                                                                            standalone C++ packages had bindings as well. Many of the initial
* Corresponding author: henryfs@princeton.edu                                               authors were transitioning to a less-code centric role or leaving
‡ Princeton University
§ University of Liverpool                                                                   for industry, leaving projects like ROOTPy and iminuit without
                                                                                            maintainers.
Copyright © 2022 Henry Schreiner et al. This is an open-access article
distributed under the terms of the Creative Commons Attribution License,                       1. Almost 20 years later ROOT’s Python bindings have been rewritten for
which permits unrestricted use, distribution, and reproduction in any medium,               easier Pythonizations, and installing ROOT in Conda is now much easier,
provided the original author and source are credited.                                       thanks in large part to efforts from Scikit-HEP developers.
116                                                                                             PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                              later writer) that could remove the initial conversion environment
                                                                              by simply pip installing a package. It also had a simple, Pythonic
            numpythia
                                                                              interface and produced outputs Python users could immediately
                                                                              use, like NumPy arrays, instead of PyROOT’s wrapped C++
  pyhepmc                                                      nndrone
                                                                              pointers.
                                                                                  Uproot needed to do more than just be file format
                                                                              reader/writer; it needed to provide a way to represent the special
                              pylhe                                           structure and common objects that ROOT files could contain.
                                                                              This lead to the development of two related packages that would
                                                   hepunits                   support uproot. One, uproot-methods, included Pythonic access to
                                                                              functionality provided by ROOT for its core classes, like spatial
                                                                              and Lorentz vectors. The other was AwkwardArray, which would
                              uhi                                             grow to become one of the most important and most general
        histoprint
                                                                              packages in Scikit-HEP. This package allows NumPy-like idioms
                                                                              for array-at-a-time manipulation on jagged data structures. A
                                                                              jagged array is a (possibly structured) array with a variable length
                                                                              dimension. These are very common and relevant in HEP; events
                                                                              have a variable number of tracks, tracks have a variable number
      Fig. 1: The Scikit-HEP ecosystem and affiliated packages.               of hits in the detector, etc. Many other fields also have jagged
                                                                              data structures. While there are formats to store such structures,
                                                                              computations on jagged structures have usually been closer to SQL
    Eduardo Rodrigues, a scientist working on the LHCb ex-
                                                                              queries on multiple tables than direct object manipulation. Pandas
periment for the University of Cincinnati, started working on a
                                                                              handles this through multiple indexing and a lot of duplication.
package called scikit-hep that would provide a set to tools useful
                                                                                  Uproot was a huge hit with incoming HEP students (see Fig 2);
for physicists working on HEP analysis. The initial version of the
                                                                              suddenly they could access HEP data using a library installed with
scikit-hep package had a simple vector library, HEP related units
                                                                              pip or conda and no external compiler or library requirements, and
and conversions, several useful statistical tools, and provenance
                                                                              could easily use tools they already knew that were compatible with
recording functionality,
                                                                              the Python buffer protocol, like NumPy, Pandas and the rapidly
    He also placed the scikit-hep GitHub repository in a Scikit-
                                                                              growing machine learning frameworks. There were still some gaps
HEP GitHub organization, and asked several of the other HEP
                                                                              and pain points in the ecosystem, but an analysis without writing
related packages to join. The ROOTPy project was ending, with
                                                                              C++ (interpreted or compiled) and compiling ROOT manually was
the primary author moving on, and so several of the then-popular
                                                                              finally possible. Scikit-HEP did not and does not intend to replace
packages2 that were included in the ROOTPy organization were
                                                                              ROOT, but it provides alternative solutions that work natively in
happily transferred to Scikit-HEP. Several other existing HEP
                                                                              the Python "Big Data" ecosystem.
libraries, primarily interfacing to existing C++ simulation and
                                                                                  Several other useful HEP libraries were also written. Particle
tracking frameworks, also joined, like PyJet and NumPythia. Some
                                                                              was written for accessing the Particle Data Group (PDG) particle
of these libraries have been retired or replaced today, but were an
                                                                              data in a simple and Pythonic way. DecayLanguage originally
important part of Scikit-HEP’s initial growth.
                                                                              provided tooling for decay definitions, but was quickly expanded
                                                                              to include tools to read and validate "DEC" decay files, an existing
First initial success                                                         text format used to configure simulations in HEP.
In 2016, the largest barrier to using Python in HEP in a Pythonic
way was ROOT. It was challenging to compile, had many non-                    Building compiled packages
Python dependencies, was huge compared to most Python li-
braries, and didn’t play well with Python packaging. It was not               In 2018, HEP physicist and programmer Hans Dembinski pro-
Pythonic, meaning it had very little support for Python protocols             posed a histogram library to the Boost libraries, the most influen-
like iteration, buffers, keyword arguments, tab completion and                tial C++ library collection; many additions to the standard library
inspect in, dunder methods, didn’t follow conventions for useful              are based on Boost. Boost.Histogram provided a histogram-as-
reprs, and Python naming conventions; it was simply a direct on-              an-object concept from HEP, but was designed around C++14
demand C++ binding, including pointers. Many Python analyses                  templating, using composable axes and storage types. It originally
started with a "convert data" step using PyROOT to read ROOT                  had an initial Python binding, written in Boost::Python. Henry
files and convert them to a Python friendly format like HDF5.                 Schreiner proposed the creation of a standalone binding to be
Then the bulk of the analysis would use reproducible Python                   written with pybind11 in Scikit-HEP. The original bindings were
virtual environments or Conda environments.                                   removed, Boost::Histogram was accepted into the Boost libraries,
     This changed when Jim Pivarski introduced the Uproot pack-               and work began on boost-histogram. IRIS-HEP, a multi-institution
age, a pure-Python implementation of a ROOT file reader (and                  project for sustainable HEP software, had just started, which was
                                                                              providing funding for several developers to work on Scikit-HEP
   2. The primary package of the ROOTPy project, also called ROOTPy, was      project packages such as this one. This project would pioneer
not transferred, but instead had a final release and then died. It was an     standalone C++ library development and deployment for Scikit-
inspiration for the new PyROOT bindings, and influenced later Scikit-HEP      HEP.
packages like mplhep. The transferred libraries have since been replaced by
integrated ROOT functionality. All these packages required ROOT, which is          There were already a variety of attempts at histogram libraries,
not on PyPI, so were not suited for a Python-centric ecosystem.               but none of them filled the requirements of HEP physicists:
AWKWARD PACKAGING: BUILDING SCIKIT-HEP                                                                                                                    117




                                                       ROOT (C++ and PyROOT)
                                                              (as a baseline for scale)
                                         Scientific
                                          Python

                                                                                                                                           P
                                                                                                                                        HE
                                         Scikit-HEP
                                                                                                                                  in
                                                                                                                             on
                                                             CMSSW config                                                  th
                                                      (Python but not data analysis)                                    Py
                                                                                                                   c
                                                                                                              ntifi
                                                                                                           ie
                                                                                                        Sc
                                                               PyROOT                              of                                             ag es
                                                                                               e                                              ack
                                                                                             Us                                           EPp
                                                                                                                                  kit  -H
                                                                                                                             ci
                                                                                                                        of S
                                                                                                                  Use



Fig. 2: Adoption of scientific Python libraries and Scikit-HEP among members of the CMS experiment (one of the four major LHC experiments).
CMS requires users to fork github:cms-sw/cmssw, which can be used to identify 3484 physicist users, who created 16656 non-fork repos.
This plot quantifies adoption by counting "#include X", "import X", and "from X import" strings in the users’ code to measure
adoption of various libraries (most popular by category are shown).




                                                                                                                                              bo
                                                                                                                                      lhep gram,
                                                                                                                                           com
                                                       mainstream Python adoption




                                                                                                                                         to
                                                      in HEP: when many histogram




                                                                                                                             hist st::His
                                                          libraries lived and died




                                                                                                                                 , mp
                                                                                                                              Boo
                                                                                                                                                   ROOT

                                    histogram part of ROOT
                                         (395 C++ files)                                                                                    YODA
                                                                   histograms                                                        histograms
                                            YODA
                                                                    in rootpy                                                         in Coffea


Fig. 3: Developer activity on histogram libraries in HEP: number of unique committers to each library per month, smoothed (derived from git
logs). Illustrates the convergence of a fractured community (around 2017) into a unified one (now).


fills on pre-existing histograms, simple manipulation of multi-                        pybind11.
dimensional histograms, competitive performance, and easy to                               The first stand-alone development was azure-wheel-helpers, a
install in clusters or for students. Any new attempt here would                        set of files that helped produce wheels on the new Azure Pipelines
have to be clearly better than the existing collection of diverse                      platform. Building redistributable wheels requires a variety of
attempts (see Fig 3). The development of a library with compiled                       techniques, even without shared libraries, that vary dramatically
components intended to be usable everywhere required good                              between platforms and were/are poorly documented. On Linux,
support for building libraries that was lacking both in Scikit-                        everything needs to be built inside a controlled manylinux image,
HEP and to an extent the broader Python ecosystem. Previous                            and post-processed by the auditwheel tool. On macOS, this in-
advancements in the packaging ecosystem, such as the wheel                             cludes downloading an official CPython binary for Python to allow
format for distributing binary platform dependent Python packages                      older versions of macOS to be targeted (10.9+), several special
and the manylinux specification and docker image that allowed a                        environment variables, especially when cross compiling to Apple
single compiled wheel to target many distributions of Linux, but                       Silicon, and post processing with the develwheel tool. Windows is
there still were many challenges to making a library redistributable                   the simplest, as most versions of CPython work identically there.
on all platforms.                                                                      azure-wheel-helpers worked well, and was quickly adapted for
     The boost-histogram library only depended on header-only                          the other packages in Scikit-HEP that included non-ROOT binary
components of the Boost libraries, and the header-only pybind11                        components. Work here would eventually be merged into the
package, so it was able to avoid a separate compile step or                            existing and general cibuildwheel package, which would become
linking to external dependencies, which simplified the initial build                   the build tool for all non-ROOT binary packages in Scikit-HEP, as
process. All needed files were collected from git submodules and                       well as over 600 other packages like matplotlib and numpy, and
packed into a source distribution (SDist), and everything was built                    was accepted into the PyPA (Python Packaging Authority).
using only setuptools, making build-from-source simple on any                              The second major development was the upstreaming of CI
system supporting C++14. This did not include RHEL 7, a popular                        and build system developments to pybind11. Pybind11 is a C++
platform in HEP at the time, and on any platform building could                        API for Python designed for writing a binding to C++, and
take several minutes and required several gigabytes of memory                          provided significant benefits to our packages over (mis)-using
to resolve the heavy C++ templating in the Boost libraries and                         Cython for bindings; Cython was designed to transpile a Python-
118                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

like language to C (or C++), and just happened to support bindings
since you can call C and C++ from it, but it was not what it
                                                                            Boost::Histogram
was designed for. Benefits of pybind11 included reduced code                                          thin wrapper
complexity and duplication, no pre-process step (cythonize), no
need to pin NumPy when building, and a cross-package API. The
                                                                                                  boost-histogram
iMinuit package was later moved from Cython to pybind11 as                                                                    fully featured
well, and pybind11 became the Scikit-HEP recommended binding
tool. We contributed a variety of fixes and features to pybind11,
                                                                                                                                  hist
including positional-only and keyword-only arguments, the option                                              plotting in
to prepend to the overload chain, and an API for type access                                                  Matplotlib
and manipulation. We also completely redesigned CMake inte-
gration, added a new pure-Setuptools helpers file, and completely                                      mplhep               plotting in
                                                                                                                             terminal
redesigned the CI using GitHub Actions, running over 70 jobs on
a variety of systems and compilers. We also helped modernize and
                                                                                                             histoprint
improve all the example projects with simpler builds, new CI, and
cibuildwheel support.
    This example of a project with binary components being
                                                                        Fig. 4: The collection of histogram packages and related packages in
usable everywhere then encouraged the development of Awkward            Scikit-HEP.
1.0, a rewrite of AwkwardArray replacing the Python-only code
with compiled code using pybind11, fixing some long-standing
limitations, like an inability to slice past two dimensions or select
                                                                        broader HEP ecosystem. The affiliated classification is also used
"n choose k" for k > 5; these simply could not be expressed
                                                                        on broader ecosystem packages like pybind11 and cibuildwheel
using Awkward 0’s NumPy expressions, but can be solved with
                                                                        that we recommend and share maintainers with.
custom compiled kernels. This also enabled further developments
in backends [PEL20].                                                        Histogramming was designed to be a collection of specialized
                                                                        packages (see Fig. 4) with carefully defined interoperability;
                                                                        boost-histogram for manipulation and filling, Hist for a user-
Broader ecosystem                                                       friendly interface and simple plotting tools, histoprint for display-
Scikit-HEP had become a "toolset" for HEP analysis in Python, a         ing histograms, and the existing mplhep and uproot packages also
collection of packages that worked together, instead of a "toolkit"     needed to be able to work with histograms. This ecosystem was
like ROOT, which is one monopackage that tries to provide every-        built and is held together with UHI, which is a formal specification
thing [R+ 20]. A toolset is more natural in the Python ecosystem,       agreed upon by several developers of different libraries, backed by
where we have good packaging tools and many existing libraries.         a statically typed Protocol, for a PlottableHistogram object. Pro-
Scikit-HEP only needed to fill existing gaps, instead of covering       ducers of histograms, like boost-histogram/hist and uproot provide
every possible aspect of an analysis like ROOT did. The original        objects that follow this specification, and users of histograms,
scikit-hep package had its functionality pulled out into existing or    such as mplhep and histoprint take any object that follows this
new separate packages such as HEPUnits and Vector, and the core         specification. The UHI library is not required at runtime, though it
scikit-hep package instead became a metapackage with no unique          does also provide a few simple utilities to help a library also accept
functionality on its own. Instead, it installs a useful subset of our   ROOT histograms, which do not (currently) follow the Protocol, so
libraries for a physicist wanting to quickly get started on a new       several libraries have decided to include it at runtime too. By using
analysis.                                                               a static type checker like MyPy to statically enforce a Protocol,
    Scikit-HEP was quickly becoming the center of HEP specific          libraries that can communicate without depending on each other
Python software (see Fig. 1). Several other projects or packages        or on a shared runtime dependency and class inheritance. This has
joined Scikit-HEP iMinuit, a popular HEP and astrophysics fitting       been a great success story for Scikit-HEP, and We expect Protocols
library, was probably the most widely used single package to            to continue to be used in more places in the ecosystem.
have joined. PyHF and cabinetry also joined; these were larger              The design for Scikit-HEP as a toolset is of many parts that
frameworks that could drive a significant part of an analysis           all work well together. One example of a package pulling together
internally using other Scikit-HEP tools.                                many components is uproot-browser, a tool that combines uproot,
    Other packages, like GooFit, Coffea, and zFit, were not added,      Hist, and Python libraries like textual and plotext to provide a
but were built on Scikit-HEP packages and had developers work-          terminal browser for ROOT files.
ing closely with Scikit-HEP maintainers. Scikit-HEP introduced              Scikit-HEP’s external contributions continued to grow. One of
an "affiliated" classification for these packages, which allowed        the most notable ones was our work on cibuildwheel. This was
an external package to be listed on the Scikit-HEP website              a Python package that supported building redistributable wheels
and encouraged collaboration. Coffea had a strong influence             on multiple CI systems. Unlike our own azure-wheel-helpers or
on histogram design, and zFit has contributed code to Scikit-           the competing multibuild package, it was written in Python, so
HEP. Currently all affiliated packages have at least one Scikit-        good practices in Python package design could apply, like unit
HEP developer as a maintainer, though that is currently not a           and integration tests, static checks, and it was easy to remain
requirement. An affiliated package fills a particular need for the      independent of the underlying CI system. Building wheels on
community. Scikit-HEP doesn’t have to, or need to, attempt to           Linux requires a docker image, macOS requires the python.org
develop a package that others are providing, but rather tries to        Python, and Windows can use any copy of Python - cibuildwheel
ensure that the externally provided package works well with the         uses this to supply Python in all cases, which keeps it from
AWKWARD PACKAGING: BUILDING SCIKIT-HEP                                                                                                       119

depending on the CI’s support for a particular Python version. We       helpful for monitoring adoption of the developer pages, especially
merged our improvements to cibuildwheel, like better Windows            newer additions, across the Scikit-HEP packages. This package
support, VCS versioning support, and better PEP 518 support.            was then implemented directly into the Scikit-HEP pages, using
We dropped azure-wheel-helpers, and eventually a scikit-build           Pyodide to run Python in WebAssembly directly inside a user’s
maintainer joined the cibuildwheel project. cibuildwheel would          browser. Now anyone visiting the page can enter their repository
go on to join the PyPA, and is now in use in over 600 packages,         and branch, and see the adoption report in a couple of seconds.
including numpy, matplotlib, mypy, and scikit-learn.
    Our continued contributions to cibuildwheel included a              Working toward the future
TOML-based configuration system for cibuildwheel 2.0, an over-
                                                                        Scikit-HEP is looking toward the future in several different areas.
ride system to make supporting multiple manylinux and musllinux
                                                                        We have been working with the Pyodide developers to support
targets easier, a way to build directly from SDists, an option to use
                                                                        WebAssembly; boost-histogram is compiled into Pyodide 0.20,
build instead of pip, the automatic detection of Python version
                                                                        and Pyodide’s support for pybind11 packages is significantly bet-
requirements, and better globbing support for build specifiers. We
                                                                        ter due to that work, including adding support for C++ exception
also helped improve the code quality in various ways, including
                                                                        handling. PyHF’s documentation includes a live Pyodide kernel,
fully statically typing the codebase, applying various checks and
                                                                        and a try-pyhf site (based on the repo-review tool) lets users run
style controls, automating CI processes, and improving support for
                                                                        a model without installing anything - it can even be saved as a
special platforms like CPython 3.8 on macOS Apple Silicon.
                                                                        webapp on mobile devices.
    We also have helped with build, nox, pyodide, and many other
                                                                            We have also been working with Scikit-Build to try to provide
packages, improving the tooling we depend on to develop scikit-
                                                                        a modern build experience in Python using CMake. This project
build and giving back to the community.
                                                                        is just starting, but we expect over the next year or two that
                                                                        the usage of CMake as a first class build tool for binaries in
The Scikit-HEP Developer Pages                                          Python will be possible using modern developments and avoiding
A variety of packaging best practices were coming out of the            distutils/setuptools hacks.
boost-histogram work, supporting both ease of installation for
users as well as various static checks and styling to keep the          Summary
package easy to maintain and reduce bugs. These techniques              The Scikit-HEP project started in Autumn 2016 and has grown
would also be useful apply to Scikit-HEP’s nearly thirty other          to be a core component in many HEP analyses. It has also
packages, but applying them one-by-one was not scalable. The            provided packages that are growing in usage outside of HEP, like
development and adoption of azure-wheel-helpers included a se-          AwkwardArray, boost-histogram/Hist, and iMinuit. The tooling
ries of blog posts that covered the Azure Pipelines platform and        developed and improved by Scikit-HEP has helped Scikit-HEP
wheel building details. This ended up serving as the inspiration        developers as well as the broader Python community.
for a new set of pages on the Scikit-HEP website for developers
interested in making Python packages. Unlike blog posts, these
would be continuously maintained and extended over the years,           R EFERENCES
serving as a template and guide for updating and adding packages        [Dea20]   Hans Dembinski and Piti Ongmongkolkul et al.            scikit-
to Scikit-HEP, and educating new developers.                                      hep/iminuit. Dec 2020. URL: https://doi.org/10.5281/zenodo.
                                                                                  3949207, doi:10.5281/zenodo.3949207.
    These pages grew to describe the best practices for developing
                                                                        [EB08]    Lyndon Evans and Philip Bryant. Lhc machine. Journal of
and maintaining a package, covering recommended configuration,                    instrumentation, 3(08):S08001, 2008.
style checking, testing, continuous integration setup, task runners,    [GTW20] Galli, Massimiliano, Tejedor, Enric, and Wunsch, Stefan. "a new
and more. Shortly after the introduction of the developer pages,                  pyroot: Modern, interoperable and more pythonic". EPJ Web
                                                                                  Conf., 245:06004, 2020. URL: https://doi.org/10.1051/epjconf/
Scikit-HEP developers started asking for a template to quickly                    202024506004, doi:10.1051/epjconf/202024506004.
produce new packages following the guidelines. This was eventu-         [IVL22]   Ioana Ifrim, Vassil Vassilev, and David J Lange. GPU Ac-
ally produced; the "cookiecutter" based template is kept in sync                  celerated Automatic Differentiation With Clad. arXiv preprint
with the developer pages; any new addition to one is also added                   arXiv:2203.06139, 2022.
                                                                        [Lam98]   Stephan Lammel.         Computing models of cdf and dØ
to the other. The developer pages are also kept up to date using a                in run ii. Computer Physics Communications, 110(1):32–
CI job that bumps any GitHub Actions or pre-commit versions to                    37, 1998. URL: https://www.sciencedirect.com/science/article/
the most recent versions weekly. Some portions of the developer                   pii/S0010465597001501, doi:10.1016/s0010-4655(97)
                                                                                  00150-1.
pages have been contributed to packaging.python.org, as well.           [LCC+ 09] Barry M Leiner, Vinton G Cerf, David D Clark, Robert E
    The cookie cutter was developed to be able to support multiple                Kahn, Leonard Kleinrock, Daniel C Lynch, Jon Postel, Larry G
build backends; the original design was to target both pure Python                Roberts, and Stephen Wolff. A brief history of the internet.
and Pybind11 based binary builds. This has expanded to include                    ACM SIGCOMM Computer Communication Review, 39(5):22–
                                                                                  31, 2009.
11 different backends by mid 2022, including Rust extensions,           [LGMM05] W Lavrijsen, J Generowicz, M Marino, and P Mato. Reflection-
many PEP 621 based backends, and a Scikit-Build based backend                     Based Python-C++ Bindings. 2005. URL: https://cds.cern.ch/
for pybind11 in addition to the classic Setuptools one. This has                  record/865620, doi:10.5170/CERN-2005-002.441.
                                                                        [PEL20]   Jim Pivarski, Peter Elmer, and David Lange. Awkward arrays
helped work out bugs and influence the design of several PEP                      in python, c++, and numba. In EPJ Web of Conferences,
621 packages, including helping with the addition of PEP 621 to                   volume 245, page 05023. EDP Sciences, 2020. doi:10.1051/
Setuptools.                                                                       epjconf/202024505023.
    The most recent addition to the pages was based on a new            [PJ11]    Andreas J Peters and Lukasz Janyst. Exabyte scale storage at
                                                                                  CERN. In Journal of Physics: Conference Series, volume 331,
repo-review package which evaluates and existing repository to                    page 052015. IOP Publishing, 2011. doi:10.1088/1742-
see what parts of the guidelines are being followed. This was                     6596/331/5/052015.
120                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[R+ 20]     Eduardo Rodrigues et al. The Scikit HEP Project – overview and
            prospects. EPJ Web of Conferences, 245:06028, 2020. arXiv:
            2007.03577, doi:10.1051/epjconf/202024506028.
[RS21]      Olivier Rousselle and Tom Sykora. Fast simulation of Time-
            of-Flight detectors at the LHC. In EPJ Web of Conferences,
            volume 251, page 03027. EDP Sciences, 2021. doi:10.1051/
            epjconf/202125103027.
[RTA+ 17]   D Remenska, C Tunnell, J Aalbers, S Verhoeven, J Maassen, and
            J Templon. Giving pandas ROOT to chew on: experiences with
            the XENON1T Dark Matter experiment. In Journal of Physics:
            Conference Series, volume 898, page 042003. IOP Publishing,
            2017.
[SLF+ 15]   Zachary D Stephens, Skylar Y Lee, Faraz Faghri, Roy H
            Campbell, Chengxiang Zhai, Miles J Efron, Ravishankar Iyer,
            Michael C Schatz, Saurabh Sinha, and Gene E Robinson. Big
            data: astronomical or genomical? PLoS biology, 13(7):e1002195,
            2015.
[Tem22]     Jeffrey Templon. Reflections on the uptake of the Python pro-
            gramming language in Nuclear and High-Energy Physics, March
            2022. None. URL: https://doi.org/10.5281/zenodo.6353621,
            doi:10.5281/zenodo.6353621.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                  121




  Keeping your Jupyter notebook code quality bar high
        (and production ready) with Ploomber
                                                                    Ido Michael‡∗



                                                                               F



    This paper walks through this interactive tutorial. It is highly
recommended running this interactively so it’s easier to follow and
see the results in real-time. There’s a binder link in there as well,
so you can launch it instantly.
                                                                                   Fig. 1: In this pipeline none of the tasks were executed - it’s all red.

1. Introduction
Notebooks are an excellent environment for data exploration:                       In addition, it can transform a notebook to a single-task pipeline
they allow us to write code interactively and get visual feedback,                 and then the user can split it into smaller tasks as they see fit.
providing an unbeatable experience for understanding our data.                         To refactor the notebook, we use the soorgeon refactor
    However, this convenience comes at a cost; if we are not                       command:
careful about adding and removing code cells, we may have an                       soorgeon refactor nb.ipynb
irreproducible notebook. Arbitrary execution order is a prevalent                  After running the refactor command, we can take a look at the
problem: a recent analysis found that about 36% of notebooks on                    local directory and see that we now have multiple python tasks
GitHub did not execute in linear order. To ensure our notebooks                    which that are ready for production:
run, we must continuously test them to catch these problems.
                                                                                   ls playground
    A second notable problem is the size of notebooks: the more
cells we have, the more difficult it is to debug since there are more              We can see that we have a few new files. pipeline.yaml
variables and code involved.                                                       contains the pipeline declaration, and tasks/ contains the stages
    Software engineers typically break down projects into multiple                 that Soorgeon identified based on our H2 Markdown headings:
steps and test continuously to prevent broken and unmaintainable                   ls playground/tasks
code. However, applying these ideas for data analysis requires
extra work; multiple notebooks imply we have to ensure the output                  One of the best ways to onboard new people and explain what
from one stage becomes the input for the next one. Furthermore,                    each workflow is doing is by plotting the pipeline (note that we’re
we can no longer press “Run all cells” in Jupyter to test our                      now using ploomber, which is the framework for developing
analysis from start to finish.                                                     pipelines):
    Ploomber provides all the necessary tools to build multi-                      ploomber plot
stage, reproducible pipelines in Jupyter that feel like a single
                                                                                   This command will generate the plot below for us, which will
notebook. Users can easily break down their analysis into multiple
                                                                                   allow us to stay up to date with changes that are happening in our
notebooks and execute them all with a single command.
                                                                                   pipeline and get the current status of tasks that were executed or
                                                                                   failed to execute.
2. Refactoring a legacy notebook                                                       Soorgeon correctly identified the stages in our
If you already have a python project in a single notebook, you                     original nb.ipynb notebook. It even detected that
can use our tool Soorgeon to automatically refactor it into a                      the     last   two     tasks    (linear-regression,            and
Ploomber pipeline. Soorgeon statically analyzes your code, cleans                  random-forest-regressor) are independent of each
up unnecessary imports, and makes sure your monolithic notebook                    other!
is broken down into smaller components. It does that by scanning                       We can also get a summary of the pipeline with ploomber
the markdown in the notebook and analyzing the headers; each                       status:
H2 header in our example is marking a new self-contained task.                     cd playground
                                                                                   ploomber status
* Corresponding author: ido@ploomber.io
‡ Ploomber
                                                                                   3. The pipeline.yaml file
Copyright © 2022 Ido Michael. This is an open-access article distributed
under the terms of the Creative Commons Attribution License, which permits         To develop a pipeline, users create a pipeline.yaml file and
unrestricted use, distribution, and reproduction in any medium, provided the
                                                                                   declare the tasks and their outputs as follows:
original author and source are credited.
122                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




                                                                                      Fig. 3: Here we can see the build outputs

Fig. 2: In here we can see the status of each of our pipeline’s tasks,
runtime and location.


tasks:
  - source: script.py
    product:
       nb: output/executed.ipynb
       data: output/data.csv

  # more tasks here...

The previous pipeline has a single task (script.py)
and generates two outputs: output/executed.ipynb and
output/data.csv. You may be wondering why we have a
notebook as an output: Ploomber converts scripts to notebooks
before execution; hence, our script is considered the source and the
notebook a byproduct of the execution. Using scripts as sources
(instead of notebooks) makes it simpler to use git. However, this
does not mean you have to give up interactive development since
Ploomber integrates with Jupyter, allowing you to edit scripts as
notebooks.                                                                            Fig. 4: These are the post build artifacts
    In this case, since we used soorgeon to refactor an existing
notebook, we did not have to write the pipeline.yaml file.
                                                                         # Sample data quality checks after loading the raw data
                                                                         # Check nulls
4. Building the pipeline                                                 assert not df['HouseAge'].isnull().values.any()
Let’s build the pipeline (this will take ~30 seconds):
                                                                         # Check a specific range - no outliers
cd playground                                                            assert df['HouseAge'].between(0,100).any()
ploomber build
                                                                         # Exact expected row count
We can see which are the tasks that ran during this command, how         assert len(df) == 11085
long they took to execute, and the contributions of each task to the
overall pipeline execution runtime.                                      ** We’ll do the same for tasks/linear-regression.py, open the file
   Navigate to playground/output/ and you’ll see all the                 and add the tests:
outputs: the executed notebooks, data files and trained model.           # Sample tests after the notebook ran
                                                                         # Check task test input exists
ls playground/output                                                     assert Path(upstream['train-test-split']['X_test']).exists()
In this figure, we can see all of the data that was collected during
                                                                         # Check task train input exists
the pipeline, any artifacts that might be useful to the user, and some   assert Path(upstream['train-test-split']['y_train']).exists()
of the execution history that is saved on the notebook’s context.
                                                                         # Validating output type
                                                                         assert 'pkl' in upstream['train-test-split']['X_test']
5. Testing and quality checks
                                                                         Adding these snippets will allow us to validate that the data we’re
** Open tasks/train-test-split.py as a notebook by right-clicking
                                                                         looking for exists and has the quality we expect. For instance, in
on it and then Open With -> Notebook and add the following
                                                                         the first test we’re checking there are no missing rows, and that
code after the cell with # noqa:
                                                                         the data sample we have are for houses up to 100 years old.
KEEPING YOUR JUPYTER NOTEBOOK CODE QUALITY BAR HIGH (AND PRODUCTION READY) WITH PLOOMBER                                            123




                                                                                        Fig. 6: lab-open-with-notebook
            Fig. 5: Now we see an independent new task


    In the second snippet, we’re checking that there are train and
test inputs which are crucial for training the model.

6. Maintaining the pipeline
Let’s look again at our pipeline plot:                                          Fig. 7: The new task is attached to the pipeline
Image('playground/pipeline.png')
The arrows in the diagram represent input/output dependencies            At the top of the notebook, you’ll see the following:
and depict the execution order. For example, the first task (load)
                                                                      upstream = None
loads some data, then clean uses such data as input and
processes it, then train-test-split splits our dataset into           This special variable indicates which tasks should execute before
training and test sets. Finally, we use those datasets to train a     the notebook we’re currently working on. In this case, we want to
linear regression and a random forest regressor.                      get training data so we can train our new model so we change the
    Soorgeon extracted and declared this dependencies for us, but     upstream variable:
if we want to modify the existing pipeline, we need to declare        upstream = ['train-test-split']
such dependencies. Let’s see how.
    We can also see that the pipeline is green, meaning all of the    Let’s generate the plot again:
tasks in it have been executed recently.                              cd playground
                                                                      ploomber plot
7. Adding a new task                                                  Ploomber now recognizes our dependency declaration!
Let’s say we want to train another model and decide to try Gradient      Open
Boosting Regressor. First, we modify the pipeline.yaml file           playground/tasks/gradient-boosting-regressor.py
and add a new task:
                                                                      as a notebook by right-clicking on it and then Open With ->
    Open playground/pipeline.yaml and add the follow-
                                                                      Notebook and add the following code:
ing lines at the end
                                                                      from pathlib import Path
- source: tasks/gradient-boosting-regressor.py
                                                                      import pickle
  product:
    nb: output/gradient-boosting-regressor.ipynb
                                                                      import seaborn as sns
Now, let’s create a base file by executing ploomber                   from sklearn.ensemble import GradientBoostingRegressor
scaffold:                                                             y_train = pickle.loads(Path(
cd playground                                                            upstream['train-test-split']['y_train']).read_bytes())
ploomber scaffold                                                     y_test = pickle.loads(Path(
                                                                         upstream['train-test-split']['y_test']).read_bytes())
This      is     the      output   of     the     command:       `    X_test = pickle.loads(Path(
Found spec at 'pipeline.yaml' Adding                                     upstream['train-test-split']['X_test']).read_bytes())
/Users/ido/ploomber-workshop/playground/                              X_train = pickle.loads(Path(
                                                                         upstream['train-test-split']['X_train']).read_bytes())
tasks/ gradient-boosting-regressor.py...
Created 1 new task sources. `                                         gbr = GradientBoostingRegressor()
   We can see it created the task sources for our new task, we just   gbr.fit(X_train, y_train)
have to fill those in right now.                                      y_pred = gbr.predict(X_test)
   Let’s see how the plot looks now:                                  sns.scatterplot(x=y_test, y=y_pred)
cd playground
ploomber plot
You can see that Ploomber recognizes the new file, but it does not    8. Incremental builds
have any dependency, so let’s tell Ploomber that it should execute    Data workflows require a lot of iteration. For example, you may
after train-test-split:                                               want to generate a new feature or model. However, it’s wasteful
    Open                                                              to re-execute every task with every minor change. Therefore,
playground/tasks/gradient-boosting-regressor.py                       one of Ploomber’s core features is incremental builds, which
                                                                      automatically skip tasks whose source code hasn’t changed.
as a notebook by right-clicking on it and then Open With ->
                                                                          Run the pipeline again:
Notebook:
124                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                        11. Resources
                                                                        Thanks for taking the time to go through this tutorial! We hope
                                                                        you consider using Ploomber for your next project. If you have
                                                                        any questions or need help, please reach out to us! (contact info
                                                                        below).
                                                                            Here are a few resources to dig deeper:
                                                                           •   GitHub
                                                                           •   Documentation
                                                                           •   Code examples
          Fig. 8: We can see this pipeline has multiple new tasks.         •   JupyterCon 2020 talk
                                                                           •   Argo Community Meeting talk
                                                                           •   Pangeo Showcase talk (AWS Batch demo)
cd playground                                                              •   Jupyter project
ploomber build
You can see that only the gradient-boosting-regressor
                                                                        10. Contact
task ran!
    Incremental builds allow us to iterate faster without keeping          •   Twitter
track of task changes.                                                     •   Join us on Slack
    Check              out             playground/output/                  •   E-mail us
gradient-boosting-regressor.ipynb,
    which contains the output notebooks with the model evaluation
plot.

9. Parallel execution and Ploomber cloud execution
This section can run locally or on the cloud. To setup the cloud
we’ll need to register for an api key
    Ploomber cloud allows you to scale your experiments into the
cloud without provisioning machines and without dealing with
infrastrucutres.
    Open playground/pipeline.yaml and add the following code
instead of the source task:
- source: tasks/random-forest-regressor.py
This is how your task should look like in the end
- source: tasks/random-forest-regressor.py
  name: random-forest-
  product:
    nb: output/random-forest-regressor.ipynb
  grid:
        # creates 4 tasks (2 * 2)
        n_estimators: [5, 10]
        criterion: [gini, entropy]
In addition, we’ll need to add a flag to tell the pipeline to execute
in parallel. Open playground/pipeline.yaml and add the following
code above the -tasks section (line 1):
    yaml
    # Execute independent tasks in parallel executor: parallel
ploomber plot

ploomber build


10. Execution in the cloud
When working with datasets that fit in memory, running your
pipeline is simple enough, but sometimes you may need more
computing power for your analysis. Ploomber makes it simple
to execute your code in a distributed environment without code
changes.
    Check out Soopervisor, the package that implements exporting
Ploomber projects in the cloud with support for:
      •    Kubernetes (Argo Workflows)
      •    AWS Batch
      •    Airflow
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                   125




   Likeness: a toolkit for connecting the social fabric of
                place to human dynamics
                                                     Joseph V. Tuccillo‡∗ , James D. Gaboardi‡



                                                                                    F



Abstract—The ability to produce richly-attributed synthetic populations is                   Modeling these processes at scale and with respect to indi-
key for understanding human dynamics, responding to emergencies, and                    vidual privacy is most commonly achieved through agent-based
preparing for future events, all while protecting individual privacy. The Like-         simulations on synthetic populations [SEM14]. Synthetic popula-
ness toolkit accomplishes these goals with a suite of Python packages:                  tions consist of individual agents that, when viewed in aggregate,
pymedm/pymedm_legacy, livelike, and actlike. This production
                                                                                        closely recreate the makeup of an area’s observed population
process is initialized in pymedm (or pymedm_legacy) that utilizes census
microdata records as the foundation on which disaggregated spatial allocation
                                                                                        [HHSB12], [TMKD17]. Modeling human dynamics with syn-
matrices are built. The next step, performed by livelike, is the generation of          thetic populations is common across research areas including spa-
a fully autonomous agent population attributed with hundreds of demographic             tial epidemiology [DKA+ 08], [BBE+ 08], [HNB+ 11], [NCA13],
census variables. The agent population synthesized in livelike is then                  [RSF+ 21], [SNGJ+ 09], public health [BCD+ 06], [BFH+ 17],
attributed with residential coordinates in actlike based on block assignment            [SPH11], [TCR08], [MCB+ 08], and transportation [BBM96],
and, finally, allocated to an optimal daytime activity location via the street          [ZFJ14]. However, a persistent limitation across these applications
network. We present a case study in Knox County, Tennessee, synthesizing 30             is that synthetic populations often do not capture a wide enough
populations of public K–12 school students & teachers and allocating them to
                                                                                        range of individual characteristics to assess how human dynamics
schools. Validation of our results shows they are highly promising by replicating
                                                                                        are linked to human security problems (e.g., how a person’s age,
reported school enrollment and teacher capacity with a high degree of fidelity.
                                                                                        limited transportation access, and linguistic isolation may interact
Index Terms—activity spaces, agent-based modeling, human dynamics, popu-                with their housing situation in a flood evacuation emergency).
lation synthesis
                                                                                            In this paper, we introduce Likeness [TG22], a Python toolkit
                                                                                        for connecting the social fabric of place to human dynamics via
Introduction                                                                            models that support increased spatial, temporal, and demographic
Human security fundamentally involves the functional capacity                           fidelity. Likeness is an extension of the UrbanPop framework de-
that individuals possess to withstand adverse circumstances, me-                        veloped at Oak Ridge National Laboratory (ORNL) that embraces
diated by the social and physical environments in which they live                       a new paradigm of "vivid" synthetic populations [TM21], [Tuc21],
[Hew97]. Attention to human dynamics is a key piece of the                              in which individual agents may be attributed in potentially hun-
human security puzzle, as it reveals spatial policy interventions                       dreds of ways, across subjects spanning demographics, socioe-
most appropriate to the ways in which people within a community                         conomic status, housing, and health. Vivid synthetic populations
behave and interact in daily life. For example, "one size fits all"                     benefit human dynamics research both by enabling more precise
solutions do not exist for mitigating disease spread, promoting                         geolocation of population segments, as well as providing a deeper
physical activity, or enabling access to healthy food sources.                          understanding of how individual and neighborhood characteris-
Rather, understanding these outcomes requires examination of                            tics are coupled. UrbanPop’s early development was motivated
processes like residential sorting, mobility, and social transmis-                      by linking models of residential sorting and worker commute
sion.                                                                                   behaviors [MNP+ 17], [MPN+ 17], [ANM+ 18]. Likeness expands
                                                                                        upon the UrbanPop approach by providing a novel integrated
* Corresponding author: tuccillojv@ornl.gov
‡ Oak Ridge National Laboratory                                                         model that pairs vivid residential synthetic populations with an
                                                                                        activity simulation model on real-world transportation networks,
Copyright © 2022 Oak Ridge National Laboratory. This is an open-access                  with travel destinations based on points of interest (POIs) curated
article distributed under the terms of the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any          from location services and federal critical facilities data.
medium, provided the original author and source are credited.
Notice: This manuscript has been authored by UT-Battelle, LLC under                         We first provide an overview of Likeness’ capabilities, then
Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy.                      provide a more detailed walkthrough of its central workflow with
The United States Government retains and the publisher, by accepting the
article for publication, acknowledges that the United States Government                 respect to livelike, a package for population synthesis and
retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or         residential characterization, and actlike a package for activity
reproduce the published form of this manuscript, or allow others to do so, for          allocation. We provide preliminary usage examples for Likeness
United States Government purposes. The Department of Energy will provide                based on 1) social contact networks in POIs 2) 24-hour POI
public access to these results of federally sponsored research in accordance
with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-                occupancy characteristics. Finally, we discuss existing limitations
access-plan).                                                                           and the outlook for future development.
126                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Overview of Core Capabilities and Workflow                                the ACS Public-Use Microdata Sample (PUMS) at the scale
UrbanPop initially combined the vivid synthetic populations pro-          of census block groups (typically 300–6000 people) or tracts
duced from the American Community Survey (ACS) using the                  (1200–8000 people), depending upon the use-case.
Penalized-Maximum Entropy Dasymetric Modeling (P-MEDM)                        Downscaling the PUMS from the Public-Use Microdata Area
method, which is detailed later, with a commute model based on            (PUMA) level at which it is offered (100,000 or more people) to
origin-destination flows, to generate a detailed dataset of daytime       these neighborhood scales then enables us to produce synthetic
and nighttime synthetic populations across the United States              populations (the livelike package) and simulate their travel
[MPN+ 17]. Our development of Likeness is motivated by extend-            to POIs (the actlike package) in an integrated model. This ap-
ing the existing capabilities of UrbanPop to routing libraries avail-     proach provides a new means of modeling population mobility and
able in Python like osmnx1 and pandana2 [Boe17], [FW12].                  activity spaces with respect to real-world transportation networks
In doing so, we are able to simulate travel to regular daytime            and POIs, in turn enabling investigation of social processes from
activities (work and school) based on real-world transportation           the atomic (e.g., person) level in human systems.
networks. Likeness continues to use the P-MEDM approach, but                  Likeness offers two implementations of P-MEDM. The first,
is fully integrated with the U.S. Census Bureau’s ACS Summary             the pymedm package, is written natively in Python based on
File (SF) and Census Microdata APIs, enabling the production of           scipy.optimize.minimize, and while fully operational re-
activity models on-the-fly.                                               mains in development and is currently suitable for one-off simu-
    Likeness features three core capabilities supporting activ-           lations. The second, the pmedm_legacy package, uses rpy2 as
ity simulation with vivid synthetic populations (Figure 1).               a bridge to [NBLS14]’s original implementation of P-MEDM3 in
The first, spatial allocation, is provided by the pymedm and              R/C++ and is currently more stable and scalable. We offer conda
pmedm_legacy packages and uses Iterative Proportional Fitting             environments specific to each package, based on user preferences.
(IPF) to downscale census microdata records to small neighbor-                Each package’s functionality centers around a PMEDM class,
hood areas, providing a basis for population synthesis. Baseline          which contains information required to solve the P-MEDM prob-
residential synthetic populations are then created and stratified into    lem:
agent segments (e.g., grade 10 students, hospitality workers) using          •    The individual (household) level constraints based on ACS
the livelike package. Finally, the actlike package models                         PUMS. To preserve households from the PUMS in the syn-
travel across agent segments of interest to POIs outside places of                thetic population, the person-level constraints describing
residence at varying times of day.                                                household members are aggregated to the household level
                                                                                  and merged with household-level constraints.
Spatial Allocation: the pymedm & pmedm_legacy packages                       •    PUMS household sample weights.
Synthetic populations are typically generated from census micro-             •    The target (e.g., block group) and aggregate (e.g., tract)
data, which consists of a sample of publicly available longform                   zone constraints based on population-level estimates avail-
responses to official statistical surveys. To preserve respondent                 able in the ACS SF.
confidentiality, census microdata is often published at spatial              •    The target/aggregate zone 90% margins of error and asso-
scales the size of a city or larger. Spatial allocation with IPF                  ciated standard errors (SE = 1.645 × MOE).
provides a maximum-likelihood estimator for microdata responses               The PMEDM classes feature a solve() method that returns
in small (e.g., neighborhood) areas based on aggregate data               an optimized P-MEDM solution and allocation matrix. Through
published about those areas (known as "constraints"), resulting           a diagnostics module, users may then evaluate a P-MEDM
in a baseline for population synthesis [WCC+ 09], [BBM96],                solution based on the proportion of published 90% MOEs from
[TMKD17]. UrbanPop is built upon a regularized implementation             the summary-level ACS data preserved at the target (allocation)
of IPF, the P-MEDM method, that permits many more input census            scale.
variables than traditional approaches [LNB13], [NBLS14]. The P-
MEDM objective function (Eq. 1) is written as:                            Population Synthesis: the livelike package
                           n wit     wit      e2                          The livelike package generates baseline residential synthetic
                  max − ∑        log     − ∑ k2                    (1)
                        it N dit     dit   k 2σk                          populations and performs agent segmentation for activity simula-
                                                                          tion.
where wit is the estimate of variable i in zone t, dit is the synthetic
estimate of variable i in location t, n is the number of microdata        Specifying and Solving Spatial Allocation Problems
responses, and N is the total population size. Uncertainty in
                                                                          The livelike workflow is oriented around a user-specified
variable estimates is handled by adding an error term to the
               e2                                                         constraints file containing all of the information necessary to
allocation ∑k 2σk2 , where ek is the error between the synthetic          specify a P-MEDM problem for a PUMA of interest. "Constraints"
                 k
and published estimate of ACS variable k and σk is the ACS                are variables from the ACS common among people/households
standard error for the estimate of variable k. This is accomplished       (PUMS) and populations (SF) that are used as both model inputs
by leveraging the uncertainty in the input variables: the "tighter"       and descriptors. The constraints file includes information for
the margins of error on the estimate of variable k in place t, the        bridging PUMS variable definitions with those from the SF using
more leverage it holds upon the solution [NBLS14].                        helper functions provided by the livelike.pums module,
    The P-MEDM procedure outputs an allocation matrix that                including table IDs, sampling universe (person/household), and
estimates the probability of individuals matching responses from          tags for the range of ACS vintages (years) for which the variables
                                                                          are relevant.
  1. https://github.com/gboeing/osmnx
  2. https://github.com/UDST/pandana                                        3. https://bitbucket.org/nnnagle/pmedmrcpp
LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS                                                              127




                                                 Fig. 1: Core capabilities and workflow of Likeness.


    The primary livelike class is the acs.puma, which stores                implementation of [LB13]’s "Truncate, Replicate, Sample" (TRS)
information about a single PUMA necessary for spatial allocation            method. TRS works by separating each cell of the allocation
of the PUMS data to block groups/tracts with P-MEDM. The                    matrix into whole-number (integer) and fractional components,
process of creating an acs.puma is integrated with the U.S.                 then incrementing the whole-number estimates by a random
Census Bureau’s ACS SF and Census Microdata 5-Year Estimates                sample of unit weights performed with sampling probabilities
(5YE) APIs4 . This enables generation of an acs.puma class                  based on the fractional component. Because TRS is stochastic,
with a high-level call involving just a few parameters: 1) the              the homesim.hsim() function generates multiple (default 30)
PUMA’s Federal Information Processing Standard (FIPS) code 2)               realizations of the residential population. The results are provided
the constraints file, loaded as a pandas.DataFrame and 3) the               as a pandas.DataFrame in long format, attributed by:
target ACS vintage (year). An example call to build an acs.puma
                                                                                •   PUMS Household ID (h_id)
for the Knoxville City, TN PUMA (FIPS 4701603) using the ACS
                                                                                •   Simulation number (sim)
2015–2019 5-Year Estimates is:
                                                                                •   Target zone FIPS code (geoid)
acs.puma(
    fips="4701603",                                                             •   Household count (count)
    constraints=constraints,
    year=2019                                                                   Since household and person-level attributes are combined
)                                                                           when creating the acs.puma class, person-level records from
                                                                            the PUMS are assumed to be joined to the synthesized household
The censusdata package5 is used internally to
                                                                            IDs many-to-one. For example, if two people, A01 and A03, in
fetch population-level (SF) constraints, standard errors,
                                                                            household A have some attribute of interest, and there are 3
and MOEs from the ACS 5YE API, while the
                                                                            households of type A in zone G, then we estimate that a total
acs.extract_pums_constraints function is used to
                                                                            of 6 people with that attribute from household A reside in zone G.
fetch individual-level constraints and weights from the Census
Microdata 5YE API.
                                                                            Agent Generation
    Spatial allocation is then carried out by passing
the acs.puma attributes to a pymedm.PMEDM or                                The synthetic populations can then be segmented into different
pmedm_legacy.PMEDM (depending on user preference).                          groups of agents (e.g., workers by industry, students by grade) for
                                                                            activity modeling with the actlike package. Agent segments
Population Synthesis                                                        may be identified in several ways:
The homesim module provides support for population synthe-
                                                                                •   Using acs.extract_pums_segment_ids() to
sis on the spatial allocation matrix within a solved P-MEDM
                                                                                    fetch the person IDs (household serial number + person
object. The population synthesis procedure involves converting
                                                                                    line number) from the Census Microdata API matching
the fractional estimates from the allocation matrix (n household
                                                                                    some criteria of interest (e.g., public school students in
IDs by m zones) to integer representation such that whole peo-
                                                                                    10th grade).
ple/households are preserved. This homesim module features an
                                                                                •   Using acs.extract_pums_descriptors() to
  4. https://www.census.gov/data/developers/data-sets.html                          fetch criteria that may be queried from the Census
  5. https://pypi.org/project/CensusData                                            Microdata API. This is useful when dealing with criteria
128                                                                                           PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

       more specific than can be directly controlled for in the         in time and are placed with a greater frequency proportional
       P-MEDM problem (e.g., detailed NAICS code of worker,             to reported household density [LB13]. We employ population
       exact number of hours worked).                                   and housing counts within 2010 Decennial Census blocks to
                                                                        formulate a modified Variable Size Bin Packing Problem [FL86],
    The function est.tabulate_by_serial() is then used
                                                                        [CGSdG08] for each populated block group, which allows for
to tabulate agents by target zone and simulation by appending
                                                                        an optimal placement of household points and is accomplished
them to the synthetic population based on household ID, then
                                                                        by     the    actlike.block_denisty_allocation()
aggregating the person-level counts. This routine is flexible in that
                                                                        function      that       creates      and       solves     an
a user can use any set of criteria available from the PUMS to
                                                                        actlike.block_allocation.BinPack instance.
define customized agents for mobility modeling purposes.

Other Capabilities                                                      Activity Allocation
        Population Statistics: In addition to agent creation, the       Once household location attribution is complete, individual agents
livelike.est module also supports the creation of popula-               must be allocated from households (nighttime locations) to prob-
tion statistics. This can be used to estimate the compositional         able activity spaces (daytime locations). This is achieved through
characteristics of small neighborhood areas and POIs, for ex-           spatial network modeling over the streets within a study area via
ample to simulate social contact networks (see Students). To            OpenStreetMap6 utilizing osmnx for network extraction & pre-
accomplish this, the results of est.tabulate_by_serial                  processing and pandana for shortest path and route calculations.
(see Agent Generation) are converted to proportional esti-              The underlying impedance metric for shortest path calculation,
mates to facilitate POIs (est.to_prop()), then averaged                 handled in actlike.calc_cost_mtx() and associated in-
across simulations to produce Monte Carlo estimates and errors          ternal functions, can either take the form of distance or travel time.
est.monte_carlo_estimate()).                                            Moreover, household and activity locations must be connected to
        Multiple ACS Vintages and PUMAs: The multi                      nearby network edges for realistic representations within network
module extends the capabilities of livelike to                          space [GFH20].
multiple ACS 5YE vintages (dating back to 2016), as                         With a cost matrix from all residences to daytime loca-
well as multiple PUMAs (e.g., a metropolitan area) via                  tions calculated, the simulated population can then be "sent"
the multi module. Using multi.make_pumas()                              to the likely activity spaces by utilizing an instance of
or       multi.make_multiyear_pumas(),                   multiple       actlike.ActivityAllocation to generate an adapted
PUMAs/multiple years may be stored in a dict                            Transportation Problem. This mixed integer program, solved using
that    enables     iterative  runs    for   spatial   allocation       the solve() method, optimally associates all population within
(multi.make_pmedm_problems()),                        population        an activity space with the objective of minimizing the total cost of
synthesis       (multi.homesim()),         and     agent     cre-       impedance (Eq. 2), being subject to potentially relaxed minimum
ation             (multi.extract_pums_segment_ids(),                    and maximum capacity constraints (Eq. 4 & 5). Each decision
multi.extract_pums_segment_ids_multiyear(),                             variable (xi j ) represents a potential allocation from origin i to
multi.extract_pums_descriptors(),                             and       destination j that must be an integer greater than or equal to zero
multi.extract_pums_descriptors_multiyear()).                            (Eq. 6 & 7). The problem is formulated as follows:
This functionality is currently available for pmedm_legacy
only.                                                                                                     min ∑ ∑ ci j xi j                     (2)
                                                                                                                    i∈I j∈J

Activity Allocation: the actlike package                                                         s.t.         ∑ xi j = Oi       ∀i ∈ I;         (3)
The actlike package [GT22] allocates agents from synthetic                                                    j∈J

populations generated by livelike POI, like schools and work-
places, based on optimal allocation about transportation networks
                                                                                              s.t.      ∑ xi j ≥ minD j          ∀ j ∈ J;       (4)
                                                                                                        i∈I
derived from osmnx and pandana [Boe17], [FW12]. Solutions
are the product of a modified integer program (Transportation                                 s.t.      ∑ xi j ≤ maxD j          ∀ j ∈ J;       (5)
Problem [Hit41], [Koo49], [MS01], [MS15]) modeled in pulp                                               i∈I
or mip [MOD11], [ST20], whereby supply (students/workers)
                                                                                              s.t.      xi j ≥ 0       ∀i ∈ I    ∀ j ∈ J;       (6)
are "shipped" to demand locations (schools/workplaces), with
potentially relaxed minimum and maximum capacity constraints at                               s.t.      xi j ∈ Z ∀i ∈ I          ∀ j ∈ J.       (7)
demand locations. Impedance from nighttime to daytime locations
(Origin-Destination [OD] pairs) can be modeled by either network            where
distance or network travel time.                                                      i ∈ I = each household in the set of origins
                                                                                       j ∈ J = each school in the set of destinations
Location Synthesis
                                                                                      xi j = allocation decision from i ∈ I to j ∈ J
Following the generation of synthetic households for the study
                                                                                      ci j = cost between all i, j pairs
universe, locations for all households across the 30 default
simulations must be created. In order to intelligently site pseudo-                   Oi = population in origin i for i ∈ I
neighborhood clusters of random points, we adopt a dasymetric                         minD j = minimum capacity j for j ∈ J
[QC13] approach, which we term intelligent block-based (IBB)                          maxD j = maximum capacity j for j ∈ J
allocation, whereby household locations are only placed within
blocks known to have been populated at a particular period                6. https://www.openstreetmap.org/about
LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS                                                            129

The key to this adapted formulation of the classic Trans-                         Because school attendance in Knox County is restricted by
portation Problem is the utilization of minimum and maxi-                         district boundaries, we only placed student households in
mum capacity thresholds that are generated endogenously within                    the PUMAs intersecting with the district (FIPS 4701601,
actlike.ActivityAllocation and are tuned to reflect                               4701602, 4701603, 4701604). However, because educators
the uncertainty of both the population estimates generated by                     may live outside school district boundaries, we simulated
livelike and the reported (or predicted) capacities at activity                   their household locations throughout the Knoxville CBSA.
locations. Moreover, network impedance from origins to destina-            •      Used actlike to perform optimal allocation of
tions (ci j ) can be randomly reduced through an internal process                 workers and students about road networks in Knox
by passing in an integer value to the reduce_seed keyword ar-                     County/Knoxville CBSA. Across the 30 simulations and
gument. By triggering this functionality, the count and magnitude                 14 segments identified, we produced a total of 420 travel
of reduction is determined algorithmically. A random reduction                    simulations. Network impedance was measured in geo-
of this nature is beneficial in generating dispersed solutions that               graphic distance for all student simulations and travel time
do not resemble compact clusters, with an example being the                       for all educator simulations.
replication of a private school’s student body that does not adhere
                                                                           Figure 2 demonstrates the optimal allocations, routing, and
to public school attendance zones.
                                                                       network space for a single simulation of 10th grade public school
    After the optimal solution is found for an
                                                                       students in Knox County, TN. Students, shown in households
actlike.ActivityAllocation                   instance,     selected
                                                                       as small black dots, are associated with schools, represented by
decisions are isolated from non-zero decision variables
                                                                       transparent colored circles sized according to reported enrollment.
with the realized_allocations() method. These
                                                                       The network space connecting student residential locations to
allocations are then used to generate solution routes with the
                                                                       assigned schools is displayed in a matching color. Further, the
network_routes() function that represent the shortest path
                                                                       inset in Figure 2 provides the pseudo-school attendance zone for
along the network traversed from residential locations to assigned
                                                                       10th graders at one school in central Knoxville and demonstrates
activity spaces. Solutions can be further validated with Canonical
                                                                       the adherence to network space.
Correlation Analysis, in instances where the agent segments are
stratified, and simple linear regression for those where a single
                                                                       Students
segment of agents is used. Validation is discussed further in
Validation & Diagnostics.                                              Our study of K–12 students examines social contact networks
                                                                       with respect to potentially underserved student populations via
                                                                       the compositional characteristics of POIs (schools).
Case Study: K–12 Public Schools in Knox County, TN
                                                                           We characterized each school’s student body by identifying
To illustrate Likeness’ capability to simulate POI travel among        student profiles based on several criteria: minority race/ethnicity,
specific population segments, we provide a case study of travel to     poverty status, single caregiver households, and unemployed care-
POIs, in this case K–12 schools, in Knox County, TN. Our choice        giver households (householder and/or spouse/parnter). We defined
of K–12 schools was motivated by several factors. First, they serve    6 student profiles using an implementation of the density-based
as common destinations for the two major groups—workers and            K-Modes clustering algorithm [CLB09] with a distance heuris-
students—expected to consistently travel on a typical business         tic designed to optimize cluster separation [NLHH07] available
day [RWM+ 17]. Second, a complete inventory of public school           through the kmodes package9 [dV21]. Student profile labels were
locations, as well as faculty and enrollment sizes, is available       appended to the student travel simulation results, then used to
publicly through federal open data sources. In this case, we           produce Monte Carlo proportional estimates of profiles by school.
obtained school locations and faculty sizes from the Homeland              The results in Figure 3 reveal strong dissimilarities in student
Infrastructure Foundation-Level Database (HIFLD)7 and student          makeup between schools on the periphery of Knox County and
enrollment sizes by grade from the National Center for Education       those nearer to Knoxville’s downtown core in the center of the
Statistics (NCES) Common Core of Data8 .                               county. We estimate that the former are largely composed of
    We chose the Knox County School District, which coincides          students in married families, above poverty, and with employed
with Knox County’s boundaries, as our study area. We used the          caregivers, whereas the latter are characterized more strongly by
livelike package to create 30 synthetic populations for the            single caregiver living arrangements and, particularly in areas
Knoxville Core-Based Statistical Area (CBSA), then for each            north of the downtown core, economic distress (pop-out map).
simulation we:
   •    Isolated agent segments from the synthetic population.
                                                                       Workers (Educators)
        K–12 educators consist of full-time workers employed as
        primary and secondary education teachers (2018 Standard        We evaluated the results of our K–12 educator simulations with
        Occupation Classification System codes 2300–2320) in           respect to POI occupancy characteristics, as informed by commute
        elementary and secondary schools (NAICS 6111). We              and work statistics obtained from the PUMS. Specifically, we used
        separated out student agents by public schools and by          work arrival times associated with each synthetic worker (PUMS
        grade level (Kindergarten through Grade 12).                   JWAP) to timestamp the start of each work day, and incremented
   •    Performed IBB allocation to simulate the household loca-       this by daily hours worked (derived from PUMS W KHP) to create
        tions of workers and students. Our selection of household      a second timestamp for work departure. The estimated departure
        locations for workers and students varied geographically.      time assumes that each educator travels to the school for a typical
                                                                       5-day workweek, and is estimated as JWAP + W KHP  5 .
  7. https://hifld-geoplatform.opendata.arcgis.com
  8. https://nces.ed.gov/ccd/files.asp                                   9. https://pypi.org/project/kmodes
130                                                                                     PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




                 Fig. 2: Optimal allocations for one simulation of 10th grade public school students in Knox County, TN.




Fig. 3: Compositional characteristics of K–12 public schools in Knox County, TN based on 6 student profiles. Glyph plot methodolgy adapted
from [GLC+ 15].
LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS                                                        131




                            Fig. 4: Hourly worker occupancy estimates for K–12 schools in Knox County, TN.


   Roughly 50 educator agents per simulation were not attributed       Validation & Diagnostics
with work arrival times, possibly due to the source PUMS re-           A determination of modeling output robustness was needed to
spondents being away from their typical workplaces (e.g., on           validate our results. Specifically, we aimed to ensure the preser-
summer or winter break) but still working virtually when they          vation of relative facility size and composition. To perform this
were surveyed. We filled in these unkown arrival times with the        validation, we tested the optimal allocations of those generated by
modal arrival time observed across all simulations (7:25 AM).          Likeness against the maximally adjusted reported enrollment &
                                                                       faculty employment counts. We used the maximum adjusted value
                                                                       to account for scenarios where the population synthesis phase
    Figure 4 displays the hourly proportion of educators present       resulted in a total demographic segment greater than reported total
at each school in Knox County between 7:00 AM (t700) and               facility capacity. We employed Canonical Correlation Analysis
6:00 PM (t1800). Morning worker arrivals occur more rapidly            (CCA) [Kna78] for the K–12 public school student allocations
than afternoon departures. Between the hours of 7:00 AM and            due to their stratified nature, and an ordinary least squares (OLS)
9:00 AM (t700–t900), schools transition from nearly empty              simple linear regression for the educator allocations [PVG+ 11].
of workers to being close to capacity. In the afternoon, workers       Because CCA is a multivariate measure, it is only a suitable
begin to gradually depart at 3:00 PM (t1500) with somewhere            diagnostic for activity allocation when multiple segments (e.g.,
between 50%–70% of workers still present by 4:00 PM (t1600),           students by grade) are of interest. For educators, which we
then workers begin to depart in earnest at 5:00 PM into 6:00 PM        treated as a single agent segment without stratification, we used
(t1700–t1800), by which most have returned home.                       OLS regression instead. The CCA for students was performed in
                                                                       two components: Between-Destination, which measures capacity
                                                                       across facilities, and Within-Destination, which measures capacity
    Geographic differences are also visible and may be a function      across strata.
of (1) a higher concentration of a particular school type (e.g.,           Descriptive Monte Carlo statistics from the 30 simulations
elementary, middle, high) in this area and (2) staggered starts        were run on the resultant coefficients of determination (R2 ),
between these types (to accommodate bus schedules, etc.). This         which show a goodness of fit (approaching 1). As seen in Table
could be due in part to concentrations of different school schedules   1, all models performed exceedingly well, though the Within-
by grade level, especially elementary schools starting much earlier    Destination CCA performed slightly less well than both the
than middle and high schools10 . For example, schools near the         Between-Destination CCA and the OLS linear regression. In fact,
center of Knox County reach worker capacity more quickly in the        the global minimum of all R2 scores approaches 0.99 (students
morning, starting around 8:00 AM (t800), but also empty out            – Within-Destination), which demonstrates robust preservation of
more rapidly than schools in surrounding areas beginning around
4:00 PM (t1600).                                                         10. https://www.knoxschools.org/Page/5553
132                                                                                            PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                     K–12                                   R2 Type                     Min       Median      Mean       Max
                                                            Between-Destination CCA   0.9967       0.9974    0.9973    0.9976
                     Students (public schools)
                                                            Within-Destination CCA    0.9883       0.9894    0.9896    0.9910
                     Educators (public & private schools)   OLS Linear Regression     0.9977       0.9983    0.9983    0.9991


      TABLE 1: Validating optimal allocations considering reported enrollment at public schools & faculty employment at all schools.


true capacities in our synthetic activity modeling. Furthermore,          agent characterization and travel along real-world transportation
a global maximum of greater than 0.999 is seen for educators,             networks to POIs. These capabilities benefit planners and urban
which indicates a near perfect replication of relative faculty sizes      researchers by providing a richer understanding of how spatial
by school.                                                                policy interventions can be designed with respect to how people
                                                                          live, move, and interact. Likeness strives to be flexible toward a
Discussion                                                                variety of research applications linked to human security, among
Our Case Study demonstrates the twofold benefits of modeling              them spatial epidemiology, transportation equity, and environmen-
human dynamics with vivid synthetic populations. Using Like-              tal hazards.
ness, we are able to both produce a more reasoned estimate of the              Several ongoing developments will further Likeness’ capa-
neighborhoods in which people reside and interact than existing           bilities. First, we plan to expand our support for POIs curated
synthetic population frameworks, as well as support more nuanced          by location services (e.g., Google, Facebook, Here, TomTom,
characterization of human activities at specific POIs (e.g., social       FourSquare) by the ORNL PlanetSense project [TBP+ 15] by
contact networks, occupancy).                                             incorporating factors like facility size, hours of operation, and pop-
    The examples provided in the Case Study show how this                 ularity curves to refine the destination capacity estimates required
refined understanding of human dynamics can benefit planning              to perform actlike simulations. Second, along with multi-
applications. For example, in the event of a localized emergency,         modal travel, we plan to incorporate multiple trip models based
the results of Students could be used to examine schools for              on large-scale human activity datasets like the American Time Use
which rendezvous with caregivers might pose an added challenge            Survey11 and National Household Travel Survey12 . Together, these
towards students (e.g., more students from single caregiver vs.           improvements will extend our travel simulations to "non-obligate"
married family households). Additionally, the POI occupancy               population segments traveling to civic, social, and recreational
dynamics demonstrated in Workers (Educators) could be used                activities [BMWR22]. Third, the current procedure for spatial
to assess the times at which worker commutes to/from places               allocation uses block groups as the target scale for population
of employment might be most sensitive to a nearby disruption.             synthesis. However, there are a limited number of constraining
Another application in the public health sphere might be to use           variables available at the block group level. To include a larger
occupancy estimates to anticipate the best time of day to reach           volume of constraints (e.g., vehicle access, language), we are
workers, during a vaccination campaign, for example.                      exploring an additional tract-level approach. P-MEDM in this
    Our case study had several limitations that we plan to over-          case is run on cross-covariances between tracts and "supertract"
come in future work. First, we assumed that all travel within our         aggregations created with the Max-p-regions problem [DAR12],
study area occurs along road networks. While road-based travel            [WRK21] implemented in PySAL’s spopt [RA07], [FGK+ 21],
is the dominant means of travel in the Knoxville CBSA, this               [RAA+ 21], [FBG+ 22].
assumption is not transferable to other urban areas within the                 As a final note, the Likeness toolkit is being developed on top
United States. Our eventual goal is to build in additional modes of       of key open source dependencies in the Scientific Python ecosys-
travel like public transit, walk/bike, and ferries by expanding our       tem, the core of which are, of course, numpy [HMvdW+ 20]
ingest of OpenStreetMap features.                                         and scipy [VGO+ 20]. Although an exhaustive list would be
    Second, we do not yet offer direct support for non-traditional        prohibitive, major packages not previously mentioned include
schools (e.g., populations with special needs, families on military       geopandas [JdBF+ 21], matplotlib [Hun07], networkx
bases). For example, the Tennessee School for the Deaf falls              [HSS08], pandas [pdt20], [WM10], and shapely [G+ ]. Our
within our study area, and its compositional estimate could be            goal is contribute to the community with releases of the packages
refined if we reapportioned students more likely in attendance to         comprising Likeness, but since this is an emerging project its
that location.                                                            development to date has been limited to researchers at ORNL.
    Third, we did not account for teachers in virtual schools,            However, we plan to provide a fully open-sourced code base
which may form a portion of the missing work arrival times                within the coming year through GitHub13 .
discussed in Workers (Educators). Work-from-home populations
                                                                          Acknowledgements
can be better incorporated into our travel simulations by apply-
ing work schedules from time-use surveys to probabilistically             This material is based upon the work supported by the U.S.
assign in-person or remote status based on occupation. We are             Department of Energy under contract no. DE-AC05-00OR22725.
particularly interested in using this technique with Likeness to
better understand changing patterns of life during the COVID-19           R EFERENCES
pandemic in 2020.                                                         [ANM+ 18]     H.M. Abdul Aziz, Nicholas N. Nagle, April M. Morton,
                                                                                        Michael R. Hilliard, Devin A. White, and Robert N. Stew-
Conclusion
                                                                            11. https://www.bls.gov/tus
The Likeness toolkit enhances agent creation for modeling human             12. https://nhts.ornl.gov
dynamics through its dual capabilities of high-fidelity ("vivid")           13. https://github.com/ORNL
LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS                                                                         133

             art. Exploring the impact of walk–bike infrastructure, safety       [GFH20]     James D. Gaboardi, David C. Folch, and Mark W. Horner.
             perception, and built-environment on active transportation                      Connecting Points to Spatial Networks: Effects on Discrete
             mode choice: a random parameter model using New York                            Optimization Models. Geographical Analysis, 52(2):299–322,
             City commuter data. Transportation, 45(5):1207–1229, 2018.                      2020. doi:10.1111/gean.12211.
             doi:10.1007/s11116-017-9760-8.                                      [GLC+ 15]   Isabella Gollini, Binbin Lu, Martin Charlton, Christopher
[BBE+ 08]    Christopher L. Barrett, Keith R. Bisset, Stephen G. Eubank,                     Brunsdon, and Paul Harris. GWmodel: An R package for
             Xizhou Feng, and Madhav V. Marathe. EpiSimdemics: an ef-                        exploring spatial heterogeneity using geographically weighted
             ficient algorithm for simulating the spread of infectious disease               models. Journal of Statistical Software, 63(17):1–50, 2015.
             over large realistic social networks. In SC’08: Proceedings of                  doi:10.18637/jss.v063.i17.
             the 2008 ACM/IEEE Conference on Supercomputing, pages               [GT22]      James D. Gaboardi and Joseph V. Tuccillo. Simulating Travel
             1–12. IEEE, 2008. doi:10.1109/SC.2008.5214892.                                  to Points of Interest for Demographically-rich Synthetic Popu-
[BBM96]      Richard J. Beckman, Keith A. Baggerly, and Michael D.                           lations, February 2022. American Association of Geographers
             McKay. Creating synthetic baseline populations. Transporta-                     Annual Meeting. doi:10.5281/zenodo.6335783.
             tion Research Part A: Policy and Practice, 30(6):415–429,           [Hew97]     Kenneth Hewitt. Vulnerability Perspectives: the Human Ecol-
             1996. doi:10.1016/0965-8564(96)00004-3.                                         ogy of Endangerment. In Regions of Risk: A Geographical
[BCD+ 06]    Dimitris Ballas, Graham Clarke, Danny Dorling, Jan Rigby,                       Introduction to Disasters, chapter 6, pages 141–164. Addison
             and Ben Wheeler. Using geographical information systems and                     Wesley Longman, 1997.
             spatial microsimulation for the analysis of health inequalities.    [HHSB12]    Kirk Harland, Alison Heppenstall, Dianna Smith, and Mark H.
             Health Informatics Journal, 12(1):65–79, 2006. doi:10.                          Birkin. Creating realistic synthetic populations at varying
             1177/1460458206061217.                                                          spatial scales: A comparative critique of population synthesis
[BFH+ 17]    Komal Basra, M. Patricia Fabian, Raymond R. Holberger,                          techniques. Journal of Artificial Societies and Social Simula-
             Robert French, and Jonathan I. Levy. Community-engaged                          tion, 15(1):1, 2012. doi:10.18564/jasss.1909.
             modeling of geographic and demographic patterns of mul-             [Hit41]     Frank L. Hitchcock. The Distribution of a Product from
             tiple public health risk factors. International Journal of                      Several Sources to Numerous Localities. Journal of Mathe-
             Environmental Research and Public Health, 14(7):730, 2017.                      matics and Physics, 20(1-4):224–230, 1941. doi:10.1002/
             doi:10.3390/ijerph14070730.                                                     sapm1941201224.
[BMWR22]     Christa Brelsford, Jessica J. Moehl, Eric M. Weber, and             [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der
             Amy N. Rose. Segmented Population Models: Improving the                         Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric
             LandScan USA Non-Obligate Population Estimate (NOPE).                           Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,
             American Association of Geographers 2022 Annual Meeting,                        Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerk-
             2022.                                                                           wijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río,
[Boe17]      Geoff Boeing. OSMnx: New methods for acquiring, con-                            Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin
             structing, analyzing, and visualizing complex street networks.                  Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi,
             Computers, Environment and Urban Systems, 65:126–139,                           Christoph Gohlke, and Travis E. Oliphant. Array programming
             September 2017. doi:10.1016/j.compenvurbsys.                                    with NumPy. Nature, 585(7825):357–362, September 2020.
             2017.05.004.                                                                    doi:10.1038/s41586-020-2649-2.
[CGSdG08]    Isabel Correia, Luís Gouveia, and Francisco Saldanha-da             [HNB+ 11]   Jan A.C. Hontelez, Nico Nagelkerke, Till Bärnighausen, Roel
             Gama. Solving the variable size bin packing problem                             Bakker, Frank Tanser, Marie-Louise Newell, Mark N. Lurie,
             with discretized formulations. Computers & Operations Re-                       Rob Baltussen, and Sake J. de Vlas. The potential impact of
             search, 35(6):2103–2113, June 2008. doi:10.1016/j.                              RV144-like vaccines in rural South Africa: a study using the
             cor.2006.10.014.                                                                STDSIM microsimulation model. Vaccine, 29(36):6100–6106,
                                                                                             2011. doi:10.1016/j.vaccine.2011.06.059.
[CLB09]      Fuyuan Cao, Jiye Liang, and Liang Bai. A new initialization
             method for categorical data clustering. Expert Systems with         [HSS08]     Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart.
             Applications, 36(7):10223–10228, 2009. doi:10.1016/j.                           Exploring Network Structure, Dynamics, and Function using
             eswa.2009.01.060.                                                               NetworkX. In Gaël Varoquaux, Travis Vaught, and Jarrod
                                                                                             Millman, editors, Proceedings of the 7th Python in Science
[DAR12]      Juan C. Duque, Luc Anselin, and Sergio J. Rey. THE MAX-
                                                                                             Conference, pages 11 – 15, Pasadena, CA USA, 2008. URL:
             P-REGIONS PROBLEM*. Journal of Regional Science,
                                                                                             https://www.osti.gov/biblio/960616.
             52(3):397–419, 2012. doi:10.1111/j.1467-9787.
                                                                                 [Hun07]     J. D. Hunter. Matplotlib: A 2D graphics environment. Com-
             2011.00743.x.
                                                                                             puting in Science & Engineering, 9(3):90–95, 2007. doi:
[DKA+ 08]    M. Diaz, J.J. Kim, G. Albero, S. De Sanjose, G. Clifford, F.X.                  10.1109/MCSE.2007.55.
             Bosch, and S.J. Goldie. Health and economic impact of HPV
                                                                                 [JdBF+ 21]  Kelsey Jordahl, Joris Van den Bossche, Martin Fleischmann,
             16 and 18 vaccination and cervical cancer screening in India.
                                                                                             James McBride, Jacob Wasserman, Adrian Garcia Badaracco,
             British Journal of Cancer, 99(2):230–238, 2008. doi:10.
                                                                                             Jeffrey Gerard, Alan D. Snow, Jeff Tratner, Matthew Perry,
             1038/sj.bjc.6604462.
                                                                                             Carson Farmer, Geir Arne Hjelle, Micah Cochran, Sean
[dV21]       Nelis J. de Vos. kmodes categorical clustering library. https:                  Gillies, Lucas Culbertson, Matt Bartos, Brendan Ward, Gia-
             //github.com/nicodv/kmodes, 2015–2021.                                          como Caria, Mike Taves, Nick Eubank, sangarshanan, John
[FBG+ 22]    Xin Feng, Germano Barcelos, James D. Gaboardi, Elijah                           Flavin, Matt Richards, Sergio Rey, maxalbert, Aleksey Bi-
             Knaap, Ran Wei, Levi J. Wolf, Qunshan Zhao, and Sergio J.                       logur, Christopher Ren, Dani Arribas-Bel, Daniel Mesejo-
             Rey. spopt: a python package for solving spatial optimization                   León, and Leah Wasser. geopandas/geopandas: v0.10.2, Octo-
             problems in PySAL. Journal of Open Source Software,                             ber 2021. doi:10.5281/zenodo.5573592.
             7(74):3330, 2022. doi:10.21105/joss.03330.                          [Kna78]     Thomas R. Knapp. Canonical Correlation Analysis: A general
[FGK+ 21]    Xin Feng, James D. Gaboardi, Elijah Knaap, Sergio J. Rey,                       parametric significance-testing system. Psychological Bulletin,
             and Ran Wei. pysal/spopt, jan 2021. URL: https://github.com/                    85(2):410–416, 1978. doi:10.1037/0033-2909.85.
             pysal/spopt, doi:10.5281/zenodo.4444156.                                        2.410.
[FL86]       D.K. Friesen and M.A. Langston. Variable Sized Bin Packing.         [Koo49]     Tjalling C. Koopmans. Optimum Utilization of the Transporta-
             SIAM Journal on Computing, 15(1):222–230, February 1986.                        tion System. Econometrica, 17:136–146, 1949. Publisher:
             doi:10.1137/0215016.                                                            [Wiley, Econometric Society]. doi:10.2307/1907301.
[FW12]       Fletcher Foti and Paul Waddell. A Generalized Com-                  [LB13]      Robin Lovelace and Dimitris Ballas. ‘Truncate, replicate,
             putational Framework for Accessibility: From the Pedes-                         sample’: A method for creating integer weights for spa-
             trian to the Metropolitan Scale. In Transportation Re-                          tial microsimulation. Computers, Environment and Urban
             search Board Annual Conference, pages 1–14, 2012.                               Systems, 41:1–11, September 2013. doi:10.1016/j.
             URL: https://onlinepubs.trb.org/onlinepubs/conferences/2012/                    compenvurbsys.2013.03.004.
             4thITM/Papers-A/0117-000062.pdf.                                    [LNB13]     Stefan Leyk, Nicholas N. Nagle, and Barbara P. Buttenfield.
[G+ ]        Sean Gillies et al. Shapely: manipulation and analysis of                       Maximum Entropy Dasymetric Modeling for Demographic
             geometric objects, 2007–. URL: https://github.com/shapely/                      Small Area Estimation. Geographical Analysis, 45(3):285–
             shapely.                                                                        306, July 2013. doi:10.1111/gean.12011.
134                                                                                            PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[MCB+ 08]   Karyn Morrissey, Graham Clarke, Dimitris Ballas, Stephen                       Scan USA 2016 [Data set]. Technical report, Oak Ridge
            Hynes, and Cathal O’Donoghue. Examining access to GP                           National Laboratory, 2017. doi:10.48690/1523377.
            services in rural Ireland using microsimulation analysis. Area,   [SEM14]      Samarth Swarup, Stephen G. Eubank, and Madhav V. Marathe.
            40(3):354–364, 2008. doi:10.1111/j.1475-4762.                                  Computational epidemiology as a challenge domain for multi-
            2008.00844.x.                                                                  agent systems. In Proceedings of the 2014 international con-
[MNP+ 17]   April M. Morton, Nicholas N. Nagle, Jesse O. Piburn,                           ference on Autonomous agents and multi-agent systems, pages
            Robert N. Stewart, and Ryan McManamay. A hybrid dasy-                          1173–1176, 2014. URL: https://www.ifaamas.org/AAMAS/
            metric and machine learning approach to high-resolution                        aamas2014/proceedings/aamas/p1173.pdf.
            residential electricity consumption modeling. In Advances         [SNGJ+ 09]   Beate Sander, Azhar Nizam, Louis P. Garrison Jr., Maarten J.
            in Geocomputation, pages 47–58. Springer, 2017. doi:                           Postma, M. Elizabeth Halloran, and Ira M. Longini Jr. Eco-
            10.1007/978-3-319-22786-3_5.                                                   nomic evaluation of influenza pandemic mitigation strate-
[MOD11]     Stuart     Mitchell,     Michael    O’Sullivan,    and     Iain                gies in the United States using a stochastic microsimulation
            Dunning.         PuLP: A Linear Programming Toolkit                            transmission model. Value in Health, 12(2):226–233, 2009.
            for Python.            Technical report, 2011.            URL:                 doi:10.1111/j.1524-4733.2008.00437.x.
            https://www.dit.uoi.gr/e-class/modules/document/file.php/         [SPH11]      Dianna M. Smith, Jamie R. Pearce, and Kirk Harland. Can
            216/PAPERS/2011.%20PuLP%20-%20A%20Linear%                                      a deterministic spatial microsimulation model provide reli-
            20Programming%20Toolkit%20for%20Python.pdf.                                    able small-area estimates of health behaviours? An example
[MPN+ 17]   April M. Morton, Jesse O. Piburn, Nicholas N. Nagle, H.M.                      of smoking prevalence in New Zealand. Health & Place,
            Aziz, Samantha E. Duchscherer, and Robert N. Stewart. A                        17(2):618–624, 2011. doi:10.1016/j.healthplace.
            simulation approach for modeling high-resolution daytime                       2011.01.001.
            commuter travel flows and distributions of worker subpopula-      [ST20]       Haroldo G. Santos and Túlio A.M. Toffolo. Mixed Integer Lin-
            tions. In GeoComputation 2017, Leeds, UK, pages 1–5, 2017.                     ear Programming with Python. Technical report, 2020. URL:
            URL: http://www.geocomputation.org/2017/papers/44.pdf.                         https://python-mip.readthedocs.io/_/downloads/en/latest/pdf/.
[MS01]      Harvey J. Miller and Shih-Lung Shaw. Geographic Informa-          [TBP+ 15]    Gautam S. Thakur, Budhendra L. Bhaduri, Jesse O. Piburn,
            tion Systems for Transportation: Principles and Applications.                  Kelly M. Sims, Robert N. Stewart, and Marie L. Urban.
            Oxford University Press, New York, 2001.                                       PlanetSense: a real-time streaming and spatio-temporal an-
[MS15]      Harvey J. Miller and Shih-Lung Shaw. Geographic Informa-                       alytics platform for gathering geo-spatial intelligence from
            tion Systems for Transportation in the 21st Century. Geogra-                   open source data. In Proceedings of the 23rd SIGSPATIAL
            phy Compass, 9(4):180–189, 2015. doi:10.1111/gec3.                             International Conference on Advances in Geographic Informa-
            12204.                                                                         tion Systems, pages 1–4, 2015. doi:10.1145/2820783.
[NBLS14]    Nicholas N. Nagle, Barbara P. Buttenfield, Stefan Leyk, and                    2820882.
            Seth Spielman. Dasymetric modeling and uncertainty. Annals        [TCR08]      Melanie N. Tomintz, Graham P. Clarke, and Janette E. Rigby.
            of the Association of American Geographers, 104(1):80–95,                      The geography of smoking in Leeds: estimating individual
            2014. doi:10.1080/00045608.2013.843439.                                        smoking rates and the implications for the location of stop
[NCA13]     Markku Nurhonen, Allen C. Cheng, and Kari Auranen. Pneu-                       smoking services. Area, 40(3):341–353, 2008. doi:10.
            mococcal transmission and disease in silico: a microsimu-                      1111/j.1475-4762.2008.00837.x.
            lation model of the indirect effects of vaccination. PloS         [TG22]       Joseph V. Tuccillo and James D. Gaboardi. Connecting Vivid
            one, 8(2):e56079, 2013. doi:10.1371/journal.pone.                              Population Data to Human Dynamics, June 2022. Distilling
            0056079.                                                                       Diversity by Tapping High-Resolution Population and Survey
[NLHH07]    Michael K. Ng, Mark Junjie Li, Joshua Zhexue Huang, and                        Data. doi:10.5281/zenodo.6607533.
            Zengyou He. On the impact of dissimilarity measure in             [TM21]       Joseph V. Tuccillo and Jessica Moehl. An Individual-
            k-modes clustering algorithm. IEEE Transactions on Pat-                        Oriented Typology of Social Areas in the United States, May
            tern Analysis and Machine Intelligence, 29(3):503–507, 2007.                   2021. 2021 ACS Data Users Conference. doi:10.5281/
            doi:10.1109/TPAMI.2007.53.                                                     zenodo.6672291.
[pdt20]     The pandas development team. pandas-dev/pandas: Pandas,           [TMKD17]     Matthias Templ, Bernhard Meindl, Alexander Kowarik, and
            February 2020. doi:10.5281/zenodo.3509134.                                     Olivier Dupriez. Simulation of synthetic complex data: The
[PVG+ 11]   F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,                            R package simPop. Journal of Statistical Software, 79:1–38,
            B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,                  2017. doi:10.18637/jss.v079.i10.
            V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,              [Tuc21]      Joseph V. Tuccillo. An Individual-Centered Approach for
            M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:                         Geodemographic Classification. In 11th International Con-
            Machine Learning in Python. Journal of Machine Learning                        ference on Geographic Information Science 2021 Short Paper
            Research, 12:2825–2830, 2011. URL: https://www.jmlr.org/                       Proceedings, pages 1–6, 2021. doi:10.25436/E2H59M.
            papers/v12/pedregosa11a.html.                                     [VGO+ 20]    Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt
[QC13]      Fang Qiu and Robert Cromley. Areal Interpolation and                           Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski,
            Dasymetric Modeling: Areal Interpolation and Dasymetric                        Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté-
            Modeling. Geographical Analysis, 45(3):213–215, July 2013.                     fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar-
            doi:10.1111/gean.12016.                                                        rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric
[RA07]      Sergio J. Rey and Luc Anselin. PySAL: A Python Library of                      Jones, Robert Kern, Eric Larson, C.J. Carey, İlhan Polat,
            Spatial Analytical Methods. The Review of Regional Studies,                    Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde,
            37(1):5–27, 2007. URL: https://rrs.scholasticahq.com/article/                  Josef Perktold, Robert Cimrman, Ian Henriksen, E.A. Quin-
            8285.pdf, doi:10.52324/001c.8285.                                              tero, Charles R. Harris, Anne M. Archibald, Antônio H.
[RAA+ 21]   Sergio J. Rey, Luc Anselin, Pedro Amaral, Dani Arribas-                        Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy
            Bel, Renan Xavier Cortes, James David Gaboardi, Wei Kang,                      1.0 Contributors. SciPy 1.0: Fundamental Algorithms for
            Elijah Knaap, Ziqi Li, Stefanie Lumnitz, Taylor M. Oshan,                      Scientific Computing in Python. Nature Methods, 17:261–272,
            Hu Shao, and Levi John Wolf. The PySAL Ecosystem:                              2020. doi:10.1038/s41592-019-0686-2.
            Philosophy and Implementation. Geographical Analysis, 2021.       [WCC+ 09]    William D. Wheaton, James C. Cajka, Bernadette M. Chas-
            doi:10.1111/gean.12276.                                                        teen, Diane K. Wagener, Philip C. Cooley, Laxminarayana
[RSF+ 21]   Krishna P. Reddy, Fatma M. Shebl, Julia H.A. Foote, Guy                        Ganapathi, Douglas J. Roberts, and Justine L. Allpress.
            Harling, Justine A. Scott, Christopher Panella, Kieran P. Fitz-                Synthesized population databases: A US geospatial database
            maurice, Clare Flanagan, Emily P. Hyle, Anne M. Neilan, et al.                 for agent-based models.       Methods report (RTI Press),
            Cost-effectiveness of public health strategies for COVID-19                    2009(10):905, 2009. doi:10.3768/rtipress.2009.
            epidemic control in South Africa: a microsimulation modelling                  mr.0010.0905.
            study. The Lancet Global Health, 9(2):e120–e129, 2021.            [WM10]       Wes McKinney. Data Structures for Statistical Computing in
            doi:10.1016/S2214-109X(20)30452-6.                                             Python. In Stéfan van der Walt and Jarrod Millman, editors,
[RWM+ 17]   Amy N. Rose, Eric M. Weber, Jessica J. Moehl, Melanie L.                       Proceedings of the 9th Python in Science Conference, pages 56
            Laverdiere, Hsiu-Han Yang, Matthew C. Whitehead, Kelly M.                      – 61, 2010. doi:10.25080/Majora-92bf1922-00a.
            Sims, Nathan E. Trombley, and Budhendra L. Bhaduri. Land-         [WRK21]      Ran Wei, Sergio J. Rey, and Elijah Knaap. Efficient re-
LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS   135

             gionalization for spatially explicit neighborhood delineation.
             International Journal of Geographical Information Science,
             35(1):135–151, 2021. doi:10.1080/13658816.2020.
             1759806.
[ZFJ14]      Yi Zhu and Joseph Ferreira Jr. Synthetic population gener-
             ation at disaggregated spatial scales for land use and trans-
             portation microsimulation. Transportation Research Record,
             2429(1):168–177, 2014. doi:10.3141/2429-18.
136                                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




                      poliastro: a Python library for interactive
                                    astrodynamics
                                              Juan Luis Cano Rodríguez‡∗ , Jorge Martínez Garrido‡

                                      https://www.youtube.com/watch?v=VCpTgU1pb5k

                                                                                       F


Abstract—Space is more popular than ever, with the growing public awareness                problem. This work was generalized by Newton to give birth to
of interplanetary scientific missions, as well as the increasingly large number            the n-body problem, and many other mathematicians worked on
of satellite companies planning to deploy satellite constellations. Python has             it throughout the centuries (Daniel and Johann Bernoulli, Euler,
become a fundamental technology in the astronomical sciences, and it has also              Gauss). Poincaré established in the 1890s that no general closed-
caught the attention of the Space Engineering community.
                                                                                           form solution exists for the n-body problem, since the resulting
     One of the requirements for designing a space mission is studying the
trajectories of satellites, probes, and other artificial objects, usually ignoring
                                                                                           dynamical system is chaotic [Bat99]. Sundman proved in the
non-gravitational forces or treating them as perturbations: the so-called n-body           1900s the existence of convergent solutions for a few restricted
problem. However, for preliminary design studies and most practical purposes, it           with n = 3.
is sufficient to consider only two bodies: the object under study and its attractor.                                M = E − e sin E                      (1)
     Even though the two-body problem has many analytical solutions, or-
                                                                                           In 1903 Tsiokovsky evaluated the conditions required for artificial
bit propagation (the initial value problem) and targeting (the boundary value
problem) remain computationally intensive because of long propagation times,
                                                                                           objects to leave the orbit of the earth; this is considered as a foun-
tight tolerances, and vast solution spaces. On the other hand, astrodynamics               dational contribution to the field of astrodynamics. Tsiokovsky
researchers often do not share the source code they used to run analyses and               devised equation 2 which relates the increase in velocity with the
simulations, which makes it challenging to try out new solutions.                          effective exhaust velocity of thrusted gases and the fraction of used
     This paper presents poliastro, an open-source Python library for interactive          propellant.
                                                                                                                                  m0
astrodynamics that features an easy-to-use API and tools for quick visualization.                                      ∆v = ve ln                             (2)
poliastro implements core astrodynamics algorithms (such as the resolution                                                        mf
of the Kepler and Lambert problems) and leverages numba, a Just-in-Time                    Further developments by Kondratyuk, Hohmann, and Oberth in
compiler for scientific Python, to optimize the running time. Thanks to Astropy,           the early 20th century all added to the growing field of orbital
poliastro can perform seamless coordinate frame conversions and use proper
                                                                                           mechanics, which in turn enabled the development of space flight
physical units and timescales. At the moment, poliastro is the longest-lived
Python library for astrodynamics, has contributors from all around the world,
                                                                                           in the USSR and the United States in the 1950s and 1960s.
and several New Space companies and people in academia use it.                             The two-body problem
                                                                                           In a system of i ∈ 1, ..., n bodies subject to their mutual attraction,
Index Terms—astrodynamics, orbital mechanics, orbit propagation, orbit visu-
alization, two-body problem
                                                                                           by application of Newton’s law of universal gravitation, the total
                                                                                           force fi affecting mi due to the presence of the other n − 1 masses
                                                                                           is given by [Bat99]:
Introduction                                                                                                                 n
                                                                                                                                  mi m j
                                                                                                                   fi = −G ∑              r
                                                                                                                                         3 ij
                                                                                                                                                               (3)
History                                                                                                                      j6=i |ri j |
The term "astrodynamics" was coined by the American as-
                                                                                           where G = 6.67430 · 10−11 N m2 kg−2 is the universal gravita-
tronomer Samuel Herrick, who received encouragement from
                                                                                           tional constant, and ri j denotes the position vector from mi to m j .
the space pioneer Robert H. Goddard, and refers to the branch
                                                                                           Applying Newton’s second law of motion results in a system of n
of space science dealing with the motion of artificial celestial
                                                                                           differential equations:
bodies ([Dub73], [Her71]). However, the roots of its mathematical
foundations go back several centuries.                                                                            d2 ri        n
                                                                                                                                  mj
                                                                                                                     2
                                                                                                                        = −G ∑          r
                                                                                                                                       3 ij
                                                                                                                                                               (4)
    Kepler first introduced his laws of planetary motion in 1609                                                  dt         j6=i i j |
                                                                                                                                 |r
and 1619 and derived his famous transcendental equation (1),                               By setting n = 2 in 4 and subtracting the two resulting equali-
which we now see as capturing a restricted form of the two-body                            ties, one arrives to the fundamental equation of the two-body
                                                                                           problem:
* Corresponding author: hello@juanlu.space
‡ Unaffiliated                                                                                                            d2 r     µ
                                                                                                                               =− 3r                         (5)
                                                                                                                          dt 2     r
Copyright © 2022 Juan Luis Cano Rodríguez et al. This is an open-access                    where µ = G(m1 + m2 ) = G(M + m). When m  M (for example,
article distributed under the terms of the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any             an artificial satellite orbiting a planet), one can consider µ = GM
medium, provided the original author and source are credited.                              a property of the attractor.
POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS                                                                                   137

Keplerian vs non-keplerian motion                                       State of the art
Conveniently manipulating equation 5 leads to several properties        In our view, at the time of creating poliastro there were a number
[Bat99] that were already published by Johannes Kepler in the           of issues with existing open source astrodynamics software that
1610s, namely:                                                          posed a barrier of entry for novices and amateur practitioners.
   1)    The orbit always describes a conic section (an ellipse, a      Most of these barriers still exist today and are described in the
         parabola, or an hyperbola), with the attractor at one of       following paragraphs. The goals of the project can be condensed
         the two foci and can be written in polar coordinates like      as follows:
         r = 1+epcos ν (Kepler’s first law).                                1)   Set an example on reproducibility and good coding prac-
   2)    The magnitude of the specific angular momentum h =                      tices in astrodynamics.
         r2 ddtθ is constant an equal to two times the areal velocity       2)   Become an approachable software even for novices.
         (Kepler’s second law).                                             3)   Offer a performant software that can be also used in
   3)    For closed (circular and elliptical) orbits, the periodq is             scripting and interactive workflows.
                                                                    3
         related to the size of the orbit through P = 2π aµ
         (Kepler’s third law).                                               The most mature software libraries for astrodynamics are
                                                                        arguably Orekit [noa22c], a "low level space dynamics library
    For many practical purposes it is usually sufficient to limit       written in Java" with an open governance model, and SPICE
the study to one object orbiting an attractor and ignore all other      [noa22d], a toolkit developed by NASA’s Navigation and An-
external forces of the system, hence restricting the study to           cillary Information Facility at the Jet Propulsion Laboratory.
trajectories governed by equation 5. Such trajectories are called       Other similar, smaller projects that appeared later on and that
"Keplerian", and several problems can be formulated for them:           are still maintained to this day include PyKEP [IBD+ 20], be-
   •    The initial-value problem, which is usually called prop-        yond [noa22a], tudatpy [noa22e], sbpy [MKDVB+ 19], Skyfield
        agation, involves determining the position and velocity of      [Rho20] (Python), CelestLab (Scilab) [noa22b], astrodynamics.jl
        an object after an elapse period of time given some initial     (Julia) [noa] and Nyx (Rust) [noa21a]. In addition, there are
        conditions.                                                     some Graphical User Interface (GUI) based open source programs
   •    Preliminary orbit determination, which involves using           used for Mission Analysis and orbit visualization, such as GMAT
        exact or approximate methods to derive a Keplerian orbit        [noa20] and gpredict [noa18], and complete web applications for
        from a set of observations.                                     tracking constellations of satellites like the SatNOGS project by
   •    The boundary-value problem, often named the Lambert             the Libre Space Foundation [noa21b].
        problem, which involves determining a Keplerian orbit                The level of quality and maintenance of these packages is
        from boundary conditions, usually departure and arrival         somewhat heterogeneous. Community-led projects with a strong
        position vectors and a time of flight.                          corporate backing like Orekit are in excellent health, while on
                                                                        the other hand smaller projects developed by volunteers (beyond,
     Fortunately, most of these problems boil down to finding           astrodynamics.jl) or with limited institutional support (PyKEP,
numerical solutions to relatively simple algebraic relations be-        GMAT) suffer from lack of maintenance. Part of the problem
tween time and angular variables: for elliptic motion (0 ≤ e < 1)       might stem from the fact that most scientists are never taught how
it is the Kepler equation, and equivalent relations exist for the       to build software efficiently, let alone the skills to collaboratively
other eccentricity regimes [Bat99]. Numerical solutions for these       develop software in the open [WAB+ 14], and astrodynamicists are
equations can be found in a number of different ways, each one          no exception.
with different complexity and precision tradeoffs. In the Methods            On the other hand, it is often difficult to translate the advances
section we list the ones implemented by poliastro.                      in astrodynamics research to software. Classical algorithms devel-
     On the other hand, there are many situations in which natural      oped throughout the 20th century are described in papers that are
and artificial orbital perturbations must be taken into account so      sometimes difficult to find, and source code or validation data
that the actual non-Keplerian motion can be properly analyzed:          is almost never available. When it comes to modern research
   •    Interplanetary travel in the proximity of other planets. On     carried in the digital era, source code and validation data is
        a first approximation it is usually enough to study the         still difficult, even though they are supposedly provided "upon
        trajectory in segments and focus the analysis on the closest    reasonable request" [SSM18] [GBP22].
        attractor, hence patching several Keplerian orbits along             It is no surprise that astrodynamics software often requires
        the way (the so-called "patched-conic approximation")           deep expertise. However, there are often implicit assumptions that
        [Bat99]. The boundary surface that separates one segment        are not documented with an adequate level of detail which orig-
        from the other is called the sphere of influence.               inate widespread misconceptions and lead even seasoned profes-
   •    Use of solar sails, electric propulsion, or other means         sionals to make conceptual mistakes. Some of the most notorious
        of continuous thrust. Devising the optimal guidance laws        misconceptions arise around the use of general perturbations data
        that minimize travel time or fuel consumption under these       (OMMs and TLEs) [Fin07], the geometric interpretation of the
        conditions is usually treated as an optimization problem        mean anomaly [Bat99], or coordinate transformations [VCHK06].
        of a dynamical system, and as such it is particularly                Finally, few of the open source software libraries mentioned
        challenging [Con14].                                            above are amenable to scripting or interactive use, as promoted by
   •    Artificial satellites in the vicinity of a planet. This is      computational notebooks like Jupyter [KRKP+ 16].
        the regime in which all the commercial space industry                The following sections will now discuss the various areas of
        operates, especially for those satellites in Low-Earth Orbit    current research that an astrodynamicist will engage in, and how
        (LEO).                                                          poliastro improves their workflow.
138                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Methods
                                                                                                Nice, high level API
Software Architecture
The architecture of poliastro emerges from the following set of
conflicting requirements:
                                                                                             Dangerous™ algorithms
      1)    There should be a high-level API that enables users to
            perform orbital calculations in a straightforward way and                 Fig. 1: poliastro two-layer architecture
            prevent typical mistakes.
      2)    The running time of the algorithms should be within the         Most of the methods of the High level API consist only
            same order of magnitude of existing compiled implemen-      of the necessary unit compatibility checks, plus a wrapper over
            tations.                                                    the corresponding Core API function that performs the actual
      3)    The library should be written in a popular open-source      computation.
            language to maximize adoption and lower the barrier to
                                                                        @u.quantity_input(E=u.rad, ecc=u.one)
            external contributors.                                      def E_to_nu(E, ecc):
                                                                            """True anomaly from eccentric anomaly."""
    One of the most typical mistakes we set ourselves to prevent            return (
with the high-level API is dimensional errors. Addition and                     E_to_nu_fast(
                                                                                     E.to_value(u.rad),
substraction operations of physical quantities are defined only for
                                                                                     ecc.value
quantities with the same units [Dro53]: for example, the operation              ) << u.rad
1 km + 100 m requires a scale transformation of at least one                ).to(E.unit)
of the operands, since they have different units (kilometers and        As a result, poliastro offers a unit-safe API that performs the least
meters) but the same dimension (length), whereas the operation          amount of computation possible to minimize the performance
1 km + 1 kg is directly not allowed because dimensions are              penalty of unit checks, and also a unit-unsafe API that offers
incompatible (length and mass). As such, software systems oper-         maximum performance at the cost of not performing any unit
ating with physical quantities should raise exceptions when adding      validation checks.
different dimensions, and transparently perform the required scale          Finally, there are several options to write performant code that
transformations when adding different units of the same dimen-          can be used from Python, and one of them is using a fast, compiled
sion.                                                                   language for the CPU intensive parts. Successful examples of this
    With this in mind, we evaluated several Python packages for         include NumPy, written in C [HMvdW+ 20], SciPy, featuring a
unit handling (see [JGAZJT+ 18] for a recent survey) and chose          mix of FORTRAN, C, and C++ code [VGO+ 20], and pandas,
astropy.units [TPWS+ 18].                                               making heavy use of Cython [BBC+ 11]. However, having to
radius = 6000 # km                                                      write code in two different languages hinders the development
altitude = 500 # m                                                      speed, makes debugging more difficult, and narrows the potential
# Wrong!
                                                                        contributor base (what Julia creators called "The Two Language
distance = radius + altitude                                            Problem" [BEKS17]).
                                                                            As authors of poliastro we wanted to use Python as the
from astropy import units as u                                          sole programming language of the implementation, and the best
# Correct                                                               solution we found to improve its performance was to use Numba,
distance = (radius << u.km) + (altitude << u.m)                         a LLVM-based Python JIT compiler [LPS15].
This notion of providing a "safe" API extends to other parts            Usage
of the library by leveraging other capabilities of the Astropy
                                                                        Basic Orbit and Ephem creation
project. For example, timestamps use astropy.time objects,
which take care of the appropriate handling of time scales              The two central objects of the poliastro high level API are Orbit
(such as TDB or UTC), reference frame conversions leverage              and Ephem:
astropy.coordinates, and so forth.                                         •    Orbit objects represent an osculating (hence Keplerian)
    One of the drawbacks of existing unit packages is that                      orbit of a dimensionless object around an attractor at a
they impose a significant performance penalty. Even though                      given point in time and a certain reference frame.
astropy.units is integrated with NumPy, hence allowing                     •    Ephem objects represent an ephemerides, a sequence of
the creation of array quantities, all the unit compatibility checks             spatial coordinates over a period of time in a certain
are implemented in Python and require lots of introspection, and                reference frame.
this can slow down mathematical operations by several orders of             There are six parameters that uniquely determine a Keplerian
magnitude. As such, to fulfill our desired performance requirement      orbit, plus the gravitational parameter of the corresponding attrac-
for poliastro, we envisioned a two-layer architecture:                  tor (k or µ). Optionally, an epoch that contextualizes the orbit
      •    The Core API follows a procedural style, and all the         can be included as well. This set of six parameters is not unique,
           functions receive Python numerical types and NumPy           and several of them have been developed over the years to serve
           arrays for maximum performance.                              different purposes. The most widely used ones are:
      •    The High level API is object-oriented, all the methods          •    Cartesian elements: Three components for the position
           receive Astropy Quantity objects with physical units,                (x, y, z) and three components for the velocity (vx , vy , vz ).
           and computations are deferred to the Core API.                       This set has no singularities.
POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS                                                                              139

   •   Classical Keplerian elements: Two components for the
       shape of the conic (usually the semimajor axis a or              from poliastro.ephem import Ephem
       semiparameter p and the eccentricity e), three Euler angles      # Configure high fidelity ephemerides globally
       for the orientation of the orbital plane in space (inclination   # (requires network access)
       i, right ascension of the ascending node Ω, and argument         solar_system_ephemeris.set("jpl")
       of periapsis ω), and one polar angle for the position of the
                                                                        # For predefined poliastro attractors
       body along the conic (usually true anomaly f or ν). This         earth = Ephem.from_body(Earth, Time.now().tdb)
       set of elements has an easy geometrical interpretation and
       the advantage that, in pure two-body motion, five of them        # For the rest of the Solar System bodies
                                                                        ceres = Ephem.from_horizons("Ceres", Time.now().tdb)
       are fixed (a, e, i, Ω, ω) and only one is time-dependent
       (ν), which greatly simplifies the analytical treatment of        There are some crucial differences between Orbit and Ephem
       orbital perturbations. However, they suffer from singular-       objects:
       ities steming from the Euler angles ("gimbal lock") and             •   Orbit objects have an attractor, whereas Ephem objects
       equations expressed in them are ill-conditioned near such               do not. Ephemerides can originate from complex trajecto-
       singularities.                                                          ries that don’t necessarily conform to the ideal two-body
   •   Walker modified equinoctial elements: Six parameters                    problem.
       (p, f , g, h, k, L). Only L is time-dependent and this set has      •   Orbit objects capture a precise instant in a two-body mo-
       no singularities, however the geometrical interpretation of             tion plus the necessary information to propagate it forward
       the rest of the elements is lost [WIO85].                               in time indefinitely, whereas Ephem objects represent a
    Here is how to create an Orbit from cartesian and from clas-               bounded time history of a trajectory. This is because the
sical Keplerian elements. Walker modified equinoctial elements                 equations for the two-body motion are known, whereas
are supported as well.                                                         an ephemeris is either an observation or a prediction
from astropy import units as u                                                 that cannot be extrapolated in any case without external
                                                                               knowledge. As such, Orbit objects have a .propagate
from poliastro.bodies import Earth, Sun                                        method, but Ephem ones do not. This prevents users from
from poliastro.twobody import Orbit
from poliastro.constants import J2000
                                                                               attempting to propagate the position of the planets, which
                                                                               will always yield poor results compared to the excellent
# Data from Curtis, example 4.3                                                ephemerides calculated by external entities.
r = [-6045, -3490, 2500] << u.km
v = [-3.457, 6.618, 2.533] << u.km / u.s                                   Finally, both types have methods to convert between them:
                                                                           •   Ephem.from_orbit is the equivalent of sampling a
orb_curtis = Orbit.from_vectors(
   Earth, # Attractor                                                          two-body motion over a given time interval. As explained
   r, v # Elements                                                             above, the resulting Ephem loses the information about
)                                                                              the original attractor.
# Data for Mars at J2000 from JPL HORIZONS
                                                                           •   Orbit.from_ephem is the equivalent of calculating
a = 1.523679 << u.au                                                           the osculating orbit at a certain point of a trajectory,
ecc = 0.093315 << u.one                                                        assuming a given attractor. The resulting Orbit loses
inc = 1.85 << u.deg                                                            the information about the original, potentially complex
raan = 49.562 << u.deg
argp = 286.537 << u.deg                                                        trajectory.
nu = 23.33 << u.deg
                                                                        Orbit propagation
orb_mars = Orbit.from_classical(                                        Orbit objects have a .propagate method that takes an elapsed
   Sun,
   a, ecc, inc, raan, argp, nu,
                                                                        time and returns another Orbit with new orbital elements and an
   J2000 # Epoch                                                        updated epoch:
)                                                                       >>> from poliastro.examples import iss
When displayed on an interactive REPL, Orbit objects provide            >>> iss
basic information about the geometry, the attractor, and the epoch:     >>> 6772 x 6790 km x 51.6 deg (GCRS) ...
>>> orb_curtis
7283 x 10293 km x 153.2 deg (GCRS) orbit                                >>> iss.nu.to(u.deg)
around Earth (X) at epoch J2000.000 (TT)                                <Quantity 46.59580468 deg>

>>> orb_mars                                                            >>> iss_30m = iss.propagate(30 << u.min)
1 x 2 AU x 1.9 deg (HCRS) orbit
around Sun (X) at epoch J2000.000 (TT)                                  >>> (iss_30m.epoch - iss.epoch).datetime
                                                                        datetime.timedelta(seconds=1800)
Similarly, Ephem objects can be created using a variety of class-
methods as well. Thanks to astropy.coordinates built-in                 >>> (iss_30m.nu - iss.nu).to(u.deg)
                                                                        <Quantity 116.54513153 deg>
low-fidelity ephemerides, as well as its capability to remotely
                                                       The default propagation algorithm is an analytical procedure
access the JPL HORIZONS system, the user can seamlessly build
                                                       described in [FCM13] that works seamlessly in the near parabolic
an object that contains the time history of the position of any Solar
System body:                                           region. In addition, poliastro implements analytical propagation
from astropy.time import Time                          algorithms as described in [DB83], [OG86], [Mar95], [Mik87],
from astropy.coordinates import solar_system_ephemeris [PP13], [Cha22], and [VM07].
140                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)



                                                                        rr = propagate(
                                                                            orbit,
                                                                            tofs,
                                                                            method=cowell,
                                                                            f=f,
                                                                        )


                                                                        Continuous thrust control laws
                                                                        Beyond natural perturbations, spacecraft can modify their trajec-
                                                                        tory on purpose by using impulsive maneuvers (as explained in
                                                                        the next section) as well as continuous thrust guidance laws. The
                                                                        user can define custom guidance laws by providing a perturbation
Fig. 2: Osculating (Keplerian) vs perturbed (true) orbit (source:       acceleration in the same way natural perturbations are used. In
Wikipedia, CC BY-SA 3.0)                                                addition, poliastro includes several analytical solutions for con-
                                                                        tinuous thrust guidance laws with specific purposes, as studied in
                                                                        [CR17]: optimal transfer between circular coplanar orbits [Ede61]
Natural perturbations                                                   [Bur67], optimal transfer between circular inclined orbits [Ede61]
As showcased in Figure 2, at any point in a trajectory we               [Kec97], quasi-optimal eccentricity-only change [Pol97], simulta-
can define an ideal Keplerian orbit with the same position and          neous eccentricity and inclination change [Pol00], and agument of
velocity under the attraction of a point mass: this is called the       periapsis adjustment [Pol98]. A much more rigorous analysis of a
osculating orbit. Some numerical propagation methods exist that         similar set of laws can be found in [DCV21].
model the true, perturbed orbit as a deviation from an evolving,        from poliastro.twobody.thrust import change_ecc_inc
osculating orbit. poliastro implements Cowell’s method [CC10],
which consists in adding all the perturbation accelerations and then    ecc_f = 0.0 << u.one
                                                                        inc_f = 20.0 << u.deg
integrating the resulting differential equation with any numerical
                                                                        f = 2.4e-6 << (u.km / u.s**2)
method of choice:
                          d2 r      µ                                   a_d, _, t_f = change_ecc_inc(orbit, ecc_f, inc_f, f)
                             2
                               = − 3 r + ad                       (6)
                          dt       r
The resulting equation is usually integrated using high order
                                                                        Impulsive maneuvers
numerical methods, since the integration times are quite large
and the tolerances comparatively tight. An in-depth discussion of       Impulsive maneuvers are modeled considering a change in the
such methods can be found in [HNW09]. poliastro uses Dormand-           velocity of a spacecraft while its position remains fixed. The
Prince 8(5,3) (DOP853), a commonly used method available in             poliastro.maneuver.Maneuver class provides various
SciPy [HMvdW+ 20].                                                      constructors to instantiate popular impulsive maneuvers in the
    There are several natural perturbations included: J2 and J3         framework of the non-perturbed two-body problem:
gravitational terms, several atmospheric drag models (exponential,         •   Maneuver.impulse
[Jac77], [AAAA62], [AAA+ 76]), and helpers for third body                  •   Maneuver.hohmann
gravitational attraction and radiation pressure as described in [?].       •   Maneuver.bielliptic
@njit                                                       •                  Maneuver.lambert
def combined_a_d(
    t0, state, k, j2, r_eq, c_d, a_over_m, h0, rho0
):                                                      from poliastro.maneuver import Maneuver
    return (
        J2_perturbation(                                orb_i = Orbit.circular(Earth, alt=700 << u.km)
             t0, state, k, j2, r_eq                     hoh = Maneuver.hohmann(orb_i, r_f=36000 << u.km)
        ) + atmospheric_drag_exponential(
             t0, state, k, r_eq, c_d, a_over_m, h0, rho0Once instantiated, Maneuver objects provide information regard-
        )                                               ing total ∆v and ∆t:
    )
                                                        >>> hoh.get_total_cost()
                                                        <Quantity 3.6173981270031357 km / s>
def f(t0, state, k):
    du_kep = func_twobody(t0, state, k)
                                                        >>> hoh.get_total_time()
    ax, ay, az = combined_a_d(
                                                        <Quantity 15729.741535747102 s>
        t0,
        state,                                          Maneuver objects can be applied to Orbit instances using the
        k,
        R=R,                                            apply_maneuver method.
        C_D=C_D,                                        >>> orb_i
        A_over_m=A_over_m,                              7078 x 7078 km x 0.0 deg (GCRS) orbit
        H0=H0,                                          around Earth (X)
        rho0=rho0,
        J2=Earth.J2.value,                              >>> orb_f = orb_i.apply_maneuver(hoh)
    )                                                   >>> orb_f
    du_ad = np.array([0, 0, 0, ax, ay, az])             36000 x 36000 km x 0.0 deg (GCRS) orbit
                                                        around Earth (X)
    return du_kep + du_ad
POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS                                                                                                                                                         141

Targeting                                                                                                Earth - Mars for year 2020-2021, C3 launch
                                                                                      2021-05




                                                                                                                                                                                    34.1
Targeting is the problem of finding the orbit connecting two                                                                                                                Days of flight




                                                                                                 .0




                                                                                                                                                                                 .8
                                                                                                                                                                            .0
                                                                                                                                                                            Arrival velocity km/s         41.90




                                                                                                                                                                  434553750..04273
                                                                                              400




                                                                                                                                                                          24
                                                                                                                                                                          0


                                                                                                                                                                 31.0




                                                                                                                                                                                            200 35.7
positions over a finite amount of time. Within the context of




                                                                                                                5.
                                                                                                                                                              43.4




                                                                                                                                                         313.80.8
                                                                                      2021-04




                                                                                                                                                                                               .0
the non-perturbed two-body problem, targeting is just a matter                                                                                  26
                                                                                                                                                  .4                                                      37.24




                                                                                                                                                    32.6


                                                                                                                                                    41.9
of solving the BVP, also known as Lambert’s problem. Because




                                                                                                                                                  24.8
targeting tries to find for an orbit, the problem is included in the                  2021-03                                                                                                             32.59




                                                                                                                                                                                           29.5
                                                                                                                                             34.1
                                                                                                                                   37.2 410.9.0




                                                                                                                                         18..86
                                                                                                                                            .9
Initial Orbit Determination field.




                                                                                                              30




                                                                                                                                     20.2 3
                                                                                                                                         27




                                                                                                                                      17.1
                                                                                                                                                                                                          27.93
    The poliastro.iod package contains izzo and                                       2021-02




                                                                       Arrival date




                                                                                                                                23.3




                                                                                                                                                                                            38.8
vallado modules. These provide a lambert function for solv-




                                                                                                                                                                                                                 km2 / s2
                                                                                                                             15.5
                                                                                                            40.3 45.0
                                                                                                                                                                                                          23.28




                                                                                                         3.8



                                                                                                                                 5
                                                                                                                             29.

                                                                                                                           26.4
ing the targeting problem. Nevertheless, a Maneuver.lambert




                                                                                                                       21.7
                                                                                      2021-01




                                                                                                                      27.9
constructor is also provided so users can keep taking advantage of                                                                                                                                        18.62




                                                                                                                 32.6
Orbit objects.
                                                                                                                                                                                                          13.97
                                                                                      2020-12
# Declare departure and arrival datetimes
date_launch = time.Time(                                                                                                                                                                                  9.31
    '2011-11-26 15:02', scale='tdb'                                                                                              5.0
                                                                                      2020-11
)                                                                                                                                                                                          Perseverance   4.66




                                                                                                                                                               .0
                                                                                                                                                                                           Tianwen-1




                                                                                                                                                         100
date_arrival = time.Time(
    '2012-08-06 05:17', scale='tdb'                                                                                                                                                        Hope Mars
                                                                                      2020-10                                                                                                             0.00
)                                                                                               3         4         5         6         7         8         9         0
                                                                                            0-0       0-0       0-0       0-0       0-0       0-0       0-0       0-1
                                                                                        202       202       202       202       202       202       202       202
# Define initial and final orbits                                                                                                       Launch date
orb_earth = Orbit.from_ephem(
    Sun, Ephem.from_body(Earth, date_launch),                          Fig. 3: Porkchop plot for Earth-Mars transfer arrival energy showing
    date_launch                                                        latest missions to the Martian planet.
)
orb_mars = Orbit.from_ephem(
    Sun, Ephem.from_body(Mars, date_arrival),
    date_arrival                                                           Generated graphics can be static or interactive. The main
)                                                                      difference between these two is the ability to modify the camera
                                                                       view in a dynamic way when using interactive plotters.
# Compute targetting maneuver and apply it
man_lambert = Maneuver.lambert(orb_earth, orb_mars)                        The most important classes in the poliastro.plotting
orb_trans, orb_target = ss0.apply_maneuver(                            package are StaticOrbitPlotter and OrbitPlotter3D.
    man_lambert, intermediate=true                                     In addition, the poliastro.plotting.misc module con-
)
                                                                       tains the plot_solar_system function, which allows the user
Targeting is closely related to quick mission design by means of       to visualize inner and outter both in 2D and 3D, as requested by
porkchop diagrams. These are contour plots showing all combi-          users.
nations of departure and arrival dates with the specific energy for        The following example illustrates the plotting capabilities of
each transfer orbit. They allow for quick identification of the most   poliastro. At first, orbits to be plotted are computed and their
optimal transfer dates between two bodies.                             plotting style is declared:
    The poliastro.plotting.porkchop provides the                       from poliastro.plotting.misc import plot_solar_system
PorkchopPlotter class which allows the user to generate
these diagrams.                                                        # Current datetime
                                                                       now = Time.now().tdb
from poliastro.plotting.porkchop import (
    PorkchopPlotter                                                    # Obtain Florence and Halley orbits
)                                                                      florence = Orbit.from_sbdb("Florence")
from poliastro.utils import time_range                                 halley_1835_ephem = Ephem.from_horizons(
                                                                           "90000031", now
# Generate all launch and arrival dates                                )
launch_span = time_range(                                              halley_1835 = Orbit.from_ephem(
    "2020-03-01", end="2020-10-01", periods=int(150)                       Sun, halley_1835_ephem, halley_1835_ephem.epochs[0]
)                                                                      )
arrival_span = time_range(
    "2020-10-01", end="2021-05-01", periods=int(150)                   # Define orbit labels and color style
)                                                                      florence_style = {label: "Florence", color: "#000000"}
                                                                       halley_style = {label: "Florence", color: "#84B0B8"}
# Create an instance of the porkchop and plot it
porkchop = PorkchopPlotter(                                            The static two-dimensional plot can be created using the following
    Earth, Mars, launch_span, arrival_span,                            code:
)
                                                                       # Generate a static 2D figure
Previous code, with some additional customization, generates           frame2D = rame = plot_solar_system(
figure 3.                                                                  epoch=now, outer=False
                                                                       )
                                                                       frame2D.plot(florence, **florence_style)
Plotting                                                               frame2D.plot(florence, **halley_style)
For      visualization  purposes,   poliastro    provides the
                                                                       As a result, figure 4 is obtained.
poliastro.plotting package, which contains various
                                                                           The interactive three-dimensional plot can be created using the
utilities for generating 2D and 3D graphics using different
                                                                       following code:
backends such as matplotlib [Hun07] and Plotly [Inc15].
142                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




Fig. 4: Two-dimensional view of the inner Solar System, Florence,
and Halley.                                                            Fig. 6: Natural perturbations affecting Low-Earth Orbit (LEO) mo-
                                                                       tion (source: [VM07])

# Generate an interactive 3D figure
frame3D = rame = plot_solar_system(                                    the Simplified General Perturbation (SGP) models, first developed
    epoch=now, outer=False,
                                                                       in [HK66] and then refined in [LC69] into what we know these
    use_3d=True, interactive=true
)                                                                      days as the SGP4 propagator [HR80] [VCHK06]. Even though
frame3D.plot(florence, **florence_style)                               certain elements of the reference frame used by SGP4 are not
frame3D.plot(florence, **halley_style)                                 properly specified [VCHK06] and that its accuracy might still be
As a result, figure 5 is obtained.                                     too limited for certain applications [Ko09] [Lar16], it is nowadays
                                                                       the most widely used propagation method thanks in large part to
                                                                       the dissemination of General Perturbations orbital data by the US
                                                                       501(c)(3) CelesTrak (which itself obtains it from the 18th Space
                                                                       Defense Squadron of the US Space Force).
                                                                            The starting point of SGP4 is a special element set that uses
                                                                       Brouwer mean orbital elements [Bro59] plus a ballistic coefficient
                                                                       based on an approximation of the atmospheric drag [LC69], and
                                                                       its results are expressed in a special coordinate system called True
                                                                       Equator Mean Equinox (TEME). Special care needs to be taken
                                                                       to avoid mixing mean elements with osculating elements, and to
                                                                       convert the output of the propagation to the appropriate reference
                                                                       frame. These element sets have been traditionally distributed in a
                                                                       compact text representation called Two-Line Element sets (TLEs)
                                                                       (see 7 for an example). However this format is quite cryptic and
Fig. 5: Three-dimensional view of the inner Solar System, Florence,    suffers from a number of shortcomings, so recently there has
and Halley.
                                                                       been a push to use the Orbit Data Messages international standard
                                                                       developed by the Consultive Committee for Space Data Systems
Commercial Earth satellites                                            (CCSDS 502.0-B-2).
Figure 6 gives a clear picture of the most important natural pertur-   1 25544U 98067A   22156.15037205 .00008547 00000+0 15823-3 0 9994
bations affecting satellites in LEO, namely: the first harmonic of     2 25544 51.6449   36.2070 0004577 196.3587 298.4146 15.49876730343319

the geopotential field J2 (representing the attractor oblateness),     Fig. 7: Two-Line Element set (TLE) for the ISS (retrieved on 2022-
the atmospheric drag, and the higher order harmonics of the            06-05)
geopotential field.
    At least the most significant of these perturbations need to be       At the moment, general perturbations data both in OMM and
taken into account when propagating LEO orbits, and therefore          TLE format can be integrated with poliastro thanks to the sgp4
the methods for purely Keplerian motion are not enough. As             Python library and the Ephem class as follows:
seen above, poliastro implements a number of these perturbations       from astropy.coordinates import TEME, GCRS
already - however, numerical methods are much slower than
analytical ones, and this can render them unsuitable for large         from poliastro.ephem import Ephem
                                                                       from poliastro.frames import Planes
scale simulations, satellite conjunction assesment, propagation in
constrained hardware, and so forth.
    To address this issue, semianalytical propagation methods          def ephem_from_gp(sat, times):
were devised that attempt to strike a balance between the fast             errors, rs, vs = sat.sgp4_array(times.jd1, times.jd2)
                                                                           if not (errors == 0).all():
running times of analytical methods and the necessary inclusion                warn(
of perturbation forces. One of such semianalytical methods are                     "Some objects could not be propagated, "
POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS                                                                                   143

               "proceeding with the rest",                              do not want to use some of the higher level poliastro abstractions
               stacklevel=2,                                            or drag its large number of heavy dependencies.
          )
          rs = rs[errors == 0]
                                                                            Finally, the sustainability of the project cannot yet be taken for
          vs = vs[errors == 0]                                          granted: the project has reached a level of complexity that already
          times = times[errors == 0]                                    warrants dedicated development effort that cannot be covered with
                                                                        short-lived grants. Such funding could potentially come from the
     cart_teme = CartesianRepresentation(
         rs << u.km,                                                    private sector, but although there is evidence that several for-profit
         xyz_axis=-1,                                                   companies are using poliastro, we have very little information of
         differentials=CartesianDifferential(                           how is it being used and what problems are those users having,
             vs << (u.km / u.s),
             xyz_axis=-1,
                                                                        let alone what avenues for funded work could potentially work.
         ),                                                             Organizations like the Libre Space Foundation advocate for a
     )                                                                  strong copyleft licensing model to convince commercial actors to
     cart_gcrs = (                                                      contribute to the commons, but in principle that goes against the
         TEME(cart_teme, obstime=times)
         .transform_to(GCRS(obstime=times))                             permissive licensing that the wider Scientific Python ecosystem,
         .cartesian                                                     including poliastro, has adopted. With the advent of new business
     )                                                                  models and the ever increasing reliance in open source by the
                                                                        private sector, a variety of ways to engage commercial users and
     return Ephem(
         cart_gcrs,                                                     include them in the conversation exist. However, these have not
         times,                                                         been explored yet.
         plane=Planes.EARTH_EQUATOR
     )                                                                  Acknowledgements
However, no native integration with SGP4 has been implemented           The authors would like to thank Prof. Michèle Lavagna for her
yet in poliastro, for technical and non-technical reasons. On one       original guidance and inspiration, David A. Vallado for his en-
hand, this propagator is too different from the other methods, and      couragement and for publishing the source code for the algorithms
we have not yet devised how to add it to the library in a way           from his book for free, Dr. T.S. Kelso for his tireless efforts in
that does not create confusion. On the other hand, adding such          maintaining CelesTrak, Alejandro Sáez for sharing the dream of
a propagator to poliastro would probably open the flood gates of        a better way, Prof. Dr. Manuel Sanjurjo Rivo for believing in my
corporate users of the library, and we would like to first devise       work, Helge Eichhorn for his enthusiasm and decisive influence
a sustainability strategy for the project, which is addressed in the    in poliastro, the whole OpenAstronomy collaboration for opening
next section.                                                           the door for us, the NumFOCUS organization for their immense
                                                                        support, and Alexandra Elbakyan for enabling scientific progress
                                                                        worldwide.
Future work
Despite the fact that poliastro has existed for almost a decade, for    R EFERENCES
most of its history it has been developed by volunteers on their        [AAA+ 76]   United States Committee on Extension to the Standard At-
free time, and only in the past five years it has received funding                  mosphere, United States National Aeronautics, Space Ad-
through various Summer of Code programs (SOCIS 2017, GSOC                           ministration, United States National Oceanic, Atmospheric
                                                                                    Administration, and United States Air Force. U.S. Stan-
2018-2021) and institutional grants (NumFOCUS 2020, 2021).                          dard Atmosphere, 1976. NOAA - SIT 76-1562. National
The funded work has had an overwhemingly positive impact on                         Oceanic and Amospheric [sic] Administration, 1976. URL:
the project, however the lack of a dedicated maintainer has caused                  https://books.google.es/books?id=x488AAAAIAAJ.
                                                                        [AAAA62]    United States Committee on Extension to the Standard At-
some technical debt to accrue over the years, and some parts of
                                                                                    mosphere, United States National Aeronautics, Space Admin-
the project are in need of refactoring or better documentation.                     istration, and United States Environmental Science Services
    Historically, poliastro has tried to implement algorithms that                  Administration. U.S. Standard Atmosphere, 1962: ICAO
were applicable for all the planets in the Solar System, however                    Standard Atmosphere to 20 Kilometers; Proposed ICAO Ex-
                                                                                    tension to 32 Kilometers; Tables and Data to 700 Kilo-
some of them have proved to be very difficult to generalize for                     meters. U.S. Government Printing Office, 1962. URL:
bodies other than the Earth. For cases like these, poliastro ships a                https://books.google.es/books?id=fWdTAAAAMAAJ.
poliastro.earth package, but going forward we would like                [Bat99]     Richard H. Battin. An Introduction to the Mathematics
                                                                                    and Methods of Astrodynamics, Revised Edition. American
to continue embracing a generic approach that can serve other                       Institute of Aeronautics and Astronautics, Inc., Reston, VA,
bodies as well.                                                                     January 1999. URL: https://arc.aiaa.org/doi/book/10.2514/4.
    Several open source projects have successfully used poliastro                   861543, doi:10.2514/4.861543.
or were created taking inspiration from it, like spacetech-ssa          [BBC+ 11]   Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dal-
                                                                                    cin, Dag Sverre Seljebotn, and Kurt Smith. Cython: The
by IBM1 or mubody [BBVPFSC22]. AGI (previously Analytical                           Best of Both Worlds. Computing in Science & Engineering,
Graphics, Inc., now Ansys Government Initiatives) published a                       13(2):31–39, March 2011. URL: http://ieeexplore.ieee.org/
series of scripts to automate the commercial tool STK from Python                   document/5582062/, doi:10.1109/MCSE.2010.118.
                                                                        [BBVPFSC22] Juan Bermejo Ballesteros, José María Vergara Pérez,
leveraging poliastro2 . However, we have observed that there is still               Alejandro Fernández Soler, and Javier Cubas.             Mu-
lots of repeated code across similar open source libraries written                  body, an astrodynamics open-source Python library fo-
in Python, which means that there is an opportunity to provide                      cused on libration points.         Barcelona, Spain, April
a "kernel" of algorithms that can be easily reused. Although                        2022. URL: https://sseasymposium.org/wp-content/uploads/
                                                                                    2022/04/4thSSEA_AllAbstracts.pdf.
poliastro.core started as a separate layer to isolate fast, non-
safe functions as described above, we think we could move it to           1. https://github.com/IBM/spacetech-ssa
an external package so it can be depended upon by projects that           2. https://github.com/AnalyticalGraphicsInc/STKCodeExamples/
144                                                                                             PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[BEKS17]    Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Vi-                        Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant,
            ral B. Shah.        Julia: A Fresh Approach to Numerical                      Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer
            Computing. SIAM Review, 59(1):65–98, January 2017.                            Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro-
            URL: https://epubs.siam.org/doi/10.1137/141000671, doi:                       gramming with NumPy. Nature, 585(7825):357–362, Septem-
            10.1137/141000671.                                                            ber 2020. URL: https://www.nature.com/articles/s41586-020-
[Bro59]     Dirk Brouwer. Solution of the problem of artificial satellite                 2649-2, doi:10.1038/s41586-020-2649-2.
            theory without drag. The Astronomical Journal, 64:378,           [HNW09]      E. Hairer, S. P. Nørsett, and Gerhard Wanner. Solving ordi-
            November 1959. URL: http://adsabs.harvard.edu/cgi-bin/bib_                    nary differential equations I: nonstiff problems. Number 8
            query?1959AJ.....64..378B, doi:10.1086/107958.                                in Springer series in computational mathematics. Springer,
[Bur67]     E.G.C. Burt.            On space manoeuvres with con-                         Heidelberg ; London, 2nd rev. ed edition, 2009. OCLC:
            tinuous thrust.            Planetary and Space Science,                       ocn620251790.
            15(1):103–122,       January     1967.         URL:     https:   [HR80]       Felix R. Hoots and Ronald L. Roehrich. Models for prop-
            //linkinghub.elsevier.com/retrieve/pii/0032063367900700,                      agation of NORAD element sets. Technical report, Defense
            doi:10.1016/0032-0633(67)90070-0.                                             Technical Information Center, Fort Belvoir, VA, December
[CC10]      Philip Herbert Cowell and Andrew Claude Crommelin. Inves-                     1980. URL: http://www.dtic.mil/docs/citations/ADA093554.
            tigation of the Motion of Halley’s Comet from 1759 to 1910.      [Hun07]      J. D. Hunter. Matplotlib: A 2D graphics environment. Com-
            Neill & Company, limited, 1910.                                               puting in Science & Engineering, 9(3):90–95, 2007. Pub-
[Cha22]     Kevin Charls. Recursive solution to Kepler’s problem for                      lisher: IEEE COMPUTER SOC. doi:10.1109/MCSE.
            elliptical orbits - application in robust Newton-Raphson and                  2007.55.
            co-planar closest approach estimation. 2022. Publisher:          [IBD+ 20]    Dario Izzo, Will Binns, Dariomm098, Alessio Mereta,
            Unpublished Version Number: 1. URL: https://rgdoi.net/                        Christopher Iliffe Sprague, Dhennes, Bert Van Den Abbeele,
            10.13140/RG.2.2.18578.58563/1, doi:10.13140/RG.2.                             Chris Andre, Krzysztof Nowak, Nat Guy, Alberto Isaac Bar-
            2.18578.58563/1.                                                              quín Murguía, Pablo, Frédéric Chapoton, GiacomoAcciarini,
[Con14]     Bruce A. Conway. Spacecraft trajectory optimization. Num-                     Moritz V. Looz, Dietmarwo, Mike Heddes, Anatoli Babenia,
            ber 29 in Cambridge aerospace series. Cambridge university                    Baptiste Fournier, Johannes Simon, Jonathan Willitts, Ma-
            press, Cambridge (GB), 2014.                                                  teusz Polnik, Sanjeev Narayanaswamy, The Gitter Badger,
[CR17]      Juan Luis Cano Rodríguez. Study of analytical solutions for                   and Jack Yarndley. esa/pykep: Optimize, October 2020.
            low-thrust trajectories. Master’s thesis, Universidad Politéc-                URL: https://zenodo.org/record/4091753, doi:10.5281/
            nica de Madrid, March 2017.                                                   ZENODO.4091753.
[DB83]      J. M. A. Danby and T. M. Burkardt. The solution of Kepler’s      [Inc15]      Plotly Technologies Inc. Collaborative data science, 2015.
            equation, I. Celestial Mechanics, 31(2):95–107, October                       Place: Montreal, QC Publisher: Plotly Technologies Inc. URL:
            1983. URL: http://link.springer.com/10.1007/BF01686811,                       https://plot.ly.
            doi:10.1007/BF01686811.                                          [Jac77]      L. G. Jacchia. Thermospheric Temperature, Density, and
[DCV21]     Marilena Di Carlo and Massimiliano Vasile. Analytical                         Composition: New Models. SAO Special Report, 375, March
            solutions for low-thrust orbit transfers. Celestial Mechanics                 1977. ADS Bibcode: 1977SAOSR.375.....J. URL: https:
            and Dynamical Astronomy, 133(7):33, July 2021. URL: https:                    //ui.adsabs.harvard.edu/abs/1977SAOSR.375.....J.
            //link.springer.com/10.1007/s10569-021-10033-9, doi:10.          [JGAZJT+ 18] Nathan J. Goldbaum, John A. ZuHone, Matthew J. Turk,
            1007/s10569-021-10033-9.                                                      Kacper Kowalik, and Anna L. Rosen. unyt: Handle, ma-
[Dro53]     S. Drobot. On the foundations of Dimensional Analysis.                        nipulate, and convert data with units in Python. Jour-
            Studia Mathematica, 14(1):84–99, 1953. URL: http://www.                       nal of Open Source Software, 3(28):809, August 2018.
            impan.pl/get/doi/10.4064/sm-14-1-84-99, doi:10.4064/                          URL: http://joss.theoj.org/papers/10.21105/joss.00809, doi:
            sm-14-1-84-99.                                                                10.21105/joss.00809.
[Dub73]     G. N. Duboshin. Book Review: Samuel Herrick. Astrodynam-         [Kec97]      Jean Albert Kechichian. Reformulation of Edelbaum’s Low-
            ics. Soviet Astronomy, 16:1064, June 1973. ADS Bibcode:                       Thrust Transfer Problem Using Optimal Control Theory.
            1973SvA....16.1064D. URL: https://ui.adsabs.harvard.edu/                      Journal of Guidance, Control, and Dynamics, 20(5):988–
            abs/1973SvA....16.1064D.                                                      994, September 1997. URL: https://arc.aiaa.org/doi/10.2514/
[Ede61]     Theodore N. Edelbaum. Propulsion Requirements for Con-                        2.4145, doi:10.2514/2.4145.
            trollable Satellites. ARS Journal, 31(8):1079–1089, August       [Ko09]       TS Kelso and others. Analysis of the Iridium 33-Cosmos
            1961. URL: https://arc.aiaa.org/doi/10.2514/8.5723, doi:                      2251 collision. Advances in the Astronautical Sciences,
            10.2514/8.5723.                                                               135(2):1099–1112, 2009. Publisher: Citeseer.
[FCM13]     Davide Farnocchia, Davide Bracali Cioci, and Andrea Milani.      [KRKP+ 16]   Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez,
            Robust resolution of Kepler’s equation in all eccentricity                    Brian E Granger, Matthias Bussonnier, Jonathan Frederic,
            regimes. Celestial Mechanics and Dynamical Astronomy,                         Kyle Kelley, Jessica B Hamrick, Jason Grout, Sylvain Cor-
            116(1):21–34, May 2013. URL: http://link.springer.com/10.                     lay, and others. Jupyter Notebooks-a publishing format for
            1007/s10569-013-9476-9, doi:10.1007/s10569-013-                               reproducible computational workflows., volume 2016. 2016.
            9476-9.                                                          [Lar16]      Martin Lara. Analytical and Semianalytical Propagation
[Fin07]     D Finkleman. "TLE or Not TLE?" That is the Question (AAS                      of Space Orbits: The Role of Polar-Nodal Variables. In
            07-126). ADVANCES IN THE ASTRONAUTICAL SCIENCES,                              Gerard Gómez and Josep J. Masdemont, editors, Astro-
            127(1):401, 2007. Publisher: Published for the American                       dynamics Network AstroNet-II, volume 44, pages 151–
            Astronautical Society by Univelt; 1999.                                       166. Springer International Publishing, Cham, 2016. Se-
[GBP22]     Mirko Gabelica, Ružica Bojčić, and Livia Puljak. Many                       ries Title: Astrophysics and Space Science Proceedings.
            researchers were not compliant with their published                           URL: http://link.springer.com/10.1007/978-3-319-23986-6_
            data sharing statement: mixed-methods study.            Jour-                 11, doi:10.1007/978-3-319-23986-6_11.
            nal of Clinical Epidemiology, page S089543562200141X,            [LC69]       M. H. Lane and K. Cranford.              An improved ana-
            May 2022. URL: https://linkinghub.elsevier.com/retrieve/                      lytical drag theory for the artificial satellite problem.
            pii/S089543562200141X, doi:10.1016/j.jclinepi.                                In Astrodynamics Conference, Princeton,NJ,U.S.A., August
            2022.05.019.                                                                  1969. American Institute of Aeronautics and Astronautics.
[Her71]     Samuel Herrick. Astrodynamics. Van Nostrand Reinhold Co,                      URL: https://arc.aiaa.org/doi/10.2514/6.1969-925, doi:10.
            London, New York, 1971.                                                       2514/6.1969-925.
[HK66]      CG Hilton and JR Kuhlman. Mathematical models for the            [LPS15]      Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: a
            space defense center. Philco-Ford Publication No. U-3871,                     LLVM-based Python JIT compiler. In Proceedings of the Sec-
            17:28, 1966.                                                                  ond Workshop on the LLVM Compiler Infrastructure in HPC
[HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der                       - LLVM ’15, pages 1–6, Austin, Texas, 2015. ACM Press.
            Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric                    URL: http://dl.acm.org/citation.cfm?doid=2833157.2833162,
            Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,                    doi:10.1145/2833157.2833162.
            Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van           [Mar95]      F. Landis Markley.        Kepler Equation solver.      Celes-
            Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del                   tial Mechanics & Dynamical Astronomy, 63(1):101–111,
POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS                                                                                             145

            1995. URL: http://link.springer.com/10.1007/BF00691917,                         S. Fabbro, L. A. Ferreira, T. Finethy, R. T. Fox, L. H.
            doi:10.1007/BF00691917.                                                         Garrison, S. L. J. Gibbons, D. A. Goldstein, R. Gommers, J. P.
[Mik87]     Seppo Mikkola. A cubic approximation for Kepler’s equa-                         Greco, P. Greenfield, A. M. Groener, F. Grollier, A. Hagen,
            tion.    Celestial Mechanics, 40(3-4):329–334, September                        P. Hirst, D. Homeier, A. J. Horton, G. Hosseinzadeh, L. Hu,
            1987. URL: http://link.springer.com/10.1007/BF01235850,                         J. S. Hunkeler, Ž. Ivezić, A. Jain, T. Jenness, G. Kanarek,
            doi:10.1007/BF01235850.                                                         S. Kendrew, N. S. Kern, W. E. Kerzendorf, A. Khvalko,
[MKDVB+ 19] Michael Mommert, Michael Kelley, Miguel De Val-Borro,                           J. King, D. Kirkby, A. M. Kulkarni, A. Kumar, A. Lee,
            Jian-Yang Li, Giannina Guzman, Brigitta Sipőcz, Josef                          D. Lenz, S. P. Littlefair, Z. Ma, D. M. Macleod, M. Mastropi-
            Ďurech, Mikael Granvik, Will Grundy, Nick Moskovitz,                           etro, C. McCully, S. Montagnac, B. M. Morris, M. Mueller,
            Antti Penttilä, and Nalin Samarasinha. sbpy: A Python                           S. J. Mumford, D. Muna, N. A. Murphy, S. Nelson, G. H.
            module for small-body planetary astronomy.                 Jour-                Nguyen, J. P. Ninan, M. Nöthe, S. Ogaz, S. Oh, J. K. Parejko,
            nal of Open Source Software, 4(38):1426, June 2019.                             N. Parley, S. Pascual, R. Patil, A. A. Patil, A. L. Plunkett,
            URL: http://joss.theoj.org/papers/10.21105/joss.01426, doi:                     J. X. Prochaska, T. Rastogi, V. Reddy Janga, J. Sabater,
            10.21105/joss.01426.                                                            P. Sakurikar, M. Seifert, L. E. Sherbert, H. Sherwood-Taylor,
[noa]       Astrodynamics.jl.          URL: https://github.com/JuliaSpace/                  A. Y. Shih, J. Sick, M. T. Silbiger, S. Singanamalla, L. P.
            Astrodynamics.jl.                                                               Singer, P. H. Sladen, K. A. Sooley, S. Sornarajah, O. Stre-
[noa18]     gpredict, January 2018.          URL: https://github.com/csete/                 icher, P. Teuben, S. W. Thomas, G. R. Tremblay, J. E. H.
            gpredict/releases/tag/v2.2.1.                                                   Turner, V. Terrón, M. H. van Kerkwijk, A. de la Vega,
[noa20]     GMAT, July 2020. URL: https://sourceforge.net/projects/                         L. L. Watkins, B. A. Weaver, J. B. Whitmore, J. Woillez,
            gmat/files/GMAT/GMAT-R2020a/.                                                   V. Zabalza, and (Astropy Contributors). The Astropy Project:
[noa21a]    nyx, November 2021. URL: https://gitlab.com/nyx-space/                          Building an Open-science Project and Status of the v2.0
            nyx/-/tags/1.0.0.                                                               Core Package. The Astronomical Journal, 156(3):123, August
[noa21b]    SatNOGS, October 2021.                 URL: https://gitlab.com/                 2018. URL: https://iopscience.iop.org/article/10.3847/1538-
            librespacefoundation/satnogs/satnogs-client/-/tags/1.7.                         3881/aabc4f, doi:10.3847/1538-3881/aabc4f.
[noa22a]    beyond, January 2022. URL: https://pypi.org/project/beyond/         [VCHK06]    David Vallado, Paul Crawford, Ricahrd Hujsak, and T.S.
            0.7.4/.                                                                         Kelso. Revisiting Spacetrack Report #3. In AIAA/AAS Astro-
[noa22b]    celestlab, January 2022.          URL: https://atoms.scilab.org/                dynamics Specialist Conference and Exhibit, Keystone, Col-
            toolboxes/celestlab/3.4.1.                                                      orado, August 2006. American Institute of Aeronautics and
[noa22c]    Orekit, June 2022.         URL: https://gitlab.orekit.org/orekit/               Astronautics. URL: https://arc.aiaa.org/doi/10.2514/6.2006-
            orekit/-/releases/11.2.                                                         6753, doi:10.2514/6.2006-6753.
[noa22d]    SPICE, January 2022. URL: https://naif.jpl.nasa.gov/naif/           [VGO+ 20]   Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt
            toolkit.html.                                                                   Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski,
[noa22e]    tudatpy, January 2022. URL: https://github.com/tudat-team/                      Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté-
            tudatpy/releases/tag/0.6.0.                                                     fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar-
[OG86]      A. W. Odell and R. H. Gooding. Procedures for solving                           rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric
            Kepler’s equation. Celestial Mechanics, 38(4):307–334, April                    Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat,
            1986. URL: http://link.springer.com/10.1007/BF01238923,                         Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde,
            doi:10.1007/BF01238923.                                                         Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin-
[Pol97]     James E Pollard. Simplified approach for assessment of low-                     tero, Charles R. Harris, Anne M. Archibald, Antônio H.
            thrust elliptical orbit transfers. In 25th International Electric               Ribeiro, Fabian Pedregosa, Paul van Mulbregt, SciPy 1.0
            Propulsion Conference, Cleveland, OH, pages 97–160, 1997.                       Contributors, Aditya Vijaykumar, Alessandro Pietro Bardelli,
[Pol98]     James Pollard. Evaluation of low-thrust orbital maneuvers.                      Alex Rothberg, Andreas Hilboll, Andreas Kloeckner, Anthony
            In 34th AIAA/ASME/SAE/ASEE Joint Propulsion Confer-                             Scopatz, Antony Lee, Ariel Rokem, C. Nathan Woods, Chad
            ence and Exhibit, Cleveland,OH,U.S.A., July 1998. Ameri-                        Fulton, Charles Masson, Christian Häggström, Clark Fitzger-
            can Institute of Aeronautics and Astronautics. URL: https:                      ald, David A. Nicholson, David R. Hagen, Dmitrii V. Pasech-
            //arc.aiaa.org/doi/10.2514/6.1998-3486, doi:10.2514/6.                          nik, Emanuele Olivetti, Eric Martin, Eric Wieser, Fabrice
            1998-3486.                                                                      Silva, Felix Lenders, Florian Wilhelm, G. Young, Gavin A.
[Pol00]     J. E. Pollard. Simplified analysis of low-thrust orbital maneu-                 Price, Gert-Ludwig Ingold, Gregory E. Allen, Gregory R. Lee,
            vers. Technical report, Defense Technical Information Center,                   Hervé Audren, Irvin Probst, Jörg P. Dietrich, Jacob Silterra,
            Fort Belvoir, VA, August 2000. URL: http://www.dtic.mil/                        James T Webber, Janko Slavič, Joel Nothman, Johannes Buch-
            docs/citations/ADA384536.                                                       ner, Johannes Kulick, Johannes L. Schönberger, José Vinícius
[PP13]      Adonis Reinier Pimienta-Penalver. Accurate Kepler equation                      de Miranda Cardoso, Joscha Reimer, Joseph Harrington, Juan
            solver without transcendental function evaluations. State                       Luis Cano Rodríguez, Juan Nunez-Iglesias, Justin Kuczynski,
            University of New York at Buffalo, 2013.                                        Kevin Tritz, Martin Thoma, Matthew Newville, Matthias
[Rho20]     Brandon Rhodes. Skyfield: Generate high precision research-                     Kümmerer, Maximilian Bolingbroke, Michael Tartre, Mikhail
            grade positions for stars, planets, moons, and Earth satellites,                Pak, Nathaniel J. Smith, Nikolai Nowaczyk, Nikolay She-
            February 2020.                                                                  banov, Oleksandr Pavlyk, Per A. Brodtkorb, Perry Lee,
[SSM18]     Victoria Stodden, Jennifer Seiler, and Zhaokun Ma. An                           Robert T. McGibbon, Roman Feldbauer, Sam Lewis, Sam
            empirical analysis of journal policy effectiveness for                          Tygier, Scott Sievert, Sebastiano Vigna, Stefan Peterson,
            computational reproducibility. Proceedings of the National                      Surhud More, Tadeusz Pudlik, Takuya Oshima, Thomas J.
            Academy of Sciences, 115(11):2584–2589, March 2018.                             Pingel, Thomas P. Robitaille, Thomas Spura, Thouis R. Jones,
            URL:        https://pnas.org/doi/full/10.1073/pnas.1708290115,                  Tim Cera, Tim Leslie, Tiziano Zito, Tom Krauss, Utkarsh
            doi:10.1073/pnas.1708290115.                                                    Upadhyay, Yaroslav O. Halchenko, and Yoshiki Vázquez-
                                                                                            Baeza. SciPy 1.0: fundamental algorithms for scientific
[TPWS+ 18]  The Astropy Collaboration, A. M. Price-Whelan, B. M.
                                                                                            computing in Python. Nature Methods, 17(3):261–272,
            Sipőcz, H. M. Günther, P. L. Lim, S. M. Crawford, S. Conseil,
                                                                                            March 2020. URL: http://www.nature.com/articles/s41592-
            D. L. Shupe, M. W. Craig, N. Dencheva, A. Ginsburg, J. T.
                                                                                            019-0686-2, doi:10.1038/s41592-019-0686-2.
            VanderPlas, L. D. Bradley, D. Pérez-Suárez, M. de Val-Borro,
            (Primary Paper Contributors), T. L. Aldcroft, K. L. Cruz, T. P.     [VM07]      David A. Vallado and Wayne D. McClain. Fundamentals
            Robitaille, E. J. Tollerud, (Astropy Coordination Commit-                       of astrodynamics and applications. Number 21 in Space
            tee), C. Ardelean, T. Babej, Y. P. Bach, M. Bachetti, A. V.                     technology library. Microcosm Press [u.a.], Hawthorne, Calif.,
            Bakanov, S. P. Bamford, G. Barentsen, P. Barmby, A. Baum-                       3. ed., 1. printing edition, 2007.
            bach, K. L. Berry, F. Biscani, M. Boquien, K. A. Bostroem,          [WAB+ 14]   Greg Wilson, D. A. Aruliah, C. Titus Brown, Neil P.
            L. G. Bouma, G. B. Brammer, E. M. Bray, H. Breytenbach,                         Chue Hong, Matt Davis, Richard T. Guy, Steven H. D. Had-
            H. Buddelmeijer, D. J. Burke, G. Calderone, J. L. Cano                          dock, Kathryn D. Huff, Ian M. Mitchell, Mark D. Plumbley,
            Rodríguez, M. Cara, J. V. M. Cardoso, S. Cheedella, Y. Copin,                   Ben Waugh, Ethan P. White, and Paul Wilson. Best Practices
            L. Corrales, D. Crichton, D. D’Avella, C. Deil, É. Depagne,                     for Scientific Computing. PLoS Biology, 12(1):e1001745,
            J. P. Dietrich, A. Donath, M. Droettboom, N. Earl, T. Erben,                    January 2014. URL: https://dx.plos.org/10.1371/journal.pbio.
146                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

          1001745, doi:10.1371/journal.pbio.1001745.
[WIO85]   M. J. H. Walker, B. Ireland, and Joyce Owens. A set modified
          equinoctial orbit elements. Celestial Mechanics, 36(4):409–
          419, August 1985. URL: http://link.springer.com/10.1007/
          BF01227493, doi:10.1007/BF01227493.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                              147




   A New Python API for Webots Robotics Simulations
                                                                   Justin C. Fisher‡∗



                                                                                 F



Abstract—Webots is a popular open-source package for 3D robotics simula-                 In qualitative terms, the old API feels like one is awkwardly
tions. It can also be used as a 3D interactive environment for other physics-        using Python to call C and C++ functions, whereas the new API
based modeling, virtual reality, teaching or games. Webots has provided a sim-       feels much simpler, much easier, and like it is fully intended for
ple API allowing Python programs to control robots and/or the simulated world,       Python. Here is a representative (but far from comprehensive) list
but this API is inefficient and does not provide many "pythonic" conveniences.
                                                                                     of examples:
A new Python API for Webots is presented that is more efficient and provides a
more intuitive, easily usable, and "pythonic" interface.
                                                                                        •   Unlike the old API, the new API contains helpful Python
Index Terms—Webots, Python, Robotics, Robot Operating System (ROS),                         type annotations and docstrings.
Open Dynamics Engine (ODE), 3D Physics Simulation                                       •   Webots employs many vectors, e.g., for 3D positions, 4D
                                                                                            rotations, and RGB colors. The old API typically treats
                                                                                            these as lists or integers (24-bit colors). In the new API
1. Introduction
                                                                                            these are Vector objects, with conveniently addressable
Webots is a popular open-source package for 3D robotics sim-                                components (e.g. vector.x or color.red), conve-
ulations [Mic01], [Webots]. It can also be used as a 3D in-                                 nient helper methods like vector.magnitude and
teractive environment for other physics-based modeling, virtual                             vector.unit_vector, and overloaded vector arith-
reality, teaching or games. Webots uses the Open Dynamics                                   metic operations, akin to (and interoperable with) NumPy
Engine [ODE], which allows physical simulations of Newtonian                                arrays.
bodies, collisions, joints, springs, friction, and fluid dynamics.                      •   The new API also provides easy interfacing between
Webots provides the means to simulate a wide variety of robot                               high-resolution Webots sensors (like cameras and Lidar)
components, including motors, actuators, wheels, treads, grippers,                          and Numpy arrays, to make it much more convenient to
light sensors, ultrasound sensors, pressure sensors, range finders,                         use Webots with popular Python packages like Numpy
radar, lidar, and cameras (with many of these sensors drawing                               [NumPy], [Har01], Scipy [Scipy], [Vir01], PIL/PILLOW
their inputs from GPU processing of the simulation). A typical                              [PIL] or OpenCV [OpenCV], [Brad01]. For example,
simulation will involve one or more robots, each with somewhere                             converting a Webots camera image to a NumPy array is
between 3 and 30 moving parts (though more would be possible),                              now as simple as camera.array and this now allows
each running its own controller program to process information                              the array to share memory with the camera, making this
taken in by its sensors to determine what control signals to send to                        extremely fast regardless of image size.
its devices. A simulated world typically involves a ground surface                      •   The old API often requires that all function parameters be
(which may be a sloping polygon mesh) and dozens of walls,                                  given explicitly in every call, whereas the new API gives
obstacles, and/or other objects, which may be stationary or moving                          many parameters commonly used default values, allowing
in the physics simulation.                                                                  them often to be omitted, and keyword arguments to be
     Webots has historically provided a simple Python API, allow-                           used where needed.
ing Python programs to control individual robots or the simulated                       •   Most attributes are now accessible (and alterable, when ap-
world. This Python API is a thin wrapper over a C++ API, which                              plicable) by pythonic properties like motor.velocity.
itself is a wrapper over Webots’ core C API. These nested layers                        •   Many devices now have Python methods like __bool__
of API-wrapping are inefficient. Furthermore, this API is not very                          overloaded in intuitive ways. E.g., you can now use if
"pythonic" and did not provide many of the conveniences that                                bumper to detect if a bumper has been pressed, rather
help to make development in Python be fast, intuitive, and easy                             than the old if bumper.getValue().
to learn. This paper presents a new Python API [NewAPI01] that                          •   Pythonic container-like interfaces are now provided.
more efficiently interfaces directly with the Webots C API and                              You may now use for target in radar to iterate
provides a more intuitive, easily usable, and "pythonic" interface                          through the various targets a radar device has detected or
for controlling Webots robots and simulations.                                              for packet in receiver to iterate through com-
                                                                                            munication packets that a receiver device has received
* Corresponding author: fisher@smu.edu
‡ Southern Methodist University, Department of Philosophy                                   (and it now automatically handles a wide variety of Python
                                                                                            objects, not just strings).
Copyright © 2022 Justin C. Fisher. This is an open-access article distributed           •   The old API requires supervisor controllers to use a
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the                wide variety of separate functions to traverse and in-
original author and source are credited.                                                    teract with the simulation’s scene tree, including dif-
148                                                                                    PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

       ferent functions for different VRML datatypes (like           well as ultrasound and touch sensors to detect obstables. Using
       SFVec3f or MFInt32). The new API automatically                these sensors, the robots navigate towards the lamp in a cluttered
       handles these datatypes and translates intuitive Python       playground sandbox that includes sloping sand, an exterior wall,
       syntax (like dot-notation and square-bracket indexing)        and various obstacles including a puddle of water and platforms
       to the Webots equivalents. E.g., you can now move             from which robots may fall.
       a particular crate 1 meter in the x direction using               This interdisciplinary class draws students with diverse back-
       a command like world.CRATES[3].translation                    grounds, and programming skills. Accomodating those with fewer
       += [1,0,0]. Under the old API, this would require             skills required simplifying many of the complexities of the old
       numerous function calls (calling getNodeFromDef to            Webots API. It also required setting up tools to use Webots
       find the CRATES node, getMFNode to find the child             "supervisor" powers to help manipulate the simulated world, e.g.
       with index 3, getSFField to find its translation field,       to provide students easier customization options for their robots.
       and getSFVec3f to retrieve that field’s value, then some      The old Webots API makes the use of such supervisor powers
       list manipulation to alter the x-component of that value,     tedious and difficult, even for experienced coders, so this prac-
       and finally a call to setSFVec3f to set the new value).       tically required developing new tools to streamline the process.
                                                                     These factors led to the development of an interface that would be
   As another example illustrating how much easier the new
                                                                     much easier for novice students to adapt to, and that would make it
API is to use, here are two lines from Webots’ sample
                                                                     much easier for an experienced programmer to make much use of
supervisor_draw_trail, as it would appear in the old
                                                                     supervisor powers to manipulate the simulated world. Discussion
Python API.
                                                                     of this with the core Webots development team then led to the
f = supervisor.getField(supervisor.getRoot(),                        decision to incorporate these improvements into Webots, where
                        "children")
f.importMFNodeFromString(-1, trail_plan)                             they can be of benefit to a much broader community.

And here is how that looks written using the new API:                3. Design Decisions.
world.children.append(trail_plan)                                    This section discusses some design decisions that arose in develop-
The new API is mostly backwards-compatible with the old Python       ing this API, and discusses the factors that drove these decisions.
Webots API, and provides an option to display deprecation warn-      This may help give the reader a better understanding of this API,
ings with helpful advice for changing to the new API.                and also of relevant considerations that would arise in many other
    The new Python API is planned for inclusion in an upcoming       development scenarios.
Webots release, to replace the old one. In the meantime, an early-
                                                                     3.1. Shifting from functions to properties.
access version is available, distributed under Apache 2.0 licence,
the same permissibe open-source license that Webots is distributed   The old Python API for Webots consists largely
under.                                                               of      methods      like      motor.getVelocity()               and
    In what follows, the history and motivation for this new API     motor.setVelocity(new_velocity). In the new API
is discussed, including its use in teaching an interdisciplinary     these have quite uniformly been changed to Python properties, so
undergraduate Cognitive Science course called Minds, Brains and      these purposes are now accomplished with motor.velocity
Robotics. Some of the design decisions for the new API are           and motor.velocity = new_velocity.
discussed, which will not only aid in understanding it, but also         Reduction of wordiness and punctuation helps to make pro-
have broader relevance to parallel dilemmas that face many other     grams easier to read and to understand, and it reduces the cognitive
software developers. And some metrics are given to quantify how      load on coders. However, there are also drawbacks.
the new API has improved over the old.                                   One drawback is that properties can give the mistaken impres-
                                                                     sion that some attributes are computationally cheap to get or set. In
                                                                     cases where this impression would be misleading, more traditional
2. History and Motivation.                                           method calls were retained and/or the comparative expense of the
Much of this new API was developed by the author in the              operation was clearly documented.
course of teaching an interdisciplinary Southern Methodist Uni-          Two other drawbacks are related. One is that inviting ordinary
versity undergraduate Cognitive Science course entitled Minds,       users to assign properties to API objects might lead them to assign
Brains and Robotics (PHIL 3316). Before the Covid pandemic,          other attributes that could cause problems. Since Python lacks
this course had involved lab activities where students build and     true privacy protections, it has always faced this sort of worry, but
program physical robots. The pandemic forced these activities        this worry becomes even worse when users start to feel familiar
to become virtual. Fortunately, Webots simulations actually have     moving beyond just using defined methods to interact with an
many advantages over physical robots, including not requiring        object.
any specialized hardware (beyond a decent personal computer),            Relatedly, Python debugging provides direct feedback in
making much more interesting uses of altitude rather than having     cases where a user misspells motor.setFoo(v) but not when
the robots confined to a safely flat surface, allowing robots        someone mispells ’motor.foo = v‘. If a user inadvertently types
to engage in dangerous or destructive activities that would be       motor.setFool(v) they will get an AttributeError
risky or expensive with physical hardware, allowing a much           noting that motor lacks a setFool attribute. But if a user
broader array of sensors including high-resolution cameras, and      inadvertently types motor.fool = v, then Python will silently
enabling full-fledged neural network and computational vision        create a new .fool attribute for motor and the user will often
simulations. For example, an early activity in this class involves   have no idea what has gone wrong.
building Braitenburg-style vehicles [Bra01] that use light sensors       These two drawbacks both involve users setting an attribute
and cameras to detect a lamp carried by a hovering drone, as         they shouldn’t: either an attribute that has another purpose, or one
A NEW PYTHON API FOR WEBOTS ROBOTICS SIMULATIONS                                                                                       149

that doesn’t. Defenses against the first include "hiding" important    in the simulated world (presuming that the controller has such
attributes behind a leading "_", or protecting them with a Python      permissions, of course). In many use cases, supervisor robots don’t
property, which can also help provide useful doc-strings. Unfor-       actually have bodies and devices of their own, and just use their
tunately it’s much harder to protect against misspellings in this      supervisor powers incorporeally, so all they will need is world.
piece-meal fashion.                                                    In the case where a robot’s controller wants to exert both forms
     This led to the decision to have robot devices like motors        of control, it can import both robot to control its own body, and
and cameras employ a blanket __setattr__ that will generate            world to control the rest of the world.
warnings if non-property attributes of devices are set from outside        This distinction helps to make things more intuitively clear.
the module. So the user who inadvertently types motor.fool             It also frees world from having all the properties and methods
= v will immediately be warned of their mistake. This does incur       that robot has, which in turn reduces the risk of name-collisions
a performance cost, but that cost is often worthwhile when it saves    as world takes on the role of serving as the root of the proxy
development time and frustration. For cases when performance is        scene tree. In the new API, world.children refers to the
crucial, and/or a user wants to live dangerously and meddle inside     children field of the root of the scene tree which contains (al-
API objects, this layer of protection can be deactivated.              most) all of the simulated world, world.WorldInfo refers to
     An alternative approach, suggested by Matthew Feickert,           one of these children, a WorldInfo node, and world.ROBOT2
would have been to use __slots__ rather than an ordinary               dynamically returns a node within the world whose Webots
__dict__ to store device attributes, which would also have the         DEF-name is "ROBOT2". These uses of world would have
effect of raising an error if users attempt to modify unexpected       been much less intuitive if users thought of world as being
attributes. Not having a __dict__ can make it harder to do             a special sort of robot, rather than as being their handle on
some things like cached properties and multiple inheritance. But       controlling the simulated world. Other sorts of supervisor func-
in cases where such issues don’t arise or can be worked around,        tionality also are very intuitively associated with world, like
readers facing similar challenges may find __slots__ to be a           world.save(filename) to save the state of the simulated
preferable solution.                                                   world, or world.mode = 'PAUSE'.
                                                                           Having world.attributes dynamically fetch nodes and
3.2 Backwards Compatibility.                                           fields from the scene tree did come with some drawbacks. There
The new API offers many new ways of doing things, many                 is a risk of name-collisions, though these are rare since Webots
of which would seem "better" by most metrics, with the main            field-names are known in advance, and nodes are typically sought
drawback being just that they differ from old ways. The possibility    by ALL-CAPS DEF-names, which won’t collide with world
of making a clean break from the old API was considered, but that      ’s lower-case and MixedCase attributes. Linters like MyPy and
would stop old code from working, alienate veteran users, and          PyCharm also cannot anticipate such dynamic references, which
risk causing a schism akin to the deep one that arose between          is unfortunate, but does not stop such dynamic references from
Python 2 and Python 3 communities when Python 3 opted against          being extremely useful.
backwards compatibility.
    Another option would have been to refrain from adding a            4. Readability Metrics
"new-and-better" feature to avoid introducing redundancies or
                                                                       A main advantage of the new API is that it allows Webots
backward incompatibilities. But that has obvious drawbacks too.
                                                                       controllers to be written in a manner that is easier for coders to
    Instead, a compromise was typically adopted: to provide both
                                                                       read, write, and understand. Qualitatively, this difference becomes
the "new-and-better" way and the "worse-old" way. This redun-
                                                                       quite apparent upon a cursory inspection of examples like the one
dancy was eased by shifting from getFoo / setFoo methods
                                                                       given in section 1. As another representative example, here are
to properties, and from CamelCase to pythonic snake_case,
                                                                       three lines from Webots’ included supervisor_draw_trail
which reduced the number of name collisions between old and
                                                                       sample as they would appear in the old Python API:
new. Employing the "worse-old" way leads to a deprecation
warning that includes helpful advice regarding shifting to the         trail_node = world.getFromDef("TRAIL")
                                                                       point_field = trail_node.getField("coord")\
"new-and-better" way of doing things. This may help users to                                   .getSFNode()\
transition more gradually to the new ways, or they can shut these                              .getField("point")
warnings off to help preserve good will, and hopefully avoid a         index_field = trail_node.getField("coordIndex")
schism.
                                                                       And here is their equivalent in the new API:
3.3 Separating robot and world.                                        point_field = world.TRAIL.coord.point
                                                                       index_field = world.TRAIL.coordIndex
In Webots there is a distinction between "ordinary robots" whose
capabilities are generally limited to using the robot’s own devices,   Brief inspection should reveal that the latter code is much easier
and "supervisor robots" who share those capabilities, but also have    to read, write and understand, not just because it is shorter, but
virtual omniscience and omnipotence over most aspects of the           also because its punctuation is limited to standard Python syntax
simulated world. In the old API, supervisor controller programs        for traversing attributes of objects, because it reduces the need
import a Supervisor subclass of Robot, but typically still             to introduce new variables like trail_node for things that
call this unusually powerful robot robot, which has led to many        it already makes easy to reference (via world.TRAIL, which
confusions.                                                            the new API automatically caches for fast repeat reference), and
    In the new API these two sorts of powers are strictly separated.   because it invisibly handles selecting appropriate C-API functions
Importing robot provides an object that can be used to control         like getField and getSFNode, saving the user from needing
the devices in the robot itself. Importing world provides an           to learn and remember all these functions (of which there are
object that can be used to observe and enact changes anywhere          many).
150                                                                                                  PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

 Metric                                         New API          Old API          Halstead Metric                                    New API        Old API

 Lines of Code (with blanks, comments)          43               49               Vocabulary = (n1)operators+(n2)operands            18             54
 Source Lines of Code (without those)           29               35               Length = (N1)operator + (N2)operand instances      38             99
 Logical Lines of Code (single commands)        27               38               Volume = Length * log2 (Vocabulary)                158            570
 Cyclomatic Complexity                          5 (A)            8 (B)            Difficulty = (n1 * N2) / (2 * n2)                  4.62           4.77
                                                                                  Effort = Difficulty * Volume                       731            2715
                                     TABLE 1                                      Time = Effort / 18                                 41             151
               Length and Complexity Metrics. Raw measures for                    Bugs = Volume / 3000                               0.05           0.19
supervisor_draw_trail as it would be written with the new Python API
 for Webots or the old Python API for Webots. The "lines of codes" measures
   differ with respect to how they count blank lines, comments, and lines that                                      TABLE 2
combine multiple commands. Cyclomatic complexity measures the number of           Halstead Metrics. Halstead metrics for supervisor_draw_trail as it
                      potential branching points in the code.                    would be written with the new and old Python API’s for Webots. Lower numbers
                                                                                                   are commonly construed as being better.




     This intuitive impression is confirmed by automated metrics
for code readability. The measures in what follows consider the                  quickly under the new API, since it provides many simpler ways
full supervisor_draw_trail sample controller (from which                         of doing things, and need never do any worse since it provides
the above snippet was drawn), since this is the Webots sample                    backwards-compatible options.
controller that makes the most sustained use of supervisor func-                     Another collection of classic measures of code readability
tionality to perform a fairly plausible supervisor task (maintaining             was developed by Halstead. [Hal01] These measures (especially
the position of a streamer that trails behind the robot). Webots                 volume) have been shown to correlate with human assessments
provides this sample controller in C [SDTC], but it was re-                      of code readability [Bus01], [Pos01]. These measures generally
implemented using both the Old Python API and the New Python                     penalize a program for using a "vocabulary" involving more
API [Metrics], maintaining straightforward correspondence be-                    operators and operands. Table 2 shows these metrics, as computed
tween the two, with the only differences being directly due to                   by Radon. (Again all measures are reported, while remaining
the differences in the API’s.                                                    neutral about which are most significant.) The new API scores
     Some raw measures for the two controllers are shown in                      significantly lower/"better" on these metrics, due in large part
Table 1. These were gathered using the Radon code-analysis                       to its automatically selecting among many different C-API calls
tools [Radon]. (These metrics, as well as those below, may be                    without these needing to appear in the user’s code. E.g. hav-
reproduced by (1) installing Radon [Radon], (2) downloading                      ing motor.velocity as a unified property involves fewer
the source files to compare and the script for computing Metrics                 unique names than having users write both setVelocity() and
[Metrics], (3) ensuring that the path at the top of the script refers            getVelocity(), and often forming a third local velocity
to the local location of the source files to be compared, and                    variable. And having world.children[-1] access the last
(4) running this script.) Multiple metrics are reported because                  child that field in the simulation saves having to count getField,
theorists disagree about which are most relevant in assessing                    and getMFNode in the vocabulary, and often also saves forming
code readability, because some of these play a role in computing                 additional local variables for nodes or fields gotten in this way.
other metrics discussed below, and because this may help to allay                Both of these factors also help the new API to greatly reduce
potential worries that a few favorable metrics might have been                   parentheses counts.
cherry-picked. This paper provides some explanation of these                         Lastly, the Maintainability Index and variants thereof are
metrics and of their potential significance, while remaining neutral             intended to measure of how easy to support and change source
regarding which, if any, of these metrics is best.                               code is. [Oman01] Variants of the Maintainability Index are
     The "lines of code" measures reflect that the new API makes                 commonly used, including in Microsoft Visual Studio. These
it easier to do more things with less code. The measures differ                  measures combine Halstead Volume, Source Lines of Code, and
in how they count blank lines, comments, multi-line statements,                  Cyclomatic Complexity, all mentioned above, and two variants
and multi-statement lines like if p: q(). Line counts can be                     (SEI and Radon) also provide credit for percentage of comment
misleading, especially when the code with fewer lines has longer                 lines. (Both samples compared here include 5 comment lines, but
lines, though upcoming measures will show that that is not the                   these compose a higher percentage of the new API’s shorter code).
case here.                                                                       Different versions of this measure weight and curve these factors
     Cyclomatic Complexity counts the number of potential                        somewhat differently, but since the new API outperforms the old
branching points that appear within the code, like if, while and                 on each factor, all versions agree that it gets the higher/"better"
for. [McC01] Cyclomatic Complexity is strongly correlated with                   score, as shown in Table 3. (These measures were computed based
other plausible measures of code readability involving indentation               on the input components as counted by Radon.)
structure [Hin01]. The new API’s score is lower/"better" due to its                  There are potential concerns about each of these measures
automatically converting vector-like values to the format needed                 of code readability, and one can easily imagine playing a form
for importing new nodes into the Webots simulation, and due to                   of "code golf" to optimize some of these scores without actually
its automatic caching allowing a simpler loop to remove unwanted                 improving readability (though it would be difficult to do this for all
nodes. By Radon’s reckoning this difference in complexity already                scores at once). Fortunately, most plausible measures of readabil-
gives the old API a "B" grade, as compared to the new API’s "A".                 ity have been observed to be strongly correllated across ordinary
These complexity measures would surely rise in more complex                      cases, [Pos01] so the clear and unanimous agreement between
controllers employed in larger simulations, but they would rise less             these measures is a strong confirmation that the new API is indeed
A NEW PYTHON API FOR WEBOTS ROBOTICS SIMULATIONS                                                                                                          151

   Maintainability Index version            New API         Old API              [NewAPI01]   https://github.com/Justin-Fisher/new_python_api_for_webots
                                                                                 [NumPy]      Numerical Python (NumPy). https://www.numpy.org
   Original [Oman01]                        89              79                   [ODE]        Open Dynamics Engine. https://www.ode.org/
   Software Engineering Institute           78              62                   [Oman01]     Oman, P and J Hagemeister. "Metrics for assessing a software
   Microsoft Visual Studio                  52              46                                system’s maintainability," Proceedings Conference on Software
                                                                                              Maintenance, 337-44. 1992. doi: 10.1109/ICSM.1992.242525.
   Radon                                    82              75
                                                                                 [OpenCV]     Open Source Computer Vision Library for Python. https://
                                                                                              github.com/opencv/opencv-python
                                  TABLE 3                                        [PIL]        Python Imaging Library. https://python-pillow.org/
       Maintainability Index Metrics. Maintainability Index metrics for          [Pos01]      Posnet, D, A Hindle and P Devanbu. "A simpler model of
 supervisor_draw_trail as it would be written with the new and old                            software readability." Proceedings of the 8th working conference
 versions of the Python API for Webots, according to different versions of the                on mining software repositories, 73-82. 2011.
Maintainability Index. Higher numbers are commonly construed as being better.    [Radon]      Radon. https://radon.readthedocs.io/en/latest/index.html
                                                                                 [Sca01]      Scalabrino, S, M Linares-Vasquez, R Oliveto and D Poshy-
                                                                                              vanyk. "A Comprehensive Model for Code Readability."
                                                                                              Jounal of Software: Evolution and Process, 1-29. 2017. doi:
                                                                                              10.1002/smr.1958.
                                                                                 [Scipy]      https://www.scipy.org
more readable. Other plausible measures of readability would take                [SDTC]       https://cyberbotics.com/doc/guide/samples-howto#supervisor_
into account factors like whether the operands are ordinary English                           draw_trail-wbt
words, [Sca01] or how deeply nested (or indented) the code ends                  [SDTNew]     https://github.com/Justin-Fisher/new_python_api_for_webots/
                                                                                              blob/d180bcc7f505f8168246bee379f8067dfaf373ea/webots_
up being, [Hin01] both of which would also favor the new API.                                 new_python_api_samples/controllers/supervisor_draw_trail_
So the mathematics confirm what was likely obvious from visual                                python/supervisor_draw_trail_new_api_bare_bones.py
comparison of code samples above, that the new API is indeed                     [SDTOld]     https://github.com/Justin-Fisher/new_python_api_for_webots/
more "readable" than the old.                                                                 blob/d180bcc7f505f8168246bee379f8067dfaf373ea/webots_
                                                                                              new_python_api_samples/controllers/supervisor_draw_trail_
                                                                                              python/supervisor_draw_trail_old_api_bare_bones.py
5. Conclusions                                                                   [Vir01]      Virtanen, P, R. Gommers, T. Oliphant, et al. SciPy 1.0: Funda-
A new Python API for Webots robotic simulations was presented.                                mental Algorithms for Scientific Computing in Python. Nature
                                                                                              Methods, 17(3), 261-72. 2020. doi: 10.1038/s41592-019-0686-2.
It more efficiently interfaces directly with the Webots C API and                [Webots]     Webots Open Source Robotic Simulator. https://cyberbotics.
provides a more intuitive, easily usable, and "pythonic" interface                            com/
for controlling Webots robots and simulations. Motivations for the
API and some of its design decisions were discussed, including
decisions use python properties, to add new functionality along-
side deprecated backwards compatibility, and to separate robot and
supervisor/world functionality. Advantages of the new API were
discussed and quantified using automated code readability metrics.

More Information
An early-access version of the new API and a variety of sam-
ple programs and metric computations: https://github.com/Justin-
Fisher/new_python_api_for_webots
    Lengthy discussion of the new API and its planned inclusion
in Webots: https://github.com/cyberbotics/webots/pull/3801
    Webots home page, including free download of Webots: https:
//cyberbotics.com/

R EFERENCES
[Brad01]      Bradski, G. The OpenCV Library. Dr Dobb’s Journal of Soft-
              ware Tools. 2000.
[Bra01]       Braitenberg, V. Vehicles: Experiments in synthetic psychology.
              Cambridge, MA: MIT Press. 1984.
[Bus01]       Buse, R and W Weimer. Learning a metric for code readability.
              IEEE Transactions on Software Engineering, 36(4): 546-58.
              2010. doi: 10.1109/TSE.2009.70.
[Metrics]     Fisher, J. Readability Metrics for a New Python API for Webots
              Robotics Simulations. 2022. doi: 10.5281/zenodo.6813819.
[Hal01]       Halstead, M. Elements of software science. Elsevier New York.
              1977.
[Har01]       Harris, C., K. Millman, S. van der Walt, et al. Array pro-
              gramming with NumPy. Nature 585, 357–62. 2020. doi:
              10.1038/s41586-020-2649-2.
[Hin01]       Hindle, A, MW Godfrey and RC Holt. "Reading beside the
              lines: Indentation as a proxy for complexity metric." Program
              Comprehension. The 16th IEEE International Conference, 133-
              42. 2008. doi: 10.1109/icpc.2008.13.
[McC01]       McCabe, TJ. "A Complexity Measure" , 2(4): 308-320. 1976.
[Mic01]       Michel, O. "Webots: Professional Mobile Robot Simulation.
              Journal of Advanced Robotics Systems. 1(1): 39-42. 2004. doi:
              10.5772/5618.
152                                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




          pyAudioProcessing: Audio Processing, Feature
           Extraction, and Machine Learning Modeling
                                                                        Jyotika Singh‡∗



                                                                                    F



Abstract—pyAudioProcessing is a Python based library for processing audio               that are closer in the vector space are expected to be similar
data, constructing and extracting numerical features from audio, building and           in meaning [Wik22b]. Word embeddings work great for many
testing machine learning models, and classifying data with existing pre-trained         applications surrounding textual data [JS21]. However, passing
audio classification models or custom user-built models. MATLAB is a popular            numbers, an audio signal, or an image through a word embeddings
language of choice for a vast amount of research in the audio and speech
                                                                                        generation method is not likely to return any meaningful numerical
processing domain. On the contrary, Python remains the language of choice
for a vast majority of machine learning research and functionality. This library
                                                                                        representation that can be used to train machine learning models.
contains features built in Python that were originally published in MATLAB.             Different data types correlate with feature formation techniques
pyAudioProcessing allows the user to compute various features from audio files          specific to their domain rather than a one-size-fits-all. These
including Gammatone Frequency Cepstral Coefficients (GFCC), Mel Frequency               methods for audio signals are very specific to audio and speech
Cepstral Coefficients (MFCC), spectral features, chroma features, and others            signal processing, which is a domain of digital signal processing.
such as beat-based and cepstrum-based features from audio. One can use                  Digital signal processing is a field of its own and is not feasible to
these features along with one’s own classification backend or any of the pop-           master in an ad-hoc fashion. This calls for the need to have sought-
ular scikit-learn classifiers that have been integrated into pyAudioProcessing.
                                                                                        after and useful processes for audio signals to be in a ready-to-use
Cleaning functions to strip unwanted portions from the audio are another offering
                                                                                        state by users.
of the library. It further contains integrations with other audio functionalities
such as frequency and time-series visualizations and audio format conversions.               There are two popular approaches for feature building in audio
This software aims to provide machine learning engineers, data scientists,              classification tasks.
researchers, and students with a set of baseline models to classify audio.                   1. Computing spectrograms from audio signals as images and
The library is available at https://github.com/jsingh811/pyAudioProcessing and          using an image classification pipeline for the remainder.
is under GPL-3.0 license.                                                                    2. Computing features from audio files directly as numerical
                                                                                        vectors and applying them to a classification backend.
Index Terms—pyAudioProcessing, audio processing, audio data, audio clas-
                                                                                             pyAudioProcessing includes the capability of computing spec-
sification, audio feature extraction, gfcc, mfcc, spectral features, spectrogram,
                                                                                        trograms, but focusses most functionalities around the latter for
chroma
                                                                                        building audio models. This tool contains implementations of
                                                                                        various widely used audio feature extraction techniques, and
Introduction                                                                            integrates with popular scikit-learn classifiers including support
                                                                                        vector machine (SVM), SVM radial basis function kernel (RBF),
The motivation behind this software is to make available complex
                                                                                        random forest, logistic regression, k-nearest neighbors (k-NN),
audio features in Python for a variety of audio processing tasks.
                                                                                        gradient boosting, and extra trees. Audio data can be cleaned,
Python is a popular choice for machine learning tasks. Having
                                                                                        trained, tested, and classified using pyAudioProcessing [Sin21].
solutions for computing complex audio features using Python
enables easier and unified usage of Python for building machine                              Some other useful libraries for the domain of audio pro-
learning algorithms on audio. This not only implies the need for                        cessing include librosa [MRL+ 15], spafe [Mal20], essentia
resources to guide solutions for audio processing, but also signifies                   [BWG+ 13], pyAudioAnalysis [Gia15], and paid services from
the need for Python guides and implementations to solve audio and                       service providers such as Google1 .
speech cleaning, transformation, and classification tasks.                                   The use of pyAudioProcessing in the community inspires the
    Different data processing techniques work well for different                        need and growth of this software. It is referenced in a text book
types of data. For example, in natural language processing, word                        titled Artificial Intelligence with Python Cookbook published by
embedding is a term used for the representation of words for                            Packt Publishing in October 2020 [Auf20]. Additionally, pyAu-
text analysis, typically in the form of a real-valued numerical                         dioProcessing is a part of specific admissions requirement for a
vector that encodes the meaning of the word such that the words                         funded PhD project at University of Portsmouth2 . It is further
                                                                                        referenced in this thesis paper titled "Master Thesis AI Method-
* Corresponding author: singhjyotika811@gmail.com                                       ologies for Processing Acoustic Signals AI Usage for Processing
‡ Placemakr                                                                             Acoustic Signals" [Din21], in recent research on audio processing
                                                                                        for assessing attention levels in Attention Deficit Hyperactivity
Copyright © 2022 Jyotika Singh. This is an open-access article distributed
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the               1. https://developers.google.com/learn/pathways/get-started-audio-
original author and source are credited.                                                classification
PYAUDIOPROCESSING: AUDIO PROCESSING, FEATURE EXTRACTION, AND MACHINE LEARNING MODELING                                                           153

Disorder (ADHD) students [BGSR21], and more. There are thus                          Class                       Metric
far 16000+ downloads via pip for pyAudioProcessing with 1000+                                     Accuracy       Precision      F1
downloads in the last month [PeP22]. As several different audio                      music        97.60%         98.79%         98.19%
features need development, new issues are created on GitHub                          speech       98.80%         97.63%         98.21%
and contributions to the code by the open-source community are
welcome to grow the tool faster.
                                                                              TABLE 1: Per-class evaluation metrics for audio type (speech vs
                                                                              music) classification pre-trained model.
Core Functionalities
                                                                                     Class                       Metric
pyAudioProcessing aims to provide an end-to-end processing so-
                                                                                                  Accuracy       Precision      F1
lution for converting between audio file formats, visualizing time
and frequency domain representations, cleaning with silence and                      music        94.60%         96.93%         95.75%
low-activity segments removal from audio, building features from                     speech       97.00%         97.79%         97.39%
raw audio samples, and training a machine learning model that                        birds        100.00%        96.89%         98.42%
can then be used to classify unseen raw audio samples (e.g., into
categories such as music, speech, etc.). This library allows the user
to extract features such as Mel Frequency Cepstral Coefficients               TABLE 2: Per-class evaluation metrics for audio type (speech vs
                                                                              music vs bird sound) classification pre-trained model.
(MFCC) [CD14], Gammatone Frequency Cepstral Coefficients
(GFCC) [JDHP17], spectral features, chroma features and other
beat-based and cepstrum based features from audio to use with
                                                                              Methods and Results
one’s own classification backend or scikit-learn classifiers that
have been built into pyAudioProcessing. The classifier implemen-              Pre-trained models
tation examples that are a part of this software aim to give the              pyAudioProcessing offers pre-trained audio classification models
users a sample solution to audio classification problems and help             for the Python community to aid in quick baseline establishment.
build the foundation to tackle new and unseen problems.                       This is an evolving feature as new datasets and classification
     pyAudioProcessing provides seven core functionalities com-               problems gain prominence in the field.
prising different stages of audio signal processing.                              Some of the pre-trained models include the following.
     1. Converting audio files to .wav format to give the users                   1. Audio type classifier to determine speech versus music:
the ability to work with different types of audio to increase                 Trained a Support Vector Machine (SVM) classifier for classifying
compatibility with code and processes that work best with .wav                audio into two possible classes - music, speech. This classifier
audio type.                                                                   was trained using Mel Frequency Cepstral Coefficients (MFCC),
     2. Audio visualization in time-series and frequency represen-            spectral features, and chroma features. This model was trained on
tation, including spectrograms.                                               manually created and curated samples for speech and music. The
     3. Segmenting and removing low-activity segments from audio              per-class evaluation metrics are shown in Table 1.
files for removing unwanted audio segments that are less likely to                2. Audio type classifier to determine speech versus music ver-
represent meaningful information.                                             sus bird sounds: Trained Support Vector Machine (SVM) classifier
     4. Building numerical features from audio that can be used               for classifying audio into three possible classes - music, speech,
to train machine learning models. The set of features supported               birds. This classifier was trained using Mel Frequency Cepstral
evolves with time as research informs new and improved algo-                  Coefficients (MFCC), spectral features, and chroma features. The
rithms.                                                                       per-class evaluation metrics are shown in Table 2.
     5. Ability to export the features built with this library to use             3. Music genre classifier using the GTZAN [TEC01]: Trained
with any custom machine learning backend of the user’s choosing.              on SVM classifier using Gammatone Frequency Cepstral Coef-
                                                                              ficients (GFCC), Mel Frequency Cepstral Coefficients (MFCC),
     6. Capability that allows users to train scikit-learn classifiers
                                                                              spectral features, and chroma features to classify music into 10
using features of their choosing directly from raw data. pyAudio-
                                                                              genre classes - blues, classical, country, disco, hiphop, jazz, metal,
Processing
                                                                              pop, reggae, rock. The per-class evaluation metrics are shown in
         a). runs automatic hyper-parameter tuning                            Table 3.
         b). returns to the user the training model metrics                       These models aim to present capability of audio feature gen-
     along with cross-validation confusion matrix (a cross-                   eration algorithms in extracting meaningful numeric patterns from
     validation confusion matrix is an evaluation matrix from                 the audio data. One can train their own classifiers using similar
     where we can estimate the performance of the model                       features and different machine learning backend for researching
     broken down by each class/category) for model evalua-                    and exploring improvements.
     tion
         c). allows the user to test the created classifier with              Audio features
     the same features used for training                                      There are multiple types of features one can extract from audio.
   7. Includes pre-trained models to provide users with baseline              Information about getting started with audio processing is well
audio classifiers.                                                            described in [Sin19]. pyAudioProcessing allows users to compute
                                                                              GFCC, MFCC, other cepstral features, spectral features, temporal
  2. https://www.port.ac.uk/study/postgraduate-research/research-degrees/     features, chroma features, and more. Details on how to extract
phd/explore-our-projects/detection-of-emotional-states-from-speech-and-text   these features are present in the project documentation on GitHub.
154                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                        Class                         Metric
                                                    Accuracy          Precision         F1

                                        pop         72.36%            78.63%            75.36%
                                        met         87.31%            85.52%            86.41%
                                        dis         62.84%            59.45%            61.10%
                                        blu         83.02%            72.96%            77.66%
                                        reg         79.82%            69.72%            74.43%
                                        cla         90.61%            86.38%            88.44%
                                        rock        53.10%            51.50%            52.29%
                                        hip         60.94%            77.22%            68.12%
                                        cou         58.34%            62.53%            60.36%
                                        jazz        78.10%            85.17%            81.48%



                         TABLE 3: Per-class evaluation metrics for music genre classification pre-trained model.


Generally, features useful in different audio prediction tasks (es-
pecially speech) include Linear Prediction Coefficients (LPC) and            Another filter inspired by human hearing is the gammatone
Linear Prediction Cepstral Coefficients (LPCC), Bark Frequency          filter bank. The gammatone filter bank shape looks similar to the
Cepstral Coefficients (BFCC), Power Normalized Cepstral Coef-           mel filter bank, expect the peaks are smoother than the triangular
ficients (PNCC), and spectral features like spectral flux, entropy,     shape of the mel filters. gammatone filters are conceived to be a
roll off, centroid, spread, and energy entropy.                         good approximation to the human auditory filters and are used as
     While MFCC features find use in most commonly encountered          a front-end simulation of the cochlea. Since a human ear is the
audio processing tasks such as audio type classification, speech        perfect receiver and distinguisher of speakers in the presence of
classification, GFCC features have been found to have application       noise or no noise, construction of gammatone filters that mimic
in speaker identification or speaker diarization (the process of        auditory filters became desirable. Thus, it has many applications
partitioning an input audio stream into homogeneous segments            in speech processing because it aims to replicate how we hear.
according to the human speaker identity [Wik22a]). Applications,             GFCCs are formed by passing the spectrum through a gam-
comparisons and uses can be found in [ZW13], [pat21], and               matone filter bank, followed by loudness compression and DCT,
[pat22].                                                                as seen in Figure 3. The first (approximately) 22 features are
     pyAudioProcessing library includes computation of these fea-       called GFCCs. GFCCs have a number of applications in speech
tures for audio segments of a single audio, followed by computing       processing, such as speaker identification. GFCC for a sample
mean and standard deviation of all the signal segments.                 speech audio can be seen in Figure 4.

       Mel Frequency Cepstral Coefficients (MFCC):                                Temporal features:

    The mel scale relates perceived frequency, or pitch, of a pure          Temporal features from audio are extracted from the signal
tone to its actual measured frequency. Humans are much better           information in its time domain representations. Examples include
at discerning small changes in pitch at low frequencies compared        signal energy, entropy, zero crossing rate, etc. Some sample mean
to high frequencies. Incorporating this scale makes our features        temporal features can be seen in Figure 5.
match more closely what humans hear. The mel-frequency scale is
approximately linear for frequencies below 1 kHz and logarithmic                  Spectral features:
for frequencies above 1 kHz, as shown in Figure 1. This is
motivated by the fact that the human auditory system becomes                Spectral features on the other hand derive information con-
less frequency-selective as frequency increases above 1 kHz.            tained in the frequency domain representation of an audio signal.
    The signal is divided into segments and a spectrum is com-          The signal can be converted from time domain to frequency
puted. Passing a spectrum through the mel filter bank, followed by      domain using the Fourier transform. Useful features from the
taking the log magnitude and a discrete cosine transform (DCT)          signal spectrum include fundamental frequency, spectral entropy,
produces the mel cepstrum. DCT extracts the signal’s main infor-        spectral spread, spectral flux, spectral centroid, spectral roll-off,
mation and peaks. For this very property, DCT is also widely used       etc. Some sample mean spectral features can be seen in Figure
in applications such as JPEG and MPEG compressions. The peaks           6.
after DCT contain the gist of the audio information. Typically,
the first 13-20 coefficients extracted from the mel cepstrum are                  Chroma features:
called the MFCCs. These hold very useful information about audio
and are often used to train machine learning models. The process            Chroma features are highly popular for music audio data. In
of developing these coefficients can be seen in the form of an          Western music, the term chroma feature or chromagram closely re-
illustration in Figure 1. MFCC for a sample speech audio can be         lates to the twelve different pitch classes. Chroma-based features,
seen in Figure 2.                                                       which are also referred to as "pitch class profiles", are a powerful
                                                                        tool for analyzing music whose pitches can be meaningfully
       Gammatone Frequency Cepstral Coefficients (GFCC):                categorized (often into twelve categories : A, A#, B, C, C#, D,
PYAUDIOPROCESSING: AUDIO PROCESSING, FEATURE EXTRACTION, AND MACHINE LEARNING MODELING                                                    155




                                                    Fig. 1: MFCC from audio spectrum.


                                                                                Features           boston acc        london acc

                                                                                mfcc               0.765             0.412
                                                                                clean+mfcc         0.823             0.471



                                                                         TABLE 4: Performance comparison on test data between MFCC
                                                                         feature trained model with and without cleaning.


                                                                         usually depicted as a heat map, i.e., as an image with the intensity
                                                                         shown by varying the color or brightness.
                                                                             After applying the algorithm for signal alteration to remove
            Fig. 2: MFCC from a sample speech audio.                     irrelevant and low activity audio segments, the resultant audio’s
                                                                         time-series plot looks like Figure 10. The spectrogram looks like
                                                                         Figure 11. It can be seen that the low activity areas are now
D#, E, F, F#, G, G# ) and whose tuning approximates to the equal-
                                                                         missing from the audio and the resultant audio contains more
tempered scale [con22]. A prime characteristic of chroma features
                                                                         activity filled regions. This algorithm removes silences as well
is that they capture the harmonic and melodic attributes of audio,
                                                                         as low-activity regions from the audio.
while being robust to changes in timbre and instrumentation. Some
sample mean chroma features can be seen in Figure 7.                         These visualizations were produced using pyAudioProcessing
                                                                         and can be produced for any audio signal using the library.
Audio data cleaning/de-noising
Often times an audio sample has multiple segments present in the                 Impact of cleaning on feature formations for a classifica-
same signal that do not contain anything but silence or a slight         tion task:
degree of background noise compared to the rest of the audio.
For most applications, those low activity segments make up the               A spoken location name classification problem was considered
irrelevant information of the signal.                                    for this evaluation. The dataset consisted of 23 samples for
    The audio clip shown in Figure 8 is a human saying the word          training per class and 17 samples for testing per class. The total
"london" and represents the audio plotted in the time domain, with       number of classes is 2 - london and boston. This dataset was
signal amplitude as y-axis and sample number as x-axis. The areas        manually created and can be found linked in the project readme
where the signal looks closer to zero/low in amplitude are areas         of pyAudioProcessing. For comparative purposes, the classifier is
where speech is absent and represents the pauses the speaker took        kept constant at SVM, and the parameter C is chosen based on grid
while saying the word "london".                                          search for each experiment based on best precision, recall and F1
    Figure 9 shows the spectrogram of the same audio signal. A           score. Results in table 4 show the impact of applying the low-
spectrogram contains time on the x-axis and frequency of the y-          activity region removal using pyAudioProcessing prior to training
axis. A spectrogram is a visual representation of the spectrum of        the model using MFCC features.
frequencies of a signal as it varies with time. When applied to              It can be seen that the accuracies increased when audio sam-
an audio signal, spectrograms are sometimes called sonographs,           ples were cleaned prior to training the model. This is especially
voiceprints, or voicegrams. When the data are represented in a 3D        useful in cases where silence or low-activity regions in the audio
plot they may be called waterfalls. As [Wik21] mentions, spectro-        do not contribute to the predictions and act as noise in the signal.
grams are used extensively in the fields of music, linguistics, sonar,
radar, speech processing, seismology, and others. Spectrograms           Integrations
of audio can be used to identify spoken words phonetically, and
to analyze the various calls of animals. A spectrogram can be            pyAudioProcessing integrates with third-party tools such as scikit-
generated by an optical spectrometer, a bank of band-pass filters,       learn, matplotlib, and pydub to offer additional functionalities.
by Fourier transform or by a wavelet transform. A spectrogram is
156                                                                                     PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




                                                    Fig. 3: GFCC from audio spectrum.




            Fig. 4: GFCC from a sample speech audio.                         Fig. 6: Spectral features from a sample speech audio.




      Fig. 5: Temporal extractions from a sample speech audio.
                                                                             Fig. 7: Chroma features from a sample speech audio.

        Training, classification, and evaluation:
                                                                             Audio visualization:
    The library contains integrations with scikit-learn classifiers
for passing audio through feature extraction followed by classi-
fication directly using the raw audio samples as input. Training          Spectrograms are 2-D images representing sequences of spec-
results include computation of cross-validation results along with    tra with time along one axis, frequency along the other, and bright-
hyperparameter tuning details.                                        ness or color representing the strength of a frequency component
                                                                      at each time frame [Wys17]. Not only can one see whether there
        Audio format conversion:                                      is more or less energy at, for example, 2 Hz vs 10 Hz, but one
                                                                      can also see how energy levels vary over time [PNS]. Some of
    Some applications and integrations work best with .wav data       the convolutional neural network architectures for images can be
format. pyAudioProcessing integrates with tools that perform          applied to audio signals on top of the spectrograms. This is a dif-
format conversion and presents them as a functionality via the        ferent route of building audio models by developing spectrograms
library.                                                              followed by image processing. Time-series, frequency-domain,
                                                                      and spectrogram (both time and frequency domains) visualizations
                                                                      can be retrieved using pyAudioProcessing and its integrations. See
                                                                      figures 10 and 9 as examples.
PYAUDIOPROCESSING: AUDIO PROCESSING, FEATURE EXTRACTION, AND MACHINE LEARNING MODELING                                                          157




     Fig. 8: Time-series representation of speech for "london".                 Fig. 11: Spectrogram of cleaned speech for "london".


                                                                      software’s readme and wiki for giving the user a guide and the
                                                                      flexibility of usage. pyAudioProcessing has been used in active
                                                                      research around audio processing and can be used as the basis for
                                                                      further python-based research efforts.
                                                                          pyAudioProcessing is updated frequently in order to apply
                                                                      enhancements and new functionalities with recent research efforts
                                                                      of the digital signal processing and machine learning community.
                                                                      Some of the ongoing implementations include additions of cepstral
                                                                      features such as LPCC, integration with deep learning backends,
                                                                      and a variety of spectrogram formations that can be used for image
                                                                      classification-based audio classification tasks.


                                                                      R EFERENCES
           Fig. 9: Spectrogram of speech for "london".
                                                                      [Auf20]  Ben Auffarth. Artificial Intelligence with Python Cookbook. Packt
                                                                               Publishing, 10 2020.
                                                                      [BGSR21] Srivi Balaji, Meghana Gopannagari, Svanik Sharma, and Preethi
Conclusion                                                                     Rajgopal. Developing a machine learning algorithm to assess atten-
                                                                               tion levels in adh students in a virtual learning setting using audio
In this paper pyAudioProcessing, an open-source Python library,                and video processing. International Journal of Recent Technology
is presented. The tool implements and integrates a wide range                  and Engineering (IJRTE), 10, 5 2021. doi:10.35940/ijrte.
of audio processing functionalities. Using pyAudioProcessing,                  A5965.0510121.
                                                                      [BWG 13] Dmitry Bogdanov, N Wack, Emilia Gómez, Sankalp Gulati,
                                                                             +
one can read and visualize audio signals, clean audio signals by               Perfecto Herrera, Oscar Mayor, G Roma, Justin Salamon, Jose
removal of irrelevant content, build and extract complex features              Zapata, and Xavier Serra. Essentia: an audio analysis library for
such as GFCC, MFCC, and other spectrum and cepstrum based                      music information retrieval. 11 2013.
features, build classification models, and use pre-built trained      [CD14] Paresh M. Chauhan and Nikita P. Desai. Mel frequency cepstral
                                                                               coefficients (mfcc) based speaker identification in noisy envi-
baseline models to classify different types of audio. Wrappers                 ronment using wiener filter. In 2014 International Conference
along with command-line usage examples are provided in the                     on Green Computing Communication and Electrical Engineer-
                                                                               ing (ICGCCEE), pages 1–5, 2014. doi:10.1109/ICGCCEE.
                                                                               2014.6921394.
                                                                      [con22] Wikipedia contributors.        Chroma feature — wikipedia the
                                                                               free encyclopedia, 2022.          Online; accessed 18-May-2022.
                                                                               URL: https://en.wikipedia.org/w/index.php?title=Chroma_feature&
                                                                               oldid=1066722932.
                                                                      [Din21] Vincent Dinger. Master Thesis KI Methodiken für die Ver-
                                                                               arbeitung akustischer Signale AI Usage for Processing Acoustic
                                                                               Signals. PhD thesis, Kaiserslautern University of Applied Sciences,
                                                                               03 2021. doi:10.13140/RG.2.2.15872.97287.
                                                                      [Gia15] Theodoros Giannakopoulos. pyaudioanalysis: An open-source
                                                                               python library for audio signal analysis. PloS one, 10(12), 2015.
                                                                               doi:10.1371/journal.pone.0144610.
                                                                      [JDHP17] Medikonda Jeevan, Atul Dhingra, M. Hanmandlu, and Bijaya
                                                                               Panigrahi. Robust Speaker Verification Using GFCC Based i-
                                                                               Vectors, volume 395, pages 85–91. Springer, 10 2017. doi:
                                                                               10.1007/978-81-322-3592-7\_9.
                                                                      [JS21]   Jyotika Singh.      Social Media Analysis using Natural Lan-
                                                                               guage Processing Techniques. In Meghann Agarwal, Chris Cal-
                                                                               loway, Dillon Niederhut, and David Shupe, editors, Proceed-
                                                                               ings of the 20th Python in Science Conference, pages 52 –
Fig. 10: Time-series representation of cleaned speech for "london".            58, 2021. URL: http://conference.scipy.org/proceedings/scipy2021/
158                                                                            PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

         pdfs/jyotika_singh.pdf, doi:10.25080/majora-1b6fd038-
         009.
[Mal20] Ayoub Malek. spafe/spafe: 0.1.2, April 2020. URL: https://github.
         com/SuperKogito/spafe.
[MRL+ 15] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt
         McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and
         music signal analysis in python. In Proceedings of the 14th python
         in science conference, volume 8, 2015. doi:10.5281/zenodo.
         4792298.
[pat21]  Method for optimizing media and marketing content using cross-
         platform video intelligence, 2021. URL: https://patents.google.com/
         patent/US10949880B2/en.
[pat22]  Media and marketing optimization with cross platform consumer
         and content intelligence, 2022. URL: https://patents.google.com/
         patent/US20210201349A1/en.
[PeP22] PePy. PePy download statistics, 2022. URL: https://pepy.tech/
         project/pyAudioProcessing.
[PNS]    PNSN.       What is a spectrogram?           URL: https://pnsn.org/
         spectrograms/what-is-a-spectrogram#.
[Sin19] Jyotika Singh. An introduction to audio processing and machine
         learning using python, 2019. URL: https://opensource.com/article/
         19/9/audio-processing-machine-learning-python.
[Sin21] Jyotika Singh.         jsingh811/pyAudioProcessing: Audio pro-
         cessing, feature extraction and classification, July 2021.
         URL: https://github.com/jsingh811/pyAudioProcessing, doi:10.
         5281/zenodo.5121041.
[TEC01] George Tzanetakis, Georg Essl, and Perry Cook. Automatic musical
         genre classification of audio signals, 2001. URL: http://ismir2001.
         ismir.net/pdf/tzanetakis.pdf.
[Wik21] Wikipedia contributors. Spectrogram — Wikipedia, the free
         encyclopedia, 2021. [Online; accessed 19-July-2021]. URL:
         https://en.wikipedia.org/w/index.php?title=Spectrogram&oldid=
         1031156666.
[Wik22a] Wikipedia contributors.       Speaker diarisation — Wikipedia,
         the free encyclopedia, 2022.           [Online; accessed 23-June-
         2022]. URL: https://en.wikipedia.org/w/index.php?title=Speaker_
         diarisation&oldid=1090834931.
[Wik22b] Wikipedia contributors.         Word embedding — Wikipedia,
         the free encyclopedia, 2022.           [Online; accessed 23-June-
         2022].     URL: https://en.wikipedia.org/w/index.php?title=Word_
         embedding&oldid=1091348337.
[Wys17] Lonce Wyse. Audio spectrogram representations for processing
         with convolutional neural networks. 06 2017.
[ZW13] Xiaojia Zhao and DeLiang Wang. Analyzing noise robustness
         of mfcc and gfcc features in speaker identification. In 2013
         IEEE International Conference on Acoustics, Speech and Signal
         Processing, pages 7204–7208, 2013. doi:10.1109/ICASSP.
         2013.6639061.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                       159




Phylogeography: Analysis of genetic and climatic data
                 of SARS-CoV-2
                                     Aleksandr Koshkarov‡§¶∗ , Wanlin Li‡¶ , My-Linh Luuk , Nadia Tahiri‡



                                                                                       F



Abstract—Due to the fact that the SARS-CoV-2 pandemic reaches its peak,                    contributed to the development of vaccines to better combat
researchers around the globe are combining efforts to investigate the genetics             the spread of the virus. Studying the factors (e.g., environment,
of different variants to better deal with its distribution. This paper discusses           host, agent of transmission) that influence epidemiology helps
phylogeographic approaches to examine how patterns of divergence within                    us to limit the continued spread of infection and prepare for the
SARS-CoV-2 coincide with geographic features, such as climatic features. First,
                                                                                           future re-emergence of diseases caused by subtypes of coronavirus
we propose a python-based bioinformatic pipeline called aPhylogeo for phylo-
geographic analysis written in Python 3 that help researchers better understand
                                                                                           [LFZK06]. However, few studies report associations between
the distribution of the virus in specific regions via a configuration file, and then       environmental factors and the genetics of different variants. Dif-
run all the analysis operations in a single run. In particular, the aPhylogeo tool         ferent variants of SARS-CoV-2 are expected to spread differently
determines which parts of the genetic sequence undergo a high mutation rate                depending on geographical conditions, such as the meteorological
depending on geographic conditions, using a sliding window that moves along                parameters. The main objective of this study is to find clear corre-
the genetic sequence alignment in user-defined steps and a window size. As a               lations between genetics and geographic distribution of different
Python-based cross-platform program, aPhylogeo works on Windows®, MacOS                    variants of SARS-CoV-2.
X® and GNU/Linux. The implementation of this pipeline is publicly available
on GitHub (https://github.com/tahiri-lab/aPhylogeo). Second, we present an ex-
ample of analysis of our new aPhylogeo tool on real data (SARS-CoV-2) to                       Several studies showed that COVID-19 cases and related
understand the occurrence of different variants.                                           climatic factors correlate significantly with each other ([OCFC20],
                                                                                           [SDdPS+ 20], and [SMVS+ 22]). Oliveiros et al. [OCFC20] re-
Index Terms—Phylogeography, SARS-CoV-2, Bioinformatics, Genetic, Climatic                  ported a decrease in the rate of SARS-CoV-2 progression with the
Condition
                                                                                           onset of spring and summer in the northern hemisphere. Sobral
                                                                                           et al. [SDdPS+ 20] suggested a negative correlation between mean
Introduction                                                                               temperature by country and the number of SARS-CoV-2 infec-
                                                                                           tions, along with a positive correlation between rainfall and SARS-
The global pandemic caused by severe acute respiratory syn-                                CoV-2 transmission. This contrasts with the results of the study by
drome coronavirus 2 (SARS-CoV-2) is at its peak and more and                               Sabarathinam et al. [SMVS+ 22], which showed that an increase in
more variants of SARS-CoV-2 were described over time. Among                                temperature led to an increase in the spread of SARS-CoV-2. The
these, some are considered variants of concern (VOC) by the                                results of Chen et al. [CPK+ 21] imply that a country located 1000
World Health Organization (WHO) due to their impact on global                              km closer to the equator can expect 33% fewer cases of SARS-
public health, such as Alpha (B.1.1.7), Beta (B.1.351), Gamma                              CoV-2 per million population. Some virus variants may be more
(P.1), Delta (B.1.617.2), and Omicron (B.1.1.529) [CRA+ 22].                               stable in environments with specific climatic factors. Sabarathinam
Although significant progress was made in vaccine development                              et al. [SMVS+ 22] compared mutation patterns of SARS-CoV-
and mass vaccination is being implemented in many countries, the                           2 with time series of changes in precipitation, humidity, and
continued emergence of new variants of SARS-CoV-2 threatens                                temperature. They suggested that temperatures between 43°F and
to reverse the progress made to date. Researchers around the                               54°F, humidity of 67-75%, and precipitation of 2-4 mm may be
world collaborate to better understand the genetics of the different                       the optimal environment for the transition of the mutant form from
variants, along with the factors that influence the epidemiology of                        D614 to G614.
this infectious disease. Genetic studies of the different variants

* Corresponding author: Nadia.Tahiri@USherbrooke.ca                                            In this study, we examine the geospatial lineage of SARS-
‡ Department of Computer Science, University of Sherbrooke, Sherbrooke, QC
J1K2R1, Canada                                                                             CoV-2 by combining genetic data and metadata from associated
§ Center of Artificial Intelligence, Astrakhan State University, Astrakhan,                sampling locations. Thus, an association between genetics and the
414056, Russia                                                                             geographic distribution of SARS-CoV-2 variants can be found. We
¶ Contributed equally
|| Department of Computer Science, University of Quebec at Montreal, Mon-                  focus on developing a new algorithm to find relationships between
treal, QC, Canada                                                                          a reference tree (i.e., a tree of geographic species distributions, a
                                                                                           temperature tree, a habitat precipitation tree, or others) with their
Copyright © 2022 Aleksandr Koshkarov et al. This is an open-access article                 genetic compositions. This new algorithm can help find which
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,              genes or which subparts of a gene are sensitive or favorable to a
provided the original author and source are credited.                                      given environment.
160                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Problem statement and proposal                                             to represent 38 gene sequences of SARS-CoV-2. After collecting
Phylogeography is the study of the principles and processes that           genetic data, we extracted 5 climatic factors for the 20 regions,
govern the distribution of genealogical lineages, particularly at the      i.e., Temperature, Humidity, Precipitation, Wind speed, and Sky
intraspecific level. The geographic distribution of species is often       surface shortwave downward irradiance. This data was obtained
correlated with the patterns associated with the species’ genes            from the NASA website (https://power.larc.nasa.gov/).
([A+ 00] and [KM02]). In a phylogeographic study, three major                   In the second step, trees are created with climatic data and
processes should be considered (see [Nag92] for more details),             genetic data, respectively. For climatic data, we calculated the
which are:                                                                 dissimilarity between each pair of variants (i.e., from different
                                                                           climatic conditions), resulting in a symmetric square matrix. From
      1)   Genetic drift is the result of allele sampling errors. These    this matrix, the neighbor joining algorithm was used to construct
           errors are due to generational transmission of alleles and      the climate tree. The same approach was implemented for genetic
           geographical barriers. Genetic drift is a function of the       data. Using nucleotide sequences from the 38 SARS-CoV-2 lin-
           size of the population. Indeed, the larger the population,      eages, phylogenetic reconstruction is repeated to construct genetic
           the lower the genetic drift. This is explained by the ability   trees, considering only the data within a window that moves along
           to maintain genetic diversity in the original population.       the alignment in user-defined steps and window size (their length
           By convention, we say that an allele is fixed if it reaches     is denoted by the number of base pairs (bp)).
           the frequency of 100%, and that it is lost if it reaches the         In the third step, the phylogenetic trees constructed in each
           frequency of 0%.                                                sliding window are compared to the climatic trees using the
      2)   Gene flow or migration is an important process for              Robinson and Foulds (RF) topological distance [RF81]. The
           conducting a phylogeographic study. It is the transfer          distance was normalized by 2n−6, where n is the number of leaves
           of alleles from one population to another, increasing           (i.e., taxa). The proposed approach considers bootstrapping. The
           intrapopulation diversity and decreasing interpopulation        implementation of sliding window technology provides a more
           diversity.                                                      accurate identification of regions with high gene mutation rates.
      3)   There are many selections in all species. Here we indicate
                                                                                As a result, we highlighted a correlation between parts of
           the two most important of them, if they are essential
                                                                           genes with a high rate of mutations depending on the geographic
           for a phylogeographic study. (a) Sexual selection is a
                                                                           distribution of viruses, which emphasizes the emergence of new
           phenomenon resulting from an attractive characteristic
                                                                           variants (i.e., Alpha, Beta, Delta, Gamma, and Omicron).
           between two species. Therefore, this selection is a func-
           tion of the size of the population. (b) Natural selection            The creation of phylogenetic trees, as mentioned above, is an
           is a function of fertility, mortality, and adaptation of a      important part of the solution and includes the main steps of the
           species to a habitat.                                           developed pipeline. This function is intended for genetic data. The
                                                                           main parameters of this part are as follows:
    Populations living in different environments with varying
                                                                           def create_phylo_tree(gene,
climatic conditions are subject to pressures that can lead to                                   window_size,
evolutionary divergence and reproductive isolation ([OS98] and                                  step_size,
[Sch01]). Phylogeny and geography are then correlated. This                                     bootstrap_threshold,
                                                                                                rf_threshold,
study, therefore, aims to present an algorithm to show the possible                             data_names):
correlation between certain genes or gene fragments and the
geographical distribution of species.                                         number_seq = align_sequence(gene)
    Most studies in phylogeography consider only genetic data                 sliding_window(window_size, step_size)
                                                                              ...
without directly considering climatic data. They indirectly take              for file in files:
this information as a basis for locating the habitat of the species.             try:
We have developed the first version of a phylogeography that                           ...
                                                                                       create_bootstrap()
integrates climate data. The sliding window strategy provides more                     run_dnadist()
robust results, as it particularly highlights the areas sensitive to                   run_neighbor()
climate adaptation.                                                                    run_consense()
                                                                                       filter_results(gene,
                                                                                                      bootstrap_threshold,
Methods and Python scripts                                                                            rf_threshold,
                                                                                                      data_names,
In order to achieve our goal, we designed a workflow and then                                         number_seq,
developed a script in Python version 3.9 called aPhylogeo for                                         file))
phylogeographic analysis (see [LLKT22] for more details). It in-                       ...
                                                                                 except Exception as error:
teracts with multiple bioinformatic programs, taking climatic data                     raise
and nucleotide data as input, and performs multiple phylogenetic
analyses on nucleotide sequencing data using a sliding window              This function takes gene data, window size, step size, boot-
approach. The process is divided into three main steps (see Figure         strap threshold, threshold for the Robinson and Foulds dis-
1).                                                                        tance, and data names as input parameters. Then the func-
    The first step involves collecting data to search for quality          tion sequentially connects the main steps of the pipeline:
viral sequences that are essential for the conditions of our results.      align_sequence(gene), sliding_window(window_size, step_size),
All sequences were retrieved from the NCBI Virus website (Na-              create_bootstrap(), run_dnadist(), run_neighbor(), run_consense(),
tional Center for Biotechnology Information, https://www.ncbi.             and filter_results with parameters. As a result, we obtain a phylo-
nlm.nih.gov/labs/virus/vssi/#/). In total, 20 regions were selected        genetic tree (or several trees), which is written to a file.
PHYLOGEOGRAPHY: ANALYSIS OF GENETIC AND CLIMATIC DATA OF SARS-COV-2                                                                   161

    We have created a function (create_tree) to create the climate                    for line in f:
trees. The function is described as follow:                                               if line != "\n":
                                                                                             espece = list_names[index]
def create_tree(file_name, names):                                                           nb_espace = 11 - len(espece)
    for i in range(1, len(names)):                                                           out.write(espece)
                                                                                             for i in range(nb_espace):
          create_matrix(file_name,                                                                 out.write(" ")
                        names[0],                                                            out.write(line[debut:fin])
                        names[i],                                                            index = index + 1
                        "infile")                                               out.close()
                                                                                f.close()
          os.system("./exec/neighbor " +                                        start = start + step
                    "< input/input.txt")                                        fin = fin + step
                                                                          except:
          subprocess.call(["mv",                                             print("An error occurred.")
                           "outtree",
                           "intree"])

          subprocess.call(["rm",
                                                                       Algorithmic complexity
                           "infile",                                   The complexity of the algorithm described in the previous section
                           "outfile"])                                 depends on the complexity of the various external programs used
          os.system("./exec/consense "+                                and the number of windows that the alignment can contain, plus
                    "< input/input.txt")                               one for the total alignment that the program will process.
                                                                          Recall the different complexities of the different external
          newick_file = names[i].replace(" ", "_") +
                       "_newick"                                       programs used in the algorithm:
                                                                          •    SeqBoot program: O(r × n × SA)
          subprocess.call(["rm",
                           "outfile"])                                    •    DNADist program: O(n2 )
                                                                          •    Neighbor program: O(n3 )
          subprocess.call(["mv",                                          •    Consense program: O(r × n2 )
                           "outtree",
                           newick_file])                                  •    RaxML program: O(e × n × SA)
                                                                          •    RF program: O(n2 ),
The sliding window strategy can detect genetic fragments depend-
ing on environmental parameters, but this work requires time-              where n is a number of species (or taxa), r is a number of
consuming data preprocessing and the use of several bioinformat-       replicates, SA is a size of the multiple sequence alignment (MSA),
ics programs. For example, we need to verify that each sequence        and e is a number of refinement steps performed by the RaxML
identifier in the sequencing data always matches the corresponding     algorithm. For all SA ∈ N ∗ and for all W S, S ∈ N, the number of
metadata. If samples are added or removed, we need to check            windows can be evaluated as follow (Eq. 1):
                                                                                                                   
whether the sequencing dataset matches the metadata and make                                           SA −W S
changes accordingly. In the next stage, we need to align the                                   nb =             +1 ,                  (1)
                                                                                                          S
sequences (multiple sequence alignment, MSA) and integrate all
                                                                       where W S is a window size, and S is a step.
step by step into specific software such as MUSCLE [Edg04],
Phylip package (i.e. Seqboot, DNADist, Neighbor, and Consense)
[Fel05], RF [RF81], and raxmlHPC [Sta14]. The use of each              Dataset
software requires expertise in bioinformatics. In addition, the        The following two principles were applied to select the samples
intermediate analysis steps inevitably generate many files, the        for analysis.
management of which not only consumes the time of the biologist,
                                                                          1)     Selection of SARS-CoV-2 Pango lineages that are
but is also subject to errors, which reduces the reproducibility
                                                                                 dispersed in different phylogenetic clusters whenever
of the study. At present, there are only a few systems designed
                                                                                 possible.
to automate the analysis of phylogeography. In this context, the
development of a computer program for a better understanding               The Pango lineage nomenclature system is hierarchical and
of the nature and evolution of coronavirus is essential for the        fine-scaled and is designed to capture the leading edge of
advancement of clinical research.                                      pandemic transmission. Each Pango lineage aims to define an
    The following sliding window function illustrates moving the       epidemiologically relevant phylogenetic cluster, for instance, an
sliding window through an alignment with window size and step          introduction into a distinct geographic area with evidence of
size as parameters. The first 11 characters are allocated to species   onward transmission [RHO+ 20]. From one side, Pango lineages
names, plus a space.                                                   signify groups or clusters of infections with shared ancestry.
def sliding_window(window_size=0, step=0):                             If the entire pandemic can be thought of as a vast branching
   try:                                                                tree of transmission, then the Pango lineages represent individual
      f = open("infile", "r")                                          branches within that tree. From another side, Pango lineages are
      ...
      # slide the window along the sequence                            intended to highlight epidemiologically relevant events, such as
      start = 0                                                        the appearance of the virus in a new location, a rapid increase in
      fin = start + window_size                                        the number of cases, or the evolution of viruses with new phe-
      while fin <= longueur:                                           notypes [OSU+ 21]. Therefore, to have some sequence diversity
          index = 0
          with open("out", "r") as f, ... as out:                      in the selected samples, we avoided selecting lineages belonging
                ...                                                    to the same or similar phylogenetic clusters. For example, among
162                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




Fig. 1: The workflow of the algorithm. The operations within this workflow include several blocks. The blocks are highlighted by three different
colors. The first block (grey color) is responsible for creating the trees based on the climate data. The second block (green color) performs the
function of input parameter validation. The third block (blue color) allows the creation of phylogenetic trees. This is the most important block
and the basis of this study, through the results of which the user receives the output data with the necessary calculations.
PHYLOGEOGRAPHY: ANALYSIS OF GENETIC AND CLIMATIC DATA OF SARS-COV-2                                                                        163

C.36, C.36.1, C.36.2, C.36.3 and C.36.3.1, only C.36 was used as
a sample for analysis.
   2)    Selection of the lineages that are clearly dominant in
         a particular region compared to other regions.
    Through significant advances in the generation and ex-
change of SARS-CoV-2 genomic data in real time, international
spread of lineages is tracked and recorded on the website (cov-
lineages.org/global_report.html) [OHP+ 21]. Based on the statis-
tical information provided by the website, our study focuses on
SARS-CoV-2 lineages that were first identified (Earliest date)
and widely disseminated in a particular country (Most common
country) during a certain period (Table 1).
    We list four examples of the distribution of a set of lineages:
   •    Both lineages A.2.3 and B.1.1.107 have 100% distribu-
        tion in the United Kingdom. Both lineages D.2 and D.3
        have 100% distribution in Australia. B.1.1.172, L.4 and
        P.1.13 have 100% distribution in the United States. Finally,
        AH.1, AK.2, C.7 have 100% distribution in Switzerland,
        Germany, and Denmark, respectively.
                                                                         Fig. 2: Climatic conditions of each lineage in most common country
   •    The country with the widest distribution of L.2 is the
                                                                         at the time of first detection. The climate factors involved include
        Netherlands (77.0%), followed by Germany (19.0%). Due            Temperature at 2 meters (C), Specific humidity at 2 meters (g/kg),
        to a 58% difference in the distribution of L.2 between the       Precipitation corrected (mm/day), Wind speed at 10 meters (m/s), and
        two locations, we consider the Netherlands as the main           All sky surface shortwave downward irradiance (kW − hr/m2 /day).
        distribution country of L.2 and, therefore, it was selected
        as a sample.
   •    Similarly, the most predominant country of distribution of       collected climatological data for the three days before the earliest
        C.37 is Peru (44%), followed by Chile (19.0%), with a            reporting date corresponding to each lineage and averaged them
        difference of 25%. Among all samples of this study, C.37         for analysis (Fig. 2).
        was the lineage with the least difference in distribution per-       Although the selection of samples was based on the phyloge-
        centage between the two countries. Considering the need          netic cluster of lineage and transmission, most of the sites involved
        to increase the diversity of the geographical distribution of    represent different meteorological conditions. As shown in Figure
        the samples, C.37 was also selected.                             2, the 38 samples involved temperatures ranging from -4 C to 32.6
   •    In contrast, the distribution of C.6 is 17.0% in France,         C, with an average temperature of 15.3 C. The Specific humidity
        14.0% in Angola, 13.0% in Portugal, and 8.0% in Switzer-         ranged from 2.9 g/kg to 19.2 g/kg with an average of 8.3 g/kg. The
        land, and we concluded that C.6 does not show a tendency         variability of Wind speed and All sky surface shortwave downward
        in terms of geographic distribution and, therefore, was not      irradiance was relatively small across samples compared to other
        included as a sample for analysis.                               parameters. The Wind speed ranged from 0.7 m/s to 9.3 m/s with
                                                                         an average of 4.0 m/s, and All sky surface shortwave downward
    In accordance with the above principles, we selected 38
                                                                         irradiance ranged from 0.8 kW-hr/m2/day to 8.6 kW-hr/m2/day
lineages with regional characteristics for further study. Based on
                                                                         with an average of 4.5 kW-hr/m2/day. In contrast to the other
location information, complete nucleotide sequencing data for
                                                                         parameters, 75% of the cities involved receive less than 2.2 mm
these 38 lineages was collected from the NCBI Virus website
                                                                         of precipitation per day, and only 5 cities have more than 5 mm
(https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/). In the case of
                                                                         of precipitation per day. The minimum precipitation is 0 mm/day,
the availability of multiple sequencing results for the same lineage
                                                                         the maximum precipitation is 12 mm/day, and the average value
in the same country, we selected the sequence whose collection
                                                                         is 2.1 mm/day.
date was closest to the earliest date presented. If there are several
sequencing results for the same country on the same date, the
sequence with the least number of ambiguous characters (N per
                                                                         Results
nucleotide) is selected (Table 1).
    Based on the sampling locations (consistent with the most            In this section, we describe the results obtained on our dataset (see
common country, but accurate to specific cities) of each lineage         Data section) using our new algorithm (see Method section).
sequence in Table 1, combined with the time when the lineage                 The size of the sliding window and the advanced step for
was first discovered, we obtained data on climatic conditions at         the sliding window play an important role in the analysis. We
the time each lineage was first discovered. The meteorological           restricted our conditions to certain values. For comparison, we
parameters include Temperature at 2 meters, Specific humidity at         applied five combinations of parameters (window size and step
2 meters, Precipitation corrected, Wind speed at 10 meters, and          size) to the same dataset. These include the choice of different
All sky surface shortwave Downward irradiance. The daily data            window sizes (20bp, 50bp, 200bp) and step sizes (10bp, 50bp,
for the above parameters were collected from the NASA website            200bp). These combinations of window sizes and steps provide an
(https://power.larc.nasa.gov/). Considering that the spread of the       opportunity to have three different movement strategies (overlap-
virus in a country and the data statistics are time-consuming, we        ping, non-overlapping, with gaps). Here we fixed the pair (window
164                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                     Lineage     Most Common Country     Earliest Date   Sequence Accession
                                     A.2.3       United Kingdom 100.0%   2020-03-12      OW470304.1
                                     AE.2        Bahrain 100.0%          2020-06-23      MW341474
                                     AH.1        Switzerland 100.0%      2021-01-05      OD999779
                                     AK.2        Germany 100.0%          2020-09-19      OU077014
                                     B.1.1.107   United Kingdom 100.0%   2020-06-06      OA976647
                                     B.1.1.172   USA 100.0%              2020-04-06      MW035925
                                     BA.2.24     Japan 99.0%             2022-01-27      BS004276
                                     C.1         South Africa 93.0%      2020-04-16      OM739053.1
                                     C.7         Denmark 100.0%          2020-05-11      OU282540
                                     C.17        Egypt 69.0%             2020-04-04      MZ380247
                                     C.20        Switzerland 85.0%       2020-10-26      OU007060
                                     C.23        USA 90.0%               2020-05-11      ON134852
                                     C.31        USA 87.0%               2020-08-11      OM052492
                                     C.36        Egypt 34.0%             2020-03-13      MW828621
                                     C.37        Peru 43.0%              2021-02-02      OL622102
                                     D.2         Australia 100.0%        2020-03-19      MW320730
                                     D.3         Australia 100.0%        2020-06-14      MW320869
                                     D.4         United Kingdom 80.0%    2020-08-13      OA967683
                                     D.5         Sweden 65.0%            2020-10-12      OU370897
                                     Q.2         Italy 99.0%             2020-12-15      OU471040
                                     Q.3         USA 99.0%               2020-07-08      ON129429
                                     Q.6         France 92.0%            2021-03-02      ON300460
                                     Q.7         France 86.0%            2021-01-29      ON442016
                                     L.2         Netherlands 73.0%       2020-03-23      LR883305
                                     L.4         USA 100.0%              2020-06-29      OK546730
                                     N.1         USA 91.0%               2020-03-25      MT520277
                                     N.3         Argentina 96.0%         2020-04-17      MW633892
                                     N.4         Chile 92.0%             2020-03-25      MW365278
                                     N.6         Chile 98.0%             2020-02-16      MW365092
                                     N.7         Uruguay 100.0%          2020-06-18      MW298637
                                     N.8         Kenya 94.0%             2020-06-23      OK510491
                                     N.9         Brazil 96.0%            2020-09-25      MZ191508
                                     M.2         Switzerland 90.0%       2020-10-26      OU009929
                                     P.1.7.1     Peru 94.0%              2021-02-07      OK594577
                                     P.1.13      USA 100.0%              2021-02-24      OL522465
                                     P.2         Brazil 58.0%            2020-04-13      ON148325
                                     P.3         Philippines 83.0%       2021-01-08      OL989074
                                     P.7         Brazil 71.0%            2020-07-01      ON148327


TABLE 1: SARS-CoV-2 lineages analyzed. The lineage assignments covered in the table were last updated on March 1, 2022. Among all
Pango lineages of SARS-CoV-2, 38 lineages were analyzed. Corresponding sequencing data were found in the NCBI database based on the
date of earliest detection and country of most common. The table also marks the percentage of the virus in the most common country compared
to all countries where the virus is present.


size, step size) at some values (20, 10), (20, 50), (50, 50), (200,               climate conditions on bootstrap values greater than 10.
50) and (200, 200).                                                               The trend of RF values variation under different climatic
                                                                                  conditions does not vary much throughout this whole
      1)   Robinson and Foulds baseline and bootstrap thresh-
                                                                                  sequence sliding window scan, which may be related
           old: the phylogenetic trees constructed in each sliding
                                                                                  to the correlation between climatic factors (Wind Speed,
           window are compared to the climatic trees using the
                                                                                  Downward Irradiance, Precipitation, Humidity, Temper-
           Robinson and Foulds topological distance (the RF dis-
                                                                                  ature). Windows starting from or containing position
           tance). We defined the value of the RF distance ob-
                                                                                  (28550bp) were screened in all five scans for different
           tained for regions without any mutations as the baseline.
                                                                                  combinations of window size and step size. The window
           Although different sample sizes and sample sequence
                                                                                  formed from position 29200bp to position 29470bp is
           characteristics can cause differences in the baseline, how-
                                                                                  screened out in all four scans except for the combination
           ever, regions without any mutation are often accompanied
                                                                                  of 50bp window size with 50bp step size. As Figure 3
           by very low bootstrap values. Using the distribution
                                                                                  shows, if there are gaps in the scan (window size: 20bp,
           of bootstrap values and combining it with validation
                                                                                  step size: 50bp), some potential mutation windows are not
           of alignment visualization, we confirmed that the RF
                                                                                  screened compared to other movement strategies because
           baseline value in this study was 50, and the bootstrap
                                                                                  the sequences of the gap part are not computed by the
           values corresponding to this baseline were smaller than
                                                                                  algorithm. In addition, when the window size is small,
           10.
                                                                                  the capture of the window mutation signal becomes more
      2)   Sliding window: the implementation of sliding window
                                                                                  sensitive, especially when the number of samples is small.
           technology with bootstrap threshold provides a more
                                                                                  At this time, a single base change in a single sequence can
           accurate identification of regions with high gene mutation
                                                                                  cause a change in the value of the RF distance. Therefore,
           rates. Figure 3 shows the general pattern of the RF
                                                                                  high quality sequencing data is required to prevent errors
           distance changes over alignment windows with different
PHYLOGEOGRAPHY: ANALYSIS OF GENETIC AND CLIMATIC DATA OF SARS-COV-2                                                                               165

         caused by ambiguous characters (N in nucleotide) on the                   (dates). In addition, since the size of the sliding window
         RF distance values. In cases where a larger window size                   and the forward step play an important role in the anal-
         (200bp) is selected, the overlapping movement strategy                    ysis, we need to perform several tests to choose the best
         (window size: 200bp, step size: 50bp) allows the signal of                combination of parameters. In this case, it is important to
         base mutations to be repeatedly verified and enhanced in                  provide the faster performance of this solution, and we
         adjacent window scans compared to the non-overlapping                     plan to adapt the code to parallelize the computations.
         strategy (window size: 200bp, step size: 200bp). In this                  In addition, we intend to use the resources of Compute
         situation, the range of the RF distance values is relatively              Canada and Compute Quebec for these high load calcu-
         large, and the number of windows eventually screened is                   lations.
         relatively greater. Due to the small number of the SARS-           2)     To enable further analysis of this topic, it would be
         CoV-2 lineages sequences that we analyzed in this study,                  interesting to relate the results obtained, especially the
         we chose to scan the alignment sequences with a larger                    values obtained from the best positions of the multiple
         window and overlapping movement strategy for further                      sequence alignments, to the dimensional structure of the
         analysis (window size: 200bp, step size: 50bp).                           proteins, or to the map of the selective pressure exerted
   3)    Comparaison between genetic trees and climatic trees:                     on the indicated alignment fragments.
         the RF distance quantified the difference between a phy-           3)     We can envisage a study that would consist in selecting
         logenetic tree constructed in specific sliding windows and                only different phenotypes of a single species, for exam-
         a climatic tree constructed in corresponding climatic data.               ple, Homo Sapiens, in different geographical locations. In
         Relatively low RF distance values represent relatively                    this case, we would have to consider a larger geographical
         more similarity between the phylogenetic tree and the                     area in order to significantly increase the variation of
         climatic tree. With our algorithm based on the sliding                    the selected climatic parameters. This type of research
         window technique, regions with high mutation rates can                    would consist in observing the evolution of the genes
         be identified (Fig 4). Subsequently, we compare the                       of the selected species according to different climatic
         RF values of these regions. In cases where there is a                     parameters.
         correlation between the occurrence of mutations and the            4)     We intend to develop a website that can help biologists,
         climate factors studied, the regions with relatively low                  ecologists and other interested professionals to perform
         RF distance values (the alignment position of 15550bp                     calculations in their phylogeography projects faster and
         – 15600bp and 24650bp-24750bp) are more likely to                         easier. We plan to create a user-friendly interface with
         be correlated with climate factors than the other loci                    the input of the necessary initial parameters and the
         screened for mutations.                                                   possibility to save the results (for example, by sending
                                                                                   them to an email). These results will include calculated
    In addition, we can state that we have made an effort to                       parameters and visualizations.
make our tool as independent as possible of the input data and
parameters. Our pipeline can also be applied to phylogeographic
studies of other species. In cases where it is determined (or           Acknowledgements
assumed) that the occurrence of a mutation is associated with           The authors thank SciPy conference and reviewers for their valu-
certain geographic factors, our pipeline can help to highlight          able comments on this paper. This work was supported by Natural
mutant regions and specific mutant regions within them that are         Sciences and Engineering Research Council of Canada and the
more likely to be associated with that geographic parameter. Our        University of Sherbrooke grant.
algorithm can provide a reference for further biological studies.

                                                                        R EFERENCES
Conclusions and future work
                                                                        [A+ 00]       John C Avise et al. Phylogeography: the history and formation
In this paper, a bioinformatics pipeline for phylogeographic                          of species. Harvard University Press, 2000. doi:10.1093/
analysis is designed to help researchers better understand the                        icb/41.1.134.
distribution of viruses in specific regions using genetic and climate   [CPK+ 21]     Simiao Chen, Klaus Prettner, Michael Kuhn, Pascal Geldsetzer,
data. We propose a new algorithm called aPhylogeo [LLKT22]                            Chen Wang, Till Bärnighausen, and David E Bloom. Climate
                                                                                      and the spread of covid-19. Scientific Reports, 11(1):1–6, 2021.
that allows the user to quickly and intuitively create trees from                     doi:10.1038/s41598-021-87692-z.
genetic and climate data. Using a sliding window, the algorithm         [CRA+ 22]     Marco Cascella, Michael Rajnik, Abdul Aleem, Scott C Dule-
finds specific regions on the viral genetic sequences that can                        bohn, and Raffaela Di Napoli. Features, evaluation, and treat-
                                                                                      ment of coronavirus (covid-19). Statpearls [internet], 2022.
be correlated to the climatic conditions of the region. To our
                                                                        [Edg04]       Robert C Edgar. Muscle: a multiple sequence alignment method
knowledge, this is the first study of its kind that incorporates                      with reduced time and space complexity. BMC bioinformatics,
climate data into this type of study. It aims to help the scientific                  5(1):1–19, 2004. doi:10.1186/1471-2105-5-113.
community by facilitating research in the field of phylogeography.      [Fel05]       Joseph Felsenstein. PHYLIP (Phylogeny Inference Package)
                                                                                      version 3.6. Distributed by the author. Department of Genome
Our solution runs on Windows®, MacOS X® and GNU/Linux                                 Sciences, University of Washington, Seattle, 2005.
and the code is freely available to researchers and collaborators on    [KM02]        L Lacey Knowles and Wayne P Maddison. Statistical phylo-
GitHub (https://github.com/tahiri-lab/aPhylogeo).                                     geography. Molecular Ecology, 11(12):2623–2635, 2002. doi:
    As a future work on the project, we plan to incorporate the                       10.1146/annurev.ecolsys.38.091206.095702.
                                                                        [LFZK06]      Kun Lin, Daniel Yee-Tak Fong, Biliu Zhu, and Johan Karl-
following additional features:                                                        berg. Environmental factors on the sars epidemic: air tem-
                                                                                      perature, passage of time and multiplicative effect of hospital
   1)    We can handle large amounts of data, especially when                         infection. Epidemiology & Infection, 134(2):223–230, 2006.
         considering many countries and longer time periods                           doi:10.1017/S0950268805005054.
166                                                                                              PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




Fig. 3: Heatmap of Robinson and Foulds topological distance over alignment windows. Five different combinations of parameters were applied
(a) window size = 20bp and step size = 10bp; (b) window size = 20bp and step size = 50bp; (c) window size = 50bp and step size = 50bp;
(d) window size = 200bp and step size = 50bp; and (e) window size = 200bp and step size = 200bp. Robinson and Foulds topological
distance was used to quantify the distance between a phylogenetic tree constructed in certain sliding windows and a climatic tree constructed
in corresponding climatic data (wind speed, downward irradiance, precipitation, humidity, temperature).


                                                                                         grinch. Wellcome open research, 6, 2021. doi:10.12688/
                                                                                         wellcomeopenres.16661.2.
                                                                              [OS98]     Matthew R Orr and Thomas B Smith. Ecology and speciation.
                                                                                         Trends in Ecology & Evolution, 13(12):502–506, 1998. doi:
                                                                                         10.1016/s0169-5347(98)01511-0.
                                                                              [OSU+ 21] Áine O’Toole, Emily Scher, Anthony Underwood, Ben Jack-
                                                                                         son, Verity Hill, John T McCrone, Rachel Colquhoun, Chris
                                                                                         Ruis, Khalil Abu-Dahab, Ben Taylor, et al. Assignment of
                                                                                         epidemiological lineages in an emerging pandemic using the
                                                                                         pangolin tool. Virus Evolution, 7(2):veab064, 2021. doi:
                                                                                         10.1093/ve/veab064.
                                                                              [RF81]     David F Robinson and Leslie R Foulds. Comparison of phyloge-
                                                                                         netic trees. Mathematical biosciences, 53(1-2):131–147, 1981.
                                                                                         doi:10.1016/0025-5564(81)90043-2.
                                                                              [RHO+ 20] Andrew Rambaut, Edward C Holmes, Áine O’Toole, Verity
                                                                                         Hill, John T McCrone, Christopher Ruis, Louis du Plessis, and
                                                                                         Oliver G Pybus. A dynamic nomenclature proposal for sars-
                                                                                         cov-2 lineages to assist genomic epidemiology. Nature micro-
                                                                                         biology, 5(11):1403–1407, 2020. doi:10.1038/s41564-
                                                                                         020-0770-5.
                                                                              [Sch01]    Dolph Schluter. Ecology and the origin of species. Trends in
                                                                                         ecology & evolution, 16(7):372–380, 2001. doi:10.1016/
                                                                                         s0169-5347(01)02198-x.
Fig. 4: Robinson and Foulds topological distance normalized changes           [SDdPS 20] Marcos Felipe Falcão Sobral, Gisleia Benini Duarte, Ana
                                                                                      +
over the alignment windows. Multiple phylogenetic analyses were                          Iza Gomes da Penha Sobral, Marcelo Luiz Monteiro Marinho,
performed using a sliding window (window size = 200 bp and step size                     and André de Souza Melo. Association between climate vari-
= 50 bp). Phylogenetic reconstruction was repeated considering only                      ables and global transmission of sars-cov-2. Science of The
data within a window that moved along the alignment in steps. The                        Total Environment, 729:138997, 2020. doi:10.1016/j.
RF normalized topological distance was used to quantify the distance                     scitotenv.2020.138997.
between the phylogenetic tree constructed in each sliding window and          [SMVS+ 22] Chidambaram Sabarathinam, Prasanna Mohan Viswanathan,
                                                                                         Venkatramanan Senapathi, Shankar Karuppannan, Dhanu Radha
the climate tree constructed in the corresponding climate data (Wind
                                                                                         Samayamanthula, Gnanachandrasamy Gopalakrishnan, Ra-
speed, Downward irradiance, Precipitation, Humidity, Temperature).                       manathan Alagappan, and Prosun Bhattacharya. Sars-cov-2
Only regions with high genetic mutation rates were marked in the                         phase i transmission and mutability linked to the interplay of
figure.                                                                                  climatic variables: a global observation on the pandemic spread.
                                                                                         Environmental Science and Pollution Research, pages 1–18,
                                                                                         2022. doi:10.1007/s11356-021-17481-8.
                                                                              [Sta14]    Alexandros Stamatakis. Raxml version 8: a tool for phy-
[LLKT22]    Wanlin Li, My-Lin Luu, Aleksandr Koshkarov, and Nadia                        logenetic analysis and post-analysis of large phylogenies.
            Tahiri. aPhylogeo (version 1.0), July 2022. URL: https://                    Bioinformatics, 30(9):1312–1313, 2014.         doi:10.1093/
            github.com/tahiri-lab/aPhylogeo, doi:doi.org/10.5281/                        bioinformatics/btu033.
            zenodo.6773603.
[Nag92]     Thomas Nagylaki. Rate of evolution of a quantitative character.
            Proceedings of the National Academy of Sciences, 89(17):8121–
            8124, 1992. doi:10.1073/pnas.89.17.8121.
[OCFC20]    Barbara Oliveiros, Liliana Caramelo, Nuno C Ferreira, and
            Francisco Caramelo. Role of temperature and humidity in the
            modulation of the doubling time of covid-19 cases. MedRxiv,
            2020. doi:10.1101/2020.03.05.20031872.
[OHP+ 21]   Áine O’Toole, Verity Hill, Oliver G Pybus, Alexander Watts,
            Issac I Bogoch, Kamran Khan, Jane P Messina, The COVID,
            Genomics UK, et al.        Tracking the international spread
            of sars-cov-2 lineages b. 1.1. 7 and b. 1.351/501y-v2 with
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                   167




  Global optimization software library for research and
                       education
                                                                         Nadia Udler‡∗



                                                                                    F



Abstract—Machine learning models are often represented by functions given               the distance to optimal point. In this paper the basic SA algorithm
by computer programs. Optimization of such functions is a challenging task              is used as a starting point. We can offer more basic module as a
because traditional derivative based optimization methods with guaranteed               starting point ( and by specifying distribution as ’exponential’ get
convergence properties cannot be used.. This software allows to create new              the variant of SA) thus achieving more flexible design opportuni-
optimization methods with desired properties, based on basic modules. These
                                                                                        ties for custom optimization algorithm. Note that convergence of
basic modules are designed in accordance with approach for constructing global
optimization methods based on potential theory [KAP]. These methods do not
                                                                                        the newly created hybrid algorithm does not need to be verified
use derivatives of objective function and as a result work with nondifferentiable       when using minpy basic modules, whereas previously mentioned
functions (or functions given by computer programs, or black box functions), but        SA-based hybrid has to be verified separately ( see [GLUQ])
have guaranteed convergence. The software helps to understand principles of                 Testing functions are included in the library. They represent
learning algorithms. This software may be used by researchers to design their           broad range of use cases covering above mentioned difficult
own variations or hybrids of known heuristic optimization methods. It may be            functions. In this paper we describe the approach underlying these
used by students to understand how known heuristic optimization methods work            optimization methods. The distinctive feature of these methods
and how certain parameters affect the behavior of the method.
                                                                                        is that they are not heuristic in nature. The algorithms are de-
Index Terms—global optimization, black-box functions, algorithmically defined
                                                                                        rived based on potential theory [KAP], and their convergence is
functions, potential functions                                                          guaranteed by their derivation method [KPP]. Recently potential
                                                                                        theory was applied to prove convergence of well known heuristic
                                                                                        methods, for example see [BIS] for convergence of PSO, and to
Introduction
                                                                                        re prove convergence of well known gradient based methods, in
Optimization lies at the heart of machine learning and data                             particular, first order methods - see [NBAG] for convergence of
science. One of the most relevant problems in machine learning is                       gradient descent and [ZALO] for mirror descent. For potential
automatic selection of the algorithm depending on the objective.                        functions approach for stochastic first order optimization methods
This is necessary in many applications such as robotics, simulating                     see [ATFB].
biological or chemical processes, trading strategies optimization,
to name a few [KHNT]. We developed a library of optimization
methods as a first step for self-adapting algorithms. Optimization                      Outline of the approach
methods in this library work with all objectives including very                         The approach works for non-smooth or algorithmically defined
onerous ones, such as black box functions and functions given by                        functions. For detailed description of the approach see [KAP],
computer code, and the convergences of methods is guaranteed.                           [KP]. In this approach the original optimization problem is re-
This library allows to create customized derivative free learning                       placed with a randomized problem, allowing the use of Monte-
algorithms with desired properties by combining building blocks                         Carlo methods for calculating integrals. This is especially impor-
from this library or other Python libraries.                                            tant if the objective function is given by its values (no analytical
    The library is intended primarily for educational purposes                          formula) and derivatives are not known. The original problem
and its focus is on transparency of the methods rather than on                          is restated in the framework of gradient (sub gradient) methods,
efficiency of implementation.                                                           employing the standard theory (convergence theorems for gradient
    The library can be used by researches to design optimization                        (sub gradient) methods), whereas no derivatives of the objective
methods with desired properties by varying parameters of the                            function are needed. At the same time, the method obtained is
general algorithm.                                                                      a method of nonlocal search unlike other gradient methods. It
    As an example, consider variant of simulated annealing (SA)                         will be shown, that instead of measuring the gradient of the
proposed in [FGSB] where different values of parameters ( Boltz-                        objective function we can measure the gradient of the potential
man distribution parameters, step size, etc.) are used depending of                     function at each iteration step , and the value of the gradient
                                                                                        can be obtained using values of objective function only, in the
* Corresponding author: nadiakap@optonline.net
‡ University of Connecticut (Stamford)                                                  framework of Monte Carlo methods for calculating integrals.
                                                                                        Furthermore, this value does not have to be precise, because
Copyright © 2022 Nadia Udler. This is an open-access article distributed                it is recalculated at each iteration step. It will also be shown
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the            that well-known zero-order optimization methods ( methods that
original author and source are credited.                                                do not use derivatives of objective function but its values only)
168                                                                                           PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

are generalized into their adaptive extensions. The generalization        where ( , ) defines dot product.
of zero-order methods (that are heuristic in nature) is obtained              Assuming differentiability of the integrals (for example, by
using standardized methodology, namely, gradient (sub gradient)           selecting the appropriate pxε (x, y) and using 3, 4 we get
framework. We consider the unconstrained optimization problem
                                                                                                       Z       Z
                                                                                                  d
                        f (x1 , x2 , ..xn ) → min                  (1)           δY F(X 0 ) = [                     f (x)pxε (x − εy, y)dxdy]ε=0 =
                                             x∈Rn                                                 dε       Rn Rn
By randomizing we get                                                          dR        R                                       d  R          R
                                                                          = [ dε Rn f (x) Rn pxε (x−εy, y)dxdy]ε=0 = [ Rn f (x)( dε Rn pxε (x−
                      F(X) = E[ f (X)] → min                       (2)    εy, y)dy)dx]ε=0 =
                                               x∈Rn                                    R         R     d
                                                                              =       R Rn
                                                                                           f (x)( Rn [ dε pxε (x −   εy, y)]ε=0 dy)dx        =
where X is a random vector from Rn , {X} is a set of such random            R
                                                                          − Rn f (x)( Rn [divx (pxε (x, y)y)]dy)dx =
vectors, and E[·] is the expectation operator.
    Problem 2 is equivalent to problem 1 in the sense that any                                Z                    Z
realization of the random vector X ∗ , where X ∗ is a solution to 2,                      −         f (x)divx [           (pxε (x, y)y)dy]dx
                                                                                               Rn                    Rn
that has a nonzero probability, will be a solution to problem 1 (see
[KAP] for proof).                                                                                                                                  p ε (x,y)
                                                                          Using formula for conditional distribution pY /X 0 =x (y) = px εy (x)) ,
    Note that 2 is the stochastic optimization problem of the                                   R                                          x

functional F(X) .                                                             where pxε (x) = Rn pxε y (x, u)du
                                                                                                    R                    R
    To study the gradient nature of the solution algorithms for               we get δY F(X 0 ) = − Rn f (x)divx [pxε (x) Rn pY /X 0 =x (y)ydy]dx
                                                                                              R
problem 2, a variation of objective functional F(X) will be consid-           Denote y(x) = Rn ypY /X 0 =x (y)dy = E[Y /X 0 = x]
ered.                                                                         Taking into account normalization condition for density we
    The suggested approach makes it possible to obtain opti-              arrive at the following expression for directional derivative:
mization methods in systematic way, similar to the methodology                                             Z
adopted in smooth optimization. Derivation includes random-                         δY F(X 0 ) = −              ( f (x) −C)divx [px0 (x)y(x)]dx
ization of the original optimization problem, finding directional                                          Rn
derivative for the randomized problem and choosing moving
direction Y based on the condition that directional derivative in         where C is arbitrary chosen constant
the direction of Y is being less or equal to 0.                              Considering solution to δY F(X 0 ) → minY allows to obtain
    Because of randomization, the expression for directional              gradient-like algorithms for optimization that use only objective
derivative doesn’t contain the differential characteristics of the        function values ( do not use derivatives of objective function)
original function. We obtain the condition for selecting the di-
rection of search Y in terms of its characteristics - conditional
expectation. Conditional expectation is a vector function (or             Potential function as a solution to Poisson’s equation
vector field) and can be decomposed (following the theorem of
                                                                          Decomposing vector field px0 (x)y(x) into potential field ∇ϕ0 (x)
decomposition of the vector field) into the sum of the gradient
                                                                          and divergence-free component W0 (x):
of scalar function P and a function with zero divergence. P is
called a potential function. As a result the original problem is
                                                                                               px0 (x)y(x) = ∇φ0 (x) +W0 (x)
reduced to optimization of the potential function, furthermore, the
potential function is specific for each iteration step. Next, we arrive
at partial differential equation that connects P and the original         we arrive at Poisson’s equation for potential function:
function. To define computational algorithms it is necessary to
specify the dynamics of the random vectors. For example, the                                   ∆ϕ0 (x) = −L[ f (x) −C]pu (x)
dynamics can be expressed in a form of densities. For certain class
of distributions, for example normal distribution, the dynamics can       where L is a constant
be written in terms of expectation and covariance matrix. It is also         Solution to Poisson’s equation approaching 0 at infinity may
possible to express the dynamics in mixed characteristics.                be written in the following form
                                                                                                       Z
Expression for directional derivative                                                   ϕ0 (x) =               E(x, ξ )[ f (ξ ) −C]pu (ξ )dξ
                                                                                                       Rn
Derivative of objective functional F(X) in the direction of the
random vector Y at the point X 0 (Gateaux derivative) is:                 where E(x, ξ ) is a fundamental solution to Laplace’s equation.
                       d
         δY F(X 0 ) = dε                   d
                         F(X 0 + εY )ε=0 = dε F(X ε )dxε=0 =                 Then for potential component ∆ϕ0 (x) we have
      d R
      dε  f (X)pxε (x)ε=0
    where density function of the random vector X ε = X 0 + εY                            ∆ϕ0 (x) = −LE[∆x E(x, u)( f (x) −C)]
may be expressed in terms of joint density function pX 0 ,Y (x, y) of
X 0 and Y as follows:                                                     To conclude, the representation for gradient-like direction is
                                Z
                                                                          obtained. This direction maximizes directional derivative of the
                    pxε (x) =        pxε (x − εy, y)dy             (3)
                                Rn                                        objective functional F(X). Therefore, this representation can be
The following relation (property of divergence) will be needed            used for computing the gradient of the objective function f(x)
later                                                                     using only its values. Gradient direction of the objective function
    d                                                                     f(x) is determined by the gradient of the potential function ϕ0 (x),
       pxε (x − εy, y) = (−∇x pxε (x, y), y) = −divx (pxε (x, y)y) (4)    which, in turn, is determined by Poisson’s equation.
    dε
GLOBAL OPTIMIZATION SOFTWARE LIBRARY FOR RESEARCH AND EDUCATION                                                                                      169

Practical considerations                                                      The code is organized in such a way that it allows to pair the
The dynamics of the expectation of objective function may be              algorithm with objective function. The new algorithm may be im-
written in the space of random vectors as follows:                        plmented as method of class Minimize. Newly created algorithm
                                                                          can be paired with test objectivve function supplied with a library
                       XN+1 = XN + αN+1YN+1                               or with externally supplied objective function (implemented in
                                                                          separate python module). New algorithms can be made more or
where N - iteration number, Y N+1 - random vector that defines            less universal, that is, may have different number of parameters
direction of move at ( N+1)th iteration, αN+1 -step size on (N+1)th       that user can specify. For example, it is possible to create Nelder
iteration. Y N+1 must be feasible at each iteration, i.e. the objective   and Mead algorithm (NM) using basic modules, and this would
functional should decrease: F(X N+1 ) < (X N ). Applying expection        be an example of the most specific algorithm. It is also possible
to (12) and presenting E[YN+1 asconditional expectation Ex E[Y /X]        to create Stochastic Extention of NM (more generic than classic
we get:                                                                   NM, similar to Simplicial Homology Global Optimisation [ESF]
               XN+1 = E[XN ] + αN+1 EX N E[Y N+1 /X N ]                   method) and with certain settings of adjustable parameters it may
                                                                          work identical to classic NM. Library repository may be found
Replacing mathematical expectations E[XN ] and YN+1 ] with their
            N+1                                                           here: https://github.com/nadiakap/MinPy_edu
estimates E     and y(X N ) we get:                                           The following algorithms demonstrate steps similar to steps of
                  E
                      N+1       N
                            = E + αN+1 E X N [y(X N )]                    Nelder and Mead algorithm (NM) but select only those points with
                                                                          objective function values smaller or equal to mean level of objec-
Note that expression for y(X N ) was obtained in the previos section      tive funtion. Such an improvement to NM assures its convergence
up to certain parameters. By setting parameters to certain values         [KPP]. Unlike NM, they are derived from the generic approach.
we can obtain stochastic extensions of well known heuristics such         First variant (NM-stochastic) resembles NM but corrects some
as Nelder and Mead algorithm or Covariance Matrix Adaptation              of its drawbacks, and second variant (NM-nonlocal) has some
Evolution Strategy. In minpy library we use several common build-         similarity to random search as well as to NM and helps to resolve
ing blocks to create different algorithms. Customized algorithms          some other issues of classical NM algorithm.
may be defined by combining these common blocks and varying                   Steps of NM-stochastic:
their parameters.
                                                                             1)    Initialize the search by generating K ≥ n separate real-
    Main building blocks include computing center of mass of the
                                                                                   izations of ui0 , i=1,..K of the random vector U0 , and set
sample points and finding newtonian potential.
                                                                                   m0 = K1 ∑Ki=0 ui0
                                                                             2)    On step j = 1, 2, ...
Key takeaways, example algorithm, and code organization                                                                1
                                                                             a.Compute the mean level c j−1 =          K   ∑Ki=1 f (uij−1 )
Many industry professionals and researchers utilize mathematical             b.Calculate new set of vertices:
optimization packages to search for better solutions of their
                                                                                                                                 m j−1 − uij−1
problems. Examples of such problem include minimization of                        uij = m j−1 + ε j−1 ( f (uij−1 ) − c j−1 )
free energy in physical system [FW], robot gait optimization                                                                   ||m j−1 − uij−1 ||n
from robotics [PHS], designing materials for 3D printing [ZM],
[TMAACBA], wine production [CTC], [CWC], optimizing chem-                 c.Set m j = K1 ∑Ki=0 uij
ical reactions [VNJT]. These problems may involve "black box                  d.Adjust the step size ε j−1 so that f (m j ) < f (m j−1 ). If
optimization", where the structure of the objective function is           approximate ε j−1 cannot be obtained within the specified number
unknown and is revealed through a small sequence of expen-                of trails, then set mk = m j−1
sive trials. Software implementations for these methods become                e.Use sample standard deviation as termination criterion:
more user friendly. As a rule, however, certain modeling skills                                        1 K
are needed to formulate real world problem in a way suitable                                Dj = (         ∑ ( f (uij ) − c j )2 )1/2
                                                                                                     K − 1 i=1
for applying software package. Moreover, selecting optimization
method appropriate for the model is a challenging task. Our               Note that classic simplex search methods do not use values of
educational software helps users of such optimization packages            objective function to calculate reflection/expantion/contraction co-
and may be considered as a companion to them. The focus                   efficients. Those coefficients are the same for all vertices, whereas
of our software is on transparency of the methods rather than             in NM-stochastic the distance each vertex will travel depends
on efficiency. A principal benefit of our software is the unified         on the difference between objective function value and average
approach for constructing algorithms whereby any other algorithm          value across all vertices ( f (uij ) − c j ). NM-stochastic shares the
is obtained from the generalized algorithm by changing certain            following drawbacks with classic simplex methods: a. simlex may
parameters. Well known heuristic algorithms such as Nelder and            collapse into a nearly degenerate figure, and usually proposed
Mead (NM) algorithm may be obtained using this generalized                remedy is to restart the simlex every once in a while, b. only initial
approach, as well as new algorithms. Although some derivative-            vertices are randomly generated, and the path of all subsequent
free optimization packages (matlab global optimization toolbox,           vertices is deterministic. Next variant of the algorithm (NM-
Tensorflow Probability optimizers, Excel Evolutionary Solver,             nonlocal) maintains the randomness of vertices on each step, while
scikit-learn Stochastic Gradient Descent class, scipy.optimize.shgo       adjusting the distribution of U0 to mimic the pattern of the modi-
method) put a lot of effort in transparency and educational value,        fied vertices. The corrected algorithm has much higher exploration
they don’t have the same level of flexibility and generality as our       power than the first algorithm (similar to the exploration power of
system. An example of educational-only optimization software is           random search algorithms), and has exploitation power of direct -
[SAS]. It is limited to teach Particle Swarm Optimization.                search algorithms.
170                                                                                                  PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

      Steps of NM - nonlocal                                                     [VNJT]    Fath, Verena, Kockmann, Norbert, Otto, Jürgen, Röder,
                                                                                           Thorsten, Self-optimising processes and real-time-optimisation
      1)   Choose a starting point x0 and set m0 = x0 .                                    of organic syntheses in a microreactor system using
                                                                                           Nelder–Mead and design of experiments, React. Chem. Eng.,
    2. On step j = 1, 2, ... Obtain K separate realizations of uii ,                       2020,5, 1281-1299, https://doi.org/10.1039/D0RE00081G
i=1,..K of the random vector U j                                                 [ZM]      Plüss, T.; Zimmer, F.; Hehn, T.; Murk, A. Characterisation and
                                                                                           Comparison of Material Parameters of 3D-Printable Absorbing
    a.Compute f (uij−1 ), j = 1, 2, ..K, and the sample mean level                         Materials. Materials 2022, 15, 1503. https://doi.org/10.3390/
                                                                                           ma15041503
                                      1 K                                        [TMAACBA] Thoufeili Taufek, Yupiter H.P. Manurung, Mohd Shahriman
                          c j−1 =       ∑ f (uij−1 )
                                      K i=1
                                                                                           Adenan, Syidatul Akma, Hui Leng Choo, Borhen Louhichi,
                                                                                           Martin Bednardz, and Izhar Aziz.3D Printing and Additive
                                                                                           Manufacturing, 2022, http://doi.org/10.1089/3dp.2021.0197
b.Generate the new estimate of the mean:                                         [CTC]     Vismara, P., Coletta, R. & Trombettoni, G. Constrained global
                                                          m j−1 − uij                      optimization for wine blending. Constraints 21, 597–615
                               1 K
           m j = m j−1 + ε j     ∑
                               K i=1
                                    [( f (uij ) − c j )
                                                        ||m j−1 − uij ||n
                                                                          ]
                                                                                 [CWC]
                                                                                           (2016), https://doi.org/10.1007/s10601-015-9235-5
                                                                                           Terry Hui-Ye Chiu, Chienwen Wu, Chun-Hao Chen, A Gen-
                                                                                           eralized Wine Quality Prediction Framework by Evolutionary
Adjust the step size ε j−1 so that f (m j ) < f (m j−1 ). If approximate                   Algorithms, International Journal of Interactive Multimedia
                                                                                           and Artificial Intelligence, Vol. 6, Nº7,2021, https://doi.org/10.
ε j−1 cannot be obtained within the specified number of trails, then                       9781/ijimai.2021.04.006
set mk = m j−1                                                                   [KHNT]    Pascal Kerschke, Holger H. Hoos, Frank Neumann, Heike
    c.Use sample standard deviation as termination criterion                               Trautmann; Automated Algorithm Selection: Survey and Per-
                                                                                           spectives. Evol Comput 2019; 27 (1): 3–45, https://doi.org/10.
                              1 K                                                          1162/evco_a_00242
                   Dj = (         ∑ ( f (uij ) − c j )2 )1/2
                            K − 1 i=1
                                                                                 [SAS]     Leandro dos Santos Coelho, Cezar Augusto Sierakowski, A
                                                                                           software tool for teaching of particle swarm optimization
                                                                                           fundamentals, Advances in Engineering Software, Volume 39,
                                                                                           Issue 11, 2008, Pages 877-887, ISSN 0965-9978, https://doi.
R EFERENCES                                                                                org/10.1016/j.advengsoft.2008.01.005.
                                                                                 [ESF]     Endres, S.C., Sandrock, C. & Focke, W.W. A simplicial ho-
[KAP]          Kaplinskii, A.I.,Pesin, A.M.,Propoi, A.I.(1994). Analysis of                mology algorithm for Lipschitz optimisation. J Glob Optim 72,
               search methods of optimization based on potential theory. I:                181–217 (2018), https://doi.org/10.1007/s10898-018-0645-y
               Nonlocal properties. Automation and Remote Control. Volume
               55, N.9, Part 2, September, pp.1316-1323 (rus. pp.97-105),
               1994
[KP]           Kaplinskii, A.I. and Propoi, A.I., Nonlocal Optimization Meth-
               ods ofthe First Order Based on Potential Theory, Automation
               and Remote Control. Volume 55, N.7, Part 2, July, pp.1004-
               1011 (rus. pp.97-102), 1994
[KPP]          Kaplinskii, A.I., Pesin, A.M.,Propoi, A.I. Analysis of search
               methods of optimization based on potential theory. III: Conver-
               gence of methods. Automation and remote Control, Volume 55,
               N.11, Part 1, November, pp.1604-1610 (rus. pp.66-72 ), 1994.
[NBAG]         Nikhil Bansal, Anupam Gupta, Potential-function proofs for
               gradient methods, Theory of Computing, Volume 15, (2019)
               Article 4 pp. 1-32, https://doi.org/10.4086/toc.2019.v015a004
[ATFB]         Adrien Taylor, Francis Bach, Stochastic first-order meth-
               ods: non-asymptotic and computer-aided analyses via
               potential functions, arXiv:1902.00947 [math.OC], 2019,
               https://doi.org/10.48550/arXiv.1902.00947
[ZALO]         Zeyuan Allen-Zhu and Lorenzo Orecchia, Linear Coupling: An
               Ultimate Unification of Gradient and Mirror Descent, Inno-
               vations in Theoretical Computer Science Conference (ITCS),
               2017, pp. 3:1-3:22, https://doi.org/10.4230/LIPIcs.ITCS.2017.3
[BIS]          Berthold Immanuel Schmitt, Convergence Analysis for Particle
               Swarm Optimization, FAU University Press, 2015
[FGSB]         FJuan Frausto-Solis, Ernesto Liñán-García, Juan Paulo
               Sánchez-Hernández, J. Javier González-Barbosa, Carlos
               González-Flores, Guadalupe Castilla-Valdez, Multiphase Sim-
               ulated Annealing Based on Boltzmann and Bose-Einstein
               Distribution Applied to Protein Folding Problem, Advances
               in Bioinformatics, Volume 2016, Article ID 7357123, https:
               //doi.org/10.1155/2016/7357123
[GLUQ]         Gong G., Liu, Y., Qian M, Simulated annealing with a potential
               function with discontinuous gradient on Rd , Ici. China Ser. A-
               Math. 44, 571-578, 2001, https://doi.org/10.1007/BF02876705
[PHS]          Valdez, S.I., Hernandez, E., Keshtkar, S. (2020). A Hybrid
               EDA/Nelder-Mead for Concurrent Robot Optimization. In:
               Madureira, A., Abraham, A., Gandhi, N., Varela, M. (eds)
               Hybrid Intelligent Systems. HIS 2018. Advances in Intel-
               ligent Systems and Computing, vol 923. Springer, Cham.
               https://doi.org/10.1007/978-3-030-14347-3_20
[FW]           Fan, Yi & Wang, Pengjun & Heidari, Ali Asghar & Chen,
               Huiling & HamzaTurabieh, & Mafarja, Majdi, 2022. "Random
               reselection particle swarm optimization for optimal design of
               solar photovoltaic modules," Energy, Elsevier, vol. 239(PA),
               https://doi.org/10.1016/j.energy.2021.121865
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                   171




       Temporal Word Embeddings Analysis for Disease
                       Prevention
       Nathan Jacobi‡∗ , Ivan Mo‡§ , Albert You‡ , Krishi Kishore‡ , Zane Page‡ , Shannon P. Quinn‡¶ , Tim Heckmank



                                                                                    F



Abstract—Human languages’ semantics and structure constantly change over                then be studied to track contextual drift over time. However, a
time through mediums such as culturally significant events. By viewing the              common issue in these so-called “temporal word embeddings”
semantic changes of words during notable events, contexts of existing and               is that they are often unaligned — i.e. the embeddings do not
novel words can be predicted for similar, current events. By studying the initial       lie within the same embedding space. Past proposed solutions
outbreak of a disease and the associated semantic shifts of select words, we
                                                                                        to aligning temporal word embeddings require multiple separate
hope to be able to spot social media trends to prevent future outbreaks faster
than traditional methods. To explore this idea, we generate a temporal word
                                                                                        alignment problems to be solved, or for “anchor words” – words
embedding model that allows us to study word semantics evolving over time.              that have no contextual shifts between times – to be used for
Using these temporal word embeddings, we use machine learning models to                 mapping one time period to the next [HLJ16]. Yao et al. propose a
predict words associated with the disease outbreak.                                     solution to this alignment issue, shown to produce accurate and
                                                                                        aligned temporal word embeddings, through solving one joint
Index Terms—Natural Language Processing, Word Embeddings, Bioinformat-                  alignment problem across all time slices, which we utilize here
ics, Social Media, Disease Prediction                                                   [YSD+ 18].

Introduction & Background                                                               Methodology
Human languages experience continual changes to their semantic                          Data Collection & Pre-Processing
structures. Natural language processing techniques allow us to
                                                                                        Our data set is a corpus D of over 7 million tweets collected
examine these semantic alterations through methods such as word
                                                                                        from Scott County, Indiana from the dates January 1st, 2014 until
embeddings. Word embeddings provide low dimension numerical
                                                                                        January 17th, 2017. The data was lent to us from Twitter after
representations of words, mapping lexical meanings into a vector
                                                                                        a data request, and has not yet been made publicly available.
space. Words that lie close together in this vector space represent
                                                                                        During this time period, an HIV outbreak was taking place in
close semantic similarities [MCCD13]. This numerical vector
                                                                                        Scott County, with an eventual 215 confirmed cases being linked
space allows for quantitative analysis of semantics and contextual
                                                                                        to the outbreak [PPH+ 16]. Gonsalves et al. predicts an additional
meanings, allowing for more use in machine learning models that
                                                                                        126 undiagnosed HIV cases were linked to this same outbreak
utilize human language.
     We hypothesize that disease outbreaks can be predicted faster                      [GC18]. The state’s response led to questioning if the outbreak
than traditional methods by studying word embeddings and their                          could have been stemmed or further prevented with an earlier
semantic shifts during past outbreaks. By surveying the context                         response [Gol17]. Our corpus was selected with a focus on tweets
of select medical terms and other words associated with a disease                       related to the outbreak. By closely studying the semantic shifts
during the initial outbreak, we create a generalized model that can                     during this outbreak, we hope to accurately predict similar future
be used to catch future similar outbreaks quickly. By leveraging                        outbreaks before they reach large case numbers, allowing for a
social media activity, we predict similar semantic trends can be                        critical earlier response.
found in real time. Additionally, this allows novel terms to be                              To study semantic shifts through time, the corpus was split
evaluated in context without requiring a priori knowledge of them,                      into 18 temporal buckets, each spanning a 2 month period. All data
allowing potential outbreaks to be detected early in their lifespans,                   utilized in scripts was handled via the pandas Python package. The
thus minimizing the resultant damage to public health.                                  corpus within each bucket is represented by Dt , with t representing
     Given a corpus spanning a fixed time period, multiple word                         the temporal slice. Within each 2 month period, tweets were split
embeddings can be created at set temporal intervals, which can                          into 12 pre-processed output csv files. Pre-processing steps first
                                                                                        removed retweets, links, images, emojis, and punctuation. Com-
* Corresponding author: Nathan.Jacobi@uga.edu                                           mon stop words were removed from the tweets using the NLTK
‡ Computer Science Department, University of Georgia                                    Python package, and each tweet was tokenized. A vocabulary
§ Linguistics Department, University of Georgia
¶ Cellular Biology Department, University of Georgia                                    dictionary was then generated for each of the 18 temporal buckets,
|| Public Health Department, University of Georgia                                      containing each unique word and a count of its occurrences
                                                                                        within its respective bucket. The vocabulary dictionaries for each
Copyright © 2022 Nathan Jacobi et al. This is an open-access article dis-               bucket were then combined into a global vocabulary dictionary,
tributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, pro-            containing the total counts for each unique word across all 18
vided the original author and source are credited.                                      buckets. Our experiments utilized two vocabulary dictionaries: the
172                                                                                                  PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

first being the 10,000 most frequently occurring words from the                             B = Y (t)U(t) + γU(t) + τ(U(t − 1) +U(t + 1))
global vocabulary for ensuring proper generation of embedding
vectors, the second being a combined vocabulary of 15,000 terms,                  To decompose PPMI(t) in our model, SciPy’s linear algebra
including our target HIV/AIDS related terms. This combined                        package was utilized to solve for eigendecomposition of each
vocabulary consisted of the top 10,000 words across D as well                     PPMI(t), and the top 100 terms were kept to generate an em-
as an additional 473 HIV/AIDS related terms that occurred at                      bedding of d = 100. The alignment was then applied, yielding
least 8 times within the corpus. The 10,000th most frequent term                  18 temporally aligned word embedding sets of our vocabulary,
in D occurred 39 times, so to ensure results were not influenced                  with dimensions |V | × d, or 15,000 x 100. These word embedding
by sparsity in the less frequent HIV/AIDS terms, 4,527 randomly                   sets are aligned spatially and in terms of rotations, however there
selected terms with occurrences between 10 and 25 times were                      appears to be some spatial drift that we hope to remove by tuning
added to the vocabulary, bringing it to a total of 15,000 terms.                  hyperparameters. Following alignment, these vectors are usable
The HIV/AIDS related terms came from a list of 1,031 terms we                     for experimentation and analysis.
compiled, primarily coming from the U.S. Department of Veteran
                                                                                  Predictions for Detecting Modern Shifts
Affairs published list of HIV/AIDS related terms, and other terms
we thought were pertinent to include, such as HIV medications                     Following the generation of temporally aligned word embedding,
and terms relating to sexual health [Aff05].                                      they can be used for semantic shift analysis. Using the word
                                                                                  embedding vectors generated for each temporal bucket, 2 new
Temporally Aligned Vector Generation                                              data sets were created to use for determining patterns in the
Generating word2vec embeddings is typically done through 2                        semantic shifts surrounding HIV outbreaks. Both of these data
primary methods: continuous bag-of-words (CBOW) and skip-                         sets were constructed using our second vocabulary of 15,000
gram, however many other various models exist [MCCD13]. Our                       terms, including the 473 HIV/AIDS related terms, and each term’s
methods use a CBOW approach at generating embeddings, which                       embedding of d = 100 that were generated by the dynamic
generates a word’s vector embedding based on the context the                      embedding model. The first experimental data set was the shift
word appears in, i.e. the words in a window range surrounding                     in the d = 100 embedding vector between each time bucket and
the target word. Following pre-processing of our corpus, steps                    the one that immediately followed it. These shifts were calculated
for generating word embeddings were applied to each temporal                      by simply subtracting the next temporal and initial vectors from
bucket. For each time bucket, co-occurrence matrices were first                   each other. In addition to the change in the 100 dimensional vector
created, with a window size w = 5. These matrices contained                       between each time bucket and its next, the initial and next 10
the total occurrences of each word against every other within a                   dimensional embeddings were included from each, which were
window range L of 5 words within the corpus at time t. Each                       generated using the same dynamic embedding model. This yielded
co-occurrence matrix was of dimensions |V | × |V |. Following the                 each word having 17 observations and 121 features: {d_vec0 . . .
generation of each of these co-occurrence matrices, a |V | × |V |                 d_vec99, v_init_0 . . . v_init_9, v_fin_0 . . . v_fin_9, label}. This
dimensioned Positive Pointwise Mutual Information matrix was                      data set will be referred to as "data_121". The reasoning to include
calculated. The value in each cell was calculated as follows:                     these lower dimensional embeddings was so that both the shift
                                                                                  and initial and next positions in the embedding space would be
               PPMI(t, L)w,c = max{PMI(Dt , L)w,c , 0},
                                                                                  used in our machine learning algorithms. The other experimental
where w and c are two words in V. Embeddings generated by                         data set was constructed similarly, but rather than subtracting the
word2vec can be approximated by PMI matrices, where given                         two vectors and including lower dimensions vectors, the initial
embedding vectors utilize the following equation [YSD+ 18]:                       and next 100 dimensional vectors were listed as features. This
                                                                                  allowed machine learning algorithms to have access to the full
                          uTw uc ≈ PMI(D, L)w,c
                                                                                  positional information of each vector alongside the shift between
Each embedding u has a reduced dimensionality d, typically                        the two. This yielded each word having 17 observations and 201
around 25 - 200. Each PPMI from our data set is created inde-                     features: {vec_init0 . . . vec_init99, vec_fin0 . . . vec_fin99, label}.
pendently from each other temporal bucket. After these PPMI                       This data set will be referred to as "data_201". With the 15,000
matrices are made, temporal word embeddings can be created                        terms each having 17 observations, it led to a total of 255,000
using the method proposed by Yao et al. [YSD+ 18]. The proposed                   observations. It should be noted that in addition to the vector
solution focuses on the equation:                                                 information, the data sets also listed the number of days since
                                                                                  the outbreak began, the predicted number of cases at that point
                        U(t)U(t)T ≈ PPMI(t, L)
                                                                                  in time, from [GC18], and the total magnitude of the shift in the
where U is a set of embeddings from time period t. Decomposing                    vector between the corresponding time buckets. All these features
each PPMI(t) will yield embedding U(t), however each U(t) is not                  were dropped prior to use within the models, as the magnitude
guaranteed to be in the same embedding space. Yao et al. derives                  feature was colinear with the other positional features, and the case
U(t)A = B with the following equation234 [YSD+ 18]:                               and day data will not be available in predicting modern outbreaks.
                    A = U(t)T U(t) + (γ + λ + 2τ)I,                               Using these data, two machine learning algorithms were applied:
                                                                                  unsupervised k-means clustering and a supervised neural network.
  1. All code used can be found here https://github.com/quinngroup/Twitter-
Embedding-Analysis/                                                               K-means Clustering
  2. γ represents the forcing regularizer. λ represents the Frobenius norm
regularizer. τ represents the smoothing regularizer.
                                                                                  To examine any similarities within shifts, k-means clustering was
  3. Y(t) represents PPMI(t).                                                     performed on the data sets at first. Initial attempts at k-means with
  4. The original equation uses W(t), but this acts as identical to U(t) in the   the 100 dimensional embeddings yielded extremely large inertial
code. We replaced it here to improve readability.                                 values and poor results. In an attempt to reduce inertia, features
TEMPORAL WORD EMBEDDINGS ANALYSIS FOR DISEASE PREVENTION                                                                                     173

for data that k-means would be performed onto were assessed.              100, 150, and 200. Additionally, several certainty thresholds for a
K-means was performed on a reduced dimensionality data set,               positive classification were tested on each of the models. The best
with embedding vectors of dimensionality d = 10, however this             results from each will be listed in the results section. As we begin
led to strict convergence and poor results again. The data set            implementation of these models on other HIV outbreak related
with the change in an embeddings vector, data_121, continued              data sets, the proper certainty thresholds can be better determined.
to contain the changes of vectors between each time bucket and
its next. However, rather than the 10 dimensional position vectors
                                                                          Results
for both time buckets, 2 dimensional positions were used instead,
generated by UMAP from the 10 dimensioned vectors. The second             Analysis of Embeddings
data set, data_201, always led to strict convergence on clustering,       To ensure accuracy in word embeddings generated in this model,
even when reduced to just the 10 dimensional representations.             we utilized word2vec (w2v), a proven neural network method of
Therefore, k-means was performed explicitly on the data_121               embeddings [MCCD13]. For each temporal bucket, a static w2v
set, with the 2 dimensional representations alongside the 100             embedding of d = 100 was generated to compare to the temporal
dimensional change in the vectors. Separate two dimensional               embedding generated from the same bucket. These vectors were
UMAP representations were generated for use as a feature and              generated from the same corpus as the ones generated by the
for visual examination. The data set also did not have the term’s         dynamic model. As the vectors do not lie within the same
label listed as a feature for clustering.                                 embedding space, the vectors cannot be directly compared. As
     Inertia at convergence on clustering for k-means was reduced         the temporal embeddings generated by the alignment model are
significantly, as much as 86% after features were reassessed, yield-      influenced by other temporal buckets, we hypothesize notably
ing significantly better results. Following the clustering, the results   different vectors. Methods for testing quality in [YSD+ 18] rely
were analyzed to determine which clusters contained the higher            on a semi-supervised approach: the corpus used is an annotated
than average incidence rates of medical terms and HIV/AIDS                set of New York Times articles, and the section (Sports, Business,
related terms. These clusters can then be considered target clusters,     Politics, etc.) are given alongside the text, and can be used to
and large incidences of words being clustered within these can be         assess strength of an embedding. Additionally, the corpus used
flagged as indicative as a possible outbreak.                             spans over 20 years, allowing for metrics such as checking the
                                                                          closest word to leaders or titles, such as "president" or "NYC
Neural Network Predictions                                                mayor" throughout time. These methods show that this dynamic
In addition to the k-means model, we created a neural network             word embedding alignment model yields accurate results.
model for binary classification of our terms. Our target class was            Major differences can be attributed to the word2vec model
terms that we hypothesized were closely related to the HIV epi-           only being given a section of the corpus at a time, while our model
demic in Scott County, i.e. any word in our HIV terms list. Several       had access to the entire corpus across all temporal buckets. Terms
iterations with varying number of layers, activation functions, and       that might not have appeared in the given time bucket might still
nodes within each layer were attempted to maximize performance.           appear in the embeddings generated by our model, but not at all
Each model used an 80% training, 20% testing split on these data,         within the word2vec embeddings. For example, most embeddings
with two variations performed of this split on training and testing       generated by the word2vec model did not often have hashtagged
data. The first was randomly splitting all 255,000 observations,          terms in their top 10 closest terms, while embeddings generated
without care of some observations for a term being in both training       by our model often did. As hashtagged terms are very related to
set and some being in the testing set. This split of data will            ongoing events, keeping these terms can give useful information
be referred to as "mixed" data, as the terms are mixed between            to this outbreak. Modern hashtagged terms will likely be the most
the splits. The second split of data split the 15,000 words into          common novel terms that we have no prior knowledge on, and we
80% training and 20% testing. After the vocabulary was split,             hypothesize that these terms will be relevant to ongoing outbreaks.
the corresponding observations in the data were split accordingly,            Given that our corpus spans a significantly shorter time period
leaving all observations for each term within the same split.             than the New York Times set, and does not have annotations, we
Additionally, we tested a neural network that would accept the            use existing baseline data sets of word similarities. We evaluated
same data as the input, either data_201 or data_121, with the             the accuracy of both models’ vectors using a baseline sources
addition of the label assigned to that observation by the k-means         for the semantic similarity of terms. The first source used was
model as a feature. The goal of these models, in addition was to          SimLex-999, which contains 999 word pairings, with correspond-
correctly identifying terms we classified as related to the outbreak,     ing human generated similarity scores on a scale of 0-10, where
was to discover new terms that shift in similar ways to the HIV           10 is the highest similarity [HRK15]. Cosine similarities for each
terms we labeled.                                                         pair of terms in SimLex-999 were calculated for both the w2v
    The neural network model used was four layers, with three             model vectors as well as vectors generated by the dynamic model
ReLu layers with 128, 256, and 256 neurons, followed by a single          for each temporal bucket. Pairs containing terms that were not
neuron sigmoid output layer. This neural network was constructed          present in the model generated vectors were omitted for that
using the Keras module of the TensorFlow library. The main                models similarity measurements. The cosine similarities were then
difference between them was the input data itself. The input data         compared to the assigned SimLex scores using the Spearman’s
were data_201 with and without k-means labels, data_121 with              rank correlation coefficient. The results of this baseline can be seen
and without k-means labels. On each of these, there were two splits       in Table 1. The Spearman’s coefficient of both sets of embeddings,
of the training and testing data, as in the previously mentioned          averaged across all 18 temporal buckets, was .151334 for the
"mixed" terms. Parameters of the neural network layers were               w2v vectors and .15506 for the dynamic word embedding (dwe)
adjusted, but results did not improve significantly across the data       vectors. The dwe vectors slightly outperformed the w2v baseline
sets. All models were trained with a varying number of epochs: 50,        in this test of word similarities. However, it should be noted that
174                                                                                     PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)



                          Time       w2v Score    dwe Score    Difference   w2v          dwe          Difference
                          Bucket     (MEN)        (MEN)        (MEN)        Score        Score        (SL)
                                                                            (SL)         (SL)

                          0          0.437816     0.567757     0.129941     0.136146     0.169702     0.033556
                          1          0.421271     0.561996     0.140724     0.131751     0.167809     0.036058
                          2          0.481644     0.554162     0.072518     0.113067     0.165794     0.052727
                          3          0.449981     0.543395     0.093413     0.137704     0.163349     0.025645
                          4          0.360462     0.532634     0.172172     0.169419     0.158774     -0.010645
                          5          0.353343     0.521376     0.168032     0.133773     0.157173     0.023400
                          6          0.365653     0.511323     0.145669     0.173503     0.154299     -0.019204
                          7          0.358100     0.502065     0.143965     0.196332     0.152701     -0.043631
                          8          0.380266     0.497222     0.116955     0.152287     0.154338     .002051
                          9          0.405048     0.496563     0.091514     0.149980     0.148919     -0.001061
                          10         0.403719     0.499463     0.095744     0.145412     0.142114     -0.003298
                          11         0.381033     0.504986     0.123952     0.181667     0.141901     -0.039766
                          12         0.378455     0.511041     0.132586     0.159254     0.144187     -0.015067
                          13         0.391209     0.514521     0.123312     0.145519     0.147816     0.002297
                          14         0.405100     0.519095     0.113995     0.151422     0.152477     0.001055
                          15         0.419895     0.522854     0.102959     0.117026     0.154963     0.037937
                          16         0.400947     0.524462     0.123515     0.158833     0.157687     -0.001146
                          17         0.321936     0.525109     0.203172     0.170925     0.157068     -0.013857
                          Average    0.437816     0.567757     0.129941     0.151334     0.155059     0.003725



TABLE 1: Spearman’s correlation coefficients for w2v vectors and dynamic word embedding (dwe) vectors for all 18 temporal clusters against
the SimLex word pair data set.




                                Fig. 1: 2 Dimensional Representation of Embeddings from Time Bucket 0.
TEMPORAL WORD EMBEDDINGS ANALYSIS FOR DISEASE PREVENTION                                                                               175




                               Fig. 2: 2 Dimensional Representation of Embeddings from Time Bucket 17.


these Spearman’s coefficients are very low compared to baselines      UMAP, can be seen in Figure 1 and Figure 2. Figure 1 represents
such as in [WWC+ 19], where the average Spearman’s coefficient        the embedding generated for the first time bucket, while Figure
amongst common models was .38133 on this data set of words.           2 represents the embedding generated for the final time bucket.
These models, however, were trained on corpus generated from          These UMAP representations use cosine distance as their metric
Wikipedia pages — wiki2010. The lower Spearman’s coefficients         over Euclidian distance, leading to more dense clusters and more
can likely be accounted to our corpus. In 2014-2017, when             accurate representations of nearby terms within the embedding
this corpus was generated, Twitter had a 140 character limit on       space. The section of terms outlying from the main grouping
tweets. The limited characters have been shown to affect user’s       appears to be terms that do not appear often within that temporal
language within their tweets [BTKSDZ19], possibly affecting our       cluster itself, but may appear several times later in a temporal
embeddings. Boot et al. show that Twitter increasing the character    bucket. Figure 1 contains a zoomed in view of this outlying group,
limit to 280 characters in 2017 impacted the language within the      as well as a subgrouping on the outskirts of the main group,
tweets. As we test this pipeline on more Twitter data from various    containing food related terms. The majority of these terms are
time intervals, the character increase in 2017 is something to keep   ones that would likely be hashtagged frequently during a brief time
in mind.                                                              period within one temporal bucket. These terms are still relevant
    The second source of baseline was the MEN Test Collection,        to study, as hashtagged terms that appear frequently for a brief
containing 3,000 pairs with similarity scores of 0-50, with 50        period of time are most likely extremely attached to an ongoing
being the most similar [BTB14]. Following the same methodology        event. In future iterations, the length of each temporal bucket will
for assessing the strength of embeddings as we did for the            be decreased, hopefully giving more temporal buckets access to
SimLex-999 set, the Spearman’s coefficients from this set yielded     terms that only appear within one currently.
much better results than from the SimLex-999 set. The average
of the Spearman’s coefficients, across all 18 temporal buckets,       K-Means Clustering Results
was .39532 for the w2v embeddings and .52278 for the dwe              The results of the k-means clustering can be seen below in
embeddings. The dwe significantly outperformed the w2v baseline       Figures 4 and 5. Figure 4 shows the results of k-means clustering
on this set, but still did not reach the average correlation of       with the corresponding 2 dimensional UMAP positions generated
.7306 that other common models achieved in the baseline tests         from the 10 dimensional vector that were used as features in
in [WWC+ 19].                                                         the clustering. Figure 5 shows the results of k-means clustering
   Two dimensional representations of embeddings, generated by        with the corresponding 2 dimensional UMAP representation of the
176                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

      Cluster     All Words     HIV Terms           Difference

      0           0.173498      0.287048            0.113549
      1           0.231063      0.238876            0.007814
      2           0.220039      0.205600            -0.014440
      3           0.023933      0.000283            -0.023651
      4           0.108078      0.105581            -0.002498
      5           0.096149      0.084276            -0.011873
      6           0.023525      0.031391            0.007866
      7           0.123714      0.046946            -0.076768



TABLE 2: Distribution of HIV terms and all terms within k-means
clusters




                                                                        Fig. 4: Results of k-means clustering shown over the 2 dimensional
                                                                        UMAP representation of the 10 dimensional embeddings.




Fig. 3: Bar graph showing k-means clustering distribution of HIV
terms against all terms.


entire data set used in clustering. The k-means clustering revealed
semantic shifts of HIV related terms being clustered with higher
incidence than other terms in one cluster. Incidence rates for all
terms and HIV terms in each cluster can be seen in Table 2 and
                                                                        Fig. 5: Results of k-means clustering shown over the 2 dimensional
Figure 3. This increased incidence rate of HIV related terms in         UMAP representation of the full data set.
certain clusters leads us to hypothesize that semantic shifts of
terms in future datasets can be clustered using the same k-means
model, and analyzed to search for outbreaks. Clustering of terms        and .1 for the mixed split in both sets. The difference in certainty
in future data sets can be compared to these clustering results, and    thresholds was due to any mixed term data set having an extremely
similarities between the data can be recognized.                        large number of false positives on .01, but more reasonable results
                                                                        on .1.
Neural Network Results                                                      These results show that classification of terms surrounding
Neural network models we generated showed promising results             the Scott County HIV outbreak is achievable, but the model will
on classification of HIV related terms. The goal of the models          need to be refined on more data. It can be seen that the mixed
was to identify and discover terms surrounding the HIV outbreak.        term split of data led to a high rate of true positives, however
Therefore we were not concerned about the rate of false positive        it quickly became much more specific to terms outside of our
terms. False positive terms likely had semantic shifts very similar     target class on higher epochs, with false positives dropping to
to the HIV related terms, and therefore can be related to the           lower rates. Additionally, accuracy on data_201 begins to increase
outbreak. These terms can be labeled as potentially HIV related         between 150 and 200 epoch models for the unmixed split, so
while studying future data sets, which can aid the identifying of       even higher epoch models might improve results further for the
if an outbreak is ongoing during the time tweets in the corpus          unmixed split. Outliers, such as in the true positives in data_121
were tweeted. We looked for a balance of finding false positive         with 100 epochs without k-means labels, can be explained by
terms without lowering our certainty threshold to include too many      the certainty threshold. If the certainty threshold was .05 for that
terms. Results of the testing data for data_201 set can be seen in      model, there would have been 86 true positives, and 1,129 false
3, and results of the testing data for data_121 set can be seen in 4.   positives. A precise certainty threshold can be found as we test this
The certainty threshold for the unmixed split in both sets was .01,     model on other HIV related data sets and control data sets. With
TEMPORAL WORD EMBEDDINGS ANALYSIS FOR DISEASE PREVENTION                                                                                       177

                      With K-Means Label                                          Without K-Means Label

             Epochs   Accuracy Precision   Recall   TP     FP      TN      FN     Accuracy Precision   Recall   TP     FP      TN      FN
             50       0.9589 0.0513        0.0041   8      148     48897   1947   0.9571 0.1538        0.0266   52     286     48759   1903
             100      0.9589 0.0824        0.0072   14     156     48889   1941   0.9608 0.0893        0.0026   5      51      48994   1950
             150      0.6915 0.0535        0.4220   825    14602   34443   1130   0.7187 0.0451        0.3141   614    13006   36039   1341
             200      0.7397 0.0388        0.2435   476    11797   37248   1479   0.7566 0.0399        0.2317   453    10912   38133   1502
             50Mix    0.9881 0.9107        0.7967   1724   169     48667   440    0.9811 0.9417        0.5901   1277   79      48757   887
             100Mix   0.9814 0.9418        0.5980   1294   80      48756   870    0.9823 0.9090        0.6465   1399   140     48696   765
             150Mix   0.9798 0.9595        0.5471   1184   50      48786   980    0.9752 0.9934        0.4191   907    6       48830   1257
             200Mix   0.9736 0.9846        0.3835   830    13      48823   1334   0.9770 0.9834        0.4658   1008   17      48819   1156



TABLE 3: Results of the neural network run on the data_201 set. The epochs column shows the number of training epochs on the models, as
well as if the words were mixed between the training and testing data, denoted by "Mix".

                      With K-Means Label                                          Without K-Means Label

             Epochs   Accuracy Precision   Recall   TP     FP      TN      FN     Accuracy Precision   Recall   TP     FP      TN      FN
             50       0.9049 0.0461        0.0752   147    3041    46004   1808   0.9350 0.0652        0.0522   102    1463    47582   1853
             100      0.9555 0.1133        0.0235   46     360     48685   1909   0.8251 0.0834        0.3565   697    7663    41382   1258
             150      0.9554 0.0897        0.0179   35     355     48690   1920   0.9572 0.0957        0.0138   27     255     48790   1928
             200      0.9496 0.0335        0.0113   22     635     48410   1933   0.9525 0.0906        0.0266   52     522     48523   1903
             50Mix    0.9285 0.2973        0.5018   1086   2567    46269   1078   0.9487 0.4062        0.4501   974    1424    47412   1190
             100Mix   0.9475 0.3949        0.4464   966    1480    47356   1198   0.9492 0.4192        0.5134   1111   1539    47297   1053
             150Mix   0.9344 0.3112        0.4496   973    2154    46682   1191   0.9514 0.4291        0.4390   950    1264    47572   1214
             200Mix   0.9449 0.3779        0.4635   1003   1651    47185   1161   0.9500 0.4156        0.4395   951    1337    47499   1213



TABLE 4: Results of the neural network on the data_121 set. The epochs column shows the number of training epochs on the models, as well
as if the words were mixed between the training and testing data, denoted by "Mix".


enough experimentation and data, a set can be run through our                insight into relevant medical activity, but also further strengthen
pipeline and a certainty of there being a potential HIV outbreak in          and expand our model and its credibility. There is a large source
the region the data originated from can be generated by a future             of data potentially related to HIV/AIDS on Twitter, so finding
model.                                                                       and collecting this data would be a crucial first step. One potent
                                                                             example of data could be from the 220 United States counties
                                                                             determined by the CDC to be considered vulnerable to HIV and/or
Conclusion
                                                                             viral hepatitis outbreaks due to injection drug use, similar to the
Our results prove promising, with high accuracy and decent recall            outbreak that occurred in Scott County [VHRH+ 16]. Our next
on classification of HIV/AIDS related terms, as well as potentially          data set that is being studied is tweets from Cabell County, West
discovering new terms related to the outbreak. Given more HIV                Virginia, from January of 2018 through 2020. During this time
related data sets and control data sets, we could begin examining            an HIV outbreak similar to the one that took place in Scott
and generating thresholds of what might be indicative of an                  County in 2014 occurred [AMK20]. The end goal is to create
outbreak. To improve results, metrics for our word2vec baseline              a pipeline that can perform live semantic shift analysis at set
model and statistical analysis could be further explored, as well as         intervals of time within these counties, and classify these shifts
exploring previously mentioned noise and biases from our data.               as they happen. A future model can predict whether or not the
Additionally, sparsity of data in earlier temporal buckets may               number of terms classified as HIV related is indicative of an
lead to some loss of accuracy. Fine tuning hyperparameters of                outbreak. If enough terms classified by our model as potentially
the alignment model through grid searching would likely even                 indicative of an outbreak become detected, or if this future model
further improve these results. We predict that given more data sets          predicts a possible outbreak, public health officials can be notified
containing tweets from areas and times that had similar HIV/AIDS             and the severity of a possible outbreak can be mitigated if properly
outbreaks to Scott County, as well as control data sets that are             handled.
not directly related to an HIV outbreak, we could determine
                                                                                 Expansion into other social media platforms would increase
a threshold of words that would define a county as potentially
                                                                             the variety of data our model has access to, and therefore what
undergoing an HIV outbreak. With a refined pipeline and model
                                                                             our model is able to respond to. With the foundational model
such as this, we hope to be able to begin biosurveillance to try to
                                                                             established, we will be able to focus on converting the data and
prevent future outbreaks.
                                                                             addressing the differences between social networks (e.g. audience
                                                                             and online etiquette). Reddit and Instagram are two points of
Future Work                                                                  interest due to their increasing prevalence, as well as vastness of
Case studies of previous datasets related to other diseases and              available data.
collection of more modern tweets could not only provide critical                  An idea for future implementation following the generation
178                                                                                              PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

of a generalized model would be creating a web application. The                [PPH+ 16]    Philip J. Peters, Pamela Pontones, Karen W. Hoover, Monita R.
ideal audience would be medical officials and organizations, but                            Patel, Romeo R. Galang, Jessica Shields, Sara J. Blosser,
                                                                                            Michael W. Spiller, Brittany Combs, William M. Switzer, and
even public or research use for trend prediction could be potent.                           et al. HIV infection linked to injection use of Oxymorphone
The application would give users the ability to pick from a given                           in Indiana, 2014–2015. New England Journal of Medicine,
glossary of medical terms, defining their own set of significant                            375(3):229–239, 2016. doi:10.1056/NEJMoa1515195.
words to run our model on. Our model would then expose any                     [VHRH+ 16]   Michelle M. Van Handel, Charles E. Rose, Elaine J. Hallisey,
                                                                                            Jessica L. Kolling, Jon E. Zibbell, Brian Lewis, Michele K.
potential trends or insight for the given terms in contemporary                             Bohm, Christopher M. Jones, Barry E. Flanagan, Azfar-E-Alam
data, allowing for quicker responses to activity. Customization of                          Siddiqi, and et al. County-level vulnerability assessment for
the data pool could also be a feature, where tweets and other                               rapid dissemination of HIV or HCV infections among persons
                                                                                            who inject drugs, United States. JAIDS Journal of Acquired
social media posts are filtered to specified geographic regions or                          Immune Deficiency Syndromes, 73(3):323–331, 2016. doi:
time windows, yielding more specific results.                                               10.1097/qai.0000000000001098.
    Additionally, we would like to reassess our embedding model                [WWC+ 19]    Bin Wang, Angela Wang, Fenxiao Chen, Yuncheng Wang, and
                                                                                            C.-C. Jay Kuo. Evaluating word embedding models: Methods
to try and improve embeddings generated and our understanding                               and experimental results. APSIPA Transactions on Signal and
of the semantic shifts. This project has been ongoing for several                           Information Processing, 8(1), 2019. doi:10.1017/atsip.
years, and new models, such as the use of bidirectional encoders,                           2019.12.
as in BERT [DCLT18], have proven to have high performance.                     [YSD+ 18]    Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao, and Hui
                                                                                            Xiong. Dynamic word embeddings for evolving semantic dis-
BERT based models have also been used for temporal embedding                                covery. In Proceedings of the Eleventh ACM International Con-
studies, such as in [LMD+ 19], a study focused on clinical corpora.                         ference on Web Search and Data Mining:, WSDM ’18, page
We predict that updating our pipeline to match more modern                                  673–681, New York, NY, USA, 2018. Association for Comput-
                                                                                            ing Machinery. doi:10.1145/3159652.3159703.
methodology can lead to more effective disease detection.



R EFERENCES

[Aff05]    Veteran Affairs. Glossary of HIV/AIDS terms: Veterans affairs,
           Dec 2005. URL: https://www.hiv.va.gov/provider/glossary/
           index.asp.
[AMK20]    A Atkins, RP McClung, and M Kilkenny. Notes from the
           field: Outbreak of Human Immunodeficiency Virus infection
           among persons who inject drugs — Cabell County, West
           Virginia, 2018–2019. Morbidity and Mortality Weekly Report,
           69(16):499–500, 2020. doi:10.15585/mmwr.mm6916a2.
[BTB14]    Elia Bruni, Nam Khanh Tran, and Marco Baroni. Multimodal
           distributional semantics. J. Artif. Int. Res., 49(1):1–47, 2014.
           doi:10.1613/jair.4135.
[BTKSDZ19] Arnout Boot, Erik Tjon Kim Sang, Katinka Dijkstra, and
           Rolf Zwaan. How character limit affects language usage
           in tweets. Palgrave Communications, 5(76), 2019. doi:
           10.1057/s41599-019-0280-3.
[DCLT18]   Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
           Toutanova. BERT: Pre-training of deep bidirectional transform-
           ers for language understanding, 2018. doi:10.18653/v1/
           N19-1423.
[GC18]     Gregg S Gonsalves and Forrest W Crawford. Dynamics of
           the HIV outbreak and response in Scott County, IN, USA,
           2011–15: A modelling study. The Lancet HIV, 5(10), 2018.
           URL: https://pubmed.ncbi.nlm.nih.gov/30220531/.
[Gol17]    Nicholas J. Golding. The needle and the damage done: In-
           diana’s response to the 2015 HIV epidemic and the need to
           change state and federal policies regarding needle exchanges
           and intravenous drug users. Indiana Health Law Review,
           14(2):173, 2017. doi:10.18060/3911.0038.
[HLJ16]    William L. Hamilton, Jure Leskovec, and Dan Jurafsky. Di-
           achronic word embeddings reveal statistical laws of seman-
           tic change. CoRR, abs/1605.09096, 2016. arXiv:1605.
           09096, doi:10.48550/arXiv.1605.09096.
[HRK15]    Felix Hill, Roi Reichart, and Anna Korhonen. SimLex-
           999: Evaluating semantic models with (genuine) similarity
           estimation. Computational Linguistics, 41(4):665–695, 2015.
           doi:10.1162/COLI_a_00237.
[LMD+ 19]  Chen Lin, Timothy Miller, Dmitriy Dligach, Steven Bethard,
           and Savova Guergana. A BERT-based universal model for both
           within- and cross-sentence clinical temporal relation extraction.
           In Proceedings of the 2nd Clinical Natural Language Process-
           ing Workshop, pages 65–71. Association for Computational
           Linguistics, 2019. doi:10.18653/v1/W19-1908.
[MCCD13]   Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
           Efficient estimation of word representations in vector space,
           2013. doi:10.48550/ARXIV.1301.3781.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                       179




 Design of a Scientific Data Analysis Support Platform
                                      Nathan Martindale‡∗ , Jason Hite‡ , Scott Stewart‡ , Mark Adams‡



                                                                                     F



Abstract—Software data analytic workflows are a critical aspect of modern                     Fundamentally, science revolves around the ability for others
scientific research and play a crucial role in testing scientific hypotheses. A          to repeat and reproduce prior published works, and this has
typical scientific data analysis life cycle in a research project must include           become a difficult task with many computation-based studies.
several steps that may not be fundamental to testing the hypothesis, but are             Often, scientists outside of a computer science field may not have
essential for reproducibility. This includes tasks that have analogs to software
                                                                                         training in software engineering best practices, or they may simply
engineering practices such as versioning code, sharing code among research
team members, maintaining a structured codebase, and tracking associated
                                                                                         disregard them because the focus of a researcher is on scientific
resources such as software environments. Tasks unique to scientific research             publications rather than the analysis software itself. Lack of docu-
include designing, implementing, and modifying code that tests a hypothesis.             mentation and provenance of research artifacts and frequent failure
This work refers to this code as an experiment, which is defined as a software           to publish repositories for data and source code has led to a crisis
analog to physical experiments.                                                          in reproducibility in artificial intelligence (AI) and other fields that
     A software experiment manager should support tracking and reproducing               rely heavily on computation [SBB13], [DMR+ 09], [Hut18]. One
individual experiment runs, organizing and presenting results, and storing and           study showed that quantifiably few machine learning (ML) papers
reloading intermediate data on long-running computations. A software experi-
                                                                                         document specifics in how they ran their experiments [GGA18].
ment manager with these features would reduce the time a researcher spends
                                                                                         This gap between established practices from the software engi-
on tedious busywork and would enable more effective collaboration. This work
discusses the necessary design features in more depth, some of the existing
                                                                                         neering field and how computational research is conducted has
software packages that support this workflow, and a custom developed open-               been studied for some time, and the problems that can stem from
source solution to address these needs.                                                  it are discussed at length in [Sto18].
                                                                                              To mitigate these issues, computation-based research requires
Index Terms—reproducible research, experiment life cycle, data analysis sup-             better infrastructure and tooling [Pen11] as well as applying
port                                                                                     relevant software engineering principles [Sto18], [Dub05] to allow
                                                                                         data scientists to ensure their work is effective, correct, and
Introduction                                                                             reproducible. In this paper we focus on the ability to manage re-
                                                                                         producible workflows for scientific experiments and data analyses.
Modern science increasingly uses software as a tool for conducting
                                                                                         We discuss the features that software to support this might require,
research and scientific data analyses. The growing number of
                                                                                         compare some of the existing tools that address them, and finally
libraries and frameworks facilitating this work has greatly low-
                                                                                         present the open-source tool Curifactory which incorporates the
ered the barrier to usage, allowing more researchers to benefit
                                                                                         proposed design elements.
from this paradigm. However, as a result of the dependence on
software, there is a need for more thorough integration of sound
software engineering practices with the scientific process. The                          Related Work
fragility of complex environments containing heavily intercon-                           Reproducibility of AI experiments has been separated into three
nected packages coupled with a lack of provenance of the artifacts                       different degrees [GK18]: Experiment reproduciblity, or repeata-
generated throughout the development of an experiment increases                          bility, refers to using the same code implementation with the
the potential for long-term problems, undetected bugs, and failure                       same data to obtain the same results. Data reproducibility, or
to reproduce previous analyses.                                                          replicability, is when a different implementation with the same
* Corresponding author: martindalena@ornl.gov
                                                                                         data outputs the same results. Finally, method reproducibility
‡ Oak Ridge National Laboratory                                                          describes when a different implementation with different data is
                                                                                         able to achieve consistent results. These degrees are discussed
Copyright © 2022 Oak Ridge National Laboratory. This is an open-access
article distributed under the terms of the Creative Commons Attribution                  in [GGA18], comparing the implications and trade-offs on the
License, which permits unrestricted use, distribution, and reproduction in any           amount of work for the original researcher versus an external
medium, provided the original author and source are credited.                            researcher, and the degree of generality afforded by a reproduced
Notice: This manuscript has been authored by UT-Battelle, LLC, under                     implementation. A repeatable experiment places the greatest bur-
contract DE-AC05-00OR22725 with the US Department of Energy (DOE).
The US government retains and the publisher, by accepting the article for pub-           den on the original researcher, requiring the full codebase and
lication, acknowledges that the US government retains a nonexclusive, paid-up,           experiment to be sufficiently documented and published so that
irrevocable, worldwide license to publish or reproduce the published form of             a peer is able to correctly repeat it. At the other end of the
this manuscript, or allow others to do so, for US government purposes. DOE               spectrum, method reproducibility demands the greatest burden
will provide public access to these results of federally sponsored research in ac-
cordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-               on the external researcher, as they must implement and run the
public-access-plan).                                                                     experiment from scratch. For the remainder of this paper, we refer
180                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

to "reproducibility" as experiment reproducibility (repeatability).     ifications to take full advantage of all features. This can entail
Tooling that is able to assist with documentation and organization      a significant learning curve and places additional burden on
of a published experiment reduces the amount of work for the            the researcher. To address this, some sources propose automatic
original researcher and still allows for the lowest level of burden     documentation of experiments and code through static source code
to external researchers to verify and extend previous work.             analysis [NFP+ 20], [Red19].
     In an effort to encourage better reproducibility based on              Beyond the preexisting body of knowledge about software
datasets, the Findable, Accessible, Interoperable, and Reusable         engineering principles, other works [SNTH13], [KHS09] de-
(FAIR) data principles [WDA+ 16] were established. These prin-          scribe recommended rules and practices to follow when conduct-
ciples recommend that data should have unique and persistent            ing computation-based research. These include avoiding manual
identifiers, use common standards, and provide rich metadata            data manipulation in favor of scripted changes, keeping detailed
description and provenance, allowing both humans and machines           records of how results are produced (manual provenance), tracking
to effectively parse them. These principles have been extended          the versions of libraries and programs used, and tracking random
more broadly to software [LGK+ 20], computational workflows             seeds. Many of these ideas can be assisted or encapsulated through
[GCS+ 20], and to entire data pipelines [MLC+ 21].                      appropriate infrastructure decisions, which is the premise on
     Various works have surveyed software engineering practices         which this work bases its software reviews.
and identified practices that provide value in scientific computing         Although this paper focuses on the scientific workflow, a
contexts, including various forms of unit and regression testing,       growing related field tackles many of the same issues from
proper source control usage, formal verification, bug tracking,         an industry standpoint: machine learning operations (MLOps)
and agile development methods [Sto18], [Dub05]. In particular,          [Goy20]. MLOps, an ML-oriented version of DevOps, is con-
[Sto18] described many concepts from agile development as being         cerned with supporting an entire data science life cycle, from data
well suited to an experimental context, where the current knowl-        acquisition to deployment of a production model. Many of the
edge and goals may be fairly dynamic throughout the project. They       same challenges are present, reproducibility and provenance are
noted that although many of these techniques could be directly          crucial in both production and research workflows [RMRO21].
applied, some required adaptation to make sense in the scientific       Infrastructure, tools, and practices developed for MLOps may also
software domain.                                                        hold value in the scientific community.
     Similar to this paper, two other works [DGST09], [WWG21]               A taxonomy for ML tools that we reference throughout this
discuss sets of design aspects and features that a workflow             work is from [QCL21], which describes a characterization of tools
manager would need. Deelman et al. describe the life cycle of           consisting of three primary categories: general, analysis support,
a workflow as composition, mapping, execution, and provenance           and reproducibility support, each of which is further subdivided
capture [DGST09]. A workflow manager must then support each             into aspects to describe a tool. For example, these subaspects
of these aspects. Composition is how the workflow is constructed,       include data visualization, web dashboard capabilities, experiment
such as through a graphical interface or with a text configuration      logging, and the interaction modes the tool supports, such as a
file. Mapping and execution are determining the resources to be         command line interface (CLI) or application programming inter-
used for a workflow and then utilizing those resources to run it,       face (API).
including distributing to cloud compute and external representa-
tional state transfer (REST) services. This also refers to scheduling   Design Features
subworkflows/tasks to reuse intermediate artifacts as available.        We combine the two sets of capabilities from [DGST09] and
Provenance, which is crucial for enabling repeatability, is how all     [WWG21] with the taxonomy from [QCL21] to propose a set
artifacts, library versions, and other relevant metadata are tracked    of six design features that are important for an experiment
during the execution of a workflow.                                     manager. These include orchestration, parameterization, caching,
     Wratten, Wilm, and Göke surveyed many bioinformatics pi-           reproducibility, reporting, and scalability. The crossover between
pline and workflow management tools, listing the challenges that        these proposed feature sets are shown in Table 1. We expand on
tooling should address: data provenance, portability, scalability,      each of these in more depth in the subsections below.
and re-entrancy [WWG21]. Provenance is defined the same way
as in [DGST09], and further states the need for generating              Orchestration
reports that include the tracking information and metadata for          Orchestration of an experiment refers to the mechanisms used
the associated experiment run. Portability—allowing set up and          to chain and compose a sequence of smaller logical steps into
execution of an experiment in a different environment—can be            an overarching pipeline. This provides a higher-level view of an
a challenge because of the dependency requirements of a given           experiment and helps abstract away some of the implementation
system and the ease with which the environment can be specified         details. Operation of most workflow managers is based on a
and reinitialized on a different machine or operating system.           directed acyclic graph (DAG), which specifies the stages/steps as
Scalability is important especially when large scale data, many         nodes and the edges connecting them as their respective inputs and
compute-heavy steps, or both are involved throughout the work-          outputs. The intent with orchestration is to encourage designing
flow. Scalability in a manager involves allowing execution on a         distinct, reusable steps that can easily be composed in different
high-performance computing (HPC) system or with some form of            ways to support testing different hypotheses or overarching ex-
parallel compute. Finally they mention re-entrancy, or the ability      periment runs. This allows greater focus on the design of the
to resume execution of a compute step from where it last stopped,       experiments than the implementation of the underlying functions
preventing unnecessary recomputation of prior steps.                    that the experiments consist of. As discussed in the taxonomy
     One area of the literature that needs further discussion is        [QCL21], pipeline creation can consist of a combination of scripts,
the design of automated provenance tracking systems. Existing           configuration files, or a visual tool. This aspect falls within the
workflow management tools generally require source code mod-            composition capability discussed in [DGST09].
DESIGN OF A SCIENTIFIC DATA ANALYSIS SUPPORT PLATFORM                                                                                       181

                       This work          [DGST09]             [WWG21]                   Taxonomy [QCL21]

                      Orchestration       Composition          —                         Reproducibility/pipeline creation
                      Parameterization    —                    —                         —
                      Caching             —                    Re-entrancy               —
                      Reproducibility     Provenance           Provenance, portability   Reproducibility
                      Reporting           —                    —                         Analysis/visualization, web dashboard
                      Scalability         Mapping, execution   Scalability               Analysis/computational resources



                                         TABLE 1: Comparing design features listed in various works.


Parameterization                                                           Reproducibility
Parameterization specifies how a compute pipeline is customized
                                                                           Mechanisms for reproducibility are one of the most important fea-
for a particular run by passing in configuration values to change
                                                                           tures for a successful data analysis support platform. Reproducibil-
aspects of the experiment. The ability to customize analysis code
                                                                           ity is challenging because of the complexity of constantly evolving
is crucial to conducting a compute-based experiment, providing a
                                                                           codebases, complicated and changing dependency graphs, and
mechanism to manipulate a variable under test to verify or reject
                                                                           inconsistent hardware and environments. Reproducibility entails
a hypothesis.
                                                                           two subcomponents: provenance and portability. This falls under
    Conventionally, parameterization is done either through spec-
                                                                           the provenance aspect from [DGST09], both data provenance and
ifying parameters in a CLI call or by passing configuration files
                                                                           portability from [WWG21], and the entire reproducibility support
in a format like JSON or YAML. As discussed in [DGST09],
                                                                           section of the taxonomy [QCL21].
parameterization sometimes consists of more complicated needs,
such as conducting parameter sweeps or grid searches. There are                 Data provenance is about tracking the history, configuration,
libraries dedicated to managing parameter searches like this, such         and steps taken to produce an intermediate or final data artifact.
as hyperopt [BYC13] used in [RMRO21].                                      In ML this would include the cleaning/munging steps used and
                                                                           the intermediate tables created in the process, but provenance can
    Although not provided as a design capability in the other
                                                                           apply more broadly to any type of artifact an experiment may
works, we claim the mechanisms provided for parameterization
                                                                           produce, such as ML models themselves, or "model provenance"
are important, as these mechanisms are the primary way to con-
                                                                           [SH18]. Applying provenance beyond just data is critical, as
figure, modify, and vary experiment execution without explicitly
                                                                           models may be sensitive to the specific sets of training data and
changing the code itself or modifying hard-coded values. This
                                                                           conditions used to produce them [Hut18]. This means that every-
means that a recorded parameter set can better "describe" an
                                                                           thing required to directly and exactly reproduce a given artifact
experiment run, increasing provenance and making it easier for
                                                                           is recorded, such as the manipulations applied to its predecessors
another researcher to understand what pieces of an experiment
                                                                           and all hyperparameters used within those manipulations.
can be readily changed and explored.
    Some support is provided for this in [DGST09], stating that                 Portability refers to the ability to take an experiment and
the necessity of running many slight variations on workflows               execute it outside of the initial computing environment it was
sometimes leads to the creation of ad hoc scripts to generate the          created in [WWG21]. This can be a challenge if all software
variants, which leads to increased complexity in the organization          dependency versions are not strictly defined, or when some de-
of the codebase. Improved mechanisms to parameterize the same              pendencies may not be available in all environments. Minimally,
workflow for many variants helps to manage this complexity.                allowing portability requires keeping explicit track of all packages
                                                                           and the versions used. A 2017 study [OBA17] found that even
                                                                           this minimal step is rarely taken. Another mechanism to support
Caching
                                                                           portability is the use of containerization, such as with Docker or
Refining experiment code and finding bugs is often a lengthy               Podman [SH18].
iterative process, and removing the friction of constantly rerunning
all intermediate steps every time an experiment is wrong can
improve efficiency. Caching values between each step of an                 Reporting
experiment allows execution to resume at a certain spot in the
pipeline, rather than starting from scratch every time. This is            Reporting is an important step for analyzing the results of an
defined as re-entrancy in [WWG21].                                         experiment, through visualizations, summaries, comparisons of
    In addition to increasing the speed of rerunning experiments           results, or combinations thereof. As a design capability, reporting
and running new experiments that combine old results for analysis,         refers to the mechanisms available for the system to export or
caching is useful to help find and debug mistakes throughout               retrieve these results for human analysis. Although data visu-
an experiment. Cached outputs from each step allow manual                  alization and analysis can be done manually by the scientist,
interrogation outside of the experiment. For example, if a cleaning        tools to assist with making these steps easier and to keep results
step was implemented incorrectly and a user noticed an invalid             organized are valuable from a project management standpoint.
value in an output data table, they could use a notebook to load           Mechanisms for this might include a web interface for exploring
and manipulate the intermediate artifact tables for that data to           individual or multiple runs. Under the taxonomy [QCL21], this
determine what stage introduced the error and what code should             falls primarily within analysis support, such as data visualization
be used to correctly fix it.                                               or a web dashboard.
182                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Scalability                                                              container. MLFlow then ensures that the environment is set up and
Many data analytic problems require large amounts of space               active before running. The CLI even allows directly specifying a
and compute resources, often beyond what can be handled on               GitHub link to an mlflow-enabled project to download, set up, and
an individual machine. To efficiently support running a large            then run the associated experiment. For reporting, the MLFlow
experiment, mechanisms for scaling execution are important and           tracking UI lets the user view and compare various runs and their
could include anything from supporting parallel computation on           associated artifacts through a web dashboard. For scalability, both
an experiment or stage level, to allowing the execution of jobs on       distributed storage for saving/loading artifacts as well as execution
remote machines or within an HPC context. This falls within both         of runs on distributed clusters is supported.
mapping and execution from [DGST09], the scalability aspect
                                                                         Sacred
from [WWG21], and the computational resources category within
the taxonomy [QCL21].                                                    Sacred [GKC+ 17] is a Python library and CLI tool to help
                                                                         organize and reproduce experiments. Orchestration is managed
                                                                         through the use of Python decorators, a "main" for experiment
Existing Tools
                                                                         entry point functions and "capture" for parameterizable functions,
A wide range of pipeline and workflow tools have been devel-             where function arguments are automatically populated from the
oped to support many of these design features, and some of the           active configuration when called. Parameterization is done directly
more common examples include DVC [KPP+ 22] and MLFlow                    in Python through applying a config decorator to a function that
[MLf22]. We briefly survey and analyze a small sample of these           assigns variables. Configurations can also be written to or read
tools to demonstrate the diversity of ideas and their applicability in   from JSON and YAML files, so parameters must be simple types.
different situations. Table 2 compares the support of each design        Different observers can be specified to automatically track much
feature by each tool.                                                    of the metadata, environment information, and current parameters,
                                                                         and within the code the user can specify additional artifacts and
DVC                                                                      resources to track during the run. Each run will store the requested
DVC [KPP+ 22] is a Git-like version control tool for datasets.           outputs, although there is no re-entrant use of these cached values.
Orchestration is done by specifying stages, or runnable script           Portability is supported through the ability to print the versions of
commands, either in YAML or directly on the CLI. A stage is              libraries needed to run a particular experiment. Reporting can be
specified with output file paths and input file paths as dependen-       done through a specific type of observer, and the user can provide
cies, allowing an implicit pipeline or DAG to form, representing all     custom templated reports that are generated at the end of each run.
the processing steps. Parameterization is done by defining within a
YAML file what the possible parameters are, along with the default       Kedro
values. When running the DAG, parameters can be customized on            Kedro [ABC+ 22] is another Python library/CLI tool for managing
the CLI. Since inputs and outputs are file paths, caching and re-        reproducible and modular experiments. Orchestration is particu-
entrancy come for free, and DVC will intelligently determine if          larly well done with "node" and "pipeline" abstractions, a node
certain stages do not need to be re-computed.                            referring to a single compute step with defined inputs and outputs,
     A saved experiment or state is frozen into each commit, so          and a pipeline implemented as an ordered list of nodes. Pipelines
all parameters and artifacts are available at any point. No explicit     can be composed and joined to create an overarching workflow.
tracking of the environment (e.g., software versions and hardware        Possible parameters are defined in a YAML file and either set
info) is present, but this could be manually included by tracking it     in other parameter files or configured on the CLI. Similar to
in a separate file. Reporting can be done by specifying per-stage        MLFlow, while tracking outputs are cached, there’s no automatic
metrics to track in the YAML configuration. The CLI includes a           mechanism for re-entrancy. Provenance is achieved by storing
way to generate HTML files on the fly to render requested plots.         user-specified metrics and tracked datasets for each run, and it
There is also an external "Iterative Studio" project, which provides     has a few different mechanisms for portability. This includes the
a live web dashboard to view continually updating HTML reports           ability to export an entire project into a Docker container. A
from DVC. For scalability, parallel runs can be achieved by              separate Kedro-Viz tool provides a web dashboard to show a map
queuing an experiment multiple times in the CLI.                         of experiments, as well as showing each tracked experiment run
                                                                         and allowing comparison of metrics and outputs between them.
MLFlow                                                                   Projects can be deployed into several different cloud providers,
MLFlow [MLf22] is a framework for managing the entire life               such as Databricks and Dask clusters, allowing for several options
cycle of an ML project, with an emphasis on scalability and de-          for scalability.
ployment. It has no specific mechanisms for orchestration, instead
allowing the user to intersperse MLFlow API calls in an existing         Curifactory
codebase. Runnable scripts can be provided as entry points into          Curifactory [MHSA22] is a Python API and CLI tool for organiz-
a configuration YAML, along with the parameters that can be              ing, tracking, reproducing, and exporting computational research
provided to it. Parameters are changed through the CLI. Although         experiments and data analysis workflows. It is intended primarily
MLFlow has extensive capabilities for tracking artifacts, there are      for smaller teams conducting research, rather than production-
no automatic re-entrancy methods. Reproducibility is a strong fea-       level or large-scale ML projects. Curifactory is available on
ture, and provenance and portability are well supported. The track-      GitHub1 with an open-source BSD-3-Clause license. Below, we
ing module provides provenance by recording metadata such as the         describe the mechanisms within Curifactory to support each of the
Git commit, parameters, metrics, and any user-specified artifacts        six capabilities, and compare it with the tools discussed above.
in the code. Portability is done by allowing the environment for
an entry point to be specified as a Conda environment or Docker            1. https://github.com/ORNL/curifactory
DESIGN OF A SCIENTIFIC DATA ANALYSIS SUPPORT PLATFORM                                                                                       183

                              Orchestration   Parameterization   Caching    Provenance      Portability   Reporting   Scalability
                DVC           +               +                  ++         +               +             +           +
                MLFlow                        +                  *          ++              ++            ++          ++
                Sacred        +               ++                 *          ++              +             +
                Kedro         +               +                  *          +               ++            ++          ++
                Curifactory   +               ++                 ++         ++              ++            +           +



TABLE 2: Supported design features in each tool. Note, + indicates that a feature is supported, ++ indicates very strong support, and *
indicates tooling that supports caching artifacts as a provenance tool but does not provide a mechanism for automatically reloading cached
values as a form of re-entrancy.



                                                                         @stage(inputs=["model"], outputs=["results"])
                                                                         def test_model(record, model):
                                                                             # ...

                                                                         def run(argsets, manager):
                                                                             """An example experiment definition.

                                                                              The primary intent of an experiment is to run
                                                                              each set of arguments through the desired
                                                                              stages, in order to compare results at the end.
                                                                              """
                                                                              for argset in argsets:
                                                                                  # A record is the "pipeline state"
                                                                                  # associated with each set of arguments.
                                                                                  # Stages take and return a record,
                                                                                  # automatically handling pushing and
                                                                                  # pulling inputs and outputs from the
                                                                                  # record state.
                                                                                  record = Record(manager, argsets)
                                                                                  test_model(train_model(load_data(record)))



                                                                         Parameterization

                                                                         Parameterization in Curifactory is done directly in Python scripts.
                                                                         The user defines a dataclass with the parameters they need
                                                                         throughout their various stages in order to customize the exper-
                                                                         iment, and they can then define parameter files that each return
          Fig. 1: Stages are composed into an experiment.                one or more instances of this arguments class. All stages in an
                                                                         experiment are automatically given access to the current argument
                                                                         set in use while an experiment is running.
Orchestration                                                                 While configuration can also be done directly in Python in
Curifactory provides several abstractions, the lowest level of which     Sacred, Curifactory makes a different trade-off: A parameter file
is a stage. A stage is a function that takes a defined set of input      or get_params() function in Curifactory returns an array of
variable names, a defined set of output variable names, and an           one or more argument sets, and arguments can directly include
optional set of caching strategies for the outputs. Stages are similar   complex Python objects. Unlike Sacred, this means Curifactory
to Kedro’s nodes but implemented with @stage() decorators on             cannot directly translate back and forth from static configuration
the target function rather than passing the target function to a         files, but in exchange allows for grid searches to be defined directly
node() call. One level up from a stage is an experiment: an              and easily in a single parameter file, as well as allowing argument
experiment describes the orchestration of these stages as shown in       sets to be composed or even inherit from other argument set
Figure 1, functionally chaining them together without needing to         instances. Importantly, Curifactory can still encode representations
explicitly manage what variables are passed between the stages.          of arguments into JSON for provenance, but this is a one direc-
                                                                         tional transformation.
@stage(inputs=None, outputs=["data"])
def load_data(record):                                                        This approach allows a great deal of flexibility, and is valuable
    # every stage has the currently active record                        in experiments where a large range of parameters need to be
    # passed to it, which contains the "state", or                       tested or there is significant repetition among parameter sets.
    # all previous output values associated with
    # the current argset, as defined in the                              For example, in an experiment testing different effects of model
    # Parameterization section                                           training hyperparameters, there may be several parameter files
    # ...                                                                meant to vary only the arguments needed for model training while
                                                                         using the same base set of data cleaning arguments. Composing
@stage(inputs=["data"], outputs=["model", "stats"])
def train_model(record, data):                                           these parameter sets from a common imported set means that any
    # ...                                                                subsequent changes to the data cleaning arguments only need to
184                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

be modified in one place, rather than each individual parameter
file.
@dataclass
class MyArgs(curifactory.ExperimentArgs):
    """Define the possible arguments needed in the
    stages."""
    random_seed: int = 42
    train_test_ratio: float = 0.8
    layers: tuple = (100,)
    activation: str = "relu"

def get_params():
    """Define a simple grid search: return
    many arguments instances for testing."""
    args = []
    layer_sizes = [10, 20, 50, 100]
    for size in layer_sizes:
        args.append(MyArgs(name=f"network_{size}",
            layers=(size,)))
    return args


Caching
Curifactory supports per-stage caching, similar to memoization,
through a set of easy-to-use caching strategies. When a stage
executes, it uses the specified cache mechanism to store the stage
outputs to disk, with a filename based on the experiment, stage,
and a hash of the arguments. When the experiment is re-executed,
if it finds an existing output on disk based on this name, it short-
circuits the stage computation and simply reloads the previously
cached files, allowing a form of re-entrancy. Adding this caching
ability to a stage is done through simply providing the list of                     Fig. 2: Metadata block at the top of a report.
caching strategies to the stage decorator, one for each output:
@stage(
    inputs=["data"],                                                    default Dockerfile for this purpose, and running the experiment
    outputs=["training_set", "testing_set"],                            with the Docker flag creates an image that exposes a Jupyter
    cachers=[PandasCSVCacher]*2                                         notebook to repeat the run and keep the artifacts in memory, as
):
def split_data(record, data):                                           well as a file server pointing to the appropriate cache for manual
    # stage definition                                                  exploration and inspection. Directly reproducing the experiment
                                                                        can be done either through the exposed notebook or by running
Reproducibility                                                         the Curifactory experiment command inside of the image.
As mentioned before, reproducibility consists of tracking prove-
nance and metadata of artifacts as well as providing a means to set     Reporting
up and repeat an experiment in a different compute environment.         While Curifactory does not run a live web dashboard like MLFlow,
To handle provenance, Curifactory automatically records metadata        DVC’s Iterative Studio, and Kedro-viz, every experiment run
for every experiment run executed, including a logfile of the           outputs an HTML experiment report and updates a top-level index
console output, current Git commit hash, argument sets used and         HTML page linking to the new report, which can be browsed
the rendered versions of those arguments, and the CLI command           from a file manager or statically served if running from an
used to start the run. The final reports from each run also include a   external compute resource. Although simplistic, this reduces the
graphical representation of the stage DAG, and shows each output        dependencies and infrastructure needed to achieve a basic level
artifact and what its cache file location is.                           of reporting, and produces stand-alone folders for consumption
     Curifactory has two mechanisms to fully track and export an        outside of the original environment if needed.
experiment run. The first is to execute a "full store" run, which           Every report from Curifactory includes all relevant metadata
creates a single exported folder containing all metadata mentioned      mentioned above, including the machine host name, experiment
above, along with a copy of every cache file created, the output        sequential run number, Git commit hash, parameters, and com-
run report (mentioned below), as well as a Python requirements.txt      mand line string. Stage code can add user-defined objects to output
and Conda environment dump, containing a list of all packages in        in each report, such as tables, figures, and so on. Curifactory comes
the environment and their respective versions. This run folder can      with a default set of helpers for several basic types of output
then be distributed. Reproducing from the folder consists of setting    visualizations, including basic line plots, entire Matplotlib figures,
up an environment based on the Conda/Python dependencies as             and dataframes.
needed, and running the experiment command using the exported               The output report also contains a graphical representation of
folder as the cache directory.                                          the DAG for the experiment, rendered using Graphviz, and shows
     The second mechanism is a command to create a Docker con-          the artifacts produced by each stage and the file path where they
tainer that includes the environment, entire codebase, and artifact     are cached. An example of some of the components of this report
cache for a specific experiment run. Curifactory comes with a           are rendered in figures 2, 3, 4, and 5.
DESIGN OF A SCIENTIFIC DATA ANALYSIS SUPPORT PLATFORM                                                                                           185

                                                                           Conclusion
                                                                           The complexity in modern software, environments, and data ana-
                                                                           lytic approaches threaten the reproducibility and effectiveness of
                                                                           computation-based studies. This has been compounded by the lack
                                                                           of standardization in infrastructure tools and software engineering
                                                                           principles applied within scientific research domains. While many
                                                                           novel tools and systems are in development to address these
                                                                           shortcomings, several design critieria must be met, including the
                                                                           ability to easily compose and orchestrate experiments, parameter-
                                                                           ize them to manipulate variables under test, cache intermediate
                                                                           artifacts, record provenance of all artifacts and allow the software
                                                                           to port to other systems, produce output visualizations and reports
                                                                           for analysis, and scale execution to the resource requirements
                                                                           of the experiment. We developed Curifactory to address these
                                                                           criteria specifically for small research teams running Python based
                                                                           experiments.
       Fig. 3: User-defined objects to report ("reportables").

                                                                           Acknowledgements
                                                                           The authors would like to acknowledge the US Department of
                                                                           Energy, National Nuclear Security Administration’s Office of De-
                                                                           fense Nuclear Nonproliferation Research and Development (NA-
                                                                           22) for supporting this work.


                                                                           R EFERENCES
                                                                           [ABC+ 22] Sajid Alam, Lorena Bălan, Gabriel Comym, Yetunde Dada, Ivan
                                                                                     Danov, Lim Hoang, Rashida Kanchwala, Jiri Klein, Antony
                                                                                     Milne, Joel Schwarzmann, Merel Theisen, and Susanna Wong.
                                                                                     Kedro. https://kedro.org/, March 2022.
                                                                           [BYC13]   James Bergstra, Daniel Yamins, and David Cox. Making a Sci-
Fig. 4: Graphviz rendering of experiment DAG. Each large colored
                                                                                     ence of Model Search: Hyperparameter Optimization in Hundreds
area represents a single record associated with a specific argset. White             of Dimensions for Vision Architectures. In Proceedings of the
ellipses are stages, and the blocks in between them are the input and                30th International Conference on Machine Learning, pages 115–
output artifacts.                                                                    123. PMLR, February 2013.
                                                                           [DGST09] Ewa Deelman, Dennis Gannon, Matthew Shields, and Ian Taylor.
                                                                                     Workflows and e-Science: An overview of workflow system
                                                                                     features and capabilities. Future Generation Computer Systems,
Scalability                                                                          25:524–540, May 2009. doi:10.1016/j.future.2008.
                                                                                     06.012.
Curifactory has no integrated method of executing portions of jobs         [DMR+ 09] David L. Donoho, Arian Maleki, Inam Ur Rahman, Morteza
on external compute resources like Kedro and MLFlow, but it does                     Shahram, and Victoria Stodden. Reproducible Research in Com-
allow local multi-process parallelization of parameter sets. When                    putational Harmonic Analysis. Computing in Science Engineer-
an experiment run would entail executing a series of stages for                      ing, 11(1):8–18, January 2009. doi:10.1109/MCSE.2009.
                                                                                     15.
each argument set in series, Curifactory can divide the collection         [Dub05]   P.F. Dubois. Maintaining correctness in scientific programs.
of argument sets into one subcollection per process, and runs the                    Computing in Science Engineering, 7(3):80–85, May 2005. doi:
experiment in parallel on each subcollection. By taking advantage                    10.1109/MCSE.2005.54.
                                                                           [GCS 20] Carole Goble, Sarah Cohen-Boulakia, Stian Soiland-Reyes,
                                                                                +
of the caching mechanism, when all parallel runs complete, the                       Daniel Garijo, Yolanda Gil, Michael R. Crusoe, Kristian Peters,
experiment reruns in a single process to aggregate all of the                        and Daniel Schober. FAIR Computational Workflows. Data
precached values into a single report.                                               Intelligence, 2(1-2):108–121, January 2020. doi:10.1162/
                                                                                     dint_a_00033.
                                                                           [GGA18] Odd Erik Gundersen, Yolanda Gil, and David W. Aha. On Repro-
                                                                                     ducible AI: Towards Reproducible Research, Open Science, and
                                                                                     Digital Scholarship in AI Publications. AI Magazine, 39(3):56–
                                                                                     68, September 2018. doi:10.1609/aimag.v39i3.2816.
                                                                           [GK18]    Odd Erik Gundersen and Sigbjørn Kjensmo. State of the Art:
                                                                                     Reproducibility in Artificial Intelligence. Proceedings of the
                                                                                     AAAI Conference on Artificial Intelligence, 32(1), April 2018.
                                                                                     doi:10.1609/aaai.v32i1.11503.
                                                                           [GKC+ 17] Klaus Greff, Aaron Klein, Martin Chovanec, Frank Hutter, and
                                                                                     Jürgen Schmidhuber. The Sacred Infrastructure for Computa-
                                                                                     tional Research. In Proceedings of the 16th Python in Sci-
                                                                                     ence Conference, pages 49–56, Austin, Texas, 2017. SciPy.
                                                                                     doi:10.25080/shinma-7f4c6e7-008.
                                                                           [Goy20]   A. Goyal. Machine learning operations, 2020.
                                                                           [Hut18]   Matthew Hutson. Artificial intelligence faces reproducibility
Fig. 5: Graphviz rendering of each record in more depth, showing                     crisis. Science, 359(6377):725–726, February 2018. doi:
cache file paths and artifact data types.                                            10.1126/science.359.6377.725.
186                                                                                                   PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[KHS09]     Diane Kelly, Daniel Hook, and Rebecca Sanders. Five Rec-                         Hovig. Ten Simple Rules for Reproducible Computational Re-
            ommended Practices for Computational Scientists Who Write                        search. PLOS Computational Biology, 9(10):e1003285, October
            Software. Computing in Science Engineering, 11(5):48–53,                         2013. doi:10.1371/journal.pcbi.1003285.
            September 2009. doi:10.1109/MCSE.2009.139.                             [Sto18]   Tim Storer. Bridging the Chasm: A Survey of Software Engineer-
[KPP+ 22]   Ruslan Kuprieiev, Saugat Pachhai, Dmitry Petrov, Paweł                           ing Practice in Scientific Programming. ACM Computing Surveys,
            Redzyński, Casper da Costa-Luis, Peter Rowlands, Alexander                      50(4):1–32, July 2018. doi:10.1145/3084225.
            Schepanovski, Ivan Shcheklein, Batuhan Taskaya, Jorge Orpinel,         [WDA+ 16] Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg,
            Gao, Fábio Santos, David de la Iglesia Castro, Aman Sharma,                      Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg,
            Zhanibek, Dani Hodovic, Nikita Kodenko, Andrew Grigorev,                         Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E.
            Earl, Nabanita Dash, George Vyshnya, maykulkarni, Max Hora,                      Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè
            Vera, Sanidhya Mangal, Wojciech Baranowski, Clemens Wolff,                       Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T.
            and Kurian Benoy. DVC: Data Version Control - Git for Data                       Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alasdair
            & Models. Zenodo, April 2022. doi:10.5281/zenodo.                                J. G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap
            6417224.                                                                         Heringa, Peter A. C. ’t Hoen, Rob Hooft, Tobias Kuhn, Ruben
[LGK+ 20]   Anna-Lena Lamprecht, Leyla Garcia, Mateusz Kuzak, Car-                           Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Al-
            los Martinez, Ricardo Arcila, Eva Martin Del Pico, Victoria                      bert Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-
            Dominguez Del Angel, Stephanie van de Sandt, Jon Ison,                           Serra, Marco Roos, Rene van Schaik, Susanna-Assunta Sansone,
            Paula Andrea Martinez, Peter McQuilton, Alfonso Valencia,                        Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn,
            Jennifer Harrow, Fotis Psomopoulos, Josep Ll Gelpi, Neil                         Morris A. Swertz, Mark Thompson, Johan van der Lei, Erik
            Chue Hong, Carole Goble, and Salvador Capella-Gutierrez. To-                     van Mulligen, Jan Velterop, Andra Waagmeester, Peter Wit-
            wards FAIR principles for research software. Data Science,                       tenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons.
            3(1):37–59, January 2020. doi:10.3233/DS-190026.                                 The FAIR Guiding Principles for scientific data management
[MHSA22]    Nathan Martindale, Jason Hite, Scott L. Stewart, and Mark                        and stewardship. Scientific Data, 3(1):160018, March 2016.
            Adams.       Curifactory.    https://github.com/ORNL/curifactory,                doi:10.1038/sdata.2016.18.
            March 2022.                                                            [WWG21] Laura Wratten, Andreas Wilm, and Jonathan Göke. Reproducible,
[MLC+ 21]   Sonia Natalie Mitchell, Andrew Lahiff, Nathan Cummings,                          scalable, and shareable analysis pipelines with bioinformatics
            Jonathan Hollocombe, Bram Boskamp, Dennis Reddyhoff, Ryan                        workflow managers. Nature Methods, 18(10):1161–1168, Oc-
            Field, Kristian Zarebski, Antony Wilson, Martin Burke, Blair                     tober 2021. doi:10.1038/s41592-021-01254-9.
            Archibald, Paul Bessell, Richard Blackwell, Lisa A. Boden, Alys
            Brett, Sam Brett, Ruth Dundas, Jessica Enright, Alejandra N.
            Gonzalez-Beltran, Claire Harris, Ian Hinder, Christopher David
            Hughes, Martin Knight, Vino Mano, Ciaran McMonagle, Do-
            minic Mellor, Sibylle Mohr, Glenn Marion, Louise Matthews,
            Iain J. McKendrick, Christopher Mark Pooley, Thibaud Por-
            phyre, Aaron Reeves, Edward Townsend, Robert Turner, Jeremy
            Walton, and Richard Reeve. FAIR Data Pipeline: Provenance-
            driven data management for traceable scientific workflows.
            arXiv:2110.07117 [cs, q-bio], October 2021. arXiv:2110.
            07117.
[MLf22]     MLflow: A Machine Learning Lifecycle Platform. https://mlflow.
            org/, April 2022.
[NFP+ 20]   Mohammad Hossein Namaki, Avrilia Floratou, Fotis Psallidas,
            Subru Krishnan, Ashvin Agrawal, Yinghui Wu, Yiwen Zhu,
            and Markus Weimer. Vamsa: Automated Provenance Tracking
            in Data Science Scripts. In Proceedings of the 26th ACM
            SIGKDD International Conference on Knowledge Discovery &
            Data Mining, KDD ’20, pages 1542–1551, New York, NY, USA,
            August 2020. Association for Computing Machinery. doi:
            10.1145/3394486.3403205.
[OBA17]     Babatunde K. Olorisade, Pearl Brereton, and Peter Andras. Re-
            producibility in Machine Learning-Based Studies: An Example
            of Text Mining. In Reproducibility in ML Workshop, 34th In-
            ternational Conference on Machine Learning, ICML 2017, June
            2017.
[Pen11]     Roger D. Peng. Reproducible Research in Computational Sci-
            ence. Science, 334(6060):1226–1227, December 2011. doi:
            10.1126/science.1213847.
[QCL21]     Luigi Quaranta, Fabio Calefato, and Filippo Lanubile. A Taxon-
            omy of Tools for Reproducible Machine Learning Experiments.
            In AIxIA 2021 Discussion Papers, 20th International Conference
            of the Italian Association for Artificial Intelligence, pages 65–76,
            2021.
[Red19]     Sergey Redyuk. Automated Documentation of End-to-End Ex-
            periments in Data Science. In 2019 IEEE 35th International
            Conference on Data Engineering (ICDE), pages 2076–2080,
            April 2019. doi:10.1109/ICDE.2019.00243.
[RMRO21]    Philipp Ruf, Manav Madan, Christoph Reich, and Djaffar Ould-
            Abdeslam. Demystifying MLOps and Presenting a Recipe for the
            Selection of Open-Source Tools. Applied Sciences, 11(19):8861,
            January 2021. doi:10.3390/app11198861.
[SBB13]     Victoria Stodden, Jonathan Borwein, and David H. Bailey. Pub-
            lishing Standards for Computational Science: “Setting the Default
            to Reproducible”. Pennsylvania State University, 2013.
[SH18]      Peter Sugimura and Florian Hartl. Building a Reproducible
            Machine Learning Pipeline. arXiv:1810.04570 [cs, stat], October
            2018. arXiv:1810.04570.
[SNTH13]    Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and Eivind
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                      187




   The Geoscience Community Analysis Toolkit: An
  Open Development, Community Driven Toolkit in the
            Scientific Python Ecosystem
            Orhan Eroglu‡∗ , Anissa Zacharias‡ , Michaela Sizemore‡ , Alea Kootz‡ , Heather Craker‡ , John Clyne‡

                                      https://www.youtube.com/watch?v=34zFGkDwJPc

                                                                                      F



Abstract—The Geoscience Community Analysis Toolkit (GeoCAT) team de-                           GeoCAT has seven Python tools for geoscientific computation
velops and maintains data analysis and visualization tools on structured and              and visualization. These tools are built upon the Pangeo [HRA18]
unstructured grids for the geosciences community in the Scientific Python                 ecosystem. In particular, they rely on Xarray [HH17], and Dask
Ecosystem (SPE). In response to dealing with increasing geoscientific data                [MR15], as well as they are compatible with Numpy and use
sizes, GeoCAT prioritizes scalability, ensuring its implementations are scalable
                                                                                          Jupyter Notebooks for demonstration purposes. Dask compatibil-
from personal laptops to HPC clusters. Another major goal of the GeoCAT
team is to ensure community involvement throughout the whole project lifecycle,
                                                                                          ity allows the GeoCAT functions to scale from personal laptops
which is realized through an open development mindset by encouraging users                to high performance computing (HPC) systems such as NCAR’s
and contributors to get involved in decision-making. With this model, we not              Casper, Cheyenne, and upcoming Derecho clusters [CKZ+ 22].
only have our project stack open-sourced but also ensure most of the project              Additionally, GeoCAT also utilizes Numba, an open source just-
assets that are directly related to the software development lifecycle are publicly       in-time (JIT) compiler [LPS15], to translate Python and NumPy
accessible.                                                                               code into machine codes in order to get faster executions wherever
                                                                                          possible. GeoCAT’s visualization components rely on Matplotlib
Index Terms—data analysis, geocat, geoscience, open development, open
source, scalability, visualization
                                                                                          [Hun07] for most of the plotting functionalities, Cartopy [Met15]
                                                                                          for projections, as well as the Datashader and Holoviews stack
                                                                                          [Anaa] for big data rendering. Figure 1 shows these technologies
Introduction
                                                                                          with their essential roles around GeoCAT.
The Geoscience Community Analysis Toolkit (GeoCAT) team,                                       Briefly, GeoCAT-comp houses computational operators for
established in 2019, leads the software engineering efforts of                            applications ranging from regridding and interpolation, to cli-
the National Center for Atmospheric Research (NCAR) “Pivot                                matology and meteorology. GeoCAT-examples provides over 140
to Python” initiative [Geo19]. Before then, NCAR Command                                  publication-quality plotting scripts in Python for Earth sciences. It
Language (NCL) [BBHH12] was developed by NCAR as an                                       also houses Jupyter notebooks with high-performance, interactive
interpreted, domain-specific language that was aimed to support                           plots that enable features such as pan and zoom on fine-resolution,
the analysis and visualization needs of the global geosciences                            unstructured geoscience data (e.g. ~3 km data rendered within
community. NCL had been serving several tens of thousands of                              a few tens of seconds to a few minutes on personal laptops).
users for decades. It is still available for use but has not been                         This is achieved by making use of the connectivity information
actively developed as it has been in maintenance mode.                                    in the unstructured grid and rendering data via the Datashader
    The initiative had an initial two-year roadmap with major                             and Holoviews ecosystem [Anaa]. GeoCAT-viz enables higher-
milestones being: (1) Replicating NCL’s computational routines in                         level implementation of Matplotlib and Cartopy plotting capabil-
Python, (2) training and support for transitioning NCL users into                         ities through its variety of easy to use visualization convenience
Python, and (3) moving tools into an open development model.                              functions for GeoCAT-examples. GeoCAT also maintains WRF-
GeoCAT aims to create scalable data analysis and visualization                            Python (Weather Research and Forecasting), which works with
tools on structured and unstructured grids for the geosciences                            WRF-ARW model output and provides diagnostic and interpola-
community in the SPE. The GeoCAT team is committed to                                     tion routines.
open development, which helps the team prioritize community                                    GeoCAT was recently awarded Project Raijin, which is an
involvement at any level of the project lifecycle alongside having                        NSF EarthCube-funded effort [NSF21] [CEMZ21]. Its goal is to
the whole software stack open-sourced.                                                    enhance the open-source analysis and visualization tool landscape
                                                                                          by developing community-owned, sustainable, scalable tools that
* Corresponding author: oero@ucar.edu
‡ National Center for Atmospheric Research                                                facilitate operating on unstructured climate and global weather
                                                                                          data in the SPE. Throughout this three-year project, GeoCAT
Copyright © 2022 Orhan Eroglu et al. This is an open-access article dis-                  will work on the development of data analysis and visualization
tributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, pro-              functions that operate directly on the native grid as well as
vided the original author and source are credited.                                        establish an active community of user-contributors.
188                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




                                     Fig. 1: The core Python technologies on which GeoCAT relies on


    This paper will provide insights about GeoCAT’s software            PYPI badges in the Package row shows and links to the latest
stack and current status, team scope and near-term plans, open          versions of the software tool distributed through NCAR’s Conda
development methodology, as well as current pathways of com-            channel and PyPI, respectively. The LICENSE badge provides a
munity involvement.                                                     link to our software licenses, Apache License version 2.0 [Apa04],
                                                                        for all of the GeoCAT stack, enabling the redistribution of the
GeoCAT Software                                                         open-source software products on an "as is" basis. Finally, to
                                                                        provide reproducibility of our software products (either for the
The GeoCAT team develops and maintains several open-source
                                                                        latest or any older version), we publish version-specific Digital
software tools. Before describing those tools, it is vital to explain
                                                                        Object Identifiers (DOIs), which can be accessed through the DOI
in detail how the team implements the continuous integration and
                                                                        badge. This allows the end-user to accurately cite the specific
continuous delivery/deployment (CI/CD) in consistence for all of
                                                                        version of the GeoCAT tools they used for science or research
those tools.
                                                                        purposes.
Continuous Integration and Continuous Delivery/Deployment
(CI/CD)
GeoCAT employs a continuous delivery model, with a monthly
package release cycle on package management systems and pack-
age indexes such as Conda [Anab] and PyPI [Pyt]. This model
helps the team make new functions available as soon as they are
implemented and address potential errors quickly. To assist this
process, the team utilizes multiple tools throughout GitHub assets
to ensure automation, unit testing and code coverage, as well as
licensing and reproducibility. Figure 2, for example, shows the
set of badges displaying the near real-time status of each CI/CD
implementation in the GitHub repository homepage from one of            Fig. 2: GeoCAT-comp’s badges in the beginning of its README file
                                                                        (i.e. the home page of the Githug repository) [geob]
our software tools.
    CI build tests of our repositories are implemented and au-
tomated (for pushed commits, pull requests, and daily scheduled
execution) via GitHub Actions workflows [Git], with the CI badge        GeoCAT-comp (and GeoCAT-f2py)
shown in Figure 2 displaying the status (i.e. pass or fail) of          GeoCAT-comp is the computational component of the GeoCAT
those workflows. Similarly, the CONDA-BUILDS badge shows                project as can be seen in Figure 4. GeoCAT-comp houses im-
if the conda recipe works successfully for the repository. The          plementations of geoscience data analysis functions. Novel re-
Python package "codecov" [cod] analyzes the percentage of code          search and development is conducted for analyzing both structured
coverage from unit tests in the repository. Additionally, the overall   and unstructured grid data from various research fields such as
results as well as details for each code script can be seen via         climate, weather, atmosphere, ocean, among others. In addition,
the COVERAGE badge. Each of our software repositories has               some of the functionalities of GeoCAT-comp are inspired or
a corresponding documentation page that is populated mostly-            reimplemented from the NCL in order to address the first goal
automatically through the Sphinx Python documentation generator         of the "Pivot to Python effort. For that purpose, 114 NCL rou-
[Bra21] and published through ReadTheDocs [rea] via an auto-            tines were selected, excluding some functionalities such as date
mated building and versioning schema. The DOCS badge provides           routines, which could be handled by other packages in the Python
a link to the documentation page along with showing failures, if        ecosystem today. These functions were ranked by order of website
any, with the documentation rendering process. Figure 3 shows           documentation access from most to least, and prioritization was
the documentation homepage of GeoCAT-comp. The NCAR and                 made based on those ranks. Today, GeoCAT-comp provides the
THE GEOSCIENCE COMMUNITY ANALYSIS TOOLKIT: AN OPEN DEVELOPMENT, COMMUNITY DRIVEN TOOLKIT IN THE SCIENTIFIC PYTHON ECOSYSTEM            189




            Fig. 3: GeoCAT-comp documentation homepage built with Sphinx using a theme provided by ReadTheDocs [geoa]


same or similar capabilities of about 39% (44 out of 114) of those        GeoCAT-comp code-base does not explicitly contain or require
functions.                                                            any compiled code, making it more accessible to the general
    Some of the functions that are made available through             Python community at large. In addition, GeoCAT-f2py is auto-
GeoCAT-comp are listed below, for which the GeoCAT-comp               matically installed through GeoCAT-comp installation, and all
documentation [geoa] provides signatures and descriptions as well     functions contained in the "geocat.f2py" package are imported
as links to the usage examples:                                       transparently into the "geocat.comp" namespace. Thus, GeoCAT-
   •   Spherical harmonics (both decomposition and recomposi-         comp serves as a user API to access the entire computational
       tion as well as area weighting)                                toolkit even though its GitHub repository itself only contains pure
   •   Fourier transforms such as band-block, band-pass, low-         Python code from the developer’s perspective. Whenever prospec-
       pass, and high-pass                                            tive contributors want to contribute computational functionality in
   •   Meteorological variable computations such as relative hu-      pure Python, GeoCAT-comp is the only GitHub repository they
       midity, dew-point temperature, heat index, saturation vapor    need to deal with. Therefore, there is no onus on contributors of
       pressure, and more                                             pure Python code to build, compile, or test any compiled code
   •   Climatology functions such as climate average over mul-        (e.g. Fortran) at GeoCAT-comp level.
       tiple years, daily/monthly/seasonal averages, as well as       GeoCAT-examples (and GeoCAT-viz)
       anomalies                                                      GeoCAT-examples [geoe] was created to address a few of the
   •   Regridding of curvilinear grid to rectilinear grid, unstruc-   original milestones of NCAR’s "Pivot to Python" initiative: (1)
       tured grid to rectilinear grid, curvilinear grid to unstruc-   to provide the geoscience community with well-documented visu-
       tured grid, and vice versa                                     alization examples for several plotting classes in the SPE, and (2)
   •   Interpolation methods such as bilinear interpolation of a      to help transition NCL users into the Python ecosystem through
       rectilinear to another rectilinear grid, hybrid-sigma levels   providing such resources. It was born in early 2020 as the result of
       to isobaric levels, and sigma to hybrid coordinates            a multi-day hackathon event among the GeoCAT team and several
   •   Empirical orthogonal function (EOF) analysis                   other scientists and developers from various NCAR labs/groups.
    Many of the computational functions in GeoCAT are im-             It has since grown to house novel visualization examples and
plemented in pure Python. However, there are others that were         showcase the capabilities of other GeoCAT components, like
originally implemented in Fortran but are now wrapped up in           GeoCAT-comp, along with newer technologies like interactive
Python with the help of Numpy’s F2PY, Fortran to Python in-           plotting notebooks. Figure 5 illustrates one of the unique GeoCAT-
terface generator. This is mostly because re-implementing some        examples cases that was aimed at exploring the best practices for
functions would require understanding of complicated algorithm        data visualization like choosing color blind friendly colormaps.
flows and implementation of extensive unit tests that would end           The GeoCAT-examples [geod] gallery contains over 140 ex-
up taking too much time, compared to wrapping their already-          ample Python plotting scripts, demonstrating functionalities from
implemented Fortran routines up in Python. Furthermore, outside       Python packages like Matplotlib, Cartopy, Numpy, and Xarray.
contributors from science background would keep considering to        The gallery includes plots from a range of visualization categories
add new functions to GeoCAT from their older Fortran routines         such as box plots, contours, meteograms, overlays, projections,
in the future. To facilitate contribution, the whole GeoCAT-comp      shapefiles, streamlines, and trajectories among others. The plotting
structure is split into two repositories with respect to being        categories and scripts under GeoCAT-examples cover almost all of
either pure-Python or Python with compiled code (i.e. Fortran)        the NCL plot types and techniques. In addition, GeoCAT-examples
implementations. Such implementation layers are handled with          houses plotting examples for individual GeoCAT-comp analysis
the GeoCAT-comp and GeoCAT-f2py repositories, respectively.           functions.
190                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




                                   Fig. 4: GeoCAT project structure with all of the software tools [geoc]


                                                                        viz helps keep the LOC comparable to NCL, one of the Taylor
                                                                        diagrams (i.e. Taylor_6) took 80 LOC in NCL, and its Python
                                                                        implementation in GeoCAT-examples takes 72 LOC. If many
                                                                        of the Matplotlib functions (e.g. figure and axes initialization,
                                                                        adjustment of several axes parameters, call to plotting functions for
                                                                        Taylor diagram, management of grids, addition of titles, contours,
                                                                        etc.) used in this example weren’t wrapped up in GeoCAT-viz
                                                                        [geof], the same visualization would easily end up in around two
                                                                        hundred LOC.



Fig. 5: Comparison between NCL (left) and Python (right) when
choosing a colormap; GeoCAT-examples aiming at choosing color
blind friendly colormaps [SEKZ22]



    Despite Matplotlib and Cartopy’s capabilities to reproduce
almost all of NCL plots, there was one significant caveat with
using their low-level implementations against NCL: NCL’s high-
level plotting functions allowed scientists to plot most of the cases   Fig. 6: Taylor diagram and curly vector examples that created with
in only tens of lines of codes (LOC) while the Matplotlib and           the help of GeoCAT-viz
Cartopy stack required writing a few hundred LOC. In order
to build a higher-level implementation on top of Matplotlib and             Recently, the GeoCAT team has been focused on interactive
Cartopy while recreating the NCL-like plots (from vital plotting        plotting technologies, especially for larger data sets that contain
capabilities that were not readily available in the Python ecosystem    millions of data points. This effort was centered on unstructured
at the time such as Taylor diagrams and curly vectors to more           grid visualization as part of Project Raijin, which is detailed in
stylistic changes such as font sizes, color schemes, etc. that resem-   a later section in this manuscript. That is because unstructured
ble NCL plots), the GeoCAT-viz library [geof] was implemented.          meshes are a great research and application field for big data
Use of functions from this library in GeoCAT-examples signifi-          and interactivity such as zoom in/out for regions of interest. As
cantly reduces the LOC requirements for most of the visualization       a result of this effort, we created a new notebooks gallery under
examples to comparable numbers to those of NCL’s. Figure 6              GeoCAT-examples to house such interactive data visualizations.
shows Taylor diagram and curly vector examples that have been           The first notebook, a screenshot from which is shown in Figure 7,
created with the help of GeoCAT-viz. To exemplify how GeoCAT-           in this gallery is implemented via the Datashader and Holoviews
THE GEOSCIENCE COMMUNITY ANALYSIS TOOLKIT: AN OPEN DEVELOPMENT, COMMUNITY DRIVEN TOOLKIT IN THE SCIENTIFIC PYTHON ECOSYSTEM             191

ecosystem [Anaa], and it provides a high-performance, interactive       charge of the software development of Project Raijin, which
visualization of a Model for Prediction Across Scales (MPAS)            mainly consists of implementing visualization and analysis func-
Global Storm-Resolving Model weather simulation dataset. The            tions in the SPE to be executed on native grids. While doing so,
interactivity features are pan and zoom to reveal greater data          GeoCAT is also responsible for establishing an open development
fidelity globally and regionally. The data used in this work is         environment, clearly documenting the implementation work, and
the courtesy of the DYAMOND effort [SSA+ 19] and has varying            aligning deployments with the project milestones as well as SPE
resolutions from 30 km to 3.75 km. Our notebook in the gallery          requirements and specifications.
uses the 30 km resolution data for the users to be able to download         GeoCAT has created the Xarray-based Uxarray package [uxa]
and work on it in their local configuration. However, our work          to recognize unstructured grid models through partnership with
with the 3.75 km resolution data (i.e. about 42 million hexagonal       geoscience community groups. UXarray is built on top of the
cells globally) showed that rendering the data took only a few          built-in Xarray Dataset functionalities while recognizing several
minutes on a decent laptop, even without any parallelization. The       unstructured grid formats (UGRID, SCRIP, and Exodus for now).
main reason behind such a high performance was that we used the         Since there are more unstructured mesh models in the community
cell-to-node connectivity information in the MPAS data to render        than UXarray natively supports, its architecture will also support
the native grid directly (i.e. without remapping to the structured      addition of new models. Figure 8 shows the regularly structured
grid) along with utilizing the Datashader stack. Without using the      “latitude-longitude” grids versus a few unstructured grid models.
connectivity information, it would require to run much costly               The UXarray project has implemented data input/output func-
Delaunay triangulation. The notebook provides a comparison              tions for UGRID, SCRIP, and Exodus, as well as methods for
between these two approaches as well.                                   surface area and integration calculations so far. The team is cur-
                                                                        rently conducting open discussions (through GitHub Discussions)
GeoCAT-datafiles
                                                                        with community members, who are interested in unstructured
GeoCAT-datafiles is GeoCAT’s small data storage component as            grids research and development in order to prioritize data analysis
a Github repository. This tool houses many datasets in different        operators to be implemented throughout the project lifecycle.
file formats such as NetCDF, which can be used along with other
GeoCAT tools or ad-hoc data needs in any other Python script.
The datasets can be accessed by the end-user through a lightweight      Scalability
convenience function:
geocat.datafiles.get("folder_name/filename")                            GeoCAT is aware of the fact that today’s geoscientific models
                                                                        are capable of generating huge sizes of data. Furthermore, these
GeoCAT-datafiles fetches the file by simply reading from the
                                                                        datasets, such as those produced by global convective-permitting
local storage, if any, or downloading from the GeoCAT-datafiles
                                                                        models, are going to grow even larger in size in the future.
repository, if not in the local storage, with the help of Pooch
                                                                        Therefore, computational and visualization functions that are
framework [USR+ 20].
                                                                        being developed in the geoscientific research and development
WRF-Python                                                              workflows need to be scalable from personal devices (e.g. laptops)
                                                                        to HPC (e.g. NCAR’s Casper, Cheyenne, and upcoming Derecho
WRF-Python was created in early 2017 in order to replicate NCL’s
                                                                        clusters) and cloud platforms (e.g. AWS).
Weather Research and Forecasting (WRF) package in the SPE, and
                                                                             In order to keep up with the scalability objectives, GeoCAT
it covers 100% of the routines in that package. About two years
                                                                        functions are implemented to operate on Dask arrays in addition
later, NCAR’s “Pivot to Python” initiative was announced, and the
                                                                        to natively supporting NumPy arrays and Xarray DataArrays.
GeoCAT team has taken over development and maintenance of
                                                                        Therefore, the GeoCAT functions can trivially and transparently be
WRF-Python.
                                                                        parallelized to be run on shared-memory and distributed-memory
     The package focuses on creating a Python package that elim-
                                                                        platforms after having Dask cluster/client properly configured and
inates the need to work across multiple software platforms when
                                                                        functions fed with Dask arrays or Dask-backed Xarray DataArrays
using WRF datasets. It contains more than 30 computational
                                                                        (i.e. chunked Xarray DataArrays that wrap up Dask arrays).
(e.g. diagnostic calculations, several interpolation routines) and
visualization routines that aim at reducing the amount of post-
processing tools necessary to visualize WRF output files.
     Even though there is no continuous development in WRF-             Open Development
Python, as is seen in the rest of the GeoCAT stack, the package is
                                                                        To ensure community involvement at every level in the develop-
still maintained with timely responses and bug-fix releases to the
                                                                        ment lifecycle, GeoCAT is committed to an open development
issues reported by the user community.
                                                                        model. In order to implement this model, GeoCAT provides all
                                                                        of its software tools as GitHub repositories with public GitHub
Project Raijin                                                          project boards and roadmaps, issue tracking and development re-
“Collaborative Research: EarthCube Capabilities: Raijin: Commu-         viewing, comprehensive documentation for users and contributors
nity Geoscience Analysis Tools for Unstructured Mesh Data”, i.e.        such as Contributor’s Guide [geoc] and toolkit-specific documen-
Project Raijin, of the consortium between NCAR and Pennsylva-           tation, along with community announcements on the GeoCAT
nia State University has been awarded by NSF 21-515 EarthCube           blog. Furthermore, GeoCAT encourages community feedback and
for an award period of 1 September, 2021 - 31 August, 2024              contribution at any level with inclusive and welcoming language.
[NSF21]. Project Raijin aims at developing community-owned,             As a result of this, community requests and feedback have played
sustainable, scalable tools that facilitate operating on unstructured   significant role in forming and revising the GeoCAT roadmap and
climate and global weather data [rai]. The GeoCAT team is in            projects’ scope.
192                                                                                     PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




                     Fig. 7: The interactive plot interface from the MPAS visualization notebook in GeoCAT-examples


                                                                      and GeoCAT-viz in particular has received significant contribu-
                                                                      tions through SIParCS in 2020 and 2021 summers (i.e. tens
                                                                      of visualization examples as well as important infrastructural
                                                                      changes were made available by our interns) [CKZ+ 22] [LLZ+ 21]
                                                                      [CFS21]. Furthermore, the team has created three essential and
                                                                      one collaboration project through SIParCS 2022 summer through
                                                                      which advanced geoscientific visualization, unstructured grid vi-
                                                                      sualization and data analysis, Fortran to Python algorithm and
                                                                      code development, as well as GPU optimization for GeoCAT-
      Fig. 8: Regular grid (left) vs MPAS-A & CAM-SE grids            comp routines will be investigated.

                                                                      Project Pythia
Community engagement
                                                                      The GeoCAT effort is also a part of the NSF funded Project
To further promote engagement with the geoscience community,          Pythia. Project Pythia aims to provide a public, web-accessible
GeoCAT organizes and attends various community events. First          training resource that could help educate earth scientists to more
of all, scientific conferences and meetings are great venues for      effectively use the SPE and cloud computing for dealing with
such a scientific software engineering project to share updates       big data in geosciences. GeoCAT helps with Pythia development
and progress with the community. For instance, the American           through content creation and infrastructure contributions. GeoCAT
Meteorological Society (AMS) Annual Meeting and American              has also contributed several Python tutorials (such as Numpy, Mat-
Geophysical Union (AGU) Fall Meeting are two significant sci-         plotlib, Cartopy, etc.) to the educational resources created through
entific events that the GeoCAT team presented one or multiple         Project Pythia. These materials consist of live tutorial sessions,
publications every year since its birth to inform the community.      interactive Jupyter notebook demonstrations, Q&A sessions, as
The annual Scientific Computing with Python (SciPy) conference        well as published video recording of the event on Pythia’s Youtube
is another great fit to showcase what GeoCAT has been conducting      channel. As a result, it helps us engage with the community
in geoscience. The team also attended The International Confer-       through multiple channels.
ence for High Performance Computing, Networking, Storage, and
Analysis (SC) a few times to keep up-to-date with the industry
state-of-the-arts in these technologies.                              Future directions
    Creating internship projects is another way of improving com-     GeoCAT aims to keep increasing the number of data analysis and
munity interactions as it triggers collaboration through GeoCAT,      visualization functionalities in both structured and unstructured
institutions, students, and university in general. The GeoCAT         meshes with the same pace as has been done so far. The team will
team, thus,encourages undergraduate and graduate student engage-      continue prioritizing scalability and open development in future
ment in the Python ecosystem through participation in NCAR’s          development and maintenance of its software tools landscape. To
Summer Internships in Parallel Computational Science (SIParCS).       achieve the goals with scalability of our tools, we will ensure our
Such programs are quite beneficial for both students and scientific   implementations are compatible with the state-of-the-art and up-
software development teams. To exemplify, GeoCAT-examples             to-date with the best practices of the technology we are using, e.g.
THE GEOSCIENCE COMMUNITY ANALYSIS TOOLKIT: AN OPEN DEVELOPMENT, COMMUNITY DRIVEN TOOLKIT IN THE SCIENTIFIC PYTHON ECOSYSTEM                                193

Dask. To enhance the community involvement in our open devel-                     [Met15] Met Office. Cartopy: a cartographic python library with a matplotlib
opment model, we will continue interacting with the community                              interface. Exeter, Devon, 2010 - 2015. URL: http://scitools.org.uk/
                                                                                           cartopy.
members through significant events such as Pangeo community                       [MR15] Matthew Rocklin. Dask: Parallel Computation with Blocked algo-
meetings, scientific conferences, tutorials and workshops of Geo-                          rithms and Task Scheduling. In Kathryn Huff and James Bergstra,
CAT’s own as well as other community members; we will keep                                 editors, Proceedings of the 14th Python in Science Conference, pages
our timely communication with the stakeholders through GitHub                              126 – 132, 2015. doi:10.25080/Majora-7b98e3ed-013.
                                                                                  [NSF21] NSF.       Collaborative research: Earthcube capabilities: Raijin:
assets and other communication channels.                                                   Community geoscience analysis tools for unstructured mesh
                                                                                           data. https://nsf.gov/awardsearch/showAward?AWD_ID=2126458&
                                                                                           HistoricalAwards=false, 2021. Online; accessed 17 May 2022.
R EFERENCES                                                                       [Pyt]    Python Software Foundation. The Python Package Index - PyPI.
                                                                                           https://pypi.org/. Online; accessed 18 May 2022.
[Anaa]   Anaconda. Datashader. https://datashader.org/. Online; accessed 29       [rai]    Raijin homepage. https://raijin.ucar.edu/. Online; accessed 21 May
         June 2022.                                                                        2022.
[Anab] Anaconda, Inc. Conda package manager. https://docs.conda.io/en/            [rea]    ReadTheDocs. https://readthedocs.org/. Online; accessed 18 May
         latest/. Online; accessed 18 May 2022.                                            2022.
[Apa04] Apache Software Foundation. Apache License, version 2.0. https:           [SEKZ22] Michaela Sizemore, Orhan Eroglu, Alea Kootz, and Anissa
         //www.apache.org/licenses/LICENSE-2.0, 2004. Online; accessed                     Zacharias. Pivoting to Python: Lessons Learned in Recreating the
         18 May 2022.                                                                      NCAR Command Language in Python. 102nd American Meteoro-
[BBHH12] David Brown, Rick Brownrigg, Mary Haley, and Wei Huang.                           logical Society Annual Meeting, 2022.
         NCAR Command Language (ncl), 2012. doi:http://dx.doi.                    [SSA+ 19] Bjorn Stevens, Masaki Satoh, Ludovic Auger, Joachim Bier-
         org/10.5065/D6WD3XH5.                                                             camp, Christopher S Bretherton, Xi Chen, Peter Düben, Falko
[Bra21] Georg Brandl. Sphinx documentation. URL http://sphinx-doc.                         Judt, Marat Khairoutdinov, Daniel Klocke, et al. DYAMOND:
         org/sphinx. pdf, 2021.                                                            the DYnamics of the Atmospheric general circulation Modeled
[CEMZ21] John Clyne, Orhan Eroglu, Brian Medeiros, and Colin M Zarzy-                      On Non-hydrostatic Domains. Progress in Earth and Planetary
         cki. Project raijin: Community geoscience analysis tools for unstruc-             Science, 6(1):1–17, 2019. doi:https://doi.org/10.1186/
         tured grids. In AGU Fall Meeting 2021. AGU, 2021.                                 s40645-019-0304-z.
[CFS21] Heather Rose Craker, Claire Anne Fiorino, and Michaela Victoria           [USR+ 20] Leonardo Uieda, Santiago Rubén Soler, Rémi Rampin, Hugo
         Sizemore. Rebuilding the ncl visualization gallery in python. In                  Van Kemenade, Matthew Turk, Daniel Shapero, Anderson Bani-
         101nd American Meteorological Society Annual Meeting. AMS,                        hirwe, and John Leeman. Pooch: A friend to fetch your data
         2021.                                                                             files. Journal of Open Source Software, 5(45):1943, 2020. doi:
[CKZ+ 22] Heather Craker, Alea Kootz, Anissa Zacharias, Michaela Size-                     10.21105/joss.01943.
         more, and Orhan Eroglu. NCAR’s GeoCAT Announcement of                    [uxa]    UXarray GitHub repository. https://github.com/UXARRAY/uxarray.
         Computational Tools. In 102nd American Meteorological Society                     Online; accessed 20 May 2022.            doi:10.5281/zenodo.
         Annual Meeting. AMS, 2022.                                                        5655065.
[cod]    Codecov. https://about.codecov.io/. Online; accessed 18 May 2022.
[geoa]   GeoCAT-comp          documentation    page.            https://geocat-
         comp.readthedocs.io/en/latest/index.html.      Online; accessed 20
         May 2022. doi:doi:10.5281/zenodo.6607205.
[geob] GeoCAT-comp GitHub repository.              https://github.com/NCAR/
         geocat-comp. Online; accessed 20 May 2022. doi:doi:10.
         5281/zenodo.6607205.
[geoc]   GeoCAT Contributor’s Guide.            https://geocat.ucar.edu/pages/
         contributing.html. Online; accessed 20 May 2022. doi:10.5065/
         a8pp-4358.
[geod] GeoCAT-examples documentation page. https://geocat-examples.
         readthedocs.io/en/latest/index.html. Online; accessed 20 May 2022.
         doi:10.5281/zenodo.6678258.
[geoe]   GeoCAT-examples GitHub repository. https://github.com/NCAR/
         geocat-examples. Online; accessed 20 May 2022. doi:10.5281/
         zenodo.6678258.
[geof]   GeoCAT-viz GitHub repository. https://github.com/NCAR/geocat-
         viz. Online; accessed 20 May 2022. doi:10.5281/zenodo.
         6678345.
[Geo19] GeoCAT. The future of NCL and the Pivot to Python. https:
         //www.ncl.ucar.edu/Document/Pivot_to_Python, 2019. Online; ac-
         cessed 17 May 2022. doi:http://dx.doi.org/10.5065/
         D6WD3XH5.
[Git]    GitHub. Github Actions. https://docs.github.com/en/actions. Online;
         accessed 18 May 2022.
[HH17] Stephan Hoyer and Joseph Hamman. xarray: N-D labeled arrays
         and datasets in Python. Journal of Open Research Software, 5(1):10,
         2017. doi:http://doi.org/10.5334/jors.148.
[HRA18] Joseph Hamman, Matthew Rocklin, and Ryan Abernathy. Pangeo:
         A big-data ecosystem for scalable earth system science. EGU
         General Assembly Conference Abstracts, 2018.
[Hun07] J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in
         Science & Engineering, 9(3):90–95, 2007. doi:10.1109/MCSE.
         2007.55.
[LLZ+ 21] Erin Lincoln, Jiaqi Li, Anissa Zacharias, Michaela Sizemore,
         Orhan Eroglu, and Julia Kent. Expanding and strengthening the
         transition from NCL to Python visualizations. In AGU Fall Meeting
         2021. AGU, 2021.
[LPS15] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A
         llvm-based python jit compiler. In Proceedings of the Second Work-
         shop on the LLVM Compiler Infrastructure in HPC, pages 1–6, 2015.
         doi:https://doi.org/10.1145/2833157.2833162.
194                                                                                                             PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




popmon: Analysis Package for Dataset Shift Detection
                                         Simon Brugman‡∗ , Tomas Sostak§ , Pradyot Patil‡ , Max Baak‡



                                                                                        F



Abstract—popmon is an open-source Python package to check the stability of
a tabular dataset. popmon creates histograms of features binned in time-slices,
and compares the stability of its profiles and distributions using statistical tests,
both over time and with respect to a reference dataset. It works with numerical,
ordinal and categorical features, on both pandas and Spark dataframes, and
the histograms can be higher-dimensional, e.g. it can also track correlations
between sets of features. popmon can automatically detect and alert on
changes observed over time, such as trends, shifts, peaks, outliers, anomalies,
changing correlations, etc., using monitoring business rules that are either static
or dynamic. popmon results are presented in a self-contained report.

Index Terms—dataset shift detection, population shift, covariate shift, his-                                   Fig. 1: The popmon package logo
togramming, profiling

                                                                                            make it easy to detect which (combinations of) features are most
Introduction                                                                                affected by changing distributions.
Tracking model performance is crucial to guarantee that a model                                 popmon is light-weight. For example, only one line is required
behaves as designed and trained initially, and for determining                              to generate a stability report.
whether to promote a model with the same initial design but                                 report = popmon.df_stability_report(
trained on different data to production. Model performance de-                                   df,
pends directly on the data used for training and the data predicted                              time_axis="date",
                                                                                                 time_width="1w",
on. Changes in the latter (e.g. certain word frequency, user demo-                               time_offset="2022-1-1"
graphics, etc.) can affect the performance and make predictions                             )
unreliable.                                                                                 report.to_file("report.html")
     Given that input data often change over time, it is important to                       The package is built on top of Python’s scientific computing
track changes in both input distributions and delivered predictions                         ecosystem (numpy, scipy [HMvdW+ 20], [VGO+ 20]) and sup-
periodically, and to act on them when they are significantly                                ports pandas and Apache Spark dataframes [pdt20], [WM10],
different from past instances – e.g. to diagnose and retrain an                             [ZXW+ 16]. This paper discusses how popmon monitors for
incorrect model in production. Predictions may be far ahead in                              dataset changes. The popmon code is modular in design and user
time, so the performance can only be verified later, for example in                         configurable. The project is available as open-source software.1
one year. Taking action at that point might already be too late.
     To make monitoring both more consistent and semi-automatic,                            Related work
ING Bank has created a generic Python package called popmon.
                                                                                            Many algorithms detecting dataset shift exist that follow a similar
popmon monitors the stability of data populations over time and
                                                                                            structure [LLD+ 18], using various data structures and algorithms
detects dataset shifts, based on techniques from statistical process
                                                                                            at each step [DKVY06], [QAWZ15]. However, few are readily
control and the dataset shift literature.
                                                                                            available to use in production. popmon offers both a framework
     popmon employs so-called dynamic monitoring rules to flag
                                                                                            that generalizes pipelines needed to implement those algorithms,
and alert on changes observed over time. Using a specified refer-
                                                                                            and default data drift pipelines, built on histograms with statistical
ence dataset, from which observed levels of variation are extracted
                                                                                            comparisons and profiles (see Sec. data representation).
automatically, popmon sets allowed boundaries on the input data.
                                                                                                Other families of tools have been developed that work on
If the reference dataset changes over time, the effective ranges on
                                                                                            individual data points, for model explanations (e.g. SHAP [LL17],
the input data can change accordingly. Dynamic monitoring rules
                                                                                            feature attributions [SLL20]), rule-based data monitoring (e.g.
* Corresponding author: simon.brugman@ing.com
                                                                                            Great Expectations, Deequ [GCSG22], [SLS+ 18]) and outlier
‡ ING Analytics Wholesale Banking                                                           detection (e.g. [RGL19], [LPO17]).
§ Vinted                                                                                        alibi-detect [KVLC+ 20], [VLKV+ 22] is somewhat
Copyright © 2022 Simon Brugman et al. This is an open-access article                        similar to popmon. This is an open-source Python library that
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,                  1. See https://github.com/ing-bank/popmon for code, documentation, tutori-
provided the original author and source are credited.                                       als and example stability reports.
POPMON: ANALYSIS PACKAGE FOR DATASET SHIFT DETECTION                                                                                                             195

focuses on outlier, adversarial and drift detection. It allows for
monitoring of tabular, text, images and time series data, using
both online and offline detectors. The backend is implemented                Source data
in TensorFlow and PyTorch. Much of the reporting functionality,
such as feature distributions, are restricted to the (commercial) en-
terprise version called seldon-deploy. Integrations for model




                                                                                                                                  Time-axis
                                                                                    External
deployment are available based on Kubernetes. The infrastructure                   Reference
                                                                                                                                              Data (nD)
setup thus is more complex and restrictive than for popmon,                         dataset
                                                                                   (optional)
which can run on any developer’s machine.
                                                                                                             Partition on
Contributions                                                                                                 time-axis
The advantage of popmon’s dynamic monitoring rules over con-
ventional static ones, is that little prior knowledge is required of
the input data to set sensible limits on the desired level of stability.      Temporal partitioning
This makes popmon a scalable solution over multiple datasets.
    To the best of our knowledge, no other monitoring tool exists
that suits our criteria to monitor models in production for dataset                    D1           D2             D3             D4                D5

shift. In particular, no other, light-weight, open-source package is
available that performs such extensive stability tests of a pandas                                        Partitioned dataset
or Spark dataset.
    We believe the combination of wide applicability, out-of-the-             Data representation
box performance, available statistical tests, and configurability
makes popmon an ideal addition to the toolbox of any data
scientist or machine learning engineer.

Approach                                                                                        Histograms per feature for each partition

popmon tests the dataset stability and reports the results through
                                                                              Comparison generation
a sequence of steps (Fig. 2):
    1)   The data are represented by histograms of features,
         binned in time-slices (Sec. data representation).                            D1            D2             D3             D4               D5
    2)   The data is arranged according to the selected reference
         type (Sec. comparisons).
                                                                                     Historical data         New data
    3)   The stability of the profiles and distributions of those
         histograms are compared using statistical tests, both with           Statistical comparison
         respect to a reference and over time. It works with numer-
         ical, ordinal, categorical features, and the histograms can
         be higher-dimensional, e.g. it can also track correlations
                                                                                                          Metric
         between any two features (Sec. comparisons).
    4)   popmon can automatically flag and alert on changes                                                                          Value of interest
         observed over time, such as trends, anomalies, changing                                                                        over time
         correlations, etc, using monitoring rules (Sec. alerting).
    5)   Results are reported to the user via a dedicated, self-              Dynamic bounds
         contained report (Sec. reporting).

Dataset shift
In the context of supervised learning, one can distinguish dataset              Value of interest
                                                                                                         Reference distribution           Traffic light bounds
shift as a shift in various distributions:                                         over time

    1)   Covariate shift: shift in the independent variables (p(x)).          Reporting
    2)   Prior probability shift: shift in the target variable (the
         class, p(y)).
    3)   Concept shift: shift in the relationship between the inde-
         pendent and target variables (i.e. p(x|y)).
    Note that there is a lot of variation in terminology used, refer-
ring to probabilities prevents this ambiguity. For more information
on dataset shift see Quinonero-Candela et al. [QCSSL08].                   Fig. 2: Step-by-step overview of popmon’s pipeline as described in
    popmon is primarily interested in monitoring the distributions         section approach onward.
of features p(x) and labels p(y) for monitoring trained classifiers.
These data in deployment ideally resembles the training data.
196                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

However, the package can be used more widely, for instance              Implementation
by monitoring interactions between features and the label, or the
                                                                        For the creation of histograms from data records the open-source
distribution of model predictions.
                                                                        histogrammar package has been adopted. histogrammar
                                                                        has been implemented in both Scala and Python [PS21],
Temporal representation                                                 [PSSE16], and works on Spark and pandas dataframes re-
popmon requires features to be distributed as a function of time        spectively. The two implementations have been tested exten-
(bins), which can be provided in two ways:                              sively to guarantee compatibility. The histograms coming out
                                                                        of histogrammar form the basis of the monitoring code in
      1)   Time axis. Two-dimensional (or higher) distributions are     popmon, which otherwise does not require input dataframes. In
           provided, where the first dimension is time and the second   other words, the monitoring code itself has no Spark or pandas
           is the feature to monitor. To get time slices, the time      data dependencies, keeping the code base relatively simple.
           column needs to be specified, e.g. “date”, including the
           bin width, e.g. one week (“1w”), and the offset, which is
                                                                        Histogram types
           the lower edge of one time-bin, e.g. a certain start date
           (“2022-1-1”).                                                Three types of histograms are typically used:
      2)   Ordered data batches. A set of distributions of features
           is provided, corresponding to a new batch of data. This         •   Normal histograms, meant for numerical features with
           batch is considered a new time-slice, and is stitched to            known, fixed ranges. The bin specifications are the lowest
           an existing set of batches, in order of incoming batches,           and highest expected values and the number of (equidis-
           where each batch is assigned a unique, increasing index.            tant) bins.
           Together the indices form an artificial, binned time-axis.      •   Categorical histograms, for categorical and ordinal fea-
                                                                               tures, typically boolean or string-based. A categorical
                                                                               histogram accepts any value: when not yet encountered,
Data representation
                                                                               it creates a new bin. No bin specifications are required.
popmon uses histogram-based monitoring to track potential                  •   Sparse histograms are open-ended histograms, for numer-
dataset shift and outliers over time, as detailed in the next sub-             ical features with no known range. The bin specifications
section.                                                                       only need the bin-width, and optionally the origin (the
    In the literature, alternative data representations are also em-           lower edge of bin zero, with a default value of zero).
ployed, such as kdq-trees [DKVY06]. Different data representa-                 Sparse histograms accept any value. When the value is
tions are in principle compatible with the popmon pipeline, as it              not yet encountered, a new bin gets created.
is similarly structured to alternative methods (see [LLD+ 18], c.f.
Fig 5).                                                                     For normal and sparse histograms reasonable bin specifica-
    Dimensionality reduction techniques may be used to transform        tions can be derived automatically. Both categorical and sparse
the input dataset into a space where the distance between instances     histograms are dictionaries with histogram properties. New (index,
are more meaningful for comparison, before using popmon, or in-         bin) pairs get created whenever needed. Although this could result
between steps. For example a linear projection may be used as a         in out-of-memory problems, e.g. when histogramming billions
preprocessing step, by taking the principal components of PCA as        of unique strings, in practice this is typically not an issue, as
in [QAWZ15]. Machine learning classifiers or autoencoders have          this can be easily mitigated. Features may be transformed into
also been used for this purpose [LWS18], [RGL19] and can be             a representation with a lower number of distinct values, e.g. via
particularly helpful for high-dimensional data such as images or        embedding or substrings; or one selects the top-n most frequently
text.                                                                   occurring values.
                                                                            Open-ended histograms are ideal for monitoring dataset shift
Histogram-based monitoring                                              and outliers: they capture any kind of (large) data change. When
There are multiple reasons behind the histogram-based monitoring        there is a drift, there is no need to change the low- and high-range
approach taken in popmon.                                               values. The same holds for outlier detection: if a new maximum
    Histograms are small in size, and thus are efficiently stored and   or minimum value is found, it is still captured.
transferred, regardless of the input dataset size. Once data records
have been aggregated feature-wise, with a minimum number of             Dimensionality
entries per bin, they are typically no longer privacy sensitive (e.g.
knowing the number of records with age 30-35 in a dataset).             A histogram can be multi-dimensional, and any combination of
    popmon is primarily looking for changes in data distributions.      types is possible. The first dimension is always the time axis,
Solely monitoring the (main) profiles of a distribution, such as        which is always represented by a sparse histogram. The second
the mean, standard deviation and min and max values, does not           dimension is the feature to monitor over time. When adding a third
necessarily capture the changes in a feature’s distribution. Well-      axis for another feature, the heatmap between those two features
known examples of this are Anscome’s Quartet [Ans73] and the            is created over time. For example, when monitoring financial
dinosaurs datasets [MF17], where – between different datasets –         transactions: the first axis could be time, the second axis client
the means and correlation between two features are identical, but       type, and the third axis transaction amount.
the distributions are different. Histograms of the corresponding            Usually one feature is followed over time, or at maximum two.
features (or feature pairs), however, do capture the corresponding      The synthetic datasets in section synthetic datasets contain exam-
changes.                                                                ples of higher-dimensional histograms for known interactions.
POPMON: ANALYSIS PACKAGE FOR DATASET SHIFT DETECTION                                                                                           197

Additivity                                                                the adjacent time-slots. A sliding reference, on the other hand,
Histograms are additive. As an example, a batch of data records           is updated with more recent data, that incorporates this trend.
arrives each week. A new batch arrives, containing timestamps             Consider the case where the data contain a price field that is yearly
that were missing in a previous batch. When histograms are made           indexed to the inflation, then using a static reference may alert
of the new batch, these can be readily summed with the histograms         purely on the trend.
of the previous batches. The missing records are immediately put              The reference implementations are provided for common sce-
into the right time-slices.                                               narios, such as working with a fixed dataset, batched dataset or
    It is important that the bin specifications are the same between      with streaming data. For instance, a fixed dataset is common for
different batches of data, otherwise their histograms cannot be           exploratory data analysis and one-off monitoring, whereas batched
summed and comparisons are impossible.                                    or streaming data is more common in a production setting.
                                                                              The reference may be static or dynamic. Four different refer-
Limitations                                                               ence types are possible:
There is one downside to using histograms: since the data get                1)      Self-reference. Using the full dataset on which the sta-
aggregated into bins, and profiles and statistical tests are obtained                bility report is built as a reference. This method is static:
from the histograms, slightly lower resolution is achieved than                      each time slot is compared to all the slots in the dataset.
on the full dataset. In practice, however, this is a non-issue;                      This is the default reference setting.
histograms work great for data monitoring. The reference type                2)      External reference. Using an external reference set, for
and time-axis binning configuration allow the user for selecting an                  example the training data of your classifier, to identify
effective resolution.                                                                which time slots are deviating. This is also a static
                                                                                     method: each time slot is compared to the full reference
Comparisons                                                                          set.
                                                                             3)      Rolling reference. Using a rolling window on the input
In popmon the monitoring of data stability is based on statistical                   dataset, allowing one to compare each time slot to a
process control (SPC) techniques. SPC is a standard method to                        window of preceding time slots. This method is dynamic:
manage the data quality of high-volume data processing opera-                        one can set the size of the window and the shift from the
tions, for example in a large data warehouse [Eng99]. The idea                       current time slot. By default the 10 preceding time slots
is as follows. Most features have multiple sources of variation                      are used.
from underlying processes. When these processes are stable, the              4)      Expanding reference. Using an expanding reference,
variation of a feature over time should remain within a known                        allowing one to compare each time slot to all preceding
set of limits. The level of variation is obtained from a reference                   time slots. This is also a dynamic method, with variable
dataset, one that is deemed stable and trustworthy.                                  window size, since all available previous time slots are
    For each feature in the input data (except the time column),                     used. For example, with ten available time slots the
the stability is determined by taking the reference dataset – for                    window size is 9.
example the data on which a classification model was trained –
and contrasting each time slot in the input data.                         Statistical comparisons
    The comparison can be done in two ways:
                                                                          Users may have various reasons to prefer a two-sample test over
    1)   Comparisons: statistically comparing each time slot              another. The appropriate comparison depends on our confidence in
         to the reference data (for example using Kolmogorov-             the reference dataset [Ric22], and certain tests may be more com-
         Smirnov testing, χ 2 testing, or the Pearson correlation).       mon in some fields. Many common tests are related [DKVY06],
    2)   Profiles: for example, tracking the mean of a distribution       e.g. the χ 2 function is the first-order expansion of the KL distance
         over time and contrasting this to the reference data.            function.
         Similar analyses can be done for other summary statistics,           Therefore, popmon provides an extensible framework that
         such as the median, min, max or quantiles. This is related       allows users to provide custom two-sample tests using a simple
         to the CUsUM technique [Pag54], a well-known method              syntax, via the registry pattern:
         in SPC.                                                          @Comparisons.register(key="jsd", description="JSD")
                                                                          def jensen_shannon_divergence(p, q):
                                                                               m = 0.5 * (p + q)
Reference types
                                                                               return (
Consider X to be an N-dimensional dataset representing our                         0.5 *
reference data, and X 0 to be our incoming data. A covariate shift                 (kl_divergence(p, m) + kl_divergence(q, m))
                                                                               )
occurs when p(X) 6= p(X 0 ) is detected. Different choices for X
and X 0 may detect different types of drift (e.g. sudden, gradual,        Most commonly used test statistics are implemented, such as the
incremental). p(X) is referred to as the reference dataset.               Population-Stability-Index and the Jensen-Shannon divergence.
    Many change-detection algorithms use a window-based solu-             The implementations of the χ 2 and Kolmogorov-Smirnov tests
tion that compares a static reference to a test window [DKVY06],          account for statistical fluctuations in both the input and reference
or a sliding window for both, where the reference is dynamically          distributions. For example, this is relevant when comparing adja-
updated [QAWZ15]. A static reference is a wise choice for mon-            cent, low-statistics time slices.
itoring of a trained classifier: the performance of such a classifier
depends on the similarity of the test data to the training data.          Profiles
Moreover, it may pick up an incremental departure (trend) from            Tracking the distribution of values of interest over time is achieved
the initial distribution, that will not be significant in comparison to   via profiles. These are functions of the input histogram. Metrics
198                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

may be defined for all dimensions (e.g. count, correlations), or
for specific dimensions as in the case of 1D numerical histograms
(e.g. quantiles). Extending the existing set of profiles is possible
via a syntax similar as above:
@Profiles.register(
     key=["q5", "q50", "q95"],
     description=[
         "5% percentile",
         "50% percentile (median)",
         "95% percentile"
     ],
     dim=1,
     type="num"
)
def profile_quantiles(values, counts):
    return logic_goes_here(values, counts)

Denote xi (t) as the profile i of feature x at time t, for example the   Fig. 3: A snapshot of part of the HTML stability report. It shows the
5% quantile of the histogram of incoming transaction amounts in          aggregated traffic light overview. This view can be used to prioritize
a given week. Identical bin specifications are assumed between the       features for inspection.
reference and incoming data. x̄i is defined as the average of that
metric on the reference data, and σxi as the corresponding standard
                                                                         Dynamic monitoring rules
deviation.
    The normalized residual between the incoming and reference           Dynamic monitoring rules are complementary to static rules. The
data, also known as the “pull” or “Z-score”, is given by:                levels of variation in feature metrics are assumed to have been
                                                                         measured on the reference data. Per feature metric, incoming data
                                        xi (t) − x̄i
                          pulli (t) =                .                   are compared against the reference levels. When (significantly)
                                            σxi                          outside of the known bounds, instability of the underlying sources
When the underlying sources of variation are stable, and assuming        is assumed, and a warning gets raised to the user.
the reference dataset is asymptotically large and independent from           popmon’s dynamic monitoring rules raise traffic lights to
the incoming data, pulli (t) follows a normal distribution centered      the user whenever the normalized residual pulli (t) falls outside
around zero and with unit width, N(0, 1), as dictated by the central     certain, configurable ranges. By default:
limit theorem [Fis11].                                                                            
    In practice, the criteria for normality are hardly ever met. Typi-                            Green, if |pulli (t)| ≤ 4
                                                                                                  
cally the distribution is wider with larger tails. Yet, approximately              T L(pulli ,t) = Yellow, if 4 < |pulli (t)| ≤ 7
                                                                                                  
                                                                                                  
normal behaviour is exhibited. Chebyshev’s inequality [Che67]                                       Red,      if |pulli (t)| > 7
guarantees that, for a wide class of distributions, no more than k12
of the distribution’s values can be k or more standard deviations        If the reference dataset is changing over time, the effective ranges
away from the mean. For example, a minimum of 75% (88.9%) of             on xi (t) can change as well. The advantage of this approach over
values must lie within two (three) standard deviations of the mean.      static rules is that significant deviations in the incoming data can
These boundaries reoccur in Sec. dynamic monitoring rules.               be flagged and alerted to the user for a large set of features and
                                                                         corresponding metrics, requiring little (or no) prior knowledge of
                                                                         the data at hand. The relevant knowledge is all extracted from the
Alerting
                                                                         reference dataset.
For alerting, popmon uses traffic-light-based monitoring rules,               With multiple feature metrics, many dynamic monitoring tests
raising green, yellow or red alerts to the user. Green alerts signal     can get performed on the same dataset. This raises the multiple
the data are fine, yellow alerts serve as warnings of meaningful         comparisons problem: the more inferences are made, the more
deviations, and red alerts need critical attention. These monitoring     likely erroneous red flags are raised. To compensate for a large
rules can be static or dynamic, as explained in this section.            number of tests being made, typically one can set wider traffic
                                                                         light boundaries, reducing the false positive rate.2 The boundaries
Static monitoring rules                                                  control the size of the deviations - or number of red and yellow
Static monitoring rules are traditional data quality rules (e.g.         alerts - that the user would like to be informed of.
[RD00]). Denote xi (t) as metric i of feature x at time t, for example
the number of NaNs encountered in feature x on a given day. As
                                                                         Reporting
an example, the following traffic lights might be set on xi (t):
                                                                        popmon outputs monitoring results as HTML stability reports.
                         Green, if xi (t) ≤ 1
                                                                        The reports offer multiple views of the data (histograms and
            T L(xi ,t) = Yellow, if 1 < xi (t) ≤ 10                      heatmaps), the profiles and comparisons, and traffic light alerts.
                         
                         
                           Red,        if xi (t) > 10                    There are several reasons for providing self-contained reports: they
                                                                         can be opened in the browser, easily shared, stored as artifacts, and
The thresholds of this monitoring rule are fixed, and considered         tracked using tools such as MLFlow. The reports also have no need
static over time. They need to be set by hand, to sensible values.       for an advanced infrastructure setup, and are possible to create and
This requires domain knowledge of the data and the processes
that produce it. Setting these traffic light ranges is a time-costly       2. Alternatively one may apply the Bonferroni correction to counteract this
process when covering many features and corresponding metrics.           problem [Bon36].
POPMON: ANALYSIS PACKAGE FOR DATASET SHIFT DETECTION                                                                                       199




Fig. 4: LED: Pearson correlation compared with previous histogram.        Fig. 5: Sine1: The dataset shifts around data points 20.000, 40.000,
The shifting points are correctly identified at every 5th of the LED      60.000 and 80.000 of the Sine1 dataset are clearly visible.
dataset. Similar patterns are visible for other comparisons, e.g. χ 2 .


view in many environments: from a local machine, a (restricted)
environment, to a public cloud. If, however, a certain dashboarding
tool is available, then the metrics computed by popmon are
exposed and can be exported into that tool, for example Kibana
[Ela22]. One downside of producing self-contained reports is that
they can get large when the plots are pre-rendered and embedded.
This is mitigated by embedding plots as JSON that are (lazily)
rendered on the client-side. Plotly express [Plo22] powers the
interactive embedded plots in popmon as of v1.0.0.
     Note that multiple reference types can be used in the same sta-
bility report. For instance, popmon’s default reference pipelines
always include a rolling comparison with window size 1, i.e.
comparing to the preceding time slot.

Synthetic datasets
In the literature synthetic datasets are commonly used to test the        Fig. 6: Hyperplane: The incremental drift compared to the reference
effectiveness of dataset shift monitoring approaches [LLD+ 18].           dataset is observed for the PhiK correlation between the predictions
One can test the detection for all kinds of shifts, as the generation     and the label.
process controls when and how the shift happens. popmon has
been tested on multiple of such artificial datasets: Sine1, Sine2,
                                                                          reference. The predictions of this model are added to the dataset,
Mixed, Stagger, Circles, LED, SEA and Hyperplane [PVP18],
                                                                          simulating a machine learning model in production. popmon is
[SK], [Fan04]. These datasets cover myriad dataset shift charac-
                                                                          able to pick up the divergence between the predictions and the
teristics: sudden and gradual drifts, dependency of the label on
                                                                          class label, as depicted in Figure 6.
just one or multiple features, binary and multiclass labels, and
containing unrelated features. The dataset descriptions and sample
popmon configurations are available in the code repository.               Conclusion
     The reports generated by popmon capture features and time            This paper has presented popmon, an open-source Python pack-
bins where the dataset shift is occurring for all tested datasets.        age to check the stability of a tabular dataset. Built around
Interactions between features and the label can be used for               histogram-based monitoring, it runs on a dataset of arbitrary size,
feature selection, in addition to monitoring the individual feature       supporting both pandas and Spark dataframes. Using the variations
distributions. The sudden and gradual drifts are clearly visible          observed in a reference dataset, popmon can automatically detect
using a rolling reference, see Fig. 4 for examples. The drift in the      and flag deviations in incoming data, requiring little prior domain
Hyperplane dataset, incremental and gradual, is not expected to be        knowledge. As such, popmon is a scalable solution that can be
detected using a rolling reference or self-reference. Moreover, the       applied to many datasets. By default its findings get presented
dataset is synthesized so that the distribution of the features and       in a single HTML report. This makes popmon ideal for both
the class balance does not change [Fan04].                                exploratory data analysis and as a monitoring tool for machine
     The process to monitor this dataset could be set up in multiple      learning models running in production. We believe the combina-
ways, one of which is described here. A logistic regression model         tion of out-of-the-box performance and presented features makes
is trained on the first 10% of the data, which is also used as static     popmon an excellent addition to the data practitioner’s toolbox.
200                                                                                             PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Acknowledgements                                                                [MF17]      Justin Matejka and George Fitzmaurice. Same stats, different
                                                                                            graphs: generating datasets with varied appearance and identi-
We thank our colleagues from the ING Analytics Wholesale                                    cal statistics through simulated annealing. In Proceedings of
Banking team for fruitful discussions, all past contributors to                             the 2017 CHI conference on human factors in computing sys-
                                                                                            tems, pages 1290–1294, 2017. URL: https://doi.org/10.1145/
popmon, and in particular Fabian Jansen and Ilan Fridman Rojas                              3025453.3025912, doi:10.1145/3025453.3025912.
for carefully reading the manuscript. This work is supported by                 [Pag54]     Ewas S Page. Continuous inspection schemes. Biometrika,
ING Bank.                                                                                   41(1/2):100–115, 1954.           URL: https://doi.org/10.2307/
                                                                                            2333009, doi:10.2307/2333009.
                                                                                [pdt20]     The pandas development team. pandas-dev/pandas: Pan-
                                                                                            das, February 2020. URL: https://doi.org/10.5281/zenodo.
R EFERENCES                                                                                 3509134, doi:10.5281/zenodo.3509134.
                                                                                [Plo22]     Plotly Development Team. Plotly.py: The interactive graphing
[Ans73]     F.J. Anscome. Graphs in statistical analysis. American                          library for Python (includes Plotly Express), 6 2022. URL:
            Statistician. 27 (1), pages 17–21, 1973. URL: https://doi.org/                  https://github.com/plotly/plotly.py.
            10.2307/2682899, doi:10.2307/2682899.                               [PS21]      Jim         Pivarski        and      Alexey        Svyatkovskiy.
[Bon36]     Carlo Bonferroni. Teoria statistica delle classi e calcolo delle                histogrammar/histogrammar-scala:            v1.0.20,       April
            probabilita. Pubblicazioni del R Istituto Superiore di Scienze                  2021.          URL: https://doi.org/10.5281/zenodo.4660177,
            Economiche e Commericiali di Firenze, 8:3–62, 1936.                             doi:10.5281/zenodo.4660177.
[Che67]     Pafnutii Lvovich Chebyshev. Des valeurs moyennes, liou-             [PSSE16]    Jim Pivarski, Alexey Svyatkovskiy, Ferdinand Schenck,
            ville’s. J. Math. Pures Appl., 12:177–184, 1867.                                and Bill Engels. histogrammar-python: 1.0.0, September
[DKVY06]    Tamraparni Dasu, Shankar Krishnan, Suresh Venkatasubra-                         2016. URL: https://doi.org/10.5281/zenodo.61418, doi:10.
            manian, and Ke Yi. An information-theoretic approach to                         5281/zenodo.61418.
            detecting changes in multi-dimensional data streams. In In          [PVP18]     Ali Pesaranghader, Herna Viktor, and Eric Paquet. Reser-
            Proc. Symp. on the Interface of Statistics, Computing Science,                  voir of diverse adaptive learners and stacking fast hoeffding
            and Applications. Citeseer, 2006.                                               drift detection methods for evolving data streams. Machine
[Ela22]     Elastic. Kibana, 2022. URL: https://github.com/elastic/kibana.                  Learning, 107(11):1711–1743, 2018. URL: https://doi.org/10.
[Eng99]     Larry English. Improving Data Warehouse and Business Infor-                     1007/s10994-018-5719-z, doi:10.1007/s10994-018-
            mation Quality: Methods for Reducing Costs and Increasing                       5719-z.
            Profits. Wiley, 1999.                                               [QAWZ15]    Abdulhakim A Qahtan, Basma Alharbi, Suojin Wang, and
[Fan04]     Wei Fan. Systematic data selection to mine concept-drifting                     Xiangliang Zhang. A pca-based change detection frame-
            data streams. In Proceedings of the Tenth ACM SIGKDD                            work for multidimensional data streams: Change detection
            International Conference on Knowledge Discovery and Data                        in multidimensional data streams. In Proceedings of the
            Mining, KDD ’04, page 128–137, New York, NY, USA, 2004.                         21th ACM SIGKDD International Conference on Knowledge
            Association for Computing Machinery. URL: https://doi.                          Discovery and Data Mining, pages 935–944, 2015. doi:
            org/10.1145/1014052.1014069, doi:10.1145/1014052.                               10.1145/2783258.2783359.
            1014069.                                                            [QCSSL08]   Joaquin Quiñonero-Candela, Masashi Sugiyama, Anton
[Fis11]     Hans Fischer. The Central Limit Theorem from Laplace to                         Schwaighofer, and Neil D Lawrence. Dataset shift in machine
            Cauchy: Changes in Stochastic Objectives and in Analytical                      learning. Mit Press, 2008.
            Methods, pages 17–74. Springer New York, New York, NY,              [RD00]      Erhard Rahm and Hong Hai Do. Data cleaning: Problems and
            2011. URL: https://doi.org/10.1007/978-0-387-87857-7_2,                         current approaches. IEEE Data Eng. Bull., 23(4):3–13, 2000.
            doi:10.1007/978-0-387-87857-7\_2.                                   [RGL19]     Stephan Rabanser, Stephan Günnemann, and Zachary
[GCSG22]    Abe Gong, James Campbell, Superconductive, and Great Ex-                        Lipton.         Failing loudly: An empirical study of
            pectations. Great Expectations, 2022. URL: https://github.                      methods for detecting dataset shift.               Advances in
            com/great-expectations/great_expectations, doi:10.5281/                         Neural Information Processing Systems, 32, 2019.
            zenodo.5683574.                                                                 URL:             https://proceedings.neurips.cc/paper/2019/hash/
[HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der                         846c260d715e5b854ffad5f70a516c88-Abstract.html.
            Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric          [Ric22]     Oliver E Richardson. Loss as the inconsistency of a proba-
            Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,                      bilistic dependency graph: Choose your model, not your loss
            Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van                          function. In International Conference on Artificial Intelligence
            Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del                     and Statistics, pages 2706–2735. PMLR, 2022.
            Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant,            [SK]        W Nick Street and YongSeog Kim. A streaming ensemble
            Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer                           algorithm (sea) for large-scale classification. In Proceedings
            Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro-                    of the Seventh ACM SIGKDD International Conference on
            gramming with NumPy. Nature, 585(7825):357–362, Septem-                         Knowledge Discovery and Data Mining, KDD ’01, page
            ber 2020. URL: https://doi.org/10.1038/s41586-020-2649-2,                       377–382, New York, NY, USA. Association for Comput-
            doi:10.1038/s41586-020-2649-2.                                                  ing Machinery. URL: https://doi.org/10.1145/502512.502568,
[KVLC+ 20] Janis Klaise, Arnaud Van Looveren, Clive Cox, Giovanni                           doi:10.1145/502512.502568.
            Vacanti, and Alexandru Coca. Monitoring and explainability          [SLL20]     Pascal Sturmfels, Scott Lundberg, and Su-In Lee. Visu-
            of models in production. arXiv preprint arXiv:2007.06299,                       alizing the impact of feature attribution baselines. Distill,
            2020. URL: https://doi.org/10.48550/arXiv.2007.06299, doi:                      2020. https://distill.pub/2020/attribution-baselines. doi:
            10.48550/arXiv.2007.06299.                                                      10.23915/distill.00022.
[LL17]      Scott M Lundberg and Su-In Lee. A unified approach to in-           [SLS+ 18]   Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem
            terpreting model predictions. Advances in neural information                    Celikel, Felix Biessmann, and Andreas Grafberger. Automat-
            processing systems, 30, 2017.                                                   ing large-scale data quality verification. Proc. VLDB Endow.,
[LLD+ 18]   Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and                            11(12):1781–1794, aug 2018. URL: https://doi.org/10.14778/
            Guangquan Zhang. Learning under concept drift: A review.                        3229863.3229867, doi:10.14778/3229863.3229867.
            IEEE Transactions on Knowledge and Data Engineering,                [VGO+ 20]   Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt
            31(12):2346–2363, 2018. doi:10.1109/TKDE.2018.                                  Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski,
            2876857.                                                                        Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté-
[LPO17]     David Lopez-Paz and Maxime Oquab. Revisiting classifier                         fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar-
            two-sample tests. In International Conference on Learning                       rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric
            Representations, 2017.                                                          Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat,
[LWS18]     Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. De-                         Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde,
            tecting and correcting for label shift with black box predictors.               Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin-
            In International conference on machine learning, pages 3122–                    tero, Charles R. Harris, Anne M. Archibald, Antônio H.
            3130. PMLR, 2018.                                                               Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy
POPMON: ANALYSIS PACKAGE FOR DATASET SHIFT DETECTION                            201

             1.0 Contributors. SciPy 1.0: Fundamental Algorithms for
             Scientific Computing in Python. Nature Methods, 17:261–272,
             2020. doi:10.1038/s41592-019-0686-2.
[VLKV+ 22]   Arnaud Van Looveren, Janis Klaise, Giovanni Vacanti, Oliver
             Cobb, Ashley Scillitoe, and Robert Samoilescu. Alibi Detect:
             Algorithms for outlier, adversarial and drift detection, 4 2022.
             URL: https://github.com/SeldonIO/alibi-detect.
[WM10]       Wes McKinney. Data Structures for Statistical Computing in
             Python. In Stéfan van der Walt and Jarrod Millman, editors,
             Proceedings of the 9th Python in Science Conference, pages
             56–61, 2010. doi:10.25080/Majora-92bf1922-00a.
[ZXW+ 16]    Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata
             Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh
             Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Gh-
             odsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. Apache
             spark: A unified engine for big data processing. Commun.
             ACM, 59(11):56–65, oct 2016. URL: https://doi.org/10.1145/
             2934664, doi:10.1145/2934664.
202                                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




    pyDAMPF: a Python package for modeling
mechanical properties of hygroscopic materials under
           interaction with a nanoprobe
                             Willy Menacho‡§ , Gonzalo Marcelo Ramírez-Ávila‡§ , Horacio V. Guzman¶k‡§∗



                                                                                     F



Abstract—pyDAMPF is a tool oriented to the Atomic Force Microscopy (AFM)                      Despite the recent open-source availability of dynamic AFM
community, which allows the simulation of the physical properties of materials           simulation packages [GGG15], [MHR08], a broad usage for the
under variable relative humidity (RH). In particular, pyDAMPF is mainly focused          assessment and planning of experiments has yet to come. One of
on the mechanical properties of polymeric hygroscopic nanofibers that play an            the problems is that it is often hard to simulate several operational
essential role in designing tissue scaffolds for implants and filtering devices.
                                                                                         parameters at once. For example, most scientists evaluate differ-
Those mechanical properties have been mostly studied from a very coarse
perspective reaching a micrometer scale. However, at the nanoscale, the me-
                                                                                         ent AFM cantilevers before starting new experiments. A typical
chanical response of polymeric fibers becomes cumbersome due to both exper-              evaluation criterion is the maximum exerted force that prevents
imental and theoretical limitations. For example, the response of polymeric fibers       invasivity of the nanoprobe into the sample. The variety of AFM
to RH demands advanced models that consider sub-nanometric changes in the                cantilevers depends on the geometrical and material characteristics
local structure of each single polymer chain. From an experimental viewpoint,            used for its fabrication. Moreover, manufacturers’ nanofabrication
choosing the optimal cantilevers to scan the fibers under variable RH is not             techniques may change from time to time, according to the
trivial.                                                                                 necessities of the experiments, like sharper tips and/or higher
      In this article, we show how to use pyDAMPF to choose one optimal
                                                                                         oscillation frequencies. From a simulation perspective, evaluating
nanoprobe for planned experiments with a hygroscopic polymer. Along these
                                                                                         observables for reaching optimal results on upcoming experiments
lines, We show how to evaluate common and non-trivial operational parame-
ters from an AFM cantilever of different manufacturers. Our results show in a
                                                                                         is nowadays possible for tens or hundreds of cantilevers. On top of
stepwise approach the most relevant parameters to compare the cantilevers                other operational parameters in the case of dynamic AFM like the
based on a non-invasive criterion of measurements. The computing engine is               oscillation amplitude A0 , set-point Asp , among other materials ex-
written in Fortran, and wrapped into Python. This aims to reuse physics code             pected properties that may feed simulations and create simulations
without losing interoperability with high-level packages. We have also introduced        batches of easily thousands of cases. Given this context, we focus
an in-house and transparent method for allowing multi-thread computations to             this article on choosing a cantilever out of an initial pyDAMPF
the users of the pyDAMPF code, which we benchmarked for various comput-                  database of 30. In fact, many of them are similar in terms of spring
ing architectures (PC, Google Colab and an HPC facility) and results in very
                                                                                         constant kc , cantilever volume Vc and also Tip’s radius RT . Then
favorable speed-up compared to former AFM simulators.
                                                                                         we focus on seven archetypical and distinct cases/cantilevers to
Index Terms—Materials science, Nanomechanical properties, AFM, f2py, multi-              understand the characteristics of each of the parameters specified
threading CPUs, numerical simulations, polymers                                          in the manufacturers’ datasheets, by evaluating the maximum
                                                                                         (peak) forces.
Introduction and Motivation
                                                                                              We present four scenarios comparing a total of seven can-
                                                                                         tilevers and the same sample, where we use as a test-case Poly-
This article provides an overview of pyDAMPF, which is a                                 Vinyl Acetate (PVA) fiber. The first scenario (Figure 1) illustrates
BSD licensed, Python and Fortran modeling tool that enables                              the difference between air and a moist environment. On the
AFM users to simulate the interaction between a probe (can-                              second one, a cantilever, only very soft and stiff cantilever spring
tilever) and materials at the nanoscale under diverse environments.                      constants are compared (see Figure :ref:fig1b‘). At the same time,
The code is packaged in a bundle and hosted on GitHub at                                 the different volumes along the 30 cantilevers are depicted in
(https://github.com/govarguz/pyDAMPF).                                                   Figure 3. A final and mostly very common comparison is scenario
‡ Instituto de Investigaciones Físicas.                                                  4, by comparing one of the most sensitive parameters to the force
§ Carrera de Física, Universidad Mayor de San Andrés. Campus Universitario               of the tip’s radii (see Figure 4).
Cota Cota. La Paz, Bolivia                                                                    The quantitative analysis for these four scenarios is presented
* Corresponding author: horacio.guzman@ijs.si                                            and also the advantages of computing several simulation cases
¶ Department of Theoretical Physics
|| Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia                           at once with our in-house development. Such a comparison is
                                                                                         performed under the most common computers used in science,
Copyright © 2022 Willy Menacho et al. This is an open-access article dis-                namely, personal computers (PC), cloud (Colab) and supercom-
tributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, pro-             puting (small Xeon based cluster). We reach a Speed-up of 20
vided the original author and source are credited.                                       over the former implementation [GGG15].
PYDAMPF: A PYTHON PACKAGE FOR MODELING MECHANICAL PROPERTIES OF HYGROSCOPIC MATERIALS UNDER INTERACTION WITH A NANOPROBE                  203

    Another novelty of pyDAMPF is the detailed [GS05] calcu-                    Serial method: This method is completely transparent to
lation of the environmental-related parameters, like the quality         the user and will execute all the simulation cases found in the file
factor Q.                                                                tempall.txt by running the script inputs_processor.py. Our in-house
    Here, we summarize the main features of pyDAMPF are:                 development creates an individual folder for each simulation case,
                                                                         which can be executed in one thread.
   •   Highly efficient structure in terms of time-to-result, at least
                                                                         def serial_method(tcases, factor, tempall):
       one order of magnitude faster than existing approaches.               lst = gen_limites(tcases, factor)
   •   Easy to use for scientists without a computing background,            change_dir()
       in particular in the use of multi-threads.                            for i in range(1,factor+1):
                                                                                     direc = os.getcwd()
   •   It supports the addition of further AFM cantilevers and                       direc2 = direc+'/pyDAMPF_BASE/'
       parameters into the code database.                                            direc3 = direc+'/SERIALBASIC_0/'+str(i)+'/'
   •   Allows an interactive analysis, including a graphical and                     shutil.copytree ( direc2,direc3)
       table-based comparison of results through Jupyter Note-               os.chdir ( direc+'/SERIALBASIC_0/1/nrun/')
                                                                             exec(open('generate_cases.py').read())
       books.
                                                                         As arguments, the serial method requires the total number of
   The results presented in this article are available as Google         simulation cases obtained from tempall.txt. In contrast, the factor
Colaboratory notebook, which facilitates to explore pyDAMPF              parameter has, in this case,a default value of 1.
and these examples.                                                               Parallel method: The parallel method uses more than one
                                                                         computational thread. It is similar to the serial method; however,
                                                                         this method distributes the total load along the available threads
Methods                                                                  and executes in a parallel-fashion. This method comprises two
                                                                         parts: first, a function that takes care of the bookkeeping of cases
Processing inputs
                                                                         and folders:
pyDAMPF counts with an initial database of 30 cantilevers,
                                                                         def Parallel_method(tcases, factor, tempall):
which can be extended at any time by accessing to the file can-              lst = gen_limites(tcases, factor)
tilevers_data.txt then, the program inputs_processor.py reads the            change_dir()
cantilever database and asks for further physical and operational            for i in range(1,factor+1):
                                                                                 lim_inferior=lst[i-1][0]
variables, required to start the simulations. This will generate                 lim_superior=lst[i-1][1]
tempall.txt, which contains all cases e.g. 30 to be simulated with               direc =os.getcwd()
pyDAMPF                                                                          direc2 =direc+'/pyDAMPF_BASE/'
                                                                                 direc3 =direc+'/SERIALBASIC_0/'+str(i)+'/'
def inputs_processor(variables,data):                                            shutil.copytree ( direc2,direc3)
    a,b = np.shape(data)                                                         factorantiguo = ' factor=1'
    final = gran_permutador( variables, data)                                    factornuevo='factor='+str(factor)
    f_name = ' tempall.txt'                                                      rangoantiguo = '( 0,paraleliz)'
    np.savetxt(f_name,final)                                                     rangonuevo='('+str(lim_inferior)+','
    directory = os.getcwd()                                                                  +str(lim_superior)+')'
    shutil.copy(directory+'/tempall.txt',directory+'                             os.chdir(direc+'/PARALLELBASIC_0/'+str(i))
            /EXECUTE_pyDAMPF/')                                                  pyname =' nrun/generate_cases.py'
    shutil.copy(directory+'/tempall.txt',directory+'                             newpath=direc+'/PARALLELBASIC_0/'+str(i)+'/'
            /EXECUTE_pyDAMPF/pyDAMPF_BASE/nrun/')                                            +pyname
                                                                                 reemplazo(newpath,factorantiguo,factornuevo)
The variables inside the argument of the function inputs_processor               reemplazo(newpath,rangoantiguo,rangonuevo)
are interactively requested from a shell command line. Then the                  os.chdir(direc)
file tempall.txt is generated and copied to the folders that will
                                                                         This part generates serial-like folders for each thread’s number of
contain the simulations.
                                                                         cases to be executed.
                                                                             The second part of the parallel method will execute pyDAMPF,
Execute pyDAMPF                                                          which contains at the same time two scripts. One for executing
                                                                         pyDAMPF in a common UNIX based desktop or laptop. While the
For execution in a single or multi-thread way, we require first          second is a python script that generated SLURM code to launch
to wrap our numeric core from Fortran to Python by using                 jobs in HPC facilities.
f2py [Vea20]. Namely, the file pyDAMPF.f90 within the folder
EXECUTE_pyDAMPF.                                                            •   Execution with SLURM
       Compilation with f2py: This step is only required once
                                                                            It runs pyDAMPF in different threads under the SLURM
and depends on the computer architecture the code for this reads:
                                                                         queuing system.
f2py -c --fcompiler=gnu95 pyDAMPF.f90 -m mypyDAMPF
                                                                         def cluster(factor):
                                                                             for i in range(1,factor+1):
This command-line generates mypyDAMPF.so, which will be
                                                                                  with open('jobpyDAMPF'+str(i)+'.x','w')
automatically located in the simulation folders.                                              as ssf :
    Once we have obtained the numerical code as Python modules,                       ssf.write('#/bin/bashl|n ')
we need to choose the execution mode, which can be serial or                          ssf.write('#SBATCH--time=23:00:00
                                                                             \n')
parallel. Whereby parallel refers to multi-threading capabilities                     ssf.write('#SBATCH--constraint=
only within this first version of the code.                                  epyc3\n')
204                                                                                         PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                   ssf.write('\n')
                   ssf.write('ml Anaconda3/2019.10\n')
                   ssf.write('\n')
                   ssf.write('ml foss/2018a\n')
                   ssf.write('\n')
                   ssf.write('cd/home/$<USER>/pyDAMPF/
          EXECUTE_pyDAMPF/PARALLELBASIC_0/'+str(i)+'/nrun
          \n')
                   ssf.write('\n')
                   ssf.write('echo$pwd\n')
                   ssf.write('\n')
                   ssf.write('python3 generate_cases.py
          \n')
                   ssf.close();
               os.system(sbatch jobpyDAMPF)'+str(i)+'
          .x;')
               os.system(rm jobpyDAMPF)'+str(i)+'.x;')

The above script generates SLURM jobs for a chosen set of
threads; after launched, those jobs files are erased in order to
improve bookkeeping.
                                                                          Fig. 1: Schematic of the tip-sample interface comparing air at a given
      •     Parallel execution with UNIX based Laptops or Desktops        Relative Humidity with air.
    Usually, microscopes (AFM) computers have no SLURM pre-
installed; for such a configuration, we run the following script:
def compute(factor):
    direc = os.getcwd()
    for i in range(1,factor+1):
        os.chdir(direc+'/PARALLELBASIC_0/'+
                     str(i)+'/nrun')
        os.system('python3 generate_cases.py
                     &')
        os.chdir(direc)

This function allows the proper execution of the parallel case
without a queuing system and where a slight delay might appear
from thread to thread execution.

Analysis
           Graphically:
      •     With static graphics, as shown in Figures 5, 9, 13 and 17.
python3 Graphical_analysis.py
                                                                          Fig. 2: Schematic of the tip-sample interface comparing a hard (stiff)
      •     With interactive graphics, as shown in Figure 18.             cantilever with a soft cantilever.
pip install plotly
jupyter notebook Graphical_analysis.ipynb

           Quantitatively:
      •     With static data table:
python3 Quantitative_analysis.py

      •     With interactive tables
            Quantitative_analysis.ipynb uses a minimalistic dashboard
            application for tabular data visualization tabloo with easy
            installation.:
pip install tabloo
jupyter notebook Quantitative_analysis.ipynb


Results and discussions
In Figure 1, we show four scenarios to be tackled in this test-
case for pyDAMPF. As described in the introduction, the first
scenario (Figure 1), compares between air and moist environment,
                                                                          Fig. 3: Schematic of the tip-sample interface comparing a cantilever
the second tackles soft and stiff cantilevers(see Figure 2), next         with a high volume compared with a cantilever with a small volume.
is Figure Figure 3, with the cantilever volume comparison and
PYDAMPF: A PYTHON PACKAGE FOR MODELING MECHANICAL PROPERTIES OF HYGROSCOPIC MATERIALS UNDER INTERACTION WITH A NANOPROBE                   205




                                                                        Fig. 6: Time-varying force for PVA at RH = 60.1% for different
                                                                        cantilevers. The simulations show elastic (Hertz) responses. For each
                                                                        curve, the maximum force value is the peak force. Two complete
Fig. 4: Schematic of the tip-sample interface comparing a cantilever
                                                                        oscillations are shown corresponding to a hard (stiff) cantilever with
with a wide tip with a cantilever with a sharp tip.
                                                                        a soft cantilever. The simulations were performed for Asp /A0 = 0.8 .




Fig. 5: Time-varying force for PVA at RH = 60.1% for different          Fig. 7: Time-varying force for PVA at RH = 60.1% for different
cantilevers. The simulations show elastic (Hertz) responses. For each   cantilevers. The simulations show elastic (Hertz) responses. For each
curve, the maximum force value is the peak force. Two complete          curve, the maximum force value is the peak force. Two complete os-
oscillations are shown corresponding to air at a given Relative         cillations are shown corresponding to a cantilever with a high volume
Humidity with air. The simulations were performed for Asp /A0 = 0.8     compared with a cantilever with a small volume. The simulations were
.                                                                       performed for Asp /A0 = 0.8 .


the force the tip’s radio (see Figure 4). Further details of the
cantilevers depicted here are included in Table 22.
    The AFM is widely used for mechanical properties mapping of
matter [Gar20]. Hence, the first comparison of the four scenarios
points out to the force response versus time according to a
Hertzian interaction [Guz17]. In Figure 5, we see the humid air
(RH = 60.1%) changes the measurement conditions by almost
10%. Using a stiffer cantilever (kc = 2.7[N/m]) will also increase
the force by almost 50% from the softer one (kc = 0.8[N/m]),
see Figure 6. Interestingly, the cantilever’s volume, a smaller
cantilever, results in the highest force by almost doubling the force
by almost five folds of the smallest volume (Figure 7). Finally, the
Tip radius difference between 8 and 20 nm will impact the force
in roughly 40 pN (Figure 8).
                                                                        Fig. 8: Time-varying force for PVA at RH = 60.1% for different
    Now, if we consider literature values for different                 cantilevers. The simulations show elastic (Hertz) responses. For each
RH [FCK+ 12], [HLLB09], we can evaluate the Peak or Maximum             curve, the maximum force value is the peak force. Two complete
Forces. This force in all cases depicted in Figure 9 shows a            oscillations are shown corresponding to a cantilever with a wide tip
monotonically increasing behavior with the higher Young mod-            with a cantilever with a sharp tip. The simulations were performed
ulus. Remarkably, the force varies in a range of 25% from dried         for Asp /A0 = 0.8 .
206                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




Fig. 9: Peak force reached for a PVA sample subjected to different
                                                                         Fig. 11: Peak force reached for a PVA sample subjected to different
relative humidities 0.0%, 29.5%, 39.9% and 60.1% corresponding
                                                                         relative humidities 0.0%, 29.5%, 39.9% and 60.1% corresponding to
to air at a given Relative Humidity with air. The simulations were
                                                                         a cantilever with a high volume compared with a cantilever with a
performed for Asp /A0 = 0.8 .
                                                                         small volume. The simulations were performed for Asp /A0 = 0.8 .




Fig. 10: Peak force reached for a PVA sample subjected to different
relative humidities 0.0%, 29.5%, 39.9% and 60.1% corresponding to
a hard (stiff) cantilever with a soft cantilever. The simulations were   Fig. 12: Peak force reached for a PVA sample subjected to different
performed for Asp /A0 = 0.8 .                                            relative humidities 0.0%, 29.5%, 39.9% and 60.1% corresponding to
                                                                         a cantilever with a wide tip with a cantilever with a sharp tip. The
                                                                         simulations were performed for Asp /A0 = 0.8 .

PVA to one at RH = 60.1% (see Figure 9).
     In order to properly describe operational parameters in dy-
namic AFM we analyze the peak force dependence with the set-
point amplitude Asp . In Figure 13, we have the comparison of
peak forces for the different cantilevers as a function of Asp . The
sensitivity of the peak force is higher for the type of cantilevers
with varying kc and Vc . Nonetheless, the peak force dependence
given by the Hertzian mechanics has a dependence with the
square root of the tip radius, and for those Radii on Table 22
are not influencing the force much. However, they could strongly
influence resolution [GG13].
     Figure 17 shows the dependence of the peak force as a function
of kc , Vc , and RT , respectively, for all the cantilevers listed in
Table 22; constituting a graphical summary of the seven analyzed
cantilevers for completeness of the analysis.
     Another way to summarize the results in AFM simulations
if to show the Force vs. Distance curves (see Fig. 18), which in
these case show exactly how for example a stiffer cantilever may         Fig. 13: Dependence of the maximum force on the set-point amplitude
penetrate more into the sample by simple checking the distance           corresponding to air at a given Relative Humidity with air.
cantilever e reaches. On the other hand, it also jumps into the
PYDAMPF: A PYTHON PACKAGE FOR MODELING MECHANICAL PROPERTIES OF HYGROSCOPIC MATERIALS UNDER INTERACTION WITH A NANOPROBE                       207




Fig. 14: Dependence of the maximum force on the set-point amplitude      Fig. 17: Dependence of the maximum force with the most important
corresponding to a hard (stiff) cantilever with a soft cantilever.       characteristics of each cantilever, filtering the cantilevers used for the
                                                                         scenarios , the figure shows maximum force dependent on the: (a)
                                                                         force constant k, (b) cantilever tip radius, and (c) cantilever volume,
                                                                         respectively. The simulations were performed for $A_{sp}/A_{0}$ =
                                                                         0.8.




Fig. 15: Dependence of the maximum force on the set-point amplitude
corresponding to a cantilever with a high volume compared with a
cantilever with a small volume.




                                                                         Fig. 18: Three-dimensional plots of the various cantilevers provided
                                                                         by the manufacturer and those in the pyDAMPF database that
                                                                         establish a given maximum force at a given distance between the
                                                                         tip and the sample for a PVA polymer subjected to RH= 0% with E =
                                                                         930 [MPa].


                                                                         eyes that a cantilever with small volume f has less damping from
                                                                         the environment and thus it also indents more than the ones with
                                                                         higher volume. Although these type of plots are the easiest to
                                                                         make, they carry lots of experimental information. In addition,
                                                                         pyDAMPF can plot such 3D figures interactively that enables a
                                                                         detailed comparison of those curves.
Fig. 16: Dependence of the maximum force on the set-point amplitude          As we aim a massive use of pyDAMPF, we also perform the
corresponding to a cantilever with a wide tip with a cantilever with a   corresponding benchmarks on four different computing platforms,
sharp tip.                                                               where two of them resembles the standard PC or Laptop found
                                                                         at the labs, and the other two aim to cloud and HPC facilities,
208                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




Fig. 19: Three-dimensional plots of the various cantilevers provided                    Fig. 21: Speed up parallel method.
by the manufacturer and those in the pyDAMPF database that
establish a given maximum force at a given distance between the
tip and the sample for a PVA polymer subjected to RH = 60.1% with
E = 248.8 [MPa].



                                                                       Fig. 22: Data used for Figs. 5, 9 and 13 with an A0 = 10[nm] . Observe
                                                                       that the quality factor and Young’s modulus have three different values
                                                                       respectively for RH1 = 29.5%, RH2 = 39.9% y RH3 = 60.1%. ∗∗
                                                                       The values presented for Quality Factor Q were calculated at Google
                                                                       Colaboratory notebook Q calculation, using the method proposed by
                                                                       [GS05], [Sad98].




                                                                       Fig. 23: Computers used to run pyDAMPF and Former work
                                                                       [GGG15], ∗ the free version of Colab provides this capability, there
                                                                       are two paid versions which provide much greater capacity, these
                                                                       versions known as Colab Pro and Colab Pro+ are only available in
Fig. 20: Comparison of times taken by both the parallel method and     some countries.
the serial method.



respectively (see Table 23 for details).
    Figure 20 shows the average run time for the serial and parallel
implementation. Despite a slightly higher performance for the case
of the HPC cluster nodes, a high-end computer (PC 2) may also
reach similar values, which is our current goal. Another striking
aspect observed by looking at the speed-up, is the maximum
and minimum run times, which notoriously show the on-demand
character of cloud services. As their maxima and minima show the
highest variations.
    To calculate the speed up we use the following equation:

                                  ttotal
                            S=
                                 tthread

Where S is the speed up , tT hread is the execution time of a          Fig. 24: Execution times per computational thread, for each computer.
computational thread, and tTotal is the sum of times, shown in         Note that each Thread consists of 9 simulation cases, with a sum time
                                                                       showing the total of 90 cases for evaluating 3 different Young moduli
the table 24. For our calculations we used the highest, the average
                                                                       and 30 cantilevers at the same time.
and the lowest execution time per thread.
PYDAMPF: A PYTHON PACKAGE FOR MODELING MECHANICAL PROPERTIES OF HYGROSCOPIC MATERIALS UNDER INTERACTION WITH A NANOPROBE                         209

Limitations                                                             R EFERENCES
The main limitation of dynamic AFM simulators based in con-             [FCK+ 12] Kathrin Friedemann, Tomas Corrales, Michael Kappl, Katharina
                                                                                  Landfester, and Daniel Crespy. Facile and large-scale fabrication
tinuum modeling is that sometimes a molecular behavior is over-                   of anisometric particles from fibers synthesized by colloid elec-
looked. Such a limitation comes from the multiple time and length                 trospinning. Small, 8:144–153, 2012. doi:10.1002/smll.
scales behind the physics of complex systems, as it is the case                   201101247.
of polymers and biopolymers. In this regard, several efforts on         [Gar20]   Ricardo Garcia. Nanomechanical mapping of soft materials with
                                                                                  the atomic force microscope: methods, theory and applications.
the multiscale modeling of materials have been proposed, joining                  The Royal Society of Chemistry, 49:5850–5884, 2020. doi:10.
mainly efforts to stretch the multiscale gap [GTK+ 19]. We also                   1039/d0cs00318b.
plan to do so, within a current project, for modeling the polymeric     [GG13]    Horacio V. Guzman and Ricardo Garcia. Peak forces and lateral
                                                                                  resolution in amplitude modulation force microscopy in liquid.
fibers as molecular chains and providing "feedback" between mod-                  Beilstein Journal of Nanotechnology, 4:852–859, 2013. doi:
els from a top-down strategy. Code-wise, the implementation will                  10.3762/bjnano.4.96.
be also gradually improved. Nonetheless, to maintain scientific         [GGG15] Horacio V. Guzman, Pablo D. Garcia, and Ricardo Garcia. Dy-
code is a challenging task. In particular without the support for                 namic force microscopy simulator (dforce): A tool for planning
                                                                                  and understanding tapping and bimodal afm experiments. Beilstein
our students once they finish their thesis. In this respect, we will              Journal of Nanotechnology, 6:369–379, 2015. doi:10.3762/
seek software funding and more community contributions.                           bjnano.6.36.
                                                                        [GPG13] Horacio V. Guzman, Alma P. Perrino, and Ricardo Garcia. Peak
                                                                                  forces in high-resolution imaging of soft matter in liquid. ACS
                                                                                  Nano, 7:3198–3204, 2013. doi:10.1021/nn4012835.
Future work
                                                                        [GS05]    Christopher P. Green and John E. Sader. Frequency response of
There are several improvements that are planned for pyDAMPF.                      cantilever beams immersed in viscous fluids near a solid surface
                                                                                  with applications to the atomic force microscope. Journal of Ap-
                                                                                  plied Physics, 98:114913, 2005. doi:10.1063/1.2136418.
   •   We plan to include a link to molecular dynamics simula-          [GTK+ 19] Horacio V. Guzman, Nikita Tretyakov, Hideki Kobayashi, Aoife C.
       tions of polymer chains in a multiscale like approach.                     Fogarty, Karsten Kreis, Jakub Krajniak, Christoph Junghans, Kurt
   •   We plan to use experimental values with less uncertainty                   Kremer, and Torsten Stuehn. Espresso++ 2.0: Advanced methods
                                                                                  for multiscale molecular simulation. Computer Physics Communi-
       to boost semi-empirical models based on pyDAMPF.
                                                                                  cations, 238:66–76, 2019. doi:10.1016/j.cpc.2018.12.
   •   The code is still not very clean and some internal cleanup                 017.
       is necessary. This is especially true for the Python backend     [Guz17]   Horacio V. Guzman. Scaling law to determine peak forces
       which may require a refactoring.                                           in tapping-mode afm experiments on finite elastic soft matter
                                                                                  systems. Beilstein Journal of Nanotechnology, 8:968–974, 2017.
   •   Some AI optimization was also envisioned, particularly for                 doi:10.3762/bjnano.8.98.
       optimizing criteria and comparing operational parameters.        [HLLB09] Fei Hang, Dun Lu, Shuang Wu Li, and Asa H. Barber. Stress-strain
                                                                                  behavior of individual electrospun polymer fibers using combina-
                                                                                  tion afm and sem. Materials Research Society, 1185:1185–II07–
Conclusions                                                                       10, 2009. doi:10.1557/PROC-1185-II07-10.
                                                                        [MHR08] John Melcher, Shuiqing Hu, and Arvind Raman. Veda: A
In summary, pyDAMPF is a highly efficient and adaptable simu-                     web-based virtual environment for dynamic atomic force mi-
                                                                                  croscopy. Review of Scientific Instruments, 79:061301, 2008.
lation tool aimed at analyzing, planning and interpreting dynamic                 doi:10.1063/1.2938864.
AFM experiments.                                                        [Ram20] Prabhu Ramachandran. Compyle: a Python package for paral-
    It is important to keep in mind that pyDAMPF uses cantilever                  lel computing. In Meghann Agarwal, Chris Calloway, Dillon
                                                                                  Niederhut, and David Shupe, editors, Proceedings of the 19th
manufacturers information to analyze, evaluate and choose a                       Python in Science Conference, pages 32 – 39, 2020. doi:
certain nanoprobe that fulfills experimental criteria. If this will               10.25080/majora-342d178e-005.
not be the case, it will advise the experimentalists on what to         [Sad98]   John E. Sader. Frequency response of cantilever beams immersed
expect from their measurements and the response a material may                    in viscous fluids with applications to the atomic force microscope.
                                                                                  Journal of Applied Physics, 84:64–76, 1998. doi:10.1063/1.
have. We currently support multi-thread execution using in-house                  368002.
development. However, in our outlook, we plan to extend the             [Vea20]   Pauli Virtanen and et al. Scipy 1.0: fundamental algorithms for
code to GPU by using transpiling tools, like compyle [Ram20],                     scientific computing in Python. Nature Methods, 17:261–272,
                                                                                  2020. doi:10.1038/s41592-019-0686-2.
as the availability of GPUs also increases in standard worksta-
tions. In addition, we have shown how to reuse a widely tested
Fortran code [GPG13] and wrap it as a python module to profit
from pythonic libraries and interactivity via Jupyter notebooks.
Implementing new interaction forces for the simulator is straight-
forward. However, this code includes the state-of-the-art contact,
viscous, van der Waals, capillarity and electrostatic forces used for
physics at the interfaces. Moreover, we plan to implement soon
semi-empirical analysis and multiscale modeling with molecular
dynamics simulations.


Acknowledgments
H.V.G thanks the financial support by the Slovenian Research
Agency (Funding No. P1-0055). We gratefully acknowledge the
fruitful discussions with Tomas Corrales and our joint Fondecyt
Regular project 1211901.
210                                                                                                  PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




Improving PyDDA’s atmospheric wind retrievals using
 automatic differentiation and Augmented Lagrangian
                        methods
 Robert Jackson‡∗ , Rebecca Gjini§ , Sri Hari Krishna Narayanan‡ , Matt Menickelly, Paul Hovland‡ , Jan Hückelheim‡ ,
                                                     Scott Collis‡



                                                                               F



Introduction                                                                       [LSKJ17] as detailed in the 2019 SciPy Conference proceedings
Meteorologists require information about the spatiotemporal dis-                   (see [JCL+ 20], [RJSCTL+ 19]). It provided a much easier to
tribution of winds in thunderstorms in order to analyze how                        use and more portable interface for wind retrievals than was
physical and dynamical processes govern thunderstorm evolution.                    provided by these packages. In PyDDA versions 0.5 and prior,
Knowledge of such processes is vital for predicting severe and                     the implementation of Equation (1) uses NumPy [HMvdW+ 20]
hazardous weather events. However, acquiring wind observations                     to calculate J and its gradient. In order to find the wind field
in thunderstorms is a non-trivial task. There are a variety of in-                 V that minimizes J, PyDDA used the limited memory Broy-
struments that can measure winds including radars, anemometers,                    den–Fletcher–Goldfarb–Shanno bounded (L-BFGS-B) from SciPy
and vertically pointing wind profilers. The difficulty in acquiring                [VGO+ 20]. L-BFGS-B requires gradients of J in order to mini-
a three dimensional volume of the 3D wind field from these                         mize J. Considering the antiquity of the CEDRIC and Multidop
sensors is that these sensors typically only measure either point                  packages, these first steps provided the transition to Python that
observations or only the component of the wind field parallel                      was needed in order to enhance accessibility of wind retrieval
to the direction of the antenna. Therefore, in order to obtain 3D                  software by the scientific community. For more information
wind fields, the weather radar community uses a weak variational                   about PyDDA versions 0.5 and prior, consult [RJSCTL+ 19] and
technique that finds a 3D wind field that minimizes a cost function                [JCL+ 20].
J.                                                                                      However, there are further improvements that still needed
            J(V) = µm Jm + µo Jo + µv Jv + µb Jb + µs Js        (1)                to be made in order to optimize both the accuracy and speed
                                                                                   of the PyDDA retrievals. For example, the cost functions and
Here, Jm is how much the wind field V violates the anelastic mass                  gradients in PyDDA 0.5 are implemented in NumPy which does
continuity equation. Jo is how much the wind field is different                    not take advantage of GPU architectures for potential speedups
from the radar observations. Jv is how much the wind field violates                [HMvdW+ 20]. In addition, the gradients of the cost function that
the vertical vorticity equation. Jb is how much the wind field                     are required for the weak variational technique are hand-coded
differs from a prescribed background. Finally Js is related to                     even though packages such as Jax [BFH+ 18] and TensorFlow
the smoothness of the wind field, quantified as the Laplacian                      [AAB+ 15] can automatically calculate these gradients. These
of the wind field. The scalars µx are weights determining the                      needs motivated new features for the release of PyDDA 1.0. In
relative contribution of each cost function to the total J. The                    PyDDA 1.0, we utilize Jax and TensorFlow’s automatic differen-
flexibility in this formulation potentially allows for factoring in                tiation capabilities for differentiating J, making these calculations
the uncertainties that are inherent in the measurements. This                      less prone to human error and more efficient.
formulation is expandable to include cost functions related to data                     Finally, upgrading PyDDA to use Jax and TensorFlow allows it
from other sources such as weather forecast models and soundings.                  to take advantage of GPUs, increasing the speed of retrievals. This
For more specific information on these cost functions, see [SPG09]                 paper shows how Jax and TensorFlow are used to automatically
and [PSX12].                                                                       calculate the gradient of J and improve the performance of
    PyDDA is an open source Python package that implements the                     PyDDA’s wind retrievals using GPUs.
weak variational technique for retrieving winds. It was originally                      In addition, a drawback to the weak variational technique
developed in order to modernize existing codes for the weak                        is that the technique requires user specified constants µ. This
variational retrievals such as CEDRIC [MF98] and Multidop                          therefore creates the possibility that winds retrieved from different
                                                                                   datasets may not be physically consistent with each other, affecting
* Corresponding author: rjackson@anl.gov                                           reproducibility. Therefore, for the PyDDA 1.1 release, this paper
‡ Argonne National Laboratory, 9700 Cass Ave., Argonne, IL, 60439
§ University of California at San Diego                                            also details a new approach that uses Augmented Lagrangian
                                                                                   solvers in order to place strong constraints on the wind field such
Copyright © 2022 Robert Jackson et al. This is an open-access article dis-         that it satisfies a mass continuity constraint to within a specified
tributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, pro-       tolerance while minimizing the rest of the cost function. This
vided the original author and source are credited.                                 new approach also takes advantage of the automatically calculated
IMPROVING PYDDA’S ATMOSPHERIC WIND RETRIEVALS USING AUTOMATIC DIFFERENTIATION AND AUGMENTED LAGRANGIAN METHODS                                      211

gradients that are implemented in PyDDA 1.0. This paper will                     J were calculated by finding the closed form of the gradient
show that this new approach eliminates the need for user specified               by hand and then coding the closed form in Python. The code
constants, ensuring the reproducibility of the results produced by               snippet below provides an example of how the cost function Jm is
PyDDA.                                                                           implemented in PyDDA using NumPy.
                                                                                 def calculate_mass_continuity(u, v, w, z, dx, dy, dz):
Weak variational technique
                                                                                      dudx = np.gradient(u, dx, axis=2)
This section summarizes the weak variational technique that was                       dvdy = np.gradient(v, dy, axis=1)
implemented in PyDDA previous to version 1.0 and is currently                         dwdz = np.gradient(w, dz, axis=0)
the default option for PyDDA 1.1. PyDDA currently uses the                            div = dudx + dvdy + dwdz
weak variational formulation given by Equation (1). For this
proceedings, we will focus our attention on the mass continuity                       return coeff * np.sum(np.square(div)) / 2.0
Jm and observational cost function Jo . In PyDDA, Jm is given as                 In order to hand code the gradient of the cost function above, one
the discrete volume integral of the square of the anelastic mass                 has to write the closed form of the derivative into another function
continuity equation                                                              like below.
                                                       
                              δ (ρs u) δ (ρs v) δ (ρs w) 2                       def calculate_mass_continuity_gradient(u, v, w, z, dx,
       Jm (u, v, w) = ∑               +        +           , (2)                                                        dy, dz, coeff):
                     volume      δx       δy       δz                                dudx = np.gradient(u, dx, axis=2)
                                                                                     dvdy = np.gradient(v, dy, axis=1)
where u is the zonal component of the wind field and v is the                        dwdz = np.gradient(w, dz, axis=0)
meridional component of the wind field. ρs is the density of air,
which is approximated in PyDDA as ρs (z) = e−z/10000 where z is                       grad_u = -np.gradient(div, dx, axis=2) * coeff
                                                                                      grad_v = -np.gradient(div, dy, axis=1) * coeff
the height in meters. The physical interpretation of this equation is                 grad_w = -np.gradient(div, dz, axis=0) * coeff
that a column of air in the atmosphere is only allowed to compress
in order to generate changes in air density in the vertical direction.                y = np.stack([grad_u, grad_v, grad_w], axis=0)
Therefore, wind convergence at the surface will generate vertical                     return y.flatten()
air motion. A corollary of this is that divergent winds must occur               Hand coding these functions can be labor intensive for compli-
in the presence of a downdraft. At the scales of winds observed                  cated cost functions. In addition, there is no guarantee that there is
by PyDDA, this is a reasonable approximation of the winds in the                 a closed form solution for the gradient. Therefore, we tested using
atmosphere.                                                                      both Jax and TensorFlow to automatically compute the gradients
    The cost function Jo metricizes how much the wind field is                   of J. Computing the gradients of J using Jax can be done in two
different from the winds measured by each radar. Since a scanning                lines of code using jax.vjp:
radar will scan a storm while pointing at an elevation angle θ and               primals, fun_vjp = jax.vjp(
an azimuth angle φ , the wind field must first be projected to the                   calculate_radial_vel_cost_function,
radar’s coordinates. After that, PyDDA finds the total square error                  vrs, azs, els, u, v, w, wts, rmsVr, weights,
                                                                                     coeff)
between the analysis wind field and the radar observed winds as
                                                                                 _, _, _, p_x1, p_y1, p_z1, _, _, _, _ = fun_vjp(1.0)
done in Equation (3).
                                                                                 Calculating the gradients using automatic differentiation us-
 Jo (u, v, w) =     ∑      (u cos θ sin φ + v cos θ cos φ + (w − wt ) sin θ )2   ing TensorFlow is also a simple code snippet using
                  volume
                                                              (3)                tf.GradientTape:
Here, wt is the terminal velocity of the particles scanned by                    with tf.GradientTape() as tape:
the radar volume. This is approximated using empirical relation-                     tape.watch(u)
                                                                                     tape.watch(v)
ships between wt and the radar reflectivity Z. PyDDA then uses                       tape.watch(w)
the limited memory Broyden–Fletcher–Goldfarb–Shanno bounded                          loss = calculate_radial_vel_cost_function(
(L-BFGS-B) algorithm (see, e.g., [LN89]) to find the u, v, and w                         vrs, azs, els, u, v, w,
                                                                                         wts, rmsVr, weights, coeff)
that solves the optimization problem
                                                                                 grad = tape.gradient(loss)
            min J(u, v, w) , µm Jm (u, v, w) + µv Jv (u, v, w).            (4)
            u,v,w
                                                                                 As one can see, there is no more need to derive the closed form of
For experiments using the weak variational technique, we run                     the gradient of the cost function. Rather, the cost function itself is
the optimization until either the Linf norm of the gradient of J                 now the input to a snippet of code that automatically provides the
is less than 10−8 or when the maximum change in u, v, and                        derivative. In PyDDA 1.0, there are now three different engines
w between iterations is less than 0.01 m/s as done by [PSX12].                   that the user can specify. The classic "scipy" mode uses the
Typically, the second criteria is reached first. Before PyDDA 1.0,               NumPy-based cost function and hand coded gradients used by
PyDDA utilized SciPy’s L-BFGS-B implementation. However,                         versions of PyDDA previous to 1.0. In addition, there are now
as of PyDDA 1.0 one can also use TensorFlow’s L-BFGS-B                           TensorFlow and Jax modes that use both cost functions and
implementation, which is used here for the experiments with the                  automatically generated gradients generated using TensorFlow or
weak variational technique [AAB+ 15].                                            Jax.

Using automatic differentiation                                                  Improving performance with GPU capabilities
The optimization problem in Equation (4) requires the gradients                  The implementation of a TensorFlow-based engine provides Py-
of J. In PyDDA 0.5 and prior, the gradients of the cost function                 DDA the capability to take advantage of CUDA-compatible GPUs.
212                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                      CPU-based retrievals increases as resolution decreases, demon-
                                                                      strating the importance of the GPU for conducting high-resolution
                                                                      wind retrievals. In Table 1, using a GPU to retrieve the Hurricane
                                                                      Florence example at 1 km resolution reduces the run time from 341
                                                                      s to 12 s. Therefore, these performance improvements show that
                                                                      PyDDA’s TensorFlow-based engine now enables it to handle both
                                                                      spatial scales of hundreds of kms at a 1 km resolution. For a day
                                                                      of data at this resolution, assuming five minutes between scans, an
                                                                      entire day of data can be processed in 57 minutes. With the use of
                                                                      multi-GPU clusters and selecting for cases where precipitation is
                                                                      present, this enables the ability to process winds from multi-year
                                                                      radar datasets within days instead of months.
                                                                          In addition, simply using TensorFlow’s implementation of
                                                                      L-BFGS-B as well as the TensorFlow calculated cost function
                                                                      and gradients provides a significant performance improvement
                                                                      compared to the original "scipy" engine in PyDDA 0.5, being up
                                                                      to a factor of 30 faster. In fact, running PyDDA’s original "scipy"
                                                                      engine on the 0.5 km resolution data for the Hurricane Florence
                                                                      example would have likely taken 50 days to complete on an Intel
                                                                      Core i7-based MacBook laptop. Therefore, that particular run was
                                                                      not tenable to do and therefore not shown in Figure 1. In any case,
                                                                      this shows that upgrading the calculations to use TensorFlow’s
Fig. 1: The time in seconds of execution of the Hurricane Florence    automatically generated gradients and L-BFGS-B implementation
retrieval example when using the TensorFlow and SciPy engines on      provides a very significant speedup to the processing time.
an Intel Core i7 MacBook in CPU mode and on a node of Argonne
National Laboratory’s Lambda cluster, utilizing a single NVIDIA
Tesla A100 GPU for the calculation.                                   Augmented Lagrangian method
                                                                      The release of PyDDA 1.0 focused on improving its performance
      Method         0.5 km     1 km        2.5 km      5.0 km        and gradient accuracy by using automatic differentiation for cal-
                                                                      culating the gradient. For PyDDA 1.1, the PyDDA development
      SciPy Engine   ~50 days   5771.2 s    871.5 s     226.9 s       team focused on implementing a technique that enables the user to
      TensorFlow     7372.5 s   341.5 s     28.1 s      7.0 s         automatically determine the weight coefficients µ. This technique
      Engine
                                                                      builds upon the automatic differentiation work done for PyDDA
      NVIDIA         89.4 s     12.0 s      3.5 s       2.6 s
      Tesla A100                                                      1.0 by using the automatically generated gradients. In this work,
      GPU                                                             we consider a constrained reformulation of Equation (4) that
                                                                      requires wind fields returned by PyDDA to (approximately) satisfy
                                                                      mass continuity constraints. That is, we focus on the constrained
      TABLE 1: Run times for each of the benchmarks in Figure 1.      optimization problem

                                                                                               min     Jv (u, v, w)
                                                                                               u,v,w                                        (5)
Given that weather radar datasets can span decades and processing
                                                                                              s. to    Jm (u, v, w) = 0,
each 10 minute time period of data given by the radar can take
on the order of 1-2 minutes with PyDDA using regular CPU              where we now interpret Jm as a vector mapping that outputs, at
operations, if this time were reduced to seconds, then processing     each grid point in the discretized volume δ (ρδ xs u) + δ (ρs v)  δ (ρs w)
                                                                                                                                 δy + δz .
winds from years of radar data would become tenable. Therefore,       Notice that the formulation in Equation (5) has no dependencies
we used the TensorFlow-based PyDDA using the weak variational         on scalars µ.
technique on the Hurricane Florence example in the PyDDA                  To solve the optimization problem in Equation (5), we im-
Documentation. On 14 September 2018, Hurricane Florence was           plemented an augmented Lagrangian method with a filter mech-
within range of 2 radars from the NEXRAD network: KMHX                anism inspired by [LV20]. An augmented Lagrangian method
stationed in Newport, NC and KLTX stationed in Wilmington,            considers the Lagrangian associated with an equality-constrained
NC. In addition, the High Resolution Rapid Refresh model runs         optimization problem, in this case L0 (u, v, w, λ ) = Jv (u, v, w) −
provided an additional constraint for the wind retrieval. For more    λ > Jm (u, v, w), where λ is a vector of Lagrange multipliers of
information on this example, see [RJSCTL+ 19]. The analysis           the same length as the number of grid points in the discretized
domain spans 400 km by 400 km horizontally, and the horizontal        volume. The Lagrangian is then augmented with an additional
resolution was allowed to vary for different runs in order to com-    squared-penalty term on the constraints to yield Lµ (u, v, w, λ ) =
pare how both the CPU and GPU-based retrievals’ performance           L0 (u, v, w, λ ) + µ2 kJm (u, v, w)k2 , where we have intentionally used
would be affected by grid resolution. The time of completion of       µ > 0 as the scalar in the penalty term to make comparisons
each of these retrievals is shown in Figure 1.                        with Equation (4) transparent. It is well known (see, for instance,
    Figure 1 and Table 1 show that, in general, the retrievals took   Theorem 17.5 of [NW06]) that under some not overly restrictive
anywhere from 10 to 100 fold less time on the GPU compared to         conditions there exists a finite µ̄ such that if µ ≥ µ̄, then each local
the CPU. The discrepancy in performance between the GPU and           solution of Equation (5) corresponds to a strict local minimizer
IMPROVING PYDDA’S ATMOSPHERIC WIND RETRIEVALS USING AUTOMATIC DIFFERENTIATION AND AUGMENTED LAGRANGIAN METHODS                                  213

of Lµ (u, v, w, λ ∗ ) for a suitable choice of multipliers λ ∗ . Essen-
tially, augmented Lagrangian methods solve a short sequence of
unconstrained problems Lµ (u, v, w, λ ), with different values of µ
until a solution is returned that is a local, feasible solution to
Equation (5). In our implementation of an augmented Lagrangian
method, the coarse minimization of Lµ (u, v, w, λ ) is performed
by the Scipy implementation of LBFGS-B with the TensorFlow
implementation of the cost function and gradients. Additionally, in
our implementation, we employ a filter mechanism (see a survey
in [FLT06]) recently proposed for augmented Lagrangian methods
in [LV20] in order to guarantee convergence. We defer details
to that paper, but note that the feasibility restoration phase (the
minimization of a squared constraint violation) required by such
a filter method is also performed by the SciPy implementation of
LBFGS-B.
     The PyDDA documentation contains an example of a
mesoscale convective system (MCS) that was sampled by a C-
band Polarization Radar (CPOL) and a Bureau of Meteorology
Australia radar on 20 Jan 2006 in Darwin, Australia. For more
details on this storm and the radar network configuration, see
[CPMW13]. For more information about the CPOL radar dataset,
see [JCL+ 18]. This example with its data is included in the
PyDDA Documentation as the "Example of retrieving and plotting
winds."
     Figure 2 shows the winds retrieved by the Augmented La-
grangian technique with µ = 1 and from the weak variational
technique with µ = 1 on the right. Figure 2 shows that both tech-
niques are capturing similar horizontal wind fields in this storm.
However, the Augmented Lagrangian technique is resolving an
updraft that is not present in the wind field generated by the weak
variational technique. Since there is horizontal wind convergence
in this region, we expect there to be an updraft present in this
box in order for the solution to be physically realistic. Therefore,
for µ = 1, the Augmented Lagrangian technique is doing a better
job at resolving the updrafts present in the storm than the weak
variational technique is. This shows that adjusting µ is required in
order for the weak variational technique to resolve the updraft.
     We solve the unconstrained formulation (4) using the imple-
mentation of L-BFGS-B currently employed in PyDDA; we fix
the value µv = 1 and vary µm = 2 j : j = 0, 1, 2, . . . , 16. We also
solve the constrained formulation (5) using our implementation
of a filter Augmented Lagrangian method, and instead vary the
initial guess of penalty parameter µ = 2 j : j = 0, 1, 2, . . . , 16. For
the initial state, we use the wind profile from the weather balloon
launch at 00 UTC 20 Jan 2006 from Darwin and apply it to
the whole analysis domain. A summary of results is shown in
Figures 3 and 4. We applied a maximum constraint violation
tolerance of 10−3 to the filter Augmented Lagrangian method.
This is a tolerance that assumes that the winds do not violate
the mass continuity constraint by more than 0.001 m2 s−2 . Notice
                                                                            Fig. 2: The PyDDA retrieved winds overlaid over reflectivity from the
that such a tolerance is impossible to supply to the weak vari-
                                                                            C-band Polarization Radar for the MCS that passed over Darwin,
ational method, highlighting the key advantage of employing a               Australia on 20 Jan 2006. The winds were retrieved using the weak
constrained method. Notice that in this example, only 5 settings of         variational technique with µ = 1 (a) and the Augmented Lagrangian
µm lead to sufficiently feasible solutions returned by the variational      technique with µ = 1 (b). The contours represent vertical velocities at
technique.                                                                  3.05 km altitude. The boxed region shows the updrafts that generated
     Finally, a variable of interest to atmospheric scientists for          the heavy precipitation.
winds inside MCSes is the vertical wind velocity. It provides a
measure of the intensity of the storm by demonstrating the amount
of upscale growth contributing to intensification. Figure 5 shows
the mean updraft velocities inside the box in Figure 2 as a function
of height for each of the runs of the TensorFlow L-BFGS-B and
214                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




Fig. 3: The x-axis shows, on a logarithmic scale, the maximum
constraint violation in the units of divergence of the wind field and
the y-axis shows the value of the data-fitting term Jv at the optimal
solution. The legend lists the number of function/gradient calls made
by the filter Augmented Lagrangian Method, which is the dominant
cost of both approaches. The dashed line at 10−3 denotes the tolerance
on the maximum constraint violation that was supplied to the filter
Augmented Lagrangian method.




                                                                         Fig. 5: The mean updraft velocity obtained by (left) the weak
                                                                         variational and (right) the Augmented Lagrangian technique inside
                                                                         the updrafts in the boxed region of Figure 2. Each line represents a
Fig. 4: As 3, but for the weak variational technique that uses L-BFGS-   different value of µ for the given technique.
B.

                                                                         using the Augmented Lagrangian technique will result in more
Augmented Lagrangian techniques. Table 2 summarizes the mean             reproducible wind fields from radar wind networks since it is
and spread of the solutions in Figure 5. For the updraft velocities      less sensitive to user-defined parameters than the weak variational
produced by the Augmented Lagrangian technique, there is a 1 m/s         technique. However, a limitation of this technique is that, for now,
spread of velocities produced for given values of µ at altitudes         this technique is limited to two radars and to the mass continuity
< 7.5 km in Table 2. At an altitude of 10 km, this spread is             and vertical vorticity constraints.
1.9 m/s. This is likely due to the reduced spatial coverage of
the radars at higher altitudes. However, for the weak variational        Concluding remarks
technique, the sensitivity of the retrieval to µ is much more            Atmospheric wind retrievals are vital for forecasting severe
pronounced, with up to 2.8 m/s differences between retrievals.           weather events. Therefore, this motivated us to develop an open
Therefore, using the Augmented Lagrangian technique makes the            source package for developing atmospheric wind retrievals called
vertical velocities less sensitive to µ. Therefore, this shows that      PyDDA. In the original releases of PyDDA (versions 0.5 and
IMPROVING PYDDA’S ATMOSPHERIC WIND RETRIEVALS USING AUTOMATIC DIFFERENTIATION AND AUGMENTED LAGRANGIAN METHODS                             215

                         Min      Mean     Max     Std. Dev.          a paid-up nonexclusive, irrevocable worldwide license in said
                                                                      article to reproduce, prepare derivative works, distribute copies
      Weak variational
      2.5 km             1.2      1.8      2.7     0.6
                                                                      to the public, and perform publicly and display publicly, by or
      5 km               2.2      2.9      4.0     0.7                on behalf of the Government. The Department of Energy will
      7.5 km             3.2      3.9      5.0     0.4                provide public access to these results of federally sponsored
      10 km              2.3      3.3      4.9     1.0                research in accordance with the DOE Public Access Plan. This
      Aug. Lagrangian                                                 material is based upon work supported by Laboratory Directed
      2.5 km             1.8      2.8      3.3     0.5                Research and Development (LDRD) funding from Argonne Na-
      5 km               3.1      3.3      3.5     0.1                tional Laboratory, provided by the Director, Office of Science, of
      7.5 km             3.2      3.5      3.9     0.1                the U.S. Department of Energy under Contract No. DE-AC02-
      10 km              3.0      4.3      4.9     0.5                06CH11357. This material is also based upon work funded by
                                                                      program development funds from the Mathematics and Computer
                                                                      Science and Environmental Science departments at Argonne Na-
TABLE 2: Minimum, mean, maximum, and standard deviation of w          tional Laboratory.
(m/s) for select levels in Figure 5.


prior), the original goal of PyDDA was to convert legacy wind         R EFERENCES
retrieval packages such as CEDRIC and Multidop to be fully
Pythonic, open source, and accessible to the scientific community.    [AAB+ 15]   Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo,
                                                                                  Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis,
However, there remained many improvements to be made to                           Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfel-
PyDDA to optimize the speed of the retrievals and to make it                      low, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing
easier to add constraints to PyDDA.                                               Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh
                                                                                  Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore,
    This therefore motivated two major changes to PyDDA’s wind                    Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens,
retrieval routine for PyDDA 1.0. The first major change to PyDDA                  Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker,
in PyDDA 1.0 was to simplify the wind retrieval process by                        Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol
automating the calculation of the gradient of the cost function                   Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan
                                                                                  Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine
used for the weak variational technique. To do this, we utilized                  learning on heterogeneous systems, 2015. Software available
Jax and TensorFlow’s capabilities to do automatic differentiation                 from tensorflow.org. URL: https://www.tensorflow.org/.
of functions. This also allows PyDDA to take advantage of GPU         [BFH+ 18]   James Bradbury, Roy Frostig, Peter Hawkins, Matthew James
resources, significantly speeding up retrieval times for mesoscale                Johnson, Chris Leary, Dougal Maclaurin, George Necula,
                                                                                  Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne,
retrievals at kilometer-scale resolution. In addition, running the                and Qiao Zhang.        JAX: composable transformations of
TensorFlow-based version of PyDDA provided significant perfor-                    Python+NumPy programs, 2018. URL: http://github.com/
mance improvements even when using a CPU.                                         google/jax.
                                                                      [CPMW13]    Scott Collis, Alain Protat, Peter T. May, and Christopher
    These automatically generated gradients were then used to                     Williams. Statistics of storm updraft velocities from twp-ice
implement an Augmented Lagrangian technique in PyDDA 1.1                          including verification with profiling measurements. Journal
that allows for automatically determining the weights for each                    of Applied Meteorology and Climatology, 52(8):1909 – 1922,
cost function in the retrieval. The Augmented Lagrangian tech-                    2013. doi:10.1175/JAMC-D-12-0230.1.
                                                                      [FLT06]     Roger Fletcher, Sven Leyffer, and Philippe Toint. A brief
nique guarantees convergence to a physically realistic solution,                  history of filter methods. Technical report, Argonne National
something that is not always the case for a given set of weights                  Laboratory, 2006. URL: http://www.optimization-online.org/
for the weak variational technique. Therefore, this both creates                  DB_FILE/2006/10/1489.pdf.
more reproducible wind retrievals and simplifies the process of       [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der
                                                                                  Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric
retrieving winds for the non-specialist user. However, since the                  Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,
Augmented Lagrangian technique currently only supports the                        Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerk-
ingesting of radar data into the retrieval, plans for PyDDA 1.2 and               wijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río,
                                                                                  Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin
beyond include expanding the Augmented Lagrangian technique
                                                                                  Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi,
to support multiple data sources such as models and rawinsondes.                  Christoph Gohlke, and Travis E. Oliphant. Array programming
                                                                                  with NumPy. Nature, 585(7825):357–362, September 2020.
                                                                                  doi:10.1038/s41586-020-2649-2.
Code Availability                                                     [JCL+ 18]   R. C. Jackson, S. M. Collis, V. Louf, A. Protat, and L. Ma-
                                                                                  jewski. A 17 year climatology of the macrophysical prop-
PyDDA is available for public use with documentation and
                                                                                  erties of convection in darwin. Atmospheric Chemistry and
examples available at https://openradarscience.org/PyDDA. The                     Physics, 18(23):17687–17704, 2018. doi:10.5194/acp-
GitHub repository that hosts PyDDA’s source code is available                     18-17687-2018.
at https://github.com/openradar/PyDDA.                                [JCL+ 20]   Robert Jackson, Scott Collis, Timothy Lang, Corey Potvin,
                                                                                  and Todd Munson. Pydda: A pythonic direct data assimilation
                                                                                  framework for wind retrievals. Journal of Open Research
Acknowledgments                                                                   Software, 8(1):20, 2020. doi:10.5334/jors.264.
                                                                      [LN89]      Dong C. Liu and Jorge Nocedal. On the limited memory
The submitted manuscript has been created by UChicago Argonne,                    bfgs method for large scale optimization. MATHEMATI-
LLC, Operator of Argonne National Laboratory (’Argonne’). Ar-                     CAL PROGRAMMING, 45:503–528, 1989. doi:10.1007/
                                                                                  bf01589116.
gonne, a U.S. Department of Energy Office of Science laboratory,
                                                                      [LSKJ17]    Timothy Lang, Mario Souto, Shahin Khobahi, and Bobby
is operated under Contract No. DE-AC02-06CH11357. The U.S.                        Jackson. nasa/multidop: Multidop v0.3, October 2017. doi:
Government retains for itself, and others acting on its behalf,                   10.5281/zenodo.1035904.
216                                                                              PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[LV20]       Sven Leyffer and Charlie Vanaret. An augmented lagrangian
             filter method. Mathematical Methods of Operations Research,
             92(2):343–376, 2020. URL: https://doi.org/10.1007/s00186-
             020-00713-x, doi:10.1007/s00186-020-00713-x.
[MF98]       L. Jay Miller and Sherri M. Fredrick. Custom editing and
             display of reduced information in cartesian space. Technical
             report, National Center for Atmospheric Research, 1998.
[NW06]       Jorge Nocedal and Stephen J. Wright. Numerical Optimization.
             Springer, New York, NY, USA, second edition, 2006.
[PSX12]      Corey K. Potvin, Alan Shapiro, and Ming Xue. Impact of
             a vertical vorticity constraint in variational dual-doppler wind
             analysis: Tests with real and simulated supercell data. Journal
             of Atmospheric and Oceanic Technology, 29(1):32 – 49, 2012.
             doi:10.1175/JTECH-D-11-00019.1.
[RJSCTL+ 19] Robert Jackson, Scott Collis, Timothy Lang, Corey Potvin,
             and Todd Munson. PyDDA: A new Pythonic Wind Re-
             trieval Package. In Chris Calloway, David Lippa, Dillon
             Niederhut, and David Shupe, editors, Proceedings of the
             18th Python in Science Conference, pages 111 – 117, 2019.
             doi:10.25080/Majora-7ddc1dd1-010.
[SPG09]      Alan Shapiro, Corey K. Potvin, and Jidong Gao. Use of a verti-
             cal vorticity equation in variational dual-doppler wind analysis.
             Journal of Atmospheric and Oceanic Technology, 26(10):2089
             – 2106, 2009. doi:10.1175/2009JTECHA1256.1.
[VGO+ 20]    Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt
             Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski,
             Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté-
             fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar-
             rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric
             Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat,
             Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde,
             Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin-
             tero, Charles R. Harris, Anne M. Archibald, Antônio H.
             Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy
             1.0 Contributors. SciPy 1.0: Fundamental Algorithms for
             Scientific Computing in Python. Nature Methods, 17:261–272,
             2020. doi:10.1038/s41592-019-0686-2.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                    217




  RocketPy: Combining Open-Source and Scientific
Libraries to Make the Space Sector More Modern and
                      Accessible
     João Lemes Gribel Soares‡∗ , Mateus Stano Junqueira‡ , Oscar Mauricio Prada Ramirez‡ , Patrick Sampaio dos
      Santos Brandão‡§ , Adriano Augusto Antongiovanni‡ , Guilherme Fernandes Alves‡ , Giovani Hidalgo Ceotto‡



                                                                                      F


Abstract—In recent years we are seeing exponential growth in the space sector,            important issue. Moreover, performance is always a requirement
with new companies emerging in it. On top of that more people are becoming                both for saving financial and time resources while efficiently
fascinated to participate in the aerospace revolution, which motivates students           launch performance goals.
and hobbyists to build more High Powered and Sounding Rockets. However,                        In this scenario, crucial parameters should be determined be-
rocketry is still a very inaccessible field, with high knowledge of entry-level and
                                                                                          fore a safe launch can be performed. Examples include calculating
concrete terms. To make it more accessible, people need an active community
with flexible, easy-to-use, and well-documented tools. RocketPy is a software
                                                                                          with high accuracy and certainty the most likely impact or landing
solution created to address all those issues, solving the trajectory simulation           region. This information greatly increases range safety and the
for High-Power rockets being built on top of SciPy and the Python Scien-                  possibility of recovering the rocket [Wil18]. As another example,
tific Environment. The code allows for a sophisticated 6 degrees of freedom               it is important to determine the altitude of the rocket’s apogee in
simulation of a rocket’s flight trajectory, including high fidelity variable mass         order to avoid collision with other aircraft and prevent airspace
effects as well as descent under parachutes. All of this is packaged into an              violations.
architecture that facilitates complex simulations, such as multi-stage rockets,                To better attend to those issues, RocketPy was created as a
design and trajectory optimization, and dispersion analysis. In this work, the
                                                                                          computational tool that can accurately predict all dynamic param-
flexibility and usability of RocketPy are indicated in three example simulations:
                                                                                          eters involved in the flight of sounding, model, and High-Powered
a basic trajectory simulation, a dynamic stability analysis, and a Monte Carlo
dispersion simulation. The code structure and the main implemented methods
                                                                                          Rockets, given parameters such as the rocket geometry, motor
are also presented.                                                                       characteristics, and environmental conditions. It is an open source
                                                                                          project, well structured, and documented, allowing collaborators
Index Terms—rocketry, flight, rocket trajectory, flexibility, Monte Carlo analysis        to contribute with new features with minimum effort regarding
                                                                                          legacy code modification [CSA+ 21].
Introduction
                                                                                          Background
When it comes to rockets, there is a wide field ranging from
                                                                                          Rocketry terminology
orbital rockets to model rockets. Between them, two types of
rockets are relevant to this work: sounding rockets and High-                             To better understand the current work, some specific terms regard-
Powered Rockets (HPRs). Sounding rockets are mainly used                                  ing the rocketry field are stated below:
by government agencies for scientific experiments in suborbital                              •   Apogee: The point at which a body is furthest from earth
flights while HPRs are generally used for educational purposes,                              •   Degrees of freedom: Maximum number of independent
with increasing popularity in university competitions, such as the                               values in an equation
annual Spaceport America Cup, which hosts more than 100 rocket                               •   Flight Trajectory: 3-dimensional path, over time, of the
design teams from all over the world. After the university-built                                 rocket during its flight
rocket TRAVELER IV [AEH+ 19] successfully reached space by                                   •   Launch Rail: Guidance for the rocket to accelerate to a
crossing the Kármán line in 2019, both Sounding Rockets and                                      stable flight speed
HPRs can now be seen as two converging categories in terms of                                •   Powered Flight: Phase of the flight where the motor is
overall flight trajectory.                                                                       active
    HPRs are becoming bigger and more robust, increasing their                               •   Free Flight: Phase of the flight where the motor is inactive
potential hazard, along with their capacity, making safety an                                    and no other component but its inertia is influencing the
                                                                                                 rocket’s trajectory
* Corresponding author: jgribel@usp.br
‡ Escola Politécnica of the University of São Paulo                                          •   Standard Atmosphere: Average pressure, temperature, and
§ École Centrale de Nantes.                                                                      air density for various altitudes
                                                                                             •   Nozzle: Part of the rocket’s engine that accelerates the
Copyright © 2022 João Lemes Gribel Soares et al. This is an open-access                          exhaust gases
article distributed under the terms of the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any               •   Static hot-fire test: Test to measure the integrity of the
medium, provided the original author and source are credited.                                    motor and determine its thrust curve
218                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

      •   Thrust Curve: Evolution of thrust force generated by a         Function
          motor                                                          Variable interpolation meshes/grids from different sources can
      •   Static Margin: Is a non-dimensional distance to analyze        lead to problems regarding coupling different data types. To
          the stability                                                  solve this, RocketPy employs a dedicated Function class which
      •   Nosecone: The forward-most section of a rocket, shaped         allows for more natural and dynamic handling of these objects,
          for aerodynamics                                               structuring them as Rn → R mathematical functions.
      •   Fin: Flattened append of the rocket providing stability            Through the use of those methods, this approach allows for
          during flight, keeping it in the flight trajectory             quick and easy arithmetic operations between lambda expressions
                                                                         and list-defined interpolated functions, as well as scalars. Different
Flight Model                                                             interpolation methods are available to be chosen from, among
                                                                         them simple polynomial, spline, and Akima ([Aki70]). Extrapo-
The flight model of a high-powered rocket takes into account at
                                                                         lation of Function objects outside the domain constrained by a
least three different phases:
                                                                         given dataset is also allowed.
    1. The first phase consists of a linear movement along the               Furthermore, evaluation of definite integrals of these Function
launch rail: The motion of the rocket is restricted to one dimen-        objects is among their feature set. By cleverly exploiting the
sion, which means that only the translation along with the rail          chosen interpolation option, RocketPy calculates the values fast
needs to be modeled. During this phase, four forces can act on           and precisely through the use of different analytical methods. If
the rocket: weight, engine thrust, rail reactions, and aerodynamic       numerical integration is required, the class makes use of SciPy’s
forces.                                                                  implementation of the QUADPACK Fortran library [PdDKÜK83].
    2. After completely leaving the rail, a phase of 6 degrees of        For 1-dimensional Functions, evaluation of derivatives at a point
freedom (DOF) is established, which includes powered flight and          is made possible through the employment of a simple finite
free flight: The rocket is free to move in three-dimensional space       difference method.
and weight, engine thrust, normal and axial aerodynamic forces               Finally, to increase usability and readability, all Function
are still important.                                                     object instances are callable and can be presented in multiple
    3. Once apogee is reached, a parachute is usually deployed,          ways depending on the given arguments. If no argument is given,
characterizing the third phase of flight: the parachute descent. In      a Matplotlib figure opens and the plot of the function is shown in-
the last phase, the parachute is launched from the rocket, which is      side its domain. Only 2-dimensional and 3-dimensional functions
usually divided into two or more parts joined by ropes. This phase       can be plotted. This is especially useful for the post-processing
ends at the point of impact.                                             methods where various information on the classes responsible for
                                                                         the definition of the rocket and its flight is presented, providing for
                                                                         more concise code. If an n-sized array is passed instead, RocketPy
Design: RocketPy Architecture
                                                                         will try and evaluate the value of the Function at this given point
Four main classes organize the dataflow during the simulations:          using different methods, returning its value. An example of the
motor, rocket, environment, and flight [CSA+ 21]. Furthermore,           usage of the Function class can be found in the Examples section.
there is also a helper class named function, which will be described         Additionally, if another Function object is passed, the class
further. In the Motor class, the main physical and geometric             will try to match their respective domain and co-domain in order
parameters of the motor are configured, such as nozzle geometry,         to return a third instance, representing a composition of functions,
grain parameters, mass, inertia, and thrust curve. This first-class      in the likes of: h(x) = (g◦ f )(x) = g( f (x)). With different Function
acts as an input to the Rocket class where the user is also asked        objects defined, the comparePlots method can be used to plot, in
to define certain parameters of the rocket such as the inertial mass     a single graph, different functions.
tensor, geometry, drag coefficients, and parachute description.              By imitating, in syntax, commonly used mathematical no-
Finally, the Flight class joins the rocket and motor parameters with     tation, RocketPy allows for more understandable and human-
information from another class called Environment, such as wind,         readable code, especially in the implementation of the more
atmospheric, and earth models, to generate a simulation of the           extensive and cluttered rocket equations of motion.
rocket’s trajectory. This modular architecture, along with its well-
structured and documented code, facilitates complex simulations,         Environment
starting with the use of Jupyter Notebooks that people can adapt         The Environment class reads, processes and stores all the infor-
for their specific use case. Fig. 1 illustrates RocketPy architecture.   mation regarding wind and atmospheric model data. It receives
                                                                         as inputs launch point coordinates, as well as the length of the
                                                                         launch rail, and then provides the flight class with six profiles as
                                                                         a function of altitude: wind speed in east and north directions,
                                                                         atmospheric pressure, air density, dynamic viscosity, and speed
                                                                         of sound. For instance, an Environment object can be set as
                                                                         representing New Mexico, United States:
                                                                         1   from rocketpy import Environment
                                                                         2
                                                                         3   ex_env = Environment(
                                                                         4      railLength=5.2,
                                                                         5      latitude=32.990254,
            Fig. 1: RocketPy classes interaction [CSA+ 21]               6      longitude=-106.974998,
                                                                         7      elevation=1400
                                                                         8   )
ROCKETPY: COMBINING OPEN-SOURCE AND SCIENTIFIC LIBRARIES TO MAKE THE SPACE SECTOR MORE MODERN AND ACCESSIBLE                                219

RocketPy requires datetime library information specifying the            of rocket motors: solid motors, liquid motors, and hybrid motors.
year, month, day and hour to compute the weather conditions on           Currently, a robust Solid Motor class has been fully implemented
the specified day of launch. An optional argument, the timezone,         and tested. For example, a typical solid motor can be created as an
may also be specified. If the user prefers to omit it, RocketPy will     object in the following way:
assume the datetime object is given in standard UTC time, just as         1   from rocketpy import SolidMotor
follows:                                                                  2
                                                                          3   ex_motor = SolidMotor(
 1   import datetime                                                      4      thrustSource='Motor_file.eng',
 2   tomorrow = (                                                         5      burnOut=2,
 3      datetime.date.today() +                                           6      reshapeThrustCurve= False,
 4      datetime.timedelta(days=1)                                        7      grainNumber=5,
 5   )                                                                    8      grainSeparation=3/1000,
 6
                                                                          9      grainOuterRadius=33/1000,
 7   date_info = (                                                       10      grainInitialInnerRadius=15/1000,
 8      tomorrow.year,                                                   11      grainInitialHeight=120/1000,
 9      tomorrow.month,                                                  12      grainDensity= 1782.51,
10      tomorrow.day,                                                    13      nozzleRadius=49.5/2000,
11      12                                                               14      throatRadius=21.5/2000,
12   ) # Hour given in UTC time                                          15      interpolationMethod='linear')
By default, the International Standard Atmosphere [ISO75] static
atmospheric model is loaded. However, it is easy to set other            Rocket
models by importing data from different meteorological agencys’          The Rocket Class is responsible for creating and defining the
public datasets, such as Wyoming Upper-Air Soundings and Eu-             rocket’s core characteristics. Mostly composed of physical at-
ropean Centre for Medium-Range Weather Forecasts (ECMWF);                tributes, such as mass and moments of inertia, the rocket object
or to set a customized atmospheric model based on user-defined           will be responsible for storage and calculate mechanical parame-
functions. As RocketPy supports integration with different meteo-        ters.
rological agencies’ datasets, it allows for a sophisticated definition       A rocket object can be defined with the following code:
of weather conditions including forecasts and historical reanalysis
                                                                          1   from rocketpy import Rocket
scenarios.                                                                2
    In this case, NOAA’s RUC Soundings data model is used, a              3   ex_rocket = Rocket(
worldwide and open-source meteorological model made available             4      motor=ex_motor,
                                                                          5      radius=127 / 2000,
online. The file name is set as GFS, indicating the use of the Global     6      mass=19.197 - 2.956,
Forecast System provided by NOAA, which features a forecast               7      inertiaI=6.60,
with a quarter degree equally spaced longitude/latitude grid with         8      inertiaZ=0.0351,
a temporal resolution of three hours.                                     9      distanceRocketNozzle=-1.255,
                                                                         10      distanceRocketPropellant=-0.85704,
 1   ex_env.setAtmosphericModel(                                         11      powerOffDrag="data/rocket/powerOffDragCurve.csv",
 2      type='Forecast',                                                 12      powerOnDrag="data/rocket/powerOnDragCurve.csv",
 3      file='GFS')                                                      13   )
 4   ex_env.info()
                                                                         As stated in [RocketPy architecture], a fundamental input of the
What is happening on the back-end of this code’s snippet is Rock-        rocket is its motor, an object of the Motor class that must be
etPy utilizing the OPeNDAP protocol to retrieve data arrays from         previously defined. Some inputs are fairly simple and can be easily
NOAA’s server. It parses by using the netCDF4 data management            obtained with a CAD model of the rocket such as radius, mass,
system, allowing for the retrieval of pressure, temperature, wind        and moment of inertia on two different axes. The distance inputs
velocity, and surface elevation data as a function of altitude. The      are relative to the center of mass and define the position of the
Environment class then computes the following parameters: wind           motor nozzle and the center of mass of the motor propellant. The
speed, wind heading, speed of sound, air density, and dynamic            powerOffDrag and powerOnDrag receive .csv data that represents
viscosity. Finally, plots of the evaluated parameters concerning         the drag coefficient as a function of rocket speed for the case where
the altitude are all passed on to the mission analyst by calling the     the motor is off and other for the motor still burning, respectively.
Env.info() method.                                                           At this point, the simulation would run a rocket with a tube of a
                                                                         certain diameter, with its center of mass specified and a motor at its
Motor
                                                                         end. For a better simulation, a few more important aspects should
RocketPy is flexible enough to work with most types of motors            then be defined, called Aerodynamic surfaces. Three of them are
used in sound rockets. The main function of the Motor class              accepted in the code, these being the nosecone, fins, and tail. They
is to provide the thrust curve, the propulsive mass, the inertia         can be simply added to the code via the following methods:
tensor, and the position of its center of mass as a function of time.     1   nose_cone = ex_rocket.addNose(
Geometric parameters regarding propellant grains and the motor’s          2      length=0.55829, kind="vonKarman",
nozzle must be provided, as well as a thrust curve as a function          3      distanceToCM=0.71971
of time. The latter is preferably obtained empirically from a static      4   )
                                                                          5   fin_set = ex_rocket.addFins(
hot-fire test, however, many of the curves for commercial motors          6      4, span=0.100, rootChord=0.120, tipChord=0.040,
are freely available online [Cok98].                                      7      distanceToCM=-1.04956
    Alternatively, for homemade motors, there is a wide range             8   )
of open-source internal ballistics simulators, such as OpenMotor          9   tail = ex_rocket.addTail(
                                                                         10      topRadius=0.0635, bottomRadius=0.0435,
[Rei22], can predict the produced thrust with high accuracy for a        11      length=0.06, distanceToCM=-1.194656
given sizing and propellant combination. There are different types       12   )
220                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

All these methods receive defining geometrical parameters and           the Rocket class and the Environment class are used as input to
their distance to the rocket’s center of mass (distanceToCM)            initialize it, along with parameters such as launch heading and
as inputs. Each of these surfaces generates, during the flight,         inclination relative to the Earth’s surface:
a lift force that can be calculated via a lift coefficient, which       1   from rocketpy import Flight
is calculated with geometrical properties, as shown in [Bar67].         2

Further on, these coefficients are used to calculate the center of      3   ex_flight = Flight(
                                                                        4      rocket=rocket,
pressure and subsequently the static margin. In each of these           5      environment=env,
methods, the static margin is reevaluated.                              6      inclination=85,
    Finally, the parachutes can be added in a similar manner to         7      heading=0
                                                                            )
the aerodynamic surfaces. However, a few inputs regarding the           8


electronics involved in the activation of the parachute are required.   Once the simulation is initialized, run, and completed, the
The most interesting of them is the trigger and samplingRate            instance of the Flight class stores relevant raw data. The
inputs, which are used to define the parachute’s activation. The        Flight.postProcess() method can then be used to com-
trigger is a function that returns a boolean value that signifies       pute secondary parameters such as the rocket’s Mach number
when the parachute should be activated. The samplingRate is the         during flight and its angle of attack.
time interval that the trigger will be evaluated in the simulation          To perform the numerical integration of the equations of mo-
time steps.                                                             tion, the Flight class uses the LSODA solver [Pet83] implemented
 1   def parachute_trigger(p, y):                                       by Scipy’s scipy.integrate module [VGO+ 20]. Usually,
 2      if vel_z < 0 and height < 800:                                  well-designed rockets result in non-stiff equations of motion.
3          boole = True                                                 However, during flight, rockets may become unstable due to
4       else:
5          boole = False                                                variations in their inertial and aerodynamic properties, which can
6       return boole                                                    result in a stiff system. LSODA switches automatically between
 7                                                                      the nonstiff Adams method and the stiff BDF method, depending
 8   ex_parachute = ex_rocket.addParachute(
9       'ParachuteName',
                                                                        on the detected stiffness, perfectly handle both cases.
10      CdS=10.0,                                                           Since a rocket’s flight trajectory is composed of multiple
11      trigger=parachute_trigger,                                      phases, each with its own set of governing equations, RocketPy
12      samplingRate=105,                                               employs a couple of clever methods to run the numerical inte-
13      lag=1.5,
14      noise=(0, 8.3, 0.5)                                             gration. The Flight class uses a FlightPhases container to
15   )                                                                  hold each FlightPhase. The FlightPhases container will
                                                                        orchestrate the different FlightPhase instances, and compose
With the rocket fully defined, the Rocket.info() and
                                                                        them during the flight.
Rocket.allInfo() methods can be called giving us informa-
                                                                            This is crucial because there are events that may or may not
tion and plots of the calculations performed in the class. One of the
                                                                        happen during the simulation, such as the triggering of a parachute
most relevant outputs of the Rocket class is the static margin, as
                                                                        ejection system (which may or may not fail) or the activation of a
it is important for the rocket stability and makes possible several
                                                                        premature flight termination event. There are also events such as
analyses. It is visualized through the time plot in Fig. 2, which
                                                                        the departure from the launch rail or the apogee that is known to
shows the variation of the static margin as the motor burns its
                                                                        occur, but their timestamp is unknown until the simulation is run.
propellant.
                                                                        All of these events can trigger new flight phases, characterized by
                                                                        a change in the rocket’s equations of motion. Furthermore, such
                                                                        events can happen close to each other and provoke delayed phases.
                                                                            To handle this, the Flight class has a mechanism for creating
                                                                        new phases and adding them dynamically in the appropriate order
                                                                        to the FlightPhases container.
                                                                            The constructor of the FlightPhase class takes the follow-
                                                                        ing arguments:
                                                                             •   t: a timestamp that symbolizes at which instant such flight
                                                                                 phase should begin;
                                                                             •   derivative: a function that returns the time derivatives
                                                                                 of the rocket’s state vector (i.e., calculates the equations of
                                                                                 motion for this flight phase);
                                                                             •   callbacks: a list of callback functions to be run when
                                                                                 the flight phase begins (which can be useful if some
                                                                                 parameters of the rocket need to be modified before the
                                                                                 flight phase begins).
                       Fig. 2: Static Margin                                 The constructor of the Flight class initializes the
                                                                        FlightPhases container with a rail phase and also a dummy
                                                                        max time phase which marks the maximum flight duration. Then,
Flight                                                                  it loops through the elements of the container.
The Flight class is responsible for the integration of the rocket’s          Inside the loop, an important attribute of the current
equations of motion overtime [CSA+ 21]. Data from instances of          flight phase is set: FlightPhase.timeBound, the maxi-
ROCKETPY: COMBINING OPEN-SOURCE AND SCIENTIFIC LIBRARIES TO MAKE THE SPACE SECTOR MORE MODERN AND ACCESSIBLE                                 221

mum timestamp of the flight phase, which is always equal
to the initial timestamp of the next flight phase. Ordinar-
ily, it would be possible to run the LSODA solver from
FlightPhase.t to FlightPhase.timeBound. However,
this is not an option because the events which can trigger new
flight phases need to be checked throughout the simulation. While
scipy.integrate.solve_ivp does offer the events ar-
gument to aid in this, it is not possible to use it with most of the
events that need to be tracked, since they cannot be expressed in
the necessary form.
     As an example, consider the very common event of a parachute
ejection system. To simulate real-time algorithms, the necessary
inputs to the ejection algorithm need to be supplied at regular
intervals to simulate the desired sampling rate. Furthermore, the
ejection algorithm cannot be called multiple times without real
data since it generally stores all the inputs it gets to calculate if
the rocket has reached the apogee to trigger the parachute release
mechanism. Discrete controllers can present the same peculiar
properties.
     To handle this, the instance of the FlightPhase class holds          Fig. 3: 3D flight trajectory, an output of the Flight.allInfo method
a TimeNodes container, which stores all the required timesteps,
or TimeNode, that the integration algorithm should stop at so
that the events can be checked, usually by feeding the necessary                 Monte Carlo simulations, which require a large number of
data to parachutes and discrete control trigger functions. When it               simulations to be performed (10,000 ~ 100,000).
comes to discrete controllers, they may change some parameters              •    The code structure should be flexible. This is important
in the rocket once they are called. On the other hand, a parachute               due to the diversity of possible scenarios that exist in a
triggers rarely actually trigger, and thus, rarely invoke the creation           rocket design context. Each user will have their simulation
of a new flight phase characterized by descent under parachute                   requirements and should be able to modify and adapt new
governing equations of motion.                                                   features to meet their needs. For this reason, the code was
     The Flight class can take advantage of this fact by employing               designed in a fashion such that each major component is
overshootable time nodes: time nodes that the integrator does                    separated into self-encapsulated classes, responsible for a
not need to stop. This allows the integration algorithm to use                   single functionality. This tenet follows the concepts of the
more optimized timesteps and significantly reduce the number of                  so-called Single Responsibility Principle (SRP) [MNK03].
iterations needed to perform a simulation. Once a new timestep              •    Finally, the software should aim to be accessible. The
is taken, the Flight class checks all overshootable time nodes that              source code was openly published on GitHub (https:
have passed and feeds their event triggers with interpolated data.               //github.com/Projeto-Jupiter/RocketPy), where the com-
In case when an event is triggered, the simulation is rolled back to             munity started to be built and a group of developers, known
that state.                                                                      as the RocketPy Team, are currently assigned as dedicated
     In summary, throughout a simulation, the Flight class loops                 maintainers. The job involves not only helping to improve
through each non-overshootable TimeNode of each element of                       the code, but also working towards building a healthy
the FlightPhases container. At each TimeNode, the event                          ecosystem of Python, rocketry, and scientific computing
triggers are fed with the necessary input data. Once an event is                 enthusiasts alike; thus facilitating access to the high-
triggered, a new FlightPhase is created and added to the main                    quality simulation without a great level of specialization.
container. These loops continue until the simulation is completed,
either by reaching the maximum flight duration or by reaching a              The following examples demonstrate how RocketPy can be a
terminal event, such as ground impact.                                   useful tool during the design and operation of a rocket model,
     Once the simulation is completed, raw data can al-                  enabling functionalities not available by other simulation software
ready be accessed. To compute secondary parameters, the                  before.
Flight.postProcess() is used. It takes advantage of the
fact that the FlightPhases container keeps all relevant flight           Examples
information to essentially retrace the trajectory and capture more
                                                                         Using RocketPy for Rocket Design
information about the flight.
     Once     secondary      parameters      are     computed,     the      1)    Apogee by Mass using a Function helper class
Flight.allInfo method can be used to show and plot                            Because of performance and safety reasons, apogee is one of
all the relevant information, as illustrated in Fig. 3.                  the most important results in rocketry competitions, and it’s highly
                                                                         valuable for teams to understand how different Rocket parameters
The adaptability of the Code and Accessibility
                                                                         can change it. Since a direct relation is not available for this kind
RocketPy’s development started in 2017, and since the beginning,         of computation, the characteristic of running simulation quickly is
certain requirements were kept in mind:                                  utilized for evaluation of how the Apogee is affected by the mass
   •    Execution times should be fast. There is a high interest in      of the Rocket. This function is highly used during the early phases
        performing sensitivity analysis, optimization studies and        of the design of a Rocket.
222                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

      An example of code of how this could be achieved:                 16         terminateOnApogee=True,
                                                                        17         verbose=True,
1    from rocketpy import Function
                                                                        18      )
2
                                                                        19      ex_flight.postProcess()
3    def apogee(mass):
                                                                        20      simulation_results += [(
4       # Prepare Environment
                                                                        21         ex_flight.attitudeAngle,
 5      ex_env = Environment(...)
                                                                        22         ex_rocket.staticMargin(0),
 6
                                                                        23         ex_rocket.staticMargin(ex_flight.outOfRailTime),
 7     ex_env.setAtmosphericModel(
                                                                        24         ex_rocket.staticMargin(ex_flight.tFinal)
 8        type="CustomAtmosphere",
                                                                        25      )]
 9        wind_v=-5
                                                                        26   Function.comparePlots(
10     )
                                                                        27      simulation_results,
11
                                                                        28      xlabel="Time (s)",
12     # Prepare Motor
                                                                        29      ylabel="Attitude Angle (deg)",
13     ex_motor = SolidMotor(...)
                                                                        30   )
14
15     # Prepare Rocket                                                 The next step is to start the simulations themselves, which can
16     ex_rocket = Rocket(
17        ...,                                                          be done through a loop where the Flight class is called, perform
18        mass=mass,                                                    the simulation, save the desired parameters into a list and then
19        ...                                                           follow through with the next iteration. The post-process flight data
20     )
21
                                                                        method is being used to make RocketPy evaluate additional result
22     ex_rocket.setRailButtons([0.2, -0.5])                            parameters after the simulation.
23     nose_cone = ex_rocket.addNose(.....)                                 Finally, the Function.comparePlots() method is used to plot
24     fin_set = ex_rocket.addFins(....)                                the final result, as reported at Fig. 4.
25     tail = ex_rocket.addTail(....)
26
27     # Simulate Flight until Apogee
28     ex_flight = Flight(.....)
29     return ex_flight.apogee
30
31   apogee_by_mass = Function(
32      apogee, inputs="Mass (kg)",
33      outputs="Estimated Apogee (m)"
34   )
35   apogee_by_mass.plot(8, 20, 20)
The possibility of generating this relation between mass and
apogee in a graph shows the flexibility of Rocketpy and also the
importance of the simulation being designed to run fast.
      1)   Dynamic Stability Analysis
    In this analysis the integration of three different RocketPy
classes will be explored: Function, Rocket, and Flight. The moti-
vation is to investigate how static stability translates into dynamic
stability, i.e. different static margins result relies on different     Fig. 4: Dynamic Stability example, unstable rocket presented on blue
dynamic behavior, which also depends on the rocket’s rotational         line
inertia.
    We can assume the objects stated in [motor] and [rocket]
sections and just add a couple of variations on some input data         Monte Carlo Simulation
to visualize the output effects. More specifically, the idea will be    When simulating a rocket’s trajectory, many input parameters
to explore how the dynamic stability of the studied rocket varies       may not be completely reliable due to several uncertainties in
by changing the position of the set of fins by a certain factor.        measurements raised during the design or construction phase of
    To do that, we have to simulate multiple flights with different     the rocket. These uncertainties can be considered together in a
static margins, which is achieved by varying the rocket’s fin           group of Monte Carlo simulations [RK16] which can be built on
positions. This can be done through a simple python loop, as            top of RocketPy.
described below:                                                            The Monte Carlo method here is applied by running a signifi-
 1   simulation_results = []                                            cant number of simulations where each iteration has a different
 2   for factor in [0.5, 0.7, 0.9, 1.1, 1.3]:                           set of inputs that are randomly sampled given a previously
 3      # remove previous fin set                                       known probability distribution, for instance the mean and standard
 4      ex_rocket.aerodynamicSurfaces.remove(fin_set)
 5      fin_set = ex_rocket.addFins(                                    deviation of a Gaussian distribution. Almost every input data
 6         4, span=0.1, rootChord=0.120, tipChord=0.040,                presents some kind of uncertainty, except for the number of fins or
 7         distanceToCM=-1.04956 * factor                               propellant grains that a rocket presents. Moreover, some inputs,
 8      )
 9      ex_flight = Flight(
                                                                        such as wind conditions, system failures, or the aerodynamic
10         rocket=ex_rocket,                                            coefficient curves, may behave differently and must receive special
11         environment=env,                                             treatment.
12         inclination=90,                                                  Statistical analysis can then be made on all the simulations,
13         heading=0,
14         maxTimeStep=0.01,                                            with the main result being the 1σ , 2σ , and 3σ ellipses representing
15         maxTime=5,                                                   the possible area of impact and the area where the apogee is
ROCKETPY: COMBINING OPEN-SOURCE AND SCIENTIFIC LIBRARIES TO MAKE THE SPACE SECTOR MORE MODERN AND ACCESSIBLE                               223

reached (Fig. 5). All ellipses can be evaluated based on the method      22     export_flight_data(s, ex_flight)
presented by [Che66].                                                    23   except Exception as E:
                                                                         24     # if an error occurs, export the error
                                                                         25     # message to a text file
                                                                         26     print(E)
                                                                         27     export_flight_error(s)

                                                                         Finally, the set of inputs for each simulation along with its set of
                                                                         outputs, are stored in a .txt file. This allows for long-term data
                                                                         storage and the possibility to append simulations to previously
                                                                         finished ones. The stored output data can be used to study the final
                                                                         probability distribution of key parameters, as illustrated on Fig. 6.




Fig. 5: 1 1σ , 2 2σ , and 3 3σ dispersion ellipses for both apogee and
landing point

    When performing the Monte Carlo simulations on RocketPy,
all the inputs - i.e. the parameters along with their respective
standard deviations - are stored in a dictionary. The randomized
set of inputs is then generated using a yield function:
 1   def sim_settings(analysis_params, iter_number):                                    Fig. 6: Distribution of apogee altitude
 2      i = 0
 3      while i < iter_number:
 4         # Generate a simulation setting                                  Finally, it is also worth mentioning that all the information
 5         sim_setting = {}                                              generated in the Monte Carlo simulation is based on RocketPy
 6         for p_key, p_value in analysis_params.items():                may be of utmost importance to safety and operational manage-
 7                if type(p_value) is tuple:
 8                   sim_setting[p_key] = normal(*p_value)               ment during rocket launches, once it allows for a more reliable
 9                else:                                                  prediction of the landing site and apogee coordinates.
10                   sim_setting[p_key] = choice(p_value)
11         # Update counter
12         i += 1                                                        Validation of the results: Unit, Dimensionality and Acceptance
13         # Yield a simulation setting
14         yield sim_setting                                             Tests
                                                          Validation is a big problem for libraries like RocketPy, where
Where analysis_params is the dictionary with the inputs and
                                                          true values for some results like apogee and maximum velocity
iter_number is the total number of simulations to be performed. At
                                                          is very hard to obtain or simply not available. Therefore, in
that time the function yields one dictionary with one set of inputs,
                                                          order to make RocketPy more robust and easier to modify, while
which will be used to run a simulation. Later the sim_settings
                                                          maintaining precise results, some innovative testing strategies have
function is called again and another simulation is run until the
loop iterations reach the number of simulations:          been implemented.
                                                              First of all, unit tests were implemented for all classes and
 1   for s in sim_settings(analysis_params, iter_number):
 2      # Define all classes to simulate with the current their methods ensuring that each function is working properly.
 3      # set of inputs generated by sim_settings         Given a set of different inputs that each function can receive, the
 4                                                        respective outputs are tested against expected results, which can be
 5      # Prepare Environment
 6      ex_env = Environment(.....)
                                                          based on real data or augmented examples cases. The test fails if
 7      # Prepare Motor                                   the output deviates considerably from the established conditions,
 8      ex_motor = SolidMotor(.....)                      or an unexpected error occurs along the way.
 9      # Prepare Rocket                                      Since RocketPy relies heavily on mathematical functions to
10      ex_rocket = Rocket(.....)
11      nose_cone = ex_rocket.addNose(.....)              express  the governing equations, implementation errors can occur
12      fin_set = ex_rocket.addFins(....)                 due to the convoluted nature of such expressions. Hence, to reduce
13      tail = ex_rocket.addTail(.....)                   the probability of such errors, there is a second layer of testing
14
15      # Considers any possible errors in the simulation
                                                          which will evaluate if such equations are dimensionally correct.
16      try:                                                  To accomplish this, RocketPy makes use of the numericalunits
17        # Simulate Flight until Apogee                  library, which defines a set of independent base units as randomly-
18        ex_flight = Flight(.....)                       chosen positive floating point numbers. In a dimensionally-correct
19
20        # Function to export all output and input       function, the units all cancel out when the final answer is divided
21        # data to a text file (.txt)                    by its resulting unit. And thus, the result is deterministic, not
224                                                                                    PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

random. On the other hand, if the function contains dimensionally-   1   def test_static_margin_dimension(
incorrect equations, there will be random factors causing a          2      unitless_rocket,
                                                                     3      unitful_rocket
randomly-varying final answer. In practice, RocketPy runs two        4   ):
calculations: one without numericalunits, and another with the       5      ...
dimensionality variables. The results are then compared to assess    6      s1 = unitless_rocket.staticMargin(0)
if the dimensionality is correct.                                    7      s2 = unitful_rocket.staticMargin(0)
                                                                     8      assert abs(s1 - s2) < 1e-6
     Here is an example. First, a SolidMotor object and a Rocket
object are initialized without numericalunits:                       In case the value of interest has units, such as the position of the
1    @pytest.fixture                                                 center of pressure of the rocket, which has units of length, then
2    def unitless_solid_motor():                                     such value must be divided by the relevant unit for comparison:
3       return SolidMotor(                                           1   def test_cp_position_dimension(
4          thrustSource="Cesaroni_M1670.eng",                        2      unitless_rocket,
 5         burnOut=3.9,                                              3      unitful_rocket
 6         grainNumber=5,                                            4   ):
 7         grainSeparation=0.005,                                    5      ...
 8         grainDensity=1815,                                        6      cp1 = unitless_rocket.cpPosition(0)
 9         ...                                                       7      cp2 = unitful_rocket.cpPosition(0) / m
10      )                                                            8      assert abs(cp1 - cp2) < 1e-6
11
12   @pytest.fixture                                                 If the assertion fails, we can assume that the formula responsible
13   def unitless_rocket(solid_motor):
14      return Rocket(
                                                                     for calculating the center of pressure position was implemented
15         motor=unitless_solid_motor,                               incorrectly, probably with a dimensional error.
16         radius=0.0635,                                                Finally, some tests at a larger scale, known as acceptance
17         mass=16.241,                                              tests, were implemented to validate outcomes such as apogee,
18         inertiaI=6.60,
19         inertiaZ=0.0351,                                          apogee time, maximum velocity, and maximum acceleration when
20         distanceRocketNozzle=-1.255,                              compared to real flight data. A required accuracy for such values
21         distanceRocketPropellant=-0.85704,                        were established after the publication of the experimental data by
22         ...
                                                                     [CSA+ 21]. Such tests are crucial for ensuring that the code doesn’t
23      )
                                                                     lose precision as a result of new updates.
Then, a SolidMotor object and a Rocket object are initialized with       These three layers of testing ensure that the code is trustwor-
numericalunits:                                                      thy, and that new features can be implemented without degrading
 1   import numericalunits                                           the results.
 2
 3   @pytest.fixture
 4   def m():                                                        Conclusions
 5      return numericalunits.m
 6
                                                                     RocketPy is an easy-to-use tool for simulating high-powered
 7                                                                   rocket trajectories built with SciPy and the Python Scientific
 8   @pytest.fixture                                                 Environment. The software’s modular architecture is based on
 9   def kg():                                                       four main classes and helper classes with well-documented code
10      return numericalunits.kg
11
                                                                     that allows to easily adapt complex simulations to various needs
12   @pytest.fixture                                                 using the supplied Jupyter Notebooks. The code can be a useful
13   def unitful_motor(kg, m):                                       tool during Rocket design and operation, allowing to calculate
        return SolidMotor(
14
                                                                     of key parameters such as apogee and dynamic stability as well
15         thrustSource="Cesaroni_M1670.eng",
16         burnOut=3.9,                                              as high-fidelity 6-DOF vehicle trajectory with a wide variety of
17         grainNumber=5,                                            customizable parameters, from its launch to its point of impact.
18         grainSeparation=0.005 * m,                                RocketPy is an ever-evolving framework and is also accessible to
19         grainDensity=1815 * (kg / m**3),
20         ...                                                       anyone interested, with an active community maintaining it and
21      )                                                            working on future features such as the implementation of other
22                                                                   engine types, such as hybrids and liquids motors, and even orbital
23   @pytest.fixture                                                 flights.
24   def unitful_rocket(kg, m, dimensionless_motor):
25      return Rocket(
26         motor=unitful_motor,                                      Installing RocketPy
27         radius=0.0635 * m,
28         mass=16.241 * kg,                                         RocketPy was made to run on Python 3.6+ and requires the
29         inertiaI=6.60 * (kg * m**2),                              packages: Numpy >=1.0, Scipy >=1.0 and Matplotlib >= 3.0. For
30         inertiaZ=0.0351 * (kg * m**2),                            a complete experience we also recommend netCDF4 >= 1.4. All
31         distanceRocketNozzle=-1.255 * m,
32         distanceRocketPropellant=-0.85704 * m,                    these packages, except netCDF4, will be installed automatically if
33         ...                                                       the user does not have them. To install, execute:
34      )
                                                                     pip install rocketpy
Then, to ensure that the equations implemented in both classes
                                                                     or
(Rocket and SolidMotor) are dimensionally correct, the val-
                                                                     conda install -c conda-forge rocketpy
ues computed can be compared. For example, the Rocket class
computes the rocket’s static margin, which is a non-dimensional      The source code, documentation and more examples are available
value and the result from both calculations should be the same:      at https://github.com/Projeto-Jupiter/RocketPy
ROCKETPY: COMBINING OPEN-SOURCE AND SCIENTIFIC LIBRARIES TO MAKE THE SPACE SECTOR MORE MODERN AND ACCESSIBLE   225

Acknowledgments
The authors would like to thank the University of São Paulo, for
the support during the development of the current publication, and
also all members of Projeto Jupiter and the RocketPy Team who
contributed to the making of the RocketPy library.

R EFERENCES
[AEH+ 19]  Adam Aitoumeziane, Peter Eusebio, Conor Hayes, Vivek Ra-
           machandran, Jamie Smith, Jayasurya Sridharan, Luke St Regis,
           Mark Stephenson, Neil Tewksbury, Madeleine Tran, and Hao-
           nan Yang. Traveler IV Apogee Analysis. Technical report,
           USC Rocket Propulsion Laboratory, Los Angeles, 2019. URL:
           http://www.uscrpl.com/s/Traveler-IV-Whitepaper.
[Aki70]    Hiroshi Akima. A new method of interpolation and smooth
           curve fitting based on local procedures. Journal of the ACM
           (JACM), 17(4):589–602, 1970. doi:10.1145/321607.
           321609.
[Bar67]    James S Barrowman. The Practical Calculation of the Aero-
           dynamic Characteristics of Slender Finned Vehicles. PhD
           thesis, Catholic University of America, Washington, DC United
           States, 1967.
[Che66]    Victor Chew. Confidence, Prediction, and Tolerance Re-
           gions for the Multivariate Normal Distribution. Journal of
           the American Statistical Association, 61(315), 1966. doi:
           10.1080/01621459.1966.10480892.
[Cok98]    J Coker. Thrustcurve.org — rocket motor performance data
           online, 1998. URL: https://www.thrustcurve.org/.
[CSA+ 21]  Giovani H Ceotto, Rodrigo N Schmitt, Guilherme F Alves, Lu-
           cas A Pezente, and Bruno S Carmo. Rocketpy: Six degree-of-
           freedom rocket trajectory simulator. Journal of Aerospace En-
           gineering, 34(6), 2021. doi:10.1061/(ASCE)AS.1943-
           5525.0001331.
[ISO75]    ISO Central Secretary. Standard Atmosphere. Technical Report
           ISO 2533:1975, International Organization for Standardization,
           Geneva, CH, 5 1975.
[MNK03]    Robert C Martin, James Newkirk, and Robert S Koss. Agile
           software development: principles, patterns, and practices, vol-
           ume 2. Prentice Hall Upper Saddle River, NJ, 2003.
[PdDKÜK83] Robert Piessens, Elise de Doncker-Kapenga, Christoph W
           Überhuber, and David K Kahaner. Quadpack: a subroutine
           package for automatic integration, volume 1. Springer Science
           & Business Media, 1983. doi:10.1007/978-3-642-
           61786-7.
[Pet83]    Linda Petzold. Automatic Selection of Methods for Solving
           Stiff and Nonstiff Systems of Ordinary Differential Equa-
           tions. SIAM Journal on Scientific and Statistical Computing,
           4(1):136–148, 3 1983. doi:10.1137/0904010.
[Rei22]    A Reilley. openmotor: An open-source internal ballistics
           simulator for rocket motor experimenters, 2022. URL: https:
           //github.com/reilleya/openMotor.
[RK16]     Reuven Y Rubinstein and Dirk P Kroese. Simulation and the
           Monte Carlo method. John Wiley & Sons, 2016. doi:10.
           1002/9781118631980.
[VGO+ 20]  Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haber-
           land, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu
           Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van
           der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman,
           Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert
           Kern, Eric Larson, C J Carey, İlhan Polat, Yu Feng, Eric W.
           Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert
           Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris,
           Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa,
           Paul van Mulbregt, and SciPy 1.0 Contributors. SciPy 1.0: Fun-
           damental Algorithms for Scientific Computing in Python. Na-
           ture Methods, 17:261–272, 2020. doi:10.1038/s41592-
           019-0686-2.
[Wil18]    Paul D. Wilde. Range safety requirements and methods for
           sounding rocket launches. Journal of Space Safety Engineer-
           ing, 5(1):14–21, 3 2018. doi:10.1016/j.jsse.2018.
           01.002.
226                                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




       Wailord: Parsers and Reproducibility for Quantum
                          Chemistry
                                                                       Rohit Goswami‡§∗



                                                                                     F



Abstract—Data driven advances dominate the applied sciences landscape,                   and text classification, can be linked to the difficulty in obtaining
with quantum chemistry being no exception to the rule. Dataset biases and                labeled results for training purposes. This is not an issue in the
human error are key bottlenecks in the development of reproducible and general-          computational physical sciences at all, as the training data can
ized insights. At a computational level, we demonstrate how changing the granu-          often be labeled without human intervention. This is especially
larity of the abstractions employed in data generation from simulations can aid in
                                                                                         true when simulations are carried out at varying levels of accuracy.
reproducible work. In particular, we introduce wailord (https://wailord.xyz), a
free-and-open-source python library to shorten the gap between data-analysis
                                                                                         However, this also leads to a heavy reliance on high accuracy
and computational chemistry, with a focus on the ORCA suite binaries. A two              calculations on "benchmark" datasets and results [HMSE+ 21],
level hierarchy and exhaustive unit-testing ensure the ability to reproducibly           [SEJ+ 19].
describe and analyze "computational experiments". wailord offers both input                  Compute is expensive, and the reproduction of data which
generation, with enhanced analysis, and raw output analysis, for traditionally           is openly available is often hard to justify as a valid scientific
executed ORCA runs. The design focuses on treating output and input gener-               endeavor. Rather than focus on the observable outputs of cal-
ation in terms of a mini domain specific language instead of more imperative             culations, instead we assert that it is best to be able to have
approaches, and we demonstrate how this abstraction facilitates chemical in-
                                                                                         reproducible confidence in the elements of the workflow. In the
sights.
                                                                                         following sections, we will outline wailord, a library which
Index Terms—quantum chemistry, parsers, reproducible reports, computational
                                                                                         implements a two level structure for interacting with ORCA
inference                                                                                [Nee12] to implement an end-to-end workflow to analyze and
                                                                                         prepare datasets. Our focus on ORCA is due to its rapid and
                                                                                         responsive development cycles, that it is free to use (but not open
Introduction                                                                             source) and also because of its large repertoire of computational
The use of computational methods for chemistry is ubiquitous                             chemistry calculations. Notably, the black-box nature of ORCA
and few modern chemists retain the initial skepticism of the field                       (in that the source is not available) mirrors that of many other
[Koh99], [Sch86]. Machine learning has been further earmarked                            packages (which are not free) like VASP [Haf08]. Using ORCA
[MSH19], [Dra20], [SGT+ 19] as an effective accelerator for                              then, allows us to design a workflow which is best suited for
computational chemistry at every level, from DFT [GLL+ 16] to                            working with many software suites in the community.
alchemical searches [DBCC16] and saddle point searches [ÁJ18].                               We shall understand this wailord from the lens of what is
However, these methods trade technical rigor for vast amounts of                         often known as a design pattern in the practice of computational
data, and so the ability to reproduce results becomes increasingly                       science and engineering. That is, a template or description to solve
more important. Independently, the ability to reproduce results                          commonly occurring problems in the design of programs.
[Pen11], [SNTH13] in all fields of computational research, and
has spawned a veritable flock of methodological and program-                             Structure and Implementation
matic advances [CAB+ 19], including the sophisticated provenance
                                                                                         Python has grown to become the lingua-franca for much of the
tracking of AiiDA [PCS+ 16], [HZU+ 20].
                                                                                         scientific community [Oli07], [MA11], in no small part because
                                                                                         of its interactive nature. In particular, the REPL (read-evaluate-
Dataset bias                                                                             print-loop) structure which has been prioritized (from IPython to
[EIS+ 20], [BS19], [RBA+ 19] has gained prominence in the ma-                            Jupyter) is one of the prime motivations for the use of Python
chine learning literature, but has not yet percolated through to                         as an exploratory tool. Additionally, PyPI, the python package
the chemical sciences community. At its core, the argument for                           index, accelerates the widespread disambiguation of software
dataset biases in generic machine learning problems of image                             packages. Thus wailord is implemented as a free and open
                                                                                         source python library.
* Corresponding author: rog32@hi.is
‡ Science Institute, University of Iceland
§ Quansight Austin, TX, USA                                                              Structure
                                                                                         Data generation involves set of known configurations (say, xyz
Copyright © 2022 Rohit Goswami. This is an open-access article distributed               inputs) and a series of common calculations whose outputs are
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the             required. Computational chemistry packages tend to be focused
original author and source are credited.                                                 on acceleration and setup details on a per-job scale. wailord,
WAILORD: PARSERS AND REPRODUCIBILITY FOR QUANTUM CHEMISTRY                                                                               227

in contrast, considers the outputs of simulations to form a tree,
where the actual run and its inputs are the leaves, and each layer
of the tree structure holds information which is collated into a
single dataframe which is presented to the user.
     Downstream tasks for simulations of chemical systems involve
questions phrased as queries or comparative measures. With that in
mind, wailord generates pandas dataframes which are indis-
tinguishable from standard machine learning information sources,
to trivialize the data-munging and preparation process. The outputs
of wailord represent concrete information and it is not meant to
store runs like the ASE database [LMB+ 17] , nor run a process to
manage discrete workflows like AiiDA [HZU+ 20].
     By construction, it differs also from existing "interchange"
formats as those favored by the materials data repositories like
the QCArchive project [SAB+ 21] and is partially close in spirit to
the cclib endeavor [OTL08].

Implementation
                                                                       Fig. 1: Some implemented workflows including the two input YML
Two classes form the backbone of the data-harvesting process. The      files. VPT2 stands for second-order vibrational perturbation theory
intended point of interface with a user is the orcaExp class which     and Orca_vis objects are part of wailord’s class structure. PES
collects information from multiple ORCA outputs and produces           stands for potential energy surface.
dataframes which include relevant metadata (theory, basis, system,
etc.) along with the requested results (energy surfaces, energies,
angles, geometries, frequencies, etc.). A lower level "orca visitor"   User Interface
class is meant to parse each individual ORCA output. Until the         The core user interface is depicted in Fig. [[fig:uiwail]]. The
release of ORCA 5 which promises structured property files,            test suites cover standard usage and serve as ad-hoc tutorials.
the outputs are necessarily parsed with regular expressions, but       Additionally, jupyter notebooks are also able to effectively
validated extensively. The focus on ORCA has allowed for more          run wailord which facilitates its use over SSH connections to
exotic helper functions, like the calculation of rate constants from   high-performance-computing (HPC) clusters. The user is able to
orcaVis files. However, beyond this functionality offered by the       describe the nature of calculations required in a simple YAML file
quantum chemistry software (ORCA), a computational chemistry           format. A command line interface can then be used to generate
workflow requires data to be more malleable. To this end, the          inputs, or another YAML file may be passed to describe the
plain-text or binary outputs of quantum chemistry software must        paths needed. A very basic harness script for submissions is also
be further worked on (post-processed) to gain insights. This means     generated which can be rate limited to ensure optimal runs on an
for example, that the outputs may be entered into a spreadsheet,       HPC cluster.
or into a plain text note, or a lab notebook, but in practice,
programming languages are a good level of abstraction. Of the
programming languages, Python as a general purpose program-            Design and Usage
ming language with a high rate of community adoption is a good
                                                                       A simulation study can be broken into:
starting place.
    Python has a rich set of structures implemented in the standard       •   Inputs + Configuration for runs + Data for structures
library, which have been liberally used for structuring outputs.          •   Outputs per run
Furthermore, there have been efforts to convert the grammar               •   Post-processing and aggregation
of graphics [WW05] and tidy-data [WAB+ 19] approaches to
the pandas package which have also been adapted internally,                From a software design perspective, it is important to rec-
including strict unit adherence using the pint library. The user       ognize the right level of abstraction for the given problem. An
is not burdened by these implementation details and is instead         object-oriented pattern is seen to be the correct design paradigm.
ensured a pandas data-frame for all operations, both at the            However, though combining test driven development and object
orcaVis level, and the orcaExp level.                                  oriented design is robust and extensible, the design of wailord
    Software industry practices have been followed throughout the      is meant to tackle the problem at the level of a domain specific
development process. In particular, the entire package is written in   language. Recall from formal language theory [AA07] the fact
a test-driven-development (TDD) fashion which has been proven          that a grammar is essentially meant to specify the entire possible
many times over for academia [DJS08] and industry [BN06].              set of inputs and outputs for a given language. A grammar can
In essence, each feature is accompanied by a test-case. This is        be expressed as a series of tokens (terminal symbols) and non-
meant to ensure that once the end-user is able to run the test-        terminal (syntactic variables) symbols along with rules defining
suite, they are guaranteed the features promised by the software.      valid combinations of these.
Additionally, this means that potential bugs can be submitted              It may appear that there is little but splitting hairs between
as a test case which helps isolate errors for fixes. Furthermore,      parsing data line by line as is traditionally done in libraries, com-
software testing allows for coverage metrics, thereby enhancing        pared to defining the exact structural relations between allowed
user and development confidence in different components of any         symbols. However, this design, apart from disallowing invalid
large code-base.                                                       inputs, also makes sense from a pedagogical perspective.
228                                                                                    PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

   For example, of the inputs, structured data like configurations   Usage is then facilitated by a high-level call.
(XYZ formats) are best handled by concrete grammars, where           waex.cookies.gen_base(
each rule is followed in order:                                      template="basicExperiment",
                                                                     absolute=False,
grammar_xyz = Grammar(
                                                                     filen="./lab6/expCookieST_meth.yml",
    r"""
                                                                     )
    meta = natoms ws coord_block ws?
    natoms = number                                                  The resulting directory tree can be sent to a High Performance
    coord_block = (aline ws)+
    aline = (atype ws cline)
                                                                     Computing Cluster (HPC), and once executed via the generated
    atype = ~"[a-zA-Z]" / ~"[0-9]"                                   run-script helper; locally analysis can proceed.
    cline = (float ws float ws float)                                mdat = waio.orca.genEBASet(Path("buildOuts") / \
    float = pm number "." number                                     "methylene",
    pm              = ~"[+-]?"                                       deci=4)
    number          = ~"\\d+"                                        print(mdat.to_latex(index=False,
    ws              = ~"\\s*"                                        caption="CH2 energies and angles \
    """                                                              at various levels of theory, with NUMGRAD"))
)
                                                                     In certain situations, ordering may be relevant as well (e.g. for gen-
This definition maps neatly into the exact specification of an xyz
                                                                     erating curves of varying density functional theoretic complexity).
file:
                                                                     This can be handled as well.
2                                                                        For the outputs, similar to the key ideas across signac, nix,
H     -2.8   2.8     0.1                                             spack and other tools, control is largely taken away from the user
H     -3.2   3.4     0.2                                             in terms of the auto-generated directory structure. The outputs of
                                                                     each run is largely collected through regular expressions, due to
Where we recognize that the overarching structure is of the
                                                                     the ever changing nature of the outputs of closed source software.
number of atoms, followed by multiple coordinate blocks followed
                                                                         Importantly, for a code which is meant to confer insights,
by optional whitespace. We move on to define each coordinate
                                                                     the concept of units is key. wailord with ORCA has first class
block as a line of one or many aline constructs, each of which
                                                                     support for units using pint.
is an atype with whitespace and three float values representing
coordinates. Finally we define the positive, negative, numeric and
                                                                     Dissociation of H2
whitespace symbols to round out the grammar. This is the exact
form of every valid xyz file. The parsimonious library allows        As a concrete example, we demonstrate a popular pedagogical
handling grammatical constructs in a Pythonic manner.                exercise, namely to obtain the binding energy curves of the H2
    However, the generation of inputs is facilitated through the     molecule at varying basis sets and for the Hartree Fock, along with
use of generalized templates for "experiments" controlled by         the results of Kolos and Wolniewicz [KW68]. We first recognize,
cookiecutter. This allows for validations on the workflow            that even for a moderate 9 basis sets with 33 points, we expect
during setup itself.                                                 around 1814 data points. Where each basis set requires a separate
    For the purposes of the simulation study, one "experiment"       run, this is easily expected to be tedious.
consists of multiple single-shot runs; each of which can take a          Naively, this would require modifying and generating ORCA
long time.                                                           input files.
    Concretely, the top-level "experiment" is controlled by a        !UHF 3-21G ENERGY
YAML file:                                                           %paras
project_slug: methylene                                                  R = 0.4, 2.0, 33 # x-axis of H1
project_name: singlet_triplet_methylene                              end
outdir: "./lab6"
desc: An experiment to calculate singlet and triplet                 *xyz 0 1
states differences at a QCISD(T) level                               H    0.00      0.0000000         0.0000000
author: Rohit                                                        H    {R}       0.0000000         0.0000000
year: "2020"                                                         *
license: MIT
orca_root: "/home/orca/"                                             We can formulate the requirement imperatively as:
orca_yml: "orcaST_meth.yml"                                          qc:
inp_xyz: "ch2_631ppg88_trip.xyz"                                       active: True
                                                                       style: ["UHF", "QCISD", "QCISD(T)"]
Where each run is then controlled individually.                        calculations: ["ENERGY"] # Same as single point or SP
qc:                                                                    basis_sets:
  active: True                                                           - 3-21G
  style: ["UHF", "QCISD", "QCISD(T)"]                                    - 6-31G
  calculations: ["OPT"]                                                  - 6-311G
  basis_sets:                                                            - 6-311G*
    - 6-311++G**                                                         - 6-311G**
xyz: "inp.xyz"                                                           - 6-311++G**
spin:                                                                    - 6-311++G(2d,2p)
  - "0 1" # Singlet                                                      - 6-311++G(2df,2pd)
  - "0 3" # Triplet                                                      - 6-311++G(3df,3pd)
extra: "!NUMGRAD"                                                    xyz: "inp.xyz"
viz:                                                                 spin:
  molden: True                                                         - "0 1"
  chemcraft: True                                                    params:
jobscript: "basejob.sh"                                                - name: R
WAILORD: PARSERS AND REPRODUCIBILITY FOR QUANTUM CHEMISTRY                                                                                        229

    range: [0.4, 2.00]
    points: 33
    slot:
      xyz: True
      atype: "H"
      anum: 1 # Start from 0
      axis: "x"
extra: Null
jobscript: "basejob.sh"

This run configuration is coupled with an experiment setup file,
similar to the one in the previous section. With this in place,
generating a data-set of all the required data is fairly trivial.
kolos = pd.read_csv(
    "../kolos_H2.ene",
    skiprows=4,
    header=None,
    names=["bond_length", "Actual Energy"],
    sep=" ",
)
kolos['theory']="Kolos"

expt = waio.orca.orcaExp(expfolder=Path("buildOuts") / "h2")
h2dat = expt.get_energy_surface()
                                                                          Fig. 2: Plots generated from tidy principles for post-processing
Finally, the resulting data can be plotted using tidy principles.
                                                                          wailord parsed outputs.
imgname = "images/plotH2A.png"
p1a = (
    p9.ggplot(
        data=h2dat, mapping=p9.aes(x="bond_length",
                                                                          here has been applied to ORCA, however, the two level structure
        y="Actual Energy",                                                has generalizations to most quantum chemistry codes as well.
        color="theory")                                                       Importantly, we note that the ideas expressed form a design
    )                                                                     pattern for interacting with a plethora of computational tools
    + p9.geom_point()
    + p9.geom_point(mapping=p9.aes(x="bond_length",                       in a reproducible manner. By defining appropriate scopes for
      y="SCF Energy"),                                                    our structured parsers, generating deterministic directory trees,
      color="black", alpha=0.1,                                           along with a judicious use of regular expressions for output data
      shape='*', show_legend=True)
                                                                          harvesting, we are able to leverage tidy-data principles to analyze
    + p9.geom_point(mapping=p9.aes(x="bond_length",
      y="Actual Energy",                                                  the results of a large number of single-shot runs.
      color="theory"),                                                        Taken together, this tool-set and methodology can be used to
      data=kolos,                                                         generate elegant reports combining code and concepts together
      show_legend=True)
    + p9.scales.scale_y_continuous(breaks                                 in a seamless whole. Beyond this, the interpretation of each
      = np.arange( h2dat["Actual Energy"].min(),                          computational experiment in terms of a concrete domain specific
      h2dat["Actual Energy"].max(), 0.05) )                               language is expected to reduce the requirement of having to re-run
    + p9.ggtitle("Scan of an H2 \                                         benchmark calculations.
      bond length (dark stars are SCF energies)")
    + p9.labels.xlab("Bond length in Angstrom")
    + p9.labels.ylab("Actual Energy (Hatree)")                            Acknowledgments
    + p9.facet_wrap("basis")
)                                                                         R Goswami thanks H. Jónsson and V. Ásgeirsson for discussions
p1a.save(imgname, width=10, height=10, dpi=300)                           on the design of computational experiments for inference in
Which gives rise to the concise representation Fig. 2 from which          computation chemistry. This work was partially supported by the
all required inference can be drawn.                                      Icelandic Research Fund, grant number 217436052.
     In this particular case, it is possible to see the deviations from
the experimental results at varying levels of theory for different        R EFERENCES
basis sets.                                                               [AA07]      Alfred V. Aho and Alfred V. Aho, editors. Compilers: Principles,
                                                                                      Techniques, & Tools. Pearson/Addison Wesley, Boston, 2nd ed
                                                                                      edition, 2007.
Conclusions                                                               [ÁJ18]      Vilhjálmur Ásgeirsson and Hannes Jónsson. Exploring Potential
                                                                                      Energy Surfaces with Saddle Point Searches. In Wanda Andreoni
We have discussed wailord in the context of generating, in
                                                                                      and Sidney Yip, editors, Handbook of Materials Modeling, pages
a reproducible manner the structured inputs and output datasets                       1–26. Springer International Publishing, Cham, 2018. doi:
which facilitate chemical insight. The formulation of bespoke                         10.1007/978-3-319-42913-7_28-1.
datasets tailored to the study of specific properties across a wide       [BN06]      Thirumalesh Bhat and Nachiappan Nagappan. Evaluating the
                                                                                      efficacy of test-driven development: Industrial case studies. In
range of materials at varying levels of theory has been shown.                        Proceedings of the 2006 ACM/IEEE International Symposium
The test-driven-development approach is a robust methodology                          on Empirical Software Engineering, ISESE ’06, pages 356–363,
for interacting with closed source software. The design patterns                      New York, NY, USA, September 2006. Association for Comput-
expressed, of which the wailord library is a concrete imple-                          ing Machinery. doi:10.1145/1159733.1159787.
                                                                          [BS19]      Avrim Blum and Kevin Stangl. Recovering from Biased Data:
mentation, is expected to be augmented with more workflows, in                        Can Fairness Constraints Improve Accuracy? arXiv:1912.01094
particular, with a focus on nudged elastic band. The methodology                      [cs, stat], December 2019. arXiv:1912.01094.
230                                                                                            PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[CAB+ 19]  The Turing Way Community, Becky Arnold, Louise Bowler,            [Nee12]     Frank Neese. The ORCA program system. WIREs Computa-
           Sarah Gibson, Patricia Herterich, Rosie Higman, Anna Krys-                    tional Molecular Science, 2(1):73–78, 2012. doi:10.1002/
           talli, Alexander Morley, Martin O’Reilly, and Kirstie Whitaker.               wcms.81.
           The Turing Way: A Handbook for Reproducible Data Science.         [Oli07]     T. E. Oliphant. Python for Scientific Computing. Comput-
           Zenodo, March 2019.                                                           ing in Science Engineering, 9(3):10–20, May 2007. doi:
[DBCC16] Sandip De, Albert P. Bartók, Gábor Csányi, and Michele                          10/fjzzc8.
           Ceriotti.    Comparing molecules and solids across struc-         [OTL08]     Noel M. O’boyle, Adam L. Tenderholt, and Karol M.
           tural and alchemical space. Physical Chemistry Chemical                       Langner. Cclib: A library for package-independent computa-
           Physics, 18(20):13754–13769, May 2016. doi:10.1039/                           tional chemistry algorithms. Journal of Computational Chem-
           C6CP00415F.                                                                   istry, 29(5):839–845, 2008. doi:10.1002/jcc.20823.
[DJS08]    Chetan Desai, David Janzen, and Kyle Savage. A survey             [PCS+ 16]   Giovanni Pizzi, Andrea Cepellotti, Riccardo Sabatini, Nicola
           of evidence for test-driven development in academia. ACM                      Marzari, and Boris Kozinsky. AiiDA: Automated interactive
           SIGCSE Bulletin, 40(2):97–101, June 2008. doi:10.1145/                        infrastructure and database for computational science. Compu-
           1383602.1383644.                                                              tational Materials Science, 111:218–230, January 2016. doi:
[Dra20]    Pavlo O. Dral. Quantum Chemistry in the Age of Ma-                            10.1016/j.commatsci.2015.09.013.
           chine Learning. The Journal of Physical Chemistry Let-            [Pen11]     Roger D. Peng. Reproducible Research in Computational Sci-
           ters, 11(6):2336–2347, March 2020. doi:10.1021/acs.                           ence. Science, 334(6060):1226–1227, December 2011. doi:
           jpclett.9b03664.                                                              10/fdv356.
[EIS+ 20]  Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris         [RBA+ 19]   Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler,
           Tsipras, Jacob Steinhardt, and Aleksander Madry. Identifying                  Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville.
           Statistical Bias in Dataset Replication. arXiv:2005.09619 [cs,                On the Spectral Bias of Neural Networks. In Proceedings of
           stat], May 2020. arXiv:2005.09619.                                            the 36th International Conference on Machine Learning, pages
[GLL+ 16] Ting Gao, Hongzhi Li, Wenze Li, Lin Li, Chao Fang, Hui Li, Li-                 5301–5310. PMLR, May 2019.
           Hong Hu, Yinghua Lu, and Zhong-Min Su. A machine learning         [SAB+ 21]   Daniel G. A. Smith, Doaa Altarawy, Lori A. Burns, Matthew
           correction for DFT non-covalent interactions based on the S22,                Welborn, Levi N. Naden, Logan Ward, Sam Ellis, Benjamin P.
           S66 and X40 benchmark databases. Journal of Cheminformatics,                  Pritchard, and T. Daniel Crawford. The MolSSI QCArchive
           8(1):24, May 2016. doi:10.1186/s13321-016-0133-7.                             project: An open-source platform to compute, organize, and
[Haf08]    Jürgen Hafner. Ab-initio simulations of materials using VASP:                 share quantum chemistry data. WIREs Computational Molecular
           Density-functional theory and beyond. Journal of Computa-                     Science, 11(2):e1491, 2021. doi:10.1002/wcms.1491.
           tional Chemistry, 29(13):2044–2078, 2008. doi:10.1002/            [Sch86]     Henry F. Schaefer. Methylene: A Paradigm for Computational
           jcc.21057.                                                                    Quantum Chemistry. Science, 231(4742):1100–1107, March
                                                                                         1986. doi:10.1126/science.231.4742.1100.
[HMSE+ 21] Johannes Hoja, Leonardo Medrano Sandonas, Brian G. Ernst,
                                                                             [SEJ+ 19]   Andrew W. Senior, Richard Evans, John Jumper, James Kirk-
           Alvaro Vazquez-Mayagoitia, Robert A. DiStasio Jr., and Alexan-
                                                                                         patrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Žídek,
           dre Tkatchenko. QM7-X, a comprehensive dataset of quantum-
                                                                                         Alexander W. R. Nelson, Alex Bridgland, Hugo Penedones,
           mechanical properties spanning the chemical space of small
                                                                                         Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli,
           organic molecules. Scientific Data, 8(1):43, February 2021.
                                                                                         David T. Jones, David Silver, Koray Kavukcuoglu, and Demis
           doi:10.1038/s41597-021-00812-2.
                                                                                         Hassabis. Protein structure prediction using multiple deep neural
[HZU+ 20] Sebastiaan P. Huber, Spyros Zoupanos, Martin Uhrin, Leopold                    networks in the 13th Critical Assessment of Protein Structure
           Talirz, Leonid Kahle, Rico Häuselmann, Dominik Gresch,                        Prediction (CASP13). Proteins: Structure, Function, and Bioin-
           Tiziano Müller, Aliaksandr V. Yakutovich, Casper W. Andersen,                 formatics, 87(12):1141–1148, 2019. doi:10.1002/prot.
           Francisco F. Ramirez, Carl S. Adorf, Fernando Gargiulo, Snehal                25834.
           Kumbhar, Elsa Passaro, Conrad Johnston, Andrius Merkys, An-       [SGT+ 19]   K. T. Schütt, M. Gastegger, A. Tkatchenko, K.-R. Müller,
           drea Cepellotti, Nicolas Mounet, Nicola Marzari, Boris Kozin-                 and R. J. Maurer. Unifying machine learning and quantum
           sky, and Giovanni Pizzi. AiiDA 1.0, a scalable computa-                       chemistry with a deep neural network for molecular wavefunc-
           tional infrastructure for automated reproducible workflows and                tions. Nature Communications, 10(1):5024, November 2019.
           data provenance. Scientific Data, 7(1):300, September 2020.                   doi:10.1038/s41467-019-12875-2.
           doi:10.1038/s41597-020-00638-4.                                   [SNTH13]    Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and Eivind
[Koh99]    W. Kohn. Nobel Lecture: Electronic structure of matter—                       Hovig. Ten Simple Rules for Reproducible Computational Re-
           wave functions and density functionals. Reviews of Modern                     search. PLOS Computational Biology, 9(10):e1003285, October
           Physics, 71(5):1253–1266, October 1999. doi:10.1103/                          2013. doi:10/pjb.
           RevModPhys.71.1253.                                               [WAB+ 19]   Hadley Wickham, Mara Averick, Jennifer Bryan, Winston
[KW68]     W. Kolos and L. Wolniewicz. Improved Theoretical Ground-                      Chang, Lucy D’Agostino McGowan, Romain François, Garrett
           State Energy of the Hydrogen Molecule. The Journal of Chem-                   Grolemund, Alex Hayes, Lionel Henry, Jim Hester, Max Kuhn,
           ical Physics, 49(1):404–410, July 1968. doi:10.1063/1.                        Thomas Lin Pedersen, Evan Miller, Stephan Milton Bache,
           1669836.                                                                      Kirill Müller, Jeroen Ooms, David Robinson, Dana Paige Seidel,
[LMB+ 17] Ask Hjorth Larsen, Jens Jørgen Mortensen, Jakob Blomqvist,                     Vitalie Spinu, Kohske Takahashi, Davis Vaughan, Claus Wilke,
           Ivano E. Castelli, Rune Christensen, Marcin Du\lak, Jesper                    Kara Woo, and Hiroaki Yutani. Welcome to the Tidyverse.
           Friis, Michael N. Groves, Bjørk Hammer, Cory Hargus, Eric D.                  Journal of Open Source Software, 4(43):1686, November 2019.
           Hermes, Paul C. Jennings, Peter Bjerre Jensen, James Kermode,                 doi:10.21105/joss.01686.
           John R. Kitchin, Esben Leonhard Kolsbjerg, Joseph Kubal, Kris-    [WW05]      Leland Wilkinson and Graham Wills. The Grammar of Graph-
           ten Kaasbjerg, Steen Lysgaard, Jón Bergmann Maronsson, Tris-                  ics. Statistics and Computing. Springer, New York, 2nd ed
           tan Maxson, Thomas Olsen, Lars Pastewka, Andrew Peterson,                     edition, 2005.
           Carsten Rostgaard, Jakob Schiøtz, Ole Schütt, Mikkel Strange,
           Kristian S. Thygesen, Tejs Vegge, Lasse Vilhelmsen, Michael
           Walter, Zhenhua Zeng, and Karsten W. Jacobsen. The atomic
           simulation environment—a Python library for working with
           atoms. Journal of Physics: Condensed Matter, 29(27):273002,
           June 2017. doi:10.1088/1361-648X/aa680e.
[MA11]     K. J. Millman and M. Aivazis. Python for Scientists and
           Engineers. Computing in Science Engineering, 13(2):9–12,
           March 2011. doi:10/dc343g.
[MSH19]    Ralf Meyer, Klemens S. Schmuck, and Andreas W. Hauser.
           Machine Learning in Computational Chemistry: An Evalua-
           tion of Method Performance for Nudged Elastic Band Cal-
           culations. Journal of Chemical Theory and Computation,
           15(11):6513–6523, November 2019. doi:10.1021/acs.
           jctc.9b00708.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                  231




 Variational Autoencoders For Semi-Supervised Deep
                   Metric Learning
             Nathan Safir‡∗ , Meekail Zain§ , Curtis Godwin‡ , Eric Miller‡ , Bella Humphrey§ , Shannon P Quinn§¶



                                                                                   F



Abstract—Deep metric learning (DML) methods generally do not incorporate               loss may help incorporate semantic information from unlabelled
unlabelled data. We propose borrowing components of the variational autoen-            sources. Second, we propose that the structure of the VAE latent
coder (VAE) methodology to extend DML methods to train on semi-supervised              space, as it is confined by a prior distribution, can be used to
datasets. We experimentally evaluate the atomic benefits to the perform- ing           induce bias in the latent space of a DML system. For instance,
DML on the VAE latent space such as the enhanced ability to train using
                                                                                       if we know a dataset contains N -many classes, creating a prior
unlabelled data and to induce bias given prior knowledge. We find that jointly
training DML with an autoencoder and VAE may be potentially helpful for some
                                                                                       distribution that is a learnable mixture of N gaussians may help
semi-suprevised datasets, but that a training routine of alternating between           produce better representations. Third, we propose that performing
the DML loss and an additional unsupervised loss across epochs is generally            DML on the latent space of the VAE so that the DML task can
unviable.                                                                              be jointly optimized with the VAE to incorporate unlabelled data
                                                                                       may help produce better representations.
Index Terms—Variational Autoencoders, Metric Learning, Deep Learning, Rep-
                                                                                           Each of the three improvement proposals will be evaluated
resentation Learning, Generative Models
                                                                                       experimentally. The improvement proposals will be evaluated by
                                                                                       comparing a standard DML implementation to the same DML
Introduction                                                                           implementation:
Within the broader field of representation learning, metric learning
is an area which looks to define a distance metric which is smaller                       •   jointly optimized with an autoencoder
between similar objects (such as objects of the same class) and                           •   while structuring the latent space around a prior distribu-
larger between dissimilar objects. Oftentimes, a map is learned                               tion using the VAE’s KL-divergence loss term between the
from inputs into a low-dimensional latent space where euclidean                               approximated posterior and prior
distance exhibits this relationship, encouraged by training said                          •   jointly optimized with a VAE
map against a loss (cost) function based on the euclidean distance                         Our primary contribution is evaluating these three improve-
between sets of similar and dissimilar objects in the latent space.                    ment proposals. Our secondary contribution is presenting the
Existing metric learning methods are generally unable to learn                         results of the joint approaches for VAEs and DML for more recent
from unlabelled data, which is problematic because unlabelled                          metric losses that have not been jointly optimized with a VAE in
data is often easier to obtain and is potentially informative.                         previous literature.
    We take inspiration from variational autoencoders (VAEs),
a generative representation learning architecture, for using un-
                                                                                       Related Literature
labelled data to create accurate representations. Specifically, we
look to evaluate three atomic improvement proposals that detail                        The goal of this research is to investigate how components of the
how pieces of the VAE architecture can create a better deep metric                     variational autoencoder can help the performance of deep metric
learning (DML) model on a semi-supervised dataset. From here,                          learning in semi supervised tasks. We draw on previous literature
we can ascertain which specific qualities of how VAEs process                          to find not only prior attempts at this specific research goal but
unlabelled data are most helpful in modifying DML methods to                           also work in adjacent research questions that proves insightful.
train with semi-supervised datasets.                                                   In this review of the literature, we discuss previous related work
    First, we propose that the autoencoder structure of the VAE                        in the areas of Semi-Supervised Metric Learning and VAEs with
helps the clustering of unlabelled points, as the reconstruction                       Metric Losses.

* Corresponding author: nssafir@gmail.com                                              Semi-Supervised Metric Learning
‡ Institute for Artificial Intelligence, University of Georgia, Athens, GA 30602
USA                                                                                    There have been previous approaches to designing metric learning
§ Department of Computer Science, University of Georgia, Athens, GA 30602              architectures which incorporate unlabelled data into the metric
USA
¶ Department of Cellular Biology, University of Georgia, Athens, GA 30602              learning training regimen for semi-supervised datasets. One of the
USA                                                                                    original approaches is the MPCK-MEANS algorithm proposed
                                                                                       by Bilenko et al. ([BBM04]), which adds a penalty for placing
Copyright © 2022 Nathan Safir et al. This is an open-access article distributed        labelled inputs in the same cluster which are of a different class
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the           or in different clusters if they are of the same class. This penalty
original author and source are credited.                                               is proportional to the metric distance between the pair of inputs.
232                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

Baghshah and Shouraki ([BS09]) also looks to impose similar            also experiment with adding a (different) metric loss to the overall
constraints by introducing a loss term to preserve locally linear      VAE loss function.
relationships between labelled and unlabelled data in the input            Most recently, Grosnit et al. ([GTM+ 21]) leverage a new
space. Wang et al. ([WYF13]) also use a regularizer term to            training algorithm for combining VAEs and DML for Bayesian
preserve the topology of the input space. Using VAEs, in a sense,      Optimization and said algorithm using simple, contrastive, and
draws on this theme: though there is not explicit term to enforce      triplet metric losses. We look to build on this literature by also
that the topology of the input space is preserved, a topology of       testing a combined VAE DML architecture on more recent metric
the inputs is intended to be learned through a low-dimensional         losses, albeit using a simpler training regimen.
manifold in the latent space.
     One more recent common general approach to this problem
is to use the unlabelled data’s proximity to the labelled data         Deep Metric Learning (DML)
to estimate labels for unlabelled data, effectively transforming
unlabelled data into labelled data. Dutta et al. ([DHS21]) and Li et   Metric learning attempts to create representations for data by
al. ([LYZ+ 19]) propose a model which uses affinity propagation        training against the similarity or dissimilarity of samples. In a
on a k-Nearest-Neighbors graph to label partitions of unlabelled       more technical sense, there are two notable functions in DML
data based on their closest neighbors in the latent space. Wu et al.   systems. Function fθ is a neural network which maps the input
([WFZ20]) also look to assign pseudo-labels to unlabelled data,        data X to the latent points Z (i.e. fθ : X 7→ Z, where θ is the
but not through a graph-based approach. Instead, the proposed          network parameters). Generally, Z exists in a space of much lower
model looks to approximate "soft" pseudo-labels for unlabelled         dimensionality than X (eg. X is a set of 28 × 28 pixel pictures such
data from the metric learning similarity measure between the           that X ⊂ R28×28 and Z ⊂ R10 ).
embedding of unlabelled data and the center of each input of each           The function D fθ (x, y) = D( fθ (x), fθ (y)) represents the dis-
class of the labelled data.                                            tance between two inputs x, y ∈ X. To create a useful embedding
     Several of the recent graph based approaches can be consid-       model fθ , we would like for fθ to produce large values of D fθ (x, y)
ered state-of-the-art for semi supervised metric learning. Li et.      when x and y are dissimilar and for fθ to produce small values of
al.’s paper states their methods achieve 98.9 percent clustering       D fθ (x, y) when x and y are similar. In some cases, dissimilarity
accuracy on the MNIST dataset with 10% labelled data, outper-          and similarity can refer to when inputs are of different and the
forming two similar state-of-the-art methods, DFCM ([ARJM18])          same classes, respectively.
and SDEC ([RHD+ 19]), by roughly 8 points. Dutta et. al.’s method           It is common for the Euclidean metric (i.e. the L2 metric) to
also outperforms 5 other state for the R@1 metric (the "percentage     be used as a distance function in metric learning. The generalized
of test examples" that have at least one 1 "nearest neighbor from      L p metric can be defined as follows, where z0 , z1 ∈ Rd .
the same class.") by at leat 1.2 on the MNIST dataset, as well
                                                                                                                      d
as the Fashion-MNIST and CIFAR-10 datasets. It is difficult to
                                                                                  D p (z0 , z1 ) = ||z0 − z1 || p = ( ∑ |z0i − z1i | p )1/p
compare the two approaches as the evaluation metrics used in                                                         i=1
each paper differ. Li et al.’s paper has been cited rather heavily
relative to other papers in the field and can be considered state      If we have chosen fθ (a neural network) and the distance function
of the art for semi-supervised DML on MNIST. The paper also            D (the L2 metric), the remaining component to be defined in
provides a helpful metric (98.9 percent clustering accuracy on the     a metric learning system is the loss function for training f . In
MNIST dataset with 10% labelled data) to use as a reference point      practice, we will be using triplet loss ([SKP15]), one of the most
for the results in this paper.                                         common metric learning loss functions.

VAEs with Metric Loss
                                                                       Methodology
Some approaches to incorporating labelled data into VAEs use
a metric loss to govern the latent space more explicitly. Lin et       We look to discover the potential of applying components of the
al. ([LDD+ 18]) model the intra-class invariance (i.e. the class-      VAE methodology to DML systems. We test this through present-
related information of a data point) and intra-class variance (i.e.    ing incremental modifications to the basic DML architecture. Each
the distinct features of a data point not unique to it’s class)        modified architecture corresponds to an improvement proposal
seperately. Like several other models in this section, this paper’s    about how a specific part of the VAE training regime and loss
proposed model incorporates a metric loss term for the latent          function may be adapted to assist the performance of a DML
vectors representing intra-class invariance and the latent vectors     method for a semi-supervised dataset.
representing both intra-class invariance and intra-class variance.         The general method we will take for creating modified DML
    Kulkarni et al. ([KCJ20]) incorporate labelled information into    models involves extending the training regimen to two phases,
the VAE methodology in two ways. First, a modified architecture        a supervised and unsupervised phase. In the supervised phase the
called the CVAE is used in which the encoder and generator of the      modified DML model behaves identically to the base DML model,
VAE is not only conditioned on the input X and latent vector z,        training on the same metric loss function. In the unsupervised
respectively, but also on the label Y . The CVAE was introduced in     phase, the DML model will train against an unsupervised loss
previous papers ([SLY15]) ([DCGO19]). Second, the authors add          inspired by the VAE. This may require extra steps to be added
a metric loss, specifically a multi-class N-pair loss ([Soh16]), in    to the DML architecture. In the pseudocode, s refers to boolean
the overall loss function of the model. While it is unclear how the    variable representing if the current phase is supervised. α is a
CVAE technique would be adapted in a semi-supervised setting,          hyperparameter which modulates the impact of the unsupervised
as there is not a label Y associated with each datapoint X, we         on total loss for the DML autoencoder.
VARIATIONAL AUTOENCODERS FOR SEMI-SUPERVISED DEEP METRIC LEARNING                                                                           233




Improvement Proposal 1                                                  distribution instead of a point will allow us to calculate the KL
                                                                        divergence.
We first look to evaluate the improvement proposal that adding
                                                                            In practice, we will be evaluating a DML model with a unit
a reconstruction loss to a DML system can improve the quality
                                                                        prior and a DML model with a mixture of gaussians (GMM) prior.
of clustering in the latent representations on a semi-supervised
                                                                        The latter model constructs the prior as a mixture of n gaussians –
dataset. Reconstruction loss in and of itself enforces a similar
                                                                        each the vertice of the unit (i.e. each side is 2 units long) hypercube
semantic mapping onto the latent space as a metric loss, but can
                                                                        in the latent space. The logvar of each component is set equal to
be computed without labelled data. In theory, we believe that the
                                                                        one. Constructing the prior in this way is beneficial in that it is
added constraint that the latent vector must be reconstructed to
                                                                        ensured that each component is evenly spaced within the latent
approximate the original output will train the spatial positioning
                                                                        space, but is limiting in that there must be exactly 2d components
to reflect semantic information. Following this reasoning, obser-
                                                                        in the GMM prior. Thus, to test, we will test a dataset with 10
vations which share similar semantic information, specifically
                                                                        classes on the latent space dimensionality of 4, such that there
observations of the same class (even if not labelled as such),
                                                                        are 24 = 16 gaussian components in the GMM prior. Though the
should intuitively be positioned nearby within the latent space. To
                                                                        number of prior components is greater than the number of classes,
test if this intuition occurs in practice, we evaluate if a DML model
                                                                        the latent mapping may still exhibit the pattern of classes forming
with an autoencoder structure and reconstruction loss (described in
                                                                        clusters around the prior components as the extra components may
further detail below) will perform better than a plain DML model
                                                                        be made redundant.
in terms of clustering quality. This will be especially evident for
                                                                            The drawback of the decision to set the GMM components’
semi-supervised datasets in which the amount of labelled data is
                                                                        means to the coordinates of the unit hypercube’s vertices is that
not feasible for solely supervised DML.
                                                                        the manifold of the chosen dataset may not necessarily exist in 4
     Given a semi-supervised dataset, we assume a standard DML          dimensions. Choosing gaussian components from a d-dimensional
system will use only the labelled data and train given a metric loss    hypersphere in the latent space R d would solve this issue, but
Lmetric (see Algorithm 1). Our modified model DML Autoencoder           there does not appear to be a solution for choosing n evenly spaced
will extend the DML model’s training regime by adding a decoder         points spanning d dimensions on a d-dimensional hypersphere.
network which takes the latent point z as input and produces an         KL Divergence is calculated with a monte carlo approximation
output x̂. The unsupervised loss LU is equal to the reconstruction      for the GMM and analytically with the unit prior.
loss.
                                                                        Improvement Proposal 3
Improvement Proposal 2                                                  The third improvement proposal we look to evaluate is that
                                                                        given a semi-supervised dataset, optimizing a DML model jointly
Say we are aware that a dataset has n classes. It may be useful
                                                                        with a VAE on the VAE’s latent space will produce superior
to encourage that there are n clusters in the latent space of a
                                                                        clustering than the DML model individually. The intuition behind
DML model. This can be enforced by using a prior distribution
                                                                        this approach is that DML methods can learn from only supervised
containing n many Gaussians. As we wish to measure only
                                                                        data and VAE methods can learn from only unsupervised data; the
the affect of inducing bias on the representation without adding
                                                                        proposed methodology will optimize both tasks simultaneously to
any complexity to the model, the prior distribution will not be
                                                                        learn from both supervised and unsupervised data.
learnable (unlike VAE with VampPrior). By testing whether the
                                                                            The MetricVAE implementation we create jointly optimizes
classes of points in the latent space are organized along the prior
                                                                        the VAE task and DML task on the VAE latent space. The
components we can test whether bias can be induced using a
                                                                        unsupervised loss is set to the VAE loss. The implementation uses
prior to constrain the latent space of a DML. By testing whether
                                                                        the VAE with VampPrior model instead of the vanilla VAE.
clustering improves performance, we can evaluate whether this
inductive bias is helpful.
                                                                        Results
    Given a fully supervised dataset, we assume a standard DML
system will use only the labelled data and train given a metric loss    Experimental Configuration
Lmetric . Our modified model will extend the DML system’s training      Each set of experiments shares a similar hyperparameter search
regime by setting the unsupervised loss to a KL divergence term         space. Below we describe the hyperparameters that are included
that measures the difference between posterior distributions and        in the search space of each experiment and the evaluation method.
a prior distribution. It should also be noted that, like the VAE                 Learning Rate (lr): Through informal experimentation, we
encoder, we will map the input not to a latent point but to a           have found that the learning rate of 0.001 causes the models to
latent distribution. The latent point is stochastically sampled from    converge consistently (relative to 0.005 and 0.0005). The learning
the latent distribution during training. Mapping the input to a         rate is thus set to 0.001 in each experiment.
234   PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)
VARIATIONAL AUTOENCODERS FOR SEMI-SUPERVISED DEEP METRIC LEARNING                                                                            235




        Latent Space Dimensionality (lsdim): Latent space dimen-         ([YSW+ 21]). The MNIST and OrganAMNIST datasets are similar
sionality refers to the dimensionality of the vector output of the       in dimensionality (1 x 28 x 28), number of samples (60,000 and
encoder of a DML network or the dimensionality of the posterior          58,850, respectively) and in that they are both greyscale.
distribution of a VAE (also the dimensionality of the latent space).             Evaluation: We evaluate the results by running each model
When the latent space dimensionality is 2, we see the added benefit      on a test partition of data. We then take the latent points Z
of creating plots of the latent representations (though we can           generated by the model and the corresponding labels Y . Three
accomplish this through using dimensionality reduction methods           classifiers (sklearn’s implementation of RandomForest, MLP, and
like tSNE for higher dimensionalities as well). Example values for       kNN) each output predicted labels Ŷ for the latent points. In
this hyperparameter used in experiments are 2, 4, and 10.                most of the charts shown, however, we only include the kNN
        Alpha: Alpha (α) is a hyperapameter which refers to the          classification output due to space constraints and the lack of
balance between the unsupervised and supervised losses of some           meaningful difference between the output for each classifier. We
of the modified DML models. More details about the role of α             finally measure the quality of the predicted labels Ŷ using the
in the model implementations are discussed in the methodology            Adjusted Mutual Information Score (AMI) ([?]) and accuracy
section of the model. Potential values for alpha are each between        (which is still helpful but is also easier to interpret in some cases).
0 (exclusive) and 1 (inclusive). We do not include 0 in this set as if   This scoring metric is common in research that looks to evaluate
α is set to 0, the model is equivalent to the fully supervised plain     clustering performance ([ZG21]) ([EKGB16]). We will be using
DML model because the supervised loss would not be included. If          sklearn’s implementation of AMI ([PVG+ 11]). The performance
α is set to 1, then the model would train on only the unsupervised       of a classifier on the latent points intuitively can be used as a
loss; for instance if the DML Autoencoder had α set to 1, then the       measure of quality of clustering.
model would be equivalent to an autoencoder.
        Partial Labels Percentage (pl%): The partial labels per-         Improvement Proposal 1 Results: Benefits of Reconstruction Loss
centage hyperparameter refers to the percentage of the dataset that      In evaluating the first improvement proposal, we compare the
is labelled and thus the size of the partion of the dataset that can     performance of the plain DML model to the DML Autoencoder
be used for labelled training. Of course, each of the datasets we        model. We do so by comparing the performance of the plain
use is fully labelled, so a partially labelled datset can be trivially   DML system and the DML Autoencoder across a search space
constructed by ignoring some of the labels. As the sizes of the          containing the lsdim, alpha, and pl% hyperparameters and both
dataset vary, each percentage can refer to a different number of         datasets.
labelled samples. Values for the partial label percentage we use             In Table 1 and Table 2, we observe that for relatively small
across experiments include 0.01, 0.1, and 10 (with each value            amounts of labelled samples (the partial labels percentages of 0.01
referring to the percentage).                                            and 0.1 correspond to 6 and 60 labelled samples respectively),
        Datasets: Two datasets are used for evaluating the models.       the DML Autoencoder severely outperforms the DML model.
The first dataset is MNIST ([LC10]), a very popular dataset              However, when the number of labelled samples increases (the
in machine learning containing greyscale images of handwritten           partial labels percentage of 10 correspond to 6000 labelled sam-
digits. The second dataset we use is the organ OrganAMNIST               ples respectively), the DML model significantly outperforms the
dataset from MedMNIST v2 ([YSW+ 21]). This dataset contains              DML Autoencoder. This trend is not too surprising, as when there
2D slices from computed tomography images from the Liver                 is sufficient data to train unsupervised methods and insufficient
Tumor Segmentation Benchmark – the labels correspond to the              data to train supervised method, as is the case for the 0.01 and
classification of 11 different body organs. The decision to use          0.1 partial label percentages, the unsupervised method will likely
a second dataset was motivated because as the improvement                perform better.
proposals are tested over more datasets, the results supporting the          The data looks to show that adding a reconstruction loss to a
improvement proposals become more generalizable. The decision            DML system can improve the quality of clustering in the latent
to use the OrganAMNIST dataset specifically is motivated in              representations on a semi-supervised dataset when there are small
part due to the Quinn Research Group working on similar tasks            amounts (roughly less than 100 samples) of labelled data and a
for biomedical imaging ([ZRS+ 20]). It is also motivated in part         sufficient quantity of unlabelled data. But an important caveat is
because OrganAMNIST is a more difficult dataset, at least for            that it is not convincing that the DML Autoencoder effectively
the classfication task, as the leading accuracy for MNIST is .9991       combined the unsupervised and supervised losses to create a
([ALP+ 20]) while the leading accuracy for OrganAMNIST is .951           superior model, as a plain autoencoder (i.e. the DML Autoencoder
236                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




                       Fig. 1: Sample images from the MNIST (left) and OrganAMNIST of MedMNIST (right) datasets


with α = 1) outperforms the DML for the partial labels percentage        routine of alternating between supervised loss (in this case, metric
of or less than 0.1% and underperforms the DML for the partial           loss) and unsupervised (in this case, VAE loss) is not optimal for
labels percentage of 10%.                                                training the model.
                                                                              We have trained a seperate combined VAE and DML model
Improvement Proposal 2 Results: Incorporating Inductive Bias with        which trains on both the unsupervised and supervised loss each
a Prior                                                                  epoch instead of alternating between the two each epoch. In the
In evaluating the second improvement proposal, we compare the            results for this model, we see that an alpha value of over zero
performance of the plain DML model to the DML with a unit prior          (i.e. incorporating both the supervised metric loss into the overall
and a DML with a GMM prior. The DML prior with the GMM                   MVAE loss function) can help improve performance especially
prior will have 2^2 = 4 gaussian components when lsdim = 2 and           among lower dimensionalities. Given our analysis of the data, we
2^4 = 16 components when lsdim = 4. Our broad intention is to            see that incorporating the DML loss to the VAE is potentially
see if changing the shape (specifically the number of components)        helpful, but only when training the unsupervised and supervised
of the prior can induce bias by affecting the pattern of embeddings.     losses jointly. Even in that case, it is unclear whether the MVAE
We hypothesize that when the GMM prior contains n components             performs better than the corresponding DML model even if it does
and n is slightly greater than or equal to the number of classes,        perform better than the corresponding VAE model.
each class will cluster around one of the prior components. We will
test this for the GMM prior with 16 components (lsdim = 4) as
                                                                         Conclusion
both the MNIST and MedMNIST datasets have 10 classes. We are
unable to set the number of GMM components to 10 as our GMM              Conclusion
sampling method only allows for the number of components to              In this work, we have set out to determine how DML can be
equal a power of 2. Bseline models include a plain DML and a             extended for semi-supervised datasets by borrowing components
DML with a unit prior (the distribution N(0, 1)).                        of the variational autoencoder. We have formalized this approach
     In Table 3, it is very evident that across both datasets, the DML   through defining three specific improvement proposals. To evalu-
models with any prior distribution all devolve to the null model         ate each improvement proposal, we have created several variations
(i.e. the classifier is no better than random selection). From the       of the DML model, such as the DML Autoencoder, DML with
visualizations of the latent embeddings, we see that the embedded        Unit/GMM Prior, and MVAE. We then tested the performance
data for the DML models with priors appears completely random.           of the models across several semi-supervised partitions of two
In the case of the GMM prior, it also does not appear to take on the     datasets, along with other configurations of hyperparameters.
shape of the prior or reflect the number of components in the prior.         We have determined from the analysis of our results, there
This may be due to the training routine of the DML models. As            is too much dissenting data to clearly accept any three of the
the KL divergence loss, which can be said to "fit" the embeddings        improvement proposals. For improvement proposal 1, while the
to the prior, trains on alternating epochs with the supervised DML       DML Autoencoder outperforms the DML for semisupervised
loss, it is possible that the two losses are not balanced correctly      datasets with small amounts of labelled data, it’s peformance is not
during the training process. From the discussed results, it is fair      consistently much better than that of a plain autoencoder which
to state that adding a prior distribution to a DML model through         uses no labelled data. For improvement proposal 2, each of the
training the model on the KL divergence between the prior and            DML models with an added prior performed extremely poorly,
approximated posterior distributions on alternating epochs does is       near or at the level of the null model. For improvement proposal
not an effective way to induce bias in the latent space.                 3, we see the same extremely poor performance from the MVAE
                                                                         models.
Improvement Proposal 3 Results: Jointly Optimizing DML with VAE              From the results in improvement proposals 1 and 3, we find
To evaluate the third improvement proposal, we compare the               that there may be potential in incorporating the autoencoder and
performance of DMLs to MetricVAEs (defined in the previous               VAE loss terms into DML systems. However, we were unable to
chapter) across several metric losses. We run experiments for            show that any of these improvement proposals would consistently
triplet loss, supervised loss, and center loss DML and MetricVAE         outperform the both the DML and fully unsupervised architectures
models. To evaluate the improvement proposal, we will assess             in semisupervised settings. We also found that the training routine
whether the model performance improves for the MetricVAE over            used for the improvement proposals, in which the loss function
the DML for the same metric loss and other hyper parameters.             would alternate between supervised and unsupervised losses each
    Like the previous improvement proposal, the proposed Metric-         epoch, was not effective. This is especially evident in comparing
VAE model does not perform better than the null model. As with           the two combined VAE DML models for improvement proposal
improvement proposal 2, it is possible this is because the training      3.
VARIATIONAL AUTOENCODERS FOR SEMI-SUPERVISED DEEP METRIC LEARNING                                                                             237




Fig. 2: Table 1: Comparison of the DML (left) and DML Autoencoder (right) models for the MNIST dataset. Bolded values indicate best
performance for each partial labels percentage partition (pl%).




              Fig. 3: Table 2: Comparison of the DML (left) and DML Autoencoder (right) models for the MEDMNIST dataset..


Future Work                                                           R EFERENCES

In the future, it would be worthwhile to evaluate these improve-      [AHS20]   Georgios Arvanitidis, Søren Hauberg, and Bernhard Schölkopf.
                                                                                Geometrically enriched latent spaces.            arXiv preprint
ment proposals using a different training routine. We have stated               arXiv:2008.00565, 2020. doi:10.48550/arXiv.2008.
previously that perhaps the extremely poor performance of the                   00565.
DML with a prior and MVAE models may be due to alternating            [ALP+ 20] Sanghyeon An, Min Jun Lee, Sanglee Park, Heerin Yang, and
on training against a supervised and unsupervised loss. Further                 Jungmin So. An ensemble of simple convolutional neural network
                                                                                models for MNIST digit recognition. CoRR, abs/2008.10400,
research could look to develop or compare several different                     2020. URL: https://arxiv.org/abs/2008.10400, arXiv:2008.
training routines. One alternative would be alternating between                 10400, doi:10.48550/arXiv.2008.10400.
losses at each batch instead of each epoch. Another alternative,      [ARJM18] Ali Arshad, Saman Riaz, Licheng Jiao, and Aparna Murthy.
specifically for the MVAE, may be first training DML on labelled                Semi-supervised deep fuzzy c-mean clustering for software fault
                                                                                prediction. IEEE Access, 6:25675–25685, 2018. doi:10.
data, training a GMM on it’s outputs, and then using the GMM as                 1109/ACCESS.2018.2835304.
the prior distribution for the VAE.                                   [BBM04] Mikhail Bilenko, Sugato Basu, and Raymond J Mooney. Integrat-
                                                                                ing constraints and metric learning in semi-supervised clustering.
    Another potentially interesting avenue for future study is in
                                                                                In Proceedings of the twenty-first international conference on
investigating a fourth improvement proposal: the ability to define              Machine learning, page 11, 2004. doi:10.1145/1015330.
a Riemannian metric on the latent space. Previous research has                  1015360.
shown a Riemannian metric can be computed on the latent space         [BS09]    Mahdieh Soleymani Baghshah and Saeed Bagheri Shouraki.
of the VAE by computing the pull-back metric of the VAE’s                       Semi-supervised metric learning using pairwise constraints. In
                                                                                Twenty-First International Joint Conference on Artificial Intelli-
decoder function ([AHS20]). Through the Riemannian metric we                    gence, 2009.
could calculate metric losses such as triplet loss with a geodesic    [DCGO19] Sara Dahmani, Vincent Colotte, Valérian Girard, and Slim Ouni.
instead of euclidean distance. The geodesic distance may be a                   Conditional variational auto-encoder for text-driven expressive
more accurate representation of similarity in the latent space than             audiovisual speech synthesis. In INTERSPEECH 2019-20th
                                                                                Annual Conference of the International Speech Communication
euclidean distance as it accounts for the structure of the input                Association, 2019. doi:10.21437/interspeech.2019-
data.                                                                           2848.
238                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




Fig. 4: Table 3: Comparison of the DML model (left) and the DML with prior models with a unit gaussian prior (center) and GMM prior
(right) models for the MNIST dataset.




Fig. 5: Comparison of latent spaces for DML with unit prior (left) and DML with GMM prior containing 4 components (right) for lsdim
= 2 on OrganAMNIST dataset. The gaussian components are shown as black with the raidus equal to variance (1). There appears to be no
evidence of the distinct gaussian components in the latent space on the right. It does appear that the unit prior may regularize the magnitude
of the latent vectors




Fig. 6: Graph of reconstruction loss (componenet of unsupervised loss) of MVAE across epochs. The unsupervised loss does not converge
despite being trained on each epoch.




Fig. 7: Table 4: Experiments performed on MVAE architecture across fully labelled MNIST dataset that trains on objective function L =
LU +γ ∗LS on fully supervised dataset. The best results for the classification accuracy on the MVAE embeddings in a given latent-dimensionality
are bolded.
VARIATIONAL AUTOENCODERS FOR SEMI-SUPERVISED DEEP METRIC LEARNING                239

[DHS21]     Ujjal Kr Dutta, Mehrtash Harandi, and Chellu Chandra Sekhar.
            Semi-supervised metric learning: A deep resurrection. 2021.
            doi:10.48550/arXiv.2105.05061.
[EKGB16]    Scott Emmons, Stephen Kobourov, Mike Gallant, and Katy
            Börner. Analysis of network clustering algorithms and clus-
            ter quality metrics at scale. PloS one, 11(7):e0159161, 2016.
            doi:10.1371/journal.pone.0159161.
[GTM+ 21]   Antoine Grosnit, Rasul Tutunov, Alexandre Max Maraval, Ryan-
            Rhys Griffiths, Alexander I Cowen-Rivers, Lin Yang, Lin Zhu,
            Wenlong Lyu, Zhitang Chen, Jun Wang, et al. High-dimensional
            bayesian optimisation with variational autoencoders and deep
            metric learning. arXiv preprint arXiv:2106.03609, 2021. doi:
            10.48550/arXiv.2106.03609.
[KCJ20]     Ajinkya Kulkarni, Vincent Colotte, and Denis Jouvet. Deep
            variational metric learning for transfer of expressivity in multi-
            speaker text to speech. In International Conference on Statistical
            Language and Speech Processing, pages 157–168. Springer, 2020.
            doi:10.1007/978-3-030-59430-5_13.
[LC10]      Yann LeCun and Corinna Cortes. MNIST handwritten digit
            database. 2010. URL: http://yann.lecun.com/exdb/mnist/ [cited
            2016-01-14 14:24:11].
[LDD+ 18]   Xudong Lin, Yueqi Duan, Qiyuan Dong, Jiwen Lu, and Jie Zhou.
            Deep variational metric learning. In Proceedings of the European
            Conference on Computer Vision (ECCV), pages 689–704, 2018.
            doi:10.1007/978-3-030-01267-0_42.
[LYZ+ 19]   Xiaocui Li, Hongzhi Yin, Ke Zhou, Hongxu Chen, Shazia Sadiq,
            and Xiaofang Zhou. Semi-supervised clustering with deep metric
            learning. In International Conference on Database Systems for
            Advanced Applications, pages 383–386. Springer, 2019. doi:
            10.1007/978-3-030-18590-9_50.
[PVG+ 11]   F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
            O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg,
            J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,
            and E. Duchesnay. Scikit-learn: Machine learning in Python.
            Journal of Machine Learning Research, 12:2825–2830, 2011.
[RHD+ 19]   Yazhou Ren, Kangrong Hu, Xinyi Dai, Lili Pan, Steven CH Hoi,
            and Zenglin Xu. Semi-supervised deep embedded clustering. Neu-
            rocomputing, 325:121–130, 2019. doi:10.1016/j.neucom.
            2018.10.016.
[SKP15]     Florian Schroff, Dmitry Kalenichenko, and James Philbin.
            Facenet: A unified embedding for face recognition and clus-
            tering. In Proceedings of the IEEE conference on computer
            vision and pattern recognition, pages 815–823, 2015. doi:
            10.1109/cvpr.2015.7298682.
[SLY15]     Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning struc-
            tured output representation using deep conditional generative
            models. Advances in neural information processing systems,
            28:3483–3491, 2015.
[Soh16]     Kihyuk Sohn. Improved deep metric learning with multi-class n-
            pair loss objective. In Advances in neural information processing
            systems, pages 1857–1865, 2016.
[WFZ20]     Sanyou Wu, Xingdong Feng, and Fan Zhou. Metric learning
            by similarity network for deep semi-supervised learning. In
            Developments of Artificial Intelligence Technologies in Compu-
            tation and Robotics: Proceedings of the 14th International FLINS
            Conference (FLINS 2020), pages 995–1002. World Scientific,
            2020. doi:10.1142/9789811223334_0120.
[WYF13]     Qianying Wang, Pong C Yuen, and Guocan Feng. Semi-
            supervised metric learning via topology preserving multiple semi-
            supervised assumptions. Pattern Recognition, 46(9):2576–2587,
            2013. doi:10.1016/j.patcog.2013.02.015.
[YSW+ 21]   Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao,
            Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2:
            A large-scale lightweight benchmark for 2d and 3d biomedical
            image classification. arXiv preprint arXiv:2110.14795, 2021.
            doi:10.48550/arXiv.2110.14795.
[ZG21]      Zhen Zhu and Yuan Gao. Finding cross-border collaborative
            centres in biopharma patent networks: A clustering comparison
            approach based on adjusted mutual information. In International
            Conference on Complex Networks and Their Applications, pages
            62–72. Springer, 2021. doi:10.1007/978-3-030-93409-
            5_6.
[ZRS+ 20]   Meekail Zain, Sonia Rao, Nathan Safir, Quinn Wyner, Isabella
            Humphrey, Alexa Eldridge, Chenxiao Li, BahaaEddin AlAila,
            and Shannon P. Quinn. Towards an unsupervised spatiotemporal
            representation of cilia video using a modular generative pipeline.
            2020. doi:10.25080/majora-342d178e-017.
240                                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




 A Python Pipeline for Rapid Application Development
                        (RAD)
      Scott D. Christensen‡∗ , Marvin S. Brown‡ , Robert B. Haehnel‡ , Joshua Q. Church‡ , Amanda Catlett‡ , Dallon C.
                                      Schofield‡ , Quyen T. Brannon‡ , Stacy T. Smith‡



                                                                                     F


Abstract—Rapid Application Development (RAD) is the ability to rapidly pro-              Python ecosystem provides a rich set of tools that can be applied to
totype an interactive interface through frequent feedback, so that it can be             various data sources to provide valuable insights. These insitghts
quickly deployed and delivered to stakeholders and customers. RAD is a critical          can be integrated into decision support systems that can enhance
capability needed to meet the ever-evolving demands in scientific research and           the information available when making mission critical decisions.
data science. To further this capability in the Python ecosystem, we have curated
                                                                                         Yet, while the opportunities are vast, the ability to get the resources
and developed a set of open-source tools, including Panel, Bokeh, and Tethys
Platform. These tools enable prototyping interfaces in a Jupyter Notebook and
                                                                                         necessary to pursue those opportunities requires effective and
facilitate the progression of the interface into a fully-featured, deployable web-       timely communication of the value and feasibility of a proposed
application.                                                                             project.
                                                                                             We have found that rapid prototyping is a very impactful way
Index Terms—web app, Panel, Tethys, Tethys Platform, Bokeh, Jupyter                      to concretely show the value that can be obtained from a proposal.
                                                                                         Moreover, it also illustrates with clarity that the project is feasible
                                                                                         and likely to succeed. Many scientific workflows are developed in
Introduction
                                                                                         Python, and often the prototyping phase is done in a Jupyter Note-
With the tools for data science continually improving and an al-                         book. The Jupyter environment provides an easy way to quickly
most innumerable supply of new data sources, there are seemingly                         modify code and visualize output. However, the visualizations are
endless opportunities to create new insights and decision support                        interlaced with the code and thus it does not serve as an ideal way
systems. Yet, an investment of resources are needed to extract                           demonstrate the prototype to stakeholders, that may not be familiar
the value from data using new and improved tools. Well-timed                             with Jupyter Notebooks or code. The Jupyter Dashboard project
and impactful proposals are necessary to gain the support and                            was addressing this issue before support for it was dropped in
resources needed from stakeholders and decision makers to pursue                         2017. To address this technical gap, we worked with the Holoviz
these opportunities. The ability to rapidly prototype capabilities                       team to develop the Panel library. [Panel] Panel is a high-level
and new ideas provides a powerful visual tool to communicate                             Python library for developing apps and dashboards. It enables
the impact of a proposal. Interactive applications are even more                         building layouts with interactive widgets in a Jupyter Notebook
impactful by engaging the user in the data analysis process.                             environment, but can then easily transition to serving the same
     After a prototype is implemented to communicate ideas and                           code on a standalone secure webserver. This capability enabled
feasibility of a project, additional success is determined by the                        us to rapidly prototype workflows and dashboards that could be
ability to produce the end product on time and within budget.                            directly accessed by potential sponsors.
If the deployable product needs to be completely re-written using                            Panel makes prototyping and deploying simple. It can also
different tools, programing languages, and/or frameworks from the                        be iterative. As new features are developed we can continue to
prototype, then significantly more time and resources are required.                      work in the Jupyter Notebook environment and then seamlessly
The ability to quickly mature a prototype to production-ready                            transition the new code to a deployed application. Since appli-
application using the same tool stack can make the difference in                         cations continue to mature they often require production-level
the success of a project.                                                                features. Panel apps are deployed via Bokeh, and the Bokeh
                                                                                         framework lacks some aspects that are needed in some production
Background                                                                               applications (e.g. a user management system for authentication
                                                                                         and permissions, and a database to persist data beyond a session).
At the US Army Engineer Research and Development Center                                  Bokeh doesn’t provide either of these aspects natively.
(ERDC) there are evolving needs to support the missions of the                               Tethys Platform is a Django-based web framework that is
US Army Corps of Engineers and our partners. The scientific                              geared toward making scientific web applications easier to de-
                                                                                         velop by scientists and engineers. [Swain] It provides a Python
* Corresponding author: Scott.D.Christensen@usace.army.mil
‡ US Army Engineer Research and Development Center                                       Software Development Kit (SDK) that enables web apps to be
                                                                                         created almost purely in Python, while still leaving the flexibility
Copyright © 2022 Scott D. Christensen et al. This is an open-access article              to add custom HTML, JavaScript, and CSS. Tethys provides
distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,            user management and role-based permissions control. It also
provided the original author and source are credited.                                    enables database persistence and computational job management
A PYTHON PIPELINE FOR RAPID APPLICATION DEVELOPMENT (RAD)                                                                               241

[Christensen], in addition to many visualization tools. Tethys of-
fers the power of a fully-featured web framework without the need
to be an expert in full-stack web development. However, Tethys
lacks the ease of prototyping in a Jupyter Notebook environment
that is provided by Panel.
    To support both the rapid prototyping capability provided
by Panel and the production-level features of Tethys Platform,
we needed a pipeline that could take our Panel-based code
and integrate it into the Tethys Platform framework. Through
collaborations with the Bokeh development team and developers
at Aquaveo, LLC, we were able to create that integration of
Panel (Bokeh) and Tethys. This paper demonstrates the seamless
pipeline that facilitates Rapid Application Development (RAD).
In the next section we describe how the RAD pipeline is used at
the ERDC for a particular use case, but first we will provide some
background on the use case itself.                                    Fig. 1: Collective Sweep Inputs Stage rendered in a Jupyter Notebook.

Use Case
Helios is a computational fluid dynamics (CFD) code for simulat-
ing rotorcraft. It is very computationally demanding and requires
High Performance Computing (HPC) resources to execute any-
thing but the most basic of models. At the ERDC we often face a
need to run parameter sweeps to determine the affects of varying
a particular parameter (or set of parameters). Setting up a Helios
model to run on the HPC is a somewhat involved process that
requires file management and creating a script to submit the job
to the queueing system. When executing a parameter sweep the
process becomes even more cumbersome, and is often avoided.
    While tedeous to perform manually, the process of modifying
input files, transferring to the HPC, and generating and submitting
job scripts to the the HPC queueing system can be automated
with Python. Furthermore, it can be made much more accessible,
even to those without extensive knowledge of how Helios works,
through a web-based interface.

Methods
To automate the process of submitting Helios model parameter
sweeps to the HPC via a simple interactive web application            Fig. 2: Collective Sweep Inputs Stage rendered as a stand-alone
we developed and used the RAD pipeline. Initially three Helios        Bokeh app.
parameter sweep workflows were identified:
   1)    Collective Sweep
                                                                      API to execute commands on the login nodes of the DoD HPC
   2)    Speed Sweep
                                                                      systems. The PyUIT library provides a Python wrapper for the
   3)    Ensemble Analysis
                                                                      UIT+ REST API. Additionally, it provides Panel-based interfaces
   The process of submitting each of these workflows to the HPC       for each of the workflow steps listed above. Panel refers to a
was similar. They each involved the same basic steps:                 workflow comprised of a sequence of steps as a pipeline, and
                                                                      each step in the pipeline is called a stage. Thus, PyUIT provides a
   1)    Authentication to the HPC                                    template stage class for each step in the basisc HPC workflow.
   2)    Connecting to a specific HPC system
                                                                          The PyUIT pipeline stages were customized to create inter-
   3)    Specifying the parameter sweep inputs
                                                                      faces for each of the three Helios workflows. Other than the
   4)    Submtting the job to the queuing system
                                                                      inputs stage, the rest of the stages are the same for each of the
   5)    Monitoring the job as it runs
                                                                      workflows (See figures 1, 2, and 3). The inputs stage allows the
   6)    Visualizing the results
                                                                      user to select a Helios input file and then provides inputs to allow
    In fact, these steps are essentially the same for any job being   the user to specify the values for the parameter(s) that will be
submitted to the HPC. To ensure that we were able to resuse           varied in the sweep. Each of these stages was first created in a
as much code as possible we created PyUIT, a generic, open-           Jupyter Notebook. We were then able to deploy each workflow as
source Python library that enables this workflow. The ability to      a standalone Bokeh application. Finally we integrated the Panel-
authenticate and connect to the DoD HPC systems is enabled            based app into Tethys to leverage the compute job management
by a service called User Interface Toolkit Plus (UIT+). [PyUIT]       system and single-sign-on authentication.
UIT+ provides an OAuth2 authentication service and a RESTful              As additional features are required, we are able to leverage
242                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




                                                                      Fig. 5: The Helios Tethys App is the framework for launching each of
                                                                      the three Panel-based Helios parameter sweep workflows.



Fig. 3: Collective Sweep Inputs Stage rendered in the Helios Tethys
App.


the same pipeline: first developing the capability in a Jupyter
Notebook, then testing with a Bokeh-served app, and finally, a
full integration into Tethys.

Results
By integrating the Panel workflows into the Helios Tethys app
we can take advantage of Tethys Platform features, such as the
jobs table, which persists metadata about computational jobs in a
database.
                                                                      Fig. 6: Actions associated with a job. The available actions depend
                                                                      on the job’s status.


                                                                      results to view. The pages that display the results are built with
                                                                      Panel, but Tethys enables them to be populated with information
                                                                      about the job from the database. Figure 7 shows the Tracking Data
                                                                      tab of the results viewer page. The plot is a dynamic Bokeh plot
                                                                      that enables the user to select the data to plot on each axis. This
                                                                      particular plot is showing the variation of the coeffient of drag of
                                                                      the fuselage body over the simulation time.
                                                                          Figure 8 shows what is called CoViz data, or data that is
                                                                      extracted from the solution as the model is running. This image is
                                                                      showing an isosurface colored by density.
Fig. 4: Helios Tethys App home page showing a table of previously
submitted Helios simulations.
                                                                      Conclusion
                                                                      The Helios Tethys App has demonstrated the value of the RAD pi-
    Each of the three workflows can be launched from the home
                                                                      pline, which enables both rapid prototyping and rapid progression
page of the Helios Tethys app as shown in Figure 5. Although
                                                                      to production. This enables researchers to quickly communicate
the home page was created in the Tethys framework, once the
                                                                      and prove ideas and deliver successful products on time. In
workflows are launched the same Panel code that was previously
                                                                      addition to the Helios Tethys App, RAD has been instrumental
developed is called to display the workflow (refer to figures 1, 2,
                                                                      for the mission success of various projects at the ERDC.
and 3).
    From the Tethys Jobs Table different actions are available for
each job including viewing results once the job has completed (see    R EFERENCES
6).                                                                   [Christensen] Christensen, S. D., Swain, N. R., Jones, N. L., Nelson, E.
    View job results is much more natural in the Tethys app. Helios                 J., Snow, A. D., & Dolder, H. G. (2017). A Comprehensive
jobs often take multiple days to complete. By embedding the                         Python Toolkit for Accessing High-Throughput Computing to
                                                                                    Support Large Hydrologic Modeling Tasks. JAWRA Journal
Helios Panel workflows in Tethys users can leave the web app                        of the American Water Resources Association, 53(2), 333-343.
(ending their session), and then come back later and pull up the                    https://doi.org/10.1111/1752-1688.12455
A PYTHON PIPELINE FOR RAPID APPLICATION DEVELOPMENT (RAD)                    243




Fig. 7: Timeseries output associated with a Helios Speed Sweep run.




  Fig. 8: Isosurface visualization from a Helios Speed Sweep run.


[Panel]      https://www.panel.org
[PyUIT]      https://github.com/erdc/pyuit
[Swain]      Swain, N. R., Christensen, S. D., Snow, A. D., Dolder, H.,
             Espinoza-Dávalos, G., Goharian, E., Jones, N. L., Ames, D.P.,
             & Burian, S. J. (2016). A new open source platform for
             lowering the barrier for environmental web app development.
             Environmental Modelling & Software, 85, 11-26. https://doi.
             org/10.1016/j.envsoft.2016.08.003
244                                                                                                        PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




            Monaco: A Monte Carlo Library for Performing
               Uncertainty and Sensitivity Analyses
                                                                    W. Scott Shambaugh∗



                                                                                     F



Abstract—This paper introduces monaco, a Python library for conducting                   integration), tailored towards training neural nets, or require a
Monte Carlo simulations of computational models, and performing uncertainty              deep statistical background to use. See [OGA+ 20], [RJS+ 21], and
analysis (UA) and sensitivity analysis (SA) on the results. UA and SA are critical       [DSICJ20] for an overview of the currently available Python tools
to effective and responsible use of models in science, engineering, and public           for performing UA and SA. For the domain expert who wants to
policy, however their use is uncommon. By providing a simple, general, and
                                                                                         perform UA and SA on their existing models, there is not an easy
rigorous-by-default library that wraps around existing models, monaco makes
UA and SA easy and accessible to practitioners with a basic knowledge of
                                                                                         tool to do both in a single shot. monaco was written to address
statistics.                                                                              this gap.

Index Terms—Monte Carlo, Modeling, Uncertainty Quantification, Uncertainty
Analysis, Sensitivity Analysis, Decision-Making, Ensemble Prediction, VARS, D-
VARS


Introduction                                                                                               Fig. 1: The monaco project logo.
Computational models form the backbone of decision-making
processes in science, engineering, and public policy. However,
our increased reliance on these models stands in contrast to the                         Motivation for Monte Carlo Approach
difficulty in understanding them as we add increasing complexity
                                                                                         Mathematical Grounding
to try and capture ever more of the fine details of real-world
interactions. Practitioners will often take the results of their large,                  Randomized Monte Carlo sampling offers a cure to the curse of
complex model as a point estimate, with no knowledge of how                              dimensionality: consider an investigation of the output from k
uncertain those results are [FST16]. Multiple-scenario modeling                          input factors y = f (x1 , x2 , ..., xk ) where each factor is uniformly
(e.g. looking at a worst-case, most-likely, and best-case scenario)                      sampled between 0 and 1, xi ∈ U[0, 1]. The input space is then a
is an improvement, but a complete global exploration of the input                        k-dimensional hypercube with volume 1. If each input is varied
space is needed. That gives insight into the overall distribution of                     one at a time (OAT), then the volume V of the convex hull of the
results (UA) as well as the relative influence of the different input                    sampled points forms a hyperoctahedron with volume V = k!1 (or
                                                                                                                                          π k/2
factors on the ouput variance (SA). This complete understanding is                       optimistically, a hypersphere with V = 2k Γ(k/2+1)     ), both of which
critical for effective and responsible use of models in any decision-                    decrease super-exponentially as k increases. Unless the model is
making process, and policy papers have identified UA and SA as                           known to be linear, this leaves the input space wholly unexplored.
key modeling practices [ALMR20] [EPA09].                                                 In contrast, the volume of the convex hull of n → ∞ random
     Despite the importance of UA and SA, recent literature reviews                      samples as is obtained with a Monte Carlo approach will converge
show that they are uncommon – in 2014 only 1.3% of all published                         to V = 1, with much better coverage within that volume as well
papers [FST16] using modeling performed any SA. And even                                 [DFM92]. See Fig. 2.
when performed, best practices are usually lacking – amongst
papers which specifically claimed to perform sensitivity analysis,                       Benefits and Drawbacks of Basic Monte Carlo Sampling
a 2019 review found only 21% performed global (as opposed to                             monaco focuses on forward uncertainty propagation with basic
local or zero) UA, and 41% performed global SA [SAB+ 19].                                Monte Carlo sampling. This has several benefits:
     Typically, UA and SA are done using Monte Carlo simula-
                                                                                             •   The method is conceptually simple, lowering the barrier of
tions, for reasons explored in the following section. There are
                                                                                                 entry and increasing the ease of communicating results to
Monte Carlo frameworks available, however existing options are
                                                                                                 a broader audience.
largely domain-specific, focused on narrow sub-problems (i.e.
                                                                                             •   The same sample points can be used for UA and SA. Gen-
* Corresponding author: wsshambaugh@gmail.com                                                    erally, Bayesian methods such as Markov Chain Monte
                                                                                                 Carlo provide much faster convergence on UA quantities
Copyright © 2022 W. Scott Shambaugh. This is an open-access article dis-                         of interest, but their undersampling of regions that do not
tributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, pro-                     contribute to the desired quantities is inadequate for SA
vided the original author and source are credited.                                               and complete exploration of the input space. The author’s
MONACO: A MONTE CARLO LIBRARY FOR PERFORMING UNCERTAINTY AND SENSITIVITY ANALYSES                                                        245




                                                                       Fig. 3: Monte Carlo workflow for understanding the full behavior of
                                                                       a computational model, inspired by [SAB+ 19].
Fig. 2: Volume fraction V of a k-dimensional hypercube enclosed by
the convex hull of n → ∞ random samples versus OAT samples along
the principle axes of the input space.                                 monaco Structure
                                                                       Overall Structure
       experience aligns with    [SAB+ 19]   in that there is great    Broadly, each input factor and model output is a variable that
       practical benefit in broad sampling without pigeonholing        can be thought of as lists (rows) containing the full range of
       one’s purview to particular posteriors, through uncovering      randomized values. Cases are slices (columns) that take the i’th
       bugs and edge cases in regions of input space that were         input and output value for each variable, and represent a single
       not being previously considered.                                run of the model. Each case is run on its own, and the output
   •   It can be applied to domains that are not data-rich. See for    values are collected into output variables. Fig. 4 shows a visual
       example NASA’s use of Monte Carlo simulations during            representation of this.
       rocket design prior to collecting test flight data [HB10].
    However, basic Monte Carlo sampling is subject to the classi-
cal drawbacks of √ the method such as poor sampling of rare events
and the slow σ / n convergence on quantities of interest. If the
outputs and regions of interest are firmly known at the outset, then
other sampling methods will be more efficient [KTB13].
    Additionally, given that any conclusions are conditional on
the correctness of the underlying model and input parameters,
the task of validation is critical to confidence in the UA and SA
results. However, this is currently out of scope for the library
and must be performed with other tools. In a data-poor domain,
hypothesis testing or probabilistic prediction measures like loss
scores can be used to anchor the outputs against a small number
of real-life test data. More generally, the "inverse problem" of
model and parameter validation is a deep field unto itself and
[C+ 12] and [SLKW08] are recommended as overviews of some
methods. If monaco’s scope is too limited for the reader’s needs,
the author recommends UQpy [OGA+ 20] for UA and SA, and
PyMC [SWF16] or Stan [CGH+ 17] as good general-purpose
                                                                       Fig. 4: Structure of a monaco simulation, showing the relationship
probabilistic programming Python libraries.                            between the major objects and functions. This maps onto the central
                                                                       block in Fig. 3.
Workflow
UA and SA of any model follows a common workflow. Probability
distributions for the model inputs are defined, and randomly           Simulation Setup
sampled values for a large number of cases are fed to the model.       The base of a monaco simulation is the Sim object. This object
The outputs from each case are collected and the full set of           is formed by passing it a name, the number of random cases
inputs and outputs can be analyzed. Typically, UA is performed         ncases, and a dict fcns of the handles for three user-defined
by generating histograms, scatter plots, and summary statistics for    functions detailed in the next section. A random seed that then
the output variables, and SA is performed by looking at the effect     seeds the entire simulation can also be passed in here, and is
of input on output variables through scatter plots, performing         highly recommended for repeatability of results.
regressions, and calculating sensitivity indices. These results can        Input variables then need to be defined. monaco takes in the
then be compared to real-world test data to validate the model or      handle to any of scipy.stat’s continuous or discrete probability
inform revisions to the model and input variables. See Fig. 3.         distributions, as well as the required arguments for that probability
     Note that with model and input parameter validation currently     distribution [VGO+ 20]. If nonnumeric inputs are desired, the
outside monaco’s scope, closing that part of the workflow loop is      method can also take in a nummap dictionary which maps the
left up to the user.                                                   randomly drawn integers to values of other types.
246                                                                                             PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

    At this point the sim can be run. The randomized drawing                                   nonnumeric, a valmap dict assigning numbers to
of input values, creation of cases, running of those cases, and                                each unique value is automatically generated.
extraction of output values are automatically executed.
                                                                                 4)    Calculate statistics & sensitivities for input & output
User-Defined Functions                                                                 variables.
                                                                                 5)    Plot variables, their statistics, and sensitivities.
The user needs to define three functions to wrap monaco’s Monte
Carlo structure around their existing computational model. First
                                                                              Incorporating into Existing Workflows
is a run function which either calls or directly implements their
model. Second is a preprocess function which takes in a Case                  If the user wants to use existing workflows for generating, run-
object, extracts the randomized inputs, and structures them with              ning, post-processing, or examining results, any combination of
any other invariant data to pass to the run function. Third is a              monaco’s major steps can be replaced with external tooling by
postprocess function which takes in a Case object as well as the              saving and loading input and output variables to file. For example,
results from the model, and extracts the desired output values. The           monaco can be used only for its parallel processing backend by
Python call chain is as:                                                      importing existing randomly drawn input variables, running the
postprocess(case, *run(*preprocess(case)))
                                                                              simulation, and exporting the output variables for outside analysis.
                                                                              Or, it can be used only for its plotting and analysis capabilities by
Or equivalently to expand the Python star notation into pseu-                 feeding it inputs and outputs generated elsewhere.
docode:
siminput = (siminput1, siminput2, ...)                                        Resource Usage
             = preprocess(case)
simoutput = (simoutput1, simoutput2, ...)                                     Note that monaco’s computational and storage overhead in cre-
              = run(*siminput)                                                ating easily-interrogatable objects for each variable, value, and
              = run(siminput1, siminput2, ...)                                case makes it an inefficient choice for computationally simple
_ = postprocess(case, *simoutput)
                                                                              applications with high n, such as Monte Carlo integration. If the
  = postprocess(case, simoutput1, simoutput2, ...)
                                                                              preprocessed sim input and raw output for each case (which for
These three functions must be passed to the simulation in a dict              some models may dominate storage) is not retained, then the
with keys ’run’, ’preprocess’, and ’postprocess’. See the example             storage bottleneck will be the creation of a Val object for each
code at the end of the paper for a simple worked example.                     case’s input and output values with minimum size 0.5 kB. The
                                                                              maximum n will be driven by the size of the RAM on the host
Examining Results
                                                                              machine being capable of holding at least 0.5 ∗ n(kin + kout ) kB.
After running, users should generally do all of the following                 On the computational bottleneck side, monaco is best suited for
UA and SA tasks to get a full picture of the behavior of their                models where the model runtime dominates the random variate
computational model.                                                          generation and the few hundred microseconds of dask.delayed
      •    Plot the results (UA & SA).                                        task switching time.
      •    Calculate statistics for input or output variables (UA).
      •    Calculate sensitivity indices to rank importance of the            Technical Features
           input variables on variance of the output variables (SA).
                                                                              Sampling Methods
      •    Investigate specific cases with outlier or puzzling results.
      •    Save the results to file or pass them to other programs.           Random sampling of the percentiles for each variable can be done
                                                                              using scipy’s pseudo-random number generator (PRNG), or with
Data Flow                                                                     any of the low-discrepancy methods from the scip.stats.qmc quasi-
A summary of the process and data flow:                                       Monte Carlo (QMC) module. QMC in general provides faster
                                                                              O(log(n)k n−1 ) convergence compared to the O(n−1/2 ) conver-
      1)    Instantiate a Sim object.                                         gence of random sampling [Caf98]. Available low-discrepancy
      2)    Add input variables to the sim with specified probability         options are regular or scrambled Sobol sequences, regular or
            distributions.                                                    scrambled Halton sequences, or Latin Hypercube Sampling. In
      3)    Run the simulation. This executes the following:                  general, the ’sobol_random’ method that generates scrambled
               a)   Random percentiles pi ∈ U[0, 1] are drawn                 Sobol sequences [Sob67] [Owe20] is recommended in nearly
                    ndraws times for each of the input variables.             all cases as the sequence with the fastest QMC convergence
               b)   These percentiles are transformed into random             [CKK18], balanced integration properties as long as the number of
                    values via the inverse cumulative density function        cases is a power of 2, and a fairly flat frequency spectrum (though
                    of the target probability distribution xi = F −1 (pi ).   sampling spectra are rarely a concern) [PCX+ 18]. See Fig. 5 for a
               c)   If nonnumeric inputs are desired, the numbers are         visual comparison of some of the options.
                    converted to objects via a nummap dict.
               d)   Case objects are created and populated with the           Order Statistics, or, How Many Cases to Run?
                    input values for each case.                               How many Monte Carlo cases should one run? One answer would
               e)   Each case is run by structuring the inputs values         be to choose n ≥ 2k with a sampling method that implements a
                    with the preprocess function, passing them to             (t,m,s) digital net (such as a Sobol or Halton sequence), which
                    the run function, and collecting the output values        guarantees that there will be at least one sample point in every
                    with the postprocess function.                            hyperoctant of the input space [JK08]. This should be considered
               f)   The output values are collected into output vari-         a lower bound for SA, with the number of cases run being some
                    ables and saved back to the sim. If the values are        integer multiple of 2k .
MONACO: A MONTE CARLO LIBRARY FOR PERFORMING UNCERTAINTY AND SENSITIVITY ANALYSES                                                        247

                                                                       Sensitivity Indices
                                                                       Sensitivity indices give a measure of the relationship between the
                                                                       variance of a scalar output variable to the variance of each of the
                                                                       input variables. In other words, they measure which of the input
                                                                       ranges have the largest effect on an output range. It is crucial that
                                                                       sensitivity indices are global rather than local measures – global
                                                                       sensitivity has the stronger theoretical grounding and there is no
                                                                       reason to rely on local measures in scenarios such as automated
                                                                       computer experiments where data can be easily and arbitrarily
                                                                       sampled [SRA+ 08] [PBPS22].
                                                                           With computer-designed experiments, it is possible to con-
                                                                       struct a specially constructed sample set to directly calculate
                                                                       global sensitivity indices such as the Total-Order Sobol index
                                                                       [Sob01], or the IVARS100 index [RG16]. However, this special
                                                                       construction requires either sacrificing the desirable UA properties
                                                                       of low-discrepancy sampling, or conducting an additional Monte
                                                                       Carlo analysis of the model with a different sample set. For this
                                                                       reason, monaco uses the D-VARS approach to calculating global
                                                                       sensitivity indices, which allows for using a set of given data
                                                                       [SR20]. This is the first publically available implementation of
                                                                       the D-VARS algorithm.

Fig. 5: 256 uniform and normal samples along with the 2D frequency     Plotting
spectra for PRNG random sampling (top), Sobol sampling (middle),       monaco includes a plotting module that takes in input and output
and scrambled Sobol sampling (bottom, default).                        variables and quickly creates histograms, empirical CDFs, scatter
                                                                       plots, or 2D or 3D "spaghetti plots" depending on what is most ap-
                                                                       propriate for each variable. Variable statistics and their confidence
    Along a similar vein, [DFM92] suggests that with random
                                                                       intervals are automatically shown on plots when applicable.
sampling n ≥ 2.136k is sufficient to ensure that the volume fraction
V approaches 1. The author hypothesizes that for a digital net, the    Vector Data
n ≥ λ k condition will be satisfied with some λ ≤ 2, and so n ≥ 2k
will suffice for this condition to hold. However, these methods of     If the values for an output variable are length s lists, NumPy
choosing the number of cases may undersample for low k and be          arrays, or Pandas dataframes, they are treated as timeseries with s
infeasible for high k.                                                 steps. Variable statistics for these variables are calculated on the
    A rigorous way of choosing the number of cases is to first         ensemble of values at each step, giving time-varying statistics.
choose a statistical interval (e.g. a confidence interval for a            The plotting module will automatically plot size (1, s) arrays
percentile, or a tolerance interval to contain a percent of the        against the step number as 2-D lines, size (2, s) arrays as 2-D
population), and then use order statistics to calculate the minimum    parametric lines, and size (3, s) arrays as 3-D parametric lines.
n required to obtain that result at a desired confidence level. This
                                                                       Parallel Processing
approach is independent of k, making UA of high-dimensional
models tractable. monaco implements order statistics routines          monaco uses dask.distributed [Roc15] as a parallel processing
for calculating these statistical intervals with a distribution-free   backend, and supports preprocessing, running, and postprocessing
approach that makes no assumptions about the normality or other        cases in a parallel arrangement. Users familiar with dask can
shape characteristics of the output distribution. See Chapter 5 of     extend the parallelization of their simulation from their single
[HM91] for background.                                                 machine to a distributed cluster.
    A more qualitative UA method would simply be to choose a               For simple simulations such as the example code at the end of
reasonably high n (say, n = 210 ), manually examine the results to     the paper, the overhead of setting up a dask server may outweigh
ensure high-interest areas are not being undersampled, and rely        the speedup from parallel computation, and in those cases monaco
on bootstrapping of the desired variable statistics to obtain the      also supports running single-threaded in a single for-loop.
required confidence levels.
                                                                       The Median Case
Variable Statistics                                                    A "nominal" run is often useful as a baseline to compare other
For any input or output variable, a statistic can be calculated        cases against. If desired, the user can set a flag to force the
for the ensemble of values. monaco builds in some common               first case to be the median 50th percentile draw of all the input
statistics (mean, percentile, etc), or alternatively the user can      variables prior to random sampling.
pass in a custom one. To obtain a confidence interval for this
statistic, the results are resampled with replacement using the        Debugging Cases
scipy.stats.bootstrap module. The number of bootstrap samples          By default, all the raw results from each case’s simulation run
is determined using an order statistic approach as outlined in the     prior to postprocessing are saved to the corresponding Case object.
previous section, and multiplying that number by a scaling factor      Individual cases can be interrogated by looking at these raw
(default 10x) for smoothness of results.                               results, or by indicating that their results should be highlighted
248                                                                                     PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

in plots. If some cases fail to run, monaco will mark them as                         fcns=fcns, seed=seed)
incomplete and those specific cases can be rerun without requiring
                                                                      # Generate the input variables
the full set of cases to be recomputed. A debug flag can be set to    sim.addInVar(name='die1', dist=randint,
not skip over failed cases and instead stop at a breakpoint or dump                distkwargs={'low': 1, 'high': 6+1})
the stack trace on encountering an exception.                         sim.addInVar(name='die2', dist=randint,
                                                                                   distkwargs={'low': 1, 'high': 6+1})
Saving and Loading to File                                            # Run the Simulation
The base Sim object and the Case objects can be serialized and        sim.runSim()
saved to or loaded from .mcsim and .mccase files respectively,        The results of the simulation can then be analyzed and examined.
which are stored in a results directory. The Case objects are saved   Fig. 6 shows the plots this code generates.
separately since the raw results from a run of the simulation         # Calculate the mean and 5-95th percentile
may be arbitrarily large, and the Sim object can be comparatively     # statistics for the dice sum
lightweight. Loading the Sim object from file will automatically      sim.outvars['Sum'].addVarStat('mean')
attempt to load the cases in the same directory, but can also stand   sim.outvars['Sum'].addVarStat('percentile',
                                                                                                    {'p':[0.05, 0.95]})
alone if the raw results are not needed.
    Alternatively, the numerical representations for input and out-   # Plots a histogram of the dice sum
put variables can be saved to and loaded from .json or .csv files.    mc.plot(sim.outvars['Sum'])
This is useful for interfacing with external tooling, but discards    # Creates a scatter plot of the sum vs the roll
the metadata that would be present by saving to monaco’s native       # number, showing randomness
objects.                                                              mc.plot(sim.outvars['Sum'],
                                                                              sim.outvars['Roll Number'])

Example                                                               # Calculate the sensitivity of the dice sum to
                                                                      # each of the input variables
Presented here is a simple example showing a Monte Carlo              sim.calcSensitivities('Sum')
simulation of rolling two 6-sided dice and looking at their sum.      sim.outvars['Sum'].plotSensitivities()
   The user starts with their run function which here directly
implements their computational model. They must then create
preprocess and postprocess functions to feed in the randomized
input values and collect the outputs from that model.
# The 'run' function, which implements the
# existing computational model (or wraps it)
def example_run(die1, die2):
    dicesum = die1 + die2
    return (dicesum, )

# The 'preprocess' function grabs the random
# input values for each case and structures it
# with any other data in the format the 'run'
# function expects
def example_preprocess(case):
    die1 = case.invals['die1'].val
    die2 = case.invals['die2'].val
    return (die1, die2)

# The 'postprocess' function takes the output
# from the 'run' function and saves off the
# outputs for each case
def example_postprocess(case, dicesum):
    case.addOutVal(name='Sum', val=dicesum)
    case.addOutVal(name='Roll Number',
                   val=case.ncase)
    return None

The monaco simulation is initialized, given input variables with      Fig. 6: Output from the example code which calculates the sum of two
specified probability distributions (here a random integer between    random dice rolls. The top plot shows a histogram of the 2-dice sum
1 and 6), and run.                                                    with the mean and 5–95th percentiles marked, the middle plot shows
                                                                      the randomness over the set of rolls, and the bottom plot shows that
import monaco as mc
from scipy.stats import randint                                       each of the dice contributes 50% to the variance of the sum.

# dict structure for the three input functions
fcns = {'run'        : example_run,                                   Case Studies
        'preprocess' : example_preprocess,
        'postprocess': example_postprocess}                           These two case studies are toy models meant as illustrative of
                                                                      potential uses, and not of expertise or rigor in their respective
# Initialize the simulation                                           domains. Please see https://github.com/scottshambaugh/monaco/
ndraws = 1024 # Arbitrary for this example                            tree/main/examples for their source code as well as several more
seed = 123456 # Recommended for repeatability
                                                                      Monte Carlo implementation examples across a range of domains
sim = mc.Sim(name='Dice Roll', ndraws=ndraws,                         including financial modeling, pandemic spread, and integration.
MONACO: A MONTE CARLO LIBRARY FOR PERFORMING UNCERTAINTY AND SENSITIVITY ANALYSES                                                                249

Baseball                                                                      The calculated win probabilities from this simulation are
This case study models the trajectory of a baseball in flight             93.4% Democratic, 6.2% Republican, and 0.4% Tie. The 25–75th
after being hit for varying speeds, angles, topspins, aerodynamic         percentile range for the number of electoral votes for the Demo-
conditions, and mass properties. From assumed initial conditions          cratic candidate is 281–412, and the actual election result was 306
immediately after being hit, the physics of the ball’s ballistic flight   electoral votes. See Fig. 8.
are calculated over time until it hits the ground.
    Fig. 7 shows some plots of the results. A baseball team might
use analyses like this to determine where outfielders should be
placed to catch a ball for a hitter with known characteristics, or
determine what aspect of a hit a batter should focus on to improve
their home run potential.




                                                                          Fig. 8: Predicted electoral votes for the Democratic 2020 US Pres-
                                                                          idential candidate with the median and 25-75th percentile interval
                                                                          marked (top), and a map of the predicted Democratic win probability
                                                                          per state (bottom).



                                                                          Conclusion
                                                                          This paper has introduced the ideas underlying Monte Carlo
                                                                          analysis and discussed when it is appropriate to use for conducting
                                                                          UA and SA. It has shown how monaco implements a rigorous,
                                                                          parallel Monte Carlo process, and how to use it through a simple
                                                                          example and two case studies. This library is geared towards
                                                                          scientists, engineers, and policy analysts that have a computational
                                                                          model in their domain of expertise, enough statistical knowledge
                                                                          to define a probability distribution, and a desire to ensure their
                                                                          model will make accurate predictions of reality. The author hopes
                                                                          this tool will help contribute to easier and more widespread use of
Fig. 7: 100 simulated baseball trajectories (top), and the relationship   UA and SA in improved decision-making.
between launch angle and landing distance (bottom). Home runs are
highlighted in orange.
                                                                          Further Information
                                                                          monaco is available on PyPI as the package monaco, has API
Election
                                                                          documentation at https://monaco.rtfd.io/, and is hosted on github
This case study attempts to predict the result of the 2020 US             at https://github.com/scottshambaugh/monaco/.
presidential election, based on polling data from FiveThirtyEight
3 weeks prior to the election [Fiv20].
    Each state independently casts a normally distributed percent-        R EFERENCES
age of votes for the Democratic, Republican, and Other candidates,
                                                                          [ALMR20] I Azzini, G Listorti, TA Mara, and R Rosati. Uncertainty and
based on polling. Also assumed is a uniform ±3% national                           sensitivity analysis for policy decision making. An Introductory
swing due to polling error which is applied to all states equally.                 Guide. Joint Research Centre, European Commission, Luxem-
That summed percentage is then normalized so the total for all                     bourg, 2020. doi:10.2760/922129.
candidates is 100%. The winner of each state’s election assigns           [C+ 12]  National Research Council et al. Assessing the reliability of
                                                                                   complex models: mathematical and statistical foundations of
their electoral votes to that candidate, and the candidate that wins               verification, validation, and uncertainty quantification. National
at least 270 of the 538 electoral votes is the winner.                             Academies Press, 2012. doi:10.17226/13395.
250                                                                                                   PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[Caf98]     Russel E Caflisch.         Monte carlo and quasi-monte carlo                      in science conference, volume 130, page 136. Citeseer, 2015.
            methods. Acta numerica, 7:1–49, 1998. doi:10.1017/                                doi:10.25080/majora-7b98e3ed-013.
            S0962492900002804.                                                    [SAB+ 19]   Andrea Saltelli, Ksenia Aleksankina, William Becker, Pamela
[CGH+ 17]   Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel                           Fennell, Federico Ferretti, Niels Holst, Sushan Li, and Qiongli
            Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker,                           Wu. Why so many published sensitivity analyses are false: A
            Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic                   systematic review of sensitivity analysis practices. Environmental
            programming language. Journal of statistical software, 76(1),                     modelling & software, 114:29–39, 2019. doi:10.1016/j.
            2017. doi:10.18637/jss.v076.i01.                                                  envsoft.2019.01.012.
[CKK18]     Per Christensen, Andrew Kensler, and Charlie Kilpatrick. Pro-         [SLKW08]    Richard M Shiffrin, Michael D Lee, Woojae Kim, and Eric-
            gressive multi-jittered sample sequences. In Computer Graphics                    Jan Wagenmakers. A survey of model evaluation approaches
            Forum, volume 37, pages 21–33. Wiley Online Library, 2018.                        with a tutorial on hierarchical bayesian methods.            Cog-
            doi:10.1111/cgf.13472.                                                            nitive Science, 32(8):1248–1284, 2008.           doi:10.1080/
[DFM92]     Martin E. Dyer, Zoltan Füredi, and Colin McDiarmid. Volumes                       03640210802414826.
            spanned by random points in the hypercube. Random Struc-              [Sob67]     Ilya M Sobol. On the distribution of points in a cube and
            tures & Algorithms, 3(1):91–106, 1992. doi:10.1002/rsa.                           the approximate evaluation of integrals. Zhurnal Vychislitel’noi
            3240030107.                                                                       Matematiki i Matematicheskoi Fiziki, 7(4):784–802, 1967. doi:
[DSICJ20]   Dominique Douglas-Smith, Takuya Iwanaga, Barry F.W. Croke,                        10.1016/0041-5553(67)90144-9.
            and Anthony J. Jakeman. Certain trends in uncertainty and             [Sob01]     Ilya M Sobol. Global sensitivity indices for nonlinear mathe-
            sensitivity analysis: An overview of software tools and tech-                     matical models and their monte carlo estimates. Mathematics
            niques. Environmental Modelling & Software, 124, 2020. doi:                       and computers in simulation, 55(1-3):271–280, 2001. doi:
            10.1016/j.envsoft.2019.104588.                                                    10.1016/s0378-4754(00)00270-6.
[EPA09]     US EPA. Guidance on the development, evaluation, and appli-           [SR20]      Razi Sheikholeslami and Saman Razavi. A fresh look at vari-
            cation of environmental models (epa/100/k-09/003), 2009. URL:                     ography: measuring dependence and possible sensitivities across
            https://nepis.epa.gov/Exe/ZyPDF.cgi?Dockey=P1003E4R.PDF.                          geophysical systems from any given data. Geophysical Re-
[Fiv20]     FiveThirtyEight. 2020 general election forecast - state topline                   search Letters, 47(20):e2020GL089829, 2020. doi:10.1029/
            polls-plus data, October 2020.           URL: https://github.com/                 2020gl089829.
            fivethirtyeight/data/tree/master/election-forecasts-2020.             [SRA+ 08]   Andrea Saltelli, Marco Ratto, Terry Andres, Francesca Campo-
[FST16]     Federico Ferretti, Andrea Saltelli, and Stefano Tarantola. Trends                 longo, Jessica Cariboni, Debora Gatelli, Michaela Saisana, and
            in sensitivity analysis practice in the last decade. Science of                   Stefano Tarantola. Global sensitivity analysis: the primer. John
            the total environment, 568:666–670, 2016. doi:10.1016/j.                          Wiley & Sons, 2008. doi:10.1002/9780470725184.
            scitotenv.2016.02.133.                                                [SWF16]     John Salvatier, Thomas V Wiecki, and Christopher Fonnesbeck.
[HB10]      John Hanson and Bernard Beard. Applying monte carlo simu-                         Probabilistic programming in python using pymc3. PeerJ Com-
            lation to launch vehicle design and requirements verification. In                 puter Science, 2:e55, 2016. doi:10.7717/peerj-cs.55.
            AIAA Guidance, Navigation, and Control Conference. American           [VGO+ 20]   Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haber-
            Institute of Aeronautics and Astronautics, 2010. doi:10.2514/                     land, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu
            6.2010-8433.                                                                      Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0:
                                                                                              fundamental algorithms for scientific computing in python. Na-
[HM91]      Gerald J Hahn and William Q Meeker. Statistical intervals: a
                                                                                              ture methods, 17(3):261–272, 2020. doi:10.14293/s2199-
            guide for practitioners. John Wiley & Sons, 1991. doi:10.
                                                                                              1006.1.sor-life.a7056644.v1.rysreg.
            1002/9780470316771.ch5.
[JK08]      Stephen Joe and Frances Y Kuo. Constructing sobol sequences
            with better two-dimensional projections. SIAM Journal on Sci-
            entific Computing, 30(5):2635–2654, 2008. doi:10.1137/
            070709359.
[KTB13]     Dirk P Kroese, Thomas Taimre, and Zdravko I Botev. Handbook
            of monte carlo methods. John Wiley & Sons, 2013. doi:10.
            1002/9781118014967.
[OGA+ 20]   Audrey Olivier, Dimitris G. Giovanis, B.S. Aakash, Mohit
            Chauhan, Lohit Vandanapu, and Michael D. Shields. Uqpy: A
            general purpose python package and development environment
            for uncertainty quantification. Journal of Computational Science,
            47:101204, 2020. doi:10.1016/j.jocs.2020.101204.
[Owe20]     Art B Owen.         On dropping the first sobol’point.        arXiv
            preprint arXiv:2008.08051, 2020. doi:10.48550/arXiv.
            2008.08051.
[PBPS22]    Arnald Puy, William Becker, Samuele Lo Piano, and An-
            drea Saltelli. A comprehensive comparison of total-order es-
            timators for global sensitivity analysis. International Journal
            for Uncertainty Quantification, 12(2), 2022. doi:int.j.
            uncertaintyquantification.2021038133.
[PCX+ 18]   Hélène Perrier, David Coeurjolly, Feng Xie, Matt Pharr, Pat
            Hanrahan, and Victor Ostromoukhov. Sequences with low-
            discrepancy blue-noise 2-d projections. In Computer Graphics
            Forum, volume 37, pages 339–353. Wiley Online Library, 2018.
            doi:10.1111/cgf.13366.
[RG16]      Saman Razavi and Hoshin V Gupta. A new framework for
            comprehensive, robust, and efficient global sensitivity analysis:
            1. theory. Water Resources Research, 52(1):423–439, 2016.
            doi:10.1002/2015wr017558.
[RJS+ 21]   Saman Razavi, Anthony Jakeman, Andrea Saltelli, Clémentine
            Prieur, Bertrand Iooss, Emanuele Borgonovo, Elmar Plischke,
            Samuele Lo Piano, Takuya Iwanaga, William Becker, et al. The
            future of sensitivity analysis: An essential discipline for systems
            modeling and policy support. Environmental Modelling & Soft-
            ware, 137:104954, 2021. doi:10.1016/j.envsoft.2020.
            104954.
[Roc15]     Matthew Rocklin. Dask: Parallel computation with blocked
            algorithms and task scheduling. In Proceedings of the 14th python
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                      251




          Enabling Active Learning Pedagogy and Insight
            Mining with a Grammar of Model Analysis
                                                                    Zachary del Rosario‡∗



                                                                                      F



Abstract—Modern engineering models are complex, with dozens of inputs,                    The fundamental issue underlying these criteria is a flawed
uncertainties arising from simplifying assumptions, and dense output data.                heuristic for uncertainty propagation; initial human subjects work
While major strides have been made in the computational scalability of complex            suggests that engineers’ tendency to misdiagnose sources of vari-
models, relatively less attention has been paid to user-friendly, reusable tools to       ability as inconsequential noise may contribute to the persistent
explore and make sense of these models. Grama is a python package aimed at
                                                                                          application of flawed design criteria [AFD+ 21]. These flawed
supporting these activities. Grama is a grammar of model analysis: an ontology
that specifies data (in tidy form), models (with quantified uncertainties), and
                                                                                          treatments of uncertainty are not limited to engineering design;
the verbs that connect these objects. This definition enables a reusable set              recent work by Kahneman et al. [KSS21] highlights widespread
of evaluation "verbs" that provide a consistent analysis toolkit across different         failures to recognize or address variability in human judgment,
grama models. This paper presents three case studies that illustrate pedagogy             leading to bias in hiring, economic loss, and an unacceptably
and engineering work with grama: 1. Providing teachable moments through                   capricious application of justice.
errors for learners, 2. Providing reusable tools to help users self-initiate pro-             Grama was originally developed to support model analysis un-
ductive modeling behaviors, and 3. Enabling exploratory model analysis (EMA)              der uncertainty; in particular, to enable active learning [FEM+ 14]
– exploratory data analysis augmented with data generation.
                                                                                          – a form of teaching characterized by active student engagement
Index Terms—engineering, engineering education, exploratory model analysis,
                                                                                          shown to be superior to lecture alone. This toolkit aims to integrate
software design, uncertainty quantification                                               the disciplinary perspectives of computational engineering and
                                                                                          statistical analysis within a unified environment to support a
                                                                                          coding to learn pedagogy [Bar16] – a teaching philosophy that
Introduction                                                                              uses code to teach a discipline, rather than as a means to teach
Modern engineering relies on scientific computing. Computational                          computer science or coding itself. The design of grama is heavily
advances enable faster analysis and design cycles by reducing                             inspired by the Tidyverse [WAB+ 19], an integrated set of R
the need for physical experiments. For instance, finite-element                           packages organized around the ’tidy data’ concept [Wic14]. Grama
analysis enables computational study of aerodynamic flutter, and                          uses the tidy data concept and introduces an analogous concepts
Reynolds-averaged Navier-Stokes simulation supports the simu-                             for models.
lation of jet engines. Both of these are enabling technologies
that support the design of modern aircraft [KN05]. Modern ar-
                                                                                          Grama: A Grammar of Model Analysis
eas of computational research include heterogeneous computing
environments [MV15], task-based parallelism [BTSA12], and big                             Grama [dR20] is an integrated set of tools for working with data
data [SS13]. Another line of work considers the development of                            and models. Pandas [pdt20], [WM10] is used as the underlying
integrated tools to unite diverse disciplinary perspectives in a sin-                     data class, while grama implements a Model class. A grama
gle, unified environment (e.g., the integration of multiple physical                      model includes a number of functions – mathematical expressions
phenomena in a single code [EVB+ 20] or the integration of a                              or simulations – and domain/distribution information for the de-
computational solver and data analysis tools [MTW+ 22]). Such                             terministic/random inputs. The following code illustrates a simple
integrated computational frameworks are highlighted as essential                          grama model with both deterministic and random inputs1 .
for applications such as computational analysis and design of                             # Each cp_* function adds information to the model
aircraft [SKA+ 14]. While engineering computation has advanced                            md_example = (
along the aforementioned axes, the conceptual understanding of                                gr.Model("An example model")
                                                                                              # Overloaded `>>` provides pipe syntax
practicing engineers has lagged in key areas.                                                 >> gr.cp_vec_function(
    Every aircraft you have ever flown on has been designed using                                 fun=lambda df: gr.df_make(f=df.x+df.y+df.z),
probabilistically-flawed, potentially dangerous criteria [dRFI21].                                var=["x", "y", "z"],
                                                                                                  out=["f"],
                                                                                              )
* Corresponding author: zdelrosario@olin.edu
‡ Assistant Professor of Engineering and Applied Statistics, Olin College of                  >> gr.cp_bounds(x=(-1, +1))
Engineering                                                                                   >> gr.cp_marginals(
                                                                                                  y=gr.marg_mom("norm", mean=0, sd=1),
Copyright © 2022 Zachary del Rosario. This is an open-access article dis-                         z=gr.marg_mom("uniform", mean=0, sd=1),
tributed under the terms of the Creative Commons Attribution License, which                   )
permits unrestricted use, distribution, and reproduction in any medium, pro-
vided the original author and source are credited.                                          1. Throughout, import grama as gr is assumed.
252                                                                                       PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

      >> gr.cp_copula_gaussian(
          df_corr=gr.df_make(
              var1="y",
              var2="z",
              corr=0.5,
          )
      )
)

While an engineer’s interpretation of the term "model" focuses on
the input-to-output mapping (the simulation), and a statistician’s
interpretation of the term "model" focuses on a distribution, the
grama model integrates both perspectives in a single model.
    Grama models are intended to be evaluated to generate data.
The data can then be analyzed using visual and statistical means.
Models can be composed to add more information, or fit to a
dataset. Figure 1 illustrates this interplay between data and models
in terms of the four categories of function "verbs" provided in
                                                                        Fig. 2: Input sweep generated from the code above. Each panel
grama.
                                                                        visualizes the effect of changing a single input, with all other inputs
                                                                        held constant.


                                                                             >> gr.tf_filter(DF.sweep_var == "x")
                                                                             >> gr.ggplot(gr.aes("x", "f", group="sweep_ind"))
                                                                             + gr.geom_line()
                                                                        )

                                                                        This system of defaults is important for pedagogical design:
                                                                        Introductory grama code can be made extremely simple when first
Fig. 1: Verb categories in grama. These grama functions start with an   introducing a concept. However, the defaults can be overridden
identifying prefix, e.g. ev_* for evaluation verbs.                     to carry out sophisticated and targeted analyses. We will see in
                                                                        the Case Studies below how this concise syntax encourages sound
                                                                        analysis among students.
Defaults for Concise Code
Grama verbs are designed with sensible default arguments to
                                                                        Pedagogy Case Studies
enable concise code. For instance, the following code visualizes
input sweeps across its three inputs, similar to a ceteris paribus      The following two case studies illustrate how grama is designed
profile [KBB19], [Bie20].                                               to support pedagogy: the formal method and practice of teaching.
                                                                        In particular, grama is designed for an active learning pedagogy
(
      ## Concise default analysis                                       [FEM+ 14], a style of teaching characterized by active student
      md_example                                                        engagement.
      >> gr.ev_sinews(df_det="swp")
      >> gr.pt_auto()                                                   Teachable Moments through Errors for Learners
)
                                                                        An advantage of a unified modeling environment like grama is
This code uses the default number of sweeps and sweep density,          the opportunity to introduce design errors for learners in order to
and constructs a visualization of the results. The resulting plot is    provide teachable moments.
shown in Figure 2.                                                          It is common in probabilistic modeling to make problematic
    Grama imports the plotnine package for data visualization           assumptions. For instance, Cullen and Frey [CF99] note that
[HK21], both to provide an expressive grammar of graphics, but          modelers frequently and erroneously treat the normal distribution
also to implement a variety of "autoplot" routines. These are           as a default choice for all unknown quantities. Another common
called via a dispatcher gr.pt_auto() which uses metadata                issue is to assume, by default, the independence of all random
from evaluation verbs to construct a default visual. Combined           inputs to a model. This is often done tacitly – with the indepen-
with sensible defaults for keyword arguments, these tools provide       dence assumption unstated. These assumptions are problematic, as
a concise syntax even for sophisticated analyses. The same code         they can adversely impact the validity of a probabilistic analysis
can be slightly modified to change a default argument value, or to      [dRFI21].
use plotnine to create a more tailored visual.                              To highlight the dependency issue for novice modelers, grama
(                                                                       uses error messages to provide just-in-time feedback to a user
      md_example                                                        who does not articulate their modeling choices. For example,
      ## Override default parameters
      >> gr.ev_sinews(df_det="swp", n_sweeps=10)                        the following code builds a model with no dependency structure
      >> gr.pt_auto()                                                   specified. The result is an error message that summarizes the
)                                                                       conceptual issue and points the user to a primer on random
(
                                                                        variable modeling.
      md_example                                                        md_flawed = (
      >> gr.ev_sinews(df_det="swp")                                         gr.Model("An example model")
      ## Construct a targeted plot                                          >> gr.cp_vec_function(
ENABLING ACTIVE LEARNING PEDAGOGY AND INSIGHT MINING WITH A GRAMMAR OF MODEL ANALYSIS                                                              253

            fun=lambda df: gr.df_make(f=df.x+df.y+df.z),                      data=data,
            var=["x", "y", "z"],                                              columns=["f", "x", "y"],
            out=["f"],                                                 )
     )
     >> gr.cp_bounds(x=(-1, +1))                                       The ability to write low-level programming constructs – such
     >> gr.cp_marginals(                                               as the loops above – is an obviously worthy learning outcome
         y=gr.marg_mom("norm", mean=0, sd=1),                          in a course on scientific computing. However, not all courses
         z=gr.marg_mom("uniform", mean=0, sd=1),
     )                                                                 should focus on low-level programming constructs. Grama is not
     ## NOTE: No dependency specified                                  designed to support low-level learning outcomes; instead, the
)                                                                      package is designed to support a "coding to learn" philosophy
(
     md_flawed                                                         [Bar16] focused on higher-order learning outcomes to support
     ## This code will throw an Error                                  sound modeling practices.
     >> gr.ev_sample(n=1000, df_det="nom")                                 Parameter sweep functionality can be achieved in grama
)
                                                                       without explicit loop management and with sensible defaults for
                                                                       the analysis parameters. This provides a "quick and dirty" tool
    Error ValueError: Present model copula must be de-                 to inspect a model’s behavior. A grama approach to parameter
    fined for sampling. Use CopulaIndependence only                    sweeps is shown below.
    when inputs can be guaranteed independent. See the                 ## Parameter sweep: Grama approach
    Documentation chapter on Random Variable Modeling                  # Gather model info
    for more information. https://py-grama.readthedocs.io/en/          md_gr = (
                                                                           gr.Model()
    latest/source/rv_modeling.html                                         >> gr.cp_vec_function(
                                                                               fun=lambda df: gr.df_make(f=df.x**2 * df.y),
    Grama is designed both as a teaching tool and a scientific                 var=["x", "y"],
modeling toolkit. For the student, grama offers teachable moments              out=["f"],
to help the novice grow as a modeler. For the scientist, grama             )
                                                                           >> gr.cp_bounds(
enforces practices that promote scientific reproducibility.                    x=(-1, +1),
                                                                               y=(-1, +1),
Encouraging Sound Analysis                                                 )
                                                                       )
As mentioned above, concise grama syntax is desirable to encour-       # Generate data
age sound analysis practices. Grama is designed to support higher-     df_gr = gr.eval_sinews(
level learning outcomes [Blo56]. For instance, rather than focusing        md_gr,
                                                                           df_det="swp",
on applying programming constructs to generate model results,              n_sweeps=3,
grama is intended to help users study model results ("evaluate,"       )
according to Bloom’s Taxonomy). Sound computational analysis
                                                                       Once a model is implemented in grama, generating and visualizing
demands study of simulation results (e.g., to check for numerical
                                                                       a parameter sweep is trivial, requiring just two lines of code and
instabilities). This case study makes this learning outcome distinc-
                                                                       zero initial choices for analysis parameters. The practical outcome
tion concrete by considering parameter sweeps.
                                                                       of this software design is that users will tend to self-initiate
    Generating a parameter sweep similar to Figure 2 with stan-
                                                                       parameter sweeps: While students will rarely choose to write the
dard Python libraries requires a considerable amount of boilerplate
                                                                       extensive boilerplate code necessary for a parameter sweep (unless
code, manual coordination of model information, and explicit loop
                                                                       required to do so), students writing code in grama will tend to self-
construction. The following code generates parameter sweep data
                                                                       initiate sound analysis practices.
using standard libraries. Note that this code sweeps through values
                                                                            For example, the following code is unmodified from a student
of x holding values of y fixed; additional code would be necessary
                                                                       report3 . The original author implemented an ordinary differential
to construct a sweep through y2 .
                                                                       equation model to simulate the track time "finish_time" of
## Parameter sweep: Manual approach                                    an electric formula car, and sought to study the impact of variables
# Gather model info
x_lo = -1; x_up = +1;                                                  such as the gear ratio "GR" on "finish_time". While the
y_lo = -1; y_up = +1;                                                  assignment did not require a parameter sweep, the student chose
f_model = lambda x, y: x**2 * y                                        to carry out their own study. The code below is a self-initiated
# Analysis parameters
                                                                       parameter sweep of the track time model.
nx = 10               # Grid resolution for x
y_const = [-1, 0, +1] # Constant values for y                          ## Unedited student code
# Generate data                                                        md_car = (
data = np.zeros((nx * len(y_const), 3))                                    gr.Model("Accel Model")
for i, x in enumerate(                                                     >> gr.cp_function(
        np.linspace(x_lo, x_up, num=nx)                                        fun = calculate_finish_time,
    ):                                                                         var = ["GR", "dt_mass", "I_net" ],
    for j, y in enumerate(y_const):                                            out = ["finish_time"],
        data[i + j*nx, 0] = f_model(x, y)                                  )
        data[i + j*nx, 1] = x
        data[i + j*nx, 2] = y                                                 >> gr.cp_bounds(
# Package data for visual                                                         GR=(+1,+4),
df_manual = pd.DataFrame(                                                         dt_mass=(+5,+15),
                                                                                  I_net=(+.2,+.3),
  2. Code   assumes   import numpy as np; import pandas as
pd.                                                                        3. Included with permission of the author, on condition of anonymity.
254                                                                                          PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

      )
)

gr.plot_auto(
    gr.eval_sinews(
        md_car,
        df_det="swp",
        #skip=True,
        n_density=20,
        n_sweeps=5,
        seed=101,
    )
)




                                                                          Fig. 4: Schematic boat hull rotated to 22.5◦ . The forces due to gravity
                                                                          and buoyancy act at the center of mass (COM) and center of buoyancy
                                                                          (COB), respectively. Note that this hull is upright stable, as the couple
                                                                          will rotate the boat to upright.


                                                                          that a restoring torque is generated (Fig. 4). However, this upright
                                                                          stability is not guaranteed; Figure 5 illustrates a boat design that
                                                                          does not provide a restoring torque near its upright angle. An
                                                                          upright-unstable boat will tend to capsize spontaneously.




Fig. 3: Input sweep generated from the student code above. The image
has been cropped for space, and the results are generated with an
older version of grama. The jagged response at higher values of the
input are evidence of solver instabilities.

The parameter sweep shown in Figure 2 gives an overall impres-
sion of the effect of input "GR" on the output "finish_time".
This particular input tends to dominate the results. However,
variable results at higher values of "GR" provide evidence of
numerical instability in the ODE solver underlying the model.
Without this sort of model evaluation, the student author would
not have discovered the limitations of the model.

Exploratory Model Analysis Case Study
This final case study illustrates how grama supports exploratory          Fig. 5: Schematic boat hull rotated to 22.5◦ . Gravity and buoyancy
model analysis. This iterative process is a computational approach        are annotated as in Figure 4. Note that this hull is upright unstable,
to mining insights into physical systems. The following use case          as the couple will rotate the boat away from upright.
illustrates the approach by considering the design of boat hull
cross-sections.                                                               Naval engineers analyze the stability of a boat design by
                                                                          constructing a moment curve, such as the one pictured in Figure
Static Stability of Boat Hulls                                            6. This curve depicts the net moment due to buoyancy at various
Stability is a key consideration in boat hull design. One of the most     angles, assuming the vessel is in vertical equilibrium. From this
fundamental aspects of stability is static stability; the behavior of a   figure we can see that the design is upright-stable, as it possesses
boat when perturbed away from static equilibrium [LE00]. Figure           a negative slope at upright θ = 0◦ . Note that a boat may not have
4 illustrates the physical mechanism governing stability at small         an unlimited range of stability as Figure 6 exhibits an angle of
perturbations from an upright orientation.                                vanishing stability (AVS) beyond which the boat does not recover
    As a boat is rotated away from its upright orientation, its center    to upright.
of buoyancy (COB) will tend to migrate. If the boat is in vertical            The classical way to build intuition about boat stability is
equilibrium, its buoyant force will be equal in magnitude to its          via mathematical derivations [LE00]. In the following section we
weight. A stable boat is a hull whose COB migrates in such a way          present an alternative way to build intuition through exploratory
ENABLING ACTIVE LEARNING PEDAGOGY AND INSIGHT MINING WITH A GRAMMAR OF MODEL ANALYSIS                                                     255

                                                                        gr.tf_iocorr() computes correlations between every pair of
                                                                        input variables var and outputs out. The routine also attaches
                                                                        metadata, enabling an autoplot as a tileplot of the correlation
                                                                        values.
                                                                        (
                                                                             df_boats
                                                                             >> gr.tf_iocorr(
                                                                                 var=["H", "W", "n", "d", "f_com"],
                                                                                 out=["mass", "angle", "stability"],
                                                                             )
                                                                             >> gr.pt_auto()
                                                                        )




Fig. 6: Total moment on a boat hull as it is rotated through 180◦ .
A negative slope at upright θ = 0◦ is required for upright stability.
Stability is lost at the angle of vanishing stability (AVS).


model analysis.

EMA for Insight Mining
Generation and post-processing of the moment curve are imple-
mented in the grama model md_performance4 . This model
parameterizes a 2d boat hull via its height H, width W, shape
of corner n, the vertical height of the center of mass f_com
(as a fraction of the height), and the displacement ratio d (the        Fig. 7: Tile plot of input/output correlations; autoplot gr.pt_auto()
ratio of the boat’s mass to maximum water mass displaced).              visualization of gr.tf_iocorr() output.
Note that a boat with d > 1 is incapable of flotation. A
smaller value of d corresponds to a boat that floats higher in          The correlations in Figure 7 suggest that stability is posi-
the water. The model md_performance returns stability                   tively impacted by increasing the width W and displacement ratio
= -dMdtheta_0 (the negative of the moment curve slope at                d of a boat, and by decreasing the height H, shape factor n, and
upright) as well as the mass and AVS angle. A positive value            vertical location of the center of mass f_com. The correlations
of stability indicates upright stability, while a larger value of       also suggest a similar impact of each variable on the AVS angle,
angle indicates a wider range of stability.                             but with a weaker dependence on H. These results also suggest that
    The EMA process begins by generating data from the model.           f_com has the strongest effect on both stability and angle.
However, the generation of a moment curve is a nontrivial cal-              Correlations are a reasonable first-check of input/output be-
culation. One should exercise care in choosing an initial sample        havior, but linear correlation quantifies only an average, linear
of designs to analyze. The statistical problem of selecting efficient   association. A second-pass at the data would be to fit an accurate
input values for a computer model is called the design of computer      surrogate model and inspect parameter sweeps. The following
experiments [SSW89]. The grama verb gr.tf_sp() implements the           code defines a gaussian process fit [RW05] for both stability
support points algorithm [MJ18] to reduce a large dataset of target     and angle, and estimates model error using k-folds cross valida-
points to a smaller (but representative) sample. The following code     tion [JWHT13]. Note that a non-default kernel is necessary for a
generates a sample of input design values via gr.ev_sample()            reasonable fit of the latter output5 .
with the skip=True argument, uses gr.tf_sp() to "com-                   ## Define fitting procedure
pact" this large sample, then evaluates the performance model at        ft_common = gr.ft_gp(
                                                                            var=["H", "W", "n", "d", "f_com"],
the smaller sample.                                                         out=["angle", "stability"],
df_boats = (                                                                kernels=dict(
    md_performance                                                              stability=None, # Use default
    >> gr.ev_sample(                                                            angle=RBF(length_scale=0.1),
        n=5e3,                                                              )
        df_det="nom",                                                   )
        seed=101,                                                       ## Estimate model accuracy via k-folds CV
        skip=True,                                                      (
    )                                                                       df_boats
    >> gr.tf_sp(n=1000, seed=101)                                           >> gr.tf_kfolds(
    >> gr.tf_md(md=md_performance)                                              ft=ft_common,
)                                                                               out=["angle", "stability"],
                                                                            )
With an initial sample generated, we can perform an ex-                 )
ploratory analysis relating the inputs and outputs. The verb
                                                                          5. RBF is imported as from sklearn.gaussian_process.kernels
  4. The analysis reported here is available as a jupyter notebook.     import RBF.
256                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                angle       stability   k                             Direction      H           W           n           d            f_com
                0.771       0.979       0
                                                                      1              -0.0277     0.0394      -0.1187     0.4009       -0.9071
                0.815       0.976       1
                                                                      2              -0.6535     0.3798      -0.0157     -0.6120      -0.2320
                0.835       0.95        2
                0.795       0.962       3
                0.735       0.968       4
                                                                                  TABLE 2: Subspace weights in df_weights.


TABLE 1: Accuracy (R2 ) estimated via k-fold cross validation of
                                                                     all the sweeps across f_com for stability_mean tend to be
gaussian process model.
                                                                     monotone with a fairly steep slope. This is in agreement with
                                                                     the correlation results of Figure 7; the f_com sweeps tend to
   The k-folds CV results (Tab. 1) suggest a highly accurate         have the steepest slopes. Given the high accuracy of the model
model for stability, and a moderately accurate model for             for stability (as measured by k-folds CV), this trend is
angle. The following code defines the surrogate model over a         reasonably trustworthy.
domain that includes the original dataset, and performs parameter         However, the same figure shows an inconsistent (non-
sweeps across all inputs.                                            monotone) effect of most inputs on the AVS angle_mean.
md_fit = (                                                           These results are in agreement with the k-fold CV results shown
    df_boats                                                         above. Clearly, the surrogate model is untrustworthy, and we
    >> ft_common()                                                   should resist trusting conclusions from the parameter sweeps for
    >> gr.cp_marginals(
        H=gr.marg_mom("uniform", mean=2.0,            cov=0.30),     angle_mean. This undermines the conclusion we drew from
        W=gr.marg_mom("uniform", mean=2.5,            cov=0.35),     the input/output correlations pictured in Figure 7. Clearly, angle
        n=gr.marg_mom("uniform", mean=1.0,            cov=0.30),     exhibits more complex behavior than a simple linear correlation
        d=gr.marg_mom("uniform", mean=0.5,            cov=0.30),
        f_com=gr.marg_mom(
                                                                     with each of the boat design variables.
             "uniform",                                                   A different analysis of the boat hull angle data helps
             mean=0.55,                                              develop useful insights. We pursue an active subspace analysis
             cov=0.47,                                               of the data to reduce the dimensionality of the input space by
        ),
    )                                                                identifying directions that best explain variation in the output
    >> gr.cp_copula_independence()                                   [dCI17], [Con15]. The verb gr.tf_polyridge() implements
)                                                                    the variable projection algorithm of Hokanson and Constantine
(
                                                                     [HC18]. The following code pursues a two-dimensional reduction
      md_fit                                                         of the input space. Note that the hyperparameter n_degree=6 is
      >> gr.ev_sinews(df_det="swp", n_sweeps=5)                      set via a cross-validation study.
      >> gr.pt_auto()
)                                                                    ## Find two important directions
                                                                     df_weights = (
                                                                         df_boats
                                                                         >> gr.tf_polyridge(
                                                                             var=["H", "W", "n", "d", "f_com"],
                                                                             out="angle",
                                                                             n_degree=6, # Set via CV study
                                                                             n_dim=2,    # Seek 2d subspace
                                                                         )
                                                                     )

                                                                     The subspace weights are reported in Table 2. Note that the
                                                                     leading direction 1 is dominated by the displacement ratio d and
                                                                     COM location f_com. Essentially, this describes the "loading"
                                                                     of the vessel. The second direction corresponds to "widening and
                                                                     shortening" of the hull cross-section (in addition to lowering d and
                                                                     f_com).
                                                                         Using the subspace weights in Table 2 to produce a 2d projec-
                                                                     tion of the feature space enables visualizing all boat geometries in
                                                                     a single plot. Figure 9 reveals that this 2d projection is very suc-
Fig. 8: Parameter sweeps for fitted GP model. Model *_mean and       cessful at separating universally-stable (angle==180), upright-
predictive uncertainty *_sd values are reported for each output      unstable (angle==0), and intermediate cases (0 < angle <
angle, stability.                                                    180). Intermediate cases are concentrated at higher values of
                                                                     the second active variable. There is a phase transition between
Figure 8 displays parameter sweeps for the surrogate model of        universally-stable and upright-unstable vessels at lower values of
stability and angle. Note that the surrogate model reports           the second active variable.
both a mean trend *_mean and a predictive uncertainty *_sd.              Interpreting Figure 9 in light of Table 2 provides us with deep
The former is the model’s prediction for future values, while the    insight about boat stability: Since active variable 1 corresponds to
latter quantifies the model’s confidence in each prediction.         loading (high displacement ratio d with a low COM f_com), we
     The parameter sweeps of Figure 8 show a consistent and strong   can see that the boat’s loading conditions are key to determining
effect of f_com on the stability_mean of the boat; note that         its stability. Since active variable 2 depends on the aspect ratio
ENABLING ACTIVE LEARNING PEDAGOGY AND INSIGHT MINING WITH A GRAMMAR OF MODEL ANALYSIS                                                        257

                                                                       native to derivation for the activities in an active learning approach.
                                                                       Rather than structuring courses around deriving and implementing
                                                                       scientific models, course exercises could have students explore
                                                                       the behavior of a pre-implemented model to better understand
                                                                       physical phenomena. Lorena Barba [Bar16] describes some of the
                                                                       benefits in this style of lesson design. EMA is also an important
                                                                       part of the modeling practitioner’s toolkit as a means to verify a
                                                                       model’s implementation and to develop new insights. Grama sup-
                                                                       ports both novices and practitioners in performing EMA through
                                                                       a concise syntax.



                                                                       R EFERENCES

                                                                       [AFD+ 21] Riya Aggarwal, Mira Flynn, Sam Daitzman, Diane Lam, and
                                                                                 Zachary Riggins del Rosario. A qualitative study of engineer-
                                                                                 ing students’ reasoning about statistical variability. In 2021
                                                                                 Fall ASEE Middle Atlantic Section Meeting, 2021. URL:
                                                                                 https://peer.asee.org/38421.
Fig. 9: Boat design feature vectors projected to 2d active subspace.   [Bar16]   Lorena Barba.        Computational thinking: I do not think
                                                                                 it means what you think it means.                 Technical re-
The origin corresponds to the mean feature vector.                               port, 2016. URL: https://lorenabarba.com/blog/computational-
                                                                                 thinking-i-do-not-think-it-means-what-you-think-it-means/.
                                                                       [Bie20]   Przemyslaw Biecek. ceterisParibus: Ceteris Paribus Profiles,
(higher width, shorter height), Figure 9 suggests that only wider                2020. R package version 0.4.2. URL: https://cran.r-project.org/
boats will tend to exhibit intermediate stability.                               package=ceterisParibus.
                                                                       [Blo56]   Benjamin Samuel Bloom. Taxonomy of educational objectives:
                                                                                 The classification of educational goals. Addison-Wesley Long-
Conclusions
                                                                                 man Ltd., 1956.
Grama is a Python implementation of a grammar of model anal-           [Bry20]   Jennifer Bryan. object of type closure is not subsettable. 2020.
ysis. The grammar’s design supports an active learning approach                  rstudio::conf 2020. URL: https://rstd.io/debugging.
                                                                       [BTSA12] Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken.
to teaching sound scientific modeling practices. Two case studies                Legion: Expressing locality and independence with logical re-
demonstrated the teaching benefits of grama: errors for learners                 gions. In SC’12: Proceedings of the International Conference on
help guide novices toward a more sound analysis, while concise                   High Performance Computing, Networking, Storage and Analy-
syntax encourages novices to carry out sound analysis practices.                 sis, pages 1–11. IEEE, 2012. URL: https://ieeexplore.ieee.org/
                                                                                 document/6468504, doi:10.1109/SC.2012.71.
Grama can also be used for exploratory model analysis (EMA)            [CF99]    Alison C Cullen and H Christopher Frey. Probabilistic Tech-
– an exploratory procedure to mine a scientific model for useful                 niques In Exposure Assessment: A Handbook For Dealing With
insights. A case study of boat hull design demonstrated EMA.                     Variability And Uncertainty In Models And Inputs. Springer
                                                                                 Science & Business Media, 1999.
In particular, the example explored and explained the relationship
                                                                       [Con15]   Paul G. Constantine. Active Subspaces: Emerging Ideas for
between boat design parameters and two metrics of boat stability.                Dimension Reduction in Parameter Studies. SIAM Philadelphia,
     Several ideas from the grama project are of interest to other               2015. doi:10.1137/1.9781611973860.
practitioners and developers in scientific computing. Grama was        [dCI17]   Zachary del Rosario, Paul G. Constantine, and Gianluca Iac-
designed to support model analysis under uncertainty. However,                   carino. Developing design insight through active subspaces.
                                                                                 In 19th AIAA Non-Deterministic Approaches Conference, page
the data/model and four-verb ontology (Fig. 1) underpinning                      1090, 2017. URL: https://arc.aiaa.org/doi/10.2514/6.2017-1090,
grama is a much more general idea. This design enables very                      doi:10.2514/6.2017-1090.
concise model analysis syntax, which provides much of the benefit      [dR20]    Zachary del Rosario. Grama: A grammar of model analysis. Jour-
                                                                                 nal of Open Source Software, 5(51):2462, 2020. URL: https://doi.
behind grama.                                                                    org/10.21105/joss.02462, doi:10.21105/joss.02462.
     The design idiom of errors for learners is not simply focused     [dRFI21]  Zachary del Rosario, Richard W Fenrich, and Gianluca Iaccarino.
on writing "useful" error messages, but is rather a design orien-                When are allowables conservative? AIAA Journal, 59(5):1760–
tation to use errors to introduce teachable moments. In addition                 1772, 2021. URL: https://doi.org/10.2514/1.J059578, doi:10.
                                                                                 2514/1.J059578.
to writing error messages "for humans" [Bry20], an errors for          [EVB+ 20] M Esmaily, L Villafane, AJ Banko, G Iaccarino, JK Eaton,
learners philosophy designs errors not simply to avoid fatal                     and A Mani.           A benchmark for particle-laden turbu-
program behavior, but rather introduces exceptions to prevent                    lent duct flow: A joint computational and experimen-
conceptually invalid analyses. For instance, in the case study                   tal study.       International Journal of Multiphase Flow,
                                                                                 132:103410, 2020.          URL: https://www.sciencedirect.com/
presented above, designing gr.tf_sample() to assume independent                  science/article/abs/pii/S030193222030519X, doi:10.1016/
random inputs when a copula is unspecified would lead to code                    j.ijmultiphaseflow.2020.103410.
that throws errors less frequently. However, this would silently       [FEM+ 14] Scott Freeman, Sarah L Eddy, Miles McDonough, Michelle K
endorse the conceptually problematic mentality of "independence                  Smith, Nnadozie Okoroafor, Hannah Jordt, and Mary Pat Wen-
                                                                                 deroth. Active learning increases student performance in sci-
is the default." While throwing an error message for an unspecified              ence, engineering, and mathematics. Proceedings of the Na-
dependence structure leads to more frequent errors, it serves as a               tional Academy of Sciences, 111(23):8410–8415, 2014. doi:
frequent reminder that dependency is an important part of a model                10.1073/pnas.1319030111.
                                                                       [HC18]    Jeffrey M Hokanson and Paul G Constantine. Data-driven
involving random inputs.
                                                                                 polynomial ridge approximation using variable projection. SIAM
     Finally, exploratory model analysis holds benefits for both                 Journal on Scientific Computing, 40(3):A1566–A1589, 2018.
learners and practitioners of scientific modeling. EMA is an alter-              doi:10.1137/17M1117690.
258                                                                             PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

[HK21]    Jan Katins gdowding austin matthias-k Tyler Funnell Florian
          Finkernagel Jonas Arnfred Dan Blanchard et al. Hassan Kibirige,
          Greg Lamp. has2k1/plotnine: v0.8.0. Mar 2021. doi:10.
          5281/zenodo.4636791.
[JWHT13] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshi-
          rani. An Introduction to Statistical Learning: with Applications in
          R, volume 112. Springer, 2013. URL: https://www.statlearning.
          com/.
[KBB19]   Michał Kuźba, Ewa Baranowska, and Przemysław Biecek. pyce-
          terisparibus: explaining machine learning models with ceteris
          paribus profiles in python. Journal of Open Source Software,
          4(37):1389, 2019. URL: https://joss.theoj.org/papers/10.21105/
          joss.01389, doi:10.21105/joss.01389.
[KN05]    Andy Keane and Prasanth Nair. Computational Approaches For
          Aerospace Design: The Pursuit Of Excellence. John Wiley &
          Sons, 2005.
[KSS21]   Daniel Kahneman, Olivier Sibony, and Cass R Sunstein. Noise:
          A flaw in human judgment. Little, Brown, 2021.
[LE00]    Lars Larsson and Rolf Eliasson. Principles of Yacht Design.
          McGraw Hill Companies, 2000.
[MJ18]    Simon Mak and V Roshan Joseph. Support points. The Annals
          of Statistics, 46(6A):2562–2592, 2018. doi:10.1214/17-
          AOS1629.
[MTW 22] Kazuki Maeda, Thiago Teixeira, Jonathan M Wang, Jeffrey M
      +

          Hokanson, Caetano Melone, Mario Di Renzo, Steve Jones, Javier
          Urzay, and Gianluca Iaccarino. An integrated heterogeneous
          computing framework for ensemble simulations of laser-induced
          ignition. arXiv preprint arXiv:2202.02319, 2022. URL: https:
          //arxiv.org/abs/2202.02319, doi:10.48550/arXiv.2202.
          02319.
[MV15]    Sparsh Mittal and Jeffrey S Vetter. A survey of cpu-gpu heteroge-
          neous computing techniques. ACM Computing Surveys (CSUR),
          47(4):1–35, 2015. URL: https://dl.acm.org/doi/10.1145/2788396,
          doi:10.1145/2788396.
[pdt20]   The pandas development team. pandas-dev/pandas: Pandas,
          February 2020. URL: https://doi.org/10.5281/zenodo.3509134,
          doi:10.5281/zenodo.3509134.
[RW05]    Carl Edward Rasmussen and Christopher K. I. Williams. Gaus-
          sian Processes for Machine Learning. The MIT Press, 11
          2005.      URL: https://doi.org/10.7551/mitpress/3206.001.0001,
          doi:10.7551/mitpress/3206.001.0001.
[SKA+ 14] Jeffrey P Slotnick, Abdollah Khodadoust, Juan Alonso, David
          Darmofal, William Gropp, Elizabeth Lurie, and Dimitri J
          Mavriplis. Cfd vision 2030 study: A path to revolutionary
          computational aerosciences. Technical report, 2014. URL:
          https://ntrs.nasa.gov/citations/20140003093.
[SS13]    Seref Sagiroglu and Duygu Sinanc. Big data: A review.
          In 2013 International Conference on Collaboration Technolo-
          gies and Systems (CTS), pages 42–47. IEEE, 2013. URL:
          https://ieeexplore.ieee.org/document/6567202, doi:10.1109/
          CTS.2013.6567202.
[SSW89]   Jerome Sacks, Susannah B. Schiller, and William J. Welch.
          Designs for computer experiments. Technometrics, 31(1):41–
          47, 1989. URL: http://www.jstor.org/stable/1270363, doi:10.
          2307/1270363.
[WAB 19] Hadley Wickham, Mara Averick, Jennifer Bryan, Winston Chang,
      +

          Lucy D’Agostino McGowan, Romain François, Garrett Grole-
          mund, Alex Hayes, Lionel Henry, Jim Hester, et al. Welcome
          to the tidyverse. Journal of Open Source Software, 4(43):1686,
          2019. doi:10.21105/joss.01686.
[Wic14]   Hadley Wickham. Tidy data. Journal of Statistical Software,
          59(10):1–23, 2014. doi:10.18637/jss.v059.i10.
[WM10]    Wes McKinney. Data Structures for Statistical Computing in
          Python. In Stéfan van der Walt and Jarrod Millman, editors,
          Proceedings of the 9th Python in Science Conference, pages 56 –
          61, 2010. doi:10.25080/Majora-92bf1922-00a.
PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)                                                                                                          259




   Low Level Feature Extraction for Cilia Segmentation
                                         Meekail Zain‡†∗ , Eric Miller§† , Shannon P Quinn‡¶ , Cecilia Lok



                                                                                         F



Abstract—Cilia are organelles found on the surface of some cells in the human
body that sweep rhythmically to transport substances. Dysfunction of ciliary
motion is often indicative of diseases known as ciliopathies, which disrupt
the functionality of macroscopic structures within the lungs, kidneys and other
organs [LWL+ 18]. Phenotyping ciliary motion is an essential step towards un-
derstanding ciliopathies; however, this is generally an expert-intensive process
[QZD+ 15]. A means of automatically parsing recordings of cilia to determine
useful information would greatly reduce the amount of expert intervention re-
quired. This would not only improve overall throughput, but also mitigate human
error, and greatly improve the accessibility of cilia-based insights. Such automa-
tion is difficult to achieve due to the noisy, partially occluded and potentially out-
of-phase imagery used to represent cilia, as well as the fact that cilia occupy a
minority of any given image. Segmentation of cilia mitigates these issues, and is
thus a critical step in enabling a powerful pipeline. However, cilia are notoriously
difficult to properly segment in most imagery, imposing a bottleneck on the
pipeline. Experimentation on and evaluation of alternative methods for feature
extraction of cilia imagery hence provide the building blocks of a more potent
segmentation model. Current experiments show up to a 10% improvement over
base segmentation models using a novel combination of feature extractors.


Index Terms—cilia, segmentation, u-net, deep learning
                                                                                                         Fig. 1: A sample frame from the cilia dataset


Introduction
                                                                                             gation in the Quinn Research Group at the University of Georgia
Cilia are organelles found on the surface of some cells in the hu-
                                                                                             [ZRS+ 20].
man body that sweep rhythmically to transport substances [Ish17].
                                                                                                 The current pipeline consists of three major stages: preprocess-
Dysfunction of ciliary motion often indicates diseases known as
                                                                                             ing, where segmentation masks and optical flow representations
ciliopathies, which on a larger scale disrupt the functionality of
                                                                                             are created to supplement raw cilia video data; appearance, where
structures within the lungs, kidneys and other organs. Pheno-
                                                                                             a model learns a condensed spacial representation of the cilia; and
typing ciliary motion is an essential step towards understanding
                                                                                             dynamics, which learns a representation from the video, encoded
ciliopathies. However, this is generally an expert-intensive pro-
                                                                                             as a series of latent points from the appearance module. In the
cess [LWL+ 18], [QZD+ 15]. A means of automatically parsing
                                                                                             primary module, the segmentation mask is essential in scoping
recordings of cilia to determine useful information would greatly
                                                                                             downstream analysis to the cilia themselves, so inaccuracies at
reduce the amount of expert intervention required, thus increasing
                                                                                             this stage directly affect the overall performance of the pipeline.
throughput while alleviating the potential for human error. Hence,
                                                                                             However, due to the high variance of ciliary structure, as well
Zain et al. (2020) discuss the construction of a generative pipeline
                                                                                             as the noisy and out-of-phase imagery available, segmentation
to model and analyze ciliary motion, a prevalent field of investi-
                                                                                             attempts have been prone to error.
† These authors contributed equally.                                                             While segmentation masks for such a pipeline could be
* Corresponding author: meekail.zain@uga.edu                                                 manually generated, the process requires intensive expert labor
‡ Department of Computer Science, University of Georgia, Athens, GA 30602
                                                                                             [DvBB+ 21]. Requiring manual segmentation before analysis thus
USA
§ Institute for Artificial Intelligence, University of Georgia, Athens, GA 30602             greatly increases the barrier to entry for this tool. Not only would
USA                                                                                          it increase the financial strain of adopting ciliary analysis as a
¶ Department of Cellular Biology, University of Georgia, Athens, GA 30602                    clinical tool, but it would also serve as an insurmountable barrier to
USA
|| Department of Developmental Biology, University of Pittsburgh, Pittsburgh,                entry for communities that do not have reliable access to such clin-
PA 15261 USA                                                                                 icians in the first place, such as many developing nations and rural
                                                                                             populations. Not only can automated segmentation mitigate these
Copyright © 2022 Meekail Zain et al. This is an open-access article distributed              barriers to entry, but it can also simplify existing treatment and
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the                 analysis infrastructure. In particular, it has the potential to reduce
original author and source are credited.                                                     the magnitude of work required by an expert clinician, thereby
260                                                                                     PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

                                                                      expansion. The contraction path follows the standard strategy of
                                                                      most convolutional neural networks (CNNs), where convolutions
                                                                      are followed by Rectified Linear Unit (ReLU) activation func-
                                                                      tions and max pooling layers. While max pooling downsamples
                                                                      the images, the convolutions double the number of channels.
                                                                      Upon expansion, up-convolutions are applied to up-sample the
                                                                      image while reducing the number of channels. At each stage,
                                                                      the network concatenates the up-sampled image with the image
                                                                      of corresponding size (cropped to account for border pixels)
                                                                      from a layer in the contracting path. A final layer uses pixel-
                                                                      wise (1 × 1) convolutions to map each pixel to a corresponding
                                                                      class, building a segmentation. Before training, data is generally
                                                                      augmented to provide both invariance in rotation and scale as well
                                                                      as a larger amount of training data. In general, U-Nets have shown
                                                                      high performance on biomedical data sets with low quantities
Fig. 2: The classical U-Net architecture, which serves as both a      of labelled images, as well as reasonably fast training times on
baseline and backbone model for this research                         graphics processing units (GPUs) [RFB15]. However, in a few
                                                                      past experiments with cilia data, the U-Net architecture has had
                                                                      low segmentation accuracy [LMZ+ 18]. Difficulties modeling cilia
decreasing costs and increasing clinician throughput [QZD+ 15],       with CNN-based architectures include their fine high-variance
[ZRS+ 20]. Furthermore, manual segmentation imparts clinician-        structure, spatial sparsity, color homogeneity (with respect to the
specific bias which reduces the reproducability of results, making    background and ambient cells), as well as inconsistent shape and
it difficult to verify novel techniques and claims [DvBB+ 21].        distribution across samples. Hence, various enhancements to the
     A thorough review of previous segmentation models, specif-       pure U-Net model are necessary for reliable cilia segmentation.
ically those using the same dataset, shows that current results
are poor, impeding tasks further along the pipeline. For this
                                                                      Methodology
study, model architectures utilize various methods of feature
extraction that are hypothesized to improve the accuracy of a base    The U-Net architecture is the backbone of the model due to its
segmentation model, such as using zero-phased PCA maps and            well-established performance in the biomedical image analysis
Sparse Autoencoder reconstructions with various parameters as a       domain. This paper focuses on extracting and highlighting the
data augmentation tool. Various experiments with these methods        underlying features in the image through various means. There-
provide a summary of both qualitative and quantitative results        fore, optimization of the U-Net backbone itself is not a major
necessary in ascertaining the viability for such feature extractors   consideration of this project. Indeed, the relative performance of
to aid in segmentation.                                               the various modified U-Nets sufficiently communicates the effi-
                                                                      cacy of the underlying methods. Each feature extraction method
                                                                      will map the underlying raw image to a corresponding feature
Related Works                                                         map. To evaluate the usefulness of these feature maps, the model
Lu et. al. (2018) utilized a Dense Net segmentation model as an       concatenates these augmentations to the original image and use
upstream to a CNN-based Long Short-Term Memory (LSTM)                 the aggregate data as input to a U-Net that is slightly modified to
time-series model for classifying cilia based on spatiotemporal       accept multiple input channels.
patterns [LMZ+ 18]. While the model reports good classification           The feature extractors of interest are Zero-phase PCA sphering
accuracy and a high F-1 score, the underlying dataset only            (ZCA) and a Sparse Autoencoder (SAE), on both of which the
contains 75 distinct samples and the results must therefore be        following subsections provide more detail. Roughly speaking,
taken with great care. Furthermore, Lu et. al. did not report the     these are both lossy, non-bijective transformations which map
separate performance of the upstream segmentation network. Their      a single image to a single feature map. In the case of ZCA,
approach did, however, inspire the follow-up methodology of Zain      empirically the feature maps tend to preserve edges and reduce
et. al. (2020) for segmentation. In particular, they employ a Dense   the rest of the image to arbitrary noise, thereby emphasizing local
Net segmentation model as well, however they first augment the        structure (since cell structure tends not to be well-preserved). The
underlying images with the calculated optical flow. In this way,      SAE instead acts as a harsh compression and filters out both linear
their segmentation strategy employs both spatial and temporal         and non-linear features, preserving global structure. Each extractor
information. To compare against [LMZ+ 18], the authors evaluated      is evaluated by considering the performance of a U-Net model
their segmentation model in the same way—as an upstream to            trained on multi-channel inputs, where the first channel is the
an CNN/LSTM classification network. Their model improved              original image, and the second and/or third channels are the feature
the classification accuracy two points above that of Charles et.      maps extracted by these methods. In particular, the objective is for
al. (2018). Their reported intersection-over-union (IoU) score is     the doubly-augmented data, or the “composite” model, to achieve
33.06% and marks the highest performance achieved on this             state-of-the-art performance on this challenging dataset.
dataset.                                                                  The ZCA implementation utilizes SciPy linear algebra solvers,
     One alternative segmentation model, often used in biomedical     and both U-Net and SAE architectures use the PyTorch deep
image processing and analysis, where labelled data sets are rela-     learning library. Next, the evaluation stage employs canonical
tively small, is the U-Net architecture (2) [RFB15]. Developed by     segmentation quality metrics, such as the Jaccard score and Dice
Ronneberger et. al., U-Nets consist of two parts: contraction and     coefficient, on various models. When applied to the composite
LOW LEVEL FEATURE EXTRACTION FOR CILIA SEGMENTATION                                                                                        261

model, these metrics determine any potential improvements to the        feature since often times in image analysis low eigenvalues (and
state-of-the-art for cilia segmentation.                                the span of their corresponding eigenvectors) tend to capture high-
                                                                        frequency data. Such data is essential for tasks such as texture
Cilia Data                                                              analysis, and thus tuning the value of ε helps to preserve this data.
As in the Zain paper, the input data is a limited set of grayscale      ZCA maps for various values of ε on a sample image are shown
cilia imagery, from both healthy patients and those diagnosed with      in figure 3.
ciliopathies, with corresponding ground truth masks provided by
experts. The images are cropped to 128 × 128 patches. The images
are cropped at random coordinates in order to increase the size
and variance of the sample space, and each image is cropped a
number of times proportional its resolution. Additionally, crops
that contain less than fifteen percent cilia are excluded from the      Fig. 3: Comparison of ZCA maps on a cilia sample image with various
training/test sets. This method increases the size of the training      levels of ε. The original image is followed by maps with ε = 1e − 4,
set from 253 images to 1409 images. Finally, standard minmax            ε = 1e − 5, ε = 1e − 6, and ε = 1e − 7, from left to right.
contrast normalization maps the luminosity to the interval [0, 1].

Zero-phase PCA sphering (ZCA)                                           Sparse Autoencoder (SAE)
The first augmentation of the underlying data concatenates the          Similar in aim to ZCA, an SAE can augment the underlying
input to the backbone U-Net model with the ZCA-transformed              images to further filter and reduce noise while allowing the
data. ZCA maps the underlying data to a version of the data that is     construction and retention of potentially nonlinear spatial features.
“rotated” through the dataspace to ensure certain spectral proper-      Autoencoders are deep learning models that first compress data
ties. ZCA in effect can implicitly normalize the data using the most    into a low-level latent space and then attempt to reconstruct images
significant (by empirical variance) spatial features present across     from the low-level representation. SAEs in particular add an
the dataset. Given a matrix X with rows representing samples and        additional constraint, usually via the loss function, that encourages
columns for each feature, a sphering (or whitening) transformation      sparsity (i.e., less activation) in hidden layers of the network. Xu
W is one which decorrelates X. That is, the covariance of W X           et. al. use the SAE architecture for breast cancer nuclear detection
must be equal to the identity matrix. By the spectral theorem,          and show that the architecture preserves essential, high-level,
the symmetric matrix XX T —the covariance matrix corresponding          and often nonlinear aspects of the initial imagery—even when
to the data, assuming the data is centered—can be decomposed            unlabelled—such as shape and color [XXL+ 16]. An adaptation of
into PDPT , where P is an orthogonal matrix of eigenvectors             the first two terms of their loss function enforces sparsity:
and D a diagonal matrix of corresponding eigenvalues of the
covariance matrix. ZCA uses the sphering matrix W = PD−1/2 PT                           1 N                                 1   n

and can be thought of as a transformation into the eigenspace of
                                                                          LSAE (θ ) =     ∑ (L(x(k), dθ̂ (eθ̌ (x(k))))) + α n
                                                                                        N k=1                                   ∑ KL(ρ||ρ̂).
                                                                                                                                j=1
its covariance matrix—projection onto the data’s principal axes,
                                                                        The first term is a standard reconstruction loss (mean squared
as the minimal projection residual is onto the axes with maximal
                                                                        error), whereas the latter is the mean Kullback-Leibler (KL)
variance—followed by normalization of variance along every axis
                                                                        divergence between ρ̂, the activation of a neuron in the encoder,
and rotation back into the original image space. In order to reduce
                                                                        and ρ, the enforced activation. For the case of experiments
the amount of two-way correlation in images, Krizhevsky applies
                                                                        performed here, ρ = 0.05 remains constant but values of α vary,
ZCA whitening to preprocess CIFAR-10 data before classification
                                                                        specifically 1e − 2, 1e − 3, and 1e − 4, for each of which a static
and shows that this process nicely preserves features, such as edges
                                                                        dataset is created for feeding into the segmentation model. Larger
[LjWD19].
                                                                        alpha prioritizes sparsity over reconstruction accuracy, which to
    This ZCA implementation uses the Python SciPy library
                                                                        an extent, is hypothesized to retain significant low-level features
(SciPy), which builds on top of low-level hardware-optimized
                                                                        of the cilia. Reconstructions with various values of α are shown
routines such as BLAS and LAPACK to efficiently calculate many
                                                                        in figure 4
linear algebra operations. In particular, these expirements imple-
ment ZCA as a generalized whitening technique. While normal the
normal ZCA calculation selects a whitening                     −1 T
                                           p matrix W = PD 2 P ,
a more applicable alternative is W = P (D + εI)−1 PT where ε
is a hyperparameter which attenuates eigenvalue sensitivity. This
new "whitening" is actually not a proper whitening since it does
not guarantee an identity covariance matrix. It does however serve      Fig. 4: Comparison of SAE reconstructions from different training
a similar purpose and actually lends some benefits.                     instances with various levels of α (the activation loss weight). From
                                                                        left to right: original image, α = 1e − 2 reconstruction, α = 1e − 3
    Most importantly, it is indeed a generalization of canonical q      reconstruction, α = 1e − 4 reconstruction.
ZCA. That is to say, ε = 0 recovers canonical ZCA, and λ → λ1
provides the spectrum ofqW on the eigenvalues. Otherwise, ε > 0             A significant amount of freedom can be found in potential
                             1
results in the map λ → λ +ε     . In this case, while all eigenvalues   architectural choices for SAE. A focus on low-medium complexity
map to smaller values compared to the original map, the smallest        models both provides efficiency and minimizes overfitting and ar-
eigenvalues map to significantly smaller values compared to the         tifacts as consequence of degenerate autoencoding. One important
original map. This means that ε serves to “dampen” the effects          danger to be aware of is that SAEs—and indeed, all AEs—are
of whitening for particularly small eigenvalues. This is a valuable     at risk of a degenerate solution wherein a sufficiently complex
262                                                                                      PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)




                                                                       Fig. 6: Artifacts generated during the training of U-Net. From left to
                                                                       right: original image, generated segmentation mask (pre-threshold),
                                                                       ground-truth segmentation mask




Fig. 5: Illustration and pseudocode for Spatial Broadcast Decoding
[WMBL19]

                                                                       Fig. 7: Artifacts generated during the training of ZCA+U-Net. From
decoder essentially learns to become a hashmap of arbitrary (and       left to right: original image, ZCA-mapped image, generated segmen-
potentially random) encodings.                                         tation mask (pre-threshold), ground-truth segmentation mask
    The SAE will therefore utilize a CNN architecture, as op-
posed to more modern transformer-style architectures, since the
                                                                       figure 9 was taken only 10 epochs into the training process.
simplicity and induced spatial bias provide potent defenses against
                                                                       Notably, this model, the composite pipeline, produced usable
overfitting and mode collapse. Furthermore the encoder will use
                                                                       artifacts in mere minutes of training, whereas other models did
Spatial Broadcast Decoding (SBD) which provides a method for
                                                                       not produce similar results until after about 10-40 epochs.
decoding from a latent vector using size-preserving convolutions,
                                                                            Figure 10 provides a summary of experiments performed with
thereby preserving the spatial bias even in decoding, and eliminat-
                                                                       SAE and ZCA augmented data, along with a few composite models
ing the artifacts generated by alternate decoding strategies such as
                                                                       and a base U-Net for comparison. These models were produced
“transposed” convolutions [WMBL19].
                                                                       with data augmentation at various values of α (for the Sparse
                                                                       Autoencoder loss function) and ε (for ZCA) discussed above.
Spatial Broadcast Decoding (SBD)
                                                                       While the table provides five metrics, those of primary importance
Spatial Broadcast Decoding provides an alternative method from         are the Intersection over Union (IoU), or Jaccard Score, as well
”transposed” (or ”skip”) convolutions to upsample images in the        as the Dice (or F1) score, which are the most commonly used
decoder portion of CNN-based autoencoders. Rather than main-           metrics for evaluating the performance of segmentation models.
taining the square shape, and hence associated spatial properties,     Most feature extraction models at least marginally improve the
of the latent representation, the output of the encoder is reshaped    performance in of the U-Net in terms of IoU and Dice scores,
into a single one-dimensional tensor per input image, which is then    and the best-performing composite model (with ε of 1e − 4
tiled to the shape of the desired image (in this case, 128 × 128).     for ZCA and α of 1e − 3 for SAE) provide an improvement
In this way, the initial dimension of the latent vector becomes        of approximately 10% from the base U-Net in these metrics.
the number of input channels when fed into the decoder, and two        There does not seem to be an obvious correlation between which
additional channels are added to represent 2-dimensional spatial       feature extraction hyperparameters provided the best performance
coordinates. In its initial publication, SBD has been shown to pro-    for individual ZCA+U-Net and SAE+U-Net models versus those
vide effective results in disentangling latent space representations   for the composite pipeline, but further experiments may assist in
in various autoencoder models.                                         analyzing this possibility.
                                                                            The base U-Net does outperform the others in precision,
U-Net
All models use a standard U-Net and undergo the same training
process to provide a solid basis for analysis. Besides the number
of input channels to the initial model (1 plus the number of
augmentation channels from SAE and ZCA, up to 3 total chan-
nels), the model architecture is identical for all runs. A single-
channel (original image) U-Net first trains as a basis point for
analysis. The model trains on two-channel inputs provided by           Fig. 8: Artifacts generated during the training of SAE+U-Net. From
ZCA (original image concatenated with the ZCA-mapped one)              left to right: original image, SAE-reconstructed image, generated
with various ε values for the dataset, and similarly SAE with          segmentation mask (pre-threshold), ground-truth segmentation mask
various α values, train the model. Finally, composite models train
with a few combinations of ZCA and SAE hyperparameters. Each
training process uses binary cross entropy loss with a learning rate
of 1e − 3 for 225 epochs.

Results
                                                                       Fig. 9: Artifacts generated 10 epochs into the training of the compos-
Figures 6, 7, 8, and 9 show masks produced on validation data          ite U-Net. From left to right: original image, ZCA-mapped image,
from instances of the four model types. While the former three         SAE-mapped image, generated segmentation mask (pre-threshold),
show results near the end of training (about 200-250 epochs),          ground-truth segmentation mask
LOW LEVEL FEATURE EXTRACTION FOR CILIA SEGMENTATION                                                                                                                                           263

                        Extractor Parameters                                      Scores                                  Implications internal to other projects within the research group
          Model         ε (ZCA)          α (SAE)            IoU    Accuracy        Recall          Dice       Precision   sponsoring this research are clear. As discussed earlier, later
  U-Net (base)                —            —            0.399         0.759        0.501           0.529      0.692       pipelines of ciliary representation and modeling are currently
                          1e − 4           —            0.395         0.754        0.509           0.513      0.625       being bottlenecked by the poor segmentation masks produced by
                          1e − 5           —            0.401         0.732        0.563           0.539      0.607       base U-Nets, and the under-segmented predictions provided by
 ZCA + U-Net
                          1e − 6           —            0.408         0.756        0.543           0.546      0.644
                          1e − 7           —            0.419         0.758        0.563           0.557      0.639       the original model limits the scope of what these later stages
                              —           1e − 2        0.380         0.719        0.568           0.520      0.558       may achieve. Better predictions hence tend to transfer to better
 SAE + U-Net                  —           1e − 3        0.398         0.751        0.512           0.526      0.656       downstream results.
                              —           1e − 4        0.416         0.735        0.607           0.555      0.603
                                                                                                                              These results also have significant implications outside of the
                          1e − 4          1e − 2        0.401         0.761        0.506           0.521      0.649
                          1e − 4          1e − 3        0.441         0.767        0.580           0.585      0.661
                                                                                                                          specific task of cilia segmentation and modeling. The inherent
                          1e − 4          1e − 4        0.305         0.722        0.398           0.424      0.588       problem that motivated an introduction of feature extraction into
                          1e − 5          1e − 2        0.392         0.707        0.624           0.530      0.534
                          1e − 5          1e − 3        0.413         0.770        0.514           0.546      0.678
                                                                                                                          the segmentation process was the poor quality of the given dataset.
                          1e − 5          1e − 4        0.413         0.751        0.565           0.550      0.619       From occlusion to poor lighting to blurred images, these are
    Composite
                          1e − 6          1e − 2        0.392         0.719        0.602           0.527      0.571       problems that typically plague segmentation models in the real
                          1e − 6          1e − 3        0.395         0.759        0.480           0.521      0.711
                          1e − 6          1e − 4        0.405         0.729        0.587           0.545      0.591       world, where data sets are not of ideal quality. For many modern
                          1e − 7          1e − 2        0.383         0.753        0.487           0.503      0.655       computer vision tasks, segmentation is a necessary technique to
                          1e − 7          1e − 3        0.380         0.736        0.526           0.519      0.605
                          1e − 7          1e − 4        0.293         0.674        0.445           0.418      0.487
                                                                                                                          begin analysis of certain objects in an image, including any forms
                                                                                                                          of objects from people to vehicles to landscapes. Many images
Fig. 10: A summary of segmentation scores on test data for a base                                                         for these tasks are likely to come from low-resolution imagery,
U-Net model, ZCA+U-Net, SAE+U-Net, and a composite model, with                                                            whether that be satellite data or security cameras, and are likely
various feature extraction hyperparameters. The best result for each                                                      to face similar problems as the given cilia dataset in terms of
scoring metric is in bold.                                                                                                image quality. Even if this is not the case, manual labelling, like
                                                                                                                          that of this dataset and convenient in many other instances, is
               Input Images                                                      Predicted Masks
    Original      ZCA              SAE       Ground Truth     Base U-Net   ZCA + U-Net     SAE + U-Net     Composite      prone to error and is likely to bottleneck results. As experiments
                                                                                                                          have shown, feature extraction through SAE and ZCA maps are
                                                                                                                          a potential avenue for improvement of such models and would be
                                                                                                                          an interesting topic to explore on other problematic datsets.
                                                                                                                              Especially compelling, aside from the raw numeric results, is
                                                                                                                          how soon composite pipelines began to produce usable masks on
                                                                                                                          training data. As discussed earlier, most original U-Net models
                                                                                                                          would take at least 40-50 epochs before showing any accurate
                                                                                                                          predictions on training data. However, when feeding in composite
Fig. 11: Comparison of predicted masks and ground truth for three                                                         SAE and ZCA data along with the original image, unusually
test images. ZCA mapped images with ε = 1e − 4 and SAE reconstruc-                                                        accurate masks were produced within just a couple minutes, with
tions with α = 1e − 3 are used where applicable.                                                                          usable results at 10 epochs. This has potential implications in
                                                                                                                          scenarios such as one-shot and/or unsupervised learning, where
                                                                                                                          models cannot train over a large datset.
however. Analysis of predicted masks from various models, some
of which are shown in figure 11, shows that the base U-Net                                                                Future Research
model tends to under-predict cilia, explaining the relatively high                                                        While this work establishes a primary direction and a novel
precision. Previous endeavors in cilia segmentation also revealed                                                         perspective for segmenting cilia, there are many interesting and
this pattern.                                                                                                             valuable directions for future planned research. In particular, a
                                                                                                                          novel and still-developing alternative to the convolution layer
                                                                                                                          known as a Sharpened Cosine Similarity (SCS) layer has begun
                                                                                                                          to attract some attention. While regular CNNs are proficient at
Conclusions                                                                                                               filtering, developing invariance to certain forms of noise and
This paper highlights the current shortcomings of automated,                                                              perturbation, they are notoriously poor at serving as a spatial
deep-learning based segmentation models for cilia, specifically                                                           indicator for features. Convolution activations can be high due to
on the data provided to the Quinn Research Group, and provides                                                            changes in luminosity and do not necessarily imply the distribu-
two additional methods, Zero-Phase PCA Sphering (ZCA) and                                                                 tion of the underlying luminosity, therefore losing precise spatial
Sparse Autoencoders (SAE), for performing feature extracting                                                              information. By design, SCS avoids these faults by considering
augmentations with the purpose of aiding a U-Net model in                                                                 the mathematical case of a “normalized” convolution, wherein
segmentation. An analysis of U-Nets with various combinations                                                             neither the magnitude of the input, nor of the kernel, affect the final
of these feature extraction and parameters help determine the                                                             output. Instead, SCS activations are dictated purely by the relative
feasibility for low-level feature extraction in improving cilia seg-                                                      magnitudes of weights in the kernel, which is to say by the spatial
mentation, and results from initial experiments show up to 10%                                                            distribution of features in the input [Pis22]. Domain knowledge
increases in relevant metrics.                                                                                            suggests that cilia, while able to vary greatly, all share relatively
    While these improvements, in general, have been marginal,                                                             unique spatial distributions when compared to non-cilia such as
these results show that pre-segmentation based feature extraction                                                         cells, out-of-phase structures, microscopy artifacts, etc. Therefore,
methods, particularly the avenues explored, provide a worthwhile                                                          SCS may provide a strong augmentation to the backbone U-
path of exploration and research for improving cilia segmentation.                                                        Net model by acting as an additional layer in tandem with the
264                                                                                              PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022)

already existing convolution layers. This way, the model is a true           [LWL+ 18] Fangzhao Li, Changjian Wang, Xiaohui Liu, Yuxing Peng, and
generalization of the canonical U-Net and is less likely to suffer                     Shiyao Jin. A composite model of wound segmentation based
                                                                                       on traditional methods and deep neural networks. Computational
poor performance due to the introduction of SCS.                                       intelligence and neuroscience, 2018, 2018. doi:10.1155/
    Another avenue of exploration would be a more robust ablation                      2018/4149103.
study on some of the hyperparameters of the feature extractors               [Pis22]   Raphael Pisonir. Sharpened cosine distance as an alternative for
                                                                                       convolutions, Jan 2022. URL: https://www.rpisoni.dev.
used. While most of the hyperparameters were chosen based on                 [QZD 15] Shannon P Quinn, Maliha J Zahid, John R Durkin, Richard J
                                                                                   +
either canonical choices [XXL+ 16] or through empirical study                          Francis, Cecilia W Lo, and S Chakra Chennubhotla. Auto-
(e.g. ε for ZCA whitening), a more comprehensive hyperparameter                        mated identification of abnormal respiratory ciliary motion in
search would be worth consideration. This would be especially                          nasal biopsies. Science translational medicine, 7(299):299ra124
                                                                                       |–| 299ra124, 2015.          doi:10.1126/scitranslmed.
valuable for the composite model since the choice of most opti-                        aaa1233.
mal hyperparameters is dependent on the downstream tasks and                 [RFB15]   Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
therefore may be different for the composite model than what was                       net: Convolutional networks for biomedical image segmentation.
                                                                                       CoRR, 2015. doi:10.48550/arXiv.1505.04597.
found for the individual models.                                             [WMBL19] Nicholas Watters, Loïc Matthey, Christopher P. Burgess, and
    More robust data augmentation could additionally improve                           Alexander Lerchner. Spatial broadcast decoder: A simple archi-
results. Image cropping and basic augmentation methods alone                           tecture for learning disentangled representations in vaes. CoRR,
provided minor improvements of just the base U-Net from the                            2019. doi:10.48550/arXiv.1901.07017.
                                                                             [XXL+ 16] Jun Xu, Lei Xiang, Qingshan Liu, Hannah Gilmore, Jianzhong
state of the art. Regarding the cropping method, an upper threshold                    Wu, Jinghai Tang, and Anant Madabhushi. Stacked sparse au-
for the percent of cilia per image may be worth implementing,                          toencoder (ssae) for nuclei detection on breast cancer histopathol-
as cropped images containing over approximately 90% cilia pro-                         ogy images. IEEE Transactions on Medical Imaging, 35(1):119–
                                                                                       130, 2016. doi:10.1109/TMI.2015.2458702.
duced poor results, likely due to a lack of surrounding context.             [ZRS+ 20] Meekail Zain, Sonia Rao, Nathan Safir, Quinn Wyner, Isabella
Additionally, rotations and lighting/contrast adjustments could                        Humphrey, Alex Eldridge, Chenxiao Li, BahaaEddin AlAila,
further augment the data set during the training process.                              and Shannon Quinn. Towards an unsupervised spatiotemporal
                                                                                       representation of cilia video using a modular generative pipeline.
    Re-segmenting the cilia images by hand, a planned endeavor,                        In Proceedings of the Python in Science Conference, 2020.
will likely provide more accurate masks for the training process.                      doi:10.25080/majora-342d178e-017.
This is an especially difficult task for the cilia dataset, as the poor
lighting and focus even causes medical professionals to disagree
on the exact location of cilia in certain instances. However, the re-
search group associated with this paper is currently in the process
of setting up a web interface for such professionals to ”vote” on
segmentation masks. Additionally, it is likely worth experimenting
with various thresholds for converting U-Net outputs into masks,
and potentially some form of region growing to dynamically aid
the process.
    Finally, it is possible to train the SAE and U-Net jointly as
an end-to-end system. Current experimentation has foregone this
path due to the additional computational and memory complexity
and has instead opted for separate training to at least justify this
direction of exploration. Training in an end-to-end fashion could
lead to a more optimal result and potentially even an interesting
latent representation of ciliary features in the image. It is worth
noting that larger end-to-end systems like this tend to be more
difficult to train and balance, and such architectures can fall into
degenerate solutions more readily.



R EFERENCES

[DvBB+ 21] Cenna Doornbos, Ronald van Beek, Ernie MHF Bongers, Dorien
           Lugtenberg, Peter Klaren, Lisenka ELM Vissers, Ronald Roep-
           man, Machteld M Oud, et al. Cell-based assay for ciliopathy
           patients to improve accurate diagnosis using alpaca. Euro-
           pean Journal of Human Genetics, 29(11):1677 |–| 1689, 2021.
           doi:10.1038/s41431-021-00907-9.
[Ish17]    Takashi Ishikawa. Axoneme structure from motile cilia. Cold
           Spring Harbor perspectives in biology, 9(1):a028076, 2017.
           doi:10.1101/cshperspect.a028076.
[LjWD19] Hui Li, Xiao jun Wu, and Tariq S. Durrani. Infrared and visible
           image fusion with resnet and zero-phase component analysis. In-
           frared Physics & Technology, 102:103039, 2019. doi:https:
           //doi.org/10.1016/j.infrared.2019.103039.
[LMZ+ 18] Charles Lu, M. Marx, M. Zahid, C. W. Lo, Chakra Chennubhotla,
           and Shannon P. Quinn. Stacked neural networks for end-to-
           end ciliary motion analysis. CoRR, 2018. doi:10.48550/
           arXiv.1803.07534.