Proceedings of the 21st Python in Science Conference P ROCEEDINGS OF THE 21 ST P YTHON IN S CIENCE C ONFERENCE Edited by Meghann Agarwal, Chris Calloway, Dillon Niederhut, and David Shupe. SciPy 2022 Austin, Texas July 11 - July 17, 2022 Copyright c 2022. The articles in the Proceedings of the Python in Science Conference are copyrighted and owned by their original authors This is an open-access publication and is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. For more information, please see: http://creativecommons.org/licenses/by/3.0/ ISSN:2575-9752 https://doi.org/10.25080/majora-212e5952-046 O RGANIZATION Conference Chairs J ONATHAN G UYER, NIST A LEXANDRE C HABOT-L ECLERC, Enthought, Inc. Program Chairs M ATT H ABERLAND, Cal Poly J ULIE H OLLEK, Mozilla M ADICKEN M UNK, University of Illinois G UEN P RAWIROATMODJO, Microsoft Corp Communications A RLISS C OLLINS, NumFOCUS M ATT D AVIS, Populus D AVID N ICHOLSON, Embedded Intelligence Birds of a Feather A NDREW R EID, NIST A NASTASIIA S ARMAKEEVA, George Washington University Proceedings M EGHANN A GARWAL, Overhaul C HRIS C ALLOWAY, University of North Carolina D ILLON N IEDERHUT, Novi Labs D AVID S HUPE, Caltech’s IPAC Astronomy Data Center Financial Aid S COTT C OLLIS, Argonne National Laboratory N ADIA TAHIRI, Université de Montréal Tutorials M IKE H EARNE, USGS L OGAN T HOMAS, Enthought, Inc. Sprints TANIA A LLARD, Quansight Labs B RIGITTA S IP ŐCZ, Caltech/IPAC Diversity C ELIA C INTAS, IBM Research Africa B ONNY P M C C LAIN, O’Reilly Media FATMA TARLACI, OpenTeams Activities PAUL A NZEL, Codecov I NESSA PAWSON, Albus Code Sponsors K RISTEN L EISER, Enthought, Inc. Financial C HRIS C HAN, Enthought, Inc. B ILL C OWAN, Enthought, Inc. J ODI H AVRANEK, Enthought, Inc. Logistics K RISTEN L EISER, Enthought, Inc. Proceedings Reviewers A ILEEN N IELSEN A JIT D HOBALE A LEJANDRO C OCA -C ASTRO A LEXANDER YANG B HUPENDRA A R AUT B RADLEY D ICE B RIAN G UE C ADIOU C ORENTIN C ARL S IMON A DORF C HEN Z HANG C HIARA M ARMO C HITARANJAN M AHAPATRA C HRIS C ALLOWAY D ANIEL W HEELER D AVID N ICHOLSON D AVID S HUPE D ILLON N IEDERHUT D IPTORUP D EB J ELENA M ILOSEVIC M ICHAL M ACIEJEWSKI E D R OGERS H IMAGHNA B HATTACHARJEE H ONGSUP S HIN I NDRANEIL PAUL I VAN M ARROQUIN J AMES L AMB J YH -M IIN L IN J YOTIKA S INGH K ARTHIK M URUGADOSS K EHINDE A JAYI K ELLY L. R OWLAND K ELVIN L EE K EVIN M AIK J ABLONKA K EVIN W. B EAM K UNTAO Z HAO M ARUTHI NH M ATT C RAIG M ATTHEW F EICKERT M EGHANN A GARWAL M ELISSA W EBER M ENDONÇA O NURALP S OYLEMEZ R OHIT G OSWAMI RYAN B UNNEY S HUBHAM S HARMA S IDDHARTHA S RIVASTAVA S USHANT M ORE T ETSUO K OYAMA T HOMAS N ICHOLAS V ICTORIA A DESOBA V IDHI C HUGH V IVEK S INHA W ENDUO Z HOU Z UHAL C AKIR ACCEPTED TALK S LIDES B UILDING B INARY E XTENSIONS WITH PYBIND 11, SCIKIT- BUILD , AND CIBUILDWHEEL, Henry Schreiner, and Joe Rickerby, and Ralf Grosse-Kunstleve, and Wenzel Jakob, and Matthieu Darbois, and Aaron Gokaslan, and Jean-Christophe Fillion- Robin, and Matt McCormick doi.org/10.25080/majora-212e5952-033 P YTHON D EVELOPMENT S CHEMES FOR M ONTE C ARLO N EUTRONICS ON H IGH P ERFORMANCE C OMPUTING, Jack- son P. Morgan, and Kyle E. Niemeyer doi.org/10.25080/majora-212e5952-034 AWKWARD PACKAGING : B UILDING SCIKIT-HEP, Henry Schreiner, and Jim Pivarski, and Eduardo Rodrigues doi.org/10.25080/majora-212e5952-035 D EVELOPMENT OF A CCESSIBLE , A ESTHETICALLY-P LEASING C OLOR S EQUENCES, Matthew A. Petroff doi.org/10.25080/majora-212e5952-036 C UTTING E DGE C LIMATE S CIENCE IN THE C LOUD WITH PANGEO, Julius Busecke doi.org/10.25080/majora-212e5952-037 P YLIRA : DECONVOLUTION OF IMAGES IN THE PRESENCE OF P OISSON NOISE, Axel Donath, and Aneta Siemiginowska, and Vinay Kashyap, and Douglas Burke, and Karthik Reddy Solipuram, and David van Dyk doi.org/10.25080/majora-212e5952-038 A CCELERATING S CIENCE WITH THE G ENERATIVE T OOLKIT FOR S CIENTIFIC D ISCOVERY (GT4SD), GT4SD team doi.org/10.25080/majora-212e5952-039 MM ODEL : A MODULAR MODELING FRAMEWORK FOR SCIENTIFIC PROTOTYPING, Peter Sun, and John A. Marohn doi.org/10.25080/majora-212e5952-03a M ONACO : Q UANTIFY U NCERTAINTY AND S ENSITIVITIES IN Y OUR C OMPUTATIONAL M ODELS WITH A M ONTE C ARLO L IBRARY, W. Scott Shambaugh doi.org/10.25080/majora-212e5952-03b UF UNCS AND DT YPES : NEW POSSIBILITIES IN N UM P Y, Sebastian Berg, and Stéfan van der Walt doi.org/10.25080/majora-212e5952-03c P ER P YTHON AD ASTRA : INTERACTIVE A STRODYNAMICS WITH POLIASTRO , Juan Luis Cano Rodrı́guez doi.org/10.25080/majora-212e5952-03d PYAMPUTE : A P YTHON LIBRARY FOR DATA AMPUTATION, Rianne M Schouten, and Davina Zamanzadeh, and Prabhant Singh doi.org/10.25080/majora-212e5952-03e S CIENTIFIC P YTHON : F ROM G IT H UB TO T IK T OK, Juanita Gomez Romero, and Stéfan van der Walt, and K. Jarrod Millman, and Melissa Weber Mendonça, and Inessa Pawson doi.org/10.25080/majora-212e5952-03f S CIENTIFIC P YTHON : B Y MAINTAINERS , FOR MAINTAINERS, Pamphile T. Roy, and Stéfan van der Walt, and K. Jarrod Millman, and Melissa Weber Mendonça doi.org/10.25080/majora-212e5952-040 I MPROVING RANDOM SAMPLING IN P YTHON : SCIPY. STATS . SAMPLING AND SCIPY. STATS . QMC, Pamphile T. Roy, and Matt Haberland, and Christoph Baumgarten, and Tirth Patel doi.org/10.25080/majora-212e5952-041 P ETABYTE - SCALE OCEAN DATA ANALYTICS ON STAGGERED GRIDS VIA THE GRID UFUNC PROTOCOL IN X GCM, Thomas Nicholas, and Julius Busecke, and Ryan Abernathey doi.org/10.25080/majora-212e5952-042 ACCEPTED P OSTERS O PTIMAL R EVIEW A SSIGNMENTS FOR THE S CI P Y C ONFERENCE U SING B INARY I NTEGER L INEAR P ROGRAMMING IN S CI P Y 1.9, Matt Haberland, and Nicholas McKibben doi.org/10.25080/majora-212e5952-029 C ONTRIBUTING TO O PEN S OURCE S OFTWARE : F ROM NOT KNOWING P YTHON TO BECOMING A S PYDER CORE DE - VELOPER , Daniel Althviz Moré doi.org/10.25080/majora-212e5952-02a S EMI -S UPERVISED S EMANTIC A NNOTATOR (S3A): T OWARD E FFICIENT S EMANTIC I MAGE L ABELING, Nathan Jessu- run, and Olivia P. Dizon-Paradis, and Dan E. Capecci, and Damon L. Woodard, and Navid Asadizanjani doi.org/10.25080/majora-212e5952-02b B IOFRAME : O PERATING ON G ENOMIC I NTERVAL D ATAFRAMES, Nezar Abdennur, and Geoffrey Fudenberg, and Ilya M. Flyamer, and Aleksandra Galitsyna, and Anton Goloborodko, and Maxim Imakaev, and Trevor Manz, and Sergey V. Venev doi.org/10.25080/majora-212e5952-02c L IKENESS : A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS, Joseph V. Tuccillo, and James D. Gaboardi doi.org/10.25080/majora-212e5952-02d PYA UDIO P ROCESSING : A UDIO P ROCESSING , F EATURE E XTRACTION , AND M ACHINE L EARNING M ODELING , Jy- otika Singh doi.org/10.25080/majora-212e5952-02e K IWI : P YTHON T OOL FOR T EX P ROCESSING AND C LASSIFICATION, Neelima Pulagam, and Sai Marasani, and Brian Sass doi.org/10.25080/majora-212e5952-02f P HYLOGEOGRAPHY: A NALYSIS OF GENETIC AND CLIMATIC DATA OF SARS-C O V-2, Wanlin Li, and Aleksandr Koshkarov, and My-Linh Luu, and Nadia Tahiri doi.org/10.25080/majora-212e5952-030 D ESIGN OF A S CIENTIFIC D ATA A NALYSIS S UPPORT P LATFORM, Nathan Martindale, and Jason Hite, and Scott Stewart, and Mark Adams doi.org/10.25080/majora-212e5952-031 O PENING ARM: A PIVOT TO COMMUNITY SOFTWARE TO MEET THE NEEDS OF USERS AND STAKEHOLDERS OF THE PLANET ’ S LARGEST CLOUD OBSERVATORY , Zachary Sherman, and Scott Collis, and Max Grover, and Robert Jackson, and Adam Theisen doi.org/10.25080/majora-212e5952-032 S CI P Y TOOLS P LENARIES S CI P Y T OOLS P LENARY - CEL TEAM, Inessa Pawson doi.org/10.25080/majora-212e5952-043 S CI P Y T OOLS P LENARY ON M ATPLOTLIB, Elliott Sales de Andrade doi.org/10.25080/majora-212e5952-044 S CI P Y T OOLS P LENARY - N UM P Y, Inessa Pawson doi.org/10.25080/majora-212e5952-045 L IGHTNING TALKS D OWNSAMPLING T IME S ERIES D ATA FOR V ISUALIZATIONS, Delaina Moore doi.org/10.25080/majora-212e5952-027 A NALYSIS AS A PPLICATIONS : Q UICK INTRODUCTION TO LOCKFILES, Matthew Feickert doi.org/10.25080/majora-212e5952-028 S CHOLARSHIP R ECIPIENTS A MAN G OEL, University of Delhi A NURAG S AHA R OY, Saarland University I SURU F ERNANDO, University of Illinois at Urbana Champaign K ELLY M EEHAN, US Forest Service K ADAMBARI D EVARAJAN, University of Rhode Island K RISHNA K ATYAL, Thapar Institute of Engineering and Technology M ATTHEW M URRAY, Dask N AMAN G ERA, Sympy, LPython R OHIT G OSWAMI, University of Iceland S IMON C ROSS, QuTIP TANYA A KUMU, IBM Research Z UHAL C AKIR, Purdue University C ONTENTS The Advanced Scientific Data Format (ASDF): An Update 1 Perry Greenfield, Edward Slavich, William Jamieson, Nadia Dencheva Semi-Supervised Semantic Annotator (S3A): Toward Efficient Semantic Labeling 7 Nathan Jessurun, Daniel E. Capecci, Olivia P. Dizon-Paradis, Damon L. Woodard, Navid Asadizanjani Galyleo: A General-Purpose Extensible Visualization Solution 13 Rick McGeer, Andreas Bergen, Mahdiyar Biazi, Matt Hemmings, Robin Schreiber USACE Coastal Engineering Toolkit and a Method of Creating a Web-Based Application 22 Amanda Catlett, Theresa R. Coumbe, Scott D. Christensen, Mary A. Byrant Search for Extraterrestrial Intelligence: GPU Accelerated TurboSETI 26 Luigi Cruz, Wael Farah, Richard Elkins Experience report of physics-informed neural networks in fluid simulations: pitfalls and frustration 28 Pi-Yueh Chuang, Lorena A. Barba atoMEC: An open-source average-atom Python code 37 Timothy J. Callow, Daniel Kotik, Eli Kraisler, Attila Cangi Automatic random variate generation in Python 46 Christoph Baumgarten, Tirth Patel Utilizing SciPy and other open source packages to provide a powerful API for materials manipulation in the Schrödinger Materials Suite 52 Alexandr Fonari, Farshad Fallah, Michael Rauch A Novel Pipeline for Cell Instance Segmentation, Tracking and Motility Classification of Toxoplasma Gondii in 3D Space 60 Seyed Alireza Vaezi, Gianni Orlando, Mojtaba Fazli, Gary Ward, Silvia Moreno, Shannon Quinn The myth of the normal curve and what to do about it 64 Allan Campopiano Python for Global Applications: teaching scientific Python in context to law and diplomacy students 69 Anna Haensch, Karin Knudson Papyri: better documentation for the scientific ecosystem in Jupyter 75 Matthias Bussonnier, Camille Carvalho Bayesian Estimation and Forecasting of Time Series in statsmodels 83 Chad Fulton Python vs. the pandemic: a case study in high-stakes software development 90 Cliff C. Kerr, Robyn M. Stuart, Dina Mistry, Romesh G. Abeysuriya, Jamie A. Cohen, Lauren George, Michał Jastrzebski, Michael Famulare, Edward Wenger, Daniel J. Klein Pylira: deconvolution of images in the presence of Poisson noise 98 Axel Donath, Aneta Siemiginowska, Vinay Kashyap, Douglas Burke, Karthik Reddy Solipuram, David van Dyk Codebraid Preview for VS Code: Pandoc Markdown Preview with Jupyter Kernels 105 Geoffrey M. Poore Incorporating Task-Agnostic Information in Task-Based Active Learning Using a Variational Autoencoder 110 Curtis Godwin, Meekail Zain, Nathan Safir, Bella Humphrey, Shannon P Quinn Awkward Packaging: building Scikit-HEP 115 Henry Schreiner, Jim Pivarski, Eduardo Rodrigues Keeping your Jupyter notebook code quality bar high (and production ready) with Ploomber 121 Ido Michael Likeness: a toolkit for connecting the social fabric of place to human dynamics 125 Joseph V. Tuccillo, James D. Gaboardi poliastro: a Python library for interactive astrodynamics 136 Juan Luis Cano Rodrı́guez, Jorge Martı́nez Garrido A New Python API for Webots Robotics Simulations 147 Justin C. Fisher pyAudioProcessing: Audio Processing, Feature Extraction, and Machine Learning Modeling 152 Jyotika Singh Phylogeography: Analysis of genetic and climatic data of SARS-CoV-2 159 Aleksandr Koshkarov, Wanlin Li, My-Linh Luu, Nadia Tahiri Global optimization software library for research and education 167 Nadia Udler Temporal Word Embeddings Analysis for Disease Prevention 171 Nathan Jacobi, Ivan Mo, Albert You, Krishi Kishore, Zane Page, Shannon P. Quinn, Tim Heckman Design of a Scientific Data Analysis Support Platform 179 Nathan Martindale, Jason Hite, Scott Stewart, Mark Adams The Geoscience Community Analysis Toolkit: An Open Development, Community Driven Toolkit in the Scientific Python Ecosystem 187 Orhan Eroglu, Anissa Zacharias, Michaela Sizemore, Alea Kootz, Heather Craker, John Clyne popmon: Analysis Package for Dataset Shift Detection 194 Simon Brugman, Tomas Sostak, Pradyot Patil, Max Baak pyDAMPF: a Python package for modeling mechanical properties of hygroscopic materials under interaction with a nanoprobe 202 Willy Menacho, Gonzalo Marcelo Ramı́rez-Ávila, Horacio V. Guzman Improving PyDDA’s atmospheric wind retrievals using automatic differentiation and Augmented Lagrangian methods 210 Robert Jackson, Rebecca Gjini, Sri Hari Krishna Narayanan, Matt Menickelly, Paul Hovland, Jan Hückelheim, Scott Collis RocketPy: Combining Open-Source and Scientific Libraries to Make the Space Sector More Modern and Accessible 217 João Lemes Gribel Soares, Mateus Stano Junqueira, Oscar Mauricio Prada Ramirez, Patrick Sampaio dos Santos Brandão, Adriano Augusto Antongiovanni, Guilherme Fernandes Alves, Giovani Hidalgo Ceotto Wailord: Parsers and Reproducibility for Quantum Chemistry 226 Rohit Goswami Variational Autoencoders For Semi-Supervised Deep Metric Learning 231 Nathan Safir, Meekail Zain, Curtis Godwin, Eric Miller, Bella Humphrey, Shannon P Quinn A Python Pipeline for Rapid Application Development (RAD) 240 Scott D. Christensen, Marvin S. Brown, Robert B. Haehnel, Joshua Q. Church, Amanda Catlett, Dallon C. Schofield, Quyen T. Brannon, Stacy T. Smith Monaco: A Monte Carlo Library for Performing Uncertainty and Sensitivity Analyses 244 W. Scott Shambaugh Enabling Active Learning Pedagogy and Insight Mining with a Grammar of Model Analysis 251 Zachary del Rosario Low Level Feature Extraction for Cilia Segmentation 259 Meekail Zain, Eric Miller, Shannon P Quinn, Cecilia Lo PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 1 The Advanced Scientific Data Format (ASDF): An Update Perry Greenfield‡∗ , Edward Slavich‡† , William Jamieson‡† , Nadia Dencheva‡† F Abstract—We report on progress in developing and extending the new (ASDF) by outlining our near term plans for further improvements and format we have developed for the data from the James Webb and Nancy Grace extensions. Roman Space Telescopes since we reported on it at a previous Scipy. While the format was developed as a replacement for the long-standard FITS format Summary of Motivations used in astronomy, it is quite generic and not restricted to use with astronomical • Suitable as an archival format: data. We will briefly review the format, and extensions and changes made to the standard itself, as well as to the reference Python implementation we have – Old versions continue to be supported by developed to support it. The standard itself has been clarified in a number libraries. of respects. Recent improvements to the Python implementation include an – Format is sufficiently transparent (e.g., not improved framework for conversion between complex Python objects and ASDF, requiring extensive documentation to de- better control of the configuration of extensions supported and versioning of extensions, tools for display and searching of the structured metadata, bet- code) for the fundamental set of capabili- ter developer documentation, tutorials, and a more maintainable and flexible ties. schema system. This has included a reorganization of the components to make – Metadata is easily viewed with any text the standard free from astronomical assumptions. A important motivator for the editor. format was the ability to support serializing functional transforms in multiple dimensions as well as expressions built out of such transforms, which has now • Intrinsically hierarchical been implemented. More generalized compression schemes are now enabled. • Avoids duplication of shared items We are currently working on adding chunking support and will discuss our plan • Based on existing standard(s) for metadata and structure for further enhancements. • No tight constraints on attribute lengths or their values. • Clearly versioned Index Terms—data formats, standards, world coordinate systems, yaml • Supports schemas for validating files for basic structure and value requirements • Easily extensible, both for the standard, and for local or Introduction domain-specific conventions. The Advanced Scientific Data Format (ASDF) was originally developed in 2015. That original version was described in a paper Basics of ASDF Format [Gre15]. That paper described the shortcomings of the widely used • Format consists of a YAML header optionally followed by astronomical standard format FITS [FIT16] as well as those of one or more binary blocks for containing binary data. existing potential alternatives. It is not the goal of this paper to • The YAML [http://yaml.org] header contains all the meta- rehash those points in detail, though it is useful to summarize the data and defines the structural relationship of all the data basic points here. The remainder of this paper will describe where elements. we are using ASDF, what lessons we have learned from using • YAML tags are used to indicate to libraries the semantics ASDF for the James Webb Space Telescope, and summarize the of subsections of the YAML header that libraries can use to most important changes we have made to the standard, the Python construct special software objects. For example, a tag for library that we use to read and write ASDF files, and best practices a data array would indicate to a Python library to convert for using the format. it into a numpy array. We will give an example of a more advanced use case that • YAML anchors and alias are used to share common ele- illustrates some of the powerful advantages of ASDF, and that ments to avoid duplication. its application is not limited to astronomy, but suitable for much • JSON Schema [http://json-schema.org/specification.html], of scientific and engineering data, as well as models. We finish [http://json-schema.org/understanding-json-schema/] is used for schemas to define expectations for tag content * Corresponding author: perry@stsci.edu and whole headers combined with tools to validate actual ‡ Space Telescope Science Institute † These authors contributed equally. ASDF files against these schemas. • Binary blocks are referenced in the YAML to link binary Copyright © 2022 Perry Greenfield et al. This is an open-access article data to YAML attributes. distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, • Support for arrays embedded in YAML or in a binary provided the original author and source are credited. block. 2 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) • Streaming support for a single binary block. Changes for 1.6 • Permit local definitions of tags and schemas outside of the Addition of the manifest mechanism standard. • While developed for astronomy, useful for general scien- The manifest is a YAML document that explicitly lists the tags and tific or engineering use. other features introduced by an extension to the ASDF standard. • Aims to be language neutral. It provides a more straightforward way of associating tags with schemas, allowing multiple tags to share the same schema, and generally making it simpler to visualize how tags and schemas Current and planned uses are associated (previously these associations were implied by the James Webb Space Telescope (JWST) Python implementation but were not documented elsewhere). NASA requires JWST data products be made available in the FITS format. Nevertheless, all the calibration pipelines operate Handling of null values and their interpretation on the data using an internal objects very close to the the ASDF The standard didn’t previously specify the behavior regarding null representation. The JWST calibration pipeline uses ASDF to values. The Python library previously removed attributes from the serialize data that cannot be easily represented in FITS, such as YAML tree when the corresponding Python attribute has a None World Coordinate System information. The calibration software value upon writing to an ADSF file. On reading files where the is also capable of reading and producing data products as pure attribute was missing but the schema indicated a default value, ASDF files. the library would create the Python attribute with the default. As mentioned in the next item, we no longer use this mechanism, and Nancy Grace Roman Space Telescope now when written, the attribute appears in the YAML tree with This telescope, with the same mirror size as the Hubble Space a null value if the Python value is None and the schema permits Telescope (HST), but a much larger field of view than HST, will null values. be launched in 2026 or thereabouts. It is to be used mostly in survey mode and is capable of producing very large mosaicked Interpretation of default values in schema images. It will use ASDF as its primary data format. The use of default values in schemas is discouraged since the Daniel K Inoue Solar Telescope interpretation by libraries is prone to confusion if the assemblage This telescope is using ASDF for much of the early data products of schemas conflict with regard to the default. We have stopped to hold the metadata for a combined set of data which can involve using defaults in the Python library and recommend that the ASDF many thousands of files. Furthermore, the World Coordinate file always be explicit about the value rather than imply it through System information is stored using ASDF for all the referenced the schema. If there are practical cases that preclude always data. writing out all values (e.g., they are only relevant to one mode and usually are irrelevant), it should be the library that manages whether such attributes are written conditionally rather using the Vera Rubin Telescope (for World Coordinate System interchange) schema default mechanism. There have been users outside of astronomy using ASDF, as well as contributors to the source code. Add alternative tag URI scheme We now recommend that tag URIs begin with asdf:// Changes to the standard (completed and proposed) These are based on lessons learned from usage. Be explicit about what kind of complex YAML keys are supported The current version of the standard is 1.5.0 (1.6.0 being developed). For example, not all legal YAML keys are supported. Namely The following items reflect areas where we felt improvements YAML arrays, which are not hashable in Python. Likewise, were needed. general YAML objects are not either. The Standard now limits keys to string, integer, or boolean types. If more complex keys are Changes for 1.5 required, they should be encoded in strings. Moving the URI authority from stsci.edu to asdf-format.org Still to be done This is to remove the standard from close association with STScI Upgrade to JSON Schema draft-07 and make it clear that the format is not intended to be controlled There is interest in some of the new features of this version, by one institution. however, this is problematic since there are aspects of this version that are incompatible with draft-04, thus requiring all previous Moving astronomy-specific schemas out of standard schemas to be updated. These primarily affect the previous inclusion of World Coordinate Tags, which are strongly associated with astronomy. Remaining Replace extensions section of file history are those related to time and unit standards, both of obvious gen- erality, but the implementation must be based on some standards, This section is considered too specific to the concept of Python and currently the astropy-based ones are as good or better than extensions, and is probably best replaced with a more flexible any. system for listing extensions used. THE ADVANCED SCIENTIFIC DATA FORMAT (ASDF): AN UPDATE 3 Changes to Python ASDF package Easier and more flexible mechanism to create new extensions (2.8.0) The previous system for defining extensions to ASDF, now deprecated, has been replaced by a new system that makes the association between tags, schemas, and conversion code more straightforward, as well as providing more intuitive names for the methods and attributes, and makes it easier to handle reference cycles if they are present in the code (also added to the original Tag handling classes). Introduced global configuration mechanism (2.8.0) This reworks how ASDF resources are located, and makes it easier to update the current configuration, as well as track down the location of the needed resources (e.g., schemas and converters), as well as removing performance issues that previously required extracting information from all the resource files thus slowing the Fig. 1: A plot of the compound model defined in the first segment of first asdf.open call. code. Added info/search methods and command line tools (2.6.0) These allow displaying the hierarchical structure of the header and file. This is made possible by the fact that expressions of models the values and types of the attributes. Initially, such introspection are straightforward to represent in YAML structure. stopped at any tagged item. A subsequent change provides mech- Despite the fact that the models are in some sense executable, anisms to see into tagged items (next item). An example of these they are perfectly safe so long as the library they are implemented tools is shown in a later section. in is safe (e.g., it doesn’t implement an "execute any OS com- mand" model). Furthermore, the representation in ASDF does not Added mechanism for info to display tagged item contents (2.9.0) explicitly use Python code. In principle it could be written or read This allows the library that converts the YAML to Python objects in any computer language. to expose a summary of the contents of the object by supplying The following illustrates a relatively simple but not trivial an optional "dunder" method that the info mechanism can take example. advantage of. First we define a 1D model and plot it. import numpy as np Added documentation on how ASDF library internals work import astropy.modeling.models as amm import astropy.units as u These appear in the readthedocs under the heading "Developer import asdf Overview". from matplotlib import pyplot as plt # Define 3 model components with units Plugin API for block compressors (2.8.0) g1 = amm.Gaussian1D(amplitude=100*u.Jy, This enables a localized extension to support further compression mean=120*u.MHz, options. stddev=5.*u.MHz) g2 = amm.Gaussian1D(65*u.Jy, 140*u.MHz, 3*u.MHz) powerlaw = amm.PowerLaw1D(amplitude=10*u.Jy, Support for asdf:// URI scheme (2.8.0) x_0=100*u.MHz, Support for ASDF Standard 1.6.0 (2.8.0) alpha=3) # Define a compound model This is still subject to modifications to the 1.6.0 standard. model = g1 + g2 + powerlaw x = np.arange(50, 200) * u.MHz Modified handling of defaults in schemas and None values (2.8.0) plt.plot(x, model(x)) As described previously. The following code will save the model to an ASDF file, and read it back in Using ASDF to store models af = asdf.AsdfFile() af.tree = {'model': model} This section highlights one aspect of ASDF that few other formats af.write_to('model.asdf') support in an archival way, e.g., not using a language-specific af2 = asdf.open('model.asdf') model2 = af2['model'] mechanism, such as Python’s pickle. The astropy package contains model2 is model a modeling subpackage that defines a number of analytical, as well False as a few table-based, models that can be combined in many ways, model2(103.5) == model(103.5) such as arithmetically, in composition, or multi-dimensional. Thus True it is possible to define fairly complex multi-dimensional models, Listing the relevant part of the ASDF file illustrates how the model many of which can use the built in fitting machinery. has been saved in the YAML header (reformatted to fit in this paper These models, and their compound constructs can be saved column). in ASDF files and later read in to recreate the corresponding model: !transform/add-1.2.0 astropy objects that were used to create the entries in the ASDF forward: 4 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) - !transform/add-1.2.0 something that the FITS format had no hope of managing, nor any forward: other scientific format that we are aware of. - !transform/gaussian1d-1.0.0 amplitude: !unit/quantity-1.1.0 {unit: !unit/unit-1.0.0 Jy, value: 100.0} Displaying the contents of ASDF files bounding_box: - !unit/quantity-1.1.0 Functionality has been added to display the structure and content {unit: !unit/unit-1.0.0 MHz, value: 92.5} of the header (including data item properties), with a number of - !unit/quantity-1.1.0 {unit: !unit/unit-1.0.0 MHz, value: 147.5} options of what depth to display, how many lines to display, etc. bounds: An example of the info use is shown in Figure 2. stddev: [1.1754943508222875e-38, null] There is also functionality to search for items in the file by inputs: [x] mean: !unit/quantity-1.1.0 attribute name and/or values, also using pattern matching for {unit: !unit/unit-1.0.0 MHz, value: 120.0} either. The search results are shown as attribute paths to the items outputs: [y] that were found. stddev: !unit/quantity-1.1.0 {unit: !unit/unit-1.0.0 MHz, value: 5.0} - !transform/gaussian1d-1.0.0 ASDF Extension/Converter System amplitude: !unit/quantity-1.1.0 {unit: !unit/unit-1.0.0 Jy, value: 65.0} There are a number of components that are involved. Converters bounding_box: encapsulate the code that handles converting Python objects to - !unit/quantity-1.1.0 {unit: !unit/unit-1.0.0 MHz, value: 123.5} and from their ASDF representation. These are classes that inherit - !unit/quantity-1.1.0 from the basic Converter class and define two Class attributes: {unit: !unit/unit-1.0.0 MHz, value: 156.5} tags, types each of which is a list of associated tag(s) and class(es) bounds: that the specific converter class will handle (each converter can stddev: [1.1754943508222875e-38, null] inputs: [x] handle more than one tag type and more than one class). The mean: !unit/quantity-1.1.0 ASDF machinery uses this information to map tags to converters {unit: !unit/unit-1.0.0 MHz, value: 140.0} when reading ASDF content, and to map types to converters when outputs: [y] stddev: !unit/quantity-1.1.0 saving these objects to an ASDF file. {unit: !unit/unit-1.0.0 MHz, value: 3.0} Each converter class is expected to supply two methods: inputs: [x] to_yaml_tree and from_yaml_tree that construct the outputs: [y] YAML content and convert the YAML content to Python class - !transform/power_law1d-1.0.0 alpha: 3.0 instances respectively. amplitude: !unit/quantity-1.1.0 A manifest file is used to associate tags and schema ID’s {unit: !unit/unit-1.0.0 Jy, value: 10.0} so that if a schema has been defined, that the ASDF content inputs: [x] outputs: [y] can be validated against the schema (as well as providing extra x_0: !unit/quantity-1.1.0 information for the ASDF content in the info command). Normally {unit: !unit/unit-1.0.0 MHz, value: 100.0} the converters and manifest are registered with the ASDF library inputs: [x] using standard functions, and this registration is normally (but is outputs: [y] ... not required to be) triggered by use of Python entry points defined in the setup.cfg file so that this extension is automatically Note that there are extra pieces of information that define the recognized when the extension package is installed. model more precisely. These include: One can of course write their own custom code to convert the contents of ASDF files however they want. The advantage of the • many tags indicating special items. These include different tag/converter system is that the objects can be anywhere in the tree kinds of transforms (i.e., functions), quantities (i.e., num- structure and be properly saved and recovered without having any bers with units), units, etc. implied knowledge of what attribute or location the object is at. • definitions of the units used. Furthermore, it brings with it the ability to validate the contents • indications of the valid range of the inputs or parameters by use of schema files. (bounds) Jupyter tutorials that show how to use converters can be found • each function shows the mapping of the inputs and the at: naming of the outputs of each function. • the addition operator is itself a transform. • https://github.com/asdf-format/tutorials/blob/master/ Your_first_ASDF_converter.ipynb Without the use of units, the YAML would be simpler. But • https://github.com/asdf-format/tutorials/blob/master/ the point is that the YAML easily accommodates expression trees. Your_second_ASDF_converter.ipynb The tags are used by the library to construct the astropy models, units and quantities as Python objects. However, nothing in the above requires the library to be written in Python. ASDF Roadmap for STScI Work This machinery can handle multidimensional models and sup- The planned enhancements to ASDF are understandably focussed ports both the combining of models with arithmetic operators as on the needs of STScI missions. Nevertheless, we are particularly well as pipelining the output of one model into another. This interested in areas that have wider benefit to the general scientific system has been used to define complex coordinate transforms and engineering community, and such considerations increase the from telescope detectors to sky coordinates for imaging, and priority of items necessary to STScI. Furthermore, we are eager wavelengths for spectrographs, using over 100 model components, to aid others working on ASDF by providing advice, reviews, and THE ADVANCED SCIENTIFIC DATA FORMAT (ASDF): AN UPDATE 5 Fig. 2: This shows part of the output of the info command that shows the structure of a Roman Space Telescope test file (provided by the Roman Telescopes Branch at STScI). Displayed is the relative depth of the item, its type, value, and a title extracted from the associated schema to be used as explanatory information. possibly collaborative coding effort. STScI is committed to the Redefining versioning semantics long-term support of ADSF. Previously the meaning of different levels of versioning The following is a list of planned work, in order of decreasing were unclear. The normal inclination is to treat schema priority. version using the typical semantic versioning system de- fined for software. But schemas are not software and Chunking Support we are inclined to use the proposed system for schemas [url: https://snowplowanalytics.com/blog/2014/05/13/introducing- Since the Roman mission is expected to deal with large data schemaver-for-semantic-versioning-of-schemas/] To summarize: sets and mosaicked images, support for chunking is considered in this case the three levels of versioning correspond to: essential. We expect to layer the support in our Python library Model.Revision.Addition where a schema change: on zarr [https://zarr.dev/], with two different representations, one where all data is contained within the ADSF file in separate • [Model] prevents working with historical data blocks, and one where the blocks are saved in individual files. • [Revision] may prevent working with historical data Both representations have important advantages and use cases. • [Addition] is compatible with all historical data Integration into astronomy display tools Improvements to binary block management It is essential that astronomers be able to visualize the data These enhancements are needed to enable better chunking support contained within ASDF files conveniently using the commonly and other capabilities. available tool, such as SAOImage DS9 [Joy03] and Ginga [Jes13]. 6 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Cloud optimized storage [McK10] W. McKinney. Data structures for statistical computing in python, Proceedigns of the 9th Python in Science Conference, p56-61, 2010. Much of the future data processing operations for STScI are https://doi.org/10.25080/Majora-92bf1922-00a expected to be performed on the cloud, so having ASDF efficiently [Pen09] W. Pence, R. Seaman, R. L. White, Lossless Astronomical Image support such uses is important. An important element of this is Compression and the Effects of Noise, Publications of the Astro- making the format work efficiently with object storage services nomical Society of the Pacific, 121:414-427, April 2009. https: //doi.org/10.48550/arXiv.0903.2140 such as AWS S3 and Google Cloud Storage. [Pen10] W. Pence, R. L. White, R. Seaman. Optimal Compression of Floating- Point Astronomical Images Without Significant Loss of Information, IDL support Publications of the Astronomical Society of the Pacific, 122:1065- 1076, September 2010. https://doi.org/10.1086/656249 While Python is rapidly surpassing the use of IDL in astronomy, [Joy03] W. A. Joye, E. Mandel. New Features of SAOImage DS9, Astronomi- there is still much IDL code being used, and many of those still cal Data Analysis Software and Systems XII ASP Conference Series, using IDL are in more senior and thus influential positions (they 295:489, 2003. aren’t quite dead yet). So making ASDF data at least readable to IDL is a useful goal. Support Rice compression Rice compression [Pen09], [Pen10] has proven a useful lossy compression algorithm for astronomical imaging data. Supporting it will be useful to astronomers, particularly for downloading large imaging data sets. Pandas Dataframe support Pandas [McK10] has proven to be a useful tool to many as- tronomers, as well as many in the sciences and engineering, so support will enhance the uptake of ASDF. Compact, easy-to-read schema summaries Most scientists and even scientific software developers tend to find JSON Schema files tedious to interpret. A more compact, and intuitive rendering of the contents would be very useful. Independent implementation Having ASDF accepted as a standard data format requires a library that is divorced from a Python API. Initially this can be done most easily by layering it on the Python library, but ultimately there should be an independent implementation which includes support for C/C++ wrappers. This is by far the item that will require the most effort, and would benefit from outside involvement. Provide interfaces to other popular packages This is a catch all for identifying where there would be significant advantages to providing the ability to save and recover information in the ASDF format as an interchange option. Sources of Information • ASDF Standard: https://asdf-standard.readthedocs.io/en/ latest/ • Python ASDF package documentation: https://asdf. readthedocs.io/en/stable/ • Repository: https://github.com//asdf-format/asdf • Tutorials: https://github.com/asdf-format/tutorials R EFERENCES [Gre15] P. Greenfield, M. Droettboom, E. Bray. ASDF: A new data format for astronomy, Astronomy and Computing, 12:240-251, September 2015. https://doi.org/10.1016/j.ascom.2015.06.004 [FIT16] FITS Working Group. Definition of the Flexible Image Transport System, International Astronomical Union, http://fits.gsfc.nasa.gov/ fits_standard.html, July 2016. [Jes13] E. Jeschke. Ginga: an open-source astronomical image viewer and toolkit, Proc. of the 12th Python in Science Conference., p58- 64,January 2013. https://doi.org/10.25080/Majora-8b375195-00a PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 7 Semi-Supervised Semantic Annotator (S3A): Toward Efficient Semantic Labeling Nathan Jessurun‡∗ , Daniel E. Capecci‡ , Olivia P. Dizon-Paradis‡ , Damon L. Woodard‡ , Navid Asadizanjani‡ F Abstract—Most semantic image annotation platforms suffer severe bottlenecks when handling large images, complex regions of interest, or numerous distinct foreground regions in a single image. We have developed the Semi-Supervised Semantic Annotator (S3A) to address each of these issues and facilitate rapid collection of ground truth pixel-level labeled data. Such a feat is accomplished through a robust and easy-to-extend integration of arbitrary python image pro- cessing functions into the semantic labeling process. Importantly, the framework devised for this application allows easy visualization and machine learning prediction of arbitrary formats and amounts of per-component metadata. To our knowledge, the ease and flexibility offered are unique to S3A among all open- source alternatives. Index Terms—Semantic annotation, Image labeling, Semi-supervised, Region of interest Introduction Labeled image data is essential for training, tuning, and evaluating Fig. 1. Common use cases for semantic segmentation involve relatively few fore- ground objects, low-resolution data, and limited complexity per object. Images the performance of many machine learning applications. Such retrieved from https://cocodataset.org/#explore. labels are typically defined with simple polygons, ellipses, and bounding boxes (i.e., "this rectangle contains a cat"). However, this approach can misrepresent more complex shapes with holes and greatly hinders scalability. As such, several tools have been or multiple regions as shown later in Figure 9. When high accuracy proposed to alleviate the burden of collecting these ground-truth is required, labels must be specified at or close to the pixel-level labels [itL18]. Unfortunately, existing tools are heavily biased - a process known as semantic labeling or semantic segmentation. toward lower-resolution images with few regions of interest (ROI), A detailed description of this process is given in [CZF+ 18]. similar to Figure 1. While this may not be an issue for some Examples can readily be found in several popular datasets such datasets, such assumptions are crippling for high-fidelity images as COCO, depicted in Figure 1. with hundreds of annotated ROIs [LSA+ 10], [WYZZ09]. Semantic segmentation is important in numerous domains With improving hardware capabilities and increasing need for including printed circuit board assembly (PCBA) inspection (dis- high-resolution ground truth segmentation, there are a continu- cussed later in the case study) [PJTA20], [AML+ 19], quality ally growing number of applications that require high-resolution control during manufacturing [FRLL18], [AVK+ 01], [AAV+ 02], imaging with the previously described characteristics [MKS18], manuscript restoration / digitization [GNP+ 04], [KBO16], [JB92], [DS20]. In these cases, the existing annotation tooling greatly [TFJ89], [FNK92], and effective patient diagnosis [SKM+ 10], impacts productivity due to the previously referenced assumptions [RLO+ 17], [YPH+ 06], [IGSM14]. In all these cases, imprecise and lack of support [Spa20]. annotations severely limit the development of automated solutions In response to these bottlenecks, we present the Semi- and can decrease the accuracy of standard trained segmentation Supervised Semantic Annotation (S3A) annotation and prototyping models. platform -- an application which eases the process of pixel-level Quality semantic segmentation is difficult due to a reliance on labeling in large, complex scenes.1 Its graphical user interface is large, high-quality datasets, which are often created by manually shown in Figure 2. The software includes live app-level property labeling each image. Manual annotation is error-prone, costly, customization, real-time algorithm modification and feedback, region prediction assistance, constrained component table editing * Corresponding author: njessurun@ufl.edu ‡ University of Florida based on allowed data types, various data export formats, and a highly adaptable set of plugin interfaces for domain-specific exten- Copyright © 2022 Nathan Jessurun et al. This is an open-access article sions to S3A. Beyond software improvements, these features play distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, significant roles in bridging the gap between human annotation provided the original author and source are credited. efforts and scalable, automated segmentation methods [BWS+ 10]. 8 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Improve Semi- segmentation supervised techniques labeling Update Generate models training data Fig. 3. S3A’s can iteratively annotate, evaluate, and update its internals in real- time. to specify (but can be modified or customized if desired). As a re- sult, incorporating additional/customized application functionality can require as little as one line of code. Processes interface with Fig. 2. S3A’s interface. The main view consists of an image to annotate, a PyQtGraph parameters to gain access to data-customized widget component table of prior annotations, and a toolbar which changes functionality types and more (https://github.com/pyqtgraph/pyqtgraph). depending on context. These processes can also be arbitrarily nested and chained, which is critical for developing hierarchical image processing models, an example of which is shown in Figure 4. This frame- work is used for all image and region processing within S3A. Note that for image processes, each portion of the hierarchy yields Application Overview intermediate outputs to determine which stage of the process flow Design decisions throughout S3A’s architecture have been driven is responsible for various changes. This, in turn, reduces the by the following objectives: effort required to determine which parameters must be adjusted to achieve optimal performance. • Metadata should have significance rather than be treated as an afterthought, Plugins for User Extensions • High-resolution images should have minimal impact on the annotation workflow, The previous section briefly described how custom user functions • ROI density and complexity should not limit annotation are easily wrapped within a process, exposing its parameters workflow, and within S3A in a GUI format. A rich plugin interface is built on top • Prototyping should not be hindered by application com- of this capability in which custom functions, table field predictors, plexity. default action hooks, and more can be directly integrated into S3A. In all cases, only a few lines of code are required to achieve most These motives were selected upon noticing the general lack integrations between user code and plugin interface specifications. of solutions for related problems in previous literature and tool- The core plugin infrastructure consists of a function/property reg- ing. Moreover, applications that do address multiple aspects of istration mechanism and an interaction window that shows them complex region annotation often require an enterprise service and in the UI. As such, arbitrary user functions can be "registered" in cannot be accessed under open-source policies. one line of code to a plugin, where it will be effectively exposed to While the first three points are highlighted in the case study, the user within S3A. A trivial example is depicted in Figure 5, but the subsections below outline pieces of S3A’s architecture that more complex behavior such as OCR integration is possible with prove useful for iterative algorithm prototyping and dataset gen- similar ease (see this snippet for an implementation leveraging eration as depicted in Figure 3. Note that beyond the facets easyocr). illustrated here, S3A possesses multiple additional characteris- Plugin features are heavily oriented toward easing the pro- tics as outlined in its documentation (https://gitlab.com/s3a/s3a/- cess of automation both for general annotation needs and niche /wikis/docs/User’s-Guide). datasets. In either case, incorporating existing library functions is converted into a trivial task directly resulting in lower annotation Processing Framework time and higher labeling accuracy. At the root of S3A’s functionality and configurability lies its adaptive processing framework. Functions exposed within S3A are Adaptable I/O thinly wrapped using a Process structure responsible for parsing An extendable I/O framework allows annotations to be used in signature information to provide documentation, parameter infor- a myriad of ways. Out-of-the-box, S3A easily supports instance- mation, and more to the UI. Hence, all graphical depictions are level segmentation outputs, facilitating deep learning model train- abstracted beyond the concern of the user while remaining trivial ing. As an example, Figure 6 illustrates how each instance in the image becomes its own pair of image and mask data. When several 1. A preliminary version was introduced in an earlier publication [JPRA20], but significant changes to the framework and tool capabilities have been instances overlap, each is uniquely distinguishable depending employed since then. on the characteristic of their label field. Particularly helpful for SEMI-SUPERVISED SEMANTIC ANNOTATOR (S3A): TOWARD EFFICIENT SEMANTIC LABELING 9 Fig. 4. Outputs of each processing stage can be quickly viewed in context after an iteration of annotating. Upon inspecting the results, it is clear the failure point is a low k value during K-means clustering and segmentation. The woman’s shirt is not sufficiently distinguishable from the background palette to denote a separate entity. The red dot is an indicator of where the operator clicked during annotation. from qtpy import QtWidgets from s3a import ( S3A, __main__, RandomToolsPlugin, ) def hello_world(win: S3A): QtWidgets.QMessageBox.information( win, "Hello World", "Hello World!" ) RandomToolsPlugin.deferredRegisterFunc( hello_world ) __main__.mainCli() Fig. 6. Multiple export formats exist, among which is a utility that crops com- ponents out of the image, optionally padding with scene pixels and resizing to Fig. 5. Simple standalone functions can be easily exposed to the user through ensure all shapes are equal. Each sub-image and mask is saved accordingly, the random tools plugin. Note that if tunable parameters were included in the which is useful for training on multiple forms of machine learning models. function signature, pressing "Open Tools" (the top menu option) allows them to be altered. binations for functions outside S3A in the event they are utilized in a different framework. models with fixed input sizes, these exports can optionally be forced to have a uniform shape (e.g., 512x512 pixels) while main- taining their aspect ratio. This is accomplished by incorporating Case Study additional scene pixels around each object until the appropriate Both the inspiration and developing efforts for S3A were initially size is obtained. Models trained on these exports can be directly driven by optical printed circuit board (PCB) assurance needs. plugged back into S3A’s processing framework, allowing them In this domain, high-resolution images can contain thousands to generate new annotations or refine preliminary user efforts. of complex objects in a scene, as seen in Figure 7. Moreover, The described I/O framework is also heavily modularized such numerous components are not representable by cardinal shapes that custom dataset specifications can easily be incorporated. In such as rectangles, circles, etc. Hence, high-count polygonal this manner, future versions of S3A will facilitate interoperability regions dominated a significant portion of the annotated regions. with popular formats such as COCO and Pascal VOC [LMB+ 14], The computational overhead from displaying large images and [EGW+ 10]. substantial numbers of complex regions either crashed most anno- tation platforms or prevented real-time interaction. In response, S3A was designed to fill the gap in open-source annotation Deep, Portable Customizability platforms that addressed each issue while requiring minimal setup Beyond the features previously outlined, S3A provides numerous and allowing easy prototyping of arbitrary image processing tasks. avenues to configure shortcuts, color schemes, and algorithm The subsections below describe how the S3A labeling platform workflows. Several examples of each can be seen in the user was utilized to collect a large database of PCB annotations along guide. Most customizable components prototyped within S3A can with their associated metadata2 . also be easily ported to external workflows after development. Hierarchical processes have states saved in YAML files describing Large Images with Many Annotations all parameters, which can be reloaded to create user profiles. In optical PCB assurance, one method of identifying component Alternatively, these same files can describe ideal parameter com- defects is to localize and characterize all objects in the image. Each 10 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 8. Regardless of total image size and number of annotations, Python processing is be limited to the ROI or viewbox size for just the selected object based on user preferences. The depiction shows Grab Cut operating on a user- defined initial region within a much larger (8000x6000) image. The resulting Fig. 7. Example PCB segmentation. In contrast to typical semgentation tasks, region was available in 1.94 seconds on low-grade hardware. the scene contains over 4,000 objects with numerous complex shapes. component can then be cross-referenced against genuine proper- ties such as length/width, associated text, allowed orientations, etc. However, PCB surfaces can contain hundreds to thousands of components at several magnitudes of size, necessitating high- resolution images for in-line scanning. To handle this problem more generally, S3A separates the editing and viewing experi- Fig. 9. Annotated objects in S3A can incorporate both holes and distinct regions ences. In other words, annotation time is orders of magnitude through a multi-polygon container. Holes are represented as polygons drawn on faster since only edits in one region at a time and on a small subset top of existing foreground, and can be arbitrarily nested (i.e. island foreground is of the full image are considered during assisted segmentation. All also possible). other annotations are read-only until selected for alteration. For instance, Figure 8 depicts user inputs on a small ROI out of a key performance improvement when thousands of regions (each much larger image. The resulting component shape is proposed with thousands of points) are in the same field of view. When within seconds and can either be accepted or modified further by low polygon counts are required, S3A also supports RDP polygon the user. While PCB annotations initially inspired this approach, it simplification down to a user-specified epsilon parameter [Ram]. is worth noting that the architectural approach applies to arbitrary domains of image segmentation. Complex Metadata Another key performance improvement comes from resizing Most annotation software support robust implementation of im- the processed region to a user-defined maximum size. For instance, age region, class, and various text tags ("metadata"). However, if an ROI is specified across a large portion of the image but this paradigm makes collecting type-checked or input-sanitized the maximum processing size is 500x500 pixels, the processed metadata more difficult. This includes label categories such as area will be downsampled to a maximum dimension length of object rotation, multiclass specifications, dropdown selections, 500 before intensive algorithms are run. The final output will and more. In contrast, S3A treats each metadata field the same be upsampled back to the initial region size. In this manner, way as object vertices, where they can be algorithm-assisted, optionally sacrificing a small amount of output accuracy can directly input by the user, or part of a machine learning prediction drastically accelerate runtime performance for larger annotated framework. Note that simple properties such as text strings or objects. numbers can be directly input in the table cells with minimal need for annotation assistance3 . In conrast, custom fields can provide Complex Vertices/Semantic Segmentation plugin specifications which allow more advanced user interaction. Multiple types of PCB components possess complex shapes which Finally, auto-populated fields like annotation timestamp or author might contain holes or noncontiguous regions. Hence, it is bene- can easily be constructed by providing a factory function instead ficial for software like S3A to represent these features inherently of default value in the parameter specification. with a ComplexXYVertices object: that is, a collection of This capability is particularly relevant in the field of optical polygons which either describe foreground regions or holes. This PCB assurance. White markings on the PCB surface, known is enabled by thinly wrapping opencv’s contour and hierarchy as silkscreen, indicate important aspects of nearby components. logic. Example components difficult to accomodate with single- Thus, understanding the silkscreen’s orientation, alphanumeric polygon annotation formats are illustrated in Figure 9. characters, associated component, logos present, and more provide At the same time, S3A also supports high-count polygons several methods by which to characterize / identify features with no performance losses. Since region edits are performed by of their respective devices. Both default and customized input image processing algorithms, there is no need for each vertex validators were applied to each field using parameter specifica- to be manually placed or altered by human input. Thus, such tions, custom plugins, or simple factories as described above. A non-interactive shapes can simply be rendered as a filled path summary of the metadata collected for one component is shown without a large number of event listeners present. This is the in Figure 10. SEMI-SUPERVISED SEMANTIC ANNOTATOR (S3A): TOWARD EFFICIENT SEMANTIC LABELING 11 results depending on the initial image complexity [VGSG+ 19]. Hence, these methods would be significantly easier to incorporate into S3A if a generalized windowing framework was incorporated which allows users to specify all necessary parameters such as window overlap, size, sampling frequency, etc. A preliminary version of this is implemented for categorical-based model pre- diction, but a more robust feature set for interactive segmentation is strongly preferable. Aggregation of Human Annotation Habits Several times, it has been noted that manual segmentation of Fig. 10. Metadata can be collected, validated, and customized with ease. A mix image data is not a feasible or scalable approach for remotely of default properties (strings, numbers, booleans), factories (timestamp, author), large datasets. However, there are multiple cases in which human and custom plugins (yellow circle representing associated device) are present. intuition can greatly outperform even complex neural networks, depending on the specific segmentation challenge [RLFF15]. For this reason, it would be ideal to capture data points possessing Conclusion and Future Work information about the human decision-making process and apply The Semi-Supervised Semantic Annotator (S3A) is proposed to them to images at scale. This may include taking into account hu- address the difficult task of pixel-level annotations of image data. man labeling time per class, hesitation between clicks, relationship For high-resolution images with numerous complex regions of between shape boundary complexity and instance quantity, and interest, existing labeling software faces performance bottlenecks more. By aggregating such statistics, a pattern may arise which can attempting to extract ground-truth information. Moreover, there is be leveraged as an additional automated annotation technique. a lack of capabilities to convert such a labeling workflow into an automated procedure with feedback at every step. Each of these challenges is overcome by various features within S3A specifically R EFERENCES designed for such tasks. As a result, S3A provides not only tremen- [AAV+ 02] C Anagnostopoulos, I Anagnostopoulos, D Vergados, G Kouzas, dous time savings during ground truth annotation, but also allows E Kayafas, V Loumos, and G Stassinopoulos. High performance an annotation pipeline to be directly converted into a prediction computing algorithms for textile quality control. Mathematics scheme. Furthermore, the rapid feedback accessible at every stage and Computers in Simulation, 60(3):389–400, September 2002. doi:10.1016/S0378-4754(02)00031-9. of annotation expedites prototyping of novel solutions to imaging [AML+ 19] Mukhil Azhagan, Dhwani Mehta, Hangwei Lu, Sudarshan domains in which few examples of prior work exist. Nonetheless, Agrawal, Mark Tehranipoor, Damon L Woodard, Navid multiple avenues exist for improving S3A’s capabilities in each of Asadizanjani, and Praveen Chawla. A review on automatic bill of material generation and visual inspection on PCBs. In these areas. Several prominent future goals are highlighted in the ISTFA 2019: Proceedings of the 45th International Symposium following sections. for Testing and Failure Analysis, page 256. ASM International, 2019. Dynamic Algorithm Builder [AVK+ 01] C. Anagnostopoulos, D. Vergados, E. Kayafas, V. Loumos, and G. Stassinopoulos. A computer vision approach for textile Presently, processing workflows can be specified in a sequential quality control. The Journal of Visualization and Computer YAML file which describes each algorithm and their respective Animation, 12(1):31–44, 2001. doi:10.1002/vis.245. parameters. However, this is not easy to adapt within S3A, [BWS+ 10] Steve Branson, Catherine Wah, Florian Schroff, Boris Babenko, Peter Welinder, Pietro Perona, and Serge Belongie. Visual especially by inexperienced annotators. Future iterations of S3A recognition with humans in the loop. In Kostas Daniilidis, Petros will incoroprate graphical flowcharts which make this process Maragos, and Nikos Paragios, editors, Computer Vision – ECCV drastically more intuitive and provide faster feedback. Frameworks 2010, pages 438–451, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg. like Orange [DCE+ ] perform this task well, and S3A would [CZF+ 18] Qimin Cheng, Qian Zhang, Peng Fu, Conghuan Tu, and Sen Li. strongly benefit from adding the relevant capabilities. A survey and analysis on automatic image annotation. Pattern Recognition, 79:242–259, 2018. doi:10.1016/j.patcog. Image Navigation Assistance 2018.02.017. [DCE+ ] Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Several aspects of image navigation can be incorporated to sim- Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar, plify the handling of large images. For instance, a "minimap" tool Marko Toplak, and Anže Starič. Orange: Data mining toolbox in Python. 14(1):2349–2353. would allow users to maintain a global image perspective while [DS20] Polina Demochkina and Andrey V. Savchenko. Improving making local edits. Furthermore, this sense of scale aids intuition the accuracy of one-shot detectors for small objects in x-ray of how many regions of similar component density, color, etc. exist images. In 2020 International Russian Automation Confer- within the entire image. ence (RusAutoCon), page 610–614. IEEE, September 2020. URL: https://ieeexplore.ieee.org/document/9208097/, doi:10. Second, multiple strategies for annotating large images lever- 1109/RusAutoCon49822.2020.9208097. age a windowing approach, where they will divide the total image [EGW+ 10] Mark Everingham, Luc Gool, Christopher K. Williams, John into several smaller pieces in a gridlike fashion. While this has its Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. Int. J. Comput. Vision, 88(2):303–338, jun disadvantages, it is fast, easy to automate, and produces reasonable 2010. URL: https://doi.org/10.1007/s11263-009-0275-4, doi: 10.1007/s11263-009-0275-4. 2. For those curious, the dataset and associated paper are accessible at https: [FNK92] H. Fujisawa, Y. Nakano, and K. Kurino. Segmentation methods //www.trust-hub.org/#/data/pcb-images. for character recognition: From segmentation to document struc- 3. For a list of input validators and supported primitive types, refer to ture analysis. Proceedings of the IEEE, 80(7):1079–1092, July PyQtGraph’s Parameter documentation. 1992. doi:10.1109/5.156471. 12 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) [FRLL18] Max K. Ferguson, Ak Ronay, Yung-Tsun Tina Lee, and Kin- IEEE Transactions on Medical Imaging, 36(2):674–683, Febru- cho. H. Law. Detection and segmentation of manufacturing ary 2017. doi:10.1109/TMI.2016.2621185. defects with convolutional neural networks and transfer learn- [SKM+ 10] Sascha Seifert, Michael Kelm, Manuel Moeller, Saikat Mukher- ing. Smart and sustainable manufacturing systems, 2, 2018. jee, Alexander Cavallaro, Martin Huber, and Dorin Comaniciu. doi:10.1520/SSMS20180033. Semantic annotation of medical images. In Brent J. Liu and [GNP+ 04] Basilios Gatos, Kostas Ntzios, Ioannis Pratikakis, Sergios William W. Boonn, editors, Medical Imaging 2010: Advanced Petridis, T. Konidaris, and Stavros J. Perantonis. A segmentation- PACS-based Imaging Informatics and Therapeutic Applications, free recognition technique to assist old greek handwritten volume 7628, pages 43 – 50. International Society for Optics and manuscript OCR. In Simone Marinai and Andreas R. Dengel, Photonics, SPIE, 2010. URL: https://doi.org/10.1117/12.844207, editors, Document Analysis Systems VI, Lecture Notes in Com- doi:10.1117/12.844207. puter Science, pages 63–74, Berlin, Heidelberg, 2004. Springer. [Spa20] SpaceNet. Multi-Temporal Urban Development Challenge. doi:10.1007/978-3-540-28640-0_7. https://spacenet.ai/sn7-challenge/, June 2020. [IGSM14] D. K. Iakovidis, T. Goudas, C. Smailis, and I. Maglogiannis. [TFJ89] T. Taxt, P.J. Flynn, and A.K. Jain. Segmentation of document Ratsnake: A versatile image annotation tool with application images. IEEE Transactions on Pattern Analysis and Machine to computer-aided diagnosis, 2014. doi:10.1155/2014/ Intelligence, 11(12):1322–1329, December 1989. doi:10. 286856. 1109/34.41371. [itL18] Humans in the Loop. The best image annotation platforms [VGSG+ 19] Juan P. Vigueras-Guillén, Busra Sari, Stanley F. Goes, Hans G. for computer vision (+ an honest review of each), October Lemij, Jeroen van Rooij, Koenraad A. Vermeer, and Lucas J. 2018. URL: https://hackernoon.com/the-best-image-annotation- van Vliet. Fully convolutional architecture vs sliding-window platforms-for-computer-vision-an-honest-review-of-each- cnn for corneal endothelium cell segmentation. BMC Biomedical dac7f565fea. Engineering, 1(1):4, January 2019. doi:10.1186/s42490- [JB92] Anil K. Jain and Sushil Bhattacharjee. Text segmentation using 019-0003-2. gabor filters for automatic document processing. Machine Vision [WYZZ09] C. Wang, Shuicheng Yan, Lei Zhang, and H. Zhang. Multi- and Applications, 5(3):169–184, June 1992. doi:10.1007/ label sparse coding for automatic image annotation. In 2009 BF02626996. IEEE Conference on Computer Vision and Pattern Recognition, [JPRA20] Nathan Jessurun, Olivia Paradis, Alexandra Roberts, and Navid page 1643–1650, June 2009. doi:10.1109/CVPR.2009. Asadizanjani. Component Detection and Evaluation Framework 5206866. (CDEF): A Semantic Annotation Tool. Microscopy and Micro- [YPH 06] Paul A. Yushkevich, Joseph Piven, Heather Cody Hazlett, + analysis, 26(S2):1470–1474, August 2020. doi:10.1017/ Rachel Gimpel Smith, Sean Ho, James C. Gee, and Guido S1431927620018243. Gerig. User-guided 3D active contour segmentation of anatom- [KBO16] Made Windu Antara Kesiman, Jean-Christophe Burie, and Jean- ical structures: Significantly improved efficiency and reliability. Marc Ogier. A new scheme for text line and character seg- NeuroImage, 31(3):1116–1128, July 2006. doi:10.1016/j. mentation from gray scale images of palm leaf manuscript. neuroimage.2006.01.015. In 2016 15th International Conference on Frontiers in Hand- writing Recognition (ICFHR), pages 325–330, October 2016. doi:10.1109/ICFHR.2016.0068. [LMB+ 14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Euro- pean conference on computer vision, pages 740–755. Springer, 2014. [LSA+ 10] L’ubor Ladický, Paul Sturgess, Karteek Alahari, Chris Russell, and Philip H. S. Torr. What, where and how many? combining object detectors and crfs. In Kostas Daniilidis, Petros Maragos, and Nikos Paragios, editors, Computer Vision – ECCV 2010, pages 424–437, Berlin, Heidelberg, 2010. Springer Berlin Hei- delberg. [MKS18] S. Mohajerani, T. A. Krammer, and P. Saeedi. A cloud detection algorithm for remote sensing images using fully convolutional neural networks. In 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP), page 1–5, August 2018. doi:10.1109/MMSP.2018.8547095. [PJTA20] Olivia P Paradis, Nathan T Jessurun, Mark Tehranipoor, and Navid Asadizanjani. Color normalization for robust automatic bill of materials generation and visual inspection of pcbs. In ISTFA 2020: Papers Accepted for the Planned 46th International Symposium for Testing and Failure Analysis, International Symposium for Testing and Failure Analysis, pages 172–179, 2020. URL: https://doi.org/10.31399/asm.cp. istfa2020p0172https://dl.asminternational.org/istfa/proceedings- pdf/ISTFA2020/83348/172/425605/istfa2020p0172.pdf, doi:10.31399/asm.cp.istfa2020p0172. [Ram] Urs Ramer. An iterative procedure for the polygonal approx- imation of plane curves. 1(3):244–256. URL: https://www. sciencedirect.com/science/article/pii/S0146664X72800170, doi:10.1016/S0146-664X(72)80017-0. [RLFF15] Olga Russakovsky, Li-Jia Li, and Li Fei-Fei. Best of both worlds: Human-machine collaboration for object annotation. In 2015 IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), page 2121–2131. IEEE, June 2015. URL: http://ieeexplore.ieee.org/document/7298824/, doi:10. 1109/CVPR.2015.7298824. [RLO+ 17] Martin Rajchl, Matthew C. H. Lee, Ozan Oktay, Konstantinos Kamnitsas, Jonathan Passerat-Palmbach, Wenjia Bai, Mellisa Damodaram, Mary A. Rutherford, Joseph V. Hajnal, Bernhard Kainz, and Daniel Rueckert. DeepCut: Object segmentation from bounding box annotations using convolutional neural networks. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 13 Galyleo: A General-Purpose Extensible Visualization Solution Rick McGeer‡∗ , Andreas Bergen‡ , Mahdiyar Biazi‡ , Matt Hemmings‡ , Robin Schreiber‡ F Abstract—Galyleo is an open-source, extensible dashboarding solution inte- Jupyter’s web interface is primarily to offer textboxes for code grated with JupyterLab [jup]. Galyleo is a standalone web application integrated entry. Entered code is sent to the server for evaluation and as an iframe [LS10] into a JupyterLab tab. Users generate data for the dash- text/HTML results returned. Visualization in a Jupyter Notebook board inside a Jupyter Notebook [KRKP+ 16], which transmits the data through is either given by images rendered server-side and returned as message passing [mdn] to the dashboard; users use drag-and-drop operations inline image tags, or by JavaScript/HTML5 libraries which have to add widgets to filter, and charts to display the data, shapes, text, and images. The dashboard is saved as a JSON [Cro06] file in the user’s filesystem in the a corresponding server-side Python library. The Python library same directory as the Notebook. generates HTML5/JavaScript code for rendering. The limiting factor is that the visualization library must be in- Index Terms—JupyterLab, JupyterLab extension, Data visualization tegrated with the Python backend by a developer, and only a subset of the rich array of visualization, charting, and mapping libraries Introduction available on the HTML5/JavaScript platform is integrated. The HTML5/JavaScript platform is as rich a client-side visualization Current dashboarding solutions [hol22a] [hol22b] [plo] [pan22] platform as Python is a server-side platform. for Jupyter either involve external, heavyweight tools, ingrained Galyleo set out to offer the best of both worlds: Python, R, and HTML/CSS coding, complex publication, or limited control over Julia as a scalable analytics platform coupled with an extensible layout, and have restricted widget sets and visualization libraries. JavaScript/HTML5 visualization and interaction platform. It offers Graphics objects require a great deal of configuration: size, posi- a no-code client-side environment, for several reasons. tion, colors, fonts must be specified for each object. Thus library solutions involve a significant amount of fairly simple code. Con- 1) The Jupyter analytics community is comfortable with versely, visualization involves analytics, an inherently complex server-side analytics environments (the 100+ kernels set of operations. Visualization tools such as Tableau [DGHP13] available in Jupyter, including Python, R and Julia) but or Looker [loo] combine visualization and analytics in a single less so with the JavaScript visualization platform. application presented through a point-and-click interface. Point- 2) Configuration of graphical objects takes a lot of low-value and-click interfaces are limited in the number and complexity configuration code; conversely, it is relatively easy to do of operations supported. The complexity of an operation isn’t by hand. reduced by having a simple point-and-click interface; instead, the user is confronted with the challenge of trying to do something These insights lead to a mixed interface, combining a drag- complicated by pointing. The result is that tools encapsulate and-drop interface for the design and configuration of visual complex operations in a few buttons, and that leads to a limited objects, and a coding, server-side interface for analytics programs. number of operations with reduced options and/or tools with steep Extension of the widget set was an important consideration. A learning curves. widget is a client-side object with a physical component. Galyleo In contrast, Jupyter is simply a superior analytics environment is designed to be extensible both by adding new visualization in every respect over a standalone visualization tool: its various libraries and components and by adding new widgets. kernels and their libraries provide a much broader range of analyt- Publication of interactive dashboards has been a further chal- ics capabilities; its programming interface is a much cleaner and lenge. A design goal of Galyleo was to offer a simple scheme, simpler way to perform complex operations; hardware resources where a dashboard could be published to the web with a single can scale far more easily than they can for a visualization tool; click. and connectors to data sources are both plentiful and extensible. These then, are the goals of Galyleo: Both standalone visualization tools and Jupyter libraries have a limited set of visualizations. Jupyter is a server-side platform. 1) Simple, drag-and-drop design of interactive dashboards in a visual editor. The visual design of a Galyleo dashboard * Corresponding author: rick.mcgeer@engageLively.com ‡ engageLively should be no more complex than design of a PowerPoint or Google slide; Copyright © 2022 Rick McGeer et al. This is an open-access article distributed 2) Radically simplify the dashboard-design interface by cou- under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the pling it to a powerful, Jupyter back end to do the analytics original author and source are credited. work, separating visualization and analytics concerns; 14 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 1: Figure 1. A New Galyleo Dashboard Fig. 3: Figure 3. Dataflow in Galyleo As the user creates and manipulates the visual elements, the editor continuously saves the table as a JSON file, which can also be edited with Jupyter’s built-in text editor. Workflow The goal of Galyleo is simplicity and transparency. Data prepa- ration is handled in Jupyter, and the basic abstract item, the GalyleoTable is generally created and manipulated there, using an open-source Python library. When a table is ready, the Galyleo- Client library is invoked to send it to the dashboard, where it appears in the table tab of the sidebar. The dashboard author then creates visual elements such as sliders, lists, dropdowns etc., Fig. 2: Figure 2. The Galyleo Dashboard Studio which select rows of the table, and uses these filtered lists as inputs to charts. The general idea is that the author should be 3) Maximimize extensibility for visualization and widgets able to seamlessly move between manipulating and creating data on the client side and analytics libraries, data sources and tables in the Notebook, and filtering and visualizing them in the hardware resources on the server side; dashboard. 4) Easy, simple publication; Data Flow and Conceptual Picture The Galyleo Data Model and Architecture is discussed in detail Using Galyleo below. The central idea is to have a few, orthogonal, easily-grasped The general usage model of Galyleo is that a Notebook is being concepts which make data manipulation easy and intuitive. The edited and executed in one tab of JupyterLab, and a corresponding basic concepts are as follows: dashboard file is being edited and executed in another; as the Notebook executes, it uses the Galyleo Client library to send 1) Table: A Table is a list of records, equivalent to a Pandas data to the dashboard file. To JupyterLab, the Galyleo Dashboard DataFrame [pdt20] [WM10] or a SQL Table. In general, Studio is just another editor; it reads and writes .gd.json files in in Galyleo, a Table is expected to be produced by an the current directory. external source, generally a Jupyter Notebook 2) Filter: A Filter is a logical function which applies to a The Dashboard Studio single column of a Table Table, and selects rows from the Table. Each Filter corresponds to a widget; widgets set A new Galyleo Dashboard can be launched from the JupyterLab the values Filter use to select Table rows launcher or from the File>New menu, as shown in Figure 1. 3) View A View is a subset of a Table selected by one or An existing dashboard is saved as a .gd.json file, and is more Filters. To create a view, the user chooses a Table, denoted with the Galyleo star logo. It can be opened in the usual and then chooses one or more Tilters to apply to the Table way, with a double-click. to select the rows for the View. The user can also statically Once a file is opened, or a new file created, a new Galyleo tab select a subset of the columns to include in the View. opens onto it. It resembles a simplified form of a Tableau, Looker, 4) Chart A Chart is a generic term for an object that displays or PowerBI editor. The collapsible right-hand sidebar offers the data graphically. Its input is a View or a Table. Each Chart ability to view Tables, and view, edit, or create Views, Filters, has a single data source. and Charts. The bottom half of the right sidebar gives controls for styling of text and shapes. The data flow is straightforward. A Table is updated from The top bar handles the introduction of decorative and styling an external source, or the user manipulates a widget. When this elements to the dashboard: labels and text, simple shapes such as happens, the affected item signals the dashboard controller that it ellipses, rectangles, polygons, lines, and images. All images are has been updated. The controller then signals all charts to redraw referenced by URL. themselves. Each Chart will then request updated data from its GALYLEO: A GENERAL-PURPOSE EXTENSIBLE VISUALIZATION SOLUTION 15 source Table or View. A View then requests its configured filters for their current logic functions, and passes these to the source Table with a request to apply the filters and return the rows which are selected by all the filters (in the future, a more general Boolean will be applied; the UI elements to construct this function are under design). The Table then returns the rows which pass the filters; the View selects the static subset of columns it supports, and passes this to its Charts, which then redraw themselves. Each item in this flow conceptually has a single data source, but multiple data targets. There can be multiple Views over a Table, but each View has a single Table as a source. There can be multiple charts fed by a View, but each Chart has a single Table or View as a source. It’s important to note that there are no special cases. There is no distinction, as there is in most visualization systems, between a "Dimension" or a "Measure"; there are simply columns of data, Fig. 4: Figure 4. A Published Galyleo Dashboard which can be either a value or category axis for any Chart. From this simplicity, significant generality is achieved. For example, a filter selects values from any column, whether that column is and configuration gives instant feedback and tight control over providing value or category. Applying a range filter to a category appearance. For example, the authors of a LaTeX paper (including column gives natural telescoping and zooming on the x-axis of a this one) can’t control the placement of figures within the text. The chart, without change to the architecture. fourth, which is correct, is that configuration code is more verbose, error-prone, and time-consuming than manual configuration. Drilldowns What is less often appreciated is that when operations become An important operation for any interactive dashboard is drill- sufficiently complex, coding is a much simpler interface than downs: expanding detail for a datapoint on a chart. The user manual configuration. For example, building a pivot table in a should be able to click on a chart and see a detailed view of spreadsheet using point-and-click operations have "always had a the data underlying the datapoint. This was naturally implemented reputation for being complicated" [Dev]. It’s three lines of code in in our system by associating a filter with every chart: every chart Python, even without using the Pandas pivot_table method. Most in Galyleo is also a Select Filter, and it can be used as a Filter in analytics procedures are far more easily done in code. a view, just as any other widget can be. As a result, Galyleo is an appropriate-code environment, which is an environment which combines a coding interface Publishing The Dashboard for complex, large-scale, or abstract operations and a point- Once the dashboard is complete, it can be published to the and-click interface for simple, concrete, small-scale operations. web simply by moving the dashboard file to any place it get Galyleo combines broadly powerful Jupyter-based code and low- an URL (e.g. a github repo). It can then be viewed by visiting code libraries for analytics paired with fast GUI-based design and https://galyleobeta.engagelively.com/public/galyleo/index.html? configuration for graphical elements and layout. dashboard=<url of dashboard file>. The attached figure shows a published Galyleo Dashboard, which displays Florence Galyleo Data Model And Architecture Nightingale’s famous Crimean War dataset. Using the double sliders underneath the column charts telescope the x axes, The Galyleo data Model and architecture closely model the effectively permitting zooming on a range; clicking on a column dashboard architecture discussed in the previous section. They are shows the detailed death statistics for that month in the pie chart based on the idea of a few simple, generalizable structures, which above the column chart. are largely independent of each other and communicate through simple interfaces. No-Code, Low-Code, and Appropriate-Code The GalyleoTable Galyleo is an appropriate-code environment, meaning that it offers A GalyleoTable is the fundamental data structure in Galyleo. It efficient creation to developers at every step. It offers What-You- is a logical, not a physical abstraction; it simply responds to See-Is-What-You-Get (WYSIWYG) design tools where appro- the GalyleoTable API. A GalyleoTable is a pair (columns, rows), priate, low-code where appropriate, and full code creation tools where columns is a list of pairs (name, type), where type is one where appropriate. of {string, boolean, number, date}, and rows is a list of lists of No-code and low-code environments, where users construct primitive values, where the length of each component list is the applications through a visual interface, are popular for several length of the list of columns and the type of the kth entry in each reasons. The first is the assumption that coding is time-consuming list is the type specified by the kth column. and hard, which isn’t always or necessarily true; the second is Small, public tables may be contained in the dashboard file; the assumption that coding is a skill known to only a small these are called explicit tables. However, explicitly representing fraction of the population, which is becoming less true by the the table in the dashboard file has a number of disadvantages: day. 40% of Berkeley undergraduates take Data 8, in which every assignment involves programming in a Jupyter Notebook. 1) An explicit table is in the memory of the client viewing The third, particularly for graphics code, is that manual design the dashboard; if it is too large, it may cause signifi- 16 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) cant performance problems on the dashboard author or viewer’s device 2) Since the dashboard file is accessible on the web, any data within it is public 3) The data may be continuously updated from a source, and it’s inconvenient to re-run the Notebook to update the data. Therefore, the GalyleoTable can be of one of three types: 1) A data server that implements the Table REST API 2) A JavaScript object within the dashboard page itself 3) A JavaScript messenger in the page that implements a messaging version of the API Fig. 5: Figure 5. Galyleo Dataflow with Remote Tables An explicit table is simply a special case of (2) -- in this case, the JavaScript object is simply a linear list of rows. Comments These are not exclusive. The JavaScript messenger case is designed to support the ability of a containing application within Again, simplicity and orthogonality have shown tremendous bene- the browser to handle viewer authentication, shrinking the security fits here. Though filters conceptually act as selectors on rows, they vulnerability footprint and ensuring that the client application may perform a variety of roles in implementations. For example, controls the data going to the dashboard. In general, aside from a table produced by a simulator may be controlled by a parameter performing tasks like authentication, the messenger will call an value given by a Filter function. external data server for the values themselves. Whether in a Data Server, a containing application, or a Extending Galyleo JavaScript object, Tables support three operations: Every element of the Galyleo system, whether it is a widget, Chart, Table Server, or Filter is defined exclusively through a small set 1) Get all the values for a specific column of public APIs. This is done to permit easy extension, by either 2) Get the max/min/increment for a specific numeric column the Galyleo team, users, or third parties. A Chart is defined as an 3) Get the rows which match a boolean function, passed in object which has a physical HTML representation, and it supports as a parameter to the operation four JavaScript methods: redraw (draw the chart), set data (set the Of course, (3) is the operation that we have seen above, to chart’s data), set options (set the chart’s options), and supports populate a view and a chart. (1) and (2) populate widgets on the table (a boolean which returns true if and only if the chart can dashboard; (1) is designed for a select filter, which is a widget draw the passed-in data set). In addition, it exports out a defined that lets a user pick a specific set of values for a column; (2) is JSON structure which indicates what options it supports and the an optimization for numeric filters, so that the entire list of values types of their values; this is used by the Chart Editor to display a for the column need not be sent -- rather, only the start and end configurator for the chart. values, and the increment between them. Similarly, the underlying lively.next system supports user design of new filters. Again, a filter is simply an object with a Each type of table specifies a source, additional information physical presence, that the user can design in lively, and supports a (in the case of a data server, for example, any header variables specific API -- broadly, set the choices and hand back the Boolean that must be specified in order to fetch the data), and, optionally, function as a JSON object which will be used to filter the data. a polling interval. The latter is designed to handle live data; the dashboard will query the data source at each polling interval to lively.next see if the data has changed. Any system can be used to extend Galyleo; at the end of the The choice of these three table instantiations (REST, day, all that need be done is encapsulate a widget or chart in JavaScript object, messenger) is that they provide the key founda- a snippet of HTML with a JavaScript interface that matches tional building block for future extensions; it’s easy to add a SQL the Galyleo protocol. This is done most easily and quickly connection on top of a REST interface, or a Python simulator. by using lively.next [SKH21]. lively.next is the latest in a line of Smalltalk- and Squeak-inspired [IKM+ 97] JavaScript/HTML Filters integrated development environments that began with the Lively Tables must be filtered in situ. One of the key motivators behind Kernel [IPU+ 08] [KIH+ 09] and continued through the Lively Web remote tables is in keeping large amounts of data from hitting the [LKI+ 12] [IFH+ 16] [TM17]. Galyleo is an application built in browser. This is largely defeated if the entire table is sent to the Lively, following the work done in [HIK+ 16]. dashboard and then filtered there. As a result, there is a Filter API Lively shares with Jupyter an emphasis on live programming together with the Table API whereever there are tables. [KRB18], orwhere a Read-Evaluate-Act Loop (REAL) program- The data flow of the previous section remains unchanged; ming style. It adds to that a combination of visual and text it is simply that the filter functions are transmitted to wherever programming [ABF20], where physical objects are positioned and the tables happen to be. The dataflow in the case of remote configured largely by hand as done with any drawing or design tables (whether messenger-based or REST-based) is shown here, program (e.g., PowerPoint, Illustrator, DrawPad, Google Draw) with operations that are resident where the table is situated and and programmed with a built-in editor and workspace, similar in operations resident on the dashboard clearly shown. concept if not form to a Jupyter Notebook. GALYLEO: A GENERAL-PURPOSE EXTENSIBLE VISUALIZATION SOLUTION 17 2) acceptsDataset(<Table or View>) returns a boolean de- pending on whether this chart can draw the data in this view. For example, a Table Chart can draw any tabular data; a Geo Chart typically requires that the first column be a place specifier. In addition, it has a read-only property: 1) optionSpec: A JSON structure describing the options for the chart. This is a dictionary, which specifies the name of each option, and its type (color, number, string, boolean, or enum with values given). Each type corresponds to a specific UI widget that the chart editor uses. And two read write properties: 1) options: The current options, as a JSON dictionary. This Fig. 6: Figure 6. The lively.next environment matches exactly the JSON dictionary in optionSpec, with values in place of the types. 2) dataSource: a string, the name of the current Galyleo Lively abstracts away HTML and CSS tags in graphical Table or Galyleo View objects called "Morphs". Morphs [MS95] were invented as the user interface layer for Self [US87], and have been used as Typically, an extension to Galyleo’s charting capabilities is the foundation of the graphics system in Squeak and Scratch done by incorporating the library as described in the previous [MRR+ 10]. In this UI, every physical object is a Morph; these section, implementing the API given in this section, and then can be as simple as a simple polygon or text string to a full publishing the result as a component application. Morphs are combined via composition, similar to the way that objects are grouped in a presentation or drawing program. Extending Galyleo’s Widget Set The composition is simply another Morph, which in turn can be A widget is a graphical item used to filter data. It operates on a composed with other Morphs. In this manner, complex Morphs single column on any table in the current data set. It is either a can be built up from collections of simpler ones. For example, range filter (which selects a range of numeric values) or a select a slider is simply the composition of a circle (the knob) with a filter (which selects a specific value, or a set of specific values). thin, long rectangle (the bar). Each Morph can be individually The API that is implemented consists only of properties. programmed as a JavaScript object, or can inherit base level 1) valueChanged : a signal, which is fired whenever the behavior and extend it. value of the widget is changed In lively.next, each morph turns into a snippet of HTML, CSS, 2) value: read-write. The current value of the widget and JavaScript code and the entire application turns into a web 3) filter: read-only. The current filter function, as a JSON page. The programmer doesn’t see the HTML and CSS code structure directly; these are auto-generated. Instead, the programmer writes 4) allValues: read-write, select filters only. JavaScript code for both logic and configuration (to the extent that 5) column: read-only. The name of the column of this the configuration isn’t done by hand). The code is bundled with widget. Set when the widget is created the object and integrated in the web page. 6) numericSpec: read-write. A dictionary containing the Morphs can be set as reusable components by a simple numeric specification for a numeric or date filter declaration. They can then be reused in any lively design. Widgets are typically designed as a standard Lively graphical Incorporating New Libraries component, much as the slider described above. Libraries are typically incorporated into lively.next by attaching them to a convenient physical object, importing the library from a Integration into Jupyter Lab: The Galyleo Extension package manager such as npm, and then writing a small amount Galyleo is a standalone web application that is integrated into of code to expose the object’s API. The simplest form of this is to JupyterLab using an iframe inside a JupyterLab tab for physical assign the module to an instance variable so it has an addressable design. A small JupyterLab extension was built that implements name, but typically a few convenience methods are written as well. the JupyterLab editor API. The JupyterLab extension has two In this way, a large number of libraries have been incorporated major functions: to handle read/write/undo requests from the as reusable components in lively.next, including Google Maps, JupyterLab menus and file browser, and receive and transmit Google Charts [goo], Chartjs [cha], D3 [BOH11], Leaflet.js [lea], messages from the running Jupyter kernels to update tables on OpenLayers [ope], cytoscape:ono and many more. the Dashboard Studio, and to handle the reverse messages where Extending Galyleo’s Charting and Visualization capabilities the studio requests data from the kernel. Standard Jupyter and browser mechanisms are used. File sys- A Galyleo Chart is anything that changes its display based on tem requests come to the extension from the standard Jupyter API, tabular data from a Galyleo Table or Galyleo View. It responds to exactly the same requests and mechanisms that are sent to a Mark- a specific API, which includes two principal methods: down or Notebook editor. The extension receives them, and then 1) drawChart: redraw the chart using the current tabular data uses standard browser-based messaging (window.postMessage) to from the input or view signal the standalone web app. Similarly, when the extension 18 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) of environments hosted by a server is arbitrary, and the cost is only the cost of maintaining the Dockerfile for each environment. An environment is easy to design for a specific class, project, or task; it’s simply adding libraries and executables to a base Dockerfile. It must be tested, of course, but everything must be. And once it is tested, the burden of software maintenance and installation is removed from the user; the user is already in a task- customized, curated environment. Of course, the usual installation tools (apt, pip, conda, easy_install ) can be pre-loaded (they’re just executables) so if the environment designer missed something it can be added by the end user. Though a user can only be in one environment at a time, persistent storage is shared across all environments, meaning Fig. 7: Figure 7. Galyleo Extension Architecture switching environments is simply a question of swapping one environment out and starting another. Viewed in this light, a JupyterHub is a multi-purpose computer makes a request of JupyterLab, it does so through this mechanism in the Cloud, with an easy-to-use UI that presents through a and a receiver in the extension gets it and makes the appropriate browser. JupyterLab isn’t simply an IDE; it’s the window system method calls within JupyterLab to achieve the objective. and user interface for this computer. The JupyterLab launcher is When a kernel makes a request through the Galyleo Client, the desktop for this computer (and it changes what’s presented, this is handled exactly the same way. A Jupyter messaging server depending on the environment); the file browser is the computer’s within the extension receives the message from the kernel, and file browser, and the JupyterLab API is the equivalent of the Win- then uses browser messaging to contact the application with the dows or MacOS desktop APIs and window system that permits request, and does the reverse on a Galyleo message to the kernel. third parties to build applications for this. This is a highly efficient method of interaction, since browser- based messaging is in-memory transactions on the client machine. This Jupyter Computer has a large number of advantages over It’s important to note that there is nothing Galyleo-specific a standard desktop or laptop computer. It can be accessed from any about the extension: the Galyleo Extension is a general method device, anywhere on Earth with an Internet connection. Software for any standalone web editor (e.g., a slide or drawing editor) to installation and maintenance issues are nonexistent. Data loss due be integrated into JupyterLab. The JupyterLab connection is a few to hardware failure is extremely unlikely; backups are still required tens of lines of code in the Galyleo Dashboard. The extension is to prevent accidental data loss (e.g., erroneous file deletion), but slightly more complex, but it can be configured for a different they are far easier to do in a Cloud environment. Hardware application with a simple data structure which specifies the URL resources such as disk, RAM, and CPU can be added rapidly, of the application, file type and extension to be manipulated, and on a permanent or temporary basis. Relatively exotic resources message list. (e.g., GPUs) can also be added, again on an on-demand, temporary basis. The advantages go still further than that. Any resource that The Jupyter Computer can be accessed over a network connection can be added to The implications of the Galyleo Extension go well beyond vi- the Jupyter Computer simply by adding the appropriate accessor sualization and dashboards and easy publication in JupyterLab. library to an environment’s Dockerfile. For example, a database JupyterLab is billed as the next-generation integrated Develop- solution such as Snowflake, BigQuery, or Amazon Aurora (or ment Environment for Jupyter, but in fact it is substantially more one of many others) can be "installed" by adding the relevant than that. It is the user interface and windowing system for Cloud- library module to the environment. Of course, the user will need based personal computing. Inspired by previous extensions such to order the database service from the relevant provider, and obtain as the Vega Extension, the Galyleo Extensions seeks to provide authentication tokens, and so on -- but this is far less troublesome the final piece of the puzzle. than even maintaining the library on the desktop. Consider a Jupyter server in the Cloud, served from a Jupyter- However, to date the Jupyter Computer only supports a few Hub such as the Berkeley Data Hub. It’s built from a base window-based applications, and adding a new application is a Ubuntu image, with the standard Jupyter libraries installed and, time-consuming development task. The applications supported are importantly, a UI that includes a Linux terminal interface. Any familiar and easy to enumerate: a Notebook editor, of course; a Linux executable can be installed in the Jupyter server image, as Markdown Viewer; a CSV Viewer; a JSON Viewer (not inline can any Jupyter kernel, and any collection of libraries. The Jupyter editor), and a text editor that is generally used for everything from server has per-user persistent storage, which is organized in a Python files to Markdown to CSV. standard Linux filesystem. This makes the Jupyter server a curated This is a small subset of the rich range of JavaScript/HTML5 execution environment with a Linux command-line interface and applications which have significant value for Jupyter Computer a Notebook interface for Jupyter execution. users. For example, the Ace Code Editor supports over 110 A JupyterHub similar to Berkeley Data Hub (essentially, languages and has the functionality of popular desktop editors anything built from Zero 2 Jupyter Hub or Q-Hub) comes with a such as Vim and Sublime Text. There are over 1100 open-source number of "environments". The user chooses the environment on drawing applications on the JavaScript/HTML5 platform; multiple startup. Each environment comes with a built-in set of libraries and spreadsheet applications, the most notable being jExcel, and many executables designed for a specific task or set of tasks. The number more. GALYLEO: A GENERAL-PURPOSE EXTENSIBLE VISUALIZATION SOLUTION 19 Fig. 8: Figure 8. Galyleo Extension Application-Side messaging Fig. 9: Figure 9. Generations of Internet Computing Up until now, adding a new application to JupyterLab involved writing a hand-coded extension in Typescript, and compiling it into JupyterLab. However, the Galyleo Extension has been the user uses any of a wide variety of text editors to prepare the designed so that any HTML5/JavaScript application can be added document, any of a wide variety of productivity and illustrator easily, simply by configuring the Galyleo Extension with a small programs to prepare the images, runs this through a local sequence JSON file. of commands (e.g., pdflatex paper; bibtex paper; pdflatex paper. The promise of the Galyleo Extension is that it can be adapted Usually Github or another repository is used for storage and to any open-source JavaScript/HTML5 application very easily. collaboration. The Galyleo Extension merely needs the: In a Cloud service, this is another matter. There is at most one editor, selected by the service, on the site. There is no • URL of the application image editing or illustrator program that reads and writes files • File extension that the application reads/writes on the site. Auxiliary tools, such as a bib searcher, aren’t present • URL of an image for the launcher or aren’t customizable. The service has its own siloed storage, • Name of the application for the file menu its own text editor, and its own document-preparation pipeline. The application must implement a small messaging client, The tools (aside from the core document-preparation program) using the standard JavaScript messaging interface, and implement are primitive. The online service has two advantages over the the calls the Galyleo Extension makes. The conceptual picture is personal-device service. Collaboration is generally built-in, with shown im Figure 8. multiple people having access to the project, and the software need And it must support (at a minimum) messages to read and not be maintained. Aside from that, the personal-device experience write the file being edited. is generally superior. In particular, the user is free to pick their own editor, and doesn’t have to orchestrate multiple downloads and The Third Generation of Network Computing uploads from various websites. The usual collection of command- The World-Wide Web and email comprised the first generation line utilities are available to small touchups. of Internet computing (the Internet had been around for a decade The third generation of Internet Computing represented by the before the Web, and earlier networks dated from the sixties, but Jupyter Computer. This offers a Cloud experience similar to the the Web and email were the first mass-market applications on personal computer, but with the scalability, reliability, and ease of the network), and they were very simple -- both were document- collaboration of the Cloud. exchange applications, using slightly different protocols. The second generation of Network applications were the siloed pro- Conclusion and Further Work ductivity applications, where standard desktop applications moved The vision of the Jupyter Computer, bringing the power of the to the Cloud. The most famous example is of course GSuite Cloud to the personal computing experience has been started and Office 365, but there were and are many others -- Canva, with Galyleo. It will not end there. At the heart of it is a Loom, Picasa, as well as a large number of social/chat/social composition of two broadly popular platforms: HTML5/JavaScript media applications. What they all had in common was that they for presentation and interaction, and the various Jupyter kernels were siloed applications which, with the exception of the office for server-side analytics. Galyleo is a start at seamless interaction suites, didn’t even share a common store. In many ways, this of these two platforms. Continuing and extending this is further second generation of network applications recapitulates the era development of narrow-waist protocols to permit maximal inde- immediately prior to the introduction of the personal computer. pendent development and extension. That era was dominated by single-application computers such as word processors, which were simply computers with a hardcoded program loaded into ROM. Acknowledgements The Word Processor era was due to technological limitations The authors wish to thank Alex Yang, Diptorup Deb, and for -- the processing power and memory to run multiple programs their insightful comments, and Meghann Agarwal for stewardship. simply wasn’t available on low-end hardware, and PC operating We have received invaluable help from Robert Krahn, Marko systems didn’t yet exist. In some sense, the current second genera- Röder, Jens Lincke and Linus Hagemann. We thank the en- tion of Internet Computing suffers from similar technological con- gageLively team for all of their support and help: Tim Braman, straints. The "Operating System" for Internet Computing doesn’t Patrick Scaglia, Leighton Smith, Sharon Zehavi, Igor Zhukovsky, yet exist. The Jupyter Computer can provide it. Deepak Gupta, Steve King, Rick Rasmussen, Patrick McCue, To see the difference that this can make, consider LaTeX (per- Jeff Wade, Tim Gibson. The JupyterLab development commu- haps preceded by Docutils, as is the case for SciPy) preparation of nity has been helpful and supportive; we want to thank Tony a document. On a personal computer, it’s fairly straightforward; Fast, Jason Grout, Mehmet Bektas, Isabela Presedo-Floyd, Brian 20 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Granger, and Michal Krassowski. The engageLively Technology [hol22b] Installation - holoviews v1.14.9, May 2022. URL: https: Advisory Board has helped shape these ideas: Ani Mardurkar, //holoviews.org/. [IFH+ 16] Daniel Ingalls, Tim Felgentreff, Robert Hirschfeld, Robert Priya Joseph, David Peterson, Sunil Joshi, Michael Czahor, Isha Krahn, Jens Lincke, Marko Röder, Antero Taivalsaari, and Oke, Petrus Zwart, Larry Rowe, Glenn Ricart, Sunil Joshi, Antony Tommi Mikkonen. A world of active objects for work and play: Ng. We want to thank the people from the AWS team that have The first ten years of lively. In Proceedings of the 2016 ACM helped us tremendously: Matt Vail, Omar Valle, Pat Santora. International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, Onward! 2016, page Galyleo has been dramatically improved with the assistance of our 238–249, New York, NY, USA, 2016. Association for Comput- Japanese colleagues at KCT and Pacific Rim Technologies: Yoshio ing Machinery. URL: https://doi.org/10.1145/2986012.2986029, Nakamura, Ted Okasaki, Ryder Saint, Yoshikazu Tokushige, and doi:10.1145/2986012.2986029. Naoyuki Shimazaki. Our undestanding of Jupyter in an academic [IKM+ 97] Dan Ingalls, Ted Kaehler, John Maloney, Scott Wallace, and Alan Kay. Back to the future: The story of squeak, a prac- context came from our colleagues and friends at Berkeley, the tical smalltalk written in itself. In Proceedings of the 12th University of Victoria, and UBC: Shawna Dark, Hausi Müller, ACM SIGPLAN Conference on Object-Oriented Programming, Ulrike Stege, James Colliander, Chris Holdgraf, Nitesh Mor. Use Systems, Languages, and Applications, OOPSLA ’97, page 318–326, New York, NY, USA, 1997. Association for Comput- of Jupyter in a research context was emphasized by Andrew ing Machinery. URL: https://doi.org/10.1145/263698.263754, Weidlea, Eli Dart, Jeff D’Ambrogia. We benefitted enormously doi:10.1145/263698.263754. from the CITRIS Foundry: Alic Chen, Jing Ge, Peter Minor, Kyle [IPU+ 08] Daniel Ingalls, Krzysztof Palacz, Stephen Uhler, Antero Taival- Clark, Julie Sammons, Kira Gardner. The Alchemist Accelerator saari, and Tommi Mikkonen. The lively kernel a self-supporting system on a web page. In Workshop on Self-sustaining Systems, was central to making this product: Ravi Belani, Arianna Haider, pages 31–50. Springer, 2008. doi:10.1007/978-3-540- Jasmine Sunga, Mia Scott, Kenn So, Aaron Kalb, Adam Frankl. 89275-5_2. Kris Singh was a constant source of inspiration and help. Larry [jup] Jupyterlab documentation. URL: https://jupyterlab.readthedocs. Singer gave us tremendous help early on. Vibhu Mittal more io/en/stable/. than anyone inspired us to pursue this road. Ken Lutz has been [KIH+ 09] Robert Krahn, Dan Ingalls, Robert Hirschfeld, Jens Lincke, and Krzysztof Palacz. Lively wiki a development environment for a constant sounding board and inspiration, and worked hand-in- creating and sharing active web content. In Proceedings of the hand with us to develop this product. Our early customers and 5th International Symposium on Wikis and Open Collaboration, partners have been and continue to be a source of inspiration, WikiSym ’09, New York, NY, USA, 2009. Association for Computing Machinery. URL: https://doi.org/10.1145/1641309. support, and experience that is absolutely invaluable: Jonathan 1641324, doi:10.1145/1641309.1641324. Tan, Roger Basu, Jason Koeller, Steve Schwab, Michael Collins, [KRB18] Juraj Kubelka, Romain Robbes, and Alexandre Bergel. The road Alefiya Hussain, Geoff Lawler, Jim Chimiak, Fraukë Tillman, to live programming: Insights from the practice. In Proceedings Andy Bavier, Andy Milburn, Augustine Bui. All of our customers of the 40th International Conference on Software Engineering, ICSE ’18, page 1090–1101, New York, NY, USA, 2018. Associ- are really partners, none moreso than the fantastic teams at Tanjo ation for Computing Machinery. URL: https://doi.org/10.1145/ AI and Ultisim: Bjorn Nordwall, Ken Lane, Jay Sanders, Eric 3180155.3180200, doi:10.1145/3180155.3180200. Smith, Miguel Matos, Linda Bernard, Kevin Clark, and Richard [KRKP+ 16] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Boyd. We want to especially thank our investors, who bet on this Brian Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, Paul technology and company. Ivanov, Damián Avila, Safia Abdalla, Carol Willing, and Jupyter development team. Jupyter Notebooks - a publishing format for reproducible computational workflows. IOS Press, 2016. URL: R EFERENCES https://eprints.soton.ac.uk/403913/. [lea] An open-source javascript library for interactive maps. URL: [ABF20] Leif Andersen, Michael Ballantyne, and Matthias Felleisen. https://leafletjs.com/. Adding interactive visual syntax to textual code. Proc. ACM [LKI+ 12] Jens Lincke, Robert Krahn, Dan Ingalls, Marko Roder, and Program. Lang., 4(OOPSLA), nov 2020. URL: https://doi.org/ Robert Hirschfeld. The lively partsbin–a cloud-based repository 10.1145/3428290, doi:10.1145/3428290. for collaborative development of active web content. In 2012 [BOH11] Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. D3 data- 45th Hawaii International Conference on System Sciences, pages driven documents. IEEE Transactions on Visualization and Com- 693–701, 2012. doi:10.1109/HICSS.2012.42. puter Graphics, 17(12):2301–2309, dec 2011. URL: https://doi. [loo] Looker. URL: https://looker.com/. org/10.1109/TVCG.2011.185, doi:10.1109/TVCG.2011. [LS10] Bruce Lawson and Remy Sharp. Introducing HTML5. New 185. Riders Publishing, USA, 1st edition, 2010. [cha] Chart.js. URL: https://www.chartjs.org/. [mdn] Window.postmessage() - web apis: Mdn. URL: https://developer. [Cro06] D. Crockford. The application/json media type for javascript mozilla.org/en-US/docs/Web/API/Window/postMessage. object notation (json). RFC 4627, RFC Editor, July 2006. http:// [MRR+ 10] John Maloney, Mitchel Resnick, Natalie Rusk, Brian Silverman, www.rfc-editor.org/rfc/rfc4627.txt. URL: http://www.rfc-editor. and Evelyn Eastmond. The scratch programming language org/rfc/rfc4627.txt, doi:10.17487/rfc4627. and environment. ACM Transactions on Computing Educa- [Dev] Erik Devaney. How to create a pivot table in excel: A step-by- tion (TOCE), 10(4):1–15, 2010. URL: https://doi.org/10.1145/ step tutorial. URL: https://blog.hubspot.com/marketing/how-to- 1868358.1868363, doi:10.1145/1868358.1868363. create-pivot-table-tutorial-ht. [DGHP13] Marcello D’Agostino, Dov M Gabbay, Reiner Hähnle, and [MS95] John H Maloney and Randall B Smith. Directness and liveness in Joachim Posegga. Handbook of tableau methods. Springer the morphic user interface construction environment. In Proceed- Science & Business Media, 2013. ings of the 8th annual ACM symposium on User interface and software technology, pages 21–28, 1995. URL: https://doi.org/ [goo] Charts: google developers. URL: https://developers.google.com/ 10.1145/215585.215636, doi:10.1145/215585.215636. chart/. [HIK+ 16] Matthew Hemmings, Daniel Ingalls, Robert Krahn, Rick [ope] Openlayers. URL: https://openlayers.org/. McGeer, Glenn Ricart, Marko Röder, and Ulrike Stege. Livetalk: [pan22] Panel, May 2022. URL: https://panel.holoviz.org/. A framework for collaborative browser-based replicated- [pdt20] The pandas development team. pandas-dev/pandas: Pandas, computation applications. In 2016 28th International Tele- February 2020. URL: https://doi.org/10.5281/zenodo.3509134, traffic Congress (ITC 28), volume 01, pages 270–277, 2016. doi:10.5281/zenodo.3509134. doi:10.1109/ITC-28.2016.144. [plo] Dash overview. URL: https://plotly.com/dash/. [hol22a] High-level tools to simplify visualization in python, Apr 2022. [SKH21] Robin Schrieber, Robert Krahn, and Linus Hagemann. URL: https://holoviz.org/. lively.next, 2021. GALYLEO: A GENERAL-PURPOSE EXTENSIBLE VISUALIZATION SOLUTION 21 [TM17] Antero Taivalsaari and Tommi Mikkonen. The web as a software platform: Ten years later. In International Conference on Web Information Systems and Technologies, volume 2, pages 41–50. SCITEPRESS, 2017. doi:10.5220/0006234800410050. [US87] David Ungar and Randall B. Smith. Self: The power of simplic- ity. volume 22, page 227–242, New York, NY, USA, dec 1987. Association for Computing Machinery. URL: https://doi.org/10. 1145/38807.38828, doi:10.1145/38807.38828. [WM10] Wes McKinney. Data Structures for Statistical Computing in Python. In Stéfan van der Walt and Jarrod Millman, editors, Proceedings of the 9th Python in Science Conference, pages 56 – 61, 2010. doi:10.25080/Majora-92bf1922-00a. 22 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) USACE Coastal Engineering Toolkit and a Method of Creating a Web-Based Application Amanda Catlett‡∗ , Theresa R. Coumbe‡ , Scott D. Christensen‡ , Mary A. Byrant‡ F Abstract—In the early 1990s the Automated Coastal Engineering Systems, the goal of deploying the ACES tools as a web-based application, ACES, was created with the goal of providing state-of-the-art computer-based and ultimately renamed it to: USACE Coastal Engineering Toolkit tools to increase the accuracy, reliability, and cost-effectiveness of Corps coastal (UCET). engineering endeavors. Over the past 30 years, ACES has become less and less The RAD team focused on updating the Python codebase accessible to engineers. An updated version of ACES was necessary for use in utilizing Python’s object-oriented programming and the newly coastal engineering. Our goal was to bring the tools in ACES to a user-friendly web-based dashboard that would allow a wide range of users to be able to easily developed HoloViz ecosystem. The team refactored the code to and quickly visualize results. We will discuss how we restructured the code implement inheritance so the code is clean, readable, and scalable. using class inheritance and the three libraries Param, Panel, and HoloViews to The tools were expanded to a Graphical User Interface (GUI) so create an extensible, interactive, graphical user interface. We have created the the implementation to a web-app would provide a user-friendly USACE Coastal Engineering Toolkit, UCET, which is a web-based application experience. This was done by using the HoloViz-maintained that contains 20 of the tools in ACES. UCET serves as an outline for the process libraries: Param, Panel, and Holoviews. of taking a model or set of tools and developing web-based application that can This paper will discuss some of the steps that were taken produce visualizations of the results. by the RAD team to update the Python codebase to create a panel application of the coastal engineering tools. In particular, Index Terms—GUI, Param, Panel, HoloViews refactoring the input and output variables with the Param library, the class hierarchy used, and utilization of Panel and HoloViews Introduction for a user-friendly experience. The Automated Coastal Engineering System (ACES) was devel- oped in response to the charge by the LTG E. R. Heiberg III, Refactoring Using Param who was the Chief of Engineers at the time, to provide improved design capabilities to the Corps coastal specialists. [Leenknecht] Each coastal tool in UCET has two classes, the model class and the In 1992, ACES was presented as an interactive computer-based GUI class. The model class holds input and output variables and design and analysis system in the field of coastal engineering. The the methods needed to run the model. Whereas the GUI class holds tools consist of seven functional areas which are: Wave Prediction, information for GUI visualization. To make implementation of the Wave Theory, Structural Design, Wave Runup Transmission and GUI more seamless we refactored model variables to utilize the Overtopping, Littoral Process, and Inlet Processes. These func- Param library. Param is a library that has the goal of simplifying tional areas contain classical theory describing wave motion, to the codebase by letting the programmer explicitly declare the types expressions resulting from tests of structures in wave flumes, and and values of parameters accepted by the code. Param can also be numerical models describing the exchange of energy from the at- seamlessly used when implementing the GUI through Panel and mosphere to the sea surface. The math behind these uses anything HoloViews. from simple algebraic expressions, both theoretical and empirical, Each UCET tool’s model class declares the input and output to numerically intense algorithms. [Leenknecht][UG][shankar] values used in the model as class parameters. Each input and Originally, ACES was written in FORTRAN 77 resulting in output variables are declared and given the following metadata a decreased ability to use the tool as technology has evolved. In features: 2017, the codebase was converted from FORTRAN 77 to MAT- • default: each input variable is defined as a Param with a LAB and Python. This conversion ensured that coastal engineers default value defined from the 1992 ACES user manual using this tool base would not need training in yet another coding • bounds: each input variable is defined with range values language. In 2020, the Engineered Resilient Systems (ERS) Rapid defined in the 1992 ACES user manual Application Development (RAD) team undertook the project with • doc or docstrings: input and output variables have the expected variable and description of the variable defined * Corresponding author: amanda.r.catlett@erdc.dren.mil ‡ ERDC as a doc. This is used as a label over the input and output widgets. Most docstrings follow the pattern of Copyright © 2022 Amanda Catlett et al. This is an open-access article <variable>:<description of variable [units, if any]> distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, • constant: the output variables all set constant equal True, provided the original author and source are credited. thereby restricting the user’s ability to manipulate the USACE COASTAL ENGINEERING TOOLKIT AND A METHOD OF CREATING A WEB-BASED APPLICATION 23 value. Note that when calculations are being done they will classes. In figure 1 the model classes are labeled as: Base-Tool need to be inside a with param.edit_constant(self) function Class, Graph-Tool Class, Water-Tool Class, and Graph-Water-Tool • precedence: input and output variables will use prece- Class and each has a corresponding GUI class. dence when there are instances where the variable does Due to the inheritance in UCET, the first two questions that not need to be seen. can be asked when adding a tool are: ‘Does this tool need water variables for the calculation?’ and ‘Does this tool have a graph?’. The following is an example of an input parameter: The developer can then add a model class and a GUI class and H = param.Number( inherit based on figure 1. For instance, Linear Wave Theory is doc='H: wave height [{distance_unit}]', default=6.3, an application that yields first-order approximations for various bounds=(0.1, 200) parameters of wave motion as predicted by the wave theory. It ) provides common items of interest such as water surface elevation, An example of an output variable is: general wave properties, particle kinematics and pressure as a function of wave height and period, water depth, and position L = param.Number( doc='L: Wavelength [{distance_unit}]', in the wave form. This tool uses water density and has multiple constant=True graphs in its output. Therefore, Linear Wave Theory is considered ) a Graph-Water-Tool and the model class will inherit from Water- The model’s main calculation functions mostly remained un- TypeDriver and the GUI class will inherit the linear wave theory changed. However, the use of Param eliminated the need for code model class, WaterTypeGui, and TabularDataGui. that handled type checking and bounds checks. GUI Implementation Using Panel and HoloViews Class Hierarchy Each UCET tool has a GUI class where the Panel and HoloView UCET has twenty tools from six of the original seven functional libraries are implemented. Panel is a hierarchical container that areas of ACES. When we designed our class hierarchy, we focused can layout panes, widgets, or other Panels in an arrangement on the visualization of the web application rather than functional that forms an app or dashboard. The Pane is used to render any areas. Thus, each tool’s class can be categorized into Base-Tool, widget-like object such as Spinner, Tabulator, Buttons, CheckBox, Graph-Tool, Water-Tool, or Graph-Water-Tool. The Base-Tool has Indicators, etc. Those widgets are used to gather user input and the coastal engineering models that do not have any water property run the specific tool’s model. inputs (such as water density) in the calculations and no graphical UCET utilizes the following widgets to gather user input: output. The Graph-Tool has the coastal engineering models that do not have any water property inputs in the calculations but have • Spinner: single numeric input values a graphical output. Water-Tool has the coastal engineering models • Tabulator: table input data that have water property inputs in the calculations and no graphical • CheckBox: true or false values output. Graph-Water-Tool has the coastal engineering models that • Drop down: items that have a list of pre-selected values, have water property inputs in the calculations and has a graphical such as which units to use output. Figure 1 shows a flow of inheritance for each of those classes. UCET utilizes indicators.Number, Tabulator, and graphs to There are two types of general categories for the classes in visualize the outputs of the coastal engineering models. A single the UCET codebase: utility and tool-specific. Utility classes have number is shown using indicators.Number and graph data is methods and functions that are utilized across more than one tool. displayed using the Tabulator widget to show the data of the graph. The Utility classes are: The graphs are created using HoloViews and have tool options such as pan, zooming, and saving. Buttons are used to calculate, • BaseDriver: holds methods and functions that each tool save the current run, and save the graph data. needs to collect data, run coastal engineering models, and All of these widgets are organized into 5 pan- print data. els: title, options, inputs, outputs, and graph. The • WaterDriver: has the methods that make water density BaseGui/WaterTypeGui/TabularDataGui have methods that and water weight available to the models that need those organize the widgets within the 5 panels that most tools follow. inputs for the calculations. The “options” panel has a row that holds the dropdown selections • BaseGui: has the functions and methods for the visualiza- for units and water type (if the tool is a Water-Tool). Some tools tion and utilization of all inputs and outputs within each have a second row in the “options” panel with other drop-down tool’s GUI. options. The input panel has two columns for spinner widgets • WaterTypeGui: has the widget for water selection. with a calculation button at the bottom left. The output panel has • TabulatorDataGui: holds the functions and methods used two columns of indicators.Number for the single numeric output for visualizing plots and the ability to download the data values. At the bottom of the output panel there is a button to “save that is used for plotting. the current profile”. The graph panel is tabbed where the first Each coastal tool in UCET has two classes, the model class and tab shows the graph and the second tab shows the data provided the GUI class. The model class holds input and output variables within the graph. An visual outline of this can ben seen in the and the methods needed to run the model. The model class either following figure. Some of the UCET tools have more complicated directly inherits from the BaseDriver or the WaterTypeDriver. The input or output visualizations and that tool’s GUI class will add tool’s GUI class holds information for GUI visualization that is or modify methods to meet the needs of that tool. different from the BaseGui, WaterTypeGUI, and TabulatorDataGui The general outline of a UCET tool for the GUI. 24 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) zero the point is outside the waveform. Therefore, if a user makes a combination where the sum is less than zero, UCET will post a warning to tell the user that the point is outside the waveform. Current State See the below figure for an example The developers have been UCET approaches software development from the perspective of documenting this project using GitHub and JIRA. someone within the field of Research and Development. Each An example of a warning message based on chosen inputs. tool within UCET is not inherently complex from the traditional software perspective. However, this codebase enables researchers Results to execute complex coastal engineering models in a user-friendly Linear Wave Theory was described in the class hierarchy example. environment by leveraging open-source libraries in the scientific This Graph-Water-Tool utilizes most of the BaseGui methods. The Python ecosystem such as: Param, Panel, and HoloViews. biggest difference is instead of having three graphs in the graph Currently, UCET is only deployed using a command line panel there is a plot selector drop down where the user can select interface panel serve command. UCET is awaiting the Security which graph they want to see. Technical Implementation Guide process before it can be launched Windspeed Adjustment and Wave Growth provides a quick as a website. As part of this security vetting process we plan to and simple estimate for wave growth over open-water and re- leverage continuous integration/continuous development (CI/CD) stricted fetches in deep and shallow water. This is a Base-Tool tools to automate the deployment process. While this process is as there are no graphs and no water variables for the calculations. happening, we have started to get feedback from coastal engineers This tool has four additional options in the options panel where to update the tools usability, accuracy, and adding suggested the user can select the wind observation type, fetch type, wave features. To minimize the amount of computer science knowledge equation type, and if knots are being used. Based on the selection the coastal engineers need, our team created a batch script. This of these options, the input and output variables will change so only script creates a conda environment, activates and runs the panel what is used or calculated for those selections are seen. serve command to launch the app on a local host. The user only needs to click on the batch script for this to take place. Conclusion and Future Work Other tests are being created to ensure the accuracy of the Thirty years ago, ACES was developed to provide improved tools using a testing framework to compare output from UCET design capabilities to Corps coastal specialists and while these with that of the FORTRAN original code. The biggest barrier to tools are still used today, it became more and more difficult for this testing strategy is getting data from the FORTRAN to compare users to access them. Five years ago, there was a push to update with Python. Currently, there are tests for most of the tools that the code base to one that coastal specialists would be more familiar read a CSV file of input and output results from FORTRAN and with: MATLAB and Python. Within the last two years the RAD compare with what the Python code is calculating. team was able to finalize the update so that the user can access Our team has also compiled an updated user guide on how to these tools without having years of programming experience. We use the tool, what to expect from the tool, and a deeper description were able to do this by utilizing classes, inheritance, and the on any warning messages that might appear as the user adds input Param, Panel, and HoloViews libraries. The use of inheritance values. An example of a warning message would be, if a user has allowed for shorter code-bases and also has made it so new chooses input values that make it so the application does not make tools can be added to the toolkit. Param, Panel, and HoloViews physical sense, a warning message will appear under the output work cohesively together to not only run the models but make a header and replace all output values. For a more concrete example: simple interface. Linear Wave Theory has a vertical coordinate (z) and the water Future work will involve expanding UCET to include current depth (d) as input values and when those values sum is less than coastal engineering models, and completing the security vetting USACE COASTAL ENGINEERING TOOLKIT AND A METHOD OF CREATING A WEB-BASED APPLICATION 25 process to deploy to a publicly accessible website. We plan to incorporate an automated CI/CD to ensure smooth deployment of future versions. We also will continue to incorporate feedback from users and refine the code to ensure the application provides a quality user experience. R EFERENCES [Leenknecht] David A. Leenknecht, Andre Szuwalski, and Ann R. Sherlock. 1992. Automated Coastal Engineering System -Technical Refer- ence. Technical report. https://usace.contentdm.oclc.org/digital/ collection/p266001coll1/id/2321/ [panel] “Panel: A High-Level App and Dashboarding Solution for Python.” Panel 0.12.6 Documentation, Panel Contributors, 2019, https://panel.holoviz.org/. [holoviz] “High-Level Tools to Simplify Visualization in Python.” HoloViz 0.13.0 Documentation, HoloViz Authors, 2017, https: //holoviz.org. [UG] David A. Leenknecht, et al. “Automated Tools for Coastal Engineering.” Journal of Coastal Research, vol. 11, no. 4, Coastal Education & Research Foundation, Inc., 1995, pp. 1108-24. https://usace.contentdm.oclc.org/digital/collection/ p266001coll1/id/2321/ [shankar] N.J. Shankar, M.P.R. Jayaratne, Wave run-up and overtopping on smooth and rough slopes of coastal structures, Ocean Engi- neering, Volume 30, Issue 2, 2003, Pages 221-238, ISSN 0029- 8018, https://doi.org/10.1016/S0029-8018(02)00016-1 Fig. 1: Screen shot of Linear Wave Theory Fig. 2: Screen shot of Windspeed Adjustment and Wave Growth 26 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Search for Extraterrestrial Intelligence: GPU Accelerated TurboSETI Luigi Cruz‡∗ , Wael Farah‡ , Richard Elkins‡ F Abstract—A common technique adopted by the Search For Extraterrestrial In- by an analog-to-digital converter as voltages and transmitted to a telligence (SETI) community is monitoring electromagnetic radiation for signs of processing logic to extract useful information from it. The data extraterrestrial technosignatures using ground-based radio observatories. The stream generated by a radio telescope can easily reach the rate analysis is made using a Python-based software called TurboSETI to detect nar- of terabits per second because of the ultra-wide bandwidth radio rowband drifting signals inside the recordings that can mean a technosignature. spectrum. The current workflow utilized by the Breakthrough The data stream generated by a telescope can easily reach the rate of terabits per second. Our goal was to improve the processing speeds by writing a GPU- Listen, the largest scientific research program aimed at finding accelerated backend in addition to the original CPU-based implementation of the evidence of extraterrestrial intelligence, consists in pre-processing de-doppler algorithm used to integrate the power of drifting signals. We discuss and storing the incoming data as frequency-time binary files how we ported a CPU-only program to leverage the parallel capabilities of a ([LCS+ 19]) in persistent storage for later analysis. This post- GPU using CuPy, Numba, and custom CUDA kernels. The accelerated backend analysis is made possible using a Python-based software called reached a speed-up of an order of magnitude over the CPU implementation. TurboSETI ([ESF+ 17]) to detect narrowband signals that could be drifting in frequency owing to the relative radial velocity between Index Terms—gpu, numba, cupy, seti, turboseti the observer on earth, and the transmitter. The offline processing speed of TurboSETI is directly related to the scientific output of 1. Introduction an observation. Each voltage file ingested by TurboSETI is often on the order of a few hundreds of gigabytes. To process data The Search for Extraterrestrial Intelligence (SETI) is a broad term efficiently without Python overhead, the program uses Numpy for utilized to describe the effort of locating any scientific proof of near machine-level performance. To measure a potential signal’s past or present technology that originated beyond the bounds of drift rate, TurboSETI uses a de-doppler algorithm to align the Earth. SETI can be performed in a plethora of ways: either actively frequency axis according to a pre-set drift rate. Another algorithm by deploying orbiters and rovers around planets/moons within the called “hitsearch” ([ESF+ 17]) is then utilized to identify any solar system, or passively by either searching for biosignatures in signal present in the recorded spectrum. These two algorithms exoplanet atmospheres or “listening” to technologically-capable are the most resource-hungry elements of the pipeline consuming extraterrestrial civilizations. One of the most common techniques almost 90% of the running time. adopted by the SETI community is monitoring electromagnetic radiation for narrowband signs of technosignatures using ground- based radio observatories. This search can be performed in mul- 2. Approach tiple ways: equipment primarily built for this task, like the Allen Multiple methods were utilized in this effort to write a GPU- Telescope Array (California, USA), renting observation time, or accelerated backend and optimize the CPU implementation of in the background while the primary user is conducting other ob- TurboSETI. In this section, we enumerate all three main methods. servations. Other radio-observatories useful for this search include the MeerKAT Telescope (Northern Cape, South Africa), Green 2.1. CuPy Bank Telescope (West Virginia, USA), and the Parkes Telescope The original implementation of TurboSETI heavily depends on (New South Wales, Australia). The operation of a radio-telescope Numpy ([HMvdW+ 20]) for data processing. To keep the number is similar to an optical telescope. Instead of using optics to of modifications as low as possible, we implemented the GPU- concentrate light into an optical sensor, a radio-telescope operates accelerated backend using CuPy ([OUN+ 17]). This open-source by concentrating electromagnetic waves into an antenna using a library offers GPU acceleration backed by NVIDIA CUDA and large reflective structure called a “dish” ([Reb82]). The interac- AMD ROCm while using a Numpy style API. This enabled us tion between the metallic antenna and the electromagnetic wave to reuse most of the code between the CPU and GPU-based generates a faint electrical current. This effect is then quantized implementations. * Corresponding author: lfcruz@seti.org 2.1. Numba ‡ SETI Institute Some computationally heavy methods of the original CPU-based Copyright © 2022 Luigi Cruz et al. This is an open-access article distributed implementation of TurboSETI were written in Cython. This ap- under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the proach has disadvantages: the developer has to be familiar with original author and source are credited. Cython syntax to alter the code; the code requires additional logic SEARCH FOR EXTRATERRESTRIAL INTELLIGENCE: GPU ACCELERATED TURBOSETI 27 Double-Precision (float64) 4. Conclusion Impl. Device File A File B File C The original implementation of TurboSETI worked exclusively Cython CPU 0.44 min 25.26 min 23.06 min on the CPU to process data. We implemented a GPU-accelerated Numba CPU 0.36 min 20.67 min 22.44 min backend to leverage the massive parallelization capabilities of a CuPy GPU 0.05 min 2.73 min 3.40 min graphical device. The benchmark performed shows that the new CPU and GPU implementation takes significantly less time to TABLE 1 process observation data resulting in more science being produced. Double precision processing time benchmark with Cython, Numba and CuPy Based on the results, the recommended configuration to run the implementation. program is with single-precision calculations on a GPU device. Single-Precision (float32) R EFERENCES [ESF+ 17] J. Emilio Enriquez, Andrew Siemion, Griffin Foster, Vishal Impl. Device File A File B File C Gajjar, Greg Hellbourg, Jack Hickish, Howard Isaacson, Numba CPU 0.26 min 16.13 min 16.15 min Danny C. Price, Steve Croft, David DeBoer, Matt Lebof- CuPy GPU 0.03 min 1.52 min 2.14 min sky, David H. E. MacMahon, and Dan Werthimer. The breakthrough listen search for intelligent life: 1.1–1.9 TABLE 2 ghz observations of 692 nearby stars. The Astrophys- Single precision processing time benchmark with Numba and CuPy ical Journal, 849(2):104, Nov 2017. URL: https://ui. implementation. adsabs.harvard.edu/abs/2017ApJ...849..104E/abstract, doi: 10.3847/1538-4357/aa8d1b. [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, to be compiled at installation time. Consequently, it was decided Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del to replace Cython with pure Python methods decorated with the Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Numba ([LPS15]) accelerator. By leveraging the power of the Just- Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer In-Time (JIT) compiler from Low Level Virtual Machine (LLVM), Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro- gramming with NumPy. Nature, 585(7825):357–362, Septem- Numba can compile Python code into assembly code as well ber 2020. URL: https://doi.org/10.1038/s41586-020-2649-2, as apply Single Instruction/Multiple Data (SIMD) acceleration doi:10.1038/s41586-020-2649-2. instructions to achieve near machine-level speeds. [LCS 19] + Matthew Lebofsky, Steve Croft, Andrew P. V. Siemion, Danny C. Price, J. Emilio Enriquez, Howard Isaacson, David H. E. MacMahon, David Anderson, Bryan Brzycki, Jeff Cobb, 2.2. Single-Precision Floating-Point Daniel Czech, David DeBoer, Julia DeMarines, Jamie Drew, The original implementation of the software handled the input Griffin Foster, Vishal Gajjar, Nectaria Gizani, Greg Hellbourg, Eric J. Korpela, and Brian Lacki. The breakthrough listen data as double-precision floating-point numbers. This behavior search for intelligent life: Public data, formats, reduction, and would cause all the mathematical operations to take significantly archiving. Publications of the Astronomical Society of the longer to process because of the extended precision. The ultimate Pacific, 131(1006):124505, Nov 2019. URL: https://arxiv.org/ abs/1906.07391, doi:10.1088/1538-3873/ab3e82. precision of the output product is inherently limited by the preci- [LPS15] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: sion of the original input data which in most cases is represented A llvm-based python jit compiler. In Proceedings of the by an 8-bit signed integer. Therefore, the addition of a single- Second Workshop on the LLVM Compiler Infrastructure in precision floating-point number decreased the processing time HPC, LLVM ’15, New York, NY, USA, 2015. Association for Computing Machinery. URL: https://doi.org/10.1145/ without compromising the useful precision of the output data. 2833157.2833162, doi:10.1145/2833157.2833162. [OUN 17] + Ryosuke Okuta, Yuya Unno, Daisuke Nishino, Shohei Hido, and Crissman Loomis. Cupy: A numpy-compatible library 3. Results for nvidia gpu calculations. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty- To test the speed improvements between implementations we used first Annual Conference on Neural Information Processing files from previous observations coming from different observato- Systems (NIPS), 2017. URL: http://learningsys.org/nips17/ ries. Table 1 indicates the processing times it took to process three assets/papers/paper_16.pdf. [Reb82] Grote Reber. Cosmic Static, pages 61–69. Springer Nether- different files in double-precision mode. We can notice that the lands, Dordrecht, 1982. URL: https://doi.org/10.1007/978- CPU implementation based on Numba is measurably faster than 94-009-7752-5_6, doi:10.1007/978-94-009-7752- the original CPU implementation based on Cython. At the same 5_6. time, the GPU-accelerated backend processed the data from 6.8 to 9.3 times faster than the original CPU-based implementation. Table 2 indicates the same results as Table 1 but with single- precision floating points. The original Cython implementation was left out because it doesn’t support single-precision mode. Here, the same data was processed from 7.5 to 10.6 times faster than the Numba CPU-based implementation. To illustrate the processing time improvement, a single obser- vation containing 105 GB of data was processed in 12 hours by the original CPU-based TurboSETI implementation on an i7-7700K Intel CPU, and just 1 hour and 45 minutes by the GPU-accelerated backend on a GTX 1070 Ti NVIDIA GPU. 28 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Experience report of physics-informed neural networks in fluid simulations: pitfalls and frustration Pi-Yueh Chuang‡∗ , Lorena A. Barba‡ F Abstract—Though PINNs (physics-informed neural networks) are now deemed PINN (physics-informed neural network) method denotes an ap- as a complement to traditional CFD (computational fluid dynamics) solvers proach to incorporate deep learning in CFD applications, where rather than a replacement, their ability to solve the Navier-Stokes equations solving partial differential equations plays the key role. These par- without given data is still of great interest. This report presents our not-so- tial differential equations include the well-known Navier-Stokes successful experiments of solving the Navier-Stokes equations with PINN as equations—one of the Millennium Prize Problems. The universal a replacement to traditional solvers. We aim to, with our experiments, prepare readers for the challenges they may face if they are interested in data-free PINN. approximation theorem ([Hor]) implies that neural networks can In this work, we used two standard flow problems: 2D Taylor-Green vortex at model the solution to the Navier-Stokes equations with high Re = 100 and 2D cylinder flow at Re = 200. The PINN method solved the 2D fidelity and capture complicated flow details as long as networks Taylor-Green vortex problem with acceptable results, and we used this flow as an are big enough. The idea of PINN methods can be traced back accuracy and performance benchmark. About 32 hours of training were required to [DPT], while the name PINN was coined in [RPK]. Human- for the PINN method’s accuracy to match the accuracy of a 16 × 16 finite- provided data are not necessary in applying PINN [LMMK], mak- difference simulation, which took less than 20 seconds. The 2D cylinder flow, on ing it a potential alternative to traditional CFD solvers. Sometimes the other hand, did not produce a physical solution. The PINN method behaved it is branded as unsupervised learning—it does not rely on human- like a steady-flow solver and did not capture the vortex shedding phenomenon. provided data, making it sound very "AI." It is now common to By sharing our experience, we would like to emphasize that the PINN method is still a work-in-progress, especially in terms of solving flow problems without any see headlines like "AI has cracked the Navier-Stokes equations" in given data. More work is needed to make PINN feasible for real-world problems recent popular science articles ([Hao]). in such applications. (Reproducibility package: [Chu22].) Though data-free PINN as an alternative to traditional CFD solvers may sound attractive, PINN can also be used under data- Index Terms—computational fluid dynamics, deep learning, physics-informed driven configurations, for which it is better suited. Cai et al. neural network [CMW+ ] state that PINN is not meant to be a replacement of existing CFD solvers due to its inferior accuracy and efficiency. The most useful applications of PINN should be those with 1. Introduction some given data, and thus the models are trained against the Recent advances in computing and programming techniques have data. For example, when we have experimental measurements or motivated practitioners to revisit deep learning applications in partial simulation results (coarse-grid data, limited numbers of computational fluid dynamics (CFD). We use the verb "revisit" snapshots, etc.) from traditional CFD solvers, PINN may be useful because deep learning applications in CFD already existed going to reconstruct the flow or to be a surrogate model. back to at least the 1990s, for example, using neural networks as Nevertheless, data-free PINN may offer some advantages over surrogate models ([LS], [FS]). Another example is the work of traditional solvers, and using data-free PINN to replace traditional Lagaris and his/her colleagues ([LLF]) on solving partial differen- solvers is still of great interest to researchers (e.g., [KDYI]). First, tial equations with fully-connected neural networks back in 1998. it is a mesh-free scheme, which benefits engineering problems Similar work with radial basis function networks can be found where fluid flows interact with objects of complicated geometries. in reference [LLQH]. Nevertheless, deep learning applications Simulating these fluid flows with traditional numerical methods in CFD did not get much attention until this decade, thanks to usually requires high-quality unstructured meshes with time- modern computing technology, including GPUs, cloud computing, consuming human intervention in the pre-processing stage before high-level libraries like PyTorch and TensorFlow, and their Python actual simulations. The second benefit of PINN is that the trained APIs. models approximate the governing equations’ general solutions, Solving partial differential equations with deep learning is meaning there is no need to solve the equations repeatedly for particularly interesting to CFD researchers and practitioners. The different flow parameters. For example, a flow model taking boundary velocity profiles as its input arguments can predict * Corresponding author: pychuang@gwu.edu flows under different boundary velocity profiles after training. ‡ Department of Mechanical and Aerospace Engineering, The George Wash- ington University, Washington, DC 20052, USA Conventional numerical methods, on the contrary, require repeated simulations, each one covering one boundary velocity profile. Copyright © 2022 Pi-Yueh Chuang et al. This is an open-access article This feature could help in situations like engineering design op- distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, timization: the process of running sets of experiments to conduct provided the original author and source are credited. parameter sweeps and find the optimal values or geometries for EXPERIENCE REPORT OF PHYSICS-INFORMED NEURAL NETWORKS IN FLUID SIMULATIONS: PITFALLS AND FRUSTRATION 29 products. Given these benefits, researchers continue studying and and momentum equations: improving the usability of data-free PINN (e.g., [WYP], [DZ], [WTP], [SS]). ∂U ~ ~ = − 1 ∇p + ν∇2U ~ · ∇)U ~ +~g + (U (2) Data-free PINN, however, is not ready nor meant to replace ∂t ρ traditional CFD solvers. This claim may be obvious to researchers where ρ = ρ(~x,t), ν = ν(~x,t), and p = p(~x,t) are scalar fields experienced in PINN, but it may not be clear to others, especially denoting density, kinematic viscosity, and pressure, respectively. to CFD end-users without ample expertise in numerical methods. ~x denotes the spatial coordinate, and ~x = [x, y]T in two di- Even in literature that aims to improve PINN, it’s common to see mensions. The density and viscosity fields are usually known only the success stories with simple CFD problems. Important in- and given, while the pressure field is unknown. U ~ = U(~ ~ x,t) = formation concerning the feasibility of PINN in practical and real- [u(x, y,t), v(x, y,t)]T is a vector field for flow velocity. All of them world applications is often missing from these success stories. For are functions of the spatial coordinate in the computational domain example, few reports discuss the required computing resources, Ω and time before a given limit T . The gravitational field ~g may the computational cost of training, the convergence properties, or also be a function of space and time, though it is usually a constant. the error analysis of PINN. PINN suffers from performance and A solution to the Navier-Stokes equations is subjected to an initial solvability issues due to the need for high-order automatic differ- condition and boundary conditions: entiation and multi-objective nonlinear optimization. Evaluating high-order derivatives using automatic differentiation increases ~ x,t) = U U(~ ~ 0 (~x), ∀~x ∈ Ω, t = 0 the computational graphs of neural networks. And multi-objective U(~x,t) = UΓ (~x,t), ∀~x ∈ Γ, t ∈ [0, T ] ~ ~ (3) optimization, which reduces all the residuals of the differential p(~x,t) = pΓ (x,t), ∀~x ∈ Γ, t ∈ [0, T ] equations, initial conditions, and boundary conditions, makes the training difficult to converge to small-enough loss values. where Γ represents the boundary of the computational domain. Fluid flows are sensitive nonlinear dynamical systems in which a small change or error in inputs may produce a very different 2.1. The PINN method flow field. So to get correct solutions, the optimization in PINN The basic form of the PINN method ([RPK], [CMW+ ]) starts from needs to minimize the loss to values very close to zero, further approximating U~ and p with a neural network: compromising the method’s solvability and performance. " # This paper reports on our not-so-successful PINN story as a ~ U (~x,t) ≈ G(~x,t; Θ) (4) lesson learned to readers, so they can be aware of the challenges p they may face if they consider using data-free PINN in real-world applications. Our story includes two computational experiments Here we use a single network that predicts both pressure and as case studies to benchmark the PINN method’s accuracy and velocity fields. It is also possible to use different networks for them computational performance. The first case study is a Taylor- separately. Later in this work, we will use GU and G p to denote Green vortex, solved successfully though not to our complete the predicted velocity and pressure from the neural network. Θ at satisfaction. We will discuss the performance of PINN using this this point represents the free parameters of the network. case study. The second case study, flow over a cylinder, did not To determine the free parameters, Θ, ideally, we hope the even result in a physical solution. We will discuss the frustration approximate solution gives zero residuals for equations (1), (2), we encountered with PINN in this case study. and (3). That is We built our PINN solver with the help of NVIDIA’s Modulus r1 (~x,t; Θ) ≡ ∇ · GU = 0 library ([noa]). Modulus is a high-level Python package built on ∂ GU 1 top of PyTorch that helps users develop PINN-based differential r2 (~x,t; Θ) ≡ + (GU · ∇)GU + ∇G p − ν∇2 GU −~g = 0 equation solvers. Also, in each case study, we also carried out sim- ∂t ρ (5) ulations with our CFD solver, PetIBM ([CMKAB18]). PetIBM is r3 (~x; Θ) ≡ GU ~ t=0 − U0 = 0 a traditional solver using staggered-grid finite difference methods r4 (~x,t; Θ) ≡ GU − U~ Γ = 0, ∀~x ∈ Γ with MPI parallelization and GPU computing. PetIBM simulations r5 (~x,t; Θ) ≡ G p − pΓ = 0, ∀~x ∈ Γ in each case study served as baseline data. For all cases, config- urations, post-processing scripts, and required Singularity image And the set of desired parameter, Θ = θ , is the common zero root definitions can be found at reference [Chu22]. of all the residuals. This paper is structured as follows: the second section briefly The derivatives of G with respect to ~x and t are usually ob- describes the PINN method and an analogy to traditional CFD tained using automatic differentiation. Nevertheless, it is possible methods. The third and fourth sections provide our computational to use analytical derivatives when the chosen network architecture experiments of the Taylor-Green vortex in 2D and a 2D laminar is simple enough, as reported by early-day literature ([LLF], cylinder flow with vortex shedding. Most discussions happen [LLQH]). in the corresponding case studies. The last section presents the If residuals in (5) are not complicated, and if the number of conclusion and discussions that did not fit into either one of the the parameters, NΘ , is small enough, we may numerically find the cases. zero root by solving a system of NΘ nonlinear equations generated from a suitable set of NΘ spatial-temporal points. However, the 2. Solving Navier-Stokes equations with PINN scenario rarely happens as G is usually highly complicated and NΘ is large. Moreover, we do not even know if such a zero root The incompressible Navier-Stokes equations in vector form are exists for the equations in (5). composed of the continuity equation: Instead, in PINN, the condition is relaxed. We do not seek the ∇ ·U ~ =0 (1) zero root of (5) but just hope to find a set of parameters that make 30 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) the residuals sufficiently close to zero. Consider the sum of the l2 2.2. An analogy to conventional numerical methods norms of residuals: For readers with a background in numerical methods for partial ( 5 x∈Ω differential equations, we would like to make an analogy between r(~x,t; Θ = θ ) ≡ ∑ kri (~x,t; Θ = θ )k , ∀ 2 (6) traditional numerical methods and PINN. i=1 t ∈ [0, T ] In obtaining strong solutions to differential equations, we can The θ that makes residuals closest to zero (or even equal to zero describe the solution workflows of most numerical methods with if such θ exists) also makes (6) minimal because r(~x,t; Θ) ≥ 0. In five stages: other words, ( 1) Designing the approximate solution with undetermined x∈Ω parameters θ = arg min r(~x,t; Θ) ∀ (7) Θ t ∈ [0, T ] 2) Choosing proper approximation for derivatives 3) Obtaining the so-called modified equation by substituting This poses a fundamental difference between the PINN method approximate derivatives into the differential equations and traditional CFD schemes, making it potentially more difficult and initial/boundary conditions for the PINN method to achieve the same accuracy as the tradi- 4) Generating a system of linear/nonlinear algebraic equa- tional schemes. We will discuss this more in section 3. Note that tions in practice, each loss term on the right-hand-side of equation (6) is 5) Solving the system of equations weighted. We ignore the weights here for demonstrating purpose. To solve (7), theoretically, we can use any number of spatial- For example, to solve ∇U 2 (x) = s(x), the most naive spectral temporal points, which eases the need of computational resources, method ([Tre]) approximates the solution with U(x) ≈ G(x) = N compared to finding the zero root directly. Gradient-descent- ∑ ci φi (x), where ci represents undetermined parameters, and φi (x) based optimizers further reduce the computational cost, especially i=1 denotes a set of either polynomials, trigonometric functions, or in terms of memory usage and the difficulty of parallelization. complex exponentials. Next, obtaining the first derivative of U is Alternatively, Quasi-Newton methods may work but only when N NΘ is small enough. straightforward—we can just assume U 0 (x) ≈ G0 (x) = ∑ ci φi0 (x). i=1 However, even though equation (7) may be solvable, it is still The second-order derivative may be more tricky. One can assume a significantly expensive task. While typical data-driven learning N requires one back-propagation pass on the derivatives of the loss U 00 (x) ≈ G00 = ∑ ci φi00 (x). Or, another choice for nodal bases (i.e., i=1 function, here automatic differentiation is needed to evaluate the N derivatives of G with respect to ~x and t. The first-order derivatives when φi (x) is chosen to make ci ≡ G(xi )) is U 00 (x) ≈ ∑ ci G0 (xi ). i=1 require one back-propagation on the network, while the second- Because φi (x) is known, the derivatives are analytical. After sub- order derivatives present in the diffusion term ∇2 GU require an stituting the approximate solution and derivatives in to the target additional back-propagation on the first-order derivatives’ com- differential equation, we need to solve for parameters c1 , · · · , cN . putational graph. Finally, to update parameters in an optimizer, We do so by selecting N points from the computational domain the gradients of G with respect to parameters Θ requires another and creating a system of N linear equations: back-propagation on the graph of the second-order derivatives. This all leads to a very large computational graph. We will see the φ100 (x1 ) · · · φN00 (x1 ) c1 s(x1 ) . .. . .. . . performance of the PINN method in the case studies. . . . .. − .. = 0 (8) In summary, when viewing the PINN method as supervised φ1 (xN ) · · · φN (xN ) cN 00 00 s(xN ) machine learning, the inputs of a network are spatial-temporal coordinates, and the outputs are the physical quantities of our Finally, we determine the parameters by solving this linear system. interest. The loss or objective functions in PINN are governing Though this example uses a spectral method, the workflow also equations that regulate how the target physical quantities should applies to many other numerical methods, such as finite difference behave. The use of governing equations eliminates the need for methods, which can be reformatted as a form of spectral method. true answers. A trivial example is using Bernoulli’s equation as With this workflow in mind, it should be easy to see the anal- the loss function, i.e., loss = 2gu2 p + ρg − H0 + z(x), and a neural ogy between PINN and conventional numerical methods. Aside network predicts the flow speed u and pressure p at a given from using much more complicated approximate solutions, the location x along a streamline. (The gravitational acceleration major difference lies in how to determine the unknown parameters g, density ρ, energy head H0 , and elevation z(x) are usually in the approximate solutions. While traditional methods solve the known and given.) Such a loss function regulates the relationship zero-residual conditions, PINN relies on searching the minimal between predicted u and p and does not need true answers for residuals. A secondary difference is how to approximate deriva- the two quantities. Unlike Bernoulli’s equation, most governing tives. Conventional numerical methods use analytical or numerical equations in physics are usually differential equations (e.g., heat differentiation of the approximate solutions, and the PINN meth- equations). The main difference is that now the PINN method ods usually depends on automatic differentiation. This difference needs automatic differentiation to evaluate the loss. Regardless may be minor as we are still able to use analytical differentiation of the forms of governing equations, spatial-temporal coordinates for simple network architectures with PINN. However, automatic are the only data required during training. Hence, throughout this differentiation is a major factor affecting PINN’s performance. paper, training data means spatial-temporal points and does not 3. Case 1: Taylor-Green vortex: accuracy and performance involve any true answers to predicted quantities. (Note in some literature, the PINN method is applied to applications that do need 3.1. 2D Taylor-Green vortex true answers, see [CMW+ ]. These applications are out of scope The Taylor-Green vortex represents a family of flows with a here.) specific form of analytical initial flow conditions in both 2D EXPERIENCE REPORT OF PHYSICS-INFORMED NEURAL NETWORKS IN FLUID SIMULATIONS: PITFALLS AND FRUSTRATION 31 Fig. 2: Total residuals (loss) with respect to training iterations. Fig. 1: Contours of u and v at t = 32 to demonstrate the solution of 2D Taylor-Green vortex. variants). We carried out the training using different numbers of and 3D. The 2D Taylor-Green vortex has closed-form analytical GPUs to investigate the performance of the PINN solver. All cases solutions with periodic boundary conditions, and hence they are were trained up to 1 million iterations. Note that the parallelization standard benchmark cases for verifying CFD solvers. In this work, was done with weak scaling, meaning increasing the number of we used the following 2D Taylor-Green vortex: GPUs would not reduce the workload of each GPU. Instead, increasing the number of GPUs would increase the total and x y ν u(x, y,t) = V0 cos( ) sin( ) exp(−2 2 t) per-iteration numbers of training points. Therefore, our expected L L L x y ν outcome was that all cases required about the same wall time to v(x, y,t) = −V0 sin( ) cos( ) exp(−2 2 t) (9) finish, while the residual from using 8 GPUs would converge the L L L ρ 2x 2y ν fastest. p(x, y,t) = − V02 cos( ) + cos( ) exp(−4 2 t) After training, the PINN solver’s prediction errors (i.e., accu- 4 L L L racy) were evaluated on cell centers of a 512 × 512 Cartesian mesh where V0 represents the peak (and also the lowest) velocity at against the analytical solution. With these spatially distributed t = 0. Other symbols carry the same meaning as those in section errors, we calculated the L2 error norm for a given t: 2. sZ r The periodic boundary conditions were applied to x = −Lπ, L2 = error(x, y)2 dΩ ≈ ∑ ∑ errori,2 j ∆Ωi, j (10) x = Lπ, y = −Lπ, and y = Lπ. We used the following parameters Ω i j in this work: V0 = L = ρ = 1.0 and ν = 0.01. These parameters correspond to Reynolds number Re = 100. Figure 1 shows a where i and j here are the indices of a cell center in the Cartesian snapshot of velocity at t = 32. mesh. ∆Ωi, j is the corresponding cell area, 4π 2 /5122 in this case. We compared accuracy and performance against results using 3.2. Solver and runtime configurations PetIBM. All PetIBM simulations in this section were done with 1 K40 GPU and 6 CPU cores (Intel i7-5930K) on our old lab The neural network used in the PINN solver is a fully-connected workstation. We carried out 7 PetIBM simulations with different neural network with 6 hidden layers and 256 neurons per layer. spatial resolutions: 2k × 2k for k = 4, 5, . . . , 10. The time step size The activation functions are SiLU ([HG]). We used Adam for for each spatial resolution was ∆t = 0.1/2k−4 . optimization, and its initial parameters are the defaults from Py- A special note should be made here: the PINN solver used Torch. The learning rate exponentially decayed through PyTorch’s single-precision floats, while PetIBM used double-precision floats. ExponentialLR with gamma equal to 0.951/10000 . Note we did It might sound unfair. However, this discrepancy does not change not conduct hyperparameter optimization, given the computational the qualitative findings and conclusions, as we will see later. cost. The hyperparameters are mostly the defaults used by the 3D Taylor-Green example in Modulus ([noa]). The training data were simply spatial-temporal coordinates. 3.3. Results Before the training, the PINN solver pre-generated 18,432,000 Figure 2 shows the convergence history of the total residuals spatial-temporal points to evaluate the residuals of the Navier- (equation (6)). Using more GPUs in weak scaling (i.e., more Stokes equations (the r1 and r2 in equation (5)). These training training points) did not accelerate the convergence, contrary to points were randomly chosen from the spatial domain [−π, π] × what we expected. All cases converged at a similar rate. Though [−π, π] and temporal domain (0, 100]. The solver used only 18,432 without a quantitative criterion or justification, we considered that points in each training iteration, making it a batch training. For further training would not improve the accuracy. Figure 3 gives a the residual of the initial condition (the r3 ), the solver also pre- visual taste of what the predictions from the neural network look generated 18,432,000 random spatial points and used only 18,432 like. per iteration. Note that for r3 , the points were distributed in space The result visually agrees with that in figure 1. However, as only because t = 0 is a fixed condition. Because of the periodic shown in figure 4, the error magnitudes from the PINN solver boundary conditions, the solver did not require any training points are much higher than those from PetIBM. Figure 4 shows the for r4 and r5 . prediction errors with respect to t. We only present the error on The hardware used for the PINN solver was a single node of the u velocity as those for v and p are similar. The accuracy of NVIDIA’s DGX-A100. It was equipped with 8 A100 GPUs (80GB the PINN solver is similar to that of the 16 × 16 simulation with 32 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 5: L2 error norm versus wall time. Fig. 3: Contours of u and v at t = 32 from the PINN solver. 3.4. Discussion A notice should be made regarding the results: we do not claim that these results represent the most optimized configuration of the PINN method. Neither do we claim the qualitative conclusions apply to all other hyperparameter configurations. These results merely reflect the outcomes of our computational experiments with respect to the specific configuration abovementioned. They should be deemed experimental data rather than a thorough anal- ysis of the method’s characteristics. The Taylor-Green vortex serves as a good benchmark case because it reduces the number of required residual constraints: residuals r4 and r5 are excluded from r in equation 6. This means Fig. 4: L2 error norm versus simulation time. the optimizer can concentrate only on the residuals of initial conditions and the Navier-Stokes equations. Using more GPUs (thus using more training points, i.e., spatio- PetIBM. Using more GPUs, which implies more training points, temporal points) did not speed up the convergence, which may does not improve the accuracy. indicate that the per-iteration number of points on a single GPU Regardless of the magnitudes, the trends of the errors with is already big enough. The number of training points mainly respect to t are similar for both PINN and PetIBM. For PetIBM, affects the mean gradients of the residual with respect to model the trend shown in figure 4 indicates that the temporal error is parameters, which then will be used to update parameters by bounded, and the scheme is stable. However, this concept does gradient-descent-based optimizers. If the number of points is not apply to PINN as it does not use any time-marching schemes. already big enough on a single GPU, then using more points or What this means for PINN is still unclear to us. Nevertheless, more GPUs is unlikely to change the mean gradients significantly, it shows that PINN is able to propagate the influence of initial causing the convergence solely to rely on learning rates. conditions to later times, which is a crucial factor for solving The accuracy of the PINN solver was acceptable but not hyperbolic partial differential equations. satisfying, especially when considering how much time it took Figure 5 shows the computational cost of PINN and PetIBM to achieve such accuracy. The low accuracy to some degree was in terms of the desired accuracy versus the required wall time. We not surprising. Recall the theory in section 2. The PINN method only show the PINN results of 8 A100 GPUs on this figure. We only seeks the minimal residual on the total residual’s hyperplane. believe this type of plot may help evaluate the computational cost It does not try to find the zero root of the hyperplane and does not in engineering applications. According to the figure, for example, even care whether such a zero root exists. Furthermore, by using a achieving an accuracy of 10−3 at t = 2 requires less than 1 second gradient-descent-based optimizer, the resulting minimum is likely for PetIBM with 1 K40 and 6 CPU cores, but it requires more than just a local minimum. It makes sense that it is hard for the residual 8 hours for PINN with at least 1 A100 GPU. to be close to zero, meaning it is hard to make errors small. Table 1 lists the wall time per 1 thousand iterations and the Regarding the performance result in figure 5, we would like scaling efficiency. As indicated previously, weak scaling was used to avoid interpreting the result as one solver being better than the in PINN, which follows most machine learning applications. other one. The proper conclusion drawn from the figure should be as follows: when using the PINN solver as a CFD simulator for a specific flow condition, PetIBM outperforms the PINN solver. 1 GPUs 2 GPUs 4 GPUs 8 GPUs As stated in section 1, the PINN method can solve flows under Time (sec/1k iters) 85.0 87.7 89.1 90.1 different flow parameters in one run—a capability that PetIBM Efficiency (%) 100 97 95 94 does not have. The performance result in figure 5 only considers a limited application of the PINN solver. One issue for this case study was how to fairly compare TABLE 1: Weak scaling performance of the PINN solver using the PINN solver and PetIBM, especially when investigating the NVIDIA A100-80GB GPUs accuracy versus the workload/problem size or time-to-solution EXPERIENCE REPORT OF PHYSICS-INFORMED NEURAL NETWORKS IN FLUID SIMULATIONS: PITFALLS AND FRUSTRATION 33 versus problem size. Defining the problem size in PINN is not as straightforward as we thought. Let us start with degrees of freedom—in PINN, it is called the number of model parame- ters, and in traditional CFD solvers, it is called the number of unknowns. The PINN solver and traditional CFD solvers are all trying to determine the free parameters in models (that is, approximate solutions). Hence, the number of degrees of freedom determines the problem sizes and workloads. However, in PINN, problem sizes and workloads do not depend on degrees of freedom solely. The number of training points also plays a critical role in workloads. We were not sure if it made sense to define a problem size as the sum of the per-iteration number of training points and the number of model parameters. For example, 100 model parameters plus 100 training points is not equivalent to 150 model parameters plus 50 training points in terms of workloads. So without a proper definition of problem size and workload, it was not clear how to fairly compare PINN and traditional CFD methods. Nevertheless, the gap between the performances of PINN and Fig. 6: Demonstration of velocity and vorticity fields at t = 200 from a PetIBM simulation. PetIBM is too large, and no one can argue that using other metrics would change the conclusion. Not to mention that the PINN solver ran on A100 GPUs, while PetIBM ran on a single K40 GPU 200. Figure 6 shows the velocity and vorticity snapshots at t = 200. in our lab, a product from 2013. This is also not a surprising As shown in the figure, this type of flow displays a phenomenon conclusion because, as indicated in section 2, the use of automatic called vortex shedding. Though vortex shedding makes the flow differentiation for temporal and spatial derivatives results in a huge always unsteady, after a certain time, the flow reaches a periodic computational graph. In addition, the PINN solver uses gradient- stage and the flow pattern repeats after a certain period. descent based method, which is a first-order method and limits the The Navier-Stokes equations can be deemed as a dynamical performance. system. Instability appears in the flow under some flow conditions Weak scaling is a natural choice of the PINN solver when it and responds to small perturbations, causing the vortex shedding. comes to distributed computing. As we don’t know a proper way In nature, the vortex shedding comes from the uncertainty and to define workload, simply copying all model parameters to all perturbation existing everywhere. In CFD simulations, the vortex processes and using the same number of training points on all shedding is caused by small numerical and rounding errors in processes works well. calculations. Interested readers should consult reference [Wil]. 4. Case 2: 2D cylinder flows: harder than we thought 4.2. Solver and runtime configurations This case study shows what really made us frustrated: a 2D For the PINN solver, we tested with two networks. Both were cylinder flow at Reynolds number Re = 200. We failed to even fully-connected neural networks: one with 256 neurons per layer, produce a solution that qualitatively captures the key physical while the other one with 512 neurons per layer. All other net- phenomenon of this flow: vortex shedding. work configurations were the same as those in section 3, except we allowed human intervention to manually adjust the learning 4.1. Problem description rates during training. Our intention for this case study was to The computational domain is [−8, 25] × [−8, 8], and a cylinder successfully obtain physical solutions from the PINN solver, with a radius of 0.5 sits at coordinate (0, 0). The velocity boundary rather than conducting a performance and accuracy benchmark. conditions are (u, v) = (1, 0) along x = −8, y = −8, and y = 8. On Therefore, we would adjust the learning rate to accelerate the the cylinder surface is the no-slip condition, i.e., (u, v) = (0, 0). convergence or to escape from local minimums. This decision was At the outlet (x = 25), we enforced a pressure boundary condition in line with common machine learning practice. We did not carry p = 0. The initial condition is (u, v) = (0, 0). Note that this initial out hyperparameter optimization. These parameters were chosen condition is different from most traditional CFD simulations. because they work in Modulus’ examples and in the Taylor-Green Conventionally, CFD simulations use (u, v) = (1, 0) for cylinder vortex experiment. flows. A uniform initial condition of u = 1 does not satisfy The PINN solver pre-generated 40, 960, 000 spatial-temporal the Navier-Stokes equations due to the no-slip boundary on the points from a spatial domain in [−8, 25] × [−8, 8] and temporal cylinder surface. Conventional CFD solvers are usually able to domain (0, 200] to evaluate residuals of the Navier-Stokes equa- correct the solution during time-marching by propagating bound- tions, and used 40, 960 points per iteration. The number of pre- ary effects into the domain through numerical schemes’ stencils. generated points for the initial condition was 2, 048, 000, and the In our experience, using u = 1 or u = 0 did not matter for PINN per-iteration number is 2, 048. On each boundary, the numbers of because both did not give reasonable results. Nevertheless, the pre-generated and per-iteration points are 8,192,000 and 8,192. PINN solver’s results shown in this section were obtained using a Both cases used 8 A100 GPUs, which scaled these numbers up uniform u = 0 for the initial condition. with a factor of 8. For example, during each iteration, a total of The density, ρ, is one, and the kinematic viscosity is ν = 327, 680 points were actually used to evaluate the Navier-Stokes 0.005. These parameters correspond to Reynolds number Re = equations’ residuals. Both cases ran up to 64 hours in wall time. 34 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 7: Training history of the 2D cylinder flow at Re = 200. One PetIBM simulation was carried out as a baseline. This simulation had a spatial resolution of 1485 × 720, and the time step size is 0.005. Figure 6 was rendered using this simulation. The hardware used was 1 K40 GPU plus 6 cores of i7-5930K Fig. 8: Velocity and vorticity at t = 200 from PINN. CPU. It took about 1.7 hours to finish. The quantity of interest is the drag coefficient. We consider both the friction drag and pressure drag in the coefficient calcula- tion as follows: 2 Z ∂ U~ ·~t CD = ρν ny − pnx dS (11) ρU02 D ∂~n S Here, U0 = 1 is the inlet velocity. ~n = [nx , ny ]T and ~t = [ny , −nx ]T are the normal and tangent vectors, respectively. S represents the cylinder surface. The theoretical lift coefficient (CL ) for this flow is zero due to the symmetrical geometry. 4.3. Results Note, as stated in section 3.4, we deem the results as experimental data under a specific experiment configuration. Hence, we do not claim that the results and qualitative conclusions will apply to Fig. 9: Drag and lift coefficients with respect to t other hyperparameter configuration. Figure 7 shows the convergence history. The bumps in the practice. Our viewpoints may be subjective, and hence we leave history correspond to our manual adjustment of the learning rates. them here in the discussion. After 64 hours of training, the total loss had not converged to an Allow us to start this discussion with a hypothetical situation. obvious steady value. However, we decided not to continue the If one asks why we chose such a spatial and temporal resolution training because, as later results will show, it is our judgment call for a conventional CFD simulation, we have mathematical or that the results would not be correct even if the training converged. physical reasons to back our decision. However, if the person asks Figure 8 provides a visualization of the predicted velocity why we chose 6 hidden layers and 256 neurons per layer, we will and vorticity at t = 200. And in figure 9 are the drag and lift not be able to justify it. "It worked in another case!" is probably the coefficients versus simulation time. From both figures, we couldn’t best answer we can offer. The situation also indicates that we have see any sign of vortex shedding with the PINN solver. systematic approaches to improve a conventional simulation but We provide a comparison against the values reported by others can only improve PINN’s results through computer experiments. in table 2. References [GS74] and [For80] calculate the drag Most traditional numerical methods have rigorous analytical coefficients using steady flow simulations, which were popular derivations and analyses. Each parameter used in a scheme has decades ago because of their inexpensive computational costs. a meaning or a purpose in physical or numerical aspects. The The actual flow is not a steady flow, and these steady-flow simplest example is the spatial resolution in the finite difference coefficient values are lower than unsteady-flow predictions. The method, which controls the truncation errors in derivatives. Or, drag coefficient from the PINN solver is closer to the steady-flow predictions. Unsteady simulations Steady simulations 4.4. Discussion PetIBM PINN [DSY07] [RKM09] [GS74] [For80] While researchers may be interested in why the PINN solver 1.38 0.95 1.25 1.34 0.97 0.83 behaves like a steady flow solver, in this section, we would like to focus more on the user experience and the usability of PINN in TABLE 2: Comparison of drag coefficients, CD EXPERIENCE REPORT OF PHYSICS-INFORMED NEURAL NETWORKS IN FLUID SIMULATIONS: PITFALLS AND FRUSTRATION 35 the choice of the limiters in finite volume methods, used to inhibit CFD solvers. The literature shows researchers have shifted their the oscillation in solutions. So when a conventional CFD solver attention to hybrid-mode applications. For example, in [JEA+ 20], produces unsatisfying or even non-physical results, practitioners the authors combined the concept of PINN and a traditional CFD usually have systematic approaches to identify the cause or solver to train a model that takes in low-resolution CFD simulation improve the outcomes. Moreover, when necessary, practitioners results and outputs high-resolution flow fields. know how to balance the computational cost and the accuracy, For people with a strong background in numerical methods or which is a critical point for using computer-aided engineering. CFD, we would suggest trying to think out of the box. During Engineering always concerns the costs and outcomes. our work, we realized our mindset and ideas were limited by what On the other hand, the PINN method lacks well-defined we were used to in CFD. An example is the initial conditions. procedures to control the outcome. For example, we know the We were used to only having one set of initial conditions when numbers of neurons and layers control the degrees of freedom in a the temporal derivative in differential equations is only first-order. model. With more degrees of freedom, a neural network model can However, in PINN, nothing limits us from using more than one approximate a more complicated phenomenon. However, when we initial condition. We can generate results at t = 0, 1, . . . ,tn using feel that a neural network is not complicated enough to capture a a traditional CFD solver and add the residuals corresponding to physical phenomenon, what strategy should we use to adjust the these time snapshots to the total residual, so the PINN method neurons and layers? Should we increase neurons or layers first? may perform better in predicting t > tn . In other words, the PINN By how much? solver becomes the traditional CFD solvers’ replacement only for Moreover, when it comes to something non-numeric, it is even t > tn ([noa]). more challenging to know what to use and why to use it. For As discussed in [THM+ ], solving partial differential equations instance, what activation function should we use and why? Should with deep learning is still a work-in-progress. It may not work in we use the same activation everywhere? Not to mention that we many situations. Nevertheless, it does not mean we should stay are not yet even considering a different network architecture here. away from PINN and discard this idea. Stepping away from a new Ultimately, are we even sure that increasing the network’s thing gives zero chance for it to evolve, and we will never know complexity is the right path? Our assumption that the network if PINN can be improved to a mature state that works well. Of is not complicated enough may just be wrong. course, overly promoting its bright side with only success stories The following situation happened in this case study. Before does not help, either. Rather, we should honestly face all troubles, we realized the PINN solver behaved like a steady-flow solver, we difficulties, and challenges. Knowing the problem is the first step attributed the cause to model complexity. We faced the problem to solving it. of how to increase the model complexity systematically. Theoret- ically, we could follow the practice of the design of experiments Acknowledgements (e.g., through grid search or Taguchi methods). However, given the computational cost and the number of hyperparameters/options of We appreciate the support by NVIDIA, through sponsoring the PINN, a proper design of experiments is not affordable for us. access to its high-performance computing cluster. Furthermore, the design of experiments requires the outcome to change with changes in inputs. In our case, the vortex shedding R EFERENCES remains absent regardless of how we changed hyperparameters. [Chu22] Pi-Yueh Chuang. barbagroup/scipy-2022-repro-pack: Let us move back to the flow problem to conclude this 20220530, May 2022. URL: https://doi.org/10.5281/zenodo. case study. The model complexity may not be the culprit here. 6592457, doi:10.5281/zenodo.6592457. Vortex shedding is the product of the dynamical systems of the [CMKAB18] Pi-Yueh Chuang, Olivier Mesnard, Anush Krishnan, and Lorena Navier-Stokes equations and the perturbations from numerical A. Barba. PetIBM: toolbox and applications of the immersed- boundary method on distributed-memory architectures. Journal calculations (which implicitly mimic the perturbations in nature). of Open Source Software, 3(25):558, May 2018. URL: http:// Suppose the PINN solver’s prediction was the steady-state solution joss.theoj.org/papers/10.21105/joss.00558, doi:10.21105/ to the flow. We may need to introduce uncertainties and perturba- joss.00558. tions in the neural network or the training data, such as a perturbed [CMW+ ] Shengze Cai, Zhiping Mao, Zhicheng Wang, Minglang Yin, and George Em Karniadakis. Physics-informed neural net- initial condition described in [LD15]. As for why PINN predicts works (PINNs) for fluid mechanics: a review. 37(12):1727– the steady-state solution, we cannot answer it currently. 1738. URL: https://link.springer.com/10.1007/s10409-021- 01148-1, doi:10.1007/s10409-021-01148-1. [DPT] M. W. M. G. Dissanayake and N. Phan-Thien. Neural-network- 5. Further discussion and conclusion based approximations for solving partial differential equations. 10(3):195–201. URL: https://onlinelibrary.wiley.com/doi/10. Because of the widely available deep learning libraries, such as 1002/cnm.1640100303, doi:10.1002/cnm.1640100303. PyTorch, and the ease of Python, implementing a PINN solver is [DSY07] Jian Deng, Xue-Ming Shao, and Zhao-Sheng Yu. Hydro- dynamic studies on two traveling wavy foils in tandem relatively more straightforward nowadays. This may be one reason arrangement. Physics of Fluids, 19(11):113104, Novem- why the PINN method suddenly became so popular in recent ber 2007. URL: http://aip.scitation.org/doi/10.1063/1.2814259, years. This paper does not intend to discourage people from trying doi:10.1063/1.2814259. the PINN method. Instead, we share our failures and frustration [DZ] Yifan Du and Tamer A. Zaki. Evolutional deep neural network. 104(4):045303. URL: https://link. using PINN so that interested readers may know what immediate aps.org/doi/10.1103/PhysRevE.104.045303, doi:10.1103/ challenges should be resolved for PINN. PhysRevE.104.045303. Our paper is limited to using the PINN solver as a replacement [For80] Bengt Fornberg. A numerical study of steady for traditional CFD solvers. However, as the first section indicates, viscous flow past a circular cylinder. Journal of Fluid Mechanics, 98(04):819, June 1980. URL: http: PINN can do more than solving one specific flow under specific //www.journals.cambridge.org/abstract_S0022112080000419, flow parameters. Moreover, PINN can also work with traditional doi:10.1017/S0022112080000419. 36 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) [FS] William E. Faller and Scott J. Schreck. Unsteady fluid mechan- [THM+ ] Nils Thuerey, Philipp Holl, Maximilian Mueller, Patrick ics applications of neural networks. 34(1):48–55. URL: http: Schnell, Felix Trost, and Kiwon Um. Physics-based deep //arc.aiaa.org/doi/10.2514/2.2134, doi:10.2514/2.2134. learning. Number: arXiv:2109.05237. URL: http://arxiv.org/ [GS74] V.A. Gushchin and V.V. Shchennikov. A numerical method abs/2109.05237, arXiv:2109.05237[physics]. of solving the navier-stokes equations. USSR Computa- [Tre] Lloyd N. Trefethen. Spectral Methods in MATLAB. Soft- tional Mathematics and Mathematical Physics, 14(2):242–250, ware, environments, tools. Society for Industrial and Applied January 1974. URL: https://linkinghub.elsevier.com/retrieve/ Mathematics. URL: http://epubs.siam.org/doi/book/10.1137/1. pii/0041555374900615, doi:10.1016/0041-5553(74) 9780898719598, doi:10.1137/1.9780898719598. 90061-5. [Wil] C. H. K. Williamson. Vortex dynamics in the [Hao] Karen Hao. AI has cracked a key mathematical puzzle for cylinder wake. 28(1):477–539. URL: http://www. understanding our world. URL: https://www.technologyreview. annualreviews.org/doi/10.1146/annurev.fl.28.010196.002401, com/2020/10/30/1011435/ai-fourier-neural-network-cracks- doi:10.1146/annurev.fl.28.010196.002401. navier-stokes-and-partial-differential-equations/. [WTP] Sifan Wang, Yujun Teng, and Paris Perdikaris. Under- [HG] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units standing and mitigating gradient flow pathologies in physics- (GELUs). Publisher: arXiv Version Number: 4. URL: https:// informed neural networks. 43(5):A3055–A3081. URL: https: arxiv.org/abs/1606.08415, doi:10.48550/ARXIV.1606. //epubs.siam.org/doi/10.1137/20M1318043, doi:10.1137/ 08415. 20M1318043. [WYP] Sifan Wang, Xinling Yu, and Paris Perdikaris. When [Hor] Kurt Hornik. Approximation capabilities of multilayer feedfor- and why PINNs fail to train: A neural tangent ward networks. 4(2):251–257. URL: https://linkinghub.elsevier. kernel perspective. 449:110768. URL: https: com/retrieve/pii/089360809190009T, doi:10.1016/0893- //linkinghub.elsevier.com/retrieve/pii/S002199912100663X, 6080(91)90009-T. doi:10.1016/j.jcp.2021.110768. [JEA+ 20] Chiyu “Max” Jiang, Soheil Esmaeilzadeh, Kamyar Aziz- zadenesheli, Karthik Kashinath, Mustafa Mustafa, Hamdi A. Tchelepi, Philip Marcus, Mr Prabhat, and Anima Anandkumar. Meshfreeflownet: A physics-constrained deep continuous space- time super-resolution framework. In SC20: International Con- ference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2020. doi:10.1109/SC41405. 2020.00013. [KDYI] Hasan Karali, Umut M. Demirezen, Mahmut A. Yukselen, and Gokhan Inalhan. A novel physics informed deep learning method for simulation-based modelling. In AIAA Scitech 2021 Forum. American Institute of Aeronautics and Astronautics. URL: https://arc.aiaa.org/doi/10.2514/6.2021-0177, doi:10. 2514/6.2021-0177. [LD15] Mouna Laroussi and Mohamed Djebbi. Vortex Shedding for Flow Past Circular Cylinder: Effects of Initial Conditions. Universal Journal of Fluid Mechanics, 3:19–32, 2015. [LLF] I. E. Lagaris, A. Likas, and D. I. Fotiadis. Artificial neu- ral networks for solving ordinary and partial differential equations. 9(5):987–1000. URL: http://ieeexplore.ieee.org/ document/712178/, arXiv:physics/9705023, doi:10. 1109/72.712178. [LLQH] Jianyu Li, Siwei Luo, Yingjian Qi, and Yaping Huang. Numer- ical solution of elliptic partial differential equation using radial basis function neural networks. 16(5):729–734. URL: https: //linkinghub.elsevier.com/retrieve/pii/S0893608003000832, doi:10.1016/S0893-6080(03)00083-2. [LMMK] Lu Lu, Xuhui Meng, Zhiping Mao, and George Em Karniadakis. DeepXDE: A deep learning library for solving differential equations. 63(1):208–228. URL: https://epubs.siam.org/doi/10. 1137/19M1274067, doi:10.1137/19M1274067. [LS] Dennis J. Linse and Robert F. Stengel. Identification of aerodynamic coefficients using computational neural networks. 16(6):1018–1025. Publisher: Springer US, Place: Boston, MA. URL: http://link.springer.com/10.1007/0-306-48610-5_9, doi:10.2514/3.21122. [noa] Modulus. URL: https://docs.nvidia.com/deeplearning/modulus/ index.html. [RKM09] B.N. Rajani, A. Kandasamy, and Sekhar Majumdar. Nu- merical simulation of laminar flow past a circular cylin- der. Applied Mathematical Modelling, 33(3):1228–1247, March 2009. arXiv: DOI: 10.1002/fld.1 Publisher: Elsevier Inc. ISBN: 02712091 10970363. URL: http://dx.doi.org/10.1016/j.apm. 2008.01.017, doi:10.1016/j.apm.2008.01.017. [RPK] M. Raissi, P. Perdikaris, and G.E. Karniadakis. Physics- informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. 378:686–707. URL: https: //linkinghub.elsevier.com/retrieve/pii/S0021999118307125, doi:10.1016/j.jcp.2018.10.045. [SS] Justin Sirignano and Konstantinos Spiliopoulos. DGM: A deep learning algorithm for solving partial differential equations. 375:1339–1364. URL: https: //linkinghub.elsevier.com/retrieve/pii/S0021999118305527, doi:10.1016/j.jcp.2018.08.029. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 37 atoMEC: An open-source average-atom Python code Timothy J. Callow‡§∗ , Daniel Kotik‡§ , Eli Kraisler¶ , Attila Cangi‡§ F Abstract—Average-atom models are an important tool in studying matter under methods are often denoted as "first-principles" because, formally extreme conditions, such as those conditions experienced in planetary cores, speaking, they yield the exact properties of the system, under cer- brown and white dwarfs, and during inertial confinement fusion. In the right tain well-founded theoretical approximations. Density-functional context, average-atom models can yield results with similar accuracy to simu- theory (DFT), initially developed as a ground-state theory [HK64], lations which require orders of magnitude more computing time, and thus can [KS65] but later extended to non-zero temperatures [Mer65], greatly reduce financial and environmental costs. Unfortunately, due to the wide range of possible models and approximations, and the lack of open-source [PPF+ 11], is one such theory and has been used extensively to codes, average-atom models can at times appear inaccessible. In this paper, we study materials under WDM conditions [GDRT14]. Even though present our open-source average-atom code, atoMEC. We explain the aims and DFT reformulates the Schrödinger equation in a computationally structure of atoMEC to illuminate the different stages and options in an average- efficient manner [Koh99], the cost of running calculations be- atom calculation, and to facilitate community contributions. We also discuss the comes prohibitively expensive at higher temperatures. Formally, use of various open-source Python packages in atoMEC, which have expedited it scales as O(N 3 τ 3 ), with N the particle number (which usually its development. also increases with temperature) and τ the temperature [CRNB18]. This poses a serious computational challenge in the WDM regime. Index Terms—computational physics, plasma physics, atomic physics, materi- Furthermore, although DFT is a formally exact theory, in prac- als science tice it relies on approximations for the so-called "exchange- correlation" energy, which is, roughly speaking, responsible for Introduction simulating all the quantum interactions between electrons. Exist- ing exchange-correlation approximations have not been rigorously The study of matter under extreme conditions — materials tested under WDM conditions. An alternative method used in exposed to high temperatures, high pressures, or strong elec- the WDM community is path-integral Monte–Carlo [DGB18], tromagnetic fields — is critical to our understanding of many which yields essentially exact properties; however, it is even more important scientific and technological processes, such as nuclear limited by computational cost than DFT, and becomes unfeasibly fusion and various astrophysical and planetary physics phenomena expensive at lower temperatures due to the fermion sign problem. [GFG+ 16]. Of particular interest within this broad field is the It is therefore of great interest to reduce the computational warm dense matter (WDM) regime, which is typically character- complexity of the aforementioned methods. The use of graphics ized by temperatures in the range of 103 − 106 degrees (Kelvin), processing units in DFT calculations is becomingly increasingly and densities ranging from dense gases to highly compressed common, and has been shown to offer significant speed-ups solids (∼ 0.01 − 1000 g cm−3 ) [BDM+ 20]. In this regime, it is relative to conventional calculations using central processing units important to account for the quantum mechanical nature of the [MED11], [JFC+ 13]. Some other examples of promising develop- electrons (and in some cases, also the nuclei). Therefore conven- ments to reduce the cost of DFT calculations include machine- tional methods from plasma physics, which either neglect quantum learning-based solutions [SRH+ 12], [BVL+ 17], [EFP+ 21] and effects or treat them coarsely, are usually not sufficiently accurate. stochastic DFT [CRNB18], [BNR13]. However, in this paper, On the other hand, methods from condensed-matter physics and we focus on an alternative class of models known as "average- quantum chemistry, which account fully for quantum interactions, atom" models. Average-atom models have a long history in plasma typically target the ground-state only, and become computationally physics [CHKC22]: they account for quantum effects, typically intractable for systems at high temperatures. using DFT, but reduce the complex system of interacting electrons Nevertheless, there are methods which can, in principle, be and nuclei to a single atom immersed in a plasma (the "average" applied to study materials at any given temperature and den- atom). An illustration of this principle (reduced to two dimensions sity whilst formally accounting for quantum interactions. These for visual purposes) is shown in Fig. 1. This significantly reduces * Corresponding author: t.callow@hzdr.de the cost relative to a full DFT simulation, because the particle ‡ Center for Advanced Systems Understanding (CASUS), D-02826 Görlitz, number is restricted to the number of electrons per nucleus, and Germany spherical symmetry is exploited to reduce the three-dimensional § Helmholtz-Zentrum Dresden-Rossendorf, D-01328 Dresden, Germany ¶ Fritz Haber Center for Molecular Dynamics and Institute of Chemistry, The problem to one dimension. Hebrew University of Jerusalem, 9091401 Jerusalem, Israel Naturally, to reduce the complexity of the problem as de- scribed, various approximations must be introduced. It is im- Copyright © 2022 Timothy J. Callow et al. This is an open-access article portant to understand these approximations and their limitations distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, for average-atom models to have genuine predictive capabilities. provided the original author and source are credited. Unfortunately, this is not always the case: although average-atom 38 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Theoretical background Properties of interest in the warm dense matter regime include the equation-of-state data, which is the relation between the density, energy, temperature and pressure of a material [HRD08]; the mean ionization state and the electron ionization energies, which tell us about how tightly bound the electrons are to the nuclei; and the electrical and thermal conductivities. These properties yield information pertinent to our understanding of stellar and planetary physics, the Earth’s core, inertial confinement fusion, and more besides. To exactly obtain these properties, one needs (in theory) to determine the thermodynamic ensemble of the quantum states (the so-called wave-functions) representing the electrons and nuclei. Fig. 1: Illustration of the average-atom concept. The many-body Fortunately, they can be obtained with reasonable accuracy using and fully-interacting system of electron density (shaded blue) and models such as average-atom models; in this section, we elaborate nuclei (red points) on the left is mapped into the much simpler system of independent atoms on the right. Any of these identical on how this is done. atoms represents the "average-atom". The effects of interaction from We shall briefly review the key theory underpinning the type of neighboring atoms are implicitly accounted for in an approximate average-atom model implemented in atoMEC. This is intended for manner through the choice of boundary conditions. readers without a background in quantum mechanics, to give some context to the purposes and mechanisms of the code. For a compre- hensive derivation of this average-atom model, we direct readers to Ref. [CHKC22]. The average-atom model we shall describe models share common concepts, there is no unique formal theory falls into a class of models known as ion-sphere models, which underpinning them. Therefore a variety of models and codes exist, are the simplest (and still most widely used) class of average-atom and it is not typically clear which models can be expected to model. There are alternative (more advanced) classes of model perform most accurately under which conditions. In a previous such as ion-correlation [Roz91] and neutral pseudo-atom models paper [CHKC22], we addressed this issue by deriving an average- [SS14] which we have not yet implemented in atoMEC, and thus atom model from first principles, and comparing the impact of we do not elaborate on them here. different approximations within this model on some common As demonstrated in Fig. 1, the idea of the ion-sphere model properties. is to map a fully-interacting system of many electrons and In this paper, we focus on computational aspects of average- nuclei into a set of independent atoms which do not interact atom models for WDM. We introduce atoMEC [CKTS+ 21]: explicitly with any of the other spheres. Naturally, this depends an open-source average-atom code for studying Matter under on several assumptions and approximations, but there is formal Extreme Conditions. One of the main aims of atoMEC is to im- justification for such a mapping [CHKC22]. Furthermore, there prove the accessibility and understanding of average-atom models. are many examples in which average-atom models have shown To the best of our knowledge, open-source average-atom codes good agreement with more accurate simulations and experimental are in scarce supply: with atoMEC, we aim to provide a tool that data [FB19], which further justifies this mapping. people can use to run average-atom simulations and also to add Although the average-atom picture is significantly simplified their own models, which should facilitate comparisons of different relative to the full many-body problem, even determining the approximations. The relative simplicity of average-atom codes wave-functions and their ensemble weights for an atom at finite means that they are not only efficient to run, but also efficient temperature is a complex problem. Fortunately, DFT reduces this to develop: this means, for example, that they can be used as a complexity further, by establishing that the electron density — a test-bed for new ideas that could be later implemented in full DFT far less complex entity than the wave-functions — is sufficient to codes, and are also accessible to those without extensive prior determine all physical observables. The most popular formulation expertise, such as students. atoMEC aims to facilitate development of DFT, known as Kohn–Sham DFT (KS-DFT) [KS65], allows us by following good practice in software engineering (for example to construct the fully-interacting density from a non-interacting extensive documentation), a careful design structure, and of course system of electrons, simplifying the problem further still. Due to through the choice of Python and its widely used scientific stack, the spherical symmetry of the atom, the non-interacting electrons in particular the NumPy [HMvdW+ 20] and SciPy [VGO+ 20] — known as KS electrons (or KS orbitals) — can be represented libraries. as a wave-function that is a product of radial and angular compo- nents, This paper is structured as follows: in the next section, we briefly review the key theoretical points which are important φnlm (r) = Xnl (r)Ylm (θ , φ ) , (1) to understand the functionality of atoMEC, assuming no prior where n, l, and m are the quantum numbers of the orbitals, which physical knowledge of the reader. Following that, we present come from the fact that the wave-function is an eigenfunction of the key functionality of atoMEC, discuss the code structure the Hamiltonian operator, and Ylm (θ , φ ) are the spherical harmonic and algorithms, and explain how these relate to the theoretical aspects introduced. Finally, we present an example case study: functions.1 The radial coordinate r represents the absolute distance we consider helium under the conditions often experienced in from the nucleus. the outer layers of a white dwarf star, and probe the behavior 1. Please note that the notation in Eq. (1) does not imply Einstein sum- of a few important properties, namely the band-gap, pressure, and mation notation. All summations in this paper are written explicitly; Einstein ionization degree. summation notation is not used. ATOMEC: AN OPEN-SOURCE AVERAGE-ATOM PYTHON CODE 39 We therefore only need to determine the radial KS orbitals energy required to excite an electron bound to the nucleus to being Xnl (r). These are determined by solving the radial KS equation, a free (conducting) electron. These predicted ionization energies which is similar to the Schrödinger equation for a non-interacting can be used, for example, to help understand ionization potential system, with an additional term in the potential to mimic the depression, an important but somewhat controversial effect in effects of electron-electron interaction (within the single atom). WDM [STJ+ 14]. Another property that can be straightforwardly The radial KS equation is given by: obtained from the energy levels and their occupation numbers is 2 the mean ionization state Z̄ 2 , d 2 d l(l + 1) − + − + vs [n](r) Xnl (r) = εnl Xnl (r). (2) dr2 r dr r2 Z̄ = ∑(2l + 1) fnl (εnl , µ, τ) (6) n,l We have written the above equation in a way that emphasizes that it is an eigenvalue equation, with the eigenvalues εnl being the which is an important input parameter for various models, such energies of the KS orbitals. as adiabats which are used to model inertial confinement fusion On the left-hand side, the terms in the round brackets come [KDF+ 11]. from the kinetic energy operator acting on the orbitals. The vs [n](r) Various other interesting properties can also be calculated term is the KS potential, which itself is composed of three different following some post-processing of the output of an SCF cal- terms, culation, for example the pressure exerted by the electrons and Z RWS ions. Furthermore, response properties, i.e. those resulting from Z n(x)x2 δ Fxc [n] an external perturbation like a laser pulse, can also be obtained vs [n](r) = − + 4π dx + , (3) r 0 max(r, x) δ n(r) from the output of an SCF cycle. These properties include, for where RWS is the radius of the atomic sphere, n(r) is the electron example, electrical conductivities [Sta16] and dynamical structure density, Z the nuclear charge, and Fxc [n] the exchange-correlation factors [SPS+ 14]. free energy functional. Thus the three terms in the potential are respectively the electron-nuclear attraction, the classical Hartree Code structure and details repulsion, and the exchange-correlation (xc) potential. In the following sections, we describe the structure of the code We note that the KS potential and its constituents are function- in relation to the physical problem being modeled. Average-atom als of the electron density n(r). Were it not for this dependence models typically rely on various parameters and approximations. on the density, solving Eq. 2 just amounts to solving an ordinary In atoMEC, we have tried to structure the code in a way that makes linear differential equation (ODE). However, the electron density clear which parameters come from the physical problem studied is in fact constructed from the orbitals in the following way, compared to choices of the model and numerical or algorithmic n(r) = 2 ∑(2l + 1) fnl (εnl , µ, τ)|Xnl (r)|2 , (4) choices. nl atoMEC.Atom: Physical parameters where fnl (εnl , µ, τ) is the Fermi–Dirac distribution, given by The first step of any simulation in WDM (which also applies to 1 simulations in science more generally) is to define the physical fnl (εnl , µ, τ) = , (5) 1 + e(εnl −µ)/τ parameters of the problem. These parameters are unique in the where τ is the temperature, and µ is the chemical potential, which sense that, if we had an exact method to simulate the real system, is determined by fixing the number of electrons to be equal to then for each combination of these parameters there would be a a pre-determined value Ne (typically equal to the nuclear charge unique solution. In other words, regardless of the model — be Z). The Fermi–Dirac distribution therefore assigns weights to the it average atom or a different technique — these parameters are KS orbitals in the construction of the density, with the weight always required and are independent of the model. depending on their energy. In average-atom models, there are typically three parameters Therefore, the KS potential that determines the KS orbitals via defining the physical problem, which are: the ODE (2), is itself dependent on the KS orbitals. Consequently, • the atomic species; the KS orbitals and their dependent quantities (the density and • the temperature of the material, τ; KS potential) must be determined via a so-called self-consistent • the mass density of the material, ρm . field (SCF) procedure. An initial guess for the orbitals, Xnl0 (r), is used to construct the initial density n0 (r) and potential v0s (r). The mass density also directly corresponds to the mean dis- The ODE (2) is then solved to update the orbitals. This process is tance between two nuclei (atomic centers), which in the average- iterated until some appropriately chosen quantities — in atoMEC atom model is equal to twice the radius of the atomic sphere, RWS . the total free energy, density and KS potential — are converged, An additional physical parameter not mentioned above is the net i.e. ni+1 (r) = ni (r), vi+1 i i+1 = F i , within some charge of the material being considered, i.e. the difference be- s (r) = vs (r), F reasonable numerical tolerance. In Fig. 2, we illustrate the life- tween the nuclear charge Z and the electron number Ne . However, cycle of the average-atom model described so far, including the we usually assume zero net charge in average-atom simulations SCF procedure. On the left-hand side of this figure, we show the (i.e. the number of electrons is equal to the atomic charge). physical choices and mathematical operations, and on the right- In atoMEC, these physical parameters are controlled by the hand side, the representative classes and functions in atoMEC. In Atom object. As an example, we consider aluminum under ambi- the following section, we shall discuss some aspects of this figure ent conditions, i.e. at room temperature, τ = 300 K, and normal in more detail. metallic density, ρm = 2.7 g cm−3 . We set this up as: Some quantities obtained from the completion of the SCF pro- 2. The summation in Eq. (6) is often shown as an integral because the cedure are directly of interest. For example, the energy eigenvalues energies above a certain threshold form a continuous distribution (in most εnl are related to the electron ionization energies, i.e. the amount of models). 40 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 2: Schematic of the average-atom model set-up and the self-consistent field (SCF) cycle. On the left-hand side, the physical choices and mathematical operations that define the model and SCF cycle are shown. On the right-hand side, the (higher-order) functions and classes in atoMEC corresponding to the items on the left-hand side are shown. Some liberties are taken with the code snippets in the right-hand column of the figure to improve readability; more precisely, some non-crucial intermediate steps are not shown, and some parameters are also not shown or simplified. The dotted lines represent operations that are taken care of within the models.CalcEnergy function, but are shown nevertheless to improve understanding. ATOMEC: AN OPEN-SOURCE AVERAGE-ATOM PYTHON CODE 41 Fig. 4: Auto-generated print statement from calling the models.ISModel object. Fig. 3: Auto-generated print statement from calling the atoMEC.Atom object. with a "quantum" treatment of the unbound electrons, and choose the LDA exchange functional (which is also the default). This from atoMEC import Atom Al = Atom("Al", 300, density=2.7, units_temp="K") model is set up as: from atoMEC import models By default, the above code automatically prints the output seen model = models.ISModel(Al, bc="neumann", in Fig. 3. We see that the first two arguments of the Atom object xfunc_id="lda_x", unbound="quantum") are the chemical symbol of the element being studied, and the By default, the above code prints the output shown in Fig. temperature. In addition, at least one of "density" or "radius" must 4. The first (and only mandatory) input parameter to the be specified. In atoMEC, the default (and only permitted) units for models.ISModel object is the Atom object that we generated the mass density are g cm−3 ; all other input and output units in earlier. Together with the optional spinpol and spinmag atoMEC are by default Hartree atomic units, and hence we specify parameters in the models.ISModel object, this sets either the "K" for Kelvin. total number of electrons (spinpol=False) or the number of The information in Fig. 3 displays the chosen parameters in electrons in each spin channel (spinpol=True). units commonly used in the plasma and condensed-matter physics The remaining information displayed in Fig. 4 shows directly communities, as well as some other information directly obtained the chosen model parameters, or the default values where these from these parameters. The chemical symbol ("Al" in this case) parameters are not specified. The exchange and correlation func- is passed to the mendeleev library [men14] to generate this data, tionals - set by the parameters xfunc_id and cfunc_id - are which is used later in the calculation. passed to the LIBXC library [LSOM18] for processing. So far, This initial stage of the average-atom calculation, i.e. the only the "local density" family of approximations is available specification of physical parameters and initialization of the Atom in atoMEC, and thus the default values are usually a sensible object, is shown in the top row at the top of Fig. 2. choice. For more information on exchange and correlation func- atoMEC.models: Model parameters tionals, there are many reviews in the literature, for example Ref. [CMSY12]. After the physical parameters are set, the next stage of the average- This stage of the average-atom calculation, i.e. the specifica- atom calculation is to choose the model and approximations within tion of the model and the choices of approximation within that, is that class of model. As discussed, so far the only class of model shown in the second row of Fig. 2. implemented in atoMEC is the ion-sphere model. Within this model, there are still various choices to be made by the user. ISModel.CalcEnergy: SCF calculation and numerical parameters In some cases, these choices make little difference to the results, Once the physical parameters and model have been defined, the but in other cases they have significant impact. The user might next stage in the average-atom calculation (or indeed any DFT have some physical intuition as to which is most important, or calculation) is the SCF procedure. In atoMEC, this is invoked alternatively may want to run the same physical parameters with by the ISModel.CalcEnergy function. This function is called several different model parameters to examine the effects. Some CalcEnergy because it finds the KS orbitals (and associated KS choices available in atoMEC, listed approximately in decreasing density) which minimize the total free energy. order of impact (but this can depend strongly on the system under Clearly, there are various mathematical and algorithmic consideration), are: choices in this calculation. These include, for example: the basis in • the boundary conditions used to solve the KS equations; which the KS orbitals and potential are represented, the algorithm • the treatment of the unbound electrons, which means used to solve the KS equations (2), and how to ensure smooth those electrons not tightly bound to the nucleus, but rather convergence of the SCF cycle. In atoMEC, the SCF procedure delocalized over the whole atomic sphere; currently follows a single pre-determined algorithm, which we • the choice of exchange and correlation functionals, the briefly review below. central approximations of DFT [CMSY12]; In atoMEC, we represent the radial KS quantities (orbitals, • the spin polarization and magnetization. density and potential) on a logarithmic grid, i.e. x = log(r). Furthermore, we make a transformation of the orbitals Pnl (x) = We do not discuss the theory and impact of these different Xnl (x)ex/2 . Then the equations to be solved become: choices in this paper. Rather, we direct readers to Refs. [CHKC22] and [CKC22] in which all of these choices are discussed. d2 Pnl (x) − 2e2x (W (x) − εnl )Pnl (x) = 0 (7) In atoMEC, the ion-sphere model is controlled by the dx2 models.ISModel object. Continuing with our aluminum ex- 1 1 2 −2x ample, we choose the so-called "neumann" boundary condition, W (x) = vs [n](x) + l+ e . (8) 2 2 42 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) In atoMEC, we solve the KS equations using a matrix imple- a unique set of physical and model inputs — these parameters mentation of Numerov’s algorithm [PGW12]. This means we should be independently varied until some property (such as the diagonalize the following equation: total free energy) is considered suitably converged with respect to that parameter. Changing the SCF parameters should not affect the Ĥ ~P = ~ε B̂~P , where (9) final results (within the convergence tolerances), only the number Ĥ = T̂ + B̂ +Ws (~x) , (10) of iterations in the SCF cycle. 1 Let us now consider an example SCF calculation, using the T̂ = − e−2~x  , (11) 2 Atom and model objects we have already defined: Iˆ−1 − 2Iˆ0 + Iˆ1  = , and (12) from atoMEC import config dx2 config.numcores = -1 # parallelize Iˆ−1 + 10Iˆ0 + Iˆ1 B̂ = , (13) 12 nmax = 3 # max value of principal quantum number lmax = 3 # max value of angular quantum number In the above, Iˆ−1/0/1 are lower shift, identify, and upper shift matrices. # run SCF calculation The Hamiltonian matrix Ĥ is sparse and we only seek a subset scf_out = model.CalcEnergy( nmax, of eigenstates with lower energies: therefore there is no need to lmax, perform a full diagonalization, which scales as O(N 3 ), with N grid_params={"ngrid": 1500}, being the size of the radial grid. Instead, we use SciPy’s sparse ma- scf_params={"mixfrac": 0.7}, ) trix diagonalization function scipy.sparse.linalg.eigs, which scales more efficiently and allows us to go to larger grid We see that the first two parameters passed to the CalcEnergy sizes. function are the nmax and lmax quantum numbers, which specify After each step in the SCF cycle, the relative changes in the the number of eigenstates to compute. Precisely speaking, there free energy F, density n(r) and potential vs (r) are computed. is a unique Hamiltonian for each value of the angular quantum Specifically, the quantities computed are number l (and in a spin-polarized calculation, also for each F i − F i−1 spin quantum number). The sparse diagonalization routine then ∆F = (14) computes the first nmax eigenvalues for each Hamiltonian. In Fi R atoMEC, these diagonalizations can be run in parallel since they dr|ni (r) − ni−1 (r)| ∆n = R (15) are independent for each value of l. This is done by setting the drni (r) R config.numcores variable to the number of cores desired dr|vs (r) − vi−1 i s (r)| (config.numcores=-1 uses all the available cores) and han- ∆v = R i . (16) drvs (r) dled via the joblib library [Job20]. Once all three of these metrics fall below a certain threshold, the The remaining parameters passed to the CalcEnergy func- SCF cycle is considered converged and the calculation finishes. tion are optional; in the above, we have specified a grid size The SCF cycle is an example of a non-linear system and thus of 1500 points and a mixing fraction α = 0.7. The above code is prone to chaotic (non-convergent) behavior. Consequently a automatically prints the output seen in Fig. 5. This output shows range of techniques have been developed to ensure convergence the SCF cycle and, upon completion, the breakdown of the total [SM91]. Fortunately, the tendency for calculations not to converge free energy into its various components, as well as other useful becomes less likely for temperatures above zero (and especially information such as the KS energy levels and their occupations. as temperatures increase). Therefore we have implemented only Additionally, the output of the SCF function is a dictionary a simple linear mixing scheme in atoMEC. The potential used in containing the staticKS.Orbitals, staticKS.Density, each diagonalization step of the SCF cycle is not simply the one staticKS.Potential and staticKS.Density objects. generated from the most recent density, but a mix of that potential For example, one could extract the eigenfunctions as follows: and the previous one, orbs = scf_out["orbitals"] # orbs object vs (r) = αvis (r) + (1 − α)vi−1 (i) ks_eigfuncs = orbs.eigfuncs # eigenfunctions s (r) . (17) In general, a lower value of the mixing fraction α makes the The initialization of the SCF procedure is shown in the third and SCF cycle more stable, but requires more iterations to converge. fourth rows of Fig. 2, with the SCF procedure itself shown in the Typically a choice of α ≈ 0.5 gives a reasonable balance between remaining rows. speed and stability. This completes the section on the code structure and We can thus summarize the key parameters in an SCF calcu- algorithmic details. As discussed, with the output of an lation as follows: SCF calculation, there are various kinds of post-processing one can perform to obtain other properties of interest. So • the maximum number of eigenstates to compute, in terms far in atoMEC, these are limited to the computation of of both the principal and angular quantum numbers; the pressure (ISModel.CalcPressure), the electron • the numerical grid parameters, in particular the grid size; localization function (atoMEC.postprocess.ELFTools) • the convergence tolerances, Eqs. (14) to (16); and the Kubo–Greenwood conductivity • the SCF parameters, i.e. the mixing fraction and the (atoMEC.postprocess.conductivity). We refer maximum number of iterations. readers to our pre-print [CKC22] for details on how the electron The first three items in this list essentially control the accuracy localization function and the Kubo–Greenwood conductivity can of the calculation. In principle, for each SCF calculation — i.e. be used to improve predictions of the mean ionization state. ATOMEC: AN OPEN-SOURCE AVERAGE-ATOM PYTHON CODE 43 Fig. 6: Helium density-of-states (DOS) as a function of energy, for different mass densities ρm , and at temperature τ = 50 kK. Black dots indicate the occupations of the electrons in the permitted energy ranges. Dashed black lines indicate the band-gap (the energy gap between the insulating and conducting bands). Between 5 and 6 g cm−3 , the band-gap disappears. and temperature) and electrical conductivity. To calculate the insulator-to-metallic transition point, the key quantity is the electronic band-gap. The concept of band- structures is a complicated topic, which we try to briefly describe in layman’s terms. In solids, electrons can occupy certain energy ranges — we call these the energy bands. In insulating materials, there is a gap between these energy ranges that electrons are forbidden from occupying — this is the so-called band-gap. In conducting materials, there is no such gap, and therefore electrons can conduct electricity because they can be excited into any part of the energy spectrum. Therefore, a simple method to determine the insulator-to-metallic transition is to determine the density at which the band-gap becomes zero. In Fig. 6, we plot the density-of-states (DOS) as a function of energy, for different densities and at fixed temperature τ = 50 kK. The DOS shows the energy ranges that the electrons are allowed to occupy; we also show the actual energies occupied by the electrons (according to Fermi–Dirac statistics) with the black dots. We can clearly see in this figure that the band-gap (the region where the DOS is zero) becomes smaller as a function of density. From Fig. 5: Auto-generated print statement from calling the this figure, it seems the transition from insulating to metallic state ISModel.CalcEnergy function happens somewhere between 5 and 6 g cm−3 . In Fig. 7, we plot the band-gap as a function of density, for a fixed temperature τ = 50 kK. Visually, it appears that the relation- Case-study: Helium ship between band-gap and density is linear at this temperature. In this section, we consider an application of atoMEC in the This is confirmed using a linear fit, which has a coefficient of WDM regime. Helium is the second most abundant element in the determination value of almost exactly one, R2 = 0.9997. Using this universe (after hydrogen) and therefore understanding its behavior fit, the band-gap is predicted to close at 5.5 g cm−3 . Also in this under a wide range of conditions is important for our under- figure, we show the fraction of ionized electrons, which is given by standing of many astrophysical processes. Of particular interest Z̄/Ne , using Eq. (6) to calculate Z̄, and Ne being the total electron are the conditions under which helium is expected to undergo a number. The ionization fraction also relates to the conductivity of transition from insulating to metallic behavior in the outer layers the material, because ionized electrons are not bound to any nuclei of white dwarfs, which are characterized by densities of around and therefore free to conduct electricity. We see that the ionization 1 − 20 g cm−3 and temperatures of 10 − 50 kK [PR20]. These fraction mostly increases with density (excepting some strange conditions are a typical example of the WDM regime. Besides behavior around ρm = 1 g cm−3 ), which is further evidence of the predicting the point at which the insulator-to-metallic transition transition from insulating to conducting behaviour with increasing occurs in the density-temperature spectrum, other properties of density. interest include equation-of-state data (relating pressure, density, As a final analysis, we plot the pressure as a function of mass 44 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) open-source scientific libraries — especially the Python libraries NumPy, SciPy, joblib and mendeleev, as well as LIBXC. We finish this paper by emphasizing that atoMEC is still in the early stages of development, and there are many opportunities to improve and extend the code. These include, for example: • adding new average-atom models, and different approxi- mations to the existing models.ISModel model; • optimizing the code, in particular the routines in the numerov module; • adding new postprocessing functionality, for example to compute structure factors; • improving the structure and design choices of the code. Fig. 7: Band-gap (red circles) and ionization fraction (blue squares) for helium as a function of mass density, at temperature τ = 50 kK. Of course, these are just a snapshot of the avenues for future The relationship between the band-gap and the density appears to be development in atoMEC. We are open to contributions in these linear. areas and many more besides. Acknowledgements This work was partly funded by the Center for Advanced Systems Understanding (CASUS) which is financed by Germany’s Federal Ministry of Education and Research (BMBF) and by the Saxon Ministry for Science, Culture and Tourism (SMWK) with tax funds on the basis of the budget approved by the Saxon State Parliament. R EFERENCES [BDM+ 20] M. Bonitz, T. Dornheim, Zh. A. Moldabekov, S. Zhang, P. Hamann, H. Kählert, A. Filinov, K. Ramakrishna, and J. Vor- berger. Ab initio simulation of warm dense matter. Phys. Plas- mas, 27(4):042710, 2020. doi:10.1063/1.5143225. [BNR13] Roi Baer, Daniel Neuhauser, and Eran Rabani. Self- averaging stochastic Kohn-Sham density-functional theory. Fig. 8: Helium pressure (logarithmic scale) as a function of mass Phys. Rev. Lett., 111:106402, Sep 2013. doi:10.1103/ density and temperature. The pressure increases with density and PhysRevLett.111.106402. temperature (as expected), with a stronger dependence on density. [BVL+ 17] Felix Brockherde, Leslie Vogt, Li Li, Mark E. Tuckerman, Kieron Burke, and Klaus-Robert Müller. Bypassing the Kohn- Sham equations with machine learning. Nature Communica- density and temperature in Fig. 8. The pressure is given by the tions, 8(1):872, Oct 2017. doi:10.1038/s41467-017- 00839-3. sum of two terms: (i) the electronic pressure, calculated using [CHKC22] T. J. Callow, S. B. Hansen, E. Kraisler, and A. Cangi. the method described in Ref. [FB19], and (ii) the ionic pressure, First-principles derivation and properties of density-functional calculated using the ideal gas law. We observe that the pressure average-atom models. Phys. Rev. Research, 4:023055, Apr 2022. doi:10.1103/PhysRevResearch.4.023055. increases with both density and temperature, which is the expected [CKC22] Timothy J. Callow, Eli Kraisler, and Attila Cangi. Accurate behavior. Under these conditions, the density dependence is much and efficient computation of mean ionization states with an stronger, especially for higher densities. average-atom Kubo-Greenwood approach, 2022. doi:10. The code required to generate the above results and plots can 48550/ARXIV.2203.05863. [CKTS+ 21] Timothy Callow, Daniel Kotik, Ekaterina Tsve- be found in this repository. toslavova Stankulova, Eli Kraisler, and Attila Cangi. atomec, August 2021. If you use this software, please cite it Conclusions and future work using these metadata. doi:10.5281/zenodo.5205719. [CMSY12] Aron J. Cohen, Paula Mori-Sánchez, and Weitao Yang. Chal- In this paper, we have presented atoMEC: an average-atom Python lenges for density functional theory. Chemical Reviews, code for studying materials under extreme conditions. The open- 112(1):289–320, 2012. doi:10.1021/cr200107z. [CRNB18] Yael Cytter, Eran Rabani, Daniel Neuhauser, and Roi Baer. source nature of atoMEC, and the choice to use (pure) Python as Stochastic density functional theory at finite temperatures. the programming language, is designed to improve the accessibil- Phys. Rev. B, 97:115207, Mar 2018. doi:10.1103/ ity of average-atom models. PhysRevB.97.115207. We gave significant attention to the code structure in this [DGB18] Tobias Dornheim, Simon Groth, and Michael Bonitz. The uniform electron gas at warm dense matter conditions. Phys. paper, and tried as much as possible to connect the functions Rep., 744:1 – 86, 2018. doi:10.1016/j.physrep. and objects in the code with the underlying theory. We hope that 2018.04.001. this not only improves atoMEC from a user perspective, but also [EFP+ 21] J. A. Ellis, L. Fiedler, G. A. Popoola, N. A. Modine, J. A. facilitates new contributions from the wider average-atom, WDM Stephens, A. P. Thompson, A. Cangi, and S. Rajamanickam. Accelerating finite-temperature kohn-sham density functional and scientific Python communities. Another aim of the paper was theory with deep neural networks. Phys. Rev. B, 104:035120, to communicate how atoMEC benefits from a strong ecosystem of Jul 2021. doi:10.1103/PhysRevB.104.035120. ATOMEC: AN OPEN-SOURCE AVERAGE-ATOM PYTHON CODE 45 [FB19] Gérald Faussurier and Christophe Blancard. Pressure in warm temperature density-functional theory. Phys. Rev. Lett., and hot dense matter using the average-atom model. Phys. Rev. 107:163001, Oct 2011. doi:10.1103/PhysRevLett. E, 99:053201, May 2019. doi:10.1103/PhysRevE.99. 107.163001. 053201. [PR20] Martin Preising and Ronald Redmer. Metallization of dense [GDRT14] Frank Graziani, Michael P Desjarlais, Ronald Redmer, and fluid helium from ab initio simulations. Phys. Rev. B, Samuel B Trickey. Frontiers and challenges in warm dense 102:224107, Dec 2020. doi:10.1103/PhysRevB.102. matter, volume 96. Springer Science & Business, 2014. doi: 224107. 10.1007/978-3-319-04912-0. [Roz91] Balazs F. Rozsnyai. Photoabsorption in hot plasmas based [GFG+ 16] S H Glenzer, L B Fletcher, E Galtier, B Nagler, R Alonso- on the ion-sphere and ion-correlation models. Phys. Rev. A, Mori, B Barbrel, S B Brown, D A Chapman, Z Chen, C B 43:3035–3042, Mar 1991. doi:10.1103/PhysRevA.43. Curry, F Fiuza, E Gamboa, M Gauthier, D O Gericke, A Glea- 3035. son, S Goede, E Granados, P Heimann, J Kim, D Kraus, [SM91] H. B. Schlegel and J. J. W. McDouall. Do You Have SCF Sta- M J MacDonald, A J Mackinnon, R Mishra, A Ravasio, bility and Convergence Problems?, pages 167–185. Springer C Roedel, P Sperling, W Schumaker, Y Y Tsui, J Vorberger, Netherlands, Dordrecht, 1991. doi:10.1007/978-94- U Zastrau, A Fry, W E White, J B Hasting, and H J Lee. 011-3262-6_2. Matter under extreme conditions experiments at the Linac [SPS+ 14] A. N. Souza, D. J. Perkins, C. E. Starrett, D. Saumon, and Coherent Light Source. J. Phys. B, 49(9):092001, apr 2016. S. B. Hansen. Predictions of x-ray scattering spectra for warm doi:10.1088/0953-4075/49/9/092001. dense matter. Phys. Rev. E, 89:023108, Feb 2014. doi: [HK64] P. Hohenberg and W. Kohn. Inhomogeneous electron gas. 10.1103/PhysRevE.89.023108. Phys. Rev., 136(3B):B864–B871, Nov 1964. doi:10.1103/ [SRH+ 12] John C. Snyder, Matthias Rupp, Katja Hansen, Klaus-Robert PhysRev.136.B864. Müller, and Kieron Burke. Finding density functionals with [HMvdW 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der + machine learning. Phys. Rev. Lett., 108:253002, Jun 2012. Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric doi:10.1103/PhysRevLett.108.253002. Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, [SS14] C.E. Starrett and D. Saumon. A simple method for determining Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerk- the ionic structure of warm dense matter. High Energy Density wijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Physics, 10:35–42, 2014. doi:10.1016/j.hedp.2013. Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin 12.001. Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, [Sta16] C.E. Starrett. Kubo–Greenwood approach to conductivity Christoph Gohlke, and Travis E. Oliphant. Array programming in dense plasmas with average atom models. High Energy with NumPy. Nature, 585(7825):357–362, September 2020. Density Physics, 19:58–64, 2016. doi:10.1016/j.hedp. doi:10.1038/s41586-020-2649-2. 2016.04.001. [HRD08] Bastian Holst, Ronald Redmer, and Michael P. Desjarlais. [STJ+ 14] Sang-Kil Son, Robert Thiele, Zoltan Jurek, Beata Ziaja, and Thermophysical properties of warm dense hydrogen using Robin Santra. Quantum-mechanical calculation of ionization- quantum molecular dynamics simulations. Phys. Rev. B, potential lowering in dense plasmas. Phys. Rev. X, 4:031004, 77:184201, May 2008. doi:10.1103/PhysRevB.77. Jul 2014. doi:10.1103/PhysRevX.4.031004. 184201. [VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt [JFC+ 13] Weile Jia, Jiyun Fu, Zongyan Cao, Long Wang, Xuebin Chi, Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Weiguo Gao, and Lin-Wang Wang. Fast plane wave density Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté- functional theory molecular dynamics calculations on multi- fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar- GPU machines. Journal of Computational Physics, 251:102– rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric 115, 2013. doi:10.1016/j.jcp.2013.05.005. Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, [Job20] Joblib Development Team. Joblib: running Python functions Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, as pipeline jobs. https://joblib.readthedocs.io/, 2020. Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin- tero, Charles R. Harris, Anne M. Archibald, Antônio H. [KDF+ 11] A. L. Kritcher, T. Döppner, C. Fortmann, T. Ma, O. L. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy Landen, R. Wallace, and S. H. Glenzer. In-Flight Measure- 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for ments of Capsule Shell Adiabats in Laser-Driven Implosions. Scientific Computing in Python. Nature Methods, 17:261–272, Phys. Rev. Lett., 107:015002, Jul 2011. doi:10.1103/ 2020. doi:10.1038/s41592-019-0686-2. PhysRevLett.107.015002. [Koh99] W. Kohn. Nobel lecture: Electronic structure of matter—wave functions and density functionals. Rev. Mod. Phys., 71:1253– 1266, 10 1999. doi:10.1103/RevModPhys.71.1253. [KS65] W. Kohn and L. J. Sham. Self-consistent equations including exchange and correlation effects. Phys. Rev., 140(4A):A1133– A1138, Nov 1965. doi:10.1103/PhysRev.140. A1133. [LSOM18] Susi Lehtola, Conrad Steigemann, Micael J.T. Oliveira, and Miguel A.L. Marques. Recent developments in LIBXC — A comprehensive library of functionals for density functional theory. SoftwareX, 7:1–5, 2018. doi:10.1016/j.softx. 2017.11.002. [MED11] Stefan Maintz, Bernhard Eck, and Richard Dronskowski. Speeding up plane-wave electronic-structure calculations us- ing graphics-processing units. Computer Physics Communi- cations, 182(7):1421–1427, 2011. doi:10.1016/j.cpc. 2011.03.010. [men14] mendeleev – A Python resource for properties of chemical elements, ions and isotopes, ver. 0.9.0. https://github.com/ lmmentel/mendeleev, 2014. [Mer65] N. David Mermin. Thermal properties of the inhomogeneous electron gas. Phys. Rev., 137:A1441–A1443, Mar 1965. doi: 10.1103/PhysRev.137.A1441. [PGW12] Mohandas Pillai, Joshua Goglio, and Thad G. Walker. Matrix numerov method for solving schrödinger’s equation. Amer- ican Journal of Physics, 80(11):1017–1019, 2012. doi: 10.1119/1.4748813. [PPF+ 11] S. Pittalis, C. R. Proetto, A. Floris, A. Sanna, C. Bersier, K. Burke, and E. K. U. Gross. Exact conditions in finite- 46 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Automatic random variate generation in Python Christoph Baumgarten‡∗ , Tirth Patel F Abstract—The generation of random variates is an important tool that is re- • For inversion methods, the structural properties of the quired in many applications. Various software programs or packages contain underlying uniform random number generator are pre- generators for standard distributions like the normal, exponential or Gamma, served and the numerical accuracy of the methods can be e.g., the programming language R and the packages SciPy and NumPy in controlled by a parameter. Therefore, inversion is usually Python. However, it is not uncommon that sampling from new/non-standard dis- the only method applied for simulations using quasi-Monte tributions is required. Instead of deriving specific generators in such situations, so-called automatic or black-box methods have been developed. These allow Carlo (QMC) methods. the user to generate random variates from fairly large classes of distributions • Depending on the use case, one can choose between a fast by only specifying some properties of the distributions (e.g. the density and/or setup with slow marginal generation time and vice versa. cumulative distribution function). In this note, we describe the implementation of such methods from the C library UNU.RAN in the Python package SciPy and The latter point is important depending on the use case: if a provide a brief overview of the functionality. large number of samples is required for a given distribution with fixed shape parameters, a slower setup that only has to be run once Index Terms—numerical inversion, generation of random variates can be accepted if the marginal generation times are low. If small to moderate samples sizes are required for many different shape parameters, then it is important to have a fast setup. The former Introduction situation is referred to as the fixed-parameter case and the latter as The generation of random variates is an important tool that is the varying parameter case. required in many applications. Various software programs or Implementations of various methods are available in the packages contain generators for standard distributions, e.g., R C library UNU.RAN ([HL07]) and in the associated R pack- ([R C21]) and SciPy ([VGO+ 20]) and NumPy ([HMvdW+ 20]) age Runuran (https://cran.r-project.org/web/packages/Runuran/ in Python. Standard references for these algorithms are the books index.html, [TL03]). The aim of this note is to introduce the [Dev86], [Dag88], [Gen03], and [Knu14]. An interested reader Python implementation in the SciPy package that makes some will find many references to the vast existing literature in these of the key methods in UNU.RAN available to Python users in works. While relying on general methods such as the rejection SciPy 1.8.0. These general tools can be seen as a complement principle, the algorithms for well-known distributions are often to the existing specific sampling methods: they might lead to specifically designed for a particular distribution. This is also the better performance in specific situations compared to the existing case in the module stats in SciPy that contains more than 100 generators, e.g., if a very large number of samples are required for distributions and the module random in NumPy with more than a fixed parameter of a distribution or if the implemented sampling 30 distributions. However, there are also so-called automatic or method relies on a slow default that is based on numerical black-box methods for sampling from large classes of distributions inversion of the CDF. For advanced users, they also offer various with a single piece of code. For such algorithms, information options that allow to fine-tune the generators (e.g., to control the about the distribution such as the density, potentially together with time needed for the setup step). its derivative, the cumulative distribution function (CDF), and/or the mode must be provided. See [HLD04] for a comprehensive overview of these methods. Although the development of such Automatic algorithms in SciPy methods was originally motivated to generate variates from non- Many of the automatic algorithms described in [HLD04] and standard distributions, these universal methods have advantages [DHL10] are implemented in the ANSI C library, UNU.RAN that make their usage attractive even for sampling from standard (Universal Non-Uniform RANdom variate generators). Our goal distributions. We mention some of the important properties (see was to provide a Python interface to the most important methods [LH00], [HLD04], [DHL10]): from UNU.RAN to generate univariate discrete and continuous non-uniform random variates. The following generators have been • The algorithms can be used to sample from truncated implemented in SciPy 1.8.0: distributions. • TransformedDensityRejection: Transformed * Corresponding author: christoph.baumgarten@gmail.com ‡ Unaffiliated Density Rejection (TDR) ([H9̈5], [GW92]) • NumericalInverseHermite: Hermite interpolation Copyright © 2022 Christoph Baumgarten et al. This is an open-access article based INVersion of CDF (HINV) ([HL03]) distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, • NumericalInversePolynomial: Polynomial inter- provided the original author and source are credited. polation based INVersion of CDF (PINV) ([DHL10]) AUTOMATIC RANDOM VARIATE GENERATION IN PYTHON 47 • SimpleRatioUniforms: Simple Ratio-Of-Uniforms by computing tangents at suitable design points. Note that by its (SROU) ([Ley01], [Ley03]) nature any rejection method requires not always the same number • DiscreteGuideTable: (Discrete) Guide Table of uniform variates to generate one non-uniform variate; this method (DGT) ([CA74]) makes the use of QMC and of some variance reduction methods • DiscreteAliasUrn: (Discrete) Alias-Urn method more difficult or impossible. On the other hand, rejection is often (DAU) ([Wal77]) the fastest choice for the varying parameter case. The Ratio-Of-Uniforms method (ROU, [KM77]) is another Before describing the implementation in SciPy in Section general method that relies on rejection. The underlying principle is scipy_impl, we give a short introduction to random variate gener- that p if (U,V ) is uniformly distributed on the set A f := {(u, v) : 0 < ation in Section intro_rv_gen. v ≤ f (u/v), a < u/v < b} where f is a PDF with support (a, b), then X := U/V follows a distribution according to f . In general, it A very brief introduction to random variate generation is not possible to sample uniform values on A f directly. However, It is well-known that random variates can be generated by inver- if A f ⊂ R := [u− , u+ ] × [0, v+ ] for finite constants u− , u+ , v+ , one sion of the CDF F of a distribution: if U is a uniform random can apply the rejection method: generate uniform values (U,V ) on number on (0, 1), X := F −1 (U) is distributed according to F. the bounding rectangle R until (U,V ) ∈ A f and return X = U/V . Unfortunately, the inverse CDF can only be expressed in closed Automatic methods relying on the ROU method such as SROU form for very few distributions, e.g., the exponential or Cauchy and automatic ROU ([Ley00]) need a setup step to find a suitable distribution. If this is not the case, one needs to rely on imple- region S ∈ R2 such that A f ⊂ S and such that one can generate mentations of special functions to compute the inverse CDF for (U,V ) uniformly on S efficiently. standard distributions like the normal, Gamma or beta distributions or numerical methods for inverting the CDF are required. Such Description of the SciPy interface procedures, however, have the disadvantage that they may be slow SciPy provides an object-oriented API to UNU.RAN’s methods. or inaccurate, and developing fast and robust inversion algorithms To initialize a generator, two steps are required: such as HINV and PINV is a non-trivial task. HINV relies on Hermite interpolation of the inverse CDF and requires the CDF 1) creating a distribution class and object, and PDF as an input. PINV only requires the PDF. The algorithm 2) initializing the generator itself. then computes the CDF via adaptive Gauss-Lobatto integration In step 1, a distributions object must be created that im- and an approximation of the inverse CDF using Newton’s polyno- plements required methods (e.g., pdf, cdf). This can either mial interpolation. Note that an approximation of the inverse CDF be a custom object or a distribution object from the classes can be achieved by interpolating the points (F(xi ), xi ) for points rv_continuous or rv_discrete in SciPy. Once the gen- xi in the domain of F, i.e., no evaluation of the inverse CDF is erator is initialized from the distribution object, it provides a required. rvs method to sample random variates from the given dis- For discrete distributions, F is a step-function. To compute tribution. It also provides a ppf method that approximates the inverse CDF F −1 (U), the simplest idea would be to apply the inverse CDF if the initialized generator uses an inversion sequential search: if X takes values 0, 1, 2, . . . with probabil- method. The following example illustrates how to initialize the ities p0 , p1 , p2 , . . . , start with j = 0 and keep incrementing j NumericalInversePolynomial (PINV) generator for the until F( j) = p0 + · · · + p j ≥ U. When the search terminates, standard normal distribution: X = j = F −1 (U). Clearly, this approach is generally very slow import numpy as np and more efficient methods have been developed: if X takes L from scipy.stats import sampling distinct values, DGT realizes very fast inversion using so-called from math import exp guide tables / hash tables to find the index j. In contrast DAU is # create a distribution class with implementation not an inversion method but uses the alias method, i.e., tables are # of the PDF. Note that the normalization constant precomputed to write X as an equi-probable mixture of L two- # is not required point distributions (the alias values). class StandardNormal: def pdf(self, x): The rejection method has been suggested in [VN51]. In its return exp(-0.5 * x**2) simplest form, assume that f is a bounded density on [a, b], i.e., f (x) ≤ M for all x ∈ [a, b]. Sample two independent uniform # create a distribution object and initialize the # generator random variates on U on [0, 1] and V on [a, b] until M ·U ≤ f (V ). dist = StandardNormal() Note that the accepted points (U,V ) are uniformly distributed in rng = sampling.NumericalInversePolynomial(dist) the region between the x-axis and the graph of the PDF. Hence, X := V has the desired distribution f . This is a special case of # sample 100,000 random variates from the given # distribution the general version: if f , g are two densities on an interval J such rvs = rng.rvs(100000) that f (x) ≤ c · g(x) for all x ∈ J and a constant c ≥ 1, sample U uniformly distributed on [0, 1] and X distributed according to As NumericalInversePolynomial generator uses an in- g until c · U · g(X) ≤ f (X). Then X has the desired distribution version method, it also provides a ppf method that approximates f . It can be shown that the expected number of iterations before the inverse CDF: the acceptance condition is met is equal to c. Hence, the main # evaluate the approximate PPF at a few points ppf = rng.ppf([0.1, 0.5, 0.9]) challenge is to find hat functions g for which c is small and from which random variates can be generated efficiently. TDR solves It is also easy to sample from a truncated distribution by passing this problem by applying a transformation T to the density such a domain argument to the constructor of the generator. For that x 7→ T ( f (x)) is concave. A hat function can then be found example, to sample from truncated normal distribution: 48 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) # truncate the distribution by passing a reference/random/bit_generators/index.html. To change the uni- # `domain` argument form random number generator, a random_state parameter rng = sampling.NumericalInversePolynomial( dist, domain=(-1, 1) can be passed as shown in the example below: ) # 64-bit PCG random number generator in NumPy urng = np.random.Generator(np.random.PCG64()) While the default options of the generators should work well in # The above line can also be replaced by: many situations, we point out that there are various parameters that # ``urng = np.random.default_rng()`` the user can modify, e.g., to provide further information about the # as PCG64 is the default generator starting # from NumPy 1.19.0 distribution (such as mode or center) or to control the numerical accuracy of the approximated PPF. (u_resolution). Details # change the uniform random number generator by can be found in the SciPy documentation https://docs.scipy.org/ # passing the `random_state` argument rng = sampling.NumericalInversePolynomial( doc/scipy/reference/. The above code can easily be generalized to dist, random_state=urng sample from parametrized distributions using instance attributes ) in the distribution class. For example, to sample from the gamma We also point out that the PPF of inversion methods can be applied distribution with shape parameter alpha, we can create the to sequences of quasi-random numbers. SciPy provides different distribution class with parameters as instance attributes: sequences in its QMC module (scipy.stats.qmc). class Gamma: NumericalInverseHermite provides a qrvs method def __init__(self, alpha): self.alpha = alpha which generates random variates using QMC methods present in SciPy (scipy.stats.qmc) as uniform random number def pdf(self, x): generators3 . The next example illustrates how to use qrvs with a return x**(self.alpha-1) * exp(-x) generator created directly from a SciPy distribution object. def support(self): from scipy import stats return 0, np.inf from scipy.stats import qmc # initialize a distribution object with varying # 1D Halton sequence generator. # parameters qrng = qmc.Halton(d=1) dist1 = Gamma(2) dist2 = Gamma(3) rng = sampling.NumericalInverseHermite(stats.norm()) # initialize a generator for each distribution # generate quasi random numbers using the Halton rng1 = sampling.NumericalInversePolynomial(dist1) # sequence as uniform variates rng2 = sampling.NumericalInversePolynomial(dist2) qrvs = rng.qrvs(size=100, qmc_engine=qrng) In the above example, the support method is used to set the domain of the distribution. This can alternatively be done by Benchmarking passing a domain parameter to the constructor. To analyze the performance of the implementation, we tested the In addition to continuous distribution, two UNU.RAN methods methods applied to several standard distributions against the gen- have been added in SciPy to sample from discrete distributions. In erators in NumPy and the original UNU.RAN C library. In addi- this case, the distribution can be either be represented using a tion, we selected one non-standard distribution to demonstrate that probability vector (which is passed to the constructor as a Python substantial reductions in the runtime can be achieved compared to list or NumPy array) or a Python object with the implementation other implementations. All the benchmarks were carried out using of the probability mass function. In the latter case, a finite domain NumPy 1.22.4 and SciPy 1.8.1 running in a single core on Ubuntu must be passed to the constructor or the object should implement 20.04.3 LTS with Intel(R) Core(TM) i7-8750H CPU (2.20GHz the support method1 . clock speed, 16GB RAM). We run the benchmarks with NumPy’s # Probability vector to represent a discrete MT19937 (Mersenne Twister) and PCG64 random number gen- # distribution. Note that the probability vector erators (np.random.MT19937 and np.random.PCG64) in # need not be vectorized pv = [0.1, 9.0, 2.9, 3.4, 0.3] Python and use NumPy’s C implementation of MT19937 in the UNU.RAN C benchmarks. As explained above, the use of PCG64 # PCG64 uniform RNG with seed 123 is recommended, and MT19937 is only included to compare the urng = np.random.default_rng(123) rng = sampling.DiscreteAliasUrn( speed of the Python implementation and the C library by relying pv, random_state=urng on the same uniform number generator (i.e., differences in the ) performance of the uniform number generation are not taken into account). The code for all the benchmarks can be found on # sample from the given discrete distribution rvs = rng.rvs(100000) https://github.com/tirthasheshpatel/unuran_benchmarks. The methods used in NumPy to generate normal, gamma, and beta random variates are: Underlying uniform pseudo-random number generators NumPy provides several generators for uniform pseudo-random • the ziggurat algorithm ([MT00b]) to sample from the numbers2 . It is highly recommended to use NumPy’s default standard normal distribution, random number generator np.random.PCG64 for better speed 2. By default, NumPy’s legacy random number generator, MT19937 and performance, see [O’N14] and https://numpy.org/doc/stable/ (np.random.RandomState()) is used as the uniform random number generator for consistency with the stats module in SciPy. 1. Support for discrete distributions with infinite domain hasn’t been added 3. In SciPy 1.9.0, qrvs will be added to yet. NumericalInversePolynomial. AUTOMATIC RANDOM VARIATE GENERATION IN PYTHON 49 • the rejection algorithms in Chapter XII.2.6 in [Dev86] if 70-200 times faster. This clearly shows the benefit of using a α < 1 and in [MT00a] if α > 1 for the Gamma distribution, black-box algorithm. • Johnk’s algorithm ([Jöh64], Section IX.3.5 in [Dev86]) if max{α, β } ≤ 1, otherwise a ratio of two Gamma variates Conclusion with shape parameter α and β (see Section IX.4.1 in The interface to UNU.RAN in SciPy provides easy access to [Dev86]) for the beta distribution. different algorithms for non-uniform variate generation for large Benchmarking against the normal, gamma, and beta distributions classes of univariate continuous and discrete distributions. We have shown that the methods are easy to use and that the al- Table 1 compares the performance for the standard normal, gorithms perform very well both for standard and non-standard Gamma and beta distributions. We recall that the density of the distributions. A comprehensive documentation suite, a tutorial Gamma distribution with shape parameter a > 0 is given by and many examples are available at https://docs.scipy.org/doc/ x ∈ (0, ∞) 7→ xa−1 e−x and the density of the beta distribution with α−1 (1−x)β −1 scipy/reference/stats.sampling.html and https://docs.scipy.org/doc/ shape parameters α, β > 0 is given by x ∈ (0, 1) 7→ x B(α,β ) scipy/tutorial/stats/sampling.html. Various methods have been im- where Γ(·) and B(·, ·) are the Gamma and beta functions. The plemented in SciPy, and if specific use cases require additional results are reported in Table 1. functionality from UNU.RAN, the methods can easily be added We summarize our main observations: to SciPy given the flexible framework that has been developed. 1) The setup step in Python is substantially slower than Another area of further development is to better integrate SciPy’s in C due to expensive Python callbacks, especially for QMC generators for the inversion methods. PINV and HINV. However, the time taken for the setup is Finally, we point out that other sampling methods like Markov low compared to the sampling time if large samples are Chain Monte Carlo and copula methods are not part of SciPy. Rel- drawn. Note that as expected, SROU has a very fast setup evant Python packages in that context are PyMC ([PHF10]), PyS- such that this method is suitable for the varying parameter tan relying on Stan ([Tea21]), Copulas (https://sdv.dev/Copulas/) case. and PyCopula (https://blent-ai.github.io/pycopula/). 2) The sampling time in Python is slightly higher than in C for the MT19937 random number generator. If the Acknowledgments recommended PCG64 generator is used, the sampling The authors wish to thank Wolfgang Hörmann and Josef Leydold time in Python is slightly lower. The only exception for agreeing to publish the library under a BSD license and for is SROU: due to Python callbacks, the performance is helpful feedback on the implementation and this note. In addition, substantially slower than in C. However, as the main we thank Ralf Gommers, Matt Haberland, Nicholas McKibben, advantage of SROU is the fast setup time, the main use Pamphile Roy, and Kai Striega for their code contributions, re- case is the varying parameter case (i.e., the method is not views, and helpful suggestions. The second author was supported supposed to be used to generate large samples). by the Google Summer of Code 2021 program5 . 3) PINV, HINV, and TDR are at most about 2x slower than the specialized NumPy implementation for the normal R EFERENCES distribution. For the Gamma and beta distribution, they even perform better for some of the chosen shape pa- [CA74] Hui-Chuan Chen and Yoshinori Asau. On gener- ating random variates from an empirical distribution. rameters. These results underline the strong performance AIIE Transactions, 6(2):163–166, 1974. doi:10.1080/ of these black-box approaches even for standard distribu- 05695557408974949. tions. [Dag88] John Dagpunar. Principles of random variate generation. 4) While the application of PINV requires bounded densi- Oxford University Press, USA, 1988. [Dev86] Luc Devroye. Non-Uniform Random Variate Generation. ties, no issues are encountered for α = 0.05 since the Springer-Verlag, New York, 1986. doi:10.1007/978-1- unbounded part is cut off by the algorithm. However, the 4613-8643-8. setup can fail for very small values of α. [DHL10] Gerhard Derflinger, Wolfgang Hörmann, and Josef Leydold. Random variate generation by numerical inversion when only the density is known. ACM Transactions on Modeling and Benchmarking against a non-standard distribution Computer Simulation (TOMACS), 20(4):1–25, 2010. doi: We benchmark the performance of PINV to sample from the 10.1145/1842722.1842723. [Gen03] James E Gentle. Random number generation and Monte Carlo generalized normal distribution ([Sub23]) whose density is given p methods, volume 381. Springer, 2003. doi:10.1007/ pe−|x| by x ∈ (−∞, ∞) 7→ 2Γ(1/p) against the method proposed in [NP09] b97336. and against the implementation in SciPy’s gennorm distribu- [GW92] Walter R Gilks and Pascal Wild. Adaptive rejection sampling for Gibbs sampling. Journal of the Royal Statistical Society: tion. The approach in [NP09] relies on transforming Gamma Series C (Applied Statistics), 41(2):337–348, 1992. doi:10. variates to the generalized normal distribution whereas SciPy 2307/2347565. relies on computing the inverse of CDF of the Gamma distri- [H9̈5] Wolfgang Hörmann. A rejection technique for sampling from bution (https://docs.scipy.org/doc/scipy/reference/generated/scipy. T-concave distributions. ACM Trans. Math. Softw., 21(2):182– 193, 1995. doi:10.1145/203082.203089. special.gammainccinv.html). The results for different values of p [HL03] Wolfgang Hörmann and Josef Leydold. Continuous random are shown in Table 2. variate generation by fast numerical inversion. ACM Trans- PINV is usually about twice as fast than the special- actions on Modeling and Computer Simulation (TOMACS), 13(4):347–362, 2003. doi:10.1145/945511.945517. ized method and about 15-150 times faster than SciPy’s implementation4 . We also found an R package pgnorm (https: 4. In SciPy 1.9.0, the speed will be improved by implementing the method //cran.r-project.org/web/packages/pgnorm/) that implements vari- from [NP09] ous approaches from [KR13]. In that case, PINV is usually about 5. https://summerofcode.withgoogle.com/projects/#5912428874825728 50 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Python C Distribution Method Setup Sampling (PCG64) Sampling (MT19937) Setup Sampling (MT19937) PINV 4.6 29.6 36.5 0.27 32.4 HINV 2.5 33.7 40.9 0.38 36.8 Standard normal TDR 0.2 37.3 47.8 0.02 41.4 SROU 8.7 µs 2510 2160 0.5 µs 232 NumPy - 17.6 22.4 - - PINV 196.0 29.8 37.2 37.9 32.5 Gamma(0.05) HINV 24.5 36.1 43.8 1.9 40.7 NumPy - 55.0 68.1 - - PINV 16.5 31.2 38.6 2.0 34.5 Gamma(0.5) HINV 4.9 34.2 41.7 0.6 37.9 NumPy - 86.4 99.2 - - PINV 5.3 30.8 38.7 0.5 34.6 HINV 5.3 33 40.6 0.4 36.8 Gamma(3.0) TDR 0.2 38.8 49.6 0.03 44 NumPy - 36.5 47.1 - - PINV 21.4 33.1 39.9 2.4 37.3 Beta(0.5, 0.5) HINV 2.1 38.4 45.3 0.2 42 NumPy - 101 112 - - HINV 0.2 37 44.3 0.01 41.1 Beta(0.5, 1.0) NumPy - 125 138 - - PINV 15.7 30.5 37.2 1.7 34.3 HINV 4.1 33.4 40.8 0.4 37.1 Beta(1.3, 1.2) TDR 0.2 46.8 57.8 0.03 45 NumPy - 74.3 97 - - PINV 9.7 30.2 38.2 0.9 33.8 HINV 5.8 33.7 41.2 0.4 37.4 Beta(3.0, 2.0) TDR 0.2 42.8 52.8 0.02 44 NumPy - 72.6 92.8 - - TABLE 1 Average time taken (reported in milliseconds, unless mentioned otherwise) to sample 1 million random variates from the standard normal distribution. The mean is computed over 7 iterations. Standard deviations are not reported as they were very small (less than 1% of the mean in the large majority of cases). Note that not all methods can always be applied, e.g., TDR cannot be applied to the Gamma distribution if a < 1 since the PDF is not log-concave in that case. As NumPy uses rejection algorithms with precomputed constants, no setup time is reported. p 0.25 0.45 0.75 1 1.5 2 5 8 Nardon and Pianca (2009) 100 101 101 45 148 120 128 122 SciPy’s gennorm distribution 832 1000 1110 559 5240 6720 6230 5950 Python (PINV Method, PCG64 urng) 50 47 45 41 40 37 38 38 TABLE 2 Comparing SciPy’s implementation and a specialized method against PINV to sample 1 million variates from the generalized normal distribution for different values of the parameter p. Time reported in milliseconds. The mean is computer over 7 iterations. [HL07] Wolfgang Hörmann and Josef Leydold. UNU.RAN - Univer- ates. ACM Transactions on Mathematical Software (TOMS), sal Non-Uniform RANdom number generators, 2007. https: 3(3):257–260, 1977. doi:10.1145/355744.355750. //statmath.wu.ac.at/unuran/doc.html. [Knu14] Donald E Knuth. The Art of Computer Programming, Volume [HLD04] Wolfgang Hörmann, Josef Leydold, and Gerhard Derflinger. 2: Seminumerical algorithms. Addison-Wesley Professional, Automatic nonuniform random variate generation. Springer, 2014. doi:10.2307/2317055. 2004. doi:10.1007/978-3-662-05946-3. [KR13] Steve Kalke and W-D Richter. Simulation of the p-generalized [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Gaussian distribution. Journal of Statistical Computation Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric and Simulation, 83(4):641–667, 2013. doi:10.1080/ Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, 00949655.2011.631187. Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van [Ley00] Josef Leydold. Automatic sampling with the ratio-of-uniforms Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del method. ACM Transactions on Mathematical Software Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, (TOMS), 26(1):78–98, 2000. doi:10.1145/347837. Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer 347863. Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro- [Ley01] Josef Leydold. A simple universal generator for continuous gramming with NumPy. Nature, 585(7825):357–362, 2020. and discrete univariate T-concave distributions. ACM Transac- doi:10.1038/s41586-020-2649-2. tions on Mathematical Software (TOMS), 27(1):66–82, 2001. [Jöh64] MD Jöhnk. Erzeugung von betaverteilten und gammaverteilten doi:10.1145/382043.382322. Zufallszahlen. Metrika, 8(1):5–15, 1964. doi:10.1007/ [Ley03] Josef Leydold. Short universal generators via generalized bf02613706. ratio-of-uniforms method. Mathematics of Computation, [KM77] Albert J Kinderman and John F Monahan. Computer gen- 72(243):1453–1471, 2003. doi:10.1090/s0025-5718- eration of random variables using the ratio of uniform devi- 03-01511-4. AUTOMATIC RANDOM VARIATE GENERATION IN PYTHON 51 [LH00] Josef Leydold and Wolfgang Hörmann. Universal algorithms as an alternative for generating non-uniform continuous ran- dom variates. In Proceedings of the International Conference on Monte Carlo Simulation 2000., pages 177–183, 2000. [MT00a] George Marsaglia and Wai Wan Tsang. A simple method for generating gamma variables. ACM Transactions on Math- ematical Software (TOMS), 26(3):363–372, 2000. doi: 10.1145/358407.358414. [MT00b] George Marsaglia and Wai Wan Tsang. The ziggurat method for generating random variables. Journal of statistical soft- ware, 5(1):1–7, 2000. doi:10.18637/jss.v005.i08. [NP09] Martina Nardon and Paolo Pianca. Simulation techniques for generalized Gaussian densities. Journal of Statistical Computation and Simulation, 79(11):1317–1329, 2009. doi: 10.1080/00949650802290912. [O’N14] Melissa E. O’Neill. PCG: A family of simple fast space- efficient statistically good algorithms for random number gen- eration. Technical Report HMC-CS-2014-0905, Harvey Mudd College, Claremont, CA, September 2014. [PHF10] Anand Patil, David Huard, and Christopher J Fonnesbeck. PyMC: Bayesian stochastic modelling in Python. Journal of Statistical Software, 35(4):1, 2010. doi:10.18637/jss. v035.i04. [R C21] R Core Team. R: A language and environment for statistical computing, 2021. https://www.R-project.org/. [Sub23] M.T. Subbotin. On the law of frequency of error. Mat. Sbornik, 31(2):296–301, 1923. [Tea21] Stan Development Team. Stan modeling language users guide and reference manual, version 2.28., 2021. https://mc-stan.org. [TL03] Günter Tirler and Josef Leydold. Automatic non-uniform random variate generation in r. In Proceedings of DSC, page 2, 2003. [VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, pages 1–12, 2020. doi:10.1038/ s41592-019-0686-2. [VN51] John Von Neumann. Various techniques used in connection with random digits. Appl. Math Ser, 12(36-38):3, 1951. [Wal77] Alastair J Walker. An efficient method for generating discrete random variables with general distributions. ACM Transac- tions on Mathematical Software (TOMS), 3(3):253–256, 1977. doi:10.1145/355744.355749. 52 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Utilizing SciPy and other open source packages to provide a powerful API for materials manipulation in the Schrödinger Materials Suite Alexandr Fonari‡∗ , Farshad Fallah‡ , Michael Rauch‡ F Abstract—The use of several open source scientific packages in the open-source and many of which blend the two to optimize capa- Schrödinger Materials Science Suite will be discussed. A typical workflow for bilities and efficiency. For example, the main simulation engine materials discovery will be described, discussing how open source packages for molecular quantum mechanics is the Jaguar [BHH+ 13] pro- have been incorporated at every stage. Some recent implementations of ma- prietary code. The proprietary classical molecular dynamics code chine learning for materials discovery will be discussed, as well as how open Desmond (distributed by Schrödinger, Inc.) [SGB+ 14] is used to source packages were leveraged to achieve results faster and more efficiently. obtain physical properties of soft materials, surfaces and polymers. Index Terms—materials, active learning, OLED, deposition, evaporation For periodic quantum mechanics, the main simulation engine is the open source code Quantum ESPRESSO (QE) [GAB+ 17]. One of the co-authors of this proceedings (A. Fonari) contributes to Introduction the QE code in order to make integration with the Materials Suite more seamless and less error-prone. As part of this integration, A common materials discovery practice or workflow is to start support for using the portable XML format for input and output with reading an experimental structure of a material or generating in QE has been implemented in the open source Python package a structure in silico, computing its properties of interest (e.g. qeschema [BDBF]. elastic constants, electrical conductivity), tuning the material by Figure 2 gives an overview of some of the various products that modifying its structure (e.g. doping) or adding and removing compose the Schrödinger Materials Science Suite. The various atoms (deposition, evaporation), and then recomputing the proper- workflows are implemented mainly in Python (some of them ties of the modified material (Figure 1). Computational materials described below), calling on proprietary or open-source code discovery leverages such workflows to empower researchers to where appropriate, to improve the performance of the software explore vast design spaces and uncover root causes without (or in and reduce overall maintenance. conjunction with) laboratory experimentation. The materials discovery cycle can be run in a high-throughput Software tools for computational materials discovery can be manner, enumerating different structure modifications in a system- facilitated by utilizing existing libraries that cover the fundamental atic fashion, such as doping ratio in a semiconductor or depositing mathematics used in the calculations in an optimized fashion. This different adsorbates. As we will detail herein, there are several use of existing libraries allows developers to devote more time open source packages that allow the user to generate a large to developing new features instead of re-inventing established number of structures, run calculations in high throughput manner methods. As a result, such a complementary approach improves and analyze the results. For example, the open source package the performance of computational materials software and reduces pymatgen [ORJ+ 13] facilitates generation and analysis of periodic overall maintenance. structures. It can generate inputs for and read outputs of QE, the The Schrödinger Materials Science Suite [LLC22] is a propri- commercial codes VASP and Gaussian, and several other formats. etary computational chemistry/physics platform that streamlines To run and manage workflow jobs in a high-throughput manner, materials discovery workflows into a single graphical user inter- open source packages such as Custodian [ORJ+ 13] and AiiDA face (Materials Science Maestro). The interface is a single portal [HZU+ 20] can be used. for structure building and enumeration, physics-based modeling and machine learning, visualization and analysis. Tying together the various modules are a wide variety of scientific packages, some Materials import and generation of which are proprietary to Schrödinger, Inc., some of which are For reading and writing of material structures, several open source packages (e.g. OpenBabel [OBJ+ 11], RDKit [LTK+ 22]) have * Corresponding author: sasha.fonari@schrodinger.com ‡ Schrödinger Inc., 1540 Broadway, 24th Floor. New York, NY 10036 implemented functionality for working with several commonly used formats (e.g. CIF, PDB, mol, xyz). Periodic structures Copyright © 2022 Alexandr Fonari et al. This is an open-access article of materials, mainly coming from single crystal X-ray/neutron distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, diffraction experiments, are distributed in CIF (Crystallographic provided the original author and source are credited. Information File), PDB (Protein Data Bank) and lately mmCIF UTILIZING SCIPY AND OTHER OPEN SOURCE PACKAGES TO PROVIDE A POWERFUL API FOR MATERIALS MANIPULATION IN THE SCHRÖDINGER MATERIALS SUITE 53 Fig. 1: Example of a workflow for computational materials discovery. Fig. 2: Some example products that compose the Schrödinger Materials Science Suite. formats [WF05]. Correctly reading experimental structures is of work went into this project) and others to correctly read and significant importance, since the rest of the materials discovery convert periodic structures in OpenBabel. By version 3.1.1 (the workflow depends on it. In addition to atom coordinates and most recent at writing time), the authors are not aware of any periodic cell information, structural data also contains symme- structures read incorrectly by OpenBabel. In general, non-periodic try operations (listed explicitly or by the means of providing molecular formats are simpler to handle because they only contain a space group) that can be used to decrease the number of atom coordinates but no cell or symmetry information. OpenBabel computations required for a particular system by accounting for has Python bindings but due to the GPL license limitation, it is symmetry. This can be important, especially when scaling high- called as a subprocess from the Schrödinger Materials Suite. throughput calculations. From file, structure is read in a structure Another important consideration in structure generation is object through which atomic coordinates (as a NumPy array) and modeling of substitutional disorder in solid alloys and materials chemical information of the material can be accessed and updated. with point defects (intermetallics, semiconductors, oxides and Structure object is similar to the one implemented in open source their crystalline surfaces). In such cases, the unit cell and atomic packages such as pymatgen [ORJ+ 13] and ASE [LMB+ 17]. All sites of the crystal or surface slab are well defined while the chem- the structure manipulations during the workflows are done by ical species occupying the site may vary. In order to simulate sub- using structure object interface (see structure deformation example stitutional disorder, one must generate the ensemble of structures below). Example of Structure object definition in pymatgen: that includes all statistically significant atomic distributions in a class Structure: given unit cell. This can be achieved by a brute force enumeration of all symmetrically unique atomic structures with a given number def __init__(self, lattice, species, coords, ...): of vacancies, impurities or solute atoms. The open source library """Create a periodic structure.""" enumlib [HF08] implements algorithms for such a systematic One consideration of note is that PDB, CIF and mmCIF structure enumeration of periodic structures. The enumlib package consists formats allow description of the positional disorder (for example, of several Fortran binaries and Python scripts that can be run as a a solvent molecule without a stable position within the cell subprocess (no Python bindings). This allows the user to generate which can be described by multiple sets of coordinates). Another a large set of symmetrically nonequivalent materials with different complication is that experimental data spans an interval of almost compositions (e.g. doping or defect concentration). a century: one of the oldest crystal structures deposited in the Recently, we applied this approach in simultaneous study of Cambridge Structural Database (CSD) [GBLW16] dates to 1924 the activity and stability of Pt based core-shell type catalysts for [HM24]. These nuances and others present nontrivial technical the oxygen reduction reaction [MGF+ 19]. We generated a set of challenges for developers. Thus, it has been a continuous effort stable doped Pt/transition metal/nitrogen surfaces using periodic by Schrödinger, Inc. (at least 39 commits and several weeks of enumeration. Using QE to perform periodic density functional 54 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Jaguar that took 457,265 CPU hours (~52 years) [MAS+ 20]. An- other similar case study is the high-throughput molecular dynam- ics simulations (MD) of thermophysical properties of polymers for various applications [ABG+ 21]. There, using Desmond we com- puted the glass transition temperature (Tg ) of 315 polymers and compared the results with experimental measurements [Bic02]. This study took advantage of GPU (graphics processing unit) support as implemented in Desmond, as well as the job scheduler API described above. Other workflows implemented in the Schrödinger Materials Science Suite utilize open source packages as well. For soft mate- rials (polymers, organic small molecules and substrates composed of soft molecules), convex hull and related mathematical methods Fig. 3: Example of the job submission process. are important for finding possible accessible solvent voids (during submerging or sorption) and adsorbate sites (during molecular deposition). These methods are conveniently implemented in the theory (DFT) calculations, we assessed surface phase diagrams open source SciPy [VGO+ 20] and NumPy [HMvdW+ 20] pack- for Pt alloys and identified the avenues for stabilizing the cost ages. Thus, we implemented molecular deposition and evaporation effective core-shell systems by a judicious choice of the catalyst workflows by using the Desmond MD engine as the backend core material. Such catalysts may prove critical in electrocatalysis in tandem with the convex hull functionality. This workflow for fuel cell applications. enables simulation of the deposition and evaporation of the small molecules on a substrate. We utilized the aforementioned deposition workflow in the study of organic light-emitting diodes Workflow capabilities (OLEDs), which are fabricated using a stepwise process, where In the last section, we briefly described a complete workflow from new layers are deposited on top of previous layers. Both vacuum structure generation and enumeration to periodic DFT calculations and solution deposition processes have been used to prepare these to analysis. In order to be able to run a massively parallel films, primarily as amorphous thin film active layers lacking screening of materials, a highly scalable and stable queuing system long-range order. Each of these deposition techniques introduces (job scheduler) is required. We have implemented a job queuing changes to the film structure and consequently, different charge- system on top of the most used queuing systems (LSF, PBS, transfer and luminescent properties [WKB+ 22]. SGE, SLURM, TORQUE, UGE) and exposed a Python API to As can be seen from above, a workflow is usually some submit and monitor jobs. In line with technological advancements, sort of structure modification through the structure object with cloud is also supported by means of a virtual cluster configured a subsequent call to a backend code and analysis of its output if with SLURM. This allows the user to submit a large number it succeeds. Input for the next iteration depends on the output of jobs, limited only by SLURM scheduling capabilities and of the previous iteration in some workflows. Due to the large cloud resources. In order to accommodate job dependencies in chemical and manipulation space of the materials, sometimes it workflows, for each job, a parent job (or multiple parent jobs) can very tricky to keep code for all workflows follow the same code be defined forming a directed graph of jobs (Figure 3). logic. For every workflow and/or functionality in the Materials There could be several reasons for a job to fail. Depending Science Suite, some sort of peer reviewed material (publication, on the reason of failure, there are several restart and recovery conference presentation) is created where implemented algorithms mechanisms in place. The lowest level is the restart mechanism are described to facilitate reproducibility. (in SLURM it is called requeue) which is performed by the queuing system itself. This is triggered when a node goes down. Data fitting algorithms and use cases On the cloud, preemptible instances (nodes) can go offline at any moment. In addition, workflows implemented in the proprietary Materials simulation engines for QM, periodic DFT, and classical Schrödinger Materials Science Suite have built-in methods for MD (referred to herein as backends) are frequently written in handling various types of failure. For example, if the simulation compiled languages with enabled parallelization for CPU or GPU is not converging to a requested energy accuracy, it is wasteful hardware. These backends are called from Python workflows to blindly restart the calculation without changing some input using the job queuing systems described above. Meanwhile, pack- parameters. However, in the case of a failure due to full disk ages such as SciPy and NumPy provide sophisticated numerical space, it is reasonable to try restart with hopes to get a node with function optimization and fitting capabilities. Here, we describe more empty disk space. If a job fails (and cannot be restarted), examples of how the Schrödinger suite can be used to combine all its children (if any) will not start, thus saving queuing and materials simulations with popular optimization routines in the computational time. SciPy ecosystem. Having developed robust systems for running calculations, job Recently we implemented convex analysis of queuing and troubleshooting (autonomously, when applicable), the stress strain curve (as described here [PKD18]). the developed workflows have allowed us and our customers to scipy.optimize.minimize is used for a constrained perform massive screenings of materials and their properties. For minimization with boundary conditions of a function related to example, we reported a massive screening of 250,000 charge- the stress strain curve. The stress strain curve is obtained from a conducting organic materials, totaling approximately 3,619,000 series of MD simulations on deformed cells (cell deformations DFT SCF (self-consistent field) single-molecule calculations using are defined by strain type and deformation step). The pressure UTILIZING SCIPY AND OTHER OPEN SOURCE PACKAGES TO PROVIDE A POWERFUL API FOR MATERIALS MANIPULATION IN THE SCHRÖDINGER MATERIALS SUITE 55 tensor of a deformed cell is related to stress. This analysis allowed and AutoQSAR [DDS+ 16] from the Schrödinger suite. Depending prediction of elongation at yield for high density polyethylene on the type of materials, benchmark data can be obtained using polymer. Figure 4 shows obtained calculated yield of 10% vs. different codes available in the Schrödinger suite: experimental value within 9-18% range [BAS+ 20]. • small molecules and finite systems - Jaguar The scipy.optimize package is used for a least-squares • periodic systems - Quantum ESPRESSO fit of the bulk energies at different cell volumes (compressed • larger polymeric and similar systems - Desmond and expanded) in order to obtain the bulk modulus and equation of state (EOS) of a material. In the Schrödinger suite this was Different materials systems require different descriptors for implemented as a part of an EOS workflow, in which fitting is featurization. For example, for crystalline periodic systems, we performed on the results obtained from a series of QE calculations have implemented several sets of tailored descriptors. Genera- performed on the original as well as compressed and expanded tion of these descriptors again uses a mix of open source and (deformed) cells. An example of deformation applied to a structure Schrödinger proprietary tools. Specifically: in pymatgen: • elemental features such as atomic weight, number of from pymatgen.analysis.elasticity import strain valence electrons in s, p and d-shells, and electronegativity from pymatgen.core import lattice from pymatgen.core import structure • structural features such as density, volume per atom, and packing fraction descriptors implemented in the open deform = strain.Deformation([ source matminer package [WDF+ 18] [1.0, 0.02, 0.02], • intercalation descriptors such as cation and anion counts, [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]) crystal packing fraction, and average neighbor ionicity [SYC+ 17] implemented in the Schrödinger suite latt = lattice.Lattice([ • three-dimensional smooth overlap of atomic positions [3.84, 0.00, 0.00], [1.92, 3.326, 0.00], (SOAP) descriptors implemented in the open source [0.00, -2.22, 3.14], DScribe package [HJM+ 20]. ]) We are currently training models that use these descriptors st = structure.Structure( to predict properties, such as bulk modulus, of a set of Li- latt, containing battery related compounds [Cha]. Several models will ["Si", "Si"], [[0, 0, 0], [0.75, 0.5, 0.75]]) be compared, such as kernel regression methods (as implemented in the open source scikit-learn code [PVG+ 11]) and AutoQSAR. strained_st = deform.apply_to_structure(st) For isolated small molecules and extended non-periodic sys- This is also an example of loosely coupled (embarrassingly tems, RDKit can be used to generate a large number of atomic and parallel) jobs. In particular, calculations of the deformed cells molecular descriptors. A lot of effort has been devoted to ensure only depend on the bulk calculation and do not depend on each that RDKit can be used on a wide variety of materials that are other. Thus, all the deformation jobs can be submitted in parallel, supported by the Schrödinger suite. At the time of writing, the 4th facilitating high-throughput runs. most active contributor to RDKit is Ricardo Rodriguez-Schmidt Structure refinement from powder diffraction experiment is an- from Schrödinger [RDK]. other example where more complex optimization is used. Powder Recently, active learning (AL) combined with DFT has re- diffraction is a widely used method in drug discovery to assess ceived much attention to address the challenge of leveraging purity of the material and discover known or unknown crystal exhaustive libraries in materials informatics [VPB21], [SPA+ 19]. polymorphs [KBD+ 21]. In particular, there is interest in fitting of On our side, we have implemented a workflow that employs active the experimental powder diffraction intensity peaks to the indexed learning (AL) for intelligent and iterative identification of promis- peaks (Pawley refinement) [JPS92]. Here we employed the open ing materials candidates within a large dataset. In the framework of source lmfit package [NSA+ 16] to perform a minimization of AL, the predicted value with associated uncertainty is considered the multivariable Voigt-like function that represents the entire to decide what materials to be added in each iteration, aiming to diffraction spectrum. This allows the user to refine (optimize) unit improve the model performance in the next iteration (Figure 5). cell parameters coming from the indexing data and as the result, Since it could be important to consider multiple properties goodness of fit (R-factor) between experimental and simulated simultaneously in material discovery, multiple property optimiza- spectrum is minimized. tion (MPO) has also been implemented as a part of the AL work- flow [KAG+ 22]. MPO allows scaling and combining multiple properties into a single score. We employed the AL workflow Machine learning techniques to determine the top candidates for hole (positively charged Of late, there is great interest in machine learning assisted mate- carrier) transport layer (HTL) by evaluating 550 molecules in 10 rials discovery. There are several components required to perform iterations using DFT calculations for a dataset of ~9,000 molecules machine learning assisted materials discovery. In order to train a [AKA+ 22]. Resulting model was validated by randomly picking model, benchmark data from simulation and/or experimental data a molecule from the dataset, computing properties with DFT and is required. Besides benchmark data, computation of the relevant comparing those to the predicted values. According to the semi- descriptors is required (see below). Finally, a model based on classical Marcus equation [Mar93], high rates of hole transfer are benchmark data and descriptors is generated that allows prediction inversely proportional to hole reorganization energies. Thus, MPO of properties for novel materials. There are several techniques to scores were computed based on minimizing hole reorganization generate the model, such as linear or non-linear fitting to neural energy and targeting oxidation potential to an appropriate level to networks. Tools include the open source DeepChem [REW+ 19] ensure a low energy barrier for hole injection from the anode 56 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 4: Left: The uniaxial stress/strain curve of a polymer calculated using Desmond through the stress strain workflow. The dark grey band indicates an inflection that marks the yield point. Right: Constant strain simulation with convex analysis indicates elongation at yield. The red curve shows simulated stress versus strain. The blue curve shows convex analysis. Fig. 5: Active learning workflow for the design and discovery of novel optoelectronics molecules. into the emissive layer. In this workflow, we used RDKit to of similar items (similar molecules). In this case, benchmark data compute descriptors for the chemical structures. These descriptors is only needed for few representatives of each cluster. We are generated on the initial subset of structures are given as vectors to currently working on applying this approach to train models for an algorithm based on Random Forest Regressor as implemented predicting physical properties of soft materials (polymers). in scikit-learn. Bayesian optimization is employed to tune the hyperparameters of the model. In each iteration, a trained model Conclusions is applied for making predictions on the remaining materials in We present several examples of how Schrödinger Materials Suite the dataset. Figure 6 (A) displays MPO scores for the HTL dataset integrates open source software packages. There is a wide range estimated by AL as a function of hole reorganization energies that of applications in materials science that can benefit from already are separately calculated for all the materials. This figure indicates existing open source code. Where possible, we report issues to that there are many materials in the dataset with desired low hole the package authors and submit improvements and bug fixes in reorganization energies but are not suitable for HTL due to their the form of the pull requests. We are thankful to all who have improper oxidation potentials, suggesting that MPO is important contributed to open source libraries, and have made it possible for to evaluate the optoelectronic performance of the materials. Figure us to develop a platform for accelerating innovation in materials 6 (B) presents MPO scores of the materials used in the training and drug discovery. We will continue contributing to these projects dataset of AL, demonstrating that the feedback loop in the AL and we hope to further give back to the scientific community by workflow efficiently guides the data collection as the size of the facilitating research in both academia and industry. We hope that training set increases. this report will inspire other scientific companies to give back to To appreciate the computational efficiency of such an ap- the open source community in order to improve the computational proach, it is worth noting that performing DFT calculations for materials field and make science more reproducible. all of the 9,000 molecules in the dataset would increase the computational cost by a factor of 15 versus the AL workflow. It Acknowledgments seems that AL approach can be useful in the cases where problem The authors acknowledge Bradley Dice and Wenduo Zhou for space is broad (like chemical space), but there are many clusters their valuable comments during the review of the manuscript. UTILIZING SCIPY AND OTHER OPEN SOURCE PACKAGES TO PROVIDE A POWERFUL API FOR MATERIALS MANIPULATION IN THE SCHRÖDINGER MATERIALS SUITE 57 Fig. 6: A: MPO score of all materials in the HTL dataset. B: Those used in the training set as a function of the hole reorganization energy ( λh ). R EFERENCES tal Engineering and Materials, 72, 2016. doi:10.1107/ S2052520616003954. [ABG+ 21] Mohammad Atif Faiz Afzal, Andrea R. Browning, Alexan- [HF08] Gus L.W. Hart and Rodney W. Forcade. Algo- der Goldberg, Mathew D. Halls, Jacob L. Gavartin, Tsuguo rithm for generating derivative structures. Physical Re- Morisato, Thomas F. Hughes, David J. Giesen, and Joseph E. view B - Condensed Matter and Materials Physics, 77, Goose. High-throughput molecular dynamics simulations and 2008. URL: https://github.com/msg-byu/enumlib/, doi:10. validation of thermophysical properties of polymers for var- 1103/PhysRevB.77.224115. ious applications. ACS Applied Polymer Materials, 3, 2021. [HJM+ 20] Lauri Himanen, Marc O.J. Jager, Eiaki V. Morooka, Fil- doi:10.1021/acsapm.0c00524. ippo Federici Canova, Yashasvi S. Ranawat, David Z. Gao, [AKA+ 22] Hadi Abroshan, H. Shaun Kwak, Yuling An, Christopher Patrick Rinke, and Adam S. Foster. Dscribe: Library of Brown, Anand Chandrasekaran, Paul Winget, and Mathew D. descriptors for machine learning in materials science. Com- Halls. Active learning accelerates design and optimization puter Physics Communications, 247, 2020. URL: https: of hole-transporting materials for organic electronics. Fron- //singroup.github.io/dscribe/latest/, doi:10.1016/j.cpc. tiers in Chemistry, 9, 2022. doi:10.3389/fchem.2021. 2019.106949. 800371. [HM24] O Hassel and H Mark. The crystal structure of graphite. [BAS+ 20] A. R. Browning, M. A. F. Afzal, J. Sanders, A. Goldberg, Physik. Z, 25:317–337, 1924. A. Chandrasekaran, and H. S. Kwak. Polyolefin molecular [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der simulation for critical physical characteristics. International Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Polyolefins Conference, 2020. Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, [BDBF] Davide Brunato, Pietro Delugas, Giovanni Borghi, and Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Alexandr Fonari. qeschema. URL: https://github.com/QEF/ Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del qeschema. Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, [BHH+ 13] Art D. Bochevarov, Edward Harder, Thomas F. Hughes, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Jeremy R. Greenwood, Dale A. Braden, Dean M. Philipp, Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array David Rinaldo, Mathew D. Halls, Jing Zhang, and Richard A. programming with numpy, 2020. URL: https://numpy.org/, Friesner. Jaguar: A high-performance quantum chemistry doi:10.1038/s41586-020-2649-2. software program with strengths in life and materials sci- [HZU+ 20] Sebastiaan P. Huber, Spyros Zoupanos, Martin Uhrin, Leopold ences. International Journal of Quantum Chemistry, 113, Talirz, Leonid Kahle, Rico Hauselmann, Dominik Gresch, 2013. doi:10.1002/qua.24481. Tiziano Müller, Aliaksandr V. Yakutovich, Casper W. Ander- [Bic02] Jozef Bicerano. Prediction of polymer properties. cRc Press, sen, Francisco F. Ramirez, Carl S. Adorf, Fernando Gargiulo, 2002. Snehal Kumbhar, Elsa Passaro, Conrad Johnston, Andrius [Cha] A. Chandrasekaran. Active learning accelerated design of ionic Merkys, Andrea Cepellotti, Nicolas Mounet, Nicola Marzari, materials. in progress. Boris Kozinsky, and Giovanni Pizzi. Aiida 1.0, a scalable com- [DDS+ 16] Steven L. Dixon, Jianxin Duan, Ethan Smith, Christopher putational infrastructure for automated reproducible workflows D. Von Bargen, Woody Sherman, and Matthew P. Repasky. and data provenance. Scientific Data, 7, 2020. URL: https:// Autoqsar: An automated machine learning tool for best- www.aiida.net/, doi:10.1038/s41597-020-00638-4. practice quantitative structure-activity relationship modeling. [JPS92] J. Jansen, R. Peschar, and H. Schenk. Determination of Future Medicinal Chemistry, 8, 2016. doi:10.4155/fmc- accurate intensities from powder diffraction data. i. whole- 2016-0093. pattern fitting with a least-squares procedure. Journal [GAB+ 17] P. Giannozzi, O. Andreussi, T. Brumme, O. Bunau, M. Buon- of Applied Crystallography, 25, 1992. doi:10.1107/ giorno Nardelli, M. Calandra, R. Car, C. Cavazzoni, S0021889891012104. D. Ceresoli, M. Cococcioni, N. Colonna, I. Carnimeo, A. Dal [KAG+ 22] H. Shaun Kwak, Yuling An, David J. Giesen, Thomas F. Corso, S. De Gironcoli, P. Delugas, R. A. Distasio, A. Ferretti, Hughes, Christopher T. Brown, Karl Leswing, Hadi Abroshan, A. Floris, G. Fratesi, G. Fugallo, R. Gebauer, U. Gerstmann, and Mathew D. Halls. Design of organic electronic materials F. Giustino, T. Gorni, J. Jia, M. Kawamura, H. Y. Ko, with a goal-directed generative model powered by deep neural A. Kokalj, E. Kücükbenli, M. Lazzeri, M. Marsili, N. Marzari, networks and high-throughput molecular simulations. Fron- F. Mauri, N. L. Nguyen, H. V. Nguyen, A. Otero-De-La- tiers in Chemistry, 9, 2022. doi:10.3389/fchem.2021. Roza, L. Paulatto, S. Poncé, D. Rocca, R. Sabatini, B. Santra, 800370. M. Schlipf, A. P. Seitsonen, A. Smogunov, I. Timrov, T. Thon- [KBD+ 21] James A Kaduk, Simon J L Billinge, Robert E Dinnebier, hauser, P. Umari, N. Vast, X. Wu, and S. Baroni. Advanced Nathan Henderson, Ian Madsen, Radovan Černý, Matteo capabilities for materials modelling with quantum espresso. Leoni, Luca Lutterotti, Seema Thakral, and Daniel Chateigner. Journal of Physics Condensed Matter, 29, 2017. URL: Powder diffraction. Nature Reviews Methods Primers, 1:77, https://www.quantum-espresso.org/, doi:10.1088/1361- 2021. URL: https://doi.org/10.1038/s43586-021-00074-7, 648X/aa8f79. doi:10.1038/s43586-021-00074-7. [GBLW16] Colin R. Groom, Ian J. Bruno, Matthew P. Lightfoot, and [LLC22] Schrödinger LLC. Schrödinger release 2022-2: Materials Suzanna C. Ward. The cambridge structural database. science suite, 2022. URL: https://www.schrodinger.com/ Acta Crystallographica Section B: Structural Science, Crys- platform/materials-science. 58 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) [LMB+ 17] Ask Hjorth Larsen, Jens JØrgen Mortensen, Jakob Blomqvist, Ho, Douglas J. Ierardi, Lev Iserovich, Jeffrey S. Kuskin, Ivano E. Castelli, Rune Christensen, Marcin Dułak, Jesper Richard H. Larson, Timothy Layman, Li Siang Lee, Adam K. Friis, Michael N. Groves, BjØrk Hammer, Cory Hargus, Lerer, Chester Li, Daniel Killebrew, Kenneth M. Macken- Eric D. Hermes, Paul C. Jennings, Peter Bjerre Jensen, zie, Shark Yeuk Hai Mok, Mark A. Moraes, Rolf Mueller, James Kermode, John R. Kitchin, Esben Leonhard Kols- Lawrence J. Nociolo, Jon L. Peticolas, Terry Quan, Daniel bjerg, Joseph Kubal, Kristen Kaasbjerg, Steen Lysgaard, Ramot, John K. Salmon, Daniele P. Scarpazza, U. Ben Schafer, Jón Bergmann Maronsson, Tristan Maxson, Thomas Olsen, Naseer Siddique, Christopher W. Snyder, Jochen Spengler, Lars Pastewka, Andrew Peterson, Carsten Rostgaard, Jakob Ping Tak Peter Tang, Michael Theobald, Horia Toma, Brian SchiØtz, Ole Schütt, Mikkel Strange, Kristian S. Thygesen, Towles, Benjamin Vitale, Stanley C. Wang, and Cliff Young. Tejs Vegge, Lasse Vilhelmsen, Michael Walter, Zhenhua Zeng, Anton 2: Raising the bar for performance and programmabil- and Karsten W. Jacobsen. The atomic simulation envi- ity in a special-purpose molecular dynamics supercomputer. ronment - a python library for working with atoms, 2017. volume 2015-January, 2014. doi:10.1109/SC.2014.9. URL: https://wiki.fysik.dtu.dk/ase/, doi:10.1088/1361- [SPA+ 19] Gabriel R. Schleder, Antonio C.M. Padilha, Carlos Mera 648X/aa680e. Acosta, Marcio Costa, and Adalberto Fazzio. From dft to [LTK+ 22] Greg Landrum, Paolo Tosco, Brian Kelley, Ric, sriniker, machine learning: Recent approaches to materials science - gedeck, Riccardo Vianello, NadineSchneider, Eisuke a review. JPhys Materials, 2, 2019. doi:10.1088/2515- Kawashima, Andrew Dalke, Dan N, David Cosgrove, 7639/ab084b. Gareth Jones, Brian Cole, Matt Swain, Samo Turk, [SYC+ 17] Austin D Sendek, Qian Yang, Ekin D Cubuk, Karel- AlexanderSavelyev, Alain Vaucher, Maciej Wójcikowski, Alexander N Duerloo, Yi Cui, and Evan J Reed. Holistic Ichiru Take, Daniel Probst, Kazuya Ujihara, Vincent F. computational structure screening of more than 12000 can- Scalfani, guillaume godin, Axel Pahl, Francois Berenger, didates for solid lithium-ion conductor materials. Energy and JLVarjo, strets123, JP, and DoliathGavid. rdkit. 6 2022. URL: Environmental Science, 10:306–320, 2017. doi:10.1039/ https://rdkit.org/, doi:10.5281/ZENODO.6605135. c6ee02697d. [Mar93] Rudolph A. Marcus. Electron transfer reactions in chemistry. [VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt theory and experiment. Reviews of Modern Physics, 65, 1993. Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, doi:10.1103/RevModPhys.65.599. Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté- [MAS+ 20] Nobuyuki N. Matsuzawa, Hideyuki Arai, Masaru Sasago, Eiji fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar- Fujii, Alexander Goldberg, Thomas J. Mustard, H. Shaun rod Millman, Nikolay Mayorov, Andrew R.J. Nelson, Eric Kwak, David J. Giesen, Fabio Ranalli, and Mathew D. Halls. Jones, Robert Kern, Eric Larson, C. J. Carey, İlhan Polat, Massive theoretical screen of hole conducting organic mate- Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, rials in the heteroacene family by using a cloud-computing Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin- environment. Journal of Physical Chemistry A, 124, 2020. tero, Charles R. Harris, Anne M. Archibald, Antônio H. doi:10.1021/acs.jpca.9b10998. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, Aditya Vi- [MGF+ 19] Thomas Mustard, Jacob Gavartin, Alexandr Fonari, Caroline jaykumar, Alessandro Pietro Bardelli, Alex Rothberg, An- Krauter, Alexander Goldberg, H Kwak, Tsuguo Morisato, dreas Hilboll, Andreas Kloeckner, Anthony Scopatz, Antony Sudharsan Pandiyan, and Mathew Halls. Surface reactivity Lee, Ariel Rokem, C. Nathan Woods, Chad Fulton, Charles and stability of core-shell solid catalysts from ab initio combi- Masson, Christian Haggström, Clark Fitzgerald, David A. natorial calculations. volume 258, 2019. Nicholson, David R. Hagen, Dmitrii V. Pasechnik, Emanuele [NSA+ 16] Matthew Newville, Till Stensitzki, Daniel B Allen, Michal Olivetti, Eric Martin, Eric Wieser, Fabrice Silva, Felix Lenders, Rawlik, Antonino Ingargiola, and Andrew Nelson. Lmfit: Non- Florian Wilhelm, G. Young, Gavin A. Price, Gert Ludwig linear least-square minimization and curve-fitting for python. Ingold, Gregory E. Allen, Gregory R. Lee, Hervé Audren, Irvin Astrophysics Source Code Library, page ascl–1606, 2016. Probst, Jörg P. Dietrich, Jacob Silterra, James T. Webber, Janko URL: https://lmfit.github.io/lmfit-py/. Slavič, Joel Nothman, Johannes Buchner, Johannes Kulick, [OBJ+ 11] Noel M. O’Boyle, Michael Banck, Craig A. James, Chris Johannes L. Schönberger, José Vinícius de Miranda Cardoso, Morley, Tim Vandermeersch, and Geoffrey R. Hutchison. Joscha Reimer, Joseph Harrington, Juan Luis Cano Rodríguez, Open babel: An open chemical toolbox. Journal of Chem- Juan Nunez-Iglesias, Justin Kuczynski, Kevin Tritz, Martin informatics, 3, 2011. URL: https://openbabel.org/, doi: Thoma, Matthew Newville, Matthias Kümmerer, Maximilian 10.1186/1758-2946-3-33. Bolingbroke, Michael Tartre, Mikhail Pak, Nathaniel J. Smith, [ORJ+ 13] Shyue Ping Ong, William Davidson Richards, Anubhav Jain, Nikolai Nowaczyk, Nikolay Shebanov, Oleksandr Pavlyk, Geoffroy Hautier, Michael Kocher, Shreyas Cholia, Dan Per A. Brodtkorb, Perry Lee, Robert T. McGibbon, Roman Gunter, Vincent L. Chevrier, Kristin A. Persson, and Gerbrand Feldbauer, Sam Lewis, Sam Tygier, Scott Sievert, Sebastiano Ceder. Python materials genomics (pymatgen): A robust, open- Vigna, Stefan Peterson, Surhud More, Tadeusz Pudlik, Takuya source python library for materials analysis. Computational Oshima, Thomas J. Pingel, Thomas P. Robitaille, Thomas Materials Science, 68, 2013. URL: https://pymatgen.org/, Spura, Thouis R. Jones, Tim Cera, Tim Leslie, Tiziano Zito, doi:10.1016/j.commatsci.2012.10.028. Tom Krauss, Utkarsh Upadhyay, Yaroslav O. Halchenko, and [PKD18] Paul N. Patrone, Anthony J. Kearsley, and Andrew M. Di- Yoshiki Vázquez-Baeza. Scipy 1.0: fundamental algorithms enstfrey. The role of data analysis in uncertainty quantifica- for scientific computing in python. Nature Methods, 17, 2020. tion: Case studies for materials modeling. volume 0, 2018. doi:10.1038/s41592-019-0686-2. doi:10.2514/6.2018-0927. [VPB21] Rama Vasudevan, Ghanshyam Pilania, and Prasanna V. Bal- [PVG+ 11] Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vin- achandran. Machine learning for materials design and dis- cent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blon- covery. Journal of Applied Physics, 129, 2021. doi: del, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake 10.1063/5.0043300. Vanderplas, Alexandre Passos, David Cournapeau, Matthieu [WDF+ 18] Logan Ward, Alexander Dunn, Alireza Faghaninia, Nils E.R. Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit- Zimmermann, Saurabh Bajaj, Qi Wang, Joseph Montoya, learn: Machine learning in python. Journal of Machine Jiming Chen, Kyle Bystrom, Maxwell Dylla, Kyle Chard, Learning Research, 12, 2011. URL: https://scikit-learn.org/. Mark Asta, Kristin A. Persson, G. Jeffrey Snyder, Ian Foster, [RDK] Rdkit contributors. URL: https://github.com/rdkit/rdkit/ and Anubhav Jain. Matminer: An open source toolkit for graphs/contributors. materials data mining. Computational Materials Science, [REW+ 19] Bharath Ramsundar, Peter Eastman, Patrick Walters, 152, 2018. URL: https://hackingmaterials.lbl.gov/matminer/, Vijay Pande, Karl Leswing, and Zhenqin Wu. Deep doi:10.1016/j.commatsci.2018.05.018. Learning for the Life Sciences. O’Reilly Media, 2019. [WF05] John D. Westbrook and Paula M.D. Fitzgerald. The pdb https://www.amazon.com/Deep-Learning-Life-Sciences- format, mmcif formats, and other data formats, 2005. doi: Microscopy/dp/1492039837. 10.1002/0471721204.ch8. [SGB+ 14] David E. Shaw, J. P. Grossman, Joseph A. Bank, Brannon Bat- [WKB+ 22] Paul Winget, H. Shaun Kwak, Christopher T. Brown, Alexandr son, J. Adam Butts, Jack C. Chao, Martin M. Deneroff, Ron O. Fonari, Kevin Tran, Alexander Goldberg, Andrea R. Brown- Dror, Amos Even, Christopher H. Fenton, Anthony Forte, ing, and Mathew D. Halls. Organic thin films for oled appli- Joseph Gagliardo, Gennette Gill, Brian Greskamp, C. Richard cations: Influence of molecular structure, deposition method, UTILIZING SCIPY AND OTHER OPEN SOURCE PACKAGES TO PROVIDE A POWERFUL API FOR MATERIALS MANIPULATION IN THE SCHRÖDINGER MATERIALS SUITE 59 and deposition conditions. International Conference on the Science and Technology of Synthetic Metals, 2022. 60 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) A Novel Pipeline for Cell Instance Segmentation, Tracking and Motility Classification of Toxoplasma Gondii in 3D Space Seyed Alireza Vaezi‡∗ , Gianni Orlando‡ , Mojtaba Fazli§ , Gary Ward¶ , Silvia Moreno‡ , Shannon Quinn‡ F Abstract—Toxoplasma gondii is the parasitic protozoan that causes dissem- individuals, the infection has fatal implications in fetuses and inated toxoplasmosis, a disease that is estimated to infect around one-third immunocompromised individuals [SG12] . T. gondii’s virulence of the world’s population. While the disease is commonly asymptomatic, the is directly linked to its lytic cycle which is comprised of invasion, success of the parasite is in large part due to its ability to easily spread through replication, egress, and motility. Studying the motility of T. gondii nucleated cells. The virulence of T. gondii is predicated on the parasite’s motility. is crucial in understanding its lytic cycle in order to develop Thus the inspection of motility patterns during its lytic cycle has become a topic of keen interest. Current cell tracking projects usually focus on cell images potential treatments. captured in 2D which are not a true representation of the actual motion of a For this reason, we present a novel pipeline to detect, segment, cell. Current 3D tracking projects lack a comprehensive pipeline covering all track, and classify the motility pattern of T. gondii in 3D space. phases of preprocessing, cell detection, cell instance segmentation, tracking, One of the main goals is to make our pipeline intuitively easy and motion classification, and merely implement a subset of the phases. More- to use so that the users who are not experienced in the fields of over, current 3D segmentation and tracking pipelines are not targeted for users machine learning (ML), deep learning (DL), or computer vision with less experience in deep learning packages. Our pipeline, TSeg, on the (CV) can still benefit from it. The other objective is to equip it with other hand, is developed for segmenting, tracking, and classifying the motility the most robust and accurate set of segmentation and detection phenotypes of T. gondii in 3D microscopic images. Although TSeg is built initially tools so that the end product has a broad generalization, allowing focusing on T. gondii, it provides generic functions to allow users with similar but distinct applications to use it off-the-shelf. Interacting with all of TSeg’s it to perform well and accurately for various cell types right off modules is possible through our Napari plugin which is developed mainly off the the shelf. familiar SciPy scientific stack. Additionally, our plugin is designed with a user- PlantSeg uses a variant of 3D U-Net, called Residual 3D U- friendly GUI in Napari which adds several benefits to each step of the pipeline Net, for preprocessing and segmentation of multiple cell types such as visualization and representation in 3D. TSeg proves to fulfill a better [WCV+ 20]. PlantSeg performs best among Deep Learning algo- generalization, making it capable of delivering accurate results with images of rithms for 3D Instance Segmentation and is very robust against other cell types. image noise [KPR+ 21]. The segmentation module also includes the optional use of CellPose [SWMP21]. CellPose is a generalized Introduction segmentation algorithm trained on a wide range of cell types Quantitative cell research often requires the measurement of and is the first step toward increased optionality in TSeg. The different cell properties including size, shape, and motility. This Cell Tracking module consolidates the cell particles across the z- step is facilitated using segmentation of imaged cells. With flu- axis to materialize cells in 3D space and estimates centroids for orescent markers, computational tools can be used to complete each cell. The tracking module is also responsible for extracting segmentation and identify cell features and positions over time. the trajectories of cells based on the movements of centroids 2D measurements of cells can be useful, but the more difficult task throughout consecutive video frames, which is eventually the input of deriving 3D information from cell images is vital for metrics of the motion classifier module. such as motility and volumetric qualities. Most of the state-of-the-art pipelines are restricted to 2D space Toxoplasmosis is an infection caused by the intracellular which is not a true representative of the actual motion of the parasite Toxoplasma gondii. T. gondii is one of the most suc- organism. Many of them require knowledge and expertise in pro- cessful parasites, infecting at least one-third of the world’s pop- gramming, or in machine learning and deep learning models and ulation. Although Toxoplasmosis is generally benign in healthy frameworks, thus limiting the demographic of users that can use them. All of them solely include a subset of the aforementioned * Corresponding author: sv22900@uga.edu modules (i.e. detection, segmentation, tracking, and classification) ‡ University of Georgia § harvard University [SWMP21]. Many pipelines rely on the user to train their own ¶ University of Vermont model, hand-tailored for their specific application. This demands high levels of experience and skill in ML/DL and consequently Copyright © 2022 Seyed Alireza Vaezi et al. This is an open-access article undermines the possibility and feasibility of quickly utilizing an distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, off-the-shelf pipeline and still getting good results. provided the original author and source are credited. To address these we present TSeg. It segments T. gondii cells A NOVEL PIPELINE FOR CELL INSTANCE SEGMENTATION, TRACKING AND MOTILITY CLASSIFICATION OF TOXOPLASMA GONDII IN 3D SPACE 61 As an example, Fazli et al. [FVMQ18] identified three distinct motility types for T. gondii with two-dimensional data, however, they also acknowledge and state that based established heuristics from previous works there are more than three motility phenotypes for T. gondii. The focus on 2D research is understandable due to several factors. 3D data is difficult to capture as tools for capturing 3D slices and the computational requirements for analyzing this data are not available in most research labs. Most segmentation tools are unable to track objects in 3D space as the assignment of related centroids is more difficult. The additional noise from cap- ture and focus increases the probability of incorrect assignment. 3D data also has issues with overlapping features and increased computation required per frame of time. Fazli et al. [FVMQ18] studies the motility patterns of T. gondii and provides a computational pipeline for identifying motility phenotypes of T. gondii in an unsupervised, data-driven way. In that work Ca2+ is added to T. gondii cells inside a Fetal Bovine Serum. T. gondii cells react to Ca2+ and become motile and fluorescent. The images of motile T. gondii cells were captured using an LSM 710 confocal microscope. They use Python 3 and associated scientific computing libraries (NumPy, SciPy, scikit- learn, matplotlib) in their pipeline to track and cluster the trajecto- ries of T. gondii. Based on this work Fazli et al. [FVM+ 18] work on another pipeline consisting of preprocessing, sparsification, cell detection, and cell tracking modules to track T. gondii in 3D video microscopy where each frame of the video consists of image slices taken 1 micro-meters of focal depth apart along the z-axis Fig. 1: The overview of TSeg’s architecture. direction. In their latest work Fazli et al. [FSA+ 19] developed a lightweight and scalable pipeline using task distribution and paral- lelism. Their pipeline consists of multiple modules: reprocessing, in 3D microscopic images, tracks their trajectories, and classifies sparsification, cell detection, cell tracking, trajectories extraction, the motion patterns observed throughout the 3D frames. TSeg is parametrization of the trajectories, and clustering. They could comprised of four modules: pre-processing, segmentation, track- classify three distinct motion patterns in T. gondii using the same ing, and classification. We developed TSeg as a plugin for Napari data from their previous work. [SLE+ 22] - an open-source fast and interactive image viewer for While combining open source tools is not a novel architecture, Python designed for browsing, annotating, and analyzing large little has been done to integrate 3D cell tracking tools. Fazeli et multi-dimensional images. Having TSeg implemented as a part of al. [FRF+ 20] motivated by the same interest in providing better Napari not only provides a user-friendly design but also gives more tools to non-software professionals created a 2D cell tracking advanced users the possibility to attach and execute their custom pipeline. This pipeline combines Stardist [WSH+ 20] and Track- code and even interact with the steps of the pipeline if needed. Mate [TPS+ 17] for automated cell tracking. This pipeline begins The preprocessing module is equipped with basic and extra filters with the user loading cell images and centroid approximations to and functionalities to aid in the preparation of the input data. the ZeroCostDL4Mic [vCLJ+ 21] platform. ZeroCostDL4Mic is TSeg gives its users the advantage of utilizing the functionalities a deep learning training tool for those with no coding expertise. that PlantSeg and CellPose provide. These functionalities can be Once the platform is trained and masks for the training set are chosen in the pre-processing, detection, and segmentation steps. made for hand-drawn annotations, the training set can be input This brings forth a huge variety of algorithms and pre-built models to Stardist. Stardist performs automated object detection using to select from, making TSeg not only a great fit for T. gindii, but Euclidean distance to probabilistically determine cell pixels versus also a variety of different cell types. background pixels. Lastly, Trackmate uses segmentation images to The rest of this paper is structured as follows: After briefly re- track labels between timeframes and display analytics. viewing the literature in Related Work, we move on to thoroughly This Stardist pipeline is similar in concept to TSeg. Both describe the details of our work in the Method section. Following create an automated segmentation and tracking pipeline but TSeg that, the Results section depicts the results of comprehensive tests is oriented to 3D data. Cells move in 3-dimensional space that of our plugin on T. gondii cells. is not represented in a flat plane. TSeg also does not require the manual training necessary for the other pipeline. Individuals Related Work with low technical expertise should not be expected to create The recent solutions in generalized and automated segmentation masks for training or even understand the training of deep neural tools are focused on 2D cell images. Segmentation of cellular networks. Lastly, this pipeline does not account for imperfect structures in 2D is important but not representative of realistic datasets without the need for preprocessing. All implemented environments. Microbiological organisms are free to move on the algorithms in TSeg account for microscopy images with some z-axis and tracking without taking this factor into account cannot amount of noise. guarantee a full representation of the actual motility patterns. Wen et al. [WMV+ 21] combines multiple existing new tech- 62 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) nologies including deep learning and presents 3DeeCellTracker. user. The full code of TSeg is available on GitHub under the MIT 3DeeCellTracker segments and tracks cells on 3D time-lapse open source license at https://github.com/salirezav/tseg. TSeg can images. Using a small subset of their dataset they train the deep be installed through Napari’s plugins menu. learning architecture 3D U-Net for segmentation. For tracking, a combination of two strategies was used to increase accuracy: Computational Pipeline local cell region strategies, and spatial pattern strategy. Kapoor Pre-Processing: Due to the fast imaging speed in data et al. [KC21] presents VollSeg that uses deep learning methods acquisition, the image slices will inherently have a vignetting to segment, track, and analyze cells in 3D with irregular shape artifact, meaning that the corners of the images will be slightly and intensity distribution. It is a Jupyter Notebook-based Python darker than the center of the image. To eliminate this artifact we package and also has a UI in Napari. For tracking, a custom added adaptive thresholding and logarithmic correction to the pre- tracking code is developed based on Trackmate. processing module. Furthermore, another prevalent artifact on our Many segmentation tools require some amount of knowledge dataset images was a Film-Grain noise (AKA salt and pepper in Machine or Deep Learning concepts. Training the neural noise). To remove or reduce such noise a simple gaussian blur network in creating masks is a common step for open-source filter and a sharpening filter are included. segmentation tools. Automating this process makes the pipeline Cell Detection and Segmentation: TSeg’s Detection and more accessible to microbiology researchers. Segmentation modules are in fact backed by PlantSeg and Cell- Pose. The Detection Module is built only based on PlantSeg’s Method CNN Detection Module [WCV+ 20] , and for the Segmentation Module, only one of the three tools can be selected to be executed Data as the segmentation tool in the pipeline. Naturally, each of the tools Our dataset consists of 11 videos of T. gondii cells under a demands specific interface elements different from the others since microscope, obtained from different experiments with different each accepts different input values and various parameters. TSeg numbers of cells. The videos are on average around 63 frames in orchestrates this and makes sure the arguments and parameters are length. Each frame has a stack of 41 image slices of size 500×502 passed to the corresponding selected segmentation tool properly pixels along the z-axis (z-slices). The z-slices are captured 1µm and the execution will be handled accordingly. The parameters apart in optical focal length making them 402µm×401µm×40µm include but are not limited to input data location, output directory, in volume. The slices were recorded in raw format as RGB TIF and desired segmentation algorithm. This allows the end-user images but are converted to grayscale for our purpose. This data complete control over the process and feedback from each step is captured using a PlanApo 20x objective (NA = 0:75) on a of the process. The preprocessed images and relevant parameters preheated Nikon Eclipse TE300 epifluorescence microscope. The are sent to a modular segmentation controller script. As an effort image stacks were captured using an iXon 885 EMCCD camera to allow future development on TSeg, the segmentation controller (Andor Technology, Belfast, Ireland) cooled to -70oC and driven script shows how the pipeline integrates two completely different by NIS Elements software (Nikon Instruments, Melville, NY) as segmentation packages. While both PlantSeg and CellPose use part of related research by Ward et al. [LRK+ 14]. The camera was conda environments, PlantSeg requires modification of a YAML set to frame transfer sensor mode, with a vertical pixel shift speed file for initialization while CellPose initializes directly from com- of 1:0 µs, vertical clock voltage amplitude of +1, readout speed mand line parameters. In order to implement PlantSeg, TSeg gen- of 35MHz, conversion gain of 3:8×, EM gain setting of 3 and 22 erates a YAML file based on GUI input elements. After parameters binning, and the z-slices were imaged with an exposure time of are aligned, the conda environment for the chosen segmentation 16ms. algorithm is opened in a subprocess. The $CONDA_PREFIX environment variable allows the bash command to start conda and Software context switch to the correct segmentation environment. Napari Plugin: TSeg is developed as a plugin for Napari - Tracking: Features in each segmented image are found a fast and interactive multi-dimensional image viewer for python using the scipy label function. In order to reduce any leftover that allows volumetric viewing of 3D images [SLE+ 22]. Plugins noise, any features under a minimum size are filtered out and enable developers to customize and extend the functionality of considered leftover noise. After feature extraction, centroids are Napari. For every module of TSeg, we developed its corresponding calculated using the center of mass function in scipy. The centroid widget in the GUI, plus a widget for file management. The widgets of the 3D cell can be used as a representation of the entire have self-explanatory interface elements with tooltips to guide body during tracking. The tracking algorithm goes through each the inexperienced user to traverse through the pipeline with ease. captured time instance and connects centroids to the likely next Layers in Napari are the basic viewable objects that can be shown movement of the cell. Tracking involves a series of measures in or- in the Napari viewer. Seven different layer types are supported der to avoid incorrect assignments. An incorrect assignment could in Napari: Image, Labels, Points, Shapes, Surface, Tracks, and lead to inaccurate result sets and unrealistic motility patterns. If the Vectors, each of which corresponds to a different data type, same number of features in each frame of time could be guaranteed visualization, and interactivity [SLE+ 22]. After its execution, the from segmentation, minimum distance could assign features rather viewable output of each widget gets added to the layers. This accurately. Since this is not a guarantee, the Hungarian algorithm allows the user to evaluate and modify the parameters of the must be used to associate a COST with the assignment of feature widget to get the best results before continuing to the next widget. tracking. The Hungarian method is a combinatorial optimization Napari supports bidirectional communication between the viewer algorithm that solves the assignment problem in polynomial time. and the Python kernel and has a built-in console that allows users COST for the tracking algorithm determines which feature is the to control all the features of the viewer programmatically. This next iteration of the cell’s tracking through the complete time adds more flexibility and customizability to TSeg for the advanced series. The combination of distance between centroids for all A NOVEL PIPELINE FOR CELL INSTANCE SEGMENTATION, TRACKING AND MOTILITY CLASSIFICATION OF TOXOPLASMA GONDII IN 3D SPACE 63 previous points and the distance to the potential new centroid. [LRK+ 14] Jacqueline Leung, Mark Rould, Christoph Konradt, Christopher If an optimal next centroid can’t be found within an acceptable Hunter, and Gary Ward. Disruption of tgphil1 alters specific parameters of toxoplasma gondii motility measured in a quanti- distance of the current point, the tracking for the cell is considered tative, three-dimensional live motility assay. PloS one, 9:e85763, as complete. Likewise, if a feature is not assigned to a current 01 2014. doi:10.1371/journal.pone.0085763. centroid, this feature is considered a new object and is tracked as [SG12] Geita Saadatnia and Majid Golkar. A review on human toxoplas- the algorithm progresses. The complete path for each feature is mosis. Scandinavian journal of infectious diseases, 44(11):805– 814, 2012. doi:10.3109/00365548.2012.693197. then stored for motility analysis. [SLE+ 22] Nicholas Sofroniew, Talley Lambert, Kira Evans, Juan Nunez- Motion Classification: To classify the motility pattern of Iglesias, Grzegorz Bokota, Philip Winston, Gonzalo Peña- T. gondii in 3D space in an unsupervised fashion we implement Castellanos, Kevin Yamauchi, Matthias Bussonnier, Draga Don- cila Pop, Ahmet Can Solak, Ziyang Liu, Pam Wadhwa, Al- and use the method that Fazli et. al. introduced [FSA+ 19]. In that ister Burt, Genevieve Buckley, Andrew Sweet, Lukasz Mi- work, they used an autoregressive model (AR); a linear dynamical gas, Volker Hilsenstein, Lorenzo Gaifas, Jordão Bragantini, system that encodes a Markov-based transition prediction method. Jaime Rodríguez-Guerra, Hector Muñoz, Jeremy Freeman, Peter The reason is that although K-means is a favorable clustering Boone, Alan Lowe, Christoph Gohlke, Loic Royer, Andrea PIERRÉ, Hagai Har-Gil, and Abigail McGovern. napari: a multi- algorithm, there are a few drawbacks to it and to the conventional dimensional image viewer for Python, May 2022. If you use methods that draw them impractical. Firstly, K-means assumes Eu- this software, please cite it using these metadata. URL: https: clidian distance, but AR motion parameters are geodesics that do //doi.org/10.5281/zenodo.6598542, doi:10.5281/zenodo. 6598542. not reside in a Euclidean space, and secondly, K-means assumes [SWMP21] Carsen Stringer, Tim Wang, Michalis Michaelos, and Marius isotropic clusters, however, although AR motion parameters may Pachitariu. Cellpose: a generalist algorithm for cellular segmen- exhibit isotropy in their space, without a proper distance metric, tation. Nature methods, 18(1):100–106, 2021. doi:10.1101/ this issue cannot be clearly examined [FSA+ 19]. 2020.02.02.931238. [TPS+ 17] Jean-Yves Tinevez, Nick Perry, Johannes Schindelin, Genevieve M. Hoopes, Gregory D. Reynolds, Emmanuel Laplantine, Sebastian Y. Bednarek, Spencer L. Shorte, and Conclusion and Discussion Kevin W. Eliceiri. Trackmate: An open and extensible platform TSeg is an easy to use pipeline designed to study the motility for single-particle tracking. Methods, 115:80–90, 2017. Image Processing for Biologists. URL: https://www.sciencedirect. patterns of T. gondii in 3D space. It is developed as a plugin com/science/article/pii/S1046202316303346, doi:https: for Napari and is equipped with a variety of deep learning based //doi.org/10.1016/j.ymeth.2016.09.016. segmentation tools borrowed from PlantSeg and CellPose, making [vCLJ+ 21] Lucas von Chamier, Romain F Laine, Johanna Jukkala, Christoph Spahn, Daniel Krentzel, Elias Nehme, Martina it a suitable off-the-shelf tool for applications incorporating im- Lerche, Sara Hernández-Pérez, Pieta K Mattila, Eleni Kari- ages of cell types not limited to T. gondii. Future work on TSeg nou, et al. Democratising deep learning for microscopy with includes the expantion of implemented algorithms and tools in its zerocostdl4mic. Nature communications, 12(1):1–18, 2021. preprocessing, segmentation, tracking, and clustering modules. doi:10.1038/s41467-021-22518-0. [WCV+ 20] Adrian Wolny, Lorenzo Cerrone, Athul Vijayan, Rachele To- fanelli, Amaya Vilches Barro, Marion Louveaux, Christian Wenzl, Sören Strauss, David Wilson-Sánchez, Rena Lymbouri- R EFERENCES dou, Susanne S Steigleder, Constantin Pape, Alberto Bailoni, Salva Duran-Nebreda, George W Bassel, Jan U Lohmann, Mil- [FRF+ 20] Elnaz Fazeli, Nathan H Roy, Gautier Follain, Romain F Laine, tos Tsiantis, Fred A Hamprecht, Kay Schneitz, Alexis Maizel, Lucas von Chamier, Pekka E Hänninen, John E Eriksson, Jean- and Anna Kreshuk. Accurate and versatile 3d segmenta- Yves Tinevez, and Guillaume Jacquemet. Automated cell track- tion of plant tissues at cellular resolution. eLife, 9:e57613, ing using stardist and trackmate. F1000Research, 9, 2020. jul 2020. URL: https://doi.org/10.7554/eLife.57613, doi:10. doi:10.12688/f1000research.27019.1. 7554/eLife.57613. [FSA+ 19] Mojtaba Sedigh Fazli, Rachel V Stadler, BahaaEddin Alaila, [WMV+ 21] Chentao Wen, Takuya Miura, Venkatakaushik Voleti, Kazushi Stephen A Vella, Silvia NJ Moreno, Gary E Ward, and Shannon Yamaguchi, Motosuke Tsutsumi, Kei Yamamoto, Kohei Otomo, Quinn. Lightweight and scalable particle tracking and motion Yukako Fujie, Takayuki Teramoto, Takeshi Ishihara, Kazuhiro clustering of 3d cell trajectories. In 2019 IEEE International Aoki, Tomomi Nemoto, Elizabeth Mc Hillman, and Koutarou D Conference on Data Science and Advanced Analytics (DSAA), Kimura. 3DeeCellTracker, a deep learning-based pipeline for pages 412–421. IEEE, 2019. doi:10.1109/dsaa.2019. segmenting and tracking cells in 3D time lapse images. Elife, 10, 00056. March 2021. URL: https://doi.org/10.7554/eLife.59187, doi: [FVM 18] Mojtaba S Fazli, Stephen A Vella, Silvia NJ Moreno, Gary E + 10.7554/eLife.59187. Ward, and Shannon P Quinn. Toward simple & scalable 3d cell [WSH+ 20] Martin Weigert, Uwe Schmidt, Robert Haase, Ko Sugawara, tracking. In 2018 IEEE International Conference on Big Data and Gene Myers. Star-convex polyhedra for 3d object detec- (Big Data), pages 3217–3225. IEEE, 2018. doi:10.1109/ tion and segmentation in microscopy. In 2020 IEEE Winter BigData.2018.8622403. Conference on Applications of Computer Vision (WACV). IEEE, [FVMQ18] Mojtaba S Fazli, Stephen A Velia, Silvia NJ Moreno, and mar 2020. URL: https://doi.org/10.1109%2Fwacv45572.2020. Shannon Quinn. Unsupervised discovery of toxoplasma gondii 9093435, doi:10.1109/wacv45572.2020.9093435. motility phenotypes. In 2018 IEEE 15th International Sympo- sium on Biomedical Imaging (ISBI 2018), pages 981–984. IEEE, 2018. doi:10.1109/isbi.2018.8363735. [KC21] Varun Kapoor and Claudia Carabaña. Cell tracking in 3d using deep learning segmentations. In Python in Science Con- ference, pages 154–161, 2021. doi:10.25080/majora- 1b6fd038-014. [KPR+ 21] Anuradha Kar, Manuel Petit, Yassin Refahi, Guillaume Cerutti, Christophe Godin, and Jan Traas. Assessment of deep learning algorithms for 3d instance segmentation of confocal image datasets. bioRxiv, 2021. URL: https: //www.biorxiv.org/content/early/2021/06/10/2021.06.09.447748, arXiv:https://www.biorxiv.org/content/ early/2021/06/10/2021.06.09.447748.full. pdf, doi:10.1101/2021.06.09.447748. 64 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) The myth of the normal curve and what to do about it Allan Campopiano∗ F Index Terms—Python, R, robust statistics, bootstrapping, trimmed mean, data science, hypothesis testing Reliance on the normal curve as a tool for measurement is almost a given. It shapes our grading systems, our measures of intelligence, and importantly, it forms the mathematical backbone of many of our inferential statistical tests and algorithms. Some even call it “God’s curve” for its supposed presence in nature [Mic89]. Scientific fields that deal in explanatory and predictive statis- tics make particular use of the normal curve, often using it to conveniently define thresholds beyond which a result is considered statistically significant (e.g., t-test, F-test). Even familiar machine learning models have, buried in their guts, an assumption of the normal curve (e.g., LDA, gaussian naive Bayes, logistic & linear regression). The normal curve has had a grip on us for some time; the Fig. 1: Standard normal (orange) and contaminated normal (blue). The variance of the contaminated curve is more than 10 times that aphorism by Cramer [Cra46] still rings true for many today: of the standard normal curve. This can cause serious issues with “Everyone believes in the [normal] law of errors, the statistical power when using traditional hypothesis testing methods. experimenters because they think it is a mathematical theorem, the mathematicians because they think it is an experimental fact.” new Python library for robust hypothesis testing will be introduced Many students of statistics learn that N=40 is enough to ignore along with an interactive tool for robust statistics education. the violation of the assumption of normality. This belief stems from early research showing that the sampling distribution of the The contaminated normal mean quickly approaches normal, even when drawing from non- normal distributions—as long as samples are sufficiently large. It One of the most striking counterexamples of “N=40 is enough” is common to demonstrate this result by sampling from uniform is shown when sampling from the so-called contaminated normal and exponential distributions. Since these look nothing like the [Tuk60][Tan82]. This distribution is also bell shaped and sym- normal curve, it was assumed that N=40 must be enough to avoid metrical but it has slightly heavier tails when compared to the practical issues when sampling from other types of non-normal standard normal curve. That is, it contains outliers and is difficult distributions [Wil13]. (Others reached similar conclusions with to distinguish from a normal distribution with the naked eye. different methodology [Gle93].) Consider the distributions in Figure 1. The variance of the normal Two practical issues have since been identified based on this distribution is 1 but the variance of the contaminated normal is early research: (1) The distributions under study were light tailed 10.9! (they did not produce outliers), and (2) statistics other than the The consequence of this inflated variance is apparent when sample mean were not tested and may behave differently. In examining statistical power. To demonstrate, Figure 2 shows two the half century following these early findings, many important pairs of distributions: On the left, there are two normal distribu- discoveries have been made—calling into question the usefulness tions (variance 1) and on the right there are two contaminated of the normal curve [Wil13]. distributions (variance 10.9). Both pairs of distributions have a The following sections uncover various pitfalls one might mean difference of 0.8. Wilcox [Wil13] showed that by taking encounter when assuming normality—especially as they relate to random samples of N=40 from each normal curve, and comparing hypothesis testing. To help researchers overcome these problems, a them with Student’s t-test, statistical power was approximately 0.94. However, when following this same procedure for the * Corresponding author: allan@deepnote.com contaminated groups, statistical power was only 0.25. The point here is that even small apparent departures from Copyright © 2022 Allan Campopiano. This is an open-access article dis- normality, especially in the tails, can have a large impact on tributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, pro- commonly used statistics. The problems continue to get worse vided the original author and source are credited. when examining effect sizes but these findings are not discussed THE MYTH OF THE NORMAL CURVE AND WHAT TO DO ABOUT IT 65 Fig. 2: Two normal curves (left) and two contaminated normal curves (right). Despite the obvious effect sizes (∆ = 0.8 for both pairs) as well as the visual similarities of the distributions, power is only ~0.25 under contamination; however, power is ~0.94 under normality (using Student’s t-test). in this article. Interested readers should see Wilcox’s 1992 paper Fig. 3: Actual t-distribution (orange) and assumed t-distribution (blue). When simulating a t-distribution based on a lognormal curve, [Wil92]. T does not follow the assumed shape. This can cause poor probability Perhaps one could argue that the contaminated normal dis- coverage and increased Type I Error when using traditional hypothe- tribution actually represents an extreme departure from normal- sis testing approaches. ity and therefore should not be taken seriously; however, dis- tributions that generate outliers are likely common in practice [HD82][Mic89][Wil09]. A reasonable goal would then be to Modern robust methods choose methods that perform well under such situations and When it comes to hypothesis testing, one intuitive way of dealing continue to perform well under normality. In addition, serious with the issues described above would be to (1) replace the issues still exist even when examining light-tailed and skewed sample mean (and standard deviation) with a robust alternative distributions (e.g., lognormal), and statistics other than the sample and (2) use a non-parametric resampling technique to estimate the mean (e.g., T). These findings will be discussed in the following sampling distribution (rather than assuming a theoretical shape)1 . section. Two such candidates are the 20% trimmed mean and the percentile bootstrap test, both of which have been shown to have practical value when dealing with issues of outliers and non-normality Student’s t-distribution [CvNS18][Wil13]. Another common statistic is the T value obtained from Student’s t-test. As will be demonstrated, T is more sensitive to violations of The trimmed mean normality than the sample mean (which has already been shown to not be robust). This is despite the fact that the t-distribution is The trimmed mean is nothing more than sorting values, removing also bell shaped, light tailed, and symmetrical—a close relative of a proportion from each tail, and computing the mean on the the normal curve. remaining values. Formally, The assumption is that T follows a t-distribution (and with • Let X1 ...Xn be a random sample and X(1) ≤ X(2) ... ≤ X(n) large samples it approaches normality). We can test this assump- be the observations in ascending order tion by generating random samples from a lognormal distribution. • The proportion to trim is γ(0 ≤ γ ≤ .5) Specifically, 5000 datasets of sample size 20 were randomly drawn • Let g = bγnc. That is, the proportion to trim multiplied by from a lognormal distribution using SciPy’s lognorm.rvs n, rounded down to the nearest integer function. For each dataset, T was calculated and the resulting t- distribution was plotted. Figure 3 shows that the assumption that Then, in symbols, the trimmed mean can be expressed as T follows a t-distribution does not hold. follows: With N=20, the assumption is that with a probability of 0.95, X(g+1) + ... + X(n−g) T will be between -2.09 and 2.09. However, when sampling from X̄t = n − 2g a lognormal distribution in the manner just described, there is actually a 0.95 probability that T will be between approximately If the proportion to trim is 0.2, more than twenty percent of -4.2 and 1.4 (i.e., the middle 95% of the actual t-distribution is the values would have to be altered to make the trimmed mean much wider than the assumed t-distribution). Based on this result arbitrarily large or small. The sample mean, on the other hand, we can conclude that sampling from skewed distributions (e.g., can be made to go to ±∞ (arbitrarily large or small) by changing lognormal) leads to increased Type I Error when using Student’s a single value. The trimmed mean is more robust than the sample t-test [Wil98]. mean in all measures of robustness that have been studied [Wil13]. In particular the 20% trimmed mean has been shown to have “Surely the hallowed bell-shaped curve has cracked practical value as it avoids issues associated with the median (not from top to bottom. Perhaps, like the Liberty Bell, it discussed here) and still protects against outliers. should be enshrined somewhere as a memorial to more heroic days — Earnest Ernest, Philadelphia Inquirer. 10 1. Another option is to use a parametric test that assumes a different November 1974. [FG81]” underlying model. 66 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) The percentile bootstrap test best experienced in tandem with Wilcox’s book “Introduction to In most traditional parametric tests, there is an assumption that Robust Estimation and Hypothesis Testing”. the sampling distribution has a particular shape (normal, f- Hypothesize brings many of these functions into the open- distribution, t-distribution, etc). We can use these distributions source Python library ecosystem with the goal of lowering the to test the null hypothesis; however, as discussed, the theoretical barrier to modern robust methods—even for those who have distributions are not always approximated well when violations of not had extensive training in statistics or coding. With modern assumptions occur. Non-parametric resampling techniques such browser-based notebook environments (e.g., Deepnote), learning as bootstrapping and permutation tests build empirical sampling to use Hypothesize can be relatively straightforward. In fact, every distributions, and from these, one can robustly derive p-values and statistical test listed in the docs is associated with a hosted note- CIs. One example is the percentile bootstrap test [Efr92][TE93]. book, pre-filled with sample data and code. But certainly, simply The percentile bootstrap test can be thought of as an al- pip install Hypothesize to use Hypothesize in any en- gorithm that uses the data at hand to estimate the underlying vironment that supports Python. See van Noordt and Willoughby sampling distribution of a statistic (pulling yourself up by your [vNW21] and van Noordt et al. [vNDTE22] for examples of own bootstraps, as the saying goes). This approach is in contrast Hypothesize being used in applied research. to traditional methods that assume the sampling distribution takes The API for Hypothesize is organized by single- and two- a particular shape). The percentile boostrap test works well with factor tests, as well as measures of association. Input data for small sample sizes, under normality, under non-normality, and it the groups, conditions, and measures are given in the form of a easily extends to multi-group tests (ANOVA) and measures of Pandas DataFrame [pdt20][WM10]. By way of example, one can association (correlation, regression). For a two-sample case, the compare two independent groups (e.g., placebo versus treatment) steps to compute the percentile bootstrap test can be described as using the 20% trimmed mean and the percentile bootstrap test, as follows: follows (note that Hypothesize uses the naming conventions found in WRS): 1) Randomly resample with replacement n values from group one from hypothesize.utilities import trim_mean from hypothesize.compare_groups_with_single_factor \ 2) Randomly resample with replacement n values from import pb2gen group two 3) Compute X̄1 − X̄2 based on you new sample (the mean results = pb2gen(df.placebo, df.treatment, trim_mean) difference) 4) Store the difference & repeat steps 1-3 many times (say, As shown below, the results are returned as a Python dictionary 1000) containing the p-value, confidence intervals, and other important 5) Consider the middle 95% of all differences (the confi- details. dence interval) { 6) If the confidence interval contains zero, there is no 'ci': [-0.22625614592148624, 0.06961754796950131], statistical difference, otherwise, you can reject the null 'est_1': 0.43968438076483285, 'est_2': 0.5290985245430996, hypothesis (there is a statistical difference) 'est_dif': -0.08941414377826673, 'n1': 50, 'n2': 50, Implementing and teaching modern robust methods 'p_value': 0.27, 'variance': 0.005787027326924963 Despite over a half a century of convincing findings, and thousands } of papers, robust statistical methods are still not widely adopted in applied research [EHM08][Wil98]. This may be due to various For measuring associations, several options exist in Hypothesize. false beliefs. For example, One example is the Winsorized correlation which is a robust • Classical methods are robust to violations of assumptions alternative to Pearson’s R. For example, • Correcting non-normal distributions by transforming the from hypothesize.measuring_associations import wincor data will solve all issues • Traditional non-parametric tests are suitable replacements results = wincor(df.height, df.weight, tr=.2) for parametric tests that violate assumptions returns the Winsorized correlation coefficient and other relevant Perhaps the most obvious reason for the lack of adoption of statistics: modern methods is a lack of easy-to-use software and training re- { sources. In the following sections, two resources will be presented: 'cor': 0.08515087411576182, one for implementing robust methods and one for teaching them. 'nval': 50, 'sig': 0.558539575073185, 'wcov': 0.004207827245660796 Robust statistics for Python } Hypothesize is a robust null hypothesis significance testing (NHST) library for Python [CW20]. It is based on Wilcox’s WRS package for R which contains hundreds of functions for computing A case study using real-world data robust measures of central tendency and hypothesis testing. At It is helpful to demonstrate that robust methods in Hypothesize the time of this writing, the WRS library in R contains many (and in other libraries) can make a practical difference when more functions than Hypothesize and its value to researchers dealing with real-world data. In a study by Miller on sexual who use inferential statistics cannot be understated. WRS is attitudes, 1327 men and 2282 women were asked how many sexual THE MYTH OF THE NORMAL CURVE AND WHAT TO DO ABOUT IT 67 partners they desired over the next 30 years (the data are available from Rand R. Wilcox’s site). When comparing these groups using Student’s t-test, we get the following results: { 'ci': [-1491.09, 4823.24], 't_value': 1.035308, 'p_value': 0.300727 } That is, we fail to reject the null hypothesis at the α = 0.05 level using Student’s test for independent groups. However, if we switch to a robust analogue of the t-test, one that utilizes bootstrapping and trimmed means, we can indeed reject the null hypothesis. Here are the corresponding results from Hypothesize’s yuenbt test (based on [Yue74]): from hypothesize.compare_groups_with_single_factor \ import yuenbt Fig. 4: An example of the robust stats simulator in Deepnote’s hosted notebook environment. A minimalist UI can lower the barrier-to-entry results = yuenbt(df.males, df.females, to robust statistics concepts. tr=.2, alpha=.05) { The robust statistics simulator allows users to interact with the 'ci': [1.41, 2.11], following parameters: 'test_stat': 9.85, 'p_value': 0.0 • Distribution shape } • Level of contamination The point here is that robust statistics can make a practi- • Sample size cal difference with real-world data (even when N is consid- • Skew and heaviness of tails ered large). Many other examples of robust statistics making a Each of these characteristics can be adjusted independently in practical difference with real-world data have been documented order to compare classic approaches to their robust alternatives. [HD82][Wil09][Wil01]. The two measures that are used to evaluate the performance of It is important to note that robust methods may also fail to classic and robust methods are the standard error and Type I Error. reject when a traditional test rejects (remember that traditional Standard error is a measure of how much an estimator varies tests can suffer from increased Type I Error). It is also possible across random samples from our population. We want to choose that both approaches yield the same or similar conclusions. The estimators that have a low standard error. Type I Error is also exact pattern of results depends largely on the characteristics of the known as False Positive Rate. We want to choose methods that underlying population distribution. To be able to reason about how keep Type I Error close to the nominal rate (usually 0.05). The robust statistics behave when compared to traditional methods the robust statistics simulator can guide these decisions by providing robust statistics simulator has been created and is described in the empirical evidence as to why particular estimators and statistical next section. tests have been chosen. Robust statistics simulator Conclusion Having a library of robust statistical functions is not enough to make modern methods commonplace in applied research. Ed- This paper gives an overview of the issues associated with the ucators and practitioners still need intuitive training tools that normal curve. The concern with traditional methods, in terms of demonstrate the core issues surrounding classical methods and robustness to violations of normality, have been known for over how robust analogues compare. a half century and modern alternatives have been recommended; As mentioned, computational notebooks that run in the cloud however, for various reasons that have been discussed, modern offer a unique solution to learning beyond that of static textbooks robust methods have not yet become commonplace in applied and documentation. Learning can be interactive and exploratory research settings. since narration, visualization, widgets (e.g., buttons, slider bars), One reason is the lack of easy-to-use software and teaching and code can all be experienced in a ready-to-go compute envi- resources for robust statistics. To help fill this gap, Hypothesize, a ronment—with no overhead related to local environment setup. peer-reviewed and open-source Python library was developed. In As a compendium to Hypothesize, and a resource for un- addition, to help clearly demonstrate and visualize the advantages derstanding and teaching robust statistics in general, the robust of robust methods, the robust statistics simulator was created. statistics simulator repository has been developed. It is a notebook- Using these tools, practitioners can begin to integrate robust based collection of interactive demonstrations aimed at clearly and statistical methods into their inferential testing repertoire. visually explaining the conditions under which classic methods fail relative to robust methods. A hosted notebook with the Acknowledgements rendered visualizations of the simulations can be accessed here. The author would like to thank Karlynn Chan and Rand R. Wilcox and seen in Figure 4. Since the simulations run in the browser and as well as Elizabeth Dlha and the entire Deepnote team for their require very little understanding of code, students and teachers can support of this project. In addition, the author would like to thank easily onboard to the study of robust statistics. Kelvin Lee for his insightful review of this manuscript. 68 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) R EFERENCES [WM10] Wes McKinney. Data Structures for Statistical Computing in Python. In Stéfan van der Walt and Jarrod Millman, editors, Proceedings of the 9th Python in Science Conference, pages 56 – [Cra46] Harold Cramer. Mathematical methods of statistics, princeton 61, 2010. doi:10.25080/Majora-92bf1922-00a. univ. Press, Princeton, NJ, 1946. URL: https://books.google.ca/ [Yue74] Karen K Yuen. The two-sample trimmed t for unequal population books?id=CRTKKaJO0DYC. variances. Biometrika, 61(1):165–170, 1974. doi:10.2307/ [CvNS18] Allan Campopiano, Stefon JR van Noordt, and Sidney J Sega- 2334299. lowitz. Statslab: An open-source eeg toolbox for comput- ing single-subject effects using robust statistics. Behavioural Brain Research, 347:425–435, 2018. doi:10.1016/j.bbr. 2018.03.025. [CW20] Allan Campopiano and Rand R. Wilcox. Hypothesize: Ro- bust statistics for python. Journal of Open Source Software, 5(50):2241, 2020. doi:10.21105/joss.02241. [Efr92] Bradley Efron. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics, pages 569–593. Springer, 1992. doi:10.1007/978-1-4612-4380-9_41. [EHM08] David M Erceg-Hurn and Vikki M Mirosevich. Modern robust statistical methods: an easy way to maximize the accuracy and power of your research. American Psychologist, 63(7):591, 2008. doi:10.1037/0003-066X.63.7.591. [FG81] Joseph Fashing and Ted Goertzel. The myth of the normal curve a theoretical critique and examination of its role in teaching and research. Humanity & Society, 5(1):14–31, 1981. doi:10. 1177/016059768100500103. [Gle93] John R Gleason. Understanding elongation: The scale contami- nated normal family. Journal of the American Statistical Asso- ciation, 88(421):327–337, 1993. doi:10.1080/01621459. 1993.10594325. [HD82] MaryAnn Hill and WJ Dixon. Robustness in real life: A study of clinical laboratory data. Biometrics, pages 377–396, 1982. doi:10.2307/2530452. [Mic89] Theodore Micceri. The unicorn, the normal curve, and other improbable creatures. Psychological bulletin, 105(1):156, 1989. doi:10.1037/0033-2909.105.1.156. [pdt20] The pandas development team. pandas-dev/pandas: Pandas, February 2020. URL: https://doi.org/10.5281/zenodo.3509134, doi:10.5281/zenodo.3509134. [Tan82] WY Tan. Sampling distributions and robustness of t, f and variance-ratio in two samples and anova models with respect to departure from normality. Comm. Statist.-Theor. Meth., 11:2485– 2511, 1982. URL: https://pascal-francis.inist.fr/vibad/index.php? action=getRecordDetail&idt=PASCAL83X0380619. [TE93] Robert J Tibshirani and Bradley Efron. An introduction to the bootstrap. Monographs on statistics and applied probabil- ity, 57:1–436, 1993. URL: https://books.google.ca/books?id= gLlpIUxRntoC. [Tuk60] J. W. Tukey. A survey of sampling from contaminated distribu- tions. Contributions to Probability and Statistics, pages 448–485, 1960. URL: https://ci.nii.ac.jp/naid/20000755025/en/. [vNDTE22] Stefon van Noordt, James A Desjardins, BASIS Team, and Mayada Elsabbagh. Inter-trial theta phase consistency during face processing in infants is associated with later emerging autism. Autism Research, 15(5):834–846, 2022. doi:10. 1002/aur.2701. [vNW21] Stefon van Noordt and Teena Willoughby. Cortical matura- tion from childhood to adolescence is reflected in resting state eeg signal complexity. Developmental cognitive neuroscience, 48:100945, 2021. doi:10.1016/j.dcn.2021.100945. [Wil92] Rand R Wilcox. Why can methods for comparing means have relatively low power, and what can you do to correct the prob- lem? Current Directions in Psychological Science, 1(3):101–105, 1992. doi:10.1111/1467-8721.ep10768801. [Wil98] Rand R Wilcox. How many discoveries have been lost by ignoring modern statistical methods? American Psychologist, 53(3):300, 1998. doi:10.1037/0003-066X.53.3.300. [Wil01] Rand R Wilcox. Fundamentals of modern statistical meth- ods: Substantially improving power and accuracy, volume 249. Springer, 2001. URL: https://link.springer.com/book/10.1007/ 978-1-4757-3522-2. [Wil09] Rand R Wilcox. Robust ancova using a smoother with boot- strap bagging. British Journal of Mathematical and Sta- tistical Psychology, 62(2):427–437, 2009. doi:10.1348/ 000711008X325300. [Wil13] Rand R Wilcox. Introduction to robust estimation and hypothesis testing. Academic press, 2013. doi:10.1016/c2010-0- 67044-1. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 69 Python for Global Applications: teaching scientific Python in context to law and diplomacy students Anna Haensch‡§∗ , Karin Knudson‡§ F Abstract—For students across domains and disciplines, the message has been the students and faculty at the Fletcher School are eager to communicated loud and clear: data skills are an essential qualification for today’s seize upon our current data moment to expand their quantitative job market. This includes not only the traditional introductory stats coursework offerings. With this in mind, The Fletcher School reached out to but also machine learning, artificial intelligence, and programming in Python or the co-authors to develop a course in data science, situated in the R. Consequently, there has been significant student-initiated demand for data context of international diplomacy. analytic and computational skills sometimes with very clear objectives in mind, and other times guided by a vague sense of “the work I want to do will require In response, we developed the (Python-based) course, Data this.” Now we have options. If we train students using “black box” algorithms Science for Global Applications, which had its inaugural offering without attending to the technical choices involved, then we run the risk of in the Spring semester of 2022. The course had 30 enrolled unleashing practitioners who might do more harm than good. On the other hand, Fletcher School students, primarily from the MALD program. courses that completely unpack the “black box” can be so steeped in theory that When the course was announced we had a flood of interest from the barrier to entry becomes too high for students from social science and policy Fletcher students who were extremely interested in broadening backgrounds, thereby excluding critical voices. In sum, both of these options their studies with this course. With a goal of keeping a close lead to a pitfall that has gained significant media attention over recent years: the interactive atmosphere we capped enrollment at 30. To inform the harms caused by algorithms that are implemented without sufficient attention to human context. In this paper, we - two mathematicians turned data scientists direction of our course, we surveyed students on their background - present a framework for teaching introductory data science skills in a highly in programming (see Fig. 1) and on their motivations for learning contextualized and domain flexible environment. We will present example course data science (see Fig 2). Students reported only very limited outlines at the semester, weekly, and daily level, and share materials that we experience with programming - if any at all - with that experience think hold promise. primarily in Excel and Tableau. Student motivations varied, but the goal to get a job where they were able to make a meaningful Index Terms—computational social science, public policy, data science, teach- social impact was the primary motivation. ing with Python Introduction As data science continues to gain prominence in the public eye, and as we become more aware of the many facets of our lives that intersect with data-driven technologies and policies every day, universities are broadening their academic offerings to keep up with what students and their future employers demand. Not only are students hoping to obtain more hard skills in data science (e.g. Python programming experience), but they are interested in applying tools of data science across domains that haven’t Fig. 1: The majority of the 30 students enrolled in the course had little historically been part of the quantitative curriculum. The Master to no programming experience, and none reported having "a lot" of of Arts in Law and Diplomacy (MALD) is the flagship program of experience. Those who did have some experience were most likely to the Fletcher School of Law and International Diplomacy at Tufts have worked in Excel or Tableau. University. Historically, the program has contained core elements of quantitative reasoning with a focus on business, finance, and The MALD program, which is interdisciplinary by design, pro- international development, as is typical in graduate programs in vides ample footholds for domain specific data science. Keeping international relations. Like academic institutions more broadly, this in mind, as a throughline for the course, each student worked to develop their own quantitative policy project. Coursework and * Corresponding author: anna.haensch@tufts.edu discussions were designed to move this project forward from ‡ Tufts University § Data Intensive Studies Center initial policy question, to data sourcing and visualizing, and eventually to modeling and analysis. Copyright © 2022 Anna Haensch et al. This is an open-access article dis- In what follows we will describe how we structured our tributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, pro- course with the goal of empowering beginner programmers to use vided the original author and source are credited. Python for data science in the context of international relations 70 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) might understand in the abstract that the way the handling of missing data can substantially affect the outcome of an analysis, but will likely have a stronger understanding if they have had to consider how to deal with missing data in their own project. We used several course structures to support connecting data science and Python "skills" with their context. Students had readings and journaling assignments throughout the semester on topics that connected data science with society. In their journal responses, students were asked to connect the ideas in the reading to their other academic/professional interests, or ideas from other classes with the following prompt: Your reflection should be a 250-300 word narrative. Be sure to tie the reading back into your own studies, experiences, and areas of interest. For each reading, Fig. 2: The 30 enrolled students were asked to indicate which were come up with 1-2 discussion questions based on the con- relevant motivations for taking the course. Curiosity and a desire to cepts discussed in the readings. This can be a curiosity make a meaningful social impact were among the top motivations our question, where you’re interested in finding out more, students expressed. a critical question, where you challenge the author’s assumptions or decisions, or an application question, where you think about how concepts from the reading and diplomacy. We will also share details about course content would apply to a particular context you are interested in and structure, methods of assessment, and Python programming exploring.1 resources that we deployed through Google Colab. All of the materials described here can be found on the public course page These readings (highlighted in gray in Fig 3), assignments, and https://karink520.github.io/data-science-for-global-applications/. the related in-class discussions were interleaved among Python exercises meant to give students practice with skills including Course Philosophy and Goals manipulating DataFrames in pandas [The22], [Mck10], plotting in Matplotlib [Hun07] and seaborn [Was21], mapping with GeoPan- Our high level goals for the course were i) to empower students das [Jor21], and modeling with scikit-learn [Ped11]. Student with the skills to gain insight from data using Python and ii) to projects included a thorough data audit component requiring deepen students’ understanding of how the use of data science students to explore data sources and their human context in detail. affects society. As we sought to achieve these high level goals Precise details and language around the data audit can be found within the limited time scope of a single semester, the following on the course website. core principles were essential in shaping our course design. Below, we briefly describe each of these principles and share some Managing Fears & Concerns Through Supported Programming examples of how they were reflected in the course structure. In a We surmised that students who are new to programming and subsequent section we will more precisely describe the content of possibly intimidated by learning the unfamiliar skill would do the course, whereupon we will further elaborate on these principles well in an environment that included plenty of what we call and share instructional materials. But first, our core principles: supported programming - that is, practicing programming in class Connecting the Technical and Social with immediate access to instructor and peer support. In the pre-course survey we created, many students identified To understand the impact of data science on the world (and the concerns about their quantitative preparation, whether they would potential policy implications of such impact), it helps to have be able to keep up with the course, and how hard programming hands-on practice with data science. Conversely, to effectively might be. We sought to acknowledge these concerns head-on, and ethically practice data science, it is important to understand assure students of our full confidence in their ability to master how data science lives in the world. Thus, the "hard" skills of the material, and provide them with all the resources they needed coding, wrangling data, visualizing, and modeling are best taught to succeed. intertwined with a robust study of ways in which data science is A key resource to which we thought all students needed used and misused. access was instructor attention. In addition to keeping the class There is an increasing need to educate future policy-makers size capped at 30 people, with both co-instructors attending all with knowledge of how data science algorithms can be used course meetings, we structured class time to maximize the time and misused. One way to approach meeting this need, especially students spent actually doing data science in class. We sought for students within a less technically-focused program, would to keep demonstrations short, and intersperse them with coding be to teach students about how algorithms can be used without exercises so that students could practice with new ideas right actually teaching them to use algorithms. However, we argue that away. Our Colab notebooks included in the course materials show students will gain a deeper understanding of the societal and one way that we wove student practice time throughout. Drawing ethical implications of data science if they also have practical insight from social practice theory of learning (e.g. [Eng01], data science skills. For example, a student could gain a broad [Pen16]), we sought to keep in mind how individual practice and understanding of how biased training data might lead to biased learning pathways develop in relation to their particular social and algorithmic predictions, but such understanding is likely to be deeper and more memorable when a student has actually practiced 1. This journaling prompt was developed by our colleague Desen Ozkan at training a model using different training data. Similarly, someone Tufts University. PYTHON FOR GLOBAL APPLICATIONS: TEACHING SCIENTIFIC PYTHON IN CONTEXT TO LAW AND DIPLOMACY STUDENTS 71 institutional context. Crucially, we devoted a great deal of in-class and preparing data for exploratory data analysis, visualizing and time to students doing data science, and a great deal of energy annotating data, and finally modeling and analyzing data. All into making this practice time a positive and empowering social of this was done with the goal of answering a policy question experience. During student practice time, we were circulating developed by the student, allowing the student to flex some throughout the room, answering student questions and helping domain expertise to supplement the (sometimes overwhelming!) students to problem solve and debug, and encouraging students programmatic components. to work together and help each other. A small organizational Our project explicitly required that students find two datasets change we made in the first weeks of the semester that proved of interest and merge them for the final analysis. This presented to have outsized impact was moving our office hours to hold them both logistical and technical challenges. As one student pointed directly after class in an almost-adjacent room, to make it as easy out after finally finding open data: hearing people talk about the as possible for students to attend office hours. Students were vocal need for open data is one thing, but you really realize what that in their appreciation of office hours. means when you’ve spent weeks trying to get access to data that We contend that the value of supported programming time you know exists. Understanding the provenance of the data they is two-fold. First, it helps beginning programmers learn more were working with helped students assess the biases and limita- quickly. While learning to code necessarily involves challenges, tions, and also gave students a strong sense of ownership over students new to a language can sometimes struggle for an un- their final projects. An unplanned consequence of the broad scope productively long time on things like simple syntax issues. When of the policy project was that we, the instructors, learned nearly students have help available, they can move forward from minor as much about international diplomacy as the students learned issues faster and move more efficiently into building a meaningful about programming and data science, a bidirectional exchange of understanding. Secondly, supported programming time helps stu- knowledge that we surmised to have contributed to student feeling dents to understand that they are not alone in the challenges they of empowerment and a positive class environment. are facing in learning to program. They can see other students learning and facing similar challenges, can have the empowering Course Structure experience of helping each other out, and when asking for help can notice that even their instructors sometimes rely on resources We broke the course into three modules, each with focused like StackOverflow. An unforeseen benefit we believe co-teaching reading/journaling topics, Python exercises, and policy project had was to give us as instructors the opportunity to consult benchmarks: (i) getting and cleaning data, (ii) visualizing data, with each other during class time and share different approaches. and (iii) modeling data. In what follows we will describe the key These instructor interactions modeled for students how even as goals of each module and highlight the readings and exercises that experienced practitioners of data science, we too were constantly we compiled to work towards these goals. learning. Getting and Cleaning Data Lastly, a small but (we thought) important aspect of our setup was teaching students to set up a computing environment on Getting, cleaning, and wrangling data typically make up a signif- their own laptops, with Python, conda [Ana16], and JupyterLab icant proportion of the time involved in a data science project. [Pro22]. Using the command line and moving from an environ- Therefore, we devoted significant time in our course to learning ment like Google Colab to one’s own computer can both present these skills, focusing on loading and manipulating data using significant barriers, but doing so successfully can be an important pandas. Key skills included loading data into a pandas DataFrame, part of helping students feel like ‘real’ programmers. We devoted working with missing data, and slicing, grouping, and merging an entire class period to helping students with installation and DataFrames in various ways. After initial exposure and practice setup on their own computers. with example datasets, students applied their skills to wrangling We considered it an important measure of success how many the diverse and sometimes messy and large datasets that they found students told us at the end of the course that the class had helped for their individual projects. Since one requirement of the project them overcome sometimes longstanding feelings that technical was to integrate more than one dataset, merging was of particular skills like coding and modeling were not for them. importance. During this portion of the course, students read and discussed Leveraging Existing Strengths To Enhance Student Ownership Boyd and Crawford’s Critical Questions for Big Data [Boy12] Even as beginning programmers, students are capable of creating a which situates big data in the context of knowledge itself and meaningful policy-related data science project within the semester, raises important questions about access to data and privacy. Ad- starting from formulating a question and finding relevant datasets. ditional readings included selected chapters from D’Ignazio and Working on the project throughout the semester (not just at the Klein’s Data Feminism [Dig20] which highlights the importance end) gave essential context to data science skills as students could of what we choose to count and what it means when data is translate into what an idea might mean for "their" data. Giving missing. students wide leeway in their project topic allowed the project to be a point of connection between new data science skills and their Visualizing Data existing domain knowledge. Students chose projects within their A fundamental component to communicating findings from data particular areas of interest or expertise, and a number chose to is well-executed data visualization. We chose to place this module additionally connect their project for this course to their degree in the middle of the course, since it was important that students capstone project. have a common language for interpreting and communicating their Project benchmarks were placed throughout the semester analysis before moving to the more complicated aspects of data (highlighted in green in Fig 3) allowing students a concrete modeling. In developing this common language, we used Wilke’s way to develop their new skills in identifying datasets, loading Fundamentals of Data Visualization [Wil19] and Cairo’s How 72 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 3: Course outline for a 13-week semester with two 70 minute instructional blocks each week. Course readings are highlighted in gray and policy project benchmarks are highlighted in green. Chart’s Lie [Cai19] as a backbone for this section of the course. using Python. Having the concrete target of how a student wanted In addition to reading the text materials, students were tasked with their visualization to look seemed to be a motivating starting finding visualizations “in the wild,” both good and bad. Course point from which to practice coding and debugging. We spent discussions centered on the found visualizations, with Wilke and several class periods on supported programming time for students Cairo’s writings as a common foundation. From the readings and to develop their visualizations. discussions, students became comfortable with the language and Working on building the narratives of their project and devel- taxonomy around visualizations and began to develop a better ap- oping their own visualizations in the context of the course readings preciation of what makes a visualization compelling and readable. gave students a heightened sense of attention to detail. During Students were able to formulate a plan about how they could best one day of class when students shared visualizations and gave visualize their data. The next task was to translate these plans into feedback to one another, students commented and inquired about Python. incredibly small details of each others’ presentations, for example, To help students gain a level of comfort with data visualization how to adjust y-tick alignment on a horizontal bar chart. This sort in Python, we provided instruction and examples of working of tiny detail is hard to convey in a lecture, but gains outsized with a variety of charts using Matplotlib and seaborn, as well importance when a student has personally wrestled with it. as maps and choropleths using GeoPandas, and assigned students programming assignments that involved writing code to create Modeling Data a visualization matching one in an image. With that practical In this section we sought to expose students to introductory grounding, students were ready to visualize their own project data approaches in each of regression, classification, and clustering PYTHON FOR GLOBAL APPLICATIONS: TEACHING SCIENTIFIC PYTHON IN CONTEXT TO LAW AND DIPLOMACY STUDENTS 73 in Python. Specifically, we practiced using scikit-learn to work And finally, to supplement the technical components of the with linear regression, logistic regression, decision trees, random course we also had readings with associated journal entries sub- forests, and gaussian mixture models. Our focus was not on the mitted at a cadence of roughly two per module. Journal prompts theoretical underpinnings of any particular model, but rather on are described above and available on the course website. the kinds of problems that regression, classification, or clustering models respectively, are able to solve, as well as some basic ideas about model assessment. The uniform and approachable scikit- Conclusion learn API [Bui13] was crucial in supporting this focus, since it Various listings of key competencies in data science have been allowed us to focus less on syntax around any one model, and more proposed [NAS18]. For example, [Dev17] suggests the following on the larger contours of modeling, with all its associated promise pillars for an undergraduate data science curriculum: computa- and perils. We spent a good deal of time building an understanding tional and statistical thinking, mathematical foundations, model of train-test splits and their role in model assessment. building and assessment, algorithms and software foundation, Student projects were required to include a modeling com- data curation, and knowledge transference—communication and ponent. Just the process of deciding which of regression, clas- responsibility. As we sought to contribute to the training of sification, or clustering were appropriate for a given dataset and data-science informed practitioners of international relations, we policy question is highly non-trivial for beginners. The diversity of focused on helping students build an initial competency especially student projects and datasets meant students had to grapple with in the last four of these. this decision process in its full complexity. We were delighted by We can point to several key aspects of the course that made the variety of modeling approaches students used in their projects, it successful. Primary among them was the fact that the majority as well as by students’ thoughtful discussions of the limitations of of class time was spent in supported programming. This means their analysis. that students were able to ask their instructors or peers as soon To accompany this section of the course, students were as- as questions arose. Novice programmers who aren’t part of a signed readings focusing on some of the societal impacts of data formal computer science program often don’t have immediate modeling and algorithms more broadly. These readings included access to the resources necessary to get "unstuck." for the novice a chapter from O’Neil’s Weapons of Math Destruction [One16] as programmer, even learning how to google technical terms can be a well as Buolamwini and Gebru’s Gender Shades [Buo18]. Both of challenge. This sort of immediate debugging and feedback helped these readings emphasize the capacity of algorithms to exacerbate students remain confident and optimistic about their projects. This inequalities and highlight the importance of transparency and was made all the more effective since we were co-teaching the ethical data practices. These readings resonated especially strongly course and had double the resources to troubleshoot. Co-teaching with our students, many of whom had recently taken courses in also had the unforeseen benefit of making our classroom a place cyber policy and ethics in artificial intelligence. where the growth mindset was actively modeled and nurtured: where one instructor wasn’t able to answer a question, the other Assessments instructor often could. Finally, it was precisely the motivation of Formal assessment was based on four components, already alluded learning data science in context that allowed students to maintain a to throughout this note. The largest was the ongoing policy sense of ownership over their work and build connections between project which had benchmarks with rolling due dates throughout their other courses. the semester. Moreover, time spent practicing coding skills in Learning programming from the ground up is difficult. Stu- class was often done in service of the project. For example, in dents arrive excited to learn, but also nervous and occasionally week 4, when students learned to set up their local computing heavy with the baggage they carry from prior experience in environments, they also had time to practice loading, reading, and quantitative courses. However, with a sufficient supported learning saving data files associated with their chosen project datasets. This environment it’s possible to impart relevant skills. It was a measure brought challenges, since often students sitting side-by-side were of the success of the course how many students told us that the dealing with different operating systems and data formats. But course had helped them overcome negative prior beliefs about from this challenge emerged many organic conversations about their ability to code. Teaching data science skills in context and file types and the importance of naming conventions. The rubric with relevant projects that leverage students’ existing expertise and for the final project is shown in Fig 4. outside reading situates the new knowledge in a place that feels The policy project culminated with in-class “micro presenta- familiar and accessible to students. This contextualization allows tions” and a policy paper. We dedicated two days of class in week students to gain some mastery while simultaneously playing to 13 for in-class presentations, for which each student presented their strengths and interests. one slide consisting of a descriptive title, one visualization, and several “key takeaways” from the project. This extremely restric- tive format helped students to think critically about the narrative R EFERENCES information conveyed in a visualization, and was designed to create time for robust conversation around each presentation. [Ana16] Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. In addition to the policy project, each of the three course Anaconda, Nov. 2016. Web. https://anaconda.com. [Boy12] Boyd, Danah, and Kate Crawford. Critical questions for big data: modules also had an associated set of Python exercises (available Provocations for a cultural, technological, and scholarly phe- on the course website). Students were given ample time both in nomenon. Information, communication & society 15.5 (2012):662- and out of class to ask questions about the exercises. Overall, these 679. https://doi.org/10.1080/1369118X.2012.678878 exercises proved to be the most technically challenging component [Bui13] Buitinck, Lars, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae et al. API design for of the course, but we invited students to resubmit after an initial machine learning software: experiences from the scikit-learn project. round of grading. arXiv preprint arXiv:1309.0238 (2013). 74 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 4: Rubric for the policy project that formed a core component of the formal assessment of students throughout the course. [Buo18] Buolamwini, Joy, and Timnit Gebru. Gender shades: Intersectional [Ped11] Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent accuracy disparities in commercial gender classification. Conference Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel et al. on fairness, accountability and transparency. PMLR, 2018. http:// Scikit-learn: Machine learning in Python. the Journal of machine proceedings.mlr.press/v81/buolamwini18a.html Learning research 12 (2011): 2825-2830. https://dl.acm.org/doi/10. [Cai19] Cairo, Alberto. How charts lie: Getting smarter about visual infor- 5555/1953048.2078195 mation. WW Norton & Company, 2019. [Pen16] Penuel, William R., Daniela K. DiGiacomo, Katie Van Horne, and [Dev17] De Veaux, Richard D., Mahesh Agarwal, Maia Averett, Benjamin Ben Kirshner. A Social Practice Theory of Learning and Becoming S. Baumer, Andrew Bray, Thomas C. Bressoud, Lance Bryant et al. across Contexts and Time. Frontline Learning Research 4, no. 4 Curriculum guidelines for undergraduate programs in data science. (2016): 30-38. http://dx.doi.org/10.14786/flr.v4i4.205 Annual Review of Statistics and Its Application 4 (2017): 15-30. [Pro22] Project Jupyter, 2022. jupyterlab/jupyterlab: JupyterLab 3.4.3 https: https://doi.org/10.1146/annurev-statistics-060116-053930 //github.com/jupyterlab/jupyterlab [Dig20] D’Ignazio, Catherine, and Lauren F. Klein. Data Feminism. MIT [The22] The Pandas Development Team, 2022. pandas-dev/pandas: Pandas press, 2020. 1.4.2. Zenodo. https://doi.org/10.5281/zenodo.6408044 [Eng01] Engeström, Yrjö. Expansive learning at work: Toward an activity [Was21] Waskom, Michael L. Seaborn: statistical data visualization. Journal theoretical reconceptualization. Journal of education and work 14, of Open Source Software 6, no. 60 (2021): 3021. https://doi.org/10. no. 1 (2001): 133-156. https://doi.org/10.1080/13639080020028747 21105/joss.03021 [Hun07] Hunter, J.D., Matplotlib: A 2D Graphics Environment. Computing in [Wil19] Wilke, Claus O. Fundamentals of data visualization: a primer on Science & Engineering, vol. 9, no. 3 (2007): 90-95. https://doi.org/ making informative and compelling figures. O’Reilly Media, 2019. 10.1109/MCSE.2007.55 [Jor21] Jordahl, Kelsey et al. 2021. Geopandas/geopandas: V0.10.2. Zenodo. https://doi.org/10.5281/zenodo.5573592. [Mck10] McKinney, Wes. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, vol. 445, no. 1, pp. 51-56. 2010. https://doi.org/10.25080/Majora-92bf1922-00a [NAS18] National Academies of Sciences, Engineering, and Medicine. Data science for undergraduates: Opportunities and options. National Academies Press, 2018. [One16] O’Neil, Cathy. Weapons of math destruction: How big data increases inequality and threatens democracy. Broadway Books, 2016. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 75 Papyri: better documentation for the scientific ecosystem in Jupyter Matthias Bussonnier‡§∗ , Camille Carvalho¶k F Abstract—We present here the idea behind Papyri, a framework we are devel- documentation is often displayed as raw source where no naviga- oping to provide a better documentation experience for the scientific ecosystem. tion is possible. On the maintainers’ side, the final documentation In particular, we wish to provide a documentation browser (from within Jupyter rendering is less of a priority. Rather, maintainers should aim at or other IDEs and Python editors) that gives a unified experience, cross library making users gain from improvement in the rendering without navigation search and indexing. By decoupling documentation generation from having to rebuild all the docs. rendering we hope this can help address some of the documentation accessi- bility concerns, and allow customisation based on users’ preferences. Conda-Forge [CFRG] has shown that concerted efforts can give a much better experience to end-users, and in today’s world Index Terms—Documentation, Jupyter, ecosystem, accessibility where it is ubiquitous to share libraries source on code platforms, perform continuous integration and many other tools, we believe a better documentation framework for many of the libraries of the Introduction scientific Python should be available. Over the past decades, the Python ecosystem has grown rapidly, Thus, against all advice we received and based on our own and one of the last bastion where some of the proprietary competi- experience, we have decided to rebuild an opinionated documen- tion tools shine is integrated documentation. Indeed, open-source tation framework, from scratch, and with minimal dependencies: libraries are usually developed in distributed settings that can make Papyri. Papyri focuses on building an intermediate documentation it hard to develop coherent and integrated systems. representation format, that lets us decouple building, and rendering While a number of tools and documentations exists (and the docs. This highly simplifies many operations and gives us improvements are made everyday), most efforts attempt to build access to many desired features that were not available up to now. documentation in an isolated way, inherently creating a heteroge- In what follows, we provide the framework in which Papyri neous framework. The consequences are twofolds: (i) it becomes has been created and present its objectives (context and goals), difficult for newcomers to grasp the tools properly, (ii) there is a we describe the Papyri features (format, installation, and usage), lack of cohesion and of unified framework due to library authors then present its current implementation. We end this paper with making their proper choices as well as having to maintain build comments on current challenges and future work. scripts or services. Many users, colleagues, and members of the community have Context and objectives been frustrated with the documentation experience in the Python Through out the paper, we will draw several comparisons between ecosystem. Given a library, who hasn’t struggled to find the documentation building and compiled languages. Also, we will "official" website for the documentation ? Often, users stumble borrow and adapt commonly used terminology. In particular, sim- across an old documentation version that is better ranked in their ilarities with "ahead-of-time" (AOT) [AOT], "just-in-time"" (JIT) favorite search engine, and this impacts significantly the learning [JIT], intermediate representation (IR) [IR], link-time optimization process of less experienced users. (LTO) [LTO], static vs dynamic linking will be highlighted. This On users’ local machine, this process is affected by lim- allows us to clarify the presentation of the underlying architecture. ited documentation rendering. Indeed, while in many Integrated However, there is no requirement to be familiar with the above Development Environments (IDEs) the inspector provides some to understand the concepts underneath Papyri. In that context, we documentation, users do not get access to the narrative, or the full wish to discuss documentation building as a process from a source- documentation gallery. For Command Line Interface (CLI) users, code meant for a machine to a final output targeting the flesh and blood machine between the keyboard and the chair. * Corresponding author: bussonniermatthias@gmail.com ‡ QuanSight, Inc § Digital Ours Lab, SARL. Current tools and limitations ¶ University of California Merced, Merced, CA, USA || Univ Lyon, INSA Lyon, UJM, UCBL, ECL, CNRS UMR 5208, ICJ, F-69621, In the scientific Python ecosystem, it is well known that Docutils France [docutils] and Sphinx [sphinx] are major cornerstones for pub- lishing HTML documentation for Python. In fact, they are used Copyright © 2022 Matthias Bussonnier et al. This is an open-access article by all the libraries in this ecosystem. While a few alternatives distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, exist, most tools and services have some internal knowledge of provided the original author and source are credited. Sphinx. For instance, Read the Docs [RTD] provides a specific 76 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Sphinx theme [RTD-theme] users can opt-in to, Jupyter-book [JPYBOOK] is built on top of Sphinx, and MyST parser [MYST] (which is made to allow markdown in documentation) targets Sphinx as a backend, to name a few. All of the above provide an "ahead-of-time" documentation compilation and rendering, which is slow and computationally intensive. When a project needs its specific plugins, extensions and configurations to properly build (which is almost always the case), it is relatively difficult to build documentation for a single object (like a single function, module or class). This makes AOT tools difficult to use for interactive exploration. One can then consider a JIT approach, as done for Docrepr [DOCREPR] (integrated both in Jupyter and Spyder [Spyder]). However in that case, interactive documentation lacks inline plots, crosslinks, indexing, search and many custom Fig. 1: The following screenshot shows the help for directives. scipy.signal.dpss, as currently accessible (left), as shown by Papyri for Jupyterlab extension (right). An extended version of the Some of the above limitations are inherent to the design right pannel is displayed in Figure 4. of documentation build tools that were intended for a separate documentation construction. While Sphinx does provide features like intersphinx, link resolutions are done at the documentation raw docstrings (see for example the SymPy discussion2 on how building phase. Thus, this is inherently unidirectional, and can equations should be displayed in docstrings, and left panel of break easily. To illustrate this, we consider NumPy [NP] and SciPy Figure 1). In terms of format, markdown is appealing, however [SP], two extremely close libraries. In order to obtain proper cross- inconsistencies in the rendering will be created between libraries. linked documentation, one is required to perform at least five steps: Finally, some libraries can dynamically modify their docstring at • build NumPy documentation runtime. While this sometime avoids using directives, it ends up • publish NumPy object.inv file. being more expensive (runtime costs, complex maintenance, and • (re)build SciPy documentation using NumPy obj.inv contribution costs). file. Objectives of the project • publish SciPy object.inv file • (re)build NumPy docs to make use of SciPy’s obj.inv We now layout the objectives of the Papyri documentation frame- work. Let us emphasize that the project is in no way intended to Only then can both SciPy’s and NumPy’s documentation refer replace or cover many features included in well-established docu- to each other. As one can expect, cross links break every time mentation tools such as Sphinx or Jupyter-book. Those projects are a new version of a library is published1 . Pre-produced HTML extremely flexible and meet the needs of their users for publishing in IDEs and other tools are then prone to error and difficult to a standalone documentation website of PDFs. The Papyri project maintain. This also raises security issues: some institutions be- addresses specific documentation challenges (mentioned above), come reluctant to use tools like Docrepr or viewing pre-produced we present below what is (and what is not) the scope of work. HTML. Goal (a): design a non-generic (non fully customisable) website builder. When authors want or need complete control Docstrings format of the output and wide personalisation options, or branding, then The Numpydoc format is ubiquitous among the scientific ecosys- Papyri is not likely the project to look at. That is to say single- tem [NPDOC]. It is loosely based on reStructuredText (RST) project websites where appearance, layout, domain need to be syntax, and despite supporting full RST syntax, docstrings rarely controlled by the author is not part of the objectives. contain full-featured directive. Maintainers are confronted to the Goal (b): create a uniform documentation structure and following dilemma: syntax. The Papyri project prescribes stricter requirements in • keep the docstrings simple. This means mostly text-based terms of format, structure, and syntax compared to other tools docstrings with few directive for efficient readability. The such as Docutils and Sphinx. When possible, the documentation end-user may be exposed to raw docstring, there is no on- follows the Diátaxis Framework [DT]. This provides a uniform the-fly directive interpretation. This is the case for tools documentation setup and syntax, simplifying contributions to the such as IPython and Jupyter. project and easing error catching at compile time. Such strict envi- • write an extensive docstring. This includes references, and ronment is qualitatively supported by a number of documentation directive that potentially creates graphics, tables and more, fixes done upstream during the development stage of the project3 . allowing an enriched end-user experience. However this Since Papyri is not fully customisable, users who are already using may be computationally intensive, and executing code to documentation tools such as Sphinx, mkdocs [mkdocs] and others view docs could be a security risk. should expect their project to require minor modifications to work with Papyri. Other factors impact this choice: (i) users, (ii) format, (iii) Goal (c): provide accessibility and user proficiency. Ac- runtime. IDE users or non-Terminal users motivate to push for cessibility is a top priority of the project. To that aim, items extensive docstrings. Tools like Docrepr can mitigate this problem are associated to semantic meaning as much as possible, and by allowing partial rendering. However, users are often exposed to 2. sympy/sympy#14963 1. ipython/ipython#12210, numpy/numpy#21016, & #29073 3. Tests have been performed on NumPy, SciPy. PAPYRI: BETTER DOCUMENTATION FOR THE SCIENTIFIC ECOSYSTEM IN JUPYTER 77 documentation rendering is separated from documentation build- Intermediate Representation for Documentation (IRD) ing phase. That way, accessibility features such as high contract IRD format: Papyri relies on standard interchangeable themes (for better text-to-speech (TTS) raw data), early example "Intermediate Representation for Documentation" (IRD) format. highlights (for newcomers) and type annotation (for advanced This allows to reduce operation complexity of the documentation users) can be quickly available. With the uniform documentation build. For example, given M documentation producers and N structure, this provides a coherent experience where users become renderers, a full documentation build would be O(MN) (each more comfortable finding information in a single location (see renderer needs to understand each producer). If each producer only Figure 1). cares about producing IRD, and if each renderer only consumes it, Goal (d): make documentation building simple, fast, and then one can reduce to O(M+N). Additionally, one can take IRD independent. One objective of the project is to make documenta- from multiple producers at once, and render them all to a single tion installation and rendering relatively straightforward and fast. target, breaking the silos between libraries. To that aim, the project includes relative independence of doc- At the moment, IRD files are currently separated into four umentation building across libraries, allowing bidirectional cross main categories roughly following the Diátaxis framework [DT] links (i.e. both forward and backward links between pages) to and some technical needs: be maintained more easily. In other words, a single library can be built without the need to access documentation from another. Also, • API files describe the documentation for a single ob- the project should include straightforward lookup documentation ject, expressed as a JSON object. When possible, the for an object from the interactive read–eval–print loop (REPL). information is encoded semantically (Objective (c)). Files Finally, efforts are put to limit the installation speed (to avoid are organized based on the fully-qualified name of the polynomial growth when installing packages on large distributed Python object they reference, and contain either absolute systems). reference to another object (library, version and identi- fier), or delayed references to objects that may exist in another library. Some extra per-object meta information The Papyri solution like file/line number of definitions can be stored as well. In this section we describe in more detail how Papyri has been • Narrative files are similar to API files, except that they do implemented to address the objectives mentioned above. not represent a given object, but possess a previous/next page. They are organised in an ordered tree related to the table of content. Making documentation a multi-step process • Example files are a non-ordered collection of files. When using current documentation tools, customisation made by • Assets files are untouched binary resource archive files that maintainers usually falls into the following two categories: can be referenced by any of the above three ones. They are the only ones that contain backward references, and no • simpler input convenience, forward references. • modification of final rendering. In addition to the four categories above, metadata about the This first category often requires arbitrary code execution and current package is stored: this includes library name, current must import the library currently being built. This is the case version, PyPi name, GitHub repository slug4 , maintainers’ names, for example for the use of .. code-block:::, or custom logo, issue tracker and others. In particular, metadata allows :rc: directive. The second one offers a more user friendly en- us to auto-generate links to issue trackers, and to source files vironment. For example, sphinx-copybutton [sphinx-copybutton] when rendering. In order to properly resolve some references and adds a button to easily copy code snippets in a single click, normalize links convention, we also store a mapping from fully and pydata-sphinx-theme [pydata-sphinx-theme] or sphinx-rtd- qualified names to canonical ones. dark-mode provide a different appearance. As a consequence, Let us make some remarks about the current stage of IRD for- developers must make choices on behalf of their end-users: this mat. The exact structure of package metadata has not been defined may concern syntax highlights, type annotations display, light/dark yet. At the moment it is reduced to the minimum functionality. theme. While formats such as codemeta [CODEMETA] could be adopted, Being able to modify extensions and re-render the documenta- in order to avoid information duplication we rely on metadata tion without the rebuilding and executing stage is quite appealing. either present in the published packages already or extracted from Thus, the building phase in Papyri (collecting documentation Github repository sources. Also, IRD files must be standardized information) is separated from the rendering phase (Objective (c)): in order to achieve a uniform syntax structure (Objective (b)). at this step, Papyri has no knowledge and no configuration options In this paper, we do not discuss IRD files distribution. Last, the that permit to modify the appearance of the final documentation. final specification of IRD files is still in progress and regularly Additionally, the optional rendering process has no knowledge of undergoes major changes (even now). Thus, we invite contributors the building step, and can be run without accessing the libraries to consult the current state of implementation on the GitHub involved. repository [Papyri]. Once the IRD format is more stable, this will This kind of technique is commonly used in the field of be published as a JSON schema, with full specification and more compilers with the usage of Single Compilation Unit [SCU] and in-depth description. Intermediate Representation [IR], but to our knowledge, it has not been implemented for documentation in the Python ecosystem. 4. "slug" is the common term that refers to the various combinations As mentioned before, this separation is key to achieving many of organization name/user name/repository name, that uniquely identifies a features proposed in Objectives (c), (d) (see Figure 2). repository on a platform like GitHub. 78 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 2: Sketch representing how to build documentation with Papyri. Step 1: Each project builds an IRD bundle that contains semantic information about the project documentation. Step 2: the IRD bundles are publihsed online. Step 3: users install IRD bundles locally on their machine, pages get corsslinked, indexed, etc. Step 4: IDEs render documentation on-the-fly, taking into consideration users’ preferences. IRD bundles: Once a library has collected IRD repre- package managers or IDEs, one could imagine this process being sentation for all documentation items (functions, class, narrative automatic, or on demand. This step should be fairly efficient as it sections, tutorials, examples), Papyri consolidates them into what mostly requires downloading and unpacking IRD files. we will refer to as IRD bundles. A Bundle gathers all IRD files Finally, IDEs developers want to make sure IRD files can be and metadata for a single version of a library5 . Bundles are a properly rendered and browsed by their users when requested. convenient unit to speak about publication, installation, or update This may potentially take into account users’ preferences, and may of a given library documentation files. provide added values such as indexing, searching, bookmarks and Unlike package installation, IRD bundles do not have the others, as seen in rustsdocs, devdocs.io. notion of dependencies. Thus, a fully fledged package manager is not necessary, and one can simply download corresponding files Current implementation and unpack them at the installation phase. We present here some of the technological choices made in the Additionally, IRD bundles for multiple versions of the same current Papyri implementation. At the moment, it is only targeting library (or conflicting libraries) are not inherently problematic as a subset of projects and users that could make use of IRD files and they can be shared across multiple environments. bundles. As a consequence, it is constrained in order to minimize From a security standpoint, installing IRD bundles does not the current scope and efforts development. Understanding the require the execution of arbitrary code. This is a critical element implementation is not necessary to use Papyri neither as a project for adoption in deployments. There exists as well an opportunity to maintainer nor as a user, but it can help understanding some of the provide localized variants at the IRD installation time (IRD bundle current limitations. translations haven’t been explored exhaustively at the moment). Additionally, nothing prevents alternatives and complementary implementations with different choices: as long as other imple- IRD and high level usage mentations can produce (or consume) IRD bundles, they should Papyri-based documentation involves three broad categories of be perfectly compatible and work together. stakeholders (library maintainers, end-users, IDE developers), and The following sections are thus mostly informative to under- processes. This leads to certain requirements for IRD files and stand the state of the current code base. In particular we restricted bundles. ourselves to: On the maintainers’ side, the goal is to ensure that Papyri can build IRD files, and publish IRD bundles. Creation of IRD files • Producing IRD bundles for the core scientific Python and bundles is the most computationally intensive step. It may projects (Numpy, SciPy, Matplotlib...) require complex dependencies, or specific plugins. Thus, this can • Rendering IRD documentation for a single user on their be a multi-step process, or one can use external tooling (not related local machine. to Papyri nor using Python) to create them. Visual appearance Finally, some of the technological choices have no other and rendering of documentation is not taken into account in this justification than the main developer having interests in them, or process. Overall, building IRD files and bundles takes about the making iterations on IRD format and main code base faster. same amount of time as running a full Sphinx build. The limiting factor is often associated to executing library examples and code IRD files generation snippets. For example, building SciPy & NumPy documentation The current implementation of Papyri only targets some compat- IRD files on a 2021 Macbook Pro M1 (base model), including ibility with Sphinx (a website and PDF documentation builder), executing examples in most docstrings and type inferring most reStructuredText (RST) as narrative documentation syntax and examples (with most variables semantically inferred) can take Numpydoc (both a project and standard for docstring formatting). several minutes. These are widely used by a majority of the core scientific End-users are responsible for installing desired IRD bundles. Python ecosystem, and thus having Papyri and IRD bundles In most cases, it will consist of IRD bundles from already compatible with existing projects is critical. We estimate that installed libraries. While Papyri is not currently integrated with about 85%-90% of current documentation pages being built with Sphinx, RST and Numpydoc can be built with Papyri. Future work 5. One could have IRD bundles not attached to a particular library. For example, this can be done if an author wishes to provide only a set of examples includes extensions to be compatible with MyST (a project to or tutorials. We will not discuss this case further here. bring markdown syntax to Sphinx), but this is not a priority. PAPYRI: BETTER DOCUMENTATION FOR THE SCIENTIFIC ECOSYSTEM IN JUPYTER 79 To understand RST Syntax in narrative documentation, RST documents need to be parsed. To do so, Papyri uses tree-sitter [TS] and tree-sitter-rst [TSRST] projects, allowing us to extract an "Abstract Syntax Tree" (AST) from the text files. When using tree- sitter, AST nodes contain bytes-offsets into the original text buffer. Then one can easily "unparse" an AST node when necessary. This is relatively convenient for handling custom directives and edge cases (for instance, when projects rely on a loose definition of the RST syntax). Let us provide an example: RST directives are usually of the form: .. directive:: arguments body Fig. 3: Sketch representing how Papyri stores information in 3 While technically there is no space before the ::, Docutils and different format depending on access patterns: a SQLite database for Sphinx will not create errors when building the documentation. relationship information, on-disk CBOR files for more compact storate Due to our choice of a rigid (but unified) structure, we use tree- of IRD, and RAW files (e.g. Images). A GraphStore API abstracts all sitter that indicates an error node if there is an extra space. This access and takes care of maintinaing consistency. allows us to check for error nodes, unparse, add heuristics to restore a proper syntax, then parse again to obtain the new node. (like a database server) are not necessary available. This provides Alternatively, a number of directives like warnings, notes an adapted framework to test Papyri on an end-user machine. admonitions still contain valid RST. Instead of storing the With those requirements we decided to use a combination of directive with the raw text, we parse the full document (potentially SQLite (an in-process database engine), Concise Binary Object finding invalid syntax), and unparse to the raw text only if the Representation (CBOR) and raw storage to better reflect the access directive requires it. pattern (see Figure 3). Serialisation of data structure into IRD files is currently us- SQLite allows us to easily query for object existence, and ing a custom serialiser. Future work includes maybe swapping graph information (relationship between objects) at runtime. It is to msgspec [msgspec]. The AST objects are completely typed, optimized for infrequent reading access. Currently many queries however they contain a number of unions and sequences of unions. are done at runtime, when rendering documentation. The goal is to It turns out, many frameworks like pydantic [pydantic] do not move most of SQLite information resolving step at the installation support sequences of unions where each item in the union may time (such as looking for inter-libraries links) once the codebase be of a different type. To our knowledge, there are just few other and IRD format have stabilized. SQLite is less strongly typed than documentation related projects that treat AST as an intermediate other relational or graph database and needs custom logic, but object with a stable format that can be manipulated by external is ubiquitous on all systems and does not need a separate server tools. In particular, the most popular one is Pandoc [pandoc], a process, making it an easy choice of database. project meant to convert from many document types to plenty of CBOR is a more space efficient alternative to JSON. In par- other ones. ticular, keys in IRD are often highly redundant, and can be highly The current Papyri strategy is to type-infer all code examples optimized when using CBOR. Storing IRD in CBOR thus reduces with Jedi [JEDI], and pre-syntax highlight using pygments when disk usage and can also allow faster deserialization without possible. requiring potentially CPU intensive compression/decompression. IRD File Installation This is a good compromise for potentially low performance users’ Download and installation of IRD files is done concurrently using machines. httpx [httpx], with Trio [Trio] as an async framework, allowing us Raw storage is used for binary blobs which need to be accessed to download files concurrently. without further processing. This typically refers to images, and The current implementation of Papyri targets Python doc- raw storage can be accessed with standard tools like image umentation and is written in Python. We can then query the viewers. existing version of Python libraries installed, and infer the ap- Finally, access to all of these resources is provided via an propriate version of the requested documentation. At the moment, internal GraphStore API which is agnostic of the backend, but the implementation is set to tentatively guess relevant libraries ensures consistency of operations like adding/removing/replacing versions when the exact version number is missing from the install documents. Figure 3 summarizes this process. command. Of course the above choices depend on the context where For convenience and performance, IRD bundles are being post- documentation is rendered and viewed. For example, an online processed and stored in a different format. For local rendering, we archive intended to browse documentation for multiple projects mostly need to perform the following operations: and versions may decide to use an actual graph database for object relationship, and store other files on a Content Delivery Network 1) Query graph information about cross-links across docu- or blob storage for random access. ments. 2) Render a single page. Documentation Rendering 3) Access raw data (e.g. images). The current Papyri implementation includes a certain number We also assume that IRD files may be infrequently updated, of rendering engines (presented below). Each of them mostly that disk space is limited, and that installing or running services consists of fetching a single page with its metadata, and walking 80 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) through the IRD AST tree, and rendering each node with users’ Future goals include improving/replacing the JupyterLab’s ques- preferences. tion mark operator (obj?) and the JupyterLab Inspector (when possible). A screenshot of the current development version of the • An ASCII terminal renders using Jinja2 [Jinja2]. This JupyterLab extension can be seen in Figure 4. can be useful for piping documentation to other tools like grep, less, cat. Then one can work in a highly restricted environment, making sure that reading the docu- Challenges mentation is coherent. This can serve as a proxy for screen We mentioned above some limitations we encountered (in ren- reading. dering usage for instance) and what will be done in the future • A Textual User Interface browser renders using urwid. to address them. We provide below some limitations related to Navigation within the terminal is possible, one can reflow syntax choices, and broader opportunities that arise from the long lines on resized windows, and even open image files Papyri project. in external editors. Nonetheless, several bugs have been encountered in urwid. The project aims at replacing the Limitations CLI IPython question mark operator (obj?) interface The decoupling of the building and rendering phases is key in (which currently only shows raw docstrings) in urwid with Papyri. However, it requires us to come up with a method that a new one written with Rich/Textual. For this interface, uniquely identifies each object. In particular, this is essential in having images stored raw on disk is useful as it allows us order to link any object documentation without accessing the IRD to directly call into a system image viewer to display them. bundles build from all the libraries. To that aim, we use the fully • A JIT rendering engine uses Jinja2, Quart [quart], Trio. qualified names of an object. Namely, each object is identified Quart is an async version of flask [flask]. This option by the concatenation of the module in which it is defined, with contains the most features, and therefore is the main one its local name. Nonetheless, several particular cases need specific used for development. This environment lets us iterate over treatment. the rendering engine rapidly. When exploring the User In- • To mirror the Python syntax, is it easy to use . to terface design and navigation, we found that a list of back concatenate both parts. Unfortunately, that leads to some references has limited uses. Indeed, it is can be challenging ambiguity when modules re-export functions have the to judge the relevance of back references, as well as their same name. For example, if one types relationship to each other. By playing with a network # module mylib/__init__.py graph visualisation (see Figure 5)), we can identify clusters of similar information within back references. Of course, from .mything import mything this identification has limits especially when pages have a then mylib.mything is ambiguous both with respect large number of back references (where the graph becomes to the mything submodule, and the reexported object. too busy). This illustrate as well a strength of the Papyri In future versions, the chosen convention will use : as a architecture: creating this network visualization did not module/name separator. require any regeneration of the documentation, one simply • Decorated functions or other dynamic approaches to ex- updates the template and re-renders the current page as pose functions to users end up having <local>> in their needed. fully qualified names, which is invalid. • A static AOT rendering of all the existing pages that can • Many built-in functions (np.sin, np.cos, etc.) do not be rendered ahead of time uses the same class as the JIT have a fully qualified name that can be extracted by object rendering. Basically, this loops through all entries in the introspection. We believe it should be possible to identify SQLite database and renders each item independently. This those via other means like docstring hash (to be explored). renderer is mostly used for exhaustive testing and perfor- • Fully qualified names are often not canonical names (i.e. mance measures for Papyri. This can render most of the the name typically used for import). While we made efforts API documentation of IPython, Astropy [astropy], Dask to create a mapping from one to another, finding the canon- and distributed [Dask], Matplotlib [MPL], [MPL-DOI], ical name automatically is not always straightforward. Networkx [NX], NumPy [NP], Pandas, Papyri, SciPy, • There are also challenges with case sensitivity. For ex- Scikit-image and others. It can represent ~28000 pages ample for MacOS file systems, a couple of objects may in ~60 seconds (that is ~450 pages/s on a recent Macbook unfortunately refer to the same IRD file on disk. To address pro M1). this, a case-sensitive hash is appended at the end of the For all of the above renderers, profiling shows that docu- filename. mentation rendering is mostly limited by object de-serialisation • Many libraries have a syntax that looks right once ren- from disk and Jinja2 templating engine. In the early project dered to HTML while not following proper syntax, or a development phase, we attempted to write a static HTML renderer syntax that relies on specificities of Docutils and Sphinx in a compiled language (like Rust, using compiled and typed rendering/parsing. checked templates). This provided a speedup of roughly a factor • Many custom directive plugins cannot be reused from 10. However, its implementation is now out of sync with the main Sphinx. These will need to be reimplemented. Papyri code base. Finally, a JupyterLab extension is currently in progress. The Future possibilities documentation then presents itself as a side-panel and is capable Beyond what has been presented in this paper, there are several of basic browsing and rendering (see Figure 1 and Figure 4). The opportunities to improve and extend what Papyri can allow for the model uses typescript, react and native JupyterLab component. scientific Python ecosystem. PAPYRI: BETTER DOCUMENTATION FOR THE SCIENTIFIC ECOSYSTEM IN JUPYTER 81 Fig. 5: Local graph (made with D3.js [D3js]) representing the connections among the most important nodes around current page across many libraries, when viewing numpy.ndarray. Nodes are sized with respect to the number of incomming links, and colored with respect to their library. This graph is generated at rendering time, and is updated depending on the libraries currently installed. This graph helps identify related functions and documentation. It can become challenging to read for highly connected items as seen here for numpy.ndarray. The first area is the ability to build IRD bundles on Continuous Integration platforms. Services like GitHub action, Azure pipeline and many others are already setup to test packages. We hope to leverage this infrastructure to build IRD files and make them available to users. A second area is hosting of intermediate IRD files. While the current prototype is hosted by http index using GitHub pages, it is likely not a sustainable hosting platform as disk space is limited. To our knowledge, IRD files are smaller in size than HTML documentation, we hope that other platforms like Read the Docs can be leveraged. This could provide a single domain that renders the documentation for multiple libraries, thus avoiding the display of many library subdomains. This contributes to giving a more unified experience for users. It should be possible for projects to avoid using many dy- namic docstrings interpolation that are used to document *args and **kwargs. This would make sources easier to read, and potentially have some speedup at the library import time. Once a (given and appropriately used by its users) library uses an IDE that supports Papyri for documentation, docstring syntax could be exchanged for markdown. As IRD files are structured, it should be feasible to provide cross-version information in the documentation. For example, if one installs multiple versions of IRD bundles for a library, then assuming the user does not use the latest version, the renderer Fig. 4: Example of extended view of the Papyri documentation for could inspect IRD files from previous/future versions to indi- Jupyterlab extension (here for SciPy). Code examples can now include cate the range of versions for which the documentation has not plots. Most token in each examples are linked to the corresponding changed. Upon additional efforts, it should be possible to infer page. Early navigation bar is visible at the top. when a parameter was removed, or will be removed, or to simply display the difference between two versions. 82 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Conclusion [RTD-theme] https://sphinx-rtd-theme.readthedocs.io/en/stable/ [RTD] https://readthedocs.org/ To address some of the current limitations in documentation [SCU] https://en.wikipedia.org/wiki/Single_Compilation_ accessibility, building and maintaining, we have provided a new Unit documentation framework called Papyri. We presented its features [SP] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, and underlying implementation choices (such as crosslink main- Evgeni Burovski, Pearu Peterson, Warren Weckesser, tenance, decoupling building and rendering phases, enriching the Jonathan Bright, Stéfan J. van der Walt, Matthew rendering features, using the IRD format to create a unified syntax Brett, Joshua Wilson, K. Jarrod Millman, Nikolay structure, etc.). While the project is still at its early stage, clear Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, İlhan Polat, Yu Feng, impacts can already be seen on the availability of high-quality Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef documentation for end-users, and on the workload reduction for Perktold, Robert Cimrman, Ian Henriksen, E.A. Quin- maintainers. Building IRD format opened a wide range of tech- tero, Charles R Harris, Anne M. Archibald, Antônio nical possibilities, and contributes to improving users’ experience H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. (2020) SciPy 1.0: Fundamen- (and therefore the success of the scientific Python ecosystem). This tal Algorithms for Scientific Computing in Python. may become necessary for users to navigate in an exponentially Nature Methods, 17(3), 261-272. 10.1038/s41592- growing ecosystem. 019-0686-2 [Spyder] https://www.spyder-ide.org/ [TSRST] https://github.com/stsewd/tree-sitter-rst Acknowledgments [TS] https://tree-sitter.github.io/tree-sitter/ [astropy] The Astropy Project: Building an inclusive, open- The authors want to thank S. Gallegos (author of tree-sitter-rst), J. science project and status of the v2.0 core package, L. Cano Rodríguez and E. Holscher (Read The Docs), C. Holdgraf https://doi.org/10.48550/arXiv.1801.02634 (2i2c), B. Granger and F. Pérez (Jupyter Project), T. Allard and I. [docutils] https://docutils.sourceforge.io/ [flask] https://flask.palletsprojects.com/en/2.1.x/ Presedo-Floyd (QuanSight) for their useful feedback and help on [httpx] https://www.python-httpx.org/ this project. [mkdocs] https://www.mkdocs.org/ [msgspec] https://pypi.org/project/msgspec [pandoc] https://pandoc.org/ Funding [pydantic] https://pydantic-docs.helpmanual.io/ M. B. received a 2-year grant from the Chan Zuckerberg Initia- [pydata-sphinx-theme] https://pydata-sphinx-theme.readthedocs.io/en/stable/ [quart] https://pgjones.gitlab.io/quart/ tive (CZI) Essential Open Source Software for Science (EOS) [sphinx-copybutton] https://sphinx-copybutton.readthedocs.io/en/latest/ – EOSS4-0000000017 via the NumFOCUS 501(3)c non profit to [sphinx] https://www.sphinx-doc.org/en/master/ develop the Papyri project. [Trio] https://trio.readthedocs.io/ R EFERENCES [AOT] https://en.wikipedia.org/wiki/Ahead-of-time_ compilation [CFRG] conda-forge community. (2015). The conda-forge Project: Community-based Software Distribution Built on the conda Package Format and Ecosystem. Zenodo. http://doi.org/10.5281/zenodo.4774216 [CODEMETA] https://codemeta.github.io/ [D3js] https://d3js.org/ [DOCREPR] https://github.com/spyder-ide/docrepr [DT] https://diataxis.fr/ [Dask] Dask Development Team (2016). Dask: Library for dynamic task scheduling, https://dask.org [IR] https://en.wikipedia.org/wiki/Intermediate_ representation [JEDI] https://github.com/davidhalter/jedi [JIT] https://en.wikipedia.org/wiki/Just-in-time_ compilation [JPYBOOK] https://jupyterbook.org/ [Jinja2] https://jinja.palletsprojects.com/ [LTO] https://en.wikipedia.org/wiki/Interprocedural_ optimization [MPL-DOI] https://doi.org/10.5281/zenodo.6513224 [MPL] J.D. Hunter, "Matplotlib: A 2D Graphics Environ- ment", Computing in Science & Engineering, vol. 9, no. 3, pp. 90-95, 2007, [MYST] https://myst-parser.readthedocs.io/en/latest/ [NPDOC] https://numpydoc.readthedocs.io/en/latest/format.html [NP] Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Ar- ray programming with NumPy. Nature 585, 357–362 (2020). DOI: 10.1038/s41586-020-2649-2 [NX] Aric A. Hagberg, Daniel A. Schult and Pieter J. Swart, “Exploring network structure, dynamics, and function using NetworkX”, in Proceedings of the 7th Python in Science Conference (SciPy2008), Gäel Varoquaux, Travis Vaught, and Jarrod Millman (Eds), (Pasadena, CA USA), pp. 11–15, Aug 2008 [Papyri] https://github.com/jupyter/papyri PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 83 Bayesian Estimation and Forecasting of Time Series in statsmodels Chad Fulton‡∗ F Abstract—Statsmodels, a Python library for statistical and econometric inference for the well-developed stable of time series models analysis, has traditionally focused on frequentist inference, including in its mod- in statsmodels, and providing access to the rich associated els for time series data. This paper introduces the powerful features for Bayesian feature set already mentioned, presents a complementary option inference of time series models that exist in statsmodels, with applications to these more general-purpose libraries.1 to model fitting, forecasting, time series decomposition, data simulation, and impulse response functions. Time series analysis in statsmodels Index Terms—time series, forecasting, bayesian inference, Markov chain Monte A time series is a sequence of observations ordered in time, and Carlo, statsmodels time series data appear commonly in statistics, economics, finance, climate science, control systems, and signal processing, among Introduction many other fields. One distinguishing characteristic of many time Statsmodels [SP10] is a well-established Python library for series is that observations that are close in time tend to be more statistical and econometric analysis, with support for a wide range correlated, a feature known as autocorrelation. While successful of important model classes, including linear regression, ANOVA, analyses of time series data must account for this, statistical generalized linear models (GLM), generalized additive models models can harness it to decompose a time series into trend, (GAM), mixed effects models, and time series models, among seasonal, and cyclical components, produce forecasts of future many others. In most cases, model fitting proceeds by using data, and study the propagation of shocks over time. frequentist inference, such as maximum likelihood estimation We now briefly review the models for time series data that are (MLE). In this paper, we focus on the class of time series available in statsmodels and describe their features.2 models [MPS11], support for which has grown substantially in Exponential smoothing models statsmodels over the last decade. After introducing several Exponential smoothing models are constructed by combining of the most important new model classes – which are by default one or more simple equations that each describe some aspect fitted using MLE – and their features – which include forecasting, of the evolution of univariate time series data. While originally time series decomposition and seasonal adjustment, data simula- somewhat ad hoc, these models can be defined in terms of a tion, and impulse response analysis – we describe the powerful proper statistical model (for example, see [HKOS08]). They have functions that enable users to apply Bayesian methods to a wide enjoyed considerable popularity in forecasting (for example, see range of time series models. Support for Bayesian inference in Python outside of the implementation in R described by [HA18]). A prototypical statsmodels has also grown tremendously, particularly in example that allows for trending data and a seasonal component the realm of probabilistic programming, and includes powerful – often known as the additive "Holt-Winters’ method" – can be libraries such as PyMC3 [SWF16], PyStan [CGH+ 17], and written as TensorFlow Probability [DLT+ 17]. Meanwhile, ArviZ lt = α(yt − st−m ) + (1 − α)(lt−1 + bt−1 ) [KCHM19] provides many excellent tools for associated diagnos- bt = β (lt − lt−1 ) + (1 − β )bt−1 tics and vizualisations. The aim of these libraries is to provide st = γ(yt − lt−1 − bt−1 ) + (1 − γ)st−m support for Bayesian analysis of a large class of models, and they make available both advanced techniques, including auto- where lt is the level of the series, bt is the trend, st is the tuning algorithms, and flexible model specification. By contrast, seasonal component of period m, and α, β , γ are parameters of here we focus on simpler techniques. However, while the libraries the model. When augmented with an error term with some given above do include some support for time series models, this has probability distribution (usually Gaussian), likelihood-based infer- not been their primary focus. As a result, introducing Bayesian ence can be used to estimate the parameters. In statsmodels, * Corresponding author: chad.t.fulton@frb.gov 1. In addition, it is possible to combine the sampling algorithms of PyMC3 ‡ Federal Reserve Board of Governors with the time series models of statsmodels, although we will not discuss this approach in detail here. See, for example, https://www.statsmodels.org/v0. Copyright © 2022 Chad Fulton. This is an open-access article distributed 13.0/examples/notebooks/generated/statespace_sarimax_pymc3.html. under the terms of the Creative Commons Attribution License, which permits 2. In addition to statistical models, statsmodels also provides a number unrestricted use, distribution, and reproduction in any medium, provided the of tools for exploratory data analysis, diagnostics, and hypothesis testing original author and source are credited. related to time series data; see https://www.statsmodels.org/stable/tsa.html. 84 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) additive exponential smoothing models can be constructed using # ARMA(1, 1) model with explanatory variable the statespace.ExponentialSmoothing class.3 The fol- X = mdata['realint'] model_arma11 = sm.tsa.ARIMA( lowing code shows how to apply the additive Holt-Winters model y, order=(1, 0, 1), exog=X) above to model quarterly data on consumer prices: # SARIMAX(p, d, q)x(P, D, Q, s) model import statsmodels.api as sm model_sarimax = sm.tsa.ARIMA( # Load data y, order=(p, d, q), seasonal_order=(P, D, Q, s)) mdata = sm.datasets.macrodata.load().data While this class of models often produces highly competitive # Compute annualized consumer price inflation y = np.log(mdata['cpi']).diff().iloc[1:] * 400 forecasts, it does not produce a decomposition of a time series into, for example, trend and seasonal components. # Construct the Holt-Winters model model_hw = sm.tsa.statespace.ExponentialSmoothing( Vector autoregressive models y, trend=True, seasonal=12) While the SARIMAX models above handle univariate series, statsmodels also has support for the multivariate generaliza- Structural time series models tion to vector autoregressive (VAR) models.5 These models are Structural time series models, introduced by [Har90] and also written sometimes known as unobserved components models, similarly yt = ν + Φ1 yt−1 + · · · + Φ p yt−p + εt decompose a univariate time series into trend, seasonal, cyclical, and irregular components: where yt is now considered as an m × 1 vector. As a result, the intercept ν is also an m × 1 vector, the coefficients Φi are each yt = µt + γt + ct + εt m × m matrices, and the error term is εt ∼ N(0m , Ω), with Ω an where µt is the trend, γt is the seasonal component, ct is the cycli- m×m matrix. These models can be constructed in statsmodels cal component, and εt ∼ N(0, σ 2 ) is the error term. However, this using the VARMAX class, as follows6 equation can be augmented in many ways, for example to include # Multivariate dataset explanatory variables or an autoregressive component. In addition, z = (np.log(mdata['realgdp', 'realcons', 'cpi']) .diff().iloc[1:]) there are many possible specifications for the trend, seasonal, and cyclical components, so that a wide variety of time series # VAR(1) model characteristics can be accommodated. In statsmodels, these model_var = sm.tsa.VARMAX(z, order=(1, 0)) models can be constructed from the UnobservedComponents class; a few examples are given in the following code: Dynamic factor models # "Local level" model statsmodels also supports a second model for multivariate model_ll = sm.tsa.UnobservedComponents(y, 'llevel') time series: the dynamic factor model (DFM). These models, often # "Local linear trend", with seasonal component model_arma11 = sm.tsa.UnobservedComponents( used for dimension reduction, posit a few unobserved factors, with y, 'lltrend', seasonal=4) autoregressive dynamics, that are used to explain the variation in the observed dataset. In statsmodels, there are two model These models have become popular for time series analysis and classes, DynamicFactor` and DynamicFactorMQ, that can forecasting, as they are flexible and the estimated components are fit versions of the DFM. Here we focus on the DynamicFactor intuitive. Indeed, Google’s Causal Impact library [BGK+ 15] uses class, for which the model can be written a Bayesian structural time series approach directly, and Facebook’s Prophet library [TL17] uses a conceptually similar framework and yt = Λ ft + εt is estimated using PyStan. ft = Φ1 ft−1 + · · · + Φ p ft−p + ηt Autoregressive moving-average models Here again, the observation is assumed to be m × 1, but the factors are k × 1, where it is possible that k << m. As before, we assume Autoregressive moving-average (ARMA) models, ubiquitous in conformable coefficient matrices and Gaussian errors. time series applications, are well-supported in statsmodels, The following code shows how to construct a DFM in including their generalizations, abbreviated as "SARIMAX", that statsmodels allow for integrated time series data, explanatory variables, and seasonal effects.4 A general version of this model, excluding # DFM with 2 factors that evolve as a VAR(3) model_dfm = sm.tsa.DynamicFactor( integration, can be written as z, k_factors=2, factor_order=3) yt = xt β + ξt ξt = φ1 ξt−1 + · · · + φ p ξt−p + εt + θ1 εt−1 + · · · + θq εt−q Linear Gaussian state space models In statsmodels, each of the model classes introduced where εt ∼ N(0, σ 2 ). These are constructed in statsmodels above ( statespace.ExponentialSmoothing, with the ARIMA class; the following code shows how to construct UnobservedComponents, ARIMA, VARMAX, a variety of autoregressive moving-average models for consumer price data: 4. Note that in statsmodels, models with explanatory variables are in # AR(2) model the form of "regression with SARIMA errors". model_ar2 = sm.tsa.ARIMA(y, order=(2, 0, 0)) 5. statsmodels also supports vector moving-average (VMA) models using the same model class as described here for the VAR case, but, for brevity, 3. A second class, ETSModel, can also be used for both additive and we do not explicitly discuss them here. multiplicative models, and can exhibit superior performance with maximum 6. A second class, VAR, can also be used to fit VAR models, using least likelihood estimation. However, it lacks some of the features relevant for squares. However, it lacks some of the features relevant for Bayesian inference Bayesian inference discussed in this paper. discussed in this paper. BAYESIAN ESTIMATION AND FORECASTING OF TIME SERIES IN STATSMODELS 85 Fig. 1: Selected functionality of state space models in statsmodels. DynamicFactor, and DynamicFactorMQ) are implemented fcast = results_ll.forecast(4) as part of a broader class of models, referred to as linear Gaussian # Produce a draw from the posterior distribution state space models (hereafter for brevity, simply "state space # of the state vector models" or SSM). This class of models can be written as sim_ll.simulate() draw = sim_ll.simulated_state yt = dt + Zt αt + εt εt ∼ N(0, Ht ) αt+1 = ct + Tt αt + Rt ηt ηt ∼ N(0, Qt ) Nearly identical code could be used for any of the model classes introduced above, since they are all implemented as part of the where αt represents an unobserved vector containing the "state" same state space model framework. In the next section, we show of the dynamic system. In general, the model is multivariate, with how these features can be used to perform Bayesian inference with yt and εt m × 1 vector, αt k × 1, and ηt r times 1. these models. Powerful tools exist for state space models to estimate the values of the unobserved state vector, compute the value of the likelihood function for frequentist inference, and perform posterior sampling for Bayesian inference. These tools include the Bayesian inference via Markov chain Monte Carlo celebrated Kalman filter and smoother and a simulation smoother, all of which are important for conducting Bayesian inference for We begin by giving a cursory overview of the key elements these models.7 The implementation in statsmodels largely of Bayesian inference required for our purposes here.8 In brief, follows the treatment in [DK12], and is described in more detail the Bayesian approach stems from Bayes’ theorem, in which in [Ful15]. the posterior distribution for an object of interest is derived as In addition to these key tools, state space models also admit proportional to the combination of a prior distribution and the general implementations of useful features such as forecasting, likelihood function data simulation, time series decomposition, and impulse response analysis. As a consequence, each of these features extends to each p(A|B) ∝ p(B|A) × p(A) of the time series models described above. Figure 1 presents a | {z } | {z } |{z} diagram showing how to produce these features, and the code posterior likelihood prior below briefly introduces a subset of them. # Construct the Model Here, we will be interested in the posterior distribution of the pa- model_ll = sm.tsa.UnobservedComponents(y, 'llevel') rameters of our model and of the unobserved states, conditional on the chosen model specification and the observed time series data. # Construct a simulation smoother sim_ll = model_ll.simulation_smoother() While in most cases the form of the posterior cannot be derived an- alytically, simulation-based methods such as Markov chain Monte # Parameter values (variance of error and Carlo (MCMC) can be used to draw samples that approximate # variance of level innovation, respectively) the posterior distribution nonetheless. While PyMC3, PyStan, params = [4, 0.75] and TensorFlow Probability emphasize Hamiltonian Monte Carlo # Compute the log-likelihood of these parameters (HMC) and no-U-turn sampling (NUTS) MCMC methods, we llf = model_ll.loglike(params) focus on the simpler random walk Metropolis-Hastings (MH) and # `smooth` applies the Kalman filter and smoother Gibbs sampling (GS) methods. These are standard MCMC meth- # with a given set of parameters and returns a ods that have enjoyed great success in time series applications and # Results object which are simple to implement, given the state space framework results_ll = model_ll.smooth(params) already available in statsmodels. In addition, the ArviZ library # Produce forecasts for the next 4 periods is designed to work with MCMC output from any source, and we can easily adapt it to our use. 7. Statsmodels currently contains two implementations of simulation With either Metropolis-Hastings or Gibbs sampling, our pro- smoothers for the linear Gaussian state space model. The default is the "mean cedure will produce a sequence of sample values (of parameters correction" simulation smoother of [DK02]. The precision-based simulation smoother of [CJ09] can alternatively be used by specifying method='cfa' and / or the unobserved state vector) that approximate draws from when creating the simulation smoother object. the posterior distribution arbitrarily well, as the number of length 86 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) of the chain of samples becomes very large. Random walk Metropolis-Hastings In random walk Metropolis-Hastings (MH), we begin with an arbi- trary point as the initial sample, and then iteratively construct new samples in the chain as follows. At each iteration, (a) construct a proposal by perturbing the previous sample by a Gaussian random variable, and then (b) accept the proposal with some probability. If a proposal is accepted, it becomes the next sample in the chain, while if it is rejected then the previous sample value is carried over. Here, we show how to implement Metropolis-Hastings estimation of the variance parameter in a simple model, which only requires the use of the log-likelihood computation introduced above. Fig. 2: Approximate posterior distribution of variance parameter, import arviz as az random walk model, Metropolis-Hastings; U.S. Industrial Production. from scipy import stats # Construct the model model_rw = sm.tsa.UnobservedComponents(y, 'rwalk') # Specify the prior distribution. With MH, this # can be freely chosen by the user prior = stats.uniform(0.0001, 100) # Specify the Gaussian perturbation distribution perturb = stats.norm(scale=0.1) # Storage niter = 100000 samples_rw = np.zeros(niter + 1) # Initialization samples_rw[0] = y.diff().var() llf = model_rw.loglike(samples_rw[0]) prior_llf = prior.logpdf(samples_rw[0]) Fig. 3: Approximate posterior joint distribution of variance parame- # Iterations ters, local level model, Gibbs sampling; CPI inflation. for i in range(1, niter + 1): # Compute the proposal value proposal = samples_rw[i - 1] + perturb.rvs() Gibbs sampling # Compute the acceptance probability Gibbs sampling (GS) is a special case of Metropolis-Hastings proposal_llf = model_rw.loglike(proposal) (MH) that is applicable when it is possible to produce draws proposal_prior_llf = prior.logpdf(proposal) accept_prob = np.exp( directly from the conditional distributions of every variable, even proposal_llf - llf though it is still not possible to derive the general form of the joint + prior_llf - proposal_prior_llf) posterior. While this approach can be superior to random walk MH when it is applicable, the ability to derive the conditional # Accept or reject the value if accept_prob > stats.uniform.rvs(): distributions typically requires the use of a "conjugate" prior – i.e., samples_rw[i] = proposal a prior from some specific family of distributions. For example, llf = proposal_llf above we specified a uniform distribution as the prior when prior_llf = proposal_prior_llf else: sampling via MH, but that is not possible with Gibbs sampling. samples_rw[i] = samples_rw[i - 1] Here, we show how to implement Gibbs sampling estimation of the variance parameter, now making use of an inverse Gamma # Convert for use with ArviZ and plot posterior prior, and the simulation smoother introduced above. samples_rw = az.convert_to_inference_data( samples_rw) # Construct the model and simulation smoother # Eliminate the first 10000 samples as burn-in; model_ll = sm.tsa.UnobservedComponents(y, 'llevel') # thin by factor of 10 to reduce autocorrelation sim_ll = model_ll.simulation_smoother() az.plot_posterior(samples_rw.posterior.sel( {'draw': np.s_[10000::10]}), kind='bin', # Specify the prior distributions. With GS, we must point_estimate='median') # choose an inverse Gamma prior for each variance priors = [stats.invgamma(0.01, scale=0.01)] * 2 The approximate posterior distribution, constructed from the sam- ple chain, is shown in Figure 2. # Storage niter = 100000 samples_ll = np.zeros((niter + 1, 2)) 8. While a detailed description of these issues is out of the scope of this paper, there are many superb references on this topic. We refer the interested # Initialization reader to [WH99], which provides a book-length treatment of Bayesian samples_ll[0] = [y.diff().var(), 1e-5] inference for state space models, and [KN99], which provides many examples and applications. # Iterations BAYESIAN ESTIMATION AND FORECASTING OF TIME SERIES IN STATSMODELS 87 for i in range(1, niter + 1): # (a) Update the model parameters model_ll.update(samples_ll[i - 1]) # (b) Draw from the conditional posterior of # the state vector sim_ll.simulate() sample_state = sim_ll.simulated_state.T # (c) Compute / draw from conditional posterior # of the parameters: # ...observation error variance resid = y - sample_state[:, 0] post_shape = len(resid) / 2 + 0.01 post_scale = np.sum(resid**2) / 2 + 0.01 samples_ll[i, 0] = stats.invgamma( post_shape, scale=post_scale).rvs() # ...level error variance resid = sample_state[1:] - sample_state[:-1] post_shape = len(resid) / 2 + 0.01 Fig. 4: Data and forecast with 80% credible interval; U.S. Industrial post_scale = np.sum(resid**2) / 2 + 0.01 Production. samples_ll[i, 1] = stats.invgamma( post_shape, scale=post_scale).rvs() # Convert for use with ArviZ and plot posterior samples_ll = az.convert_to_inference_data( {'parameters': samples_ll[None, ...]}, coords={'parameter': model_ll.param_names}, dims={'parameters': ['parameter']}) az.plot_pair(samples_ll.posterior.sel( {'draw': np.s_[10000::10]}), kind='hexbin'); The approximate posterior distribution, constructed from the sam- ple chain, is shown in Figure 3. Illustrative examples For clarity and brevity, the examples in the previous section gave results for simple cases. However, these basic methods carry through to each of the models introduced earlier, including in cases with multivariate data and hundreds of parameters. Moreover, the Metropolis-Hastings approach can be combined with the Gibbs sampling approach, so that if the end user wishes to use Gibbs sampling for some parameters, they are not restricted to choose only conjugate priors for all parameters. In addition to sampling the posterior distributions of the parameters, this method allows sampling other objects of inter- est, including forecasts of observed variables, impulse response functions, and the unobserved state vector. This last possibility is especially useful in cases such as the structural time series Fig. 5: Estimated level, trend, and seasonal components, with 80% model, in which the unobserved states correspond to interpretable credible interval; U.S. Industrial Production. elements such as the trend and seasonal components. We provide several illustrative examples of the various types of analysis that are possible. model = sm.tsa.UnobservedComponents( y, 'lltrend', seasonal=12) Forecasting and Time Series Decomposition To produce the time-series decomposition into level, trend, and In our first example, we apply the Gibbs sampling approach to seasonal components, we will use samples from the posterior of a structural time series model in order to forecast U.S. Industrial the state vector (µt , βt , γt ) for each time period t. These are im- Production and to produce a decomposition of the series into level, mediately available when using the Gibbs sampling approach; in trend, and seasonal components. The model is the earlier example, the draw at each iteration was assigned to the yt = µt + γt + εt observation equation variable sample_state. To produce forecasts, we need to draw from the posterior predictive distribution for horizons h = 1, 2, . . . H. µt = βt + µt−1 + ζt level This can be easily accomplished by using the simulate method βt = βt−1 + ξt trend introduced earlier. To be concrete, we can accomplish these tasks γt = γt−s + ηt seasonal by modifying section (b) of our Gibbs sampler iterations as Here, we set the seasonal periodicity to s=12, since Industrial follows: Production is a monthly variable. We can construct this model 9. This model is often referred to as a "local linear trend" model (with in Statsmodels as9 additionally a seasonal component); lltrend is an abbreviation of this name. 88 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 6: "Causal impact" of COVID-19 on U.S. Sales in Manufacturing and Trade Industries. # (b') Draw from the conditional posterior of on U.S. Sales in Manufacturing and Trade Industries.11 # the state vector model.update(params[i - 1]) sim.simulate() Extensions # save the draw for use later in time series # decomposition There are many extensions to the time series models presented states[i] = sim.simulated_state.T here that are made possible when using Bayesian inference. # Draw from the posterior predictive distribution First, it is easy to create custom state space models within the # using the `simulate` method statsmodels framework. As one example, the statsmodels n_fcast = 48 documentation describes how to create a model that extends the fcast[i] = model.simulate( params[i - 1], n_fcast, typical VAR described above with time-varying parameters.12 initial_state=states[i, -1]).to_frame() These custom state space models automatically inherit all the functionality described above, so that Bayesian inference can be These forecasts and the decomposition into level, trend, and sea- conducted in exactly the same way. sonal components are summarized in Figures 4 and 5, which show Second, because the general state space model available in the median values along with 80% credible intervals. Notably, the statsmodels and introduced above allows for time-varying intervals shown incorporate for both the uncertainty arising from system matrices, it is possible using Gibbs sampling methods the stochastic terms in the model as well as the need to estimate to introduce support for automatic outlier handling, stochastic the models’ parameters.10 volatility, and regime switching models, even though these are largely infeasible in statsmodels when using frequentist meth- Casual impacts ods such as maximum likelihood estimation.13 A closely related procedure described in [BGK+ 15] uses a Bayesian structural time series model to estimate the "causal Conclusion impact" of some event on some observed variable. This approach stops estimation of the model just before the date of an event This paper introduces the suite of time series models available in and produces a forecast by drawing from the posterior predictive statsmodels and shows how Bayesian inference using Markov density, using the procedure described just above. It then uses the chain Monte Carlo methods can be applied to estimate their difference between the actual path of the data and the forecast to parameters and produce analyses of interest, including time series estimate impact of the event. decompositions and forecasts. An example of this approach is shown in Figure 6, in which we 11. In this example, we used a local linear trend model with no seasonal use this method to illustrate the effect of the COVID-19 pandemic component. 12. For details, see https://www.statsmodels.org/devel/examples/notebooks/ 10. The popular Prophet library, [TL17], similarly uses an additive model generated/statespace_tvpvar_mcmc_cfa.html. combined with Bayesian sampling methods to produce forecasts and decom- 13. See, for example, [SW16] for an application of these techniques that positions, although its underlying model is a GAM rather than a state space handles outliers, [KSC98] for stochastic volatility, and [KN98] for an applica- model. tion to dynamic factor models with regime switching. BAYESIAN ESTIMATION AND FORECASTING OF TIME SERIES IN STATSMODELS 89 R EFERENCES [SWF16] John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck. Probabilistic programming in Python using PyMC3. PeerJ [BGK+ 15] Kay H. Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy, Computer Science, 2:e55, April 2016. Publisher: PeerJ Inc. and Steven L. Scott. Inferring causal impact using Bayesian URL: https://peerj.com/articles/cs-55, doi:10.7717/peerj- structural time-series models. Annals of Applied Statistics, 9:247– cs.55. 274, 2015. doi:10.1214/14-aoas788. [TL17] Sean J. Taylor and Benjamin Letham. Forecasting at scale. [CGH+ 17] Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel Technical Report e3190v2, PeerJ Inc., September 2017. ISSN: Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, 2167-9843. URL: https://peerj.com/preprints/3190, doi:10. Jiqiang Guo, Peter Li, and Allen Riddell. Stan : A 7287/peerj.preprints.3190v2. Probabilistic Programming Language. Journal of Statisti- [WH99] Mike West and Jeff Harrison. Bayesian Forecasting and Dynamic cal Software, 76(1), January 2017. Institution: Columbia Models. Springer, New York, 2nd edition edition, March 1999. Univ., New York, NY (United States); Harvard Univ., Cam- 00000. bridge, MA (United States). URL: https://www.osti.gov/pages/ biblio/1430202-stan-probabilistic-programming-language, doi: 10.18637/jss.v076.i01. [CJ09] Joshua C.C. Chan and Ivan Jeliazkov. Efficient simulation and in- tegrated likelihood estimation in state space models. International Journal of Mathematical Modelling and Numerical Optimisation, 1(1-2):101–120, January 2009. Publisher: Inderscience Publish- ers. URL: https://www.inderscienceonline.com/doi/abs/10.1504/ IJMMNO.2009.03009. [DK02] J. Durbin and S. J. Koopman. A simple and efficient simula- tion smoother for state space time series analysis. Biometrika, 89(3):603–616, August 2002. URL: http://biomet.oxfordjournals. org/content/89/3/603, doi:10.1093/biomet/89.3.603. [DK12] James Durbin and Siem Jan Koopman. Time Series Analysis by State Space Methods: Second Edition. Oxford University Press, May 2012. [DLT+ 17] Joshua V. Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hoffman, and Rif A. Saurous. TensorFlow Distributions. Technical Report arXiv:1711.10604, arXiv, November 2017. arXiv:1711.10604 [cs, stat] type: article. URL: http://arxiv.org/ abs/1711.10604, doi:10.48550/arXiv.1711.10604. [Ful15] Chad Fulton. Estimating time series models by state space methods in python: Statsmodels. 2015. [HA18] Rob J Hyndman and George Athanasopoulos. Forecasting: principles and practice. OTexts, 2018. [Har90] Andrew C. Harvey. Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press, 1990. [HKOS08] Rob Hyndman, Anne B. Koehler, J. Keith Ord, and Ralph D. Snyder. Forecasting with Exponential Smoothing: The State Space Approach. Springer Science & Business Media, June 2008. Google-Books-ID: GSyzox8Lu9YC. [KCHM19] Ravin Kumar, Colin Carroll, Ari Hartikainen, and Osvaldo Mar- tin. ArviZ a unified library for exploratory analysis of Bayesian models in Python. Journal of Open Source Software, 4(33):1143, 2019. Publisher: The Open Journal. URL: https://doi.org/10. 21105/joss.01143, doi:10.21105/joss.01143. [KN98] Chang-Jin Kim and Charles R. Nelson. Business Cycle Turning Points, A New Coincident Index, and Tests of Duration Depen- dence Based on a Dynamic Factor Model With Regime Switch- ing. The Review of Economics and Statistics, 80(2):188–201, May 1998. Publisher: MIT Press. URL: https://doi.org/10.1162/ 003465398557447, doi:10.1162/003465398557447. [KN99] Chang-Jin Kim and Charles R. Nelson. State-Space Models with Regime Switching: Classical and Gibbs-Sampling Approaches with Applications. MIT Press Books, The MIT Press, 1999. URL: http://ideas.repec.org/b/mtp/titles/0262112388.html. [KSC98] Sangjoon Kim, Neil Shephard, and Siddhartha Chib. Stochastic Volatility: Likelihood Inference and Comparison with ARCH Models. The Review of Economic Studies, 65(3):361–393, July 1998. 01855. URL: http://restud.oxfordjournals.org/content/65/ 3/361, doi:10.1111/1467-937X.00050. [MPS11] Wes McKinney, Josef Perktold, and Skipper Seabold. Time Series Analysis in Python with statsmodels. In Stéfan van der Walt and Jarrod Millman, editors, Proceedings of the 10th Python in Science Conference, pages 107 – 113, 2011. doi:10.25080/ Majora-ebaa42b7-012. [SP10] Skipper Seabold and Josef Perktold. Statsmodels: Econometric and Statistical Modeling with Python. In Stéfan van der Walt and Jarrod Millman, editors, Proceedings of the 9th Python in Science Conference, pages 92 – 96, 2010. doi:10.25080/Majora- 92bf1922-011. [SW16] James H. Stock and Mark W. Watson. Core Inflation and Trend Inflation. Review of Economics and Statistics, 98(4):770–784, March 2016. 00000. URL: http://dx.doi.org/10.1162/REST_a_ 00608, doi:10.1162/REST_a_00608. 90 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Python vs. the pandemic: a case study in high-stakes software development Cliff C. Kerr‡§∗ , Robyn M. Stuart¶k , Dina Mistry∗∗ , Romesh G. Abeysuriyak , Jamie A. Cohen‡ , Lauren George†† , Michał Jastrzebski‡‡ , Michael Famulare‡ , Edward Wenger‡ , Daniel J. Klein‡ F Abstract—When it became clear in early 2020 that COVID-19 was going to modeling, and drug discovery made it well placed to contribute to be a major public health threat, politicians and public health officials turned to a global pandemic response plan. Founded in 2008, the Institute academic disease modelers like us for urgent guidance. Academic software for Disease Modeling (IDM) has provided analytical support for development is typically a slow and haphazard process, and we realized that BMGF (which it has been a part of since 2020) and other global business-as-usual would not suffice for dealing with this crisis. Here we describe health partners, with a focus on eradicating malaria and polio. the case study of how we built Covasim (covasim.org), an agent-based model of COVID-19 epidemiology and public health interventions, by using standard Since its creation, IDM has built up a portfolio of computational Python libraries like NumPy and Numba, along with less common ones like tools to understand, analyze, and predict the dynamics of different Sciris (sciris.org). Covasim was created in a few weeks, an order of magnitude diseases. faster than the typical model development process, and achieves performance When "coronavirus disease 2019" (COVID-19) and the virus comparable to C++ despite being written in pure Python. It has become one that causes it (SARS-CoV-2) were first identified in late 2019, of the most widely adopted COVID models, and is used by researchers and our team began summarizing what was known about the virus policymakers in dozens of countries. Covasim’s rapid development was enabled [Fam19]. By early February 2020, even though it was more than not only by leveraging the Python scientific computing ecosystem, but also by a month before the World Health Organization (WHO) declared adopting coding practices and workflows that lowered the barriers to entry for a pandemic [Med20], it had become clear that COVID-19 would scientific contributors without sacrificing either performance or rigor. become a major public health threat. The outbreak on the Diamond Index Terms—COVID-19, SARS-CoV-2, Epidemiology, Mathematical modeling, Princess cruise ship [RSWS20] was the impetus for us to start NumPy, Numba, Sciris modeling COVID in detail. Specifically, we needed a tool to (a) incorporate new data as soon as it became available, (b) explore policy scenarios, and (c) predict likely future epidemic trajectories. Background The first step was to identify which software tool would form For decades, scientists have been concerned about the possibility the best starting point for our new COVID model. Infectious of another global pandemic on the scale of the 1918 flu [Gar05]. disease models come in two major types: agent-based models track Despite a number of "close calls" – including SARS in 2002 the behavior of individual "people" (agents) in the simulation, [AFG+ 04]; Ebola in 2014-2016 [Tea14]; and flu outbreaks in- with each agent’s behavior represented by a random (probabilis- cluding 1957, 1968, and H1N1 in 2009 [SHK16], some of which tic) process. Compartmental models track populations of people led to 1 million or more deaths – the last time we experienced over time, typically using deterministic difference equations. The the emergence of a planetary-scale new pathogen was when HIV richest modeling framework used by IDM at the time was EMOD, spread globally in the 1980s [CHL+ 08]. which is a multi-disease agent-based model written in C++ and In 2015, Bill Gates gave a TED talk stating that the world was based on JSON configuration files [BGB+ 18]. We also considered not ready to deal with another pandemic [Hof20]. While the Bill Atomica, a multi-disease compartmental model written in Python & Melinda Gates Foundation (BMGF) has not historically focused and based on Excel input files [KAK+ 19]. However, both of on pandemic preparedness, its expertise in disease surveillance, these options posed significant challenges: as a compartmental model, Atomica would have been unable to capture the individual- * Corresponding author: cliff@covasim.org level detail necessary for modeling the Diamond Princess out- ‡ Institute for Disease Modeling, Bill & Melinda Gates Foundation, Seattle, break (such as passenger-crew interactions); EMOD had sufficient USA flexibility, but developing new disease modules had historically § School of Physics, University of Sydney, Sydney, Australia ¶ Department of Mathematical Sciences, University of Copenhagen, Copen- required months rather than days. hagen, Denmark As a result, we instead started developing Covasim ("COVID- || Burnet Institute, Melbourne, Australia 19 Agent-based Simulator") [KSM+ 21] from a nascent agent- ** Twitter, Seattle, USA based model written in Python, LEMOD-FP ("Light-EMOD for †† Microsoft, Seattle, USA ‡‡ GitHub, San Francisco, USA Family Planning"). LEMOD-FP was used to model reproductive health choices of women in Senegal; this model had in turn Copyright © 2022 Cliff C. Kerr et al. This is an open-access article distributed been based on an even simpler agent-based model of measles under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the vaccination programs in Nigeria ("Value-of-Information Simula- original author and source are credited. tor" or VoISim). We subsequently applied the lessons we learned PYTHON VS. THE PANDEMIC: A CASE STUDY IN HIGH-STAKES SOFTWARE DEVELOPMENT 91 scientific computing libraries. Software architecture and implementation Covasim conceptual design and usage Covasim is a standard susceptible-exposed-infectious-recovered (SEIR) model (Fig. 3). As noted above, it is an agent-based model, meaning that individual people and their interactions with one another are simulated explicitly (rather than implicitly, as in a compartmental model). The fundamental calculation that Covasim performs is to determine the probability that a given person, on a given time step, will change from one state to another, such as from susceptible to exposed (i.e., that person was infected), from undiagnosed to diagnosed, or from critically ill to dead. Covasim is fully open- source and available on GitHub (http://covasim.org) and PyPI (pip install covasim), and comes with comprehensive documentation, including tutorials (http://docs.covasim.org). The first principle of Covasim’s design philosophy is that "Common tasks should be simple" – for example, defining pa- rameters, running a simulation, and plotting results. The following example illustrates this principle; it creates a simulation with a custom parameter value, runs it, and plots the results: Fig. 1: Daily reported global COVID-19-related deaths (top; import covasim as cv smoothed with a one-week rolling window), relative to the timing of cv.Sim(pop_size=100e3).run().plot() known variants of concern (VOCs) and variants of interest (VOIs), as The second principle of Covasim’s design philosophy is "Un- well as Covasim releases (bottom). common tasks can’t always be simple, but they still should be possible." Examples include writing a custom goodness-of-fit from developing Covasim to turn LEMOD-FP into a new family function or defining a new population structure. To some extent, planning model, "FPsim", which will be launched later this year the second principle is at odds with the first, since the more [OVCC+ 22]. flexibility an interface has, typically the more complex it is as Parallel to the development of Covasim, other research teams well. at IDM developed their own COVID models, including one based To illustrate the tension between these two principles, the on the EMOD framework [SWC+ 22], and one based on an earlier following code shows how to run two simulations to determine the influenza model [COSF20]. However, while both of these models impact of a custom intervention aimed at protecting the elderly in saw use in academic contexts [KCP+ 20], neither were able to Japan, with results shown in Fig. 4: incorporate new features quickly enough, or were easy enough to import covasim as cv use, for widespread external adoption in a policy context. # Define a custom intervention Covasim, by contrast, had immediate real-world impact. The def elderly(sim, old=70): first version was released on 10 March 2020, and on 12 March if sim.t == sim.day('2020-04-01'): elderly = sim.people.age > old 2020, its output was presented by Washington State Governor Jay sim.people.rel_sus[elderly] = 0.0 Inslee during a press conference as justification for school closures and social distancing measures [KMS+ 21]. # Set custom parameters Since the early days of the pandemic, Covasim releases have pars = dict( pop_type = 'hybrid', # More realistic population coincided with major events in the pandemic, especially the iden- location = 'japan', # Japan's population pyramid tification of new variants of concern (Fig. 1). Covasim was quickly pop_size = 50e3, # Have 50,000 people total adopted globally, including applications in the UK regarding pop_infected = 100, # 100 infected people n_days = 90, # Run for 90 days school closures [PGKS+ 20], Australia regarding outbreak control ) [SAK+ 21], and Vietnam regarding lockdown measures [PSN+ 21]. To date, Covasim has been downloaded from PyPI over # Run multiple sims in parallel and plot key results 100,000 times [PeP22], has been used in dozens of academic label = 'Protect the elderly' s1 = cv.Sim(pars, label='Default') studies [KMS+ 21], and informed decision-making on every con- s2 = cv.Sim(pars, interventions=elderly, label=label) tinent (Fig. 2), making it one of the most widely used COVID msim = cv.parallel(s1, s2) models [KSM+ 21]. We believe key elements of its success include msim.plot(['cum_deaths', 'cum_infections']) (a) the simplicity of its architecture; (b) its high performance, Similar design philosophies have been articulated by previously, enabled by the use of NumPy arrays and Numba decorators; such as for Grails [AJ09] among others1 . and (c) our emphasis on prioritizing usability, including flexible type handling and careful choices of default settings. In the 1. Other similar philosophical statements include "The manifesto of Mat- remainder of this paper, we outline these principles in more detail, plotlib is: simple and common tasks should be simple to perform; provide options for more complex tasks" (Data Processing Using Python) and "Simple, in the hope that these will provide a useful roadmap for other common tasks should be simple to perform; Options should be provided to groups wanting to quickly develop high-performance, easy-to-use enable more complex tasks" (Instrumental). 92 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 2: Locations where Covasim has been used to help produce a paper, report, or policy recommendation. Fig. 3: Basic Covasim disease model. The blue arrow shows the process of reinfection. Fig. 4: Illustrative result of a simulation in Covasim focused on Simplifications using Sciris exploring an intervention for protecting the elderly. A key component of Covasim’s architecture is heavy reliance on Sciris (http://sciris.org) [KAH+ ng], a library of functions for running simulations in parallel. scientific computing that provide additional flexibility and ease- of-use on top of NumPy, SciPy, and Matplotlib, including paral- Array-based architecture lel computing, array operations, and high-performance container In a typical agent-based simulation, the outermost loop is over datatypes. time, while the inner loops iterate over different agents and agent As shown in Fig. 5, Sciris significantly reduces the number states. For a simulation like Covasim, with roughly 700 (daily) of lines of code required to perform common scientific tasks, timesteps to represent the first two years of the pandemic, tens allowing the user to focus on the code’s scientific logic rather than or hundreds of thousands of agents, and several dozen states, this the low-level implementation. Key Covasim features that rely on requires on the order of one billion update steps. Sciris include: ensuring consistent dictionary, list, and array types However, we can take advantage of the fact that each state (e.g., allowing the user to provide inputs as either lists or arrays); (such as agent age or their infection status) has the same data referencing ordered dictionary elements by index; handling and type, and thus we can avoid an explicit loop over agents by instead interconverting dates (e.g., allowing the user to provide either a representing agents as entries in NumPy vectors, and performing date string or a datetime object); saving and loading files; and operations on these vectors. These two architectures are shown in PYTHON VS. THE PANDEMIC: A CASE STUDY IN HIGH-STAKES SOFTWARE DEVELOPMENT 93 Fig. 5: Comparison of functionally identical code implemented without Sciris (left) and with (right). In this example, tasks that together take 30 lines of code without Sciris can be accomplished in 7 lines with it. 94 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) for t in self.time_vec: for person in self.people: if person.alive: person.age_person() person.check_died() # Array-based agent simulation class People: def age_people(self, inds): self.age[inds] += 1 return def check_died(self, inds): rands = np.random.rand(len(inds)) died = rands < self.death_probs[inds]: self.alive[inds[died]] = False return Fig. 6: The standard object-oriented approach for implementing agent-based models (top), compared to the array-based approach class Sim: used in Covasim (bottom). def run(self): for t in self.time_vec: alive = sc.findinds(self.people.alive) self.people.age_people(inds=alive) self.people.check_died(inds=alive) Numba optimization Numba is a compiler that translates subsets of Python and NumPy into machine code [LPS15]. Each low-level numerical function was tested with and without Numba decoration; in some cases speed improvements were negligible, while in other cases they were considerable. For example, the following function is roughly 10 times faster with the Numba decorator than without: import numpy as np import numba as nb @nb.njit((nb.int32, nb.int32), cache=True) def choose_r(max_n, n): Fig. 7: Performance comparison for FPsim from an explicit loop- return np.random.choice(max_n, n, replace=True) based approach compared to an array-based approach, showing a factor of ~70 speed improvement for large population sizes. Since Covasim is stochastic, calculations rarely need to be exact; as a result, most numerical operations are performed as 32-bit operations. Fig. 6. Compared to the explicitly object-oriented implementation Together, these speed optimizations allow Covasim to run at of an agent-based model, the array-based version is 1-2 orders of roughly 5-10 million simulated person-days per second of CPU magnitude faster for population sizes larger than 10,000 agents. time – a speed comparable to agent-based models implemented The relative performance of these two approaches is shown in purely in C or C++ [HPN+ 21]. Practically, this means that most Fig. 7 for FPsim (which, like Covasim, was initially implemented users can run Covasim analyses on their laptops without needing using an object-oriented approach before being converted to an to use cloud-based or HPC computing resources. array-based approach). To illustrate the difference between object- based and array-based implementations, the following example Lessons for scientific software development shows how aging and death would be implemented in each: Accessible coding and design # Object-based agent simulation Since Covasim was designed to be used by scientists and health class Person: officials, not developers, we made a number of design decisions that preferenced accessibility to our audience over other principles def age_person(self): of good software design. self.age += 1 return First, Covasim is designed to have as flexible of user inputs as possible. For example, a date can be specified as an integer def check_died(self): number of days from the start of the simulation, as a string (e.g. rand = np.random.random() if rand < self.death_prob: '2020-04-04'), or as a datetime object. Similarly, numeric self.alive = False inputs that can have either one or multiple values (such as the return change in transmission rate following one or multiple lockdowns) can be provided as a scalar, list, or NumPy array. As long as the class Sim: input is unambiguous, we prioritized ease-of-use and simplicity def run(self): of the interface over rigorous type checking. Since Covasim is a PYTHON VS. THE PANDEMIC: A CASE STUDY IN HIGH-STAKES SOFTWARE DEVELOPMENT 95 top-level library (i.e., it does not perform low-level functions as health background, through to public health experts with virtually part of other libraries), this prioritization has been welcomed by no prior experience in Python. Roughly 45% of Covasim con- its users. tributors had significant Python expertise, while 60% had public Second, "advanced" Python programming paradigms – such health experience; only about half a dozen contributors (<10%) as method and function decorators, lambda functions, multiple had significant experience in both areas. inheritance, and "dunder" methods – have been avoided where These half-dozen contributors formed a core group (including possible, even when they would otherwise be good coding prac- the authors of this paper) that oversaw overall Covasim develop- tice. This is because a relatively large fraction of Covasim users, ment. Using GitHub for both software and project management, including those with relatively limited Python backgrounds, need we created issues and assigned them to other contributors based to inspect and modify the source code. A Covasim user coming on urgency and skillset match. All pull requests were reviewed by from an R programming background, for example, may not have at least one person from this group, and often two, prior to merge. encountered the NumPy function intersect1d() before, but While the danger of accepting changes from contributors with they can quickly look it up and understand it as being equivalent limited Python experience is self-evident, considerable risks were to R’s intersect() function. In contrast, an R user who has also posed by contributors who lacked epidemiological insight. not encountered method decorators before is unlikely to be able to For example, some of the proposed tests were written based on look them up and understand their meaning (indeed, they may not assumptions that were true for a given time and place, but which even know what terms to search for). While Covasim indeed does were not valid for other geographical contexts. use each of the "advanced" methods listed above (e.g., the Numba One surprising outcome was that even though Covasim is decorators described above), they have been kept to a minimum largely a software project, after the initial phase of development and sequestered in particular files the user is less likely to interact (i.e., the first 4-8 weeks), we found that relatively few tasks could with. be assigned to the developers as opposed to the epidemiologists Third, testing for Covasim presented a major challenge. Given and infectious disease modelers on the project. We believe there that Covasim was being used to make decisions that affected tens are several reasons for this. First, epidemiologists tended to be of millions of people, even the smallest errors could have poten- much more aware of knowledge they were missing (e.g., what tially catastrophic consequences. Furthermore, errors could arise a particular NumPy function did), and were more readily able not only in the software logic, but also in an incorrectly entered to fill that gap (e.g., look it up in the documentation or on parameter value or a misinterpreted scientific study. Compounding Stack Overflow). By contrast, developers without expertise in these challenges, features often had to be developed and used epidemiology were less able to identify gaps in their knowledge on a timescale of hours or days to be of use to policymakers, and address them (e.g., by finding a study on Google Scholar). a speed which was incompatible with traditional software testing As a consequence, many of the epidemiologists’ software skills approaches. In addition, the rapidly evolving codebase made it improved markedly over the first few months, while the develop- difficult to write even simple regression tests. Our solution was to ers’ epidemiology knowledge increased more slowly. Second, and use a hierarchical testing approach: low-level functions were tested more importantly, we found that once transparent and performant through a standard software unit test approach, while new features coding practices had been implemented, epidemiologists were able and higher-level outputs were tested extensively by infectious to successfully adapt them to new contexts even without complete disease modelers who varied inputs corresponding to realistic understanding of the code. Thus, for developing a scientific scenarios, and checked the outputs (predominantly in the form software tool, we propose that a successful staffing plan would of graphs) against their intuition. We found that these high-level consist of a roughly equal ratio of developers and domain experts "sanity checks" were far more effective in catching bugs than during the early development phase, followed by a rapid (on a formal software tests, and as a result shifted the emphasis of timescale of weeks) ramp-down of developers and ramp-up of our test suite to prioritize the former. Public releases of Covasim domain experts. have held up well to extensive scrutiny, both by our external Acknowledging that Covasim’s potential user base includes collaborators and by "COVID skeptics" who were highly critical many people who have limited coding skills, we developed a three- of other COVID models [Den20]. tiered support model to maximize Covasim’s real-world policy Finally, since much of our intended audience has little to impact (Fig. 8). For "mode 1" engagements, we perform the anal- no Python experience, we provided as many alternative ways of yses using Covasim ourselves. While this mode typically ensures accessing Covasim as possible. For R users, we provide exam- high quality and efficiency, it is highly resource-constrained and ples of how to run Covasim using the reticulate package thus used only for our highest-profile engagements, such as with [AUTE17], which allows Python to be called from within R. the Vietnam Ministry of Health [PSN+ 21] and Washington State For specific applications, such as our test-trace-quarantine work Department of Health [KMS+ 21]. For "mode 2" engagements, we (http://ttq-app.covasim.org), we developed bespoke webapps via offer our partners training on how to use Covasim, and let them Jupyter notebooks [GP21] and Voilà [Qua19]. To help non-experts lead analyses with our feedback. This is our preferred mode of gain intuition about COVID epidemic dynamics, we also devel- engagement, since it balances efficiency and sustainability, and has oped a generic JavaScript-based webapp interface for Covasim been used for contexts including the United Kingdom [PGKS+ 20] (http://app.covasim.org), but it does not have sufficient flexibility and Australia [SLSS+ 22]. Finally, "mode 3" partnerships, in to answer real-world policy questions. which Covasim is downloaded and used without our direct input, are of course the default approach in the open-source software Workflow and team management ecosystem, including for Python. While this mode is by far the Covasim was developed by a team of roughly 75 people with most scalable, in practice, relatively few health departments or widely disparate backgrounds: from those with 20+ years of ministries of health have the time and internal technical capacity to enterprise-level software development experience and no public use this mode; instead, most of the mode 3 uptake of Covasim has 96 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) been by academic groups [LG+ 21]. Thus, we provide mode 1 and [AUTE17] JJ Allaire, Kevin Ushey, Yuan Tang, and Dirk Eddelbuettel. mode 2 partnerships to make Covasim’s impact more immediate reticulate: R Interface to Python, 2017. URL: https://github. com/rstudio/reticulate. and direct than would be possible via mode 3 alone. [BGB+ 18] Anna Bershteyn, Jaline Gerardin, Daniel Bridenbecker, Christo- pher W Lorton, Jonathan Bloedow, Robert S Baker, Guil- Future directions laume Chabot-Couture, Ye Chen, Thomas Fischle, Kurt Frey, et al. Implementation and applications of EMOD, an individual- While the need for COVID modeling is hopefully starting to based multi-disease modeling platform. Pathogens and disease, decrease, we and our collaborators are continuing development 76(5):fty059, 2018. doi:10.1093/femspd/fty059. of Covasim by updating parameters with the latest scientific [CHL+ 08] Myron S Cohen, Nick Hellmann, Jay A Levy, Kevin DeCock, Joep Lange, et al. The spread, treatment, and prevention of evidence, implementing new immune dynamics [CSN+ 21], and HIV-1: evolution of a global pandemic. The Journal of Clin- providing other usability and bug-fix updates. We also continue ical Investigation, 118(4):1244–1254, 2008. doi:10.1172/ to provide support and training workshops (including in-person JCI34706. workshops, which were not possible earlier in the pandemic). [COSF20] Dennis L Chao, Assaf P Oron, Devabhaktuni Srikrishna, and Michael Famulare. Modeling layered non-pharmaceutical inter- We are using what we learned during the development of ventions against SARS-CoV-2 in the United States with Corvid. Covasim to build a broader suite of Python-based disease mod- MedRxiv, 2020. doi:10.1101/2020.04.08.20058487. eling tools (tentatively named "*-sim" or "Starsim"). The suite [CSN+ 21] Jamie A Cohen, Robyn Margaret Stuart, Rafael C Nùñez, of Starsim tools under development includes models for family Katherine Rosenfeld, Bradley Wagner, Stewart Chang, Cliff Kerr, Michael Famulare, and Daniel J Klein. Mechanistic mod- planning [OVCC+ 22], polio, respiratory syncytial virus (RSV), eling of SARS-CoV-2 immune memory, variants, and vaccines. and human papillomavirus (HPV). To date, each tool in this medRxiv, 2021. doi:10.1101/2021.05.31.21258018. suite uses an independent codebase, and is related to Covasim [Den20] Denim, Sue. Another Computer Simulation, Another Alarmist only through the shared design principles described above, and Prediction, 2020. URL: https://dailysceptic.org/schools-paper. [Fam19] Mike Famulare. nCoV: preliminary estimates of the confirmed- by having used the Covasim codebase as the starting point for case-fatality-ratio and infection-fatality-ratio, and initial pan- development. demic risk assessment. Institute for Disease Modeling, 2019. A major open question is whether the disease dynamics im- [Gar05] Laurie Garrett. The next pandemic. Foreign Aff., 84:3, 2005. plemented in Covasim and these related models have sufficient doi:10.2307/20034417. [GP21] Brian E. Granger and Fernando Pérez. Jupyter: Thinking and overlap to be refactored into a single disease-agnostic modeling storytelling with code and data. Computing in Science & En- library, which the disease-specific modeling libraries would then gineering, 23(2):7–14, 2021. doi:10.1109/MCSE.2021. import. This "core and specialization" approach was adopted by 3059263. EMOD and Atomica, and while both frameworks continue to be [Hof20] Bert Hofman. The global pandemic. Horizons: Journal of International Relations and Sustainable Development, (16):60– used, no multi-disease modeling library has yet seen widespread 69, 2020. adoption within the disease modeling community. The alternative [HPN+ 21] Robert Hinch, William JM Probert, Anel Nurtay, Michelle approach, currently used by the Starsim suite, is for each disease Kendall, Chris Wymant, Matthew Hall, Katrina Lythgoe, Ana model to be a self-contained library. A shared library would Bulas Cruz, Lele Zhao, Andrea Stewart, et al. OpenABM- Covid19—An agent-based model for non-pharmaceutical inter- reduce code duplication, and allow new features and bug fixes ventions against COVID-19 including contact tracing. PLoS to be immediately rolled out to multiple models simultaneously. computational biology, 17(7):e1009146, 2021. doi:10. However, it would also increase interdependencies that would have 1371/journal.pcbi.1009146. the effect of increasing code complexity, increasing the risk of [KAH+ ng] Cliff C Kerr, Romesh G Abeysuriya, Vlad-S, tefan Harbuz, George L Chadderdon, Parham Saidi, Paula Sanz-Leon, James introducing subtle bugs. Which of these two options is preferable Jansson, Maria del Mar Quiroga, Sherrie Hughes, Rowan likely depends on the speed with which new disease models need Martin-and Kelly, Jamie Cohen, Robyn M Stuart, and Anna to be implemented. We hope that for the foreseeable future, none Nachesa. Sciris: a Python library to simplify scientific com- will need to be implemented as quickly as Covasim. puting. Available at http://paper.sciris.org, 2022 (forthcoming). [KAK+ 19] David J Kedziora, Romesh Abeysuriya, Cliff C Kerr, George L Chadderdon, Vlad-S, tefan Harbuz, Sarah Metzger, David P Wil- Acknowledgements son, and Robyn M Stuart. The Cascade Analysis Tool: software to analyze and optimize care cascades. Gates Open Research, 3, We thank additional contributors to Covasim, including Katherine 2019. doi:10.12688/gatesopenres.13031.2. Rosenfeld, Gregory R. Hart, Rafael C. Núñez, Prashanth Selvaraj, [KCP+ 20] Joel R Koo, Alex R Cook, Minah Park, Yinxiaohe Sun, Haoyang Brittany Hagedorn, Amanda S. Izzo, Greer Fowler, Anna Palmer, Sun, Jue Tao Lim, Clarence Tam, and Borame L Dickens. Interventions to mitigate early spread of sars-cov-2 in singapore: Dominic Delport, Nick Scott, Sherrie L. Kelly, Caroline S. Ben- a modelling study. The Lancet Infectious Diseases, 20(6):678– nette, Bradley G. Wagner, Stewart T. Chang, Assaf P. Oron, Paula 688, 2020. doi:10.1016/S1473-3099(20)30162-6. Sanz-Leon, and Jasmina Panovska-Griffiths. We also wish to thank [KMS+ 21] Cliff C Kerr, Dina Mistry, Robyn M Stuart, Katherine Rosenfeld, Maleknaz Nayebi and Natalie Dean for helpful discussions on Gregory R Hart, Rafael C Núñez, Jamie A Cohen, Prashanth Selvaraj, Romesh G Abeysuriya, Michał Jastrz˛ebski, et al. Con- code architecture and workflow practices, respectively. trolling COVID-19 via test-trace-quarantine. Nature Commu- nications, 12(1):1–12, 2021. doi:10.1038/s41467-021- 23276-9. R EFERENCES [KSM+ 21] Cliff C Kerr, Robyn M Stuart, Dina Mistry, Romesh G Abey- [AFG+ 04] Roy M Anderson, Christophe Fraser, Azra C Ghani, Christl A suriya, Katherine Rosenfeld, Gregory R Hart, Rafael C Núñez, Donnelly, Steven Riley, Neil M Ferguson, Gabriel M Leung, Jamie A Cohen, Prashanth Selvaraj, Brittany Hagedorn, et al. Tai H Lam, and Anthony J Hedley. Epidemiology, transmis- Covasim: an agent-based model of COVID-19 dynamics and sion dynamics and control of sars: the 2002–2003 epidemic. interventions. PLOS Computational Biology, 17(7):e1009149, Philosophical Transactions of the Royal Society of London. 2021. doi:10.1371/journal.pcbi.1009149. Series B: Biological Sciences, 359(1447):1091–1105, 2004. [LG+ 21] Junjiang Li, Philippe Giabbanelli, et al. Returning to a normal doi:10.1098/rstb.2004.1490. life via COVID-19 vaccines in the United States: a large- [AJ09] Bashar Abdul-Jawad. Groovy and Grails Recipes. Springer, scale Agent-Based simulation study. JMIR medical informatics, 2009. 9(4):e27419, 2021. doi:10.2196/27419. PYTHON VS. THE PANDEMIC: A CASE STUDY IN HIGH-STAKES SOFTWARE DEVELOPMENT 97 Fig. 8: The three pathways to impact with Covasim, from high bandwidth/small scale to low bandwidth/large scale. IDM: Institute for Disease Modeling; OSS: open-source software; GPG: global public good; PyPI: Python Package Index. [LPS15] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A the impact of COVID-19 vaccines in a representative COVAX llvm-based python jit compiler. In Proceedings of the Second AMC country setting due to ongoing internal migration: A Workshop on the LLVM Compiler Infrastructure in HPC, pages modeling study. PLOS Global Public Health, 2(1):e0000053, 1–6, 2015. doi:10.1145/2833157.2833162. 2022. doi:10.1371/journal.pgph.0000053. [Med20] The Lancet Respiratory Medicine. COVID-19: delay, mitigate, [Tea14] WHO Ebola Response Team. Ebola virus disease in west and communicate. The Lancet Respiratory Medicine, 8(4):321, africa—the first 9 months of the epidemic and forward projec- 2020. doi:10.1016/S2213-2600(20)30128-4. tions. New England Journal of Medicine, 371(16):1481–1495, [OVCC 22] Michelle L O’Brien, Annie Valente, Guillaume Chabot-Couture, + 2014. doi:10.1056/NEJMoa1411100. Joshua Proctor, Daniel Klein, Cliff Kerr, and Marita Zimmer- mann. FPSim: An agent-based model of family planning for informed policy decision-making. In PAA 2022 Annual Meeting. PAA, 2022. [PeP22] PePy. PePy download statistics, 2022. URL: https://pepy.tech/ project/covasim. [PGKS+ 20] Jasmina Panovska-Griffiths, Cliff C Kerr, Robyn M Stuart, Dina Mistry, Daniel J Klein, Russell M Viner, and Chris Bonell. Determining the optimal strategy for reopening schools, the impact of test and trace interventions, and the risk of occurrence of a second COVID-19 epidemic wave in the UK: a modelling study. The Lancet Child & Adolescent Health, 4(11):817–827, 2020. doi:10.1016/S2352-4642(20)30250-9. [PSN+ 21] Quang D Pham, Robyn M Stuart, Thuong V Nguyen, Quang C Luong, Quang D Tran, Thai Q Pham, Lan T Phan, Tan Q Dang, Duong N Tran, Hung T Do, et al. Estimating and mitigating the risk of COVID-19 epidemic rebound associated with reopening of international borders in Vietnam: a modelling study. The Lancet Global Health, 9(7):e916–e924, 2021. doi:10.1016/ S2214-109X(21)00103-0. [Qua19] QuantStack. And voilá! Jupyter Blog, 2019. URL: https://blog. jupyter.org/and-voil%C3%A0-f6a2c08a4a93. [RSWS20] Joacim Rocklöv, Henrik Sjödin, and Annelies Wilder-Smith. COVID-19 outbreak on the Diamond Princess cruise ship: esti- mating the epidemic potential and effectiveness of public health countermeasures. Journal of Travel Medicine, 27(3):taaa030, 2020. doi:10.1093/jtm/taaa030. [SAK+ 21] Robyn M Stuart, Romesh G Abeysuriya, Cliff C Kerr, Dina Mistry, Dan J Klein, Richard T Gray, Margaret Hellard, and Nick Scott. Role of masks, testing and contact tracing in preventing COVID-19 resurgences: a case study from New South Wales, Australia. BMJ open, 11(4):e045941, 2021. doi:10.1136/bmjopen-2020-045941. [SHK16] Patrick R Saunders-Hastings and Daniel Krewski. Review- ing the history of pandemic influenza: understanding patterns of emergence and transmission. Pathogens, 5(4):66, 2016. doi:10.3390/pathogens5040066. [SLSS+ 22] Paula Sanz-Leon, Nathan J Stevenson, Robyn M Stuart, Romesh G Abeysuriya, James C Pang, Stephen B Lambert, Cliff C Kerr, and James A Roberts. Risk of sustained SARS- CoV-2 transmission in Queensland, Australia. Scientific reports, 12(1):1–9, 2022. doi:10.1101/2021.06.08.21258599. [SWC 22] Prashanth Selvaraj, Bradley G Wagner, Dennis L Chao, + Maïna L’Azou Jackson, J Gabrielle Breugelmans, Nicholas Jack- son, and Stewart T Chang. Rural prioritization may increase 98 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Pylira: deconvolution of images in the presence of Poisson noise Axel Donath‡∗ , Aneta Siemiginowska‡ , Vinay Kashyap‡ , Douglas Burke‡ , Karthik Reddy Solipuram§ , David van Dyk¶ F Abstract—All physical and astronomical imaging observations are degraded by of the signal intensity to the signal variance. Any statistically the finite angular resolution of the camera and telescope systems. The recovery correct post-processing or reconstruction method thus requires a of the true image is limited by both how well the instrument characteristics careful treatment of the Poisson nature of the measured image. are known and by the magnitude of measurement noise. In the case of a To maximise the scientific use of the data, it is often desired to high signal to noise ratio data, the image can be sharpened or “deconvolved” correct the degradation introduced by the imaging process. Besides robustly by using established standard methods such as the Richardson-Lucy method. However, the situation changes for sparse data and the low signal to correction for non-uniform exposure and background noise this noise regime, such as those frequently encountered in X-ray and gamma-ray also includes the correction for the "blurring" introduced by the astronomy, where deconvolution leads inevitably to an amplification of noise point spread function (PSF) of the instrument. Where the latter and poorly reconstructed images. However, the results in this regime can process is often called "deconvolution". Depending on whether be improved by making use of physically meaningful prior assumptions and the PSF of the instrument is known or not, one distinguishes statistically principled modeling techniques. One proposed method is the LIRA between the "blind deconvolution" and "non blind deconvolution" algorithm, which requires smoothness of the reconstructed image at multiple process. For astronomical observations, the PSF can often either scales. In this contribution, we introduce a new python package called Pylira, be simulated, given a model of the telescope and detector, or which exposes the original C implementation of the LIRA algorithm to Python inferred directly from the data by observing far distant objects, users. We briefly describe the package structure, development setup and show a Chandra as well as Fermi-LAT analysis example. which appear as a point source to the instrument. While in other branches of astronomy deconvolution methods Index Terms—deconvolution, point spread function, poisson, low counts, X-ray, are already part of the standard analysis, such as the CLEAN gamma-ray algorithm for radio data, developed by [Hog74], this is not the case for X-ray and gamma-ray astronomy. As any deconvolution method aims to enhance small-scale structures in an image, it Introduction becomes increasingly hard to solve for the regime of low signal- Any physical and astronomical imaging process is affected by to-noise ratio, where small-scale structures are more affected by the limited angular resolution of the instrument or telescope. In noise. addition, the quality of the resulting image is also degraded by background or instrumental measurement noise and non-uniform The Deconvolution Problem exposure. For short wavelengths and associated low intensities of Basic Statistical Model the signal, the imaging process consists of recording individual Assuming the data in each pixel di in the recorded counts image photons (often called "events") originating from a source of follows a Poisson distribution, the total likelihood of obtaining the interest. This imaging process is typical for X-ray and gamma- measured image from a model image of the expected counts λi ray telescopes, but images taken by magnetic resonance imaging with N pixels is given by: or fluorescence microscopy show Poisson noise too. For each individual photon, the incident direction, energy and arrival time N exp −di λidi L (d|λ ) = ∏ (1) is measured. Based on this information, the event can be binned i di ! into two dimensional data structures to form an actual image. By taking the logarithm, dropping the constant terms and inverting As a consequence of the low intensities associated to the the sign one can transform the product into a sum over pixels, recording of individual events, the measured signal follows Pois- which is also often called the Cash [Cas79] fit statistics: son statistics. This imposes a non-linear relationship between the N measured signal and true underlying intensity as well as a coupling C (λ |d) = ∑(λi − di log λi ) (2) i * Corresponding author: axel.donath@cfa.harvard.edu ‡ Center for Astrophysics | Harvard & Smithsonian Where the expected counts λi are given by the convolution of the § University of Maryland Baltimore County true underlying flux distribution xi with the PSF pk : ¶ Imperial College London λi = ∑ xi pi−k (3) Copyright © 2022 Axel Donath et al. This is an open-access article distributed k under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the This operation is often called "forward modelling" or "forward original author and source are credited. folding" with the instrument response. PYLIRA: DECONVOLUTION OF IMAGES IN THE PRESENCE OF POISSON NOISE 99 Richardson Lucy (RL) To obtain the most likely value of xn given the data, one searches a maximum of the total likelihood function, or equivalently a of minimum C . This high dimensional optimization problem can e.g., be solved by a classic gradient descent approach. Assuming the pixels values xi of the true image as independent parameters, one can take the derivative of Eq. 2 with respect to the individual xi . This way one obtains a rule for how to update the current set of pixels xn in each iteration of the optimization: ∂ C (d|x) xn+1 = xn − α · (4) ∂ xi Where α is a factor to define the step size. This method is in general equivalent to the gradient descent and backpropagation methods used in modern machine learning techniques. This ba- sic principle of solving the deconvolution problem for images with Poisson noise was proposed by [Ric72] and [Luc74]. Their method, named after the original authors, is often known as the Fig. 1: The images show the result of the RL algorithm applied Richardson & Lucy (RL) method. It was shown by [Ric72] that to a simulated example dataset with varying numbers of iterations. this converges to a maximum likelihood solution of Eq. 2. A The image in the upper left shows the simulated counts. Those have Python implementation of the standard RL method is available been derived from the ground truth (upper mid) by convolving with a e.g. in the Scikit-Image package [vdWSN+ 14]. Gaussian PSF of width σ = 3 pix and applying Poisson noise to it. Instead of the iterative, gradient descent based optimization it The illustration uses the implementation of the RL algorithm from the is also possible to sample from the posterior distribution using a Scikit-Image package [vdWSN+ 14]. simple Metropolis-Hastings [Has70] approach and uniform prior. This is demonstrated in one of the Pylira online tutorials (Intro- the smoothness of the reconstructed image on multiple spatial duction to Deconvolution using MCMC Methods). scales. Starting from the full resolution, the image pixels xi are collected into 2 by 2 groups Qk . The four pixel values associated RL Reconstruction Quality with each group are divided by their sum to obtain a grid of “split While technically the RL method converges to a maximum like- proportions” with respect to the image down-sized by a factor of lihood solution, it mostly still results in poorly restored images, two along both axes. This process is repeated using the down sized especially if extended emission regions are present in the image. image with pixel values equal to the sums over the 2 by 2 groups The problem is illustrated in Fig. 1 using a simulated example from the full-resolution image, and the process continues until the image. While for a low number of iterations, the RL method still resolution of the image is only a single pixel, containing the total results in a smooth intensity distribution, the structure of the image sum of the full-resolution image. This multi-scale representation decomposes more and more into a set of point-like sources with is illustrated in Fig. 2. growing number of iterations. For each of the 2x2 groups of the re-normalized images a Because of the PSF convolution, an extended emission region Dirichlet distribution is introduced as a prior: can decompose into multiple nearby point sources and still lead to good model prediction, when compared with the data. Those φk ∝ Dirichlet(αk , αk , αk , αk ) (6) almost equally good solutions correspond to many narrow local and multiplied across all 2x2 groups and resolution levels k. For minima or "spikes" in the global likelihood surface. Depending on each resolution level a smoothing parameter αk is introduced. the start estimate for the reconstructed image x the RL method These hyper-parameters can be interpreted as having an infor- will follow the steepest gradient and converge towards the nearest mation content equivalent of adding αk "hallucinated" counts in narrow local minimum. This problem has been described by each grouping. This effectively results in a smoothing of the multiple authors, such as [PR94] and [FBPW95]. image at the given resolution level. The distribution of α values at each resolution level is the further described by a hyper-prior Multi-Scale Prior & LIRA distribution: One solution to this problem was described in [ECKvD04] and p(αk ) = exp (−δ α 3 /3) (7) [CSv+ 11]. First, the simple forward folded model described in Eq. 3 can be extended by taking into account the non-uniform Resulting in a fully hierarchical Bayesian model. A more com- exposure ei and an additional known background component bi : plete and detailed description of the prior definition is given in [ECKvD04]. λi = ∑ (ei · (xi + bi )) pi−k (5) The problem is then solved by using a Gibbs MCMC sampling k approach. After a "burn-in" phase the sampling process typically The background bi can be more generally understood as a "base- reaches convergence and starts sampling from the posterior distri- line" image and thus include known structures, which are not of bution. The reconstructed image is then computed as the mean of interest for the deconvolution process. E.g., a bright point source the posterior samples. As for each pixel a full distribution of its to model the core of an AGN while studying its jets. values is available, the information can also be used to compute Second, the authors proposed to extend the Poisson log- the associated error of the reconstructed value. This is another likelihood function (Equation 2) by a log-prior term that controls main advantage over RL or Maxium A-Postori (MAP) algorithms. 100 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 1 $ sudo apt-get install r-base-dev r-base r-mathlib 2 $ pip install pylira For more detailed instructions see Pylira installation instructions. API & Subpackages Pylira is structured in multiple sub-packages. The pylira.src module contains the original C implementation and the Pybind11 wrapper code. The pylira.core sub-package contains the main Python API, pylira.utils includes utility functions for plotting and serialisation. And pylira.data implements multiple pre-defined datasets for testing and tutorials. Analysis Examples Simple Point Source Pylira was designed to offer a simple Python class based user interface, which allows for a short learning curve of using the package for users who are familiar with Python in general and more specifically with Numpy. A typical complete usage example of the Pylira package is shown in the following: Fig. 2: The image illustrates the multi-scale decomposition used in the LIRA prior for a 4x4 pixels example image. Each quadrant of 2x2 1 import numpy as np sub-images is labelled with QN . The sub-pixels in each quadrant are 2 from pylira import LIRADeconvolver labelled Λi j . . 3 from pylira.data import point_source_gauss_psf 4 5 # create example dataset 6 data = point_source_gauss_psf() The Pylira Package 7 8 # define initial flux image Dependencies & Development 9 data["flux_init"] = data["flux"] The Pylira package is a thin Python wrapper around the original 10 LIRA implementation provided by the authors of [CSv+ 11]. The 11 deconvolve = LIRADeconvolver( 12 n_iter_max=3_000, original algorithm was implemented in C and made available as a 13 n_burn_in=500, package for the R Language [R C20]. Thus the implementation de- 14 alpha_init=np.ones(5) pends on the RMath library, which is still a required dependency of 15 ) 16 Pylira. The Python wrapper was built using the Pybind11 [JRM17] 17 result = deconvolve.run(data=data) package, which allows to reduce the code overhead introduced by 18 the wrapper to a minimum. For the data handling, Pylira relies on 19 # plot pixel traces, result shown in Figure 3 Numpy [HMvdW+ 20] arrays for the serialisation to the FITS data 20 result.plot_pixel_traces_region( 21 center_pix=(16, 16), radius_pix=3 format on Astropy [Col18]. The (interactive) plotting functionality 22 ) is achieved via Matplotlib [Hun07] and Ipywidgets [wc15], which 23 are both optional dependencies. Pylira is openly developed on 24 # plot pixel traces, result shown in Figure 4 25 result.plot_parameter_traces() Github at https://github.com/astrostat/pylira. It relies on GitHub 26 Actions as a continuous integration service and uses the Read 27 # finally serialise the result the Docs service to build and deploy the documentation. The on- 28 result.write("result.fits") line documentation can be found on https://pylira.readthedocs.io. The main interface is exposed via the LIRADeconvolver Pylira implements a set of unit tests to assure compatibility class, which takes the configuration of the algorithm on initial- and reproducibility of the results with different versions of the isation. Typical configuration parameters include the total num- dependencies and across different platforms. As Pylira relies on ber of iterations n_iter_max and the number of "burn-in" random sampling for the MCMC process an exact reproducibility iterations, to be excluded from the posterior mean computation. of results is hard to achieve on different platforms; however the The data, represented by a simple Python dict data structure, agreement of results is at least guaranteed in the statistical limit of contains a "counts", "psf" and optionally "exposure" drawing many samples. and "background" array. The dataset is then passed to the LIRADeconvolver.run() method to execute the deconvolu- Installation tion. The result is a LIRADeconvolverResult object, which Pylira is available via the Python package index (pypi.org), features the possibility to write the result as a FITS file, as well currently at version 0.1. As Pylira still depends on the RMath as to inspect the result with diagnostic plots. The result of the library, it is required to install this first. So the recommended way computation is shown in the left panel of Fig. 3. to install Pylira is on MacOS is: 1 $ brew install r Diagnostic Plots 2 $ pip install pylira To validate the quality of the results Pylira provides many built- On Linux the RMath dependency can be installed using standard in diagnostic plots. One of these diagnostic plot is shown in the package managers. For example on Ubuntu, one would do right panel of Fig. 3. The plot shows the image sampling trace PYLIRA: DECONVOLUTION OF IMAGES IN THE PRESENCE OF POISSON NOISE 101 Pixel trace for (16, 16) 30 800 1000 700 25 800 600 20 500 600 Posterior Mean Burn in Valid 15 400 Mean 400 1 Std. Deviation 300 10 200 200 5 100 0 0 0 5 10 15 20 25 30 0 500 1000 1500 2000 2500 3000 Number of Iterations Fig. 3: The curves show the traces of value the pixel of interest for a simulated point source and its neighboring pixels (see code example). The image on the left shows the posterior mean. The white circle in the image shows the circular region defining the neighboring pixels. The blue line on the right plot shows the trace of the pixel of interest. The solid horizontal orange line shows the mean value (excluding burn-in) of the pixel across all iterations and the shaded orange area the 1 σ error region. The burn in phase is shown in transparent blue and ignored while computing the mean. The shaded gray lines show the traces of the neighboring pixels. for a single pixel of interest and its surrounding circular region of Chandra is a space-based X-ray observatory, which has been interest. This visualisation allows the user to assess the stability in operation since 1999. It consists of nested cylindrical paraboloid of a small region in the image e.g. an astronomical point source and hyperboloid surfaces, which form an imaging optical system during the MCMC sampling process. Due to the correlation with for X-rays. In the focal plane, it has multiple instruments for dif- neighbouring pixels, the actual value of a pixel might vary in the ferent scientific purposes. This includes a high-resolution camera sampling process, which appears as "dips" in the trace of the pixel (HRC) and an Advanced CCD Imaging Spectrometer (ACIS). The of interest and anti-correlated "peaks" in the one or mutiple of typical angular resolution is 0.5 arcsecond and the covered energy the surrounding pixels. In the example a stable state of the pixels ranges from 0.1 - 10 keV. of interest is reached after approximately 1000 iterations. This Figure 5 shows the result of the Pylira algorithm applied to suggests that the number of burn-in iterations, which was defined Chandra data of the Galactic Center region between 0.5 and 7 keV. beforehand, should be increased. The PSF was obtained from simulations using the simulate_psf Pylira relies on an MCMC sampling approach to sample tool from the official Chandra science tools ciao 4.14 [FMA+ 06]. a series of reconstructed images from the posterior likelihood The algorithm achieves both an improved spatial resolution as well defined by Eq. 2. Along with the sampling, it marginalises over as a reduced noise level and higher contrast of the image in the the smoothing hyper-parameters and optimizes them in the same right panel compared to the unprocessed counts data shown in the process. To diagnose the validity of the results it is important to left panel. visualise the sampling traces of both the sampled images as well As a second example, we use data from the Fermi Large Area as hyper-parameters. Telescope (LAT). The Fermi-LAT is a satellite-based imaging Figure 4 shows another typical diagnostic plot created by the gamma-ray detector, which covers an energy range of 20 MeV code example above. In a multi-panel figure, the user can inspect to >300 GeV. The angular resolution varies strongly with energy the traces of the total log-posterior as well as the traces of the and ranges from 0.1 to >10 degree1 . smoothing parameters. Each panel corresponds to the smoothing Figure 6 shows the result of the Pylira algorithm applied to hyper parameter introduced for each level of the multi-scale Fermi-LAT data above 1 GeV to the region around the Galactic representation of the reconstructed image. The figure also shows Center. The PSF was obtained from simulations using the gtpsf the mean value along with the 1 σ error region. In this case, tool from the official Fermitools v2.0.19 [Fer19]. First, one can the algorithm shows stable convergence after a burn-in phase of see that the algorithm achieves again a considerable improvement approximately 200 iterations for the log-posterior as well as all of in the spatial resolution compared to the raw counts. It clearly the multi-scale smoothing parameters. resolves multiple point sources left to the bright Galactic Center source. Astronomical Analysis Examples Summary & Outlook Both in the X-ray as well as in the gamma-ray regime, the Galactic The Pylira package provides Python wrappers for the LIRA al- Center is a complex emission region. It shows point sources, gorithm. It allows the deconvolution of low-counts data following extended sources, as well as underlying diffuse emission and thus 1. https://www.slac.stanford.edu/exp/glast/groups/canda/lat_Performance. represents a challenge for any astronomical data analysis. htm 102 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Logpost Smoothingparam0 Smoothingparam1 Burn in 0.35 0.35 1500 Valid 0.30 Mean 0.30 1 Std. Deviation 0.25 0.25 1000 0.20 0.20 500 0.15 0.15 0 0.10 0.10 0.05 0.05 500 0.00 0.00 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 Number of Iterations Number of Iterations Number of Iterations Smoothingparam2 Smoothingparam3 Smoothingparam4 0.200 0.175 0.20 0.175 0.150 0.150 0.15 0.125 0.125 0.100 0.100 0.10 0.075 0.075 0.05 0.050 0.050 0.025 0.025 0.00 0.000 0.000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 Number of Iterations Number of Iterations Number of Iterations Fig. 4: The curves show the traces of the log posterior value as well as traces of the values of the prior parameter values. The SmoothingparamN parameters correspond to the smoothing parameters αN per multi-scale level. The solid horizontal orange lines show the mean value, the shaded orange area the 1 σ error region. The burn in phase is shown transparent and ignored while estimating the mean. Counts Deconvolved 500 PSF 257 132 -29°00'25" 68 Declination 35 Counts 18 30" 9 5 2 35" 17h45m40.6s40.4s 40.2s 40.0s 39.8s 39.6s 17h45m40.6s40.4s 40.2s 40.0s 39.8s 39.6s Right Ascension Right Ascension Fig. 5: Pylira applied to Chandra ACIS data of the Galactic Center region, using the observation IDs 4684 and 4684. The image on the left shows the raw observed counts between 0.5 and 7 keV. The image on the right shows the deconvolved version. The LIRA hyperprior values were chosen as ms_al_kap1=1, ms_al_kap2=0.02, ms_al_kap3=1. No baseline background model was included. PYLIRA: DECONVOLUTION OF IMAGES IN THE PRESENCE OF POISSON NOISE 103 Counts Deconvolved 200 0°40' PSF 120 72 20' 43 Galactic Latitude 00' 26 Counts 16 -0°20' 9 5 40' 2 0°40' 20' 00' 359°40' 20' 0°40' 20' 00' 359°40' 20' Galactic Longitude Galactic Longitude Fig. 6: Pylira applied to Fermi-LAT data from the Galactic Center region. The image on the left shows the raw measured counts between 5 and 1000 GeV. The image on the right shows the deconvolved version. The LIRA hyperprior values were chosen as ms_al_kap1=1, ms_al_kap2=0.02, ms_al_kap3=1. No baseline background model was included. Poisson statistics using a Bayesian sampling approach and a multi- [CSv+ 11] A. Connors, N. M. Stein, D. van Dyk, V. Kashyap, and scale smoothing prior assumption. The results can be easily written A. Siemiginowska. LIRA — The Low-Counts Image Restora- tion and Analysis Package: A Teaching Version via R. In I. N. to FITS files and inspected by plotting the trace of the sampling Evans, A. Accomazzi, D. J. Mink, and A. H. Rots, editors, process. This allows users to check for general convergence as Astronomical Data Analysis Software and Systems XX, volume well as pixel to pixel correlations for selected regions of interest. 442 of Astronomical Society of the Pacific Conference Series, The package is openly developed on GitHub and includes tests page 463, July 2011. [ECKvD04] David N. Esch, Alanna Connors, Margarita Karovska, and and documentation, such that it can be maintained and improved David A. van Dyk. An image restoration technique with in the future, while ensuring consistency of the results. It comes error estimates. The Astrophysical Journal, 610(2):1213– with multiple built-in test datasets and explanatory tutorials in 1227, aug 2004. URL: https://doi.org/10.1086/421761, doi: 10.1086/421761. the form of Jupyter notebooks. Future plans include the support [FBPW95] D. A. Fish, A. M. Brinicombe, E. R. Pike, and J. G. for parallelisation or distributed computing, more flexible prior Walker. Blind deconvolution by means of the richardson– definitions and the possibility to account for systematic errors on lucy algorithm. J. Opt. Soc. Am. A, 12(1):58–65, Jan 1995. the PSF during the sampling process. URL: http://opg.optica.org/josaa/abstract.cfm?URI=josaa-12- 1-58, doi:10.1364/JOSAA.12.000058. [Fer19] Fermi Science Support Development Team. Fermitools: Fermi Acknowledgements Science Tools. Astrophysics Source Code Library, record ascl:1905.011, May 2019. arXiv:1905.011. This work was conducted under the auspices of the CHASC [FMA+ 06] Antonella Fruscione, Jonathan C. McDowell, Glenn E. Allen, International Astrostatistics Center. CHASC is supported by NSF Nancy S. Brickhouse, Douglas J. Burke, John E. Davis, Nick Durham, Martin Elvis, Elizabeth C. Galle, Daniel E. Har- grants DMS-21-13615, DMS-21-13397, and DMS-21-13605; by ris, David P. Huenemoerder, John C. Houck, Bish Ishibashi, the UK Engineering and Physical Sciences Research Council Margarita Karovska, Fabrizio Nicastro, Michael S. Noble, [EP/W015080/1]; and by NASA 18-APRA18-0019. We thank Michael A. Nowak, Frank A. Primini, Aneta Siemiginowska, CHASC members for many helpful discussions, especially Xiao- Randall K. Smith, and Michael Wise. CIAO: Chandra’s data analysis system. In David R. Silva and Rodger E. Doxsey, Li Meng and Katy McKeough. DvD was also supported in part editors, Society of Photo-Optical Instrumentation Engineers by a Marie-Skodowska-Curie RISE Grant (H2020-MSCA-RISE- (SPIE) Conference Series, volume 6270 of Society of Photo- 2019-873089) provided by the European Commission. Aneta Optical Instrumentation Engineers (SPIE) Conference Series, page 62701V, June 2006. doi:10.1117/12.671760. Siemiginowska, Vinay Kashyap, and Doug Burke further acknowl- [Has70] W. K. Hastings. Monte Carlo Sampling Methods using Markov edge support from NASA contract to the Chandra X-ray Center Chains and their Applications. Biometrika, 57(1):97–109, NAS8-03060. April 1970. doi:10.1093/biomet/57.1.97. [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric R EFERENCES Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van [Cas79] W. Cash. Parameter estimation in astronomy through ap- Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del plication of the likelihood ratio. The Astrophysical Journal, Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, 228:939–947, March 1979. doi:10.1086/156922. Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer [Col18] Astropy Collaboration. The Astropy Project: Building an Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro- Open-science Project and Status of the v2.0 Core Package. The gramming with NumPy. Nature, 585(7825):357–362, Septem- Astrophysical Journal, 156(3):123, September 2018. arXiv: ber 2020. URL: https://doi.org/10.1038/s41586-020-2649-2, 1801.02634, doi:10.3847/1538-3881/aabc4f. doi:10.1038/s41586-020-2649-2. 104 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) [Hog74] J. A. Hogbom. Aperture Synthesis with a Non-Regular Distribution of Interferometer Baselines. Astronomy and As- trophysics Supplement, 15:417, June 1974. [Hun07] J. D. Hunter. Matplotlib: A 2d graphics environment. Com- puting in Science & Engineering, 9(3):90–95, 2007. doi: 10.1109/MCSE.2007.55. [JRM17] Wenzel Jakob, Jason Rhinelander, and Dean Moldovan. py- bind11 – seamless operability between c++11 and python, 2017. https://github.com/pybind/pybind11. [Luc74] L. B. Lucy. An iterative technique for the rectification of observed distributions. Astronomical Journal, 79:745, June 1974. doi:10.1086/111605. [PR94] K. M. Perry and S. J. Reeves. Generalized Cross-Validation as a Stopping Rule for the Richardson-Lucy Algorithm. In Robert J. Hanisch and Richard L. White, editors, The Restora- tion of HST Images and Spectra - II, page 97, January 1994. doi:10.1002/ima.1850060412. [R C20] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2020. URL: https://www.R-project.org/. [Ric72] William Hadley Richardson. Bayesian-Based Iterative Method of Image Restoration. Journal of the Optical Society of America (1917-1983), 62(1):55, January 1972. doi:10. 1364/josa.62.000055. [vdWSN+ 14] Stéfan van der Walt, Johannes L. Schönberger, Juan Nunez- Iglesias, François Boulogne, Joshua D. Warner, Neil Yager, Emmanuelle Gouillart, Tony Yu, and the scikit-image con- tributors. scikit-image: image processing in Python. PeerJ, 2:e453, 6 2014. URL: https://doi.org/10.7717/peerj.453, doi: 10.7717/peerj.453. [wc15] Jupyter widgets community. ipywidgets, a github repository. Retrieved from https://github.com/jupyter-widgets/ipywidgets, 2015. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 105 Codebraid Preview for VS Code: Pandoc Markdown Preview with Jupyter Kernels Geoffrey M. Poore‡∗ F Abstract—Codebraid Preview is a VS Code extension that provides a live including raw chunks of text in other formats such as reStructured- preview of Pandoc Markdown documents with optional support for executing Text. When executable code is involved, the RMarkdown-style embedded code. Unlike typical Markdown previews, all Pandoc features are fully approach of Markdown with embedded code can sometimes be supported because Pandoc itself generates the preview. The Markdown source more convenient than a browser-based Jupyter notebook since the and the preview are fully integrated with features like bidirectional scroll sync. writing process involves more direct interaction with the complete The preview supports LaTeX math via KaTeX. Code blocks and inline code can be executed with Codebraid, using either its built-in execution system or Jupyter document source. kernels. For executed code, any combination of the code and its output can be While using a Pandoc Markdown variant as a source format displayed in the preview as well as the final document. Code execution is non- brings many advantages, the actual writing process itself can blocking, so the preview always remains live and up-to-date even while code is be less than ideal, especially when executable code is involved. still running. Pandoc Markdown variants are so powerful precisely because they provide so many extensions to Markdown, but this also means Index Terms—reproducibility, dynamic report generation, literate programming, that they can only be fully rendered by Pandoc itself. When text Python, Pandoc, Markdown, Project Jupyter editors such as VS Code provide a built-in Markdown preview, typically only a small subset of Pandoc features is supported, Introduction so the representation of the document output will be inaccurate. Some editors provide a visual Markdown editing mode, in which Pandoc [JM22] is increasingly a foundational tool for creating sci- a partially rendered version of the document is displayed in the entific and technical documents. It provides Pandoc’s Markdown editor and menus or keyboard shortcuts may replace the direct and other Markdown variants that add critical features absent in entry of Markdown syntax. These generally suffer from the same basic Markdown, such as citations, footnotes, mathematics, and issue. This is only exacerbated when the document embeds code tables. At the same time, Pandoc simplifies document creation that is executed during the build process, since that goes even by providing conversion from Markdown (and other formats) to further beyond basic Markdown. formats like LaTeX, HTML, Microsoft Word, and PowerPoint. An alternative is to use Pandoc itself to generate HTML or Pandoc is especially useful for documents with embedded code PDF output, and then display this as a preview. Depending on the that is executed during the build process. RStudio’s RMarkdown text editor used, the HTML or PDF might be displayed within the [RSt20] and more recently Quarto [RSt22] leverage Pandoc to text editor in a panel beside the document source, or in a separate convert Markdown documents to other formats, with code exe- browser window or PDF viewer. For example, Quarto offers both cution provided by knitr [YX15]. JupyterLab [GP21] centers the possibilities, depending on whether RStudio, VS Code, or another writing experience around an interactive, browser-based notebook editor is used.1 While this approach resolves the inaccuracy issues instead of a Markdown document, but still relies on Pandoc for of a basic Markdown preview, it also gives up features such as export to formats other than HTML [Jup22]. There are also ways scroll sync that tightly integrate the Markdown source with the to interact with a Jupyter Notebook as a Markdown document, preview. In the case of executable code, there is the additional such as Jupytext [MWtJT20] and Pandoc’s own native Jupyter issue of a time delay in rendering the preview. Pandoc itself can support. typically convert even a relatively long document in under one Writing with Pandoc’s Markdown or a similar Markdown second. However, when code is executed as part of the document variant has advantages when multiple output formats are required, build process, preview update is blocked until code execution since Pandoc provides the conversion capabilities. Pandoc Mark- completes. down variants can also serve as a simpler syntax when creating HTML, LaTeX, or similar documents. They allow HTML and This paper introduces Codebraid Preview, a VS Code exten- LaTeX to be intermixed with Markdown syntax. They also support sion that provides a live preview of Pandoc Markdown documents with optional support for executing embedded code. Codebraid * Corresponding author: gpoore@uu.edu Preview provides a Pandoc-based preview while avoiding most ‡ Union University of the traditional drawbacks of this approach. The next section Copyright © 2022 Geoffrey M. Poore. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits 1. The RStudio editor is unique in also offering a Pandoc-based visual unrestricted use, distribution, and reproduction in any medium, provided the editing mode, starting with version 1.4 from January 2021 (https://www. original author and source are credited. rstudio.com/blog/announcing-rstudio-1-4/). 106 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) provides an overview of features. This is followed by sections There is also support for document export with Pandoc, using focusing on scroll sync, LaTeX support, and code execution as the VS Code command palette or the export-with-Pandoc button. examples of solutions and remaining challenges in creating a better Pandoc writing experience. Scroll sync Tight source-preview integration requires a source map, or a Overview of Codebraid Preview mapping from characters in the source to characters in the output. Codebraid Preview can be installed through the VS Code ex- Due to Pandoc’s parsing algorithms, tracking source location tension manager. Development is at https://github.com/gpoore/ during parsing is not possible in the general case.2 codebraid-preview-vscode. Pandoc must be installed separately Pandoc 2.11.3 was released in December 2020. It added (https://pandoc.org/). For code execution capabilities, Codebraid a sourcepos extension for CommonMark and formats must also be installed (https://github.com/gpoore/codebraid). based on it, including GitHub-Flavored Markdown (GFM) and The preview panel can be opened using the VS Code command commonmark_x (CommonMark plus extensions similar to Pan- palette, or by clicking the Codebraid Preview button that is visible doc’s Markdown). The CommonMark parser uses a different when a Markdown document is open. The preview panel takes the parsing algorithm from the Pandoc’s Markdown parser, and this document in its current state, converts it into HTML using Pandoc, algorithm permits tracking source location. For the first time, it and displays the result using a webview. An example is shown in was possible to construct a source map for a Pandoc input format. Figure 1. Since the preview is generated by Pandoc, all Pandoc Codebraid Preview defaults to commonmark_x as an input features are fully supported. format, since it provides the most features of all CommonMark- By default, the preview updates automatically whenever the based formats. Features continue to be added to commonmark_x Markdown source is changed. There is a short user-configurable and it is gradually nearing feature parity with Pandoc’s Mark- minimum update interval. For shorter documents, sub-second down. Citations are perhaps the most important feature currently updates are typical. missing.3 The preview uses the same styling CSS as VS Code’s built- Codebraid Preview provides full bidirectional scroll sync be- in Markdown preview, so it automatically adjusts to the VS Code tween source and preview for all CommonMark-based formats, color theme. For example, changing between light and dark themes using data provided by sourcepos. In the output HTML, the changes the background and text colors in the preview. first image or inline text element created by each Markdown Codebraid Preview leverages recent Pandoc advances to pro- source line is given an id attribute corresponding to the source vide bidirectional scroll sync between the Markdown source line number. When the source is scrolled to a given line range, and the preview for all CommonMark-based Markdown variants the preview scrolls to the corresponding HTML elements using that Pandoc supports (commonmark, gfm, commonmark_x). these id attributes. When the preview is scrolled, the visible By default, Codebraid Preview treats Markdown documents as HTML elements are detected via the Intersection Observer API.4 commonmark_x, which is CommonMark with Pandoc exten- Then their id attributes are used to determine the corresponding sions for features like math, footnotes, and special list types. The Markdown line range, and the source scrolls to those lines. preview still works for other Markdown variants, but scroll sync is Scroll sync is slightly more complicated when working with disabled. By default, scroll sync is fully bidirectional, so scrolling output that is generated by executed code. For example, if a code either the source or the preview will cause the other to scroll to block is executed and creates several plots in the preview, there the corresponding location. Scroll sync can instead be configured isn’t necessarily a way to trace each individual plot back to a to be only from source to preview or only from preview to source. particular line of code in the Markdown source. In such cases, the As far as I am aware, this is the first time that scroll sync has been line range of the executed code is mapped proportionally to the implemented in a Pandoc-based preview. vertical space occupied by its output. The same underlying features that make scroll sync possible Pandoc supports multi-file documents. It can be given a list are also used to provide other preview capabilities. Double- of files to combine into a single output document. Codebraid clicking in the preview moves the cursor in the editor to the Preview provides scroll sync for multi-file documents. For ex- corresponding line of the Markdown source. ample, suppose a document is divided into two files in the same Since many Markdown variants support LaTeX math, the directory, chapter_1.md and chapter_2.md. Treating these preview includes math support via KaTeX [EA22]. as a single document involves creating a YAML configuration file Codebraid Preview can simply be used for writing plain Pan- _codebraid_preview.yaml that lists the files: doc documents. Optional execution of embedded code is possible input-files: with Codebraid [GMP19], using its built-in code execution system - chapter_1.md or Jupyter kernels. When Jupyter kernels are used, it is possible - chapter_2.md to obtain the same output that would be present in a Jupyter Now launching a preview from either chapter_1.md or notebook, including rich output such as plots and mathematics. It chapter_2.md will display a preview that combines both is also possible to specify a custom display so that only a selected files. When the preview is scrolled, the editor scrolls to the combination of code, stdout, stderr, and rich output is shown while corresponding source location, automatically switching between the rest are hidden. Code execution is decoupled from the preview process, so the Markdown source can be edited and the preview 2. See for example https://github.com/jgm/pandoc/issues/4565. can update even while code is running in the background. As far as 3. The Pandoc Roadmap at https://github.com/jgm/pandoc/wiki/Roadmap summarizes current commonmark_x capabilities. I am aware, no previous software for executing code in Markdown 4. For technical details, https://www.w3.org/TR/intersection-observer/. For has supported building a document with partial code output before an overview, https://developer.mozilla.org/en-US/docs/Web/API/Intersection_ execution has completed. Observer_API. CODEBRAID PREVIEW FOR VS CODE: PANDOC MARKDOWN PREVIEW WITH JUPYTER KERNELS 107 Fig. 1: Screenshot of a Markdown document with Codebraid Preview in VS Code. This document uses Codebraid to execute code with Jupyter kernels, so all plots and math visible in the preview are generated during document build. chapter_1.md and chapter_2.md depending on the part of of HTML rendering. In the future, optional MathJax support may the preview that is visible. be needed to provide broader math support. For some applications, The preview still works when the input format is set to a non- it may also be worth considering caching pre-rendered or image CommonMark format, but in that case scroll sync is disabled. If versions of equations to improve performance. Pandoc adds sourcepos support for additional input formats in the future, scroll sync will work automatically once Codebraid Code execution Preview adds those formats to the supported list. It is possible to attempt to reconstruct a source map by performing a parallel Optional support for executing code embedded in Markdown string search on Pandoc output and the original source. This can documents is provided by Codebraid [GMP19]. Codebraid uses be error-prone due to text manipulation during format conversion, Pandoc to convert a document into an abstract syntax tree (AST), but in the future it may be possible to construct a good enough then extracts any inline or block code marked with Codebraid source map to extend basic scroll sync support to additional input attributes from the AST, executes the code, and finally formats the formats. code output so that Pandoc can use it to create the final output document. Code execution is performed with Codebraid’s own built-in system or with Jupyter kernels. For example, the code LaTeX support block Support for mathematics is one of the key features provided by ```{.python .cb-run} many Markdown variants in Pandoc, including commonmark_x. print("Hello *world!*") Math support in the preview panel is supplied by KaTeX [EA22], ``` which is a JavaScript library for rendering LaTeX math in the would result in browser. One of the disadvantages of using Pandoc to create the preview Hello world! is that every update of the preview is a complete update. This after processing by Codebraid and finally Pandoc. The .cb-run makes the preview more sensitive to HTML rendering time. In is a Codebraid attribute that marks the code block for execution contrast, in a Jupyter notebook, it is common to write Markdown and specifies the default display of code output. Further examples in multiple cells which are rendered separately and independently. of Codebraid usage are visible in Figure 1. MathJax [Mat22] provides a broader range of LaTeX support Mixing a live preview with executable code provides potential than KaTeX, and is used in software such as JupyterLab and usability and security challenges. By default, code only runs when Quarto. While MathJax performance has improved significantly the user selects execution in the VS Code command palette or since the release of version 3.0 in 2019, KaTeX can still have a clicks the Codebraid execute button. When the preview automati- speed advantage, so it is currently the default due to the importance cally updates as a result of Markdown source changes, it only uses 108 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) cached code output. Stale cached output is detected by hashing While this build process is significantly more interactive than executed code, and then marked in the preview to alert the user. what has been possible previously, it also suggests additional The standard approach to executing code within Markdown avenues for future exploration. Codebraid’s built-in code execution documents blocks the document build process until all code has system is designed to execute a predefined sequence of code finished running. Code is extracted from the Markdown source and chunks and then exit. Jupyter kernels are currently used in the executed. Then the output is combined with the original source and same manner to avoid any potential issues with out-of-order passed on to Pandoc or another Markdown application for final execution. However, Jupyter kernels can receive and execute code conversion. This is the approach taken by RMarkdown, Quarto, indefinitely, which is how they commonly function in Jupyter note- and similar software, as well as by Codebraid until recently. This books. Instead of starting a new Jupyter kernel at the beginning of design works well for building a document a single time, but each code execution cycle, it would be possible to keep the kernel blocking until all code has executed is not ideal in the context from the previous execution cycle and only pass modified code of a document preview. chunks to it. This would allow the same out-of-order execution Codebraid now offers a new mode of code execution that al- issues that are possible in a Jupyter notebook. Yet that would lows a document to be rebuilt continuously during code execution, make possible much more rapid code output, particularly in cases with each build including all code output available at that time. where large datasets must be loaded or significant preprocessing This process involves the following steps: is required. 1) The user selects code execution. Codebraid Preview Conclusion passes the document to Codebraid. Codebraid begins Codebraid Preview represents a significant advance in tools for code execution. writing with Pandoc. For the first time, it is possible to preview 2) As soon as any code output is available, Codebraid a Pandoc Markdown document using Pandoc itself while having immediately streams this back to Codebraid Preview. The features like scroll sync between the Markdown source and the output is in a format compatible with the YAML metadata preview. When embedded code needs to be executed, it is possible block at the start of Pandoc Markdown documents. The to see code output in the preview and to continue editing the output includes a hash of the code that was executed, so document during code execution, instead of having to wait until that code changes can be detected later. code finishes running. 3) If the document is modified while code is running or if Codebraid Preview or future previewers that follow this ap- code output is received, Codebraid Preview rebuilds the proach may be perfectly adequate for shorter and even some longer preview. It creates a copy of the document with all current documents, but at some point a combination of document length, Codebraid output inserted into the YAML metadata block document complexity, and mathematical content will strain what is at the start of the document. This modified document is possible and ultimately decrease preview update frequency. Every then passed to Pandoc. Pandoc runs with a Lua filter5 that update of the preview involves converting the entire document modifies the document AST before final conversion. The with Pandoc and then rendering the resulting HTML. filter removes all code marked with Codebraid attributes On the parsing side, Pandoc’s move toward CommonMark- from the AST, and replaces it with the corresponding based Markdown variants may eventually lead to enough stan- code output stored in the AST metadata. If code has dardization that other implementations with the same syntax and been modified since execution began, this is detected features are possible. This in turn might enable entirely new with the hash of the code, and an HTML class is added approaches. An ideal scenario would be a Pandoc-compatible to the output that will mark it visually as stale output. JavaScript-based parser that can parse multiple Markdown strings Code that does not yet have output is replaced by a while treating them as having a shared document state for things visible placeholder to indicate that code is still running. like labels, references, and numbering. For example, this could When the Lua filter finishes AST modifications, Pandoc allow Pandoc Markdown within a Jupyter notebook, with all completes the document build, and the preview updates. Markdown content sharing a single document state, maybe with 4) As long as code is executing, the previous process repeats each Markdown cell being automatically updated based on Mark- whenever the preview needs to be rebuilt. down changes elsewhere. 5) Once code execution completes, the most recent output is Perhaps more practically, on the preview display side, there reused for all subsequent preview updates until the next may be ways to optimize how the HTML generated by Pandoc is time the user chooses to execute code. Any code changes loaded in the preview. A related consideration might be alternative continue to be detected by hashing the code during the preview formats. There is a significant tradition of tight source- build process, so that the output can be marked visually preview integration in LaTeX (for example, [Lau08]). In principle, as stale in the preview. Pandoc’s sourcepos extension should make possible Mark- The overall result of this process is twofold. First, building down to PDF synchronization, using LaTeX as an intermediary. a document involving executed code is nearly as fast as building a plain Pandoc document. The additional output metadata plus R EFERENCES the filter are the only extra elements involved in the document [EA22] Emily Eisenberg and Sophie Alpert. KaTeX: The fastest math build, and Pandoc Lua filters have excellent performance. Second, typesetting library for the web, 2022. URL: https://katex.org/. the output for each code chunk appears in the preview almost [GMP19] Geoffrey M. Poore. Codebraid: Live Code in Pandoc Mark- immediately after the chunk finishes execution. down. In Chris Calloway, David Lippa, Dillon Niederhut, and David Shupe, editors, Proceedings of the 18th Python in Science Conference, pages 54 – 61, 2019. doi:10.25080/Majora- 5. For an overview of Lua filters, see https://pandoc.org/lua-filters.html. 7ddc1dd1-008. CODEBRAID PREVIEW FOR VS CODE: PANDOC MARKDOWN PREVIEW WITH JUPYTER KERNELS 109 [GP21] Brian E. Granger and Fernando Pérez. Jupyter: Thinking and storytelling with code and data. Computing in Science & Engineering, 23(2):7–14, 2021. doi:10.1109/MCSE.2021. 3059263. [JM22] John MacFarlane. Pandoc: a universal document converter, 2006– 2022. URL: https://pandoc.org/. [Jup22] Jupyter Development Team. nbconvert: Convert Notebooks to other formats, 2015–2022. URL: https://nbconvert.readthedocs. io. [Lau08] Jerôme Laurens. Direct and reverse synchronization with Sync- TEX. TUGBoat, 29(3):365–371, 2008. [Mat22] MathJax. MathJax: Beautiful and accessible math in all browsers, 2009–2022. URL: https://www.mathjax.org/. [MWtJT20] Marc Wouts and the Jupytext Team. Jupyter notebooks as Markdown documents, Julia, Python or R scripts, 2018–2020. URL: https://jupytext.readthedocs.io/. [RSt20] RStudio Inc. R Markdown, 2016–2020. URL: https://rmarkdown. rstudio.com/. [RSt22] RStudio Inc. Welcome to Quarto, 2022. URL: https://quarto.org/. [YX15] Yihui Xie. Dynamic Documents with R and knitr. Chapman & Hall/CRC Press, 2015. 110 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Incorporating Task-Agnostic Information in Task-Based Active Learning Using a Variational Autoencoder Curtis Godwin‡†∗ , Meekail Zain§†∗ , Nathan Safir‡ , Bella Humphrey§ , Shannon P Quinn§¶ F Abstract—It is often much easier and less expensive to collect data than to constraints by specifying a budget of points that can be labeled at label it. Active learning (AL) ([Set09]) responds to this issue by selecting which a time and evaluating against this budget. unlabeled data are best to label next. Standard approaches utilize task-aware In AL, the model for which we select new labels is referred to AL, which identifies informative samples based on a trained supervised model. as the task model. If this model is a classifier neural network, the Task-agnostic AL ignores the task model and instead makes selections based space in which it maps inputs before classifying them is known on learned properties of the dataset. We seek to combine these approaches and measure the contribution of incorporating task-agnostic information into as the latent space or representation space. A recent branch of standard AL, with the suspicion that the extra information in the task-agnostic AL ([SS18], [SCN+ 18], [YK19]), prominent for its applications features may improve the selection process. We test this on various AL methods to deep models, focuses on mapping unlabeled points into the task using a ResNet classifier with and without added unsupervised information from model’s latent space before comparing them. a variational autoencoder (VAE). Although the results do not show a significant These methods are limited in their analysis by the labeled improvement, we investigate the effects on the acquisition function and suggest data they must train on, failing to make use of potentially useful potential approaches for extending the work. information embedded in the unlabeled data. We therefore suggest that this family of methods may be improved by extending their Index Terms—active learning, variational autoencoder, deep learning, pytorch, representation spaces to include unsupervised features learned semi-supervised learning, unsupervised learning over the entire dataset. For this purpose, we opt to use a variational autoencoder (VAE) ([KW13]) , which is a prominent method for unsupervised representation learning. Our main contributions are Introduction (a) a new methodology for extending AL methods using VAE In deep learning, the capacity for data gathering often signifi- features and (b) an experiment comparing AL performance across cantly outpaces the labeling. This is easily observed in the field two recent feature-based AL methods using the new method. of bioimaging, where ground-truth labeling usually requires the expertise of a clinician. For example, producing a large quantity Related Literature of CT scans is relatively simple, but having them labeled for Active learning COVID-19 by cardiologists takes much more time and money. Much of the early active learning (AL) literature is based on These constraints ultimately limit the contribution of deep learning shallower, less computationally demanding networks since deeper to many crucial research problems. architectures were not well-developed at the time. Settles ([Set09]) This labeling issue has compelled advancements in the field of provides a review of these early methods. The modern approach active learning (AL) ([Set09]). In a typical AL setting, there is a uses an acquisition function, which involves ranking all available set of labeled data and a (usually larger) set of unlabeled data. A unlabeled points by some chosen heuristic H and choosing to model is trained on the labeled data, then the model is analyzed to label the points of highest ranking. evaluate which unlabeled points should be labeled to best improve the loss objective after further training. AL acknowledges labeling † These authors contributed equally. * Corresponding author: cmgodwin263@gmail.com, meekail.zain@uga.edu ‡ Institute for Artificial Intelligence, University of Georgia, Athens, GA 30602 USA * Corresponding author: cmgodwin263@gmail.com, meekail.zain@uga.edu § Department of Computer Science, University of Georgia, Athens, GA 30602 USA ¶ Department of Cellular Biology, University of Georgia, Athens, GA 30602 USA Copyright © 2022 Curtis Godwin et al. This is an open-access article dis- tributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, pro- The popularity of the acquisition approach has led to a widely- vided the original author and source are credited. used evaluation procedure, which we describe in Algorithm 1. INCORPORATING TASK-AGNOSTIC INFORMATION IN TASK-BASED ACTIVE LEARNING USING A VARIATIONAL AUTOENCODER 111 This procedure trains a task model T on the initial labeled data, representation c. An additional fully connected layer then maps records its test accuracy, then uses H to label a set of unlabeled c into a single value constituting the loss prediction. points. We then once again train T on the labeled data and record When attempting to train a network to directly predict T ’s its accuracy. This is repeated until a desired number of labels is loss during training, the ground truth losses naturally decrease as reached, and then the accuracies can be graphed against the num- T is optimized, resulting in a moving objective. The authors of ber of available labels to demonstrate performance over the course ([YK19]) find that a more stable ground truth is the inequality of labeling. We can use this evaluation algorithm to separately between the losses of given pairs of points. In this case, P is evaluate multiple acquisition functions on their resulting accuracy trained on pairs of labeled points, so that P is penalized for graphs. This is utilized in many AL papers to show the efficacy producing predicted loss pairs that exhibit a different inequality of their suggested heuristics in comparison to others ([WZL+ 16], than the corresponding true loss pair. [SS18], [SCN+ 18], [YK19]). More specifically, for each batch of labeled data Lbatch ⊂ L The prevailing approach to point selection has been to choose that is propagated through T during training, the batch of true unlabeled points for which the model is most uncertain, the as- losses is computed and split randomly into a batch of pairs Pbatch . sumption being that uncertain points will be the most informative The loss prediction network produces a corresponding batch of ([BRK21]). A popular early method was to label the unlabeled predicted loss pairs, denoted Pebatch . The following pair loss is then points of highest Shannon entropy ([Sha48]) under the task model, computed given each p ∈ Pbatch and its corresponding p̃ ∈ Pebatch : which is a measure of uncertainty between the classes of the data. This method is now more commonly used in combination L pair (p, p̃) = max(0, −I (p) · ( p̃(1) − p̃(2) ) + ξ ), (3) with a representativeness measure ([WZL+ 16]) to avoid selecting where I is the following indicator function for pair inequality: condensed clusters of very similar points. ( 1, p(1) > p(2) I (p) = . (4) Recent heuristics using deep features −1, p(1) ≤ p(2) For convolutional neural networks (CNNs) in image classification settings, the task model T can be decomposed into a feature- Variational Autoencoders generating module Variational autoencoders (VAEs) ([KW13]) are an unsupervised T f : Rn → R f , method for modeling data using Bayesian posterior inference. We begin with the Bayesian assumption that the data is well- which maps the input data vectors to the output of the final fully modeled by some distribution, often a multivariate Gaussian. We connected layer before classification, and a classification module also assume that this data distribution can be inferred reasonably well by a lower dimensional random variable, also often modeled Tc : R f → {0, 1, ..., c}, by a multivariate Gaussian. where c is the number of classes. The inference process then consists of an encoding into the Recent deep learning-based AL methods have approached the lower dimensional latent variable, followed by a decoding back notion of model uncertainty in terms of the rich features generated into the data dimension. We parametrize both the encoder and the by the learned model. Core-set ([SS18]) and MedAL ([SCN+ 18]) decoder as neural networks, jointly optimizing their parameters select unlabeled points that are the furthest from the labeled set with the following loss function ([KW19]): in terms of L2 distance between the learned features. For core-set, Lθ ,φ (x) = log pθ (x|z) + [log pθ (z) − log qφ (z|x)], (5) each point constructing the set S in step 6 of Algorithm 1 is chosen by where θ and φ are the parameters of the encoder and the decoder, u∗ = argmax min ||(T f (u) − T f (``))||2 , (1) respectively. The first term is the reconstruction error, penalizing u∈U ` ∈L the parameters for producing poor reconstructions of the input where U is the unlabeled set and L is the labeled set. The data. The second term is the regularization error, encouraging the analogous operation for MedAL is encoding to resemble a pre-selected prior distribution, commonly a unit Gaussian prior. 1 |L| The encoder of a well-optimized VAE can be used to gen- u∗ = argmax u∈U ∑ ||T f (u) − T f (Li )||2 . |L| i=1 (2) erate latent encodings with rich features which are sufficient to approximately reconstruct the data. The features also have some Note that after a point u∗ is chosen, the selection of the next point geometric consistency, in the sense that the encoder is encouraged assumes the previous u∗ to be in the labeled set. This way we to generate encodings in the pattern of a Gaussian distribution. discourage choosing sets that are closely packed together, leading to sets that are more diverse in terms of their features. This effect is more pronounced in the core-set method since it takes the Methods minimum distance whereas MedAL uses the average distance. We observe that the notions of uncertainty developed in the core- Another recent method ([YK19]) trains a regression network set and MedAL methods rely on distances between feature vectors to predict the loss of the task model, then takes the heuristic H modeled by the task model T . Additionally, loss prediction relies in Algorithm 1 to select the unlabeled points of highest predicted on a fully connected layer mapping from a feature space to a single loss. To implement this, the loss prediction network P is attached value, producing different predictions depending on the values of to a ResNet task model T and is trained jointly with T . The the relevant feature vector. Thus all of these methods utilize spatial inputs to P are the features output by the ResNet’s four residual reasoning in a vector space. blocks. These features are mapped into the same dimensionality Furthermore, in each of these methods, the heuristic H only via a fully connected layer and then concatenated to form a has access to information learned by the task model, which is 112 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) trained only on the labeled points at a given timestep in the la- ensure that the task models being compared were supplied with beling procedure. Since variational autoencoder (VAE) encodings the same initial set of labels. are not limited by the contents of the labeled set, we suggest that With four NVIDIA 2080 GPUs, the total runtime for the the aforementioned methods may benefit by expanding the vector MNIST experiments was 5113s for core-set and 4955s for loss spaces they investigate to include VAE features learned across prediction; for ChestMNIST, the total runtime was 7085s for core- the entire dataset, including the unlabeled data. These additional set and 7209s for loss prediction. features will constitute representative and previously inaccessible information regarding the data, which may improve the active learning process. We implement this by first training a VAE model V on the given dataset. V can then be used as a function returning the VAE features for any given datapoint. We append these additional features to the relevant vector spaces using vector concatenation, an operation we denote with the symbol _. The modified point selection operation in core-set then becomes u∗ = argmax min ||([T f (u) _ αV (u)] − [T f (``) _ αV (``)]||2 , u∈U ` ∈L (6) where α is a hyperparameter that scales the influence of the VAE features in computing the vector distance. To similarly modify the loss prediction method, we concatenate the VAE features to the Fig. 1: The average MNIST results using the core-set heuristic versus final ResNet feature concatenation c before the loss prediction, the VAE-augmented core-set heuristic for Algorithm 1 over 5 runs. so that the extra information is factored into the training of the prediction network P. Experiments In order to measure the efficacy of the newly proposed methods, we generate accuracy graphs using Algorithm 1, freezing all settings except the selection heuristic H . We then compare the performance of the core-set and loss prediction heuristics with their VAE-augmented counterparts. We use ResNet-18 pretrained on ImageNet as the task model, using the SGD optimizer with learning rate 0.001 and momen- tum 0.9. We train on the MNIST ([Den12]) and ChestMNIST ([YSN21]) datasets. ChestMNIST consists of 112,120 chest X-ray images resized to 28x28 and is one of several benchmark medical image datasets introduced in ([YSN21]). Fig. 2: The average MNIST results using the loss prediction heuristic For both datasets we experiment on randomly selected subsets, versus the VAE-augmented loss prediction heuristic for Algorithm 1 using 25000 points for MNIST and 30000 points for ChestMNIST. over 5 runs. In both cases we begin with 3000 initial labels and label 3000 points per active learning step. We opt to retrain the task model after each labeling step instead of fine-tuning. We use a similar training strategy as in ([SCN+ 18]), training the task model until >99% train accuracy before selecting new points to label. This ensures that the ResNet is similarly well fit to the labeled data at each labeling iteration. This is implemented by training for 10 epochs on the initial training set and increasing the training epochs by 5 after each labeling iteration. The VAEs used for the experiments are trained for 20 epochs using an Adam optimizer with learning rate 0.001 and weight decay 0.005. The VAE encoder architecture consists of four con- volutional downsampling filters and two linear layers to learn the low dimensional mean and log variance. The decoder consists of an upsampling convolution and four size-preserving convolutions to learn the reconstruction. Fig. 3: The average ChestMNIST results using the core-set heuristic Experiments were run five times, each with a separate set of versus the VAE-augmented core-set heuristic for Algorithm 1 over 5 randomly chosen initial labels, with the displayed results showing runs. the average validation accuracies across all runs. Figures 1 and 3 show the core-set results, while Figures 2 and 4 show the loss To investigate the qualitative difference between the VAE and prediction results. In all cases, shared random seeds were used to non-VAE approaches, we performed an additional experiment INCORPORATING TASK-AGNOSTIC INFORMATION IN TASK-BASED ACTIVE LEARNING USING A VARIATIONAL AUTOENCODER 113 Fig. 4: The average ChestMNIST results using the loss prediction heuristic versus the VAE-augmented loss prediction heuristic for Algorithm 1 over 5 runs. to visualize an example of core-set selection. We first train the ResNet-18 with the same hyperparameter settings on 1000 initial labels from the ChestMNIST dataset, then randomly choose 1556 Fig. 6: A t-SNE visualization of the ChestMNIST points chosen by (5%) of the unlabeled points from which to select 100 points to core-set when the ResNet features are augmented with VAE features. label. These smaller sizes were chosen to promote visual clarity in the output graphs. We use t-SNE ([VdMH08]) dimensionality reduction to show process. In 5, the selected points tend to be more spread out, the ResNet features of the labeled set, the unlabeled set, and the while in 6 they cluster at one edge. This appears to mirror the points chosen to be labeled by core-set. transformation of the rest of the data, which is more spread out without the VAE features, but becomes condensed in the center when they are introduced, approaching the shape of a Gaussian distribution. It seems that with the added VAE features, the selected points are further out of distribution in the latent space. This makes sense because points tend to be more sparse at the tails of a Guassian distribution and core-set prioritizes points that are well-isolated from other points. One reason for the lack of performance improvement may be the homogeneous nature of the VAE, where the optimization goal is reconstruction rather than classification. This could be improved by using a multimodal prior in the VAE, which may do a better job of modeling relevant differences between points. Conclusion Our original intuition was that additional unsupervised informa- tion may improve established active learning methods, especially when using a modern unsupervised representation method such as a VAE. The experimental results did not indicate this hypothesis, but additional investigation of the VAE features showed a notable change in the task model latent space. Though this did not result in Fig. 5: A t-SNE visualization of the ChestMNIST points chosen by superior point selections in our case, it is of interest whether dif- core-set. ferent approaches to latent space augmentation in active learning may fare better. Future work may explore the use of class-conditional VAEs Discussion in a similar application, since a VAE that can utilize the available class labels may produce more effective representations, and it Overall, the VAE-augmented active learning heuristics did not could be retrained along with the task model after each labeling exhibit a significant performance difference when compared with iteration. their counterparts. The only case of a significant p-value (<0.05) occurred during loss prediction on the MNIST dataset at 21000 labels. R EFERENCES The t-SNE visualizations in Figures 5 and 6 show some of [BRK21] Samuel Budd, Emma C Robinson, and Bernhard Kainz. A the influence that the VAE features have on the core-set selection survey on active learning and human-in-the-loop deep learning 114 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) for medical image analysis. Medical Image Analysis, 71:102062, 2021. doi:10.1016/j.media.2021.102062. [Den12] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012. doi:10.1109/MSP.2012.2211477. [KW13] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [KW19] Diederik P. Kingma and Max Welling. An Intro- duction to Variational Autoencoders. Now Publishers, 2019. URL: https://doi.org/10.1561%2F9781680836233, doi: 10.1561/9781680836233. [SCN 18] Asim Smailagic, Pedro Costa, Hae Young Noh, Devesh + Walawalkar, Kartik Khandelwal, Adrian Galdran, Mostafa Mir- shekari, Jonathon Fagert, Susu Xu, Pei Zhang, et al. Medal: Accurate and robust deep active learning for medical image analysis. In 2018 17th IEEE international conference on machine learning and applications (ICMLA), pages 481–488. IEEE, 2018. doi:10.1109/icmla.2018.00078. [Set09] Burr Settles. Active learning literature survey. 2009. [Sha48] Claude Elwood Shannon. A mathematical theory of communica- tion. The Bell system technical journal, 27(3):379–423, 1948. [SS18] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018. URL: https://openreview.net/ forum?id=H1aIuk-RW. [VdMH08] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008. [WZL+ 16] Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technol- ogy, 27(12):2591–2600, 2016. doi:10.1109/tcsvt.2016. 2589879. [YK19] Donggeun Yoo and In So Kweon. Learning loss for active learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 93–102, 2019. doi:10.1109/CVPR.2019.00018. [YSN21] Jiancheng Yang, Rui Shi, and Bingbing Ni. Medmnist classi- fication decathlon: A lightweight automl benchmark for med- ical image analysis. In 2021 IEEE 18th International Sym- posium on Biomedical Imaging (ISBI), pages 191–195, 2021. doi:10.1109/ISBI48211.2021.9434062. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 115 Awkward Packaging: building Scikit-HEP Henry Schreiner‡∗ , Jim Pivarski‡ , Eduardo Rodrigues§ F Abstract—Scikit-HEP has grown rapidly over the last few years, not just to serve parts [Lam98]. The glueing together of the system was done in the needs of the High Energy Physics (HEP) community, but in many ways, Python, a model still popular today, though some experiments are the Python ecosystem at large. AwkwardArray, boost-histogram/hist, and iminuit now using Python + Numba as an alternative model, such as for are examples of libraries that are used beyond the original HEP focus. In this example the Xenon1T experiment [RTA+ 17], [RS21]. paper we will look at key packages in the ecosystem, and how the collection of In the early 2000s, the use of Python HEP exploded, heavily 30+ packages was developed and maintained. Also we will look at some of the software ecosystem contributions made to packages like cibuildwheel, pybind11, driven by experiments like LHCb developing frameworks and user nox, scikit-build, build, and pipx that support this effort. We will also discuss the tools for scripting. ROOT started providing Python bindings in Scikit-HEP developer pages and initial WebAssembly support. 2004 [LGMM05] that were not considered Pythonic [GTW20], and still required a complex multi-hour build of ROOT to use1 . Index Terms—packaging, ecosystem, high energy physics, community project Analyses still consisted largely of ROOT, with Python sometimes showing up. By the mid 2010’s, a marked change had occurred, driven by Introduction the success of Python in Data Science, especially in education. High Energy Physics (HEP) has always had intense computing Many new students were coming into HEP with little or no needs due to the size and scale of the data collected. The C++ experience, but with existing knowledge of Python and the World Wide Web was invented at the CERN Physics laboratory growing Python data science ecosystem, like NumPy and Pandas. in Switzerland in 1989 when scientists in the EU were trying Several HEP experiment analyses were performed in, or driven to communicate results and datasets with scientist in the US, by, Python, with ROOT only being used for things that were and vice-versa [LCC+ 09]. Today, HEP has the largest scientific not available in the Python ecosystem. Some of these were HEP machine in the world, at CERN: the Large Hadron Collider (LHC), specific: ROOT is also a data format, so users needed to be able 27 km in circumference [EB08], with multiple experiments with to read data from ROOT files. Others were less specific: HEP thousands of collaborators processing over a petabyte of raw data users have intense histogram requirements due to the data sizes, every day, with 100 petabytes being stored per year at CERN. This large portions of HEP data are "jagged" rather than rectangular; is one of the largest scientific datasets in the world of exabyte scale vector manipulation was important (especially Lorenz Vectors, a [PJ11], which is roughly comparable in order of magnitude to all four dimensional relativistic vector with a non-Euclidean metric); of astronomy or YouTube [SLF+ 15]. and data fitting was important, especially with complex models In the mid nineties, HEP users were beginning to look for and accurate error estimation. a new language to replace Fortran. A few HEP scientists started investigating the use of Python around the release of 1.0.0 in 1994 Beginnings of a scikit [Tem22]. A year later, the ROOT project for an analysis toolkit (and framework) was released, quickly making C++ the main In 2016, the ecosystem for Python in HEP was rather fragmented. language for HEP. The ROOT project also needed an interpreted Physicists were developing tools in isolation, without knowing language to driving analysis code. Python was rejected for this role out the overlaps with other tools, and without making them due to being "exotic" at the time, and because it was considered too interoperable. There were a handful of popular packages that much to ask physicists to code in two languages. Instead, ROOT were useful in HEP spread around among different authors. The provided a C++ interpreter, called CINT, which later was replaced ROOTPy project had several packages that made the ROOT- with Cling, which is the basis for the clang-repl project in LLVM Python bridge a little easier than the built-in PyROOT, such as the today [IVL22]. root-numpy and related root-pandas packages. The C++ MINUIT Python would start showing up in the late 90’s in experiment fitting library was integrated into ROOT, but the iminuit package frameworks as a configuration language. These frameworks were [Dea20] provided an easy to install standalone Python package primarily written in C++, but were made of many configurable with an extracted copy of MINUIT. Several other specialized standalone C++ packages had bindings as well. Many of the initial * Corresponding author: henryfs@princeton.edu authors were transitioning to a less-code centric role or leaving ‡ Princeton University § University of Liverpool for industry, leaving projects like ROOTPy and iminuit without maintainers. Copyright © 2022 Henry Schreiner et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, 1. Almost 20 years later ROOT’s Python bindings have been rewritten for which permits unrestricted use, distribution, and reproduction in any medium, easier Pythonizations, and installing ROOT in Conda is now much easier, provided the original author and source are credited. thanks in large part to efforts from Scikit-HEP developers. 116 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) later writer) that could remove the initial conversion environment by simply pip installing a package. It also had a simple, Pythonic numpythia interface and produced outputs Python users could immediately use, like NumPy arrays, instead of PyROOT’s wrapped C++ pyhepmc nndrone pointers. Uproot needed to do more than just be file format reader/writer; it needed to provide a way to represent the special pylhe structure and common objects that ROOT files could contain. This lead to the development of two related packages that would hepunits support uproot. One, uproot-methods, included Pythonic access to functionality provided by ROOT for its core classes, like spatial and Lorentz vectors. The other was AwkwardArray, which would uhi grow to become one of the most important and most general histoprint packages in Scikit-HEP. This package allows NumPy-like idioms for array-at-a-time manipulation on jagged data structures. A jagged array is a (possibly structured) array with a variable length dimension. These are very common and relevant in HEP; events have a variable number of tracks, tracks have a variable number Fig. 1: The Scikit-HEP ecosystem and affiliated packages. of hits in the detector, etc. Many other fields also have jagged data structures. While there are formats to store such structures, computations on jagged structures have usually been closer to SQL Eduardo Rodrigues, a scientist working on the LHCb ex- queries on multiple tables than direct object manipulation. Pandas periment for the University of Cincinnati, started working on a handles this through multiple indexing and a lot of duplication. package called scikit-hep that would provide a set to tools useful Uproot was a huge hit with incoming HEP students (see Fig 2); for physicists working on HEP analysis. The initial version of the suddenly they could access HEP data using a library installed with scikit-hep package had a simple vector library, HEP related units pip or conda and no external compiler or library requirements, and and conversions, several useful statistical tools, and provenance could easily use tools they already knew that were compatible with recording functionality, the Python buffer protocol, like NumPy, Pandas and the rapidly He also placed the scikit-hep GitHub repository in a Scikit- growing machine learning frameworks. There were still some gaps HEP GitHub organization, and asked several of the other HEP and pain points in the ecosystem, but an analysis without writing related packages to join. The ROOTPy project was ending, with C++ (interpreted or compiled) and compiling ROOT manually was the primary author moving on, and so several of the then-popular finally possible. Scikit-HEP did not and does not intend to replace packages2 that were included in the ROOTPy organization were ROOT, but it provides alternative solutions that work natively in happily transferred to Scikit-HEP. Several other existing HEP the Python "Big Data" ecosystem. libraries, primarily interfacing to existing C++ simulation and Several other useful HEP libraries were also written. Particle tracking frameworks, also joined, like PyJet and NumPythia. Some was written for accessing the Particle Data Group (PDG) particle of these libraries have been retired or replaced today, but were an data in a simple and Pythonic way. DecayLanguage originally important part of Scikit-HEP’s initial growth. provided tooling for decay definitions, but was quickly expanded to include tools to read and validate "DEC" decay files, an existing First initial success text format used to configure simulations in HEP. In 2016, the largest barrier to using Python in HEP in a Pythonic way was ROOT. It was challenging to compile, had many non- Building compiled packages Python dependencies, was huge compared to most Python li- braries, and didn’t play well with Python packaging. It was not In 2018, HEP physicist and programmer Hans Dembinski pro- Pythonic, meaning it had very little support for Python protocols posed a histogram library to the Boost libraries, the most influen- like iteration, buffers, keyword arguments, tab completion and tial C++ library collection; many additions to the standard library inspect in, dunder methods, didn’t follow conventions for useful are based on Boost. Boost.Histogram provided a histogram-as- reprs, and Python naming conventions; it was simply a direct on- an-object concept from HEP, but was designed around C++14 demand C++ binding, including pointers. Many Python analyses templating, using composable axes and storage types. It originally started with a "convert data" step using PyROOT to read ROOT had an initial Python binding, written in Boost::Python. Henry files and convert them to a Python friendly format like HDF5. Schreiner proposed the creation of a standalone binding to be Then the bulk of the analysis would use reproducible Python written with pybind11 in Scikit-HEP. The original bindings were virtual environments or Conda environments. removed, Boost::Histogram was accepted into the Boost libraries, This changed when Jim Pivarski introduced the Uproot pack- and work began on boost-histogram. IRIS-HEP, a multi-institution age, a pure-Python implementation of a ROOT file reader (and project for sustainable HEP software, had just started, which was providing funding for several developers to work on Scikit-HEP 2. The primary package of the ROOTPy project, also called ROOTPy, was project packages such as this one. This project would pioneer not transferred, but instead had a final release and then died. It was an standalone C++ library development and deployment for Scikit- inspiration for the new PyROOT bindings, and influenced later Scikit-HEP HEP. packages like mplhep. The transferred libraries have since been replaced by integrated ROOT functionality. All these packages required ROOT, which is There were already a variety of attempts at histogram libraries, not on PyPI, so were not suited for a Python-centric ecosystem. but none of them filled the requirements of HEP physicists: AWKWARD PACKAGING: BUILDING SCIKIT-HEP 117 ROOT (C++ and PyROOT) (as a baseline for scale) Scientific Python P HE Scikit-HEP in on CMSSW config th (Python but not data analysis) Py c ntifi ie Sc PyROOT of ag es e ack Us EPp kit -H ci of S Use Fig. 2: Adoption of scientific Python libraries and Scikit-HEP among members of the CMS experiment (one of the four major LHC experiments). CMS requires users to fork github:cms-sw/cmssw, which can be used to identify 3484 physicist users, who created 16656 non-fork repos. This plot quantifies adoption by counting "#include X", "import X", and "from X import" strings in the users’ code to measure adoption of various libraries (most popular by category are shown). bo lhep gram, com mainstream Python adoption to in HEP: when many histogram hist st::His libraries lived and died , mp Boo ROOT histogram part of ROOT (395 C++ files) YODA histograms histograms YODA in rootpy in Coffea Fig. 3: Developer activity on histogram libraries in HEP: number of unique committers to each library per month, smoothed (derived from git logs). Illustrates the convergence of a fractured community (around 2017) into a unified one (now). fills on pre-existing histograms, simple manipulation of multi- pybind11. dimensional histograms, competitive performance, and easy to The first stand-alone development was azure-wheel-helpers, a install in clusters or for students. Any new attempt here would set of files that helped produce wheels on the new Azure Pipelines have to be clearly better than the existing collection of diverse platform. Building redistributable wheels requires a variety of attempts (see Fig 3). The development of a library with compiled techniques, even without shared libraries, that vary dramatically components intended to be usable everywhere required good between platforms and were/are poorly documented. On Linux, support for building libraries that was lacking both in Scikit- everything needs to be built inside a controlled manylinux image, HEP and to an extent the broader Python ecosystem. Previous and post-processed by the auditwheel tool. On macOS, this in- advancements in the packaging ecosystem, such as the wheel cludes downloading an official CPython binary for Python to allow format for distributing binary platform dependent Python packages older versions of macOS to be targeted (10.9+), several special and the manylinux specification and docker image that allowed a environment variables, especially when cross compiling to Apple single compiled wheel to target many distributions of Linux, but Silicon, and post processing with the develwheel tool. Windows is there still were many challenges to making a library redistributable the simplest, as most versions of CPython work identically there. on all platforms. azure-wheel-helpers worked well, and was quickly adapted for The boost-histogram library only depended on header-only the other packages in Scikit-HEP that included non-ROOT binary components of the Boost libraries, and the header-only pybind11 components. Work here would eventually be merged into the package, so it was able to avoid a separate compile step or existing and general cibuildwheel package, which would become linking to external dependencies, which simplified the initial build the build tool for all non-ROOT binary packages in Scikit-HEP, as process. All needed files were collected from git submodules and well as over 600 other packages like matplotlib and numpy, and packed into a source distribution (SDist), and everything was built was accepted into the PyPA (Python Packaging Authority). using only setuptools, making build-from-source simple on any The second major development was the upstreaming of CI system supporting C++14. This did not include RHEL 7, a popular and build system developments to pybind11. Pybind11 is a C++ platform in HEP at the time, and on any platform building could API for Python designed for writing a binding to C++, and take several minutes and required several gigabytes of memory provided significant benefits to our packages over (mis)-using to resolve the heavy C++ templating in the Boost libraries and Cython for bindings; Cython was designed to transpile a Python- 118 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) like language to C (or C++), and just happened to support bindings since you can call C and C++ from it, but it was not what it Boost::Histogram was designed for. Benefits of pybind11 included reduced code thin wrapper complexity and duplication, no pre-process step (cythonize), no need to pin NumPy when building, and a cross-package API. The boost-histogram iMinuit package was later moved from Cython to pybind11 as fully featured well, and pybind11 became the Scikit-HEP recommended binding tool. We contributed a variety of fixes and features to pybind11, hist including positional-only and keyword-only arguments, the option plotting in to prepend to the overload chain, and an API for type access Matplotlib and manipulation. We also completely redesigned CMake inte- gration, added a new pure-Setuptools helpers file, and completely mplhep plotting in terminal redesigned the CI using GitHub Actions, running over 70 jobs on a variety of systems and compilers. We also helped modernize and histoprint improve all the example projects with simpler builds, new CI, and cibuildwheel support. This example of a project with binary components being Fig. 4: The collection of histogram packages and related packages in usable everywhere then encouraged the development of Awkward Scikit-HEP. 1.0, a rewrite of AwkwardArray replacing the Python-only code with compiled code using pybind11, fixing some long-standing limitations, like an inability to slice past two dimensions or select broader HEP ecosystem. The affiliated classification is also used "n choose k" for k > 5; these simply could not be expressed on broader ecosystem packages like pybind11 and cibuildwheel using Awkward 0’s NumPy expressions, but can be solved with that we recommend and share maintainers with. custom compiled kernels. This also enabled further developments in backends [PEL20]. Histogramming was designed to be a collection of specialized packages (see Fig. 4) with carefully defined interoperability; boost-histogram for manipulation and filling, Hist for a user- Broader ecosystem friendly interface and simple plotting tools, histoprint for display- Scikit-HEP had become a "toolset" for HEP analysis in Python, a ing histograms, and the existing mplhep and uproot packages also collection of packages that worked together, instead of a "toolkit" needed to be able to work with histograms. This ecosystem was like ROOT, which is one monopackage that tries to provide every- built and is held together with UHI, which is a formal specification thing [R+ 20]. A toolset is more natural in the Python ecosystem, agreed upon by several developers of different libraries, backed by where we have good packaging tools and many existing libraries. a statically typed Protocol, for a PlottableHistogram object. Pro- Scikit-HEP only needed to fill existing gaps, instead of covering ducers of histograms, like boost-histogram/hist and uproot provide every possible aspect of an analysis like ROOT did. The original objects that follow this specification, and users of histograms, scikit-hep package had its functionality pulled out into existing or such as mplhep and histoprint take any object that follows this new separate packages such as HEPUnits and Vector, and the core specification. The UHI library is not required at runtime, though it scikit-hep package instead became a metapackage with no unique does also provide a few simple utilities to help a library also accept functionality on its own. Instead, it installs a useful subset of our ROOT histograms, which do not (currently) follow the Protocol, so libraries for a physicist wanting to quickly get started on a new several libraries have decided to include it at runtime too. By using analysis. a static type checker like MyPy to statically enforce a Protocol, Scikit-HEP was quickly becoming the center of HEP specific libraries that can communicate without depending on each other Python software (see Fig. 1). Several other projects or packages or on a shared runtime dependency and class inheritance. This has joined Scikit-HEP iMinuit, a popular HEP and astrophysics fitting been a great success story for Scikit-HEP, and We expect Protocols library, was probably the most widely used single package to to continue to be used in more places in the ecosystem. have joined. PyHF and cabinetry also joined; these were larger The design for Scikit-HEP as a toolset is of many parts that frameworks that could drive a significant part of an analysis all work well together. One example of a package pulling together internally using other Scikit-HEP tools. many components is uproot-browser, a tool that combines uproot, Other packages, like GooFit, Coffea, and zFit, were not added, Hist, and Python libraries like textual and plotext to provide a but were built on Scikit-HEP packages and had developers work- terminal browser for ROOT files. ing closely with Scikit-HEP maintainers. Scikit-HEP introduced Scikit-HEP’s external contributions continued to grow. One of an "affiliated" classification for these packages, which allowed the most notable ones was our work on cibuildwheel. This was an external package to be listed on the Scikit-HEP website a Python package that supported building redistributable wheels and encouraged collaboration. Coffea had a strong influence on multiple CI systems. Unlike our own azure-wheel-helpers or on histogram design, and zFit has contributed code to Scikit- the competing multibuild package, it was written in Python, so HEP. Currently all affiliated packages have at least one Scikit- good practices in Python package design could apply, like unit HEP developer as a maintainer, though that is currently not a and integration tests, static checks, and it was easy to remain requirement. An affiliated package fills a particular need for the independent of the underlying CI system. Building wheels on community. Scikit-HEP doesn’t have to, or need to, attempt to Linux requires a docker image, macOS requires the python.org develop a package that others are providing, but rather tries to Python, and Windows can use any copy of Python - cibuildwheel ensure that the externally provided package works well with the uses this to supply Python in all cases, which keeps it from AWKWARD PACKAGING: BUILDING SCIKIT-HEP 119 depending on the CI’s support for a particular Python version. We helpful for monitoring adoption of the developer pages, especially merged our improvements to cibuildwheel, like better Windows newer additions, across the Scikit-HEP packages. This package support, VCS versioning support, and better PEP 518 support. was then implemented directly into the Scikit-HEP pages, using We dropped azure-wheel-helpers, and eventually a scikit-build Pyodide to run Python in WebAssembly directly inside a user’s maintainer joined the cibuildwheel project. cibuildwheel would browser. Now anyone visiting the page can enter their repository go on to join the PyPA, and is now in use in over 600 packages, and branch, and see the adoption report in a couple of seconds. including numpy, matplotlib, mypy, and scikit-learn. Our continued contributions to cibuildwheel included a Working toward the future TOML-based configuration system for cibuildwheel 2.0, an over- Scikit-HEP is looking toward the future in several different areas. ride system to make supporting multiple manylinux and musllinux We have been working with the Pyodide developers to support targets easier, a way to build directly from SDists, an option to use WebAssembly; boost-histogram is compiled into Pyodide 0.20, build instead of pip, the automatic detection of Python version and Pyodide’s support for pybind11 packages is significantly bet- requirements, and better globbing support for build specifiers. We ter due to that work, including adding support for C++ exception also helped improve the code quality in various ways, including handling. PyHF’s documentation includes a live Pyodide kernel, fully statically typing the codebase, applying various checks and and a try-pyhf site (based on the repo-review tool) lets users run style controls, automating CI processes, and improving support for a model without installing anything - it can even be saved as a special platforms like CPython 3.8 on macOS Apple Silicon. webapp on mobile devices. We also have helped with build, nox, pyodide, and many other We have also been working with Scikit-Build to try to provide packages, improving the tooling we depend on to develop scikit- a modern build experience in Python using CMake. This project build and giving back to the community. is just starting, but we expect over the next year or two that the usage of CMake as a first class build tool for binaries in The Scikit-HEP Developer Pages Python will be possible using modern developments and avoiding A variety of packaging best practices were coming out of the distutils/setuptools hacks. boost-histogram work, supporting both ease of installation for users as well as various static checks and styling to keep the Summary package easy to maintain and reduce bugs. These techniques The Scikit-HEP project started in Autumn 2016 and has grown would also be useful apply to Scikit-HEP’s nearly thirty other to be a core component in many HEP analyses. It has also packages, but applying them one-by-one was not scalable. The provided packages that are growing in usage outside of HEP, like development and adoption of azure-wheel-helpers included a se- AwkwardArray, boost-histogram/Hist, and iMinuit. The tooling ries of blog posts that covered the Azure Pipelines platform and developed and improved by Scikit-HEP has helped Scikit-HEP wheel building details. This ended up serving as the inspiration developers as well as the broader Python community. for a new set of pages on the Scikit-HEP website for developers interested in making Python packages. Unlike blog posts, these would be continuously maintained and extended over the years, R EFERENCES serving as a template and guide for updating and adding packages [Dea20] Hans Dembinski and Piti Ongmongkolkul et al. scikit- to Scikit-HEP, and educating new developers. hep/iminuit. Dec 2020. URL: https://doi.org/10.5281/zenodo. 3949207, doi:10.5281/zenodo.3949207. These pages grew to describe the best practices for developing [EB08] Lyndon Evans and Philip Bryant. Lhc machine. Journal of and maintaining a package, covering recommended configuration, instrumentation, 3(08):S08001, 2008. style checking, testing, continuous integration setup, task runners, [GTW20] Galli, Massimiliano, Tejedor, Enric, and Wunsch, Stefan. "a new and more. Shortly after the introduction of the developer pages, pyroot: Modern, interoperable and more pythonic". EPJ Web Conf., 245:06004, 2020. URL: https://doi.org/10.1051/epjconf/ Scikit-HEP developers started asking for a template to quickly 202024506004, doi:10.1051/epjconf/202024506004. produce new packages following the guidelines. This was eventu- [IVL22] Ioana Ifrim, Vassil Vassilev, and David J Lange. GPU Ac- ally produced; the "cookiecutter" based template is kept in sync celerated Automatic Differentiation With Clad. arXiv preprint with the developer pages; any new addition to one is also added arXiv:2203.06139, 2022. [Lam98] Stephan Lammel. Computing models of cdf and dØ to the other. The developer pages are also kept up to date using a in run ii. Computer Physics Communications, 110(1):32– CI job that bumps any GitHub Actions or pre-commit versions to 37, 1998. URL: https://www.sciencedirect.com/science/article/ the most recent versions weekly. Some portions of the developer pii/S0010465597001501, doi:10.1016/s0010-4655(97) 00150-1. pages have been contributed to packaging.python.org, as well. [LCC+ 09] Barry M Leiner, Vinton G Cerf, David D Clark, Robert E The cookie cutter was developed to be able to support multiple Kahn, Leonard Kleinrock, Daniel C Lynch, Jon Postel, Larry G build backends; the original design was to target both pure Python Roberts, and Stephen Wolff. A brief history of the internet. and Pybind11 based binary builds. This has expanded to include ACM SIGCOMM Computer Communication Review, 39(5):22– 31, 2009. 11 different backends by mid 2022, including Rust extensions, [LGMM05] W Lavrijsen, J Generowicz, M Marino, and P Mato. Reflection- many PEP 621 based backends, and a Scikit-Build based backend Based Python-C++ Bindings. 2005. URL: https://cds.cern.ch/ for pybind11 in addition to the classic Setuptools one. This has record/865620, doi:10.5170/CERN-2005-002.441. [PEL20] Jim Pivarski, Peter Elmer, and David Lange. Awkward arrays helped work out bugs and influence the design of several PEP in python, c++, and numba. In EPJ Web of Conferences, 621 packages, including helping with the addition of PEP 621 to volume 245, page 05023. EDP Sciences, 2020. doi:10.1051/ Setuptools. epjconf/202024505023. The most recent addition to the pages was based on a new [PJ11] Andreas J Peters and Lukasz Janyst. Exabyte scale storage at CERN. In Journal of Physics: Conference Series, volume 331, repo-review package which evaluates and existing repository to page 052015. IOP Publishing, 2011. doi:10.1088/1742- see what parts of the guidelines are being followed. This was 6596/331/5/052015. 120 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) [R+ 20] Eduardo Rodrigues et al. The Scikit HEP Project – overview and prospects. EPJ Web of Conferences, 245:06028, 2020. arXiv: 2007.03577, doi:10.1051/epjconf/202024506028. [RS21] Olivier Rousselle and Tom Sykora. Fast simulation of Time- of-Flight detectors at the LHC. In EPJ Web of Conferences, volume 251, page 03027. EDP Sciences, 2021. doi:10.1051/ epjconf/202125103027. [RTA+ 17] D Remenska, C Tunnell, J Aalbers, S Verhoeven, J Maassen, and J Templon. Giving pandas ROOT to chew on: experiences with the XENON1T Dark Matter experiment. In Journal of Physics: Conference Series, volume 898, page 042003. IOP Publishing, 2017. [SLF+ 15] Zachary D Stephens, Skylar Y Lee, Faraz Faghri, Roy H Campbell, Chengxiang Zhai, Miles J Efron, Ravishankar Iyer, Michael C Schatz, Saurabh Sinha, and Gene E Robinson. Big data: astronomical or genomical? PLoS biology, 13(7):e1002195, 2015. [Tem22] Jeffrey Templon. Reflections on the uptake of the Python pro- gramming language in Nuclear and High-Energy Physics, March 2022. None. URL: https://doi.org/10.5281/zenodo.6353621, doi:10.5281/zenodo.6353621. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 121 Keeping your Jupyter notebook code quality bar high (and production ready) with Ploomber Ido Michael‡∗ F This paper walks through this interactive tutorial. It is highly recommended running this interactively so it’s easier to follow and see the results in real-time. There’s a binder link in there as well, so you can launch it instantly. Fig. 1: In this pipeline none of the tasks were executed - it’s all red. 1. Introduction Notebooks are an excellent environment for data exploration: In addition, it can transform a notebook to a single-task pipeline they allow us to write code interactively and get visual feedback, and then the user can split it into smaller tasks as they see fit. providing an unbeatable experience for understanding our data. To refactor the notebook, we use the soorgeon refactor However, this convenience comes at a cost; if we are not command: careful about adding and removing code cells, we may have an soorgeon refactor nb.ipynb irreproducible notebook. Arbitrary execution order is a prevalent After running the refactor command, we can take a look at the problem: a recent analysis found that about 36% of notebooks on local directory and see that we now have multiple python tasks GitHub did not execute in linear order. To ensure our notebooks which that are ready for production: run, we must continuously test them to catch these problems. ls playground A second notable problem is the size of notebooks: the more cells we have, the more difficult it is to debug since there are more We can see that we have a few new files. pipeline.yaml variables and code involved. contains the pipeline declaration, and tasks/ contains the stages Software engineers typically break down projects into multiple that Soorgeon identified based on our H2 Markdown headings: steps and test continuously to prevent broken and unmaintainable ls playground/tasks code. However, applying these ideas for data analysis requires extra work; multiple notebooks imply we have to ensure the output One of the best ways to onboard new people and explain what from one stage becomes the input for the next one. Furthermore, each workflow is doing is by plotting the pipeline (note that we’re we can no longer press “Run all cells” in Jupyter to test our now using ploomber, which is the framework for developing analysis from start to finish. pipelines): Ploomber provides all the necessary tools to build multi- ploomber plot stage, reproducible pipelines in Jupyter that feel like a single This command will generate the plot below for us, which will notebook. Users can easily break down their analysis into multiple allow us to stay up to date with changes that are happening in our notebooks and execute them all with a single command. pipeline and get the current status of tasks that were executed or failed to execute. 2. Refactoring a legacy notebook Soorgeon correctly identified the stages in our If you already have a python project in a single notebook, you original nb.ipynb notebook. It even detected that can use our tool Soorgeon to automatically refactor it into a the last two tasks (linear-regression, and Ploomber pipeline. Soorgeon statically analyzes your code, cleans random-forest-regressor) are independent of each up unnecessary imports, and makes sure your monolithic notebook other! is broken down into smaller components. It does that by scanning We can also get a summary of the pipeline with ploomber the markdown in the notebook and analyzing the headers; each status: H2 header in our example is marking a new self-contained task. cd playground ploomber status * Corresponding author: ido@ploomber.io ‡ Ploomber 3. The pipeline.yaml file Copyright © 2022 Ido Michael. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits To develop a pipeline, users create a pipeline.yaml file and unrestricted use, distribution, and reproduction in any medium, provided the declare the tasks and their outputs as follows: original author and source are credited. 122 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 3: Here we can see the build outputs Fig. 2: In here we can see the status of each of our pipeline’s tasks, runtime and location. tasks: - source: script.py product: nb: output/executed.ipynb data: output/data.csv # more tasks here... The previous pipeline has a single task (script.py) and generates two outputs: output/executed.ipynb and output/data.csv. You may be wondering why we have a notebook as an output: Ploomber converts scripts to notebooks before execution; hence, our script is considered the source and the notebook a byproduct of the execution. Using scripts as sources (instead of notebooks) makes it simpler to use git. However, this does not mean you have to give up interactive development since Ploomber integrates with Jupyter, allowing you to edit scripts as notebooks. Fig. 4: These are the post build artifacts In this case, since we used soorgeon to refactor an existing notebook, we did not have to write the pipeline.yaml file. # Sample data quality checks after loading the raw data # Check nulls 4. Building the pipeline assert not df['HouseAge'].isnull().values.any() Let’s build the pipeline (this will take ~30 seconds): # Check a specific range - no outliers cd playground assert df['HouseAge'].between(0,100).any() ploomber build # Exact expected row count We can see which are the tasks that ran during this command, how assert len(df) == 11085 long they took to execute, and the contributions of each task to the overall pipeline execution runtime. ** We’ll do the same for tasks/linear-regression.py, open the file Navigate to playground/output/ and you’ll see all the and add the tests: outputs: the executed notebooks, data files and trained model. # Sample tests after the notebook ran # Check task test input exists ls playground/output assert Path(upstream['train-test-split']['X_test']).exists() In this figure, we can see all of the data that was collected during # Check task train input exists the pipeline, any artifacts that might be useful to the user, and some assert Path(upstream['train-test-split']['y_train']).exists() of the execution history that is saved on the notebook’s context. # Validating output type assert 'pkl' in upstream['train-test-split']['X_test'] 5. Testing and quality checks Adding these snippets will allow us to validate that the data we’re ** Open tasks/train-test-split.py as a notebook by right-clicking looking for exists and has the quality we expect. For instance, in on it and then Open With -> Notebook and add the following the first test we’re checking there are no missing rows, and that code after the cell with # noqa: the data sample we have are for houses up to 100 years old. KEEPING YOUR JUPYTER NOTEBOOK CODE QUALITY BAR HIGH (AND PRODUCTION READY) WITH PLOOMBER 123 Fig. 6: lab-open-with-notebook Fig. 5: Now we see an independent new task In the second snippet, we’re checking that there are train and test inputs which are crucial for training the model. 6. Maintaining the pipeline Let’s look again at our pipeline plot: Fig. 7: The new task is attached to the pipeline Image('playground/pipeline.png') The arrows in the diagram represent input/output dependencies At the top of the notebook, you’ll see the following: and depict the execution order. For example, the first task (load) upstream = None loads some data, then clean uses such data as input and processes it, then train-test-split splits our dataset into This special variable indicates which tasks should execute before training and test sets. Finally, we use those datasets to train a the notebook we’re currently working on. In this case, we want to linear regression and a random forest regressor. get training data so we can train our new model so we change the Soorgeon extracted and declared this dependencies for us, but upstream variable: if we want to modify the existing pipeline, we need to declare upstream = ['train-test-split'] such dependencies. Let’s see how. We can also see that the pipeline is green, meaning all of the Let’s generate the plot again: tasks in it have been executed recently. cd playground ploomber plot 7. Adding a new task Ploomber now recognizes our dependency declaration! Let’s say we want to train another model and decide to try Gradient Open Boosting Regressor. First, we modify the pipeline.yaml file playground/tasks/gradient-boosting-regressor.py and add a new task: as a notebook by right-clicking on it and then Open With -> Open playground/pipeline.yaml and add the follow- Notebook and add the following code: ing lines at the end from pathlib import Path - source: tasks/gradient-boosting-regressor.py import pickle product: nb: output/gradient-boosting-regressor.ipynb import seaborn as sns Now, let’s create a base file by executing ploomber from sklearn.ensemble import GradientBoostingRegressor scaffold: y_train = pickle.loads(Path( cd playground upstream['train-test-split']['y_train']).read_bytes()) ploomber scaffold y_test = pickle.loads(Path( upstream['train-test-split']['y_test']).read_bytes()) This is the output of the command: ` X_test = pickle.loads(Path( Found spec at 'pipeline.yaml' Adding upstream['train-test-split']['X_test']).read_bytes()) /Users/ido/ploomber-workshop/playground/ X_train = pickle.loads(Path( upstream['train-test-split']['X_train']).read_bytes()) tasks/ gradient-boosting-regressor.py... Created 1 new task sources. ` gbr = GradientBoostingRegressor() We can see it created the task sources for our new task, we just gbr.fit(X_train, y_train) have to fill those in right now. y_pred = gbr.predict(X_test) Let’s see how the plot looks now: sns.scatterplot(x=y_test, y=y_pred) cd playground ploomber plot You can see that Ploomber recognizes the new file, but it does not 8. Incremental builds have any dependency, so let’s tell Ploomber that it should execute Data workflows require a lot of iteration. For example, you may after train-test-split: want to generate a new feature or model. However, it’s wasteful Open to re-execute every task with every minor change. Therefore, playground/tasks/gradient-boosting-regressor.py one of Ploomber’s core features is incremental builds, which automatically skip tasks whose source code hasn’t changed. as a notebook by right-clicking on it and then Open With -> Run the pipeline again: Notebook: 124 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 11. Resources Thanks for taking the time to go through this tutorial! We hope you consider using Ploomber for your next project. If you have any questions or need help, please reach out to us! (contact info below). Here are a few resources to dig deeper: • GitHub • Documentation • Code examples Fig. 8: We can see this pipeline has multiple new tasks. • JupyterCon 2020 talk • Argo Community Meeting talk • Pangeo Showcase talk (AWS Batch demo) cd playground • Jupyter project ploomber build You can see that only the gradient-boosting-regressor 10. Contact task ran! Incremental builds allow us to iterate faster without keeping • Twitter track of task changes. • Join us on Slack Check out playground/output/ • E-mail us gradient-boosting-regressor.ipynb, which contains the output notebooks with the model evaluation plot. 9. Parallel execution and Ploomber cloud execution This section can run locally or on the cloud. To setup the cloud we’ll need to register for an api key Ploomber cloud allows you to scale your experiments into the cloud without provisioning machines and without dealing with infrastrucutres. Open playground/pipeline.yaml and add the following code instead of the source task: - source: tasks/random-forest-regressor.py This is how your task should look like in the end - source: tasks/random-forest-regressor.py name: random-forest- product: nb: output/random-forest-regressor.ipynb grid: # creates 4 tasks (2 * 2) n_estimators: [5, 10] criterion: [gini, entropy] In addition, we’ll need to add a flag to tell the pipeline to execute in parallel. Open playground/pipeline.yaml and add the following code above the -tasks section (line 1): yaml # Execute independent tasks in parallel executor: parallel ploomber plot ploomber build 10. Execution in the cloud When working with datasets that fit in memory, running your pipeline is simple enough, but sometimes you may need more computing power for your analysis. Ploomber makes it simple to execute your code in a distributed environment without code changes. Check out Soopervisor, the package that implements exporting Ploomber projects in the cloud with support for: • Kubernetes (Argo Workflows) • AWS Batch • Airflow PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 125 Likeness: a toolkit for connecting the social fabric of place to human dynamics Joseph V. Tuccillo‡∗ , James D. Gaboardi‡ F Abstract—The ability to produce richly-attributed synthetic populations is Modeling these processes at scale and with respect to indi- key for understanding human dynamics, responding to emergencies, and vidual privacy is most commonly achieved through agent-based preparing for future events, all while protecting individual privacy. The Like- simulations on synthetic populations [SEM14]. Synthetic popula- ness toolkit accomplishes these goals with a suite of Python packages: tions consist of individual agents that, when viewed in aggregate, pymedm/pymedm_legacy, livelike, and actlike. This production closely recreate the makeup of an area’s observed population process is initialized in pymedm (or pymedm_legacy) that utilizes census microdata records as the foundation on which disaggregated spatial allocation [HHSB12], [TMKD17]. Modeling human dynamics with syn- matrices are built. The next step, performed by livelike, is the generation of thetic populations is common across research areas including spa- a fully autonomous agent population attributed with hundreds of demographic tial epidemiology [DKA+ 08], [BBE+ 08], [HNB+ 11], [NCA13], census variables. The agent population synthesized in livelike is then [RSF+ 21], [SNGJ+ 09], public health [BCD+ 06], [BFH+ 17], attributed with residential coordinates in actlike based on block assignment [SPH11], [TCR08], [MCB+ 08], and transportation [BBM96], and, finally, allocated to an optimal daytime activity location via the street [ZFJ14]. However, a persistent limitation across these applications network. We present a case study in Knox County, Tennessee, synthesizing 30 is that synthetic populations often do not capture a wide enough populations of public K–12 school students & teachers and allocating them to range of individual characteristics to assess how human dynamics schools. Validation of our results shows they are highly promising by replicating are linked to human security problems (e.g., how a person’s age, reported school enrollment and teacher capacity with a high degree of fidelity. limited transportation access, and linguistic isolation may interact Index Terms—activity spaces, agent-based modeling, human dynamics, popu- with their housing situation in a flood evacuation emergency). lation synthesis In this paper, we introduce Likeness [TG22], a Python toolkit for connecting the social fabric of place to human dynamics via Introduction models that support increased spatial, temporal, and demographic Human security fundamentally involves the functional capacity fidelity. Likeness is an extension of the UrbanPop framework de- that individuals possess to withstand adverse circumstances, me- veloped at Oak Ridge National Laboratory (ORNL) that embraces diated by the social and physical environments in which they live a new paradigm of "vivid" synthetic populations [TM21], [Tuc21], [Hew97]. Attention to human dynamics is a key piece of the in which individual agents may be attributed in potentially hun- human security puzzle, as it reveals spatial policy interventions dreds of ways, across subjects spanning demographics, socioe- most appropriate to the ways in which people within a community conomic status, housing, and health. Vivid synthetic populations behave and interact in daily life. For example, "one size fits all" benefit human dynamics research both by enabling more precise solutions do not exist for mitigating disease spread, promoting geolocation of population segments, as well as providing a deeper physical activity, or enabling access to healthy food sources. understanding of how individual and neighborhood characteris- Rather, understanding these outcomes requires examination of tics are coupled. UrbanPop’s early development was motivated processes like residential sorting, mobility, and social transmis- by linking models of residential sorting and worker commute sion. behaviors [MNP+ 17], [MPN+ 17], [ANM+ 18]. Likeness expands upon the UrbanPop approach by providing a novel integrated * Corresponding author: tuccillojv@ornl.gov ‡ Oak Ridge National Laboratory model that pairs vivid residential synthetic populations with an activity simulation model on real-world transportation networks, Copyright © 2022 Oak Ridge National Laboratory. This is an open-access with travel destinations based on points of interest (POIs) curated article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any from location services and federal critical facilities data. medium, provided the original author and source are credited. Notice: This manuscript has been authored by UT-Battelle, LLC under We first provide an overview of Likeness’ capabilities, then Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. provide a more detailed walkthrough of its central workflow with The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government respect to livelike, a package for population synthesis and retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or residential characterization, and actlike a package for activity reproduce the published form of this manuscript, or allow others to do so, for allocation. We provide preliminary usage examples for Likeness United States Government purposes. The Department of Energy will provide based on 1) social contact networks in POIs 2) 24-hour POI public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public- occupancy characteristics. Finally, we discuss existing limitations access-plan). and the outlook for future development. 126 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Overview of Core Capabilities and Workflow the ACS Public-Use Microdata Sample (PUMS) at the scale UrbanPop initially combined the vivid synthetic populations pro- of census block groups (typically 300–6000 people) or tracts duced from the American Community Survey (ACS) using the (1200–8000 people), depending upon the use-case. Penalized-Maximum Entropy Dasymetric Modeling (P-MEDM) Downscaling the PUMS from the Public-Use Microdata Area method, which is detailed later, with a commute model based on (PUMA) level at which it is offered (100,000 or more people) to origin-destination flows, to generate a detailed dataset of daytime these neighborhood scales then enables us to produce synthetic and nighttime synthetic populations across the United States populations (the livelike package) and simulate their travel [MPN+ 17]. Our development of Likeness is motivated by extend- to POIs (the actlike package) in an integrated model. This ap- ing the existing capabilities of UrbanPop to routing libraries avail- proach provides a new means of modeling population mobility and able in Python like osmnx1 and pandana2 [Boe17], [FW12]. activity spaces with respect to real-world transportation networks In doing so, we are able to simulate travel to regular daytime and POIs, in turn enabling investigation of social processes from activities (work and school) based on real-world transportation the atomic (e.g., person) level in human systems. networks. Likeness continues to use the P-MEDM approach, but Likeness offers two implementations of P-MEDM. The first, is fully integrated with the U.S. Census Bureau’s ACS Summary the pymedm package, is written natively in Python based on File (SF) and Census Microdata APIs, enabling the production of scipy.optimize.minimize, and while fully operational re- activity models on-the-fly. mains in development and is currently suitable for one-off simu- Likeness features three core capabilities supporting activ- lations. The second, the pmedm_legacy package, uses rpy2 as ity simulation with vivid synthetic populations (Figure 1). a bridge to [NBLS14]’s original implementation of P-MEDM3 in The first, spatial allocation, is provided by the pymedm and R/C++ and is currently more stable and scalable. We offer conda pmedm_legacy packages and uses Iterative Proportional Fitting environments specific to each package, based on user preferences. (IPF) to downscale census microdata records to small neighbor- Each package’s functionality centers around a PMEDM class, hood areas, providing a basis for population synthesis. Baseline which contains information required to solve the P-MEDM prob- residential synthetic populations are then created and stratified into lem: agent segments (e.g., grade 10 students, hospitality workers) using • The individual (household) level constraints based on ACS the livelike package. Finally, the actlike package models PUMS. To preserve households from the PUMS in the syn- travel across agent segments of interest to POIs outside places of thetic population, the person-level constraints describing residence at varying times of day. household members are aggregated to the household level and merged with household-level constraints. Spatial Allocation: the pymedm & pmedm_legacy packages • PUMS household sample weights. Synthetic populations are typically generated from census micro- • The target (e.g., block group) and aggregate (e.g., tract) data, which consists of a sample of publicly available longform zone constraints based on population-level estimates avail- responses to official statistical surveys. To preserve respondent able in the ACS SF. confidentiality, census microdata is often published at spatial • The target/aggregate zone 90% margins of error and asso- scales the size of a city or larger. Spatial allocation with IPF ciated standard errors (SE = 1.645 × MOE). provides a maximum-likelihood estimator for microdata responses The PMEDM classes feature a solve() method that returns in small (e.g., neighborhood) areas based on aggregate data an optimized P-MEDM solution and allocation matrix. Through published about those areas (known as "constraints"), resulting a diagnostics module, users may then evaluate a P-MEDM in a baseline for population synthesis [WCC+ 09], [BBM96], solution based on the proportion of published 90% MOEs from [TMKD17]. UrbanPop is built upon a regularized implementation the summary-level ACS data preserved at the target (allocation) of IPF, the P-MEDM method, that permits many more input census scale. variables than traditional approaches [LNB13], [NBLS14]. The P- MEDM objective function (Eq. 1) is written as: Population Synthesis: the livelike package n wit wit e2 The livelike package generates baseline residential synthetic max − ∑ log − ∑ k2 (1) it N dit dit k 2σk populations and performs agent segmentation for activity simula- tion. where wit is the estimate of variable i in zone t, dit is the synthetic estimate of variable i in location t, n is the number of microdata Specifying and Solving Spatial Allocation Problems responses, and N is the total population size. Uncertainty in The livelike workflow is oriented around a user-specified variable estimates is handled by adding an error term to the e2 constraints file containing all of the information necessary to allocation ∑k 2σk2 , where ek is the error between the synthetic specify a P-MEDM problem for a PUMA of interest. "Constraints" k and published estimate of ACS variable k and σk is the ACS are variables from the ACS common among people/households standard error for the estimate of variable k. This is accomplished (PUMS) and populations (SF) that are used as both model inputs by leveraging the uncertainty in the input variables: the "tighter" and descriptors. The constraints file includes information for the margins of error on the estimate of variable k in place t, the bridging PUMS variable definitions with those from the SF using more leverage it holds upon the solution [NBLS14]. helper functions provided by the livelike.pums module, The P-MEDM procedure outputs an allocation matrix that including table IDs, sampling universe (person/household), and estimates the probability of individuals matching responses from tags for the range of ACS vintages (years) for which the variables are relevant. 1. https://github.com/gboeing/osmnx 2. https://github.com/UDST/pandana 3. https://bitbucket.org/nnnagle/pmedmrcpp LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS 127 Fig. 1: Core capabilities and workflow of Likeness. The primary livelike class is the acs.puma, which stores implementation of [LB13]’s "Truncate, Replicate, Sample" (TRS) information about a single PUMA necessary for spatial allocation method. TRS works by separating each cell of the allocation of the PUMS data to block groups/tracts with P-MEDM. The matrix into whole-number (integer) and fractional components, process of creating an acs.puma is integrated with the U.S. then incrementing the whole-number estimates by a random Census Bureau’s ACS SF and Census Microdata 5-Year Estimates sample of unit weights performed with sampling probabilities (5YE) APIs4 . This enables generation of an acs.puma class based on the fractional component. Because TRS is stochastic, with a high-level call involving just a few parameters: 1) the the homesim.hsim() function generates multiple (default 30) PUMA’s Federal Information Processing Standard (FIPS) code 2) realizations of the residential population. The results are provided the constraints file, loaded as a pandas.DataFrame and 3) the as a pandas.DataFrame in long format, attributed by: target ACS vintage (year). An example call to build an acs.puma • PUMS Household ID (h_id) for the Knoxville City, TN PUMA (FIPS 4701603) using the ACS • Simulation number (sim) 2015–2019 5-Year Estimates is: • Target zone FIPS code (geoid) acs.puma( fips="4701603", • Household count (count) constraints=constraints, year=2019 Since household and person-level attributes are combined ) when creating the acs.puma class, person-level records from the PUMS are assumed to be joined to the synthesized household The censusdata package5 is used internally to IDs many-to-one. For example, if two people, A01 and A03, in fetch population-level (SF) constraints, standard errors, household A have some attribute of interest, and there are 3 and MOEs from the ACS 5YE API, while the households of type A in zone G, then we estimate that a total acs.extract_pums_constraints function is used to of 6 people with that attribute from household A reside in zone G. fetch individual-level constraints and weights from the Census Microdata 5YE API. Agent Generation Spatial allocation is then carried out by passing the acs.puma attributes to a pymedm.PMEDM or The synthetic populations can then be segmented into different pmedm_legacy.PMEDM (depending on user preference). groups of agents (e.g., workers by industry, students by grade) for activity modeling with the actlike package. Agent segments Population Synthesis may be identified in several ways: The homesim module provides support for population synthe- • Using acs.extract_pums_segment_ids() to sis on the spatial allocation matrix within a solved P-MEDM fetch the person IDs (household serial number + person object. The population synthesis procedure involves converting line number) from the Census Microdata API matching the fractional estimates from the allocation matrix (n household some criteria of interest (e.g., public school students in IDs by m zones) to integer representation such that whole peo- 10th grade). ple/households are preserved. This homesim module features an • Using acs.extract_pums_descriptors() to 4. https://www.census.gov/data/developers/data-sets.html fetch criteria that may be queried from the Census 5. https://pypi.org/project/CensusData Microdata API. This is useful when dealing with criteria 128 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) more specific than can be directly controlled for in the in time and are placed with a greater frequency proportional P-MEDM problem (e.g., detailed NAICS code of worker, to reported household density [LB13]. We employ population exact number of hours worked). and housing counts within 2010 Decennial Census blocks to formulate a modified Variable Size Bin Packing Problem [FL86], The function est.tabulate_by_serial() is then used [CGSdG08] for each populated block group, which allows for to tabulate agents by target zone and simulation by appending an optimal placement of household points and is accomplished them to the synthetic population based on household ID, then by the actlike.block_denisty_allocation() aggregating the person-level counts. This routine is flexible in that function that creates and solves an a user can use any set of criteria available from the PUMS to actlike.block_allocation.BinPack instance. define customized agents for mobility modeling purposes. Other Capabilities Activity Allocation Population Statistics: In addition to agent creation, the Once household location attribution is complete, individual agents livelike.est module also supports the creation of popula- must be allocated from households (nighttime locations) to prob- tion statistics. This can be used to estimate the compositional able activity spaces (daytime locations). This is achieved through characteristics of small neighborhood areas and POIs, for ex- spatial network modeling over the streets within a study area via ample to simulate social contact networks (see Students). To OpenStreetMap6 utilizing osmnx for network extraction & pre- accomplish this, the results of est.tabulate_by_serial processing and pandana for shortest path and route calculations. (see Agent Generation) are converted to proportional esti- The underlying impedance metric for shortest path calculation, mates to facilitate POIs (est.to_prop()), then averaged handled in actlike.calc_cost_mtx() and associated in- across simulations to produce Monte Carlo estimates and errors ternal functions, can either take the form of distance or travel time. est.monte_carlo_estimate()). Moreover, household and activity locations must be connected to Multiple ACS Vintages and PUMAs: The multi nearby network edges for realistic representations within network module extends the capabilities of livelike to space [GFH20]. multiple ACS 5YE vintages (dating back to 2016), as With a cost matrix from all residences to daytime loca- well as multiple PUMAs (e.g., a metropolitan area) via tions calculated, the simulated population can then be "sent" the multi module. Using multi.make_pumas() to the likely activity spaces by utilizing an instance of or multi.make_multiyear_pumas(), multiple actlike.ActivityAllocation to generate an adapted PUMAs/multiple years may be stored in a dict Transportation Problem. This mixed integer program, solved using that enables iterative runs for spatial allocation the solve() method, optimally associates all population within (multi.make_pmedm_problems()), population an activity space with the objective of minimizing the total cost of synthesis (multi.homesim()), and agent cre- impedance (Eq. 2), being subject to potentially relaxed minimum ation (multi.extract_pums_segment_ids(), and maximum capacity constraints (Eq. 4 & 5). Each decision multi.extract_pums_segment_ids_multiyear(), variable (xi j ) represents a potential allocation from origin i to multi.extract_pums_descriptors(), and destination j that must be an integer greater than or equal to zero multi.extract_pums_descriptors_multiyear()). (Eq. 6 & 7). The problem is formulated as follows: This functionality is currently available for pmedm_legacy only. min ∑ ∑ ci j xi j (2) i∈I j∈J Activity Allocation: the actlike package s.t. ∑ xi j = Oi ∀i ∈ I; (3) The actlike package [GT22] allocates agents from synthetic j∈J populations generated by livelike POI, like schools and work- places, based on optimal allocation about transportation networks s.t. ∑ xi j ≥ minD j ∀ j ∈ J; (4) i∈I derived from osmnx and pandana [Boe17], [FW12]. Solutions are the product of a modified integer program (Transportation s.t. ∑ xi j ≤ maxD j ∀ j ∈ J; (5) Problem [Hit41], [Koo49], [MS01], [MS15]) modeled in pulp i∈I or mip [MOD11], [ST20], whereby supply (students/workers) s.t. xi j ≥ 0 ∀i ∈ I ∀ j ∈ J; (6) are "shipped" to demand locations (schools/workplaces), with potentially relaxed minimum and maximum capacity constraints at s.t. xi j ∈ Z ∀i ∈ I ∀ j ∈ J. (7) demand locations. Impedance from nighttime to daytime locations (Origin-Destination [OD] pairs) can be modeled by either network where distance or network travel time. i ∈ I = each household in the set of origins j ∈ J = each school in the set of destinations Location Synthesis xi j = allocation decision from i ∈ I to j ∈ J Following the generation of synthetic households for the study ci j = cost between all i, j pairs universe, locations for all households across the 30 default simulations must be created. In order to intelligently site pseudo- Oi = population in origin i for i ∈ I neighborhood clusters of random points, we adopt a dasymetric minD j = minimum capacity j for j ∈ J [QC13] approach, which we term intelligent block-based (IBB) maxD j = maximum capacity j for j ∈ J allocation, whereby household locations are only placed within blocks known to have been populated at a particular period 6. https://www.openstreetmap.org/about LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS 129 The key to this adapted formulation of the classic Trans- Because school attendance in Knox County is restricted by portation Problem is the utilization of minimum and maxi- district boundaries, we only placed student households in mum capacity thresholds that are generated endogenously within the PUMAs intersecting with the district (FIPS 4701601, actlike.ActivityAllocation and are tuned to reflect 4701602, 4701603, 4701604). However, because educators the uncertainty of both the population estimates generated by may live outside school district boundaries, we simulated livelike and the reported (or predicted) capacities at activity their household locations throughout the Knoxville CBSA. locations. Moreover, network impedance from origins to destina- • Used actlike to perform optimal allocation of tions (ci j ) can be randomly reduced through an internal process workers and students about road networks in Knox by passing in an integer value to the reduce_seed keyword ar- County/Knoxville CBSA. Across the 30 simulations and gument. By triggering this functionality, the count and magnitude 14 segments identified, we produced a total of 420 travel of reduction is determined algorithmically. A random reduction simulations. Network impedance was measured in geo- of this nature is beneficial in generating dispersed solutions that graphic distance for all student simulations and travel time do not resemble compact clusters, with an example being the for all educator simulations. replication of a private school’s student body that does not adhere Figure 2 demonstrates the optimal allocations, routing, and to public school attendance zones. network space for a single simulation of 10th grade public school After the optimal solution is found for an students in Knox County, TN. Students, shown in households actlike.ActivityAllocation instance, selected as small black dots, are associated with schools, represented by decisions are isolated from non-zero decision variables transparent colored circles sized according to reported enrollment. with the realized_allocations() method. These The network space connecting student residential locations to allocations are then used to generate solution routes with the assigned schools is displayed in a matching color. Further, the network_routes() function that represent the shortest path inset in Figure 2 provides the pseudo-school attendance zone for along the network traversed from residential locations to assigned 10th graders at one school in central Knoxville and demonstrates activity spaces. Solutions can be further validated with Canonical the adherence to network space. Correlation Analysis, in instances where the agent segments are stratified, and simple linear regression for those where a single Students segment of agents is used. Validation is discussed further in Validation & Diagnostics. Our study of K–12 students examines social contact networks with respect to potentially underserved student populations via the compositional characteristics of POIs (schools). Case Study: K–12 Public Schools in Knox County, TN We characterized each school’s student body by identifying To illustrate Likeness’ capability to simulate POI travel among student profiles based on several criteria: minority race/ethnicity, specific population segments, we provide a case study of travel to poverty status, single caregiver households, and unemployed care- POIs, in this case K–12 schools, in Knox County, TN. Our choice giver households (householder and/or spouse/parnter). We defined of K–12 schools was motivated by several factors. First, they serve 6 student profiles using an implementation of the density-based as common destinations for the two major groups—workers and K-Modes clustering algorithm [CLB09] with a distance heuris- students—expected to consistently travel on a typical business tic designed to optimize cluster separation [NLHH07] available day [RWM+ 17]. Second, a complete inventory of public school through the kmodes package9 [dV21]. Student profile labels were locations, as well as faculty and enrollment sizes, is available appended to the student travel simulation results, then used to publicly through federal open data sources. In this case, we produce Monte Carlo proportional estimates of profiles by school. obtained school locations and faculty sizes from the Homeland The results in Figure 3 reveal strong dissimilarities in student Infrastructure Foundation-Level Database (HIFLD)7 and student makeup between schools on the periphery of Knox County and enrollment sizes by grade from the National Center for Education those nearer to Knoxville’s downtown core in the center of the Statistics (NCES) Common Core of Data8 . county. We estimate that the former are largely composed of We chose the Knox County School District, which coincides students in married families, above poverty, and with employed with Knox County’s boundaries, as our study area. We used the caregivers, whereas the latter are characterized more strongly by livelike package to create 30 synthetic populations for the single caregiver living arrangements and, particularly in areas Knoxville Core-Based Statistical Area (CBSA), then for each north of the downtown core, economic distress (pop-out map). simulation we: • Isolated agent segments from the synthetic population. Workers (Educators) K–12 educators consist of full-time workers employed as primary and secondary education teachers (2018 Standard We evaluated the results of our K–12 educator simulations with Occupation Classification System codes 2300–2320) in respect to POI occupancy characteristics, as informed by commute elementary and secondary schools (NAICS 6111). We and work statistics obtained from the PUMS. Specifically, we used separated out student agents by public schools and by work arrival times associated with each synthetic worker (PUMS grade level (Kindergarten through Grade 12). JWAP) to timestamp the start of each work day, and incremented • Performed IBB allocation to simulate the household loca- this by daily hours worked (derived from PUMS W KHP) to create tions of workers and students. Our selection of household a second timestamp for work departure. The estimated departure locations for workers and students varied geographically. time assumes that each educator travels to the school for a typical 5-day workweek, and is estimated as JWAP + W KHP 5 . 7. https://hifld-geoplatform.opendata.arcgis.com 8. https://nces.ed.gov/ccd/files.asp 9. https://pypi.org/project/kmodes 130 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 2: Optimal allocations for one simulation of 10th grade public school students in Knox County, TN. Fig. 3: Compositional characteristics of K–12 public schools in Knox County, TN based on 6 student profiles. Glyph plot methodolgy adapted from [GLC+ 15]. LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS 131 Fig. 4: Hourly worker occupancy estimates for K–12 schools in Knox County, TN. Roughly 50 educator agents per simulation were not attributed Validation & Diagnostics with work arrival times, possibly due to the source PUMS re- A determination of modeling output robustness was needed to spondents being away from their typical workplaces (e.g., on validate our results. Specifically, we aimed to ensure the preser- summer or winter break) but still working virtually when they vation of relative facility size and composition. To perform this were surveyed. We filled in these unkown arrival times with the validation, we tested the optimal allocations of those generated by modal arrival time observed across all simulations (7:25 AM). Likeness against the maximally adjusted reported enrollment & faculty employment counts. We used the maximum adjusted value to account for scenarios where the population synthesis phase Figure 4 displays the hourly proportion of educators present resulted in a total demographic segment greater than reported total at each school in Knox County between 7:00 AM (t700) and facility capacity. We employed Canonical Correlation Analysis 6:00 PM (t1800). Morning worker arrivals occur more rapidly (CCA) [Kna78] for the K–12 public school student allocations than afternoon departures. Between the hours of 7:00 AM and due to their stratified nature, and an ordinary least squares (OLS) 9:00 AM (t700–t900), schools transition from nearly empty simple linear regression for the educator allocations [PVG+ 11]. of workers to being close to capacity. In the afternoon, workers Because CCA is a multivariate measure, it is only a suitable begin to gradually depart at 3:00 PM (t1500) with somewhere diagnostic for activity allocation when multiple segments (e.g., between 50%–70% of workers still present by 4:00 PM (t1600), students by grade) are of interest. For educators, which we then workers begin to depart in earnest at 5:00 PM into 6:00 PM treated as a single agent segment without stratification, we used (t1700–t1800), by which most have returned home. OLS regression instead. The CCA for students was performed in two components: Between-Destination, which measures capacity across facilities, and Within-Destination, which measures capacity Geographic differences are also visible and may be a function across strata. of (1) a higher concentration of a particular school type (e.g., Descriptive Monte Carlo statistics from the 30 simulations elementary, middle, high) in this area and (2) staggered starts were run on the resultant coefficients of determination (R2 ), between these types (to accommodate bus schedules, etc.). This which show a goodness of fit (approaching 1). As seen in Table could be due in part to concentrations of different school schedules 1, all models performed exceedingly well, though the Within- by grade level, especially elementary schools starting much earlier Destination CCA performed slightly less well than both the than middle and high schools10 . For example, schools near the Between-Destination CCA and the OLS linear regression. In fact, center of Knox County reach worker capacity more quickly in the the global minimum of all R2 scores approaches 0.99 (students morning, starting around 8:00 AM (t800), but also empty out – Within-Destination), which demonstrates robust preservation of more rapidly than schools in surrounding areas beginning around 4:00 PM (t1600). 10. https://www.knoxschools.org/Page/5553 132 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) K–12 R2 Type Min Median Mean Max Between-Destination CCA 0.9967 0.9974 0.9973 0.9976 Students (public schools) Within-Destination CCA 0.9883 0.9894 0.9896 0.9910 Educators (public & private schools) OLS Linear Regression 0.9977 0.9983 0.9983 0.9991 TABLE 1: Validating optimal allocations considering reported enrollment at public schools & faculty employment at all schools. true capacities in our synthetic activity modeling. Furthermore, agent characterization and travel along real-world transportation a global maximum of greater than 0.999 is seen for educators, networks to POIs. These capabilities benefit planners and urban which indicates a near perfect replication of relative faculty sizes researchers by providing a richer understanding of how spatial by school. policy interventions can be designed with respect to how people live, move, and interact. Likeness strives to be flexible toward a Discussion variety of research applications linked to human security, among Our Case Study demonstrates the twofold benefits of modeling them spatial epidemiology, transportation equity, and environmen- human dynamics with vivid synthetic populations. Using Like- tal hazards. ness, we are able to both produce a more reasoned estimate of the Several ongoing developments will further Likeness’ capa- neighborhoods in which people reside and interact than existing bilities. First, we plan to expand our support for POIs curated synthetic population frameworks, as well as support more nuanced by location services (e.g., Google, Facebook, Here, TomTom, characterization of human activities at specific POIs (e.g., social FourSquare) by the ORNL PlanetSense project [TBP+ 15] by contact networks, occupancy). incorporating factors like facility size, hours of operation, and pop- The examples provided in the Case Study show how this ularity curves to refine the destination capacity estimates required refined understanding of human dynamics can benefit planning to perform actlike simulations. Second, along with multi- applications. For example, in the event of a localized emergency, modal travel, we plan to incorporate multiple trip models based the results of Students could be used to examine schools for on large-scale human activity datasets like the American Time Use which rendezvous with caregivers might pose an added challenge Survey11 and National Household Travel Survey12 . Together, these towards students (e.g., more students from single caregiver vs. improvements will extend our travel simulations to "non-obligate" married family households). Additionally, the POI occupancy population segments traveling to civic, social, and recreational dynamics demonstrated in Workers (Educators) could be used activities [BMWR22]. Third, the current procedure for spatial to assess the times at which worker commutes to/from places allocation uses block groups as the target scale for population of employment might be most sensitive to a nearby disruption. synthesis. However, there are a limited number of constraining Another application in the public health sphere might be to use variables available at the block group level. To include a larger occupancy estimates to anticipate the best time of day to reach volume of constraints (e.g., vehicle access, language), we are workers, during a vaccination campaign, for example. exploring an additional tract-level approach. P-MEDM in this Our case study had several limitations that we plan to over- case is run on cross-covariances between tracts and "supertract" come in future work. First, we assumed that all travel within our aggregations created with the Max-p-regions problem [DAR12], study area occurs along road networks. While road-based travel [WRK21] implemented in PySAL’s spopt [RA07], [FGK+ 21], is the dominant means of travel in the Knoxville CBSA, this [RAA+ 21], [FBG+ 22]. assumption is not transferable to other urban areas within the As a final note, the Likeness toolkit is being developed on top United States. Our eventual goal is to build in additional modes of of key open source dependencies in the Scientific Python ecosys- travel like public transit, walk/bike, and ferries by expanding our tem, the core of which are, of course, numpy [HMvdW+ 20] ingest of OpenStreetMap features. and scipy [VGO+ 20]. Although an exhaustive list would be Second, we do not yet offer direct support for non-traditional prohibitive, major packages not previously mentioned include schools (e.g., populations with special needs, families on military geopandas [JdBF+ 21], matplotlib [Hun07], networkx bases). For example, the Tennessee School for the Deaf falls [HSS08], pandas [pdt20], [WM10], and shapely [G+ ]. Our within our study area, and its compositional estimate could be goal is contribute to the community with releases of the packages refined if we reapportioned students more likely in attendance to comprising Likeness, but since this is an emerging project its that location. development to date has been limited to researchers at ORNL. Third, we did not account for teachers in virtual schools, However, we plan to provide a fully open-sourced code base which may form a portion of the missing work arrival times within the coming year through GitHub13 . discussed in Workers (Educators). Work-from-home populations Acknowledgements can be better incorporated into our travel simulations by apply- ing work schedules from time-use surveys to probabilistically This material is based upon the work supported by the U.S. assign in-person or remote status based on occupation. We are Department of Energy under contract no. DE-AC05-00OR22725. particularly interested in using this technique with Likeness to better understand changing patterns of life during the COVID-19 R EFERENCES pandemic in 2020. [ANM+ 18] H.M. Abdul Aziz, Nicholas N. Nagle, April M. Morton, Michael R. Hilliard, Devin A. White, and Robert N. Stew- Conclusion 11. https://www.bls.gov/tus The Likeness toolkit enhances agent creation for modeling human 12. https://nhts.ornl.gov dynamics through its dual capabilities of high-fidelity ("vivid") 13. https://github.com/ORNL LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS 133 art. Exploring the impact of walk–bike infrastructure, safety [GFH20] James D. Gaboardi, David C. Folch, and Mark W. Horner. perception, and built-environment on active transportation Connecting Points to Spatial Networks: Effects on Discrete mode choice: a random parameter model using New York Optimization Models. Geographical Analysis, 52(2):299–322, City commuter data. Transportation, 45(5):1207–1229, 2018. 2020. doi:10.1111/gean.12211. doi:10.1007/s11116-017-9760-8. [GLC+ 15] Isabella Gollini, Binbin Lu, Martin Charlton, Christopher [BBE+ 08] Christopher L. Barrett, Keith R. Bisset, Stephen G. Eubank, Brunsdon, and Paul Harris. GWmodel: An R package for Xizhou Feng, and Madhav V. Marathe. EpiSimdemics: an ef- exploring spatial heterogeneity using geographically weighted ficient algorithm for simulating the spread of infectious disease models. Journal of Statistical Software, 63(17):1–50, 2015. over large realistic social networks. In SC’08: Proceedings of doi:10.18637/jss.v063.i17. the 2008 ACM/IEEE Conference on Supercomputing, pages [GT22] James D. Gaboardi and Joseph V. Tuccillo. Simulating Travel 1–12. IEEE, 2008. doi:10.1109/SC.2008.5214892. to Points of Interest for Demographically-rich Synthetic Popu- [BBM96] Richard J. Beckman, Keith A. Baggerly, and Michael D. lations, February 2022. American Association of Geographers McKay. Creating synthetic baseline populations. Transporta- Annual Meeting. doi:10.5281/zenodo.6335783. tion Research Part A: Policy and Practice, 30(6):415–429, [Hew97] Kenneth Hewitt. Vulnerability Perspectives: the Human Ecol- 1996. doi:10.1016/0965-8564(96)00004-3. ogy of Endangerment. In Regions of Risk: A Geographical [BCD+ 06] Dimitris Ballas, Graham Clarke, Danny Dorling, Jan Rigby, Introduction to Disasters, chapter 6, pages 141–164. Addison and Ben Wheeler. Using geographical information systems and Wesley Longman, 1997. spatial microsimulation for the analysis of health inequalities. [HHSB12] Kirk Harland, Alison Heppenstall, Dianna Smith, and Mark H. Health Informatics Journal, 12(1):65–79, 2006. doi:10. Birkin. Creating realistic synthetic populations at varying 1177/1460458206061217. spatial scales: A comparative critique of population synthesis [BFH+ 17] Komal Basra, M. Patricia Fabian, Raymond R. Holberger, techniques. Journal of Artificial Societies and Social Simula- Robert French, and Jonathan I. Levy. Community-engaged tion, 15(1):1, 2012. doi:10.18564/jasss.1909. modeling of geographic and demographic patterns of mul- [Hit41] Frank L. Hitchcock. The Distribution of a Product from tiple public health risk factors. International Journal of Several Sources to Numerous Localities. Journal of Mathe- Environmental Research and Public Health, 14(7):730, 2017. matics and Physics, 20(1-4):224–230, 1941. doi:10.1002/ doi:10.3390/ijerph14070730. sapm1941201224. [BMWR22] Christa Brelsford, Jessica J. Moehl, Eric M. Weber, and [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Amy N. Rose. Segmented Population Models: Improving the Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric LandScan USA Non-Obligate Population Estimate (NOPE). Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, American Association of Geographers 2022 Annual Meeting, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerk- 2022. wijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, [Boe17] Geoff Boeing. OSMnx: New methods for acquiring, con- Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin structing, analyzing, and visualizing complex street networks. Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Computers, Environment and Urban Systems, 65:126–139, Christoph Gohlke, and Travis E. Oliphant. Array programming September 2017. doi:10.1016/j.compenvurbsys. with NumPy. Nature, 585(7825):357–362, September 2020. 2017.05.004. doi:10.1038/s41586-020-2649-2. [CGSdG08] Isabel Correia, Luís Gouveia, and Francisco Saldanha-da [HNB+ 11] Jan A.C. Hontelez, Nico Nagelkerke, Till Bärnighausen, Roel Gama. Solving the variable size bin packing problem Bakker, Frank Tanser, Marie-Louise Newell, Mark N. Lurie, with discretized formulations. Computers & Operations Re- Rob Baltussen, and Sake J. de Vlas. The potential impact of search, 35(6):2103–2113, June 2008. doi:10.1016/j. RV144-like vaccines in rural South Africa: a study using the cor.2006.10.014. STDSIM microsimulation model. Vaccine, 29(36):6100–6106, 2011. doi:10.1016/j.vaccine.2011.06.059. [CLB09] Fuyuan Cao, Jiye Liang, and Liang Bai. A new initialization method for categorical data clustering. Expert Systems with [HSS08] Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. Applications, 36(7):10223–10228, 2009. doi:10.1016/j. Exploring Network Structure, Dynamics, and Function using eswa.2009.01.060. NetworkX. In Gaël Varoquaux, Travis Vaught, and Jarrod Millman, editors, Proceedings of the 7th Python in Science [DAR12] Juan C. Duque, Luc Anselin, and Sergio J. Rey. THE MAX- Conference, pages 11 – 15, Pasadena, CA USA, 2008. URL: P-REGIONS PROBLEM*. Journal of Regional Science, https://www.osti.gov/biblio/960616. 52(3):397–419, 2012. doi:10.1111/j.1467-9787. [Hun07] J. D. Hunter. Matplotlib: A 2D graphics environment. Com- 2011.00743.x. puting in Science & Engineering, 9(3):90–95, 2007. doi: [DKA+ 08] M. Diaz, J.J. Kim, G. Albero, S. De Sanjose, G. Clifford, F.X. 10.1109/MCSE.2007.55. Bosch, and S.J. Goldie. Health and economic impact of HPV [JdBF+ 21] Kelsey Jordahl, Joris Van den Bossche, Martin Fleischmann, 16 and 18 vaccination and cervical cancer screening in India. James McBride, Jacob Wasserman, Adrian Garcia Badaracco, British Journal of Cancer, 99(2):230–238, 2008. doi:10. Jeffrey Gerard, Alan D. Snow, Jeff Tratner, Matthew Perry, 1038/sj.bjc.6604462. Carson Farmer, Geir Arne Hjelle, Micah Cochran, Sean [dV21] Nelis J. de Vos. kmodes categorical clustering library. https: Gillies, Lucas Culbertson, Matt Bartos, Brendan Ward, Gia- //github.com/nicodv/kmodes, 2015–2021. como Caria, Mike Taves, Nick Eubank, sangarshanan, John [FBG+ 22] Xin Feng, Germano Barcelos, James D. Gaboardi, Elijah Flavin, Matt Richards, Sergio Rey, maxalbert, Aleksey Bi- Knaap, Ran Wei, Levi J. Wolf, Qunshan Zhao, and Sergio J. logur, Christopher Ren, Dani Arribas-Bel, Daniel Mesejo- Rey. spopt: a python package for solving spatial optimization León, and Leah Wasser. geopandas/geopandas: v0.10.2, Octo- problems in PySAL. Journal of Open Source Software, ber 2021. doi:10.5281/zenodo.5573592. 7(74):3330, 2022. doi:10.21105/joss.03330. [Kna78] Thomas R. Knapp. Canonical Correlation Analysis: A general [FGK+ 21] Xin Feng, James D. Gaboardi, Elijah Knaap, Sergio J. Rey, parametric significance-testing system. Psychological Bulletin, and Ran Wei. pysal/spopt, jan 2021. URL: https://github.com/ 85(2):410–416, 1978. doi:10.1037/0033-2909.85. pysal/spopt, doi:10.5281/zenodo.4444156. 2.410. [FL86] D.K. Friesen and M.A. Langston. Variable Sized Bin Packing. [Koo49] Tjalling C. Koopmans. Optimum Utilization of the Transporta- SIAM Journal on Computing, 15(1):222–230, February 1986. tion System. Econometrica, 17:136–146, 1949. Publisher: doi:10.1137/0215016. [Wiley, Econometric Society]. doi:10.2307/1907301. [FW12] Fletcher Foti and Paul Waddell. A Generalized Com- [LB13] Robin Lovelace and Dimitris Ballas. ‘Truncate, replicate, putational Framework for Accessibility: From the Pedes- sample’: A method for creating integer weights for spa- trian to the Metropolitan Scale. In Transportation Re- tial microsimulation. Computers, Environment and Urban search Board Annual Conference, pages 1–14, 2012. Systems, 41:1–11, September 2013. doi:10.1016/j. URL: https://onlinepubs.trb.org/onlinepubs/conferences/2012/ compenvurbsys.2013.03.004. 4thITM/Papers-A/0117-000062.pdf. [LNB13] Stefan Leyk, Nicholas N. Nagle, and Barbara P. Buttenfield. [G+ ] Sean Gillies et al. Shapely: manipulation and analysis of Maximum Entropy Dasymetric Modeling for Demographic geometric objects, 2007–. URL: https://github.com/shapely/ Small Area Estimation. Geographical Analysis, 45(3):285– shapely. 306, July 2013. doi:10.1111/gean.12011. 134 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) [MCB+ 08] Karyn Morrissey, Graham Clarke, Dimitris Ballas, Stephen Scan USA 2016 [Data set]. Technical report, Oak Ridge Hynes, and Cathal O’Donoghue. Examining access to GP National Laboratory, 2017. doi:10.48690/1523377. services in rural Ireland using microsimulation analysis. Area, [SEM14] Samarth Swarup, Stephen G. Eubank, and Madhav V. Marathe. 40(3):354–364, 2008. doi:10.1111/j.1475-4762. Computational epidemiology as a challenge domain for multi- 2008.00844.x. agent systems. In Proceedings of the 2014 international con- [MNP+ 17] April M. Morton, Nicholas N. Nagle, Jesse O. Piburn, ference on Autonomous agents and multi-agent systems, pages Robert N. Stewart, and Ryan McManamay. A hybrid dasy- 1173–1176, 2014. URL: https://www.ifaamas.org/AAMAS/ metric and machine learning approach to high-resolution aamas2014/proceedings/aamas/p1173.pdf. residential electricity consumption modeling. In Advances [SNGJ+ 09] Beate Sander, Azhar Nizam, Louis P. Garrison Jr., Maarten J. in Geocomputation, pages 47–58. Springer, 2017. doi: Postma, M. Elizabeth Halloran, and Ira M. Longini Jr. Eco- 10.1007/978-3-319-22786-3_5. nomic evaluation of influenza pandemic mitigation strate- [MOD11] Stuart Mitchell, Michael O’Sullivan, and Iain gies in the United States using a stochastic microsimulation Dunning. PuLP: A Linear Programming Toolkit transmission model. Value in Health, 12(2):226–233, 2009. for Python. Technical report, 2011. URL: doi:10.1111/j.1524-4733.2008.00437.x. https://www.dit.uoi.gr/e-class/modules/document/file.php/ [SPH11] Dianna M. Smith, Jamie R. Pearce, and Kirk Harland. Can 216/PAPERS/2011.%20PuLP%20-%20A%20Linear% a deterministic spatial microsimulation model provide reli- 20Programming%20Toolkit%20for%20Python.pdf. able small-area estimates of health behaviours? An example [MPN+ 17] April M. Morton, Jesse O. Piburn, Nicholas N. Nagle, H.M. of smoking prevalence in New Zealand. Health & Place, Aziz, Samantha E. Duchscherer, and Robert N. Stewart. A 17(2):618–624, 2011. doi:10.1016/j.healthplace. simulation approach for modeling high-resolution daytime 2011.01.001. commuter travel flows and distributions of worker subpopula- [ST20] Haroldo G. Santos and Túlio A.M. Toffolo. Mixed Integer Lin- tions. In GeoComputation 2017, Leeds, UK, pages 1–5, 2017. ear Programming with Python. Technical report, 2020. URL: URL: http://www.geocomputation.org/2017/papers/44.pdf. https://python-mip.readthedocs.io/_/downloads/en/latest/pdf/. [MS01] Harvey J. Miller and Shih-Lung Shaw. Geographic Informa- [TBP+ 15] Gautam S. Thakur, Budhendra L. Bhaduri, Jesse O. Piburn, tion Systems for Transportation: Principles and Applications. Kelly M. Sims, Robert N. Stewart, and Marie L. Urban. Oxford University Press, New York, 2001. PlanetSense: a real-time streaming and spatio-temporal an- [MS15] Harvey J. Miller and Shih-Lung Shaw. Geographic Informa- alytics platform for gathering geo-spatial intelligence from tion Systems for Transportation in the 21st Century. Geogra- open source data. In Proceedings of the 23rd SIGSPATIAL phy Compass, 9(4):180–189, 2015. doi:10.1111/gec3. International Conference on Advances in Geographic Informa- 12204. tion Systems, pages 1–4, 2015. doi:10.1145/2820783. [NBLS14] Nicholas N. Nagle, Barbara P. Buttenfield, Stefan Leyk, and 2820882. Seth Spielman. Dasymetric modeling and uncertainty. Annals [TCR08] Melanie N. Tomintz, Graham P. Clarke, and Janette E. Rigby. of the Association of American Geographers, 104(1):80–95, The geography of smoking in Leeds: estimating individual 2014. doi:10.1080/00045608.2013.843439. smoking rates and the implications for the location of stop [NCA13] Markku Nurhonen, Allen C. Cheng, and Kari Auranen. Pneu- smoking services. Area, 40(3):341–353, 2008. doi:10. mococcal transmission and disease in silico: a microsimu- 1111/j.1475-4762.2008.00837.x. lation model of the indirect effects of vaccination. PloS [TG22] Joseph V. Tuccillo and James D. Gaboardi. Connecting Vivid one, 8(2):e56079, 2013. doi:10.1371/journal.pone. Population Data to Human Dynamics, June 2022. Distilling 0056079. Diversity by Tapping High-Resolution Population and Survey [NLHH07] Michael K. Ng, Mark Junjie Li, Joshua Zhexue Huang, and Data. doi:10.5281/zenodo.6607533. Zengyou He. On the impact of dissimilarity measure in [TM21] Joseph V. Tuccillo and Jessica Moehl. An Individual- k-modes clustering algorithm. IEEE Transactions on Pat- Oriented Typology of Social Areas in the United States, May tern Analysis and Machine Intelligence, 29(3):503–507, 2007. 2021. 2021 ACS Data Users Conference. doi:10.5281/ doi:10.1109/TPAMI.2007.53. zenodo.6672291. [pdt20] The pandas development team. pandas-dev/pandas: Pandas, [TMKD17] Matthias Templ, Bernhard Meindl, Alexander Kowarik, and February 2020. doi:10.5281/zenodo.3509134. Olivier Dupriez. Simulation of synthetic complex data: The [PVG+ 11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, R package simPop. Journal of Statistical Software, 79:1–38, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, 2017. doi:10.18637/jss.v079.i10. V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, [Tuc21] Joseph V. Tuccillo. An Individual-Centered Approach for M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Geodemographic Classification. In 11th International Con- Machine Learning in Python. Journal of Machine Learning ference on Geographic Information Science 2021 Short Paper Research, 12:2825–2830, 2011. URL: https://www.jmlr.org/ Proceedings, pages 1–6, 2021. doi:10.25436/E2H59M. papers/v12/pedregosa11a.html. [VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt [QC13] Fang Qiu and Robert Cromley. Areal Interpolation and Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Dasymetric Modeling: Areal Interpolation and Dasymetric Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté- Modeling. Geographical Analysis, 45(3):213–215, July 2013. fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar- doi:10.1111/gean.12016. rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric [RA07] Sergio J. Rey and Luc Anselin. PySAL: A Python Library of Jones, Robert Kern, Eric Larson, C.J. Carey, İlhan Polat, Spatial Analytical Methods. The Review of Regional Studies, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, 37(1):5–27, 2007. URL: https://rrs.scholasticahq.com/article/ Josef Perktold, Robert Cimrman, Ian Henriksen, E.A. Quin- 8285.pdf, doi:10.52324/001c.8285. tero, Charles R. Harris, Anne M. Archibald, Antônio H. [RAA+ 21] Sergio J. Rey, Luc Anselin, Pedro Amaral, Dani Arribas- Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy Bel, Renan Xavier Cortes, James David Gaboardi, Wei Kang, 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Elijah Knaap, Ziqi Li, Stefanie Lumnitz, Taylor M. Oshan, Scientific Computing in Python. Nature Methods, 17:261–272, Hu Shao, and Levi John Wolf. The PySAL Ecosystem: 2020. doi:10.1038/s41592-019-0686-2. Philosophy and Implementation. Geographical Analysis, 2021. [WCC+ 09] William D. Wheaton, James C. Cajka, Bernadette M. Chas- doi:10.1111/gean.12276. teen, Diane K. Wagener, Philip C. Cooley, Laxminarayana [RSF+ 21] Krishna P. Reddy, Fatma M. Shebl, Julia H.A. Foote, Guy Ganapathi, Douglas J. Roberts, and Justine L. Allpress. Harling, Justine A. Scott, Christopher Panella, Kieran P. Fitz- Synthesized population databases: A US geospatial database maurice, Clare Flanagan, Emily P. Hyle, Anne M. Neilan, et al. for agent-based models. Methods report (RTI Press), Cost-effectiveness of public health strategies for COVID-19 2009(10):905, 2009. doi:10.3768/rtipress.2009. epidemic control in South Africa: a microsimulation modelling mr.0010.0905. study. The Lancet Global Health, 9(2):e120–e129, 2021. [WM10] Wes McKinney. Data Structures for Statistical Computing in doi:10.1016/S2214-109X(20)30452-6. Python. In Stéfan van der Walt and Jarrod Millman, editors, [RWM+ 17] Amy N. Rose, Eric M. Weber, Jessica J. Moehl, Melanie L. Proceedings of the 9th Python in Science Conference, pages 56 Laverdiere, Hsiu-Han Yang, Matthew C. Whitehead, Kelly M. – 61, 2010. doi:10.25080/Majora-92bf1922-00a. Sims, Nathan E. Trombley, and Budhendra L. Bhaduri. Land- [WRK21] Ran Wei, Sergio J. Rey, and Elijah Knaap. Efficient re- LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS 135 gionalization for spatially explicit neighborhood delineation. International Journal of Geographical Information Science, 35(1):135–151, 2021. doi:10.1080/13658816.2020. 1759806. [ZFJ14] Yi Zhu and Joseph Ferreira Jr. Synthetic population gener- ation at disaggregated spatial scales for land use and trans- portation microsimulation. Transportation Research Record, 2429(1):168–177, 2014. doi:10.3141/2429-18. 136 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) poliastro: a Python library for interactive astrodynamics Juan Luis Cano Rodríguez‡∗ , Jorge Martínez Garrido‡ https://www.youtube.com/watch?v=VCpTgU1pb5k F Abstract—Space is more popular than ever, with the growing public awareness problem. This work was generalized by Newton to give birth to of interplanetary scientific missions, as well as the increasingly large number the n-body problem, and many other mathematicians worked on of satellite companies planning to deploy satellite constellations. Python has it throughout the centuries (Daniel and Johann Bernoulli, Euler, become a fundamental technology in the astronomical sciences, and it has also Gauss). Poincaré established in the 1890s that no general closed- caught the attention of the Space Engineering community. form solution exists for the n-body problem, since the resulting One of the requirements for designing a space mission is studying the trajectories of satellites, probes, and other artificial objects, usually ignoring dynamical system is chaotic [Bat99]. Sundman proved in the non-gravitational forces or treating them as perturbations: the so-called n-body 1900s the existence of convergent solutions for a few restricted problem. However, for preliminary design studies and most practical purposes, it with n = 3. is sufficient to consider only two bodies: the object under study and its attractor. M = E − e sin E (1) Even though the two-body problem has many analytical solutions, or- In 1903 Tsiokovsky evaluated the conditions required for artificial bit propagation (the initial value problem) and targeting (the boundary value problem) remain computationally intensive because of long propagation times, objects to leave the orbit of the earth; this is considered as a foun- tight tolerances, and vast solution spaces. On the other hand, astrodynamics dational contribution to the field of astrodynamics. Tsiokovsky researchers often do not share the source code they used to run analyses and devised equation 2 which relates the increase in velocity with the simulations, which makes it challenging to try out new solutions. effective exhaust velocity of thrusted gases and the fraction of used This paper presents poliastro, an open-source Python library for interactive propellant. m0 astrodynamics that features an easy-to-use API and tools for quick visualization. ∆v = ve ln (2) poliastro implements core astrodynamics algorithms (such as the resolution mf of the Kepler and Lambert problems) and leverages numba, a Just-in-Time Further developments by Kondratyuk, Hohmann, and Oberth in compiler for scientific Python, to optimize the running time. Thanks to Astropy, the early 20th century all added to the growing field of orbital poliastro can perform seamless coordinate frame conversions and use proper mechanics, which in turn enabled the development of space flight physical units and timescales. At the moment, poliastro is the longest-lived Python library for astrodynamics, has contributors from all around the world, in the USSR and the United States in the 1950s and 1960s. and several New Space companies and people in academia use it. The two-body problem In a system of i ∈ 1, ..., n bodies subject to their mutual attraction, Index Terms—astrodynamics, orbital mechanics, orbit propagation, orbit visu- alization, two-body problem by application of Newton’s law of universal gravitation, the total force fi affecting mi due to the presence of the other n − 1 masses is given by [Bat99]: Introduction n mi m j fi = −G ∑ r 3 ij (3) History j6=i |ri j | The term "astrodynamics" was coined by the American as- where G = 6.67430 · 10−11 N m2 kg−2 is the universal gravita- tronomer Samuel Herrick, who received encouragement from tional constant, and ri j denotes the position vector from mi to m j . the space pioneer Robert H. Goddard, and refers to the branch Applying Newton’s second law of motion results in a system of n of space science dealing with the motion of artificial celestial differential equations: bodies ([Dub73], [Her71]). However, the roots of its mathematical foundations go back several centuries. d2 ri n mj 2 = −G ∑ r 3 ij (4) Kepler first introduced his laws of planetary motion in 1609 dt j6=i i j | |r and 1619 and derived his famous transcendental equation (1), By setting n = 2 in 4 and subtracting the two resulting equali- which we now see as capturing a restricted form of the two-body ties, one arrives to the fundamental equation of the two-body problem: * Corresponding author: hello@juanlu.space ‡ Unaffiliated d2 r µ =− 3r (5) dt 2 r Copyright © 2022 Juan Luis Cano Rodríguez et al. This is an open-access where µ = G(m1 + m2 ) = G(M + m). When m M (for example, article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any an artificial satellite orbiting a planet), one can consider µ = GM medium, provided the original author and source are credited. a property of the attractor. POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS 137 Keplerian vs non-keplerian motion State of the art Conveniently manipulating equation 5 leads to several properties In our view, at the time of creating poliastro there were a number [Bat99] that were already published by Johannes Kepler in the of issues with existing open source astrodynamics software that 1610s, namely: posed a barrier of entry for novices and amateur practitioners. 1) The orbit always describes a conic section (an ellipse, a Most of these barriers still exist today and are described in the parabola, or an hyperbola), with the attractor at one of following paragraphs. The goals of the project can be condensed the two foci and can be written in polar coordinates like as follows: r = 1+epcos ν (Kepler’s first law). 1) Set an example on reproducibility and good coding prac- 2) The magnitude of the specific angular momentum h = tices in astrodynamics. r2 ddtθ is constant an equal to two times the areal velocity 2) Become an approachable software even for novices. (Kepler’s second law). 3) Offer a performant software that can be also used in 3) For closed (circular and elliptical) orbits, the periodq is scripting and interactive workflows. 3 related to the size of the orbit through P = 2π aµ (Kepler’s third law). The most mature software libraries for astrodynamics are arguably Orekit [noa22c], a "low level space dynamics library For many practical purposes it is usually sufficient to limit written in Java" with an open governance model, and SPICE the study to one object orbiting an attractor and ignore all other [noa22d], a toolkit developed by NASA’s Navigation and An- external forces of the system, hence restricting the study to cillary Information Facility at the Jet Propulsion Laboratory. trajectories governed by equation 5. Such trajectories are called Other similar, smaller projects that appeared later on and that "Keplerian", and several problems can be formulated for them: are still maintained to this day include PyKEP [IBD+ 20], be- • The initial-value problem, which is usually called prop- yond [noa22a], tudatpy [noa22e], sbpy [MKDVB+ 19], Skyfield agation, involves determining the position and velocity of [Rho20] (Python), CelestLab (Scilab) [noa22b], astrodynamics.jl an object after an elapse period of time given some initial (Julia) [noa] and Nyx (Rust) [noa21a]. In addition, there are conditions. some Graphical User Interface (GUI) based open source programs • Preliminary orbit determination, which involves using used for Mission Analysis and orbit visualization, such as GMAT exact or approximate methods to derive a Keplerian orbit [noa20] and gpredict [noa18], and complete web applications for from a set of observations. tracking constellations of satellites like the SatNOGS project by • The boundary-value problem, often named the Lambert the Libre Space Foundation [noa21b]. problem, which involves determining a Keplerian orbit The level of quality and maintenance of these packages is from boundary conditions, usually departure and arrival somewhat heterogeneous. Community-led projects with a strong position vectors and a time of flight. corporate backing like Orekit are in excellent health, while on the other hand smaller projects developed by volunteers (beyond, Fortunately, most of these problems boil down to finding astrodynamics.jl) or with limited institutional support (PyKEP, numerical solutions to relatively simple algebraic relations be- GMAT) suffer from lack of maintenance. Part of the problem tween time and angular variables: for elliptic motion (0 ≤ e < 1) might stem from the fact that most scientists are never taught how it is the Kepler equation, and equivalent relations exist for the to build software efficiently, let alone the skills to collaboratively other eccentricity regimes [Bat99]. Numerical solutions for these develop software in the open [WAB+ 14], and astrodynamicists are equations can be found in a number of different ways, each one no exception. with different complexity and precision tradeoffs. In the Methods On the other hand, it is often difficult to translate the advances section we list the ones implemented by poliastro. in astrodynamics research to software. Classical algorithms devel- On the other hand, there are many situations in which natural oped throughout the 20th century are described in papers that are and artificial orbital perturbations must be taken into account so sometimes difficult to find, and source code or validation data that the actual non-Keplerian motion can be properly analyzed: is almost never available. When it comes to modern research • Interplanetary travel in the proximity of other planets. On carried in the digital era, source code and validation data is a first approximation it is usually enough to study the still difficult, even though they are supposedly provided "upon trajectory in segments and focus the analysis on the closest reasonable request" [SSM18] [GBP22]. attractor, hence patching several Keplerian orbits along It is no surprise that astrodynamics software often requires the way (the so-called "patched-conic approximation") deep expertise. However, there are often implicit assumptions that [Bat99]. The boundary surface that separates one segment are not documented with an adequate level of detail which orig- from the other is called the sphere of influence. inate widespread misconceptions and lead even seasoned profes- • Use of solar sails, electric propulsion, or other means sionals to make conceptual mistakes. Some of the most notorious of continuous thrust. Devising the optimal guidance laws misconceptions arise around the use of general perturbations data that minimize travel time or fuel consumption under these (OMMs and TLEs) [Fin07], the geometric interpretation of the conditions is usually treated as an optimization problem mean anomaly [Bat99], or coordinate transformations [VCHK06]. of a dynamical system, and as such it is particularly Finally, few of the open source software libraries mentioned challenging [Con14]. above are amenable to scripting or interactive use, as promoted by • Artificial satellites in the vicinity of a planet. This is computational notebooks like Jupyter [KRKP+ 16]. the regime in which all the commercial space industry The following sections will now discuss the various areas of operates, especially for those satellites in Low-Earth Orbit current research that an astrodynamicist will engage in, and how (LEO). poliastro improves their workflow. 138 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Methods Nice, high level API Software Architecture The architecture of poliastro emerges from the following set of conflicting requirements: Dangerous™ algorithms 1) There should be a high-level API that enables users to perform orbital calculations in a straightforward way and Fig. 1: poliastro two-layer architecture prevent typical mistakes. 2) The running time of the algorithms should be within the Most of the methods of the High level API consist only same order of magnitude of existing compiled implemen- of the necessary unit compatibility checks, plus a wrapper over tations. the corresponding Core API function that performs the actual 3) The library should be written in a popular open-source computation. language to maximize adoption and lower the barrier to @u.quantity_input(E=u.rad, ecc=u.one) external contributors. def E_to_nu(E, ecc): """True anomaly from eccentric anomaly.""" One of the most typical mistakes we set ourselves to prevent return ( with the high-level API is dimensional errors. Addition and E_to_nu_fast( E.to_value(u.rad), substraction operations of physical quantities are defined only for ecc.value quantities with the same units [Dro53]: for example, the operation ) << u.rad 1 km + 100 m requires a scale transformation of at least one ).to(E.unit) of the operands, since they have different units (kilometers and As a result, poliastro offers a unit-safe API that performs the least meters) but the same dimension (length), whereas the operation amount of computation possible to minimize the performance 1 km + 1 kg is directly not allowed because dimensions are penalty of unit checks, and also a unit-unsafe API that offers incompatible (length and mass). As such, software systems oper- maximum performance at the cost of not performing any unit ating with physical quantities should raise exceptions when adding validation checks. different dimensions, and transparently perform the required scale Finally, there are several options to write performant code that transformations when adding different units of the same dimen- can be used from Python, and one of them is using a fast, compiled sion. language for the CPU intensive parts. Successful examples of this With this in mind, we evaluated several Python packages for include NumPy, written in C [HMvdW+ 20], SciPy, featuring a unit handling (see [JGAZJT+ 18] for a recent survey) and chose mix of FORTRAN, C, and C++ code [VGO+ 20], and pandas, astropy.units [TPWS+ 18]. making heavy use of Cython [BBC+ 11]. However, having to radius = 6000 # km write code in two different languages hinders the development altitude = 500 # m speed, makes debugging more difficult, and narrows the potential # Wrong! contributor base (what Julia creators called "The Two Language distance = radius + altitude Problem" [BEKS17]). As authors of poliastro we wanted to use Python as the from astropy import units as u sole programming language of the implementation, and the best # Correct solution we found to improve its performance was to use Numba, distance = (radius << u.km) + (altitude << u.m) a LLVM-based Python JIT compiler [LPS15]. This notion of providing a "safe" API extends to other parts Usage of the library by leveraging other capabilities of the Astropy Basic Orbit and Ephem creation project. For example, timestamps use astropy.time objects, which take care of the appropriate handling of time scales The two central objects of the poliastro high level API are Orbit (such as TDB or UTC), reference frame conversions leverage and Ephem: astropy.coordinates, and so forth. • Orbit objects represent an osculating (hence Keplerian) One of the drawbacks of existing unit packages is that orbit of a dimensionless object around an attractor at a they impose a significant performance penalty. Even though given point in time and a certain reference frame. astropy.units is integrated with NumPy, hence allowing • Ephem objects represent an ephemerides, a sequence of the creation of array quantities, all the unit compatibility checks spatial coordinates over a period of time in a certain are implemented in Python and require lots of introspection, and reference frame. this can slow down mathematical operations by several orders of There are six parameters that uniquely determine a Keplerian magnitude. As such, to fulfill our desired performance requirement orbit, plus the gravitational parameter of the corresponding attrac- for poliastro, we envisioned a two-layer architecture: tor (k or µ). Optionally, an epoch that contextualizes the orbit • The Core API follows a procedural style, and all the can be included as well. This set of six parameters is not unique, functions receive Python numerical types and NumPy and several of them have been developed over the years to serve arrays for maximum performance. different purposes. The most widely used ones are: • The High level API is object-oriented, all the methods • Cartesian elements: Three components for the position receive Astropy Quantity objects with physical units, (x, y, z) and three components for the velocity (vx , vy , vz ). and computations are deferred to the Core API. This set has no singularities. POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS 139 • Classical Keplerian elements: Two components for the shape of the conic (usually the semimajor axis a or from poliastro.ephem import Ephem semiparameter p and the eccentricity e), three Euler angles # Configure high fidelity ephemerides globally for the orientation of the orbital plane in space (inclination # (requires network access) i, right ascension of the ascending node Ω, and argument solar_system_ephemeris.set("jpl") of periapsis ω), and one polar angle for the position of the # For predefined poliastro attractors body along the conic (usually true anomaly f or ν). This earth = Ephem.from_body(Earth, Time.now().tdb) set of elements has an easy geometrical interpretation and the advantage that, in pure two-body motion, five of them # For the rest of the Solar System bodies ceres = Ephem.from_horizons("Ceres", Time.now().tdb) are fixed (a, e, i, Ω, ω) and only one is time-dependent (ν), which greatly simplifies the analytical treatment of There are some crucial differences between Orbit and Ephem orbital perturbations. However, they suffer from singular- objects: ities steming from the Euler angles ("gimbal lock") and • Orbit objects have an attractor, whereas Ephem objects equations expressed in them are ill-conditioned near such do not. Ephemerides can originate from complex trajecto- singularities. ries that don’t necessarily conform to the ideal two-body • Walker modified equinoctial elements: Six parameters problem. (p, f , g, h, k, L). Only L is time-dependent and this set has • Orbit objects capture a precise instant in a two-body mo- no singularities, however the geometrical interpretation of tion plus the necessary information to propagate it forward the rest of the elements is lost [WIO85]. in time indefinitely, whereas Ephem objects represent a Here is how to create an Orbit from cartesian and from clas- bounded time history of a trajectory. This is because the sical Keplerian elements. Walker modified equinoctial elements equations for the two-body motion are known, whereas are supported as well. an ephemeris is either an observation or a prediction from astropy import units as u that cannot be extrapolated in any case without external knowledge. As such, Orbit objects have a .propagate from poliastro.bodies import Earth, Sun method, but Ephem ones do not. This prevents users from from poliastro.twobody import Orbit from poliastro.constants import J2000 attempting to propagate the position of the planets, which will always yield poor results compared to the excellent # Data from Curtis, example 4.3 ephemerides calculated by external entities. r = [-6045, -3490, 2500] << u.km v = [-3.457, 6.618, 2.533] << u.km / u.s Finally, both types have methods to convert between them: • Ephem.from_orbit is the equivalent of sampling a orb_curtis = Orbit.from_vectors( Earth, # Attractor two-body motion over a given time interval. As explained r, v # Elements above, the resulting Ephem loses the information about ) the original attractor. # Data for Mars at J2000 from JPL HORIZONS • Orbit.from_ephem is the equivalent of calculating a = 1.523679 << u.au the osculating orbit at a certain point of a trajectory, ecc = 0.093315 << u.one assuming a given attractor. The resulting Orbit loses inc = 1.85 << u.deg the information about the original, potentially complex raan = 49.562 << u.deg argp = 286.537 << u.deg trajectory. nu = 23.33 << u.deg Orbit propagation orb_mars = Orbit.from_classical( Orbit objects have a .propagate method that takes an elapsed Sun, a, ecc, inc, raan, argp, nu, time and returns another Orbit with new orbital elements and an J2000 # Epoch updated epoch: ) >>> from poliastro.examples import iss When displayed on an interactive REPL, Orbit objects provide >>> iss basic information about the geometry, the attractor, and the epoch: >>> 6772 x 6790 km x 51.6 deg (GCRS) ... >>> orb_curtis 7283 x 10293 km x 153.2 deg (GCRS) orbit >>> iss.nu.to(u.deg) around Earth (X) at epoch J2000.000 (TT) <Quantity 46.59580468 deg> >>> orb_mars >>> iss_30m = iss.propagate(30 << u.min) 1 x 2 AU x 1.9 deg (HCRS) orbit around Sun (X) at epoch J2000.000 (TT) >>> (iss_30m.epoch - iss.epoch).datetime datetime.timedelta(seconds=1800) Similarly, Ephem objects can be created using a variety of class- methods as well. Thanks to astropy.coordinates built-in >>> (iss_30m.nu - iss.nu).to(u.deg) <Quantity 116.54513153 deg> low-fidelity ephemerides, as well as its capability to remotely The default propagation algorithm is an analytical procedure access the JPL HORIZONS system, the user can seamlessly build described in [FCM13] that works seamlessly in the near parabolic an object that contains the time history of the position of any Solar System body: region. In addition, poliastro implements analytical propagation from astropy.time import Time algorithms as described in [DB83], [OG86], [Mar95], [Mik87], from astropy.coordinates import solar_system_ephemeris [PP13], [Cha22], and [VM07]. 140 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) rr = propagate( orbit, tofs, method=cowell, f=f, ) Continuous thrust control laws Beyond natural perturbations, spacecraft can modify their trajec- tory on purpose by using impulsive maneuvers (as explained in the next section) as well as continuous thrust guidance laws. The user can define custom guidance laws by providing a perturbation Fig. 2: Osculating (Keplerian) vs perturbed (true) orbit (source: acceleration in the same way natural perturbations are used. In Wikipedia, CC BY-SA 3.0) addition, poliastro includes several analytical solutions for con- tinuous thrust guidance laws with specific purposes, as studied in [CR17]: optimal transfer between circular coplanar orbits [Ede61] Natural perturbations [Bur67], optimal transfer between circular inclined orbits [Ede61] As showcased in Figure 2, at any point in a trajectory we [Kec97], quasi-optimal eccentricity-only change [Pol97], simulta- can define an ideal Keplerian orbit with the same position and neous eccentricity and inclination change [Pol00], and agument of velocity under the attraction of a point mass: this is called the periapsis adjustment [Pol98]. A much more rigorous analysis of a osculating orbit. Some numerical propagation methods exist that similar set of laws can be found in [DCV21]. model the true, perturbed orbit as a deviation from an evolving, from poliastro.twobody.thrust import change_ecc_inc osculating orbit. poliastro implements Cowell’s method [CC10], which consists in adding all the perturbation accelerations and then ecc_f = 0.0 << u.one inc_f = 20.0 << u.deg integrating the resulting differential equation with any numerical f = 2.4e-6 << (u.km / u.s**2) method of choice: d2 r µ a_d, _, t_f = change_ecc_inc(orbit, ecc_f, inc_f, f) 2 = − 3 r + ad (6) dt r The resulting equation is usually integrated using high order Impulsive maneuvers numerical methods, since the integration times are quite large and the tolerances comparatively tight. An in-depth discussion of Impulsive maneuvers are modeled considering a change in the such methods can be found in [HNW09]. poliastro uses Dormand- velocity of a spacecraft while its position remains fixed. The Prince 8(5,3) (DOP853), a commonly used method available in poliastro.maneuver.Maneuver class provides various SciPy [HMvdW+ 20]. constructors to instantiate popular impulsive maneuvers in the There are several natural perturbations included: J2 and J3 framework of the non-perturbed two-body problem: gravitational terms, several atmospheric drag models (exponential, • Maneuver.impulse [Jac77], [AAAA62], [AAA+ 76]), and helpers for third body • Maneuver.hohmann gravitational attraction and radiation pressure as described in [?]. • Maneuver.bielliptic @njit • Maneuver.lambert def combined_a_d( t0, state, k, j2, r_eq, c_d, a_over_m, h0, rho0 ): from poliastro.maneuver import Maneuver return ( J2_perturbation( orb_i = Orbit.circular(Earth, alt=700 << u.km) t0, state, k, j2, r_eq hoh = Maneuver.hohmann(orb_i, r_f=36000 << u.km) ) + atmospheric_drag_exponential( t0, state, k, r_eq, c_d, a_over_m, h0, rho0Once instantiated, Maneuver objects provide information regard- ) ing total ∆v and ∆t: ) >>> hoh.get_total_cost() <Quantity 3.6173981270031357 km / s> def f(t0, state, k): du_kep = func_twobody(t0, state, k) >>> hoh.get_total_time() ax, ay, az = combined_a_d( <Quantity 15729.741535747102 s> t0, state, Maneuver objects can be applied to Orbit instances using the k, R=R, apply_maneuver method. C_D=C_D, >>> orb_i A_over_m=A_over_m, 7078 x 7078 km x 0.0 deg (GCRS) orbit H0=H0, around Earth (X) rho0=rho0, J2=Earth.J2.value, >>> orb_f = orb_i.apply_maneuver(hoh) ) >>> orb_f du_ad = np.array([0, 0, 0, ax, ay, az]) 36000 x 36000 km x 0.0 deg (GCRS) orbit around Earth (X) return du_kep + du_ad POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS 141 Targeting Earth - Mars for year 2020-2021, C3 launch 2021-05 34.1 Targeting is the problem of finding the orbit connecting two Days of flight .0 .8 .0 Arrival velocity km/s 41.90 434553750..04273 400 24 0 31.0 200 35.7 positions over a finite amount of time. Within the context of 5. 43.4 313.80.8 2021-04 .0 the non-perturbed two-body problem, targeting is just a matter 26 .4 37.24 32.6 41.9 of solving the BVP, also known as Lambert’s problem. Because 24.8 targeting tries to find for an orbit, the problem is included in the 2021-03 32.59 29.5 34.1 37.2 410.9.0 18..86 .9 Initial Orbit Determination field. 30 20.2 3 27 17.1 27.93 The poliastro.iod package contains izzo and 2021-02 Arrival date 23.3 38.8 vallado modules. These provide a lambert function for solv- km2 / s2 15.5 40.3 45.0 23.28 3.8 5 29. 26.4 ing the targeting problem. Nevertheless, a Maneuver.lambert 21.7 2021-01 27.9 constructor is also provided so users can keep taking advantage of 18.62 32.6 Orbit objects. 13.97 2020-12 # Declare departure and arrival datetimes date_launch = time.Time( 9.31 '2011-11-26 15:02', scale='tdb' 5.0 2020-11 ) Perseverance 4.66 .0 Tianwen-1 100 date_arrival = time.Time( '2012-08-06 05:17', scale='tdb' Hope Mars 2020-10 0.00 ) 3 4 5 6 7 8 9 0 0-0 0-0 0-0 0-0 0-0 0-0 0-0 0-1 202 202 202 202 202 202 202 202 # Define initial and final orbits Launch date orb_earth = Orbit.from_ephem( Sun, Ephem.from_body(Earth, date_launch), Fig. 3: Porkchop plot for Earth-Mars transfer arrival energy showing date_launch latest missions to the Martian planet. ) orb_mars = Orbit.from_ephem( Sun, Ephem.from_body(Mars, date_arrival), date_arrival Generated graphics can be static or interactive. The main ) difference between these two is the ability to modify the camera view in a dynamic way when using interactive plotters. # Compute targetting maneuver and apply it man_lambert = Maneuver.lambert(orb_earth, orb_mars) The most important classes in the poliastro.plotting orb_trans, orb_target = ss0.apply_maneuver( package are StaticOrbitPlotter and OrbitPlotter3D. man_lambert, intermediate=true In addition, the poliastro.plotting.misc module con- ) tains the plot_solar_system function, which allows the user Targeting is closely related to quick mission design by means of to visualize inner and outter both in 2D and 3D, as requested by porkchop diagrams. These are contour plots showing all combi- users. nations of departure and arrival dates with the specific energy for The following example illustrates the plotting capabilities of each transfer orbit. They allow for quick identification of the most poliastro. At first, orbits to be plotted are computed and their optimal transfer dates between two bodies. plotting style is declared: The poliastro.plotting.porkchop provides the from poliastro.plotting.misc import plot_solar_system PorkchopPlotter class which allows the user to generate these diagrams. # Current datetime now = Time.now().tdb from poliastro.plotting.porkchop import ( PorkchopPlotter # Obtain Florence and Halley orbits ) florence = Orbit.from_sbdb("Florence") from poliastro.utils import time_range halley_1835_ephem = Ephem.from_horizons( "90000031", now # Generate all launch and arrival dates ) launch_span = time_range( halley_1835 = Orbit.from_ephem( "2020-03-01", end="2020-10-01", periods=int(150) Sun, halley_1835_ephem, halley_1835_ephem.epochs[0] ) ) arrival_span = time_range( "2020-10-01", end="2021-05-01", periods=int(150) # Define orbit labels and color style ) florence_style = {label: "Florence", color: "#000000"} halley_style = {label: "Florence", color: "#84B0B8"} # Create an instance of the porkchop and plot it porkchop = PorkchopPlotter( The static two-dimensional plot can be created using the following Earth, Mars, launch_span, arrival_span, code: ) # Generate a static 2D figure Previous code, with some additional customization, generates frame2D = rame = plot_solar_system( figure 3. epoch=now, outer=False ) frame2D.plot(florence, **florence_style) Plotting frame2D.plot(florence, **halley_style) For visualization purposes, poliastro provides the As a result, figure 4 is obtained. poliastro.plotting package, which contains various The interactive three-dimensional plot can be created using the utilities for generating 2D and 3D graphics using different following code: backends such as matplotlib [Hun07] and Plotly [Inc15]. 142 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 4: Two-dimensional view of the inner Solar System, Florence, and Halley. Fig. 6: Natural perturbations affecting Low-Earth Orbit (LEO) mo- tion (source: [VM07]) # Generate an interactive 3D figure frame3D = rame = plot_solar_system( the Simplified General Perturbation (SGP) models, first developed epoch=now, outer=False, in [HK66] and then refined in [LC69] into what we know these use_3d=True, interactive=true ) days as the SGP4 propagator [HR80] [VCHK06]. Even though frame3D.plot(florence, **florence_style) certain elements of the reference frame used by SGP4 are not frame3D.plot(florence, **halley_style) properly specified [VCHK06] and that its accuracy might still be As a result, figure 5 is obtained. too limited for certain applications [Ko09] [Lar16], it is nowadays the most widely used propagation method thanks in large part to the dissemination of General Perturbations orbital data by the US 501(c)(3) CelesTrak (which itself obtains it from the 18th Space Defense Squadron of the US Space Force). The starting point of SGP4 is a special element set that uses Brouwer mean orbital elements [Bro59] plus a ballistic coefficient based on an approximation of the atmospheric drag [LC69], and its results are expressed in a special coordinate system called True Equator Mean Equinox (TEME). Special care needs to be taken to avoid mixing mean elements with osculating elements, and to convert the output of the propagation to the appropriate reference frame. These element sets have been traditionally distributed in a compact text representation called Two-Line Element sets (TLEs) (see 7 for an example). However this format is quite cryptic and Fig. 5: Three-dimensional view of the inner Solar System, Florence, suffers from a number of shortcomings, so recently there has and Halley. been a push to use the Orbit Data Messages international standard developed by the Consultive Committee for Space Data Systems Commercial Earth satellites (CCSDS 502.0-B-2). Figure 6 gives a clear picture of the most important natural pertur- 1 25544U 98067A 22156.15037205 .00008547 00000+0 15823-3 0 9994 bations affecting satellites in LEO, namely: the first harmonic of 2 25544 51.6449 36.2070 0004577 196.3587 298.4146 15.49876730343319 the geopotential field J2 (representing the attractor oblateness), Fig. 7: Two-Line Element set (TLE) for the ISS (retrieved on 2022- the atmospheric drag, and the higher order harmonics of the 06-05) geopotential field. At least the most significant of these perturbations need to be At the moment, general perturbations data both in OMM and taken into account when propagating LEO orbits, and therefore TLE format can be integrated with poliastro thanks to the sgp4 the methods for purely Keplerian motion are not enough. As Python library and the Ephem class as follows: seen above, poliastro implements a number of these perturbations from astropy.coordinates import TEME, GCRS already - however, numerical methods are much slower than analytical ones, and this can render them unsuitable for large from poliastro.ephem import Ephem from poliastro.frames import Planes scale simulations, satellite conjunction assesment, propagation in constrained hardware, and so forth. To address this issue, semianalytical propagation methods def ephem_from_gp(sat, times): were devised that attempt to strike a balance between the fast errors, rs, vs = sat.sgp4_array(times.jd1, times.jd2) if not (errors == 0).all(): running times of analytical methods and the necessary inclusion warn( of perturbation forces. One of such semianalytical methods are "Some objects could not be propagated, " POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS 143 "proceeding with the rest", do not want to use some of the higher level poliastro abstractions stacklevel=2, or drag its large number of heavy dependencies. ) rs = rs[errors == 0] Finally, the sustainability of the project cannot yet be taken for vs = vs[errors == 0] granted: the project has reached a level of complexity that already times = times[errors == 0] warrants dedicated development effort that cannot be covered with short-lived grants. Such funding could potentially come from the cart_teme = CartesianRepresentation( rs << u.km, private sector, but although there is evidence that several for-profit xyz_axis=-1, companies are using poliastro, we have very little information of differentials=CartesianDifferential( how is it being used and what problems are those users having, vs << (u.km / u.s), xyz_axis=-1, let alone what avenues for funded work could potentially work. ), Organizations like the Libre Space Foundation advocate for a ) strong copyleft licensing model to convince commercial actors to cart_gcrs = ( contribute to the commons, but in principle that goes against the TEME(cart_teme, obstime=times) .transform_to(GCRS(obstime=times)) permissive licensing that the wider Scientific Python ecosystem, .cartesian including poliastro, has adopted. With the advent of new business ) models and the ever increasing reliance in open source by the private sector, a variety of ways to engage commercial users and return Ephem( cart_gcrs, include them in the conversation exist. However, these have not times, been explored yet. plane=Planes.EARTH_EQUATOR ) Acknowledgements However, no native integration with SGP4 has been implemented The authors would like to thank Prof. Michèle Lavagna for her yet in poliastro, for technical and non-technical reasons. On one original guidance and inspiration, David A. Vallado for his en- hand, this propagator is too different from the other methods, and couragement and for publishing the source code for the algorithms we have not yet devised how to add it to the library in a way from his book for free, Dr. T.S. Kelso for his tireless efforts in that does not create confusion. On the other hand, adding such maintaining CelesTrak, Alejandro Sáez for sharing the dream of a propagator to poliastro would probably open the flood gates of a better way, Prof. Dr. Manuel Sanjurjo Rivo for believing in my corporate users of the library, and we would like to first devise work, Helge Eichhorn for his enthusiasm and decisive influence a sustainability strategy for the project, which is addressed in the in poliastro, the whole OpenAstronomy collaboration for opening next section. the door for us, the NumFOCUS organization for their immense support, and Alexandra Elbakyan for enabling scientific progress worldwide. Future work Despite the fact that poliastro has existed for almost a decade, for R EFERENCES most of its history it has been developed by volunteers on their [AAA+ 76] United States Committee on Extension to the Standard At- free time, and only in the past five years it has received funding mosphere, United States National Aeronautics, Space Ad- through various Summer of Code programs (SOCIS 2017, GSOC ministration, United States National Oceanic, Atmospheric Administration, and United States Air Force. U.S. Stan- 2018-2021) and institutional grants (NumFOCUS 2020, 2021). dard Atmosphere, 1976. NOAA - SIT 76-1562. National The funded work has had an overwhemingly positive impact on Oceanic and Amospheric [sic] Administration, 1976. URL: the project, however the lack of a dedicated maintainer has caused https://books.google.es/books?id=x488AAAAIAAJ. [AAAA62] United States Committee on Extension to the Standard At- some technical debt to accrue over the years, and some parts of mosphere, United States National Aeronautics, Space Admin- the project are in need of refactoring or better documentation. istration, and United States Environmental Science Services Historically, poliastro has tried to implement algorithms that Administration. U.S. Standard Atmosphere, 1962: ICAO were applicable for all the planets in the Solar System, however Standard Atmosphere to 20 Kilometers; Proposed ICAO Ex- tension to 32 Kilometers; Tables and Data to 700 Kilo- some of them have proved to be very difficult to generalize for meters. U.S. Government Printing Office, 1962. URL: bodies other than the Earth. For cases like these, poliastro ships a https://books.google.es/books?id=fWdTAAAAMAAJ. poliastro.earth package, but going forward we would like [Bat99] Richard H. Battin. An Introduction to the Mathematics and Methods of Astrodynamics, Revised Edition. American to continue embracing a generic approach that can serve other Institute of Aeronautics and Astronautics, Inc., Reston, VA, bodies as well. January 1999. URL: https://arc.aiaa.org/doi/book/10.2514/4. Several open source projects have successfully used poliastro 861543, doi:10.2514/4.861543. or were created taking inspiration from it, like spacetech-ssa [BBC+ 11] Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dal- cin, Dag Sverre Seljebotn, and Kurt Smith. Cython: The by IBM1 or mubody [BBVPFSC22]. AGI (previously Analytical Best of Both Worlds. Computing in Science & Engineering, Graphics, Inc., now Ansys Government Initiatives) published a 13(2):31–39, March 2011. URL: http://ieeexplore.ieee.org/ series of scripts to automate the commercial tool STK from Python document/5582062/, doi:10.1109/MCSE.2010.118. [BBVPFSC22] Juan Bermejo Ballesteros, José María Vergara Pérez, leveraging poliastro2 . However, we have observed that there is still Alejandro Fernández Soler, and Javier Cubas. Mu- lots of repeated code across similar open source libraries written body, an astrodynamics open-source Python library fo- in Python, which means that there is an opportunity to provide cused on libration points. Barcelona, Spain, April a "kernel" of algorithms that can be easily reused. Although 2022. URL: https://sseasymposium.org/wp-content/uploads/ 2022/04/4thSSEA_AllAbstracts.pdf. poliastro.core started as a separate layer to isolate fast, non- safe functions as described above, we think we could move it to 1. https://github.com/IBM/spacetech-ssa an external package so it can be depended upon by projects that 2. https://github.com/AnalyticalGraphicsInc/STKCodeExamples/ 144 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) [BEKS17] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Vi- Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, ral B. Shah. Julia: A Fresh Approach to Numerical Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Computing. SIAM Review, 59(1):65–98, January 2017. Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro- URL: https://epubs.siam.org/doi/10.1137/141000671, doi: gramming with NumPy. Nature, 585(7825):357–362, Septem- 10.1137/141000671. ber 2020. URL: https://www.nature.com/articles/s41586-020- [Bro59] Dirk Brouwer. Solution of the problem of artificial satellite 2649-2, doi:10.1038/s41586-020-2649-2. theory without drag. The Astronomical Journal, 64:378, [HNW09] E. Hairer, S. P. Nørsett, and Gerhard Wanner. Solving ordi- November 1959. URL: http://adsabs.harvard.edu/cgi-bin/bib_ nary differential equations I: nonstiff problems. Number 8 query?1959AJ.....64..378B, doi:10.1086/107958. in Springer series in computational mathematics. Springer, [Bur67] E.G.C. Burt. On space manoeuvres with con- Heidelberg ; London, 2nd rev. ed edition, 2009. OCLC: tinuous thrust. Planetary and Space Science, ocn620251790. 15(1):103–122, January 1967. URL: https: [HR80] Felix R. Hoots and Ronald L. Roehrich. Models for prop- //linkinghub.elsevier.com/retrieve/pii/0032063367900700, agation of NORAD element sets. Technical report, Defense doi:10.1016/0032-0633(67)90070-0. Technical Information Center, Fort Belvoir, VA, December [CC10] Philip Herbert Cowell and Andrew Claude Crommelin. Inves- 1980. URL: http://www.dtic.mil/docs/citations/ADA093554. tigation of the Motion of Halley’s Comet from 1759 to 1910. [Hun07] J. D. Hunter. Matplotlib: A 2D graphics environment. Com- Neill & Company, limited, 1910. puting in Science & Engineering, 9(3):90–95, 2007. Pub- [Cha22] Kevin Charls. Recursive solution to Kepler’s problem for lisher: IEEE COMPUTER SOC. doi:10.1109/MCSE. elliptical orbits - application in robust Newton-Raphson and 2007.55. co-planar closest approach estimation. 2022. Publisher: [IBD+ 20] Dario Izzo, Will Binns, Dariomm098, Alessio Mereta, Unpublished Version Number: 1. URL: https://rgdoi.net/ Christopher Iliffe Sprague, Dhennes, Bert Van Den Abbeele, 10.13140/RG.2.2.18578.58563/1, doi:10.13140/RG.2. Chris Andre, Krzysztof Nowak, Nat Guy, Alberto Isaac Bar- 2.18578.58563/1. quín Murguía, Pablo, Frédéric Chapoton, GiacomoAcciarini, [Con14] Bruce A. Conway. Spacecraft trajectory optimization. Num- Moritz V. Looz, Dietmarwo, Mike Heddes, Anatoli Babenia, ber 29 in Cambridge aerospace series. Cambridge university Baptiste Fournier, Johannes Simon, Jonathan Willitts, Ma- press, Cambridge (GB), 2014. teusz Polnik, Sanjeev Narayanaswamy, The Gitter Badger, [CR17] Juan Luis Cano Rodríguez. Study of analytical solutions for and Jack Yarndley. esa/pykep: Optimize, October 2020. low-thrust trajectories. Master’s thesis, Universidad Politéc- URL: https://zenodo.org/record/4091753, doi:10.5281/ nica de Madrid, March 2017. ZENODO.4091753. [DB83] J. M. A. Danby and T. M. Burkardt. The solution of Kepler’s [Inc15] Plotly Technologies Inc. Collaborative data science, 2015. equation, I. Celestial Mechanics, 31(2):95–107, October Place: Montreal, QC Publisher: Plotly Technologies Inc. URL: 1983. URL: http://link.springer.com/10.1007/BF01686811, https://plot.ly. doi:10.1007/BF01686811. [Jac77] L. G. Jacchia. Thermospheric Temperature, Density, and [DCV21] Marilena Di Carlo and Massimiliano Vasile. Analytical Composition: New Models. SAO Special Report, 375, March solutions for low-thrust orbit transfers. Celestial Mechanics 1977. ADS Bibcode: 1977SAOSR.375.....J. URL: https: and Dynamical Astronomy, 133(7):33, July 2021. URL: https: //ui.adsabs.harvard.edu/abs/1977SAOSR.375.....J. //link.springer.com/10.1007/s10569-021-10033-9, doi:10. [JGAZJT+ 18] Nathan J. Goldbaum, John A. ZuHone, Matthew J. Turk, 1007/s10569-021-10033-9. Kacper Kowalik, and Anna L. Rosen. unyt: Handle, ma- [Dro53] S. Drobot. On the foundations of Dimensional Analysis. nipulate, and convert data with units in Python. Jour- Studia Mathematica, 14(1):84–99, 1953. URL: http://www. nal of Open Source Software, 3(28):809, August 2018. impan.pl/get/doi/10.4064/sm-14-1-84-99, doi:10.4064/ URL: http://joss.theoj.org/papers/10.21105/joss.00809, doi: sm-14-1-84-99. 10.21105/joss.00809. [Dub73] G. N. Duboshin. Book Review: Samuel Herrick. Astrodynam- [Kec97] Jean Albert Kechichian. Reformulation of Edelbaum’s Low- ics. Soviet Astronomy, 16:1064, June 1973. ADS Bibcode: Thrust Transfer Problem Using Optimal Control Theory. 1973SvA....16.1064D. URL: https://ui.adsabs.harvard.edu/ Journal of Guidance, Control, and Dynamics, 20(5):988– abs/1973SvA....16.1064D. 994, September 1997. URL: https://arc.aiaa.org/doi/10.2514/ [Ede61] Theodore N. Edelbaum. Propulsion Requirements for Con- 2.4145, doi:10.2514/2.4145. trollable Satellites. ARS Journal, 31(8):1079–1089, August [Ko09] TS Kelso and others. Analysis of the Iridium 33-Cosmos 1961. URL: https://arc.aiaa.org/doi/10.2514/8.5723, doi: 2251 collision. Advances in the Astronautical Sciences, 10.2514/8.5723. 135(2):1099–1112, 2009. Publisher: Citeseer. [FCM13] Davide Farnocchia, Davide Bracali Cioci, and Andrea Milani. [KRKP+ 16] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Robust resolution of Kepler’s equation in all eccentricity Brian E Granger, Matthias Bussonnier, Jonathan Frederic, regimes. Celestial Mechanics and Dynamical Astronomy, Kyle Kelley, Jessica B Hamrick, Jason Grout, Sylvain Cor- 116(1):21–34, May 2013. URL: http://link.springer.com/10. lay, and others. Jupyter Notebooks-a publishing format for 1007/s10569-013-9476-9, doi:10.1007/s10569-013- reproducible computational workflows., volume 2016. 2016. 9476-9. [Lar16] Martin Lara. Analytical and Semianalytical Propagation [Fin07] D Finkleman. "TLE or Not TLE?" That is the Question (AAS of Space Orbits: The Role of Polar-Nodal Variables. In 07-126). ADVANCES IN THE ASTRONAUTICAL SCIENCES, Gerard Gómez and Josep J. Masdemont, editors, Astro- 127(1):401, 2007. Publisher: Published for the American dynamics Network AstroNet-II, volume 44, pages 151– Astronautical Society by Univelt; 1999. 166. Springer International Publishing, Cham, 2016. Se- [GBP22] Mirko Gabelica, Ružica Bojčić, and Livia Puljak. Many ries Title: Astrophysics and Space Science Proceedings. researchers were not compliant with their published URL: http://link.springer.com/10.1007/978-3-319-23986-6_ data sharing statement: mixed-methods study. Jour- 11, doi:10.1007/978-3-319-23986-6_11. nal of Clinical Epidemiology, page S089543562200141X, [LC69] M. H. Lane and K. Cranford. An improved ana- May 2022. URL: https://linkinghub.elsevier.com/retrieve/ lytical drag theory for the artificial satellite problem. pii/S089543562200141X, doi:10.1016/j.jclinepi. In Astrodynamics Conference, Princeton,NJ,U.S.A., August 2022.05.019. 1969. American Institute of Aeronautics and Astronautics. [Her71] Samuel Herrick. Astrodynamics. Van Nostrand Reinhold Co, URL: https://arc.aiaa.org/doi/10.2514/6.1969-925, doi:10. London, New York, 1971. 2514/6.1969-925. [HK66] CG Hilton and JR Kuhlman. Mathematical models for the [LPS15] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: a space defense center. Philco-Ford Publication No. U-3871, LLVM-based Python JIT compiler. In Proceedings of the Sec- 17:28, 1966. ond Workshop on the LLVM Compiler Infrastructure in HPC [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der - LLVM ’15, pages 1–6, Austin, Texas, 2015. ACM Press. Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric URL: http://dl.acm.org/citation.cfm?doid=2833157.2833162, Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, doi:10.1145/2833157.2833162. Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van [Mar95] F. Landis Markley. Kepler Equation solver. Celes- Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del tial Mechanics & Dynamical Astronomy, 63(1):101–111, POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS 145 1995. URL: http://link.springer.com/10.1007/BF00691917, S. Fabbro, L. A. Ferreira, T. Finethy, R. T. Fox, L. H. doi:10.1007/BF00691917. Garrison, S. L. J. Gibbons, D. A. Goldstein, R. Gommers, J. P. [Mik87] Seppo Mikkola. A cubic approximation for Kepler’s equa- Greco, P. Greenfield, A. M. Groener, F. Grollier, A. Hagen, tion. Celestial Mechanics, 40(3-4):329–334, September P. Hirst, D. Homeier, A. J. Horton, G. Hosseinzadeh, L. Hu, 1987. URL: http://link.springer.com/10.1007/BF01235850, J. S. Hunkeler, Ž. Ivezić, A. Jain, T. Jenness, G. Kanarek, doi:10.1007/BF01235850. S. Kendrew, N. S. Kern, W. E. Kerzendorf, A. Khvalko, [MKDVB+ 19] Michael Mommert, Michael Kelley, Miguel De Val-Borro, J. King, D. Kirkby, A. M. Kulkarni, A. Kumar, A. Lee, Jian-Yang Li, Giannina Guzman, Brigitta Sipőcz, Josef D. Lenz, S. P. Littlefair, Z. Ma, D. M. Macleod, M. Mastropi- Ďurech, Mikael Granvik, Will Grundy, Nick Moskovitz, etro, C. McCully, S. Montagnac, B. M. Morris, M. Mueller, Antti Penttilä, and Nalin Samarasinha. sbpy: A Python S. J. Mumford, D. Muna, N. A. Murphy, S. Nelson, G. H. module for small-body planetary astronomy. Jour- Nguyen, J. P. Ninan, M. Nöthe, S. Ogaz, S. Oh, J. K. Parejko, nal of Open Source Software, 4(38):1426, June 2019. N. Parley, S. Pascual, R. Patil, A. A. Patil, A. L. Plunkett, URL: http://joss.theoj.org/papers/10.21105/joss.01426, doi: J. X. Prochaska, T. Rastogi, V. Reddy Janga, J. Sabater, 10.21105/joss.01426. P. Sakurikar, M. Seifert, L. E. Sherbert, H. Sherwood-Taylor, [noa] Astrodynamics.jl. URL: https://github.com/JuliaSpace/ A. Y. Shih, J. Sick, M. T. Silbiger, S. Singanamalla, L. P. Astrodynamics.jl. Singer, P. H. Sladen, K. A. Sooley, S. Sornarajah, O. Stre- [noa18] gpredict, January 2018. URL: https://github.com/csete/ icher, P. Teuben, S. W. Thomas, G. R. Tremblay, J. E. H. gpredict/releases/tag/v2.2.1. Turner, V. Terrón, M. H. van Kerkwijk, A. de la Vega, [noa20] GMAT, July 2020. URL: https://sourceforge.net/projects/ L. L. Watkins, B. A. Weaver, J. B. Whitmore, J. Woillez, gmat/files/GMAT/GMAT-R2020a/. V. Zabalza, and (Astropy Contributors). The Astropy Project: [noa21a] nyx, November 2021. URL: https://gitlab.com/nyx-space/ Building an Open-science Project and Status of the v2.0 nyx/-/tags/1.0.0. Core Package. The Astronomical Journal, 156(3):123, August [noa21b] SatNOGS, October 2021. URL: https://gitlab.com/ 2018. URL: https://iopscience.iop.org/article/10.3847/1538- librespacefoundation/satnogs/satnogs-client/-/tags/1.7. 3881/aabc4f, doi:10.3847/1538-3881/aabc4f. [noa22a] beyond, January 2022. URL: https://pypi.org/project/beyond/ [VCHK06] David Vallado, Paul Crawford, Ricahrd Hujsak, and T.S. 0.7.4/. Kelso. Revisiting Spacetrack Report #3. In AIAA/AAS Astro- [noa22b] celestlab, January 2022. URL: https://atoms.scilab.org/ dynamics Specialist Conference and Exhibit, Keystone, Col- toolboxes/celestlab/3.4.1. orado, August 2006. American Institute of Aeronautics and [noa22c] Orekit, June 2022. URL: https://gitlab.orekit.org/orekit/ Astronautics. URL: https://arc.aiaa.org/doi/10.2514/6.2006- orekit/-/releases/11.2. 6753, doi:10.2514/6.2006-6753. [noa22d] SPICE, January 2022. URL: https://naif.jpl.nasa.gov/naif/ [VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt toolkit.html. Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, [noa22e] tudatpy, January 2022. URL: https://github.com/tudat-team/ Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté- tudatpy/releases/tag/0.6.0. fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar- [OG86] A. W. Odell and R. H. Gooding. Procedures for solving rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Kepler’s equation. Celestial Mechanics, 38(4):307–334, April Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, 1986. URL: http://link.springer.com/10.1007/BF01238923, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, doi:10.1007/BF01238923. Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin- [Pol97] James E Pollard. Simplified approach for assessment of low- tero, Charles R. Harris, Anne M. Archibald, Antônio H. thrust elliptical orbit transfers. In 25th International Electric Ribeiro, Fabian Pedregosa, Paul van Mulbregt, SciPy 1.0 Propulsion Conference, Cleveland, OH, pages 97–160, 1997. Contributors, Aditya Vijaykumar, Alessandro Pietro Bardelli, [Pol98] James Pollard. Evaluation of low-thrust orbital maneuvers. Alex Rothberg, Andreas Hilboll, Andreas Kloeckner, Anthony In 34th AIAA/ASME/SAE/ASEE Joint Propulsion Confer- Scopatz, Antony Lee, Ariel Rokem, C. Nathan Woods, Chad ence and Exhibit, Cleveland,OH,U.S.A., July 1998. Ameri- Fulton, Charles Masson, Christian Häggström, Clark Fitzger- can Institute of Aeronautics and Astronautics. URL: https: ald, David A. Nicholson, David R. Hagen, Dmitrii V. Pasech- //arc.aiaa.org/doi/10.2514/6.1998-3486, doi:10.2514/6. nik, Emanuele Olivetti, Eric Martin, Eric Wieser, Fabrice 1998-3486. Silva, Felix Lenders, Florian Wilhelm, G. Young, Gavin A. [Pol00] J. E. Pollard. Simplified analysis of low-thrust orbital maneu- Price, Gert-Ludwig Ingold, Gregory E. Allen, Gregory R. Lee, vers. Technical report, Defense Technical Information Center, Hervé Audren, Irvin Probst, Jörg P. Dietrich, Jacob Silterra, Fort Belvoir, VA, August 2000. URL: http://www.dtic.mil/ James T Webber, Janko Slavič, Joel Nothman, Johannes Buch- docs/citations/ADA384536. ner, Johannes Kulick, Johannes L. Schönberger, José Vinícius [PP13] Adonis Reinier Pimienta-Penalver. Accurate Kepler equation de Miranda Cardoso, Joscha Reimer, Joseph Harrington, Juan solver without transcendental function evaluations. State Luis Cano Rodríguez, Juan Nunez-Iglesias, Justin Kuczynski, University of New York at Buffalo, 2013. Kevin Tritz, Martin Thoma, Matthew Newville, Matthias [Rho20] Brandon Rhodes. Skyfield: Generate high precision research- Kümmerer, Maximilian Bolingbroke, Michael Tartre, Mikhail grade positions for stars, planets, moons, and Earth satellites, Pak, Nathaniel J. Smith, Nikolai Nowaczyk, Nikolay She- February 2020. banov, Oleksandr Pavlyk, Per A. Brodtkorb, Perry Lee, [SSM18] Victoria Stodden, Jennifer Seiler, and Zhaokun Ma. An Robert T. McGibbon, Roman Feldbauer, Sam Lewis, Sam empirical analysis of journal policy effectiveness for Tygier, Scott Sievert, Sebastiano Vigna, Stefan Peterson, computational reproducibility. Proceedings of the National Surhud More, Tadeusz Pudlik, Takuya Oshima, Thomas J. Academy of Sciences, 115(11):2584–2589, March 2018. Pingel, Thomas P. Robitaille, Thomas Spura, Thouis R. Jones, URL: https://pnas.org/doi/full/10.1073/pnas.1708290115, Tim Cera, Tim Leslie, Tiziano Zito, Tom Krauss, Utkarsh doi:10.1073/pnas.1708290115. Upadhyay, Yaroslav O. Halchenko, and Yoshiki Vázquez- Baeza. SciPy 1.0: fundamental algorithms for scientific [TPWS+ 18] The Astropy Collaboration, A. M. Price-Whelan, B. M. computing in Python. Nature Methods, 17(3):261–272, Sipőcz, H. M. Günther, P. L. Lim, S. M. Crawford, S. Conseil, March 2020. URL: http://www.nature.com/articles/s41592- D. L. Shupe, M. W. Craig, N. Dencheva, A. Ginsburg, J. T. 019-0686-2, doi:10.1038/s41592-019-0686-2. VanderPlas, L. D. Bradley, D. Pérez-Suárez, M. de Val-Borro, (Primary Paper Contributors), T. L. Aldcroft, K. L. Cruz, T. P. [VM07] David A. Vallado and Wayne D. McClain. Fundamentals Robitaille, E. J. Tollerud, (Astropy Coordination Commit- of astrodynamics and applications. Number 21 in Space tee), C. Ardelean, T. Babej, Y. P. Bach, M. Bachetti, A. V. technology library. Microcosm Press [u.a.], Hawthorne, Calif., Bakanov, S. P. Bamford, G. Barentsen, P. Barmby, A. Baum- 3. ed., 1. printing edition, 2007. bach, K. L. Berry, F. Biscani, M. Boquien, K. A. Bostroem, [WAB+ 14] Greg Wilson, D. A. Aruliah, C. Titus Brown, Neil P. L. G. Bouma, G. B. Brammer, E. M. Bray, H. Breytenbach, Chue Hong, Matt Davis, Richard T. Guy, Steven H. D. Had- H. Buddelmeijer, D. J. Burke, G. Calderone, J. L. Cano dock, Kathryn D. Huff, Ian M. Mitchell, Mark D. Plumbley, Rodríguez, M. Cara, J. V. M. Cardoso, S. Cheedella, Y. Copin, Ben Waugh, Ethan P. White, and Paul Wilson. Best Practices L. Corrales, D. Crichton, D. D’Avella, C. Deil, É. Depagne, for Scientific Computing. PLoS Biology, 12(1):e1001745, J. P. Dietrich, A. Donath, M. Droettboom, N. Earl, T. Erben, January 2014. URL: https://dx.plos.org/10.1371/journal.pbio. 146 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 1001745, doi:10.1371/journal.pbio.1001745. [WIO85] M. J. H. Walker, B. Ireland, and Joyce Owens. A set modified equinoctial orbit elements. Celestial Mechanics, 36(4):409– 419, August 1985. URL: http://link.springer.com/10.1007/ BF01227493, doi:10.1007/BF01227493. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 147 A New Python API for Webots Robotics Simulations Justin C. Fisher‡∗ F Abstract—Webots is a popular open-source package for 3D robotics simula- In qualitative terms, the old API feels like one is awkwardly tions. It can also be used as a 3D interactive environment for other physics- using Python to call C and C++ functions, whereas the new API based modeling, virtual reality, teaching or games. Webots has provided a sim- feels much simpler, much easier, and like it is fully intended for ple API allowing Python programs to control robots and/or the simulated world, Python. Here is a representative (but far from comprehensive) list but this API is inefficient and does not provide many "pythonic" conveniences. of examples: A new Python API for Webots is presented that is more efficient and provides a more intuitive, easily usable, and "pythonic" interface. • Unlike the old API, the new API contains helpful Python Index Terms—Webots, Python, Robotics, Robot Operating System (ROS), type annotations and docstrings. Open Dynamics Engine (ODE), 3D Physics Simulation • Webots employs many vectors, e.g., for 3D positions, 4D rotations, and RGB colors. The old API typically treats these as lists or integers (24-bit colors). In the new API 1. Introduction these are Vector objects, with conveniently addressable Webots is a popular open-source package for 3D robotics sim- components (e.g. vector.x or color.red), conve- ulations [Mic01], [Webots]. It can also be used as a 3D in- nient helper methods like vector.magnitude and teractive environment for other physics-based modeling, virtual vector.unit_vector, and overloaded vector arith- reality, teaching or games. Webots uses the Open Dynamics metic operations, akin to (and interoperable with) NumPy Engine [ODE], which allows physical simulations of Newtonian arrays. bodies, collisions, joints, springs, friction, and fluid dynamics. • The new API also provides easy interfacing between Webots provides the means to simulate a wide variety of robot high-resolution Webots sensors (like cameras and Lidar) components, including motors, actuators, wheels, treads, grippers, and Numpy arrays, to make it much more convenient to light sensors, ultrasound sensors, pressure sensors, range finders, use Webots with popular Python packages like Numpy radar, lidar, and cameras (with many of these sensors drawing [NumPy], [Har01], Scipy [Scipy], [Vir01], PIL/PILLOW their inputs from GPU processing of the simulation). A typical [PIL] or OpenCV [OpenCV], [Brad01]. For example, simulation will involve one or more robots, each with somewhere converting a Webots camera image to a NumPy array is between 3 and 30 moving parts (though more would be possible), now as simple as camera.array and this now allows each running its own controller program to process information the array to share memory with the camera, making this taken in by its sensors to determine what control signals to send to extremely fast regardless of image size. its devices. A simulated world typically involves a ground surface • The old API often requires that all function parameters be (which may be a sloping polygon mesh) and dozens of walls, given explicitly in every call, whereas the new API gives obstacles, and/or other objects, which may be stationary or moving many parameters commonly used default values, allowing in the physics simulation. them often to be omitted, and keyword arguments to be Webots has historically provided a simple Python API, allow- used where needed. ing