Authors Chris Calloway David Shupe Dillon Niederhut Meghann Agarwal
License CC-BY-3.0
Proceedings of the 21st Python in Science Conference P ROCEEDINGS OF THE 21 ST P YTHON IN S CIENCE C ONFERENCE Edited by Meghann Agarwal, Chris Calloway, Dillon Niederhut, and David Shupe. SciPy 2022 Austin, Texas July 11 - July 17, 2022 Copyright c 2022. The articles in the Proceedings of the Python in Science Conference are copyrighted and owned by their original authors This is an open-access publication and is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. For more information, please see: http://creativecommons.org/licenses/by/3.0/ ISSN:2575-9752 https://doi.org/10.25080/majora-212e5952-046 O RGANIZATION Conference Chairs J ONATHAN G UYER, NIST A LEXANDRE C HABOT-L ECLERC, Enthought, Inc. Program Chairs M ATT H ABERLAND, Cal Poly J ULIE H OLLEK, Mozilla M ADICKEN M UNK, University of Illinois G UEN P RAWIROATMODJO, Microsoft Corp Communications A RLISS C OLLINS, NumFOCUS M ATT D AVIS, Populus D AVID N ICHOLSON, Embedded Intelligence Birds of a Feather A NDREW R EID, NIST A NASTASIIA S ARMAKEEVA, George Washington University Proceedings M EGHANN A GARWAL, Overhaul C HRIS C ALLOWAY, University of North Carolina D ILLON N IEDERHUT, Novi Labs D AVID S HUPE, Caltech’s IPAC Astronomy Data Center Financial Aid S COTT C OLLIS, Argonne National Laboratory N ADIA TAHIRI, Université de Montréal Tutorials M IKE H EARNE, USGS L OGAN T HOMAS, Enthought, Inc. Sprints TANIA A LLARD, Quansight Labs B RIGITTA S IP ŐCZ, Caltech/IPAC Diversity C ELIA C INTAS, IBM Research Africa B ONNY P M C C LAIN, O’Reilly Media FATMA TARLACI, OpenTeams Activities PAUL A NZEL, Codecov I NESSA PAWSON, Albus Code Sponsors K RISTEN L EISER, Enthought, Inc. Financial C HRIS C HAN, Enthought, Inc. B ILL C OWAN, Enthought, Inc. J ODI H AVRANEK, Enthought, Inc. Logistics K RISTEN L EISER, Enthought, Inc. Proceedings Reviewers A ILEEN N IELSEN A JIT D HOBALE A LEJANDRO C OCA -C ASTRO A LEXANDER YANG B HUPENDRA A R AUT B RADLEY D ICE B RIAN G UE C ADIOU C ORENTIN C ARL S IMON A DORF C HEN Z HANG C HIARA M ARMO C HITARANJAN M AHAPATRA C HRIS C ALLOWAY D ANIEL W HEELER D AVID N ICHOLSON D AVID S HUPE D ILLON N IEDERHUT D IPTORUP D EB J ELENA M ILOSEVIC M ICHAL M ACIEJEWSKI E D R OGERS H IMAGHNA B HATTACHARJEE H ONGSUP S HIN I NDRANEIL PAUL I VAN M ARROQUIN J AMES L AMB J YH -M IIN L IN J YOTIKA S INGH K ARTHIK M URUGADOSS K EHINDE A JAYI K ELLY L. R OWLAND K ELVIN L EE K EVIN M AIK J ABLONKA K EVIN W. B EAM K UNTAO Z HAO M ARUTHI NH M ATT C RAIG M ATTHEW F EICKERT M EGHANN A GARWAL M ELISSA W EBER M ENDONÇA O NURALP S OYLEMEZ R OHIT G OSWAMI RYAN B UNNEY S HUBHAM S HARMA S IDDHARTHA S RIVASTAVA S USHANT M ORE T ETSUO K OYAMA T HOMAS N ICHOLAS V ICTORIA A DESOBA V IDHI C HUGH V IVEK S INHA W ENDUO Z HOU Z UHAL C AKIR ACCEPTED TALK S LIDES B UILDING B INARY E XTENSIONS WITH PYBIND 11, SCIKIT- BUILD , AND CIBUILDWHEEL, Henry Schreiner, and Joe Rickerby, and Ralf Grosse-Kunstleve, and Wenzel Jakob, and Matthieu Darbois, and Aaron Gokaslan, and Jean-Christophe Fillion- Robin, and Matt McCormick doi.org/10.25080/majora-212e5952-033 P YTHON D EVELOPMENT S CHEMES FOR M ONTE C ARLO N EUTRONICS ON H IGH P ERFORMANCE C OMPUTING, Jack- son P. Morgan, and Kyle E. Niemeyer doi.org/10.25080/majora-212e5952-034 AWKWARD PACKAGING : B UILDING SCIKIT-HEP, Henry Schreiner, and Jim Pivarski, and Eduardo Rodrigues doi.org/10.25080/majora-212e5952-035 D EVELOPMENT OF A CCESSIBLE , A ESTHETICALLY-P LEASING C OLOR S EQUENCES, Matthew A. Petroff doi.org/10.25080/majora-212e5952-036 C UTTING E DGE C LIMATE S CIENCE IN THE C LOUD WITH PANGEO, Julius Busecke doi.org/10.25080/majora-212e5952-037 P YLIRA : DECONVOLUTION OF IMAGES IN THE PRESENCE OF P OISSON NOISE, Axel Donath, and Aneta Siemiginowska, and Vinay Kashyap, and Douglas Burke, and Karthik Reddy Solipuram, and David van Dyk doi.org/10.25080/majora-212e5952-038 A CCELERATING S CIENCE WITH THE G ENERATIVE T OOLKIT FOR S CIENTIFIC D ISCOVERY (GT4SD), GT4SD team doi.org/10.25080/majora-212e5952-039 MM ODEL : A MODULAR MODELING FRAMEWORK FOR SCIENTIFIC PROTOTYPING, Peter Sun, and John A. Marohn doi.org/10.25080/majora-212e5952-03a M ONACO : Q UANTIFY U NCERTAINTY AND S ENSITIVITIES IN Y OUR C OMPUTATIONAL M ODELS WITH A M ONTE C ARLO L IBRARY, W. Scott Shambaugh doi.org/10.25080/majora-212e5952-03b UF UNCS AND DT YPES : NEW POSSIBILITIES IN N UM P Y, Sebastian Berg, and Stéfan van der Walt doi.org/10.25080/majora-212e5952-03c P ER P YTHON AD ASTRA : INTERACTIVE A STRODYNAMICS WITH POLIASTRO , Juan Luis Cano Rodrı́guez doi.org/10.25080/majora-212e5952-03d PYAMPUTE : A P YTHON LIBRARY FOR DATA AMPUTATION, Rianne M Schouten, and Davina Zamanzadeh, and Prabhant Singh doi.org/10.25080/majora-212e5952-03e S CIENTIFIC P YTHON : F ROM G IT H UB TO T IK T OK, Juanita Gomez Romero, and Stéfan van der Walt, and K. Jarrod Millman, and Melissa Weber Mendonça, and Inessa Pawson doi.org/10.25080/majora-212e5952-03f S CIENTIFIC P YTHON : B Y MAINTAINERS , FOR MAINTAINERS, Pamphile T. Roy, and Stéfan van der Walt, and K. Jarrod Millman, and Melissa Weber Mendonça doi.org/10.25080/majora-212e5952-040 I MPROVING RANDOM SAMPLING IN P YTHON : SCIPY. STATS . SAMPLING AND SCIPY. STATS . QMC, Pamphile T. Roy, and Matt Haberland, and Christoph Baumgarten, and Tirth Patel doi.org/10.25080/majora-212e5952-041 P ETABYTE - SCALE OCEAN DATA ANALYTICS ON STAGGERED GRIDS VIA THE GRID UFUNC PROTOCOL IN X GCM, Thomas Nicholas, and Julius Busecke, and Ryan Abernathey doi.org/10.25080/majora-212e5952-042 ACCEPTED P OSTERS O PTIMAL R EVIEW A SSIGNMENTS FOR THE S CI P Y C ONFERENCE U SING B INARY I NTEGER L INEAR P ROGRAMMING IN S CI P Y 1.9, Matt Haberland, and Nicholas McKibben doi.org/10.25080/majora-212e5952-029 C ONTRIBUTING TO O PEN S OURCE S OFTWARE : F ROM NOT KNOWING P YTHON TO BECOMING A S PYDER CORE DE - VELOPER , Daniel Althviz Moré doi.org/10.25080/majora-212e5952-02a S EMI -S UPERVISED S EMANTIC A NNOTATOR (S3A): T OWARD E FFICIENT S EMANTIC I MAGE L ABELING, Nathan Jessu- run, and Olivia P. Dizon-Paradis, and Dan E. Capecci, and Damon L. Woodard, and Navid Asadizanjani doi.org/10.25080/majora-212e5952-02b B IOFRAME : O PERATING ON G ENOMIC I NTERVAL D ATAFRAMES, Nezar Abdennur, and Geoffrey Fudenberg, and Ilya M. Flyamer, and Aleksandra Galitsyna, and Anton Goloborodko, and Maxim Imakaev, and Trevor Manz, and Sergey V. Venev doi.org/10.25080/majora-212e5952-02c L IKENESS : A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS, Joseph V. Tuccillo, and James D. Gaboardi doi.org/10.25080/majora-212e5952-02d PYA UDIO P ROCESSING : A UDIO P ROCESSING , F EATURE E XTRACTION , AND M ACHINE L EARNING M ODELING , Jy- otika Singh doi.org/10.25080/majora-212e5952-02e K IWI : P YTHON T OOL FOR T EX P ROCESSING AND C LASSIFICATION, Neelima Pulagam, and Sai Marasani, and Brian Sass doi.org/10.25080/majora-212e5952-02f P HYLOGEOGRAPHY: A NALYSIS OF GENETIC AND CLIMATIC DATA OF SARS-C O V-2, Wanlin Li, and Aleksandr Koshkarov, and My-Linh Luu, and Nadia Tahiri doi.org/10.25080/majora-212e5952-030 D ESIGN OF A S CIENTIFIC D ATA A NALYSIS S UPPORT P LATFORM, Nathan Martindale, and Jason Hite, and Scott Stewart, and Mark Adams doi.org/10.25080/majora-212e5952-031 O PENING ARM: A PIVOT TO COMMUNITY SOFTWARE TO MEET THE NEEDS OF USERS AND STAKEHOLDERS OF THE PLANET ’ S LARGEST CLOUD OBSERVATORY , Zachary Sherman, and Scott Collis, and Max Grover, and Robert Jackson, and Adam Theisen doi.org/10.25080/majora-212e5952-032 S CI P Y TOOLS P LENARIES S CI P Y T OOLS P LENARY - CEL TEAM, Inessa Pawson doi.org/10.25080/majora-212e5952-043 S CI P Y T OOLS P LENARY ON M ATPLOTLIB, Elliott Sales de Andrade doi.org/10.25080/majora-212e5952-044 S CI P Y T OOLS P LENARY - N UM P Y, Inessa Pawson doi.org/10.25080/majora-212e5952-045 L IGHTNING TALKS D OWNSAMPLING T IME S ERIES D ATA FOR V ISUALIZATIONS, Delaina Moore doi.org/10.25080/majora-212e5952-027 A NALYSIS AS A PPLICATIONS : Q UICK INTRODUCTION TO LOCKFILES, Matthew Feickert doi.org/10.25080/majora-212e5952-028 S CHOLARSHIP R ECIPIENTS A MAN G OEL, University of Delhi A NURAG S AHA R OY, Saarland University I SURU F ERNANDO, University of Illinois at Urbana Champaign K ELLY M EEHAN, US Forest Service K ADAMBARI D EVARAJAN, University of Rhode Island K RISHNA K ATYAL, Thapar Institute of Engineering and Technology M ATTHEW M URRAY, Dask N AMAN G ERA, Sympy, LPython R OHIT G OSWAMI, University of Iceland S IMON C ROSS, QuTIP TANYA A KUMU, IBM Research Z UHAL C AKIR, Purdue University C ONTENTS The Advanced Scientific Data Format (ASDF): An Update 1 Perry Greenfield, Edward Slavich, William Jamieson, Nadia Dencheva Semi-Supervised Semantic Annotator (S3A): Toward Efficient Semantic Labeling 7 Nathan Jessurun, Daniel E. Capecci, Olivia P. Dizon-Paradis, Damon L. Woodard, Navid Asadizanjani Galyleo: A General-Purpose Extensible Visualization Solution 13 Rick McGeer, Andreas Bergen, Mahdiyar Biazi, Matt Hemmings, Robin Schreiber USACE Coastal Engineering Toolkit and a Method of Creating a Web-Based Application 22 Amanda Catlett, Theresa R. Coumbe, Scott D. Christensen, Mary A. Byrant Search for Extraterrestrial Intelligence: GPU Accelerated TurboSETI 26 Luigi Cruz, Wael Farah, Richard Elkins Experience report of physics-informed neural networks in fluid simulations: pitfalls and frustration 28 Pi-Yueh Chuang, Lorena A. Barba atoMEC: An open-source average-atom Python code 37 Timothy J. Callow, Daniel Kotik, Eli Kraisler, Attila Cangi Automatic random variate generation in Python 46 Christoph Baumgarten, Tirth Patel Utilizing SciPy and other open source packages to provide a powerful API for materials manipulation in the Schrödinger Materials Suite 52 Alexandr Fonari, Farshad Fallah, Michael Rauch A Novel Pipeline for Cell Instance Segmentation, Tracking and Motility Classification of Toxoplasma Gondii in 3D Space 60 Seyed Alireza Vaezi, Gianni Orlando, Mojtaba Fazli, Gary Ward, Silvia Moreno, Shannon Quinn The myth of the normal curve and what to do about it 64 Allan Campopiano Python for Global Applications: teaching scientific Python in context to law and diplomacy students 69 Anna Haensch, Karin Knudson Papyri: better documentation for the scientific ecosystem in Jupyter 75 Matthias Bussonnier, Camille Carvalho Bayesian Estimation and Forecasting of Time Series in statsmodels 83 Chad Fulton Python vs. the pandemic: a case study in high-stakes software development 90 Cliff C. Kerr, Robyn M. Stuart, Dina Mistry, Romesh G. Abeysuriya, Jamie A. Cohen, Lauren George, Michał Jastrzebski, Michael Famulare, Edward Wenger, Daniel J. Klein Pylira: deconvolution of images in the presence of Poisson noise 98 Axel Donath, Aneta Siemiginowska, Vinay Kashyap, Douglas Burke, Karthik Reddy Solipuram, David van Dyk Codebraid Preview for VS Code: Pandoc Markdown Preview with Jupyter Kernels 105 Geoffrey M. Poore Incorporating Task-Agnostic Information in Task-Based Active Learning Using a Variational Autoencoder 110 Curtis Godwin, Meekail Zain, Nathan Safir, Bella Humphrey, Shannon P Quinn Awkward Packaging: building Scikit-HEP 115 Henry Schreiner, Jim Pivarski, Eduardo Rodrigues Keeping your Jupyter notebook code quality bar high (and production ready) with Ploomber 121 Ido Michael Likeness: a toolkit for connecting the social fabric of place to human dynamics 125 Joseph V. Tuccillo, James D. Gaboardi poliastro: a Python library for interactive astrodynamics 136 Juan Luis Cano Rodrı́guez, Jorge Martı́nez Garrido A New Python API for Webots Robotics Simulations 147 Justin C. Fisher pyAudioProcessing: Audio Processing, Feature Extraction, and Machine Learning Modeling 152 Jyotika Singh Phylogeography: Analysis of genetic and climatic data of SARS-CoV-2 159 Aleksandr Koshkarov, Wanlin Li, My-Linh Luu, Nadia Tahiri Global optimization software library for research and education 167 Nadia Udler Temporal Word Embeddings Analysis for Disease Prevention 171 Nathan Jacobi, Ivan Mo, Albert You, Krishi Kishore, Zane Page, Shannon P. Quinn, Tim Heckman Design of a Scientific Data Analysis Support Platform 179 Nathan Martindale, Jason Hite, Scott Stewart, Mark Adams The Geoscience Community Analysis Toolkit: An Open Development, Community Driven Toolkit in the Scientific Python Ecosystem 187 Orhan Eroglu, Anissa Zacharias, Michaela Sizemore, Alea Kootz, Heather Craker, John Clyne popmon: Analysis Package for Dataset Shift Detection 194 Simon Brugman, Tomas Sostak, Pradyot Patil, Max Baak pyDAMPF: a Python package for modeling mechanical properties of hygroscopic materials under interaction with a nanoprobe 202 Willy Menacho, Gonzalo Marcelo Ramı́rez-Ávila, Horacio V. Guzman Improving PyDDA’s atmospheric wind retrievals using automatic differentiation and Augmented Lagrangian methods 210 Robert Jackson, Rebecca Gjini, Sri Hari Krishna Narayanan, Matt Menickelly, Paul Hovland, Jan Hückelheim, Scott Collis RocketPy: Combining Open-Source and Scientific Libraries to Make the Space Sector More Modern and Accessible 217 João Lemes Gribel Soares, Mateus Stano Junqueira, Oscar Mauricio Prada Ramirez, Patrick Sampaio dos Santos Brandão, Adriano Augusto Antongiovanni, Guilherme Fernandes Alves, Giovani Hidalgo Ceotto Wailord: Parsers and Reproducibility for Quantum Chemistry 226 Rohit Goswami Variational Autoencoders For Semi-Supervised Deep Metric Learning 231 Nathan Safir, Meekail Zain, Curtis Godwin, Eric Miller, Bella Humphrey, Shannon P Quinn A Python Pipeline for Rapid Application Development (RAD) 240 Scott D. Christensen, Marvin S. Brown, Robert B. Haehnel, Joshua Q. Church, Amanda Catlett, Dallon C. Schofield, Quyen T. Brannon, Stacy T. Smith Monaco: A Monte Carlo Library for Performing Uncertainty and Sensitivity Analyses 244 W. Scott Shambaugh Enabling Active Learning Pedagogy and Insight Mining with a Grammar of Model Analysis 251 Zachary del Rosario Low Level Feature Extraction for Cilia Segmentation 259 Meekail Zain, Eric Miller, Shannon P Quinn, Cecilia Lo PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 1 The Advanced Scientific Data Format (ASDF): An Update Perry Greenfield‡∗ , Edward Slavich‡† , William Jamieson‡† , Nadia Dencheva‡† F Abstract—We report on progress in developing and extending the new (ASDF) by outlining our near term plans for further improvements and format we have developed for the data from the James Webb and Nancy Grace extensions. Roman Space Telescopes since we reported on it at a previous Scipy. While the format was developed as a replacement for the long-standard FITS format Summary of Motivations used in astronomy, it is quite generic and not restricted to use with astronomical • Suitable as an archival format: data. We will briefly review the format, and extensions and changes made to the standard itself, as well as to the reference Python implementation we have – Old versions continue to be supported by developed to support it. The standard itself has been clarified in a number libraries. of respects. Recent improvements to the Python implementation include an – Format is sufficiently transparent (e.g., not improved framework for conversion between complex Python objects and ASDF, requiring extensive documentation to de- better control of the configuration of extensions supported and versioning of extensions, tools for display and searching of the structured metadata, bet- code) for the fundamental set of capabili- ter developer documentation, tutorials, and a more maintainable and flexible ties. schema system. This has included a reorganization of the components to make – Metadata is easily viewed with any text the standard free from astronomical assumptions. A important motivator for the editor. format was the ability to support serializing functional transforms in multiple dimensions as well as expressions built out of such transforms, which has now • Intrinsically hierarchical been implemented. More generalized compression schemes are now enabled. • Avoids duplication of shared items We are currently working on adding chunking support and will discuss our plan • Based on existing standard(s) for metadata and structure for further enhancements. • No tight constraints on attribute lengths or their values. • Clearly versioned Index Terms—data formats, standards, world coordinate systems, yaml • Supports schemas for validating files for basic structure and value requirements • Easily extensible, both for the standard, and for local or Introduction domain-specific conventions. The Advanced Scientific Data Format (ASDF) was originally developed in 2015. That original version was described in a paper Basics of ASDF Format [Gre15]. That paper described the shortcomings of the widely used • Format consists of a YAML header optionally followed by astronomical standard format FITS [FIT16] as well as those of one or more binary blocks for containing binary data. existing potential alternatives. It is not the goal of this paper to • The YAML [http://yaml.org] header contains all the meta- rehash those points in detail, though it is useful to summarize the data and defines the structural relationship of all the data basic points here. The remainder of this paper will describe where elements. we are using ASDF, what lessons we have learned from using • YAML tags are used to indicate to libraries the semantics ASDF for the James Webb Space Telescope, and summarize the of subsections of the YAML header that libraries can use to most important changes we have made to the standard, the Python construct special software objects. For example, a tag for library that we use to read and write ASDF files, and best practices a data array would indicate to a Python library to convert for using the format. it into a numpy array. We will give an example of a more advanced use case that • YAML anchors and alias are used to share common ele- illustrates some of the powerful advantages of ASDF, and that ments to avoid duplication. its application is not limited to astronomy, but suitable for much • JSON Schema [http://json-schema.org/specification.html], of scientific and engineering data, as well as models. We finish [http://json-schema.org/understanding-json-schema/] is used for schemas to define expectations for tag content * Corresponding author: perry@stsci.edu and whole headers combined with tools to validate actual ‡ Space Telescope Science Institute † These authors contributed equally. ASDF files against these schemas. • Binary blocks are referenced in the YAML to link binary Copyright © 2022 Perry Greenfield et al. This is an open-access article data to YAML attributes. distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, • Support for arrays embedded in YAML or in a binary provided the original author and source are credited. block. 2 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) • Streaming support for a single binary block. Changes for 1.6 • Permit local definitions of tags and schemas outside of the Addition of the manifest mechanism standard. • While developed for astronomy, useful for general scien- The manifest is a YAML document that explicitly lists the tags and tific or engineering use. other features introduced by an extension to the ASDF standard. • Aims to be language neutral. It provides a more straightforward way of associating tags with schemas, allowing multiple tags to share the same schema, and generally making it simpler to visualize how tags and schemas Current and planned uses are associated (previously these associations were implied by the James Webb Space Telescope (JWST) Python implementation but were not documented elsewhere). NASA requires JWST data products be made available in the FITS format. Nevertheless, all the calibration pipelines operate Handling of null values and their interpretation on the data using an internal objects very close to the the ASDF The standard didn’t previously specify the behavior regarding null representation. The JWST calibration pipeline uses ASDF to values. The Python library previously removed attributes from the serialize data that cannot be easily represented in FITS, such as YAML tree when the corresponding Python attribute has a None World Coordinate System information. The calibration software value upon writing to an ADSF file. On reading files where the is also capable of reading and producing data products as pure attribute was missing but the schema indicated a default value, ASDF files. the library would create the Python attribute with the default. As mentioned in the next item, we no longer use this mechanism, and Nancy Grace Roman Space Telescope now when written, the attribute appears in the YAML tree with This telescope, with the same mirror size as the Hubble Space a null value if the Python value is None and the schema permits Telescope (HST), but a much larger field of view than HST, will null values. be launched in 2026 or thereabouts. It is to be used mostly in survey mode and is capable of producing very large mosaicked Interpretation of default values in schema images. It will use ASDF as its primary data format. The use of default values in schemas is discouraged since the Daniel K Inoue Solar Telescope interpretation by libraries is prone to confusion if the assemblage This telescope is using ASDF for much of the early data products of schemas conflict with regard to the default. We have stopped to hold the metadata for a combined set of data which can involve using defaults in the Python library and recommend that the ASDF many thousands of files. Furthermore, the World Coordinate file always be explicit about the value rather than imply it through System information is stored using ASDF for all the referenced the schema. If there are practical cases that preclude always data. writing out all values (e.g., they are only relevant to one mode and usually are irrelevant), it should be the library that manages whether such attributes are written conditionally rather using the Vera Rubin Telescope (for World Coordinate System interchange) schema default mechanism. There have been users outside of astronomy using ASDF, as well as contributors to the source code. Add alternative tag URI scheme We now recommend that tag URIs begin with asdf:// Changes to the standard (completed and proposed) These are based on lessons learned from usage. Be explicit about what kind of complex YAML keys are supported The current version of the standard is 1.5.0 (1.6.0 being developed). For example, not all legal YAML keys are supported. Namely The following items reflect areas where we felt improvements YAML arrays, which are not hashable in Python. Likewise, were needed. general YAML objects are not either. The Standard now limits keys to string, integer, or boolean types. If more complex keys are Changes for 1.5 required, they should be encoded in strings. Moving the URI authority from stsci.edu to asdf-format.org Still to be done This is to remove the standard from close association with STScI Upgrade to JSON Schema draft-07 and make it clear that the format is not intended to be controlled There is interest in some of the new features of this version, by one institution. however, this is problematic since there are aspects of this version that are incompatible with draft-04, thus requiring all previous Moving astronomy-specific schemas out of standard schemas to be updated. These primarily affect the previous inclusion of World Coordinate Tags, which are strongly associated with astronomy. Remaining Replace extensions section of file history are those related to time and unit standards, both of obvious gen- erality, but the implementation must be based on some standards, This section is considered too specific to the concept of Python and currently the astropy-based ones are as good or better than extensions, and is probably best replaced with a more flexible any. system for listing extensions used. THE ADVANCED SCIENTIFIC DATA FORMAT (ASDF): AN UPDATE 3 Changes to Python ASDF package Easier and more flexible mechanism to create new extensions (2.8.0) The previous system for defining extensions to ASDF, now deprecated, has been replaced by a new system that makes the association between tags, schemas, and conversion code more straightforward, as well as providing more intuitive names for the methods and attributes, and makes it easier to handle reference cycles if they are present in the code (also added to the original Tag handling classes). Introduced global configuration mechanism (2.8.0) This reworks how ASDF resources are located, and makes it easier to update the current configuration, as well as track down the location of the needed resources (e.g., schemas and converters), as well as removing performance issues that previously required extracting information from all the resource files thus slowing the Fig. 1: A plot of the compound model defined in the first segment of first asdf.open call. code. Added info/search methods and command line tools (2.6.0) These allow displaying the hierarchical structure of the header and file. This is made possible by the fact that expressions of models the values and types of the attributes. Initially, such introspection are straightforward to represent in YAML structure. stopped at any tagged item. A subsequent change provides mech- Despite the fact that the models are in some sense executable, anisms to see into tagged items (next item). An example of these they are perfectly safe so long as the library they are implemented tools is shown in a later section. in is safe (e.g., it doesn’t implement an "execute any OS com- mand" model). Furthermore, the representation in ASDF does not Added mechanism for info to display tagged item contents (2.9.0) explicitly use Python code. In principle it could be written or read This allows the library that converts the YAML to Python objects in any computer language. to expose a summary of the contents of the object by supplying The following illustrates a relatively simple but not trivial an optional "dunder" method that the info mechanism can take example. advantage of. First we define a 1D model and plot it. import numpy as np Added documentation on how ASDF library internals work import astropy.modeling.models as amm import astropy.units as u These appear in the readthedocs under the heading "Developer import asdf Overview". from matplotlib import pyplot as plt # Define 3 model components with units Plugin API for block compressors (2.8.0) g1 = amm.Gaussian1D(amplitude=100*u.Jy, This enables a localized extension to support further compression mean=120*u.MHz, options. stddev=5.*u.MHz) g2 = amm.Gaussian1D(65*u.Jy, 140*u.MHz, 3*u.MHz) powerlaw = amm.PowerLaw1D(amplitude=10*u.Jy, Support for asdf:// URI scheme (2.8.0) x_0=100*u.MHz, Support for ASDF Standard 1.6.0 (2.8.0) alpha=3) # Define a compound model This is still subject to modifications to the 1.6.0 standard. model = g1 + g2 + powerlaw x = np.arange(50, 200) * u.MHz Modified handling of defaults in schemas and None values (2.8.0) plt.plot(x, model(x)) As described previously. The following code will save the model to an ASDF file, and read it back in Using ASDF to store models af = asdf.AsdfFile() af.tree = {'model': model} This section highlights one aspect of ASDF that few other formats af.write_to('model.asdf') support in an archival way, e.g., not using a language-specific af2 = asdf.open('model.asdf') model2 = af2['model'] mechanism, such as Python’s pickle. The astropy package contains model2 is model a modeling subpackage that defines a number of analytical, as well False as a few table-based, models that can be combined in many ways, model2(103.5) == model(103.5) such as arithmetically, in composition, or multi-dimensional. Thus True it is possible to define fairly complex multi-dimensional models, Listing the relevant part of the ASDF file illustrates how the model many of which can use the built in fitting machinery. has been saved in the YAML header (reformatted to fit in this paper These models, and their compound constructs can be saved column). in ASDF files and later read in to recreate the corresponding model: !transform/add-1.2.0 astropy objects that were used to create the entries in the ASDF forward: 4 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) - !transform/add-1.2.0 something that the FITS format had no hope of managing, nor any forward: other scientific format that we are aware of. - !transform/gaussian1d-1.0.0 amplitude: !unit/quantity-1.1.0 {unit: !unit/unit-1.0.0 Jy, value: 100.0} Displaying the contents of ASDF files bounding_box: - !unit/quantity-1.1.0 Functionality has been added to display the structure and content {unit: !unit/unit-1.0.0 MHz, value: 92.5} of the header (including data item properties), with a number of - !unit/quantity-1.1.0 {unit: !unit/unit-1.0.0 MHz, value: 147.5} options of what depth to display, how many lines to display, etc. bounds: An example of the info use is shown in Figure 2. stddev: [1.1754943508222875e-38, null] There is also functionality to search for items in the file by inputs: [x] mean: !unit/quantity-1.1.0 attribute name and/or values, also using pattern matching for {unit: !unit/unit-1.0.0 MHz, value: 120.0} either. The search results are shown as attribute paths to the items outputs: [y] that were found. stddev: !unit/quantity-1.1.0 {unit: !unit/unit-1.0.0 MHz, value: 5.0} - !transform/gaussian1d-1.0.0 ASDF Extension/Converter System amplitude: !unit/quantity-1.1.0 {unit: !unit/unit-1.0.0 Jy, value: 65.0} There are a number of components that are involved. Converters bounding_box: encapsulate the code that handles converting Python objects to - !unit/quantity-1.1.0 {unit: !unit/unit-1.0.0 MHz, value: 123.5} and from their ASDF representation. These are classes that inherit - !unit/quantity-1.1.0 from the basic Converter class and define two Class attributes: {unit: !unit/unit-1.0.0 MHz, value: 156.5} tags, types each of which is a list of associated tag(s) and class(es) bounds: that the specific converter class will handle (each converter can stddev: [1.1754943508222875e-38, null] inputs: [x] handle more than one tag type and more than one class). The mean: !unit/quantity-1.1.0 ASDF machinery uses this information to map tags to converters {unit: !unit/unit-1.0.0 MHz, value: 140.0} when reading ASDF content, and to map types to converters when outputs: [y] stddev: !unit/quantity-1.1.0 saving these objects to an ASDF file. {unit: !unit/unit-1.0.0 MHz, value: 3.0} Each converter class is expected to supply two methods: inputs: [x] to_yaml_tree and from_yaml_tree that construct the outputs: [y] YAML content and convert the YAML content to Python class - !transform/power_law1d-1.0.0 alpha: 3.0 instances respectively. amplitude: !unit/quantity-1.1.0 A manifest file is used to associate tags and schema ID’s {unit: !unit/unit-1.0.0 Jy, value: 10.0} so that if a schema has been defined, that the ASDF content inputs: [x] outputs: [y] can be validated against the schema (as well as providing extra x_0: !unit/quantity-1.1.0 information for the ASDF content in the info command). Normally {unit: !unit/unit-1.0.0 MHz, value: 100.0} the converters and manifest are registered with the ASDF library inputs: [x] using standard functions, and this registration is normally (but is outputs: [y] ... not required to be) triggered by use of Python entry points defined in the setup.cfg file so that this extension is automatically Note that there are extra pieces of information that define the recognized when the extension package is installed. model more precisely. These include: One can of course write their own custom code to convert the contents of ASDF files however they want. The advantage of the • many tags indicating special items. These include different tag/converter system is that the objects can be anywhere in the tree kinds of transforms (i.e., functions), quantities (i.e., num- structure and be properly saved and recovered without having any bers with units), units, etc. implied knowledge of what attribute or location the object is at. • definitions of the units used. Furthermore, it brings with it the ability to validate the contents • indications of the valid range of the inputs or parameters by use of schema files. (bounds) Jupyter tutorials that show how to use converters can be found • each function shows the mapping of the inputs and the at: naming of the outputs of each function. • the addition operator is itself a transform. • https://github.com/asdf-format/tutorials/blob/master/ Your_first_ASDF_converter.ipynb Without the use of units, the YAML would be simpler. But • https://github.com/asdf-format/tutorials/blob/master/ the point is that the YAML easily accommodates expression trees. Your_second_ASDF_converter.ipynb The tags are used by the library to construct the astropy models, units and quantities as Python objects. However, nothing in the above requires the library to be written in Python. ASDF Roadmap for STScI Work This machinery can handle multidimensional models and sup- The planned enhancements to ASDF are understandably focussed ports both the combining of models with arithmetic operators as on the needs of STScI missions. Nevertheless, we are particularly well as pipelining the output of one model into another. This interested in areas that have wider benefit to the general scientific system has been used to define complex coordinate transforms and engineering community, and such considerations increase the from telescope detectors to sky coordinates for imaging, and priority of items necessary to STScI. Furthermore, we are eager wavelengths for spectrographs, using over 100 model components, to aid others working on ASDF by providing advice, reviews, and THE ADVANCED SCIENTIFIC DATA FORMAT (ASDF): AN UPDATE 5 Fig. 2: This shows part of the output of the info command that shows the structure of a Roman Space Telescope test file (provided by the Roman Telescopes Branch at STScI). Displayed is the relative depth of the item, its type, value, and a title extracted from the associated schema to be used as explanatory information. possibly collaborative coding effort. STScI is committed to the Redefining versioning semantics long-term support of ADSF. Previously the meaning of different levels of versioning The following is a list of planned work, in order of decreasing were unclear. The normal inclination is to treat schema priority. version using the typical semantic versioning system de- fined for software. But schemas are not software and Chunking Support we are inclined to use the proposed system for schemas [url: https://snowplowanalytics.com/blog/2014/05/13/introducing- Since the Roman mission is expected to deal with large data schemaver-for-semantic-versioning-of-schemas/] To summarize: sets and mosaicked images, support for chunking is considered in this case the three levels of versioning correspond to: essential. We expect to layer the support in our Python library Model.Revision.Addition where a schema change: on zarr [https://zarr.dev/], with two different representations, one where all data is contained within the ADSF file in separate • [Model] prevents working with historical data blocks, and one where the blocks are saved in individual files. • [Revision] may prevent working with historical data Both representations have important advantages and use cases. • [Addition] is compatible with all historical data Integration into astronomy display tools Improvements to binary block management It is essential that astronomers be able to visualize the data These enhancements are needed to enable better chunking support contained within ASDF files conveniently using the commonly and other capabilities. available tool, such as SAOImage DS9 [Joy03] and Ginga [Jes13]. 6 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Cloud optimized storage [McK10] W. McKinney. Data structures for statistical computing in python, Proceedigns of the 9th Python in Science Conference, p56-61, 2010. Much of the future data processing operations for STScI are https://doi.org/10.25080/Majora-92bf1922-00a expected to be performed on the cloud, so having ASDF efficiently [Pen09] W. Pence, R. Seaman, R. L. White, Lossless Astronomical Image support such uses is important. An important element of this is Compression and the Effects of Noise, Publications of the Astro- making the format work efficiently with object storage services nomical Society of the Pacific, 121:414-427, April 2009. https: //doi.org/10.48550/arXiv.0903.2140 such as AWS S3 and Google Cloud Storage. [Pen10] W. Pence, R. L. White, R. Seaman. Optimal Compression of Floating- Point Astronomical Images Without Significant Loss of Information, IDL support Publications of the Astronomical Society of the Pacific, 122:1065- 1076, September 2010. https://doi.org/10.1086/656249 While Python is rapidly surpassing the use of IDL in astronomy, [Joy03] W. A. Joye, E. Mandel. New Features of SAOImage DS9, Astronomi- there is still much IDL code being used, and many of those still cal Data Analysis Software and Systems XII ASP Conference Series, using IDL are in more senior and thus influential positions (they 295:489, 2003. aren’t quite dead yet). So making ASDF data at least readable to IDL is a useful goal. Support Rice compression Rice compression [Pen09], [Pen10] has proven a useful lossy compression algorithm for astronomical imaging data. Supporting it will be useful to astronomers, particularly for downloading large imaging data sets. Pandas Dataframe support Pandas [McK10] has proven to be a useful tool to many as- tronomers, as well as many in the sciences and engineering, so support will enhance the uptake of ASDF. Compact, easy-to-read schema summaries Most scientists and even scientific software developers tend to find JSON Schema files tedious to interpret. A more compact, and intuitive rendering of the contents would be very useful. Independent implementation Having ASDF accepted as a standard data format requires a library that is divorced from a Python API. Initially this can be done most easily by layering it on the Python library, but ultimately there should be an independent implementation which includes support for C/C++ wrappers. This is by far the item that will require the most effort, and would benefit from outside involvement. Provide interfaces to other popular packages This is a catch all for identifying where there would be significant advantages to providing the ability to save and recover information in the ASDF format as an interchange option. Sources of Information • ASDF Standard: https://asdf-standard.readthedocs.io/en/ latest/ • Python ASDF package documentation: https://asdf. readthedocs.io/en/stable/ • Repository: https://github.com//asdf-format/asdf • Tutorials: https://github.com/asdf-format/tutorials R EFERENCES [Gre15] P. Greenfield, M. Droettboom, E. Bray. ASDF: A new data format for astronomy, Astronomy and Computing, 12:240-251, September 2015. https://doi.org/10.1016/j.ascom.2015.06.004 [FIT16] FITS Working Group. Definition of the Flexible Image Transport System, International Astronomical Union, http://fits.gsfc.nasa.gov/ fits_standard.html, July 2016. [Jes13] E. Jeschke. Ginga: an open-source astronomical image viewer and toolkit, Proc. of the 12th Python in Science Conference., p58- 64,January 2013. https://doi.org/10.25080/Majora-8b375195-00a PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 7 Semi-Supervised Semantic Annotator (S3A): Toward Efficient Semantic Labeling Nathan Jessurun‡∗ , Daniel E. Capecci‡ , Olivia P. Dizon-Paradis‡ , Damon L. Woodard‡ , Navid Asadizanjani‡ F Abstract—Most semantic image annotation platforms suffer severe bottlenecks when handling large images, complex regions of interest, or numerous distinct foreground regions in a single image. We have developed the Semi-Supervised Semantic Annotator (S3A) to address each of these issues and facilitate rapid collection of ground truth pixel-level labeled data. Such a feat is accomplished through a robust and easy-to-extend integration of arbitrary python image pro- cessing functions into the semantic labeling process. Importantly, the framework devised for this application allows easy visualization and machine learning prediction of arbitrary formats and amounts of per-component metadata. To our knowledge, the ease and flexibility offered are unique to S3A among all open- source alternatives. Index Terms—Semantic annotation, Image labeling, Semi-supervised, Region of interest Introduction Labeled image data is essential for training, tuning, and evaluating Fig. 1. Common use cases for semantic segmentation involve relatively few fore- ground objects, low-resolution data, and limited complexity per object. Images the performance of many machine learning applications. Such retrieved from https://cocodataset.org/#explore. labels are typically defined with simple polygons, ellipses, and bounding boxes (i.e., "this rectangle contains a cat"). However, this approach can misrepresent more complex shapes with holes and greatly hinders scalability. As such, several tools have been or multiple regions as shown later in Figure 9. When high accuracy proposed to alleviate the burden of collecting these ground-truth is required, labels must be specified at or close to the pixel-level labels [itL18]. Unfortunately, existing tools are heavily biased - a process known as semantic labeling or semantic segmentation. toward lower-resolution images with few regions of interest (ROI), A detailed description of this process is given in [CZF+ 18]. similar to Figure 1. While this may not be an issue for some Examples can readily be found in several popular datasets such datasets, such assumptions are crippling for high-fidelity images as COCO, depicted in Figure 1. with hundreds of annotated ROIs [LSA+ 10], [WYZZ09]. Semantic segmentation is important in numerous domains With improving hardware capabilities and increasing need for including printed circuit board assembly (PCBA) inspection (dis- high-resolution ground truth segmentation, there are a continu- cussed later in the case study) [PJTA20], [AML+ 19], quality ally growing number of applications that require high-resolution control during manufacturing [FRLL18], [AVK+ 01], [AAV+ 02], imaging with the previously described characteristics [MKS18], manuscript restoration / digitization [GNP+ 04], [KBO16], [JB92], [DS20]. In these cases, the existing annotation tooling greatly [TFJ89], [FNK92], and effective patient diagnosis [SKM+ 10], impacts productivity due to the previously referenced assumptions [RLO+ 17], [YPH+ 06], [IGSM14]. In all these cases, imprecise and lack of support [Spa20]. annotations severely limit the development of automated solutions In response to these bottlenecks, we present the Semi- and can decrease the accuracy of standard trained segmentation Supervised Semantic Annotation (S3A) annotation and prototyping models. platform -- an application which eases the process of pixel-level Quality semantic segmentation is difficult due to a reliance on labeling in large, complex scenes.1 Its graphical user interface is large, high-quality datasets, which are often created by manually shown in Figure 2. The software includes live app-level property labeling each image. Manual annotation is error-prone, costly, customization, real-time algorithm modification and feedback, region prediction assistance, constrained component table editing * Corresponding author: njessurun@ufl.edu ‡ University of Florida based on allowed data types, various data export formats, and a highly adaptable set of plugin interfaces for domain-specific exten- Copyright © 2022 Nathan Jessurun et al. This is an open-access article sions to S3A. Beyond software improvements, these features play distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, significant roles in bridging the gap between human annotation provided the original author and source are credited. efforts and scalable, automated segmentation methods [BWS+ 10]. 8 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Improve Semi- segmentation supervised techniques labeling Update Generate models training data Fig. 3. S3A’s can iteratively annotate, evaluate, and update its internals in real- time. to specify (but can be modified or customized if desired). As a re- sult, incorporating additional/customized application functionality can require as little as one line of code. Processes interface with Fig. 2. S3A’s interface. The main view consists of an image to annotate, a PyQtGraph parameters to gain access to data-customized widget component table of prior annotations, and a toolbar which changes functionality types and more (https://github.com/pyqtgraph/pyqtgraph). depending on context. These processes can also be arbitrarily nested and chained, which is critical for developing hierarchical image processing models, an example of which is shown in Figure 4. This frame- work is used for all image and region processing within S3A. Note that for image processes, each portion of the hierarchy yields Application Overview intermediate outputs to determine which stage of the process flow Design decisions throughout S3A’s architecture have been driven is responsible for various changes. This, in turn, reduces the by the following objectives: effort required to determine which parameters must be adjusted to achieve optimal performance. • Metadata should have significance rather than be treated as an afterthought, Plugins for User Extensions • High-resolution images should have minimal impact on the annotation workflow, The previous section briefly described how custom user functions • ROI density and complexity should not limit annotation are easily wrapped within a process, exposing its parameters workflow, and within S3A in a GUI format. A rich plugin interface is built on top • Prototyping should not be hindered by application com- of this capability in which custom functions, table field predictors, plexity. default action hooks, and more can be directly integrated into S3A. In all cases, only a few lines of code are required to achieve most These motives were selected upon noticing the general lack integrations between user code and plugin interface specifications. of solutions for related problems in previous literature and tool- The core plugin infrastructure consists of a function/property reg- ing. Moreover, applications that do address multiple aspects of istration mechanism and an interaction window that shows them complex region annotation often require an enterprise service and in the UI. As such, arbitrary user functions can be "registered" in cannot be accessed under open-source policies. one line of code to a plugin, where it will be effectively exposed to While the first three points are highlighted in the case study, the user within S3A. A trivial example is depicted in Figure 5, but the subsections below outline pieces of S3A’s architecture that more complex behavior such as OCR integration is possible with prove useful for iterative algorithm prototyping and dataset gen- similar ease (see this snippet for an implementation leveraging eration as depicted in Figure 3. Note that beyond the facets easyocr). illustrated here, S3A possesses multiple additional characteris- Plugin features are heavily oriented toward easing the pro- tics as outlined in its documentation (https://gitlab.com/s3a/s3a/- cess of automation both for general annotation needs and niche /wikis/docs/User’s-Guide). datasets. In either case, incorporating existing library functions is converted into a trivial task directly resulting in lower annotation Processing Framework time and higher labeling accuracy. At the root of S3A’s functionality and configurability lies its adaptive processing framework. Functions exposed within S3A are Adaptable I/O thinly wrapped using a Process structure responsible for parsing An extendable I/O framework allows annotations to be used in signature information to provide documentation, parameter infor- a myriad of ways. Out-of-the-box, S3A easily supports instance- mation, and more to the UI. Hence, all graphical depictions are level segmentation outputs, facilitating deep learning model train- abstracted beyond the concern of the user while remaining trivial ing. As an example, Figure 6 illustrates how each instance in the image becomes its own pair of image and mask data. When several 1. A preliminary version was introduced in an earlier publication [JPRA20], but significant changes to the framework and tool capabilities have been instances overlap, each is uniquely distinguishable depending employed since then. on the characteristic of their label field. Particularly helpful for SEMI-SUPERVISED SEMANTIC ANNOTATOR (S3A): TOWARD EFFICIENT SEMANTIC LABELING 9 Fig. 4. Outputs of each processing stage can be quickly viewed in context after an iteration of annotating. Upon inspecting the results, it is clear the failure point is a low k value during K-means clustering and segmentation. The woman’s shirt is not sufficiently distinguishable from the background palette to denote a separate entity. The red dot is an indicator of where the operator clicked during annotation. from qtpy import QtWidgets from s3a import ( S3A, __main__, RandomToolsPlugin, ) def hello_world(win: S3A): QtWidgets.QMessageBox.information( win, "Hello World", "Hello World!" ) RandomToolsPlugin.deferredRegisterFunc( hello_world ) __main__.mainCli() Fig. 6. Multiple export formats exist, among which is a utility that crops com- ponents out of the image, optionally padding with scene pixels and resizing to Fig. 5. Simple standalone functions can be easily exposed to the user through ensure all shapes are equal. Each sub-image and mask is saved accordingly, the random tools plugin. Note that if tunable parameters were included in the which is useful for training on multiple forms of machine learning models. function signature, pressing "Open Tools" (the top menu option) allows them to be altered. binations for functions outside S3A in the event they are utilized in a different framework. models with fixed input sizes, these exports can optionally be forced to have a uniform shape (e.g., 512x512 pixels) while main- taining their aspect ratio. This is accomplished by incorporating Case Study additional scene pixels around each object until the appropriate Both the inspiration and developing efforts for S3A were initially size is obtained. Models trained on these exports can be directly driven by optical printed circuit board (PCB) assurance needs. plugged back into S3A’s processing framework, allowing them In this domain, high-resolution images can contain thousands to generate new annotations or refine preliminary user efforts. of complex objects in a scene, as seen in Figure 7. Moreover, The described I/O framework is also heavily modularized such numerous components are not representable by cardinal shapes that custom dataset specifications can easily be incorporated. In such as rectangles, circles, etc. Hence, high-count polygonal this manner, future versions of S3A will facilitate interoperability regions dominated a significant portion of the annotated regions. with popular formats such as COCO and Pascal VOC [LMB+ 14], The computational overhead from displaying large images and [EGW+ 10]. substantial numbers of complex regions either crashed most anno- tation platforms or prevented real-time interaction. In response, S3A was designed to fill the gap in open-source annotation Deep, Portable Customizability platforms that addressed each issue while requiring minimal setup Beyond the features previously outlined, S3A provides numerous and allowing easy prototyping of arbitrary image processing tasks. avenues to configure shortcuts, color schemes, and algorithm The subsections below describe how the S3A labeling platform workflows. Several examples of each can be seen in the user was utilized to collect a large database of PCB annotations along guide. Most customizable components prototyped within S3A can with their associated metadata2 . also be easily ported to external workflows after development. Hierarchical processes have states saved in YAML files describing Large Images with Many Annotations all parameters, which can be reloaded to create user profiles. In optical PCB assurance, one method of identifying component Alternatively, these same files can describe ideal parameter com- defects is to localize and characterize all objects in the image. Each 10 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 8. Regardless of total image size and number of annotations, Python processing is be limited to the ROI or viewbox size for just the selected object based on user preferences. The depiction shows Grab Cut operating on a user- defined initial region within a much larger (8000x6000) image. The resulting Fig. 7. Example PCB segmentation. In contrast to typical semgentation tasks, region was available in 1.94 seconds on low-grade hardware. the scene contains over 4,000 objects with numerous complex shapes. component can then be cross-referenced against genuine proper- ties such as length/width, associated text, allowed orientations, etc. However, PCB surfaces can contain hundreds to thousands of components at several magnitudes of size, necessitating high- resolution images for in-line scanning. To handle this problem more generally, S3A separates the editing and viewing experi- Fig. 9. Annotated objects in S3A can incorporate both holes and distinct regions ences. In other words, annotation time is orders of magnitude through a multi-polygon container. Holes are represented as polygons drawn on faster since only edits in one region at a time and on a small subset top of existing foreground, and can be arbitrarily nested (i.e. island foreground is of the full image are considered during assisted segmentation. All also possible). other annotations are read-only until selected for alteration. For instance, Figure 8 depicts user inputs on a small ROI out of a key performance improvement when thousands of regions (each much larger image. The resulting component shape is proposed with thousands of points) are in the same field of view. When within seconds and can either be accepted or modified further by low polygon counts are required, S3A also supports RDP polygon the user. While PCB annotations initially inspired this approach, it simplification down to a user-specified epsilon parameter [Ram]. is worth noting that the architectural approach applies to arbitrary domains of image segmentation. Complex Metadata Another key performance improvement comes from resizing Most annotation software support robust implementation of im- the processed region to a user-defined maximum size. For instance, age region, class, and various text tags ("metadata"). However, if an ROI is specified across a large portion of the image but this paradigm makes collecting type-checked or input-sanitized the maximum processing size is 500x500 pixels, the processed metadata more difficult. This includes label categories such as area will be downsampled to a maximum dimension length of object rotation, multiclass specifications, dropdown selections, 500 before intensive algorithms are run. The final output will and more. In contrast, S3A treats each metadata field the same be upsampled back to the initial region size. In this manner, way as object vertices, where they can be algorithm-assisted, optionally sacrificing a small amount of output accuracy can directly input by the user, or part of a machine learning prediction drastically accelerate runtime performance for larger annotated framework. Note that simple properties such as text strings or objects. numbers can be directly input in the table cells with minimal need for annotation assistance3 . In conrast, custom fields can provide Complex Vertices/Semantic Segmentation plugin specifications which allow more advanced user interaction. Multiple types of PCB components possess complex shapes which Finally, auto-populated fields like annotation timestamp or author might contain holes or noncontiguous regions. Hence, it is bene- can easily be constructed by providing a factory function instead ficial for software like S3A to represent these features inherently of default value in the parameter specification. with a ComplexXYVertices object: that is, a collection of This capability is particularly relevant in the field of optical polygons which either describe foreground regions or holes. This PCB assurance. White markings on the PCB surface, known is enabled by thinly wrapping opencv’s contour and hierarchy as silkscreen, indicate important aspects of nearby components. logic. Example components difficult to accomodate with single- Thus, understanding the silkscreen’s orientation, alphanumeric polygon annotation formats are illustrated in Figure 9. characters, associated component, logos present, and more provide At the same time, S3A also supports high-count polygons several methods by which to characterize / identify features with no performance losses. Since region edits are performed by of their respective devices. Both default and customized input image processing algorithms, there is no need for each vertex validators were applied to each field using parameter specifica- to be manually placed or altered by human input. Thus, such tions, custom plugins, or simple factories as described above. A non-interactive shapes can simply be rendered as a filled path summary of the metadata collected for one component is shown without a large number of event listeners present. This is the in Figure 10. SEMI-SUPERVISED SEMANTIC ANNOTATOR (S3A): TOWARD EFFICIENT SEMANTIC LABELING 11 results depending on the initial image complexity [VGSG+ 19]. Hence, these methods would be significantly easier to incorporate into S3A if a generalized windowing framework was incorporated which allows users to specify all necessary parameters such as window overlap, size, sampling frequency, etc. A preliminary version of this is implemented for categorical-based model pre- diction, but a more robust feature set for interactive segmentation is strongly preferable. Aggregation of Human Annotation Habits Several times, it has been noted that manual segmentation of Fig. 10. Metadata can be collected, validated, and customized with ease. A mix image data is not a feasible or scalable approach for remotely of default properties (strings, numbers, booleans), factories (timestamp, author), large datasets. However, there are multiple cases in which human and custom plugins (yellow circle representing associated device) are present. intuition can greatly outperform even complex neural networks, depending on the specific segmentation challenge [RLFF15]. For this reason, it would be ideal to capture data points possessing Conclusion and Future Work information about the human decision-making process and apply The Semi-Supervised Semantic Annotator (S3A) is proposed to them to images at scale. This may include taking into account hu- address the difficult task of pixel-level annotations of image data. man labeling time per class, hesitation between clicks, relationship For high-resolution images with numerous complex regions of between shape boundary complexity and instance quantity, and interest, existing labeling software faces performance bottlenecks more. By aggregating such statistics, a pattern may arise which can attempting to extract ground-truth information. Moreover, there is be leveraged as an additional automated annotation technique. a lack of capabilities to convert such a labeling workflow into an automated procedure with feedback at every step. Each of these challenges is overcome by various features within S3A specifically R EFERENCES designed for such tasks. As a result, S3A provides not only tremen- [AAV+ 02] C Anagnostopoulos, I Anagnostopoulos, D Vergados, G Kouzas, dous time savings during ground truth annotation, but also allows E Kayafas, V Loumos, and G Stassinopoulos. High performance an annotation pipeline to be directly converted into a prediction computing algorithms for textile quality control. Mathematics scheme. Furthermore, the rapid feedback accessible at every stage and Computers in Simulation, 60(3):389–400, September 2002. doi:10.1016/S0378-4754(02)00031-9. of annotation expedites prototyping of novel solutions to imaging [AML+ 19] Mukhil Azhagan, Dhwani Mehta, Hangwei Lu, Sudarshan domains in which few examples of prior work exist. Nonetheless, Agrawal, Mark Tehranipoor, Damon L Woodard, Navid multiple avenues exist for improving S3A’s capabilities in each of Asadizanjani, and Praveen Chawla. A review on automatic bill of material generation and visual inspection on PCBs. In these areas. Several prominent future goals are highlighted in the ISTFA 2019: Proceedings of the 45th International Symposium following sections. for Testing and Failure Analysis, page 256. ASM International, 2019. Dynamic Algorithm Builder [AVK+ 01] C. Anagnostopoulos, D. Vergados, E. Kayafas, V. Loumos, and G. Stassinopoulos. A computer vision approach for textile Presently, processing workflows can be specified in a sequential quality control. The Journal of Visualization and Computer YAML file which describes each algorithm and their respective Animation, 12(1):31–44, 2001. doi:10.1002/vis.245. parameters. However, this is not easy to adapt within S3A, [BWS+ 10] Steve Branson, Catherine Wah, Florian Schroff, Boris Babenko, Peter Welinder, Pietro Perona, and Serge Belongie. Visual especially by inexperienced annotators. Future iterations of S3A recognition with humans in the loop. In Kostas Daniilidis, Petros will incoroprate graphical flowcharts which make this process Maragos, and Nikos Paragios, editors, Computer Vision – ECCV drastically more intuitive and provide faster feedback. Frameworks 2010, pages 438–451, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg. like Orange [DCE+ ] perform this task well, and S3A would [CZF+ 18] Qimin Cheng, Qian Zhang, Peng Fu, Conghuan Tu, and Sen Li. strongly benefit from adding the relevant capabilities. A survey and analysis on automatic image annotation. Pattern Recognition, 79:242–259, 2018. doi:10.1016/j.patcog. Image Navigation Assistance 2018.02.017. [DCE+ ] Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Several aspects of image navigation can be incorporated to sim- Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar, plify the handling of large images. For instance, a "minimap" tool Marko Toplak, and Anže Starič. Orange: Data mining toolbox in Python. 14(1):2349–2353. would allow users to maintain a global image perspective while [DS20] Polina Demochkina and Andrey V. Savchenko. Improving making local edits. Furthermore, this sense of scale aids intuition the accuracy of one-shot detectors for small objects in x-ray of how many regions of similar component density, color, etc. exist images. In 2020 International Russian Automation Confer- within the entire image. ence (RusAutoCon), page 610–614. IEEE, September 2020. URL: https://ieeexplore.ieee.org/document/9208097/, doi:10. Second, multiple strategies for annotating large images lever- 1109/RusAutoCon49822.2020.9208097. age a windowing approach, where they will divide the total image [EGW+ 10] Mark Everingham, Luc Gool, Christopher K. Williams, John into several smaller pieces in a gridlike fashion. While this has its Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. Int. J. Comput. Vision, 88(2):303–338, jun disadvantages, it is fast, easy to automate, and produces reasonable 2010. URL: https://doi.org/10.1007/s11263-009-0275-4, doi: 10.1007/s11263-009-0275-4. 2. For those curious, the dataset and associated paper are accessible at https: [FNK92] H. Fujisawa, Y. Nakano, and K. Kurino. Segmentation methods //www.trust-hub.org/#/data/pcb-images. for character recognition: From segmentation to document struc- 3. For a list of input validators and supported primitive types, refer to ture analysis. Proceedings of the IEEE, 80(7):1079–1092, July PyQtGraph’s Parameter documentation. 1992. doi:10.1109/5.156471. 12 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) [FRLL18] Max K. Ferguson, Ak Ronay, Yung-Tsun Tina Lee, and Kin- IEEE Transactions on Medical Imaging, 36(2):674–683, Febru- cho. H. Law. Detection and segmentation of manufacturing ary 2017. doi:10.1109/TMI.2016.2621185. defects with convolutional neural networks and transfer learn- [SKM+ 10] Sascha Seifert, Michael Kelm, Manuel Moeller, Saikat Mukher- ing. Smart and sustainable manufacturing systems, 2, 2018. jee, Alexander Cavallaro, Martin Huber, and Dorin Comaniciu. doi:10.1520/SSMS20180033. Semantic annotation of medical images. In Brent J. Liu and [GNP+ 04] Basilios Gatos, Kostas Ntzios, Ioannis Pratikakis, Sergios William W. Boonn, editors, Medical Imaging 2010: Advanced Petridis, T. Konidaris, and Stavros J. Perantonis. A segmentation- PACS-based Imaging Informatics and Therapeutic Applications, free recognition technique to assist old greek handwritten volume 7628, pages 43 – 50. International Society for Optics and manuscript OCR. In Simone Marinai and Andreas R. Dengel, Photonics, SPIE, 2010. URL: https://doi.org/10.1117/12.844207, editors, Document Analysis Systems VI, Lecture Notes in Com- doi:10.1117/12.844207. puter Science, pages 63–74, Berlin, Heidelberg, 2004. Springer. [Spa20] SpaceNet. Multi-Temporal Urban Development Challenge. doi:10.1007/978-3-540-28640-0_7. https://spacenet.ai/sn7-challenge/, June 2020. [IGSM14] D. K. Iakovidis, T. Goudas, C. Smailis, and I. Maglogiannis. [TFJ89] T. Taxt, P.J. Flynn, and A.K. Jain. Segmentation of document Ratsnake: A versatile image annotation tool with application images. IEEE Transactions on Pattern Analysis and Machine to computer-aided diagnosis, 2014. doi:10.1155/2014/ Intelligence, 11(12):1322–1329, December 1989. doi:10. 286856. 1109/34.41371. [itL18] Humans in the Loop. The best image annotation platforms [VGSG+ 19] Juan P. Vigueras-Guillén, Busra Sari, Stanley F. Goes, Hans G. for computer vision (+ an honest review of each), October Lemij, Jeroen van Rooij, Koenraad A. Vermeer, and Lucas J. 2018. URL: https://hackernoon.com/the-best-image-annotation- van Vliet. Fully convolutional architecture vs sliding-window platforms-for-computer-vision-an-honest-review-of-each- cnn for corneal endothelium cell segmentation. BMC Biomedical dac7f565fea. Engineering, 1(1):4, January 2019. doi:10.1186/s42490- [JB92] Anil K. Jain and Sushil Bhattacharjee. Text segmentation using 019-0003-2. gabor filters for automatic document processing. Machine Vision [WYZZ09] C. Wang, Shuicheng Yan, Lei Zhang, and H. Zhang. Multi- and Applications, 5(3):169–184, June 1992. doi:10.1007/ label sparse coding for automatic image annotation. In 2009 BF02626996. IEEE Conference on Computer Vision and Pattern Recognition, [JPRA20] Nathan Jessurun, Olivia Paradis, Alexandra Roberts, and Navid page 1643–1650, June 2009. doi:10.1109/CVPR.2009. Asadizanjani. Component Detection and Evaluation Framework 5206866. (CDEF): A Semantic Annotation Tool. Microscopy and Micro- [YPH 06] Paul A. Yushkevich, Joseph Piven, Heather Cody Hazlett, + analysis, 26(S2):1470–1474, August 2020. doi:10.1017/ Rachel Gimpel Smith, Sean Ho, James C. Gee, and Guido S1431927620018243. Gerig. User-guided 3D active contour segmentation of anatom- [KBO16] Made Windu Antara Kesiman, Jean-Christophe Burie, and Jean- ical structures: Significantly improved efficiency and reliability. Marc Ogier. A new scheme for text line and character seg- NeuroImage, 31(3):1116–1128, July 2006. doi:10.1016/j. mentation from gray scale images of palm leaf manuscript. neuroimage.2006.01.015. In 2016 15th International Conference on Frontiers in Hand- writing Recognition (ICFHR), pages 325–330, October 2016. doi:10.1109/ICFHR.2016.0068. [LMB+ 14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Euro- pean conference on computer vision, pages 740–755. Springer, 2014. [LSA+ 10] L’ubor Ladický, Paul Sturgess, Karteek Alahari, Chris Russell, and Philip H. S. Torr. What, where and how many? combining object detectors and crfs. In Kostas Daniilidis, Petros Maragos, and Nikos Paragios, editors, Computer Vision – ECCV 2010, pages 424–437, Berlin, Heidelberg, 2010. Springer Berlin Hei- delberg. [MKS18] S. Mohajerani, T. A. Krammer, and P. Saeedi. A cloud detection algorithm for remote sensing images using fully convolutional neural networks. In 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP), page 1–5, August 2018. doi:10.1109/MMSP.2018.8547095. [PJTA20] Olivia P Paradis, Nathan T Jessurun, Mark Tehranipoor, and Navid Asadizanjani. Color normalization for robust automatic bill of materials generation and visual inspection of pcbs. In ISTFA 2020: Papers Accepted for the Planned 46th International Symposium for Testing and Failure Analysis, International Symposium for Testing and Failure Analysis, pages 172–179, 2020. URL: https://doi.org/10.31399/asm.cp. istfa2020p0172https://dl.asminternational.org/istfa/proceedings- pdf/ISTFA2020/83348/172/425605/istfa2020p0172.pdf, doi:10.31399/asm.cp.istfa2020p0172. [Ram] Urs Ramer. An iterative procedure for the polygonal approx- imation of plane curves. 1(3):244–256. URL: https://www. sciencedirect.com/science/article/pii/S0146664X72800170, doi:10.1016/S0146-664X(72)80017-0. [RLFF15] Olga Russakovsky, Li-Jia Li, and Li Fei-Fei. Best of both worlds: Human-machine collaboration for object annotation. In 2015 IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), page 2121–2131. IEEE, June 2015. URL: http://ieeexplore.ieee.org/document/7298824/, doi:10. 1109/CVPR.2015.7298824. [RLO+ 17] Martin Rajchl, Matthew C. H. Lee, Ozan Oktay, Konstantinos Kamnitsas, Jonathan Passerat-Palmbach, Wenjia Bai, Mellisa Damodaram, Mary A. Rutherford, Joseph V. Hajnal, Bernhard Kainz, and Daniel Rueckert. DeepCut: Object segmentation from bounding box annotations using convolutional neural networks. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 13 Galyleo: A General-Purpose Extensible Visualization Solution Rick McGeer‡∗ , Andreas Bergen‡ , Mahdiyar Biazi‡ , Matt Hemmings‡ , Robin Schreiber‡ F Abstract—Galyleo is an open-source, extensible dashboarding solution inte- Jupyter’s web interface is primarily to offer textboxes for code grated with JupyterLab [jup]. Galyleo is a standalone web application integrated entry. Entered code is sent to the server for evaluation and as an iframe [LS10] into a JupyterLab tab. Users generate data for the dash- text/HTML results returned. Visualization in a Jupyter Notebook board inside a Jupyter Notebook [KRKP+ 16], which transmits the data through is either given by images rendered server-side and returned as message passing [mdn] to the dashboard; users use drag-and-drop operations inline image tags, or by JavaScript/HTML5 libraries which have to add widgets to filter, and charts to display the data, shapes, text, and images. The dashboard is saved as a JSON [Cro06] file in the user’s filesystem in the a corresponding server-side Python library. The Python library same directory as the Notebook. generates HTML5/JavaScript code for rendering. The limiting factor is that the visualization library must be in- Index Terms—JupyterLab, JupyterLab extension, Data visualization tegrated with the Python backend by a developer, and only a subset of the rich array of visualization, charting, and mapping libraries Introduction available on the HTML5/JavaScript platform is integrated. The HTML5/JavaScript platform is as rich a client-side visualization Current dashboarding solutions [hol22a] [hol22b] [plo] [pan22] platform as Python is a server-side platform. for Jupyter either involve external, heavyweight tools, ingrained Galyleo set out to offer the best of both worlds: Python, R, and HTML/CSS coding, complex publication, or limited control over Julia as a scalable analytics platform coupled with an extensible layout, and have restricted widget sets and visualization libraries. JavaScript/HTML5 visualization and interaction platform. It offers Graphics objects require a great deal of configuration: size, posi- a no-code client-side environment, for several reasons. tion, colors, fonts must be specified for each object. Thus library solutions involve a significant amount of fairly simple code. Con- 1) The Jupyter analytics community is comfortable with versely, visualization involves analytics, an inherently complex server-side analytics environments (the 100+ kernels set of operations. Visualization tools such as Tableau [DGHP13] available in Jupyter, including Python, R and Julia) but or Looker [loo] combine visualization and analytics in a single less so with the JavaScript visualization platform. application presented through a point-and-click interface. Point- 2) Configuration of graphical objects takes a lot of low-value and-click interfaces are limited in the number and complexity configuration code; conversely, it is relatively easy to do of operations supported. The complexity of an operation isn’t by hand. reduced by having a simple point-and-click interface; instead, the user is confronted with the challenge of trying to do something These insights lead to a mixed interface, combining a drag- complicated by pointing. The result is that tools encapsulate and-drop interface for the design and configuration of visual complex operations in a few buttons, and that leads to a limited objects, and a coding, server-side interface for analytics programs. number of operations with reduced options and/or tools with steep Extension of the widget set was an important consideration. A learning curves. widget is a client-side object with a physical component. Galyleo In contrast, Jupyter is simply a superior analytics environment is designed to be extensible both by adding new visualization in every respect over a standalone visualization tool: its various libraries and components and by adding new widgets. kernels and their libraries provide a much broader range of analyt- Publication of interactive dashboards has been a further chal- ics capabilities; its programming interface is a much cleaner and lenge. A design goal of Galyleo was to offer a simple scheme, simpler way to perform complex operations; hardware resources where a dashboard could be published to the web with a single can scale far more easily than they can for a visualization tool; click. and connectors to data sources are both plentiful and extensible. These then, are the goals of Galyleo: Both standalone visualization tools and Jupyter libraries have a limited set of visualizations. Jupyter is a server-side platform. 1) Simple, drag-and-drop design of interactive dashboards in a visual editor. The visual design of a Galyleo dashboard * Corresponding author: rick.mcgeer@engageLively.com ‡ engageLively should be no more complex than design of a PowerPoint or Google slide; Copyright © 2022 Rick McGeer et al. This is an open-access article distributed 2) Radically simplify the dashboard-design interface by cou- under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the pling it to a powerful, Jupyter back end to do the analytics original author and source are credited. work, separating visualization and analytics concerns; 14 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 1: Figure 1. A New Galyleo Dashboard Fig. 3: Figure 3. Dataflow in Galyleo As the user creates and manipulates the visual elements, the editor continuously saves the table as a JSON file, which can also be edited with Jupyter’s built-in text editor. Workflow The goal of Galyleo is simplicity and transparency. Data prepa- ration is handled in Jupyter, and the basic abstract item, the GalyleoTable is generally created and manipulated there, using an open-source Python library. When a table is ready, the Galyleo- Client library is invoked to send it to the dashboard, where it appears in the table tab of the sidebar. The dashboard author then creates visual elements such as sliders, lists, dropdowns etc., Fig. 2: Figure 2. The Galyleo Dashboard Studio which select rows of the table, and uses these filtered lists as inputs to charts. The general idea is that the author should be 3) Maximimize extensibility for visualization and widgets able to seamlessly move between manipulating and creating data on the client side and analytics libraries, data sources and tables in the Notebook, and filtering and visualizing them in the hardware resources on the server side; dashboard. 4) Easy, simple publication; Data Flow and Conceptual Picture The Galyleo Data Model and Architecture is discussed in detail Using Galyleo below. The central idea is to have a few, orthogonal, easily-grasped The general usage model of Galyleo is that a Notebook is being concepts which make data manipulation easy and intuitive. The edited and executed in one tab of JupyterLab, and a corresponding basic concepts are as follows: dashboard file is being edited and executed in another; as the Notebook executes, it uses the Galyleo Client library to send 1) Table: A Table is a list of records, equivalent to a Pandas data to the dashboard file. To JupyterLab, the Galyleo Dashboard DataFrame [pdt20] [WM10] or a SQL Table. In general, Studio is just another editor; it reads and writes .gd.json files in in Galyleo, a Table is expected to be produced by an the current directory. external source, generally a Jupyter Notebook 2) Filter: A Filter is a logical function which applies to a The Dashboard Studio single column of a Table Table, and selects rows from the Table. Each Filter corresponds to a widget; widgets set A new Galyleo Dashboard can be launched from the JupyterLab the values Filter use to select Table rows launcher or from the File>New menu, as shown in Figure 1. 3) View A View is a subset of a Table selected by one or An existing dashboard is saved as a .gd.json file, and is more Filters. To create a view, the user chooses a Table, denoted with the Galyleo star logo. It can be opened in the usual and then chooses one or more Tilters to apply to the Table way, with a double-click. to select the rows for the View. The user can also statically Once a file is opened, or a new file created, a new Galyleo tab select a subset of the columns to include in the View. opens onto it. It resembles a simplified form of a Tableau, Looker, 4) Chart A Chart is a generic term for an object that displays or PowerBI editor. The collapsible right-hand sidebar offers the data graphically. Its input is a View or a Table. Each Chart ability to view Tables, and view, edit, or create Views, Filters, has a single data source. and Charts. The bottom half of the right sidebar gives controls for styling of text and shapes. The data flow is straightforward. A Table is updated from The top bar handles the introduction of decorative and styling an external source, or the user manipulates a widget. When this elements to the dashboard: labels and text, simple shapes such as happens, the affected item signals the dashboard controller that it ellipses, rectangles, polygons, lines, and images. All images are has been updated. The controller then signals all charts to redraw referenced by URL. themselves. Each Chart will then request updated data from its GALYLEO: A GENERAL-PURPOSE EXTENSIBLE VISUALIZATION SOLUTION 15 source Table or View. A View then requests its configured filters for their current logic functions, and passes these to the source Table with a request to apply the filters and return the rows which are selected by all the filters (in the future, a more general Boolean will be applied; the UI elements to construct this function are under design). The Table then returns the rows which pass the filters; the View selects the static subset of columns it supports, and passes this to its Charts, which then redraw themselves. Each item in this flow conceptually has a single data source, but multiple data targets. There can be multiple Views over a Table, but each View has a single Table as a source. There can be multiple charts fed by a View, but each Chart has a single Table or View as a source. It’s important to note that there are no special cases. There is no distinction, as there is in most visualization systems, between a "Dimension" or a "Measure"; there are simply columns of data, Fig. 4: Figure 4. A Published Galyleo Dashboard which can be either a value or category axis for any Chart. From this simplicity, significant generality is achieved. For example, a filter selects values from any column, whether that column is and configuration gives instant feedback and tight control over providing value or category. Applying a range filter to a category appearance. For example, the authors of a LaTeX paper (including column gives natural telescoping and zooming on the x-axis of a this one) can’t control the placement of figures within the text. The chart, without change to the architecture. fourth, which is correct, is that configuration code is more verbose, error-prone, and time-consuming than manual configuration. Drilldowns What is less often appreciated is that when operations become An important operation for any interactive dashboard is drill- sufficiently complex, coding is a much simpler interface than downs: expanding detail for a datapoint on a chart. The user manual configuration. For example, building a pivot table in a should be able to click on a chart and see a detailed view of spreadsheet using point-and-click operations have "always had a the data underlying the datapoint. This was naturally implemented reputation for being complicated" [Dev]. It’s three lines of code in in our system by associating a filter with every chart: every chart Python, even without using the Pandas pivot_table method. Most in Galyleo is also a Select Filter, and it can be used as a Filter in analytics procedures are far more easily done in code. a view, just as any other widget can be. As a result, Galyleo is an appropriate-code environment, which is an environment which combines a coding interface Publishing The Dashboard for complex, large-scale, or abstract operations and a point- Once the dashboard is complete, it can be published to the and-click interface for simple, concrete, small-scale operations. web simply by moving the dashboard file to any place it get Galyleo combines broadly powerful Jupyter-based code and low- an URL (e.g. a github repo). It can then be viewed by visiting code libraries for analytics paired with fast GUI-based design and https://galyleobeta.engagelively.com/public/galyleo/index.html? configuration for graphical elements and layout. dashboard=<url of dashboard file>. The attached figure shows a published Galyleo Dashboard, which displays Florence Galyleo Data Model And Architecture Nightingale’s famous Crimean War dataset. Using the double sliders underneath the column charts telescope the x axes, The Galyleo data Model and architecture closely model the effectively permitting zooming on a range; clicking on a column dashboard architecture discussed in the previous section. They are shows the detailed death statistics for that month in the pie chart based on the idea of a few simple, generalizable structures, which above the column chart. are largely independent of each other and communicate through simple interfaces. No-Code, Low-Code, and Appropriate-Code The GalyleoTable Galyleo is an appropriate-code environment, meaning that it offers A GalyleoTable is the fundamental data structure in Galyleo. It efficient creation to developers at every step. It offers What-You- is a logical, not a physical abstraction; it simply responds to See-Is-What-You-Get (WYSIWYG) design tools where appro- the GalyleoTable API. A GalyleoTable is a pair (columns, rows), priate, low-code where appropriate, and full code creation tools where columns is a list of pairs (name, type), where type is one where appropriate. of {string, boolean, number, date}, and rows is a list of lists of No-code and low-code environments, where users construct primitive values, where the length of each component list is the applications through a visual interface, are popular for several length of the list of columns and the type of the kth entry in each reasons. The first is the assumption that coding is time-consuming list is the type specified by the kth column. and hard, which isn’t always or necessarily true; the second is Small, public tables may be contained in the dashboard file; the assumption that coding is a skill known to only a small these are called explicit tables. However, explicitly representing fraction of the population, which is becoming less true by the the table in the dashboard file has a number of disadvantages: day. 40% of Berkeley undergraduates take Data 8, in which every assignment involves programming in a Jupyter Notebook. 1) An explicit table is in the memory of the client viewing The third, particularly for graphics code, is that manual design the dashboard; if it is too large, it may cause signifi- 16 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) cant performance problems on the dashboard author or viewer’s device 2) Since the dashboard file is accessible on the web, any data within it is public 3) The data may be continuously updated from a source, and it’s inconvenient to re-run the Notebook to update the data. Therefore, the GalyleoTable can be of one of three types: 1) A data server that implements the Table REST API 2) A JavaScript object within the dashboard page itself 3) A JavaScript messenger in the page that implements a messaging version of the API Fig. 5: Figure 5. Galyleo Dataflow with Remote Tables An explicit table is simply a special case of (2) -- in this case, the JavaScript object is simply a linear list of rows. Comments These are not exclusive. The JavaScript messenger case is designed to support the ability of a containing application within Again, simplicity and orthogonality have shown tremendous bene- the browser to handle viewer authentication, shrinking the security fits here. Though filters conceptually act as selectors on rows, they vulnerability footprint and ensuring that the client application may perform a variety of roles in implementations. For example, controls the data going to the dashboard. In general, aside from a table produced by a simulator may be controlled by a parameter performing tasks like authentication, the messenger will call an value given by a Filter function. external data server for the values themselves. Whether in a Data Server, a containing application, or a Extending Galyleo JavaScript object, Tables support three operations: Every element of the Galyleo system, whether it is a widget, Chart, Table Server, or Filter is defined exclusively through a small set 1) Get all the values for a specific column of public APIs. This is done to permit easy extension, by either 2) Get the max/min/increment for a specific numeric column the Galyleo team, users, or third parties. A Chart is defined as an 3) Get the rows which match a boolean function, passed in object which has a physical HTML representation, and it supports as a parameter to the operation four JavaScript methods: redraw (draw the chart), set data (set the Of course, (3) is the operation that we have seen above, to chart’s data), set options (set the chart’s options), and supports populate a view and a chart. (1) and (2) populate widgets on the table (a boolean which returns true if and only if the chart can dashboard; (1) is designed for a select filter, which is a widget draw the passed-in data set). In addition, it exports out a defined that lets a user pick a specific set of values for a column; (2) is JSON structure which indicates what options it supports and the an optimization for numeric filters, so that the entire list of values types of their values; this is used by the Chart Editor to display a for the column need not be sent -- rather, only the start and end configurator for the chart. values, and the increment between them. Similarly, the underlying lively.next system supports user design of new filters. Again, a filter is simply an object with a Each type of table specifies a source, additional information physical presence, that the user can design in lively, and supports a (in the case of a data server, for example, any header variables specific API -- broadly, set the choices and hand back the Boolean that must be specified in order to fetch the data), and, optionally, function as a JSON object which will be used to filter the data. a polling interval. The latter is designed to handle live data; the dashboard will query the data source at each polling interval to lively.next see if the data has changed. Any system can be used to extend Galyleo; at the end of the The choice of these three table instantiations (REST, day, all that need be done is encapsulate a widget or chart in JavaScript object, messenger) is that they provide the key founda- a snippet of HTML with a JavaScript interface that matches tional building block for future extensions; it’s easy to add a SQL the Galyleo protocol. This is done most easily and quickly connection on top of a REST interface, or a Python simulator. by using lively.next [SKH21]. lively.next is the latest in a line of Smalltalk- and Squeak-inspired [IKM+ 97] JavaScript/HTML Filters integrated development environments that began with the Lively Tables must be filtered in situ. One of the key motivators behind Kernel [IPU+ 08] [KIH+ 09] and continued through the Lively Web remote tables is in keeping large amounts of data from hitting the [LKI+ 12] [IFH+ 16] [TM17]. Galyleo is an application built in browser. This is largely defeated if the entire table is sent to the Lively, following the work done in [HIK+ 16]. dashboard and then filtered there. As a result, there is a Filter API Lively shares with Jupyter an emphasis on live programming together with the Table API whereever there are tables. [KRB18], orwhere a Read-Evaluate-Act Loop (REAL) program- The data flow of the previous section remains unchanged; ming style. It adds to that a combination of visual and text it is simply that the filter functions are transmitted to wherever programming [ABF20], where physical objects are positioned and the tables happen to be. The dataflow in the case of remote configured largely by hand as done with any drawing or design tables (whether messenger-based or REST-based) is shown here, program (e.g., PowerPoint, Illustrator, DrawPad, Google Draw) with operations that are resident where the table is situated and and programmed with a built-in editor and workspace, similar in operations resident on the dashboard clearly shown. concept if not form to a Jupyter Notebook. GALYLEO: A GENERAL-PURPOSE EXTENSIBLE VISUALIZATION SOLUTION 17 2) acceptsDataset(<Table or View>) returns a boolean de- pending on whether this chart can draw the data in this view. For example, a Table Chart can draw any tabular data; a Geo Chart typically requires that the first column be a place specifier. In addition, it has a read-only property: 1) optionSpec: A JSON structure describing the options for the chart. This is a dictionary, which specifies the name of each option, and its type (color, number, string, boolean, or enum with values given). Each type corresponds to a specific UI widget that the chart editor uses. And two read write properties: 1) options: The current options, as a JSON dictionary. This Fig. 6: Figure 6. The lively.next environment matches exactly the JSON dictionary in optionSpec, with values in place of the types. 2) dataSource: a string, the name of the current Galyleo Lively abstracts away HTML and CSS tags in graphical Table or Galyleo View objects called "Morphs". Morphs [MS95] were invented as the user interface layer for Self [US87], and have been used as Typically, an extension to Galyleo’s charting capabilities is the foundation of the graphics system in Squeak and Scratch done by incorporating the library as described in the previous [MRR+ 10]. In this UI, every physical object is a Morph; these section, implementing the API given in this section, and then can be as simple as a simple polygon or text string to a full publishing the result as a component application. Morphs are combined via composition, similar to the way that objects are grouped in a presentation or drawing program. Extending Galyleo’s Widget Set The composition is simply another Morph, which in turn can be A widget is a graphical item used to filter data. It operates on a composed with other Morphs. In this manner, complex Morphs single column on any table in the current data set. It is either a can be built up from collections of simpler ones. For example, range filter (which selects a range of numeric values) or a select a slider is simply the composition of a circle (the knob) with a filter (which selects a specific value, or a set of specific values). thin, long rectangle (the bar). Each Morph can be individually The API that is implemented consists only of properties. programmed as a JavaScript object, or can inherit base level 1) valueChanged : a signal, which is fired whenever the behavior and extend it. value of the widget is changed In lively.next, each morph turns into a snippet of HTML, CSS, 2) value: read-write. The current value of the widget and JavaScript code and the entire application turns into a web 3) filter: read-only. The current filter function, as a JSON page. The programmer doesn’t see the HTML and CSS code structure directly; these are auto-generated. Instead, the programmer writes 4) allValues: read-write, select filters only. JavaScript code for both logic and configuration (to the extent that 5) column: read-only. The name of the column of this the configuration isn’t done by hand). The code is bundled with widget. Set when the widget is created the object and integrated in the web page. 6) numericSpec: read-write. A dictionary containing the Morphs can be set as reusable components by a simple numeric specification for a numeric or date filter declaration. They can then be reused in any lively design. Widgets are typically designed as a standard Lively graphical Incorporating New Libraries component, much as the slider described above. Libraries are typically incorporated into lively.next by attaching them to a convenient physical object, importing the library from a Integration into Jupyter Lab: The Galyleo Extension package manager such as npm, and then writing a small amount Galyleo is a standalone web application that is integrated into of code to expose the object’s API. The simplest form of this is to JupyterLab using an iframe inside a JupyterLab tab for physical assign the module to an instance variable so it has an addressable design. A small JupyterLab extension was built that implements name, but typically a few convenience methods are written as well. the JupyterLab editor API. The JupyterLab extension has two In this way, a large number of libraries have been incorporated major functions: to handle read/write/undo requests from the as reusable components in lively.next, including Google Maps, JupyterLab menus and file browser, and receive and transmit Google Charts [goo], Chartjs [cha], D3 [BOH11], Leaflet.js [lea], messages from the running Jupyter kernels to update tables on OpenLayers [ope], cytoscape:ono and many more. the Dashboard Studio, and to handle the reverse messages where Extending Galyleo’s Charting and Visualization capabilities the studio requests data from the kernel. Standard Jupyter and browser mechanisms are used. File sys- A Galyleo Chart is anything that changes its display based on tem requests come to the extension from the standard Jupyter API, tabular data from a Galyleo Table or Galyleo View. It responds to exactly the same requests and mechanisms that are sent to a Mark- a specific API, which includes two principal methods: down or Notebook editor. The extension receives them, and then 1) drawChart: redraw the chart using the current tabular data uses standard browser-based messaging (window.postMessage) to from the input or view signal the standalone web app. Similarly, when the extension 18 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) of environments hosted by a server is arbitrary, and the cost is only the cost of maintaining the Dockerfile for each environment. An environment is easy to design for a specific class, project, or task; it’s simply adding libraries and executables to a base Dockerfile. It must be tested, of course, but everything must be. And once it is tested, the burden of software maintenance and installation is removed from the user; the user is already in a task- customized, curated environment. Of course, the usual installation tools (apt, pip, conda, easy_install ) can be pre-loaded (they’re just executables) so if the environment designer missed something it can be added by the end user. Though a user can only be in one environment at a time, persistent storage is shared across all environments, meaning Fig. 7: Figure 7. Galyleo Extension Architecture switching environments is simply a question of swapping one environment out and starting another. Viewed in this light, a JupyterHub is a multi-purpose computer makes a request of JupyterLab, it does so through this mechanism in the Cloud, with an easy-to-use UI that presents through a and a receiver in the extension gets it and makes the appropriate browser. JupyterLab isn’t simply an IDE; it’s the window system method calls within JupyterLab to achieve the objective. and user interface for this computer. The JupyterLab launcher is When a kernel makes a request through the Galyleo Client, the desktop for this computer (and it changes what’s presented, this is handled exactly the same way. A Jupyter messaging server depending on the environment); the file browser is the computer’s within the extension receives the message from the kernel, and file browser, and the JupyterLab API is the equivalent of the Win- then uses browser messaging to contact the application with the dows or MacOS desktop APIs and window system that permits request, and does the reverse on a Galyleo message to the kernel. third parties to build applications for this. This is a highly efficient method of interaction, since browser- based messaging is in-memory transactions on the client machine. This Jupyter Computer has a large number of advantages over It’s important to note that there is nothing Galyleo-specific a standard desktop or laptop computer. It can be accessed from any about the extension: the Galyleo Extension is a general method device, anywhere on Earth with an Internet connection. Software for any standalone web editor (e.g., a slide or drawing editor) to installation and maintenance issues are nonexistent. Data loss due be integrated into JupyterLab. The JupyterLab connection is a few to hardware failure is extremely unlikely; backups are still required tens of lines of code in the Galyleo Dashboard. The extension is to prevent accidental data loss (e.g., erroneous file deletion), but slightly more complex, but it can be configured for a different they are far easier to do in a Cloud environment. Hardware application with a simple data structure which specifies the URL resources such as disk, RAM, and CPU can be added rapidly, of the application, file type and extension to be manipulated, and on a permanent or temporary basis. Relatively exotic resources message list. (e.g., GPUs) can also be added, again on an on-demand, temporary basis. The advantages go still further than that. Any resource that The Jupyter Computer can be accessed over a network connection can be added to The implications of the Galyleo Extension go well beyond vi- the Jupyter Computer simply by adding the appropriate accessor sualization and dashboards and easy publication in JupyterLab. library to an environment’s Dockerfile. For example, a database JupyterLab is billed as the next-generation integrated Develop- solution such as Snowflake, BigQuery, or Amazon Aurora (or ment Environment for Jupyter, but in fact it is substantially more one of many others) can be "installed" by adding the relevant than that. It is the user interface and windowing system for Cloud- library module to the environment. Of course, the user will need based personal computing. Inspired by previous extensions such to order the database service from the relevant provider, and obtain as the Vega Extension, the Galyleo Extensions seeks to provide authentication tokens, and so on -- but this is far less troublesome the final piece of the puzzle. than even maintaining the library on the desktop. Consider a Jupyter server in the Cloud, served from a Jupyter- However, to date the Jupyter Computer only supports a few Hub such as the Berkeley Data Hub. It’s built from a base window-based applications, and adding a new application is a Ubuntu image, with the standard Jupyter libraries installed and, time-consuming development task. The applications supported are importantly, a UI that includes a Linux terminal interface. Any familiar and easy to enumerate: a Notebook editor, of course; a Linux executable can be installed in the Jupyter server image, as Markdown Viewer; a CSV Viewer; a JSON Viewer (not inline can any Jupyter kernel, and any collection of libraries. The Jupyter editor), and a text editor that is generally used for everything from server has per-user persistent storage, which is organized in a Python files to Markdown to CSV. standard Linux filesystem. This makes the Jupyter server a curated This is a small subset of the rich range of JavaScript/HTML5 execution environment with a Linux command-line interface and applications which have significant value for Jupyter Computer a Notebook interface for Jupyter execution. users. For example, the Ace Code Editor supports over 110 A JupyterHub similar to Berkeley Data Hub (essentially, languages and has the functionality of popular desktop editors anything built from Zero 2 Jupyter Hub or Q-Hub) comes with a such as Vim and Sublime Text. There are over 1100 open-source number of "environments". The user chooses the environment on drawing applications on the JavaScript/HTML5 platform; multiple startup. Each environment comes with a built-in set of libraries and spreadsheet applications, the most notable being jExcel, and many executables designed for a specific task or set of tasks. The number more. GALYLEO: A GENERAL-PURPOSE EXTENSIBLE VISUALIZATION SOLUTION 19 Fig. 8: Figure 8. Galyleo Extension Application-Side messaging Fig. 9: Figure 9. Generations of Internet Computing Up until now, adding a new application to JupyterLab involved writing a hand-coded extension in Typescript, and compiling it into JupyterLab. However, the Galyleo Extension has been the user uses any of a wide variety of text editors to prepare the designed so that any HTML5/JavaScript application can be added document, any of a wide variety of productivity and illustrator easily, simply by configuring the Galyleo Extension with a small programs to prepare the images, runs this through a local sequence JSON file. of commands (e.g., pdflatex paper; bibtex paper; pdflatex paper. The promise of the Galyleo Extension is that it can be adapted Usually Github or another repository is used for storage and to any open-source JavaScript/HTML5 application very easily. collaboration. The Galyleo Extension merely needs the: In a Cloud service, this is another matter. There is at most one editor, selected by the service, on the site. There is no • URL of the application image editing or illustrator program that reads and writes files • File extension that the application reads/writes on the site. Auxiliary tools, such as a bib searcher, aren’t present • URL of an image for the launcher or aren’t customizable. The service has its own siloed storage, • Name of the application for the file menu its own text editor, and its own document-preparation pipeline. The application must implement a small messaging client, The tools (aside from the core document-preparation program) using the standard JavaScript messaging interface, and implement are primitive. The online service has two advantages over the the calls the Galyleo Extension makes. The conceptual picture is personal-device service. Collaboration is generally built-in, with shown im Figure 8. multiple people having access to the project, and the software need And it must support (at a minimum) messages to read and not be maintained. Aside from that, the personal-device experience write the file being edited. is generally superior. In particular, the user is free to pick their own editor, and doesn’t have to orchestrate multiple downloads and The Third Generation of Network Computing uploads from various websites. The usual collection of command- The World-Wide Web and email comprised the first generation line utilities are available to small touchups. of Internet computing (the Internet had been around for a decade The third generation of Internet Computing represented by the before the Web, and earlier networks dated from the sixties, but Jupyter Computer. This offers a Cloud experience similar to the the Web and email were the first mass-market applications on personal computer, but with the scalability, reliability, and ease of the network), and they were very simple -- both were document- collaboration of the Cloud. exchange applications, using slightly different protocols. The second generation of Network applications were the siloed pro- Conclusion and Further Work ductivity applications, where standard desktop applications moved The vision of the Jupyter Computer, bringing the power of the to the Cloud. The most famous example is of course GSuite Cloud to the personal computing experience has been started and Office 365, but there were and are many others -- Canva, with Galyleo. It will not end there. At the heart of it is a Loom, Picasa, as well as a large number of social/chat/social composition of two broadly popular platforms: HTML5/JavaScript media applications. What they all had in common was that they for presentation and interaction, and the various Jupyter kernels were siloed applications which, with the exception of the office for server-side analytics. Galyleo is a start at seamless interaction suites, didn’t even share a common store. In many ways, this of these two platforms. Continuing and extending this is further second generation of network applications recapitulates the era development of narrow-waist protocols to permit maximal inde- immediately prior to the introduction of the personal computer. pendent development and extension. That era was dominated by single-application computers such as word processors, which were simply computers with a hardcoded program loaded into ROM. Acknowledgements The Word Processor era was due to technological limitations The authors wish to thank Alex Yang, Diptorup Deb, and for -- the processing power and memory to run multiple programs their insightful comments, and Meghann Agarwal for stewardship. simply wasn’t available on low-end hardware, and PC operating We have received invaluable help from Robert Krahn, Marko systems didn’t yet exist. In some sense, the current second genera- Röder, Jens Lincke and Linus Hagemann. We thank the en- tion of Internet Computing suffers from similar technological con- gageLively team for all of their support and help: Tim Braman, straints. The "Operating System" for Internet Computing doesn’t Patrick Scaglia, Leighton Smith, Sharon Zehavi, Igor Zhukovsky, yet exist. The Jupyter Computer can provide it. Deepak Gupta, Steve King, Rick Rasmussen, Patrick McCue, To see the difference that this can make, consider LaTeX (per- Jeff Wade, Tim Gibson. The JupyterLab development commu- haps preceded by Docutils, as is the case for SciPy) preparation of nity has been helpful and supportive; we want to thank Tony a document. On a personal computer, it’s fairly straightforward; Fast, Jason Grout, Mehmet Bektas, Isabela Presedo-Floyd, Brian 20 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Granger, and Michal Krassowski. The engageLively Technology [hol22b] Installation - holoviews v1.14.9, May 2022. URL: https: Advisory Board has helped shape these ideas: Ani Mardurkar, //holoviews.org/. [IFH+ 16] Daniel Ingalls, Tim Felgentreff, Robert Hirschfeld, Robert Priya Joseph, David Peterson, Sunil Joshi, Michael Czahor, Isha Krahn, Jens Lincke, Marko Röder, Antero Taivalsaari, and Oke, Petrus Zwart, Larry Rowe, Glenn Ricart, Sunil Joshi, Antony Tommi Mikkonen. A world of active objects for work and play: Ng. We want to thank the people from the AWS team that have The first ten years of lively. In Proceedings of the 2016 ACM helped us tremendously: Matt Vail, Omar Valle, Pat Santora. International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, Onward! 2016, page Galyleo has been dramatically improved with the assistance of our 238–249, New York, NY, USA, 2016. Association for Comput- Japanese colleagues at KCT and Pacific Rim Technologies: Yoshio ing Machinery. URL: https://doi.org/10.1145/2986012.2986029, Nakamura, Ted Okasaki, Ryder Saint, Yoshikazu Tokushige, and doi:10.1145/2986012.2986029. Naoyuki Shimazaki. Our undestanding of Jupyter in an academic [IKM+ 97] Dan Ingalls, Ted Kaehler, John Maloney, Scott Wallace, and Alan Kay. Back to the future: The story of squeak, a prac- context came from our colleagues and friends at Berkeley, the tical smalltalk written in itself. In Proceedings of the 12th University of Victoria, and UBC: Shawna Dark, Hausi Müller, ACM SIGPLAN Conference on Object-Oriented Programming, Ulrike Stege, James Colliander, Chris Holdgraf, Nitesh Mor. Use Systems, Languages, and Applications, OOPSLA ’97, page 318–326, New York, NY, USA, 1997. Association for Comput- of Jupyter in a research context was emphasized by Andrew ing Machinery. URL: https://doi.org/10.1145/263698.263754, Weidlea, Eli Dart, Jeff D’Ambrogia. We benefitted enormously doi:10.1145/263698.263754. from the CITRIS Foundry: Alic Chen, Jing Ge, Peter Minor, Kyle [IPU+ 08] Daniel Ingalls, Krzysztof Palacz, Stephen Uhler, Antero Taival- Clark, Julie Sammons, Kira Gardner. The Alchemist Accelerator saari, and Tommi Mikkonen. The lively kernel a self-supporting system on a web page. In Workshop on Self-sustaining Systems, was central to making this product: Ravi Belani, Arianna Haider, pages 31–50. Springer, 2008. doi:10.1007/978-3-540- Jasmine Sunga, Mia Scott, Kenn So, Aaron Kalb, Adam Frankl. 89275-5_2. Kris Singh was a constant source of inspiration and help. Larry [jup] Jupyterlab documentation. URL: https://jupyterlab.readthedocs. Singer gave us tremendous help early on. Vibhu Mittal more io/en/stable/. than anyone inspired us to pursue this road. Ken Lutz has been [KIH+ 09] Robert Krahn, Dan Ingalls, Robert Hirschfeld, Jens Lincke, and Krzysztof Palacz. Lively wiki a development environment for a constant sounding board and inspiration, and worked hand-in- creating and sharing active web content. In Proceedings of the hand with us to develop this product. Our early customers and 5th International Symposium on Wikis and Open Collaboration, partners have been and continue to be a source of inspiration, WikiSym ’09, New York, NY, USA, 2009. Association for Computing Machinery. URL: https://doi.org/10.1145/1641309. support, and experience that is absolutely invaluable: Jonathan 1641324, doi:10.1145/1641309.1641324. Tan, Roger Basu, Jason Koeller, Steve Schwab, Michael Collins, [KRB18] Juraj Kubelka, Romain Robbes, and Alexandre Bergel. The road Alefiya Hussain, Geoff Lawler, Jim Chimiak, Fraukë Tillman, to live programming: Insights from the practice. In Proceedings Andy Bavier, Andy Milburn, Augustine Bui. All of our customers of the 40th International Conference on Software Engineering, ICSE ’18, page 1090–1101, New York, NY, USA, 2018. Associ- are really partners, none moreso than the fantastic teams at Tanjo ation for Computing Machinery. URL: https://doi.org/10.1145/ AI and Ultisim: Bjorn Nordwall, Ken Lane, Jay Sanders, Eric 3180155.3180200, doi:10.1145/3180155.3180200. Smith, Miguel Matos, Linda Bernard, Kevin Clark, and Richard [KRKP+ 16] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Boyd. We want to especially thank our investors, who bet on this Brian Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, Paul technology and company. Ivanov, Damián Avila, Safia Abdalla, Carol Willing, and Jupyter development team. Jupyter Notebooks - a publishing format for reproducible computational workflows. IOS Press, 2016. URL: R EFERENCES https://eprints.soton.ac.uk/403913/. [lea] An open-source javascript library for interactive maps. URL: [ABF20] Leif Andersen, Michael Ballantyne, and Matthias Felleisen. https://leafletjs.com/. Adding interactive visual syntax to textual code. Proc. ACM [LKI+ 12] Jens Lincke, Robert Krahn, Dan Ingalls, Marko Roder, and Program. Lang., 4(OOPSLA), nov 2020. URL: https://doi.org/ Robert Hirschfeld. The lively partsbin–a cloud-based repository 10.1145/3428290, doi:10.1145/3428290. for collaborative development of active web content. In 2012 [BOH11] Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. D3 data- 45th Hawaii International Conference on System Sciences, pages driven documents. IEEE Transactions on Visualization and Com- 693–701, 2012. doi:10.1109/HICSS.2012.42. puter Graphics, 17(12):2301–2309, dec 2011. URL: https://doi. [loo] Looker. URL: https://looker.com/. org/10.1109/TVCG.2011.185, doi:10.1109/TVCG.2011. [LS10] Bruce Lawson and Remy Sharp. Introducing HTML5. New 185. Riders Publishing, USA, 1st edition, 2010. [cha] Chart.js. URL: https://www.chartjs.org/. [mdn] Window.postmessage() - web apis: Mdn. URL: https://developer. [Cro06] D. Crockford. The application/json media type for javascript mozilla.org/en-US/docs/Web/API/Window/postMessage. object notation (json). RFC 4627, RFC Editor, July 2006. http:// [MRR+ 10] John Maloney, Mitchel Resnick, Natalie Rusk, Brian Silverman, www.rfc-editor.org/rfc/rfc4627.txt. URL: http://www.rfc-editor. and Evelyn Eastmond. The scratch programming language org/rfc/rfc4627.txt, doi:10.17487/rfc4627. and environment. ACM Transactions on Computing Educa- [Dev] Erik Devaney. How to create a pivot table in excel: A step-by- tion (TOCE), 10(4):1–15, 2010. URL: https://doi.org/10.1145/ step tutorial. URL: https://blog.hubspot.com/marketing/how-to- 1868358.1868363, doi:10.1145/1868358.1868363. create-pivot-table-tutorial-ht. [DGHP13] Marcello D’Agostino, Dov M Gabbay, Reiner Hähnle, and [MS95] John H Maloney and Randall B Smith. Directness and liveness in Joachim Posegga. Handbook of tableau methods. Springer the morphic user interface construction environment. In Proceed- Science & Business Media, 2013. ings of the 8th annual ACM symposium on User interface and software technology, pages 21–28, 1995. URL: https://doi.org/ [goo] Charts: google developers. URL: https://developers.google.com/ 10.1145/215585.215636, doi:10.1145/215585.215636. chart/. [HIK+ 16] Matthew Hemmings, Daniel Ingalls, Robert Krahn, Rick [ope] Openlayers. URL: https://openlayers.org/. McGeer, Glenn Ricart, Marko Röder, and Ulrike Stege. Livetalk: [pan22] Panel, May 2022. URL: https://panel.holoviz.org/. A framework for collaborative browser-based replicated- [pdt20] The pandas development team. pandas-dev/pandas: Pandas, computation applications. In 2016 28th International Tele- February 2020. URL: https://doi.org/10.5281/zenodo.3509134, traffic Congress (ITC 28), volume 01, pages 270–277, 2016. doi:10.5281/zenodo.3509134. doi:10.1109/ITC-28.2016.144. [plo] Dash overview. URL: https://plotly.com/dash/. [hol22a] High-level tools to simplify visualization in python, Apr 2022. [SKH21] Robin Schrieber, Robert Krahn, and Linus Hagemann. URL: https://holoviz.org/. lively.next, 2021. GALYLEO: A GENERAL-PURPOSE EXTENSIBLE VISUALIZATION SOLUTION 21 [TM17] Antero Taivalsaari and Tommi Mikkonen. The web as a software platform: Ten years later. In International Conference on Web Information Systems and Technologies, volume 2, pages 41–50. SCITEPRESS, 2017. doi:10.5220/0006234800410050. [US87] David Ungar and Randall B. Smith. Self: The power of simplic- ity. volume 22, page 227–242, New York, NY, USA, dec 1987. Association for Computing Machinery. URL: https://doi.org/10. 1145/38807.38828, doi:10.1145/38807.38828. [WM10] Wes McKinney. Data Structures for Statistical Computing in Python. In Stéfan van der Walt and Jarrod Millman, editors, Proceedings of the 9th Python in Science Conference, pages 56 – 61, 2010. doi:10.25080/Majora-92bf1922-00a. 22 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) USACE Coastal Engineering Toolkit and a Method of Creating a Web-Based Application Amanda Catlett‡∗ , Theresa R. Coumbe‡ , Scott D. Christensen‡ , Mary A. Byrant‡ F Abstract—In the early 1990s the Automated Coastal Engineering Systems, the goal of deploying the ACES tools as a web-based application, ACES, was created with the goal of providing state-of-the-art computer-based and ultimately renamed it to: USACE Coastal Engineering Toolkit tools to increase the accuracy, reliability, and cost-effectiveness of Corps coastal (UCET). engineering endeavors. Over the past 30 years, ACES has become less and less The RAD team focused on updating the Python codebase accessible to engineers. An updated version of ACES was necessary for use in utilizing Python’s object-oriented programming and the newly coastal engineering. Our goal was to bring the tools in ACES to a user-friendly web-based dashboard that would allow a wide range of users to be able to easily developed HoloViz ecosystem. The team refactored the code to and quickly visualize results. We will discuss how we restructured the code implement inheritance so the code is clean, readable, and scalable. using class inheritance and the three libraries Param, Panel, and HoloViews to The tools were expanded to a Graphical User Interface (GUI) so create an extensible, interactive, graphical user interface. We have created the the implementation to a web-app would provide a user-friendly USACE Coastal Engineering Toolkit, UCET, which is a web-based application experience. This was done by using the HoloViz-maintained that contains 20 of the tools in ACES. UCET serves as an outline for the process libraries: Param, Panel, and Holoviews. of taking a model or set of tools and developing web-based application that can This paper will discuss some of the steps that were taken produce visualizations of the results. by the RAD team to update the Python codebase to create a panel application of the coastal engineering tools. In particular, Index Terms—GUI, Param, Panel, HoloViews refactoring the input and output variables with the Param library, the class hierarchy used, and utilization of Panel and HoloViews Introduction for a user-friendly experience. The Automated Coastal Engineering System (ACES) was devel- oped in response to the charge by the LTG E. R. Heiberg III, Refactoring Using Param who was the Chief of Engineers at the time, to provide improved design capabilities to the Corps coastal specialists. [Leenknecht] Each coastal tool in UCET has two classes, the model class and the In 1992, ACES was presented as an interactive computer-based GUI class. The model class holds input and output variables and design and analysis system in the field of coastal engineering. The the methods needed to run the model. Whereas the GUI class holds tools consist of seven functional areas which are: Wave Prediction, information for GUI visualization. To make implementation of the Wave Theory, Structural Design, Wave Runup Transmission and GUI more seamless we refactored model variables to utilize the Overtopping, Littoral Process, and Inlet Processes. These func- Param library. Param is a library that has the goal of simplifying tional areas contain classical theory describing wave motion, to the codebase by letting the programmer explicitly declare the types expressions resulting from tests of structures in wave flumes, and and values of parameters accepted by the code. Param can also be numerical models describing the exchange of energy from the at- seamlessly used when implementing the GUI through Panel and mosphere to the sea surface. The math behind these uses anything HoloViews. from simple algebraic expressions, both theoretical and empirical, Each UCET tool’s model class declares the input and output to numerically intense algorithms. [Leenknecht][UG][shankar] values used in the model as class parameters. Each input and Originally, ACES was written in FORTRAN 77 resulting in output variables are declared and given the following metadata a decreased ability to use the tool as technology has evolved. In features: 2017, the codebase was converted from FORTRAN 77 to MAT- • default: each input variable is defined as a Param with a LAB and Python. This conversion ensured that coastal engineers default value defined from the 1992 ACES user manual using this tool base would not need training in yet another coding • bounds: each input variable is defined with range values language. In 2020, the Engineered Resilient Systems (ERS) Rapid defined in the 1992 ACES user manual Application Development (RAD) team undertook the project with • doc or docstrings: input and output variables have the expected variable and description of the variable defined * Corresponding author: amanda.r.catlett@erdc.dren.mil ‡ ERDC as a doc. This is used as a label over the input and output widgets. Most docstrings follow the pattern of Copyright © 2022 Amanda Catlett et al. This is an open-access article <variable>:<description of variable [units, if any]> distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, • constant: the output variables all set constant equal True, provided the original author and source are credited. thereby restricting the user’s ability to manipulate the USACE COASTAL ENGINEERING TOOLKIT AND A METHOD OF CREATING A WEB-BASED APPLICATION 23 value. Note that when calculations are being done they will classes. In figure 1 the model classes are labeled as: Base-Tool need to be inside a with param.edit_constant(self) function Class, Graph-Tool Class, Water-Tool Class, and Graph-Water-Tool • precedence: input and output variables will use prece- Class and each has a corresponding GUI class. dence when there are instances where the variable does Due to the inheritance in UCET, the first two questions that not need to be seen. can be asked when adding a tool are: ‘Does this tool need water variables for the calculation?’ and ‘Does this tool have a graph?’. The following is an example of an input parameter: The developer can then add a model class and a GUI class and H = param.Number( inherit based on figure 1. For instance, Linear Wave Theory is doc='H: wave height [{distance_unit}]', default=6.3, an application that yields first-order approximations for various bounds=(0.1, 200) parameters of wave motion as predicted by the wave theory. It ) provides common items of interest such as water surface elevation, An example of an output variable is: general wave properties, particle kinematics and pressure as a function of wave height and period, water depth, and position L = param.Number( doc='L: Wavelength [{distance_unit}]', in the wave form. This tool uses water density and has multiple constant=True graphs in its output. Therefore, Linear Wave Theory is considered ) a Graph-Water-Tool and the model class will inherit from Water- The model’s main calculation functions mostly remained un- TypeDriver and the GUI class will inherit the linear wave theory changed. However, the use of Param eliminated the need for code model class, WaterTypeGui, and TabularDataGui. that handled type checking and bounds checks. GUI Implementation Using Panel and HoloViews Class Hierarchy Each UCET tool has a GUI class where the Panel and HoloView UCET has twenty tools from six of the original seven functional libraries are implemented. Panel is a hierarchical container that areas of ACES. When we designed our class hierarchy, we focused can layout panes, widgets, or other Panels in an arrangement on the visualization of the web application rather than functional that forms an app or dashboard. The Pane is used to render any areas. Thus, each tool’s class can be categorized into Base-Tool, widget-like object such as Spinner, Tabulator, Buttons, CheckBox, Graph-Tool, Water-Tool, or Graph-Water-Tool. The Base-Tool has Indicators, etc. Those widgets are used to gather user input and the coastal engineering models that do not have any water property run the specific tool’s model. inputs (such as water density) in the calculations and no graphical UCET utilizes the following widgets to gather user input: output. The Graph-Tool has the coastal engineering models that do not have any water property inputs in the calculations but have • Spinner: single numeric input values a graphical output. Water-Tool has the coastal engineering models • Tabulator: table input data that have water property inputs in the calculations and no graphical • CheckBox: true or false values output. Graph-Water-Tool has the coastal engineering models that • Drop down: items that have a list of pre-selected values, have water property inputs in the calculations and has a graphical such as which units to use output. Figure 1 shows a flow of inheritance for each of those classes. UCET utilizes indicators.Number, Tabulator, and graphs to There are two types of general categories for the classes in visualize the outputs of the coastal engineering models. A single the UCET codebase: utility and tool-specific. Utility classes have number is shown using indicators.Number and graph data is methods and functions that are utilized across more than one tool. displayed using the Tabulator widget to show the data of the graph. The Utility classes are: The graphs are created using HoloViews and have tool options such as pan, zooming, and saving. Buttons are used to calculate, • BaseDriver: holds methods and functions that each tool save the current run, and save the graph data. needs to collect data, run coastal engineering models, and All of these widgets are organized into 5 pan- print data. els: title, options, inputs, outputs, and graph. The • WaterDriver: has the methods that make water density BaseGui/WaterTypeGui/TabularDataGui have methods that and water weight available to the models that need those organize the widgets within the 5 panels that most tools follow. inputs for the calculations. The “options” panel has a row that holds the dropdown selections • BaseGui: has the functions and methods for the visualiza- for units and water type (if the tool is a Water-Tool). Some tools tion and utilization of all inputs and outputs within each have a second row in the “options” panel with other drop-down tool’s GUI. options. The input panel has two columns for spinner widgets • WaterTypeGui: has the widget for water selection. with a calculation button at the bottom left. The output panel has • TabulatorDataGui: holds the functions and methods used two columns of indicators.Number for the single numeric output for visualizing plots and the ability to download the data values. At the bottom of the output panel there is a button to “save that is used for plotting. the current profile”. The graph panel is tabbed where the first Each coastal tool in UCET has two classes, the model class and tab shows the graph and the second tab shows the data provided the GUI class. The model class holds input and output variables within the graph. An visual outline of this can ben seen in the and the methods needed to run the model. The model class either following figure. Some of the UCET tools have more complicated directly inherits from the BaseDriver or the WaterTypeDriver. The input or output visualizations and that tool’s GUI class will add tool’s GUI class holds information for GUI visualization that is or modify methods to meet the needs of that tool. different from the BaseGui, WaterTypeGUI, and TabulatorDataGui The general outline of a UCET tool for the GUI. 24 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) zero the point is outside the waveform. Therefore, if a user makes a combination where the sum is less than zero, UCET will post a warning to tell the user that the point is outside the waveform. Current State See the below figure for an example The developers have been UCET approaches software development from the perspective of documenting this project using GitHub and JIRA. someone within the field of Research and Development. Each An example of a warning message based on chosen inputs. tool within UCET is not inherently complex from the traditional software perspective. However, this codebase enables researchers Results to execute complex coastal engineering models in a user-friendly Linear Wave Theory was described in the class hierarchy example. environment by leveraging open-source libraries in the scientific This Graph-Water-Tool utilizes most of the BaseGui methods. The Python ecosystem such as: Param, Panel, and HoloViews. biggest difference is instead of having three graphs in the graph Currently, UCET is only deployed using a command line panel there is a plot selector drop down where the user can select interface panel serve command. UCET is awaiting the Security which graph they want to see. Technical Implementation Guide process before it can be launched Windspeed Adjustment and Wave Growth provides a quick as a website. As part of this security vetting process we plan to and simple estimate for wave growth over open-water and re- leverage continuous integration/continuous development (CI/CD) stricted fetches in deep and shallow water. This is a Base-Tool tools to automate the deployment process. While this process is as there are no graphs and no water variables for the calculations. happening, we have started to get feedback from coastal engineers This tool has four additional options in the options panel where to update the tools usability, accuracy, and adding suggested the user can select the wind observation type, fetch type, wave features. To minimize the amount of computer science knowledge equation type, and if knots are being used. Based on the selection the coastal engineers need, our team created a batch script. This of these options, the input and output variables will change so only script creates a conda environment, activates and runs the panel what is used or calculated for those selections are seen. serve command to launch the app on a local host. The user only needs to click on the batch script for this to take place. Conclusion and Future Work Other tests are being created to ensure the accuracy of the Thirty years ago, ACES was developed to provide improved tools using a testing framework to compare output from UCET design capabilities to Corps coastal specialists and while these with that of the FORTRAN original code. The biggest barrier to tools are still used today, it became more and more difficult for this testing strategy is getting data from the FORTRAN to compare users to access them. Five years ago, there was a push to update with Python. Currently, there are tests for most of the tools that the code base to one that coastal specialists would be more familiar read a CSV file of input and output results from FORTRAN and with: MATLAB and Python. Within the last two years the RAD compare with what the Python code is calculating. team was able to finalize the update so that the user can access Our team has also compiled an updated user guide on how to these tools without having years of programming experience. We use the tool, what to expect from the tool, and a deeper description were able to do this by utilizing classes, inheritance, and the on any warning messages that might appear as the user adds input Param, Panel, and HoloViews libraries. The use of inheritance values. An example of a warning message would be, if a user has allowed for shorter code-bases and also has made it so new chooses input values that make it so the application does not make tools can be added to the toolkit. Param, Panel, and HoloViews physical sense, a warning message will appear under the output work cohesively together to not only run the models but make a header and replace all output values. For a more concrete example: simple interface. Linear Wave Theory has a vertical coordinate (z) and the water Future work will involve expanding UCET to include current depth (d) as input values and when those values sum is less than coastal engineering models, and completing the security vetting USACE COASTAL ENGINEERING TOOLKIT AND A METHOD OF CREATING A WEB-BASED APPLICATION 25 process to deploy to a publicly accessible website. We plan to incorporate an automated CI/CD to ensure smooth deployment of future versions. We also will continue to incorporate feedback from users and refine the code to ensure the application provides a quality user experience. R EFERENCES [Leenknecht] David A. Leenknecht, Andre Szuwalski, and Ann R. Sherlock. 1992. Automated Coastal Engineering System -Technical Refer- ence. Technical report. https://usace.contentdm.oclc.org/digital/ collection/p266001coll1/id/2321/ [panel] “Panel: A High-Level App and Dashboarding Solution for Python.” Panel 0.12.6 Documentation, Panel Contributors, 2019, https://panel.holoviz.org/. [holoviz] “High-Level Tools to Simplify Visualization in Python.” HoloViz 0.13.0 Documentation, HoloViz Authors, 2017, https: //holoviz.org. [UG] David A. Leenknecht, et al. “Automated Tools for Coastal Engineering.” Journal of Coastal Research, vol. 11, no. 4, Coastal Education & Research Foundation, Inc., 1995, pp. 1108-24. https://usace.contentdm.oclc.org/digital/collection/ p266001coll1/id/2321/ [shankar] N.J. Shankar, M.P.R. Jayaratne, Wave run-up and overtopping on smooth and rough slopes of coastal structures, Ocean Engi- neering, Volume 30, Issue 2, 2003, Pages 221-238, ISSN 0029- 8018, https://doi.org/10.1016/S0029-8018(02)00016-1 Fig. 1: Screen shot of Linear Wave Theory Fig. 2: Screen shot of Windspeed Adjustment and Wave Growth 26 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Search for Extraterrestrial Intelligence: GPU Accelerated TurboSETI Luigi Cruz‡∗ , Wael Farah‡ , Richard Elkins‡ F Abstract—A common technique adopted by the Search For Extraterrestrial In- by an analog-to-digital converter as voltages and transmitted to a telligence (SETI) community is monitoring electromagnetic radiation for signs of processing logic to extract useful information from it. The data extraterrestrial technosignatures using ground-based radio observatories. The stream generated by a radio telescope can easily reach the rate analysis is made using a Python-based software called TurboSETI to detect nar- of terabits per second because of the ultra-wide bandwidth radio rowband drifting signals inside the recordings that can mean a technosignature. spectrum. The current workflow utilized by the Breakthrough The data stream generated by a telescope can easily reach the rate of terabits per second. Our goal was to improve the processing speeds by writing a GPU- Listen, the largest scientific research program aimed at finding accelerated backend in addition to the original CPU-based implementation of the evidence of extraterrestrial intelligence, consists in pre-processing de-doppler algorithm used to integrate the power of drifting signals. We discuss and storing the incoming data as frequency-time binary files how we ported a CPU-only program to leverage the parallel capabilities of a ([LCS+ 19]) in persistent storage for later analysis. This post- GPU using CuPy, Numba, and custom CUDA kernels. The accelerated backend analysis is made possible using a Python-based software called reached a speed-up of an order of magnitude over the CPU implementation. TurboSETI ([ESF+ 17]) to detect narrowband signals that could be drifting in frequency owing to the relative radial velocity between Index Terms—gpu, numba, cupy, seti, turboseti the observer on earth, and the transmitter. The offline processing speed of TurboSETI is directly related to the scientific output of 1. Introduction an observation. Each voltage file ingested by TurboSETI is often on the order of a few hundreds of gigabytes. To process data The Search for Extraterrestrial Intelligence (SETI) is a broad term efficiently without Python overhead, the program uses Numpy for utilized to describe the effort of locating any scientific proof of near machine-level performance. To measure a potential signal’s past or present technology that originated beyond the bounds of drift rate, TurboSETI uses a de-doppler algorithm to align the Earth. SETI can be performed in a plethora of ways: either actively frequency axis according to a pre-set drift rate. Another algorithm by deploying orbiters and rovers around planets/moons within the called “hitsearch” ([ESF+ 17]) is then utilized to identify any solar system, or passively by either searching for biosignatures in signal present in the recorded spectrum. These two algorithms exoplanet atmospheres or “listening” to technologically-capable are the most resource-hungry elements of the pipeline consuming extraterrestrial civilizations. One of the most common techniques almost 90% of the running time. adopted by the SETI community is monitoring electromagnetic radiation for narrowband signs of technosignatures using ground- based radio observatories. This search can be performed in mul- 2. Approach tiple ways: equipment primarily built for this task, like the Allen Multiple methods were utilized in this effort to write a GPU- Telescope Array (California, USA), renting observation time, or accelerated backend and optimize the CPU implementation of in the background while the primary user is conducting other ob- TurboSETI. In this section, we enumerate all three main methods. servations. Other radio-observatories useful for this search include the MeerKAT Telescope (Northern Cape, South Africa), Green 2.1. CuPy Bank Telescope (West Virginia, USA), and the Parkes Telescope The original implementation of TurboSETI heavily depends on (New South Wales, Australia). The operation of a radio-telescope Numpy ([HMvdW+ 20]) for data processing. To keep the number is similar to an optical telescope. Instead of using optics to of modifications as low as possible, we implemented the GPU- concentrate light into an optical sensor, a radio-telescope operates accelerated backend using CuPy ([OUN+ 17]). This open-source by concentrating electromagnetic waves into an antenna using a library offers GPU acceleration backed by NVIDIA CUDA and large reflective structure called a “dish” ([Reb82]). The interac- AMD ROCm while using a Numpy style API. This enabled us tion between the metallic antenna and the electromagnetic wave to reuse most of the code between the CPU and GPU-based generates a faint electrical current. This effect is then quantized implementations. * Corresponding author: lfcruz@seti.org 2.1. Numba ‡ SETI Institute Some computationally heavy methods of the original CPU-based Copyright © 2022 Luigi Cruz et al. This is an open-access article distributed implementation of TurboSETI were written in Cython. This ap- under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the proach has disadvantages: the developer has to be familiar with original author and source are credited. Cython syntax to alter the code; the code requires additional logic SEARCH FOR EXTRATERRESTRIAL INTELLIGENCE: GPU ACCELERATED TURBOSETI 27 Double-Precision (float64) 4. Conclusion Impl. Device File A File B File C The original implementation of TurboSETI worked exclusively Cython CPU 0.44 min 25.26 min 23.06 min on the CPU to process data. We implemented a GPU-accelerated Numba CPU 0.36 min 20.67 min 22.44 min backend to leverage the massive parallelization capabilities of a CuPy GPU 0.05 min 2.73 min 3.40 min graphical device. The benchmark performed shows that the new CPU and GPU implementation takes significantly less time to TABLE 1 process observation data resulting in more science being produced. Double precision processing time benchmark with Cython, Numba and CuPy Based on the results, the recommended configuration to run the implementation. program is with single-precision calculations on a GPU device. Single-Precision (float32) R EFERENCES [ESF+ 17] J. Emilio Enriquez, Andrew Siemion, Griffin Foster, Vishal Impl. Device File A File B File C Gajjar, Greg Hellbourg, Jack Hickish, Howard Isaacson, Numba CPU 0.26 min 16.13 min 16.15 min Danny C. Price, Steve Croft, David DeBoer, Matt Lebof- CuPy GPU 0.03 min 1.52 min 2.14 min sky, David H. E. MacMahon, and Dan Werthimer. The breakthrough listen search for intelligent life: 1.1–1.9 TABLE 2 ghz observations of 692 nearby stars. The Astrophys- Single precision processing time benchmark with Numba and CuPy ical Journal, 849(2):104, Nov 2017. URL: https://ui. implementation. adsabs.harvard.edu/abs/2017ApJ...849..104E/abstract, doi: 10.3847/1538-4357/aa8d1b. [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, to be compiled at installation time. Consequently, it was decided Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del to replace Cython with pure Python methods decorated with the Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Numba ([LPS15]) accelerator. By leveraging the power of the Just- Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer In-Time (JIT) compiler from Low Level Virtual Machine (LLVM), Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro- gramming with NumPy. Nature, 585(7825):357–362, Septem- Numba can compile Python code into assembly code as well ber 2020. URL: https://doi.org/10.1038/s41586-020-2649-2, as apply Single Instruction/Multiple Data (SIMD) acceleration doi:10.1038/s41586-020-2649-2. instructions to achieve near machine-level speeds. [LCS 19] + Matthew Lebofsky, Steve Croft, Andrew P. V. Siemion, Danny C. Price, J. Emilio Enriquez, Howard Isaacson, David H. E. MacMahon, David Anderson, Bryan Brzycki, Jeff Cobb, 2.2. Single-Precision Floating-Point Daniel Czech, David DeBoer, Julia DeMarines, Jamie Drew, The original implementation of the software handled the input Griffin Foster, Vishal Gajjar, Nectaria Gizani, Greg Hellbourg, Eric J. Korpela, and Brian Lacki. The breakthrough listen data as double-precision floating-point numbers. This behavior search for intelligent life: Public data, formats, reduction, and would cause all the mathematical operations to take significantly archiving. Publications of the Astronomical Society of the longer to process because of the extended precision. The ultimate Pacific, 131(1006):124505, Nov 2019. URL: https://arxiv.org/ abs/1906.07391, doi:10.1088/1538-3873/ab3e82. precision of the output product is inherently limited by the preci- [LPS15] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: sion of the original input data which in most cases is represented A llvm-based python jit compiler. In Proceedings of the by an 8-bit signed integer. Therefore, the addition of a single- Second Workshop on the LLVM Compiler Infrastructure in precision floating-point number decreased the processing time HPC, LLVM ’15, New York, NY, USA, 2015. Association for Computing Machinery. URL: https://doi.org/10.1145/ without compromising the useful precision of the output data. 2833157.2833162, doi:10.1145/2833157.2833162. [OUN 17] + Ryosuke Okuta, Yuya Unno, Daisuke Nishino, Shohei Hido, and Crissman Loomis. Cupy: A numpy-compatible library 3. Results for nvidia gpu calculations. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty- To test the speed improvements between implementations we used first Annual Conference on Neural Information Processing files from previous observations coming from different observato- Systems (NIPS), 2017. URL: http://learningsys.org/nips17/ ries. Table 1 indicates the processing times it took to process three assets/papers/paper_16.pdf. [Reb82] Grote Reber. Cosmic Static, pages 61–69. Springer Nether- different files in double-precision mode. We can notice that the lands, Dordrecht, 1982. URL: https://doi.org/10.1007/978- CPU implementation based on Numba is measurably faster than 94-009-7752-5_6, doi:10.1007/978-94-009-7752- the original CPU implementation based on Cython. At the same 5_6. time, the GPU-accelerated backend processed the data from 6.8 to 9.3 times faster than the original CPU-based implementation. Table 2 indicates the same results as Table 1 but with single- precision floating points. The original Cython implementation was left out because it doesn’t support single-precision mode. Here, the same data was processed from 7.5 to 10.6 times faster than the Numba CPU-based implementation. To illustrate the processing time improvement, a single obser- vation containing 105 GB of data was processed in 12 hours by the original CPU-based TurboSETI implementation on an i7-7700K Intel CPU, and just 1 hour and 45 minutes by the GPU-accelerated backend on a GTX 1070 Ti NVIDIA GPU. 28 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Experience report of physics-informed neural networks in fluid simulations: pitfalls and frustration Pi-Yueh Chuang‡∗ , Lorena A. Barba‡ F Abstract—Though PINNs (physics-informed neural networks) are now deemed PINN (physics-informed neural network) method denotes an ap- as a complement to traditional CFD (computational fluid dynamics) solvers proach to incorporate deep learning in CFD applications, where rather than a replacement, their ability to solve the Navier-Stokes equations solving partial differential equations plays the key role. These par- without given data is still of great interest. This report presents our not-so- tial differential equations include the well-known Navier-Stokes successful experiments of solving the Navier-Stokes equations with PINN as equations—one of the Millennium Prize Problems. The universal a replacement to traditional solvers. We aim to, with our experiments, prepare readers for the challenges they may face if they are interested in data-free PINN. approximation theorem ([Hor]) implies that neural networks can In this work, we used two standard flow problems: 2D Taylor-Green vortex at model the solution to the Navier-Stokes equations with high Re = 100 and 2D cylinder flow at Re = 200. The PINN method solved the 2D fidelity and capture complicated flow details as long as networks Taylor-Green vortex problem with acceptable results, and we used this flow as an are big enough. The idea of PINN methods can be traced back accuracy and performance benchmark. About 32 hours of training were required to [DPT], while the name PINN was coined in [RPK]. Human- for the PINN method’s accuracy to match the accuracy of a 16 × 16 finite- provided data are not necessary in applying PINN [LMMK], mak- difference simulation, which took less than 20 seconds. The 2D cylinder flow, on ing it a potential alternative to traditional CFD solvers. Sometimes the other hand, did not produce a physical solution. The PINN method behaved it is branded as unsupervised learning—it does not rely on human- like a steady-flow solver and did not capture the vortex shedding phenomenon. provided data, making it sound very "AI." It is now common to By sharing our experience, we would like to emphasize that the PINN method is still a work-in-progress, especially in terms of solving flow problems without any see headlines like "AI has cracked the Navier-Stokes equations" in given data. More work is needed to make PINN feasible for real-world problems recent popular science articles ([Hao]). in such applications. (Reproducibility package: [Chu22].) Though data-free PINN as an alternative to traditional CFD solvers may sound attractive, PINN can also be used under data- Index Terms—computational fluid dynamics, deep learning, physics-informed driven configurations, for which it is better suited. Cai et al. neural network [CMW+ ] state that PINN is not meant to be a replacement of existing CFD solvers due to its inferior accuracy and efficiency. The most useful applications of PINN should be those with 1. Introduction some given data, and thus the models are trained against the Recent advances in computing and programming techniques have data. For example, when we have experimental measurements or motivated practitioners to revisit deep learning applications in partial simulation results (coarse-grid data, limited numbers of computational fluid dynamics (CFD). We use the verb "revisit" snapshots, etc.) from traditional CFD solvers, PINN may be useful because deep learning applications in CFD already existed going to reconstruct the flow or to be a surrogate model. back to at least the 1990s, for example, using neural networks as Nevertheless, data-free PINN may offer some advantages over surrogate models ([LS], [FS]). Another example is the work of traditional solvers, and using data-free PINN to replace traditional Lagaris and his/her colleagues ([LLF]) on solving partial differen- solvers is still of great interest to researchers (e.g., [KDYI]). First, tial equations with fully-connected neural networks back in 1998. it is a mesh-free scheme, which benefits engineering problems Similar work with radial basis function networks can be found where fluid flows interact with objects of complicated geometries. in reference [LLQH]. Nevertheless, deep learning applications Simulating these fluid flows with traditional numerical methods in CFD did not get much attention until this decade, thanks to usually requires high-quality unstructured meshes with time- modern computing technology, including GPUs, cloud computing, consuming human intervention in the pre-processing stage before high-level libraries like PyTorch and TensorFlow, and their Python actual simulations. The second benefit of PINN is that the trained APIs. models approximate the governing equations’ general solutions, Solving partial differential equations with deep learning is meaning there is no need to solve the equations repeatedly for particularly interesting to CFD researchers and practitioners. The different flow parameters. For example, a flow model taking boundary velocity profiles as its input arguments can predict * Corresponding author: pychuang@gwu.edu flows under different boundary velocity profiles after training. ‡ Department of Mechanical and Aerospace Engineering, The George Wash- ington University, Washington, DC 20052, USA Conventional numerical methods, on the contrary, require repeated simulations, each one covering one boundary velocity profile. Copyright © 2022 Pi-Yueh Chuang et al. This is an open-access article This feature could help in situations like engineering design op- distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, timization: the process of running sets of experiments to conduct provided the original author and source are credited. parameter sweeps and find the optimal values or geometries for EXPERIENCE REPORT OF PHYSICS-INFORMED NEURAL NETWORKS IN FLUID SIMULATIONS: PITFALLS AND FRUSTRATION 29 products. Given these benefits, researchers continue studying and and momentum equations: improving the usability of data-free PINN (e.g., [WYP], [DZ], [WTP], [SS]). ∂U ~ ~ = − 1 ∇p + ν∇2U ~ · ∇)U ~ +~g + (U (2) Data-free PINN, however, is not ready nor meant to replace ∂t ρ traditional CFD solvers. This claim may be obvious to researchers where ρ = ρ(~x,t), ν = ν(~x,t), and p = p(~x,t) are scalar fields experienced in PINN, but it may not be clear to others, especially denoting density, kinematic viscosity, and pressure, respectively. to CFD end-users without ample expertise in numerical methods. ~x denotes the spatial coordinate, and ~x = [x, y]T in two di- Even in literature that aims to improve PINN, it’s common to see mensions. The density and viscosity fields are usually known only the success stories with simple CFD problems. Important in- and given, while the pressure field is unknown. U ~ = U(~ ~ x,t) = formation concerning the feasibility of PINN in practical and real- [u(x, y,t), v(x, y,t)]T is a vector field for flow velocity. All of them world applications is often missing from these success stories. For are functions of the spatial coordinate in the computational domain example, few reports discuss the required computing resources, Ω and time before a given limit T . The gravitational field ~g may the computational cost of training, the convergence properties, or also be a function of space and time, though it is usually a constant. the error analysis of PINN. PINN suffers from performance and A solution to the Navier-Stokes equations is subjected to an initial solvability issues due to the need for high-order automatic differ- condition and boundary conditions: entiation and multi-objective nonlinear optimization. Evaluating high-order derivatives using automatic differentiation increases ~ x,t) = U U(~ ~ 0 (~x), ∀~x ∈ Ω, t = 0 the computational graphs of neural networks. And multi-objective U(~x,t) = UΓ (~x,t), ∀~x ∈ Γ, t ∈ [0, T ] ~ ~ (3) optimization, which reduces all the residuals of the differential p(~x,t) = pΓ (x,t), ∀~x ∈ Γ, t ∈ [0, T ] equations, initial conditions, and boundary conditions, makes the training difficult to converge to small-enough loss values. where Γ represents the boundary of the computational domain. Fluid flows are sensitive nonlinear dynamical systems in which a small change or error in inputs may produce a very different 2.1. The PINN method flow field. So to get correct solutions, the optimization in PINN The basic form of the PINN method ([RPK], [CMW+ ]) starts from needs to minimize the loss to values very close to zero, further approximating U~ and p with a neural network: compromising the method’s solvability and performance. " # This paper reports on our not-so-successful PINN story as a ~ U (~x,t) ≈ G(~x,t; Θ) (4) lesson learned to readers, so they can be aware of the challenges p they may face if they consider using data-free PINN in real-world applications. Our story includes two computational experiments Here we use a single network that predicts both pressure and as case studies to benchmark the PINN method’s accuracy and velocity fields. It is also possible to use different networks for them computational performance. The first case study is a Taylor- separately. Later in this work, we will use GU and G p to denote Green vortex, solved successfully though not to our complete the predicted velocity and pressure from the neural network. Θ at satisfaction. We will discuss the performance of PINN using this this point represents the free parameters of the network. case study. The second case study, flow over a cylinder, did not To determine the free parameters, Θ, ideally, we hope the even result in a physical solution. We will discuss the frustration approximate solution gives zero residuals for equations (1), (2), we encountered with PINN in this case study. and (3). That is We built our PINN solver with the help of NVIDIA’s Modulus r1 (~x,t; Θ) ≡ ∇ · GU = 0 library ([noa]). Modulus is a high-level Python package built on ∂ GU 1 top of PyTorch that helps users develop PINN-based differential r2 (~x,t; Θ) ≡ + (GU · ∇)GU + ∇G p − ν∇2 GU −~g = 0 equation solvers. Also, in each case study, we also carried out sim- ∂t ρ (5) ulations with our CFD solver, PetIBM ([CMKAB18]). PetIBM is r3 (~x; Θ) ≡ GU ~ t=0 − U0 = 0 a traditional solver using staggered-grid finite difference methods r4 (~x,t; Θ) ≡ GU − U~ Γ = 0, ∀~x ∈ Γ with MPI parallelization and GPU computing. PetIBM simulations r5 (~x,t; Θ) ≡ G p − pΓ = 0, ∀~x ∈ Γ in each case study served as baseline data. For all cases, config- urations, post-processing scripts, and required Singularity image And the set of desired parameter, Θ = θ , is the common zero root definitions can be found at reference [Chu22]. of all the residuals. This paper is structured as follows: the second section briefly The derivatives of G with respect to ~x and t are usually ob- describes the PINN method and an analogy to traditional CFD tained using automatic differentiation. Nevertheless, it is possible methods. The third and fourth sections provide our computational to use analytical derivatives when the chosen network architecture experiments of the Taylor-Green vortex in 2D and a 2D laminar is simple enough, as reported by early-day literature ([LLF], cylinder flow with vortex shedding. Most discussions happen [LLQH]). in the corresponding case studies. The last section presents the If residuals in (5) are not complicated, and if the number of conclusion and discussions that did not fit into either one of the the parameters, NΘ , is small enough, we may numerically find the cases. zero root by solving a system of NΘ nonlinear equations generated from a suitable set of NΘ spatial-temporal points. However, the 2. Solving Navier-Stokes equations with PINN scenario rarely happens as G is usually highly complicated and NΘ is large. Moreover, we do not even know if such a zero root The incompressible Navier-Stokes equations in vector form are exists for the equations in (5). composed of the continuity equation: Instead, in PINN, the condition is relaxed. We do not seek the ∇ ·U ~ =0 (1) zero root of (5) but just hope to find a set of parameters that make 30 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) the residuals sufficiently close to zero. Consider the sum of the l2 2.2. An analogy to conventional numerical methods norms of residuals: For readers with a background in numerical methods for partial ( 5 x∈Ω differential equations, we would like to make an analogy between r(~x,t; Θ = θ ) ≡ ∑ kri (~x,t; Θ = θ )k , ∀ 2 (6) traditional numerical methods and PINN. i=1 t ∈ [0, T ] In obtaining strong solutions to differential equations, we can The θ that makes residuals closest to zero (or even equal to zero describe the solution workflows of most numerical methods with if such θ exists) also makes (6) minimal because r(~x,t; Θ) ≥ 0. In five stages: other words, ( 1) Designing the approximate solution with undetermined x∈Ω parameters θ = arg min r(~x,t; Θ) ∀ (7) Θ t ∈ [0, T ] 2) Choosing proper approximation for derivatives 3) Obtaining the so-called modified equation by substituting This poses a fundamental difference between the PINN method approximate derivatives into the differential equations and traditional CFD schemes, making it potentially more difficult and initial/boundary conditions for the PINN method to achieve the same accuracy as the tradi- 4) Generating a system of linear/nonlinear algebraic equa- tional schemes. We will discuss this more in section 3. Note that tions in practice, each loss term on the right-hand-side of equation (6) is 5) Solving the system of equations weighted. We ignore the weights here for demonstrating purpose. To solve (7), theoretically, we can use any number of spatial- For example, to solve ∇U 2 (x) = s(x), the most naive spectral temporal points, which eases the need of computational resources, method ([Tre]) approximates the solution with U(x) ≈ G(x) = N compared to finding the zero root directly. Gradient-descent- ∑ ci φi (x), where ci represents undetermined parameters, and φi (x) based optimizers further reduce the computational cost, especially i=1 denotes a set of either polynomials, trigonometric functions, or in terms of memory usage and the difficulty of parallelization. complex exponentials. Next, obtaining the first derivative of U is Alternatively, Quasi-Newton methods may work but only when N NΘ is small enough. straightforward—we can just assume U 0 (x) ≈ G0 (x) = ∑ ci φi0 (x). i=1 However, even though equation (7) may be solvable, it is still The second-order derivative may be more tricky. One can assume a significantly expensive task. While typical data-driven learning N requires one back-propagation pass on the derivatives of the loss U 00 (x) ≈ G00 = ∑ ci φi00 (x). Or, another choice for nodal bases (i.e., i=1 function, here automatic differentiation is needed to evaluate the N derivatives of G with respect to ~x and t. The first-order derivatives when φi (x) is chosen to make ci ≡ G(xi )) is U 00 (x) ≈ ∑ ci G0 (xi ). i=1 require one back-propagation on the network, while the second- Because φi (x) is known, the derivatives are analytical. After sub- order derivatives present in the diffusion term ∇2 GU require an stituting the approximate solution and derivatives in to the target additional back-propagation on the first-order derivatives’ com- differential equation, we need to solve for parameters c1 , · · · , cN . putational graph. Finally, to update parameters in an optimizer, We do so by selecting N points from the computational domain the gradients of G with respect to parameters Θ requires another and creating a system of N linear equations: back-propagation on the graph of the second-order derivatives. This all leads to a very large computational graph. We will see the φ100 (x1 ) · · · φN00 (x1 ) c1 s(x1 ) . .. . .. . . performance of the PINN method in the case studies. . . . .. − .. = 0 (8) In summary, when viewing the PINN method as supervised φ1 (xN ) · · · φN (xN ) cN 00 00 s(xN ) machine learning, the inputs of a network are spatial-temporal coordinates, and the outputs are the physical quantities of our Finally, we determine the parameters by solving this linear system. interest. The loss or objective functions in PINN are governing Though this example uses a spectral method, the workflow also equations that regulate how the target physical quantities should applies to many other numerical methods, such as finite difference behave. The use of governing equations eliminates the need for methods, which can be reformatted as a form of spectral method. true answers. A trivial example is using Bernoulli’s equation as With this workflow in mind, it should be easy to see the anal- the loss function, i.e., loss = 2gu2 p + ρg − H0 + z(x), and a neural ogy between PINN and conventional numerical methods. Aside network predicts the flow speed u and pressure p at a given from using much more complicated approximate solutions, the location x along a streamline. (The gravitational acceleration major difference lies in how to determine the unknown parameters g, density ρ, energy head H0 , and elevation z(x) are usually in the approximate solutions. While traditional methods solve the known and given.) Such a loss function regulates the relationship zero-residual conditions, PINN relies on searching the minimal between predicted u and p and does not need true answers for residuals. A secondary difference is how to approximate deriva- the two quantities. Unlike Bernoulli’s equation, most governing tives. Conventional numerical methods use analytical or numerical equations in physics are usually differential equations (e.g., heat differentiation of the approximate solutions, and the PINN meth- equations). The main difference is that now the PINN method ods usually depends on automatic differentiation. This difference needs automatic differentiation to evaluate the loss. Regardless may be minor as we are still able to use analytical differentiation of the forms of governing equations, spatial-temporal coordinates for simple network architectures with PINN. However, automatic are the only data required during training. Hence, throughout this differentiation is a major factor affecting PINN’s performance. paper, training data means spatial-temporal points and does not 3. Case 1: Taylor-Green vortex: accuracy and performance involve any true answers to predicted quantities. (Note in some literature, the PINN method is applied to applications that do need 3.1. 2D Taylor-Green vortex true answers, see [CMW+ ]. These applications are out of scope The Taylor-Green vortex represents a family of flows with a here.) specific form of analytical initial flow conditions in both 2D EXPERIENCE REPORT OF PHYSICS-INFORMED NEURAL NETWORKS IN FLUID SIMULATIONS: PITFALLS AND FRUSTRATION 31 Fig. 2: Total residuals (loss) with respect to training iterations. Fig. 1: Contours of u and v at t = 32 to demonstrate the solution of 2D Taylor-Green vortex. variants). We carried out the training using different numbers of and 3D. The 2D Taylor-Green vortex has closed-form analytical GPUs to investigate the performance of the PINN solver. All cases solutions with periodic boundary conditions, and hence they are were trained up to 1 million iterations. Note that the parallelization standard benchmark cases for verifying CFD solvers. In this work, was done with weak scaling, meaning increasing the number of we used the following 2D Taylor-Green vortex: GPUs would not reduce the workload of each GPU. Instead, increasing the number of GPUs would increase the total and x y ν u(x, y,t) = V0 cos( ) sin( ) exp(−2 2 t) per-iteration numbers of training points. Therefore, our expected L L L x y ν outcome was that all cases required about the same wall time to v(x, y,t) = −V0 sin( ) cos( ) exp(−2 2 t) (9) finish, while the residual from using 8 GPUs would converge the L L L ρ 2x 2y ν fastest. p(x, y,t) = − V02 cos( ) + cos( ) exp(−4 2 t) After training, the PINN solver’s prediction errors (i.e., accu- 4 L L L racy) were evaluated on cell centers of a 512 × 512 Cartesian mesh where V0 represents the peak (and also the lowest) velocity at against the analytical solution. With these spatially distributed t = 0. Other symbols carry the same meaning as those in section errors, we calculated the L2 error norm for a given t: 2. sZ r The periodic boundary conditions were applied to x = −Lπ, L2 = error(x, y)2 dΩ ≈ ∑ ∑ errori,2 j ∆Ωi, j (10) x = Lπ, y = −Lπ, and y = Lπ. We used the following parameters Ω i j in this work: V0 = L = ρ = 1.0 and ν = 0.01. These parameters correspond to Reynolds number Re = 100. Figure 1 shows a where i and j here are the indices of a cell center in the Cartesian snapshot of velocity at t = 32. mesh. ∆Ωi, j is the corresponding cell area, 4π 2 /5122 in this case. We compared accuracy and performance against results using 3.2. Solver and runtime configurations PetIBM. All PetIBM simulations in this section were done with 1 K40 GPU and 6 CPU cores (Intel i7-5930K) on our old lab The neural network used in the PINN solver is a fully-connected workstation. We carried out 7 PetIBM simulations with different neural network with 6 hidden layers and 256 neurons per layer. spatial resolutions: 2k × 2k for k = 4, 5, . . . , 10. The time step size The activation functions are SiLU ([HG]). We used Adam for for each spatial resolution was ∆t = 0.1/2k−4 . optimization, and its initial parameters are the defaults from Py- A special note should be made here: the PINN solver used Torch. The learning rate exponentially decayed through PyTorch’s single-precision floats, while PetIBM used double-precision floats. ExponentialLR with gamma equal to 0.951/10000 . Note we did It might sound unfair. However, this discrepancy does not change not conduct hyperparameter optimization, given the computational the qualitative findings and conclusions, as we will see later. cost. The hyperparameters are mostly the defaults used by the 3D Taylor-Green example in Modulus ([noa]). The training data were simply spatial-temporal coordinates. 3.3. Results Before the training, the PINN solver pre-generated 18,432,000 Figure 2 shows the convergence history of the total residuals spatial-temporal points to evaluate the residuals of the Navier- (equation (6)). Using more GPUs in weak scaling (i.e., more Stokes equations (the r1 and r2 in equation (5)). These training training points) did not accelerate the convergence, contrary to points were randomly chosen from the spatial domain [−π, π] × what we expected. All cases converged at a similar rate. Though [−π, π] and temporal domain (0, 100]. The solver used only 18,432 without a quantitative criterion or justification, we considered that points in each training iteration, making it a batch training. For further training would not improve the accuracy. Figure 3 gives a the residual of the initial condition (the r3 ), the solver also pre- visual taste of what the predictions from the neural network look generated 18,432,000 random spatial points and used only 18,432 like. per iteration. Note that for r3 , the points were distributed in space The result visually agrees with that in figure 1. However, as only because t = 0 is a fixed condition. Because of the periodic shown in figure 4, the error magnitudes from the PINN solver boundary conditions, the solver did not require any training points are much higher than those from PetIBM. Figure 4 shows the for r4 and r5 . prediction errors with respect to t. We only present the error on The hardware used for the PINN solver was a single node of the u velocity as those for v and p are similar. The accuracy of NVIDIA’s DGX-A100. It was equipped with 8 A100 GPUs (80GB the PINN solver is similar to that of the 16 × 16 simulation with 32 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 5: L2 error norm versus wall time. Fig. 3: Contours of u and v at t = 32 from the PINN solver. 3.4. Discussion A notice should be made regarding the results: we do not claim that these results represent the most optimized configuration of the PINN method. Neither do we claim the qualitative conclusions apply to all other hyperparameter configurations. These results merely reflect the outcomes of our computational experiments with respect to the specific configuration abovementioned. They should be deemed experimental data rather than a thorough anal- ysis of the method’s characteristics. The Taylor-Green vortex serves as a good benchmark case because it reduces the number of required residual constraints: residuals r4 and r5 are excluded from r in equation 6. This means Fig. 4: L2 error norm versus simulation time. the optimizer can concentrate only on the residuals of initial conditions and the Navier-Stokes equations. Using more GPUs (thus using more training points, i.e., spatio- PetIBM. Using more GPUs, which implies more training points, temporal points) did not speed up the convergence, which may does not improve the accuracy. indicate that the per-iteration number of points on a single GPU Regardless of the magnitudes, the trends of the errors with is already big enough. The number of training points mainly respect to t are similar for both PINN and PetIBM. For PetIBM, affects the mean gradients of the residual with respect to model the trend shown in figure 4 indicates that the temporal error is parameters, which then will be used to update parameters by bounded, and the scheme is stable. However, this concept does gradient-descent-based optimizers. If the number of points is not apply to PINN as it does not use any time-marching schemes. already big enough on a single GPU, then using more points or What this means for PINN is still unclear to us. Nevertheless, more GPUs is unlikely to change the mean gradients significantly, it shows that PINN is able to propagate the influence of initial causing the convergence solely to rely on learning rates. conditions to later times, which is a crucial factor for solving The accuracy of the PINN solver was acceptable but not hyperbolic partial differential equations. satisfying, especially when considering how much time it took Figure 5 shows the computational cost of PINN and PetIBM to achieve such accuracy. The low accuracy to some degree was in terms of the desired accuracy versus the required wall time. We not surprising. Recall the theory in section 2. The PINN method only show the PINN results of 8 A100 GPUs on this figure. We only seeks the minimal residual on the total residual’s hyperplane. believe this type of plot may help evaluate the computational cost It does not try to find the zero root of the hyperplane and does not in engineering applications. According to the figure, for example, even care whether such a zero root exists. Furthermore, by using a achieving an accuracy of 10−3 at t = 2 requires less than 1 second gradient-descent-based optimizer, the resulting minimum is likely for PetIBM with 1 K40 and 6 CPU cores, but it requires more than just a local minimum. It makes sense that it is hard for the residual 8 hours for PINN with at least 1 A100 GPU. to be close to zero, meaning it is hard to make errors small. Table 1 lists the wall time per 1 thousand iterations and the Regarding the performance result in figure 5, we would like scaling efficiency. As indicated previously, weak scaling was used to avoid interpreting the result as one solver being better than the in PINN, which follows most machine learning applications. other one. The proper conclusion drawn from the figure should be as follows: when using the PINN solver as a CFD simulator for a specific flow condition, PetIBM outperforms the PINN solver. 1 GPUs 2 GPUs 4 GPUs 8 GPUs As stated in section 1, the PINN method can solve flows under Time (sec/1k iters) 85.0 87.7 89.1 90.1 different flow parameters in one run—a capability that PetIBM Efficiency (%) 100 97 95 94 does not have. The performance result in figure 5 only considers a limited application of the PINN solver. One issue for this case study was how to fairly compare TABLE 1: Weak scaling performance of the PINN solver using the PINN solver and PetIBM, especially when investigating the NVIDIA A100-80GB GPUs accuracy versus the workload/problem size or time-to-solution EXPERIENCE REPORT OF PHYSICS-INFORMED NEURAL NETWORKS IN FLUID SIMULATIONS: PITFALLS AND FRUSTRATION 33 versus problem size. Defining the problem size in PINN is not as straightforward as we thought. Let us start with degrees of freedom—in PINN, it is called the number of model parame- ters, and in traditional CFD solvers, it is called the number of unknowns. The PINN solver and traditional CFD solvers are all trying to determine the free parameters in models (that is, approximate solutions). Hence, the number of degrees of freedom determines the problem sizes and workloads. However, in PINN, problem sizes and workloads do not depend on degrees of freedom solely. The number of training points also plays a critical role in workloads. We were not sure if it made sense to define a problem size as the sum of the per-iteration number of training points and the number of model parameters. For example, 100 model parameters plus 100 training points is not equivalent to 150 model parameters plus 50 training points in terms of workloads. So without a proper definition of problem size and workload, it was not clear how to fairly compare PINN and traditional CFD methods. Nevertheless, the gap between the performances of PINN and Fig. 6: Demonstration of velocity and vorticity fields at t = 200 from a PetIBM simulation. PetIBM is too large, and no one can argue that using other metrics would change the conclusion. Not to mention that the PINN solver ran on A100 GPUs, while PetIBM ran on a single K40 GPU 200. Figure 6 shows the velocity and vorticity snapshots at t = 200. in our lab, a product from 2013. This is also not a surprising As shown in the figure, this type of flow displays a phenomenon conclusion because, as indicated in section 2, the use of automatic called vortex shedding. Though vortex shedding makes the flow differentiation for temporal and spatial derivatives results in a huge always unsteady, after a certain time, the flow reaches a periodic computational graph. In addition, the PINN solver uses gradient- stage and the flow pattern repeats after a certain period. descent based method, which is a first-order method and limits the The Navier-Stokes equations can be deemed as a dynamical performance. system. Instability appears in the flow under some flow conditions Weak scaling is a natural choice of the PINN solver when it and responds to small perturbations, causing the vortex shedding. comes to distributed computing. As we don’t know a proper way In nature, the vortex shedding comes from the uncertainty and to define workload, simply copying all model parameters to all perturbation existing everywhere. In CFD simulations, the vortex processes and using the same number of training points on all shedding is caused by small numerical and rounding errors in processes works well. calculations. Interested readers should consult reference [Wil]. 4. Case 2: 2D cylinder flows: harder than we thought 4.2. Solver and runtime configurations This case study shows what really made us frustrated: a 2D For the PINN solver, we tested with two networks. Both were cylinder flow at Reynolds number Re = 200. We failed to even fully-connected neural networks: one with 256 neurons per layer, produce a solution that qualitatively captures the key physical while the other one with 512 neurons per layer. All other net- phenomenon of this flow: vortex shedding. work configurations were the same as those in section 3, except we allowed human intervention to manually adjust the learning 4.1. Problem description rates during training. Our intention for this case study was to The computational domain is [−8, 25] × [−8, 8], and a cylinder successfully obtain physical solutions from the PINN solver, with a radius of 0.5 sits at coordinate (0, 0). The velocity boundary rather than conducting a performance and accuracy benchmark. conditions are (u, v) = (1, 0) along x = −8, y = −8, and y = 8. On Therefore, we would adjust the learning rate to accelerate the the cylinder surface is the no-slip condition, i.e., (u, v) = (0, 0). convergence or to escape from local minimums. This decision was At the outlet (x = 25), we enforced a pressure boundary condition in line with common machine learning practice. We did not carry p = 0. The initial condition is (u, v) = (0, 0). Note that this initial out hyperparameter optimization. These parameters were chosen condition is different from most traditional CFD simulations. because they work in Modulus’ examples and in the Taylor-Green Conventionally, CFD simulations use (u, v) = (1, 0) for cylinder vortex experiment. flows. A uniform initial condition of u = 1 does not satisfy The PINN solver pre-generated 40, 960, 000 spatial-temporal the Navier-Stokes equations due to the no-slip boundary on the points from a spatial domain in [−8, 25] × [−8, 8] and temporal cylinder surface. Conventional CFD solvers are usually able to domain (0, 200] to evaluate residuals of the Navier-Stokes equa- correct the solution during time-marching by propagating bound- tions, and used 40, 960 points per iteration. The number of pre- ary effects into the domain through numerical schemes’ stencils. generated points for the initial condition was 2, 048, 000, and the In our experience, using u = 1 or u = 0 did not matter for PINN per-iteration number is 2, 048. On each boundary, the numbers of because both did not give reasonable results. Nevertheless, the pre-generated and per-iteration points are 8,192,000 and 8,192. PINN solver’s results shown in this section were obtained using a Both cases used 8 A100 GPUs, which scaled these numbers up uniform u = 0 for the initial condition. with a factor of 8. For example, during each iteration, a total of The density, ρ, is one, and the kinematic viscosity is ν = 327, 680 points were actually used to evaluate the Navier-Stokes 0.005. These parameters correspond to Reynolds number Re = equations’ residuals. Both cases ran up to 64 hours in wall time. 34 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 7: Training history of the 2D cylinder flow at Re = 200. One PetIBM simulation was carried out as a baseline. This simulation had a spatial resolution of 1485 × 720, and the time step size is 0.005. Figure 6 was rendered using this simulation. The hardware used was 1 K40 GPU plus 6 cores of i7-5930K Fig. 8: Velocity and vorticity at t = 200 from PINN. CPU. It took about 1.7 hours to finish. The quantity of interest is the drag coefficient. We consider both the friction drag and pressure drag in the coefficient calcula- tion as follows: 2 Z ∂ U~ ·~t CD = ρν ny − pnx dS (11) ρU02 D ∂~n S Here, U0 = 1 is the inlet velocity. ~n = [nx , ny ]T and ~t = [ny , −nx ]T are the normal and tangent vectors, respectively. S represents the cylinder surface. The theoretical lift coefficient (CL ) for this flow is zero due to the symmetrical geometry. 4.3. Results Note, as stated in section 3.4, we deem the results as experimental data under a specific experiment configuration. Hence, we do not claim that the results and qualitative conclusions will apply to Fig. 9: Drag and lift coefficients with respect to t other hyperparameter configuration. Figure 7 shows the convergence history. The bumps in the practice. Our viewpoints may be subjective, and hence we leave history correspond to our manual adjustment of the learning rates. them here in the discussion. After 64 hours of training, the total loss had not converged to an Allow us to start this discussion with a hypothetical situation. obvious steady value. However, we decided not to continue the If one asks why we chose such a spatial and temporal resolution training because, as later results will show, it is our judgment call for a conventional CFD simulation, we have mathematical or that the results would not be correct even if the training converged. physical reasons to back our decision. However, if the person asks Figure 8 provides a visualization of the predicted velocity why we chose 6 hidden layers and 256 neurons per layer, we will and vorticity at t = 200. And in figure 9 are the drag and lift not be able to justify it. "It worked in another case!" is probably the coefficients versus simulation time. From both figures, we couldn’t best answer we can offer. The situation also indicates that we have see any sign of vortex shedding with the PINN solver. systematic approaches to improve a conventional simulation but We provide a comparison against the values reported by others can only improve PINN’s results through computer experiments. in table 2. References [GS74] and [For80] calculate the drag Most traditional numerical methods have rigorous analytical coefficients using steady flow simulations, which were popular derivations and analyses. Each parameter used in a scheme has decades ago because of their inexpensive computational costs. a meaning or a purpose in physical or numerical aspects. The The actual flow is not a steady flow, and these steady-flow simplest example is the spatial resolution in the finite difference coefficient values are lower than unsteady-flow predictions. The method, which controls the truncation errors in derivatives. Or, drag coefficient from the PINN solver is closer to the steady-flow predictions. Unsteady simulations Steady simulations 4.4. Discussion PetIBM PINN [DSY07] [RKM09] [GS74] [For80] While researchers may be interested in why the PINN solver 1.38 0.95 1.25 1.34 0.97 0.83 behaves like a steady flow solver, in this section, we would like to focus more on the user experience and the usability of PINN in TABLE 2: Comparison of drag coefficients, CD EXPERIENCE REPORT OF PHYSICS-INFORMED NEURAL NETWORKS IN FLUID SIMULATIONS: PITFALLS AND FRUSTRATION 35 the choice of the limiters in finite volume methods, used to inhibit CFD solvers. The literature shows researchers have shifted their the oscillation in solutions. So when a conventional CFD solver attention to hybrid-mode applications. For example, in [JEA+ 20], produces unsatisfying or even non-physical results, practitioners the authors combined the concept of PINN and a traditional CFD usually have systematic approaches to identify the cause or solver to train a model that takes in low-resolution CFD simulation improve the outcomes. Moreover, when necessary, practitioners results and outputs high-resolution flow fields. know how to balance the computational cost and the accuracy, For people with a strong background in numerical methods or which is a critical point for using computer-aided engineering. CFD, we would suggest trying to think out of the box. During Engineering always concerns the costs and outcomes. our work, we realized our mindset and ideas were limited by what On the other hand, the PINN method lacks well-defined we were used to in CFD. An example is the initial conditions. procedures to control the outcome. For example, we know the We were used to only having one set of initial conditions when numbers of neurons and layers control the degrees of freedom in a the temporal derivative in differential equations is only first-order. model. With more degrees of freedom, a neural network model can However, in PINN, nothing limits us from using more than one approximate a more complicated phenomenon. However, when we initial condition. We can generate results at t = 0, 1, . . . ,tn using feel that a neural network is not complicated enough to capture a a traditional CFD solver and add the residuals corresponding to physical phenomenon, what strategy should we use to adjust the these time snapshots to the total residual, so the PINN method neurons and layers? Should we increase neurons or layers first? may perform better in predicting t > tn . In other words, the PINN By how much? solver becomes the traditional CFD solvers’ replacement only for Moreover, when it comes to something non-numeric, it is even t > tn ([noa]). more challenging to know what to use and why to use it. For As discussed in [THM+ ], solving partial differential equations instance, what activation function should we use and why? Should with deep learning is still a work-in-progress. It may not work in we use the same activation everywhere? Not to mention that we many situations. Nevertheless, it does not mean we should stay are not yet even considering a different network architecture here. away from PINN and discard this idea. Stepping away from a new Ultimately, are we even sure that increasing the network’s thing gives zero chance for it to evolve, and we will never know complexity is the right path? Our assumption that the network if PINN can be improved to a mature state that works well. Of is not complicated enough may just be wrong. course, overly promoting its bright side with only success stories The following situation happened in this case study. Before does not help, either. Rather, we should honestly face all troubles, we realized the PINN solver behaved like a steady-flow solver, we difficulties, and challenges. Knowing the problem is the first step attributed the cause to model complexity. We faced the problem to solving it. of how to increase the model complexity systematically. Theoret- ically, we could follow the practice of the design of experiments Acknowledgements (e.g., through grid search or Taguchi methods). However, given the computational cost and the number of hyperparameters/options of We appreciate the support by NVIDIA, through sponsoring the PINN, a proper design of experiments is not affordable for us. access to its high-performance computing cluster. Furthermore, the design of experiments requires the outcome to change with changes in inputs. In our case, the vortex shedding R EFERENCES remains absent regardless of how we changed hyperparameters. [Chu22] Pi-Yueh Chuang. barbagroup/scipy-2022-repro-pack: Let us move back to the flow problem to conclude this 20220530, May 2022. URL: https://doi.org/10.5281/zenodo. case study. The model complexity may not be the culprit here. 6592457, doi:10.5281/zenodo.6592457. Vortex shedding is the product of the dynamical systems of the [CMKAB18] Pi-Yueh Chuang, Olivier Mesnard, Anush Krishnan, and Lorena Navier-Stokes equations and the perturbations from numerical A. Barba. PetIBM: toolbox and applications of the immersed- boundary method on distributed-memory architectures. Journal calculations (which implicitly mimic the perturbations in nature). of Open Source Software, 3(25):558, May 2018. URL: http:// Suppose the PINN solver’s prediction was the steady-state solution joss.theoj.org/papers/10.21105/joss.00558, doi:10.21105/ to the flow. We may need to introduce uncertainties and perturba- joss.00558. tions in the neural network or the training data, such as a perturbed [CMW+ ] Shengze Cai, Zhiping Mao, Zhicheng Wang, Minglang Yin, and George Em Karniadakis. Physics-informed neural net- initial condition described in [LD15]. As for why PINN predicts works (PINNs) for fluid mechanics: a review. 37(12):1727– the steady-state solution, we cannot answer it currently. 1738. URL: https://link.springer.com/10.1007/s10409-021- 01148-1, doi:10.1007/s10409-021-01148-1. [DPT] M. W. M. G. Dissanayake and N. Phan-Thien. Neural-network- 5. Further discussion and conclusion based approximations for solving partial differential equations. 10(3):195–201. URL: https://onlinelibrary.wiley.com/doi/10. Because of the widely available deep learning libraries, such as 1002/cnm.1640100303, doi:10.1002/cnm.1640100303. PyTorch, and the ease of Python, implementing a PINN solver is [DSY07] Jian Deng, Xue-Ming Shao, and Zhao-Sheng Yu. Hydro- dynamic studies on two traveling wavy foils in tandem relatively more straightforward nowadays. This may be one reason arrangement. Physics of Fluids, 19(11):113104, Novem- why the PINN method suddenly became so popular in recent ber 2007. URL: http://aip.scitation.org/doi/10.1063/1.2814259, years. This paper does not intend to discourage people from trying doi:10.1063/1.2814259. the PINN method. Instead, we share our failures and frustration [DZ] Yifan Du and Tamer A. Zaki. Evolutional deep neural network. 104(4):045303. URL: https://link. using PINN so that interested readers may know what immediate aps.org/doi/10.1103/PhysRevE.104.045303, doi:10.1103/ challenges should be resolved for PINN. PhysRevE.104.045303. Our paper is limited to using the PINN solver as a replacement [For80] Bengt Fornberg. A numerical study of steady for traditional CFD solvers. However, as the first section indicates, viscous flow past a circular cylinder. Journal of Fluid Mechanics, 98(04):819, June 1980. URL: http: PINN can do more than solving one specific flow under specific //www.journals.cambridge.org/abstract_S0022112080000419, flow parameters. Moreover, PINN can also work with traditional doi:10.1017/S0022112080000419. 36 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) [FS] William E. Faller and Scott J. Schreck. Unsteady fluid mechan- [THM+ ] Nils Thuerey, Philipp Holl, Maximilian Mueller, Patrick ics applications of neural networks. 34(1):48–55. URL: http: Schnell, Felix Trost, and Kiwon Um. Physics-based deep //arc.aiaa.org/doi/10.2514/2.2134, doi:10.2514/2.2134. learning. Number: arXiv:2109.05237. URL: http://arxiv.org/ [GS74] V.A. Gushchin and V.V. Shchennikov. A numerical method abs/2109.05237, arXiv:2109.05237[physics]. of solving the navier-stokes equations. USSR Computa- [Tre] Lloyd N. Trefethen. Spectral Methods in MATLAB. Soft- tional Mathematics and Mathematical Physics, 14(2):242–250, ware, environments, tools. Society for Industrial and Applied January 1974. URL: https://linkinghub.elsevier.com/retrieve/ Mathematics. URL: http://epubs.siam.org/doi/book/10.1137/1. pii/0041555374900615, doi:10.1016/0041-5553(74) 9780898719598, doi:10.1137/1.9780898719598. 90061-5. [Wil] C. H. K. Williamson. Vortex dynamics in the [Hao] Karen Hao. AI has cracked a key mathematical puzzle for cylinder wake. 28(1):477–539. URL: http://www. understanding our world. URL: https://www.technologyreview. annualreviews.org/doi/10.1146/annurev.fl.28.010196.002401, com/2020/10/30/1011435/ai-fourier-neural-network-cracks- doi:10.1146/annurev.fl.28.010196.002401. navier-stokes-and-partial-differential-equations/. [WTP] Sifan Wang, Yujun Teng, and Paris Perdikaris. Under- [HG] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units standing and mitigating gradient flow pathologies in physics- (GELUs). Publisher: arXiv Version Number: 4. URL: https:// informed neural networks. 43(5):A3055–A3081. URL: https: arxiv.org/abs/1606.08415, doi:10.48550/ARXIV.1606. //epubs.siam.org/doi/10.1137/20M1318043, doi:10.1137/ 08415. 20M1318043. [WYP] Sifan Wang, Xinling Yu, and Paris Perdikaris. When [Hor] Kurt Hornik. Approximation capabilities of multilayer feedfor- and why PINNs fail to train: A neural tangent ward networks. 4(2):251–257. URL: https://linkinghub.elsevier. kernel perspective. 449:110768. URL: https: com/retrieve/pii/089360809190009T, doi:10.1016/0893- //linkinghub.elsevier.com/retrieve/pii/S002199912100663X, 6080(91)90009-T. doi:10.1016/j.jcp.2021.110768. [JEA+ 20] Chiyu “Max” Jiang, Soheil Esmaeilzadeh, Kamyar Aziz- zadenesheli, Karthik Kashinath, Mustafa Mustafa, Hamdi A. Tchelepi, Philip Marcus, Mr Prabhat, and Anima Anandkumar. Meshfreeflownet: A physics-constrained deep continuous space- time super-resolution framework. In SC20: International Con- ference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2020. doi:10.1109/SC41405. 2020.00013. [KDYI] Hasan Karali, Umut M. Demirezen, Mahmut A. Yukselen, and Gokhan Inalhan. A novel physics informed deep learning method for simulation-based modelling. In AIAA Scitech 2021 Forum. American Institute of Aeronautics and Astronautics. URL: https://arc.aiaa.org/doi/10.2514/6.2021-0177, doi:10. 2514/6.2021-0177. [LD15] Mouna Laroussi and Mohamed Djebbi. Vortex Shedding for Flow Past Circular Cylinder: Effects of Initial Conditions. Universal Journal of Fluid Mechanics, 3:19–32, 2015. [LLF] I. E. Lagaris, A. Likas, and D. I. Fotiadis. Artificial neu- ral networks for solving ordinary and partial differential equations. 9(5):987–1000. URL: http://ieeexplore.ieee.org/ document/712178/, arXiv:physics/9705023, doi:10. 1109/72.712178. [LLQH] Jianyu Li, Siwei Luo, Yingjian Qi, and Yaping Huang. Numer- ical solution of elliptic partial differential equation using radial basis function neural networks. 16(5):729–734. URL: https: //linkinghub.elsevier.com/retrieve/pii/S0893608003000832, doi:10.1016/S0893-6080(03)00083-2. [LMMK] Lu Lu, Xuhui Meng, Zhiping Mao, and George Em Karniadakis. DeepXDE: A deep learning library for solving differential equations. 63(1):208–228. URL: https://epubs.siam.org/doi/10. 1137/19M1274067, doi:10.1137/19M1274067. [LS] Dennis J. Linse and Robert F. Stengel. Identification of aerodynamic coefficients using computational neural networks. 16(6):1018–1025. Publisher: Springer US, Place: Boston, MA. URL: http://link.springer.com/10.1007/0-306-48610-5_9, doi:10.2514/3.21122. [noa] Modulus. URL: https://docs.nvidia.com/deeplearning/modulus/ index.html. [RKM09] B.N. Rajani, A. Kandasamy, and Sekhar Majumdar. Nu- merical simulation of laminar flow past a circular cylin- der. Applied Mathematical Modelling, 33(3):1228–1247, March 2009. arXiv: DOI: 10.1002/fld.1 Publisher: Elsevier Inc. ISBN: 02712091 10970363. URL: http://dx.doi.org/10.1016/j.apm. 2008.01.017, doi:10.1016/j.apm.2008.01.017. [RPK] M. Raissi, P. Perdikaris, and G.E. Karniadakis. Physics- informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. 378:686–707. URL: https: //linkinghub.elsevier.com/retrieve/pii/S0021999118307125, doi:10.1016/j.jcp.2018.10.045. [SS] Justin Sirignano and Konstantinos Spiliopoulos. DGM: A deep learning algorithm for solving partial differential equations. 375:1339–1364. URL: https: //linkinghub.elsevier.com/retrieve/pii/S0021999118305527, doi:10.1016/j.jcp.2018.08.029. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 37 atoMEC: An open-source average-atom Python code Timothy J. Callow‡§∗ , Daniel Kotik‡§ , Eli Kraisler¶ , Attila Cangi‡§ F Abstract—Average-atom models are an important tool in studying matter under methods are often denoted as "first-principles" because, formally extreme conditions, such as those conditions experienced in planetary cores, speaking, they yield the exact properties of the system, under cer- brown and white dwarfs, and during inertial confinement fusion. In the right tain well-founded theoretical approximations. Density-functional context, average-atom models can yield results with similar accuracy to simu- theory (DFT), initially developed as a ground-state theory [HK64], lations which require orders of magnitude more computing time, and thus can [KS65] but later extended to non-zero temperatures [Mer65], greatly reduce financial and environmental costs. Unfortunately, due to the wide range of possible models and approximations, and the lack of open-source [PPF+ 11], is one such theory and has been used extensively to codes, average-atom models can at times appear inaccessible. In this paper, we study materials under WDM conditions [GDRT14]. Even though present our open-source average-atom code, atoMEC. We explain the aims and DFT reformulates the Schrödinger equation in a computationally structure of atoMEC to illuminate the different stages and options in an average- efficient manner [Koh99], the cost of running calculations be- atom calculation, and to facilitate community contributions. We also discuss the comes prohibitively expensive at higher temperatures. Formally, use of various open-source Python packages in atoMEC, which have expedited it scales as O(N 3 τ 3 ), with N the particle number (which usually its development. also increases with temperature) and τ the temperature [CRNB18]. This poses a serious computational challenge in the WDM regime. Index Terms—computational physics, plasma physics, atomic physics, materi- Furthermore, although DFT is a formally exact theory, in prac- als science tice it relies on approximations for the so-called "exchange- correlation" energy, which is, roughly speaking, responsible for Introduction simulating all the quantum interactions between electrons. Exist- ing exchange-correlation approximations have not been rigorously The study of matter under extreme conditions — materials tested under WDM conditions. An alternative method used in exposed to high temperatures, high pressures, or strong elec- the WDM community is path-integral Monte–Carlo [DGB18], tromagnetic fields — is critical to our understanding of many which yields essentially exact properties; however, it is even more important scientific and technological processes, such as nuclear limited by computational cost than DFT, and becomes unfeasibly fusion and various astrophysical and planetary physics phenomena expensive at lower temperatures due to the fermion sign problem. [GFG+ 16]. Of particular interest within this broad field is the It is therefore of great interest to reduce the computational warm dense matter (WDM) regime, which is typically character- complexity of the aforementioned methods. The use of graphics ized by temperatures in the range of 103 − 106 degrees (Kelvin), processing units in DFT calculations is becomingly increasingly and densities ranging from dense gases to highly compressed common, and has been shown to offer significant speed-ups solids (∼ 0.01 − 1000 g cm−3 ) [BDM+ 20]. In this regime, it is relative to conventional calculations using central processing units important to account for the quantum mechanical nature of the [MED11], [JFC+ 13]. Some other examples of promising develop- electrons (and in some cases, also the nuclei). Therefore conven- ments to reduce the cost of DFT calculations include machine- tional methods from plasma physics, which either neglect quantum learning-based solutions [SRH+ 12], [BVL+ 17], [EFP+ 21] and effects or treat them coarsely, are usually not sufficiently accurate. stochastic DFT [CRNB18], [BNR13]. However, in this paper, On the other hand, methods from condensed-matter physics and we focus on an alternative class of models known as "average- quantum chemistry, which account fully for quantum interactions, atom" models. Average-atom models have a long history in plasma typically target the ground-state only, and become computationally physics [CHKC22]: they account for quantum effects, typically intractable for systems at high temperatures. using DFT, but reduce the complex system of interacting electrons Nevertheless, there are methods which can, in principle, be and nuclei to a single atom immersed in a plasma (the "average" applied to study materials at any given temperature and den- atom). An illustration of this principle (reduced to two dimensions sity whilst formally accounting for quantum interactions. These for visual purposes) is shown in Fig. 1. This significantly reduces * Corresponding author: t.callow@hzdr.de the cost relative to a full DFT simulation, because the particle ‡ Center for Advanced Systems Understanding (CASUS), D-02826 Görlitz, number is restricted to the number of electrons per nucleus, and Germany spherical symmetry is exploited to reduce the three-dimensional § Helmholtz-Zentrum Dresden-Rossendorf, D-01328 Dresden, Germany ¶ Fritz Haber Center for Molecular Dynamics and Institute of Chemistry, The problem to one dimension. Hebrew University of Jerusalem, 9091401 Jerusalem, Israel Naturally, to reduce the complexity of the problem as de- scribed, various approximations must be introduced. It is im- Copyright © 2022 Timothy J. Callow et al. This is an open-access article portant to understand these approximations and their limitations distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, for average-atom models to have genuine predictive capabilities. provided the original author and source are credited. Unfortunately, this is not always the case: although average-atom 38 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Theoretical background Properties of interest in the warm dense matter regime include the equation-of-state data, which is the relation between the density, energy, temperature and pressure of a material [HRD08]; the mean ionization state and the electron ionization energies, which tell us about how tightly bound the electrons are to the nuclei; and the electrical and thermal conductivities. These properties yield information pertinent to our understanding of stellar and planetary physics, the Earth’s core, inertial confinement fusion, and more besides. To exactly obtain these properties, one needs (in theory) to determine the thermodynamic ensemble of the quantum states (the so-called wave-functions) representing the electrons and nuclei. Fig. 1: Illustration of the average-atom concept. The many-body Fortunately, they can be obtained with reasonable accuracy using and fully-interacting system of electron density (shaded blue) and models such as average-atom models; in this section, we elaborate nuclei (red points) on the left is mapped into the much simpler system of independent atoms on the right. Any of these identical on how this is done. atoms represents the "average-atom". The effects of interaction from We shall briefly review the key theory underpinning the type of neighboring atoms are implicitly accounted for in an approximate average-atom model implemented in atoMEC. This is intended for manner through the choice of boundary conditions. readers without a background in quantum mechanics, to give some context to the purposes and mechanisms of the code. For a compre- hensive derivation of this average-atom model, we direct readers to Ref. [CHKC22]. The average-atom model we shall describe models share common concepts, there is no unique formal theory falls into a class of models known as ion-sphere models, which underpinning them. Therefore a variety of models and codes exist, are the simplest (and still most widely used) class of average-atom and it is not typically clear which models can be expected to model. There are alternative (more advanced) classes of model perform most accurately under which conditions. In a previous such as ion-correlation [Roz91] and neutral pseudo-atom models paper [CHKC22], we addressed this issue by deriving an average- [SS14] which we have not yet implemented in atoMEC, and thus atom model from first principles, and comparing the impact of we do not elaborate on them here. different approximations within this model on some common As demonstrated in Fig. 1, the idea of the ion-sphere model properties. is to map a fully-interacting system of many electrons and In this paper, we focus on computational aspects of average- nuclei into a set of independent atoms which do not interact atom models for WDM. We introduce atoMEC [CKTS+ 21]: explicitly with any of the other spheres. Naturally, this depends an open-source average-atom code for studying Matter under on several assumptions and approximations, but there is formal Extreme Conditions. One of the main aims of atoMEC is to im- justification for such a mapping [CHKC22]. Furthermore, there prove the accessibility and understanding of average-atom models. are many examples in which average-atom models have shown To the best of our knowledge, open-source average-atom codes good agreement with more accurate simulations and experimental are in scarce supply: with atoMEC, we aim to provide a tool that data [FB19], which further justifies this mapping. people can use to run average-atom simulations and also to add Although the average-atom picture is significantly simplified their own models, which should facilitate comparisons of different relative to the full many-body problem, even determining the approximations. The relative simplicity of average-atom codes wave-functions and their ensemble weights for an atom at finite means that they are not only efficient to run, but also efficient temperature is a complex problem. Fortunately, DFT reduces this to develop: this means, for example, that they can be used as a complexity further, by establishing that the electron density — a test-bed for new ideas that could be later implemented in full DFT far less complex entity than the wave-functions — is sufficient to codes, and are also accessible to those without extensive prior determine all physical observables. The most popular formulation expertise, such as students. atoMEC aims to facilitate development of DFT, known as Kohn–Sham DFT (KS-DFT) [KS65], allows us by following good practice in software engineering (for example to construct the fully-interacting density from a non-interacting extensive documentation), a careful design structure, and of course system of electrons, simplifying the problem further still. Due to through the choice of Python and its widely used scientific stack, the spherical symmetry of the atom, the non-interacting electrons in particular the NumPy [HMvdW+ 20] and SciPy [VGO+ 20] — known as KS electrons (or KS orbitals) — can be represented libraries. as a wave-function that is a product of radial and angular compo- nents, This paper is structured as follows: in the next section, we briefly review the key theoretical points which are important φnlm (r) = Xnl (r)Ylm (θ , φ ) , (1) to understand the functionality of atoMEC, assuming no prior where n, l, and m are the quantum numbers of the orbitals, which physical knowledge of the reader. Following that, we present come from the fact that the wave-function is an eigenfunction of the key functionality of atoMEC, discuss the code structure the Hamiltonian operator, and Ylm (θ , φ ) are the spherical harmonic and algorithms, and explain how these relate to the theoretical aspects introduced. Finally, we present an example case study: functions.1 The radial coordinate r represents the absolute distance we consider helium under the conditions often experienced in from the nucleus. the outer layers of a white dwarf star, and probe the behavior 1. Please note that the notation in Eq. (1) does not imply Einstein sum- of a few important properties, namely the band-gap, pressure, and mation notation. All summations in this paper are written explicitly; Einstein ionization degree. summation notation is not used. ATOMEC: AN OPEN-SOURCE AVERAGE-ATOM PYTHON CODE 39 We therefore only need to determine the radial KS orbitals energy required to excite an electron bound to the nucleus to being Xnl (r). These are determined by solving the radial KS equation, a free (conducting) electron. These predicted ionization energies which is similar to the Schrödinger equation for a non-interacting can be used, for example, to help understand ionization potential system, with an additional term in the potential to mimic the depression, an important but somewhat controversial effect in effects of electron-electron interaction (within the single atom). WDM [STJ+ 14]. Another property that can be straightforwardly The radial KS equation is given by: obtained from the energy levels and their occupation numbers is 2 the mean ionization state Z̄ 2 , d 2 d l(l + 1) − + − + vs [n](r) Xnl (r) = εnl Xnl (r). (2) dr2 r dr r2 Z̄ = ∑(2l + 1) fnl (εnl , µ, τ) (6) n,l We have written the above equation in a way that emphasizes that it is an eigenvalue equation, with the eigenvalues εnl being the which is an important input parameter for various models, such energies of the KS orbitals. as adiabats which are used to model inertial confinement fusion On the left-hand side, the terms in the round brackets come [KDF+ 11]. from the kinetic energy operator acting on the orbitals. The vs [n](r) Various other interesting properties can also be calculated term is the KS potential, which itself is composed of three different following some post-processing of the output of an SCF cal- terms, culation, for example the pressure exerted by the electrons and Z RWS ions. Furthermore, response properties, i.e. those resulting from Z n(x)x2 δ Fxc [n] an external perturbation like a laser pulse, can also be obtained vs [n](r) = − + 4π dx + , (3) r 0 max(r, x) δ n(r) from the output of an SCF cycle. These properties include, for where RWS is the radius of the atomic sphere, n(r) is the electron example, electrical conductivities [Sta16] and dynamical structure density, Z the nuclear charge, and Fxc [n] the exchange-correlation factors [SPS+ 14]. free energy functional. Thus the three terms in the potential are respectively the electron-nuclear attraction, the classical Hartree Code structure and details repulsion, and the exchange-correlation (xc) potential. In the following sections, we describe the structure of the code We note that the KS potential and its constituents are function- in relation to the physical problem being modeled. Average-atom als of the electron density n(r). Were it not for this dependence models typically rely on various parameters and approximations. on the density, solving Eq. 2 just amounts to solving an ordinary In atoMEC, we have tried to structure the code in a way that makes linear differential equation (ODE). However, the electron density clear which parameters come from the physical problem studied is in fact constructed from the orbitals in the following way, compared to choices of the model and numerical or algorithmic n(r) = 2 ∑(2l + 1) fnl (εnl , µ, τ)|Xnl (r)|2 , (4) choices. nl atoMEC.Atom: Physical parameters where fnl (εnl , µ, τ) is the Fermi–Dirac distribution, given by The first step of any simulation in WDM (which also applies to 1 simulations in science more generally) is to define the physical fnl (εnl , µ, τ) = , (5) 1 + e(εnl −µ)/τ parameters of the problem. These parameters are unique in the where τ is the temperature, and µ is the chemical potential, which sense that, if we had an exact method to simulate the real system, is determined by fixing the number of electrons to be equal to then for each combination of these parameters there would be a a pre-determined value Ne (typically equal to the nuclear charge unique solution. In other words, regardless of the model — be Z). The Fermi–Dirac distribution therefore assigns weights to the it average atom or a different technique — these parameters are KS orbitals in the construction of the density, with the weight always required and are independent of the model. depending on their energy. In average-atom models, there are typically three parameters Therefore, the KS potential that determines the KS orbitals via defining the physical problem, which are: the ODE (2), is itself dependent on the KS orbitals. Consequently, • the atomic species; the KS orbitals and their dependent quantities (the density and • the temperature of the material, τ; KS potential) must be determined via a so-called self-consistent • the mass density of the material, ρm . field (SCF) procedure. An initial guess for the orbitals, Xnl0 (r), is used to construct the initial density n0 (r) and potential v0s (r). The mass density also directly corresponds to the mean dis- The ODE (2) is then solved to update the orbitals. This process is tance between two nuclei (atomic centers), which in the average- iterated until some appropriately chosen quantities — in atoMEC atom model is equal to twice the radius of the atomic sphere, RWS . the total free energy, density and KS potential — are converged, An additional physical parameter not mentioned above is the net i.e. ni+1 (r) = ni (r), vi+1 i i+1 = F i , within some charge of the material being considered, i.e. the difference be- s (r) = vs (r), F reasonable numerical tolerance. In Fig. 2, we illustrate the life- tween the nuclear charge Z and the electron number Ne . However, cycle of the average-atom model described so far, including the we usually assume zero net charge in average-atom simulations SCF procedure. On the left-hand side of this figure, we show the (i.e. the number of electrons is equal to the atomic charge). physical choices and mathematical operations, and on the right- In atoMEC, these physical parameters are controlled by the hand side, the representative classes and functions in atoMEC. In Atom object. As an example, we consider aluminum under ambi- the following section, we shall discuss some aspects of this figure ent conditions, i.e. at room temperature, τ = 300 K, and normal in more detail. metallic density, ρm = 2.7 g cm−3 . We set this up as: Some quantities obtained from the completion of the SCF pro- 2. The summation in Eq. (6) is often shown as an integral because the cedure are directly of interest. For example, the energy eigenvalues energies above a certain threshold form a continuous distribution (in most εnl are related to the electron ionization energies, i.e. the amount of models). 40 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 2: Schematic of the average-atom model set-up and the self-consistent field (SCF) cycle. On the left-hand side, the physical choices and mathematical operations that define the model and SCF cycle are shown. On the right-hand side, the (higher-order) functions and classes in atoMEC corresponding to the items on the left-hand side are shown. Some liberties are taken with the code snippets in the right-hand column of the figure to improve readability; more precisely, some non-crucial intermediate steps are not shown, and some parameters are also not shown or simplified. The dotted lines represent operations that are taken care of within the models.CalcEnergy function, but are shown nevertheless to improve understanding. ATOMEC: AN OPEN-SOURCE AVERAGE-ATOM PYTHON CODE 41 Fig. 4: Auto-generated print statement from calling the models.ISModel object. Fig. 3: Auto-generated print statement from calling the atoMEC.Atom object. with a "quantum" treatment of the unbound electrons, and choose the LDA exchange functional (which is also the default). This from atoMEC import Atom Al = Atom("Al", 300, density=2.7, units_temp="K") model is set up as: from atoMEC import models By default, the above code automatically prints the output seen model = models.ISModel(Al, bc="neumann", in Fig. 3. We see that the first two arguments of the Atom object xfunc_id="lda_x", unbound="quantum") are the chemical symbol of the element being studied, and the By default, the above code prints the output shown in Fig. temperature. In addition, at least one of "density" or "radius" must 4. The first (and only mandatory) input parameter to the be specified. In atoMEC, the default (and only permitted) units for models.ISModel object is the Atom object that we generated the mass density are g cm−3 ; all other input and output units in earlier. Together with the optional spinpol and spinmag atoMEC are by default Hartree atomic units, and hence we specify parameters in the models.ISModel object, this sets either the "K" for Kelvin. total number of electrons (spinpol=False) or the number of The information in Fig. 3 displays the chosen parameters in electrons in each spin channel (spinpol=True). units commonly used in the plasma and condensed-matter physics The remaining information displayed in Fig. 4 shows directly communities, as well as some other information directly obtained the chosen model parameters, or the default values where these from these parameters. The chemical symbol ("Al" in this case) parameters are not specified. The exchange and correlation func- is passed to the mendeleev library [men14] to generate this data, tionals - set by the parameters xfunc_id and cfunc_id - are which is used later in the calculation. passed to the LIBXC library [LSOM18] for processing. So far, This initial stage of the average-atom calculation, i.e. the only the "local density" family of approximations is available specification of physical parameters and initialization of the Atom in atoMEC, and thus the default values are usually a sensible object, is shown in the top row at the top of Fig. 2. choice. For more information on exchange and correlation func- atoMEC.models: Model parameters tionals, there are many reviews in the literature, for example Ref. [CMSY12]. After the physical parameters are set, the next stage of the average- This stage of the average-atom calculation, i.e. the specifica- atom calculation is to choose the model and approximations within tion of the model and the choices of approximation within that, is that class of model. As discussed, so far the only class of model shown in the second row of Fig. 2. implemented in atoMEC is the ion-sphere model. Within this model, there are still various choices to be made by the user. ISModel.CalcEnergy: SCF calculation and numerical parameters In some cases, these choices make little difference to the results, Once the physical parameters and model have been defined, the but in other cases they have significant impact. The user might next stage in the average-atom calculation (or indeed any DFT have some physical intuition as to which is most important, or calculation) is the SCF procedure. In atoMEC, this is invoked alternatively may want to run the same physical parameters with by the ISModel.CalcEnergy function. This function is called several different model parameters to examine the effects. Some CalcEnergy because it finds the KS orbitals (and associated KS choices available in atoMEC, listed approximately in decreasing density) which minimize the total free energy. order of impact (but this can depend strongly on the system under Clearly, there are various mathematical and algorithmic consideration), are: choices in this calculation. These include, for example: the basis in • the boundary conditions used to solve the KS equations; which the KS orbitals and potential are represented, the algorithm • the treatment of the unbound electrons, which means used to solve the KS equations (2), and how to ensure smooth those electrons not tightly bound to the nucleus, but rather convergence of the SCF cycle. In atoMEC, the SCF procedure delocalized over the whole atomic sphere; currently follows a single pre-determined algorithm, which we • the choice of exchange and correlation functionals, the briefly review below. central approximations of DFT [CMSY12]; In atoMEC, we represent the radial KS quantities (orbitals, • the spin polarization and magnetization. density and potential) on a logarithmic grid, i.e. x = log(r). Furthermore, we make a transformation of the orbitals Pnl (x) = We do not discuss the theory and impact of these different Xnl (x)ex/2 . Then the equations to be solved become: choices in this paper. Rather, we direct readers to Refs. [CHKC22] and [CKC22] in which all of these choices are discussed. d2 Pnl (x) − 2e2x (W (x) − εnl )Pnl (x) = 0 (7) In atoMEC, the ion-sphere model is controlled by the dx2 models.ISModel object. Continuing with our aluminum ex- 1 1 2 −2x ample, we choose the so-called "neumann" boundary condition, W (x) = vs [n](x) + l+ e . (8) 2 2 42 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) In atoMEC, we solve the KS equations using a matrix imple- a unique set of physical and model inputs — these parameters mentation of Numerov’s algorithm [PGW12]. This means we should be independently varied until some property (such as the diagonalize the following equation: total free energy) is considered suitably converged with respect to that parameter. Changing the SCF parameters should not affect the Ĥ ~P = ~ε B̂~P , where (9) final results (within the convergence tolerances), only the number Ĥ = T̂ + B̂ +Ws (~x) , (10) of iterations in the SCF cycle. 1 Let us now consider an example SCF calculation, using the T̂ = − e−2~x  , (11) 2 Atom and model objects we have already defined: Iˆ−1 − 2Iˆ0 + Iˆ1  = , and (12) from atoMEC import config dx2 config.numcores = -1 # parallelize Iˆ−1 + 10Iˆ0 + Iˆ1 B̂ = , (13) 12 nmax = 3 # max value of principal quantum number lmax = 3 # max value of angular quantum number In the above, Iˆ−1/0/1 are lower shift, identify, and upper shift matrices. # run SCF calculation The Hamiltonian matrix Ĥ is sparse and we only seek a subset scf_out = model.CalcEnergy( nmax, of eigenstates with lower energies: therefore there is no need to lmax, perform a full diagonalization, which scales as O(N 3 ), with N grid_params={"ngrid": 1500}, being the size of the radial grid. Instead, we use SciPy’s sparse ma- scf_params={"mixfrac": 0.7}, ) trix diagonalization function scipy.sparse.linalg.eigs, which scales more efficiently and allows us to go to larger grid We see that the first two parameters passed to the CalcEnergy sizes. function are the nmax and lmax quantum numbers, which specify After each step in the SCF cycle, the relative changes in the the number of eigenstates to compute. Precisely speaking, there free energy F, density n(r) and potential vs (r) are computed. is a unique Hamiltonian for each value of the angular quantum Specifically, the quantities computed are number l (and in a spin-polarized calculation, also for each F i − F i−1 spin quantum number). The sparse diagonalization routine then ∆F = (14) computes the first nmax eigenvalues for each Hamiltonian. In Fi R atoMEC, these diagonalizations can be run in parallel since they dr|ni (r) − ni−1 (r)| ∆n = R (15) are independent for each value of l. This is done by setting the drni (r) R config.numcores variable to the number of cores desired dr|vs (r) − vi−1 i s (r)| (config.numcores=-1 uses all the available cores) and han- ∆v = R i . (16) drvs (r) dled via the joblib library [Job20]. Once all three of these metrics fall below a certain threshold, the The remaining parameters passed to the CalcEnergy func- SCF cycle is considered converged and the calculation finishes. tion are optional; in the above, we have specified a grid size The SCF cycle is an example of a non-linear system and thus of 1500 points and a mixing fraction α = 0.7. The above code is prone to chaotic (non-convergent) behavior. Consequently a automatically prints the output seen in Fig. 5. This output shows range of techniques have been developed to ensure convergence the SCF cycle and, upon completion, the breakdown of the total [SM91]. Fortunately, the tendency for calculations not to converge free energy into its various components, as well as other useful becomes less likely for temperatures above zero (and especially information such as the KS energy levels and their occupations. as temperatures increase). Therefore we have implemented only Additionally, the output of the SCF function is a dictionary a simple linear mixing scheme in atoMEC. The potential used in containing the staticKS.Orbitals, staticKS.Density, each diagonalization step of the SCF cycle is not simply the one staticKS.Potential and staticKS.Density objects. generated from the most recent density, but a mix of that potential For example, one could extract the eigenfunctions as follows: and the previous one, orbs = scf_out["orbitals"] # orbs object vs (r) = αvis (r) + (1 − α)vi−1 (i) ks_eigfuncs = orbs.eigfuncs # eigenfunctions s (r) . (17) In general, a lower value of the mixing fraction α makes the The initialization of the SCF procedure is shown in the third and SCF cycle more stable, but requires more iterations to converge. fourth rows of Fig. 2, with the SCF procedure itself shown in the Typically a choice of α ≈ 0.5 gives a reasonable balance between remaining rows. speed and stability. This completes the section on the code structure and We can thus summarize the key parameters in an SCF calcu- algorithmic details. As discussed, with the output of an lation as follows: SCF calculation, there are various kinds of post-processing one can perform to obtain other properties of interest. So • the maximum number of eigenstates to compute, in terms far in atoMEC, these are limited to the computation of of both the principal and angular quantum numbers; the pressure (ISModel.CalcPressure), the electron • the numerical grid parameters, in particular the grid size; localization function (atoMEC.postprocess.ELFTools) • the convergence tolerances, Eqs. (14) to (16); and the Kubo–Greenwood conductivity • the SCF parameters, i.e. the mixing fraction and the (atoMEC.postprocess.conductivity). We refer maximum number of iterations. readers to our pre-print [CKC22] for details on how the electron The first three items in this list essentially control the accuracy localization function and the Kubo–Greenwood conductivity can of the calculation. In principle, for each SCF calculation — i.e. be used to improve predictions of the mean ionization state. ATOMEC: AN OPEN-SOURCE AVERAGE-ATOM PYTHON CODE 43 Fig. 6: Helium density-of-states (DOS) as a function of energy, for different mass densities ρm , and at temperature τ = 50 kK. Black dots indicate the occupations of the electrons in the permitted energy ranges. Dashed black lines indicate the band-gap (the energy gap between the insulating and conducting bands). Between 5 and 6 g cm−3 , the band-gap disappears. and temperature) and electrical conductivity. To calculate the insulator-to-metallic transition point, the key quantity is the electronic band-gap. The concept of band- structures is a complicated topic, which we try to briefly describe in layman’s terms. In solids, electrons can occupy certain energy ranges — we call these the energy bands. In insulating materials, there is a gap between these energy ranges that electrons are forbidden from occupying — this is the so-called band-gap. In conducting materials, there is no such gap, and therefore electrons can conduct electricity because they can be excited into any part of the energy spectrum. Therefore, a simple method to determine the insulator-to-metallic transition is to determine the density at which the band-gap becomes zero. In Fig. 6, we plot the density-of-states (DOS) as a function of energy, for different densities and at fixed temperature τ = 50 kK. The DOS shows the energy ranges that the electrons are allowed to occupy; we also show the actual energies occupied by the electrons (according to Fermi–Dirac statistics) with the black dots. We can clearly see in this figure that the band-gap (the region where the DOS is zero) becomes smaller as a function of density. From Fig. 5: Auto-generated print statement from calling the this figure, it seems the transition from insulating to metallic state ISModel.CalcEnergy function happens somewhere between 5 and 6 g cm−3 . In Fig. 7, we plot the band-gap as a function of density, for a fixed temperature τ = 50 kK. Visually, it appears that the relation- Case-study: Helium ship between band-gap and density is linear at this temperature. In this section, we consider an application of atoMEC in the This is confirmed using a linear fit, which has a coefficient of WDM regime. Helium is the second most abundant element in the determination value of almost exactly one, R2 = 0.9997. Using this universe (after hydrogen) and therefore understanding its behavior fit, the band-gap is predicted to close at 5.5 g cm−3 . Also in this under a wide range of conditions is important for our under- figure, we show the fraction of ionized electrons, which is given by standing of many astrophysical processes. Of particular interest Z̄/Ne , using Eq. (6) to calculate Z̄, and Ne being the total electron are the conditions under which helium is expected to undergo a number. The ionization fraction also relates to the conductivity of transition from insulating to metallic behavior in the outer layers the material, because ionized electrons are not bound to any nuclei of white dwarfs, which are characterized by densities of around and therefore free to conduct electricity. We see that the ionization 1 − 20 g cm−3 and temperatures of 10 − 50 kK [PR20]. These fraction mostly increases with density (excepting some strange conditions are a typical example of the WDM regime. Besides behavior around ρm = 1 g cm−3 ), which is further evidence of the predicting the point at which the insulator-to-metallic transition transition from insulating to conducting behaviour with increasing occurs in the density-temperature spectrum, other properties of density. interest include equation-of-state data (relating pressure, density, As a final analysis, we plot the pressure as a function of mass 44 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) open-source scientific libraries — especially the Python libraries NumPy, SciPy, joblib and mendeleev, as well as LIBXC. We finish this paper by emphasizing that atoMEC is still in the early stages of development, and there are many opportunities to improve and extend the code. These include, for example: • adding new average-atom models, and different approxi- mations to the existing models.ISModel model; • optimizing the code, in particular the routines in the numerov module; • adding new postprocessing functionality, for example to compute structure factors; • improving the structure and design choices of the code. Fig. 7: Band-gap (red circles) and ionization fraction (blue squares) for helium as a function of mass density, at temperature τ = 50 kK. Of course, these are just a snapshot of the avenues for future The relationship between the band-gap and the density appears to be development in atoMEC. We are open to contributions in these linear. areas and many more besides. Acknowledgements This work was partly funded by the Center for Advanced Systems Understanding (CASUS) which is financed by Germany’s Federal Ministry of Education and Research (BMBF) and by the Saxon Ministry for Science, Culture and Tourism (SMWK) with tax funds on the basis of the budget approved by the Saxon State Parliament. R EFERENCES [BDM+ 20] M. Bonitz, T. Dornheim, Zh. A. Moldabekov, S. Zhang, P. Hamann, H. Kählert, A. Filinov, K. Ramakrishna, and J. Vor- berger. Ab initio simulation of warm dense matter. Phys. Plas- mas, 27(4):042710, 2020. doi:10.1063/1.5143225. [BNR13] Roi Baer, Daniel Neuhauser, and Eran Rabani. Self- averaging stochastic Kohn-Sham density-functional theory. Fig. 8: Helium pressure (logarithmic scale) as a function of mass Phys. Rev. Lett., 111:106402, Sep 2013. doi:10.1103/ density and temperature. The pressure increases with density and PhysRevLett.111.106402. temperature (as expected), with a stronger dependence on density. [BVL+ 17] Felix Brockherde, Leslie Vogt, Li Li, Mark E. Tuckerman, Kieron Burke, and Klaus-Robert Müller. Bypassing the Kohn- Sham equations with machine learning. Nature Communica- density and temperature in Fig. 8. The pressure is given by the tions, 8(1):872, Oct 2017. doi:10.1038/s41467-017- 00839-3. sum of two terms: (i) the electronic pressure, calculated using [CHKC22] T. J. Callow, S. B. Hansen, E. Kraisler, and A. Cangi. the method described in Ref. [FB19], and (ii) the ionic pressure, First-principles derivation and properties of density-functional calculated using the ideal gas law. We observe that the pressure average-atom models. Phys. Rev. Research, 4:023055, Apr 2022. doi:10.1103/PhysRevResearch.4.023055. increases with both density and temperature, which is the expected [CKC22] Timothy J. Callow, Eli Kraisler, and Attila Cangi. Accurate behavior. Under these conditions, the density dependence is much and efficient computation of mean ionization states with an stronger, especially for higher densities. average-atom Kubo-Greenwood approach, 2022. doi:10. The code required to generate the above results and plots can 48550/ARXIV.2203.05863. [CKTS+ 21] Timothy Callow, Daniel Kotik, Ekaterina Tsve- be found in this repository. toslavova Stankulova, Eli Kraisler, and Attila Cangi. atomec, August 2021. If you use this software, please cite it Conclusions and future work using these metadata. doi:10.5281/zenodo.5205719. [CMSY12] Aron J. Cohen, Paula Mori-Sánchez, and Weitao Yang. Chal- In this paper, we have presented atoMEC: an average-atom Python lenges for density functional theory. Chemical Reviews, code for studying materials under extreme conditions. The open- 112(1):289–320, 2012. doi:10.1021/cr200107z. [CRNB18] Yael Cytter, Eran Rabani, Daniel Neuhauser, and Roi Baer. source nature of atoMEC, and the choice to use (pure) Python as Stochastic density functional theory at finite temperatures. the programming language, is designed to improve the accessibil- Phys. Rev. B, 97:115207, Mar 2018. doi:10.1103/ ity of average-atom models. PhysRevB.97.115207. We gave significant attention to the code structure in this [DGB18] Tobias Dornheim, Simon Groth, and Michael Bonitz. The uniform electron gas at warm dense matter conditions. Phys. paper, and tried as much as possible to connect the functions Rep., 744:1 – 86, 2018. doi:10.1016/j.physrep. and objects in the code with the underlying theory. We hope that 2018.04.001. this not only improves atoMEC from a user perspective, but also [EFP+ 21] J. A. Ellis, L. Fiedler, G. A. Popoola, N. A. Modine, J. A. facilitates new contributions from the wider average-atom, WDM Stephens, A. P. Thompson, A. Cangi, and S. Rajamanickam. Accelerating finite-temperature kohn-sham density functional and scientific Python communities. Another aim of the paper was theory with deep neural networks. Phys. Rev. B, 104:035120, to communicate how atoMEC benefits from a strong ecosystem of Jul 2021. doi:10.1103/PhysRevB.104.035120. ATOMEC: AN OPEN-SOURCE AVERAGE-ATOM PYTHON CODE 45 [FB19] Gérald Faussurier and Christophe Blancard. Pressure in warm temperature density-functional theory. Phys. Rev. Lett., and hot dense matter using the average-atom model. Phys. Rev. 107:163001, Oct 2011. doi:10.1103/PhysRevLett. E, 99:053201, May 2019. doi:10.1103/PhysRevE.99. 107.163001. 053201. [PR20] Martin Preising and Ronald Redmer. Metallization of dense [GDRT14] Frank Graziani, Michael P Desjarlais, Ronald Redmer, and fluid helium from ab initio simulations. Phys. Rev. B, Samuel B Trickey. Frontiers and challenges in warm dense 102:224107, Dec 2020. doi:10.1103/PhysRevB.102. matter, volume 96. Springer Science & Business, 2014. doi: 224107. 10.1007/978-3-319-04912-0. [Roz91] Balazs F. Rozsnyai. Photoabsorption in hot plasmas based [GFG+ 16] S H Glenzer, L B Fletcher, E Galtier, B Nagler, R Alonso- on the ion-sphere and ion-correlation models. Phys. Rev. A, Mori, B Barbrel, S B Brown, D A Chapman, Z Chen, C B 43:3035–3042, Mar 1991. doi:10.1103/PhysRevA.43. Curry, F Fiuza, E Gamboa, M Gauthier, D O Gericke, A Glea- 3035. son, S Goede, E Granados, P Heimann, J Kim, D Kraus, [SM91] H. B. Schlegel and J. J. W. McDouall. Do You Have SCF Sta- M J MacDonald, A J Mackinnon, R Mishra, A Ravasio, bility and Convergence Problems?, pages 167–185. Springer C Roedel, P Sperling, W Schumaker, Y Y Tsui, J Vorberger, Netherlands, Dordrecht, 1991. doi:10.1007/978-94- U Zastrau, A Fry, W E White, J B Hasting, and H J Lee. 011-3262-6_2. Matter under extreme conditions experiments at the Linac [SPS+ 14] A. N. Souza, D. J. Perkins, C. E. Starrett, D. Saumon, and Coherent Light Source. J. Phys. B, 49(9):092001, apr 2016. S. B. Hansen. Predictions of x-ray scattering spectra for warm doi:10.1088/0953-4075/49/9/092001. dense matter. Phys. Rev. E, 89:023108, Feb 2014. doi: [HK64] P. Hohenberg and W. Kohn. Inhomogeneous electron gas. 10.1103/PhysRevE.89.023108. Phys. Rev., 136(3B):B864–B871, Nov 1964. doi:10.1103/ [SRH+ 12] John C. Snyder, Matthias Rupp, Katja Hansen, Klaus-Robert PhysRev.136.B864. Müller, and Kieron Burke. Finding density functionals with [HMvdW 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der + machine learning. Phys. Rev. Lett., 108:253002, Jun 2012. Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric doi:10.1103/PhysRevLett.108.253002. Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, [SS14] C.E. Starrett and D. Saumon. A simple method for determining Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerk- the ionic structure of warm dense matter. High Energy Density wijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Physics, 10:35–42, 2014. doi:10.1016/j.hedp.2013. Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin 12.001. Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, [Sta16] C.E. Starrett. Kubo–Greenwood approach to conductivity Christoph Gohlke, and Travis E. Oliphant. Array programming in dense plasmas with average atom models. High Energy with NumPy. Nature, 585(7825):357–362, September 2020. Density Physics, 19:58–64, 2016. doi:10.1016/j.hedp. doi:10.1038/s41586-020-2649-2. 2016.04.001. [HRD08] Bastian Holst, Ronald Redmer, and Michael P. Desjarlais. [STJ+ 14] Sang-Kil Son, Robert Thiele, Zoltan Jurek, Beata Ziaja, and Thermophysical properties of warm dense hydrogen using Robin Santra. Quantum-mechanical calculation of ionization- quantum molecular dynamics simulations. Phys. Rev. B, potential lowering in dense plasmas. Phys. Rev. X, 4:031004, 77:184201, May 2008. doi:10.1103/PhysRevB.77. Jul 2014. doi:10.1103/PhysRevX.4.031004. 184201. [VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt [JFC+ 13] Weile Jia, Jiyun Fu, Zongyan Cao, Long Wang, Xuebin Chi, Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Weiguo Gao, and Lin-Wang Wang. Fast plane wave density Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté- functional theory molecular dynamics calculations on multi- fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar- GPU machines. Journal of Computational Physics, 251:102– rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric 115, 2013. doi:10.1016/j.jcp.2013.05.005. Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, [Job20] Joblib Development Team. Joblib: running Python functions Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, as pipeline jobs. https://joblib.readthedocs.io/, 2020. Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin- tero, Charles R. Harris, Anne M. Archibald, Antônio H. [KDF+ 11] A. L. Kritcher, T. Döppner, C. Fortmann, T. Ma, O. L. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy Landen, R. Wallace, and S. H. Glenzer. In-Flight Measure- 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for ments of Capsule Shell Adiabats in Laser-Driven Implosions. Scientific Computing in Python. Nature Methods, 17:261–272, Phys. Rev. Lett., 107:015002, Jul 2011. doi:10.1103/ 2020. doi:10.1038/s41592-019-0686-2. PhysRevLett.107.015002. [Koh99] W. Kohn. Nobel lecture: Electronic structure of matter—wave functions and density functionals. Rev. Mod. Phys., 71:1253– 1266, 10 1999. doi:10.1103/RevModPhys.71.1253. [KS65] W. Kohn and L. J. Sham. Self-consistent equations including exchange and correlation effects. Phys. Rev., 140(4A):A1133– A1138, Nov 1965. doi:10.1103/PhysRev.140. A1133. [LSOM18] Susi Lehtola, Conrad Steigemann, Micael J.T. Oliveira, and Miguel A.L. Marques. Recent developments in LIBXC — A comprehensive library of functionals for density functional theory. SoftwareX, 7:1–5, 2018. doi:10.1016/j.softx. 2017.11.002. [MED11] Stefan Maintz, Bernhard Eck, and Richard Dronskowski. Speeding up plane-wave electronic-structure calculations us- ing graphics-processing units. Computer Physics Communi- cations, 182(7):1421–1427, 2011. doi:10.1016/j.cpc. 2011.03.010. [men14] mendeleev – A Python resource for properties of chemical elements, ions and isotopes, ver. 0.9.0. https://github.com/ lmmentel/mendeleev, 2014. [Mer65] N. David Mermin. Thermal properties of the inhomogeneous electron gas. Phys. Rev., 137:A1441–A1443, Mar 1965. doi: 10.1103/PhysRev.137.A1441. [PGW12] Mohandas Pillai, Joshua Goglio, and Thad G. Walker. Matrix numerov method for solving schrödinger’s equation. Amer- ican Journal of Physics, 80(11):1017–1019, 2012. doi: 10.1119/1.4748813. [PPF+ 11] S. Pittalis, C. R. Proetto, A. Floris, A. Sanna, C. Bersier, K. Burke, and E. K. U. Gross. Exact conditions in finite- 46 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Automatic random variate generation in Python Christoph Baumgarten‡∗ , Tirth Patel F Abstract—The generation of random variates is an important tool that is re- • For inversion methods, the structural properties of the quired in many applications. Various software programs or packages contain underlying uniform random number generator are pre- generators for standard distributions like the normal, exponential or Gamma, served and the numerical accuracy of the methods can be e.g., the programming language R and the packages SciPy and NumPy in controlled by a parameter. Therefore, inversion is usually Python. However, it is not uncommon that sampling from new/non-standard dis- the only method applied for simulations using quasi-Monte tributions is required. Instead of deriving specific generators in such situations, so-called automatic or black-box methods have been developed. These allow Carlo (QMC) methods. the user to generate random variates from fairly large classes of distributions • Depending on the use case, one can choose between a fast by only specifying some properties of the distributions (e.g. the density and/or setup with slow marginal generation time and vice versa. cumulative distribution function). In this note, we describe the implementation of such methods from the C library UNU.RAN in the Python package SciPy and The latter point is important depending on the use case: if a provide a brief overview of the functionality. large number of samples is required for a given distribution with fixed shape parameters, a slower setup that only has to be run once Index Terms—numerical inversion, generation of random variates can be accepted if the marginal generation times are low. If small to moderate samples sizes are required for many different shape parameters, then it is important to have a fast setup. The former Introduction situation is referred to as the fixed-parameter case and the latter as The generation of random variates is an important tool that is the varying parameter case. required in many applications. Various software programs or Implementations of various methods are available in the packages contain generators for standard distributions, e.g., R C library UNU.RAN ([HL07]) and in the associated R pack- ([R C21]) and SciPy ([VGO+ 20]) and NumPy ([HMvdW+ 20]) age Runuran (https://cran.r-project.org/web/packages/Runuran/ in Python. Standard references for these algorithms are the books index.html, [TL03]). The aim of this note is to introduce the [Dev86], [Dag88], [Gen03], and [Knu14]. An interested reader Python implementation in the SciPy package that makes some will find many references to the vast existing literature in these of the key methods in UNU.RAN available to Python users in works. While relying on general methods such as the rejection SciPy 1.8.0. These general tools can be seen as a complement principle, the algorithms for well-known distributions are often to the existing specific sampling methods: they might lead to specifically designed for a particular distribution. This is also the better performance in specific situations compared to the existing case in the module stats in SciPy that contains more than 100 generators, e.g., if a very large number of samples are required for distributions and the module random in NumPy with more than a fixed parameter of a distribution or if the implemented sampling 30 distributions. However, there are also so-called automatic or method relies on a slow default that is based on numerical black-box methods for sampling from large classes of distributions inversion of the CDF. For advanced users, they also offer various with a single piece of code. For such algorithms, information options that allow to fine-tune the generators (e.g., to control the about the distribution such as the density, potentially together with time needed for the setup step). its derivative, the cumulative distribution function (CDF), and/or the mode must be provided. See [HLD04] for a comprehensive overview of these methods. Although the development of such Automatic algorithms in SciPy methods was originally motivated to generate variates from non- Many of the automatic algorithms described in [HLD04] and standard distributions, these universal methods have advantages [DHL10] are implemented in the ANSI C library, UNU.RAN that make their usage attractive even for sampling from standard (Universal Non-Uniform RANdom variate generators). Our goal distributions. We mention some of the important properties (see was to provide a Python interface to the most important methods [LH00], [HLD04], [DHL10]): from UNU.RAN to generate univariate discrete and continuous non-uniform random variates. The following generators have been • The algorithms can be used to sample from truncated implemented in SciPy 1.8.0: distributions. • TransformedDensityRejection: Transformed * Corresponding author: christoph.baumgarten@gmail.com ‡ Unaffiliated Density Rejection (TDR) ([H9̈5], [GW92]) • NumericalInverseHermite: Hermite interpolation Copyright © 2022 Christoph Baumgarten et al. This is an open-access article based INVersion of CDF (HINV) ([HL03]) distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, • NumericalInversePolynomial: Polynomial inter- provided the original author and source are credited. polation based INVersion of CDF (PINV) ([DHL10]) AUTOMATIC RANDOM VARIATE GENERATION IN PYTHON 47 • SimpleRatioUniforms: Simple Ratio-Of-Uniforms by computing tangents at suitable design points. Note that by its (SROU) ([Ley01], [Ley03]) nature any rejection method requires not always the same number • DiscreteGuideTable: (Discrete) Guide Table of uniform variates to generate one non-uniform variate; this method (DGT) ([CA74]) makes the use of QMC and of some variance reduction methods • DiscreteAliasUrn: (Discrete) Alias-Urn method more difficult or impossible. On the other hand, rejection is often (DAU) ([Wal77]) the fastest choice for the varying parameter case. The Ratio-Of-Uniforms method (ROU, [KM77]) is another Before describing the implementation in SciPy in Section general method that relies on rejection. The underlying principle is scipy_impl, we give a short introduction to random variate gener- that p if (U,V ) is uniformly distributed on the set A f := {(u, v) : 0 < ation in Section intro_rv_gen. v ≤ f (u/v), a < u/v < b} where f is a PDF with support (a, b), then X := U/V follows a distribution according to f . In general, it A very brief introduction to random variate generation is not possible to sample uniform values on A f directly. However, It is well-known that random variates can be generated by inver- if A f ⊂ R := [u− , u+ ] × [0, v+ ] for finite constants u− , u+ , v+ , one sion of the CDF F of a distribution: if U is a uniform random can apply the rejection method: generate uniform values (U,V ) on number on (0, 1), X := F −1 (U) is distributed according to F. the bounding rectangle R until (U,V ) ∈ A f and return X = U/V . Unfortunately, the inverse CDF can only be expressed in closed Automatic methods relying on the ROU method such as SROU form for very few distributions, e.g., the exponential or Cauchy and automatic ROU ([Ley00]) need a setup step to find a suitable distribution. If this is not the case, one needs to rely on imple- region S ∈ R2 such that A f ⊂ S and such that one can generate mentations of special functions to compute the inverse CDF for (U,V ) uniformly on S efficiently. standard distributions like the normal, Gamma or beta distributions or numerical methods for inverting the CDF are required. Such Description of the SciPy interface procedures, however, have the disadvantage that they may be slow SciPy provides an object-oriented API to UNU.RAN’s methods. or inaccurate, and developing fast and robust inversion algorithms To initialize a generator, two steps are required: such as HINV and PINV is a non-trivial task. HINV relies on Hermite interpolation of the inverse CDF and requires the CDF 1) creating a distribution class and object, and PDF as an input. PINV only requires the PDF. The algorithm 2) initializing the generator itself. then computes the CDF via adaptive Gauss-Lobatto integration In step 1, a distributions object must be created that im- and an approximation of the inverse CDF using Newton’s polyno- plements required methods (e.g., pdf, cdf). This can either mial interpolation. Note that an approximation of the inverse CDF be a custom object or a distribution object from the classes can be achieved by interpolating the points (F(xi ), xi ) for points rv_continuous or rv_discrete in SciPy. Once the gen- xi in the domain of F, i.e., no evaluation of the inverse CDF is erator is initialized from the distribution object, it provides a required. rvs method to sample random variates from the given dis- For discrete distributions, F is a step-function. To compute tribution. It also provides a ppf method that approximates the inverse CDF F −1 (U), the simplest idea would be to apply the inverse CDF if the initialized generator uses an inversion sequential search: if X takes values 0, 1, 2, . . . with probabil- method. The following example illustrates how to initialize the ities p0 , p1 , p2 , . . . , start with j = 0 and keep incrementing j NumericalInversePolynomial (PINV) generator for the until F( j) = p0 + · · · + p j ≥ U. When the search terminates, standard normal distribution: X = j = F −1 (U). Clearly, this approach is generally very slow import numpy as np and more efficient methods have been developed: if X takes L from scipy.stats import sampling distinct values, DGT realizes very fast inversion using so-called from math import exp guide tables / hash tables to find the index j. In contrast DAU is # create a distribution class with implementation not an inversion method but uses the alias method, i.e., tables are # of the PDF. Note that the normalization constant precomputed to write X as an equi-probable mixture of L two- # is not required point distributions (the alias values). class StandardNormal: def pdf(self, x): The rejection method has been suggested in [VN51]. In its return exp(-0.5 * x**2) simplest form, assume that f is a bounded density on [a, b], i.e., f (x) ≤ M for all x ∈ [a, b]. Sample two independent uniform # create a distribution object and initialize the # generator random variates on U on [0, 1] and V on [a, b] until M ·U ≤ f (V ). dist = StandardNormal() Note that the accepted points (U,V ) are uniformly distributed in rng = sampling.NumericalInversePolynomial(dist) the region between the x-axis and the graph of the PDF. Hence, X := V has the desired distribution f . This is a special case of # sample 100,000 random variates from the given # distribution the general version: if f , g are two densities on an interval J such rvs = rng.rvs(100000) that f (x) ≤ c · g(x) for all x ∈ J and a constant c ≥ 1, sample U uniformly distributed on [0, 1] and X distributed according to As NumericalInversePolynomial generator uses an in- g until c · U · g(X) ≤ f (X). Then X has the desired distribution version method, it also provides a ppf method that approximates f . It can be shown that the expected number of iterations before the inverse CDF: the acceptance condition is met is equal to c. Hence, the main # evaluate the approximate PPF at a few points ppf = rng.ppf([0.1, 0.5, 0.9]) challenge is to find hat functions g for which c is small and from which random variates can be generated efficiently. TDR solves It is also easy to sample from a truncated distribution by passing this problem by applying a transformation T to the density such a domain argument to the constructor of the generator. For that x 7→ T ( f (x)) is concave. A hat function can then be found example, to sample from truncated normal distribution: 48 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) # truncate the distribution by passing a reference/random/bit_generators/index.html. To change the uni- # `domain` argument form random number generator, a random_state parameter rng = sampling.NumericalInversePolynomial( dist, domain=(-1, 1) can be passed as shown in the example below: ) # 64-bit PCG random number generator in NumPy urng = np.random.Generator(np.random.PCG64()) While the default options of the generators should work well in # The above line can also be replaced by: many situations, we point out that there are various parameters that # ``urng = np.random.default_rng()`` the user can modify, e.g., to provide further information about the # as PCG64 is the default generator starting # from NumPy 1.19.0 distribution (such as mode or center) or to control the numerical accuracy of the approximated PPF. (u_resolution). Details # change the uniform random number generator by can be found in the SciPy documentation https://docs.scipy.org/ # passing the `random_state` argument rng = sampling.NumericalInversePolynomial( doc/scipy/reference/. The above code can easily be generalized to dist, random_state=urng sample from parametrized distributions using instance attributes ) in the distribution class. For example, to sample from the gamma We also point out that the PPF of inversion methods can be applied distribution with shape parameter alpha, we can create the to sequences of quasi-random numbers. SciPy provides different distribution class with parameters as instance attributes: sequences in its QMC module (scipy.stats.qmc). class Gamma: NumericalInverseHermite provides a qrvs method def __init__(self, alpha): self.alpha = alpha which generates random variates using QMC methods present in SciPy (scipy.stats.qmc) as uniform random number def pdf(self, x): generators3 . The next example illustrates how to use qrvs with a return x**(self.alpha-1) * exp(-x) generator created directly from a SciPy distribution object. def support(self): from scipy import stats return 0, np.inf from scipy.stats import qmc # initialize a distribution object with varying # 1D Halton sequence generator. # parameters qrng = qmc.Halton(d=1) dist1 = Gamma(2) dist2 = Gamma(3) rng = sampling.NumericalInverseHermite(stats.norm()) # initialize a generator for each distribution # generate quasi random numbers using the Halton rng1 = sampling.NumericalInversePolynomial(dist1) # sequence as uniform variates rng2 = sampling.NumericalInversePolynomial(dist2) qrvs = rng.qrvs(size=100, qmc_engine=qrng) In the above example, the support method is used to set the domain of the distribution. This can alternatively be done by Benchmarking passing a domain parameter to the constructor. To analyze the performance of the implementation, we tested the In addition to continuous distribution, two UNU.RAN methods methods applied to several standard distributions against the gen- have been added in SciPy to sample from discrete distributions. In erators in NumPy and the original UNU.RAN C library. In addi- this case, the distribution can be either be represented using a tion, we selected one non-standard distribution to demonstrate that probability vector (which is passed to the constructor as a Python substantial reductions in the runtime can be achieved compared to list or NumPy array) or a Python object with the implementation other implementations. All the benchmarks were carried out using of the probability mass function. In the latter case, a finite domain NumPy 1.22.4 and SciPy 1.8.1 running in a single core on Ubuntu must be passed to the constructor or the object should implement 20.04.3 LTS with Intel(R) Core(TM) i7-8750H CPU (2.20GHz the support method1 . clock speed, 16GB RAM). We run the benchmarks with NumPy’s # Probability vector to represent a discrete MT19937 (Mersenne Twister) and PCG64 random number gen- # distribution. Note that the probability vector erators (np.random.MT19937 and np.random.PCG64) in # need not be vectorized pv = [0.1, 9.0, 2.9, 3.4, 0.3] Python and use NumPy’s C implementation of MT19937 in the UNU.RAN C benchmarks. As explained above, the use of PCG64 # PCG64 uniform RNG with seed 123 is recommended, and MT19937 is only included to compare the urng = np.random.default_rng(123) rng = sampling.DiscreteAliasUrn( speed of the Python implementation and the C library by relying pv, random_state=urng on the same uniform number generator (i.e., differences in the ) performance of the uniform number generation are not taken into account). The code for all the benchmarks can be found on # sample from the given discrete distribution rvs = rng.rvs(100000) https://github.com/tirthasheshpatel/unuran_benchmarks. The methods used in NumPy to generate normal, gamma, and beta random variates are: Underlying uniform pseudo-random number generators NumPy provides several generators for uniform pseudo-random • the ziggurat algorithm ([MT00b]) to sample from the numbers2 . It is highly recommended to use NumPy’s default standard normal distribution, random number generator np.random.PCG64 for better speed 2. By default, NumPy’s legacy random number generator, MT19937 and performance, see [O’N14] and https://numpy.org/doc/stable/ (np.random.RandomState()) is used as the uniform random number generator for consistency with the stats module in SciPy. 1. Support for discrete distributions with infinite domain hasn’t been added 3. In SciPy 1.9.0, qrvs will be added to yet. NumericalInversePolynomial. AUTOMATIC RANDOM VARIATE GENERATION IN PYTHON 49 • the rejection algorithms in Chapter XII.2.6 in [Dev86] if 70-200 times faster. This clearly shows the benefit of using a α < 1 and in [MT00a] if α > 1 for the Gamma distribution, black-box algorithm. • Johnk’s algorithm ([Jöh64], Section IX.3.5 in [Dev86]) if max{α, β } ≤ 1, otherwise a ratio of two Gamma variates Conclusion with shape parameter α and β (see Section IX.4.1 in The interface to UNU.RAN in SciPy provides easy access to [Dev86]) for the beta distribution. different algorithms for non-uniform variate generation for large Benchmarking against the normal, gamma, and beta distributions classes of univariate continuous and discrete distributions. We have shown that the methods are easy to use and that the al- Table 1 compares the performance for the standard normal, gorithms perform very well both for standard and non-standard Gamma and beta distributions. We recall that the density of the distributions. A comprehensive documentation suite, a tutorial Gamma distribution with shape parameter a > 0 is given by and many examples are available at https://docs.scipy.org/doc/ x ∈ (0, ∞) 7→ xa−1 e−x and the density of the beta distribution with α−1 (1−x)β −1 scipy/reference/stats.sampling.html and https://docs.scipy.org/doc/ shape parameters α, β > 0 is given by x ∈ (0, 1) 7→ x B(α,β ) scipy/tutorial/stats/sampling.html. Various methods have been im- where Γ(·) and B(·, ·) are the Gamma and beta functions. The plemented in SciPy, and if specific use cases require additional results are reported in Table 1. functionality from UNU.RAN, the methods can easily be added We summarize our main observations: to SciPy given the flexible framework that has been developed. 1) The setup step in Python is substantially slower than Another area of further development is to better integrate SciPy’s in C due to expensive Python callbacks, especially for QMC generators for the inversion methods. PINV and HINV. However, the time taken for the setup is Finally, we point out that other sampling methods like Markov low compared to the sampling time if large samples are Chain Monte Carlo and copula methods are not part of SciPy. Rel- drawn. Note that as expected, SROU has a very fast setup evant Python packages in that context are PyMC ([PHF10]), PyS- such that this method is suitable for the varying parameter tan relying on Stan ([Tea21]), Copulas (https://sdv.dev/Copulas/) case. and PyCopula (https://blent-ai.github.io/pycopula/). 2) The sampling time in Python is slightly higher than in C for the MT19937 random number generator. If the Acknowledgments recommended PCG64 generator is used, the sampling The authors wish to thank Wolfgang Hörmann and Josef Leydold time in Python is slightly lower. The only exception for agreeing to publish the library under a BSD license and for is SROU: due to Python callbacks, the performance is helpful feedback on the implementation and this note. In addition, substantially slower than in C. However, as the main we thank Ralf Gommers, Matt Haberland, Nicholas McKibben, advantage of SROU is the fast setup time, the main use Pamphile Roy, and Kai Striega for their code contributions, re- case is the varying parameter case (i.e., the method is not views, and helpful suggestions. The second author was supported supposed to be used to generate large samples). by the Google Summer of Code 2021 program5 . 3) PINV, HINV, and TDR are at most about 2x slower than the specialized NumPy implementation for the normal R EFERENCES distribution. For the Gamma and beta distribution, they even perform better for some of the chosen shape pa- [CA74] Hui-Chuan Chen and Yoshinori Asau. On gener- ating random variates from an empirical distribution. rameters. These results underline the strong performance AIIE Transactions, 6(2):163–166, 1974. doi:10.1080/ of these black-box approaches even for standard distribu- 05695557408974949. tions. [Dag88] John Dagpunar. Principles of random variate generation. 4) While the application of PINV requires bounded densi- Oxford University Press, USA, 1988. [Dev86] Luc Devroye. Non-Uniform Random Variate Generation. ties, no issues are encountered for α = 0.05 since the Springer-Verlag, New York, 1986. doi:10.1007/978-1- unbounded part is cut off by the algorithm. However, the 4613-8643-8. setup can fail for very small values of α. [DHL10] Gerhard Derflinger, Wolfgang Hörmann, and Josef Leydold. Random variate generation by numerical inversion when only the density is known. ACM Transactions on Modeling and Benchmarking against a non-standard distribution Computer Simulation (TOMACS), 20(4):1–25, 2010. doi: We benchmark the performance of PINV to sample from the 10.1145/1842722.1842723. [Gen03] James E Gentle. Random number generation and Monte Carlo generalized normal distribution ([Sub23]) whose density is given p methods, volume 381. Springer, 2003. doi:10.1007/ pe−|x| by x ∈ (−∞, ∞) 7→ 2Γ(1/p) against the method proposed in [NP09] b97336. and against the implementation in SciPy’s gennorm distribu- [GW92] Walter R Gilks and Pascal Wild. Adaptive rejection sampling for Gibbs sampling. Journal of the Royal Statistical Society: tion. The approach in [NP09] relies on transforming Gamma Series C (Applied Statistics), 41(2):337–348, 1992. doi:10. variates to the generalized normal distribution whereas SciPy 2307/2347565. relies on computing the inverse of CDF of the Gamma distri- [H9̈5] Wolfgang Hörmann. A rejection technique for sampling from bution (https://docs.scipy.org/doc/scipy/reference/generated/scipy. T-concave distributions. ACM Trans. Math. Softw., 21(2):182– 193, 1995. doi:10.1145/203082.203089. special.gammainccinv.html). The results for different values of p [HL03] Wolfgang Hörmann and Josef Leydold. Continuous random are shown in Table 2. variate generation by fast numerical inversion. ACM Trans- PINV is usually about twice as fast than the special- actions on Modeling and Computer Simulation (TOMACS), 13(4):347–362, 2003. doi:10.1145/945511.945517. ized method and about 15-150 times faster than SciPy’s implementation4 . We also found an R package pgnorm (https: 4. In SciPy 1.9.0, the speed will be improved by implementing the method //cran.r-project.org/web/packages/pgnorm/) that implements vari- from [NP09] ous approaches from [KR13]. In that case, PINV is usually about 5. https://summerofcode.withgoogle.com/projects/#5912428874825728 50 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Python C Distribution Method Setup Sampling (PCG64) Sampling (MT19937) Setup Sampling (MT19937) PINV 4.6 29.6 36.5 0.27 32.4 HINV 2.5 33.7 40.9 0.38 36.8 Standard normal TDR 0.2 37.3 47.8 0.02 41.4 SROU 8.7 µs 2510 2160 0.5 µs 232 NumPy - 17.6 22.4 - - PINV 196.0 29.8 37.2 37.9 32.5 Gamma(0.05) HINV 24.5 36.1 43.8 1.9 40.7 NumPy - 55.0 68.1 - - PINV 16.5 31.2 38.6 2.0 34.5 Gamma(0.5) HINV 4.9 34.2 41.7 0.6 37.9 NumPy - 86.4 99.2 - - PINV 5.3 30.8 38.7 0.5 34.6 HINV 5.3 33 40.6 0.4 36.8 Gamma(3.0) TDR 0.2 38.8 49.6 0.03 44 NumPy - 36.5 47.1 - - PINV 21.4 33.1 39.9 2.4 37.3 Beta(0.5, 0.5) HINV 2.1 38.4 45.3 0.2 42 NumPy - 101 112 - - HINV 0.2 37 44.3 0.01 41.1 Beta(0.5, 1.0) NumPy - 125 138 - - PINV 15.7 30.5 37.2 1.7 34.3 HINV 4.1 33.4 40.8 0.4 37.1 Beta(1.3, 1.2) TDR 0.2 46.8 57.8 0.03 45 NumPy - 74.3 97 - - PINV 9.7 30.2 38.2 0.9 33.8 HINV 5.8 33.7 41.2 0.4 37.4 Beta(3.0, 2.0) TDR 0.2 42.8 52.8 0.02 44 NumPy - 72.6 92.8 - - TABLE 1 Average time taken (reported in milliseconds, unless mentioned otherwise) to sample 1 million random variates from the standard normal distribution. The mean is computed over 7 iterations. Standard deviations are not reported as they were very small (less than 1% of the mean in the large majority of cases). Note that not all methods can always be applied, e.g., TDR cannot be applied to the Gamma distribution if a < 1 since the PDF is not log-concave in that case. As NumPy uses rejection algorithms with precomputed constants, no setup time is reported. p 0.25 0.45 0.75 1 1.5 2 5 8 Nardon and Pianca (2009) 100 101 101 45 148 120 128 122 SciPy’s gennorm distribution 832 1000 1110 559 5240 6720 6230 5950 Python (PINV Method, PCG64 urng) 50 47 45 41 40 37 38 38 TABLE 2 Comparing SciPy’s implementation and a specialized method against PINV to sample 1 million variates from the generalized normal distribution for different values of the parameter p. Time reported in milliseconds. The mean is computer over 7 iterations. [HL07] Wolfgang Hörmann and Josef Leydold. UNU.RAN - Univer- ates. ACM Transactions on Mathematical Software (TOMS), sal Non-Uniform RANdom number generators, 2007. https: 3(3):257–260, 1977. doi:10.1145/355744.355750. //statmath.wu.ac.at/unuran/doc.html. [Knu14] Donald E Knuth. The Art of Computer Programming, Volume [HLD04] Wolfgang Hörmann, Josef Leydold, and Gerhard Derflinger. 2: Seminumerical algorithms. Addison-Wesley Professional, Automatic nonuniform random variate generation. Springer, 2014. doi:10.2307/2317055. 2004. doi:10.1007/978-3-662-05946-3. [KR13] Steve Kalke and W-D Richter. Simulation of the p-generalized [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Gaussian distribution. Journal of Statistical Computation Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric and Simulation, 83(4):641–667, 2013. doi:10.1080/ Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, 00949655.2011.631187. Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van [Ley00] Josef Leydold. Automatic sampling with the ratio-of-uniforms Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del method. ACM Transactions on Mathematical Software Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, (TOMS), 26(1):78–98, 2000. doi:10.1145/347837. Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer 347863. Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro- [Ley01] Josef Leydold. A simple universal generator for continuous gramming with NumPy. Nature, 585(7825):357–362, 2020. and discrete univariate T-concave distributions. ACM Transac- doi:10.1038/s41586-020-2649-2. tions on Mathematical Software (TOMS), 27(1):66–82, 2001. [Jöh64] MD Jöhnk. Erzeugung von betaverteilten und gammaverteilten doi:10.1145/382043.382322. Zufallszahlen. Metrika, 8(1):5–15, 1964. doi:10.1007/ [Ley03] Josef Leydold. Short universal generators via generalized bf02613706. ratio-of-uniforms method. Mathematics of Computation, [KM77] Albert J Kinderman and John F Monahan. Computer gen- 72(243):1453–1471, 2003. doi:10.1090/s0025-5718- eration of random variables using the ratio of uniform devi- 03-01511-4. AUTOMATIC RANDOM VARIATE GENERATION IN PYTHON 51 [LH00] Josef Leydold and Wolfgang Hörmann. Universal algorithms as an alternative for generating non-uniform continuous ran- dom variates. In Proceedings of the International Conference on Monte Carlo Simulation 2000., pages 177–183, 2000. [MT00a] George Marsaglia and Wai Wan Tsang. A simple method for generating gamma variables. ACM Transactions on Math- ematical Software (TOMS), 26(3):363–372, 2000. doi: 10.1145/358407.358414. [MT00b] George Marsaglia and Wai Wan Tsang. The ziggurat method for generating random variables. Journal of statistical soft- ware, 5(1):1–7, 2000. doi:10.18637/jss.v005.i08. [NP09] Martina Nardon and Paolo Pianca. Simulation techniques for generalized Gaussian densities. Journal of Statistical Computation and Simulation, 79(11):1317–1329, 2009. doi: 10.1080/00949650802290912. [O’N14] Melissa E. O’Neill. PCG: A family of simple fast space- efficient statistically good algorithms for random number gen- eration. Technical Report HMC-CS-2014-0905, Harvey Mudd College, Claremont, CA, September 2014. [PHF10] Anand Patil, David Huard, and Christopher J Fonnesbeck. PyMC: Bayesian stochastic modelling in Python. Journal of Statistical Software, 35(4):1, 2010. doi:10.18637/jss. v035.i04. [R C21] R Core Team. R: A language and environment for statistical computing, 2021. https://www.R-project.org/. [Sub23] M.T. Subbotin. On the law of frequency of error. Mat. Sbornik, 31(2):296–301, 1923. [Tea21] Stan Development Team. Stan modeling language users guide and reference manual, version 2.28., 2021. https://mc-stan.org. [TL03] Günter Tirler and Josef Leydold. Automatic non-uniform random variate generation in r. In Proceedings of DSC, page 2, 2003. [VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, pages 1–12, 2020. doi:10.1038/ s41592-019-0686-2. [VN51] John Von Neumann. Various techniques used in connection with random digits. Appl. Math Ser, 12(36-38):3, 1951. [Wal77] Alastair J Walker. An efficient method for generating discrete random variables with general distributions. ACM Transac- tions on Mathematical Software (TOMS), 3(3):253–256, 1977. doi:10.1145/355744.355749. 52 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Utilizing SciPy and other open source packages to provide a powerful API for materials manipulation in the Schrödinger Materials Suite Alexandr Fonari‡∗ , Farshad Fallah‡ , Michael Rauch‡ F Abstract—The use of several open source scientific packages in the open-source and many of which blend the two to optimize capa- Schrödinger Materials Science Suite will be discussed. A typical workflow for bilities and efficiency. For example, the main simulation engine materials discovery will be described, discussing how open source packages for molecular quantum mechanics is the Jaguar [BHH+ 13] pro- have been incorporated at every stage. Some recent implementations of ma- prietary code. The proprietary classical molecular dynamics code chine learning for materials discovery will be discussed, as well as how open Desmond (distributed by Schrödinger, Inc.) [SGB+ 14] is used to source packages were leveraged to achieve results faster and more efficiently. obtain physical properties of soft materials, surfaces and polymers. Index Terms—materials, active learning, OLED, deposition, evaporation For periodic quantum mechanics, the main simulation engine is the open source code Quantum ESPRESSO (QE) [GAB+ 17]. One of the co-authors of this proceedings (A. Fonari) contributes to Introduction the QE code in order to make integration with the Materials Suite more seamless and less error-prone. As part of this integration, A common materials discovery practice or workflow is to start support for using the portable XML format for input and output with reading an experimental structure of a material or generating in QE has been implemented in the open source Python package a structure in silico, computing its properties of interest (e.g. qeschema [BDBF]. elastic constants, electrical conductivity), tuning the material by Figure 2 gives an overview of some of the various products that modifying its structure (e.g. doping) or adding and removing compose the Schrödinger Materials Science Suite. The various atoms (deposition, evaporation), and then recomputing the proper- workflows are implemented mainly in Python (some of them ties of the modified material (Figure 1). Computational materials described below), calling on proprietary or open-source code discovery leverages such workflows to empower researchers to where appropriate, to improve the performance of the software explore vast design spaces and uncover root causes without (or in and reduce overall maintenance. conjunction with) laboratory experimentation. The materials discovery cycle can be run in a high-throughput Software tools for computational materials discovery can be manner, enumerating different structure modifications in a system- facilitated by utilizing existing libraries that cover the fundamental atic fashion, such as doping ratio in a semiconductor or depositing mathematics used in the calculations in an optimized fashion. This different adsorbates. As we will detail herein, there are several use of existing libraries allows developers to devote more time open source packages that allow the user to generate a large to developing new features instead of re-inventing established number of structures, run calculations in high throughput manner methods. As a result, such a complementary approach improves and analyze the results. For example, the open source package the performance of computational materials software and reduces pymatgen [ORJ+ 13] facilitates generation and analysis of periodic overall maintenance. structures. It can generate inputs for and read outputs of QE, the The Schrödinger Materials Science Suite [LLC22] is a propri- commercial codes VASP and Gaussian, and several other formats. etary computational chemistry/physics platform that streamlines To run and manage workflow jobs in a high-throughput manner, materials discovery workflows into a single graphical user inter- open source packages such as Custodian [ORJ+ 13] and AiiDA face (Materials Science Maestro). The interface is a single portal [HZU+ 20] can be used. for structure building and enumeration, physics-based modeling and machine learning, visualization and analysis. Tying together the various modules are a wide variety of scientific packages, some Materials import and generation of which are proprietary to Schrödinger, Inc., some of which are For reading and writing of material structures, several open source packages (e.g. OpenBabel [OBJ+ 11], RDKit [LTK+ 22]) have * Corresponding author: sasha.fonari@schrodinger.com ‡ Schrödinger Inc., 1540 Broadway, 24th Floor. New York, NY 10036 implemented functionality for working with several commonly used formats (e.g. CIF, PDB, mol, xyz). Periodic structures Copyright © 2022 Alexandr Fonari et al. This is an open-access article of materials, mainly coming from single crystal X-ray/neutron distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, diffraction experiments, are distributed in CIF (Crystallographic provided the original author and source are credited. Information File), PDB (Protein Data Bank) and lately mmCIF UTILIZING SCIPY AND OTHER OPEN SOURCE PACKAGES TO PROVIDE A POWERFUL API FOR MATERIALS MANIPULATION IN THE SCHRÖDINGER MATERIALS SUITE 53 Fig. 1: Example of a workflow for computational materials discovery. Fig. 2: Some example products that compose the Schrödinger Materials Science Suite. formats [WF05]. Correctly reading experimental structures is of work went into this project) and others to correctly read and significant importance, since the rest of the materials discovery convert periodic structures in OpenBabel. By version 3.1.1 (the workflow depends on it. In addition to atom coordinates and most recent at writing time), the authors are not aware of any periodic cell information, structural data also contains symme- structures read incorrectly by OpenBabel. In general, non-periodic try operations (listed explicitly or by the means of providing molecular formats are simpler to handle because they only contain a space group) that can be used to decrease the number of atom coordinates but no cell or symmetry information. OpenBabel computations required for a particular system by accounting for has Python bindings but due to the GPL license limitation, it is symmetry. This can be important, especially when scaling high- called as a subprocess from the Schrödinger Materials Suite. throughput calculations. From file, structure is read in a structure Another important consideration in structure generation is object through which atomic coordinates (as a NumPy array) and modeling of substitutional disorder in solid alloys and materials chemical information of the material can be accessed and updated. with point defects (intermetallics, semiconductors, oxides and Structure object is similar to the one implemented in open source their crystalline surfaces). In such cases, the unit cell and atomic packages such as pymatgen [ORJ+ 13] and ASE [LMB+ 17]. All sites of the crystal or surface slab are well defined while the chem- the structure manipulations during the workflows are done by ical species occupying the site may vary. In order to simulate sub- using structure object interface (see structure deformation example stitutional disorder, one must generate the ensemble of structures below). Example of Structure object definition in pymatgen: that includes all statistically significant atomic distributions in a class Structure: given unit cell. This can be achieved by a brute force enumeration of all symmetrically unique atomic structures with a given number def __init__(self, lattice, species, coords, ...): of vacancies, impurities or solute atoms. The open source library """Create a periodic structure.""" enumlib [HF08] implements algorithms for such a systematic One consideration of note is that PDB, CIF and mmCIF structure enumeration of periodic structures. The enumlib package consists formats allow description of the positional disorder (for example, of several Fortran binaries and Python scripts that can be run as a a solvent molecule without a stable position within the cell subprocess (no Python bindings). This allows the user to generate which can be described by multiple sets of coordinates). Another a large set of symmetrically nonequivalent materials with different complication is that experimental data spans an interval of almost compositions (e.g. doping or defect concentration). a century: one of the oldest crystal structures deposited in the Recently, we applied this approach in simultaneous study of Cambridge Structural Database (CSD) [GBLW16] dates to 1924 the activity and stability of Pt based core-shell type catalysts for [HM24]. These nuances and others present nontrivial technical the oxygen reduction reaction [MGF+ 19]. We generated a set of challenges for developers. Thus, it has been a continuous effort stable doped Pt/transition metal/nitrogen surfaces using periodic by Schrödinger, Inc. (at least 39 commits and several weeks of enumeration. Using QE to perform periodic density functional 54 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Jaguar that took 457,265 CPU hours (~52 years) [MAS+ 20]. An- other similar case study is the high-throughput molecular dynam- ics simulations (MD) of thermophysical properties of polymers for various applications [ABG+ 21]. There, using Desmond we com- puted the glass transition temperature (Tg ) of 315 polymers and compared the results with experimental measurements [Bic02]. This study took advantage of GPU (graphics processing unit) support as implemented in Desmond, as well as the job scheduler API described above. Other workflows implemented in the Schrödinger Materials Science Suite utilize open source packages as well. For soft mate- rials (polymers, organic small molecules and substrates composed of soft molecules), convex hull and related mathematical methods Fig. 3: Example of the job submission process. are important for finding possible accessible solvent voids (during submerging or sorption) and adsorbate sites (during molecular deposition). These methods are conveniently implemented in the theory (DFT) calculations, we assessed surface phase diagrams open source SciPy [VGO+ 20] and NumPy [HMvdW+ 20] pack- for Pt alloys and identified the avenues for stabilizing the cost ages. Thus, we implemented molecular deposition and evaporation effective core-shell systems by a judicious choice of the catalyst workflows by using the Desmond MD engine as the backend core material. Such catalysts may prove critical in electrocatalysis in tandem with the convex hull functionality. This workflow for fuel cell applications. enables simulation of the deposition and evaporation of the small molecules on a substrate. We utilized the aforementioned deposition workflow in the study of organic light-emitting diodes Workflow capabilities (OLEDs), which are fabricated using a stepwise process, where In the last section, we briefly described a complete workflow from new layers are deposited on top of previous layers. Both vacuum structure generation and enumeration to periodic DFT calculations and solution deposition processes have been used to prepare these to analysis. In order to be able to run a massively parallel films, primarily as amorphous thin film active layers lacking screening of materials, a highly scalable and stable queuing system long-range order. Each of these deposition techniques introduces (job scheduler) is required. We have implemented a job queuing changes to the film structure and consequently, different charge- system on top of the most used queuing systems (LSF, PBS, transfer and luminescent properties [WKB+ 22]. SGE, SLURM, TORQUE, UGE) and exposed a Python API to As can be seen from above, a workflow is usually some submit and monitor jobs. In line with technological advancements, sort of structure modification through the structure object with cloud is also supported by means of a virtual cluster configured a subsequent call to a backend code and analysis of its output if with SLURM. This allows the user to submit a large number it succeeds. Input for the next iteration depends on the output of jobs, limited only by SLURM scheduling capabilities and of the previous iteration in some workflows. Due to the large cloud resources. In order to accommodate job dependencies in chemical and manipulation space of the materials, sometimes it workflows, for each job, a parent job (or multiple parent jobs) can very tricky to keep code for all workflows follow the same code be defined forming a directed graph of jobs (Figure 3). logic. For every workflow and/or functionality in the Materials There could be several reasons for a job to fail. Depending Science Suite, some sort of peer reviewed material (publication, on the reason of failure, there are several restart and recovery conference presentation) is created where implemented algorithms mechanisms in place. The lowest level is the restart mechanism are described to facilitate reproducibility. (in SLURM it is called requeue) which is performed by the queuing system itself. This is triggered when a node goes down. Data fitting algorithms and use cases On the cloud, preemptible instances (nodes) can go offline at any moment. In addition, workflows implemented in the proprietary Materials simulation engines for QM, periodic DFT, and classical Schrödinger Materials Science Suite have built-in methods for MD (referred to herein as backends) are frequently written in handling various types of failure. For example, if the simulation compiled languages with enabled parallelization for CPU or GPU is not converging to a requested energy accuracy, it is wasteful hardware. These backends are called from Python workflows to blindly restart the calculation without changing some input using the job queuing systems described above. Meanwhile, pack- parameters. However, in the case of a failure due to full disk ages such as SciPy and NumPy provide sophisticated numerical space, it is reasonable to try restart with hopes to get a node with function optimization and fitting capabilities. Here, we describe more empty disk space. If a job fails (and cannot be restarted), examples of how the Schrödinger suite can be used to combine all its children (if any) will not start, thus saving queuing and materials simulations with popular optimization routines in the computational time. SciPy ecosystem. Having developed robust systems for running calculations, job Recently we implemented convex analysis of queuing and troubleshooting (autonomously, when applicable), the stress strain curve (as described here [PKD18]). the developed workflows have allowed us and our customers to scipy.optimize.minimize is used for a constrained perform massive screenings of materials and their properties. For minimization with boundary conditions of a function related to example, we reported a massive screening of 250,000 charge- the stress strain curve. The stress strain curve is obtained from a conducting organic materials, totaling approximately 3,619,000 series of MD simulations on deformed cells (cell deformations DFT SCF (self-consistent field) single-molecule calculations using are defined by strain type and deformation step). The pressure UTILIZING SCIPY AND OTHER OPEN SOURCE PACKAGES TO PROVIDE A POWERFUL API FOR MATERIALS MANIPULATION IN THE SCHRÖDINGER MATERIALS SUITE 55 tensor of a deformed cell is related to stress. This analysis allowed and AutoQSAR [DDS+ 16] from the Schrödinger suite. Depending prediction of elongation at yield for high density polyethylene on the type of materials, benchmark data can be obtained using polymer. Figure 4 shows obtained calculated yield of 10% vs. different codes available in the Schrödinger suite: experimental value within 9-18% range [BAS+ 20]. • small molecules and finite systems - Jaguar The scipy.optimize package is used for a least-squares • periodic systems - Quantum ESPRESSO fit of the bulk energies at different cell volumes (compressed • larger polymeric and similar systems - Desmond and expanded) in order to obtain the bulk modulus and equation of state (EOS) of a material. In the Schrödinger suite this was Different materials systems require different descriptors for implemented as a part of an EOS workflow, in which fitting is featurization. For example, for crystalline periodic systems, we performed on the results obtained from a series of QE calculations have implemented several sets of tailored descriptors. Genera- performed on the original as well as compressed and expanded tion of these descriptors again uses a mix of open source and (deformed) cells. An example of deformation applied to a structure Schrödinger proprietary tools. Specifically: in pymatgen: • elemental features such as atomic weight, number of from pymatgen.analysis.elasticity import strain valence electrons in s, p and d-shells, and electronegativity from pymatgen.core import lattice from pymatgen.core import structure • structural features such as density, volume per atom, and packing fraction descriptors implemented in the open deform = strain.Deformation([ source matminer package [WDF+ 18] [1.0, 0.02, 0.02], • intercalation descriptors such as cation and anion counts, [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]) crystal packing fraction, and average neighbor ionicity [SYC+ 17] implemented in the Schrödinger suite latt = lattice.Lattice([ • three-dimensional smooth overlap of atomic positions [3.84, 0.00, 0.00], [1.92, 3.326, 0.00], (SOAP) descriptors implemented in the open source [0.00, -2.22, 3.14], DScribe package [HJM+ 20]. ]) We are currently training models that use these descriptors st = structure.Structure( to predict properties, such as bulk modulus, of a set of Li- latt, containing battery related compounds [Cha]. Several models will ["Si", "Si"], [[0, 0, 0], [0.75, 0.5, 0.75]]) be compared, such as kernel regression methods (as implemented in the open source scikit-learn code [PVG+ 11]) and AutoQSAR. strained_st = deform.apply_to_structure(st) For isolated small molecules and extended non-periodic sys- This is also an example of loosely coupled (embarrassingly tems, RDKit can be used to generate a large number of atomic and parallel) jobs. In particular, calculations of the deformed cells molecular descriptors. A lot of effort has been devoted to ensure only depend on the bulk calculation and do not depend on each that RDKit can be used on a wide variety of materials that are other. Thus, all the deformation jobs can be submitted in parallel, supported by the Schrödinger suite. At the time of writing, the 4th facilitating high-throughput runs. most active contributor to RDKit is Ricardo Rodriguez-Schmidt Structure refinement from powder diffraction experiment is an- from Schrödinger [RDK]. other example where more complex optimization is used. Powder Recently, active learning (AL) combined with DFT has re- diffraction is a widely used method in drug discovery to assess ceived much attention to address the challenge of leveraging purity of the material and discover known or unknown crystal exhaustive libraries in materials informatics [VPB21], [SPA+ 19]. polymorphs [KBD+ 21]. In particular, there is interest in fitting of On our side, we have implemented a workflow that employs active the experimental powder diffraction intensity peaks to the indexed learning (AL) for intelligent and iterative identification of promis- peaks (Pawley refinement) [JPS92]. Here we employed the open ing materials candidates within a large dataset. In the framework of source lmfit package [NSA+ 16] to perform a minimization of AL, the predicted value with associated uncertainty is considered the multivariable Voigt-like function that represents the entire to decide what materials to be added in each iteration, aiming to diffraction spectrum. This allows the user to refine (optimize) unit improve the model performance in the next iteration (Figure 5). cell parameters coming from the indexing data and as the result, Since it could be important to consider multiple properties goodness of fit (R-factor) between experimental and simulated simultaneously in material discovery, multiple property optimiza- spectrum is minimized. tion (MPO) has also been implemented as a part of the AL work- flow [KAG+ 22]. MPO allows scaling and combining multiple properties into a single score. We employed the AL workflow Machine learning techniques to determine the top candidates for hole (positively charged Of late, there is great interest in machine learning assisted mate- carrier) transport layer (HTL) by evaluating 550 molecules in 10 rials discovery. There are several components required to perform iterations using DFT calculations for a dataset of ~9,000 molecules machine learning assisted materials discovery. In order to train a [AKA+ 22]. Resulting model was validated by randomly picking model, benchmark data from simulation and/or experimental data a molecule from the dataset, computing properties with DFT and is required. Besides benchmark data, computation of the relevant comparing those to the predicted values. According to the semi- descriptors is required (see below). Finally, a model based on classical Marcus equation [Mar93], high rates of hole transfer are benchmark data and descriptors is generated that allows prediction inversely proportional to hole reorganization energies. Thus, MPO of properties for novel materials. There are several techniques to scores were computed based on minimizing hole reorganization generate the model, such as linear or non-linear fitting to neural energy and targeting oxidation potential to an appropriate level to networks. Tools include the open source DeepChem [REW+ 19] ensure a low energy barrier for hole injection from the anode 56 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 4: Left: The uniaxial stress/strain curve of a polymer calculated using Desmond through the stress strain workflow. The dark grey band indicates an inflection that marks the yield point. Right: Constant strain simulation with convex analysis indicates elongation at yield. The red curve shows simulated stress versus strain. The blue curve shows convex analysis. Fig. 5: Active learning workflow for the design and discovery of novel optoelectronics molecules. into the emissive layer. In this workflow, we used RDKit to of similar items (similar molecules). In this case, benchmark data compute descriptors for the chemical structures. These descriptors is only needed for few representatives of each cluster. We are generated on the initial subset of structures are given as vectors to currently working on applying this approach to train models for an algorithm based on Random Forest Regressor as implemented predicting physical properties of soft materials (polymers). in scikit-learn. Bayesian optimization is employed to tune the hyperparameters of the model. In each iteration, a trained model Conclusions is applied for making predictions on the remaining materials in We present several examples of how Schrödinger Materials Suite the dataset. Figure 6 (A) displays MPO scores for the HTL dataset integrates open source software packages. There is a wide range estimated by AL as a function of hole reorganization energies that of applications in materials science that can benefit from already are separately calculated for all the materials. This figure indicates existing open source code. Where possible, we report issues to that there are many materials in the dataset with desired low hole the package authors and submit improvements and bug fixes in reorganization energies but are not suitable for HTL due to their the form of the pull requests. We are thankful to all who have improper oxidation potentials, suggesting that MPO is important contributed to open source libraries, and have made it possible for to evaluate the optoelectronic performance of the materials. Figure us to develop a platform for accelerating innovation in materials 6 (B) presents MPO scores of the materials used in the training and drug discovery. We will continue contributing to these projects dataset of AL, demonstrating that the feedback loop in the AL and we hope to further give back to the scientific community by workflow efficiently guides the data collection as the size of the facilitating research in both academia and industry. We hope that training set increases. this report will inspire other scientific companies to give back to To appreciate the computational efficiency of such an ap- the open source community in order to improve the computational proach, it is worth noting that performing DFT calculations for materials field and make science more reproducible. all of the 9,000 molecules in the dataset would increase the computational cost by a factor of 15 versus the AL workflow. It Acknowledgments seems that AL approach can be useful in the cases where problem The authors acknowledge Bradley Dice and Wenduo Zhou for space is broad (like chemical space), but there are many clusters their valuable comments during the review of the manuscript. UTILIZING SCIPY AND OTHER OPEN SOURCE PACKAGES TO PROVIDE A POWERFUL API FOR MATERIALS MANIPULATION IN THE SCHRÖDINGER MATERIALS SUITE 57 Fig. 6: A: MPO score of all materials in the HTL dataset. B: Those used in the training set as a function of the hole reorganization energy ( λh ). R EFERENCES tal Engineering and Materials, 72, 2016. doi:10.1107/ S2052520616003954. [ABG+ 21] Mohammad Atif Faiz Afzal, Andrea R. Browning, Alexan- [HF08] Gus L.W. Hart and Rodney W. Forcade. Algo- der Goldberg, Mathew D. Halls, Jacob L. Gavartin, Tsuguo rithm for generating derivative structures. Physical Re- Morisato, Thomas F. Hughes, David J. Giesen, and Joseph E. view B - Condensed Matter and Materials Physics, 77, Goose. High-throughput molecular dynamics simulations and 2008. URL: https://github.com/msg-byu/enumlib/, doi:10. validation of thermophysical properties of polymers for var- 1103/PhysRevB.77.224115. ious applications. ACS Applied Polymer Materials, 3, 2021. [HJM+ 20] Lauri Himanen, Marc O.J. Jager, Eiaki V. Morooka, Fil- doi:10.1021/acsapm.0c00524. ippo Federici Canova, Yashasvi S. Ranawat, David Z. Gao, [AKA+ 22] Hadi Abroshan, H. Shaun Kwak, Yuling An, Christopher Patrick Rinke, and Adam S. Foster. Dscribe: Library of Brown, Anand Chandrasekaran, Paul Winget, and Mathew D. descriptors for machine learning in materials science. Com- Halls. Active learning accelerates design and optimization puter Physics Communications, 247, 2020. URL: https: of hole-transporting materials for organic electronics. Fron- //singroup.github.io/dscribe/latest/, doi:10.1016/j.cpc. tiers in Chemistry, 9, 2022. doi:10.3389/fchem.2021. 2019.106949. 800371. [HM24] O Hassel and H Mark. The crystal structure of graphite. [BAS+ 20] A. R. Browning, M. A. F. Afzal, J. Sanders, A. Goldberg, Physik. Z, 25:317–337, 1924. A. Chandrasekaran, and H. S. Kwak. Polyolefin molecular [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der simulation for critical physical characteristics. International Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Polyolefins Conference, 2020. Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, [BDBF] Davide Brunato, Pietro Delugas, Giovanni Borghi, and Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Alexandr Fonari. qeschema. URL: https://github.com/QEF/ Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del qeschema. Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, [BHH+ 13] Art D. Bochevarov, Edward Harder, Thomas F. Hughes, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Jeremy R. Greenwood, Dale A. Braden, Dean M. Philipp, Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array David Rinaldo, Mathew D. Halls, Jing Zhang, and Richard A. programming with numpy, 2020. URL: https://numpy.org/, Friesner. Jaguar: A high-performance quantum chemistry doi:10.1038/s41586-020-2649-2. software program with strengths in life and materials sci- [HZU+ 20] Sebastiaan P. Huber, Spyros Zoupanos, Martin Uhrin, Leopold ences. International Journal of Quantum Chemistry, 113, Talirz, Leonid Kahle, Rico Hauselmann, Dominik Gresch, 2013. doi:10.1002/qua.24481. Tiziano Müller, Aliaksandr V. Yakutovich, Casper W. Ander- [Bic02] Jozef Bicerano. Prediction of polymer properties. cRc Press, sen, Francisco F. Ramirez, Carl S. Adorf, Fernando Gargiulo, 2002. Snehal Kumbhar, Elsa Passaro, Conrad Johnston, Andrius [Cha] A. Chandrasekaran. Active learning accelerated design of ionic Merkys, Andrea Cepellotti, Nicolas Mounet, Nicola Marzari, materials. in progress. Boris Kozinsky, and Giovanni Pizzi. Aiida 1.0, a scalable com- [DDS+ 16] Steven L. Dixon, Jianxin Duan, Ethan Smith, Christopher putational infrastructure for automated reproducible workflows D. Von Bargen, Woody Sherman, and Matthew P. Repasky. and data provenance. Scientific Data, 7, 2020. URL: https:// Autoqsar: An automated machine learning tool for best- www.aiida.net/, doi:10.1038/s41597-020-00638-4. practice quantitative structure-activity relationship modeling. [JPS92] J. Jansen, R. Peschar, and H. Schenk. Determination of Future Medicinal Chemistry, 8, 2016. doi:10.4155/fmc- accurate intensities from powder diffraction data. i. whole- 2016-0093. pattern fitting with a least-squares procedure. Journal [GAB+ 17] P. Giannozzi, O. Andreussi, T. Brumme, O. Bunau, M. Buon- of Applied Crystallography, 25, 1992. doi:10.1107/ giorno Nardelli, M. Calandra, R. Car, C. Cavazzoni, S0021889891012104. D. Ceresoli, M. Cococcioni, N. Colonna, I. Carnimeo, A. Dal [KAG+ 22] H. Shaun Kwak, Yuling An, David J. Giesen, Thomas F. Corso, S. De Gironcoli, P. Delugas, R. A. Distasio, A. Ferretti, Hughes, Christopher T. Brown, Karl Leswing, Hadi Abroshan, A. Floris, G. Fratesi, G. Fugallo, R. Gebauer, U. Gerstmann, and Mathew D. Halls. Design of organic electronic materials F. Giustino, T. Gorni, J. Jia, M. Kawamura, H. Y. Ko, with a goal-directed generative model powered by deep neural A. Kokalj, E. Kücükbenli, M. Lazzeri, M. Marsili, N. Marzari, networks and high-throughput molecular simulations. Fron- F. Mauri, N. L. Nguyen, H. V. Nguyen, A. Otero-De-La- tiers in Chemistry, 9, 2022. doi:10.3389/fchem.2021. Roza, L. Paulatto, S. Poncé, D. Rocca, R. Sabatini, B. Santra, 800370. M. Schlipf, A. P. Seitsonen, A. Smogunov, I. Timrov, T. Thon- [KBD+ 21] James A Kaduk, Simon J L Billinge, Robert E Dinnebier, hauser, P. Umari, N. Vast, X. Wu, and S. Baroni. Advanced Nathan Henderson, Ian Madsen, Radovan Černý, Matteo capabilities for materials modelling with quantum espresso. Leoni, Luca Lutterotti, Seema Thakral, and Daniel Chateigner. Journal of Physics Condensed Matter, 29, 2017. URL: Powder diffraction. Nature Reviews Methods Primers, 1:77, https://www.quantum-espresso.org/, doi:10.1088/1361- 2021. URL: https://doi.org/10.1038/s43586-021-00074-7, 648X/aa8f79. doi:10.1038/s43586-021-00074-7. [GBLW16] Colin R. Groom, Ian J. Bruno, Matthew P. Lightfoot, and [LLC22] Schrödinger LLC. Schrödinger release 2022-2: Materials Suzanna C. Ward. The cambridge structural database. science suite, 2022. URL: https://www.schrodinger.com/ Acta Crystallographica Section B: Structural Science, Crys- platform/materials-science. 58 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) [LMB+ 17] Ask Hjorth Larsen, Jens JØrgen Mortensen, Jakob Blomqvist, Ho, Douglas J. Ierardi, Lev Iserovich, Jeffrey S. Kuskin, Ivano E. Castelli, Rune Christensen, Marcin Dułak, Jesper Richard H. Larson, Timothy Layman, Li Siang Lee, Adam K. Friis, Michael N. Groves, BjØrk Hammer, Cory Hargus, Lerer, Chester Li, Daniel Killebrew, Kenneth M. Macken- Eric D. Hermes, Paul C. Jennings, Peter Bjerre Jensen, zie, Shark Yeuk Hai Mok, Mark A. Moraes, Rolf Mueller, James Kermode, John R. Kitchin, Esben Leonhard Kols- Lawrence J. Nociolo, Jon L. Peticolas, Terry Quan, Daniel bjerg, Joseph Kubal, Kristen Kaasbjerg, Steen Lysgaard, Ramot, John K. Salmon, Daniele P. Scarpazza, U. Ben Schafer, Jón Bergmann Maronsson, Tristan Maxson, Thomas Olsen, Naseer Siddique, Christopher W. Snyder, Jochen Spengler, Lars Pastewka, Andrew Peterson, Carsten Rostgaard, Jakob Ping Tak Peter Tang, Michael Theobald, Horia Toma, Brian SchiØtz, Ole Schütt, Mikkel Strange, Kristian S. Thygesen, Towles, Benjamin Vitale, Stanley C. Wang, and Cliff Young. Tejs Vegge, Lasse Vilhelmsen, Michael Walter, Zhenhua Zeng, Anton 2: Raising the bar for performance and programmabil- and Karsten W. Jacobsen. The atomic simulation envi- ity in a special-purpose molecular dynamics supercomputer. ronment - a python library for working with atoms, 2017. volume 2015-January, 2014. doi:10.1109/SC.2014.9. URL: https://wiki.fysik.dtu.dk/ase/, doi:10.1088/1361- [SPA+ 19] Gabriel R. Schleder, Antonio C.M. Padilha, Carlos Mera 648X/aa680e. Acosta, Marcio Costa, and Adalberto Fazzio. From dft to [LTK+ 22] Greg Landrum, Paolo Tosco, Brian Kelley, Ric, sriniker, machine learning: Recent approaches to materials science - gedeck, Riccardo Vianello, NadineSchneider, Eisuke a review. JPhys Materials, 2, 2019. doi:10.1088/2515- Kawashima, Andrew Dalke, Dan N, David Cosgrove, 7639/ab084b. Gareth Jones, Brian Cole, Matt Swain, Samo Turk, [SYC+ 17] Austin D Sendek, Qian Yang, Ekin D Cubuk, Karel- AlexanderSavelyev, Alain Vaucher, Maciej Wójcikowski, Alexander N Duerloo, Yi Cui, and Evan J Reed. Holistic Ichiru Take, Daniel Probst, Kazuya Ujihara, Vincent F. computational structure screening of more than 12000 can- Scalfani, guillaume godin, Axel Pahl, Francois Berenger, didates for solid lithium-ion conductor materials. Energy and JLVarjo, strets123, JP, and DoliathGavid. rdkit. 6 2022. URL: Environmental Science, 10:306–320, 2017. doi:10.1039/ https://rdkit.org/, doi:10.5281/ZENODO.6605135. c6ee02697d. [Mar93] Rudolph A. Marcus. Electron transfer reactions in chemistry. [VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt theory and experiment. Reviews of Modern Physics, 65, 1993. Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, doi:10.1103/RevModPhys.65.599. Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté- [MAS+ 20] Nobuyuki N. Matsuzawa, Hideyuki Arai, Masaru Sasago, Eiji fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar- Fujii, Alexander Goldberg, Thomas J. Mustard, H. Shaun rod Millman, Nikolay Mayorov, Andrew R.J. Nelson, Eric Kwak, David J. Giesen, Fabio Ranalli, and Mathew D. Halls. Jones, Robert Kern, Eric Larson, C. J. Carey, İlhan Polat, Massive theoretical screen of hole conducting organic mate- Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, rials in the heteroacene family by using a cloud-computing Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin- environment. Journal of Physical Chemistry A, 124, 2020. tero, Charles R. Harris, Anne M. Archibald, Antônio H. doi:10.1021/acs.jpca.9b10998. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, Aditya Vi- [MGF+ 19] Thomas Mustard, Jacob Gavartin, Alexandr Fonari, Caroline jaykumar, Alessandro Pietro Bardelli, Alex Rothberg, An- Krauter, Alexander Goldberg, H Kwak, Tsuguo Morisato, dreas Hilboll, Andreas Kloeckner, Anthony Scopatz, Antony Sudharsan Pandiyan, and Mathew Halls. Surface reactivity Lee, Ariel Rokem, C. Nathan Woods, Chad Fulton, Charles and stability of core-shell solid catalysts from ab initio combi- Masson, Christian Haggström, Clark Fitzgerald, David A. natorial calculations. volume 258, 2019. Nicholson, David R. Hagen, Dmitrii V. Pasechnik, Emanuele [NSA+ 16] Matthew Newville, Till Stensitzki, Daniel B Allen, Michal Olivetti, Eric Martin, Eric Wieser, Fabrice Silva, Felix Lenders, Rawlik, Antonino Ingargiola, and Andrew Nelson. Lmfit: Non- Florian Wilhelm, G. Young, Gavin A. Price, Gert Ludwig linear least-square minimization and curve-fitting for python. Ingold, Gregory E. Allen, Gregory R. Lee, Hervé Audren, Irvin Astrophysics Source Code Library, page ascl–1606, 2016. Probst, Jörg P. Dietrich, Jacob Silterra, James T. Webber, Janko URL: https://lmfit.github.io/lmfit-py/. Slavič, Joel Nothman, Johannes Buchner, Johannes Kulick, [OBJ+ 11] Noel M. O’Boyle, Michael Banck, Craig A. James, Chris Johannes L. Schönberger, José Vinícius de Miranda Cardoso, Morley, Tim Vandermeersch, and Geoffrey R. Hutchison. Joscha Reimer, Joseph Harrington, Juan Luis Cano Rodríguez, Open babel: An open chemical toolbox. Journal of Chem- Juan Nunez-Iglesias, Justin Kuczynski, Kevin Tritz, Martin informatics, 3, 2011. URL: https://openbabel.org/, doi: Thoma, Matthew Newville, Matthias Kümmerer, Maximilian 10.1186/1758-2946-3-33. Bolingbroke, Michael Tartre, Mikhail Pak, Nathaniel J. Smith, [ORJ+ 13] Shyue Ping Ong, William Davidson Richards, Anubhav Jain, Nikolai Nowaczyk, Nikolay Shebanov, Oleksandr Pavlyk, Geoffroy Hautier, Michael Kocher, Shreyas Cholia, Dan Per A. Brodtkorb, Perry Lee, Robert T. McGibbon, Roman Gunter, Vincent L. Chevrier, Kristin A. Persson, and Gerbrand Feldbauer, Sam Lewis, Sam Tygier, Scott Sievert, Sebastiano Ceder. Python materials genomics (pymatgen): A robust, open- Vigna, Stefan Peterson, Surhud More, Tadeusz Pudlik, Takuya source python library for materials analysis. Computational Oshima, Thomas J. Pingel, Thomas P. Robitaille, Thomas Materials Science, 68, 2013. URL: https://pymatgen.org/, Spura, Thouis R. Jones, Tim Cera, Tim Leslie, Tiziano Zito, doi:10.1016/j.commatsci.2012.10.028. Tom Krauss, Utkarsh Upadhyay, Yaroslav O. Halchenko, and [PKD18] Paul N. Patrone, Anthony J. Kearsley, and Andrew M. Di- Yoshiki Vázquez-Baeza. Scipy 1.0: fundamental algorithms enstfrey. The role of data analysis in uncertainty quantifica- for scientific computing in python. Nature Methods, 17, 2020. tion: Case studies for materials modeling. volume 0, 2018. doi:10.1038/s41592-019-0686-2. doi:10.2514/6.2018-0927. [VPB21] Rama Vasudevan, Ghanshyam Pilania, and Prasanna V. Bal- [PVG+ 11] Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vin- achandran. Machine learning for materials design and dis- cent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blon- covery. Journal of Applied Physics, 129, 2021. doi: del, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake 10.1063/5.0043300. Vanderplas, Alexandre Passos, David Cournapeau, Matthieu [WDF+ 18] Logan Ward, Alexander Dunn, Alireza Faghaninia, Nils E.R. Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit- Zimmermann, Saurabh Bajaj, Qi Wang, Joseph Montoya, learn: Machine learning in python. Journal of Machine Jiming Chen, Kyle Bystrom, Maxwell Dylla, Kyle Chard, Learning Research, 12, 2011. URL: https://scikit-learn.org/. Mark Asta, Kristin A. Persson, G. Jeffrey Snyder, Ian Foster, [RDK] Rdkit contributors. URL: https://github.com/rdkit/rdkit/ and Anubhav Jain. Matminer: An open source toolkit for graphs/contributors. materials data mining. Computational Materials Science, [REW+ 19] Bharath Ramsundar, Peter Eastman, Patrick Walters, 152, 2018. URL: https://hackingmaterials.lbl.gov/matminer/, Vijay Pande, Karl Leswing, and Zhenqin Wu. Deep doi:10.1016/j.commatsci.2018.05.018. Learning for the Life Sciences. O’Reilly Media, 2019. [WF05] John D. Westbrook and Paula M.D. Fitzgerald. The pdb https://www.amazon.com/Deep-Learning-Life-Sciences- format, mmcif formats, and other data formats, 2005. doi: Microscopy/dp/1492039837. 10.1002/0471721204.ch8. [SGB+ 14] David E. Shaw, J. P. Grossman, Joseph A. Bank, Brannon Bat- [WKB+ 22] Paul Winget, H. Shaun Kwak, Christopher T. Brown, Alexandr son, J. Adam Butts, Jack C. Chao, Martin M. Deneroff, Ron O. Fonari, Kevin Tran, Alexander Goldberg, Andrea R. Brown- Dror, Amos Even, Christopher H. Fenton, Anthony Forte, ing, and Mathew D. Halls. Organic thin films for oled appli- Joseph Gagliardo, Gennette Gill, Brian Greskamp, C. Richard cations: Influence of molecular structure, deposition method, UTILIZING SCIPY AND OTHER OPEN SOURCE PACKAGES TO PROVIDE A POWERFUL API FOR MATERIALS MANIPULATION IN THE SCHRÖDINGER MATERIALS SUITE 59 and deposition conditions. International Conference on the Science and Technology of Synthetic Metals, 2022. 60 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) A Novel Pipeline for Cell Instance Segmentation, Tracking and Motility Classification of Toxoplasma Gondii in 3D Space Seyed Alireza Vaezi‡∗ , Gianni Orlando‡ , Mojtaba Fazli§ , Gary Ward¶ , Silvia Moreno‡ , Shannon Quinn‡ F Abstract—Toxoplasma gondii is the parasitic protozoan that causes dissem- individuals, the infection has fatal implications in fetuses and inated toxoplasmosis, a disease that is estimated to infect around one-third immunocompromised individuals [SG12] . T. gondii’s virulence of the world’s population. While the disease is commonly asymptomatic, the is directly linked to its lytic cycle which is comprised of invasion, success of the parasite is in large part due to its ability to easily spread through replication, egress, and motility. Studying the motility of T. gondii nucleated cells. The virulence of T. gondii is predicated on the parasite’s motility. is crucial in understanding its lytic cycle in order to develop Thus the inspection of motility patterns during its lytic cycle has become a topic of keen interest. Current cell tracking projects usually focus on cell images potential treatments. captured in 2D which are not a true representation of the actual motion of a For this reason, we present a novel pipeline to detect, segment, cell. Current 3D tracking projects lack a comprehensive pipeline covering all track, and classify the motility pattern of T. gondii in 3D space. phases of preprocessing, cell detection, cell instance segmentation, tracking, One of the main goals is to make our pipeline intuitively easy and motion classification, and merely implement a subset of the phases. More- to use so that the users who are not experienced in the fields of over, current 3D segmentation and tracking pipelines are not targeted for users machine learning (ML), deep learning (DL), or computer vision with less experience in deep learning packages. Our pipeline, TSeg, on the (CV) can still benefit from it. The other objective is to equip it with other hand, is developed for segmenting, tracking, and classifying the motility the most robust and accurate set of segmentation and detection phenotypes of T. gondii in 3D microscopic images. Although TSeg is built initially tools so that the end product has a broad generalization, allowing focusing on T. gondii, it provides generic functions to allow users with similar but distinct applications to use it off-the-shelf. Interacting with all of TSeg’s it to perform well and accurately for various cell types right off modules is possible through our Napari plugin which is developed mainly off the the shelf. familiar SciPy scientific stack. Additionally, our plugin is designed with a user- PlantSeg uses a variant of 3D U-Net, called Residual 3D U- friendly GUI in Napari which adds several benefits to each step of the pipeline Net, for preprocessing and segmentation of multiple cell types such as visualization and representation in 3D. TSeg proves to fulfill a better [WCV+ 20]. PlantSeg performs best among Deep Learning algo- generalization, making it capable of delivering accurate results with images of rithms for 3D Instance Segmentation and is very robust against other cell types. image noise [KPR+ 21]. The segmentation module also includes the optional use of CellPose [SWMP21]. CellPose is a generalized Introduction segmentation algorithm trained on a wide range of cell types Quantitative cell research often requires the measurement of and is the first step toward increased optionality in TSeg. The different cell properties including size, shape, and motility. This Cell Tracking module consolidates the cell particles across the z- step is facilitated using segmentation of imaged cells. With flu- axis to materialize cells in 3D space and estimates centroids for orescent markers, computational tools can be used to complete each cell. The tracking module is also responsible for extracting segmentation and identify cell features and positions over time. the trajectories of cells based on the movements of centroids 2D measurements of cells can be useful, but the more difficult task throughout consecutive video frames, which is eventually the input of deriving 3D information from cell images is vital for metrics of the motion classifier module. such as motility and volumetric qualities. Most of the state-of-the-art pipelines are restricted to 2D space Toxoplasmosis is an infection caused by the intracellular which is not a true representative of the actual motion of the parasite Toxoplasma gondii. T. gondii is one of the most suc- organism. Many of them require knowledge and expertise in pro- cessful parasites, infecting at least one-third of the world’s pop- gramming, or in machine learning and deep learning models and ulation. Although Toxoplasmosis is generally benign in healthy frameworks, thus limiting the demographic of users that can use them. All of them solely include a subset of the aforementioned * Corresponding author: sv22900@uga.edu modules (i.e. detection, segmentation, tracking, and classification) ‡ University of Georgia § harvard University [SWMP21]. Many pipelines rely on the user to train their own ¶ University of Vermont model, hand-tailored for their specific application. This demands high levels of experience and skill in ML/DL and consequently Copyright © 2022 Seyed Alireza Vaezi et al. This is an open-access article undermines the possibility and feasibility of quickly utilizing an distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, off-the-shelf pipeline and still getting good results. provided the original author and source are credited. To address these we present TSeg. It segments T. gondii cells A NOVEL PIPELINE FOR CELL INSTANCE SEGMENTATION, TRACKING AND MOTILITY CLASSIFICATION OF TOXOPLASMA GONDII IN 3D SPACE 61 As an example, Fazli et al. [FVMQ18] identified three distinct motility types for T. gondii with two-dimensional data, however, they also acknowledge and state that based established heuristics from previous works there are more than three motility phenotypes for T. gondii. The focus on 2D research is understandable due to several factors. 3D data is difficult to capture as tools for capturing 3D slices and the computational requirements for analyzing this data are not available in most research labs. Most segmentation tools are unable to track objects in 3D space as the assignment of related centroids is more difficult. The additional noise from cap- ture and focus increases the probability of incorrect assignment. 3D data also has issues with overlapping features and increased computation required per frame of time. Fazli et al. [FVMQ18] studies the motility patterns of T. gondii and provides a computational pipeline for identifying motility phenotypes of T. gondii in an unsupervised, data-driven way. In that work Ca2+ is added to T. gondii cells inside a Fetal Bovine Serum. T. gondii cells react to Ca2+ and become motile and fluorescent. The images of motile T. gondii cells were captured using an LSM 710 confocal microscope. They use Python 3 and associated scientific computing libraries (NumPy, SciPy, scikit- learn, matplotlib) in their pipeline to track and cluster the trajecto- ries of T. gondii. Based on this work Fazli et al. [FVM+ 18] work on another pipeline consisting of preprocessing, sparsification, cell detection, and cell tracking modules to track T. gondii in 3D video microscopy where each frame of the video consists of image slices taken 1 micro-meters of focal depth apart along the z-axis Fig. 1: The overview of TSeg’s architecture. direction. In their latest work Fazli et al. [FSA+ 19] developed a lightweight and scalable pipeline using task distribution and paral- lelism. Their pipeline consists of multiple modules: reprocessing, in 3D microscopic images, tracks their trajectories, and classifies sparsification, cell detection, cell tracking, trajectories extraction, the motion patterns observed throughout the 3D frames. TSeg is parametrization of the trajectories, and clustering. They could comprised of four modules: pre-processing, segmentation, track- classify three distinct motion patterns in T. gondii using the same ing, and classification. We developed TSeg as a plugin for Napari data from their previous work. [SLE+ 22] - an open-source fast and interactive image viewer for While combining open source tools is not a novel architecture, Python designed for browsing, annotating, and analyzing large little has been done to integrate 3D cell tracking tools. Fazeli et multi-dimensional images. Having TSeg implemented as a part of al. [FRF+ 20] motivated by the same interest in providing better Napari not only provides a user-friendly design but also gives more tools to non-software professionals created a 2D cell tracking advanced users the possibility to attach and execute their custom pipeline. This pipeline combines Stardist [WSH+ 20] and Track- code and even interact with the steps of the pipeline if needed. Mate [TPS+ 17] for automated cell tracking. This pipeline begins The preprocessing module is equipped with basic and extra filters with the user loading cell images and centroid approximations to and functionalities to aid in the preparation of the input data. the ZeroCostDL4Mic [vCLJ+ 21] platform. ZeroCostDL4Mic is TSeg gives its users the advantage of utilizing the functionalities a deep learning training tool for those with no coding expertise. that PlantSeg and CellPose provide. These functionalities can be Once the platform is trained and masks for the training set are chosen in the pre-processing, detection, and segmentation steps. made for hand-drawn annotations, the training set can be input This brings forth a huge variety of algorithms and pre-built models to Stardist. Stardist performs automated object detection using to select from, making TSeg not only a great fit for T. gindii, but Euclidean distance to probabilistically determine cell pixels versus also a variety of different cell types. background pixels. Lastly, Trackmate uses segmentation images to The rest of this paper is structured as follows: After briefly re- track labels between timeframes and display analytics. viewing the literature in Related Work, we move on to thoroughly This Stardist pipeline is similar in concept to TSeg. Both describe the details of our work in the Method section. Following create an automated segmentation and tracking pipeline but TSeg that, the Results section depicts the results of comprehensive tests is oriented to 3D data. Cells move in 3-dimensional space that of our plugin on T. gondii cells. is not represented in a flat plane. TSeg also does not require the manual training necessary for the other pipeline. Individuals Related Work with low technical expertise should not be expected to create The recent solutions in generalized and automated segmentation masks for training or even understand the training of deep neural tools are focused on 2D cell images. Segmentation of cellular networks. Lastly, this pipeline does not account for imperfect structures in 2D is important but not representative of realistic datasets without the need for preprocessing. All implemented environments. Microbiological organisms are free to move on the algorithms in TSeg account for microscopy images with some z-axis and tracking without taking this factor into account cannot amount of noise. guarantee a full representation of the actual motility patterns. Wen et al. [WMV+ 21] combines multiple existing new tech- 62 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) nologies including deep learning and presents 3DeeCellTracker. user. The full code of TSeg is available on GitHub under the MIT 3DeeCellTracker segments and tracks cells on 3D time-lapse open source license at https://github.com/salirezav/tseg. TSeg can images. Using a small subset of their dataset they train the deep be installed through Napari’s plugins menu. learning architecture 3D U-Net for segmentation. For tracking, a combination of two strategies was used to increase accuracy: Computational Pipeline local cell region strategies, and spatial pattern strategy. Kapoor Pre-Processing: Due to the fast imaging speed in data et al. [KC21] presents VollSeg that uses deep learning methods acquisition, the image slices will inherently have a vignetting to segment, track, and analyze cells in 3D with irregular shape artifact, meaning that the corners of the images will be slightly and intensity distribution. It is a Jupyter Notebook-based Python darker than the center of the image. To eliminate this artifact we package and also has a UI in Napari. For tracking, a custom added adaptive thresholding and logarithmic correction to the pre- tracking code is developed based on Trackmate. processing module. Furthermore, another prevalent artifact on our Many segmentation tools require some amount of knowledge dataset images was a Film-Grain noise (AKA salt and pepper in Machine or Deep Learning concepts. Training the neural noise). To remove or reduce such noise a simple gaussian blur network in creating masks is a common step for open-source filter and a sharpening filter are included. segmentation tools. Automating this process makes the pipeline Cell Detection and Segmentation: TSeg’s Detection and more accessible to microbiology researchers. Segmentation modules are in fact backed by PlantSeg and Cell- Pose. The Detection Module is built only based on PlantSeg’s Method CNN Detection Module [WCV+ 20] , and for the Segmentation Module, only one of the three tools can be selected to be executed Data as the segmentation tool in the pipeline. Naturally, each of the tools Our dataset consists of 11 videos of T. gondii cells under a demands specific interface elements different from the others since microscope, obtained from different experiments with different each accepts different input values and various parameters. TSeg numbers of cells. The videos are on average around 63 frames in orchestrates this and makes sure the arguments and parameters are length. Each frame has a stack of 41 image slices of size 500×502 passed to the corresponding selected segmentation tool properly pixels along the z-axis (z-slices). The z-slices are captured 1µm and the execution will be handled accordingly. The parameters apart in optical focal length making them 402µm×401µm×40µm include but are not limited to input data location, output directory, in volume. The slices were recorded in raw format as RGB TIF and desired segmentation algorithm. This allows the end-user images but are converted to grayscale for our purpose. This data complete control over the process and feedback from each step is captured using a PlanApo 20x objective (NA = 0:75) on a of the process. The preprocessed images and relevant parameters preheated Nikon Eclipse TE300 epifluorescence microscope. The are sent to a modular segmentation controller script. As an effort image stacks were captured using an iXon 885 EMCCD camera to allow future development on TSeg, the segmentation controller (Andor Technology, Belfast, Ireland) cooled to -70oC and driven script shows how the pipeline integrates two completely different by NIS Elements software (Nikon Instruments, Melville, NY) as segmentation packages. While both PlantSeg and CellPose use part of related research by Ward et al. [LRK+ 14]. The camera was conda environments, PlantSeg requires modification of a YAML set to frame transfer sensor mode, with a vertical pixel shift speed file for initialization while CellPose initializes directly from com- of 1:0 µs, vertical clock voltage amplitude of +1, readout speed mand line parameters. In order to implement PlantSeg, TSeg gen- of 35MHz, conversion gain of 3:8×, EM gain setting of 3 and 22 erates a YAML file based on GUI input elements. After parameters binning, and the z-slices were imaged with an exposure time of are aligned, the conda environment for the chosen segmentation 16ms. algorithm is opened in a subprocess. The $CONDA_PREFIX environment variable allows the bash command to start conda and Software context switch to the correct segmentation environment. Napari Plugin: TSeg is developed as a plugin for Napari - Tracking: Features in each segmented image are found a fast and interactive multi-dimensional image viewer for python using the scipy label function. In order to reduce any leftover that allows volumetric viewing of 3D images [SLE+ 22]. Plugins noise, any features under a minimum size are filtered out and enable developers to customize and extend the functionality of considered leftover noise. After feature extraction, centroids are Napari. For every module of TSeg, we developed its corresponding calculated using the center of mass function in scipy. The centroid widget in the GUI, plus a widget for file management. The widgets of the 3D cell can be used as a representation of the entire have self-explanatory interface elements with tooltips to guide body during tracking. The tracking algorithm goes through each the inexperienced user to traverse through the pipeline with ease. captured time instance and connects centroids to the likely next Layers in Napari are the basic viewable objects that can be shown movement of the cell. Tracking involves a series of measures in or- in the Napari viewer. Seven different layer types are supported der to avoid incorrect assignments. An incorrect assignment could in Napari: Image, Labels, Points, Shapes, Surface, Tracks, and lead to inaccurate result sets and unrealistic motility patterns. If the Vectors, each of which corresponds to a different data type, same number of features in each frame of time could be guaranteed visualization, and interactivity [SLE+ 22]. After its execution, the from segmentation, minimum distance could assign features rather viewable output of each widget gets added to the layers. This accurately. Since this is not a guarantee, the Hungarian algorithm allows the user to evaluate and modify the parameters of the must be used to associate a COST with the assignment of feature widget to get the best results before continuing to the next widget. tracking. The Hungarian method is a combinatorial optimization Napari supports bidirectional communication between the viewer algorithm that solves the assignment problem in polynomial time. and the Python kernel and has a built-in console that allows users COST for the tracking algorithm determines which feature is the to control all the features of the viewer programmatically. This next iteration of the cell’s tracking through the complete time adds more flexibility and customizability to TSeg for the advanced series. The combination of distance between centroids for all A NOVEL PIPELINE FOR CELL INSTANCE SEGMENTATION, TRACKING AND MOTILITY CLASSIFICATION OF TOXOPLASMA GONDII IN 3D SPACE 63 previous points and the distance to the potential new centroid. [LRK+ 14] Jacqueline Leung, Mark Rould, Christoph Konradt, Christopher If an optimal next centroid can’t be found within an acceptable Hunter, and Gary Ward. Disruption of tgphil1 alters specific parameters of toxoplasma gondii motility measured in a quanti- distance of the current point, the tracking for the cell is considered tative, three-dimensional live motility assay. PloS one, 9:e85763, as complete. Likewise, if a feature is not assigned to a current 01 2014. doi:10.1371/journal.pone.0085763. centroid, this feature is considered a new object and is tracked as [SG12] Geita Saadatnia and Majid Golkar. A review on human toxoplas- the algorithm progresses. The complete path for each feature is mosis. Scandinavian journal of infectious diseases, 44(11):805– 814, 2012. doi:10.3109/00365548.2012.693197. then stored for motility analysis. [SLE+ 22] Nicholas Sofroniew, Talley Lambert, Kira Evans, Juan Nunez- Motion Classification: To classify the motility pattern of Iglesias, Grzegorz Bokota, Philip Winston, Gonzalo Peña- T. gondii in 3D space in an unsupervised fashion we implement Castellanos, Kevin Yamauchi, Matthias Bussonnier, Draga Don- cila Pop, Ahmet Can Solak, Ziyang Liu, Pam Wadhwa, Al- and use the method that Fazli et. al. introduced [FSA+ 19]. In that ister Burt, Genevieve Buckley, Andrew Sweet, Lukasz Mi- work, they used an autoregressive model (AR); a linear dynamical gas, Volker Hilsenstein, Lorenzo Gaifas, Jordão Bragantini, system that encodes a Markov-based transition prediction method. Jaime Rodríguez-Guerra, Hector Muñoz, Jeremy Freeman, Peter The reason is that although K-means is a favorable clustering Boone, Alan Lowe, Christoph Gohlke, Loic Royer, Andrea PIERRÉ, Hagai Har-Gil, and Abigail McGovern. napari: a multi- algorithm, there are a few drawbacks to it and to the conventional dimensional image viewer for Python, May 2022. If you use methods that draw them impractical. Firstly, K-means assumes Eu- this software, please cite it using these metadata. URL: https: clidian distance, but AR motion parameters are geodesics that do //doi.org/10.5281/zenodo.6598542, doi:10.5281/zenodo. 6598542. not reside in a Euclidean space, and secondly, K-means assumes [SWMP21] Carsen Stringer, Tim Wang, Michalis Michaelos, and Marius isotropic clusters, however, although AR motion parameters may Pachitariu. Cellpose: a generalist algorithm for cellular segmen- exhibit isotropy in their space, without a proper distance metric, tation. Nature methods, 18(1):100–106, 2021. doi:10.1101/ this issue cannot be clearly examined [FSA+ 19]. 2020.02.02.931238. [TPS+ 17] Jean-Yves Tinevez, Nick Perry, Johannes Schindelin, Genevieve M. Hoopes, Gregory D. Reynolds, Emmanuel Laplantine, Sebastian Y. Bednarek, Spencer L. Shorte, and Conclusion and Discussion Kevin W. Eliceiri. Trackmate: An open and extensible platform TSeg is an easy to use pipeline designed to study the motility for single-particle tracking. Methods, 115:80–90, 2017. Image Processing for Biologists. URL: https://www.sciencedirect. patterns of T. gondii in 3D space. It is developed as a plugin com/science/article/pii/S1046202316303346, doi:https: for Napari and is equipped with a variety of deep learning based //doi.org/10.1016/j.ymeth.2016.09.016. segmentation tools borrowed from PlantSeg and CellPose, making [vCLJ+ 21] Lucas von Chamier, Romain F Laine, Johanna Jukkala, Christoph Spahn, Daniel Krentzel, Elias Nehme, Martina it a suitable off-the-shelf tool for applications incorporating im- Lerche, Sara Hernández-Pérez, Pieta K Mattila, Eleni Kari- ages of cell types not limited to T. gondii. Future work on TSeg nou, et al. Democratising deep learning for microscopy with includes the expantion of implemented algorithms and tools in its zerocostdl4mic. Nature communications, 12(1):1–18, 2021. preprocessing, segmentation, tracking, and clustering modules. doi:10.1038/s41467-021-22518-0. [WCV+ 20] Adrian Wolny, Lorenzo Cerrone, Athul Vijayan, Rachele To- fanelli, Amaya Vilches Barro, Marion Louveaux, Christian Wenzl, Sören Strauss, David Wilson-Sánchez, Rena Lymbouri- R EFERENCES dou, Susanne S Steigleder, Constantin Pape, Alberto Bailoni, Salva Duran-Nebreda, George W Bassel, Jan U Lohmann, Mil- [FRF+ 20] Elnaz Fazeli, Nathan H Roy, Gautier Follain, Romain F Laine, tos Tsiantis, Fred A Hamprecht, Kay Schneitz, Alexis Maizel, Lucas von Chamier, Pekka E Hänninen, John E Eriksson, Jean- and Anna Kreshuk. Accurate and versatile 3d segmenta- Yves Tinevez, and Guillaume Jacquemet. Automated cell track- tion of plant tissues at cellular resolution. eLife, 9:e57613, ing using stardist and trackmate. F1000Research, 9, 2020. jul 2020. URL: https://doi.org/10.7554/eLife.57613, doi:10. doi:10.12688/f1000research.27019.1. 7554/eLife.57613. [FSA+ 19] Mojtaba Sedigh Fazli, Rachel V Stadler, BahaaEddin Alaila, [WMV+ 21] Chentao Wen, Takuya Miura, Venkatakaushik Voleti, Kazushi Stephen A Vella, Silvia NJ Moreno, Gary E Ward, and Shannon Yamaguchi, Motosuke Tsutsumi, Kei Yamamoto, Kohei Otomo, Quinn. Lightweight and scalable particle tracking and motion Yukako Fujie, Takayuki Teramoto, Takeshi Ishihara, Kazuhiro clustering of 3d cell trajectories. In 2019 IEEE International Aoki, Tomomi Nemoto, Elizabeth Mc Hillman, and Koutarou D Conference on Data Science and Advanced Analytics (DSAA), Kimura. 3DeeCellTracker, a deep learning-based pipeline for pages 412–421. IEEE, 2019. doi:10.1109/dsaa.2019. segmenting and tracking cells in 3D time lapse images. Elife, 10, 00056. March 2021. URL: https://doi.org/10.7554/eLife.59187, doi: [FVM 18] Mojtaba S Fazli, Stephen A Vella, Silvia NJ Moreno, Gary E + 10.7554/eLife.59187. Ward, and Shannon P Quinn. Toward simple & scalable 3d cell [WSH+ 20] Martin Weigert, Uwe Schmidt, Robert Haase, Ko Sugawara, tracking. In 2018 IEEE International Conference on Big Data and Gene Myers. Star-convex polyhedra for 3d object detec- (Big Data), pages 3217–3225. IEEE, 2018. doi:10.1109/ tion and segmentation in microscopy. In 2020 IEEE Winter BigData.2018.8622403. Conference on Applications of Computer Vision (WACV). IEEE, [FVMQ18] Mojtaba S Fazli, Stephen A Velia, Silvia NJ Moreno, and mar 2020. URL: https://doi.org/10.1109%2Fwacv45572.2020. Shannon Quinn. Unsupervised discovery of toxoplasma gondii 9093435, doi:10.1109/wacv45572.2020.9093435. motility phenotypes. In 2018 IEEE 15th International Sympo- sium on Biomedical Imaging (ISBI 2018), pages 981–984. IEEE, 2018. doi:10.1109/isbi.2018.8363735. [KC21] Varun Kapoor and Claudia Carabaña. Cell tracking in 3d using deep learning segmentations. In Python in Science Con- ference, pages 154–161, 2021. doi:10.25080/majora- 1b6fd038-014. [KPR+ 21] Anuradha Kar, Manuel Petit, Yassin Refahi, Guillaume Cerutti, Christophe Godin, and Jan Traas. Assessment of deep learning algorithms for 3d instance segmentation of confocal image datasets. bioRxiv, 2021. URL: https: //www.biorxiv.org/content/early/2021/06/10/2021.06.09.447748, arXiv:https://www.biorxiv.org/content/ early/2021/06/10/2021.06.09.447748.full. pdf, doi:10.1101/2021.06.09.447748. 64 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) The myth of the normal curve and what to do about it Allan Campopiano∗ F Index Terms—Python, R, robust statistics, bootstrapping, trimmed mean, data science, hypothesis testing Reliance on the normal curve as a tool for measurement is almost a given. It shapes our grading systems, our measures of intelligence, and importantly, it forms the mathematical backbone of many of our inferential statistical tests and algorithms. Some even call it “God’s curve” for its supposed presence in nature [Mic89]. Scientific fields that deal in explanatory and predictive statis- tics make particular use of the normal curve, often using it to conveniently define thresholds beyond which a result is considered statistically significant (e.g., t-test, F-test). Even familiar machine learning models have, buried in their guts, an assumption of the normal curve (e.g., LDA, gaussian naive Bayes, logistic & linear regression). The normal curve has had a grip on us for some time; the Fig. 1: Standard normal (orange) and contaminated normal (blue). The variance of the contaminated curve is more than 10 times that aphorism by Cramer [Cra46] still rings true for many today: of the standard normal curve. This can cause serious issues with “Everyone believes in the [normal] law of errors, the statistical power when using traditional hypothesis testing methods. experimenters because they think it is a mathematical theorem, the mathematicians because they think it is an experimental fact.” new Python library for robust hypothesis testing will be introduced Many students of statistics learn that N=40 is enough to ignore along with an interactive tool for robust statistics education. the violation of the assumption of normality. This belief stems from early research showing that the sampling distribution of the The contaminated normal mean quickly approaches normal, even when drawing from non- normal distributions—as long as samples are sufficiently large. It One of the most striking counterexamples of “N=40 is enough” is common to demonstrate this result by sampling from uniform is shown when sampling from the so-called contaminated normal and exponential distributions. Since these look nothing like the [Tuk60][Tan82]. This distribution is also bell shaped and sym- normal curve, it was assumed that N=40 must be enough to avoid metrical but it has slightly heavier tails when compared to the practical issues when sampling from other types of non-normal standard normal curve. That is, it contains outliers and is difficult distributions [Wil13]. (Others reached similar conclusions with to distinguish from a normal distribution with the naked eye. different methodology [Gle93].) Consider the distributions in Figure 1. The variance of the normal Two practical issues have since been identified based on this distribution is 1 but the variance of the contaminated normal is early research: (1) The distributions under study were light tailed 10.9! (they did not produce outliers), and (2) statistics other than the The consequence of this inflated variance is apparent when sample mean were not tested and may behave differently. In examining statistical power. To demonstrate, Figure 2 shows two the half century following these early findings, many important pairs of distributions: On the left, there are two normal distribu- discoveries have been made—calling into question the usefulness tions (variance 1) and on the right there are two contaminated of the normal curve [Wil13]. distributions (variance 10.9). Both pairs of distributions have a The following sections uncover various pitfalls one might mean difference of 0.8. Wilcox [Wil13] showed that by taking encounter when assuming normality—especially as they relate to random samples of N=40 from each normal curve, and comparing hypothesis testing. To help researchers overcome these problems, a them with Student’s t-test, statistical power was approximately 0.94. However, when following this same procedure for the * Corresponding author: allan@deepnote.com contaminated groups, statistical power was only 0.25. The point here is that even small apparent departures from Copyright © 2022 Allan Campopiano. This is an open-access article dis- normality, especially in the tails, can have a large impact on tributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, pro- commonly used statistics. The problems continue to get worse vided the original author and source are credited. when examining effect sizes but these findings are not discussed THE MYTH OF THE NORMAL CURVE AND WHAT TO DO ABOUT IT 65 Fig. 2: Two normal curves (left) and two contaminated normal curves (right). Despite the obvious effect sizes (∆ = 0.8 for both pairs) as well as the visual similarities of the distributions, power is only ~0.25 under contamination; however, power is ~0.94 under normality (using Student’s t-test). in this article. Interested readers should see Wilcox’s 1992 paper Fig. 3: Actual t-distribution (orange) and assumed t-distribution (blue). When simulating a t-distribution based on a lognormal curve, [Wil92]. T does not follow the assumed shape. This can cause poor probability Perhaps one could argue that the contaminated normal dis- coverage and increased Type I Error when using traditional hypothe- tribution actually represents an extreme departure from normal- sis testing approaches. ity and therefore should not be taken seriously; however, dis- tributions that generate outliers are likely common in practice [HD82][Mic89][Wil09]. A reasonable goal would then be to Modern robust methods choose methods that perform well under such situations and When it comes to hypothesis testing, one intuitive way of dealing continue to perform well under normality. In addition, serious with the issues described above would be to (1) replace the issues still exist even when examining light-tailed and skewed sample mean (and standard deviation) with a robust alternative distributions (e.g., lognormal), and statistics other than the sample and (2) use a non-parametric resampling technique to estimate the mean (e.g., T). These findings will be discussed in the following sampling distribution (rather than assuming a theoretical shape)1 . section. Two such candidates are the 20% trimmed mean and the percentile bootstrap test, both of which have been shown to have practical value when dealing with issues of outliers and non-normality Student’s t-distribution [CvNS18][Wil13]. Another common statistic is the T value obtained from Student’s t-test. As will be demonstrated, T is more sensitive to violations of The trimmed mean normality than the sample mean (which has already been shown to not be robust). This is despite the fact that the t-distribution is The trimmed mean is nothing more than sorting values, removing also bell shaped, light tailed, and symmetrical—a close relative of a proportion from each tail, and computing the mean on the the normal curve. remaining values. Formally, The assumption is that T follows a t-distribution (and with • Let X1 ...Xn be a random sample and X(1) ≤ X(2) ... ≤ X(n) large samples it approaches normality). We can test this assump- be the observations in ascending order tion by generating random samples from a lognormal distribution. • The proportion to trim is γ(0 ≤ γ ≤ .5) Specifically, 5000 datasets of sample size 20 were randomly drawn • Let g = bγnc. That is, the proportion to trim multiplied by from a lognormal distribution using SciPy’s lognorm.rvs n, rounded down to the nearest integer function. For each dataset, T was calculated and the resulting t- distribution was plotted. Figure 3 shows that the assumption that Then, in symbols, the trimmed mean can be expressed as T follows a t-distribution does not hold. follows: With N=20, the assumption is that with a probability of 0.95, X(g+1) + ... + X(n−g) T will be between -2.09 and 2.09. However, when sampling from X̄t = n − 2g a lognormal distribution in the manner just described, there is actually a 0.95 probability that T will be between approximately If the proportion to trim is 0.2, more than twenty percent of -4.2 and 1.4 (i.e., the middle 95% of the actual t-distribution is the values would have to be altered to make the trimmed mean much wider than the assumed t-distribution). Based on this result arbitrarily large or small. The sample mean, on the other hand, we can conclude that sampling from skewed distributions (e.g., can be made to go to ±∞ (arbitrarily large or small) by changing lognormal) leads to increased Type I Error when using Student’s a single value. The trimmed mean is more robust than the sample t-test [Wil98]. mean in all measures of robustness that have been studied [Wil13]. In particular the 20% trimmed mean has been shown to have “Surely the hallowed bell-shaped curve has cracked practical value as it avoids issues associated with the median (not from top to bottom. Perhaps, like the Liberty Bell, it discussed here) and still protects against outliers. should be enshrined somewhere as a memorial to more heroic days — Earnest Ernest, Philadelphia Inquirer. 10 1. Another option is to use a parametric test that assumes a different November 1974. [FG81]” underlying model. 66 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) The percentile bootstrap test best experienced in tandem with Wilcox’s book “Introduction to In most traditional parametric tests, there is an assumption that Robust Estimation and Hypothesis Testing”. the sampling distribution has a particular shape (normal, f- Hypothesize brings many of these functions into the open- distribution, t-distribution, etc). We can use these distributions source Python library ecosystem with the goal of lowering the to test the null hypothesis; however, as discussed, the theoretical barrier to modern robust methods—even for those who have distributions are not always approximated well when violations of not had extensive training in statistics or coding. With modern assumptions occur. Non-parametric resampling techniques such browser-based notebook environments (e.g., Deepnote), learning as bootstrapping and permutation tests build empirical sampling to use Hypothesize can be relatively straightforward. In fact, every distributions, and from these, one can robustly derive p-values and statistical test listed in the docs is associated with a hosted note- CIs. One example is the percentile bootstrap test [Efr92][TE93]. book, pre-filled with sample data and code. But certainly, simply The percentile bootstrap test can be thought of as an al- pip install Hypothesize to use Hypothesize in any en- gorithm that uses the data at hand to estimate the underlying vironment that supports Python. See van Noordt and Willoughby sampling distribution of a statistic (pulling yourself up by your [vNW21] and van Noordt et al. [vNDTE22] for examples of own bootstraps, as the saying goes). This approach is in contrast Hypothesize being used in applied research. to traditional methods that assume the sampling distribution takes The API for Hypothesize is organized by single- and two- a particular shape). The percentile boostrap test works well with factor tests, as well as measures of association. Input data for small sample sizes, under normality, under non-normality, and it the groups, conditions, and measures are given in the form of a easily extends to multi-group tests (ANOVA) and measures of Pandas DataFrame [pdt20][WM10]. By way of example, one can association (correlation, regression). For a two-sample case, the compare two independent groups (e.g., placebo versus treatment) steps to compute the percentile bootstrap test can be described as using the 20% trimmed mean and the percentile bootstrap test, as follows: follows (note that Hypothesize uses the naming conventions found in WRS): 1) Randomly resample with replacement n values from group one from hypothesize.utilities import trim_mean from hypothesize.compare_groups_with_single_factor \ 2) Randomly resample with replacement n values from import pb2gen group two 3) Compute X̄1 − X̄2 based on you new sample (the mean results = pb2gen(df.placebo, df.treatment, trim_mean) difference) 4) Store the difference & repeat steps 1-3 many times (say, As shown below, the results are returned as a Python dictionary 1000) containing the p-value, confidence intervals, and other important 5) Consider the middle 95% of all differences (the confi- details. dence interval) { 6) If the confidence interval contains zero, there is no 'ci': [-0.22625614592148624, 0.06961754796950131], statistical difference, otherwise, you can reject the null 'est_1': 0.43968438076483285, 'est_2': 0.5290985245430996, hypothesis (there is a statistical difference) 'est_dif': -0.08941414377826673, 'n1': 50, 'n2': 50, Implementing and teaching modern robust methods 'p_value': 0.27, 'variance': 0.005787027326924963 Despite over a half a century of convincing findings, and thousands } of papers, robust statistical methods are still not widely adopted in applied research [EHM08][Wil98]. This may be due to various For measuring associations, several options exist in Hypothesize. false beliefs. For example, One example is the Winsorized correlation which is a robust • Classical methods are robust to violations of assumptions alternative to Pearson’s R. For example, • Correcting non-normal distributions by transforming the from hypothesize.measuring_associations import wincor data will solve all issues • Traditional non-parametric tests are suitable replacements results = wincor(df.height, df.weight, tr=.2) for parametric tests that violate assumptions returns the Winsorized correlation coefficient and other relevant Perhaps the most obvious reason for the lack of adoption of statistics: modern methods is a lack of easy-to-use software and training re- { sources. In the following sections, two resources will be presented: 'cor': 0.08515087411576182, one for implementing robust methods and one for teaching them. 'nval': 50, 'sig': 0.558539575073185, 'wcov': 0.004207827245660796 Robust statistics for Python } Hypothesize is a robust null hypothesis significance testing (NHST) library for Python [CW20]. It is based on Wilcox’s WRS package for R which contains hundreds of functions for computing A case study using real-world data robust measures of central tendency and hypothesis testing. At It is helpful to demonstrate that robust methods in Hypothesize the time of this writing, the WRS library in R contains many (and in other libraries) can make a practical difference when more functions than Hypothesize and its value to researchers dealing with real-world data. In a study by Miller on sexual who use inferential statistics cannot be understated. WRS is attitudes, 1327 men and 2282 women were asked how many sexual THE MYTH OF THE NORMAL CURVE AND WHAT TO DO ABOUT IT 67 partners they desired over the next 30 years (the data are available from Rand R. Wilcox’s site). When comparing these groups using Student’s t-test, we get the following results: { 'ci': [-1491.09, 4823.24], 't_value': 1.035308, 'p_value': 0.300727 } That is, we fail to reject the null hypothesis at the α = 0.05 level using Student’s test for independent groups. However, if we switch to a robust analogue of the t-test, one that utilizes bootstrapping and trimmed means, we can indeed reject the null hypothesis. Here are the corresponding results from Hypothesize’s yuenbt test (based on [Yue74]): from hypothesize.compare_groups_with_single_factor \ import yuenbt Fig. 4: An example of the robust stats simulator in Deepnote’s hosted notebook environment. A minimalist UI can lower the barrier-to-entry results = yuenbt(df.males, df.females, to robust statistics concepts. tr=.2, alpha=.05) { The robust statistics simulator allows users to interact with the 'ci': [1.41, 2.11], following parameters: 'test_stat': 9.85, 'p_value': 0.0 • Distribution shape } • Level of contamination The point here is that robust statistics can make a practi- • Sample size cal difference with real-world data (even when N is consid- • Skew and heaviness of tails ered large). Many other examples of robust statistics making a Each of these characteristics can be adjusted independently in practical difference with real-world data have been documented order to compare classic approaches to their robust alternatives. [HD82][Wil09][Wil01]. The two measures that are used to evaluate the performance of It is important to note that robust methods may also fail to classic and robust methods are the standard error and Type I Error. reject when a traditional test rejects (remember that traditional Standard error is a measure of how much an estimator varies tests can suffer from increased Type I Error). It is also possible across random samples from our population. We want to choose that both approaches yield the same or similar conclusions. The estimators that have a low standard error. Type I Error is also exact pattern of results depends largely on the characteristics of the known as False Positive Rate. We want to choose methods that underlying population distribution. To be able to reason about how keep Type I Error close to the nominal rate (usually 0.05). The robust statistics behave when compared to traditional methods the robust statistics simulator can guide these decisions by providing robust statistics simulator has been created and is described in the empirical evidence as to why particular estimators and statistical next section. tests have been chosen. Robust statistics simulator Conclusion Having a library of robust statistical functions is not enough to make modern methods commonplace in applied research. Ed- This paper gives an overview of the issues associated with the ucators and practitioners still need intuitive training tools that normal curve. The concern with traditional methods, in terms of demonstrate the core issues surrounding classical methods and robustness to violations of normality, have been known for over how robust analogues compare. a half century and modern alternatives have been recommended; As mentioned, computational notebooks that run in the cloud however, for various reasons that have been discussed, modern offer a unique solution to learning beyond that of static textbooks robust methods have not yet become commonplace in applied and documentation. Learning can be interactive and exploratory research settings. since narration, visualization, widgets (e.g., buttons, slider bars), One reason is the lack of easy-to-use software and teaching and code can all be experienced in a ready-to-go compute envi- resources for robust statistics. To help fill this gap, Hypothesize, a ronment—with no overhead related to local environment setup. peer-reviewed and open-source Python library was developed. In As a compendium to Hypothesize, and a resource for un- addition, to help clearly demonstrate and visualize the advantages derstanding and teaching robust statistics in general, the robust of robust methods, the robust statistics simulator was created. statistics simulator repository has been developed. It is a notebook- Using these tools, practitioners can begin to integrate robust based collection of interactive demonstrations aimed at clearly and statistical methods into their inferential testing repertoire. visually explaining the conditions under which classic methods fail relative to robust methods. A hosted notebook with the Acknowledgements rendered visualizations of the simulations can be accessed here. The author would like to thank Karlynn Chan and Rand R. Wilcox and seen in Figure 4. Since the simulations run in the browser and as well as Elizabeth Dlha and the entire Deepnote team for their require very little understanding of code, students and teachers can support of this project. In addition, the author would like to thank easily onboard to the study of robust statistics. Kelvin Lee for his insightful review of this manuscript. 68 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) R EFERENCES [WM10] Wes McKinney. Data Structures for Statistical Computing in Python. In Stéfan van der Walt and Jarrod Millman, editors, Proceedings of the 9th Python in Science Conference, pages 56 – [Cra46] Harold Cramer. Mathematical methods of statistics, princeton 61, 2010. doi:10.25080/Majora-92bf1922-00a. univ. Press, Princeton, NJ, 1946. URL: https://books.google.ca/ [Yue74] Karen K Yuen. The two-sample trimmed t for unequal population books?id=CRTKKaJO0DYC. variances. Biometrika, 61(1):165–170, 1974. doi:10.2307/ [CvNS18] Allan Campopiano, Stefon JR van Noordt, and Sidney J Sega- 2334299. lowitz. Statslab: An open-source eeg toolbox for comput- ing single-subject effects using robust statistics. Behavioural Brain Research, 347:425–435, 2018. doi:10.1016/j.bbr. 2018.03.025. [CW20] Allan Campopiano and Rand R. Wilcox. Hypothesize: Ro- bust statistics for python. Journal of Open Source Software, 5(50):2241, 2020. doi:10.21105/joss.02241. [Efr92] Bradley Efron. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics, pages 569–593. Springer, 1992. doi:10.1007/978-1-4612-4380-9_41. [EHM08] David M Erceg-Hurn and Vikki M Mirosevich. Modern robust statistical methods: an easy way to maximize the accuracy and power of your research. American Psychologist, 63(7):591, 2008. doi:10.1037/0003-066X.63.7.591. [FG81] Joseph Fashing and Ted Goertzel. The myth of the normal curve a theoretical critique and examination of its role in teaching and research. Humanity & Society, 5(1):14–31, 1981. doi:10. 1177/016059768100500103. [Gle93] John R Gleason. Understanding elongation: The scale contami- nated normal family. Journal of the American Statistical Asso- ciation, 88(421):327–337, 1993. doi:10.1080/01621459. 1993.10594325. [HD82] MaryAnn Hill and WJ Dixon. Robustness in real life: A study of clinical laboratory data. Biometrics, pages 377–396, 1982. doi:10.2307/2530452. [Mic89] Theodore Micceri. The unicorn, the normal curve, and other improbable creatures. Psychological bulletin, 105(1):156, 1989. doi:10.1037/0033-2909.105.1.156. [pdt20] The pandas development team. pandas-dev/pandas: Pandas, February 2020. URL: https://doi.org/10.5281/zenodo.3509134, doi:10.5281/zenodo.3509134. [Tan82] WY Tan. Sampling distributions and robustness of t, f and variance-ratio in two samples and anova models with respect to departure from normality. Comm. Statist.-Theor. Meth., 11:2485– 2511, 1982. URL: https://pascal-francis.inist.fr/vibad/index.php? action=getRecordDetail&idt=PASCAL83X0380619. [TE93] Robert J Tibshirani and Bradley Efron. An introduction to the bootstrap. Monographs on statistics and applied probabil- ity, 57:1–436, 1993. URL: https://books.google.ca/books?id= gLlpIUxRntoC. [Tuk60] J. W. Tukey. A survey of sampling from contaminated distribu- tions. Contributions to Probability and Statistics, pages 448–485, 1960. URL: https://ci.nii.ac.jp/naid/20000755025/en/. [vNDTE22] Stefon van Noordt, James A Desjardins, BASIS Team, and Mayada Elsabbagh. Inter-trial theta phase consistency during face processing in infants is associated with later emerging autism. Autism Research, 15(5):834–846, 2022. doi:10. 1002/aur.2701. [vNW21] Stefon van Noordt and Teena Willoughby. Cortical matura- tion from childhood to adolescence is reflected in resting state eeg signal complexity. Developmental cognitive neuroscience, 48:100945, 2021. doi:10.1016/j.dcn.2021.100945. [Wil92] Rand R Wilcox. Why can methods for comparing means have relatively low power, and what can you do to correct the prob- lem? Current Directions in Psychological Science, 1(3):101–105, 1992. doi:10.1111/1467-8721.ep10768801. [Wil98] Rand R Wilcox. How many discoveries have been lost by ignoring modern statistical methods? American Psychologist, 53(3):300, 1998. doi:10.1037/0003-066X.53.3.300. [Wil01] Rand R Wilcox. Fundamentals of modern statistical meth- ods: Substantially improving power and accuracy, volume 249. Springer, 2001. URL: https://link.springer.com/book/10.1007/ 978-1-4757-3522-2. [Wil09] Rand R Wilcox. Robust ancova using a smoother with boot- strap bagging. British Journal of Mathematical and Sta- tistical Psychology, 62(2):427–437, 2009. doi:10.1348/ 000711008X325300. [Wil13] Rand R Wilcox. Introduction to robust estimation and hypothesis testing. Academic press, 2013. doi:10.1016/c2010-0- 67044-1. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 69 Python for Global Applications: teaching scientific Python in context to law and diplomacy students Anna Haensch‡§∗ , Karin Knudson‡§ F Abstract—For students across domains and disciplines, the message has been the students and faculty at the Fletcher School are eager to communicated loud and clear: data skills are an essential qualification for today’s seize upon our current data moment to expand their quantitative job market. This includes not only the traditional introductory stats coursework offerings. With this in mind, The Fletcher School reached out to but also machine learning, artificial intelligence, and programming in Python or the co-authors to develop a course in data science, situated in the R. Consequently, there has been significant student-initiated demand for data context of international diplomacy. analytic and computational skills sometimes with very clear objectives in mind, and other times guided by a vague sense of “the work I want to do will require In response, we developed the (Python-based) course, Data this.” Now we have options. If we train students using “black box” algorithms Science for Global Applications, which had its inaugural offering without attending to the technical choices involved, then we run the risk of in the Spring semester of 2022. The course had 30 enrolled unleashing practitioners who might do more harm than good. On the other hand, Fletcher School students, primarily from the MALD program. courses that completely unpack the “black box” can be so steeped in theory that When the course was announced we had a flood of interest from the barrier to entry becomes too high for students from social science and policy Fletcher students who were extremely interested in broadening backgrounds, thereby excluding critical voices. In sum, both of these options their studies with this course. With a goal of keeping a close lead to a pitfall that has gained significant media attention over recent years: the interactive atmosphere we capped enrollment at 30. To inform the harms caused by algorithms that are implemented without sufficient attention to human context. In this paper, we - two mathematicians turned data scientists direction of our course, we surveyed students on their background - present a framework for teaching introductory data science skills in a highly in programming (see Fig. 1) and on their motivations for learning contextualized and domain flexible environment. We will present example course data science (see Fig 2). Students reported only very limited outlines at the semester, weekly, and daily level, and share materials that we experience with programming - if any at all - with that experience think hold promise. primarily in Excel and Tableau. Student motivations varied, but the goal to get a job where they were able to make a meaningful Index Terms—computational social science, public policy, data science, teach- social impact was the primary motivation. ing with Python Introduction As data science continues to gain prominence in the public eye, and as we become more aware of the many facets of our lives that intersect with data-driven technologies and policies every day, universities are broadening their academic offerings to keep up with what students and their future employers demand. Not only are students hoping to obtain more hard skills in data science (e.g. Python programming experience), but they are interested in applying tools of data science across domains that haven’t Fig. 1: The majority of the 30 students enrolled in the course had little historically been part of the quantitative curriculum. The Master to no programming experience, and none reported having "a lot" of of Arts in Law and Diplomacy (MALD) is the flagship program of experience. Those who did have some experience were most likely to the Fletcher School of Law and International Diplomacy at Tufts have worked in Excel or Tableau. University. Historically, the program has contained core elements of quantitative reasoning with a focus on business, finance, and The MALD program, which is interdisciplinary by design, pro- international development, as is typical in graduate programs in vides ample footholds for domain specific data science. Keeping international relations. Like academic institutions more broadly, this in mind, as a throughline for the course, each student worked to develop their own quantitative policy project. Coursework and * Corresponding author: anna.haensch@tufts.edu discussions were designed to move this project forward from ‡ Tufts University § Data Intensive Studies Center initial policy question, to data sourcing and visualizing, and eventually to modeling and analysis. Copyright © 2022 Anna Haensch et al. This is an open-access article dis- In what follows we will describe how we structured our tributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, pro- course with the goal of empowering beginner programmers to use vided the original author and source are credited. Python for data science in the context of international relations 70 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) might understand in the abstract that the way the handling of missing data can substantially affect the outcome of an analysis, but will likely have a stronger understanding if they have had to consider how to deal with missing data in their own project. We used several course structures to support connecting data science and Python "skills" with their context. Students had readings and journaling assignments throughout the semester on topics that connected data science with society. In their journal responses, students were asked to connect the ideas in the reading to their other academic/professional interests, or ideas from other classes with the following prompt: Your reflection should be a 250-300 word narrative. Be sure to tie the reading back into your own studies, experiences, and areas of interest. For each reading, Fig. 2: The 30 enrolled students were asked to indicate which were come up with 1-2 discussion questions based on the con- relevant motivations for taking the course. Curiosity and a desire to cepts discussed in the readings. This can be a curiosity make a meaningful social impact were among the top motivations our question, where you’re interested in finding out more, students expressed. a critical question, where you challenge the author’s assumptions or decisions, or an application question, where you think about how concepts from the reading and diplomacy. We will also share details about course content would apply to a particular context you are interested in and structure, methods of assessment, and Python programming exploring.1 resources that we deployed through Google Colab. All of the materials described here can be found on the public course page These readings (highlighted in gray in Fig 3), assignments, and https://karink520.github.io/data-science-for-global-applications/. the related in-class discussions were interleaved among Python exercises meant to give students practice with skills including Course Philosophy and Goals manipulating DataFrames in pandas [The22], [Mck10], plotting in Matplotlib [Hun07] and seaborn [Was21], mapping with GeoPan- Our high level goals for the course were i) to empower students das [Jor21], and modeling with scikit-learn [Ped11]. Student with the skills to gain insight from data using Python and ii) to projects included a thorough data audit component requiring deepen students’ understanding of how the use of data science students to explore data sources and their human context in detail. affects society. As we sought to achieve these high level goals Precise details and language around the data audit can be found within the limited time scope of a single semester, the following on the course website. core principles were essential in shaping our course design. Below, we briefly describe each of these principles and share some Managing Fears & Concerns Through Supported Programming examples of how they were reflected in the course structure. In a We surmised that students who are new to programming and subsequent section we will more precisely describe the content of possibly intimidated by learning the unfamiliar skill would do the course, whereupon we will further elaborate on these principles well in an environment that included plenty of what we call and share instructional materials. But first, our core principles: supported programming - that is, practicing programming in class Connecting the Technical and Social with immediate access to instructor and peer support. In the pre-course survey we created, many students identified To understand the impact of data science on the world (and the concerns about their quantitative preparation, whether they would potential policy implications of such impact), it helps to have be able to keep up with the course, and how hard programming hands-on practice with data science. Conversely, to effectively might be. We sought to acknowledge these concerns head-on, and ethically practice data science, it is important to understand assure students of our full confidence in their ability to master how data science lives in the world. Thus, the "hard" skills of the material, and provide them with all the resources they needed coding, wrangling data, visualizing, and modeling are best taught to succeed. intertwined with a robust study of ways in which data science is A key resource to which we thought all students needed used and misused. access was instructor attention. In addition to keeping the class There is an increasing need to educate future policy-makers size capped at 30 people, with both co-instructors attending all with knowledge of how data science algorithms can be used course meetings, we structured class time to maximize the time and misused. One way to approach meeting this need, especially students spent actually doing data science in class. We sought for students within a less technically-focused program, would to keep demonstrations short, and intersperse them with coding be to teach students about how algorithms can be used without exercises so that students could practice with new ideas right actually teaching them to use algorithms. However, we argue that away. Our Colab notebooks included in the course materials show students will gain a deeper understanding of the societal and one way that we wove student practice time throughout. Drawing ethical implications of data science if they also have practical insight from social practice theory of learning (e.g. [Eng01], data science skills. For example, a student could gain a broad [Pen16]), we sought to keep in mind how individual practice and understanding of how biased training data might lead to biased learning pathways develop in relation to their particular social and algorithmic predictions, but such understanding is likely to be deeper and more memorable when a student has actually practiced 1. This journaling prompt was developed by our colleague Desen Ozkan at training a model using different training data. Similarly, someone Tufts University. PYTHON FOR GLOBAL APPLICATIONS: TEACHING SCIENTIFIC PYTHON IN CONTEXT TO LAW AND DIPLOMACY STUDENTS 71 institutional context. Crucially, we devoted a great deal of in-class and preparing data for exploratory data analysis, visualizing and time to students doing data science, and a great deal of energy annotating data, and finally modeling and analyzing data. All into making this practice time a positive and empowering social of this was done with the goal of answering a policy question experience. During student practice time, we were circulating developed by the student, allowing the student to flex some throughout the room, answering student questions and helping domain expertise to supplement the (sometimes overwhelming!) students to problem solve and debug, and encouraging students programmatic components. to work together and help each other. A small organizational Our project explicitly required that students find two datasets change we made in the first weeks of the semester that proved of interest and merge them for the final analysis. This presented to have outsized impact was moving our office hours to hold them both logistical and technical challenges. As one student pointed directly after class in an almost-adjacent room, to make it as easy out after finally finding open data: hearing people talk about the as possible for students to attend office hours. Students were vocal need for open data is one thing, but you really realize what that in their appreciation of office hours. means when you’ve spent weeks trying to get access to data that We contend that the value of supported programming time you know exists. Understanding the provenance of the data they is two-fold. First, it helps beginning programmers learn more were working with helped students assess the biases and limita- quickly. While learning to code necessarily involves challenges, tions, and also gave students a strong sense of ownership over students new to a language can sometimes struggle for an un- their final projects. An unplanned consequence of the broad scope productively long time on things like simple syntax issues. When of the policy project was that we, the instructors, learned nearly students have help available, they can move forward from minor as much about international diplomacy as the students learned issues faster and move more efficiently into building a meaningful about programming and data science, a bidirectional exchange of understanding. Secondly, supported programming time helps stu- knowledge that we surmised to have contributed to student feeling dents to understand that they are not alone in the challenges they of empowerment and a positive class environment. are facing in learning to program. They can see other students learning and facing similar challenges, can have the empowering Course Structure experience of helping each other out, and when asking for help can notice that even their instructors sometimes rely on resources We broke the course into three modules, each with focused like StackOverflow. An unforeseen benefit we believe co-teaching reading/journaling topics, Python exercises, and policy project had was to give us as instructors the opportunity to consult benchmarks: (i) getting and cleaning data, (ii) visualizing data, with each other during class time and share different approaches. and (iii) modeling data. In what follows we will describe the key These instructor interactions modeled for students how even as goals of each module and highlight the readings and exercises that experienced practitioners of data science, we too were constantly we compiled to work towards these goals. learning. Getting and Cleaning Data Lastly, a small but (we thought) important aspect of our setup was teaching students to set up a computing environment on Getting, cleaning, and wrangling data typically make up a signif- their own laptops, with Python, conda [Ana16], and JupyterLab icant proportion of the time involved in a data science project. [Pro22]. Using the command line and moving from an environ- Therefore, we devoted significant time in our course to learning ment like Google Colab to one’s own computer can both present these skills, focusing on loading and manipulating data using significant barriers, but doing so successfully can be an important pandas. Key skills included loading data into a pandas DataFrame, part of helping students feel like ‘real’ programmers. We devoted working with missing data, and slicing, grouping, and merging an entire class period to helping students with installation and DataFrames in various ways. After initial exposure and practice setup on their own computers. with example datasets, students applied their skills to wrangling We considered it an important measure of success how many the diverse and sometimes messy and large datasets that they found students told us at the end of the course that the class had helped for their individual projects. Since one requirement of the project them overcome sometimes longstanding feelings that technical was to integrate more than one dataset, merging was of particular skills like coding and modeling were not for them. importance. During this portion of the course, students read and discussed Leveraging Existing Strengths To Enhance Student Ownership Boyd and Crawford’s Critical Questions for Big Data [Boy12] Even as beginning programmers, students are capable of creating a which situates big data in the context of knowledge itself and meaningful policy-related data science project within the semester, raises important questions about access to data and privacy. Ad- starting from formulating a question and finding relevant datasets. ditional readings included selected chapters from D’Ignazio and Working on the project throughout the semester (not just at the Klein’s Data Feminism [Dig20] which highlights the importance end) gave essential context to data science skills as students could of what we choose to count and what it means when data is translate into what an idea might mean for "their" data. Giving missing. students wide leeway in their project topic allowed the project to be a point of connection between new data science skills and their Visualizing Data existing domain knowledge. Students chose projects within their A fundamental component to communicating findings from data particular areas of interest or expertise, and a number chose to is well-executed data visualization. We chose to place this module additionally connect their project for this course to their degree in the middle of the course, since it was important that students capstone project. have a common language for interpreting and communicating their Project benchmarks were placed throughout the semester analysis before moving to the more complicated aspects of data (highlighted in green in Fig 3) allowing students a concrete modeling. In developing this common language, we used Wilke’s way to develop their new skills in identifying datasets, loading Fundamentals of Data Visualization [Wil19] and Cairo’s How 72 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 3: Course outline for a 13-week semester with two 70 minute instructional blocks each week. Course readings are highlighted in gray and policy project benchmarks are highlighted in green. Chart’s Lie [Cai19] as a backbone for this section of the course. using Python. Having the concrete target of how a student wanted In addition to reading the text materials, students were tasked with their visualization to look seemed to be a motivating starting finding visualizations “in the wild,” both good and bad. Course point from which to practice coding and debugging. We spent discussions centered on the found visualizations, with Wilke and several class periods on supported programming time for students Cairo’s writings as a common foundation. From the readings and to develop their visualizations. discussions, students became comfortable with the language and Working on building the narratives of their project and devel- taxonomy around visualizations and began to develop a better ap- oping their own visualizations in the context of the course readings preciation of what makes a visualization compelling and readable. gave students a heightened sense of attention to detail. During Students were able to formulate a plan about how they could best one day of class when students shared visualizations and gave visualize their data. The next task was to translate these plans into feedback to one another, students commented and inquired about Python. incredibly small details of each others’ presentations, for example, To help students gain a level of comfort with data visualization how to adjust y-tick alignment on a horizontal bar chart. This sort in Python, we provided instruction and examples of working of tiny detail is hard to convey in a lecture, but gains outsized with a variety of charts using Matplotlib and seaborn, as well importance when a student has personally wrestled with it. as maps and choropleths using GeoPandas, and assigned students programming assignments that involved writing code to create Modeling Data a visualization matching one in an image. With that practical In this section we sought to expose students to introductory grounding, students were ready to visualize their own project data approaches in each of regression, classification, and clustering PYTHON FOR GLOBAL APPLICATIONS: TEACHING SCIENTIFIC PYTHON IN CONTEXT TO LAW AND DIPLOMACY STUDENTS 73 in Python. Specifically, we practiced using scikit-learn to work And finally, to supplement the technical components of the with linear regression, logistic regression, decision trees, random course we also had readings with associated journal entries sub- forests, and gaussian mixture models. Our focus was not on the mitted at a cadence of roughly two per module. Journal prompts theoretical underpinnings of any particular model, but rather on are described above and available on the course website. the kinds of problems that regression, classification, or clustering models respectively, are able to solve, as well as some basic ideas about model assessment. The uniform and approachable scikit- Conclusion learn API [Bui13] was crucial in supporting this focus, since it Various listings of key competencies in data science have been allowed us to focus less on syntax around any one model, and more proposed [NAS18]. For example, [Dev17] suggests the following on the larger contours of modeling, with all its associated promise pillars for an undergraduate data science curriculum: computa- and perils. We spent a good deal of time building an understanding tional and statistical thinking, mathematical foundations, model of train-test splits and their role in model assessment. building and assessment, algorithms and software foundation, Student projects were required to include a modeling com- data curation, and knowledge transference—communication and ponent. Just the process of deciding which of regression, clas- responsibility. As we sought to contribute to the training of sification, or clustering were appropriate for a given dataset and data-science informed practitioners of international relations, we policy question is highly non-trivial for beginners. The diversity of focused on helping students build an initial competency especially student projects and datasets meant students had to grapple with in the last four of these. this decision process in its full complexity. We were delighted by We can point to several key aspects of the course that made the variety of modeling approaches students used in their projects, it successful. Primary among them was the fact that the majority as well as by students’ thoughtful discussions of the limitations of of class time was spent in supported programming. This means their analysis. that students were able to ask their instructors or peers as soon To accompany this section of the course, students were as- as questions arose. Novice programmers who aren’t part of a signed readings focusing on some of the societal impacts of data formal computer science program often don’t have immediate modeling and algorithms more broadly. These readings included access to the resources necessary to get "unstuck." for the novice a chapter from O’Neil’s Weapons of Math Destruction [One16] as programmer, even learning how to google technical terms can be a well as Buolamwini and Gebru’s Gender Shades [Buo18]. Both of challenge. This sort of immediate debugging and feedback helped these readings emphasize the capacity of algorithms to exacerbate students remain confident and optimistic about their projects. This inequalities and highlight the importance of transparency and was made all the more effective since we were co-teaching the ethical data practices. These readings resonated especially strongly course and had double the resources to troubleshoot. Co-teaching with our students, many of whom had recently taken courses in also had the unforeseen benefit of making our classroom a place cyber policy and ethics in artificial intelligence. where the growth mindset was actively modeled and nurtured: where one instructor wasn’t able to answer a question, the other Assessments instructor often could. Finally, it was precisely the motivation of Formal assessment was based on four components, already alluded learning data science in context that allowed students to maintain a to throughout this note. The largest was the ongoing policy sense of ownership over their work and build connections between project which had benchmarks with rolling due dates throughout their other courses. the semester. Moreover, time spent practicing coding skills in Learning programming from the ground up is difficult. Stu- class was often done in service of the project. For example, in dents arrive excited to learn, but also nervous and occasionally week 4, when students learned to set up their local computing heavy with the baggage they carry from prior experience in environments, they also had time to practice loading, reading, and quantitative courses. However, with a sufficient supported learning saving data files associated with their chosen project datasets. This environment it’s possible to impart relevant skills. It was a measure brought challenges, since often students sitting side-by-side were of the success of the course how many students told us that the dealing with different operating systems and data formats. But course had helped them overcome negative prior beliefs about from this challenge emerged many organic conversations about their ability to code. Teaching data science skills in context and file types and the importance of naming conventions. The rubric with relevant projects that leverage students’ existing expertise and for the final project is shown in Fig 4. outside reading situates the new knowledge in a place that feels The policy project culminated with in-class “micro presenta- familiar and accessible to students. This contextualization allows tions” and a policy paper. We dedicated two days of class in week students to gain some mastery while simultaneously playing to 13 for in-class presentations, for which each student presented their strengths and interests. one slide consisting of a descriptive title, one visualization, and several “key takeaways” from the project. This extremely restric- tive format helped students to think critically about the narrative R EFERENCES information conveyed in a visualization, and was designed to create time for robust conversation around each presentation. [Ana16] Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. In addition to the policy project, each of the three course Anaconda, Nov. 2016. Web. https://anaconda.com. [Boy12] Boyd, Danah, and Kate Crawford. Critical questions for big data: modules also had an associated set of Python exercises (available Provocations for a cultural, technological, and scholarly phe- on the course website). Students were given ample time both in nomenon. Information, communication & society 15.5 (2012):662- and out of class to ask questions about the exercises. Overall, these 679. https://doi.org/10.1080/1369118X.2012.678878 exercises proved to be the most technically challenging component [Bui13] Buitinck, Lars, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae et al. API design for of the course, but we invited students to resubmit after an initial machine learning software: experiences from the scikit-learn project. round of grading. arXiv preprint arXiv:1309.0238 (2013). 74 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 4: Rubric for the policy project that formed a core component of the formal assessment of students throughout the course. [Buo18] Buolamwini, Joy, and Timnit Gebru. Gender shades: Intersectional [Ped11] Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent accuracy disparities in commercial gender classification. Conference Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel et al. on fairness, accountability and transparency. PMLR, 2018. http:// Scikit-learn: Machine learning in Python. the Journal of machine proceedings.mlr.press/v81/buolamwini18a.html Learning research 12 (2011): 2825-2830. https://dl.acm.org/doi/10. [Cai19] Cairo, Alberto. How charts lie: Getting smarter about visual infor- 5555/1953048.2078195 mation. WW Norton & Company, 2019. [Pen16] Penuel, William R., Daniela K. DiGiacomo, Katie Van Horne, and [Dev17] De Veaux, Richard D., Mahesh Agarwal, Maia Averett, Benjamin Ben Kirshner. A Social Practice Theory of Learning and Becoming S. Baumer, Andrew Bray, Thomas C. Bressoud, Lance Bryant et al. across Contexts and Time. Frontline Learning Research 4, no. 4 Curriculum guidelines for undergraduate programs in data science. (2016): 30-38. http://dx.doi.org/10.14786/flr.v4i4.205 Annual Review of Statistics and Its Application 4 (2017): 15-30. [Pro22] Project Jupyter, 2022. jupyterlab/jupyterlab: JupyterLab 3.4.3 https: https://doi.org/10.1146/annurev-statistics-060116-053930 //github.com/jupyterlab/jupyterlab [Dig20] D’Ignazio, Catherine, and Lauren F. Klein. Data Feminism. MIT [The22] The Pandas Development Team, 2022. pandas-dev/pandas: Pandas press, 2020. 1.4.2. Zenodo. https://doi.org/10.5281/zenodo.6408044 [Eng01] Engeström, Yrjö. Expansive learning at work: Toward an activity [Was21] Waskom, Michael L. Seaborn: statistical data visualization. Journal theoretical reconceptualization. Journal of education and work 14, of Open Source Software 6, no. 60 (2021): 3021. https://doi.org/10. no. 1 (2001): 133-156. https://doi.org/10.1080/13639080020028747 21105/joss.03021 [Hun07] Hunter, J.D., Matplotlib: A 2D Graphics Environment. Computing in [Wil19] Wilke, Claus O. Fundamentals of data visualization: a primer on Science & Engineering, vol. 9, no. 3 (2007): 90-95. https://doi.org/ making informative and compelling figures. O’Reilly Media, 2019. 10.1109/MCSE.2007.55 [Jor21] Jordahl, Kelsey et al. 2021. Geopandas/geopandas: V0.10.2. Zenodo. https://doi.org/10.5281/zenodo.5573592. [Mck10] McKinney, Wes. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, vol. 445, no. 1, pp. 51-56. 2010. https://doi.org/10.25080/Majora-92bf1922-00a [NAS18] National Academies of Sciences, Engineering, and Medicine. Data science for undergraduates: Opportunities and options. National Academies Press, 2018. [One16] O’Neil, Cathy. Weapons of math destruction: How big data increases inequality and threatens democracy. Broadway Books, 2016. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 75 Papyri: better documentation for the scientific ecosystem in Jupyter Matthias Bussonnier‡§∗ , Camille Carvalho¶k F Abstract—We present here the idea behind Papyri, a framework we are devel- documentation is often displayed as raw source where no naviga- oping to provide a better documentation experience for the scientific ecosystem. tion is possible. On the maintainers’ side, the final documentation In particular, we wish to provide a documentation browser (from within Jupyter rendering is less of a priority. Rather, maintainers should aim at or other IDEs and Python editors) that gives a unified experience, cross library making users gain from improvement in the rendering without navigation search and indexing. By decoupling documentation generation from having to rebuild all the docs. rendering we hope this can help address some of the documentation accessi- bility concerns, and allow customisation based on users’ preferences. Conda-Forge [CFRG] has shown that concerted efforts can give a much better experience to end-users, and in today’s world Index Terms—Documentation, Jupyter, ecosystem, accessibility where it is ubiquitous to share libraries source on code platforms, perform continuous integration and many other tools, we believe a better documentation framework for many of the libraries of the Introduction scientific Python should be available. Over the past decades, the Python ecosystem has grown rapidly, Thus, against all advice we received and based on our own and one of the last bastion where some of the proprietary competi- experience, we have decided to rebuild an opinionated documen- tion tools shine is integrated documentation. Indeed, open-source tation framework, from scratch, and with minimal dependencies: libraries are usually developed in distributed settings that can make Papyri. Papyri focuses on building an intermediate documentation it hard to develop coherent and integrated systems. representation format, that lets us decouple building, and rendering While a number of tools and documentations exists (and the docs. This highly simplifies many operations and gives us improvements are made everyday), most efforts attempt to build access to many desired features that were not available up to now. documentation in an isolated way, inherently creating a heteroge- In what follows, we provide the framework in which Papyri neous framework. The consequences are twofolds: (i) it becomes has been created and present its objectives (context and goals), difficult for newcomers to grasp the tools properly, (ii) there is a we describe the Papyri features (format, installation, and usage), lack of cohesion and of unified framework due to library authors then present its current implementation. We end this paper with making their proper choices as well as having to maintain build comments on current challenges and future work. scripts or services. Many users, colleagues, and members of the community have Context and objectives been frustrated with the documentation experience in the Python Through out the paper, we will draw several comparisons between ecosystem. Given a library, who hasn’t struggled to find the documentation building and compiled languages. Also, we will "official" website for the documentation ? Often, users stumble borrow and adapt commonly used terminology. In particular, sim- across an old documentation version that is better ranked in their ilarities with "ahead-of-time" (AOT) [AOT], "just-in-time"" (JIT) favorite search engine, and this impacts significantly the learning [JIT], intermediate representation (IR) [IR], link-time optimization process of less experienced users. (LTO) [LTO], static vs dynamic linking will be highlighted. This On users’ local machine, this process is affected by lim- allows us to clarify the presentation of the underlying architecture. ited documentation rendering. Indeed, while in many Integrated However, there is no requirement to be familiar with the above Development Environments (IDEs) the inspector provides some to understand the concepts underneath Papyri. In that context, we documentation, users do not get access to the narrative, or the full wish to discuss documentation building as a process from a source- documentation gallery. For Command Line Interface (CLI) users, code meant for a machine to a final output targeting the flesh and blood machine between the keyboard and the chair. * Corresponding author: bussonniermatthias@gmail.com ‡ QuanSight, Inc § Digital Ours Lab, SARL. Current tools and limitations ¶ University of California Merced, Merced, CA, USA || Univ Lyon, INSA Lyon, UJM, UCBL, ECL, CNRS UMR 5208, ICJ, F-69621, In the scientific Python ecosystem, it is well known that Docutils France [docutils] and Sphinx [sphinx] are major cornerstones for pub- lishing HTML documentation for Python. In fact, they are used Copyright © 2022 Matthias Bussonnier et al. This is an open-access article by all the libraries in this ecosystem. While a few alternatives distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, exist, most tools and services have some internal knowledge of provided the original author and source are credited. Sphinx. For instance, Read the Docs [RTD] provides a specific 76 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Sphinx theme [RTD-theme] users can opt-in to, Jupyter-book [JPYBOOK] is built on top of Sphinx, and MyST parser [MYST] (which is made to allow markdown in documentation) targets Sphinx as a backend, to name a few. All of the above provide an "ahead-of-time" documentation compilation and rendering, which is slow and computationally intensive. When a project needs its specific plugins, extensions and configurations to properly build (which is almost always the case), it is relatively difficult to build documentation for a single object (like a single function, module or class). This makes AOT tools difficult to use for interactive exploration. One can then consider a JIT approach, as done for Docrepr [DOCREPR] (integrated both in Jupyter and Spyder [Spyder]). However in that case, interactive documentation lacks inline plots, crosslinks, indexing, search and many custom Fig. 1: The following screenshot shows the help for directives. scipy.signal.dpss, as currently accessible (left), as shown by Papyri for Jupyterlab extension (right). An extended version of the Some of the above limitations are inherent to the design right pannel is displayed in Figure 4. of documentation build tools that were intended for a separate documentation construction. While Sphinx does provide features like intersphinx, link resolutions are done at the documentation raw docstrings (see for example the SymPy discussion2 on how building phase. Thus, this is inherently unidirectional, and can equations should be displayed in docstrings, and left panel of break easily. To illustrate this, we consider NumPy [NP] and SciPy Figure 1). In terms of format, markdown is appealing, however [SP], two extremely close libraries. In order to obtain proper cross- inconsistencies in the rendering will be created between libraries. linked documentation, one is required to perform at least five steps: Finally, some libraries can dynamically modify their docstring at • build NumPy documentation runtime. While this sometime avoids using directives, it ends up • publish NumPy object.inv file. being more expensive (runtime costs, complex maintenance, and • (re)build SciPy documentation using NumPy obj.inv contribution costs). file. Objectives of the project • publish SciPy object.inv file • (re)build NumPy docs to make use of SciPy’s obj.inv We now layout the objectives of the Papyri documentation frame- work. Let us emphasize that the project is in no way intended to Only then can both SciPy’s and NumPy’s documentation refer replace or cover many features included in well-established docu- to each other. As one can expect, cross links break every time mentation tools such as Sphinx or Jupyter-book. Those projects are a new version of a library is published1 . Pre-produced HTML extremely flexible and meet the needs of their users for publishing in IDEs and other tools are then prone to error and difficult to a standalone documentation website of PDFs. The Papyri project maintain. This also raises security issues: some institutions be- addresses specific documentation challenges (mentioned above), come reluctant to use tools like Docrepr or viewing pre-produced we present below what is (and what is not) the scope of work. HTML. Goal (a): design a non-generic (non fully customisable) website builder. When authors want or need complete control Docstrings format of the output and wide personalisation options, or branding, then The Numpydoc format is ubiquitous among the scientific ecosys- Papyri is not likely the project to look at. That is to say single- tem [NPDOC]. It is loosely based on reStructuredText (RST) project websites where appearance, layout, domain need to be syntax, and despite supporting full RST syntax, docstrings rarely controlled by the author is not part of the objectives. contain full-featured directive. Maintainers are confronted to the Goal (b): create a uniform documentation structure and following dilemma: syntax. The Papyri project prescribes stricter requirements in • keep the docstrings simple. This means mostly text-based terms of format, structure, and syntax compared to other tools docstrings with few directive for efficient readability. The such as Docutils and Sphinx. When possible, the documentation end-user may be exposed to raw docstring, there is no on- follows the Diátaxis Framework [DT]. This provides a uniform the-fly directive interpretation. This is the case for tools documentation setup and syntax, simplifying contributions to the such as IPython and Jupyter. project and easing error catching at compile time. Such strict envi- • write an extensive docstring. This includes references, and ronment is qualitatively supported by a number of documentation directive that potentially creates graphics, tables and more, fixes done upstream during the development stage of the project3 . allowing an enriched end-user experience. However this Since Papyri is not fully customisable, users who are already using may be computationally intensive, and executing code to documentation tools such as Sphinx, mkdocs [mkdocs] and others view docs could be a security risk. should expect their project to require minor modifications to work with Papyri. Other factors impact this choice: (i) users, (ii) format, (iii) Goal (c): provide accessibility and user proficiency. Ac- runtime. IDE users or non-Terminal users motivate to push for cessibility is a top priority of the project. To that aim, items extensive docstrings. Tools like Docrepr can mitigate this problem are associated to semantic meaning as much as possible, and by allowing partial rendering. However, users are often exposed to 2. sympy/sympy#14963 1. ipython/ipython#12210, numpy/numpy#21016, & #29073 3. Tests have been performed on NumPy, SciPy. PAPYRI: BETTER DOCUMENTATION FOR THE SCIENTIFIC ECOSYSTEM IN JUPYTER 77 documentation rendering is separated from documentation build- Intermediate Representation for Documentation (IRD) ing phase. That way, accessibility features such as high contract IRD format: Papyri relies on standard interchangeable themes (for better text-to-speech (TTS) raw data), early example "Intermediate Representation for Documentation" (IRD) format. highlights (for newcomers) and type annotation (for advanced This allows to reduce operation complexity of the documentation users) can be quickly available. With the uniform documentation build. For example, given M documentation producers and N structure, this provides a coherent experience where users become renderers, a full documentation build would be O(MN) (each more comfortable finding information in a single location (see renderer needs to understand each producer). If each producer only Figure 1). cares about producing IRD, and if each renderer only consumes it, Goal (d): make documentation building simple, fast, and then one can reduce to O(M+N). Additionally, one can take IRD independent. One objective of the project is to make documenta- from multiple producers at once, and render them all to a single tion installation and rendering relatively straightforward and fast. target, breaking the silos between libraries. To that aim, the project includes relative independence of doc- At the moment, IRD files are currently separated into four umentation building across libraries, allowing bidirectional cross main categories roughly following the Diátaxis framework [DT] links (i.e. both forward and backward links between pages) to and some technical needs: be maintained more easily. In other words, a single library can be built without the need to access documentation from another. Also, • API files describe the documentation for a single ob- the project should include straightforward lookup documentation ject, expressed as a JSON object. When possible, the for an object from the interactive read–eval–print loop (REPL). information is encoded semantically (Objective (c)). Files Finally, efforts are put to limit the installation speed (to avoid are organized based on the fully-qualified name of the polynomial growth when installing packages on large distributed Python object they reference, and contain either absolute systems). reference to another object (library, version and identi- fier), or delayed references to objects that may exist in another library. Some extra per-object meta information The Papyri solution like file/line number of definitions can be stored as well. In this section we describe in more detail how Papyri has been • Narrative files are similar to API files, except that they do implemented to address the objectives mentioned above. not represent a given object, but possess a previous/next page. They are organised in an ordered tree related to the table of content. Making documentation a multi-step process • Example files are a non-ordered collection of files. When using current documentation tools, customisation made by • Assets files are untouched binary resource archive files that maintainers usually falls into the following two categories: can be referenced by any of the above three ones. They are the only ones that contain backward references, and no • simpler input convenience, forward references. • modification of final rendering. In addition to the four categories above, metadata about the This first category often requires arbitrary code execution and current package is stored: this includes library name, current must import the library currently being built. This is the case version, PyPi name, GitHub repository slug4 , maintainers’ names, for example for the use of .. code-block:::, or custom logo, issue tracker and others. In particular, metadata allows :rc: directive. The second one offers a more user friendly en- us to auto-generate links to issue trackers, and to source files vironment. For example, sphinx-copybutton [sphinx-copybutton] when rendering. In order to properly resolve some references and adds a button to easily copy code snippets in a single click, normalize links convention, we also store a mapping from fully and pydata-sphinx-theme [pydata-sphinx-theme] or sphinx-rtd- qualified names to canonical ones. dark-mode provide a different appearance. As a consequence, Let us make some remarks about the current stage of IRD for- developers must make choices on behalf of their end-users: this mat. The exact structure of package metadata has not been defined may concern syntax highlights, type annotations display, light/dark yet. At the moment it is reduced to the minimum functionality. theme. While formats such as codemeta [CODEMETA] could be adopted, Being able to modify extensions and re-render the documenta- in order to avoid information duplication we rely on metadata tion without the rebuilding and executing stage is quite appealing. either present in the published packages already or extracted from Thus, the building phase in Papyri (collecting documentation Github repository sources. Also, IRD files must be standardized information) is separated from the rendering phase (Objective (c)): in order to achieve a uniform syntax structure (Objective (b)). at this step, Papyri has no knowledge and no configuration options In this paper, we do not discuss IRD files distribution. Last, the that permit to modify the appearance of the final documentation. final specification of IRD files is still in progress and regularly Additionally, the optional rendering process has no knowledge of undergoes major changes (even now). Thus, we invite contributors the building step, and can be run without accessing the libraries to consult the current state of implementation on the GitHub involved. repository [Papyri]. Once the IRD format is more stable, this will This kind of technique is commonly used in the field of be published as a JSON schema, with full specification and more compilers with the usage of Single Compilation Unit [SCU] and in-depth description. Intermediate Representation [IR], but to our knowledge, it has not been implemented for documentation in the Python ecosystem. 4. "slug" is the common term that refers to the various combinations As mentioned before, this separation is key to achieving many of organization name/user name/repository name, that uniquely identifies a features proposed in Objectives (c), (d) (see Figure 2). repository on a platform like GitHub. 78 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 2: Sketch representing how to build documentation with Papyri. Step 1: Each project builds an IRD bundle that contains semantic information about the project documentation. Step 2: the IRD bundles are publihsed online. Step 3: users install IRD bundles locally on their machine, pages get corsslinked, indexed, etc. Step 4: IDEs render documentation on-the-fly, taking into consideration users’ preferences. IRD bundles: Once a library has collected IRD repre- package managers or IDEs, one could imagine this process being sentation for all documentation items (functions, class, narrative automatic, or on demand. This step should be fairly efficient as it sections, tutorials, examples), Papyri consolidates them into what mostly requires downloading and unpacking IRD files. we will refer to as IRD bundles. A Bundle gathers all IRD files Finally, IDEs developers want to make sure IRD files can be and metadata for a single version of a library5 . Bundles are a properly rendered and browsed by their users when requested. convenient unit to speak about publication, installation, or update This may potentially take into account users’ preferences, and may of a given library documentation files. provide added values such as indexing, searching, bookmarks and Unlike package installation, IRD bundles do not have the others, as seen in rustsdocs, devdocs.io. notion of dependencies. Thus, a fully fledged package manager is not necessary, and one can simply download corresponding files Current implementation and unpack them at the installation phase. We present here some of the technological choices made in the Additionally, IRD bundles for multiple versions of the same current Papyri implementation. At the moment, it is only targeting library (or conflicting libraries) are not inherently problematic as a subset of projects and users that could make use of IRD files and they can be shared across multiple environments. bundles. As a consequence, it is constrained in order to minimize From a security standpoint, installing IRD bundles does not the current scope and efforts development. Understanding the require the execution of arbitrary code. This is a critical element implementation is not necessary to use Papyri neither as a project for adoption in deployments. There exists as well an opportunity to maintainer nor as a user, but it can help understanding some of the provide localized variants at the IRD installation time (IRD bundle current limitations. translations haven’t been explored exhaustively at the moment). Additionally, nothing prevents alternatives and complementary implementations with different choices: as long as other imple- IRD and high level usage mentations can produce (or consume) IRD bundles, they should Papyri-based documentation involves three broad categories of be perfectly compatible and work together. stakeholders (library maintainers, end-users, IDE developers), and The following sections are thus mostly informative to under- processes. This leads to certain requirements for IRD files and stand the state of the current code base. In particular we restricted bundles. ourselves to: On the maintainers’ side, the goal is to ensure that Papyri can build IRD files, and publish IRD bundles. Creation of IRD files • Producing IRD bundles for the core scientific Python and bundles is the most computationally intensive step. It may projects (Numpy, SciPy, Matplotlib...) require complex dependencies, or specific plugins. Thus, this can • Rendering IRD documentation for a single user on their be a multi-step process, or one can use external tooling (not related local machine. to Papyri nor using Python) to create them. Visual appearance Finally, some of the technological choices have no other and rendering of documentation is not taken into account in this justification than the main developer having interests in them, or process. Overall, building IRD files and bundles takes about the making iterations on IRD format and main code base faster. same amount of time as running a full Sphinx build. The limiting factor is often associated to executing library examples and code IRD files generation snippets. For example, building SciPy & NumPy documentation The current implementation of Papyri only targets some compat- IRD files on a 2021 Macbook Pro M1 (base model), including ibility with Sphinx (a website and PDF documentation builder), executing examples in most docstrings and type inferring most reStructuredText (RST) as narrative documentation syntax and examples (with most variables semantically inferred) can take Numpydoc (both a project and standard for docstring formatting). several minutes. These are widely used by a majority of the core scientific End-users are responsible for installing desired IRD bundles. Python ecosystem, and thus having Papyri and IRD bundles In most cases, it will consist of IRD bundles from already compatible with existing projects is critical. We estimate that installed libraries. While Papyri is not currently integrated with about 85%-90% of current documentation pages being built with Sphinx, RST and Numpydoc can be built with Papyri. Future work 5. One could have IRD bundles not attached to a particular library. For example, this can be done if an author wishes to provide only a set of examples includes extensions to be compatible with MyST (a project to or tutorials. We will not discuss this case further here. bring markdown syntax to Sphinx), but this is not a priority. PAPYRI: BETTER DOCUMENTATION FOR THE SCIENTIFIC ECOSYSTEM IN JUPYTER 79 To understand RST Syntax in narrative documentation, RST documents need to be parsed. To do so, Papyri uses tree-sitter [TS] and tree-sitter-rst [TSRST] projects, allowing us to extract an "Abstract Syntax Tree" (AST) from the text files. When using tree- sitter, AST nodes contain bytes-offsets into the original text buffer. Then one can easily "unparse" an AST node when necessary. This is relatively convenient for handling custom directives and edge cases (for instance, when projects rely on a loose definition of the RST syntax). Let us provide an example: RST directives are usually of the form: .. directive:: arguments body Fig. 3: Sketch representing how Papyri stores information in 3 While technically there is no space before the ::, Docutils and different format depending on access patterns: a SQLite database for Sphinx will not create errors when building the documentation. relationship information, on-disk CBOR files for more compact storate Due to our choice of a rigid (but unified) structure, we use tree- of IRD, and RAW files (e.g. Images). A GraphStore API abstracts all sitter that indicates an error node if there is an extra space. This access and takes care of maintinaing consistency. allows us to check for error nodes, unparse, add heuristics to restore a proper syntax, then parse again to obtain the new node. (like a database server) are not necessary available. This provides Alternatively, a number of directives like warnings, notes an adapted framework to test Papyri on an end-user machine. admonitions still contain valid RST. Instead of storing the With those requirements we decided to use a combination of directive with the raw text, we parse the full document (potentially SQLite (an in-process database engine), Concise Binary Object finding invalid syntax), and unparse to the raw text only if the Representation (CBOR) and raw storage to better reflect the access directive requires it. pattern (see Figure 3). Serialisation of data structure into IRD files is currently us- SQLite allows us to easily query for object existence, and ing a custom serialiser. Future work includes maybe swapping graph information (relationship between objects) at runtime. It is to msgspec [msgspec]. The AST objects are completely typed, optimized for infrequent reading access. Currently many queries however they contain a number of unions and sequences of unions. are done at runtime, when rendering documentation. The goal is to It turns out, many frameworks like pydantic [pydantic] do not move most of SQLite information resolving step at the installation support sequences of unions where each item in the union may time (such as looking for inter-libraries links) once the codebase be of a different type. To our knowledge, there are just few other and IRD format have stabilized. SQLite is less strongly typed than documentation related projects that treat AST as an intermediate other relational or graph database and needs custom logic, but object with a stable format that can be manipulated by external is ubiquitous on all systems and does not need a separate server tools. In particular, the most popular one is Pandoc [pandoc], a process, making it an easy choice of database. project meant to convert from many document types to plenty of CBOR is a more space efficient alternative to JSON. In par- other ones. ticular, keys in IRD are often highly redundant, and can be highly The current Papyri strategy is to type-infer all code examples optimized when using CBOR. Storing IRD in CBOR thus reduces with Jedi [JEDI], and pre-syntax highlight using pygments when disk usage and can also allow faster deserialization without possible. requiring potentially CPU intensive compression/decompression. IRD File Installation This is a good compromise for potentially low performance users’ Download and installation of IRD files is done concurrently using machines. httpx [httpx], with Trio [Trio] as an async framework, allowing us Raw storage is used for binary blobs which need to be accessed to download files concurrently. without further processing. This typically refers to images, and The current implementation of Papyri targets Python doc- raw storage can be accessed with standard tools like image umentation and is written in Python. We can then query the viewers. existing version of Python libraries installed, and infer the ap- Finally, access to all of these resources is provided via an propriate version of the requested documentation. At the moment, internal GraphStore API which is agnostic of the backend, but the implementation is set to tentatively guess relevant libraries ensures consistency of operations like adding/removing/replacing versions when the exact version number is missing from the install documents. Figure 3 summarizes this process. command. Of course the above choices depend on the context where For convenience and performance, IRD bundles are being post- documentation is rendered and viewed. For example, an online processed and stored in a different format. For local rendering, we archive intended to browse documentation for multiple projects mostly need to perform the following operations: and versions may decide to use an actual graph database for object relationship, and store other files on a Content Delivery Network 1) Query graph information about cross-links across docu- or blob storage for random access. ments. 2) Render a single page. Documentation Rendering 3) Access raw data (e.g. images). The current Papyri implementation includes a certain number We also assume that IRD files may be infrequently updated, of rendering engines (presented below). Each of them mostly that disk space is limited, and that installing or running services consists of fetching a single page with its metadata, and walking 80 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) through the IRD AST tree, and rendering each node with users’ Future goals include improving/replacing the JupyterLab’s ques- preferences. tion mark operator (obj?) and the JupyterLab Inspector (when possible). A screenshot of the current development version of the • An ASCII terminal renders using Jinja2 [Jinja2]. This JupyterLab extension can be seen in Figure 4. can be useful for piping documentation to other tools like grep, less, cat. Then one can work in a highly restricted environment, making sure that reading the docu- Challenges mentation is coherent. This can serve as a proxy for screen We mentioned above some limitations we encountered (in ren- reading. dering usage for instance) and what will be done in the future • A Textual User Interface browser renders using urwid. to address them. We provide below some limitations related to Navigation within the terminal is possible, one can reflow syntax choices, and broader opportunities that arise from the long lines on resized windows, and even open image files Papyri project. in external editors. Nonetheless, several bugs have been encountered in urwid. The project aims at replacing the Limitations CLI IPython question mark operator (obj?) interface The decoupling of the building and rendering phases is key in (which currently only shows raw docstrings) in urwid with Papyri. However, it requires us to come up with a method that a new one written with Rich/Textual. For this interface, uniquely identifies each object. In particular, this is essential in having images stored raw on disk is useful as it allows us order to link any object documentation without accessing the IRD to directly call into a system image viewer to display them. bundles build from all the libraries. To that aim, we use the fully • A JIT rendering engine uses Jinja2, Quart [quart], Trio. qualified names of an object. Namely, each object is identified Quart is an async version of flask [flask]. This option by the concatenation of the module in which it is defined, with contains the most features, and therefore is the main one its local name. Nonetheless, several particular cases need specific used for development. This environment lets us iterate over treatment. the rendering engine rapidly. When exploring the User In- • To mirror the Python syntax, is it easy to use . to terface design and navigation, we found that a list of back concatenate both parts. Unfortunately, that leads to some references has limited uses. Indeed, it is can be challenging ambiguity when modules re-export functions have the to judge the relevance of back references, as well as their same name. For example, if one types relationship to each other. By playing with a network # module mylib/__init__.py graph visualisation (see Figure 5)), we can identify clusters of similar information within back references. Of course, from .mything import mything this identification has limits especially when pages have a then mylib.mything is ambiguous both with respect large number of back references (where the graph becomes to the mything submodule, and the reexported object. too busy). This illustrate as well a strength of the Papyri In future versions, the chosen convention will use : as a architecture: creating this network visualization did not module/name separator. require any regeneration of the documentation, one simply • Decorated functions or other dynamic approaches to ex- updates the template and re-renders the current page as pose functions to users end up having <local>> in their needed. fully qualified names, which is invalid. • A static AOT rendering of all the existing pages that can • Many built-in functions (np.sin, np.cos, etc.) do not be rendered ahead of time uses the same class as the JIT have a fully qualified name that can be extracted by object rendering. Basically, this loops through all entries in the introspection. We believe it should be possible to identify SQLite database and renders each item independently. This those via other means like docstring hash (to be explored). renderer is mostly used for exhaustive testing and perfor- • Fully qualified names are often not canonical names (i.e. mance measures for Papyri. This can render most of the the name typically used for import). While we made efforts API documentation of IPython, Astropy [astropy], Dask to create a mapping from one to another, finding the canon- and distributed [Dask], Matplotlib [MPL], [MPL-DOI], ical name automatically is not always straightforward. Networkx [NX], NumPy [NP], Pandas, Papyri, SciPy, • There are also challenges with case sensitivity. For ex- Scikit-image and others. It can represent ~28000 pages ample for MacOS file systems, a couple of objects may in ~60 seconds (that is ~450 pages/s on a recent Macbook unfortunately refer to the same IRD file on disk. To address pro M1). this, a case-sensitive hash is appended at the end of the For all of the above renderers, profiling shows that docu- filename. mentation rendering is mostly limited by object de-serialisation • Many libraries have a syntax that looks right once ren- from disk and Jinja2 templating engine. In the early project dered to HTML while not following proper syntax, or a development phase, we attempted to write a static HTML renderer syntax that relies on specificities of Docutils and Sphinx in a compiled language (like Rust, using compiled and typed rendering/parsing. checked templates). This provided a speedup of roughly a factor • Many custom directive plugins cannot be reused from 10. However, its implementation is now out of sync with the main Sphinx. These will need to be reimplemented. Papyri code base. Finally, a JupyterLab extension is currently in progress. The Future possibilities documentation then presents itself as a side-panel and is capable Beyond what has been presented in this paper, there are several of basic browsing and rendering (see Figure 1 and Figure 4). The opportunities to improve and extend what Papyri can allow for the model uses typescript, react and native JupyterLab component. scientific Python ecosystem. PAPYRI: BETTER DOCUMENTATION FOR THE SCIENTIFIC ECOSYSTEM IN JUPYTER 81 Fig. 5: Local graph (made with D3.js [D3js]) representing the connections among the most important nodes around current page across many libraries, when viewing numpy.ndarray. Nodes are sized with respect to the number of incomming links, and colored with respect to their library. This graph is generated at rendering time, and is updated depending on the libraries currently installed. This graph helps identify related functions and documentation. It can become challenging to read for highly connected items as seen here for numpy.ndarray. The first area is the ability to build IRD bundles on Continuous Integration platforms. Services like GitHub action, Azure pipeline and many others are already setup to test packages. We hope to leverage this infrastructure to build IRD files and make them available to users. A second area is hosting of intermediate IRD files. While the current prototype is hosted by http index using GitHub pages, it is likely not a sustainable hosting platform as disk space is limited. To our knowledge, IRD files are smaller in size than HTML documentation, we hope that other platforms like Read the Docs can be leveraged. This could provide a single domain that renders the documentation for multiple libraries, thus avoiding the display of many library subdomains. This contributes to giving a more unified experience for users. It should be possible for projects to avoid using many dy- namic docstrings interpolation that are used to document *args and **kwargs. This would make sources easier to read, and potentially have some speedup at the library import time. Once a (given and appropriately used by its users) library uses an IDE that supports Papyri for documentation, docstring syntax could be exchanged for markdown. As IRD files are structured, it should be feasible to provide cross-version information in the documentation. For example, if one installs multiple versions of IRD bundles for a library, then assuming the user does not use the latest version, the renderer Fig. 4: Example of extended view of the Papyri documentation for could inspect IRD files from previous/future versions to indi- Jupyterlab extension (here for SciPy). Code examples can now include cate the range of versions for which the documentation has not plots. Most token in each examples are linked to the corresponding changed. Upon additional efforts, it should be possible to infer page. Early navigation bar is visible at the top. when a parameter was removed, or will be removed, or to simply display the difference between two versions. 82 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Conclusion [RTD-theme] https://sphinx-rtd-theme.readthedocs.io/en/stable/ [RTD] https://readthedocs.org/ To address some of the current limitations in documentation [SCU] https://en.wikipedia.org/wiki/Single_Compilation_ accessibility, building and maintaining, we have provided a new Unit documentation framework called Papyri. We presented its features [SP] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, and underlying implementation choices (such as crosslink main- Evgeni Burovski, Pearu Peterson, Warren Weckesser, tenance, decoupling building and rendering phases, enriching the Jonathan Bright, Stéfan J. van der Walt, Matthew rendering features, using the IRD format to create a unified syntax Brett, Joshua Wilson, K. Jarrod Millman, Nikolay structure, etc.). While the project is still at its early stage, clear Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, İlhan Polat, Yu Feng, impacts can already be seen on the availability of high-quality Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef documentation for end-users, and on the workload reduction for Perktold, Robert Cimrman, Ian Henriksen, E.A. Quin- maintainers. Building IRD format opened a wide range of tech- tero, Charles R Harris, Anne M. Archibald, Antônio nical possibilities, and contributes to improving users’ experience H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. (2020) SciPy 1.0: Fundamen- (and therefore the success of the scientific Python ecosystem). This tal Algorithms for Scientific Computing in Python. may become necessary for users to navigate in an exponentially Nature Methods, 17(3), 261-272. 10.1038/s41592- growing ecosystem. 019-0686-2 [Spyder] https://www.spyder-ide.org/ [TSRST] https://github.com/stsewd/tree-sitter-rst Acknowledgments [TS] https://tree-sitter.github.io/tree-sitter/ [astropy] The Astropy Project: Building an inclusive, open- The authors want to thank S. Gallegos (author of tree-sitter-rst), J. science project and status of the v2.0 core package, L. Cano Rodríguez and E. Holscher (Read The Docs), C. Holdgraf https://doi.org/10.48550/arXiv.1801.02634 (2i2c), B. Granger and F. Pérez (Jupyter Project), T. Allard and I. [docutils] https://docutils.sourceforge.io/ [flask] https://flask.palletsprojects.com/en/2.1.x/ Presedo-Floyd (QuanSight) for their useful feedback and help on [httpx] https://www.python-httpx.org/ this project. [mkdocs] https://www.mkdocs.org/ [msgspec] https://pypi.org/project/msgspec [pandoc] https://pandoc.org/ Funding [pydantic] https://pydantic-docs.helpmanual.io/ M. B. received a 2-year grant from the Chan Zuckerberg Initia- [pydata-sphinx-theme] https://pydata-sphinx-theme.readthedocs.io/en/stable/ [quart] https://pgjones.gitlab.io/quart/ tive (CZI) Essential Open Source Software for Science (EOS) [sphinx-copybutton] https://sphinx-copybutton.readthedocs.io/en/latest/ – EOSS4-0000000017 via the NumFOCUS 501(3)c non profit to [sphinx] https://www.sphinx-doc.org/en/master/ develop the Papyri project. [Trio] https://trio.readthedocs.io/ R EFERENCES [AOT] https://en.wikipedia.org/wiki/Ahead-of-time_ compilation [CFRG] conda-forge community. (2015). The conda-forge Project: Community-based Software Distribution Built on the conda Package Format and Ecosystem. Zenodo. http://doi.org/10.5281/zenodo.4774216 [CODEMETA] https://codemeta.github.io/ [D3js] https://d3js.org/ [DOCREPR] https://github.com/spyder-ide/docrepr [DT] https://diataxis.fr/ [Dask] Dask Development Team (2016). Dask: Library for dynamic task scheduling, https://dask.org [IR] https://en.wikipedia.org/wiki/Intermediate_ representation [JEDI] https://github.com/davidhalter/jedi [JIT] https://en.wikipedia.org/wiki/Just-in-time_ compilation [JPYBOOK] https://jupyterbook.org/ [Jinja2] https://jinja.palletsprojects.com/ [LTO] https://en.wikipedia.org/wiki/Interprocedural_ optimization [MPL-DOI] https://doi.org/10.5281/zenodo.6513224 [MPL] J.D. Hunter, "Matplotlib: A 2D Graphics Environ- ment", Computing in Science & Engineering, vol. 9, no. 3, pp. 90-95, 2007, [MYST] https://myst-parser.readthedocs.io/en/latest/ [NPDOC] https://numpydoc.readthedocs.io/en/latest/format.html [NP] Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Ar- ray programming with NumPy. Nature 585, 357–362 (2020). DOI: 10.1038/s41586-020-2649-2 [NX] Aric A. Hagberg, Daniel A. Schult and Pieter J. Swart, “Exploring network structure, dynamics, and function using NetworkX”, in Proceedings of the 7th Python in Science Conference (SciPy2008), Gäel Varoquaux, Travis Vaught, and Jarrod Millman (Eds), (Pasadena, CA USA), pp. 11–15, Aug 2008 [Papyri] https://github.com/jupyter/papyri PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 83 Bayesian Estimation and Forecasting of Time Series in statsmodels Chad Fulton‡∗ F Abstract—Statsmodels, a Python library for statistical and econometric inference for the well-developed stable of time series models analysis, has traditionally focused on frequentist inference, including in its mod- in statsmodels, and providing access to the rich associated els for time series data. This paper introduces the powerful features for Bayesian feature set already mentioned, presents a complementary option inference of time series models that exist in statsmodels, with applications to these more general-purpose libraries.1 to model fitting, forecasting, time series decomposition, data simulation, and impulse response functions. Time series analysis in statsmodels Index Terms—time series, forecasting, bayesian inference, Markov chain Monte A time series is a sequence of observations ordered in time, and Carlo, statsmodels time series data appear commonly in statistics, economics, finance, climate science, control systems, and signal processing, among Introduction many other fields. One distinguishing characteristic of many time Statsmodels [SP10] is a well-established Python library for series is that observations that are close in time tend to be more statistical and econometric analysis, with support for a wide range correlated, a feature known as autocorrelation. While successful of important model classes, including linear regression, ANOVA, analyses of time series data must account for this, statistical generalized linear models (GLM), generalized additive models models can harness it to decompose a time series into trend, (GAM), mixed effects models, and time series models, among seasonal, and cyclical components, produce forecasts of future many others. In most cases, model fitting proceeds by using data, and study the propagation of shocks over time. frequentist inference, such as maximum likelihood estimation We now briefly review the models for time series data that are (MLE). In this paper, we focus on the class of time series available in statsmodels and describe their features.2 models [MPS11], support for which has grown substantially in Exponential smoothing models statsmodels over the last decade. After introducing several Exponential smoothing models are constructed by combining of the most important new model classes – which are by default one or more simple equations that each describe some aspect fitted using MLE – and their features – which include forecasting, of the evolution of univariate time series data. While originally time series decomposition and seasonal adjustment, data simula- somewhat ad hoc, these models can be defined in terms of a tion, and impulse response analysis – we describe the powerful proper statistical model (for example, see [HKOS08]). They have functions that enable users to apply Bayesian methods to a wide enjoyed considerable popularity in forecasting (for example, see range of time series models. Support for Bayesian inference in Python outside of the implementation in R described by [HA18]). A prototypical statsmodels has also grown tremendously, particularly in example that allows for trending data and a seasonal component the realm of probabilistic programming, and includes powerful – often known as the additive "Holt-Winters’ method" – can be libraries such as PyMC3 [SWF16], PyStan [CGH+ 17], and written as TensorFlow Probability [DLT+ 17]. Meanwhile, ArviZ lt = α(yt − st−m ) + (1 − α)(lt−1 + bt−1 ) [KCHM19] provides many excellent tools for associated diagnos- bt = β (lt − lt−1 ) + (1 − β )bt−1 tics and vizualisations. The aim of these libraries is to provide st = γ(yt − lt−1 − bt−1 ) + (1 − γ)st−m support for Bayesian analysis of a large class of models, and they make available both advanced techniques, including auto- where lt is the level of the series, bt is the trend, st is the tuning algorithms, and flexible model specification. By contrast, seasonal component of period m, and α, β , γ are parameters of here we focus on simpler techniques. However, while the libraries the model. When augmented with an error term with some given above do include some support for time series models, this has probability distribution (usually Gaussian), likelihood-based infer- not been their primary focus. As a result, introducing Bayesian ence can be used to estimate the parameters. In statsmodels, * Corresponding author: chad.t.fulton@frb.gov 1. In addition, it is possible to combine the sampling algorithms of PyMC3 ‡ Federal Reserve Board of Governors with the time series models of statsmodels, although we will not discuss this approach in detail here. See, for example, https://www.statsmodels.org/v0. Copyright © 2022 Chad Fulton. This is an open-access article distributed 13.0/examples/notebooks/generated/statespace_sarimax_pymc3.html. under the terms of the Creative Commons Attribution License, which permits 2. In addition to statistical models, statsmodels also provides a number unrestricted use, distribution, and reproduction in any medium, provided the of tools for exploratory data analysis, diagnostics, and hypothesis testing original author and source are credited. related to time series data; see https://www.statsmodels.org/stable/tsa.html. 84 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) additive exponential smoothing models can be constructed using # ARMA(1, 1) model with explanatory variable the statespace.ExponentialSmoothing class.3 The fol- X = mdata['realint'] model_arma11 = sm.tsa.ARIMA( lowing code shows how to apply the additive Holt-Winters model y, order=(1, 0, 1), exog=X) above to model quarterly data on consumer prices: # SARIMAX(p, d, q)x(P, D, Q, s) model import statsmodels.api as sm model_sarimax = sm.tsa.ARIMA( # Load data y, order=(p, d, q), seasonal_order=(P, D, Q, s)) mdata = sm.datasets.macrodata.load().data While this class of models often produces highly competitive # Compute annualized consumer price inflation y = np.log(mdata['cpi']).diff().iloc[1:] * 400 forecasts, it does not produce a decomposition of a time series into, for example, trend and seasonal components. # Construct the Holt-Winters model model_hw = sm.tsa.statespace.ExponentialSmoothing( Vector autoregressive models y, trend=True, seasonal=12) While the SARIMAX models above handle univariate series, statsmodels also has support for the multivariate generaliza- Structural time series models tion to vector autoregressive (VAR) models.5 These models are Structural time series models, introduced by [Har90] and also written sometimes known as unobserved components models, similarly yt = ν + Φ1 yt−1 + · · · + Φ p yt−p + εt decompose a univariate time series into trend, seasonal, cyclical, and irregular components: where yt is now considered as an m × 1 vector. As a result, the intercept ν is also an m × 1 vector, the coefficients Φi are each yt = µt + γt + ct + εt m × m matrices, and the error term is εt ∼ N(0m , Ω), with Ω an where µt is the trend, γt is the seasonal component, ct is the cycli- m×m matrix. These models can be constructed in statsmodels cal component, and εt ∼ N(0, σ 2 ) is the error term. However, this using the VARMAX class, as follows6 equation can be augmented in many ways, for example to include # Multivariate dataset explanatory variables or an autoregressive component. In addition, z = (np.log(mdata['realgdp', 'realcons', 'cpi']) .diff().iloc[1:]) there are many possible specifications for the trend, seasonal, and cyclical components, so that a wide variety of time series # VAR(1) model characteristics can be accommodated. In statsmodels, these model_var = sm.tsa.VARMAX(z, order=(1, 0)) models can be constructed from the UnobservedComponents class; a few examples are given in the following code: Dynamic factor models # "Local level" model statsmodels also supports a second model for multivariate model_ll = sm.tsa.UnobservedComponents(y, 'llevel') time series: the dynamic factor model (DFM). These models, often # "Local linear trend", with seasonal component model_arma11 = sm.tsa.UnobservedComponents( used for dimension reduction, posit a few unobserved factors, with y, 'lltrend', seasonal=4) autoregressive dynamics, that are used to explain the variation in the observed dataset. In statsmodels, there are two model These models have become popular for time series analysis and classes, DynamicFactor` and DynamicFactorMQ, that can forecasting, as they are flexible and the estimated components are fit versions of the DFM. Here we focus on the DynamicFactor intuitive. Indeed, Google’s Causal Impact library [BGK+ 15] uses class, for which the model can be written a Bayesian structural time series approach directly, and Facebook’s Prophet library [TL17] uses a conceptually similar framework and yt = Λ ft + εt is estimated using PyStan. ft = Φ1 ft−1 + · · · + Φ p ft−p + ηt Autoregressive moving-average models Here again, the observation is assumed to be m × 1, but the factors are k × 1, where it is possible that k << m. As before, we assume Autoregressive moving-average (ARMA) models, ubiquitous in conformable coefficient matrices and Gaussian errors. time series applications, are well-supported in statsmodels, The following code shows how to construct a DFM in including their generalizations, abbreviated as "SARIMAX", that statsmodels allow for integrated time series data, explanatory variables, and seasonal effects.4 A general version of this model, excluding # DFM with 2 factors that evolve as a VAR(3) model_dfm = sm.tsa.DynamicFactor( integration, can be written as z, k_factors=2, factor_order=3) yt = xt β + ξt ξt = φ1 ξt−1 + · · · + φ p ξt−p + εt + θ1 εt−1 + · · · + θq εt−q Linear Gaussian state space models In statsmodels, each of the model classes introduced where εt ∼ N(0, σ 2 ). These are constructed in statsmodels above ( statespace.ExponentialSmoothing, with the ARIMA class; the following code shows how to construct UnobservedComponents, ARIMA, VARMAX, a variety of autoregressive moving-average models for consumer price data: 4. Note that in statsmodels, models with explanatory variables are in # AR(2) model the form of "regression with SARIMA errors". model_ar2 = sm.tsa.ARIMA(y, order=(2, 0, 0)) 5. statsmodels also supports vector moving-average (VMA) models using the same model class as described here for the VAR case, but, for brevity, 3. A second class, ETSModel, can also be used for both additive and we do not explicitly discuss them here. multiplicative models, and can exhibit superior performance with maximum 6. A second class, VAR, can also be used to fit VAR models, using least likelihood estimation. However, it lacks some of the features relevant for squares. However, it lacks some of the features relevant for Bayesian inference Bayesian inference discussed in this paper. discussed in this paper. BAYESIAN ESTIMATION AND FORECASTING OF TIME SERIES IN STATSMODELS 85 Fig. 1: Selected functionality of state space models in statsmodels. DynamicFactor, and DynamicFactorMQ) are implemented fcast = results_ll.forecast(4) as part of a broader class of models, referred to as linear Gaussian # Produce a draw from the posterior distribution state space models (hereafter for brevity, simply "state space # of the state vector models" or SSM). This class of models can be written as sim_ll.simulate() draw = sim_ll.simulated_state yt = dt + Zt αt + εt εt ∼ N(0, Ht ) αt+1 = ct + Tt αt + Rt ηt ηt ∼ N(0, Qt ) Nearly identical code could be used for any of the model classes introduced above, since they are all implemented as part of the where αt represents an unobserved vector containing the "state" same state space model framework. In the next section, we show of the dynamic system. In general, the model is multivariate, with how these features can be used to perform Bayesian inference with yt and εt m × 1 vector, αt k × 1, and ηt r times 1. these models. Powerful tools exist for state space models to estimate the values of the unobserved state vector, compute the value of the likelihood function for frequentist inference, and perform posterior sampling for Bayesian inference. These tools include the Bayesian inference via Markov chain Monte Carlo celebrated Kalman filter and smoother and a simulation smoother, all of which are important for conducting Bayesian inference for We begin by giving a cursory overview of the key elements these models.7 The implementation in statsmodels largely of Bayesian inference required for our purposes here.8 In brief, follows the treatment in [DK12], and is described in more detail the Bayesian approach stems from Bayes’ theorem, in which in [Ful15]. the posterior distribution for an object of interest is derived as In addition to these key tools, state space models also admit proportional to the combination of a prior distribution and the general implementations of useful features such as forecasting, likelihood function data simulation, time series decomposition, and impulse response analysis. As a consequence, each of these features extends to each p(A|B) ∝ p(B|A) × p(A) of the time series models described above. Figure 1 presents a | {z } | {z } |{z} diagram showing how to produce these features, and the code posterior likelihood prior below briefly introduces a subset of them. # Construct the Model Here, we will be interested in the posterior distribution of the pa- model_ll = sm.tsa.UnobservedComponents(y, 'llevel') rameters of our model and of the unobserved states, conditional on the chosen model specification and the observed time series data. # Construct a simulation smoother sim_ll = model_ll.simulation_smoother() While in most cases the form of the posterior cannot be derived an- alytically, simulation-based methods such as Markov chain Monte # Parameter values (variance of error and Carlo (MCMC) can be used to draw samples that approximate # variance of level innovation, respectively) the posterior distribution nonetheless. While PyMC3, PyStan, params = [4, 0.75] and TensorFlow Probability emphasize Hamiltonian Monte Carlo # Compute the log-likelihood of these parameters (HMC) and no-U-turn sampling (NUTS) MCMC methods, we llf = model_ll.loglike(params) focus on the simpler random walk Metropolis-Hastings (MH) and # `smooth` applies the Kalman filter and smoother Gibbs sampling (GS) methods. These are standard MCMC meth- # with a given set of parameters and returns a ods that have enjoyed great success in time series applications and # Results object which are simple to implement, given the state space framework results_ll = model_ll.smooth(params) already available in statsmodels. In addition, the ArviZ library # Produce forecasts for the next 4 periods is designed to work with MCMC output from any source, and we can easily adapt it to our use. 7. Statsmodels currently contains two implementations of simulation With either Metropolis-Hastings or Gibbs sampling, our pro- smoothers for the linear Gaussian state space model. The default is the "mean cedure will produce a sequence of sample values (of parameters correction" simulation smoother of [DK02]. The precision-based simulation smoother of [CJ09] can alternatively be used by specifying method='cfa' and / or the unobserved state vector) that approximate draws from when creating the simulation smoother object. the posterior distribution arbitrarily well, as the number of length 86 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) of the chain of samples becomes very large. Random walk Metropolis-Hastings In random walk Metropolis-Hastings (MH), we begin with an arbi- trary point as the initial sample, and then iteratively construct new samples in the chain as follows. At each iteration, (a) construct a proposal by perturbing the previous sample by a Gaussian random variable, and then (b) accept the proposal with some probability. If a proposal is accepted, it becomes the next sample in the chain, while if it is rejected then the previous sample value is carried over. Here, we show how to implement Metropolis-Hastings estimation of the variance parameter in a simple model, which only requires the use of the log-likelihood computation introduced above. Fig. 2: Approximate posterior distribution of variance parameter, import arviz as az random walk model, Metropolis-Hastings; U.S. Industrial Production. from scipy import stats # Construct the model model_rw = sm.tsa.UnobservedComponents(y, 'rwalk') # Specify the prior distribution. With MH, this # can be freely chosen by the user prior = stats.uniform(0.0001, 100) # Specify the Gaussian perturbation distribution perturb = stats.norm(scale=0.1) # Storage niter = 100000 samples_rw = np.zeros(niter + 1) # Initialization samples_rw[0] = y.diff().var() llf = model_rw.loglike(samples_rw[0]) prior_llf = prior.logpdf(samples_rw[0]) Fig. 3: Approximate posterior joint distribution of variance parame- # Iterations ters, local level model, Gibbs sampling; CPI inflation. for i in range(1, niter + 1): # Compute the proposal value proposal = samples_rw[i - 1] + perturb.rvs() Gibbs sampling # Compute the acceptance probability Gibbs sampling (GS) is a special case of Metropolis-Hastings proposal_llf = model_rw.loglike(proposal) (MH) that is applicable when it is possible to produce draws proposal_prior_llf = prior.logpdf(proposal) accept_prob = np.exp( directly from the conditional distributions of every variable, even proposal_llf - llf though it is still not possible to derive the general form of the joint + prior_llf - proposal_prior_llf) posterior. While this approach can be superior to random walk MH when it is applicable, the ability to derive the conditional # Accept or reject the value if accept_prob > stats.uniform.rvs(): distributions typically requires the use of a "conjugate" prior – i.e., samples_rw[i] = proposal a prior from some specific family of distributions. For example, llf = proposal_llf above we specified a uniform distribution as the prior when prior_llf = proposal_prior_llf else: sampling via MH, but that is not possible with Gibbs sampling. samples_rw[i] = samples_rw[i - 1] Here, we show how to implement Gibbs sampling estimation of the variance parameter, now making use of an inverse Gamma # Convert for use with ArviZ and plot posterior prior, and the simulation smoother introduced above. samples_rw = az.convert_to_inference_data( samples_rw) # Construct the model and simulation smoother # Eliminate the first 10000 samples as burn-in; model_ll = sm.tsa.UnobservedComponents(y, 'llevel') # thin by factor of 10 to reduce autocorrelation sim_ll = model_ll.simulation_smoother() az.plot_posterior(samples_rw.posterior.sel( {'draw': np.s_[10000::10]}), kind='bin', # Specify the prior distributions. With GS, we must point_estimate='median') # choose an inverse Gamma prior for each variance priors = [stats.invgamma(0.01, scale=0.01)] * 2 The approximate posterior distribution, constructed from the sam- ple chain, is shown in Figure 2. # Storage niter = 100000 samples_ll = np.zeros((niter + 1, 2)) 8. While a detailed description of these issues is out of the scope of this paper, there are many superb references on this topic. We refer the interested # Initialization reader to [WH99], which provides a book-length treatment of Bayesian samples_ll[0] = [y.diff().var(), 1e-5] inference for state space models, and [KN99], which provides many examples and applications. # Iterations BAYESIAN ESTIMATION AND FORECASTING OF TIME SERIES IN STATSMODELS 87 for i in range(1, niter + 1): # (a) Update the model parameters model_ll.update(samples_ll[i - 1]) # (b) Draw from the conditional posterior of # the state vector sim_ll.simulate() sample_state = sim_ll.simulated_state.T # (c) Compute / draw from conditional posterior # of the parameters: # ...observation error variance resid = y - sample_state[:, 0] post_shape = len(resid) / 2 + 0.01 post_scale = np.sum(resid**2) / 2 + 0.01 samples_ll[i, 0] = stats.invgamma( post_shape, scale=post_scale).rvs() # ...level error variance resid = sample_state[1:] - sample_state[:-1] post_shape = len(resid) / 2 + 0.01 Fig. 4: Data and forecast with 80% credible interval; U.S. Industrial post_scale = np.sum(resid**2) / 2 + 0.01 Production. samples_ll[i, 1] = stats.invgamma( post_shape, scale=post_scale).rvs() # Convert for use with ArviZ and plot posterior samples_ll = az.convert_to_inference_data( {'parameters': samples_ll[None, ...]}, coords={'parameter': model_ll.param_names}, dims={'parameters': ['parameter']}) az.plot_pair(samples_ll.posterior.sel( {'draw': np.s_[10000::10]}), kind='hexbin'); The approximate posterior distribution, constructed from the sam- ple chain, is shown in Figure 3. Illustrative examples For clarity and brevity, the examples in the previous section gave results for simple cases. However, these basic methods carry through to each of the models introduced earlier, including in cases with multivariate data and hundreds of parameters. Moreover, the Metropolis-Hastings approach can be combined with the Gibbs sampling approach, so that if the end user wishes to use Gibbs sampling for some parameters, they are not restricted to choose only conjugate priors for all parameters. In addition to sampling the posterior distributions of the parameters, this method allows sampling other objects of inter- est, including forecasts of observed variables, impulse response functions, and the unobserved state vector. This last possibility is especially useful in cases such as the structural time series Fig. 5: Estimated level, trend, and seasonal components, with 80% model, in which the unobserved states correspond to interpretable credible interval; U.S. Industrial Production. elements such as the trend and seasonal components. We provide several illustrative examples of the various types of analysis that are possible. model = sm.tsa.UnobservedComponents( y, 'lltrend', seasonal=12) Forecasting and Time Series Decomposition To produce the time-series decomposition into level, trend, and In our first example, we apply the Gibbs sampling approach to seasonal components, we will use samples from the posterior of a structural time series model in order to forecast U.S. Industrial the state vector (µt , βt , γt ) for each time period t. These are im- Production and to produce a decomposition of the series into level, mediately available when using the Gibbs sampling approach; in trend, and seasonal components. The model is the earlier example, the draw at each iteration was assigned to the yt = µt + γt + εt observation equation variable sample_state. To produce forecasts, we need to draw from the posterior predictive distribution for horizons h = 1, 2, . . . H. µt = βt + µt−1 + ζt level This can be easily accomplished by using the simulate method βt = βt−1 + ξt trend introduced earlier. To be concrete, we can accomplish these tasks γt = γt−s + ηt seasonal by modifying section (b) of our Gibbs sampler iterations as Here, we set the seasonal periodicity to s=12, since Industrial follows: Production is a monthly variable. We can construct this model 9. This model is often referred to as a "local linear trend" model (with in Statsmodels as9 additionally a seasonal component); lltrend is an abbreviation of this name. 88 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 6: "Causal impact" of COVID-19 on U.S. Sales in Manufacturing and Trade Industries. # (b') Draw from the conditional posterior of on U.S. Sales in Manufacturing and Trade Industries.11 # the state vector model.update(params[i - 1]) sim.simulate() Extensions # save the draw for use later in time series # decomposition There are many extensions to the time series models presented states[i] = sim.simulated_state.T here that are made possible when using Bayesian inference. # Draw from the posterior predictive distribution First, it is easy to create custom state space models within the # using the `simulate` method statsmodels framework. As one example, the statsmodels n_fcast = 48 documentation describes how to create a model that extends the fcast[i] = model.simulate( params[i - 1], n_fcast, typical VAR described above with time-varying parameters.12 initial_state=states[i, -1]).to_frame() These custom state space models automatically inherit all the functionality described above, so that Bayesian inference can be These forecasts and the decomposition into level, trend, and sea- conducted in exactly the same way. sonal components are summarized in Figures 4 and 5, which show Second, because the general state space model available in the median values along with 80% credible intervals. Notably, the statsmodels and introduced above allows for time-varying intervals shown incorporate for both the uncertainty arising from system matrices, it is possible using Gibbs sampling methods the stochastic terms in the model as well as the need to estimate to introduce support for automatic outlier handling, stochastic the models’ parameters.10 volatility, and regime switching models, even though these are largely infeasible in statsmodels when using frequentist meth- Casual impacts ods such as maximum likelihood estimation.13 A closely related procedure described in [BGK+ 15] uses a Bayesian structural time series model to estimate the "causal Conclusion impact" of some event on some observed variable. This approach stops estimation of the model just before the date of an event This paper introduces the suite of time series models available in and produces a forecast by drawing from the posterior predictive statsmodels and shows how Bayesian inference using Markov density, using the procedure described just above. It then uses the chain Monte Carlo methods can be applied to estimate their difference between the actual path of the data and the forecast to parameters and produce analyses of interest, including time series estimate impact of the event. decompositions and forecasts. An example of this approach is shown in Figure 6, in which we 11. In this example, we used a local linear trend model with no seasonal use this method to illustrate the effect of the COVID-19 pandemic component. 12. For details, see https://www.statsmodels.org/devel/examples/notebooks/ 10. The popular Prophet library, [TL17], similarly uses an additive model generated/statespace_tvpvar_mcmc_cfa.html. combined with Bayesian sampling methods to produce forecasts and decom- 13. See, for example, [SW16] for an application of these techniques that positions, although its underlying model is a GAM rather than a state space handles outliers, [KSC98] for stochastic volatility, and [KN98] for an applica- model. tion to dynamic factor models with regime switching. BAYESIAN ESTIMATION AND FORECASTING OF TIME SERIES IN STATSMODELS 89 R EFERENCES [SWF16] John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck. Probabilistic programming in Python using PyMC3. PeerJ [BGK+ 15] Kay H. Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy, Computer Science, 2:e55, April 2016. Publisher: PeerJ Inc. and Steven L. Scott. Inferring causal impact using Bayesian URL: https://peerj.com/articles/cs-55, doi:10.7717/peerj- structural time-series models. Annals of Applied Statistics, 9:247– cs.55. 274, 2015. doi:10.1214/14-aoas788. [TL17] Sean J. Taylor and Benjamin Letham. Forecasting at scale. [CGH+ 17] Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel Technical Report e3190v2, PeerJ Inc., September 2017. ISSN: Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, 2167-9843. URL: https://peerj.com/preprints/3190, doi:10. Jiqiang Guo, Peter Li, and Allen Riddell. Stan : A 7287/peerj.preprints.3190v2. Probabilistic Programming Language. Journal of Statisti- [WH99] Mike West and Jeff Harrison. Bayesian Forecasting and Dynamic cal Software, 76(1), January 2017. Institution: Columbia Models. Springer, New York, 2nd edition edition, March 1999. Univ., New York, NY (United States); Harvard Univ., Cam- 00000. bridge, MA (United States). URL: https://www.osti.gov/pages/ biblio/1430202-stan-probabilistic-programming-language, doi: 10.18637/jss.v076.i01. [CJ09] Joshua C.C. Chan and Ivan Jeliazkov. Efficient simulation and in- tegrated likelihood estimation in state space models. International Journal of Mathematical Modelling and Numerical Optimisation, 1(1-2):101–120, January 2009. Publisher: Inderscience Publish- ers. URL: https://www.inderscienceonline.com/doi/abs/10.1504/ IJMMNO.2009.03009. [DK02] J. Durbin and S. J. Koopman. A simple and efficient simula- tion smoother for state space time series analysis. Biometrika, 89(3):603–616, August 2002. URL: http://biomet.oxfordjournals. org/content/89/3/603, doi:10.1093/biomet/89.3.603. [DK12] James Durbin and Siem Jan Koopman. Time Series Analysis by State Space Methods: Second Edition. Oxford University Press, May 2012. [DLT+ 17] Joshua V. Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hoffman, and Rif A. Saurous. TensorFlow Distributions. Technical Report arXiv:1711.10604, arXiv, November 2017. arXiv:1711.10604 [cs, stat] type: article. URL: http://arxiv.org/ abs/1711.10604, doi:10.48550/arXiv.1711.10604. [Ful15] Chad Fulton. Estimating time series models by state space methods in python: Statsmodels. 2015. [HA18] Rob J Hyndman and George Athanasopoulos. Forecasting: principles and practice. OTexts, 2018. [Har90] Andrew C. Harvey. Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press, 1990. [HKOS08] Rob Hyndman, Anne B. Koehler, J. Keith Ord, and Ralph D. Snyder. Forecasting with Exponential Smoothing: The State Space Approach. Springer Science & Business Media, June 2008. Google-Books-ID: GSyzox8Lu9YC. [KCHM19] Ravin Kumar, Colin Carroll, Ari Hartikainen, and Osvaldo Mar- tin. ArviZ a unified library for exploratory analysis of Bayesian models in Python. Journal of Open Source Software, 4(33):1143, 2019. Publisher: The Open Journal. URL: https://doi.org/10. 21105/joss.01143, doi:10.21105/joss.01143. [KN98] Chang-Jin Kim and Charles R. Nelson. Business Cycle Turning Points, A New Coincident Index, and Tests of Duration Depen- dence Based on a Dynamic Factor Model With Regime Switch- ing. The Review of Economics and Statistics, 80(2):188–201, May 1998. Publisher: MIT Press. URL: https://doi.org/10.1162/ 003465398557447, doi:10.1162/003465398557447. [KN99] Chang-Jin Kim and Charles R. Nelson. State-Space Models with Regime Switching: Classical and Gibbs-Sampling Approaches with Applications. MIT Press Books, The MIT Press, 1999. URL: http://ideas.repec.org/b/mtp/titles/0262112388.html. [KSC98] Sangjoon Kim, Neil Shephard, and Siddhartha Chib. Stochastic Volatility: Likelihood Inference and Comparison with ARCH Models. The Review of Economic Studies, 65(3):361–393, July 1998. 01855. URL: http://restud.oxfordjournals.org/content/65/ 3/361, doi:10.1111/1467-937X.00050. [MPS11] Wes McKinney, Josef Perktold, and Skipper Seabold. Time Series Analysis in Python with statsmodels. In Stéfan van der Walt and Jarrod Millman, editors, Proceedings of the 10th Python in Science Conference, pages 107 – 113, 2011. doi:10.25080/ Majora-ebaa42b7-012. [SP10] Skipper Seabold and Josef Perktold. Statsmodels: Econometric and Statistical Modeling with Python. In Stéfan van der Walt and Jarrod Millman, editors, Proceedings of the 9th Python in Science Conference, pages 92 – 96, 2010. doi:10.25080/Majora- 92bf1922-011. [SW16] James H. Stock and Mark W. Watson. Core Inflation and Trend Inflation. Review of Economics and Statistics, 98(4):770–784, March 2016. 00000. URL: http://dx.doi.org/10.1162/REST_a_ 00608, doi:10.1162/REST_a_00608. 90 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Python vs. the pandemic: a case study in high-stakes software development Cliff C. Kerr‡§∗ , Robyn M. Stuart¶k , Dina Mistry∗∗ , Romesh G. Abeysuriyak , Jamie A. Cohen‡ , Lauren George†† , Michał Jastrzebski‡‡ , Michael Famulare‡ , Edward Wenger‡ , Daniel J. Klein‡ F Abstract—When it became clear in early 2020 that COVID-19 was going to modeling, and drug discovery made it well placed to contribute to be a major public health threat, politicians and public health officials turned to a global pandemic response plan. Founded in 2008, the Institute academic disease modelers like us for urgent guidance. Academic software for Disease Modeling (IDM) has provided analytical support for development is typically a slow and haphazard process, and we realized that BMGF (which it has been a part of since 2020) and other global business-as-usual would not suffice for dealing with this crisis. Here we describe health partners, with a focus on eradicating malaria and polio. the case study of how we built Covasim (covasim.org), an agent-based model of COVID-19 epidemiology and public health interventions, by using standard Since its creation, IDM has built up a portfolio of computational Python libraries like NumPy and Numba, along with less common ones like tools to understand, analyze, and predict the dynamics of different Sciris (sciris.org). Covasim was created in a few weeks, an order of magnitude diseases. faster than the typical model development process, and achieves performance When "coronavirus disease 2019" (COVID-19) and the virus comparable to C++ despite being written in pure Python. It has become one that causes it (SARS-CoV-2) were first identified in late 2019, of the most widely adopted COVID models, and is used by researchers and our team began summarizing what was known about the virus policymakers in dozens of countries. Covasim’s rapid development was enabled [Fam19]. By early February 2020, even though it was more than not only by leveraging the Python scientific computing ecosystem, but also by a month before the World Health Organization (WHO) declared adopting coding practices and workflows that lowered the barriers to entry for a pandemic [Med20], it had become clear that COVID-19 would scientific contributors without sacrificing either performance or rigor. become a major public health threat. The outbreak on the Diamond Index Terms—COVID-19, SARS-CoV-2, Epidemiology, Mathematical modeling, Princess cruise ship [RSWS20] was the impetus for us to start NumPy, Numba, Sciris modeling COVID in detail. Specifically, we needed a tool to (a) incorporate new data as soon as it became available, (b) explore policy scenarios, and (c) predict likely future epidemic trajectories. Background The first step was to identify which software tool would form For decades, scientists have been concerned about the possibility the best starting point for our new COVID model. Infectious of another global pandemic on the scale of the 1918 flu [Gar05]. disease models come in two major types: agent-based models track Despite a number of "close calls" – including SARS in 2002 the behavior of individual "people" (agents) in the simulation, [AFG+ 04]; Ebola in 2014-2016 [Tea14]; and flu outbreaks in- with each agent’s behavior represented by a random (probabilis- cluding 1957, 1968, and H1N1 in 2009 [SHK16], some of which tic) process. Compartmental models track populations of people led to 1 million or more deaths – the last time we experienced over time, typically using deterministic difference equations. The the emergence of a planetary-scale new pathogen was when HIV richest modeling framework used by IDM at the time was EMOD, spread globally in the 1980s [CHL+ 08]. which is a multi-disease agent-based model written in C++ and In 2015, Bill Gates gave a TED talk stating that the world was based on JSON configuration files [BGB+ 18]. We also considered not ready to deal with another pandemic [Hof20]. While the Bill Atomica, a multi-disease compartmental model written in Python & Melinda Gates Foundation (BMGF) has not historically focused and based on Excel input files [KAK+ 19]. However, both of on pandemic preparedness, its expertise in disease surveillance, these options posed significant challenges: as a compartmental model, Atomica would have been unable to capture the individual- * Corresponding author: cliff@covasim.org level detail necessary for modeling the Diamond Princess out- ‡ Institute for Disease Modeling, Bill & Melinda Gates Foundation, Seattle, break (such as passenger-crew interactions); EMOD had sufficient USA flexibility, but developing new disease modules had historically § School of Physics, University of Sydney, Sydney, Australia ¶ Department of Mathematical Sciences, University of Copenhagen, Copen- required months rather than days. hagen, Denmark As a result, we instead started developing Covasim ("COVID- || Burnet Institute, Melbourne, Australia 19 Agent-based Simulator") [KSM+ 21] from a nascent agent- ** Twitter, Seattle, USA based model written in Python, LEMOD-FP ("Light-EMOD for †† Microsoft, Seattle, USA ‡‡ GitHub, San Francisco, USA Family Planning"). LEMOD-FP was used to model reproductive health choices of women in Senegal; this model had in turn Copyright © 2022 Cliff C. Kerr et al. This is an open-access article distributed been based on an even simpler agent-based model of measles under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the vaccination programs in Nigeria ("Value-of-Information Simula- original author and source are credited. tor" or VoISim). We subsequently applied the lessons we learned PYTHON VS. THE PANDEMIC: A CASE STUDY IN HIGH-STAKES SOFTWARE DEVELOPMENT 91 scientific computing libraries. Software architecture and implementation Covasim conceptual design and usage Covasim is a standard susceptible-exposed-infectious-recovered (SEIR) model (Fig. 3). As noted above, it is an agent-based model, meaning that individual people and their interactions with one another are simulated explicitly (rather than implicitly, as in a compartmental model). The fundamental calculation that Covasim performs is to determine the probability that a given person, on a given time step, will change from one state to another, such as from susceptible to exposed (i.e., that person was infected), from undiagnosed to diagnosed, or from critically ill to dead. Covasim is fully open- source and available on GitHub (http://covasim.org) and PyPI (pip install covasim), and comes with comprehensive documentation, including tutorials (http://docs.covasim.org). The first principle of Covasim’s design philosophy is that "Common tasks should be simple" – for example, defining pa- rameters, running a simulation, and plotting results. The following example illustrates this principle; it creates a simulation with a custom parameter value, runs it, and plots the results: Fig. 1: Daily reported global COVID-19-related deaths (top; import covasim as cv smoothed with a one-week rolling window), relative to the timing of cv.Sim(pop_size=100e3).run().plot() known variants of concern (VOCs) and variants of interest (VOIs), as The second principle of Covasim’s design philosophy is "Un- well as Covasim releases (bottom). common tasks can’t always be simple, but they still should be possible." Examples include writing a custom goodness-of-fit from developing Covasim to turn LEMOD-FP into a new family function or defining a new population structure. To some extent, planning model, "FPsim", which will be launched later this year the second principle is at odds with the first, since the more [OVCC+ 22]. flexibility an interface has, typically the more complex it is as Parallel to the development of Covasim, other research teams well. at IDM developed their own COVID models, including one based To illustrate the tension between these two principles, the on the EMOD framework [SWC+ 22], and one based on an earlier following code shows how to run two simulations to determine the influenza model [COSF20]. However, while both of these models impact of a custom intervention aimed at protecting the elderly in saw use in academic contexts [KCP+ 20], neither were able to Japan, with results shown in Fig. 4: incorporate new features quickly enough, or were easy enough to import covasim as cv use, for widespread external adoption in a policy context. # Define a custom intervention Covasim, by contrast, had immediate real-world impact. The def elderly(sim, old=70): first version was released on 10 March 2020, and on 12 March if sim.t == sim.day('2020-04-01'): elderly = sim.people.age > old 2020, its output was presented by Washington State Governor Jay sim.people.rel_sus[elderly] = 0.0 Inslee during a press conference as justification for school closures and social distancing measures [KMS+ 21]. # Set custom parameters Since the early days of the pandemic, Covasim releases have pars = dict( pop_type = 'hybrid', # More realistic population coincided with major events in the pandemic, especially the iden- location = 'japan', # Japan's population pyramid tification of new variants of concern (Fig. 1). Covasim was quickly pop_size = 50e3, # Have 50,000 people total adopted globally, including applications in the UK regarding pop_infected = 100, # 100 infected people n_days = 90, # Run for 90 days school closures [PGKS+ 20], Australia regarding outbreak control ) [SAK+ 21], and Vietnam regarding lockdown measures [PSN+ 21]. To date, Covasim has been downloaded from PyPI over # Run multiple sims in parallel and plot key results 100,000 times [PeP22], has been used in dozens of academic label = 'Protect the elderly' s1 = cv.Sim(pars, label='Default') studies [KMS+ 21], and informed decision-making on every con- s2 = cv.Sim(pars, interventions=elderly, label=label) tinent (Fig. 2), making it one of the most widely used COVID msim = cv.parallel(s1, s2) models [KSM+ 21]. We believe key elements of its success include msim.plot(['cum_deaths', 'cum_infections']) (a) the simplicity of its architecture; (b) its high performance, Similar design philosophies have been articulated by previously, enabled by the use of NumPy arrays and Numba decorators; such as for Grails [AJ09] among others1 . and (c) our emphasis on prioritizing usability, including flexible type handling and careful choices of default settings. In the 1. Other similar philosophical statements include "The manifesto of Mat- remainder of this paper, we outline these principles in more detail, plotlib is: simple and common tasks should be simple to perform; provide options for more complex tasks" (Data Processing Using Python) and "Simple, in the hope that these will provide a useful roadmap for other common tasks should be simple to perform; Options should be provided to groups wanting to quickly develop high-performance, easy-to-use enable more complex tasks" (Instrumental). 92 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 2: Locations where Covasim has been used to help produce a paper, report, or policy recommendation. Fig. 3: Basic Covasim disease model. The blue arrow shows the process of reinfection. Fig. 4: Illustrative result of a simulation in Covasim focused on Simplifications using Sciris exploring an intervention for protecting the elderly. A key component of Covasim’s architecture is heavy reliance on Sciris (http://sciris.org) [KAH+ ng], a library of functions for running simulations in parallel. scientific computing that provide additional flexibility and ease- of-use on top of NumPy, SciPy, and Matplotlib, including paral- Array-based architecture lel computing, array operations, and high-performance container In a typical agent-based simulation, the outermost loop is over datatypes. time, while the inner loops iterate over different agents and agent As shown in Fig. 5, Sciris significantly reduces the number states. For a simulation like Covasim, with roughly 700 (daily) of lines of code required to perform common scientific tasks, timesteps to represent the first two years of the pandemic, tens allowing the user to focus on the code’s scientific logic rather than or hundreds of thousands of agents, and several dozen states, this the low-level implementation. Key Covasim features that rely on requires on the order of one billion update steps. Sciris include: ensuring consistent dictionary, list, and array types However, we can take advantage of the fact that each state (e.g., allowing the user to provide inputs as either lists or arrays); (such as agent age or their infection status) has the same data referencing ordered dictionary elements by index; handling and type, and thus we can avoid an explicit loop over agents by instead interconverting dates (e.g., allowing the user to provide either a representing agents as entries in NumPy vectors, and performing date string or a datetime object); saving and loading files; and operations on these vectors. These two architectures are shown in PYTHON VS. THE PANDEMIC: A CASE STUDY IN HIGH-STAKES SOFTWARE DEVELOPMENT 93 Fig. 5: Comparison of functionally identical code implemented without Sciris (left) and with (right). In this example, tasks that together take 30 lines of code without Sciris can be accomplished in 7 lines with it. 94 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) for t in self.time_vec: for person in self.people: if person.alive: person.age_person() person.check_died() # Array-based agent simulation class People: def age_people(self, inds): self.age[inds] += 1 return def check_died(self, inds): rands = np.random.rand(len(inds)) died = rands < self.death_probs[inds]: self.alive[inds[died]] = False return Fig. 6: The standard object-oriented approach for implementing agent-based models (top), compared to the array-based approach class Sim: used in Covasim (bottom). def run(self): for t in self.time_vec: alive = sc.findinds(self.people.alive) self.people.age_people(inds=alive) self.people.check_died(inds=alive) Numba optimization Numba is a compiler that translates subsets of Python and NumPy into machine code [LPS15]. Each low-level numerical function was tested with and without Numba decoration; in some cases speed improvements were negligible, while in other cases they were considerable. For example, the following function is roughly 10 times faster with the Numba decorator than without: import numpy as np import numba as nb @nb.njit((nb.int32, nb.int32), cache=True) def choose_r(max_n, n): Fig. 7: Performance comparison for FPsim from an explicit loop- return np.random.choice(max_n, n, replace=True) based approach compared to an array-based approach, showing a factor of ~70 speed improvement for large population sizes. Since Covasim is stochastic, calculations rarely need to be exact; as a result, most numerical operations are performed as 32-bit operations. Fig. 6. Compared to the explicitly object-oriented implementation Together, these speed optimizations allow Covasim to run at of an agent-based model, the array-based version is 1-2 orders of roughly 5-10 million simulated person-days per second of CPU magnitude faster for population sizes larger than 10,000 agents. time – a speed comparable to agent-based models implemented The relative performance of these two approaches is shown in purely in C or C++ [HPN+ 21]. Practically, this means that most Fig. 7 for FPsim (which, like Covasim, was initially implemented users can run Covasim analyses on their laptops without needing using an object-oriented approach before being converted to an to use cloud-based or HPC computing resources. array-based approach). To illustrate the difference between object- based and array-based implementations, the following example Lessons for scientific software development shows how aging and death would be implemented in each: Accessible coding and design # Object-based agent simulation Since Covasim was designed to be used by scientists and health class Person: officials, not developers, we made a number of design decisions that preferenced accessibility to our audience over other principles def age_person(self): of good software design. self.age += 1 return First, Covasim is designed to have as flexible of user inputs as possible. For example, a date can be specified as an integer def check_died(self): number of days from the start of the simulation, as a string (e.g. rand = np.random.random() if rand < self.death_prob: '2020-04-04'), or as a datetime object. Similarly, numeric self.alive = False inputs that can have either one or multiple values (such as the return change in transmission rate following one or multiple lockdowns) can be provided as a scalar, list, or NumPy array. As long as the class Sim: input is unambiguous, we prioritized ease-of-use and simplicity def run(self): of the interface over rigorous type checking. Since Covasim is a PYTHON VS. THE PANDEMIC: A CASE STUDY IN HIGH-STAKES SOFTWARE DEVELOPMENT 95 top-level library (i.e., it does not perform low-level functions as health background, through to public health experts with virtually part of other libraries), this prioritization has been welcomed by no prior experience in Python. Roughly 45% of Covasim con- its users. tributors had significant Python expertise, while 60% had public Second, "advanced" Python programming paradigms – such health experience; only about half a dozen contributors (<10%) as method and function decorators, lambda functions, multiple had significant experience in both areas. inheritance, and "dunder" methods – have been avoided where These half-dozen contributors formed a core group (including possible, even when they would otherwise be good coding prac- the authors of this paper) that oversaw overall Covasim develop- tice. This is because a relatively large fraction of Covasim users, ment. Using GitHub for both software and project management, including those with relatively limited Python backgrounds, need we created issues and assigned them to other contributors based to inspect and modify the source code. A Covasim user coming on urgency and skillset match. All pull requests were reviewed by from an R programming background, for example, may not have at least one person from this group, and often two, prior to merge. encountered the NumPy function intersect1d() before, but While the danger of accepting changes from contributors with they can quickly look it up and understand it as being equivalent limited Python experience is self-evident, considerable risks were to R’s intersect() function. In contrast, an R user who has also posed by contributors who lacked epidemiological insight. not encountered method decorators before is unlikely to be able to For example, some of the proposed tests were written based on look them up and understand their meaning (indeed, they may not assumptions that were true for a given time and place, but which even know what terms to search for). While Covasim indeed does were not valid for other geographical contexts. use each of the "advanced" methods listed above (e.g., the Numba One surprising outcome was that even though Covasim is decorators described above), they have been kept to a minimum largely a software project, after the initial phase of development and sequestered in particular files the user is less likely to interact (i.e., the first 4-8 weeks), we found that relatively few tasks could with. be assigned to the developers as opposed to the epidemiologists Third, testing for Covasim presented a major challenge. Given and infectious disease modelers on the project. We believe there that Covasim was being used to make decisions that affected tens are several reasons for this. First, epidemiologists tended to be of millions of people, even the smallest errors could have poten- much more aware of knowledge they were missing (e.g., what tially catastrophic consequences. Furthermore, errors could arise a particular NumPy function did), and were more readily able not only in the software logic, but also in an incorrectly entered to fill that gap (e.g., look it up in the documentation or on parameter value or a misinterpreted scientific study. Compounding Stack Overflow). By contrast, developers without expertise in these challenges, features often had to be developed and used epidemiology were less able to identify gaps in their knowledge on a timescale of hours or days to be of use to policymakers, and address them (e.g., by finding a study on Google Scholar). a speed which was incompatible with traditional software testing As a consequence, many of the epidemiologists’ software skills approaches. In addition, the rapidly evolving codebase made it improved markedly over the first few months, while the develop- difficult to write even simple regression tests. Our solution was to ers’ epidemiology knowledge increased more slowly. Second, and use a hierarchical testing approach: low-level functions were tested more importantly, we found that once transparent and performant through a standard software unit test approach, while new features coding practices had been implemented, epidemiologists were able and higher-level outputs were tested extensively by infectious to successfully adapt them to new contexts even without complete disease modelers who varied inputs corresponding to realistic understanding of the code. Thus, for developing a scientific scenarios, and checked the outputs (predominantly in the form software tool, we propose that a successful staffing plan would of graphs) against their intuition. We found that these high-level consist of a roughly equal ratio of developers and domain experts "sanity checks" were far more effective in catching bugs than during the early development phase, followed by a rapid (on a formal software tests, and as a result shifted the emphasis of timescale of weeks) ramp-down of developers and ramp-up of our test suite to prioritize the former. Public releases of Covasim domain experts. have held up well to extensive scrutiny, both by our external Acknowledging that Covasim’s potential user base includes collaborators and by "COVID skeptics" who were highly critical many people who have limited coding skills, we developed a three- of other COVID models [Den20]. tiered support model to maximize Covasim’s real-world policy Finally, since much of our intended audience has little to impact (Fig. 8). For "mode 1" engagements, we perform the anal- no Python experience, we provided as many alternative ways of yses using Covasim ourselves. While this mode typically ensures accessing Covasim as possible. For R users, we provide exam- high quality and efficiency, it is highly resource-constrained and ples of how to run Covasim using the reticulate package thus used only for our highest-profile engagements, such as with [AUTE17], which allows Python to be called from within R. the Vietnam Ministry of Health [PSN+ 21] and Washington State For specific applications, such as our test-trace-quarantine work Department of Health [KMS+ 21]. For "mode 2" engagements, we (http://ttq-app.covasim.org), we developed bespoke webapps via offer our partners training on how to use Covasim, and let them Jupyter notebooks [GP21] and Voilà [Qua19]. To help non-experts lead analyses with our feedback. This is our preferred mode of gain intuition about COVID epidemic dynamics, we also devel- engagement, since it balances efficiency and sustainability, and has oped a generic JavaScript-based webapp interface for Covasim been used for contexts including the United Kingdom [PGKS+ 20] (http://app.covasim.org), but it does not have sufficient flexibility and Australia [SLSS+ 22]. Finally, "mode 3" partnerships, in to answer real-world policy questions. which Covasim is downloaded and used without our direct input, are of course the default approach in the open-source software Workflow and team management ecosystem, including for Python. While this mode is by far the Covasim was developed by a team of roughly 75 people with most scalable, in practice, relatively few health departments or widely disparate backgrounds: from those with 20+ years of ministries of health have the time and internal technical capacity to enterprise-level software development experience and no public use this mode; instead, most of the mode 3 uptake of Covasim has 96 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) been by academic groups [LG+ 21]. Thus, we provide mode 1 and [AUTE17] JJ Allaire, Kevin Ushey, Yuan Tang, and Dirk Eddelbuettel. mode 2 partnerships to make Covasim’s impact more immediate reticulate: R Interface to Python, 2017. URL: https://github. com/rstudio/reticulate. and direct than would be possible via mode 3 alone. [BGB+ 18] Anna Bershteyn, Jaline Gerardin, Daniel Bridenbecker, Christo- pher W Lorton, Jonathan Bloedow, Robert S Baker, Guil- Future directions laume Chabot-Couture, Ye Chen, Thomas Fischle, Kurt Frey, et al. Implementation and applications of EMOD, an individual- While the need for COVID modeling is hopefully starting to based multi-disease modeling platform. Pathogens and disease, decrease, we and our collaborators are continuing development 76(5):fty059, 2018. doi:10.1093/femspd/fty059. of Covasim by updating parameters with the latest scientific [CHL+ 08] Myron S Cohen, Nick Hellmann, Jay A Levy, Kevin DeCock, Joep Lange, et al. The spread, treatment, and prevention of evidence, implementing new immune dynamics [CSN+ 21], and HIV-1: evolution of a global pandemic. The Journal of Clin- providing other usability and bug-fix updates. We also continue ical Investigation, 118(4):1244–1254, 2008. doi:10.1172/ to provide support and training workshops (including in-person JCI34706. workshops, which were not possible earlier in the pandemic). [COSF20] Dennis L Chao, Assaf P Oron, Devabhaktuni Srikrishna, and Michael Famulare. Modeling layered non-pharmaceutical inter- We are using what we learned during the development of ventions against SARS-CoV-2 in the United States with Corvid. Covasim to build a broader suite of Python-based disease mod- MedRxiv, 2020. doi:10.1101/2020.04.08.20058487. eling tools (tentatively named "*-sim" or "Starsim"). The suite [CSN+ 21] Jamie A Cohen, Robyn Margaret Stuart, Rafael C Nùñez, of Starsim tools under development includes models for family Katherine Rosenfeld, Bradley Wagner, Stewart Chang, Cliff Kerr, Michael Famulare, and Daniel J Klein. Mechanistic mod- planning [OVCC+ 22], polio, respiratory syncytial virus (RSV), eling of SARS-CoV-2 immune memory, variants, and vaccines. and human papillomavirus (HPV). To date, each tool in this medRxiv, 2021. doi:10.1101/2021.05.31.21258018. suite uses an independent codebase, and is related to Covasim [Den20] Denim, Sue. Another Computer Simulation, Another Alarmist only through the shared design principles described above, and Prediction, 2020. URL: https://dailysceptic.org/schools-paper. [Fam19] Mike Famulare. nCoV: preliminary estimates of the confirmed- by having used the Covasim codebase as the starting point for case-fatality-ratio and infection-fatality-ratio, and initial pan- development. demic risk assessment. Institute for Disease Modeling, 2019. A major open question is whether the disease dynamics im- [Gar05] Laurie Garrett. The next pandemic. Foreign Aff., 84:3, 2005. plemented in Covasim and these related models have sufficient doi:10.2307/20034417. [GP21] Brian E. Granger and Fernando Pérez. Jupyter: Thinking and overlap to be refactored into a single disease-agnostic modeling storytelling with code and data. Computing in Science & En- library, which the disease-specific modeling libraries would then gineering, 23(2):7–14, 2021. doi:10.1109/MCSE.2021. import. This "core and specialization" approach was adopted by 3059263. EMOD and Atomica, and while both frameworks continue to be [Hof20] Bert Hofman. The global pandemic. Horizons: Journal of International Relations and Sustainable Development, (16):60– used, no multi-disease modeling library has yet seen widespread 69, 2020. adoption within the disease modeling community. The alternative [HPN+ 21] Robert Hinch, William JM Probert, Anel Nurtay, Michelle approach, currently used by the Starsim suite, is for each disease Kendall, Chris Wymant, Matthew Hall, Katrina Lythgoe, Ana model to be a self-contained library. A shared library would Bulas Cruz, Lele Zhao, Andrea Stewart, et al. OpenABM- Covid19—An agent-based model for non-pharmaceutical inter- reduce code duplication, and allow new features and bug fixes ventions against COVID-19 including contact tracing. PLoS to be immediately rolled out to multiple models simultaneously. computational biology, 17(7):e1009146, 2021. doi:10. However, it would also increase interdependencies that would have 1371/journal.pcbi.1009146. the effect of increasing code complexity, increasing the risk of [KAH+ ng] Cliff C Kerr, Romesh G Abeysuriya, Vlad-S, tefan Harbuz, George L Chadderdon, Parham Saidi, Paula Sanz-Leon, James introducing subtle bugs. Which of these two options is preferable Jansson, Maria del Mar Quiroga, Sherrie Hughes, Rowan likely depends on the speed with which new disease models need Martin-and Kelly, Jamie Cohen, Robyn M Stuart, and Anna to be implemented. We hope that for the foreseeable future, none Nachesa. Sciris: a Python library to simplify scientific com- will need to be implemented as quickly as Covasim. puting. Available at http://paper.sciris.org, 2022 (forthcoming). [KAK+ 19] David J Kedziora, Romesh Abeysuriya, Cliff C Kerr, George L Chadderdon, Vlad-S, tefan Harbuz, Sarah Metzger, David P Wil- Acknowledgements son, and Robyn M Stuart. The Cascade Analysis Tool: software to analyze and optimize care cascades. Gates Open Research, 3, We thank additional contributors to Covasim, including Katherine 2019. doi:10.12688/gatesopenres.13031.2. Rosenfeld, Gregory R. Hart, Rafael C. Núñez, Prashanth Selvaraj, [KCP+ 20] Joel R Koo, Alex R Cook, Minah Park, Yinxiaohe Sun, Haoyang Brittany Hagedorn, Amanda S. Izzo, Greer Fowler, Anna Palmer, Sun, Jue Tao Lim, Clarence Tam, and Borame L Dickens. Interventions to mitigate early spread of sars-cov-2 in singapore: Dominic Delport, Nick Scott, Sherrie L. Kelly, Caroline S. Ben- a modelling study. The Lancet Infectious Diseases, 20(6):678– nette, Bradley G. Wagner, Stewart T. Chang, Assaf P. Oron, Paula 688, 2020. doi:10.1016/S1473-3099(20)30162-6. Sanz-Leon, and Jasmina Panovska-Griffiths. We also wish to thank [KMS+ 21] Cliff C Kerr, Dina Mistry, Robyn M Stuart, Katherine Rosenfeld, Maleknaz Nayebi and Natalie Dean for helpful discussions on Gregory R Hart, Rafael C Núñez, Jamie A Cohen, Prashanth Selvaraj, Romesh G Abeysuriya, Michał Jastrz˛ebski, et al. Con- code architecture and workflow practices, respectively. trolling COVID-19 via test-trace-quarantine. Nature Commu- nications, 12(1):1–12, 2021. doi:10.1038/s41467-021- 23276-9. R EFERENCES [KSM+ 21] Cliff C Kerr, Robyn M Stuart, Dina Mistry, Romesh G Abey- [AFG+ 04] Roy M Anderson, Christophe Fraser, Azra C Ghani, Christl A suriya, Katherine Rosenfeld, Gregory R Hart, Rafael C Núñez, Donnelly, Steven Riley, Neil M Ferguson, Gabriel M Leung, Jamie A Cohen, Prashanth Selvaraj, Brittany Hagedorn, et al. Tai H Lam, and Anthony J Hedley. Epidemiology, transmis- Covasim: an agent-based model of COVID-19 dynamics and sion dynamics and control of sars: the 2002–2003 epidemic. interventions. PLOS Computational Biology, 17(7):e1009149, Philosophical Transactions of the Royal Society of London. 2021. doi:10.1371/journal.pcbi.1009149. Series B: Biological Sciences, 359(1447):1091–1105, 2004. [LG+ 21] Junjiang Li, Philippe Giabbanelli, et al. Returning to a normal doi:10.1098/rstb.2004.1490. life via COVID-19 vaccines in the United States: a large- [AJ09] Bashar Abdul-Jawad. Groovy and Grails Recipes. Springer, scale Agent-Based simulation study. JMIR medical informatics, 2009. 9(4):e27419, 2021. doi:10.2196/27419. PYTHON VS. THE PANDEMIC: A CASE STUDY IN HIGH-STAKES SOFTWARE DEVELOPMENT 97 Fig. 8: The three pathways to impact with Covasim, from high bandwidth/small scale to low bandwidth/large scale. IDM: Institute for Disease Modeling; OSS: open-source software; GPG: global public good; PyPI: Python Package Index. [LPS15] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A the impact of COVID-19 vaccines in a representative COVAX llvm-based python jit compiler. In Proceedings of the Second AMC country setting due to ongoing internal migration: A Workshop on the LLVM Compiler Infrastructure in HPC, pages modeling study. PLOS Global Public Health, 2(1):e0000053, 1–6, 2015. doi:10.1145/2833157.2833162. 2022. doi:10.1371/journal.pgph.0000053. [Med20] The Lancet Respiratory Medicine. COVID-19: delay, mitigate, [Tea14] WHO Ebola Response Team. Ebola virus disease in west and communicate. The Lancet Respiratory Medicine, 8(4):321, africa—the first 9 months of the epidemic and forward projec- 2020. doi:10.1016/S2213-2600(20)30128-4. tions. New England Journal of Medicine, 371(16):1481–1495, [OVCC 22] Michelle L O’Brien, Annie Valente, Guillaume Chabot-Couture, + 2014. doi:10.1056/NEJMoa1411100. Joshua Proctor, Daniel Klein, Cliff Kerr, and Marita Zimmer- mann. FPSim: An agent-based model of family planning for informed policy decision-making. In PAA 2022 Annual Meeting. PAA, 2022. [PeP22] PePy. PePy download statistics, 2022. URL: https://pepy.tech/ project/covasim. [PGKS+ 20] Jasmina Panovska-Griffiths, Cliff C Kerr, Robyn M Stuart, Dina Mistry, Daniel J Klein, Russell M Viner, and Chris Bonell. Determining the optimal strategy for reopening schools, the impact of test and trace interventions, and the risk of occurrence of a second COVID-19 epidemic wave in the UK: a modelling study. The Lancet Child & Adolescent Health, 4(11):817–827, 2020. doi:10.1016/S2352-4642(20)30250-9. [PSN+ 21] Quang D Pham, Robyn M Stuart, Thuong V Nguyen, Quang C Luong, Quang D Tran, Thai Q Pham, Lan T Phan, Tan Q Dang, Duong N Tran, Hung T Do, et al. Estimating and mitigating the risk of COVID-19 epidemic rebound associated with reopening of international borders in Vietnam: a modelling study. The Lancet Global Health, 9(7):e916–e924, 2021. doi:10.1016/ S2214-109X(21)00103-0. [Qua19] QuantStack. And voilá! Jupyter Blog, 2019. URL: https://blog. jupyter.org/and-voil%C3%A0-f6a2c08a4a93. [RSWS20] Joacim Rocklöv, Henrik Sjödin, and Annelies Wilder-Smith. COVID-19 outbreak on the Diamond Princess cruise ship: esti- mating the epidemic potential and effectiveness of public health countermeasures. Journal of Travel Medicine, 27(3):taaa030, 2020. doi:10.1093/jtm/taaa030. [SAK+ 21] Robyn M Stuart, Romesh G Abeysuriya, Cliff C Kerr, Dina Mistry, Dan J Klein, Richard T Gray, Margaret Hellard, and Nick Scott. Role of masks, testing and contact tracing in preventing COVID-19 resurgences: a case study from New South Wales, Australia. BMJ open, 11(4):e045941, 2021. doi:10.1136/bmjopen-2020-045941. [SHK16] Patrick R Saunders-Hastings and Daniel Krewski. Review- ing the history of pandemic influenza: understanding patterns of emergence and transmission. Pathogens, 5(4):66, 2016. doi:10.3390/pathogens5040066. [SLSS+ 22] Paula Sanz-Leon, Nathan J Stevenson, Robyn M Stuart, Romesh G Abeysuriya, James C Pang, Stephen B Lambert, Cliff C Kerr, and James A Roberts. Risk of sustained SARS- CoV-2 transmission in Queensland, Australia. Scientific reports, 12(1):1–9, 2022. doi:10.1101/2021.06.08.21258599. [SWC 22] Prashanth Selvaraj, Bradley G Wagner, Dennis L Chao, + Maïna L’Azou Jackson, J Gabrielle Breugelmans, Nicholas Jack- son, and Stewart T Chang. Rural prioritization may increase 98 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Pylira: deconvolution of images in the presence of Poisson noise Axel Donath‡∗ , Aneta Siemiginowska‡ , Vinay Kashyap‡ , Douglas Burke‡ , Karthik Reddy Solipuram§ , David van Dyk¶ F Abstract—All physical and astronomical imaging observations are degraded by of the signal intensity to the signal variance. Any statistically the finite angular resolution of the camera and telescope systems. The recovery correct post-processing or reconstruction method thus requires a of the true image is limited by both how well the instrument characteristics careful treatment of the Poisson nature of the measured image. are known and by the magnitude of measurement noise. In the case of a To maximise the scientific use of the data, it is often desired to high signal to noise ratio data, the image can be sharpened or “deconvolved” correct the degradation introduced by the imaging process. Besides robustly by using established standard methods such as the Richardson-Lucy method. However, the situation changes for sparse data and the low signal to correction for non-uniform exposure and background noise this noise regime, such as those frequently encountered in X-ray and gamma-ray also includes the correction for the "blurring" introduced by the astronomy, where deconvolution leads inevitably to an amplification of noise point spread function (PSF) of the instrument. Where the latter and poorly reconstructed images. However, the results in this regime can process is often called "deconvolution". Depending on whether be improved by making use of physically meaningful prior assumptions and the PSF of the instrument is known or not, one distinguishes statistically principled modeling techniques. One proposed method is the LIRA between the "blind deconvolution" and "non blind deconvolution" algorithm, which requires smoothness of the reconstructed image at multiple process. For astronomical observations, the PSF can often either scales. In this contribution, we introduce a new python package called Pylira, be simulated, given a model of the telescope and detector, or which exposes the original C implementation of the LIRA algorithm to Python inferred directly from the data by observing far distant objects, users. We briefly describe the package structure, development setup and show a Chandra as well as Fermi-LAT analysis example. which appear as a point source to the instrument. While in other branches of astronomy deconvolution methods Index Terms—deconvolution, point spread function, poisson, low counts, X-ray, are already part of the standard analysis, such as the CLEAN gamma-ray algorithm for radio data, developed by [Hog74], this is not the case for X-ray and gamma-ray astronomy. As any deconvolution method aims to enhance small-scale structures in an image, it Introduction becomes increasingly hard to solve for the regime of low signal- Any physical and astronomical imaging process is affected by to-noise ratio, where small-scale structures are more affected by the limited angular resolution of the instrument or telescope. In noise. addition, the quality of the resulting image is also degraded by background or instrumental measurement noise and non-uniform The Deconvolution Problem exposure. For short wavelengths and associated low intensities of Basic Statistical Model the signal, the imaging process consists of recording individual Assuming the data in each pixel di in the recorded counts image photons (often called "events") originating from a source of follows a Poisson distribution, the total likelihood of obtaining the interest. This imaging process is typical for X-ray and gamma- measured image from a model image of the expected counts λi ray telescopes, but images taken by magnetic resonance imaging with N pixels is given by: or fluorescence microscopy show Poisson noise too. For each individual photon, the incident direction, energy and arrival time N exp −di λidi L (d|λ ) = ∏ (1) is measured. Based on this information, the event can be binned i di ! into two dimensional data structures to form an actual image. By taking the logarithm, dropping the constant terms and inverting As a consequence of the low intensities associated to the the sign one can transform the product into a sum over pixels, recording of individual events, the measured signal follows Pois- which is also often called the Cash [Cas79] fit statistics: son statistics. This imposes a non-linear relationship between the N measured signal and true underlying intensity as well as a coupling C (λ |d) = ∑(λi − di log λi ) (2) i * Corresponding author: axel.donath@cfa.harvard.edu ‡ Center for Astrophysics | Harvard & Smithsonian Where the expected counts λi are given by the convolution of the § University of Maryland Baltimore County true underlying flux distribution xi with the PSF pk : ¶ Imperial College London λi = ∑ xi pi−k (3) Copyright © 2022 Axel Donath et al. This is an open-access article distributed k under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the This operation is often called "forward modelling" or "forward original author and source are credited. folding" with the instrument response. PYLIRA: DECONVOLUTION OF IMAGES IN THE PRESENCE OF POISSON NOISE 99 Richardson Lucy (RL) To obtain the most likely value of xn given the data, one searches a maximum of the total likelihood function, or equivalently a of minimum C . This high dimensional optimization problem can e.g., be solved by a classic gradient descent approach. Assuming the pixels values xi of the true image as independent parameters, one can take the derivative of Eq. 2 with respect to the individual xi . This way one obtains a rule for how to update the current set of pixels xn in each iteration of the optimization: ∂ C (d|x) xn+1 = xn − α · (4) ∂ xi Where α is a factor to define the step size. This method is in general equivalent to the gradient descent and backpropagation methods used in modern machine learning techniques. This ba- sic principle of solving the deconvolution problem for images with Poisson noise was proposed by [Ric72] and [Luc74]. Their method, named after the original authors, is often known as the Fig. 1: The images show the result of the RL algorithm applied Richardson & Lucy (RL) method. It was shown by [Ric72] that to a simulated example dataset with varying numbers of iterations. this converges to a maximum likelihood solution of Eq. 2. A The image in the upper left shows the simulated counts. Those have Python implementation of the standard RL method is available been derived from the ground truth (upper mid) by convolving with a e.g. in the Scikit-Image package [vdWSN+ 14]. Gaussian PSF of width σ = 3 pix and applying Poisson noise to it. Instead of the iterative, gradient descent based optimization it The illustration uses the implementation of the RL algorithm from the is also possible to sample from the posterior distribution using a Scikit-Image package [vdWSN+ 14]. simple Metropolis-Hastings [Has70] approach and uniform prior. This is demonstrated in one of the Pylira online tutorials (Intro- the smoothness of the reconstructed image on multiple spatial duction to Deconvolution using MCMC Methods). scales. Starting from the full resolution, the image pixels xi are collected into 2 by 2 groups Qk . The four pixel values associated RL Reconstruction Quality with each group are divided by their sum to obtain a grid of “split While technically the RL method converges to a maximum like- proportions” with respect to the image down-sized by a factor of lihood solution, it mostly still results in poorly restored images, two along both axes. This process is repeated using the down sized especially if extended emission regions are present in the image. image with pixel values equal to the sums over the 2 by 2 groups The problem is illustrated in Fig. 1 using a simulated example from the full-resolution image, and the process continues until the image. While for a low number of iterations, the RL method still resolution of the image is only a single pixel, containing the total results in a smooth intensity distribution, the structure of the image sum of the full-resolution image. This multi-scale representation decomposes more and more into a set of point-like sources with is illustrated in Fig. 2. growing number of iterations. For each of the 2x2 groups of the re-normalized images a Because of the PSF convolution, an extended emission region Dirichlet distribution is introduced as a prior: can decompose into multiple nearby point sources and still lead to good model prediction, when compared with the data. Those φk ∝ Dirichlet(αk , αk , αk , αk ) (6) almost equally good solutions correspond to many narrow local and multiplied across all 2x2 groups and resolution levels k. For minima or "spikes" in the global likelihood surface. Depending on each resolution level a smoothing parameter αk is introduced. the start estimate for the reconstructed image x the RL method These hyper-parameters can be interpreted as having an infor- will follow the steepest gradient and converge towards the nearest mation content equivalent of adding αk "hallucinated" counts in narrow local minimum. This problem has been described by each grouping. This effectively results in a smoothing of the multiple authors, such as [PR94] and [FBPW95]. image at the given resolution level. The distribution of α values at each resolution level is the further described by a hyper-prior Multi-Scale Prior & LIRA distribution: One solution to this problem was described in [ECKvD04] and p(αk ) = exp (−δ α 3 /3) (7) [CSv+ 11]. First, the simple forward folded model described in Eq. 3 can be extended by taking into account the non-uniform Resulting in a fully hierarchical Bayesian model. A more com- exposure ei and an additional known background component bi : plete and detailed description of the prior definition is given in [ECKvD04]. λi = ∑ (ei · (xi + bi )) pi−k (5) The problem is then solved by using a Gibbs MCMC sampling k approach. After a "burn-in" phase the sampling process typically The background bi can be more generally understood as a "base- reaches convergence and starts sampling from the posterior distri- line" image and thus include known structures, which are not of bution. The reconstructed image is then computed as the mean of interest for the deconvolution process. E.g., a bright point source the posterior samples. As for each pixel a full distribution of its to model the core of an AGN while studying its jets. values is available, the information can also be used to compute Second, the authors proposed to extend the Poisson log- the associated error of the reconstructed value. This is another likelihood function (Equation 2) by a log-prior term that controls main advantage over RL or Maxium A-Postori (MAP) algorithms. 100 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 1 $ sudo apt-get install r-base-dev r-base r-mathlib 2 $ pip install pylira For more detailed instructions see Pylira installation instructions. API & Subpackages Pylira is structured in multiple sub-packages. The pylira.src module contains the original C implementation and the Pybind11 wrapper code. The pylira.core sub-package contains the main Python API, pylira.utils includes utility functions for plotting and serialisation. And pylira.data implements multiple pre-defined datasets for testing and tutorials. Analysis Examples Simple Point Source Pylira was designed to offer a simple Python class based user interface, which allows for a short learning curve of using the package for users who are familiar with Python in general and more specifically with Numpy. A typical complete usage example of the Pylira package is shown in the following: Fig. 2: The image illustrates the multi-scale decomposition used in the LIRA prior for a 4x4 pixels example image. Each quadrant of 2x2 1 import numpy as np sub-images is labelled with QN . The sub-pixels in each quadrant are 2 from pylira import LIRADeconvolver labelled Λi j . . 3 from pylira.data import point_source_gauss_psf 4 5 # create example dataset 6 data = point_source_gauss_psf() The Pylira Package 7 8 # define initial flux image Dependencies & Development 9 data["flux_init"] = data["flux"] The Pylira package is a thin Python wrapper around the original 10 LIRA implementation provided by the authors of [CSv+ 11]. The 11 deconvolve = LIRADeconvolver( 12 n_iter_max=3_000, original algorithm was implemented in C and made available as a 13 n_burn_in=500, package for the R Language [R C20]. Thus the implementation de- 14 alpha_init=np.ones(5) pends on the RMath library, which is still a required dependency of 15 ) 16 Pylira. The Python wrapper was built using the Pybind11 [JRM17] 17 result = deconvolve.run(data=data) package, which allows to reduce the code overhead introduced by 18 the wrapper to a minimum. For the data handling, Pylira relies on 19 # plot pixel traces, result shown in Figure 3 Numpy [HMvdW+ 20] arrays for the serialisation to the FITS data 20 result.plot_pixel_traces_region( 21 center_pix=(16, 16), radius_pix=3 format on Astropy [Col18]. The (interactive) plotting functionality 22 ) is achieved via Matplotlib [Hun07] and Ipywidgets [wc15], which 23 are both optional dependencies. Pylira is openly developed on 24 # plot pixel traces, result shown in Figure 4 25 result.plot_parameter_traces() Github at https://github.com/astrostat/pylira. It relies on GitHub 26 Actions as a continuous integration service and uses the Read 27 # finally serialise the result the Docs service to build and deploy the documentation. The on- 28 result.write("result.fits") line documentation can be found on https://pylira.readthedocs.io. The main interface is exposed via the LIRADeconvolver Pylira implements a set of unit tests to assure compatibility class, which takes the configuration of the algorithm on initial- and reproducibility of the results with different versions of the isation. Typical configuration parameters include the total num- dependencies and across different platforms. As Pylira relies on ber of iterations n_iter_max and the number of "burn-in" random sampling for the MCMC process an exact reproducibility iterations, to be excluded from the posterior mean computation. of results is hard to achieve on different platforms; however the The data, represented by a simple Python dict data structure, agreement of results is at least guaranteed in the statistical limit of contains a "counts", "psf" and optionally "exposure" drawing many samples. and "background" array. The dataset is then passed to the LIRADeconvolver.run() method to execute the deconvolu- Installation tion. The result is a LIRADeconvolverResult object, which Pylira is available via the Python package index (pypi.org), features the possibility to write the result as a FITS file, as well currently at version 0.1. As Pylira still depends on the RMath as to inspect the result with diagnostic plots. The result of the library, it is required to install this first. So the recommended way computation is shown in the left panel of Fig. 3. to install Pylira is on MacOS is: 1 $ brew install r Diagnostic Plots 2 $ pip install pylira To validate the quality of the results Pylira provides many built- On Linux the RMath dependency can be installed using standard in diagnostic plots. One of these diagnostic plot is shown in the package managers. For example on Ubuntu, one would do right panel of Fig. 3. The plot shows the image sampling trace PYLIRA: DECONVOLUTION OF IMAGES IN THE PRESENCE OF POISSON NOISE 101 Pixel trace for (16, 16) 30 800 1000 700 25 800 600 20 500 600 Posterior Mean Burn in Valid 15 400 Mean 400 1 Std. Deviation 300 10 200 200 5 100 0 0 0 5 10 15 20 25 30 0 500 1000 1500 2000 2500 3000 Number of Iterations Fig. 3: The curves show the traces of value the pixel of interest for a simulated point source and its neighboring pixels (see code example). The image on the left shows the posterior mean. The white circle in the image shows the circular region defining the neighboring pixels. The blue line on the right plot shows the trace of the pixel of interest. The solid horizontal orange line shows the mean value (excluding burn-in) of the pixel across all iterations and the shaded orange area the 1 σ error region. The burn in phase is shown in transparent blue and ignored while computing the mean. The shaded gray lines show the traces of the neighboring pixels. for a single pixel of interest and its surrounding circular region of Chandra is a space-based X-ray observatory, which has been interest. This visualisation allows the user to assess the stability in operation since 1999. It consists of nested cylindrical paraboloid of a small region in the image e.g. an astronomical point source and hyperboloid surfaces, which form an imaging optical system during the MCMC sampling process. Due to the correlation with for X-rays. In the focal plane, it has multiple instruments for dif- neighbouring pixels, the actual value of a pixel might vary in the ferent scientific purposes. This includes a high-resolution camera sampling process, which appears as "dips" in the trace of the pixel (HRC) and an Advanced CCD Imaging Spectrometer (ACIS). The of interest and anti-correlated "peaks" in the one or mutiple of typical angular resolution is 0.5 arcsecond and the covered energy the surrounding pixels. In the example a stable state of the pixels ranges from 0.1 - 10 keV. of interest is reached after approximately 1000 iterations. This Figure 5 shows the result of the Pylira algorithm applied to suggests that the number of burn-in iterations, which was defined Chandra data of the Galactic Center region between 0.5 and 7 keV. beforehand, should be increased. The PSF was obtained from simulations using the simulate_psf Pylira relies on an MCMC sampling approach to sample tool from the official Chandra science tools ciao 4.14 [FMA+ 06]. a series of reconstructed images from the posterior likelihood The algorithm achieves both an improved spatial resolution as well defined by Eq. 2. Along with the sampling, it marginalises over as a reduced noise level and higher contrast of the image in the the smoothing hyper-parameters and optimizes them in the same right panel compared to the unprocessed counts data shown in the process. To diagnose the validity of the results it is important to left panel. visualise the sampling traces of both the sampled images as well As a second example, we use data from the Fermi Large Area as hyper-parameters. Telescope (LAT). The Fermi-LAT is a satellite-based imaging Figure 4 shows another typical diagnostic plot created by the gamma-ray detector, which covers an energy range of 20 MeV code example above. In a multi-panel figure, the user can inspect to >300 GeV. The angular resolution varies strongly with energy the traces of the total log-posterior as well as the traces of the and ranges from 0.1 to >10 degree1 . smoothing parameters. Each panel corresponds to the smoothing Figure 6 shows the result of the Pylira algorithm applied to hyper parameter introduced for each level of the multi-scale Fermi-LAT data above 1 GeV to the region around the Galactic representation of the reconstructed image. The figure also shows Center. The PSF was obtained from simulations using the gtpsf the mean value along with the 1 σ error region. In this case, tool from the official Fermitools v2.0.19 [Fer19]. First, one can the algorithm shows stable convergence after a burn-in phase of see that the algorithm achieves again a considerable improvement approximately 200 iterations for the log-posterior as well as all of in the spatial resolution compared to the raw counts. It clearly the multi-scale smoothing parameters. resolves multiple point sources left to the bright Galactic Center source. Astronomical Analysis Examples Summary & Outlook Both in the X-ray as well as in the gamma-ray regime, the Galactic The Pylira package provides Python wrappers for the LIRA al- Center is a complex emission region. It shows point sources, gorithm. It allows the deconvolution of low-counts data following extended sources, as well as underlying diffuse emission and thus 1. https://www.slac.stanford.edu/exp/glast/groups/canda/lat_Performance. represents a challenge for any astronomical data analysis. htm 102 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Logpost Smoothingparam0 Smoothingparam1 Burn in 0.35 0.35 1500 Valid 0.30 Mean 0.30 1 Std. Deviation 0.25 0.25 1000 0.20 0.20 500 0.15 0.15 0 0.10 0.10 0.05 0.05 500 0.00 0.00 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 Number of Iterations Number of Iterations Number of Iterations Smoothingparam2 Smoothingparam3 Smoothingparam4 0.200 0.175 0.20 0.175 0.150 0.150 0.15 0.125 0.125 0.100 0.100 0.10 0.075 0.075 0.05 0.050 0.050 0.025 0.025 0.00 0.000 0.000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 Number of Iterations Number of Iterations Number of Iterations Fig. 4: The curves show the traces of the log posterior value as well as traces of the values of the prior parameter values. The SmoothingparamN parameters correspond to the smoothing parameters αN per multi-scale level. The solid horizontal orange lines show the mean value, the shaded orange area the 1 σ error region. The burn in phase is shown transparent and ignored while estimating the mean. Counts Deconvolved 500 PSF 257 132 -29°00'25" 68 Declination 35 Counts 18 30" 9 5 2 35" 17h45m40.6s40.4s 40.2s 40.0s 39.8s 39.6s 17h45m40.6s40.4s 40.2s 40.0s 39.8s 39.6s Right Ascension Right Ascension Fig. 5: Pylira applied to Chandra ACIS data of the Galactic Center region, using the observation IDs 4684 and 4684. The image on the left shows the raw observed counts between 0.5 and 7 keV. The image on the right shows the deconvolved version. The LIRA hyperprior values were chosen as ms_al_kap1=1, ms_al_kap2=0.02, ms_al_kap3=1. No baseline background model was included. PYLIRA: DECONVOLUTION OF IMAGES IN THE PRESENCE OF POISSON NOISE 103 Counts Deconvolved 200 0°40' PSF 120 72 20' 43 Galactic Latitude 00' 26 Counts 16 -0°20' 9 5 40' 2 0°40' 20' 00' 359°40' 20' 0°40' 20' 00' 359°40' 20' Galactic Longitude Galactic Longitude Fig. 6: Pylira applied to Fermi-LAT data from the Galactic Center region. The image on the left shows the raw measured counts between 5 and 1000 GeV. The image on the right shows the deconvolved version. The LIRA hyperprior values were chosen as ms_al_kap1=1, ms_al_kap2=0.02, ms_al_kap3=1. No baseline background model was included. Poisson statistics using a Bayesian sampling approach and a multi- [CSv+ 11] A. Connors, N. M. Stein, D. van Dyk, V. Kashyap, and scale smoothing prior assumption. The results can be easily written A. Siemiginowska. LIRA — The Low-Counts Image Restora- tion and Analysis Package: A Teaching Version via R. In I. N. to FITS files and inspected by plotting the trace of the sampling Evans, A. Accomazzi, D. J. Mink, and A. H. Rots, editors, process. This allows users to check for general convergence as Astronomical Data Analysis Software and Systems XX, volume well as pixel to pixel correlations for selected regions of interest. 442 of Astronomical Society of the Pacific Conference Series, The package is openly developed on GitHub and includes tests page 463, July 2011. [ECKvD04] David N. Esch, Alanna Connors, Margarita Karovska, and and documentation, such that it can be maintained and improved David A. van Dyk. An image restoration technique with in the future, while ensuring consistency of the results. It comes error estimates. The Astrophysical Journal, 610(2):1213– with multiple built-in test datasets and explanatory tutorials in 1227, aug 2004. URL: https://doi.org/10.1086/421761, doi: 10.1086/421761. the form of Jupyter notebooks. Future plans include the support [FBPW95] D. A. Fish, A. M. Brinicombe, E. R. Pike, and J. G. for parallelisation or distributed computing, more flexible prior Walker. Blind deconvolution by means of the richardson– definitions and the possibility to account for systematic errors on lucy algorithm. J. Opt. Soc. Am. A, 12(1):58–65, Jan 1995. the PSF during the sampling process. URL: http://opg.optica.org/josaa/abstract.cfm?URI=josaa-12- 1-58, doi:10.1364/JOSAA.12.000058. [Fer19] Fermi Science Support Development Team. Fermitools: Fermi Acknowledgements Science Tools. Astrophysics Source Code Library, record ascl:1905.011, May 2019. arXiv:1905.011. This work was conducted under the auspices of the CHASC [FMA+ 06] Antonella Fruscione, Jonathan C. McDowell, Glenn E. Allen, International Astrostatistics Center. CHASC is supported by NSF Nancy S. Brickhouse, Douglas J. Burke, John E. Davis, Nick Durham, Martin Elvis, Elizabeth C. Galle, Daniel E. Har- grants DMS-21-13615, DMS-21-13397, and DMS-21-13605; by ris, David P. Huenemoerder, John C. Houck, Bish Ishibashi, the UK Engineering and Physical Sciences Research Council Margarita Karovska, Fabrizio Nicastro, Michael S. Noble, [EP/W015080/1]; and by NASA 18-APRA18-0019. We thank Michael A. Nowak, Frank A. Primini, Aneta Siemiginowska, CHASC members for many helpful discussions, especially Xiao- Randall K. Smith, and Michael Wise. CIAO: Chandra’s data analysis system. In David R. Silva and Rodger E. Doxsey, Li Meng and Katy McKeough. DvD was also supported in part editors, Society of Photo-Optical Instrumentation Engineers by a Marie-Skodowska-Curie RISE Grant (H2020-MSCA-RISE- (SPIE) Conference Series, volume 6270 of Society of Photo- 2019-873089) provided by the European Commission. Aneta Optical Instrumentation Engineers (SPIE) Conference Series, page 62701V, June 2006. doi:10.1117/12.671760. Siemiginowska, Vinay Kashyap, and Doug Burke further acknowl- [Has70] W. K. Hastings. Monte Carlo Sampling Methods using Markov edge support from NASA contract to the Chandra X-ray Center Chains and their Applications. Biometrika, 57(1):97–109, NAS8-03060. April 1970. doi:10.1093/biomet/57.1.97. [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric R EFERENCES Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van [Cas79] W. Cash. Parameter estimation in astronomy through ap- Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del plication of the likelihood ratio. The Astrophysical Journal, Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, 228:939–947, March 1979. doi:10.1086/156922. Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer [Col18] Astropy Collaboration. The Astropy Project: Building an Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro- Open-science Project and Status of the v2.0 Core Package. The gramming with NumPy. Nature, 585(7825):357–362, Septem- Astrophysical Journal, 156(3):123, September 2018. arXiv: ber 2020. URL: https://doi.org/10.1038/s41586-020-2649-2, 1801.02634, doi:10.3847/1538-3881/aabc4f. doi:10.1038/s41586-020-2649-2. 104 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) [Hog74] J. A. Hogbom. Aperture Synthesis with a Non-Regular Distribution of Interferometer Baselines. Astronomy and As- trophysics Supplement, 15:417, June 1974. [Hun07] J. D. Hunter. Matplotlib: A 2d graphics environment. Com- puting in Science & Engineering, 9(3):90–95, 2007. doi: 10.1109/MCSE.2007.55. [JRM17] Wenzel Jakob, Jason Rhinelander, and Dean Moldovan. py- bind11 – seamless operability between c++11 and python, 2017. https://github.com/pybind/pybind11. [Luc74] L. B. Lucy. An iterative technique for the rectification of observed distributions. Astronomical Journal, 79:745, June 1974. doi:10.1086/111605. [PR94] K. M. Perry and S. J. Reeves. Generalized Cross-Validation as a Stopping Rule for the Richardson-Lucy Algorithm. In Robert J. Hanisch and Richard L. White, editors, The Restora- tion of HST Images and Spectra - II, page 97, January 1994. doi:10.1002/ima.1850060412. [R C20] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2020. URL: https://www.R-project.org/. [Ric72] William Hadley Richardson. Bayesian-Based Iterative Method of Image Restoration. Journal of the Optical Society of America (1917-1983), 62(1):55, January 1972. doi:10. 1364/josa.62.000055. [vdWSN+ 14] Stéfan van der Walt, Johannes L. Schönberger, Juan Nunez- Iglesias, François Boulogne, Joshua D. Warner, Neil Yager, Emmanuelle Gouillart, Tony Yu, and the scikit-image con- tributors. scikit-image: image processing in Python. PeerJ, 2:e453, 6 2014. URL: https://doi.org/10.7717/peerj.453, doi: 10.7717/peerj.453. [wc15] Jupyter widgets community. ipywidgets, a github repository. Retrieved from https://github.com/jupyter-widgets/ipywidgets, 2015. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 105 Codebraid Preview for VS Code: Pandoc Markdown Preview with Jupyter Kernels Geoffrey M. Poore‡∗ F Abstract—Codebraid Preview is a VS Code extension that provides a live including raw chunks of text in other formats such as reStructured- preview of Pandoc Markdown documents with optional support for executing Text. When executable code is involved, the RMarkdown-style embedded code. Unlike typical Markdown previews, all Pandoc features are fully approach of Markdown with embedded code can sometimes be supported because Pandoc itself generates the preview. The Markdown source more convenient than a browser-based Jupyter notebook since the and the preview are fully integrated with features like bidirectional scroll sync. writing process involves more direct interaction with the complete The preview supports LaTeX math via KaTeX. Code blocks and inline code can be executed with Codebraid, using either its built-in execution system or Jupyter document source. kernels. For executed code, any combination of the code and its output can be While using a Pandoc Markdown variant as a source format displayed in the preview as well as the final document. Code execution is non- brings many advantages, the actual writing process itself can blocking, so the preview always remains live and up-to-date even while code is be less than ideal, especially when executable code is involved. still running. Pandoc Markdown variants are so powerful precisely because they provide so many extensions to Markdown, but this also means Index Terms—reproducibility, dynamic report generation, literate programming, that they can only be fully rendered by Pandoc itself. When text Python, Pandoc, Markdown, Project Jupyter editors such as VS Code provide a built-in Markdown preview, typically only a small subset of Pandoc features is supported, Introduction so the representation of the document output will be inaccurate. Some editors provide a visual Markdown editing mode, in which Pandoc [JM22] is increasingly a foundational tool for creating sci- a partially rendered version of the document is displayed in the entific and technical documents. It provides Pandoc’s Markdown editor and menus or keyboard shortcuts may replace the direct and other Markdown variants that add critical features absent in entry of Markdown syntax. These generally suffer from the same basic Markdown, such as citations, footnotes, mathematics, and issue. This is only exacerbated when the document embeds code tables. At the same time, Pandoc simplifies document creation that is executed during the build process, since that goes even by providing conversion from Markdown (and other formats) to further beyond basic Markdown. formats like LaTeX, HTML, Microsoft Word, and PowerPoint. An alternative is to use Pandoc itself to generate HTML or Pandoc is especially useful for documents with embedded code PDF output, and then display this as a preview. Depending on the that is executed during the build process. RStudio’s RMarkdown text editor used, the HTML or PDF might be displayed within the [RSt20] and more recently Quarto [RSt22] leverage Pandoc to text editor in a panel beside the document source, or in a separate convert Markdown documents to other formats, with code exe- browser window or PDF viewer. For example, Quarto offers both cution provided by knitr [YX15]. JupyterLab [GP21] centers the possibilities, depending on whether RStudio, VS Code, or another writing experience around an interactive, browser-based notebook editor is used.1 While this approach resolves the inaccuracy issues instead of a Markdown document, but still relies on Pandoc for of a basic Markdown preview, it also gives up features such as export to formats other than HTML [Jup22]. There are also ways scroll sync that tightly integrate the Markdown source with the to interact with a Jupyter Notebook as a Markdown document, preview. In the case of executable code, there is the additional such as Jupytext [MWtJT20] and Pandoc’s own native Jupyter issue of a time delay in rendering the preview. Pandoc itself can support. typically convert even a relatively long document in under one Writing with Pandoc’s Markdown or a similar Markdown second. However, when code is executed as part of the document variant has advantages when multiple output formats are required, build process, preview update is blocked until code execution since Pandoc provides the conversion capabilities. Pandoc Mark- completes. down variants can also serve as a simpler syntax when creating HTML, LaTeX, or similar documents. They allow HTML and This paper introduces Codebraid Preview, a VS Code exten- LaTeX to be intermixed with Markdown syntax. They also support sion that provides a live preview of Pandoc Markdown documents with optional support for executing embedded code. Codebraid * Corresponding author: gpoore@uu.edu Preview provides a Pandoc-based preview while avoiding most ‡ Union University of the traditional drawbacks of this approach. The next section Copyright © 2022 Geoffrey M. Poore. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits 1. The RStudio editor is unique in also offering a Pandoc-based visual unrestricted use, distribution, and reproduction in any medium, provided the editing mode, starting with version 1.4 from January 2021 (https://www. original author and source are credited. rstudio.com/blog/announcing-rstudio-1-4/). 106 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) provides an overview of features. This is followed by sections There is also support for document export with Pandoc, using focusing on scroll sync, LaTeX support, and code execution as the VS Code command palette or the export-with-Pandoc button. examples of solutions and remaining challenges in creating a better Pandoc writing experience. Scroll sync Tight source-preview integration requires a source map, or a Overview of Codebraid Preview mapping from characters in the source to characters in the output. Codebraid Preview can be installed through the VS Code ex- Due to Pandoc’s parsing algorithms, tracking source location tension manager. Development is at https://github.com/gpoore/ during parsing is not possible in the general case.2 codebraid-preview-vscode. Pandoc must be installed separately Pandoc 2.11.3 was released in December 2020. It added (https://pandoc.org/). For code execution capabilities, Codebraid a sourcepos extension for CommonMark and formats must also be installed (https://github.com/gpoore/codebraid). based on it, including GitHub-Flavored Markdown (GFM) and The preview panel can be opened using the VS Code command commonmark_x (CommonMark plus extensions similar to Pan- palette, or by clicking the Codebraid Preview button that is visible doc’s Markdown). The CommonMark parser uses a different when a Markdown document is open. The preview panel takes the parsing algorithm from the Pandoc’s Markdown parser, and this document in its current state, converts it into HTML using Pandoc, algorithm permits tracking source location. For the first time, it and displays the result using a webview. An example is shown in was possible to construct a source map for a Pandoc input format. Figure 1. Since the preview is generated by Pandoc, all Pandoc Codebraid Preview defaults to commonmark_x as an input features are fully supported. format, since it provides the most features of all CommonMark- By default, the preview updates automatically whenever the based formats. Features continue to be added to commonmark_x Markdown source is changed. There is a short user-configurable and it is gradually nearing feature parity with Pandoc’s Mark- minimum update interval. For shorter documents, sub-second down. Citations are perhaps the most important feature currently updates are typical. missing.3 The preview uses the same styling CSS as VS Code’s built- Codebraid Preview provides full bidirectional scroll sync be- in Markdown preview, so it automatically adjusts to the VS Code tween source and preview for all CommonMark-based formats, color theme. For example, changing between light and dark themes using data provided by sourcepos. In the output HTML, the changes the background and text colors in the preview. first image or inline text element created by each Markdown Codebraid Preview leverages recent Pandoc advances to pro- source line is given an id attribute corresponding to the source vide bidirectional scroll sync between the Markdown source line number. When the source is scrolled to a given line range, and the preview for all CommonMark-based Markdown variants the preview scrolls to the corresponding HTML elements using that Pandoc supports (commonmark, gfm, commonmark_x). these id attributes. When the preview is scrolled, the visible By default, Codebraid Preview treats Markdown documents as HTML elements are detected via the Intersection Observer API.4 commonmark_x, which is CommonMark with Pandoc exten- Then their id attributes are used to determine the corresponding sions for features like math, footnotes, and special list types. The Markdown line range, and the source scrolls to those lines. preview still works for other Markdown variants, but scroll sync is Scroll sync is slightly more complicated when working with disabled. By default, scroll sync is fully bidirectional, so scrolling output that is generated by executed code. For example, if a code either the source or the preview will cause the other to scroll to block is executed and creates several plots in the preview, there the corresponding location. Scroll sync can instead be configured isn’t necessarily a way to trace each individual plot back to a to be only from source to preview or only from preview to source. particular line of code in the Markdown source. In such cases, the As far as I am aware, this is the first time that scroll sync has been line range of the executed code is mapped proportionally to the implemented in a Pandoc-based preview. vertical space occupied by its output. The same underlying features that make scroll sync possible Pandoc supports multi-file documents. It can be given a list are also used to provide other preview capabilities. Double- of files to combine into a single output document. Codebraid clicking in the preview moves the cursor in the editor to the Preview provides scroll sync for multi-file documents. For ex- corresponding line of the Markdown source. ample, suppose a document is divided into two files in the same Since many Markdown variants support LaTeX math, the directory, chapter_1.md and chapter_2.md. Treating these preview includes math support via KaTeX [EA22]. as a single document involves creating a YAML configuration file Codebraid Preview can simply be used for writing plain Pan- _codebraid_preview.yaml that lists the files: doc documents. Optional execution of embedded code is possible input-files: with Codebraid [GMP19], using its built-in code execution system - chapter_1.md or Jupyter kernels. When Jupyter kernels are used, it is possible - chapter_2.md to obtain the same output that would be present in a Jupyter Now launching a preview from either chapter_1.md or notebook, including rich output such as plots and mathematics. It chapter_2.md will display a preview that combines both is also possible to specify a custom display so that only a selected files. When the preview is scrolled, the editor scrolls to the combination of code, stdout, stderr, and rich output is shown while corresponding source location, automatically switching between the rest are hidden. Code execution is decoupled from the preview process, so the Markdown source can be edited and the preview 2. See for example https://github.com/jgm/pandoc/issues/4565. can update even while code is running in the background. As far as 3. The Pandoc Roadmap at https://github.com/jgm/pandoc/wiki/Roadmap summarizes current commonmark_x capabilities. I am aware, no previous software for executing code in Markdown 4. For technical details, https://www.w3.org/TR/intersection-observer/. For has supported building a document with partial code output before an overview, https://developer.mozilla.org/en-US/docs/Web/API/Intersection_ execution has completed. Observer_API. CODEBRAID PREVIEW FOR VS CODE: PANDOC MARKDOWN PREVIEW WITH JUPYTER KERNELS 107 Fig. 1: Screenshot of a Markdown document with Codebraid Preview in VS Code. This document uses Codebraid to execute code with Jupyter kernels, so all plots and math visible in the preview are generated during document build. chapter_1.md and chapter_2.md depending on the part of of HTML rendering. In the future, optional MathJax support may the preview that is visible. be needed to provide broader math support. For some applications, The preview still works when the input format is set to a non- it may also be worth considering caching pre-rendered or image CommonMark format, but in that case scroll sync is disabled. If versions of equations to improve performance. Pandoc adds sourcepos support for additional input formats in the future, scroll sync will work automatically once Codebraid Code execution Preview adds those formats to the supported list. It is possible to attempt to reconstruct a source map by performing a parallel Optional support for executing code embedded in Markdown string search on Pandoc output and the original source. This can documents is provided by Codebraid [GMP19]. Codebraid uses be error-prone due to text manipulation during format conversion, Pandoc to convert a document into an abstract syntax tree (AST), but in the future it may be possible to construct a good enough then extracts any inline or block code marked with Codebraid source map to extend basic scroll sync support to additional input attributes from the AST, executes the code, and finally formats the formats. code output so that Pandoc can use it to create the final output document. Code execution is performed with Codebraid’s own built-in system or with Jupyter kernels. For example, the code LaTeX support block Support for mathematics is one of the key features provided by ```{.python .cb-run} many Markdown variants in Pandoc, including commonmark_x. print("Hello *world!*") Math support in the preview panel is supplied by KaTeX [EA22], ``` which is a JavaScript library for rendering LaTeX math in the would result in browser. One of the disadvantages of using Pandoc to create the preview Hello world! is that every update of the preview is a complete update. This after processing by Codebraid and finally Pandoc. The .cb-run makes the preview more sensitive to HTML rendering time. In is a Codebraid attribute that marks the code block for execution contrast, in a Jupyter notebook, it is common to write Markdown and specifies the default display of code output. Further examples in multiple cells which are rendered separately and independently. of Codebraid usage are visible in Figure 1. MathJax [Mat22] provides a broader range of LaTeX support Mixing a live preview with executable code provides potential than KaTeX, and is used in software such as JupyterLab and usability and security challenges. By default, code only runs when Quarto. While MathJax performance has improved significantly the user selects execution in the VS Code command palette or since the release of version 3.0 in 2019, KaTeX can still have a clicks the Codebraid execute button. When the preview automati- speed advantage, so it is currently the default due to the importance cally updates as a result of Markdown source changes, it only uses 108 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) cached code output. Stale cached output is detected by hashing While this build process is significantly more interactive than executed code, and then marked in the preview to alert the user. what has been possible previously, it also suggests additional The standard approach to executing code within Markdown avenues for future exploration. Codebraid’s built-in code execution documents blocks the document build process until all code has system is designed to execute a predefined sequence of code finished running. Code is extracted from the Markdown source and chunks and then exit. Jupyter kernels are currently used in the executed. Then the output is combined with the original source and same manner to avoid any potential issues with out-of-order passed on to Pandoc or another Markdown application for final execution. However, Jupyter kernels can receive and execute code conversion. This is the approach taken by RMarkdown, Quarto, indefinitely, which is how they commonly function in Jupyter note- and similar software, as well as by Codebraid until recently. This books. Instead of starting a new Jupyter kernel at the beginning of design works well for building a document a single time, but each code execution cycle, it would be possible to keep the kernel blocking until all code has executed is not ideal in the context from the previous execution cycle and only pass modified code of a document preview. chunks to it. This would allow the same out-of-order execution Codebraid now offers a new mode of code execution that al- issues that are possible in a Jupyter notebook. Yet that would lows a document to be rebuilt continuously during code execution, make possible much more rapid code output, particularly in cases with each build including all code output available at that time. where large datasets must be loaded or significant preprocessing This process involves the following steps: is required. 1) The user selects code execution. Codebraid Preview Conclusion passes the document to Codebraid. Codebraid begins Codebraid Preview represents a significant advance in tools for code execution. writing with Pandoc. For the first time, it is possible to preview 2) As soon as any code output is available, Codebraid a Pandoc Markdown document using Pandoc itself while having immediately streams this back to Codebraid Preview. The features like scroll sync between the Markdown source and the output is in a format compatible with the YAML metadata preview. When embedded code needs to be executed, it is possible block at the start of Pandoc Markdown documents. The to see code output in the preview and to continue editing the output includes a hash of the code that was executed, so document during code execution, instead of having to wait until that code changes can be detected later. code finishes running. 3) If the document is modified while code is running or if Codebraid Preview or future previewers that follow this ap- code output is received, Codebraid Preview rebuilds the proach may be perfectly adequate for shorter and even some longer preview. It creates a copy of the document with all current documents, but at some point a combination of document length, Codebraid output inserted into the YAML metadata block document complexity, and mathematical content will strain what is at the start of the document. This modified document is possible and ultimately decrease preview update frequency. Every then passed to Pandoc. Pandoc runs with a Lua filter5 that update of the preview involves converting the entire document modifies the document AST before final conversion. The with Pandoc and then rendering the resulting HTML. filter removes all code marked with Codebraid attributes On the parsing side, Pandoc’s move toward CommonMark- from the AST, and replaces it with the corresponding based Markdown variants may eventually lead to enough stan- code output stored in the AST metadata. If code has dardization that other implementations with the same syntax and been modified since execution began, this is detected features are possible. This in turn might enable entirely new with the hash of the code, and an HTML class is added approaches. An ideal scenario would be a Pandoc-compatible to the output that will mark it visually as stale output. JavaScript-based parser that can parse multiple Markdown strings Code that does not yet have output is replaced by a while treating them as having a shared document state for things visible placeholder to indicate that code is still running. like labels, references, and numbering. For example, this could When the Lua filter finishes AST modifications, Pandoc allow Pandoc Markdown within a Jupyter notebook, with all completes the document build, and the preview updates. Markdown content sharing a single document state, maybe with 4) As long as code is executing, the previous process repeats each Markdown cell being automatically updated based on Mark- whenever the preview needs to be rebuilt. down changes elsewhere. 5) Once code execution completes, the most recent output is Perhaps more practically, on the preview display side, there reused for all subsequent preview updates until the next may be ways to optimize how the HTML generated by Pandoc is time the user chooses to execute code. Any code changes loaded in the preview. A related consideration might be alternative continue to be detected by hashing the code during the preview formats. There is a significant tradition of tight source- build process, so that the output can be marked visually preview integration in LaTeX (for example, [Lau08]). In principle, as stale in the preview. Pandoc’s sourcepos extension should make possible Mark- The overall result of this process is twofold. First, building down to PDF synchronization, using LaTeX as an intermediary. a document involving executed code is nearly as fast as building a plain Pandoc document. The additional output metadata plus R EFERENCES the filter are the only extra elements involved in the document [EA22] Emily Eisenberg and Sophie Alpert. KaTeX: The fastest math build, and Pandoc Lua filters have excellent performance. Second, typesetting library for the web, 2022. URL: https://katex.org/. the output for each code chunk appears in the preview almost [GMP19] Geoffrey M. Poore. Codebraid: Live Code in Pandoc Mark- immediately after the chunk finishes execution. down. In Chris Calloway, David Lippa, Dillon Niederhut, and David Shupe, editors, Proceedings of the 18th Python in Science Conference, pages 54 – 61, 2019. doi:10.25080/Majora- 5. For an overview of Lua filters, see https://pandoc.org/lua-filters.html. 7ddc1dd1-008. CODEBRAID PREVIEW FOR VS CODE: PANDOC MARKDOWN PREVIEW WITH JUPYTER KERNELS 109 [GP21] Brian E. Granger and Fernando Pérez. Jupyter: Thinking and storytelling with code and data. Computing in Science & Engineering, 23(2):7–14, 2021. doi:10.1109/MCSE.2021. 3059263. [JM22] John MacFarlane. Pandoc: a universal document converter, 2006– 2022. URL: https://pandoc.org/. [Jup22] Jupyter Development Team. nbconvert: Convert Notebooks to other formats, 2015–2022. URL: https://nbconvert.readthedocs. io. [Lau08] Jerôme Laurens. Direct and reverse synchronization with Sync- TEX. TUGBoat, 29(3):365–371, 2008. [Mat22] MathJax. MathJax: Beautiful and accessible math in all browsers, 2009–2022. URL: https://www.mathjax.org/. [MWtJT20] Marc Wouts and the Jupytext Team. Jupyter notebooks as Markdown documents, Julia, Python or R scripts, 2018–2020. URL: https://jupytext.readthedocs.io/. [RSt20] RStudio Inc. R Markdown, 2016–2020. URL: https://rmarkdown. rstudio.com/. [RSt22] RStudio Inc. Welcome to Quarto, 2022. URL: https://quarto.org/. [YX15] Yihui Xie. Dynamic Documents with R and knitr. Chapman & Hall/CRC Press, 2015. 110 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Incorporating Task-Agnostic Information in Task-Based Active Learning Using a Variational Autoencoder Curtis Godwin‡†∗ , Meekail Zain§†∗ , Nathan Safir‡ , Bella Humphrey§ , Shannon P Quinn§¶ F Abstract—It is often much easier and less expensive to collect data than to constraints by specifying a budget of points that can be labeled at label it. Active learning (AL) ([Set09]) responds to this issue by selecting which a time and evaluating against this budget. unlabeled data are best to label next. Standard approaches utilize task-aware In AL, the model for which we select new labels is referred to AL, which identifies informative samples based on a trained supervised model. as the task model. If this model is a classifier neural network, the Task-agnostic AL ignores the task model and instead makes selections based space in which it maps inputs before classifying them is known on learned properties of the dataset. We seek to combine these approaches and measure the contribution of incorporating task-agnostic information into as the latent space or representation space. A recent branch of standard AL, with the suspicion that the extra information in the task-agnostic AL ([SS18], [SCN+ 18], [YK19]), prominent for its applications features may improve the selection process. We test this on various AL methods to deep models, focuses on mapping unlabeled points into the task using a ResNet classifier with and without added unsupervised information from model’s latent space before comparing them. a variational autoencoder (VAE). Although the results do not show a significant These methods are limited in their analysis by the labeled improvement, we investigate the effects on the acquisition function and suggest data they must train on, failing to make use of potentially useful potential approaches for extending the work. information embedded in the unlabeled data. We therefore suggest that this family of methods may be improved by extending their Index Terms—active learning, variational autoencoder, deep learning, pytorch, representation spaces to include unsupervised features learned semi-supervised learning, unsupervised learning over the entire dataset. For this purpose, we opt to use a variational autoencoder (VAE) ([KW13]) , which is a prominent method for unsupervised representation learning. Our main contributions are Introduction (a) a new methodology for extending AL methods using VAE In deep learning, the capacity for data gathering often signifi- features and (b) an experiment comparing AL performance across cantly outpaces the labeling. This is easily observed in the field two recent feature-based AL methods using the new method. of bioimaging, where ground-truth labeling usually requires the expertise of a clinician. For example, producing a large quantity Related Literature of CT scans is relatively simple, but having them labeled for Active learning COVID-19 by cardiologists takes much more time and money. Much of the early active learning (AL) literature is based on These constraints ultimately limit the contribution of deep learning shallower, less computationally demanding networks since deeper to many crucial research problems. architectures were not well-developed at the time. Settles ([Set09]) This labeling issue has compelled advancements in the field of provides a review of these early methods. The modern approach active learning (AL) ([Set09]). In a typical AL setting, there is a uses an acquisition function, which involves ranking all available set of labeled data and a (usually larger) set of unlabeled data. A unlabeled points by some chosen heuristic H and choosing to model is trained on the labeled data, then the model is analyzed to label the points of highest ranking. evaluate which unlabeled points should be labeled to best improve the loss objective after further training. AL acknowledges labeling † These authors contributed equally. * Corresponding author: cmgodwin263@gmail.com, meekail.zain@uga.edu ‡ Institute for Artificial Intelligence, University of Georgia, Athens, GA 30602 USA * Corresponding author: cmgodwin263@gmail.com, meekail.zain@uga.edu § Department of Computer Science, University of Georgia, Athens, GA 30602 USA ¶ Department of Cellular Biology, University of Georgia, Athens, GA 30602 USA Copyright © 2022 Curtis Godwin et al. This is an open-access article dis- tributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, pro- The popularity of the acquisition approach has led to a widely- vided the original author and source are credited. used evaluation procedure, which we describe in Algorithm 1. INCORPORATING TASK-AGNOSTIC INFORMATION IN TASK-BASED ACTIVE LEARNING USING A VARIATIONAL AUTOENCODER 111 This procedure trains a task model T on the initial labeled data, representation c. An additional fully connected layer then maps records its test accuracy, then uses H to label a set of unlabeled c into a single value constituting the loss prediction. points. We then once again train T on the labeled data and record When attempting to train a network to directly predict T ’s its accuracy. This is repeated until a desired number of labels is loss during training, the ground truth losses naturally decrease as reached, and then the accuracies can be graphed against the num- T is optimized, resulting in a moving objective. The authors of ber of available labels to demonstrate performance over the course ([YK19]) find that a more stable ground truth is the inequality of labeling. We can use this evaluation algorithm to separately between the losses of given pairs of points. In this case, P is evaluate multiple acquisition functions on their resulting accuracy trained on pairs of labeled points, so that P is penalized for graphs. This is utilized in many AL papers to show the efficacy producing predicted loss pairs that exhibit a different inequality of their suggested heuristics in comparison to others ([WZL+ 16], than the corresponding true loss pair. [SS18], [SCN+ 18], [YK19]). More specifically, for each batch of labeled data Lbatch ⊂ L The prevailing approach to point selection has been to choose that is propagated through T during training, the batch of true unlabeled points for which the model is most uncertain, the as- losses is computed and split randomly into a batch of pairs Pbatch . sumption being that uncertain points will be the most informative The loss prediction network produces a corresponding batch of ([BRK21]). A popular early method was to label the unlabeled predicted loss pairs, denoted Pebatch . The following pair loss is then points of highest Shannon entropy ([Sha48]) under the task model, computed given each p ∈ Pbatch and its corresponding p̃ ∈ Pebatch : which is a measure of uncertainty between the classes of the data. This method is now more commonly used in combination L pair (p, p̃) = max(0, −I (p) · ( p̃(1) − p̃(2) ) + ξ ), (3) with a representativeness measure ([WZL+ 16]) to avoid selecting where I is the following indicator function for pair inequality: condensed clusters of very similar points. ( 1, p(1) > p(2) I (p) = . (4) Recent heuristics using deep features −1, p(1) ≤ p(2) For convolutional neural networks (CNNs) in image classification settings, the task model T can be decomposed into a feature- Variational Autoencoders generating module Variational autoencoders (VAEs) ([KW13]) are an unsupervised T f : Rn → R f , method for modeling data using Bayesian posterior inference. We begin with the Bayesian assumption that the data is well- which maps the input data vectors to the output of the final fully modeled by some distribution, often a multivariate Gaussian. We connected layer before classification, and a classification module also assume that this data distribution can be inferred reasonably well by a lower dimensional random variable, also often modeled Tc : R f → {0, 1, ..., c}, by a multivariate Gaussian. where c is the number of classes. The inference process then consists of an encoding into the Recent deep learning-based AL methods have approached the lower dimensional latent variable, followed by a decoding back notion of model uncertainty in terms of the rich features generated into the data dimension. We parametrize both the encoder and the by the learned model. Core-set ([SS18]) and MedAL ([SCN+ 18]) decoder as neural networks, jointly optimizing their parameters select unlabeled points that are the furthest from the labeled set with the following loss function ([KW19]): in terms of L2 distance between the learned features. For core-set, Lθ ,φ (x) = log pθ (x|z) + [log pθ (z) − log qφ (z|x)], (5) each point constructing the set S in step 6 of Algorithm 1 is chosen by where θ and φ are the parameters of the encoder and the decoder, u∗ = argmax min ||(T f (u) − T f (``))||2 , (1) respectively. The first term is the reconstruction error, penalizing u∈U ` ∈L the parameters for producing poor reconstructions of the input where U is the unlabeled set and L is the labeled set. The data. The second term is the regularization error, encouraging the analogous operation for MedAL is encoding to resemble a pre-selected prior distribution, commonly a unit Gaussian prior. 1 |L| The encoder of a well-optimized VAE can be used to gen- u∗ = argmax u∈U ∑ ||T f (u) − T f (Li )||2 . |L| i=1 (2) erate latent encodings with rich features which are sufficient to approximately reconstruct the data. The features also have some Note that after a point u∗ is chosen, the selection of the next point geometric consistency, in the sense that the encoder is encouraged assumes the previous u∗ to be in the labeled set. This way we to generate encodings in the pattern of a Gaussian distribution. discourage choosing sets that are closely packed together, leading to sets that are more diverse in terms of their features. This effect is more pronounced in the core-set method since it takes the Methods minimum distance whereas MedAL uses the average distance. We observe that the notions of uncertainty developed in the core- Another recent method ([YK19]) trains a regression network set and MedAL methods rely on distances between feature vectors to predict the loss of the task model, then takes the heuristic H modeled by the task model T . Additionally, loss prediction relies in Algorithm 1 to select the unlabeled points of highest predicted on a fully connected layer mapping from a feature space to a single loss. To implement this, the loss prediction network P is attached value, producing different predictions depending on the values of to a ResNet task model T and is trained jointly with T . The the relevant feature vector. Thus all of these methods utilize spatial inputs to P are the features output by the ResNet’s four residual reasoning in a vector space. blocks. These features are mapped into the same dimensionality Furthermore, in each of these methods, the heuristic H only via a fully connected layer and then concatenated to form a has access to information learned by the task model, which is 112 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) trained only on the labeled points at a given timestep in the la- ensure that the task models being compared were supplied with beling procedure. Since variational autoencoder (VAE) encodings the same initial set of labels. are not limited by the contents of the labeled set, we suggest that With four NVIDIA 2080 GPUs, the total runtime for the the aforementioned methods may benefit by expanding the vector MNIST experiments was 5113s for core-set and 4955s for loss spaces they investigate to include VAE features learned across prediction; for ChestMNIST, the total runtime was 7085s for core- the entire dataset, including the unlabeled data. These additional set and 7209s for loss prediction. features will constitute representative and previously inaccessible information regarding the data, which may improve the active learning process. We implement this by first training a VAE model V on the given dataset. V can then be used as a function returning the VAE features for any given datapoint. We append these additional features to the relevant vector spaces using vector concatenation, an operation we denote with the symbol _. The modified point selection operation in core-set then becomes u∗ = argmax min ||([T f (u) _ αV (u)] − [T f (``) _ αV (``)]||2 , u∈U ` ∈L (6) where α is a hyperparameter that scales the influence of the VAE features in computing the vector distance. To similarly modify the loss prediction method, we concatenate the VAE features to the Fig. 1: The average MNIST results using the core-set heuristic versus final ResNet feature concatenation c before the loss prediction, the VAE-augmented core-set heuristic for Algorithm 1 over 5 runs. so that the extra information is factored into the training of the prediction network P. Experiments In order to measure the efficacy of the newly proposed methods, we generate accuracy graphs using Algorithm 1, freezing all settings except the selection heuristic H . We then compare the performance of the core-set and loss prediction heuristics with their VAE-augmented counterparts. We use ResNet-18 pretrained on ImageNet as the task model, using the SGD optimizer with learning rate 0.001 and momen- tum 0.9. We train on the MNIST ([Den12]) and ChestMNIST ([YSN21]) datasets. ChestMNIST consists of 112,120 chest X-ray images resized to 28x28 and is one of several benchmark medical image datasets introduced in ([YSN21]). Fig. 2: The average MNIST results using the loss prediction heuristic For both datasets we experiment on randomly selected subsets, versus the VAE-augmented loss prediction heuristic for Algorithm 1 using 25000 points for MNIST and 30000 points for ChestMNIST. over 5 runs. In both cases we begin with 3000 initial labels and label 3000 points per active learning step. We opt to retrain the task model after each labeling step instead of fine-tuning. We use a similar training strategy as in ([SCN+ 18]), training the task model until >99% train accuracy before selecting new points to label. This ensures that the ResNet is similarly well fit to the labeled data at each labeling iteration. This is implemented by training for 10 epochs on the initial training set and increasing the training epochs by 5 after each labeling iteration. The VAEs used for the experiments are trained for 20 epochs using an Adam optimizer with learning rate 0.001 and weight decay 0.005. The VAE encoder architecture consists of four con- volutional downsampling filters and two linear layers to learn the low dimensional mean and log variance. The decoder consists of an upsampling convolution and four size-preserving convolutions to learn the reconstruction. Fig. 3: The average ChestMNIST results using the core-set heuristic Experiments were run five times, each with a separate set of versus the VAE-augmented core-set heuristic for Algorithm 1 over 5 randomly chosen initial labels, with the displayed results showing runs. the average validation accuracies across all runs. Figures 1 and 3 show the core-set results, while Figures 2 and 4 show the loss To investigate the qualitative difference between the VAE and prediction results. In all cases, shared random seeds were used to non-VAE approaches, we performed an additional experiment INCORPORATING TASK-AGNOSTIC INFORMATION IN TASK-BASED ACTIVE LEARNING USING A VARIATIONAL AUTOENCODER 113 Fig. 4: The average ChestMNIST results using the loss prediction heuristic versus the VAE-augmented loss prediction heuristic for Algorithm 1 over 5 runs. to visualize an example of core-set selection. We first train the ResNet-18 with the same hyperparameter settings on 1000 initial labels from the ChestMNIST dataset, then randomly choose 1556 Fig. 6: A t-SNE visualization of the ChestMNIST points chosen by (5%) of the unlabeled points from which to select 100 points to core-set when the ResNet features are augmented with VAE features. label. These smaller sizes were chosen to promote visual clarity in the output graphs. We use t-SNE ([VdMH08]) dimensionality reduction to show process. In 5, the selected points tend to be more spread out, the ResNet features of the labeled set, the unlabeled set, and the while in 6 they cluster at one edge. This appears to mirror the points chosen to be labeled by core-set. transformation of the rest of the data, which is more spread out without the VAE features, but becomes condensed in the center when they are introduced, approaching the shape of a Gaussian distribution. It seems that with the added VAE features, the selected points are further out of distribution in the latent space. This makes sense because points tend to be more sparse at the tails of a Guassian distribution and core-set prioritizes points that are well-isolated from other points. One reason for the lack of performance improvement may be the homogeneous nature of the VAE, where the optimization goal is reconstruction rather than classification. This could be improved by using a multimodal prior in the VAE, which may do a better job of modeling relevant differences between points. Conclusion Our original intuition was that additional unsupervised informa- tion may improve established active learning methods, especially when using a modern unsupervised representation method such as a VAE. The experimental results did not indicate this hypothesis, but additional investigation of the VAE features showed a notable change in the task model latent space. Though this did not result in Fig. 5: A t-SNE visualization of the ChestMNIST points chosen by superior point selections in our case, it is of interest whether dif- core-set. ferent approaches to latent space augmentation in active learning may fare better. Future work may explore the use of class-conditional VAEs Discussion in a similar application, since a VAE that can utilize the available class labels may produce more effective representations, and it Overall, the VAE-augmented active learning heuristics did not could be retrained along with the task model after each labeling exhibit a significant performance difference when compared with iteration. their counterparts. The only case of a significant p-value (<0.05) occurred during loss prediction on the MNIST dataset at 21000 labels. R EFERENCES The t-SNE visualizations in Figures 5 and 6 show some of [BRK21] Samuel Budd, Emma C Robinson, and Bernhard Kainz. A the influence that the VAE features have on the core-set selection survey on active learning and human-in-the-loop deep learning 114 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) for medical image analysis. Medical Image Analysis, 71:102062, 2021. doi:10.1016/j.media.2021.102062. [Den12] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012. doi:10.1109/MSP.2012.2211477. [KW13] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [KW19] Diederik P. Kingma and Max Welling. An Intro- duction to Variational Autoencoders. Now Publishers, 2019. URL: https://doi.org/10.1561%2F9781680836233, doi: 10.1561/9781680836233. [SCN 18] Asim Smailagic, Pedro Costa, Hae Young Noh, Devesh + Walawalkar, Kartik Khandelwal, Adrian Galdran, Mostafa Mir- shekari, Jonathon Fagert, Susu Xu, Pei Zhang, et al. Medal: Accurate and robust deep active learning for medical image analysis. In 2018 17th IEEE international conference on machine learning and applications (ICMLA), pages 481–488. IEEE, 2018. doi:10.1109/icmla.2018.00078. [Set09] Burr Settles. Active learning literature survey. 2009. [Sha48] Claude Elwood Shannon. A mathematical theory of communica- tion. The Bell system technical journal, 27(3):379–423, 1948. [SS18] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018. URL: https://openreview.net/ forum?id=H1aIuk-RW. [VdMH08] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008. [WZL+ 16] Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technol- ogy, 27(12):2591–2600, 2016. doi:10.1109/tcsvt.2016. 2589879. [YK19] Donggeun Yoo and In So Kweon. Learning loss for active learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 93–102, 2019. doi:10.1109/CVPR.2019.00018. [YSN21] Jiancheng Yang, Rui Shi, and Bingbing Ni. Medmnist classi- fication decathlon: A lightweight automl benchmark for med- ical image analysis. In 2021 IEEE 18th International Sym- posium on Biomedical Imaging (ISBI), pages 191–195, 2021. doi:10.1109/ISBI48211.2021.9434062. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 115 Awkward Packaging: building Scikit-HEP Henry Schreiner‡∗ , Jim Pivarski‡ , Eduardo Rodrigues§ F Abstract—Scikit-HEP has grown rapidly over the last few years, not just to serve parts [Lam98]. The glueing together of the system was done in the needs of the High Energy Physics (HEP) community, but in many ways, Python, a model still popular today, though some experiments are the Python ecosystem at large. AwkwardArray, boost-histogram/hist, and iminuit now using Python + Numba as an alternative model, such as for are examples of libraries that are used beyond the original HEP focus. In this example the Xenon1T experiment [RTA+ 17], [RS21]. paper we will look at key packages in the ecosystem, and how the collection of In the early 2000s, the use of Python HEP exploded, heavily 30+ packages was developed and maintained. Also we will look at some of the software ecosystem contributions made to packages like cibuildwheel, pybind11, driven by experiments like LHCb developing frameworks and user nox, scikit-build, build, and pipx that support this effort. We will also discuss the tools for scripting. ROOT started providing Python bindings in Scikit-HEP developer pages and initial WebAssembly support. 2004 [LGMM05] that were not considered Pythonic [GTW20], and still required a complex multi-hour build of ROOT to use1 . Index Terms—packaging, ecosystem, high energy physics, community project Analyses still consisted largely of ROOT, with Python sometimes showing up. By the mid 2010’s, a marked change had occurred, driven by Introduction the success of Python in Data Science, especially in education. High Energy Physics (HEP) has always had intense computing Many new students were coming into HEP with little or no needs due to the size and scale of the data collected. The C++ experience, but with existing knowledge of Python and the World Wide Web was invented at the CERN Physics laboratory growing Python data science ecosystem, like NumPy and Pandas. in Switzerland in 1989 when scientists in the EU were trying Several HEP experiment analyses were performed in, or driven to communicate results and datasets with scientist in the US, by, Python, with ROOT only being used for things that were and vice-versa [LCC+ 09]. Today, HEP has the largest scientific not available in the Python ecosystem. Some of these were HEP machine in the world, at CERN: the Large Hadron Collider (LHC), specific: ROOT is also a data format, so users needed to be able 27 km in circumference [EB08], with multiple experiments with to read data from ROOT files. Others were less specific: HEP thousands of collaborators processing over a petabyte of raw data users have intense histogram requirements due to the data sizes, every day, with 100 petabytes being stored per year at CERN. This large portions of HEP data are "jagged" rather than rectangular; is one of the largest scientific datasets in the world of exabyte scale vector manipulation was important (especially Lorenz Vectors, a [PJ11], which is roughly comparable in order of magnitude to all four dimensional relativistic vector with a non-Euclidean metric); of astronomy or YouTube [SLF+ 15]. and data fitting was important, especially with complex models In the mid nineties, HEP users were beginning to look for and accurate error estimation. a new language to replace Fortran. A few HEP scientists started investigating the use of Python around the release of 1.0.0 in 1994 Beginnings of a scikit [Tem22]. A year later, the ROOT project for an analysis toolkit (and framework) was released, quickly making C++ the main In 2016, the ecosystem for Python in HEP was rather fragmented. language for HEP. The ROOT project also needed an interpreted Physicists were developing tools in isolation, without knowing language to driving analysis code. Python was rejected for this role out the overlaps with other tools, and without making them due to being "exotic" at the time, and because it was considered too interoperable. There were a handful of popular packages that much to ask physicists to code in two languages. Instead, ROOT were useful in HEP spread around among different authors. The provided a C++ interpreter, called CINT, which later was replaced ROOTPy project had several packages that made the ROOT- with Cling, which is the basis for the clang-repl project in LLVM Python bridge a little easier than the built-in PyROOT, such as the today [IVL22]. root-numpy and related root-pandas packages. The C++ MINUIT Python would start showing up in the late 90’s in experiment fitting library was integrated into ROOT, but the iminuit package frameworks as a configuration language. These frameworks were [Dea20] provided an easy to install standalone Python package primarily written in C++, but were made of many configurable with an extracted copy of MINUIT. Several other specialized standalone C++ packages had bindings as well. Many of the initial * Corresponding author: henryfs@princeton.edu authors were transitioning to a less-code centric role or leaving ‡ Princeton University § University of Liverpool for industry, leaving projects like ROOTPy and iminuit without maintainers. Copyright © 2022 Henry Schreiner et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, 1. Almost 20 years later ROOT’s Python bindings have been rewritten for which permits unrestricted use, distribution, and reproduction in any medium, easier Pythonizations, and installing ROOT in Conda is now much easier, provided the original author and source are credited. thanks in large part to efforts from Scikit-HEP developers. 116 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) later writer) that could remove the initial conversion environment by simply pip installing a package. It also had a simple, Pythonic numpythia interface and produced outputs Python users could immediately use, like NumPy arrays, instead of PyROOT’s wrapped C++ pyhepmc nndrone pointers. Uproot needed to do more than just be file format reader/writer; it needed to provide a way to represent the special pylhe structure and common objects that ROOT files could contain. This lead to the development of two related packages that would hepunits support uproot. One, uproot-methods, included Pythonic access to functionality provided by ROOT for its core classes, like spatial and Lorentz vectors. The other was AwkwardArray, which would uhi grow to become one of the most important and most general histoprint packages in Scikit-HEP. This package allows NumPy-like idioms for array-at-a-time manipulation on jagged data structures. A jagged array is a (possibly structured) array with a variable length dimension. These are very common and relevant in HEP; events have a variable number of tracks, tracks have a variable number Fig. 1: The Scikit-HEP ecosystem and affiliated packages. of hits in the detector, etc. Many other fields also have jagged data structures. While there are formats to store such structures, computations on jagged structures have usually been closer to SQL Eduardo Rodrigues, a scientist working on the LHCb ex- queries on multiple tables than direct object manipulation. Pandas periment for the University of Cincinnati, started working on a handles this through multiple indexing and a lot of duplication. package called scikit-hep that would provide a set to tools useful Uproot was a huge hit with incoming HEP students (see Fig 2); for physicists working on HEP analysis. The initial version of the suddenly they could access HEP data using a library installed with scikit-hep package had a simple vector library, HEP related units pip or conda and no external compiler or library requirements, and and conversions, several useful statistical tools, and provenance could easily use tools they already knew that were compatible with recording functionality, the Python buffer protocol, like NumPy, Pandas and the rapidly He also placed the scikit-hep GitHub repository in a Scikit- growing machine learning frameworks. There were still some gaps HEP GitHub organization, and asked several of the other HEP and pain points in the ecosystem, but an analysis without writing related packages to join. The ROOTPy project was ending, with C++ (interpreted or compiled) and compiling ROOT manually was the primary author moving on, and so several of the then-popular finally possible. Scikit-HEP did not and does not intend to replace packages2 that were included in the ROOTPy organization were ROOT, but it provides alternative solutions that work natively in happily transferred to Scikit-HEP. Several other existing HEP the Python "Big Data" ecosystem. libraries, primarily interfacing to existing C++ simulation and Several other useful HEP libraries were also written. Particle tracking frameworks, also joined, like PyJet and NumPythia. Some was written for accessing the Particle Data Group (PDG) particle of these libraries have been retired or replaced today, but were an data in a simple and Pythonic way. DecayLanguage originally important part of Scikit-HEP’s initial growth. provided tooling for decay definitions, but was quickly expanded to include tools to read and validate "DEC" decay files, an existing First initial success text format used to configure simulations in HEP. In 2016, the largest barrier to using Python in HEP in a Pythonic way was ROOT. It was challenging to compile, had many non- Building compiled packages Python dependencies, was huge compared to most Python li- braries, and didn’t play well with Python packaging. It was not In 2018, HEP physicist and programmer Hans Dembinski pro- Pythonic, meaning it had very little support for Python protocols posed a histogram library to the Boost libraries, the most influen- like iteration, buffers, keyword arguments, tab completion and tial C++ library collection; many additions to the standard library inspect in, dunder methods, didn’t follow conventions for useful are based on Boost. Boost.Histogram provided a histogram-as- reprs, and Python naming conventions; it was simply a direct on- an-object concept from HEP, but was designed around C++14 demand C++ binding, including pointers. Many Python analyses templating, using composable axes and storage types. It originally started with a "convert data" step using PyROOT to read ROOT had an initial Python binding, written in Boost::Python. Henry files and convert them to a Python friendly format like HDF5. Schreiner proposed the creation of a standalone binding to be Then the bulk of the analysis would use reproducible Python written with pybind11 in Scikit-HEP. The original bindings were virtual environments or Conda environments. removed, Boost::Histogram was accepted into the Boost libraries, This changed when Jim Pivarski introduced the Uproot pack- and work began on boost-histogram. IRIS-HEP, a multi-institution age, a pure-Python implementation of a ROOT file reader (and project for sustainable HEP software, had just started, which was providing funding for several developers to work on Scikit-HEP 2. The primary package of the ROOTPy project, also called ROOTPy, was project packages such as this one. This project would pioneer not transferred, but instead had a final release and then died. It was an standalone C++ library development and deployment for Scikit- inspiration for the new PyROOT bindings, and influenced later Scikit-HEP HEP. packages like mplhep. The transferred libraries have since been replaced by integrated ROOT functionality. All these packages required ROOT, which is There were already a variety of attempts at histogram libraries, not on PyPI, so were not suited for a Python-centric ecosystem. but none of them filled the requirements of HEP physicists: AWKWARD PACKAGING: BUILDING SCIKIT-HEP 117 ROOT (C++ and PyROOT) (as a baseline for scale) Scientific Python P HE Scikit-HEP in on CMSSW config th (Python but not data analysis) Py c ntifi ie Sc PyROOT of ag es e ack Us EPp kit -H ci of S Use Fig. 2: Adoption of scientific Python libraries and Scikit-HEP among members of the CMS experiment (one of the four major LHC experiments). CMS requires users to fork github:cms-sw/cmssw, which can be used to identify 3484 physicist users, who created 16656 non-fork repos. This plot quantifies adoption by counting "#include X", "import X", and "from X import" strings in the users’ code to measure adoption of various libraries (most popular by category are shown). bo lhep gram, com mainstream Python adoption to in HEP: when many histogram hist st::His libraries lived and died , mp Boo ROOT histogram part of ROOT (395 C++ files) YODA histograms histograms YODA in rootpy in Coffea Fig. 3: Developer activity on histogram libraries in HEP: number of unique committers to each library per month, smoothed (derived from git logs). Illustrates the convergence of a fractured community (around 2017) into a unified one (now). fills on pre-existing histograms, simple manipulation of multi- pybind11. dimensional histograms, competitive performance, and easy to The first stand-alone development was azure-wheel-helpers, a install in clusters or for students. Any new attempt here would set of files that helped produce wheels on the new Azure Pipelines have to be clearly better than the existing collection of diverse platform. Building redistributable wheels requires a variety of attempts (see Fig 3). The development of a library with compiled techniques, even without shared libraries, that vary dramatically components intended to be usable everywhere required good between platforms and were/are poorly documented. On Linux, support for building libraries that was lacking both in Scikit- everything needs to be built inside a controlled manylinux image, HEP and to an extent the broader Python ecosystem. Previous and post-processed by the auditwheel tool. On macOS, this in- advancements in the packaging ecosystem, such as the wheel cludes downloading an official CPython binary for Python to allow format for distributing binary platform dependent Python packages older versions of macOS to be targeted (10.9+), several special and the manylinux specification and docker image that allowed a environment variables, especially when cross compiling to Apple single compiled wheel to target many distributions of Linux, but Silicon, and post processing with the develwheel tool. Windows is there still were many challenges to making a library redistributable the simplest, as most versions of CPython work identically there. on all platforms. azure-wheel-helpers worked well, and was quickly adapted for The boost-histogram library only depended on header-only the other packages in Scikit-HEP that included non-ROOT binary components of the Boost libraries, and the header-only pybind11 components. Work here would eventually be merged into the package, so it was able to avoid a separate compile step or existing and general cibuildwheel package, which would become linking to external dependencies, which simplified the initial build the build tool for all non-ROOT binary packages in Scikit-HEP, as process. All needed files were collected from git submodules and well as over 600 other packages like matplotlib and numpy, and packed into a source distribution (SDist), and everything was built was accepted into the PyPA (Python Packaging Authority). using only setuptools, making build-from-source simple on any The second major development was the upstreaming of CI system supporting C++14. This did not include RHEL 7, a popular and build system developments to pybind11. Pybind11 is a C++ platform in HEP at the time, and on any platform building could API for Python designed for writing a binding to C++, and take several minutes and required several gigabytes of memory provided significant benefits to our packages over (mis)-using to resolve the heavy C++ templating in the Boost libraries and Cython for bindings; Cython was designed to transpile a Python- 118 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) like language to C (or C++), and just happened to support bindings since you can call C and C++ from it, but it was not what it Boost::Histogram was designed for. Benefits of pybind11 included reduced code thin wrapper complexity and duplication, no pre-process step (cythonize), no need to pin NumPy when building, and a cross-package API. The boost-histogram iMinuit package was later moved from Cython to pybind11 as fully featured well, and pybind11 became the Scikit-HEP recommended binding tool. We contributed a variety of fixes and features to pybind11, hist including positional-only and keyword-only arguments, the option plotting in to prepend to the overload chain, and an API for type access Matplotlib and manipulation. We also completely redesigned CMake inte- gration, added a new pure-Setuptools helpers file, and completely mplhep plotting in terminal redesigned the CI using GitHub Actions, running over 70 jobs on a variety of systems and compilers. We also helped modernize and histoprint improve all the example projects with simpler builds, new CI, and cibuildwheel support. This example of a project with binary components being Fig. 4: The collection of histogram packages and related packages in usable everywhere then encouraged the development of Awkward Scikit-HEP. 1.0, a rewrite of AwkwardArray replacing the Python-only code with compiled code using pybind11, fixing some long-standing limitations, like an inability to slice past two dimensions or select broader HEP ecosystem. The affiliated classification is also used "n choose k" for k > 5; these simply could not be expressed on broader ecosystem packages like pybind11 and cibuildwheel using Awkward 0’s NumPy expressions, but can be solved with that we recommend and share maintainers with. custom compiled kernels. This also enabled further developments in backends [PEL20]. Histogramming was designed to be a collection of specialized packages (see Fig. 4) with carefully defined interoperability; boost-histogram for manipulation and filling, Hist for a user- Broader ecosystem friendly interface and simple plotting tools, histoprint for display- Scikit-HEP had become a "toolset" for HEP analysis in Python, a ing histograms, and the existing mplhep and uproot packages also collection of packages that worked together, instead of a "toolkit" needed to be able to work with histograms. This ecosystem was like ROOT, which is one monopackage that tries to provide every- built and is held together with UHI, which is a formal specification thing [R+ 20]. A toolset is more natural in the Python ecosystem, agreed upon by several developers of different libraries, backed by where we have good packaging tools and many existing libraries. a statically typed Protocol, for a PlottableHistogram object. Pro- Scikit-HEP only needed to fill existing gaps, instead of covering ducers of histograms, like boost-histogram/hist and uproot provide every possible aspect of an analysis like ROOT did. The original objects that follow this specification, and users of histograms, scikit-hep package had its functionality pulled out into existing or such as mplhep and histoprint take any object that follows this new separate packages such as HEPUnits and Vector, and the core specification. The UHI library is not required at runtime, though it scikit-hep package instead became a metapackage with no unique does also provide a few simple utilities to help a library also accept functionality on its own. Instead, it installs a useful subset of our ROOT histograms, which do not (currently) follow the Protocol, so libraries for a physicist wanting to quickly get started on a new several libraries have decided to include it at runtime too. By using analysis. a static type checker like MyPy to statically enforce a Protocol, Scikit-HEP was quickly becoming the center of HEP specific libraries that can communicate without depending on each other Python software (see Fig. 1). Several other projects or packages or on a shared runtime dependency and class inheritance. This has joined Scikit-HEP iMinuit, a popular HEP and astrophysics fitting been a great success story for Scikit-HEP, and We expect Protocols library, was probably the most widely used single package to to continue to be used in more places in the ecosystem. have joined. PyHF and cabinetry also joined; these were larger The design for Scikit-HEP as a toolset is of many parts that frameworks that could drive a significant part of an analysis all work well together. One example of a package pulling together internally using other Scikit-HEP tools. many components is uproot-browser, a tool that combines uproot, Other packages, like GooFit, Coffea, and zFit, were not added, Hist, and Python libraries like textual and plotext to provide a but were built on Scikit-HEP packages and had developers work- terminal browser for ROOT files. ing closely with Scikit-HEP maintainers. Scikit-HEP introduced Scikit-HEP’s external contributions continued to grow. One of an "affiliated" classification for these packages, which allowed the most notable ones was our work on cibuildwheel. This was an external package to be listed on the Scikit-HEP website a Python package that supported building redistributable wheels and encouraged collaboration. Coffea had a strong influence on multiple CI systems. Unlike our own azure-wheel-helpers or on histogram design, and zFit has contributed code to Scikit- the competing multibuild package, it was written in Python, so HEP. Currently all affiliated packages have at least one Scikit- good practices in Python package design could apply, like unit HEP developer as a maintainer, though that is currently not a and integration tests, static checks, and it was easy to remain requirement. An affiliated package fills a particular need for the independent of the underlying CI system. Building wheels on community. Scikit-HEP doesn’t have to, or need to, attempt to Linux requires a docker image, macOS requires the python.org develop a package that others are providing, but rather tries to Python, and Windows can use any copy of Python - cibuildwheel ensure that the externally provided package works well with the uses this to supply Python in all cases, which keeps it from AWKWARD PACKAGING: BUILDING SCIKIT-HEP 119 depending on the CI’s support for a particular Python version. We helpful for monitoring adoption of the developer pages, especially merged our improvements to cibuildwheel, like better Windows newer additions, across the Scikit-HEP packages. This package support, VCS versioning support, and better PEP 518 support. was then implemented directly into the Scikit-HEP pages, using We dropped azure-wheel-helpers, and eventually a scikit-build Pyodide to run Python in WebAssembly directly inside a user’s maintainer joined the cibuildwheel project. cibuildwheel would browser. Now anyone visiting the page can enter their repository go on to join the PyPA, and is now in use in over 600 packages, and branch, and see the adoption report in a couple of seconds. including numpy, matplotlib, mypy, and scikit-learn. Our continued contributions to cibuildwheel included a Working toward the future TOML-based configuration system for cibuildwheel 2.0, an over- Scikit-HEP is looking toward the future in several different areas. ride system to make supporting multiple manylinux and musllinux We have been working with the Pyodide developers to support targets easier, a way to build directly from SDists, an option to use WebAssembly; boost-histogram is compiled into Pyodide 0.20, build instead of pip, the automatic detection of Python version and Pyodide’s support for pybind11 packages is significantly bet- requirements, and better globbing support for build specifiers. We ter due to that work, including adding support for C++ exception also helped improve the code quality in various ways, including handling. PyHF’s documentation includes a live Pyodide kernel, fully statically typing the codebase, applying various checks and and a try-pyhf site (based on the repo-review tool) lets users run style controls, automating CI processes, and improving support for a model without installing anything - it can even be saved as a special platforms like CPython 3.8 on macOS Apple Silicon. webapp on mobile devices. We also have helped with build, nox, pyodide, and many other We have also been working with Scikit-Build to try to provide packages, improving the tooling we depend on to develop scikit- a modern build experience in Python using CMake. This project build and giving back to the community. is just starting, but we expect over the next year or two that the usage of CMake as a first class build tool for binaries in The Scikit-HEP Developer Pages Python will be possible using modern developments and avoiding A variety of packaging best practices were coming out of the distutils/setuptools hacks. boost-histogram work, supporting both ease of installation for users as well as various static checks and styling to keep the Summary package easy to maintain and reduce bugs. These techniques The Scikit-HEP project started in Autumn 2016 and has grown would also be useful apply to Scikit-HEP’s nearly thirty other to be a core component in many HEP analyses. It has also packages, but applying them one-by-one was not scalable. The provided packages that are growing in usage outside of HEP, like development and adoption of azure-wheel-helpers included a se- AwkwardArray, boost-histogram/Hist, and iMinuit. The tooling ries of blog posts that covered the Azure Pipelines platform and developed and improved by Scikit-HEP has helped Scikit-HEP wheel building details. This ended up serving as the inspiration developers as well as the broader Python community. for a new set of pages on the Scikit-HEP website for developers interested in making Python packages. Unlike blog posts, these would be continuously maintained and extended over the years, R EFERENCES serving as a template and guide for updating and adding packages [Dea20] Hans Dembinski and Piti Ongmongkolkul et al. scikit- to Scikit-HEP, and educating new developers. hep/iminuit. Dec 2020. URL: https://doi.org/10.5281/zenodo. 3949207, doi:10.5281/zenodo.3949207. These pages grew to describe the best practices for developing [EB08] Lyndon Evans and Philip Bryant. Lhc machine. Journal of and maintaining a package, covering recommended configuration, instrumentation, 3(08):S08001, 2008. style checking, testing, continuous integration setup, task runners, [GTW20] Galli, Massimiliano, Tejedor, Enric, and Wunsch, Stefan. "a new and more. Shortly after the introduction of the developer pages, pyroot: Modern, interoperable and more pythonic". EPJ Web Conf., 245:06004, 2020. URL: https://doi.org/10.1051/epjconf/ Scikit-HEP developers started asking for a template to quickly 202024506004, doi:10.1051/epjconf/202024506004. produce new packages following the guidelines. This was eventu- [IVL22] Ioana Ifrim, Vassil Vassilev, and David J Lange. GPU Ac- ally produced; the "cookiecutter" based template is kept in sync celerated Automatic Differentiation With Clad. arXiv preprint with the developer pages; any new addition to one is also added arXiv:2203.06139, 2022. [Lam98] Stephan Lammel. Computing models of cdf and dØ to the other. The developer pages are also kept up to date using a in run ii. Computer Physics Communications, 110(1):32– CI job that bumps any GitHub Actions or pre-commit versions to 37, 1998. URL: https://www.sciencedirect.com/science/article/ the most recent versions weekly. Some portions of the developer pii/S0010465597001501, doi:10.1016/s0010-4655(97) 00150-1. pages have been contributed to packaging.python.org, as well. [LCC+ 09] Barry M Leiner, Vinton G Cerf, David D Clark, Robert E The cookie cutter was developed to be able to support multiple Kahn, Leonard Kleinrock, Daniel C Lynch, Jon Postel, Larry G build backends; the original design was to target both pure Python Roberts, and Stephen Wolff. A brief history of the internet. and Pybind11 based binary builds. This has expanded to include ACM SIGCOMM Computer Communication Review, 39(5):22– 31, 2009. 11 different backends by mid 2022, including Rust extensions, [LGMM05] W Lavrijsen, J Generowicz, M Marino, and P Mato. Reflection- many PEP 621 based backends, and a Scikit-Build based backend Based Python-C++ Bindings. 2005. URL: https://cds.cern.ch/ for pybind11 in addition to the classic Setuptools one. This has record/865620, doi:10.5170/CERN-2005-002.441. [PEL20] Jim Pivarski, Peter Elmer, and David Lange. Awkward arrays helped work out bugs and influence the design of several PEP in python, c++, and numba. In EPJ Web of Conferences, 621 packages, including helping with the addition of PEP 621 to volume 245, page 05023. EDP Sciences, 2020. doi:10.1051/ Setuptools. epjconf/202024505023. The most recent addition to the pages was based on a new [PJ11] Andreas J Peters and Lukasz Janyst. Exabyte scale storage at CERN. In Journal of Physics: Conference Series, volume 331, repo-review package which evaluates and existing repository to page 052015. IOP Publishing, 2011. doi:10.1088/1742- see what parts of the guidelines are being followed. This was 6596/331/5/052015. 120 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) [R+ 20] Eduardo Rodrigues et al. The Scikit HEP Project – overview and prospects. EPJ Web of Conferences, 245:06028, 2020. arXiv: 2007.03577, doi:10.1051/epjconf/202024506028. [RS21] Olivier Rousselle and Tom Sykora. Fast simulation of Time- of-Flight detectors at the LHC. In EPJ Web of Conferences, volume 251, page 03027. EDP Sciences, 2021. doi:10.1051/ epjconf/202125103027. [RTA+ 17] D Remenska, C Tunnell, J Aalbers, S Verhoeven, J Maassen, and J Templon. Giving pandas ROOT to chew on: experiences with the XENON1T Dark Matter experiment. In Journal of Physics: Conference Series, volume 898, page 042003. IOP Publishing, 2017. [SLF+ 15] Zachary D Stephens, Skylar Y Lee, Faraz Faghri, Roy H Campbell, Chengxiang Zhai, Miles J Efron, Ravishankar Iyer, Michael C Schatz, Saurabh Sinha, and Gene E Robinson. Big data: astronomical or genomical? PLoS biology, 13(7):e1002195, 2015. [Tem22] Jeffrey Templon. Reflections on the uptake of the Python pro- gramming language in Nuclear and High-Energy Physics, March 2022. None. URL: https://doi.org/10.5281/zenodo.6353621, doi:10.5281/zenodo.6353621. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 121 Keeping your Jupyter notebook code quality bar high (and production ready) with Ploomber Ido Michael‡∗ F This paper walks through this interactive tutorial. It is highly recommended running this interactively so it’s easier to follow and see the results in real-time. There’s a binder link in there as well, so you can launch it instantly. Fig. 1: In this pipeline none of the tasks were executed - it’s all red. 1. Introduction Notebooks are an excellent environment for data exploration: In addition, it can transform a notebook to a single-task pipeline they allow us to write code interactively and get visual feedback, and then the user can split it into smaller tasks as they see fit. providing an unbeatable experience for understanding our data. To refactor the notebook, we use the soorgeon refactor However, this convenience comes at a cost; if we are not command: careful about adding and removing code cells, we may have an soorgeon refactor nb.ipynb irreproducible notebook. Arbitrary execution order is a prevalent After running the refactor command, we can take a look at the problem: a recent analysis found that about 36% of notebooks on local directory and see that we now have multiple python tasks GitHub did not execute in linear order. To ensure our notebooks which that are ready for production: run, we must continuously test them to catch these problems. ls playground A second notable problem is the size of notebooks: the more cells we have, the more difficult it is to debug since there are more We can see that we have a few new files. pipeline.yaml variables and code involved. contains the pipeline declaration, and tasks/ contains the stages Software engineers typically break down projects into multiple that Soorgeon identified based on our H2 Markdown headings: steps and test continuously to prevent broken and unmaintainable ls playground/tasks code. However, applying these ideas for data analysis requires extra work; multiple notebooks imply we have to ensure the output One of the best ways to onboard new people and explain what from one stage becomes the input for the next one. Furthermore, each workflow is doing is by plotting the pipeline (note that we’re we can no longer press “Run all cells” in Jupyter to test our now using ploomber, which is the framework for developing analysis from start to finish. pipelines): Ploomber provides all the necessary tools to build multi- ploomber plot stage, reproducible pipelines in Jupyter that feel like a single This command will generate the plot below for us, which will notebook. Users can easily break down their analysis into multiple allow us to stay up to date with changes that are happening in our notebooks and execute them all with a single command. pipeline and get the current status of tasks that were executed or failed to execute. 2. Refactoring a legacy notebook Soorgeon correctly identified the stages in our If you already have a python project in a single notebook, you original nb.ipynb notebook. It even detected that can use our tool Soorgeon to automatically refactor it into a the last two tasks (linear-regression, and Ploomber pipeline. Soorgeon statically analyzes your code, cleans random-forest-regressor) are independent of each up unnecessary imports, and makes sure your monolithic notebook other! is broken down into smaller components. It does that by scanning We can also get a summary of the pipeline with ploomber the markdown in the notebook and analyzing the headers; each status: H2 header in our example is marking a new self-contained task. cd playground ploomber status * Corresponding author: ido@ploomber.io ‡ Ploomber 3. The pipeline.yaml file Copyright © 2022 Ido Michael. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits To develop a pipeline, users create a pipeline.yaml file and unrestricted use, distribution, and reproduction in any medium, provided the declare the tasks and their outputs as follows: original author and source are credited. 122 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 3: Here we can see the build outputs Fig. 2: In here we can see the status of each of our pipeline’s tasks, runtime and location. tasks: - source: script.py product: nb: output/executed.ipynb data: output/data.csv # more tasks here... The previous pipeline has a single task (script.py) and generates two outputs: output/executed.ipynb and output/data.csv. You may be wondering why we have a notebook as an output: Ploomber converts scripts to notebooks before execution; hence, our script is considered the source and the notebook a byproduct of the execution. Using scripts as sources (instead of notebooks) makes it simpler to use git. However, this does not mean you have to give up interactive development since Ploomber integrates with Jupyter, allowing you to edit scripts as notebooks. Fig. 4: These are the post build artifacts In this case, since we used soorgeon to refactor an existing notebook, we did not have to write the pipeline.yaml file. # Sample data quality checks after loading the raw data # Check nulls 4. Building the pipeline assert not df['HouseAge'].isnull().values.any() Let’s build the pipeline (this will take ~30 seconds): # Check a specific range - no outliers cd playground assert df['HouseAge'].between(0,100).any() ploomber build # Exact expected row count We can see which are the tasks that ran during this command, how assert len(df) == 11085 long they took to execute, and the contributions of each task to the overall pipeline execution runtime. ** We’ll do the same for tasks/linear-regression.py, open the file Navigate to playground/output/ and you’ll see all the and add the tests: outputs: the executed notebooks, data files and trained model. # Sample tests after the notebook ran # Check task test input exists ls playground/output assert Path(upstream['train-test-split']['X_test']).exists() In this figure, we can see all of the data that was collected during # Check task train input exists the pipeline, any artifacts that might be useful to the user, and some assert Path(upstream['train-test-split']['y_train']).exists() of the execution history that is saved on the notebook’s context. # Validating output type assert 'pkl' in upstream['train-test-split']['X_test'] 5. Testing and quality checks Adding these snippets will allow us to validate that the data we’re ** Open tasks/train-test-split.py as a notebook by right-clicking looking for exists and has the quality we expect. For instance, in on it and then Open With -> Notebook and add the following the first test we’re checking there are no missing rows, and that code after the cell with # noqa: the data sample we have are for houses up to 100 years old. KEEPING YOUR JUPYTER NOTEBOOK CODE QUALITY BAR HIGH (AND PRODUCTION READY) WITH PLOOMBER 123 Fig. 6: lab-open-with-notebook Fig. 5: Now we see an independent new task In the second snippet, we’re checking that there are train and test inputs which are crucial for training the model. 6. Maintaining the pipeline Let’s look again at our pipeline plot: Fig. 7: The new task is attached to the pipeline Image('playground/pipeline.png') The arrows in the diagram represent input/output dependencies At the top of the notebook, you’ll see the following: and depict the execution order. For example, the first task (load) upstream = None loads some data, then clean uses such data as input and processes it, then train-test-split splits our dataset into This special variable indicates which tasks should execute before training and test sets. Finally, we use those datasets to train a the notebook we’re currently working on. In this case, we want to linear regression and a random forest regressor. get training data so we can train our new model so we change the Soorgeon extracted and declared this dependencies for us, but upstream variable: if we want to modify the existing pipeline, we need to declare upstream = ['train-test-split'] such dependencies. Let’s see how. We can also see that the pipeline is green, meaning all of the Let’s generate the plot again: tasks in it have been executed recently. cd playground ploomber plot 7. Adding a new task Ploomber now recognizes our dependency declaration! Let’s say we want to train another model and decide to try Gradient Open Boosting Regressor. First, we modify the pipeline.yaml file playground/tasks/gradient-boosting-regressor.py and add a new task: as a notebook by right-clicking on it and then Open With -> Open playground/pipeline.yaml and add the follow- Notebook and add the following code: ing lines at the end from pathlib import Path - source: tasks/gradient-boosting-regressor.py import pickle product: nb: output/gradient-boosting-regressor.ipynb import seaborn as sns Now, let’s create a base file by executing ploomber from sklearn.ensemble import GradientBoostingRegressor scaffold: y_train = pickle.loads(Path( cd playground upstream['train-test-split']['y_train']).read_bytes()) ploomber scaffold y_test = pickle.loads(Path( upstream['train-test-split']['y_test']).read_bytes()) This is the output of the command: ` X_test = pickle.loads(Path( Found spec at 'pipeline.yaml' Adding upstream['train-test-split']['X_test']).read_bytes()) /Users/ido/ploomber-workshop/playground/ X_train = pickle.loads(Path( upstream['train-test-split']['X_train']).read_bytes()) tasks/ gradient-boosting-regressor.py... Created 1 new task sources. ` gbr = GradientBoostingRegressor() We can see it created the task sources for our new task, we just gbr.fit(X_train, y_train) have to fill those in right now. y_pred = gbr.predict(X_test) Let’s see how the plot looks now: sns.scatterplot(x=y_test, y=y_pred) cd playground ploomber plot You can see that Ploomber recognizes the new file, but it does not 8. Incremental builds have any dependency, so let’s tell Ploomber that it should execute Data workflows require a lot of iteration. For example, you may after train-test-split: want to generate a new feature or model. However, it’s wasteful Open to re-execute every task with every minor change. Therefore, playground/tasks/gradient-boosting-regressor.py one of Ploomber’s core features is incremental builds, which automatically skip tasks whose source code hasn’t changed. as a notebook by right-clicking on it and then Open With -> Run the pipeline again: Notebook: 124 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 11. Resources Thanks for taking the time to go through this tutorial! We hope you consider using Ploomber for your next project. If you have any questions or need help, please reach out to us! (contact info below). Here are a few resources to dig deeper: • GitHub • Documentation • Code examples Fig. 8: We can see this pipeline has multiple new tasks. • JupyterCon 2020 talk • Argo Community Meeting talk • Pangeo Showcase talk (AWS Batch demo) cd playground • Jupyter project ploomber build You can see that only the gradient-boosting-regressor 10. Contact task ran! Incremental builds allow us to iterate faster without keeping • Twitter track of task changes. • Join us on Slack Check out playground/output/ • E-mail us gradient-boosting-regressor.ipynb, which contains the output notebooks with the model evaluation plot. 9. Parallel execution and Ploomber cloud execution This section can run locally or on the cloud. To setup the cloud we’ll need to register for an api key Ploomber cloud allows you to scale your experiments into the cloud without provisioning machines and without dealing with infrastrucutres. Open playground/pipeline.yaml and add the following code instead of the source task: - source: tasks/random-forest-regressor.py This is how your task should look like in the end - source: tasks/random-forest-regressor.py name: random-forest- product: nb: output/random-forest-regressor.ipynb grid: # creates 4 tasks (2 * 2) n_estimators: [5, 10] criterion: [gini, entropy] In addition, we’ll need to add a flag to tell the pipeline to execute in parallel. Open playground/pipeline.yaml and add the following code above the -tasks section (line 1): yaml # Execute independent tasks in parallel executor: parallel ploomber plot ploomber build 10. Execution in the cloud When working with datasets that fit in memory, running your pipeline is simple enough, but sometimes you may need more computing power for your analysis. Ploomber makes it simple to execute your code in a distributed environment without code changes. Check out Soopervisor, the package that implements exporting Ploomber projects in the cloud with support for: • Kubernetes (Argo Workflows) • AWS Batch • Airflow PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 125 Likeness: a toolkit for connecting the social fabric of place to human dynamics Joseph V. Tuccillo‡∗ , James D. Gaboardi‡ F Abstract—The ability to produce richly-attributed synthetic populations is Modeling these processes at scale and with respect to indi- key for understanding human dynamics, responding to emergencies, and vidual privacy is most commonly achieved through agent-based preparing for future events, all while protecting individual privacy. The Like- simulations on synthetic populations [SEM14]. Synthetic popula- ness toolkit accomplishes these goals with a suite of Python packages: tions consist of individual agents that, when viewed in aggregate, pymedm/pymedm_legacy, livelike, and actlike. This production closely recreate the makeup of an area’s observed population process is initialized in pymedm (or pymedm_legacy) that utilizes census microdata records as the foundation on which disaggregated spatial allocation [HHSB12], [TMKD17]. Modeling human dynamics with syn- matrices are built. The next step, performed by livelike, is the generation of thetic populations is common across research areas including spa- a fully autonomous agent population attributed with hundreds of demographic tial epidemiology [DKA+ 08], [BBE+ 08], [HNB+ 11], [NCA13], census variables. The agent population synthesized in livelike is then [RSF+ 21], [SNGJ+ 09], public health [BCD+ 06], [BFH+ 17], attributed with residential coordinates in actlike based on block assignment [SPH11], [TCR08], [MCB+ 08], and transportation [BBM96], and, finally, allocated to an optimal daytime activity location via the street [ZFJ14]. However, a persistent limitation across these applications network. We present a case study in Knox County, Tennessee, synthesizing 30 is that synthetic populations often do not capture a wide enough populations of public K–12 school students & teachers and allocating them to range of individual characteristics to assess how human dynamics schools. Validation of our results shows they are highly promising by replicating are linked to human security problems (e.g., how a person’s age, reported school enrollment and teacher capacity with a high degree of fidelity. limited transportation access, and linguistic isolation may interact Index Terms—activity spaces, agent-based modeling, human dynamics, popu- with their housing situation in a flood evacuation emergency). lation synthesis In this paper, we introduce Likeness [TG22], a Python toolkit for connecting the social fabric of place to human dynamics via Introduction models that support increased spatial, temporal, and demographic Human security fundamentally involves the functional capacity fidelity. Likeness is an extension of the UrbanPop framework de- that individuals possess to withstand adverse circumstances, me- veloped at Oak Ridge National Laboratory (ORNL) that embraces diated by the social and physical environments in which they live a new paradigm of "vivid" synthetic populations [TM21], [Tuc21], [Hew97]. Attention to human dynamics is a key piece of the in which individual agents may be attributed in potentially hun- human security puzzle, as it reveals spatial policy interventions dreds of ways, across subjects spanning demographics, socioe- most appropriate to the ways in which people within a community conomic status, housing, and health. Vivid synthetic populations behave and interact in daily life. For example, "one size fits all" benefit human dynamics research both by enabling more precise solutions do not exist for mitigating disease spread, promoting geolocation of population segments, as well as providing a deeper physical activity, or enabling access to healthy food sources. understanding of how individual and neighborhood characteris- Rather, understanding these outcomes requires examination of tics are coupled. UrbanPop’s early development was motivated processes like residential sorting, mobility, and social transmis- by linking models of residential sorting and worker commute sion. behaviors [MNP+ 17], [MPN+ 17], [ANM+ 18]. Likeness expands upon the UrbanPop approach by providing a novel integrated * Corresponding author: tuccillojv@ornl.gov ‡ Oak Ridge National Laboratory model that pairs vivid residential synthetic populations with an activity simulation model on real-world transportation networks, Copyright © 2022 Oak Ridge National Laboratory. This is an open-access with travel destinations based on points of interest (POIs) curated article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any from location services and federal critical facilities data. medium, provided the original author and source are credited. Notice: This manuscript has been authored by UT-Battelle, LLC under We first provide an overview of Likeness’ capabilities, then Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. provide a more detailed walkthrough of its central workflow with The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government respect to livelike, a package for population synthesis and retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or residential characterization, and actlike a package for activity reproduce the published form of this manuscript, or allow others to do so, for allocation. We provide preliminary usage examples for Likeness United States Government purposes. The Department of Energy will provide based on 1) social contact networks in POIs 2) 24-hour POI public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public- occupancy characteristics. Finally, we discuss existing limitations access-plan). and the outlook for future development. 126 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Overview of Core Capabilities and Workflow the ACS Public-Use Microdata Sample (PUMS) at the scale UrbanPop initially combined the vivid synthetic populations pro- of census block groups (typically 300–6000 people) or tracts duced from the American Community Survey (ACS) using the (1200–8000 people), depending upon the use-case. Penalized-Maximum Entropy Dasymetric Modeling (P-MEDM) Downscaling the PUMS from the Public-Use Microdata Area method, which is detailed later, with a commute model based on (PUMA) level at which it is offered (100,000 or more people) to origin-destination flows, to generate a detailed dataset of daytime these neighborhood scales then enables us to produce synthetic and nighttime synthetic populations across the United States populations (the livelike package) and simulate their travel [MPN+ 17]. Our development of Likeness is motivated by extend- to POIs (the actlike package) in an integrated model. This ap- ing the existing capabilities of UrbanPop to routing libraries avail- proach provides a new means of modeling population mobility and able in Python like osmnx1 and pandana2 [Boe17], [FW12]. activity spaces with respect to real-world transportation networks In doing so, we are able to simulate travel to regular daytime and POIs, in turn enabling investigation of social processes from activities (work and school) based on real-world transportation the atomic (e.g., person) level in human systems. networks. Likeness continues to use the P-MEDM approach, but Likeness offers two implementations of P-MEDM. The first, is fully integrated with the U.S. Census Bureau’s ACS Summary the pymedm package, is written natively in Python based on File (SF) and Census Microdata APIs, enabling the production of scipy.optimize.minimize, and while fully operational re- activity models on-the-fly. mains in development and is currently suitable for one-off simu- Likeness features three core capabilities supporting activ- lations. The second, the pmedm_legacy package, uses rpy2 as ity simulation with vivid synthetic populations (Figure 1). a bridge to [NBLS14]’s original implementation of P-MEDM3 in The first, spatial allocation, is provided by the pymedm and R/C++ and is currently more stable and scalable. We offer conda pmedm_legacy packages and uses Iterative Proportional Fitting environments specific to each package, based on user preferences. (IPF) to downscale census microdata records to small neighbor- Each package’s functionality centers around a PMEDM class, hood areas, providing a basis for population synthesis. Baseline which contains information required to solve the P-MEDM prob- residential synthetic populations are then created and stratified into lem: agent segments (e.g., grade 10 students, hospitality workers) using • The individual (household) level constraints based on ACS the livelike package. Finally, the actlike package models PUMS. To preserve households from the PUMS in the syn- travel across agent segments of interest to POIs outside places of thetic population, the person-level constraints describing residence at varying times of day. household members are aggregated to the household level and merged with household-level constraints. Spatial Allocation: the pymedm & pmedm_legacy packages • PUMS household sample weights. Synthetic populations are typically generated from census micro- • The target (e.g., block group) and aggregate (e.g., tract) data, which consists of a sample of publicly available longform zone constraints based on population-level estimates avail- responses to official statistical surveys. To preserve respondent able in the ACS SF. confidentiality, census microdata is often published at spatial • The target/aggregate zone 90% margins of error and asso- scales the size of a city or larger. Spatial allocation with IPF ciated standard errors (SE = 1.645 × MOE). provides a maximum-likelihood estimator for microdata responses The PMEDM classes feature a solve() method that returns in small (e.g., neighborhood) areas based on aggregate data an optimized P-MEDM solution and allocation matrix. Through published about those areas (known as "constraints"), resulting a diagnostics module, users may then evaluate a P-MEDM in a baseline for population synthesis [WCC+ 09], [BBM96], solution based on the proportion of published 90% MOEs from [TMKD17]. UrbanPop is built upon a regularized implementation the summary-level ACS data preserved at the target (allocation) of IPF, the P-MEDM method, that permits many more input census scale. variables than traditional approaches [LNB13], [NBLS14]. The P- MEDM objective function (Eq. 1) is written as: Population Synthesis: the livelike package n wit wit e2 The livelike package generates baseline residential synthetic max − ∑ log − ∑ k2 (1) it N dit dit k 2σk populations and performs agent segmentation for activity simula- tion. where wit is the estimate of variable i in zone t, dit is the synthetic estimate of variable i in location t, n is the number of microdata Specifying and Solving Spatial Allocation Problems responses, and N is the total population size. Uncertainty in The livelike workflow is oriented around a user-specified variable estimates is handled by adding an error term to the e2 constraints file containing all of the information necessary to allocation ∑k 2σk2 , where ek is the error between the synthetic specify a P-MEDM problem for a PUMA of interest. "Constraints" k and published estimate of ACS variable k and σk is the ACS are variables from the ACS common among people/households standard error for the estimate of variable k. This is accomplished (PUMS) and populations (SF) that are used as both model inputs by leveraging the uncertainty in the input variables: the "tighter" and descriptors. The constraints file includes information for the margins of error on the estimate of variable k in place t, the bridging PUMS variable definitions with those from the SF using more leverage it holds upon the solution [NBLS14]. helper functions provided by the livelike.pums module, The P-MEDM procedure outputs an allocation matrix that including table IDs, sampling universe (person/household), and estimates the probability of individuals matching responses from tags for the range of ACS vintages (years) for which the variables are relevant. 1. https://github.com/gboeing/osmnx 2. https://github.com/UDST/pandana 3. https://bitbucket.org/nnnagle/pmedmrcpp LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS 127 Fig. 1: Core capabilities and workflow of Likeness. The primary livelike class is the acs.puma, which stores implementation of [LB13]’s "Truncate, Replicate, Sample" (TRS) information about a single PUMA necessary for spatial allocation method. TRS works by separating each cell of the allocation of the PUMS data to block groups/tracts with P-MEDM. The matrix into whole-number (integer) and fractional components, process of creating an acs.puma is integrated with the U.S. then incrementing the whole-number estimates by a random Census Bureau’s ACS SF and Census Microdata 5-Year Estimates sample of unit weights performed with sampling probabilities (5YE) APIs4 . This enables generation of an acs.puma class based on the fractional component. Because TRS is stochastic, with a high-level call involving just a few parameters: 1) the the homesim.hsim() function generates multiple (default 30) PUMA’s Federal Information Processing Standard (FIPS) code 2) realizations of the residential population. The results are provided the constraints file, loaded as a pandas.DataFrame and 3) the as a pandas.DataFrame in long format, attributed by: target ACS vintage (year). An example call to build an acs.puma • PUMS Household ID (h_id) for the Knoxville City, TN PUMA (FIPS 4701603) using the ACS • Simulation number (sim) 2015–2019 5-Year Estimates is: • Target zone FIPS code (geoid) acs.puma( fips="4701603", • Household count (count) constraints=constraints, year=2019 Since household and person-level attributes are combined ) when creating the acs.puma class, person-level records from the PUMS are assumed to be joined to the synthesized household The censusdata package5 is used internally to IDs many-to-one. For example, if two people, A01 and A03, in fetch population-level (SF) constraints, standard errors, household A have some attribute of interest, and there are 3 and MOEs from the ACS 5YE API, while the households of type A in zone G, then we estimate that a total acs.extract_pums_constraints function is used to of 6 people with that attribute from household A reside in zone G. fetch individual-level constraints and weights from the Census Microdata 5YE API. Agent Generation Spatial allocation is then carried out by passing the acs.puma attributes to a pymedm.PMEDM or The synthetic populations can then be segmented into different pmedm_legacy.PMEDM (depending on user preference). groups of agents (e.g., workers by industry, students by grade) for activity modeling with the actlike package. Agent segments Population Synthesis may be identified in several ways: The homesim module provides support for population synthe- • Using acs.extract_pums_segment_ids() to sis on the spatial allocation matrix within a solved P-MEDM fetch the person IDs (household serial number + person object. The population synthesis procedure involves converting line number) from the Census Microdata API matching the fractional estimates from the allocation matrix (n household some criteria of interest (e.g., public school students in IDs by m zones) to integer representation such that whole peo- 10th grade). ple/households are preserved. This homesim module features an • Using acs.extract_pums_descriptors() to 4. https://www.census.gov/data/developers/data-sets.html fetch criteria that may be queried from the Census 5. https://pypi.org/project/CensusData Microdata API. This is useful when dealing with criteria 128 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) more specific than can be directly controlled for in the in time and are placed with a greater frequency proportional P-MEDM problem (e.g., detailed NAICS code of worker, to reported household density [LB13]. We employ population exact number of hours worked). and housing counts within 2010 Decennial Census blocks to formulate a modified Variable Size Bin Packing Problem [FL86], The function est.tabulate_by_serial() is then used [CGSdG08] for each populated block group, which allows for to tabulate agents by target zone and simulation by appending an optimal placement of household points and is accomplished them to the synthetic population based on household ID, then by the actlike.block_denisty_allocation() aggregating the person-level counts. This routine is flexible in that function that creates and solves an a user can use any set of criteria available from the PUMS to actlike.block_allocation.BinPack instance. define customized agents for mobility modeling purposes. Other Capabilities Activity Allocation Population Statistics: In addition to agent creation, the Once household location attribution is complete, individual agents livelike.est module also supports the creation of popula- must be allocated from households (nighttime locations) to prob- tion statistics. This can be used to estimate the compositional able activity spaces (daytime locations). This is achieved through characteristics of small neighborhood areas and POIs, for ex- spatial network modeling over the streets within a study area via ample to simulate social contact networks (see Students). To OpenStreetMap6 utilizing osmnx for network extraction & pre- accomplish this, the results of est.tabulate_by_serial processing and pandana for shortest path and route calculations. (see Agent Generation) are converted to proportional esti- The underlying impedance metric for shortest path calculation, mates to facilitate POIs (est.to_prop()), then averaged handled in actlike.calc_cost_mtx() and associated in- across simulations to produce Monte Carlo estimates and errors ternal functions, can either take the form of distance or travel time. est.monte_carlo_estimate()). Moreover, household and activity locations must be connected to Multiple ACS Vintages and PUMAs: The multi nearby network edges for realistic representations within network module extends the capabilities of livelike to space [GFH20]. multiple ACS 5YE vintages (dating back to 2016), as With a cost matrix from all residences to daytime loca- well as multiple PUMAs (e.g., a metropolitan area) via tions calculated, the simulated population can then be "sent" the multi module. Using multi.make_pumas() to the likely activity spaces by utilizing an instance of or multi.make_multiyear_pumas(), multiple actlike.ActivityAllocation to generate an adapted PUMAs/multiple years may be stored in a dict Transportation Problem. This mixed integer program, solved using that enables iterative runs for spatial allocation the solve() method, optimally associates all population within (multi.make_pmedm_problems()), population an activity space with the objective of minimizing the total cost of synthesis (multi.homesim()), and agent cre- impedance (Eq. 2), being subject to potentially relaxed minimum ation (multi.extract_pums_segment_ids(), and maximum capacity constraints (Eq. 4 & 5). Each decision multi.extract_pums_segment_ids_multiyear(), variable (xi j ) represents a potential allocation from origin i to multi.extract_pums_descriptors(), and destination j that must be an integer greater than or equal to zero multi.extract_pums_descriptors_multiyear()). (Eq. 6 & 7). The problem is formulated as follows: This functionality is currently available for pmedm_legacy only. min ∑ ∑ ci j xi j (2) i∈I j∈J Activity Allocation: the actlike package s.t. ∑ xi j = Oi ∀i ∈ I; (3) The actlike package [GT22] allocates agents from synthetic j∈J populations generated by livelike POI, like schools and work- places, based on optimal allocation about transportation networks s.t. ∑ xi j ≥ minD j ∀ j ∈ J; (4) i∈I derived from osmnx and pandana [Boe17], [FW12]. Solutions are the product of a modified integer program (Transportation s.t. ∑ xi j ≤ maxD j ∀ j ∈ J; (5) Problem [Hit41], [Koo49], [MS01], [MS15]) modeled in pulp i∈I or mip [MOD11], [ST20], whereby supply (students/workers) s.t. xi j ≥ 0 ∀i ∈ I ∀ j ∈ J; (6) are "shipped" to demand locations (schools/workplaces), with potentially relaxed minimum and maximum capacity constraints at s.t. xi j ∈ Z ∀i ∈ I ∀ j ∈ J. (7) demand locations. Impedance from nighttime to daytime locations (Origin-Destination [OD] pairs) can be modeled by either network where distance or network travel time. i ∈ I = each household in the set of origins j ∈ J = each school in the set of destinations Location Synthesis xi j = allocation decision from i ∈ I to j ∈ J Following the generation of synthetic households for the study ci j = cost between all i, j pairs universe, locations for all households across the 30 default simulations must be created. In order to intelligently site pseudo- Oi = population in origin i for i ∈ I neighborhood clusters of random points, we adopt a dasymetric minD j = minimum capacity j for j ∈ J [QC13] approach, which we term intelligent block-based (IBB) maxD j = maximum capacity j for j ∈ J allocation, whereby household locations are only placed within blocks known to have been populated at a particular period 6. https://www.openstreetmap.org/about LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS 129 The key to this adapted formulation of the classic Trans- Because school attendance in Knox County is restricted by portation Problem is the utilization of minimum and maxi- district boundaries, we only placed student households in mum capacity thresholds that are generated endogenously within the PUMAs intersecting with the district (FIPS 4701601, actlike.ActivityAllocation and are tuned to reflect 4701602, 4701603, 4701604). However, because educators the uncertainty of both the population estimates generated by may live outside school district boundaries, we simulated livelike and the reported (or predicted) capacities at activity their household locations throughout the Knoxville CBSA. locations. Moreover, network impedance from origins to destina- • Used actlike to perform optimal allocation of tions (ci j ) can be randomly reduced through an internal process workers and students about road networks in Knox by passing in an integer value to the reduce_seed keyword ar- County/Knoxville CBSA. Across the 30 simulations and gument. By triggering this functionality, the count and magnitude 14 segments identified, we produced a total of 420 travel of reduction is determined algorithmically. A random reduction simulations. Network impedance was measured in geo- of this nature is beneficial in generating dispersed solutions that graphic distance for all student simulations and travel time do not resemble compact clusters, with an example being the for all educator simulations. replication of a private school’s student body that does not adhere Figure 2 demonstrates the optimal allocations, routing, and to public school attendance zones. network space for a single simulation of 10th grade public school After the optimal solution is found for an students in Knox County, TN. Students, shown in households actlike.ActivityAllocation instance, selected as small black dots, are associated with schools, represented by decisions are isolated from non-zero decision variables transparent colored circles sized according to reported enrollment. with the realized_allocations() method. These The network space connecting student residential locations to allocations are then used to generate solution routes with the assigned schools is displayed in a matching color. Further, the network_routes() function that represent the shortest path inset in Figure 2 provides the pseudo-school attendance zone for along the network traversed from residential locations to assigned 10th graders at one school in central Knoxville and demonstrates activity spaces. Solutions can be further validated with Canonical the adherence to network space. Correlation Analysis, in instances where the agent segments are stratified, and simple linear regression for those where a single Students segment of agents is used. Validation is discussed further in Validation & Diagnostics. Our study of K–12 students examines social contact networks with respect to potentially underserved student populations via the compositional characteristics of POIs (schools). Case Study: K–12 Public Schools in Knox County, TN We characterized each school’s student body by identifying To illustrate Likeness’ capability to simulate POI travel among student profiles based on several criteria: minority race/ethnicity, specific population segments, we provide a case study of travel to poverty status, single caregiver households, and unemployed care- POIs, in this case K–12 schools, in Knox County, TN. Our choice giver households (householder and/or spouse/parnter). We defined of K–12 schools was motivated by several factors. First, they serve 6 student profiles using an implementation of the density-based as common destinations for the two major groups—workers and K-Modes clustering algorithm [CLB09] with a distance heuris- students—expected to consistently travel on a typical business tic designed to optimize cluster separation [NLHH07] available day [RWM+ 17]. Second, a complete inventory of public school through the kmodes package9 [dV21]. Student profile labels were locations, as well as faculty and enrollment sizes, is available appended to the student travel simulation results, then used to publicly through federal open data sources. In this case, we produce Monte Carlo proportional estimates of profiles by school. obtained school locations and faculty sizes from the Homeland The results in Figure 3 reveal strong dissimilarities in student Infrastructure Foundation-Level Database (HIFLD)7 and student makeup between schools on the periphery of Knox County and enrollment sizes by grade from the National Center for Education those nearer to Knoxville’s downtown core in the center of the Statistics (NCES) Common Core of Data8 . county. We estimate that the former are largely composed of We chose the Knox County School District, which coincides students in married families, above poverty, and with employed with Knox County’s boundaries, as our study area. We used the caregivers, whereas the latter are characterized more strongly by livelike package to create 30 synthetic populations for the single caregiver living arrangements and, particularly in areas Knoxville Core-Based Statistical Area (CBSA), then for each north of the downtown core, economic distress (pop-out map). simulation we: • Isolated agent segments from the synthetic population. Workers (Educators) K–12 educators consist of full-time workers employed as primary and secondary education teachers (2018 Standard We evaluated the results of our K–12 educator simulations with Occupation Classification System codes 2300–2320) in respect to POI occupancy characteristics, as informed by commute elementary and secondary schools (NAICS 6111). We and work statistics obtained from the PUMS. Specifically, we used separated out student agents by public schools and by work arrival times associated with each synthetic worker (PUMS grade level (Kindergarten through Grade 12). JWAP) to timestamp the start of each work day, and incremented • Performed IBB allocation to simulate the household loca- this by daily hours worked (derived from PUMS W KHP) to create tions of workers and students. Our selection of household a second timestamp for work departure. The estimated departure locations for workers and students varied geographically. time assumes that each educator travels to the school for a typical 5-day workweek, and is estimated as JWAP + W KHP 5 . 7. https://hifld-geoplatform.opendata.arcgis.com 8. https://nces.ed.gov/ccd/files.asp 9. https://pypi.org/project/kmodes 130 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 2: Optimal allocations for one simulation of 10th grade public school students in Knox County, TN. Fig. 3: Compositional characteristics of K–12 public schools in Knox County, TN based on 6 student profiles. Glyph plot methodolgy adapted from [GLC+ 15]. LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS 131 Fig. 4: Hourly worker occupancy estimates for K–12 schools in Knox County, TN. Roughly 50 educator agents per simulation were not attributed Validation & Diagnostics with work arrival times, possibly due to the source PUMS re- A determination of modeling output robustness was needed to spondents being away from their typical workplaces (e.g., on validate our results. Specifically, we aimed to ensure the preser- summer or winter break) but still working virtually when they vation of relative facility size and composition. To perform this were surveyed. We filled in these unkown arrival times with the validation, we tested the optimal allocations of those generated by modal arrival time observed across all simulations (7:25 AM). Likeness against the maximally adjusted reported enrollment & faculty employment counts. We used the maximum adjusted value to account for scenarios where the population synthesis phase Figure 4 displays the hourly proportion of educators present resulted in a total demographic segment greater than reported total at each school in Knox County between 7:00 AM (t700) and facility capacity. We employed Canonical Correlation Analysis 6:00 PM (t1800). Morning worker arrivals occur more rapidly (CCA) [Kna78] for the K–12 public school student allocations than afternoon departures. Between the hours of 7:00 AM and due to their stratified nature, and an ordinary least squares (OLS) 9:00 AM (t700–t900), schools transition from nearly empty simple linear regression for the educator allocations [PVG+ 11]. of workers to being close to capacity. In the afternoon, workers Because CCA is a multivariate measure, it is only a suitable begin to gradually depart at 3:00 PM (t1500) with somewhere diagnostic for activity allocation when multiple segments (e.g., between 50%–70% of workers still present by 4:00 PM (t1600), students by grade) are of interest. For educators, which we then workers begin to depart in earnest at 5:00 PM into 6:00 PM treated as a single agent segment without stratification, we used (t1700–t1800), by which most have returned home. OLS regression instead. The CCA for students was performed in two components: Between-Destination, which measures capacity across facilities, and Within-Destination, which measures capacity Geographic differences are also visible and may be a function across strata. of (1) a higher concentration of a particular school type (e.g., Descriptive Monte Carlo statistics from the 30 simulations elementary, middle, high) in this area and (2) staggered starts were run on the resultant coefficients of determination (R2 ), between these types (to accommodate bus schedules, etc.). This which show a goodness of fit (approaching 1). As seen in Table could be due in part to concentrations of different school schedules 1, all models performed exceedingly well, though the Within- by grade level, especially elementary schools starting much earlier Destination CCA performed slightly less well than both the than middle and high schools10 . For example, schools near the Between-Destination CCA and the OLS linear regression. In fact, center of Knox County reach worker capacity more quickly in the the global minimum of all R2 scores approaches 0.99 (students morning, starting around 8:00 AM (t800), but also empty out – Within-Destination), which demonstrates robust preservation of more rapidly than schools in surrounding areas beginning around 4:00 PM (t1600). 10. https://www.knoxschools.org/Page/5553 132 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) K–12 R2 Type Min Median Mean Max Between-Destination CCA 0.9967 0.9974 0.9973 0.9976 Students (public schools) Within-Destination CCA 0.9883 0.9894 0.9896 0.9910 Educators (public & private schools) OLS Linear Regression 0.9977 0.9983 0.9983 0.9991 TABLE 1: Validating optimal allocations considering reported enrollment at public schools & faculty employment at all schools. true capacities in our synthetic activity modeling. Furthermore, agent characterization and travel along real-world transportation a global maximum of greater than 0.999 is seen for educators, networks to POIs. These capabilities benefit planners and urban which indicates a near perfect replication of relative faculty sizes researchers by providing a richer understanding of how spatial by school. policy interventions can be designed with respect to how people live, move, and interact. Likeness strives to be flexible toward a Discussion variety of research applications linked to human security, among Our Case Study demonstrates the twofold benefits of modeling them spatial epidemiology, transportation equity, and environmen- human dynamics with vivid synthetic populations. Using Like- tal hazards. ness, we are able to both produce a more reasoned estimate of the Several ongoing developments will further Likeness’ capa- neighborhoods in which people reside and interact than existing bilities. First, we plan to expand our support for POIs curated synthetic population frameworks, as well as support more nuanced by location services (e.g., Google, Facebook, Here, TomTom, characterization of human activities at specific POIs (e.g., social FourSquare) by the ORNL PlanetSense project [TBP+ 15] by contact networks, occupancy). incorporating factors like facility size, hours of operation, and pop- The examples provided in the Case Study show how this ularity curves to refine the destination capacity estimates required refined understanding of human dynamics can benefit planning to perform actlike simulations. Second, along with multi- applications. For example, in the event of a localized emergency, modal travel, we plan to incorporate multiple trip models based the results of Students could be used to examine schools for on large-scale human activity datasets like the American Time Use which rendezvous with caregivers might pose an added challenge Survey11 and National Household Travel Survey12 . Together, these towards students (e.g., more students from single caregiver vs. improvements will extend our travel simulations to "non-obligate" married family households). Additionally, the POI occupancy population segments traveling to civic, social, and recreational dynamics demonstrated in Workers (Educators) could be used activities [BMWR22]. Third, the current procedure for spatial to assess the times at which worker commutes to/from places allocation uses block groups as the target scale for population of employment might be most sensitive to a nearby disruption. synthesis. However, there are a limited number of constraining Another application in the public health sphere might be to use variables available at the block group level. To include a larger occupancy estimates to anticipate the best time of day to reach volume of constraints (e.g., vehicle access, language), we are workers, during a vaccination campaign, for example. exploring an additional tract-level approach. P-MEDM in this Our case study had several limitations that we plan to over- case is run on cross-covariances between tracts and "supertract" come in future work. First, we assumed that all travel within our aggregations created with the Max-p-regions problem [DAR12], study area occurs along road networks. While road-based travel [WRK21] implemented in PySAL’s spopt [RA07], [FGK+ 21], is the dominant means of travel in the Knoxville CBSA, this [RAA+ 21], [FBG+ 22]. assumption is not transferable to other urban areas within the As a final note, the Likeness toolkit is being developed on top United States. Our eventual goal is to build in additional modes of of key open source dependencies in the Scientific Python ecosys- travel like public transit, walk/bike, and ferries by expanding our tem, the core of which are, of course, numpy [HMvdW+ 20] ingest of OpenStreetMap features. and scipy [VGO+ 20]. Although an exhaustive list would be Second, we do not yet offer direct support for non-traditional prohibitive, major packages not previously mentioned include schools (e.g., populations with special needs, families on military geopandas [JdBF+ 21], matplotlib [Hun07], networkx bases). For example, the Tennessee School for the Deaf falls [HSS08], pandas [pdt20], [WM10], and shapely [G+ ]. Our within our study area, and its compositional estimate could be goal is contribute to the community with releases of the packages refined if we reapportioned students more likely in attendance to comprising Likeness, but since this is an emerging project its that location. development to date has been limited to researchers at ORNL. Third, we did not account for teachers in virtual schools, However, we plan to provide a fully open-sourced code base which may form a portion of the missing work arrival times within the coming year through GitHub13 . discussed in Workers (Educators). Work-from-home populations Acknowledgements can be better incorporated into our travel simulations by apply- ing work schedules from time-use surveys to probabilistically This material is based upon the work supported by the U.S. assign in-person or remote status based on occupation. We are Department of Energy under contract no. DE-AC05-00OR22725. particularly interested in using this technique with Likeness to better understand changing patterns of life during the COVID-19 R EFERENCES pandemic in 2020. [ANM+ 18] H.M. Abdul Aziz, Nicholas N. Nagle, April M. Morton, Michael R. Hilliard, Devin A. White, and Robert N. Stew- Conclusion 11. https://www.bls.gov/tus The Likeness toolkit enhances agent creation for modeling human 12. https://nhts.ornl.gov dynamics through its dual capabilities of high-fidelity ("vivid") 13. https://github.com/ORNL LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS 133 art. Exploring the impact of walk–bike infrastructure, safety [GFH20] James D. Gaboardi, David C. Folch, and Mark W. Horner. perception, and built-environment on active transportation Connecting Points to Spatial Networks: Effects on Discrete mode choice: a random parameter model using New York Optimization Models. Geographical Analysis, 52(2):299–322, City commuter data. Transportation, 45(5):1207–1229, 2018. 2020. doi:10.1111/gean.12211. doi:10.1007/s11116-017-9760-8. [GLC+ 15] Isabella Gollini, Binbin Lu, Martin Charlton, Christopher [BBE+ 08] Christopher L. Barrett, Keith R. Bisset, Stephen G. Eubank, Brunsdon, and Paul Harris. GWmodel: An R package for Xizhou Feng, and Madhav V. Marathe. EpiSimdemics: an ef- exploring spatial heterogeneity using geographically weighted ficient algorithm for simulating the spread of infectious disease models. Journal of Statistical Software, 63(17):1–50, 2015. over large realistic social networks. In SC’08: Proceedings of doi:10.18637/jss.v063.i17. the 2008 ACM/IEEE Conference on Supercomputing, pages [GT22] James D. Gaboardi and Joseph V. Tuccillo. Simulating Travel 1–12. IEEE, 2008. doi:10.1109/SC.2008.5214892. to Points of Interest for Demographically-rich Synthetic Popu- [BBM96] Richard J. Beckman, Keith A. Baggerly, and Michael D. lations, February 2022. American Association of Geographers McKay. Creating synthetic baseline populations. Transporta- Annual Meeting. doi:10.5281/zenodo.6335783. tion Research Part A: Policy and Practice, 30(6):415–429, [Hew97] Kenneth Hewitt. Vulnerability Perspectives: the Human Ecol- 1996. doi:10.1016/0965-8564(96)00004-3. ogy of Endangerment. In Regions of Risk: A Geographical [BCD+ 06] Dimitris Ballas, Graham Clarke, Danny Dorling, Jan Rigby, Introduction to Disasters, chapter 6, pages 141–164. Addison and Ben Wheeler. Using geographical information systems and Wesley Longman, 1997. spatial microsimulation for the analysis of health inequalities. [HHSB12] Kirk Harland, Alison Heppenstall, Dianna Smith, and Mark H. Health Informatics Journal, 12(1):65–79, 2006. doi:10. Birkin. Creating realistic synthetic populations at varying 1177/1460458206061217. spatial scales: A comparative critique of population synthesis [BFH+ 17] Komal Basra, M. Patricia Fabian, Raymond R. Holberger, techniques. Journal of Artificial Societies and Social Simula- Robert French, and Jonathan I. Levy. Community-engaged tion, 15(1):1, 2012. doi:10.18564/jasss.1909. modeling of geographic and demographic patterns of mul- [Hit41] Frank L. Hitchcock. The Distribution of a Product from tiple public health risk factors. International Journal of Several Sources to Numerous Localities. Journal of Mathe- Environmental Research and Public Health, 14(7):730, 2017. matics and Physics, 20(1-4):224–230, 1941. doi:10.1002/ doi:10.3390/ijerph14070730. sapm1941201224. [BMWR22] Christa Brelsford, Jessica J. Moehl, Eric M. Weber, and [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Amy N. Rose. Segmented Population Models: Improving the Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric LandScan USA Non-Obligate Population Estimate (NOPE). Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, American Association of Geographers 2022 Annual Meeting, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerk- 2022. wijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, [Boe17] Geoff Boeing. OSMnx: New methods for acquiring, con- Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin structing, analyzing, and visualizing complex street networks. Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Computers, Environment and Urban Systems, 65:126–139, Christoph Gohlke, and Travis E. Oliphant. Array programming September 2017. doi:10.1016/j.compenvurbsys. with NumPy. Nature, 585(7825):357–362, September 2020. 2017.05.004. doi:10.1038/s41586-020-2649-2. [CGSdG08] Isabel Correia, Luís Gouveia, and Francisco Saldanha-da [HNB+ 11] Jan A.C. Hontelez, Nico Nagelkerke, Till Bärnighausen, Roel Gama. Solving the variable size bin packing problem Bakker, Frank Tanser, Marie-Louise Newell, Mark N. Lurie, with discretized formulations. Computers & Operations Re- Rob Baltussen, and Sake J. de Vlas. The potential impact of search, 35(6):2103–2113, June 2008. doi:10.1016/j. RV144-like vaccines in rural South Africa: a study using the cor.2006.10.014. STDSIM microsimulation model. Vaccine, 29(36):6100–6106, 2011. doi:10.1016/j.vaccine.2011.06.059. [CLB09] Fuyuan Cao, Jiye Liang, and Liang Bai. A new initialization method for categorical data clustering. Expert Systems with [HSS08] Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. Applications, 36(7):10223–10228, 2009. doi:10.1016/j. Exploring Network Structure, Dynamics, and Function using eswa.2009.01.060. NetworkX. In Gaël Varoquaux, Travis Vaught, and Jarrod Millman, editors, Proceedings of the 7th Python in Science [DAR12] Juan C. Duque, Luc Anselin, and Sergio J. Rey. THE MAX- Conference, pages 11 – 15, Pasadena, CA USA, 2008. URL: P-REGIONS PROBLEM*. Journal of Regional Science, https://www.osti.gov/biblio/960616. 52(3):397–419, 2012. doi:10.1111/j.1467-9787. [Hun07] J. D. Hunter. Matplotlib: A 2D graphics environment. Com- 2011.00743.x. puting in Science & Engineering, 9(3):90–95, 2007. doi: [DKA+ 08] M. Diaz, J.J. Kim, G. Albero, S. De Sanjose, G. Clifford, F.X. 10.1109/MCSE.2007.55. Bosch, and S.J. Goldie. Health and economic impact of HPV [JdBF+ 21] Kelsey Jordahl, Joris Van den Bossche, Martin Fleischmann, 16 and 18 vaccination and cervical cancer screening in India. James McBride, Jacob Wasserman, Adrian Garcia Badaracco, British Journal of Cancer, 99(2):230–238, 2008. doi:10. Jeffrey Gerard, Alan D. Snow, Jeff Tratner, Matthew Perry, 1038/sj.bjc.6604462. Carson Farmer, Geir Arne Hjelle, Micah Cochran, Sean [dV21] Nelis J. de Vos. kmodes categorical clustering library. https: Gillies, Lucas Culbertson, Matt Bartos, Brendan Ward, Gia- //github.com/nicodv/kmodes, 2015–2021. como Caria, Mike Taves, Nick Eubank, sangarshanan, John [FBG+ 22] Xin Feng, Germano Barcelos, James D. Gaboardi, Elijah Flavin, Matt Richards, Sergio Rey, maxalbert, Aleksey Bi- Knaap, Ran Wei, Levi J. Wolf, Qunshan Zhao, and Sergio J. logur, Christopher Ren, Dani Arribas-Bel, Daniel Mesejo- Rey. spopt: a python package for solving spatial optimization León, and Leah Wasser. geopandas/geopandas: v0.10.2, Octo- problems in PySAL. Journal of Open Source Software, ber 2021. doi:10.5281/zenodo.5573592. 7(74):3330, 2022. doi:10.21105/joss.03330. [Kna78] Thomas R. Knapp. Canonical Correlation Analysis: A general [FGK+ 21] Xin Feng, James D. Gaboardi, Elijah Knaap, Sergio J. Rey, parametric significance-testing system. Psychological Bulletin, and Ran Wei. pysal/spopt, jan 2021. URL: https://github.com/ 85(2):410–416, 1978. doi:10.1037/0033-2909.85. pysal/spopt, doi:10.5281/zenodo.4444156. 2.410. [FL86] D.K. Friesen and M.A. Langston. Variable Sized Bin Packing. [Koo49] Tjalling C. Koopmans. Optimum Utilization of the Transporta- SIAM Journal on Computing, 15(1):222–230, February 1986. tion System. Econometrica, 17:136–146, 1949. Publisher: doi:10.1137/0215016. [Wiley, Econometric Society]. doi:10.2307/1907301. [FW12] Fletcher Foti and Paul Waddell. A Generalized Com- [LB13] Robin Lovelace and Dimitris Ballas. ‘Truncate, replicate, putational Framework for Accessibility: From the Pedes- sample’: A method for creating integer weights for spa- trian to the Metropolitan Scale. In Transportation Re- tial microsimulation. Computers, Environment and Urban search Board Annual Conference, pages 1–14, 2012. Systems, 41:1–11, September 2013. doi:10.1016/j. URL: https://onlinepubs.trb.org/onlinepubs/conferences/2012/ compenvurbsys.2013.03.004. 4thITM/Papers-A/0117-000062.pdf. [LNB13] Stefan Leyk, Nicholas N. Nagle, and Barbara P. Buttenfield. [G+ ] Sean Gillies et al. Shapely: manipulation and analysis of Maximum Entropy Dasymetric Modeling for Demographic geometric objects, 2007–. URL: https://github.com/shapely/ Small Area Estimation. Geographical Analysis, 45(3):285– shapely. 306, July 2013. doi:10.1111/gean.12011. 134 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) [MCB+ 08] Karyn Morrissey, Graham Clarke, Dimitris Ballas, Stephen Scan USA 2016 [Data set]. Technical report, Oak Ridge Hynes, and Cathal O’Donoghue. Examining access to GP National Laboratory, 2017. doi:10.48690/1523377. services in rural Ireland using microsimulation analysis. Area, [SEM14] Samarth Swarup, Stephen G. Eubank, and Madhav V. Marathe. 40(3):354–364, 2008. doi:10.1111/j.1475-4762. Computational epidemiology as a challenge domain for multi- 2008.00844.x. agent systems. In Proceedings of the 2014 international con- [MNP+ 17] April M. Morton, Nicholas N. Nagle, Jesse O. Piburn, ference on Autonomous agents and multi-agent systems, pages Robert N. Stewart, and Ryan McManamay. A hybrid dasy- 1173–1176, 2014. URL: https://www.ifaamas.org/AAMAS/ metric and machine learning approach to high-resolution aamas2014/proceedings/aamas/p1173.pdf. residential electricity consumption modeling. In Advances [SNGJ+ 09] Beate Sander, Azhar Nizam, Louis P. Garrison Jr., Maarten J. in Geocomputation, pages 47–58. Springer, 2017. doi: Postma, M. Elizabeth Halloran, and Ira M. Longini Jr. Eco- 10.1007/978-3-319-22786-3_5. nomic evaluation of influenza pandemic mitigation strate- [MOD11] Stuart Mitchell, Michael O’Sullivan, and Iain gies in the United States using a stochastic microsimulation Dunning. PuLP: A Linear Programming Toolkit transmission model. Value in Health, 12(2):226–233, 2009. for Python. Technical report, 2011. URL: doi:10.1111/j.1524-4733.2008.00437.x. https://www.dit.uoi.gr/e-class/modules/document/file.php/ [SPH11] Dianna M. Smith, Jamie R. Pearce, and Kirk Harland. Can 216/PAPERS/2011.%20PuLP%20-%20A%20Linear% a deterministic spatial microsimulation model provide reli- 20Programming%20Toolkit%20for%20Python.pdf. able small-area estimates of health behaviours? An example [MPN+ 17] April M. Morton, Jesse O. Piburn, Nicholas N. Nagle, H.M. of smoking prevalence in New Zealand. Health & Place, Aziz, Samantha E. Duchscherer, and Robert N. Stewart. A 17(2):618–624, 2011. doi:10.1016/j.healthplace. simulation approach for modeling high-resolution daytime 2011.01.001. commuter travel flows and distributions of worker subpopula- [ST20] Haroldo G. Santos and Túlio A.M. Toffolo. Mixed Integer Lin- tions. In GeoComputation 2017, Leeds, UK, pages 1–5, 2017. ear Programming with Python. Technical report, 2020. URL: URL: http://www.geocomputation.org/2017/papers/44.pdf. https://python-mip.readthedocs.io/_/downloads/en/latest/pdf/. [MS01] Harvey J. Miller and Shih-Lung Shaw. Geographic Informa- [TBP+ 15] Gautam S. Thakur, Budhendra L. Bhaduri, Jesse O. Piburn, tion Systems for Transportation: Principles and Applications. Kelly M. Sims, Robert N. Stewart, and Marie L. Urban. Oxford University Press, New York, 2001. PlanetSense: a real-time streaming and spatio-temporal an- [MS15] Harvey J. Miller and Shih-Lung Shaw. Geographic Informa- alytics platform for gathering geo-spatial intelligence from tion Systems for Transportation in the 21st Century. Geogra- open source data. In Proceedings of the 23rd SIGSPATIAL phy Compass, 9(4):180–189, 2015. doi:10.1111/gec3. International Conference on Advances in Geographic Informa- 12204. tion Systems, pages 1–4, 2015. doi:10.1145/2820783. [NBLS14] Nicholas N. Nagle, Barbara P. Buttenfield, Stefan Leyk, and 2820882. Seth Spielman. Dasymetric modeling and uncertainty. Annals [TCR08] Melanie N. Tomintz, Graham P. Clarke, and Janette E. Rigby. of the Association of American Geographers, 104(1):80–95, The geography of smoking in Leeds: estimating individual 2014. doi:10.1080/00045608.2013.843439. smoking rates and the implications for the location of stop [NCA13] Markku Nurhonen, Allen C. Cheng, and Kari Auranen. Pneu- smoking services. Area, 40(3):341–353, 2008. doi:10. mococcal transmission and disease in silico: a microsimu- 1111/j.1475-4762.2008.00837.x. lation model of the indirect effects of vaccination. PloS [TG22] Joseph V. Tuccillo and James D. Gaboardi. Connecting Vivid one, 8(2):e56079, 2013. doi:10.1371/journal.pone. Population Data to Human Dynamics, June 2022. Distilling 0056079. Diversity by Tapping High-Resolution Population and Survey [NLHH07] Michael K. Ng, Mark Junjie Li, Joshua Zhexue Huang, and Data. doi:10.5281/zenodo.6607533. Zengyou He. On the impact of dissimilarity measure in [TM21] Joseph V. Tuccillo and Jessica Moehl. An Individual- k-modes clustering algorithm. IEEE Transactions on Pat- Oriented Typology of Social Areas in the United States, May tern Analysis and Machine Intelligence, 29(3):503–507, 2007. 2021. 2021 ACS Data Users Conference. doi:10.5281/ doi:10.1109/TPAMI.2007.53. zenodo.6672291. [pdt20] The pandas development team. pandas-dev/pandas: Pandas, [TMKD17] Matthias Templ, Bernhard Meindl, Alexander Kowarik, and February 2020. doi:10.5281/zenodo.3509134. Olivier Dupriez. Simulation of synthetic complex data: The [PVG+ 11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, R package simPop. Journal of Statistical Software, 79:1–38, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, 2017. doi:10.18637/jss.v079.i10. V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, [Tuc21] Joseph V. Tuccillo. An Individual-Centered Approach for M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Geodemographic Classification. In 11th International Con- Machine Learning in Python. Journal of Machine Learning ference on Geographic Information Science 2021 Short Paper Research, 12:2825–2830, 2011. URL: https://www.jmlr.org/ Proceedings, pages 1–6, 2021. doi:10.25436/E2H59M. papers/v12/pedregosa11a.html. [VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt [QC13] Fang Qiu and Robert Cromley. Areal Interpolation and Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Dasymetric Modeling: Areal Interpolation and Dasymetric Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté- Modeling. Geographical Analysis, 45(3):213–215, July 2013. fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar- doi:10.1111/gean.12016. rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric [RA07] Sergio J. Rey and Luc Anselin. PySAL: A Python Library of Jones, Robert Kern, Eric Larson, C.J. Carey, İlhan Polat, Spatial Analytical Methods. The Review of Regional Studies, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, 37(1):5–27, 2007. URL: https://rrs.scholasticahq.com/article/ Josef Perktold, Robert Cimrman, Ian Henriksen, E.A. Quin- 8285.pdf, doi:10.52324/001c.8285. tero, Charles R. Harris, Anne M. Archibald, Antônio H. [RAA+ 21] Sergio J. Rey, Luc Anselin, Pedro Amaral, Dani Arribas- Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy Bel, Renan Xavier Cortes, James David Gaboardi, Wei Kang, 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Elijah Knaap, Ziqi Li, Stefanie Lumnitz, Taylor M. Oshan, Scientific Computing in Python. Nature Methods, 17:261–272, Hu Shao, and Levi John Wolf. The PySAL Ecosystem: 2020. doi:10.1038/s41592-019-0686-2. Philosophy and Implementation. Geographical Analysis, 2021. [WCC+ 09] William D. Wheaton, James C. Cajka, Bernadette M. Chas- doi:10.1111/gean.12276. teen, Diane K. Wagener, Philip C. Cooley, Laxminarayana [RSF+ 21] Krishna P. Reddy, Fatma M. Shebl, Julia H.A. Foote, Guy Ganapathi, Douglas J. Roberts, and Justine L. Allpress. Harling, Justine A. Scott, Christopher Panella, Kieran P. Fitz- Synthesized population databases: A US geospatial database maurice, Clare Flanagan, Emily P. Hyle, Anne M. Neilan, et al. for agent-based models. Methods report (RTI Press), Cost-effectiveness of public health strategies for COVID-19 2009(10):905, 2009. doi:10.3768/rtipress.2009. epidemic control in South Africa: a microsimulation modelling mr.0010.0905. study. The Lancet Global Health, 9(2):e120–e129, 2021. [WM10] Wes McKinney. Data Structures for Statistical Computing in doi:10.1016/S2214-109X(20)30452-6. Python. In Stéfan van der Walt and Jarrod Millman, editors, [RWM+ 17] Amy N. Rose, Eric M. Weber, Jessica J. Moehl, Melanie L. Proceedings of the 9th Python in Science Conference, pages 56 Laverdiere, Hsiu-Han Yang, Matthew C. Whitehead, Kelly M. – 61, 2010. doi:10.25080/Majora-92bf1922-00a. Sims, Nathan E. Trombley, and Budhendra L. Bhaduri. Land- [WRK21] Ran Wei, Sergio J. Rey, and Elijah Knaap. Efficient re- LIKENESS: A TOOLKIT FOR CONNECTING THE SOCIAL FABRIC OF PLACE TO HUMAN DYNAMICS 135 gionalization for spatially explicit neighborhood delineation. International Journal of Geographical Information Science, 35(1):135–151, 2021. doi:10.1080/13658816.2020. 1759806. [ZFJ14] Yi Zhu and Joseph Ferreira Jr. Synthetic population gener- ation at disaggregated spatial scales for land use and trans- portation microsimulation. Transportation Research Record, 2429(1):168–177, 2014. doi:10.3141/2429-18. 136 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) poliastro: a Python library for interactive astrodynamics Juan Luis Cano Rodríguez‡∗ , Jorge Martínez Garrido‡ https://www.youtube.com/watch?v=VCpTgU1pb5k F Abstract—Space is more popular than ever, with the growing public awareness problem. This work was generalized by Newton to give birth to of interplanetary scientific missions, as well as the increasingly large number the n-body problem, and many other mathematicians worked on of satellite companies planning to deploy satellite constellations. Python has it throughout the centuries (Daniel and Johann Bernoulli, Euler, become a fundamental technology in the astronomical sciences, and it has also Gauss). Poincaré established in the 1890s that no general closed- caught the attention of the Space Engineering community. form solution exists for the n-body problem, since the resulting One of the requirements for designing a space mission is studying the trajectories of satellites, probes, and other artificial objects, usually ignoring dynamical system is chaotic [Bat99]. Sundman proved in the non-gravitational forces or treating them as perturbations: the so-called n-body 1900s the existence of convergent solutions for a few restricted problem. However, for preliminary design studies and most practical purposes, it with n = 3. is sufficient to consider only two bodies: the object under study and its attractor. M = E − e sin E (1) Even though the two-body problem has many analytical solutions, or- In 1903 Tsiokovsky evaluated the conditions required for artificial bit propagation (the initial value problem) and targeting (the boundary value problem) remain computationally intensive because of long propagation times, objects to leave the orbit of the earth; this is considered as a foun- tight tolerances, and vast solution spaces. On the other hand, astrodynamics dational contribution to the field of astrodynamics. Tsiokovsky researchers often do not share the source code they used to run analyses and devised equation 2 which relates the increase in velocity with the simulations, which makes it challenging to try out new solutions. effective exhaust velocity of thrusted gases and the fraction of used This paper presents poliastro, an open-source Python library for interactive propellant. m0 astrodynamics that features an easy-to-use API and tools for quick visualization. ∆v = ve ln (2) poliastro implements core astrodynamics algorithms (such as the resolution mf of the Kepler and Lambert problems) and leverages numba, a Just-in-Time Further developments by Kondratyuk, Hohmann, and Oberth in compiler for scientific Python, to optimize the running time. Thanks to Astropy, the early 20th century all added to the growing field of orbital poliastro can perform seamless coordinate frame conversions and use proper mechanics, which in turn enabled the development of space flight physical units and timescales. At the moment, poliastro is the longest-lived Python library for astrodynamics, has contributors from all around the world, in the USSR and the United States in the 1950s and 1960s. and several New Space companies and people in academia use it. The two-body problem In a system of i ∈ 1, ..., n bodies subject to their mutual attraction, Index Terms—astrodynamics, orbital mechanics, orbit propagation, orbit visu- alization, two-body problem by application of Newton’s law of universal gravitation, the total force fi affecting mi due to the presence of the other n − 1 masses is given by [Bat99]: Introduction n mi m j fi = −G ∑ r 3 ij (3) History j6=i |ri j | The term "astrodynamics" was coined by the American as- where G = 6.67430 · 10−11 N m2 kg−2 is the universal gravita- tronomer Samuel Herrick, who received encouragement from tional constant, and ri j denotes the position vector from mi to m j . the space pioneer Robert H. Goddard, and refers to the branch Applying Newton’s second law of motion results in a system of n of space science dealing with the motion of artificial celestial differential equations: bodies ([Dub73], [Her71]). However, the roots of its mathematical foundations go back several centuries. d2 ri n mj 2 = −G ∑ r 3 ij (4) Kepler first introduced his laws of planetary motion in 1609 dt j6=i i j | |r and 1619 and derived his famous transcendental equation (1), By setting n = 2 in 4 and subtracting the two resulting equali- which we now see as capturing a restricted form of the two-body ties, one arrives to the fundamental equation of the two-body problem: * Corresponding author: hello@juanlu.space ‡ Unaffiliated d2 r µ =− 3r (5) dt 2 r Copyright © 2022 Juan Luis Cano Rodríguez et al. This is an open-access where µ = G(m1 + m2 ) = G(M + m). When m M (for example, article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any an artificial satellite orbiting a planet), one can consider µ = GM medium, provided the original author and source are credited. a property of the attractor. POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS 137 Keplerian vs non-keplerian motion State of the art Conveniently manipulating equation 5 leads to several properties In our view, at the time of creating poliastro there were a number [Bat99] that were already published by Johannes Kepler in the of issues with existing open source astrodynamics software that 1610s, namely: posed a barrier of entry for novices and amateur practitioners. 1) The orbit always describes a conic section (an ellipse, a Most of these barriers still exist today and are described in the parabola, or an hyperbola), with the attractor at one of following paragraphs. The goals of the project can be condensed the two foci and can be written in polar coordinates like as follows: r = 1+epcos ν (Kepler’s first law). 1) Set an example on reproducibility and good coding prac- 2) The magnitude of the specific angular momentum h = tices in astrodynamics. r2 ddtθ is constant an equal to two times the areal velocity 2) Become an approachable software even for novices. (Kepler’s second law). 3) Offer a performant software that can be also used in 3) For closed (circular and elliptical) orbits, the periodq is scripting and interactive workflows. 3 related to the size of the orbit through P = 2π aµ (Kepler’s third law). The most mature software libraries for astrodynamics are arguably Orekit [noa22c], a "low level space dynamics library For many practical purposes it is usually sufficient to limit written in Java" with an open governance model, and SPICE the study to one object orbiting an attractor and ignore all other [noa22d], a toolkit developed by NASA’s Navigation and An- external forces of the system, hence restricting the study to cillary Information Facility at the Jet Propulsion Laboratory. trajectories governed by equation 5. Such trajectories are called Other similar, smaller projects that appeared later on and that "Keplerian", and several problems can be formulated for them: are still maintained to this day include PyKEP [IBD+ 20], be- • The initial-value problem, which is usually called prop- yond [noa22a], tudatpy [noa22e], sbpy [MKDVB+ 19], Skyfield agation, involves determining the position and velocity of [Rho20] (Python), CelestLab (Scilab) [noa22b], astrodynamics.jl an object after an elapse period of time given some initial (Julia) [noa] and Nyx (Rust) [noa21a]. In addition, there are conditions. some Graphical User Interface (GUI) based open source programs • Preliminary orbit determination, which involves using used for Mission Analysis and orbit visualization, such as GMAT exact or approximate methods to derive a Keplerian orbit [noa20] and gpredict [noa18], and complete web applications for from a set of observations. tracking constellations of satellites like the SatNOGS project by • The boundary-value problem, often named the Lambert the Libre Space Foundation [noa21b]. problem, which involves determining a Keplerian orbit The level of quality and maintenance of these packages is from boundary conditions, usually departure and arrival somewhat heterogeneous. Community-led projects with a strong position vectors and a time of flight. corporate backing like Orekit are in excellent health, while on the other hand smaller projects developed by volunteers (beyond, Fortunately, most of these problems boil down to finding astrodynamics.jl) or with limited institutional support (PyKEP, numerical solutions to relatively simple algebraic relations be- GMAT) suffer from lack of maintenance. Part of the problem tween time and angular variables: for elliptic motion (0 ≤ e < 1) might stem from the fact that most scientists are never taught how it is the Kepler equation, and equivalent relations exist for the to build software efficiently, let alone the skills to collaboratively other eccentricity regimes [Bat99]. Numerical solutions for these develop software in the open [WAB+ 14], and astrodynamicists are equations can be found in a number of different ways, each one no exception. with different complexity and precision tradeoffs. In the Methods On the other hand, it is often difficult to translate the advances section we list the ones implemented by poliastro. in astrodynamics research to software. Classical algorithms devel- On the other hand, there are many situations in which natural oped throughout the 20th century are described in papers that are and artificial orbital perturbations must be taken into account so sometimes difficult to find, and source code or validation data that the actual non-Keplerian motion can be properly analyzed: is almost never available. When it comes to modern research • Interplanetary travel in the proximity of other planets. On carried in the digital era, source code and validation data is a first approximation it is usually enough to study the still difficult, even though they are supposedly provided "upon trajectory in segments and focus the analysis on the closest reasonable request" [SSM18] [GBP22]. attractor, hence patching several Keplerian orbits along It is no surprise that astrodynamics software often requires the way (the so-called "patched-conic approximation") deep expertise. However, there are often implicit assumptions that [Bat99]. The boundary surface that separates one segment are not documented with an adequate level of detail which orig- from the other is called the sphere of influence. inate widespread misconceptions and lead even seasoned profes- • Use of solar sails, electric propulsion, or other means sionals to make conceptual mistakes. Some of the most notorious of continuous thrust. Devising the optimal guidance laws misconceptions arise around the use of general perturbations data that minimize travel time or fuel consumption under these (OMMs and TLEs) [Fin07], the geometric interpretation of the conditions is usually treated as an optimization problem mean anomaly [Bat99], or coordinate transformations [VCHK06]. of a dynamical system, and as such it is particularly Finally, few of the open source software libraries mentioned challenging [Con14]. above are amenable to scripting or interactive use, as promoted by • Artificial satellites in the vicinity of a planet. This is computational notebooks like Jupyter [KRKP+ 16]. the regime in which all the commercial space industry The following sections will now discuss the various areas of operates, especially for those satellites in Low-Earth Orbit current research that an astrodynamicist will engage in, and how (LEO). poliastro improves their workflow. 138 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Methods Nice, high level API Software Architecture The architecture of poliastro emerges from the following set of conflicting requirements: Dangerous™ algorithms 1) There should be a high-level API that enables users to perform orbital calculations in a straightforward way and Fig. 1: poliastro two-layer architecture prevent typical mistakes. 2) The running time of the algorithms should be within the Most of the methods of the High level API consist only same order of magnitude of existing compiled implemen- of the necessary unit compatibility checks, plus a wrapper over tations. the corresponding Core API function that performs the actual 3) The library should be written in a popular open-source computation. language to maximize adoption and lower the barrier to @u.quantity_input(E=u.rad, ecc=u.one) external contributors. def E_to_nu(E, ecc): """True anomaly from eccentric anomaly.""" One of the most typical mistakes we set ourselves to prevent return ( with the high-level API is dimensional errors. Addition and E_to_nu_fast( E.to_value(u.rad), substraction operations of physical quantities are defined only for ecc.value quantities with the same units [Dro53]: for example, the operation ) << u.rad 1 km + 100 m requires a scale transformation of at least one ).to(E.unit) of the operands, since they have different units (kilometers and As a result, poliastro offers a unit-safe API that performs the least meters) but the same dimension (length), whereas the operation amount of computation possible to minimize the performance 1 km + 1 kg is directly not allowed because dimensions are penalty of unit checks, and also a unit-unsafe API that offers incompatible (length and mass). As such, software systems oper- maximum performance at the cost of not performing any unit ating with physical quantities should raise exceptions when adding validation checks. different dimensions, and transparently perform the required scale Finally, there are several options to write performant code that transformations when adding different units of the same dimen- can be used from Python, and one of them is using a fast, compiled sion. language for the CPU intensive parts. Successful examples of this With this in mind, we evaluated several Python packages for include NumPy, written in C [HMvdW+ 20], SciPy, featuring a unit handling (see [JGAZJT+ 18] for a recent survey) and chose mix of FORTRAN, C, and C++ code [VGO+ 20], and pandas, astropy.units [TPWS+ 18]. making heavy use of Cython [BBC+ 11]. However, having to radius = 6000 # km write code in two different languages hinders the development altitude = 500 # m speed, makes debugging more difficult, and narrows the potential # Wrong! contributor base (what Julia creators called "The Two Language distance = radius + altitude Problem" [BEKS17]). As authors of poliastro we wanted to use Python as the from astropy import units as u sole programming language of the implementation, and the best # Correct solution we found to improve its performance was to use Numba, distance = (radius << u.km) + (altitude << u.m) a LLVM-based Python JIT compiler [LPS15]. This notion of providing a "safe" API extends to other parts Usage of the library by leveraging other capabilities of the Astropy Basic Orbit and Ephem creation project. For example, timestamps use astropy.time objects, which take care of the appropriate handling of time scales The two central objects of the poliastro high level API are Orbit (such as TDB or UTC), reference frame conversions leverage and Ephem: astropy.coordinates, and so forth. • Orbit objects represent an osculating (hence Keplerian) One of the drawbacks of existing unit packages is that orbit of a dimensionless object around an attractor at a they impose a significant performance penalty. Even though given point in time and a certain reference frame. astropy.units is integrated with NumPy, hence allowing • Ephem objects represent an ephemerides, a sequence of the creation of array quantities, all the unit compatibility checks spatial coordinates over a period of time in a certain are implemented in Python and require lots of introspection, and reference frame. this can slow down mathematical operations by several orders of There are six parameters that uniquely determine a Keplerian magnitude. As such, to fulfill our desired performance requirement orbit, plus the gravitational parameter of the corresponding attrac- for poliastro, we envisioned a two-layer architecture: tor (k or µ). Optionally, an epoch that contextualizes the orbit • The Core API follows a procedural style, and all the can be included as well. This set of six parameters is not unique, functions receive Python numerical types and NumPy and several of them have been developed over the years to serve arrays for maximum performance. different purposes. The most widely used ones are: • The High level API is object-oriented, all the methods • Cartesian elements: Three components for the position receive Astropy Quantity objects with physical units, (x, y, z) and three components for the velocity (vx , vy , vz ). and computations are deferred to the Core API. This set has no singularities. POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS 139 • Classical Keplerian elements: Two components for the shape of the conic (usually the semimajor axis a or from poliastro.ephem import Ephem semiparameter p and the eccentricity e), three Euler angles # Configure high fidelity ephemerides globally for the orientation of the orbital plane in space (inclination # (requires network access) i, right ascension of the ascending node Ω, and argument solar_system_ephemeris.set("jpl") of periapsis ω), and one polar angle for the position of the # For predefined poliastro attractors body along the conic (usually true anomaly f or ν). This earth = Ephem.from_body(Earth, Time.now().tdb) set of elements has an easy geometrical interpretation and the advantage that, in pure two-body motion, five of them # For the rest of the Solar System bodies ceres = Ephem.from_horizons("Ceres", Time.now().tdb) are fixed (a, e, i, Ω, ω) and only one is time-dependent (ν), which greatly simplifies the analytical treatment of There are some crucial differences between Orbit and Ephem orbital perturbations. However, they suffer from singular- objects: ities steming from the Euler angles ("gimbal lock") and • Orbit objects have an attractor, whereas Ephem objects equations expressed in them are ill-conditioned near such do not. Ephemerides can originate from complex trajecto- singularities. ries that don’t necessarily conform to the ideal two-body • Walker modified equinoctial elements: Six parameters problem. (p, f , g, h, k, L). Only L is time-dependent and this set has • Orbit objects capture a precise instant in a two-body mo- no singularities, however the geometrical interpretation of tion plus the necessary information to propagate it forward the rest of the elements is lost [WIO85]. in time indefinitely, whereas Ephem objects represent a Here is how to create an Orbit from cartesian and from clas- bounded time history of a trajectory. This is because the sical Keplerian elements. Walker modified equinoctial elements equations for the two-body motion are known, whereas are supported as well. an ephemeris is either an observation or a prediction from astropy import units as u that cannot be extrapolated in any case without external knowledge. As such, Orbit objects have a .propagate from poliastro.bodies import Earth, Sun method, but Ephem ones do not. This prevents users from from poliastro.twobody import Orbit from poliastro.constants import J2000 attempting to propagate the position of the planets, which will always yield poor results compared to the excellent # Data from Curtis, example 4.3 ephemerides calculated by external entities. r = [-6045, -3490, 2500] << u.km v = [-3.457, 6.618, 2.533] << u.km / u.s Finally, both types have methods to convert between them: • Ephem.from_orbit is the equivalent of sampling a orb_curtis = Orbit.from_vectors( Earth, # Attractor two-body motion over a given time interval. As explained r, v # Elements above, the resulting Ephem loses the information about ) the original attractor. # Data for Mars at J2000 from JPL HORIZONS • Orbit.from_ephem is the equivalent of calculating a = 1.523679 << u.au the osculating orbit at a certain point of a trajectory, ecc = 0.093315 << u.one assuming a given attractor. The resulting Orbit loses inc = 1.85 << u.deg the information about the original, potentially complex raan = 49.562 << u.deg argp = 286.537 << u.deg trajectory. nu = 23.33 << u.deg Orbit propagation orb_mars = Orbit.from_classical( Orbit objects have a .propagate method that takes an elapsed Sun, a, ecc, inc, raan, argp, nu, time and returns another Orbit with new orbital elements and an J2000 # Epoch updated epoch: ) >>> from poliastro.examples import iss When displayed on an interactive REPL, Orbit objects provide >>> iss basic information about the geometry, the attractor, and the epoch: >>> 6772 x 6790 km x 51.6 deg (GCRS) ... >>> orb_curtis 7283 x 10293 km x 153.2 deg (GCRS) orbit >>> iss.nu.to(u.deg) around Earth (X) at epoch J2000.000 (TT) <Quantity 46.59580468 deg> >>> orb_mars >>> iss_30m = iss.propagate(30 << u.min) 1 x 2 AU x 1.9 deg (HCRS) orbit around Sun (X) at epoch J2000.000 (TT) >>> (iss_30m.epoch - iss.epoch).datetime datetime.timedelta(seconds=1800) Similarly, Ephem objects can be created using a variety of class- methods as well. Thanks to astropy.coordinates built-in >>> (iss_30m.nu - iss.nu).to(u.deg) <Quantity 116.54513153 deg> low-fidelity ephemerides, as well as its capability to remotely The default propagation algorithm is an analytical procedure access the JPL HORIZONS system, the user can seamlessly build described in [FCM13] that works seamlessly in the near parabolic an object that contains the time history of the position of any Solar System body: region. In addition, poliastro implements analytical propagation from astropy.time import Time algorithms as described in [DB83], [OG86], [Mar95], [Mik87], from astropy.coordinates import solar_system_ephemeris [PP13], [Cha22], and [VM07]. 140 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) rr = propagate( orbit, tofs, method=cowell, f=f, ) Continuous thrust control laws Beyond natural perturbations, spacecraft can modify their trajec- tory on purpose by using impulsive maneuvers (as explained in the next section) as well as continuous thrust guidance laws. The user can define custom guidance laws by providing a perturbation Fig. 2: Osculating (Keplerian) vs perturbed (true) orbit (source: acceleration in the same way natural perturbations are used. In Wikipedia, CC BY-SA 3.0) addition, poliastro includes several analytical solutions for con- tinuous thrust guidance laws with specific purposes, as studied in [CR17]: optimal transfer between circular coplanar orbits [Ede61] Natural perturbations [Bur67], optimal transfer between circular inclined orbits [Ede61] As showcased in Figure 2, at any point in a trajectory we [Kec97], quasi-optimal eccentricity-only change [Pol97], simulta- can define an ideal Keplerian orbit with the same position and neous eccentricity and inclination change [Pol00], and agument of velocity under the attraction of a point mass: this is called the periapsis adjustment [Pol98]. A much more rigorous analysis of a osculating orbit. Some numerical propagation methods exist that similar set of laws can be found in [DCV21]. model the true, perturbed orbit as a deviation from an evolving, from poliastro.twobody.thrust import change_ecc_inc osculating orbit. poliastro implements Cowell’s method [CC10], which consists in adding all the perturbation accelerations and then ecc_f = 0.0 << u.one inc_f = 20.0 << u.deg integrating the resulting differential equation with any numerical f = 2.4e-6 << (u.km / u.s**2) method of choice: d2 r µ a_d, _, t_f = change_ecc_inc(orbit, ecc_f, inc_f, f) 2 = − 3 r + ad (6) dt r The resulting equation is usually integrated using high order Impulsive maneuvers numerical methods, since the integration times are quite large and the tolerances comparatively tight. An in-depth discussion of Impulsive maneuvers are modeled considering a change in the such methods can be found in [HNW09]. poliastro uses Dormand- velocity of a spacecraft while its position remains fixed. The Prince 8(5,3) (DOP853), a commonly used method available in poliastro.maneuver.Maneuver class provides various SciPy [HMvdW+ 20]. constructors to instantiate popular impulsive maneuvers in the There are several natural perturbations included: J2 and J3 framework of the non-perturbed two-body problem: gravitational terms, several atmospheric drag models (exponential, • Maneuver.impulse [Jac77], [AAAA62], [AAA+ 76]), and helpers for third body • Maneuver.hohmann gravitational attraction and radiation pressure as described in [?]. • Maneuver.bielliptic @njit • Maneuver.lambert def combined_a_d( t0, state, k, j2, r_eq, c_d, a_over_m, h0, rho0 ): from poliastro.maneuver import Maneuver return ( J2_perturbation( orb_i = Orbit.circular(Earth, alt=700 << u.km) t0, state, k, j2, r_eq hoh = Maneuver.hohmann(orb_i, r_f=36000 << u.km) ) + atmospheric_drag_exponential( t0, state, k, r_eq, c_d, a_over_m, h0, rho0Once instantiated, Maneuver objects provide information regard- ) ing total ∆v and ∆t: ) >>> hoh.get_total_cost() <Quantity 3.6173981270031357 km / s> def f(t0, state, k): du_kep = func_twobody(t0, state, k) >>> hoh.get_total_time() ax, ay, az = combined_a_d( <Quantity 15729.741535747102 s> t0, state, Maneuver objects can be applied to Orbit instances using the k, R=R, apply_maneuver method. C_D=C_D, >>> orb_i A_over_m=A_over_m, 7078 x 7078 km x 0.0 deg (GCRS) orbit H0=H0, around Earth (X) rho0=rho0, J2=Earth.J2.value, >>> orb_f = orb_i.apply_maneuver(hoh) ) >>> orb_f du_ad = np.array([0, 0, 0, ax, ay, az]) 36000 x 36000 km x 0.0 deg (GCRS) orbit around Earth (X) return du_kep + du_ad POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS 141 Targeting Earth - Mars for year 2020-2021, C3 launch 2021-05 34.1 Targeting is the problem of finding the orbit connecting two Days of flight .0 .8 .0 Arrival velocity km/s 41.90 434553750..04273 400 24 0 31.0 200 35.7 positions over a finite amount of time. Within the context of 5. 43.4 313.80.8 2021-04 .0 the non-perturbed two-body problem, targeting is just a matter 26 .4 37.24 32.6 41.9 of solving the BVP, also known as Lambert’s problem. Because 24.8 targeting tries to find for an orbit, the problem is included in the 2021-03 32.59 29.5 34.1 37.2 410.9.0 18..86 .9 Initial Orbit Determination field. 30 20.2 3 27 17.1 27.93 The poliastro.iod package contains izzo and 2021-02 Arrival date 23.3 38.8 vallado modules. These provide a lambert function for solv- km2 / s2 15.5 40.3 45.0 23.28 3.8 5 29. 26.4 ing the targeting problem. Nevertheless, a Maneuver.lambert 21.7 2021-01 27.9 constructor is also provided so users can keep taking advantage of 18.62 32.6 Orbit objects. 13.97 2020-12 # Declare departure and arrival datetimes date_launch = time.Time( 9.31 '2011-11-26 15:02', scale='tdb' 5.0 2020-11 ) Perseverance 4.66 .0 Tianwen-1 100 date_arrival = time.Time( '2012-08-06 05:17', scale='tdb' Hope Mars 2020-10 0.00 ) 3 4 5 6 7 8 9 0 0-0 0-0 0-0 0-0 0-0 0-0 0-0 0-1 202 202 202 202 202 202 202 202 # Define initial and final orbits Launch date orb_earth = Orbit.from_ephem( Sun, Ephem.from_body(Earth, date_launch), Fig. 3: Porkchop plot for Earth-Mars transfer arrival energy showing date_launch latest missions to the Martian planet. ) orb_mars = Orbit.from_ephem( Sun, Ephem.from_body(Mars, date_arrival), date_arrival Generated graphics can be static or interactive. The main ) difference between these two is the ability to modify the camera view in a dynamic way when using interactive plotters. # Compute targetting maneuver and apply it man_lambert = Maneuver.lambert(orb_earth, orb_mars) The most important classes in the poliastro.plotting orb_trans, orb_target = ss0.apply_maneuver( package are StaticOrbitPlotter and OrbitPlotter3D. man_lambert, intermediate=true In addition, the poliastro.plotting.misc module con- ) tains the plot_solar_system function, which allows the user Targeting is closely related to quick mission design by means of to visualize inner and outter both in 2D and 3D, as requested by porkchop diagrams. These are contour plots showing all combi- users. nations of departure and arrival dates with the specific energy for The following example illustrates the plotting capabilities of each transfer orbit. They allow for quick identification of the most poliastro. At first, orbits to be plotted are computed and their optimal transfer dates between two bodies. plotting style is declared: The poliastro.plotting.porkchop provides the from poliastro.plotting.misc import plot_solar_system PorkchopPlotter class which allows the user to generate these diagrams. # Current datetime now = Time.now().tdb from poliastro.plotting.porkchop import ( PorkchopPlotter # Obtain Florence and Halley orbits ) florence = Orbit.from_sbdb("Florence") from poliastro.utils import time_range halley_1835_ephem = Ephem.from_horizons( "90000031", now # Generate all launch and arrival dates ) launch_span = time_range( halley_1835 = Orbit.from_ephem( "2020-03-01", end="2020-10-01", periods=int(150) Sun, halley_1835_ephem, halley_1835_ephem.epochs[0] ) ) arrival_span = time_range( "2020-10-01", end="2021-05-01", periods=int(150) # Define orbit labels and color style ) florence_style = {label: "Florence", color: "#000000"} halley_style = {label: "Florence", color: "#84B0B8"} # Create an instance of the porkchop and plot it porkchop = PorkchopPlotter( The static two-dimensional plot can be created using the following Earth, Mars, launch_span, arrival_span, code: ) # Generate a static 2D figure Previous code, with some additional customization, generates frame2D = rame = plot_solar_system( figure 3. epoch=now, outer=False ) frame2D.plot(florence, **florence_style) Plotting frame2D.plot(florence, **halley_style) For visualization purposes, poliastro provides the As a result, figure 4 is obtained. poliastro.plotting package, which contains various The interactive three-dimensional plot can be created using the utilities for generating 2D and 3D graphics using different following code: backends such as matplotlib [Hun07] and Plotly [Inc15]. 142 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 4: Two-dimensional view of the inner Solar System, Florence, and Halley. Fig. 6: Natural perturbations affecting Low-Earth Orbit (LEO) mo- tion (source: [VM07]) # Generate an interactive 3D figure frame3D = rame = plot_solar_system( the Simplified General Perturbation (SGP) models, first developed epoch=now, outer=False, in [HK66] and then refined in [LC69] into what we know these use_3d=True, interactive=true ) days as the SGP4 propagator [HR80] [VCHK06]. Even though frame3D.plot(florence, **florence_style) certain elements of the reference frame used by SGP4 are not frame3D.plot(florence, **halley_style) properly specified [VCHK06] and that its accuracy might still be As a result, figure 5 is obtained. too limited for certain applications [Ko09] [Lar16], it is nowadays the most widely used propagation method thanks in large part to the dissemination of General Perturbations orbital data by the US 501(c)(3) CelesTrak (which itself obtains it from the 18th Space Defense Squadron of the US Space Force). The starting point of SGP4 is a special element set that uses Brouwer mean orbital elements [Bro59] plus a ballistic coefficient based on an approximation of the atmospheric drag [LC69], and its results are expressed in a special coordinate system called True Equator Mean Equinox (TEME). Special care needs to be taken to avoid mixing mean elements with osculating elements, and to convert the output of the propagation to the appropriate reference frame. These element sets have been traditionally distributed in a compact text representation called Two-Line Element sets (TLEs) (see 7 for an example). However this format is quite cryptic and Fig. 5: Three-dimensional view of the inner Solar System, Florence, suffers from a number of shortcomings, so recently there has and Halley. been a push to use the Orbit Data Messages international standard developed by the Consultive Committee for Space Data Systems Commercial Earth satellites (CCSDS 502.0-B-2). Figure 6 gives a clear picture of the most important natural pertur- 1 25544U 98067A 22156.15037205 .00008547 00000+0 15823-3 0 9994 bations affecting satellites in LEO, namely: the first harmonic of 2 25544 51.6449 36.2070 0004577 196.3587 298.4146 15.49876730343319 the geopotential field J2 (representing the attractor oblateness), Fig. 7: Two-Line Element set (TLE) for the ISS (retrieved on 2022- the atmospheric drag, and the higher order harmonics of the 06-05) geopotential field. At least the most significant of these perturbations need to be At the moment, general perturbations data both in OMM and taken into account when propagating LEO orbits, and therefore TLE format can be integrated with poliastro thanks to the sgp4 the methods for purely Keplerian motion are not enough. As Python library and the Ephem class as follows: seen above, poliastro implements a number of these perturbations from astropy.coordinates import TEME, GCRS already - however, numerical methods are much slower than analytical ones, and this can render them unsuitable for large from poliastro.ephem import Ephem from poliastro.frames import Planes scale simulations, satellite conjunction assesment, propagation in constrained hardware, and so forth. To address this issue, semianalytical propagation methods def ephem_from_gp(sat, times): were devised that attempt to strike a balance between the fast errors, rs, vs = sat.sgp4_array(times.jd1, times.jd2) if not (errors == 0).all(): running times of analytical methods and the necessary inclusion warn( of perturbation forces. One of such semianalytical methods are "Some objects could not be propagated, " POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS 143 "proceeding with the rest", do not want to use some of the higher level poliastro abstractions stacklevel=2, or drag its large number of heavy dependencies. ) rs = rs[errors == 0] Finally, the sustainability of the project cannot yet be taken for vs = vs[errors == 0] granted: the project has reached a level of complexity that already times = times[errors == 0] warrants dedicated development effort that cannot be covered with short-lived grants. Such funding could potentially come from the cart_teme = CartesianRepresentation( rs << u.km, private sector, but although there is evidence that several for-profit xyz_axis=-1, companies are using poliastro, we have very little information of differentials=CartesianDifferential( how is it being used and what problems are those users having, vs << (u.km / u.s), xyz_axis=-1, let alone what avenues for funded work could potentially work. ), Organizations like the Libre Space Foundation advocate for a ) strong copyleft licensing model to convince commercial actors to cart_gcrs = ( contribute to the commons, but in principle that goes against the TEME(cart_teme, obstime=times) .transform_to(GCRS(obstime=times)) permissive licensing that the wider Scientific Python ecosystem, .cartesian including poliastro, has adopted. With the advent of new business ) models and the ever increasing reliance in open source by the private sector, a variety of ways to engage commercial users and return Ephem( cart_gcrs, include them in the conversation exist. However, these have not times, been explored yet. plane=Planes.EARTH_EQUATOR ) Acknowledgements However, no native integration with SGP4 has been implemented The authors would like to thank Prof. Michèle Lavagna for her yet in poliastro, for technical and non-technical reasons. On one original guidance and inspiration, David A. Vallado for his en- hand, this propagator is too different from the other methods, and couragement and for publishing the source code for the algorithms we have not yet devised how to add it to the library in a way from his book for free, Dr. T.S. Kelso for his tireless efforts in that does not create confusion. On the other hand, adding such maintaining CelesTrak, Alejandro Sáez for sharing the dream of a propagator to poliastro would probably open the flood gates of a better way, Prof. Dr. Manuel Sanjurjo Rivo for believing in my corporate users of the library, and we would like to first devise work, Helge Eichhorn for his enthusiasm and decisive influence a sustainability strategy for the project, which is addressed in the in poliastro, the whole OpenAstronomy collaboration for opening next section. the door for us, the NumFOCUS organization for their immense support, and Alexandra Elbakyan for enabling scientific progress worldwide. Future work Despite the fact that poliastro has existed for almost a decade, for R EFERENCES most of its history it has been developed by volunteers on their [AAA+ 76] United States Committee on Extension to the Standard At- free time, and only in the past five years it has received funding mosphere, United States National Aeronautics, Space Ad- through various Summer of Code programs (SOCIS 2017, GSOC ministration, United States National Oceanic, Atmospheric Administration, and United States Air Force. U.S. Stan- 2018-2021) and institutional grants (NumFOCUS 2020, 2021). dard Atmosphere, 1976. NOAA - SIT 76-1562. National The funded work has had an overwhemingly positive impact on Oceanic and Amospheric [sic] Administration, 1976. URL: the project, however the lack of a dedicated maintainer has caused https://books.google.es/books?id=x488AAAAIAAJ. [AAAA62] United States Committee on Extension to the Standard At- some technical debt to accrue over the years, and some parts of mosphere, United States National Aeronautics, Space Admin- the project are in need of refactoring or better documentation. istration, and United States Environmental Science Services Historically, poliastro has tried to implement algorithms that Administration. U.S. Standard Atmosphere, 1962: ICAO were applicable for all the planets in the Solar System, however Standard Atmosphere to 20 Kilometers; Proposed ICAO Ex- tension to 32 Kilometers; Tables and Data to 700 Kilo- some of them have proved to be very difficult to generalize for meters. U.S. Government Printing Office, 1962. URL: bodies other than the Earth. For cases like these, poliastro ships a https://books.google.es/books?id=fWdTAAAAMAAJ. poliastro.earth package, but going forward we would like [Bat99] Richard H. Battin. An Introduction to the Mathematics and Methods of Astrodynamics, Revised Edition. American to continue embracing a generic approach that can serve other Institute of Aeronautics and Astronautics, Inc., Reston, VA, bodies as well. January 1999. URL: https://arc.aiaa.org/doi/book/10.2514/4. Several open source projects have successfully used poliastro 861543, doi:10.2514/4.861543. or were created taking inspiration from it, like spacetech-ssa [BBC+ 11] Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dal- cin, Dag Sverre Seljebotn, and Kurt Smith. Cython: The by IBM1 or mubody [BBVPFSC22]. AGI (previously Analytical Best of Both Worlds. Computing in Science & Engineering, Graphics, Inc., now Ansys Government Initiatives) published a 13(2):31–39, March 2011. URL: http://ieeexplore.ieee.org/ series of scripts to automate the commercial tool STK from Python document/5582062/, doi:10.1109/MCSE.2010.118. [BBVPFSC22] Juan Bermejo Ballesteros, José María Vergara Pérez, leveraging poliastro2 . However, we have observed that there is still Alejandro Fernández Soler, and Javier Cubas. Mu- lots of repeated code across similar open source libraries written body, an astrodynamics open-source Python library fo- in Python, which means that there is an opportunity to provide cused on libration points. Barcelona, Spain, April a "kernel" of algorithms that can be easily reused. Although 2022. URL: https://sseasymposium.org/wp-content/uploads/ 2022/04/4thSSEA_AllAbstracts.pdf. poliastro.core started as a separate layer to isolate fast, non- safe functions as described above, we think we could move it to 1. https://github.com/IBM/spacetech-ssa an external package so it can be depended upon by projects that 2. https://github.com/AnalyticalGraphicsInc/STKCodeExamples/ 144 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) [BEKS17] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Vi- Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, ral B. Shah. Julia: A Fresh Approach to Numerical Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Computing. SIAM Review, 59(1):65–98, January 2017. Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro- URL: https://epubs.siam.org/doi/10.1137/141000671, doi: gramming with NumPy. Nature, 585(7825):357–362, Septem- 10.1137/141000671. ber 2020. URL: https://www.nature.com/articles/s41586-020- [Bro59] Dirk Brouwer. Solution of the problem of artificial satellite 2649-2, doi:10.1038/s41586-020-2649-2. theory without drag. The Astronomical Journal, 64:378, [HNW09] E. Hairer, S. P. Nørsett, and Gerhard Wanner. Solving ordi- November 1959. URL: http://adsabs.harvard.edu/cgi-bin/bib_ nary differential equations I: nonstiff problems. Number 8 query?1959AJ.....64..378B, doi:10.1086/107958. in Springer series in computational mathematics. Springer, [Bur67] E.G.C. Burt. On space manoeuvres with con- Heidelberg ; London, 2nd rev. ed edition, 2009. OCLC: tinuous thrust. Planetary and Space Science, ocn620251790. 15(1):103–122, January 1967. URL: https: [HR80] Felix R. Hoots and Ronald L. Roehrich. Models for prop- //linkinghub.elsevier.com/retrieve/pii/0032063367900700, agation of NORAD element sets. Technical report, Defense doi:10.1016/0032-0633(67)90070-0. Technical Information Center, Fort Belvoir, VA, December [CC10] Philip Herbert Cowell and Andrew Claude Crommelin. Inves- 1980. URL: http://www.dtic.mil/docs/citations/ADA093554. tigation of the Motion of Halley’s Comet from 1759 to 1910. [Hun07] J. D. Hunter. Matplotlib: A 2D graphics environment. Com- Neill & Company, limited, 1910. puting in Science & Engineering, 9(3):90–95, 2007. Pub- [Cha22] Kevin Charls. Recursive solution to Kepler’s problem for lisher: IEEE COMPUTER SOC. doi:10.1109/MCSE. elliptical orbits - application in robust Newton-Raphson and 2007.55. co-planar closest approach estimation. 2022. Publisher: [IBD+ 20] Dario Izzo, Will Binns, Dariomm098, Alessio Mereta, Unpublished Version Number: 1. URL: https://rgdoi.net/ Christopher Iliffe Sprague, Dhennes, Bert Van Den Abbeele, 10.13140/RG.2.2.18578.58563/1, doi:10.13140/RG.2. Chris Andre, Krzysztof Nowak, Nat Guy, Alberto Isaac Bar- 2.18578.58563/1. quín Murguía, Pablo, Frédéric Chapoton, GiacomoAcciarini, [Con14] Bruce A. Conway. Spacecraft trajectory optimization. Num- Moritz V. Looz, Dietmarwo, Mike Heddes, Anatoli Babenia, ber 29 in Cambridge aerospace series. Cambridge university Baptiste Fournier, Johannes Simon, Jonathan Willitts, Ma- press, Cambridge (GB), 2014. teusz Polnik, Sanjeev Narayanaswamy, The Gitter Badger, [CR17] Juan Luis Cano Rodríguez. Study of analytical solutions for and Jack Yarndley. esa/pykep: Optimize, October 2020. low-thrust trajectories. Master’s thesis, Universidad Politéc- URL: https://zenodo.org/record/4091753, doi:10.5281/ nica de Madrid, March 2017. ZENODO.4091753. [DB83] J. M. A. Danby and T. M. Burkardt. The solution of Kepler’s [Inc15] Plotly Technologies Inc. Collaborative data science, 2015. equation, I. Celestial Mechanics, 31(2):95–107, October Place: Montreal, QC Publisher: Plotly Technologies Inc. URL: 1983. URL: http://link.springer.com/10.1007/BF01686811, https://plot.ly. doi:10.1007/BF01686811. [Jac77] L. G. Jacchia. Thermospheric Temperature, Density, and [DCV21] Marilena Di Carlo and Massimiliano Vasile. Analytical Composition: New Models. SAO Special Report, 375, March solutions for low-thrust orbit transfers. Celestial Mechanics 1977. ADS Bibcode: 1977SAOSR.375.....J. URL: https: and Dynamical Astronomy, 133(7):33, July 2021. URL: https: //ui.adsabs.harvard.edu/abs/1977SAOSR.375.....J. //link.springer.com/10.1007/s10569-021-10033-9, doi:10. [JGAZJT+ 18] Nathan J. Goldbaum, John A. ZuHone, Matthew J. Turk, 1007/s10569-021-10033-9. Kacper Kowalik, and Anna L. Rosen. unyt: Handle, ma- [Dro53] S. Drobot. On the foundations of Dimensional Analysis. nipulate, and convert data with units in Python. Jour- Studia Mathematica, 14(1):84–99, 1953. URL: http://www. nal of Open Source Software, 3(28):809, August 2018. impan.pl/get/doi/10.4064/sm-14-1-84-99, doi:10.4064/ URL: http://joss.theoj.org/papers/10.21105/joss.00809, doi: sm-14-1-84-99. 10.21105/joss.00809. [Dub73] G. N. Duboshin. Book Review: Samuel Herrick. Astrodynam- [Kec97] Jean Albert Kechichian. Reformulation of Edelbaum’s Low- ics. Soviet Astronomy, 16:1064, June 1973. ADS Bibcode: Thrust Transfer Problem Using Optimal Control Theory. 1973SvA....16.1064D. URL: https://ui.adsabs.harvard.edu/ Journal of Guidance, Control, and Dynamics, 20(5):988– abs/1973SvA....16.1064D. 994, September 1997. URL: https://arc.aiaa.org/doi/10.2514/ [Ede61] Theodore N. Edelbaum. Propulsion Requirements for Con- 2.4145, doi:10.2514/2.4145. trollable Satellites. ARS Journal, 31(8):1079–1089, August [Ko09] TS Kelso and others. Analysis of the Iridium 33-Cosmos 1961. URL: https://arc.aiaa.org/doi/10.2514/8.5723, doi: 2251 collision. Advances in the Astronautical Sciences, 10.2514/8.5723. 135(2):1099–1112, 2009. Publisher: Citeseer. [FCM13] Davide Farnocchia, Davide Bracali Cioci, and Andrea Milani. [KRKP+ 16] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Robust resolution of Kepler’s equation in all eccentricity Brian E Granger, Matthias Bussonnier, Jonathan Frederic, regimes. Celestial Mechanics and Dynamical Astronomy, Kyle Kelley, Jessica B Hamrick, Jason Grout, Sylvain Cor- 116(1):21–34, May 2013. URL: http://link.springer.com/10. lay, and others. Jupyter Notebooks-a publishing format for 1007/s10569-013-9476-9, doi:10.1007/s10569-013- reproducible computational workflows., volume 2016. 2016. 9476-9. [Lar16] Martin Lara. Analytical and Semianalytical Propagation [Fin07] D Finkleman. "TLE or Not TLE?" That is the Question (AAS of Space Orbits: The Role of Polar-Nodal Variables. In 07-126). ADVANCES IN THE ASTRONAUTICAL SCIENCES, Gerard Gómez and Josep J. Masdemont, editors, Astro- 127(1):401, 2007. Publisher: Published for the American dynamics Network AstroNet-II, volume 44, pages 151– Astronautical Society by Univelt; 1999. 166. Springer International Publishing, Cham, 2016. Se- [GBP22] Mirko Gabelica, Ružica Bojčić, and Livia Puljak. Many ries Title: Astrophysics and Space Science Proceedings. researchers were not compliant with their published URL: http://link.springer.com/10.1007/978-3-319-23986-6_ data sharing statement: mixed-methods study. Jour- 11, doi:10.1007/978-3-319-23986-6_11. nal of Clinical Epidemiology, page S089543562200141X, [LC69] M. H. Lane and K. Cranford. An improved ana- May 2022. URL: https://linkinghub.elsevier.com/retrieve/ lytical drag theory for the artificial satellite problem. pii/S089543562200141X, doi:10.1016/j.jclinepi. In Astrodynamics Conference, Princeton,NJ,U.S.A., August 2022.05.019. 1969. American Institute of Aeronautics and Astronautics. [Her71] Samuel Herrick. Astrodynamics. Van Nostrand Reinhold Co, URL: https://arc.aiaa.org/doi/10.2514/6.1969-925, doi:10. London, New York, 1971. 2514/6.1969-925. [HK66] CG Hilton and JR Kuhlman. Mathematical models for the [LPS15] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: a space defense center. Philco-Ford Publication No. U-3871, LLVM-based Python JIT compiler. In Proceedings of the Sec- 17:28, 1966. ond Workshop on the LLVM Compiler Infrastructure in HPC [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der - LLVM ’15, pages 1–6, Austin, Texas, 2015. ACM Press. Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric URL: http://dl.acm.org/citation.cfm?doid=2833157.2833162, Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, doi:10.1145/2833157.2833162. Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van [Mar95] F. Landis Markley. Kepler Equation solver. Celes- Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del tial Mechanics & Dynamical Astronomy, 63(1):101–111, POLIASTRO: A PYTHON LIBRARY FOR INTERACTIVE ASTRODYNAMICS 145 1995. URL: http://link.springer.com/10.1007/BF00691917, S. Fabbro, L. A. Ferreira, T. Finethy, R. T. Fox, L. H. doi:10.1007/BF00691917. Garrison, S. L. J. Gibbons, D. A. Goldstein, R. Gommers, J. P. [Mik87] Seppo Mikkola. A cubic approximation for Kepler’s equa- Greco, P. Greenfield, A. M. Groener, F. Grollier, A. Hagen, tion. Celestial Mechanics, 40(3-4):329–334, September P. Hirst, D. Homeier, A. J. Horton, G. Hosseinzadeh, L. Hu, 1987. URL: http://link.springer.com/10.1007/BF01235850, J. S. Hunkeler, Ž. Ivezić, A. Jain, T. Jenness, G. Kanarek, doi:10.1007/BF01235850. S. Kendrew, N. S. Kern, W. E. Kerzendorf, A. Khvalko, [MKDVB+ 19] Michael Mommert, Michael Kelley, Miguel De Val-Borro, J. King, D. Kirkby, A. M. Kulkarni, A. Kumar, A. Lee, Jian-Yang Li, Giannina Guzman, Brigitta Sipőcz, Josef D. Lenz, S. P. Littlefair, Z. Ma, D. M. Macleod, M. Mastropi- Ďurech, Mikael Granvik, Will Grundy, Nick Moskovitz, etro, C. McCully, S. Montagnac, B. M. Morris, M. Mueller, Antti Penttilä, and Nalin Samarasinha. sbpy: A Python S. J. Mumford, D. Muna, N. A. Murphy, S. Nelson, G. H. module for small-body planetary astronomy. Jour- Nguyen, J. P. Ninan, M. Nöthe, S. Ogaz, S. Oh, J. K. Parejko, nal of Open Source Software, 4(38):1426, June 2019. N. Parley, S. Pascual, R. Patil, A. A. Patil, A. L. Plunkett, URL: http://joss.theoj.org/papers/10.21105/joss.01426, doi: J. X. Prochaska, T. Rastogi, V. Reddy Janga, J. Sabater, 10.21105/joss.01426. P. Sakurikar, M. Seifert, L. E. Sherbert, H. Sherwood-Taylor, [noa] Astrodynamics.jl. URL: https://github.com/JuliaSpace/ A. Y. Shih, J. Sick, M. T. Silbiger, S. Singanamalla, L. P. Astrodynamics.jl. Singer, P. H. Sladen, K. A. Sooley, S. Sornarajah, O. Stre- [noa18] gpredict, January 2018. URL: https://github.com/csete/ icher, P. Teuben, S. W. Thomas, G. R. Tremblay, J. E. H. gpredict/releases/tag/v2.2.1. Turner, V. Terrón, M. H. van Kerkwijk, A. de la Vega, [noa20] GMAT, July 2020. URL: https://sourceforge.net/projects/ L. L. Watkins, B. A. Weaver, J. B. Whitmore, J. Woillez, gmat/files/GMAT/GMAT-R2020a/. V. Zabalza, and (Astropy Contributors). The Astropy Project: [noa21a] nyx, November 2021. URL: https://gitlab.com/nyx-space/ Building an Open-science Project and Status of the v2.0 nyx/-/tags/1.0.0. Core Package. The Astronomical Journal, 156(3):123, August [noa21b] SatNOGS, October 2021. URL: https://gitlab.com/ 2018. URL: https://iopscience.iop.org/article/10.3847/1538- librespacefoundation/satnogs/satnogs-client/-/tags/1.7. 3881/aabc4f, doi:10.3847/1538-3881/aabc4f. [noa22a] beyond, January 2022. URL: https://pypi.org/project/beyond/ [VCHK06] David Vallado, Paul Crawford, Ricahrd Hujsak, and T.S. 0.7.4/. Kelso. Revisiting Spacetrack Report #3. In AIAA/AAS Astro- [noa22b] celestlab, January 2022. URL: https://atoms.scilab.org/ dynamics Specialist Conference and Exhibit, Keystone, Col- toolboxes/celestlab/3.4.1. orado, August 2006. American Institute of Aeronautics and [noa22c] Orekit, June 2022. URL: https://gitlab.orekit.org/orekit/ Astronautics. URL: https://arc.aiaa.org/doi/10.2514/6.2006- orekit/-/releases/11.2. 6753, doi:10.2514/6.2006-6753. [noa22d] SPICE, January 2022. URL: https://naif.jpl.nasa.gov/naif/ [VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt toolkit.html. Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, [noa22e] tudatpy, January 2022. URL: https://github.com/tudat-team/ Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté- tudatpy/releases/tag/0.6.0. fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar- [OG86] A. W. Odell and R. H. Gooding. Procedures for solving rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Kepler’s equation. Celestial Mechanics, 38(4):307–334, April Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, 1986. URL: http://link.springer.com/10.1007/BF01238923, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, doi:10.1007/BF01238923. Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin- [Pol97] James E Pollard. Simplified approach for assessment of low- tero, Charles R. Harris, Anne M. Archibald, Antônio H. thrust elliptical orbit transfers. In 25th International Electric Ribeiro, Fabian Pedregosa, Paul van Mulbregt, SciPy 1.0 Propulsion Conference, Cleveland, OH, pages 97–160, 1997. Contributors, Aditya Vijaykumar, Alessandro Pietro Bardelli, [Pol98] James Pollard. Evaluation of low-thrust orbital maneuvers. Alex Rothberg, Andreas Hilboll, Andreas Kloeckner, Anthony In 34th AIAA/ASME/SAE/ASEE Joint Propulsion Confer- Scopatz, Antony Lee, Ariel Rokem, C. Nathan Woods, Chad ence and Exhibit, Cleveland,OH,U.S.A., July 1998. Ameri- Fulton, Charles Masson, Christian Häggström, Clark Fitzger- can Institute of Aeronautics and Astronautics. URL: https: ald, David A. Nicholson, David R. Hagen, Dmitrii V. Pasech- //arc.aiaa.org/doi/10.2514/6.1998-3486, doi:10.2514/6. nik, Emanuele Olivetti, Eric Martin, Eric Wieser, Fabrice 1998-3486. Silva, Felix Lenders, Florian Wilhelm, G. Young, Gavin A. [Pol00] J. E. Pollard. Simplified analysis of low-thrust orbital maneu- Price, Gert-Ludwig Ingold, Gregory E. Allen, Gregory R. Lee, vers. Technical report, Defense Technical Information Center, Hervé Audren, Irvin Probst, Jörg P. Dietrich, Jacob Silterra, Fort Belvoir, VA, August 2000. URL: http://www.dtic.mil/ James T Webber, Janko Slavič, Joel Nothman, Johannes Buch- docs/citations/ADA384536. ner, Johannes Kulick, Johannes L. Schönberger, José Vinícius [PP13] Adonis Reinier Pimienta-Penalver. Accurate Kepler equation de Miranda Cardoso, Joscha Reimer, Joseph Harrington, Juan solver without transcendental function evaluations. State Luis Cano Rodríguez, Juan Nunez-Iglesias, Justin Kuczynski, University of New York at Buffalo, 2013. Kevin Tritz, Martin Thoma, Matthew Newville, Matthias [Rho20] Brandon Rhodes. Skyfield: Generate high precision research- Kümmerer, Maximilian Bolingbroke, Michael Tartre, Mikhail grade positions for stars, planets, moons, and Earth satellites, Pak, Nathaniel J. Smith, Nikolai Nowaczyk, Nikolay She- February 2020. banov, Oleksandr Pavlyk, Per A. Brodtkorb, Perry Lee, [SSM18] Victoria Stodden, Jennifer Seiler, and Zhaokun Ma. An Robert T. McGibbon, Roman Feldbauer, Sam Lewis, Sam empirical analysis of journal policy effectiveness for Tygier, Scott Sievert, Sebastiano Vigna, Stefan Peterson, computational reproducibility. Proceedings of the National Surhud More, Tadeusz Pudlik, Takuya Oshima, Thomas J. Academy of Sciences, 115(11):2584–2589, March 2018. Pingel, Thomas P. Robitaille, Thomas Spura, Thouis R. Jones, URL: https://pnas.org/doi/full/10.1073/pnas.1708290115, Tim Cera, Tim Leslie, Tiziano Zito, Tom Krauss, Utkarsh doi:10.1073/pnas.1708290115. Upadhyay, Yaroslav O. Halchenko, and Yoshiki Vázquez- Baeza. SciPy 1.0: fundamental algorithms for scientific [TPWS+ 18] The Astropy Collaboration, A. M. Price-Whelan, B. M. computing in Python. Nature Methods, 17(3):261–272, Sipőcz, H. M. Günther, P. L. Lim, S. M. Crawford, S. Conseil, March 2020. URL: http://www.nature.com/articles/s41592- D. L. Shupe, M. W. Craig, N. Dencheva, A. Ginsburg, J. T. 019-0686-2, doi:10.1038/s41592-019-0686-2. VanderPlas, L. D. Bradley, D. Pérez-Suárez, M. de Val-Borro, (Primary Paper Contributors), T. L. Aldcroft, K. L. Cruz, T. P. [VM07] David A. Vallado and Wayne D. McClain. Fundamentals Robitaille, E. J. Tollerud, (Astropy Coordination Commit- of astrodynamics and applications. Number 21 in Space tee), C. Ardelean, T. Babej, Y. P. Bach, M. Bachetti, A. V. technology library. Microcosm Press [u.a.], Hawthorne, Calif., Bakanov, S. P. Bamford, G. Barentsen, P. Barmby, A. Baum- 3. ed., 1. printing edition, 2007. bach, K. L. Berry, F. Biscani, M. Boquien, K. A. Bostroem, [WAB+ 14] Greg Wilson, D. A. Aruliah, C. Titus Brown, Neil P. L. G. Bouma, G. B. Brammer, E. M. Bray, H. Breytenbach, Chue Hong, Matt Davis, Richard T. Guy, Steven H. D. Had- H. Buddelmeijer, D. J. Burke, G. Calderone, J. L. Cano dock, Kathryn D. Huff, Ian M. Mitchell, Mark D. Plumbley, Rodríguez, M. Cara, J. V. M. Cardoso, S. Cheedella, Y. Copin, Ben Waugh, Ethan P. White, and Paul Wilson. Best Practices L. Corrales, D. Crichton, D. D’Avella, C. Deil, É. Depagne, for Scientific Computing. PLoS Biology, 12(1):e1001745, J. P. Dietrich, A. Donath, M. Droettboom, N. Earl, T. Erben, January 2014. URL: https://dx.plos.org/10.1371/journal.pbio. 146 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 1001745, doi:10.1371/journal.pbio.1001745. [WIO85] M. J. H. Walker, B. Ireland, and Joyce Owens. A set modified equinoctial orbit elements. Celestial Mechanics, 36(4):409– 419, August 1985. URL: http://link.springer.com/10.1007/ BF01227493, doi:10.1007/BF01227493. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 147 A New Python API for Webots Robotics Simulations Justin C. Fisher‡∗ F Abstract—Webots is a popular open-source package for 3D robotics simula- In qualitative terms, the old API feels like one is awkwardly tions. It can also be used as a 3D interactive environment for other physics- using Python to call C and C++ functions, whereas the new API based modeling, virtual reality, teaching or games. Webots has provided a sim- feels much simpler, much easier, and like it is fully intended for ple API allowing Python programs to control robots and/or the simulated world, Python. Here is a representative (but far from comprehensive) list but this API is inefficient and does not provide many "pythonic" conveniences. of examples: A new Python API for Webots is presented that is more efficient and provides a more intuitive, easily usable, and "pythonic" interface. • Unlike the old API, the new API contains helpful Python Index Terms—Webots, Python, Robotics, Robot Operating System (ROS), type annotations and docstrings. Open Dynamics Engine (ODE), 3D Physics Simulation • Webots employs many vectors, e.g., for 3D positions, 4D rotations, and RGB colors. The old API typically treats these as lists or integers (24-bit colors). In the new API 1. Introduction these are Vector objects, with conveniently addressable Webots is a popular open-source package for 3D robotics sim- components (e.g. vector.x or color.red), conve- ulations [Mic01], [Webots]. It can also be used as a 3D in- nient helper methods like vector.magnitude and teractive environment for other physics-based modeling, virtual vector.unit_vector, and overloaded vector arith- reality, teaching or games. Webots uses the Open Dynamics metic operations, akin to (and interoperable with) NumPy Engine [ODE], which allows physical simulations of Newtonian arrays. bodies, collisions, joints, springs, friction, and fluid dynamics. • The new API also provides easy interfacing between Webots provides the means to simulate a wide variety of robot high-resolution Webots sensors (like cameras and Lidar) components, including motors, actuators, wheels, treads, grippers, and Numpy arrays, to make it much more convenient to light sensors, ultrasound sensors, pressure sensors, range finders, use Webots with popular Python packages like Numpy radar, lidar, and cameras (with many of these sensors drawing [NumPy], [Har01], Scipy [Scipy], [Vir01], PIL/PILLOW their inputs from GPU processing of the simulation). A typical [PIL] or OpenCV [OpenCV], [Brad01]. For example, simulation will involve one or more robots, each with somewhere converting a Webots camera image to a NumPy array is between 3 and 30 moving parts (though more would be possible), now as simple as camera.array and this now allows each running its own controller program to process information the array to share memory with the camera, making this taken in by its sensors to determine what control signals to send to extremely fast regardless of image size. its devices. A simulated world typically involves a ground surface • The old API often requires that all function parameters be (which may be a sloping polygon mesh) and dozens of walls, given explicitly in every call, whereas the new API gives obstacles, and/or other objects, which may be stationary or moving many parameters commonly used default values, allowing in the physics simulation. them often to be omitted, and keyword arguments to be Webots has historically provided a simple Python API, allow- used where needed. ing Python programs to control individual robots or the simulated • Most attributes are now accessible (and alterable, when ap- world. This Python API is a thin wrapper over a C++ API, which plicable) by pythonic properties like motor.velocity. itself is a wrapper over Webots’ core C API. These nested layers • Many devices now have Python methods like __bool__ of API-wrapping are inefficient. Furthermore, this API is not very overloaded in intuitive ways. E.g., you can now use if "pythonic" and did not provide many of the conveniences that bumper to detect if a bumper has been pressed, rather help to make development in Python be fast, intuitive, and easy than the old if bumper.getValue(). to learn. This paper presents a new Python API [NewAPI01] that • Pythonic container-like interfaces are now provided. more efficiently interfaces directly with the Webots C API and You may now use for target in radar to iterate provides a more intuitive, easily usable, and "pythonic" interface through the various targets a radar device has detected or for controlling Webots robots and simulations. for packet in receiver to iterate through com- munication packets that a receiver device has received * Corresponding author: fisher@smu.edu ‡ Southern Methodist University, Department of Philosophy (and it now automatically handles a wide variety of Python objects, not just strings). Copyright © 2022 Justin C. Fisher. This is an open-access article distributed • The old API requires supervisor controllers to use a under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the wide variety of separate functions to traverse and in- original author and source are credited. teract with the simulation’s scene tree, including dif- 148 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) ferent functions for different VRML datatypes (like well as ultrasound and touch sensors to detect obstables. Using SFVec3f or MFInt32). The new API automatically these sensors, the robots navigate towards the lamp in a cluttered handles these datatypes and translates intuitive Python playground sandbox that includes sloping sand, an exterior wall, syntax (like dot-notation and square-bracket indexing) and various obstacles including a puddle of water and platforms to the Webots equivalents. E.g., you can now move from which robots may fall. a particular crate 1 meter in the x direction using This interdisciplinary class draws students with diverse back- a command like world.CRATES[3].translation grounds, and programming skills. Accomodating those with fewer += [1,0,0]. Under the old API, this would require skills required simplifying many of the complexities of the old numerous function calls (calling getNodeFromDef to Webots API. It also required setting up tools to use Webots find the CRATES node, getMFNode to find the child "supervisor" powers to help manipulate the simulated world, e.g. with index 3, getSFField to find its translation field, to provide students easier customization options for their robots. and getSFVec3f to retrieve that field’s value, then some The old Webots API makes the use of such supervisor powers list manipulation to alter the x-component of that value, tedious and difficult, even for experienced coders, so this prac- and finally a call to setSFVec3f to set the new value). tically required developing new tools to streamline the process. These factors led to the development of an interface that would be As another example illustrating how much easier the new much easier for novice students to adapt to, and that would make it API is to use, here are two lines from Webots’ sample much easier for an experienced programmer to make much use of supervisor_draw_trail, as it would appear in the old supervisor powers to manipulate the simulated world. Discussion Python API. of this with the core Webots development team then led to the f = supervisor.getField(supervisor.getRoot(), decision to incorporate these improvements into Webots, where "children") f.importMFNodeFromString(-1, trail_plan) they can be of benefit to a much broader community. And here is how that looks written using the new API: 3. Design Decisions. world.children.append(trail_plan) This section discusses some design decisions that arose in develop- The new API is mostly backwards-compatible with the old Python ing this API, and discusses the factors that drove these decisions. Webots API, and provides an option to display deprecation warn- This may help give the reader a better understanding of this API, ings with helpful advice for changing to the new API. and also of relevant considerations that would arise in many other The new Python API is planned for inclusion in an upcoming development scenarios. Webots release, to replace the old one. In the meantime, an early- 3.1. Shifting from functions to properties. access version is available, distributed under Apache 2.0 licence, the same permissibe open-source license that Webots is distributed The old Python API for Webots consists largely under. of methods like motor.getVelocity() and In what follows, the history and motivation for this new API motor.setVelocity(new_velocity). In the new API is discussed, including its use in teaching an interdisciplinary these have quite uniformly been changed to Python properties, so undergraduate Cognitive Science course called Minds, Brains and these purposes are now accomplished with motor.velocity Robotics. Some of the design decisions for the new API are and motor.velocity = new_velocity. discussed, which will not only aid in understanding it, but also Reduction of wordiness and punctuation helps to make pro- have broader relevance to parallel dilemmas that face many other grams easier to read and to understand, and it reduces the cognitive software developers. And some metrics are given to quantify how load on coders. However, there are also drawbacks. the new API has improved over the old. One drawback is that properties can give the mistaken impres- sion that some attributes are computationally cheap to get or set. In cases where this impression would be misleading, more traditional 2. History and Motivation. method calls were retained and/or the comparative expense of the Much of this new API was developed by the author in the operation was clearly documented. course of teaching an interdisciplinary Southern Methodist Uni- Two other drawbacks are related. One is that inviting ordinary versity undergraduate Cognitive Science course entitled Minds, users to assign properties to API objects might lead them to assign Brains and Robotics (PHIL 3316). Before the Covid pandemic, other attributes that could cause problems. Since Python lacks this course had involved lab activities where students build and true privacy protections, it has always faced this sort of worry, but program physical robots. The pandemic forced these activities this worry becomes even worse when users start to feel familiar to become virtual. Fortunately, Webots simulations actually have moving beyond just using defined methods to interact with an many advantages over physical robots, including not requiring object. any specialized hardware (beyond a decent personal computer), Relatedly, Python debugging provides direct feedback in making much more interesting uses of altitude rather than having cases where a user misspells motor.setFoo(v) but not when the robots confined to a safely flat surface, allowing robots someone mispells ’motor.foo = v‘. If a user inadvertently types to engage in dangerous or destructive activities that would be motor.setFool(v) they will get an AttributeError risky or expensive with physical hardware, allowing a much noting that motor lacks a setFool attribute. But if a user broader array of sensors including high-resolution cameras, and inadvertently types motor.fool = v, then Python will silently enabling full-fledged neural network and computational vision create a new .fool attribute for motor and the user will often simulations. For example, an early activity in this class involves have no idea what has gone wrong. building Braitenburg-style vehicles [Bra01] that use light sensors These two drawbacks both involve users setting an attribute and cameras to detect a lamp carried by a hovering drone, as they shouldn’t: either an attribute that has another purpose, or one A NEW PYTHON API FOR WEBOTS ROBOTICS SIMULATIONS 149 that doesn’t. Defenses against the first include "hiding" important in the simulated world (presuming that the controller has such attributes behind a leading "_", or protecting them with a Python permissions, of course). In many use cases, supervisor robots don’t property, which can also help provide useful doc-strings. Unfor- actually have bodies and devices of their own, and just use their tunately it’s much harder to protect against misspellings in this supervisor powers incorporeally, so all they will need is world. piece-meal fashion. In the case where a robot’s controller wants to exert both forms This led to the decision to have robot devices like motors of control, it can import both robot to control its own body, and and cameras employ a blanket __setattr__ that will generate world to control the rest of the world. warnings if non-property attributes of devices are set from outside This distinction helps to make things more intuitively clear. the module. So the user who inadvertently types motor.fool It also frees world from having all the properties and methods = v will immediately be warned of their mistake. This does incur that robot has, which in turn reduces the risk of name-collisions a performance cost, but that cost is often worthwhile when it saves as world takes on the role of serving as the root of the proxy development time and frustration. For cases when performance is scene tree. In the new API, world.children refers to the crucial, and/or a user wants to live dangerously and meddle inside children field of the root of the scene tree which contains (al- API objects, this layer of protection can be deactivated. most) all of the simulated world, world.WorldInfo refers to An alternative approach, suggested by Matthew Feickert, one of these children, a WorldInfo node, and world.ROBOT2 would have been to use __slots__ rather than an ordinary dynamically returns a node within the world whose Webots __dict__ to store device attributes, which would also have the DEF-name is "ROBOT2". These uses of world would have effect of raising an error if users attempt to modify unexpected been much less intuitive if users thought of world as being attributes. Not having a __dict__ can make it harder to do a special sort of robot, rather than as being their handle on some things like cached properties and multiple inheritance. But controlling the simulated world. Other sorts of supervisor func- in cases where such issues don’t arise or can be worked around, tionality also are very intuitively associated with world, like readers facing similar challenges may find __slots__ to be a world.save(filename) to save the state of the simulated preferable solution. world, or world.mode = 'PAUSE'. Having world.attributes dynamically fetch nodes and 3.2 Backwards Compatibility. fields from the scene tree did come with some drawbacks. There The new API offers many new ways of doing things, many is a risk of name-collisions, though these are rare since Webots of which would seem "better" by most metrics, with the main field-names are known in advance, and nodes are typically sought drawback being just that they differ from old ways. The possibility by ALL-CAPS DEF-names, which won’t collide with world of making a clean break from the old API was considered, but that ’s lower-case and MixedCase attributes. Linters like MyPy and would stop old code from working, alienate veteran users, and PyCharm also cannot anticipate such dynamic references, which risk causing a schism akin to the deep one that arose between is unfortunate, but does not stop such dynamic references from Python 2 and Python 3 communities when Python 3 opted against being extremely useful. backwards compatibility. Another option would have been to refrain from adding a 4. Readability Metrics "new-and-better" feature to avoid introducing redundancies or A main advantage of the new API is that it allows Webots backward incompatibilities. But that has obvious drawbacks too. controllers to be written in a manner that is easier for coders to Instead, a compromise was typically adopted: to provide both read, write, and understand. Qualitatively, this difference becomes the "new-and-better" way and the "worse-old" way. This redun- quite apparent upon a cursory inspection of examples like the one dancy was eased by shifting from getFoo / setFoo methods given in section 1. As another representative example, here are to properties, and from CamelCase to pythonic snake_case, three lines from Webots’ included supervisor_draw_trail which reduced the number of name collisions between old and sample as they would appear in the old Python API: new. Employing the "worse-old" way leads to a deprecation warning that includes helpful advice regarding shifting to the trail_node = world.getFromDef("TRAIL") point_field = trail_node.getField("coord")\ "new-and-better" way of doing things. This may help users to .getSFNode()\ transition more gradually to the new ways, or they can shut these .getField("point") warnings off to help preserve good will, and hopefully avoid a index_field = trail_node.getField("coordIndex") schism. And here is their equivalent in the new API: 3.3 Separating robot and world. point_field = world.TRAIL.coord.point index_field = world.TRAIL.coordIndex In Webots there is a distinction between "ordinary robots" whose capabilities are generally limited to using the robot’s own devices, Brief inspection should reveal that the latter code is much easier and "supervisor robots" who share those capabilities, but also have to read, write and understand, not just because it is shorter, but virtual omniscience and omnipotence over most aspects of the also because its punctuation is limited to standard Python syntax simulated world. In the old API, supervisor controller programs for traversing attributes of objects, because it reduces the need import a Supervisor subclass of Robot, but typically still to introduce new variables like trail_node for things that call this unusually powerful robot robot, which has led to many it already makes easy to reference (via world.TRAIL, which confusions. the new API automatically caches for fast repeat reference), and In the new API these two sorts of powers are strictly separated. because it invisibly handles selecting appropriate C-API functions Importing robot provides an object that can be used to control like getField and getSFNode, saving the user from needing the devices in the robot itself. Importing world provides an to learn and remember all these functions (of which there are object that can be used to observe and enact changes anywhere many). 150 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Metric New API Old API Halstead Metric New API Old API Lines of Code (with blanks, comments) 43 49 Vocabulary = (n1)operators+(n2)operands 18 54 Source Lines of Code (without those) 29 35 Length = (N1)operator + (N2)operand instances 38 99 Logical Lines of Code (single commands) 27 38 Volume = Length * log2 (Vocabulary) 158 570 Cyclomatic Complexity 5 (A) 8 (B) Difficulty = (n1 * N2) / (2 * n2) 4.62 4.77 Effort = Difficulty * Volume 731 2715 TABLE 1 Time = Effort / 18 41 151 Length and Complexity Metrics. Raw measures for Bugs = Volume / 3000 0.05 0.19 supervisor_draw_trail as it would be written with the new Python API for Webots or the old Python API for Webots. The "lines of codes" measures differ with respect to how they count blank lines, comments, and lines that TABLE 2 combine multiple commands. Cyclomatic complexity measures the number of Halstead Metrics. Halstead metrics for supervisor_draw_trail as it potential branching points in the code. would be written with the new and old Python API’s for Webots. Lower numbers are commonly construed as being better. This intuitive impression is confirmed by automated metrics for code readability. The measures in what follows consider the quickly under the new API, since it provides many simpler ways full supervisor_draw_trail sample controller (from which of doing things, and need never do any worse since it provides the above snippet was drawn), since this is the Webots sample backwards-compatible options. controller that makes the most sustained use of supervisor func- Another collection of classic measures of code readability tionality to perform a fairly plausible supervisor task (maintaining was developed by Halstead. [Hal01] These measures (especially the position of a streamer that trails behind the robot). Webots volume) have been shown to correlate with human assessments provides this sample controller in C [SDTC], but it was re- of code readability [Bus01], [Pos01]. These measures generally implemented using both the Old Python API and the New Python penalize a program for using a "vocabulary" involving more API [Metrics], maintaining straightforward correspondence be- operators and operands. Table 2 shows these metrics, as computed tween the two, with the only differences being directly due to by Radon. (Again all measures are reported, while remaining the differences in the API’s. neutral about which are most significant.) The new API scores Some raw measures for the two controllers are shown in significantly lower/"better" on these metrics, due in large part Table 1. These were gathered using the Radon code-analysis to its automatically selecting among many different C-API calls tools [Radon]. (These metrics, as well as those below, may be without these needing to appear in the user’s code. E.g. hav- reproduced by (1) installing Radon [Radon], (2) downloading ing motor.velocity as a unified property involves fewer the source files to compare and the script for computing Metrics unique names than having users write both setVelocity() and [Metrics], (3) ensuring that the path at the top of the script refers getVelocity(), and often forming a third local velocity to the local location of the source files to be compared, and variable. And having world.children[-1] access the last (4) running this script.) Multiple metrics are reported because child that field in the simulation saves having to count getField, theorists disagree about which are most relevant in assessing and getMFNode in the vocabulary, and often also saves forming code readability, because some of these play a role in computing additional local variables for nodes or fields gotten in this way. other metrics discussed below, and because this may help to allay Both of these factors also help the new API to greatly reduce potential worries that a few favorable metrics might have been parentheses counts. cherry-picked. This paper provides some explanation of these Lastly, the Maintainability Index and variants thereof are metrics and of their potential significance, while remaining neutral intended to measure of how easy to support and change source regarding which, if any, of these metrics is best. code is. [Oman01] Variants of the Maintainability Index are The "lines of code" measures reflect that the new API makes commonly used, including in Microsoft Visual Studio. These it easier to do more things with less code. The measures differ measures combine Halstead Volume, Source Lines of Code, and in how they count blank lines, comments, multi-line statements, Cyclomatic Complexity, all mentioned above, and two variants and multi-statement lines like if p: q(). Line counts can be (SEI and Radon) also provide credit for percentage of comment misleading, especially when the code with fewer lines has longer lines. (Both samples compared here include 5 comment lines, but lines, though upcoming measures will show that that is not the these compose a higher percentage of the new API’s shorter code). case here. Different versions of this measure weight and curve these factors Cyclomatic Complexity counts the number of potential somewhat differently, but since the new API outperforms the old branching points that appear within the code, like if, while and on each factor, all versions agree that it gets the higher/"better" for. [McC01] Cyclomatic Complexity is strongly correlated with score, as shown in Table 3. (These measures were computed based other plausible measures of code readability involving indentation on the input components as counted by Radon.) structure [Hin01]. The new API’s score is lower/"better" due to its There are potential concerns about each of these measures automatically converting vector-like values to the format needed of code readability, and one can easily imagine playing a form for importing new nodes into the Webots simulation, and due to of "code golf" to optimize some of these scores without actually its automatic caching allowing a simpler loop to remove unwanted improving readability (though it would be difficult to do this for all nodes. By Radon’s reckoning this difference in complexity already scores at once). Fortunately, most plausible measures of readabil- gives the old API a "B" grade, as compared to the new API’s "A". ity have been observed to be strongly correllated across ordinary These complexity measures would surely rise in more complex cases, [Pos01] so the clear and unanimous agreement between controllers employed in larger simulations, but they would rise less these measures is a strong confirmation that the new API is indeed A NEW PYTHON API FOR WEBOTS ROBOTICS SIMULATIONS 151 Maintainability Index version New API Old API [NewAPI01] https://github.com/Justin-Fisher/new_python_api_for_webots [NumPy] Numerical Python (NumPy). https://www.numpy.org Original [Oman01] 89 79 [ODE] Open Dynamics Engine. https://www.ode.org/ Software Engineering Institute 78 62 [Oman01] Oman, P and J Hagemeister. "Metrics for assessing a software Microsoft Visual Studio 52 46 system’s maintainability," Proceedings Conference on Software Maintenance, 337-44. 1992. doi: 10.1109/ICSM.1992.242525. Radon 82 75 [OpenCV] Open Source Computer Vision Library for Python. https:// github.com/opencv/opencv-python TABLE 3 [PIL] Python Imaging Library. https://python-pillow.org/ Maintainability Index Metrics. Maintainability Index metrics for [Pos01] Posnet, D, A Hindle and P Devanbu. "A simpler model of supervisor_draw_trail as it would be written with the new and old software readability." Proceedings of the 8th working conference versions of the Python API for Webots, according to different versions of the on mining software repositories, 73-82. 2011. Maintainability Index. Higher numbers are commonly construed as being better. [Radon] Radon. https://radon.readthedocs.io/en/latest/index.html [Sca01] Scalabrino, S, M Linares-Vasquez, R Oliveto and D Poshy- vanyk. "A Comprehensive Model for Code Readability." Jounal of Software: Evolution and Process, 1-29. 2017. doi: 10.1002/smr.1958. [Scipy] https://www.scipy.org more readable. Other plausible measures of readability would take [SDTC] https://cyberbotics.com/doc/guide/samples-howto#supervisor_ into account factors like whether the operands are ordinary English draw_trail-wbt words, [Sca01] or how deeply nested (or indented) the code ends [SDTNew] https://github.com/Justin-Fisher/new_python_api_for_webots/ blob/d180bcc7f505f8168246bee379f8067dfaf373ea/webots_ up being, [Hin01] both of which would also favor the new API. new_python_api_samples/controllers/supervisor_draw_trail_ So the mathematics confirm what was likely obvious from visual python/supervisor_draw_trail_new_api_bare_bones.py comparison of code samples above, that the new API is indeed [SDTOld] https://github.com/Justin-Fisher/new_python_api_for_webots/ more "readable" than the old. blob/d180bcc7f505f8168246bee379f8067dfaf373ea/webots_ new_python_api_samples/controllers/supervisor_draw_trail_ python/supervisor_draw_trail_old_api_bare_bones.py 5. Conclusions [Vir01] Virtanen, P, R. Gommers, T. Oliphant, et al. SciPy 1.0: Funda- A new Python API for Webots robotic simulations was presented. mental Algorithms for Scientific Computing in Python. Nature Methods, 17(3), 261-72. 2020. doi: 10.1038/s41592-019-0686-2. It more efficiently interfaces directly with the Webots C API and [Webots] Webots Open Source Robotic Simulator. https://cyberbotics. provides a more intuitive, easily usable, and "pythonic" interface com/ for controlling Webots robots and simulations. Motivations for the API and some of its design decisions were discussed, including decisions use python properties, to add new functionality along- side deprecated backwards compatibility, and to separate robot and supervisor/world functionality. Advantages of the new API were discussed and quantified using automated code readability metrics. More Information An early-access version of the new API and a variety of sam- ple programs and metric computations: https://github.com/Justin- Fisher/new_python_api_for_webots Lengthy discussion of the new API and its planned inclusion in Webots: https://github.com/cyberbotics/webots/pull/3801 Webots home page, including free download of Webots: https: //cyberbotics.com/ R EFERENCES [Brad01] Bradski, G. The OpenCV Library. Dr Dobb’s Journal of Soft- ware Tools. 2000. [Bra01] Braitenberg, V. Vehicles: Experiments in synthetic psychology. Cambridge, MA: MIT Press. 1984. [Bus01] Buse, R and W Weimer. Learning a metric for code readability. IEEE Transactions on Software Engineering, 36(4): 546-58. 2010. doi: 10.1109/TSE.2009.70. [Metrics] Fisher, J. Readability Metrics for a New Python API for Webots Robotics Simulations. 2022. doi: 10.5281/zenodo.6813819. [Hal01] Halstead, M. Elements of software science. Elsevier New York. 1977. [Har01] Harris, C., K. Millman, S. van der Walt, et al. Array pro- gramming with NumPy. Nature 585, 357–62. 2020. doi: 10.1038/s41586-020-2649-2. [Hin01] Hindle, A, MW Godfrey and RC Holt. "Reading beside the lines: Indentation as a proxy for complexity metric." Program Comprehension. The 16th IEEE International Conference, 133- 42. 2008. doi: 10.1109/icpc.2008.13. [McC01] McCabe, TJ. "A Complexity Measure" , 2(4): 308-320. 1976. [Mic01] Michel, O. "Webots: Professional Mobile Robot Simulation. Journal of Advanced Robotics Systems. 1(1): 39-42. 2004. doi: 10.5772/5618. 152 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) pyAudioProcessing: Audio Processing, Feature Extraction, and Machine Learning Modeling Jyotika Singh‡∗ F Abstract—pyAudioProcessing is a Python based library for processing audio that are closer in the vector space are expected to be similar data, constructing and extracting numerical features from audio, building and in meaning [Wik22b]. Word embeddings work great for many testing machine learning models, and classifying data with existing pre-trained applications surrounding textual data [JS21]. However, passing audio classification models or custom user-built models. MATLAB is a popular numbers, an audio signal, or an image through a word embeddings language of choice for a vast amount of research in the audio and speech generation method is not likely to return any meaningful numerical processing domain. On the contrary, Python remains the language of choice for a vast majority of machine learning research and functionality. This library representation that can be used to train machine learning models. contains features built in Python that were originally published in MATLAB. Different data types correlate with feature formation techniques pyAudioProcessing allows the user to compute various features from audio files specific to their domain rather than a one-size-fits-all. These including Gammatone Frequency Cepstral Coefficients (GFCC), Mel Frequency methods for audio signals are very specific to audio and speech Cepstral Coefficients (MFCC), spectral features, chroma features, and others signal processing, which is a domain of digital signal processing. such as beat-based and cepstrum-based features from audio. One can use Digital signal processing is a field of its own and is not feasible to these features along with one’s own classification backend or any of the pop- master in an ad-hoc fashion. This calls for the need to have sought- ular scikit-learn classifiers that have been integrated into pyAudioProcessing. after and useful processes for audio signals to be in a ready-to-use Cleaning functions to strip unwanted portions from the audio are another offering state by users. of the library. It further contains integrations with other audio functionalities such as frequency and time-series visualizations and audio format conversions. There are two popular approaches for feature building in audio This software aims to provide machine learning engineers, data scientists, classification tasks. researchers, and students with a set of baseline models to classify audio. 1. Computing spectrograms from audio signals as images and The library is available at https://github.com/jsingh811/pyAudioProcessing and using an image classification pipeline for the remainder. is under GPL-3.0 license. 2. Computing features from audio files directly as numerical vectors and applying them to a classification backend. Index Terms—pyAudioProcessing, audio processing, audio data, audio clas- pyAudioProcessing includes the capability of computing spec- sification, audio feature extraction, gfcc, mfcc, spectral features, spectrogram, trograms, but focusses most functionalities around the latter for chroma building audio models. This tool contains implementations of various widely used audio feature extraction techniques, and Introduction integrates with popular scikit-learn classifiers including support vector machine (SVM), SVM radial basis function kernel (RBF), The motivation behind this software is to make available complex random forest, logistic regression, k-nearest neighbors (k-NN), audio features in Python for a variety of audio processing tasks. gradient boosting, and extra trees. Audio data can be cleaned, Python is a popular choice for machine learning tasks. Having trained, tested, and classified using pyAudioProcessing [Sin21]. solutions for computing complex audio features using Python enables easier and unified usage of Python for building machine Some other useful libraries for the domain of audio pro- learning algorithms on audio. This not only implies the need for cessing include librosa [MRL+ 15], spafe [Mal20], essentia resources to guide solutions for audio processing, but also signifies [BWG+ 13], pyAudioAnalysis [Gia15], and paid services from the need for Python guides and implementations to solve audio and service providers such as Google1 . speech cleaning, transformation, and classification tasks. The use of pyAudioProcessing in the community inspires the Different data processing techniques work well for different need and growth of this software. It is referenced in a text book types of data. For example, in natural language processing, word titled Artificial Intelligence with Python Cookbook published by embedding is a term used for the representation of words for Packt Publishing in October 2020 [Auf20]. Additionally, pyAu- text analysis, typically in the form of a real-valued numerical dioProcessing is a part of specific admissions requirement for a vector that encodes the meaning of the word such that the words funded PhD project at University of Portsmouth2 . It is further referenced in this thesis paper titled "Master Thesis AI Method- * Corresponding author: singhjyotika811@gmail.com ologies for Processing Acoustic Signals AI Usage for Processing ‡ Placemakr Acoustic Signals" [Din21], in recent research on audio processing for assessing attention levels in Attention Deficit Hyperactivity Copyright © 2022 Jyotika Singh. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the 1. https://developers.google.com/learn/pathways/get-started-audio- original author and source are credited. classification PYAUDIOPROCESSING: AUDIO PROCESSING, FEATURE EXTRACTION, AND MACHINE LEARNING MODELING 153 Disorder (ADHD) students [BGSR21], and more. There are thus Class Metric far 16000+ downloads via pip for pyAudioProcessing with 1000+ Accuracy Precision F1 downloads in the last month [PeP22]. As several different audio music 97.60% 98.79% 98.19% features need development, new issues are created on GitHub speech 98.80% 97.63% 98.21% and contributions to the code by the open-source community are welcome to grow the tool faster. TABLE 1: Per-class evaluation metrics for audio type (speech vs music) classification pre-trained model. Core Functionalities Class Metric pyAudioProcessing aims to provide an end-to-end processing so- Accuracy Precision F1 lution for converting between audio file formats, visualizing time and frequency domain representations, cleaning with silence and music 94.60% 96.93% 95.75% low-activity segments removal from audio, building features from speech 97.00% 97.79% 97.39% raw audio samples, and training a machine learning model that birds 100.00% 96.89% 98.42% can then be used to classify unseen raw audio samples (e.g., into categories such as music, speech, etc.). This library allows the user to extract features such as Mel Frequency Cepstral Coefficients TABLE 2: Per-class evaluation metrics for audio type (speech vs music vs bird sound) classification pre-trained model. (MFCC) [CD14], Gammatone Frequency Cepstral Coefficients (GFCC) [JDHP17], spectral features, chroma features and other beat-based and cepstrum based features from audio to use with Methods and Results one’s own classification backend or scikit-learn classifiers that have been built into pyAudioProcessing. The classifier implemen- Pre-trained models tation examples that are a part of this software aim to give the pyAudioProcessing offers pre-trained audio classification models users a sample solution to audio classification problems and help for the Python community to aid in quick baseline establishment. build the foundation to tackle new and unseen problems. This is an evolving feature as new datasets and classification pyAudioProcessing provides seven core functionalities com- problems gain prominence in the field. prising different stages of audio signal processing. Some of the pre-trained models include the following. 1. Converting audio files to .wav format to give the users 1. Audio type classifier to determine speech versus music: the ability to work with different types of audio to increase Trained a Support Vector Machine (SVM) classifier for classifying compatibility with code and processes that work best with .wav audio into two possible classes - music, speech. This classifier audio type. was trained using Mel Frequency Cepstral Coefficients (MFCC), 2. Audio visualization in time-series and frequency represen- spectral features, and chroma features. This model was trained on tation, including spectrograms. manually created and curated samples for speech and music. The 3. Segmenting and removing low-activity segments from audio per-class evaluation metrics are shown in Table 1. files for removing unwanted audio segments that are less likely to 2. Audio type classifier to determine speech versus music ver- represent meaningful information. sus bird sounds: Trained Support Vector Machine (SVM) classifier 4. Building numerical features from audio that can be used for classifying audio into three possible classes - music, speech, to train machine learning models. The set of features supported birds. This classifier was trained using Mel Frequency Cepstral evolves with time as research informs new and improved algo- Coefficients (MFCC), spectral features, and chroma features. The rithms. per-class evaluation metrics are shown in Table 2. 5. Ability to export the features built with this library to use 3. Music genre classifier using the GTZAN [TEC01]: Trained with any custom machine learning backend of the user’s choosing. on SVM classifier using Gammatone Frequency Cepstral Coef- ficients (GFCC), Mel Frequency Cepstral Coefficients (MFCC), 6. Capability that allows users to train scikit-learn classifiers spectral features, and chroma features to classify music into 10 using features of their choosing directly from raw data. pyAudio- genre classes - blues, classical, country, disco, hiphop, jazz, metal, Processing pop, reggae, rock. The per-class evaluation metrics are shown in a). runs automatic hyper-parameter tuning Table 3. b). returns to the user the training model metrics These models aim to present capability of audio feature gen- along with cross-validation confusion matrix (a cross- eration algorithms in extracting meaningful numeric patterns from validation confusion matrix is an evaluation matrix from the audio data. One can train their own classifiers using similar where we can estimate the performance of the model features and different machine learning backend for researching broken down by each class/category) for model evalua- and exploring improvements. tion c). allows the user to test the created classifier with Audio features the same features used for training There are multiple types of features one can extract from audio. 7. Includes pre-trained models to provide users with baseline Information about getting started with audio processing is well audio classifiers. described in [Sin19]. pyAudioProcessing allows users to compute GFCC, MFCC, other cepstral features, spectral features, temporal 2. https://www.port.ac.uk/study/postgraduate-research/research-degrees/ features, chroma features, and more. Details on how to extract phd/explore-our-projects/detection-of-emotional-states-from-speech-and-text these features are present in the project documentation on GitHub. 154 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Class Metric Accuracy Precision F1 pop 72.36% 78.63% 75.36% met 87.31% 85.52% 86.41% dis 62.84% 59.45% 61.10% blu 83.02% 72.96% 77.66% reg 79.82% 69.72% 74.43% cla 90.61% 86.38% 88.44% rock 53.10% 51.50% 52.29% hip 60.94% 77.22% 68.12% cou 58.34% 62.53% 60.36% jazz 78.10% 85.17% 81.48% TABLE 3: Per-class evaluation metrics for music genre classification pre-trained model. Generally, features useful in different audio prediction tasks (es- pecially speech) include Linear Prediction Coefficients (LPC) and Another filter inspired by human hearing is the gammatone Linear Prediction Cepstral Coefficients (LPCC), Bark Frequency filter bank. The gammatone filter bank shape looks similar to the Cepstral Coefficients (BFCC), Power Normalized Cepstral Coef- mel filter bank, expect the peaks are smoother than the triangular ficients (PNCC), and spectral features like spectral flux, entropy, shape of the mel filters. gammatone filters are conceived to be a roll off, centroid, spread, and energy entropy. good approximation to the human auditory filters and are used as While MFCC features find use in most commonly encountered a front-end simulation of the cochlea. Since a human ear is the audio processing tasks such as audio type classification, speech perfect receiver and distinguisher of speakers in the presence of classification, GFCC features have been found to have application noise or no noise, construction of gammatone filters that mimic in speaker identification or speaker diarization (the process of auditory filters became desirable. Thus, it has many applications partitioning an input audio stream into homogeneous segments in speech processing because it aims to replicate how we hear. according to the human speaker identity [Wik22a]). Applications, GFCCs are formed by passing the spectrum through a gam- comparisons and uses can be found in [ZW13], [pat21], and matone filter bank, followed by loudness compression and DCT, [pat22]. as seen in Figure 3. The first (approximately) 22 features are pyAudioProcessing library includes computation of these fea- called GFCCs. GFCCs have a number of applications in speech tures for audio segments of a single audio, followed by computing processing, such as speaker identification. GFCC for a sample mean and standard deviation of all the signal segments. speech audio can be seen in Figure 4. Mel Frequency Cepstral Coefficients (MFCC): Temporal features: The mel scale relates perceived frequency, or pitch, of a pure Temporal features from audio are extracted from the signal tone to its actual measured frequency. Humans are much better information in its time domain representations. Examples include at discerning small changes in pitch at low frequencies compared signal energy, entropy, zero crossing rate, etc. Some sample mean to high frequencies. Incorporating this scale makes our features temporal features can be seen in Figure 5. match more closely what humans hear. The mel-frequency scale is approximately linear for frequencies below 1 kHz and logarithmic Spectral features: for frequencies above 1 kHz, as shown in Figure 1. This is motivated by the fact that the human auditory system becomes Spectral features on the other hand derive information con- less frequency-selective as frequency increases above 1 kHz. tained in the frequency domain representation of an audio signal. The signal is divided into segments and a spectrum is com- The signal can be converted from time domain to frequency puted. Passing a spectrum through the mel filter bank, followed by domain using the Fourier transform. Useful features from the taking the log magnitude and a discrete cosine transform (DCT) signal spectrum include fundamental frequency, spectral entropy, produces the mel cepstrum. DCT extracts the signal’s main infor- spectral spread, spectral flux, spectral centroid, spectral roll-off, mation and peaks. For this very property, DCT is also widely used etc. Some sample mean spectral features can be seen in Figure in applications such as JPEG and MPEG compressions. The peaks 6. after DCT contain the gist of the audio information. Typically, the first 13-20 coefficients extracted from the mel cepstrum are Chroma features: called the MFCCs. These hold very useful information about audio and are often used to train machine learning models. The process Chroma features are highly popular for music audio data. In of developing these coefficients can be seen in the form of an Western music, the term chroma feature or chromagram closely re- illustration in Figure 1. MFCC for a sample speech audio can be lates to the twelve different pitch classes. Chroma-based features, seen in Figure 2. which are also referred to as "pitch class profiles", are a powerful tool for analyzing music whose pitches can be meaningfully Gammatone Frequency Cepstral Coefficients (GFCC): categorized (often into twelve categories : A, A#, B, C, C#, D, PYAUDIOPROCESSING: AUDIO PROCESSING, FEATURE EXTRACTION, AND MACHINE LEARNING MODELING 155 Fig. 1: MFCC from audio spectrum. Features boston acc london acc mfcc 0.765 0.412 clean+mfcc 0.823 0.471 TABLE 4: Performance comparison on test data between MFCC feature trained model with and without cleaning. usually depicted as a heat map, i.e., as an image with the intensity shown by varying the color or brightness. After applying the algorithm for signal alteration to remove Fig. 2: MFCC from a sample speech audio. irrelevant and low activity audio segments, the resultant audio’s time-series plot looks like Figure 10. The spectrogram looks like Figure 11. It can be seen that the low activity areas are now D#, E, F, F#, G, G# ) and whose tuning approximates to the equal- missing from the audio and the resultant audio contains more tempered scale [con22]. A prime characteristic of chroma features activity filled regions. This algorithm removes silences as well is that they capture the harmonic and melodic attributes of audio, as low-activity regions from the audio. while being robust to changes in timbre and instrumentation. Some sample mean chroma features can be seen in Figure 7. These visualizations were produced using pyAudioProcessing and can be produced for any audio signal using the library. Audio data cleaning/de-noising Often times an audio sample has multiple segments present in the Impact of cleaning on feature formations for a classifica- same signal that do not contain anything but silence or a slight tion task: degree of background noise compared to the rest of the audio. For most applications, those low activity segments make up the A spoken location name classification problem was considered irrelevant information of the signal. for this evaluation. The dataset consisted of 23 samples for The audio clip shown in Figure 8 is a human saying the word training per class and 17 samples for testing per class. The total "london" and represents the audio plotted in the time domain, with number of classes is 2 - london and boston. This dataset was signal amplitude as y-axis and sample number as x-axis. The areas manually created and can be found linked in the project readme where the signal looks closer to zero/low in amplitude are areas of pyAudioProcessing. For comparative purposes, the classifier is where speech is absent and represents the pauses the speaker took kept constant at SVM, and the parameter C is chosen based on grid while saying the word "london". search for each experiment based on best precision, recall and F1 Figure 9 shows the spectrogram of the same audio signal. A score. Results in table 4 show the impact of applying the low- spectrogram contains time on the x-axis and frequency of the y- activity region removal using pyAudioProcessing prior to training axis. A spectrogram is a visual representation of the spectrum of the model using MFCC features. frequencies of a signal as it varies with time. When applied to It can be seen that the accuracies increased when audio sam- an audio signal, spectrograms are sometimes called sonographs, ples were cleaned prior to training the model. This is especially voiceprints, or voicegrams. When the data are represented in a 3D useful in cases where silence or low-activity regions in the audio plot they may be called waterfalls. As [Wik21] mentions, spectro- do not contribute to the predictions and act as noise in the signal. grams are used extensively in the fields of music, linguistics, sonar, radar, speech processing, seismology, and others. Spectrograms Integrations of audio can be used to identify spoken words phonetically, and to analyze the various calls of animals. A spectrogram can be pyAudioProcessing integrates with third-party tools such as scikit- generated by an optical spectrometer, a bank of band-pass filters, learn, matplotlib, and pydub to offer additional functionalities. by Fourier transform or by a wavelet transform. A spectrogram is 156 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 3: GFCC from audio spectrum. Fig. 4: GFCC from a sample speech audio. Fig. 6: Spectral features from a sample speech audio. Fig. 5: Temporal extractions from a sample speech audio. Fig. 7: Chroma features from a sample speech audio. Training, classification, and evaluation: Audio visualization: The library contains integrations with scikit-learn classifiers for passing audio through feature extraction followed by classi- fication directly using the raw audio samples as input. Training Spectrograms are 2-D images representing sequences of spec- results include computation of cross-validation results along with tra with time along one axis, frequency along the other, and bright- hyperparameter tuning details. ness or color representing the strength of a frequency component at each time frame [Wys17]. Not only can one see whether there Audio format conversion: is more or less energy at, for example, 2 Hz vs 10 Hz, but one can also see how energy levels vary over time [PNS]. Some of Some applications and integrations work best with .wav data the convolutional neural network architectures for images can be format. pyAudioProcessing integrates with tools that perform applied to audio signals on top of the spectrograms. This is a dif- format conversion and presents them as a functionality via the ferent route of building audio models by developing spectrograms library. followed by image processing. Time-series, frequency-domain, and spectrogram (both time and frequency domains) visualizations can be retrieved using pyAudioProcessing and its integrations. See figures 10 and 9 as examples. PYAUDIOPROCESSING: AUDIO PROCESSING, FEATURE EXTRACTION, AND MACHINE LEARNING MODELING 157 Fig. 8: Time-series representation of speech for "london". Fig. 11: Spectrogram of cleaned speech for "london". software’s readme and wiki for giving the user a guide and the flexibility of usage. pyAudioProcessing has been used in active research around audio processing and can be used as the basis for further python-based research efforts. pyAudioProcessing is updated frequently in order to apply enhancements and new functionalities with recent research efforts of the digital signal processing and machine learning community. Some of the ongoing implementations include additions of cepstral features such as LPCC, integration with deep learning backends, and a variety of spectrogram formations that can be used for image classification-based audio classification tasks. R EFERENCES Fig. 9: Spectrogram of speech for "london". [Auf20] Ben Auffarth. Artificial Intelligence with Python Cookbook. Packt Publishing, 10 2020. [BGSR21] Srivi Balaji, Meghana Gopannagari, Svanik Sharma, and Preethi Conclusion Rajgopal. Developing a machine learning algorithm to assess atten- tion levels in adh students in a virtual learning setting using audio In this paper pyAudioProcessing, an open-source Python library, and video processing. International Journal of Recent Technology is presented. The tool implements and integrates a wide range and Engineering (IJRTE), 10, 5 2021. doi:10.35940/ijrte. of audio processing functionalities. Using pyAudioProcessing, A5965.0510121. [BWG 13] Dmitry Bogdanov, N Wack, Emilia Gómez, Sankalp Gulati, + one can read and visualize audio signals, clean audio signals by Perfecto Herrera, Oscar Mayor, G Roma, Justin Salamon, Jose removal of irrelevant content, build and extract complex features Zapata, and Xavier Serra. Essentia: an audio analysis library for such as GFCC, MFCC, and other spectrum and cepstrum based music information retrieval. 11 2013. features, build classification models, and use pre-built trained [CD14] Paresh M. Chauhan and Nikita P. Desai. Mel frequency cepstral coefficients (mfcc) based speaker identification in noisy envi- baseline models to classify different types of audio. Wrappers ronment using wiener filter. In 2014 International Conference along with command-line usage examples are provided in the on Green Computing Communication and Electrical Engineer- ing (ICGCCEE), pages 1–5, 2014. doi:10.1109/ICGCCEE. 2014.6921394. [con22] Wikipedia contributors. Chroma feature — wikipedia the free encyclopedia, 2022. Online; accessed 18-May-2022. URL: https://en.wikipedia.org/w/index.php?title=Chroma_feature& oldid=1066722932. [Din21] Vincent Dinger. Master Thesis KI Methodiken für die Ver- arbeitung akustischer Signale AI Usage for Processing Acoustic Signals. PhD thesis, Kaiserslautern University of Applied Sciences, 03 2021. doi:10.13140/RG.2.2.15872.97287. [Gia15] Theodoros Giannakopoulos. pyaudioanalysis: An open-source python library for audio signal analysis. PloS one, 10(12), 2015. doi:10.1371/journal.pone.0144610. [JDHP17] Medikonda Jeevan, Atul Dhingra, M. Hanmandlu, and Bijaya Panigrahi. Robust Speaker Verification Using GFCC Based i- Vectors, volume 395, pages 85–91. Springer, 10 2017. doi: 10.1007/978-81-322-3592-7\_9. [JS21] Jyotika Singh. Social Media Analysis using Natural Lan- guage Processing Techniques. In Meghann Agarwal, Chris Cal- loway, Dillon Niederhut, and David Shupe, editors, Proceed- ings of the 20th Python in Science Conference, pages 52 – Fig. 10: Time-series representation of cleaned speech for "london". 58, 2021. URL: http://conference.scipy.org/proceedings/scipy2021/ 158 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) pdfs/jyotika_singh.pdf, doi:10.25080/majora-1b6fd038- 009. [Mal20] Ayoub Malek. spafe/spafe: 0.1.2, April 2020. URL: https://github. com/SuperKogito/spafe. [MRL+ 15] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, volume 8, 2015. doi:10.5281/zenodo. 4792298. [pat21] Method for optimizing media and marketing content using cross- platform video intelligence, 2021. URL: https://patents.google.com/ patent/US10949880B2/en. [pat22] Media and marketing optimization with cross platform consumer and content intelligence, 2022. URL: https://patents.google.com/ patent/US20210201349A1/en. [PeP22] PePy. PePy download statistics, 2022. URL: https://pepy.tech/ project/pyAudioProcessing. [PNS] PNSN. What is a spectrogram? URL: https://pnsn.org/ spectrograms/what-is-a-spectrogram#. [Sin19] Jyotika Singh. An introduction to audio processing and machine learning using python, 2019. URL: https://opensource.com/article/ 19/9/audio-processing-machine-learning-python. [Sin21] Jyotika Singh. jsingh811/pyAudioProcessing: Audio pro- cessing, feature extraction and classification, July 2021. URL: https://github.com/jsingh811/pyAudioProcessing, doi:10. 5281/zenodo.5121041. [TEC01] George Tzanetakis, Georg Essl, and Perry Cook. Automatic musical genre classification of audio signals, 2001. URL: http://ismir2001. ismir.net/pdf/tzanetakis.pdf. [Wik21] Wikipedia contributors. Spectrogram — Wikipedia, the free encyclopedia, 2021. [Online; accessed 19-July-2021]. URL: https://en.wikipedia.org/w/index.php?title=Spectrogram&oldid= 1031156666. [Wik22a] Wikipedia contributors. Speaker diarisation — Wikipedia, the free encyclopedia, 2022. [Online; accessed 23-June- 2022]. URL: https://en.wikipedia.org/w/index.php?title=Speaker_ diarisation&oldid=1090834931. [Wik22b] Wikipedia contributors. Word embedding — Wikipedia, the free encyclopedia, 2022. [Online; accessed 23-June- 2022]. URL: https://en.wikipedia.org/w/index.php?title=Word_ embedding&oldid=1091348337. [Wys17] Lonce Wyse. Audio spectrogram representations for processing with convolutional neural networks. 06 2017. [ZW13] Xiaojia Zhao and DeLiang Wang. Analyzing noise robustness of mfcc and gfcc features in speaker identification. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7204–7208, 2013. doi:10.1109/ICASSP. 2013.6639061. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 159 Phylogeography: Analysis of genetic and climatic data of SARS-CoV-2 Aleksandr Koshkarov‡§¶∗ , Wanlin Li‡¶ , My-Linh Luuk , Nadia Tahiri‡ F Abstract—Due to the fact that the SARS-CoV-2 pandemic reaches its peak, contributed to the development of vaccines to better combat researchers around the globe are combining efforts to investigate the genetics the spread of the virus. Studying the factors (e.g., environment, of different variants to better deal with its distribution. This paper discusses host, agent of transmission) that influence epidemiology helps phylogeographic approaches to examine how patterns of divergence within us to limit the continued spread of infection and prepare for the SARS-CoV-2 coincide with geographic features, such as climatic features. First, future re-emergence of diseases caused by subtypes of coronavirus we propose a python-based bioinformatic pipeline called aPhylogeo for phylo- geographic analysis written in Python 3 that help researchers better understand [LFZK06]. However, few studies report associations between the distribution of the virus in specific regions via a configuration file, and then environmental factors and the genetics of different variants. Dif- run all the analysis operations in a single run. In particular, the aPhylogeo tool ferent variants of SARS-CoV-2 are expected to spread differently determines which parts of the genetic sequence undergo a high mutation rate depending on geographical conditions, such as the meteorological depending on geographic conditions, using a sliding window that moves along parameters. The main objective of this study is to find clear corre- the genetic sequence alignment in user-defined steps and a window size. As a lations between genetics and geographic distribution of different Python-based cross-platform program, aPhylogeo works on Windows®, MacOS variants of SARS-CoV-2. X® and GNU/Linux. The implementation of this pipeline is publicly available on GitHub (https://github.com/tahiri-lab/aPhylogeo). Second, we present an ex- ample of analysis of our new aPhylogeo tool on real data (SARS-CoV-2) to Several studies showed that COVID-19 cases and related understand the occurrence of different variants. climatic factors correlate significantly with each other ([OCFC20], [SDdPS+ 20], and [SMVS+ 22]). Oliveiros et al. [OCFC20] re- Index Terms—Phylogeography, SARS-CoV-2, Bioinformatics, Genetic, Climatic ported a decrease in the rate of SARS-CoV-2 progression with the Condition onset of spring and summer in the northern hemisphere. Sobral et al. [SDdPS+ 20] suggested a negative correlation between mean Introduction temperature by country and the number of SARS-CoV-2 infec- tions, along with a positive correlation between rainfall and SARS- The global pandemic caused by severe acute respiratory syn- CoV-2 transmission. This contrasts with the results of the study by drome coronavirus 2 (SARS-CoV-2) is at its peak and more and Sabarathinam et al. [SMVS+ 22], which showed that an increase in more variants of SARS-CoV-2 were described over time. Among temperature led to an increase in the spread of SARS-CoV-2. The these, some are considered variants of concern (VOC) by the results of Chen et al. [CPK+ 21] imply that a country located 1000 World Health Organization (WHO) due to their impact on global km closer to the equator can expect 33% fewer cases of SARS- public health, such as Alpha (B.1.1.7), Beta (B.1.351), Gamma CoV-2 per million population. Some virus variants may be more (P.1), Delta (B.1.617.2), and Omicron (B.1.1.529) [CRA+ 22]. stable in environments with specific climatic factors. Sabarathinam Although significant progress was made in vaccine development et al. [SMVS+ 22] compared mutation patterns of SARS-CoV- and mass vaccination is being implemented in many countries, the 2 with time series of changes in precipitation, humidity, and continued emergence of new variants of SARS-CoV-2 threatens temperature. They suggested that temperatures between 43°F and to reverse the progress made to date. Researchers around the 54°F, humidity of 67-75%, and precipitation of 2-4 mm may be world collaborate to better understand the genetics of the different the optimal environment for the transition of the mutant form from variants, along with the factors that influence the epidemiology of D614 to G614. this infectious disease. Genetic studies of the different variants * Corresponding author: Nadia.Tahiri@USherbrooke.ca In this study, we examine the geospatial lineage of SARS- ‡ Department of Computer Science, University of Sherbrooke, Sherbrooke, QC J1K2R1, Canada CoV-2 by combining genetic data and metadata from associated § Center of Artificial Intelligence, Astrakhan State University, Astrakhan, sampling locations. Thus, an association between genetics and the 414056, Russia geographic distribution of SARS-CoV-2 variants can be found. We ¶ Contributed equally || Department of Computer Science, University of Quebec at Montreal, Mon- focus on developing a new algorithm to find relationships between treal, QC, Canada a reference tree (i.e., a tree of geographic species distributions, a temperature tree, a habitat precipitation tree, or others) with their Copyright © 2022 Aleksandr Koshkarov et al. This is an open-access article genetic compositions. This new algorithm can help find which distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, genes or which subparts of a gene are sensitive or favorable to a provided the original author and source are credited. given environment. 160 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Problem statement and proposal to represent 38 gene sequences of SARS-CoV-2. After collecting Phylogeography is the study of the principles and processes that genetic data, we extracted 5 climatic factors for the 20 regions, govern the distribution of genealogical lineages, particularly at the i.e., Temperature, Humidity, Precipitation, Wind speed, and Sky intraspecific level. The geographic distribution of species is often surface shortwave downward irradiance. This data was obtained correlated with the patterns associated with the species’ genes from the NASA website (https://power.larc.nasa.gov/). ([A+ 00] and [KM02]). In a phylogeographic study, three major In the second step, trees are created with climatic data and processes should be considered (see [Nag92] for more details), genetic data, respectively. For climatic data, we calculated the which are: dissimilarity between each pair of variants (i.e., from different climatic conditions), resulting in a symmetric square matrix. From 1) Genetic drift is the result of allele sampling errors. These this matrix, the neighbor joining algorithm was used to construct errors are due to generational transmission of alleles and the climate tree. The same approach was implemented for genetic geographical barriers. Genetic drift is a function of the data. Using nucleotide sequences from the 38 SARS-CoV-2 lin- size of the population. Indeed, the larger the population, eages, phylogenetic reconstruction is repeated to construct genetic the lower the genetic drift. This is explained by the ability trees, considering only the data within a window that moves along to maintain genetic diversity in the original population. the alignment in user-defined steps and window size (their length By convention, we say that an allele is fixed if it reaches is denoted by the number of base pairs (bp)). the frequency of 100%, and that it is lost if it reaches the In the third step, the phylogenetic trees constructed in each frequency of 0%. sliding window are compared to the climatic trees using the 2) Gene flow or migration is an important process for Robinson and Foulds (RF) topological distance [RF81]. The conducting a phylogeographic study. It is the transfer distance was normalized by 2n−6, where n is the number of leaves of alleles from one population to another, increasing (i.e., taxa). The proposed approach considers bootstrapping. The intrapopulation diversity and decreasing interpopulation implementation of sliding window technology provides a more diversity. accurate identification of regions with high gene mutation rates. 3) There are many selections in all species. Here we indicate As a result, we highlighted a correlation between parts of the two most important of them, if they are essential genes with a high rate of mutations depending on the geographic for a phylogeographic study. (a) Sexual selection is a distribution of viruses, which emphasizes the emergence of new phenomenon resulting from an attractive characteristic variants (i.e., Alpha, Beta, Delta, Gamma, and Omicron). between two species. Therefore, this selection is a func- tion of the size of the population. (b) Natural selection The creation of phylogenetic trees, as mentioned above, is an is a function of fertility, mortality, and adaptation of a important part of the solution and includes the main steps of the species to a habitat. developed pipeline. This function is intended for genetic data. The main parameters of this part are as follows: Populations living in different environments with varying def create_phylo_tree(gene, climatic conditions are subject to pressures that can lead to window_size, evolutionary divergence and reproductive isolation ([OS98] and step_size, [Sch01]). Phylogeny and geography are then correlated. This bootstrap_threshold, rf_threshold, study, therefore, aims to present an algorithm to show the possible data_names): correlation between certain genes or gene fragments and the geographical distribution of species. number_seq = align_sequence(gene) Most studies in phylogeography consider only genetic data sliding_window(window_size, step_size) ... without directly considering climatic data. They indirectly take for file in files: this information as a basis for locating the habitat of the species. try: We have developed the first version of a phylogeography that ... create_bootstrap() integrates climate data. The sliding window strategy provides more run_dnadist() robust results, as it particularly highlights the areas sensitive to run_neighbor() climate adaptation. run_consense() filter_results(gene, bootstrap_threshold, Methods and Python scripts rf_threshold, data_names, In order to achieve our goal, we designed a workflow and then number_seq, developed a script in Python version 3.9 called aPhylogeo for file)) phylogeographic analysis (see [LLKT22] for more details). It in- ... except Exception as error: teracts with multiple bioinformatic programs, taking climatic data raise and nucleotide data as input, and performs multiple phylogenetic analyses on nucleotide sequencing data using a sliding window This function takes gene data, window size, step size, boot- approach. The process is divided into three main steps (see Figure strap threshold, threshold for the Robinson and Foulds dis- 1). tance, and data names as input parameters. Then the func- The first step involves collecting data to search for quality tion sequentially connects the main steps of the pipeline: viral sequences that are essential for the conditions of our results. align_sequence(gene), sliding_window(window_size, step_size), All sequences were retrieved from the NCBI Virus website (Na- create_bootstrap(), run_dnadist(), run_neighbor(), run_consense(), tional Center for Biotechnology Information, https://www.ncbi. and filter_results with parameters. As a result, we obtain a phylo- nlm.nih.gov/labs/virus/vssi/#/). In total, 20 regions were selected genetic tree (or several trees), which is written to a file. PHYLOGEOGRAPHY: ANALYSIS OF GENETIC AND CLIMATIC DATA OF SARS-COV-2 161 We have created a function (create_tree) to create the climate for line in f: trees. The function is described as follow: if line != "\n": espece = list_names[index] def create_tree(file_name, names): nb_espace = 11 - len(espece) for i in range(1, len(names)): out.write(espece) for i in range(nb_espace): create_matrix(file_name, out.write(" ") names[0], out.write(line[debut:fin]) names[i], index = index + 1 "infile") out.close() f.close() os.system("./exec/neighbor " + start = start + step "< input/input.txt") fin = fin + step except: subprocess.call(["mv", print("An error occurred.") "outtree", "intree"]) subprocess.call(["rm", Algorithmic complexity "infile", The complexity of the algorithm described in the previous section "outfile"]) depends on the complexity of the various external programs used os.system("./exec/consense "+ and the number of windows that the alignment can contain, plus "< input/input.txt") one for the total alignment that the program will process. Recall the different complexities of the different external newick_file = names[i].replace(" ", "_") + "_newick" programs used in the algorithm: • SeqBoot program: O(r × n × SA) subprocess.call(["rm", "outfile"]) • DNADist program: O(n2 ) • Neighbor program: O(n3 ) subprocess.call(["mv", • Consense program: O(r × n2 ) "outtree", newick_file]) • RaxML program: O(e × n × SA) • RF program: O(n2 ), The sliding window strategy can detect genetic fragments depend- ing on environmental parameters, but this work requires time- where n is a number of species (or taxa), r is a number of consuming data preprocessing and the use of several bioinformat- replicates, SA is a size of the multiple sequence alignment (MSA), ics programs. For example, we need to verify that each sequence and e is a number of refinement steps performed by the RaxML identifier in the sequencing data always matches the corresponding algorithm. For all SA ∈ N ∗ and for all W S, S ∈ N, the number of metadata. If samples are added or removed, we need to check windows can be evaluated as follow (Eq. 1): whether the sequencing dataset matches the metadata and make SA −W S changes accordingly. In the next stage, we need to align the nb = +1 , (1) S sequences (multiple sequence alignment, MSA) and integrate all where W S is a window size, and S is a step. step by step into specific software such as MUSCLE [Edg04], Phylip package (i.e. Seqboot, DNADist, Neighbor, and Consense) [Fel05], RF [RF81], and raxmlHPC [Sta14]. The use of each Dataset software requires expertise in bioinformatics. In addition, the The following two principles were applied to select the samples intermediate analysis steps inevitably generate many files, the for analysis. management of which not only consumes the time of the biologist, 1) Selection of SARS-CoV-2 Pango lineages that are but is also subject to errors, which reduces the reproducibility dispersed in different phylogenetic clusters whenever of the study. At present, there are only a few systems designed possible. to automate the analysis of phylogeography. In this context, the development of a computer program for a better understanding The Pango lineage nomenclature system is hierarchical and of the nature and evolution of coronavirus is essential for the fine-scaled and is designed to capture the leading edge of advancement of clinical research. pandemic transmission. Each Pango lineage aims to define an The following sliding window function illustrates moving the epidemiologically relevant phylogenetic cluster, for instance, an sliding window through an alignment with window size and step introduction into a distinct geographic area with evidence of size as parameters. The first 11 characters are allocated to species onward transmission [RHO+ 20]. From one side, Pango lineages names, plus a space. signify groups or clusters of infections with shared ancestry. def sliding_window(window_size=0, step=0): If the entire pandemic can be thought of as a vast branching try: tree of transmission, then the Pango lineages represent individual f = open("infile", "r") branches within that tree. From another side, Pango lineages are ... # slide the window along the sequence intended to highlight epidemiologically relevant events, such as start = 0 the appearance of the virus in a new location, a rapid increase in fin = start + window_size the number of cases, or the evolution of viruses with new phe- while fin <= longueur: notypes [OSU+ 21]. Therefore, to have some sequence diversity index = 0 with open("out", "r") as f, ... as out: in the selected samples, we avoided selecting lineages belonging ... to the same or similar phylogenetic clusters. For example, among 162 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 1: The workflow of the algorithm. The operations within this workflow include several blocks. The blocks are highlighted by three different colors. The first block (grey color) is responsible for creating the trees based on the climate data. The second block (green color) performs the function of input parameter validation. The third block (blue color) allows the creation of phylogenetic trees. This is the most important block and the basis of this study, through the results of which the user receives the output data with the necessary calculations. PHYLOGEOGRAPHY: ANALYSIS OF GENETIC AND CLIMATIC DATA OF SARS-COV-2 163 C.36, C.36.1, C.36.2, C.36.3 and C.36.3.1, only C.36 was used as a sample for analysis. 2) Selection of the lineages that are clearly dominant in a particular region compared to other regions. Through significant advances in the generation and ex- change of SARS-CoV-2 genomic data in real time, international spread of lineages is tracked and recorded on the website (cov- lineages.org/global_report.html) [OHP+ 21]. Based on the statis- tical information provided by the website, our study focuses on SARS-CoV-2 lineages that were first identified (Earliest date) and widely disseminated in a particular country (Most common country) during a certain period (Table 1). We list four examples of the distribution of a set of lineages: • Both lineages A.2.3 and B.1.1.107 have 100% distribu- tion in the United Kingdom. Both lineages D.2 and D.3 have 100% distribution in Australia. B.1.1.172, L.4 and P.1.13 have 100% distribution in the United States. Finally, AH.1, AK.2, C.7 have 100% distribution in Switzerland, Germany, and Denmark, respectively. Fig. 2: Climatic conditions of each lineage in most common country • The country with the widest distribution of L.2 is the at the time of first detection. The climate factors involved include Netherlands (77.0%), followed by Germany (19.0%). Due Temperature at 2 meters (C), Specific humidity at 2 meters (g/kg), to a 58% difference in the distribution of L.2 between the Precipitation corrected (mm/day), Wind speed at 10 meters (m/s), and two locations, we consider the Netherlands as the main All sky surface shortwave downward irradiance (kW − hr/m2 /day). distribution country of L.2 and, therefore, it was selected as a sample. • Similarly, the most predominant country of distribution of collected climatological data for the three days before the earliest C.37 is Peru (44%), followed by Chile (19.0%), with a reporting date corresponding to each lineage and averaged them difference of 25%. Among all samples of this study, C.37 for analysis (Fig. 2). was the lineage with the least difference in distribution per- Although the selection of samples was based on the phyloge- centage between the two countries. Considering the need netic cluster of lineage and transmission, most of the sites involved to increase the diversity of the geographical distribution of represent different meteorological conditions. As shown in Figure the samples, C.37 was also selected. 2, the 38 samples involved temperatures ranging from -4 C to 32.6 • In contrast, the distribution of C.6 is 17.0% in France, C, with an average temperature of 15.3 C. The Specific humidity 14.0% in Angola, 13.0% in Portugal, and 8.0% in Switzer- ranged from 2.9 g/kg to 19.2 g/kg with an average of 8.3 g/kg. The land, and we concluded that C.6 does not show a tendency variability of Wind speed and All sky surface shortwave downward in terms of geographic distribution and, therefore, was not irradiance was relatively small across samples compared to other included as a sample for analysis. parameters. The Wind speed ranged from 0.7 m/s to 9.3 m/s with an average of 4.0 m/s, and All sky surface shortwave downward In accordance with the above principles, we selected 38 irradiance ranged from 0.8 kW-hr/m2/day to 8.6 kW-hr/m2/day lineages with regional characteristics for further study. Based on with an average of 4.5 kW-hr/m2/day. In contrast to the other location information, complete nucleotide sequencing data for parameters, 75% of the cities involved receive less than 2.2 mm these 38 lineages was collected from the NCBI Virus website of precipitation per day, and only 5 cities have more than 5 mm (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/). In the case of of precipitation per day. The minimum precipitation is 0 mm/day, the availability of multiple sequencing results for the same lineage the maximum precipitation is 12 mm/day, and the average value in the same country, we selected the sequence whose collection is 2.1 mm/day. date was closest to the earliest date presented. If there are several sequencing results for the same country on the same date, the sequence with the least number of ambiguous characters (N per Results nucleotide) is selected (Table 1). Based on the sampling locations (consistent with the most In this section, we describe the results obtained on our dataset (see common country, but accurate to specific cities) of each lineage Data section) using our new algorithm (see Method section). sequence in Table 1, combined with the time when the lineage The size of the sliding window and the advanced step for was first discovered, we obtained data on climatic conditions at the sliding window play an important role in the analysis. We the time each lineage was first discovered. The meteorological restricted our conditions to certain values. For comparison, we parameters include Temperature at 2 meters, Specific humidity at applied five combinations of parameters (window size and step 2 meters, Precipitation corrected, Wind speed at 10 meters, and size) to the same dataset. These include the choice of different All sky surface shortwave Downward irradiance. The daily data window sizes (20bp, 50bp, 200bp) and step sizes (10bp, 50bp, for the above parameters were collected from the NASA website 200bp). These combinations of window sizes and steps provide an (https://power.larc.nasa.gov/). Considering that the spread of the opportunity to have three different movement strategies (overlap- virus in a country and the data statistics are time-consuming, we ping, non-overlapping, with gaps). Here we fixed the pair (window 164 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Lineage Most Common Country Earliest Date Sequence Accession A.2.3 United Kingdom 100.0% 2020-03-12 OW470304.1 AE.2 Bahrain 100.0% 2020-06-23 MW341474 AH.1 Switzerland 100.0% 2021-01-05 OD999779 AK.2 Germany 100.0% 2020-09-19 OU077014 B.1.1.107 United Kingdom 100.0% 2020-06-06 OA976647 B.1.1.172 USA 100.0% 2020-04-06 MW035925 BA.2.24 Japan 99.0% 2022-01-27 BS004276 C.1 South Africa 93.0% 2020-04-16 OM739053.1 C.7 Denmark 100.0% 2020-05-11 OU282540 C.17 Egypt 69.0% 2020-04-04 MZ380247 C.20 Switzerland 85.0% 2020-10-26 OU007060 C.23 USA 90.0% 2020-05-11 ON134852 C.31 USA 87.0% 2020-08-11 OM052492 C.36 Egypt 34.0% 2020-03-13 MW828621 C.37 Peru 43.0% 2021-02-02 OL622102 D.2 Australia 100.0% 2020-03-19 MW320730 D.3 Australia 100.0% 2020-06-14 MW320869 D.4 United Kingdom 80.0% 2020-08-13 OA967683 D.5 Sweden 65.0% 2020-10-12 OU370897 Q.2 Italy 99.0% 2020-12-15 OU471040 Q.3 USA 99.0% 2020-07-08 ON129429 Q.6 France 92.0% 2021-03-02 ON300460 Q.7 France 86.0% 2021-01-29 ON442016 L.2 Netherlands 73.0% 2020-03-23 LR883305 L.4 USA 100.0% 2020-06-29 OK546730 N.1 USA 91.0% 2020-03-25 MT520277 N.3 Argentina 96.0% 2020-04-17 MW633892 N.4 Chile 92.0% 2020-03-25 MW365278 N.6 Chile 98.0% 2020-02-16 MW365092 N.7 Uruguay 100.0% 2020-06-18 MW298637 N.8 Kenya 94.0% 2020-06-23 OK510491 N.9 Brazil 96.0% 2020-09-25 MZ191508 M.2 Switzerland 90.0% 2020-10-26 OU009929 P.1.7.1 Peru 94.0% 2021-02-07 OK594577 P.1.13 USA 100.0% 2021-02-24 OL522465 P.2 Brazil 58.0% 2020-04-13 ON148325 P.3 Philippines 83.0% 2021-01-08 OL989074 P.7 Brazil 71.0% 2020-07-01 ON148327 TABLE 1: SARS-CoV-2 lineages analyzed. The lineage assignments covered in the table were last updated on March 1, 2022. Among all Pango lineages of SARS-CoV-2, 38 lineages were analyzed. Corresponding sequencing data were found in the NCBI database based on the date of earliest detection and country of most common. The table also marks the percentage of the virus in the most common country compared to all countries where the virus is present. size, step size) at some values (20, 10), (20, 50), (50, 50), (200, climate conditions on bootstrap values greater than 10. 50) and (200, 200). The trend of RF values variation under different climatic conditions does not vary much throughout this whole 1) Robinson and Foulds baseline and bootstrap thresh- sequence sliding window scan, which may be related old: the phylogenetic trees constructed in each sliding to the correlation between climatic factors (Wind Speed, window are compared to the climatic trees using the Downward Irradiance, Precipitation, Humidity, Temper- Robinson and Foulds topological distance (the RF dis- ature). Windows starting from or containing position tance). We defined the value of the RF distance ob- (28550bp) were screened in all five scans for different tained for regions without any mutations as the baseline. combinations of window size and step size. The window Although different sample sizes and sample sequence formed from position 29200bp to position 29470bp is characteristics can cause differences in the baseline, how- screened out in all four scans except for the combination ever, regions without any mutation are often accompanied of 50bp window size with 50bp step size. As Figure 3 by very low bootstrap values. Using the distribution shows, if there are gaps in the scan (window size: 20bp, of bootstrap values and combining it with validation step size: 50bp), some potential mutation windows are not of alignment visualization, we confirmed that the RF screened compared to other movement strategies because baseline value in this study was 50, and the bootstrap the sequences of the gap part are not computed by the values corresponding to this baseline were smaller than algorithm. In addition, when the window size is small, 10. the capture of the window mutation signal becomes more 2) Sliding window: the implementation of sliding window sensitive, especially when the number of samples is small. technology with bootstrap threshold provides a more At this time, a single base change in a single sequence can accurate identification of regions with high gene mutation cause a change in the value of the RF distance. Therefore, rates. Figure 3 shows the general pattern of the RF high quality sequencing data is required to prevent errors distance changes over alignment windows with different PHYLOGEOGRAPHY: ANALYSIS OF GENETIC AND CLIMATIC DATA OF SARS-COV-2 165 caused by ambiguous characters (N in nucleotide) on the (dates). In addition, since the size of the sliding window RF distance values. In cases where a larger window size and the forward step play an important role in the anal- (200bp) is selected, the overlapping movement strategy ysis, we need to perform several tests to choose the best (window size: 200bp, step size: 50bp) allows the signal of combination of parameters. In this case, it is important to base mutations to be repeatedly verified and enhanced in provide the faster performance of this solution, and we adjacent window scans compared to the non-overlapping plan to adapt the code to parallelize the computations. strategy (window size: 200bp, step size: 200bp). In this In addition, we intend to use the resources of Compute situation, the range of the RF distance values is relatively Canada and Compute Quebec for these high load calcu- large, and the number of windows eventually screened is lations. relatively greater. Due to the small number of the SARS- 2) To enable further analysis of this topic, it would be CoV-2 lineages sequences that we analyzed in this study, interesting to relate the results obtained, especially the we chose to scan the alignment sequences with a larger values obtained from the best positions of the multiple window and overlapping movement strategy for further sequence alignments, to the dimensional structure of the analysis (window size: 200bp, step size: 50bp). proteins, or to the map of the selective pressure exerted 3) Comparaison between genetic trees and climatic trees: on the indicated alignment fragments. the RF distance quantified the difference between a phy- 3) We can envisage a study that would consist in selecting logenetic tree constructed in specific sliding windows and only different phenotypes of a single species, for exam- a climatic tree constructed in corresponding climatic data. ple, Homo Sapiens, in different geographical locations. In Relatively low RF distance values represent relatively this case, we would have to consider a larger geographical more similarity between the phylogenetic tree and the area in order to significantly increase the variation of climatic tree. With our algorithm based on the sliding the selected climatic parameters. This type of research window technique, regions with high mutation rates can would consist in observing the evolution of the genes be identified (Fig 4). Subsequently, we compare the of the selected species according to different climatic RF values of these regions. In cases where there is a parameters. correlation between the occurrence of mutations and the 4) We intend to develop a website that can help biologists, climate factors studied, the regions with relatively low ecologists and other interested professionals to perform RF distance values (the alignment position of 15550bp calculations in their phylogeography projects faster and – 15600bp and 24650bp-24750bp) are more likely to easier. We plan to create a user-friendly interface with be correlated with climate factors than the other loci the input of the necessary initial parameters and the screened for mutations. possibility to save the results (for example, by sending them to an email). These results will include calculated In addition, we can state that we have made an effort to parameters and visualizations. make our tool as independent as possible of the input data and parameters. Our pipeline can also be applied to phylogeographic studies of other species. In cases where it is determined (or Acknowledgements assumed) that the occurrence of a mutation is associated with The authors thank SciPy conference and reviewers for their valu- certain geographic factors, our pipeline can help to highlight able comments on this paper. This work was supported by Natural mutant regions and specific mutant regions within them that are Sciences and Engineering Research Council of Canada and the more likely to be associated with that geographic parameter. Our University of Sherbrooke grant. algorithm can provide a reference for further biological studies. R EFERENCES Conclusions and future work [A+ 00] John C Avise et al. Phylogeography: the history and formation In this paper, a bioinformatics pipeline for phylogeographic of species. Harvard University Press, 2000. doi:10.1093/ analysis is designed to help researchers better understand the icb/41.1.134. distribution of viruses in specific regions using genetic and climate [CPK+ 21] Simiao Chen, Klaus Prettner, Michael Kuhn, Pascal Geldsetzer, data. We propose a new algorithm called aPhylogeo [LLKT22] Chen Wang, Till Bärnighausen, and David E Bloom. Climate and the spread of covid-19. Scientific Reports, 11(1):1–6, 2021. that allows the user to quickly and intuitively create trees from doi:10.1038/s41598-021-87692-z. genetic and climate data. Using a sliding window, the algorithm [CRA+ 22] Marco Cascella, Michael Rajnik, Abdul Aleem, Scott C Dule- finds specific regions on the viral genetic sequences that can bohn, and Raffaela Di Napoli. Features, evaluation, and treat- ment of coronavirus (covid-19). Statpearls [internet], 2022. be correlated to the climatic conditions of the region. To our [Edg04] Robert C Edgar. Muscle: a multiple sequence alignment method knowledge, this is the first study of its kind that incorporates with reduced time and space complexity. BMC bioinformatics, climate data into this type of study. It aims to help the scientific 5(1):1–19, 2004. doi:10.1186/1471-2105-5-113. community by facilitating research in the field of phylogeography. [Fel05] Joseph Felsenstein. PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. Department of Genome Our solution runs on Windows®, MacOS X® and GNU/Linux Sciences, University of Washington, Seattle, 2005. and the code is freely available to researchers and collaborators on [KM02] L Lacey Knowles and Wayne P Maddison. Statistical phylo- GitHub (https://github.com/tahiri-lab/aPhylogeo). geography. Molecular Ecology, 11(12):2623–2635, 2002. doi: As a future work on the project, we plan to incorporate the 10.1146/annurev.ecolsys.38.091206.095702. [LFZK06] Kun Lin, Daniel Yee-Tak Fong, Biliu Zhu, and Johan Karl- following additional features: berg. Environmental factors on the sars epidemic: air tem- perature, passage of time and multiplicative effect of hospital 1) We can handle large amounts of data, especially when infection. Epidemiology & Infection, 134(2):223–230, 2006. considering many countries and longer time periods doi:10.1017/S0950268805005054. 166 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 3: Heatmap of Robinson and Foulds topological distance over alignment windows. Five different combinations of parameters were applied (a) window size = 20bp and step size = 10bp; (b) window size = 20bp and step size = 50bp; (c) window size = 50bp and step size = 50bp; (d) window size = 200bp and step size = 50bp; and (e) window size = 200bp and step size = 200bp. Robinson and Foulds topological distance was used to quantify the distance between a phylogenetic tree constructed in certain sliding windows and a climatic tree constructed in corresponding climatic data (wind speed, downward irradiance, precipitation, humidity, temperature). grinch. Wellcome open research, 6, 2021. doi:10.12688/ wellcomeopenres.16661.2. [OS98] Matthew R Orr and Thomas B Smith. Ecology and speciation. Trends in Ecology & Evolution, 13(12):502–506, 1998. doi: 10.1016/s0169-5347(98)01511-0. [OSU+ 21] Áine O’Toole, Emily Scher, Anthony Underwood, Ben Jack- son, Verity Hill, John T McCrone, Rachel Colquhoun, Chris Ruis, Khalil Abu-Dahab, Ben Taylor, et al. Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool. Virus Evolution, 7(2):veab064, 2021. doi: 10.1093/ve/veab064. [RF81] David F Robinson and Leslie R Foulds. Comparison of phyloge- netic trees. Mathematical biosciences, 53(1-2):131–147, 1981. doi:10.1016/0025-5564(81)90043-2. [RHO+ 20] Andrew Rambaut, Edward C Holmes, Áine O’Toole, Verity Hill, John T McCrone, Christopher Ruis, Louis du Plessis, and Oliver G Pybus. A dynamic nomenclature proposal for sars- cov-2 lineages to assist genomic epidemiology. Nature micro- biology, 5(11):1403–1407, 2020. doi:10.1038/s41564- 020-0770-5. [Sch01] Dolph Schluter. Ecology and the origin of species. Trends in ecology & evolution, 16(7):372–380, 2001. doi:10.1016/ s0169-5347(01)02198-x. Fig. 4: Robinson and Foulds topological distance normalized changes [SDdPS 20] Marcos Felipe Falcão Sobral, Gisleia Benini Duarte, Ana + over the alignment windows. Multiple phylogenetic analyses were Iza Gomes da Penha Sobral, Marcelo Luiz Monteiro Marinho, performed using a sliding window (window size = 200 bp and step size and André de Souza Melo. Association between climate vari- = 50 bp). Phylogenetic reconstruction was repeated considering only ables and global transmission of sars-cov-2. Science of The data within a window that moved along the alignment in steps. The Total Environment, 729:138997, 2020. doi:10.1016/j. RF normalized topological distance was used to quantify the distance scitotenv.2020.138997. between the phylogenetic tree constructed in each sliding window and [SMVS+ 22] Chidambaram Sabarathinam, Prasanna Mohan Viswanathan, Venkatramanan Senapathi, Shankar Karuppannan, Dhanu Radha the climate tree constructed in the corresponding climate data (Wind Samayamanthula, Gnanachandrasamy Gopalakrishnan, Ra- speed, Downward irradiance, Precipitation, Humidity, Temperature). manathan Alagappan, and Prosun Bhattacharya. Sars-cov-2 Only regions with high genetic mutation rates were marked in the phase i transmission and mutability linked to the interplay of figure. climatic variables: a global observation on the pandemic spread. Environmental Science and Pollution Research, pages 1–18, 2022. doi:10.1007/s11356-021-17481-8. [Sta14] Alexandros Stamatakis. Raxml version 8: a tool for phy- [LLKT22] Wanlin Li, My-Lin Luu, Aleksandr Koshkarov, and Nadia logenetic analysis and post-analysis of large phylogenies. Tahiri. aPhylogeo (version 1.0), July 2022. URL: https:// Bioinformatics, 30(9):1312–1313, 2014. doi:10.1093/ github.com/tahiri-lab/aPhylogeo, doi:doi.org/10.5281/ bioinformatics/btu033. zenodo.6773603. [Nag92] Thomas Nagylaki. Rate of evolution of a quantitative character. Proceedings of the National Academy of Sciences, 89(17):8121– 8124, 1992. doi:10.1073/pnas.89.17.8121. [OCFC20] Barbara Oliveiros, Liliana Caramelo, Nuno C Ferreira, and Francisco Caramelo. Role of temperature and humidity in the modulation of the doubling time of covid-19 cases. MedRxiv, 2020. doi:10.1101/2020.03.05.20031872. [OHP+ 21] Áine O’Toole, Verity Hill, Oliver G Pybus, Alexander Watts, Issac I Bogoch, Kamran Khan, Jane P Messina, The COVID, Genomics UK, et al. Tracking the international spread of sars-cov-2 lineages b. 1.1. 7 and b. 1.351/501y-v2 with PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 167 Global optimization software library for research and education Nadia Udler‡∗ F Abstract—Machine learning models are often represented by functions given the distance to optimal point. In this paper the basic SA algorithm by computer programs. Optimization of such functions is a challenging task is used as a starting point. We can offer more basic module as a because traditional derivative based optimization methods with guaranteed starting point ( and by specifying distribution as ’exponential’ get convergence properties cannot be used.. This software allows to create new the variant of SA) thus achieving more flexible design opportuni- optimization methods with desired properties, based on basic modules. These ties for custom optimization algorithm. Note that convergence of basic modules are designed in accordance with approach for constructing global optimization methods based on potential theory [KAP]. These methods do not the newly created hybrid algorithm does not need to be verified use derivatives of objective function and as a result work with nondifferentiable when using minpy basic modules, whereas previously mentioned functions (or functions given by computer programs, or black box functions), but SA-based hybrid has to be verified separately ( see [GLUQ]) have guaranteed convergence. The software helps to understand principles of Testing functions are included in the library. They represent learning algorithms. This software may be used by researchers to design their broad range of use cases covering above mentioned difficult own variations or hybrids of known heuristic optimization methods. It may be functions. In this paper we describe the approach underlying these used by students to understand how known heuristic optimization methods work optimization methods. The distinctive feature of these methods and how certain parameters affect the behavior of the method. is that they are not heuristic in nature. The algorithms are de- Index Terms—global optimization, black-box functions, algorithmically defined rived based on potential theory [KAP], and their convergence is functions, potential functions guaranteed by their derivation method [KPP]. Recently potential theory was applied to prove convergence of well known heuristic methods, for example see [BIS] for convergence of PSO, and to Introduction re prove convergence of well known gradient based methods, in Optimization lies at the heart of machine learning and data particular, first order methods - see [NBAG] for convergence of science. One of the most relevant problems in machine learning is gradient descent and [ZALO] for mirror descent. For potential automatic selection of the algorithm depending on the objective. functions approach for stochastic first order optimization methods This is necessary in many applications such as robotics, simulating see [ATFB]. biological or chemical processes, trading strategies optimization, to name a few [KHNT]. We developed a library of optimization methods as a first step for self-adapting algorithms. Optimization Outline of the approach methods in this library work with all objectives including very The approach works for non-smooth or algorithmically defined onerous ones, such as black box functions and functions given by functions. For detailed description of the approach see [KAP], computer code, and the convergences of methods is guaranteed. [KP]. In this approach the original optimization problem is re- This library allows to create customized derivative free learning placed with a randomized problem, allowing the use of Monte- algorithms with desired properties by combining building blocks Carlo methods for calculating integrals. This is especially impor- from this library or other Python libraries. tant if the objective function is given by its values (no analytical The library is intended primarily for educational purposes formula) and derivatives are not known. The original problem and its focus is on transparency of the methods rather than on is restated in the framework of gradient (sub gradient) methods, efficiency of implementation. employing the standard theory (convergence theorems for gradient The library can be used by researches to design optimization (sub gradient) methods), whereas no derivatives of the objective methods with desired properties by varying parameters of the function are needed. At the same time, the method obtained is general algorithm. a method of nonlocal search unlike other gradient methods. It As an example, consider variant of simulated annealing (SA) will be shown, that instead of measuring the gradient of the proposed in [FGSB] where different values of parameters ( Boltz- objective function we can measure the gradient of the potential man distribution parameters, step size, etc.) are used depending of function at each iteration step , and the value of the gradient can be obtained using values of objective function only, in the * Corresponding author: nadiakap@optonline.net ‡ University of Connecticut (Stamford) framework of Monte Carlo methods for calculating integrals. Furthermore, this value does not have to be precise, because Copyright © 2022 Nadia Udler. This is an open-access article distributed it is recalculated at each iteration step. It will also be shown under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the that well-known zero-order optimization methods ( methods that original author and source are credited. do not use derivatives of objective function but its values only) 168 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) are generalized into their adaptive extensions. The generalization where ( , ) defines dot product. of zero-order methods (that are heuristic in nature) is obtained Assuming differentiability of the integrals (for example, by using standardized methodology, namely, gradient (sub gradient) selecting the appropriate pxε (x, y) and using 3, 4 we get framework. We consider the unconstrained optimization problem Z Z d f (x1 , x2 , ..xn ) → min (1) δY F(X 0 ) = [ f (x)pxε (x − εy, y)dxdy]ε=0 = x∈Rn dε Rn Rn By randomizing we get dR R d R R = [ dε Rn f (x) Rn pxε (x−εy, y)dxdy]ε=0 = [ Rn f (x)( dε Rn pxε (x− F(X) = E[ f (X)] → min (2) εy, y)dy)dx]ε=0 = x∈Rn R R d = R Rn f (x)( Rn [ dε pxε (x − εy, y)]ε=0 dy)dx = where X is a random vector from Rn , {X} is a set of such random R − Rn f (x)( Rn [divx (pxε (x, y)y)]dy)dx = vectors, and E[·] is the expectation operator. Problem 2 is equivalent to problem 1 in the sense that any Z Z realization of the random vector X ∗ , where X ∗ is a solution to 2, − f (x)divx [ (pxε (x, y)y)dy]dx Rn Rn that has a nonzero probability, will be a solution to problem 1 (see [KAP] for proof). p ε (x,y) Using formula for conditional distribution pY /X 0 =x (y) = px εy (x)) , Note that 2 is the stochastic optimization problem of the R x functional F(X) . where pxε (x) = Rn pxε y (x, u)du R R To study the gradient nature of the solution algorithms for we get δY F(X 0 ) = − Rn f (x)divx [pxε (x) Rn pY /X 0 =x (y)ydy]dx R problem 2, a variation of objective functional F(X) will be consid- Denote y(x) = Rn ypY /X 0 =x (y)dy = E[Y /X 0 = x] ered. Taking into account normalization condition for density we The suggested approach makes it possible to obtain opti- arrive at the following expression for directional derivative: mization methods in systematic way, similar to the methodology Z adopted in smooth optimization. Derivation includes random- δY F(X 0 ) = − ( f (x) −C)divx [px0 (x)y(x)]dx ization of the original optimization problem, finding directional Rn derivative for the randomized problem and choosing moving direction Y based on the condition that directional derivative in where C is arbitrary chosen constant the direction of Y is being less or equal to 0. Considering solution to δY F(X 0 ) → minY allows to obtain Because of randomization, the expression for directional gradient-like algorithms for optimization that use only objective derivative doesn’t contain the differential characteristics of the function values ( do not use derivatives of objective function) original function. We obtain the condition for selecting the di- rection of search Y in terms of its characteristics - conditional expectation. Conditional expectation is a vector function (or Potential function as a solution to Poisson’s equation vector field) and can be decomposed (following the theorem of Decomposing vector field px0 (x)y(x) into potential field ∇ϕ0 (x) decomposition of the vector field) into the sum of the gradient and divergence-free component W0 (x): of scalar function P and a function with zero divergence. P is called a potential function. As a result the original problem is px0 (x)y(x) = ∇φ0 (x) +W0 (x) reduced to optimization of the potential function, furthermore, the potential function is specific for each iteration step. Next, we arrive at partial differential equation that connects P and the original we arrive at Poisson’s equation for potential function: function. To define computational algorithms it is necessary to specify the dynamics of the random vectors. For example, the ∆ϕ0 (x) = −L[ f (x) −C]pu (x) dynamics can be expressed in a form of densities. For certain class of distributions, for example normal distribution, the dynamics can where L is a constant be written in terms of expectation and covariance matrix. It is also Solution to Poisson’s equation approaching 0 at infinity may possible to express the dynamics in mixed characteristics. be written in the following form Z Expression for directional derivative ϕ0 (x) = E(x, ξ )[ f (ξ ) −C]pu (ξ )dξ Rn Derivative of objective functional F(X) in the direction of the random vector Y at the point X 0 (Gateaux derivative) is: where E(x, ξ ) is a fundamental solution to Laplace’s equation. d δY F(X 0 ) = dε d F(X 0 + εY )ε=0 = dε F(X ε )dxε=0 = Then for potential component ∆ϕ0 (x) we have d R dε f (X)pxε (x)ε=0 where density function of the random vector X ε = X 0 + εY ∆ϕ0 (x) = −LE[∆x E(x, u)( f (x) −C)] may be expressed in terms of joint density function pX 0 ,Y (x, y) of X 0 and Y as follows: To conclude, the representation for gradient-like direction is Z obtained. This direction maximizes directional derivative of the pxε (x) = pxε (x − εy, y)dy (3) Rn objective functional F(X). Therefore, this representation can be The following relation (property of divergence) will be needed used for computing the gradient of the objective function f(x) later using only its values. Gradient direction of the objective function d f(x) is determined by the gradient of the potential function ϕ0 (x), pxε (x − εy, y) = (−∇x pxε (x, y), y) = −divx (pxε (x, y)y) (4) which, in turn, is determined by Poisson’s equation. dε GLOBAL OPTIMIZATION SOFTWARE LIBRARY FOR RESEARCH AND EDUCATION 169 Practical considerations The code is organized in such a way that it allows to pair the The dynamics of the expectation of objective function may be algorithm with objective function. The new algorithm may be im- written in the space of random vectors as follows: plmented as method of class Minimize. Newly created algorithm can be paired with test objectivve function supplied with a library XN+1 = XN + αN+1YN+1 or with externally supplied objective function (implemented in separate python module). New algorithms can be made more or where N - iteration number, Y N+1 - random vector that defines less universal, that is, may have different number of parameters direction of move at ( N+1)th iteration, αN+1 -step size on (N+1)th that user can specify. For example, it is possible to create Nelder iteration. Y N+1 must be feasible at each iteration, i.e. the objective and Mead algorithm (NM) using basic modules, and this would functional should decrease: F(X N+1 ) < (X N ). Applying expection be an example of the most specific algorithm. It is also possible to (12) and presenting E[YN+1 asconditional expectation Ex E[Y /X] to create Stochastic Extention of NM (more generic than classic we get: NM, similar to Simplicial Homology Global Optimisation [ESF] XN+1 = E[XN ] + αN+1 EX N E[Y N+1 /X N ] method) and with certain settings of adjustable parameters it may work identical to classic NM. Library repository may be found Replacing mathematical expectations E[XN ] and YN+1 ] with their N+1 here: https://github.com/nadiakap/MinPy_edu estimates E and y(X N ) we get: The following algorithms demonstrate steps similar to steps of E N+1 N = E + αN+1 E X N [y(X N )] Nelder and Mead algorithm (NM) but select only those points with objective function values smaller or equal to mean level of objec- Note that expression for y(X N ) was obtained in the previos section tive funtion. Such an improvement to NM assures its convergence up to certain parameters. By setting parameters to certain values [KPP]. Unlike NM, they are derived from the generic approach. we can obtain stochastic extensions of well known heuristics such First variant (NM-stochastic) resembles NM but corrects some as Nelder and Mead algorithm or Covariance Matrix Adaptation of its drawbacks, and second variant (NM-nonlocal) has some Evolution Strategy. In minpy library we use several common build- similarity to random search as well as to NM and helps to resolve ing blocks to create different algorithms. Customized algorithms some other issues of classical NM algorithm. may be defined by combining these common blocks and varying Steps of NM-stochastic: their parameters. 1) Initialize the search by generating K ≥ n separate real- Main building blocks include computing center of mass of the izations of ui0 , i=1,..K of the random vector U0 , and set sample points and finding newtonian potential. m0 = K1 ∑Ki=0 ui0 2) On step j = 1, 2, ... Key takeaways, example algorithm, and code organization 1 a.Compute the mean level c j−1 = K ∑Ki=1 f (uij−1 ) Many industry professionals and researchers utilize mathematical b.Calculate new set of vertices: optimization packages to search for better solutions of their m j−1 − uij−1 problems. Examples of such problem include minimization of uij = m j−1 + ε j−1 ( f (uij−1 ) − c j−1 ) free energy in physical system [FW], robot gait optimization ||m j−1 − uij−1 ||n from robotics [PHS], designing materials for 3D printing [ZM], [TMAACBA], wine production [CTC], [CWC], optimizing chem- c.Set m j = K1 ∑Ki=0 uij ical reactions [VNJT]. These problems may involve "black box d.Adjust the step size ε j−1 so that f (m j ) < f (m j−1 ). If optimization", where the structure of the objective function is approximate ε j−1 cannot be obtained within the specified number unknown and is revealed through a small sequence of expen- of trails, then set mk = m j−1 sive trials. Software implementations for these methods become e.Use sample standard deviation as termination criterion: more user friendly. As a rule, however, certain modeling skills 1 K are needed to formulate real world problem in a way suitable Dj = ( ∑ ( f (uij ) − c j )2 )1/2 K − 1 i=1 for applying software package. Moreover, selecting optimization method appropriate for the model is a challenging task. Our Note that classic simplex search methods do not use values of educational software helps users of such optimization packages objective function to calculate reflection/expantion/contraction co- and may be considered as a companion to them. The focus efficients. Those coefficients are the same for all vertices, whereas of our software is on transparency of the methods rather than in NM-stochastic the distance each vertex will travel depends on efficiency. A principal benefit of our software is the unified on the difference between objective function value and average approach for constructing algorithms whereby any other algorithm value across all vertices ( f (uij ) − c j ). NM-stochastic shares the is obtained from the generalized algorithm by changing certain following drawbacks with classic simplex methods: a. simlex may parameters. Well known heuristic algorithms such as Nelder and collapse into a nearly degenerate figure, and usually proposed Mead (NM) algorithm may be obtained using this generalized remedy is to restart the simlex every once in a while, b. only initial approach, as well as new algorithms. Although some derivative- vertices are randomly generated, and the path of all subsequent free optimization packages (matlab global optimization toolbox, vertices is deterministic. Next variant of the algorithm (NM- Tensorflow Probability optimizers, Excel Evolutionary Solver, nonlocal) maintains the randomness of vertices on each step, while scikit-learn Stochastic Gradient Descent class, scipy.optimize.shgo adjusting the distribution of U0 to mimic the pattern of the modi- method) put a lot of effort in transparency and educational value, fied vertices. The corrected algorithm has much higher exploration they don’t have the same level of flexibility and generality as our power than the first algorithm (similar to the exploration power of system. An example of educational-only optimization software is random search algorithms), and has exploitation power of direct - [SAS]. It is limited to teach Particle Swarm Optimization. search algorithms. 170 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Steps of NM - nonlocal [VNJT] Fath, Verena, Kockmann, Norbert, Otto, Jürgen, Röder, Thorsten, Self-optimising processes and real-time-optimisation 1) Choose a starting point x0 and set m0 = x0 . of organic syntheses in a microreactor system using Nelder–Mead and design of experiments, React. Chem. Eng., 2. On step j = 1, 2, ... Obtain K separate realizations of uii , 2020,5, 1281-1299, https://doi.org/10.1039/D0RE00081G i=1,..K of the random vector U j [ZM] Plüss, T.; Zimmer, F.; Hehn, T.; Murk, A. Characterisation and Comparison of Material Parameters of 3D-Printable Absorbing a.Compute f (uij−1 ), j = 1, 2, ..K, and the sample mean level Materials. Materials 2022, 15, 1503. https://doi.org/10.3390/ ma15041503 1 K [TMAACBA] Thoufeili Taufek, Yupiter H.P. Manurung, Mohd Shahriman c j−1 = ∑ f (uij−1 ) K i=1 Adenan, Syidatul Akma, Hui Leng Choo, Borhen Louhichi, Martin Bednardz, and Izhar Aziz.3D Printing and Additive Manufacturing, 2022, http://doi.org/10.1089/3dp.2021.0197 b.Generate the new estimate of the mean: [CTC] Vismara, P., Coletta, R. & Trombettoni, G. Constrained global m j−1 − uij optimization for wine blending. Constraints 21, 597–615 1 K m j = m j−1 + ε j ∑ K i=1 [( f (uij ) − c j ) ||m j−1 − uij ||n ] [CWC] (2016), https://doi.org/10.1007/s10601-015-9235-5 Terry Hui-Ye Chiu, Chienwen Wu, Chun-Hao Chen, A Gen- eralized Wine Quality Prediction Framework by Evolutionary Adjust the step size ε j−1 so that f (m j ) < f (m j−1 ). If approximate Algorithms, International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 6, Nº7,2021, https://doi.org/10. ε j−1 cannot be obtained within the specified number of trails, then 9781/ijimai.2021.04.006 set mk = m j−1 [KHNT] Pascal Kerschke, Holger H. Hoos, Frank Neumann, Heike c.Use sample standard deviation as termination criterion Trautmann; Automated Algorithm Selection: Survey and Per- spectives. Evol Comput 2019; 27 (1): 3–45, https://doi.org/10. 1 K 1162/evco_a_00242 Dj = ( ∑ ( f (uij ) − c j )2 )1/2 K − 1 i=1 [SAS] Leandro dos Santos Coelho, Cezar Augusto Sierakowski, A software tool for teaching of particle swarm optimization fundamentals, Advances in Engineering Software, Volume 39, Issue 11, 2008, Pages 877-887, ISSN 0965-9978, https://doi. R EFERENCES org/10.1016/j.advengsoft.2008.01.005. [ESF] Endres, S.C., Sandrock, C. & Focke, W.W. A simplicial ho- [KAP] Kaplinskii, A.I.,Pesin, A.M.,Propoi, A.I.(1994). Analysis of mology algorithm for Lipschitz optimisation. J Glob Optim 72, search methods of optimization based on potential theory. I: 181–217 (2018), https://doi.org/10.1007/s10898-018-0645-y Nonlocal properties. Automation and Remote Control. Volume 55, N.9, Part 2, September, pp.1316-1323 (rus. pp.97-105), 1994 [KP] Kaplinskii, A.I. and Propoi, A.I., Nonlocal Optimization Meth- ods ofthe First Order Based on Potential Theory, Automation and Remote Control. Volume 55, N.7, Part 2, July, pp.1004- 1011 (rus. pp.97-102), 1994 [KPP] Kaplinskii, A.I., Pesin, A.M.,Propoi, A.I. Analysis of search methods of optimization based on potential theory. III: Conver- gence of methods. Automation and remote Control, Volume 55, N.11, Part 1, November, pp.1604-1610 (rus. pp.66-72 ), 1994. [NBAG] Nikhil Bansal, Anupam Gupta, Potential-function proofs for gradient methods, Theory of Computing, Volume 15, (2019) Article 4 pp. 1-32, https://doi.org/10.4086/toc.2019.v015a004 [ATFB] Adrien Taylor, Francis Bach, Stochastic first-order meth- ods: non-asymptotic and computer-aided analyses via potential functions, arXiv:1902.00947 [math.OC], 2019, https://doi.org/10.48550/arXiv.1902.00947 [ZALO] Zeyuan Allen-Zhu and Lorenzo Orecchia, Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent, Inno- vations in Theoretical Computer Science Conference (ITCS), 2017, pp. 3:1-3:22, https://doi.org/10.4230/LIPIcs.ITCS.2017.3 [BIS] Berthold Immanuel Schmitt, Convergence Analysis for Particle Swarm Optimization, FAU University Press, 2015 [FGSB] FJuan Frausto-Solis, Ernesto Liñán-García, Juan Paulo Sánchez-Hernández, J. Javier González-Barbosa, Carlos González-Flores, Guadalupe Castilla-Valdez, Multiphase Sim- ulated Annealing Based on Boltzmann and Bose-Einstein Distribution Applied to Protein Folding Problem, Advances in Bioinformatics, Volume 2016, Article ID 7357123, https: //doi.org/10.1155/2016/7357123 [GLUQ] Gong G., Liu, Y., Qian M, Simulated annealing with a potential function with discontinuous gradient on Rd , Ici. China Ser. A- Math. 44, 571-578, 2001, https://doi.org/10.1007/BF02876705 [PHS] Valdez, S.I., Hernandez, E., Keshtkar, S. (2020). A Hybrid EDA/Nelder-Mead for Concurrent Robot Optimization. In: Madureira, A., Abraham, A., Gandhi, N., Varela, M. (eds) Hybrid Intelligent Systems. HIS 2018. Advances in Intel- ligent Systems and Computing, vol 923. Springer, Cham. https://doi.org/10.1007/978-3-030-14347-3_20 [FW] Fan, Yi & Wang, Pengjun & Heidari, Ali Asghar & Chen, Huiling & HamzaTurabieh, & Mafarja, Majdi, 2022. "Random reselection particle swarm optimization for optimal design of solar photovoltaic modules," Energy, Elsevier, vol. 239(PA), https://doi.org/10.1016/j.energy.2021.121865 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 171 Temporal Word Embeddings Analysis for Disease Prevention Nathan Jacobi‡∗ , Ivan Mo‡§ , Albert You‡ , Krishi Kishore‡ , Zane Page‡ , Shannon P. Quinn‡¶ , Tim Heckmank F Abstract—Human languages’ semantics and structure constantly change over then be studied to track contextual drift over time. However, a time through mediums such as culturally significant events. By viewing the common issue in these so-called “temporal word embeddings” semantic changes of words during notable events, contexts of existing and is that they are often unaligned — i.e. the embeddings do not novel words can be predicted for similar, current events. By studying the initial lie within the same embedding space. Past proposed solutions outbreak of a disease and the associated semantic shifts of select words, we to aligning temporal word embeddings require multiple separate hope to be able to spot social media trends to prevent future outbreaks faster than traditional methods. To explore this idea, we generate a temporal word alignment problems to be solved, or for “anchor words” – words embedding model that allows us to study word semantics evolving over time. that have no contextual shifts between times – to be used for Using these temporal word embeddings, we use machine learning models to mapping one time period to the next [HLJ16]. Yao et al. propose a predict words associated with the disease outbreak. solution to this alignment issue, shown to produce accurate and aligned temporal word embeddings, through solving one joint Index Terms—Natural Language Processing, Word Embeddings, Bioinformat- alignment problem across all time slices, which we utilize here ics, Social Media, Disease Prediction [YSD+ 18]. Introduction & Background Methodology Human languages experience continual changes to their semantic Data Collection & Pre-Processing structures. Natural language processing techniques allow us to Our data set is a corpus D of over 7 million tweets collected examine these semantic alterations through methods such as word from Scott County, Indiana from the dates January 1st, 2014 until embeddings. Word embeddings provide low dimension numerical January 17th, 2017. The data was lent to us from Twitter after representations of words, mapping lexical meanings into a vector a data request, and has not yet been made publicly available. space. Words that lie close together in this vector space represent During this time period, an HIV outbreak was taking place in close semantic similarities [MCCD13]. This numerical vector Scott County, with an eventual 215 confirmed cases being linked space allows for quantitative analysis of semantics and contextual to the outbreak [PPH+ 16]. Gonsalves et al. predicts an additional meanings, allowing for more use in machine learning models that 126 undiagnosed HIV cases were linked to this same outbreak utilize human language. We hypothesize that disease outbreaks can be predicted faster [GC18]. The state’s response led to questioning if the outbreak than traditional methods by studying word embeddings and their could have been stemmed or further prevented with an earlier semantic shifts during past outbreaks. By surveying the context response [Gol17]. Our corpus was selected with a focus on tweets of select medical terms and other words associated with a disease related to the outbreak. By closely studying the semantic shifts during the initial outbreak, we create a generalized model that can during this outbreak, we hope to accurately predict similar future be used to catch future similar outbreaks quickly. By leveraging outbreaks before they reach large case numbers, allowing for a social media activity, we predict similar semantic trends can be critical earlier response. found in real time. Additionally, this allows novel terms to be To study semantic shifts through time, the corpus was split evaluated in context without requiring a priori knowledge of them, into 18 temporal buckets, each spanning a 2 month period. All data allowing potential outbreaks to be detected early in their lifespans, utilized in scripts was handled via the pandas Python package. The thus minimizing the resultant damage to public health. corpus within each bucket is represented by Dt , with t representing Given a corpus spanning a fixed time period, multiple word the temporal slice. Within each 2 month period, tweets were split embeddings can be created at set temporal intervals, which can into 12 pre-processed output csv files. Pre-processing steps first removed retweets, links, images, emojis, and punctuation. Com- * Corresponding author: Nathan.Jacobi@uga.edu mon stop words were removed from the tweets using the NLTK ‡ Computer Science Department, University of Georgia Python package, and each tweet was tokenized. A vocabulary § Linguistics Department, University of Georgia ¶ Cellular Biology Department, University of Georgia dictionary was then generated for each of the 18 temporal buckets, || Public Health Department, University of Georgia containing each unique word and a count of its occurrences within its respective bucket. The vocabulary dictionaries for each Copyright © 2022 Nathan Jacobi et al. This is an open-access article dis- bucket were then combined into a global vocabulary dictionary, tributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, pro- containing the total counts for each unique word across all 18 vided the original author and source are credited. buckets. Our experiments utilized two vocabulary dictionaries: the 172 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) first being the 10,000 most frequently occurring words from the B = Y (t)U(t) + γU(t) + τ(U(t − 1) +U(t + 1)) global vocabulary for ensuring proper generation of embedding vectors, the second being a combined vocabulary of 15,000 terms, To decompose PPMI(t) in our model, SciPy’s linear algebra including our target HIV/AIDS related terms. This combined package was utilized to solve for eigendecomposition of each vocabulary consisted of the top 10,000 words across D as well PPMI(t), and the top 100 terms were kept to generate an em- as an additional 473 HIV/AIDS related terms that occurred at bedding of d = 100. The alignment was then applied, yielding least 8 times within the corpus. The 10,000th most frequent term 18 temporally aligned word embedding sets of our vocabulary, in D occurred 39 times, so to ensure results were not influenced with dimensions |V | × d, or 15,000 x 100. These word embedding by sparsity in the less frequent HIV/AIDS terms, 4,527 randomly sets are aligned spatially and in terms of rotations, however there selected terms with occurrences between 10 and 25 times were appears to be some spatial drift that we hope to remove by tuning added to the vocabulary, bringing it to a total of 15,000 terms. hyperparameters. Following alignment, these vectors are usable The HIV/AIDS related terms came from a list of 1,031 terms we for experimentation and analysis. compiled, primarily coming from the U.S. Department of Veteran Predictions for Detecting Modern Shifts Affairs published list of HIV/AIDS related terms, and other terms we thought were pertinent to include, such as HIV medications Following the generation of temporally aligned word embedding, and terms relating to sexual health [Aff05]. they can be used for semantic shift analysis. Using the word embedding vectors generated for each temporal bucket, 2 new Temporally Aligned Vector Generation data sets were created to use for determining patterns in the Generating word2vec embeddings is typically done through 2 semantic shifts surrounding HIV outbreaks. Both of these data primary methods: continuous bag-of-words (CBOW) and skip- sets were constructed using our second vocabulary of 15,000 gram, however many other various models exist [MCCD13]. Our terms, including the 473 HIV/AIDS related terms, and each term’s methods use a CBOW approach at generating embeddings, which embedding of d = 100 that were generated by the dynamic generates a word’s vector embedding based on the context the embedding model. The first experimental data set was the shift word appears in, i.e. the words in a window range surrounding in the d = 100 embedding vector between each time bucket and the target word. Following pre-processing of our corpus, steps the one that immediately followed it. These shifts were calculated for generating word embeddings were applied to each temporal by simply subtracting the next temporal and initial vectors from bucket. For each time bucket, co-occurrence matrices were first each other. In addition to the change in the 100 dimensional vector created, with a window size w = 5. These matrices contained between each time bucket and its next, the initial and next 10 the total occurrences of each word against every other within a dimensional embeddings were included from each, which were window range L of 5 words within the corpus at time t. Each generated using the same dynamic embedding model. This yielded co-occurrence matrix was of dimensions |V | × |V |. Following the each word having 17 observations and 121 features: {d_vec0 . . . generation of each of these co-occurrence matrices, a |V | × |V | d_vec99, v_init_0 . . . v_init_9, v_fin_0 . . . v_fin_9, label}. This dimensioned Positive Pointwise Mutual Information matrix was data set will be referred to as "data_121". The reasoning to include calculated. The value in each cell was calculated as follows: these lower dimensional embeddings was so that both the shift and initial and next positions in the embedding space would be PPMI(t, L)w,c = max{PMI(Dt , L)w,c , 0}, used in our machine learning algorithms. The other experimental where w and c are two words in V. Embeddings generated by data set was constructed similarly, but rather than subtracting the word2vec can be approximated by PMI matrices, where given two vectors and including lower dimensions vectors, the initial embedding vectors utilize the following equation [YSD+ 18]: and next 100 dimensional vectors were listed as features. This allowed machine learning algorithms to have access to the full uTw uc ≈ PMI(D, L)w,c positional information of each vector alongside the shift between Each embedding u has a reduced dimensionality d, typically the two. This yielded each word having 17 observations and 201 around 25 - 200. Each PPMI from our data set is created inde- features: {vec_init0 . . . vec_init99, vec_fin0 . . . vec_fin99, label}. pendently from each other temporal bucket. After these PPMI This data set will be referred to as "data_201". With the 15,000 matrices are made, temporal word embeddings can be created terms each having 17 observations, it led to a total of 255,000 using the method proposed by Yao et al. [YSD+ 18]. The proposed observations. It should be noted that in addition to the vector solution focuses on the equation: information, the data sets also listed the number of days since the outbreak began, the predicted number of cases at that point U(t)U(t)T ≈ PPMI(t, L) in time, from [GC18], and the total magnitude of the shift in the where U is a set of embeddings from time period t. Decomposing vector between the corresponding time buckets. All these features each PPMI(t) will yield embedding U(t), however each U(t) is not were dropped prior to use within the models, as the magnitude guaranteed to be in the same embedding space. Yao et al. derives feature was colinear with the other positional features, and the case U(t)A = B with the following equation234 [YSD+ 18]: and day data will not be available in predicting modern outbreaks. A = U(t)T U(t) + (γ + λ + 2τ)I, Using these data, two machine learning algorithms were applied: unsupervised k-means clustering and a supervised neural network. 1. All code used can be found here https://github.com/quinngroup/Twitter- Embedding-Analysis/ K-means Clustering 2. γ represents the forcing regularizer. λ represents the Frobenius norm regularizer. τ represents the smoothing regularizer. To examine any similarities within shifts, k-means clustering was 3. Y(t) represents PPMI(t). performed on the data sets at first. Initial attempts at k-means with 4. The original equation uses W(t), but this acts as identical to U(t) in the the 100 dimensional embeddings yielded extremely large inertial code. We replaced it here to improve readability. values and poor results. In an attempt to reduce inertia, features TEMPORAL WORD EMBEDDINGS ANALYSIS FOR DISEASE PREVENTION 173 for data that k-means would be performed onto were assessed. 100, 150, and 200. Additionally, several certainty thresholds for a K-means was performed on a reduced dimensionality data set, positive classification were tested on each of the models. The best with embedding vectors of dimensionality d = 10, however this results from each will be listed in the results section. As we begin led to strict convergence and poor results again. The data set implementation of these models on other HIV outbreak related with the change in an embeddings vector, data_121, continued data sets, the proper certainty thresholds can be better determined. to contain the changes of vectors between each time bucket and its next. However, rather than the 10 dimensional position vectors Results for both time buckets, 2 dimensional positions were used instead, generated by UMAP from the 10 dimensioned vectors. The second Analysis of Embeddings data set, data_201, always led to strict convergence on clustering, To ensure accuracy in word embeddings generated in this model, even when reduced to just the 10 dimensional representations. we utilized word2vec (w2v), a proven neural network method of Therefore, k-means was performed explicitly on the data_121 embeddings [MCCD13]. For each temporal bucket, a static w2v set, with the 2 dimensional representations alongside the 100 embedding of d = 100 was generated to compare to the temporal dimensional change in the vectors. Separate two dimensional embedding generated from the same bucket. These vectors were UMAP representations were generated for use as a feature and generated from the same corpus as the ones generated by the for visual examination. The data set also did not have the term’s dynamic model. As the vectors do not lie within the same label listed as a feature for clustering. embedding space, the vectors cannot be directly compared. As Inertia at convergence on clustering for k-means was reduced the temporal embeddings generated by the alignment model are significantly, as much as 86% after features were reassessed, yield- influenced by other temporal buckets, we hypothesize notably ing significantly better results. Following the clustering, the results different vectors. Methods for testing quality in [YSD+ 18] rely were analyzed to determine which clusters contained the higher on a semi-supervised approach: the corpus used is an annotated than average incidence rates of medical terms and HIV/AIDS set of New York Times articles, and the section (Sports, Business, related terms. These clusters can then be considered target clusters, Politics, etc.) are given alongside the text, and can be used to and large incidences of words being clustered within these can be assess strength of an embedding. Additionally, the corpus used flagged as indicative as a possible outbreak. spans over 20 years, allowing for metrics such as checking the closest word to leaders or titles, such as "president" or "NYC Neural Network Predictions mayor" throughout time. These methods show that this dynamic In addition to the k-means model, we created a neural network word embedding alignment model yields accurate results. model for binary classification of our terms. Our target class was Major differences can be attributed to the word2vec model terms that we hypothesized were closely related to the HIV epi- only being given a section of the corpus at a time, while our model demic in Scott County, i.e. any word in our HIV terms list. Several had access to the entire corpus across all temporal buckets. Terms iterations with varying number of layers, activation functions, and that might not have appeared in the given time bucket might still nodes within each layer were attempted to maximize performance. appear in the embeddings generated by our model, but not at all Each model used an 80% training, 20% testing split on these data, within the word2vec embeddings. For example, most embeddings with two variations performed of this split on training and testing generated by the word2vec model did not often have hashtagged data. The first was randomly splitting all 255,000 observations, terms in their top 10 closest terms, while embeddings generated without care of some observations for a term being in both training by our model often did. As hashtagged terms are very related to set and some being in the testing set. This split of data will ongoing events, keeping these terms can give useful information be referred to as "mixed" data, as the terms are mixed between to this outbreak. Modern hashtagged terms will likely be the most the splits. The second split of data split the 15,000 words into common novel terms that we have no prior knowledge on, and we 80% training and 20% testing. After the vocabulary was split, hypothesize that these terms will be relevant to ongoing outbreaks. the corresponding observations in the data were split accordingly, Given that our corpus spans a significantly shorter time period leaving all observations for each term within the same split. than the New York Times set, and does not have annotations, we Additionally, we tested a neural network that would accept the use existing baseline data sets of word similarities. We evaluated same data as the input, either data_201 or data_121, with the the accuracy of both models’ vectors using a baseline sources addition of the label assigned to that observation by the k-means for the semantic similarity of terms. The first source used was model as a feature. The goal of these models, in addition was to SimLex-999, which contains 999 word pairings, with correspond- correctly identifying terms we classified as related to the outbreak, ing human generated similarity scores on a scale of 0-10, where was to discover new terms that shift in similar ways to the HIV 10 is the highest similarity [HRK15]. Cosine similarities for each terms we labeled. pair of terms in SimLex-999 were calculated for both the w2v The neural network model used was four layers, with three model vectors as well as vectors generated by the dynamic model ReLu layers with 128, 256, and 256 neurons, followed by a single for each temporal bucket. Pairs containing terms that were not neuron sigmoid output layer. This neural network was constructed present in the model generated vectors were omitted for that using the Keras module of the TensorFlow library. The main models similarity measurements. The cosine similarities were then difference between them was the input data itself. The input data compared to the assigned SimLex scores using the Spearman’s were data_201 with and without k-means labels, data_121 with rank correlation coefficient. The results of this baseline can be seen and without k-means labels. On each of these, there were two splits in Table 1. The Spearman’s coefficient of both sets of embeddings, of the training and testing data, as in the previously mentioned averaged across all 18 temporal buckets, was .151334 for the "mixed" terms. Parameters of the neural network layers were w2v vectors and .15506 for the dynamic word embedding (dwe) adjusted, but results did not improve significantly across the data vectors. The dwe vectors slightly outperformed the w2v baseline sets. All models were trained with a varying number of epochs: 50, in this test of word similarities. However, it should be noted that 174 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Time w2v Score dwe Score Difference w2v dwe Difference Bucket (MEN) (MEN) (MEN) Score Score (SL) (SL) (SL) 0 0.437816 0.567757 0.129941 0.136146 0.169702 0.033556 1 0.421271 0.561996 0.140724 0.131751 0.167809 0.036058 2 0.481644 0.554162 0.072518 0.113067 0.165794 0.052727 3 0.449981 0.543395 0.093413 0.137704 0.163349 0.025645 4 0.360462 0.532634 0.172172 0.169419 0.158774 -0.010645 5 0.353343 0.521376 0.168032 0.133773 0.157173 0.023400 6 0.365653 0.511323 0.145669 0.173503 0.154299 -0.019204 7 0.358100 0.502065 0.143965 0.196332 0.152701 -0.043631 8 0.380266 0.497222 0.116955 0.152287 0.154338 .002051 9 0.405048 0.496563 0.091514 0.149980 0.148919 -0.001061 10 0.403719 0.499463 0.095744 0.145412 0.142114 -0.003298 11 0.381033 0.504986 0.123952 0.181667 0.141901 -0.039766 12 0.378455 0.511041 0.132586 0.159254 0.144187 -0.015067 13 0.391209 0.514521 0.123312 0.145519 0.147816 0.002297 14 0.405100 0.519095 0.113995 0.151422 0.152477 0.001055 15 0.419895 0.522854 0.102959 0.117026 0.154963 0.037937 16 0.400947 0.524462 0.123515 0.158833 0.157687 -0.001146 17 0.321936 0.525109 0.203172 0.170925 0.157068 -0.013857 Average 0.437816 0.567757 0.129941 0.151334 0.155059 0.003725 TABLE 1: Spearman’s correlation coefficients for w2v vectors and dynamic word embedding (dwe) vectors for all 18 temporal clusters against the SimLex word pair data set. Fig. 1: 2 Dimensional Representation of Embeddings from Time Bucket 0. TEMPORAL WORD EMBEDDINGS ANALYSIS FOR DISEASE PREVENTION 175 Fig. 2: 2 Dimensional Representation of Embeddings from Time Bucket 17. these Spearman’s coefficients are very low compared to baselines UMAP, can be seen in Figure 1 and Figure 2. Figure 1 represents such as in [WWC+ 19], where the average Spearman’s coefficient the embedding generated for the first time bucket, while Figure amongst common models was .38133 on this data set of words. 2 represents the embedding generated for the final time bucket. These models, however, were trained on corpus generated from These UMAP representations use cosine distance as their metric Wikipedia pages — wiki2010. The lower Spearman’s coefficients over Euclidian distance, leading to more dense clusters and more can likely be accounted to our corpus. In 2014-2017, when accurate representations of nearby terms within the embedding this corpus was generated, Twitter had a 140 character limit on space. The section of terms outlying from the main grouping tweets. The limited characters have been shown to affect user’s appears to be terms that do not appear often within that temporal language within their tweets [BTKSDZ19], possibly affecting our cluster itself, but may appear several times later in a temporal embeddings. Boot et al. show that Twitter increasing the character bucket. Figure 1 contains a zoomed in view of this outlying group, limit to 280 characters in 2017 impacted the language within the as well as a subgrouping on the outskirts of the main group, tweets. As we test this pipeline on more Twitter data from various containing food related terms. The majority of these terms are time intervals, the character increase in 2017 is something to keep ones that would likely be hashtagged frequently during a brief time in mind. period within one temporal bucket. These terms are still relevant The second source of baseline was the MEN Test Collection, to study, as hashtagged terms that appear frequently for a brief containing 3,000 pairs with similarity scores of 0-50, with 50 period of time are most likely extremely attached to an ongoing being the most similar [BTB14]. Following the same methodology event. In future iterations, the length of each temporal bucket will for assessing the strength of embeddings as we did for the be decreased, hopefully giving more temporal buckets access to SimLex-999 set, the Spearman’s coefficients from this set yielded terms that only appear within one currently. much better results than from the SimLex-999 set. The average of the Spearman’s coefficients, across all 18 temporal buckets, K-Means Clustering Results was .39532 for the w2v embeddings and .52278 for the dwe The results of the k-means clustering can be seen below in embeddings. The dwe significantly outperformed the w2v baseline Figures 4 and 5. Figure 4 shows the results of k-means clustering on this set, but still did not reach the average correlation of with the corresponding 2 dimensional UMAP positions generated .7306 that other common models achieved in the baseline tests from the 10 dimensional vector that were used as features in in [WWC+ 19]. the clustering. Figure 5 shows the results of k-means clustering Two dimensional representations of embeddings, generated by with the corresponding 2 dimensional UMAP representation of the 176 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Cluster All Words HIV Terms Difference 0 0.173498 0.287048 0.113549 1 0.231063 0.238876 0.007814 2 0.220039 0.205600 -0.014440 3 0.023933 0.000283 -0.023651 4 0.108078 0.105581 -0.002498 5 0.096149 0.084276 -0.011873 6 0.023525 0.031391 0.007866 7 0.123714 0.046946 -0.076768 TABLE 2: Distribution of HIV terms and all terms within k-means clusters Fig. 4: Results of k-means clustering shown over the 2 dimensional UMAP representation of the 10 dimensional embeddings. Fig. 3: Bar graph showing k-means clustering distribution of HIV terms against all terms. entire data set used in clustering. The k-means clustering revealed semantic shifts of HIV related terms being clustered with higher incidence than other terms in one cluster. Incidence rates for all terms and HIV terms in each cluster can be seen in Table 2 and Fig. 5: Results of k-means clustering shown over the 2 dimensional Figure 3. This increased incidence rate of HIV related terms in UMAP representation of the full data set. certain clusters leads us to hypothesize that semantic shifts of terms in future datasets can be clustered using the same k-means model, and analyzed to search for outbreaks. Clustering of terms and .1 for the mixed split in both sets. The difference in certainty in future data sets can be compared to these clustering results, and thresholds was due to any mixed term data set having an extremely similarities between the data can be recognized. large number of false positives on .01, but more reasonable results on .1. Neural Network Results These results show that classification of terms surrounding Neural network models we generated showed promising results the Scott County HIV outbreak is achievable, but the model will on classification of HIV related terms. The goal of the models need to be refined on more data. It can be seen that the mixed was to identify and discover terms surrounding the HIV outbreak. term split of data led to a high rate of true positives, however Therefore we were not concerned about the rate of false positive it quickly became much more specific to terms outside of our terms. False positive terms likely had semantic shifts very similar target class on higher epochs, with false positives dropping to to the HIV related terms, and therefore can be related to the lower rates. Additionally, accuracy on data_201 begins to increase outbreak. These terms can be labeled as potentially HIV related between 150 and 200 epoch models for the unmixed split, so while studying future data sets, which can aid the identifying of even higher epoch models might improve results further for the if an outbreak is ongoing during the time tweets in the corpus unmixed split. Outliers, such as in the true positives in data_121 were tweeted. We looked for a balance of finding false positive with 100 epochs without k-means labels, can be explained by terms without lowering our certainty threshold to include too many the certainty threshold. If the certainty threshold was .05 for that terms. Results of the testing data for data_201 set can be seen in model, there would have been 86 true positives, and 1,129 false 3, and results of the testing data for data_121 set can be seen in 4. positives. A precise certainty threshold can be found as we test this The certainty threshold for the unmixed split in both sets was .01, model on other HIV related data sets and control data sets. With TEMPORAL WORD EMBEDDINGS ANALYSIS FOR DISEASE PREVENTION 177 With K-Means Label Without K-Means Label Epochs Accuracy Precision Recall TP FP TN FN Accuracy Precision Recall TP FP TN FN 50 0.9589 0.0513 0.0041 8 148 48897 1947 0.9571 0.1538 0.0266 52 286 48759 1903 100 0.9589 0.0824 0.0072 14 156 48889 1941 0.9608 0.0893 0.0026 5 51 48994 1950 150 0.6915 0.0535 0.4220 825 14602 34443 1130 0.7187 0.0451 0.3141 614 13006 36039 1341 200 0.7397 0.0388 0.2435 476 11797 37248 1479 0.7566 0.0399 0.2317 453 10912 38133 1502 50Mix 0.9881 0.9107 0.7967 1724 169 48667 440 0.9811 0.9417 0.5901 1277 79 48757 887 100Mix 0.9814 0.9418 0.5980 1294 80 48756 870 0.9823 0.9090 0.6465 1399 140 48696 765 150Mix 0.9798 0.9595 0.5471 1184 50 48786 980 0.9752 0.9934 0.4191 907 6 48830 1257 200Mix 0.9736 0.9846 0.3835 830 13 48823 1334 0.9770 0.9834 0.4658 1008 17 48819 1156 TABLE 3: Results of the neural network run on the data_201 set. The epochs column shows the number of training epochs on the models, as well as if the words were mixed between the training and testing data, denoted by "Mix". With K-Means Label Without K-Means Label Epochs Accuracy Precision Recall TP FP TN FN Accuracy Precision Recall TP FP TN FN 50 0.9049 0.0461 0.0752 147 3041 46004 1808 0.9350 0.0652 0.0522 102 1463 47582 1853 100 0.9555 0.1133 0.0235 46 360 48685 1909 0.8251 0.0834 0.3565 697 7663 41382 1258 150 0.9554 0.0897 0.0179 35 355 48690 1920 0.9572 0.0957 0.0138 27 255 48790 1928 200 0.9496 0.0335 0.0113 22 635 48410 1933 0.9525 0.0906 0.0266 52 522 48523 1903 50Mix 0.9285 0.2973 0.5018 1086 2567 46269 1078 0.9487 0.4062 0.4501 974 1424 47412 1190 100Mix 0.9475 0.3949 0.4464 966 1480 47356 1198 0.9492 0.4192 0.5134 1111 1539 47297 1053 150Mix 0.9344 0.3112 0.4496 973 2154 46682 1191 0.9514 0.4291 0.4390 950 1264 47572 1214 200Mix 0.9449 0.3779 0.4635 1003 1651 47185 1161 0.9500 0.4156 0.4395 951 1337 47499 1213 TABLE 4: Results of the neural network on the data_121 set. The epochs column shows the number of training epochs on the models, as well as if the words were mixed between the training and testing data, denoted by "Mix". enough experimentation and data, a set can be run through our insight into relevant medical activity, but also further strengthen pipeline and a certainty of there being a potential HIV outbreak in and expand our model and its credibility. There is a large source the region the data originated from can be generated by a future of data potentially related to HIV/AIDS on Twitter, so finding model. and collecting this data would be a crucial first step. One potent example of data could be from the 220 United States counties determined by the CDC to be considered vulnerable to HIV and/or Conclusion viral hepatitis outbreaks due to injection drug use, similar to the Our results prove promising, with high accuracy and decent recall outbreak that occurred in Scott County [VHRH+ 16]. Our next on classification of HIV/AIDS related terms, as well as potentially data set that is being studied is tweets from Cabell County, West discovering new terms related to the outbreak. Given more HIV Virginia, from January of 2018 through 2020. During this time related data sets and control data sets, we could begin examining an HIV outbreak similar to the one that took place in Scott and generating thresholds of what might be indicative of an County in 2014 occurred [AMK20]. The end goal is to create outbreak. To improve results, metrics for our word2vec baseline a pipeline that can perform live semantic shift analysis at set model and statistical analysis could be further explored, as well as intervals of time within these counties, and classify these shifts exploring previously mentioned noise and biases from our data. as they happen. A future model can predict whether or not the Additionally, sparsity of data in earlier temporal buckets may number of terms classified as HIV related is indicative of an lead to some loss of accuracy. Fine tuning hyperparameters of outbreak. If enough terms classified by our model as potentially the alignment model through grid searching would likely even indicative of an outbreak become detected, or if this future model further improve these results. We predict that given more data sets predicts a possible outbreak, public health officials can be notified containing tweets from areas and times that had similar HIV/AIDS and the severity of a possible outbreak can be mitigated if properly outbreaks to Scott County, as well as control data sets that are handled. not directly related to an HIV outbreak, we could determine Expansion into other social media platforms would increase a threshold of words that would define a county as potentially the variety of data our model has access to, and therefore what undergoing an HIV outbreak. With a refined pipeline and model our model is able to respond to. With the foundational model such as this, we hope to be able to begin biosurveillance to try to established, we will be able to focus on converting the data and prevent future outbreaks. addressing the differences between social networks (e.g. audience and online etiquette). Reddit and Instagram are two points of Future Work interest due to their increasing prevalence, as well as vastness of Case studies of previous datasets related to other diseases and available data. collection of more modern tweets could not only provide critical An idea for future implementation following the generation 178 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) of a generalized model would be creating a web application. The [PPH+ 16] Philip J. Peters, Pamela Pontones, Karen W. Hoover, Monita R. ideal audience would be medical officials and organizations, but Patel, Romeo R. Galang, Jessica Shields, Sara J. Blosser, Michael W. Spiller, Brittany Combs, William M. Switzer, and even public or research use for trend prediction could be potent. et al. HIV infection linked to injection use of Oxymorphone The application would give users the ability to pick from a given in Indiana, 2014–2015. New England Journal of Medicine, glossary of medical terms, defining their own set of significant 375(3):229–239, 2016. doi:10.1056/NEJMoa1515195. words to run our model on. Our model would then expose any [VHRH+ 16] Michelle M. Van Handel, Charles E. Rose, Elaine J. Hallisey, Jessica L. Kolling, Jon E. Zibbell, Brian Lewis, Michele K. potential trends or insight for the given terms in contemporary Bohm, Christopher M. Jones, Barry E. Flanagan, Azfar-E-Alam data, allowing for quicker responses to activity. Customization of Siddiqi, and et al. County-level vulnerability assessment for the data pool could also be a feature, where tweets and other rapid dissemination of HIV or HCV infections among persons who inject drugs, United States. JAIDS Journal of Acquired social media posts are filtered to specified geographic regions or Immune Deficiency Syndromes, 73(3):323–331, 2016. doi: time windows, yielding more specific results. 10.1097/qai.0000000000001098. Additionally, we would like to reassess our embedding model [WWC+ 19] Bin Wang, Angela Wang, Fenxiao Chen, Yuncheng Wang, and C.-C. Jay Kuo. Evaluating word embedding models: Methods to try and improve embeddings generated and our understanding and experimental results. APSIPA Transactions on Signal and of the semantic shifts. This project has been ongoing for several Information Processing, 8(1), 2019. doi:10.1017/atsip. years, and new models, such as the use of bidirectional encoders, 2019.12. as in BERT [DCLT18], have proven to have high performance. [YSD+ 18] Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao, and Hui Xiong. Dynamic word embeddings for evolving semantic dis- BERT based models have also been used for temporal embedding covery. In Proceedings of the Eleventh ACM International Con- studies, such as in [LMD+ 19], a study focused on clinical corpora. ference on Web Search and Data Mining:, WSDM ’18, page We predict that updating our pipeline to match more modern 673–681, New York, NY, USA, 2018. Association for Comput- ing Machinery. doi:10.1145/3159652.3159703. methodology can lead to more effective disease detection. R EFERENCES [Aff05] Veteran Affairs. Glossary of HIV/AIDS terms: Veterans affairs, Dec 2005. URL: https://www.hiv.va.gov/provider/glossary/ index.asp. [AMK20] A Atkins, RP McClung, and M Kilkenny. Notes from the field: Outbreak of Human Immunodeficiency Virus infection among persons who inject drugs — Cabell County, West Virginia, 2018–2019. Morbidity and Mortality Weekly Report, 69(16):499–500, 2020. doi:10.15585/mmwr.mm6916a2. [BTB14] Elia Bruni, Nam Khanh Tran, and Marco Baroni. Multimodal distributional semantics. J. Artif. Int. Res., 49(1):1–47, 2014. doi:10.1613/jair.4135. [BTKSDZ19] Arnout Boot, Erik Tjon Kim Sang, Katinka Dijkstra, and Rolf Zwaan. How character limit affects language usage in tweets. Palgrave Communications, 5(76), 2019. doi: 10.1057/s41599-019-0280-3. [DCLT18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transform- ers for language understanding, 2018. doi:10.18653/v1/ N19-1423. [GC18] Gregg S Gonsalves and Forrest W Crawford. Dynamics of the HIV outbreak and response in Scott County, IN, USA, 2011–15: A modelling study. The Lancet HIV, 5(10), 2018. URL: https://pubmed.ncbi.nlm.nih.gov/30220531/. [Gol17] Nicholas J. Golding. The needle and the damage done: In- diana’s response to the 2015 HIV epidemic and the need to change state and federal policies regarding needle exchanges and intravenous drug users. Indiana Health Law Review, 14(2):173, 2017. doi:10.18060/3911.0038. [HLJ16] William L. Hamilton, Jure Leskovec, and Dan Jurafsky. Di- achronic word embeddings reveal statistical laws of seman- tic change. CoRR, abs/1605.09096, 2016. arXiv:1605. 09096, doi:10.48550/arXiv.1605.09096. [HRK15] Felix Hill, Roi Reichart, and Anna Korhonen. SimLex- 999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4):665–695, 2015. doi:10.1162/COLI_a_00237. [LMD+ 19] Chen Lin, Timothy Miller, Dmitriy Dligach, Steven Bethard, and Savova Guergana. A BERT-based universal model for both within- and cross-sentence clinical temporal relation extraction. In Proceedings of the 2nd Clinical Natural Language Process- ing Workshop, pages 65–71. Association for Computational Linguistics, 2019. doi:10.18653/v1/W19-1908. [MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space, 2013. doi:10.48550/ARXIV.1301.3781. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 179 Design of a Scientific Data Analysis Support Platform Nathan Martindale‡∗ , Jason Hite‡ , Scott Stewart‡ , Mark Adams‡ F Abstract—Software data analytic workflows are a critical aspect of modern Fundamentally, science revolves around the ability for others scientific research and play a crucial role in testing scientific hypotheses. A to repeat and reproduce prior published works, and this has typical scientific data analysis life cycle in a research project must include become a difficult task with many computation-based studies. several steps that may not be fundamental to testing the hypothesis, but are Often, scientists outside of a computer science field may not have essential for reproducibility. This includes tasks that have analogs to software training in software engineering best practices, or they may simply engineering practices such as versioning code, sharing code among research team members, maintaining a structured codebase, and tracking associated disregard them because the focus of a researcher is on scientific resources such as software environments. Tasks unique to scientific research publications rather than the analysis software itself. Lack of docu- include designing, implementing, and modifying code that tests a hypothesis. mentation and provenance of research artifacts and frequent failure This work refers to this code as an experiment, which is defined as a software to publish repositories for data and source code has led to a crisis analog to physical experiments. in reproducibility in artificial intelligence (AI) and other fields that A software experiment manager should support tracking and reproducing rely heavily on computation [SBB13], [DMR+ 09], [Hut18]. One individual experiment runs, organizing and presenting results, and storing and study showed that quantifiably few machine learning (ML) papers reloading intermediate data on long-running computations. A software experi- document specifics in how they ran their experiments [GGA18]. ment manager with these features would reduce the time a researcher spends This gap between established practices from the software engi- on tedious busywork and would enable more effective collaboration. This work discusses the necessary design features in more depth, some of the existing neering field and how computational research is conducted has software packages that support this workflow, and a custom developed open- been studied for some time, and the problems that can stem from source solution to address these needs. it are discussed at length in [Sto18]. To mitigate these issues, computation-based research requires Index Terms—reproducible research, experiment life cycle, data analysis sup- better infrastructure and tooling [Pen11] as well as applying port relevant software engineering principles [Sto18], [Dub05] to allow data scientists to ensure their work is effective, correct, and Introduction reproducible. In this paper we focus on the ability to manage re- producible workflows for scientific experiments and data analyses. Modern science increasingly uses software as a tool for conducting We discuss the features that software to support this might require, research and scientific data analyses. The growing number of compare some of the existing tools that address them, and finally libraries and frameworks facilitating this work has greatly low- present the open-source tool Curifactory which incorporates the ered the barrier to usage, allowing more researchers to benefit proposed design elements. from this paradigm. However, as a result of the dependence on software, there is a need for more thorough integration of sound software engineering practices with the scientific process. The Related Work fragility of complex environments containing heavily intercon- Reproducibility of AI experiments has been separated into three nected packages coupled with a lack of provenance of the artifacts different degrees [GK18]: Experiment reproduciblity, or repeata- generated throughout the development of an experiment increases bility, refers to using the same code implementation with the the potential for long-term problems, undetected bugs, and failure same data to obtain the same results. Data reproducibility, or to reproduce previous analyses. replicability, is when a different implementation with the same * Corresponding author: martindalena@ornl.gov data outputs the same results. Finally, method reproducibility ‡ Oak Ridge National Laboratory describes when a different implementation with different data is able to achieve consistent results. These degrees are discussed Copyright © 2022 Oak Ridge National Laboratory. This is an open-access article distributed under the terms of the Creative Commons Attribution in [GGA18], comparing the implications and trade-offs on the License, which permits unrestricted use, distribution, and reproduction in any amount of work for the original researcher versus an external medium, provided the original author and source are credited. researcher, and the degree of generality afforded by a reproduced Notice: This manuscript has been authored by UT-Battelle, LLC, under implementation. A repeatable experiment places the greatest bur- contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for pub- den on the original researcher, requiring the full codebase and lication, acknowledges that the US government retains a nonexclusive, paid-up, experiment to be sufficiently documented and published so that irrevocable, worldwide license to publish or reproduce the published form of a peer is able to correctly repeat it. At the other end of the this manuscript, or allow others to do so, for US government purposes. DOE spectrum, method reproducibility demands the greatest burden will provide public access to these results of federally sponsored research in ac- cordance with the DOE Public Access Plan (http://energy.gov/downloads/doe- on the external researcher, as they must implement and run the public-access-plan). experiment from scratch. For the remainder of this paper, we refer 180 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) to "reproducibility" as experiment reproducibility (repeatability). ifications to take full advantage of all features. This can entail Tooling that is able to assist with documentation and organization a significant learning curve and places additional burden on of a published experiment reduces the amount of work for the the researcher. To address this, some sources propose automatic original researcher and still allows for the lowest level of burden documentation of experiments and code through static source code to external researchers to verify and extend previous work. analysis [NFP+ 20], [Red19]. In an effort to encourage better reproducibility based on Beyond the preexisting body of knowledge about software datasets, the Findable, Accessible, Interoperable, and Reusable engineering principles, other works [SNTH13], [KHS09] de- (FAIR) data principles [WDA+ 16] were established. These prin- scribe recommended rules and practices to follow when conduct- ciples recommend that data should have unique and persistent ing computation-based research. These include avoiding manual identifiers, use common standards, and provide rich metadata data manipulation in favor of scripted changes, keeping detailed description and provenance, allowing both humans and machines records of how results are produced (manual provenance), tracking to effectively parse them. These principles have been extended the versions of libraries and programs used, and tracking random more broadly to software [LGK+ 20], computational workflows seeds. Many of these ideas can be assisted or encapsulated through [GCS+ 20], and to entire data pipelines [MLC+ 21]. appropriate infrastructure decisions, which is the premise on Various works have surveyed software engineering practices which this work bases its software reviews. and identified practices that provide value in scientific computing Although this paper focuses on the scientific workflow, a contexts, including various forms of unit and regression testing, growing related field tackles many of the same issues from proper source control usage, formal verification, bug tracking, an industry standpoint: machine learning operations (MLOps) and agile development methods [Sto18], [Dub05]. In particular, [Goy20]. MLOps, an ML-oriented version of DevOps, is con- [Sto18] described many concepts from agile development as being cerned with supporting an entire data science life cycle, from data well suited to an experimental context, where the current knowl- acquisition to deployment of a production model. Many of the edge and goals may be fairly dynamic throughout the project. They same challenges are present, reproducibility and provenance are noted that although many of these techniques could be directly crucial in both production and research workflows [RMRO21]. applied, some required adaptation to make sense in the scientific Infrastructure, tools, and practices developed for MLOps may also software domain. hold value in the scientific community. Similar to this paper, two other works [DGST09], [WWG21] A taxonomy for ML tools that we reference throughout this discuss sets of design aspects and features that a workflow work is from [QCL21], which describes a characterization of tools manager would need. Deelman et al. describe the life cycle of consisting of three primary categories: general, analysis support, a workflow as composition, mapping, execution, and provenance and reproducibility support, each of which is further subdivided capture [DGST09]. A workflow manager must then support each into aspects to describe a tool. For example, these subaspects of these aspects. Composition is how the workflow is constructed, include data visualization, web dashboard capabilities, experiment such as through a graphical interface or with a text configuration logging, and the interaction modes the tool supports, such as a file. Mapping and execution are determining the resources to be command line interface (CLI) or application programming inter- used for a workflow and then utilizing those resources to run it, face (API). including distributing to cloud compute and external representa- tional state transfer (REST) services. This also refers to scheduling Design Features subworkflows/tasks to reuse intermediate artifacts as available. We combine the two sets of capabilities from [DGST09] and Provenance, which is crucial for enabling repeatability, is how all [WWG21] with the taxonomy from [QCL21] to propose a set artifacts, library versions, and other relevant metadata are tracked of six design features that are important for an experiment during the execution of a workflow. manager. These include orchestration, parameterization, caching, Wratten, Wilm, and Göke surveyed many bioinformatics pi- reproducibility, reporting, and scalability. The crossover between pline and workflow management tools, listing the challenges that these proposed feature sets are shown in Table 1. We expand on tooling should address: data provenance, portability, scalability, each of these in more depth in the subsections below. and re-entrancy [WWG21]. Provenance is defined the same way as in [DGST09], and further states the need for generating Orchestration reports that include the tracking information and metadata for Orchestration of an experiment refers to the mechanisms used the associated experiment run. Portability—allowing set up and to chain and compose a sequence of smaller logical steps into execution of an experiment in a different environment—can be an overarching pipeline. This provides a higher-level view of an a challenge because of the dependency requirements of a given experiment and helps abstract away some of the implementation system and the ease with which the environment can be specified details. Operation of most workflow managers is based on a and reinitialized on a different machine or operating system. directed acyclic graph (DAG), which specifies the stages/steps as Scalability is important especially when large scale data, many nodes and the edges connecting them as their respective inputs and compute-heavy steps, or both are involved throughout the work- outputs. The intent with orchestration is to encourage designing flow. Scalability in a manager involves allowing execution on a distinct, reusable steps that can easily be composed in different high-performance computing (HPC) system or with some form of ways to support testing different hypotheses or overarching ex- parallel compute. Finally they mention re-entrancy, or the ability periment runs. This allows greater focus on the design of the to resume execution of a compute step from where it last stopped, experiments than the implementation of the underlying functions preventing unnecessary recomputation of prior steps. that the experiments consist of. As discussed in the taxonomy One area of the literature that needs further discussion is [QCL21], pipeline creation can consist of a combination of scripts, the design of automated provenance tracking systems. Existing configuration files, or a visual tool. This aspect falls within the workflow management tools generally require source code mod- composition capability discussed in [DGST09]. DESIGN OF A SCIENTIFIC DATA ANALYSIS SUPPORT PLATFORM 181 This work [DGST09] [WWG21] Taxonomy [QCL21] Orchestration Composition — Reproducibility/pipeline creation Parameterization — — — Caching — Re-entrancy — Reproducibility Provenance Provenance, portability Reproducibility Reporting — — Analysis/visualization, web dashboard Scalability Mapping, execution Scalability Analysis/computational resources TABLE 1: Comparing design features listed in various works. Parameterization Reproducibility Parameterization specifies how a compute pipeline is customized Mechanisms for reproducibility are one of the most important fea- for a particular run by passing in configuration values to change tures for a successful data analysis support platform. Reproducibil- aspects of the experiment. The ability to customize analysis code ity is challenging because of the complexity of constantly evolving is crucial to conducting a compute-based experiment, providing a codebases, complicated and changing dependency graphs, and mechanism to manipulate a variable under test to verify or reject inconsistent hardware and environments. Reproducibility entails a hypothesis. two subcomponents: provenance and portability. This falls under Conventionally, parameterization is done either through spec- the provenance aspect from [DGST09], both data provenance and ifying parameters in a CLI call or by passing configuration files portability from [WWG21], and the entire reproducibility support in a format like JSON or YAML. As discussed in [DGST09], section of the taxonomy [QCL21]. parameterization sometimes consists of more complicated needs, such as conducting parameter sweeps or grid searches. There are Data provenance is about tracking the history, configuration, libraries dedicated to managing parameter searches like this, such and steps taken to produce an intermediate or final data artifact. as hyperopt [BYC13] used in [RMRO21]. In ML this would include the cleaning/munging steps used and the intermediate tables created in the process, but provenance can Although not provided as a design capability in the other apply more broadly to any type of artifact an experiment may works, we claim the mechanisms provided for parameterization produce, such as ML models themselves, or "model provenance" are important, as these mechanisms are the primary way to con- [SH18]. Applying provenance beyond just data is critical, as figure, modify, and vary experiment execution without explicitly models may be sensitive to the specific sets of training data and changing the code itself or modifying hard-coded values. This conditions used to produce them [Hut18]. This means that every- means that a recorded parameter set can better "describe" an thing required to directly and exactly reproduce a given artifact experiment run, increasing provenance and making it easier for is recorded, such as the manipulations applied to its predecessors another researcher to understand what pieces of an experiment and all hyperparameters used within those manipulations. can be readily changed and explored. Some support is provided for this in [DGST09], stating that Portability refers to the ability to take an experiment and the necessity of running many slight variations on workflows execute it outside of the initial computing environment it was sometimes leads to the creation of ad hoc scripts to generate the created in [WWG21]. This can be a challenge if all software variants, which leads to increased complexity in the organization dependency versions are not strictly defined, or when some de- of the codebase. Improved mechanisms to parameterize the same pendencies may not be available in all environments. Minimally, workflow for many variants helps to manage this complexity. allowing portability requires keeping explicit track of all packages and the versions used. A 2017 study [OBA17] found that even this minimal step is rarely taken. Another mechanism to support Caching portability is the use of containerization, such as with Docker or Refining experiment code and finding bugs is often a lengthy Podman [SH18]. iterative process, and removing the friction of constantly rerunning all intermediate steps every time an experiment is wrong can improve efficiency. Caching values between each step of an Reporting experiment allows execution to resume at a certain spot in the pipeline, rather than starting from scratch every time. This is Reporting is an important step for analyzing the results of an defined as re-entrancy in [WWG21]. experiment, through visualizations, summaries, comparisons of In addition to increasing the speed of rerunning experiments results, or combinations thereof. As a design capability, reporting and running new experiments that combine old results for analysis, refers to the mechanisms available for the system to export or caching is useful to help find and debug mistakes throughout retrieve these results for human analysis. Although data visu- an experiment. Cached outputs from each step allow manual alization and analysis can be done manually by the scientist, interrogation outside of the experiment. For example, if a cleaning tools to assist with making these steps easier and to keep results step was implemented incorrectly and a user noticed an invalid organized are valuable from a project management standpoint. value in an output data table, they could use a notebook to load Mechanisms for this might include a web interface for exploring and manipulate the intermediate artifact tables for that data to individual or multiple runs. Under the taxonomy [QCL21], this determine what stage introduced the error and what code should falls primarily within analysis support, such as data visualization be used to correctly fix it. or a web dashboard. 182 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Scalability container. MLFlow then ensures that the environment is set up and Many data analytic problems require large amounts of space active before running. The CLI even allows directly specifying a and compute resources, often beyond what can be handled on GitHub link to an mlflow-enabled project to download, set up, and an individual machine. To efficiently support running a large then run the associated experiment. For reporting, the MLFlow experiment, mechanisms for scaling execution are important and tracking UI lets the user view and compare various runs and their could include anything from supporting parallel computation on associated artifacts through a web dashboard. For scalability, both an experiment or stage level, to allowing the execution of jobs on distributed storage for saving/loading artifacts as well as execution remote machines or within an HPC context. This falls within both of runs on distributed clusters is supported. mapping and execution from [DGST09], the scalability aspect Sacred from [WWG21], and the computational resources category within the taxonomy [QCL21]. Sacred [GKC+ 17] is a Python library and CLI tool to help organize and reproduce experiments. Orchestration is managed through the use of Python decorators, a "main" for experiment Existing Tools entry point functions and "capture" for parameterizable functions, A wide range of pipeline and workflow tools have been devel- where function arguments are automatically populated from the oped to support many of these design features, and some of the active configuration when called. Parameterization is done directly more common examples include DVC [KPP+ 22] and MLFlow in Python through applying a config decorator to a function that [MLf22]. We briefly survey and analyze a small sample of these assigns variables. Configurations can also be written to or read tools to demonstrate the diversity of ideas and their applicability in from JSON and YAML files, so parameters must be simple types. different situations. Table 2 compares the support of each design Different observers can be specified to automatically track much feature by each tool. of the metadata, environment information, and current parameters, and within the code the user can specify additional artifacts and DVC resources to track during the run. Each run will store the requested DVC [KPP+ 22] is a Git-like version control tool for datasets. outputs, although there is no re-entrant use of these cached values. Orchestration is done by specifying stages, or runnable script Portability is supported through the ability to print the versions of commands, either in YAML or directly on the CLI. A stage is libraries needed to run a particular experiment. Reporting can be specified with output file paths and input file paths as dependen- done through a specific type of observer, and the user can provide cies, allowing an implicit pipeline or DAG to form, representing all custom templated reports that are generated at the end of each run. the processing steps. Parameterization is done by defining within a YAML file what the possible parameters are, along with the default Kedro values. When running the DAG, parameters can be customized on Kedro [ABC+ 22] is another Python library/CLI tool for managing the CLI. Since inputs and outputs are file paths, caching and re- reproducible and modular experiments. Orchestration is particu- entrancy come for free, and DVC will intelligently determine if larly well done with "node" and "pipeline" abstractions, a node certain stages do not need to be re-computed. referring to a single compute step with defined inputs and outputs, A saved experiment or state is frozen into each commit, so and a pipeline implemented as an ordered list of nodes. Pipelines all parameters and artifacts are available at any point. No explicit can be composed and joined to create an overarching workflow. tracking of the environment (e.g., software versions and hardware Possible parameters are defined in a YAML file and either set info) is present, but this could be manually included by tracking it in other parameter files or configured on the CLI. Similar to in a separate file. Reporting can be done by specifying per-stage MLFlow, while tracking outputs are cached, there’s no automatic metrics to track in the YAML configuration. The CLI includes a mechanism for re-entrancy. Provenance is achieved by storing way to generate HTML files on the fly to render requested plots. user-specified metrics and tracked datasets for each run, and it There is also an external "Iterative Studio" project, which provides has a few different mechanisms for portability. This includes the a live web dashboard to view continually updating HTML reports ability to export an entire project into a Docker container. A from DVC. For scalability, parallel runs can be achieved by separate Kedro-Viz tool provides a web dashboard to show a map queuing an experiment multiple times in the CLI. of experiments, as well as showing each tracked experiment run and allowing comparison of metrics and outputs between them. MLFlow Projects can be deployed into several different cloud providers, MLFlow [MLf22] is a framework for managing the entire life such as Databricks and Dask clusters, allowing for several options cycle of an ML project, with an emphasis on scalability and de- for scalability. ployment. It has no specific mechanisms for orchestration, instead allowing the user to intersperse MLFlow API calls in an existing Curifactory codebase. Runnable scripts can be provided as entry points into Curifactory [MHSA22] is a Python API and CLI tool for organiz- a configuration YAML, along with the parameters that can be ing, tracking, reproducing, and exporting computational research provided to it. Parameters are changed through the CLI. Although experiments and data analysis workflows. It is intended primarily MLFlow has extensive capabilities for tracking artifacts, there are for smaller teams conducting research, rather than production- no automatic re-entrancy methods. Reproducibility is a strong fea- level or large-scale ML projects. Curifactory is available on ture, and provenance and portability are well supported. The track- GitHub1 with an open-source BSD-3-Clause license. Below, we ing module provides provenance by recording metadata such as the describe the mechanisms within Curifactory to support each of the Git commit, parameters, metrics, and any user-specified artifacts six capabilities, and compare it with the tools discussed above. in the code. Portability is done by allowing the environment for an entry point to be specified as a Conda environment or Docker 1. https://github.com/ORNL/curifactory DESIGN OF A SCIENTIFIC DATA ANALYSIS SUPPORT PLATFORM 183 Orchestration Parameterization Caching Provenance Portability Reporting Scalability DVC + + ++ + + + + MLFlow + * ++ ++ ++ ++ Sacred + ++ * ++ + + Kedro + + * + ++ ++ ++ Curifactory + ++ ++ ++ ++ + + TABLE 2: Supported design features in each tool. Note, + indicates that a feature is supported, ++ indicates very strong support, and * indicates tooling that supports caching artifacts as a provenance tool but does not provide a mechanism for automatically reloading cached values as a form of re-entrancy. @stage(inputs=["model"], outputs=["results"]) def test_model(record, model): # ... def run(argsets, manager): """An example experiment definition. The primary intent of an experiment is to run each set of arguments through the desired stages, in order to compare results at the end. """ for argset in argsets: # A record is the "pipeline state" # associated with each set of arguments. # Stages take and return a record, # automatically handling pushing and # pulling inputs and outputs from the # record state. record = Record(manager, argsets) test_model(train_model(load_data(record))) Parameterization Parameterization in Curifactory is done directly in Python scripts. The user defines a dataclass with the parameters they need throughout their various stages in order to customize the exper- iment, and they can then define parameter files that each return Fig. 1: Stages are composed into an experiment. one or more instances of this arguments class. All stages in an experiment are automatically given access to the current argument set in use while an experiment is running. Orchestration While configuration can also be done directly in Python in Curifactory provides several abstractions, the lowest level of which Sacred, Curifactory makes a different trade-off: A parameter file is a stage. A stage is a function that takes a defined set of input or get_params() function in Curifactory returns an array of variable names, a defined set of output variable names, and an one or more argument sets, and arguments can directly include optional set of caching strategies for the outputs. Stages are similar complex Python objects. Unlike Sacred, this means Curifactory to Kedro’s nodes but implemented with @stage() decorators on cannot directly translate back and forth from static configuration the target function rather than passing the target function to a files, but in exchange allows for grid searches to be defined directly node() call. One level up from a stage is an experiment: an and easily in a single parameter file, as well as allowing argument experiment describes the orchestration of these stages as shown in sets to be composed or even inherit from other argument set Figure 1, functionally chaining them together without needing to instances. Importantly, Curifactory can still encode representations explicitly manage what variables are passed between the stages. of arguments into JSON for provenance, but this is a one direc- tional transformation. @stage(inputs=None, outputs=["data"]) def load_data(record): This approach allows a great deal of flexibility, and is valuable # every stage has the currently active record in experiments where a large range of parameters need to be # passed to it, which contains the "state", or tested or there is significant repetition among parameter sets. # all previous output values associated with # the current argset, as defined in the For example, in an experiment testing different effects of model # Parameterization section training hyperparameters, there may be several parameter files # ... meant to vary only the arguments needed for model training while using the same base set of data cleaning arguments. Composing @stage(inputs=["data"], outputs=["model", "stats"]) def train_model(record, data): these parameter sets from a common imported set means that any # ... subsequent changes to the data cleaning arguments only need to 184 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) be modified in one place, rather than each individual parameter file. @dataclass class MyArgs(curifactory.ExperimentArgs): """Define the possible arguments needed in the stages.""" random_seed: int = 42 train_test_ratio: float = 0.8 layers: tuple = (100,) activation: str = "relu" def get_params(): """Define a simple grid search: return many arguments instances for testing.""" args = [] layer_sizes = [10, 20, 50, 100] for size in layer_sizes: args.append(MyArgs(name=f"network_{size}", layers=(size,))) return args Caching Curifactory supports per-stage caching, similar to memoization, through a set of easy-to-use caching strategies. When a stage executes, it uses the specified cache mechanism to store the stage outputs to disk, with a filename based on the experiment, stage, and a hash of the arguments. When the experiment is re-executed, if it finds an existing output on disk based on this name, it short- circuits the stage computation and simply reloads the previously cached files, allowing a form of re-entrancy. Adding this caching ability to a stage is done through simply providing the list of Fig. 2: Metadata block at the top of a report. caching strategies to the stage decorator, one for each output: @stage( inputs=["data"], default Dockerfile for this purpose, and running the experiment outputs=["training_set", "testing_set"], with the Docker flag creates an image that exposes a Jupyter cachers=[PandasCSVCacher]*2 notebook to repeat the run and keep the artifacts in memory, as ): def split_data(record, data): well as a file server pointing to the appropriate cache for manual # stage definition exploration and inspection. Directly reproducing the experiment can be done either through the exposed notebook or by running Reproducibility the Curifactory experiment command inside of the image. As mentioned before, reproducibility consists of tracking prove- nance and metadata of artifacts as well as providing a means to set Reporting up and repeat an experiment in a different compute environment. While Curifactory does not run a live web dashboard like MLFlow, To handle provenance, Curifactory automatically records metadata DVC’s Iterative Studio, and Kedro-viz, every experiment run for every experiment run executed, including a logfile of the outputs an HTML experiment report and updates a top-level index console output, current Git commit hash, argument sets used and HTML page linking to the new report, which can be browsed the rendered versions of those arguments, and the CLI command from a file manager or statically served if running from an used to start the run. The final reports from each run also include a external compute resource. Although simplistic, this reduces the graphical representation of the stage DAG, and shows each output dependencies and infrastructure needed to achieve a basic level artifact and what its cache file location is. of reporting, and produces stand-alone folders for consumption Curifactory has two mechanisms to fully track and export an outside of the original environment if needed. experiment run. The first is to execute a "full store" run, which Every report from Curifactory includes all relevant metadata creates a single exported folder containing all metadata mentioned mentioned above, including the machine host name, experiment above, along with a copy of every cache file created, the output sequential run number, Git commit hash, parameters, and com- run report (mentioned below), as well as a Python requirements.txt mand line string. Stage code can add user-defined objects to output and Conda environment dump, containing a list of all packages in in each report, such as tables, figures, and so on. Curifactory comes the environment and their respective versions. This run folder can with a default set of helpers for several basic types of output then be distributed. Reproducing from the folder consists of setting visualizations, including basic line plots, entire Matplotlib figures, up an environment based on the Conda/Python dependencies as and dataframes. needed, and running the experiment command using the exported The output report also contains a graphical representation of folder as the cache directory. the DAG for the experiment, rendered using Graphviz, and shows The second mechanism is a command to create a Docker con- the artifacts produced by each stage and the file path where they tainer that includes the environment, entire codebase, and artifact are cached. An example of some of the components of this report cache for a specific experiment run. Curifactory comes with a are rendered in figures 2, 3, 4, and 5. DESIGN OF A SCIENTIFIC DATA ANALYSIS SUPPORT PLATFORM 185 Conclusion The complexity in modern software, environments, and data ana- lytic approaches threaten the reproducibility and effectiveness of computation-based studies. This has been compounded by the lack of standardization in infrastructure tools and software engineering principles applied within scientific research domains. While many novel tools and systems are in development to address these shortcomings, several design critieria must be met, including the ability to easily compose and orchestrate experiments, parameter- ize them to manipulate variables under test, cache intermediate artifacts, record provenance of all artifacts and allow the software to port to other systems, produce output visualizations and reports for analysis, and scale execution to the resource requirements of the experiment. We developed Curifactory to address these criteria specifically for small research teams running Python based experiments. Fig. 3: User-defined objects to report ("reportables"). Acknowledgements The authors would like to acknowledge the US Department of Energy, National Nuclear Security Administration’s Office of De- fense Nuclear Nonproliferation Research and Development (NA- 22) for supporting this work. R EFERENCES [ABC+ 22] Sajid Alam, Lorena Bălan, Gabriel Comym, Yetunde Dada, Ivan Danov, Lim Hoang, Rashida Kanchwala, Jiri Klein, Antony Milne, Joel Schwarzmann, Merel Theisen, and Susanna Wong. Kedro. https://kedro.org/, March 2022. [BYC13] James Bergstra, Daniel Yamins, and David Cox. Making a Sci- Fig. 4: Graphviz rendering of experiment DAG. Each large colored ence of Model Search: Hyperparameter Optimization in Hundreds area represents a single record associated with a specific argset. White of Dimensions for Vision Architectures. In Proceedings of the ellipses are stages, and the blocks in between them are the input and 30th International Conference on Machine Learning, pages 115– output artifacts. 123. PMLR, February 2013. [DGST09] Ewa Deelman, Dennis Gannon, Matthew Shields, and Ian Taylor. Workflows and e-Science: An overview of workflow system features and capabilities. Future Generation Computer Systems, Scalability 25:524–540, May 2009. doi:10.1016/j.future.2008. 06.012. Curifactory has no integrated method of executing portions of jobs [DMR+ 09] David L. Donoho, Arian Maleki, Inam Ur Rahman, Morteza on external compute resources like Kedro and MLFlow, but it does Shahram, and Victoria Stodden. Reproducible Research in Com- allow local multi-process parallelization of parameter sets. When putational Harmonic Analysis. Computing in Science Engineer- an experiment run would entail executing a series of stages for ing, 11(1):8–18, January 2009. doi:10.1109/MCSE.2009. 15. each argument set in series, Curifactory can divide the collection [Dub05] P.F. Dubois. Maintaining correctness in scientific programs. of argument sets into one subcollection per process, and runs the Computing in Science Engineering, 7(3):80–85, May 2005. doi: experiment in parallel on each subcollection. By taking advantage 10.1109/MCSE.2005.54. [GCS 20] Carole Goble, Sarah Cohen-Boulakia, Stian Soiland-Reyes, + of the caching mechanism, when all parallel runs complete, the Daniel Garijo, Yolanda Gil, Michael R. Crusoe, Kristian Peters, experiment reruns in a single process to aggregate all of the and Daniel Schober. FAIR Computational Workflows. Data precached values into a single report. Intelligence, 2(1-2):108–121, January 2020. doi:10.1162/ dint_a_00033. [GGA18] Odd Erik Gundersen, Yolanda Gil, and David W. Aha. On Repro- ducible AI: Towards Reproducible Research, Open Science, and Digital Scholarship in AI Publications. AI Magazine, 39(3):56– 68, September 2018. doi:10.1609/aimag.v39i3.2816. [GK18] Odd Erik Gundersen and Sigbjørn Kjensmo. State of the Art: Reproducibility in Artificial Intelligence. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), April 2018. doi:10.1609/aaai.v32i1.11503. [GKC+ 17] Klaus Greff, Aaron Klein, Martin Chovanec, Frank Hutter, and Jürgen Schmidhuber. The Sacred Infrastructure for Computa- tional Research. In Proceedings of the 16th Python in Sci- ence Conference, pages 49–56, Austin, Texas, 2017. SciPy. doi:10.25080/shinma-7f4c6e7-008. [Goy20] A. Goyal. Machine learning operations, 2020. [Hut18] Matthew Hutson. Artificial intelligence faces reproducibility Fig. 5: Graphviz rendering of each record in more depth, showing crisis. Science, 359(6377):725–726, February 2018. doi: cache file paths and artifact data types. 10.1126/science.359.6377.725. 186 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) [KHS09] Diane Kelly, Daniel Hook, and Rebecca Sanders. Five Rec- Hovig. Ten Simple Rules for Reproducible Computational Re- ommended Practices for Computational Scientists Who Write search. PLOS Computational Biology, 9(10):e1003285, October Software. Computing in Science Engineering, 11(5):48–53, 2013. doi:10.1371/journal.pcbi.1003285. September 2009. doi:10.1109/MCSE.2009.139. [Sto18] Tim Storer. Bridging the Chasm: A Survey of Software Engineer- [KPP+ 22] Ruslan Kuprieiev, Saugat Pachhai, Dmitry Petrov, Paweł ing Practice in Scientific Programming. ACM Computing Surveys, Redzyński, Casper da Costa-Luis, Peter Rowlands, Alexander 50(4):1–32, July 2018. doi:10.1145/3084225. Schepanovski, Ivan Shcheklein, Batuhan Taskaya, Jorge Orpinel, [WDA+ 16] Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gao, Fábio Santos, David de la Iglesia Castro, Aman Sharma, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Zhanibek, Dani Hodovic, Nikita Kodenko, Andrew Grigorev, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E. Earl, Nabanita Dash, George Vyshnya, maykulkarni, Max Hora, Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Vera, Sanidhya Mangal, Wojciech Baranowski, Clemens Wolff, Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T. and Kurian Benoy. DVC: Data Version Control - Git for Data Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alasdair & Models. Zenodo, April 2022. doi:10.5281/zenodo. J. G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap 6417224. Heringa, Peter A. C. ’t Hoen, Rob Hooft, Tobias Kuhn, Ruben [LGK+ 20] Anna-Lena Lamprecht, Leyla Garcia, Mateusz Kuzak, Car- Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Al- los Martinez, Ricardo Arcila, Eva Martin Del Pico, Victoria bert Mons, Abel L. Packer, Bengt Persson, Philippe Rocca- Dominguez Del Angel, Stephanie van de Sandt, Jon Ison, Serra, Marco Roos, Rene van Schaik, Susanna-Assunta Sansone, Paula Andrea Martinez, Peter McQuilton, Alfonso Valencia, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn, Jennifer Harrow, Fotis Psomopoulos, Josep Ll Gelpi, Neil Morris A. Swertz, Mark Thompson, Johan van der Lei, Erik Chue Hong, Carole Goble, and Salvador Capella-Gutierrez. To- van Mulligen, Jan Velterop, Andra Waagmeester, Peter Wit- wards FAIR principles for research software. Data Science, tenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons. 3(1):37–59, January 2020. doi:10.3233/DS-190026. The FAIR Guiding Principles for scientific data management [MHSA22] Nathan Martindale, Jason Hite, Scott L. Stewart, and Mark and stewardship. Scientific Data, 3(1):160018, March 2016. Adams. Curifactory. https://github.com/ORNL/curifactory, doi:10.1038/sdata.2016.18. March 2022. [WWG21] Laura Wratten, Andreas Wilm, and Jonathan Göke. Reproducible, [MLC+ 21] Sonia Natalie Mitchell, Andrew Lahiff, Nathan Cummings, scalable, and shareable analysis pipelines with bioinformatics Jonathan Hollocombe, Bram Boskamp, Dennis Reddyhoff, Ryan workflow managers. Nature Methods, 18(10):1161–1168, Oc- Field, Kristian Zarebski, Antony Wilson, Martin Burke, Blair tober 2021. doi:10.1038/s41592-021-01254-9. Archibald, Paul Bessell, Richard Blackwell, Lisa A. Boden, Alys Brett, Sam Brett, Ruth Dundas, Jessica Enright, Alejandra N. Gonzalez-Beltran, Claire Harris, Ian Hinder, Christopher David Hughes, Martin Knight, Vino Mano, Ciaran McMonagle, Do- minic Mellor, Sibylle Mohr, Glenn Marion, Louise Matthews, Iain J. McKendrick, Christopher Mark Pooley, Thibaud Por- phyre, Aaron Reeves, Edward Townsend, Robert Turner, Jeremy Walton, and Richard Reeve. FAIR Data Pipeline: Provenance- driven data management for traceable scientific workflows. arXiv:2110.07117 [cs, q-bio], October 2021. arXiv:2110. 07117. [MLf22] MLflow: A Machine Learning Lifecycle Platform. https://mlflow. org/, April 2022. [NFP+ 20] Mohammad Hossein Namaki, Avrilia Floratou, Fotis Psallidas, Subru Krishnan, Ashvin Agrawal, Yinghui Wu, Yiwen Zhu, and Markus Weimer. Vamsa: Automated Provenance Tracking in Data Science Scripts. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, pages 1542–1551, New York, NY, USA, August 2020. Association for Computing Machinery. doi: 10.1145/3394486.3403205. [OBA17] Babatunde K. Olorisade, Pearl Brereton, and Peter Andras. Re- producibility in Machine Learning-Based Studies: An Example of Text Mining. In Reproducibility in ML Workshop, 34th In- ternational Conference on Machine Learning, ICML 2017, June 2017. [Pen11] Roger D. Peng. Reproducible Research in Computational Sci- ence. Science, 334(6060):1226–1227, December 2011. doi: 10.1126/science.1213847. [QCL21] Luigi Quaranta, Fabio Calefato, and Filippo Lanubile. A Taxon- omy of Tools for Reproducible Machine Learning Experiments. In AIxIA 2021 Discussion Papers, 20th International Conference of the Italian Association for Artificial Intelligence, pages 65–76, 2021. [Red19] Sergey Redyuk. Automated Documentation of End-to-End Ex- periments in Data Science. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 2076–2080, April 2019. doi:10.1109/ICDE.2019.00243. [RMRO21] Philipp Ruf, Manav Madan, Christoph Reich, and Djaffar Ould- Abdeslam. Demystifying MLOps and Presenting a Recipe for the Selection of Open-Source Tools. Applied Sciences, 11(19):8861, January 2021. doi:10.3390/app11198861. [SBB13] Victoria Stodden, Jonathan Borwein, and David H. Bailey. Pub- lishing Standards for Computational Science: “Setting the Default to Reproducible”. Pennsylvania State University, 2013. [SH18] Peter Sugimura and Florian Hartl. Building a Reproducible Machine Learning Pipeline. arXiv:1810.04570 [cs, stat], October 2018. arXiv:1810.04570. [SNTH13] Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and Eivind PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 187 The Geoscience Community Analysis Toolkit: An Open Development, Community Driven Toolkit in the Scientific Python Ecosystem Orhan Eroglu‡∗ , Anissa Zacharias‡ , Michaela Sizemore‡ , Alea Kootz‡ , Heather Craker‡ , John Clyne‡ https://www.youtube.com/watch?v=34zFGkDwJPc F Abstract—The Geoscience Community Analysis Toolkit (GeoCAT) team de- GeoCAT has seven Python tools for geoscientific computation velops and maintains data analysis and visualization tools on structured and and visualization. These tools are built upon the Pangeo [HRA18] unstructured grids for the geosciences community in the Scientific Python ecosystem. In particular, they rely on Xarray [HH17], and Dask Ecosystem (SPE). In response to dealing with increasing geoscientific data [MR15], as well as they are compatible with Numpy and use sizes, GeoCAT prioritizes scalability, ensuring its implementations are scalable Jupyter Notebooks for demonstration purposes. Dask compatibil- from personal laptops to HPC clusters. Another major goal of the GeoCAT team is to ensure community involvement throughout the whole project lifecycle, ity allows the GeoCAT functions to scale from personal laptops which is realized through an open development mindset by encouraging users to high performance computing (HPC) systems such as NCAR’s and contributors to get involved in decision-making. With this model, we not Casper, Cheyenne, and upcoming Derecho clusters [CKZ+ 22]. only have our project stack open-sourced but also ensure most of the project Additionally, GeoCAT also utilizes Numba, an open source just- assets that are directly related to the software development lifecycle are publicly in-time (JIT) compiler [LPS15], to translate Python and NumPy accessible. code into machine codes in order to get faster executions wherever possible. GeoCAT’s visualization components rely on Matplotlib Index Terms—data analysis, geocat, geoscience, open development, open source, scalability, visualization [Hun07] for most of the plotting functionalities, Cartopy [Met15] for projections, as well as the Datashader and Holoviews stack [Anaa] for big data rendering. Figure 1 shows these technologies Introduction with their essential roles around GeoCAT. The Geoscience Community Analysis Toolkit (GeoCAT) team, Briefly, GeoCAT-comp houses computational operators for established in 2019, leads the software engineering efforts of applications ranging from regridding and interpolation, to cli- the National Center for Atmospheric Research (NCAR) “Pivot matology and meteorology. GeoCAT-examples provides over 140 to Python” initiative [Geo19]. Before then, NCAR Command publication-quality plotting scripts in Python for Earth sciences. It Language (NCL) [BBHH12] was developed by NCAR as an also houses Jupyter notebooks with high-performance, interactive interpreted, domain-specific language that was aimed to support plots that enable features such as pan and zoom on fine-resolution, the analysis and visualization needs of the global geosciences unstructured geoscience data (e.g. ~3 km data rendered within community. NCL had been serving several tens of thousands of a few tens of seconds to a few minutes on personal laptops). users for decades. It is still available for use but has not been This is achieved by making use of the connectivity information actively developed as it has been in maintenance mode. in the unstructured grid and rendering data via the Datashader The initiative had an initial two-year roadmap with major and Holoviews ecosystem [Anaa]. GeoCAT-viz enables higher- milestones being: (1) Replicating NCL’s computational routines in level implementation of Matplotlib and Cartopy plotting capabil- Python, (2) training and support for transitioning NCL users into ities through its variety of easy to use visualization convenience Python, and (3) moving tools into an open development model. functions for GeoCAT-examples. GeoCAT also maintains WRF- GeoCAT aims to create scalable data analysis and visualization Python (Weather Research and Forecasting), which works with tools on structured and unstructured grids for the geosciences WRF-ARW model output and provides diagnostic and interpola- community in the SPE. The GeoCAT team is committed to tion routines. open development, which helps the team prioritize community GeoCAT was recently awarded Project Raijin, which is an involvement at any level of the project lifecycle alongside having NSF EarthCube-funded effort [NSF21] [CEMZ21]. Its goal is to the whole software stack open-sourced. enhance the open-source analysis and visualization tool landscape by developing community-owned, sustainable, scalable tools that * Corresponding author: oero@ucar.edu ‡ National Center for Atmospheric Research facilitate operating on unstructured climate and global weather data in the SPE. Throughout this three-year project, GeoCAT Copyright © 2022 Orhan Eroglu et al. This is an open-access article dis- will work on the development of data analysis and visualization tributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, pro- functions that operate directly on the native grid as well as vided the original author and source are credited. establish an active community of user-contributors. 188 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 1: The core Python technologies on which GeoCAT relies on This paper will provide insights about GeoCAT’s software PYPI badges in the Package row shows and links to the latest stack and current status, team scope and near-term plans, open versions of the software tool distributed through NCAR’s Conda development methodology, as well as current pathways of com- channel and PyPI, respectively. The LICENSE badge provides a munity involvement. link to our software licenses, Apache License version 2.0 [Apa04], for all of the GeoCAT stack, enabling the redistribution of the GeoCAT Software open-source software products on an "as is" basis. Finally, to provide reproducibility of our software products (either for the The GeoCAT team develops and maintains several open-source latest or any older version), we publish version-specific Digital software tools. Before describing those tools, it is vital to explain Object Identifiers (DOIs), which can be accessed through the DOI in detail how the team implements the continuous integration and badge. This allows the end-user to accurately cite the specific continuous delivery/deployment (CI/CD) in consistence for all of version of the GeoCAT tools they used for science or research those tools. purposes. Continuous Integration and Continuous Delivery/Deployment (CI/CD) GeoCAT employs a continuous delivery model, with a monthly package release cycle on package management systems and pack- age indexes such as Conda [Anab] and PyPI [Pyt]. This model helps the team make new functions available as soon as they are implemented and address potential errors quickly. To assist this process, the team utilizes multiple tools throughout GitHub assets to ensure automation, unit testing and code coverage, as well as licensing and reproducibility. Figure 2, for example, shows the set of badges displaying the near real-time status of each CI/CD implementation in the GitHub repository homepage from one of Fig. 2: GeoCAT-comp’s badges in the beginning of its README file (i.e. the home page of the Githug repository) [geob] our software tools. CI build tests of our repositories are implemented and au- tomated (for pushed commits, pull requests, and daily scheduled execution) via GitHub Actions workflows [Git], with the CI badge GeoCAT-comp (and GeoCAT-f2py) shown in Figure 2 displaying the status (i.e. pass or fail) of GeoCAT-comp is the computational component of the GeoCAT those workflows. Similarly, the CONDA-BUILDS badge shows project as can be seen in Figure 4. GeoCAT-comp houses im- if the conda recipe works successfully for the repository. The plementations of geoscience data analysis functions. Novel re- Python package "codecov" [cod] analyzes the percentage of code search and development is conducted for analyzing both structured coverage from unit tests in the repository. Additionally, the overall and unstructured grid data from various research fields such as results as well as details for each code script can be seen via climate, weather, atmosphere, ocean, among others. In addition, the COVERAGE badge. Each of our software repositories has some of the functionalities of GeoCAT-comp are inspired or a corresponding documentation page that is populated mostly- reimplemented from the NCL in order to address the first goal automatically through the Sphinx Python documentation generator of the "Pivot to Python effort. For that purpose, 114 NCL rou- [Bra21] and published through ReadTheDocs [rea] via an auto- tines were selected, excluding some functionalities such as date mated building and versioning schema. The DOCS badge provides routines, which could be handled by other packages in the Python a link to the documentation page along with showing failures, if ecosystem today. These functions were ranked by order of website any, with the documentation rendering process. Figure 3 shows documentation access from most to least, and prioritization was the documentation homepage of GeoCAT-comp. The NCAR and made based on those ranks. Today, GeoCAT-comp provides the THE GEOSCIENCE COMMUNITY ANALYSIS TOOLKIT: AN OPEN DEVELOPMENT, COMMUNITY DRIVEN TOOLKIT IN THE SCIENTIFIC PYTHON ECOSYSTEM 189 Fig. 3: GeoCAT-comp documentation homepage built with Sphinx using a theme provided by ReadTheDocs [geoa] same or similar capabilities of about 39% (44 out of 114) of those GeoCAT-comp code-base does not explicitly contain or require functions. any compiled code, making it more accessible to the general Some of the functions that are made available through Python community at large. In addition, GeoCAT-f2py is auto- GeoCAT-comp are listed below, for which the GeoCAT-comp matically installed through GeoCAT-comp installation, and all documentation [geoa] provides signatures and descriptions as well functions contained in the "geocat.f2py" package are imported as links to the usage examples: transparently into the "geocat.comp" namespace. Thus, GeoCAT- • Spherical harmonics (both decomposition and recomposi- comp serves as a user API to access the entire computational tion as well as area weighting) toolkit even though its GitHub repository itself only contains pure • Fourier transforms such as band-block, band-pass, low- Python code from the developer’s perspective. Whenever prospec- pass, and high-pass tive contributors want to contribute computational functionality in • Meteorological variable computations such as relative hu- pure Python, GeoCAT-comp is the only GitHub repository they midity, dew-point temperature, heat index, saturation vapor need to deal with. Therefore, there is no onus on contributors of pressure, and more pure Python code to build, compile, or test any compiled code • Climatology functions such as climate average over mul- (e.g. Fortran) at GeoCAT-comp level. tiple years, daily/monthly/seasonal averages, as well as GeoCAT-examples (and GeoCAT-viz) anomalies GeoCAT-examples [geoe] was created to address a few of the • Regridding of curvilinear grid to rectilinear grid, unstruc- original milestones of NCAR’s "Pivot to Python" initiative: (1) tured grid to rectilinear grid, curvilinear grid to unstruc- to provide the geoscience community with well-documented visu- tured grid, and vice versa alization examples for several plotting classes in the SPE, and (2) • Interpolation methods such as bilinear interpolation of a to help transition NCL users into the Python ecosystem through rectilinear to another rectilinear grid, hybrid-sigma levels providing such resources. It was born in early 2020 as the result of to isobaric levels, and sigma to hybrid coordinates a multi-day hackathon event among the GeoCAT team and several • Empirical orthogonal function (EOF) analysis other scientists and developers from various NCAR labs/groups. Many of the computational functions in GeoCAT are im- It has since grown to house novel visualization examples and plemented in pure Python. However, there are others that were showcase the capabilities of other GeoCAT components, like originally implemented in Fortran but are now wrapped up in GeoCAT-comp, along with newer technologies like interactive Python with the help of Numpy’s F2PY, Fortran to Python in- plotting notebooks. Figure 5 illustrates one of the unique GeoCAT- terface generator. This is mostly because re-implementing some examples cases that was aimed at exploring the best practices for functions would require understanding of complicated algorithm data visualization like choosing color blind friendly colormaps. flows and implementation of extensive unit tests that would end The GeoCAT-examples [geod] gallery contains over 140 ex- up taking too much time, compared to wrapping their already- ample Python plotting scripts, demonstrating functionalities from implemented Fortran routines up in Python. Furthermore, outside Python packages like Matplotlib, Cartopy, Numpy, and Xarray. contributors from science background would keep considering to The gallery includes plots from a range of visualization categories add new functions to GeoCAT from their older Fortran routines such as box plots, contours, meteograms, overlays, projections, in the future. To facilitate contribution, the whole GeoCAT-comp shapefiles, streamlines, and trajectories among others. The plotting structure is split into two repositories with respect to being categories and scripts under GeoCAT-examples cover almost all of either pure-Python or Python with compiled code (i.e. Fortran) the NCL plot types and techniques. In addition, GeoCAT-examples implementations. Such implementation layers are handled with houses plotting examples for individual GeoCAT-comp analysis the GeoCAT-comp and GeoCAT-f2py repositories, respectively. functions. 190 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 4: GeoCAT project structure with all of the software tools [geoc] viz helps keep the LOC comparable to NCL, one of the Taylor diagrams (i.e. Taylor_6) took 80 LOC in NCL, and its Python implementation in GeoCAT-examples takes 72 LOC. If many of the Matplotlib functions (e.g. figure and axes initialization, adjustment of several axes parameters, call to plotting functions for Taylor diagram, management of grids, addition of titles, contours, etc.) used in this example weren’t wrapped up in GeoCAT-viz [geof], the same visualization would easily end up in around two hundred LOC. Fig. 5: Comparison between NCL (left) and Python (right) when choosing a colormap; GeoCAT-examples aiming at choosing color blind friendly colormaps [SEKZ22] Despite Matplotlib and Cartopy’s capabilities to reproduce almost all of NCL plots, there was one significant caveat with using their low-level implementations against NCL: NCL’s high- level plotting functions allowed scientists to plot most of the cases Fig. 6: Taylor diagram and curly vector examples that created with in only tens of lines of codes (LOC) while the Matplotlib and the help of GeoCAT-viz Cartopy stack required writing a few hundred LOC. In order to build a higher-level implementation on top of Matplotlib and Recently, the GeoCAT team has been focused on interactive Cartopy while recreating the NCL-like plots (from vital plotting plotting technologies, especially for larger data sets that contain capabilities that were not readily available in the Python ecosystem millions of data points. This effort was centered on unstructured at the time such as Taylor diagrams and curly vectors to more grid visualization as part of Project Raijin, which is detailed in stylistic changes such as font sizes, color schemes, etc. that resem- a later section in this manuscript. That is because unstructured ble NCL plots), the GeoCAT-viz library [geof] was implemented. meshes are a great research and application field for big data Use of functions from this library in GeoCAT-examples signifi- and interactivity such as zoom in/out for regions of interest. As cantly reduces the LOC requirements for most of the visualization a result of this effort, we created a new notebooks gallery under examples to comparable numbers to those of NCL’s. Figure 6 GeoCAT-examples to house such interactive data visualizations. shows Taylor diagram and curly vector examples that have been The first notebook, a screenshot from which is shown in Figure 7, created with the help of GeoCAT-viz. To exemplify how GeoCAT- in this gallery is implemented via the Datashader and Holoviews THE GEOSCIENCE COMMUNITY ANALYSIS TOOLKIT: AN OPEN DEVELOPMENT, COMMUNITY DRIVEN TOOLKIT IN THE SCIENTIFIC PYTHON ECOSYSTEM 191 ecosystem [Anaa], and it provides a high-performance, interactive charge of the software development of Project Raijin, which visualization of a Model for Prediction Across Scales (MPAS) mainly consists of implementing visualization and analysis func- Global Storm-Resolving Model weather simulation dataset. The tions in the SPE to be executed on native grids. While doing so, interactivity features are pan and zoom to reveal greater data GeoCAT is also responsible for establishing an open development fidelity globally and regionally. The data used in this work is environment, clearly documenting the implementation work, and the courtesy of the DYAMOND effort [SSA+ 19] and has varying aligning deployments with the project milestones as well as SPE resolutions from 30 km to 3.75 km. Our notebook in the gallery requirements and specifications. uses the 30 km resolution data for the users to be able to download GeoCAT has created the Xarray-based Uxarray package [uxa] and work on it in their local configuration. However, our work to recognize unstructured grid models through partnership with with the 3.75 km resolution data (i.e. about 42 million hexagonal geoscience community groups. UXarray is built on top of the cells globally) showed that rendering the data took only a few built-in Xarray Dataset functionalities while recognizing several minutes on a decent laptop, even without any parallelization. The unstructured grid formats (UGRID, SCRIP, and Exodus for now). main reason behind such a high performance was that we used the Since there are more unstructured mesh models in the community cell-to-node connectivity information in the MPAS data to render than UXarray natively supports, its architecture will also support the native grid directly (i.e. without remapping to the structured addition of new models. Figure 8 shows the regularly structured grid) along with utilizing the Datashader stack. Without using the “latitude-longitude” grids versus a few unstructured grid models. connectivity information, it would require to run much costly The UXarray project has implemented data input/output func- Delaunay triangulation. The notebook provides a comparison tions for UGRID, SCRIP, and Exodus, as well as methods for between these two approaches as well. surface area and integration calculations so far. The team is cur- rently conducting open discussions (through GitHub Discussions) GeoCAT-datafiles with community members, who are interested in unstructured GeoCAT-datafiles is GeoCAT’s small data storage component as grids research and development in order to prioritize data analysis a Github repository. This tool houses many datasets in different operators to be implemented throughout the project lifecycle. file formats such as NetCDF, which can be used along with other GeoCAT tools or ad-hoc data needs in any other Python script. The datasets can be accessed by the end-user through a lightweight Scalability convenience function: geocat.datafiles.get("folder_name/filename") GeoCAT is aware of the fact that today’s geoscientific models are capable of generating huge sizes of data. Furthermore, these GeoCAT-datafiles fetches the file by simply reading from the datasets, such as those produced by global convective-permitting local storage, if any, or downloading from the GeoCAT-datafiles models, are going to grow even larger in size in the future. repository, if not in the local storage, with the help of Pooch Therefore, computational and visualization functions that are framework [USR+ 20]. being developed in the geoscientific research and development WRF-Python workflows need to be scalable from personal devices (e.g. laptops) to HPC (e.g. NCAR’s Casper, Cheyenne, and upcoming Derecho WRF-Python was created in early 2017 in order to replicate NCL’s clusters) and cloud platforms (e.g. AWS). Weather Research and Forecasting (WRF) package in the SPE, and In order to keep up with the scalability objectives, GeoCAT it covers 100% of the routines in that package. About two years functions are implemented to operate on Dask arrays in addition later, NCAR’s “Pivot to Python” initiative was announced, and the to natively supporting NumPy arrays and Xarray DataArrays. GeoCAT team has taken over development and maintenance of Therefore, the GeoCAT functions can trivially and transparently be WRF-Python. parallelized to be run on shared-memory and distributed-memory The package focuses on creating a Python package that elim- platforms after having Dask cluster/client properly configured and inates the need to work across multiple software platforms when functions fed with Dask arrays or Dask-backed Xarray DataArrays using WRF datasets. It contains more than 30 computational (i.e. chunked Xarray DataArrays that wrap up Dask arrays). (e.g. diagnostic calculations, several interpolation routines) and visualization routines that aim at reducing the amount of post- processing tools necessary to visualize WRF output files. Even though there is no continuous development in WRF- Open Development Python, as is seen in the rest of the GeoCAT stack, the package is To ensure community involvement at every level in the develop- still maintained with timely responses and bug-fix releases to the ment lifecycle, GeoCAT is committed to an open development issues reported by the user community. model. In order to implement this model, GeoCAT provides all of its software tools as GitHub repositories with public GitHub Project Raijin project boards and roadmaps, issue tracking and development re- “Collaborative Research: EarthCube Capabilities: Raijin: Commu- viewing, comprehensive documentation for users and contributors nity Geoscience Analysis Tools for Unstructured Mesh Data”, i.e. such as Contributor’s Guide [geoc] and toolkit-specific documen- Project Raijin, of the consortium between NCAR and Pennsylva- tation, along with community announcements on the GeoCAT nia State University has been awarded by NSF 21-515 EarthCube blog. Furthermore, GeoCAT encourages community feedback and for an award period of 1 September, 2021 - 31 August, 2024 contribution at any level with inclusive and welcoming language. [NSF21]. Project Raijin aims at developing community-owned, As a result of this, community requests and feedback have played sustainable, scalable tools that facilitate operating on unstructured significant role in forming and revising the GeoCAT roadmap and climate and global weather data [rai]. The GeoCAT team is in projects’ scope. 192 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 7: The interactive plot interface from the MPAS visualization notebook in GeoCAT-examples and GeoCAT-viz in particular has received significant contribu- tions through SIParCS in 2020 and 2021 summers (i.e. tens of visualization examples as well as important infrastructural changes were made available by our interns) [CKZ+ 22] [LLZ+ 21] [CFS21]. Furthermore, the team has created three essential and one collaboration project through SIParCS 2022 summer through which advanced geoscientific visualization, unstructured grid vi- sualization and data analysis, Fortran to Python algorithm and code development, as well as GPU optimization for GeoCAT- Fig. 8: Regular grid (left) vs MPAS-A & CAM-SE grids comp routines will be investigated. Project Pythia Community engagement The GeoCAT effort is also a part of the NSF funded Project To further promote engagement with the geoscience community, Pythia. Project Pythia aims to provide a public, web-accessible GeoCAT organizes and attends various community events. First training resource that could help educate earth scientists to more of all, scientific conferences and meetings are great venues for effectively use the SPE and cloud computing for dealing with such a scientific software engineering project to share updates big data in geosciences. GeoCAT helps with Pythia development and progress with the community. For instance, the American through content creation and infrastructure contributions. GeoCAT Meteorological Society (AMS) Annual Meeting and American has also contributed several Python tutorials (such as Numpy, Mat- Geophysical Union (AGU) Fall Meeting are two significant sci- plotlib, Cartopy, etc.) to the educational resources created through entific events that the GeoCAT team presented one or multiple Project Pythia. These materials consist of live tutorial sessions, publications every year since its birth to inform the community. interactive Jupyter notebook demonstrations, Q&A sessions, as The annual Scientific Computing with Python (SciPy) conference well as published video recording of the event on Pythia’s Youtube is another great fit to showcase what GeoCAT has been conducting channel. As a result, it helps us engage with the community in geoscience. The team also attended The International Confer- through multiple channels. ence for High Performance Computing, Networking, Storage, and Analysis (SC) a few times to keep up-to-date with the industry state-of-the-arts in these technologies. Future directions Creating internship projects is another way of improving com- GeoCAT aims to keep increasing the number of data analysis and munity interactions as it triggers collaboration through GeoCAT, visualization functionalities in both structured and unstructured institutions, students, and university in general. The GeoCAT meshes with the same pace as has been done so far. The team will team, thus,encourages undergraduate and graduate student engage- continue prioritizing scalability and open development in future ment in the Python ecosystem through participation in NCAR’s development and maintenance of its software tools landscape. To Summer Internships in Parallel Computational Science (SIParCS). achieve the goals with scalability of our tools, we will ensure our Such programs are quite beneficial for both students and scientific implementations are compatible with the state-of-the-art and up- software development teams. To exemplify, GeoCAT-examples to-date with the best practices of the technology we are using, e.g. THE GEOSCIENCE COMMUNITY ANALYSIS TOOLKIT: AN OPEN DEVELOPMENT, COMMUNITY DRIVEN TOOLKIT IN THE SCIENTIFIC PYTHON ECOSYSTEM 193 Dask. To enhance the community involvement in our open devel- [Met15] Met Office. Cartopy: a cartographic python library with a matplotlib opment model, we will continue interacting with the community interface. Exeter, Devon, 2010 - 2015. URL: http://scitools.org.uk/ cartopy. members through significant events such as Pangeo community [MR15] Matthew Rocklin. Dask: Parallel Computation with Blocked algo- meetings, scientific conferences, tutorials and workshops of Geo- rithms and Task Scheduling. In Kathryn Huff and James Bergstra, CAT’s own as well as other community members; we will keep editors, Proceedings of the 14th Python in Science Conference, pages our timely communication with the stakeholders through GitHub 126 – 132, 2015. doi:10.25080/Majora-7b98e3ed-013. [NSF21] NSF. Collaborative research: Earthcube capabilities: Raijin: assets and other communication channels. Community geoscience analysis tools for unstructured mesh data. https://nsf.gov/awardsearch/showAward?AWD_ID=2126458& HistoricalAwards=false, 2021. Online; accessed 17 May 2022. R EFERENCES [Pyt] Python Software Foundation. The Python Package Index - PyPI. https://pypi.org/. Online; accessed 18 May 2022. [Anaa] Anaconda. Datashader. https://datashader.org/. Online; accessed 29 [rai] Raijin homepage. https://raijin.ucar.edu/. Online; accessed 21 May June 2022. 2022. [Anab] Anaconda, Inc. Conda package manager. https://docs.conda.io/en/ [rea] ReadTheDocs. https://readthedocs.org/. Online; accessed 18 May latest/. Online; accessed 18 May 2022. 2022. [Apa04] Apache Software Foundation. Apache License, version 2.0. https: [SEKZ22] Michaela Sizemore, Orhan Eroglu, Alea Kootz, and Anissa //www.apache.org/licenses/LICENSE-2.0, 2004. Online; accessed Zacharias. Pivoting to Python: Lessons Learned in Recreating the 18 May 2022. NCAR Command Language in Python. 102nd American Meteoro- [BBHH12] David Brown, Rick Brownrigg, Mary Haley, and Wei Huang. logical Society Annual Meeting, 2022. NCAR Command Language (ncl), 2012. doi:http://dx.doi. [SSA+ 19] Bjorn Stevens, Masaki Satoh, Ludovic Auger, Joachim Bier- org/10.5065/D6WD3XH5. camp, Christopher S Bretherton, Xi Chen, Peter Düben, Falko [Bra21] Georg Brandl. Sphinx documentation. URL http://sphinx-doc. Judt, Marat Khairoutdinov, Daniel Klocke, et al. DYAMOND: org/sphinx. pdf, 2021. the DYnamics of the Atmospheric general circulation Modeled [CEMZ21] John Clyne, Orhan Eroglu, Brian Medeiros, and Colin M Zarzy- On Non-hydrostatic Domains. Progress in Earth and Planetary cki. Project raijin: Community geoscience analysis tools for unstruc- Science, 6(1):1–17, 2019. doi:https://doi.org/10.1186/ tured grids. In AGU Fall Meeting 2021. AGU, 2021. s40645-019-0304-z. [CFS21] Heather Rose Craker, Claire Anne Fiorino, and Michaela Victoria [USR+ 20] Leonardo Uieda, Santiago Rubén Soler, Rémi Rampin, Hugo Sizemore. Rebuilding the ncl visualization gallery in python. In Van Kemenade, Matthew Turk, Daniel Shapero, Anderson Bani- 101nd American Meteorological Society Annual Meeting. AMS, hirwe, and John Leeman. Pooch: A friend to fetch your data 2021. files. Journal of Open Source Software, 5(45):1943, 2020. doi: [CKZ+ 22] Heather Craker, Alea Kootz, Anissa Zacharias, Michaela Size- 10.21105/joss.01943. more, and Orhan Eroglu. NCAR’s GeoCAT Announcement of [uxa] UXarray GitHub repository. https://github.com/UXARRAY/uxarray. Computational Tools. In 102nd American Meteorological Society Online; accessed 20 May 2022. doi:10.5281/zenodo. Annual Meeting. AMS, 2022. 5655065. [cod] Codecov. https://about.codecov.io/. Online; accessed 18 May 2022. [geoa] GeoCAT-comp documentation page. https://geocat- comp.readthedocs.io/en/latest/index.html. Online; accessed 20 May 2022. doi:doi:10.5281/zenodo.6607205. [geob] GeoCAT-comp GitHub repository. https://github.com/NCAR/ geocat-comp. Online; accessed 20 May 2022. doi:doi:10. 5281/zenodo.6607205. [geoc] GeoCAT Contributor’s Guide. https://geocat.ucar.edu/pages/ contributing.html. Online; accessed 20 May 2022. doi:10.5065/ a8pp-4358. [geod] GeoCAT-examples documentation page. https://geocat-examples. readthedocs.io/en/latest/index.html. Online; accessed 20 May 2022. doi:10.5281/zenodo.6678258. [geoe] GeoCAT-examples GitHub repository. https://github.com/NCAR/ geocat-examples. Online; accessed 20 May 2022. doi:10.5281/ zenodo.6678258. [geof] GeoCAT-viz GitHub repository. https://github.com/NCAR/geocat- viz. Online; accessed 20 May 2022. doi:10.5281/zenodo. 6678345. [Geo19] GeoCAT. The future of NCL and the Pivot to Python. https: //www.ncl.ucar.edu/Document/Pivot_to_Python, 2019. Online; ac- cessed 17 May 2022. doi:http://dx.doi.org/10.5065/ D6WD3XH5. [Git] GitHub. Github Actions. https://docs.github.com/en/actions. Online; accessed 18 May 2022. [HH17] Stephan Hoyer and Joseph Hamman. xarray: N-D labeled arrays and datasets in Python. Journal of Open Research Software, 5(1):10, 2017. doi:http://doi.org/10.5334/jors.148. [HRA18] Joseph Hamman, Matthew Rocklin, and Ryan Abernathy. Pangeo: A big-data ecosystem for scalable earth system science. EGU General Assembly Conference Abstracts, 2018. [Hun07] J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):90–95, 2007. doi:10.1109/MCSE. 2007.55. [LLZ+ 21] Erin Lincoln, Jiaqi Li, Anissa Zacharias, Michaela Sizemore, Orhan Eroglu, and Julia Kent. Expanding and strengthening the transition from NCL to Python visualizations. In AGU Fall Meeting 2021. AGU, 2021. [LPS15] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A llvm-based python jit compiler. In Proceedings of the Second Work- shop on the LLVM Compiler Infrastructure in HPC, pages 1–6, 2015. doi:https://doi.org/10.1145/2833157.2833162. 194 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) popmon: Analysis Package for Dataset Shift Detection Simon Brugman‡∗ , Tomas Sostak§ , Pradyot Patil‡ , Max Baak‡ F Abstract—popmon is an open-source Python package to check the stability of a tabular dataset. popmon creates histograms of features binned in time-slices, and compares the stability of its profiles and distributions using statistical tests, both over time and with respect to a reference dataset. It works with numerical, ordinal and categorical features, on both pandas and Spark dataframes, and the histograms can be higher-dimensional, e.g. it can also track correlations between sets of features. popmon can automatically detect and alert on changes observed over time, such as trends, shifts, peaks, outliers, anomalies, changing correlations, etc., using monitoring business rules that are either static or dynamic. popmon results are presented in a self-contained report. Index Terms—dataset shift detection, population shift, covariate shift, his- Fig. 1: The popmon package logo togramming, profiling make it easy to detect which (combinations of) features are most Introduction affected by changing distributions. Tracking model performance is crucial to guarantee that a model popmon is light-weight. For example, only one line is required behaves as designed and trained initially, and for determining to generate a stability report. whether to promote a model with the same initial design but report = popmon.df_stability_report( trained on different data to production. Model performance de- df, pends directly on the data used for training and the data predicted time_axis="date", time_width="1w", on. Changes in the latter (e.g. certain word frequency, user demo- time_offset="2022-1-1" graphics, etc.) can affect the performance and make predictions ) unreliable. report.to_file("report.html") Given that input data often change over time, it is important to The package is built on top of Python’s scientific computing track changes in both input distributions and delivered predictions ecosystem (numpy, scipy [HMvdW+ 20], [VGO+ 20]) and sup- periodically, and to act on them when they are significantly ports pandas and Apache Spark dataframes [pdt20], [WM10], different from past instances – e.g. to diagnose and retrain an [ZXW+ 16]. This paper discusses how popmon monitors for incorrect model in production. Predictions may be far ahead in dataset changes. The popmon code is modular in design and user time, so the performance can only be verified later, for example in configurable. The project is available as open-source software.1 one year. Taking action at that point might already be too late. To make monitoring both more consistent and semi-automatic, Related work ING Bank has created a generic Python package called popmon. Many algorithms detecting dataset shift exist that follow a similar popmon monitors the stability of data populations over time and structure [LLD+ 18], using various data structures and algorithms detects dataset shifts, based on techniques from statistical process at each step [DKVY06], [QAWZ15]. However, few are readily control and the dataset shift literature. available to use in production. popmon offers both a framework popmon employs so-called dynamic monitoring rules to flag that generalizes pipelines needed to implement those algorithms, and alert on changes observed over time. Using a specified refer- and default data drift pipelines, built on histograms with statistical ence dataset, from which observed levels of variation are extracted comparisons and profiles (see Sec. data representation). automatically, popmon sets allowed boundaries on the input data. Other families of tools have been developed that work on If the reference dataset changes over time, the effective ranges on individual data points, for model explanations (e.g. SHAP [LL17], the input data can change accordingly. Dynamic monitoring rules feature attributions [SLL20]), rule-based data monitoring (e.g. * Corresponding author: simon.brugman@ing.com Great Expectations, Deequ [GCSG22], [SLS+ 18]) and outlier ‡ ING Analytics Wholesale Banking detection (e.g. [RGL19], [LPO17]). § Vinted alibi-detect [KVLC+ 20], [VLKV+ 22] is somewhat Copyright © 2022 Simon Brugman et al. This is an open-access article similar to popmon. This is an open-source Python library that distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, 1. See https://github.com/ing-bank/popmon for code, documentation, tutori- provided the original author and source are credited. als and example stability reports. POPMON: ANALYSIS PACKAGE FOR DATASET SHIFT DETECTION 195 focuses on outlier, adversarial and drift detection. It allows for monitoring of tabular, text, images and time series data, using both online and offline detectors. The backend is implemented Source data in TensorFlow and PyTorch. Much of the reporting functionality, such as feature distributions, are restricted to the (commercial) en- terprise version called seldon-deploy. Integrations for model Time-axis External deployment are available based on Kubernetes. The infrastructure Reference Data (nD) setup thus is more complex and restrictive than for popmon, dataset (optional) which can run on any developer’s machine. Partition on Contributions time-axis The advantage of popmon’s dynamic monitoring rules over con- ventional static ones, is that little prior knowledge is required of the input data to set sensible limits on the desired level of stability. Temporal partitioning This makes popmon a scalable solution over multiple datasets. To the best of our knowledge, no other monitoring tool exists that suits our criteria to monitor models in production for dataset D1 D2 D3 D4 D5 shift. In particular, no other, light-weight, open-source package is available that performs such extensive stability tests of a pandas Partitioned dataset or Spark dataset. We believe the combination of wide applicability, out-of-the- Data representation box performance, available statistical tests, and configurability makes popmon an ideal addition to the toolbox of any data scientist or machine learning engineer. Approach Histograms per feature for each partition popmon tests the dataset stability and reports the results through Comparison generation a sequence of steps (Fig. 2): 1) The data are represented by histograms of features, binned in time-slices (Sec. data representation). D1 D2 D3 D4 D5 2) The data is arranged according to the selected reference type (Sec. comparisons). Historical data New data 3) The stability of the profiles and distributions of those histograms are compared using statistical tests, both with Statistical comparison respect to a reference and over time. It works with numer- ical, ordinal, categorical features, and the histograms can be higher-dimensional, e.g. it can also track correlations Metric between any two features (Sec. comparisons). 4) popmon can automatically flag and alert on changes Value of interest observed over time, such as trends, anomalies, changing over time correlations, etc, using monitoring rules (Sec. alerting). 5) Results are reported to the user via a dedicated, self- Dynamic bounds contained report (Sec. reporting). Dataset shift In the context of supervised learning, one can distinguish dataset Value of interest Reference distribution Traffic light bounds shift as a shift in various distributions: over time 1) Covariate shift: shift in the independent variables (p(x)). Reporting 2) Prior probability shift: shift in the target variable (the class, p(y)). 3) Concept shift: shift in the relationship between the inde- pendent and target variables (i.e. p(x|y)). Note that there is a lot of variation in terminology used, refer- ring to probabilities prevents this ambiguity. For more information on dataset shift see Quinonero-Candela et al. [QCSSL08]. Fig. 2: Step-by-step overview of popmon’s pipeline as described in popmon is primarily interested in monitoring the distributions section approach onward. of features p(x) and labels p(y) for monitoring trained classifiers. These data in deployment ideally resembles the training data. 196 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) However, the package can be used more widely, for instance Implementation by monitoring interactions between features and the label, or the For the creation of histograms from data records the open-source distribution of model predictions. histogrammar package has been adopted. histogrammar has been implemented in both Scala and Python [PS21], Temporal representation [PSSE16], and works on Spark and pandas dataframes re- popmon requires features to be distributed as a function of time spectively. The two implementations have been tested exten- (bins), which can be provided in two ways: sively to guarantee compatibility. The histograms coming out of histogrammar form the basis of the monitoring code in 1) Time axis. Two-dimensional (or higher) distributions are popmon, which otherwise does not require input dataframes. In provided, where the first dimension is time and the second other words, the monitoring code itself has no Spark or pandas is the feature to monitor. To get time slices, the time data dependencies, keeping the code base relatively simple. column needs to be specified, e.g. “date”, including the bin width, e.g. one week (“1w”), and the offset, which is Histogram types the lower edge of one time-bin, e.g. a certain start date (“2022-1-1”). Three types of histograms are typically used: 2) Ordered data batches. A set of distributions of features is provided, corresponding to a new batch of data. This • Normal histograms, meant for numerical features with batch is considered a new time-slice, and is stitched to known, fixed ranges. The bin specifications are the lowest an existing set of batches, in order of incoming batches, and highest expected values and the number of (equidis- where each batch is assigned a unique, increasing index. tant) bins. Together the indices form an artificial, binned time-axis. • Categorical histograms, for categorical and ordinal fea- tures, typically boolean or string-based. A categorical histogram accepts any value: when not yet encountered, Data representation it creates a new bin. No bin specifications are required. popmon uses histogram-based monitoring to track potential • Sparse histograms are open-ended histograms, for numer- dataset shift and outliers over time, as detailed in the next sub- ical features with no known range. The bin specifications section. only need the bin-width, and optionally the origin (the In the literature, alternative data representations are also em- lower edge of bin zero, with a default value of zero). ployed, such as kdq-trees [DKVY06]. Different data representa- Sparse histograms accept any value. When the value is tions are in principle compatible with the popmon pipeline, as it not yet encountered, a new bin gets created. is similarly structured to alternative methods (see [LLD+ 18], c.f. Fig 5). For normal and sparse histograms reasonable bin specifica- Dimensionality reduction techniques may be used to transform tions can be derived automatically. Both categorical and sparse the input dataset into a space where the distance between instances histograms are dictionaries with histogram properties. New (index, are more meaningful for comparison, before using popmon, or in- bin) pairs get created whenever needed. Although this could result between steps. For example a linear projection may be used as a in out-of-memory problems, e.g. when histogramming billions preprocessing step, by taking the principal components of PCA as of unique strings, in practice this is typically not an issue, as in [QAWZ15]. Machine learning classifiers or autoencoders have this can be easily mitigated. Features may be transformed into also been used for this purpose [LWS18], [RGL19] and can be a representation with a lower number of distinct values, e.g. via particularly helpful for high-dimensional data such as images or embedding or substrings; or one selects the top-n most frequently text. occurring values. Open-ended histograms are ideal for monitoring dataset shift Histogram-based monitoring and outliers: they capture any kind of (large) data change. When There are multiple reasons behind the histogram-based monitoring there is a drift, there is no need to change the low- and high-range approach taken in popmon. values. The same holds for outlier detection: if a new maximum Histograms are small in size, and thus are efficiently stored and or minimum value is found, it is still captured. transferred, regardless of the input dataset size. Once data records have been aggregated feature-wise, with a minimum number of Dimensionality entries per bin, they are typically no longer privacy sensitive (e.g. knowing the number of records with age 30-35 in a dataset). A histogram can be multi-dimensional, and any combination of popmon is primarily looking for changes in data distributions. types is possible. The first dimension is always the time axis, Solely monitoring the (main) profiles of a distribution, such as which is always represented by a sparse histogram. The second the mean, standard deviation and min and max values, does not dimension is the feature to monitor over time. When adding a third necessarily capture the changes in a feature’s distribution. Well- axis for another feature, the heatmap between those two features known examples of this are Anscome’s Quartet [Ans73] and the is created over time. For example, when monitoring financial dinosaurs datasets [MF17], where – between different datasets – transactions: the first axis could be time, the second axis client the means and correlation between two features are identical, but type, and the third axis transaction amount. the distributions are different. Histograms of the corresponding Usually one feature is followed over time, or at maximum two. features (or feature pairs), however, do capture the corresponding The synthetic datasets in section synthetic datasets contain exam- changes. ples of higher-dimensional histograms for known interactions. POPMON: ANALYSIS PACKAGE FOR DATASET SHIFT DETECTION 197 Additivity the adjacent time-slots. A sliding reference, on the other hand, Histograms are additive. As an example, a batch of data records is updated with more recent data, that incorporates this trend. arrives each week. A new batch arrives, containing timestamps Consider the case where the data contain a price field that is yearly that were missing in a previous batch. When histograms are made indexed to the inflation, then using a static reference may alert of the new batch, these can be readily summed with the histograms purely on the trend. of the previous batches. The missing records are immediately put The reference implementations are provided for common sce- into the right time-slices. narios, such as working with a fixed dataset, batched dataset or It is important that the bin specifications are the same between with streaming data. For instance, a fixed dataset is common for different batches of data, otherwise their histograms cannot be exploratory data analysis and one-off monitoring, whereas batched summed and comparisons are impossible. or streaming data is more common in a production setting. The reference may be static or dynamic. Four different refer- Limitations ence types are possible: There is one downside to using histograms: since the data get 1) Self-reference. Using the full dataset on which the sta- aggregated into bins, and profiles and statistical tests are obtained bility report is built as a reference. This method is static: from the histograms, slightly lower resolution is achieved than each time slot is compared to all the slots in the dataset. on the full dataset. In practice, however, this is a non-issue; This is the default reference setting. histograms work great for data monitoring. The reference type 2) External reference. Using an external reference set, for and time-axis binning configuration allow the user for selecting an example the training data of your classifier, to identify effective resolution. which time slots are deviating. This is also a static method: each time slot is compared to the full reference Comparisons set. 3) Rolling reference. Using a rolling window on the input In popmon the monitoring of data stability is based on statistical dataset, allowing one to compare each time slot to a process control (SPC) techniques. SPC is a standard method to window of preceding time slots. This method is dynamic: manage the data quality of high-volume data processing opera- one can set the size of the window and the shift from the tions, for example in a large data warehouse [Eng99]. The idea current time slot. By default the 10 preceding time slots is as follows. Most features have multiple sources of variation are used. from underlying processes. When these processes are stable, the 4) Expanding reference. Using an expanding reference, variation of a feature over time should remain within a known allowing one to compare each time slot to all preceding set of limits. The level of variation is obtained from a reference time slots. This is also a dynamic method, with variable dataset, one that is deemed stable and trustworthy. window size, since all available previous time slots are For each feature in the input data (except the time column), used. For example, with ten available time slots the the stability is determined by taking the reference dataset – for window size is 9. example the data on which a classification model was trained – and contrasting each time slot in the input data. Statistical comparisons The comparison can be done in two ways: Users may have various reasons to prefer a two-sample test over 1) Comparisons: statistically comparing each time slot another. The appropriate comparison depends on our confidence in to the reference data (for example using Kolmogorov- the reference dataset [Ric22], and certain tests may be more com- Smirnov testing, χ 2 testing, or the Pearson correlation). mon in some fields. Many common tests are related [DKVY06], 2) Profiles: for example, tracking the mean of a distribution e.g. the χ 2 function is the first-order expansion of the KL distance over time and contrasting this to the reference data. function. Similar analyses can be done for other summary statistics, Therefore, popmon provides an extensible framework that such as the median, min, max or quantiles. This is related allows users to provide custom two-sample tests using a simple to the CUsUM technique [Pag54], a well-known method syntax, via the registry pattern: in SPC. @Comparisons.register(key="jsd", description="JSD") def jensen_shannon_divergence(p, q): m = 0.5 * (p + q) Reference types return ( Consider X to be an N-dimensional dataset representing our 0.5 * reference data, and X 0 to be our incoming data. A covariate shift (kl_divergence(p, m) + kl_divergence(q, m)) ) occurs when p(X) 6= p(X 0 ) is detected. Different choices for X and X 0 may detect different types of drift (e.g. sudden, gradual, Most commonly used test statistics are implemented, such as the incremental). p(X) is referred to as the reference dataset. Population-Stability-Index and the Jensen-Shannon divergence. Many change-detection algorithms use a window-based solu- The implementations of the χ 2 and Kolmogorov-Smirnov tests tion that compares a static reference to a test window [DKVY06], account for statistical fluctuations in both the input and reference or a sliding window for both, where the reference is dynamically distributions. For example, this is relevant when comparing adja- updated [QAWZ15]. A static reference is a wise choice for mon- cent, low-statistics time slices. itoring of a trained classifier: the performance of such a classifier depends on the similarity of the test data to the training data. Profiles Moreover, it may pick up an incremental departure (trend) from Tracking the distribution of values of interest over time is achieved the initial distribution, that will not be significant in comparison to via profiles. These are functions of the input histogram. Metrics 198 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) may be defined for all dimensions (e.g. count, correlations), or for specific dimensions as in the case of 1D numerical histograms (e.g. quantiles). Extending the existing set of profiles is possible via a syntax similar as above: @Profiles.register( key=["q5", "q50", "q95"], description=[ "5% percentile", "50% percentile (median)", "95% percentile" ], dim=1, type="num" ) def profile_quantiles(values, counts): return logic_goes_here(values, counts) Denote xi (t) as the profile i of feature x at time t, for example the Fig. 3: A snapshot of part of the HTML stability report. It shows the 5% quantile of the histogram of incoming transaction amounts in aggregated traffic light overview. This view can be used to prioritize a given week. Identical bin specifications are assumed between the features for inspection. reference and incoming data. x̄i is defined as the average of that metric on the reference data, and σxi as the corresponding standard Dynamic monitoring rules deviation. The normalized residual between the incoming and reference Dynamic monitoring rules are complementary to static rules. The data, also known as the “pull” or “Z-score”, is given by: levels of variation in feature metrics are assumed to have been measured on the reference data. Per feature metric, incoming data xi (t) − x̄i pulli (t) = . are compared against the reference levels. When (significantly) σxi outside of the known bounds, instability of the underlying sources When the underlying sources of variation are stable, and assuming is assumed, and a warning gets raised to the user. the reference dataset is asymptotically large and independent from popmon’s dynamic monitoring rules raise traffic lights to the incoming data, pulli (t) follows a normal distribution centered the user whenever the normalized residual pulli (t) falls outside around zero and with unit width, N(0, 1), as dictated by the central certain, configurable ranges. By default: limit theorem [Fis11]. In practice, the criteria for normality are hardly ever met. Typi- Green, if |pulli (t)| ≤ 4 cally the distribution is wider with larger tails. Yet, approximately T L(pulli ,t) = Yellow, if 4 < |pulli (t)| ≤ 7 normal behaviour is exhibited. Chebyshev’s inequality [Che67] Red, if |pulli (t)| > 7 guarantees that, for a wide class of distributions, no more than k12 of the distribution’s values can be k or more standard deviations If the reference dataset is changing over time, the effective ranges away from the mean. For example, a minimum of 75% (88.9%) of on xi (t) can change as well. The advantage of this approach over values must lie within two (three) standard deviations of the mean. static rules is that significant deviations in the incoming data can These boundaries reoccur in Sec. dynamic monitoring rules. be flagged and alerted to the user for a large set of features and corresponding metrics, requiring little (or no) prior knowledge of the data at hand. The relevant knowledge is all extracted from the Alerting reference dataset. For alerting, popmon uses traffic-light-based monitoring rules, With multiple feature metrics, many dynamic monitoring tests raising green, yellow or red alerts to the user. Green alerts signal can get performed on the same dataset. This raises the multiple the data are fine, yellow alerts serve as warnings of meaningful comparisons problem: the more inferences are made, the more deviations, and red alerts need critical attention. These monitoring likely erroneous red flags are raised. To compensate for a large rules can be static or dynamic, as explained in this section. number of tests being made, typically one can set wider traffic light boundaries, reducing the false positive rate.2 The boundaries Static monitoring rules control the size of the deviations - or number of red and yellow Static monitoring rules are traditional data quality rules (e.g. alerts - that the user would like to be informed of. [RD00]). Denote xi (t) as metric i of feature x at time t, for example the number of NaNs encountered in feature x on a given day. As Reporting an example, the following traffic lights might be set on xi (t): popmon outputs monitoring results as HTML stability reports. Green, if xi (t) ≤ 1 The reports offer multiple views of the data (histograms and T L(xi ,t) = Yellow, if 1 < xi (t) ≤ 10 heatmaps), the profiles and comparisons, and traffic light alerts. Red, if xi (t) > 10 There are several reasons for providing self-contained reports: they can be opened in the browser, easily shared, stored as artifacts, and The thresholds of this monitoring rule are fixed, and considered tracked using tools such as MLFlow. The reports also have no need static over time. They need to be set by hand, to sensible values. for an advanced infrastructure setup, and are possible to create and This requires domain knowledge of the data and the processes that produce it. Setting these traffic light ranges is a time-costly 2. Alternatively one may apply the Bonferroni correction to counteract this process when covering many features and corresponding metrics. problem [Bon36]. POPMON: ANALYSIS PACKAGE FOR DATASET SHIFT DETECTION 199 Fig. 4: LED: Pearson correlation compared with previous histogram. Fig. 5: Sine1: The dataset shifts around data points 20.000, 40.000, The shifting points are correctly identified at every 5th of the LED 60.000 and 80.000 of the Sine1 dataset are clearly visible. dataset. Similar patterns are visible for other comparisons, e.g. χ 2 . view in many environments: from a local machine, a (restricted) environment, to a public cloud. If, however, a certain dashboarding tool is available, then the metrics computed by popmon are exposed and can be exported into that tool, for example Kibana [Ela22]. One downside of producing self-contained reports is that they can get large when the plots are pre-rendered and embedded. This is mitigated by embedding plots as JSON that are (lazily) rendered on the client-side. Plotly express [Plo22] powers the interactive embedded plots in popmon as of v1.0.0. Note that multiple reference types can be used in the same sta- bility report. For instance, popmon’s default reference pipelines always include a rolling comparison with window size 1, i.e. comparing to the preceding time slot. Synthetic datasets In the literature synthetic datasets are commonly used to test the Fig. 6: Hyperplane: The incremental drift compared to the reference effectiveness of dataset shift monitoring approaches [LLD+ 18]. dataset is observed for the PhiK correlation between the predictions One can test the detection for all kinds of shifts, as the generation and the label. process controls when and how the shift happens. popmon has been tested on multiple of such artificial datasets: Sine1, Sine2, reference. The predictions of this model are added to the dataset, Mixed, Stagger, Circles, LED, SEA and Hyperplane [PVP18], simulating a machine learning model in production. popmon is [SK], [Fan04]. These datasets cover myriad dataset shift charac- able to pick up the divergence between the predictions and the teristics: sudden and gradual drifts, dependency of the label on class label, as depicted in Figure 6. just one or multiple features, binary and multiclass labels, and containing unrelated features. The dataset descriptions and sample popmon configurations are available in the code repository. Conclusion The reports generated by popmon capture features and time This paper has presented popmon, an open-source Python pack- bins where the dataset shift is occurring for all tested datasets. age to check the stability of a tabular dataset. Built around Interactions between features and the label can be used for histogram-based monitoring, it runs on a dataset of arbitrary size, feature selection, in addition to monitoring the individual feature supporting both pandas and Spark dataframes. Using the variations distributions. The sudden and gradual drifts are clearly visible observed in a reference dataset, popmon can automatically detect using a rolling reference, see Fig. 4 for examples. The drift in the and flag deviations in incoming data, requiring little prior domain Hyperplane dataset, incremental and gradual, is not expected to be knowledge. As such, popmon is a scalable solution that can be detected using a rolling reference or self-reference. Moreover, the applied to many datasets. By default its findings get presented dataset is synthesized so that the distribution of the features and in a single HTML report. This makes popmon ideal for both the class balance does not change [Fan04]. exploratory data analysis and as a monitoring tool for machine The process to monitor this dataset could be set up in multiple learning models running in production. We believe the combina- ways, one of which is described here. A logistic regression model tion of out-of-the-box performance and presented features makes is trained on the first 10% of the data, which is also used as static popmon an excellent addition to the data practitioner’s toolbox. 200 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Acknowledgements [MF17] Justin Matejka and George Fitzmaurice. Same stats, different graphs: generating datasets with varied appearance and identi- We thank our colleagues from the ING Analytics Wholesale cal statistics through simulated annealing. In Proceedings of Banking team for fruitful discussions, all past contributors to the 2017 CHI conference on human factors in computing sys- tems, pages 1290–1294, 2017. URL: https://doi.org/10.1145/ popmon, and in particular Fabian Jansen and Ilan Fridman Rojas 3025453.3025912, doi:10.1145/3025453.3025912. for carefully reading the manuscript. This work is supported by [Pag54] Ewas S Page. Continuous inspection schemes. Biometrika, ING Bank. 41(1/2):100–115, 1954. URL: https://doi.org/10.2307/ 2333009, doi:10.2307/2333009. [pdt20] The pandas development team. pandas-dev/pandas: Pan- das, February 2020. URL: https://doi.org/10.5281/zenodo. R EFERENCES 3509134, doi:10.5281/zenodo.3509134. [Plo22] Plotly Development Team. Plotly.py: The interactive graphing [Ans73] F.J. Anscome. Graphs in statistical analysis. American library for Python (includes Plotly Express), 6 2022. URL: Statistician. 27 (1), pages 17–21, 1973. URL: https://doi.org/ https://github.com/plotly/plotly.py. 10.2307/2682899, doi:10.2307/2682899. [PS21] Jim Pivarski and Alexey Svyatkovskiy. [Bon36] Carlo Bonferroni. Teoria statistica delle classi e calcolo delle histogrammar/histogrammar-scala: v1.0.20, April probabilita. Pubblicazioni del R Istituto Superiore di Scienze 2021. URL: https://doi.org/10.5281/zenodo.4660177, Economiche e Commericiali di Firenze, 8:3–62, 1936. doi:10.5281/zenodo.4660177. [Che67] Pafnutii Lvovich Chebyshev. Des valeurs moyennes, liou- [PSSE16] Jim Pivarski, Alexey Svyatkovskiy, Ferdinand Schenck, ville’s. J. Math. Pures Appl., 12:177–184, 1867. and Bill Engels. histogrammar-python: 1.0.0, September [DKVY06] Tamraparni Dasu, Shankar Krishnan, Suresh Venkatasubra- 2016. URL: https://doi.org/10.5281/zenodo.61418, doi:10. manian, and Ke Yi. An information-theoretic approach to 5281/zenodo.61418. detecting changes in multi-dimensional data streams. In In [PVP18] Ali Pesaranghader, Herna Viktor, and Eric Paquet. Reser- Proc. Symp. on the Interface of Statistics, Computing Science, voir of diverse adaptive learners and stacking fast hoeffding and Applications. Citeseer, 2006. drift detection methods for evolving data streams. Machine [Ela22] Elastic. Kibana, 2022. URL: https://github.com/elastic/kibana. Learning, 107(11):1711–1743, 2018. URL: https://doi.org/10. [Eng99] Larry English. Improving Data Warehouse and Business Infor- 1007/s10994-018-5719-z, doi:10.1007/s10994-018- mation Quality: Methods for Reducing Costs and Increasing 5719-z. Profits. Wiley, 1999. [QAWZ15] Abdulhakim A Qahtan, Basma Alharbi, Suojin Wang, and [Fan04] Wei Fan. Systematic data selection to mine concept-drifting Xiangliang Zhang. A pca-based change detection frame- data streams. In Proceedings of the Tenth ACM SIGKDD work for multidimensional data streams: Change detection International Conference on Knowledge Discovery and Data in multidimensional data streams. In Proceedings of the Mining, KDD ’04, page 128–137, New York, NY, USA, 2004. 21th ACM SIGKDD International Conference on Knowledge Association for Computing Machinery. URL: https://doi. Discovery and Data Mining, pages 935–944, 2015. doi: org/10.1145/1014052.1014069, doi:10.1145/1014052. 10.1145/2783258.2783359. 1014069. [QCSSL08] Joaquin Quiñonero-Candela, Masashi Sugiyama, Anton [Fis11] Hans Fischer. The Central Limit Theorem from Laplace to Schwaighofer, and Neil D Lawrence. Dataset shift in machine Cauchy: Changes in Stochastic Objectives and in Analytical learning. Mit Press, 2008. Methods, pages 17–74. Springer New York, New York, NY, [RD00] Erhard Rahm and Hong Hai Do. Data cleaning: Problems and 2011. URL: https://doi.org/10.1007/978-0-387-87857-7_2, current approaches. IEEE Data Eng. Bull., 23(4):3–13, 2000. doi:10.1007/978-0-387-87857-7\_2. [RGL19] Stephan Rabanser, Stephan Günnemann, and Zachary [GCSG22] Abe Gong, James Campbell, Superconductive, and Great Ex- Lipton. Failing loudly: An empirical study of pectations. Great Expectations, 2022. URL: https://github. methods for detecting dataset shift. Advances in com/great-expectations/great_expectations, doi:10.5281/ Neural Information Processing Systems, 32, 2019. zenodo.5683574. URL: https://proceedings.neurips.cc/paper/2019/hash/ [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der 846c260d715e5b854ffad5f70a516c88-Abstract.html. Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric [Ric22] Oliver E Richardson. Loss as the inconsistency of a proba- Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, bilistic dependency graph: Choose your model, not your loss Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van function. In International Conference on Artificial Intelligence Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del and Statistics, pages 2706–2735. PMLR, 2022. Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, [SK] W Nick Street and YongSeog Kim. A streaming ensemble Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer algorithm (sea) for large-scale classification. In Proceedings Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array pro- of the Seventh ACM SIGKDD International Conference on gramming with NumPy. Nature, 585(7825):357–362, Septem- Knowledge Discovery and Data Mining, KDD ’01, page ber 2020. URL: https://doi.org/10.1038/s41586-020-2649-2, 377–382, New York, NY, USA. Association for Comput- doi:10.1038/s41586-020-2649-2. ing Machinery. URL: https://doi.org/10.1145/502512.502568, [KVLC+ 20] Janis Klaise, Arnaud Van Looveren, Clive Cox, Giovanni doi:10.1145/502512.502568. Vacanti, and Alexandru Coca. Monitoring and explainability [SLL20] Pascal Sturmfels, Scott Lundberg, and Su-In Lee. Visu- of models in production. arXiv preprint arXiv:2007.06299, alizing the impact of feature attribution baselines. Distill, 2020. URL: https://doi.org/10.48550/arXiv.2007.06299, doi: 2020. https://distill.pub/2020/attribution-baselines. doi: 10.48550/arXiv.2007.06299. 10.23915/distill.00022. [LL17] Scott M Lundberg and Su-In Lee. A unified approach to in- [SLS+ 18] Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem terpreting model predictions. Advances in neural information Celikel, Felix Biessmann, and Andreas Grafberger. Automat- processing systems, 30, 2017. ing large-scale data quality verification. Proc. VLDB Endow., [LLD+ 18] Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and 11(12):1781–1794, aug 2018. URL: https://doi.org/10.14778/ Guangquan Zhang. Learning under concept drift: A review. 3229863.3229867, doi:10.14778/3229863.3229867. IEEE Transactions on Knowledge and Data Engineering, [VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt 31(12):2346–2363, 2018. doi:10.1109/TKDE.2018. Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, 2876857. Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté- [LPO17] David Lopez-Paz and Maxime Oquab. Revisiting classifier fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar- two-sample tests. In International Conference on Learning rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Representations, 2017. Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, [LWS18] Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. De- Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, tecting and correcting for label shift with black box predictors. Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin- In International conference on machine learning, pages 3122– tero, Charles R. Harris, Anne M. Archibald, Antônio H. 3130. PMLR, 2018. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy POPMON: ANALYSIS PACKAGE FOR DATASET SHIFT DETECTION 201 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi:10.1038/s41592-019-0686-2. [VLKV+ 22] Arnaud Van Looveren, Janis Klaise, Giovanni Vacanti, Oliver Cobb, Ashley Scillitoe, and Robert Samoilescu. Alibi Detect: Algorithms for outlier, adversarial and drift detection, 4 2022. URL: https://github.com/SeldonIO/alibi-detect. [WM10] Wes McKinney. Data Structures for Statistical Computing in Python. In Stéfan van der Walt and Jarrod Millman, editors, Proceedings of the 9th Python in Science Conference, pages 56–61, 2010. doi:10.25080/Majora-92bf1922-00a. [ZXW+ 16] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Gh- odsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. Apache spark: A unified engine for big data processing. Commun. ACM, 59(11):56–65, oct 2016. URL: https://doi.org/10.1145/ 2934664, doi:10.1145/2934664. 202 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) pyDAMPF: a Python package for modeling mechanical properties of hygroscopic materials under interaction with a nanoprobe Willy Menacho‡§ , Gonzalo Marcelo Ramírez-Ávila‡§ , Horacio V. Guzman¶k‡§∗ F Abstract—pyDAMPF is a tool oriented to the Atomic Force Microscopy (AFM) Despite the recent open-source availability of dynamic AFM community, which allows the simulation of the physical properties of materials simulation packages [GGG15], [MHR08], a broad usage for the under variable relative humidity (RH). In particular, pyDAMPF is mainly focused assessment and planning of experiments has yet to come. One of on the mechanical properties of polymeric hygroscopic nanofibers that play an the problems is that it is often hard to simulate several operational essential role in designing tissue scaffolds for implants and filtering devices. parameters at once. For example, most scientists evaluate differ- Those mechanical properties have been mostly studied from a very coarse perspective reaching a micrometer scale. However, at the nanoscale, the me- ent AFM cantilevers before starting new experiments. A typical chanical response of polymeric fibers becomes cumbersome due to both exper- evaluation criterion is the maximum exerted force that prevents imental and theoretical limitations. For example, the response of polymeric fibers invasivity of the nanoprobe into the sample. The variety of AFM to RH demands advanced models that consider sub-nanometric changes in the cantilevers depends on the geometrical and material characteristics local structure of each single polymer chain. From an experimental viewpoint, used for its fabrication. Moreover, manufacturers’ nanofabrication choosing the optimal cantilevers to scan the fibers under variable RH is not techniques may change from time to time, according to the trivial. necessities of the experiments, like sharper tips and/or higher In this article, we show how to use pyDAMPF to choose one optimal oscillation frequencies. From a simulation perspective, evaluating nanoprobe for planned experiments with a hygroscopic polymer. Along these observables for reaching optimal results on upcoming experiments lines, We show how to evaluate common and non-trivial operational parame- ters from an AFM cantilever of different manufacturers. Our results show in a is nowadays possible for tens or hundreds of cantilevers. On top of stepwise approach the most relevant parameters to compare the cantilevers other operational parameters in the case of dynamic AFM like the based on a non-invasive criterion of measurements. The computing engine is oscillation amplitude A0 , set-point Asp , among other materials ex- written in Fortran, and wrapped into Python. This aims to reuse physics code pected properties that may feed simulations and create simulations without losing interoperability with high-level packages. We have also introduced batches of easily thousands of cases. Given this context, we focus an in-house and transparent method for allowing multi-thread computations to this article on choosing a cantilever out of an initial pyDAMPF the users of the pyDAMPF code, which we benchmarked for various comput- database of 30. In fact, many of them are similar in terms of spring ing architectures (PC, Google Colab and an HPC facility) and results in very constant kc , cantilever volume Vc and also Tip’s radius RT . Then favorable speed-up compared to former AFM simulators. we focus on seven archetypical and distinct cases/cantilevers to Index Terms—Materials science, Nanomechanical properties, AFM, f2py, multi- understand the characteristics of each of the parameters specified threading CPUs, numerical simulations, polymers in the manufacturers’ datasheets, by evaluating the maximum (peak) forces. Introduction and Motivation We present four scenarios comparing a total of seven can- tilevers and the same sample, where we use as a test-case Poly- This article provides an overview of pyDAMPF, which is a Vinyl Acetate (PVA) fiber. The first scenario (Figure 1) illustrates BSD licensed, Python and Fortran modeling tool that enables the difference between air and a moist environment. On the AFM users to simulate the interaction between a probe (can- second one, a cantilever, only very soft and stiff cantilever spring tilever) and materials at the nanoscale under diverse environments. constants are compared (see Figure :ref:fig1b‘). At the same time, The code is packaged in a bundle and hosted on GitHub at the different volumes along the 30 cantilevers are depicted in (https://github.com/govarguz/pyDAMPF). Figure 3. A final and mostly very common comparison is scenario ‡ Instituto de Investigaciones Físicas. 4, by comparing one of the most sensitive parameters to the force § Carrera de Física, Universidad Mayor de San Andrés. Campus Universitario of the tip’s radii (see Figure 4). Cota Cota. La Paz, Bolivia The quantitative analysis for these four scenarios is presented * Corresponding author: horacio.guzman@ijs.si and also the advantages of computing several simulation cases ¶ Department of Theoretical Physics || Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia at once with our in-house development. Such a comparison is performed under the most common computers used in science, Copyright © 2022 Willy Menacho et al. This is an open-access article dis- namely, personal computers (PC), cloud (Colab) and supercom- tributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, pro- puting (small Xeon based cluster). We reach a Speed-up of 20 vided the original author and source are credited. over the former implementation [GGG15]. PYDAMPF: A PYTHON PACKAGE FOR MODELING MECHANICAL PROPERTIES OF HYGROSCOPIC MATERIALS UNDER INTERACTION WITH A NANOPROBE 203 Another novelty of pyDAMPF is the detailed [GS05] calcu- Serial method: This method is completely transparent to lation of the environmental-related parameters, like the quality the user and will execute all the simulation cases found in the file factor Q. tempall.txt by running the script inputs_processor.py. Our in-house Here, we summarize the main features of pyDAMPF are: development creates an individual folder for each simulation case, which can be executed in one thread. • Highly efficient structure in terms of time-to-result, at least def serial_method(tcases, factor, tempall): one order of magnitude faster than existing approaches. lst = gen_limites(tcases, factor) • Easy to use for scientists without a computing background, change_dir() in particular in the use of multi-threads. for i in range(1,factor+1): direc = os.getcwd() • It supports the addition of further AFM cantilevers and direc2 = direc+'/pyDAMPF_BASE/' parameters into the code database. direc3 = direc+'/SERIALBASIC_0/'+str(i)+'/' • Allows an interactive analysis, including a graphical and shutil.copytree ( direc2,direc3) table-based comparison of results through Jupyter Note- os.chdir ( direc+'/SERIALBASIC_0/1/nrun/') exec(open('generate_cases.py').read()) books. As arguments, the serial method requires the total number of The results presented in this article are available as Google simulation cases obtained from tempall.txt. In contrast, the factor Colaboratory notebook, which facilitates to explore pyDAMPF parameter has, in this case,a default value of 1. and these examples. Parallel method: The parallel method uses more than one computational thread. It is similar to the serial method; however, this method distributes the total load along the available threads Methods and executes in a parallel-fashion. This method comprises two parts: first, a function that takes care of the bookkeeping of cases Processing inputs and folders: pyDAMPF counts with an initial database of 30 cantilevers, def Parallel_method(tcases, factor, tempall): which can be extended at any time by accessing to the file can- lst = gen_limites(tcases, factor) tilevers_data.txt then, the program inputs_processor.py reads the change_dir() cantilever database and asks for further physical and operational for i in range(1,factor+1): lim_inferior=lst[i-1][0] variables, required to start the simulations. This will generate lim_superior=lst[i-1][1] tempall.txt, which contains all cases e.g. 30 to be simulated with direc =os.getcwd() pyDAMPF direc2 =direc+'/pyDAMPF_BASE/' direc3 =direc+'/SERIALBASIC_0/'+str(i)+'/' def inputs_processor(variables,data): shutil.copytree ( direc2,direc3) a,b = np.shape(data) factorantiguo = ' factor=1' final = gran_permutador( variables, data) factornuevo='factor='+str(factor) f_name = ' tempall.txt' rangoantiguo = '( 0,paraleliz)' np.savetxt(f_name,final) rangonuevo='('+str(lim_inferior)+',' directory = os.getcwd() +str(lim_superior)+')' shutil.copy(directory+'/tempall.txt',directory+' os.chdir(direc+'/PARALLELBASIC_0/'+str(i)) /EXECUTE_pyDAMPF/') pyname =' nrun/generate_cases.py' shutil.copy(directory+'/tempall.txt',directory+' newpath=direc+'/PARALLELBASIC_0/'+str(i)+'/' /EXECUTE_pyDAMPF/pyDAMPF_BASE/nrun/') +pyname reemplazo(newpath,factorantiguo,factornuevo) The variables inside the argument of the function inputs_processor reemplazo(newpath,rangoantiguo,rangonuevo) are interactively requested from a shell command line. Then the os.chdir(direc) file tempall.txt is generated and copied to the folders that will This part generates serial-like folders for each thread’s number of contain the simulations. cases to be executed. The second part of the parallel method will execute pyDAMPF, Execute pyDAMPF which contains at the same time two scripts. One for executing pyDAMPF in a common UNIX based desktop or laptop. While the For execution in a single or multi-thread way, we require first second is a python script that generated SLURM code to launch to wrap our numeric core from Fortran to Python by using jobs in HPC facilities. f2py [Vea20]. Namely, the file pyDAMPF.f90 within the folder EXECUTE_pyDAMPF. • Execution with SLURM Compilation with f2py: This step is only required once It runs pyDAMPF in different threads under the SLURM and depends on the computer architecture the code for this reads: queuing system. f2py -c --fcompiler=gnu95 pyDAMPF.f90 -m mypyDAMPF def cluster(factor): for i in range(1,factor+1): This command-line generates mypyDAMPF.so, which will be with open('jobpyDAMPF'+str(i)+'.x','w') automatically located in the simulation folders. as ssf : Once we have obtained the numerical code as Python modules, ssf.write('#/bin/bashl|n ') we need to choose the execution mode, which can be serial or ssf.write('#SBATCH--time=23:00:00 \n') parallel. Whereby parallel refers to multi-threading capabilities ssf.write('#SBATCH--constraint= only within this first version of the code. epyc3\n') 204 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) ssf.write('\n') ssf.write('ml Anaconda3/2019.10\n') ssf.write('\n') ssf.write('ml foss/2018a\n') ssf.write('\n') ssf.write('cd/home/$<USER>/pyDAMPF/ EXECUTE_pyDAMPF/PARALLELBASIC_0/'+str(i)+'/nrun \n') ssf.write('\n') ssf.write('echo$pwd\n') ssf.write('\n') ssf.write('python3 generate_cases.py \n') ssf.close(); os.system(sbatch jobpyDAMPF)'+str(i)+' .x;') os.system(rm jobpyDAMPF)'+str(i)+'.x;') The above script generates SLURM jobs for a chosen set of threads; after launched, those jobs files are erased in order to improve bookkeeping. Fig. 1: Schematic of the tip-sample interface comparing air at a given • Parallel execution with UNIX based Laptops or Desktops Relative Humidity with air. Usually, microscopes (AFM) computers have no SLURM pre- installed; for such a configuration, we run the following script: def compute(factor): direc = os.getcwd() for i in range(1,factor+1): os.chdir(direc+'/PARALLELBASIC_0/'+ str(i)+'/nrun') os.system('python3 generate_cases.py &') os.chdir(direc) This function allows the proper execution of the parallel case without a queuing system and where a slight delay might appear from thread to thread execution. Analysis Graphically: • With static graphics, as shown in Figures 5, 9, 13 and 17. python3 Graphical_analysis.py Fig. 2: Schematic of the tip-sample interface comparing a hard (stiff) • With interactive graphics, as shown in Figure 18. cantilever with a soft cantilever. pip install plotly jupyter notebook Graphical_analysis.ipynb Quantitatively: • With static data table: python3 Quantitative_analysis.py • With interactive tables Quantitative_analysis.ipynb uses a minimalistic dashboard application for tabular data visualization tabloo with easy installation.: pip install tabloo jupyter notebook Quantitative_analysis.ipynb Results and discussions In Figure 1, we show four scenarios to be tackled in this test- case for pyDAMPF. As described in the introduction, the first scenario (Figure 1), compares between air and moist environment, Fig. 3: Schematic of the tip-sample interface comparing a cantilever the second tackles soft and stiff cantilevers(see Figure 2), next with a high volume compared with a cantilever with a small volume. is Figure Figure 3, with the cantilever volume comparison and PYDAMPF: A PYTHON PACKAGE FOR MODELING MECHANICAL PROPERTIES OF HYGROSCOPIC MATERIALS UNDER INTERACTION WITH A NANOPROBE 205 Fig. 6: Time-varying force for PVA at RH = 60.1% for different cantilevers. The simulations show elastic (Hertz) responses. For each curve, the maximum force value is the peak force. Two complete Fig. 4: Schematic of the tip-sample interface comparing a cantilever oscillations are shown corresponding to a hard (stiff) cantilever with with a wide tip with a cantilever with a sharp tip. a soft cantilever. The simulations were performed for Asp /A0 = 0.8 . Fig. 5: Time-varying force for PVA at RH = 60.1% for different Fig. 7: Time-varying force for PVA at RH = 60.1% for different cantilevers. The simulations show elastic (Hertz) responses. For each cantilevers. The simulations show elastic (Hertz) responses. For each curve, the maximum force value is the peak force. Two complete curve, the maximum force value is the peak force. Two complete os- oscillations are shown corresponding to air at a given Relative cillations are shown corresponding to a cantilever with a high volume Humidity with air. The simulations were performed for Asp /A0 = 0.8 compared with a cantilever with a small volume. The simulations were . performed for Asp /A0 = 0.8 . the force the tip’s radio (see Figure 4). Further details of the cantilevers depicted here are included in Table 22. The AFM is widely used for mechanical properties mapping of matter [Gar20]. Hence, the first comparison of the four scenarios points out to the force response versus time according to a Hertzian interaction [Guz17]. In Figure 5, we see the humid air (RH = 60.1%) changes the measurement conditions by almost 10%. Using a stiffer cantilever (kc = 2.7[N/m]) will also increase the force by almost 50% from the softer one (kc = 0.8[N/m]), see Figure 6. Interestingly, the cantilever’s volume, a smaller cantilever, results in the highest force by almost doubling the force by almost five folds of the smallest volume (Figure 7). Finally, the Tip radius difference between 8 and 20 nm will impact the force in roughly 40 pN (Figure 8). Fig. 8: Time-varying force for PVA at RH = 60.1% for different Now, if we consider literature values for different cantilevers. The simulations show elastic (Hertz) responses. For each RH [FCK+ 12], [HLLB09], we can evaluate the Peak or Maximum curve, the maximum force value is the peak force. Two complete Forces. This force in all cases depicted in Figure 9 shows a oscillations are shown corresponding to a cantilever with a wide tip monotonically increasing behavior with the higher Young mod- with a cantilever with a sharp tip. The simulations were performed ulus. Remarkably, the force varies in a range of 25% from dried for Asp /A0 = 0.8 . 206 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 9: Peak force reached for a PVA sample subjected to different Fig. 11: Peak force reached for a PVA sample subjected to different relative humidities 0.0%, 29.5%, 39.9% and 60.1% corresponding relative humidities 0.0%, 29.5%, 39.9% and 60.1% corresponding to to air at a given Relative Humidity with air. The simulations were a cantilever with a high volume compared with a cantilever with a performed for Asp /A0 = 0.8 . small volume. The simulations were performed for Asp /A0 = 0.8 . Fig. 10: Peak force reached for a PVA sample subjected to different relative humidities 0.0%, 29.5%, 39.9% and 60.1% corresponding to a hard (stiff) cantilever with a soft cantilever. The simulations were Fig. 12: Peak force reached for a PVA sample subjected to different performed for Asp /A0 = 0.8 . relative humidities 0.0%, 29.5%, 39.9% and 60.1% corresponding to a cantilever with a wide tip with a cantilever with a sharp tip. The simulations were performed for Asp /A0 = 0.8 . PVA to one at RH = 60.1% (see Figure 9). In order to properly describe operational parameters in dy- namic AFM we analyze the peak force dependence with the set- point amplitude Asp . In Figure 13, we have the comparison of peak forces for the different cantilevers as a function of Asp . The sensitivity of the peak force is higher for the type of cantilevers with varying kc and Vc . Nonetheless, the peak force dependence given by the Hertzian mechanics has a dependence with the square root of the tip radius, and for those Radii on Table 22 are not influencing the force much. However, they could strongly influence resolution [GG13]. Figure 17 shows the dependence of the peak force as a function of kc , Vc , and RT , respectively, for all the cantilevers listed in Table 22; constituting a graphical summary of the seven analyzed cantilevers for completeness of the analysis. Another way to summarize the results in AFM simulations if to show the Force vs. Distance curves (see Fig. 18), which in these case show exactly how for example a stiffer cantilever may Fig. 13: Dependence of the maximum force on the set-point amplitude penetrate more into the sample by simple checking the distance corresponding to air at a given Relative Humidity with air. cantilever e reaches. On the other hand, it also jumps into the PYDAMPF: A PYTHON PACKAGE FOR MODELING MECHANICAL PROPERTIES OF HYGROSCOPIC MATERIALS UNDER INTERACTION WITH A NANOPROBE 207 Fig. 14: Dependence of the maximum force on the set-point amplitude Fig. 17: Dependence of the maximum force with the most important corresponding to a hard (stiff) cantilever with a soft cantilever. characteristics of each cantilever, filtering the cantilevers used for the scenarios , the figure shows maximum force dependent on the: (a) force constant k, (b) cantilever tip radius, and (c) cantilever volume, respectively. The simulations were performed for $A_{sp}/A_{0}$ = 0.8. Fig. 15: Dependence of the maximum force on the set-point amplitude corresponding to a cantilever with a high volume compared with a cantilever with a small volume. Fig. 18: Three-dimensional plots of the various cantilevers provided by the manufacturer and those in the pyDAMPF database that establish a given maximum force at a given distance between the tip and the sample for a PVA polymer subjected to RH= 0% with E = 930 [MPa]. eyes that a cantilever with small volume f has less damping from the environment and thus it also indents more than the ones with higher volume. Although these type of plots are the easiest to make, they carry lots of experimental information. In addition, pyDAMPF can plot such 3D figures interactively that enables a detailed comparison of those curves. Fig. 16: Dependence of the maximum force on the set-point amplitude As we aim a massive use of pyDAMPF, we also perform the corresponding to a cantilever with a wide tip with a cantilever with a corresponding benchmarks on four different computing platforms, sharp tip. where two of them resembles the standard PC or Laptop found at the labs, and the other two aim to cloud and HPC facilities, 208 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 19: Three-dimensional plots of the various cantilevers provided Fig. 21: Speed up parallel method. by the manufacturer and those in the pyDAMPF database that establish a given maximum force at a given distance between the tip and the sample for a PVA polymer subjected to RH = 60.1% with E = 248.8 [MPa]. Fig. 22: Data used for Figs. 5, 9 and 13 with an A0 = 10[nm] . Observe that the quality factor and Young’s modulus have three different values respectively for RH1 = 29.5%, RH2 = 39.9% y RH3 = 60.1%. ∗∗ The values presented for Quality Factor Q were calculated at Google Colaboratory notebook Q calculation, using the method proposed by [GS05], [Sad98]. Fig. 23: Computers used to run pyDAMPF and Former work [GGG15], ∗ the free version of Colab provides this capability, there are two paid versions which provide much greater capacity, these versions known as Colab Pro and Colab Pro+ are only available in Fig. 20: Comparison of times taken by both the parallel method and some countries. the serial method. respectively (see Table 23 for details). Figure 20 shows the average run time for the serial and parallel implementation. Despite a slightly higher performance for the case of the HPC cluster nodes, a high-end computer (PC 2) may also reach similar values, which is our current goal. Another striking aspect observed by looking at the speed-up, is the maximum and minimum run times, which notoriously show the on-demand character of cloud services. As their maxima and minima show the highest variations. To calculate the speed up we use the following equation: ttotal S= tthread Where S is the speed up , tT hread is the execution time of a Fig. 24: Execution times per computational thread, for each computer. computational thread, and tTotal is the sum of times, shown in Note that each Thread consists of 9 simulation cases, with a sum time showing the total of 90 cases for evaluating 3 different Young moduli the table 24. For our calculations we used the highest, the average and 30 cantilevers at the same time. and the lowest execution time per thread. PYDAMPF: A PYTHON PACKAGE FOR MODELING MECHANICAL PROPERTIES OF HYGROSCOPIC MATERIALS UNDER INTERACTION WITH A NANOPROBE 209 Limitations R EFERENCES The main limitation of dynamic AFM simulators based in con- [FCK+ 12] Kathrin Friedemann, Tomas Corrales, Michael Kappl, Katharina Landfester, and Daniel Crespy. Facile and large-scale fabrication tinuum modeling is that sometimes a molecular behavior is over- of anisometric particles from fibers synthesized by colloid elec- looked. Such a limitation comes from the multiple time and length trospinning. Small, 8:144–153, 2012. doi:10.1002/smll. scales behind the physics of complex systems, as it is the case 201101247. of polymers and biopolymers. In this regard, several efforts on [Gar20] Ricardo Garcia. Nanomechanical mapping of soft materials with the atomic force microscope: methods, theory and applications. the multiscale modeling of materials have been proposed, joining The Royal Society of Chemistry, 49:5850–5884, 2020. doi:10. mainly efforts to stretch the multiscale gap [GTK+ 19]. We also 1039/d0cs00318b. plan to do so, within a current project, for modeling the polymeric [GG13] Horacio V. Guzman and Ricardo Garcia. Peak forces and lateral resolution in amplitude modulation force microscopy in liquid. fibers as molecular chains and providing "feedback" between mod- Beilstein Journal of Nanotechnology, 4:852–859, 2013. doi: els from a top-down strategy. Code-wise, the implementation will 10.3762/bjnano.4.96. be also gradually improved. Nonetheless, to maintain scientific [GGG15] Horacio V. Guzman, Pablo D. Garcia, and Ricardo Garcia. Dy- code is a challenging task. In particular without the support for namic force microscopy simulator (dforce): A tool for planning and understanding tapping and bimodal afm experiments. Beilstein our students once they finish their thesis. In this respect, we will Journal of Nanotechnology, 6:369–379, 2015. doi:10.3762/ seek software funding and more community contributions. bjnano.6.36. [GPG13] Horacio V. Guzman, Alma P. Perrino, and Ricardo Garcia. Peak forces in high-resolution imaging of soft matter in liquid. ACS Nano, 7:3198–3204, 2013. doi:10.1021/nn4012835. Future work [GS05] Christopher P. Green and John E. Sader. Frequency response of There are several improvements that are planned for pyDAMPF. cantilever beams immersed in viscous fluids near a solid surface with applications to the atomic force microscope. Journal of Ap- plied Physics, 98:114913, 2005. doi:10.1063/1.2136418. • We plan to include a link to molecular dynamics simula- [GTK+ 19] Horacio V. Guzman, Nikita Tretyakov, Hideki Kobayashi, Aoife C. tions of polymer chains in a multiscale like approach. Fogarty, Karsten Kreis, Jakub Krajniak, Christoph Junghans, Kurt • We plan to use experimental values with less uncertainty Kremer, and Torsten Stuehn. Espresso++ 2.0: Advanced methods for multiscale molecular simulation. Computer Physics Communi- to boost semi-empirical models based on pyDAMPF. cations, 238:66–76, 2019. doi:10.1016/j.cpc.2018.12. • The code is still not very clean and some internal cleanup 017. is necessary. This is especially true for the Python backend [Guz17] Horacio V. Guzman. Scaling law to determine peak forces which may require a refactoring. in tapping-mode afm experiments on finite elastic soft matter systems. Beilstein Journal of Nanotechnology, 8:968–974, 2017. • Some AI optimization was also envisioned, particularly for doi:10.3762/bjnano.8.98. optimizing criteria and comparing operational parameters. [HLLB09] Fei Hang, Dun Lu, Shuang Wu Li, and Asa H. Barber. Stress-strain behavior of individual electrospun polymer fibers using combina- tion afm and sem. Materials Research Society, 1185:1185–II07– Conclusions 10, 2009. doi:10.1557/PROC-1185-II07-10. [MHR08] John Melcher, Shuiqing Hu, and Arvind Raman. Veda: A In summary, pyDAMPF is a highly efficient and adaptable simu- web-based virtual environment for dynamic atomic force mi- croscopy. Review of Scientific Instruments, 79:061301, 2008. lation tool aimed at analyzing, planning and interpreting dynamic doi:10.1063/1.2938864. AFM experiments. [Ram20] Prabhu Ramachandran. Compyle: a Python package for paral- It is important to keep in mind that pyDAMPF uses cantilever lel computing. In Meghann Agarwal, Chris Calloway, Dillon Niederhut, and David Shupe, editors, Proceedings of the 19th manufacturers information to analyze, evaluate and choose a Python in Science Conference, pages 32 – 39, 2020. doi: certain nanoprobe that fulfills experimental criteria. If this will 10.25080/majora-342d178e-005. not be the case, it will advise the experimentalists on what to [Sad98] John E. Sader. Frequency response of cantilever beams immersed expect from their measurements and the response a material may in viscous fluids with applications to the atomic force microscope. Journal of Applied Physics, 84:64–76, 1998. doi:10.1063/1. have. We currently support multi-thread execution using in-house 368002. development. However, in our outlook, we plan to extend the [Vea20] Pauli Virtanen and et al. Scipy 1.0: fundamental algorithms for code to GPU by using transpiling tools, like compyle [Ram20], scientific computing in Python. Nature Methods, 17:261–272, 2020. doi:10.1038/s41592-019-0686-2. as the availability of GPUs also increases in standard worksta- tions. In addition, we have shown how to reuse a widely tested Fortran code [GPG13] and wrap it as a python module to profit from pythonic libraries and interactivity via Jupyter notebooks. Implementing new interaction forces for the simulator is straight- forward. However, this code includes the state-of-the-art contact, viscous, van der Waals, capillarity and electrostatic forces used for physics at the interfaces. Moreover, we plan to implement soon semi-empirical analysis and multiscale modeling with molecular dynamics simulations. Acknowledgments H.V.G thanks the financial support by the Slovenian Research Agency (Funding No. P1-0055). We gratefully acknowledge the fruitful discussions with Tomas Corrales and our joint Fondecyt Regular project 1211901. 210 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Improving PyDDA’s atmospheric wind retrievals using automatic differentiation and Augmented Lagrangian methods Robert Jackson‡∗ , Rebecca Gjini§ , Sri Hari Krishna Narayanan‡ , Matt Menickelly, Paul Hovland‡ , Jan Hückelheim‡ , Scott Collis‡ F Introduction [LSKJ17] as detailed in the 2019 SciPy Conference proceedings Meteorologists require information about the spatiotemporal dis- (see [JCL+ 20], [RJSCTL+ 19]). It provided a much easier to tribution of winds in thunderstorms in order to analyze how use and more portable interface for wind retrievals than was physical and dynamical processes govern thunderstorm evolution. provided by these packages. In PyDDA versions 0.5 and prior, Knowledge of such processes is vital for predicting severe and the implementation of Equation (1) uses NumPy [HMvdW+ 20] hazardous weather events. However, acquiring wind observations to calculate J and its gradient. In order to find the wind field in thunderstorms is a non-trivial task. There are a variety of in- V that minimizes J, PyDDA used the limited memory Broy- struments that can measure winds including radars, anemometers, den–Fletcher–Goldfarb–Shanno bounded (L-BFGS-B) from SciPy and vertically pointing wind profilers. The difficulty in acquiring [VGO+ 20]. L-BFGS-B requires gradients of J in order to mini- a three dimensional volume of the 3D wind field from these mize J. Considering the antiquity of the CEDRIC and Multidop sensors is that these sensors typically only measure either point packages, these first steps provided the transition to Python that observations or only the component of the wind field parallel was needed in order to enhance accessibility of wind retrieval to the direction of the antenna. Therefore, in order to obtain 3D software by the scientific community. For more information wind fields, the weather radar community uses a weak variational about PyDDA versions 0.5 and prior, consult [RJSCTL+ 19] and technique that finds a 3D wind field that minimizes a cost function [JCL+ 20]. J. However, there are further improvements that still needed J(V) = µm Jm + µo Jo + µv Jv + µb Jb + µs Js (1) to be made in order to optimize both the accuracy and speed of the PyDDA retrievals. For example, the cost functions and Here, Jm is how much the wind field V violates the anelastic mass gradients in PyDDA 0.5 are implemented in NumPy which does continuity equation. Jo is how much the wind field is different not take advantage of GPU architectures for potential speedups from the radar observations. Jv is how much the wind field violates [HMvdW+ 20]. In addition, the gradients of the cost function that the vertical vorticity equation. Jb is how much the wind field are required for the weak variational technique are hand-coded differs from a prescribed background. Finally Js is related to even though packages such as Jax [BFH+ 18] and TensorFlow the smoothness of the wind field, quantified as the Laplacian [AAB+ 15] can automatically calculate these gradients. These of the wind field. The scalars µx are weights determining the needs motivated new features for the release of PyDDA 1.0. In relative contribution of each cost function to the total J. The PyDDA 1.0, we utilize Jax and TensorFlow’s automatic differen- flexibility in this formulation potentially allows for factoring in tiation capabilities for differentiating J, making these calculations the uncertainties that are inherent in the measurements. This less prone to human error and more efficient. formulation is expandable to include cost functions related to data Finally, upgrading PyDDA to use Jax and TensorFlow allows it from other sources such as weather forecast models and soundings. to take advantage of GPUs, increasing the speed of retrievals. This For more specific information on these cost functions, see [SPG09] paper shows how Jax and TensorFlow are used to automatically and [PSX12]. calculate the gradient of J and improve the performance of PyDDA is an open source Python package that implements the PyDDA’s wind retrievals using GPUs. weak variational technique for retrieving winds. It was originally In addition, a drawback to the weak variational technique developed in order to modernize existing codes for the weak is that the technique requires user specified constants µ. This variational retrievals such as CEDRIC [MF98] and Multidop therefore creates the possibility that winds retrieved from different datasets may not be physically consistent with each other, affecting * Corresponding author: rjackson@anl.gov reproducibility. Therefore, for the PyDDA 1.1 release, this paper ‡ Argonne National Laboratory, 9700 Cass Ave., Argonne, IL, 60439 § University of California at San Diego also details a new approach that uses Augmented Lagrangian solvers in order to place strong constraints on the wind field such Copyright © 2022 Robert Jackson et al. This is an open-access article dis- that it satisfies a mass continuity constraint to within a specified tributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, pro- tolerance while minimizing the rest of the cost function. This vided the original author and source are credited. new approach also takes advantage of the automatically calculated IMPROVING PYDDA’S ATMOSPHERIC WIND RETRIEVALS USING AUTOMATIC DIFFERENTIATION AND AUGMENTED LAGRANGIAN METHODS 211 gradients that are implemented in PyDDA 1.0. This paper will J were calculated by finding the closed form of the gradient show that this new approach eliminates the need for user specified by hand and then coding the closed form in Python. The code constants, ensuring the reproducibility of the results produced by snippet below provides an example of how the cost function Jm is PyDDA. implemented in PyDDA using NumPy. def calculate_mass_continuity(u, v, w, z, dx, dy, dz): Weak variational technique dudx = np.gradient(u, dx, axis=2) This section summarizes the weak variational technique that was dvdy = np.gradient(v, dy, axis=1) implemented in PyDDA previous to version 1.0 and is currently dwdz = np.gradient(w, dz, axis=0) the default option for PyDDA 1.1. PyDDA currently uses the div = dudx + dvdy + dwdz weak variational formulation given by Equation (1). For this proceedings, we will focus our attention on the mass continuity return coeff * np.sum(np.square(div)) / 2.0 Jm and observational cost function Jo . In PyDDA, Jm is given as In order to hand code the gradient of the cost function above, one the discrete volume integral of the square of the anelastic mass has to write the closed form of the derivative into another function continuity equation like below. δ (ρs u) δ (ρs v) δ (ρs w) 2 def calculate_mass_continuity_gradient(u, v, w, z, dx, Jm (u, v, w) = ∑ + + , (2) dy, dz, coeff): volume δx δy δz dudx = np.gradient(u, dx, axis=2) dvdy = np.gradient(v, dy, axis=1) where u is the zonal component of the wind field and v is the dwdz = np.gradient(w, dz, axis=0) meridional component of the wind field. ρs is the density of air, which is approximated in PyDDA as ρs (z) = e−z/10000 where z is grad_u = -np.gradient(div, dx, axis=2) * coeff grad_v = -np.gradient(div, dy, axis=1) * coeff the height in meters. The physical interpretation of this equation is grad_w = -np.gradient(div, dz, axis=0) * coeff that a column of air in the atmosphere is only allowed to compress in order to generate changes in air density in the vertical direction. y = np.stack([grad_u, grad_v, grad_w], axis=0) Therefore, wind convergence at the surface will generate vertical return y.flatten() air motion. A corollary of this is that divergent winds must occur Hand coding these functions can be labor intensive for compli- in the presence of a downdraft. At the scales of winds observed cated cost functions. In addition, there is no guarantee that there is by PyDDA, this is a reasonable approximation of the winds in the a closed form solution for the gradient. Therefore, we tested using atmosphere. both Jax and TensorFlow to automatically compute the gradients The cost function Jo metricizes how much the wind field is of J. Computing the gradients of J using Jax can be done in two different from the winds measured by each radar. Since a scanning lines of code using jax.vjp: radar will scan a storm while pointing at an elevation angle θ and primals, fun_vjp = jax.vjp( an azimuth angle φ , the wind field must first be projected to the calculate_radial_vel_cost_function, radar’s coordinates. After that, PyDDA finds the total square error vrs, azs, els, u, v, w, wts, rmsVr, weights, coeff) between the analysis wind field and the radar observed winds as _, _, _, p_x1, p_y1, p_z1, _, _, _, _ = fun_vjp(1.0) done in Equation (3). Calculating the gradients using automatic differentiation us- Jo (u, v, w) = ∑ (u cos θ sin φ + v cos θ cos φ + (w − wt ) sin θ )2 ing TensorFlow is also a simple code snippet using volume (3) tf.GradientTape: Here, wt is the terminal velocity of the particles scanned by with tf.GradientTape() as tape: the radar volume. This is approximated using empirical relation- tape.watch(u) tape.watch(v) ships between wt and the radar reflectivity Z. PyDDA then uses tape.watch(w) the limited memory Broyden–Fletcher–Goldfarb–Shanno bounded loss = calculate_radial_vel_cost_function( (L-BFGS-B) algorithm (see, e.g., [LN89]) to find the u, v, and w vrs, azs, els, u, v, w, wts, rmsVr, weights, coeff) that solves the optimization problem grad = tape.gradient(loss) min J(u, v, w) , µm Jm (u, v, w) + µv Jv (u, v, w). (4) u,v,w As one can see, there is no more need to derive the closed form of For experiments using the weak variational technique, we run the gradient of the cost function. Rather, the cost function itself is the optimization until either the Linf norm of the gradient of J now the input to a snippet of code that automatically provides the is less than 10−8 or when the maximum change in u, v, and derivative. In PyDDA 1.0, there are now three different engines w between iterations is less than 0.01 m/s as done by [PSX12]. that the user can specify. The classic "scipy" mode uses the Typically, the second criteria is reached first. Before PyDDA 1.0, NumPy-based cost function and hand coded gradients used by PyDDA utilized SciPy’s L-BFGS-B implementation. However, versions of PyDDA previous to 1.0. In addition, there are now as of PyDDA 1.0 one can also use TensorFlow’s L-BFGS-B TensorFlow and Jax modes that use both cost functions and implementation, which is used here for the experiments with the automatically generated gradients generated using TensorFlow or weak variational technique [AAB+ 15]. Jax. Using automatic differentiation Improving performance with GPU capabilities The optimization problem in Equation (4) requires the gradients The implementation of a TensorFlow-based engine provides Py- of J. In PyDDA 0.5 and prior, the gradients of the cost function DDA the capability to take advantage of CUDA-compatible GPUs. 212 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) CPU-based retrievals increases as resolution decreases, demon- strating the importance of the GPU for conducting high-resolution wind retrievals. In Table 1, using a GPU to retrieve the Hurricane Florence example at 1 km resolution reduces the run time from 341 s to 12 s. Therefore, these performance improvements show that PyDDA’s TensorFlow-based engine now enables it to handle both spatial scales of hundreds of kms at a 1 km resolution. For a day of data at this resolution, assuming five minutes between scans, an entire day of data can be processed in 57 minutes. With the use of multi-GPU clusters and selecting for cases where precipitation is present, this enables the ability to process winds from multi-year radar datasets within days instead of months. In addition, simply using TensorFlow’s implementation of L-BFGS-B as well as the TensorFlow calculated cost function and gradients provides a significant performance improvement compared to the original "scipy" engine in PyDDA 0.5, being up to a factor of 30 faster. In fact, running PyDDA’s original "scipy" engine on the 0.5 km resolution data for the Hurricane Florence example would have likely taken 50 days to complete on an Intel Core i7-based MacBook laptop. Therefore, that particular run was not tenable to do and therefore not shown in Figure 1. In any case, this shows that upgrading the calculations to use TensorFlow’s Fig. 1: The time in seconds of execution of the Hurricane Florence automatically generated gradients and L-BFGS-B implementation retrieval example when using the TensorFlow and SciPy engines on provides a very significant speedup to the processing time. an Intel Core i7 MacBook in CPU mode and on a node of Argonne National Laboratory’s Lambda cluster, utilizing a single NVIDIA Tesla A100 GPU for the calculation. Augmented Lagrangian method The release of PyDDA 1.0 focused on improving its performance Method 0.5 km 1 km 2.5 km 5.0 km and gradient accuracy by using automatic differentiation for cal- culating the gradient. For PyDDA 1.1, the PyDDA development SciPy Engine ~50 days 5771.2 s 871.5 s 226.9 s team focused on implementing a technique that enables the user to TensorFlow 7372.5 s 341.5 s 28.1 s 7.0 s automatically determine the weight coefficients µ. This technique Engine builds upon the automatic differentiation work done for PyDDA NVIDIA 89.4 s 12.0 s 3.5 s 2.6 s Tesla A100 1.0 by using the automatically generated gradients. In this work, GPU we consider a constrained reformulation of Equation (4) that requires wind fields returned by PyDDA to (approximately) satisfy mass continuity constraints. That is, we focus on the constrained TABLE 1: Run times for each of the benchmarks in Figure 1. optimization problem min Jv (u, v, w) u,v,w (5) Given that weather radar datasets can span decades and processing s. to Jm (u, v, w) = 0, each 10 minute time period of data given by the radar can take on the order of 1-2 minutes with PyDDA using regular CPU where we now interpret Jm as a vector mapping that outputs, at operations, if this time were reduced to seconds, then processing each grid point in the discretized volume δ (ρδ xs u) + δ (ρs v) δ (ρs w) δy + δz . winds from years of radar data would become tenable. Therefore, Notice that the formulation in Equation (5) has no dependencies we used the TensorFlow-based PyDDA using the weak variational on scalars µ. technique on the Hurricane Florence example in the PyDDA To solve the optimization problem in Equation (5), we im- Documentation. On 14 September 2018, Hurricane Florence was plemented an augmented Lagrangian method with a filter mech- within range of 2 radars from the NEXRAD network: KMHX anism inspired by [LV20]. An augmented Lagrangian method stationed in Newport, NC and KLTX stationed in Wilmington, considers the Lagrangian associated with an equality-constrained NC. In addition, the High Resolution Rapid Refresh model runs optimization problem, in this case L0 (u, v, w, λ ) = Jv (u, v, w) − provided an additional constraint for the wind retrieval. For more λ > Jm (u, v, w), where λ is a vector of Lagrange multipliers of information on this example, see [RJSCTL+ 19]. The analysis the same length as the number of grid points in the discretized domain spans 400 km by 400 km horizontally, and the horizontal volume. The Lagrangian is then augmented with an additional resolution was allowed to vary for different runs in order to com- squared-penalty term on the constraints to yield Lµ (u, v, w, λ ) = pare how both the CPU and GPU-based retrievals’ performance L0 (u, v, w, λ ) + µ2 kJm (u, v, w)k2 , where we have intentionally used would be affected by grid resolution. The time of completion of µ > 0 as the scalar in the penalty term to make comparisons each of these retrievals is shown in Figure 1. with Equation (4) transparent. It is well known (see, for instance, Figure 1 and Table 1 show that, in general, the retrievals took Theorem 17.5 of [NW06]) that under some not overly restrictive anywhere from 10 to 100 fold less time on the GPU compared to conditions there exists a finite µ̄ such that if µ ≥ µ̄, then each local the CPU. The discrepancy in performance between the GPU and solution of Equation (5) corresponds to a strict local minimizer IMPROVING PYDDA’S ATMOSPHERIC WIND RETRIEVALS USING AUTOMATIC DIFFERENTIATION AND AUGMENTED LAGRANGIAN METHODS 213 of Lµ (u, v, w, λ ∗ ) for a suitable choice of multipliers λ ∗ . Essen- tially, augmented Lagrangian methods solve a short sequence of unconstrained problems Lµ (u, v, w, λ ), with different values of µ until a solution is returned that is a local, feasible solution to Equation (5). In our implementation of an augmented Lagrangian method, the coarse minimization of Lµ (u, v, w, λ ) is performed by the Scipy implementation of LBFGS-B with the TensorFlow implementation of the cost function and gradients. Additionally, in our implementation, we employ a filter mechanism (see a survey in [FLT06]) recently proposed for augmented Lagrangian methods in [LV20] in order to guarantee convergence. We defer details to that paper, but note that the feasibility restoration phase (the minimization of a squared constraint violation) required by such a filter method is also performed by the SciPy implementation of LBFGS-B. The PyDDA documentation contains an example of a mesoscale convective system (MCS) that was sampled by a C- band Polarization Radar (CPOL) and a Bureau of Meteorology Australia radar on 20 Jan 2006 in Darwin, Australia. For more details on this storm and the radar network configuration, see [CPMW13]. For more information about the CPOL radar dataset, see [JCL+ 18]. This example with its data is included in the PyDDA Documentation as the "Example of retrieving and plotting winds." Figure 2 shows the winds retrieved by the Augmented La- grangian technique with µ = 1 and from the weak variational technique with µ = 1 on the right. Figure 2 shows that both tech- niques are capturing similar horizontal wind fields in this storm. However, the Augmented Lagrangian technique is resolving an updraft that is not present in the wind field generated by the weak variational technique. Since there is horizontal wind convergence in this region, we expect there to be an updraft present in this box in order for the solution to be physically realistic. Therefore, for µ = 1, the Augmented Lagrangian technique is doing a better job at resolving the updrafts present in the storm than the weak variational technique is. This shows that adjusting µ is required in order for the weak variational technique to resolve the updraft. We solve the unconstrained formulation (4) using the imple- mentation of L-BFGS-B currently employed in PyDDA; we fix the value µv = 1 and vary µm = 2 j : j = 0, 1, 2, . . . , 16. We also solve the constrained formulation (5) using our implementation of a filter Augmented Lagrangian method, and instead vary the initial guess of penalty parameter µ = 2 j : j = 0, 1, 2, . . . , 16. For the initial state, we use the wind profile from the weather balloon launch at 00 UTC 20 Jan 2006 from Darwin and apply it to the whole analysis domain. A summary of results is shown in Figures 3 and 4. We applied a maximum constraint violation tolerance of 10−3 to the filter Augmented Lagrangian method. This is a tolerance that assumes that the winds do not violate the mass continuity constraint by more than 0.001 m2 s−2 . Notice Fig. 2: The PyDDA retrieved winds overlaid over reflectivity from the that such a tolerance is impossible to supply to the weak vari- C-band Polarization Radar for the MCS that passed over Darwin, ational method, highlighting the key advantage of employing a Australia on 20 Jan 2006. The winds were retrieved using the weak constrained method. Notice that in this example, only 5 settings of variational technique with µ = 1 (a) and the Augmented Lagrangian µm lead to sufficiently feasible solutions returned by the variational technique with µ = 1 (b). The contours represent vertical velocities at technique. 3.05 km altitude. The boxed region shows the updrafts that generated Finally, a variable of interest to atmospheric scientists for the heavy precipitation. winds inside MCSes is the vertical wind velocity. It provides a measure of the intensity of the storm by demonstrating the amount of upscale growth contributing to intensification. Figure 5 shows the mean updraft velocities inside the box in Figure 2 as a function of height for each of the runs of the TensorFlow L-BFGS-B and 214 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 3: The x-axis shows, on a logarithmic scale, the maximum constraint violation in the units of divergence of the wind field and the y-axis shows the value of the data-fitting term Jv at the optimal solution. The legend lists the number of function/gradient calls made by the filter Augmented Lagrangian Method, which is the dominant cost of both approaches. The dashed line at 10−3 denotes the tolerance on the maximum constraint violation that was supplied to the filter Augmented Lagrangian method. Fig. 5: The mean updraft velocity obtained by (left) the weak variational and (right) the Augmented Lagrangian technique inside the updrafts in the boxed region of Figure 2. Each line represents a Fig. 4: As 3, but for the weak variational technique that uses L-BFGS- different value of µ for the given technique. B. using the Augmented Lagrangian technique will result in more Augmented Lagrangian techniques. Table 2 summarizes the mean reproducible wind fields from radar wind networks since it is and spread of the solutions in Figure 5. For the updraft velocities less sensitive to user-defined parameters than the weak variational produced by the Augmented Lagrangian technique, there is a 1 m/s technique. However, a limitation of this technique is that, for now, spread of velocities produced for given values of µ at altitudes this technique is limited to two radars and to the mass continuity < 7.5 km in Table 2. At an altitude of 10 km, this spread is and vertical vorticity constraints. 1.9 m/s. This is likely due to the reduced spatial coverage of the radars at higher altitudes. However, for the weak variational Concluding remarks technique, the sensitivity of the retrieval to µ is much more Atmospheric wind retrievals are vital for forecasting severe pronounced, with up to 2.8 m/s differences between retrievals. weather events. Therefore, this motivated us to develop an open Therefore, using the Augmented Lagrangian technique makes the source package for developing atmospheric wind retrievals called vertical velocities less sensitive to µ. Therefore, this shows that PyDDA. In the original releases of PyDDA (versions 0.5 and IMPROVING PYDDA’S ATMOSPHERIC WIND RETRIEVALS USING AUTOMATIC DIFFERENTIATION AND AUGMENTED LAGRANGIAN METHODS 215 Min Mean Max Std. Dev. a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies Weak variational 2.5 km 1.2 1.8 2.7 0.6 to the public, and perform publicly and display publicly, by or 5 km 2.2 2.9 4.0 0.7 on behalf of the Government. The Department of Energy will 7.5 km 3.2 3.9 5.0 0.4 provide public access to these results of federally sponsored 10 km 2.3 3.3 4.9 1.0 research in accordance with the DOE Public Access Plan. This Aug. Lagrangian material is based upon work supported by Laboratory Directed 2.5 km 1.8 2.8 3.3 0.5 Research and Development (LDRD) funding from Argonne Na- 5 km 3.1 3.3 3.5 0.1 tional Laboratory, provided by the Director, Office of Science, of 7.5 km 3.2 3.5 3.9 0.1 the U.S. Department of Energy under Contract No. DE-AC02- 10 km 3.0 4.3 4.9 0.5 06CH11357. This material is also based upon work funded by program development funds from the Mathematics and Computer Science and Environmental Science departments at Argonne Na- TABLE 2: Minimum, mean, maximum, and standard deviation of w tional Laboratory. (m/s) for select levels in Figure 5. prior), the original goal of PyDDA was to convert legacy wind R EFERENCES retrieval packages such as CEDRIC and Multidop to be fully Pythonic, open source, and accessible to the scientific community. [AAB+ 15] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, However, there remained many improvements to be made to Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfel- PyDDA to optimize the speed of the retrievals and to make it low, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing easier to add constraints to PyDDA. Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, This therefore motivated two major changes to PyDDA’s wind Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, retrieval routine for PyDDA 1.0. The first major change to PyDDA Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, in PyDDA 1.0 was to simplify the wind retrieval process by Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol automating the calculation of the gradient of the cost function Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine used for the weak variational technique. To do this, we utilized learning on heterogeneous systems, 2015. Software available Jax and TensorFlow’s capabilities to do automatic differentiation from tensorflow.org. URL: https://www.tensorflow.org/. of functions. This also allows PyDDA to take advantage of GPU [BFH+ 18] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James resources, significantly speeding up retrieval times for mesoscale Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, retrievals at kilometer-scale resolution. In addition, running the and Qiao Zhang. JAX: composable transformations of TensorFlow-based version of PyDDA provided significant perfor- Python+NumPy programs, 2018. URL: http://github.com/ mance improvements even when using a CPU. google/jax. [CPMW13] Scott Collis, Alain Protat, Peter T. May, and Christopher These automatically generated gradients were then used to Williams. Statistics of storm updraft velocities from twp-ice implement an Augmented Lagrangian technique in PyDDA 1.1 including verification with profiling measurements. Journal that allows for automatically determining the weights for each of Applied Meteorology and Climatology, 52(8):1909 – 1922, cost function in the retrieval. The Augmented Lagrangian tech- 2013. doi:10.1175/JAMC-D-12-0230.1. [FLT06] Roger Fletcher, Sven Leyffer, and Philippe Toint. A brief nique guarantees convergence to a physically realistic solution, history of filter methods. Technical report, Argonne National something that is not always the case for a given set of weights Laboratory, 2006. URL: http://www.optimization-online.org/ for the weak variational technique. Therefore, this both creates DB_FILE/2006/10/1489.pdf. more reproducible wind retrievals and simplifies the process of [HMvdW+ 20] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric retrieving winds for the non-specialist user. However, since the Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Augmented Lagrangian technique currently only supports the Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerk- ingesting of radar data into the retrieval, plans for PyDDA 1.2 and wijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin beyond include expanding the Augmented Lagrangian technique Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, to support multiple data sources such as models and rawinsondes. Christoph Gohlke, and Travis E. Oliphant. Array programming with NumPy. Nature, 585(7825):357–362, September 2020. doi:10.1038/s41586-020-2649-2. Code Availability [JCL+ 18] R. C. Jackson, S. M. Collis, V. Louf, A. Protat, and L. Ma- jewski. A 17 year climatology of the macrophysical prop- PyDDA is available for public use with documentation and erties of convection in darwin. Atmospheric Chemistry and examples available at https://openradarscience.org/PyDDA. The Physics, 18(23):17687–17704, 2018. doi:10.5194/acp- GitHub repository that hosts PyDDA’s source code is available 18-17687-2018. at https://github.com/openradar/PyDDA. [JCL+ 20] Robert Jackson, Scott Collis, Timothy Lang, Corey Potvin, and Todd Munson. Pydda: A pythonic direct data assimilation framework for wind retrievals. Journal of Open Research Acknowledgments Software, 8(1):20, 2020. doi:10.5334/jors.264. [LN89] Dong C. Liu and Jorge Nocedal. On the limited memory The submitted manuscript has been created by UChicago Argonne, bfgs method for large scale optimization. MATHEMATI- LLC, Operator of Argonne National Laboratory (’Argonne’). Ar- CAL PROGRAMMING, 45:503–528, 1989. doi:10.1007/ bf01589116. gonne, a U.S. Department of Energy Office of Science laboratory, [LSKJ17] Timothy Lang, Mario Souto, Shahin Khobahi, and Bobby is operated under Contract No. DE-AC02-06CH11357. The U.S. Jackson. nasa/multidop: Multidop v0.3, October 2017. doi: Government retains for itself, and others acting on its behalf, 10.5281/zenodo.1035904. 216 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) [LV20] Sven Leyffer and Charlie Vanaret. An augmented lagrangian filter method. Mathematical Methods of Operations Research, 92(2):343–376, 2020. URL: https://doi.org/10.1007/s00186- 020-00713-x, doi:10.1007/s00186-020-00713-x. [MF98] L. Jay Miller and Sherri M. Fredrick. Custom editing and display of reduced information in cartesian space. Technical report, National Center for Atmospheric Research, 1998. [NW06] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, NY, USA, second edition, 2006. [PSX12] Corey K. Potvin, Alan Shapiro, and Ming Xue. Impact of a vertical vorticity constraint in variational dual-doppler wind analysis: Tests with real and simulated supercell data. Journal of Atmospheric and Oceanic Technology, 29(1):32 – 49, 2012. doi:10.1175/JTECH-D-11-00019.1. [RJSCTL+ 19] Robert Jackson, Scott Collis, Timothy Lang, Corey Potvin, and Todd Munson. PyDDA: A new Pythonic Wind Re- trieval Package. In Chris Calloway, David Lippa, Dillon Niederhut, and David Shupe, editors, Proceedings of the 18th Python in Science Conference, pages 111 – 117, 2019. doi:10.25080/Majora-7ddc1dd1-010. [SPG09] Alan Shapiro, Corey K. Potvin, and Jidong Gao. Use of a verti- cal vorticity equation in variational dual-doppler wind analysis. Journal of Atmospheric and Oceanic Technology, 26(10):2089 – 2106, 2009. doi:10.1175/2009JTECHA1256.1. [VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Sté- fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar- rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quin- tero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi:10.1038/s41592-019-0686-2. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 217 RocketPy: Combining Open-Source and Scientific Libraries to Make the Space Sector More Modern and Accessible João Lemes Gribel Soares‡∗ , Mateus Stano Junqueira‡ , Oscar Mauricio Prada Ramirez‡ , Patrick Sampaio dos Santos Brandão‡§ , Adriano Augusto Antongiovanni‡ , Guilherme Fernandes Alves‡ , Giovani Hidalgo Ceotto‡ F Abstract—In recent years we are seeing exponential growth in the space sector, important issue. Moreover, performance is always a requirement with new companies emerging in it. On top of that more people are becoming both for saving financial and time resources while efficiently fascinated to participate in the aerospace revolution, which motivates students launch performance goals. and hobbyists to build more High Powered and Sounding Rockets. However, In this scenario, crucial parameters should be determined be- rocketry is still a very inaccessible field, with high knowledge of entry-level and fore a safe launch can be performed. Examples include calculating concrete terms. To make it more accessible, people need an active community with flexible, easy-to-use, and well-documented tools. RocketPy is a software with high accuracy and certainty the most likely impact or landing solution created to address all those issues, solving the trajectory simulation region. This information greatly increases range safety and the for High-Power rockets being built on top of SciPy and the Python Scien- possibility of recovering the rocket [Wil18]. As another example, tific Environment. The code allows for a sophisticated 6 degrees of freedom it is important to determine the altitude of the rocket’s apogee in simulation of a rocket’s flight trajectory, including high fidelity variable mass order to avoid collision with other aircraft and prevent airspace effects as well as descent under parachutes. All of this is packaged into an violations. architecture that facilitates complex simulations, such as multi-stage rockets, To better attend to those issues, RocketPy was created as a design and trajectory optimization, and dispersion analysis. In this work, the computational tool that can accurately predict all dynamic param- flexibility and usability of RocketPy are indicated in three example simulations: eters involved in the flight of sounding, model, and High-Powered a basic trajectory simulation, a dynamic stability analysis, and a Monte Carlo dispersion simulation. The code structure and the main implemented methods Rockets, given parameters such as the rocket geometry, motor are also presented. characteristics, and environmental conditions. It is an open source project, well structured, and documented, allowing collaborators Index Terms—rocketry, flight, rocket trajectory, flexibility, Monte Carlo analysis to contribute with new features with minimum effort regarding legacy code modification [CSA+ 21]. Introduction Background When it comes to rockets, there is a wide field ranging from Rocketry terminology orbital rockets to model rockets. Between them, two types of rockets are relevant to this work: sounding rockets and High- To better understand the current work, some specific terms regard- Powered Rockets (HPRs). Sounding rockets are mainly used ing the rocketry field are stated below: by government agencies for scientific experiments in suborbital • Apogee: The point at which a body is furthest from earth flights while HPRs are generally used for educational purposes, • Degrees of freedom: Maximum number of independent with increasing popularity in university competitions, such as the values in an equation annual Spaceport America Cup, which hosts more than 100 rocket • Flight Trajectory: 3-dimensional path, over time, of the design teams from all over the world. After the university-built rocket during its flight rocket TRAVELER IV [AEH+ 19] successfully reached space by • Launch Rail: Guidance for the rocket to accelerate to a crossing the Kármán line in 2019, both Sounding Rockets and stable flight speed HPRs can now be seen as two converging categories in terms of • Powered Flight: Phase of the flight where the motor is overall flight trajectory. active HPRs are becoming bigger and more robust, increasing their • Free Flight: Phase of the flight where the motor is inactive potential hazard, along with their capacity, making safety an and no other component but its inertia is influencing the rocket’s trajectory * Corresponding author: jgribel@usp.br ‡ Escola Politécnica of the University of São Paulo • Standard Atmosphere: Average pressure, temperature, and § École Centrale de Nantes. air density for various altitudes • Nozzle: Part of the rocket’s engine that accelerates the Copyright © 2022 João Lemes Gribel Soares et al. This is an open-access exhaust gases article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any • Static hot-fire test: Test to measure the integrity of the medium, provided the original author and source are credited. motor and determine its thrust curve 218 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) • Thrust Curve: Evolution of thrust force generated by a Function motor Variable interpolation meshes/grids from different sources can • Static Margin: Is a non-dimensional distance to analyze lead to problems regarding coupling different data types. To the stability solve this, RocketPy employs a dedicated Function class which • Nosecone: The forward-most section of a rocket, shaped allows for more natural and dynamic handling of these objects, for aerodynamics structuring them as Rn → R mathematical functions. • Fin: Flattened append of the rocket providing stability Through the use of those methods, this approach allows for during flight, keeping it in the flight trajectory quick and easy arithmetic operations between lambda expressions and list-defined interpolated functions, as well as scalars. Different Flight Model interpolation methods are available to be chosen from, among them simple polynomial, spline, and Akima ([Aki70]). Extrapo- The flight model of a high-powered rocket takes into account at lation of Function objects outside the domain constrained by a least three different phases: given dataset is also allowed. 1. The first phase consists of a linear movement along the Furthermore, evaluation of definite integrals of these Function launch rail: The motion of the rocket is restricted to one dimen- objects is among their feature set. By cleverly exploiting the sion, which means that only the translation along with the rail chosen interpolation option, RocketPy calculates the values fast needs to be modeled. During this phase, four forces can act on and precisely through the use of different analytical methods. If the rocket: weight, engine thrust, rail reactions, and aerodynamic numerical integration is required, the class makes use of SciPy’s forces. implementation of the QUADPACK Fortran library [PdDKÜK83]. 2. After completely leaving the rail, a phase of 6 degrees of For 1-dimensional Functions, evaluation of derivatives at a point freedom (DOF) is established, which includes powered flight and is made possible through the employment of a simple finite free flight: The rocket is free to move in three-dimensional space difference method. and weight, engine thrust, normal and axial aerodynamic forces Finally, to increase usability and readability, all Function are still important. object instances are callable and can be presented in multiple 3. Once apogee is reached, a parachute is usually deployed, ways depending on the given arguments. If no argument is given, characterizing the third phase of flight: the parachute descent. In a Matplotlib figure opens and the plot of the function is shown in- the last phase, the parachute is launched from the rocket, which is side its domain. Only 2-dimensional and 3-dimensional functions usually divided into two or more parts joined by ropes. This phase can be plotted. This is especially useful for the post-processing ends at the point of impact. methods where various information on the classes responsible for the definition of the rocket and its flight is presented, providing for more concise code. If an n-sized array is passed instead, RocketPy Design: RocketPy Architecture will try and evaluate the value of the Function at this given point Four main classes organize the dataflow during the simulations: using different methods, returning its value. An example of the motor, rocket, environment, and flight [CSA+ 21]. Furthermore, usage of the Function class can be found in the Examples section. there is also a helper class named function, which will be described Additionally, if another Function object is passed, the class further. In the Motor class, the main physical and geometric will try to match their respective domain and co-domain in order parameters of the motor are configured, such as nozzle geometry, to return a third instance, representing a composition of functions, grain parameters, mass, inertia, and thrust curve. This first-class in the likes of: h(x) = (g◦ f )(x) = g( f (x)). With different Function acts as an input to the Rocket class where the user is also asked objects defined, the comparePlots method can be used to plot, in to define certain parameters of the rocket such as the inertial mass a single graph, different functions. tensor, geometry, drag coefficients, and parachute description. By imitating, in syntax, commonly used mathematical no- Finally, the Flight class joins the rocket and motor parameters with tation, RocketPy allows for more understandable and human- information from another class called Environment, such as wind, readable code, especially in the implementation of the more atmospheric, and earth models, to generate a simulation of the extensive and cluttered rocket equations of motion. rocket’s trajectory. This modular architecture, along with its well- structured and documented code, facilitates complex simulations, Environment starting with the use of Jupyter Notebooks that people can adapt The Environment class reads, processes and stores all the infor- for their specific use case. Fig. 1 illustrates RocketPy architecture. mation regarding wind and atmospheric model data. It receives as inputs launch point coordinates, as well as the length of the launch rail, and then provides the flight class with six profiles as a function of altitude: wind speed in east and north directions, atmospheric pressure, air density, dynamic viscosity, and speed of sound. For instance, an Environment object can be set as representing New Mexico, United States: 1 from rocketpy import Environment 2 3 ex_env = Environment( 4 railLength=5.2, 5 latitude=32.990254, Fig. 1: RocketPy classes interaction [CSA+ 21] 6 longitude=-106.974998, 7 elevation=1400 8 ) ROCKETPY: COMBINING OPEN-SOURCE AND SCIENTIFIC LIBRARIES TO MAKE THE SPACE SECTOR MORE MODERN AND ACCESSIBLE 219 RocketPy requires datetime library information specifying the of rocket motors: solid motors, liquid motors, and hybrid motors. year, month, day and hour to compute the weather conditions on Currently, a robust Solid Motor class has been fully implemented the specified day of launch. An optional argument, the timezone, and tested. For example, a typical solid motor can be created as an may also be specified. If the user prefers to omit it, RocketPy will object in the following way: assume the datetime object is given in standard UTC time, just as 1 from rocketpy import SolidMotor follows: 2 3 ex_motor = SolidMotor( 1 import datetime 4 thrustSource='Motor_file.eng', 2 tomorrow = ( 5 burnOut=2, 3 datetime.date.today() + 6 reshapeThrustCurve= False, 4 datetime.timedelta(days=1) 7 grainNumber=5, 5 ) 8 grainSeparation=3/1000, 6 9 grainOuterRadius=33/1000, 7 date_info = ( 10 grainInitialInnerRadius=15/1000, 8 tomorrow.year, 11 grainInitialHeight=120/1000, 9 tomorrow.month, 12 grainDensity= 1782.51, 10 tomorrow.day, 13 nozzleRadius=49.5/2000, 11 12 14 throatRadius=21.5/2000, 12 ) # Hour given in UTC time 15 interpolationMethod='linear') By default, the International Standard Atmosphere [ISO75] static atmospheric model is loaded. However, it is easy to set other Rocket models by importing data from different meteorological agencys’ The Rocket Class is responsible for creating and defining the public datasets, such as Wyoming Upper-Air Soundings and Eu- rocket’s core characteristics. Mostly composed of physical at- ropean Centre for Medium-Range Weather Forecasts (ECMWF); tributes, such as mass and moments of inertia, the rocket object or to set a customized atmospheric model based on user-defined will be responsible for storage and calculate mechanical parame- functions. As RocketPy supports integration with different meteo- ters. rological agencies’ datasets, it allows for a sophisticated definition A rocket object can be defined with the following code: of weather conditions including forecasts and historical reanalysis 1 from rocketpy import Rocket scenarios. 2 In this case, NOAA’s RUC Soundings data model is used, a 3 ex_rocket = Rocket( worldwide and open-source meteorological model made available 4 motor=ex_motor, 5 radius=127 / 2000, online. The file name is set as GFS, indicating the use of the Global 6 mass=19.197 - 2.956, Forecast System provided by NOAA, which features a forecast 7 inertiaI=6.60, with a quarter degree equally spaced longitude/latitude grid with 8 inertiaZ=0.0351, a temporal resolution of three hours. 9 distanceRocketNozzle=-1.255, 10 distanceRocketPropellant=-0.85704, 1 ex_env.setAtmosphericModel( 11 powerOffDrag="data/rocket/powerOffDragCurve.csv", 2 type='Forecast', 12 powerOnDrag="data/rocket/powerOnDragCurve.csv", 3 file='GFS') 13 ) 4 ex_env.info() As stated in [RocketPy architecture], a fundamental input of the What is happening on the back-end of this code’s snippet is Rock- rocket is its motor, an object of the Motor class that must be etPy utilizing the OPeNDAP protocol to retrieve data arrays from previously defined. Some inputs are fairly simple and can be easily NOAA’s server. It parses by using the netCDF4 data management obtained with a CAD model of the rocket such as radius, mass, system, allowing for the retrieval of pressure, temperature, wind and moment of inertia on two different axes. The distance inputs velocity, and surface elevation data as a function of altitude. The are relative to the center of mass and define the position of the Environment class then computes the following parameters: wind motor nozzle and the center of mass of the motor propellant. The speed, wind heading, speed of sound, air density, and dynamic powerOffDrag and powerOnDrag receive .csv data that represents viscosity. Finally, plots of the evaluated parameters concerning the drag coefficient as a function of rocket speed for the case where the altitude are all passed on to the mission analyst by calling the the motor is off and other for the motor still burning, respectively. Env.info() method. At this point, the simulation would run a rocket with a tube of a certain diameter, with its center of mass specified and a motor at its Motor end. For a better simulation, a few more important aspects should RocketPy is flexible enough to work with most types of motors then be defined, called Aerodynamic surfaces. Three of them are used in sound rockets. The main function of the Motor class accepted in the code, these being the nosecone, fins, and tail. They is to provide the thrust curve, the propulsive mass, the inertia can be simply added to the code via the following methods: tensor, and the position of its center of mass as a function of time. 1 nose_cone = ex_rocket.addNose( Geometric parameters regarding propellant grains and the motor’s 2 length=0.55829, kind="vonKarman", nozzle must be provided, as well as a thrust curve as a function 3 distanceToCM=0.71971 of time. The latter is preferably obtained empirically from a static 4 ) 5 fin_set = ex_rocket.addFins( hot-fire test, however, many of the curves for commercial motors 6 4, span=0.100, rootChord=0.120, tipChord=0.040, are freely available online [Cok98]. 7 distanceToCM=-1.04956 Alternatively, for homemade motors, there is a wide range 8 ) of open-source internal ballistics simulators, such as OpenMotor 9 tail = ex_rocket.addTail( 10 topRadius=0.0635, bottomRadius=0.0435, [Rei22], can predict the produced thrust with high accuracy for a 11 length=0.06, distanceToCM=-1.194656 given sizing and propellant combination. There are different types 12 ) 220 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) All these methods receive defining geometrical parameters and the Rocket class and the Environment class are used as input to their distance to the rocket’s center of mass (distanceToCM) initialize it, along with parameters such as launch heading and as inputs. Each of these surfaces generates, during the flight, inclination relative to the Earth’s surface: a lift force that can be calculated via a lift coefficient, which 1 from rocketpy import Flight is calculated with geometrical properties, as shown in [Bar67]. 2 Further on, these coefficients are used to calculate the center of 3 ex_flight = Flight( 4 rocket=rocket, pressure and subsequently the static margin. In each of these 5 environment=env, methods, the static margin is reevaluated. 6 inclination=85, Finally, the parachutes can be added in a similar manner to 7 heading=0 ) the aerodynamic surfaces. However, a few inputs regarding the 8 electronics involved in the activation of the parachute are required. Once the simulation is initialized, run, and completed, the The most interesting of them is the trigger and samplingRate instance of the Flight class stores relevant raw data. The inputs, which are used to define the parachute’s activation. The Flight.postProcess() method can then be used to com- trigger is a function that returns a boolean value that signifies pute secondary parameters such as the rocket’s Mach number when the parachute should be activated. The samplingRate is the during flight and its angle of attack. time interval that the trigger will be evaluated in the simulation To perform the numerical integration of the equations of mo- time steps. tion, the Flight class uses the LSODA solver [Pet83] implemented 1 def parachute_trigger(p, y): by Scipy’s scipy.integrate module [VGO+ 20]. Usually, 2 if vel_z < 0 and height < 800: well-designed rockets result in non-stiff equations of motion. 3 boole = True However, during flight, rockets may become unstable due to 4 else: 5 boole = False variations in their inertial and aerodynamic properties, which can 6 return boole result in a stiff system. LSODA switches automatically between 7 the nonstiff Adams method and the stiff BDF method, depending 8 ex_parachute = ex_rocket.addParachute( 9 'ParachuteName', on the detected stiffness, perfectly handle both cases. 10 CdS=10.0, Since a rocket’s flight trajectory is composed of multiple 11 trigger=parachute_trigger, phases, each with its own set of governing equations, RocketPy 12 samplingRate=105, employs a couple of clever methods to run the numerical inte- 13 lag=1.5, 14 noise=(0, 8.3, 0.5) gration. The Flight class uses a FlightPhases container to 15 ) hold each FlightPhase. The FlightPhases container will orchestrate the different FlightPhase instances, and compose With the rocket fully defined, the Rocket.info() and them during the flight. Rocket.allInfo() methods can be called giving us informa- This is crucial because there are events that may or may not tion and plots of the calculations performed in the class. One of the happen during the simulation, such as the triggering of a parachute most relevant outputs of the Rocket class is the static margin, as ejection system (which may or may not fail) or the activation of a it is important for the rocket stability and makes possible several premature flight termination event. There are also events such as analyses. It is visualized through the time plot in Fig. 2, which the departure from the launch rail or the apogee that is known to shows the variation of the static margin as the motor burns its occur, but their timestamp is unknown until the simulation is run. propellant. All of these events can trigger new flight phases, characterized by a change in the rocket’s equations of motion. Furthermore, such events can happen close to each other and provoke delayed phases. To handle this, the Flight class has a mechanism for creating new phases and adding them dynamically in the appropriate order to the FlightPhases container. The constructor of the FlightPhase class takes the follow- ing arguments: • t: a timestamp that symbolizes at which instant such flight phase should begin; • derivative: a function that returns the time derivatives of the rocket’s state vector (i.e., calculates the equations of motion for this flight phase); • callbacks: a list of callback functions to be run when the flight phase begins (which can be useful if some parameters of the rocket need to be modified before the flight phase begins). Fig. 2: Static Margin The constructor of the Flight class initializes the FlightPhases container with a rail phase and also a dummy max time phase which marks the maximum flight duration. Then, Flight it loops through the elements of the container. The Flight class is responsible for the integration of the rocket’s Inside the loop, an important attribute of the current equations of motion overtime [CSA+ 21]. Data from instances of flight phase is set: FlightPhase.timeBound, the maxi- ROCKETPY: COMBINING OPEN-SOURCE AND SCIENTIFIC LIBRARIES TO MAKE THE SPACE SECTOR MORE MODERN AND ACCESSIBLE 221 mum timestamp of the flight phase, which is always equal to the initial timestamp of the next flight phase. Ordinar- ily, it would be possible to run the LSODA solver from FlightPhase.t to FlightPhase.timeBound. However, this is not an option because the events which can trigger new flight phases need to be checked throughout the simulation. While scipy.integrate.solve_ivp does offer the events ar- gument to aid in this, it is not possible to use it with most of the events that need to be tracked, since they cannot be expressed in the necessary form. As an example, consider the very common event of a parachute ejection system. To simulate real-time algorithms, the necessary inputs to the ejection algorithm need to be supplied at regular intervals to simulate the desired sampling rate. Furthermore, the ejection algorithm cannot be called multiple times without real data since it generally stores all the inputs it gets to calculate if the rocket has reached the apogee to trigger the parachute release mechanism. Discrete controllers can present the same peculiar properties. To handle this, the instance of the FlightPhase class holds Fig. 3: 3D flight trajectory, an output of the Flight.allInfo method a TimeNodes container, which stores all the required timesteps, or TimeNode, that the integration algorithm should stop at so that the events can be checked, usually by feeding the necessary Monte Carlo simulations, which require a large number of data to parachutes and discrete control trigger functions. When it simulations to be performed (10,000 ~ 100,000). comes to discrete controllers, they may change some parameters • The code structure should be flexible. This is important in the rocket once they are called. On the other hand, a parachute due to the diversity of possible scenarios that exist in a triggers rarely actually trigger, and thus, rarely invoke the creation rocket design context. Each user will have their simulation of a new flight phase characterized by descent under parachute requirements and should be able to modify and adapt new governing equations of motion. features to meet their needs. For this reason, the code was The Flight class can take advantage of this fact by employing designed in a fashion such that each major component is overshootable time nodes: time nodes that the integrator does separated into self-encapsulated classes, responsible for a not need to stop. This allows the integration algorithm to use single functionality. This tenet follows the concepts of the more optimized timesteps and significantly reduce the number of so-called Single Responsibility Principle (SRP) [MNK03]. iterations needed to perform a simulation. Once a new timestep • Finally, the software should aim to be accessible. The is taken, the Flight class checks all overshootable time nodes that source code was openly published on GitHub (https: have passed and feeds their event triggers with interpolated data. //github.com/Projeto-Jupiter/RocketPy), where the com- In case when an event is triggered, the simulation is rolled back to munity started to be built and a group of developers, known that state. as the RocketPy Team, are currently assigned as dedicated In summary, throughout a simulation, the Flight class loops maintainers. The job involves not only helping to improve through each non-overshootable TimeNode of each element of the code, but also working towards building a healthy the FlightPhases container. At each TimeNode, the event ecosystem of Python, rocketry, and scientific computing triggers are fed with the necessary input data. Once an event is enthusiasts alike; thus facilitating access to the high- triggered, a new FlightPhase is created and added to the main quality simulation without a great level of specialization. container. These loops continue until the simulation is completed, either by reaching the maximum flight duration or by reaching a The following examples demonstrate how RocketPy can be a terminal event, such as ground impact. useful tool during the design and operation of a rocket model, Once the simulation is completed, raw data can al- enabling functionalities not available by other simulation software ready be accessed. To compute secondary parameters, the before. Flight.postProcess() is used. It takes advantage of the fact that the FlightPhases container keeps all relevant flight Examples information to essentially retrace the trajectory and capture more Using RocketPy for Rocket Design information about the flight. Once secondary parameters are computed, the 1) Apogee by Mass using a Function helper class Flight.allInfo method can be used to show and plot Because of performance and safety reasons, apogee is one of all the relevant information, as illustrated in Fig. 3. the most important results in rocketry competitions, and it’s highly valuable for teams to understand how different Rocket parameters The adaptability of the Code and Accessibility can change it. Since a direct relation is not available for this kind RocketPy’s development started in 2017, and since the beginning, of computation, the characteristic of running simulation quickly is certain requirements were kept in mind: utilized for evaluation of how the Apogee is affected by the mass • Execution times should be fast. There is a high interest in of the Rocket. This function is highly used during the early phases performing sensitivity analysis, optimization studies and of the design of a Rocket. 222 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) An example of code of how this could be achieved: 16 terminateOnApogee=True, 17 verbose=True, 1 from rocketpy import Function 18 ) 2 19 ex_flight.postProcess() 3 def apogee(mass): 20 simulation_results += [( 4 # Prepare Environment 21 ex_flight.attitudeAngle, 5 ex_env = Environment(...) 22 ex_rocket.staticMargin(0), 6 23 ex_rocket.staticMargin(ex_flight.outOfRailTime), 7 ex_env.setAtmosphericModel( 24 ex_rocket.staticMargin(ex_flight.tFinal) 8 type="CustomAtmosphere", 25 )] 9 wind_v=-5 26 Function.comparePlots( 10 ) 27 simulation_results, 11 28 xlabel="Time (s)", 12 # Prepare Motor 29 ylabel="Attitude Angle (deg)", 13 ex_motor = SolidMotor(...) 30 ) 14 15 # Prepare Rocket The next step is to start the simulations themselves, which can 16 ex_rocket = Rocket( 17 ..., be done through a loop where the Flight class is called, perform 18 mass=mass, the simulation, save the desired parameters into a list and then 19 ... follow through with the next iteration. The post-process flight data 20 ) 21 method is being used to make RocketPy evaluate additional result 22 ex_rocket.setRailButtons([0.2, -0.5]) parameters after the simulation. 23 nose_cone = ex_rocket.addNose(.....) Finally, the Function.comparePlots() method is used to plot 24 fin_set = ex_rocket.addFins(....) the final result, as reported at Fig. 4. 25 tail = ex_rocket.addTail(....) 26 27 # Simulate Flight until Apogee 28 ex_flight = Flight(.....) 29 return ex_flight.apogee 30 31 apogee_by_mass = Function( 32 apogee, inputs="Mass (kg)", 33 outputs="Estimated Apogee (m)" 34 ) 35 apogee_by_mass.plot(8, 20, 20) The possibility of generating this relation between mass and apogee in a graph shows the flexibility of Rocketpy and also the importance of the simulation being designed to run fast. 1) Dynamic Stability Analysis In this analysis the integration of three different RocketPy classes will be explored: Function, Rocket, and Flight. The moti- vation is to investigate how static stability translates into dynamic stability, i.e. different static margins result relies on different Fig. 4: Dynamic Stability example, unstable rocket presented on blue dynamic behavior, which also depends on the rocket’s rotational line inertia. We can assume the objects stated in [motor] and [rocket] sections and just add a couple of variations on some input data Monte Carlo Simulation to visualize the output effects. More specifically, the idea will be When simulating a rocket’s trajectory, many input parameters to explore how the dynamic stability of the studied rocket varies may not be completely reliable due to several uncertainties in by changing the position of the set of fins by a certain factor. measurements raised during the design or construction phase of To do that, we have to simulate multiple flights with different the rocket. These uncertainties can be considered together in a static margins, which is achieved by varying the rocket’s fin group of Monte Carlo simulations [RK16] which can be built on positions. This can be done through a simple python loop, as top of RocketPy. described below: The Monte Carlo method here is applied by running a signifi- 1 simulation_results = [] cant number of simulations where each iteration has a different 2 for factor in [0.5, 0.7, 0.9, 1.1, 1.3]: set of inputs that are randomly sampled given a previously 3 # remove previous fin set known probability distribution, for instance the mean and standard 4 ex_rocket.aerodynamicSurfaces.remove(fin_set) 5 fin_set = ex_rocket.addFins( deviation of a Gaussian distribution. Almost every input data 6 4, span=0.1, rootChord=0.120, tipChord=0.040, presents some kind of uncertainty, except for the number of fins or 7 distanceToCM=-1.04956 * factor propellant grains that a rocket presents. Moreover, some inputs, 8 ) 9 ex_flight = Flight( such as wind conditions, system failures, or the aerodynamic 10 rocket=ex_rocket, coefficient curves, may behave differently and must receive special 11 environment=env, treatment. 12 inclination=90, Statistical analysis can then be made on all the simulations, 13 heading=0, 14 maxTimeStep=0.01, with the main result being the 1σ , 2σ , and 3σ ellipses representing 15 maxTime=5, the possible area of impact and the area where the apogee is ROCKETPY: COMBINING OPEN-SOURCE AND SCIENTIFIC LIBRARIES TO MAKE THE SPACE SECTOR MORE MODERN AND ACCESSIBLE 223 reached (Fig. 5). All ellipses can be evaluated based on the method 22 export_flight_data(s, ex_flight) presented by [Che66]. 23 except Exception as E: 24 # if an error occurs, export the error 25 # message to a text file 26 print(E) 27 export_flight_error(s) Finally, the set of inputs for each simulation along with its set of outputs, are stored in a .txt file. This allows for long-term data storage and the possibility to append simulations to previously finished ones. The stored output data can be used to study the final probability distribution of key parameters, as illustrated on Fig. 6. Fig. 5: 1 1σ , 2 2σ , and 3 3σ dispersion ellipses for both apogee and landing point When performing the Monte Carlo simulations on RocketPy, all the inputs - i.e. the parameters along with their respective standard deviations - are stored in a dictionary. The randomized set of inputs is then generated using a yield function: 1 def sim_settings(analysis_params, iter_number): Fig. 6: Distribution of apogee altitude 2 i = 0 3 while i < iter_number: 4 # Generate a simulation setting Finally, it is also worth mentioning that all the information 5 sim_setting = {} generated in the Monte Carlo simulation is based on RocketPy 6 for p_key, p_value in analysis_params.items(): may be of utmost importance to safety and operational manage- 7 if type(p_value) is tuple: 8 sim_setting[p_key] = normal(*p_value) ment during rocket launches, once it allows for a more reliable 9 else: prediction of the landing site and apogee coordinates. 10 sim_setting[p_key] = choice(p_value) 11 # Update counter 12 i += 1 Validation of the results: Unit, Dimensionality and Acceptance 13 # Yield a simulation setting 14 yield sim_setting Tests Validation is a big problem for libraries like RocketPy, where Where analysis_params is the dictionary with the inputs and true values for some results like apogee and maximum velocity iter_number is the total number of simulations to be performed. At is very hard to obtain or simply not available. Therefore, in that time the function yields one dictionary with one set of inputs, order to make RocketPy more robust and easier to modify, while which will be used to run a simulation. Later the sim_settings maintaining precise results, some innovative testing strategies have function is called again and another simulation is run until the loop iterations reach the number of simulations: been implemented. First of all, unit tests were implemented for all classes and 1 for s in sim_settings(analysis_params, iter_number): 2 # Define all classes to simulate with the current their methods ensuring that each function is working properly. 3 # set of inputs generated by sim_settings Given a set of different inputs that each function can receive, the 4 respective outputs are tested against expected results, which can be 5 # Prepare Environment 6 ex_env = Environment(.....) based on real data or augmented examples cases. The test fails if 7 # Prepare Motor the output deviates considerably from the established conditions, 8 ex_motor = SolidMotor(.....) or an unexpected error occurs along the way. 9 # Prepare Rocket Since RocketPy relies heavily on mathematical functions to 10 ex_rocket = Rocket(.....) 11 nose_cone = ex_rocket.addNose(.....) express the governing equations, implementation errors can occur 12 fin_set = ex_rocket.addFins(....) due to the convoluted nature of such expressions. Hence, to reduce 13 tail = ex_rocket.addTail(.....) the probability of such errors, there is a second layer of testing 14 15 # Considers any possible errors in the simulation which will evaluate if such equations are dimensionally correct. 16 try: To accomplish this, RocketPy makes use of the numericalunits 17 # Simulate Flight until Apogee library, which defines a set of independent base units as randomly- 18 ex_flight = Flight(.....) chosen positive floating point numbers. In a dimensionally-correct 19 20 # Function to export all output and input function, the units all cancel out when the final answer is divided 21 # data to a text file (.txt) by its resulting unit. And thus, the result is deterministic, not 224 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) random. On the other hand, if the function contains dimensionally- 1 def test_static_margin_dimension( incorrect equations, there will be random factors causing a 2 unitless_rocket, 3 unitful_rocket randomly-varying final answer. In practice, RocketPy runs two 4 ): calculations: one without numericalunits, and another with the 5 ... dimensionality variables. The results are then compared to assess 6 s1 = unitless_rocket.staticMargin(0) if the dimensionality is correct. 7 s2 = unitful_rocket.staticMargin(0) 8 assert abs(s1 - s2) < 1e-6 Here is an example. First, a SolidMotor object and a Rocket object are initialized without numericalunits: In case the value of interest has units, such as the position of the 1 @pytest.fixture center of pressure of the rocket, which has units of length, then 2 def unitless_solid_motor(): such value must be divided by the relevant unit for comparison: 3 return SolidMotor( 1 def test_cp_position_dimension( 4 thrustSource="Cesaroni_M1670.eng", 2 unitless_rocket, 5 burnOut=3.9, 3 unitful_rocket 6 grainNumber=5, 4 ): 7 grainSeparation=0.005, 5 ... 8 grainDensity=1815, 6 cp1 = unitless_rocket.cpPosition(0) 9 ... 7 cp2 = unitful_rocket.cpPosition(0) / m 10 ) 8 assert abs(cp1 - cp2) < 1e-6 11 12 @pytest.fixture If the assertion fails, we can assume that the formula responsible 13 def unitless_rocket(solid_motor): 14 return Rocket( for calculating the center of pressure position was implemented 15 motor=unitless_solid_motor, incorrectly, probably with a dimensional error. 16 radius=0.0635, Finally, some tests at a larger scale, known as acceptance 17 mass=16.241, tests, were implemented to validate outcomes such as apogee, 18 inertiaI=6.60, 19 inertiaZ=0.0351, apogee time, maximum velocity, and maximum acceleration when 20 distanceRocketNozzle=-1.255, compared to real flight data. A required accuracy for such values 21 distanceRocketPropellant=-0.85704, were established after the publication of the experimental data by 22 ... [CSA+ 21]. Such tests are crucial for ensuring that the code doesn’t 23 ) lose precision as a result of new updates. Then, a SolidMotor object and a Rocket object are initialized with These three layers of testing ensure that the code is trustwor- numericalunits: thy, and that new features can be implemented without degrading 1 import numericalunits the results. 2 3 @pytest.fixture 4 def m(): Conclusions 5 return numericalunits.m 6 RocketPy is an easy-to-use tool for simulating high-powered 7 rocket trajectories built with SciPy and the Python Scientific 8 @pytest.fixture Environment. The software’s modular architecture is based on 9 def kg(): four main classes and helper classes with well-documented code 10 return numericalunits.kg 11 that allows to easily adapt complex simulations to various needs 12 @pytest.fixture using the supplied Jupyter Notebooks. The code can be a useful 13 def unitful_motor(kg, m): tool during Rocket design and operation, allowing to calculate return SolidMotor( 14 of key parameters such as apogee and dynamic stability as well 15 thrustSource="Cesaroni_M1670.eng", 16 burnOut=3.9, as high-fidelity 6-DOF vehicle trajectory with a wide variety of 17 grainNumber=5, customizable parameters, from its launch to its point of impact. 18 grainSeparation=0.005 * m, RocketPy is an ever-evolving framework and is also accessible to 19 grainDensity=1815 * (kg / m**3), 20 ... anyone interested, with an active community maintaining it and 21 ) working on future features such as the implementation of other 22 engine types, such as hybrids and liquids motors, and even orbital 23 @pytest.fixture flights. 24 def unitful_rocket(kg, m, dimensionless_motor): 25 return Rocket( 26 motor=unitful_motor, Installing RocketPy 27 radius=0.0635 * m, 28 mass=16.241 * kg, RocketPy was made to run on Python 3.6+ and requires the 29 inertiaI=6.60 * (kg * m**2), packages: Numpy >=1.0, Scipy >=1.0 and Matplotlib >= 3.0. For 30 inertiaZ=0.0351 * (kg * m**2), a complete experience we also recommend netCDF4 >= 1.4. All 31 distanceRocketNozzle=-1.255 * m, 32 distanceRocketPropellant=-0.85704 * m, these packages, except netCDF4, will be installed automatically if 33 ... the user does not have them. To install, execute: 34 ) pip install rocketpy Then, to ensure that the equations implemented in both classes or (Rocket and SolidMotor) are dimensionally correct, the val- conda install -c conda-forge rocketpy ues computed can be compared. For example, the Rocket class computes the rocket’s static margin, which is a non-dimensional The source code, documentation and more examples are available value and the result from both calculations should be the same: at https://github.com/Projeto-Jupiter/RocketPy ROCKETPY: COMBINING OPEN-SOURCE AND SCIENTIFIC LIBRARIES TO MAKE THE SPACE SECTOR MORE MODERN AND ACCESSIBLE 225 Acknowledgments The authors would like to thank the University of São Paulo, for the support during the development of the current publication, and also all members of Projeto Jupiter and the RocketPy Team who contributed to the making of the RocketPy library. R EFERENCES [AEH+ 19] Adam Aitoumeziane, Peter Eusebio, Conor Hayes, Vivek Ra- machandran, Jamie Smith, Jayasurya Sridharan, Luke St Regis, Mark Stephenson, Neil Tewksbury, Madeleine Tran, and Hao- nan Yang. Traveler IV Apogee Analysis. Technical report, USC Rocket Propulsion Laboratory, Los Angeles, 2019. URL: http://www.uscrpl.com/s/Traveler-IV-Whitepaper. [Aki70] Hiroshi Akima. A new method of interpolation and smooth curve fitting based on local procedures. Journal of the ACM (JACM), 17(4):589–602, 1970. doi:10.1145/321607. 321609. [Bar67] James S Barrowman. The Practical Calculation of the Aero- dynamic Characteristics of Slender Finned Vehicles. PhD thesis, Catholic University of America, Washington, DC United States, 1967. [Che66] Victor Chew. Confidence, Prediction, and Tolerance Re- gions for the Multivariate Normal Distribution. Journal of the American Statistical Association, 61(315), 1966. doi: 10.1080/01621459.1966.10480892. [Cok98] J Coker. Thrustcurve.org — rocket motor performance data online, 1998. URL: https://www.thrustcurve.org/. [CSA+ 21] Giovani H Ceotto, Rodrigo N Schmitt, Guilherme F Alves, Lu- cas A Pezente, and Bruno S Carmo. Rocketpy: Six degree-of- freedom rocket trajectory simulator. Journal of Aerospace En- gineering, 34(6), 2021. doi:10.1061/(ASCE)AS.1943- 5525.0001331. [ISO75] ISO Central Secretary. Standard Atmosphere. Technical Report ISO 2533:1975, International Organization for Standardization, Geneva, CH, 5 1975. [MNK03] Robert C Martin, James Newkirk, and Robert S Koss. Agile software development: principles, patterns, and practices, vol- ume 2. Prentice Hall Upper Saddle River, NJ, 2003. [PdDKÜK83] Robert Piessens, Elise de Doncker-Kapenga, Christoph W Überhuber, and David K Kahaner. Quadpack: a subroutine package for automatic integration, volume 1. Springer Science & Business Media, 1983. doi:10.1007/978-3-642- 61786-7. [Pet83] Linda Petzold. Automatic Selection of Methods for Solving Stiff and Nonstiff Systems of Ordinary Differential Equa- tions. SIAM Journal on Scientific and Statistical Computing, 4(1):136–148, 3 1983. doi:10.1137/0904010. [Rei22] A Reilley. openmotor: An open-source internal ballistics simulator for rocket motor experimenters, 2022. URL: https: //github.com/reilleya/openMotor. [RK16] Reuven Y Rubinstein and Dirk P Kroese. Simulation and the Monte Carlo method. John Wiley & Sons, 2016. doi:10. 1002/9781118631980. [VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haber- land, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. SciPy 1.0: Fun- damental Algorithms for Scientific Computing in Python. Na- ture Methods, 17:261–272, 2020. doi:10.1038/s41592- 019-0686-2. [Wil18] Paul D. Wilde. Range safety requirements and methods for sounding rocket launches. Journal of Space Safety Engineer- ing, 5(1):14–21, 3 2018. doi:10.1016/j.jsse.2018. 01.002. 226 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Wailord: Parsers and Reproducibility for Quantum Chemistry Rohit Goswami‡§∗ F Abstract—Data driven advances dominate the applied sciences landscape, and text classification, can be linked to the difficulty in obtaining with quantum chemistry being no exception to the rule. Dataset biases and labeled results for training purposes. This is not an issue in the human error are key bottlenecks in the development of reproducible and general- computational physical sciences at all, as the training data can ized insights. At a computational level, we demonstrate how changing the granu- often be labeled without human intervention. This is especially larity of the abstractions employed in data generation from simulations can aid in true when simulations are carried out at varying levels of accuracy. reproducible work. In particular, we introduce wailord (https://wailord.xyz), a free-and-open-source python library to shorten the gap between data-analysis However, this also leads to a heavy reliance on high accuracy and computational chemistry, with a focus on the ORCA suite binaries. A two calculations on "benchmark" datasets and results [HMSE+ 21], level hierarchy and exhaustive unit-testing ensure the ability to reproducibly [SEJ+ 19]. describe and analyze "computational experiments". wailord offers both input Compute is expensive, and the reproduction of data which generation, with enhanced analysis, and raw output analysis, for traditionally is openly available is often hard to justify as a valid scientific executed ORCA runs. The design focuses on treating output and input gener- endeavor. Rather than focus on the observable outputs of cal- ation in terms of a mini domain specific language instead of more imperative culations, instead we assert that it is best to be able to have approaches, and we demonstrate how this abstraction facilitates chemical in- reproducible confidence in the elements of the workflow. In the sights. following sections, we will outline wailord, a library which Index Terms—quantum chemistry, parsers, reproducible reports, computational implements a two level structure for interacting with ORCA inference [Nee12] to implement an end-to-end workflow to analyze and prepare datasets. Our focus on ORCA is due to its rapid and responsive development cycles, that it is free to use (but not open Introduction source) and also because of its large repertoire of computational The use of computational methods for chemistry is ubiquitous chemistry calculations. Notably, the black-box nature of ORCA and few modern chemists retain the initial skepticism of the field (in that the source is not available) mirrors that of many other [Koh99], [Sch86]. Machine learning has been further earmarked packages (which are not free) like VASP [Haf08]. Using ORCA [MSH19], [Dra20], [SGT+ 19] as an effective accelerator for then, allows us to design a workflow which is best suited for computational chemistry at every level, from DFT [GLL+ 16] to working with many software suites in the community. alchemical searches [DBCC16] and saddle point searches [ÁJ18]. We shall understand this wailord from the lens of what is However, these methods trade technical rigor for vast amounts of often known as a design pattern in the practice of computational data, and so the ability to reproduce results becomes increasingly science and engineering. That is, a template or description to solve more important. Independently, the ability to reproduce results commonly occurring problems in the design of programs. [Pen11], [SNTH13] in all fields of computational research, and has spawned a veritable flock of methodological and program- Structure and Implementation matic advances [CAB+ 19], including the sophisticated provenance Python has grown to become the lingua-franca for much of the tracking of AiiDA [PCS+ 16], [HZU+ 20]. scientific community [Oli07], [MA11], in no small part because of its interactive nature. In particular, the REPL (read-evaluate- Dataset bias print-loop) structure which has been prioritized (from IPython to [EIS+ 20], [BS19], [RBA+ 19] has gained prominence in the ma- Jupyter) is one of the prime motivations for the use of Python chine learning literature, but has not yet percolated through to as an exploratory tool. Additionally, PyPI, the python package the chemical sciences community. At its core, the argument for index, accelerates the widespread disambiguation of software dataset biases in generic machine learning problems of image packages. Thus wailord is implemented as a free and open source python library. * Corresponding author: rog32@hi.is ‡ Science Institute, University of Iceland § Quansight Austin, TX, USA Structure Data generation involves set of known configurations (say, xyz Copyright © 2022 Rohit Goswami. This is an open-access article distributed inputs) and a series of common calculations whose outputs are under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the required. Computational chemistry packages tend to be focused original author and source are credited. on acceleration and setup details on a per-job scale. wailord, WAILORD: PARSERS AND REPRODUCIBILITY FOR QUANTUM CHEMISTRY 227 in contrast, considers the outputs of simulations to form a tree, where the actual run and its inputs are the leaves, and each layer of the tree structure holds information which is collated into a single dataframe which is presented to the user. Downstream tasks for simulations of chemical systems involve questions phrased as queries or comparative measures. With that in mind, wailord generates pandas dataframes which are indis- tinguishable from standard machine learning information sources, to trivialize the data-munging and preparation process. The outputs of wailord represent concrete information and it is not meant to store runs like the ASE database [LMB+ 17] , nor run a process to manage discrete workflows like AiiDA [HZU+ 20]. By construction, it differs also from existing "interchange" formats as those favored by the materials data repositories like the QCArchive project [SAB+ 21] and is partially close in spirit to the cclib endeavor [OTL08]. Implementation Fig. 1: Some implemented workflows including the two input YML Two classes form the backbone of the data-harvesting process. The files. VPT2 stands for second-order vibrational perturbation theory intended point of interface with a user is the orcaExp class which and Orca_vis objects are part of wailord’s class structure. PES collects information from multiple ORCA outputs and produces stands for potential energy surface. dataframes which include relevant metadata (theory, basis, system, etc.) along with the requested results (energy surfaces, energies, angles, geometries, frequencies, etc.). A lower level "orca visitor" User Interface class is meant to parse each individual ORCA output. Until the The core user interface is depicted in Fig. [[fig:uiwail]]. The release of ORCA 5 which promises structured property files, test suites cover standard usage and serve as ad-hoc tutorials. the outputs are necessarily parsed with regular expressions, but Additionally, jupyter notebooks are also able to effectively validated extensively. The focus on ORCA has allowed for more run wailord which facilitates its use over SSH connections to exotic helper functions, like the calculation of rate constants from high-performance-computing (HPC) clusters. The user is able to orcaVis files. However, beyond this functionality offered by the describe the nature of calculations required in a simple YAML file quantum chemistry software (ORCA), a computational chemistry format. A command line interface can then be used to generate workflow requires data to be more malleable. To this end, the inputs, or another YAML file may be passed to describe the plain-text or binary outputs of quantum chemistry software must paths needed. A very basic harness script for submissions is also be further worked on (post-processed) to gain insights. This means generated which can be rate limited to ensure optimal runs on an for example, that the outputs may be entered into a spreadsheet, HPC cluster. or into a plain text note, or a lab notebook, but in practice, programming languages are a good level of abstraction. Of the programming languages, Python as a general purpose program- Design and Usage ming language with a high rate of community adoption is a good A simulation study can be broken into: starting place. Python has a rich set of structures implemented in the standard • Inputs + Configuration for runs + Data for structures library, which have been liberally used for structuring outputs. • Outputs per run Furthermore, there have been efforts to convert the grammar • Post-processing and aggregation of graphics [WW05] and tidy-data [WAB+ 19] approaches to the pandas package which have also been adapted internally, From a software design perspective, it is important to rec- including strict unit adherence using the pint library. The user ognize the right level of abstraction for the given problem. An is not burdened by these implementation details and is instead object-oriented pattern is seen to be the correct design paradigm. ensured a pandas data-frame for all operations, both at the However, though combining test driven development and object orcaVis level, and the orcaExp level. oriented design is robust and extensible, the design of wailord Software industry practices have been followed throughout the is meant to tackle the problem at the level of a domain specific development process. In particular, the entire package is written in language. Recall from formal language theory [AA07] the fact a test-driven-development (TDD) fashion which has been proven that a grammar is essentially meant to specify the entire possible many times over for academia [DJS08] and industry [BN06]. set of inputs and outputs for a given language. A grammar can In essence, each feature is accompanied by a test-case. This is be expressed as a series of tokens (terminal symbols) and non- meant to ensure that once the end-user is able to run the test- terminal (syntactic variables) symbols along with rules defining suite, they are guaranteed the features promised by the software. valid combinations of these. Additionally, this means that potential bugs can be submitted It may appear that there is little but splitting hairs between as a test case which helps isolate errors for fixes. Furthermore, parsing data line by line as is traditionally done in libraries, com- software testing allows for coverage metrics, thereby enhancing pared to defining the exact structural relations between allowed user and development confidence in different components of any symbols. However, this design, apart from disallowing invalid large code-base. inputs, also makes sense from a pedagogical perspective. 228 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) For example, of the inputs, structured data like configurations Usage is then facilitated by a high-level call. (XYZ formats) are best handled by concrete grammars, where waex.cookies.gen_base( each rule is followed in order: template="basicExperiment", absolute=False, grammar_xyz = Grammar( filen="./lab6/expCookieST_meth.yml", r""" ) meta = natoms ws coord_block ws? natoms = number The resulting directory tree can be sent to a High Performance coord_block = (aline ws)+ aline = (atype ws cline) Computing Cluster (HPC), and once executed via the generated atype = ~"[a-zA-Z]" / ~"[0-9]" run-script helper; locally analysis can proceed. cline = (float ws float ws float) mdat = waio.orca.genEBASet(Path("buildOuts") / \ float = pm number "." number "methylene", pm = ~"[+-]?" deci=4) number = ~"\\d+" print(mdat.to_latex(index=False, ws = ~"\\s*" caption="CH2 energies and angles \ """ at various levels of theory, with NUMGRAD")) ) In certain situations, ordering may be relevant as well (e.g. for gen- This definition maps neatly into the exact specification of an xyz erating curves of varying density functional theoretic complexity). file: This can be handled as well. 2 For the outputs, similar to the key ideas across signac, nix, H -2.8 2.8 0.1 spack and other tools, control is largely taken away from the user H -3.2 3.4 0.2 in terms of the auto-generated directory structure. The outputs of each run is largely collected through regular expressions, due to Where we recognize that the overarching structure is of the the ever changing nature of the outputs of closed source software. number of atoms, followed by multiple coordinate blocks followed Importantly, for a code which is meant to confer insights, by optional whitespace. We move on to define each coordinate the concept of units is key. wailord with ORCA has first class block as a line of one or many aline constructs, each of which support for units using pint. is an atype with whitespace and three float values representing coordinates. Finally we define the positive, negative, numeric and Dissociation of H2 whitespace symbols to round out the grammar. This is the exact form of every valid xyz file. The parsimonious library allows As a concrete example, we demonstrate a popular pedagogical handling grammatical constructs in a Pythonic manner. exercise, namely to obtain the binding energy curves of the H2 However, the generation of inputs is facilitated through the molecule at varying basis sets and for the Hartree Fock, along with use of generalized templates for "experiments" controlled by the results of Kolos and Wolniewicz [KW68]. We first recognize, cookiecutter. This allows for validations on the workflow that even for a moderate 9 basis sets with 33 points, we expect during setup itself. around 1814 data points. Where each basis set requires a separate For the purposes of the simulation study, one "experiment" run, this is easily expected to be tedious. consists of multiple single-shot runs; each of which can take a Naively, this would require modifying and generating ORCA long time. input files. Concretely, the top-level "experiment" is controlled by a !UHF 3-21G ENERGY YAML file: %paras project_slug: methylene R = 0.4, 2.0, 33 # x-axis of H1 project_name: singlet_triplet_methylene end outdir: "./lab6" desc: An experiment to calculate singlet and triplet *xyz 0 1 states differences at a QCISD(T) level H 0.00 0.0000000 0.0000000 author: Rohit H {R} 0.0000000 0.0000000 year: "2020" * license: MIT orca_root: "/home/orca/" We can formulate the requirement imperatively as: orca_yml: "orcaST_meth.yml" qc: inp_xyz: "ch2_631ppg88_trip.xyz" active: True style: ["UHF", "QCISD", "QCISD(T)"] Where each run is then controlled individually. calculations: ["ENERGY"] # Same as single point or SP qc: basis_sets: active: True - 3-21G style: ["UHF", "QCISD", "QCISD(T)"] - 6-31G calculations: ["OPT"] - 6-311G basis_sets: - 6-311G* - 6-311++G** - 6-311G** xyz: "inp.xyz" - 6-311++G** spin: - 6-311++G(2d,2p) - "0 1" # Singlet - 6-311++G(2df,2pd) - "0 3" # Triplet - 6-311++G(3df,3pd) extra: "!NUMGRAD" xyz: "inp.xyz" viz: spin: molden: True - "0 1" chemcraft: True params: jobscript: "basejob.sh" - name: R WAILORD: PARSERS AND REPRODUCIBILITY FOR QUANTUM CHEMISTRY 229 range: [0.4, 2.00] points: 33 slot: xyz: True atype: "H" anum: 1 # Start from 0 axis: "x" extra: Null jobscript: "basejob.sh" This run configuration is coupled with an experiment setup file, similar to the one in the previous section. With this in place, generating a data-set of all the required data is fairly trivial. kolos = pd.read_csv( "../kolos_H2.ene", skiprows=4, header=None, names=["bond_length", "Actual Energy"], sep=" ", ) kolos['theory']="Kolos" expt = waio.orca.orcaExp(expfolder=Path("buildOuts") / "h2") h2dat = expt.get_energy_surface() Fig. 2: Plots generated from tidy principles for post-processing Finally, the resulting data can be plotted using tidy principles. wailord parsed outputs. imgname = "images/plotH2A.png" p1a = ( p9.ggplot( data=h2dat, mapping=p9.aes(x="bond_length", here has been applied to ORCA, however, the two level structure y="Actual Energy", has generalizations to most quantum chemistry codes as well. color="theory") Importantly, we note that the ideas expressed form a design ) pattern for interacting with a plethora of computational tools + p9.geom_point() + p9.geom_point(mapping=p9.aes(x="bond_length", in a reproducible manner. By defining appropriate scopes for y="SCF Energy"), our structured parsers, generating deterministic directory trees, color="black", alpha=0.1, along with a judicious use of regular expressions for output data shape='*', show_legend=True) harvesting, we are able to leverage tidy-data principles to analyze + p9.geom_point(mapping=p9.aes(x="bond_length", y="Actual Energy", the results of a large number of single-shot runs. color="theory"), Taken together, this tool-set and methodology can be used to data=kolos, generate elegant reports combining code and concepts together show_legend=True) + p9.scales.scale_y_continuous(breaks in a seamless whole. Beyond this, the interpretation of each = np.arange( h2dat["Actual Energy"].min(), computational experiment in terms of a concrete domain specific h2dat["Actual Energy"].max(), 0.05) ) language is expected to reduce the requirement of having to re-run + p9.ggtitle("Scan of an H2 \ benchmark calculations. bond length (dark stars are SCF energies)") + p9.labels.xlab("Bond length in Angstrom") + p9.labels.ylab("Actual Energy (Hatree)") Acknowledgments + p9.facet_wrap("basis") ) R Goswami thanks H. Jónsson and V. Ásgeirsson for discussions p1a.save(imgname, width=10, height=10, dpi=300) on the design of computational experiments for inference in Which gives rise to the concise representation Fig. 2 from which computation chemistry. This work was partially supported by the all required inference can be drawn. Icelandic Research Fund, grant number 217436052. In this particular case, it is possible to see the deviations from the experimental results at varying levels of theory for different R EFERENCES basis sets. [AA07] Alfred V. Aho and Alfred V. Aho, editors. Compilers: Principles, Techniques, & Tools. Pearson/Addison Wesley, Boston, 2nd ed edition, 2007. Conclusions [ÁJ18] Vilhjálmur Ásgeirsson and Hannes Jónsson. Exploring Potential Energy Surfaces with Saddle Point Searches. In Wanda Andreoni We have discussed wailord in the context of generating, in and Sidney Yip, editors, Handbook of Materials Modeling, pages a reproducible manner the structured inputs and output datasets 1–26. Springer International Publishing, Cham, 2018. doi: which facilitate chemical insight. The formulation of bespoke 10.1007/978-3-319-42913-7_28-1. datasets tailored to the study of specific properties across a wide [BN06] Thirumalesh Bhat and Nachiappan Nagappan. Evaluating the efficacy of test-driven development: Industrial case studies. In range of materials at varying levels of theory has been shown. Proceedings of the 2006 ACM/IEEE International Symposium The test-driven-development approach is a robust methodology on Empirical Software Engineering, ISESE ’06, pages 356–363, for interacting with closed source software. The design patterns New York, NY, USA, September 2006. Association for Comput- expressed, of which the wailord library is a concrete imple- ing Machinery. doi:10.1145/1159733.1159787. [BS19] Avrim Blum and Kevin Stangl. Recovering from Biased Data: mentation, is expected to be augmented with more workflows, in Can Fairness Constraints Improve Accuracy? arXiv:1912.01094 particular, with a focus on nudged elastic band. The methodology [cs, stat], December 2019. arXiv:1912.01094. 230 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) [CAB+ 19] The Turing Way Community, Becky Arnold, Louise Bowler, [Nee12] Frank Neese. The ORCA program system. WIREs Computa- Sarah Gibson, Patricia Herterich, Rosie Higman, Anna Krys- tional Molecular Science, 2(1):73–78, 2012. doi:10.1002/ talli, Alexander Morley, Martin O’Reilly, and Kirstie Whitaker. wcms.81. The Turing Way: A Handbook for Reproducible Data Science. [Oli07] T. E. Oliphant. Python for Scientific Computing. Comput- Zenodo, March 2019. ing in Science Engineering, 9(3):10–20, May 2007. doi: [DBCC16] Sandip De, Albert P. Bartók, Gábor Csányi, and Michele 10/fjzzc8. Ceriotti. Comparing molecules and solids across struc- [OTL08] Noel M. O’boyle, Adam L. Tenderholt, and Karol M. tural and alchemical space. Physical Chemistry Chemical Langner. Cclib: A library for package-independent computa- Physics, 18(20):13754–13769, May 2016. doi:10.1039/ tional chemistry algorithms. Journal of Computational Chem- C6CP00415F. istry, 29(5):839–845, 2008. doi:10.1002/jcc.20823. [DJS08] Chetan Desai, David Janzen, and Kyle Savage. A survey [PCS+ 16] Giovanni Pizzi, Andrea Cepellotti, Riccardo Sabatini, Nicola of evidence for test-driven development in academia. ACM Marzari, and Boris Kozinsky. AiiDA: Automated interactive SIGCSE Bulletin, 40(2):97–101, June 2008. doi:10.1145/ infrastructure and database for computational science. Compu- 1383602.1383644. tational Materials Science, 111:218–230, January 2016. doi: [Dra20] Pavlo O. Dral. Quantum Chemistry in the Age of Ma- 10.1016/j.commatsci.2015.09.013. chine Learning. The Journal of Physical Chemistry Let- [Pen11] Roger D. Peng. Reproducible Research in Computational Sci- ters, 11(6):2336–2347, March 2020. doi:10.1021/acs. ence. Science, 334(6060):1226–1227, December 2011. doi: jpclett.9b03664. 10/fdv356. [EIS+ 20] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris [RBA+ 19] Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Tsipras, Jacob Steinhardt, and Aleksander Madry. Identifying Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. Statistical Bias in Dataset Replication. arXiv:2005.09619 [cs, On the Spectral Bias of Neural Networks. In Proceedings of stat], May 2020. arXiv:2005.09619. the 36th International Conference on Machine Learning, pages [GLL+ 16] Ting Gao, Hongzhi Li, Wenze Li, Lin Li, Chao Fang, Hui Li, Li- 5301–5310. PMLR, May 2019. Hong Hu, Yinghua Lu, and Zhong-Min Su. A machine learning [SAB+ 21] Daniel G. A. Smith, Doaa Altarawy, Lori A. Burns, Matthew correction for DFT non-covalent interactions based on the S22, Welborn, Levi N. Naden, Logan Ward, Sam Ellis, Benjamin P. S66 and X40 benchmark databases. Journal of Cheminformatics, Pritchard, and T. Daniel Crawford. The MolSSI QCArchive 8(1):24, May 2016. doi:10.1186/s13321-016-0133-7. project: An open-source platform to compute, organize, and [Haf08] Jürgen Hafner. Ab-initio simulations of materials using VASP: share quantum chemistry data. WIREs Computational Molecular Density-functional theory and beyond. Journal of Computa- Science, 11(2):e1491, 2021. doi:10.1002/wcms.1491. tional Chemistry, 29(13):2044–2078, 2008. doi:10.1002/ [Sch86] Henry F. Schaefer. Methylene: A Paradigm for Computational jcc.21057. Quantum Chemistry. Science, 231(4742):1100–1107, March 1986. doi:10.1126/science.231.4742.1100. [HMSE+ 21] Johannes Hoja, Leonardo Medrano Sandonas, Brian G. Ernst, [SEJ+ 19] Andrew W. Senior, Richard Evans, John Jumper, James Kirk- Alvaro Vazquez-Mayagoitia, Robert A. DiStasio Jr., and Alexan- patrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Žídek, dre Tkatchenko. QM7-X, a comprehensive dataset of quantum- Alexander W. R. Nelson, Alex Bridgland, Hugo Penedones, mechanical properties spanning the chemical space of small Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli, organic molecules. Scientific Data, 8(1):43, February 2021. David T. Jones, David Silver, Koray Kavukcuoglu, and Demis doi:10.1038/s41597-021-00812-2. Hassabis. Protein structure prediction using multiple deep neural [HZU+ 20] Sebastiaan P. Huber, Spyros Zoupanos, Martin Uhrin, Leopold networks in the 13th Critical Assessment of Protein Structure Talirz, Leonid Kahle, Rico Häuselmann, Dominik Gresch, Prediction (CASP13). Proteins: Structure, Function, and Bioin- Tiziano Müller, Aliaksandr V. Yakutovich, Casper W. Andersen, formatics, 87(12):1141–1148, 2019. doi:10.1002/prot. Francisco F. Ramirez, Carl S. Adorf, Fernando Gargiulo, Snehal 25834. Kumbhar, Elsa Passaro, Conrad Johnston, Andrius Merkys, An- [SGT+ 19] K. T. Schütt, M. Gastegger, A. Tkatchenko, K.-R. Müller, drea Cepellotti, Nicolas Mounet, Nicola Marzari, Boris Kozin- and R. J. Maurer. Unifying machine learning and quantum sky, and Giovanni Pizzi. AiiDA 1.0, a scalable computa- chemistry with a deep neural network for molecular wavefunc- tional infrastructure for automated reproducible workflows and tions. Nature Communications, 10(1):5024, November 2019. data provenance. Scientific Data, 7(1):300, September 2020. doi:10.1038/s41467-019-12875-2. doi:10.1038/s41597-020-00638-4. [SNTH13] Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and Eivind [Koh99] W. Kohn. Nobel Lecture: Electronic structure of matter— Hovig. Ten Simple Rules for Reproducible Computational Re- wave functions and density functionals. Reviews of Modern search. PLOS Computational Biology, 9(10):e1003285, October Physics, 71(5):1253–1266, October 1999. doi:10.1103/ 2013. doi:10/pjb. RevModPhys.71.1253. [WAB+ 19] Hadley Wickham, Mara Averick, Jennifer Bryan, Winston [KW68] W. Kolos and L. Wolniewicz. Improved Theoretical Ground- Chang, Lucy D’Agostino McGowan, Romain François, Garrett State Energy of the Hydrogen Molecule. The Journal of Chem- Grolemund, Alex Hayes, Lionel Henry, Jim Hester, Max Kuhn, ical Physics, 49(1):404–410, July 1968. doi:10.1063/1. Thomas Lin Pedersen, Evan Miller, Stephan Milton Bache, 1669836. Kirill Müller, Jeroen Ooms, David Robinson, Dana Paige Seidel, [LMB+ 17] Ask Hjorth Larsen, Jens Jørgen Mortensen, Jakob Blomqvist, Vitalie Spinu, Kohske Takahashi, Davis Vaughan, Claus Wilke, Ivano E. Castelli, Rune Christensen, Marcin Du\lak, Jesper Kara Woo, and Hiroaki Yutani. Welcome to the Tidyverse. Friis, Michael N. Groves, Bjørk Hammer, Cory Hargus, Eric D. Journal of Open Source Software, 4(43):1686, November 2019. Hermes, Paul C. Jennings, Peter Bjerre Jensen, James Kermode, doi:10.21105/joss.01686. John R. Kitchin, Esben Leonhard Kolsbjerg, Joseph Kubal, Kris- [WW05] Leland Wilkinson and Graham Wills. The Grammar of Graph- ten Kaasbjerg, Steen Lysgaard, Jón Bergmann Maronsson, Tris- ics. Statistics and Computing. Springer, New York, 2nd ed tan Maxson, Thomas Olsen, Lars Pastewka, Andrew Peterson, edition, 2005. Carsten Rostgaard, Jakob Schiøtz, Ole Schütt, Mikkel Strange, Kristian S. Thygesen, Tejs Vegge, Lasse Vilhelmsen, Michael Walter, Zhenhua Zeng, and Karsten W. Jacobsen. The atomic simulation environment—a Python library for working with atoms. Journal of Physics: Condensed Matter, 29(27):273002, June 2017. doi:10.1088/1361-648X/aa680e. [MA11] K. J. Millman and M. Aivazis. Python for Scientists and Engineers. Computing in Science Engineering, 13(2):9–12, March 2011. doi:10/dc343g. [MSH19] Ralf Meyer, Klemens S. Schmuck, and Andreas W. Hauser. Machine Learning in Computational Chemistry: An Evalua- tion of Method Performance for Nudged Elastic Band Cal- culations. Journal of Chemical Theory and Computation, 15(11):6513–6523, November 2019. doi:10.1021/acs. jctc.9b00708. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 231 Variational Autoencoders For Semi-Supervised Deep Metric Learning Nathan Safir‡∗ , Meekail Zain§ , Curtis Godwin‡ , Eric Miller‡ , Bella Humphrey§ , Shannon P Quinn§¶ F Abstract—Deep metric learning (DML) methods generally do not incorporate loss may help incorporate semantic information from unlabelled unlabelled data. We propose borrowing components of the variational autoen- sources. Second, we propose that the structure of the VAE latent coder (VAE) methodology to extend DML methods to train on semi-supervised space, as it is confined by a prior distribution, can be used to datasets. We experimentally evaluate the atomic benefits to the perform- ing induce bias in the latent space of a DML system. For instance, DML on the VAE latent space such as the enhanced ability to train using if we know a dataset contains N -many classes, creating a prior unlabelled data and to induce bias given prior knowledge. We find that jointly training DML with an autoencoder and VAE may be potentially helpful for some distribution that is a learnable mixture of N gaussians may help semi-suprevised datasets, but that a training routine of alternating between produce better representations. Third, we propose that performing the DML loss and an additional unsupervised loss across epochs is generally DML on the latent space of the VAE so that the DML task can unviable. be jointly optimized with the VAE to incorporate unlabelled data may help produce better representations. Index Terms—Variational Autoencoders, Metric Learning, Deep Learning, Rep- Each of the three improvement proposals will be evaluated resentation Learning, Generative Models experimentally. The improvement proposals will be evaluated by comparing a standard DML implementation to the same DML Introduction implementation: Within the broader field of representation learning, metric learning is an area which looks to define a distance metric which is smaller • jointly optimized with an autoencoder between similar objects (such as objects of the same class) and • while structuring the latent space around a prior distribu- larger between dissimilar objects. Oftentimes, a map is learned tion using the VAE’s KL-divergence loss term between the from inputs into a low-dimensional latent space where euclidean approximated posterior and prior distance exhibits this relationship, encouraged by training said • jointly optimized with a VAE map against a loss (cost) function based on the euclidean distance Our primary contribution is evaluating these three improve- between sets of similar and dissimilar objects in the latent space. ment proposals. Our secondary contribution is presenting the Existing metric learning methods are generally unable to learn results of the joint approaches for VAEs and DML for more recent from unlabelled data, which is problematic because unlabelled metric losses that have not been jointly optimized with a VAE in data is often easier to obtain and is potentially informative. previous literature. We take inspiration from variational autoencoders (VAEs), a generative representation learning architecture, for using un- Related Literature labelled data to create accurate representations. Specifically, we look to evaluate three atomic improvement proposals that detail The goal of this research is to investigate how components of the how pieces of the VAE architecture can create a better deep metric variational autoencoder can help the performance of deep metric learning (DML) model on a semi-supervised dataset. From here, learning in semi supervised tasks. We draw on previous literature we can ascertain which specific qualities of how VAEs process to find not only prior attempts at this specific research goal but unlabelled data are most helpful in modifying DML methods to also work in adjacent research questions that proves insightful. train with semi-supervised datasets. In this review of the literature, we discuss previous related work First, we propose that the autoencoder structure of the VAE in the areas of Semi-Supervised Metric Learning and VAEs with helps the clustering of unlabelled points, as the reconstruction Metric Losses. * Corresponding author: nssafir@gmail.com Semi-Supervised Metric Learning ‡ Institute for Artificial Intelligence, University of Georgia, Athens, GA 30602 USA There have been previous approaches to designing metric learning § Department of Computer Science, University of Georgia, Athens, GA 30602 architectures which incorporate unlabelled data into the metric USA ¶ Department of Cellular Biology, University of Georgia, Athens, GA 30602 learning training regimen for semi-supervised datasets. One of the USA original approaches is the MPCK-MEANS algorithm proposed by Bilenko et al. ([BBM04]), which adds a penalty for placing Copyright © 2022 Nathan Safir et al. This is an open-access article distributed labelled inputs in the same cluster which are of a different class under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the or in different clusters if they are of the same class. This penalty original author and source are credited. is proportional to the metric distance between the pair of inputs. 232 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Baghshah and Shouraki ([BS09]) also looks to impose similar also experiment with adding a (different) metric loss to the overall constraints by introducing a loss term to preserve locally linear VAE loss function. relationships between labelled and unlabelled data in the input Most recently, Grosnit et al. ([GTM+ 21]) leverage a new space. Wang et al. ([WYF13]) also use a regularizer term to training algorithm for combining VAEs and DML for Bayesian preserve the topology of the input space. Using VAEs, in a sense, Optimization and said algorithm using simple, contrastive, and draws on this theme: though there is not explicit term to enforce triplet metric losses. We look to build on this literature by also that the topology of the input space is preserved, a topology of testing a combined VAE DML architecture on more recent metric the inputs is intended to be learned through a low-dimensional losses, albeit using a simpler training regimen. manifold in the latent space. One more recent common general approach to this problem is to use the unlabelled data’s proximity to the labelled data Deep Metric Learning (DML) to estimate labels for unlabelled data, effectively transforming unlabelled data into labelled data. Dutta et al. ([DHS21]) and Li et Metric learning attempts to create representations for data by al. ([LYZ+ 19]) propose a model which uses affinity propagation training against the similarity or dissimilarity of samples. In a on a k-Nearest-Neighbors graph to label partitions of unlabelled more technical sense, there are two notable functions in DML data based on their closest neighbors in the latent space. Wu et al. systems. Function fθ is a neural network which maps the input ([WFZ20]) also look to assign pseudo-labels to unlabelled data, data X to the latent points Z (i.e. fθ : X 7→ Z, where θ is the but not through a graph-based approach. Instead, the proposed network parameters). Generally, Z exists in a space of much lower model looks to approximate "soft" pseudo-labels for unlabelled dimensionality than X (eg. X is a set of 28 × 28 pixel pictures such data from the metric learning similarity measure between the that X ⊂ R28×28 and Z ⊂ R10 ). embedding of unlabelled data and the center of each input of each The function D fθ (x, y) = D( fθ (x), fθ (y)) represents the dis- class of the labelled data. tance between two inputs x, y ∈ X. To create a useful embedding Several of the recent graph based approaches can be consid- model fθ , we would like for fθ to produce large values of D fθ (x, y) ered state-of-the-art for semi supervised metric learning. Li et. when x and y are dissimilar and for fθ to produce small values of al.’s paper states their methods achieve 98.9 percent clustering D fθ (x, y) when x and y are similar. In some cases, dissimilarity accuracy on the MNIST dataset with 10% labelled data, outper- and similarity can refer to when inputs are of different and the forming two similar state-of-the-art methods, DFCM ([ARJM18]) same classes, respectively. and SDEC ([RHD+ 19]), by roughly 8 points. Dutta et. al.’s method It is common for the Euclidean metric (i.e. the L2 metric) to also outperforms 5 other state for the R@1 metric (the "percentage be used as a distance function in metric learning. The generalized of test examples" that have at least one 1 "nearest neighbor from L p metric can be defined as follows, where z0 , z1 ∈ Rd . the same class.") by at leat 1.2 on the MNIST dataset, as well d as the Fashion-MNIST and CIFAR-10 datasets. It is difficult to D p (z0 , z1 ) = ||z0 − z1 || p = ( ∑ |z0i − z1i | p )1/p compare the two approaches as the evaluation metrics used in i=1 each paper differ. Li et al.’s paper has been cited rather heavily relative to other papers in the field and can be considered state If we have chosen fθ (a neural network) and the distance function of the art for semi-supervised DML on MNIST. The paper also D (the L2 metric), the remaining component to be defined in provides a helpful metric (98.9 percent clustering accuracy on the a metric learning system is the loss function for training f . In MNIST dataset with 10% labelled data) to use as a reference point practice, we will be using triplet loss ([SKP15]), one of the most for the results in this paper. common metric learning loss functions. VAEs with Metric Loss Methodology Some approaches to incorporating labelled data into VAEs use a metric loss to govern the latent space more explicitly. Lin et We look to discover the potential of applying components of the al. ([LDD+ 18]) model the intra-class invariance (i.e. the class- VAE methodology to DML systems. We test this through present- related information of a data point) and intra-class variance (i.e. ing incremental modifications to the basic DML architecture. Each the distinct features of a data point not unique to it’s class) modified architecture corresponds to an improvement proposal seperately. Like several other models in this section, this paper’s about how a specific part of the VAE training regime and loss proposed model incorporates a metric loss term for the latent function may be adapted to assist the performance of a DML vectors representing intra-class invariance and the latent vectors method for a semi-supervised dataset. representing both intra-class invariance and intra-class variance. The general method we will take for creating modified DML Kulkarni et al. ([KCJ20]) incorporate labelled information into models involves extending the training regimen to two phases, the VAE methodology in two ways. First, a modified architecture a supervised and unsupervised phase. In the supervised phase the called the CVAE is used in which the encoder and generator of the modified DML model behaves identically to the base DML model, VAE is not only conditioned on the input X and latent vector z, training on the same metric loss function. In the unsupervised respectively, but also on the label Y . The CVAE was introduced in phase, the DML model will train against an unsupervised loss previous papers ([SLY15]) ([DCGO19]). Second, the authors add inspired by the VAE. This may require extra steps to be added a metric loss, specifically a multi-class N-pair loss ([Soh16]), in to the DML architecture. In the pseudocode, s refers to boolean the overall loss function of the model. While it is unclear how the variable representing if the current phase is supervised. α is a CVAE technique would be adapted in a semi-supervised setting, hyperparameter which modulates the impact of the unsupervised as there is not a label Y associated with each datapoint X, we on total loss for the DML autoencoder. VARIATIONAL AUTOENCODERS FOR SEMI-SUPERVISED DEEP METRIC LEARNING 233 Improvement Proposal 1 distribution instead of a point will allow us to calculate the KL divergence. We first look to evaluate the improvement proposal that adding In practice, we will be evaluating a DML model with a unit a reconstruction loss to a DML system can improve the quality prior and a DML model with a mixture of gaussians (GMM) prior. of clustering in the latent representations on a semi-supervised The latter model constructs the prior as a mixture of n gaussians – dataset. Reconstruction loss in and of itself enforces a similar each the vertice of the unit (i.e. each side is 2 units long) hypercube semantic mapping onto the latent space as a metric loss, but can in the latent space. The logvar of each component is set equal to be computed without labelled data. In theory, we believe that the one. Constructing the prior in this way is beneficial in that it is added constraint that the latent vector must be reconstructed to ensured that each component is evenly spaced within the latent approximate the original output will train the spatial positioning space, but is limiting in that there must be exactly 2d components to reflect semantic information. Following this reasoning, obser- in the GMM prior. Thus, to test, we will test a dataset with 10 vations which share similar semantic information, specifically classes on the latent space dimensionality of 4, such that there observations of the same class (even if not labelled as such), are 24 = 16 gaussian components in the GMM prior. Though the should intuitively be positioned nearby within the latent space. To number of prior components is greater than the number of classes, test if this intuition occurs in practice, we evaluate if a DML model the latent mapping may still exhibit the pattern of classes forming with an autoencoder structure and reconstruction loss (described in clusters around the prior components as the extra components may further detail below) will perform better than a plain DML model be made redundant. in terms of clustering quality. This will be especially evident for The drawback of the decision to set the GMM components’ semi-supervised datasets in which the amount of labelled data is means to the coordinates of the unit hypercube’s vertices is that not feasible for solely supervised DML. the manifold of the chosen dataset may not necessarily exist in 4 Given a semi-supervised dataset, we assume a standard DML dimensions. Choosing gaussian components from a d-dimensional system will use only the labelled data and train given a metric loss hypersphere in the latent space R d would solve this issue, but Lmetric (see Algorithm 1). Our modified model DML Autoencoder there does not appear to be a solution for choosing n evenly spaced will extend the DML model’s training regime by adding a decoder points spanning d dimensions on a d-dimensional hypersphere. network which takes the latent point z as input and produces an KL Divergence is calculated with a monte carlo approximation output x̂. The unsupervised loss LU is equal to the reconstruction for the GMM and analytically with the unit prior. loss. Improvement Proposal 3 Improvement Proposal 2 The third improvement proposal we look to evaluate is that given a semi-supervised dataset, optimizing a DML model jointly Say we are aware that a dataset has n classes. It may be useful with a VAE on the VAE’s latent space will produce superior to encourage that there are n clusters in the latent space of a clustering than the DML model individually. The intuition behind DML model. This can be enforced by using a prior distribution this approach is that DML methods can learn from only supervised containing n many Gaussians. As we wish to measure only data and VAE methods can learn from only unsupervised data; the the affect of inducing bias on the representation without adding proposed methodology will optimize both tasks simultaneously to any complexity to the model, the prior distribution will not be learn from both supervised and unsupervised data. learnable (unlike VAE with VampPrior). By testing whether the The MetricVAE implementation we create jointly optimizes classes of points in the latent space are organized along the prior the VAE task and DML task on the VAE latent space. The components we can test whether bias can be induced using a unsupervised loss is set to the VAE loss. The implementation uses prior to constrain the latent space of a DML. By testing whether the VAE with VampPrior model instead of the vanilla VAE. clustering improves performance, we can evaluate whether this inductive bias is helpful. Results Given a fully supervised dataset, we assume a standard DML system will use only the labelled data and train given a metric loss Experimental Configuration Lmetric . Our modified model will extend the DML system’s training Each set of experiments shares a similar hyperparameter search regime by setting the unsupervised loss to a KL divergence term space. Below we describe the hyperparameters that are included that measures the difference between posterior distributions and in the search space of each experiment and the evaluation method. a prior distribution. It should also be noted that, like the VAE Learning Rate (lr): Through informal experimentation, we encoder, we will map the input not to a latent point but to a have found that the learning rate of 0.001 causes the models to latent distribution. The latent point is stochastically sampled from converge consistently (relative to 0.005 and 0.0005). The learning the latent distribution during training. Mapping the input to a rate is thus set to 0.001 in each experiment. 234 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) VARIATIONAL AUTOENCODERS FOR SEMI-SUPERVISED DEEP METRIC LEARNING 235 Latent Space Dimensionality (lsdim): Latent space dimen- ([YSW+ 21]). The MNIST and OrganAMNIST datasets are similar sionality refers to the dimensionality of the vector output of the in dimensionality (1 x 28 x 28), number of samples (60,000 and encoder of a DML network or the dimensionality of the posterior 58,850, respectively) and in that they are both greyscale. distribution of a VAE (also the dimensionality of the latent space). Evaluation: We evaluate the results by running each model When the latent space dimensionality is 2, we see the added benefit on a test partition of data. We then take the latent points Z of creating plots of the latent representations (though we can generated by the model and the corresponding labels Y . Three accomplish this through using dimensionality reduction methods classifiers (sklearn’s implementation of RandomForest, MLP, and like tSNE for higher dimensionalities as well). Example values for kNN) each output predicted labels Ŷ for the latent points. In this hyperparameter used in experiments are 2, 4, and 10. most of the charts shown, however, we only include the kNN Alpha: Alpha (α) is a hyperapameter which refers to the classification output due to space constraints and the lack of balance between the unsupervised and supervised losses of some meaningful difference between the output for each classifier. We of the modified DML models. More details about the role of α finally measure the quality of the predicted labels Ŷ using the in the model implementations are discussed in the methodology Adjusted Mutual Information Score (AMI) ([?]) and accuracy section of the model. Potential values for alpha are each between (which is still helpful but is also easier to interpret in some cases). 0 (exclusive) and 1 (inclusive). We do not include 0 in this set as if This scoring metric is common in research that looks to evaluate α is set to 0, the model is equivalent to the fully supervised plain clustering performance ([ZG21]) ([EKGB16]). We will be using DML model because the supervised loss would not be included. If sklearn’s implementation of AMI ([PVG+ 11]). The performance α is set to 1, then the model would train on only the unsupervised of a classifier on the latent points intuitively can be used as a loss; for instance if the DML Autoencoder had α set to 1, then the measure of quality of clustering. model would be equivalent to an autoencoder. Partial Labels Percentage (pl%): The partial labels per- Improvement Proposal 1 Results: Benefits of Reconstruction Loss centage hyperparameter refers to the percentage of the dataset that In evaluating the first improvement proposal, we compare the is labelled and thus the size of the partion of the dataset that can performance of the plain DML model to the DML Autoencoder be used for labelled training. Of course, each of the datasets we model. We do so by comparing the performance of the plain use is fully labelled, so a partially labelled datset can be trivially DML system and the DML Autoencoder across a search space constructed by ignoring some of the labels. As the sizes of the containing the lsdim, alpha, and pl% hyperparameters and both dataset vary, each percentage can refer to a different number of datasets. labelled samples. Values for the partial label percentage we use In Table 1 and Table 2, we observe that for relatively small across experiments include 0.01, 0.1, and 10 (with each value amounts of labelled samples (the partial labels percentages of 0.01 referring to the percentage). and 0.1 correspond to 6 and 60 labelled samples respectively), Datasets: Two datasets are used for evaluating the models. the DML Autoencoder severely outperforms the DML model. The first dataset is MNIST ([LC10]), a very popular dataset However, when the number of labelled samples increases (the in machine learning containing greyscale images of handwritten partial labels percentage of 10 correspond to 6000 labelled sam- digits. The second dataset we use is the organ OrganAMNIST ples respectively), the DML model significantly outperforms the dataset from MedMNIST v2 ([YSW+ 21]). This dataset contains DML Autoencoder. This trend is not too surprising, as when there 2D slices from computed tomography images from the Liver is sufficient data to train unsupervised methods and insufficient Tumor Segmentation Benchmark – the labels correspond to the data to train supervised method, as is the case for the 0.01 and classification of 11 different body organs. The decision to use 0.1 partial label percentages, the unsupervised method will likely a second dataset was motivated because as the improvement perform better. proposals are tested over more datasets, the results supporting the The data looks to show that adding a reconstruction loss to a improvement proposals become more generalizable. The decision DML system can improve the quality of clustering in the latent to use the OrganAMNIST dataset specifically is motivated in representations on a semi-supervised dataset when there are small part due to the Quinn Research Group working on similar tasks amounts (roughly less than 100 samples) of labelled data and a for biomedical imaging ([ZRS+ 20]). It is also motivated in part sufficient quantity of unlabelled data. But an important caveat is because OrganAMNIST is a more difficult dataset, at least for that it is not convincing that the DML Autoencoder effectively the classfication task, as the leading accuracy for MNIST is .9991 combined the unsupervised and supervised losses to create a ([ALP+ 20]) while the leading accuracy for OrganAMNIST is .951 superior model, as a plain autoencoder (i.e. the DML Autoencoder 236 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 1: Sample images from the MNIST (left) and OrganAMNIST of MedMNIST (right) datasets with α = 1) outperforms the DML for the partial labels percentage routine of alternating between supervised loss (in this case, metric of or less than 0.1% and underperforms the DML for the partial loss) and unsupervised (in this case, VAE loss) is not optimal for labels percentage of 10%. training the model. We have trained a seperate combined VAE and DML model Improvement Proposal 2 Results: Incorporating Inductive Bias with which trains on both the unsupervised and supervised loss each a Prior epoch instead of alternating between the two each epoch. In the In evaluating the second improvement proposal, we compare the results for this model, we see that an alpha value of over zero performance of the plain DML model to the DML with a unit prior (i.e. incorporating both the supervised metric loss into the overall and a DML with a GMM prior. The DML prior with the GMM MVAE loss function) can help improve performance especially prior will have 2^2 = 4 gaussian components when lsdim = 2 and among lower dimensionalities. Given our analysis of the data, we 2^4 = 16 components when lsdim = 4. Our broad intention is to see that incorporating the DML loss to the VAE is potentially see if changing the shape (specifically the number of components) helpful, but only when training the unsupervised and supervised of the prior can induce bias by affecting the pattern of embeddings. losses jointly. Even in that case, it is unclear whether the MVAE We hypothesize that when the GMM prior contains n components performs better than the corresponding DML model even if it does and n is slightly greater than or equal to the number of classes, perform better than the corresponding VAE model. each class will cluster around one of the prior components. We will test this for the GMM prior with 16 components (lsdim = 4) as Conclusion both the MNIST and MedMNIST datasets have 10 classes. We are unable to set the number of GMM components to 10 as our GMM Conclusion sampling method only allows for the number of components to In this work, we have set out to determine how DML can be equal a power of 2. Bseline models include a plain DML and a extended for semi-supervised datasets by borrowing components DML with a unit prior (the distribution N(0, 1)). of the variational autoencoder. We have formalized this approach In Table 3, it is very evident that across both datasets, the DML through defining three specific improvement proposals. To evalu- models with any prior distribution all devolve to the null model ate each improvement proposal, we have created several variations (i.e. the classifier is no better than random selection). From the of the DML model, such as the DML Autoencoder, DML with visualizations of the latent embeddings, we see that the embedded Unit/GMM Prior, and MVAE. We then tested the performance data for the DML models with priors appears completely random. of the models across several semi-supervised partitions of two In the case of the GMM prior, it also does not appear to take on the datasets, along with other configurations of hyperparameters. shape of the prior or reflect the number of components in the prior. We have determined from the analysis of our results, there This may be due to the training routine of the DML models. As is too much dissenting data to clearly accept any three of the the KL divergence loss, which can be said to "fit" the embeddings improvement proposals. For improvement proposal 1, while the to the prior, trains on alternating epochs with the supervised DML DML Autoencoder outperforms the DML for semisupervised loss, it is possible that the two losses are not balanced correctly datasets with small amounts of labelled data, it’s peformance is not during the training process. From the discussed results, it is fair consistently much better than that of a plain autoencoder which to state that adding a prior distribution to a DML model through uses no labelled data. For improvement proposal 2, each of the training the model on the KL divergence between the prior and DML models with an added prior performed extremely poorly, approximated posterior distributions on alternating epochs does is near or at the level of the null model. For improvement proposal not an effective way to induce bias in the latent space. 3, we see the same extremely poor performance from the MVAE models. Improvement Proposal 3 Results: Jointly Optimizing DML with VAE From the results in improvement proposals 1 and 3, we find To evaluate the third improvement proposal, we compare the that there may be potential in incorporating the autoencoder and performance of DMLs to MetricVAEs (defined in the previous VAE loss terms into DML systems. However, we were unable to chapter) across several metric losses. We run experiments for show that any of these improvement proposals would consistently triplet loss, supervised loss, and center loss DML and MetricVAE outperform the both the DML and fully unsupervised architectures models. To evaluate the improvement proposal, we will assess in semisupervised settings. We also found that the training routine whether the model performance improves for the MetricVAE over used for the improvement proposals, in which the loss function the DML for the same metric loss and other hyper parameters. would alternate between supervised and unsupervised losses each Like the previous improvement proposal, the proposed Metric- epoch, was not effective. This is especially evident in comparing VAE model does not perform better than the null model. As with the two combined VAE DML models for improvement proposal improvement proposal 2, it is possible this is because the training 3. VARIATIONAL AUTOENCODERS FOR SEMI-SUPERVISED DEEP METRIC LEARNING 237 Fig. 2: Table 1: Comparison of the DML (left) and DML Autoencoder (right) models for the MNIST dataset. Bolded values indicate best performance for each partial labels percentage partition (pl%). Fig. 3: Table 2: Comparison of the DML (left) and DML Autoencoder (right) models for the MEDMNIST dataset.. Future Work R EFERENCES In the future, it would be worthwhile to evaluate these improve- [AHS20] Georgios Arvanitidis, Søren Hauberg, and Bernhard Schölkopf. Geometrically enriched latent spaces. arXiv preprint ment proposals using a different training routine. We have stated arXiv:2008.00565, 2020. doi:10.48550/arXiv.2008. previously that perhaps the extremely poor performance of the 00565. DML with a prior and MVAE models may be due to alternating [ALP+ 20] Sanghyeon An, Min Jun Lee, Sanglee Park, Heerin Yang, and on training against a supervised and unsupervised loss. Further Jungmin So. An ensemble of simple convolutional neural network models for MNIST digit recognition. CoRR, abs/2008.10400, research could look to develop or compare several different 2020. URL: https://arxiv.org/abs/2008.10400, arXiv:2008. training routines. One alternative would be alternating between 10400, doi:10.48550/arXiv.2008.10400. losses at each batch instead of each epoch. Another alternative, [ARJM18] Ali Arshad, Saman Riaz, Licheng Jiao, and Aparna Murthy. specifically for the MVAE, may be first training DML on labelled Semi-supervised deep fuzzy c-mean clustering for software fault prediction. IEEE Access, 6:25675–25685, 2018. doi:10. data, training a GMM on it’s outputs, and then using the GMM as 1109/ACCESS.2018.2835304. the prior distribution for the VAE. [BBM04] Mikhail Bilenko, Sugato Basu, and Raymond J Mooney. Integrat- ing constraints and metric learning in semi-supervised clustering. Another potentially interesting avenue for future study is in In Proceedings of the twenty-first international conference on investigating a fourth improvement proposal: the ability to define Machine learning, page 11, 2004. doi:10.1145/1015330. a Riemannian metric on the latent space. Previous research has 1015360. shown a Riemannian metric can be computed on the latent space [BS09] Mahdieh Soleymani Baghshah and Saeed Bagheri Shouraki. of the VAE by computing the pull-back metric of the VAE’s Semi-supervised metric learning using pairwise constraints. In Twenty-First International Joint Conference on Artificial Intelli- decoder function ([AHS20]). Through the Riemannian metric we gence, 2009. could calculate metric losses such as triplet loss with a geodesic [DCGO19] Sara Dahmani, Vincent Colotte, Valérian Girard, and Slim Ouni. instead of euclidean distance. The geodesic distance may be a Conditional variational auto-encoder for text-driven expressive more accurate representation of similarity in the latent space than audiovisual speech synthesis. In INTERSPEECH 2019-20th Annual Conference of the International Speech Communication euclidean distance as it accounts for the structure of the input Association, 2019. doi:10.21437/interspeech.2019- data. 2848. 238 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 4: Table 3: Comparison of the DML model (left) and the DML with prior models with a unit gaussian prior (center) and GMM prior (right) models for the MNIST dataset. Fig. 5: Comparison of latent spaces for DML with unit prior (left) and DML with GMM prior containing 4 components (right) for lsdim = 2 on OrganAMNIST dataset. The gaussian components are shown as black with the raidus equal to variance (1). There appears to be no evidence of the distinct gaussian components in the latent space on the right. It does appear that the unit prior may regularize the magnitude of the latent vectors Fig. 6: Graph of reconstruction loss (componenet of unsupervised loss) of MVAE across epochs. The unsupervised loss does not converge despite being trained on each epoch. Fig. 7: Table 4: Experiments performed on MVAE architecture across fully labelled MNIST dataset that trains on objective function L = LU +γ ∗LS on fully supervised dataset. The best results for the classification accuracy on the MVAE embeddings in a given latent-dimensionality are bolded. VARIATIONAL AUTOENCODERS FOR SEMI-SUPERVISED DEEP METRIC LEARNING 239 [DHS21] Ujjal Kr Dutta, Mehrtash Harandi, and Chellu Chandra Sekhar. Semi-supervised metric learning: A deep resurrection. 2021. doi:10.48550/arXiv.2105.05061. [EKGB16] Scott Emmons, Stephen Kobourov, Mike Gallant, and Katy Börner. Analysis of network clustering algorithms and clus- ter quality metrics at scale. PloS one, 11(7):e0159161, 2016. doi:10.1371/journal.pone.0159161. [GTM+ 21] Antoine Grosnit, Rasul Tutunov, Alexandre Max Maraval, Ryan- Rhys Griffiths, Alexander I Cowen-Rivers, Lin Yang, Lin Zhu, Wenlong Lyu, Zhitang Chen, Jun Wang, et al. High-dimensional bayesian optimisation with variational autoencoders and deep metric learning. arXiv preprint arXiv:2106.03609, 2021. doi: 10.48550/arXiv.2106.03609. [KCJ20] Ajinkya Kulkarni, Vincent Colotte, and Denis Jouvet. Deep variational metric learning for transfer of expressivity in multi- speaker text to speech. In International Conference on Statistical Language and Speech Processing, pages 157–168. Springer, 2020. doi:10.1007/978-3-030-59430-5_13. [LC10] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL: http://yann.lecun.com/exdb/mnist/ [cited 2016-01-14 14:24:11]. [LDD+ 18] Xudong Lin, Yueqi Duan, Qiyuan Dong, Jiwen Lu, and Jie Zhou. Deep variational metric learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 689–704, 2018. doi:10.1007/978-3-030-01267-0_42. [LYZ+ 19] Xiaocui Li, Hongzhi Yin, Ke Zhou, Hongxu Chen, Shazia Sadiq, and Xiaofang Zhou. Semi-supervised clustering with deep metric learning. In International Conference on Database Systems for Advanced Applications, pages 383–386. Springer, 2019. doi: 10.1007/978-3-030-18590-9_50. [PVG+ 11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [RHD+ 19] Yazhou Ren, Kangrong Hu, Xinyi Dai, Lili Pan, Steven CH Hoi, and Zenglin Xu. Semi-supervised deep embedded clustering. Neu- rocomputing, 325:121–130, 2019. doi:10.1016/j.neucom. 2018.10.016. [SKP15] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clus- tering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015. doi: 10.1109/cvpr.2015.7298682. [SLY15] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning struc- tured output representation using deep conditional generative models. Advances in neural information processing systems, 28:3483–3491, 2015. [Soh16] Kihyuk Sohn. Improved deep metric learning with multi-class n- pair loss objective. In Advances in neural information processing systems, pages 1857–1865, 2016. [WFZ20] Sanyou Wu, Xingdong Feng, and Fan Zhou. Metric learning by similarity network for deep semi-supervised learning. In Developments of Artificial Intelligence Technologies in Compu- tation and Robotics: Proceedings of the 14th International FLINS Conference (FLINS 2020), pages 995–1002. World Scientific, 2020. doi:10.1142/9789811223334_0120. [WYF13] Qianying Wang, Pong C Yuen, and Guocan Feng. Semi- supervised metric learning via topology preserving multiple semi- supervised assumptions. Pattern Recognition, 46(9):2576–2587, 2013. doi:10.1016/j.patcog.2013.02.015. [YSW+ 21] Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2: A large-scale lightweight benchmark for 2d and 3d biomedical image classification. arXiv preprint arXiv:2110.14795, 2021. doi:10.48550/arXiv.2110.14795. [ZG21] Zhen Zhu and Yuan Gao. Finding cross-border collaborative centres in biopharma patent networks: A clustering comparison approach based on adjusted mutual information. In International Conference on Complex Networks and Their Applications, pages 62–72. Springer, 2021. doi:10.1007/978-3-030-93409- 5_6. [ZRS+ 20] Meekail Zain, Sonia Rao, Nathan Safir, Quinn Wyner, Isabella Humphrey, Alexa Eldridge, Chenxiao Li, BahaaEddin AlAila, and Shannon P. Quinn. Towards an unsupervised spatiotemporal representation of cilia video using a modular generative pipeline. 2020. doi:10.25080/majora-342d178e-017. 240 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) A Python Pipeline for Rapid Application Development (RAD) Scott D. Christensen‡∗ , Marvin S. Brown‡ , Robert B. Haehnel‡ , Joshua Q. Church‡ , Amanda Catlett‡ , Dallon C. Schofield‡ , Quyen T. Brannon‡ , Stacy T. Smith‡ F Abstract—Rapid Application Development (RAD) is the ability to rapidly pro- Python ecosystem provides a rich set of tools that can be applied to totype an interactive interface through frequent feedback, so that it can be various data sources to provide valuable insights. These insitghts quickly deployed and delivered to stakeholders and customers. RAD is a critical can be integrated into decision support systems that can enhance capability needed to meet the ever-evolving demands in scientific research and the information available when making mission critical decisions. data science. To further this capability in the Python ecosystem, we have curated Yet, while the opportunities are vast, the ability to get the resources and developed a set of open-source tools, including Panel, Bokeh, and Tethys Platform. These tools enable prototyping interfaces in a Jupyter Notebook and necessary to pursue those opportunities requires effective and facilitate the progression of the interface into a fully-featured, deployable web- timely communication of the value and feasibility of a proposed application. project. We have found that rapid prototyping is a very impactful way Index Terms—web app, Panel, Tethys, Tethys Platform, Bokeh, Jupyter to concretely show the value that can be obtained from a proposal. Moreover, it also illustrates with clarity that the project is feasible and likely to succeed. Many scientific workflows are developed in Introduction Python, and often the prototyping phase is done in a Jupyter Note- With the tools for data science continually improving and an al- book. The Jupyter environment provides an easy way to quickly most innumerable supply of new data sources, there are seemingly modify code and visualize output. However, the visualizations are endless opportunities to create new insights and decision support interlaced with the code and thus it does not serve as an ideal way systems. Yet, an investment of resources are needed to extract demonstrate the prototype to stakeholders, that may not be familiar the value from data using new and improved tools. Well-timed with Jupyter Notebooks or code. The Jupyter Dashboard project and impactful proposals are necessary to gain the support and was addressing this issue before support for it was dropped in resources needed from stakeholders and decision makers to pursue 2017. To address this technical gap, we worked with the Holoviz these opportunities. The ability to rapidly prototype capabilities team to develop the Panel library. [Panel] Panel is a high-level and new ideas provides a powerful visual tool to communicate Python library for developing apps and dashboards. It enables the impact of a proposal. Interactive applications are even more building layouts with interactive widgets in a Jupyter Notebook impactful by engaging the user in the data analysis process. environment, but can then easily transition to serving the same After a prototype is implemented to communicate ideas and code on a standalone secure webserver. This capability enabled feasibility of a project, additional success is determined by the us to rapidly prototype workflows and dashboards that could be ability to produce the end product on time and within budget. directly accessed by potential sponsors. If the deployable product needs to be completely re-written using Panel makes prototyping and deploying simple. It can also different tools, programing languages, and/or frameworks from the be iterative. As new features are developed we can continue to prototype, then significantly more time and resources are required. work in the Jupyter Notebook environment and then seamlessly The ability to quickly mature a prototype to production-ready transition the new code to a deployed application. Since appli- application using the same tool stack can make the difference in cations continue to mature they often require production-level the success of a project. features. Panel apps are deployed via Bokeh, and the Bokeh framework lacks some aspects that are needed in some production Background applications (e.g. a user management system for authentication and permissions, and a database to persist data beyond a session). At the US Army Engineer Research and Development Center Bokeh doesn’t provide either of these aspects natively. (ERDC) there are evolving needs to support the missions of the Tethys Platform is a Django-based web framework that is US Army Corps of Engineers and our partners. The scientific geared toward making scientific web applications easier to de- velop by scientists and engineers. [Swain] It provides a Python * Corresponding author: Scott.D.Christensen@usace.army.mil ‡ US Army Engineer Research and Development Center Software Development Kit (SDK) that enables web apps to be created almost purely in Python, while still leaving the flexibility Copyright © 2022 Scott D. Christensen et al. This is an open-access article to add custom HTML, JavaScript, and CSS. Tethys provides distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, user management and role-based permissions control. It also provided the original author and source are credited. enables database persistence and computational job management A PYTHON PIPELINE FOR RAPID APPLICATION DEVELOPMENT (RAD) 241 [Christensen], in addition to many visualization tools. Tethys of- fers the power of a fully-featured web framework without the need to be an expert in full-stack web development. However, Tethys lacks the ease of prototyping in a Jupyter Notebook environment that is provided by Panel. To support both the rapid prototyping capability provided by Panel and the production-level features of Tethys Platform, we needed a pipeline that could take our Panel-based code and integrate it into the Tethys Platform framework. Through collaborations with the Bokeh development team and developers at Aquaveo, LLC, we were able to create that integration of Panel (Bokeh) and Tethys. This paper demonstrates the seamless pipeline that facilitates Rapid Application Development (RAD). In the next section we describe how the RAD pipeline is used at the ERDC for a particular use case, but first we will provide some background on the use case itself. Fig. 1: Collective Sweep Inputs Stage rendered in a Jupyter Notebook. Use Case Helios is a computational fluid dynamics (CFD) code for simulat- ing rotorcraft. It is very computationally demanding and requires High Performance Computing (HPC) resources to execute any- thing but the most basic of models. At the ERDC we often face a need to run parameter sweeps to determine the affects of varying a particular parameter (or set of parameters). Setting up a Helios model to run on the HPC is a somewhat involved process that requires file management and creating a script to submit the job to the queueing system. When executing a parameter sweep the process becomes even more cumbersome, and is often avoided. While tedeous to perform manually, the process of modifying input files, transferring to the HPC, and generating and submitting job scripts to the the HPC queueing system can be automated with Python. Furthermore, it can be made much more accessible, even to those without extensive knowledge of how Helios works, through a web-based interface. Methods To automate the process of submitting Helios model parameter sweeps to the HPC via a simple interactive web application Fig. 2: Collective Sweep Inputs Stage rendered as a stand-alone we developed and used the RAD pipeline. Initially three Helios Bokeh app. parameter sweep workflows were identified: 1) Collective Sweep API to execute commands on the login nodes of the DoD HPC 2) Speed Sweep systems. The PyUIT library provides a Python wrapper for the 3) Ensemble Analysis UIT+ REST API. Additionally, it provides Panel-based interfaces The process of submitting each of these workflows to the HPC for each of the workflow steps listed above. Panel refers to a was similar. They each involved the same basic steps: workflow comprised of a sequence of steps as a pipeline, and each step in the pipeline is called a stage. Thus, PyUIT provides a 1) Authentication to the HPC template stage class for each step in the basisc HPC workflow. 2) Connecting to a specific HPC system The PyUIT pipeline stages were customized to create inter- 3) Specifying the parameter sweep inputs faces for each of the three Helios workflows. Other than the 4) Submtting the job to the queuing system inputs stage, the rest of the stages are the same for each of the 5) Monitoring the job as it runs workflows (See figures 1, 2, and 3). The inputs stage allows the 6) Visualizing the results user to select a Helios input file and then provides inputs to allow In fact, these steps are essentially the same for any job being the user to specify the values for the parameter(s) that will be submitted to the HPC. To ensure that we were able to resuse varied in the sweep. Each of these stages was first created in a as much code as possible we created PyUIT, a generic, open- Jupyter Notebook. We were then able to deploy each workflow as source Python library that enables this workflow. The ability to a standalone Bokeh application. Finally we integrated the Panel- authenticate and connect to the DoD HPC systems is enabled based app into Tethys to leverage the compute job management by a service called User Interface Toolkit Plus (UIT+). [PyUIT] system and single-sign-on authentication. UIT+ provides an OAuth2 authentication service and a RESTful As additional features are required, we are able to leverage 242 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 5: The Helios Tethys App is the framework for launching each of the three Panel-based Helios parameter sweep workflows. Fig. 3: Collective Sweep Inputs Stage rendered in the Helios Tethys App. the same pipeline: first developing the capability in a Jupyter Notebook, then testing with a Bokeh-served app, and finally, a full integration into Tethys. Results By integrating the Panel workflows into the Helios Tethys app we can take advantage of Tethys Platform features, such as the jobs table, which persists metadata about computational jobs in a database. Fig. 6: Actions associated with a job. The available actions depend on the job’s status. results to view. The pages that display the results are built with Panel, but Tethys enables them to be populated with information about the job from the database. Figure 7 shows the Tracking Data tab of the results viewer page. The plot is a dynamic Bokeh plot that enables the user to select the data to plot on each axis. This particular plot is showing the variation of the coeffient of drag of the fuselage body over the simulation time. Figure 8 shows what is called CoViz data, or data that is extracted from the solution as the model is running. This image is showing an isosurface colored by density. Fig. 4: Helios Tethys App home page showing a table of previously submitted Helios simulations. Conclusion The Helios Tethys App has demonstrated the value of the RAD pi- Each of the three workflows can be launched from the home pline, which enables both rapid prototyping and rapid progression page of the Helios Tethys app as shown in Figure 5. Although to production. This enables researchers to quickly communicate the home page was created in the Tethys framework, once the and prove ideas and deliver successful products on time. In workflows are launched the same Panel code that was previously addition to the Helios Tethys App, RAD has been instrumental developed is called to display the workflow (refer to figures 1, 2, for the mission success of various projects at the ERDC. and 3). From the Tethys Jobs Table different actions are available for each job including viewing results once the job has completed (see R EFERENCES 6). [Christensen] Christensen, S. D., Swain, N. R., Jones, N. L., Nelson, E. View job results is much more natural in the Tethys app. Helios J., Snow, A. D., & Dolder, H. G. (2017). A Comprehensive jobs often take multiple days to complete. By embedding the Python Toolkit for Accessing High-Throughput Computing to Support Large Hydrologic Modeling Tasks. JAWRA Journal Helios Panel workflows in Tethys users can leave the web app of the American Water Resources Association, 53(2), 333-343. (ending their session), and then come back later and pull up the https://doi.org/10.1111/1752-1688.12455 A PYTHON PIPELINE FOR RAPID APPLICATION DEVELOPMENT (RAD) 243 Fig. 7: Timeseries output associated with a Helios Speed Sweep run. Fig. 8: Isosurface visualization from a Helios Speed Sweep run. [Panel] https://www.panel.org [PyUIT] https://github.com/erdc/pyuit [Swain] Swain, N. R., Christensen, S. D., Snow, A. D., Dolder, H., Espinoza-Dávalos, G., Goharian, E., Jones, N. L., Ames, D.P., & Burian, S. J. (2016). A new open source platform for lowering the barrier for environmental web app development. Environmental Modelling & Software, 85, 11-26. https://doi. org/10.1016/j.envsoft.2016.08.003 244 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Monaco: A Monte Carlo Library for Performing Uncertainty and Sensitivity Analyses W. Scott Shambaugh∗ F Abstract—This paper introduces monaco, a Python library for conducting integration), tailored towards training neural nets, or require a Monte Carlo simulations of computational models, and performing uncertainty deep statistical background to use. See [OGA+ 20], [RJS+ 21], and analysis (UA) and sensitivity analysis (SA) on the results. UA and SA are critical [DSICJ20] for an overview of the currently available Python tools to effective and responsible use of models in science, engineering, and public for performing UA and SA. For the domain expert who wants to policy, however their use is uncommon. By providing a simple, general, and perform UA and SA on their existing models, there is not an easy rigorous-by-default library that wraps around existing models, monaco makes UA and SA easy and accessible to practitioners with a basic knowledge of tool to do both in a single shot. monaco was written to address statistics. this gap. Index Terms—Monte Carlo, Modeling, Uncertainty Quantification, Uncertainty Analysis, Sensitivity Analysis, Decision-Making, Ensemble Prediction, VARS, D- VARS Introduction Fig. 1: The monaco project logo. Computational models form the backbone of decision-making processes in science, engineering, and public policy. However, our increased reliance on these models stands in contrast to the Motivation for Monte Carlo Approach difficulty in understanding them as we add increasing complexity Mathematical Grounding to try and capture ever more of the fine details of real-world interactions. Practitioners will often take the results of their large, Randomized Monte Carlo sampling offers a cure to the curse of complex model as a point estimate, with no knowledge of how dimensionality: consider an investigation of the output from k uncertain those results are [FST16]. Multiple-scenario modeling input factors y = f (x1 , x2 , ..., xk ) where each factor is uniformly (e.g. looking at a worst-case, most-likely, and best-case scenario) sampled between 0 and 1, xi ∈ U[0, 1]. The input space is then a is an improvement, but a complete global exploration of the input k-dimensional hypercube with volume 1. If each input is varied space is needed. That gives insight into the overall distribution of one at a time (OAT), then the volume V of the convex hull of the results (UA) as well as the relative influence of the different input sampled points forms a hyperoctahedron with volume V = k!1 (or π k/2 factors on the ouput variance (SA). This complete understanding is optimistically, a hypersphere with V = 2k Γ(k/2+1) ), both of which critical for effective and responsible use of models in any decision- decrease super-exponentially as k increases. Unless the model is making process, and policy papers have identified UA and SA as known to be linear, this leaves the input space wholly unexplored. key modeling practices [ALMR20] [EPA09]. In contrast, the volume of the convex hull of n → ∞ random Despite the importance of UA and SA, recent literature reviews samples as is obtained with a Monte Carlo approach will converge show that they are uncommon – in 2014 only 1.3% of all published to V = 1, with much better coverage within that volume as well papers [FST16] using modeling performed any SA. And even [DFM92]. See Fig. 2. when performed, best practices are usually lacking – amongst papers which specifically claimed to perform sensitivity analysis, Benefits and Drawbacks of Basic Monte Carlo Sampling a 2019 review found only 21% performed global (as opposed to monaco focuses on forward uncertainty propagation with basic local or zero) UA, and 41% performed global SA [SAB+ 19]. Monte Carlo sampling. This has several benefits: Typically, UA and SA are done using Monte Carlo simula- • The method is conceptually simple, lowering the barrier of tions, for reasons explored in the following section. There are entry and increasing the ease of communicating results to Monte Carlo frameworks available, however existing options are a broader audience. largely domain-specific, focused on narrow sub-problems (i.e. • The same sample points can be used for UA and SA. Gen- * Corresponding author: wsshambaugh@gmail.com erally, Bayesian methods such as Markov Chain Monte Carlo provide much faster convergence on UA quantities Copyright © 2022 W. Scott Shambaugh. This is an open-access article dis- of interest, but their undersampling of regions that do not tributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, pro- contribute to the desired quantities is inadequate for SA vided the original author and source are credited. and complete exploration of the input space. The author’s MONACO: A MONTE CARLO LIBRARY FOR PERFORMING UNCERTAINTY AND SENSITIVITY ANALYSES 245 Fig. 3: Monte Carlo workflow for understanding the full behavior of a computational model, inspired by [SAB+ 19]. Fig. 2: Volume fraction V of a k-dimensional hypercube enclosed by the convex hull of n → ∞ random samples versus OAT samples along the principle axes of the input space. monaco Structure Overall Structure experience aligns with [SAB+ 19] in that there is great Broadly, each input factor and model output is a variable that practical benefit in broad sampling without pigeonholing can be thought of as lists (rows) containing the full range of one’s purview to particular posteriors, through uncovering randomized values. Cases are slices (columns) that take the i’th bugs and edge cases in regions of input space that were input and output value for each variable, and represent a single not being previously considered. run of the model. Each case is run on its own, and the output • It can be applied to domains that are not data-rich. See for values are collected into output variables. Fig. 4 shows a visual example NASA’s use of Monte Carlo simulations during representation of this. rocket design prior to collecting test flight data [HB10]. However, basic Monte Carlo sampling is subject to the classi- cal drawbacks of √ the method such as poor sampling of rare events and the slow σ / n convergence on quantities of interest. If the outputs and regions of interest are firmly known at the outset, then other sampling methods will be more efficient [KTB13]. Additionally, given that any conclusions are conditional on the correctness of the underlying model and input parameters, the task of validation is critical to confidence in the UA and SA results. However, this is currently out of scope for the library and must be performed with other tools. In a data-poor domain, hypothesis testing or probabilistic prediction measures like loss scores can be used to anchor the outputs against a small number of real-life test data. More generally, the "inverse problem" of model and parameter validation is a deep field unto itself and [C+ 12] and [SLKW08] are recommended as overviews of some methods. If monaco’s scope is too limited for the reader’s needs, the author recommends UQpy [OGA+ 20] for UA and SA, and PyMC [SWF16] or Stan [CGH+ 17] as good general-purpose Fig. 4: Structure of a monaco simulation, showing the relationship probabilistic programming Python libraries. between the major objects and functions. This maps onto the central block in Fig. 3. Workflow UA and SA of any model follows a common workflow. Probability distributions for the model inputs are defined, and randomly Simulation Setup sampled values for a large number of cases are fed to the model. The base of a monaco simulation is the Sim object. This object The outputs from each case are collected and the full set of is formed by passing it a name, the number of random cases inputs and outputs can be analyzed. Typically, UA is performed ncases, and a dict fcns of the handles for three user-defined by generating histograms, scatter plots, and summary statistics for functions detailed in the next section. A random seed that then the output variables, and SA is performed by looking at the effect seeds the entire simulation can also be passed in here, and is of input on output variables through scatter plots, performing highly recommended for repeatability of results. regressions, and calculating sensitivity indices. These results can Input variables then need to be defined. monaco takes in the then be compared to real-world test data to validate the model or handle to any of scipy.stat’s continuous or discrete probability inform revisions to the model and input variables. See Fig. 3. distributions, as well as the required arguments for that probability Note that with model and input parameter validation currently distribution [VGO+ 20]. If nonnumeric inputs are desired, the outside monaco’s scope, closing that part of the workflow loop is method can also take in a nummap dictionary which maps the left up to the user. randomly drawn integers to values of other types. 246 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) At this point the sim can be run. The randomized drawing nonnumeric, a valmap dict assigning numbers to of input values, creation of cases, running of those cases, and each unique value is automatically generated. extraction of output values are automatically executed. 4) Calculate statistics & sensitivities for input & output User-Defined Functions variables. 5) Plot variables, their statistics, and sensitivities. The user needs to define three functions to wrap monaco’s Monte Carlo structure around their existing computational model. First Incorporating into Existing Workflows is a run function which either calls or directly implements their model. Second is a preprocess function which takes in a Case If the user wants to use existing workflows for generating, run- object, extracts the randomized inputs, and structures them with ning, post-processing, or examining results, any combination of any other invariant data to pass to the run function. Third is a monaco’s major steps can be replaced with external tooling by postprocess function which takes in a Case object as well as the saving and loading input and output variables to file. For example, results from the model, and extracts the desired output values. The monaco can be used only for its parallel processing backend by Python call chain is as: importing existing randomly drawn input variables, running the postprocess(case, *run(*preprocess(case))) simulation, and exporting the output variables for outside analysis. Or, it can be used only for its plotting and analysis capabilities by Or equivalently to expand the Python star notation into pseu- feeding it inputs and outputs generated elsewhere. docode: siminput = (siminput1, siminput2, ...) Resource Usage = preprocess(case) simoutput = (simoutput1, simoutput2, ...) Note that monaco’s computational and storage overhead in cre- = run(*siminput) ating easily-interrogatable objects for each variable, value, and = run(siminput1, siminput2, ...) case makes it an inefficient choice for computationally simple _ = postprocess(case, *simoutput) applications with high n, such as Monte Carlo integration. If the = postprocess(case, simoutput1, simoutput2, ...) preprocessed sim input and raw output for each case (which for These three functions must be passed to the simulation in a dict some models may dominate storage) is not retained, then the with keys ’run’, ’preprocess’, and ’postprocess’. See the example storage bottleneck will be the creation of a Val object for each code at the end of the paper for a simple worked example. case’s input and output values with minimum size 0.5 kB. The maximum n will be driven by the size of the RAM on the host Examining Results machine being capable of holding at least 0.5 ∗ n(kin + kout ) kB. After running, users should generally do all of the following On the computational bottleneck side, monaco is best suited for UA and SA tasks to get a full picture of the behavior of their models where the model runtime dominates the random variate computational model. generation and the few hundred microseconds of dask.delayed • Plot the results (UA & SA). task switching time. • Calculate statistics for input or output variables (UA). • Calculate sensitivity indices to rank importance of the Technical Features input variables on variance of the output variables (SA). Sampling Methods • Investigate specific cases with outlier or puzzling results. • Save the results to file or pass them to other programs. Random sampling of the percentiles for each variable can be done using scipy’s pseudo-random number generator (PRNG), or with Data Flow any of the low-discrepancy methods from the scip.stats.qmc quasi- A summary of the process and data flow: Monte Carlo (QMC) module. QMC in general provides faster O(log(n)k n−1 ) convergence compared to the O(n−1/2 ) conver- 1) Instantiate a Sim object. gence of random sampling [Caf98]. Available low-discrepancy 2) Add input variables to the sim with specified probability options are regular or scrambled Sobol sequences, regular or distributions. scrambled Halton sequences, or Latin Hypercube Sampling. In 3) Run the simulation. This executes the following: general, the ’sobol_random’ method that generates scrambled a) Random percentiles pi ∈ U[0, 1] are drawn Sobol sequences [Sob67] [Owe20] is recommended in nearly ndraws times for each of the input variables. all cases as the sequence with the fastest QMC convergence b) These percentiles are transformed into random [CKK18], balanced integration properties as long as the number of values via the inverse cumulative density function cases is a power of 2, and a fairly flat frequency spectrum (though of the target probability distribution xi = F −1 (pi ). sampling spectra are rarely a concern) [PCX+ 18]. See Fig. 5 for a c) If nonnumeric inputs are desired, the numbers are visual comparison of some of the options. converted to objects via a nummap dict. d) Case objects are created and populated with the Order Statistics, or, How Many Cases to Run? input values for each case. How many Monte Carlo cases should one run? One answer would e) Each case is run by structuring the inputs values be to choose n ≥ 2k with a sampling method that implements a with the preprocess function, passing them to (t,m,s) digital net (such as a Sobol or Halton sequence), which the run function, and collecting the output values guarantees that there will be at least one sample point in every with the postprocess function. hyperoctant of the input space [JK08]. This should be considered f) The output values are collected into output vari- a lower bound for SA, with the number of cases run being some ables and saved back to the sim. If the values are integer multiple of 2k . MONACO: A MONTE CARLO LIBRARY FOR PERFORMING UNCERTAINTY AND SENSITIVITY ANALYSES 247 Sensitivity Indices Sensitivity indices give a measure of the relationship between the variance of a scalar output variable to the variance of each of the input variables. In other words, they measure which of the input ranges have the largest effect on an output range. It is crucial that sensitivity indices are global rather than local measures – global sensitivity has the stronger theoretical grounding and there is no reason to rely on local measures in scenarios such as automated computer experiments where data can be easily and arbitrarily sampled [SRA+ 08] [PBPS22]. With computer-designed experiments, it is possible to con- struct a specially constructed sample set to directly calculate global sensitivity indices such as the Total-Order Sobol index [Sob01], or the IVARS100 index [RG16]. However, this special construction requires either sacrificing the desirable UA properties of low-discrepancy sampling, or conducting an additional Monte Carlo analysis of the model with a different sample set. For this reason, monaco uses the D-VARS approach to calculating global sensitivity indices, which allows for using a set of given data [SR20]. This is the first publically available implementation of the D-VARS algorithm. Fig. 5: 256 uniform and normal samples along with the 2D frequency Plotting spectra for PRNG random sampling (top), Sobol sampling (middle), monaco includes a plotting module that takes in input and output and scrambled Sobol sampling (bottom, default). variables and quickly creates histograms, empirical CDFs, scatter plots, or 2D or 3D "spaghetti plots" depending on what is most ap- propriate for each variable. Variable statistics and their confidence Along a similar vein, [DFM92] suggests that with random intervals are automatically shown on plots when applicable. sampling n ≥ 2.136k is sufficient to ensure that the volume fraction V approaches 1. The author hypothesizes that for a digital net, the Vector Data n ≥ λ k condition will be satisfied with some λ ≤ 2, and so n ≥ 2k will suffice for this condition to hold. However, these methods of If the values for an output variable are length s lists, NumPy choosing the number of cases may undersample for low k and be arrays, or Pandas dataframes, they are treated as timeseries with s infeasible for high k. steps. Variable statistics for these variables are calculated on the A rigorous way of choosing the number of cases is to first ensemble of values at each step, giving time-varying statistics. choose a statistical interval (e.g. a confidence interval for a The plotting module will automatically plot size (1, s) arrays percentile, or a tolerance interval to contain a percent of the against the step number as 2-D lines, size (2, s) arrays as 2-D population), and then use order statistics to calculate the minimum parametric lines, and size (3, s) arrays as 3-D parametric lines. n required to obtain that result at a desired confidence level. This Parallel Processing approach is independent of k, making UA of high-dimensional models tractable. monaco implements order statistics routines monaco uses dask.distributed [Roc15] as a parallel processing for calculating these statistical intervals with a distribution-free backend, and supports preprocessing, running, and postprocessing approach that makes no assumptions about the normality or other cases in a parallel arrangement. Users familiar with dask can shape characteristics of the output distribution. See Chapter 5 of extend the parallelization of their simulation from their single [HM91] for background. machine to a distributed cluster. A more qualitative UA method would simply be to choose a For simple simulations such as the example code at the end of reasonably high n (say, n = 210 ), manually examine the results to the paper, the overhead of setting up a dask server may outweigh ensure high-interest areas are not being undersampled, and rely the speedup from parallel computation, and in those cases monaco on bootstrapping of the desired variable statistics to obtain the also supports running single-threaded in a single for-loop. required confidence levels. The Median Case Variable Statistics A "nominal" run is often useful as a baseline to compare other For any input or output variable, a statistic can be calculated cases against. If desired, the user can set a flag to force the for the ensemble of values. monaco builds in some common first case to be the median 50th percentile draw of all the input statistics (mean, percentile, etc), or alternatively the user can variables prior to random sampling. pass in a custom one. To obtain a confidence interval for this statistic, the results are resampled with replacement using the Debugging Cases scipy.stats.bootstrap module. The number of bootstrap samples By default, all the raw results from each case’s simulation run is determined using an order statistic approach as outlined in the prior to postprocessing are saved to the corresponding Case object. previous section, and multiplying that number by a scaling factor Individual cases can be interrogated by looking at these raw (default 10x) for smoothness of results. results, or by indicating that their results should be highlighted 248 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) in plots. If some cases fail to run, monaco will mark them as fcns=fcns, seed=seed) incomplete and those specific cases can be rerun without requiring # Generate the input variables the full set of cases to be recomputed. A debug flag can be set to sim.addInVar(name='die1', dist=randint, not skip over failed cases and instead stop at a breakpoint or dump distkwargs={'low': 1, 'high': 6+1}) the stack trace on encountering an exception. sim.addInVar(name='die2', dist=randint, distkwargs={'low': 1, 'high': 6+1}) Saving and Loading to File # Run the Simulation The base Sim object and the Case objects can be serialized and sim.runSim() saved to or loaded from .mcsim and .mccase files respectively, The results of the simulation can then be analyzed and examined. which are stored in a results directory. The Case objects are saved Fig. 6 shows the plots this code generates. separately since the raw results from a run of the simulation # Calculate the mean and 5-95th percentile may be arbitrarily large, and the Sim object can be comparatively # statistics for the dice sum lightweight. Loading the Sim object from file will automatically sim.outvars['Sum'].addVarStat('mean') attempt to load the cases in the same directory, but can also stand sim.outvars['Sum'].addVarStat('percentile', {'p':[0.05, 0.95]}) alone if the raw results are not needed. Alternatively, the numerical representations for input and out- # Plots a histogram of the dice sum put variables can be saved to and loaded from .json or .csv files. mc.plot(sim.outvars['Sum']) This is useful for interfacing with external tooling, but discards # Creates a scatter plot of the sum vs the roll the metadata that would be present by saving to monaco’s native # number, showing randomness objects. mc.plot(sim.outvars['Sum'], sim.outvars['Roll Number']) Example # Calculate the sensitivity of the dice sum to # each of the input variables Presented here is a simple example showing a Monte Carlo sim.calcSensitivities('Sum') simulation of rolling two 6-sided dice and looking at their sum. sim.outvars['Sum'].plotSensitivities() The user starts with their run function which here directly implements their computational model. They must then create preprocess and postprocess functions to feed in the randomized input values and collect the outputs from that model. # The 'run' function, which implements the # existing computational model (or wraps it) def example_run(die1, die2): dicesum = die1 + die2 return (dicesum, ) # The 'preprocess' function grabs the random # input values for each case and structures it # with any other data in the format the 'run' # function expects def example_preprocess(case): die1 = case.invals['die1'].val die2 = case.invals['die2'].val return (die1, die2) # The 'postprocess' function takes the output # from the 'run' function and saves off the # outputs for each case def example_postprocess(case, dicesum): case.addOutVal(name='Sum', val=dicesum) case.addOutVal(name='Roll Number', val=case.ncase) return None The monaco simulation is initialized, given input variables with Fig. 6: Output from the example code which calculates the sum of two specified probability distributions (here a random integer between random dice rolls. The top plot shows a histogram of the 2-dice sum 1 and 6), and run. with the mean and 5–95th percentiles marked, the middle plot shows the randomness over the set of rolls, and the bottom plot shows that import monaco as mc from scipy.stats import randint each of the dice contributes 50% to the variance of the sum. # dict structure for the three input functions fcns = {'run' : example_run, Case Studies 'preprocess' : example_preprocess, 'postprocess': example_postprocess} These two case studies are toy models meant as illustrative of potential uses, and not of expertise or rigor in their respective # Initialize the simulation domains. Please see https://github.com/scottshambaugh/monaco/ ndraws = 1024 # Arbitrary for this example tree/main/examples for their source code as well as several more seed = 123456 # Recommended for repeatability Monte Carlo implementation examples across a range of domains sim = mc.Sim(name='Dice Roll', ndraws=ndraws, including financial modeling, pandemic spread, and integration. MONACO: A MONTE CARLO LIBRARY FOR PERFORMING UNCERTAINTY AND SENSITIVITY ANALYSES 249 Baseball The calculated win probabilities from this simulation are This case study models the trajectory of a baseball in flight 93.4% Democratic, 6.2% Republican, and 0.4% Tie. The 25–75th after being hit for varying speeds, angles, topspins, aerodynamic percentile range for the number of electoral votes for the Demo- conditions, and mass properties. From assumed initial conditions cratic candidate is 281–412, and the actual election result was 306 immediately after being hit, the physics of the ball’s ballistic flight electoral votes. See Fig. 8. are calculated over time until it hits the ground. Fig. 7 shows some plots of the results. A baseball team might use analyses like this to determine where outfielders should be placed to catch a ball for a hitter with known characteristics, or determine what aspect of a hit a batter should focus on to improve their home run potential. Fig. 8: Predicted electoral votes for the Democratic 2020 US Pres- idential candidate with the median and 25-75th percentile interval marked (top), and a map of the predicted Democratic win probability per state (bottom). Conclusion This paper has introduced the ideas underlying Monte Carlo analysis and discussed when it is appropriate to use for conducting UA and SA. It has shown how monaco implements a rigorous, parallel Monte Carlo process, and how to use it through a simple example and two case studies. This library is geared towards scientists, engineers, and policy analysts that have a computational model in their domain of expertise, enough statistical knowledge to define a probability distribution, and a desire to ensure their model will make accurate predictions of reality. The author hopes this tool will help contribute to easier and more widespread use of Fig. 7: 100 simulated baseball trajectories (top), and the relationship UA and SA in improved decision-making. between launch angle and landing distance (bottom). Home runs are highlighted in orange. Further Information monaco is available on PyPI as the package monaco, has API Election documentation at https://monaco.rtfd.io/, and is hosted on github This case study attempts to predict the result of the 2020 US at https://github.com/scottshambaugh/monaco/. presidential election, based on polling data from FiveThirtyEight 3 weeks prior to the election [Fiv20]. Each state independently casts a normally distributed percent- R EFERENCES age of votes for the Democratic, Republican, and Other candidates, [ALMR20] I Azzini, G Listorti, TA Mara, and R Rosati. Uncertainty and based on polling. Also assumed is a uniform ±3% national sensitivity analysis for policy decision making. An Introductory swing due to polling error which is applied to all states equally. Guide. Joint Research Centre, European Commission, Luxem- That summed percentage is then normalized so the total for all bourg, 2020. doi:10.2760/922129. candidates is 100%. The winner of each state’s election assigns [C+ 12] National Research Council et al. Assessing the reliability of complex models: mathematical and statistical foundations of their electoral votes to that candidate, and the candidate that wins verification, validation, and uncertainty quantification. National at least 270 of the 538 electoral votes is the winner. Academies Press, 2012. doi:10.17226/13395. 250 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) [Caf98] Russel E Caflisch. Monte carlo and quasi-monte carlo in science conference, volume 130, page 136. Citeseer, 2015. methods. Acta numerica, 7:1–49, 1998. doi:10.1017/ doi:10.25080/majora-7b98e3ed-013. S0962492900002804. [SAB+ 19] Andrea Saltelli, Ksenia Aleksankina, William Becker, Pamela [CGH+ 17] Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Fennell, Federico Ferretti, Niels Holst, Sushan Li, and Qiongli Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Wu. Why so many published sensitivity analyses are false: A Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic systematic review of sensitivity analysis practices. Environmental programming language. Journal of statistical software, 76(1), modelling & software, 114:29–39, 2019. doi:10.1016/j. 2017. doi:10.18637/jss.v076.i01. envsoft.2019.01.012. [CKK18] Per Christensen, Andrew Kensler, and Charlie Kilpatrick. Pro- [SLKW08] Richard M Shiffrin, Michael D Lee, Woojae Kim, and Eric- gressive multi-jittered sample sequences. In Computer Graphics Jan Wagenmakers. A survey of model evaluation approaches Forum, volume 37, pages 21–33. Wiley Online Library, 2018. with a tutorial on hierarchical bayesian methods. Cog- doi:10.1111/cgf.13472. nitive Science, 32(8):1248–1284, 2008. doi:10.1080/ [DFM92] Martin E. Dyer, Zoltan Füredi, and Colin McDiarmid. Volumes 03640210802414826. spanned by random points in the hypercube. Random Struc- [Sob67] Ilya M Sobol. On the distribution of points in a cube and tures & Algorithms, 3(1):91–106, 1992. doi:10.1002/rsa. the approximate evaluation of integrals. Zhurnal Vychislitel’noi 3240030107. Matematiki i Matematicheskoi Fiziki, 7(4):784–802, 1967. doi: [DSICJ20] Dominique Douglas-Smith, Takuya Iwanaga, Barry F.W. Croke, 10.1016/0041-5553(67)90144-9. and Anthony J. Jakeman. Certain trends in uncertainty and [Sob01] Ilya M Sobol. Global sensitivity indices for nonlinear mathe- sensitivity analysis: An overview of software tools and tech- matical models and their monte carlo estimates. Mathematics niques. Environmental Modelling & Software, 124, 2020. doi: and computers in simulation, 55(1-3):271–280, 2001. doi: 10.1016/j.envsoft.2019.104588. 10.1016/s0378-4754(00)00270-6. [EPA09] US EPA. Guidance on the development, evaluation, and appli- [SR20] Razi Sheikholeslami and Saman Razavi. A fresh look at vari- cation of environmental models (epa/100/k-09/003), 2009. URL: ography: measuring dependence and possible sensitivities across https://nepis.epa.gov/Exe/ZyPDF.cgi?Dockey=P1003E4R.PDF. geophysical systems from any given data. Geophysical Re- [Fiv20] FiveThirtyEight. 2020 general election forecast - state topline search Letters, 47(20):e2020GL089829, 2020. doi:10.1029/ polls-plus data, October 2020. URL: https://github.com/ 2020gl089829. fivethirtyeight/data/tree/master/election-forecasts-2020. [SRA+ 08] Andrea Saltelli, Marco Ratto, Terry Andres, Francesca Campo- [FST16] Federico Ferretti, Andrea Saltelli, and Stefano Tarantola. Trends longo, Jessica Cariboni, Debora Gatelli, Michaela Saisana, and in sensitivity analysis practice in the last decade. Science of Stefano Tarantola. Global sensitivity analysis: the primer. John the total environment, 568:666–670, 2016. doi:10.1016/j. Wiley & Sons, 2008. doi:10.1002/9780470725184. scitotenv.2016.02.133. [SWF16] John Salvatier, Thomas V Wiecki, and Christopher Fonnesbeck. [HB10] John Hanson and Bernard Beard. Applying monte carlo simu- Probabilistic programming in python using pymc3. PeerJ Com- lation to launch vehicle design and requirements verification. In puter Science, 2:e55, 2016. doi:10.7717/peerj-cs.55. AIAA Guidance, Navigation, and Control Conference. American [VGO+ 20] Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haber- Institute of Aeronautics and Astronautics, 2010. doi:10.2514/ land, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu 6.2010-8433. Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Na- [HM91] Gerald J Hahn and William Q Meeker. Statistical intervals: a ture methods, 17(3):261–272, 2020. doi:10.14293/s2199- guide for practitioners. John Wiley & Sons, 1991. doi:10. 1006.1.sor-life.a7056644.v1.rysreg. 1002/9780470316771.ch5. [JK08] Stephen Joe and Frances Y Kuo. Constructing sobol sequences with better two-dimensional projections. SIAM Journal on Sci- entific Computing, 30(5):2635–2654, 2008. doi:10.1137/ 070709359. [KTB13] Dirk P Kroese, Thomas Taimre, and Zdravko I Botev. Handbook of monte carlo methods. John Wiley & Sons, 2013. doi:10. 1002/9781118014967. [OGA+ 20] Audrey Olivier, Dimitris G. Giovanis, B.S. Aakash, Mohit Chauhan, Lohit Vandanapu, and Michael D. Shields. Uqpy: A general purpose python package and development environment for uncertainty quantification. Journal of Computational Science, 47:101204, 2020. doi:10.1016/j.jocs.2020.101204. [Owe20] Art B Owen. On dropping the first sobol’point. arXiv preprint arXiv:2008.08051, 2020. doi:10.48550/arXiv. 2008.08051. [PBPS22] Arnald Puy, William Becker, Samuele Lo Piano, and An- drea Saltelli. A comprehensive comparison of total-order es- timators for global sensitivity analysis. International Journal for Uncertainty Quantification, 12(2), 2022. doi:int.j. uncertaintyquantification.2021038133. [PCX+ 18] Hélène Perrier, David Coeurjolly, Feng Xie, Matt Pharr, Pat Hanrahan, and Victor Ostromoukhov. Sequences with low- discrepancy blue-noise 2-d projections. In Computer Graphics Forum, volume 37, pages 339–353. Wiley Online Library, 2018. doi:10.1111/cgf.13366. [RG16] Saman Razavi and Hoshin V Gupta. A new framework for comprehensive, robust, and efficient global sensitivity analysis: 1. theory. Water Resources Research, 52(1):423–439, 2016. doi:10.1002/2015wr017558. [RJS+ 21] Saman Razavi, Anthony Jakeman, Andrea Saltelli, Clémentine Prieur, Bertrand Iooss, Emanuele Borgonovo, Elmar Plischke, Samuele Lo Piano, Takuya Iwanaga, William Becker, et al. The future of sensitivity analysis: An essential discipline for systems modeling and policy support. Environmental Modelling & Soft- ware, 137:104954, 2021. doi:10.1016/j.envsoft.2020. 104954. [Roc15] Matthew Rocklin. Dask: Parallel computation with blocked algorithms and task scheduling. In Proceedings of the 14th python PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 251 Enabling Active Learning Pedagogy and Insight Mining with a Grammar of Model Analysis Zachary del Rosario‡∗ F Abstract—Modern engineering models are complex, with dozens of inputs, The fundamental issue underlying these criteria is a flawed uncertainties arising from simplifying assumptions, and dense output data. heuristic for uncertainty propagation; initial human subjects work While major strides have been made in the computational scalability of complex suggests that engineers’ tendency to misdiagnose sources of vari- models, relatively less attention has been paid to user-friendly, reusable tools to ability as inconsequential noise may contribute to the persistent explore and make sense of these models. Grama is a python package aimed at application of flawed design criteria [AFD+ 21]. These flawed supporting these activities. Grama is a grammar of model analysis: an ontology that specifies data (in tidy form), models (with quantified uncertainties), and treatments of uncertainty are not limited to engineering design; the verbs that connect these objects. This definition enables a reusable set recent work by Kahneman et al. [KSS21] highlights widespread of evaluation "verbs" that provide a consistent analysis toolkit across different failures to recognize or address variability in human judgment, grama models. This paper presents three case studies that illustrate pedagogy leading to bias in hiring, economic loss, and an unacceptably and engineering work with grama: 1. Providing teachable moments through capricious application of justice. errors for learners, 2. Providing reusable tools to help users self-initiate pro- Grama was originally developed to support model analysis un- ductive modeling behaviors, and 3. Enabling exploratory model analysis (EMA) der uncertainty; in particular, to enable active learning [FEM+ 14] – exploratory data analysis augmented with data generation. – a form of teaching characterized by active student engagement Index Terms—engineering, engineering education, exploratory model analysis, shown to be superior to lecture alone. This toolkit aims to integrate software design, uncertainty quantification the disciplinary perspectives of computational engineering and statistical analysis within a unified environment to support a coding to learn pedagogy [Bar16] – a teaching philosophy that Introduction uses code to teach a discipline, rather than as a means to teach Modern engineering relies on scientific computing. Computational computer science or coding itself. The design of grama is heavily advances enable faster analysis and design cycles by reducing inspired by the Tidyverse [WAB+ 19], an integrated set of R the need for physical experiments. For instance, finite-element packages organized around the ’tidy data’ concept [Wic14]. Grama analysis enables computational study of aerodynamic flutter, and uses the tidy data concept and introduces an analogous concepts Reynolds-averaged Navier-Stokes simulation supports the simu- for models. lation of jet engines. Both of these are enabling technologies that support the design of modern aircraft [KN05]. Modern ar- Grama: A Grammar of Model Analysis eas of computational research include heterogeneous computing environments [MV15], task-based parallelism [BTSA12], and big Grama [dR20] is an integrated set of tools for working with data data [SS13]. Another line of work considers the development of and models. Pandas [pdt20], [WM10] is used as the underlying integrated tools to unite diverse disciplinary perspectives in a sin- data class, while grama implements a Model class. A grama gle, unified environment (e.g., the integration of multiple physical model includes a number of functions – mathematical expressions phenomena in a single code [EVB+ 20] or the integration of a or simulations – and domain/distribution information for the de- computational solver and data analysis tools [MTW+ 22]). Such terministic/random inputs. The following code illustrates a simple integrated computational frameworks are highlighted as essential grama model with both deterministic and random inputs1 . for applications such as computational analysis and design of # Each cp_* function adds information to the model aircraft [SKA+ 14]. While engineering computation has advanced md_example = ( along the aforementioned axes, the conceptual understanding of gr.Model("An example model") # Overloaded `>>` provides pipe syntax practicing engineers has lagged in key areas. >> gr.cp_vec_function( Every aircraft you have ever flown on has been designed using fun=lambda df: gr.df_make(f=df.x+df.y+df.z), probabilistically-flawed, potentially dangerous criteria [dRFI21]. var=["x", "y", "z"], out=["f"], ) * Corresponding author: zdelrosario@olin.edu ‡ Assistant Professor of Engineering and Applied Statistics, Olin College of >> gr.cp_bounds(x=(-1, +1)) Engineering >> gr.cp_marginals( y=gr.marg_mom("norm", mean=0, sd=1), Copyright © 2022 Zachary del Rosario. This is an open-access article dis- z=gr.marg_mom("uniform", mean=0, sd=1), tributed under the terms of the Creative Commons Attribution License, which ) permits unrestricted use, distribution, and reproduction in any medium, pro- vided the original author and source are credited. 1. Throughout, import grama as gr is assumed. 252 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) >> gr.cp_copula_gaussian( df_corr=gr.df_make( var1="y", var2="z", corr=0.5, ) ) ) While an engineer’s interpretation of the term "model" focuses on the input-to-output mapping (the simulation), and a statistician’s interpretation of the term "model" focuses on a distribution, the grama model integrates both perspectives in a single model. Grama models are intended to be evaluated to generate data. The data can then be analyzed using visual and statistical means. Models can be composed to add more information, or fit to a dataset. Figure 1 illustrates this interplay between data and models in terms of the four categories of function "verbs" provided in Fig. 2: Input sweep generated from the code above. Each panel grama. visualizes the effect of changing a single input, with all other inputs held constant. >> gr.tf_filter(DF.sweep_var == "x") >> gr.ggplot(gr.aes("x", "f", group="sweep_ind")) + gr.geom_line() ) This system of defaults is important for pedagogical design: Introductory grama code can be made extremely simple when first Fig. 1: Verb categories in grama. These grama functions start with an introducing a concept. However, the defaults can be overridden identifying prefix, e.g. ev_* for evaluation verbs. to carry out sophisticated and targeted analyses. We will see in the Case Studies below how this concise syntax encourages sound analysis among students. Defaults for Concise Code Grama verbs are designed with sensible default arguments to Pedagogy Case Studies enable concise code. For instance, the following code visualizes input sweeps across its three inputs, similar to a ceteris paribus The following two case studies illustrate how grama is designed profile [KBB19], [Bie20]. to support pedagogy: the formal method and practice of teaching. In particular, grama is designed for an active learning pedagogy ( ## Concise default analysis [FEM+ 14], a style of teaching characterized by active student md_example engagement. >> gr.ev_sinews(df_det="swp") >> gr.pt_auto() Teachable Moments through Errors for Learners ) An advantage of a unified modeling environment like grama is This code uses the default number of sweeps and sweep density, the opportunity to introduce design errors for learners in order to and constructs a visualization of the results. The resulting plot is provide teachable moments. shown in Figure 2. It is common in probabilistic modeling to make problematic Grama imports the plotnine package for data visualization assumptions. For instance, Cullen and Frey [CF99] note that [HK21], both to provide an expressive grammar of graphics, but modelers frequently and erroneously treat the normal distribution also to implement a variety of "autoplot" routines. These are as a default choice for all unknown quantities. Another common called via a dispatcher gr.pt_auto() which uses metadata issue is to assume, by default, the independence of all random from evaluation verbs to construct a default visual. Combined inputs to a model. This is often done tacitly – with the indepen- with sensible defaults for keyword arguments, these tools provide dence assumption unstated. These assumptions are problematic, as a concise syntax even for sophisticated analyses. The same code they can adversely impact the validity of a probabilistic analysis can be slightly modified to change a default argument value, or to [dRFI21]. use plotnine to create a more tailored visual. To highlight the dependency issue for novice modelers, grama ( uses error messages to provide just-in-time feedback to a user md_example who does not articulate their modeling choices. For example, ## Override default parameters >> gr.ev_sinews(df_det="swp", n_sweeps=10) the following code builds a model with no dependency structure >> gr.pt_auto() specified. The result is an error message that summarizes the ) conceptual issue and points the user to a primer on random ( variable modeling. md_example md_flawed = ( >> gr.ev_sinews(df_det="swp") gr.Model("An example model") ## Construct a targeted plot >> gr.cp_vec_function( ENABLING ACTIVE LEARNING PEDAGOGY AND INSIGHT MINING WITH A GRAMMAR OF MODEL ANALYSIS 253 fun=lambda df: gr.df_make(f=df.x+df.y+df.z), data=data, var=["x", "y", "z"], columns=["f", "x", "y"], out=["f"], ) ) >> gr.cp_bounds(x=(-1, +1)) The ability to write low-level programming constructs – such >> gr.cp_marginals( as the loops above – is an obviously worthy learning outcome y=gr.marg_mom("norm", mean=0, sd=1), in a course on scientific computing. However, not all courses z=gr.marg_mom("uniform", mean=0, sd=1), ) should focus on low-level programming constructs. Grama is not ## NOTE: No dependency specified designed to support low-level learning outcomes; instead, the ) package is designed to support a "coding to learn" philosophy ( md_flawed [Bar16] focused on higher-order learning outcomes to support ## This code will throw an Error sound modeling practices. >> gr.ev_sample(n=1000, df_det="nom") Parameter sweep functionality can be achieved in grama ) without explicit loop management and with sensible defaults for the analysis parameters. This provides a "quick and dirty" tool Error ValueError: Present model copula must be de- to inspect a model’s behavior. A grama approach to parameter fined for sampling. Use CopulaIndependence only sweeps is shown below. when inputs can be guaranteed independent. See the ## Parameter sweep: Grama approach Documentation chapter on Random Variable Modeling # Gather model info for more information. https://py-grama.readthedocs.io/en/ md_gr = ( gr.Model() latest/source/rv_modeling.html >> gr.cp_vec_function( fun=lambda df: gr.df_make(f=df.x**2 * df.y), Grama is designed both as a teaching tool and a scientific var=["x", "y"], modeling toolkit. For the student, grama offers teachable moments out=["f"], to help the novice grow as a modeler. For the scientist, grama ) >> gr.cp_bounds( enforces practices that promote scientific reproducibility. x=(-1, +1), y=(-1, +1), Encouraging Sound Analysis ) ) As mentioned above, concise grama syntax is desirable to encour- # Generate data age sound analysis practices. Grama is designed to support higher- df_gr = gr.eval_sinews( level learning outcomes [Blo56]. For instance, rather than focusing md_gr, df_det="swp", on applying programming constructs to generate model results, n_sweeps=3, grama is intended to help users study model results ("evaluate," ) according to Bloom’s Taxonomy). Sound computational analysis Once a model is implemented in grama, generating and visualizing demands study of simulation results (e.g., to check for numerical a parameter sweep is trivial, requiring just two lines of code and instabilities). This case study makes this learning outcome distinc- zero initial choices for analysis parameters. The practical outcome tion concrete by considering parameter sweeps. of this software design is that users will tend to self-initiate Generating a parameter sweep similar to Figure 2 with stan- parameter sweeps: While students will rarely choose to write the dard Python libraries requires a considerable amount of boilerplate extensive boilerplate code necessary for a parameter sweep (unless code, manual coordination of model information, and explicit loop required to do so), students writing code in grama will tend to self- construction. The following code generates parameter sweep data initiate sound analysis practices. using standard libraries. Note that this code sweeps through values For example, the following code is unmodified from a student of x holding values of y fixed; additional code would be necessary report3 . The original author implemented an ordinary differential to construct a sweep through y2 . equation model to simulate the track time "finish_time" of ## Parameter sweep: Manual approach an electric formula car, and sought to study the impact of variables # Gather model info x_lo = -1; x_up = +1; such as the gear ratio "GR" on "finish_time". While the y_lo = -1; y_up = +1; assignment did not require a parameter sweep, the student chose f_model = lambda x, y: x**2 * y to carry out their own study. The code below is a self-initiated # Analysis parameters parameter sweep of the track time model. nx = 10 # Grid resolution for x y_const = [-1, 0, +1] # Constant values for y ## Unedited student code # Generate data md_car = ( data = np.zeros((nx * len(y_const), 3)) gr.Model("Accel Model") for i, x in enumerate( >> gr.cp_function( np.linspace(x_lo, x_up, num=nx) fun = calculate_finish_time, ): var = ["GR", "dt_mass", "I_net" ], for j, y in enumerate(y_const): out = ["finish_time"], data[i + j*nx, 0] = f_model(x, y) ) data[i + j*nx, 1] = x data[i + j*nx, 2] = y >> gr.cp_bounds( # Package data for visual GR=(+1,+4), df_manual = pd.DataFrame( dt_mass=(+5,+15), I_net=(+.2,+.3), 2. Code assumes import numpy as np; import pandas as pd. 3. Included with permission of the author, on condition of anonymity. 254 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) ) ) gr.plot_auto( gr.eval_sinews( md_car, df_det="swp", #skip=True, n_density=20, n_sweeps=5, seed=101, ) ) Fig. 4: Schematic boat hull rotated to 22.5◦ . The forces due to gravity and buoyancy act at the center of mass (COM) and center of buoyancy (COB), respectively. Note that this hull is upright stable, as the couple will rotate the boat to upright. that a restoring torque is generated (Fig. 4). However, this upright stability is not guaranteed; Figure 5 illustrates a boat design that does not provide a restoring torque near its upright angle. An upright-unstable boat will tend to capsize spontaneously. Fig. 3: Input sweep generated from the student code above. The image has been cropped for space, and the results are generated with an older version of grama. The jagged response at higher values of the input are evidence of solver instabilities. The parameter sweep shown in Figure 2 gives an overall impres- sion of the effect of input "GR" on the output "finish_time". This particular input tends to dominate the results. However, variable results at higher values of "GR" provide evidence of numerical instability in the ODE solver underlying the model. Without this sort of model evaluation, the student author would not have discovered the limitations of the model. Exploratory Model Analysis Case Study This final case study illustrates how grama supports exploratory Fig. 5: Schematic boat hull rotated to 22.5◦ . Gravity and buoyancy model analysis. This iterative process is a computational approach are annotated as in Figure 4. Note that this hull is upright unstable, to mining insights into physical systems. The following use case as the couple will rotate the boat away from upright. illustrates the approach by considering the design of boat hull cross-sections. Naval engineers analyze the stability of a boat design by constructing a moment curve, such as the one pictured in Figure Static Stability of Boat Hulls 6. This curve depicts the net moment due to buoyancy at various Stability is a key consideration in boat hull design. One of the most angles, assuming the vessel is in vertical equilibrium. From this fundamental aspects of stability is static stability; the behavior of a figure we can see that the design is upright-stable, as it possesses boat when perturbed away from static equilibrium [LE00]. Figure a negative slope at upright θ = 0◦ . Note that a boat may not have 4 illustrates the physical mechanism governing stability at small an unlimited range of stability as Figure 6 exhibits an angle of perturbations from an upright orientation. vanishing stability (AVS) beyond which the boat does not recover As a boat is rotated away from its upright orientation, its center to upright. of buoyancy (COB) will tend to migrate. If the boat is in vertical The classical way to build intuition about boat stability is equilibrium, its buoyant force will be equal in magnitude to its via mathematical derivations [LE00]. In the following section we weight. A stable boat is a hull whose COB migrates in such a way present an alternative way to build intuition through exploratory ENABLING ACTIVE LEARNING PEDAGOGY AND INSIGHT MINING WITH A GRAMMAR OF MODEL ANALYSIS 255 gr.tf_iocorr() computes correlations between every pair of input variables var and outputs out. The routine also attaches metadata, enabling an autoplot as a tileplot of the correlation values. ( df_boats >> gr.tf_iocorr( var=["H", "W", "n", "d", "f_com"], out=["mass", "angle", "stability"], ) >> gr.pt_auto() ) Fig. 6: Total moment on a boat hull as it is rotated through 180◦ . A negative slope at upright θ = 0◦ is required for upright stability. Stability is lost at the angle of vanishing stability (AVS). model analysis. EMA for Insight Mining Generation and post-processing of the moment curve are imple- mented in the grama model md_performance4 . This model parameterizes a 2d boat hull via its height H, width W, shape of corner n, the vertical height of the center of mass f_com (as a fraction of the height), and the displacement ratio d (the Fig. 7: Tile plot of input/output correlations; autoplot gr.pt_auto() ratio of the boat’s mass to maximum water mass displaced). visualization of gr.tf_iocorr() output. Note that a boat with d > 1 is incapable of flotation. A smaller value of d corresponds to a boat that floats higher in The correlations in Figure 7 suggest that stability is posi- the water. The model md_performance returns stability tively impacted by increasing the width W and displacement ratio = -dMdtheta_0 (the negative of the moment curve slope at d of a boat, and by decreasing the height H, shape factor n, and upright) as well as the mass and AVS angle. A positive value vertical location of the center of mass f_com. The correlations of stability indicates upright stability, while a larger value of also suggest a similar impact of each variable on the AVS angle, angle indicates a wider range of stability. but with a weaker dependence on H. These results also suggest that The EMA process begins by generating data from the model. f_com has the strongest effect on both stability and angle. However, the generation of a moment curve is a nontrivial cal- Correlations are a reasonable first-check of input/output be- culation. One should exercise care in choosing an initial sample havior, but linear correlation quantifies only an average, linear of designs to analyze. The statistical problem of selecting efficient association. A second-pass at the data would be to fit an accurate input values for a computer model is called the design of computer surrogate model and inspect parameter sweeps. The following experiments [SSW89]. The grama verb gr.tf_sp() implements the code defines a gaussian process fit [RW05] for both stability support points algorithm [MJ18] to reduce a large dataset of target and angle, and estimates model error using k-folds cross valida- points to a smaller (but representative) sample. The following code tion [JWHT13]. Note that a non-default kernel is necessary for a generates a sample of input design values via gr.ev_sample() reasonable fit of the latter output5 . with the skip=True argument, uses gr.tf_sp() to "com- ## Define fitting procedure pact" this large sample, then evaluates the performance model at ft_common = gr.ft_gp( var=["H", "W", "n", "d", "f_com"], the smaller sample. out=["angle", "stability"], df_boats = ( kernels=dict( md_performance stability=None, # Use default >> gr.ev_sample( angle=RBF(length_scale=0.1), n=5e3, ) df_det="nom", ) seed=101, ## Estimate model accuracy via k-folds CV skip=True, ( ) df_boats >> gr.tf_sp(n=1000, seed=101) >> gr.tf_kfolds( >> gr.tf_md(md=md_performance) ft=ft_common, ) out=["angle", "stability"], ) With an initial sample generated, we can perform an ex- ) ploratory analysis relating the inputs and outputs. The verb 5. RBF is imported as from sklearn.gaussian_process.kernels 4. The analysis reported here is available as a jupyter notebook. import RBF. 256 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) angle stability k Direction H W n d f_com 0.771 0.979 0 1 -0.0277 0.0394 -0.1187 0.4009 -0.9071 0.815 0.976 1 2 -0.6535 0.3798 -0.0157 -0.6120 -0.2320 0.835 0.95 2 0.795 0.962 3 0.735 0.968 4 TABLE 2: Subspace weights in df_weights. TABLE 1: Accuracy (R2 ) estimated via k-fold cross validation of all the sweeps across f_com for stability_mean tend to be gaussian process model. monotone with a fairly steep slope. This is in agreement with the correlation results of Figure 7; the f_com sweeps tend to The k-folds CV results (Tab. 1) suggest a highly accurate have the steepest slopes. Given the high accuracy of the model model for stability, and a moderately accurate model for for stability (as measured by k-folds CV), this trend is angle. The following code defines the surrogate model over a reasonably trustworthy. domain that includes the original dataset, and performs parameter However, the same figure shows an inconsistent (non- sweeps across all inputs. monotone) effect of most inputs on the AVS angle_mean. md_fit = ( These results are in agreement with the k-fold CV results shown df_boats above. Clearly, the surrogate model is untrustworthy, and we >> ft_common() should resist trusting conclusions from the parameter sweeps for >> gr.cp_marginals( H=gr.marg_mom("uniform", mean=2.0, cov=0.30), angle_mean. This undermines the conclusion we drew from W=gr.marg_mom("uniform", mean=2.5, cov=0.35), the input/output correlations pictured in Figure 7. Clearly, angle n=gr.marg_mom("uniform", mean=1.0, cov=0.30), exhibits more complex behavior than a simple linear correlation d=gr.marg_mom("uniform", mean=0.5, cov=0.30), f_com=gr.marg_mom( with each of the boat design variables. "uniform", A different analysis of the boat hull angle data helps mean=0.55, develop useful insights. We pursue an active subspace analysis cov=0.47, of the data to reduce the dimensionality of the input space by ), ) identifying directions that best explain variation in the output >> gr.cp_copula_independence() [dCI17], [Con15]. The verb gr.tf_polyridge() implements ) the variable projection algorithm of Hokanson and Constantine ( [HC18]. The following code pursues a two-dimensional reduction md_fit of the input space. Note that the hyperparameter n_degree=6 is >> gr.ev_sinews(df_det="swp", n_sweeps=5) set via a cross-validation study. >> gr.pt_auto() ) ## Find two important directions df_weights = ( df_boats >> gr.tf_polyridge( var=["H", "W", "n", "d", "f_com"], out="angle", n_degree=6, # Set via CV study n_dim=2, # Seek 2d subspace ) ) The subspace weights are reported in Table 2. Note that the leading direction 1 is dominated by the displacement ratio d and COM location f_com. Essentially, this describes the "loading" of the vessel. The second direction corresponds to "widening and shortening" of the hull cross-section (in addition to lowering d and f_com). Using the subspace weights in Table 2 to produce a 2d projec- tion of the feature space enables visualizing all boat geometries in a single plot. Figure 9 reveals that this 2d projection is very suc- Fig. 8: Parameter sweeps for fitted GP model. Model *_mean and cessful at separating universally-stable (angle==180), upright- predictive uncertainty *_sd values are reported for each output unstable (angle==0), and intermediate cases (0 < angle < angle, stability. 180). Intermediate cases are concentrated at higher values of the second active variable. There is a phase transition between Figure 8 displays parameter sweeps for the surrogate model of universally-stable and upright-unstable vessels at lower values of stability and angle. Note that the surrogate model reports the second active variable. both a mean trend *_mean and a predictive uncertainty *_sd. Interpreting Figure 9 in light of Table 2 provides us with deep The former is the model’s prediction for future values, while the insight about boat stability: Since active variable 1 corresponds to latter quantifies the model’s confidence in each prediction. loading (high displacement ratio d with a low COM f_com), we The parameter sweeps of Figure 8 show a consistent and strong can see that the boat’s loading conditions are key to determining effect of f_com on the stability_mean of the boat; note that its stability. Since active variable 2 depends on the aspect ratio ENABLING ACTIVE LEARNING PEDAGOGY AND INSIGHT MINING WITH A GRAMMAR OF MODEL ANALYSIS 257 native to derivation for the activities in an active learning approach. Rather than structuring courses around deriving and implementing scientific models, course exercises could have students explore the behavior of a pre-implemented model to better understand physical phenomena. Lorena Barba [Bar16] describes some of the benefits in this style of lesson design. EMA is also an important part of the modeling practitioner’s toolkit as a means to verify a model’s implementation and to develop new insights. Grama sup- ports both novices and practitioners in performing EMA through a concise syntax. R EFERENCES [AFD+ 21] Riya Aggarwal, Mira Flynn, Sam Daitzman, Diane Lam, and Zachary Riggins del Rosario. A qualitative study of engineer- ing students’ reasoning about statistical variability. In 2021 Fall ASEE Middle Atlantic Section Meeting, 2021. URL: https://peer.asee.org/38421. Fig. 9: Boat design feature vectors projected to 2d active subspace. [Bar16] Lorena Barba. Computational thinking: I do not think it means what you think it means. Technical re- The origin corresponds to the mean feature vector. port, 2016. URL: https://lorenabarba.com/blog/computational- thinking-i-do-not-think-it-means-what-you-think-it-means/. [Bie20] Przemyslaw Biecek. ceterisParibus: Ceteris Paribus Profiles, (higher width, shorter height), Figure 9 suggests that only wider 2020. R package version 0.4.2. URL: https://cran.r-project.org/ boats will tend to exhibit intermediate stability. package=ceterisParibus. [Blo56] Benjamin Samuel Bloom. Taxonomy of educational objectives: The classification of educational goals. Addison-Wesley Long- Conclusions man Ltd., 1956. Grama is a Python implementation of a grammar of model anal- [Bry20] Jennifer Bryan. object of type closure is not subsettable. 2020. ysis. The grammar’s design supports an active learning approach rstudio::conf 2020. URL: https://rstd.io/debugging. [BTSA12] Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. to teaching sound scientific modeling practices. Two case studies Legion: Expressing locality and independence with logical re- demonstrated the teaching benefits of grama: errors for learners gions. In SC’12: Proceedings of the International Conference on help guide novices toward a more sound analysis, while concise High Performance Computing, Networking, Storage and Analy- syntax encourages novices to carry out sound analysis practices. sis, pages 1–11. IEEE, 2012. URL: https://ieeexplore.ieee.org/ document/6468504, doi:10.1109/SC.2012.71. Grama can also be used for exploratory model analysis (EMA) [CF99] Alison C Cullen and H Christopher Frey. Probabilistic Tech- – an exploratory procedure to mine a scientific model for useful niques In Exposure Assessment: A Handbook For Dealing With insights. A case study of boat hull design demonstrated EMA. Variability And Uncertainty In Models And Inputs. Springer Science & Business Media, 1999. In particular, the example explored and explained the relationship [Con15] Paul G. Constantine. Active Subspaces: Emerging Ideas for between boat design parameters and two metrics of boat stability. Dimension Reduction in Parameter Studies. SIAM Philadelphia, Several ideas from the grama project are of interest to other 2015. doi:10.1137/1.9781611973860. practitioners and developers in scientific computing. Grama was [dCI17] Zachary del Rosario, Paul G. Constantine, and Gianluca Iac- designed to support model analysis under uncertainty. However, carino. Developing design insight through active subspaces. In 19th AIAA Non-Deterministic Approaches Conference, page the data/model and four-verb ontology (Fig. 1) underpinning 1090, 2017. URL: https://arc.aiaa.org/doi/10.2514/6.2017-1090, grama is a much more general idea. This design enables very doi:10.2514/6.2017-1090. concise model analysis syntax, which provides much of the benefit [dR20] Zachary del Rosario. Grama: A grammar of model analysis. Jour- nal of Open Source Software, 5(51):2462, 2020. URL: https://doi. behind grama. org/10.21105/joss.02462, doi:10.21105/joss.02462. The design idiom of errors for learners is not simply focused [dRFI21] Zachary del Rosario, Richard W Fenrich, and Gianluca Iaccarino. on writing "useful" error messages, but is rather a design orien- When are allowables conservative? AIAA Journal, 59(5):1760– tation to use errors to introduce teachable moments. In addition 1772, 2021. URL: https://doi.org/10.2514/1.J059578, doi:10. 2514/1.J059578. to writing error messages "for humans" [Bry20], an errors for [EVB+ 20] M Esmaily, L Villafane, AJ Banko, G Iaccarino, JK Eaton, learners philosophy designs errors not simply to avoid fatal and A Mani. A benchmark for particle-laden turbu- program behavior, but rather introduces exceptions to prevent lent duct flow: A joint computational and experimen- conceptually invalid analyses. For instance, in the case study tal study. International Journal of Multiphase Flow, 132:103410, 2020. URL: https://www.sciencedirect.com/ presented above, designing gr.tf_sample() to assume independent science/article/abs/pii/S030193222030519X, doi:10.1016/ random inputs when a copula is unspecified would lead to code j.ijmultiphaseflow.2020.103410. that throws errors less frequently. However, this would silently [FEM+ 14] Scott Freeman, Sarah L Eddy, Miles McDonough, Michelle K endorse the conceptually problematic mentality of "independence Smith, Nnadozie Okoroafor, Hannah Jordt, and Mary Pat Wen- deroth. Active learning increases student performance in sci- is the default." While throwing an error message for an unspecified ence, engineering, and mathematics. Proceedings of the Na- dependence structure leads to more frequent errors, it serves as a tional Academy of Sciences, 111(23):8410–8415, 2014. doi: frequent reminder that dependency is an important part of a model 10.1073/pnas.1319030111. [HC18] Jeffrey M Hokanson and Paul G Constantine. Data-driven involving random inputs. polynomial ridge approximation using variable projection. SIAM Finally, exploratory model analysis holds benefits for both Journal on Scientific Computing, 40(3):A1566–A1589, 2018. learners and practitioners of scientific modeling. EMA is an alter- doi:10.1137/17M1117690. 258 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) [HK21] Jan Katins gdowding austin matthias-k Tyler Funnell Florian Finkernagel Jonas Arnfred Dan Blanchard et al. Hassan Kibirige, Greg Lamp. has2k1/plotnine: v0.8.0. Mar 2021. doi:10. 5281/zenodo.4636791. [JWHT13] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshi- rani. An Introduction to Statistical Learning: with Applications in R, volume 112. Springer, 2013. URL: https://www.statlearning. com/. [KBB19] Michał Kuźba, Ewa Baranowska, and Przemysław Biecek. pyce- terisparibus: explaining machine learning models with ceteris paribus profiles in python. Journal of Open Source Software, 4(37):1389, 2019. URL: https://joss.theoj.org/papers/10.21105/ joss.01389, doi:10.21105/joss.01389. [KN05] Andy Keane and Prasanth Nair. Computational Approaches For Aerospace Design: The Pursuit Of Excellence. John Wiley & Sons, 2005. [KSS21] Daniel Kahneman, Olivier Sibony, and Cass R Sunstein. Noise: A flaw in human judgment. Little, Brown, 2021. [LE00] Lars Larsson and Rolf Eliasson. Principles of Yacht Design. McGraw Hill Companies, 2000. [MJ18] Simon Mak and V Roshan Joseph. Support points. The Annals of Statistics, 46(6A):2562–2592, 2018. doi:10.1214/17- AOS1629. [MTW 22] Kazuki Maeda, Thiago Teixeira, Jonathan M Wang, Jeffrey M + Hokanson, Caetano Melone, Mario Di Renzo, Steve Jones, Javier Urzay, and Gianluca Iaccarino. An integrated heterogeneous computing framework for ensemble simulations of laser-induced ignition. arXiv preprint arXiv:2202.02319, 2022. URL: https: //arxiv.org/abs/2202.02319, doi:10.48550/arXiv.2202. 02319. [MV15] Sparsh Mittal and Jeffrey S Vetter. A survey of cpu-gpu heteroge- neous computing techniques. ACM Computing Surveys (CSUR), 47(4):1–35, 2015. URL: https://dl.acm.org/doi/10.1145/2788396, doi:10.1145/2788396. [pdt20] The pandas development team. pandas-dev/pandas: Pandas, February 2020. URL: https://doi.org/10.5281/zenodo.3509134, doi:10.5281/zenodo.3509134. [RW05] Carl Edward Rasmussen and Christopher K. I. Williams. Gaus- sian Processes for Machine Learning. The MIT Press, 11 2005. URL: https://doi.org/10.7551/mitpress/3206.001.0001, doi:10.7551/mitpress/3206.001.0001. [SKA+ 14] Jeffrey P Slotnick, Abdollah Khodadoust, Juan Alonso, David Darmofal, William Gropp, Elizabeth Lurie, and Dimitri J Mavriplis. Cfd vision 2030 study: A path to revolutionary computational aerosciences. Technical report, 2014. URL: https://ntrs.nasa.gov/citations/20140003093. [SS13] Seref Sagiroglu and Duygu Sinanc. Big data: A review. In 2013 International Conference on Collaboration Technolo- gies and Systems (CTS), pages 42–47. IEEE, 2013. URL: https://ieeexplore.ieee.org/document/6567202, doi:10.1109/ CTS.2013.6567202. [SSW89] Jerome Sacks, Susannah B. Schiller, and William J. Welch. Designs for computer experiments. Technometrics, 31(1):41– 47, 1989. URL: http://www.jstor.org/stable/1270363, doi:10. 2307/1270363. [WAB 19] Hadley Wickham, Mara Averick, Jennifer Bryan, Winston Chang, + Lucy D’Agostino McGowan, Romain François, Garrett Grole- mund, Alex Hayes, Lionel Henry, Jim Hester, et al. Welcome to the tidyverse. Journal of Open Source Software, 4(43):1686, 2019. doi:10.21105/joss.01686. [Wic14] Hadley Wickham. Tidy data. Journal of Statistical Software, 59(10):1–23, 2014. doi:10.18637/jss.v059.i10. [WM10] Wes McKinney. Data Structures for Statistical Computing in Python. In Stéfan van der Walt and Jarrod Millman, editors, Proceedings of the 9th Python in Science Conference, pages 56 – 61, 2010. doi:10.25080/Majora-92bf1922-00a. PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) 259 Low Level Feature Extraction for Cilia Segmentation Meekail Zain‡†∗ , Eric Miller§† , Shannon P Quinn‡¶ , Cecilia Lok F Abstract—Cilia are organelles found on the surface of some cells in the human body that sweep rhythmically to transport substances. Dysfunction of ciliary motion is often indicative of diseases known as ciliopathies, which disrupt the functionality of macroscopic structures within the lungs, kidneys and other organs [LWL+ 18]. Phenotyping ciliary motion is an essential step towards un- derstanding ciliopathies; however, this is generally an expert-intensive process [QZD+ 15]. A means of automatically parsing recordings of cilia to determine useful information would greatly reduce the amount of expert intervention re- quired. This would not only improve overall throughput, but also mitigate human error, and greatly improve the accessibility of cilia-based insights. Such automa- tion is difficult to achieve due to the noisy, partially occluded and potentially out- of-phase imagery used to represent cilia, as well as the fact that cilia occupy a minority of any given image. Segmentation of cilia mitigates these issues, and is thus a critical step in enabling a powerful pipeline. However, cilia are notoriously difficult to properly segment in most imagery, imposing a bottleneck on the pipeline. Experimentation on and evaluation of alternative methods for feature extraction of cilia imagery hence provide the building blocks of a more potent segmentation model. Current experiments show up to a 10% improvement over base segmentation models using a novel combination of feature extractors. Index Terms—cilia, segmentation, u-net, deep learning Fig. 1: A sample frame from the cilia dataset Introduction gation in the Quinn Research Group at the University of Georgia Cilia are organelles found on the surface of some cells in the hu- [ZRS+ 20]. man body that sweep rhythmically to transport substances [Ish17]. The current pipeline consists of three major stages: preprocess- Dysfunction of ciliary motion often indicates diseases known as ing, where segmentation masks and optical flow representations ciliopathies, which on a larger scale disrupt the functionality of are created to supplement raw cilia video data; appearance, where structures within the lungs, kidneys and other organs. Pheno- a model learns a condensed spacial representation of the cilia; and typing ciliary motion is an essential step towards understanding dynamics, which learns a representation from the video, encoded ciliopathies. However, this is generally an expert-intensive pro- as a series of latent points from the appearance module. In the cess [LWL+ 18], [QZD+ 15]. A means of automatically parsing primary module, the segmentation mask is essential in scoping recordings of cilia to determine useful information would greatly downstream analysis to the cilia themselves, so inaccuracies at reduce the amount of expert intervention required, thus increasing this stage directly affect the overall performance of the pipeline. throughput while alleviating the potential for human error. Hence, However, due to the high variance of ciliary structure, as well Zain et al. (2020) discuss the construction of a generative pipeline as the noisy and out-of-phase imagery available, segmentation to model and analyze ciliary motion, a prevalent field of investi- attempts have been prone to error. † These authors contributed equally. While segmentation masks for such a pipeline could be * Corresponding author: meekail.zain@uga.edu manually generated, the process requires intensive expert labor ‡ Department of Computer Science, University of Georgia, Athens, GA 30602 [DvBB+ 21]. Requiring manual segmentation before analysis thus USA § Institute for Artificial Intelligence, University of Georgia, Athens, GA 30602 greatly increases the barrier to entry for this tool. Not only would USA it increase the financial strain of adopting ciliary analysis as a ¶ Department of Cellular Biology, University of Georgia, Athens, GA 30602 clinical tool, but it would also serve as an insurmountable barrier to USA || Department of Developmental Biology, University of Pittsburgh, Pittsburgh, entry for communities that do not have reliable access to such clin- PA 15261 USA icians in the first place, such as many developing nations and rural populations. Not only can automated segmentation mitigate these Copyright © 2022 Meekail Zain et al. This is an open-access article distributed barriers to entry, but it can also simplify existing treatment and under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the analysis infrastructure. In particular, it has the potential to reduce original author and source are credited. the magnitude of work required by an expert clinician, thereby 260 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) expansion. The contraction path follows the standard strategy of most convolutional neural networks (CNNs), where convolutions are followed by Rectified Linear Unit (ReLU) activation func- tions and max pooling layers. While max pooling downsamples the images, the convolutions double the number of channels. Upon expansion, up-convolutions are applied to up-sample the image while reducing the number of channels. At each stage, the network concatenates the up-sampled image with the image of corresponding size (cropped to account for border pixels) from a layer in the contracting path. A final layer uses pixel- wise (1 × 1) convolutions to map each pixel to a corresponding class, building a segmentation. Before training, data is generally augmented to provide both invariance in rotation and scale as well as a larger amount of training data. In general, U-Nets have shown high performance on biomedical data sets with low quantities Fig. 2: The classical U-Net architecture, which serves as both a of labelled images, as well as reasonably fast training times on baseline and backbone model for this research graphics processing units (GPUs) [RFB15]. However, in a few past experiments with cilia data, the U-Net architecture has had low segmentation accuracy [LMZ+ 18]. Difficulties modeling cilia decreasing costs and increasing clinician throughput [QZD+ 15], with CNN-based architectures include their fine high-variance [ZRS+ 20]. Furthermore, manual segmentation imparts clinician- structure, spatial sparsity, color homogeneity (with respect to the specific bias which reduces the reproducability of results, making background and ambient cells), as well as inconsistent shape and it difficult to verify novel techniques and claims [DvBB+ 21]. distribution across samples. Hence, various enhancements to the A thorough review of previous segmentation models, specif- pure U-Net model are necessary for reliable cilia segmentation. ically those using the same dataset, shows that current results are poor, impeding tasks further along the pipeline. For this Methodology study, model architectures utilize various methods of feature extraction that are hypothesized to improve the accuracy of a base The U-Net architecture is the backbone of the model due to its segmentation model, such as using zero-phased PCA maps and well-established performance in the biomedical image analysis Sparse Autoencoder reconstructions with various parameters as a domain. This paper focuses on extracting and highlighting the data augmentation tool. Various experiments with these methods underlying features in the image through various means. There- provide a summary of both qualitative and quantitative results fore, optimization of the U-Net backbone itself is not a major necessary in ascertaining the viability for such feature extractors consideration of this project. Indeed, the relative performance of to aid in segmentation. the various modified U-Nets sufficiently communicates the effi- cacy of the underlying methods. Each feature extraction method will map the underlying raw image to a corresponding feature Related Works map. To evaluate the usefulness of these feature maps, the model Lu et. al. (2018) utilized a Dense Net segmentation model as an concatenates these augmentations to the original image and use upstream to a CNN-based Long Short-Term Memory (LSTM) the aggregate data as input to a U-Net that is slightly modified to time-series model for classifying cilia based on spatiotemporal accept multiple input channels. patterns [LMZ+ 18]. While the model reports good classification The feature extractors of interest are Zero-phase PCA sphering accuracy and a high F-1 score, the underlying dataset only (ZCA) and a Sparse Autoencoder (SAE), on both of which the contains 75 distinct samples and the results must therefore be following subsections provide more detail. Roughly speaking, taken with great care. Furthermore, Lu et. al. did not report the these are both lossy, non-bijective transformations which map separate performance of the upstream segmentation network. Their a single image to a single feature map. In the case of ZCA, approach did, however, inspire the follow-up methodology of Zain empirically the feature maps tend to preserve edges and reduce et. al. (2020) for segmentation. In particular, they employ a Dense the rest of the image to arbitrary noise, thereby emphasizing local Net segmentation model as well, however they first augment the structure (since cell structure tends not to be well-preserved). The underlying images with the calculated optical flow. In this way, SAE instead acts as a harsh compression and filters out both linear their segmentation strategy employs both spatial and temporal and non-linear features, preserving global structure. Each extractor information. To compare against [LMZ+ 18], the authors evaluated is evaluated by considering the performance of a U-Net model their segmentation model in the same way—as an upstream to trained on multi-channel inputs, where the first channel is the an CNN/LSTM classification network. Their model improved original image, and the second and/or third channels are the feature the classification accuracy two points above that of Charles et. maps extracted by these methods. In particular, the objective is for al. (2018). Their reported intersection-over-union (IoU) score is the doubly-augmented data, or the “composite” model, to achieve 33.06% and marks the highest performance achieved on this state-of-the-art performance on this challenging dataset. dataset. The ZCA implementation utilizes SciPy linear algebra solvers, One alternative segmentation model, often used in biomedical and both U-Net and SAE architectures use the PyTorch deep image processing and analysis, where labelled data sets are rela- learning library. Next, the evaluation stage employs canonical tively small, is the U-Net architecture (2) [RFB15]. Developed by segmentation quality metrics, such as the Jaccard score and Dice Ronneberger et. al., U-Nets consist of two parts: contraction and coefficient, on various models. When applied to the composite LOW LEVEL FEATURE EXTRACTION FOR CILIA SEGMENTATION 261 model, these metrics determine any potential improvements to the feature since often times in image analysis low eigenvalues (and state-of-the-art for cilia segmentation. the span of their corresponding eigenvectors) tend to capture high- frequency data. Such data is essential for tasks such as texture Cilia Data analysis, and thus tuning the value of ε helps to preserve this data. As in the Zain paper, the input data is a limited set of grayscale ZCA maps for various values of ε on a sample image are shown cilia imagery, from both healthy patients and those diagnosed with in figure 3. ciliopathies, with corresponding ground truth masks provided by experts. The images are cropped to 128 × 128 patches. The images are cropped at random coordinates in order to increase the size and variance of the sample space, and each image is cropped a number of times proportional its resolution. Additionally, crops that contain less than fifteen percent cilia are excluded from the Fig. 3: Comparison of ZCA maps on a cilia sample image with various training/test sets. This method increases the size of the training levels of ε. The original image is followed by maps with ε = 1e − 4, set from 253 images to 1409 images. Finally, standard minmax ε = 1e − 5, ε = 1e − 6, and ε = 1e − 7, from left to right. contrast normalization maps the luminosity to the interval [0, 1]. Zero-phase PCA sphering (ZCA) Sparse Autoencoder (SAE) The first augmentation of the underlying data concatenates the Similar in aim to ZCA, an SAE can augment the underlying input to the backbone U-Net model with the ZCA-transformed images to further filter and reduce noise while allowing the data. ZCA maps the underlying data to a version of the data that is construction and retention of potentially nonlinear spatial features. “rotated” through the dataspace to ensure certain spectral proper- Autoencoders are deep learning models that first compress data ties. ZCA in effect can implicitly normalize the data using the most into a low-level latent space and then attempt to reconstruct images significant (by empirical variance) spatial features present across from the low-level representation. SAEs in particular add an the dataset. Given a matrix X with rows representing samples and additional constraint, usually via the loss function, that encourages columns for each feature, a sphering (or whitening) transformation sparsity (i.e., less activation) in hidden layers of the network. Xu W is one which decorrelates X. That is, the covariance of W X et. al. use the SAE architecture for breast cancer nuclear detection must be equal to the identity matrix. By the spectral theorem, and show that the architecture preserves essential, high-level, the symmetric matrix XX T —the covariance matrix corresponding and often nonlinear aspects of the initial imagery—even when to the data, assuming the data is centered—can be decomposed unlabelled—such as shape and color [XXL+ 16]. An adaptation of into PDPT , where P is an orthogonal matrix of eigenvectors the first two terms of their loss function enforces sparsity: and D a diagonal matrix of corresponding eigenvalues of the covariance matrix. ZCA uses the sphering matrix W = PD−1/2 PT 1 N 1 n and can be thought of as a transformation into the eigenspace of LSAE (θ ) = ∑ (L(x(k), dθ̂ (eθ̌ (x(k))))) + α n N k=1 ∑ KL(ρ||ρ̂). j=1 its covariance matrix—projection onto the data’s principal axes, The first term is a standard reconstruction loss (mean squared as the minimal projection residual is onto the axes with maximal error), whereas the latter is the mean Kullback-Leibler (KL) variance—followed by normalization of variance along every axis divergence between ρ̂, the activation of a neuron in the encoder, and rotation back into the original image space. In order to reduce and ρ, the enforced activation. For the case of experiments the amount of two-way correlation in images, Krizhevsky applies performed here, ρ = 0.05 remains constant but values of α vary, ZCA whitening to preprocess CIFAR-10 data before classification specifically 1e − 2, 1e − 3, and 1e − 4, for each of which a static and shows that this process nicely preserves features, such as edges dataset is created for feeding into the segmentation model. Larger [LjWD19]. alpha prioritizes sparsity over reconstruction accuracy, which to This ZCA implementation uses the Python SciPy library an extent, is hypothesized to retain significant low-level features (SciPy), which builds on top of low-level hardware-optimized of the cilia. Reconstructions with various values of α are shown routines such as BLAS and LAPACK to efficiently calculate many in figure 4 linear algebra operations. In particular, these expirements imple- ment ZCA as a generalized whitening technique. While normal the normal ZCA calculation selects a whitening −1 T p matrix W = PD 2 P , a more applicable alternative is W = P (D + εI)−1 PT where ε is a hyperparameter which attenuates eigenvalue sensitivity. This new "whitening" is actually not a proper whitening since it does not guarantee an identity covariance matrix. It does however serve Fig. 4: Comparison of SAE reconstructions from different training a similar purpose and actually lends some benefits. instances with various levels of α (the activation loss weight). From left to right: original image, α = 1e − 2 reconstruction, α = 1e − 3 Most importantly, it is indeed a generalization of canonical q reconstruction, α = 1e − 4 reconstruction. ZCA. That is to say, ε = 0 recovers canonical ZCA, and λ → λ1 provides the spectrum ofqW on the eigenvalues. Otherwise, ε > 0 A significant amount of freedom can be found in potential 1 results in the map λ → λ +ε . In this case, while all eigenvalues architectural choices for SAE. A focus on low-medium complexity map to smaller values compared to the original map, the smallest models both provides efficiency and minimizes overfitting and ar- eigenvalues map to significantly smaller values compared to the tifacts as consequence of degenerate autoencoding. One important original map. This means that ε serves to “dampen” the effects danger to be aware of is that SAEs—and indeed, all AEs—are of whitening for particularly small eigenvalues. This is a valuable at risk of a degenerate solution wherein a sufficiently complex 262 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) Fig. 6: Artifacts generated during the training of U-Net. From left to right: original image, generated segmentation mask (pre-threshold), ground-truth segmentation mask Fig. 5: Illustration and pseudocode for Spatial Broadcast Decoding [WMBL19] Fig. 7: Artifacts generated during the training of ZCA+U-Net. From decoder essentially learns to become a hashmap of arbitrary (and left to right: original image, ZCA-mapped image, generated segmen- potentially random) encodings. tation mask (pre-threshold), ground-truth segmentation mask The SAE will therefore utilize a CNN architecture, as op- posed to more modern transformer-style architectures, since the figure 9 was taken only 10 epochs into the training process. simplicity and induced spatial bias provide potent defenses against Notably, this model, the composite pipeline, produced usable overfitting and mode collapse. Furthermore the encoder will use artifacts in mere minutes of training, whereas other models did Spatial Broadcast Decoding (SBD) which provides a method for not produce similar results until after about 10-40 epochs. decoding from a latent vector using size-preserving convolutions, Figure 10 provides a summary of experiments performed with thereby preserving the spatial bias even in decoding, and eliminat- SAE and ZCA augmented data, along with a few composite models ing the artifacts generated by alternate decoding strategies such as and a base U-Net for comparison. These models were produced “transposed” convolutions [WMBL19]. with data augmentation at various values of α (for the Sparse Autoencoder loss function) and ε (for ZCA) discussed above. Spatial Broadcast Decoding (SBD) While the table provides five metrics, those of primary importance Spatial Broadcast Decoding provides an alternative method from are the Intersection over Union (IoU), or Jaccard Score, as well ”transposed” (or ”skip”) convolutions to upsample images in the as the Dice (or F1) score, which are the most commonly used decoder portion of CNN-based autoencoders. Rather than main- metrics for evaluating the performance of segmentation models. taining the square shape, and hence associated spatial properties, Most feature extraction models at least marginally improve the of the latent representation, the output of the encoder is reshaped performance in of the U-Net in terms of IoU and Dice scores, into a single one-dimensional tensor per input image, which is then and the best-performing composite model (with ε of 1e − 4 tiled to the shape of the desired image (in this case, 128 × 128). for ZCA and α of 1e − 3 for SAE) provide an improvement In this way, the initial dimension of the latent vector becomes of approximately 10% from the base U-Net in these metrics. the number of input channels when fed into the decoder, and two There does not seem to be an obvious correlation between which additional channels are added to represent 2-dimensional spatial feature extraction hyperparameters provided the best performance coordinates. In its initial publication, SBD has been shown to pro- for individual ZCA+U-Net and SAE+U-Net models versus those vide effective results in disentangling latent space representations for the composite pipeline, but further experiments may assist in in various autoencoder models. analyzing this possibility. The base U-Net does outperform the others in precision, U-Net All models use a standard U-Net and undergo the same training process to provide a solid basis for analysis. Besides the number of input channels to the initial model (1 plus the number of augmentation channels from SAE and ZCA, up to 3 total chan- nels), the model architecture is identical for all runs. A single- channel (original image) U-Net first trains as a basis point for analysis. The model trains on two-channel inputs provided by Fig. 8: Artifacts generated during the training of SAE+U-Net. From ZCA (original image concatenated with the ZCA-mapped one) left to right: original image, SAE-reconstructed image, generated with various ε values for the dataset, and similarly SAE with segmentation mask (pre-threshold), ground-truth segmentation mask various α values, train the model. Finally, composite models train with a few combinations of ZCA and SAE hyperparameters. Each training process uses binary cross entropy loss with a learning rate of 1e − 3 for 225 epochs. Results Fig. 9: Artifacts generated 10 epochs into the training of the compos- Figures 6, 7, 8, and 9 show masks produced on validation data ite U-Net. From left to right: original image, ZCA-mapped image, from instances of the four model types. While the former three SAE-mapped image, generated segmentation mask (pre-threshold), show results near the end of training (about 200-250 epochs), ground-truth segmentation mask LOW LEVEL FEATURE EXTRACTION FOR CILIA SEGMENTATION 263 Extractor Parameters Scores Implications internal to other projects within the research group Model ε (ZCA) α (SAE) IoU Accuracy Recall Dice Precision sponsoring this research are clear. As discussed earlier, later U-Net (base) — — 0.399 0.759 0.501 0.529 0.692 pipelines of ciliary representation and modeling are currently 1e − 4 — 0.395 0.754 0.509 0.513 0.625 being bottlenecked by the poor segmentation masks produced by 1e − 5 — 0.401 0.732 0.563 0.539 0.607 base U-Nets, and the under-segmented predictions provided by ZCA + U-Net 1e − 6 — 0.408 0.756 0.543 0.546 0.644 1e − 7 — 0.419 0.758 0.563 0.557 0.639 the original model limits the scope of what these later stages — 1e − 2 0.380 0.719 0.568 0.520 0.558 may achieve. Better predictions hence tend to transfer to better SAE + U-Net — 1e − 3 0.398 0.751 0.512 0.526 0.656 downstream results. — 1e − 4 0.416 0.735 0.607 0.555 0.603 These results also have significant implications outside of the 1e − 4 1e − 2 0.401 0.761 0.506 0.521 0.649 1e − 4 1e − 3 0.441 0.767 0.580 0.585 0.661 specific task of cilia segmentation and modeling. The inherent 1e − 4 1e − 4 0.305 0.722 0.398 0.424 0.588 problem that motivated an introduction of feature extraction into 1e − 5 1e − 2 0.392 0.707 0.624 0.530 0.534 1e − 5 1e − 3 0.413 0.770 0.514 0.546 0.678 the segmentation process was the poor quality of the given dataset. 1e − 5 1e − 4 0.413 0.751 0.565 0.550 0.619 From occlusion to poor lighting to blurred images, these are Composite 1e − 6 1e − 2 0.392 0.719 0.602 0.527 0.571 problems that typically plague segmentation models in the real 1e − 6 1e − 3 0.395 0.759 0.480 0.521 0.711 1e − 6 1e − 4 0.405 0.729 0.587 0.545 0.591 world, where data sets are not of ideal quality. For many modern 1e − 7 1e − 2 0.383 0.753 0.487 0.503 0.655 computer vision tasks, segmentation is a necessary technique to 1e − 7 1e − 3 0.380 0.736 0.526 0.519 0.605 1e − 7 1e − 4 0.293 0.674 0.445 0.418 0.487 begin analysis of certain objects in an image, including any forms of objects from people to vehicles to landscapes. Many images Fig. 10: A summary of segmentation scores on test data for a base for these tasks are likely to come from low-resolution imagery, U-Net model, ZCA+U-Net, SAE+U-Net, and a composite model, with whether that be satellite data or security cameras, and are likely various feature extraction hyperparameters. The best result for each to face similar problems as the given cilia dataset in terms of scoring metric is in bold. image quality. Even if this is not the case, manual labelling, like that of this dataset and convenient in many other instances, is Input Images Predicted Masks Original ZCA SAE Ground Truth Base U-Net ZCA + U-Net SAE + U-Net Composite prone to error and is likely to bottleneck results. As experiments have shown, feature extraction through SAE and ZCA maps are a potential avenue for improvement of such models and would be an interesting topic to explore on other problematic datsets. Especially compelling, aside from the raw numeric results, is how soon composite pipelines began to produce usable masks on training data. As discussed earlier, most original U-Net models would take at least 40-50 epochs before showing any accurate predictions on training data. However, when feeding in composite Fig. 11: Comparison of predicted masks and ground truth for three SAE and ZCA data along with the original image, unusually test images. ZCA mapped images with ε = 1e − 4 and SAE reconstruc- accurate masks were produced within just a couple minutes, with tions with α = 1e − 3 are used where applicable. usable results at 10 epochs. This has potential implications in scenarios such as one-shot and/or unsupervised learning, where models cannot train over a large datset. however. Analysis of predicted masks from various models, some of which are shown in figure 11, shows that the base U-Net Future Research model tends to under-predict cilia, explaining the relatively high While this work establishes a primary direction and a novel precision. Previous endeavors in cilia segmentation also revealed perspective for segmenting cilia, there are many interesting and this pattern. valuable directions for future planned research. In particular, a novel and still-developing alternative to the convolution layer known as a Sharpened Cosine Similarity (SCS) layer has begun to attract some attention. While regular CNNs are proficient at Conclusions filtering, developing invariance to certain forms of noise and This paper highlights the current shortcomings of automated, perturbation, they are notoriously poor at serving as a spatial deep-learning based segmentation models for cilia, specifically indicator for features. Convolution activations can be high due to on the data provided to the Quinn Research Group, and provides changes in luminosity and do not necessarily imply the distribu- two additional methods, Zero-Phase PCA Sphering (ZCA) and tion of the underlying luminosity, therefore losing precise spatial Sparse Autoencoders (SAE), for performing feature extracting information. By design, SCS avoids these faults by considering augmentations with the purpose of aiding a U-Net model in the mathematical case of a “normalized” convolution, wherein segmentation. An analysis of U-Nets with various combinations neither the magnitude of the input, nor of the kernel, affect the final of these feature extraction and parameters help determine the output. Instead, SCS activations are dictated purely by the relative feasibility for low-level feature extraction in improving cilia seg- magnitudes of weights in the kernel, which is to say by the spatial mentation, and results from initial experiments show up to 10% distribution of features in the input [Pis22]. Domain knowledge increases in relevant metrics. suggests that cilia, while able to vary greatly, all share relatively While these improvements, in general, have been marginal, unique spatial distributions when compared to non-cilia such as these results show that pre-segmentation based feature extraction cells, out-of-phase structures, microscopy artifacts, etc. Therefore, methods, particularly the avenues explored, provide a worthwhile SCS may provide a strong augmentation to the backbone U- path of exploration and research for improving cilia segmentation. Net model by acting as an additional layer in tandem with the 264 PROC. OF THE 21st PYTHON IN SCIENCE CONF. (SCIPY 2022) already existing convolution layers. This way, the model is a true [LWL+ 18] Fangzhao Li, Changjian Wang, Xiaohui Liu, Yuxing Peng, and generalization of the canonical U-Net and is less likely to suffer Shiyao Jin. A composite model of wound segmentation based on traditional methods and deep neural networks. Computational poor performance due to the introduction of SCS. intelligence and neuroscience, 2018, 2018. doi:10.1155/ Another avenue of exploration would be a more robust ablation 2018/4149103. study on some of the hyperparameters of the feature extractors [Pis22] Raphael Pisonir. Sharpened cosine distance as an alternative for convolutions, Jan 2022. URL: https://www.rpisoni.dev. used. While most of the hyperparameters were chosen based on [QZD 15] Shannon P Quinn, Maliha J Zahid, John R Durkin, Richard J + either canonical choices [XXL+ 16] or through empirical study Francis, Cecilia W Lo, and S Chakra Chennubhotla. Auto- (e.g. ε for ZCA whitening), a more comprehensive hyperparameter mated identification of abnormal respiratory ciliary motion in search would be worth consideration. This would be especially nasal biopsies. Science translational medicine, 7(299):299ra124 |–| 299ra124, 2015. doi:10.1126/scitranslmed. valuable for the composite model since the choice of most opti- aaa1233. mal hyperparameters is dependent on the downstream tasks and [RFB15] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- therefore may be different for the composite model than what was net: Convolutional networks for biomedical image segmentation. CoRR, 2015. doi:10.48550/arXiv.1505.04597. found for the individual models. [WMBL19] Nicholas Watters, Loïc Matthey, Christopher P. Burgess, and More robust data augmentation could additionally improve Alexander Lerchner. Spatial broadcast decoder: A simple archi- results. Image cropping and basic augmentation methods alone tecture for learning disentangled representations in vaes. CoRR, provided minor improvements of just the base U-Net from the 2019. doi:10.48550/arXiv.1901.07017. [XXL+ 16] Jun Xu, Lei Xiang, Qingshan Liu, Hannah Gilmore, Jianzhong state of the art. Regarding the cropping method, an upper threshold Wu, Jinghai Tang, and Anant Madabhushi. Stacked sparse au- for the percent of cilia per image may be worth implementing, toencoder (ssae) for nuclei detection on breast cancer histopathol- as cropped images containing over approximately 90% cilia pro- ogy images. IEEE Transactions on Medical Imaging, 35(1):119– 130, 2016. doi:10.1109/TMI.2015.2458702. duced poor results, likely due to a lack of surrounding context. [ZRS+ 20] Meekail Zain, Sonia Rao, Nathan Safir, Quinn Wyner, Isabella Additionally, rotations and lighting/contrast adjustments could Humphrey, Alex Eldridge, Chenxiao Li, BahaaEddin AlAila, further augment the data set during the training process. and Shannon Quinn. Towards an unsupervised spatiotemporal representation of cilia video using a modular generative pipeline. Re-segmenting the cilia images by hand, a planned endeavor, In Proceedings of the Python in Science Conference, 2020. will likely provide more accurate masks for the training process. doi:10.25080/majora-342d178e-017. This is an especially difficult task for the cilia dataset, as the poor lighting and focus even causes medical professionals to disagree on the exact location of cilia in certain instances. However, the re- search group associated with this paper is currently in the process of setting up a web interface for such professionals to ”vote” on segmentation masks. Additionally, it is likely worth experimenting with various thresholds for converting U-Net outputs into masks, and potentially some form of region growing to dynamically aid the process. Finally, it is possible to train the SAE and U-Net jointly as an end-to-end system. Current experimentation has foregone this path due to the additional computational and memory complexity and has instead opted for separate training to at least justify this direction of exploration. Training in an end-to-end fashion could lead to a more optimal result and potentially even an interesting latent representation of ciliary features in the image. It is worth noting that larger end-to-end systems like this tend to be more difficult to train and balance, and such architectures can fall into degenerate solutions more readily. R EFERENCES [DvBB+ 21] Cenna Doornbos, Ronald van Beek, Ernie MHF Bongers, Dorien Lugtenberg, Peter Klaren, Lisenka ELM Vissers, Ronald Roep- man, Machteld M Oud, et al. Cell-based assay for ciliopathy patients to improve accurate diagnosis using alpaca. Euro- pean Journal of Human Genetics, 29(11):1677 |–| 1689, 2021. doi:10.1038/s41431-021-00907-9. [Ish17] Takashi Ishikawa. Axoneme structure from motile cilia. Cold Spring Harbor perspectives in biology, 9(1):a028076, 2017. doi:10.1101/cshperspect.a028076. [LjWD19] Hui Li, Xiao jun Wu, and Tariq S. Durrani. Infrared and visible image fusion with resnet and zero-phase component analysis. In- frared Physics & Technology, 102:103039, 2019. doi:https: //doi.org/10.1016/j.infrared.2019.103039. [LMZ+ 18] Charles Lu, M. Marx, M. Zahid, C. W. Lo, Chakra Chennubhotla, and Shannon P. Quinn. Stacked neural networks for end-to- end ciliary motion analysis. CoRR, 2018. doi:10.48550/ arXiv.1803.07534.