DOKK Library

Performance of the AMS Offline Software at National Energy Research Scientific Computing Centre and Argonne Leadership Computing Facility

Authors Alexander Egorov Alexandre Eline Baosong Shan Vitali Choutko

License CC-BY-4.0

Plaintext
EPJ Web of Conferences 214, 03022 (2019)                                 https://doi.org/10.1051/epjconf/201921403022
CHEP 2018




      Performance of the AMS Offline Software at National En-
      ergy Research Scientific Computing Centre and Argonne
      Leadership Computing Facility

      Vitali Choutko1 , Alexander Egorov1 , Alexandre Eline1 , and Baosong Shan2,∗
      1
          Massachusetts Institute of Technology, Laboratory for Nuclear Science, MA-02139, United States
      2
          Beihang University, School of Mathematics and System Science, Beijing 100191, China

                    Abstract. The Alpha Magnetic Spectrometer [1] (AMS) is a high energy
                    physics experiment installed and operating on board of the International Space
                    Station (ISS) from May 2011 and expected to last through year 2024 and be-
                    yond. More than 50 million of CPU hours has been delivered for AMS Monte
                    Carlo simulations using NERSC and ALCF facilities in 2017. The details of
                    porting of the AMS software to the 2nd Generation Intel Xeon Phi Knights
                    Landing architecture are discussed, including the MPI emulation module to al-
                    low the AMS offline software to be run as multiple-node batch jobs. The per-
                    formance of the AMS simulation software at NERSC Cori (KNL 7250), ALCF
                    Theta (KNL 7230), and Mira (IBM BG/Q) farms is also discussed.



      1 Introduction
      1.1 Intel Xeon Phi Knights Landing architecture
      The Intel Xeon Phi Knights Landing architecture is described in details in Ref. [2]. Knights
      Landing is the second generation Many Integrated Core (MIC) architecture product of Intel.
      It is available in two forms, as a coprocessor or a host processor (CPU), based on INTEL’s 14
      nm process technology, and it also includes integrated on-package memory for significantly
      higher memory bandwidth.
           Knights Landing contains up to 72 Airmont (Atom) cores with four-way hyper-threading,
      supporting up to 384 GB of “far” DDR4 2133 RAM and 8–16 GB of stacked “near” 3D MC-
      DRAM [3]. Each core has two 512-bit vector units and supports AVX-512 SIMD instructions.

      1.2 National Energy Research Scientific Computing Centre
      The National Energy Research Scientific Computing Centre (NERSC) [4] is a high perfor-
      mance computing facility operated by Lawrence Berkeley National Laboratory for the United
      States Department of Energy Office of Science. As the mission computing centre for the Of-
      fice of Science, NERSC houses high performance computing and data systems used by 7,000
      scientists at national laboratories and universities around the country. NERSC is located on
      the main Berkeley Lab campus in Berkeley, California.
          NERSC installed the second phase of its supercomputing system “Cori” with 9,668 com-
      pute nodes based on Knights Landing architecture in the second half of 2016. It features:
           ∗ e-mail: baosong.shan@cern.ch




© The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons
Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/).
EPJ Web of Conferences 214, 03022 (2019)                      https://doi.org/10.1051/epjconf/201921403022
CHEP 2018

      • Each node contains an Intel Xeon Phi Processor 7250 @ 1.40GHz.
      • 68 cores per node with support for 4 hardware threads each (272 threads total).
      • 96 GB DDR4 2400 MHz memory per node using six 16 GB DIMMs (115.2 GB/s peak
        bandwidth). The total aggregated memory (combined with MCDRAM) is 1 PB.
      • 16 GB of on-package, high-bandwidth memory with bandwidth projected to be 5X the
        bandwidth of DDR4 DRAM memory, (>460 GB/sec); over 5x energy efficiency vs.
        GDDR52; over 3x density vs. GDDR52.
          After the upgrade, Cori was ranked 5th on the TOP500 list of world’s fastest supercom-
      puters in November 2016. [5]

      1.3 Argonne Leadership Computing Facility (ALCF)

      Argonne National Laboratory is a scientific and engineering research national laboratory op-
      erated by the University of Chicago Argonne LLC for the United States Department of Energy
      located near Lemont, Illinois, outside Chicago. Belonging to Argonne National Laboratory,
      Argonne Leadership Computing Facility is a national scientific user facility that provides su-
      percomputing resources, including computing time, resources and data storage, and expertise
      to the scientific and engineering community in order to accelerate the pace of discovery and
      innovation in a broad range of disciplines.
          In 2017, the Theta supercomputing system installation finished and it entered production
      mode. Theta includes 3,624 Knights Landing nodes:
      • Each node contains an Intel Xeon Phi Processor 7230 @ 1.30GHz.
      • 68 cores per node with support for 4 hardware threads each (272 threads total).
      • 192 GB DDR4 and 16 GB MCDRAM memory per node.
      • 128 GB SSD per node.


      2 AMS Offline Software Practices in NERSC and ALCF
      AMS is using NERSC (Edison and Cori) and ALCF (Theta and Mira) facilities for simulation.

      2.1 Software porting

      Mira is based on IBM Blue Gene/Q architecture, and thanks to our experiences [6] during
      using JuQueen [7], we are able to run the ported binaries without any issue.
          Knights Landing (KNL) architecture has an important improvement for end users com-
      pared with its predecessor, the Knights Corner (KNC): the KNL is a self-booting, standalone
      processor and it is binary compatible with the standard Xeon instruction set, which means
      that it can run legacy software, compilers, tools and profilers without recompilation. This
      feature saved us from building another separate distribution of our offline software.

      2.2 Time Divided Variables deployment

      CernVM File System (CVMFS) [8] had been missing in both facilities. Very recently it
      started to be deployed at NERSC, but our repository (/cvmfs/ams.cern.ch) has not been in-
      cluded yet. At NERSC we build a Docker image to provide our database of Time Divided
      Variables to achieve the best performance of simultaneous starting of MPI jobs. In ALCF the
      database is extracted to local computing nodes before real simulation starting.

                                                    2
EPJ Web of Conferences 214, 03022 (2019)                                                                                                                                 https://doi.org/10.1051/epjconf/201921403022
CHEP 2018

     2.3 Job management

     As described in Ref. [9], a light-weight production platform was designed to automate the
     processes of reconstruction and simulation production in AMS computing centres. The plat-
     form manages all the production stages, including job acquisition, submission, monitoring,
     validation, transferring, and optional scratching. The platform is based on script languages
     (Perl [10] and Python [11]) and sqlite3 [12] database, and it is easy to deploy and customise,
     according to the needs of different batch systems, storage, and transferring method. This
     platform is used in both NERSC and ALCF.


     2.4 MPI emulation

     Large scaled jobs are preferred/allowed at NERSC and ALCF. The MPI emulator cite-
     bib:bluegene developed for JuQueen platform is used to emulate the required features of
     Open MPI messaging.


     3 Results

     The AMS simulation software uses memory and startup time optimised GEANT-4.10 pack-
     age which allows it to run on modern processors with large number of cores and limited
     amount of memory per core, such as Intel Xeon, IBM BlueGene/Q and Intel KNL [13].
         In particular, jobs with up to 3,400 KNL nodes and 700,000 threads were successfully
     running in NERSC Cori facility; jobs with up to 600 KNL nodes and ~100,000 threads on
     ALCF Theta facility; and jobs with up to 4,096 PowerPC nodes and ~250,000 threads on
     ALCF Mira facility.
         Figure 1 shows the measured performance of the AMS software on single node of Intel
     KNL at ALCF Theta (a), Intel KNL at NERSC Cori (b), and Intel Xeon at NERSC Edison (c)
     facilities. The AMS software performance on ALCF Mira hardware is similar to that shown
     in Figure 2 of Ref. [6].




                                                     8                                                                                  10                                                                                       10
                                                                                                                                                                               GEANT4 He Events/sec, (Xeon E5-2695 v2 2.40GHz)
                                                                                        GEANT4 He Events/sec, (Phi 7250 1.4 GHz flat)




                                                         a)                                                                                  b)                                                                                       c)
     GEANT4 He Events/sec, (Phi 7210 1.3 GHz flat)




                                                                                                                                         9                                                                                        9
                                                     7

                                                                                                                                         8                                                                                        8
                                                     6
                                                                                                                                         7                                                                                        7

                                                     5
                                                                                                                                         6                                                                                        6


                                                     4                                                                                   5                                                                                        5

                                                                                                                                         4                                                                                        4
                                                     3

                                                                                                                                         3                                                                                        3
                                                     2
                                                                                                                                         2                                                                                        2

                                                     1
                                                                                                                                         1                                                                                        1

                                                     0                                                                                   0                                                                                        0
                                                          50   100   150   200    250                                                         50   100       150   200   250                                                          10   20   30   40   50   60   70      80
                                                                                 CPUs                                                                                     CPUs                                                                                           CPUs




     Figure 1. The measured performance of AMS software on Intel KNL hardware at ALCF Theta (a),
     NERSC Cori (b), and Intel Xeon hardware at NERSC Edison (c) facilities. The linear scaling versus
     number of threads used in application is seen up to the number of physical cores in the processors.



                                                                                                                                                         3
EPJ Web of Conferences 214, 03022 (2019)                                                                                                                                                                 https://doi.org/10.1051/epjconf/201921403022
CHEP 2018


         Figure 2 shows the AMS software’s large scale performance for jobs with up to 3,400
      KNL nodes and 700,000 threads at NERSC facility. As shown, the AMS software perfor-
      mance scales well with number of nodes and/or threads.




                                                        2000                                                                                                                            400




                                                                                                                                  He Simulation Million Events, Cori KNL, 4 Hours Job
         Initialization Phase for Cori KNL Shifter AMS, BB, sec




                                                        1800
                                                                            a)                                                                                                          350
                                                                                                                                                                                                  b)
                                                        1600                                                                                                                            300


                                                        1400                                                                                                                            250


                                                        1200                                                                                                                            200


                                                        1000                                                                                                                            150


                                                                  800                                                                                                                   100


                                                                  600                                                                                                                    50


                                                                  400                                                                                                                     0
                                                                        0   500   1000   1500   2000   2500   3000   3500   4000                                                              0    500     1000   1500   2000   2500   3000   3500   4000
                                                                                                              Nodes204 CPU Each                                                                                                        Nodes204 CPU Each



      Figure 2. The AMS software large scale performance at NERSC facility. (a) Job starting time using
      Shifter and Burst Buffer technologies (b) Number of events simulates as function of number of KNL
      nodes used.



          As the physics analysis of the AMS experiment is moving to the nuclei particles with
      higher mass and higher energy, the computing power requirement is growing. Figure 3 shows
      the CPU time spent on simulations from 2014 to 2018, and as shown in Figure 4, NERSC and
      ALCF altogether contributed 38% of the total simulation CPU time for the AMS experiment
      in 2018.




      Figure 3. The amount of CPU years spent on AMS simulations from 2014 to 2018.




                                                                                                                                                                        4
EPJ Web of Conferences 214, 03022 (2019)                        https://doi.org/10.1051/epjconf/201921403022
CHEP 2018




     Figure 4. The simulation CPU time (in million CPU hours) contribution of AMS computing centres in
     2018.



     4 Conclusions
     The AMS offline software has been deployed and tested at NERSC and ALCF computing
     centres. The measured performance on Intel KNL shows linear scaling versus number of
     threads up to the number of physical cores. Large scale jobs requiring up to 3400 KNL
     nodes and around 700,000 threads have been run on Cori and scale well with the number
     of nodes/threads. In 2017 and 2018 NERSC and ALCF have contributed over one third
     of the total CPU hours for AMS simulation, and we expect both centres will make more
     contributions in future.


     5 Acknowledgement
     This work has been completed using resources from the National Energy Research Scientific
     Computing Centre under Contract No. DE-AC02-05CH11231 and the Argonne Leadership
     Computing Facility under Contract No. DE-AC02-06CH11357.


     References
      [1] S. Ting, Nuclear Physics B-Proceedings Supplements 243, 12 (2013)
      [2] A. Sodani, R. Gramunt, J. Corbal, H.S. Kim, K. Vinod, S. Chinthamani, S. Hutsell,
          R. Agarwal, Y.C. Liu, Ieee micro 36, 34 (2016)
      [3] Xeon phi, https://en.wikipedia.org/wiki/Xeon_Phi
      [4] National energy research scientific computing center, https://en.wikipedia.org/wiki/
          National_Energy_Research_Scientific_Computing_Center
      [5] H.W. Meuer, E. Strohmaier, J. Dongarra, H. Simon, M. Meuer, November 2016 top500
          supercomputer sites (2016)
      [6] V. Choutko, A. Egorov, B. Shan, Performance of the AMS Offline software on the IBM
          Blue Gene/Q architecture, in Journal of Physics: Conference Series (IOP Publishing,
          2017), Vol. 898, p. 072002
      [7] M. Stephan, J. Docter, Journal of large-scale research facilities JLSRF 1, 1 (2015)
      [8] C. Aguado Sanchez, J. Bloomer, P. Buncic, L. Franco, S. Klemer, P. Mato, CVMFS-a file
          system for the CernVM virtual appliance, in Proceedings of XII Advanced Computing
          and Analysis Techniques in Physics Research (2008), Vol. 1, p. 52


                                                     5
EPJ Web of Conferences 214, 03022 (2019)                   https://doi.org/10.1051/epjconf/201921403022
CHEP 2018


       [9] V. Choutko, O. Demakov, A. Egorov, A. Eline, B. Shan, R. Shi, Production Manage-
           ment System for AMS Computing Centres, in Journal of Physics: Conference Series
           (IOP Publishing, 2017), Vol. 898, p. 092034
      [10] L. Wall et al., The perl programming language (1994)
      [11] G. Van Rossum et al., Python Programming Language., in USENIX Annual Technical
           Conference (2007), Vol. 41
      [12] M. Owens, G. Allen, SQLite (Springer, 2010)
      [13] V. Choutko, A. Egorov, A. Eline, B. Shan, Computing Strategy of the AMS Experiment,
           in Journal of Physics: Conference Series (IOP Publishing, 2015), Vol. 664, p. 032029




                                                 6