Performance of the AMS Offline Software at National Energy Research Scientific Computing Centre and Argonne Leadership Computing Facility

Authors Alexander Egorov, Alexandre Eline, Baosong Shan, Vitali Choutko,

Plaintext

EPJ Web of Conferences 214, 03022 (2019) https://doi.org/10.1051/epjconf/201921403022
CHEP 2018

Performance of the AMS Offline Software at National En-
ergy Research Scientific Computing Centre and Argonne
Leadership Computing Facility

Vitali Choutko1 , Alexander Egorov1 , Alexandre Eline1 , and Baosong Shan2,∗
1
Massachusetts Institute of Technology, Laboratory for Nuclear Science, MA-02139, United States
2
Beihang University, School of Mathematics and System Science, Beijing 100191, China

Abstract. The Alpha Magnetic Spectrometer [1] (AMS) is a high energy
physics experiment installed and operating on board of the International Space
Station (ISS) from May 2011 and expected to last through year 2024 and be-
yond. More than 50 million of CPU hours has been delivered for AMS Monte
Carlo simulations using NERSC and ALCF facilities in 2017. The details of
porting of the AMS software to the 2nd Generation Intel Xeon Phi Knights
Landing architecture are discussed, including the MPI emulation module to al-
low the AMS offline software to be run as multiple-node batch jobs. The per-
formance of the AMS simulation software at NERSC Cori (KNL 7250), ALCF
Theta (KNL 7230), and Mira (IBM BG/Q) farms is also discussed.

1 Introduction
1.1 Intel Xeon Phi Knights Landing architecture
The Intel Xeon Phi Knights Landing architecture is described in details in Ref. [2]. Knights
Landing is the second generation Many Integrated Core (MIC) architecture product of Intel.
It is available in two forms, as a coprocessor or a host processor (CPU), based on INTEL’s 14
nm process technology, and it also includes integrated on-package memory for significantly
higher memory bandwidth.
Knights Landing contains up to 72 Airmont (Atom) cores with four-way hyper-threading,
supporting up to 384 GB of “far” DDR4 2133 RAM and 8–16 GB of stacked “near” 3D MC-
DRAM [3]. Each core has two 512-bit vector units and supports AVX-512 SIMD instructions.

1.2 National Energy Research Scientific Computing Centre
The National Energy Research Scientific Computing Centre (NERSC) [4] is a high perfor-
mance computing facility operated by Lawrence Berkeley National Laboratory for the United
States Department of Energy Office of Science. As the mission computing centre for the Of-
fice of Science, NERSC houses high performance computing and data systems used by 7,000
scientists at national laboratories and universities around the country. NERSC is located on
the main Berkeley Lab campus in Berkeley, California.
NERSC installed the second phase of its supercomputing system “Cori” with 9,668 com-
pute nodes based on Knights Landing architecture in the second half of 2016. It features:
∗ e-mail: baosong.shan@cern.ch

© The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons
Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/).
EPJ Web of Conferences 214, 03022 (2019) https://doi.org/10.1051/epjconf/201921403022
CHEP 2018

• Each node contains an Intel Xeon Phi Processor 7250 @ 1.40GHz.
• 68 cores per node with support for 4 hardware threads each (272 threads total).
• 96 GB DDR4 2400 MHz memory per node using six 16 GB DIMMs (115.2 GB/s peak
bandwidth). The total aggregated memory (combined with MCDRAM) is 1 PB.
• 16 GB of on-package, high-bandwidth memory with bandwidth projected to be 5X the
bandwidth of DDR4 DRAM memory, (>460 GB/sec); over 5x energy efficiency vs.
GDDR52; over 3x density vs. GDDR52.
After the upgrade, Cori was ranked 5th on the TOP500 list of world’s fastest supercom-
puters in November 2016. [5]

1.3 Argonne Leadership Computing Facility (ALCF)

Argonne National Laboratory is a scientific and engineering research national laboratory op-
erated by the University of Chicago Argonne LLC for the United States Department of Energy
located near Lemont, Illinois, outside Chicago. Belonging to Argonne National Laboratory,
Argonne Leadership Computing Facility is a national scientific user facility that provides su-
percomputing resources, including computing time, resources and data storage, and expertise
to the scientific and engineering community in order to accelerate the pace of discovery and
innovation in a broad range of disciplines.
In 2017, the Theta supercomputing system installation finished and it entered production
mode. Theta includes 3,624 Knights Landing nodes:
• Each node contains an Intel Xeon Phi Processor 7230 @ 1.30GHz.
• 68 cores per node with support for 4 hardware threads each (272 threads total).
• 192 GB DDR4 and 16 GB MCDRAM memory per node.
• 128 GB SSD per node.

2 AMS Offline Software Practices in NERSC and ALCF
AMS is using NERSC (Edison and Cori) and ALCF (Theta and Mira) facilities for simulation.

2.1 Software porting

Mira is based on IBM Blue Gene/Q architecture, and thanks to our experiences [6] during
using JuQueen [7], we are able to run the ported binaries without any issue.
Knights Landing (KNL) architecture has an important improvement for end users com-
pared with its predecessor, the Knights Corner (KNC): the KNL is a self-booting, standalone
processor and it is binary compatible with the standard Xeon instruction set, which means
that it can run legacy software, compilers, tools and profilers without recompilation. This
feature saved us from building another separate distribution of our offline software.

2.2 Time Divided Variables deployment

CernVM File System (CVMFS) [8] had been missing in both facilities. Very recently it
started to be deployed at NERSC, but our repository (/cvmfs/ams.cern.ch) has not been in-
cluded yet. At NERSC we build a Docker image to provide our database of Time Divided
Variables to achieve the best performance of simultaneous starting of MPI jobs. In ALCF the
database is extracted to local computing nodes before real simulation starting.

2
EPJ Web of Conferences 214, 03022 (2019) https://doi.org/10.1051/epjconf/201921403022
CHEP 2018

2.3 Job management

As described in Ref. [9], a light-weight production platform was designed to automate the
processes of reconstruction and simulation production in AMS computing centres. The plat-
form manages all the production stages, including job acquisition, submission, monitoring,
validation, transferring, and optional scratching. The platform is based on script languages
(Perl [10] and Python [11]) and sqlite3 [12] database, and it is easy to deploy and customise,
according to the needs of different batch systems, storage, and transferring method. This
platform is used in both NERSC and ALCF.

2.4 MPI emulation

Large scaled jobs are preferred/allowed at NERSC and ALCF. The MPI emulator cite-
bib:bluegene developed for JuQueen platform is used to emulate the required features of
Open MPI messaging.

3 Results

The AMS simulation software uses memory and startup time optimised GEANT-4.10 pack-
age which allows it to run on modern processors with large number of cores and limited
amount of memory per core, such as Intel Xeon, IBM BlueGene/Q and Intel KNL [13].
In particular, jobs with up to 3,400 KNL nodes and 700,000 threads were successfully
running in NERSC Cori facility; jobs with up to 600 KNL nodes and ~100,000 threads on
ALCF Theta facility; and jobs with up to 4,096 PowerPC nodes and ~250,000 threads on
ALCF Mira facility.
Figure 1 shows the measured performance of the AMS software on single node of Intel
KNL at ALCF Theta (a), Intel KNL at NERSC Cori (b), and Intel Xeon at NERSC Edison (c)
facilities. The AMS software performance on ALCF Mira hardware is similar to that shown
in Figure 2 of Ref. [6].

8 10 10
GEANT4 He Events/sec, (Xeon E5-2695 v2 2.40GHz)
GEANT4 He Events/sec, (Phi 7250 1.4 GHz flat)

a) b) c)
GEANT4 He Events/sec, (Phi 7210 1.3 GHz flat)

9 9
7

8 8
6
7 7

5
6 6

4 5 5

4 4
3

3 3
2
2 2

1
1 1

0 0 0
50 100 150 200 250 50 100 150 200 250 10 20 30 40 50 60 70 80
CPUs CPUs CPUs

Figure 1. The measured performance of AMS software on Intel KNL hardware at ALCF Theta (a),
NERSC Cori (b), and Intel Xeon hardware at NERSC Edison (c) facilities. The linear scaling versus
number of threads used in application is seen up to the number of physical cores in the processors.

3
EPJ Web of Conferences 214, 03022 (2019) https://doi.org/10.1051/epjconf/201921403022
CHEP 2018

Figure 2 shows the AMS software’s large scale performance for jobs with up to 3,400
KNL nodes and 700,000 threads at NERSC facility. As shown, the AMS software perfor-
mance scales well with number of nodes and/or threads.

2000 400

He Simulation Million Events, Cori KNL, 4 Hours Job
Initialization Phase for Cori KNL Shifter AMS, BB, sec

1800
a) 350
b)
1600 300

1400 250

1200 200

1000 150

800 100

600 50

400 0
0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000
Nodes204 CPU Each Nodes204 CPU Each

Figure 2. The AMS software large scale performance at NERSC facility. (a) Job starting time using
Shifter and Burst Buffer technologies (b) Number of events simulates as function of number of KNL
nodes used.

As the physics analysis of the AMS experiment is moving to the nuclei particles with
higher mass and higher energy, the computing power requirement is growing. Figure 3 shows
the CPU time spent on simulations from 2014 to 2018, and as shown in Figure 4, NERSC and
ALCF altogether contributed 38% of the total simulation CPU time for the AMS experiment
in 2018.

Figure 3. The amount of CPU years spent on AMS simulations from 2014 to 2018.

4
EPJ Web of Conferences 214, 03022 (2019) https://doi.org/10.1051/epjconf/201921403022
CHEP 2018

Figure 4. The simulation CPU time (in million CPU hours) contribution of AMS computing centres in
2018.

4 Conclusions
The AMS offline software has been deployed and tested at NERSC and ALCF computing
centres. The measured performance on Intel KNL shows linear scaling versus number of
threads up to the number of physical cores. Large scale jobs requiring up to 3400 KNL
nodes and around 700,000 threads have been run on Cori and scale well with the number
of nodes/threads. In 2017 and 2018 NERSC and ALCF have contributed over one third
of the total CPU hours for AMS simulation, and we expect both centres will make more
contributions in future.

5 Acknowledgement
This work has been completed using resources from the National Energy Research Scientific
Computing Centre under Contract No. DE-AC02-05CH11231 and the Argonne Leadership
Computing Facility under Contract No. DE-AC02-06CH11357.

References
[1] S. Ting, Nuclear Physics B-Proceedings Supplements 243, 12 (2013)
[2] A. Sodani, R. Gramunt, J. Corbal, H.S. Kim, K. Vinod, S. Chinthamani, S. Hutsell,
R. Agarwal, Y.C. Liu, Ieee micro 36, 34 (2016)
[3] Xeon phi, https://en.wikipedia.org/wiki/Xeon_Phi
[4] National energy research scientific computing center, https://en.wikipedia.org/wiki/
National_Energy_Research_Scientific_Computing_Center
[5] H.W. Meuer, E. Strohmaier, J. Dongarra, H. Simon, M. Meuer, November 2016 top500
supercomputer sites (2016)
[6] V. Choutko, A. Egorov, B. Shan, Performance of the AMS Offline software on the IBM
Blue Gene/Q architecture, in Journal of Physics: Conference Series (IOP Publishing,
2017), Vol. 898, p. 072002
[7] M. Stephan, J. Docter, Journal of large-scale research facilities JLSRF 1, 1 (2015)
[8] C. Aguado Sanchez, J. Bloomer, P. Buncic, L. Franco, S. Klemer, P. Mato, CVMFS-a file
system for the CernVM virtual appliance, in Proceedings of XII Advanced Computing
and Analysis Techniques in Physics Research (2008), Vol. 1, p. 52

5
EPJ Web of Conferences 214, 03022 (2019) https://doi.org/10.1051/epjconf/201921403022
CHEP 2018

[9] V. Choutko, O. Demakov, A. Egorov, A. Eline, B. Shan, R. Shi, Production Manage-
ment System for AMS Computing Centres, in Journal of Physics: Conference Series
(IOP Publishing, 2017), Vol. 898, p. 092034
[10] L. Wall et al., The perl programming language (1994)
[11] G. Van Rossum et al., Python Programming Language., in USENIX Annual Technical
Conference (2007), Vol. 41
[12] M. Owens, G. Allen, SQLite (Springer, 2010)
[13] V. Choutko, A. Egorov, A. Eline, B. Shan, Computing Strategy of the AMS Experiment,
in Journal of Physics: Conference Series (IOP Publishing, 2015), Vol. 664, p. 032029