Nonlinear Wave Simulation on the Xeon Phi Knights Landing Processor

Authors Goran Goranov, Ivan Hristov, Radoslava Hristova,

Plaintext

EPJ Web of Conferences 173, 06007 (2018) https://doi.org/10.1051/epjconf/201817306007
Mathematical Modeling and Computational Physics 2017

Nonlinear Wave Simulation on the Xeon Phi Knights Landing
Processor

Ivan Hristov1,2 , , Goran Goranov3 , , and Radoslava Hristova1,2 ,
1
JINR, LIT, Dubna, Russia
2
Sofia University, Bulgaria
3
Technical University of Gabrovo, Bulgaria

Abstract. We consider an interesting from computational point of view standing wave
simulation by solving coupled 2D perturbed Sine-Gordon equations. We make an
OpenMP realization which explores both thread and SIMD levels of parallelism. We test
the OpenMP program on two different energy equivalent Intel architectures: 2× Xeon
E5-2695 v2 processors, (code-named “Ivy Bridge-EP”) in the Hybrilit cluster, and Xeon
Phi 7250 processor (code-named “Knights Landing” (KNL). The results show 2 times
better performance on KNL processor.

1 Introduction
The second generation Intel XeonPhiT M processors code-named Knights Landing (KNL) are ex-
pected to deliver better performance than general purpose CPUs like Intel Xeon processors for
applications with both high degree of parallelism and well behaved communications with memory
[1]. Compute-bound applications run better on KNL due to its larger (512-bit) vector registers.
Bandwidth-bound applications also run better on KNL due to its high bandwidth memory (HBM).
In this work we consider an interesting from computational point of view example of numeri-
cal solution of coupled 2D perturbed Sine-Gordon equations. We really need serious computational
resources because in some cases the computational domain size may be very large – 106 –108 mesh
points and very long time integration is also needed – 108 –109 time steps. Usually applications with
stencil operations (like those in the presented work) are bandwidth-bound. The calculation of the
transcendental sine function however makes our application to be closer to the compute-bound case
and hence it benefits both from using HBM and from vectorization.
The considered systems of coupled 2D perturbed Sine-Gordon equations are of practical interest
because it is well known that they model the dynamics of the so called Intrinsic Josephson Junctions
(IJJ) [2].
The goals of the work are:
• To make an OpenMP realization of a finite difference scheme for solving systems of 2D perturbed
Sine-Gordon equations. We want this realization to explore both thread and SIMD levels of paral-
lelism.
e-mail: christov_ivan@abv.bg
e-mail: ph.d.g.goranov@gmail.com
e-mail: radoslava@fmi.uni-sofia.bg

© The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons
Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/).
EPJ Web of Conferences 173, 06007 (2018) https://doi.org/10.1051/epjconf/201817306007
Mathematical Modeling and Computational Physics 2017

• To test the OpenMP program on two different energy equivalent Intel architectures:
2× Xeon E5-2695 v2 processors with 24 cores, 48 threads, (code-named “Ivy Bridge-EP”) in the
Hybrilit cluster, and Xeon Phi 7250 processor with 68 cores, 272 threads, (code-named “Knights
Landing” (KNL)).

2 Mathematical model and numerical scheme
We consider the following systems of 2D perturbed Sine-Gordon equations :

S (ϕtt + αϕt + sin ϕ − γ) = ∆ϕ, (x, y) ∈ Ω ⊂ R2 . (1)

Here ∆ is the 2D Laplace operator. Ω is a given domain in R2 . S is the Neq × Neq cyclic tridiagonal
matrix:  
 1 s 0 . 0 s 
 s 1 s 0 . 0 
 
 . . . . . . 
S =   ,
 . . . . . . 
 0 . 0 s 1 s 
 
s . 0 0 s 1
where −0.5 < s ≤ 0 and Neq is the number of equations. The unknown is the column vector ϕ(x, y, t) =
(ϕ1 , . . . , ϕNeq )T . Neumann boundary conditions are considered:

∂ϕ
= 0. (2)
∂n ∂Ω
In (2) n denotes the exterior normal to the boundary ∂Ω. To close the problem (1)–(2) appropriate
initial conditions are posed. The model (1)–(2) describes very well the dynamics of Neq periodically
stacked IJJs [2]. The parameter s represents the inductive coupling between adjacent Josephson junc-
tions, α is the dissipation parameter, γ is the external current. All the units are normalized as in [2].
We follow the approach of construction of the numerical scheme from [3]. We solve the problem
numerically in rectangular domains by using second order central finite differences with respect to
all of the derivatives. As a result at every mesh point of the domain we have to solve a system with
the tridiagonal matrix S . Because of the specific tridiagonal structure of S we need only (9Neq − 12)
floating point operations for solving one system. So the algorithm complexity is O(Neq · N x · Ny ·
Ntime_steps ), where Neq is the number of equations, N x is the number of mesh points in x-direction, Ny
is the number of mesh points in y-direction and Ntime_steps is the number of steps in time.

3 Parallelization strategy and performance scalability results
OpenMP Fortran realization of the above numerical scheme is made. We store the unknown solution
of two consecutive time levels in two multidimensional arrays U1(Neq , N x , Ny ), U2(Neq , N x , Ny ). To
ensure a good data locality, the main loop over indexes of U1 and U2 follows the column major order
of multidimensional arrays in Fortran language.
To utilise the computational capabilities of the considered processors we explore two levels of
parallelism: SIMD parallelism (vectorization) at the inner level and thread parallelism at the outer
level. The smallest piece of work which we consider consists of solving 8 successive linear systems
with the cyclic tridiagonal matrix S . Such pieces of work are distributed between OpenMP threads.
Both the calculation of the right-hand sides for each linear system and solving 8 linear systems at once

2
EPJ Web of Conferences 173, 06007 (2018) https://doi.org/10.1051/epjconf/201817306007
Mathematical Modeling and Computational Physics 2017

are vectorized. Our parallelization strategy consists of a parallelization of the main nested DO loop
by using the OMP DO directive [4]:
!$OMP DO [ Clauses ]
DO I=1,Ny
DO J=1,N x ,8
DO K=J,J+7
.................
Vectorized loops for calculation of the right-hand sides
of 8 successive linear systems
.................
ENDDO
...................
Vectorized loops for solving 8 systems at once
...................
ENDDO
ENDDO
!$OMP END DO
The use of -O2 optimization flag for compiling ensures an automatic vectorization of the innermost
loops. Therefore the indexes of the outermost loop are distributed between the OpenMP threads and
all the innermost loops (all with length 8) are vectorized. Let us mention that 8 double precision words
correspond to the length of one vector register in KNL processors – 512 bit and two vector registers
(256 bit each) in Ivy Bridge-EP processors. As a result we achieve about 2 times better performance
from vectorization on “Ivy Bridge-EP” and about 4 times better performance from vectorization on
KNL processors. The achievement of 50 % effectiveness from vectorization can be explained by the
fact that our application is somewhere between bandwith-bound and compute-bound cases. On the
one hand the application is of stencil type which type is by rule bandwidth-bound. On the other hand
a calculation of the transcendental sine function is needed at every mesh point of the computational
domain, which makes our application closer to the compute-bound case.
In the following figure the computational domain size is: Neq × N x × Ny = 8 × 4096 × 4096. As
seen from this figure we achieve good performance scalability on both architectures and 2 times better
performance on KNL processor.

4 Numerical example of a nonlinear standing wave
To check that the realized OpenMP program really works, we repeated the numerical results (in 2D
case) from the classical works [5, 6]. As explained in these papers the powerful THz radiation from
IJJs reported in [7] corresponds to a new type of standing wave solutions with excited so called cavity
modes. For certain parameters α and γ the phase ϕ(x, y, t) in a particular equation (junction) is a sum
of three terms: a linear with respect to the time term, vt, a static π kink term pi_kink(x, y) and an
oscillating term:
ϕ(x, y, t) = vt + pi_kink(x, y) + oscillating_term(x, y, t)
The oscillating term is approximately a solution of the linear wave equation:
1
ϕtt = ∆ϕ, (x, y) ∈ Ω ⊂ R2
1 + 2s
with Neumann boundary conditions.

3
EPJ Web of Conferences 173, 06007 (2018) https://doi.org/10.1051/epjconf/201817306007
Mathematical Modeling and Computational Physics 2017

Figure 1. Performance scalability as speedup relative to one vectorized thread on 2 × “Ivy Bridge-EP” processors

As opposite to the oscillating term, which is the same for each junction (equation), the static term
(in our case static π kinks) have alternative character, i.e. opposite static π kinks alternate in odd and
even junctions (equations).

5 Conclusions
An OpenMP program for solving systems of 2D perturbed Sine-Gordon equations is realized and
tested on two different Intel architectures: one computational node in the Hybrilit cluster consisting
of two Ivy Bridge-EP processors and a KNL processor provided by RSC Group, Moscow. The results
shows 2 times better performance on the KNL processor.

6 Acknowledgements
We greatly thank to the opportunity to use the computational resources of the HybriLIT cluster and
to use KNL processors provided by RSC Group, Moscow. This work is supported by the National
Science Fund of Bulgaria under Grant DFNI-I02/8 and by the National Science Fund of BMSE under
grant I02/9/2014.

References
[1] J. Jeffers, J. Reinders, and A. Sodani, Intel Xeon Phi Processor High Performance Programming:
Knights Landing Edition (Morgan Kaufmann, 2016)
[2] S. Sakai, P. Bodin, and N.F. Pedersen, Journal of Applied Physics 73 (5), 2411–2418 (1993)
[3] G.S. Kazacha and S.I. Serdyukova, Zhurnal Vychislitelnoi Matematiki i Matematicheskoi Fiziki
33 (3), 417–427 (1993)
[4] B. Chapman, G. Jost, and R. Van Der Pas, Using OpenMP: portable shared memory parallel
programming (MIT press, 2008)
[5] S. Lin and X. Hu, Physical Review Letters 100 (24), p. 247006 (2008)
[6] A.E. Koshelev, Physical Review B 78 (17), p. 174509 (2008)
[7] L. Ozyuzer, A.E. Koshelev, et al., Science 318 (5854), 1291–1293 (2007)