DOKK Library

Nonlinear Wave Simulation on the Xeon Phi Knights Landing Processor

Authors Goran Goranov Ivan Hristov Radoslava Hristova

License CC-BY-4.0

Plaintext
EPJ Web of Conferences 173, 06007 (2018)                                 https://doi.org/10.1051/epjconf/201817306007
Mathematical Modeling and Computational Physics 2017




 Nonlinear Wave Simulation on the Xeon Phi Knights Landing
 Processor

 Ivan Hristov1,2 , , Goran Goranov3 , , and Radoslava Hristova1,2 ,
 1
   JINR, LIT, Dubna, Russia
 2
   Sofia University, Bulgaria
 3
   Technical University of Gabrovo, Bulgaria


             Abstract. We consider an interesting from computational point of view standing wave
             simulation by solving coupled 2D perturbed Sine-Gordon equations. We make an
             OpenMP realization which explores both thread and SIMD levels of parallelism. We test
             the OpenMP program on two different energy equivalent Intel architectures: 2× Xeon
             E5-2695 v2 processors, (code-named “Ivy Bridge-EP”) in the Hybrilit cluster, and Xeon
             Phi 7250 processor (code-named “Knights Landing” (KNL). The results show 2 times
             better performance on KNL processor.



 1 Introduction
 The second generation Intel XeonPhiT M processors code-named Knights Landing (KNL) are ex-
 pected to deliver better performance than general purpose CPUs like Intel Xeon processors for
 applications with both high degree of parallelism and well behaved communications with memory
 [1]. Compute-bound applications run better on KNL due to its larger (512-bit) vector registers.
 Bandwidth-bound applications also run better on KNL due to its high bandwidth memory (HBM).
     In this work we consider an interesting from computational point of view example of numeri-
 cal solution of coupled 2D perturbed Sine-Gordon equations. We really need serious computational
 resources because in some cases the computational domain size may be very large – 106 –108 mesh
 points and very long time integration is also needed – 108 –109 time steps. Usually applications with
 stencil operations (like those in the presented work) are bandwidth-bound. The calculation of the
 transcendental sine function however makes our application to be closer to the compute-bound case
 and hence it benefits both from using HBM and from vectorization.
     The considered systems of coupled 2D perturbed Sine-Gordon equations are of practical interest
 because it is well known that they model the dynamics of the so called Intrinsic Josephson Junctions
 (IJJ) [2].
     The goals of the work are:
 • To make an OpenMP realization of a finite difference scheme for solving systems of 2D perturbed
   Sine-Gordon equations. We want this realization to explore both thread and SIMD levels of paral-
   lelism.
    e-mail: christov_ivan@abv.bg
   e-mail: ph.d.g.goranov@gmail.com
  e-mail: radoslava@fmi.uni-sofia.bg




© The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons
Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/).
EPJ Web of Conferences 173, 06007 (2018)                       https://doi.org/10.1051/epjconf/201817306007
Mathematical Modeling and Computational Physics 2017



 • To test the OpenMP program on two different energy equivalent Intel architectures:
   2× Xeon E5-2695 v2 processors with 24 cores, 48 threads, (code-named “Ivy Bridge-EP”) in the
   Hybrilit cluster, and Xeon Phi 7250 processor with 68 cores, 272 threads, (code-named “Knights
   Landing” (KNL)).


 2 Mathematical model and numerical scheme
 We consider the following systems of 2D perturbed Sine-Gordon equations :

                           S (ϕtt + αϕt + sin ϕ − γ) = ∆ϕ,      (x, y) ∈ Ω ⊂ R2 .                      (1)

 Here ∆ is the 2D Laplace operator. Ω is a given domain in R2 . S is the Neq × Neq cyclic tridiagonal
 matrix:                                                             
                                          1 s 0 . 0 s 
                                            s 1 s 0 . 0 
                                                                   
                                                . . . . . . 
                                   S =                              ,
                                                   . . . . . . 
                                                     0 . 0 s 1 s 
                                                                     
                                                         s . 0 0 s 1
 where −0.5 < s ≤ 0 and Neq is the number of equations. The unknown is the column vector ϕ(x, y, t) =
 (ϕ1 , . . . , ϕNeq )T . Neumann boundary conditions are considered:
                                                       
                                                  ∂ϕ 
                                                        = 0.                                     (2)
                                                   ∂n  ∂Ω
 In (2) n denotes the exterior normal to the boundary ∂Ω. To close the problem (1)–(2) appropriate
 initial conditions are posed. The model (1)–(2) describes very well the dynamics of Neq periodically
 stacked IJJs [2]. The parameter s represents the inductive coupling between adjacent Josephson junc-
 tions, α is the dissipation parameter, γ is the external current. All the units are normalized as in [2].
      We follow the approach of construction of the numerical scheme from [3]. We solve the problem
 numerically in rectangular domains by using second order central finite differences with respect to
 all of the derivatives. As a result at every mesh point of the domain we have to solve a system with
 the tridiagonal matrix S . Because of the specific tridiagonal structure of S we need only (9Neq − 12)
 floating point operations for solving one system. So the algorithm complexity is O(Neq · N x · Ny ·
 Ntime_steps ), where Neq is the number of equations, N x is the number of mesh points in x-direction, Ny
 is the number of mesh points in y-direction and Ntime_steps is the number of steps in time.


 3 Parallelization strategy and performance scalability results
 OpenMP Fortran realization of the above numerical scheme is made. We store the unknown solution
 of two consecutive time levels in two multidimensional arrays U1(Neq , N x , Ny ), U2(Neq , N x , Ny ). To
 ensure a good data locality, the main loop over indexes of U1 and U2 follows the column major order
 of multidimensional arrays in Fortran language.
     To utilise the computational capabilities of the considered processors we explore two levels of
 parallelism: SIMD parallelism (vectorization) at the inner level and thread parallelism at the outer
 level. The smallest piece of work which we consider consists of solving 8 successive linear systems
 with the cyclic tridiagonal matrix S . Such pieces of work are distributed between OpenMP threads.
 Both the calculation of the right-hand sides for each linear system and solving 8 linear systems at once



                                                       2
EPJ Web of Conferences 173, 06007 (2018)                         https://doi.org/10.1051/epjconf/201817306007
Mathematical Modeling and Computational Physics 2017



 are vectorized. Our parallelization strategy consists of a parallelization of the main nested DO loop
 by using the OMP DO directive [4]:
    !$OMP DO [ Clauses ]
        DO I=1,Ny
           DO J=1,N x ,8
              DO K=J,J+7
              .................
              Vectorized loops for calculation of the right-hand sides
              of 8 successive linear systems
              .................
              ENDDO
              ...................
              Vectorized loops for solving 8 systems at once
              ...................
           ENDDO
        ENDDO
    !$OMP END DO
      The use of -O2 optimization flag for compiling ensures an automatic vectorization of the innermost
 loops. Therefore the indexes of the outermost loop are distributed between the OpenMP threads and
 all the innermost loops (all with length 8) are vectorized. Let us mention that 8 double precision words
 correspond to the length of one vector register in KNL processors – 512 bit and two vector registers
 (256 bit each) in Ivy Bridge-EP processors. As a result we achieve about 2 times better performance
 from vectorization on “Ivy Bridge-EP” and about 4 times better performance from vectorization on
 KNL processors. The achievement of 50 % effectiveness from vectorization can be explained by the
 fact that our application is somewhere between bandwith-bound and compute-bound cases. On the
 one hand the application is of stencil type which type is by rule bandwidth-bound. On the other hand
 a calculation of the transcendental sine function is needed at every mesh point of the computational
 domain, which makes our application closer to the compute-bound case.
      In the following figure the computational domain size is: Neq × N x × Ny = 8 × 4096 × 4096. As
 seen from this figure we achieve good performance scalability on both architectures and 2 times better
 performance on KNL processor.


 4 Numerical example of a nonlinear standing wave
 To check that the realized OpenMP program really works, we repeated the numerical results (in 2D
 case) from the classical works [5, 6]. As explained in these papers the powerful THz radiation from
 IJJs reported in [7] corresponds to a new type of standing wave solutions with excited so called cavity
 modes. For certain parameters α and γ the phase ϕ(x, y, t) in a particular equation (junction) is a sum
 of three terms: a linear with respect to the time term, vt, a static π kink term pi_kink(x, y) and an
 oscillating term:
                          ϕ(x, y, t) = vt + pi_kink(x, y) + oscillating_term(x, y, t)
 The oscillating term is approximately a solution of the linear wave equation:
                                            1
                                  ϕtt =          ∆ϕ,       (x, y) ∈ Ω ⊂ R2
                                          1 + 2s
 with Neumann boundary conditions.



                                                       3
EPJ Web of Conferences 173, 06007 (2018)                           https://doi.org/10.1051/epjconf/201817306007
Mathematical Modeling and Computational Physics 2017




 Figure 1. Performance scalability as speedup relative to one vectorized thread on 2 × “Ivy Bridge-EP” processors



     As opposite to the oscillating term, which is the same for each junction (equation), the static term
 (in our case static π kinks) have alternative character, i.e. opposite static π kinks alternate in odd and
 even junctions (equations).


 5 Conclusions
 An OpenMP program for solving systems of 2D perturbed Sine-Gordon equations is realized and
 tested on two different Intel architectures: one computational node in the Hybrilit cluster consisting
 of two Ivy Bridge-EP processors and a KNL processor provided by RSC Group, Moscow. The results
 shows 2 times better performance on the KNL processor.


 6 Acknowledgements
 We greatly thank to the opportunity to use the computational resources of the HybriLIT cluster and
 to use KNL processors provided by RSC Group, Moscow. This work is supported by the National
 Science Fund of Bulgaria under Grant DFNI-I02/8 and by the National Science Fund of BMSE under
 grant I02/9/2014.


 References
 [1] J. Jeffers, J. Reinders, and A. Sodani, Intel Xeon Phi Processor High Performance Programming:
     Knights Landing Edition (Morgan Kaufmann, 2016)
 [2] S. Sakai, P. Bodin, and N.F. Pedersen, Journal of Applied Physics 73 (5), 2411–2418 (1993)
 [3] G.S. Kazacha and S.I. Serdyukova, Zhurnal Vychislitelnoi Matematiki i Matematicheskoi Fiziki
     33 (3), 417–427 (1993)
 [4] B. Chapman, G. Jost, and R. Van Der Pas, Using OpenMP: portable shared memory parallel
     programming (MIT press, 2008)
 [5] S. Lin and X. Hu, Physical Review Letters 100 (24), p. 247006 (2008)
 [6] A.E. Koshelev, Physical Review B 78 (17), p. 174509 (2008)
 [7] L. Ozyuzer, A.E. Koshelev, et al., Science 318 (5854), 1291–1293 (2007)



                                                       4