DOKK Library

Parallelizing Julia with a Non-Invasive DSL

Authors Ehsan Totoni Hai Liu Jan Vitek Lindsey Kuper Tatiana Shpeisman Todd A. Anderson

License CC-BY-3.0

Plaintext
Parallelizing Julia with a Non-Invasive DSL∗
Todd A. Anderson1 , Hai Liu2 , Lindsey Kuper3 , Ehsan Totoni4 ,
Jan Vitek5 , and Tatiana Shpeisman6

1       Parallel Computing Lab, Intel Labs
2       Parallel Computing Lab, Intel Labs
3       Parallel Computing Lab, Intel Labs
4       Parallel Computing Lab, Intel Labs
5       Northeastern University / Czech Technical University Prague
6       Parallel Computing Lab, Intel Labs


        Abstract
Computational scientists often prototype software using productivity languages that offer high-
level programming abstractions. When higher performance is needed, they are obliged to rewrite
their code in a lower-level efficiency language. Different solutions have been proposed to address
this trade-off between productivity and efficiency. One promising approach is to create embedded
domain-specific languages that sacrifice generality for productivity and performance, but practical
experience with DSLs points to some road blocks preventing widespread adoption. This paper
proposes a non-invasive domain-specific language that makes as few visible changes to the host
programming model as possible. We present ParallelAccelerator, a library and compiler for high-
level, high-performance scientific computing in Julia. ParallelAccelerator’s programming model is
aligned with existing Julia programming idioms. Our compiler exposes the implicit parallelism
in high-level array-style programs and compiles them to fast, parallel native code. Programs
can also run in “library-only” mode, letting users benefit from the full Julia environment and
libraries. Our results show encouraging performance improvements with very few changes to
source code required. In particular, few to no additional type annotations are necessary.

1998 ACM Subject Classification D.1.3 Parallel Programming

Keywords and phrases parallelism, scientific computing, domain-specific languages, Julia

Digital Object Identifier 10.4230/LIPIcs.ECOOP.2017.4

Supplementary Material ECOOP Artifact Evaluation approved artifact available at
http://dx.doi.org/10.4230/DARTS.3.2.7

    1    Introduction
Computational scientists often prototype software using a productivity language [8, 19] for
scientific computing, such as MATLAB, Python with the NumPy library, R, or most recently
Julia. Productivity languages free the programmer from having to think about low-level
issues such as how to manage memory or schedule parallel threads, and they typically
come ready to perform common scientific computing tasks through extensive libraries. The
productivity-language programmer can thus work at a level of abstraction that matches
their domain expertise. However, a dilemma arises when the programmer, having produced
a prototype, wants to handle larger problem sizes. The next step is to manually port the


∗
    Prof. Vitek’s research has been supported by the European Research Council (ERC) under the European
    Union’s Horizon 2020 research and innovation program (grant agreement No 695412).
             © Todd A. Anderson, Hai Liu, Lindsey Kuper, Ehsan Totoni,                                                   tifact
                                                                                                               Ar            * Complete
             Jan Vitek, and Tatiana Shpeisman;                                                                           t
                                                                                                                                           *




                                                                                                                      en
                                                                                                      EC O C P *




                                                                                                                                            AECl Do
                                                                                                                                      *
                                                                                                               st




                                                                                                                                          We
                                                                                                             * onsi
                                                                                                          O




                                                                                                                                             l




             licensed under Creative Commons License CC-BY
                                                                                                                                                  *
                                                                                                                                          cu m
                                                                                                                se
                                                                                                             eu




                                                                                                                                            Ev
                                                                                                                                              e




                                                                                                                      nt

31st European Conference on Object-Oriented Programming (ECOOP 2017).                                                   ed           R
                                                                                                                           * Easy to
                                                                                                           *




                                                                                                                                  alu
                                                                                                                   at e d
Editor: Peter Müller; Article No. 4; pp. 4:1–4:29
                    Leibniz International Proceedings in Informatics
                    Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany
4:2   Parallelizing Julia with a Non-Invasive DSL


      code to an efficiency language like C++ and parallelize it with tools like MPI. This takes
      considerable effort and requires a different skill set. While the result can be fast, it is harder
      to maintain or experiment with.
          Ideally the productivity language could be automatically converted to efficient code and
      parallelized. Unfortunately, automatic parallelization has proved elusive, and efficient com-
      pilation of dynamic languages remains an open problem. An alternative is to embed, in
      the high-level language, a domain-specific language (DSL) specifically designed for high-
      performance scientific computing. That domain-specific language will have a restricted set
      of features, carefully designed to allow for efficient code generation and parallelization. Rep-
      resentative examples are DSLs developed using the Delite framework [6, 27, 28] for Scala,
      Copperhead [7] and DSLs developed using the SEJITS framework [8] for Python, and Accel-
      erate [9] for Haskell. Brown et al. posit a “pick 2-out-of-3” trilemma between performance,
      productivity, and generality [6]. DSLs choose performance and productivity at the cost of
      generality by targeting a particular domain. This allows implementations to make stronger
      assumptions about programmer intent and employ domain-specific optimizations.
          Practical experience with DSLs points to some road blocks preventing widespread adop-
      tion. DSLs have a learning curve that may put off users reluctant to invest time in learning
      new technologies. DSLs have functionality cliffs; the fear of hitting a limitation late in the
      project discourages some users. DSLs can lack in robustness; they may have long com-
      pile times, be unable to use the host’s debugger, or place limits on supported libraries and
      platforms.
          This paper proposes a non-invasive domain-specific language that makes as few visible
      changes to the host programming model as possible. It aims to help developers parallelize
      scientific code with minimal alterations using a hybrid compiler and library approach. With
      the compiler, ParallelAccelerator provides a new parallel execution model for code that uses
      parallelizable constructs, but any program can also run single-threaded with the default
      semantics of the host language. This is also the case when the compiler encounters constructs
      that inhibit parallelization. In library mode, all features of the host language are available.
      The initial learning curve is thus small; users can start writing programs in our DSL by
      adding a single annotation. During development, the library mode allows users to sidestep
      any compilation overheads, and to retain access to all features of the host language including
      its debugger and libraries. The contribution of this paper is a design that leverages existing
      technologies and years of research in the high-performance computing community to create
      a system that works surprisingly well. The paper also explores the combination of features
      needed from a dynamic language for this to work.
          ParallelAccelerator is embedded in the Julia programming language. Julia is challenging
      to parallelize: its code is untyped, all operators are dynamically bound, and eval allows
      loading new code at any time. While some features cannot be parallelized, their presence
      in other parts of the program does not prevent us from generating efficient code. Julia
      is also interesting because it is among the fastest dynamic languages of the day. Julia’s
      LLVM-based just-in-time compiler generates efficient code, giving the language competitive
      baseline performance. Languages like MATLAB or Python have much more “fat” that can
      be trimmed.
          Julia provides high-level constructs in its standard library for scientific computing as
      well as bindings to high-performance native libraries (e.g., BLAS libraries). While many
      library calls can run in parallel, the real issue is that library calls do not compose in parallel.
      Thus porting to an efficiency language is often about making the parallelism explicit and
      manually composing constructs to achieve greater performance. ParallelAccelerator focuses
T. A. Anderson, H. Liu, L. Kuper, E. Totoni, J. Vitek, and T. Shpeisman                              4:3


on programs written in array style, a style of programming that is widely used in scientific
languages such as MATLAB and R; it identifies the implicit parallelism in array operations
and automatically composes them. The standard Julia compiler does not optimize array-
style code.
    ParallelAccelerator provides the @acc annotation, a short-hand for “accelerate”, which in-
structs the compiler to focus on the annotated block or function and attempts to parallelize
its execution. Plain host-language functions and @acc-annotated code can be intermingled
and invoke each other freely. ParallelAccelerator parallelizes array-style constructs already
existing in Julia, such as element-wise array operations, reductions on arrays, and array com-
prehensions, and introduces only a single new construct, runStencil for stencil computations.
While adding annotations is easy, we do rely on users to write code in array style.
    ParallelAccelerator is implemented in Julia and is released as a package.1 There is a
small array runtime component written in C. The compiler uses a combination of type
specialization and devirtualization of the Julia code and generates monomorphic OpenMP
C++ code for every @acc-annotated function and its transitive closure. Two features of
Julia made our implementation possible. The first is Julia’s macro system that allows us to
intercept calls to an accelerated function for each unique type signature. The second is access
to Julia’s type inference results. Accurate type information allows us to generate code that
uses efficient unboxed representations and calls to native operations on primitive data types.
Our results demonstrate that ParallelAccelerator can provide orders-of-magnitude speedup
over Julia on a variety of scientific computing workloads, and can achieve performance close
to optimized parallel C++.

    2      Background

2.1        Julia
The Julia programming language [3] is a high-level dynamic language that targets scientific
computing. Like other productivity languages, Julia has a read-eval-print loop for continuous
and immediate user interaction with the program being developed. Type annotations can
be omitted on function arguments and variables, in which case they behave as dynamically
typed variables [4]. Julia has a full complement of meta-programming features, including
macros that operate on abstract syntax trees and eval which allows users to construct code
as text strings and evaluate it in the environment of the current module.
    Julia comes with an extensive base library, largely written in Julia itself, for everyday
programming tasks as well as a range of functions appropriate for scientific and numerical
computing. In particular, the notation for array and vector computation is close to that
of MATLAB, which suggests an easy transition to Julia for MATLAB programmers. For
example, the following are a selected few array operators and functions:

        Unary: - ! log exp sin cos
        Binary: .+ .- .* ./ .== .!= .> .< .>= .<=

    Julia supports multiple dispatch; that is to say, functions can be (optionally) annotated
with types and Julia allows functions to be overloaded based on the types of their arguments.
Hence the same - (negation) operator that usually takes scalar operands can be overloaded
to take an array object as its argument, and returns the negation of each of its elements in a


1
    Source code is available on GitHub: https://github.com/intellabs/ParallelAccelerator.jl.




                                                                                                  ECOOP 2017
4:4   Parallelizing Julia with a Non-Invasive DSL


      new array. For instance, -[1,2,3] evaluates to [-1,-2,-3], and [1,2,3] .* [3,2,1] evaluates
      to [3,4,3] where .* stands for element-wise multiplication. The resolution of any function
      call is typically dynamic: at each call, the runtime system will check the tags of arguments
      and find the function definition that is the most applicable for these types.
          The Julia execution engine is an aggressively specializing just-in-time compiler that emits
      intermediate representation for the LLVM compiler. Julia has a fast C function call API,
      a garbage collector and, since recently, native threads. The compiler does not optimize
      array-style code, so users tend to write (error-prone) explicit loops.
          There are a number of design choices in Julia that facilitate the job of ParallelAccelerator.
      Optional type annotations are useful, in particular on data type declarations. In Python,
      programmers have no way to limit the range of values the fields of a class can take. In R
      or MATLAB, things are even worse, as there are not even classes; all data types are built
      up dynamically out of basic building blocks such as lists and arrays. In Julia, if a field is
      declared to be of some type T, then the runtime system will insert checks at every assignment
      (unless the compiler can prove they are redundant). Julia differentiates between abstract
      types, which cannot be instantiated but can have subtypes, and concrete types, which can be
      instantiated but cannot have subtypes. In Java terminology, concrete types are final. This
      property is helpful because the memory layout of a concrete type is thus known, and the
      compiler can optimize them (e.g., by stack allocation or field stripping). The eval function
      is not allowed to execute in the scope of the current function, as it does in JavaScript, which
      means that the damage that it can do is limited to changing global variables (and if the
      variables are typed, those changes must be type-preserving) and defining new functions.
      Julia’s reflection capacities are limited, so it is not possible to modify the shape of data
      structures or add local variables to existing frames as in JavaScript or R.

      2.2    Related Work
      One way to improve the performance of high-level languages is to reduce interpreter over-
      head, as some of these languages are still executed by interpreting abstract syntax trees or
      bytecode. For instance, there is work on compiling MATLAB to machine code [10, 5, 22, 16],
      but due to the untyped and dynamic nature of the language, a sophisticated just-in-time
      compiler performing complex type inference is needed to get any performance improvements.
      Several projects have explored how to speed up the R language. Riposte [29] uses tracing
      techniques to extract commonly taken operation sequences and efficiently schedule vector
      operations. Early versions of FastR [14] exclusively relied on runtime specialization to re-
      move high-level overheads; more recent versions also generate native code [25]. Pydron [21]
      provides semi-automatic parallelization of Python programs but requires explicit program-
      mer annotations of side-effect free, parallelizable functions. Numba [17] is a JIT compiler
      for Python, and allows the user to define NumPy math kernels called UFuncs in Python
      and run them in parallel on either CPU or GPU. Julia is simpler to optimize, because it
      intentionally omits some of the most dynamic features of other productivity languages for
      scientific computing. For instance, Julia does not allow a function to delete arbitrary local
      variables of its caller (which R allows). ParallelAccelerator takes advantage of the existing Ju-
      lia compiler. That compiler performs one and only one major optimization: it aggressively
      specializes functions on the run-time type of their arguments. This is how Julia obtains
      similar benefits to FastR but with a simpler runtime infrastructure. ParallelAccelerator only
      needs a little help to generate efficient parallel code. The main difference between ParallelAc-
      celerator and Riposte is the use of static analysis rather than dynamic liveness information.
      The difference between ParallelAccelerator and Pydron is the reduction in the number of
      programmer-provided annotations.
T. A. Anderson, H. Liu, L. Kuper, E. Totoni, J. Vitek, and T. Shpeisman                               4:5


julia > @acc f ( x ) = x .+ x .* x
f ( generic function with 1 method )

julia > f ([1 ,2 ,3])
5 - element Array { Int64 ,1}:
2
6
12


     Figure 1 A “hello world” motivating example for ParallelAccelerator.


    Another way to improve performance is to trade generality for efficiency with domain-
specific languages (DSLs). Delite [6, 27, 28] is a Scala compiler framework and runtime for
high-performance embedded DSLs that leverages Lightweight Modular Staging [24] for run-
time code generation. Our compiler design is inspired by Delite’s Domain IR and Parallel
IR, but does not prevent users from using the host language (by contrast, Delite-based DSLs
such as OptiML support only a subset of Scala [26]). Copperhead [7] provides composable
primitives for data-parallel operations embedded in a subset of Python, and leverages im-
plicit data parallelism for efficiency. DSLs such as Patus [11, 12] target stencil computations.
PolyMage [20] and Halide [23] are highly optimized DSL implementations for image process-
ing pipelines. ParallelAccelerator addresses the lack of generality of these DSLs by providing
full access to the host language. The SEJITS methodology [8] similarly allows full access
to the host language, and employs specializers (micro-compilers) for specific computational
“motifs” [2], which are “stovepipes” [15] from kernels to execution platform. Individual spe-
cializers use domain-specific optimizations to efficiently implement specific kernels, but do
not share a common intermediate representation or runtime like ParallelAccelerator, limiting
composability.
    Pochoir [30] is a DSL for stencil computations, embedded in C++ using template metapro-
gramming. Like ParallelAccelerator, it can be used in two modes. In the first mode, which
is analogous to ParallelAccelerator’s library-only mode, the programmer can compile Pochoir
programs to unoptimized code using an ordinary C++ compiler. In the second mode, the
programmer compiles the code using the Pochoir compiler, which acts as a preprocessor to
the C++ compiler and transforms the code into parallel C++.
    Improving the speed of array-style code in Julia is the goal of packages such as De-
vectorize.jl [18]. It provides a macro for automatically translating array-style code into
devectorized code. Like Devectorize.jl, the focus of ParallelAccelerator is on speeding up
code written in array style. However, since our approach is compiler-based rather than
library-based, we can do much more in terms of compiler optimizations, and the addition of
parallelism provides a substantial further speedup.


 3      Motivating Examples

We illustrate how ParallelAccelerator speeds up scientific computing codes by example. Con-
sider a trivial @acc-annotated function declared and run in the Julia REPL as shown in
Figure 1.
    When compiling an @acc-annotated function such as f, ParallelAccelerator optimizes high-
level array operations, such as pointwise array addition (.+) and multiplication (.*). It
compiles them to C++ with OpenMP directives, so that a C++ compiler can then generate
high-performance native code. The C++ code that ParallelAccelerator produces for f is shown



                                                                                                   ECOOP 2017
4:6        Parallelizing Julia with a Non-Invasive DSL


       1   void f271 ( j2c_array < double > &x , j2c_array < double > * ret )
       2   {
       3       int64_t idx , len ;
       4       double tmp1 , tmp2 , ssa0 , ssa1 ;
       5       j2c_array < double > new_arr ;
       6
       7        len = x . ARRAYSIZE (1) ;
       8        new_arr = j2c_array < double >:: new_j2c_array_1d ( NULL , len ) ;
       9   # pragma omp parallel private ( tmp1 , ssa1 , ssa0 , tmp2 )
      10        {
      11   # pragma omp for private ( idx )
      12             for ( idx = 1; idx <= len ; idx ++)
      13             {
      14                  tmp1 = x . ARRAYELEM ( idx ) ;
      15                  ssa1 = tmp1 * tmp1 ;
      16                  tmp2 = x . ARRAYELEM ( idx ) ;
      17                  ssa0 = tmp2 + ssa1 ;
      18                  new_arr . ARRAYELEM ( idx ) = ssa0 ;
      19             }
      20        }
      21        * ret = new_arr ;
      22   }


              Figure 2 The generated C++ code for the motivating example from Figure 1.


           in Figure 2. In line 1, the function name f is mangled to produce a unique C function name
           and the input array x can be seen followed by the pointer argument ret that is used to return
           the output array. In line 7, the length of the array x is saved and used as the upper bound of
           the for loop in line 12. In line 8, memory is allocated for the output array, similarly matching
           the length of x. The parallel OpenMP for loop defined on lines 9-22 iterates through each
           element of x, multiplies each element by itself, adds each element to that product, and stores
           the result in the corresponding index in the output array. On line 21, the output array is
           returned from the function by storing it in ret.
               The performance improvements that ParallelAccelerator delivers over standard Julia are
           in part a result of exposing parallelism and exploiting parallel hardware, and in part a result
           of eliminating run-time inefficiencies such as unneeded array bounds checks and intermedi-
           ate array allocations. In fact, ParallelAccelerator often provides a substantial performance
           improvement over standard Julia even when running on one thread; see Section 6 for details.

           3.1    Black-Scholes option pricing
           The Black-Scholes formula for option pricing is a classic high-performance computing bench-
           mark. Figure 3 shows an implementation of the Black-Scholes formula, written in a high-
           level array style in Julia. The arguments to the blackscholes function, sptprice, strike, rate,
           volatility, and time, are all arrays of floating-point numbers. blackscholes performs several
           computations involving pointwise addition (.+), subtraction (.-), multiplication (.*), and
           division (./) on these arrays. To understand this example, it is not necessary to understand
           the details of the Black-Scholes formula; the important thing to notice about the code is
           that it does many pointwise array arithmetic operations. When run on arrays of 100 million
           elements, this code takes 22 seconds to run under standard Julia.
               The many pointwise array operations in this code make it a good candidate for speeding
           up with ParallelAccelerator. Doing so requires only minor changes to the code: we need only
           import the ParallelAccelerator library, then annotate the blackscholes function with @acc.
           With the addition of @acc, the running time drops to 13.1s on one thread, and when we
T. A. Anderson, H. Liu, L. Kuper, E. Totoni, J. Vitek, and T. Shpeisman                              4:7


function blackscholes ( sptprice , strike , rate , volatility , time )
    logterm = log10 ( sptprice ./ strike )
    powterm = .5 .* volatility .* volatility
    den = volatility .* sqrt ( time )
    d1 = ((( rate .+ powterm ) .* time ) .+ logterm ) ./ den
    d2 = d1 . - den
    NofXd1 = 0.5 .+ 0.5 .* erf (0.707106781 .* d1 )
    NofXd2 = 0.5 .+ 0.5 .* erf (0.707106781 .* d2 )
    futureValue = strike .* exp ( - rate .* time )
    c1 = futureValue .* NofXd2
    call = sptprice .* NofXd1 . - c1
    put = call . - futureValue .+ sptprice
end


   Figure 3 An implementation of the Black-Scholes option pricing algorithm. Adding an @acc
annotation to this function improves performance by 41.6× on 36 cores.


function blur ( img , iterations )
    w , h = size ( img )
    for i = 1: iterations
       img [3: w -2 ,3: h -2] =
        img [3 -2: w -4 ,3 -2: h -4] * 0.0030   + img [3 -1: w -3 ,3 -2: h -4] * 0.0133 +
             ... +
        img [3 -2: w -4 ,3 -1: h -3] * 0.0133   + img [3 -1: w -3 ,3 -1: h -3] * 0.0596 +
             ... +
        img [3 -2: w -4 ,3+0: h -2] * 0.0219    + img [3 -1: w -3 ,3+0: h -2] * 0.0983 +
             ... +
        img [3 -2: w -4 ,3+1: h -1] * 0.0133    + img [3 -1: w -3 ,3+1: h -1] * 0.0596 +
             ... +
        img [3 -2: w -4 ,3+2: h -0] * 0.0030    + img [3 -1: w -3 ,3+2: h -0] + 0.0133 +
             ...
    end
    return img
end


   Figure 4 An implementation of a Gaussian blur in Julia. The ...s elide parts of the weighted
average.


enable 36-thread parallelism, running time drops to 0.5s, for a total speedup of 41.6× over
standard Julia.


3.2    Gaussian blur
In this example, we consider a stencil computation that blurs an image using a Gaussian blur.
Stencil computations, which update the elements of an array according to a fixed pattern
called a stencil, are common in scientific computing. In this case, we are blurring an image,
represented as a 2D array of pixels, by setting the value of each output pixel to a weighted
average of the values of the corresponding input pixel and its neighbors. (At the borders of
the image, we do not have enough neighboring pixels to compute an output pixel value, so
we skip those pixels and do not assign to them.) Figure 4 gives a Julia implementation of
a Gaussian blur that blurs an image (img) a given number of times (iterations). The blur
function is in array style and does not explicitly loop over all pixels in the image; instead,
the loop counter refers to the number of times the blur should be iteratively applied. When
run for 100 iterations on a large grayscale input image of 7095x5322 pixels, this code takes
877s to run under standard Julia.



                                                                                                  ECOOP 2017
4:8   Parallelizing Julia with a Non-Invasive DSL


          @acc function blur ( img , iterations )
              buf = Array ( Float32 , size ( img ) ...)
              runStencil ( buf , img , iterations , : oob_skip ) do           b, a
                  b [0 ,0] =
                         ( a [ -2 , -2] * 0.0030 + a [ -1 , -2] * 0.0133      +   ...    +
                           a [ -2 , -1] * 0.0133 + a [ -1 , -1] * 0.0596      +   ...    +
                           a [ -2 , 0] * 0.0219 + a [ -1 , 0] * 0.0983        +   ...    +
                           a [ -2 , 1] * 0.0133 + a [ -1 , 1] * 0.0596        +   ...    +
                           a [ -2 , 2] * 0.0030 + a [ -1 , 2] * 0.0133        +   ...
                  return a , b
              end
              return img
          end


            Figure 5 Gaussian blur implemented using ParallelAccelerator.


          Using ParallelAccelerator, we can rewrite the blur function as shown in Figure 5. In
      addition to the @acc annotation, we replace the for loop in the original code with ParallelAc-
      celerator’s runStencil construct.2 The runStencil construct uses relative rather than absolute
      indexing into the input and output arrays. The :oob_skip argument tells runStencil to skip
      pixels at the borders of the image; unlike in the original code, we do not have to adjust
      indices to do this manually. Section 4.5 covers the semantics of runStencil in detail. With
      these changes, running under the ParallelAccelerator compiler, running time drops to 38s on
      one core and only 1.4s when run on 36 cores — a speedup of over 600× over standard Julia.

      3.3       Two-dimensional wave equation simulation
      The previous two examples focused on orders-of-magnitude performance improvements en-
      abled by ParallelAccelerator. For this example, we focus instead on the flexibility that Par-
      allelAccelerator provides through its approach of extending Julia, rather than offering an
      invasive DSL alternative to programming in standard Julia.
          This example uses the two-dimensional wave equation to simulate the interaction of two
      waves. The two-dimensional wave equation is a partial differential equation (PDE) that
      describes the propagation of waves across a surface, such as the vibrations of a drum head.
      From the discretized PDE, one can derive a formula for the future (f ) position of a point
      (x, y) on the surface, based on its current (c) and past (p) positions (r is a constant):

      f (x, y) = 2c(x, y)−p(x, y)+r2 c(x−∆x, y)+c(x+∆x, y)+c(x, y−∆y)+c(x, y+∆y)−4c(x, y)
                                                                                         


      The above formula expressed in array-style code is:
      f [2: s -1 ,2: s -1] = 2* c [2: s -1 ,2: s -1] - p [2: s -1 ,2: s -1]
                             + r ^2 * ( c [1: s -2 ,2: s -1] + c [3: s ,2: s -1]
                                           + c [2: s -1 ,1: s -2] + c [2: s -1 ,3: s ]
                                           - 4* c [2: s -1 ,2: s -1] )

      This code is excerpted from a MATLAB program found “in the wild”3 that simulates the
      propagation of waves on a surface. It is written in an idiomatic MATLAB style, making
      heavy use of array notation. It represents f , c, and p as three 2D arrays of size s in each


      2
           Here, runStencil is being called with Julia’s built-in do-block syntax (http://docs.julialang.org/
           en/release-0.5/manual/functions/#do-block-syntax-for-function-arguments).
      3
           See footnote 8 for a link to the original code.
T. A. Anderson, H. Liu, L. Kuper, E. Totoni, J. Vitek, and T. Shpeisman                                4:9


dimension, and updates the f array based on the contents of the c and p arrays (skipping
the boundaries of the grid, which are handled separately).
    A direct port (preserving the array style of the MATLAB code) of the full program to
Julia has a running time of 4s on a 512×512 grid. The wave equation shown above is part
of the main loop of the simulation. Most of the simulation’s execution time is spent in
computing the above wave equation. To speed up this code with ParallelAccelerator, the key
is to observe that the wave equation is performing a stencil computation. The expression
runStencil (p , c , f , 1 , : oob_skip ) do p , c , f
    f [0 ,0] = 2* c [0 ,0] - p [0 ,0]
               + r * r * ( c [ -1 ,0] + c [1 ,0] + c [0 , -1] + c [0 ,1] - 4* c [0 ,0])
end

takes the place of the wave equation formula. Like the original code, it computes an updated
value for f based on the contents of c and p. Unlike the original code, it uses relative rather
than absolute indexing: assigning to f[0,0] updates all elements of f, based on the contents
of the c and p arrays. With this minor change and with the addition of an @acc annotation,
the code runs in 0.3s, delivering a speedup of 15× over standard Julia.
    However, for this example the key point is not the speedup enabled by runStencil but
rather the way in which runStencil combines seamlessly with standard Julia features. Al-
though the wave equation dominates the running time of the main loop of the simulation,
most of the code in that main loop (which we omit here) handles other important details,
such as how the simulation should behave at the boundaries of the grid. Most stencil DSLs
support setting the edges to zero values, which in physical terms means that a wave reaching
the edge would be reflected back toward the middle of the grid. However, this simulation
uses “transparent” boundaries, which simulate an infinite grid. With ParallelAccelerator,
rather than needing to add special support for this sophisticated boundary handling, we
can again use :oob_skip in our runStencil call, as was done in Figure 5 for the Gaussian
blur example in the previous section, and instead express the boundary-handling code in
standard Julia.
    The sophisticated boundary handling that this simulation requires illustrates one reason
why ParallelAccelerator takes the approach of extending Julia and identifying existing parallel
patterns, rather than providing an invasive DSL: it is difficult for a DSL designer to anticipate
all the features that the user of the DSL might need. Therefore, ParallelAccelerator does not
aim to provide an invasive DSL alternative to programming in Julia. Instead, the user
is free to use domain-specific constructs like runStencil (say, for the wave equation), but
can combine them seamlessly with standard, fully general Julia code (say, for boundary
handling) that operates on the same data structures.

 4     Parallel patterns in ParallelAccelerator
In this section, we explain the implicit parallel patterns that the ParallelAccelerator compiler
makes explicit. We also give a careful examination of their differences in expressiveness,
safety guarantees and implementation trade-offs, which hopefully should give a few new
insights even to readers who are already well versed with concepts like map and reduce.

4.1    Building Blocks
Array operators and functions become the building blocks for users to write scientific and
numerical programs and provide a higher level of abstraction than operating on individual
elements of arrays. There are numerous benefits to writing programs in array style:



                                                                                                    ECOOP 2017
4:10   Parallelizing Julia with a Non-Invasive DSL


           We can safely index array elements without bounds checking once we know the input
           array size.
           Many operations are amenable to implicit parallelization without changing their seman-
           tics.
           Many operations do not have side effects, or when they do, their side effects are well-
           specified (e.g., modifying the elements of one of the input arrays).
           Many operations can be further optimized to make better use of hardware features, such
           as caches and SIMD (Single Instruction Multiple Data) vectorization.
       In short, such array operators and functions already come with a degree of domain knowledge
       embedded in their semantics, and such knowledge enables parallel implementations and
       optimization opportunities. In ParallelAccelerator, we identify a range of parallel patterns
       corresponding to either existing functions from the base library or language features. We are
       then able to translate these patterns to more efficient implementations without sacrificing
       program safety or altering program semantics. A crucial enabling factor is the readily
       available type information in Julia’s typed AST, which makes it easy to identify what domain
       knowledge is present (e.g., dense array or sparse array), and generate safe and efficient code
       without runtime type checking.


       4.2      Map
       Many element-wise array functions are essentially what is called a map operation in func-
       tional programming. For a unary function, this operation maps from each element of the
       input array to an element of the output array, which is freshly allocated and of the same
       size as the input array. For a binary function, this operation maps from each pair of ele-
       ments from the input arrays to an element of the output array, again requiring that the two
       input arrays and the output array are of the same size. Such arity extension can also be
       applied to the output, so instead of just one output, a map can produce two output arrays.
       Internally, ParallelAccelerator translates element-wise array operators and functions to the
       following construct that we call multi-map, or mmap for short:

            (B1 , B2 , . . .) = mmap((x1 , x2 , . . .) → (e1 , e2 , . . .), A1 , A2 , . . .)
             | {z }                   | {z }              | {z } | {z }
                   n                            m                   n             m

       Here, mmap takes an anonymous lambda function and maps it over m input arrays A1 , A2 , . . .
       to produce n output arrays B1 , B2 , . . . . The lambda function performs element-wise com-
       putation from m scalar inputs to n scalar outputs. The sequential operational semantics
       of mmap is to iterate over the array length using an index, call the lambda function with
       elements read from the input arrays at this index, and write the results to the output arrays
       at the same index.
           In addition to mmap, we introduce an in-place multi-map construct, or mmap! for short:4

            mmap!((x1 , x2 , . . . ) → (e1 , e2 , . . . ), A1 , A2 , . . .), where m ≥ n
                  | {z }               | {z } | {z }
                             m                   n              m


       The difference between mmap! and mmap is that mmap! modifies the first n out of its m
       input arrays, hence we require that m ≥ n. Below are some examples of how we translate


       4
           The ! symbol is part of a legal identifier, suggesting mutation.
T. A. Anderson, H. Liu, L. Kuper, E. Totoni, J. Vitek, and T. Shpeisman                                         4:11


user-facing array operations to either mmap or mmap!:

      log(A)    ⇒     mmap(x → log(x), A)
      A .* B    ⇒     mmap((x, y) → x*y, A, B)
      A -= B    ⇒     mmap!((x, y) → x-y, A, B)
      A .+ c    ⇒     mmap(x → x+c, A)

In the last example above, we are able to inline a scalar variable c into the lambda because
type inference is able to tell that c is not an array.
    Once we guarantee the inputs and outputs used in mmap and mmap! are of the same size,
we can avoid all bounds checking when iterating through the arrays. Operational safety is
further guaranteed by having mmap and mmap! only as internal constructs to our compiler,
rather than exposing them to users, and translating only a selected subset of higher-level
operators and functions when they are safe to parallelize. This way, we do not risk exposing
the lambda function to our users, and can rule out any unintended side effects (e.g., writing
to environment variables, or reading from a file) that may cause non-deterministic behavior
in a parallel implementation.

4.3     Reduction
Beyond maps, ParallelAccelerator also supports reduction as a parallel construct. Internally
it has the form r = reduce(⊕, φ, A), where ⊕ stands for a binary infix reduction operator,
and φ represents an identity value that accompanies a particular ⊕ for a specific element
type. Mathematically, the reduce operation is equivalent to computing r = φ ⊕ a1 ⊕ · · · ⊕ an ,
where a1 , . . . , an represent all elements in array A of length n. Reductions can be made
parallel only when the operator ⊕ is associative, which means we can only safely translate
a few functions to reduce, namely:

      sum(A)         ⇒     reduce(+, 0, A)
      product(A)     ⇒     reduce(*, 1, A)
      any(A)         ⇒     reduce(||, false, A)
      all(A)         ⇒     reduce(&&, true, A)

Unlike the multi-map case, we make a design choice here not to support multi-reduction so
that we can limit ourselves to only operators rather than the more flexible lambda expression.
This is mostly an implementation constraint that can be lifted as soon as our parallel backend
supports custom user functions.

4.4     Cartesian Map
So far, we have focused on a selected subset of base library functions that can be safely
translated to parallel constructs such as mmap, mmap!, and reduce. Going beyond that,
we also look for ways to translate larger program fragments. One such target is a Julia
language feature called comprehension, a functional programming concept that has grown
popular among scripting languages such as Python and Ruby. In Julia, comprehensions have
the syntax illustrated below:

    A = [f (x1 , x2 , . . . , xn ) for x1 in r1 , x2 in r2 , . . . , xn in rn ]

where variables x1 , x2 , . . . , xn are iterated over either range or array objects r1 , r2 , . . . , rn ,
and the result is an n-dimensional dense array A, each value of which is computed by calling



                                                                                                              ECOOP 2017
4:12   Parallelizing Julia with a Non-Invasive DSL


       function f with x1 , x2 , . . . , xn . Operationally it is equivalent to creating a rank-n array of
       a dimension that is the Cartesian product of the range of variables r1 , r2 , . . . , rn , and then
       going through n-level nested loops that use f to fill in all elements. As an example, we
       quote from the Julia user manual5 the following avg function, which takes a one-dimensional
       input array x of length n and uses an array comprehension to construct an output array of
       length n − 2, in which each element is a weighted average of the corresponding element in
       the original array and its two neighbors:
           avg ( x ) = [ 0.25* x [i -1] + 0.5* x [ i ] + 0.25* x [ i +1] for i in 2: length ( x ) -1 ]

       This example explicitly indexes into arrays using the variable i, something that cannot be
       expressed as a vanilla map operation as discussed in Section 4.2. Most comprehensions,
       however, can still be parallelized provided the following conditions can be satisfied:
       1. There are no side effects in the comprehension body (the function f ).
       2. The size of all range objects r1 , r2 , . . . , rn can be computed ahead of time.
       3. There are no inter-dependencies among the range indices x1 , x2 , . . . , xn .
       The first condition is to ensure that the result is still deterministic when its computation
       is made parallel, and ParallelAccelerator implements a conservative code analysis to identify
       possible breaches to this rule and reject such code. The second and third are constraints that
       allow the result array to be allocated prior to the computation, and in fact Julia’s existing
       semantics for comprehension already imposes these rules. Moreover, comprehension syntax
       also rules out ways to either mention or index the output array within the body. Therefore,
       in our implementation it is always safe to write to the output array without additional
       bounds checking. It must be noted, however, that since arbitrary user code can go into the
       body, indexing into other array variables is still allowed and must be bounds-checked.
           In essence, a comprehension is still a form of our familiar map operation, where instead
       of input arrays it maps over the range objects that make up the Cartesian space of the
       output array. Internally ParallelAccelerator translates comprehension to a parallel construct
       that we call cartesianmap:

             A = cartesianmap((i1 , . . . ) → f (r1 [i1 ], . . .), len(r1 ), . . .)
                              | {z }             | {z } | {z }
                                         n                 n              n

       In the above translation, we slightly transform the input ranges to their numerical sizes,
       and liberally make use of array indexing notation to read the actual range values so that
       xj = r[ij ] for all 1 ≤ i ≤ len(rj ) and 1 ≤ j ≤ n. This transformation makes it easier
       to enumerate all range objects uniformly. Furthermore, given the fact that cartesianmap
       iterates over the Cartesian space of its output array, we can decompose a cartesianmap
       construct into two steps: allocating the output array, and in-place mmap!-ing over it, with
       the lambda function extended to take indices as additional parameters. We omit the details
       of the transformation here.

       4.5       Stencil
       A stencil computation computes new values for all elements of an array based on the current
       values of neighbors. The example we give for comprehensions in Section 4.4 would also
       qualify as a stencil computation. It may seem plausible to just translate stencils as if they
       were parallel array comprehensions, but there is a crucial difference: stencils have both


       5
            http://docs.julialang.org/en/release-0.5/manual/arrays/#comprehensions
T. A. Anderson, H. Liu, L. Kuper, E. Totoni, J. Vitek, and T. Shpeisman                                  4:13


runStencil ( Apu , Apv , pu , pv , Ix , Iy , 1 , : oob_src_zero ) do Apu , Apv , pu ,
    pv , Ix , Iy
  ix = Ix [0 ,0]
  iy = Iy [0 ,0]
  Apu [0 ,0] = ix * ( ix * pu [0 ,0] + iy * pv [0 ,0]) + lam *(4.0 f0 * pu [0 ,0] -( pu
       [ -1 ,0]+ pu [1 ,0]+ pu [0 , -1]+ pu [0 ,1]) )
  Apv [0 ,0] = iy * ( ix * pu [0 ,0] + iy * pv [0 ,0]) + lam *(4.0 f0 * pv [0 ,0] -( pv
       [ -1 ,0]+ pv [1 ,0]+ pv [0 , -1]+ pv [0 ,1]) )
end


    Figure 6 Stencil code excerpt from the opt-flow workload (see Section 6) that illustrates multi-
buffer operation.


input and output arrays, and they are all of the same size. This property allows us to
eliminate bounds checking not just when writing to outputs, but also when reading from
inputs. Because Julia does not have a built-in language construct for us to identify stencil
calls in its AST, we introduce a new user-facing language construct, runStencil:

   runStencil((A, B, . . .) → f (A, B, . . .), A, B, . . ., n, s)
                 | {z }            | {z } | {z }
                     m                  m           m

The runStencil function takes a function f as the stencil kernel specification, then m image
buffers (dense arrays) A, B, . . . , and optionally a trip-count n for an iterative stencil loop
(ISL), and a symbol s that specifies how to handle stencil boundaries. We also require that:
    All buffers are of the same size.
    Function f has arity m.
    In f , all buffers are relatively indexed with statically known indices, i.e., only integer
    literals or constants.
    There are no updates to environment variables or I/O in f .
    For an ISL, f may return a set of buffers, rotated in position, to indicate the sequence
    of buffer swapping in between two consecutive stencil loops.
For boundary handling, we have built-in support for a few common cases such as wrap-
arounds, but users are free to do their own boundary handling outside of the runStencil call,
as mentioned in Section 3.
    An interesting aspect of our API is that input and output buffers need not be separated
and that any buffer may be read or updated. This allows flexible stencil specifications over
multiple buffers, combined with support for rotating only a subset of them in case of an ISL.
Figure 6 shows an excerpt from the opt-flow workload (see Section 6) that demonstrates
this use case. It has six buffers, all accessed using relative indices, and only two buffers
Apu and Apv are written to. A caveat is that care must be taken when reading from and
writing to the same output buffer. Although there are stencil programs that require this
ability, output arrays must always be indexed at 0 (i.e., the current position) in order to
avoid non-determinism. We currently do not check for this situation.
    Our library provides two implementations of runStencil: one a pure-Julia implementa-
tion, and the other the parallel implementation that our compiler provides. In the latter
case, we translate the runStencil call to an internal construct called stencil! that has some
additional information after a static analysis to derive kernel extents, dimensions, and so
on. Since there is a Julia implementation, code that uses runStencil can run in Julia simply
by importing the ParallelAccelerator library, even when the compiler is turned off, which
can be done by setting an environment variable. Any @acc-annotated function, including



                                                                                                       ECOOP 2017
4:14   Parallelizing Julia with a Non-Invasive DSL


       those that use runStencil, will run in this library-only mode, but through the ordinary Julia
       compilation path.6


           5    Implementing ParallelAccelerator

       The standard Julia compiler converts programs into ASTs, then transforms them to LLVM
       IR, and finally generates native assembly code. The ParallelAccelerator compiler intercepts
       this AST, introduces new AST node types for the parallel patterns discussed in Section 4,
       performs optimizations, and finally generates OpenMP C++ code (as shown in Figure 7).
       Since we assume there is an underlying C++ compiler producing optimized native code,
       our general approach to optimization within ParallelAccelerator itself is to only perform
       those optimizations that the underlying compiler is not capable of doing, due to the loss
       of semantic information present only in our Domain AST (Section 5.1) and Parallel AST
       (Section 5.2) intermediate languages.
           Functions annotated with the @acc macro are replaced by a trampoline. Calls to those
       functions are thus redirected to the trampoline. When it is invoked, the types of all argu-
       ments are known. The combination of function name and argument types forms a key that
       the trampoline looks for within a cache of accelerated functions. If the key exists, then the
       cache has optimized code for this call. If it does not, the trampoline uses Julia’s code_typed
       function to generate a type-inferred AST specialized for the argument types. The trampo-
       line then coordinates the passage of this AST through the rest of our compilation pipeline,
       installs the result in the cache, and calls the newly optimized function. This aggressive code
       specialization mitigates the dynamic aspects of the language.


       5.1      Domain Transformation
       Domain transformation takes a Julia AST and returns a Domain AST where some nodes
       have been replaced by “domain nodes” such as mmap, reduce, stencil!, etc. We pattern-
       match against the incoming AST to translate supported operators and functions to their
       internal representations, and perform necessary safety checks and code analysis to ensure
       soundness with respect to the sequential semantics.
           Since our compiler backend outputs C++, all transitively reachable functions must be
       optimized. This phase processes all call sites, finds the target function(s) and recursively
       optimizes them.
           The viability of this strategy crucially depends on the compiler’s ability to precisely
       determine the target functions at each call site. This, in turn, relies on knowing the type
       of arguments of functions. While this is generally an intractable problem for a dynamic
       language, ParallelAccelerator is saved by the fact that optimizations are performed at runtime,
       the types of the argument to the original @acc call are known, and the types of global
       constants are also known. Lastly, at the point when an @acc is actually invoked, all functions
       needed for its execution are likely to have been loaded. Users can help by providing type
       annotations, but in our experience these are rarely needed.


       6
           The Julia version of runStencil did not require significant extra development effort. To the contrary,
           it was a crucial early step in implementing the native ParallelAccelerator version, because it allowed us
           to prototype the semantics of the feature and have a reference implementation.
T. A. Anderson, H. Liu, L. Kuper, E. Totoni, J. Vitek, and T. Shpeisman                              4:15




   Figure 7 The ParallelAccelerator compiler pipeline.


5.2    Parallel Transformation
Parallel transformation lowers domain nodes down to a common “parallel for” representation
that allows a unified optimization framework for all parallel patterns. The result of this
phase is a Parallel AST which extends Julia ASTs with parfor nodes. Each parfor node
represents one or more tightly nested for loops where every point in the loops’ iteration
space is independent and is thus amenable to parallelization.
    This phase starts with standard compiler techniques to simplify and reorder code so as to
maximize later fusion. Next, the AST is lowered by replacing domain nodes with equivalent
parfor nodes. Lastly, this phase performs fusion.
    A parfor node consists of pre-statements, loop nests, reductions, body, and post-statements.
Pre- and post-statements are statements executed once before and after the loop. Pre-
statements do things such as output array allocation, storing the length of the arrays used



                                                                                                   ECOOP 2017
4:16   Parallelizing Julia with a Non-Invasive DSL


       by the loop or the initial value of the reduction variable. The loop nests are encoded by
       an array where each element represents one of the nested loops of the parfor. Each such
       element contains the index variable, the initial and final values of the index variable, and
       the step. All lowered domain operations have a loop nest. The reduction is an array where
       each element represents one reduction computation taking place in the parfor. Each reduc-
       tion element contains the name of the reduction variable, the initial value of the reduction
       variable, and the function by which multiple reduction values may be combined. The body
       consists of the statements that perform the element-wise computation of the parfor. The
       body is generated in three parts, input, computation, and output, which makes it easier to
       perform fusion. In the input, individual elements from input arrays are stored into variables.
       Conversely, in the output, variables containing results are stored into their destination array.
       The computation is generated by the domain node which takes the variables defined for the
       input and output sections and generates statements that perform the computation.
           Parfor fusion lowers loop iteration overhead, eliminates some intermediate arrays that
       would otherwise have to be created and typically has cache benefits by allowing array ele-
       ments to remain in registers or cache across multiple uses. When two consecutive domain
       node types are lowered to parfors, we check whether they can be fused. The criteria are:

          Loop nests must be equivalent: Since loop nests are usually based on some array’s
          dimensions, the check for equivalence often boils down to whether the arrays used by both
          parfors are known to have the same shape. To make this determination, we keep track
          of how arrays are derived from other arrays and maintain a set of array size equivalence
          classes.
          The second parfor must not access any piece of data created by the first at a different
          point in the iteration space: This means that the second parfor does not use a reduction
          variable computed by the first parfor and all array accesses must only access array
          elements corresponding to the current point in the iteration space. This also means that
          we do not currently fuse parfors corresponding to stencil! nodes.
          The fusing of two parfors involves appending the second parfor’s pre-statements, body,
       reductions, and post-statements to those of the first parfor’s. Also, since the body of the
       second parfor uses loop index variables specific to the second parfor’s loop nest, the second
       parfor’s loop index variables are replaced with the corresponding index variables of the first
       parfor. In addition, we eliminate redundant loads and stores and unneeded intermediate
       arrays. If an array created for the first parfor is not live at end of the second parfor,
       then the array is eliminated by removing its allocation statement in the first parfor and by
       removing all assignments to it in the fused body.

       5.3    Code Generation
       The ParallelAccelerator compiler produces executable code from Parallel AST through our
       CGen backend which outputs C++ OpenMP code. That code is compiled into a native
       shared library with a standard C++ compiler. ParallelAccelerator creates a proxy function
       that handles marshalling of arguments and invokes the shared library with Julia’s ccall
       mechanism. It is this proxy function that is installed in the code cache.
           CGen makes a single depth-first pass over the AST. It uses the following translation
       strategy for Julia types, parfor nodes, and method invocations. Unboxed scalar types, such
       as 64-bit integers, are translated to the semantically equivalent C++ type, e.g., int64_t. Com-
       posite types, such as Tuples, become C structs. Array types are translated into reference-
       counted C++ objects provided by our array runtime. Parfor nodes are lowered into OpenMP
T. A. Anderson, H. Liu, L. Kuper, E. Totoni, J. Vitek, and T. Shpeisman                             4:17


loops with reduction clauses and operators where appropriate. The parfor nodes also con-
tain metadata from the parallel transformation phase that describes the private variables for
each loop nest, and these are translated into OpenMP private variables. Finally, there are
three kinds of method invocations that CGen has to translate: intrinsics, foreign functions,
and other Julia functions. Intrinsics are primitive operations, such as array indexing and
arithmetic functions on scalars, and CGen translates these into the equivalent native func-
tions or operators and inlines them at the call sites. Julia calls foreign functions through
its ccall mechanism, which includes the names of the library and the function to invoke.
CGen translates such calls into normal function calls to the appropriate dynamic libraries.
Calls to Julia functions, whether part of the standard library or user-defined, cause CGen
to add the function to a worklist. When CGen is finished translating the current method,
it will translate the first function on the worklist. In this way, CGen recursively translates
all reachable Julia functions in a breadth-first order.
    CGen imposes certain limitations on the Julia features that ParallelAccelerator supports.
There are some Julia features that could be supported in CGen to some degree with ad-
ditional work, such as string processing and exceptions. More fundamentally, since CGen
must declare a single C type for every variable, CGen cannot support Julia union types (in-
cluding type Any), which occur if Julia type inference determines that a particular variable
could have different types at different points within a function. Global variables are always
type-inferred as Any and so are not supported by CGen. CGen also does not support reflec-
tion or meta-programming, such as eval. Whenever CGen is provided an AST containing
unsupported features, CGen prints a message indicating which feature caused translation to
fail and installs the original, unmodified AST for the function in the code cache so that the
program will still run, albeit unoptimized.


5.3.1      Experimental JGen Backend
We are developing an alternative backend, JGen, that builds on the experimental threading
infrastructure provided recently in Julia 0.5. JGen generates Julia task functions for each
parfor node in the AST. The arguments to the task function are determined by the parfor’s
liveness information plus a range argument that specifies which portion of the parfor’s iter-
ation space the function should perform. The task’s body is a nested for loop that iterates
through the space and executes the parfor body.
    JGen replaces parfor nodes with calls to Julia’s threading runtime, specifying the schedul-
ing function, the task, and arguments to the task. The updated AST is stored in the code
cache. When it is called, Julia applies its regular LLVM-based compilation pipeline to gen-
erate native code. Each Julia thread calls the backend’s scheduling function which uses
the thread id to perform a static partitioning of the complete iteration space. Alternative
implementations such as a dynamic load-balancing scheduler are possible. JGen supports
all Julia features.
    Code generated by the JGen backend is currently significantly slower (about 2×) than
that generated by CGen, due to factors such as C++ compiler support for vectorization that
is currently lacking in LLVM. Moreover, the Julia threading infrastructure on which JGen is
based is not yet considered ready for production use.7 Therefore all the performance results
we present for ParallelAccelerator in this paper use the CGen backend. However, in the long
run, JGen may become the dominant backend for ParallelAccelerator as it is more general.


7
    See http://julialang.org/blog/2016/10/julia-0.5-highlights.




                                                                                                  ECOOP 2017
4:18   Parallelizing Julia with a Non-Invasive DSL


          Table 1 Description of workloads. The last column shows the compile time for @acc functions.
       This compile-time cost is only incurred the first time that an @acc-accelerated function is run during
       a Julia session, and is not included in the performance results in Figures 8 and 9.

           Workload Description                              Input size                    Stencil   Comp.
                                                                                                      time
           opt-flow        Horn-Schunck optical flow         5184x2912 image                    3      11s
           black-scholes   Black-Scholes option pricing      100M iters                                 4s
           gaussian-blur   Gaussian blur image processing    7095x5322 image, 100 iters         3       2s
           laplace-3d      Laplace 3D 6-point stencil        290x290x290 array, 1K iters        3       2s
           quant           Quantitative option pricing       524,288 paths, 256 steps                   9s
           boltzmann       2D lattice Boltzmann fluid flow   200x200 grid                       3       9s
           harris          Harris corner detection           8Kx8K image                        3       4s
           wave-2d         2D wave equation simulation       512x512 array                      3       6s
           juliaset        Julia set computation             1Kx1K resolution, 10 iters                 4s
           nengo           Nengo NEF algorithm               NA = 1000, NB = 800                        7s




           6     Empirical Evaluation

       Our evaluation is based on a small but hopefully representative collection of scientific work-
       loads listed in Table 1. We ran the workloads on a server with two Intel® Xeon® E5-2699
       v3 (“Haswell”) processors, 128GB RAM and CentOS v6.6 Linux. Each processor has 18
       physical cores (36 cores total) with base frequency of 2.3GHz. The cache sizes are 32KB for
       L1d, 32KB for L1i, 256KB for L2, and 25MB for the L3 cache. The Intel® C++ compiler
       v15.0.2 compiled the generated code with -O3. All results shown are the average of 3 runs
       (out of 5 runs, first and last runs discarded).
          Our speedup results are shown in Figure 8. For each workload, we measure the per-
       formance of ParallelAccelerator running in single-threaded mode (labeled “@acc (1 thread)”)
       and with multiple threads (“@acc (36 threads)”), compared to standard Julia running single-
       threaded (“Julia (1 thread)”). We used Julia 0.5.0 to run both the ParallelAccelerator and
       standard Julia workloads.
           For all workloads except opt-flow and nengo, we also compare with a MATLAB imple-
       mentation. The boltzmann, wave-2d, and juliaset workloads are based on previously exist-
       ing MATLAB code found “in the wild” with minor adjustments made for measurement
       purposes.8 For the other workloads, we wrote the MATLAB code. MATLAB runs used
       version R2015a, 8.5.0.197613. The label “Matlab (1 thread)” denotes runs of MATLAB
       with the -singleCompThread argument. MATLAB sometimes does not benefit from implicit
       parallelization of vector operations (see boltzmann or wave-2d). For the opt-flow and nengo
       workloads, we compare with Python implementations, run on version 2.7.10. Finally, for
       opt-flow, laplace-3d, and quant, expert parallel C/C++ implementations were available for
       comparison. In the rest of this section, we discuss each workload in detail. Julia and Par-
       allelAccelerator code for all the workloads we discuss is available in the ParallelAccelerator
       GitHub repository.




       8
           The original implementations are: http://www.exolete.com/lbm for boltzmann, https://www.
           piso.at/julius/index.php/projects/programmierung/13-2d-wave-equation-in-octave for wave-
           2d, and http://www.albertostrumia.it/Fractals/FractalMatlab/Jul.html for juliaset.
 T. A. Anderson, H. Liu, L. Kuper, E. Totoni, J. Vitek, and T. Shpeisman                                                                                                                                                    4:19



                                         Python                                       Julia (1 thread)                                         @acc (36 threads)
                                         Matlab (1 thread)                            Julia (explicit loops, 1 thread)                         C (simple serial, 1 thread)
                                         Matlab                                       @acc (1 thread)                                          C/C++ (optimized, 36 threads)
                                                          opt-flow                               84.6x                                                           black-scholes
            100                                                       68.9x                                             100
                                                                                                                                                                                                             41.6x
                                               11.8x
             10                                                                                                          10
  Speedup




                                                                                                              Speedup
                                                                                                                                                2.9x
                                                                                                                                  1.7x                                          1.5x          1.7x
              1      1.0x                                                                                                 1                                      1.0x


             0.1   (1882.1 s)                 (160.1 s)               (27.3 s)                   (22.3 s)               0.1      (12.8 s)      (7.7 s)          (22.0 s)    (15.1 s)         (13.1 s)         (0.5 s)


                                                      gaussian-blur                                                                                                 laplace-3d
          1000                                                                            611.9x                        100                                                                                    56.6x
                                                                                                                                                                                  26.1x
            100
                                                                         23.3x                                           10                                             6.0x                      6.0x
Speedup




                                                                                                              Speedup
             10
                                            4.7x
                        2.6x                                                                                              1                              1.0x
              1                                             1.0x
                                                                                                                                0.2x          0.2x
             0.1       (339.5 s)        (185.9 s)         (876.9 s)      (37.6 s)             (1.4 s)                   0.1    (960.5 s) (738.7 s) (148.1 s) (24.8 s)             (5.7 s)        (24.6 s)      (2.6 s)


                                                           quant                                                                                                    boltzmann
          1000                                                                                                        1000
                                                                                 119.0x 150.9x                                                                                                          172.9x
            100                                                                                                         100
Speedup




                                                                                                            Speedup




                                                                                                                                                                                       14.0x
             10                                                        7.8x                                              10
                                  2.7x                      3.7x                                                                    3.1x              2.6x
                     2.3x
              1                                 1.0x                                                                      1                                             1.0x

             0.1    (5.0 s)       (4.3 s)      (11.6 s)    (3.1 s)    (1.5 s)       (0.1 s)      (0.1 s)                0.1        (30.0 s)          (36.1 s)         (93.2 s)         (6.6 s)              (0.5 s)



          1000                                             harris                                                       100                                         wave-2d

                                                                                          138.1x
            100                                                                                                                                                                                          14.8x
                                                                                                                         10                                                            8.4x
Speedup




                                                                                                              Speedup




                                                                          12.9x
             10
                                                                                                                                    1.5x
                                            1.7x                                                                          1                           0.9x              1.0x
                        1.3x                                1.0x
              1

             0.1       (10.4 s)             (8.1 s)        (14.0 s)       (1.1 s)             (0.1 s)                   0.1        (2.6 s)           (4.5 s)          (4.0 s)          (0.5 s)              (0.3 s)


                                                          juliaset                                                                                                    nengo
             10                                                                                                          10
                                                                                                                                                                                   1.9x
                                                                                              2.7x                        1                               1.0x                                                  0.7x
                                            1.4x                          1.5x
   Speedup




                                                                                                            Speedup




              1                                             1.0x                                                         0.1
                        0.4x
                                                                                                                        0.01     0.0x

             0.1       (24.4 s)             (6.9 s)        (9.8 s)        (6.4 s)             (3.6 s)                 0.001     (77.6 s)                  (0.5 s)                  (0.3 s)                      (0.7 s)




    Figure 8 Speedups. Improvements relative to Julia are at the top of each bar. Absolute
 running times are at the bottom of each bar. Note: This figure is best viewed in color.




                                                                                                                                                                                                                          ECOOP 2017
4:20   Parallelizing Julia with a Non-Invasive DSL


       6.1     Horn-Schunck optical flow estimation
       This is our longest-running workload. It takes two 5184x2912 images (e.g., from a sequence
       of video frames) as input and computes the apparent motion of objects from one image to
       the next using the Horn-Schunck method [13]. The implementation in standard Julia9 ran
       in 1882s. Single-threaded ParallelAccelerator showed 11.8-fold improvement (160s). Running
       on 36 threads results in a speedup of 68.9× over standard Julia (27s). For comparison, a
       highly optimized parallel C++ implementation runs in 22s. That implementation is about
       900 lines of C++ and uses a hand-tuned number of threads and handwritten barriers to avoid
       synchronization after each parallel for loop. With ParallelAccelerator, only 300 lines of code
       are needed (including three runStencil calls). We achieve performance within a factor of two
       of C++. This suggests it is possible, at least in some cases, to close much of the performance
       gap between productivity languages and expert C/C++ implementations. We do not show
       Python results because the Python implementation timed out. If we run on a smaller image
       size (534x388), then Python takes 1061s, Julia 20s, ParallelAccelerator 2s and C++ 0.2s.
           The Julia implementation of this workload consists of twelve functions in two modules,
       and uses explicit for loops in many places. To port this code to ParallelAccelerator, we
       added the @acc annotation to ten functions using @acc begin ... end blocks (omitting the
       two functions that perform file I/O). In one @acc-annotated function, we replaced loops with
       three runStencil calls. For example, the runStencil call shown in Figure 6 replaced a 23-line,
       doubly nested for loop. Elsewhere, we refactored code to use array comprehensions and
       aggregate array operations in place of explicit loops. These changes tended to shorten the
       code. However, it is difficult to give a line count of the changes because in the process
       of porting to ParallelAccelerator, we also refactored the code to take advantage of Julia’s
       support for multiple return values, which simplified the code considerably. We also had to
       make some modifications to work around ParallelAccelerator’s limitations; for example, we
       moved a nested function to the top level because ParallelAccelerator can only operate on
       top-level functions within a module. The overall structure of the code remained the same.
           Finally, opt-flow is the only workload we investigated in which it was necessary to add
       some type annotations to compile the code with ParallelAccelerator. In particular, type
       annotations were necessary for variables used in array comprehensions. Interestingly, though,
       this is the case only under Julia 0.5.0 and not under the previous version, 0.4.6, which
       suggests that it is not a fundamental limitation but rather an artifact of the way that Julia
       currently implements type inference.


       6.2     Black-Scholes option pricing model
       This workload (described previously in Section 3.1) uses the Black-Scholes model to calculate
       the prices of European options for 100 million iterations. The Julia version runs in 22s, the
       single-threaded MATLAB implementation takes 12.8s, and the default MATLAB 7.7s. The
       gap between the single-threaded Julia and MATLAB running times bears discussion. In
       Julia, code written with explicit loops is often faster than code written in vectorized style
       with aggregate array operations.10 This is in contrast with MATLAB, which encourages
       writing in vectorized style. We also measured a devectorized Julia version of this workload


       9
          For this workload, we show results for Julia 0.4.6, because a performance regression in Julia 0.5.0 caused
          a large slowdown in the standard Julia implementation that would make the comparison unfair. All
          other workloads use Julia 0.5.0.
       10
          See, e.g., https://github.com/JuliaLang/julialang.github.com/issues/353 for a discussion.
T. A. Anderson, H. Liu, L. Kuper, E. Totoni, J. Vitek, and T. Shpeisman                            4:21


(“Julia (explicit loops, 1 thread)”). As Figure 8 shows, it runs in 15.1s, much closer to
single-threaded MATLAB. ParallelAccelerator on one thread gives a 1.7× speedup over the
array-style Julia implementation (13.1s), and with 36 threads the running time is 0.5s, a
total speedup of 41.6× over Julia and 14.5× over the faster of the two MATLAB versions.
Because the original code was already written in array style, this speedup was achieved
non-invasively: the only modification necessary to the code was to add an @acc annotation.


6.3    Gaussian blur image processing
This workload (described previously in Section 3.2) uses a stencil computation to blur an
image using a Gaussian blur. The Julia version runs in 877s, single-threaded MATLAB
in 340s, and the default MATLAB in 186s. The ParallelAccelerator implementation uses a
single runStencil call and takes 38s on one thread, a speedup of 23.3× over Julia. 36-thread
parallelism reduces the running time to 1.4s — a total speedup of over 600× over Julia and
about 130× over the faster of the two MATLAB versions. The code modification necessary
to achieve this speedup was to replace the loop shown in Figure 4 with the equivalent
runStencil call shown in Figure 5 — an 8-line change, along with adding an @acc annotation.


6.4    Laplace 3D 6-point stencil
This workload solves the Laplace equation on a regular 3D grid with simple Dirichlet bound-
ary conditions. The Julia implementation we compare with is written in devectorized style,
using four nested for loops. It runs in 148s, outperforming default MATLAB (961s) and
single-threaded MATLAB (739s). The ParallelAccelerator implementation uses runStencil.
    ParallelAccelerator on one thread runs in 24.8s, a speedup of 6× over standard Julia. This
running time is roughly equivalent to a simple serial C implementation (24.6s). Running
under 36 threads, the ParallelAccelerator implementation takes 5.7s, a speedup of 26× over
Julia and 130× over the faster of the two MATLAB versions. An optimized parallel C
implementation that uses SSE intrinsics runs in 2.6s. The running time of the ParallelAccel-
erator version is therefore nearly within a factor of two of highly optimized parallel C. The
runStencil implementation is very high-level: the body of the function passed to runStencil is
only one line. The C version requires four nested loops and many calls to intrinsic functions.
Furthermore, the C code is less general: each dimension must be a multiple of 4, plus 2. In-
deed, this was the reason we chose the problem size of 290x290x290. The ParallelAccelerator
implementation, though, can handle arbitrary N xN xN input sizes.
    For this example, since the original Julia code was written in devectorized style, the
necessary code changes were to replace the loop nest with a single runStencil call and to add
the @acc annotation. We replaced 17 lines of nested for loops with a 3-line runStencil call.


6.5    Quantitative option pricing model
This workload uses a quantitative model to calculate the prices of European and American
options. An array-style Julia implementation runs in 11.6s. As with black-scholes, we also
compare with a Julia version written in devectorized style, which runs 3.7× faster (3.1s).
Single-threaded MATLAB and default MATLAB versions run in 5s and 4.3s, respectively.
Single-threaded ParallelAccelerator runs in 1.5s. With 36 threads, the running time is 0.09s,
a total speedup of 119× over array-style Julia, and 45× over the faster of the two MATLAB
versions. For comparison, an optimized parallel C++ implementation written using OpenMP
runs in 0.08s. ParallelAccelerator is about 1.3× slower than the parallel C++ version.



                                                                                                 ECOOP 2017
4:22   Parallelizing Julia with a Non-Invasive DSL


           Since this workload was already written in array style, and the bulk of the computation
       takes place in a single function, it should have been easy to port to ParallelAccelerator by
       adding an @acc annotation. However, we encountered a problem in that the @acc-annotated
       function calls the inv function (for inverting a matrix) from Julia’s linear algebra standard
       library.11 We had difficulty compiling inv through CGen because Julia has an unusual im-
       plementation of linear algebra, making it hard to generate code for most of the linear algebra
       library functions (except for BLAS library calls, which are straightforward to translate). As
       a workaround, we wrote our own implementation of inv for the @acc-annotated code to call,
       specialized to the array size needed for this workload. With that change, ParallelAccelerator
       worked well. The need for workarounds like this could be avoided by using the JGen backend
       described in Section 5.3.1, which supports all of Julia.

       6.6       2D lattice Boltzmann fluid flow model
       This workload uses the 2D lattice Boltzmann method for fluid simulation. The ParallelAc-
       celerator version uses runStencil. The Julia implementation runs in 93s, and single-threaded
       MATLAB and default MATLAB in 30s and 36s, respectively (making this an example of a
       workload where MATLAB’s default implicit parallelization hurts rather than helps). Single-
       threaded ParallelAccelerator runs in 6.6s, a speedup of 14× over Julia. With 36 threads we
       get a further speedup to 0.5s, for a total speedup of 173× over Julia and 56× over the faster
       of the two MATLAB versions.
           This workload is the only one we investigated in which a use of runStencil is longer than
       the code it replaces. The ParallelAccelerator version of the code contains a single 65-line
       runStencil call, replacing a 44-line while loop in the standard Julia implementation.12 In
       addition to the replacement of the while loop with runStencil, other, smaller differences
       between the ParallelAccelerator and Julia implementations arose because ParallelAccelerator
       does not support the transfer of BitArrays between C and Julia. Therefore the modifica-
       tions needed to run this workload with ParallelAccelerator came the closest to being invasive
       changes of any workload we studied. That said, the code is still recognizably “Julia” and
       our view is that the resulting 173× speedup justifies the effort.

       6.7       Harris corner detection
       This workload uses the Harris corner detection method to find corners in an input image.
       The ParallelAccelerator implementation uses runStencil. The Julia implementation runs in
       14s; single-threaded MATLAB and default MATLAB run in 10.4s and 8.1s, respectively. The
       single-threaded ParallelAccelerator version runs in 1.1s, a speedup of 13× over Julia. The
       addition of 36-thread parallelism results in a further speedup to 0.1s, for a total speedup of
       138× over Julia and 80× over the faster of the two MATLAB versions.
           The Harris corner detection algorithm is painful to implement without some kind of sten-
       cil abstraction. The ParallelAccelerator implementation of this workload uses five runStencil
        calls, each with a one-line function body. The standard Julia code, in the absence of
       runStencil, has a function that computes and returns the application of a stencil to an input
       2D array. This function has a similar interface to runStencil, but is less general (and, of


       11
            See http://docs.julialang.org/en/stable/stdlib/linalg/#Base.inv.
       12
            That said, the Julia implementation was a direct port from the original MATLAB code, which was
            written with extreme concision in mind (see http://exolete.com/lbm/ for a discussion), while the
            runStencil implementation was written with more of an eye toward readability.
T. A. Anderson, H. Liu, L. Kuper, E. Totoni, J. Vitek, and T. Shpeisman                            4:23


course, cannot parallelize as runStencil does). Therefore the biggest difference between the
ParallelAccelerator and Julia implementations of this workload is that we were able to remove
the runStencil substitute function from the ParallelAccelerator version, which eliminated over
30 lines of code. The remaining differences between the versions, including addition of the
@acc annotation, are trivial.


6.8       2D wave equation simulation
This workload is the wave equation simulation described in Section 3.3. The ParallelAcceler-
ator implementation uses runStencil. The Julia implementation (4s) outperforms the default
MATLAB implementation (4.5s); however, the single-threaded MATLAB implementation
runs in 2.6s, making this another case where MATLAB’s default implicit parallelization is
unhelpful. The single-threaded ParallelAccelerator version runs in 0.5s, a speedup of 8× over
Julia. The addition of 36-thread parallelism results in a further speedup to 0.3s, for a total
speedup of about 15× over Julia and 10× over the faster of the two MATLAB versions.
    The Julia implementation of this workload is a direct port from the MATLAB version
and is written in array style, so it is amenable to speedup with ParallelAccelerator without
any invasive changes. The only nontrivial modification necessary is to replace the one-line
wave equation shown in Section 3.3 with a call to an @acc-annotated function containing a
runStencil call with an equivalent one-line body.


6.9       Julia set computation
This workload computes the Julia set fractal13 for a given complex constant at a resolution of
1000x1000 for ten successive iterations of a loop. The Julia implementation (9.8s) is written
in array style. It outperforms the single-threaded MATLAB version (24s), but is slightly
slower than the default MATLAB version (6.9s). On one thread, ParallelAccelerator runs in
6.4s, a speedup of 1.5× over standard Julia. With 36 threads we achieve a further speedup to
3.6s, for a total speedup of 2.7× over Julia and about 2× over the faster of the two MATLAB
versions. The only modification needed to the standard Julia code is to add a single @acc
annotation. The speedup enabled by ParallelAccelerator is modest because each iteration of
the loop is dependent on results from the previous iteration, and so ParallelAccelerator is
limited to parallelizing array-style operations within each iteration.


6.10       Nengo NEF algorithm
Finally, for our last example we consider a workload that demonstrates poor parallel scal-
ing with ParallelAccelerator. This workload is a demonstration of the Neural Engineering
Framework (NEF) algorithm used by Nengo, a Python software package for simulating neu-
ral systems [1]. It builds a network from two populations of neurons. We ported the NEF
Python code14 to Julia and attempted to parallelize it with ParallelAccelerator. With an
input size of 1000 neurons for the first population and 800 for the second population, the
original Python code runs in 78s. We observed an impressive speedup to 0.5s simply by
porting the code to Julia. On one thread, the ParallelAccelerator version runs in 0.3s, a 1.9×
speedup over standard Julia, but on 36 threads we observed a slowdown to 0.7s.


13
     See https://en.wikipedia.org/wiki/Julia_set.
14
     Available at http://nengo.ca/docs/html/nef_algorithm.html.




                                                                                                 ECOOP 2017
4:24   Parallelizing Julia with a Non-Invasive DSL



                                           Julia (1 thread)                                                +Parallelization                  +Hoisting
                                           @acc (1 thread, no optimizations)                               +Fusion                           +Vectorization
                                                    opt-flow                                                                            black-scholes
                   100                                                57.0x      68.9x                  100
                                                           48.1x                                                                                42.1x      42.4x     41.6x
                                                 17.6x                                                                                 21.1x
                    10                                                                                   10
         Speedup




                                                                                              Speedup
                                       2.8x
                                                                                                                             1.3x
                     1                                                                                    1


                    0.1   (1882.1 s) (667.1 s) (107.2 s)   (39.1 s)   (33.0 s)   (27.3 s)                0.1    (22.0 s)    (16.6 s)   (1.0 s)   (0.5 s)   (0.5 s)   (0.5 s)


                                                  gaussian-blur                                                                          laplace-3d
                 1000                            522.7x 508.1x 502.0x 611.9x                            100
                                                                                                                                       25.8x     25.8x     25.7x     26.1x
                   100
                                                                                                         10
                                      19.8x
       Speedup




                                                                                              Speedup
                    10                                                                                                       3.1x
                                                                                                          1
                     1

                    0.1   (876.9 s)   (44.3 s)   (1.7 s)    (1.7 s)   (1.7 s)    (1.4 s)                 0.1    (148.1 s)   (47.5 s)   (5.7 s)   (5.7 s)   (5.8 s)   (5.7 s)


                                                       quant                                                                             boltzmann
                 1000                                                                                 1000

                                                           110.1x 116.0x 119.0x                                                        156.3x 145.5x 157.2x 172.9x
                   100                           56.3x                                                  100
       Speedup




                                                                                            Speedup




                    10                                                                                   10                 10.0x
                                       3.6x
                     1                                                                                    1

                    0.1    (11.6 s)   (3.2 s)    (0.2 s)    (0.1 s)   (0.1 s)    (0.1 s)                 0.1    (93.2 s)    (9.3 s)    (0.6 s)   (0.6 s)   (0.6 s)   (0.5 s)



                 1000                                  harris                                           100                                wave-2d

                   100                           91.6x 129.8x 125.8x 138.1x                                                                      14.5x     13.9x     14.8x
                                                                                                         10                            10.1x
                                                                                                                             6.9x
       Speedup




                                                                                              Speedup




                    10                 6.7x
                                                                                                          1
                     1

                    0.1    (14.0 s)   (2.1 s)    (0.2 s)    (0.1 s)   (0.1 s)    (0.1 s)                 0.1     (4.0 s)    (0.6 s)    (0.4 s)   (0.3 s)   (0.3 s)   (0.3 s)


                                                     juliaset                                                                               nengo
                    10

                                                 2.8x       2.8x       2.8x       2.7x                    1                                      0.9x      0.9x
                                       1.8x                                                                                  0.7x                                    0.7x
          Speedup




                                                                                               Speedup




                     1


                                                                                                                                       0.1x
                    0.1    (9.8 s)    (5.4 s)    (3.4 s)    (3.4 s)   (3.5 s)    (3.6 s)                 0.1     (0.5 s)    (0.7 s)    (4.5 s)   (0.6 s)   (0.6 s)   (0.7 s)




          Figure 9 The effects of individual ParallelAccelerator optimizations on a variety of workloads.
       Speedup after each successive optimization (compared to Julia baseline) is shown at the top of each
       bar (higher is better). The absolute running times are shown at the bottom of each bar.
T. A. Anderson, H. Liu, L. Kuper, E. Totoni, J. Vitek, and T. Shpeisman                                    4:25


     The Julia port of the NEF algorithm is about 200 lines of code comprising several func-
tions. For the ParallelAccelerator implementation, we annotated two of the functions with
@acc, and we replaced roughly 30 lines of code in those functions that had been written us-
ing explicit for loops with their array-style equivalents. Doing so led to the modest speedup
gained by running with ParallelAccelerator on one thread. However, this workload offers
little opportunity for parallelization with ParallelAccelerator, although it might be possible
to obtain better results on a different problem size or with fewer threads.

6.11      Impact of Individual Optimizations
Figure 9 shows a breakdown of the effects of parallelism and individual optimizations imple-
mented by our compiler. The leftmost bar in each plot (labeled “Julia (1 thread)”) shows
standard Julia running times, for comparison. The second bar (“@acc (1 thread, no optimiza-
tions)”) shows running time for ParallelAccelerator with OMP_NUM_THREADS=1 and optimizations
disabled. The difference between the first and second bars in each plot illustrates the impact
of avoiding the run-time overhead of allocation and management of arrays for intermediate
computations and checking array access bounds. As the figure shows, the difference can
be substantial. This is due to the lack of optimization for array-style code in the Julia
compiler. The third bar in each plot (“+Parallelization”) shows the impact of enabling par-
allel execution with OMP_NUM_THREADS=36. Again, the benefits are usually noticeable. The last
three bars in each plot (“+Fusion”, “+Hoisting”, “+Vectorization”) each cumulatively add
an additional compile-time optimization. For many workloads, these optimizations have no
significant impact or even a slight negative impact. Some workloads, such as opt-flow and
gaussian-blur, see a noticeable speedup from vectorization. For opt-flow, black-scholes, and
quant there is a noticeable speedup from fusion. In general, the usefulness of these opti-
mizations depends on the application, but the performance to be gained from them appears
small in comparison to the improvement we see from parallelization and run-time overhead
elimination.


 7      Limitations and Future Work
While we are confident that the reported results and claimed benefits of ParallelAccelerator
will generalize to other workloads and other languages, we do acknowledge the following
limitations and opportunities to extend our work.

     Workloads: Our empirical evaluation is limited in size to the workloads we were able
     to obtain or write ourselves. A larger set of programs would increase confidence in the
     applicability and generality of the approach. We considered porting more MATLAB or
     R codes to Julia, as they are often naturally vectorized, but the differences in semantics
     of the base libraries complicates that task. It turned out that each program in our
     benchmark suite represented substantial work. Larger programs in terms of code size
     would also help validate the scalability of the compilation strategy. We intend to engage
     with the Julia community to port more Julia code currently written with loops to the
     array style supported by ParallelAccelerator.
     Programming model: ParallelAccelerator only parallelizes programs that are written in
     array style; it will not touch explicit loops. Thus it is less suitable for certain applications,
     for instance, string and graph processing applications that use pointers. Additionally, the
     ParallelAccelerator compiler must be able to statically resolve all operations being invoked.
     For this it needs to have fairly accurate approximations of the types of every value.



                                                                                                         ECOOP 2017
4:26   Parallelizing Julia with a Non-Invasive DSL


            Furthermore, some reflective operations cannot be invoked within an @acc-annotated
            function. We can generalize the ParallelAccelerator strategy to accept more programming
            styles, although automatic parallelization may be more challenging. As for the type
            specialization, it has worked surprisingly well so far. We hypothesize that the kinds
            of array codes we work with do not use complex types for performance reasons. They
            tend to involve arrays of primitives as programmers try to match what will run fast on
            the underlying hardware. One of the reasons why allocation and object operations are
            currently not supported is that ParallelAccelerator was originally envisioned as running on
            GPUs or other accelerators with a relatively limited programming model, but for CPUs,
            we could relax that restriction.
            Code bloat: The aggressive specialization used by ParallelAccelerator has the potential
            for massive code bloat. This could occur if @acc-annotated functions were called with
            many distinct sets of argument types. We have not witnessed it so far, but it could be a
            problem and would require a smarter specialization strategy. The aggressive specializa-
            tion may lead to generating many native functions that are mostly similar. Instead of
            generating a new function for each new type signature, we could try to share the same
            implementation for signatures that behave similarly.
            User feedback: There is currently limited feedback when ParallelAccelerator fails to
            parallelize code (e.g., due to union types for some variables). While the code will run,
            users will see a warning message and will not see the expected speedups. Unlike invasive
            DSLs, with additional work we can map parallelization failures back to statements in
            the Julia program. We are considering how to provide better diagnostic information.
            Variability: The benefits of parallelization depend on both the algorithm and the target
            parallel architecture. For simplicity, we assume a shared-memory machine without any
            communication cost and parallelize all implicitly parallel operations. However, this can
            result in poor performance. For example, parallel distribution of array elements across
            operations can be inconsistent, which can have expensive communication costs (i.e.,
            cache line exchange). We are considering how to expose more tuning parameters to the
            user.


        8      Conclusion

       Typical high-performance DSLs require the use of a dedicated compiler and runtime for users
       to use the domain-specific language features. Unfortunately, DSLs often face challenges that
       limit their widespread adoption, such as a steep learning curve, functionality cliffs, and a
       lack of robustness. Addressing these shortcomings requires significant engineering effort.
       Our position is that designing and implementing a DSL is difficult enough without having
       to tackle these additional challenges. Instead, we argue that implementors should focus only
       on providing high-level abstractions and a highly optimizing implementation, but users of
       the DSL should enjoy rapid development and debugging, using familiar tools on the platform
       of their choice. This is where the ability to disable the ParallelAccelerator compiler during
       development and then enable it again at deployment time comes in: it allows us to offer
       users high performance and high-level abstractions while still giving them an easy way to
       sidestep problems of compilation time, robustness, debuggability, and platform availability.
           In conclusion, ParallelAccelerator is a non-invasive DSL because it does not require whole-
       sale changes to the programming model. It allows programmers to write high-level, high-
       performance array-style code in a general-purpose productivity language by identifying im-
       plicit parallel patterns in the code and compiling them to efficient native code. It also
T. A. Anderson, H. Liu, L. Kuper, E. Totoni, J. Vitek, and T. Shpeisman                           4:27


eliminates many of the usual overheads of high-level array languages, such as intermediate
array allocation and bounds checking. Our results demonstrate considerable speedups for
a number of scientific workloads. Since ParallelAccelerator programs can run under stan-
dard Julia, programmers can develop and debug their code using a familiar environment
and tools. ParallelAccelerator also demonstrates that with a few judicious design decisions,
scientific codes written in dynamic languages can be parallelized. While it may be the case
that scientific codes are somewhat more regular in their computational kernels than general-
purpose codes, our experience with ParallelAccelerator was mostly positive: there were very
few cases where we needed to add type annotations or where the productivity-oriented as-
pects of the Julia language prevented our compiler from doing its job. This is encouraging
as it suggests that dynamism and performance need not be mutually exclusive.

Acknowledgments. Anand Deshpande and Dhiraj Kalamkar wrote the parallel C version
of laplace-3d. Thanks to our current and former colleagues at Intel and Intel Labs who
contributed to the design and implementation of ParallelAccelerator and to the collection of
workloads we studied: Raj Barik, Neal Glew, Chunling Hu, Victor Lee, Geoff Lowney, Paul
Petersen, Hongbo Rong, Jaswanth Sreeram, Leonard Truong, and Youfeng Wu. Thanks to
the Julia Computing team for their encouragement of our work and assistance with Julia
internals.


     References
 1   The Nengo neural simulator, 2016. URL: http://nengo.ca.
 2   Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Hus-
     bands, Kurt Keutzer, David Patterson, William Lester Plishker, John Shalf, Samuel Webb
     Williams, and Katherine Yelick. The landscape of parallel computing research: A view
     from Berkeley. Technical report, UC Berkeley, 2006. URL: www.eecs.berkeley.edu/
     Pubs/TechRpts/2006/EECS-2006-183.html.
 3   Jeff Bezanson, Stefan Karpinski, Viral Shah, and Alan Edelman. Julia: A fast dynamic
     language for technical computing. CoRR, abs/1209.5145, 2012. URL: http://arxiv.org/
     abs/1209.5145.
 4   Gavin Bierman, Erik Meijer, and Mads Torgersen. Adding dynamic types to C] . In Proceed-
     ings of the 24th European Conference on Object-oriented Programming, ECOOP’10, pages
     76–100, Berlin, Heidelberg, 2010. Springer-Verlag. URL: http://dl.acm.org/citation.
     cfm?id=1883978.1883986.
 5   João Bispo, Luís Reis, and João M. P. Cardoso. Techniques for efficient MATLAB-to-
     C compilation. In Proceedings of the 2nd ACM SIGPLAN International Workshop on
     Libraries, Languages, and Compilers for Array Programming, ARRAY 2015, pages 7–12,
     New York, NY, USA, 2015. ACM. doi:10.1145/2774959.2774961.
 6   Kevin J. Brown, Arvind K. Sujeeth, HyoukJoong Lee, Tiark Rompf, Hassan Chafi, Martin
     Odersky, and Kunle Olukotun. A heterogeneous parallel framework for domain-specific
     languages. In Proceedings of the 2011 International Conference on Parallel Architectures
     and Compilation Techniques, PACT ’11, pages 89–100, Washington, DC, USA, 2011. IEEE
     Computer Society. doi:10.1109/PACT.2011.15.
 7   Bryan Catanzaro, Michael Garland, and Kurt Keutzer. Copperhead: Compiling an embed-
     ded data parallel language. In Proceedings of the 16th ACM Symposium on Principles and
     Practice of Parallel Programming, PPoPP ’11, pages 47–56, New York, NY, USA, 2011.
     ACM. doi:10.1145/1941553.1941562.
 8   Bryan Catanzaro, Shoaib Kamil, Yunsup Lee, Krste Asanovic, James Demmel, Kurt
     Keutzer, John Shalf, Kathy Yelick, and Armando Fox. SEJITS: Getting productivity and




                                                                                                ECOOP 2017
4:28   Parallelizing Julia with a Non-Invasive DSL


            performance with selective embedded JIT specialization. In Workshop on Programmable
            Models for Emerging Architecture (PMEA), 2009. URL: http://parlab.eecs.berkeley.
            edu/publication/296.
        9   Manuel M.T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod
            Grover. Accelerating Haskell array codes with multicore GPUs. In Proceedings of the Sixth
            Workshop on Declarative Aspects of Multicore Programming, DAMP ’11, pages 3–14, New
            York, NY, USA, 2011. ACM. doi:10.1145/1926354.1926358.
       10   Maxime Chevalier-Boisvert, Laurie Hendren, and Clark Verbrugge. Optimizing MATLAB
            through just-in-time specialization. In Proceedings of the 19th Joint European Conference
            on Theory and Practice of Software, International Conference on Compiler Construction,
            CC’10/ETAPS’10, pages 46–65, Berlin, Heidelberg, 2010. Springer-Verlag. doi:10.1007/
            978-3-642-11970-5_4.
       11   Matthias Christen, Olaf Schenk, and Helmar Burkhart. PATUS: A code generation and
            autotuning framework for parallel iterative stencil computations on modern microarchitec-
            tures. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing
            Symposium, IPDPS ’11, pages 676–687, Washington, DC, USA, 2011. IEEE Computer
            Society. doi:10.1109/IPDPS.2011.70.
       12   Matthias Christen, Olaf Schenk, and Yifeng Cui. Patus for convenient high-performance
            stencils: Evaluation in earthquake simulations. In Proceedings of the 2012 International
            Conference for High Performance Computing, Networking, Storage and Analysis, SC ’12,
            pages 1–10, Washington, DC, USA, 2012. IEEE Computer Society. doi:10.1109/SC.2012.
            95.
       13   Berthold K. P. Horn and Brian G. Schunck. Determining optical flow. Artif. Intell., 17(1-
            3):185–203, August 1981. doi:10.1016/0004-3702(81)90024-2.
       14   Tomas Kalibera, Petr Maj, Floreal Morandat, and Jan Vitek. A fast abstract syntax
            tree interpreter for R. In Proceedings of the 10th ACM SIGPLAN/SIGOPS International
            Conference on Virtual Execution Environments, VEE ’14, pages 89–102, New York, NY,
            USA, 2014. ACM. doi:10.1145/2576195.2576205.
       15   Shoaib Ashraf Kamil. Productive High Performance Parallel Programming with Auto-tuned
            Domain-Specific Embedded Languages. PhD thesis, EECS Department, University of Cali-
            fornia, Berkeley, January 2013. URL: http://www.eecs.berkeley.edu/Pubs/TechRpts/
            2013/EECS-2013-1.html.
       16   Vineet Kumar and Laurie Hendren. MIX10: Compiling MATLAB to X10 for high per-
            formance. In Proceedings of the 2014 ACM International Conference on Object Oriented
            Programming Systems Languages & Applications, OOPSLA ’14, pages 617–636, New York,
            NY, USA, 2014. ACM. doi:10.1145/2660193.2660218.
       17   Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A LLVM-based Python JIT
            compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in
            HPC, LLVM ’15, pages 7:1–7:6, New York, NY, USA, 2015. ACM. doi:10.1145/2833157.
            2833162.
       18   Dahua Lin. Devectorize.jl, 2015. URL: https://github.com/lindahua/Devectorize.jl.
       19   Derek Lockhart, Gary Zibrat, and Christopher Batten. PyMTL: A unified framework for
            vertically integrated computer architecture research. In Proceedings of the 47th Annual
            IEEE/ACM International Symposium on Microarchitecture, MICRO-47, pages 280–292,
            Washington, DC, USA, 2014. IEEE Computer Society. doi:10.1109/MICRO.2014.50.
       20   Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. PolyMage: Automatic op-
            timization for image processing pipelines. In Proceedings of the Twentieth International
            Conference on Architectural Support for Programming Languages and Operating Systems,
            ASPLOS ’15, pages 429–443, New York, NY, USA, 2015. ACM. doi:10.1145/2694344.
            2694364.
T. A. Anderson, H. Liu, L. Kuper, E. Totoni, J. Vitek, and T. Shpeisman                              4:29


21   Stefan C. Müller, Gustavo Alonso, Adam Amara, and André Csillaghy. Pydron: Semi-
     automatic parallelization for multi-core and the cloud. In Proceedings of the 11th USENIX
     Conference on Operating Systems Design and Implementation, OSDI’14, pages 645–659,
     Berkeley, CA, USA, 2014. USENIX Association. URL: http://dl.acm.org/citation.
     cfm?id=2685048.2685100.
22   Ashwin Prasad, Jayvant Anantpur, and R. Govindarajan. Automatic compilation of
     MATLAB programs for synergistic execution on heterogeneous processors. In Proceed-
     ings of the 32nd ACM SIGPLAN Conference on Programming Language Design and
     Implementation, PLDI ’11, pages 152–163, New York, NY, USA, 2011. ACM. doi:
     10.1145/1993498.1993517.
23   Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand,
     and Saman Amarasinghe. Halide: A language and compiler for optimizing parallelism,
     locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM
     SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13,
     pages 519–530, New York, NY, USA, 2013. ACM. doi:10.1145/2491956.2462176.
24   Tiark Rompf and Martin Odersky. Lightweight modular staging: A pragmatic approach
     to runtime code generation and compiled DSLs. In Proceedings of the Ninth International
     Conference on Generative Programming and Component Engineering, GPCE ’10, pages
     127–136, New York, NY, USA, 2010. ACM. doi:10.1145/1868294.1868314.
25   Lukas Stadler, Adam Welc, Christian Humer, and Mick Jordan. Optimizing R language
     execution via aggressive speculation. In Proceedings of the 12th Symposium on Dynamic
     Languages, DLS 2016, pages 84–95, New York, NY, USA, 2016. ACM. doi:10.1145/
     2989225.2989236.
26   Arvind Sujeeth. OptiML language specification 0.2, 2012. URL: stanford-ppl.github.
     io/Delite/optiml/downloads/optiml-spec.pdf.
27   Arvind Sujeeth, HyoukJoong Lee, Kevin Brown, Tiark Rompf, Hassan Chafi, Michael Wu,
     Anand Atreya, Martin Odersky, and Kunle Olukotun. OptiML: An implicitly parallel
     domain-specific language for machine learning. In Lise Getoor and Tobias Scheffer, editors,
     Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML
     ’11, pages 609–616, New York, NY, USA, June 2011. ACM.
28   Arvind K. Sujeeth, Tiark Rompf, Kevin J. Brown, HyoukJoong Lee, Hassan Chafi, Victoria
     Popic, Michael Wu, Aleksandar Prokopec, Vojin Jovanovic, Martin Odersky, and Kunle
     Olukotun. Composition and reuse with compiled domain-specific languages. In Proceedings
     of the 27th European Conference on Object-Oriented Programming, ECOOP’13, pages 52–
     78, Berlin, Heidelberg, 2013. Springer-Verlag. doi:10.1007/978-3-642-39038-8_3.
29   Justin Talbot, Zachary DeVito, and Pat Hanrahan. Riposte: A trace-driven compiler and
     parallel VM for vector code in R. In Proceedings of the 21st International Conference on
     Parallel Architectures and Compilation Techniques, PACT ’12, pages 43–52, New York,
     NY, USA, 2012. ACM. doi:10.1145/2370816.2370825.
30   Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and
     Charles E. Leiserson. The Pochoir stencil compiler. In Proceedings of the Twenty-third
     Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’11, pages
     117–128, New York, NY, USA, 2011. ACM. doi:10.1145/1989493.1989508.




                                                                                                   ECOOP 2017