DOKK Library

Superoptimization

Authors Embecosm Limited

License CC-BY-SA-4.0

Plaintext
                                                                  Palamos House #208
                                                                  66-67 High Street
                                                                  Lymington
                                                                  SO41 9AL United Kingdom
                                                                  www.embecosm.com




                                   Superoptimization
        Feasibility Study by Embecosm Limited, supported by Innovate UK

Superoptimization is the process of finding the optimal instruction sequence for a given
section of code. This contrast with the traditional process of compilation, which optimizes
code but does not ensure optimality.
This report explores the different types of superoptimizer, and how they could be
implemented in modern technologies.
This work was led by James Pallister, Research Engineer at Embecosm and supported by
Innovate UK under their Technology Inspired Feasibility Studies initiative.
For further information about this work, please contact Dr Jeremy Bennett, Chief Executive,
jeremy.bennett@embecosm.com.

1 Introduction
Superoptimization is an idea which produces perfectly optimal code, in place of the code we
currently have generated by compilers. This is typically done via a brute-force search of
every possible instruction sequence, checking whether it performs the desired actions and
accepting the sequence if it is the optimal one.
The problem is clearly very difficult, exploding in size as the length of the instruction
sequence increases. However, superoptimization gaining traction as a method of optimizing
programs. In this feasibility study we hope to find which techniques are extensible enough
to bring out of the academic community to be using in a commercial setting. This means
that the superoptimizer needs to be able to handle all the corner cases that may arise in
production code. If it turns out that superoptimization is not currently feasibly, by the end
of the project we will have a research roadmap describing what will need to be done next.
Superoptimization was first conceived by Henry Massalin[1], where a brute-force search was
conducted through the arithmetic instructions of the Motorola 68000. The example given in
the original paper is given below.




The function on the left is compiled, giving the piece of code on the right, with multiple
branches and comparisons. This could perhaps be reduced by an expert assembly
programmer, but was generally accepted as close to optimal. When the superoptimizer was

                                  Freely available under a
1              Creative Commons Attribution-ShareAlike 4.0 International License
run on this program, a much shorter version was found - 4 instructions long and with no
branching.




 On the left is the instruction sequence found by the superoptimizer (Motorola 68000), with
the definitions of the instructions on the far right. In the middle are three cases, computed
step by step to illustrate how the code sequence works. How to instructions compute their
result is not obvious exploiting the carry flag ('x') in combination with arithmetic
instructions.




                                  Freely available under a
2              Creative Commons Attribution-ShareAlike 4.0 International License
                                           Analysis

2 Types of superoptimizer
The most obvious choice for a superoptimizer is a brute force enumeration, simply searching
all possible sequences in cost order until a correct sequence is found. However, there are
other ways of performing the search, and these are also described below.

2.1   Brute force
The most common type of superoptimizer is a brute force optimizer, which enumerates each
instruction sequence in turn, testing each until a correct sequence is found. The order of
this search is important, since to guarantee optimality the sequences must be searched in
cost order.
The search order is simple for some metrics, such as code size, where in many cases this
corresponds to number of instructions in the sequences. A superoptimizer would enumerate
sequences with one instruction, followed by two instructions, etc.
The disadvantage with this approach is the search space scales exponentially with the
length of the instructions sequences, making long sequences intractable. The size of the
instruction set also greatly affects the superoptimization potential. This is discussed below
in 3.1 Potential efficiency: Instruction set choice. The exponential growth can be mitigated by
several techniques to remove redundant instruction sequences and incorrect sequences.

2.2   Machine learning
Recently machine learning has been applied to superoptimization to traverse the search
space efficiently, allowing the search to quickly identify efficient instruction sequences [2].
With this approach it is difficult to ensure optimality of the generated sequences, although
the sequences are typically very close to optimal and the method gives good results in a
shorter time frame than any other method.

2.3   Constraint solving
The constraint solving approach to superoptimization uses a solver (typically a SMT solver)
with a set of equations to find a efficient solution. The instructions are encoded into
constraints which the solver can interpret, and then the target function is represented using
this form of the instructions. The SMT solver typically converts these constraints into
boolean equations (a SAT problem), which is given to a SAT solver. A SAT solver then tries to
work out if a solution can be found for the equations. If a solution is found then this
describes the instruction sequence performs the target operation. An off-the-shelf solver
can then produce a solution [3].
SMT solvers are typically very efficient, containing many heuristics to speed up and guide
the search. This method allows the superoptimizer to increase in performance as the
corresponding solver is developed.
However, SMT solvers typically solve a decision problem, rather than an optimization
problem, meaning that the problem domain does not simply translate into SMT constraints
– an iterative approach must be taken. This involves asking the solver “Prove there is no
instruction sequence of size N to solve the problem”. The SMT solver will either return a
counter example (the instruction sequence) or inform that the problem cannot be solved
given the constraints [4]. The parameter N can then be increased until the shortest
sequences is found.




                                   Freely available under a
3               Creative Commons Attribution-ShareAlike 4.0 International License
3 Potential efficiency
Currently superoptimization is limited to short sequences of mostly ALU based code, with
some techniques being able to handle memory operations and a few with branches. No
superoptimizer has attempted to create arbitrary control flow.
Superoptimization often finds optimizations where multiple different representations of a
number can be exploited. For example, if the number can be operated on as both bits and
as an arithmetic number, interesting optimizations can often be found. This is also possible
with floating point numbers, although no superoptimizer has attempted this yet.
One thing a superoptimizer can successfully find is branch-free replacements for short
sequences of code, allowing (usually) costly branches to be removed [5]. This is focused on
by many of the brute force superoptimizers. However, it is still difficult for superoptimizers
to produce code with branches in them, and even more difficult if loops are considered.

3.1   Instruction set choice
The instruction set chosen as the target for superoptimization has a large effect on the
efficacy of the search – the length of the resulting solution and the time taken to find it.
There are a number of trade-offs to consider when choosing the instruction set, or subset of
the instruction set to superoptimize.
The number of instructions in the instruction set is the predominant factor in whether the
superoptimizer will find a solution or not, since longer sequences can typically solve more
complex problems. However, if each instruction only performs a 'small' amount of
computation, then the required result will be need to be longer. This means that the subset
of instructions chosen should be as small as possible, and consist of 'complex' operations.
The notion of operation 'complexity' is necessarily vague, since there is not a good metric to
describe the amount of computation an operation requires.
The following graphs shows the superoptimization results for several architectures using the
GNU superoptimizer. This includes the inbuilt superoptimization target functions (134
functions), as well as the superoptimization benchmarks given in the following section (25
functions). Each stacked bar shows the number of target functions that were able to be
superoptimized, with the resulting number of instructions shown in colour.
The black line indicates the average length of the found solution for that architecture.




                                   Freely available under a
4               Creative Commons Attribution-ShareAlike 4.0 International License
Overall this graph gives an idea of the complexity of the instruction set (or the subset
implemented in GSO) for each architecture. AVR has longer instruction sequences on
average, due to it being the only 8-bit architecture and by far the simplest.
An important trend to notice in the graph is that architectures on the left (lower average
sequence length) typically have 3 operand instructions, whereas many on the right are 2
operand (implied destination). By not modifying one of the source registers in some
operations, the superoptimizer can make reuse of this source again if necessary. If the
instruction set only supports 2 operand addressing then the superoptimizer must find
another way of preserving the source register, typically by inserting an additional move
instruction, thus increasing the sequence size.
The average sequence length is not a measure of how good the instruction set – it is
resultant from design choices made when creating the instruction set. The choice of
instructions in the AVR instruction set means that the processor have a smaller silicon area
than the other processors.
The importance of moving register contents around results in instruction sets with a
conditional move performing well. The alpha architecture, and PA-RISC (hppa) both have
some support for either conditional instructions or conditional moves.
The following graphs shows a breakdown of the superoptimizer benchmarks for each
architecture. Only the 25 superoptimizer benchmarks are displayed, since it is expected
these are a better indicator for the overall performance than the full set (GSO's initial
functions are heavily biased towards if-conversion type optimizations). This highlights the
length of the individual result for each superoptimizer trial.




                                  Freely available under a
5              Creative Commons Attribution-ShareAlike 4.0 International License
As with the previous graph, there is a general increase in the number of instructions
required to solve the benchmark, from left to right. From these graphs it would suggest the
subset of AVR instructions implemented is the least complex, whereas the instructions from
the Alpha instruction set are the most complex.

4 Difficulties
Some constructs and instructions are challenging for a superoptimizer to deal with. These
are listed below.


Memory operations         Memory operations can be challenging, since it represents a large
                          additional state that must be kept track of. This is handled in [6] by
                          constraining the amount of memory that can be accessed to 256 bytes,
                          by only using the last byte of the memory address. This quick test has
                          aliasing problems in certain cases, which must be checked later with
                          more extensive verification.

                          Memory is also problematic in the constraint solving approach, due to

                                    Freely available under a
6                Creative Commons Attribution-ShareAlike 4.0 International License
                          the large number of constraints it produces – a look up table must be
                          created with enough entries to store each unique accesses. Each entry is
                          then compared to the lookup table when a load is performed.

                          Volatility must also be considered when superoptimizing memory
                          accesses.
Control flow.             Branches greatly widen the scope of what the superoptimizer can
Branches                  generate. It is relatively simple to verify whether a section of loop free
                          code is equivalent to another section of code with a different branching
                          structure (i.e. allowing branching input to the superoptimizer).
                          Generating branching output is more challenging due to the large
                          number of possibilities.
Control flow.             Loops are extremely difficult to superoptimize, particularly due to the
Loops                     problem of verifying loop equivalence. Superoptimizing across loop
                          boundaries was shown to be possible in [7], however this kept the same
                          loop structure.
Large     numbers    of Many of the registers combinations are redundant, if some of the
register combinations registers are orthogonal. For example:

                           mov r0, r1                             mov r3, r2
                           add r1, r2, r2                         add r2, r1, r1

                          These two sequences are functionally equivalent, apart from the
                          registers the results start and end in.
Floating point            Floating point is challenging to superoptimize because it is particularly
                          difficult to verify if the output sequence is correct. The typical method of
                          using an SMT solver to prove an instruction sequence's equivalence does
                          not work for floating point (no available floating point theory).
Precoloured registers     Instructions which imply a certain source or destination registers can be
                          problematic in certain circumstances, particularly for brute-force
                          superoptimizers. The canonical form solution for a large number of
                          register combinations (given above) has problems coping with this
                          scenario, since the reduction cannot be applied to these registers. This is
                          less a problem for converting existing sequences to canonical form but
                          the difficulties arise when trying to iteratively generate all sequences of
                          registers in canonical form. More research is needed to efficiently
                          generate precoloured registers.




                                    Freely available under a
7                Creative Commons Attribution-ShareAlike 4.0 International License
                             Choice of code to superoptimize

The choice of code to superoptimize has a large impact on the effectiveness of the solution,
and the time taken to find it. A simple way to choose which areas of code should be
superoptimized is to profile, and find which regions of code are hot. However, some of these
will not be suitable for superoptimization – this section discusses some of these cases.

5 Memory accesses
If the code frequently accesses memory then it is likely going to be difficult for a
superoptimizer to utilize. Global memory accesses are often required, because it is the data
the code is operating on. Local accesses frequently cannot be removed either, since they
have dependencies across loop edges, or the computation of the value is a significant
distance from the target region.
The following dataflow DAG shows a basic block that may not be effective to superoptimize.
The green arrows indicate the sequence of memory operations. In this analysis the arrows
are marked conservatively – the volatility is unknown so the order should be preserved.




If an arithmetic intensive portifc13208on of the basic block can be extracted from the rest of
the basic block, then just this section could be considered for superoptimization. This has
the advantage of cutting out the computation that has few possible gains, and focuses on
what superoptimizers are typically most performant at.
The basic block below has a highlight region which could be fed through a superoptimizer,
avoiding the chain of memory accesses on the right.




                                   Freely available under a
8               Creative Commons Attribution-ShareAlike 4.0 International License
6 Target Codes
Superoptimization is not an appropriate technique for optimizing everything – it generally
takes too much time to be used as a generic optimization. However for certain types of code
it can be very beneficial. These are typically libraries which get used in many places, or
applications which will run for a very long time and are heavily dependent on some forms of
compute. In particular for embedded systems, reducing the size of a library is very
important since space is heavily constrained.

libgcc                            This is the runtime library that provides various support
                                  functions to applications compiled by GCC. On embedded
                                  processors which do not support floating point, this
                                  library provides the emulation, as well as other support,
                                  such as division for processors without a divide
                                  instruction.

                                  As such this library is heavily used in some embedded
                                  applications, and superoptimization could greatly benefit
                                  by reducing the overhead of this library.
compiler-rt                       This library is similar to GCC, but is used with LLVM
                                  instead.
softfloat                         This library emulates floating point instructions using
                                  integer and bitwise operations, for processors which do
                                  not have a hardware FPU. These functions could be ideal
                                  targets for superoptimization, since they are mostly
                                  computation bound, and not reliant on memory accesses.




                                  Freely available under a
9              Creative Commons Attribution-ShareAlike 4.0 International License
                                         Example code

Henry Massalin first introduced superoptimization with the following example [1]:
             int signum(int x)
             {
                 if(x < 0)
                      return -1;
                 else if(x > 0)
                      return 1;
                 else
                      return 0;
             }
This function was roughly 8 instructions in length when naively compiled, but could be
reduced to 6 instructions by an expert programmer. Massalin's superoptimizer managed to
find the result in 4 instructions:
             add.l        d0, d0
             subx.l       d1, d1
             negx.l       d0
             addx.l       d1, d1

7 Superoptimization benchmarks
Several small functions were identified in [3] as being good targets for superoptimization.
These were short functions easy to specify but had challenging implementations, many
originating in Hacker's Delight [8]. The implementations of these programs often rely on
interpreting the values in different representations and operating on them as such, meaning
that it can be difficult for traditional compilers to solve. However, these difficulties make
them tractable with a superoptimizer.
The benchmarks selected by Gulwani et al. are:

     Benchmark         Parameters                              Description
P1                 x                Clear the least significant bit.
P2                 x                                                    n
                                    Check whether x is of the form     2 −1 .
P3                 x                Isolate the right most 1-bit.
P4                 x                Form a mask of set bits from the right most set bit to the least
                                    significant bit
P5                 x                Set all from the least significant bit up to the least significant
                                    set bit.
P6                 x                Set the least significant 0-bit.
P7                 x                Isolate the right most 0-bit.
P8                 x                Form a mast of set bits from the right most cleared bit to the
                                    least significant bit.
P9                 x                abs(x).
P10                x, y             Test if the number of leading zeroes is the same for x and y.
P11                x, y             Test if x has fewer leading zeroes than y.
P12                x, y             Test if x has fewer or equal leading zeroes than y.
P13                x                The signum function [1].


                                    Freely available under a
10               Creative Commons Attribution-ShareAlike 4.0 International License
P14              x, y            Floor of x and y.
P15              x, y            Ceil of x and y.
P16              x, y            Max of x and y.
P17              x               Turn of the right most string of 1-bits.
P18              x               Check whether x is a power of two.
P19              x, m, k         Exchange two bitfields in x, where m as a mask marking the
                                 right most field, and k is the number of bits from the start of
                                 the first bitfield to the start of the second.

                                 E.g. x = 0x00A00B00, m = 0x00000F00, k = 12
                                         x' = 0x00B00A00
P20              x               The next higher unsigned number with the same number of set
                                 bits as x.
P21              x, a, b, c      Cycle through a, b and c, returning the next number where x =
                                 a, b or c.

                                 E.g. x = a, x' = b.
                                 E.g. x = c, x' = a.
P22              x               Compute the parity of x.
P23              x               Count the number of bits set in x.
P24              x               Round x up to the next power of 2.
P25              x, y            Compute the high part of the product of x and y.


8 Other superoptimized code
In addition to the benchmarks in the previous section, other codes have been
superoptimized. Integer multiplication by a constant factor has been attempted.
Granlund et al. [5] used a superoptimizer to remove small branches and common if-then-
else conditions. These were later integrated into GCC peephole pass.




                                 Freely available under a
11            Creative Commons Attribution-ShareAlike 4.0 International License
                                   Superoptimizer designs

This section discusses the design of several superoptimizers, as well as presenting a new
design. These all will require varying amounts of additional research to realise (with brute-
force requiring the least, and constructive requiring the most).

9 Brute force
The GNU Superoptimizer (GSO) uses a dynamic programming approach to achieve efficient
superoptimization. For a sequence of length N, it will enumerate the first instructions,
computing the effect on a test vector. It will then recursively enumerate sequential
instructions, passing the partial computed results to each subsequent instruction. The
prevents redundant calculation of the results, and quickly prunes sequences which cannot
be correct.
This method of dynamic programming results in generated sequences which are in
canonical form – again reducing the search space. However, the same drawbacks occur and
in this framework it is difficult to implement instruction sets with implicit sources or
destinations. Constraints on registers are also challenging to work with, while also ensuring
that all possible sequences can be explored.

9.1     Design
A similar design to GSO could be taken, while changing the architecture slightly to enable a
wider range of instructions to be easily implemented. This includes modifications to the
method of selecting registers, to allow instructions with implicit destinations to be
implemented, and modifications to any flag variable (currently GSO only supports the carry
flag).
The superoptimizer keeps a set of current registers and flags, which are used as a quick test
to exclude incorrect sequences.
Overall the superoptimizer follows a recursive design:
        function superopt(current_len, max_len, machine_state, sequence):
            if(current_len > max_len)
                if(machine_state matches target function)
                    verify the instruction sequence
                return
            for insn in instructions
                for registers in registers_lists
                    execute insn with registers on machine_state
                    append insn to sequence
                    superopt(current_len+1, max_len, machine_state)

        for i = 1 to ...
            superopt(1, i, blank_machine_state, empty_instruction_sequence)

This generates and tests instruction sequences in a iterative deepening depth-first search,
and reuses the values computed by the instructions. This search choice is a trade-off
between depth-first search and breadth-first search: depth-first is not appropriate since it
does not traverse the sequences in approximate cost order. A naïve breadth-first search
would use a very large amount of memory as the sequence length increased.

9.1.1    Registers
The above algorithm needs a list of all of the register variants which can go with that
instruction sequence. This can be optimized by iterating through the registers in canonical
form. The following algorithm will iteratively update a list of registers so that they are always

                                    Freely available under a
12               Creative Commons Attribution-ShareAlike 4.0 International License
in canonical form. The algorithm is initially given a list of 0's. The length of this list is equal
to the number of register slots in the instruction sequence.
        function iterate_register_list(reg_list, num_registers)
            n = length(reg_list)
            finished = 1
            for i = 0 to n – 1
                if reg_list[i] < max(reg_list[i+1 .. n]) + 1
                           and reg_list[i] < num_registers
                     reg_list[i]++
                     finished = 0
                     break
                else
                     reg_list[i] = 0
            return reg_list, finished

If the algorithm returns with finished set to one, then the algorithm has wrapped around an
the next instruction should be selected. The set of registers should not include slots for
instructions which have an implicit register destination. For example, the AVR mul
instruction has two explicit register slots, and two implicit, fixed destinations. In this case
only the explicit register slots will be included in the list.
If the instruction has constraints on the set of registers that can be used in that slot, the
slot should not be included in the iterative canonicalisation, and should be looped over in
brute-force. This is due to the above algorithm not generating the correct results in this case
(see research questions). For example, the AVR ldi instruction cannot use registers r0-r15.

9.1.2    Improvements
There are many ways to reduce the search space, however, the reduction in search space is
a trade off with their implementation efficiency.

Commutativity             Some instructions have commutative operations. These can be
                          exploited to reduce the search space by only considering one form
                          of the instruction. Commutativity also interacts with canonical
                          form, meaning that the ordering of the register slots in can be
                          important.
Redundant outputs         This technique involves analyzing the generated instruction
                          sequence for dependencies that would have been eliminated in
                          fully optimized code.

                          The number of registers in use, and used by the input sequence is
                          known, therefore if the sequence reads from a register that has not
                          been assigned to yet, it can be excluded. Writes to registers which
                          are never read from can be removed.

                          These can be incorporated into the canonical register generator, by
                          giving it information
Machine state lookup      This exploits the fact that the majority of instruction sequences
                          are incorrect. The machine state is stored in a hash table, along
                          with the amount of instructions (or cost) required to compute it.
                          Each time the superoptimizer is about to recurse, it first checks
                          whether the current machine state is in the table. If it is, and the
                          cost to reach that state is higher than that stored in the table,
                          then the resulting sequence cannot possibly be optimal (since
                          we've previously seen that machine state, and not found a result).


                                   Freely available under a
13              Creative Commons Attribution-ShareAlike 4.0 International License
                          This leads to 3-10x speed up in certain circumstances. This
                          method needs tuning, since the hashtable lookup can easily slow
                          the search down. In particular, it is not beneficial to store or
                          lookup the machine state if there is only 1 instruction left in the
                          current search.

                          This method essentially removes parts of the search space which
                          are known to be bad, typically occurring when some instruction
                          reordering does not affect the output of the sequence. This also
                          has the effect of quickly pruning sets of sequences which have
                          dead code in them.

                          Sometimes trivial sequences of instructions produce the same
                          output (e.g. add r0,-1 and sub r0,1) and if these are known to
                          be incorrect then are excluded by this technique.
   Heuristics             Some heuristics can be used to change the order of the iterated
                          instructions. This does not reduce the search space, but may allow
                          the superoptimizer to find any solution quicker.




Illustration 2: Savings from implementing            Illustration 1: Search space savings from
canonical form, and commutativity                    not generating code with redundant read
                                                     and writes


  9.1.3   Speed or Optimality
  There is a certain amount of cross over with the machine learning approach if heuristics are
  used to guide the search. In particular a brute force superoptimize has difficulty with the
  possible range of constants – there are frequently too many possible constants to use them
  all. GSO only uses a small range of constants, removing the guarantee of optimality but
  speeding up the search. The set of constants is chosen carefully, based on constants which
  appear frequently in code. The graph below shows the frequency of certain constants for a
  range of benchmarks.




                                    Freely available under a
  14             Creative Commons Attribution-ShareAlike 4.0 International License
The graph shows spikes around the powers-of-two, and powers-of-two minus 1. The
superoptimizer could bias the search by preferring power-of-two constants, and then trying
other constants after.

9.2       Advantages
      •    Simple to implement and a well known technique.
      •    Guarantee of optimal solution, provided the exploration order is sound.
      •    Can be parallelized easily.

9.3       Disadvantages
      •    Restricted to small sequences of code.
      •    Choosing the exact subset of the instruction set to implement in the superoptimizer,
           or the subset for superoptimization is hard.

9.4       Research Questions
      •    How can sequences be enumerated in cost order when the machine model is
           complex? This could require an amount of “fuzziness”, where when a solution is
           found, the superoptimizer continues for a certain amount of extra cost.
      •    Can we build a fast way of recognizing areas of the search space which are never
           optimal, and quickly exclude them?
      •    Can the canonical form of registers be extended to cope with instructions that have
           constraints on the register slot?
      •    The machine state lookup improvement is heavily dependent on the initial values
           chosen for the registers. Is there a way of choosing a good set of values?
      •    What other techniques are there of improving a brute-force search?

10 Learning Optimizer
STOKE [9] is an optimizer based on machine learning techniques to guide the search for an
instruction sequence. This type of superoptimizer is able to find much longer code
sequences in the same amount of time (up to 16 instructions).




                                      Freely available under a
15                 Creative Commons Attribution-ShareAlike 4.0 International License
A similar design could be embedded into a compiler, since the search can utilize an existing
sequence as the starting point for the superoptimization. Then the sequence can be
permuted until a better sequence is found.

10.1   Design
A compiler pass, placed near the end of the compilation process that will use machine
learning techniques to attempt to improve the code. Initially the pass would work on
individual basic block, selected by a heuristic based on the length and type of instructions
in the basic block.
The chosen machine learning method is then applied to the instruction sequence (see the
next section), and this sequence is tested (requiring the compiler pass to be able to simulate
instructions for the target machine). If the sequence matches then it is verified if it is
correct, and a cost model applied. If the sequence is correct, and performs better then either
the search can stop there, integrating that sequence into the code, or the search can
continue and attempt to find better sequences. Otherwise the search continues until the
allotted time has expired.

10.2   Machine Learning
The most challenging part of this approach to superoptimization is choosing a fitness
function that guides the search towards a correct sequence. STOKE used a function which
measured the number of correct bits in the output registers, combined with the performance
model to guide it towards efficient solutions. This is necessary for creating solutions from
scratch, and if the search accepts an invalid sequence.
STOKE used a Monte Carlo Markov chain method (similar to simulated annealing) to choose
which sequence to use next. This approach mutated the instruction sequence, then
accepted it if the fitness of the solution had gone up. The solution was also accepted if it was
less fit, with a certain probability dependent on the value of the fitness.
Other learning schemes could be used, such as genetic programming or genetic algorithms.
The performance model can easily be swapped out for other metrics, such as code size and
energy consumption.

10.3   Advantages
•      Able to explore longer instruction sequences.
•      Reduced time to find a solution compared to other methods.
•      Giving the compiler a longer time to perform the optimizations will generally result in
       better solutions.
•      Some machine learning techniques can be parallelized easily, allowing faster
       exploration.

10.4   Disadvantages
•      No guarantee of optimality.
•      There can be many parameters to tune, which greatly affect the time and solution
       found by the optimizer.

10.5   Research Questions
•      If the code was modified since the last compilation, can the previous results of the
       search be used as an initial seed, and guided towards a correct sequence.
•      Which method of machine learning works well on this kind of problem?



                                    Freely available under a
16               Creative Commons Attribution-ShareAlike 4.0 International License
•      How close to optimality does this approach come? How long does the superoptimizer
       have to be run to get within X% of optimal?

11 Constructive
This section presents the experimental design for a constructive superoptimizer. The
premise of the superoptimizer is to split the synthesis task into two components – creating
an optimal dataflow DAG, then optimally lowering this DAG to sequential instructions.
This approach allows unsupported instruction sequences to be superoptimized, by
superoptimizing 'around' them.

11.1     DAG superoptimizer
This portion of the superoptimizer takes a sequence of code as input, and produces an
optimal DAG as output.
1.     Compile the code at the desired optimization level.
2.     Construct a control flow graph from the code, and perform liveness analysis on the
       code.
3.     For the targeted sections of code, construct a dataflow DAG for each basic block.
4.     Simplify the DAG, so that register assignments are removed, and implied by the DAG.
5.     Compose the multiple basic blocks into a large DAG.
6.     Transform the DAG into SMT constraints – this is used for the verify step.
7.     Construct the skeleton DAG.
8.     Iteratively perform Counter Example Guided Inductive Synthesis to fill in the holes in
       the DAG.

11.1.1    The dataflow DAG
The dataflow DAG is constructed per basic block, and describes the dataflow within that
basic block. This is constructed by placing an edge between the instruction that generates a
value, and an instruction that uses that value. If no instruction in that block creates the
value that is needed by an instruction then the register holding that instruction must be live
on entry to the block. The liveness analysis provides a set of registers which are live on exit
from the block. If an instruction creates a value which is not used in the basic block, the
registers which are live on exit from the basic block indicate whether the instructions output
should be on an edge out of the block.
Once the dataflow DAG has been constructed, the register names are removed. All of the
memory operations are represented sequentially unless volatility can be ignored. If volatility
is to be ignored, alias analysis can be performed to break the dependencies.

11.1.2    DAG composition
Multiple DAGs can be composed together, by linking the registers in to each DAG with the
registers generated by the previous basic blocks. In the case where there are multiple
previous basic blocks there would be multiple edges into a node for a single value. These
should be intermediary multiplexor (phi) nodes inserted. These multiplexors choose the
values to propagate, based on the selector's value. The selector's value is chosen based on
conditions generated in previous basic blocks.

11.1.3    Skeleton DAG
The skeleton dataflow DAG is a reduced form of the previously constructed DAG, with only
the portions which are difficult to superoptimize remaining. For example, if memory

                                    Freely available under a
17               Creative Commons Attribution-ShareAlike 4.0 International License
operations were not implemented in the superoptimizer, the memory operations and their
connections would remain in this DAG. The following diagrams show examples of the
skeleton DAG. The blue arrows indicate a register dependency, red indicates a dependency
on a flag and green indicates a memory dependency.

               Dataflow DAG                                      Skeleton DAG




                                                 This dataflow DAG only operates on r24 (r0
                                                 is used by the multiply, r1 is cleared after
                                                 the multiply). Therefore only r24 needs to
                                                 be considered for input and output, along
                                                 with the flag dependency, indicating to the
                                                 branch.




                                  Freely available under a
18             Creative Commons Attribution-ShareAlike 4.0 International License
     Dataflow DAG                                      Skeleton DAG




                                       In this example, the chain of memory
                                       accesses is kept identical, with the
                                       dependencies. The rest of the instruction
                                       can be filled in using the values provided
                                       by the existing instructions.




                                       This example indicates how multiple basic
                                       blocks can be combined into a single
                                       skeleton DAG. Two outputs for r24 must
                                       be produced, along with a condition that
                                       selects between them. This is selected by a
                                       multiplexor component in the DAG. The “br
                                       A” and “br B” indicate the next destination
                                       of the code is also dependent on the flags.

                                       This form of the DAG allows the potential
                                       for control flow to be encoded, then
                                       subsequently split into separate basic
                                       blocks or a conditional move performed (as
                                       decided by the lowering step).




                        Freely available under a
19   Creative Commons Attribution-ShareAlike 4.0 International License
11.1.4    Filling in the skeleton DAG
Both the dataflow DAG and the skeleton DAG are converted into SMT constraints. The
corresponding points on the graph can then be asserted for equality – the inputs, the
outputs, and the intermediary nodes in the skeleton DAG.
Counter Example Guided Inductive Synthesis can then be performed, with the set of
instructions in the instruction set as components. With an interactive increase in the
number of components available, this will result in a optimal solution to make the skeleton
DAG and dataflow DAG equivalent.

11.2     Lowering superoptimizer
This takes in the optimal DAG as input, and outputs a sequence of assembly level
instructions. This can be performed by using integer linear programming (ILP) to represent
the ordering of the instructions and perform optimal register allocation. Additionally this
can optimize the splitting of the computed DAG into multiple basic blocks. By using an ILP
solver, the metric for superoptimization can be considered:
•      Code size: minimize the control flow such that the total code size is small. This will
       likely result in code which performs slowly, as if conversion may be done on the entire
       set of blocks.
•      Performance: minimize the average path length when the control flow is included. This
       could use profile data on the range of values the multiplexor expressions take,
       allowing control flow which minimizes the execution time to be inserted.
More research is needed for the exact formulation of the ILP problem (it is covered in detail
in [10]–[12]), however, the DAG itself can be lowered using traditional compiler techniques
(register allocations, scheduling).

11.3     Advantages
•      Decomposes superoptimization into smaller more manageable problems. This allows
       the compiler to substitute in non-optimal techniques if a particular subproblem is
       taking too long.
•      Able to generate multiple basic blocks.
•      Possibly able to handle large sequences of code (20 instructions).

11.4     Disadvantages
•      Complex to implement.
•      Currently untested and may not provide the speed ups necessary.
•      Difficult to parallelism (few efficient parallel SAT solvers exist).
•      Decomposition of the problem may result in no guarantee of optimality.

11.5     Research Questions
There are several additional items to be researched before this design can be realized.
•      Does a minimal DAG correspond to a minimal instruction sequence? Can the DAG
       synthesis be biased towards DAGs which make the ILP problem easier?
•      What is the best way of formulating the ILP problem?




                                     Freely available under a
20                Creative Commons Attribution-ShareAlike 4.0 International License
                                             Conclusion

Today there are only a handful of areas where superoptimization can be applied effectively
in a commercial context. In particular, there are significant limitations on producing code
sequences which include memory and branching operations. This paper has shown how the
commercial value of superoptimization can be increased in both long and short term.
Three different designs for superoptimizers were analyzed: those based on brute-force, those
applying machine learning; and those using constructive approaches. Overall, the brute-
force superoptimizer has the lowest risk, being built on well known technologies. However, it
has the most limitations, being restricted to a small number of instructions. On the other
end of the scale, a superoptimizer taking a constructive approach with SMT solvers may be
able to scale to much longer sequences, however the technology is in its infancy and needs
extensive development. The machine learning based superoptimizer has the most potential
for immediate benefits, with a smaller amount of additional research to develop, and
potentially large gains. This superoptimizer may not be able to guarantee an optimal
solution is found, but can quickly explore possible solutions and improve code
size/speed/energy.
The features of the instruction set greatly affect the efficacy of superoptimization and there
are a number of instruction set features which cannot be superoptimized. Some of these are
fundamentally difficult, such as a superoptimizer producing loops, while some are
implementation difficulties, such as supporting memory accesses.
As a result we can see three stages to the future deployment of superoptimization.
1.   Immediate use of the GNU superoptimizer (a brute force tool) to help customers
     optimize key sections of code, with short-term improvements to make it easy to
     generate goal functions automatically from source assembler. This is quite a niche
     application, and its commercial value is largely in cementing Embecosm's technical
     credibility.
2.   In the medium term, application of DAG analysis to increase the applicability of the
     brute force approach and development of machine learning technology to make the
     brute force approach more directed. This is still relatively high risk, and would be
     appropriate for feasibility funding by Innovate UK.
3.   In the longer term, investment in constructive approaches, with academic study to
     determine how to handle loop generation, memory access and floating point. The
     starting point for this should research council funded academic projects, but in which
     Embecosm would provide some industrial support.
However this study has identified some areas with scope to increase the commercial
relevance with some relatively modest R&D investment, albeit at quite high risk. There is
much greater scope longer term, but this will in the first instance require major academic
investment to resolve issues with memory access, branching and floating point.
This has been a very high risk study. Embecosm would like to express our thanks to
Innovate UK for their support, without which we could not have carried out this work.




                                   Freely available under a
21              Creative Commons Attribution-ShareAlike 4.0 International License
                                         References

[1]    H. Massalin, “Superoptimizer - A Look at the Smallest Program,” ACM SIGARCH
       Comput. Archit. News, pp. 122–126, 1987.
[2]    E. Schkufza, R. Sharma, and A. Aiken, “Stochastic optimization of floating-point
       programs with tunable precision,” in Proceedings of the 35th ACM SIGPLAN
       Conference on Programming Language Design and Implementation - PLDI ’14, 2013, pp.
       53–64.
[3]    S. Gulwani, S. Jha, A. Tiwari, and R. Venkatesan, “Synthesis of loop-free programs,”
       ACM SIGPLAN Not., vol. 47, no. 6, p. 62, Aug. 2012.
[4]    R. Joshi, G. Nelson, and K. Randall, “Denali : A Goal-directed Superoptimizer.”
[5]    T. Granlund and R. Kenner, “Eliminating branches using a superoptimizer and the
       GNU C compiler,” in Proceedings of the ACM SIGPLAN 1992 conference on
       Programming language design and implementation - PLDI ’92, 1992, pp. 341–352.
[6]    S. Bansal and A. Aiken, “Automatic generation of peephole superoptimizers,” ACM
       SIGOPS Oper. Syst. Rev., vol. 40, no. 5, p. 394, Oct. 2006.
[7]    R. Sharma, E. Schkufza, B. Churchill, and A. Aiken, “Data-driven equivalence
       checking,” in Proceedings of the 2013 ACM SIGPLAN international conference on Object
       oriented programming systems languages & applications - OOPSLA ’13, 2013, vol. 48,
       no. 10, pp. 391–406.
[8]    H. S. Warren, Hacker’s Delight. Boston, MA, USA: Addison-Wesley Longman
       Publishing Co., Inc., 2002.
[9]    E. Schkufza, R. Sharma, and A. Aiken, “Stochastic superoptimization,” in
       Architectural Support for Programming Languages and Operating Systems, 2013, p.
       305.
[10]   D. W. Goodwin and K. D. Wilken, “Optimal and Near-optimal Global Register
       Allocation Using 0- 1 Integer Programming,” vol. 26, no. January 1995, pp. 929–965,
       1996.
[11]   T. C. Wilson, G. W. Grewal, and D. K. Banerji, “An ILP Solution for Simultaneous
       Scheduling , Allocation , and Binding in Multiple Block Synthesis,” 1994.
[12]   D. Kastner and M. Langenbach, “Code Optimization by Integer Linear Programming,”
       in Compiler Construction, 1999, pp. 122–137.




                                   Freely available under a
22              Creative Commons Attribution-ShareAlike 4.0 International License