Palamos House #208 66-67 High Street Lymington SO41 9AL United Kingdom www.embecosm.com Superoptimization Feasibility Study by Embecosm Limited, supported by Innovate UK Superoptimization is the process of finding the optimal instruction sequence for a given section of code. This contrast with the traditional process of compilation, which optimizes code but does not ensure optimality. This report explores the different types of superoptimizer, and how they could be implemented in modern technologies. This work was led by James Pallister, Research Engineer at Embecosm and supported by Innovate UK under their Technology Inspired Feasibility Studies initiative. For further information about this work, please contact Dr Jeremy Bennett, Chief Executive, jeremy.bennett@embecosm.com. 1 Introduction Superoptimization is an idea which produces perfectly optimal code, in place of the code we currently have generated by compilers. This is typically done via a brute-force search of every possible instruction sequence, checking whether it performs the desired actions and accepting the sequence if it is the optimal one. The problem is clearly very difficult, exploding in size as the length of the instruction sequence increases. However, superoptimization gaining traction as a method of optimizing programs. In this feasibility study we hope to find which techniques are extensible enough to bring out of the academic community to be using in a commercial setting. This means that the superoptimizer needs to be able to handle all the corner cases that may arise in production code. If it turns out that superoptimization is not currently feasibly, by the end of the project we will have a research roadmap describing what will need to be done next. Superoptimization was first conceived by Henry Massalin[1], where a brute-force search was conducted through the arithmetic instructions of the Motorola 68000. The example given in the original paper is given below. The function on the left is compiled, giving the piece of code on the right, with multiple branches and comparisons. This could perhaps be reduced by an expert assembly programmer, but was generally accepted as close to optimal. When the superoptimizer was Freely available under a 1 Creative Commons Attribution-ShareAlike 4.0 International License run on this program, a much shorter version was found - 4 instructions long and with no branching. On the left is the instruction sequence found by the superoptimizer (Motorola 68000), with the definitions of the instructions on the far right. In the middle are three cases, computed step by step to illustrate how the code sequence works. How to instructions compute their result is not obvious exploiting the carry flag ('x') in combination with arithmetic instructions. Freely available under a 2 Creative Commons Attribution-ShareAlike 4.0 International License Analysis 2 Types of superoptimizer The most obvious choice for a superoptimizer is a brute force enumeration, simply searching all possible sequences in cost order until a correct sequence is found. However, there are other ways of performing the search, and these are also described below. 2.1 Brute force The most common type of superoptimizer is a brute force optimizer, which enumerates each instruction sequence in turn, testing each until a correct sequence is found. The order of this search is important, since to guarantee optimality the sequences must be searched in cost order. The search order is simple for some metrics, such as code size, where in many cases this corresponds to number of instructions in the sequences. A superoptimizer would enumerate sequences with one instruction, followed by two instructions, etc. The disadvantage with this approach is the search space scales exponentially with the length of the instructions sequences, making long sequences intractable. The size of the instruction set also greatly affects the superoptimization potential. This is discussed below in 3.1 Potential efficiency: Instruction set choice. The exponential growth can be mitigated by several techniques to remove redundant instruction sequences and incorrect sequences. 2.2 Machine learning Recently machine learning has been applied to superoptimization to traverse the search space efficiently, allowing the search to quickly identify efficient instruction sequences [2]. With this approach it is difficult to ensure optimality of the generated sequences, although the sequences are typically very close to optimal and the method gives good results in a shorter time frame than any other method. 2.3 Constraint solving The constraint solving approach to superoptimization uses a solver (typically a SMT solver) with a set of equations to find a efficient solution. The instructions are encoded into constraints which the solver can interpret, and then the target function is represented using this form of the instructions. The SMT solver typically converts these constraints into boolean equations (a SAT problem), which is given to a SAT solver. A SAT solver then tries to work out if a solution can be found for the equations. If a solution is found then this describes the instruction sequence performs the target operation. An off-the-shelf solver can then produce a solution [3]. SMT solvers are typically very efficient, containing many heuristics to speed up and guide the search. This method allows the superoptimizer to increase in performance as the corresponding solver is developed. However, SMT solvers typically solve a decision problem, rather than an optimization problem, meaning that the problem domain does not simply translate into SMT constraints – an iterative approach must be taken. This involves asking the solver “Prove there is no instruction sequence of size N to solve the problem”. The SMT solver will either return a counter example (the instruction sequence) or inform that the problem cannot be solved given the constraints [4]. The parameter N can then be increased until the shortest sequences is found. Freely available under a 3 Creative Commons Attribution-ShareAlike 4.0 International License 3 Potential efficiency Currently superoptimization is limited to short sequences of mostly ALU based code, with some techniques being able to handle memory operations and a few with branches. No superoptimizer has attempted to create arbitrary control flow. Superoptimization often finds optimizations where multiple different representations of a number can be exploited. For example, if the number can be operated on as both bits and as an arithmetic number, interesting optimizations can often be found. This is also possible with floating point numbers, although no superoptimizer has attempted this yet. One thing a superoptimizer can successfully find is branch-free replacements for short sequences of code, allowing (usually) costly branches to be removed [5]. This is focused on by many of the brute force superoptimizers. However, it is still difficult for superoptimizers to produce code with branches in them, and even more difficult if loops are considered. 3.1 Instruction set choice The instruction set chosen as the target for superoptimization has a large effect on the efficacy of the search – the length of the resulting solution and the time taken to find it. There are a number of trade-offs to consider when choosing the instruction set, or subset of the instruction set to superoptimize. The number of instructions in the instruction set is the predominant factor in whether the superoptimizer will find a solution or not, since longer sequences can typically solve more complex problems. However, if each instruction only performs a 'small' amount of computation, then the required result will be need to be longer. This means that the subset of instructions chosen should be as small as possible, and consist of 'complex' operations. The notion of operation 'complexity' is necessarily vague, since there is not a good metric to describe the amount of computation an operation requires. The following graphs shows the superoptimization results for several architectures using the GNU superoptimizer. This includes the inbuilt superoptimization target functions (134 functions), as well as the superoptimization benchmarks given in the following section (25 functions). Each stacked bar shows the number of target functions that were able to be superoptimized, with the resulting number of instructions shown in colour. The black line indicates the average length of the found solution for that architecture. Freely available under a 4 Creative Commons Attribution-ShareAlike 4.0 International License Overall this graph gives an idea of the complexity of the instruction set (or the subset implemented in GSO) for each architecture. AVR has longer instruction sequences on average, due to it being the only 8-bit architecture and by far the simplest. An important trend to notice in the graph is that architectures on the left (lower average sequence length) typically have 3 operand instructions, whereas many on the right are 2 operand (implied destination). By not modifying one of the source registers in some operations, the superoptimizer can make reuse of this source again if necessary. If the instruction set only supports 2 operand addressing then the superoptimizer must find another way of preserving the source register, typically by inserting an additional move instruction, thus increasing the sequence size. The average sequence length is not a measure of how good the instruction set – it is resultant from design choices made when creating the instruction set. The choice of instructions in the AVR instruction set means that the processor have a smaller silicon area than the other processors. The importance of moving register contents around results in instruction sets with a conditional move performing well. The alpha architecture, and PA-RISC (hppa) both have some support for either conditional instructions or conditional moves. The following graphs shows a breakdown of the superoptimizer benchmarks for each architecture. Only the 25 superoptimizer benchmarks are displayed, since it is expected these are a better indicator for the overall performance than the full set (GSO's initial functions are heavily biased towards if-conversion type optimizations). This highlights the length of the individual result for each superoptimizer trial. Freely available under a 5 Creative Commons Attribution-ShareAlike 4.0 International License As with the previous graph, there is a general increase in the number of instructions required to solve the benchmark, from left to right. From these graphs it would suggest the subset of AVR instructions implemented is the least complex, whereas the instructions from the Alpha instruction set are the most complex. 4 Difficulties Some constructs and instructions are challenging for a superoptimizer to deal with. These are listed below. Memory operations Memory operations can be challenging, since it represents a large additional state that must be kept track of. This is handled in [6] by constraining the amount of memory that can be accessed to 256 bytes, by only using the last byte of the memory address. This quick test has aliasing problems in certain cases, which must be checked later with more extensive verification. Memory is also problematic in the constraint solving approach, due to Freely available under a 6 Creative Commons Attribution-ShareAlike 4.0 International License the large number of constraints it produces – a look up table must be created with enough entries to store each unique accesses. Each entry is then compared to the lookup table when a load is performed. Volatility must also be considered when superoptimizing memory accesses. Control flow. Branches greatly widen the scope of what the superoptimizer can Branches generate. It is relatively simple to verify whether a section of loop free code is equivalent to another section of code with a different branching structure (i.e. allowing branching input to the superoptimizer). Generating branching output is more challenging due to the large number of possibilities. Control flow. Loops are extremely difficult to superoptimize, particularly due to the Loops problem of verifying loop equivalence. Superoptimizing across loop boundaries was shown to be possible in [7], however this kept the same loop structure. Large numbers of Many of the registers combinations are redundant, if some of the register combinations registers are orthogonal. For example: mov r0, r1 mov r3, r2 add r1, r2, r2 add r2, r1, r1 These two sequences are functionally equivalent, apart from the registers the results start and end in. Floating point Floating point is challenging to superoptimize because it is particularly difficult to verify if the output sequence is correct. The typical method of using an SMT solver to prove an instruction sequence's equivalence does not work for floating point (no available floating point theory). Precoloured registers Instructions which imply a certain source or destination registers can be problematic in certain circumstances, particularly for brute-force superoptimizers. The canonical form solution for a large number of register combinations (given above) has problems coping with this scenario, since the reduction cannot be applied to these registers. This is less a problem for converting existing sequences to canonical form but the difficulties arise when trying to iteratively generate all sequences of registers in canonical form. More research is needed to efficiently generate precoloured registers. Freely available under a 7 Creative Commons Attribution-ShareAlike 4.0 International License Choice of code to superoptimize The choice of code to superoptimize has a large impact on the effectiveness of the solution, and the time taken to find it. A simple way to choose which areas of code should be superoptimized is to profile, and find which regions of code are hot. However, some of these will not be suitable for superoptimization – this section discusses some of these cases. 5 Memory accesses If the code frequently accesses memory then it is likely going to be difficult for a superoptimizer to utilize. Global memory accesses are often required, because it is the data the code is operating on. Local accesses frequently cannot be removed either, since they have dependencies across loop edges, or the computation of the value is a significant distance from the target region. The following dataflow DAG shows a basic block that may not be effective to superoptimize. The green arrows indicate the sequence of memory operations. In this analysis the arrows are marked conservatively – the volatility is unknown so the order should be preserved. If an arithmetic intensive portifc13208on of the basic block can be extracted from the rest of the basic block, then just this section could be considered for superoptimization. This has the advantage of cutting out the computation that has few possible gains, and focuses on what superoptimizers are typically most performant at. The basic block below has a highlight region which could be fed through a superoptimizer, avoiding the chain of memory accesses on the right. Freely available under a 8 Creative Commons Attribution-ShareAlike 4.0 International License 6 Target Codes Superoptimization is not an appropriate technique for optimizing everything – it generally takes too much time to be used as a generic optimization. However for certain types of code it can be very beneficial. These are typically libraries which get used in many places, or applications which will run for a very long time and are heavily dependent on some forms of compute. In particular for embedded systems, reducing the size of a library is very important since space is heavily constrained. libgcc This is the runtime library that provides various support functions to applications compiled by GCC. On embedded processors which do not support floating point, this library provides the emulation, as well as other support, such as division for processors without a divide instruction. As such this library is heavily used in some embedded applications, and superoptimization could greatly benefit by reducing the overhead of this library. compiler-rt This library is similar to GCC, but is used with LLVM instead. softfloat This library emulates floating point instructions using integer and bitwise operations, for processors which do not have a hardware FPU. These functions could be ideal targets for superoptimization, since they are mostly computation bound, and not reliant on memory accesses. Freely available under a 9 Creative Commons Attribution-ShareAlike 4.0 International License Example code Henry Massalin first introduced superoptimization with the following example [1]: int signum(int x) { if(x < 0) return -1; else if(x > 0) return 1; else return 0; } This function was roughly 8 instructions in length when naively compiled, but could be reduced to 6 instructions by an expert programmer. Massalin's superoptimizer managed to find the result in 4 instructions: add.l d0, d0 subx.l d1, d1 negx.l d0 addx.l d1, d1 7 Superoptimization benchmarks Several small functions were identified in [3] as being good targets for superoptimization. These were short functions easy to specify but had challenging implementations, many originating in Hacker's Delight [8]. The implementations of these programs often rely on interpreting the values in different representations and operating on them as such, meaning that it can be difficult for traditional compilers to solve. However, these difficulties make them tractable with a superoptimizer. The benchmarks selected by Gulwani et al. are: Benchmark Parameters Description P1 x Clear the least significant bit. P2 x n Check whether x is of the form 2 −1 . P3 x Isolate the right most 1-bit. P4 x Form a mask of set bits from the right most set bit to the least significant bit P5 x Set all from the least significant bit up to the least significant set bit. P6 x Set the least significant 0-bit. P7 x Isolate the right most 0-bit. P8 x Form a mast of set bits from the right most cleared bit to the least significant bit. P9 x abs(x). P10 x, y Test if the number of leading zeroes is the same for x and y. P11 x, y Test if x has fewer leading zeroes than y. P12 x, y Test if x has fewer or equal leading zeroes than y. P13 x The signum function [1]. Freely available under a 10 Creative Commons Attribution-ShareAlike 4.0 International License P14 x, y Floor of x and y. P15 x, y Ceil of x and y. P16 x, y Max of x and y. P17 x Turn of the right most string of 1-bits. P18 x Check whether x is a power of two. P19 x, m, k Exchange two bitfields in x, where m as a mask marking the right most field, and k is the number of bits from the start of the first bitfield to the start of the second. E.g. x = 0x00A00B00, m = 0x00000F00, k = 12 x' = 0x00B00A00 P20 x The next higher unsigned number with the same number of set bits as x. P21 x, a, b, c Cycle through a, b and c, returning the next number where x = a, b or c. E.g. x = a, x' = b. E.g. x = c, x' = a. P22 x Compute the parity of x. P23 x Count the number of bits set in x. P24 x Round x up to the next power of 2. P25 x, y Compute the high part of the product of x and y. 8 Other superoptimized code In addition to the benchmarks in the previous section, other codes have been superoptimized. Integer multiplication by a constant factor has been attempted. Granlund et al. [5] used a superoptimizer to remove small branches and common if-then- else conditions. These were later integrated into GCC peephole pass. Freely available under a 11 Creative Commons Attribution-ShareAlike 4.0 International License Superoptimizer designs This section discusses the design of several superoptimizers, as well as presenting a new design. These all will require varying amounts of additional research to realise (with brute- force requiring the least, and constructive requiring the most). 9 Brute force The GNU Superoptimizer (GSO) uses a dynamic programming approach to achieve efficient superoptimization. For a sequence of length N, it will enumerate the first instructions, computing the effect on a test vector. It will then recursively enumerate sequential instructions, passing the partial computed results to each subsequent instruction. The prevents redundant calculation of the results, and quickly prunes sequences which cannot be correct. This method of dynamic programming results in generated sequences which are in canonical form – again reducing the search space. However, the same drawbacks occur and in this framework it is difficult to implement instruction sets with implicit sources or destinations. Constraints on registers are also challenging to work with, while also ensuring that all possible sequences can be explored. 9.1 Design A similar design to GSO could be taken, while changing the architecture slightly to enable a wider range of instructions to be easily implemented. This includes modifications to the method of selecting registers, to allow instructions with implicit destinations to be implemented, and modifications to any flag variable (currently GSO only supports the carry flag). The superoptimizer keeps a set of current registers and flags, which are used as a quick test to exclude incorrect sequences. Overall the superoptimizer follows a recursive design: function superopt(current_len, max_len, machine_state, sequence): if(current_len > max_len) if(machine_state matches target function) verify the instruction sequence return for insn in instructions for registers in registers_lists execute insn with registers on machine_state append insn to sequence superopt(current_len+1, max_len, machine_state) for i = 1 to ... superopt(1, i, blank_machine_state, empty_instruction_sequence) This generates and tests instruction sequences in a iterative deepening depth-first search, and reuses the values computed by the instructions. This search choice is a trade-off between depth-first search and breadth-first search: depth-first is not appropriate since it does not traverse the sequences in approximate cost order. A naïve breadth-first search would use a very large amount of memory as the sequence length increased. 9.1.1 Registers The above algorithm needs a list of all of the register variants which can go with that instruction sequence. This can be optimized by iterating through the registers in canonical form. The following algorithm will iteratively update a list of registers so that they are always Freely available under a 12 Creative Commons Attribution-ShareAlike 4.0 International License in canonical form. The algorithm is initially given a list of 0's. The length of this list is equal to the number of register slots in the instruction sequence. function iterate_register_list(reg_list, num_registers) n = length(reg_list) finished = 1 for i = 0 to n – 1 if reg_list[i] < max(reg_list[i+1 .. n]) + 1 and reg_list[i] < num_registers reg_list[i]++ finished = 0 break else reg_list[i] = 0 return reg_list, finished If the algorithm returns with finished set to one, then the algorithm has wrapped around an the next instruction should be selected. The set of registers should not include slots for instructions which have an implicit register destination. For example, the AVR mul instruction has two explicit register slots, and two implicit, fixed destinations. In this case only the explicit register slots will be included in the list. If the instruction has constraints on the set of registers that can be used in that slot, the slot should not be included in the iterative canonicalisation, and should be looped over in brute-force. This is due to the above algorithm not generating the correct results in this case (see research questions). For example, the AVR ldi instruction cannot use registers r0-r15. 9.1.2 Improvements There are many ways to reduce the search space, however, the reduction in search space is a trade off with their implementation efficiency. Commutativity Some instructions have commutative operations. These can be exploited to reduce the search space by only considering one form of the instruction. Commutativity also interacts with canonical form, meaning that the ordering of the register slots in can be important. Redundant outputs This technique involves analyzing the generated instruction sequence for dependencies that would have been eliminated in fully optimized code. The number of registers in use, and used by the input sequence is known, therefore if the sequence reads from a register that has not been assigned to yet, it can be excluded. Writes to registers which are never read from can be removed. These can be incorporated into the canonical register generator, by giving it information Machine state lookup This exploits the fact that the majority of instruction sequences are incorrect. The machine state is stored in a hash table, along with the amount of instructions (or cost) required to compute it. Each time the superoptimizer is about to recurse, it first checks whether the current machine state is in the table. If it is, and the cost to reach that state is higher than that stored in the table, then the resulting sequence cannot possibly be optimal (since we've previously seen that machine state, and not found a result). Freely available under a 13 Creative Commons Attribution-ShareAlike 4.0 International License This leads to 3-10x speed up in certain circumstances. This method needs tuning, since the hashtable lookup can easily slow the search down. In particular, it is not beneficial to store or lookup the machine state if there is only 1 instruction left in the current search. This method essentially removes parts of the search space which are known to be bad, typically occurring when some instruction reordering does not affect the output of the sequence. This also has the effect of quickly pruning sets of sequences which have dead code in them. Sometimes trivial sequences of instructions produce the same output (e.g. add r0,-1 and sub r0,1) and if these are known to be incorrect then are excluded by this technique. Heuristics Some heuristics can be used to change the order of the iterated instructions. This does not reduce the search space, but may allow the superoptimizer to find any solution quicker. Illustration 2: Savings from implementing Illustration 1: Search space savings from canonical form, and commutativity not generating code with redundant read and writes 9.1.3 Speed or Optimality There is a certain amount of cross over with the machine learning approach if heuristics are used to guide the search. In particular a brute force superoptimize has difficulty with the possible range of constants – there are frequently too many possible constants to use them all. GSO only uses a small range of constants, removing the guarantee of optimality but speeding up the search. The set of constants is chosen carefully, based on constants which appear frequently in code. The graph below shows the frequency of certain constants for a range of benchmarks. Freely available under a 14 Creative Commons Attribution-ShareAlike 4.0 International License The graph shows spikes around the powers-of-two, and powers-of-two minus 1. The superoptimizer could bias the search by preferring power-of-two constants, and then trying other constants after. 9.2 Advantages • Simple to implement and a well known technique. • Guarantee of optimal solution, provided the exploration order is sound. • Can be parallelized easily. 9.3 Disadvantages • Restricted to small sequences of code. • Choosing the exact subset of the instruction set to implement in the superoptimizer, or the subset for superoptimization is hard. 9.4 Research Questions • How can sequences be enumerated in cost order when the machine model is complex? This could require an amount of “fuzziness”, where when a solution is found, the superoptimizer continues for a certain amount of extra cost. • Can we build a fast way of recognizing areas of the search space which are never optimal, and quickly exclude them? • Can the canonical form of registers be extended to cope with instructions that have constraints on the register slot? • The machine state lookup improvement is heavily dependent on the initial values chosen for the registers. Is there a way of choosing a good set of values? • What other techniques are there of improving a brute-force search? 10 Learning Optimizer STOKE [9] is an optimizer based on machine learning techniques to guide the search for an instruction sequence. This type of superoptimizer is able to find much longer code sequences in the same amount of time (up to 16 instructions). Freely available under a 15 Creative Commons Attribution-ShareAlike 4.0 International License A similar design could be embedded into a compiler, since the search can utilize an existing sequence as the starting point for the superoptimization. Then the sequence can be permuted until a better sequence is found. 10.1 Design A compiler pass, placed near the end of the compilation process that will use machine learning techniques to attempt to improve the code. Initially the pass would work on individual basic block, selected by a heuristic based on the length and type of instructions in the basic block. The chosen machine learning method is then applied to the instruction sequence (see the next section), and this sequence is tested (requiring the compiler pass to be able to simulate instructions for the target machine). If the sequence matches then it is verified if it is correct, and a cost model applied. If the sequence is correct, and performs better then either the search can stop there, integrating that sequence into the code, or the search can continue and attempt to find better sequences. Otherwise the search continues until the allotted time has expired. 10.2 Machine Learning The most challenging part of this approach to superoptimization is choosing a fitness function that guides the search towards a correct sequence. STOKE used a function which measured the number of correct bits in the output registers, combined with the performance model to guide it towards efficient solutions. This is necessary for creating solutions from scratch, and if the search accepts an invalid sequence. STOKE used a Monte Carlo Markov chain method (similar to simulated annealing) to choose which sequence to use next. This approach mutated the instruction sequence, then accepted it if the fitness of the solution had gone up. The solution was also accepted if it was less fit, with a certain probability dependent on the value of the fitness. Other learning schemes could be used, such as genetic programming or genetic algorithms. The performance model can easily be swapped out for other metrics, such as code size and energy consumption. 10.3 Advantages • Able to explore longer instruction sequences. • Reduced time to find a solution compared to other methods. • Giving the compiler a longer time to perform the optimizations will generally result in better solutions. • Some machine learning techniques can be parallelized easily, allowing faster exploration. 10.4 Disadvantages • No guarantee of optimality. • There can be many parameters to tune, which greatly affect the time and solution found by the optimizer. 10.5 Research Questions • If the code was modified since the last compilation, can the previous results of the search be used as an initial seed, and guided towards a correct sequence. • Which method of machine learning works well on this kind of problem? Freely available under a 16 Creative Commons Attribution-ShareAlike 4.0 International License • How close to optimality does this approach come? How long does the superoptimizer have to be run to get within X% of optimal? 11 Constructive This section presents the experimental design for a constructive superoptimizer. The premise of the superoptimizer is to split the synthesis task into two components – creating an optimal dataflow DAG, then optimally lowering this DAG to sequential instructions. This approach allows unsupported instruction sequences to be superoptimized, by superoptimizing 'around' them. 11.1 DAG superoptimizer This portion of the superoptimizer takes a sequence of code as input, and produces an optimal DAG as output. 1. Compile the code at the desired optimization level. 2. Construct a control flow graph from the code, and perform liveness analysis on the code. 3. For the targeted sections of code, construct a dataflow DAG for each basic block. 4. Simplify the DAG, so that register assignments are removed, and implied by the DAG. 5. Compose the multiple basic blocks into a large DAG. 6. Transform the DAG into SMT constraints – this is used for the verify step. 7. Construct the skeleton DAG. 8. Iteratively perform Counter Example Guided Inductive Synthesis to fill in the holes in the DAG. 11.1.1 The dataflow DAG The dataflow DAG is constructed per basic block, and describes the dataflow within that basic block. This is constructed by placing an edge between the instruction that generates a value, and an instruction that uses that value. If no instruction in that block creates the value that is needed by an instruction then the register holding that instruction must be live on entry to the block. The liveness analysis provides a set of registers which are live on exit from the block. If an instruction creates a value which is not used in the basic block, the registers which are live on exit from the basic block indicate whether the instructions output should be on an edge out of the block. Once the dataflow DAG has been constructed, the register names are removed. All of the memory operations are represented sequentially unless volatility can be ignored. If volatility is to be ignored, alias analysis can be performed to break the dependencies. 11.1.2 DAG composition Multiple DAGs can be composed together, by linking the registers in to each DAG with the registers generated by the previous basic blocks. In the case where there are multiple previous basic blocks there would be multiple edges into a node for a single value. These should be intermediary multiplexor (phi) nodes inserted. These multiplexors choose the values to propagate, based on the selector's value. The selector's value is chosen based on conditions generated in previous basic blocks. 11.1.3 Skeleton DAG The skeleton dataflow DAG is a reduced form of the previously constructed DAG, with only the portions which are difficult to superoptimize remaining. For example, if memory Freely available under a 17 Creative Commons Attribution-ShareAlike 4.0 International License operations were not implemented in the superoptimizer, the memory operations and their connections would remain in this DAG. The following diagrams show examples of the skeleton DAG. The blue arrows indicate a register dependency, red indicates a dependency on a flag and green indicates a memory dependency. Dataflow DAG Skeleton DAG This dataflow DAG only operates on r24 (r0 is used by the multiply, r1 is cleared after the multiply). Therefore only r24 needs to be considered for input and output, along with the flag dependency, indicating to the branch. Freely available under a 18 Creative Commons Attribution-ShareAlike 4.0 International License Dataflow DAG Skeleton DAG In this example, the chain of memory accesses is kept identical, with the dependencies. The rest of the instruction can be filled in using the values provided by the existing instructions. This example indicates how multiple basic blocks can be combined into a single skeleton DAG. Two outputs for r24 must be produced, along with a condition that selects between them. This is selected by a multiplexor component in the DAG. The “br A” and “br B” indicate the next destination of the code is also dependent on the flags. This form of the DAG allows the potential for control flow to be encoded, then subsequently split into separate basic blocks or a conditional move performed (as decided by the lowering step). Freely available under a 19 Creative Commons Attribution-ShareAlike 4.0 International License 11.1.4 Filling in the skeleton DAG Both the dataflow DAG and the skeleton DAG are converted into SMT constraints. The corresponding points on the graph can then be asserted for equality – the inputs, the outputs, and the intermediary nodes in the skeleton DAG. Counter Example Guided Inductive Synthesis can then be performed, with the set of instructions in the instruction set as components. With an interactive increase in the number of components available, this will result in a optimal solution to make the skeleton DAG and dataflow DAG equivalent. 11.2 Lowering superoptimizer This takes in the optimal DAG as input, and outputs a sequence of assembly level instructions. This can be performed by using integer linear programming (ILP) to represent the ordering of the instructions and perform optimal register allocation. Additionally this can optimize the splitting of the computed DAG into multiple basic blocks. By using an ILP solver, the metric for superoptimization can be considered: • Code size: minimize the control flow such that the total code size is small. This will likely result in code which performs slowly, as if conversion may be done on the entire set of blocks. • Performance: minimize the average path length when the control flow is included. This could use profile data on the range of values the multiplexor expressions take, allowing control flow which minimizes the execution time to be inserted. More research is needed for the exact formulation of the ILP problem (it is covered in detail in [10]–[12]), however, the DAG itself can be lowered using traditional compiler techniques (register allocations, scheduling). 11.3 Advantages • Decomposes superoptimization into smaller more manageable problems. This allows the compiler to substitute in non-optimal techniques if a particular subproblem is taking too long. • Able to generate multiple basic blocks. • Possibly able to handle large sequences of code (20 instructions). 11.4 Disadvantages • Complex to implement. • Currently untested and may not provide the speed ups necessary. • Difficult to parallelism (few efficient parallel SAT solvers exist). • Decomposition of the problem may result in no guarantee of optimality. 11.5 Research Questions There are several additional items to be researched before this design can be realized. • Does a minimal DAG correspond to a minimal instruction sequence? Can the DAG synthesis be biased towards DAGs which make the ILP problem easier? • What is the best way of formulating the ILP problem? Freely available under a 20 Creative Commons Attribution-ShareAlike 4.0 International License Conclusion Today there are only a handful of areas where superoptimization can be applied effectively in a commercial context. In particular, there are significant limitations on producing code sequences which include memory and branching operations. This paper has shown how the commercial value of superoptimization can be increased in both long and short term. Three different designs for superoptimizers were analyzed: those based on brute-force, those applying machine learning; and those using constructive approaches. Overall, the brute- force superoptimizer has the lowest risk, being built on well known technologies. However, it has the most limitations, being restricted to a small number of instructions. On the other end of the scale, a superoptimizer taking a constructive approach with SMT solvers may be able to scale to much longer sequences, however the technology is in its infancy and needs extensive development. The machine learning based superoptimizer has the most potential for immediate benefits, with a smaller amount of additional research to develop, and potentially large gains. This superoptimizer may not be able to guarantee an optimal solution is found, but can quickly explore possible solutions and improve code size/speed/energy. The features of the instruction set greatly affect the efficacy of superoptimization and there are a number of instruction set features which cannot be superoptimized. Some of these are fundamentally difficult, such as a superoptimizer producing loops, while some are implementation difficulties, such as supporting memory accesses. As a result we can see three stages to the future deployment of superoptimization. 1. Immediate use of the GNU superoptimizer (a brute force tool) to help customers optimize key sections of code, with short-term improvements to make it easy to generate goal functions automatically from source assembler. This is quite a niche application, and its commercial value is largely in cementing Embecosm's technical credibility. 2. In the medium term, application of DAG analysis to increase the applicability of the brute force approach and development of machine learning technology to make the brute force approach more directed. This is still relatively high risk, and would be appropriate for feasibility funding by Innovate UK. 3. In the longer term, investment in constructive approaches, with academic study to determine how to handle loop generation, memory access and floating point. The starting point for this should research council funded academic projects, but in which Embecosm would provide some industrial support. However this study has identified some areas with scope to increase the commercial relevance with some relatively modest R&D investment, albeit at quite high risk. There is much greater scope longer term, but this will in the first instance require major academic investment to resolve issues with memory access, branching and floating point. This has been a very high risk study. Embecosm would like to express our thanks to Innovate UK for their support, without which we could not have carried out this work. Freely available under a 21 Creative Commons Attribution-ShareAlike 4.0 International License References [1] H. Massalin, “Superoptimizer - A Look at the Smallest Program,” ACM SIGARCH Comput. Archit. News, pp. 122–126, 1987. [2] E. Schkufza, R. Sharma, and A. Aiken, “Stochastic optimization of floating-point programs with tunable precision,” in Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI ’14, 2013, pp. 53–64. [3] S. Gulwani, S. Jha, A. Tiwari, and R. Venkatesan, “Synthesis of loop-free programs,” ACM SIGPLAN Not., vol. 47, no. 6, p. 62, Aug. 2012. [4] R. Joshi, G. Nelson, and K. Randall, “Denali : A Goal-directed Superoptimizer.” [5] T. Granlund and R. Kenner, “Eliminating branches using a superoptimizer and the GNU C compiler,” in Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation - PLDI ’92, 1992, pp. 341–352. [6] S. Bansal and A. Aiken, “Automatic generation of peephole superoptimizers,” ACM SIGOPS Oper. Syst. Rev., vol. 40, no. 5, p. 394, Oct. 2006. [7] R. Sharma, E. Schkufza, B. Churchill, and A. Aiken, “Data-driven equivalence checking,” in Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications - OOPSLA ’13, 2013, vol. 48, no. 10, pp. 391–406. [8] H. S. Warren, Hacker’s Delight. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 2002. [9] E. Schkufza, R. Sharma, and A. Aiken, “Stochastic superoptimization,” in Architectural Support for Programming Languages and Operating Systems, 2013, p. 305. [10] D. W. Goodwin and K. D. Wilken, “Optimal and Near-optimal Global Register Allocation Using 0- 1 Integer Programming,” vol. 26, no. January 1995, pp. 929–965, 1996. [11] T. C. Wilson, G. W. Grewal, and D. K. Banerji, “An ILP Solution for Simultaneous Scheduling , Allocation , and Binding in Multiple Block Synthesis,” 1994. [12] D. Kastner and M. Langenbach, “Code Optimization by Integer Linear Programming,” in Compiler Construction, 1999, pp. 122–137. Freely available under a 22 Creative Commons Attribution-ShareAlike 4.0 International License