Authors Jeremy Bennett
License CC-BY-2.0
Who Ate My Battery? Why software engineers are the key to low power system design Jeremy Bennett, Embecosm Abstract Despite a decade of innovative development, and despite improvements in battery technology, a modern smartphone needs recharging far more than its turn of the century predecessor. Yet the blame cannot be laid at the door of hardware engineers. Multiple clock domains, clock gating and dynamic voltage and frequency control have all served to make modern hardware highly power efficient. The problem lies in the software. In this paper we consider how the entire software design process needs reworking to bring the software engineering team into low power design from day one. Central to this is the availability of good software development and debug functionality. We look at how the development environment must work seamlessly from the first architectural model to delivery of finished silicon. The author is one of the main developers of the OpenRISC processor, which has been used to demonstrate some of these ideas. The paper concludes with a short overview of the OpenRISC project. Low power system design Has downloading and running the latest applications also drained your smartphone's battery? Consider how technology has advanced over the past decade. The Ericsson T95 was launched in 2001. It had a 720 mAh Li-Ion battery, a standby time of 300 hours, talk time of 11 hours and a simple indicator of how much standby and talk time remained. The Sony-Ericsson Xperia X10 mini was launched in 2010. It has a 910 mAh Li-polymer battery, a standby time of approximately 360 hours with 2G and 285 hours with 3G and a talk time of approximately 4 hours with 2G and 3.5 hours with 3G. The fault cannot be laid at the door of the hardware design team. In recent years, hardware designers have become very good at low power design. Multiple voltage domains, clock gating, dynamic frequency scaling, multiple modes of operation and a host of other techniques have helped reduce power consumption. It is a never ending battle as dimensions shrink to just 10s of atoms, and leakage becomes an ever more pressing problem. There are three main factors contributing to power loss: • static leakage—mitigated by reducing voltage; • dynamic leakage—mitigated by reducing frequency and switching; and • number of components—mitigated through smaller, simpler silicon and less memory. Note in particular that reducing voltage is a quadratic gain, and that reducing frequency is a double gain because it also allows voltage to be reduced. With chip voltages ranging from 0.6V to 1.5V, there is the potential of ten-fold gain to be had. This is why it is generally more power efficient to use a multi-core chip running at a lower frequency, rather than a single core. 1 For the hardware designer, the biggest savings by far are achieved at the architectural level. By the time we reach RTL synthesis, gates and layout, there is little scope for significant power saving. In 2010 LSI Logic and Mentor Graphics summarized this potential in Figure 1. Figure 1: Potential power saving in hardware design For a long time, energy efficiency has been seen as a hardware problem. Yet software can undo all the design efficiency at a stroke. Famously a Linux implementation wasted 70-90% of its power, simply because a blinking cursor woke up the entire system several times a second [1]. The author was involved in a commercial project, where the design team found they had to increase clock frequency (and hence power consumption) three fold because a standard audio codec caused excessive processor stalls through cache conflicts. That project was canceled shortly afterwards. Why focus on the system and software in particular? Traditionally, researchers and engineers work within one or perhaps two layers of the system stack with very limited overlap, for example software engineers, computer architects or hardware designers. However, energy-aware computing is a challenge that requires investigating the entire system stack from application software and algorithms, via programming languages, compilers, instruction sets and micro architectures, through to the design and manufacture of the hardware. Ultimately software controls the hardware. Choice of algorithms and data structures will have a huge impact on power consumption. The traditional compiler focus on speed at the expense of all other considerations is very bad news for power consumption. Few software engineers appreciate this. Power usage is invariably a secondary requirement, if it is a software requirement at all. Yet the biggest savings are to be had at the top of the architectural stack. With Kerstin Eder, the author recently suggested the LSI Logic/Mentor Graphics chart could be extended as shown in Figure 2 [6]. 2 Figure 2: Potential power saving in system design We do not (yet) have quantitative data to substantiate this chart. Its shape is derived from anecdotal evidence from a number of system designers. How to tackle energy efficiency at a system level has been known for well over a decade. In their 1997 paper [2], Roy and Johnson summed up how to align software design decisions with energy efficiency as a design goal. Their key steps are (in the given order): • choose the best algorithm to fit the hardware; • manage memory size and memory access through algorithm tuning; • optimize for performance, making best use of parallelism; • use hardware support for power management; and • generate code that minimizes switching in the CPU and data path. One of the reasons for slow progress in this area is the lack of suitable tool flows. Eder [ 7] has explained exactly what is required. We already know how to do hardware power analysis, as illustrated in Figure 3. This approach is accurate, but computationally immensely demanding, so the analysis is slow. 3 Figure 3: Hardware power analysis flow (from Eder 2011[7]) We can naturally extend this to a system level analysis as shown in Figure 4. However if power analysis of gate level simulation was computationally hard, this approach to power analysis of a complete system including software and hardware is completely intractable. Figure 4: System power analysis flow (from Eder 2011 [7]) 4 What is needed is power analysis appropriate to the needs of system and software development. In other words a flow like that in Figure 5. Figure 5: Software power analysis flow (from Eder 2011 [7]) The problem is in modeling power consumption, even when we are prepared to tolerate a degree of inaccuracy. This is an area where progress has been slow over the past 15 years. An early approach was to use a formulaic analysis based on the operations in executing code [8]. E P=∑ Bi× N i + ∑ O i , j ×N i , j + ∑ E k i i,j k The formula contains a term for the base power used by each instruction, a term for the power overhead in switching between each pair of instructions and a term for various other instruction effects such as pipeline stall. The formula is highly parameterized, and determining the values of those hundreds of parameters experimentally or from first principles is difficult. Yet without accurate parameters, the results cannot be accurate. Wattch is an architectural level power simulator, which instead estimates system power usage by combining common functional blocks, whose power usage is already determined [3]. This is a practical approach, which is reported to offer accuracy within 10% and a performance one thousand times greater than traditional gate level power estimation. Using these approaches we have learned some things about how to design low power systems. One study minimized the Hamming distance between pairs of instructions to reduce switching [9]. This reduced power consumption by 62% in opcode switching. The problem with this approach is that it yields an ISA which is very target application specific. Another study found that 25% of the registers in a register file accounted for 83% of the time spent accessing the register file. It thus makes sense to partition a register file into 5 "hot" and "cold" register blocks. This led to a 54% reduction in power consumption by the register file compared to an unpartitioned register file [10]. Other researchers have looked at higher levels of abstraction still. One approach is use of approximate calculation to reduce the computation required, where full accuracy is not required [4]. A related approach allowed the programmer to control the number of bits of accuracy used in floating point applications [5]. However both these studies must be regarded as "niche", and of little relevance to software engineering in general. Perhaps one of the best approaches is one of the simplest. Measure the power being consumed as code executes on a chip. This can be as simple as measuring the voltage drop across a resister in the power line [11,12]. The resulting data can then be reconciled with program execution to yield an instruction by instruction power profile. More research is needed in this field. The Energy Aware Computing (EACO) initiative began with 3 workshops during 2011. Sponsored by the Institute for Advanced Study at Bristol University, it aims to foster a European program of research in this general area. Both incremental improvements and radical new innovative approaches are sought. The conveners are Prof David May and Dr Kerstin Eder, both at Bristol University and the next workshop takes place on 18 April 2012 in Bristol. A system wide approach to debugging software As we have seen successful low power design relies on having tools that can take a system- wide view. One area where this can have most effect is in debugging software. The traditional approach to system development and debugging is shown in Figure . Figure 6: Traditional system development All too often the hardware and software teams do not communicate. They may be in a different building, different town, different state or even different country and time zone. The software engineers rely on their own ISS, often until after tape out. If only they could use the same models as the hardware engineers. Embedded software tools such as debuggers that take a system-centric view need two characteristics. 1. They need to be peripheral aware. When the program halts, the peripherals must also halt and the tool must have visibility into peripheral state. 2. They must work with hardware models as easily as with final silicon. That is models of the complete system, not just the CPU, whether high level or low level, software or FPGA emulation. 6 This is not a technical challenge for the tool developer. Most debuggers are easily extensible to access peripherals, IEEE 1159.1 JTAG (or its successors) provide a natural point of abstraction for the interface and the EDA world knows how to model complete systems. The GNU debugger (GDB) for the OpenRISC 1200 Reference Platform System-on-Chip (ORPSoC) already supports access to peripheral state. Memory mapped peripheral registers appear as special purpose registers (SPRs). GDB is easily extended to add a command to read SPRs, for example to read the programmable interrupt controller (PIC) match register. (gdb) info spr picmr PIC.PICMR = SPR9_0 = 0 (0x0) (gdb) Similarly GDB can write the value of peripheral registers. (gdb) set spr picmr 0x00000007 PIC.PICMR (SPR9_0) set to 7 (0x7), was: 0 (0x0) (gdb) All that is required is a command to provide control based on peripheral registers. (gdb) pwatch picsr Peripheral watchpoint 2: PIC.PICSR (SPR9_2) (gdb) This is yet to be implemented, since it requires extension to the underlying hardware. In an embedded environment the debugger client must communicate with the target, and these additional peripheral commands must be transferred to the target. Each debugger has its own communication protocol, which must be utilized for this purpose, and typically offers appropriate extension facilities. In the case of GDB, communication is through the GDB remote serial protocol (RSP). This simple packet protocol provides one packet, qCmd, for the express purpose of passing arbitrary commands to the target. In this case we add readspr and writespr commands and extend the RSP server on the target to read and write peripheral state according to these commands. By working through this standard interface, these extensions are robust to changes in future GDB upgrades. There are many types of target model, all with their own interfaces. A SystemC transaction level model (TLM) may define a class with a transactional port as interface. class SocTlmModel : public sc_core::sc_module { … tlm:tlm_transport_dbg_if<JtagPayload> jtagPort; A SystemC cycle accurate model may define a class with sc_in and sc_out wires as interface. class SocCycleModel : public sc_core::sc_module { … sc_in<bool> jtagTck; sc_in<bool> jtagTms; An FPGA model may interface through library calls to control a JTAG driver chip: static void jp1_ll_reset_jp1() { … write (lp, &data, sizeof (data)); 7 JP1_WAIT (); We need a mechanism where the debugger interface does not need to be rewritten for each variant. As shown in Figure 7, the solution is to use a transaction level abstraction of JTAG as the interface between debugger and target [13]. Figure 7: Unified debug using JTAG as interface abstraction At its simplest, JTAG is just a mechanism to write and read serial registers of arbitrary size, so naturally fits a transactional model of abstraction. The highest level of class abstraction is of the RSP server class communicating with a JTAG interface class mediated via a JTAG register class as shown in Figure 8. Figure 8: High level class abstraction for unified debug interface All three classes are abstract. Of each interface we must provide a concrete specialization corresponding to the particular interface as shown in Figure 9. 8 Figure 9: JTAG interface specialization for unified debug interface Note that these specializations are independent of any particular architecture being debugged. For any particular architecture we must then provide specialization of the RSP server class and JTAG class. These will provide handling of architecture specific debug packets. For example translating the reading and writing of memory into the correct sequence of JTAG packet transfers. We must also provide the specific JTAG registers used by the processor's particular debug unit. This gives us the class specialization shown in Figure 10. Figure 10: RSP interface specialization for a unified debug interface Note how this approach allows efficient reuse of the interface. For any new architecture, providing just the architecture specialization allows the debugger to talk to all types of models and hardware for which a JTAG specialization exists. Similarly for a new type of debug interface (for example to a different JTAG chip), it is only necessary to provide a new JTAG class specialization and that interface is available to any architecture. The SystemC transaction level modeling interface cannot be used in its vanilla form, since it does presume a byte addressed bus interface. It also assumes read and write are separate operations and has no concept of the simultaneous write and read of a shift register. However the TLM extension mechanism allows this functionality to be provided in a 9 standard way, ensuring models will work in any TLM environment. The extension class diagram is shown in Figure 11. Figure 11: SystemC TLM extension classes for JTAG The specialization of this abstraction (which has no time component) to lower level interface will require modeling of the JTAG test access port (TAP) to generate the correct sequence of clock signals on the JTAG pins. Figure 12 shows how this works as a sequence diagram. Figure 12: JTAG TLM to cycle accurate specialization OpenCores and OpenRISC Much of the work described in the previous section has been demonstrated on the OpenRISC processor. We provide an introduction here to OpenRISC and the OpenCores website on which it is hosted. OpenCores was conceived as a website to host open source silicon IP in 1999 by Damjan Lampret, then at Flextronics. While the intention was always to host a wide range of IP, from 10 the beginning its flagship project was the development of a fully open source RISC processor. For the first eight years, the development of opencores.org was supported by Flextronics, with most contributors being coming from their European development teams. The result included a reference ASIC implementation with approximately 150,000 gates and 17 memory blocks. Since 2007, opencores.org has been run by ORSoC AB, a Swedish hardware design house. Under their stewardship the website has grown to around 120,000 registered users (at the time or writing). Discussions are under way now to set up a completely independent foundation, with the community now mature enough to take control of its own destiny. The OpenRISC processor has been adopted in a number of commercial applications. Beyond Semiconductor is a design house, supplying commercially hardened derivatives of the OpenRISC processor. Jennic (now part of NXP) was an early adopter of the Beyond Semiconductor designs for their Zigbee chips. Samsung use OpenRISC in their DTV SoCs. Cadence use OpenRISC as a reference architecture to demonstrate their various EDA design flows. Most OpenCores IP designed are licensed using either BSD or Gnu LGPL licenses (OpenRISC uses the latter). Associated software tool chains and documentation are typically licensed using the Gnu GPL. LGPL does represent something of a problem for silicon IP, even IP that is destined to become a bitstream on an FPGA. What is the hardware equivalent of linking? Is an FPGA bitstream equivalent to a software object file? The OpenRISC 1000 architecture defines a family of 32 and 64-bit RISC processors with a Harvard or Stanford architecture [14]. The instruction set architecture (ISA) is similar to that of MIPS or DLX, offering 32 general purpose registers, register-to-register operations, two addressing modes, delayed branches and a fast context switch. The processor offers WishBone bus interfaces for instruction and data memory access with IEEE 1149.1 JTAG as a debugging interface. Memory management units (MMU) and caches may optionally be included. The core instruction set features the common arithmetic/logic and control flow instructions. Optional additional instructions allow for hardware multiply/divide, additional logical instructions, floating point and vector operations. The ALU is a 4/5 stage pipeline, similar to that in early MIPS designs. The design is completely open source, licensed under the GNU Lesser General Public License (LGPL), this means it can be included as an IP block in larger designs, without requiring that the rest of the design be open source. The OpenRISC 1200 (OR1200) was the first design to follow the OpenRISC 1000 architecture. It is a 32-bit implementation incorporating optional MMUs and caches, a tick timer, programmable interrupt controller (PIC) and power management. The implementation is approximately 32,000 lines of Verilog. The overall design of the OR1200 is shown in Figure 13. 11 Figure 13: OpenRISC 1200 architecture For development and testing purposes the OR1200 is incorporated into a System-on-Chip (SoC), the OpenRISC Reference Platform SoC (ORPSoC), which is shown in Figure 14. 12 Figure 14: OpenRISC Reference Platform SoC (ORPSoC) One objective of the OpenRISC project is to use open source tools as far as possible during hardware development. While back-end FPGA tools are free (as in beer), the are not open source. However there are now good open source options for front-end EDA tools. There are three simulation models used during OR1200 development • Or1ksim, the golden reference architectural model. This is an interpreting ISS running at 2-5 MIPS with a test-suite of approximately 2,500 tests. • A 2-state cycle-accurate model in C++/SystemC generated automatically from the Verilog RTL by Verilator, which runs at around 130kHz. • An event driven simulation model of the Verilog RTL using Icarus Verilog, which runs at round 1.4kHz. All three models support the GDB RSP, so can be used for software development and debug. This is key to a system-centric view of the development process. The original verification OR1200 used a Verilog test bench which ran a number of test programs in C and assembler, compiled using the OpenRISC 1000 GNU C compiler. The test bench comprises a total of 13 target programs, which were deemed to past if they printed out 0xdeadbeef on exit. The limitations of this approach are clear: • it is not exhaustive; 13 • there are no coverage metrics; and • the testing is not consistent with that of Or1ksim. A medium term objective of the OR1200 design team is to unify this test bench with that of Or1ksim. More recently an OVM testing regime for the OR1200 was implemented [15]. Using Or1ksim as golden reference he aimed to verify against 5 criteria, generating appropriate coverage metrics. 1. Does the PC update correctly? 2. Does the status register update correctly? 3. Do exceptions save context correctly? 4. Is data stored to the correct memory address? 5. Are results stored correctly in registers? Although Ahmed used a commercial simulator for his work, he was able to make use of the SystemC interface to Or1ksim to implement a DPI SystemVerilog wrapper. The resulting test bench allowed comparative testing of Or1ksim against the RTL as show in Figure 15. Constrained random test generation was used to create a set of tests to maximize coverage. Testing uncovered numerous errors where the RTL and Or1ksim disagreed, which fell into three categories. 1. Discrepancies due to ambiguities in the architectural definition. An example being the handling of unaligned addresses by l.jr and l.jalr. 2. Instructions incorrectly implemented or missing in the RTL. Examples being l.addic and l.lws. 3. Instructions incorrectly implemented or missing in Or1ksim. Examples being l.ror, l.rori and l.macrc, although these are all optional instructions. However they are implemented in the OR1200 RTL. In total 20 instructions had errors of some sort. This made for limitations in the coverage that could be achieved, since Ahmed did not have the opportunity to fix the RTL during the period of this project. However he was able to show that for many instruction set coverage criteria, he had achieved as full coverage as possible, while for others he had achieved significant coverage. There remain however a set of coverage criteria with 0% result, since all instructions in these cases had errors. As a result of this work, the architectural specification has been updated, Or1ksim has been fixed, and changes to the RTL are in progress. 14 Figure 15: Dual Or1ksim and RTL test bench under OVM The OpenRISC 1000 architecture is supported by a comprehensive tool chain. binutils 2.20.1, gcc 4.5.1 and gdb 7.2 are implemented supporting both C and C++. At present, in common with many embedded tool chains, only static libraries are supported. The or32-elf- tool chain for bare metal applications uses the newlib library, while the or32- linux- tool chain for Linux applications uses the uClibc library. Both tool chains are regression tested on Or1ksim, with the or32-elf- tool chain also tested against the Verilator model and the or32-linux- tool chain tested on physical hardware. A wide range of boards have board support packages, including Or1ksim, the Terasic DE0- nano and DE2 FPGA boards and the Xilinx ML510 FPGA board. A number of open source real-time operating systems (RTOS) are supported. FreeRTOS, RTEMS and eCos have all been ported to OpenRISC. 15 OpenRISC supports Linux, with the implementation being adopted in the Linux 3.1 mainline. There are currently some limitations to the implementation (kernel debug, ptrace). BusyBox is supported as a command-line environment. Debug is supported through JTAG, which is appropriate for bare-metal applications and the Linux kernel. For Linux applications, gdbserver has been implemented. The ability to seamlessly use the tool chains on both models and physical hardware leads to a new way to carry out system verification of the hardware [16]. The GNU tools all contain a very substantial regression test-suite (around 100,000 tests). Comparative runs against the reference architectural model, Or1ksim and the Verilog implementation (on physical hardware or as Verilator model) should give identical results. Any discrepancies can be down to one of two reasons: 1. tests timing out on one target, due to differential performance; or 2. bugs in the hardware implementation. As an example, early in the implementation of GCC 4.5.1, we were able to run the gcc regression tests on both targets. On Or1ksim, the results were: === gcc Summary === # of expected passes 52753 # of unexpected failures 152 # of expected failures 77 # of unresolved testcases 122 # of unsupported tests 716 With a Verilator model of the RTL, the results were: === gcc Summary === # of expected passes 52677 # of unexpected failures 228 # of expected failures 77 # of unresolved testcases 122 # of unsupported tests 716 The 76 tests which failed with the Verilator model were then examined to determine the cause. Thus we can identify that the test labeled: gcc.c-torture/execute/20011008-3.c execution, -O0 timed out. A manual rerun of the command line in the log shows that this test does complete if given enough time—it just requires 115 million cycles. Inspection of the code shows it contains large nested for loops, so this is not surprising. This is an example of the first class of failure which does not indicate any problem with the RTL. However the following test also times out. gcc.c-torture/execute/20020402-3.c execution, -Os Manual rerunning does not allow this test to complete, even after two hours. In any case inspection of its sibling tests with different optimizations show they only require a few hundred thousand cycles to complete. This is an example of the second type of failure. At this point we have a clear test case for an RTL failure. Typically we will run the test with tracing enabled using both RTL and Or1ksim versions, and determine where execution diverges. A VCD inspection usually quickly shows the cause of the failure. Finally the following test completed execution, but gave the wrong result. gcc.c-torture/execute/20090113-1.c execution, -O2 16 In this case the test terminated with a Bus Error exception. This is another example of the second type of failure. There is another type of failure possible, which is where a test passes in RTL, but fails with Or1ksim. No such failures have been found to date, but they could indicate Or1ksim incorrectly implementing the architectural specification. These bugs have now been corrected in the OpenRISC RTL. The compiler implementation has also been completed, and there are now no regression failures with either Or1ksim or the Verilog RTL implementation. This approach has been used commercially. For example Embecosm developed the GNU tool chain for the Adapteva Epiphany architecture prior to first silicon tape out. Adapteva reported that the discrepancies found enabled the elimination of 50-60 hardware design flaws, leading to a processor that worked correctly with first silicon. Summary In summary • Future low power products will require a systems approach. This will mean hardware and software engineers must work together and the approach applies throughout the life cycle. • The greatest opportunity for power saving is in the software. Techniques for tackling this are still in their infancy. We need breakthroughs in high level power modeling and simulation • We need a systems oriented tool chain geared to the needs of both software and hardware and usable throughout the product life cycle • Embecosm's unified debugging approach is an example, which allows software debugging throughout the life cycle • The benefits can be seen already in the OpenRISC project, with hardware bugs identified by the software engineers Acknowledgments Most of the work described in the first section of this paper is due to my colleague, Dr Kerstin Eder at the University of Bristol. It draws heavily on our joint paper for the NMI [6]. OpenRISC is a community project, to which I am just one of the contributors. It is the cumulative result of 12 years work by a very large number of people. References 1 E. Sperling and P. Chatterjee. 16 June, 2011. The Tao of Software. Chip Design Magazine Low Power Engineering online community blog post. chipdesignmag.com/lpd/blog/2011/06/16/the-tao-of-software. 2 K. Roy and M.C. Johnson. 1997. Software design for low power. In W. Nebel and J. Mermet (Eds.) Low power design in deep submicron electronics. Kluwer Nato Advanced Science Institutes Series, Vol. 337, Norwell, MA, USA, pp 433-460. 3 D. Brooks, V. Tiwari, M. Martonosi, Wattch: A framework for architecture-level power analysis and optimizations, Proc. 27th International Symposium on Computer Architecture (ISCA), pp. 83-94, 2000. 4 A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze and D. Grossman, EnerJ: Approximate Data Types for Safe and General Low-Power Computation. In Proc. of PLDI, June 2011. 17 5 J.Y.F. Tong, D. Nagle and R.A. Rutenbar, Reducing Power by Optimizing the Necessary Precision/Range of Floating-Point Arithmetic. IEEE Transactions on VLSI Systems, 8(3), pp 273-286, June 2000. 6 Jeremy Bennett and Kerstin Eder. The software drained my battery. NMI yearbook 2011-12, pp 39-24, November 2011. 7 Kerstin Eder. Energy aware system design. Low-Power Verification - bridging the gap between hardware and software. NMI meeting, Rutherford-Appleton Laboratory, 20 June 2011. 8 V. Tiwari, S. Malik and A. Wolfe, Instruction Level Power Analysis and Optimization of Software” , Journal of VLSI Signal Processing Systems, 13, pp 223-238, 1996. 9 S. Woo, J. Yoon and J. Kim, Low-Power Instruction Encoding Techniques, Proceedings of the SOC Design Conference, 2001. 10 X. Guan and Y. Fei, Registeer File Partitioning and Recompilation for Register File Power Reduction, ACM Transactions on Design Automation of Electronic Systems, 15(3)24, May 2010. 11 Steve Kerrison, Bristol University Design Automation and Verification, Microelectronics Group. Informal presentation during the 2 nd EACO Workshop, Energy-Aware Computing: Beyond the state of the art, 13-14 July 2011. 12 Barry Lock, Power by Software Function - some real life examples. Low-Power Verification - bridging the gap between hardware and software. NMI meeting, Rutherford-Appleton Laboratory, 20 June 2011. 13 Jeremy Bennett, Using JTAG with SystemC: Implementation of a Cycle Accurate Interface, Embecosm Application Note No. 5, www.embecosm.com, January 2009. 14 The OpenRISC project. http://opencores.org/or1k/Main_Page. 15 Waqas Ahmed. Implementation and Verification of a CPU Subsystem for Multimode RF Transceivers. MSc dissertation, Royal Institute of Technology (KTH). May 2010. 16 Jeremy Bennett, Processor Verification using Open Source Tools and the GCC Regression Test Suite: A Case Study, Design Verification Club meeting, 20 September 2010, Bristol UK, Cambridge UK and Eindhoven Netherlands (multicast conference). About the Author Dr Jeremy Bennett is Chief Executive of Embecosm (www.embecosm.com) which provides open source services, tools and models to facilitate embedded software development with complex systems-on-chip. Contact him at jeremy.bennett@embecosm.com. This paper was presented to the School of Electronic, Electrical and Systems Engineering on 14 December 2011. This work is licensed under the Creative Commons Attribution 2.0 UK: England & Wales License. To view a copy of this license, visit creativecommons.org/licenses/by/2.0/uk/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. This license means you are free: • to copy, distribute, display, and perform the work • to make derivative works under the following conditions: • Attribution. You must give the original author, Jeremy Bennett, credit; 18 • For any reuse or distribution, you must make clear to others the license terms of this work; • Any of these conditions can be waived if you get permission from the copyright holder, Embecosm; and • Nothing in this license impairs or restricts the author's moral rights. 19