Microkernels in a Bit More Depth COMP9242 2008/S2 Week 3 ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 1 Copyright Notice These slides are distributed under the Creative Commons Attribution 3.0 License You are free: • to share — to copy, distribute and transmit the work • to remix — to adapt the work Under the following conditions: • Attribution. You must attribute the work (but not in any way that suggests that the author endorses you or your use of the work) as follows: • “Courtesy of Gernot Heiser, [Institution]”, where [Institution] is one of • “UNSW”, “NICTA”, or “Open Kernel Labs” The complete license text can be found at http://creativecommons.org/licenses/by/3.0/legalcode ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 2 Motivation Early operating systems had very little structure A strictly layered approach was promoted by Dijkstra • THE Operating System [Dij68] Later OS (more or less) followed that approach (e.g., Unix). Such systems are known as monolithic kernels ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 3 Issues of Monolithic Kernels Advantages: Kernel has access to everything: • all optimisations possible • all techniques/mechanisms/concepts implementable Kernel can be extended by adding more code, e.g. for: • new services • support for new hardwdare Problems: • Widening range of services and applications • OS bigger, more complex, slower, more error prone. • Need to support same OS on different hardware. • Like to support various OS environments. • Distribution − Impossible to provide all services from same (local) kernel ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 4 Evolution of the Linux Kernel ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 5 Approaches to Tackling Complexity Classical software-engineering approach: modularity • (Relatively) small, mostly self-contained components • Well-defined interfaces between them • Enforcement of interfaces • Containment of faults to few modules Doesn’t work with monolithic kernels: • All kernel code executes in privileged mode • Faults aren't contained • Interfaces cannot be enforced • Performance takes priority over structure ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 6 Cross-Module Dependencies (“Spaghettiness”) ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 7 Evolution of the Linux Kernel — Part 2 Software-engineering study of Linux kernel [SJW+02]: Looked at size and interdependencies of kernel "modules“ • “common coupling": interdependency via global variables Analyzed development over time (linearised version number) Result 1: Module size grows lineary with version number Result 2: Interdependency grows exponentially with version! The present Linux model is doomed! There is no reason to believe that others are different • e.g. Windows, MacOS, ... Need better software engineering in operating systems! ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 8 Monolithic vs. Microkernel OS Structure Applications User-level Servers OS unprivileged privileged Microkernel Hardware Based on the ideas of Brinch Hansen's “Nucleus” [BH70] ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 9 Monolithic vs. Microkernel OS Structure Application Syscall User VFS Mode Unix File Server Device Server IPC, file system Application Driver Scheduler, virtual memory Kernel Mode Device drivers, dispatcher IPC, virtual memory IPC Hardware Hardware Monolithic OS Microkernel OS • lots of privileged code • little privileged code • vertical structure • horizontal structure • invoked by system call • invoked by IPC ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 10 Microkernel OS Kernel: • Contains code which must run in supervisor mode • Isolates hardware dependence from higher levels • Is small and fast extensible system • Provides mechanisms. User-level servers: • Are hardware independent/portable • Provide "OS environment"/"OS personality" (maybe several) • May be invoked: − From application (via message-passing IPC) − From kernel (upcalls) • Implement policies [BH70]. ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 11 Downcall vs. Upcall Applications downcall upcall (syscall) unprivileged privileged Kernel Downcall: Upcall: unprivileged code enters kernel mode privileged code enters user mode implemented via trap implemented via signal/IPC ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 12 Microkernel-Based Systems Classic Embed- OS ded Native app Security Java Highly-specialized mini-OS component OKL4 OKL4 OKL4 Hardware Hardware Hardware Classic + thin specialized ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 13 Microkernel-Based Systems Comp Comp Comp Comp App Real Comp App Time Comms Comp Object Loader App Library Mgr App File System TCP/IP User Interface Network Display Flash OK Linux OKL4 Hybrid system • Linux for legacy support or high-level API requirements • RTOS for legacy support for real-time apps • Highly componentised system for robustness Provides migration path from legacy to componentised ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 14 Early Example: Hydra Separation of mechanism from policy • e.g. protection vs. security No hierarchical layering of kernel Protection, even within OS • Uses (segregated) capabilities Objects, encapsulation, units of protection. Unique object name, no concept of object ownership. Object persistence based on reference counting [WCC+74] ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 15 Hydra... Can be considered the first object-oriented OS Has been called the first microkernel OS • by people who ignored Brinch Hansen Has had enormous influence on later OS research Was never widely used even at CMU because of • poor performance • lack of a complete environment ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 16 Popular Example: Mach Developed at CMU by Rashid and others [RTY+88] from 1984 Successor of Accent [FR86] and RIG [Ras88] Goals: Tailorability: support different OS interfaces Portability: almost all code H/W independent Real-time capability Multiprocessor and distribution support Security Coined term microkernel ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 17 Basic Features of Mach Kernel Task and thread management Interprocess communication • asynchronous message-passing Memory object management System call redirection • for virtualization (although they didn't call it that) Device support Multiprocessor support ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 18 Mach Tasks and Threads Thread • active entity (basic unit of CPU utilisation) • own stack, kernel scheduled • may run in parallel on multiprocessor Task • consists of one or more threads • provides address space and other environment • created from "blueprint" − Empty or inherited address space − Similar approach adopted by Linux clone • Activated by creating a thread in it “Privileged user-state program" may control scheduling ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 19 Mach IPC: Ports Addressing based on ports: • port is a mailbox, allocated/destroyed via a system call • has a fixed-size message queue associated with it • is protected by (segregated) capabilities • as exactly one receiver, but possibly many senders • can have "send-once" capability to a port − for RPC replies (server invocation) Can pass the receive capability for a port to another process • give up read access to the port Kernel detects (and cleans up) ports without senders or receiver Processes may have many ports (UNIX server has 2000!) • can be grouped into port sets • supports listening to many (similar to Unix select) Send blocks if queue is full • blocking limited by timeout Indirection via ports supports transparent distribution • Local proxy port forwards message to receiver on remote node ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 20 Mach IPC: Messages Segregated capabilities: • Threads refer to them via local indices • Kernel marshals capabilities in messages • Message format must identify caps Message contents • Send capability to destination port (mandatory) − Used by kernel to validate operation • Optional send capability to reply port − For use by receiver to send reply • Possibly other capabilities • “in-line” (by-value) data • “out-of-line” (by reference) data, using copy-on-write, − May contain whole address spaces ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 21 Mach IPC Message Port rights Out-of- In-line Header (Capabilities) line data data Virtual address space Virtual address space Task 1 Task 2 IPC Mapping Mapping before after IPC IPC Physical Memory ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 22 Mach Virtual Memory Management Address space constructed from memory regions Initially empty Populated by: • explicit allocation • explicitly mapping a memory object • inheriting from parent − by-region inheritance: none, copy, shared • allocated automatically by kernel during IPC − when passing by-reference parameters − kernel determines mapping location Leads to sparse virtual memory use (unlike UNIX) • uses complex address-map datastructure to limit impact Extensive use of copy-on-write for efficiency • imposes alignment restrictions • not necessarily a win for single pages ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 23 Mach Memory Objects Kernel doesn't support file system Memory objects are an abstraction of secondary storage: • can be mapped into virtual memory • are cached by the kernel in physical memory • pager invoked if unmapped page is touched (or R/O page written to) − invoke file system server to provide data Support data sharing • by mapping objects into several address spaces Mach views virtual memory only as a cache for memory objects ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 24 User-Level Page Fault Handlers All actual I/O performed by pager — can be • default pager (provided by kernel), or • external pager, running at user level Task External Pager Intrinsic page fault cost: 2 IPCs (1) Check protection & locate memory object Map IPC • uses address map Kernel (2) Check cache, invoke pager if cache miss Memory • uses a hashed page table cache object (3) Check copy-on-write • perform physical copy if write fault (4) Enter new mapping into H/W page tables ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 25 Mach Unix Virtualization Unix Application 5 Emulation 3 Server library 4 1 Syscall redirect 2 Mach Emulation library in user address space handles IPC Invoked by system call redirection (trampoline mechanism) • Supports binary compatibility • Example of what's now called para-virtualization ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 26 Mach = Microkernel? Most OS services implemented at user level • Using memory objects and external pagers • Provides mechanisms, not policies Mostly hardware independent Big! • 140 system calls (300 in later versions), >100 kLOC − Compare: Unix 6th edition had 48 syscalls (10 kLOC without drivers) • 200 KiB text size (350 KiB in later versions) Performance poor • Tendency to move features into kernel − OSF/1 − Darwin (base of MacOS X): complete BSD kernel inside Mach Further information on Mach: [YTR+87, CDK94, Sin97] ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 27 Other Client-Server Systems Lots! Most notable systems: Amoeba: FU Amsterdam, early 1980's [TM81, TM84, MT86] • followed by Minix ('87), Minix 3 ('05) Chorus: INRIA (France), early 1980's [DA92, RAA+90, RAA+92] • Commercialised by Chorus Systèmes in 1988 • Targeted embedded systems (esp. network infrastructure) • Bought by Sun in 1997, closed down in 2002 • Chorus team spun out to create Jaluna (renamed VirtualLogix in '06) • Now market embedded virtualization technology QNX: “first commercial microkernel” (early '80s) • highly successful in automotive and other transport systems Green Hills Integrity • '97 for military, commercial release '02 • market leader in aerospace, military Windows NT: Microsoft (early 1990's) [Cus93] • Early versions (NT 3) were microkernel-ish • Now run main servers and most drivers in kernel mode ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 28 Critique of Microkernel Architectures I'm not interested in making devices look like user-level. They aren't, they shouldn't, and microkernels are just stupid. Linus Torvalds Is Linus right? ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 29 Microkernel Performance First generation microkernel systems ('80s, early '90s) • Exhibited poor performance when − Compared to monolithic UNIX implementations • Particularly Mach, the best-known example − But others weren't better Typical result: re-kernelise systems • Move OS services back into the kernel for performance • Move complete OS personalities into kernel − Mach Unix “server” → Unix kernel co-located with Mach − Chorus Unix − Mac OS X − OSF/1.... Some spectacular failures • most notorious: IBM Workplace OS [Phelan et al. 93] • also the GNU Hurd • many others... ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 30 IBM Workplace OS (1991–96) Unify IBM's operating systems (and produce cost savings) • DOS, OS/2, Posix, AIX, OS/400, WIndows (binary compatible) • all on same underlying platform, available concurrently • apps can use services from multiple OSes • “Grand Unification Theory of Operating Systems” (GUTS) Scale across a wide range of environments • PDAs (ARM) • desktops (x86, PowerPC) Applications • massively-parallel machines (Power, ...) DOS OS/2 AIX Decided to base on Mach • “Workplace OS microkernel” derived from Mach 3.0 Personality Neutral Services • for providing concurrent OS personalities • share personality neutral services Workplace OS Microkernel (PNSs) ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 31 IBM Workplace OS Significant modifications to Mach to address its problems • synchronous IPC, single-copy message-passing • direct support for RPC − send+receive-reply without user-level capability manipulation • migrating threads model − thread moves with message during IPC • improvements in memory management − eg. use mappings for message transfers • security tokens that reduce number of rights checks • generally simplified and optimised code base • more than doubled overall code size • improved IPC performance 3 times (still 8 times slower than L4) Plagued by problems • Schedule overruns • Budget overruns • On-going technical problems ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 32 IBM Workplace OS History One of the biggest OS projects ever: US$2G • 400 microkernel, 1500 OS/2 programmers Jan '91: Project start Fall '92: Demoed OS/2, DOS and Unix on Mach Fall '93: Announced that Workplace would not replace AIX Jan '95: completely abandoned AIX personality Oct '95: GA release of microkernel for PowerPC Oct '95: Workplace project cancelled, Personal Power Div closed Early '96: shipped last version (2.0) for x86, PowerPC, ARM Considered a prime example of vapourware • much marketing before technology was created ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 33 IBM Workplace OS Lessons Analysis by Fleisch, Allan [1998] Difficulty to map personality services to shared PNSs • required extensive restructuring of existing code • difficult to get PNS APIs right Featurism Focussed on microkernel, too late on personalities Too much focus on portability of microkernel? Poor management of huge project • eg. wrt shared PSNs Don't mention microkernel performance as an issue ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 34 Microkernel Performance Performance problems of Mach became generally known 93 Reasons are investigated by [Chen & Bershad 93]: • Instrumented user and system code to collect execution traces • Run on DECstation 5000/200 (25MHz R3000) • Run under Ultrix and Mach with Unix server • Traces fed to memory system simulator • Analyse MCPI (memory cycles per instruction) − Baseline MCPI (i.e. excluding idle loops) ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 35 Ultix vs. Mach-Unix MCPI ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 36 Interpretation Observations: • Mach memory penalty higher − i.e. cache misses or write stalls • Mach VM system executes more instructions than Ultrix − But has more functionality Claim: • Degraded performance is (intrinsic?) result of OS structure • IPC cost is not a major factor [Ber92] − IPC cost known to be high in Mach ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 37 Assertions OS has less instruction & data locality than user code • System code has higher cache and TLB miss rates • Particularly bad for instructions System execution is more dependent on instruction cache behaviour than is user execution • MCPI’s dominated by system i-cache misses • Now: most benchmarks were small, i.e. user code fits in cache Competition between user & system code no problem • Few conflicts between user and system caching • TLB misses are not a relevant factor • Note: the hardware used has direct-mapped physical caches − Split system/user caches wouldn't help ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 38 Self-Interference Only examine system cache misses Shaded: System cache misses removed by associativity MCPI for system-only, using R3000 direct-mapped cache Reductions due to associativity were obtained by running system on a simulator and using a two-way associative cache of the same size ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 39 Assertions 4 Self-interference is a problem in system instruction reference streams. • High internal conflicts in system code • System would benefit from higher cache associativity 5 System block memory operations are responsible for a large percentage of memory system reference costs • Particularly true for I/O system calls 6 Write buffers are less effective for system references. • Write buffer allows limited asynchronous writes on cache misses 7 Virtual-to-physical mapping strategy can have significant impact on cache performance • Unfortunate mapping may increase conflict misses • “Random " mappings (Mach) are to be avoided ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 40 Other Experience with Microkernel Performance System call costs are (inherently?) high • Typically hundreds of cycles, 900 for Mach/i486 Context (address-space) switching costs (inherently?) high • Getting worse (in terms of cycles) with increasing CPU/memory speed ratios [Ous90] • IPC (involving system calls and context switches) is inherently expensive Microkernels heavily depend on IPC IPC is expensive • Is the microkernel idea flawed? • Should some code never leave the kernel? • Do we have to buy flexibility with performance? ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 41 A Critique of the Critique Data presented earlier: • Are specific to one (or a few) system, • Results cannot be generalised without thorough analysis • No such analysis had been done Cannot trust the conclusions [Lie95] ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 42 Re-Analysis of Chen & Bershad's Data MCPI for Ultrix and Mach ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 43 Re-Analysis of Chen & Bershad's Data MCPI caused by cache misses: conflict (black) vs capacity (white) ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 44 Conclusion Match system is too big • Kernel + UNIX server + emulation library UNIX server is essentially same Emulation library is irrelevant (according to Chan & Bershad) Inevitable conclusion: Mach kernel working set is too big Can we build microkernels which avoid these problems? ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 45 Requirements for Microkernels Fast (system call costs, IPC costs) Small (almost inevitably big slow) Must be well designed Must provide a minimal set of operations Can this be done? Example: kernel call cost on i486 • Mach kernel call: 900 cycles • Inherent (hardware-dictated cost): 107 cycles − 800 cycles kernel overhead • L4 kernel call: 123–180 cycles (15–73 cycles overhead) • Obviously, Mach’s performance is a result of design and implementation − It is not the result of the microkernel concept! ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 46 Microkernel Design Principles [Lie96] Minimality: • If it doesn't have to be in the kernel, it shouldn't be in the kernel Appropriate abstractions • which can be made fast and allow efficient implementation of services Well written: • It pays to shave a few cycles off TLB refill handler or the IPC path Unportable: • must be targeted to specific hardware • no problem if it's small, and higher layers are portable • Example: Liedtke reports significant rewrite of memory management when porting from 486 to Pentium − Eg size and associativity of cache, TLB • Hardware abstraction layer is too costly We'll revisit those principles later ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 47 What Must a Microkernel Provide? Virtual memory/address spaces • required for protection Threads (or equivalent, eg scheduler activations) • as execution abstraction • for exploiting multiple CPUs Fast IPC • the most critical operation Unique identifiers (for IPC addressing) • Actually, not true: can use local names • Example: shared memory: − “physical” identifiers (physical addresses) only known to kernel − Mapped into local name space (virtual addresses) ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 48 Microkernel Should Not Provide File system • User-level server (as in Mach) Device drivers • user-level driver invoked via interrupt (= IPC) Page-fault handler • Use user-level pager ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 49 L4 Implementation Techniques [Liedtke '93] Appropriate system calls to minimise number of kernel invocations • e.g. reply & receive next • As many syscall args as possible in registers Efficient IPC • Rich message structure • Value and reference parameters in message • Copy message only once (i.e. not user→kernel→user) Fast thread access • Thread UIDs (containing thread ID) • TCBs in (mapped) VM, cache-friendly layout • Separate kernel stack for each thread (fast interrupt handling) General optimisations • “hottest” kernel code is shortest • Kernel IPC code on single page, critical data on single page • Many H/W specific optimisations ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 50 Microkernel Performance [95/97] System CPU MHz RPC [µs] cyc/IPC semantics L4 MIPS R4600 100 2 100 full L4 Alpha 21164 433 0.2 43 full L4 Pentium 166 1.5 125 full L4 i486 50 10 250 full IBM µk PPC 604 60 14 420 full QNX i486 33 76 1254 full Mach MIPS R2000 16.7 190 1587 full Mach i486 50 230 5750 full Amoeba MC 68020 15 800 6000 full Spin Alpha 21064 133 102 6783 full Mach Alpha 21064 133 104 6916 full Exo-tlrpc MIPS R2000 116.7 6 350 restricted Spring SPARC V8 40 11 220 restricted DP-Mach i486 66 16 528 restricted LRPC CVAX 12.5 157 981 restricted ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 51 L4Ka::Pistachio IPC Performance C/C++ optimised Architecture Optimisation Intra AS Inter AS Intra AS Inter AS Pentium-3 UKA 180 367 113 305 Itanium 2 NICTA 508 508 36 36 MIPS64 UNSW/NICTA 276 276 109 109 - inter-CPU UNSW/NICTA 3238 3238 690 690 PowerPC-64 UNSW/NICTA 330 518 ~200 ~200 Alpha 21264 UNSW/NICTA 440 642 ~70 ~70 ARM/XScale UNSW/NICTA 340 340 151 151 ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 52 Case in Point: L4Linux [Härtig et al. 97] Port of Linux kernel to L4 (like Mach Unix server) • Single-threaded (for simplicity, not performance) • Is pager of all Linux user processes • Maps emulation library and signal-handling code into AS • Server AS maps physical memory (& Linux runs within) • Copying between user and server done on physical memory − Use software lookup of page tables for address translation Changes to Linux restricted to architecture-dependent part Duplication of page tables (L4 and Linux server) Binary compatible to native Linux via trampoline mechanism • But also modified libc with RPC stubs ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 53 Signal Delivery in L4Linux Separate signal-handler thread in each user process (1) Server IPCs signal-handler thread (2) Handler thread manipulates main user thread to save Linux user process Linux server state − Exchange_Registers User Resum (3) User thread IPCs Linux e (5) thread server Enter L inux (3 Main (4) Server does signal ) ) thread processing Manipulate (1 i g nal (5) Server IPCs user thread to Thread (2) a rd s w resume For Signal thread ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 54 L4Linux Performance: Microbenchmarks getpid(): System Time [µs] Cycles Linux 1.68 223 L4Linux (mod libc) 3.95 526 Li4Linux (trampoline) 5.66 753 MkLinux in-kernel 15.66 2050 MkLinux server 110.6 14710 Cycle breakdown: Client Cycles Server enter emulation lib 20 Hardware cost: send syscall message 168 wait for msg 82 cycles (133MHz Pentium) 131 Linux kernel receive reply 188 send reply leave emulation lib 19 ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 55 L4Linux Performance Microbenchmarks: lmbench Macrobenchmarks: kernel compile ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 56 Conclusions Mach sux ╪► microkernels suck L4 shows that performance might be deliverable • L4Linux gets close to monolithic kernel performance • Need real multi-server system to evaluate microkernel potential Recent work substantially closer to native performance • NICTA Wombat, OK Linux Microkernel-based systems can perform Mach has prejudiced community (see Linus...) • Getting microkernels accepted is still uphill battle ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 57 Present State Microkernels deployed for years where reliability matters • QNX, Integrity • Military, aerospace, automotive OKL4 is now being deployed where performance matters • Mobile wireless devices − Qualcomm chipsets − Mobile phones • Estimated deployment: 150 million devices (August '08) • About to enter general consumer- electronics area (set-top boxes) ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 58 Liedtke's Design Principles: What Stands? Minimality: definitely Appropriate abstractions: yes • but no agreement about some of them • L4 API still developing • NICTA seL4 is most advanced model − Integration with commercial OKL4 will set a new standard Well-written: absolutely Unportable: no • Pistachio is proof • but highly optimised IPC fast path (assembler) ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 59 How About His Implementation Techniques? Appropriate system calls: yes • But probably less critical than thought Efficient IPC, rich message structure: less so • OKL4 has abandoned structured messages • Passing data in registers beneficial on some architectures • single-copy definitely wins • Note introduction of asynchronous notification and memcopy syscall in OKL4 Fast thread access: no (at least as propagated by Liedtke) • Thread UIDs maybe nice but are a security issue − Covert storage channel through global names − Segregates caps are the way to go (se OKL4) • virtually-mapped linear (sparse) TCB array: no − Performance impact negligible [Nourai 05] − Wastes address space, requires exception handling in kernel (complexity) • per-thread kernel stacks: no − Performance impact negligible [Warton 05] − Wastes physical memory (very significant for embedded use) − Creates multiprocessor scalability issues ©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 60
Authors Gernot Heiser
License CC-BY-SA-4.0