Plaintext
Microkernels in a Bit More Depth
COMP9242
2008/S2 Week 3
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 1
Copyright Notice
These slides are distributed under the Creative Commons
Attribution 3.0 License
You are free:
• to share — to copy, distribute and transmit the work
• to remix — to adapt the work
Under the following conditions:
• Attribution. You must attribute the work (but not in any way that suggests
that the author endorses you or your use of the work) as follows:
• “Courtesy of Gernot Heiser, [Institution]”, where [Institution] is one of
• “UNSW”, “NICTA”, or “Open Kernel Labs”
The complete license text can be found at
http://creativecommons.org/licenses/by/3.0/legalcode
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 2
Motivation
Early operating systems had very little structure
A strictly layered approach was promoted by Dijkstra
• THE Operating System [Dij68]
Later OS (more or less) followed that approach (e.g., Unix).
Such systems are known as monolithic kernels
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 3
Issues of Monolithic Kernels
Advantages:
Kernel has access to everything:
• all optimisations possible
• all techniques/mechanisms/concepts implementable
Kernel can be extended by adding more code, e.g. for:
• new services
• support for new hardwdare
Problems:
• Widening range of services and applications
• OS bigger, more complex, slower, more error prone.
• Need to support same OS on different hardware.
• Like to support various OS environments.
• Distribution
− Impossible to provide all services from same (local) kernel
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 4
Evolution of the Linux Kernel
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 5
Approaches to Tackling Complexity
Classical software-engineering approach: modularity
• (Relatively) small, mostly self-contained components
• Well-defined interfaces between them
• Enforcement of interfaces
• Containment of faults to few modules
Doesn’t work with monolithic kernels:
• All kernel code executes in privileged mode
• Faults aren't contained
• Interfaces cannot be enforced
• Performance takes priority over structure
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 6
Cross-Module Dependencies (“Spaghettiness”)
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 7
Evolution of the Linux Kernel — Part 2
Software-engineering study of Linux kernel [SJW+02]:
Looked at size and interdependencies of kernel "modules“
• “common coupling": interdependency via global variables
Analyzed development over time (linearised version number)
Result 1: Module size grows lineary with version number
Result 2: Interdependency grows exponentially with version!
The present Linux model is doomed!
There is no reason to believe that others are different
• e.g. Windows, MacOS, ...
Need better software engineering in operating systems!
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 8
Monolithic vs. Microkernel OS Structure
Applications
User-level Servers
OS
unprivileged
privileged Microkernel
Hardware
Based on the ideas of Brinch Hansen's “Nucleus” [BH70]
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 9
Monolithic vs. Microkernel OS Structure
Application Syscall
User
VFS Mode Unix File
Server Device Server
IPC, file system Application Driver
Scheduler, virtual memory Kernel
Mode
Device drivers, dispatcher IPC, virtual memory IPC
Hardware Hardware
Monolithic OS Microkernel OS
• lots of privileged code • little privileged code
• vertical structure • horizontal structure
• invoked by system call • invoked by IPC
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 10
Microkernel OS
Kernel:
• Contains code which must run in supervisor mode
• Isolates hardware dependence from higher levels
• Is small and fast extensible system
• Provides mechanisms.
User-level servers:
• Are hardware independent/portable
• Provide "OS environment"/"OS personality" (maybe several)
• May be invoked:
− From application (via message-passing IPC)
− From kernel (upcalls)
• Implement policies [BH70].
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 11
Downcall vs. Upcall
Applications
downcall upcall
(syscall)
unprivileged
privileged
Kernel
Downcall: Upcall:
unprivileged code enters kernel mode privileged code enters user mode
implemented via trap implemented via signal/IPC
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 12
Microkernel-Based Systems
Classic Embed-
OS ded
Native app
Security Java
Highly-specialized
mini-OS component
OKL4 OKL4 OKL4
Hardware Hardware Hardware
Classic + thin specialized
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 13
Microkernel-Based Systems
Comp Comp Comp Comp
App
Real Comp
App Time Comms Comp Object Loader
App Library Mgr
App File
System
TCP/IP User
Interface
Network Display Flash
OK Linux
OKL4
Hybrid system
• Linux for legacy support or high-level API requirements
• RTOS for legacy support for real-time apps
• Highly componentised system for robustness
Provides migration path from legacy to componentised
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 14
Early Example: Hydra
Separation of mechanism from policy
• e.g. protection vs. security
No hierarchical layering of kernel
Protection, even within OS
• Uses (segregated) capabilities
Objects, encapsulation, units of protection.
Unique object name, no concept of object ownership.
Object persistence based on reference counting [WCC+74]
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 15
Hydra...
Can be considered the first object-oriented OS
Has been called the first microkernel OS
• by people who ignored Brinch Hansen
Has had enormous influence on later OS research
Was never widely used even at CMU because of
• poor performance
• lack of a complete environment
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 16
Popular Example: Mach
Developed at CMU by Rashid and others [RTY+88] from 1984
Successor of Accent [FR86] and RIG [Ras88]
Goals:
Tailorability: support different OS interfaces
Portability: almost all code H/W independent
Real-time capability
Multiprocessor and distribution support
Security
Coined term microkernel
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 17
Basic Features of Mach Kernel
Task and thread management
Interprocess communication
• asynchronous message-passing
Memory object management
System call redirection
• for virtualization (although they didn't call it that)
Device support
Multiprocessor support
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 18
Mach Tasks and Threads
Thread
• active entity (basic unit of CPU utilisation)
• own stack, kernel scheduled
• may run in parallel on multiprocessor
Task
• consists of one or more threads
• provides address space and other environment
• created from "blueprint"
− Empty or inherited address space
− Similar approach adopted by Linux clone
• Activated by creating a thread in it
“Privileged user-state program" may control scheduling
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 19
Mach IPC: Ports
Addressing based on ports:
• port is a mailbox, allocated/destroyed via a system call
• has a fixed-size message queue associated with it
• is protected by (segregated) capabilities
• as exactly one receiver, but possibly many senders
• can have "send-once" capability to a port
− for RPC replies (server invocation)
Can pass the receive capability for a port to another process
• give up read access to the port
Kernel detects (and cleans up) ports without senders or receiver
Processes may have many ports (UNIX server has 2000!)
• can be grouped into port sets
• supports listening to many (similar to Unix select)
Send blocks if queue is full
• blocking limited by timeout
Indirection via ports supports transparent distribution
• Local proxy port forwards message to receiver on remote node
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 20
Mach IPC: Messages
Segregated capabilities:
• Threads refer to them via local indices
• Kernel marshals capabilities in messages
• Message format must identify caps
Message contents
• Send capability to destination port (mandatory)
− Used by kernel to validate operation
• Optional send capability to reply port
− For use by receiver to send reply
• Possibly other capabilities
• “in-line” (by-value) data
• “out-of-line” (by reference) data, using copy-on-write,
− May contain whole address spaces
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 21
Mach IPC
Message Port rights Out-of- In-line
Header (Capabilities) line data data
Virtual address space Virtual address space
Task 1 Task 2
IPC
Mapping Mapping
before after
IPC IPC
Physical
Memory
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 22
Mach Virtual Memory Management
Address space constructed from memory regions
Initially empty
Populated by:
• explicit allocation
• explicitly mapping a memory object
• inheriting from parent
− by-region inheritance: none, copy, shared
• allocated automatically by kernel during IPC
− when passing by-reference parameters
− kernel determines mapping location
Leads to sparse virtual memory use (unlike UNIX)
• uses complex address-map datastructure to limit impact
Extensive use of copy-on-write for efficiency
• imposes alignment restrictions
• not necessarily a win for single pages
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 23
Mach Memory Objects
Kernel doesn't support file system
Memory objects are an abstraction of secondary storage:
• can be mapped into virtual memory
• are cached by the kernel in physical memory
• pager invoked if unmapped page is touched (or R/O page written to)
− invoke file system server to provide data
Support data sharing
• by mapping objects into several address spaces
Mach views virtual memory only as a cache for memory objects
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 24
User-Level Page Fault Handlers
All actual I/O performed by pager — can be
• default pager (provided by kernel), or
• external pager, running at user level Task
External
Pager
Intrinsic page fault cost: 2 IPCs
(1) Check protection & locate memory object Map IPC
• uses address map
Kernel
(2) Check cache, invoke pager if cache miss Memory
• uses a hashed page table cache
object
(3) Check copy-on-write
• perform physical copy if write fault
(4) Enter new mapping into H/W page tables
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 25
Mach Unix Virtualization
Unix
Application 5 Emulation 3
Server
library
4
1 Syscall redirect 2
Mach
Emulation library in user address space handles IPC
Invoked by system call redirection (trampoline mechanism)
• Supports binary compatibility
• Example of what's now called para-virtualization
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 26
Mach = Microkernel?
Most OS services implemented at user level
• Using memory objects and external pagers
• Provides mechanisms, not policies
Mostly hardware independent
Big!
• 140 system calls (300 in later versions), >100 kLOC
− Compare: Unix 6th edition had 48 syscalls (10 kLOC without drivers)
• 200 KiB text size (350 KiB in later versions)
Performance poor
• Tendency to move features into kernel
− OSF/1
− Darwin (base of MacOS X): complete BSD kernel inside Mach
Further information on Mach: [YTR+87, CDK94, Sin97]
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 27
Other Client-Server Systems
Lots! Most notable systems:
Amoeba: FU Amsterdam, early 1980's [TM81, TM84, MT86]
• followed by Minix ('87), Minix 3 ('05)
Chorus: INRIA (France), early 1980's [DA92, RAA+90, RAA+92]
• Commercialised by Chorus Systèmes in 1988
• Targeted embedded systems (esp. network infrastructure)
• Bought by Sun in 1997, closed down in 2002
• Chorus team spun out to create Jaluna (renamed VirtualLogix in '06)
• Now market embedded virtualization technology
QNX: “first commercial microkernel” (early '80s)
• highly successful in automotive and other transport systems
Green Hills Integrity
• '97 for military, commercial release '02
• market leader in aerospace, military
Windows NT: Microsoft (early 1990's) [Cus93]
• Early versions (NT 3) were microkernel-ish
• Now run main servers and most drivers in kernel mode
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 28
Critique of Microkernel Architectures
I'm not interested in making devices look like user-level.
They aren't, they shouldn't, and microkernels are just stupid.
Linus Torvalds
Is Linus right?
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 29
Microkernel Performance
First generation microkernel systems ('80s, early '90s)
• Exhibited poor performance when
− Compared to monolithic UNIX implementations
• Particularly Mach, the best-known example
− But others weren't better
Typical result: re-kernelise systems
• Move OS services back into the kernel for performance
• Move complete OS personalities into kernel
− Mach Unix “server” → Unix kernel co-located with Mach
− Chorus Unix
− Mac OS X
− OSF/1....
Some spectacular failures
• most notorious: IBM Workplace OS [Phelan et al. 93]
• also the GNU Hurd
• many others...
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 30
IBM Workplace OS (1991–96)
Unify IBM's operating systems (and produce cost savings)
• DOS, OS/2, Posix, AIX, OS/400, WIndows (binary compatible)
• all on same underlying platform, available concurrently
• apps can use services from multiple OSes
• “Grand Unification Theory of Operating Systems” (GUTS)
Scale across a wide range of environments
• PDAs (ARM)
• desktops (x86, PowerPC) Applications
• massively-parallel machines
(Power, ...)
DOS OS/2 AIX
Decided to base on Mach
• “Workplace OS microkernel”
derived from Mach 3.0 Personality Neutral Services
• for providing concurrent OS
personalities
• share personality neutral services Workplace OS Microkernel
(PNSs)
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 31
IBM Workplace OS
Significant modifications to Mach to address its problems
• synchronous IPC, single-copy message-passing
• direct support for RPC
− send+receive-reply without user-level capability manipulation
• migrating threads model
− thread moves with message during IPC
• improvements in memory management
− eg. use mappings for message transfers
• security tokens that reduce number of rights checks
• generally simplified and optimised code base
• more than doubled overall code size
• improved IPC performance 3 times (still 8 times slower than L4)
Plagued by problems
• Schedule overruns
• Budget overruns
• On-going technical problems
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 32
IBM Workplace OS History
One of the biggest OS projects ever: US$2G
• 400 microkernel, 1500 OS/2 programmers
Jan '91: Project start
Fall '92: Demoed OS/2, DOS and Unix on Mach
Fall '93: Announced that Workplace would not replace AIX
Jan '95: completely abandoned AIX personality
Oct '95: GA release of microkernel for PowerPC
Oct '95: Workplace project cancelled, Personal Power Div closed
Early '96: shipped last version (2.0) for x86, PowerPC, ARM
Considered a prime example of vapourware
• much marketing before technology was created
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 33
IBM Workplace OS Lessons
Analysis by Fleisch, Allan [1998]
Difficulty to map personality services to shared PNSs
• required extensive restructuring of existing code
• difficult to get PNS APIs right
Featurism
Focussed on microkernel, too late on personalities
Too much focus on portability of microkernel?
Poor management of huge project
• eg. wrt shared PSNs
Don't mention microkernel performance as an issue
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 34
Microkernel Performance
Performance problems of Mach became generally known 93
Reasons are investigated by [Chen & Bershad 93]:
• Instrumented user and system code to collect execution traces
• Run on DECstation 5000/200 (25MHz R3000)
• Run under Ultrix and Mach with Unix server
• Traces fed to memory system simulator
• Analyse MCPI (memory cycles per instruction)
− Baseline MCPI (i.e. excluding idle loops)
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 35
Ultix vs. Mach-Unix MCPI
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 36
Interpretation
Observations:
• Mach memory penalty higher
− i.e. cache misses or write stalls
• Mach VM system executes more instructions than Ultrix
− But has more functionality
Claim:
• Degraded performance is (intrinsic?) result of OS structure
• IPC cost is not a major factor [Ber92]
− IPC cost known to be high in Mach
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 37
Assertions
OS has less instruction & data locality than user code
• System code has higher cache and TLB miss rates
• Particularly bad for instructions
System execution is more dependent on instruction cache
behaviour than is user execution
• MCPI’s dominated by system i-cache misses
• Now: most benchmarks were small, i.e. user code fits in cache
Competition between user & system code no problem
• Few conflicts between user and system caching
• TLB misses are not a relevant factor
• Note: the hardware used has direct-mapped physical caches
− Split system/user caches wouldn't help
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 38
Self-Interference
Only examine system cache misses
Shaded: System cache
misses removed by associativity
MCPI for system-only, using
R3000 direct-mapped cache
Reductions due to associativity
were obtained by running system
on a simulator and using a two-way
associative cache of the same size
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 39
Assertions
4 Self-interference is a problem in system instruction reference
streams.
• High internal conflicts in system code
• System would benefit from higher cache associativity
5 System block memory operations are responsible for a
large percentage of memory system reference costs
• Particularly true for I/O system calls
6 Write buffers are less effective for system references.
• Write buffer allows limited asynchronous writes on cache misses
7 Virtual-to-physical mapping strategy can have significant
impact on cache performance
• Unfortunate mapping may increase conflict misses
• “Random " mappings (Mach) are to be avoided
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 40
Other Experience with Microkernel
Performance
System call costs are (inherently?) high
• Typically hundreds of cycles, 900 for Mach/i486
Context (address-space) switching costs (inherently?) high
• Getting worse (in terms of cycles) with increasing CPU/memory speed ratios
[Ous90]
• IPC (involving system calls and context switches) is inherently expensive
Microkernels heavily depend on IPC
IPC is expensive
• Is the microkernel idea flawed?
• Should some code never leave the kernel?
• Do we have to buy flexibility with performance?
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 41
A Critique of the Critique
Data presented earlier:
• Are specific to one (or a few) system,
• Results cannot be generalised without thorough analysis
• No such analysis had been done
Cannot trust the conclusions [Lie95]
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 42
Re-Analysis of Chen & Bershad's Data
MCPI for Ultrix and Mach
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 43
Re-Analysis of Chen & Bershad's Data
MCPI caused by cache misses: conflict (black) vs capacity (white)
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 44
Conclusion
Match system is too big
• Kernel + UNIX server + emulation library
UNIX server is essentially same
Emulation library is irrelevant (according to Chan & Bershad)
Inevitable conclusion: Mach kernel working set is too big
Can we build microkernels which avoid these problems?
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 45
Requirements for Microkernels
Fast (system call costs, IPC costs)
Small (almost inevitably big slow)
Must be well designed
Must provide a minimal set of operations
Can this be done?
Example: kernel call cost on i486
• Mach kernel call: 900 cycles
• Inherent (hardware-dictated cost): 107 cycles
− 800 cycles kernel overhead
• L4 kernel call: 123–180 cycles (15–73 cycles overhead)
• Obviously, Mach’s performance is a result of design and implementation
− It is not the result of the microkernel concept!
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 46
Microkernel Design Principles [Lie96]
Minimality:
• If it doesn't have to be in the kernel, it shouldn't be in the kernel
Appropriate abstractions
• which can be made fast and allow efficient implementation of services
Well written:
• It pays to shave a few cycles off TLB refill handler or the IPC path
Unportable:
• must be targeted to specific hardware
• no problem if it's small, and higher layers are portable
• Example: Liedtke reports significant rewrite of memory management when
porting from 486 to Pentium
− Eg size and associativity of cache, TLB
• Hardware abstraction layer is too costly
We'll revisit those principles later
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 47
What Must a Microkernel Provide?
Virtual memory/address spaces
• required for protection
Threads (or equivalent, eg scheduler activations)
• as execution abstraction
• for exploiting multiple CPUs
Fast IPC
• the most critical operation
Unique identifiers (for IPC addressing)
• Actually, not true: can use local names
• Example: shared memory:
− “physical” identifiers (physical addresses) only known to kernel
− Mapped into local name space (virtual addresses)
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 48
Microkernel Should Not Provide
File system
• User-level server (as in Mach)
Device drivers
• user-level driver invoked via interrupt (= IPC)
Page-fault handler
• Use user-level pager
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 49
L4 Implementation Techniques [Liedtke '93]
Appropriate system calls to minimise number of kernel invocations
• e.g. reply & receive next
• As many syscall args as possible in registers
Efficient IPC
• Rich message structure
• Value and reference parameters in message
• Copy message only once (i.e. not user→kernel→user)
Fast thread access
• Thread UIDs (containing thread ID)
• TCBs in (mapped) VM, cache-friendly layout
• Separate kernel stack for each thread (fast interrupt handling)
General optimisations
• “hottest” kernel code is shortest
• Kernel IPC code on single page, critical data on single page
• Many H/W specific optimisations
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 50
Microkernel Performance [95/97]
System CPU MHz RPC [µs] cyc/IPC semantics
L4 MIPS R4600 100 2 100 full
L4 Alpha 21164 433 0.2 43 full
L4 Pentium 166 1.5 125 full
L4 i486 50 10 250 full
IBM µk PPC 604 60 14 420 full
QNX i486 33 76 1254 full
Mach MIPS R2000 16.7 190 1587 full
Mach i486 50 230 5750 full
Amoeba MC 68020 15 800 6000 full
Spin Alpha 21064 133 102 6783 full
Mach Alpha 21064 133 104 6916 full
Exo-tlrpc MIPS R2000 116.7 6 350 restricted
Spring SPARC V8 40 11 220 restricted
DP-Mach i486 66 16 528 restricted
LRPC CVAX 12.5 157 981 restricted
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 51
L4Ka::Pistachio IPC Performance
C/C++ optimised
Architecture Optimisation Intra AS Inter AS Intra AS Inter AS
Pentium-3 UKA 180 367 113 305
Itanium 2 NICTA 508 508 36 36
MIPS64 UNSW/NICTA 276 276 109 109
- inter-CPU UNSW/NICTA 3238 3238 690 690
PowerPC-64 UNSW/NICTA 330 518 ~200 ~200
Alpha 21264 UNSW/NICTA 440 642 ~70 ~70
ARM/XScale UNSW/NICTA 340 340 151 151
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 52
Case in Point: L4Linux [Härtig et al. 97]
Port of Linux kernel to L4 (like Mach Unix server)
• Single-threaded (for simplicity, not performance)
• Is pager of all Linux user processes
• Maps emulation library and signal-handling code into AS
• Server AS maps physical memory (& Linux runs within)
• Copying between user and server done on physical memory
− Use software lookup of page tables for address translation
Changes to Linux restricted to architecture-dependent part
Duplication of page tables (L4 and Linux server)
Binary compatible to native Linux via trampoline mechanism
• But also modified libc with RPC stubs
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 53
Signal Delivery in L4Linux
Separate signal-handler thread in each user process
(1) Server IPCs signal-handler
thread
(2) Handler thread manipulates
main user thread to save Linux user process Linux server
state
− Exchange_Registers
User Resum
(3) User thread IPCs Linux e (5)
thread
server Enter L
inux (3 Main
(4) Server does signal )
) thread
processing Manipulate (1
i g nal
(5) Server IPCs user thread to Thread (2) a rd s
w
resume For
Signal
thread
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 54
L4Linux Performance: Microbenchmarks
getpid(): System Time [µs] Cycles
Linux 1.68 223
L4Linux (mod libc) 3.95 526
Li4Linux (trampoline) 5.66 753
MkLinux in-kernel 15.66 2050
MkLinux server 110.6 14710
Cycle breakdown: Client Cycles Server
enter emulation lib 20
Hardware cost: send syscall message 168 wait for msg
82 cycles (133MHz Pentium) 131 Linux kernel
receive reply 188 send reply
leave emulation lib 19
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 55
L4Linux Performance
Microbenchmarks: lmbench
Macrobenchmarks: kernel compile
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 56
Conclusions
Mach sux ╪► microkernels suck
L4 shows that performance might be deliverable
• L4Linux gets close to monolithic kernel performance
• Need real multi-server system to evaluate microkernel potential
Recent work substantially closer to native performance
• NICTA Wombat, OK Linux
Microkernel-based systems can perform
Mach has prejudiced community (see Linus...)
• Getting microkernels accepted is still uphill battle
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 57
Present State
Microkernels deployed for years
where reliability matters
• QNX, Integrity
• Military, aerospace, automotive
OKL4 is now being deployed
where performance matters
• Mobile wireless devices
− Qualcomm chipsets
− Mobile phones
• Estimated deployment: 150 million
devices (August '08)
• About to enter general consumer-
electronics area (set-top boxes)
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 58
Liedtke's Design Principles: What Stands?
Minimality: definitely
Appropriate abstractions: yes
• but no agreement about some of them
• L4 API still developing
• NICTA seL4 is most advanced model
− Integration with commercial OKL4 will set a new standard
Well-written: absolutely
Unportable: no
• Pistachio is proof
• but highly optimised IPC fast path (assembler)
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 59
How About His Implementation Techniques?
Appropriate system calls: yes
• But probably less critical than thought
Efficient IPC, rich message structure: less so
• OKL4 has abandoned structured messages
• Passing data in registers beneficial on some architectures
• single-copy definitely wins
• Note introduction of asynchronous notification and memcopy syscall in OKL4
Fast thread access: no (at least as propagated by Liedtke)
• Thread UIDs maybe nice but are a security issue
− Covert storage channel through global names
− Segregates caps are the way to go (se OKL4)
• virtually-mapped linear (sparse) TCB array: no
− Performance impact negligible [Nourai 05]
− Wastes address space, requires exception handling in kernel (complexity)
• per-thread kernel stacks: no
− Performance impact negligible [Warton 05]
− Wastes physical memory (very significant for embedded use)
− Creates multiprocessor scalability issues
©2008 Gernot Heiser UNSW/NICTA/OKL. Distributed under Creative Commons Attribution License 60