SlideShare a Scribd company logo
Hardware/Software Co-Design for Efficient
Microkernel Execution
Martin Děcký
martin.decky@huawei.com
February 2019
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 2
Who Am I
Passionate programmer and operating systems enthusiast
With a specific inclination towards multiserver microkernels
HelenOS developer since 2004
Research Scientist from 2006 to 2018
Charles University (Prague), Distributed Systems Research Group
Senior Research Engineer since 2017
Huawei Technologies (Munich), German Research Center, Central
Software Institute, OS Kernel Lab
3Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution
Microkernel Multiserver
Systems are better than
Monolithic Systems
3
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 4
Monolithic OS Design is Flawed
Biggs S., Lee D., Heiser G.: The Jury Is In: Monolithic OS Design Is
Flawed: Microkernel-based Designs Improve Security, ACM 9th Asia-
Pacific Workshop on Systems (APSys), 2018
“While intuitive, the benefits of the small TCB have not been quantified to
date. We address this by a study of critical Linux CVEs, where we examine
whether they would be prevented or mitigated by a microkernel-based
design. We find that almost all exploits are at least mitigated to less than
critical severity, and 40 % completely eliminated by an OS design based
on a verified microkernel, such as seL4.”
5Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution
Problem Statement5
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 6
Problem Statement
Microkernel design ideas go as back as 1969
RC 4000 Multiprogramming System nucleus (Per Brinch Hansen)
Isolation of unprivileged processes, inter-process communication,
hierarchical control
Even after 50 years they are not fully accepted as mainstream
Hardware and software used to be designed independently
Designing CPUs used to be an extremely complicated and costly process
Operating systems used to be written after the CPUs were designed
Hardware designs used to be rather conservative
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 7
Problem Statement (2)
Mainstream ISAs used to be designed in a rather conservative way
Can you name some really revolutionary ISA features since IBM
System/370 Advanced Function?
Requirements on the new ISAs usually follow the needs of the
mainstream operating systems running on the past ISAs
No wonder microkernels suffer performance penalties compared to
monolithic systems
The more fine-grained the architecture, the more penalties it suffers
Let us design the hardware with microkernels in mind!
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 8
The Vicious Cycle
CPUs do not support
microkernels properly
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 9
The Vicious Cycle
CPUs do not support
microkernels properly
Microkernels suffer
perfromance penalties
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 10
The Vicious Cycle
CPUs do not support
microkernels properly
Microkernels are not
in the mainstream
Microkernels suffer
perfromance penalties
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 11
The Vicious Cycle
CPUs do not support
microkernels properly
Microkernels are not
in the mainstream
Microkernels suffer
perfromance penalties
No requirements on
CPUs from microkernels
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 12
The Vicious Cycle
CPUs do not support
microkernels properly
Microkernels are not
in the mainstream
Microkernels suffer
perfromance penalties
No requirements on
CPUs from microkernels
13Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution
Any Ideas?
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 14
Communication between Address Spaces
Control and data flow between subsystems
Monolithic kernel
Function calls
Passing arguments in registers and on the stack
Passing direct pointers to memory structures
Multiserver microkernel
IPC via microkernel syscalls
Passing arguments in a subset of registers
Privilege level switch, address space switch
Scheduling (in case of asynchronous IPC)
Data copying or memory sharing with page granularity
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 15
Communication between Address Spaces (2)
Is the kernel round-trip of the IPC necessary?
Suggestion for synchronous IPC: Extended Jump/Call and Return instructions
that also switch the address space
Communicating parties identified by a “call gate” (capability) containing the target
address space and the PC of the IPC handler (implicit for return)
Call gates stored in a TLB-like hardware cache (CLB)
CLB populated by the microkernel similarly to TLB-only memory management
architecture
Suggestion for asynchronous IPC: Using CPU cache lines as the buffers for the
messages
Async Jump/Call, Async Return and Async Receive instructions
Using the CPU cache like an extended register stack engine
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 16
Communication between Address Spaces (3)
Bulk data
Observation: Memory sharing is actually quite efficient for large amounts
of data (multiple pages)
Overhead is caused primarily by creating and tearing down the shared
pages
Data needs to be page-aligned
Sub-page granularity and dynamic data structures
Suggestion: Using CPU cache lines as shared buffers
Much finer granularity than pages (typically 64 to 128 bytes)
A separate virtual-to-cache mapping mechanism before the standard
virtual-to-physical mapping
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 17
Fast Context Switching
Current microsecond-scale latency hiding mechanisms
Hardware multi-threading
Effective
Does not scale beyond a few threads
Operating system context switching
Scales for any thread count
Too slow (order of 10 µs)
Goal: Finding a sweet spot between the two mechanisms
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 18
Fast Context Switching (2)
Suggestion: Hardware cache for contexts
Again, similar mechanism to TLB-only memory management
Dedicated instructions for context store, context restore, context switch, context
save, context load
Context data could be potentially ABI-optimized
Autonomous mechanism for event-triggered context switch (e.g. external
interrupt)
Efficient hardware mechanism for latency hiding
The equivalent of fine/coarse-grained simultaneous multithreading
The software scheduler is in charge of setting the scheduler policy
The CPU is in charge of scheduling the contexts based on ALU, cache and other resource
availability
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 19
User Space Interrupt Processing
Extension of the fast context switching mechanism
Efficient delivery of interrupt events to user space device drivers
Without the routine microkernel intervention
An interrupt could be directly handled by a preconfigured hardware context in
user space
A clear path towards moving even the timer interrupt handler and the scheduler from
kernel space to user space
Going back to interrupt-driven handling of peripherals with extreme low latency
requirements (instead of polling)
The usual pain point: Level-triggered interrupts
Some coordination with the platform interrupt controller is probably needed
to automatically mask the interrupt source
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 20
Capabilities as First-Class Entities
Capabilities as unforgeable object identifiers
But eventually each access to an object needs to be bound-checked and
translated into the (flat) virtual address space
Suggestion: Embedding the capability reference in pointers
RV128 (128-bit variant of RISC-V) would provide 64 bits for the capability
reference and 64 bits for object offset
128-bit flat pointers are probably useless anyway
Besides the (somewhat narrow) use in the microkernel, this could be useful
for other purposes
Simplifying the implementation of managed languages’ VMs
Working with multiple virtual address spaces at once
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 21
Prior Art
Nordström S., Lindh L., Johansson L., Skoglund T.: Application Specific
Real-Time Microkernel in Hardware, 14th IEEE-NPSS Real Time
Conference, 2005
Offloading basic microkernel operations (e.g. thread creation, context
switching) to hardware shown to improve performance by 15 % on
average and up to 73 %
This was a coarse-grained approach
Hardware message passing in Intel SCC and Tilera TILE-G64/TILE-
Pro64
Asynchronous message passing with tight software integration
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 22
Prior Art (2)
Hajj I. E,, Merritt A., Zellweger G., Milojicic D., Achermann R., Faraboschi
P., Hwu W., Roscoe T., Schwan K.: SpaceJMP: Programming with Multiple
Virtual Address Spaces, 21st ACM ASPLOS, 2016
Practical programming model for using multiple virtual address spaces on
commodity hardware (evaluated on DragonFly BSD and Barrelfish)
Useful for data-centric applications for sharing large amounts of memory between
processes
Intel IA-32 Task State Segment (TSS)
Hardware-based context switching
Historically, it has been used by Linux
The primary reason for removal was not performance, but portability
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 23
Prior Art (3)
Intel VT-x VM Functions (VMFUNC)
Efficient cross-VM function calls
Switching the EPT and passing register arguments
Current implementation limited to 512 entry points
Practically usable even for very fine-grained virtualization with the
granularity of individual functions
Liu Y., Zhou T., Chen K., Chen H., Xia Y.: Thwarting Memory Disclosure with
Efficient Hypervisor-enforced Intra-domain Isolation, 22nd ACM SIGSAC
Conference on Computer and Communications Security, 2015
– “The cost of a VMFUNC is similar with a syscall”
– “… hypervisor-level protection at the cost of system calls”
SkyBridge paper to appear at EuroSys 2019
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 24
Prior Art (4)
Woodruff J., Watson R. N. M., Chisnall D., Moore S., Anderson J., Davis B., Laurie
B., Neumann P. G., Norton R., Roe. M.: The CHERI capability model: Revisiting RISC
in the an age of risk, 41st ACM Annual International Symposium on Computer
Architecture, 2014
Hardware-based capability model for byte-granularity memory protection
Extension of the 64-bit MIPS ISA
Evaluated on an extended MIPS R4000 FPGA soft-core
32 capability registers (256 bits)
Limitation: Inflexible design mostly due to the tight backward compatibility with a 64-bit
ISA
Intel MPX
Several design and implementation issues, deemed not production-ready
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 25
Summary
Traditionally, hardware has not been designed to accommodate the
requirements of microkernel multiserver operating systems
Microkernels thus suffer performance penalties
This prevented them from replacing monolithic operating systems and closed
the vicious cycle
Hardware design is hopefully becoming more accessible and democratic
E.g. RISC-V
Co-designing the hardware and software might help us gain the benefits
of the microkernel multiserver design with no performance penalties
However, it requires some out-of-the-box thinking
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 26
Acknowledgements
OS Kernel Lab at Huawei Technologies
Javier Picorel
Haibo Chen
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 27
Huawei Dresden R&D Lab
Focusing on microkernel research, design and development
Basic research
Applied research
Prototype development
Collaboration with academia and other technology companies
Looking for senior operating system researchers, designers, developers and
experts
Previous microkernel experience is a big plus
“A startup within a large company”
Shaping the future product portfolio of Huawei
Including hardware/software co-design via HiSilicon
28Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution
Q&A
Thank You!
Ad

Recommended

Microkernels in the Era of Data-Centric Computing
Microkernels in the Era of Data-Centric Computing
Martin Děcký
 
Formal Verification of Functional Code
Formal Verification of Functional Code
Martin Děcký
 
Lessons Learned from Porting HelenOS to RISC-V
Lessons Learned from Porting HelenOS to RISC-V
Martin Děcký
 
IPC in Microkernel Systems, Capabilities
IPC in Microkernel Systems, Capabilities
Martin Děcký
 
Unikernels, Multikernels, Virtual Machine-based Kernels
Unikernels, Multikernels, Virtual Machine-based Kernels
Martin Děcký
 
Hardware Implementation of Algorithm for Cryptanalysis
Hardware Implementation of Algorithm for Cryptanalysis
ijcisjournal
 
LO-PHI: Low-Observable Physical Host Instrumentation for Malware Analysis
LO-PHI: Low-Observable Physical Host Instrumentation for Malware Analysis
Pietro De Nicolao
 
Optimization of latency of temporal key Integrity protocol (tkip) using graph...
Optimization of latency of temporal key Integrity protocol (tkip) using graph...
ijcseit
 
ICCT2017: A user mode implementation of filtering rule management plane using...
ICCT2017: A user mode implementation of filtering rule management plane using...
Ruo Ando
 
Data-Centric Parallel Programming
Data-Centric Parallel Programming
inside-BigData.com
 
40520130101005
40520130101005
IAEME Publication
 
Fpga based encryption design using vhdl
Fpga based encryption design using vhdl
eSAT Publishing House
 
An Efficient PDP Scheme for Distributed Cloud Storage
An Efficient PDP Scheme for Distributed Cloud Storage
IJMER
 
Multicore Intel Processors Performance Evaluation
Multicore Intel Processors Performance Evaluation
المهندسة عائشة بني صخر
 
Towards Edge Computing as a Service: Dynamic Formation of the Micro Data-Centers
Towards Edge Computing as a Service: Dynamic Formation of the Micro Data-Centers
Faculty of Technical Sciences, University of Novi Sad
 
Iaetsd implementation of secure audit process
Iaetsd implementation of secure audit process
Iaetsd Iaetsd
 
Shilpa ppt
Shilpa ppt
shilpa kanhurkar
 
DIVISION AND REPLICATION OF DATA IN GRID FOR OPTIMAL PERFORMANCE AND SECURITY
DIVISION AND REPLICATION OF DATA IN GRID FOR OPTIMAL PERFORMANCE AND SECURITY
ijgca
 
Lec08 optimizations
Lec08 optimizations
Taras Zakharchenko
 
The effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theory
Vinícius Uchôa
 
A COMPARISON BETWEEN PARALLEL AND SEGMENTATION METHODS USED FOR IMAGE ENCRYPT...
A COMPARISON BETWEEN PARALLEL AND SEGMENTATION METHODS USED FOR IMAGE ENCRYPT...
ijcsit
 
DEF CON 27 - BRENT STONE - reverse enginerring 17 cars
DEF CON 27 - BRENT STONE - reverse enginerring 17 cars
Felipe Prado
 
NEW ALGORITHM FOR WIRELESS NETWORK COMMUNICATION SECURITY
NEW ALGORITHM FOR WIRELESS NETWORK COMMUNICATION SECURITY
ijcisjournal
 
Lec07 threading hw
Lec07 threading hw
Taras Zakharchenko
 
PERFORMANCE EVALUATION OF PARALLEL INTERNATIONAL DATA ENCRYPTION ALGORITHM ON...
PERFORMANCE EVALUATION OF PARALLEL INTERNATIONAL DATA ENCRYPTION ALGORITHM ON...
IJNSA Journal
 
Final Year Project Synopsis: Post Quantum Encryption using Neural Networks
Final Year Project Synopsis: Post Quantum Encryption using Neural Networks
JPC Hanson
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architecture
Michael Gschwind
 
A New Direction for Computer Architecture Research
A New Direction for Computer Architecture Research
dbpublications
 
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
Michael Gschwind
 
CS465Lec1.ppt computer architecture in the fall term
CS465Lec1.ppt computer architecture in the fall term
ppavani10
 

More Related Content

What's hot (18)

ICCT2017: A user mode implementation of filtering rule management plane using...
ICCT2017: A user mode implementation of filtering rule management plane using...
Ruo Ando
 
Data-Centric Parallel Programming
Data-Centric Parallel Programming
inside-BigData.com
 
40520130101005
40520130101005
IAEME Publication
 
Fpga based encryption design using vhdl
Fpga based encryption design using vhdl
eSAT Publishing House
 
An Efficient PDP Scheme for Distributed Cloud Storage
An Efficient PDP Scheme for Distributed Cloud Storage
IJMER
 
Multicore Intel Processors Performance Evaluation
Multicore Intel Processors Performance Evaluation
المهندسة عائشة بني صخر
 
Towards Edge Computing as a Service: Dynamic Formation of the Micro Data-Centers
Towards Edge Computing as a Service: Dynamic Formation of the Micro Data-Centers
Faculty of Technical Sciences, University of Novi Sad
 
Iaetsd implementation of secure audit process
Iaetsd implementation of secure audit process
Iaetsd Iaetsd
 
Shilpa ppt
Shilpa ppt
shilpa kanhurkar
 
DIVISION AND REPLICATION OF DATA IN GRID FOR OPTIMAL PERFORMANCE AND SECURITY
DIVISION AND REPLICATION OF DATA IN GRID FOR OPTIMAL PERFORMANCE AND SECURITY
ijgca
 
Lec08 optimizations
Lec08 optimizations
Taras Zakharchenko
 
The effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theory
Vinícius Uchôa
 
A COMPARISON BETWEEN PARALLEL AND SEGMENTATION METHODS USED FOR IMAGE ENCRYPT...
A COMPARISON BETWEEN PARALLEL AND SEGMENTATION METHODS USED FOR IMAGE ENCRYPT...
ijcsit
 
DEF CON 27 - BRENT STONE - reverse enginerring 17 cars
DEF CON 27 - BRENT STONE - reverse enginerring 17 cars
Felipe Prado
 
NEW ALGORITHM FOR WIRELESS NETWORK COMMUNICATION SECURITY
NEW ALGORITHM FOR WIRELESS NETWORK COMMUNICATION SECURITY
ijcisjournal
 
Lec07 threading hw
Lec07 threading hw
Taras Zakharchenko
 
PERFORMANCE EVALUATION OF PARALLEL INTERNATIONAL DATA ENCRYPTION ALGORITHM ON...
PERFORMANCE EVALUATION OF PARALLEL INTERNATIONAL DATA ENCRYPTION ALGORITHM ON...
IJNSA Journal
 
Final Year Project Synopsis: Post Quantum Encryption using Neural Networks
Final Year Project Synopsis: Post Quantum Encryption using Neural Networks
JPC Hanson
 
ICCT2017: A user mode implementation of filtering rule management plane using...
ICCT2017: A user mode implementation of filtering rule management plane using...
Ruo Ando
 
Data-Centric Parallel Programming
Data-Centric Parallel Programming
inside-BigData.com
 
Fpga based encryption design using vhdl
Fpga based encryption design using vhdl
eSAT Publishing House
 
An Efficient PDP Scheme for Distributed Cloud Storage
An Efficient PDP Scheme for Distributed Cloud Storage
IJMER
 
Iaetsd implementation of secure audit process
Iaetsd implementation of secure audit process
Iaetsd Iaetsd
 
DIVISION AND REPLICATION OF DATA IN GRID FOR OPTIMAL PERFORMANCE AND SECURITY
DIVISION AND REPLICATION OF DATA IN GRID FOR OPTIMAL PERFORMANCE AND SECURITY
ijgca
 
The effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theory
Vinícius Uchôa
 
A COMPARISON BETWEEN PARALLEL AND SEGMENTATION METHODS USED FOR IMAGE ENCRYPT...
A COMPARISON BETWEEN PARALLEL AND SEGMENTATION METHODS USED FOR IMAGE ENCRYPT...
ijcsit
 
DEF CON 27 - BRENT STONE - reverse enginerring 17 cars
DEF CON 27 - BRENT STONE - reverse enginerring 17 cars
Felipe Prado
 
NEW ALGORITHM FOR WIRELESS NETWORK COMMUNICATION SECURITY
NEW ALGORITHM FOR WIRELESS NETWORK COMMUNICATION SECURITY
ijcisjournal
 
PERFORMANCE EVALUATION OF PARALLEL INTERNATIONAL DATA ENCRYPTION ALGORITHM ON...
PERFORMANCE EVALUATION OF PARALLEL INTERNATIONAL DATA ENCRYPTION ALGORITHM ON...
IJNSA Journal
 
Final Year Project Synopsis: Post Quantum Encryption using Neural Networks
Final Year Project Synopsis: Post Quantum Encryption using Neural Networks
JPC Hanson
 

Similar to Hardware/Software Co-Design for Efficient Microkernel Execution (20)

Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architecture
Michael Gschwind
 
A New Direction for Computer Architecture Research
A New Direction for Computer Architecture Research
dbpublications
 
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
Michael Gschwind
 
CS465Lec1.ppt computer architecture in the fall term
CS465Lec1.ppt computer architecture in the fall term
ppavani10
 
Evolution of the Windows Kernel Architecture, by Dave Probert
Evolution of the Windows Kernel Architecture, by Dave Probert
yang
 
Oct2009
Oct2009
guest81ab2b4
 
Course: "Introductory course to HLS FPGA programming"
Course: "Introductory course to HLS FPGA programming"
Mirko Mariotti
 
Lecture_IIITD.pptx
Lecture_IIITD.pptx
achakracu
 
Co question 2008
Co question 2008
SANTOSH RATH
 
Japan's post K Computer
Japan's post K Computer
inside-BigData.com
 
Performance of State-of-the-Art Cryptography on ARM-based Microprocessors
Performance of State-of-the-Art Cryptography on ARM-based Microprocessors
Hannes Tschofenig
 
Aw4201337340
Aw4201337340
IJERA Editor
 
Architecting Solutions for the Manycore Future
Architecting Solutions for the Manycore Future
Talbott Crowell
 
Design of Tele command SOC-IP by AES Cryptographic Method Using VHDL
Design of Tele command SOC-IP by AES Cryptographic Method Using VHDL
dbpublications
 
dvance computer architecture computer architecture: a quantitative approach c...
dvance computer architecture computer architecture: a quantitative approach c...
mahdieh79
 
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
EricSifuna1
 
ZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed Systems
Gokhan Boranalp
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdf
anil0878
 
Systems Support for Many Task Computing
Systems Support for Many Task Computing
Eric Van Hensbergen
 
onur-comparch-fall2018-lecture3a-whycomparch-afterlecture.pptx
onur-comparch-fall2018-lecture3a-whycomparch-afterlecture.pptx
sivasubramanianManic2
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architecture
Michael Gschwind
 
A New Direction for Computer Architecture Research
A New Direction for Computer Architecture Research
dbpublications
 
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
Michael Gschwind
 
CS465Lec1.ppt computer architecture in the fall term
CS465Lec1.ppt computer architecture in the fall term
ppavani10
 
Evolution of the Windows Kernel Architecture, by Dave Probert
Evolution of the Windows Kernel Architecture, by Dave Probert
yang
 
Course: "Introductory course to HLS FPGA programming"
Course: "Introductory course to HLS FPGA programming"
Mirko Mariotti
 
Lecture_IIITD.pptx
Lecture_IIITD.pptx
achakracu
 
Performance of State-of-the-Art Cryptography on ARM-based Microprocessors
Performance of State-of-the-Art Cryptography on ARM-based Microprocessors
Hannes Tschofenig
 
Architecting Solutions for the Manycore Future
Architecting Solutions for the Manycore Future
Talbott Crowell
 
Design of Tele command SOC-IP by AES Cryptographic Method Using VHDL
Design of Tele command SOC-IP by AES Cryptographic Method Using VHDL
dbpublications
 
dvance computer architecture computer architecture: a quantitative approach c...
dvance computer architecture computer architecture: a quantitative approach c...
mahdieh79
 
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
EricSifuna1
 
ZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed Systems
Gokhan Boranalp
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdf
anil0878
 
Systems Support for Many Task Computing
Systems Support for Many Task Computing
Eric Van Hensbergen
 
onur-comparch-fall2018-lecture3a-whycomparch-afterlecture.pptx
onur-comparch-fall2018-lecture3a-whycomparch-afterlecture.pptx
sivasubramanianManic2
 
Ad

More from Martin Děcký (9)

2024 in Microkernels (a year in review lightning talk)
2024 in Microkernels (a year in review lightning talk)
Martin Děcký
 
HelenOS: 20 Years of History, 20 Years of Future Vision
HelenOS: 20 Years of History, 20 Years of Future Vision
Martin Děcký
 
Code Instrumentation, Dynamic Tracing
Code Instrumentation, Dynamic Tracing
Martin Děcký
 
Nízkoúrovňové programování
Nízkoúrovňové programování
Martin Děcký
 
Porting HelenOS to RISC-V
Porting HelenOS to RISC-V
Martin Děcký
 
What Could Microkernels Learn from Monolithic Kernels (and Vice Versa)
What Could Microkernels Learn from Monolithic Kernels (and Vice Versa)
Martin Děcký
 
FOSDEM 2014: Read-Copy-Update for HelenOS
FOSDEM 2014: Read-Copy-Update for HelenOS
Martin Děcký
 
FOSDEM 2013: Operating Systems Hot Topics
FOSDEM 2013: Operating Systems Hot Topics
Martin Děcký
 
HelenOS: State of the Union 2012
HelenOS: State of the Union 2012
Martin Děcký
 
2024 in Microkernels (a year in review lightning talk)
2024 in Microkernels (a year in review lightning talk)
Martin Děcký
 
HelenOS: 20 Years of History, 20 Years of Future Vision
HelenOS: 20 Years of History, 20 Years of Future Vision
Martin Děcký
 
Code Instrumentation, Dynamic Tracing
Code Instrumentation, Dynamic Tracing
Martin Děcký
 
Nízkoúrovňové programování
Nízkoúrovňové programování
Martin Děcký
 
Porting HelenOS to RISC-V
Porting HelenOS to RISC-V
Martin Děcký
 
What Could Microkernels Learn from Monolithic Kernels (and Vice Versa)
What Could Microkernels Learn from Monolithic Kernels (and Vice Versa)
Martin Děcký
 
FOSDEM 2014: Read-Copy-Update for HelenOS
FOSDEM 2014: Read-Copy-Update for HelenOS
Martin Děcký
 
FOSDEM 2013: Operating Systems Hot Topics
FOSDEM 2013: Operating Systems Hot Topics
Martin Děcký
 
HelenOS: State of the Union 2012
HelenOS: State of the Union 2012
Martin Děcký
 
Ad

Recently uploaded (20)

" How to survive with 1 billion vectors and not sell a kidney: our low-cost c...
" How to survive with 1 billion vectors and not sell a kidney: our low-cost c...
Fwdays
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
 
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
PyCon SG 25 - Firecracker Made Easy with Python.pdf
PyCon SG 25 - Firecracker Made Easy with Python.pdf
Muhammad Yuga Nugraha
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Safe Software
 
Mastering AI Workflows with FME by Mark Döring
Mastering AI Workflows with FME by Mark Döring
Safe Software
 
Python Conference Singapore - 19 Jun 2025
Python Conference Singapore - 19 Jun 2025
ninefyi
 
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
AI vs Human Writing: Can You Tell the Difference?
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
 
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
 
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
" How to survive with 1 billion vectors and not sell a kidney: our low-cost c...
" How to survive with 1 billion vectors and not sell a kidney: our low-cost c...
Fwdays
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
 
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
PyCon SG 25 - Firecracker Made Easy with Python.pdf
PyCon SG 25 - Firecracker Made Easy with Python.pdf
Muhammad Yuga Nugraha
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Safe Software
 
Mastering AI Workflows with FME by Mark Döring
Mastering AI Workflows with FME by Mark Döring
Safe Software
 
Python Conference Singapore - 19 Jun 2025
Python Conference Singapore - 19 Jun 2025
ninefyi
 
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
AI vs Human Writing: Can You Tell the Difference?
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
 
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
 
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 

Hardware/Software Co-Design for Efficient Microkernel Execution

  • 1. Hardware/Software Co-Design for Efficient Microkernel Execution Martin Děcký [email protected] February 2019
  • 2. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 2 Who Am I Passionate programmer and operating systems enthusiast With a specific inclination towards multiserver microkernels HelenOS developer since 2004 Research Scientist from 2006 to 2018 Charles University (Prague), Distributed Systems Research Group Senior Research Engineer since 2017 Huawei Technologies (Munich), German Research Center, Central Software Institute, OS Kernel Lab
  • 3. 3Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution Microkernel Multiserver Systems are better than Monolithic Systems 3
  • 4. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 4 Monolithic OS Design is Flawed Biggs S., Lee D., Heiser G.: The Jury Is In: Monolithic OS Design Is Flawed: Microkernel-based Designs Improve Security, ACM 9th Asia- Pacific Workshop on Systems (APSys), 2018 “While intuitive, the benefits of the small TCB have not been quantified to date. We address this by a study of critical Linux CVEs, where we examine whether they would be prevented or mitigated by a microkernel-based design. We find that almost all exploits are at least mitigated to less than critical severity, and 40 % completely eliminated by an OS design based on a verified microkernel, such as seL4.”
  • 5. 5Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution Problem Statement5
  • 6. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 6 Problem Statement Microkernel design ideas go as back as 1969 RC 4000 Multiprogramming System nucleus (Per Brinch Hansen) Isolation of unprivileged processes, inter-process communication, hierarchical control Even after 50 years they are not fully accepted as mainstream Hardware and software used to be designed independently Designing CPUs used to be an extremely complicated and costly process Operating systems used to be written after the CPUs were designed Hardware designs used to be rather conservative
  • 7. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 7 Problem Statement (2) Mainstream ISAs used to be designed in a rather conservative way Can you name some really revolutionary ISA features since IBM System/370 Advanced Function? Requirements on the new ISAs usually follow the needs of the mainstream operating systems running on the past ISAs No wonder microkernels suffer performance penalties compared to monolithic systems The more fine-grained the architecture, the more penalties it suffers Let us design the hardware with microkernels in mind!
  • 8. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 8 The Vicious Cycle CPUs do not support microkernels properly
  • 9. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 9 The Vicious Cycle CPUs do not support microkernels properly Microkernels suffer perfromance penalties
  • 10. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 10 The Vicious Cycle CPUs do not support microkernels properly Microkernels are not in the mainstream Microkernels suffer perfromance penalties
  • 11. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 11 The Vicious Cycle CPUs do not support microkernels properly Microkernels are not in the mainstream Microkernels suffer perfromance penalties No requirements on CPUs from microkernels
  • 12. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 12 The Vicious Cycle CPUs do not support microkernels properly Microkernels are not in the mainstream Microkernels suffer perfromance penalties No requirements on CPUs from microkernels
  • 13. 13Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution Any Ideas?
  • 14. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 14 Communication between Address Spaces Control and data flow between subsystems Monolithic kernel Function calls Passing arguments in registers and on the stack Passing direct pointers to memory structures Multiserver microkernel IPC via microkernel syscalls Passing arguments in a subset of registers Privilege level switch, address space switch Scheduling (in case of asynchronous IPC) Data copying or memory sharing with page granularity
  • 15. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 15 Communication between Address Spaces (2) Is the kernel round-trip of the IPC necessary? Suggestion for synchronous IPC: Extended Jump/Call and Return instructions that also switch the address space Communicating parties identified by a “call gate” (capability) containing the target address space and the PC of the IPC handler (implicit for return) Call gates stored in a TLB-like hardware cache (CLB) CLB populated by the microkernel similarly to TLB-only memory management architecture Suggestion for asynchronous IPC: Using CPU cache lines as the buffers for the messages Async Jump/Call, Async Return and Async Receive instructions Using the CPU cache like an extended register stack engine
  • 16. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 16 Communication between Address Spaces (3) Bulk data Observation: Memory sharing is actually quite efficient for large amounts of data (multiple pages) Overhead is caused primarily by creating and tearing down the shared pages Data needs to be page-aligned Sub-page granularity and dynamic data structures Suggestion: Using CPU cache lines as shared buffers Much finer granularity than pages (typically 64 to 128 bytes) A separate virtual-to-cache mapping mechanism before the standard virtual-to-physical mapping
  • 17. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 17 Fast Context Switching Current microsecond-scale latency hiding mechanisms Hardware multi-threading Effective Does not scale beyond a few threads Operating system context switching Scales for any thread count Too slow (order of 10 µs) Goal: Finding a sweet spot between the two mechanisms
  • 18. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 18 Fast Context Switching (2) Suggestion: Hardware cache for contexts Again, similar mechanism to TLB-only memory management Dedicated instructions for context store, context restore, context switch, context save, context load Context data could be potentially ABI-optimized Autonomous mechanism for event-triggered context switch (e.g. external interrupt) Efficient hardware mechanism for latency hiding The equivalent of fine/coarse-grained simultaneous multithreading The software scheduler is in charge of setting the scheduler policy The CPU is in charge of scheduling the contexts based on ALU, cache and other resource availability
  • 19. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 19 User Space Interrupt Processing Extension of the fast context switching mechanism Efficient delivery of interrupt events to user space device drivers Without the routine microkernel intervention An interrupt could be directly handled by a preconfigured hardware context in user space A clear path towards moving even the timer interrupt handler and the scheduler from kernel space to user space Going back to interrupt-driven handling of peripherals with extreme low latency requirements (instead of polling) The usual pain point: Level-triggered interrupts Some coordination with the platform interrupt controller is probably needed to automatically mask the interrupt source
  • 20. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 20 Capabilities as First-Class Entities Capabilities as unforgeable object identifiers But eventually each access to an object needs to be bound-checked and translated into the (flat) virtual address space Suggestion: Embedding the capability reference in pointers RV128 (128-bit variant of RISC-V) would provide 64 bits for the capability reference and 64 bits for object offset 128-bit flat pointers are probably useless anyway Besides the (somewhat narrow) use in the microkernel, this could be useful for other purposes Simplifying the implementation of managed languages’ VMs Working with multiple virtual address spaces at once
  • 21. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 21 Prior Art Nordström S., Lindh L., Johansson L., Skoglund T.: Application Specific Real-Time Microkernel in Hardware, 14th IEEE-NPSS Real Time Conference, 2005 Offloading basic microkernel operations (e.g. thread creation, context switching) to hardware shown to improve performance by 15 % on average and up to 73 % This was a coarse-grained approach Hardware message passing in Intel SCC and Tilera TILE-G64/TILE- Pro64 Asynchronous message passing with tight software integration
  • 22. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 22 Prior Art (2) Hajj I. E,, Merritt A., Zellweger G., Milojicic D., Achermann R., Faraboschi P., Hwu W., Roscoe T., Schwan K.: SpaceJMP: Programming with Multiple Virtual Address Spaces, 21st ACM ASPLOS, 2016 Practical programming model for using multiple virtual address spaces on commodity hardware (evaluated on DragonFly BSD and Barrelfish) Useful for data-centric applications for sharing large amounts of memory between processes Intel IA-32 Task State Segment (TSS) Hardware-based context switching Historically, it has been used by Linux The primary reason for removal was not performance, but portability
  • 23. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 23 Prior Art (3) Intel VT-x VM Functions (VMFUNC) Efficient cross-VM function calls Switching the EPT and passing register arguments Current implementation limited to 512 entry points Practically usable even for very fine-grained virtualization with the granularity of individual functions Liu Y., Zhou T., Chen K., Chen H., Xia Y.: Thwarting Memory Disclosure with Efficient Hypervisor-enforced Intra-domain Isolation, 22nd ACM SIGSAC Conference on Computer and Communications Security, 2015 – “The cost of a VMFUNC is similar with a syscall” – “… hypervisor-level protection at the cost of system calls” SkyBridge paper to appear at EuroSys 2019
  • 24. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 24 Prior Art (4) Woodruff J., Watson R. N. M., Chisnall D., Moore S., Anderson J., Davis B., Laurie B., Neumann P. G., Norton R., Roe. M.: The CHERI capability model: Revisiting RISC in the an age of risk, 41st ACM Annual International Symposium on Computer Architecture, 2014 Hardware-based capability model for byte-granularity memory protection Extension of the 64-bit MIPS ISA Evaluated on an extended MIPS R4000 FPGA soft-core 32 capability registers (256 bits) Limitation: Inflexible design mostly due to the tight backward compatibility with a 64-bit ISA Intel MPX Several design and implementation issues, deemed not production-ready
  • 25. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 25 Summary Traditionally, hardware has not been designed to accommodate the requirements of microkernel multiserver operating systems Microkernels thus suffer performance penalties This prevented them from replacing monolithic operating systems and closed the vicious cycle Hardware design is hopefully becoming more accessible and democratic E.g. RISC-V Co-designing the hardware and software might help us gain the benefits of the microkernel multiserver design with no performance penalties However, it requires some out-of-the-box thinking
  • 26. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 26 Acknowledgements OS Kernel Lab at Huawei Technologies Javier Picorel Haibo Chen
  • 27. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 27 Huawei Dresden R&D Lab Focusing on microkernel research, design and development Basic research Applied research Prototype development Collaboration with academia and other technology companies Looking for senior operating system researchers, designers, developers and experts Previous microkernel experience is a big plus “A startup within a large company” Shaping the future product portfolio of Huawei Including hardware/software co-design via HiSilicon
  • 28. 28Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution Q&A