SlideShare a Scribd company logo
Brought to you by
OSv Unikernel
Waldek Kozaczuk
OSv Committer
Optimizing Guest OS to Run Stateless and Serverless Apps in the Cloud
What is OSv ?
An open-source versatile modular unikernel designed to run single unmodified
Linux application securely as microVM on top of a hypervisor, when compared to
traditional operating systems which were designed for a vast range of physical
machines. Or simply:
■ OS designed to run single application without isolation between the
application and kernel
■ HIP - Highly Isolated Process without ability to make system calls to the host
OS
■ Supports both x64_64 and aarch64 platforms
Components of OSv
Why Stateless and Serverless Workloads?
Can take advantage of OSv strengths:
■ Fast to boot and restart
■ Low memory utilization
■ Optimized networking stack
Do not need performant and feature-full filesystem, just enough to read
code and configuration
■ What about logs?
What and Why to Optimize
■ Short boot time
■ Low memory utilization
● Current minimum is 15M but can be optimized to 10M
■ Small kernel size
● Directly leads to higher density of guests on the host
■ Optimized networking stack
● Improves throughput in terms of requests per second
● Improves latency
Optimize Boot Time
OSv, with Read-Only FS and networking off, can boot as fast as ~5 ms on Firecracker and
even faster around ~3 ms on QEMU with the microvm machine. However, in general the
boot time will depend on many factors like hypervisor including settings of individual
para-virtual devices, filesystem (ZFS, ROFS, RAMFS or Virtio-FS) and some boot
parameters
For example, the boot time of ZFS image on Firecracker is ~40 ms and regular QEMU
~200 ms these days. Also, newer versions of QEMU (>=4.0) are typically faster to boot.
Booting on QEMU in PVH/HVM mode (aka direct kernel boot) should always be faster as
OSv is directly invoked in 64-bit long mode.
For more details see https://github.com/cloudius-systems/osv#boot-time
Optimize Kernel ELF Size: Why?
■ Smaller kernel ELF leads to less memory utilization
■ Fewer symbols, ideally only those needed by a specific app, improves security
Current kernel size is around 6.7 MB and includes subsets of following libraries:
The experiments described in following slides help reduce kernel size to 2.6 MB
libdl.so.2, ld-linux-x86-64.so.2
libresolv.so.2, libcrypt.so.1, libaio.so.1
libc.so.6, libm.so.6
libpthread.so.0
librt.so.1, libxenstore.so.3.0
libstdc++.so.6
Optimize Kernel ELF Size: Hide STD C++
diff --git a/Makefile b/Makefile
+ --version-script=./version_script_with_public_ABI_symbols_only 
--whole-archive 
- $(libstdc++.a) $(libgcc_eh.a) 
+ $(libgcc_eh.a) 
$(boost-libs) 
- --no-whole-archive $(libgcc.a), 
+ --no-whole-archive $(libstdc++.a) $(libgcc.a), 
LINK kernel.elf)
Hiding standard C++ library helps reduce kernel to 5.0 MB.
Optimize Kernel ELF Size: Collect Garbage
Enabling garbage collection reduces kernel size furher to 4.3 MB.
diff --git a/Makefile b/Makefile
EXTRA_FLAGS = -D__OSV_CORE__ -DOSV_KERNEL_BASE=$(kernel_base)
-DOSV_KERNEL_VM_BASE=$(kernel_vm_base) 
- -DOSV_KERNEL_VM_SHIFT=$(kernel_vm_shift)
+ -DOSV_KERNEL_VM_SHIFT=$(kernel_vm_shift) -ffunction-sections -fdata-sections
- --no-whole-archive $(libstdc++.a) $(libgcc.a), 
+ --no-whole-archive $(libstdc++.a) $(libgcc.a) --gc-sections, 
diff --git a/arch/x64/loader.ld b/arch/x64/loader.ld
.start32_address : AT(ADDR(.start32_address) - OSV_KERNEL_VM_SHIFT) {
*(.start32_address)
- }
+ KEEP(*(.start32_address)) }
Optimize Kernel ELF Size: Disable ZFS
diff --git a/Makefile b/Makefile
+ifdef zfs-enabled
solaris += bsd/sys/cddl/contrib/opensolaris/uts/common/zmod/zmod_subr.o
solaris += bsd/sys/cddl/contrib/opensolaris/uts/common/zmod/zutil.o
solaris += $(zfs)
+endif
+ifdef zfs-enabled
drivers += drivers/zfs.o
+endif
We do not need ZFS for stateless and server-less workloads.
Disabling ZFS reduces kernel size to 3.6 MB.
Optimize Kernel ELF Size: Select Platform/Drivers
diff --git a/Makefile b/Makefile
+ifdef xen-enabled
bsd += bsd/sys/xen/xenstore/xenstore.o
bsd += bsd/sys/xen/xenbus/xenbus.o
+endif
+ifdef virtio-enabled
drivers += drivers/virtio-vring.o
drivers += drivers/virtio-blk.o
+endif
For example, disabling all drivers and other platform code but what is needed to
run on firecracker or QEMU microVM, reduces kernel size to 3.1 MB.
Optimize Kernel ELF Size: App Specific Symbols
{
global:
__cxa_finalize;
__libc_start_main;
puts;
local:
*;
};
Eliminate all symbols and related code but what is needed to run specific app.
This further reduces kernel size to 2.6 MB which is enough to run native and
Java “hello world” app.
Optimize Memory Usage
Apart from shrinking kernel ELF to minimize memory used, following
optimizations can be implemented:
■ Lazy stack for application threads (WIP patch available, issue #144)
● Needs to pre-fault before calling kernel-code that cannot be preempted.
■ Refine L1/L2 memory pools logic to dynamically shrink/expand the low
watermark depending on physical memory size
● Currently we pre-allocate 512K for each vCPU, regardless if app needs or not.
Optimize Number of Runs on c5n.metal
■ Disk-only boot on Firecracker
■ Almost 1,900 boots per second (total of 629,625 runs)
● 25 boots per second on single host CPU
■ Boot time percentiles:
● P50 = 8.98 ms
● P75 = 12.07ms
● P90 = 17.15 ms
● P99 = 31.49 ms
■ Cost of hypervisor affecting boots/sec
Optimize Density on c5n.metal: Boot Time
Optimize Density on c5n.metal: Boots/second
Optimize Density on c5n.metal: CPU utilization
Optimize HTTP Requests/Sec
Each test described in following slides involves separate test “worker” machine connected with a
test “client” machine over 1GBit network.
■ Setup:
● Test guest VM:
■ Linux guest - Fedora 33 with firewall turned off
■ OSv guest - 0.56
■ QEMU 5.0 with vhost networking bridged to expose guest interface within local
ethernet, same setup for OSv and Linux guest
● Test worker machine - 8-way MacBook Pro i7 2.3GHz with Ubuntu 20.10 on it
● Linux test “client” machine - 8-way MacBook Pro i7 2.7GHz with Fedora 33 on it
■ Each test executed against guest VM with 1 and 2 and 4 vCPUs if makes sense
■ As a baseline each test app is executed and measured on host with taskset to limit cpu count
■ The load is generated by wrk with enough load to observe host CPUs pinned to the OSv or Linux VM
spike close to 100% cpu utilization
Linux Guest vs OSv: Nginx 1.20.1
Each test - best out of 3 runs, wrk with 8 threads and 100 connections running for 5sec, response payload - 774 bytes
Host pinned to single host CPU
■ 51,076.43 requests/sec
■ 49.78 MB/sec
■ P99 latency: 2.28ms
Linux guest with 1 vCPU
■ 25,736.96 requests/sec
■ 24.80 MB/sec
■ P99 latency: 14.04ms
OSv with 1 vCPU (same as Linux)
■ 38,333.70 requests/sec (~1.49 of Linux guest)
■ 36.78 MB/sec
■ P99 latency: 1.75 ms (~0.12 of Linux guest)
Linux Guest vs OSv: Node.JS 14.17
Each test - best out of 3 runs, wrk with 8 threads and 100 connections running for 5sec, payload of 33 bytes
Host pinned to single host CPU
■ 23,260.02 requests/sec
■ 3.95 MB/sec
■ P99 latency: 8.05 ms
Linux guest with 1 vCPU
■ 12,351.55 requests/sec
■ 2.37 MB/sec
■ P99 latency: 14.78 ms
OSv with 1 vCPU
■ 17,996.67 requests/sec (~1.46 of Linux guest)
■ 3.45 MB/sec
■ P99 latency: 7.38 ms (~0.5 of Linux guest)
Linux Guest vs OSv: Golang 1.15.13
Each test - best out of 3 runs, wrk with 8 threads and 100 connections running for 5sec, response payload 42 bytes
Host with CPUs pinned 1 CPU 2 CPUs 4 CPUs
Requests/sec, trans in MB/sec 48,033.00 7.28 94,346.28 14.31 106,905.85 16.21
P99 latency in ms 4.16 3.28 2.26
Linux Guest 1 vCPU 2 vCPUs 4 vCPUs
Requests/sec, trans in MB/sec 24,124.64 3.84 49,856.71 7.94 93,544.90 14.90
P99 latency in ms 8.62 9.21 8.11
OSv 1 vCPU 2 vCPUs 4 vCPUs
Requests/sec, trans in MB/sec 40,793.25
(1.69)
6.03 74,247.88
(1.49)
10.98 82,426.27
(0.88)
12.18
P99 latency in ms 5.35 (0.62) 15.94 (1.73) 10.90 (1.34)
Linux Guest vs OSv: Rust with Tokio and Hyper
Each test - best out of 3 runs, wrk with 8 threads and 200 connections running for 5sec, response payload of 30 bytes
Host with CPUs pinned 1 CPU 2 CPUs 4 CPUs
Requests/sec, trans in MB/sec 71,011.59 9.96 153,286.61 21.49 144,677.22 20.28
P99 latency in ms 3.29 1.85 2.78
Linux Guest 1 vCPU 2 vCPUs 4 vCPUs
Requests/sec, trans in MB/sec 28,061.11 3.93 68,742.13 9.64 132,515.62 18.58
P99 latency in ms 10.03 8.18 5.06
OSv 1 vCPU 2 vCPUs 4 vCPUs
Requests/sec, trans in MB/sec 57,886.77
(2.06)
8.12 47,312.48
(0.69)
6.63 47,073.25
(0.36)
6.60
P99 latency in ms 7.48 (0.75) 8.36 (1.02) 22.22 (4.39)
Linux Guest vs OSv: Akka HTTP 2.6 on Java8
Each test - best out of 3 runs, wrk with 8 threads and 100 connections running for 5sec, response payload of 42 bytes
Host with CPUs pinned 1 CPU 2 CPUs 4 CPUs
Requests/sec, trans in MB/sec 19,122.27 3.15 53,301.46 8.79 95,439.53 15.75
P99 latency in ms 50.64 33.26 16.35
Linux Guest 1 vCPU 2 vCPUs 4 vCPUs
Requests/sec, trans in MB/sec 10,959.20 1.81 27,018.56 4.46 51,493.63 8.50
P99 latency in ms 691.89 96.63 40.04
OSv 1 vCPU 2 vCPUs 4 vCPUs
Requests/sec, trans in MB/sec 38,666.84
(3.52)
6.38 64,532.12
(2.39)
10.65 81,930.80
(1.59)
13.52
P99 latency in ms 91.94 (0.13) 30.26 (0.31) 54.80 (1.37)
Things to optimize
■ Implement SO_REUSEPORT to improve Rust apps throughput
■ Finish “lazy application stack” support to minimize memory used
■ Lock contention in futex implementation to improve Golang apps
■ Optimize atomic operations on single vCPU
■ Make L1/L2 memory pool sizes self-configurable depending on physical
memory available
■ Other open issues
● https://github.com/cloudius-systems/osv/labels/performance
● https://github.com/cloudius-systems/osv/labels/optimization
Brought to you by
Waldek Kozaczuk
https://github.com/cloudius-systems/osv
https://groups.google.com/g/osv-dev
@OSv_unikernel

More Related Content

PDF
[Container Plumbing Days 2023] Why was nerdctl made?
PDF
Intel dpdk Tutorial
PDF
Ceph and RocksDB
PDF
DPDK In Depth
PDF
20111015 勉強会 (PCIe / SR-IOV)
PDF
[NDC13] 70명이 커밋하는 라이브 개발하기 (해외 진출 라이브 프로젝트의 개발 관리) - 송창규
PPTX
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
PPTX
Containerd internals: building a core container runtime
[Container Plumbing Days 2023] Why was nerdctl made?
Intel dpdk Tutorial
Ceph and RocksDB
DPDK In Depth
20111015 勉強会 (PCIe / SR-IOV)
[NDC13] 70명이 커밋하는 라이브 개발하기 (해외 진출 라이브 프로젝트의 개발 관리) - 송창규
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
Containerd internals: building a core container runtime

What's hot (20)

PPTX
Understanding DPDK
PDF
Dbtechshowcasesapporo mysql-turing-for-cloud-0.9.3
PDF
I/O仮想化最前線〜ネットワークI/Oを中心に〜
PDF
Kernel Recipes 2015: Kernel packet capture technologies
PDF
Linux Performance Analysis: New Tools and Old Secrets
PDF
Linux Kernel vs DPDK: HTTP Performance Showdown
PDF
Network Programming: Data Plane Development Kit (DPDK)
PDF
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...
PDF
Hands-On Introduction to Kubernetes at LISA17
PPTX
The TCP/IP Stack in the Linux Kernel
PPTX
Kubernetes #6 advanced scheduling
PDF
Kubernetes dealing with storage and persistence
PDF
DevConf 2014 Kernel Networking Walkthrough
PPTX
Fast Userspace OVS with AF_XDP, OVS CONF 2018
PDF
Kdump and the kernel crash dump analysis
PDF
Using eBPF for High-Performance Networking in Cilium
PDF
Postgres vs Mongo / Олег Бартунов (Postgres Professional)
PPTX
Dpdk applications
PDF
TRex Traffic Generator - Hanoch Haim
PDF
Arm device tree and linux device drivers
Understanding DPDK
Dbtechshowcasesapporo mysql-turing-for-cloud-0.9.3
I/O仮想化最前線〜ネットワークI/Oを中心に〜
Kernel Recipes 2015: Kernel packet capture technologies
Linux Performance Analysis: New Tools and Old Secrets
Linux Kernel vs DPDK: HTTP Performance Showdown
Network Programming: Data Plane Development Kit (DPDK)
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...
Hands-On Introduction to Kubernetes at LISA17
The TCP/IP Stack in the Linux Kernel
Kubernetes #6 advanced scheduling
Kubernetes dealing with storage and persistence
DevConf 2014 Kernel Networking Walkthrough
Fast Userspace OVS with AF_XDP, OVS CONF 2018
Kdump and the kernel crash dump analysis
Using eBPF for High-Performance Networking in Cilium
Postgres vs Mongo / Олег Бартунов (Postgres Professional)
Dpdk applications
TRex Traffic Generator - Hanoch Haim
Arm device tree and linux device drivers
Ad

Similar to OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in the Cloud (20)

PDF
OSv at Usenix ATC 2014
PDF
Current and Future of Non-Volatile Memory on Linux
PDF
Ceph on arm64 upload
PDF
Running Applications on the NetBSD Rump Kernel by Justin Cormack
PPT
Open HFT libraries in @Java
PDF
Kvm performance optimization for ubuntu
PDF
Achieving the ultimate performance with KVM
PPTX
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
PPTX
ceph-barcelona-v-1.2
PPTX
Ceph barcelona-v-1.2
PPTX
Refining Linux
PDF
Achieving the Ultimate Performance with KVM
PDF
Achieving the ultimate performance with KVM
PDF
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
PDF
Memory, Big Data, NoSQL and Virtualization
PDF
PVH : PV Guest in HVM container
PDF
mTCP使ってみた
PDF
Disaggregating Ceph using NVMeoF
PPTX
Fast boot
PDF
Direct Code Execution - LinuxCon Japan 2014
OSv at Usenix ATC 2014
Current and Future of Non-Volatile Memory on Linux
Ceph on arm64 upload
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Open HFT libraries in @Java
Kvm performance optimization for ubuntu
Achieving the ultimate performance with KVM
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
ceph-barcelona-v-1.2
Ceph barcelona-v-1.2
Refining Linux
Achieving the Ultimate Performance with KVM
Achieving the ultimate performance with KVM
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Memory, Big Data, NoSQL and Virtualization
PVH : PV Guest in HVM container
mTCP使ってみた
Disaggregating Ceph using NVMeoF
Fast boot
Direct Code Execution - LinuxCon Japan 2014
Ad

More from ScyllaDB (20)

PDF
Understanding The True Cost of DynamoDB Webinar
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
PDF
New Ways to Reduce Database Costs with ScyllaDB
PDF
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
PDF
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
PDF
Leading a High-Stakes Database Migration
PDF
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
PDF
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
PDF
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
PDF
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
PDF
ScyllaDB: 10 Years and Beyond by Dor Laor
PDF
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
PDF
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
PDF
Vector Search with ScyllaDB by Szymon Wasik
PDF
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
PDF
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
PDF
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
PDF
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
PDF
Lessons Learned from Building a Serverless Notifications System by Srushith R...
Understanding The True Cost of DynamoDB Webinar
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
New Ways to Reduce Database Costs with ScyllaDB
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Leading a High-Stakes Database Migration
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB: 10 Years and Beyond by Dor Laor
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Vector Search with ScyllaDB by Szymon Wasik
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
Lessons Learned from Building a Serverless Notifications System by Srushith R...

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
Machine Learning_overview_presentation.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
Tartificialntelligence_presentation.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Spectroscopy.pptx food analysis technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
A comparative study of natural language inference in Swahili using monolingua...
A comparative analysis of optical character recognition models for extracting...
Heart disease approach using modified random forest and particle swarm optimi...
Machine Learning_overview_presentation.pptx
Network Security Unit 5.pdf for BCA BBA.
Reach Out and Touch Someone: Haptics and Empathic Computing
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Programs and apps: productivity, graphics, security and other tools
Mobile App Security Testing_ A Comprehensive Guide.pdf
Approach and Philosophy of On baking technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
SOPHOS-XG Firewall Administrator PPT.pptx
OMC Textile Division Presentation 2021.pptx
Tartificialntelligence_presentation.pptx

OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in the Cloud

  • 1. Brought to you by OSv Unikernel Waldek Kozaczuk OSv Committer Optimizing Guest OS to Run Stateless and Serverless Apps in the Cloud
  • 2. What is OSv ? An open-source versatile modular unikernel designed to run single unmodified Linux application securely as microVM on top of a hypervisor, when compared to traditional operating systems which were designed for a vast range of physical machines. Or simply: ■ OS designed to run single application without isolation between the application and kernel ■ HIP - Highly Isolated Process without ability to make system calls to the host OS ■ Supports both x64_64 and aarch64 platforms
  • 4. Why Stateless and Serverless Workloads? Can take advantage of OSv strengths: ■ Fast to boot and restart ■ Low memory utilization ■ Optimized networking stack Do not need performant and feature-full filesystem, just enough to read code and configuration ■ What about logs?
  • 5. What and Why to Optimize ■ Short boot time ■ Low memory utilization ● Current minimum is 15M but can be optimized to 10M ■ Small kernel size ● Directly leads to higher density of guests on the host ■ Optimized networking stack ● Improves throughput in terms of requests per second ● Improves latency
  • 6. Optimize Boot Time OSv, with Read-Only FS and networking off, can boot as fast as ~5 ms on Firecracker and even faster around ~3 ms on QEMU with the microvm machine. However, in general the boot time will depend on many factors like hypervisor including settings of individual para-virtual devices, filesystem (ZFS, ROFS, RAMFS or Virtio-FS) and some boot parameters For example, the boot time of ZFS image on Firecracker is ~40 ms and regular QEMU ~200 ms these days. Also, newer versions of QEMU (>=4.0) are typically faster to boot. Booting on QEMU in PVH/HVM mode (aka direct kernel boot) should always be faster as OSv is directly invoked in 64-bit long mode. For more details see https://github.com/cloudius-systems/osv#boot-time
  • 7. Optimize Kernel ELF Size: Why? ■ Smaller kernel ELF leads to less memory utilization ■ Fewer symbols, ideally only those needed by a specific app, improves security Current kernel size is around 6.7 MB and includes subsets of following libraries: The experiments described in following slides help reduce kernel size to 2.6 MB libdl.so.2, ld-linux-x86-64.so.2 libresolv.so.2, libcrypt.so.1, libaio.so.1 libc.so.6, libm.so.6 libpthread.so.0 librt.so.1, libxenstore.so.3.0 libstdc++.so.6
  • 8. Optimize Kernel ELF Size: Hide STD C++ diff --git a/Makefile b/Makefile + --version-script=./version_script_with_public_ABI_symbols_only --whole-archive - $(libstdc++.a) $(libgcc_eh.a) + $(libgcc_eh.a) $(boost-libs) - --no-whole-archive $(libgcc.a), + --no-whole-archive $(libstdc++.a) $(libgcc.a), LINK kernel.elf) Hiding standard C++ library helps reduce kernel to 5.0 MB.
  • 9. Optimize Kernel ELF Size: Collect Garbage Enabling garbage collection reduces kernel size furher to 4.3 MB. diff --git a/Makefile b/Makefile EXTRA_FLAGS = -D__OSV_CORE__ -DOSV_KERNEL_BASE=$(kernel_base) -DOSV_KERNEL_VM_BASE=$(kernel_vm_base) - -DOSV_KERNEL_VM_SHIFT=$(kernel_vm_shift) + -DOSV_KERNEL_VM_SHIFT=$(kernel_vm_shift) -ffunction-sections -fdata-sections - --no-whole-archive $(libstdc++.a) $(libgcc.a), + --no-whole-archive $(libstdc++.a) $(libgcc.a) --gc-sections, diff --git a/arch/x64/loader.ld b/arch/x64/loader.ld .start32_address : AT(ADDR(.start32_address) - OSV_KERNEL_VM_SHIFT) { *(.start32_address) - } + KEEP(*(.start32_address)) }
  • 10. Optimize Kernel ELF Size: Disable ZFS diff --git a/Makefile b/Makefile +ifdef zfs-enabled solaris += bsd/sys/cddl/contrib/opensolaris/uts/common/zmod/zmod_subr.o solaris += bsd/sys/cddl/contrib/opensolaris/uts/common/zmod/zutil.o solaris += $(zfs) +endif +ifdef zfs-enabled drivers += drivers/zfs.o +endif We do not need ZFS for stateless and server-less workloads. Disabling ZFS reduces kernel size to 3.6 MB.
  • 11. Optimize Kernel ELF Size: Select Platform/Drivers diff --git a/Makefile b/Makefile +ifdef xen-enabled bsd += bsd/sys/xen/xenstore/xenstore.o bsd += bsd/sys/xen/xenbus/xenbus.o +endif +ifdef virtio-enabled drivers += drivers/virtio-vring.o drivers += drivers/virtio-blk.o +endif For example, disabling all drivers and other platform code but what is needed to run on firecracker or QEMU microVM, reduces kernel size to 3.1 MB.
  • 12. Optimize Kernel ELF Size: App Specific Symbols { global: __cxa_finalize; __libc_start_main; puts; local: *; }; Eliminate all symbols and related code but what is needed to run specific app. This further reduces kernel size to 2.6 MB which is enough to run native and Java “hello world” app.
  • 13. Optimize Memory Usage Apart from shrinking kernel ELF to minimize memory used, following optimizations can be implemented: ■ Lazy stack for application threads (WIP patch available, issue #144) ● Needs to pre-fault before calling kernel-code that cannot be preempted. ■ Refine L1/L2 memory pools logic to dynamically shrink/expand the low watermark depending on physical memory size ● Currently we pre-allocate 512K for each vCPU, regardless if app needs or not.
  • 14. Optimize Number of Runs on c5n.metal ■ Disk-only boot on Firecracker ■ Almost 1,900 boots per second (total of 629,625 runs) ● 25 boots per second on single host CPU ■ Boot time percentiles: ● P50 = 8.98 ms ● P75 = 12.07ms ● P90 = 17.15 ms ● P99 = 31.49 ms ■ Cost of hypervisor affecting boots/sec
  • 15. Optimize Density on c5n.metal: Boot Time
  • 16. Optimize Density on c5n.metal: Boots/second
  • 17. Optimize Density on c5n.metal: CPU utilization
  • 18. Optimize HTTP Requests/Sec Each test described in following slides involves separate test “worker” machine connected with a test “client” machine over 1GBit network. ■ Setup: ● Test guest VM: ■ Linux guest - Fedora 33 with firewall turned off ■ OSv guest - 0.56 ■ QEMU 5.0 with vhost networking bridged to expose guest interface within local ethernet, same setup for OSv and Linux guest ● Test worker machine - 8-way MacBook Pro i7 2.3GHz with Ubuntu 20.10 on it ● Linux test “client” machine - 8-way MacBook Pro i7 2.7GHz with Fedora 33 on it ■ Each test executed against guest VM with 1 and 2 and 4 vCPUs if makes sense ■ As a baseline each test app is executed and measured on host with taskset to limit cpu count ■ The load is generated by wrk with enough load to observe host CPUs pinned to the OSv or Linux VM spike close to 100% cpu utilization
  • 19. Linux Guest vs OSv: Nginx 1.20.1 Each test - best out of 3 runs, wrk with 8 threads and 100 connections running for 5sec, response payload - 774 bytes Host pinned to single host CPU ■ 51,076.43 requests/sec ■ 49.78 MB/sec ■ P99 latency: 2.28ms Linux guest with 1 vCPU ■ 25,736.96 requests/sec ■ 24.80 MB/sec ■ P99 latency: 14.04ms OSv with 1 vCPU (same as Linux) ■ 38,333.70 requests/sec (~1.49 of Linux guest) ■ 36.78 MB/sec ■ P99 latency: 1.75 ms (~0.12 of Linux guest)
  • 20. Linux Guest vs OSv: Node.JS 14.17 Each test - best out of 3 runs, wrk with 8 threads and 100 connections running for 5sec, payload of 33 bytes Host pinned to single host CPU ■ 23,260.02 requests/sec ■ 3.95 MB/sec ■ P99 latency: 8.05 ms Linux guest with 1 vCPU ■ 12,351.55 requests/sec ■ 2.37 MB/sec ■ P99 latency: 14.78 ms OSv with 1 vCPU ■ 17,996.67 requests/sec (~1.46 of Linux guest) ■ 3.45 MB/sec ■ P99 latency: 7.38 ms (~0.5 of Linux guest)
  • 21. Linux Guest vs OSv: Golang 1.15.13 Each test - best out of 3 runs, wrk with 8 threads and 100 connections running for 5sec, response payload 42 bytes Host with CPUs pinned 1 CPU 2 CPUs 4 CPUs Requests/sec, trans in MB/sec 48,033.00 7.28 94,346.28 14.31 106,905.85 16.21 P99 latency in ms 4.16 3.28 2.26 Linux Guest 1 vCPU 2 vCPUs 4 vCPUs Requests/sec, trans in MB/sec 24,124.64 3.84 49,856.71 7.94 93,544.90 14.90 P99 latency in ms 8.62 9.21 8.11 OSv 1 vCPU 2 vCPUs 4 vCPUs Requests/sec, trans in MB/sec 40,793.25 (1.69) 6.03 74,247.88 (1.49) 10.98 82,426.27 (0.88) 12.18 P99 latency in ms 5.35 (0.62) 15.94 (1.73) 10.90 (1.34)
  • 22. Linux Guest vs OSv: Rust with Tokio and Hyper Each test - best out of 3 runs, wrk with 8 threads and 200 connections running for 5sec, response payload of 30 bytes Host with CPUs pinned 1 CPU 2 CPUs 4 CPUs Requests/sec, trans in MB/sec 71,011.59 9.96 153,286.61 21.49 144,677.22 20.28 P99 latency in ms 3.29 1.85 2.78 Linux Guest 1 vCPU 2 vCPUs 4 vCPUs Requests/sec, trans in MB/sec 28,061.11 3.93 68,742.13 9.64 132,515.62 18.58 P99 latency in ms 10.03 8.18 5.06 OSv 1 vCPU 2 vCPUs 4 vCPUs Requests/sec, trans in MB/sec 57,886.77 (2.06) 8.12 47,312.48 (0.69) 6.63 47,073.25 (0.36) 6.60 P99 latency in ms 7.48 (0.75) 8.36 (1.02) 22.22 (4.39)
  • 23. Linux Guest vs OSv: Akka HTTP 2.6 on Java8 Each test - best out of 3 runs, wrk with 8 threads and 100 connections running for 5sec, response payload of 42 bytes Host with CPUs pinned 1 CPU 2 CPUs 4 CPUs Requests/sec, trans in MB/sec 19,122.27 3.15 53,301.46 8.79 95,439.53 15.75 P99 latency in ms 50.64 33.26 16.35 Linux Guest 1 vCPU 2 vCPUs 4 vCPUs Requests/sec, trans in MB/sec 10,959.20 1.81 27,018.56 4.46 51,493.63 8.50 P99 latency in ms 691.89 96.63 40.04 OSv 1 vCPU 2 vCPUs 4 vCPUs Requests/sec, trans in MB/sec 38,666.84 (3.52) 6.38 64,532.12 (2.39) 10.65 81,930.80 (1.59) 13.52 P99 latency in ms 91.94 (0.13) 30.26 (0.31) 54.80 (1.37)
  • 24. Things to optimize ■ Implement SO_REUSEPORT to improve Rust apps throughput ■ Finish “lazy application stack” support to minimize memory used ■ Lock contention in futex implementation to improve Golang apps ■ Optimize atomic operations on single vCPU ■ Make L1/L2 memory pool sizes self-configurable depending on physical memory available ■ Other open issues ● https://github.com/cloudius-systems/osv/labels/performance ● https://github.com/cloudius-systems/osv/labels/optimization
  • 25. Brought to you by Waldek Kozaczuk https://github.com/cloudius-systems/osv https://groups.google.com/g/osv-dev @OSv_unikernel