OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in the Cloud

Brought to you by
OSv Unikernel
Waldek Kozaczuk
OSv Committer
Optimizing Guest OS to Run Stateless and Serverless Apps in the Cloud

What is OSv ?
An open-source versatile modular unikernel designed to run single unmodiﬁed
Linux application securely as microVM on top of a hypervisor, when compared to
traditional operating systems which were designed for a vast range of physical
machines. Or simply:
■ OS designed to run single application without isolation between the
application and kernel
■ HIP - Highly Isolated Process without ability to make system calls to the host
OS
■ Supports both x64_64 and aarch64 platforms

Why Stateless and Serverless Workloads?
Can take advantage of OSv strengths:
■ Fast to boot and restart
■ Low memory utilization
■ Optimized networking stack
Do not need performant and feature-full ﬁlesystem, just enough to read
code and conﬁguration
■ What about logs?

What and Why to Optimize
■ Short boot time
■ Low memory utilization
● Current minimum is 15M but can be optimized to 10M
■ Small kernel size
● Directly leads to higher density of guests on the host
■ Optimized networking stack
● Improves throughput in terms of requests per second
● Improves latency

Optimize Boot Time
OSv, with Read-Only FS and networking off, can boot as fast as ~5 ms on Firecracker and
even faster around ~3 ms on QEMU with the microvm machine. However, in general the
boot time will depend on many factors like hypervisor including settings of individual
para-virtual devices, ﬁlesystem (ZFS, ROFS, RAMFS or Virtio-FS) and some boot
parameters
For example, the boot time of ZFS image on Firecracker is ~40 ms and regular QEMU
~200 ms these days. Also, newer versions of QEMU (>=4.0) are typically faster to boot.
Booting on QEMU in PVH/HVM mode (aka direct kernel boot) should always be faster as
OSv is directly invoked in 64-bit long mode.
For more details see https://github.com/cloudius-systems/osv#boot-time

Optimize Kernel ELF Size: Why?
■ Smaller kernel ELF leads to less memory utilization
■ Fewer symbols, ideally only those needed by a speciﬁc app, improves security
Current kernel size is around 6.7 MB and includes subsets of following libraries:
The experiments described in following slides help reduce kernel size to 2.6 MB
libdl.so.2, ld-linux-x86-64.so.2
libresolv.so.2, libcrypt.so.1, libaio.so.1
libc.so.6, libm.so.6
libpthread.so.0
librt.so.1, libxenstore.so.3.0
libstdc++.so.6

Optimize Kernel ELF Size: Hide STD C++
diff --git a/Makefile b/Makefile
+ --version-script=./version_script_with_public_ABI_symbols_only
--whole-archive
- $(libstdc++.a) $(libgcc_eh.a)
+ $(libgcc_eh.a)
$(boost-libs)
- --no-whole-archive $(libgcc.a),
+ --no-whole-archive $(libstdc++.a) $(libgcc.a),
LINK kernel.elf)
Hiding standard C++ library helps reduce kernel to 5.0 MB.

Optimize Kernel ELF Size: Collect Garbage
Enabling garbage collection reduces kernel size furher to 4.3 MB.
EXTRA_FLAGS = -D__OSV_CORE__ -DOSV_KERNEL_BASE=$(kernel_base)
-DOSV_KERNEL_VM_BASE=$(kernel_vm_base)
- -DOSV_KERNEL_VM_SHIFT=$(kernel_vm_shift)
+ -DOSV_KERNEL_VM_SHIFT=$(kernel_vm_shift) -ffunction-sections -fdata-sections
- --no-whole-archive $(libstdc++.a) $(libgcc.a),
+ --no-whole-archive $(libstdc++.a) $(libgcc.a) --gc-sections,
diff --git a/arch/x64/loader.ld b/arch/x64/loader.ld
.start32_address : AT(ADDR(.start32_address) - OSV_KERNEL_VM_SHIFT) {
*(.start32_address)
- }
+ KEEP(*(.start32_address)) }

Optimize Kernel ELF Size: Disable ZFS
+ifdef zfs-enabled
solaris += bsd/sys/cddl/contrib/opensolaris/uts/common/zmod/zmod_subr.o
solaris += bsd/sys/cddl/contrib/opensolaris/uts/common/zmod/zutil.o
solaris += $(zfs)
+endif
+ifdef zfs-enabled
drivers += drivers/zfs.o
+endif
We do not need ZFS for stateless and server-less workloads.
Disabling ZFS reduces kernel size to 3.6 MB.

Optimize Kernel ELF Size: Select Platform/Drivers
+ifdef xen-enabled
bsd += bsd/sys/xen/xenstore/xenstore.o
bsd += bsd/sys/xen/xenbus/xenbus.o
+endif
+ifdef virtio-enabled
drivers += drivers/virtio-vring.o
drivers += drivers/virtio-blk.o
+endif
For example, disabling all drivers and other platform code but what is needed to
run on ﬁrecracker or QEMU microVM, reduces kernel size to 3.1 MB.

Optimize Kernel ELF Size: App Speciﬁc Symbols
{
global:
__cxa_finalize;
__libc_start_main;
puts;
local:
*;
};
Eliminate all symbols and related code but what is needed to run speciﬁc app.
This further reduces kernel size to 2.6 MB which is enough to run native and
Java “hello world” app.

Optimize Memory Usage
Apart from shrinking kernel ELF to minimize memory used, following
optimizations can be implemented:
■ Lazy stack for application threads (WIP patch available, issue #144)
● Needs to pre-fault before calling kernel-code that cannot be preempted.
■ Reﬁne L1/L2 memory pools logic to dynamically shrink/expand the low
watermark depending on physical memory size
● Currently we pre-allocate 512K for each vCPU, regardless if app needs or not.

Optimize Number of Runs on c5n.metal
■ Disk-only boot on Firecracker
■ Almost 1,900 boots per second (total of 629,625 runs)
● 25 boots per second on single host CPU
■ Boot time percentiles:
● P50 = 8.98 ms
● P75 = 12.07ms
● P90 = 17.15 ms
● P99 = 31.49 ms
■ Cost of hypervisor affecting boots/sec

Optimize Density on c5n.metal: Boot Time

Optimize Density on c5n.metal: Boots/second

Optimize Density on c5n.metal: CPU utilization

Optimize HTTP Requests/Sec
Each test described in following slides involves separate test “worker” machine connected with a
test “client” machine over 1GBit network.
■ Setup:
● Test guest VM:
■ Linux guest - Fedora 33 with ﬁrewall turned off
■ OSv guest - 0.56
■ QEMU 5.0 with vhost networking bridged to expose guest interface within local
ethernet, same setup for OSv and Linux guest
● Test worker machine - 8-way MacBook Pro i7 2.3GHz with Ubuntu 20.10 on it
● Linux test “client” machine - 8-way MacBook Pro i7 2.7GHz with Fedora 33 on it
■ Each test executed against guest VM with 1 and 2 and 4 vCPUs if makes sense
■ As a baseline each test app is executed and measured on host with taskset to limit cpu count
■ The load is generated by wrk with enough load to observe host CPUs pinned to the OSv or Linux VM
spike close to 100% cpu utilization

Linux Guest vs OSv: Nginx 1.20.1
Each test - best out of 3 runs, wrk with 8 threads and 100 connections running for 5sec, response payload - 774 bytes
Host pinned to single host CPU
■ 51,076.43 requests/sec
■ 49.78 MB/sec
■ P99 latency: 2.28ms
Linux guest with 1 vCPU
■ 24.80 MB/sec
■ P99 latency: 14.04ms
OSv with 1 vCPU (same as Linux)
■ 38,333.70 requests/sec (~1.49 of Linux guest)
■ 36.78 MB/sec
■ P99 latency: 1.75 ms (~0.12 of Linux guest)

Linux Guest vs OSv: Node.JS 14.17
Each test - best out of 3 runs, wrk with 8 threads and 100 connections running for 5sec, payload of 33 bytes
Host pinned to single host CPU
■ 3.95 MB/sec
■ P99 latency: 8.05 ms
Linux guest with 1 vCPU
■ 2.37 MB/sec
■ P99 latency: 14.78 ms
OSv with 1 vCPU
■ 17,996.67 requests/sec (~1.46 of Linux guest)
■ 3.45 MB/sec
■ P99 latency: 7.38 ms (~0.5 of Linux guest)

Linux Guest vs OSv: Golang 1.15.13
Each test - best out of 3 runs, wrk with 8 threads and 100 connections running for 5sec, response payload 42 bytes
Host with CPUs pinned 1 CPU 2 CPUs 4 CPUs
Requests/sec, trans in MB/sec 48,033.00 7.28 94,346.28 14.31 106,905.85 16.21
P99 latency in ms 4.16 3.28 2.26
Linux Guest 1 vCPU 2 vCPUs 4 vCPUs
P99 latency in ms 8.62 9.21 8.11
OSv 1 vCPU 2 vCPUs 4 vCPUs
Requests/sec, trans in MB/sec 40,793.25
(1.69)
6.03 74,247.88
(1.49)
10.98 82,426.27
(0.88)
12.18
P99 latency in ms 5.35 (0.62) 15.94 (1.73) 10.90 (1.34)

Linux Guest vs OSv: Rust with Tokio and Hyper
Each test - best out of 3 runs, wrk with 8 threads and 200 connections running for 5sec, response payload of 30 bytes
P99 latency in ms 3.29 1.85 2.78
P99 latency in ms 10.03 8.18 5.06
(2.06)
8.12 47,312.48
(0.69)
6.63 47,073.25
(0.36)
6.60
P99 latency in ms 7.48 (0.75) 8.36 (1.02) 22.22 (4.39)

Linux Guest vs OSv: Akka HTTP 2.6 on Java8
Each test - best out of 3 runs, wrk with 8 threads and 100 connections running for 5sec, response payload of 42 bytes
P99 latency in ms 50.64 33.26 16.35
P99 latency in ms 691.89 96.63 40.04
(3.52)
6.38 64,532.12
(2.39)
10.65 81,930.80
(1.59)
13.52
P99 latency in ms 91.94 (0.13) 30.26 (0.31) 54.80 (1.37)

Things to optimize
■ Implement SO_REUSEPORT to improve Rust apps throughput
■ Finish “lazy application stack” support to minimize memory used
■ Lock contention in futex implementation to improve Golang apps
■ Optimize atomic operations on single vCPU
■ Make L1/L2 memory pool sizes self-conﬁgurable depending on physical
memory available
■ Other open issues
● https://github.com/cloudius-systems/osv/labels/performance
● https://github.com/cloudius-systems/osv/labels/optimization

Brought to you by
Waldek Kozaczuk
https://github.com/cloudius-systems/osv
https://groups.google.com/g/osv-dev
@OSv_unikernel

OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in the Cloud

More Related Content

What's hot (20)

Similar to OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in the Cloud (20)

More from ScyllaDB (20)

Recently uploaded (20)

OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in the Cloud