SlideShare a Scribd company logo
  -‐‑‒
+)*. *+ *1 9@ 7
•
– d
TCP/IP
• *
• mTCP v memcached
– 35%
– v
2
*)4 u
v
• B3 k6
z 2 l
• mTCP +4Intel4DPDK wi
• github mTCP+4DPDK
orz
• Key4Value4Store w k
Linux l
• RADIS →
• d v orz
• Memcached →
• d
3
A G 7 LNPMXXT
4
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100
Key size (bytes)
Key size CDF by appearance
USR
APP
ETC
VAR
SYS
0
0.2
0.4
0.6
0.8
1
1 10 100 1000 10000 100000 1e+06
Value size (bytes)
Value Size CDF by appearance
USR
APP
ETC
VAR
SYS
0
0.2
0.4
0.6
0.8
1
1 10 100 1000 10
Value size (bytes)
Value size CDF by total
Figure 2: Key and value size distributions for all traces. The leftmost CDF shows the sizes o
B.4Atikoglu,4et4al.,4“Workload4Analysis4 of4a4LargeUScale4KeyUValue4Store,”4ACM4SIGMETRICS42012.
here. It is important to note, however, that all Memcached
instances in this study ran on identical hardware.
2.3 Tracing Methodology
Our analysis called for complete traces of traffic passing
through Memcached servers for at least a week. This task
is particularly challenging because it requires nonintrusive
instrumentation of high-traffic volume production servers.
Standard packet sniffers such as tcpdump2
have too much
overhead to run under heavy load. We therefore imple-
mented an efficient packet sniffer called mcap. Implemented
as a Linux kernel module, mcap has several advantages over
standard packet sniffers: it accesses packet data in kernel
space directly and avoids additional memory copying; it in-
troduces only 3% performance overhead (as opposed to tcp-
dump’s 30%); and unlike standard sniffers, it handles out-
of-order packets correctly by capturing incoming traffic af-
ter all TCP processing is done. Consequently, mcap has a
complete view of what the Memcached server sees, which
eliminates the need for further processing of out-of-order
packets. On the other hand, its packet parsing is optimized
for Memcached packets, and would require adaptations for
other applications.
The captured traces vary in size from 3T B to 7T B each.
This data is too large to store locally on disk, adding another
challenge: how to offload this much data (at an average rate
of more than 80, 000 samples per second) without interfering
with production traffic. We addressed this challenge by com-
bining local disk buffering and dynamic offload throttling to
take advantage of low-activity periods in the servers.
Finally, another challenge is this: how to effectively pro-
cess these large data sets? We used Apache HIVE3
to ana-
lyze Memcached traces. HIVE is part of the Hadoop frame-
work that translates SQL-like queries into MapReduce jobs.
We also used the Memcached “stats” command, as well as
Facebook’s production logs, to verify that the statistics we
computed, such as hit rates, are consistent with the aggre-
gated operational metrics collected by these tools.
3. WORKLOAD CHARACTERISTICS
This section describes the observed properties of each trace
0
10000
20000
30000
40000
50000
60000
70000
USR APP ETC VAR SYS
Requests(millions)
Pool
DELETE
UPDATE
GET
Figure 1: Distribution of request types per pool,
over exactly 7 days. UPDATE commands aggregate
all non-DELETE writing operations, such as SET,
REPLACE, etc.
operations. DELETE operations occur when a cached
database entry is modified (but not required to be
set again in the cache). SET operations occur when
the Web servers add a value to the cache. The rela-
tively high number of DELETE operations show that
this pool represents database-backed values that are
affected by frequent user modifications.
ETC has similar characteristics to APP, but with an even
higher rate of DELETE requests (of which some may
not be currently cached). ETC is the largest and least
specific of the pools, so its workloads might be the most
representative to emulate. Because it is such a large
and heterogenous workload, we pay special attention
to this workload throughout the paper.
VAR is the only pool sampled that is write-dominated. It
stores short-term values such as browser-window size
rformance metrics over
ekly patterns (Sec. 3.3,
be used to generate more
We found that the salient
r-law distributions, sim-
serving systems (Sec. 5).
d deployment that can
-scale production usage
as follows. We begin by
cached, its deployment
d its workload. Sec. 3
properties of the trace
), while Sec. 4 describes
he server point of view).
model of the most rep-
tion brings the data to-
s, followed by a section
zing cache behavior and
RIPTION
ource software package
s over the network. As
more RAM can be added
added to the network.
mmunicate with clients.
o select a unique server
ge of the total number of
Table 1: Memcached pools sampled (in one cluster).
These pools do not match their UNIX namesakes,
but are used for illustrative purposes here instead
of their internal names.
Pool Size Description
USR few user-account status information
APP dozens object metadata of one application
ETC hundreds nonspecific, general-purpose
VAR dozens server-side browser information
SYS few system data on service location
A new item arriving after the heap is exhausted requires
the eviction of an older item in the appropriate slab. Mem-
cached uses the Least-Recently-Used (LRU) algorithm to
select the items for eviction. To this end, each slab class
has an LRU queue maintaining access history on its items.
Although LRU decrees that any accessed item be moved to
the top of the queue, this version of Memcached coalesces
repeated accesses of the same item within a short period
(one minute by default) and only moves this item to the top
the first time, to reduce overhead.
2.2 Deployment
Facebook relies on Memcached for fast access to frequently-
accessed values. Web servers typically try to read persistent
values from Memcached before trying the slower backend
databases. In many cases, the caches are demand-filled,
meaning that generally, data is added to the cache after
a client has requested it and failed.
Modifications to persistent data in the database often
propagate as deletions (invalidations) to the Memcached
tier. Some cached data, however, is transient and not backed
by persistent storage, requiring no invalidations.
. VPVNLNRP
USR4keys4are416B4or421B
90%4of4VAR4keys4are431B
USR4values4are4only42B
90%4of4values4are4smaller4than4500B
vw
c *)>M g
b*)*) ( 1%/-‐‑‒%*+ 1 t *.C
c EI +> ag
b+ *)2 ( *. *)/ t * NUXNT ( LNTP
*
* ii
5
L14(64KB)
L24(256KB)
6
CPU4Core
L14(64KB)
L24(256KB)
L14(64KB)
L24(256KB)
L14(64KB)
L24(256KB)
CPU4Core
L14(64KB)
L24(256KB)
CPU4Core
L14(64KB)
L24(256KB)
L14(64KB)
L24(256KB)
L14(64KB)
L24(256KB)
CPU4Core
LLC4(12MB)
44cycles
124cycles
444cycles
Memory4(xx4GB)
3004cyclesMatching4Tables
(4>4xx4MB)
Copyright420144NTT4Corporaton
m x86 v n 7
2010$Sep.
Per$Packet)CPU)Cycles)for)10G
8
1,200 600
1,200 1,600
Cycles'
needed
Packet'I/O IPv4'lookup
='1,800'cycles
='2,800
Your
budget
1,400'cycles
10G, min-sized packets, dual quad-core 2.66GHz CPUs
5,4001,200 … ='6,600
Packet'I/O IPv6'lookup
Packet'I/O Encryption'and'hashing
IPv4
IPv6
IPsec
+
+
+
(in x86, cycle numbers are from RouteBricks [Dobrescu09] and ours)
S. Han, et al., “PacketShader: a GPU-accelerated Software Router,”
SIGCOMM 2010.
※
2010$Sep.
PacketShader:)psio I/O)Optimization
9
Packet'I/O
Packet'I/O
Packet'I/O
Packet'I/O
! 1,200'reduced'to'200'cycles'
per'packet
! Main'ideas
! Huge'packet'buffer
! Batch'processing
600
1,600
IPv4'lookup
='1,800'cycles
='2,800
5,400 … ='6,600
IPv6'lookup
Encryption'and'hashing
+
+
+
1,200
1,200
1,200
S. Han, et al., “PacketShader: a GPU-accelerated Software Router,”
SIGCOMM 2010.
2010$Sep.
PacketShader:)GPU)Offloading
10
Packet'I/O
Packet'I/O
Packet'I/O
! GPU'Offloading'for
! MemoryMintensive'or
! ComputeMintensive'
operations
! Main'topic'of'this'talk
600
1,600
IPv4'lookup
5,400 …
IPv6'lookup
Encryption'and'hashing
+
+
+
S. Han, et al., “PacketShader: a GPU-accelerated Software Router,”
SIGCOMM 2010.
Kernel Uses the Most CPU Cycles
4
83% of CPU usage spent
inside kernel!
Performance bottlenecks
1. Shared resources
2. Broken locality
3. Per packet processing
1) Efficient use of CPU cycles
for TCP/IP processing
2.35x more CPU cycles for app
2) 3x ~ 25x better performance
Bottleneck removed
by mTCPKernel
(without TCP/IP)
45%
Packet I/O
4%
TCP/IP
34%
Application
17%
CPU Usage Breakdown of Web Server
Web server (Lighttpd) Serving a 64 byte file
Linux-3.10
11
E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4
Multicore4Systems,”4NSDI2014.
12
Inefficiencies in Kernel from Shared FD
1. Shared resources
– Shared listening queue
– Shared file descriptor space
5
Per-core packet queue
Receive-Side Scaling (H/W)
Core 0 Core 1 Core 3Core 2
Listening queue
Lock
File descriptor space
Linear search for finding empty slot
E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4
Multicore4Systems,”4NSDI2014.
13
Inefficiencies in Kernel from Broken Locality
2. Broken locality
6
Per-core packet queue
Receive-Side Scaling (H/W)
Core 0 Core 1 Core 3Core 2
Interrupt
handle
accept()
read()
write()
Interrupt handling core != accepting core
E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4
Multicore4Systems,”4NSDI2014.
14
Inefficiencies in Kernel from Lack of Support for Batching
3. Per packet, per system call processing
Inefficient per packet processing
Frequent mode switching
Cache pollution
Per packet memory allocation
Inefficient per system call processing
7
accept(), read(), write()
Packet I/O
Kernel TCP
Application thread
BSD socket LInux epoll
Kernel
User
E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4
Multicore4Systems,”4NSDI2014.
15
Overview of mTCP Architecture
10
1. Thread model: Pairwise, per-core threading
2. Batching from packet I/O to application
3. mTCP API: Easily portable API (BSD-like)
User-level packet I/O library (PSIO)
mTCP thread 0 mTCP thread 1
Application
Thread 0
Application
Thread 1
mTCP socket mTCP epoll
NIC device driver Kernel-level
1
2
3
User-level
Core 0 Core 1
• [SIGCOMM’10] PacketShader: A GPU-accelerated software router,
http://shader.kaist.edu/packetshader/io_engine/index.html
E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4
Multicore4Systems,”4NSDI2014.
Intel4DPDK
VPVNLNRP VH E
•
– k u •l • z
z h c.f.4SeaStar
16
main4
thread
worker4
thread
worker4
thread
worker4
thread
kernel
main4
thread
worker4
thread
worker4
thread
worker4
thread
mTCP
thread
mTCP
thread
mTCP
thread
pipe
accept()
accept()
read()
write()
read()
write()
read()
write()
read()
write()
accept()
read()
write()
accept()
read()
write()
c
b + E MLNT X MLNT
b VH E
c
b US R * -‐‑‒ + % 8 LNRP MP NRVL T
b CPVNLNRP * -‐‑‒ + -‐‑‒ % VN MP NRVL T
17
Hardware
CPU Intel Xeon E5-22430L/2.0GHz
(6 core) x 2 sockets
Memory 48 GB PC3-12800
Ethernet Intel X520-SR1 (10 GbE)
Software
OS Debian GNU/Linux 8.1
kernel Linux 3.16.0-4-amd64
Intel DPDK 2.0.0
mTCP (4603a1a,June 7 2015)
US R
0
20
40
60
80
100
120
140
160
180
0 2 4 6 8 10 12
10004REQUESTS/SECOND
#CORES
Linux SO_REUSEPORT mTCP
higher4is4better
• Apache4benchmark
• 64B4message
• 10004concurrency
• 100K4requests
3.3x
5.5x
18
VPVNLNRP
c VN MP NRVL T FL S MP NRVL T
l
c VH E v d
G<H .  d><H +  
c u d
19
TCP$
w/$1$thread
TCP$
w/$3$threads
mTCP
w/$1$thread
SET 85,404 146,3514(1.71) 115,1664(1.35)
GET 115,079 139,5754(1.21) 116,8384(1.02)
• mcUbenchmark
• 64B4message
• 5004concurrency
• 100K4requests
VH E g
c – v d
u k u
z ls E A u
c 9G qP XUU 8E@ v d
k P NX P US P S
P P l u d
c
c v
c D@ E A w d
c v v d
dH E(@E v z e
E A e
20
•
• X86
w
• vzh
– z
• cpufreqUinfo(1) v v1/2
– cgroups CPU4throttling z
– kXeon4Phil w z
• r FLARE Tilera v
21
o p
Supachai Thongprasit
e
[1]4S.4Thongprasit,4V.4Visoottiviseh,4and4R.4Takano,4“Toward4Fast4and4Scalable4
KeyUValue4Stores4Based4on4User4Space4TCP/IP4Stack,”4AINTEC42015.4
d d
k d l e
u d ve
22
Ad

Recommended

IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告
Ryousei Takano
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
Ryousei Takano
 
Expectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software research
Ryousei Takano
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
Ryousei Takano
 
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data Center
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data Center
Ryousei Takano
 
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
Ryousei Takano
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)
Ryousei Takano
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
Ryousei Takano
 
クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術
Ryousei Takano
 
An introduction to the Design of Warehouse-Scale Computers
An introduction to the Design of Warehouse-Scale Computers
Alessio Villardita
 
HPC Cloud: Clouds on supercomputers for HPC
HPC Cloud: Clouds on supercomputers for HPC
Ryousei Takano
 
Exascale Capabl
Exascale Capabl
Sagar Dolas
 
LCA13: Jason Taylor Keynote - ARM & Disaggregated Rack - LCA13-Hong - 6 March...
LCA13: Jason Taylor Keynote - ARM & Disaggregated Rack - LCA13-Hong - 6 March...
Linaro
 
Warehouse scale computer
Warehouse scale computer
Hassan A-j
 
Stig Telfer - OpenStack and the Software-Defined SuperComputer
Stig Telfer - OpenStack and the Software-Defined SuperComputer
Danny Abukalam
 
Programmable Exascale Supercomputer
Programmable Exascale Supercomputer
Sagar Dolas
 
Vector processor : Notes
Vector processor : Notes
Subhajit Sahu
 
MIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platform
Ganesan Narayanasamy
 
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Danny Abukalam
 
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
inside-BigData.com
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspective
Jason Shih
 
Hardware architecture of Summit Supercomputer
Hardware architecture of Summit Supercomputer
VigneshwarRamaswamy
 
Scale-out AI Training on Massive Core System from HPC to Fabric-based SOC
Scale-out AI Training on Massive Core System from HPC to Fabric-based SOC
inside-BigData.com
 
Introduction to High-Performance Computing (HPC) Containers and Singularity*
Introduction to High-Performance Computing (HPC) Containers and Singularity*
Intel® Software
 
Google warehouse scale computer
Google warehouse scale computer
Tejhaskar Ashok Kumar
 
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
inside-BigData.com
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Carlo C. del Mundo
 
XNAT Tuning & Monitoring
XNAT Tuning & Monitoring
John Paulett
 
xv6から始めるSPIN入門
xv6から始めるSPIN入門
Ryousei Takano
 
MSDOS
MSDOS
santivago1
 

More Related Content

What's hot (20)

クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術
Ryousei Takano
 
An introduction to the Design of Warehouse-Scale Computers
An introduction to the Design of Warehouse-Scale Computers
Alessio Villardita
 
HPC Cloud: Clouds on supercomputers for HPC
HPC Cloud: Clouds on supercomputers for HPC
Ryousei Takano
 
Exascale Capabl
Exascale Capabl
Sagar Dolas
 
LCA13: Jason Taylor Keynote - ARM & Disaggregated Rack - LCA13-Hong - 6 March...
LCA13: Jason Taylor Keynote - ARM & Disaggregated Rack - LCA13-Hong - 6 March...
Linaro
 
Warehouse scale computer
Warehouse scale computer
Hassan A-j
 
Stig Telfer - OpenStack and the Software-Defined SuperComputer
Stig Telfer - OpenStack and the Software-Defined SuperComputer
Danny Abukalam
 
Programmable Exascale Supercomputer
Programmable Exascale Supercomputer
Sagar Dolas
 
Vector processor : Notes
Vector processor : Notes
Subhajit Sahu
 
MIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platform
Ganesan Narayanasamy
 
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Danny Abukalam
 
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
inside-BigData.com
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspective
Jason Shih
 
Hardware architecture of Summit Supercomputer
Hardware architecture of Summit Supercomputer
VigneshwarRamaswamy
 
Scale-out AI Training on Massive Core System from HPC to Fabric-based SOC
Scale-out AI Training on Massive Core System from HPC to Fabric-based SOC
inside-BigData.com
 
Introduction to High-Performance Computing (HPC) Containers and Singularity*
Introduction to High-Performance Computing (HPC) Containers and Singularity*
Intel® Software
 
Google warehouse scale computer
Google warehouse scale computer
Tejhaskar Ashok Kumar
 
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
inside-BigData.com
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Carlo C. del Mundo
 
XNAT Tuning & Monitoring
XNAT Tuning & Monitoring
John Paulett
 
クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術
Ryousei Takano
 
An introduction to the Design of Warehouse-Scale Computers
An introduction to the Design of Warehouse-Scale Computers
Alessio Villardita
 
HPC Cloud: Clouds on supercomputers for HPC
HPC Cloud: Clouds on supercomputers for HPC
Ryousei Takano
 
LCA13: Jason Taylor Keynote - ARM & Disaggregated Rack - LCA13-Hong - 6 March...
LCA13: Jason Taylor Keynote - ARM & Disaggregated Rack - LCA13-Hong - 6 March...
Linaro
 
Warehouse scale computer
Warehouse scale computer
Hassan A-j
 
Stig Telfer - OpenStack and the Software-Defined SuperComputer
Stig Telfer - OpenStack and the Software-Defined SuperComputer
Danny Abukalam
 
Programmable Exascale Supercomputer
Programmable Exascale Supercomputer
Sagar Dolas
 
Vector processor : Notes
Vector processor : Notes
Subhajit Sahu
 
MIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platform
Ganesan Narayanasamy
 
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Danny Abukalam
 
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
inside-BigData.com
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspective
Jason Shih
 
Hardware architecture of Summit Supercomputer
Hardware architecture of Summit Supercomputer
VigneshwarRamaswamy
 
Scale-out AI Training on Massive Core System from HPC to Fabric-based SOC
Scale-out AI Training on Massive Core System from HPC to Fabric-based SOC
inside-BigData.com
 
Introduction to High-Performance Computing (HPC) Containers and Singularity*
Introduction to High-Performance Computing (HPC) Containers and Singularity*
Intel® Software
 
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
inside-BigData.com
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Carlo C. del Mundo
 
XNAT Tuning & Monitoring
XNAT Tuning & Monitoring
John Paulett
 

Viewers also liked (20)

xv6から始めるSPIN入門
xv6から始めるSPIN入門
Ryousei Takano
 
MSDOS
MSDOS
santivago1
 
Bish Bash Bosh & Co
Bish Bash Bosh & Co
Bish Bash Bosh & Co
 
とある帽子の大蛇料理Ⅱ
とある帽子の大蛇料理Ⅱ
Masami Ichikawa
 
あなたの知らないネットワークプログラミングの世界
あなたの知らないネットワークプログラミングの世界
Ryousei Takano
 
πολλαπλασιασμοι ενοτητα 11
πολλαπλασιασμοι ενοτητα 11
Γιαννόπουλος Γιάννης
 
100Gbpsソフトウェアルータの実現可能性に関する論文
100Gbpsソフトウェアルータの実現可能性に関する論文
y_uuki
 
xv6のコンテキストスイッチを読む
xv6のコンテキストスイッチを読む
mfumi
 
デバドラを書いてみよう!
デバドラを書いてみよう!
Masami Ichikawa
 
I/O仮想化最前線〜ネットワークI/Oを中心に〜
I/O仮想化最前線〜ネットワークI/Oを中心に〜
Ryousei Takano
 
Disruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on Linux
Naoto MATSUMOTO
 
x86とコンテキストスイッチ
x86とコンテキストスイッチ
Masami Ichikawa
 
Network processing by pid
Network processing by pid
Nuno Martins
 
クラウド環境におけるキャッシュメモリQoS制御の評価
クラウド環境におけるキャッシュメモリQoS制御の評価
Ryousei Takano
 
DPDKを拡張してみた話し
DPDKを拡張してみた話し
Lagopus SDN/OpenFlow switch
 
Xeon dとlagopusと、pktgen dpdk
Xeon dとlagopusと、pktgen dpdk
Masaru Oki
 
Dpdk環境の話
Dpdk環境の話
Masaru Oki
 
Msdos
Msdos
Prem Sahu
 
Intel 82599 10GbE Controllerで遊ぼう
Intel 82599 10GbE Controllerで遊ぼう
Takuya ASADA
 
10GbE時代のネットワークI/O高速化
10GbE時代のネットワークI/O高速化
Takuya ASADA
 
xv6から始めるSPIN入門
xv6から始めるSPIN入門
Ryousei Takano
 
とある帽子の大蛇料理Ⅱ
とある帽子の大蛇料理Ⅱ
Masami Ichikawa
 
あなたの知らないネットワークプログラミングの世界
あなたの知らないネットワークプログラミングの世界
Ryousei Takano
 
100Gbpsソフトウェアルータの実現可能性に関する論文
100Gbpsソフトウェアルータの実現可能性に関する論文
y_uuki
 
xv6のコンテキストスイッチを読む
xv6のコンテキストスイッチを読む
mfumi
 
デバドラを書いてみよう!
デバドラを書いてみよう!
Masami Ichikawa
 
I/O仮想化最前線〜ネットワークI/Oを中心に〜
I/O仮想化最前線〜ネットワークI/Oを中心に〜
Ryousei Takano
 
Disruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on Linux
Naoto MATSUMOTO
 
x86とコンテキストスイッチ
x86とコンテキストスイッチ
Masami Ichikawa
 
Network processing by pid
Network processing by pid
Nuno Martins
 
クラウド環境におけるキャッシュメモリQoS制御の評価
クラウド環境におけるキャッシュメモリQoS制御の評価
Ryousei Takano
 
Xeon dとlagopusと、pktgen dpdk
Xeon dとlagopusと、pktgen dpdk
Masaru Oki
 
Dpdk環境の話
Dpdk環境の話
Masaru Oki
 
Intel 82599 10GbE Controllerで遊ぼう
Intel 82599 10GbE Controllerで遊ぼう
Takuya ASADA
 
10GbE時代のネットワークI/O高速化
10GbE時代のネットワークI/O高速化
Takuya ASADA
 
Ad

Similar to User-space Network Processing (20)

Kernel Recipes 2019 - Metrics are money
Kernel Recipes 2019 - Metrics are money
Anne Nicolas
 
Memcached Code Camp 2009
Memcached Code Camp 2009
NorthScale
 
All The Little Pieces
All The Little Pieces
Andrei Zmievski
 
Practice and challenges from building IaaS
Practice and challenges from building IaaS
Shawn Zhu
 
Caching for Performance Masterclass: Caching at Scale
Caching for Performance Masterclass: Caching at Scale
ScyllaDB
 
Cache is king
Cache is king
Folio3 Software
 
EVCache Builderscon
EVCache Builderscon
Scott Mansfield
 
Kickin' Ass with Cache-Fu (with notes)
Kickin' Ass with Cache-Fu (with notes)
err
 
Introduction to memcached
Introduction to memcached
Jurriaan Persyn
 
Memcached: What is it and what does it do?
Memcached: What is it and what does it do?
Brian Moon
 
20080528dublinpt1
20080528dublinpt1
Jeff Hammerbacher
 
Moreno tfm2012
Moreno tfm2012
Alok Prasad
 
Performance Tuning EC2 Instances
Performance Tuning EC2 Instances
Brendan Gregg
 
Python Load Testing - Pygotham 2012
Python Load Testing - Pygotham 2012
Dan Kuebrich
 
Qcon
Qcon
adityaagarwal
 
Scaling Django for X Factor - DJUGL Oct 2012
Scaling Django for X Factor - DJUGL Oct 2012
Malcolm Box
 
Lessons from Highly Scalable Architectures at Social Networking Sites
Lessons from Highly Scalable Architectures at Social Networking Sites
Patrick Senti
 
QCon 2015 Broken Performance Tools
QCon 2015 Broken Performance Tools
Brendan Gregg
 
John adams talk cloudy
John adams talk cloudy
John Adams
 
Configuration and Deployment Guide For Memcached on Intel® Architecture
Configuration and Deployment Guide For Memcached on Intel® Architecture
Odinot Stanislas
 
Kernel Recipes 2019 - Metrics are money
Kernel Recipes 2019 - Metrics are money
Anne Nicolas
 
Memcached Code Camp 2009
Memcached Code Camp 2009
NorthScale
 
Practice and challenges from building IaaS
Practice and challenges from building IaaS
Shawn Zhu
 
Caching for Performance Masterclass: Caching at Scale
Caching for Performance Masterclass: Caching at Scale
ScyllaDB
 
Kickin' Ass with Cache-Fu (with notes)
Kickin' Ass with Cache-Fu (with notes)
err
 
Introduction to memcached
Introduction to memcached
Jurriaan Persyn
 
Memcached: What is it and what does it do?
Memcached: What is it and what does it do?
Brian Moon
 
Performance Tuning EC2 Instances
Performance Tuning EC2 Instances
Brendan Gregg
 
Python Load Testing - Pygotham 2012
Python Load Testing - Pygotham 2012
Dan Kuebrich
 
Scaling Django for X Factor - DJUGL Oct 2012
Scaling Django for X Factor - DJUGL Oct 2012
Malcolm Box
 
Lessons from Highly Scalable Architectures at Social Networking Sites
Lessons from Highly Scalable Architectures at Social Networking Sites
Patrick Senti
 
QCon 2015 Broken Performance Tools
QCon 2015 Broken Performance Tools
Brendan Gregg
 
John adams talk cloudy
John adams talk cloudy
John Adams
 
Configuration and Deployment Guide For Memcached on Intel® Architecture
Configuration and Deployment Guide For Memcached on Intel® Architecture
Odinot Stanislas
 
Ad

More from Ryousei Takano (19)

Error Permissive Computing
Error Permissive Computing
Ryousei Takano
 
Opportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCI
Ryousei Takano
 
ABCI: An Open Innovation Platform for Advancing AI Research and Deployment
ABCI: An Open Innovation Platform for Advancing AI Research and Deployment
Ryousei Takano
 
ABCI Data Center
ABCI Data Center
Ryousei Takano
 
A Look Inside Google’s Data Center Networks
A Look Inside Google’s Data Center Networks
Ryousei Takano
 
不揮発メモリとOS研究にまつわる何か
不揮発メモリとOS研究にまつわる何か
Ryousei Takano
 
High-resolution Timer-based Packet Pacing Mechanism on the Linux Operating Sy...
High-resolution Timer-based Packet Pacing Mechanism on the Linux Operating Sy...
Ryousei Takano
 
クラウドの垣根を超えた高性能計算に向けて~AIST Super Green Cloudでの試み~
クラウドの垣根を超えた高性能計算に向けて~AIST Super Green Cloudでの試み~
Ryousei Takano
 
高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud
高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud
Ryousei Takano
 
IEEE/ACM SC2013報告
IEEE/ACM SC2013報告
Ryousei Takano
 
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
Ryousei Takano
 
伸縮自在なデータセンターを実現するインタークラウド資源管理システム
伸縮自在なデータセンターを実現するインタークラウド資源管理システム
Ryousei Takano
 
SoNIC: Precise Realtime Software Access and Control of Wired Networks
SoNIC: Precise Realtime Software Access and Control of Wired Networks
Ryousei Takano
 
異種クラスタを跨がる仮想マシンマイグレーション機構
異種クラスタを跨がる仮想マシンマイグレーション機構
Ryousei Takano
 
動的ネットワーク切替を用いた省電力指向トラフィックオフロード方式
動的ネットワーク切替を用いた省電力指向トラフィックオフロード方式
Ryousei Takano
 
Ninja Migration: An Interconnect transparent Migration for Heterogeneous Data...
Ninja Migration: An Interconnect transparent Migration for Heterogeneous Data...
Ryousei Takano
 
インタークラウドにおける仮想インフラ構築システム
インタークラウドにおける仮想インフラ構築システム
Ryousei Takano
 
Preliminary Experiment of Disaster Recovery based on Interconnect-transparent...
Preliminary Experiment of Disaster Recovery based on Interconnect-transparent...
Ryousei Takano
 
動的ネットワークパス構築と連携したエッジオーバレイ帯域制御
動的ネットワークパス構築と連携したエッジオーバレイ帯域制御
Ryousei Takano
 
Error Permissive Computing
Error Permissive Computing
Ryousei Takano
 
Opportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCI
Ryousei Takano
 
ABCI: An Open Innovation Platform for Advancing AI Research and Deployment
ABCI: An Open Innovation Platform for Advancing AI Research and Deployment
Ryousei Takano
 
A Look Inside Google’s Data Center Networks
A Look Inside Google’s Data Center Networks
Ryousei Takano
 
不揮発メモリとOS研究にまつわる何か
不揮発メモリとOS研究にまつわる何か
Ryousei Takano
 
High-resolution Timer-based Packet Pacing Mechanism on the Linux Operating Sy...
High-resolution Timer-based Packet Pacing Mechanism on the Linux Operating Sy...
Ryousei Takano
 
クラウドの垣根を超えた高性能計算に向けて~AIST Super Green Cloudでの試み~
クラウドの垣根を超えた高性能計算に向けて~AIST Super Green Cloudでの試み~
Ryousei Takano
 
高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud
高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud
Ryousei Takano
 
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
Ryousei Takano
 
伸縮自在なデータセンターを実現するインタークラウド資源管理システム
伸縮自在なデータセンターを実現するインタークラウド資源管理システム
Ryousei Takano
 
SoNIC: Precise Realtime Software Access and Control of Wired Networks
SoNIC: Precise Realtime Software Access and Control of Wired Networks
Ryousei Takano
 
異種クラスタを跨がる仮想マシンマイグレーション機構
異種クラスタを跨がる仮想マシンマイグレーション機構
Ryousei Takano
 
動的ネットワーク切替を用いた省電力指向トラフィックオフロード方式
動的ネットワーク切替を用いた省電力指向トラフィックオフロード方式
Ryousei Takano
 
Ninja Migration: An Interconnect transparent Migration for Heterogeneous Data...
Ninja Migration: An Interconnect transparent Migration for Heterogeneous Data...
Ryousei Takano
 
インタークラウドにおける仮想インフラ構築システム
インタークラウドにおける仮想インフラ構築システム
Ryousei Takano
 
Preliminary Experiment of Disaster Recovery based on Interconnect-transparent...
Preliminary Experiment of Disaster Recovery based on Interconnect-transparent...
Ryousei Takano
 
動的ネットワークパス構築と連携したエッジオーバレイ帯域制御
動的ネットワークパス構築と連携したエッジオーバレイ帯域制御
Ryousei Takano
 

Recently uploaded (20)

20CE404-Soil Mechanics - Slide Share PPT
20CE404-Soil Mechanics - Slide Share PPT
saravananr808639
 
Rapid Prototyping for XR: Lecture 1 Introduction to Prototyping
Rapid Prototyping for XR: Lecture 1 Introduction to Prototyping
Mark Billinghurst
 
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...
resming1
 
NEW Strengthened Senior High School Gen Math.pptx
NEW Strengthened Senior High School Gen Math.pptx
DaryllWhere
 
Tesla-Stock-Analysis-and-Forecast.pptx (1).pptx
Tesla-Stock-Analysis-and-Forecast.pptx (1).pptx
moonsony54
 
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
resming1
 
Introduction to Python Programming Language
Introduction to Python Programming Language
merlinjohnsy
 
FSE-Journal-First-Automated code editing with search-generate-modify.pdf
FSE-Journal-First-Automated code editing with search-generate-modify.pdf
cl144
 
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
IJDKP
 
LECTURE 7 COMPUTATIONS OF LEVELING DATA APRIL 2025.pptx
LECTURE 7 COMPUTATIONS OF LEVELING DATA APRIL 2025.pptx
rr22001247
 
Structured Programming with C++ :: Kjell Backman
Structured Programming with C++ :: Kjell Backman
Shabista Imam
 
Rapid Prototyping for XR: Lecture 2 - Low Fidelity Prototyping.
Rapid Prototyping for XR: Lecture 2 - Low Fidelity Prototyping.
Mark Billinghurst
 
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
AlqualsaDIResearchGr
 
Validating a Citizen Observatories enabling Platform by completing a Citizen ...
Validating a Citizen Observatories enabling Platform by completing a Citizen ...
Diego López-de-Ipiña González-de-Artaza
 
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
 
AI_Presentation (1). Artificial intelligence
AI_Presentation (1). Artificial intelligence
RoselynKaur8thD34
 
International Journal of Advanced Information Technology (IJAIT)
International Journal of Advanced Information Technology (IJAIT)
ijait
 
دراسة حاله لقرية تقع في جنوب غرب السودان
دراسة حاله لقرية تقع في جنوب غرب السودان
محمد قصص فتوتة
 
Introduction to sensing and Week-1.pptx
Introduction to sensing and Week-1.pptx
KNaveenKumarECE
 
Rapid Prototyping for XR: Lecture 3 - Video and Paper Prototyping
Rapid Prototyping for XR: Lecture 3 - Video and Paper Prototyping
Mark Billinghurst
 
20CE404-Soil Mechanics - Slide Share PPT
20CE404-Soil Mechanics - Slide Share PPT
saravananr808639
 
Rapid Prototyping for XR: Lecture 1 Introduction to Prototyping
Rapid Prototyping for XR: Lecture 1 Introduction to Prototyping
Mark Billinghurst
 
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...
resming1
 
NEW Strengthened Senior High School Gen Math.pptx
NEW Strengthened Senior High School Gen Math.pptx
DaryllWhere
 
Tesla-Stock-Analysis-and-Forecast.pptx (1).pptx
Tesla-Stock-Analysis-and-Forecast.pptx (1).pptx
moonsony54
 
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
resming1
 
Introduction to Python Programming Language
Introduction to Python Programming Language
merlinjohnsy
 
FSE-Journal-First-Automated code editing with search-generate-modify.pdf
FSE-Journal-First-Automated code editing with search-generate-modify.pdf
cl144
 
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
IJDKP
 
LECTURE 7 COMPUTATIONS OF LEVELING DATA APRIL 2025.pptx
LECTURE 7 COMPUTATIONS OF LEVELING DATA APRIL 2025.pptx
rr22001247
 
Structured Programming with C++ :: Kjell Backman
Structured Programming with C++ :: Kjell Backman
Shabista Imam
 
Rapid Prototyping for XR: Lecture 2 - Low Fidelity Prototyping.
Rapid Prototyping for XR: Lecture 2 - Low Fidelity Prototyping.
Mark Billinghurst
 
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
AlqualsaDIResearchGr
 
Validating a Citizen Observatories enabling Platform by completing a Citizen ...
Validating a Citizen Observatories enabling Platform by completing a Citizen ...
Diego López-de-Ipiña González-de-Artaza
 
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
 
AI_Presentation (1). Artificial intelligence
AI_Presentation (1). Artificial intelligence
RoselynKaur8thD34
 
International Journal of Advanced Information Technology (IJAIT)
International Journal of Advanced Information Technology (IJAIT)
ijait
 
دراسة حاله لقرية تقع في جنوب غرب السودان
دراسة حاله لقرية تقع في جنوب غرب السودان
محمد قصص فتوتة
 
Introduction to sensing and Week-1.pptx
Introduction to sensing and Week-1.pptx
KNaveenKumarECE
 
Rapid Prototyping for XR: Lecture 3 - Video and Paper Prototyping
Rapid Prototyping for XR: Lecture 3 - Video and Paper Prototyping
Mark Billinghurst
 

User-space Network Processing

  • 2. • – d TCP/IP • * • mTCP v memcached – 35% – v 2 *)4 u v
  • 3. • B3 k6 z 2 l • mTCP +4Intel4DPDK wi • github mTCP+4DPDK orz • Key4Value4Store w k Linux l • RADIS → • d v orz • Memcached → • d 3
  • 4. A G 7 LNPMXXT 4 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 Key size (bytes) Key size CDF by appearance USR APP ETC VAR SYS 0 0.2 0.4 0.6 0.8 1 1 10 100 1000 10000 100000 1e+06 Value size (bytes) Value Size CDF by appearance USR APP ETC VAR SYS 0 0.2 0.4 0.6 0.8 1 1 10 100 1000 10 Value size (bytes) Value size CDF by total Figure 2: Key and value size distributions for all traces. The leftmost CDF shows the sizes o B.4Atikoglu,4et4al.,4“Workload4Analysis4 of4a4LargeUScale4KeyUValue4Store,”4ACM4SIGMETRICS42012. here. It is important to note, however, that all Memcached instances in this study ran on identical hardware. 2.3 Tracing Methodology Our analysis called for complete traces of traffic passing through Memcached servers for at least a week. This task is particularly challenging because it requires nonintrusive instrumentation of high-traffic volume production servers. Standard packet sniffers such as tcpdump2 have too much overhead to run under heavy load. We therefore imple- mented an efficient packet sniffer called mcap. Implemented as a Linux kernel module, mcap has several advantages over standard packet sniffers: it accesses packet data in kernel space directly and avoids additional memory copying; it in- troduces only 3% performance overhead (as opposed to tcp- dump’s 30%); and unlike standard sniffers, it handles out- of-order packets correctly by capturing incoming traffic af- ter all TCP processing is done. Consequently, mcap has a complete view of what the Memcached server sees, which eliminates the need for further processing of out-of-order packets. On the other hand, its packet parsing is optimized for Memcached packets, and would require adaptations for other applications. The captured traces vary in size from 3T B to 7T B each. This data is too large to store locally on disk, adding another challenge: how to offload this much data (at an average rate of more than 80, 000 samples per second) without interfering with production traffic. We addressed this challenge by com- bining local disk buffering and dynamic offload throttling to take advantage of low-activity periods in the servers. Finally, another challenge is this: how to effectively pro- cess these large data sets? We used Apache HIVE3 to ana- lyze Memcached traces. HIVE is part of the Hadoop frame- work that translates SQL-like queries into MapReduce jobs. We also used the Memcached “stats” command, as well as Facebook’s production logs, to verify that the statistics we computed, such as hit rates, are consistent with the aggre- gated operational metrics collected by these tools. 3. WORKLOAD CHARACTERISTICS This section describes the observed properties of each trace 0 10000 20000 30000 40000 50000 60000 70000 USR APP ETC VAR SYS Requests(millions) Pool DELETE UPDATE GET Figure 1: Distribution of request types per pool, over exactly 7 days. UPDATE commands aggregate all non-DELETE writing operations, such as SET, REPLACE, etc. operations. DELETE operations occur when a cached database entry is modified (but not required to be set again in the cache). SET operations occur when the Web servers add a value to the cache. The rela- tively high number of DELETE operations show that this pool represents database-backed values that are affected by frequent user modifications. ETC has similar characteristics to APP, but with an even higher rate of DELETE requests (of which some may not be currently cached). ETC is the largest and least specific of the pools, so its workloads might be the most representative to emulate. Because it is such a large and heterogenous workload, we pay special attention to this workload throughout the paper. VAR is the only pool sampled that is write-dominated. It stores short-term values such as browser-window size rformance metrics over ekly patterns (Sec. 3.3, be used to generate more We found that the salient r-law distributions, sim- serving systems (Sec. 5). d deployment that can -scale production usage as follows. We begin by cached, its deployment d its workload. Sec. 3 properties of the trace ), while Sec. 4 describes he server point of view). model of the most rep- tion brings the data to- s, followed by a section zing cache behavior and RIPTION ource software package s over the network. As more RAM can be added added to the network. mmunicate with clients. o select a unique server ge of the total number of Table 1: Memcached pools sampled (in one cluster). These pools do not match their UNIX namesakes, but are used for illustrative purposes here instead of their internal names. Pool Size Description USR few user-account status information APP dozens object metadata of one application ETC hundreds nonspecific, general-purpose VAR dozens server-side browser information SYS few system data on service location A new item arriving after the heap is exhausted requires the eviction of an older item in the appropriate slab. Mem- cached uses the Least-Recently-Used (LRU) algorithm to select the items for eviction. To this end, each slab class has an LRU queue maintaining access history on its items. Although LRU decrees that any accessed item be moved to the top of the queue, this version of Memcached coalesces repeated accesses of the same item within a short period (one minute by default) and only moves this item to the top the first time, to reduce overhead. 2.2 Deployment Facebook relies on Memcached for fast access to frequently- accessed values. Web servers typically try to read persistent values from Memcached before trying the slower backend databases. In many cases, the caches are demand-filled, meaning that generally, data is added to the cache after a client has requested it and failed. Modifications to persistent data in the database often propagate as deletions (invalidations) to the Memcached tier. Some cached data, however, is transient and not backed by persistent storage, requiring no invalidations. . VPVNLNRP USR4keys4are416B4or421B 90%4of4VAR4keys4are431B USR4values4are4only42B 90%4of4values4are4smaller4than4500B
  • 5. vw c *)>M g b*)*) ( 1%/-‐‑‒%*+ 1 t *.C c EI +> ag b+ *)2 ( *. *)/ t * NUXNT ( LNTP * * ii 5
  • 8. 2010$Sep. Per$Packet)CPU)Cycles)for)10G 8 1,200 600 1,200 1,600 Cycles' needed Packet'I/O IPv4'lookup ='1,800'cycles ='2,800 Your budget 1,400'cycles 10G, min-sized packets, dual quad-core 2.66GHz CPUs 5,4001,200 … ='6,600 Packet'I/O IPv6'lookup Packet'I/O Encryption'and'hashing IPv4 IPv6 IPsec + + + (in x86, cycle numbers are from RouteBricks [Dobrescu09] and ours) S. Han, et al., “PacketShader: a GPU-accelerated Software Router,” SIGCOMM 2010. ※
  • 9. 2010$Sep. PacketShader:)psio I/O)Optimization 9 Packet'I/O Packet'I/O Packet'I/O Packet'I/O ! 1,200'reduced'to'200'cycles' per'packet ! Main'ideas ! Huge'packet'buffer ! Batch'processing 600 1,600 IPv4'lookup ='1,800'cycles ='2,800 5,400 … ='6,600 IPv6'lookup Encryption'and'hashing + + + 1,200 1,200 1,200 S. Han, et al., “PacketShader: a GPU-accelerated Software Router,” SIGCOMM 2010.
  • 10. 2010$Sep. PacketShader:)GPU)Offloading 10 Packet'I/O Packet'I/O Packet'I/O ! GPU'Offloading'for ! MemoryMintensive'or ! ComputeMintensive' operations ! Main'topic'of'this'talk 600 1,600 IPv4'lookup 5,400 … IPv6'lookup Encryption'and'hashing + + + S. Han, et al., “PacketShader: a GPU-accelerated Software Router,” SIGCOMM 2010.
  • 11. Kernel Uses the Most CPU Cycles 4 83% of CPU usage spent inside kernel! Performance bottlenecks 1. Shared resources 2. Broken locality 3. Per packet processing 1) Efficient use of CPU cycles for TCP/IP processing 2.35x more CPU cycles for app 2) 3x ~ 25x better performance Bottleneck removed by mTCPKernel (without TCP/IP) 45% Packet I/O 4% TCP/IP 34% Application 17% CPU Usage Breakdown of Web Server Web server (Lighttpd) Serving a 64 byte file Linux-3.10 11 E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4 Multicore4Systems,”4NSDI2014.
  • 12. 12 Inefficiencies in Kernel from Shared FD 1. Shared resources – Shared listening queue – Shared file descriptor space 5 Per-core packet queue Receive-Side Scaling (H/W) Core 0 Core 1 Core 3Core 2 Listening queue Lock File descriptor space Linear search for finding empty slot E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4 Multicore4Systems,”4NSDI2014.
  • 13. 13 Inefficiencies in Kernel from Broken Locality 2. Broken locality 6 Per-core packet queue Receive-Side Scaling (H/W) Core 0 Core 1 Core 3Core 2 Interrupt handle accept() read() write() Interrupt handling core != accepting core E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4 Multicore4Systems,”4NSDI2014.
  • 14. 14 Inefficiencies in Kernel from Lack of Support for Batching 3. Per packet, per system call processing Inefficient per packet processing Frequent mode switching Cache pollution Per packet memory allocation Inefficient per system call processing 7 accept(), read(), write() Packet I/O Kernel TCP Application thread BSD socket LInux epoll Kernel User E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4 Multicore4Systems,”4NSDI2014.
  • 15. 15 Overview of mTCP Architecture 10 1. Thread model: Pairwise, per-core threading 2. Batching from packet I/O to application 3. mTCP API: Easily portable API (BSD-like) User-level packet I/O library (PSIO) mTCP thread 0 mTCP thread 1 Application Thread 0 Application Thread 1 mTCP socket mTCP epoll NIC device driver Kernel-level 1 2 3 User-level Core 0 Core 1 • [SIGCOMM’10] PacketShader: A GPU-accelerated software router, http://shader.kaist.edu/packetshader/io_engine/index.html E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4 Multicore4Systems,”4NSDI2014. Intel4DPDK
  • 16. VPVNLNRP VH E • – k u •l • z z h c.f.4SeaStar 16 main4 thread worker4 thread worker4 thread worker4 thread kernel main4 thread worker4 thread worker4 thread worker4 thread mTCP thread mTCP thread mTCP thread pipe accept() accept() read() write() read() write() read() write() read() write() accept() read() write() accept() read() write()
  • 17. c b + E MLNT X MLNT b VH E c b US R * -‐‑‒ + % 8 LNRP MP NRVL T b CPVNLNRP * -‐‑‒ + -‐‑‒ % VN MP NRVL T 17 Hardware CPU Intel Xeon E5-22430L/2.0GHz (6 core) x 2 sockets Memory 48 GB PC3-12800 Ethernet Intel X520-SR1 (10 GbE) Software OS Debian GNU/Linux 8.1 kernel Linux 3.16.0-4-amd64 Intel DPDK 2.0.0 mTCP (4603a1a,June 7 2015)
  • 18. US R 0 20 40 60 80 100 120 140 160 180 0 2 4 6 8 10 12 10004REQUESTS/SECOND #CORES Linux SO_REUSEPORT mTCP higher4is4better • Apache4benchmark • 64B4message • 10004concurrency • 100K4requests 3.3x 5.5x 18
  • 19. VPVNLNRP c VN MP NRVL T FL S MP NRVL T l c VH E v d G<H .  d><H +   c u d 19 TCP$ w/$1$thread TCP$ w/$3$threads mTCP w/$1$thread SET 85,404 146,3514(1.71) 115,1664(1.35) GET 115,079 139,5754(1.21) 116,8384(1.02) • mcUbenchmark • 64B4message • 5004concurrency • 100K4requests
  • 20. VH E g c – v d u k u z ls E A u c 9G qP XUU 8E@ v d k P NX P US P S P P l u d c c v c D@ E A w d c v v d dH E(@E v z e E A e 20
  • 21. • • X86 w • vzh – z • cpufreqUinfo(1) v v1/2 – cgroups CPU4throttling z – kXeon4Phil w z • r FLARE Tilera v 21