scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Partition-Aware Packet Steering Using XDP and eBPF for Improving Application-Level Parallelism

09 Dec 2019-pp 27-33
TL;DR: This work proposes an approach that combines application-level partitioning and packet steering with a programmable NIC that can reduce latency and improve throughput because it utilizes multicore systems efficiently, and applications can improve partitioning scheme without impacting clients.
Abstract: A single CPU core is not fast enough to process packets arriving from the network on commodity NICs. Applications are therefore turning to application-level partitioning and NIC offload to exploit parallelism on multicore systems and relieve the CPU. Although NIC offload techniques are not new, programmable NICs have emerged as a way for custom packet processing offload. However, it is not clear what parts of the application should be offloaded to a programmable NIC for improving parallelism. We propose an approach that combines application-level partitioning and packet steering with a programmable NIC. Applications partition data in DRAM between CPU cores, and steer requests to the correct core by parsing L7 packet headers on a programmable NIC. This approach improves request-level parallelism but keeps the partitioning scheme transparent to clients. We believe this approach can reduce latency and improve throughput because it utilizes multicore systems efficiently, and applications can improve partitioning scheme without impacting clients.

Summary (3 min read)

1 INTRODUCTION

  • A single CPU core is not fast enough to serve packets arriving at line rate.
  • These programmable NICs come in different flavors, ASIC-based, FPGAs, special-purpose cores (e.g. NPU), or multicore system-on-chips (SoC) [8, 31, 36], but their objective is the same: provide programmable packet processing on the NIC before packets are forwarded down to the OS network stack.
  • As shown in Figure 1, the application uses a thread-per-core approach in which data in the DRAM is partitioned between threads that are pinned to CPU cores.
  • The proposed approach keeps the partitioning scheme transparent similar to software steering but maintains low request steering overhead, similar to hardware steering.

2 BACKGROUND

  • The authors believe that combining application-level partitioning with packet steering can improve request-level parallelism.
  • To perform parallel request processing, applications use OS threads, but they have overheads from synchronization and context switching.
  • State-of-art systems, summarized in Table 1, use partitioning, packet steering, and NIC offload for high performance, but their approaches differ from each other.
  • The XDP interface enables the implementation of high-performance networking applications on Linux by combining programmable packet processing and kernel-bypass [10].
  • Applications implement custom packet processing in a programming language such as C, which compiles to the eBPF virtual machine instruction set.

3 APPLICATION-LEVEL PARTITIONING AND PACKET STEERING

  • Applications must embrace parallelism to take advantage of multicore CPUs.
  • In the thread-per-core approach to improve parallelism, applications restrict the number of application threads to the number of CPU cores, and partition application data in DRAM and the resources between the CPU cores to eliminate thread synchronization and OS-level locks.
  • Current solutions are unable to steer packets to the CPU core that can independently serve the request without exposing the partitioning schema to its clients.
  • The authors now show how application-level packet steering with a programmable NIC can solve this problem.

3.1 Partitioning in the thread-per-core model

  • In the thread-per-core approach, an arriving packets needs to be steered to the CPU core that can serve the request.
  • This approach allows the CPU cores to run independently by eliminating the need to synchronize threads on application-level data access, and avoiding OS-level locking (in some cases).
  • Steering requests to the correct CPU core either requires clients to specify the partition in the request [23], or the application threads need to redirect the requests.
  • This is because the approaches to steer the packets using traditional non-programmable NICs are restricted to L2-L4 protocol headers.
  • The CPU core clusters could also be partitioned around sub-NUMA clusters, which groups CPU cores by memory controllers [28].

3.2 Partition-aware packet steering

  • Request processing is performed in different stages across the NIC, the kernel, and user space as shown in Figure 3(a).
  • The thread notification time is the time difference between the call to the write system call and the return of the read system call.
  • In a key-value store, the partition identifier can be the request key, which determines the target of the operation requested by a client.
  • If an application workload does not access the resource partition and data sets uniformly, the NIC packet steering can perform load balancing between the CPU cores.
  • If processing a request needs a lot of CPU cycles, caching the response at NIC level can be beneficial.

3.3 Example: A Key-Value Store

  • For their discussion, the authors assume two types of requests, get and set, both of which contain a request key k .
  • Finally, a response message is generated for the request and handed over to the network stack, which performs protocol processing and hands over the packet to the NIC.
  • A program running on the NIC parses L7 packet headers to determine request keys and uses the same partitioning function p(k) to steer packets to the target CPU core.
  • The KV store spawns an OS thread for each CPU core assigned for the application with pthread_create and allocates memory regions individually for each CPU core with mmap.

4 DISCUSSION

  • The authors propose an approach that combines application-level partitioning and packet steering with a programmable NIC.
  • A multi-key request and range query possibly need to access data on multiple CPU cores.
  • Another possible solution to the issue of steering encrypted packets is to use homomorphic encryption techniques, which allow computation on encrypted data.
  • They are already a good fit for implementing their proposed approach.
  • One XDP limitation for their approach is if the AF_XDP kernel-bypass interface can deliver packets directly to user space.

5 RELATEDWORK

  • Multi-queue NICs support receive-side scaling (RSS) and Flow Director, which distribute packets to multiple NIC receive queues [1].
  • The OS maps the NIC queues to different CPUs, which enables parallel processing of the packets.
  • The partitioned packet steering approach similar to the one proposed by Floem [30], except the authors design around Linux XDP and eBPF subsystems instead of a having a separate compiler.
  • TheMICA in-memory key-value store partitions data between CPU cores and uses NIC Flow Director to map requests to specific CPUs [23].
  • Shenango provides low-latency for applications by dedicating a single CPU core that polls the NIC for arriving packets and steers them to dynamically allocated application CPU cores.

6 CONCLUSION

  • The authors have proposed a combination of application-level partitioning and packet steering with a programmable NIC to improve application-level parallelism.
  • An application partitions its resources and data in DRAM between CPU cores, and a program running on a programmable NIC inspects L7 protocol headers to steer the request to its partition.
  • The authors are currently working on a prototype in-memory KV store using this approach and are planning to compare its performance against previous works such as MICA [23].
  • This would allow some existing applications to take advantage of application-level partitioning and packet steering.
  • The authors are also exploring the limits of XDP and eBPF to their approach, and looking into alternatives to address them.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

https://helda.helsinki.fi
Partition-Aware Packet Steering Using XDP and eBPF for
Improving Application-Level Parallelism
Enberg, Pekka
ACM
2019
Enberg , P , Rao , A & Tarkoma , S 2019 , Partition-Aware Packet Steering Using XDP and
eBPF for Improving Application-Level Parallelism . in Proceedings of the 1st ACM CoNEXT
Workshop on Emerging in-Network Computing Paradigms . ACM , New York, NY, USA , pp.
27-33 , ACM CoNEXT Workshop on Emerging in-Network Computing Paradigms , Orlando ,
Florida , United States , 09/12/2019 . https://doi.org/10.1145/3359993.3366766
http://hdl.handle.net/10138/326309
https://doi.org/10.1145/3359993.3366766
acceptedVersion
Downloaded from Helda, University of Helsinki institutional repository.
This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail.
Please cite the original version.

Partition-Aware Packet Ste ering Using XDP and eBPF for
Improving Application-Level Parallelism
Pekka Enberg
University of Helsinki
Ashwin Rao
University of Helsinki
Sasu Tarkoma
University of Helsinki
ABSTRACT
A single CPU core is not fast enough to process packets arriving
from the network on commodity NICs. Applications are there-
fore turning to application-level partitioning and NIC ooad to
exploit parallelism on multicore systems and relieve the CPU. Al-
though NIC ooad techniques are not new, programmable NICs
have emerged as a way for custom packet processing ooad. How-
ever, it is not clear what parts of the application should be ooaded
to a programmable NIC for improving parallelism.
We propose an approach that combines application-level parti-
tioning and packet steering with a programmable NIC. Applications
partition data in DRAM between CPU cores, and steer requests to
the correct core by parsing L7 packet headers on a programmable
NIC. This approach improves request-level parallelism but keeps
the partitioning scheme transparent to clients. We believe this
approach can reduce latency and improve throughput because it
utilizes multicore systems eciently, and applications can improve
partitioning scheme without impacting clients.
CCS CONCEPTS
Software and its engineering Operating systems
;
Com-
munications management
; Multiprocessing / multiprogramming
/ multitasking.
KEYWORDS
XDP, eBPF, Packet Steering, Parallelism, Partioning
ACM Reference Format:
Pekka Enberg, Ashwin Rao, and Sasu Tarkoma. 2019. Partition-Aware
Packet Steering Using XDP and eBPF for Improving Application-Level
Parallelism. In 1st ACM CoNEXT Workshop on Emerging in-Network Comput-
ing Paradigms (ENCP ’19), December 9, 2019, Orlando, FL, USA. ACM, New
York, NY, USA, 7 pages. https://doi.org/10.1145/3359993.3366766
1 INTRODUCTION
A single CPU core is not fast enough to serve packets arriving at
line rate. For example, the arrival rate of packets on a 40 Gbps
NIC is faster than the rate at which a single CPU core can access
its last-level cache (LLC), and this dierence in operating speeds
can prevent the CPU from keeping up with the network [15]. The
performance gap is further expected to increase with 400 Gbps and
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
ENCP ’19, December 9, 2019, Orlando, FL, USA
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-7000-4/19/12.. . $15.00
https://doi.org/10.1145/3359993.3366766
DRAM
CPU 0 CPU 1
A, B, C D, E, F
Programmable NIC
request(A),
request(B),
request(D)
request(D)
request(A)
request(B)
Figure 1: Partition-aware packet steering on a pro-
grammable NIC. The application partitions its resources and
the data in DRAM between CPU cores, and a programmable NIC
steers requests to the target CPU core by inspecting protocol headers.
This allows request processing to run independently on each CPU,
while keeping partitioning transparent to client. For the example
key-value store shown in this gure, the NIC parses keys from client
requests.
beyond on the horizon. Fundamentally, the time budget to process a
single packet is shrinking radically, forcing applications to embrace
parallel processing and NIC ooad capabilities.
Application-level partitioning is one approach to parallelize re-
quest processing on multicore systems [
23
,
35
,
37
]. In the thread-
per-core model, applications run only one OS thread per CPU core
and also partition the data in DRAM between the cores [
7
,
35
]. This
enables the CPU cores to run independently by eliminating synchro-
nization for data access and avoiding OS-level locking. However,
steering requests to the CPU core that manages request data either
requires clients to specify the partition [
14
,
23
] or uses CPU cycles
for the steering [6, 35].
Ooading the network processing to the NIC helps conserve
CPU cycles [
5
,
27
], and NICs are also starting to support the abil-
ity to run arbitrary programs and customize the ooad. These
programmable NICs
come in dierent avors, ASIC-based, FPGAs,
special-purpose cores (e.g. NPU), or multicore system-on-chips
(SoC) [
8
,
31
,
36
], but their objective is the same: provide programmable
packet processing on the NIC before packets are forwarded down
to the OS network stack. Programmable NICs have a signicant per-
formance advantage over host CPU for packet processing because
they can be highly specialized and do not have to wait for packet
data to DMA over the I/O bus to DRAM. However, the emergence
of programmable NICs raises a question:
What should applications
ooad to a programmable NIC for improving parallelism?

ENCP ’19, December 9, 2019, Orlando, FL, USA Pekka Enberg, Ashwin Rao, and Sasu Tarkoma
We propose an approach for improving parallelism that com-
bines application-level partitioning and packet steering with a pro-
grammable NIC (§3). As shown in Figure 1, the application uses
a thread-per-core approach in which data in the DRAM is parti-
tioned between threads that are pinned to CPU cores. In one of
our previous works, we highlighted that packet steering is a per-
request overhead in the thread-per-core model, and this overhead
could be address with the help of a programmable NIC [
7
]. A pro-
grammable NIC runs a program that parses application-specic
protocol headers, including L7 headers, to steer the packet to the
thread responsible for serving it. For example, for a key-value (KV)
store such as Memcached, a program running on the NIC parses
the Memcached protocol headers to determine a request key and
forwards the packet to a CPU core that manages that keys (§3.3).
We propose to implement our approach using Linux’s Express Data
Path (XDP) interface, which is available on a Linux. XDP combines
programmable packet processor with kernel-bypass [
10
], and sup-
ports ooad to a programmable NIC using eBPF [
18
]. We present
an overview of the XDP interface and eBPF in §2.
Our contributions are as follows.
We propose a NIC-CPU co-design using eBPF and XDP on Linux,
where the NIC performs packet steering and CPU executes ap-
plication logic (§3). As an example application, we describe the
design and implementation of a key-value store using eBPF and
XDP. We believe this approach provides a practical solution for
accelerating network-intensive applications.
We discuss the limitations and future research directions of our
proposed NIC-CPU co-design approach in §4. Specically, we
analyze a) how applications beyond key-value stores can take
advantage of this approach, b) how NIC-based packet steering
can improve application-level partitioning, and c) what are the
limitations of eBPF and XDP for NIC ooad.
Previous approaches to application-level partitioning either re-
quire the clients to be aware of the partition scheme, or require
expensive inter-thread communication. MICA [
23
] and HERD [
14
]
expose application partitioning scheme to clients, which makes it
challenging to improve partitioning without impacting the clients.
Minos implements a size-aware partitioning scheme that is trans-
parent to clients, but it requires inter-thread communication over a
software queue [
6
]. Similarly, the Seastar framework steers requests
in user space [
35
] but requires expensive CPU-intensive polling
to avoid thread wakeups [
19
]. Our proposed approach keeps the
partitioning scheme transparent similar to software steering but
maintains low request steering overhead, similar to hardware steer-
ing. It complements Floem [
30
], a data-ow programming language
for NIC-CPU co-design, which can be used for steering packets
in sharded applications. Furthermore, our approach to ooading
only packet steering to a programmable NIC is easier to adopt for
general purpose applications than previous approaches that ooad
whole applications to a programmable NIC using a special-purpose
programming language such as OpenCL [20].
2 BACKGROUND
We believe that
combining application-level partitioning with packet
steering can improve request-level parallelism.
To motivate our ap-
proach, we begin by discussing why parallel processing and NIC
System Partitioning Steering Hardware
Seastar [35] CPU core HW/SW RSS
MICA [23] CPU core HW Flow Director
HERD [14] OS process HW RDMA
Minos [6] Size-aware HW/SW RSS and Flow Director
KV-Direct [20] None HW FPGA
NetCache [13] Server HW ASIC
Table 1: Partitioning and packet steering implementations.
The data is partitioned either per-CPU core, or per OS process, or is
based on the size of requeste d items, or per server. The steering of
requests is done either using a combination of hardware software co-
design, or solely in the hardware.
ooad are critical. We then highlight that kernel-bypass network-
ing is a key enabler for this approach. Finally, we give an overview
of the Express Data Path (XDP) networking interface [
10
] and the
extended Berkeley Packet Filter (eBPF) virtual machine [
18
], which
make our proposed application-level partitioning with a packet
steering approach practical on Linux.
Parallel processing.
Applications must embrace parallel process-
ing because single-threaded CPU core speeds have stagnated [
9
,
34
],
but NIC speeds are getting faster [
15
]. To perform parallel request
processing, applications use OS threads, but they have overheads
from synchronization and context switching. OS system calls can
block an OS thread, which is why applications need to create more
threads than CPU cores. However, having a large number of threads
incurs high overheads because of context switching costs and mem-
ory footprint. To address these overheads, applications are increas-
ingly leveraging application-level partitioning [14, 23, 35].
CPU and NIC oload co-design.
State-of-art systems, summa-
rized in Table 1, use partitioning, packet steering, and NIC ooad
for high performance, but their approaches dier from each other.
KV-Direct [
20
] and NetCache [
13
] ooad the whole application
to hardware for high performance. However, systems that want
to use the CPU for application logic have to use simple partition-
ing schemes for hardware steering. For example, MICA [
23
] and
HERD [
14
] partition by CPU core and by OS process, and use hard-
ware steering provided by commodity multi-queue NIC or RDMA.
Seastar [
35
] and Minos [
6
] use a combination of hardware and
software steering; they either support commodity POSIX APIs or
provide more advanced partitioning schemes.
There is a gap in
systems that want to combine CPU use with NIC ooad while en-
abling advanced application-level partitioning, which our proposed
approach aims to ll.
Kernel-bypass networking.
Traditional in-kernel network stacks
are designed for exibility, but are a bottleneck for network-intensive
applications for two reasons: (1) they perform too much work per
packet, and (2) their system call interface is too expensive [
12
,
16
,
33
,
38
]. Traditional network stacks require memory allocation and
locking per packet, which is too heavy-weight for packet process-
ing time budgets of current NICs. Applications receive and trans-
mit data using the POSIX sockets API, which has high overheads
from system call costs and copying. Kernel-bypass networking has
emerged as a solution to eliminate these overheads [
12
,
24
,
33
].
With kernel-bypass networking, the OS is eliminated from data

Partition-Aware Packet Steering using XDP and eBPF ENCP ’19, December 9, 2019, Orlando, FL, USA
Process
XDP
Program
NIC queue
NIC
OS
Network
Stack
Process
NIC queue
Hardware Kernel Userspace
(a) XDP via POSIX sockets
Process
XDP
Program
NIC queue
NIC
Process
NetworkStack NetworkStack
NIC queue
Hardware Kernel Userspace
(b) XDP via kernel-bypass interface
Process
NIC queue
NIC
Process
XDP
Program
NetworkStack NetworkStack
Hardware Kernel Userspace
(c) XDP via NIC oload
Figure 2: XDP and eBPF congurations. Applications can use XDP and eBPF via (a) POSIX sockets without bypassing the kernel, (b) the
AF_XDP kernel-bypass interface, or (c) hardware ooad with a programmable NIC.
plane, and the NIC leverages DMAs to write packets to a memory
buer that applications consume directly.
XDP and eBPF.
The XDP interface enables the implementation of
high-performance networking applications on Linux by combining
programmable packet processing and kernel-bypass [
10
]. Appli-
cations implement custom packet processing in a programming
language such as C, which compiles to the eBPF virtual machine
instruction set. These eBPF programs run before the OS forwards
the packets to the in-kernel network stack. The OS provides an
in-kernel virtual machine for eBPF programs, but they can also
be ooaded to a programmable NIC [
18
]. As shown in Figure 2,
XDP supports multiple congurations: POSIX sockets API,
AF_XDP
kernel-bypass, and NIC ooad; the
AF_XDP
socket type in XDP
allows applications to by-pass the OS network stack entirely if
needed. As eBPF is programming language agnostic, applications
can reuse the same partitioning code in the XDP program and the
application. For example, applications can use the same application
code implemented in C for request steering or run existing packet
processors implemented in the P4 [
3
] programming language on
XDP [
38
]. The availability of XDP and eBPF in a commodity OS
makes application-level partitioning with packet steering a practical
approach for applications.
3 APPLICATION-LEVEL PARTITIONING AND
PACKET STEERING
Applications must embrace parallelism to take advantage of multi-
core CPUs. In the thread-per-core approach to improve parallelism,
applications restrict the number of application threads to the num-
ber of CPU cores, and partition application data in DRAM and the
resources between the CPU cores to eliminate thread synchroniza-
tion and OS-level locks. However, current solutions are unable to
steer packets to the CPU core that can independently serve the
request without exposing the partitioning schema to its clients.
We now show how application-level packet steering with a pro-
grammable NIC can solve this problem.
3.1 Partitioning in the thread-per-core model
Partitioning is increasingly being adopted as a strategy to improve
application-level parallelism. In the thread-per-core approach, an ar-
riving packets needs to be steered to the CPU core that can serve the
request. This approach allows the CPU cores to run independently
by eliminating the need to synchronize threads on application-level
data access, and avoiding OS-level locking (in some cases). However,
steering requests to the correct CPU core either requires clients to
specify the partition in the request [
23
], or the application threads
need to redirect the requests. This is because the approaches to
steer the packets using traditional non-programmable NICs are re-
stricted to L2-L4 protocol headers. A programmable NIC can solve
the problem of request steering by inspecting L7 packet headers.
In spite of its benets, the thread-per-core approach for partition-
ing data and resources has its limitations. For skewed workloads,
this approach can overload some CPU cores while leaving the oth-
ers underutilized. This can be addressed by binning CPU cores in
clusters, and making each CPU cluster responsible for a partition
of the data. Each CPU core cluster could use the traditional shared-
memory approach which is known to scale to small core counts [
11
].
The CPU core clusters could also be partitioned around sub-NUMA
clusters, which groups CPU cores by memory controllers [28].
3.2 Partition-aware packet steering
Request processing is performed in dierent stages across the NIC,
the kernel, and user space as shown in Figure 3(a). The NIC rst
performs L1 processing to queue packets in the NIC RX queues and
then performs packet steering via RSS or Flow director by L2-L4
packet headers. Finally, the OS network stack performs protocol
processing and hands over the packet to user space for L7 protocol
processing. Note that the kernel forwards a packet to a user space
thread based on its own packet steering policy, and this steering
has no knowledge of application-specic partitioning. When data
is partitioned using the thread-per-core approach, the user space
thread which receives the packet needs to rst forward it to the
remote thread responsible for serving the packet, and then notify
the remote thread [7, 35].
We measured the time required to notify a thread on an Intel
Xeon E5-2686 v4 @ 2.30GHz with two non-uniform memory access
(NUMA) nodes running Ubuntu 18.04.3 LTS with Linux 4.15.0-1051-
aws. In our experimental setup, we had two threads running on
dierent CPUs on the system. The rst thread noties the second
thread by writing to an eventfd le descriptor that the second thread
reads. The thread notication time is the time dierence between
the call to the write system call and the return of the read system

ENCP ’19, December 9, 2019, Orlando, FL, USA Pekka Enberg, Ashwin Rao, and Sasu Tarkoma
Kernel User space
Service request
Protocol
processing
Packet queuing
NIC
Packet steering
NIC
L1 L2-L4 L2-L4 L7
(a) Request processing ow with the OS network stack.
User space User space
Service request
Protocol
processing
Packet queuing
NIC
Packet steering
NIC or kernel
L1 L2-L7 L2-L4 L7
(b) Request processing ow with XDP and eBPF.
Figure 3: Request processing. Packets traverse through multi-
ple stages–packet queuing, packet steering, and protocol processing—
before the application thread services the request; L1-L7 denote the
seven layers of the OSI model.
3.5 4.0 4.5 5.0 5.5 6.0
Thread notification time (µs)
0.00
0.25
0.50
0.75
1.00
CDF
99.90th percentile 99.90th percentile
NUMA local
NUMA remote
Figure 4: Thread notication time. The cumulative distribution
function (CDF) of the thread notication time highlights that the time
required to notify a user space thread is signicantly larger than the
time between successive packet arrivals on a fast NIC; a 40 Gbps NIC
can receive a 64 byte packet close to every 12 ns [15].
call. As shown in Figure 4, the 99.9
th
percentile of thread wakeup
delay on the same NUMA node is
4. 32 µs
and
6. 09 µs
on a remote
NUMA node. In contrast, a 40 Gbps NIC can receive a 64 byte packet
close to every 12 ns [15].
As shown in Figure 3(b), a programmable NIC with XDP and
eBPF can perform partition-aware packet steering. The NIC per-
forms L1 processing to queue packets, and a program running on
the NIC inspects L2-L7 protocol headers to steer packets directly
to the CPU core that can serve the request. The user space thread
then performs the protocol processing and serves the request. For
example, in a key-value store, the partition identier can be the
request key, which determines the target of the operation requested
by a client. In an RPC call, the partition identier can be the name of
the RPC function or a parameter of the RPC call, which determines
a service thread that implements the RPC function.
If an application workload does not access the resource partition
and data sets uniformly, the NIC packet steering can perform load
balancing between the CPU cores. For example, instead of steering
packets to a single CPU core, the NIC can round-robin request
processing between all CPU cores. The trade-o with load balancing
is that the request-processing CPU cores must either access CPU
remote memory or use software steering to complete the request.
Another possible optimization at NIC level is response caching. For
example, if processing a request needs a lot of CPU cycles, caching
the response at NIC level can be benecial. For requests that are not
CPU-intensive, caching responses can still improve performance,
because caching eliminates transferring data over the PCIe bus.
However, the trade-o with response caching that the NIC program
needs more application-specic knowledge to perform the cache
lookup.
3.3 Example: A Key-Value Store
Key-value (KV) stores are a widely understood topic [
6
,
13
,
14
,
20
,
23
]. Although KV stores have been criticized recently [
2
], they serve
as an easy to understand example of a network-intensive applica-
tion. We believe that KV stores can benet from application-level
partitioning and packet steering and that the lessons are applicable
to a broader range of networked applications that need to embrace
parallelism.
Request processing overview.
For our discussion, we assume
two types of requests,
get
and
set
, both of which contain a request
key
k
. Request processing in the KV store starts when a packet
containing the client request arrives on the NIC. The NIC forwards
the packet to one if its RX queues, depending on how the device
driver congures the NIC. The NIC obtains a DMA descriptor from
the NIC RX queue, and DMAs packet data to a region of the DRAM
pointed to by the DMA descriptor. The device driver notices that
a new packet has arrived and forwards it to the network stack.
The network stack performs the L2-L4 protocol processing, after
which it hands over the packet to the application. The application
parses the request and performs necessary operations. For example,
for a
get
request, a value is looked up from a data structure by the
request key. Finally, a response message is generated for the request
and handed over to the network stack, which performs protocol
processing and hands over the packet to the NIC.
Design. The design goals for our KV store are as follows.
Improve hardware utilization with CPU and NIC ooad.
Exploit multicore CPU parallelism eciently.
Do not expose application-level partitioning to clients.
As shown in Figure 1, we propose a design that combines application-
level partitioning and packet steering with programmable NIC. Sim-
ilar to previous approaches [
14
,
23
], it runs one thread per CPU
core, and it partitions the keyspace in DRAM between the CPU
cores. That is, a partitioning function
p(k)
maps a key
k
0
to CPU
C
n
, key
k
1
to CPU
C
m
, and so on. For example, the MICA KV store
partitions the keys by using some bits of the hash of the keys [
23
].
With this application-level partitioning in place, a program running
on the NIC parses L7 packet headers to determine request keys
and uses the same partitioning function
p(k)
to steer packets to the
target CPU core. For example, with the Memcache protocol, the
request key is part of the request headers of all operations. The
application thread parses the complete request, performs the re-
quested operation, generates a response, and places it on the NIC TX
queue. The dierence to previous approaches is that the NIC steers
packets directly to the target thread. This eliminates overheads

Citations
More filters
Journal Article
TL;DR: It is argued that substantial reductions in the carbon intensity of datacenter computing are possible with a software-centric approach: by making energy and carbon visible to application developers on a fine-grained basis, by modifying system APIs to make it possible to make informed trade offs between performance and carbon emissions.
Abstract: The end of Dennard scaling and the slowing of Moore’s Law has put the energy use of datacenters on an unsustainable path. Datacenters are already a significant fraction of worldwide electricity use, with application demand scaling at a rapid rate. We argue that substantial reductions in the carbon intensity of datacenter computing are possible with a software-centric approach: by making energy and carbon visible to application developers on a fine-grained basis, by modifying system APIs to make it possible to make informed trade offs between performance and carbon emissions, and by raising the level of application programming to allow for flexible use of more energy efficient means of compute and storage. We also lay out a research agenda for systems software to reduce the carbon footprint of datacenter computing. 1

15 citations

Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this article, the authors explore using BPF to reduce the overhead of the kernel storage path by injecting user-defined functions deep in the kernel's I/O processing stack.
Abstract: The overhead of the kernel storage path accounts for half of the access latency for new NVMe storage devices. We explore using BPF to reduce this overhead, by injecting user-defined functions deep in the kernel's I/O processing stack. When issuing a series of dependent I/O requests, this approach can increase IOPS by over 2.5X and cut latency by half, by bypassing kernel layers and avoiding user-kernel boundary crossings. However, we must avoid losing important properties when bypassing the file system and block layer such as the safety guarantees of the file system and translation between physical blocks addresses and file offsets. We sketch potential solutions to these problems, inspired by exokernel file systems from the late 90s, whose time, we believe, has finally come! "As a dog returns to his vomit, so a fool repeats his folly." Attributed to King Solomon

11 citations

Journal ArticleDOI
TL;DR: The extended Berkeley Packet Filter (eBPF) as discussed by the authors is a lightweight and fast 64-bit RISC-like virtual machine (VM) inside the Linux kernel, which has received widespread adoption by both industry and academia for a wide range of application domains.
Abstract: The extended Berkeley Packet Filter (eBPF) is a lightweight and fast 64-bit RISC-like virtual machine (VM) inside the Linux kernel. eBPF has emerged as the most promising and de facto standard of executing untrusted, user-defined specialized code at run-time inside the kernel with strong performance, portability, flexibility, and safety guarantees. Due to these key benefits and availability of a rich ecosystem of compilers and tools within the Linux kernel, eBPF has received widespread adoption by both industry and academia for a wide range of application domains. The most important include enhancing performance of monitoring tools and providing a variety of new security mechanisms, data collection tools and data screening applications. In this review, we investigate the landscape of existing eBPF use-cases and trends with aim to provide a clear roadmap for researchers and developers. We first introduce the necessary background knowledge for eBPF before delving into its applications. Although, the potential use-cases of eBPF are vast, we restrict our focus on four key application domains related to networking, security, storage, and sandboxing. Then for each application domain, we analyze and summarize solution techniques along with their working principles in an effort to provide an insightful discussion that will enable researchers and practitioners to easily adopt eBPF into their designs. Finally, we delineate several exciting research avenues to fully exploit the revolutionary eBPF technology.

3 citations

Posted Content
TL;DR: In this article, the authors explore using BPF to reduce the overhead of the kernel storage path by injecting user-defined functions deep in the kernel's I/O processing stack.
Abstract: The overhead of the kernel storage path accounts for half of the access latency for new NVMe storage devices. We explore using BPF to reduce this overhead, by injecting user-defined functions deep in the kernel's I/O processing stack. When issuing a series of dependent I/O requests, this approach can increase IOPS by over 2.5$\times$ and cut latency by half, by bypassing kernel layers and avoiding user-kernel boundary crossings. However, we must avoid losing important properties when bypassing the file system and block layer such as the safety guarantees of the file system and translation between physical blocks addresses and file offsets. We sketch potential solutions to these problems, inspired by exokernel file systems from the late 90s, whose time, we believe, has finally come!

2 citations

Journal ArticleDOI
TL;DR: The extended Berkeley Packet Filter (eBPF) as discussed by the authors is a lightweight and fast 64-bit RISC-like virtual machine (VM) inside the Linux kernel, which has received widespread adoption by both industry and academia for a wide range of application domains.
Abstract: The extended Berkeley Packet Filter (eBPF) is a lightweight and fast 64-bit RISC-like virtual machine (VM) inside the Linux kernel. eBPF has emerged as the most promising and de facto standard of executing untrusted, user-defined specialized code at run-time inside the kernel with strong performance, portability, flexibility, and safety guarantees. Due to these key benefits and availability of a rich ecosystem of compilers and tools within the Linux kernel, eBPF has received widespread adoption by both industry and academia for a wide range of application domains. The most important include enhancing performance of monitoring tools and providing a variety of new security mechanisms, data collection tools and data screening applications. In this review, we investigate the landscape of existing eBPF use-cases and trends with aim to provide a clear roadmap for researchers and developers. We first introduce the necessary background knowledge for eBPF before delving into its applications. Although, the potential use-cases of eBPF are vast, we restrict our focus on four key application domains related to networking, security, storage, and sandboxing. Then for each application domain, we analyze and summarize solution techniques along with their working principles in an effort to provide an insightful discussion that will enable researchers and practitioners to easily adopt eBPF into their designs. Finally, we delineate several exciting research avenues to fully exploit the revolutionary eBPF technology.

2 citations

References
More filters
Journal ArticleDOI
28 Jul 2014
TL;DR: This paper proposes P4 as a strawman proposal for how OpenFlow should evolve in the future, and describes how to use P4 to configure a switch to add a new hierarchical label.
Abstract: P4 is a high-level language for programming protocol-independent packet processors. P4 works in conjunction with SDN control protocols like OpenFlow. In its current form, OpenFlow explicitly specifies protocol headers on which it operates. This set has grown from 12 to 41 fields in a few years, increasing the complexity of the specification while still not providing the flexibility to add new headers. In this paper we propose P4 as a strawman proposal for how OpenFlow should evolve in the future. We have three goals: (1) Reconfigurability in the field: Programmers should be able to change the way switches process packets once they are deployed. (2) Protocol independence: Switches should not be tied to any specific network protocols. (3) Target independence: Programmers should be able to describe packet-processing functionality independently of the specifics of the underlying hardware. As an example, we describe how to use P4 to configure a switch to add a new hierarchical label.

2,214 citations


"Partition-Aware Packet Steering Usi..." refers methods in this paper

  • ...For example, applications can use the same application code implemented in C for request steering or run existing packet processors implemented in the P4 [3] programming language on XDP [38]....

    [...]

Proceedings Article
Luigi Rizzo1
01 Jan 2012
TL;DR: Netmap as discussed by the authors is a framework that enables commodity operating systems to handle the millions of packets per seconds traversing 110 Gbit/s links, without requiring custom hardware or changes to applications.
Abstract: Many applications (routers, traffic monitors, firewalls, etc) need to send and receive packets at line rate even on very fast links In this paper we present netmap, a novel framework that enables commodity operating systems to handle the millions of packets per seconds traversing 110 Gbit/s links, without requiring custom hardware or changes to applications In building netmap, we identified and successfully reduced or removed three main packet processing costs: per-packet dynamic memory allocations, removed by preallocating resources; system call overheads, amortized over large batches; and memory copies, eliminated by sharing buffers and metadata between kernel and userspace, while still protecting access to device registers and other kernel memory areas Separately, some of these techniques have been used in the past The novelty in our proposal is not only that we exceed the performance of most of previouswork, but also that we provide an architecture that is tightly integrated with existing operating system primitives, not tied to specific hardware, and easy to use and maintain Netmap has been implemented in FreeBSD and Linux for several 1 and 10 Gbit/s network adapters In our prototype, a single core running at 900 MHz can send or receive 1488 Mpps (the peak packet rate on 10 Gbit/s links) This is more than 20 times faster than conventional APIs Large speedups (5× and more) are also achieved on user-space Click and other packet forwarding applications using a libpcap emulation library running on top of netmap

691 citations

Book ChapterDOI
01 Dec 2018
TL;DR: The current RDBMS code lines, while attempting to be a "one size fits all" solution, in fact, excel at nothing and should be retired in favor of a collection of "from scratch" specialized engines.
Abstract: In previous papers [SC05, SBC+07], some of us predicted the end of "one size fits all" as a commercial relational DBMS paradigm. These papers presented reasons and experimental evidence that showed that the major RDBMS vendors can be outperformed by 1--2 orders of magnitude by specialized engines in the data warehouse, stream processing, text, and scientific database markets.Assuming that specialized engines dominate these markets over time, the current relational DBMS code lines will be left with the business data processing (OLTP) market and hybrid markets where more than one kind of capability is required. In this paper we show that current RDBMSs can be beaten by nearly two orders of magnitude in the OLTP market as well. The experimental evidence comes from comparing a new OLTP prototype, H-Store, which we have built at M.I.T. to a popular RDBMS on the standard transactional benchmark, TPC-C.We conclude that the current RDBMS code lines, while attempting to be a "one size fits all" solution, infact, excel at nothing. Hence, they are 25 year old legacy code lines that should be retired in favor of a collection of "from scratch" specialized engines. The DBMS vendors (and the research community) should start with a clean sheet of paper and design systems for yesterday's needs.

679 citations


"Partition-Aware Packet Steering Usi..." refers background in this paper

  • ...Application-level partitioning is one approach to parallelize request processing on multicore systems [23, 35, 37]....

    [...]

Proceedings Article
Luigi Rizzo1
13 Jun 2012
TL;DR: The novelty in the proposal is not only that it exceeds the performance of most of previous work, but also that it provides an architecture that is tightly integrated with existing operating system primitives, not tied to specific hardware, and easy to use and maintain.
Abstract: Many applications (routers, traffic monitors, firewalls, etc.) need to send and receive packets at line rate even on very fast links. In this paper we present netmap, a novel framework that enables commodity operating systems to handle the millions of packets per seconds traversing 1..10 Gbit/s links, without requiring custom hardware or changes to applications. In building netmap, we identified and successfully reduced or removed three main packet processing costs: per-packet dynamic memory allocations, removed by preallocating resources; system call overheads, amortized over large batches; and memory copies, eliminated by sharing buffers and metadata between kernel and userspace, while still protecting access to device registers and other kernel memory areas. Separately, some of these techniques have been used in the past. The novelty in our proposal is not only that we exceed the performance of most of previouswork, but also that we provide an architecture that is tightly integrated with existing operating system primitives, not tied to specific hardware, and easy to use and maintain. Netmap has been implemented in FreeBSD and Linux for several 1 and 10 Gbit/s network adapters. In our prototype, a single core running at 900 MHz can send or receive 14.88 Mpps (the peak packet rate on 10 Gbit/s links). This is more than 20 times faster than conventional APIs. Large speedups (5× and more) are also achieved on user-space Click and other packet forwarding applications using a libpcap emulation library running on top of netmap.

653 citations


"Partition-Aware Packet Steering Usi..." refers background in this paper

  • ...Traditional in-kernel network stacks are designed for flexibility, but are a bottleneck for network-intensive applications for two reasons: (1) they perform too much work per packet, and (2) their system call interface is too expensive [12, 16, 33, 38]....

    [...]

  • ...Kernel-bypass networking has emerged as a solution to eliminate these overheads [12, 24, 33]....

    [...]

Proceedings Article
23 Sep 2007
TL;DR: In this paper, the authors show that the current RDBMS code lines, while attempting to be a "one size fits all" solution, in fact excel at nothing, and they are 25 year old legacy code lines that should be retired in favor of a collection of "from scratch" specialized engines.
Abstract: In previous papers [SC05, SBC+07], some of us predicted the end of "one size fits all" as a commercial relational DBMS paradigm. These papers presented reasons and experimental evidence that showed that the major RDBMS vendors can be outperformed by 1--2 orders of magnitude by specialized engines in the data warehouse, stream processing, text, and scientific database markets. Assuming that specialized engines dominate these markets over time, the current relational DBMS code lines will be left with the business data processing (OLTP) market and hybrid markets where more than one kind of capability is required. In this paper we show that current RDBMSs can be beaten by nearly two orders of magnitude in the OLTP market as well. The experimental evidence comes from comparing a new OLTP prototype, H-Store, which we have built at M.I.T. to a popular RDBMS on the standard transactional benchmark, TPC-C. We conclude that the current RDBMS code lines, while attempting to be a "one size fits all" solution, in fact, excel at nothing. Hence, they are 25 year old legacy code lines that should be retired in favor of a collection of "from scratch" specialized engines. The DBMS vendors (and the research community) should start with a clean sheet of paper and design systems for tomorrow's requirements, not continue to push code lines and architectures designed for yesterday's needs.

579 citations

Frequently Asked Questions (2)
Q1. What are the contributions in "Partition-aware packet steering using xdp and ebpf for improving application-level parallelism enberg, pekka" ?

The authors propose an approach that combines application-level partitioning and packet steering with a programmable NIC. The authors believe this approach can reduce latency and improve throughput because it utilizes multicore systems efficiently, and applications can improve partitioning scheme without impacting clients. 

The authors are currently working on a prototype in-memory KV store using this approach and are planning to compare its performance against previous works such as MICA [ 23 ].