Skip to content

Commit 97abb2b

Browse files
anakryikoAlexei Starovoitov
authored andcommitted
docs/bpf: Add BPF ring buffer design notes
Add commit description from patch #1 as a stand-alone documentation under Documentation/bpf, as it might be more convenient format, in long term perspective. Suggested-by: Stanislav Fomichev <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
1 parent c97099b commit 97abb2b

File tree

1 file changed

+209
-0
lines changed

1 file changed

+209
-0
lines changed

Documentation/bpf/ringbuf.rst

Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
===============
2+
BPF ring buffer
3+
===============
4+
5+
This document describes BPF ring buffer design, API, and implementation details.
6+
7+
.. contents::
8+
:local:
9+
:depth: 2
10+
11+
Motivation
12+
----------
13+
14+
There are two distinctive motivators for this work, which are not satisfied by
15+
existing perf buffer, which prompted creation of a new ring buffer
16+
implementation.
17+
18+
- more efficient memory utilization by sharing ring buffer across CPUs;
19+
- preserving ordering of events that happen sequentially in time, even across
20+
multiple CPUs (e.g., fork/exec/exit events for a task).
21+
22+
These two problems are independent, but perf buffer fails to satisfy both.
23+
Both are a result of a choice to have per-CPU perf ring buffer. Both can be
24+
also solved by having an MPSC implementation of ring buffer. The ordering
25+
problem could technically be solved for perf buffer with some in-kernel
26+
counting, but given the first one requires an MPSC buffer, the same solution
27+
would solve the second problem automatically.
28+
29+
Semantics and APIs
30+
------------------
31+
32+
Single ring buffer is presented to BPF programs as an instance of BPF map of
33+
type ``BPF_MAP_TYPE_RINGBUF``. Two other alternatives considered, but
34+
ultimately rejected.
35+
36+
One way would be to, similar to ``BPF_MAP_TYPE_PERF_EVENT_ARRAY``, make
37+
``BPF_MAP_TYPE_RINGBUF`` could represent an array of ring buffers, but not
38+
enforce "same CPU only" rule. This would be more familiar interface compatible
39+
with existing perf buffer use in BPF, but would fail if application needed more
40+
advanced logic to lookup ring buffer by arbitrary key.
41+
``BPF_MAP_TYPE_HASH_OF_MAPS`` addresses this with current approach.
42+
Additionally, given the performance of BPF ringbuf, many use cases would just
43+
opt into a simple single ring buffer shared among all CPUs, for which current
44+
approach would be an overkill.
45+
46+
Another approach could introduce a new concept, alongside BPF map, to represent
47+
generic "container" object, which doesn't necessarily have key/value interface
48+
with lookup/update/delete operations. This approach would add a lot of extra
49+
infrastructure that has to be built for observability and verifier support. It
50+
would also add another concept that BPF developers would have to familiarize
51+
themselves with, new syntax in libbpf, etc. But then would really provide no
52+
additional benefits over the approach of using a map. ``BPF_MAP_TYPE_RINGBUF``
53+
doesn't support lookup/update/delete operations, but so doesn't few other map
54+
types (e.g., queue and stack; array doesn't support delete, etc).
55+
56+
The approach chosen has an advantage of re-using existing BPF map
57+
infrastructure (introspection APIs in kernel, libbpf support, etc), being
58+
familiar concept (no need to teach users a new type of object in BPF program),
59+
and utilizing existing tooling (bpftool). For common scenario of using a single
60+
ring buffer for all CPUs, it's as simple and straightforward, as would be with
61+
a dedicated "container" object. On the other hand, by being a map, it can be
62+
combined with ``ARRAY_OF_MAPS`` and ``HASH_OF_MAPS`` map-in-maps to implement
63+
a wide variety of topologies, from one ring buffer for each CPU (e.g., as
64+
a replacement for perf buffer use cases), to a complicated application
65+
hashing/sharding of ring buffers (e.g., having a small pool of ring buffers
66+
with hashed task's tgid being a look up key to preserve order, but reduce
67+
contention).
68+
69+
Key and value sizes are enforced to be zero. ``max_entries`` is used to specify
70+
the size of ring buffer and has to be a power of 2 value.
71+
72+
There are a bunch of similarities between perf buffer
73+
(``BPF_MAP_TYPE_PERF_EVENT_ARRAY``) and new BPF ring buffer semantics:
74+
75+
- variable-length records;
76+
- if there is no more space left in ring buffer, reservation fails, no
77+
blocking;
78+
- memory-mappable data area for user-space applications for ease of
79+
consumption and high performance;
80+
- epoll notifications for new incoming data;
81+
- but still the ability to do busy polling for new data to achieve the
82+
lowest latency, if necessary.
83+
84+
BPF ringbuf provides two sets of APIs to BPF programs:
85+
86+
- ``bpf_ringbuf_output()`` allows to *copy* data from one place to a ring
87+
buffer, similarly to ``bpf_perf_event_output()``;
88+
- ``bpf_ringbuf_reserve()``/``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()``
89+
APIs split the whole process into two steps. First, a fixed amount of space
90+
is reserved. If successful, a pointer to a data inside ring buffer data
91+
area is returned, which BPF programs can use similarly to a data inside
92+
array/hash maps. Once ready, this piece of memory is either committed or
93+
discarded. Discard is similar to commit, but makes consumer ignore the
94+
record.
95+
96+
``bpf_ringbuf_output()`` has disadvantage of incurring extra memory copy,
97+
because record has to be prepared in some other place first. But it allows to
98+
submit records of the length that's not known to verifier beforehand. It also
99+
closely matches ``bpf_perf_event_output()``, so will simplify migration
100+
significantly.
101+
102+
``bpf_ringbuf_reserve()`` avoids the extra copy of memory by providing a memory
103+
pointer directly to ring buffer memory. In a lot of cases records are larger
104+
than BPF stack space allows, so many programs have use extra per-CPU array as
105+
a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs
106+
completely. But in exchange, it only allows a known constant size of memory to
107+
be reserved, such that verifier can verify that BPF program can't access memory
108+
outside its reserved record space. bpf_ringbuf_output(), while slightly slower
109+
due to extra memory copy, covers some use cases that are not suitable for
110+
``bpf_ringbuf_reserve()``.
111+
112+
The difference between commit and discard is very small. Discard just marks
113+
a record as discarded, and such records are supposed to be ignored by consumer
114+
code. Discard is useful for some advanced use-cases, such as ensuring
115+
all-or-nothing multi-record submission, or emulating temporary
116+
``malloc()``/``free()`` within single BPF program invocation.
117+
118+
Each reserved record is tracked by verifier through existing
119+
reference-tracking logic, similar to socket ref-tracking. It is thus
120+
impossible to reserve a record, but forget to submit (or discard) it.
121+
122+
``bpf_ringbuf_query()`` helper allows to query various properties of ring
123+
buffer. Currently 4 are supported:
124+
125+
- ``BPF_RB_AVAIL_DATA`` returns amount of unconsumed data in ring buffer;
126+
- ``BPF_RB_RING_SIZE`` returns the size of ring buffer;
127+
- ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical possition
128+
of consumer/producer, respectively.
129+
130+
Returned values are momentarily snapshots of ring buffer state and could be
131+
off by the time helper returns, so this should be used only for
132+
debugging/reporting reasons or for implementing various heuristics, that take
133+
into account highly-changeable nature of some of those characteristics.
134+
135+
One such heuristic might involve more fine-grained control over poll/epoll
136+
notifications about new data availability in ring buffer. Together with
137+
``BPF_RB_NO_WAKEUP``/``BPF_RB_FORCE_WAKEUP`` flags for output/commit/discard
138+
helpers, it allows BPF program a high degree of control and, e.g., more
139+
efficient batched notifications. Default self-balancing strategy, though,
140+
should be adequate for most applications and will work reliable and efficiently
141+
already.
142+
143+
Design and Implementation
144+
-------------------------
145+
146+
This reserve/commit schema allows a natural way for multiple producers, either
147+
on different CPUs or even on the same CPU/in the same BPF program, to reserve
148+
independent records and work with them without blocking other producers. This
149+
means that if BPF program was interruped by another BPF program sharing the
150+
same ring buffer, they will both get a record reserved (provided there is
151+
enough space left) and can work with it and submit it independently. This
152+
applies to NMI context as well, except that due to using a spinlock during
153+
reservation, in NMI context, ``bpf_ringbuf_reserve()`` might fail to get
154+
a lock, in which case reservation will fail even if ring buffer is not full.
155+
156+
The ring buffer itself internally is implemented as a power-of-2 sized
157+
circular buffer, with two logical and ever-increasing counters (which might
158+
wrap around on 32-bit architectures, that's not a problem):
159+
160+
- consumer counter shows up to which logical position consumer consumed the
161+
data;
162+
- producer counter denotes amount of data reserved by all producers.
163+
164+
Each time a record is reserved, producer that "owns" the record will
165+
successfully advance producer counter. At that point, data is still not yet
166+
ready to be consumed, though. Each record has 8 byte header, which contains the
167+
length of reserved record, as well as two extra bits: busy bit to denote that
168+
record is still being worked on, and discard bit, which might be set at commit
169+
time if record is discarded. In the latter case, consumer is supposed to skip
170+
the record and move on to the next one. Record header also encodes record's
171+
relative offset from the beginning of ring buffer data area (in pages). This
172+
allows ``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()`` to accept only the
173+
pointer to the record itself, without requiring also the pointer to ring buffer
174+
itself. Ring buffer memory location will be restored from record metadata
175+
header. This significantly simplifies verifier, as well as improving API
176+
usability.
177+
178+
Producer counter increments are serialized under spinlock, so there is
179+
a strict ordering between reservations. Commits, on the other hand, are
180+
completely lockless and independent. All records become available to consumer
181+
in the order of reservations, but only after all previous records where
182+
already committed. It is thus possible for slow producers to temporarily hold
183+
off submitted records, that were reserved later.
184+
185+
Reservation/commit/consumer protocol is verified by litmus tests in
186+
Documentation/litmus_tests/bpf-rb/_.
187+
188+
One interesting implementation bit, that significantly simplifies (and thus
189+
speeds up as well) implementation of both producers and consumers is how data
190+
area is mapped twice contiguously back-to-back in the virtual memory. This
191+
allows to not take any special measures for samples that have to wrap around
192+
at the end of the circular buffer data area, because the next page after the
193+
last data page would be first data page again, and thus the sample will still
194+
appear completely contiguous in virtual memory. See comment and a simple ASCII
195+
diagram showing this visually in ``bpf_ringbuf_area_alloc()``.
196+
197+
Another feature that distinguishes BPF ringbuf from perf ring buffer is
198+
a self-pacing notifications of new data being availability.
199+
``bpf_ringbuf_commit()`` implementation will send a notification of new record
200+
being available after commit only if consumer has already caught up right up to
201+
the record being committed. If not, consumer still has to catch up and thus
202+
will see new data anyways without needing an extra poll notification.
203+
Benchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbuf.c_) show that
204+
this allows to achieve a very high throughput without having to resort to
205+
tricks like "notify only every Nth sample", which are necessary with perf
206+
buffer. For extreme cases, when BPF program wants more manual control of
207+
notifications, commit/discard/output helpers accept ``BPF_RB_NO_WAKEUP`` and
208+
``BPF_RB_FORCE_WAKEUP`` flags, which give full control over notifications of
209+
data availability, but require extra caution and diligence in using this API.

0 commit comments

Comments
 (0)