Adding a size-based vector to libc++’s unstable ABI
tl;dr We can significantly improve the runtime performance of std::vector
by changing its representation from three pointers to one pointer and two integers. This document explains the details of this change, along with the justifications for making it.
Motivation
std::vector
is a dynamically-sized array that grows as elements are added. Appending elements is required to have amortised constant-time complexity, so std::vector
typically pre-allocates memory for n elements (called the vector’s capacity), and reallocates to a larger buffer when the number of elements in the vector exceeds its current capacity. For example, a vector of three elements might be able to insert a fourth element, but adding a fifth element would force a reallocation.
Figure 1: A visual representation of a vector growing. Each row shows the number of elements in the vector (Size ), the number of elements it can store before needing to relocate the existing elements (Capacity ), the memory address of each element (0x57d66d9132a0 , etc.), and the element's value (0 , 1 , 2 , etc.).
The first row shows the vector with three integers, with room for one more integer to be inserted. The second row shows the vector after a fourth integer is added. The elements’ location in memory remains unchanged from the first row. The third row shows a fifth element added, with room for a few more elements. Because the original buffer ran out of room for more elements after the fourth element was inserted, it reallocated the buffer to somewhere larger. |
The current implementation of std::vector
in libc++ achieves this by managing a triplet of pointers:
- one that points to the beginning of the buffer (
begin_
); - one that points to where the next element should be inserted (
end_
); - one that points to the end of the buffer (
end_cap_
).
Reallocation happens when end_ == end_cap_
. We suspect that this representation was chosen because the C++ standard library is based on iterators, and optimising for iterator usage was likely a natural conclusion. Here’s how we use these to compute some important member functions:
v.end()
returnsend_
directly;v.size()
returnsend_ - begin_
;v.capacity()
returnsend_cap_ - begin_
.
Since the size()
and capacity()
perform pointer subtraction, they end up being compiled to multiple instructions, relative to end()
.
Figure 2: The same image as in Figure 1, except that it now shows the current representation of std::vector . In this image, the size and capacity aren't immediately known, and need to be computed from begin_ , end_ , and end_cap_ . The red box represents the sentinel value that end_cap_ points to.
|
Changing std::vector
’s representation to directly track its size and capacity via integers is an opportunity for optimisation. We ran various load tests to contrast the performance of the pointer-based vector against the size-based vector, with and without libc++ hardening enabled. Our findings indicate that when libc++ hardening is enabled, the size-based vector offers a statistically significant edge over the pointer-based vector.
When libc++ hardening is enabled, std::vector
’s read operations add calls to size()
or empty()
in order to validate that we have elements to read. Looking at our example from before, we see that the pointer-based vector needs to do more work for both x86_64 and ARMv8. To compute size()
, the pointer-based vector needs to load two members from memory, and then subtract them. Pointer subtraction requires additional operations to extract an integer value, since we need to convert the difference in bytes to the size of the value type. The “conversion” is an integer division by a constant. This is implemented as either a right shift when the value type’s size is a power of two, or an IDIV
or IMUL
operation when the size isn’t a power of two[1]. The size-based vector sidesteps this work by already having those values: we don’t need to do any work, except to compare them.
We’d like to upstream this work. In the refactored implementation:
std::vector
only uses one pointer;begin_
, to track where the buffer is in memory;std::vector
uses an integer,size_
, to track how many elements are in the vector;std::vector
uses an integer,capacity_
, to track the total number of elements a vector can contain;- Reallocation happens when
size_ == capacity_
; v.size()
directly returnssize_
;v.capacity()
directly returnscapacity_
;v.end()
is computed bybegin_ + size_
.
Figure 3: The same image as in Figure 1, except that it now shows the new representation of std::vector . In this image, the size and capacity are immediately known, but we need to compute where the next element is to be inserted.
|
Simplifying calls to v.size()
requires v.end()
taking on some of its cost. end()
now loads two data members and offsets the pointer by an integer. It’s worth reiterating that pointer arithmetic always includes an implicit multiplicative operation. In the worst case, pointer subtraction needs to be divided by the size of the pointee’s type. In contrast, a pointer-offset only needs to be multiplied by its pointee’s size. This ends up yielding nicer codegen—independently from libc++ hardening—in many cases, as outlined in the table below.
sizeof(value_type) is a power-of-two
|
sizeof(value_type) is not a power-of-two
|
|
x86_64 -O0 | Codegen for pointer-based vector’s size() and size-based vector’s end() are equivalent.
|
Codegen for size-based vector’s end() is better than pointer-based vector’s size() .
|
x86_64 -O3 | Codegen for pointer-based vector’s size() and size-based vector’s end() are equivalent.
|
Codegen for size-based vector’s end() is better than pointer-based vector’s size() .
|
armv8-a -O0 | Size-based vector’s end() codegen is better than pointer-based vector’s size() .
|
Unclear if the codegen is equivalent, or if size-based vector’s end() is better than pointer-based vector’s size() .
|
armv8-a -O3 | Size-based vector’s end() codegen is better than pointer-based vector’s size() .
|
Codegen for size-based vector’s end() is better than pointer-based vector’s size() .
|
In general, the codegen in the power-of-two cases is the same because both multiplication and division by powers of two can be implemented as bit-shifts. It’s important to remember that, unless the pointee’s size is only one byte, pointer arithmetic always involves a multiplicative operation. The difference for the optimised armv8-a code is because the bit-shift can be included as an input-operand modifier in the add instruction needed for end()
, but in the code for size()
the shift has to happen after the subtraction.
In the non-power-of-two cases, the differences are because the pointer-based vector’s size()
code requires division by a constant where the size-based vector’s end()
code requires multiplication, and integer division is typically more expensive than integer multiplication. Even though the optimisation of the pointer-based vector’s size()
code can implement the division with a less-expensive integer multiplication instruction, the multiplication is by a large magic number that can’t be further optimised. By contrast, in many cases the optimisation of the size-based vector’s end()
code can often replace the multiplication with a faster set of add and shift instructions.
Almost all cases thus seem to favour the size-based vector’s codegen, and those that don’t end up with equivalent codegen. As the size of a value type grows, it becomes less likely that its size is a power of two. Since our load tests cover a broad spectrum of code, and our observations only indicate either a positive or neutral impact on run-time, we conclude that this trade-off is a net-positive even when end()
is frequently called.
Performance improvements
Microbenchmarks are unable to provide a reliable signal for this change. The optimisations that are applied are usually local, and the results can be highly variable. A change like this requires us to either macrobenchmark software in production, or software that’s representative of production software (e.g. load tests). Our load tests have been designed with this in mind, and so we ran as many as possible.
In summary, most of our load tests indicated at least a modest, but statistically significant improvement. The remainder indicated no change to performance. Some of our servers will likely enjoy a 0.2–0.5% increase in queries per second. We also observed that one of our most important servers will have a 0.12% reduction in operating costs. Although modest, this is particularly noteworthy, because getting any sort of above-the-noise signal from this particular load test is exceedingly rare. We also noticed statistically significant improvements to some of our database software.
Finally, we’re preparing to benchmark our software in production, but this will take a few weeks to spin up, and then a few more weeks to gather the data. We don’t believe that this should block a discussion on whether or not to proceed.
Implementation
Changing vector
’s representation also requires changing __split_buffer
’s representation. We’ve currently deployed our size-based vector as a standalone implementation, alongside the upstream implementation of std::vector
. Maintaining two copies of vector
and __split_buffer
is not ideal, so we’re in the process of unifying the changes into the upstream vector
and __split_buffer
in a relatively non-intrusive way.
We’ve finished the initial changes to __split_buffer
, which can be found in PR#139632. The changes to vector
are very similar what’s proposed in __split_buffer
, so we’ll follow up with the changes to vector
once __split_buffer
is good to be merged.
Deployment challenges
This change breaks ABI; we need to deploy it behind a macro such as _LIBCPP_ABI_SIZE_BASED_VECTOR
. We’ve tested this on over a billion lines of code, including a significant amount of third-party code and Chromium, and surprisingly little was subjected to Hyrum’s law. Even more surprising: only one third-party project (OpenCV) ended up being adversely impacted by the change in representation.
Example: OpenCV
An example of third-party code falling victim to our changes is the computer vision library, OpenCV. Across its five thousand API functions, OpenCV has three array abstractions to support and type-erase array and array-like types from the C++ standard library, OpenCV, and Eigen:
_InputArray
, for read-only access;_OutputArray
, for write-only access;_InputOutputArray
, for read-write access.
For simplicity, we’ll collectively refer to this trio as the Array
types. The Array
types have two fields: obj
and flags
.
obj
stores the address of the abstracted array/array-like type as a blob of memory, without any type information.flags
stores some of the type info, andreinterpret_cast
s to something from a predefined set of types.
Since std::vector
is a class template, it is infeasible for the Array
types to track all of its possible value types. Most vector types end up being converted to vector<unsigned char>
when vector operations are required. This violates the strict aliasing rule, but since that’s IFNDR, it’s rarely caught. As such, things “just worked” for OpenCV 4.x, up until now.
Since the current implementation of vector
relies on pointer arithmetic to compute the number of elements (__end_ - __begin_
) and its capacity (__cap_ - __begin_
), the OpenCV code gets away with pretending that a vector with 100 ints is actually a vector of 400 bytes: the parts of the implementation that make this data structure work are still intact, and thus the apparent behaviour of the code is correct, despite using the wrong type. It ends up breaking in several key operations when we changed to using our size-based vector, as the vector directly tracks the size and capacity. For example, the end of the allocated buffer is now computed by multiplying the size by the element type’s size, which becomes incorrect when using the wrong element type. TCMalloc flushed this out by reporting that we were leaking most of the memory that these reinterpreted vectors held during resizing (you end up leaking 300 bytes in the hypothetical case above, because vector<unsigned char>
now thinks that it only holds 100 chars).
While OpenCV 4.x can’t use a size-based vector due to ABI commitments to their users, we’re proposing changes to the Array types for OpenCV 5.x, which is allowed to take an ABI break.
Other potential deployment issues
While OpenCV was the only third-party deployment issue that we encountered, we observed similar issues in first-party code. Additionally, we learnt that this optimisation has the potential to flush out ODR issues that might slip by when using the pointer-based implementation.
Suggestions for diagnosing failures
Memory allocator, GWP-ASAN failures, and most segfaults
It’s best to start by turning on -fsanitize=address,undefined
and debug symbols. This usually provides a starting point for debugging your issues. To fix this problem, you’ll need to refactor your code to eliminate invalid casts.
The OpenCV issue is an example of this class of problem.
Weird ASan issues
If AddressSanitizer reports that a type’s destructor is attempting to destroy a field that isn’t in your type, then it means that you’re probably running afoul of the One Definition Rule. To fix this problem, rename any symbols that have the same name, but different definitions (or scope them appropriately). We only observed this problem once, but it took a long time to identify.
Segmentation faults
We know of at least two reasons that might cause a segfault:
- You have a GWP-ASAN issue. See above.
- You are linking objects that are built with multiple toolchains, only some of which are using a size-based
std::vector
representation, with the result that you’re passing data from one representation ofstd::vector
to a different representation.
Potential follow-up work
A size-based string
std::basic_string
has a similar representation to std::vector
. It may be beneficial for us to also consider a size-based string. However, it’s important to acknowledge that the two types have different uses. This means we should redo benchmarks before committing to such a large endeavour. If the research pays off, then we may want to entertain size-based strings as well.
A size-based __wrap_iter
__wrap_iter
is a pointer-based iterator for data structures with a contiguous memory layout. It may be beneficial for us to consider a size-based __wrap_iter
for the same reasons. Unsurprisingly, we should first redo the benchmarks before committing to this. The original implementation of our size-based vector also included changes to __wrap_iter
, but we deemed those orthogonal to the change in vector’s representation.
Notes
The specific operation is based on your build flags:
-O0
chooses a division operation;-O3
chooses a multiplication operation. ↩︎