Adding a size-based vector to libc++’s unstable ABI

Adding a size-based vector to libc++’s unstable ABI

tl;dr We can significantly improve the runtime performance of std::vector by changing its representation from three pointers to one pointer and two integers. This document explains the details of this change, along with the justifications for making it.

Motivation

std::vector is a dynamically-sized array that grows as elements are added. Appending elements is required to have amortised constant-time complexity, so std::vector typically pre-allocates memory for n elements (called the vector’s capacity), and reallocates to a larger buffer when the number of elements in the vector exceeds its current capacity. For example, a vector of three elements might be able to insert a fourth element, but adding a fifth element would force a reallocation.

Figure 1: A visual representation of a vector growing. Each row shows the number of elements in the vector (Size), the number of elements it can store before needing to relocate the existing elements (Capacity), the memory address of each element (0x57d66d9132a0, etc.), and the element's value (0, 1, 2, etc.).

The first row shows the vector with three integers, with room for one more integer to be inserted. The second row shows the vector after a fourth integer is added. The elements’ location in memory remains unchanged from the first row. The third row shows a fifth element added, with room for a few more elements. Because the original buffer ran out of room for more elements after the fourth element was inserted, it reallocated the buffer to somewhere larger.

The current implementation of std::vector in libc++ achieves this by managing a triplet of pointers:

  • one that points to the beginning of the buffer (begin_);
  • one that points to where the next element should be inserted (end_);
  • one that points to the end of the buffer (end_cap_).

Reallocation happens when end_ == end_cap_. We suspect that this representation was chosen because the C++ standard library is based on iterators, and optimising for iterator usage was likely a natural conclusion. Here’s how we use these to compute some important member functions:

  • v.end() returns end_ directly;
  • v.size() returns end_ - begin_;
  • v.capacity() returns end_cap_ - begin_.

Since the size() and capacity() perform pointer subtraction, they end up being compiled to multiple instructions, relative to end().

Figure 2: The same image as in Figure 1, except that it now shows the current representation of std::vector. In this image, the size and capacity aren't immediately known, and need to be computed from begin_, end_, and end_cap_. The red box represents the sentinel value that end_cap_ points to.

Changing std::vector’s representation to directly track its size and capacity via integers is an opportunity for optimisation. We ran various load tests to contrast the performance of the pointer-based vector against the size-based vector, with and without libc++ hardening enabled. Our findings indicate that when libc++ hardening is enabled, the size-based vector offers a statistically significant edge over the pointer-based vector.

When libc++ hardening is enabled, std::vector’s read operations add calls to size() or empty() in order to validate that we have elements to read. Looking at our example from before, we see that the pointer-based vector needs to do more work for both x86_64 and ARMv8. To compute size(), the pointer-based vector needs to load two members from memory, and then subtract them. Pointer subtraction requires additional operations to extract an integer value, since we need to convert the difference in bytes to the size of the value type. The “conversion” is an integer division by a constant. This is implemented as either a right shift when the value type’s size is a power of two, or an IDIV or IMUL operation when the size isn’t a power of two[1]. The size-based vector sidesteps this work by already having those values: we don’t need to do any work, except to compare them.

We’d like to upstream this work. In the refactored implementation:

  • std::vector only uses one pointer; begin_, to track where the buffer is in memory;
  • std::vector uses an integer, size_, to track how many elements are in the vector;
  • std::vector uses an integer, capacity_, to track the total number of elements a vector can contain;
  • Reallocation happens when size_ == capacity_;
  • v.size() directly returns size_;
  • v.capacity() directly returns capacity_;
  • v.end() is computed by begin_ + size_.
Figure 3: The same image as in Figure 1, except that it now shows the new representation of std::vector. In this image, the size and capacity are immediately known, but we need to compute where the next element is to be inserted.

Simplifying calls to v.size() requires v.end() taking on some of its cost. end() now loads two data members and offsets the pointer by an integer. It’s worth reiterating that pointer arithmetic always includes an implicit multiplicative operation. In the worst case, pointer subtraction needs to be divided by the size of the pointee’s type. In contrast, a pointer-offset only needs to be multiplied by its pointee’s size. This ends up yielding nicer codegen—independently from libc++ hardening—in many cases, as outlined in the table below.

sizeof(value_type) is a power-of-two

(Compiler Explorer example)

sizeof(value_type) is not a power-of-two

(Compiler Explorer example)

x86_64 -O0 Codegen for pointer-based vector’s size() and size-based vector’s end() are equivalent. Codegen for size-based vector’s end() is better than pointer-based vector’s size().
x86_64 -O3 Codegen for pointer-based vector’s size() and size-based vector’s end() are equivalent. Codegen for size-based vector’s end() is better than pointer-based vector’s size().
armv8-a -O0 Size-based vector’s end() codegen is better than pointer-based vector’s size(). Unclear if the codegen is equivalent, or if size-based vector’s end() is better than pointer-based vector’s size().
armv8-a -O3 Size-based vector’s end() codegen is better than pointer-based vector’s size(). Codegen for size-based vector’s end() is better than pointer-based vector’s size().

In general, the codegen in the power-of-two cases is the same because both multiplication and division by powers of two can be implemented as bit-shifts. It’s important to remember that, unless the pointee’s size is only one byte, pointer arithmetic always involves a multiplicative operation. The difference for the optimised armv8-a code is because the bit-shift can be included as an input-operand modifier in the add instruction needed for end(), but in the code for size() the shift has to happen after the subtraction.

In the non-power-of-two cases, the differences are because the pointer-based vector’s size() code requires division by a constant where the size-based vector’s end() code requires multiplication, and integer division is typically more expensive than integer multiplication. Even though the optimisation of the pointer-based vector’s size() code can implement the division with a less-expensive integer multiplication instruction, the multiplication is by a large magic number that can’t be further optimised. By contrast, in many cases the optimisation of the size-based vector’s end() code can often replace the multiplication with a faster set of add and shift instructions.

Almost all cases thus seem to favour the size-based vector’s codegen, and those that don’t end up with equivalent codegen. As the size of a value type grows, it becomes less likely that its size is a power of two. Since our load tests cover a broad spectrum of code, and our observations only indicate either a positive or neutral impact on run-time, we conclude that this trade-off is a net-positive even when end() is frequently called.

Performance improvements

Microbenchmarks are unable to provide a reliable signal for this change. The optimisations that are applied are usually local, and the results can be highly variable. A change like this requires us to either macrobenchmark software in production, or software that’s representative of production software (e.g. load tests). Our load tests have been designed with this in mind, and so we ran as many as possible.

In summary, most of our load tests indicated at least a modest, but statistically significant improvement. The remainder indicated no change to performance. Some of our servers will likely enjoy a 0.2–0.5% increase in queries per second. We also observed that one of our most important servers will have a 0.12% reduction in operating costs. Although modest, this is particularly noteworthy, because getting any sort of above-the-noise signal from this particular load test is exceedingly rare. We also noticed statistically significant improvements to some of our database software.

Finally, we’re preparing to benchmark our software in production, but this will take a few weeks to spin up, and then a few more weeks to gather the data. We don’t believe that this should block a discussion on whether or not to proceed.

Implementation

Changing vector’s representation also requires changing __split_buffer’s representation. We’ve currently deployed our size-based vector as a standalone implementation, alongside the upstream implementation of std::vector. Maintaining two copies of vector and __split_buffer is not ideal, so we’re in the process of unifying the changes into the upstream vector and __split_buffer in a relatively non-intrusive way.

We’ve finished the initial changes to __split_buffer, which can be found in PR#139632. The changes to vector are very similar what’s proposed in __split_buffer, so we’ll follow up with the changes to vector once __split_buffer is good to be merged.

Deployment challenges

This change breaks ABI; we need to deploy it behind a macro such as _LIBCPP_ABI_SIZE_BASED_VECTOR. We’ve tested this on over a billion lines of code, including a significant amount of third-party code and Chromium, and surprisingly little was subjected to Hyrum’s law. Even more surprising: only one third-party project (OpenCV) ended up being adversely impacted by the change in representation.

Example: OpenCV

An example of third-party code falling victim to our changes is the computer vision library, OpenCV. Across its five thousand API functions, OpenCV has three array abstractions to support and type-erase array and array-like types from the C++ standard library, OpenCV, and Eigen:

  • _InputArray, for read-only access;
  • _OutputArray, for write-only access;
  • _InputOutputArray, for read-write access.

For simplicity, we’ll collectively refer to this trio as the Array types. The Array types have two fields: obj and flags.

  • obj stores the address of the abstracted array/array-like type as a blob of memory, without any type information.
  • flags stores some of the type info, and reinterpret_casts to something from a predefined set of types.

Since std::vector is a class template, it is infeasible for the Array types to track all of its possible value types. Most vector types end up being converted to vector<unsigned char> when vector operations are required. This violates the strict aliasing rule, but since that’s IFNDR, it’s rarely caught. As such, things “just worked” for OpenCV 4.x, up until now.

Since the current implementation of vector relies on pointer arithmetic to compute the number of elements (__end_ - __begin_) and its capacity (__cap_ - __begin_), the OpenCV code gets away with pretending that a vector with 100 ints is actually a vector of 400 bytes: the parts of the implementation that make this data structure work are still intact, and thus the apparent behaviour of the code is correct, despite using the wrong type. It ends up breaking in several key operations when we changed to using our size-based vector, as the vector directly tracks the size and capacity. For example, the end of the allocated buffer is now computed by multiplying the size by the element type’s size, which becomes incorrect when using the wrong element type. TCMalloc flushed this out by reporting that we were leaking most of the memory that these reinterpreted vectors held during resizing (you end up leaking 300 bytes in the hypothetical case above, because vector&lt;unsigned char> now thinks that it only holds 100 chars).

While OpenCV 4.x can’t use a size-based vector due to ABI commitments to their users, we’re proposing changes to the Array types for OpenCV 5.x, which is allowed to take an ABI break.

Other potential deployment issues

While OpenCV was the only third-party deployment issue that we encountered, we observed similar issues in first-party code. Additionally, we learnt that this optimisation has the potential to flush out ODR issues that might slip by when using the pointer-based implementation.

Suggestions for diagnosing failures

Memory allocator, GWP-ASAN failures, and most segfaults

It’s best to start by turning on -fsanitize=address,undefined and debug symbols. This usually provides a starting point for debugging your issues. To fix this problem, you’ll need to refactor your code to eliminate invalid casts.

The OpenCV issue is an example of this class of problem.

Weird ASan issues

If AddressSanitizer reports that a type’s destructor is attempting to destroy a field that isn’t in your type, then it means that you’re probably running afoul of the One Definition Rule. To fix this problem, rename any symbols that have the same name, but different definitions (or scope them appropriately). We only observed this problem once, but it took a long time to identify.

Segmentation faults

We know of at least two reasons that might cause a segfault:

  • You have a GWP-ASAN issue. See above.
  • You are linking objects that are built with multiple toolchains, only some of which are using a size-based std::vector representation, with the result that you’re passing data from one representation of std::vector to a different representation.

Potential follow-up work

A size-based string

std::basic_string has a similar representation to std::vector. It may be beneficial for us to also consider a size-based string. However, it’s important to acknowledge that the two types have different uses. This means we should redo benchmarks before committing to such a large endeavour. If the research pays off, then we may want to entertain size-based strings as well.

A size-based __wrap_iter

__wrap_iter is a pointer-based iterator for data structures with a contiguous memory layout. It may be beneficial for us to consider a size-based __wrap_iter for the same reasons. Unsurprisingly, we should first redo the benchmarks before committing to this. The original implementation of our size-based vector also included changes to __wrap_iter, but we deemed those orthogonal to the change in vector’s representation.

Notes


  1. The specific operation is based on your build flags: -O0 chooses a division operation; -O3 chooses a multiplication operation. ↩︎

12 Likes

I think this is a good change and IMHO it’s a good thing to be breaking API-violating users who have made assumptions about implementation internals outside what the standards specify.

This also opens another interesting optimization possibility that may be appropriate for certain circumstances. That is, for some situations using 64-bit pointers it may be a fine limitation to use only 32-bit sizes (limiting vector<byte> to 4G, etc.) and this could reduce the memory footprint of std::vector, std::string etc. if many of those are stored in heap memory. Something to consider as an opt-in for certain ABI configurations down the line after this lands.

3 Likes

That’s a great point – and I believe that this proposal already makes that possible without further changes to libc++. User code can already create a vector with custom allocator. The allocator can specify using size_type = uint32_t; using difference_type = int32_t; – and since the new vector representation contains two size_type fields, using such an allocator should actually reduce the size of the vector object.

3 Likes

The proposed libc++ change will make std::vector similar to SmallVector :slight_smile:
llvm/ADT/SmallVector.h contains the exact optimization with this condition ⚙ D77621 ADT: SmallVector size/capacity use word-size integers when elements are small

template <class T>
using SmallVectorSizeType =
    std::conditional<sizeof(T) < 4, uintptr_t, uint32_t>;

@cjdb If you have more ideas for optimization, make sure port that to SmallVector as well :slight_smile:

BTW, LLVM Programmer’s Manual — LLVM 21.0.0git documentation currently states

std::vector<T> is well loved and respected. However, SmallVector<T, 0> is often a better option due to the advantages listed above. std::vector is still useful when you need to store more than UINT32_MAX elements or when interfacing with code that expects vectors :).

:slight_smile:

3 Likes

Just curious: Did you look in to the impact of changing these validations to first compute the read location and check whether that’s in [begin, end_)? It looks like that would require 2 comparisons so not obviously better but might avoid the pessimization of end().

I’m not sure I understand what you mean. Could you provide an example, please?

I am assuming that “libc++ hardening” means, for example, that std::vector::operator[](std::size_t i) adds a check to throwIf(i < size()).

In this example, I believe an equivalent check is

const auto readAddress = begin_ + i;
throwIf(readAddress < begin_ || readAddress >= end_);

And this means no call to size().

readAddress always needs calculating anyway, but this does introduce one extra comparison.

I was wondering whether you’d looked in to this (or if I’ve misunderstood the motivation).

Ah, thanks for clarifying. We can’t do that, because it’s undefined to have begin_ + i when i > size(), and there’s no mechanism to inform other parts of the implementation that we’re just doing a bounds check.

Thanks for the RFC and apologies for chiming in so late! As I’ve said elsewhere previously, I personally think this is something we should allow people to opt into. This seems to provide a lot of value and IMO the bigger question is how to best make this change while minimizing maintenance burden, duplication and risk of new bugs. I think those concerns are better suited for a pull request review so I’ll reach out to you to schedule some review time in the near future. But TLDR, I think this makes a lot of sense if we can figure out a way to make it live side by side with our current implementation reasonably (I’m sure we can).