Demystifying the byte type

Hi all,

In May 2021, together with Nuno Lopes and Juneyoung Lee, we proposed to add a byte type in LLVM to fix load type punning issues. Initial RFC touched some subtle aspects of LLVM IR and its semantics, and sparked a lot of questions, concerns, and discussions.

We decided to write a post that would summarise the thread and the complicated topic:

https://gist.github.com/georgemitenkov/3def898b8845c2cc161bd216cbbdb81f

We hope that our post clarifies initial concerns raised on the mailing list. As always, any questions, suggestions and advice are welcome!

Thanks,
George

Hi,

Thank you for the description of the problem.
I think it would be helpful to also document which CPUs you were
considering in relation to the behaviour.
I can see that the description would hold for x86(32bit) and amd64.
But, there are CPUs out there that have special instructions for doing
pointer manipulation.
You might know that the CPU type has no bearing on the discussion, in
which case it would be helpful to add a paragraph or two explaining
that.

Kind Regards

James

Hi,

The gist post, seems to imply that one needs memory to be typed.
If what you describe is correct, doesn't that imply that the opaque
pointer work is a fools errand ?
I.e. If memory needs to be typed, surely pointers to that memory also
need to be typed?

Kind Regards

James

Hi James,

If what you describe is correct, doesn’t that imply that the opaque pointer work is a fools errand ?
I.e. If memory needs to be typed, surely pointers to that memory also need to be typed?

Not at all, the issue described in the post is orthogonal to opaque pointers. The main problem that the post talks about is that LLVM does not have a type that represents a raw sequence of bits (i.e. memory). Currently, integers are used for that, which makes them carry pointers sometimes (as described in first parts of the post). This creates a problem for optimizations on integers, since we do not know that the values that we load are “pure” real integers or pointers casted to ints (and different LLVM optimizations make different assumptions about that).

So it is not about making memory “typed”, but rather creating a universal type that can be used to load from memory something that carries raw data (integers and pointers) and preserves provenance.

Hope this helps!

Thanks,
George

But, there are CPUs out there that have special instructions for doing
pointer manipulation.

I am not sure I see why this is relevant to the byte type?

I believe that the issue is not that relevant to some specific architecture as it stems from the frontend and IR optimizations. As I say in the post, it is tied to C semantics saying that unsigned char is a universal type holder. This means that any pointer can be copied byte-per-byte without alias analysis in LLVM realising that, leading to incorrect IR. LLVM does not have such a type and uses untyped memory with integers carrying data (which as described in the post creates inconsistencies and invalidates certain LLVM optimizations).

Thanks,
George

Hi George,

I only made it through part 1 for now but I figured I might forget if I don't reply directly:

> Under the untyped memory model, we need to accept that every load/store has an implicit |ptrtoint|/|inttoptr| attached to it.

This is stated but I don't see it. Rather, a store of a pointer makes the pointer potentially escape (also see below).
Escaped pointers could show up as integers (among other things). So escaping a pointer (by any means) does an implicit
ptrtoint/inttoptr but not necessarily the store or the load. The transformation shown below that statement doesn't
contradict this view and SROA is still legal. All that happened is that SROA has first determined and then made
it explicit that the pointer (here %in) did not escape during the round trip through %mem. If, for example, %mem would
have been passed to an unknown function SROA would not have done this transformation because %in could now have
escaped through %mem. If %mem was casted to int and then loaded SROA would have made the escaping use explicit through
an ptrtoint: Compiler Explorer
Long story short, if you store a pointer (or cast it to an integer, or compare it other than some special ways) it might
escape and as anything could happen to it it looses its provenance. If you can show it doesn't escape, no provenance is
lost.

> An alternative is to say that all pointer stores escape, which again has severe performance consequences and again do not align with all LLVM optimizations.

What optimizations do not treat a pointer stored away as an escaping use? That sounds like a problem.
[FWIW, I'm only aware of the Attributor but it ensures that all uses of the store are instead visited which makes this sound again (no escape can happen through the store).]

~ Johannes

From the linked document:

Solution 3: Annotations and tags
LLVM optimizers work with the assumption that attributes can be discarded if the optimizer does not know how to handle them.

I don't think this is necessarily the case. Such attributes can be
designed such that a missing attribute represents the most
conservative, like the `mustprogress` attribute/metadata. That is, a
missing annotation has an implicit provenance of {all}. GVN can fold q
and p after `if (q == p)` with a new provenance being the union of q
and p's provenance, like a PHINode. In other models, p and q cannot be
folded or in the case of the proposed byte type, cannot carry
provenance information.

High engineering effort to enforce that attributes are preserved in every transformation and used by analyses.

IMHO, it is still lower than introducing a new first-class type.

Michael

Hi George,

>
> Hi all,
>
> In May 2021, together with Nuno Lopes and Juneyoung Lee, we proposed to add a byte type in LLVM to fix load type punning issues. Initial RFC touched some subtle aspects of LLVM IR and its semantics, and sparked a lot of questions, concerns, and discussions.
>
> We decided to write a post that would summarise the thread and the complicated topic:
>
> Byte types, or how to get rid of i8 abuse for chars in LLVM IR · GitHub
>
> We hope that our post clarifies initial concerns raised on the mailing list. As always, any questions, suggestions and advice are welcome!

Thank you for the writeup. I think a big part of the problem in understanding this comes from the name of the type. On provenance-carrying architectures (such as CHERI systems, including Arm's Morello[1]), it is unsound to copy a pointer as bytes. Pointers must be copied by provenance-carrying operations. The hardware splits registers into ones that don't carry provenance (integer, floating-point, vector) and ones that do but which can *also* be used to copy non-pointer data (capabilities).

On a CHERI system, ptrtoint does not confer provenance and inttoptr on the result may yield either an invalid pointer or a pointer with larger bounds, depending on the environment. This reflects the machine semantics: converting a pointer to an integer is an operation that simply extracts the address (on Morello, the address is exposed as a subregister of the capability register). Converting in the opposite direction inserts the address into the capability held in the default data capability register (which, in the pure-capability ABI is typically not a valid capabilitiy and so yields an invalid pointer, in the hybrid ABI refers to the part of the address space used for legacy code).

I think that all of this is fairly aligned with your byte type.

David

[1] Morello Program – Arm®

From the linked document:

Solution 3: Annotations and tags
LLVM optimizers work with the assumption that attributes can be discarded if the optimizer does not know how to handle them.

I don't think this is necessarily the case.

To be really clear about this: The statement in the blog post is not true.

Michael is right that one can design the absence of the attribute to be a
conservative things (like mustprogress) but that is not all.

Attributes cannot be dropped by the optimizer, that was never (or at least
not for a long time) an option. The same is true for operand bundles and
to some degree metadata.

Here are some, not all, examples of non-droppable attributes:
- convergent
- alwaysinline
- naked
- nobuiltin
- noduplicate
- noimplicitfloat
- noinline
- optnone
- byval
- preallocated

Operand bundles cannot be dropped and having them on arbitrary instructions
would not be the worst thing. Metadata is partially dropped but also often
preserved already as if it is required. Several backends crash if you drop
the wrong debug info metadata, which is almost as good as requiring to
preserve them.

Now one can argue if attributes (or annotations in general) make IR less readable
but that is not the strongest argument. IR is not meant to be readable in the same
sense as a language you would program in is.

The effort to to introduce new attributes, preserve them, an use them is not necessarily
high. A lot of the new attributes we introduced in the last 2 years were implemented by
students under guidance. Incl. the latest two being under review: `objectsize` and `globalmemonly`

All that said, I still believe the underlying problem here is with the fact when a pointer
escapes. It is just important to keep the description of alternative options correct.

~ Johannes

The way I understand it, the problem that the byte type is meant to solve is part of a broader-scoped problem, which is the inconsistency of pointer semantics in LLVM (and other compilers, for that matter). Subtle misunderstandings in how pointer semantics works between different optimization passes causes misoptimizations to happen, and identifying which pass is the culprit is challenging. This is not helped by the LLVM language reference being outright incorrect here: it describes provenance in terms of data dependence, even through integers, which is not how any of our analyses actually work, generally preferring to reason on a more escape-based analysis approach.

However, the byte type proposal feels to me like it is motivated on a minor portion of the problem, so narrow that it feels like it only really solves “how to write memcpy in standard C” aspect of this problem. It doesn’t really address how the addition of byte types would fix miscompilations, especially anything beyond memcpy (for example, C code compiled with -fno-strict-aliasing). It doesn’t suggest any fixes to the current known inconsistencies in the language specification. And as a result, it’s kind of dismissive as to why isolated fixes to various optimization passes are insufficient to achieve coherent semantics.

Stepping back a bit, it’s helpful to understand that, for the purposes of building an operational semantics, a pointer is not an i64 but a { i64, BOOM (Bag Of Other Metadata) }, where the BOOM contains sufficient information to explain when a load or store of a pointer is undefined behavior—including liveness information, provenance, and noalias rules [1]. Described like this, three things should be clear. First, the inttoptr instruction has to recreate the BOOM given no information, which is necessarily a pessimistic assumption (it may be useful to have intrinsics that provide less pessimistic recreation of the BOOM). Second, loads and stores of pointers in memory needs to preserve the BOOM, presumably through a generally inaccessible shadow memory feature. Finally, the interaction of non-pointer types with the representation of the BOOM in memory needs to be given a definition.

Fundamentally, then, the problem is inttoptr (and to a lesser degree, ptrtoint, as it constitutes a vehicle for escaping pointers), and memory is involved only insofar as it constitutes a ‘hidden’ inttoptr (and ptrtoint). But byte doesn’t really expose the ‘hidden’ inttoptr, it just hides it in a different place. Indeed, it still retains the existing ones if you should load a pointer with an i64. To me, it appears only to be useful in giving a way to canonicalize @llvm.memcpy into a regular load type, but an entirely new type doesn’t seem necessary for that—intrinsics that give access to reading and writing shadow BOOM seem like they would be sufficient. You might argue that such intrinsics would eliminate the ability of users to write their own copies of memcpy, but even here, byte is an insufficient proposal—there’s no way to write a word-based memcpy in C with this proposal (assuming -fno-strict-aliasing, of course).

With that in mind, I’d like to ask a few questions:

Have you been tracking the WG14 study group on provenance?

Have you attempted to put together some form of provenance semantics in a tool like Alive2 to more comprehensively catalogue miscompilations in existing optimizations?

[1] My first instinct is to say that the BOOM is the set of allocations the pointer may point to, but there may be edge cases that I’m not immediately thinking of. Formal semantics is not my forte, after all.

From the linked document:

Solution 3: Annotations and tags
LLVM optimizers work with the assumption that attributes can be discarded if the optimizer does not know how to handle them.

I don't think this is necessarily the case.

To be really clear about this: The statement in the blog post is not true.

Michael is right that one can design the absence of the attribute to be a
conservative things (like mustprogress) but that is not all.

Attributes cannot be dropped by the optimizer, that was never (or at least
not for a long time) an option. The same is true for operand bundles and
to some degree metadata.

Here are some, not all, examples of non-droppable attributes:
- convergent
- alwaysinline
- naked
- nobuiltin
- noduplicate
- noimplicitfloat
- noinline
- optnone
- byval
- preallocated

Operand bundles cannot be dropped and having them on arbitrary instructions
would not be the worst thing. Metadata is partially dropped but also often
preserved already as if it is required. Several backends crash if you drop
the wrong debug info metadata, which is almost as good as requiring to
preserve them.

+1 one on the attributes and operand bundles. These are semantic, and can not be dropped without understanding their meaning.

Metadata, by contract, can always be dropped. Any backend which doesn't tolerate that has a bug. It may not be a bug we care about or bother to fix, but it's still a bug.