MCP: More Cranelift-friendly portable SIMD intrinsics

# Proposal

cc @rust-lang/project-portable-simd

## Summary

- Set an expectation that `simd_*` intrinsics from `extern "platform-intrinsic"` will form the basis of backend-agnostic intrinsics within the standard library.
- Fill in any missing gaps in existing `simd_*` intrinsics so that `core::simd` can avoid any direct LLVM dependency.
- Provide non-vectorized reference implementations for all `simd_*` intrinsics that are exposed through `compiler_builtins`.

Implementation of this MCP will be coordinated between the Portable SIMD project and Compiler team for Cranelift and LLVM implementations.

## Motivations

The user-facing goal of the `core::simd` API being developed by the @rust-lang/project-portable-simd group is to let users take advantage of SIMD intrinsics in a way that's automatically portable across hardware. Realizing that goal requires special backend support so that `core::simd` can utilize existing intrinsics rather than having to target ISAs specifically. Since `rustc` has multiple compiler backends (LLVM and Cranelift) the `core::simd` implementation also needs to also be portable across compiler backends in order to be portable across hardware in the way we want.

We currently have a best-effort approach to backend-agnostic SIMD intrinsics through the `extern "platform-intrinsic"` mechanism. There are a number of existing platform intrinsics like `simd_add` and `simd_mul` that are used throughout `core::arch`, but also many ISA idiosyncrasies that aren't reasonable to create platform intrinsics for so it also uses a mix of LLVM intrinsics and inline assembly. The story for `core::simd` is a bit different. Its goal is specifically to expose standard lowest-common-denominator functionality that's consistent across all supported ISAs. That makes `core::simd` a great fit for platform intrinsics, and `core::simd` would like to use them exclusively over LLVM intrinsics.

We can think of `core::simd` almost like a thin user-friendly wrapper over platform intrinsics. Since it's going to rely on these intrinsics and make promises about their behavior across platforms we should write reference implementations of them all in plain Rust. These reference implementations will both document the conformant behavior of these intrinsics and give compiler backends a way to support `core::simd` without necessarily having to generate vectorized code for all platforms.

## Implementation

The overall picture of how Rust's portable SIMD story will fit together covers a few areas of the compiler and libraries:

- Libraries like `core::simd` call platform intrinsics through `extern "platform-intrinsic"`. These are generic functions that are backend agnostic. They work with types that are `#[repr(simd)]`.
- Compiler backends like LLVM or Cranelift receive these generic platform intrinsics and can either emit appropriate ISA-specific vector instructions, or they can emit a function call to a generic fallback implementation.
- The fallback implementations that backends can call are monomorphic intrinsics from `compiler_builtins`. They're named so that backends can easily name them given the platform intrinsic and shape of the inputs. Instead of working with some generic `#[repr(simd)] struct T`, they work with `[T; N]`s.
- While the fallback implementations are monomorphic so they can live in `compiler_builtins`, they can be backed by a single generic implementation to make them easier to write.

To get a better idea of how this is expected to hang together, let's consider a working example of a conceptual `core::simd` and Rust compiler.
This example implements addition for `i64x4`, a vector type with 4 `i64` lanes:

```rust
#![feature(core_intrinsics, min_specialization, min_const_generics)]
#![allow(non_camel_case_types)]

#[test]
fn i64x4_add() {
    use crate::simd::i64x4;
    
    let a = i64x4([1, 2, 3, 4]);
    let b = i64x4([4, 3, 2, 1]);
    
    assert_eq!(i64x4([5, 5, 5, 5]), a + b);
}

pub mod simd {
    /*!
    An external crate, `core::simd`.
    
    It wants to call platform intrinsics instead of direct backend intrinsics.
    */

    use std::ops::Add;
    
    use crate::compiler::{ReprSimd, platform_intrinsics};
    
    /**
    A vector type with 4 `i64` lanes.
    */
    #[derive(Debug, Clone, Copy, PartialEq, Eq)]
    pub struct i64x4(pub [i64; 4]);
    
    /**
    Our vector type is `#[repr(simd)]`
    */
    impl ReprSimd for i64x4 {
        type Type = i64;
        const SIZE: usize = 4;
        
        fn into_fields(self) -> Vec<Self::Type> {
            vec![self.0[0], self.0[1], self.0[2], self.0[3]]
        }

        fn from_fields(fields: Vec<Self::Type>) -> Self {
            i64x4([fields[0], fields[1], fields[2], fields[3]])
        }
    }
    
    /**
    We can add two of our vectors together using the `simd_add` platform intrinsic.
    Hopefully it'll generate vectorized code! But it's ok if it doesn't.
    */
    impl Add for i64x4 {
        type Output = Self;
        
        fn add(self, rhs: Self) -> Self::Output {
            platform_intrinsics::simd_add(self, rhs)
        }
    }
}

pub mod compiler {
    /*!
    Our Rust compiler.
    
    It contains a few private child modules to demonstrate where each piece lives
    and how they interact to expose the single `platform-intrinsics` API.
    */

    mod builtins {
        /*!
        The compiler builtins.
        
        We expose manually monomorphized SIMD fallback intrinsics here that
        codegen backends can call into if they don't support generating
        specialized code for a SIMD platform intrinsic yet.
        */

        use std::{
            intrinsics,
            ops::Add,
        };
    
        pub fn simd_fallback_add_i64(a: &[i64], b: &[i64], r: &mut [i64]) { i64::simd_add(a, b, r) }
        pub fn simd_fallback_add_f64(a: &[f64], b: &[f64], r: &mut [f64]) { f64::simd_add(a, b, r) }
    
        /**
        Let's model our `simd_add` operation as a trait so we can specialize.
        It doesn't really matter how we do this, so long as it's readable.
        */
        trait SimdAdd {
            type Output;
        
            fn simd_add(a: &[Self], b: &[Self], r: &mut [Self::Output])
            where
                Self: Sized;
        }
    
        impl<T> SimdAdd for T
        where
            T: Add + Copy,
        {
            type Output = T;
        
            default fn simd_add(a: &[Self], b: &[Self], r: &mut [Self::Output]) {
                for i in 0..r.len() {
                    // behavior on overflow is one of the edge cases the fallback should specify
                    r[i] = intrinsics::wrapping_add(a[i], b[i]);
                }
            }
        }

        impl SimdAdd for f64 {
            fn simd_add(a: &[Self], b: &[Self], r: &mut [Self]) {
                for i in 0..r.len() {
                    // we might want to treat floats specially to guarantee some behavior for NaN or Infinity
                    r[i] = unsafe { intrinsics::fadd_fast(a[i], b[i]) };
                }
            }
        }
    }
    
    mod codegen_support {
        /*!
        Some backend-agnostic helpers.
        */

        use std::{
            mem,
            any::type_name,
        };
    
        use super::builtins;
    
        /**
        Find a builtin fallback function by operation and type.
        
        This is what backends would do if they can't generate specific code
        for a given platform intrinsic.
        */
        pub fn find_simd_fallback<T>(op: &str) -> fn(&[T], &[T], &mut [T]) {
            match (op, type_name::<T>()) {
                ("simd_add", "i64") => unsafe { mem::transmute::<for<'r, 's, 't0> fn(&'r [i64], &'s [i64], &'t0 mut [i64]), for<'r, 's, 't0> fn(&'r [T], &'s [T], &'t0 mut [T])>(builtins::simd_fallback_add_i64) },
                ("simd_add", "f64") => unsafe { mem::transmute::<for<'r, 's, 't0> fn(&'r [f64], &'s [f64], &'t0 mut [f64]), for<'r, 's, 't0> fn(&'r [T], &'s [T], &'t0 mut [T])>(builtins::simd_fallback_add_f64) },
                (op, ty) => unimplemented!("{} is unsupported for {}", op, ty),
            }
        }
    }
    
    mod codegen_cranelift {
        /*!
        An example of the code a backend might use to implement platform intrinsics.
        */

        use super::ReprSimd;
        use super::codegen_support;
    
        pub fn simd_add<T>(a: T, b: T) -> T
        where
            T: ReprSimd,
        {
            // We don't generate specific code for `simd_add`, so use a fallback
            // to find the appropriate function to call
            simd_fallback("simd_add", a, b)
        }

        fn simd_fallback<T>(op: &str, a: T, b: T) -> T
        where
            T: ReprSimd,
        {
            let a = a.into_fields();
            let b = b.into_fields();
            let mut r = vec![T::Type::default(); T::SIZE];
            
            assert_eq!(T::SIZE, a.len());
            assert_eq!(T::SIZE, b.len());
    
            (codegen_support::find_simd_fallback::<T::Type>(op))(&*a, &*b, &mut *r);
            
            T::from_fields(r)
        }
    }
    
    pub mod platform_intrinsics {
        /*!
        Let's expose our backend's `simd_add` as a platform intrinsic.
        
        This is the 'public' API that our `simd` consumer can call.
        */

        pub use super::codegen_cranelift::simd_add;
    }
    
    /**
    A codeified representation of the requirements we expect `#[repr(simd)]`
    types to satisfy.
    
    We're using `Vec` here instead of `[T; N]` just to make things compile
    for this example.
    */
    pub trait ReprSimd {
        type Type: Default + Copy;
        const SIZE: usize;
        
        fn into_fields(self) -> Vec<Self::Type>;
        fn from_fields(fields: Vec<Self::Type>) -> Self;
    }
}
```

At the top level, in `core::simd` we call into generic platform intrinsics.

In our compiler backend, we have one of two options; emit ISA-specific vectorized code for `simd_add` (such as a call to x86's `_mm256_add_epi64`), or emit a call to the fallback implementation from `compiler_builtins`. The `#[repr(simd)]` attribute ensures we can convert to and from arrays for the fallback implementation. Given a `#[repr(simd)]` struct with `N` fields of type `T` we can read them into a `[T; N]` and then write them from a `[T; N]` back into the `#[repr(simd)]` struct.

We pass these fallback vectors to the intrinsic as slices instead of arrays, so that arbitrarily sized inputs can be supported.

New SIMD operations can be defined with a reference implementation in `compiler_builtins`. They can then be picked up by compiler backends automatically until they choose to generate specific code for them.

There's an implicit bound on the `T` in `simd_add`, which is that it somehow maps to an intrinsic like `simd_fallback_add_i64`. This isn't currently communicated in its contract, and could be masked by compiler backends that don't forward to intrinsics. We suggest compiler backends always validate an appropriate fallback exists for a given platform intrinsic and vector type, even if it doesn't plan to use it.

## Who defines intrinsics?

The Portable SIMD group will be responsible for determining the behavior of the `simd_*` intrinsics it needs. These will take the form of the reference/fallback implementations in `compiler_builtins`.

## What do we need from `T-compiler`?

We'll need a lot of help figuring out where to put reference implementations and how to make them available to the various backends.

# Mentors or Reviewers

- @bjorn3 who's been working on `rustc`'s Cranelift backend and `extern "platform-intrinsic"`.

# Process

The main points of the [Major Change Process][MCP] is as follows:

* [x] File an issue describing the proposal.
* [ ] A compiler team member or contributor who is knowledgeable in the area can **second** by writing `@rustbot second`.
    * Finding a "second" suffices for internal changes. If however you are proposing a new public-facing feature, such as a `-C flag`, then full team check-off is required.
    * Compiler team members can initiate a check-off via `@rfcbot fcp merge` on either the MCP or the PR.
* [ ] Once an MCP is seconded, the Final Comment Period begins. If no objections are raised after 10 days, the MCP is considered **approved**.

You can read [more about Major Change Proposals on forge][MCP].

# Comments

**This issue is not meant to be used for technical discussion. There is a Zulip stream for that. Use this issue to leave procedural comments, such as volunteering to review, indicating that you second the proposal (or third, etc), or raising a concern that you would like to be addressed.**

[MCP]: https://forge.rust-lang.org/compiler/mcp.html
[forge]: https://forge.rust-lang.org/



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MCP: More Cranelift-friendly portable SIMD intrinsics #381

Proposal

Summary

Motivations

Implementation

Who defines intrinsics?

What do we need from `T-compiler`?

Mentors or Reviewers

Process

Comments

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MCP: More Cranelift-friendly portable SIMD intrinsics #381

Description

Proposal

Summary

Motivations

Implementation

Who defines intrinsics?

What do we need from T-compiler?

Mentors or Reviewers

Process

Comments

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

What do we need from `T-compiler`?