[VectorOps] Vector -> GPU for single Block / Warp or Group / SubGroup

Hi everyone,

This is more a statement of intent than an RFC.
As the vector dialect is starting to get more fleshed out, @ThomasRaoux, myself and others have been discussing and prototyping Vector → GPU lowering in the context of SPIRV. However we expect most of the lessons and techniques to translate more generally.

Some of the high-order bits are:

  1. a vector op is a good approximation for a divergence-free block / warp level chunk of computation that can be distributed across threads within a block.
  2. higher-order n-D vector operations such as pointwise, vector.contract and reductions can be lowered to proper warp-synchronized operations along certain dimensions and unrolled along others.
  3. vector.transfer_read and _write ops have masked + padding semantics that can take advantage of the predicated SIMT model + massive parallelism + dynamic HW memory hiding properties of GPUs.

This could get us to a nice SSA-based vector programming abstraction in MLIR with unsurprising mapping to the right underlying GPU features, while punting for now on more difficult issues related to thread divergence, by construction.

After early discussions with @ThomasRaoux prototypes include:

  1. mapping simple vector.transfer + pointwise ops to 1 block - many threads with 1-1 multiplicity
  2. mapping vector.transfer + vector.contract ops to cooperative matrix ops with 1-1 multiplicity
  3. introduce a “CooperativeCompatible” op interface on relevant vector ops that would be generally ignored except for the specific purpose of lowering to GPUs with cooperative support.
  4. in the SPIR-V model, CooperativeCompatible ops require using a specific opaque. This will likely require slicing between CooperativeCompatible ops and other ops to introduce write-read pairs and bitcast through memory. Maybe this would even be a special cast that has to go through memory at the lower level.
  5. support more general striding and transposes in the vector.transfer → SPIRV lowering
  6. Allow n → 1 multiplicity and explore various forms of lowering to vector type and/or unrolling

As part of this work we expect a bunch of other things to surface while connecting the pieces, e.g. vector.transfer canonicalizations, alignment requirements that are currently mostly not implemented etc.

For now things get prototyped on the IREE side but we hope to start sending things upstream as we make progress.

Thanks!

4 Likes

Looks great! Looking forward to all this :slight_smile:
The “cooperative” aspects in particular will be interesting in the GPU dialect to have something that we can map to Cuda and other targets as well!

Thanks Nicolas for bringing this up. This is being prototyped in IREE (thanks @ThomasRaoux for mainly driving this), and there are somethings here that I think are very useful to explore. I would really like some input from community experts on some aspects involved here to help this work along.

  • Is it possible to still do fusion of element-wise operations with the result of the Cooperative matrix multiply operation, i.e. you accumulate in registers and then perform some other arithmetic operations on the result while keeping it in registers. From the SPIR-V spec it is unclear to me if this is possible or not. Essentially, I dont see how to do the type conversion from the matrix types to scalar types for performing the arithmetic operations after the matmul can be achieved. If this is possible, the vector abstraction is clean way to lower to such an instruction sequence in SPIR-V. Would be good to get some input on this aspect.

  • Outside of SPIR-V, the NVVM → PTX path allows using wmma/mma intrinsics which looks like it could support doing the accumulate in registers and perform some additional arithmetic operations on the result of the accumulation, before finally writing out to memory (effectively achieving fusion of matmul ops with its consumer). I dont have direct experience in this to know if this can work, but the spec seems to allow this. Some info/pointers on examples that achieve this could be a good target for the codegen

  • Stepping back from using MMA intrinsics, just using fma instructions to do the multiply-accumulate as well as fusing with its consumers would be a great thing. The approach of going from Linalg to scalar computation AFAICS requires the use of an intermediate buffer (in scratch space memory) for fusing matmuls with its consumers. The use of Vector dialect should help avoid that.

Another request from the community is that I am sure this is not the first time a vector abstraction is being applied to GPU programming. It would be great to hear from people with prior experience here about what pitfalls to avoid :slight_smile:

Thanks @nicolasvasilache for the nice writeup and thanks Thomas for taking on this! It’s really great to see this happening. :slight_smile:

Vulkan (compute) side shader/kernel compilation happens in two phases: an offline stage which generates SPIR-V and an online stage which translates SPIR-V to hardware ISA. So SPIR-V is the intermediate exchange format. Cooperative matrix is a high-level (compared to normal SPIR-V instructions) opaque construct representing small fixed-sized matrix. It gives implementation flexibility for optimization under the hood for computations on it. So I think it should possible to identify SPIR-V patterns and fuse them in the driver compiler.

Yep a cooperative matrix itself is an opaque object. But we do have ways to get elements in an implementation-specific way. For example you can use OpCooperativeMatrixLengthNV to get the number of invocations in the current batch and you can extract elements with *Id, OpAccessChain, OpCompositeExtract, and then insert back with *Id, OpAccessChain, OpCompositeInsert. These can be used together with arithmetic ops to perform after-matmul element-wise operations.

Thanks @nicolasvasilache for starting this.

Yes definitely, we are mapping directly to SPIR-V for path finding but we should definitely unify CUDA and SPIR-V going through the GPU dialect.

This is definitely the interesting part. Note that cooperative matrix type allows some pointwise arithmetic instructions (OpSNegate, OpFNegate, OpIAdd, OpFAdd, OpISub, OpFSub, OpFDiv, OpSDiv, OpUDiv) so some fusion can definitely be done applying those before or after the matmul accumulate instruction.

That being said it having a different type does make mixing different operations none-trivial. One thing I would like to understand if someone is familiar with how the driver handles those operation is if there is a cheap way to do a bitcast kind of operation so that the compiler doesn’t have to do special codegen for load/store/arithmetic based on how whether or not the data are going to be used by a matmul operation.
From @antiagainst answer it sounds like we could just use OpCompositeExtract/OpCompositeInsert to convert to an array. This is something I’m planning to try and benchmark sometimes soon but if anybody has insights about it would be good to hear.

Does this require parametric vector size?

My understanding is that in Vulkan/SPIR-V, the subgroup size (i.e. warp size, i.e. vector size, …) is only known at runtime. E.g. see use of gl_SubgroupSize in Vulkan subgroup tutorial

Example of SPIR-V shader using cooperative matrix: link

Note:

// M/N/K values filled out at pipeline creation time
layout(constant_id = 0) const uint lM = 1;
layout(constant_id = 1) const uint lN = 1;
layout(constant_id = 2) const uint lK = 1;

The code then proceeds to do e.g.

coopmatT<A_BITS, gl_ScopeSubgroup, lM, lK> matA[C_ROWS];
[[unroll]] for (uint i = 0; i < C_ROWS; ++i) {
    uint gi = TILE_M * tileID.y + lM * i; // !! <--------- parametric "lM" vector size.
    uint gk = chunkK; 
    coopMatLoadNV(matA[i], inputA.x, coordToOffset(gi, gk, strideA, false), strideA, false); 
}

The current direction is to assume that the subgroup size is a known compile time constant.

It is true that the subgroup size is not constant in Vulkan but [most] vendors have a fixed subgroup size given in the device informations.

@antiagainst should have more information about this but my understanding is during compilation we assume we know the device capabilities and extensions. Otherwise many things are hard to do.
Another example is for cooperative matrix, the matrix type and dimensions supported are properties of the device so it is pretty much impossible to generate code without this information.

+1 to what Thomas said in the above. We pass supported capabilities and extensions through the stack to drive code generation in order to generate better code where possible. (Although right now the generated code basically requires no capabilities/extensions, but that’s for a baseline and reliable fallback. The mechanism is there.)

Note that the suggestion I have in the above is a general way that I can think of by reading the spec to express various kinds of additional element-wise computations. Cooperative matrices already support common unary/binary operations. So for those common cases (+,-,*,/,etc.) there is no need to get a per-element invocation; one can just perform on the whole coop matrix.

Regarding subgroup size, yes, you can control and vary it if the device supports VK_EXT_subgroup_size_control. Then gl_SubgroupSize will truly become a runtime-known value. But this extension is not widely supported yet. Without it, each physical device expose a single subgroup size via VkPhysicalDeviceSubgroupProperties. So if one want to target all possible scenarios with one kernel then yes one will need to treat the subgroup size as a runtime-known number. But if one just want to target a known list of devices, it’s fine to assume fixed subgroup sizes. If we ends up needing to resort to getting per-element invocation, it’s not uncommon to CodeGen to a few configurations and select among them at runtime, esp. for this kind of vendor specific logic guarded by extra extensions.

1 Like

Thanks folks. That makes perfect sense!

Thanks @antiagainst, I didnt realize that they are Composite types. Then we have to rely on driver compilers to optimize away the packs/unpacks.

@nicolasvasilache, @ThomasRaoux would be interested in your thoughts on how to use shared memory (or workgourp memory in SPIR-V parlance). There are two options in this context

  1. Use tiling/promotion to move the input data into shared memory, then use vector transfers + vector contraction
  2. Use vector transfers to load data directly from global memory. Then when “breaking” up the vector to match the SPIR-V subgroup co-operative matmul specs, use scratch space memory then.

My current thinking is to go with (1) above, but (2) also has advantages (though more unclear how the generated code would actually look).

This is definitely an interesting question. I’m not sure which will be the final solution at this point. (1) is definitely the current direction but I’m planning to experiment with solution (2) to figure out if re-discovering the memory that should be promoted as well as the sync point is feasible.