[RFC] Extending vector distribution to support other types

kurapov-peter · October 30, 2024, 11:59am

Existing logic in vector dialect ([Vector] Vector distribution (large vector to small vector) - #23 by ThomasRaoux) works well together with xegpu ops distribution ([MLIR][XeGPU] Xegpu distribution patterns for load_nd, store_nd, and create_nd_tdesc. by kurapov-peter · Pull Request #112945 · llvm/llvm-project · GitHub and here’s a crude example of stitching the xegpu and vector distribution patterns together [MLIR][XeGPU] XeGPU ops distribution demo by kurapov-peter · Pull Request #111989 · llvm/llvm-project · GitHub).

warp_execute_on_lane_0 currently lies in the vector dialect, hence, expects only vector types to be distributed. Xegpu introduces another type (a tensor descriptor) that should be distributed. Although the only change required for the op to accommodate the new type is in distributed type validation, there is nothing specific to vector dialect in the distribution process. The execute_on_lane_0 op can potentially distribute any shaped type (or a narrower set of DistributionTypeInterface implementations as suggested in the PR), so it doesn’t really belong to vector dialect.

A natural solution to this would be to move the op out of the vector dialect, however, there seems to be no good place for it at the moment. This leads to the proposal of a distribution dialect that would contain operations such as warp_execute_on_lane_0 and contain all the logic related to distribution.

Today there is a simple solution to the problem (a small change to the op validation: [MLIR][Vector] Allow any shaped type to be distributed for vector.wa… by kurapov-peter · Pull Request #114215 · llvm/llvm-project · GitHub), so the proposal is just an idea of a clean design.

I’d like to collect feedback and concerns first, if any. Thoughts?

MaheshRavishankar · October 30, 2024, 7:53pm

FWIW, the distribution with execute_on_lane_0 was used in IREE and we are actively trying to deprecate it. It is unclear to us downstream that this is a good way forward.

That aside, I think if this is useful for you, moving it out of vector dialect is a pre-requisite. If you make it work on tensor-like types, then a question would be what does “execute on lane 0” mean for a non-vector type. It was primarly meant for experimentation with a GPU codegeneration flow. Just using it to something else does not seem great.

An alternative would be we have a VectorTypeInterface. We could make the distribution verification check for the interface, and the xeGPU ops could implement that interface. I dont keep track of xeGPU, but since it is in tree, I am wondering why it wasnt using vector types to begin with if the intent was to use such a distribution logic.

fschlimb · October 31, 2024, 8:25am

This makes me reminiscent of the mesh dialect. Even though on a higher level, its mapping a tensor to a mesh (e.g. grid) of devices looks very similar to mapping a vector to a map. In mesh this step is called “spmdization” and works exactly as you suggest: operations implement the spmdize method of the ShardingInterface.

I wonder if there is opportunity for a separate dialect that can be used in both worlds. The current state of ShardingInterface looks a bit convoluted anyway and this might be a good opportunity to separate concerns in a more generic way.

kurapov-peter · October 31, 2024, 10:54am

In mesh this step is called “spmdization” and works exactly as you suggest: operations implement the spmdize method of the ShardingInterface.

That does sound similar, thanks, I’ll take a look.

a question would be what does “execute on lane 0” mean for a non-vector type

I don’t have the same mental model of the thing, obviously. To me, the transition from operations working on tensors to SIMT looks roughly like assigning the right portions of data to logical threads. In case we run it through “execute on lane 0”, the first thread in a subgroup takes over the ownership of all the data (whatever shaped type it is) portions other threads had in a subgroup.

An alternative would be we have a VectorTypeInterface . We could make the distribution verification check for the interface, and the xeGPU ops could implement that interface.

I thought about this too. It solves the problem at hand, although I don’t see too much of a difference between an interface and the shaped type. After all, any shaped type could implement the interface, so we’ll end up with a redundant layer. I’d prefer a more generic solution here.

I am wondering why it wasnt using vector types to begin with if the intent was to use such a distribution logic.

I don’t think this was a part of the original design, but I didn’t participate in it. @Jianhui-Li may be the best person to comment on this.

MaheshRavishankar · November 1, 2024, 9:57pm

I really think you are looking to move this operation out of vector dialect. That would also be fine by me. I dont have any stake in the operation itself. I just dont think something that is supposed to be scoped to vector dialect is suddenly allowed for all ShapedTypes.

dcaballe · November 2, 2024, 6:18pm

As discussed in the PR, creating a distribution dialect and a DistributionTypeInterface to handle the distribution semantics and transformations make sense to me. That may help accommodate the layout that you are using in your downstream type. However, this requires careful thought and input from those experienced with vector distribution.

Regarding the VectorTypeInterface, I’ve been experimenting with it to better model scalable and fixed-length vector semantics. When I shared this with others, some viewed this interface as a way to decorate the vector type with various semantically-loaded attributes, including layouts. However, this is not the intent of this interface. I’ve been digging deeper into vector/tensor layout encodings lately and it’s now evident to me that using layouts to model in-register data transformations is a different paradigm to the existing vector model. Namely, existing vector operations like transposes, shuffles, and data permutations, in general, conflict with the concept of layouts. I don’t think we could accommodate layouts in the current vector type/dialect, regardless of having the VectorTypeInterface.

kurapov-peter · November 4, 2024, 11:58am

Agreed. I can start with something like the DistributionTypeInterface, but I’m not sure what exactly would it do yet. I could keep it empty for now and just check if the type implements it in the warp_execute_on_lane_0 validation.

I also would like hear from somebody with more experience in vector distribution.

matthias-springer · November 5, 2024, 2:09am

I implemented some parts of execute_on_lane_0 some time ago.

Xegpu introduces another type (a tensor descriptor) that should be distributed.

What benefit do you get by extending execute_on_lane_0 to other types? If I remember correctly, we don’t have any transformations that operate on execute_on_lane_0 ops.

We have rewrite patterns that match an execute_on_lane_0(OP) where OP can be vector.transfer_read, vector.transfer_write, … These are patterns that extract a nested op from an execute_on_lane_0 op. If you are using non-vector types/ops, you cannot reuse any of these patterns.

So the only benefit I see is that you can avoid copying the vector.execute_on_lane_0 op in your project and save maybe 100 lines of code. But you won’t be able to use any of the functionality that we implemented around execute_on_lane_0.

Agreed. I can start with something like the DistributionTypeInterface , but I’m not sure what exactly would it do yet.

I’m also wondering about that. We should clarify that before adding a new interface.

kurapov-peter · November 5, 2024, 10:08am

What benefit do you get by extending execute_on_lane_0 to other types?

xegpu.loads use a custom type to create a view into a memref, but return a vector. So any consequent operations can reuse existing vector patterns. For example, in the demo ([MLIR][XeGPU] XeGPU ops distribution demo by kurapov-peter · Pull Request #111989 · llvm/llvm-project · GitHub) I’m reusing WarpOpDeadResult, WarpOpForwardOperand, WarpOpElementwise, and WarpOpConstant. The idea was that I add rewrites for xegpu ops and put those together with vector rewrites to create a complete distribution pass. In other words, it’s not only the OP but most of rewrites as well.

I’m also wondering about that. We should clarify that before adding a new interface.

Yup. It would also need to live inside builtin type interfaces so that vector could pick it up. At first glance, it could provide some common functionality for shape calculation. Example:

//===----------------------------------------------------------------------===//
// DistributableType
//===----------------------------------------------------------------------===//
def DistributableTypeInterface : TypeInterface<"DistributableType", [ShapedTypeInterface]> {
  let cppNamespace = "::mlir";

  let description = [{
    Interface for types that can be distributed when converting from vector-like
    representation to SIMT-like programming models. 
  }];

  let methods = [
    InterfaceMethod<
      /*description=*/"Return a distributed shape based on the original type, "
                      "the distribution map, and the factors array.",
      /*retTy=*/"::llvm::FailureOr<::llvm::SmallVector<int64_t>>",
      /*methodName=*/"getDistributedShape",
      /*args=*/(ins "::mlir::ShapedType":$originalType,
                    "::mlir::AffineMap":$map,
                    "::mlir::ArrayRef<int64_t>":$factors),
      /*methodBody=*/"",
      /*defaultImplementation=*/[{ return ::mlir::failure(); }]>,
  ];
}

Groverkss · November 5, 2024, 12:13pm

I’m not sure what you are trying to accomplish can be done with the current vector dialect vector distribution.

The current vector distribution implementation really only works when distributing a single dimension, and that dimension has to be maintained during the entire distribution.

For example: something like this can be distributed:

%a = vector.transfer_read %mem : vector<32xf32>
%result = vector.reduction %a : vector<32xf32> to f32

Vector distribution will pick the result of vector.reduction, and will start distributing threads on the innermost (or whatever dimension you choose) from that result. Notice that there is no real control over how you distribute that dimension (you will notice that the implementation distributes things implicitly).

The whole reason this works is because the result of a vector.reduction is a scalar, which is implicitly not distributed. Vector Distribution only really works when the first “vector operation” is a single 1-D reduction, and the inputs to that reductions are elementwise or reads.

Something like this will never work with the current implementation:

%a = vector.transfer_read %mem : vector<24x32xf32>
%b = vector.transpose %a : vector<32x24xf32>
%c = vector.multi_reduction %b : vector<32xf32>

Vector Distribution has no understanding of how it’s distributing things, it simply distributes greedily along the inner dimension, which is why this will not work. (Notice there are no distribution patterns for vector.transpose/vector.broadcast because they change the “innermost dimension”. shape_cast works because it’s a reshape.)

I noticed some examples in your PRs and I see two problems:

You have a create_nd_desc operation that describes a layout on the type and allows your operations to describe how to distribute. Vector dialect operations do not have this information and do not have to obey it. I did not see any examples in your pr where the distribution layout is not on the innermost dimension. Can you try some examples which do distribution on multiple dimensions?
You are trying to introduce multiple dimensions in a framework that really only supports 1-d dimensions properly. Do you examples of doing a reduction and storing it using xegpu.store_nd?

In my experience, when I tried doing vector distribution on N-D vectors, things quickly started breaking down. I don’t have any problems with vector distribution being extended, as long as there is a valid use case.

In IREE, we noticed this problem with N-D vectors and re-wrote the entire transformation, with a focus on keeping the how things should be distributed available to each operation during distribution. I’m happy to talk about moving to a solution where we can support N-D shaped type distribution, but would like to hear your experience on it first.

On the topic of creating interfaces/new dialects/etc. , we could just move this operation to be gpu.execute_on_lane_0. The operation anyway has semantics defined for a SIMT GPU model and should not really live in vector dialect. But I think those are easier problems to solve. The harder problem is if this operation is even the right abstraction to do distribution. From my experience, it’s really not. You instead need an operation that retains at boundaries how the results/operands are distributed, which can be used to infer how an operation should be distributed when moving it out of the region (scf.forall/scf.parallel have this, which is why they can do nicer things).

kurapov-peter · November 5, 2024, 1:45pm

Thanks for the input!

Can you try some examples which do distribution on multiple dimensions?

Sure, can do. I don’t expect it to work right away either.

You have a create_nd_desc operation that describes a layout on the type and allows your operations to describe how to distribute.

Right, the idea was to guide distributed type inference by the explicit knowledge from xegpu level where we have an attribute sg_map that produces the right type for other vector-based consumers. And if that wouldn’t work the next step would have been to supply existing rewrites with a custom distribution function.

That said, xegpu doesn’t currently have reductions and I haven’t tried combining them.

Groverkss · November 5, 2024, 10:18pm

supply existing rewrites with a custom distribution function

The current transformations make some assumptions on the implicit distribution, which was a problem I had when trying to support more things than the distribution today does.

Also, just to set expectations from my last message, I’m offering past experience with the distribution framework and suggesting you try out the cases I had problems with, before making major changes. If you do find the same issues, we can try to come up with a better solution.

If you don’t want to do that, you can instead send an RFC to move the execute_on_lane_0 operation to GPU dialect (since it really models a region in a GPU SIMT model). That is a good change in itself and would accomplish what you want. As long as the new distribution patterns live in XeGPU dialect, and don’t change the existing patterns, I don’t think there will be any blockers. Trying to support more vector dialect operations which try to do any smarter distribution however, will probably need a rework and there might be some blockers there.

kuhar · November 5, 2024, 10:59pm

+1

ThomasRaoux · November 6, 2024, 6:30am

I’m not very involved in this work even though I had started it at the time. As others mentioned, in retrospective I’m not sure this way of distributing shaped type can scale very much.

Since I don’t work in this area anymore I don’t want to give an opinion on what should be done. One suggestion I would offer is that if you are not sure on how it should look like iterating downstream might be simpler for you and give you more freedom.

kurapov-peter · November 6, 2024, 2:32pm

Also, just to set expectations from my last message, I’m offering past experience with the distribution framework and suggesting you try out the cases I had problems with, before making major changes. If you do find the same issues, we can try to come up with a better solution.

Sounds good to me, I’ll do some experimentation and report on this.
I have a vague idea of a generic mechanism to perform this and other kinds of distribution like sharding Frank mentioned. Those seem to do the same thing - split work into independent portions and provide a way to do some communication by materialization (e.g., in shared memory). I’d dig a bit more to figure out if the requirements are indeed similar.

you can instead send an RFC to move the execute_on_lane_0 operation to GPU dialect (since it really models a region in a GPU SIMT model). That is a good change in itself and would accomplish what you want.

This seems to be somewhat orthogonal and able to unblock in-tree experimentation? In my case in-tree would be much more convenient actually, so I’m happy to do it.

Groverkss · November 6, 2024, 8:43pm

Yes, you can do it in parallel. It would probably involve moving the distribution utilities and framework to GPU dialect, exposing the distribution utilities like you did in your earlier patch and keeping the vector distribution patterns inside vector dialect. I would be supportive of this, not sure about others.

Topic		Replies	Views
[RFC] Move execute_on_lane_0 from vector to gpu dialect MLIR gpu , mlir	5	104	November 11, 2024
Vector Dialect Roundtable – EuroLLVM 2025 Summary MLIR	3	199	May 19, 2025
[RFC] Dynamic Vector Semantics for the MLIR Vector Dialect MLIR	24	1916	March 7, 2024
[RFC] Vector Dialects: Neon and SVE MLIR	15	3380	December 8, 2020
[RFC] Introduce a bitcast op MLIR	15	902	August 6, 2021

[RFC] Extending vector distribution to support other types

Related topics