[MLIR][XeGPU] XeGPU ops distribution demo #111989
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi all!
We recently started looking into xegpu SIMT lowering, so I'm posting this crude prototype of distribution (to work items or single logical threads) to collect early feedback and motivate upcoming changes. This is based on
vector::WarpExecuteOnLane0
which seems to produce nice isolated rewrite patterns, integrates well with existing code, and requires relatively little code. I've combined it with existing rewriters (just copied into the pass) from vector for demonstration's sake (see the test case). Xegpu ops can be sunk through theyield
op of theWarpExecuteOnLane0
in a similar way to those in vector (here I'm hoistingCreateNdDesc
ops which I don't think is necessary, just an artifact of experimentation) and use thesg_map
(sub-group mapping) attribute to infer the distributed type.Here is a summary of what this demo does:
vector.warp_execute_on_lane_0
- this assumes a single block, for the real thing it should create one for each block.warp_execute_on_lane_0
consuming its yielded values.The changes needed for this would be:
The former includes a way to combine the rewriters of both vector and xegpu into a single pass since they'll depend on each other.
The latter is the current restriction of distribution that expects only
VectorType
to be distributable. The proper type distribution of the xegpu's tensor descriptor type hits the error after the transformation:The question is whether it would be reasonable to relax the constraint to, say,
ShapedType
. Or more broadly, would it make sense to make the distribution logic not specific to vector?There's also one interesting caveat: some of the operations in xegpu like
load_nd
will be lowered to intrinsics that assume a full active subgroup, otherwise the behavior is undefined. This will impact the lowering toscf.if
.