[MLIR][XeGPU] XeGPU ops distribution demo #111989

kurapov-peter · 2024-10-11T12:31:14Z

Hi all!

We recently started looking into xegpu SIMT lowering, so I'm posting this crude prototype of distribution (to work items or single logical threads) to collect early feedback and motivate upcoming changes. This is based on vector::WarpExecuteOnLane0 which seems to produce nice isolated rewrite patterns, integrates well with existing code, and requires relatively little code. I've combined it with existing rewriters (just copied into the pass) from vector for demonstration's sake (see the test case). Xegpu ops can be sunk through the yield op of the WarpExecuteOnLane0 in a similar way to those in vector (here I'm hoisting CreateNdDesc ops which I don't think is necessary, just an artifact of experimentation) and use the sg_map (sub-group mapping) attribute to infer the distributed type.

Here is a summary of what this demo does:

Move a kernel function body into enclosing vector.warp_execute_on_lane_0 - this assumes a single block, for the real thing it should create one for each block.
Run rewrites to sink ops through warp_execute_on_lane_0 consuming its yielded values.
Run a set of existing vector patterns.
Ignore type consistency during the transformation - nd_desc type is treated as uniform here and does not describe the portion of the data for a single work item.

The changes needed for this would be:

A mechanical refactoring of some distribution logic to be available outside vector to avoid duplication.
A solution for preserving xegpu descriptor type consistency during transformations.

The former includes a way to combine the rewriters of both vector and xegpu into a single pass since they'll depend on each other.

The latter is the current restriction of distribution that expects only VectorType to be distributable. The proper type distribution of the xegpu's tensor descriptor type hits the error after the transformation:

"func.func"() <{function_type = (memref<24x32xf16>, memref<24x32xf16>) -> (), sym_name = "test_load_store_nd_distribution"}> ({
^bb0(%arg0: memref<24x32xf16>, %arg1: memref<24x32xf16>):
  %0 = "gpu.lane_id"() <{upper_bound = 16 : index}> : () -> index
  %1:2 = "vector.warp_execute_on_lane_0"(%0, %arg0, %arg1) <{warp_size = 16 : i64}> ({
  ^bb0(%arg2: memref<24x32xf16>, %arg3: memref<24x32xf16>):
    %2 = "arith.constant"() <{value = dense<1.000000e+00> : vector<24x32xf16>}> : () -> vector<24x32xf16>
    %3 = "xegpu.create_nd_tdesc"(%arg2) <{const_offsets = array<i64: 0, 0>, const_strides = array<i64: 32, 1>, operandSegmentSizes = array<i32: 1, 0, 0, 0>}> : (memref<24x32xf16>) -> !xegpu.tensor_desc<24x32xf16, #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>>
    %4 = "xegpu.create_nd_tdesc"(%arg3) <{const_offsets = array<i64: 0, 0>, const_strides = array<i64: 32, 1>, operandSegmentSizes = array<i32: 1, 0, 0, 0>}> : (memref<24x32xf16>) -> !xegpu.tensor_desc<24x32xf16, #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>>
    %5 = "xegpu.load_nd"(%3) <{l1_hint = #xegpu.cache_hint<cached>, l2_hint = #xegpu.cache_hint<uncached>}> : (!xegpu.tensor_desc<24x32xf16, #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>>) -> vector<24x32xf16>
    %6 = "arith.addf"(%5, %2) <{fastmath = #arith.fastmath<none>}> : (vector<24x32xf16>, vector<24x32xf16>) -> vector<24x32xf16>
    "vector.yield"(%4, %6) : (!xegpu.tensor_desc<24x32xf16, #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>>, vector<24x32xf16>) -> ()
  }) : (index, memref<24x32xf16>, memref<24x32xf16>) -> (!xegpu.tensor_desc<24x2xf16, #xegpu.scatter_tdesc_attr<memory_space =  global, chunk_size = 1 : i64>>, vector<24x2xf16>)
  "xegpu.store_nd"(%1#1, %1#0) <{l1_hint = #xegpu.cache_hint<write_back>, l2_hint = #xegpu.cache_hint<uncached>}> : (vector<24x2xf16>, !xegpu.tensor_desc<24x2xf16, #xegpu.scatter_tdesc_attr<memory_space =  global, chunk_size = 1 : i64>>) -> ()
  "func.return"() : () -> ()
}) : () -> ()

llvm-project/mlir/test/Dialect/XeGPU/xegpu-distribute-to-wi.mlir:1 offset :9:1: error: 'vector.warp_execute_on_lane_0' op expected vector type for distributed operands.

The question is whether it would be reasonable to relax the constraint to, say, ShapedType. Or more broadly, would it make sense to make the distribution logic not specific to vector?

There's also one interesting caveat: some of the operations in xegpu like load_nd will be lowered to intrinsics that assume a full active subgroup, otherwise the behavior is undefined. This will impact the lowering to scf.if.

kurapov-peter · 2024-10-11T12:38:12Z

FYI @adam-smnk, @chencha3, @rengolin, @ThomasRaoux, @nicolasvasilache, @matthias-springer

kurapov-peter added 3 commits October 8, 2024 10:46

[MLIR][XeGPU] Add xegpu-distribute-to-wi pass

d6de7d9

Proper create_nd_desc distribution structure

720a0ef

Fix the test

45d7385

kurapov-peter mentioned this pull request Oct 28, 2024

[MLIR][XeGPU] Xegpu distribution patterns for load_nd, store_nd, and create_nd_tdesc. #112945

Closed

kurapov-peter closed this Jun 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MLIR][XeGPU] XeGPU ops distribution demo #111989

[MLIR][XeGPU] XeGPU ops distribution demo #111989

Uh oh!

kurapov-peter commented Oct 11, 2024

Uh oh!

kurapov-peter commented Oct 11, 2024

Uh oh!

Uh oh!

[MLIR][XeGPU] XeGPU ops distribution demo #111989

[MLIR][XeGPU] XeGPU ops distribution demo #111989

Uh oh!

Conversation

kurapov-peter commented Oct 11, 2024

Uh oh!

kurapov-peter commented Oct 11, 2024

Uh oh!

Uh oh!