Skip to content

[MLIR][XeGPU] XeGPU ops distribution demo #111989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

kurapov-peter
Copy link
Contributor

Hi all!

We recently started looking into xegpu SIMT lowering, so I'm posting this crude prototype of distribution (to work items or single logical threads) to collect early feedback and motivate upcoming changes. This is based on vector::WarpExecuteOnLane0 which seems to produce nice isolated rewrite patterns, integrates well with existing code, and requires relatively little code. I've combined it with existing rewriters (just copied into the pass) from vector for demonstration's sake (see the test case). Xegpu ops can be sunk through the yield op of the WarpExecuteOnLane0 in a similar way to those in vector (here I'm hoisting CreateNdDesc ops which I don't think is necessary, just an artifact of experimentation) and use the sg_map (sub-group mapping) attribute to infer the distributed type.

Here is a summary of what this demo does:

  1. Move a kernel function body into enclosing vector.warp_execute_on_lane_0 - this assumes a single block, for the real thing it should create one for each block.
  2. Run rewrites to sink ops through warp_execute_on_lane_0 consuming its yielded values.
  3. Run a set of existing vector patterns.
  4. Ignore type consistency during the transformation - nd_desc type is treated as uniform here and does not describe the portion of the data for a single work item.

The changes needed for this would be:

  1. A mechanical refactoring of some distribution logic to be available outside vector to avoid duplication.
  2. A solution for preserving xegpu descriptor type consistency during transformations.

The former includes a way to combine the rewriters of both vector and xegpu into a single pass since they'll depend on each other.

The latter is the current restriction of distribution that expects only VectorType to be distributable. The proper type distribution of the xegpu's tensor descriptor type hits the error after the transformation:

"func.func"() <{function_type = (memref<24x32xf16>, memref<24x32xf16>) -> (), sym_name = "test_load_store_nd_distribution"}> ({
^bb0(%arg0: memref<24x32xf16>, %arg1: memref<24x32xf16>):
  %0 = "gpu.lane_id"() <{upper_bound = 16 : index}> : () -> index
  %1:2 = "vector.warp_execute_on_lane_0"(%0, %arg0, %arg1) <{warp_size = 16 : i64}> ({
  ^bb0(%arg2: memref<24x32xf16>, %arg3: memref<24x32xf16>):
    %2 = "arith.constant"() <{value = dense<1.000000e+00> : vector<24x32xf16>}> : () -> vector<24x32xf16>
    %3 = "xegpu.create_nd_tdesc"(%arg2) <{const_offsets = array<i64: 0, 0>, const_strides = array<i64: 32, 1>, operandSegmentSizes = array<i32: 1, 0, 0, 0>}> : (memref<24x32xf16>) -> !xegpu.tensor_desc<24x32xf16, #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>>
    %4 = "xegpu.create_nd_tdesc"(%arg3) <{const_offsets = array<i64: 0, 0>, const_strides = array<i64: 32, 1>, operandSegmentSizes = array<i32: 1, 0, 0, 0>}> : (memref<24x32xf16>) -> !xegpu.tensor_desc<24x32xf16, #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>>
    %5 = "xegpu.load_nd"(%3) <{l1_hint = #xegpu.cache_hint<cached>, l2_hint = #xegpu.cache_hint<uncached>}> : (!xegpu.tensor_desc<24x32xf16, #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>>) -> vector<24x32xf16>
    %6 = "arith.addf"(%5, %2) <{fastmath = #arith.fastmath<none>}> : (vector<24x32xf16>, vector<24x32xf16>) -> vector<24x32xf16>
    "vector.yield"(%4, %6) : (!xegpu.tensor_desc<24x32xf16, #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>>, vector<24x32xf16>) -> ()
  }) : (index, memref<24x32xf16>, memref<24x32xf16>) -> (!xegpu.tensor_desc<24x2xf16, #xegpu.scatter_tdesc_attr<memory_space =  global, chunk_size = 1 : i64>>, vector<24x2xf16>)
  "xegpu.store_nd"(%1#1, %1#0) <{l1_hint = #xegpu.cache_hint<write_back>, l2_hint = #xegpu.cache_hint<uncached>}> : (vector<24x2xf16>, !xegpu.tensor_desc<24x2xf16, #xegpu.scatter_tdesc_attr<memory_space =  global, chunk_size = 1 : i64>>) -> ()
  "func.return"() : () -> ()
}) : () -> ()
llvm-project/mlir/test/Dialect/XeGPU/xegpu-distribute-to-wi.mlir:1 offset :9:1: error: 'vector.warp_execute_on_lane_0' op expected vector type for distributed operands.

The question is whether it would be reasonable to relax the constraint to, say, ShapedType. Or more broadly, would it make sense to make the distribution logic not specific to vector?

There's also one interesting caveat: some of the operations in xegpu like load_nd will be lowered to intrinsics that assume a full active subgroup, otherwise the behavior is undefined. This will impact the lowering to scf.if.

@kurapov-peter
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant