[RFC][OpenMP] Translation of allocate clause to LLVM IR in taskgroup directive

Problem:

Adding lowering support for allocate clause in taskgroup construct to LLVM IR.

Sample example of using allocate clause for taskgroup:

program test()
  integer :: x
  !$omp taskgroup allocate(allocator(omp_default_mem_alloc):x) task_reduction(+:x)
  !$omp task in_reduction(+:x)
    x = x + 1
  !$omp end task
  !$omp end taskgroup
end program

Analysing Clang’s behaviour for allocate directive/clause:

  • Clang’s behavior for allocate directive with allocator and align clause

    • #pragma omp allocate(x) allocator(omp_thread_mem_alloc) :
      • It emits runtime call to @__kmpc_alloc(int gtid, size_t size, omp_allocator_handle_t allocator) for allocator clause (See LLVM_IR)
    • #pragma omp allocate(x) align(64) :
      • It emits runtime call to __kmpc_aligned_alloc(int gtid, size_t algn, size_t size, omp_allocator_handle_t allocator) for align clause(See LLVM_IR)
  • Clang’s behavior for parallel directive with allocate clause

    • #pragma omp parallel allocate(omp_thread_mem_alloc:x) reduction(+:x) :
      • Runtime call to @__kmpc_alloc(i32 %10, i64 4, ptr inttoptr (i64 8 to ptr)) (See LLVM_IR).

    With parallel construct, the private copies of reduction variables are allocated by making a runtime call to __kmpc_alloc(int gtid, size_t size, omp_allocator_handle_t allocator) function.

  • Clang’s behavior for taskgroup construct with allocate clause.

    #include<omp.h>
    void test() {
      int x = 0;
      #pragma omp taskgroup allocate(omp_thread_mem_alloc:x) task_reduction(+:x)
      {
        #pragma omp task in_reduction(+:x)
        {
          x = x + 1;
        }
      }
    }
    
    1. In practice, Clang’s behavior is kept as a reference guide for implementing features in Flang.
    2. For directives like parallel and allocate:
      • It is emitting runtime calls to @__kmpc_alloc() for allocate/allocator clause and @__kmpc_aligned_alloc() for an alignment clause.
    3. For allocate clause on taskgroup directive:
      • allocation of private copies for reduction variables are handled by ___kmp_allocate(size_t size) within function __kmpc_taskred_init(int gtid, int num, void *data).
    4. With the above comparison,it appears that the allocate clause may not be completely handled in Clang for taskgroup construct.
      • Because there is no observable difference in the generated LLVM IR with or without the allocate clause in a taskgroup construct (See LLVM_IR).
    5. However, expected behavior with allocate clause in a taskgroup context would involve allocating private copies for reduction variables through runtime calls to @_kmpc_alloc() or @__kmpc_aligned_alloc(), but this behavior is not observed.

Solutions desired:

Here are three potential approaches for incorporating allocate clause support in taskgroup:

Solution 1:

Based on the analysis, a taskgroup construct with an allocate clause, Clang compiler manages memory allocation for private copies of task reduction variables using __kmp_allocate(size_t size) rather than __kmpc_alloc() or __kmpc_aligned_alloc(). Since clang does not completely handle the allocate clause within taskgroup construct, we might consider following the same approach in Flang. If that’s case, then nothing needs to be done for allocate clause.

Solution 2:

The runtime call __kmpc_taskred_init(), used to initialize task reduction, only receives task reduction details without any data related to the allocator or alignment specified in the allocate clause. We could introduce a new runtime function to replace __kmpc_taskred_init() by incorporating data from the allocate clause, such as allocator flags and alignment settings. This new runtime function would enable allocation of private copies based on the specified allocator and alignment options(Calling @__kmpc_alloc() for an allocator clause or @__kmpc_aligned_alloc() for an alignment clause.)

Solution 3:

The kmp_taskred_input structure, which is passed as an argument to __kmpc_taskred_init(int gtid, int num, void *data) to initialize task reduction where data is a pointer to array which contains information about task reduction, contains the following fields:

struct kmp_taskred_input {
   void *reduce_shar; 
   void *reduce_orig; 
   size_t reduce_size; 
   void *reduce_init; 
   void *reduce_fini; 
   void *reduce_comb;  
   kmp_task_red_flags_t flags; 
} kmp_taskred_input_t;

where kmp_task_red_flags_t contains

  typedef struct kmp_taskred_flags {
  /*! 1 - use lazy alloc/init (e.g. big objects, num tasks < num threads) */
  unsigned lazy_priv : 1;
  unsigned reserved31 : 31;
} kmp_taskred_flags_t;

We could leverage the flags field within the structure(struct kmp_taskred_input) to store allocator flag information from the allocate clause and subsequently trigger a call to __kmpc_alloc() with the allocator settings. For alignment requirements, a default argument could be added in function __kmpc_taskred_init() to manage alignment considerations.

Please review these proposed approaches and comment your thoughts. Any additional insights or suggestions would be highly valuable for further development.

Thankyou.

Thanks @kaviya2510 for the RFC.

Does ___kmp_allocate(size_t size) allocate in the memory specified by the allocator clause? Or does it not honour the allocate clause at all?

Do you mean to say that the implementation is not complete in Clang? Or do you mean to say that the implementation is complete but most of the handling for allocating memory is handled internally by the OpenMP runtime?

Hi @kiranchandramohan ,Thanks for the comments.
No, ___kmp_allocate(size_t size) doesn’t allocate the memory specified by the allocator clause. It is actually allocating memory for private variables using malloc() with required size and default alignment.

In taskgroup directive

  • Function __kmp_allocate(size_t size)(size is obtained from the struct kmp_taskred_input_t populated for taskgroup) is called at runtime by __kmpc_taskred_init(int gtid, int num, void *data), which initializes task reduction. In this case all allocations are managed internally by the OpenMP runtime, which does not use the allocator/align information specified by the allocate clause in taskgroup.

Based on current observations it seems like clang does not fully support allocate clause(allocator or align) for taskgroup as it does for other directives like parallel and allocate.
It’s unclear if this is due to incomplete support or an intentional design choice that we need to check with clang community.
I have created an RFC for the Clang community. Please take a look at it for more details.