Problem:
Adding lowering support for allocate clause in taskgroup construct to LLVM IR.
Sample example of using allocate clause for taskgroup:
program test()
integer :: x
!$omp taskgroup allocate(allocator(omp_default_mem_alloc):x) task_reduction(+:x)
!$omp task in_reduction(+:x)
x = x + 1
!$omp end task
!$omp end taskgroup
end program
Analysing Clang’s behaviour for allocate directive/clause:
-
Clang’s behavior for allocate directive with allocator and align clause
- #pragma omp allocate(x) allocator(omp_thread_mem_alloc) :
- It emits runtime call to
@__kmpc_alloc(int gtid, size_t size, omp_allocator_handle_t allocator)
for allocator clause (See LLVM_IR)
- It emits runtime call to
- #pragma omp allocate(x) align(64) :
- It emits runtime call to
__kmpc_aligned_alloc(int gtid, size_t algn, size_t size, omp_allocator_handle_t allocator)
for align clause(See LLVM_IR)
- It emits runtime call to
- #pragma omp allocate(x) allocator(omp_thread_mem_alloc) :
-
Clang’s behavior for parallel directive with allocate clause
- #pragma omp parallel allocate(omp_thread_mem_alloc:x) reduction(+:x) :
- Runtime call to
@__kmpc_alloc(i32 %10, i64 4, ptr inttoptr (i64 8 to ptr))
(See LLVM_IR).
- Runtime call to
With parallel construct, the private copies of reduction variables are allocated by making a runtime call to
__kmpc_alloc(int gtid, size_t size, omp_allocator_handle_t allocator)
function. - #pragma omp parallel allocate(omp_thread_mem_alloc:x) reduction(+:x) :
-
Clang’s behavior for taskgroup construct with allocate clause.
#include<omp.h> void test() { int x = 0; #pragma omp taskgroup allocate(omp_thread_mem_alloc:x) task_reduction(+:x) { #pragma omp task in_reduction(+:x) { x = x + 1; } } }
- In practice, Clang’s behavior is kept as a reference guide for implementing features in Flang.
- For directives like parallel and allocate:
- It is emitting runtime calls to
@__kmpc_alloc()
for allocate/allocator clause and@__kmpc_aligned_alloc()
for an alignment clause.
- It is emitting runtime calls to
- For allocate clause on taskgroup directive:
- allocation of private copies for reduction variables are handled by
___kmp_allocate(size_t size)
within function__kmpc_taskred_init(int gtid, int num, void *data)
.
- allocation of private copies for reduction variables are handled by
- With the above comparison,it appears that the allocate clause may not be completely handled in Clang for taskgroup construct.
- Because there is no observable difference in the generated LLVM IR with or without the allocate clause in a taskgroup construct (See LLVM_IR).
- However, expected behavior with allocate clause in a taskgroup context would involve allocating private copies for reduction variables through runtime calls to
@_kmpc_alloc()
or@__kmpc_aligned_alloc()
, but this behavior is not observed.
Solutions desired:
Here are three potential approaches for incorporating allocate clause support in taskgroup:
Solution 1:
Based on the analysis, a taskgroup construct with an allocate clause, Clang compiler manages memory allocation for private copies of task reduction variables using __kmp_allocate(size_t size)
rather than __kmpc_alloc()
or __kmpc_aligned_alloc()
. Since clang does not completely handle the allocate clause within taskgroup construct, we might consider following the same approach in Flang. If that’s case, then nothing needs to be done for allocate clause.
Solution 2:
The runtime call __kmpc_taskred_init()
, used to initialize task reduction, only receives task reduction details without any data related to the allocator or alignment specified in the allocate
clause. We could introduce a new runtime function to replace __kmpc_taskred_init()
by incorporating data from the allocate
clause, such as allocator flags and alignment settings. This new runtime function would enable allocation of private copies based on the specified allocator and alignment options(Calling @__kmpc_alloc()
for an allocator clause or @__kmpc_aligned_alloc()
for an alignment clause.)
Solution 3:
The kmp_taskred_input
structure, which is passed as an argument to __kmpc_taskred_init(int gtid, int num, void *data)
to initialize task reduction where data is a pointer to array which contains information about task reduction, contains the following fields:
struct kmp_taskred_input {
void *reduce_shar;
void *reduce_orig;
size_t reduce_size;
void *reduce_init;
void *reduce_fini;
void *reduce_comb;
kmp_task_red_flags_t flags;
} kmp_taskred_input_t;
where kmp_task_red_flags_t contains
typedef struct kmp_taskred_flags {
/*! 1 - use lazy alloc/init (e.g. big objects, num tasks < num threads) */
unsigned lazy_priv : 1;
unsigned reserved31 : 31;
} kmp_taskred_flags_t;
We could leverage the flags
field within the structure(struct kmp_taskred_input)
to store allocator flag information from the allocate
clause and subsequently trigger a call to __kmpc_alloc()
with the allocator settings. For alignment requirements, a default argument could be added in function __kmpc_taskred_init()
to manage alignment considerations.
Please review these proposed approaches and comment your thoughts. Any additional insights or suggestions would be highly valuable for further development.
Thankyou.