Update C++ object in parallel loop using OpenAcc

ayushkoul00 · January 15, 2024, 3:18am

I am trying to parallelize a loop with a C++ object but the code is around 10x slower on the GPU than on the CPU.

This is an oversimplified version of my code:

#include <cstdlib>
#include <ctime>
class Test {
public:
  Test() = default;
  ~Test() = default;

  void init() {
    // set member variables
  }

#pragma acc routine seq
  void update(int a, double b, double c, bool d //, and so on...
  ) {
    // perform alot of calculations and update member variables
  }

  std::vector<float> a;
  std::vector<double> b;
};

Test t;
int main() {
  t.init();

  // read large data from file
  // ...

  // iterate of each row of data
#pragma acc parallel loop
  for (int i = 1; i < data.size(); ++i) {
    t.update(data[i][0], data[i][1] //, and so on...
    )
  }

  return EXIT_SUCCESS;
}

I think I know the reason why this is slow (correct me if I am wrong). Each iteration of the loop is parallel but it tries to update the same Test object: t (and its members). Because of this, all the threads need to synchronize which slows the GPU down. But the issue is, I don’t know how to fix this. And the data I am reading is very large (which is why a GPU would be really useful here).

MatColgrove · January 16, 2024, 3:39pm

Doubtful given threads are only implicitly synchronized at the end of the vector loop. Given “update” is a sequential routine, no synchronization would occur.

I’d first double check that the compiler is able to parallelize the code by reviewing the compiler feedback messages (enabled via the “-Minfo=accel” flag).

Next, I’d run the code using the Nsight-Systems profiler to see where the slow performance is coming from. The most common cause is due to data movement between the device and host.

ayushkoul00 · January 16, 2024, 8:14pm

I thought the reason it may be slowing down was that, in my example code, I have one object Test t which has several data members. Since each iteration of the loop will run parallelly, all threads/vectors will perform the computation (i.e. the update() method) and then write the results into that common object. Because of this, wouldn’t it be slowed since there would have to be synchronization for each write operation? As opposed to threads/vectors writing to their own object in an array (which I believe doesn’t require synchronization)

I am compiling my code using these flags (in CMake):

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -w -fast -acc -gpu=managed -Minfo=accel")

The output of -Minfo=accel is a bit verbose because of STL but I believe this is the main part:

58, Generating implicit private(i)
         Generating NVIDIA GPU code
         63, #pragma acc loop seq
     58, Generating implicit copy(...<VARIABLES>) [if not already present]
     63, Loop without integer trip count will be executed in sequential mode
         Complex loop carried dependence of ->__b_St12_Vector_baseISt6vectorIdSaIdEESaIS2_EE._M_impl._M_start,data.__b_St12_Vector_baseISt6vectorIdSaIdEESaIS2_EE._M_impl._M_start+(i*24)->__b_St12_Vector_baseIdSaIdEE._M_impl._M_start,->__b_St12_Vector_baseISt6vectorIdSaIdEESaIS2_EE._M_impl._M_finish prevents parallelization
         Loop carried dependence of -> prevents parallelization
         Loop carried backward dependence of -> prevents vectorization
         Generating implicit firstprivate(...<VARIABLES>)
         Loop carried dependence of -> prevents parallelization

It says: 63, #pragma acc loop seq but I used #pragma acc parallel loop. Why is this different here?

Also, I am new to Nsight and OpenAcc, so I may be wrong, but I don’t see any excessive data movements in the profile. I profiled the code using two commands:

nsys profile -f true -o full_profile ./output
nsys profile -t cuda,openacc -f true -o acc_profile ./output

I have attached the profiles here as well.

Could you also tell me a good resource for profiling OpenAcc code using Nsight Systems? A lot of the tutorials/videos are old, short, or just briefly go over the profile with screenshots.
Profiles.zip (2.1 MB)

MatColgrove · January 16, 2024, 8:59pm

It’s say that the loop is being run sequentially since it can’t determine the loop trip count. Likely due to the use of “data.size” which is a function. Try assigning the value of “data.size” to a scalar then use this scalar for the upper bound.

The other messages about the dependencies only mean that the compiler analysis can’t determine if the loop is independent or not. So it’s up to the programmer to ensure the loops can be parallelized.

MatColgrove · January 16, 2024, 9:03pm

OpenACC gets compiled to the same device code as CUDA so any tutorial should apply. Though I think Mozhgan uses OpenACC in this tutorial so may be the most relevant: https://www.youtube.com/watch?v=kKANP0kL_hk

The generate kernel names from our OpenACC implementation are “function name” + the line number from the source file.

ayushkoul00 · January 16, 2024, 10:47pm

Thank you, I will follow this video! Is it possible for you to take a look at the profiles and see what is wrong? I checked the profile again and I only see the data transfers happening at the beginning and end of the program

Also, this is a separate question, but I just wanted to confirm: the OpenAcc routine directive is applied where the function is defined or declared? if I have a header file where I have a class declaration with just its method signatures/declarations. And in another cpp file, I have the definitions. When I apply the routine directive on the function definition in the cpp file, I get errors but when I do it on the declarations the error goes away

MatColgrove · January 17, 2024, 3:50pm

You should fix the code to get the loop to parallelize (see above) and then profile it again.

the OpenAcc routine directive is applied where the function is defined or declared?

It needs to be visible by both the caller and the callee. Having the routine pragma in the header file where you have the prototype is visible in both spots. It doesn’t hurt having it at the definition as well, since it helps with self-documentation, but isn’t necessary.

Note that the compiler can auto-generate device routines when the definition is visible in the same compile unit as the caller. It wont help you here since the definition is in a separate file, but is useful for things like templates and lambdas.

ayushkoul00 · January 17, 2024, 6:11pm

You should fix the code to get the loop to parallelize (see above) and then profile it again.

Wow. That worked perfectly. Thank you!

system · January 31, 2024, 6:11pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
OpenACC routine behavior nvfortran nvc, nvc++ and nvfortran	4	24	April 11, 2025
How to parallelize this loop... Legacy PGI Compilers	14	7826	December 18, 2012
How to not parallelize inner loops in OpenACC ? Legacy PGI Compilers	7	3722	May 1, 2020
Openacc Example running slower with GPU nvc, nvc++ and nvfortran	7	931	June 19, 2022
Parallelization of c++ code with OpenACC in PGI 20.7 Legacy PGI Compilers	4	632	August 19, 2020
OpenACC: Best way to manage data transfer between host and device Legacy PGI Compilers	7	2646	September 27, 2021
Difference in Performance CUDA Programming and Performance	13	9747	August 20, 2008
No Performance achieved with OpenAcc Parallel loop when working with a grid of c++ objects nvc, nvc++ and nvfortran cuda	1	231	January 26, 2024
OpenACC: Best way to parallelize nested DO loops with data dependency between loops? nvc, nvc++ and nvfortran	14	3435	October 4, 2021
Unresolved symbols to OpenACC acc_malloc in my program with nvc++ nvc, nvc++ and nvfortran	21	70	February 15, 2025

Update C++ object in parallel loop using OpenAcc

Related topics