I am trying to parallelize a loop with a C++ object but the code is around 10x slower on the GPU than on the CPU.
This is an oversimplified version of my code:
#include <cstdlib>
#include <ctime>
class Test {
public:
Test() = default;
~Test() = default;
void init() {
// set member variables
}
#pragma acc routine seq
void update(int a, double b, double c, bool d //, and so on...
) {
// perform alot of calculations and update member variables
}
std::vector<float> a;
std::vector<double> b;
};
Test t;
int main() {
t.init();
// read large data from file
// ...
// iterate of each row of data
#pragma acc parallel loop
for (int i = 1; i < data.size(); ++i) {
t.update(data[i][0], data[i][1] //, and so on...
)
}
return EXIT_SUCCESS;
}
I think I know the reason why this is slow (correct me if I am wrong). Each iteration of the loop is parallel but it tries to update the same Test object: t (and its members). Because of this, all the threads need to synchronize which slows the GPU down. But the issue is, I don’t know how to fix this. And the data I am reading is very large (which is why a GPU would be really useful here).