CUDA separable compilation + shared libraries -> "Invalid function" error

Hi,

I’m running into an issue trying to apply CUDA separable compilation to my project, which uses shared libraries. Note: I am aware that device linking cannot be done across shared library boundaries. That’s not what I’m doing. My shared libraries have a regular C/C++ interface.

After some experimentation, I believe the issue is: I’m performing device-link twice (one for each library), and the same symbol exists in both device linked objects. Why is this a problem?

I have asked the question in StackOverflow, I’ll reproduce it here.

I’m trying to use CUDA separable compilation in my project. The project is composed of a binary that depends on a few shared libraries (all built in the same build system). These shared libraries in turn use common CUDA code. When running the binary, I get a segfault similar to here. When I create a minimal example, I get “invalid device function” error instead. If I turn the shared libraries into static libraries, the error goes away . Unfortunately I don’t have control over this and need to make it work with shared libraries.

I have seen a couple similar posts here in SO, but they use CMake and the solutions usually involve changing libraries from shared to static, which I can’t do in my project. I have double-checked that I’m running the code on the right GPU (and indeed it works if I do some changes, see below), so that’s not the issue.

I believe I’m missing something when doing CUDA separable compilation, device linking or creating shared libraries.

Below is a fully reproducible minimal example of the problem:

// common.h
#ifndef COMMON_H
#define COMMON_H

__device__ int common();

#endif
// common.cu
#include "common.h"

__device__ int common()
{
    return 123;
}
// a.h
#ifndef A_H
#define A_H

__attribute__((__visibility__("default")))
void runA();

#endif
// a.cu
#include "a.h"

#include <cstdio>
#include <iostream>

#include "common.h"

__global__ void kernelA()
{
    printf("Running A: %d\n", 456 + common());
}

void runA()
{
    kernelA<<<1,1>>>();
    std::cout << cudaGetErrorString(cudaPeekAtLastError()) << std::endl;
    cudaDeviceSynchronize();
}
// b.h
#ifndef B_H
#define B_H

__attribute__((__visibility__("default")))
void runB();

#endif
// b.cu
#include "b.h"

#include <cstdio>
#include <iostream>

#include "common.h"

__global__ void kernelB()
{
    printf("Running B: %d\n", 321 + common());
}

void runB()
{
    kernelB<<<1,1>>>();
    std::cout << cudaGetErrorString(cudaPeekAtLastError()) << std::endl;
    cudaDeviceSynchronize();
}
// main.cpp
#include "a.h"
#include "b.h"

int main()
{
    runA();
    runB();
}

So basically a binary depending on 2 shared libraries A and B, both of which utilize the common() device function.

This is my build/test script:

#!/usr/bin/env bash
set -euxo pipefail

CUDA_ROOT=/usr/local/cuda-10.2
NVCC=$CUDA_ROOT/bin/nvcc
CC=/usr/bin/g++
GENCODE="arch=compute_75,code=sm_75"

# Clean previous build
rm -f *.o *.so main

# Compile relocatable CUDA code
$NVCC -gencode=$GENCODE -dc -Xcompiler -fPIC,-fvisibility=hidden common.cu -o common.cu.o
$NVCC -gencode=$GENCODE -dc -Xcompiler -fPIC,-fvisibility=hidden      a.cu -o      a.cu.o
$NVCC -gencode=$GENCODE -dc -Xcompiler -fPIC,-fvisibility=hidden      b.cu -o      b.cu.o

# Build shared library A
$NVCC -gencode=$GENCODE -dlink common.cu.o a.cu.o -o a.dlink.o
$CC -shared common.cu.o a.cu.o a.dlink.o -L$CUDA_ROOT/lib64 -lcudart -o liba.so

# Build shared library B
$NVCC -gencode=$GENCODE -dlink common.cu.o b.cu.o -o b.dlink.o
$CC -shared common.cu.o b.cu.o b.dlink.o -L$CUDA_ROOT/lib64 -lcudart -o libb.so

# Build final executable
$CC main.cpp -L. -la -lb -o main

# Run it
LD_LIBRARY_PATH=. ./main

Running it I get:

invalid device function
invalid device function

After some trial and error, I notice the problem is solved by:

  • Not linking common.cu.o in either library when device linking (obviously I need to make either library no longer use the common() function.
  • Making A and B static libraries.
  • Combining A and B into one single shared library.

Unfortunately I cannot apply these solutions in my project. Why is it a problem to have 2 shared libraries? I’ve read about the “device linker ignoring shared libraries”, but in this case it’s the host linker creating the shared library, not the device linker, so I’m hoping that’s OK?

Thanks!

Could you please file us a ticket following the instruction here Getting Help with CUDA NVCC Compiler . You can add this topic link in the description . We will take a look . Thanks.

Absolutely, thanks! Just filed the bug, let me know if I can help further.

Browsing the Forums I believe my issue is closest to this one:

If I create only 1 .so file, which links together both a.dlink.o and b.dlink.o, I get “multiple definitions” error. This cannot be caught by the linker if I build 2 independent .so files.

Another thing that I have observed is that, when I run my program and the runtime linker starts to load to my libraries, they call this function twice:

__cudaRegisterLinkedBinary_14_common_cpp1_ii_e5ca4d49

This doesn’t happen with the other device functions - they get called only once. Is this a problem? Registering the same binary twice?

Thanks!

Your original issue is talked internally and here is the info
The problem is the duplicate common.o in both libraries. If you copy common.cu to common2.cu, and then link common2.o in libb, common.o in liba, then it works.

Also here is my other workaround on trying your case

yni@node1:~/yni/CustomerBug/nvbug3373815$ nvcc -arch=sm_75 -dlink common.cu.o a.cu.o b.cu.o -o ab.dlink.o
yni@node1:~/yni/CustomerBug/nvbug3373815$
yni@node1:~/yni/CustomerBug/nvbug3373815$ g++  -shared common.cu.o a.cu.o b.cu.o ab.dlink.o -L /usr/local/cuda-11.5/lib64 -lcuda -lcudart -o libmy.so
yni@node1:~/yni/CustomerBug/nvbug3373815$ 
yni@node1:~/yni/CustomerBug/nvbug3373815$ g++ main.cpp -L . -lmy -o myMain
yni@node1:~/yni/CustomerBug/nvbug3373815$
yni@node1:~/yni/CustomerBug/nvbug3373815$
yni@node1:~/yni/CustomerBug/nvbug3373815$ LD_LIBRARY_PATH=. ./myMain
no error
Running A: 579
no error
Running B: 444

Would it be fixed? It seems that initialization code should be arranged as singleton, that is not the case currently.

Found a solution: add -fvisibility=hidden to the dlink step:

-$NVCC -gencode=$GENCODE -dlink common.cu.o a.cu.o -o a.dlink.o
+$NVCC -gencode=$GENCODE -dlink -Xcompiler -fvisibility=hidden common.cu.o a.cu.o -o a.dlink.o

-$NVCC -gencode=$GENCODE -dlink common.cu.o b.cu.o -o b.dlink.o
+$NVCC -gencode=$GENCODE -dlink -Xcompiler -fvisibility=hidden common.cu.o b.cu.o -o b.dlink.o

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.