seems that cuda doesn't support pointer to pointer problem report

yk_cadcg · March 1, 2007, 2:10am

Hi, it seems that cuda doesn’t support pointer to pointers, like:
int** a;
cudaMalloc(&a, sizeof(int*) * N));
for(int i = 0; i < N; ++i)
cudaMalloc(&a[i],size)); //memory violation here

above psuedocode can’t pass Device, we have to use continuous 1d array or cudaMalloc2D.

Any advice? Thanks a lot!

yk_cadcg · March 1, 2007, 5:44am

Anybody met this issue? Waiting for reply hurrily…

Hi, it seems that cuda doesn’t support pointer to pointers, like:

int** a;

cudaMalloc(&a, sizeof(int*) * N));

for(int i = 0; i < N; ++i)
cudaMalloc(&a[i],size));  //memory violation here
above psuedocode can’t pass Device, we have to use continuous 1d array or cudaMalloc2D.

Any advice? Thanks a lot!

[snapback]165385[/snapback]

javier1 · March 1, 2007, 9:39am

Hi, it seems that cuda doesn’t support pointer to pointers, like:

int** a;

cudaMalloc(&a, sizeof(int*) * N));

for(int i = 0; i < N; ++i)
cudaMalloc(&a[i],size));  //memory violation here
above psuedocode can’t pass Device, we have to use continuous 1d array or cudaMalloc2D.

Any advice? Thanks a lot!

[snapback]165385[/snapback]

I didn’t say anything :P

Anyway, I think CUDA does not support pointers, so I suppose you will have to replace pointers by indexes or something similar. You have to keep in mind that after cudaMalloc( &a, … ), a is “pointing” to device memory, not host memory. And as soon as you dereference a, you are doing something that has no sense, *a is not memory in host space.

If you just need an array of device memory chunks, the first malloc must be a regular one. But keep in mind that you will not be able to dereference it in the device, you can use it for things like:

cudaDeviceFunction<<<…>>>( a[1], a[2], … );

not as a parameter by itself.

Cyril_Zeller · March 1, 2007, 10:16am

You cannot dereference a device pointer in host code (see last sentence of Section 4.2.2.4). In the same way, you cannot have cudaMalloc write to a location pointed to by a device pointer.

You need to explicitly copy to device memory with something like this:

int** a; 

cudaMalloc(&a, sizeof(int*) * N));

int* ha[N];

for(int i = 0; i < N; ++i) 

       cudaMalloc(&ha[i],size));

cudaMemcpy(a, ha, sizeof(a), cudaMemcpyHostToDevice);

yk_cadcg · March 2, 2007, 6:13am

Thank you very much! I will try.

You cannot dereference a device pointer in host code (see last sentence of Section 4.2.2.4). In the same way, you cannot have cudaMalloc write to a location pointed to by a device pointer.

You need to explicitly copy to device memory with something like this:
int** a; 

cudaMalloc(&a, sizeof(int*) * N));

int* ha[N];

for(int i = 0; i < N; ++i) 

       cudaMalloc(&ha[i],size));

cudaMemcpy(a, ha, sizeof(a), cudaMemcpyHostToDevice);
[snapback]165494[/snapback]

jhanweck · May 24, 2007, 9:40pm

I have tried this (using floats), but in the kernel, the compiler appears confused about how to dereference the pointer.

Can you tell me what’s wrong below, or post the kernel code corresponding to your host code?

Thanks!

// one row of the array per thread block...

__global__ void kernelTestP2P(float **g_array, float *g_Out) 

{

    unsigned int bid = blockIdx.x;

    unsigned int tid = threadIdx.x;

   // get the pointer to this thread block's row

    __shared__ float *g_row;    // or should this be:  float *g_row; ....

    if (tid == 0)          // ... and eliminate this line???  neither seems to work...

        g_row = g_array[bid];

    __syncthreads();

   shmem[tid] = g_row[tid];

    __syncthreads();

Can you provide an example of the kernel code?

You cannot dereference a device pointer in host code (see last sentence of Section 4.2.2.4). In the same way, you cannot have cudaMalloc write to a location pointed to by a device pointer.

You need to explicitly copy to device memory with something like this:
int** a; 

cudaMalloc(&a, sizeof(int*) * N));

int* ha[N];

for(int i = 0; i < N; ++i) 

       cudaMalloc(&ha[i],size));

cudaMemcpy(a, ha, sizeof(a), cudaMemcpyHostToDevice);
[snapback]165494[/snapback]

BARRACUDA · June 4, 2007, 2:30pm

What about using cudaMalloc2D? Although I haven’t personally been able to do much with it, it seems like the thing that you’re looking for.

AndreasBuhr · June 5, 2007, 11:56am

// one row of the array per thread block...

__global__ void kernelTestP2P(float **g_array, float *g_Out) 

{

    unsigned int bid = blockIdx.x;

    unsigned int tid = threadIdx.x;

   // get the pointer to this thread block's row

    __shared__ float *g_row;    // or should this be:  float *g_row; ....

    if (tid == 0)          // ... and eliminate this line???  neither seems to work...

        g_row = g_array[bid];

    __syncthreads();

   shmem[tid] = g_row[tid];

    __syncthreads();

I think g_row is a variable in shared memory in your code. When accessing g_row[tid], the compiler probably tries to access shared memory.

You should have a look at your .ptx file. There is probably something like

ld.global.f32

st.shared.f32

when loading something from global memory to shared memory. Maybe in your ptx file is something like

ld.shared.f32

st.shared.f32

then you know, where your Problem is ;-)

Cyril_Zeller · June 5, 2007, 3:36pm

The code I wrote earlier had a typo; here’s the right code for the copy to work correctly:

ismailfaruqi · November 17, 2011, 5:06am

wow I dont believe facing a problem solved 4 years ago!!!

this saves me from headbanging further LOL

lclin · January 6, 2012, 8:33am

Now, is there a new easier way to do this problem since 4 years pasted !

Araz · March 29, 2012, 3:43pm

This page helped me a lot. so I just add a simple example to show clearly how we can use pointer to pointers in cuda.
//////////////////////////////P2P.cu//////////////
#include
#include
#include
#include
#include
#include <stddef.h>
#include <cuda.h>
#include <math_functions.h>
static void HandleError( cudaError_t err,const char *file, int line ) {
if (err != cudaSuccess) {
printf( “%s in %s at line %d\n”, cudaGetErrorString( err ),
file, line );
exit( EXIT_FAILURE );
}
}
#define HANDLE_ERROR( err ) (HandleError( err, FILE, LINE ))
float Randomizer () { return (float)rand()/(float)RAND_MAX; }

global void kernelTestP2P(float **in, float **out)
{
unsigned int bid = blockIdx.x;
unsigned int tid = threadIdx.x;

// get the pointer to this thread block's row
extern __shared__ float shmem[];

shmem[tid] = in[bid][tid];
__syncthreads();
out[bid][tid]=sinf(shmem[tid]);

}

int main(){
int B=14,N=32;
srand ( unsigned ( time(NULL) ) );

float **in_dd,**in_hd,**in_hh;

//alocate 1st part, B count of float*
in_hd=(float**)malloc(sizeof(float*)B);
in_hh=(float*)malloc(sizeof(float*)B);
HANDLE_ERROR(cudaMalloc(&in_dd, sizeof(float) * B));
std::cout<<“OK\n”;
//alocate 2nd part, N count of float
for(int i = 0; i < B; ++i) {
HANDLE_ERROR(cudaMalloc((void**)&in_hd[i],Nsizeof(float)));
in_hh[i]=(float)malloc(Nsizeof(float));//allocate or connect the input data to it
std::generate_n(in_hh[i],N,Randomizer);
HANDLE_ERROR(cudaMemcpy(in_hd[i],in_hh[i],sizeof(float)N,cudaMemcpyHostToDevice));
}
//copy float part from hd to dd
HANDLE_ERROR(cudaMemcpy(in_dd, in_hd, sizeof(float)*B, cudaMemcpyHostToDevice));

float **out_dd,**out_hd,**out_hh;

out_hh=(float**)malloc(sizeof(float*)B);
out_hd=(float*)malloc(sizeof(float*) * B);
HANDLE_ERROR(cudaMalloc(&out_dd, sizeof(float*) * B));

for(int i=0;i<B;i++){
HANDLE_ERROR(cudaMalloc((void**)&out_hd[i],Nsizeof(float)));
out_hh[i]=(float)malloc(Nsizeof(float));//just alocate, NO input
}
HANDLE_ERROR(cudaMemcpy(out_dd,out_hd,Bsizeof(float*),cudaMemcpyHostToDevice));

size_t shMemSize=N*sizeof(float);
dim3 GridDim=dim3(B,1,1),BlockDim=dim3(N,1,1);

kernelTestP2P<<<GridDim,BlockDim,shMemSize>>>(in_dd,out_dd);

for(int i=0;i<B;i++){
HANDLE_ERROR(cudaMemcpy(out_hh[i], out_hd[i], sizeof(float)*N, cudaMemcpyDeviceToHost));
HANDLE_ERROR(cudaFree(in_hd[i]));
HANDLE_ERROR(cudaFree(out_hd[i]));
}
HANDLE_ERROR(cudaFree(in_dd));
HANDLE_ERROR(cudaFree(out_dd));

std::ofstream f_out;
f_out.open(“output.txt”);

for(int i=0;i<B;i++) {
for(int j=0;j<N;j++) f_out<<out_hh[i][j]<<" “;
f_out<<”\n";
free(out_hh[i]);
free(in_hh[i]);
}
free(out_hh);
free(in_hh);
return 0;
}

Topic		Replies	Views
2d matrix passing values help with this code CUDA Programming and Performance	4	3209	November 10, 2010
How to pass large arguments in CUDA kernels Kernel arguments CUDA Programming and Performance	10	19063	December 18, 2009
Strange memory gremlins Getting pwned by pointers CUDA Programming and Performance	9	12180	July 1, 2009
Invalid Device Pointer CUDA Programming and Performance	9	24504	January 15, 2009
Problems with creating an array of Cuda pointers CUDA Programming and Performance	7	13630	April 20, 2009
How to use __device__ pointer on host side(e.g. cudaMalloc)? CUDA Programming and Performance	3	7004	February 3, 2009
Another Device Memory Question CUDA Programming and Performance	7	2311	February 9, 2010
Can we do malloc inside a __global__ function CUDA Programming and Performance	26	9708	February 21, 2010
CUDA with Pthread CUDA Programming and Performance	5	8453	July 24, 2008
Problems with cudaHostAlloc and cudaMemcpyAsync CUDA Programming and Performance	5	4524	February 8, 2010

seems that cuda doesn't support pointer to pointer problem report

Related topics