HELP with vector sum

Hi, I want to implement an algorithm that sums two vectors and scales depending on their dimension. This is what I have done so far:

#include <stdio.h>

#define VSIZE 64

__global__ void sum(char *v1, char *v2, int length, int *res) {

	//Total number of threas within a block

	int b_threads=blockDim.x*blockDim.y*blockDim.z;

	//Total number of threads in the grid

	int g_threads=gridDim.x*gridDim.y*gridDim.z*b_threads;

	//Size of the vector partition on which every thread will work on

	int pt_size=length/g_threads;

	//Size of the vector partition which will be loaded by a block

	int pb_size=pt_size*b_threads;

	//Block index

	int blockId=blockIdx.x+blockIdx.y*blockDim.x;

	//Thread index (this is addressing a thread within the grid)

	int tId=threadIdx.x+threadIdx.y*blockDim.x+ blockId*b_threads;

	int bound=tId*pt_size+pt_size;

	//printf("b_threads:%d, g_threads:%d, pt_size:%d, pb_size:%d, blockId:%d, tId:%d, bound:%d\n", 

		//	b_threads, g_threads, pt_size, pb_size, blockId, tId, bound);

	int i;

	for (i=tId*pt_size; i<bound; i++) {

		res[i]=v1[i]+v2[i];

	}

}

int main() {

	char v1[VSIZE], v2[VSIZE], *vc1, *vc2;

	int vres[VSIZE], *vcres;

	int size1= sizeof(char)*VSIZE;

	int size2=sizeof(int)*VSIZE;

	int i;

	for (i=0; i<VSIZE; i++) {

		v1[i]=1;

		v2[i]=2;

	}

	//initialize and copy mem to device

	cudaMalloc((void**)&vc1, size1);

	cudaMemcpy(vc1, v1, size1, cudaMemcpyHostToDevice);

	cudaMalloc((void**)&vc2, size1);

	cudaMemcpy(vc2, v2, size1, cudaMemcpyHostToDevice);

	cudaMalloc((void**)&vcres, size2);

	//set dimensions of grid and block

	dim3 dimBlock(2, 2);

	dim3 dimGrid(2, 2);

	sum<<<dimGrid, dimBlock>>>(vc1, vc2, VSIZE, vcres);

	cudaMemcpy(vres, vcres, size2, cudaMemcpyDeviceToHost);

	cudaFree(vc1);

	cudaFree(vc2);

	cudaFree(vcres);

	int k;

	for (k=0; k<VSIZE; k++) printf("posizione %d: %d\n", k, vres[k]);

	return 0;

}

The code is working on a “short” input (i.e vector size 32 or 64) but when it grows I get meaningless data!

The kernel assumes that the total number of threads is always smaller or equal than the size of the vectors.

How can I use shared memory in order to have faster memory access within a block?

Any help is appreciated!

In your code you are actually working with vectors (1D). By making the thread and block dimensions to 2D you are just making the unwanted confusions in your logic for calculating the array index from the available CUDA global variables. So what I think is that, in this particular case better represent it in 1D.

Also, according to your logic you are doing the computations for a block inside 1 cuda thread. Since this will cause noncoaleased accesses to global memory it will be much slower. Better go for making as many cuda threads as the vector size.

About using shared memory: I this case, you are not having repeated accesses to the same memory location in global memory. So using shared memory is not a good choice here. Instead of this texture binding the input buffers will give better performance compared to global memory access.

Thank you for the reply.

I found the error in that code and know the algorithm works; basically i changed this line

int blockId=blockIdx.x+blockIdx.y*blockDim.x;

into this:

int blockId=blockIdx.x+blockIdx.y*gridDim.x;

I have choosen to use 2D block because I wasn’t sure about block dimensions and how to handle them. I see your point anyway and I agree!

According to the code the logic is: while there is enough thread capability assign a thread to a vector position, when there is not, assign contiguous portions of vector to each thread.

I did this because depending on the device capability (and the size of the input) there could be the case in which the vector is too big and it’s not possible to have one thread for each position. As I said I am very new to cuda so… do you think that’s a reasonable idea?

Ok about the shared memory, but I don’t think I clearly understand what texture binding is!.. Could you give me some reference?

thanks

Right I managed to use texture binding! I also used your suggestion and I handle blocks as vector. Here is the code, tell me if I am correctly using the texture binding please!

texture<char, 1, cudaReadModeElementType> texRef1; //Are they supposed to be global???

texture<char, 1, cudaReadModeElementType> texRef2;

__global__ void sum(int* res)

{

	int i = threadIdx.x + blockDim.x * blockIdx.x;

	res[i] = tex1Dfetch(texRef1, i)+tex1Dfetch(texRef2, i);

}

int main() {

	char v1[VSIZE], v2[VSIZE], *vc1, *vc2;

	int vres[VSIZE], *vcres;

	int size1= sizeof(char)*VSIZE;

	int size2=sizeof(int)*VSIZE;

	int i;

	for (i=0; i<VSIZE; i++) {

		v1[i]=50;

		v2[i]=50;

	}

	cudaEvent_t start,stop;

	float time;

	cudaEventCreate(&start);

	cudaEventCreate(&stop);

	//initialize and copy mem to device

	cudaMalloc((void**)&vc1, size1);

	cudaMemcpy(vc1, v1, size1, cudaMemcpyHostToDevice); 

	cudaMalloc((void**)&vc2, size1); 

	cudaMemcpy(vc2, v2, size1, cudaMemcpyHostToDevice); 

	cudaMalloc((void**)&vcres, size2); 

	//Binds the vectors to textures

	cudaBindTexture(0, texRef1, vc1, VSIZE);

	cudaBindTexture(0, texRef2, vc2, VSIZE);

	checkCUDAError("Texture binding:");

	cudaEventRecord( start, 0 );

	vector_add<<<VSIZE/256,256>>>(vc1, vc2, vcres);

	cudaThreadSynchronize();

	checkCUDAError("kernel execution");

	cudaEventRecord( stop , 0 );

	cudaEventSynchronize( stop );

	cudaEventElapsedTime( &time, start, stop );

	printf("Elapsed time: %fms\n", time);

	cudaMemcpy(vres, vcres, size2, cudaMemcpyDeviceToHost);

	cudaFree(vc1);

	cudaFree(vc2);

	cudaFree(vcres);

}

Without texture, ~37% of the time was spent executing the kernel. Now ~5% is spent instead… that’s great! Thank you Preetha.

I have done all of this in order to practice with this kind of programming paradigm which is new to me. In fact there is no speed up using the gpu for summing two vectors (memory transfers take too long!!). The question is: is this right? Or is possible to optimize the code and hope that the gpu can beat the cpu in such a simple computation?

Indeed there is no way the GPU can beat the CPU here, as the kernel is entirely bandwidth bound and the PCIe transfer to the card alone takes longer than the full computation on any decent CPU.

In cuda the maximum supported block count is 65535 and the maximum supported threads per block is 512. So you can allocate upto 65536*512 cuda thread for a kernel. Hope this won’t beat your requirement.

If such a case occurs better you split the data and repeatedly call the kernel. Anyway “assigning contiguous portions of vector to each thread” inside a kernel is not at all a good choice. This will cause non colaeased access and will definitely slower your kernel execution. (Even if you are using texture for keeping input vectors, the output is anyway written to global memory).

Vector addition is not a good problem for gpu to beat the corresponding cpu version since it is not having enough computations to compensate the data transfer time. Remember GPU are good at computation and try to exploit this quality of gpu to achive better results.

Anyway, you can use this problem to familiarize the programming paradigm.

That’s exactly what i meant!

The next step will be the multiplication of the two vectors! Basically those vectors represent large numbers (arrays of bytes), I am trying to implement a multiple precision library with basic operations.

In the multiplication I will have to use CUFFT library… and hopefully I’ ll get what I am looking for! (don’t tell me that even that it’s wasted time!!)

Thanks for the hints!