how to speed up? data transfer

Gyga91 · April 4, 2011, 2:01pm

Hi, i have 2 big arrays(each cca. 256MB) and i have to do matrix addition and store it in another array…
Now my problem is that the data transfer is bottleneck since one cudaMemcpy() takes about 45ms and i have 3 of them which is about 135ms just for transfer + about 50ms for kernel execution(GTX 460)…
Is there some way to speed up this? Or the streams and ovelapping are the only way?
Thanks
Sorry for my english

brano · April 4, 2011, 2:37pm

Hi! I will try to answer your question.

First of all you are saying that you are uploading 256MB@45ms? This means that you have a throughput of approx. 5.7GB/s! Are you already using page-locked memory?

One way to speed up the Memcpy is to use page-locked memory and use async. memcopy calls.

Another possible solution it to use “page-locked host memory” which enables you to give your kernel the CPU pointer of the storage and let the kernel stream in the data over PCI-e as needed and also put it back in the CPU memory when done.

Also you could set the flag to write combined if you now that you only read from certain pointers and write to others.

Check the CUDA programming guide 3.2 on how to use this. (chapter 3.2.5)

Gyga91 · April 4, 2011, 5:13pm

Yes i forgot to tell,i do use pinned memory.

Ok i tried with zero-copy access,and found that it is veery slow cca. 1480ms and with old cudaMemcpy() version it takes about 190ms.I am pretty sure that i am wrong somewhere,but i just cant’t figure out where…

Here’s my code so if you could help me…

Thanks

#define HEIGHT  8*1024

#define WIDTH 8*1024

#define threadsX 16

#define threadsY 16

#define blocksX 512

#define blocksY 512

#include <cuda.h>

#include <stdio.h>

__global__ void add( int *a, int *b, int *c) {

	int x =threadIdx.x + blockIdx.x * blockDim.x; // handle the data at this index

	int y =threadIdx.y + blockIdx.y * blockDim.y;	

	int offset = y + x * HEIGHT;

	while (y < HEIGHT) {

		while (x < WIDTH) {

			c[offset] =a[offset] + b[offset];			

			x += blockDim.x * gridDim.x;

			offset = y + x * HEIGHT;

			}

		x =threadIdx.x + blockIdx.x * blockDim.x;

		y += blockDim.y * gridDim.y;

		offset = y + x * HEIGHT;

	}

}

int main( void ) {

	int *a, *b, *c;

	int *dev_a, *dev_b, *dev_c;	

	cudaEvent_t start, stop;

	cudaSetDeviceFlags(cudaDeviceMapHost);

	cudaHostAlloc( &a, HEIGHT * WIDTH * sizeof (int), cudaHostAllocMapped);

	cudaHostAlloc( &b, HEIGHT * WIDTH * sizeof (int), cudaHostAllocMapped);

	cudaHostAlloc( &c, HEIGHT * WIDTH * sizeof (int), cudaHostAllocMapped);

	cudaHostGetDevicePointer( &dev_a, a, 0);

	cudaHostGetDevicePointer( &dev_b, b, 0);

	cudaHostGetDevicePointer( &dev_c, c, 0);

	// fill the arrays 'a' and 'b' on the CPU

	for (int i=0; i<WIDTH; i++) {

		for (int j=0; j<HEIGHT; j++) {

			a[i * HEIGHT + j] = -i+j;

			b[i * HEIGHT + j] = i * j;

		}

	}

	// capture the start time

	cudaEventCreate( &start ) ;

	cudaEventCreate( &stop ) ;

	cudaEventRecord( start, 0 ) ; 

	dim3 threads (threadsX, threadsY);

	dim3 blocks (blocksX, blocksY );

	add <<<blocks,threads>>> (dev_a, dev_b, dev_c);

	

	cudaThreadSynchronize();

//capture the stop time

	cudaEventRecord( stop, 0 ) ;

	cudaEventSynchronize( stop ) ;

float elapsedTime;

	cudaEventElapsedTime( &elapsedTime,

	start, stop ) ;

	printf( "Time to generate: %3.1f ms\n", elapsedTime );

	cudaEventDestroy( start ) ;

	cudaEventDestroy( stop ) ; 

	cudaFreeHost(a);

	cudaFreeHost(b);

	cudaFreeHost(c);

	return 0;

}

avidday · April 4, 2011, 5:22pm

That kernel looks like a pretty convoluted way to do simple addition. Given the matrices are just stored in linear memory and have the same layout, why not just add corresponding words in each input array.

Gyga91 · April 4, 2011, 5:29pm

I made kernel like that so i could add matrices with infinitely number of elements,and not to be restricted by maximum number of threads…

avidday · April 4, 2011, 5:34pm

You can do that with a single loop and five lines of kernel code.

Gyga91 · April 4, 2011, 6:05pm

How could i do that?

avidday · April 4, 2011, 6:46pm

<template typename T>

__global__ void addkernel(const T *a, const T *b, T *c, const size_t n)

{

	size_t tidx = threadIdx.x + blockIdx.x * blockDim.x;

	size_t stride = blockDim.x * gridDim.x;

	for(size_t i=tidx; i<n; i+=stride) {

		c[i] = a[i] + b[i]; 

	}

}

launch it with 8 blocks per multiprocessor and as many warps per block as will give full occupancy on the architecture you are using. No limits anywhere except size_t and GPU memory. All reads and writes should be coalesved. Should be faster than what you are using now.

hamster143 · April 4, 2011, 7:09pm

You should be able to overlap the transfer to the device with the transfer from the device. See section 3.1.2 of the “best practices guide”.

zam23 · April 4, 2011, 10:15pm

I am also facing the same issue of data transfer bottleneck.

I have already used pinned memory(this gave me a significant improvement), tried zero copy, but it did not give any improvement.
So I am trying to use some kind of algorithm for data compression and transfer compressed data.

Has any body done this. please help me. Your ideas and thoughts are warmly welcome.

seibert · April 4, 2011, 10:54pm

Unfortunately, that is not possible on the GeForce cards. Only one DMA engine is available.

Gyga91 · April 4, 2011, 11:03pm

global void addkernel(const T *a, const T *b, T *c, const size_t n)

{
    size_t tidx = threadIdx.x + blockIdx.x * blockDim.x;

    size_t stride = blockDim.x * gridDim.x;
for(size_t i=tidx; i<n; i+=stride) {
            c[i] = a[i] + b[i]; 

    }
}

Thanks!it did job very well…i menaged to get 138ms which is great!

You mean to launch it with 8*7 SM’s=56 blocks , and 1024 threads?I tried that and it takes about 153ms…

I read it but there is only about overlapping data transfer with execution.Did you mean to use 2 streams,one for transfer to device and kernel execution and one for transfer from device?Thanks!

If you need to access all elements once try zero-copy code i pasted,and fix a kernel with a avidday’s

Gyga91 · April 4, 2011, 11:07pm

Actually i think it’s possible on CC 2.1 devices…

seibert · April 4, 2011, 11:23pm

Oh? Hmm, I have a GTS 450 here. Should test this…

hamster143 · April 5, 2011, 12:53am

Either that, or just do exactly what the best practices guide says. Transfers from device should be automatically overlapped with transfers to device.

But there’s also a question whether your system memory can handle two streams of 5+ GB/s at the same time.

I was under the impression that all Fermi cards have two DMA engines. And that PCI Express allows you to transfer data in two directions at the same time. I could be wrong.

tera · April 5, 2011, 3:43am

It is possible though to overlap one transfer via DMA with a transfer via zerocopy (preferably in the opposite direction so they don’t compete for PCIe bandwidth).

tera · April 5, 2011, 3:58am

The second DMA engine is only enabled for Tesla and Quadro Fermi cards, not for GeForce.

hamster143 · April 5, 2011, 4:40am

Are you sure that zero copy bypasses the DMA?

tera · April 5, 2011, 5:12am

pretty much

hamster143 · April 5, 2011, 6:41am

You’re quite right.

I wrote a test app. On a GeForce 560, I see almost no speedup from streams if I use cudaMemcpy to transfer data both ways. But there’s a substantial speedup from using cudaMemcpy to send data to the device and zero-copy to send data back.

Topic		Replies	Views
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13938	September 5, 2008
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16323	January 30, 2011
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2250	January 18, 2023
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	590	April 6, 2023
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11181	May 23, 2010
DMA transfers in parallel 2-way SLI with 2 GTX 280 CUDA Programming and Performance	8	3694	March 16, 2009
Number of kilobytes transferred to/from shared memory twice the expected CUDA Programming and Performance	12	706	September 29, 2018
GPU Transfer problems GPU won't correctly read data out from Device to Host CUDA Programming and Performance	15	2640	August 2, 2010
Inconsistent concurrent transfer speed CUDA Programming and Performance	21	1213	April 17, 2023
Cuda code performance CUDA Programming and Performance	14	3163	December 16, 2014

how to speed up? data transfer

Related topics