Using cudaHostAlloc

I am performing a simulation where I am concerned with how a system evolves in time. This entails that I make several memory copies from the GPU to the CPU. I was reading Cuda by Example by Sanders and Kandrot and noticed that page-locked memory might be of some use help speed up these memory copies. However, I have run into problems. Here is how I currently have the memory allocated:

x = (double*)malloc(n*sizeof(double));
y = (double*)malloc(n*sizeof(double));
z = (double*)malloc(n*sizeof(double));

cudaSetDevice(0);

cudaHostAlloc((void**)&x, n*sizeof(double), cudaHostAllocDefault);
cudaHostAlloc((void**)&y, n*sizeof(double), cudaHostAllocDefault);
cudaHostAlloc((void**)&y, n*sizeof(double), cudaHostAllocDefault);


    cudaHostGetDevicePointer (&xd, x, 0);
cudaHostGetDevicePointer (&yd, y, 0);
cudaHostGetDevicePointer (&zd, z, 0);

I initially had the malloc command commented out; however, when I tried to run the code, it segfaulted immediately. My system has 2 GPU’s, a Quardro 295 and a GTX 570 (which is device “0”), hence the cudaSetDevice. I was wondering if I am using these allocations correctly, and if so, why am I not noticing a speedup?

Thank you in advance.