Why do I get segmentation fault when I try to access array in global memory

I am very new to CUDA so I apologize if this question is basic/ has been asked before.

void CudaRenderer::render() {
dim3 blockDim(16, 16, 1);
dim3 gridDim((image->width + blockDim.x - 1) / blockDim.x,
(image->height + blockDim.y - 1) / blockDim.y);
int *filteredCircles,
*lastIndices,
sz = gridDim.y * gridDim.x ;
cudaMalloc((void **)&filteredCircles, sizeof(int) * sz * 2000);
cudaMalloc((void **)&lastIndices, sizeof(int) * sz);
cudaMemset(lastIndices, 0, sizeof(int) * sz);
filterCircles<<<gridDim, blockDim>>>(filteredCircles, lastIndices);
for (int i = 0; i < 1; ++i)
printf(“lastIndices[%d] = %d\n”, i, lastIndices[i]);
kernelRenderCircles<<<gridDim, blockDim>>>(filteredCircles, lastIndices);
cudaFree(filteredCircles);
cudaDeviceSynchronize();
}

What I am trying to do here is allocate memory for two arrays in global memory, which will later be used to record data with in kernels.

Before I added the printing line, there was no issue compiling but the results came out wrong. After I added the printf, seems like as long as I tried to access elements in lastIndices, there would be segmentation fault.

Do I need to initialize filteredCircles too even if the initial values does not matter?

What did I do wrong here? Thank you!

lastIndices is allocated on the device. So you have to copy back to host to access and print its elements.