I’m curious about maximum number of blocks on cuda. Let’s say that we have a GTX 8800 with 16 streaming multi processors, 8 ALUs, 16KB shared memory per multiprocessor, and a kernel where each block requires 1060 Bytes of shared memory.
As far as I understood the maximum number of blocks that can be run simultaneously will be limited by the shared memory requirements of the blocks. So I tried to calculate the maximum num of blocks for kernel execution as:
max number of blocks per Multiprocessors: 16KB/1060B → 15
max number of blocks on device → 15x16 = 240
However, this calculation is in collision with the experimental results. The before mentioned kernel achieves really good performance when it’s launched on a 64x64 grid = 4096 blocks , or larger. Finally, for 512x512 blocks, the kernel crashes.
So, I would expect that the blocks are replaced with the new ones on the same multiprocessor, as soon as they are processed. Is that correct or some other mechanism is used?
Finally, how is the max. number of blocks that could be run
a) by one multiprocessor
b) by one kernel
correctly determined?
Thanks in advance for helping me to better understand the execution model. :)
Cheers
Ana
Denis, thanks for the tips. Here’re the values that I get from deviceQuery and the occupancy calc:
Device 0: "GeForce 8800 Ultra"
Major revision number: 1
Minor revision number: 0
Total amount of global memory: 804978688 bytes
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1512000 kilohertz
The lines with maximum sizes of each dimension of block/grid are not really clear. So,
what’s the max. number of blocks that I can run in one kernel invocation? Is it 65535^2 or less?
1.) Select a GPU from the list (click): G80
2.) Enter your resource usage:
Threads Per Block 256
Registers Per Thread 8
Shared Memory Per Block (bytes) 1060
3.) GPU Occupancy Data is displayed here and in the graphs:
Active Threads per Multiprocessor 768
Active Warps per Multiprocessor 24
Active Thread Blocks per Multiprocessor 3
Occupancy of each Multiprocessor 100%
Maximum Simultaneous Blocks per GPU 48
---
Maximum Thread Blocks Per Multiprocessor Blocks
Limited by Max Warps / Multiprocessor 3
Limited by Registers / Multiprocessor 4
Limited by Shared Memory / Multiprocessor 10
According to the occupancy calculator, with this kernel, each multiprocessor can run simultaneously 3 blocks. Does it mean that a new block is loaded for execution on an MP as soon as it has finished processing of some block, thus allowing thousands of blocks to be executed in one kernel run?
It would be great, if you would help me clarify this very basic source of the confusion. It would really help to know how many blocks are expected to correctly execute under one such kernel. Thanks a lot in advance!
65535^2 is correct. The maximum number of blocks that can run is that. It is independent of your kernel (register usage, etc.) In your case you can have 3 blocks per multiprocessor at one given time. So you will have 16x3 = 48 blocks in flight at any given time. blocks that are finished are indeed replaced by blocks that have not run yet (which is why you cannot have communication between blocks)
thanks for the clarification! one additional question: our experimental kernel starts crashing e.g. for 512^2 (finishes computation, but displays some memory access message) and 1024^2 thread blocks (blocks completely), although it works correctly for smaller grid sizes e.g. 256^2, and slightly larger. According to your post it should scale to 65535^2 blocks which would be great. Do you have any tip where to look for the source of the problem?
Hello. Sorry for digging an old thread, but I would rather not start another cloned topic.
I’m going to implement quite a big code on CUDA.
Quick look reveals there are 1188 double variables declared in the code. If I truncate them to float, they should occupy no more than 1188 registers, while there are supposed to be 8192 registers per thread block.
Still, CUDA Occupancy calculator reads that there is a limit of 0 concurrent threads bounded by “registers per multiprocessor”. I declared the following:
Compute cacability 1.1
6 threads per block
1188 register per thread
0 or 1 byte of shared memory
Is that correct? Can’t I use large number of registers per single execution thread?
Using shared memory is somewhat less tempting, as I would need 4752 bytes, or 9504 for double precision. It means than my GTS250 would become at most 16-core processor using shared memory, probably not as quick and generally not much useful.
Also, the second issue.
Calculator says there are 8192 registers per thread block
Nsight test shows there are 8192 registers per multiprocessor.
Does it imply I run only one block at multiprocessor?
1188 registers per thread is far beyond the capability of any CUDA device (and any CPU I know either), which can have at most 63 or 127 registers per thread depending on compute capability. Not all variables need to reside in registers at the same time though (which also is how conventional CPUs handle this problem). Compile the code with [font=“Courier New”]nvcc -Xptxas=-v[/font] and the compiler will print the number of registers each kernel uses.