c++ - CUDA Warps and Optimal Number of Threads Per Block -


from understand kepler gpus, , cuda in general, when single smx unit works on block, launches warps groups of 32 threads. here questions:

1) if smx unit can work on 64 warps, means there limit of 32x64 = 2048 threads per smx unit. kepler gpus have 4 warp schedulers, mean 4 warps can worked on simultaneously within gpu kernel? , if so, mean should looking blocks have multiples of 128 threads (assuming no divergence in threads) opposed recommended 32? of course, ignoring divergence or cases global memory access can cause warp stall , have scheduler switch another.

2) if above correct, best possible outcome single smx unit work on 128 threads simultaneously? , gtx titan has 14 smx units, total of 128x14 = 1792 threads? see numbers online says otherwise. titan can run 14x64 (max warps per smx) x32(threads per smx) = 28,672 concurrently. how can smx units launch warps, , have 4 warp schedulers? cannot launch 2048 threads per smx @ once? maybe i'm confused definition of maximum number of threads gpu can launch concurrently, allowed queue?

i appreciate answers , clarification on this.

so mean 4 warps can worked on simultaneously within gpu kernel?

instructions 4 warps can scheduled in given clock cycle on kepler smx. due pipelines in execution units, in given clock cycle, instructions may in various stages of pipeline execution , warps resident on smx.

and if so, mean should looking blocks have multiples of 128 threads (assuming no divergence in threads) opposed recommended 32?

i'm not sure how jumped previous point one. since instruction mix presumably varies warp warp (since different warps presumably @ different points in instruction stream) , instruction mix varies 1 place in instruction stream, don't see logical connection between 4 warps schedulable in given clock cycle, , need have groups of 4 warps. given warp may @ point instructions highly schedulable (perhaps @ sequence of sp fma, requiring sp cores, plentiful), , 3 warps may @ point in instruction stream instructions "harder schedule" (perhaps requiring sfus, there fewer of). therefore arbitrarily grouping warps sets of 4 doesn't make sense. note don't require divergence warps out of sync each other. natural behavior of scheduler coupled varying availability of execution resources create warps together, @ different points in instruction stream.

for second question, think fundamental knowledge gap in understanding how gpu hides latency. suppose gpu has set of 3 instructions issue across warp:

ld r0, a[idx] ld r1, b[idx] mpy r2, r0, r1 

the first instruction ld global memory, , can issued , not stall warp. second instruction likewise can issued. warp stall @ 3rd instruction, however, due latency global memory. until r0 , r1 become populated, multiply instruction cannot dispatched. latency main memory prevents it. gpu deals problem (hopefully) having ready supply of "other work" can turn to, namely other warps in unstalled state (i.e. have instruction can issued). best way facilitate latency-hiding process have many warps available smx. there isn't granularity (such needing 4 warps). speaking, more threads/warps/blocks in grid, better chance gpu have of hiding latency.

so true gpu cannot "launch" 2048 threads (i.e. issue instructions 2048 threads) in single clock cycle. when warp stalls, put waiting queue until stall condition lifted, , until then, helpful have other warps "ready go", next clock cycle(s).

gpu latency hiding commonly misunderstood topic. there many available resources learn if search them.


Comments

Popular posts from this blog

html - Outlook 2010 Anchor (url/address/link) -

javascript - Why does running this loop 9 times take 100x longer than running it 8 times? -

Getting gateway time-out Rails app with Nginx + Puma running on Digital Ocean -