**Threads** in CUDA are the workers who work on data. In the CUDA architecture, threads are grouped into **blocks** and blocks are grouped into a **grid**. Given such an architecture, there are 2 common techniques to allocate the data among the threads. Or put another way, there are 2 ways for each thread to pick the data it should work on.

**Technique 1: Chunks of Data Per Thread**

In this technique, data is broken into many contiguous **chunks** and each thread works on one (or none) such chunk. Here is sample code for illustration:

__global__ void fooKernel( const int* dataArray, int dataNum ) { // Thread info const int blocksPerGrid = gridDim.x; const int threadsPerBlock = blockDim.x; const int totalThreadNum = blocksPerGrid * threadsPerBlock; const int curThreadIdx = ( blockIdx.x * threadsPerBlock ) + threadIdx.x; // Work allocation const int dataPerThread = ( dataNum + ( totalThreadNum - 1 ) ) / totalThreadNum; const int curThreadDataBegin = dataPerThread * curThreadIdx; const int curThreadDataEnd = curThreadDataBegin + dataPerThread; // Iterate data chunk of this thread for ( int idx = curThreadDataBegin; idx < curThreadDataEnd; ++idx ) { // Check if data out of bounds if ( idx >= dataNum ) continue; // Do something with data int val = dataArray[ idx ]; } return; }

Note that the work allocation calculation is:

const int dataPerThread = ( dataNum + ( totalThreadNum - 1 ) ) / totalThreadNum;

It is not a mere division of available work (data) by available labour (threads). Due to integer division, such a simple calculation would lead to trouble if `dataNum`

is not a multiple of `totalThreadNum`

.

So, if `dataNum`

is not a multiple of `totalThreadNum`

there will always be a few threads with no work. This is an unavoidable fact of life for this technique! :-)

This is also why we need to ensure we are always accessing something inside of the input data:

if ( idx >= dataNum ) continue;

This work allocation technique is advantageous if the work done by the thread benefits by having access to elements lying in the same chunk. If this is not the case, then Technique 2 is far easier to write and understand.

**Technique 2: Iterate With a Large Increment**

Simply put, in each iteration each thread accesses the data that is a long distance away from its current data. How far away? A distance equal to the *total number of threads*. This iteration is very simple to write:

__global__ void fooKernel( const int* dataArray, int dataNum ) { // Thread info const int blocksPerGrid = gridDim.x; const int threadsPerBlock = blockDim.x; const int totalThreadNum = blocksPerGrid * threadsPerBlock; const int curThreadIdx = ( blockIdx.x * threadsPerBlock ) + threadIdx.x; // Iterate over data for ( int idx = curThreadIdx; idx < dataNum; idx += totalThreadNum ) { // Do something with data int val = dataArray[ idx ]; } return; }

The work allocation calculation and the bounds check of Technique 1 are both not needed. All the magic is in the details of the loop:

for ( int idx = curThreadIdx; // Use thread index as beginning location idx < dataNum; // Thread never goes outside the data idx += totalThreadNum // Jump a long way )

You can easily see that these techniques lie at the ends of a spectrum of possible work allocation techniques. For example, a hybrid technique would handle chunks of size `n`

and then increment by `n * totalThreadNum`

. Look at your application closely and use what works best for you! :-)