0-Basics

KernelA<<<nBlk, nTid>>(args);

A grid of blocks

gridDim, number of blocks in each dim

blockIdx, the index of the block in the grid

blockDim, number of threads in a block

A block of threads

threadIdx, the index of the thread in the block

// 1-d
int global_thread_id = blockDim.x * blockIdx.x + threadIdx.x;

// 2-d
int global_x = blockDim.x * blockIdx.x + threadIdx.x;
int global_y = blockDim.y * blockIdx.y + threadIdx.y;

// 3-d
int global_x = blockDim.x * blockIdx.x + threadIdx.x;
int global_y = blockDim.y * blockIdx.y + threadIdx.y;
int global_z = blockDim.z * blockIdx.z + threadIdx.z;

Note

There is no gridIdx.

__host__: default, calls from CPU, runs on CPU
__device__: calls from GPU, runs on GPU
__global__: calls from CPU, runs on GPU

dim3 dimGrid(32, 1, 1);
dim3 dimBlock(128, 1, 1);
vecAddKernel<<<dimGrid, dimBlock>>>(xxxx);