

When a thread is executing a kernel there are a few variables that CUDA exposes that can help with identifying which thread it is: Therefore it is always good to experiment with various sizes to see what the impact is on your specific kernel. Sometimes, for example, it makes sense to perform a bit more work in an individual thread to minimize the number of blocks that need to be scheduled. Picking the right dimensions is somewhat of a black art, as it can be GPU/kernel/data shape/algorithm dependent. Another factor to consider is that threads get scheduled in warps, so a block size should always be a multiple of 32, otherwise at least one warp will be scheduled that is not making use of all the cores in the SM. The major factors influencing occupancy are shared memory usage, register usage, and thread block size. In general, outside of the maximum allowed dimensions for blocks and grids, you want to size your blocks/grid to match your data and simultaneously maximize occupancy, that is, how many threads are active at one time.
Cuda dim3 one dimension to multiple how to#
Possible arrangement of CUDA blocks and threads How to choose size The block size follows a similar model but has always had the ability to be specified in three dimensions. Ever since compute capability 2.0 the grid size can be specified in three dimensions, whilst in earlier versions being restricted to only two. The grid size defines the number of blocks and the shape of the cube that the blocks are distributed within. Programming modelĪ program designed to run on a GPU is called a kernel, and in CUDA the level of parallelism for a kernel is defined by the grid size and the block size.
Cuda dim3 one dimension to multiple code#
Additionally, as the instructions are executed in lockstep, it is important to try to avoid conditional code which forces the threads within a warp to execute different branches as that can also have a significant impact on the time it takes to complete a warp. By optimizing the memory access within warps the memory access can be coalesced with one call fetching data for several threads within the warp, and thus greatly reduce the overall cost of memory latency. Once the memory access returns the warp will be rescheduled. Warps are the primary way a GPU can hide latency, so if an operation will take a long time to perform - such as fetching from global memory - the warp scheduler will park the warp and schedule a different one. To get the most performance out of a GPU it is therefore important to understand warps, and how to write warp-friendly code. Every thread in a warp is executed in SIMD lockstep fashion, executing the same instruction but using its private registers to perform the requested operation. The GigaScheduler provides a GPU-level scheduler distributing thread blocks to SMs internal schedulers, while the Host Interface connects through PCI-Express to the host system.īecause an SM contains 32 cores it can, as such, only execute a maximum of 32 threads at any given time, which in CUDA-speak is called a warp. Surrounding the SMs are six 64-bit memory partitions, for a 384-bit memory interface, supporting up to a total of 6GB of GDDR5 DRAM memory. The newest NVIDIA compute architecture is Kepler (compute capability 3.x), but here I will focus on describing a Fermi-based compute architecture (compute capability 2.x) as it is the one I have most familiarity with, and the one that Amazon EC2 uses in their Cluster GPU instance types.Ī Fermi-based GPU is based upon 16 Streaming Multiprocessors (SM) positioned around a common L2 cache, with each SM containing 32 cores, registers, and a small chunk of shared memory. In order to do that we must first look at what a GPU looks like at logical hardware level. The second part, which may be less obvious, is to maximize GPU occupancy by carefully partitioning the computation to work with the hardware layout. The first part is to have an algorithm that can scale to the required level of parallelism. The key, therefore, to successfully unlock their power is two-fold. Where a standard CPU has between 4-8 cores, a GPU like the Nvidia GTX Titan readily comes with 2688 cores. The key to their power is their parallelism. GPUs once strongly associated with gaming are now becoming common place in industry where frameworks like CUDA and OpenCL are exposing their power for general applications with serious needs for computation.
