← Lessons

quiz vs the machine

Gold1450

Concurrency

The GPU Thread Model Basics

Thousands of threads grouped into warps and blocks on a GPU.

5 min read · core · beat Gold to climb

The GPU Thread Model Basics

A GPU runs the same program across thousands of lightweight threads at once. Unlike a CPU that has a few powerful cores, a GPU has many simple cores designed for massive data parallelism. To organize all those threads, the GPU uses a hierarchy.

Threads are grouped into blocks, and blocks form a grid. Threads within a block can share a fast on chip memory and synchronize with each other, while threads in different blocks generally cannot. The grid lets a single launch cover a huge problem by spreading blocks across the hardware.

  • Thread The smallest unit, runs one instance of the program.
  • Block A group of threads that share memory and can synchronize.
  • Grid All the blocks for one launch, mapped onto the device.

Inside the hardware, threads execute in fixed groups called warps, typically thirty two threads that run in lockstep. All threads in a warp execute the same instruction together. If they take different branches, the warp runs both paths and disables the inactive threads, a slowdown called divergence.

Performance comes from keeping warps busy. Because memory latency is high, the GPU hides it by switching to other ready warps, so launching far more threads than cores is normal and desirable.

Key idea

A GPU organizes thousands of threads into warps, blocks, and a grid, running warps in lockstep and hiding memory latency by switching among many ready warps.

Check yourself

Answer to earn rating on the learn ladder.

1. What is a warp on a GPU?

2. What can threads in the same block do that threads in different blocks generally cannot?

3. How does a GPU hide high memory latency?