The Tensor Cores

Beyond plain cores

Standard GPU cores do one multiply add per cycle each. Tensor cores instead perform a whole small matrix multiply accumulate in a single operation. They take two small tiles, multiply them, and add the result into an accumulator.

Mixed precision by design

Tensor cores were built around mixed precision:

Inputs are often half precision such as FP16 or BF16.
Multiplication happens fast in low precision.
Accumulation uses higher precision, often FP32, to protect accuracy.

This combination gives large speedups while keeping numerical error small enough for training.

How a big matmul uses them

A large matrix multiply is tiled into many small blocks, each handled by a tensor core operation.

Practical impact

Because deep learning is dominated by matrix multiplications, tensor cores can multiply effective throughput several times over compared with regular cores. Frameworks enable them automatically when you use supported precisions and tensor shapes that are multiples of the tile size.

Key idea

Tensor cores execute small matrix multiply accumulate operations in mixed precision, accelerating the matmuls at the heart of deep learning.

Beyond plain cores

Mixed precision by design

How a big matmul uses them

Practical impact

Key idea

Check yourself