Beyond plain cores
Standard GPU cores do one multiply add per cycle each. Tensor cores instead perform a whole small matrix multiply accumulate in a single operation. They take two small tiles, multiply them, and add the result into an accumulator.
Mixed precision by design
Tensor cores were built around mixed precision:
- Inputs are often half precision such as FP16 or BF16.
- Multiplication happens fast in low precision.
- Accumulation uses higher precision, often FP32, to protect accuracy.
This combination gives large speedups while keeping numerical error small enough for training.
How a big matmul uses them
A large matrix multiply is tiled into many small blocks, each handled by a tensor core operation.
Practical impact
Because deep learning is dominated by matrix multiplications, tensor cores can multiply effective throughput several times over compared with regular cores. Frameworks enable them automatically when you use supported precisions and tensor shapes that are multiples of the tile size.
Key idea
Tensor cores execute small matrix multiply accumulate operations in mixed precision, accelerating the matmuls at the heart of deep learning.