← Lessons

quiz vs the machine

Silver1090

Machine Learning

The Tensor Cores

Specialized units that crunch small matrix multiplies at huge throughput.

4 min read · intro · beat Silver to climb

Beyond plain cores

Standard GPU cores do one multiply add per cycle each. Tensor cores instead perform a whole small matrix multiply accumulate in a single operation. They take two small tiles, multiply them, and add the result into an accumulator.

Mixed precision by design

Tensor cores were built around mixed precision:

  • Inputs are often half precision such as FP16 or BF16.
  • Multiplication happens fast in low precision.
  • Accumulation uses higher precision, often FP32, to protect accuracy.

This combination gives large speedups while keeping numerical error small enough for training.

How a big matmul uses them

A large matrix multiply is tiled into many small blocks, each handled by a tensor core operation.

Practical impact

Because deep learning is dominated by matrix multiplications, tensor cores can multiply effective throughput several times over compared with regular cores. Frameworks enable them automatically when you use supported precisions and tensor shapes that are multiples of the tile size.

Key idea

Tensor cores execute small matrix multiply accumulate operations in mixed precision, accelerating the matmuls at the heart of deep learning.

Check yourself

Answer to earn rating on the learn ladder.

1. What operation does a tensor core perform in one step?

2. Why do tensor cores accumulate in higher precision?