Three philosophies
These chips differ in how much they specialize for dense linear algebra.
- A CPU has few powerful cores, deep caches, and excellent control flow. It excels at varied sequential logic and data preparation.
- A GPU has thousands of cores and high memory bandwidth for general parallel math, the flexible default for deep learning.
- A TPU is an accelerator built around a large systolic array that streams data through a grid of multiply add cells, specialized for matrix multiplication.
The systolic idea
A TPU keeps weights stationary in a grid and pumps activations through, reusing each loaded value across many cycles. This minimizes memory traffic for matmuls but is less flexible than a GPU.
Matching workload to chip
In practice
Real pipelines use all three: CPUs feed data and orchestrate, GPUs train and serve most models, and TPUs shine on very large dense workloads on platforms that support them.
Key idea
CPUs favor flexible serial control, GPUs offer broad parallelism, and TPUs use a systolic array specialized for dense matrix multiplication.