A compiler for inference
TensorRT is an inference optimizer that takes a trained model and compiles it into a highly tuned engine for a specific GPU. Rather than running operators generically, it rewrites and tunes the graph for the target hardware.
What it optimizes
- Layer and tensor fusion merges operators to cut memory traffic.
- Precision calibration lowers weights to FP16 or INT8, using calibration data for INT8.
- Kernel auto tuning benchmarks candidate kernels and picks the fastest for the actual GPU and shapes.
- Memory planning reuses buffers to shrink the footprint.
Build then run
The cost and the catch
The build step is expensive and produces an engine specialized to one GPU, precision, and shape range. That engine is fast but not portable, so you rebuild for new hardware. Dynamic shapes are supported through optimization profiles that bound the allowed input sizes.
Key idea
TensorRT compiles a model into a GPU specific engine using fusion, low precision, and kernel auto tuning, trading a costly build and reduced portability for high inference speed.