Glossary
What is FP8?: rawcompute.in Glossary
FP8 is an 8-bit floating-point number format supported by NVIDIA Hopper and later GPUs, enabling roughly 2x the tensor core throughput of FP16 for AI workloads.
FP8 refers to 8-bit floating-point representations used in deep learning. The NVIDIA Hopper architecture supports two FP8 variants: E4M3 (4-bit exponent, 3-bit mantissa) for forward passes, and E5M2 (5-bit exponent, 2-bit mantissa) for gradient computation during backpropagation. By halving the bit-width compared to FP16, FP8 effectively doubles the number of multiply-accumulate operations that tensor cores can perform per cycle. The H100 SXM5 achieves 3,958 TFLOPS at FP8, compared to 1,979 TFLOPS at FP16.
The key challenge with FP8 is maintaining training convergence with reduced numerical precision. NVIDIA’s Transformer Engine addresses this by dynamically choosing between FP8 and FP16 per layer and per training iteration, using scaling factors to prevent overflow and underflow. In practice, most large transformer models train to equivalent accuracy with FP8 mixed precision, making it the default choice for Hopper-based training clusters.
Why it matters when buying hardware
FP8 support is only available on Hopper-generation (H100, H200) and newer GPUs. Older Ampere GPUs (A100) top out at FP16/BF16 precision. When comparing GPU pricing, factor in the effective throughput. An H100 at FP8 delivers roughly 2x the training throughput of an A100 at FP16 per GPU, which can halve your cluster size and operating costs. Always ask your vendor whether their quoted TFLOPS figures are at FP8 or FP16 to make apples-to-apples comparisons.