What are Tensor Cores?: rawcompute.in Glossary

Tensor Cores are specialised hardware units in NVIDIA GPUs that perform matrix multiply-accumulate operations in a single clock cycle, dramatically accelerating deep learning computations.

Tensor Cores are fixed-function processing units within NVIDIA GPUs, first introduced with the Volta architecture (V100). Unlike general-purpose CUDA cores that execute scalar or vector operations, a tensor core performs a full 4x4 matrix multiply-accumulate (MMA) in a single clock cycle. This makes them vastly more efficient for the dense linear algebra at the heart of neural network training and inference. Each successive GPU generation has expanded tensor core capabilities. Ampere added TF32 and structured sparsity, Hopper added FP8 and the Transformer Engine.

The H100 SXM5 contains 528 fourth-generation tensor cores, which collectively deliver 1,979 TFLOPS at FP16 and 3,958 TFLOPS at FP8. In practical terms, tensor core utilisation determines how much of the GPU’s theoretical peak performance your workload actually achieves. Well-optimised frameworks like cuDNN and cuBLAS route matrix operations to tensor cores automatically, but workload characteristics (batch size, sequence length, model architecture) affect utilisation.

Why it matters when buying hardware

When comparing GPUs, tensor core TFLOPS (not CUDA core counts) is the metric that matters for AI workloads. An A100 with 6,912 CUDA cores outperforms a consumer RTX 4090 with 16,384 CUDA cores on most training tasks because the A100’s 432 tensor cores are optimised for sustained data-centre throughput. Always prioritise tensor core performance (and the precisions supported) over raw CUDA core count when selecting GPUs for machine learning.

Need hardware advice?

Tell us your requirements and we'll recommend the right setup.

WhatsApp Us

What are Tensor Cores?: rawcompute.in Glossary

Why it matters when buying hardware

Related Terms

Need hardware advice?