Glossary
What is VRAM?: rawcompute.in Glossary
VRAM is the dedicated memory on a GPU used to store model parameters, activations, gradients, and optimizer states during AI training and inference.
VRAM (Video RAM) refers to the dedicated memory attached to a GPU. In data-centre GPUs, this is typically HBM (High Bandwidth Memory). The A100 has 80 GB of HBM2e, the H100 SXM5 has 80 GB of HBM3, and the H200 has 141 GB of HBM3e. VRAM capacity determines the maximum model size that a single GPU can hold in memory, and VRAM bandwidth determines how fast data can be fed to the compute units.
For LLM training, VRAM must hold not just the model weights but also optimizer states (e.g., Adam requires 2x the model size for momentum and variance), gradients, and activations for backpropagation. A 7B-parameter model in FP16 requires ~14 GB for weights alone, but full training may need 60-100 GB when accounting for optimizer states and activations. Techniques like gradient checkpointing, ZeRO (partitioned optimizer states), and activation offloading can reduce per-GPU VRAM requirements at the cost of some compute overhead.
Why it matters when buying hardware
VRAM capacity is often the primary constraint when selecting GPUs for AI workloads. Before purchasing, estimate your model’s total VRAM footprint (weights + optimizer states + activations + KV-cache for inference). If a single GPU’s VRAM is insufficient, you will need model parallelism across multiple GPUs, which requires NVLink or InfiniBand. For inference, VRAM determines the maximum batch size and context length you can serve. Rawcompute.in can help you size VRAM requirements based on your specific model architecture and training configuration.