Glossary

What is Inference?: rawcompute.in Glossary

Inference is the process of running a trained AI model to generate predictions or outputs from new input data. It is the production-serving phase of the AI lifecycle, distinct from training.

Inference refers to using a trained machine learning model to make predictions on new data. Unlike training, which involves updating billions of model parameters through iterative gradient computations, inference runs the model in a forward-pass-only mode. For LLMs, inference involves processing an input prompt through the model’s layers to generate output tokens one at a time (autoregressive generation). The key performance metrics for inference are latency (time to first token, time per output token) and throughput (tokens generated per second across all concurrent requests).

Inference workloads have different hardware requirements than training. Inference benefits from high memory bandwidth (to load model weights quickly for each token generation step), large VRAM (to hold the model and KV-cache for many concurrent users), and support for reduced precision (INT8, FP8, INT4) to improve throughput. Specialised inference frameworks like NVIDIA TensorRT-LLM, vLLM, and TGI optimise GPU utilisation through techniques like continuous batching, paged attention, and speculative decoding.

Why it matters when buying hardware

Inference GPU selection depends on your serving requirements. For high-throughput inference with many concurrent users, the NVIDIA H100 or H200 offers the best tokens-per-second performance. For cost-sensitive deployments serving lighter models, the L40S or A100 PCIe may offer better cost per token. The key sizing question is: what is your target model size, required latency SLA, and expected concurrent user count? rawcompute.in helps customers select the optimal inference hardware configuration based on these parameters.

Need hardware advice?

Tell us your requirements and we'll recommend the right setup.

WhatsApp Us

Get a Quote

We respond within 4 business hours

Same-day responseNo spam, everGST invoice