Building an AI Training Cluster in India: Practical Guide

Building an AI training cluster is a multi-disciplinary project that spans GPU selection, server configuration, network fabric design, storage architecture, data-centre planning, and software stack deployment. This guide covers the practical considerations for building a training cluster in India, from 8 GPUs to 256 GPUs.

Step 1: Define Your Compute Requirements

Before selecting hardware, quantify your training workload:

Model size: Parameters determine VRAM requirements per GPU
Target training time: Combined with model FLOPs, this determines cluster size
Parallelism strategy: Tensor parallelism (within a node via NVLink), pipeline parallelism (across nodes), and data parallelism (across all GPUs) each have different networking requirements
Iteration speed: How often you need to retrain or fine-tune

A rough sizing formula: Total GPU-hours = Model FLOPs / (GPU TFLOPS x MFU x 3600). For a 70B parameter model requiring ~10^23 FLOPs with H100 GPUs at 50% MFU, you need approximately 14,000 GPU-hours, about 73 days on 8 GPUs, or 9 days on 64 GPUs.

Step 2: Choose Your GPU Platform

For clusters up to 8 GPUs (single node):

8x H100 SXM5 HGX is the standard building block
NVLink + NVSwitch provides all-to-all GPU connectivity within the node
A single 4U server (Supermicro SYS-421GE or Dell XE9680) houses the full 8-GPU system

For clusters of 16-256 GPUs (multi-node):

Each node is an 8x H100 HGX server
Nodes are connected via InfiniBand NDR (400 Gb/s) or high-speed Ethernet
The inter-node network fabric becomes as critical as the GPUs themselves

Step 3: Design the Network Fabric

The inter-node network is where most first-time cluster builders make mistakes. For multi-node training:

InfiniBand NDR 400 is the recommended fabric for clusters of 16+ GPUs
Each server needs one NVIDIA ConnectX-7 HCA per GPU (8 per server) for optimal “rail-optimised” topology, or at minimum 2-4 HCAs for smaller clusters
A fat-tree topology using NVIDIA Quantum-2 switches provides full bisection bandwidth
For 4-node (32 GPU) clusters, a single 40-port InfiniBand switch can connect all nodes directly
For 8+ nodes, you need a spine-leaf switching architecture

Budget warning: InfiniBand switches and cabling can cost 15-25% of your total cluster budget. Do not underestimate this.

Step 4: Plan Storage

Training data loading must keep pace with GPU compute. A poorly designed storage layer will starve your GPUs:

Local NVMe (2-4 drives per server) for dataset caching and checkpoint writes. Plan for 4-8 TB per node
Shared storage for the canonical dataset and model checkpoints. Options include a dedicated NFS/NAS appliance, a parallel filesystem (BeeGFS, Lustre), or S3-compatible object storage (MinIO)
Throughput target: At minimum, your storage layer should deliver 2-5 GB/s per GPU node to avoid data loading bottlenecks

Step 5: Select a Data Centre

Not all Indian data centres can host GPU clusters. Key requirements:

Power density: An 8x H100 server draws 10+ kW. A 4-server rack needs 40-50 kW. Verify the facility supports this per-rack power density
Cooling capacity: Ensure the data centre can handle the heat output of GPU servers. Some facilities offer rear-door heat exchangers or direct liquid cooling support
Network connectivity: The facility should have carrier-neutral connectivity, peering with major ISPs, and ideally access to internet exchange points
Location: Mumbai and Chennai offer the best connectivity to international networks. Pune and Hyderabad are emerging alternatives with competitive pricing

Step 6: Software Stack

A production training cluster needs:

OS: Ubuntu 22.04/24.04 LTS with NVIDIA driver 535+ and CUDA 12.x
Container runtime: Docker + NVIDIA Container Toolkit (nvidia-docker)
Job scheduler: Slurm is the standard for HPC-style training clusters; Kubernetes with GPU operator is an alternative for mixed workloads
Communication library: NCCL 2.x for GPU-to-GPU collective operations
Monitoring: Prometheus + Grafana with DCGM (Data Center GPU Manager) for GPU health and utilisation metrics
Training framework: PyTorch with DeepSpeed or Megatron-LM for distributed training

Common Mistakes to Avoid

Skimping on the network fabric: a 10GbE inter-node network with 8x H100 GPUs will bottleneck training severely
Ignoring power planning: ordering a 4-server GPU cluster before confirming your data centre can deliver 50+ kW per rack
No monitoring from day one: GPU failures, thermal throttling, and network errors are invisible without proper monitoring
Underestimating lead times: GPU servers, InfiniBand switches, and colocation space in India can have 4-12 week lead times. Plan ahead.

rawcompute.in designs and deploys complete AI training clusters in India, including GPU servers, InfiniBand fabric, storage, and colocation placement. Contact us with your training requirements for a customised cluster design and quote.