Building an AI Training Cluster in India: Practical Guide
Building an AI training cluster is a multi-disciplinary project that spans GPU selection, server configuration, network fabric design, storage architecture, data-centre planning, and software stack deployment. This guide covers the practical considerations for building a training cluster in India, from 8 GPUs to 256 GPUs.
Step 1: Define Your Compute Requirements
Before selecting hardware, quantify your training workload:
- Model size: Parameters determine VRAM requirements per GPU
- Target training time: Combined with model FLOPs, this determines cluster size
- Parallelism strategy: Tensor parallelism (within a node via NVLink), pipeline parallelism (across nodes), and data parallelism (across all GPUs) each have different networking requirements
- Iteration speed: How often you need to retrain or fine-tune
A rough sizing formula: Total GPU-hours = Model FLOPs / (GPU TFLOPS x MFU x 3600). For a 70B parameter model requiring ~10^23 FLOPs with H100 GPUs at 50% MFU, you need approximately 14,000 GPU-hours, about 73 days on 8 GPUs, or 9 days on 64 GPUs.
Step 2: Choose Your GPU Platform
For clusters up to 8 GPUs (single node):
- 8x H100 SXM5 HGX is the standard building block
- NVLink + NVSwitch provides all-to-all GPU connectivity within the node
- A single 4U server (Supermicro SYS-421GE or Dell XE9680) houses the full 8-GPU system
For clusters of 16-256 GPUs (multi-node):
- Each node is an 8x H100 HGX server
- Nodes are connected via InfiniBand NDR (400 Gb/s) or high-speed Ethernet
- The inter-node network fabric becomes as critical as the GPUs themselves
Step 3: Design the Network Fabric
The inter-node network is where most first-time cluster builders make mistakes. For multi-node training:
- InfiniBand NDR 400 is the recommended fabric for clusters of 16+ GPUs
- Each server needs one NVIDIA ConnectX-7 HCA per GPU (8 per server) for optimal “rail-optimised” topology, or at minimum 2-4 HCAs for smaller clusters
- A fat-tree topology using NVIDIA Quantum-2 switches provides full bisection bandwidth
- For 4-node (32 GPU) clusters, a single 40-port InfiniBand switch can connect all nodes directly
- For 8+ nodes, you need a spine-leaf switching architecture
Budget warning: InfiniBand switches and cabling can cost 15-25% of your total cluster budget. Do not underestimate this.
Step 4: Plan Storage
Training data loading must keep pace with GPU compute. A poorly designed storage layer will starve your GPUs:
- Local NVMe (2-4 drives per server) for dataset caching and checkpoint writes. Plan for 4-8 TB per node
- Shared storage for the canonical dataset and model checkpoints. Options include a dedicated NFS/NAS appliance, a parallel filesystem (BeeGFS, Lustre), or S3-compatible object storage (MinIO)
- Throughput target: At minimum, your storage layer should deliver 2-5 GB/s per GPU node to avoid data loading bottlenecks
Step 5: Select a Data Centre
Not all Indian data centres can host GPU clusters. Key requirements:
- Power density: An 8x H100 server draws 10+ kW. A 4-server rack needs 40-50 kW. Verify the facility supports this per-rack power density
- Cooling capacity: Ensure the data centre can handle the heat output of GPU servers. Some facilities offer rear-door heat exchangers or direct liquid cooling support
- Network connectivity: The facility should have carrier-neutral connectivity, peering with major ISPs, and ideally access to internet exchange points
- Location: Mumbai and Chennai offer the best connectivity to international networks. Pune and Hyderabad are emerging alternatives with competitive pricing
Step 6: Software Stack
A production training cluster needs:
- OS: Ubuntu 22.04/24.04 LTS with NVIDIA driver 535+ and CUDA 12.x
- Container runtime: Docker + NVIDIA Container Toolkit (nvidia-docker)
- Job scheduler: Slurm is the standard for HPC-style training clusters; Kubernetes with GPU operator is an alternative for mixed workloads
- Communication library: NCCL 2.x for GPU-to-GPU collective operations
- Monitoring: Prometheus + Grafana with DCGM (Data Center GPU Manager) for GPU health and utilisation metrics
- Training framework: PyTorch with DeepSpeed or Megatron-LM for distributed training
Common Mistakes to Avoid
- Skimping on the network fabric: a 10GbE inter-node network with 8x H100 GPUs will bottleneck training severely
- Ignoring power planning: ordering a 4-server GPU cluster before confirming your data centre can deliver 50+ kW per rack
- No monitoring from day one: GPU failures, thermal throttling, and network errors are invisible without proper monitoring
- Underestimating lead times: GPU servers, InfiniBand switches, and colocation space in India can have 4-12 week lead times. Plan ahead.
rawcompute.in designs and deploys complete AI training clusters in India, including GPU servers, InfiniBand fabric, storage, and colocation placement. Contact us with your training requirements for a customised cluster design and quote.
Need this for your infrastructure? Let's talk.
We help teams across India spec and deploy hardware.