What is RDMA?: rawcompute.in Glossary

RDMA (Remote Direct Memory Access) enables one server to directly access the memory of another server over the network without involving either CPU, providing ultra-low latency and high throughput.

RDMA is a data transfer mechanism that allows a network adapter to read from or write to the memory of a remote machine without interrupting either machine’s CPU. Traditional TCP/IP networking requires multiple data copies and CPU involvement for each packet. RDMA eliminates this overhead entirely. The result is latency in the single-digit microsecond range and near-line-rate throughput, which is essential for distributed GPU training where gradient synchronisation happens thousands of times per training step.

RDMA can be implemented over InfiniBand (native RDMA), RoCE (RDMA over Converged Ethernet), or iWARP (RDMA over TCP). InfiniBand provides the most mature and performant RDMA implementation. NVIDIA’s GPUDirect RDMA extends this further by allowing network adapters to DMA directly into GPU memory, bypassing both system memory and CPU for GPU-to-GPU transfers across nodes.

Why it matters when buying hardware

For any multi-node AI training deployment, RDMA capability is essential. Without RDMA, inter-node communication falls back to TCP sockets, adding significant latency and CPU overhead that degrades training throughput. Ensure your network adapters (ConnectX-7 or equivalent), switches, and drivers all support RDMA. If using RoCE over Ethernet, you will need a lossless Ethernet fabric with PFC (Priority Flow Control) and ECN (Explicit Congestion Notification) properly configured. This is more complex than InfiniBand, which supports RDMA natively.

Need hardware advice?

Tell us your requirements and we'll recommend the right setup.

WhatsApp Us

What is RDMA?: rawcompute.in Glossary

Why it matters when buying hardware

Related Terms

Need hardware advice?