Glossary

What is RoCE?: rawcompute.in Glossary

RoCE (RDMA over Converged Ethernet) is a network protocol that allows Remote Direct Memory Access over Ethernet infrastructure, offering a lower-cost alternative to InfiniBand for GPU clusters.

RoCE (pronounced “rocky”) enables RDMA over standard Ethernet networks, allowing GPU servers to communicate with low latency without requiring dedicated InfiniBand hardware. RoCE v2 encapsulates RDMA packets inside UDP/IP, making it routable across Layer 3 networks. This is a significant advantage over RoCE v1, which was limited to a single Layer 2 broadcast domain. Modern network adapters like NVIDIA ConnectX-7 support both InfiniBand and RoCE on the same hardware.

To achieve the lossless transport that RDMA requires, RoCE deployments need a carefully configured Ethernet fabric with Data Center Bridging (DCB) features: Priority Flow Control (PFC) to prevent packet drops, and ECN (Explicit Congestion Notification) to manage congestion before buffers overflow. Without these, packet loss causes RDMA transport retries that severely degrade performance. This configuration complexity is the primary trade-off versus InfiniBand, which provides lossless transport by design.

Why it matters when buying hardware

RoCE is a viable option for smaller GPU clusters (4-16 nodes) where the cost of InfiniBand switches is prohibitive. Many organisations already have high-speed Ethernet infrastructure (25/100/400 GbE) that can be repurposed for RoCE. However, for clusters larger than 32 nodes or workloads requiring absolute minimum latency, InfiniBand remains the better choice. If you choose RoCE, ensure your Ethernet switches support DCB and that your network team is comfortable configuring PFC and ECN. Rawcompute.in supports both InfiniBand and RoCE fabric designs.

Need hardware advice?

Tell us your requirements and we'll recommend the right setup.

WhatsApp Us

Get a Quote

We respond within 4 business hours

Same-day responseNo spam, everGST invoice