Artificial intelligence continues to be a focal point for companies in all areas of technology and communications as demand from enterprise customers continues to soar, but one of the underappreciated aspects of AI is that a network plays a critical role in the success of AI initiatives.
Despite the same type of “AI bump” the capital markets have given the chip companies; the network vendors have been aggressive with evolving their products to meet the demands of AI.
Arista Networks Inc., which is the network vendor that has done the most effective job of tying its growth to AI, Wednesday announced new capabilities for its EOS Smart AI Suite designed to improve AI cluster performance and efficiency.
The Santa Clara company introduced a feature called Arista Cluster Load Balancing, or CLB, in its Arista EOS Smart AI Suite to maximize AI workload performance with consistent, low-latency network flows. It also announced that its Arista CloudVision Universal Network Observability, or CV UNO, now offers AI observability for enhanced troubleshooting and issue inferencing to ensure reliable job completion at scale.
Cluster Load Balancing benefits
Based on RDMA queue pairs, Cluster Load Balancing enables high bandwidth utilization between spines and leaves. One of the aspects of AI clusters is that they typically have low quantities of large-bandwidth flows, which is unlike typical network such as e-mail and internet traffic. Traditional network infrastructure was never designed for AI, so they lack the necessary throughput for AI workloads.
That can lead to uneven traffic distribution and increased tail latency. CLB solves this issue with RDMA-aware flow placement to deliver uniform high performance for all flows while maintaining low tail latency. CLB optimizes bidirectional traffic flow — leaf-to-spine and spine-to leaf — to provide enterprises with balanced utilization and consistent low latency.
With CLB UNO, Arista is enabling the network to directly impact AI performance at an application level. “With CLB we look at network performance, but we will also integrate application-level performance, VM performance, all into one screen so that network engineers can figure out where performance issues are and quickly get find the root cause of it,” Praful Bhaidasna, head of observability products for Arista, told me.
Quantifying CLB benefits
I asked Brendan Gibbs, vice president of AI, routing and switching platforms for Arista, to quantify the benefits that CLB delivers. He said that though all organizations are different, the performance improvements are significant. “With clusters, a general rule of thumb is about 30% of time is spent in networking,” he said. “If we can provide an extra 8% or 10% of throughput on the links customers have already deployed, it means an Arista network is going to be higher-throughput, with a lower job completion time than the next nearest competitive platform.”
The performance boost is notable. With traditional networks, which use dynamic load balancing, or DLB, to optimize traffic, the best-performing networks operate at about 90% efficiency. I asked Gibbs about CLB versus DLB and he told me it can achieve 98.3% efficiency. Given the cost of GPUs, every information technology pro I’ve talked to about AI wants more network throughput to keep the processors busy, since inefficiency leads to dollars being wasted.
One of those customers is Oracle Corp., which is using Arista switches as it continues to grow its AI infrastructure. “We see a need for advanced load balancing techniques to help avoid flow contentions and increase throughput in ML networks,” Jag Brar, vice president and distinguished engineer for Oracle’s cloud infrastructure, said in Arista’s news release. “Arista’s Cluster Load Balancing feature helps do that.” I don’t normally pull quotes from press releases, but I did in this case as Oracle is usually tight-lipped about whom its suppliers are. The fact it provided a quote is meaningful, as it’s out of the norm for Oracle.
AI job visibility
Arista said that CV UNO unifies network, system and AI job data within the Arista Network Data Lake, or NetDL, by providing end-to-end AI job visibility. NetDL is a real-time telemetry framework that streams granular network data from Arista switches into NetDL, unlike traditional SNMP polling, which relies on periodic queries and can miss critical updates.
Although Arista makes great hardware, it’s the data that gives it operational and performance consistency across products. When Arista launched, each network device had its own network database, NetDB, but a few years ago, it evolved to a single data lake across its product and NetDL was born.
EOS NetDL delivers low-latency, high-frequency, event-driven insights into network performance. This is a key element for providing connectivity in large-scale AI training and inferencing infrastructure.
Benefits of EOS NetDL Streamer
- AI job monitoring: A view of AI job health metrics, such as job completion times, congestion indicators and real-time insights from buffer/link utilization.
- Deep-dive analysis: Provides job-specific insights by analyzing network devices, server NICs and related flows to pinpoint performance bottlenecks precisely.
- Flow visualization: Uses the power of CV topology mapping to provide real-time, intuitive visibility into AI job flows at microsecond granularity to accelerate issue inference and resolution.
- Proactive resolution: Finds anomalies quickly and correlates network and computer performance within NetDL to ensure uninterrupted, high-efficiency AI workload execution.
Availability
Arista said CLB is available today on its 7260X3, 7280R3, 7500R3, and 7800R3 platforms. It will be supported in Q2 2025 on the 7060X6 and 7060X5 platforms. Support for the 7800R4 platform is scheduled for the second half of this year.
CV UNO is available today, and the AI observability enhancements, currently in customer trials, are expected to be available in the second half of 2025.