AI workloads pose new challenges to your storage infrastructure

By Michael McNerney, Vice President Marketing and Network Security, Supermicro.

  • 4 months ago Posted in

There’s a complete makeover underway in modern enterprises today. It’s centered around what might be called the “AI revolution.” Organizations are obtaining competitive advantages and key insights when they put advanced AI- or ML-based applications to work.

Among the leading examples of such workloads are large language models (LLMs) such as ChatGPT along with ML models based on humongous training data sets, complex 3D models, animation and virtual reality, simulations, and other data- and compute-intensive applications.

For a high performance AI or ML solution, an entire, comprehensive set of hardware needs to be working together. Behind the flashy rack-mounted hardware that houses the GPU-driven brains of any AI cluster, high throughput, low-latency storage systems are needed to keep the cluster productive. These support the I/O channels feeding massive amounts of data to train models and perform complex simulations and analyses needed to support AI, ML, and similar workloads. Indeed, one of the biggest challenges facing businesses looking to capitalize on the growth of AI is finding a storage solution that won’t bottleneck their high-performance CPUs, GPUs, or database clusters. After all, the CPUs and GPUs need to be kept busy to reduce the TCO of the data center.

Avoiding storage bottlenecks

What happens in distributed and parallel modes is that for the distributed file systems, data arrives from multiple sources where that data needs to get processed at scale across various protocols and for various applications. In a typical storage system, metadata quickly becomes a bottleneck. Indeed, you can only pump as much data through the system as the metadata supports. As the amount of data scales, the ability to handle metadata needs to scale proportionally.

WEKA distributed storage is architected to provide such proportional scaling. That explains why, despite adding more data capacity to a system or cluster, the I/O performance continues to scale linearly from eight (minimum node count for a WEKA cluster) to hundreds of storage nodes. It does so by eliminating bottlenecks and providing support for even the heaviest and most demanding AI/ML (and other similar) workloads.

But there’s more to optimizing servers and clusters than providing scalable, high-performance, low-latency storage. When designing an entire system, the focus cannot be exclusively on any single feature or function. The entire architecture must work in concert to support targeted workloads. Thus, designing a system for AI applications means creating a runtime environment built from the ground up to handle data-intensive applications both quickly and satisfactorily. What the server does with the data while handling an AI workload is as essential as the data traffic into and out of any given node.

NVMe to the rescue

Another critical feature is the number of PCIe 5.0 lanes. This technology enables servers to accommodate more extensive collections of SSDs, NICs, GPUs, and even extended memory CXL devices. All of these play essential roles in handling demanding AI and ML workloads, including PCIe Gen5 SSDs for high-speed local storage, large numbers of high-speed network interfaces to connect servers to other nodes, such as storage or other specialized servers, to extend data scope and reach as well as large numbers of GPUs for handling specialized, targeted tasks or workloads.

NVMe devices have totally changed the server and cluster game. With NVMe at its base, a completely reworked architecture becomes possible. It allows storage to work at scale and speed alongside high-performance CPUs, GPUs, and NICs, especially with the EDSFF form factor. Single-socket design enables best-of-breed CPUs to fully saturate network cards and storage and exploit the highest possible levels of parallelism and clustering capabilities for HPC, AI, and other next-generation solutions.

By Sam Bainborough, Director EMEA-Strategic Segment Colocation & Hyperscale at Vertiv.
By Gregg Ostrowski, CTO Advisor, Cisco Observability.
By Rosemary Thomas, Senior Technical Researcher, AI Labs, Version 1.
By Ian Wood, Senior SE Director at Commvault.
By Ram Chakravarti, chief technology officer, BMC Software.
By Darren Watkins, chief revenue officer at VIRTUS Data Centres.
By Steve Young, UK SVP and MD, Dell Technologies.