Ready for testing: First-ever supercomputer powered by Intel's wildcard AI chips

The San Diego Supercomputer Center (SDSC) says it's ready to run test workloads on its experimental Voyager AI system, which looks to be the first-ever Intel Habana-based supercomputer.

The supercomputer was built in collaboration with Intel's Habana Labs and Supermicro as part of a five-year $11.25 million grant from America's National Science Foundation. And while powerful, Voyager isn't trying to win any benchmark records — it's not supposed to.

Voyager is intended to be a proving ground for AI/ML computing research and development on specialized hardware — in this case, Habana's Goya and Gaudi processors — Voyager Principal Investigator Amit Majumdar told The Register.

Introduced in 2019, Habana Lab's Goya was designed to accelerate AI inference workloads using eight tensor processor cores with support for mixed precision from FP32 to UINT8. Meanwhile, Gaudi, introduced a few months later, was a 350W chip designed with ML training in mind. It featured 32GB of onboard memory operating at a bandwidth of 1TB/s.

Intel acquired the chip designer in late 2019 after abandoning its ill-fated Nervana collab with Meta (then Facebook). Sort of a third-time lucky thing for Intel on AI systems.

The Habana AI accelerators are deployed across 42 Supermicro X12 networks that make up Voyager. Each X12 system is equipped with a pair of Intel's third-gen Xeon Scalable processors and eight Habana Gaudi AI processors. The cluster also employs a pair of the OEM's SuperServer 4029GP-T systems with eight Goya HL-100 PCIe cards for AI inferencing.

Because the system is designed to support very large AI models, each server is networked with six 400 Gbit/sec ports operating over the RDMA-over-converged-Ethernet protocol to a large Arista non-blocking switch.

Ready, set, test

With the Voyager system operational, SDSC has transitioned to the test-bed phase of the project.

Ready for testing: First-ever supercomputer powered by Intel's wildcard AI chips

During this period, the supercomputing center has three years to work directly with researchers to suss out the system's performance, hardware quirks, and software compatibility requirements, Majumdar explained.

The research will also explore use cases for Habana's chips, which have traditionally targeted computer vision, natural-language processing, and deep-learning workloads, Sree Ganeson, head of software product management at Habana Labs, told The Register.

"This community of scientists and researchers are going to bring a different class of problems and try to apply them too deep learning," she said. "The kinds of patterns they may bring might be different, so, it's going to be a learning [process]."

The results of this testing will be shared over the next few years during semiannual workshops and user forums.

However, not everyone will get to work on the system. Research groups determined with the help of an external advisory board, and the information collected will be used to develop best practices and allocation policies. This is different from category-one systems, which are opened to peer-reviewed research projects shortly after coming online, Majumdar said.

After the three years are up, the project will transition to a two-year allocation phase during which the SDSC team will step back and allow independent scientists to conduct research on the system.

While Voyager has only just come online, Majumdar claims early testing has been promising, with performance being "better than projected" and workloads porting relatively painlessly to run on Gaudi and Goya. "The software stack, porting, and running on the machine has been really smooth," he said.

What about Gaudi2 and Greco?

Voyager comes online just weeks after Intel's Habana Labs unveiled its second-gen AI training and inference processors: Gaudi2 and Greco.

Intel claims the chips offer a substantial performance boost over the previous generation and allegedly outperform Nvidia's A100 GPUs in its internal benchmarks.

The 600W Gaudi2 offers 24 tensor cores based on a 7nm manufacturing process and 96GB of HBM2e high-bandwidth memory operating at 2.45TB/s. Greco, meanwhile, offers 16GB — the same as Goya — of newer LPDDR5 in a smaller single-slot, half-height, half-length PCIe card that consumes less than half the power.

"Gaudi2 is bigger in many ways with more tensor processor cores, more HBM2e, more scale-out ports, so whatever we learn from [Voyager] should scale even better on Gaudi2," Ganeson said. "The cutting edge work is getting done by this community. So, we get to learn and develop for what's going to be in production in the future." ®

Get our Tech Resources