A leading expert has raised critical questions about the validity of claims surrounding “Zettascale” and “Exascale-class” AI supercomputers.
In an article that delves deep into the technical intricacies of these terms, Doug Eadline from HPCWire explains how terms like exascale, which traditionally denote computers achieving one quintillion floating-point operations per second (FLOPS), are often misused or misrepresented, especially in the context of AI workloads.
Eadline points out that many of the recent announcements touting “exascale” or even “zettascale” performance are based on speculative metrics, rather than tested results. He writes, “How do these ‘snort your coffee’ numbers arise from unbuilt systems?” – a question that highlights the gap between theoretical peak performance and actual measured results in the field of high-performance computing. The term exascale has historically been reserved for systems that achieve at least 10^18 FLOPS in sustained, double-precision (64-bit) calculations, a standard verified by benchmarks such as the High-Performance LINPACK (HPLinpack).
Car comparison
As Eadline explains, the distinction between FLOPS in AI and HPC is crucial. While AI workloads often rely on lower-precision floating-point formats such as FP16, FP8, or even FP4, traditional HPC systems demand higher precision for accurate results.
The use of these lower-precision numbers is what leads to inflated claims of exaFLOP or even zettaFLOP performance. According to Eadline, “calling it ‘AI zetaFLOPS’ is silly because no AI was run on this unfinished machine.”
He further emphasizes the importance of using verified benchmarks like HPLinpack, which has been the standard for measuring HPC performance since 1993, and how using theoretical peak numbers can be misleading.
The two supercomputers that are currently part of the exascale club – Frontier at Oak Ridge National Laboratory and Aurora at Argonne National Laboratory – have been tested with real applications, unlike many of the AI systems making exascale claims.
To explain the difference between various floating-point formats, Eadline offers a car analogy: “The average double precision FP64 car weighs about 4,000 pounds (1814 Kilos). It is great at navigating terrain, holds four people comfortably, and gets 30 MPG. Now, consider the FP4 car, which has been stripped down to 250 pounds (113 Kilos) and gets an astounding 480 MPG. Great news. You have the best gas mileage ever! Except, you don’t mention a few features of your fantastic FP4 car. First, the car has been stripped down of everything except a small engine and maybe a seat. What’s more, the wheels are 16-sided (2^4) and provide a bumpy ride as compared to the smooth FP64 sedan ride with wheels that have somewhere around 2^64 sides. There may be places where your FP4 car works just fine, like cruising down Inference Lane, but it will not do well heading down the FP64 HPC highway.”
Eadline’s article serves as a reminder that while AI and HPC are converging, the standards for measuring performance in these fields remain distinct. As he puts it, “Fuzzing things up with ‘AI FLOPS’ will not help either,” pointing out that only verified systems that meet the stringent requirements for double-precision calculations should be considered true exascale or zettascale systems.