Inside the race to develop an exascale computer, the world’s most powerful system


By Derek Slater

Nuclear fusion is a process that creates energy by smashing two atoms together. Scientists believe a fusion-based power plant could supply almost limitless energy while creating very little waste.

The problem: No one knows whether it’s safe to create the superhot gas needed to trigger fusion.

Scientists are eager to run a simulation of the fusion process. But even today’s fastest supercomputer, the Sunway TaihuLight system in Wuxi, China—which can reach a maximum calculation speed of just over 93 petaflops, or about 93 quadrillion calculations per second—is about 10 times too slow for the job.

“We’re using 95 percent of the capacity of these computers, and we aren’t close to realism,” says Amitava Bhattacharjee, a professor of astrophysical sciences with the Princeton Program in Plasma Physics. ”

Simulating fusion, and accounting for millions of variables and particle interactions, requires an “exascale” system that can perform one quintillion calculations every second, roughly the combined processing power of the 500 fastest supercomputers now in existence.

This need for high-powered simulations is shared by many scientific disciplines, ranging from nanoscale particle physics up to cosmology.

“These problems all involve complex underlying equations,” Bhattacharjee says. “We need exascale simulations in order to be able to make predictions in these systems where there’s a lot going on.”

The U.S. Department of Energy expects an “exascale” computer like this could lead to dramatic advances in precision medicine, quantum physics, and biofuel development. In early June, the department announced it selected HPE to develop the architecture for exascale computing under the DOE’s Path Forward program. The program aims to deliver its first exascale supercomputer by 2021.

Eighteen Zeros

Watch the video

Reaching exascale means achieving over 10x improvement in computing capacity, without increasing the power consumption, in less than five years

Building an exascale computer isn’t simply a matter of adding more processing cores to a system, the classic path to more speed. At a certain point, computers stop being able to move data from memory to processors, and from processor to processor, fast enough to take advantage of the raw power.

“In the halls we say ‘processing is free,’” says Mike Vildibill, vice president of advanced technologies at HPE. At exascale speeds, the cost of moving data exceeds the cost of computation, he says. “And in exascale computing, data movement is king.”

Vildibill is part of a team of HPE engineers and researchers working on the exascale computer. He and his colleagues are designing a system with as many as 100,000 processors, capable of performing one quintillion calculations per second.

To get here, they’ve scrutinized every bottleneck to data movement in today’s computing technology, identifying better ways to transmit data over novel optical interconnects, shorten the distances data travels, and route data more efficiently inside the system. They’re also figuring out how to make many of the underlying components necessary to achieve exascale.

One of the biggest challenges: moving data around a supercomputer uses a lot of power and generates enormous heat. An exascale system built with today’s technology would consume about 650 megawatts of power—or a little less than a small nuclear power plant generates.

“You’d practically need to build a dedicated power plant right next door just to operate the system,” Vildibill says. Even if that was practical, the heat from all that energy would cripple the system. ”

To make data go faster while cutting power usage, the team has re-thought nearly every component and system inside today’s supercomputers. Here are four innovations that will enable exascale computing.

Inside the HPE Exascale Computer

Watch the video

Memory and processors together

Each processor in a computer requires a dedicated chunk of memory, called its working memory, to perform calculations. The communication bandwidth between the processor and its working memory has a big impact on overall system speed. Today’s systems generally house memory in components called DIMMs, or double in-line memory modules, which connect to the processors via copper wiring in the circuit board. Exascale systems will need to move the vast majority of working memory far closer to the computing elements, in an organization called “co-packaged memory”, where the fastest memory tier is designed inside the physical package that also contains the processor chip.

CPU chips keep growing more powerful, but the capacity of that fastest memory tier hasn’t kept up. As a result, it is becoming increasingly difficult to keep all the working set of data in working memory.

“In the next generation of semiconductors we’ll see processors with computing power of well over 10 teraflops per second,” says Paolo Faraboschi, an HPE fellow and vice president. “To fully utilize those processors, you need at least one-tenth of that amount in bandwidth between the chip and memory, or one terabyte per second. This is very challenging with traditional discrete memory technology.”

An exascale system built with today’s technology would consume about 650 megawatts of power—or a little less than a small nuclear power plant generates.

The HPE approach is straightforward: Co-packaged components mean data travels only a short distance between processor and memory, rather than moving a longer distance between separated components. This silicon-connected memory replaces DIMMs and also eliminates the overhead of attaching routing information to the data to notify the system where that data is headed.

Because co-packaged memory will be smaller than discrete memory, more traffic will need to go outside of the processor-memory complex, in order to reach the memory embedded in another computing node. Keeping the system’s performance in balance requires at least one one-hundredth of its bandwidth to be available via the “fabric” of connections used for node-to-node communication, which is more than 100 gigabytes per second.

Copper and traditional memory protocols are inadequate for exascale computing because they can’t achieve that transfer rate. They simply provide too little bandwidth and consume too much power. HPE plans to break this barrier using new optical interconnects and communication protocols based on a highly efficient technique called “memory semantics.”

4 places data movement slows down today's computers

Silicon traffic cops

Data moves between processors and other components as they coordinate their work on big calculations. Inside a large-scale supercomputer this movement is directed by fabric routers, chips that act like traffic cops for data. Without efficient routing, data can be sent on a circuitous path or collide with other data trying to use the same path. Routers are also responsible for tolerating network failures by dynamically adjusting paths to go around faults.

The more data, the busier the router becomes. And in supercomputing, which demands extreme bandwidth and nearly instant data movement, routers are rapidly becoming a bottleneck.

Data physically flows in and out of the router via its pins. Inside the router, the data arriving via these pins is received by ports, with a single port able to gather data from multiple pins.

The traditional way to increase throughput is to focus on increasing the bandwidth or amount of data each pin can accommodate. However, the real bottleneck is in the ports. With relatively few ports carrying all the data, there are not a lot of possible paths between two processors in the network.

The HPE team focused on increasing the number of ports inside the router, rather than the bandwidth of each pin. A device made with this approach is called a high-radix router, which allows more pathways for data, more direct logical connections between devices, and more adaptive routing.

“Basically, high-radix means fewer hops,” says Al Davis, a distinguished technologist at Labs. In turn, fewer hops means data can arrive at its destination faster. The HPE design calls for no more than four hops for data to reach any of the processors from any other point in the exascale system—even if the number of processors reaches 100,000. ”

The inner topology

Routers are part of a broader data transmission fabric inside the system. The pattern in which these connections are laid out is called a topology, and the topology determines how efficiently data inside a system reaches its destination.

The topology used by most computers is decades old and wasn’t designed for supercomputers with thousands of processors that need to move data at the speed of memory. The result is that in supercomputers, most components aren’t directly connected to one another. Data moving from memory in one area of a big system to a processor in another may have to “hop” multiple times, passing through several routers to reach the destination.

Every hop costs time and fills the queue of pending data movement requests in the computing nodes, which can become saturated, adding delays.

An exascale computer needs the most efficient interconnect topology possible so it can keep the number of hops to a minimum. Because high-radix routers have more ports, they can support a topology with direct connections between a higher number of processors.

Labs researchers have designed a topology called hyperX, which looks like a flattened butterfly. In this topology, processors are divided into groups called “dimensions.” Every processor in a dimension has a direct connection to all the others, and every dimension has a direct connection to every other dimension. This approach contributes to efficient, collision-free routing and a minimum of intermediate stops.

Optical components and fabric

Copper has served faithfully as the conductor of data—transmitted as an electrical signal—since the 1940s. At exascale computing speeds, though, copper’s shortcomings move to the forefront. The more data, the more electricity required to move it.

Not only does moving data use a lot power, but about 50 percent of the electricity used by a computer today becomes that unfortunate byproduct, heat. And that’s been a limiting factor for supercomputing. Even if a system could add infinite processors, Davis says, “We can’t cool infinite power.”

Optical transmission, sending data by light through fiber optic cables, has the potential to move larger data volumes while using much less power. But even with an optical signal, conventional methods fail at exascale.

Most optical transmission in use today relies on VCSELs, or vertical-cavity surface-emitting lasers, to generate the light signal. VCSELs (which researchers pronounce as a word, like “vixels”) are popular because they are relatively inexpensive to make. To understand how a VCSEL works, Faraboschi says, imagine holding a flashlight and pushing the on-off button 25 billion or 50 billion times per second.

In earlier times, there were two pillars of scientific discovery: experiment and theory. Now computer simulation is becoming the third pillar

In current optical architectures, each transmission cable needs its own VCSEL, and the system needs a huge number of cables to achieve a resilient optical fabric of interconnected components.

The US Department of Energy’s goal is to make an entire exascale system cost less than $300 million. “For exascale, just the cables alone would cost you a billion dollars using today’s technology,” says Faraboschi. “Clearly we cannot use the same approach,” he adds.

A better approach at this scale is to use a different type of optical interconnect, with optical modulators deeply integrated with the other silicon components, and the cost of expensive components, such as lasers, amortized across several optical channels. Labs researchers are building photonic components entirely made of silicon, which can then modulate a laser’s signal using the on-off technique to transmit data down each optical fiber.

Integrating several other functions into a single photonic component—rather than having to connect several different components with cables or wires—is more efficient. And the high-speed optical fabric means processors can efficiently access a very large pool of memory using lightweight memory access semantics, allowing the system to work on larger datasets and solve bigger problems—an approach HPE calls Memory-Driven Computing.

Wait, there’s more

Exascale computing, and data movement in particular, presents plenty of other challenges. For example, sometimes the most efficient solution is to not move data at all. Instead of copying a big block of data from one area of memory to be closer to a distant processor, an exascale system might sometimes just want to tell that processor where the data is located—giving it the right address—and let the processor order up a specific calculation based on that data without moving the data itself.

In order to do this, HPE is working with a consortium of partners to create Gen-Z, a universal memory-semantic protocol that unifies communication and data access across multiple data storage and transmission devices.

With so many components in an exascale system, reliability is a potential headache. “Even if the parts have a high average time between failures, it’s just math,” says Cong Xu, a research scientist at Labs. “Multiply that time by tens or hundreds of thousands of components, and something is going to break every couple of hours.”

To address this problem, Labs has designed a huge pool of energy-efficient nonvolatile memory that can hold frequent snapshots of work in progress across the whole system—a process known as ‘checkpointing.’ If a processor fails, the system can quickly restore the last snapshot and proceed, rather than having to restart a complex computation from the beginning. For this purpose, HPE is taking advantage of many of the technologies that were prototyped in The Machine, a new system based on the Memory-Driven Computing concept.

The Labs approach executes these backup-and-restore operations in a matter of seconds, where today’s technology would require “tens of minutes,” says Xu.

Taken together, these technologies promise to help exascale computing simulate and predict massively complicated systems—not only nuclear fusion but the ocean, the weather, the cosmos—at an unprecedented scale.

Inside an exascale computer

Watch the video

“Simulation is the bridge between what we know today and what we can predict in the future. And simulation is what exascale computing is all about.”