Intel and Micron showed results of their Hybrid Memory Cube [HMC] collaboration last week at IDF 2011. This is the first public showing of the jointly developed high performance memory architecture, a purpose designed memory solution for their exascale computing effort.

Intel-Micron Hybrid Memory Cube in real world
Intel-Micron Hybrid Memory Cube in real world

Intel has set the year 2018 as the year in which the company plans to build a supercomputer capable of an exaflop or better execution rate. This is approximately two orders of magnitude faster than those extant as of June 2011. The proposed machine, targeted to consume less than 20 megawatts of power, a reduction of 300 times over today’s supercomputers according to Justin Rattner, Intel’s Chief Technology Officer.

Background
The HMC development is a result of findings from a DARPA funded Exaflop Feasibility Study Group started in 2007 and headed by Peter Kogge. Darpa asked the question, "What sort of technologies would engineers need by 2015 to build a supercomputer copable of executing a quintillion (1018= exa) floating point operations per second (exa + flop)?"

The study found that an Exaflop machine would consume 1.5 gigawatts of energy, or more than 0.1 percent of the U.S. power grid total. The joke about needing a brand new nuclear power to power the facility was suddenly no longer funny. In fact, the Exaflop machine might not be possible at all given extrapolations based on the current state of the art at the time.

The study group decided to use the amount of power required to perform one floating-point-operation [FLOP] as a base metric for extrapolating total power requirements.  At the time of the study this was around 70 picojoules [a picojoule is one millionth of one millionth of a joule (a joule of energy can run a 1-watt load for one second)]. The good news is the engineers felt they could get it down to 5 to 10 pJ. The bad news is the amount of energy needed to move data between source-to-execution unit and then back to the result-destination ended in the range of 1000 to 10,000 pJ per flop.

To get a handle on the best way to minimize power consumption they decided to generate a detailed design of a hypothetical fundamental building block of the future supercomputer. The design was based on a silicon substrate microprocessors running at 0.5 V to reduce power (most processors are internally regulated at around 1 V today).

Bill Dally (then at Stanford University now Chief Scientist at NVIDIA Corporation), working largely on his own, generated an outline of such a design on paper.  His module included 742 separate microprocessor cores running at 1.5 GHz. Each core unit supported four floating-point-units and a small amount of nearby cache memory for fast access. Pairs of such cores share a somewhat slower second-level cache, and all such pairs can access each other’s second-level (and even third-level) memory caches. In Dally’s design, each processor connects directly to 16 dynamic RAM chips. Each processor device also has ports for connections to up to 12 separate routers for fast off-chip data transfers.

Hybrid Memory Cube (HMC)
The HMC, as shown, consists of four specially designed DRAMs (1 gigabit each) stacked on top of a logic interface device.

Hybrid Memory Cube high-level overview
Hybrid Memory Cube high-level overview

Through-Silicon Vias connect though the DRAM to the bottom logic interface. Note that the TSV allows a high degree of parallel interconnect between the DRAMs and the logic die. Also, the interesting bit is that the logic die can be manufactured in a different manufacturing process when compared to the DRAM dies – which can significantly reduce the cost of manufacture – for instance, combining the 32nm or 22nm DRAM logic with future memories manufactured in 16, 14, 10 and sub 10nm nodes.

Hybrid Memory Cube Architecture by Intel and Micron
Hybrid Memory Cube Architecture by Intel and Micron
 
Although not identical to Dally’s theoretical memory architecture the HMC is too close to have been accidental. Of interest is Micron’s custom designed DRAM memory with lower voltage (?) and smaller page size. The absence of external drive requirements reduces the on-chip I/O driver to dimensions not much larger than other on-chip logic transistors allowing for denser I/O buffering.

It also aids the layout in that the I/O channels can be located nearer the memory array – naturally implying a higher bank count per die. Micron blurred the micrograph to avoid reverse engineering by competitors.

How Thorough-Silicon-Via (TSV) memory looks like in reality
How Thorough-Silicon-Via (TSV) memory looks like in reality

Micron demonstrated their prowess at being able to stack DRAMs using Through-Silicon-Via (TSV) technology. The long-term reliability of TSV technology and manufacturing cost is still the topic of an ongoing heated debate – only long-term reliability studies will put this to rest.

This is a core-manufacturing element in the path of the HMC technology and will ultimately prove to be the single most important gating item regarding whether the overarching technology will be successful or not. 

Performance Results
The following slide was the standard "comparison guide" showing the HMC’s performance against other generic commercial memory solutions. 

Comparing the 1Tb/s Hybrid Memory Cube Prototype to other DRAM modules
Comparing the 1Tb/s Hybrid Memory Cube Prototype to other DRAM modules

The transfer performance over DDR3-1333 4GB ECC module was 12 X with the pJ/bit transferred at 7.8 X. Note that all solutions are ECC grade memory modules.

Memory bandwidth tested on a real-world benchmark

Memory bandwidth tested on a real-world benchmark

So, for the first time in a very, very long time Samsung does not own the first and fastest DRAM in the industry.
  
Epilogue Summary
Justin Rattner’s keynote address that Intel is actively developing Exascale computing technology is big news.  The company is committed to a strategy to become a dominant player, or at the very least, to be a lead player in this high end, high margin computing segment.

Prototype PCB is currently in testing
Prototype PCB is currently in testing

Many loose end questions remain.

Top of the list is what is the ownership of the HMC? Micron manufactured the DRAM and Intel manufactured the Logic die. It’s not clear how the IP ownership rights are split. Last time Intel dipped its toe in the memory pool the industry ended up with the seemingly endless Rambus litigation (which is still going on). 

An undertone is that DARPA is keeping this strictly an American development – the reason why specific details are not available, micrographs blurred and direct questions concerning the technology abstained.

Intel and NVIDIA settled their disputes earlier this year allowing cross use of patents between the two companies. NVIDIA has their own ideas on supercomputer architecture and it’s whispered that Micron is also working with them. Nvidia delayed their annual Techcon till May next year. Guess we’ll just have to wait till then to find out what brand N has planned for their version.