Project Echelon: The Future of NVIDIA's Silicon
11/19/2010 by: Gil Russell
Nvidia’s chief scientist, William Dally, gave attendees a sneak peak at the company’s efforts at Supercomputing 2010 yesterday. Nvidia is working in concert with the U.S. Department of Defense on an exascale computer called "Echelon" scheduled to debut in 2018.
Three other teams, Intel, MIT and Sandia National Labs are competing with Nvidia in the "Ubiquitous High Performance Computing Program" sponsored by DARPA. The base requirement calls for a prototype Petaflop-Class system in a 57 kilowatt rack by 2014. Follow-on work will use refinements of the prototype system as part of the exascale system targeted at 2018.
William Dally, Nvidia’s Chief Scientist, who is heading the Echelon project, explained "Our focus at Nvidia is on performance per watt, and we are starting to reuse designs across the spectrum from Tegra to Tesla chips". Dally explained that the Nvidia team plans to place 256MB of SRAM memory on the device and has made advances in the way SRAM caches can be configured that significantly shorten data latency time to the execution units.
The fact they are dynamically configurable to support application specific execution leads to further throughput and power efficiencies. Dally revealed that the Nvidia team has already lowered the power required per floating point operation from 200 picojoules for its Fermi devices to 10 picojoules, a reduction factor of 20.
Dally is convinced that an overriding key parameter is the energy involved in moving data and that it must be intrinsically close to the execution unit in order to realize power efficiencies:
William Dally's Joules/bit calculation - The amount of energy needed to shift bits throughout silicon
Currently Echelon is only a paper design supported by simulations. The envisioned architecture at this point consists of 64-bit floating point cores [each core capable of four double precision floating point operations per clock cycle] make up a streaming multiprocessor unit [SMU] - 128 of these complete a chip. It’s envisioned that a 1024 SMU core graphics processor would have the equivalent performance of 10 Teraflops per chip.
The Echelon chip has twice as many cores as today’s high-end GPUs though those cores can only execute only one 64-bit floating point operation per clock cycle compared to four for Echelon.
Dally envisions handsets powered by just 8 of the cores [one SMU unit] which work out to around 78 megaflops double precision.
The challenge of programming a 1024 core is one of the more daunting tasks faced by today’s computer architects. Dally said that "We are about to see a sea change in the programming models". "In high performance computing we went from vectorized Fortran to MPI and now we need a new programming model for the next decade or so," he said.
"We think it should be an evolution of CUDA," said Dally. "But there are CUDA like approaches such as OpenCL, OpenMP and DirectCompute or a whole new language," according to Dally. One thing is evident; the success of a 1024 cores will rest on the availability of complementary software support.
"If you can do a really good job computing at one scale you can do it at another," said Dally, further "Our focus at Nvidia is on performance per watt [across all products], and we are starting to reuse designs across the spectrum from Tegra to Tesla chips." - Important items when you consider that Nvidia’s core technologies have to address a wide range of application requirements…
nVidia, NVDA, William Daly, Bill Daly, GPGPU, GPU, Echelon, Project Echelon, US DoD, Department of Defense, DARPA, microarchitecture, GPU microarchitecture, joul, picojoules, PFLOPS, Petaflop, Petaflops, Exascale, 10 PFLOPS, KW, kW, Kilowatt, SRAM, SRAM cache,
© 2009 - 2011 Bright Side Of News*, All rights reserved.