The New Architecture - Building upon G80/Tesla and GF100/Fermi Legacy
First and foremost, the Kepler architecture represents the biggest shift in company philosophy since NV40/NV47/RSX was replaced with the original G80/Tesla architecture. Given the principal architect, we're not surprised by this development.
John Danskin serves in NVIDIA as Vice President GPU Architecture but in all honesty, his influence spans much farther than Tesla and Kepler. If you are a 3D aficionado, you'll remember legendary 3dfx
, where John served as the Chief GPU Architect. Sadly, 3dfx folded before Rampage /Sage GPU architecture came to life, but we believe the G80 more than compensated for that. Our sources at Intel and Qualcomm both confirmed that those companies courted John to switch boats and joins on Larrabee and Adreno projects, but now you can read why he decided to stay with the green team.
Soon after the development of G80/Tesla ended, work started on the Kepler architecture. It took several years of prep work and two years of very hard work to bring Kepler to life, with John being in charge of anywhere between 50-100 architects that serve as the Core Architectural Team and coordinate with over than 1000 engineers in the various stages of creation, resulting in multiple ASICs that are now coming out of TSMC.World, meet Kepler in its performance shape: the GK104 chip represents the current top of what NVIDIA has to offerKeyword for Kepler: Efficiency
When the target for the architecture was made, the main buzzword was efficiency. NVIDIA employees were quite unhappy with the compromises they had to make in latter Tesla and Fermi architectures, where the company was cramming more and more compute features - consuming more power and sacrificing performance.
At the same time, AMD went back to the drawing board after R600 debacle and released a series of smaller, compact dies that performed exceptionally given the size, catching NVIDIA's parts off-guard in terms of "three P's": Price, Performance, Power (consumption). However, with some of key AMD figures gone, time will tell what is happening in the Red Team camp with their future ASICs.Cornerstone of Kepler architecture: Meet SMX, the base for all Kepler GPUs
As you can see in diagram above, NVIDIA went back to the drawing board and completely changed the way the chip operates. Now, the company adopted AMD's mantra about creating a performance chip first, and then releasing lower and higher-end parts based on same or different silicon.
While Fermi architecture had 32 cores in a single SM unit (Streaming Multiprocessor) in the initial GF100 silicon, and 48 cores in the refreshed Fermi (i.e. GF114), Kepler's SMX cluster (Streaming Multiprocessor eXtreme) takes 192 Fermi cores, and builds a completely new logic around them. While NVIDIA used similar diagrams between Fermi and Kepler, we've been told that "the logic is completely different".Eight of SMXes make for reference Kepler GPU: the GK104
The GK104 chip is a clear evidence of just how different Kepler is. While Fermi (GF100/110) had 512 cores, Kepler GK104 packs eight SMX units for a grand total of 1536 cores. Those eight clusters are divided into four GPC units (Graphics Processing Cluster), but this is where similarities with Fermi end.
There is also a difference in cache memory. While Fermi had 1MB of L1 and 768KB of L2 memory, Kepler has 512KB of Shared L1 cache memory, and 512KB of L2 memory. The 512/768 KB is logical, due to ties with memory controller, while each SMX has equal amount of cache as Fermi's SM. With half of SM units, the L1 cache was cut from 1024 to 512KB. However, Kepler increased texture and instruction cache. The chip which features all of the parts above takes 3.54 billion transistors packed in just 294mm2.
Second GPU NVIDIA is launching today is known under codename GK107. Consisting out of single GPC with two SMX cores, the GK107 packs 384 cores, 128KB of L1 and 128KB of L2 cache memory. Memory controller is 128-bit wide and supports GDDR3 and GDDR5 memory.
In order to create Kepler, GPU architects took CUDA cores from Fermi architecture and completely reworked how those cores operate.Fermi vs. Kepler in raw numbers - notice the ratios and the most important part; IPC efficiency doubled on Kepler
By looking at raw numbers, Kepler features higher efficiency that Fermi but there are some parts which remain the same, as LD/ST (Load/Store) units inside each SMX unit. This result in 64-bit operations executed much in the same way as Fermi - at least on the consumer side of things. The most important bit is the numbers of instructions per clock: while Fermi (GF100/110) executed up to 1024 instructions in a single clock, i.e. 1.58 million instructions per second, Kepler (GF104) executes 2048 instructions per clock, i.e. 2.06 million instructions per second. If there was any doubt why GeForce GTX 670 Ti became GeForce GTX 680, this is it.
GPU Computing & Double Precision: Yes, the GeForce Kepler is Castrated
Not all news are great, though. NVIDIA kept its policy of castrating double precision performance for the consumer parts, meaning the GeForce GTX 680 should not have better results than GTX 580, which was getting hammered by AMD silicon for the past few generations. As you can read in our in-depth, 11 page review of GeForce GTX 680
, the results of SiSoft Sandra 2012 reveals that GeForce GTX 680 gets handly beaten by Radeon HD 7970
- which is not Double Precision castrated.
However, the Quadro and Tesla professional cards will paint a different picture. Company representatives said they cannot go into GPGPU i.e. GPU computing performance until GPU Technology Conference in May and if rumors hold true, we expect to see NVIDIA debuting "three fourths" double precision rate. The chip itself reaches 3.09 TFLOPS in single precision (SP), but castrated GeForce parts should not expect more than one eight of double precision performance. Unlock Kepler should have between 1.54 and 2.32 TFLOPS of DP performance, reaching the internal target of having higher DP than Kepler SP performance. And that is something that makes GPU Computing a proposition of the future - for 1.54 DP TFLOPS, you'd have to buy no less than 23-26 Sandy Bridge-E (23x Core i7-3960X i.e. 26x Xeon E5-2687W). The difference in price is even larger - we expect to see Quadro and Tesla parts for about $2000-2500, while 23 i7-3960X processors will set you back for $24,149.77, while twentysix E5-2687W are "cheap" $49,399.74. Yes, 50,000 dollars.
© 2009 - 2013 Bright Side Of News*, All rights reserved.