The wait is over. For the first time since January 2010 (or November 2009, depends on how you count), NVIDIA is introducing a new GPU architecture, codenamed Kepler. The chips are identified as the GK1xx family (GK104 and GK107 are being launched today), and in this article we’ll take a deep dive through the architecture which couple of legendary developers said it reminds them of G80.
No Kepler in 2011 – What Went Wrong?
First and foremost, we have to start with the reason why there was no Kepler architecture in 2011, as NVIDIA announced on the GPU Technology Conference 2010. As you can see in the picture below, the public roadmap clearly showed Kepler arriving in 2011 and Maxell following in 2013.
On this image, we do not agree with NVIDIA on 2007 timing for the Tesla architecture launch. Tesla, or G80 as it is more known was launched and made available in November 2006. It is true that NVIDIA launched the Tesla GPGPU part in 2007, so it all depends on how you count (silicon availability or public announcement). Tesla architecture was used for G80, G90 and GT200 generation of products spanning through five years (if we count mobile parts).
Then, NVIDIA presented Fermi architecture at the inaugural GPU Technology Conference with a prototype board, revealing architectural details on supercomputing conference in late fourth quarter. However, the actual availability of GeForce did not arrive until the end of first quarter of 2010, with Quadro and Tesla professional products following suit in second and third quarter 2010.
NVIDIA faced issues with the troubled birth of GF100 i.e. Fermi made a dent in the schedule, and engineers were then taken by the work on a Fermi-refresh for 2011. The refresh was known as GF11x and all of that bought time for NVIDIA to get Kepler ready for launch in the fourth quarter of 2011. Bear in mind that the company had silicon running in the office for better part of second half of 2011, but spent time optimizing the silicon to reduce the power consumption as low as possible.
The new company mantra is Performance/Watt (finally) and in this article, you’ll see the amount of changes NVIDIA made in order to make that mantra stick. Unfortunately, the availability of 28nm process just wasn’t there and while some of the leaked chips we saw were manufactured in 39th, 41st and 44th week of 2011 – the parts simply weren’t ready for prime time.
All those delays, or we should simply say – "when it’s done" approach caused numerous stories online (we don’t see Apple being crucified for their delays, while AMD, Intel and NVIDIA are constantly under the looking glass). In our on and off the record conversations with company executives, engineers and PR staff, it was interesting to see the company attitude changing, especially after AMD introduced Southern Islands, their first true GPGPU architecture.
March 22, 2012 – the arrival of Kepler
There is a pretty good reason why NVIDIA delayed the launch of hard silicon for as long as possible and today’s articles will explain why. There are two parts available today, the GK104 (GeForce GTX 680) and GK107 (GeForce GT 640M LE, GT 640M, GT 650M, GTX 660M). In our talks with NVIDIA insiders, we were told that there are two more ASICs coming in the next couple of months, or should we say weeks.
The interesting bit is that we were told that next generation is on track, even though we doubt we are going to see Maxwell before the end of 2013. Reasons for that have nothing to do with NVIDIA, but rather with its manufacturing partners and the process node selected for it.
Without further ado, lets dig into this:
The second name on this slide bears a lot of importance, so hit the next page
The New Architecture – Building upon G80/Tesla and GF100/Fermi Legacy
First and foremost, the Kepler architecture represents the biggest shift in company philosophy since NV40/NV47/RSX was replaced with the original G80/Tesla architecture. Given the principal architect, we’re not surprised by this development.
John Danskin serves in NVIDIA as Vice President GPU Architecture but in all honesty, his influence spans much farther than Tesla and Kepler. If you are a 3D aficionado, you’ll remember legendary 3dfx, where John served as the Chief GPU Architect. Sadly, 3dfx folded before Rampage /Sage GPU architecture came to life, but we believe the G80 more than compensated for that. Our sources at Intel and Qualcomm both confirmed that those companies courted John to switch boats and joins on Larrabee and Adreno projects, but now you can read why he decided to stay with the green team.
Soon after the development of G80/Tesla ended, work started on the Kepler architecture. It took several years of prep work and two years of very hard work to bring Kepler to life, with John being in charge of anywhere between 50-100 architects that serve as the Core Architectural Team and coordinate with over than 1000 engineers in the various stages of creation, resulting in multiple ASICs that are now coming out of TSMC.
World, meet Kepler in its performance shape: the GK104 chip represents the current top of what NVIDIA has to offer
Keyword for Kepler: Efficiency
When the target for the architecture was made, the main buzzword was efficiency. NVIDIA employees were quite unhappy with the compromises they had to make in latter Tesla and Fermi architectures, where the company was cramming more and more compute features – consuming more power and sacrificing performance.
At the same time, AMD went back to the drawing board after R600 debacle and released a series of smaller, compact dies that performed exceptionally given the size, catching NVIDIA’s parts off-guard in terms of "three P’s": Price, Performance, Power (consumption). However, with some of key AMD figures gone, time will tell what is happening in the Red Team camp with their future ASICs.
Cornerstone of Kepler architecture: Meet SMX, the base for all Kepler GPUs
As you can see in diagram above, NVIDIA went back to the drawing board and completely changed the way the chip operates. Now, the company adopted AMD’s mantra about creating a performance chip first, and then releasing lower and higher-end parts based on same or different silicon.
While Fermi architecture had 32 cores in a single SM unit (Streaming Multiprocessor) in the initial GF100 silicon, and 48 cores in the refreshed Fermi (i.e. GF114), Kepler’s SMX cluster (Streaming Multiprocessor eXtreme) takes 192 Fermi cores, and builds a completely new logic around them. While NVIDIA used similar diagrams between Fermi and Kepler, we’ve been told that "the logic is completely different".
Eight of SMXes make for reference Kepler GPU: the GK104
The GK104 chip is a clear evidence of just how different Kepler is. While Fermi (GF100/110) had 512 cores, Kepler GK104 packs eight SMX units for a grand total of 1536 cores. Those eight clusters are divided into four GPC units (Graphics Processing Cluster), but this is where similarities with Fermi end.
There is also a difference in cache memory. While Fermi had 1MB of L1 and 768KB of L2 memory, Kepler has 512KB of Shared L1 cache memory, and 512KB of L2 memory. The 512/768 KB is logical, due to ties with memory controller, while each SMX has equal amount of cache as Fermi’s SM. With half of SM units, the L1 cache was cut from 1024 to 512KB. However, Kepler increased texture and instruction cache. The chip which features all of the parts above takes 3.54 billion transistors packed in just 294mm2.
Second GPU NVIDIA is launching today is known under codename GK107. Consisting out of single GPC with two SMX cores, the GK107 packs 384 cores, 128KB of L1 and 128KB of L2 cache memory. Memory controller is 128-bit wide and supports GDDR3 and GDDR5 memory.
In order to create Kepler, GPU architects took CUDA cores from Fermi architecture and completely reworked how those cores operate.
Fermi vs. Kepler in raw numbers – notice the ratios and the most important part; IPC efficiency doubled on Kepler
By looking at raw numbers, Kepler features higher efficiency that Fermi but there are some parts which remain the same, as LD/ST (Load/Store) units inside each SMX unit. This result in 64-bit operations executed much in the same way as Fermi – at least on the consumer side of things. The most important bit is the numbers of instructions per clock: while Fermi (GF100/110) executed up to 1024 instructions in a single clock, i.e. 1.58 million instructions per second, Kepler (GF104) executes 2048 instructions per clock, i.e. 2.06 million instructions per second. If there was any doubt why GeForce GTX 670 Ti became GeForce GTX 680, this is it.
GPU Computing & Double Precision: Yes, the GeForce Kepler is Castrated
Not all news are great, though. NVIDIA kept its policy of castrating double precision performance for the consumer parts, meaning the GeForce GTX 680 should not have better results than GTX 580, which was getting hammered by AMD silicon for the past few generations. As you can read in our in-depth, 11 page review of GeForce GTX 680, the results of SiSoft Sandra 2012 reveals that GeForce GTX 680 gets handly beaten by Radeon HD 7970 – which is not Double Precision castrated.
However, the Quadro and Tesla professional cards will paint a different picture. Company representatives said they cannot go into GPGPU i.e. GPU computing performance until GPU Technology Conference in May and if rumors hold true, we expect to see NVIDIA debuting "three fourths" double precision rate. The chip itself reaches 3.09 TFLOPS in single precision (SP), but castrated GeForce parts should not expect more than one eight of double precision performance. Unlock Kepler should have between 1.54 and 2.32 TFLOPS of DP performance, reaching the internal target of having higher DP than Kepler SP performance. And that is something that makes GPU Computing a proposition of the future – for 1.54 DP TFLOPS, you’d have to buy no less than 23-26 Sandy Bridge-E (23x Core i7-3960X i.e. 26x Xeon E5-2687W). The difference in price is even larger – we expect to see Quadro and Tesla parts for about $2000-2500, while 23 i7-3960X processors will set you back for $24,149.77, while twentysix E5-2687W are "cheap" $49,399.74. Yes, 50,000 dollars.
How Kepler Became a Performance Monster?
One of questions on a lot of minds between game developers, and certainly between application developers was – how could Kepler be more efficient? In fact, we had couple of in-depth conversations with AAA engine developers who were confused at our reports of Kepler having 1536 units.
First and foremost, we’ll give you an efficiency example. Samaritan demo by Epic Games required three GTX 580 boards in 3-Way SLI to run at 30 frames a second in Full HD resolution. Even by going Quad SLI with two GTX 590 boards were not a guarantee of smooth framerates. In terms of finances, we’re talking about a $1647 investment; $1107 after the last round of GTX 580 price cuts. A single GeForce GTX 680 manages to achieve the identical framerate and we’re talking about a $499.99 investment (the cost of plain vanilla GTX 680).
Given that the pixel load hasn’t changed between three GTX 580 boards (3-Way GTX 580: 1536 cores, 1152-bit memory interface, 4.5GB GDDR5 memory), how did a single board with 1536 cores, 256-bit memory interface and 2GB GDDR5 memory managed to achieve just that?
As we wrote earlier, NVIDIA kept the Fermi cores but changed everything else. GPU architects explained that key battle was won with the complete change of the way how GPU works. First and foremost, NVIDIA removed a lot of functionality from hardware into software and good example is task scheduler. The number of Warp Schedulers remained the same (32 in both GF100/110 and GK104), but the way how instructions are executed was changed by removing multiple steps.
Fermi was heavily dependent on hardware checks, to make sure all instructions are properly executed. Given the amount of new instructions and functionality (native C++), it was clear why NVIDIA chose the safe route and with a lot of HW checks. Kepler removes the checks and introduces a software pre-decode, which executes in just five steps.
This enabled Kepler to get rid of a large number of transistors and change the principle from double pumped "hot clock" units (first introduced by Intel with their NetBurst architecture, followed with NVIDIA with the Tesla and Fermi architectures) to a normal pumped one. We’ll address this in greater detail on the next page.
Furthermore, NVIDIA removed one limitation which was holding the architecture back, and this limitation is not something that will be mentioned by more than a sentence, or maybe a slide. Well, not the case here. The big issue was feeding the GPU with instructions and textures, which NVIDIA witnessed with AMD’s mainstream parts beating the living daylights out of Fermi and Tesla parts. Kepler addresses this by increasing the number of Texture units to 128 (double the Fermi), the same amount as on AMD’s Northern and Southern Islands. Meet Bindless Textures.
Now that the number of texture units doubled, NVIDIA also changed the way how Texture Units operate. In the pre-Kepler era, you could use up to 128 simultaneous textures. Bindless Textures are probably the key reason why Samaritan can run on a single card, and why the performance of GT 640M and GTX 680 is where it is. This feature increased the number of simultaneous textures from 128 to over a million. Yes, you’ve read this correctly. An engine developer can now run its shader code on all the textures he or she plans to use and run those textures as they come along. In ideal circumstances, Samaritan demo can address operations on 200-300?1000 textures and Kepler gives you that level.
For competitive comparison, AMD Southern Islands also has 128 Texture Units with the technology focus being on Partially Resident Textures, i.e. MegaTexturing technology pioneered by John Carmack and idSoftware’s engine. AMD has a different approach to this problem, but in our talks with engine developers, it was Bindless Textures that got the nod over primarily OpenGL limited PRT technology. Those two are apples and pears, though (slicing a large texture vs. addressing thousands of textures at the same time).
Just like AMD hit a power wall with the R600 (HD 2900), NVIDIA had the same experience with the first Fermi, i.e. the GeForce GTX 480. AMD’s answer to power inefficiencies was Radeon HD 3870 and later, HD 4870 – power efficient GPUs which took the best parts of R600 architecture and combined it with fixed function parts, getting nice, power efficient products.
With Kepler, took a similar approach – take what is good from Fermi and throw out everything else. NVIDIA had enough time to completely reorganize the way how the GPU works. When designing Kepler, the company architects decided to reduce the power as much as possible, and the decision was made to abandon the "double-pumped", "hot clocks" concept. New concept increased the needed die area for physical cores and reduced the power consumption. The cores used the same physical space, but at half the power.
With the excessive control logic thrown out, enough room opened not to just double the core count, but rather triple it. While NVIDIA was making this chip, its designation was "performance" and the target positioning was "GeForce GTX 670 Ti". However, after the company received first 28nm silicon back, the triple core count worked better than expected and enabled engineers to clock the parts higher than expected. At the same time, the new software-driven approach resulted in reduced complexity and easier performance tuning. As such, the part planned as GTX 670 Titanium became the GTX 680.
Kepler’s power efficiency – On avergage 30-50% more efficient than Fermi
In real world applications, GK104 was twice as efficient as GF110 in Battlefied 3, with other applications following suit. Worst efficiency was achieved in The Elder Scrolls V: Skyrim and Batman: Arkham City. On average, Kepler is between 30-50% more power efficient than the Fermi architecture.
The real efficiency is seen when you take a look at the notebook part. The GK107 packs 384 cores and that was enough to
When your silicon is power efficient, an interesting thing happens – you can clock it to heavens high and beat your competition on clock, should logic lack the sophistication. If you end with sophisticated logic and low power consumption, you have a brilliant chance of taking the market.
In NVIDIA, the "holy duality" effect happened with Riva TNT2, GeForce 4 Titanium, GeForce 6000, 7000 Series and now with the GeForce GTX 600 Series. In our conversations, we learned that NVIDIA always planned to launch the GPU Boost feature, but they weren’t certain just how good the boost can be. GPU Boost is similar to Turbo mode on CPUs of today with several changes. Turbo mode on modern processors works that algorithm checks for current load of the cores, actual temperature and power consumption. Based on the information given, the Turbo mode will clock one or two cores to the Maximum TDP, or clock all available cores until maximum power consumption is reached.
Given that it controls the whole board, GPU Boost is far more complex system. The system works with the algorithm checking the actual GPU and RAM power consumption, utilization, GPU temperature and similar parts of the board. End result is that the GPU Boost will not just change frequency of a single core (it does not make lot of sense with the way how GPU works), but rather increase the frequency of both the GPU and on-board memory. Furthermore, GPU and memory voltage will be increased to the safe point, i.e. maximum power consumption. NVIDIA stated that the GeForce GTX 680 has two 6-pin connectors and can take a maximum of 225W, 75W less than GTX 580. We have no doubt that custom GTX 680 boards will bring 8+6-pin configuration (300W) for maximum performance.
GPU Boost doesn’t stop with the set TDP, though. Overclocking is favorite past time for a lot of engineers (and users) and NVIDIA utilized GPU Boost to enable board partners to overclock their parts based on better cooling (read: parts with the most efficient cooling will have more headroom by default BIOS settings, yet alone after vendor-specific BIOS tweaks).
When overclocking the GPU, GPU Boost will continue to run in the background, and if you bin the GPU/DRAM chips, you might end up with a monster. In our testing, we’ve found that GPU Boost works quite impressive, with the maximum clock achieved being 1,316 MHz – over 30% from default 1008 MHz. The clock was good enough to beat a GeForce GTX 590, which is based on two Fermi GPUs.
GPU Boost also plays a very important role in notebook design
Beside the desktop part, GPU Boost is also used in the notebooks. The algorithm looks at the whole computer and tries to redirect each available Watt of power to the GPU, increasing performance by quite a bit. While 5W of power doesn’t mean a lot in desktop world, we’re talking about one fourth of the overall power budget for the Ultrabook/Notebook part, resulting in higher performance.
New Memory Controller: More Transistors, Higher Integration, Highest Efficiency
The battle for the most efficient memory controller is also the battle for just how fast your chip is going to be. Unlike the CPU, memory controller is without any doubt the most important part in a GPU. Without it, the units will starve and that was one of reasons why Intel Larrabee and ATI R600 were such failures – execution units were starved.
With Kepler, NVIDIA kept their policy of "1st gen memory controller is ok, 2nd gen kills the competition" and from the looks of it, the engineers scored once more. AMD has the natural advantage, since the company actually creates the GDDR memory standards.
Fermi had a fairly efficient memory controller, but you could not drive the memory as efficient as four generations of GDDR5 memory controllers on AMD GPUs (Radeon HD 4000, HD 5000, HD 6000, HD 7000). NVIDIA sticked to GDDR3 until the very end and then switched the boat with the first Fermi, the GeForce GTX 480. Unfortunately for NVIDIA, the company had multiple issues with the silicon and could not dedicate time and resources to optimize the performance on the part.
With Kepler, memory controller was in the focus and the result is expected – highest stock clock for GDDR5 memory by quite a margin: GTX 680 has memory operating at 1.50 GHz QDR (Quad Data Rate) i.e. 6000 effective MHz. By comparison, Radeon HD 7970 has memory operating at 1.375 GHz QDR, i.e. 5500 effective MHz.
Thus, GK104 has 256-bit memory interface and almost beats the GF110 Fermi (GTX 580), which utilizes 384-bit memory interface. According to company representatives, overclocking the GDDR5 memory will also be quite an interesting experience, with the first reports coming from overclockers reaching 1.8GHz QDR i.e. 7.2 "GHz". With 256-bit interface that yields 225GB/s – still short of 257GB/s achieved by AMD Radeon HD 7970, but nevertheless impressive.
New Display Engine
With Evergreen architecture, AMD surprised the world by introducing display support for up to six displays from a single graphics processor. I remember that I spoke to NVIDIA GPU architects and executives who came to USS Hornet for AMD Evergreen after party (regardless of what most people think, but a lot of people from Intel, AMD, Microsoft, NVIDIA, Google, Apple – are house friends and there are no animosities like a vocal minority and the users themselves); they were shocked at the trick that Carrell Killebrew pulled not just on NVIDIA, but on AMD themselves. Eyefinity was close guarded secret and nobody knew about it.
Three years later, NVIDIA is finally able to answer with the brand new display engine. Focus was enabling 3D Vision Surround running of a single card, with fourth display added for presence on social networks / internet / email while gaming on three displays.
We expect this feature to be of great importance when it comes to commercial use, especially in financial sector. Furthermore, video professionals will be all over the card for finally enabling the holy trinity of video production: dual display with working palettes and a calibrated TV which shows the video as it’s going to be broadcasted.
Tagging onto the video production bandwagon, Kepler also supports 4K display resolutions from a single cable, just like Radeon HD 7700/7800/7900 series. DisplayPort 1.2 and HDMI 1.4a "high speed" (3GHz) can now drive 3840×2160 and 4096×2160 at 60Hz.
New Video Engine: NVENC
When NVIDIA introduced 2nd generation Tesla architecture (GT200), there was a lot of talk about the GPU video encoding functionality. With the arrival of Sandy Bridge architecture, the talk died down on the consumer level. In commercial usage, NVIDIA Quadro represents de facto the standard, greatly helped by the fact that leading companies based their software on CUDA: Adobe Mercury Playback Engine, Blackmagic DaVinci Resolve, Avid utilize GPGPU functionality to the level where a thought of even using a dual-socket CPU configuration seems wasteful.
NVENC is the new generation of video encoding engine which claims 1080p encoding at 120-240 frames per second in popular H.264 for 2D and MVC (Multi-view Video Encoding) for 3D video.
Kepler now supports hardware acceleration for videoconferencing and wireless display, while transcoding and video editing should be brought to another level. Given that Adobe is coming out with Creative Suite 6 just in time for NAB 2012, the annual broadcaster conference which starts on April 14 in Las Vegas, NV.
The Kepler Lineup: High-End GK104, Low-End GK107 Split into 11 parts
As you already know, Kepler GPUs belong to GeForce 600 Series. However, not all 600 Series GPUs will be Kepler parts – some of them are still based on the Fermi architecture and will probably end up being replaced by the yet unannounced Kepler ASICs.
On the desktop side, NVIDIA is launching GeForce GTX 680 today, and that is the only desktop part being launched today. After NVIDIA builds an allocation of GK107 parts, you can expect the arrival of GT 600 series for desktop as well.Thus, there isn’t much to write about the desktop part, since we have an in-depth review written by my colleague Anshel Sag. This in-depth page review reveals everything that desktop part has to offer.
In the world of mobile, there are plenty of activities; GeForce GT 640M was launched last week, after Acer started selling Aspire Timeline M3 ultrabooks earlier than agreed. There are multiple notebook design wins, but they’re closely tied to the introduction of Intel Ivy Bridge series of processors. Unfortunately, the situation is not as simple as it could be. Intel has changed their mind on more than one occasion when it comes to introduction of 3000 series of Intel Core i5 and i7 processors. Intel’s VP Sean Maloney said that Intel will introduce Ivy Bridge in two series; April and June. The high-end processors should be launched in April, while the lower-end dual-core parts should arrive in time for Computex Taipei. Until then, both AMD and NVIDIA will be limited to saying they have design wins, without a way to show them.
The mobile lineup is consisted out of no less than 10 models, based on three ASICs:
- 40nm Fermi: 610M, GT 635M, GTX 670M, GTX 675M
- 28nm Fermi: GT 620M, GT 630M
- 28nm Kepler: GT 640M LE, GT 640M, GT 650M, GTX 660M
Yes, you’ve read it correctly. NVIDIA taped out 28nm Fermi GPU silicon consisting out of two SM units: 96 CUDA cores are paired with a 128-bit memory controller, connecting to GDDR3 or GDDR5 memory. NVIDIA claims that both 40nm and 28nm Fermi silicon underwent extensive rearrangement of the new Performance/Watt mantra and that the new parts are significantly more efficient than the original Fermi parts.
The GK107 will serve as all of four notebook parts, but with different amount of enabled units. NVIDIA cites "up to 384"for all parts, while the memory controller is 128-bit and supports GDDR3 or GDDR5 memory. The GeForce GT 640M LE features up to 2GB GDDR3 memory, GT 640M, 650M and GTX 660M support up to 2GB of GDDR3 or GDDR5 memory.
We have a notebook based on GT 640M part and will be publishing the review shortly.
Without any doubt, Kepler represents the biggest shift in the way of thinking in the past six years. This new architecture focuses on efficiency like no other architecture from NVIDIA. In a way, Kepler represents NVIDA’s "Core architecture" which shifted Intel towards more power efficient and higher performing designs.
This is the first time that a high-end part consumes significantly less power than its predecessor and actually removes the obligatory 8+6-pin power connector for a more modest 6+6 one. The "Performance per Watt" and "Instructions Per Clock" mantras are visible in every part of the design and where GTX 680 stopped, GT 640M continued. Seeing a discrete GPU inside an Ultrabook is rare enough, but seeing that GPU running Battlefield 3 in native resolution with all the details turned on is something we haven’t witnessed on high-end notebooks, yet alone on a product which weighs as much as two past-gen GTX 580 cards.
According to Pat Moorhead, former Corporate VP and Corporate Fellow at AMD and a President and Principal Analyst at Moor Insights & Strategy, NVIDIA has a true winner on their hands:
"Having NVIDIA?s new high performance graphics inside Ultrabooks is good for the entire ecosystem of consumers, channel partners, OEMs, ODMs, game ISVs and of course, NVIDIA:
- Consumers get between 2-10X the gaming performance plus all the other Ultrabook attributes.
- Channel partners, OEMs, and ODMs can now offer a much more differentiated and profitable line of Ultrabooks.
- Game ISVs and their distribution partners can now participate more fully in the Ultrabook ecosystem."
Even though the products punch above expectations, we’re still worried what happened to GPGPU computing and how come NVIDIA is seriously lagging behind AMD now. According to company representatives, GPGPU functionality will be only known in mid-May, in time for the GPU Technology Conference which takes place in San Jose, CA.
Overall, NVIDIA executed probably the best architecture launches to date – this is the first time the company launched not one, but two GPUs at the same time and have hardware availability from day one (ok, Acer had availability in stores eight days ahead). We look forward to see how partners will take the two launched and two upcoming GPUs and compete in the market against Intel’s Ivy Bridge and AMD’s Fusion on the low-end, and AMD’s Southern Islands-based GPUs on mainstream, performance and high-end levels.