Connect with us

News

NVIDIA Kepler Analysis: Another Masterpiece from the Architects of G80 or…?

Archivebot

Published

on

How Kepler Became a Performance Monster?

One of questions on a lot of minds between game developers, and certainly between application developers was – how could Kepler be more efficient? In fact, we had couple of in-depth conversations with AAA engine developers who were confused at our reports of Kepler having 1536 units.

First and foremost, we’ll give you an efficiency example. Samaritan demo by Epic Games required three GTX 580 boards in 3-Way SLI to run at 30 frames a second in Full HD resolution. Even by going Quad SLI with two GTX 590 boards were not a guarantee of smooth framerates. In terms of finances, we’re talking about a $1647 investment; $1107 after the last round of GTX 580 price cuts. A single GeForce GTX 680 manages to achieve the identical framerate and we’re talking about a $499.99 investment (the cost of plain vanilla GTX 680).

Given that the pixel load hasn’t changed between three GTX 580 boards (3-Way GTX 580: 1536 cores, 1152-bit memory interface, 4.5GB GDDR5 memory), how did a single board with 1536 cores, 256-bit memory interface and 2GB GDDR5 memory managed to achieve just that?

As we wrote earlier, NVIDIA kept the Fermi cores but changed everything else. GPU architects explained that key battle was won with the complete change of the way how GPU works. First and foremost, NVIDIA removed a lot of functionality from hardware into software and good example is task scheduler. The number of Warp Schedulers remained the same (32 in both GF100/110 and GK104), but the way how instructions are executed was changed by removing multiple steps.

Fermi was heavily dependent on hardware checks, to make sure all instructions are properly executed. Given the amount of new instructions and functionality (native C++), it was clear why NVIDIA chose the safe route and with a lot of HW checks. Kepler removes the checks and introduces a software pre-decode, which executes in just five steps.

This enabled Kepler to get rid of a large number of transistors and change the principle from double pumped “hot clock” units (first introduced by Intel with their NetBurst architecture, followed with NVIDIA with the Tesla and Fermi architectures) to a normal pumped one. We’ll address this in greater detail on the next page.

Furthermore, NVIDIA removed one limitation which was holding the architecture back, and this limitation is not something that will be mentioned by more than a sentence, or maybe a slide. Well, not the case here. The big issue was feeding the GPU with instructions and textures, which NVIDIA witnessed with AMD’s mainstream parts beating the living daylights out of Fermi and Tesla parts. Kepler addresses this by increasing the number of Texture units to 128 (double the Fermi), the same amount as on AMD’s Northern and Southern Islands. Meet Bindless Textures.

Now that the number of texture units doubled, NVIDIA also changed the way how Texture Units operate. In the pre-Kepler era, you could use up to 128 simultaneous textures. Bindless Textures are probably the key reason why Samaritan can run on a single card, and why the performance of GT 640M and GTX 680 is where it is. This feature increased the number of simultaneous textures from 128 to over a million. Yes, you’ve read this correctly. An engine developer can now run its shader code on all the textures he or she plans to use and run those textures as they come along. In ideal circumstances, Samaritan demo can address operations on 200-300…1000 textures and Kepler gives you that level.

For competitive comparison, AMD Southern Islands also has 128 Texture Units with the technology focus being on Partially Resident Textures, i.e. MegaTexturing technology pioneered by John Carmack and idSoftware’s engine. AMD has a different approach to this problem, but in our talks with engine developers, it was Bindless Textures that got the nod over primarily OpenGL limited PRT technology. Those two are apples and pears, though (slicing a large texture vs. addressing thousands of textures at the same time).

Power Efficiency

Just like AMD hit a power wall with the R600 (HD 2900), NVIDIA had the same experience with the first Fermi, i.e. the GeForce GTX 480. AMD’s answer to power inefficiencies was Radeon HD 3870 and later, HD 4870 – power efficient GPUs which took the best parts of R600 architecture and combined it with fixed function parts, getting nice, power efficient products.

With Kepler, took a similar approach – take what is good from Fermi and throw out everything else. NVIDIA had enough time to completely reorganize the way how the GPU works. When designing Kepler, the company architects decided to reduce the power as much as possible, and the decision was made to abandon the “double-pumped”, “hot clocks” concept. New concept increased the needed die area for physical cores and reduced the power consumption. The cores used the same physical space, but at half the power.

With the excessive control logic thrown out, enough room opened not to just double the core count, but rather triple it. While NVIDIA was making this chip, its designation was “performance” and the target positioning was “GeForce GTX 670 Ti”. However, after the company received first 28nm silicon back, the triple core count worked better than expected and enabled engineers to clock the parts higher than expected. At the same time, the new software-driven approach resulted in reduced complexity and easier performance tuning. As such, the part planned as GTX 670 Titanium became the GTX 680.

In real world applications, GK104 was twice as efficient as GF110 in Battlefied 3, with other applications following suit. Worst efficiency was achieved in The Elder Scrolls V: Skyrim and Batman: Arkham City. On average, Kepler is between 30-50% more power efficient than the Fermi architecture.

The real efficiency is seen when you take a look at the notebook part. The GK107 packs 384 cores and that was enough to

GPU Boost

When your silicon is power efficient, an interesting thing happens – you can clock it to heavens high and beat your competition on clock, should logic lack the sophistication. If you end with sophisticated logic and low power consumption, you have a brilliant chance of taking the market.

In NVIDIA, the “holy duality” effect happened with Riva TNT2, GeForce 4 Titanium, GeForce 6000, 7000 Series and now with the GeForce GTX 600 Series. In our conversations, we learned that NVIDIA always planned to launch the GPU Boost feature, but they weren’t certain just how good the boost can be. GPU Boost is similar to Turbo mode on CPUs of today with several changes. Turbo mode on modern processors works that algorithm checks for current load of the cores, actual temperature and power consumption. Based on the information given, the Turbo mode will clock one or two cores to the Maximum TDP, or clock all available cores until maximum power consumption is reached.

Given that it controls the whole board, GPU Boost is far more complex system. The system works with the algorithm checking the actual GPU and RAM power consumption, utilization, GPU temperature and similar parts of the board. End result is that the GPU Boost will not just change frequency of a single core (it does not make lot of sense with the way how GPU works), but rather increase the frequency of both the GPU and on-board memory. Furthermore, GPU and memory voltage will be increased to the safe point, i.e. maximum power consumption. NVIDIA stated that the GeForce GTX 680 has two 6-pin connectors and can take a maximum of 225W, 75W less than GTX 580. We have no doubt that custom GTX 680 boards will bring 8+6-pin configuration (300W) for maximum performance.

GPU Boost doesn’t stop with the set TDP, though. Overclocking is favorite past time for a lot of engineers (and users) and NVIDIA utilized GPU Boost to enable board partners to overclock their parts based on better cooling (read: parts with the most efficient cooling will have more headroom by default BIOS settings, yet alone after vendor-specific BIOS tweaks).

When overclocking the GPU, GPU Boost will continue to run in the background, and if you bin the GPU/DRAM chips, you might end up with a monster. In our testing, we’ve found that GPU Boost works quite impressive, with the maximum clock achieved being 1,316 MHz – over 30% from default 1008 MHz. The clock was good enough to beat a GeForce GTX 590, which is based on two Fermi GPUs.

Beside the desktop part, GPU Boost is also used in the notebooks. The algorithm looks at the whole computer and tries to redirect each available Watt of power to the GPU, increasing performance by quite a bit. While 5W of power doesn’t mean a lot in desktop world, we’re talking about one fourth of the overall power budget for the Ultrabook/Notebook part, resulting in higher performance.

New Memory Controller: More Transistors, Higher Integration, Highest Efficiency
The battle for the most efficient memory controller is also the battle for just how fast your chip is going to be. Unlike the CPU, memory controller is without any doubt the most important part in a GPU. Without it, the units will starve and that was one of reasons why Intel Larrabee and ATI R600 were such failures – execution units were starved.

With Kepler, NVIDIA kept their policy of “1st gen memory controller is ok, 2nd gen kills the competition” and from the looks of it, the engineers scored once more. AMD has the natural advantage, since the company actually creates the GDDR memory standards.

Fermi had a fairly efficient memory controller, but you could not drive the memory as efficient as four generations of GDDR5 memory controllers on AMD GPUs (Radeon HD 4000, HD 5000, HD 6000, HD 7000). NVIDIA sticked to GDDR3 until the very end and then switched the boat with the first Fermi, the GeForce GTX 480. Unfortunately for NVIDIA, the company had multiple issues with the silicon and could not dedicate time and resources to optimize the performance on the part.

With Kepler, memory controller was in the focus and the result is expected – highest stock clock for GDDR5 memory by quite a margin: GTX 680 has memory operating at 1.50 GHz QDR (Quad Data Rate) i.e. 6000 effective MHz. By comparison, Radeon HD 7970 has memory operating at 1.375 GHz QDR, i.e. 5500 effective MHz.

Thus, GK104 has 256-bit memory interface and almost beats the GF110 Fermi (GTX 580), which utilizes 384-bit memory interface. According to company representatives, overclocking the GDDR5 memory will also be quite an interesting experience, with the first reports coming from overclockers reaching 1.8GHz QDR i.e. 7.2 “GHz”. With 256-bit interface that yields 225GB/s – still short of 257GB/s achieved by AMD Radeon HD 7970, but nevertheless impressive.

New Display Engine

With Evergreen architecture, AMD surprised the world by introducing display support for up to six displays from a single graphics processor. I remember that I spoke to NVIDIA GPU architects and executives who came to USS Hornet for AMD Evergreen after party (regardless of what most people think, but a lot of people from Intel, AMD, Microsoft, NVIDIA, Google, Apple – are house friends and there are no animosities like a vocal minority and the users themselves); they were shocked at the trick that Carrell Killebrew pulled not just on NVIDIA, but on AMD themselves. Eyefinity was close guarded secret and nobody knew about it.

Three years later, NVIDIA is finally able to answer with the brand new display engine. Focus was enabling 3D Vision Surround running of a single card, with fourth display added for presence on social networks / internet / email while gaming on three displays.

We expect this feature to be of great importance when it comes to commercial use, especially in financial sector. Furthermore, video professionals will be all over the card for finally enabling the holy trinity of video production: dual display with working palettes and a calibrated TV which shows the video as it’s going to be broadcasted.

Tagging onto the video production bandwagon, Kepler also supports 4K display resolutions from a single cable, just like Radeon HD 7700/7800/7900 series. DisplayPort 1.2 and HDMI 1.4a “high speed” (3GHz) can now drive 3840×2160 and 4096×2160 at 60Hz.

New Video Engine: NVENC

When NVIDIA introduced 2nd generation Tesla architecture (GT200), there was a lot of talk about the GPU video encoding functionality. With the arrival of Sandy Bridge architecture, the talk died down on the consumer level. In commercial usage, NVIDIA Quadro represents de facto the standard, greatly helped by the fact that leading companies based their software on CUDA: Adobe Mercury Playback Engine, Blackmagic DaVinci Resolve, Avid utilize GPGPU functionality to the level where a thought of even using a dual-socket CPU configuration seems wasteful.

NVENC is the new generation of video encoding engine which claims 1080p encoding at  120-240 frames per second in popular H.264 for 2D and MVC (Multi-view Video Encoding) for 3D video.

Kepler now supports hardware acceleration for videoconferencing and wireless display, while transcoding and video editing should be brought to another level. Given that Adobe is coming out with Creative Suite 6 just in time for NAB 2012, the annual broadcaster conference which starts on April 14 in Las Vegas, NV.

The Kepler Lineup: High-End GK104, Low-End GK107 Split into 11 parts

As you already know, Kepler GPUs belong to GeForce 600 Series. However, not all 600 Series GPUs will be Kepler parts – some of them are still based on the Fermi architecture and will probably end up being replaced by the yet unannounced Kepler ASICs.

On the desktop side, NVIDIA is launching GeForce GTX 680 today, and that is the only desktop part being launched today. After NVIDIA builds an allocation of GK107 parts, you can expect the arrival of GT 600 series for desktop as well.Thus, there isn’t much to write about the desktop part, since we have an in-depth review written by my colleague Anshel Sag. This in-depth page review reveals everything that desktop part has to offer.

In the world of mobile, there are plenty of activities; GeForce GT 640M was launched last week, after Acer started selling Aspire Timeline M3 ultrabooks earlier than agreed. There are multiple notebook design wins, but they’re closely tied to the introduction of Intel Ivy Bridge series of processors. Unfortunately, the situation is not as simple as it could be. Intel has changed their mind on more than one occasion when it comes to introduction of 3000 series of Intel Core i5 and i7 processors. Intel’s VP Sean Maloney said that Intel will introduce Ivy Bridge in two series; April and June. The high-end processors should be launched in April, while the lower-end dual-core parts should arrive in time for Computex Taipei. Until then, both AMD and NVIDIA will be limited to saying they have design wins, without a way to show them.

The mobile lineup is consisted out of no less than 10 models, based on three ASICs:

  • 40nm Fermi: 610M, GT 635M, GTX 670M, GTX 675M
  • 28nm Fermi: GT 620M, GT 630M
  • 28nm Kepler: GT 640M LE, GT 640M, GT 650M, GTX 660M

Yes, you’ve read it correctly. NVIDIA taped out 28nm Fermi GPU silicon consisting out of two SM units: 96 CUDA cores are paired with a 128-bit memory controller, connecting to GDDR3 or GDDR5 memory. NVIDIA claims that both 40nm and 28nm Fermi silicon underwent extensive rearrangement of the new Performance/Watt mantra and that the new parts are significantly more efficient than the original Fermi parts.

The GK107 will serve as all of four notebook parts, but with different amount of enabled units. NVIDIA cites “up to 384″for all parts, while the memory controller is 128-bit and supports GDDR3 or GDDR5 memory. The GeForce GT 640M LE features up to 2GB GDDR3 memory, GT 640M, 650M and GTX 660M support up to 2GB of GDDR3 or GDDR5 memory.

We have a notebook based on GT 640M part and will be publishing the review shortly.

Conclusion

Without any doubt, Kepler represents the biggest shift in the way of thinking in the past six years. This new architecture focuses on efficiency like no other architecture from NVIDIA. In a way, Kepler represents NVIDA’s “Core architecture” which shifted Intel towards more power efficient and higher performing designs.

This is the first time that a high-end part consumes significantly less power than its predecessor and actually removes the obligatory 8+6-pin power connector for a more modest 6+6 one. The “Performance per Watt” and “Instructions Per Clock” mantras are visible in every part of the design and where GTX 680 stopped, GT 640M continued. Seeing a discrete GPU inside an Ultrabook is rare enough, but seeing that GPU running Battlefield 3 in native resolution with all the details turned on is something we haven’t witnessed on high-end notebooks, yet alone on a product which weighs as much as two past-gen GTX 580 cards.

According to Pat Moorhead, former Corporate VP and Corporate Fellow at AMD and a President and Principal Analyst at Moor Insights & Strategy, NVIDIA has a true winner on their hands:

“Having NVIDIA’s new high performance graphics inside Ultrabooks is good for the entire ecosystem of consumers, channel partners, OEMs, ODMs, game ISVs and of course, NVIDIA:

  • Consumers get between 2-10X the gaming performance plus all the other Ultrabook attributes.
  • Channel partners, OEMs, and ODMs can now offer a much more differentiated and profitable line of Ultrabooks.
  • Game ISVs and their distribution partners can now participate more fully in the Ultrabook ecosystem.”

Even though the products punch above expectations, we’re still worried what happened to GPGPU computing and how come NVIDIA is seriously lagging behind AMD now. According to company representatives, GPGPU functionality will be only known in mid-May, in time for the GPU Technology Conference which takes place in San Jose, CA.

Overall, NVIDIA executed probably the best architecture launches to date – this is the first time the company launched not one, but two GPUs at the same time and have hardware availability from day one (ok, Acer had availability in stores eight days ahead). We look forward to see how partners will take the two launched and two upcoming GPUs and compete in the market against Intel’s Ivy Bridge and AMD’s Fusion on the low-end, and AMD’s Southern Islands-based GPUs on mainstream, performance and high-end levels.

Original Author: Theo Valich


This news article is part of our extensive Archive. As a tech news site we love to cover the latest gadgets out there, so be sure to visit our homepage for fresh articles. Additionally, we’re also covering user manuals for all the most popular devices, in case you would need some help with the troubleshooting, as well as iGaming news.