BRIGHT SIDE OF NEWS About | Advertise | Contact BSN USER Login
| Register
SUBSCRIBE Newsletter | RSS Feeds
Sunday, May 19, 2013
Email this to a friend.
Your friend's e-mail:
Your Name:
Your e-mail:
Message subject:

NVIDIA Kepler Analysis: Another Masterpiece from the Architects of G80 or...?




How Kepler Became a Performance Monster?

One of questions on a lot of minds between game developers, and certainly between application developers was - how could Kepler be more efficient? In fact, we had couple of in-depth conversations with AAA engine developers who were confused at our reports of Kepler having 1536 units.



First and foremost, we'll give you an efficiency example. Samaritan demo by Epic Games required three GTX 580 boards in 3-Way SLI to run at 30 frames a second in Full HD resolution. Even by going Quad SLI with two GTX 590 boards were not a guarantee of smooth framerates. In terms of finances, we're talking about a $1647 investment; $1107 after the last round of GTX 580 price cuts. A single GeForce GTX 680 manages to achieve the identical framerate and we're talking about a $499.99 investment (the cost of plain vanilla GTX 680).

This is the power of Kepler - run Samaritan demo on a single GPU

Given that the pixel load hasn't changed between three GTX 580 boards (3-Way GTX 580: 1536 cores, 1152-bit memory interface, 4.5GB GDDR5 memory), how did a single board with 1536 cores, 256-bit memory interface and 2GB GDDR5 memory managed to achieve just that?

As we wrote earlier, NVIDIA kept the Fermi cores but changed everything else. GPU architects explained that key battle was won with the complete change of the way how GPU works. First and foremost, NVIDIA removed a lot of functionality from hardware into software and good example is task scheduler. The number of Warp Schedulers remained the same (32 in both GF100/110 and GK104), but the way how instructions are executed was changed by removing multiple steps.

Scheduling example - Fermi versus Kepler

Fermi was heavily dependent on hardware checks, to make sure all instructions are properly executed. Given the amount of new instructions and functionality (native C++), it was clear why NVIDIA chose the safe route and with a lot of HW checks. Kepler removes the checks and introduces a software pre-decode, which executes in just five steps.

This enabled Kepler to get rid of a large number of transistors and change the principle from double pumped "hot clock" units (first introduced by Intel with their NetBurst architecture, followed with NVIDIA with the Tesla and Fermi architectures) to a normal pumped one. We'll address this in greater detail on the next page.

Bindless Textures is one of key features on Kepler, replacing a sea of GPUs needed for previous workloads
 
Furthermore, NVIDIA removed one limitation which was holding the architecture back, and this limitation is not something that will be mentioned by more than a sentence, or maybe a slide. Well, not the case here. The big issue was feeding the GPU with instructions and textures, which NVIDIA witnessed with AMD's mainstream parts beating the living daylights out of Fermi and Tesla parts. Kepler addresses this by increasing the number of Texture units to 128 (double the Fermi), the same amount as on AMD's Northern and Southern Islands. Meet Bindless Textures.

Now that the number of texture units doubled, NVIDIA also changed the way how Texture Units operate. In the pre-Kepler era, you could use up to 128 simultaneous textures. Bindless Textures are probably the key reason why Samaritan can run on a single card, and why the performance of GT 640M and GTX 680 is where it is. This feature increased the number of simultaneous textures from 128 to over a million. Yes, you've read this correctly. An engine developer can now run its shader code on all the textures he or she plans to use and run those textures as they come along. In ideal circumstances, Samaritan demo can address operations on 200-300…1000 textures and Kepler gives you that level.

For competitive comparison, AMD Southern Islands also has 128 Texture Units with the technology focus being on Partially Resident Textures, i.e. MegaTexturing technology pioneered by John Carmack and idSoftware's engine. AMD has a different approach to this problem, but in our talks with engine developers, it was Bindless Textures that got the nod over primarily OpenGL limited PRT technology. Those two are apples and pears, though (slicing a large texture vs. addressing thousands of textures at the same time).

© 2009 - 2013 Bright Side Of News*, All rights reserved.

<<< previous | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | next >>>
© 2009 - 2013 Bright Side Of News*, All rights reserved.