In the past few weeks, we’ve seen various fishy rumors on the product specifications of first discrete GPU using the upcoming 28nm Kepler architecture the GK104. While we have known parts of the specifications, such as no hot clocks, the doubling of Streaming Multiprocessor (SM) node from 48 to 96 CUDA cores (i.e. Stream Processors), 256-bit memory controller, the real specifications are (finally) here… even though, our information differes minimally from information originally posted on 3DCenter.org.

NVIDIA Kepler GK104 Architectural overview: at first look, very similar to GF110, but then you take a deeper look: 1536 Stream Processors instead of 512!
NVIDIA Kepler GK104 Architectural overview: at first look, very similar to GF110, but then you take a deeper look: 1536 Stream Processors instead of 512!

First and foremost, in NVIDIA’s internal nomenclature, this part should be named GeForce GTX 660. This is a 199-249 dollar part which in conventional way would replace the 300-dollar "GeForce GTX 560 Ti 2GB", but will offer higher performance than GTX 580. Significantly higher? and more importantly, not just beating the $449 Radeon HD 7950 3GB, but also endangering the $549 Radeon HD 7970. Yeah, it is that fast – in some tests will be on par with HD 7870 Pitcairn, but in some it will even beat HD 7970.

Why? Because we’re talking about 1536 CUDA cores divided in four Graphics Processing Clusters (GPC), all of which contain four Streaming Multiprocessors (SM). Given that there are 96 Stream Processors (or CUDA cores, NVIDIA seems they cannot make up their minds how to call them), we can see that for instance, the entry-level Kepler has a single SM unit with 96 CUDA cores/Stream Processors. Can you say? a mobile GPU part that allegedly taped out ages ago? and just by some accident, ended in a Samsung notebook? Only time will tell for those.

The base combinations for NVIDIA future GPUs now are 96 (1SM), 384 (1GPC), 768 (2GPC), 1536 (4GPC), 2304 CUDA cores/Stream Processors (6GPC). Given that we our sources are telling us the big monolithic die comes with 2304 SP, the question is what can be done with the memory controller. The logic dictates Kepler can come with the following memory controller configuration: 64-bit, 128-bit, 192-bit, 256-bit, 320-bit and 512-bit: to us, it is most logical that we see 64-bit low-end, 128-bit mainstream, 256-bit performance and 384-bit on the high-end side. Compute parts will offer unprecedented amount of working memory, as NVIDIA is working hard towards becoming the alternative to Xeon and Opteron processors. 

Continuing with the GK104 GPU, the chip has the same amount of fixed-function logic as competing Tahiti XT – 32 ROPs (Raster OPeration Units) and 128 TMUs (Texture Memory Units). As you can see in our architectural mockup, the decision to go with 256-bit memory controller results in 2GB GDDR5 and this is the only part where NVIDIA really loses to AMD: both 7950 and 7970 come with 3GB GDDR5 memory. True, the difference in planned price is estimated at $100 less for NVIDIA boards ($199-249 versus $199 7870, $449/7950 and $549/7970), which should mitigate the paper advantage of the HD 7900 Series.

How high can it go?
Just like GF110, the GK104 comes in two different versions: the GeForce board will run double-precision at one sixth rate – while Quadro and Tesla will run at typical half-rate. Just like AMD Southern Islands, we were told by one source that there is an architectural possibility of full rate DP (instruction, cache sizes) – but we do not believe in fairy tales. 

The GPU clock is estimated at 950MHz, but our sources are telling us that there are different clocks running in Lab: 772MHz for clock-per-clock versus GTX 580, 925MHz for clock-per-clock versus Tahiti XT, while the clock range for the shipping parts is between 950 and 1000MHz. We were told that NVIDIA did not laugh too much at Verdetrol performance enhancing pills and that the company is trying to tweak the BIOS (more importantly, thermal envelope) in order to get the parts running at 1GHz. If NVIDIA fails, the partners are certain to offer a 1GHz board (just like in case of Tahiti XT and 3rd party vendors).

The memory is set at 1.25 GHz in Quad-Data Rate (QDR, i.e. 5GHz "effective"). This 25% boost over GF100/GF110 is something that thrilled NVIDIA engineers, since this is the first time their memory controllers were able to reach  AMD with stable default clock frequency. Remember, unlike GDDR3 memory, GDDR5 is "activelly driven" and memory controller does much more than it used to. Given that AMD is actually the company that creates the memory standard, AMD’s GPU engineers actually have a good advantage in terms of just how high can they clock the GDDR5 memory.

This clock results in 160GB/s video memory bandwidth, a drop from GTX 580 (192.4GB/s), but a big boost over GTX 560 Ti and its 128.27GB/s (excluding the OEM versions), and just a bit higher from GTX 560 Ti OEM (GF110 die), GTX 560 Ti 448 Cores LE and GTX 570, all having the same GDDR5 memory clock and bandwidth of 152GB/s.

All of this results with 2.9 to 3.05 TFLOPS single-precision, i.e. 486-500 GFLOPS double-precision. Quadro and potential Tesla versions of this board will feature unlocked double-precision, meaning identically clocked board would have around the same amount of DP-GFLOPS as GTX 580 had single-precision? an impressive boost indeed. In any case higher than what Fermi-based Quadros and Teslas were able to achieve.

You won’t need to wait for too long, as NVIDIA is already starting pre-sale activities, and getting ready to counter AMD and their momentum with the Radeon 7700 (Cape Verde, February 15), 7800 (Pitcairn, March 6) and 7900 Series (released).