Nvidia has finally launched their first iteration of their new Maxwell architecture in the GTX 750 Ti. While this GPU is still a relatively low-end GPU for Nvidia and manufactured on the same 28nm process. The low-down on Maxwell is that it is designed to deliver improved performance per watt versus Kepler to a similar degree that Kepler did over Fermi. Since Fermi was known as one of the hottest GPUs the company has ever made, they have really put an incredibly obsessive focus on performance per watt. While this approach is great for a company struggling to improve performance while remaining within the same power envelope, it is even better when you look at the world of mobile where Nvidia really hopes to finally take hold. The Kepler architecture itself promises to deliver a whole host of new features to Nvidia’s Tegra mobile chips which should theoretically put them in a more advantageous place this year compared to last year.

Now, getting back to Maxwell, the GTX 750 Ti is Nvidia’s first GPU with the new Maxwell architecture based on the SMM compute clusters, an update to the SMX clusters in the Kepler architecture. The SMM compute clusters, according to Nvidia deliver an improvement per core of 135% while also delivering a 2x improvement in terms of performance per watt. Below is a block diagram of the new Maxwell SMM, you can compare it against the SMX of Kepler in our architectural analysis here.

As you can tell above, each SMM is made up of 128 shader cores, or as Nvidia likes to call them CUDA Cores. The previous generation Kepler architecture used 192 of these shader cores per cluster and was made up of one large unit of clusters rather than four sub-units of 32 clusters each. Because of this design, Nvidia claims that they are able to more efficiently allocate work to the smaller clusters, which gains them a performance improvement of 35% which they claim more than compensates for the fewer cores. Even though, if you do simple math and assumed that 128 Kepler cores magically became 35% faster, you would be looking at an equivalent of 172 Kepler cores in terms of performance, which still doesn’t match up to the 192 Kepler cores that you get with one SMX. Clearly, there are a lot more performance improvements that have been made outside of the core design themselves, some of which can be attributed to the amount of cache that this chip features, a massive 2MB of L2 cache.

In their white paper, they even state, "Overall, with this new design, each ?SM? is significantly smaller while delivering about 90% of the performance of a Kepler SM, and the smaller area enables us to implement many more SMs per GPU. Comparing GK107 versus GM107 total SM related metrics, GM107 has five versus two SMs, 25% more peak texture performance, 1.7 times more CUDA cores, and about 2.3 times more delivered shader
performance."

Here you can see a pretty good side by side comparison with Maxwell and Kepler side by side in each architecture’s lowest performance configuration for low power and mainstream performance. As you can tell, the amount of overall cores increases pretty precipitously, however, there is a bit of misleading information here. Because the GTX 650 Ti is actually a GK106 part while the GT 640 is actually a GK107 part. So, we are essentially comparing a GT 640 against a GTX 750 so they are clearly targeted at different performance levels and price segments. Even so, this table attempts to look at the two GPUs purely architecturally in order to show why GM107 is capable of delivering more than GK106-level performance.

If you look at GM107 against GK107, you can see that the base clock is actually lower, which enables a lower power envelope. The lower clock is possible due to the greater performance per watt in cores themselves which then ultimately allows for higher performance at lower clocks. As such, the GM107 delivers a peak theoretical performance of 1,305 GFLOPS versus the GK107′s 812 GFLOPs, an improvement of nearly 50%. Naturally, there are also more texture units in the GM107, with 40, versus Kepler’s 32 in the GK107. The overall texture fill-rate is also improved with the GM107 with a performance differential of 40.8 Gigatexels/s versus 33.9 Gigatexels/s in GK107.

The number of ROPs remains unchanged, while the size of the L2 cache has increased by 8X. The increase in L2 cache helps the card become less dependent on memory for bandwidth and makes memory less of an issue. Considering that these cards traditionally ship with fairly narrow memory buses and lower clocked memory, the overall memory bandwidth is reduced and cripples the card at higher resolutions. So, even though this is a 2GB card, it still has 128 bit memory bus which is to be expected from a mainstream graphics card.

Below, you have a full block diagram of the entire GTX 750 Ti based on the GM107. There you can see that it features 5 SMM units which results in the total shader count of 640 shader cores, or 640 ‘CUDA cores’.

Based on this, Nvidia claims that Maxwell and the GM107 are the "Most Efficient GPU Ever Built." They back these claims by showing how much performance has improved while actually having decreased power consumption over the previous generations.

They are even able to show more than double the performance of the previous generation in a lower power envelope. The GTX 750 Ti is so power stingy that it doesn’t actually need the 6-pin power connector that is built into the PCB. At 60w, the 750 Ti can operate purely off of the PCIe Gen 3.0′s 75w of power from the slot itself and can have an additional 35w drawn through the 6-pin power connector for overclocking purposes. The card that Nvidia sent us did not have a 6-pin power connector soldered onto the board, but many AIBs will likely offer this as an option to differentiate themselves and to offer their own overclocked offerings.

Beyond that, Nvidia also claims that the GPU will be capable of beating their competitor’s R7 260X GPU, which recently got a price cut in order to pre-empt the launch of the GTX 750 Ti. AMD actually released a new GPU, the R7 265X in order to compete with the GTX 750 Ti, however, I don’t think that it will be able to compete with the GTX 750 Ti in terms of performance per watt. It will, however, probably offer a decent level of performance per dollar if it wants to compete with the GTX 750 Ti at the $149 price point.

Additionally, the GTX 750 Ti will be shipping with a base clock of 1020 MHz with a boost of up to 1085 MHz. THe memory will be clocked at 5400 MHz (MT/s) and come in a quantity of 2 GB, although Nvidia says that 1 GB variants will also be available soon. The 2 GB of GDDR5 is powered by the 128-bit bus, which results in a memory bandwidth of 86.4 GB/s and has a texture fill-rate of 40.7 GigaTexels/s as we had stated before.

The chip itself has a transistor count of 1.87 billion transistors, which are manufactured on TSMC’s 28nm process. What’s curious about this fact is that Nvidia did not go 20nm as many people were hoping they would with this lower-end part. Without a die-shrink on Maxwell, I don’t really know how big we could expect a ‘full’ Maxwell design to look. Keep in mind that the GM107 is 21% bigger than the GK107 in terms of die space (148 mm sq vs 118 mm sq), which could make the full Maxwell chip over 600 mm sq since GK110 has a 551 mm sq. die. I don’t know if we can expect to see a full-size Maxwell chip until Nvidia is able to go 20nm, but if they don’t it could be problematic for yields. The fact that the GTX 750 Ti isn’t a 20nm test chip is a bit worrying for the future of Maxwell in the high-end, but right now their performance per watt levels look to be pretty promising even if the chip may end up being enormous.

Right now, we are testing our GTX 750 Ti and putting it through the paces and will have a full review for you in the coming days. At the $149 price point, it really looks like Nvidia is bringing competition right back to AMD after AMD had done the same to Nvidia with their Hawaii R9 290 and R9 290X GPUs at the high-end. If Nvidia is able to scale the Maxwell architecture up to a high-end GPU with a reasonable die size we could see Nvidia forcing AMD to be competitive with their GPUs once again and making them release a new high-end competitive GPU solution as well. 2014 looks to be a very interesting year for GPUs and Maxwell has started out of the gate early, we will just have to see if AMD’s answer will be a 20nm die shrink or if they will work the same magic as Nvidia did with the GTX 750 Ti.

Nvidia also announced the GTX Titan Black Edition today, which is the full-blown 2880 core version of the GTX Titan with all of the compute cores unlocked and at the full clock speed. This GPU is going to be available in limited quantities but is designed to replace the GTX Titan of last year as it also has 6GB of RAM and full dual precision performance. The primary difference is the fact that you gain one more SMX and better overall performance. However, it will also feature the same price as the GTX Titan at $999 per card. The one nice thing that Nvidia did add to the GPU over all of its predecessors was the fact that it can drive 4 different monitors from a single GPU, a first for Nvidia’s desktop gaming graphics cards. Maybe we’ll eventually see a GM180-based chip on 20nm that will have an even higher shader count than the current GTX Titan Black Edition next year-ish, but first I think we need to see a GM110 happen this year, be it 28nm or 20nm.