Just as CES 2010 was winding down, nVidia gathered technical journalists to reveal its desktop take on their Fermi architecture and the subsequent NV100 silicon. In the case of the desktop boards, this part will be known as the GF100.

During a whole day of technical briefings, we were briefed on all the intricacies of such a complex part of silicon, without any doubt – the most complex piece of silicon ever manufactured on planet Earth. As a GPU enthusiast, it is a pleasure seeing AMD and nVidia manufacturing world’s most complex chips and having them compete for market dominance in a way that CPU industry never experienced. After dozens of hours of briefings on AMD’s Evergreen and nVidia’s Fermi architecture, I personally believe the picture has been painted where the computing will go into the future. For that, two important blocks had to come out of nVidia. One is Tegra 2 and one is the topic of today’s analysis – GF100.

Meet the chip: NV100 i.e. Fermi i.e. GF100
During the Deep Dive sessions, nVidia declined to talk about the chip specifics. However, we do feel that it would be unfair to write about the architecture without talking about the actual silicon [and the current state of affairs]. Without further ado, this is the image that you probably saw earlier, but with somewhat more complete information.

nVidia GF100 Die - 570mm2 die manufactured in 40nm at TSMC
nVidia GF100 Die – massive 2.4×2.4 centimeter die manufactured in 40nm at TSMC

From the information we gathered from industry sources, the chip is around 24 x 23 millimeters in size i.e. expect a die size of around 570 mm2 – a little bit smaller than the 65nm GT200, i.e. GeForce GTX 280. With GF100, nVidia pretty much intends to do the same thing it did with G80 – introduce a high-end desktop part and then drive it to GeForce, Quadro, Tesla and Tegra businesses. According to our sources, we won’t have to wait for too long until nVidia reveals a Tegra part heavily influenced by the GF100 architecture.

The chip itself is manufactured using the 40nm process over at TSMC, with current yields hovering around lowly 25% figure. With A3 revision of silicon, sources at the company told us they expect to see yield in 40% range, i.e. comparable with majority of AMD’s 40nm silicon. That is at least what nVidia hopes to achieve. We did a more detailed article here. Regardless of how you put it, it is obvious that TSMC badly screwed the pooch, as both of their largest customers aren’t exactly feeling "peachy". Then again, according to our sources, nVidia also has the highest yielding 40nm part, the infamous Tegra 2. 

Getting back to Fermi silicon, nVidia was able to fit 94 chips on a single 300mm2 wafer, meaning we’re talking about a bulk price of $53 per single GF100 silicon if the chip yields at 100%. Given the current yields result in 23-25 workable chips per single wafer, price of a single piece of silicon is astoundingly high – around $210 per chip. This indeed means that the high-end part will carry a high price tag but then again, it’s not that we don’t expect to see a high-end GF100 parts to debut in $499 or $549 range. After you read this article, we would like to hear your thoughts on where nVidia should pitch their parts.

As far as desktop version goes, we don’t expect to see the change in price from the current GT200 generation at introduction – since nVidia will offset the revenue loss on the desktop side with Tesla and Quadro cards – just like AMD and Intel compensate the lower price of desktop CPUs with commercial-grade models which are essentially the same silicon [Athlon/Phenom & Opteron, Core and Xeon]. We do feel nVidia could have been more open about the situation given that the company is now targeting markets where roadmaps are known years in advance.
From the looks of it, we would not be surprised if nVidia’s Fermi strategy isn’t a repeat of GT200 – launch at 40nm and then move either to 32nm or 28nm as soon as possible, bringing the cost down. We do expect to see the move down to 32 or 28nm during 2010, but as it looks right now – GeForce "360" and "380" will have one hellishly hot introduction into the marketplace.

GF100 Architecture overview
Now that you know the silicon and the associated cost, it’s time to dig into the architecture itself. If you ask for my personal opinion, nVidia did what Intel’s engineers did when they took Core 2 architecture and created Nehalem: analyzed the GPU architecture from all sides and increased the efficiency across the board. This was mostly done by performing as much as on-chip operations as possible which ultimately led to dropping the 512-bit memory controller in favor of a simpler 384-bit one. We did ask nVidia about future iterations of GF100 and will the company go for Differential GDDR5 memory, given the fact that nVidia created its own ECC version of GDDR5. Unfortunately, Jonah M. Alben didn’t want to go into the product strategy for this part.

nVidia GF100 - You've seen it as Fermi, this is when those units go graphical
nVidia GF100 – You’ve seen it as Fermi, this is when those units go graphical – No ECC, just throughput

 GF100 has many key parts of the architecture but as we all know, the main figure is the GigaThread engine. As Intel experienced with Larrabee, without an extremely efficient dispatcher of instructions, the chip will starve for data and efficiency levels will drop off a cliff. In the case of GF100, nVidia took a look at the architecture as a whole and created "Instruction & Data" feeder that should very efficiently push the tasked processes through cores. In order to do that, several new layers of control were established.

 If you followed our coverage of "GT300" i.e. NV100 i.e. GF100 through last couple of months, then you know the chip features 512 cores, 64 Texture Memory Units and 48 ROP units connected to 384-bit interface. The afore mentioned GigaThread Engine feeds several architectural layers, starting with four Graphics Processing Clusters [GPC]. Each and every GPC consists out of four SM clusters which consist out of 32 cores each. Even though nVidia calls them CUDA cores, you can forget about us calling those units with marketing names. Unlike Register Combiners, Shader Units, and Shader Cores from the past – these cores are "the real deal" – standalone units able to process multiple instructions and multiple data [MIMD]. Each of 512 cores consists out of Dispatch Port, Operand Collector, two processing units [INT and FP] and Result Queue registers.

The baseline unit for future GPUs - 32 Core cluster, L1 cache, four TMUs and Polymorph Engine
The baseline unit for future GPUs – 32 Core cluster, L1 cache, four TMUs and Polymorph Engine

SM cluster is actually the reason why GF100 architecture should be the most efficient GPU architecture. By looking into it, we see that 32 cores are bundled together with 64KB of dedicated cache which switches between modes – either as 48/16 or 16/48 for the role of L1 and Shared Memory. This dynamic switching should help game developers to optimize their game performance as the architecture is very flexible. Unfortunately, we don’t know how many cycles are needed for the memory to switch between the states. Furthermore, we have Warp Scheduler and Master Dispatch Unit which both feed into very large Register File [32,768 32-bit entries - the Register File can accept different formats i.e. 8-bit, 16-bit etc.]. Rounding up the SM Cluster, we have four Texture Memory Units, Texture Cache and probably the most important bit of them all – Polymorph Engine.

To nVidia, the Polymorph Engine is "the end all means", a single unit consisting out of Attribute Setup, Tessellator, Viewport Transform, Vertex Fetch and Stream Output. When nVidia analyzed how to address DirectX 11 Tessellation, putting the Tessellator unit on top of GT200’s architecture would create an architectural bottleneck. As nVidia didn’t have experience with Tessellation in the way that AMD did, the company chose the "clean sheet of paper" approach and came up with what they hoped to be the most effective way to do Tessellation on a GPU. 
The chip features a massive 1MB of L1 cache and 768KB of L2 cache, which reminded us of the old Duron processors which had massive L1 and puny L2 cache. Given the slaughter that small Duron performed on the competing products from Intel and VIA, "where there is smoke, there is fire". In comparison, AMD’s Evergreen architecture went the opposite route, putting small L1 cache and increasing the size of L2 cache.

 We asked Henry Moreton, Distinguished Engineer at nVidia what the overall bandwidth of the caches involved was and we learned that nVidia’s GF100 packs more than 1.5 TB/s of bandwidth for L1 and a very similar speed for the L2 cache. The bandwidth naturally depends on clocks of the GPU itself and until we learn the final clocks, we won’t be able to tell you the exact cache bandwidth. Same thing applies to video memory bandwidth, as nVidia did not disclose a single clock, not even for the prototype boards.

Tessellation: Efficiency is king, trouble for 5870 and 5970?
Demonstration what can be created with a simple LOD scaling
Demonstration what can be created with a simple LOD scaling

As we explained in the Architectural Overview, in order to implement Tessellation, nVidia went for the "clean sheet" approach. Henry claims that the Polymorph Engine is the right way to address the Tessellation, instead of having a centralized approach hitting multiple Cores.

Unigine Tessellation sequence - Dragon and Cobblestone Road Sequence
Unigine Tessellation sequence – Dragon and Cobblestone Road Sequence

In order to show GF100 efficiency, nVidia used Unigine’s DirectX 11 benchmark. In a 60-second snapshot showing Dragon and Cobblestone road sequence, the upcoming GF100 was 50-80% faster than the Radeon HD 5870. Given the results of dual-GPU HD5970 in the same test, one might argue that in a very limited sequence, GF100 is faster than a dual-GPU AMD card. However, when we take the whole test into equation the situation should naturally change.

Tessellation performance in selected tests - AMD HD5870 vs. nVidia GF100
Tessellation performance in selected tests – AMD HD5870 vs. nVidia GF100

All of the nVidia speakers claimed they have much higher efficiency in terms of Tessellation vs. non-Tessellation but we were unable to confirm that. We also asked nVidia that given the slides compared AMD vs. nVidia hardware using nVidia’s own demos, does this mean nVidia will not deploy a GPU VendorID lock? The answer was surprisingly positive and we got confirmations from several nVidia staffers that you should have no issues running nVidia’s DirectX 11 demos on AMD hardware. Performance is whole another issue, though.

In terms of the so-called Geometric Realism, one of things that nVidia pulled out of a magician’s hat was a comparison in Shader and Geometry performance. GeForce GTX 280 had 150 times more shading performance than FX5800. However, it also had only three times more geometry performance. With GF100, nVidia increased geometry performance by a factor of eight times, meaning now GF100 will have 15 times more geometry performance than NV30, i.e. FX5800. We might call this a good start. Compared to the Radeon HD 5870, nVidia claims that GF100 is 4-6x faster when using Microsoft DirectX SDK Geometry Shader example [the infamous car demo].

Image Quality – Anti-Aliasing: Meet the "miraculous" 32x CSAA
nVidia's Distinguished Engineer and one of key Architects on GF100 explains the way how 32xCSAA works
One of key Architects on GF100 explains the way how 32xCSAA works

 32x CSAA is consisted out of eight color and 24 coverage samples and all samples are using Alpha to Coverage. 32xCSAA is able to detect 33 levels of transparency, a 33x improvement over previous generation. As you might have guessed, the GT200 series did not support coverage samples while AMD’s Radeon HD 4000 and 5000 series did.
We will put this mode under a detailed test but for now, it looks promising. More importantly for the AA setting in general, GF100 comes with vastly improved Transparency Multi-Sampling AA. We took a look at Left 4 Dead and saw a great improvement – lack of any artifacts. Then again, using Age of Conan for demonstrating 32xCSAA isn’t something we would do. In any case, if new Anti-Aliasing modes really work, that will mean nVidia caught up with AMD in this key aspect of image quality.

8xAA vs. 32xCSAA mode comparison in Age of Conan
8xAA vs. 32xCSAA mode comparison in Age of Conan

AA Performance Penalty comparison between 8x AA and 32x CSAA

nVidia was also criticized for its Anti-Aliasing implementation in the past, with GF100 hopefully being a chip that will change all of that.

This 8+24x mode allegedly comes at only a 8-15% penalty when compared to regular 8x AA mode while offering much better quality.

Naturally, we won’t be able to check this until we receive the boards themselves, but the picture on the right shows performance penalty while using a single GPU. If you decide to venture into the world of multi-GPUs, the performance penalty should virtually disappear. Truth to be told, we saw many games where going from 4x to 8x cut down performance by 30% and more while the companies claimed 5-15%. Unlike the past, we now believe that with Radeon HD 5870 and GF100 will finally put those unfortunate titles to rest. Today’s GPUs should have more than enough computational power to perform this level of comparison. Do bear in mind that the difference between 8x and this new 32x CSAA mode is nothing else but the computational power of the GPU at hand.

3D Vision Surround: Excellent experience but?
Tom Petersen, Director of Technical Marketing at nVidia shows Bezel Correction on three Acer 24" 120Hz displays. The effect is definitely memorable
Tom Petersen, Director of Technical Marketing at nVidia shows Bezel Correction on three Acer 24" 120Hz displays.
The effect is definitely memorable

During CES 2010, nVidia launched 3D Vision Surround Gaming and nVidia Surround, a dual-GPU answer to AMD’s Eyefinity technology. nVidia’s setup works with GT200-class hardware such as the GeForce GTX 285, 275, 260 etc. The only real requirement is that you have three display outputs and are able to drive the resolution at hand.
When nVidia’s execs started to discuss 3D Vision Surround, it was obvious to us that this is a whole another ballgame in terms of rendering. AMD’s upcoming six-display board [codename Trillian, but also known as Radeon HD 5870 Eyefinity6 and/or HD 5890] uses a single GPU and 2GB of GDDR5 memory to drive 7680×3200 i.e. majestic 24.5 million pixels. In order to keep the target frame rate in games [AMD set the bar to 40 fps for Eyefinity], Cypress GPU has to process 980 million pixels every second.

In the case of 3-way 3D Vision Surround, three 1920×1080 displays result in 5760×1080 resolution rendered at 120 times each second i.e. "only" 6.22 million pixels. By using a simple calculation, GeForce needs to put 746.49 million pixels every second and they need to be in perfect sync in order to achieve the 3D effect. Also, nVidia activated Bezel Correction which on ATI cards, works only in Linux operating system [at the moment]. In order to calculate 3D and Bezel Correction, nVidia requires that you use two GT200 or GF100-class GPUs and three DVI connectors.

 We played Need For Speed: SHIFT on a dual-GF100 based system and honestly, the experience was better than on an equal AMD Eyefinity setup. The problem I experienced on Eyefinity was that left and right screen were a little bit blurred in the speed, and the cockpit was positioned differently when compared to 3D Vision Surround. In the case of 3DVS, every display looked great and after a lap, you weren’t playing the game, you were in the game. In any case, it will be interesting to see how these two competing technologies will pan out. We hope that there will be not a single mention of "proprietary", as this is the last thing the gaming industry needs.

There is also another concern I wish to address in this article. When it comes to financial aspect of this gaming experience, we got into a quarrel with Drew Henry about the price. According to information at hand, many gamers are now considering or switching to Full HD-capable LCD TV’s for their gaming displays even if they stick to PC platform. The inconvenient truth for PC gaming of today is that if size is something you want, you will simply go for 37", 42", and 46" of similar LCD TV or a Plasma TV. You can get brilliant 46-inch FullHD "3D Ready" Panasonic Viera [ex-Pioneer panel] plasma TV for $850, or just about the price of two 120Hz Acer panels. To add insult to injury, the experience of playing a game on such a screen is even better than 3-display Eyefinity [even with 30" displays] or 3D Vision Surround, as it achieves a similar effect to IMAX movies.

 Back in 2005, nVidia and AMD seriously missed the boat with World of Warcraft and in my opinion; they’re doing the same with tens of millions of potential buyers. By talking about new investments and large investments in PC hardware rather that offering and addressing reasonable combinations is what limits the companies such as AMD, Intel and nVidia. The "gamer" that these companies have in mind [pays through the nose for all the "latest and greatest"] obviously existed in minds of people who run the show exists in volume equal to the owners of all supercars combined. If nVidia or anybody else wishes to increase the amount of high-end hardware they sell, it has to be bundled with high-end hardware that doesn’t necessarily go with the standard perception. It would be excellent for us to write about an ideal $2000 setup with a single Fermi or Cypress board and 3D Ready TV but until the support for 3DTV stops existing in dull pre-CES press releases and actually dwindles down to being a part of a showcase, we don’t see this moving forward.

The high-end hardware users that usually approach us purchase a $500 graphics card every 2-3 years, when their investment pays off. Just as those users were first to adopt 24" Dell displays [if you're Dell sales rep, remember what was the percentage of 2405WFP attach to Dell system vs. discrete display buy, i.e. monitor only].
If you ask us what about 3D Vision Surround, the answer would be quite simple. It’s a fantastic experience. Once that we calculate setup costs you’re looking at $1500 price tag for displays and an additional $1000-1200 for two GF100 boards. Not exactly "fantastic."

Real-world Ray Tracing App
Jen-Hsun's next Ferrari, the beautifully looking 458 Italia
According to Jen-Hsun’s – his next Ferrari, the beautifully looking 458 Italia rendered using multiple rays.

During the last two years, we heard a lot about Ray tracing and games. As it happened, Ray tracing didn’t exactly gain a lot of traction in game development – we do expect that to change in 2011 and 2012, with the arrival of several titles that will put you in control of a movie. As we all know, nVidia owns Mental Images and their Ray Tracing software is second to none in the industry.

For some reason, there is no Ray Tracing presentation from nVidia without a Bugatti Veyron
For some reason, there is no Ray Tracing presentation from nVidia without a Bugatti Veyron.

 For GF100, nVidia will release a free Ray tracing demo application, featuring 12 cars and six different scenes. The cars will be rendered to "look like real" in perhaps a few frames a second. According to the demo we saw, GF100 was several times faster in rendering the same object. Seeing a scene rendered at 1-4 frames per second wasn’t exactly impressive on a dual-GF100 GPU setup. For comparison, the GTX285 ran the scene at 0.33 frames a second. Even though this was much slower than the Ray Trace renderers in the past, this was a demonstration of cinema-grade quality. We still feel that true reality is a few years off, though.

When it comes to Ray tracing, the interesting bit of information was that nVidia does not use SLI to render the scene but rather use computational power of both GPUs and then just outputting the processed image using computational power. This approach is very similar to the principle used by LucidLogix. During our talks with nVidia, we learned that this non-SLI multi-GPU approach won’t be used for Ray tracing alone, rather when it makes sense as it adds PCIe latency to the mix.

NEXUS: Building an Ecosystem for Game Developers
Just like Star Wars, Terminator 2, Titanic or Avatar, the impact of the GF100 on content production will not be as valued in benchmarks compared to what it brings to users. Those movies may have not found their way into the hearts of moviegoers, but they changed the way movie production works for good.

Traditional Game Development...
Traditional Game Development according to nVidia

 While we do not know what kind of a market impact Fermi architecture will have as such, we know that gaming development will never be the same. We specifically asked is nVidia GF100 fully addressable as a C++ co-processor. In theory, it sounds nice but the reality is much better than that. We saw NEXUS for the first time at the nVidia GPU Technology Conference while it ran on Tesla and Quadro boards. With GF100 being ready to come out the door, the situation changed for the better. What I saw was on-the-fly debugging of game code running on a virtual machine. Sebastien Domine, Sr. Director of Developer Tools at nVidia demonstrated NEXUS on a system equipped with two graphics cards, running Windows 7 with Visual Studio 2008 and a virtual machine with again Windows 7 with the game code. 

NEXUS - a wholy integrated development system for developers
NEXUS – a wholy integrated development system for developers

A bug was discovered on the virtual machine and NEXUS was able to completely freeze the rendered frame, followed with a step-by-step analysis until Sebastien found the rendering bug that appeared in the QA process. 
This is for the first time that we saw a GPU acting like a CPU, and this should make future debugging and bug tracing a breeze for the deadline-burdened developers. NEXUS works on older products as well but naturally, some features weren’t supported in hardware until the DirectX 11/GF100-class hardware.

NEXUS will be available in two variants: a free version and a professional version for $349. Given the price of development tools, we don’t view the professional version as necessary. The only major difference is tech support, since buying a license will grant you access to a 24-hour reply service.

GeForce GF100 board pictures – already water-cooled
nVidia showcased water-cooled 3Way SLI - Interesting bit: everything was cooled with a dual 120mm radiator
nVidia showcased water-cooled 3Way SLI. The impressive part was that the whole setup was cooled with a dual 120mm radiator

nVidia showed several systems equipped with GF100 prototype boards. To us, the most interesting thing about GF100 is the fact that nVidia opened up towards cooling manufacturers, enabling them to be ready for GF100 when it finally rolls out. We saw several water blocks from the usual and not-so-usual suspects. During the GF100 Deep Dive Day, GF100 3Way-SLI system came with water-blocks for an all-water cooled gaming experience. The system was as quiet as a mouse but again, we would like to see those prototype blocks on the cooler.

The system from up and close reveals naked fans and a heatsink that spans across the PCB.
The system from up and close reveals naked fans and a heatsink that spans across the PCB.

The good news is that nVidia finally decided to follow AMD’s lead and use GT200-like mounting zone for the coolers. This means if you have an after-market GT200 cooler, it might work with GF100 boards, just as the GT200-based GPU blocks worked on the prototype system.

Triple GF100 SLI setup reveals six DVI connectors.
Triple GF100 SLI setup reveals six DVI connectors. nVidia decided to stay conservative and not push the new standards.

Looking at the backside of the boards, we can see that nVidia is definitely sticking with a conventional dual-DVI setup for now.


Creating a smaller GF100
 Even though the focus of this article is GF100 in its monolithic, three billion transistors heavy die; it is quite easy to read in how nVidia will create smaller dies. We would put in an estimate for a mainstream part to consist out of two Geometry engines, i.e. 256 cores. Next up would be low-end silicon with a single GPC engine and 128 cores. For parts such as Tegra and its netbook/notebook/desktop chipset variants, nVidia can create a reduced-size GPC with as low as 32 cores. Unlike GT200 derivations, which were put on the backburner due to resource drain caused by the Fermi architecture – GF100 should arrive in multiple ASICs as the yields on the high-end part stabilize. Then again, competition from AMD is more than strong here and we do expect that by the time nVidia ships its first $100 Fermi-based GPU, AMD will probably have a more than 15-20 million installed user base in the segment.

Availability? What availability?
During CES, we learned the manufacturing schedules from partners and it looks like February is the date for mass production of the boards. nVidia partners gave us their schedules and this will be an interesting launch, to say the least. Naturally, all the scheduling depends on how many workable chips will come off each and every wafer, as TSMC not just has yield issue, but also suffers from a capacity issue. The Foundry is starting to feel the heat from GlobalFoundries and yet isn’t able to "fire on all cylinders" and push out as much 40nm chips as possible.

nVidia did not say anything about the final products, neither the clocks nor the number of SKUs [Stock Keeping Unit]. Based on information given to us, we expect to see two SKUs at launch and several OEM designs that will follow later. GeForce 360 and 380 sound like the right way to go, even though we heard that nothing is final. The only thing that is certain is that the boards will feature two DVI connectors and that nVidia won’t focus a lot with DisplayPort as market acceptance isn’t on expected levels. Instead, a combination between one mini-DisplayPort and two DVI is expected, with partners shipping one or two DVI-HDMI adapters.

Conclusion
From what we saw on this past Monday the 11th, it is obvious that nVidia is changing its course and not going against the gamers, as was feared by many. We consider those rumors to be nothing but FUD and treated them as such. We do expect to see GF100 GPUs proving as really efficient gaming beasts once that they finally come out. Key word: finally. The architectural approach taken by nVidia was extremely brave but currently, did not pay off in terms of units shipped. Just like AMD and its K8 architecture, nVidia is late to the market. To make the matters worse, it is coming to the party with a complex and hot chip, not even touching to discuss the lower-cost ASICs. Is that a good sign? We don’t think so.

Advanced architecture that works is going to capture hearts of a lot of nVidia fans out there, but the focus has to be on the bottom line. While 3D Vision Surround is nothing but a brilliant experience, the price of the setup at hand sounds like a refusal to admit the market realities. By keeping chips like Fermi in the high end spectrum, nVidia can potentially damage itself and push the conventional gamers towards consoles and alternative companies.
In a way, GF100 chip is the same like NV10, NV30 and NV50 [G80] were. New and unproven architecture, large die, finger-pointing at manufacturing process or the company at hand but also – yielding a very high return for the card owner. NV10 i.e. GeForce256 created a GPU as we know it, NV30 went on to evolve into NV4x architecture that shipped in tens of millions of units and NV50 still makes for a vast majority of nVidia’s current line-up [GeForce 8, 9, GTS 2x0, GTX260M/280M etc.].

We expect to see the launch of desktop parts on CeBIT 2010 in Hannover, Germany, followed by the launch of the Quadro line-up on NAB 2010 in Las Vegas or Siggraph 2010 later in the summer, depending on the time for ISV/OEM/ODM qualification.

 The real question is – can this chip make money? It costs a lot of money to manufacture a single GF100 chip [$151?] and if the A3 revision doesn’t bring the yields up, we estimate that in worst case scenario, larger allocation of chips would be to Quadro and Tesla line-ups, while desktop parts will have to wait until the die shrink.