GF100 Architecture overview Now that you know the silicon and the associated cost, it's time to dig into the architecture itself. If you ask for my personal opinion, nVidia did what Intel's engineers did when they took Core 2 architecture and created Nehalem: analyzed the GPU architecture from all sides and increased the efficiency across the board. This was mostly done by performing as much as on-chip operations as possible which ultimately led to dropping the 512-bit memory controller in favor of a simpler 384-bit one. We did ask nVidia about future iterations of GF100 and will the company go for Differential GDDR5 memory, given the fact that nVidia created its own ECC version of GDDR5. Unfortunately, Jonah M. Alben didn't want to go into the product strategy for this part.
nVidia GF100 - You've seen it as Fermi, this is when those units go graphical - No ECC, just throughput GF100 has many key parts of the architecture but as we all know, the main figure is the GigaThread engine. As Intel experienced with Larrabee, without an extremely efficient dispatcher of instructions, the chip will starve for data and efficiency levels will drop off a cliff. In the case of GF100, nVidia took a look at the architecture as a whole and created "Instruction & Data" feeder that should very efficiently push the tasked processes through cores. In order to do that, several new layers of control were established.
If you followed our coverage of "GT300" i.e. NV100 i.e. GF100 through last couple of months, then you know the chip features 512 cores, 64 Texture Memory Units and 48 ROP units connected to 384-bit interface. The afore mentioned GigaThread Engine feeds several architectural layers, starting with four Graphics Processing Clusters [GPC]. Each and every GPC consists out of four SM clusters which consist out of 32 cores each. Even though nVidia calls them CUDA cores, you can forget about us calling those units with marketing names. Unlike Register Combiners, Shader Units, and Shader Cores from the past - these cores are "the real deal" - standalone units able to process multiple instructions and multiple data [MIMD]. Each of 512 cores consists out of Dispatch Port, Operand Collector, two processing units [INT and FP] and Result Queue registers.
The baseline unit for future GPUs - 32 Core cluster, L1 cache, four TMUs and Polymorph Engine SM cluster is actually the reason why GF100 architecture should be the most efficient GPU architecture. By looking into it, we see that 32 cores are bundled together with 64KB of dedicated cache which switches between modes - either as 48/16 or 16/48 for the role of L1 and Shared Memory. This dynamic switching should help game developers to optimize their game performance as the architecture is very flexible. Unfortunately, we don't know how many cycles are needed for the memory to switch between the states. Furthermore, we have Warp Scheduler and Master Dispatch Unit which both feed into very large Register File [32,768 32-bit entries - the Register File can accept different formats i.e. 8-bit, 16-bit etc.]. Rounding up the SM Cluster, we have four Texture Memory Units, Texture Cache and probably the most important bit of them all - Polymorph Engine.
To nVidia, the Polymorph Engine is "the end all means", a single unit consisting out of Attribute Setup, Tessellator, Viewport Transform, Vertex Fetch and Stream Output. When nVidia analyzed how to address DirectX 11 Tessellation, putting the Tessellator unit on top of GT200's architecture would create an architectural bottleneck. As nVidia didn't have experience with Tessellation in the way that AMD did, the company chose the "clean sheet of paper" approach and came up with what they hoped to be the most effective way to do Tessellation on a GPU.
The chip features a massive 1MB of L1 cache and 768KB of L2 cache, which reminded us of the old Duron processors which had massive L1 and puny L2 cache. Given the slaughter that small Duron performed on the competing products from Intel and VIA, "where there is smoke, there is fire". In comparison, AMD's Evergreen architecture went the opposite route, putting small L1 cache and increasing the size of L2 cache.
We asked Henry Moreton, Distinguished Engineer at nVidia what the overall bandwidth of the caches involved was and we learned that nVidia's GF100 packs more than 1.5 TB/s of bandwidth for L1 and a very similar speed for the L2 cache. The bandwidth naturally depends on clocks of the GPU itself and until we learn the final clocks, we won't be able to tell you the exact cache bandwidth. Same thing applies to video memory bandwidth, as nVidia did not disclose a single clock, not even for the prototype boards.
© 2009 - 2013 Bright Side Of News*, All rights reserved.