During the recently held SC09 conference in Portland, Oregon – Intel finally managed to reach its original performance goal for Larrabee. Back in 2006, when we first got the first details about Larrabee, the performance goal was "1TFLOPS@ 16 cores, 2.0 GHz clock, 150W TDP". During Justin Rattner’s keynote, Intel demonstrated the performance of LRB as it stands today.

At SGEMM Performance test [4K by 4K Matrix Multiply, QCD], Intel achieved 417 GFLOPS using half the cores on the prototype card, and reached 825 GFLOPS by enabling all the cores. While looking at the numbers alone, one might think that these scores are below the level of ATI Radeon 4850 and nVidia GeForce GTX 280/GTX 285. Of course, there is a "but" coming – unlike theoretical numbers that are usually disclosed by ATI and nVidia – this was an actual SGEMM benchmark calculation used in the HPC community.

Intel Larrabee reaches 1TFLOPS in SGEMM BLAS test, 4Kx4K matrix
Intel Larrabee reaches 1TFLOPS in SGEMM BLAS test, 4Kx4K matrix

The keynote continued while the engineers scrambled at the back to try to beat the 1TFLOPS barrier. A couple of minutes before the end of the keynote, Justin added the infamous "And one more thing?" Initial overclocked performance was 913 GFLOPS, moved slowly past 919 GLOPS, bounced up to 997 GFLOPS and ultimately passed the 1TFLOPS barrier with 1006 GFLOPS. Now, we can debate the numbers all we want, but the fact of the matter is that nVidia Tesla C1060 delivers only 370 GFLOPS in an identical SGEMM 4Kx4K calculation. Thus, Larrabee today comes at 2.7x math performance of GT200 chip.

In comparison, GT200-based Tesla card reaches 370 GFLOPS...
In comparison, GT200-based Tesla card reaches 370 GFLOPS…

One might mention AMD GPU line-up being more efficient than nVidia one, but unfortunately the situation is rather complex due to interesting state of AMD GPGPU developments. AMD’s architecture is very strong in theoretical performance and in real-world gaming. When it comes to GPGPU world, AMD ditched everything else to focus on OpenCL development and the results will come in 2010. But those efforts cannot accommodate for architectural limitations. As we disclosed on numerous occasions, AMD introduced the 1Fat+4Thin concept with the ATI Radeon 2900XT, pulling in a Core cluster consists out of one unit for transcendental operations and four units for Multiply-Add/Add/Integer-Add/Dot operations. Thus, the Radeon 4800 family comes with 160 cores comparable to nVidia 30 clusters with 8 fully-featured cores i.e. ATI’s 160 vs. nVidia 240 cores.

Long story short, the real-world SGEMM performance of AMD’s FireStream 9270 board [Radeon 4870] is 300 GFLOPS, weaker than GT200. We don’t have information about SGEMM performance of Evergreen GPUs [5700, 5800, 5900 series] but as soon as we learn the numbers – we’ll let you know. The same thing goes for nVidia’s long-delayed NV100-based family of products.

But as of SC09, the top five performing products for SGEMM 4K x 4K are as follows [do note that multi-GPU products are excluded as they don’t run SGEMM]:
1.  Intel Larrabee [LRB, 45nm] – 1006 GFLOPS
2.  EVGA GeForce GTX 285 FTW – 425 GFLOPS
3.  nVidia Tesla C1060 [GT200, 65nm] – 370 GFLOPS
4.  AMD FireStream 9270 [RV770, 55nm] – 300 GFLOPS
5.  IBM PowerXCell 8i [Cell, 65nm] – 164 GFLOPS

If you’re wondering where products such as Intel Harpertown-based Core 2 Quad or Nehalem-based Core i7 stand, the answer is quite simple – i7 XE 975 at 3.33 GHz will give you 101 GFLOPS, while Core 2 Extreme QX9770 at 3.2 GHz gives out 91 GFLOPS. Regardless of how hard we tired, we weren’t able to find performance of AMD CPUs while using 4K by 4K matrix.

Larrabee board shown at SC09 differed from Larrabee board at IDF - note the magenta stripe on the heatsinkAs you can see for yourself, Larrabee is finally starting to produce some positive results. Even though the company had silicon for over a year and a half, the performance simply wasn’t there and naturally, whenever a development hits a snag – you either give up or give it all you’ve got. After hearing that the "champions of Intel" moved from the CPU development into the Larrabee project, we can now say that Intel will deliver Larrabee at the price the company is ready to pay for. The fact that the design cost for Larrabee is probably as high as the combined R&D cost on GPU from nVidia and AMD combined in the past? 3 years, doesn’t exactly play a role here. Intel has enough cash to deliver the part and not worry about TSMC’s hiccup which only accelerated AMD’s plans to move the GPU production away from TSMC [to GlobalFoundries] in 2011, leaving nVidia as the only major client.

There are several questions that are yet to be unveiled, such as efficiency of Tesla C2050/C2070 GPGPU cards. If nVidia raises the efficiency from current 40% to an expected 80-90%, Tesla chips should give out more than 1TFLOPS, but neither Larrabee nor NV100 are out the door yet.

Also, we wonder what the restructured memory infrastructure means for the GPGPU version of AMD Evergreen architecture. By a rough factor of 2x more compute power, Radeon 5870 / FireStream 9370 should give out 600 GFLOPS in SGEMM benchmark but we don’t know if that number is correct.