BRIGHT SIDE OF NEWS About | Advertise | Contact BSN USER Login
| Register
SUBSCRIBE Newsletter | RSS Feeds
Tuesday, March 16, 2010
Email this to a friend.
Your friend's e-mail:
Your Name:
Your e-mail:
Message subject:

Intel Larrabee finally hits 1TFLOPS - 2.7x faster than nVidia GT200!



During the recently held SC09 conference in Portland, Oregon - Intel finally managed to reach its original performance goal for Larrabee. Back in 2006, when we first got the first details about Larrabee, the performance goal was "1TFLOPS@ 16 cores, 2.0 GHz clock, 150W TDP". During Justin Rattner's keynote, Intel demonstrated the performance of LRB as it stands today.

At SGEMM Performance test [4K by 4K Matrix Multiply, QCD], Intel achieved 417 GFLOPS using half the cores on the prototype card, and reached 825 GFLOPS by enabling all the cores. While looking at the numbers alone, one might think that these scores are below the level of ATI Radeon 4850 and nVidia GeForce GTX 280/GTX 285. Of course, there is a "but" coming - unlike theoretical numbers that are usually disclosed by ATI and nVidia - this was an actual SGEMM benchmark calculation used in the HPC community.

Intel Larrabee reaches 1TFLOPS in SGEMM BLAS test, 4Kx4K matrix
Intel Larrabee reaches 1TFLOPS in SGEMM BLAS test, 4Kx4K matrix

The keynote continued while the engineers scrambled at the back to try to beat the 1TFLOPS barrier. A couple of minutes before the end of the keynote, Justin added the infamous "And one more thing…" Initial overclocked performance was 913 GFLOPS, moved slowly past 919 GLOPS, bounced up to 997 GFLOPS and ultimately passed the 1TFLOPS barrier with 1006 GFLOPS. Now, we can debate the numbers all we want, but the fact of the matter is that nVidia Tesla C1060 delivers only 370 GFLOPS in an identical SGEMM 4Kx4K calculation. Thus, Larrabee today comes at 2.7x math performance of GT200 chip.

In comparison, GT200-based Tesla card reaches 370 GFLOPS...
In comparison, GT200-based Tesla card reaches 370 GFLOPS...

One might mention AMD GPU line-up being more efficient than nVidia one, but unfortunately the situation is rather complex due to interesting state of AMD GPGPU developments. AMD's architecture is very strong in theoretical performance and in real-world gaming. When it comes to GPGPU world, AMD ditched everything else to focus on OpenCL development and the results will come in 2010. But those efforts cannot accommodate for architectural limitations. As we disclosed on numerous occasions, AMD introduced the 1Fat+4Thin concept with the ATI Radeon 2900XT, pulling in a Core cluster consists out of one unit for transcendental operations and four units for Multiply-Add/Add/Integer-Add/Dot operations. Thus, the Radeon 4800 family comes with 160 cores comparable to nVidia 30 clusters with 8 fully-featured cores i.e. ATI's 160 vs. nVidia 240 cores.

Long story short, the real-world SGEMM performance of AMD's FireStream 9270 board [Radeon 4870] is 300 GFLOPS, weaker than GT200. We don't have information about SGEMM performance of Evergreen GPUs [5700, 5800, 5900 series] but as soon as we learn the numbers - we'll let you know. The same thing goes for nVidia's long-delayed NV100-based family of products.

But as of SC09, the top five performing products for SGEMM 4K x 4K are as follows [do note that multi-GPU products are excluded as they don't run SGEMM]:
1.  Intel Larrabee [LRB, 45nm] - 1006 GFLOPS
2.  EVGA GeForce GTX 285 FTW - 425 GFLOPS
3.  nVidia Tesla C1060 [GT200, 65nm] - 370 GFLOPS
4.  AMD FireStream 9270 [RV770, 55nm] - 300 GFLOPS
5.  IBM PowerXCell 8i [Cell, 65nm] - 164 GFLOPS

If you're wondering where products such as Intel Harpertown-based Core 2 Quad or Nehalem-based Core i7 stand, the answer is quite simple - i7 XE 975 at 3.33 GHz will give you 101 GFLOPS, while Core 2 Extreme QX9770 at 3.2 GHz gives out 91 GFLOPS. Regardless of how hard we tired, we weren't able to find performance of AMD CPUs while using 4K by 4K matrix.

Larrabee board shown at SC09 differed from Larrabee board at IDF - note the magenta stripe on the heatsinkAs you can see for yourself, Larrabee is finally starting to produce some positive results. Even though the company had silicon for over a year and a half, the performance simply wasn't there and naturally, whenever a development hits a snag - you either give up or give it all you've got. After hearing that the "champions of Intel" moved from the CPU development into the Larrabee project, we can now say that Intel will deliver Larrabee at the price the company is ready to pay for. The fact that the design cost for Larrabee is probably as high as the combined R&D cost on GPU from nVidia and AMD combined in the past… 3 years, doesn't exactly play a role here. Intel has enough cash to deliver the part and not worry about TSMC's hiccup which only accelerated AMD's plans to move the GPU production away from TSMC [to GlobalFoundries] in 2011, leaving nVidia as the only major client.

There are several questions that are yet to be unveiled, such as efficiency of Tesla C2050/C2070 GPGPU cards. If nVidia raises the efficiency from current 40% to an expected 80-90%, Tesla chips should give out more than 1TFLOPS, but neither Larrabee nor NV100 are out the door yet.

Also, we wonder what the restructured memory infrastructure means for the GPGPU version of AMD Evergreen architecture. By a rough factor of 2x more compute power, Radeon 5870 / FireStream 9370 should give out 600 GFLOPS in SGEMM benchmark but we don't know if that number is correct.



© 2009 - 2010 Bright Side Of News*, All rights reserved.



Related articles:

Tags:

Share and enjoy :)

  • Digg
  • del.icio.us
  • StumbleUpon
  • TwitThis
  • Reddit
  • Furl
  • Google
  • Technorati
  • Sphinn
  • Mixx
  • Facebook
  • LinkedIn
  • Slashdot
  • Newsvine
  • Ma.gnolia
  • BlinkList
  • connotea
  • Fark
  • MisterWong
  • Netvouz
  • PlugIM
  • Propeller
  • Simpy
  • SphereIt
  • Spurl
  • ThisNext
  • YahooMyWeb
  • co.mments
  • Live
  • MySpace
  • Yahoo! Buzz

Would you like to purchase related items?

Radeon 4850
Sapphire Technology Sapphire Radeon HD 4850 Video Card - 512MB GDDR3 PCI Express 2.0 CrossFireX Ready Dual Link Dual DVI HDTV HDMI VGA 100245HDMI 100245HDMI
Sapphire Technology Sapphire Radeon HD 4850 Video Card - 512MB GDDR3 PCI Express 2.0 CrossFireX Ready Dual Link Dual DVI HDTV HDMI VGA 100245HDMI 100245HDMI
Sapphire Radeon HD 4850 Graphics Card 100245HDMI
Sapphire Sapphire Radeon HD 4850 Graphics Card - 100245HDMI Sapphire Radeon HD 4850 Graphics Card - ATi Radeon HD 4850 625MHz - 512MB SDRAM 256bit - PCI Express 2.0 x16 - Retail GDDR3 100245HDMI
EVGA
EVGA X58 SLI Motherboard - LGA 1366 Intel X58 SATA SLI Ready CrossFireX Ready Triple Channel DDR3 support RAID Hyperthreading support 132-BL-E758-A1
eVGA Motherboard X58 Core i7 ATX Max 12GB DDR3 2PCIEX16 PCIEX8 PCIEX 2PCI 2GBE Aud SATA RAID 132-BL-E758-A1
EVGA 132-BL-E758-A1 Intel X58 Core i7 Socket 1366 DDR3-1333 ATX Motherboard Retail PC3-10600 132-BL-E758-A1
Evga X58 SLI Desktop Board - 132-BL-E758-A1 The features of the includes support for Intel processors including Intel Core i7 processors EVGA VDroop control that stabilizes voltage and ensures the
nVidia
PNY Technologies PNY VCQFX1800-PCIE-PB Quadro FX 1800 192-bit PCI Express 2.0 x16 Workstation Video Card - Retail 768MB GDDR3 VCQFX1800-PCIE-PB
PNY Technologies nVIDIA Quadro FX 1800 PCI Express 2.0 x16 Professional Display Card VCQFX1800-PCIE-PB PNY Technologies nVIDIA Quadro FX 1800 PCI Express 2.0 x16 Professional Display Card VCQFX1800-PCIE-PB
PNY Electronics FX 1800 Graphics Card PCIEe 2 VCQFX1800-PCIE-PB
PNY Technologies PNY Quadro FX 1800 GDDR3 192-bit 2.0 x16 Professional Video Card VCQFX1800-PCIE-PB 768MB PCI-E VCQFX1800-PCIE-PB


Comments:

Riiight... by: Anonymous on 12/7/2009
"1 for Intel and a kick in the short and curlies to nVidia and AMD!

Microsoft should consider Larrabee for XBOX 720 since SONY has gone with Imagination Technologie Power VR Series 6 for PS4"

You really have no idea what you're on about...

ah well...


http://www.reuters.com/article/idUSTRE5B51QR20091206?type=technologyNews
by: rvalencia on 12/7/2009
GT5 Prologue (demo) = 1080p mode is 1280x1080 (2xAA) in-game while the garage/pit/showrooms are 1920x1080 with no AA. 720p mode is 1280x720 (4xAA)

Refer to http://forum.beyond3d.com/showthread.php?t=46241

GT5 Prologue's in-game 1080p is fake i.e. not real 1920x1080.
by: rvalencia on 12/7/2009
From Beyond3D's forum, a developer was able to achieve ~1 TFlops (1000 GFlops) using Radeon HD 4870 for the SGEMM benchmark.

http://forum.beyond3d.com/showthread.php?t=54842
dude by: Anonymous on 12/3/2009
sgemm has nothing to do with qcd dude - it was 'the other' number that was for qcd - the one that was, what.. 8 GFlops?
Anyways sgemm is full square matrix multiply - qcd, or Dirac solver, is all about sparse matrices and until lately has been variants of conjugate gradient algorithm (sparse matrix - full vector multiply), with an almost diagonal sparse matrix => general purpose multiply is either crazy.
Re. Theo by: Anonymous on 12/2/2009
Theo,

I'm a CUDA developer. If you've meet CUDA devs who think this business is about tricks and magic they just didn't know what they were talking about.

I would have to say that 95% of all the techniques can be found in the programming guide and best practices guide.

There are a few performance enhancers that can be used on both Nvidia and AMD/ATI cards that isn't really documented. According to my reading analogous optimizations can be found both on the Cell BE and on regular intel processors.

There is nothing in the closet about GPGPU programming these days. Everything is well documented with more and more development tools on the way because thats the way Nvidia and others want it.

There is nothing mystical about GPGPU programming other than that it's new and it's scarring the sh*t out of intel. Like someone mentioned, they are about 2 years behind.
Larrabee is a graphics card? by: Greg442 on 12/2/2009
Wait I'm confused, Larrabee is a graphics card? not a cpu with graphics integrated? Why in the hell would Intel think it was a good idea to compete with ATI/NVidia building graphic cards? Only a fool would buy a first generation, completely “experimental" graphics card from Intel. This is an "EPIC FAILURE" considering the fact the R&D on Larrabee is rumored to exceed the cost of both ATI 5000 Series and NVidia Fermi... ridiculous
half the story by: Anonymous on 12/2/2009
Ok, I understand that gpu computing isnt about video display, but about number crunching, but the rest of the actual video cards that perform this feature ALSO render video. CVome back Intel, when your card actually does both. Otherwisw why slap any name to it such as Larrabee....
Larrabee by: Anonymous on 12/2/2009
Is a Graphics Card - it is not a CPU/ GPU - it requires a CPU. The 1 TFlop is a huge milestone, now we need to see Crysis 2 running on with all the shaders, anti-aliasing, physics, etc

1 for Intel and a kick in the short and curlies to nVidia and AMD!

Microsoft should consider Larrabee for XBOX 720 since SONY has gone with Imagination Technologie Power VR Series 6 for PS4
Theo, GT5 @ 1080p *IS* a trick by: Anonymous on 12/2/2009
... and a dirty one at that. In reality the RSX renders it at 1280x1080, but anamorphically (rectangular pixels instead of square ones). The picture then gets stretched sideways by PS3s internal HW scaler to what would *appear* to be 1920x1080, while the rectangular pixels become square ones so the picture does not look funny.

Wipeout HD uses the same HW scaler trick to scale on-the-fly and make it look like 1920x1080 @ solid 60 fps, when most of the time it's somwhere between 1920/1280 x 1080. Apparently the dev got a lot of praise for this stunt. I have to admit, it is sorta impressive as a technical solution, but as a gamer I'd loathe to see this happen often. And they had the nerve to present it as fully-fledged Full HD experience.
RE: Standards... by: Theo Valich on 12/2/2009
HPC standards are the ones GPU needs adjusting to, not the other way around.

Bear in mind that everybody is using their own
4K by 4K matrix is an industry standard and if different CPU vendors go by it, there is no pardon given if a GPU wants to capture not a small slice, but a lionshare of the multi-billion dollar business.

Graphics was a lot of "tricks'n'hoes", as one developer colorfully put - whenever I talk with gamedevs about their title, the amount of tricks that has to be put inside "because the HW can't support it" or "because the routine would be too slow on slower HW" etc etc etc. Only clearing those tricks out can lead to a better performance.

When you see a PlayStation 3 game rendered in 1080p, especially Gran Turismo 5 - bear in mind that is cut-down GeForce 7900 GTX with 128-bit interface.

The development of PS3 games is a result when you have access to low-level hardware and there are no secrets go to around. The time has come for PC hardware to be open and clean, and let's see what kind of applications can we have without ghosts in closets.

Ed.
Larrabee scaling problem? by: Anonymous on 12/2/2009
The demo utilized only half of the available cores. Why? Is the Larrabee architecture incapable of scaling any further? Is it due to thermal/power limitation? Or both?
by: Anonymous on 12/2/2009
300Gflops for Rv770?? thats TOTOALLY UN-OPTMIZED!! some guy @ Beyond3D squeezed 2.7TFLOPS in FP32 Matrix Mul!!!
by: Anonymous on 12/2/2009
If nvidia is scalar and ati is vector based, then comparing them directly would be pointless. Can someone confirm this?
re: by: Anonymous on 12/2/2009
Yes, CUDA is easier. Nvidia is constantly offering more and more support to developers, they are really taking HPC seriously.

I believe this is also a huge benefit for gaming since it should also mean that game developers can squeeze more juice out of the hardware.

I saw people mentioning that some of the older amd/ati cards had serious trouble with coalescing memory ( using the on-chip to off-chip) bandwidth efficiently which really drives down efficiency.

CUDA is scalar, AMD is vector based...
by: Anonymous on 12/2/2009
So it means that gt200 is slower than rv770 theoretically, but easier to write code for. In other words rv770 is fast, but writing a real world application that would take 100% of its computing potential would be a very difficult task. Am I right? Is this why cuda is so popular?
easier by: Anonymous on 12/2/2009
x86 my a*s, with a 100 vector ops required
to reach even a small percent of theoretical peak. I bet writing fast code will be as complicated as writing good shaders right now, that is keeping mem/alu ratio, hot texture caches, slot occupancy etc
by: Anonymous on 12/2/2009
A more efficient implementation of SGEMM on ATI cards has achieved 92% theoretical peak.
http://cerberus.fileburst.net/showthread.php?s=fbfd66aadcfb503bc6e82afcf1f4fcc4&t=54842

In the right hands, the 5970 should be able to get 3+ TFLOPs. And you can buy the card NOW...sorta.
by: Anonymous on 12/2/2009
it basically tries to do all the hardware stuff modern gpu's do in software instead. Also it's based off the x86 instruction set, so they believe it'll be easier for programmers to work with.
what.. by: Anonymous on 12/2/2009
What is this Larabee again ??

Is it a GPU ?
Is it a CPU ?
Is it supposed be a CPU/GPU combo ?
1.8 > 0.9 by: Anonymous on 12/2/2009
Yes, there is an implementation running at 2/3 on an AMD card.

Thus it is doing around 1.8 TFLOPs. 1.8 TFLOPs is more than 0.9xx TFLOPs ( pun intended).
Leave a comment:

Author:

Title:

Comment:


Enter the code shown above:

(Note: If you cannot read the numbers in the above
image, reload the page to generate a new one.)




Highlight
  • ATI Radeon HD 5970 is the king of iPhone, Wi-Fi password cracking
  • Award your favorite video game with your vote
  • Kingston Announces 2400 MHz DDR3 Kit
  • GeForce GTX 480 has 480 cores, AIBs confirm
  • GeForce GTX 480 has 480 cores, AIBs confirm
March 20, 2010, 20:00 UTC

Dear Readers,

In order to enable new features for the site, we'll be temporarily offline on Saturday, March 20th 2010 at noon Pacific, 3PM Eastern or 8PM/20:00 GMT/UTC. We should be offline for 15-25min, after which you should be able to see new features.

Thank You for understanding,

The BSN* Team

© 2009 - 2010 Bright Side Of News*, All rights reserved.