"CPUs are easier to make than GPUs" Now, first and foremost, we have to disclose that it is excruciatingly hard to create a graphics processor. Even though some skeptics will say that
"you just build one shader unit and then multiply it by an X factor", that is frankly, a load of bull. Today's graphics processors are massively parallel beasts that require two factors to work: drivers and massively parallel hardware. This was confirmed to us by engineers at ATI, nVidia and Intel - so forget about picking sides here.
The DirectX 11 generation graphics parts from ATI and nVidia are featuring wide extensions to the chip microarchitecture itself, and saying that it is easy to create such a chip is, again - a load of bull. GPUs and CPUs operate on completely different sides of the computing scale. The CPU is optimized for random operations simply because it cannot expect what it will calculate, while the driver does most of the work for the GPU and just queues hundreds of thousands of instructions waiting to be churned out. Thus, CPU needs a shed load of cache; GPUs do not [unless you want to use them for computational purposes].
Intel LRB was designed to go head to head against these two, but the part... especially the right one. As both things are moving in the same direction and becoming massively parallel beasts, the GPU is gaining cache and bandwidth speed. For example, ATI's Radeon 5870 GPU comes with 160KB of L1 cache, 160KB of Scratch cache and 512KB of L2 cache. The L1 cache features a bandwidth of around 1TB/s, while L1 to L2 cache bandwidth is 435GB/s. This flat out destroys any CPU cache bandwidth figures, and we're talking about a chip that works at
"only" 850 MHz. Recently, SmoothCreations launched a factory over clocked card at 950 MHz for the GPU, pushing the bandwidth figures to over 1.1TB/s for L1 and almost 500GB/s for L1 to L2 cache speed. Bear in mind that this is a 40nm part.
On the other side of the fence, nVidia recently announced its Fermi architecture, more known as architecture that will end in GT300/NV70/GF100 chips. The cGPU based on Fermi architecture features 1MB of L1 cache and 768KB L2 cache. One megabyte of L1 cache is more than any of the higher-volume CPUs that are currently being shipped, just look at the CPUs below:
- AMD Quad-Core Shanghai = 512KB L1 [64KB Instruction + 64KB Data per core]
- AMD Sexa-Core Shanghai = 640KB L1 [64KB Instruction + 64KB Data per core]
- Intel Quad-Core Nehalem = 256KB L1 [32KB Instruction + 32KB Data per core]
- Intel Sexa-Core Dunnington = 96KB L1 [16KB Instruction + 16KB Data per core]
- Intel Octal-Core Nehalem-EX = 512KB L1 [32KB Instruction + 32KB Data per core]
Furthermore, nVidia features a cluster of 32 Fused Multiply-Add capable cores capable of handling Integer or Floating-Point instruction. In comparison, Intel will support Fused Multiply-Add [FMA] with Larrabee as a cGPU and 2012 Haswell architecture as a CPU.
Now, does this sound easy to make? If it was easy, we would not go from almost 80 GPU companies at the beginning of 21st century to eight [ATI, ARM, Imagination Technologies, nVidia, S3, SiS, Matrox], with only two making discrete products in serious volumes on desktop and notebook segment, and two licensing very serious volumes in handheld business. Even though it owns 50% of the world-wide graphics market, again - we cannot consider Intel to be customer-oriented "GPU" vendor, given the performance of their parts. Just ask Microsoft how many waivers Intel hardware has in their certification process [hint: the number is higher than nVidia's and ATI's worst non-compliant hardware combined]. We know the number and sure thing is, it ain't pretty.
Thus, Intel knew what the company has to do - or risk becoming a dinosaur in increasingly visual world. Now, the company knew the road to Larrabee would be difficult. The only problem is that Intel's old-school thinking underestimated the size of the task at hand and time that it will take to complete such a project.
AMD's reaction: We'll merge with nVidia - a marriage that never happened AMD's reaction to Intel's split to CPU and cGPU and future fusion parts was quite simple: Hector J. Ruiz and his executive team began to discuss a merger with nVidia, which ultimately fell through in the second half of 2005. AMD knew nVidia's roadmaps just like nVidia knew AMD's, thanks to the now-defunct SNAP [Strategic nVidia AMD Partnership], formed in order to get the contract for the first Xbox. A few weeks after those negotiations went under [Jen-Hsun’s major and unbreakable requirement was a CEO position, which Hector refused], AMD started to talk about the acquisition of ATI which ultimately became a reality a few months down the line [July 2006].
If this merger went through, there is little doubt that today ION chipset would be MCM [Multi-Chip Module] and the world of netbooks would probably look a whole lot different [remember whose hardware was inside
the world's first netbook?]. But AMD went with a less aggressive company and the strategy is paying off now.
The real question of did AMD overpay for ATI Technologies Inc. can only be concluded once that the cost of Intel building Larrabee on its own becomes a matter of public knowledge. Over the past few years, we heard several different calculations with almost each and every one being well over a billion dollars. Worst case that we heard was
"we burned through three billion USD", but that belongs in the speculation category. Do bear in mind that the figures aren't coming from the bean counters and that the cost of slippage cannot be calculated yet.
Intel Larrabee specifications
One of early Intel Larrabee PCB board layouts - not much has changed on the current prototypesIf you are wondering what Larrabee's specifications are, the project's goal was manufacturing a chip with 16 cores and 2-4MB of L2 cache clocked at 2 GHz, all packed in to a 150W power envelope [for the chip alone]. The chip was supposed to be manufactured in the well proven and paid-out 45nm process technology. In fact, the only thing that keeps the bean counting dogs away is the fact that Larrabee will be manufactured in 45nm with all the factory investment paid out, as the company has more wafers than some wafer suppliers. Still, our sources pitched the cost of a single chip [with packaging] at around $80 per chip.
The memory controller resembled ATI's R600: "1024-bit" internal ring-bus [two way 512-bit] with over 1TB/s of bandwidth connecting to eight 64-bit memory controllers [512-bit] that control 1GB of GDDR5 memory clocked at 1.0 GHz [4.0 GT/s], for an external bandwidth of 256GB/s. As it turned out, the available bandwidth of video memory turned into a problem later in the development.
Looking at the execution core itself, it was in-order x86 [not going into tech overdrive, we can tell you that the principle is vaguely "similar" to one Intel uses in Atom CPU] capable of handling four threads on-the-fly [Nehalem architecture supports two threads - Hyper-Threading]. This was a bastard child of the improved Pentium MMX core [P55c], but laid in "in-order execution" fashion. 16-wide Vector SIMD [Single Instruction Multiple Data] unit carrying the heaviest burden – it is capable of handling 16 32-bit [512-bit] Advanced Vector Extensions. In order to have everything working properly and avoid starvation, the cores were interconnected with the aforementioned ring-bus controller. Internally, Larrabee core features 64KB of L1 cache [32KB Data, 32KB Instruction] just like Nehalem processors. As performance simulations commenced, it was clear that 16 x86 cores with AVX extensions would not attain the performance needed to reach projected GPUs from ATI and nVidia in 2008-2009, thus the roadmap was expanded with 24, 32 and 48 core parts and the L2 cache was kept at 256KB per core. In case of 32 cores, Larrabee cGPU should have 8MB of cache, while the 48 core version should have 12MB. The 24 core version would still have physical 8MB of cache, but it remains to be seen will that mean the 8MB is accessible to all 24 cores or does every core keeps it's 256KB and "that's that". According to the sources we spoke with, the
"size of L2 cache is not an issue. We are the industry benchmark for SRAM cache density and we can put as much cache as we want. Nobody can touch us [in that perspective] and everybody knows that."
This is the plan for Intel Larrabee on the HPC side: CPU with mostly LRB cores for capturing HPC contracts.To disclose some figures, Intel was targeting 1TFLOPS of computing power with 16 cores, but was unable to reach that target, hence the 50% increase in units for the baseline part. The 24 and 32 core parts were planned to use the 45nm process node, while the 48-core version would come out in 32nm. Please do note that the 24 core version was actually the 32-core version, but with some faulty cores. The 8 and 16 core parts would be integrated inside the CPU as a part of tock version of Haswell architecture [follow-up to Sandy Bridge and Ivy Bridge - Sandy Bridge would have the final GMA-based part, swapping for Larrabee core in a 22nm refresh] and "AMD would be done".
That was the plan. Now, let's see where the Larrabee train derailed and where it is right now.
© 2009 - 2013 Bright Side Of News*, All rights reserved.