It is never boring in the world of CPUs. Regardless of who's on top, plans for next generations tend to excite everybody in the eco-system… if you deliver, that is. AMD had a lot of tough times of late, and lost a lot of good people due to lack of proper management. In this article, we bring you a look into the architecture that everybody in the industry has been impatiently waiting for. But this time, AMD cannot afford to fail.
The Ex-Alpha engineering teams lead by Dirk Meyer that created K7 and K8 architecture messed everything up with Barcelona/Agena and the infamous TLB-bug [Translation-Lookaside Buffer]. Shanghai/Deneb cleaned a lot of things up and AMD is back being competitive again, but Intel is pushing hard: Intel is operating in tick-tock architectural mode, and so far - AMD isn't able to answer back. K10 and K10.5 were nothing else but improvements over the K8 architecture. Last time we saw a completely new architecture from AMD, the stock market thought that an online dog-food shop was worth half a billion US$, mainstream media was touting that the world is going to end with that horrible Y2K bug... Yes, quite a long time ago. But before we dig into Bulldozer's architecture, let's set the record clear, with a simple architectural comparison between AMD and Intel.
AMD's Kryptonite versus Intel's Tick-Tock or are things really as they seem?
Looking at public and leaked roadmaps, it looks like AMD's K11, or Bulldozer core is shaping up to be what Core architecture was for Intel. AMD went to the drawing board back in 2005, and started to work on "K11" architecture. Intel is touting its Tick-Tock architecture, but let's take a look at the real state of the market, not looking at marketing statements:
AMD Modern Architectures - Our take
- K5 [Kryptonite 5]: 1996, K6: 1997, K6-2: 1997, K6-III: 1998
- K7: 1999, K7.5: 2001
- K8: 2003, K9: 2005, K10: 2007, K10.5: 2008
Intel's Modern Architectures - Our take
- P5 [Pentium]: 1993, P55C [Pentium MMX]: 1996
- P6 [Pentium Pro]: 1995, P6B [Pentium II]: 1997, P6C [Pentium III]: 1999, P6D [Pentium M]: 2003, P6E [Core]: 2005, Core 2 ["P6F"]: 2006, Nehalem ["P6G"]: 2008
- NetBurst: 2000, NetBurst HT: 2003, NetBurst AMD64 [Prescott 2M]: 2004
- P7 [Itanium]: 2001
- Atom: 2008
As you can see, up to today, AMD only delivered three "ticks" and seven "tocks" architectures, with the latest one breaking the tradition: K5 was launched in 1996, K7 followed up three years later. K8 was an evolution planned to debut in late 2001, but numerous [manufacturing-related] delays postponed the part until April 2003. If the current schedule sticks to K8 and K11, we'll have to wait for eight years between the two.
At the same time, even though Intel likes to ponder the "Tick-Tock" architecture, the reality is such that even the Nehalem architecture is remotely based on the Pentium Pro core, and if we look through the "P6" architecture, we see that Intel has delivered five genuinely new architectures as ticks, and a gazillion tocks. If we apply the "apples to apples" metric, we get - eight completely new architectures, improved numerous times. Out of those eight, one ended up as the best CPU micro-architecture of all time [P6, if there was any room for doubt], one started a multimedia revolution [P55C], one showed the right path of computing evolution to 64-bit [K8] and two were failures: NetBu[r]st and Itanic.
Thus, Intel and AMD are very much alike, even though we are bound to get criticized for these lines. It's not our fault the Core architecture was coded as P6-based in Intel's own papers. We could argue putting Nehalem is in the same basket as Core 2 architecture, since it contains numerous improvements and nicely copies DEC Alpha's design - yes, oh shocker. In case you don't know, the IMC [Integrated Memory Controller] was massively used in the Alpha micro-architecture, the fastest x86 architecture to date [it wasn't even x86, it ran a translation layer from RISC to x86 CISC], The Alpha architecture was sold to Intel and Dirk Meyer's Alpha team switched to AMD [and that's how Opteron came to life]. Then again, we almost got Intel's short-sighted vision of "NetBust will get us to 10 GHz". Luckily for our power bills, and the laws of physics made sure that 1kW TDP CPUs never came to market.
In case you're interested in Alpha 21264 and 21364, you might be interested to know that this decade-old CPU architecture featured a 10-channel RDRAM IMC, with two channels being used for redundancy. The remaining eight achieved higher bandwidth than Intel Core 2 Quad, a CPU released almost a eight years later. Now that we're done with this look into the past, it's time to take a good look at AMD's future.
M-SPACE or how Fusion came to be…
According to our sources, Bulldozer architecture is actually a consequence of the failed tie-up between AMD and nVidia. Back in 2005, AMD felt that it had Intel by that certain part of male body [direct quote from an unnamed exec] and wanted to merge with nVidia. That fell through because Jen-Hsun Huang [rightfully?] wanted the CEO position, and the rest is history; AMD already borrowed money to buy nVidia and had no choice but to seal the deal with ATI Technologies.
The key reason for the birth of the Bulldozer architecture is M-SPACE design [Modular-Scalable Portable Accessible Compatible Efficient], GPU-resembling a "LEGO block" architectural concept that became a mantra in AMD's halls. Under the M-SPACE design guidelines, Bulldozer [10-100W TDP] and Bobcat [1-10W] cores were supposed to address different market segments, but the way of creating a processor was exactly the same. The goal was to have Bobcat addressing the OLPC/netbook/MID market, then considered as a crazy vision by Nicholas Negroponte - can anyone today say "Nicholas was crazy"? Bulldozer was the "big daddy" core, going head to head against then Pentiums and Xeons. Unfortunately for AMD, Intel got there first [Core 2, Atom].
How ATI takes a fully developed architecture and creates a affordable part, while raising performance of the latter.
In order to understand M-SPACE, we need to take a look into graphics chips; a GPU manufacturer will release a high-end part and then decrease the number of logic units depending on targeted die-size [cost]. AMD saw M-SPACE as the way to leverage its biggest disadvantage: lack of available die space. A lot of things have changed since then, AMD spun off its foundry operations to GlobalFoundries and with ATI's upcoming 32nm GPUs coming from ex-Fab38 [Fab 1, Module 2] GlobalFoundries facility in Dresden, pressure to make M-SPACE work increased, since now you have two products being built under the same roof, unlike past decision to send the CPUs to Taiwan and then let TSMC glue them together.
How to create a CPU or a APU? According to AMD, the answer is M-SPACE
Under the M-SPACE concept, AMD should be able to create products such as tri-, quad or even octal-core CPU + GPU, and combine such dies with another, containing only CPU cores. Yes, a 12-core combination of Quad+GPU and Octal-Core on a single die could be achieved under this plan. One of the key components for M-SPACE are the future CPU Sockets - Servers will get G34 as a part of the Maranello platform [LGA] - but consumer platforms won't stay on Socket AM3 either. AMD has plans to introduce G Sockets across the board, since they will be a necessity for a new memory controller, Display connectors, PCI Express 3.0 etc. Socket AM3 and its 940 pins just won't cut the mustard, but 2000+ lines on Landing Grid Array might do. This also means that pins are waving goodbye from mainstream consumer platforms - AMD will introduce LGA on desktops and start to push BGA [Ball Grid Array] on notebook platforms.
Can Bulldozer bulldoze the competition?
Now that you read what M-SPACE is, time to address the heart(s) of the "Kryptonite 11" micro-architecture. If we take a look at a single Bulldozer core, you see a design optimized for throughput - AMD's will not introduce its own version of Hyper-Threading, but rather focus on physically increasing the number of instructions per clock [IPC] through wider internal units. A good example will be the newly designed 128-bit FPUs [Floating-Point Units]. Currently, 128-bit instructions are carried out by using 32-bit / 64-bit FPU at a reduced efficiency [more cycles needed to process a single instruction]. According to our sources, GPR [General Purpose Registers] were increased to 128-bit. Once that we learned of this alleged GPR depth, we asked does that mean we can, theoretically, call Bulldozer a "128-bit CPU" and is "x86-128" on the way? I will openly admit that I asked such a question without giving it a second thought.
I was explained that focus of AMD's design was to increase the number of instructions processed on-the-fly, meaning that most instructions should use registers in a 64+64-bit or 32+32+32+32-bit fashion, significantly raising the IPC when compared to current K10.5 architecture. So, no "x86-128". For now. This new internal architecture enabled AMD to design its first Streaming SIMD Extension set, 128-bit SSE5. Again, according to our sources - this was also the reason why Intel went into a denial frenzy over a possible implementation of the SSE5 instruction set. "They cannot do it [SSE5] until they really change their architecture. We did and paid dearly for it [the architectural change]. But we will blow them out of the water"… were the words from one of the e-mails I had with an anonymous CPU designer from times when SSE5 development took place [thus, pre-AVX].
While it is currently true that 128-bit SSE instructions were executed slower due to reliance on 32 and 64-bit registers for FPUs, we have to wait and see who will have better a FPU unit: 512-bits Vector unit inside Larrabee or 128-bit Bulldozer ones.
Intel's executives and PR managers publicly stated that Intel will not use SSE5 in its upcoming processors, but focus on 256-bit AVX [Advanced Vector Extensions] in Larrabee and in 2011-2012, you can expect Intel to fuse the Sandy Bridge architecture with several Larrabee cores, as Intel's second-generation Fusion CPU+GPU part [first one being 32nm Arandale/Clarkdale], and offer 256-bit AVX on the CPU socket too.
According to our sources, this is one of the problems in Bulldozer design - it isn't easy to design a FPU, especially when you have to put engineering resources to fix the Barcelona core and shuffle scientists around. One of our sources was highly critical of Dirk Meyer and those decisions, but since most of our sources still work close to the company, we would say that they all found same goal worth more than views on management.
One part that is bound to bring confusion is the memory controller. To be perfectly honest, both K10 [Phenom] and K10.5 [Phenom II] did a pretty lame job with asynchronous clock between the CPU cores and a "Northbridge" block consisting out of memory controller, I/O protocols and L3 cache. The fact that L3 cache worked at a lower clock significantly reduced its usability - you can get a higher performance boost if you overclock the "Northbridge", than raising CPU cores until they crash. Bulldozer brings even more complexity into the frame - M-SPACE enables GPU-like clock gating, and processors based upon Bulldozer core should offer power efficiency one step ahead of most efficient notebook processors. The memory controller is continuing to be independently clocked, and L3 cache is now a default part of the architecture for both sides in CPU arena. If we talk about the width, here comes the interesting part: AMD's memory controller can be 144-bit, 288-bit or even 576-bit [on MCM processors], but we doubt that we will ever see a 576-bit interface. MCM modules will feature a unison of two dies and a merger of cores and L3 cache from one unit with another, bypassing the external memory addressing - thus remaining 288-bit wide even with two physical 288-bit interfaces embedded in silicon. With Virtualization or AMD-V continuing to be one of key architectural accents, the memory controller features a lot of technologies that will ease life to numerous virtual hosting providers. Every core can address a single channel or use one channel for redundancy, yet another feature from Alpha 21364 architecture.
Since AMD is pairing Bulldozer with the JEDEC-certified DDR3-1600 memory spec, you can expect to see memory bandwidth ranging from 25.6-51.2 GB/s. This part is heavily influenced with the underground overclocking department inside AMD. Those guys will expose a *lot* of advanced memory options exposed in the CPU design, so Orochi [desktop versions] should have no problems running DDR3-2000 or DDR3-2133 without overclocking the CPU itself - resulting in 32-31.4 GB/s. Since we mentioned overclocking… rest assured, AMD's Bulldozer isn't afraid of being in the cold as this video demonstrates.
Contined of next page: Product positioning, Server, Desktop, Notebook, Conclusion
CPU becomes APU
The Bulldozer core will be implemented across the range: Server, Desktop, Notebook, launching as server first, followed by desktop and notebook.
Eight-core mono-die part, member of long-delayed Sandtiger family.
Server-wise, AMD plans to introduce three parts: single die quad-core & octal-core for the launch, with dual-die hex-core [16 Cores] to follow later. Quad-core and Octal-core are succeeding Sao Paolo [Istanbul on Socket G34], while Magny-Cours [12-core dual-die on Socket G34] will be succeeded by Montreal, a 16-core dual-die part. Note: we heard about the "Montreal" codename back in 2006, so it might have changed by now. All of these parts sit on the Maranello platform, which will be introduced early next year.
Montreal is a successor of Magny-Cours - brings 16 physical cores on a single G34 socket. Release date unknown.
When it comes to the world of desktops, Bulldozer arrives as two parts: Orochi and Llano. Orochi is the first M-SPACE design to feature both as Opteron and Phenom, featuring four Bulldozer cores and 8MB of cache. Naturally, AMD "forgot" to calculate L1 cache in [128KB per core on Agena/Deneb CPUs]. With Orochi being based on a new architecture, it is too early to say what the amount of L1 cache is.
Llano is the new key processor for AMD's commercial desktop and notebook efforts. Dubbed Accelerated Processing Unit [APU], this combination of quad-core processor and ATI's DX11 core [both manufactured in 32nm - CPU die is SOI, GPU die is bulk].
When looking at Ontario's specs, it is clear that this Falcon's dead-ringer [Kuma+ATI core] is all that Falcon was supposed to be: dual-core CPU packed with DirectX 11 based core using BGA packaging, targeting ultra-portable and netbook markets.
Meet the APU: Can a new core achieve fusion with DirectX 11 part?
As you can guess, we saw a lot of these roadmaps over the course of last couple of years. This roadmap was released back in November, as a part of AMD's Analyst Day, and given that the first quarter 2009 is out, we wonder when AMD launch three of four products mentioned on its 2009 plan. But Orochi/Llama and Ontario look well positioned for todays' computing needs. What will happen in 2011, remains to be seen. In 2011.
Going through numerous e-mails and presentations about Bulldozer made us think that AMD really had a winner in its hands: if the company didn't under-estimate Intel and seriously messed up their product development [yes, we know about political directions in 90/65nm times], we would be writing an architectural preview of a product that was set to launch on Computex of this year, and Falcon CPU+GPU would probably made a killing at this year's Back-to-School. But what is done is done, and we won't see Bulldozer-based parts until 2011.
If we look at thr specs, it is beyond any doubt that this architecture is another "hammer", but is a hammer for Intel's line-up of today. Intel will launch 32nm Westmere in 2010, and have roughly 11 month advantage over AMD in terms of manufacturing process. To make the matters worse, Sandy Bridge is Intel's [allegedly] new architecture en route 2011, and there is a big question looming above heads at AMD: what will the state of the market be once that Bulldozer finally launches?
2011 is not too late for a Fusion "APU", though. Even though Intel will launch its 32nm Arandale processor in Q1'10, performance and compatibility of integrated graphics is a far cry from usability standpoint. Intel's integrated graphics currently does little more than output picture on the display, and DX11-compliant, OpenCL-compliant, decent low-res gaming performer will cause serious headaches for Intel. Once Intel integrates those features into Larrabee and Sandy Bridge, then we will be able to speak about problems for AMD.
For us, it looks like AMD is on a path of innovation. But when will AMD stop being "late to the party"?