Recently, we exclusively unveiled that Kaveri, successor to the current "Trinity" high-end APU (Fusion A8 and A10 family) features a GDDR5 memory interface. This time we will talk about architectural enhancements of AMDs upcoming mainstream APU Kaveri as well as enhancements of the Steamroller cores which will also make their way into servers and high-end desktop systems in 2014. The information comes from a "Preliminary BIOS and Kernel Developer’s Guide for AMD Family 15h Models 30h-3Fh Processors" (you can find a similar document here, dated January 2012) document, available to interested developers.

AMD Kaveri APU Processor will feature from 4 to 6 core architecture, Embedded GPU and Northbridge supporting DDR3, DDR3L and GDDR5 memory. In this overview, we show you the GDDR5 version
AMD Kaveri APU Processor will feature 4-6 Steamroller cores, Sea Islands GPU and Northbridge supporting DDR3, DDR3L and GDDR5 memory.

At the Hot Chips conference in August 2012, Mark Papermaster already gave a broad overview what Steamroller will be about (PDF download). AMD has learned from the performance drawbacks of Bulldozer and to a lesser degree Piledriver and tries to continually improve over the status quo. Back in 2011 when Bulldozer was launched, AMD laid out an ambitious performance roadmap, which promised some 10-15% performance improvement each year.

AMD High-Performance Core Roadmap

AMD took several measures to improve CPU performance and enhance functionality. For one, AMD increases the L1 instruction cache size from 64KB to 96KB and changes it’s associativity from 2-way to 3-way. For virtualization, Steamroller now supports a virtualized interrupt controller, which is an advanced feature of AMD’s hardware virtualization. Additionally the XSAVEOPT instruction is now supported.

How Steamroller acceleraters feeding the cores, increasing the IPC

The document lists the following changes to improve instructions per clock (IPC):

  • Store to load forwarding optimization
  • Dispatch and retire up to 2 stores per cycle
  • Improved memfile, from last 3 stores to last 8 stores, and allow tracking of dependent stack operations.
  • Load queue (LDQ) size increased to 48, from 44.
  • Store queue (STQ) size increased to 32, from 24.
  • Increase dispatch bandwidth to 8 INT ops per cycle (4 to each core), from 4 INT ops per cycle (4 to just 1 core). 4 ops per cycle per core remains unchanged.
  • Accelerate SYSCALL/SYSRET.
  • Increased L2 BTB size from 5K to 10K and from 8 to 16 banks.
  • Improved loop prediction.
  • Increase PFB from 8 to 16 entries; the 8 additional entries can be used either for prefetch or as a loop buffer.
  • Increase snoop tag throughput.
  • Change from 4 to 3 FP pipe stages.

Improving the Single-Core Execution on AMD Steamroller core, to debut with Kaveri APU

While this is fairly technical, lets break down the major changes. A lot of the changes increase the size of internal buffers or the throughput of certain paths inside the core. This is clearly an attempt to get rid of certain bottlenecks in the design. One of the interesting changes is that with Steamroller, each core gets it’s own integer decoder, which should improve performance if both cores of a modules are fully loaded. The reduced floating point pipeline length should lead to an increase in floating point IPC, as the instruction latency is reduced by 25%. On the slide you can see the projected performance implications.

AMD's "Steamroller" Big Core is optimized for Performance/Watt based on experiences gained with Bobcat, Jaguar low-power cores

The document lists a number of other additions we omitted on purpose, notably certain performance counter features. Those don’t affect performance per se, but allow programmers of debugging tools to gain more detailed information on what is going on inside the chip.

But AMD didn’t only beef up the x86 CPU cores but also the architectural features of their APU. As AMD publicized before, Kaveri will be the first APU which allows coherent memory access for the GPU part of the chip. To this end the communication facilities between x86 CPU cores and GPU cores have been extended considerably. The width of the internal interface called Onion which connects the GPU to the coherent request queues has been widened to 256-bit in each direction. This allows for faster data exchange between the CPU and GPU, clearly a requirement of the HSA.

The Evolution of HSA - Heterogeneous System Architecture

One item in the document that caught our attention reads "Add PCIe endpoint mode". We don’t know what the aim of this functionality is but we assume it can be used to attach Kaveri as a PCIe device in any other PCIe based system, e.g. as a co-processor. Given that the SeaMicro "Freedom" fabric AMD acquired last year is based on PCIe, this opens some interesting possibilities. It would theoretically also allow Kaveri to be put on co-processor cards, e.g. for servers.

Bear in mind that the "PCIe Endpoint Mode" isn’t a new feature. With the return of Jim Keller in AMD and the resurging engineering team, we’re seeing a lot of developments coming back to play which were killed during the time when AMD was ruled by Hector J. Ruiz and Dirk Meyer. "PCIe Endpoint Mode" bears resemblance to the Torrenza initiative from 2006, which called for using the Opteron sockets for 3rd party cores. That initiative is still being used by the Chinese CPU manufacturers, which are making high-performance multi-core server CPUs based on Alpha and MIPS cores that are causing a lot of grey hair to Intel’s engineers. Acquisition of SeaMicro might bring AMD into the limelight to the 3rd party ecosystem, using Torrenza on low volume, and SeaMicro’s Freedom Fabric on high volume server and micro-server markets. Originally, the plan was disclosed to us in 2006 by a now former AMD executive, passionate Giuseppe Amato.