Editors note:
This article was originally published by Juan Ramón González Álvarez at http://juanrga.com/en/AMD-kaveri-benchmark.html

It represents a very well-informed estimate of the performance of future APUs from AMD based off the Kaveri core. These estimates are based off the bits of information AMD disclosed both about Kaveri itself and the Steamroller architecture. Furthermore it includes some assumptions that are sensible but may not equate to the actual specifications. As any performance estimate, they could be off substantially because we are dealing with incomplete information. For this reason you should take the benchmark numbers with a grain of salt. However we believe the point of view presented here is valuable.

Before you look at the article we would like to highlight some of our previous coverage on the topic. We discussed the Steamroller improvements in detail, talked about Kaveris role in AMDs roadmap as well as the lack of novelty in the Bolton family of chipsets. More recently we covered AMDs heterogenous Queuing, which is an instrumental part of the HSA. Now without further ado we present the article.

AMD Kaveri Benchmark

Abstract: The Chip-maker, AMD, will release a new family of APUs by the end of 2013. This new family will be named Kaveri and will substitute the current Richland family of APUs. In this article I will consider Kaveri performance using all the information that AMD has disclosed as well as my best guesses on the parts that remain unknown. I estimate that the the top Kaveri GPU will have a CPU clocked at 4 GHz with a maximum performance of 128 GFLOP and a GPU clocked at 900 MHz with 922 GFLOP, giving a total of 1050 GFLOP for the whole APU. Combining all this data, I predict that the CPU of the top Kaveri APU will be about 26% faster than top Trinity APU and about 17% faster than top Richland APU. This would put the multi-threaded performance of the CPU of the new quad core Kaveri APU at the same level than an Intel quad core i5 or a six-core AMD FX with traditional software. The new Kaveri APU will show its real strength with HSA software, which will exploit the performance of both the CPU and the GPU. With HSA enabled software, Kaveri has the potential to be much faster than an Intel i7 or an octo-core AMD FX. Some developers are finding accelerations of up to 500% when enabling HSA. A collection of APU and CPU benchmarks and scores is given.

The Chip-maker, AMD, will release a new family of APUs by the end of 2013. This new family will be named Kaveri and will substitute the current Richland family of APUs. Kaveri will combine two to four 28nm CPU cores, based on the Steamroller architecture plus HSA improvements, with a HSA enabled Graphics Core Next GPU. This HSA Graphics Core Next architecture can work as a traditional graphics card (rendering graphics) or as co-processor (computing parallel tasks). Kaveri will also introduce a uniform memory model, dubbed hUMA, that will allow both the CPU and the GPU to access to a common memory pool.

Many sites report some of the specs of the upcoming Kaveri. In this article I will consider Kaveri performance using all the information that AMD has disclosed as well as my best guesses on the parts that remain unknown.
two modules with two Steamroller x86 cores and 2 MB L2 cache per module; Radeon HD Graphics and Multimedia Engine; Dual-Channel DDR3 Memory Controller; Video; PCIe Gen 3 PCIe Gen 2. Bolton SCH with Serial Interface; SATA 2/3; Low Pin Count Interface; USB 2.0/3.0; Pcie Gen 2

First look of the top Kaveri APU

During the Hot Chips 2012 conference, AMD said that the new Steamroller modules will provide a 30% gain in IPC over the initial Bulldozer modular design. Piledriver introduced about a 8% gain over Bulldozer at the same clocks. This means that Steamroller will introduce about a 20% IPC gain over Piledriver, because 1.30 = 1.08 × 1.20 rounded to two decimal digits. We know that the shared decode used in both Bulldozer and Piledriver introduces about a 20% penalty compared with a non-clustered core design; we know this from comparing the performance of a two-threaded workload running in two cores in the same module against the performance when each core is in a different module 1. Of course, Steamroller could be faster like some rumors suggest; I discuss this possibility below.

AMD has not revealed the frequencies of the new APUs. However, we know that the top Kaveri APU will have a total performance of 1050 GFLOP rounded to zero decimal figures. During the 2012 conference, AMD said that Steamroller will have a FPU with two FMAC 128-wide units plus a MMX unit. This is the same configuration as Piledriver, except that Steamroller FPU will be streamlined to save die space. Each FMAC unit will be capable of up to 8 FLOP s using FMA4 instructions; those are SP computations. Therefore, the formula to obtain the maximum floating point performance of the quad core CPU is 4 core × 8 FLOP s core-1 × freq s-1.

AMD has revealed that the top Kaveri APU will include a GPU with 512 unified shaders. Each shader can run up to 2 FLOP s of SP computations, which implies the following formula for the maximum performance of the top Kaveri GPU: 512 shader × 2 FLOP s shader-1 × freq s-1. The following table gives three possible combinations of frequencies for the CPU and the GPU whose combined performance gives the claimed 1050 GFLOP.

AMD commercializes Radeon discrete graphics cards with core frequencies of 897, 900, and 907 MHz; however, the Radeon HD 7750 based in GCN architecture has 512 shaders at 900 MHz. Selecting this frequency for the top Kaveri GPU, implies a CPU clocked at 4 GHz, which is a frequency between Trinity A10-5800k and Richland A10-6800k. The very small down-clock from 4.1 GHz is broadly compensated by the new Steamroller architecture.

Combining all this data 2, I predict that the CPU of the top Kaveri APU will be about 26% faster than top Trinity APU and about 17% faster than top Richland APU. This would put the multi-threaded performance of the CPU of the new quad core Kaveri APU at about the same level than an Intel quad core i5 or a six-core AMD FX. I estimate that the Kaveri quad core APU will have a PassMark CPU score of about 6000 points. Next, I add a collection of CPU benchmarks with estimations of the performance of the top Kaveri APU compared to the competence.

Kaveri scores for the CPU are obtained from taking Trinity/Richland scores as base and utilizing the assumed 20% gain from doubling the decoder per module minus a correction factor of the 5%. This correction factor is a safety belt that accounts for ‘systematic’ variations in benchmarks scores caused by different compiler support for bdver2 flags. Those variations are of the order of 3%?4%. In practice, I am assuming a worst case scenario where the base scores would be above the average and thus multiplying them by a 15% instead of an 20%. I assume that the scores obtained in this way for Kaveri are conservative and probably the real silicon will perform better thanks to improvements from the new bdver3 flags for Steamroller architecture and hardware improvements not considered here, such as a superior memory subsystem.

The following are all CPU performance estimates

The above benchmarks use traditional software, which only use the performance of the CPU and ignores the rest of the APU. The new Kaveri APU will show its real strength with HSA software, which will exploit the performance of both the CPU and the GPU.

AMD has shown, during the Hot Chips 2013 conference, the acceleration that ordinary applications receive when HSA is enabled. Below I show an estimation of the acceleration that Kaveri will provide in an algorithm that analyzes images to detect faces. This estimation is obtained from data disclosed by AMD for an older VLIW4 architecture GPU with six compute units at 685MHz. I corrected the observed acceleration by 8/6, which accounts for the higher number of compute units of the new Kaveri APU. I ignored further corrections arising from the fact that Kaveri will use a new GCN architecture (which is much faster at compute than VLIW4) and the new hUMA memory subsystem. Again, I assume that the HSA score obtained in this way for Kaveri is conservative and probably the real silicon will perform much better.

Estimation of HSA performance of the top Kaveri APU

Many other massively parallel algorithms 3 from file compression to video encoding or game physics simulations will receive similar accelerations. With HSA enabled software, Kaveri has the potential of being much faster than an Intel i7 or an octo-core AMD FX 4. Some developers are finding accelerations of up to 500% when enabling HSA. During Hot Chips 2013 AMD showed a 5.8x increase for a HSA enabled cloud server workload running on the old APU with six VLIW4 compute units at 685MHz. The new Kaveri APU would break the 6x increase easily, thanks to its eight GCN compute units clocked around the 900MHz 4.

I cannot estimate the performance of the GPU in Kaveri because crucial data is not known. The improvement in graphics performance could be anything between 33% and 200% faster than Richland APU, because benchmarks are very sensitive to memory bandwidth, and the precise nature of the version of GCN used in the GPU (Richland uses older VLIW4 architecture).

During the preparation of this article, AMD has announced a revolutionary OS layer named MANTLE. This consists of a driver plus a low-level API for GCN graphics cards. It is highly probable that Kaveri will be fully compatible with MANTLE thanks to the GCN technology used in the integrated GPU. Being a low-level API, MANTLE is free of the overhead associated to bloated and inefficient APIs like Microsoft DirectX. During the presentation, AMD claims up to 9x more draw calls compared to other APIs 5. The key here is on weighting the importance of draw calls in the game engine. This is something I cannot evaluate; first because it is engine and even game dependent; second because few details are known about MANTLE at the time of writing this.

Please note that the Kaveri APU has not been released still and some of the information contained here can change in the last minute. Needless to say, any mistakes that might be found in this article are entirely my responsibility and cannot be attributed to AMD nor to anyone else.

Acknowledgements: I thank both Marcus Pollice and Matthias Waldhauer for their useful insights and corrections to the original draft. I also thank Matthias Waldhauer for additional corrections to the final version of this article.