OPINION: Are Benchmarks Worthless?
4/19/2012 by: Theo Valich
After reading the benchmarking piece by Rob Enderle, one of the striking messages was that we should no longer use current benchmarks and that a new type of benchmarking should be adopted, one that would emulate the real world. As it turns out, Rob was inspired by a recent AMD-organized event in Austin, TX where select press and analysts gathered for a Trinity Review Day. Note that BSN* was not invited and as such, we are free to fully disclose the performance of the Trinity APU (page two please).
Over the past 13 years, I have tested over 3000 components, and as patterns emerged I developed an aversion to overly optimized benchmarks. Instead, I personally, and the team here at BSN* try to test the products in as real situations as possible, resulting in countless hours of work over reviewing a single part. I wish to thank the respected members of AMD, Intel, NVIDIA, Futuremark, Microsoft, ORACLE, Blizzard, CroTeam, Samsung and Epic Games and many others - who worked with me for over a decade and explaining what I did not understand at the very beginning.
What's wrong with Benchmarks today?
If we take a look today at the performance measurements, most of companies will utilize an ubiquitous benchmark for their respective fields of interests: in the case of computing, we have consumer, prosumer, workstation, server and mobile benchmarks. The performance benchmarks are divided into synthetic and real world testing.
In this article, we will focus on the consumer/prosumer systems, which make up the majority of public interest, but also are subject to most scrutiny (and the most money invested by the companies that have an interest in it). The players in the benchmark field are BAPCo on the business side, Futuremark on the consumer and business side, and a couple of irrelevant players (in the grand scheme of things). There are also real world tests, which involve professional applications (Adobe, Blackmagic, CyberLink, etc.) and video games from major companies like Activision, Bethesda, Electronic Arts, Take 2 and so on. Video games, as we have found, tend to be the most intensive measure of a computing platform’s capabilities and stability.
BAPCo: The one-sided marching band that wins billions for one side
We start off with probably the most controversial organization of them all. BAPCo used to be an industry body originally founded by Intel. In fact, for good amount of time, the allegedly independent organization was operating out of Intel's offices in Silicon Valley. As Intel began to focus more and more on optimizing, more and more companies left the organization. The companies that left (in alphabetical order) are AMD, Apple, Microsoft, NVIDIA, SanDisk, VIA and many more. A good explanation of how Intel controlled the benchmark body was an organization, ARCIntuition, which always voted in Intel's favor. As our confidential source and a former board member of BAPCo said "Intel has many ways in which it influenced the outcome of BAPCo votes. One glaring example is ARCintuition, a shell company that has no other purpose than providing services to BAPCo. The company is basically an Intel sock-puppet."
A good example of the wrong side of benchmarking ethos was EEcoMark. Roughly three years ago, BAPCo wanted to release EEcoMArk to the EPA (Environmental Protection Agency) to become part of the EnergyStar certification engine. This would influence every computer on the planet, since EEcoMark overnight "would have become the most important benchmark ever created by far. For instance, many organizations will not purchase systems that do not bear the EnergyStar logo. Organizations in Europe and Asia were already lined up to follow the EPA’s lead, so EEcoMark’s reach would have been worldwide. Board members proved that, if the EPA were to adopt EEcoMark, only systems containing Intel microprocessors would ever earn EnergyStar stickers. It was the only time BAPCo had ever stood up to Intel and said ‘no.’ In the end, to save face after even their most loyal OEM allies turned against them, Intel was forced to vote ‘no’ on their own benchmark!"
EEcoMark was also one of many reasons why Apple left the BAPCo organization. Thus, when we take a look at business magazines and see that BAPCo benchmarks have been used, with authors using those benchmark scores to decide whether you should purchase brand A or B…
You can easily see why such a benchmark would be flawed. Yet, today billions of dollars of revenue are made based upon test results provided by MobileMark, EEcoMark and SYSmark. SYSmark 2012 is the crown example how an allegedly independent benchmark is deliberately manipulated in order for one side to win. In the creation of the benchmark, BAPCo used older versions of 3rd party software which did not support hardware acceleration (otherwise for example, an AMD Bobcat, Fusion E-Series APU would rival Intel's Core i7 "Sandy Bridge" processor), and the interesting bit was that in majority of cases, a revision of software used was the last one before accelerated version was made available.
Here at BSN*, you won't ever see us using BAPCo's products, and not even then. We know how BAPCo operates and in our opinion, given the amazing performance increases we have seen over the years from Intel - these tactics should really not be used. Yet, what is done, is done.
Benchmarks - Real World - 0:1
The Curious Case of Futuremark Corporation; One Benchmark to Rule Them All?
Futuremark Corporation is a company with an interesting past, since it was started by several members of Future Crew, the legendary demo group from the 1990s. The company split into multiple entities over the course of the past decade, but their products - 3DMark, PCMark and PowerMark are an interesting example of synthetic benchmarks mimicking real life.
For example, 3DMark is a synthetic 3D engine which swings the leadership from one generation of hardware to another, but the company has not support Intel's graphics hardware since 2006 and the arrival of DirectX 10. 3DMark Vantage was a native DirectX 10, while 3DMark11 is a native DirectX 11 benchmark. Up until Ivy Bridge, which Intel is releasing soon as "3rd Generation Core" processor, Intel did not have a graphics part that fully supports Microsoft‘s DirectX 11 API (Application Programming Interface) and as such was not available for a direct comparison versus contemporary graphics architectures from AMD and NVIDIA.
Even though the 3D engine was originally "synthetic", Futuremark released Shattered Horizon, a fully-featured 3D game which utilized 3DMark Vantage engine. As such, the game only works on Windows Vista and Windows 7, something that publishers did not particularly like. At the end of the day, the only viable competitor for 3DMark is Unigine's Heaven benchmark, which originally had ties to AMD. However, today the Unigine team is pretty much on its own, since it is a difficult benchmark for both AMD and NVIDIA, with victories being won blow by blow (low end, mainstream, high end… etc.).
Moving on from the 3D world there is a benchmark, PCMark, that combines elements from 3DMark with a large number of system-stress benchmarks which are based upon Windows behavior (Windows startup pattern for the hard drive, for example), confidential file encryption and decryption, video encoding and decoding and so on, and so forth. However, going head to head against SYSmark proved to be a difficult task, unlike 3DMark which practically has no competition in the synthetic 3D benchmark space.
There is also a third benchmark which is a recent newcomer into the field, PowerMark. PowerMark goes head to head against MobileMark, and it will be interesting to see whether it can gain a foothold in market. We have benchmarked an Ultrabook from Acer using PowerMark and MobileMark and decided that for future use, we will use PowerMark as a base for our productivity/battery suite for mobile computers.
Their fourth benchmark is Peacekeeper, a browser-based benchmark with a somewhat ironic name, since it caused a lot of PowerPoint Wars between practically every company that has a product that connects to the Internet. Even though Peacekeeper has its shortcomings, it is an overall good benchmark used for all devices in the field (smartphone, tablet, notebook, desktop, workstation etc.).
This year, Futuremark will complicate the matter further, since the company plans on releasing 3DMark for Apple iOS, Google Android, Microsoft Windows 8 and Windows RT, creating a unique benchmark that works on all platforms. This benchmark will mean that for the first time, you will be able to compare a smartphone to a desktop PC in terms of high performance. Futuremark plans to do the same thing with PCMark and PowerMark, so expect a lot of fireworks.
In terms of our Benchmark vs. Real World score, FutureMark brings the score roughly back to 1:1. Onto the real world...
Real World Dilemma
Whenever some company is in trouble in a particular set of case scenarios, the marketing department will mask the shortcomings by creating scenarios in which your computing experience should be peachy, provided that "you're not a 3D designer", "you're not a hard-core gamer", "you're not a professional photographer". Sadly for those terms, and we do hear the gaming one quite a bit, the reality is something else. For starters, we'll give you a quick example.
Why are Consumers Buying Tablets?
Back in 2009, we launched an experiment, which was simple comparing experience between Apple iPod and Microsoft Zune HD devices. We masked both devices not to see the manufacturer and gave users to select the one that gave them the best experience. The result was pretty devastating when you know the market shares of both; out of 10 users, vast majority would select Zune HD product over an iDevice, but the glory of marketing and promise of a perfect experience caused Apple to win over. In the meantime, Apple launched iPod Touch powered by the same iOS that powers iPhone and iPad and the user experience changed for the better.
When the Apple iPad came along, it was not a new concept. In fact, Microsoft has been selling Windows XP Tablet OS for almost a decade yet the part was kept in vertical markets and never managed to spread on the mainstream market mainly due to its lack of touch friendliness.
Apple rode on the popularity of the iPhone, but also on something else, their experience with web pages, surfing etc. was much better than on netbooks (which all were almost exclusively powered by Intel Atom processors), and even some low-end notebooks, running on weak CPUs from Intel and AMD. Today, reading the financial results from companies explains the convergence and the bottom compute experience price threshold has been reached. Consumers want a smooth user experience starting at $499. If you don't offer a smooth experience for that amount of money or more, don't show up to the party.
Delusion #1: You don't need Powerful Graphics for work in Windows or Mac OS
This is the single most common explanation given by the marketing staff from companies when it comes to parts which rely on whole system experience, rather than just the number crunching aspect. When it comes to experience, having a smooth user interface and workflow is paramount in today's user community. With all kudos to Android tablets, they still haven't reached the smoothness of Apple products, even though we saw Apple iPad 3, i.e. "the new iPad" starting to use SSD as a cache and in some cases not being as smooth as it should be.
This is an amazing setup... it is also very expensive. Can it actually deliver user experience promised by Apple and Intel?
Thus, let's say you are a user that just shelled out $3000 for a top-of-the-line MacBook Air, 27" Thunderbolt Display and have connected both via $49 Thunderbolt cable. You know you won't play games, because you have an integrated graphics part. However, upon connecting your laptop with the display, and working in let's say two iterations of Safari with about 10 tabs each, you'll start noticing YouTube videos not working as smoothly as they should, and scrolling won't be as smooth as you would have expected. The natural explanation is "oh, applications are not accelerated" but in reality, should you care if you just spent $3000 on your computing experience? Looks are one thing, and the ability of integrated graphics to simply refresh 4.9 million pixels 60 times a second is another.
A good example of smoothness is a notebook powered by AMD‘s Bobcat APU. We have checked a ton of netbooks, and "lightbooks" and saw that a cheap computer can offer smooth Windows 7 UI experience. Pushing the term a bit further, both AMD Llano and Trinity APUs offer a great experience, and you can even play computer games in native resolution with full API compliance.
For example, at the recently held Trinity Reviewers Day, AMD did a series of surveys of the press and analysts. They were showing two systems side-by-side and asked the attendees to vote which one had smoother experience. In terms of productivity, the company ran Microsoft Excel, Internet Explorer and Word. Out of 30 people, 24 said that AMD Fusion A10 performed better than Intel Core i5 processor, while five members were undecided. The video shake removal test relied on MotionDSP technology and here Trinity A10 won by 25 to 3 to Core i5, with two undecided members. The final test was file compression, for which the latest version of WinZip was used (OpenCL support included). 26 members of the press and analysts said Fusion A10 system won, one said Intel did better and three were undecided.
Delusion #2: You don't need CPU power at all
Second delusion, which we hear from companies which have a strong GPU, but lack in sheer CPU compute performance is that sheer CPU performance does not matter anymore. For example, Intel will demolish its competition in sheer performance. If we take a look at the table below, you'll see that Intel Core i5-2500K runs around in circles even around 8-core FX-8150, yet alone current and upcoming quad-core processors such as A8-3870 (Llano) and A10-5800K (Trinity).
Leaked Performance table shows AMD Fusion A10-5800K Series CPU fighting against Llano based APUs and Intel Sandy Bridge CPUs.
To say CPU performance is not important is a severe understatement. Today, more than ever, you need compute power to use secure banking, work on complex spreadsheets, or simply have a smooth experience in editing photos (provided that you don't have a high-power GPU). Furthermore, if you are a prosumer, you will need to get as much CPU power as possible, because you cannot lose business while waiting for the magically accelerated software to appear. In our own experiences of doing high-end graphics benchmarking, we have always found that the system with the best performing processor will also deliver the highest graphical performance even in tests where the CPU is not as critical. This is because at a certain point, the processor becomes a bottleneck to all of that graphical computation and in some cases even the fastest processors need to be overclocked.
Let Me Entertain You
After you are done with work, what you want to do is to be entertained (yes, more often people want to be entertained during work as well). The most popular entertainment options are social networks, social games, videos and games. The defense that today people don't game is borderline ridiculous, since more people play computer games than see movies in cinema. The PC gaming industry is the largest branch of entertainment, with over $30 billion in annual revenue from software and hardware. Add in all the consoles, gaming oriented platforms such as tablets and smartphones and you'll get the industry almost larger than Hollywood and the music industry…. combined (note - not the commercial aspect, marketing etc - we're looking at product sales only, game boxes and movie ticket / blu-ray sales).
How to benchmark that video experience? How to benchmark games? It is easy to say "you can't", like some analysts like to write. But that is simply not true. The question is though - can your laptop play back a 1080p resolution video you purchased on your desktop, even though you only have a 1366x768 resolution? Here at BSN*, we've had our fair share of notebooks and even desktops that could not pass the mustard in dynamic picture resizing, an option we consider to be an absolute must. How many images can be opened in Photoshop until the system slows down to a crawl?
And what settings in computer games make for a smooth gaming experience exactly…? All of these values are what testers on sites you read and appreciate or criticize. There are no silver bullets that can drive down the real world usage model from each and every user, but there are exact things you can measure and where the performance, and more importantly, your experience will be decided by something you did not expect - a hard drive that goes to sleep, a loud optical drive with a slow spin-up time, not enough USB 3.0 ports for fast transfer… and all of these can be measured.
Solving the Benchmark Dilemma, Car Style
When it comes to benchmarking computing devices, we actually don't need to reinvent the wheel. The dilemma we as an IT industry face is very similar to car testers and journalists. The car industry has the same issues - for years, standards bodies had their own set of measurements which the car makers tweaked and optimized their vehicles, only to have those same bodies changing the measurements every time an erroneous calculation occurred. Be that MPG ratings, which recently caused a lot of shuffle behind the scenes, or EuroNCAP crash test ratings, which were changed in a way that most five-star rated cars lost their rank, and some even were deemed more dangerous than before.
We live in an evolving industry and every new generation will push the boundaries much more aggressive than the car industry. However, just like our computers connect to the displays, car tires connect to the road.
Computing reviewers are adopting the same metric, and if they don't - they need to adopt the same approach: verify if the performance is in accordance to official released figures (practically every computer component or a system now comes with a review guide in which performance is disclosed, theoretical and measured one), what are the customized workloads (no two magazines test the cars in absolutely the same way… in that way, no two car testers will drive the car in the same way) and how much value per dollar are you getting.
Car industry loves to test on this near perfect circle in Italy, recently acquired by VW. Should we disregard cars not tested here?
Miles to the Gallon or Liters per 100 Kilometers turn into Performance Per Watt, i.e. how much juice you have to pay in order to utilize the computer to your workload. 0-60 mph i.e. 0-100 kph is akin to Boot time or time to wake for a system. Driving the car on a track is equal to running Unigine's Heaven benchmark, and 3DMark score for performance, which takes both CPU and the GPU into account. Taking the car on an open road and measuring the highway or city performance is what the applications such as Adobe Creative Suite, Blackmagic DaVinci Resolve, CyberLink MediaEspresso, PCMark and nearly any well coded video game (taking certain developer relations into account).
Benchmarks versus Real World: I am sorry, but there are no victors here. You have to measure the performance taking both sides into account. But you have to select the right tools for the job. There are benefits to measuring both, as in many cases real world tests become much more subjective especially as settings and applications between testers vary.
Conclusion: Vote with Your Wallet
At the end of the day, what decides whether or not a product is good enough for the market is the effort which reviewers have put into writing a review of a product. A good review needs to be based on the similar principles as automotive reviews and to cover what is both above and below the hood/bonnet, and test the vehicle in both controlled and uncontrolled environments.
Should you stop trusting the benchmarks today? No, you should not. But you also need to take things at face value, because the subjective experience is the most important one. And seeing your $3000 MacBook Air plus 27" Thunderbolt display run like a turtle for simplistic browsing experience is enough to go back to the store.
Real benchmarks are the ones to be trusted, while it is always good to rely on selected synthetic benchmarks for comparable numbers, such as ones from ElcomSoft, FinalWire, Futuremark, SiSoft or Unigine.
When you purchase your next computer, or a car, or just about anything measurable, think of just one thing… is this product really worth my hard earned ______________ (insert your local currency)?
Rob Enderle, AMD, AMD Trinity, AMD Fusion A10, Fusion A10, APU, Intel, INTC, Sandy Bridge, Ivy Bridge, DirectX, DX, DX9, DX11, DirectX 11, WebGL, SPEC, SYSmark, SYSmark 2012, BatteryMark, BAPCo, Futuremark, 3DMark, PCMark, Windows 7, Windows 8, JQuery, WinZip, OpenCL, Word, Microsoft, MSFT, Productivity, SQL Benchmark, render benchmark, Battlefield, Apple, MacBook Air, MBA, AAPL
© 2009 - 2011 Bright Side Of News*, All rights reserved.