In the first part of this review, we looked at the hardware, and
continued with the BIOS and Vista boot in the second part. We open the third part with a picture of how Vista welcomes you once that you have this rig:
We cannot wait for Microsoft to finally release a Windows 7 Releace Candidate. While Vista is working like a charm on this setup [Nova, should we state only on this setup - Ed.], it is no secret that W7 and its built-in file system optimizations for SSDs would make this setup fly.
In the meantime, pictures like this are the reason why you should leave the Welcome screen on. No matter what your day is, just looking at this overview makes me smile. An ideal graphics card for this system would be a nVidia Quadro FX 5800 or the upcoming Fire Pro 9000 series dual-GPU card.
Loading Windows with 8 physical cores yields a little bit different memory occupancy - 80 MB extra is cached
If issues arise, disable Hyper-Threading and decrease task manager to 8 cores, instead of 16 threads.
are the first sets of benchmarks I ran, just in time for the official launch. Cutting to the chase: the first round of benchmarks, focusing mostly on
SiSoft Sandra 2009 SP3 beta, CineBench R10 and Linpack 10, used all default
system settings and SMT aka Hyper-Threading enabled. Turbo and NUMA
were always enabled in all tests. We did encounter various issues which we will address in this article, but all of the issues can be resolved by disabling Hyper-Threading technology.
Synthetic (does not) lie: 8-core Nehalem beats out 16-core AMD system
Oh my... are we seeing 138 GFLOPS Double-Precision from a two-socket system? nVidia now needs a new GPU to beat raw numbers achieved by Nehalem-EP. ATI Radeon 4870 is still safe with its 250 GFLOPS. This is ego boasting for Intel engineers par excellence.
SiSoft Sandra is useful for its
very graphical results display in multiple modes - two of which we
use here - and convenient competitive comparisons, usually up to
date, built in. In this round, we covered Sandra CPU tests, Multimedia, Memory bandwidth,
Memory latency (random and linear latency benchmarks).
Memory Latency test - Hyper-Threading enabled and 79ns latency
Seeing 16 threads in Task Manager will yield 79ns. We ran this benchmark several times, and each time score was the same...
Memory Latency test - Hyper-Threading disabled and 78ns
Testing memory latency with HT disabled results in impressive latency of just 78ns. Bear in mind this is a combined 384-bit memory controller, just like GeForce 8800GTX architecture.
ran the random latency several times, both on CPU0 and then on CPU1, where it
s haves off a nanosecond - interesting... we will thoroughly check various benchmarks and work closely with the developers to see are these issues software or hardware-based.
Also, note that our unit seems just a tiny bit slower than the
"other" dual W5580 that SiSoft seemingly tested. It could be a
combination of slightly slower base clock, memory speed and some
other BIOS or hardware factors that we'll investigate in due course.
Memory bandwidth... theory states 51.02 GB/s with DDR3-1066 memory, reality dictates still impressive 36.86 GB/s
Memory results entered Warp 7 compared to weak scores achieved by Nehalem's predecessor.
Sandra's MultiMedia score shows that AMD still has some life left - 16 AMD cores yield almost 425 MPixel/s
AMD's pricey 16-core monster reacts to Intel's threat, when a combined 512-bit memory controller showed fantastic 423 MPix/sec, resulting in fantastic 423 MPix/s.
Either way, in most cases, you can see that the W5580 blows away the
competition, with Intel's own dual Harpertown Xeon X5492 being the
closest match in many cases. In a few situations, the HT-enabled dual
W5580 beats the quad-socket Shanghai Opterons from AMD, as you can see. And even with all twelve of the DIMM slots populated with 2-rank registered
modules, neither the memory bandwidth nor latency suffer; a good base
for very fast streaming data processing.
Linpack meets Hyper-Threading...
16 CPUs, 8 threads, HT enabled and we get 56.99 GFLOPS
Linpack forgets Hyper-Threading
8 CPUs, 8 Threads and we get 85.25 GFLOPS. Not a small difference...
If you want the best, there you go
Best shot with large data array - more improvements to come using DDR3-1333 or 1600 RAM
Interesting run with Linpack when SMT is enabled: I manually set the
total thread count to be eight as we wanted to see how the Windows will
prioritize the threads. As guessed, it tried to pile them all up on
one CPU, it seems, instead of doing the "physical CPUs used first,
then logical CPUs when all physical occupied" approach. Result was, well... not exactly encouraging. So, please no
SMT/HT for Linpack-like heavy math stuff.
Next page: 3DMark Vantage, Hyper-Threading issues continued, Conclusion
Turning the Hyper-Threading off in SiSoft Sandra 2009
Turning the HT off will cost you around 14 GIPS and massive 58 GFLOPS!
For Sandra's tests, leave HT on at all costs, or be punished...
Turning HT off had a somewhat small loss of performance in Integer apps and 35 MPixel/s deficit in Floating-Point. But take a look at superiority of just eight core Nehalem architecture against 16 real cores from AMD in FP operations.
Bear in mind that this AMD system is radically more expensive than our tested system. Our Nehalem-EP platform would retail for around (sans SSDs) 6000 USD. In start contrast, a 4-Socket 2.7 GHz QuadCore would set you back for almost 10,000 USD! So, these results are the reason why AMD is worried. The AMD Opteron 8000 is nothing short of a golden goose, and even the dual-socket Nehalem-EP is able to compete against it! True, not when you disable Hyper-Threading...
3DMark Vantage... sort of
For the first time in Vantage, we get triple-zero score. Clean 17K of single-card awesomeness.
Unfortunately, we had to skip a complete 3DMark Vantage run missing in this part - despite three tries, the CPU portion of the famous 3D benchmark
couldn't complete with SMT on. Back at the day, we spoke with Oliver from Futuremark and he told us that 3DMark Vantage can handle 16 threads, so does PCMark Vantage.
So, we come to the SMT off or HT-off, call it whatever you like. We were curious to see how this will impact the performance - and yes, It did lower the synthetic bench performance, but it did prop up the real app ones - at least for the compute-intensive runs
with heavy threading.
Cinebench R10 was always a stronghold of Core architecture
No room for playing around... Cinebench didn't exactly profit from Hyper-Threading, and as you can see, GPU performance even suffers when HT is on.
Look what it does for CineBench 10, not to
mention the Linpack.
Sandra CPU tests are noticeably lower - In one case, the old Xeon
X5492 overtakes the new one; my suspicion is that this is due to the cache latency
and size issues in L1 and L2 here. Remember that, as mentioned,
Turbo was on for the W5580, so pretty much all the tests actually ran CPU cores at 3.33 GHz.
The Cinebench seems to do better now, and Linpack simply shines: 95
GFLOPS of actual double precision performance obtained here, beating double precision offered by every GeForce GPU below GTX295, e.g. the dual-GT200b GPU part. Note
that Linpack is quite sensitive to the memory latency too. If you ever decide to run DDR3-1333 memory modules with CAS6 latency, you could easily see the system achieving 100 GFLOPS, very near its peak theoretical limit.
Of course, this is just a start anyway.
We will follow-up this article soon, not just with other workstation and desktop apps, but
also with faster memory, as well as other mainboards that promise better tuning,
and of course, Linux environments.
Nehalem-EP, as the new Xeon DP series leader, lays down as an
impressive processor and platform base for the new dual processor systems from Intel - whether it's a 3-D workstation, extreme
desktop PC, a mainstream server or a HPC cluster node. The near perfect
balance between the CPU, memory and I/O resources will help a lot in
many real life apps, aside from all these benchmarks. It is up to AMD
to fight and match this, not just by adding cores as in the sexa-core Istanbul,
but also combining it with HyperTransport 3.1 links plus DDR3 memory,
and they better make it real quick.
We also heard that AMD plans to bring 256-bit memory controller with 12-core Magny Cours in 2010 and if that turns out to be true, AMD may be able to stay in the game. But as for 2009, it is up to Intel to show how far they want to go promoting and
gaining market share with the Gainestown chips. Bear in mind that this is not the end. Intel plans to introduce a socket drop-in upgrade, a 32nm
sexa-core 12 MB L3 cache follow-up from Westmere "tock" architecture.
The promise of a
balanced base board platform with suitable derivatives, not to mention tune-ups and software stacks addressing the workstation and server as well as the extreme PC all at once, has finally become true.
In our Video Production Studio, we are working on an Intel V8 65nm, Skulltrail 45nm and 45nm Mac Pro systems, and there was just one major issue - the memory subsystem was getting choked by the requirements of REDCODE stream. With Nehalem-EP platform, this bottleneck is finally removed and now Intel can truly shine and start replacing all those Opteron workstations that offered insane bandwdith [yes, a dual-socket dual-core Opteron is better choice that dual-socket quad-core Harpertown. Go figure]. Intel probably doesn't even have a clue what they did with this platform: if a dual-socket Opteron and nVidia Quadro FX 3000 SDI enabled the creation of BattleStar Galactica, Nehalem-EP with Quadro FX5800 in SLI should usher us into the world of TV shows having better effects than Hollywood movies Anno Domini 2008. With proper pricing, you can assemble a machine that will eat up everything that comes in its way, that being 128, 256 or even 512 instruments in an audio production. You can now emulate complete orchestras in real-time... overall, Intel has become the king of content production. Hats off to all the guys and girls who made this possible.