As mentioned before, Nehalem-EP brings a major improvement in the component compatibility between high-end desktop, workstation and server lines by bringing Socket commonality and allowing high performance desktop memories for Core i7 to work smoothly in its dual CPU Nehalem-EP brethren. Here we have one of the fastest memory kits around, the Kingston HyperX DDR3-2000 DIMM, made ewith fast Elpida chips, in 3-channel kits for Core i7 - and, yes, we tested two kits here.
The Eight-headed beast meets the best that Kingston can offer
As promised, we're returning to the scene of the crime - and testing desktop memory on a "Gainestown" workstation/server platform.
A grand total of 12 GB memory may not sound that great compared to the 48 GB in the reference configuration we tested earlier [Part I, Part II, Part III, Ed.], but the extra benefit of speed and latency saving from using unbuffered DIMMs can't be ignored.
So, how much do we benefit on a very conservative mainboard with absolutely no performance tweaks and everything going according to the basic - no XMP or EPP - SPD info in the DIMM? Remember, all of the scores you'll read here are achieved by running these modules here at DDR3-1333 CL8-7-7, the lowest Super Micro allowed us. Unfortunately, we weren't able to try to see could CL5-6-5 work [stable on ASUS X48], yet alone the "default" clock of the memory itself: DDR3-2000 CL8-8-8 or DDR3-1900 CL7-7-7, which these modules achieve with ease at "street-legal" 1.63V voltage on the Asus X48 Rampage Extreme. Oh yes, due to the well-improved memory controller in the X48 chipset, some of these i7-optimized DIMMs, including Kingston HyperX, can raise the memory performance of your Core 2 machine by quite a notch, too.
Sandra takes the helm
As you can see, 133 MHz DDR (266 MT/s) clock jump can't do miracles for overall system bandwidth
Back to the Nehalem-EP: so, here's the Sandra 2009SP3 memory bandwidth and latency synthetic benchmarks, as well as Cinebench 10 and Linpack, our reference CPU benchmark. We do not care about Super-PI, Prime or similar tests, if the CPU isn't stable with Linpack, you can forget about stable overclock.
Latency drops down by another 4ns...still, could be much better than this no-frills workstation motherboard
Memory latency decreased by a couple of nanoseconds, but it would be interesting to see sub-70ns latency on more aggressive timings.
Sandra is pretty straightforward - you jump from 36 to 39 GB/s, less than 8% in bandwidth for a 25% theoretical bandwidth gain from DDR3-1066 CL7 to DDR3-1333CL8 with the latency penalty included. Talking about total memory latency, we're now down to 74ns from 78ns in random latency benchmarks. The benefit comes partly due to no buffers on DIMMs, and partly due to higher DRAM clock overall. OK not bad still...
Cinebench? Less than a 1% jump, but still an improvement - remember that many ray tracers are quite CPU dependent, so if this memory improvement speeds us up 1%, it means that even ray trace code does like faster RAM.
When it comes to the workstation market, the situation is pretty much the same just like in hard-core overclocking scene. You need a stable machine [albeit for a much longer period of time], and even the smallest performance gain is worth a pot of gold if you can qualify the parts.
In real numbers, our BSN* intro takes 12 hours to render on this machine in 1280x720. Yes, we are talking about the 2 second, 50 frame intro that you can see on our initial Akihabara video. With this memory installed, we saved 7min12sec. That is maybe a small gain, BUT - in our video studio, we usually render around 100-150 renders with similar demands… this memory, even with this pathetic speed jump will result in savings measured in hours. Hours = Money [and less nervous workstation jockeys, Ed.].
As for Linpack, the story turns interesting: if you keep the small data set size, something like 10000x10000 matrix at 4 KB data alignment, using Kingston HyperX will up your score from 87.3 to 88.9 GFLOPS even at this "low performance" mode. A 2% speedup for, again, quite CPU intensive app is not bad. When moving to larger matrix size like 20000x20000, the difference is less pronounced: 93.1 GFLOPS vs. 93.4 GFLOPS. At least it's still a gain, and it would definitely be way higher if the latency-sensitive Linpack could make use of much lower latencies that these Kingstons can do at DDR3-1333: CL5-6-5 should be quite doable even at the default voltage - as soon as we receive more performance-oriented motherboard, you can expect that we will revisit the platform with these modules as well.
Finally, yes, these DIMMs are way cooler due to huge heat spreaders - I have to say putting them horizontaly, as most Nehalem-EP memory modules are mounted, made me worried if one half of heat spreader may just decouple and fall - and half the RAM dies per DIMM compared to those 4 GB registered ones from Samsung.
Linpack advantage drops as matrix complexity increases
No room for doubt - Linpack is very sensitive to memory latency, and the lower you go, higher performance gain is [almost] guaranteed.
For the final screenshot of the day, we leave you with a picture what this memory could really do on Nehalem-EP platform, if featured on a motherboard that would allow us to play with the settings:
ASUS Rampage Extreme and low latencies...
DDR3 on 1.9 GT/s, 7-7-7-18 latency thanks to Rampage Extreme
In summary, despite zero optimization on the Super Micro board side, there is a high potential benefit from Nehalem-EP memory flexibility.
It's your choice whether to use the capacious, reliable, expensive - and still fast! - server memory up to a stunning 192 GB in a single workstation, or stick with 12 or 24 GB of higher speed, cheaper Core i7 memory but without ECC and such reliability benefits. Intel finally grew up and created a multi-use platform that won't get you stuck with FB-DIMMs, old DDR2 or diusastrously loud stock coolers and things like that. Nehalem-EP can use the whole Core i7 eco-system [imagine two Megahalems, Ed.].
Yes, Nehalem anyway isn't short of memory bandwidth, but this test shows how, once the unlocked dual CPU versions come, together with board support for manual RAM timing settings, what could be done when 4++ GHz dualie gets matched with six channels of Kingston HyperX 2000 modules, or maybe 2133 at that point.
Graphics cards are already beyond 100GB/s, why should CPUs be 10x slower… so, why not break the 100 GB/s total system peak memory bandwidth barrier?
© 2009 - 2013 Bright Side Of News*, All rights reserved.