Following our Interview with Gipsel of the Planet 3D Now! Team, we now want to provide you with the raw numbers we got from running the Milkyway@Home Project. This time, comparing the original client that comes from the MW@Home server with the freely available optimized clients from Gipsel. He did not only optimize the plain x87 client, but also compiled it for SSE, SSE2 up to SSE4.2. Even if there is no performance after SSE2, we ran them all in our lab. Of course we also ran the GPU client.

Gipsel has just recently released version 0.19e which is multi-gpu capable and more easy on your gpu usage, so graphics don’t completely get to a halt. Also you can exclude any gpu that you want (if you have non IGP on your system, you might want that looked after) and twiddle with many other options. Gipsel did provide a excellent readme.txt with the client.

We used the following system:


Both the system and the GPU were provided by Sven Hornbruch-Vandrey, our Distributed Computing Community Manager.

Here are our crunching results:
WU                 Vanilla         x87               SSE             SSE2          SSE3           SSSE3        SSE4.1       SSE4.2       GPU
2214526_     01:01:36     00:34:54     01:02:11     00:14:36     00:14:40     00:14:36     00:14:34     00:14:44     00:00:31
2215127_     01:05:13     00:36:51     01:05:48     00:15:16     00:15:22     00:15:16     00:15:17     00:15:26     00:00:34
2215128_     01:05:23     00:36:50     01:05:55     00:15:17     00:15:19     00:15:17     00:15:16     00:15:24     00:00:34
2214655_     01:04:07     00:39:20     01:04:43     00:15:26     00:15:28     00:15:24     00:15:29     00:15:36     00:00:33
2210226_     01:02:10     00:37:30     01:02:27     00:14:57     00:14:51     00:14:53     00:14:51     00:14:58     00:00:32
2210227_     01:02:15     00:37:40     01:03:13     00:14:51     00:14:58     00:14:54     00:14:53     00:15:04     00:00:32
2214807_     01:03:47     00:36:01     01:03:55     00:15:03     00:15:03     00:15:04     00:14:59     00:15:09     00:00:33
2214808_     01:03:30     00:36:05     01:04:10     00:15:02     00:15:05     00:15:02     00:15:03     00:15:09     00:00:33
2214383_     00:35:16     00:20:17     00:35:03     00:09:07     00:09:08     00:09:10     00:09:05     00:09:08     00:00:21
2214384_     00:35:35     00:20:28     00:33:58     00:09:07     00:09:08     00:09:06     00:09:05     00:09:10     00:00:21
Total Time (HH:MM:SS)     09:38:52     05:35:56     09:41:23     02:18:42     02:19:02     02:18:42     02:18:32     02:19:48     00:05:04
Total Time (seconds)     34732     20156     34883     8322     8342     8322     8312     8388     304
Speed Comparison Index     100     172,32     99,57     417,35     416,35     417,35     417,85     414,07     11425,00

There you go. Using Gipsel’s code, ATI Radeon 4850 is offering 11 thousand 425 times speed-up compared to the original Milkyway@Home code running on a single-core Harpertown at 2.66 GHz! When we compare the code to the optimized version, ATI Radeon 4850 is still more than 100x faster than an Intel CPU.

Now you might have noticed that the x87 client is quite a bit faster than the SSE client. It got us startled, but after verifying it and getting in touch with it’s creator, we were told that SSE doesn’t offer double precition, we’re seeing the difference between two compilers here. According to Gipsel, the SSE Version has been compiled by a Microsoft product and the x87 has been produced by the Intel compiler.

The times that we get on the GPU workunits actually point out the time needed on the GPU itself. At this time, one workunit will consume aproximately 2 seconds of CPU time. With the GPU client not using much CPU time while the GPU is at full load, this opens another great opportunity to make use of other (non-gpu based) BOINC projects at the same time. If you want to know how to twiddle with your BOINC settings, follow the instructions found in the GPU ReadMe.

Gipsel has given us a great insight on how computing on the ATI GPU works. And we’d like to thank him for this. We’re sure we will hear more from him and his excellent skills in the future.

Our recommendation on ATI GPU computing clearly points to Milkyway@Home, where we found the use of just one GPU (we only had the 4850 available at the time we wrote this story) increasing the workunit throughput from a four core Xeon to the factor 7 on a single gpu. Imagining that our quad core Xeon would have four GPUs available would skyrocket our throughput to somewhat 27 times the amount of workunits done in the same time, while leaving the CPU pretty much idle.

UPDATE #1, March 25th, 2009 10:26AM CET: In the article, we have used the base of 100 for performance basis, and we weren’t clear about it. In November of 2008, Milkyway@Home code was 100x slower than it is now after Andreas "Gipsel" Przystawik did a brilliant set of code optimizations. This was efficiency test of Gipsel’s code, not the original Milkyway@Home code.