We are currently faced with too many questions and everybody from the scientific community to us, regular people, are looking for answers. We are faced with questions touching on the human lifespan and how we can prevent diseases – to questions about our place in the universe and what it looks like… and in order to find the answers to these perplexing questions, we need A LOT of computing power.
Enter the realm of GPGPU or GPU Computing. Our passion for playing in virtual worlds has lead companies such as ATI and Nvidia to create graphics chips that surpass general-purpose chips by orders of magnitude at times exceeding 10x, 100x, somewhere even 1500x. This performance resulted in GPU chips starting to become household names in the world of scientific computing. Naturally, with every competition comes the ultimate question “Who has the best GPU Computing chip out there?” The answer to that question has sparked a lot of controversy, and there is no easy answer.
In order to give you at least one part of the answer, we took a look at what communities can do to accelerate the distributed computing effort. We have interviewed Gipsel a.k.a. Andreas Przystawik, from the Planet3DNow! Community. This bright programmer caused quite a stir when he optimized the Milkyway@Home client for GPU Computing. Currently, only Milkyway@Home, Einstein@Home and SETI@Home are known to offer their source code, which was the basis for Gipsel’s GPGPU effort. If you know of other projects, let us know firstname.lastname@example.org or email@example.com.
BSN*: Greetings Mr. Przystawik, for the very first question – could you tell us about your background and programming skills prior to optimizing the Milkyway@Home client?
Gipsel: I had my very first contact with some Basic on a KC85 at an age of 11. But I guess that doesn’t count. Actually, I am not a really experienced programmer. I had computer science classes in school learning some Pascal. Later, at the university I’ve heard a lecture “Algorithms and Data structures” at the Information Technology faculty – for one semester only. During that time I started with some C programming in my spare time. No big projects, it was just for fun – mostly quite low level stuff (involving assembler) as I was mainly interested in the hardware and its possibilities. But I have not much time for it during my studies (Physics). So I was not doing any such things for about five years. And then came Milkyway@Home.
BSN*: What kind of hardware did you have available at the time when you optimized the client?
Gipsel: I started with my normal PC. It was an old AMD Athlon XP 1800+, AMD760-based motherboard and Nvidia GeForce4 MX graphics card. Later, when there was the need to test some SSE2/3 binaries, I switched to a Phenom X4 9750 on a 780G chipset (using the integrated graphics e.g. Radeon 3200). I’m still working with this box. All the GPU stuff was also done on it. Unfortunately, I don’t have a recent graphics card (never needed one so far), meaning I don’t have the possibility to run and test the GPU applications by myself.
BSN*: What performance gain was achieved after you completed the optimization work on the code?
Gipsel: It depends on what are you comparing. The original source code of Milkyway@Home was grossly inefficient; it simply wasted a lot of time. The first things to do were no optimizations in a common sense; one had to clean up the algorithm Milkyway@Home is using. In the meantime most of my suggestions for improvements were implemented in the sources maintained by Milkyway@Home. That brought the calculation times for a workunit (WU) down in a massive way.
Using my CPU-optimized code, 65nm Core2 or a Phenom running at 3GHz will take just slightly above four minutes to crunch one of today’s short WUs. The stock applications distributed by the project are a bit slower; they take between about 10-18 minutes. In November 2008, it would have taken a full day for the same WUs on the same CPUs (MW uses longer WUs now). By doing my optimizations into account, Milkyway@Home experienced a speedup of factor 100 on the CPU alone.
But I think readers are mostly interested about the GPU application. ATI Radeon HD4870 completes the same WUs in only nine seconds. Since Quad-core CPU calculates four WUs at once, a 3GHz quad will effectively complete four WUs in about four minutes with the fastest WU. At the same time, ATI’s Radeon HD4870 will complete 25 WUs – six times the throughput for about the same price. Even a last-gen Radeon HD3800 will complete 8-10 WUs in four minutes, still more than double what a fast quad-core CPU can do. If you summarize all the improvements, you see that a single HD4870 is now doing more science than the whole project did couple of months ago! If you compare the beginning of the project with today’s situation, you could claim a gain from “one WU a day” on a single Core 2 processor @3GHz to almost 10,000 WUs a day with a HD4870 [this is a live testament what code optimization can achieve - imagine if every application would have such a dedicated code-optimizer - Ed.].
BSN*: Did you get any help from ATI or NVIDIA when you developed your code?
Gipsel: No. I just use the public available documentation and tools that everyone can download from their website without getting any special support.
BSN*: What was the reaction of the Milkyway@Home administration after you released your client?
Gipsel: The very first reactions were not very encouraging. Actually, before my project, there was an optimized application used privately by the well-known guy “Crunch3r”. That application actually sparked my interest in the whole thing. At that time the project staff didn’t react in the best possible way and it appeared to me they handled it like a threat. But I guess after a while they were simply convinced by the new possibilities that open up with the massively increased throughput. So they are now much more open and cooperative. If all goes well, it should become possible to distribute the ATI GPU application as a stock application by the project itself.
BSN*: Are you planning to release a NVIDIA client as well? Why not?
Gipsel: Not at the moment and there are several reasons for that. First of all, the ATI application still needs some polishing like multiple GPU or Linux support. Furthermore, the project itself is working with nVidia on CUDA-powered version. Apparently, nVidia gives a lot of support to BOINC projects that want to port their applications to CUDA. Together with the mature CUDA SDK, it shouldn’t take long until MW@H also gets a GPU application that supports the latest nVidia cards.
The reason I started with ATI in the first instance was the quite massive performance advantage ATI has on current hardware for the kind of calculations done at Milkyway [Dual Precision format - Ed.]. I hope, it will increase the interest of getting GPGPU applications ported also to ATI hardware, which is in a lot of cases at least as capable as comparable nVidia offerings. The fact that I’m a member of Team Planet3DNow!, a BOINC team associated with an AMD oriented website, has no influence whatsoever.
BSN*: What do you recommend to other distributed computing projects? ATI or Nvidia?
Gipsel: I would recommend support both Without going to much into the details there are different advantages to both. Basically, one can use a very simple high level programming model for ATI that may be enough for simple problems. If not, one has to resort to harder to program low-level approaches, but gets very solid performance in return.
If you need to use a lot of double precision calculations, there is simply no way around ATI from a performance standpoint, at least with current hardware. On the other hand, Nvidia has created quite a mature environment with CUDA, enabling relatively easy creation of high performing GPU applications. From what I hear they offer also great support to BOINC projects. But we should overcome the need to create two version of a GPGPU application with the advent of OpenCL that will get support by both [AMD & Nvidia - Ed.] as well as Intel. Actually, OpenCL has a lot of resemblance to CUDA.
BSN*: What do you think about GPU computing?
Gipsel: That is almost the same as asking “what do you think of multi core?” only taken one or two steps forward. GPU Computing opens up great possibilities. It offers an increase of the performance and also the performance per Watt by the order of magnitude or even more. Realistically speaking, GPU Computing is currently limited to a small range of applications – for the time being.
One has to keep in mind that not all problems can be easily ported to it. Developers actually need to implement a lot more parallelism than for conventional multithreaded application. And we all know how long it took (still takes!) that mainstream applications such as games really make a use of dual or quad core CPUs. And now think you should program not for four threads but for some thousands or even a million threads!
The answer to the question from the beginning of this article, “Who has the best GPU Computing chip out there?” is (partially) answered. In case of the BOINC project Milkyway@Home, the answer is the ATI Radeon 4800 series.
However, it is clear that GPU Computing has a mountain to climb, because it has to overcome the programming model itself. Old-school programmers weren’t educated to think in parallel, while the new generation of programmers is. It will take some time, but as programmers mature and experiment, we will enter a whole new era of computing.
What Gipsel did is nothing short of amazing, yet clearly proving that optimization is a key to a good application [you lazy scoundrels at Rockstar, are you taking a note for your amateur GTA4 conversion? - Ed.] – Milky Way project was accelerated by 100 times on a CPU alone, and then GPU accelerated the original code by 10,000 times. If a 10K performance increase isn’t mind blowing, we don’t know what is.
We would like to thank Andreas Przystawik aka. Gipsel of Planet3DNow! fame for his time and to congratulate him on his efforts in the world of Distributed Computing. At this time, we don’t know what features will appear in the next generation of GPUs, but you can be sure that the staff here at BSN* will keep you updated each and every step of the way.
Please also take a look at our upcoming follow-up story where we compare all versions of the Milkyway@Home client on various workunits. There we’ll point out the full numbers to the speed increase mentioned in this interview.