NVIDIA’s GPU Technology Conference 2012, now over (watch all the GTC 2012 presentations and speakers), left a trail of speculation about what the company’s next strategic move will be. "Streaming Media" (Cloud Video, Cloud Gaming, Cloud Kitchen and a sink) could easily have been the unofficial title for this year’s conference. The streaming message was replete through the conference including the timbre of startup companies NVIDIA has committed investment funding to through their GPU Ventures Program.

NVIDIA has entered a new phase of growth and has begun execution of a long-term strategic plan – one that bridges the next business plateau – affordable supercomputing.

Origins of the "Idea"
Graphics processors started life as fixed function, raster operation pipelines. Over time, the graphics display "engines" became increasingly programmable leading to the introduction by NVIDIA in 1999 of the first Graphics Processing Unit (GPU). Researchers in fields such as medical imaging and geophysical exploration began experimenting with GPUs for running "general purpose" computational applications. The excellent floating-point performance of GPUs yielded remarkable performance boosts for a range of scientific applications that led to the advent of a movement called the GPGPU (General Purpose computing on GPUs).

A major issue with programming GPGPUs at the time was that they required using graphics programming languages like OpenGL HLSL and Cg to program them. Developers had to make their scientific applications look like a graphics app then map them into problems that draw triangles and polygons – severely limiting the accessibility to the tremendous performance increases the GPU was capable of unleashing.

GPGPU Developer Support Wares
Formal origins of the General Purpose GPU started in November 2006 when ATI gave access to its graphics cards under the "Close to Metal" initiative in which the company made available programming manuals for their GPU’s. Suddenly programmers had access to compilers, debuggers, mathematical libraries and platforms for programming general-purpose GPU applications. Sadly for ATI (or now AMD), Google acquired PeakStream and the company lost a partner and momentum with the 3rd party toolkits. On the other hand, NVIDIA released their first version of CUDA (with SDK) in February 2007. ATI released "ATI Stream" in December 2007 – an extended and improved version of CTM. The amount of floating-point throughput offered by GPUs outstripped CPU’s by a factor of 10 to 24X.

OpenCL, initially developed by Apple Inc. [which still holds the trademark rights], was refined into an initial proposal in collaboration with technical teams representing AMD, IBM, Intel, and NVIDIA. Apple submitted the initial proposal to the Khronos Group in June 2008. The Khronos Compute Working Group was formed with representatives from CPU, GPU, embedded-processor, and software companies. OpenCL 1.0 received final approval for publication on 8 December 2008.

Using the OpenCL API, developers could launch compute kernels written using a limited subset of the C programming language on a GPU. This was included in NVIDIA’s CUDA architecture announced on 8 December 2008 – the stage was set for "formal" entry of the GPU to enter the market as a "General Purpose" computing engine.

Raster Ops Transition to Affordable Supercomputing
NVIDIA had reached a transition point in their corporate strategy. To bring the full performance to the larger scientific community required modifying the GPU to make it fully programmable for scientific applications and add support for high-level languages like C, C++, and FORTRAN. The company had begun the transition from a graphics processor only focus to the next big challenge – the onset of affordable supercomputing.

NVIDIA made the decision to pursue affordable supercomputing in 2009. The company had to rethink the GPU using supercomputing as the reference frame. Essentially:

  • ECC at cache level through I/O ports
  • Full double precision IEEE 754 floating-point arithmetic support
  • Increase the number of execution units
  • Decrease power dissipation at all levels
  • Increase data bandwidth
  • Major changes to CUDA Architecture

NVIDIA, as a company, has lived a precarious life in the graphics fast lane and to a large degree become an analog of that markets growth and health. The company markets itself, as being close to the edge of technology – yet it is fabless and must design competitive products using generic semiconductor processes available to all. Gestation of the idea for the company’s first real departure from a pure graphics processor play coupled with the fact that the company, in reality, is fairly conservative when it comes to making product line transitions required more than just a little internal argumentation.

Motivating Factors
In the beginning of the GPGPU movement NVIDIA had no motivating factors to change the raster-ops style pipeline structure of their GPUs. Competitive benchmarks over their competitors remained the measure of success for their designs. The advent and growth of the GPGPU movement led from what at first was a largely dissociated group to one that include major supercomputing centers around the World. The fact that GPUs were the lowest cost, high availability solution for high performance computing aided the decision.

A less well known, but significant growth segment, is the under $200K supercomputer market. Small and medium enterprises have been acquiring GPU equipped units and running diverse scientific applications – now incorporated into their business it has quickly become a mainstay support for research and engineering.

The growth of streaming video over the internet provided another forcing point on the NVIDIA’s decision tree. Video streaming requires transcoding of the video from one format to another allowing a diverse universe for end-user device display. Computational requirements mandated using GPGPU elements as part of a provider’s service.

An unexpected but natural element of streaming video is the adaption of multi-player video games to the technology. Cloud gaming, also called gaming on demand, is a type of online gaming that allows direct and on-demand streaming of games onto a computer, similar to video on demand. It requires no special computing requirement other than an adequate connection to the net. Transcoding for the diverse end-user device formats is mandatory – it cannot work without it. NVIDIA realized this would reduce demand for GPUs selling into the consumer space but would elevate demand in the Cloud Server segment.

Kepler – Response to the Idea
The first public announcement of Kepler was September 2010. Kepler is the planned series of GPUs that will take the company from graphics processors to one of affordable supercomputing. NVIDIA announced 4Q availability of the GK110 based Tesla K20 – the first to exceed one-teraflops in IEEE 754-2011 double precision execution rate. The GK110 requires 7.1 billion transistors to implement – an indicator of the devices intended market segment.

NVIDIA added several new and important innovations to the GK110 programming model:

  • Dynamic Parallelism – Dynamic Parallelism allows parallel scheduling of work on the GPU without involving the CPU. This capability allows less?structured, more complex tasks to run easily and effectively, enabling larger portions of an application to run entirely on the GPU. In addition, programs are easier to create freeing the CPU for other tasks.
  • Hyper-Q – Hyper?Q enables multiple CPU cores to launch work on a single GPU simultaneously, thereby dramatically increasing GPU utilization and significantly reducing CPU idle times.

These features, considered extremely important to the success of the architecture, mark a major milestone for high performance computing semiconductor devices. One cautionary note is that the power dissipation of Kepler is still a pacing item and will likely remain so.

Idea Closure
NVIDIA fully realizes that the ground beneath their existing business is changing and that they must respond in kind in order to survive. Simply put Kepler is their key to the next level – establishing NVIDIA as the premier supplier of high performance computing in the Cloud. Affordable supercomputing in the cloud makes a very convincing argument. Having the compute power of a supercomputer at your disposal for solutions whether it be a multi-player game or resolving protein fits in molecular modeling resolves the issue of cost – a thin client solution that easily monetizes the cloud.

NVIDIA most likely will adopt a sales/business model somewhat similar to the one Intel uses for the Xeon series of processors – one that supports the overall company’s gross margin and ultimate profitability.

Gross Margin – The Business Goal
The health of any tech company eventually revolves around gross margins. It is an indicator of just how well positioned the company’s products are [demand] and how efficient the operation is in delivering products to the market. More importantly, it allows budget planning and funds the company’s competitive activities.

NVIDIA’s gross margin has suffered in the past leaving the company in a less than desired position amongst analysts. The housing market crash in the U.S. followed by financial contagion in the European Union negatively affected sales of the low-end and mid range product offering – late arrival of new products placed additional stress on the company’s margins.

NVIDIA’ gross margin made a turnaround beginning in 2011. The company has been successful at holding margins at around 50% for 2011 and into 2012. A large percentage of the company’s stabilized margins is from sales into the under $200K supercomputer market – a well kept company secret.

Down Binning
NVIDIA is a master of the down bin art. Down Binning is a back-end operation in semiconductor manufacturing where parts are tested and binned according to select operating parameters. This allows offering a spectrum of device capabilities to the customer – it also allows NVIDIA a way to remain flexible as regards their gross profit margin.

Many companies now use GATP (Global Available-to-Promise) and CTM (Capable-to-Match) software to model customer deliverables and their ability to deliver it in advance. An interesting aspect of this type of modeling is that it makes it somewhat difficult to adhere to a specific set of technical requirements in advance of the wafer sort. Sorting out dash numbers mixed with software workarounds helps explain NVIDIA’s delays at the time of new product introductions.

The GK110 by necessity will have more than a few dash numbers enabling recovery of partially good dies – a source of high performance – lower cost GPUs until yields improve.

Kepler Realization: TSMC vs. Samsung
NVIDIA’s Kepler production, run on TSMC’s 28 nm line, has led to a close collaboration between the two companies – TSMC has given NVIDIA priority for Kepler’s development. NVIDIA has also begun production of test chips using Samsungs 28 nm process lending credence to rumors that all is not well with the NVIDIA – TSMC relationship.

The current situation is somewhat unsettling – without volume 28 nm yields the entire forward plan is at risk. We have received information that NVIDIA is interested in utilizing Samsung’s Austin, Texas facility for production. They have already received test samples from that facility. Reportedly, there are other extenuating reasons for NVIDIA’s interest in this supply source though "those reasons" are not completely clear yet.

BSN* Take
NVIDIA is well on their way to what appears to be a very bright future – if they can execute the plan. Production and power, the two P’s, are seen as the limiting factors to the company’s effort of realizing the Kepler product into the affordable supercomputing marketplace. The company’s improved margins and ascension to a lead position in the high performance computing cloud segment will undoubtedly be assured with Kepler. Maxwell will bring things into perspective and you’ll know why Intel scaled Larrabee to 22nm and branded the part Xeon Phi.

We could not but help notice that NVIDIA and Apple Computer have long been "underground partners" in developing common standards ground between the two companies. After all, one of industry’s dirty secrets was the relationship between the two companies on development of OpenCL (can you say Cg really fast?), to which majority of industry thought was a standard completely developed by Apple, while in fact the the notion that Apple will be releasing Apple TV sometime in the near future begs the question of where the company will obtain the necessary "transcoding capabilities" from the cloud. Our forward vision has improved by a few diopters and there is even an approximate calendar date now.

As Bill Dally, NVIDIA’s Chief Scientist said, "Where else will you get 10KW of supercomputing power on a 2 watt tablet?"?,