Monday, February 7, 2011

Java on GPU with Ateji PX



An explicit parallel composition operator (Ateji PX's parallel bar) is pretty useful for hybrid CPU-GPU programming.

All frameworks I know embed GPU code into some sort of method call, sometimes indirectly. In any case, you end up writing something like


blablabla(); // runs on CPU

// this code must run on the GPU
output = doSomethingOnGPU(input);

blablabla(); // runs on CPU



The bottom line is that input and output data, either implicit or explicit, are passed from CPU to GPU before running the code, and then from GPU to CPU when computation terminates. In other words, the GPU spends a lot of time idle waiting for I/O.

The key to GPU performance is to overlap data transfer and computation, so that input data is already present when a new computation starts. When coding in OpenCL or CUDA, this is done via asynchronous calls :


// start asynchronous transfer
// the method call returns immediately
cudaMemcpyAsync(handle, ...);

// perform computations that do not depend
// on the result of the transfer

// wait until the transfer is finished,
// using 'handle' as a reference to the async transfer
kernel<<<grid, block>>>(handle);

// now we're sure the transfer is complete,
// we can perform computations that do depend
// on its result


In this example, the intent is really to perform computation and communication in parallel. But since there's no notion of parallel composition in OpenCL (or C, or any mainstream language for that matter), it is not possible to express directly the intent. Instead, you have to resort to this complex and not very intuitive mechanism of asynchronous calls. This is pretty low-level and you'd be happy that the compiler transform your intent into these low-level calls. That's precisely what Ateji PX does.

So what's the intent? Expressing that CPU and GPU must execute concurrently. In Ateji PX, this is done with two parallel branches, one of them bearing the #GPU annotation. A channel is visible from both branches and will be used for communication.


Chan input = new AsyncChan();
Chan output = new AsyncChan()
[
  || // code to be run on CPU
     ...
  || (#OpenCL) // code to be run on GPU
     ...
]



Note that running code on the CPU or on the GPU is only a matter of modifying the annotations (can also be determined at run-time). No other change is the source code is required.

The GPU repeatedly waits for input data on channel c, performs computation and sends a result:


    || // code to be run on GPU
       for(;;) {
         input ? data; // receive data from CPU on the input channel
         result = computation(data);
         output ! data; // send result on the output channel 
       }




The CPU repeatedly sends input data and waits for results:


    || // code to be run on CPU
       for(;;) {
         input ! data; 
         output ? data;
         ... do something with the result ...
       }




Computation and communication overlap because the communication channels have been declared as asynchronous:


Chan input = new AsyncChan();
Chan output = new AsyncChan();



That's all!

The intent is clearly expressed in the source code : we have two parallel branches that communicate using channels, add a single annotation to state that a branch should run on GPU. No need to manage asynchronous calls, no need to use two different languages for coding an application, the Ateji PX compiler does all this for you.

Code is understandable and maintainable, and can work on multicore CPUs by simply changing the #GPU annotation. You can for instance debug the application on CPU before deploying it on GPU.



We prototyped a parallel version of the Mandelbrot set demo based on this idea, and achieved a speedup of 60x on a standard desktop PC. A pretty impressive speedup for just adding a #GPU annotation in the source code, isn't it ?