Monday, February 7, 2011

Java on GPU with Ateji PX



An explicit parallel composition operator (Ateji PX's parallel bar) is pretty useful for hybrid CPU-GPU programming.

All frameworks I know embed GPU code into some sort of method call, sometimes indirectly. In any case, you end up writing something like


blablabla(); // runs on CPU

// this code must run on the GPU
output = doSomethingOnGPU(input);

blablabla(); // runs on CPU



The bottom line is that input and output data, either implicit or explicit, are passed from CPU to GPU before running the code, and then from GPU to CPU when computation terminates. In other words, the GPU spends a lot of time idle waiting for I/O.

The key to GPU performance is to overlap data transfer and computation, so that input data is already present when a new computation starts. When coding in OpenCL or CUDA, this is done via asynchronous calls :


// start asynchronous transfer
// the method call returns immediately
cudaMemcpyAsync(handle, ...);

// perform computations that do not depend
// on the result of the transfer

// wait until the transfer is finished,
// using 'handle' as a reference to the async transfer
kernel<<<grid, block>>>(handle);

// now we're sure the transfer is complete,
// we can perform computations that do depend
// on its result


In this example, the intent is really to perform computation and communication in parallel. But since there's no notion of parallel composition in OpenCL (or C, or any mainstream language for that matter), it is not possible to express directly the intent. Instead, you have to resort to this complex and not very intuitive mechanism of asynchronous calls. This is pretty low-level and you'd be happy that the compiler transform your intent into these low-level calls. That's precisely what Ateji PX does.

So what's the intent? Expressing that CPU and GPU must execute concurrently. In Ateji PX, this is done with two parallel branches, one of them bearing the #GPU annotation. A channel is visible from both branches and will be used for communication.


Chan input = new AsyncChan();
Chan output = new AsyncChan()
[
  || // code to be run on CPU
     ...
  || (#OpenCL) // code to be run on GPU
     ...
]



Note that running code on the CPU or on the GPU is only a matter of modifying the annotations (can also be determined at run-time). No other change is the source code is required.

The GPU repeatedly waits for input data on channel c, performs computation and sends a result:


    || // code to be run on GPU
       for(;;) {
         input ? data; // receive data from CPU on the input channel
         result = computation(data);
         output ! data; // send result on the output channel 
       }




The CPU repeatedly sends input data and waits for results:


    || // code to be run on CPU
       for(;;) {
         input ! data; 
         output ? data;
         ... do something with the result ...
       }




Computation and communication overlap because the communication channels have been declared as asynchronous:


Chan input = new AsyncChan();
Chan output = new AsyncChan();



That's all!

The intent is clearly expressed in the source code : we have two parallel branches that communicate using channels, add a single annotation to state that a branch should run on GPU. No need to manage asynchronous calls, no need to use two different languages for coding an application, the Ateji PX compiler does all this for you.

Code is understandable and maintainable, and can work on multicore CPUs by simply changing the #GPU annotation. You can for instance debug the application on CPU before deploying it on GPU.



We prototyped a parallel version of the Mandelbrot set demo based on this idea, and achieved a speedup of 60x on a standard desktop PC. A pretty impressive speedup for just adding a #GPU annotation in the source code, isn't it ?

2 comments:

  1. Hi,

    is it possible to let the gpu run a complete jar?
    We're not able to parallelize the programme, but for us it's important to process a lot of workunits in a short time and this could be done in serial or in parallel.

    That would be nice to have,

    Constellation

    ReplyDelete
  2. Hi Constellation,

    There's no way a GPU of the current generation could be able to run a complete JAR. GPUs are fast but specialized processors that excel for some operations but cannot handle all JVM bytecodes with decent efficiency.

    The way we do it with Ateji PX is to translate only specific fragments of Java code, namely those that can benefit of the GPU architecture. All the rest keeps running on the CPU.

    If you can tell me more about your app in a private mail, I'll try to estimate the speedup you can expect on GPU.

    ReplyDelete