The dress of thought: 2011

Here at Ateji, we aim at offering innovative programming languages and tools. Running such a business is like performing a delicate balancing act: developing a good product takes time, and when the product is finally out you need to implement the right business model. An important choice is pricing.

Offering free software helps building a community behind the product, but does not help financing the company. This is why innovative solutions are typically launched in two steps. We have now reached the first step, where an initial stream of corporate customers guarantees a recurrent revenue. Now that we've made sure that we can feed our families, we're ready to offer a free personal edition of the software.

I am proud to announce the availability of the Free Edition of Ateji PX, the easiest Java-based parallel programming tool on the market. The Free Edition handles up to 4 cores, which is typically the upper count found in personal computers. Use it for your existing applications and future developments, and enjoy multicore programming!

Corporate servers now commonly have 16 cores or more for computational intensive applications. Corporate users can try the core-unlimited evaluation version and purchase a license when the tests are promising. Don't leave your powerful servers idle! Depending on the application, 12x speedups on 16-core servers are not uncommon.

An explicit parallel composition operator (Ateji PX's parallel bar) is pretty useful for hybrid CPU-GPU programming.

All frameworks I know embed GPU code into some sort of method call, sometimes indirectly. In any case, you end up writing something like

blablabla(); // runs on CPU

// this code must run on the GPU
output = doSomethingOnGPU(input);

blablabla(); // runs on CPU

The bottom line is that input and output data, either implicit or explicit, are passed from CPU to GPU before running the code, and then from GPU to CPU when computation terminates. In other words, the GPU spends a lot of time idle waiting for I/O.

The key to GPU performance is to overlap data transfer and computation, so that input data is already present when a new computation starts. When coding in OpenCL or CUDA, this is done via asynchronous calls :

// start asynchronous transfer
// the method call returns immediately
cudaMemcpyAsync(handle, ...);

// perform computations that do not depend
// on the result of the transfer

// wait until the transfer is finished,
// using 'handle' as a reference to the async transfer
kernel<<<grid, block>>>(handle);

// now we're sure the transfer is complete,
// we can perform computations that do depend
// on its result

In this example, the intent is really to perform computation and communication in parallel. But since there's no notion of parallel composition in OpenCL (or C, or any mainstream language for that matter), it is not possible to express directly the intent. Instead, you have to resort to this complex and not very intuitive mechanism of asynchronous calls. This is pretty low-level and you'd be happy that the compiler transform your intent into these low-level calls. That's precisely what Ateji PX does.

So what's the intent? Expressing that CPU and GPU must execute concurrently. In Ateji PX, this is done with two parallel branches, one of them bearing the #GPU annotation. A channel is visible from both branches and will be used for communication.

Chan input = new AsyncChan();
Chan output = new AsyncChan()
[
  || // code to be run on CPU
     ...
  || (#OpenCL) // code to be run on GPU
     ...
]

Note that running code on the CPU or on the GPU is only a matter of modifying the annotations (can also be determined at run-time). No other change is the source code is required.

The GPU repeatedly waits for input data on channel c, performs computation and sends a result:

    || // code to be run on GPU
       for(;;) {
         input ? data; // receive data from CPU on the input channel
         result = computation(data);
         output ! data; // send result on the output channel 
       }

The CPU repeatedly sends input data and waits for results:

    || // code to be run on CPU
       for(;;) {
         input ! data; 
         output ? data;
         ... do something with the result ...
       }

Computation and communication overlap because the communication channels have been declared as asynchronous:

Chan input = new AsyncChan();
Chan output = new AsyncChan();

That's all!

The intent is clearly expressed in the source code : we have two parallel branches that communicate using channels, add a single annotation to state that a branch should run on GPU. No need to manage asynchronous calls, no need to use two different languages for coding an application, the Ateji PX compiler does all this for you.

Code is understandable and maintainable, and can work on multicore CPUs by simply changing the #GPU annotation. You can for instance debug the application on CPU before deploying it on GPU.

We prototyped a parallel version of the Mandelbrot set demo based on this idea, and achieved a speedup of 60x on a standard desktop PC. A pretty impressive speedup for just adding a #GPU annotation in the source code, isn't it ?

The dress of thought

Friday, March 18, 2011

This blog has moved

Wednesday, March 2, 2011

Ateji PX Free Edition available

Monday, February 7, 2011

Java on GPU with Ateji PX