Friday, March 18, 2011

This blog has moved


This blog is now hosted on the Ateji web site at http://www.ateji.com/blog/.

Please update your RSS feeds.

Using this new platform will make it much easier to format and post source code. Expect a wealth of technical articles about Java, multithreading, parallel programming and Ateji PX in the coming weeks.

See you there.

Wednesday, March 2, 2011

Ateji PX Free Edition available

Here at Ateji, we aim at offering innovative programming languages and tools. Running such a business is like performing a delicate balancing act: developing a good product takes time, and when the product is finally out you need to implement the right business model. An important choice is pricing.

Offering free software helps building a community behind the product, but does not help financing the company. This is why innovative solutions are typically launched in two steps. We have now reached the first step, where an initial stream of corporate customers guarantees a recurrent revenue. Now that we've made sure that we can feed our families, we're ready to offer a free personal edition of the software.

I am proud to announce the availability of the Free Edition of Ateji PX, the easiest Java-based parallel programming tool on the market. The Free Edition handles up to 4 cores, which is typically the upper count found in personal computers. Use it for your existing applications and future developments, and enjoy multicore programming!

Corporate servers now commonly have 16 cores or more for computational intensive applications. Corporate users can try the core-unlimited evaluation version and purchase a license when the tests are promising. Don't leave your powerful servers idle! Depending on the application, 12x speedups on 16-core servers are not uncommon.

Monday, February 7, 2011

Java on GPU with Ateji PX



An explicit parallel composition operator (Ateji PX's parallel bar) is pretty useful for hybrid CPU-GPU programming.

All frameworks I know embed GPU code into some sort of method call, sometimes indirectly. In any case, you end up writing something like


blablabla(); // runs on CPU

// this code must run on the GPU
output = doSomethingOnGPU(input);

blablabla(); // runs on CPU



The bottom line is that input and output data, either implicit or explicit, are passed from CPU to GPU before running the code, and then from GPU to CPU when computation terminates. In other words, the GPU spends a lot of time idle waiting for I/O.

The key to GPU performance is to overlap data transfer and computation, so that input data is already present when a new computation starts. When coding in OpenCL or CUDA, this is done via asynchronous calls :


// start asynchronous transfer
// the method call returns immediately
cudaMemcpyAsync(handle, ...);

// perform computations that do not depend
// on the result of the transfer

// wait until the transfer is finished,
// using 'handle' as a reference to the async transfer
kernel<<<grid, block>>>(handle);

// now we're sure the transfer is complete,
// we can perform computations that do depend
// on its result


In this example, the intent is really to perform computation and communication in parallel. But since there's no notion of parallel composition in OpenCL (or C, or any mainstream language for that matter), it is not possible to express directly the intent. Instead, you have to resort to this complex and not very intuitive mechanism of asynchronous calls. This is pretty low-level and you'd be happy that the compiler transform your intent into these low-level calls. That's precisely what Ateji PX does.

So what's the intent? Expressing that CPU and GPU must execute concurrently. In Ateji PX, this is done with two parallel branches, one of them bearing the #GPU annotation. A channel is visible from both branches and will be used for communication.


Chan input = new AsyncChan();
Chan output = new AsyncChan()
[
  || // code to be run on CPU
     ...
  || (#OpenCL) // code to be run on GPU
     ...
]



Note that running code on the CPU or on the GPU is only a matter of modifying the annotations (can also be determined at run-time). No other change is the source code is required.

The GPU repeatedly waits for input data on channel c, performs computation and sends a result:


    || // code to be run on GPU
       for(;;) {
         input ? data; // receive data from CPU on the input channel
         result = computation(data);
         output ! data; // send result on the output channel 
       }




The CPU repeatedly sends input data and waits for results:


    || // code to be run on CPU
       for(;;) {
         input ! data; 
         output ? data;
         ... do something with the result ...
       }




Computation and communication overlap because the communication channels have been declared as asynchronous:


Chan input = new AsyncChan();
Chan output = new AsyncChan();



That's all!

The intent is clearly expressed in the source code : we have two parallel branches that communicate using channels, add a single annotation to state that a branch should run on GPU. No need to manage asynchronous calls, no need to use two different languages for coding an application, the Ateji PX compiler does all this for you.

Code is understandable and maintainable, and can work on multicore CPUs by simply changing the #GPU annotation. You can for instance debug the application on CPU before deploying it on GPU.



We prototyped a parallel version of the Mandelbrot set demo based on this idea, and achieved a speedup of 60x on a standard desktop PC. A pretty impressive speedup for just adding a #GPU annotation in the source code, isn't it ?

Saturday, December 4, 2010

Why multithreading is difficult

It is common wisdom that programming with Java threads is difficult (and even more so in other languages). But have you ever tried to figure precisely why this is so ?

The problem with threads (PDF) is a nicely written research report that tries to answer this question. In short, a thread is a notion that makes sense at the hardware level, when you think in terms of registers and instruction pointers. But this hardware level notion should never have made its way up to the source code level. Just like you wouldn't mix Java source code and assembly language instructions.





Let us try to go deeper: why exactly is the notion of thread ill-suited as a programming abstraction ? The core of the problem can be expressed in one single word: composability.

Consider for example arithmetic expressions. You can take two numbers and compose them to form an addition:

1+2

The result of the addition is itself a number, so you can compose it again with yet another number:
(1+2)+3 or 3+(1+2)

When there's no ambiguity, you can remove the parentheses altogether:
1+2+3

In other words, arithmetic expressions can be composed using + as a composition operator (there are many other).

The exact same process works for Java statements using sequential composition:

{ a; b; }
{ a; b; }; c;
c; { a; b; }
a; b; c;



What about threads ? Can you take two threads and combine them in order to obtain a thread ? So that it can itself be combined with yet another thread ? No, threads do not compose.



So why is composability important ?

Composability provides a method for building larger pieces out of smaller ones: just compose. This is how you can build large programs by building up on smaller ones.

You want to make sure that the program behaves as expected ? First start by making sure that each of its pieces behaves as expected. This is what unit testing is about.

Composability also provides a method for understanding what is going on: just decompose. If a divide-by-zero exception is thrown by { a; b; }, then it must have been thrown either from a or from b. Decomposition is our standard and intuitive way to debug a program.

Now what about a multithreaded program ? Imagine a program with two threads that throws an exception. Can you split the program in two pieces such that the exception comes from either one or the other ? No way !

This is precisely why multithreading is difficult. Multithreaded programs are impossible to test and debug, and making sure that they work properly requires a thorough analysis of the whole code, including pieces you didn't even know they existed.

Parallel programming itself does not need to be difficult, the problem is that mainstream programming languages do not provide a parallel composition operator as they do for sequential composition. Ateji's technological offering consists precisely in retrofitting languages with such an operator.



Let me stress this once again: multithreaded programming is difficult because of the lack of composability. Not because because of parallelism, non-determinism, or the other usual suspects.

This is true of Java threads, but also of all structures built upon threads without providing their own composition mechanism, such as tasks (the Java Executor framework) or concurrent collections (the Java concurrency framework).

Saturday, November 27, 2010

Back from SuperComputing 2010


The New Orleans skyline from Garden District

This view from the hotel at 6am was about the only time we had a chance to see the sun. SC10 has been a very busy week for the Ateji team, with a lot of business meetings (all the big guys where there) and a wealth of visitors to our booth.

From left to right: Claude, Maxence and Patrick on the Ateji booth. Most visitors were curious about this new approach to do parallel programming and spent a long time chatting and asking questions. We even had a handful of teachers interested in using Ateji PX as a tool for teaching parallel programming, as it provides a general and intuitive model of parallelism on top of Java.






Ateji was part of the Disruptive Technologies exhibit. You get there after submitting your technology and vision and being selected by a panel of experts of the program committee. And yes, we got a free booth! Many thanks to John and Elizabeth for making this exhibit a success.


Parallel lines named Desire


Being recognized as a disruptive technology means that Ateji PX has the potential to deeply change the landscape of HPC and parallel programming in the coming years. We all hope for its success as an industry standard. The focus on disruptive technologies was emphasized by having Clayton Christensen, the author of "The Innovator's Dilemma" and "The Innovator's Solution", as keynote speaker. Clayton, if you happen to read this, I'd be happy to have a chat with you.



We handed out these cute USB keys labeled Ateji - unlock Java performance, that contain whitepapers, documentation and an evaluation version of the software. They had a lot of success on the SC10 booth, but also with the TSA: I carried a few hundreds of them in my suitcase, it was opened and inspected every single time we boarded a plane! I now have a collection of these flyers.

Friday, September 3, 2010

Java for HPC (High-Performance Computing)

When presenting Ateji PX to an audience with HPC or simulation background, I often hear definitive opinions of the kind "We'll never consider Java, it's too slow".

This appears to be more a cultural bias than an objective statement referring to actual benchmarks. To better understand the state of Java for high-performance computing, let us browse through three recent publications:



Java is used happily for huge HPC projects

At the European Space Agency, Java has been chosen for large CPU and data handling needs in the order of 10^21 flops and 10^15 bytes. That's a lot of zeroes, there is no doubt here we're talking here about high-performance computing.
"We are happy with the decision made and haven’t (yet) faced any major drawback due to the choice of language" [1].


"HPC developers and users usually want to use Java in their projects" [2]. Indeed, Java has many advantages over traditional HPC languages:

  • faster development

  • higher code reliability

  • portability

  • adaptative run-time optimization


There is also a lesser known but very interesting side-effect. Since the language is cleaner, it is easier for developers to concentrate on performance optimization (rather than, say, chasing memory-allocation bugs):

"In Mare Nostrum the Java version runs about four times faster than the C version [...]. Obviously, an optimisation of the C code should make it much more efficient, to at least the level of the Java code. However, this shows how the same developer did a quicker and better job in Java (a language that, unlike C, he was unfamiliar with)" [1].


Java code is no longer slow [2].

All benchmarks show that Java is not any slower than C++. This was the case ten years ago, and the deceptive results of the JavaGrande initiative for promoting Java for HPC gave it a bad reputation. But recent JVMs (Java Virtual Machines) do an impressive job of aggressive runtime optimization, adapting to the specific hardware it is running on and dynamically optimizing critical code fragments.

However, performance varies greatly, depending on the JVM used (version and vendor) and the kind of computation performed [1]. One lesson to be remembered is that you should always test and benchmark your application with several recent JVMs from different vendors, as well as different command line arguments (compare -client and -server).

Obviously, there are some caveats.

There are still performance penalties in Java communications: pure Java libraries are not well suited for HPC ("the speed and scalability of the Java ProActive implementation are, as of today still lower than MPI" [3]), and wrapping native libraries with JNI is slow.

The situation is improving recently with projects such as Java Fast Sockets, Fast-MPJ, MPJ Express, that aim at providing fast message-passing without JNI overhead.

HPC users also lack the quality and sophistication of those available in C or Fortran (but this seems to be improving). Obviously, this is a matter of having a large enough community, and the critical mass seems to have been reached.

The decision process

I cannot resist citing the whole part about the decision process that took place at ESA:


FORTRAN was somewhat favoured by the scientific community but was quickly discarded; the type of system to develop would have been unmaintainable, and even not feasible in some cases. For this purpose the choice of an object-oriented approach was deemed advisable. The choice was narrowed to C++ and Java.

The C++ versus Java debate lasted longer. “Orthodox” thinking stated that C++ should be used for High Performance Computing for performance reasons. “Heterodox” thinking suggested that the disadvantage of Java in performance was outweighted by faster development and higher code reliability.

However, when JIT Java VMs were released we did some benchmarks to compare C++ vs Java performances (linear algebra, FFTs, etc.) . The results showed that the Java performance had become quite reasonable, even comparable to C++ code (and likely to improve!). Additionally, Java offered 100% portability and I/O was likely to be the main limiting factor rather than raw computation performance.

Java was finally chosen as the development language for DPAC. Since then hundreds of thousands of code lines have been written for the reduction system. We are happy with the decision made and haven’t (yet) faced any major drawback due to the choice of language.


And now comes Ateji PX

So there's a growing consensus about using Java for HPC applications, in industries such as space, particle physics, bioinformatics, finance.

However, Java does not make parallel programming easy. It relies on libraries providing explicit threads (a concepts very ill-suited as a programming abstraction, see "The Problem with Threads") or relying on Runnable-like interfaces for distributed programming, leaving all the splitting and interfacing work to the programmer.

What we've tried to provide with Ateji PX is a simple and easy way to do parallel high-performance computing in Java. Basically, the only thing you need to learn is the parallel bar '||'. Here is the parallel 'Hello World' in Ateji PX:

[

  || System.out.println("Hello");

  || System.out.println("World");

]



This code contains two parallel branches, introduced by the '||' symbol. It will print either
Hello
World
or
World
Hello
depending on which branch gets scheduled first.

Data parallelism is obtained by quantifiying parallel branches. For instance, the code below increments all N elements of array in parallel:

[

  || (int i: n) array[i]++;

]


Communication is also part of language, and is mapped by the compiler to any available communication library, such as sockets or MPI, or via shared memory when possible. Here are two communicating branches, the first one sends a value on a channel, the second one reads the value and prints it:

Chan c = new Chan();

[

  || c ! 123;

  || c ? int value; System.out.println(value);

]



There much more, have a look at the Ateji PX whitepaper for a detailed survey.

Ateji PX has proven to be efficient, demonstrating a 12.5x speedup on a 16-core server with a single '||'.

It has also proven to be easy to learn and use. We ran a 12-month beta testing program that demonstrated that most Java developers, given the Ateji PX documentation can write, compile and run their first parallel program within a couple of hours.

An interesting aspect is that the source code does not depend on the actual hardware architecture being used, be it shared-memory, cluster, grid, cloud or GPU accelerator. It is the compiler that performs the boring task of mapping syntactic constructs to specific architectures and libraries.

A free evaluation version is available for download here. Have fun!

Monday, August 30, 2010

Disruptive Technology at SC'10

Ateji PX has been selected for presentation at the Disrupted Technologies exhibit part of the SuperComputing 2010 conference.


"Each year, the SC Conference seeks out new technologies with the potential to disrupt the HPC landscape as we know it. Generally speaking, “disruptive technology” refers to drastic innovations in current practices such that they have the potential to completely transform the high-performance computing field as it currently exists — ultimately overtaking the incumbent technologies or software tools in the marketplace. For SC10, Disruptive Technologies examines new computing architectures and interfaces that will significantly impact the high-performance computing field throughout the next five to 15 years, but have not yet emerged in current systems. The Disruptive Technologies exhibits, located in the SC10 exhibit hall, will showcase technologies ranging from storage, programming, cooling and productivity software through presentations, demonstrations and an exhibit showcase.

Selected technologies for SC10 will be on display during regular exhibit hall hours. Please stop by the booth for more information on the presentations and demonstrations schedule."

See you in New Orleans, November 13-19.