Posts for the month of October 2010

GPUArray considered harmful

Back in September, everyone decided that introducing FiPy to CUDA in some facet was a good idea. The particular introduction we decided on was in the form of middleware which would slide between NumPy and our own numerix.py. This proposed middleware would shift some (or all) arrays from the system memory onto the GPU memory. That way, with the arrays on the GPU, we could perform all computations there, reaping the gleaming wonders of the all-powerful graphics processing unit. Silver bullet, right?

No, not exactly.

I'm no longer convinced that this approach will gain us anything; that is to say, I am no longer convinced that indiscriminately chucking NumPy arrays at the GPU is going to yield any performance gain for FiPy. I come to this conclusion after profiling a few varied FiPy runs in their entirety and trying, in vain, to implement the middleware described above.

Profiling

I should have done this long ago, before we even talked about possible interactions between FiPy and CUDA, just to see where performance gains are actually going to benefit us. Finally, after a few weeks of struggling with the proposed GPUArray module, I sat down and profiled a few FiPy runs to see where the cycles are being spent. After all, if we can't find a significant amount of time that's being spent doing arithmetic operations on NumPy arrays, there is no point in putting arrays on the GPU before the solver.

3D heat diffusion (1003)

I profiled the problem found here with the timing instrumentation removed. The results of the profiling were run through gprof2dot to produce this figure. The figure is a tree where each node is a function, the percentage below it is the percentage time spent in the function and all sub-functions, the parenthesized percentage is the time spent in the function itself, and all children of that node are sub-functions called within that function.

In this case, the only nodes displayed are for functions which consume 5% of runtime or more. These are the only components of a run that we should consider optimizing. Notice that the only array-specific functions found here are numeric:201:asarray and ~:0:<numpy.core.multiarray.array>. Both of these functions would take much longer when using a GPU (based on a strong hunch). Note also that we can't see any arithmetic operations in this graph. For this particular problem, it is clear that arithmetic operations on arrays are not the bottleneck, and thus using a GPU pre-solution makes absolutely no sense.

To get a better idea of which array functions were consuming time, I decided to generate another figure where each node consumes a minimum of 1% total runtime. That figure shows array allocation dominating array-specific operations, with ~:0:<numpy.core.multiarray.array> responsible for 18.6% of runtime.

No arithmetic operations are visible until we reduce the minimum percentage of runtime consumed by any node to 0.5%. That figure finally shows numerix:624:sin weighing in at 0.73%. Another array operation which appears in this graph is core:3914:reshape, the reshaping of an array, which comes in around 0.85%. Think of how this operation's time consumption would balloon if we were on a GPU.

The conclusion is clear: we should not be using a GPU for pre-solution array corralling on this run or anything like it.

Wheeler's Reactive Wetting

After profiling these toy runs, I told Wheeler about the results. He and I agreed that before making any final conclusions, we should profile a more realistic problem. We chose his reactive wetting problem. He profiled, sent me the pstats file, and I generated [http://matforge.org/fipy/export/3926/sandbox/gpuarray/wettingProf.simple.pdf this figure]. The minimum runtime consumption here is 5%.

Note that most array-relevant operations (aside from ~:0:<numpy.core.multiarray.array>) are found under inline:14:_optionalInline. Here we see that numerix:941:take is a good subtree to examine, as it is responsible for ~14% of total runtime.

The only arithmetic operation in this mess is core:3164:__mul__, boasting 5.50%. "Well, finally," you say to yourself, "there's something that will benefit from residence on the GPU." Not so fast. Only 0.01% of that 5.50% is spent in __mul__ itself. The overwhelming majority of time is spent in its child, ~:0:<numpy.core.multiarray.where>, which is presumably fiddling around in memory: yet another operation that will balloon with GPU usage.

Let's work on PyKrylov

After these damning results, it's evident to me that using Klockner's PyCuda within FiPy is a really bad idea. When using PyCuda's drop-in replacement (ahem) for numpy.ndarray, any arithmetic operation results in an intermediate copy of the gpuarray since there is no clever mechanism for lazily evaluating curried-kernel strings, or whatever.

Furthermore, I think that any solution proposing to be as convenient as Klockner's gpuarray is snake-oil unless some very clever mechanisms for incremental kernel accumulation and/or lazy evaluation enter the picture. The only way we can juice the GPU is by writing a few custom kernels at performance critical junctures: that way, we're not tap-dancing all over the GPU's limited and relatively slow memory by creating transient copies at every step of an arbitrary computation.

For these reasons, I think we (or I) should work on writing a few CUDA kernels for use by the non-Charlatan regions of PyCuda within PyKrylov. This implies work on the PyKrylov branch, which I am more than happy to undertake.