Posts for the month of March 2011

More mesh refactoring!?

Don't worry: it's almost over.

Previous mesh hierarchy

Meshes

Topologies

Geometries

I'm stripping out geometry and topology

Why?

  • Their inheritance trees seem to inevitably mimic those of the meshes.
  • We don't get any code reuse out of their existence, despite what I thought at their creation.

  • A huge amount of boilerplate code is devoted to passing arguments to, instantiating, and creating interfaces to geometry/topology objects that is otherwise unnecessary.
  • While geometry/topology objects do provide an added amount of orthogonality to the meshes, this benefit is negligible since it is highly likely that no other class hierarchy aside from meshes will use geometry/topology objects.

I was convinced of the harmful effects of geom/top classes when I went to expand the topology classes. In adding a few extra methods to topology, I found that I would have to nearly fully replicate the class tree of the meshes (instead of having just five topology classes) and the argument list for topology classes would not only vary from class to class, but would become unreasonably long.

Since the (lengthy) argument list for all geometry objects already varied from class to class, and thus was slightly ridiculous, I decided to think about reuniting meshes and geometry/topology. Where there is no uniform interface, no good code-reuse, and heavy coupling, merging classes is likely a good course of action.

Hard lesson learned: never refactor unless it is going to prevent duplication.

The new arrangement

I spent today merging geometries and topologies back into meshes. The new mesh hierarchy is displayed in the following UML diagram.

Meshes

Grid Builders

New features

  • Uniform and irregular meshes are now siblings. This results in a less confusing inheritance scheme. Gridlike?D objects are aggregated under (Uniform)Grids to avoid duplication without the use of multiple inheritance. Gridlikes are never instantiated, but their contents are almost entirely static.
  • GridBuilders are being used for the construction of grids. They warrant their own class hierarchy because the process for grid construction is largely dimensionally-independent. After the class-travaganza of geometry and topology, I can understand if this addition is met with suspicion. However, I think the reasons above are good ones.
  • Interface for all meshes is defined in AbstractMesh, which should provide some clarification.
  • Reduced NLOC after deleting the cruft classes.
  • Posted: 2011-03-27 00:49 (Updated: 2011-03-27 01:00)
  • Author: obeirne
  • Categories: (none)
  • Comments (0)

explicit terms and residuals

What to do in _prepareLinearSystem() for different values of var?

Now:

NoneA
Term()errormatrix for A
Term(var=A)matrix for Amatrix for A
Term(var=B)matrix for B0
BinaryTerm(var=())errormatrix for A
BinaryTerm(var=(A,))matrix for Amatrix for A
BinaryTerm(var=(A, B))errormatrix for A
BinaryTerm(var=(B,))matrix for B0
BinaryTerm(var=(B, C))error0
CoupledBinaryTerm(var=(A, B))matrix for (A, B)error
CoupledBinaryTerm(var=(B, C))matrix for (B, C)error

Want:

NoneA
Term()errormatrix for A
Term(var=A)matrix for Amatrix for A
Term(var=B)matrix for B0
BinaryTerm(var=())errormatrix for A
BinaryTerm(var=(A,))matrix for Amatrix for A
BinaryTerm(var=(A, B))residual for A and Bmatrix for A + residual for B
BinaryTerm(var=(B,))matrix for Bresidual for B
BinaryTerm(var=(B, C))residual for B and Cresidual for B and C
CoupledBinaryTerm(var=(A, B))matrix for (A, B)error
CoupledBinaryTerm(var=(B, C))matrix for (B, C)error

Note: _buildMatrix() still works as it did before, i.e., it must always be passed a solution variable and it build the matrix for that solution variable.

  • Posted: 2011-03-10 22:30 (Updated: 2011-03-11 10:00)
  • Author: guyer
  • Categories: (none)
  • Comments (0)

Looking at Cython

Cython is widely accessible

  • Cython is bundled with Enthought Python Distribution, SAGE, and PythonXY
  • Cython only requires Python and GCC, so it's installable basically anywhere
  • Cython is easy_installable.
  • mpi4py is already written in Cython.

In short: I'll bet dollars to donuts that anywhere FiPy is installable, Cython is too.

Building Cython code

  1. Write up a module foo.pyx in Cython.
  2. Write a setup.py which lists foo.pyx as an extension and specifies a cmdclass entry to compile it. Let's call that entry "build_ext"
  3. Run
    python setup.py build_ext --inplace
    
    This compiles the foo module from Cython down to C.
  4. Use foo as you would a Python module, despite the fact that it's (supposedly well-optimized) C code.

More details are available here.

Cython integrates with NumPy

See here. Further details are available in Seljebotn's paper.

An interesting excerpt from the paper:

For the algorithms which are expressible as NumPy operations, the speedup is much lower, ranging from no speedup to around ten times. The Cython code is usually much more verbose and requires more decisions to be made at compile-time. Use of Cython in these situations seems much less clear cut. A good approach is to prototype using pure Python, and, if it is deemed too slow, optimize the important parts after benchmarks or code profiling.

Most arithmetic algorithms in FiPy are specified in terms of NumPy operations, so Cython's use to us may be questionable. In a direct translation from a NumPy-operation-based routine to typed Cython code (using cdef np.ndarray), I saw no speedup.

Return of the GPU

Spoiler alert: GPUArray is still snake oil

We've spoken recently of whether or not our decision to ditch GPUArray forever was ill-conceived. Certainly, my previous blog post on the topic was, at best, incomplete in that I never gave a conclusive explanation of what time is being spent where. This was a result of me using fancy dot-graph visualizations instead of sticking to that natural adversary of bullshit: raw text.

In the previous post, I never examined what Guyer refers to as terminal procedures. More importantly, though, I never presented a list of procedures sorted by time spent in each procedure itself (excluding calls made to other procedures within its body). In other words, I never presented a list of procedures sorted by what Python's Pstats refers to as total time.

In this post, I will explore whether or not arithmetic operations ex solver are taking a considerable amount of time. When I conclude that they are, I will explore whether or not the use of a temporary GPUArray for some arithmetic operations may alleviate this bottleneck. I will conclude that it won't, since GPUArray initialization and communication are stalls which cause a net gain in runtime, even considering the much better performance of arithmetic operations on the GPU.

Profiling FiPy

In order to determine if, in fact, we are spending an amount of time on ex solver arithmetic operations that would merit speeding them up somehow, I profiled a phase anisotropy run, available here.

I ran the simulation for 10 steps with dx and dy both at 1000. I then obtained profiling output with this script

import pstats

def printProfileInfo(files):
    for filename in files:
        p = pstats.Stats(filename)

        # Sort by time in self
        p.sort_stats('time').print_stats(20)

if __name__ == '__main__':
    import sys
    printProfileInfo(sys.argv[1:])

as generated by this pstats file. The output is here.

Let's walk through line-by-line.

  • pysparse.itsolvers.pcg: solver; of no concern right now.
  • {method 'sum' ...}: helllooooo arithmetic.
  • variable.py:1088(<lambda>): straight from The Good Book, Variable.py:

return self._BinaryOperatorVariable(lambda a,b: a*b, other)

Obviously some steamy arithmetic.

  • {method 'take' ...}: I'd guess the reason this guy is so costly is memory access. Just a guess, though.
  • meshVariable.py:230(_dot): helllooooo arithmetic.
  • harmonicCellToFaceVariable.py:42(_calcValuePy): a whole mess of arithmetic.
def _calcValuePy(self, alpha, id1, id2):
    cell1 = numerix.take(self.var,id1, axis=-1)
    cell2 = numerix.take(self.var,id2, axis=-1)
    value = ((cell2 - cell1) * alpha + cell1)
    eps = 1e-20
    value = (value == 0.) * eps + (value != 0.) * value
    cell1Xcell2 = cell1 * cell2
    value = ((value > eps) | (value < -eps)) * cell1Xcell2 / value
    value = (cell1Xcell2 >= 0.) * value

    return value

Okay, so the point is clear: arithmetic out the wazoo. That numerical analysis software is stunted by calculation shouldn't come as a surprise.

Can the GPU help us?

For the sake of simplicity and to serve as a proof of concept, from here on out I'll deal exclusively with sum. If we can make sum into less of a problem with the help of Dr. Nvidia, then certainly we can rid other arithmetic aches similarly.

As I was shuffling around $FIPYROOT/sandbox to prepare some GPU-sum benchmarks, I was pleasantly surprised to find that I'd already been through this process, I just never had the good sense to document it.

Allow me to introduce you to benchSum.py, available here.

#!/usr/bin/env python

import numpy
import pycuda.autoinit
import pycuda.gpuarray as gpua
import random

"""
Using `time`:

GPU 
========
          real   3m55.157s
          user   3m52.720s
          sys    0m1.530s

        where numLoops is 200.

NumPy
========
          real    3m52.877s
          user    3m51.680s
          sys     0m0.590s

        where numLoops is 200.

Using cProfile/pstats:
GPU
========

Tue Nov  2 20:31:07 2010    sum-gpu.prof

         800042925 function calls (800042000 primitive calls) in 372.578 CPU seconds

   Ordered by: internal time
   List reduced from 596 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
400000000  163.425    0.000  194.783    0.000 random.py:351(uniform)
        1  142.234  142.234  372.452  372.452 benchSum.py:26(compare)
      402   33.223    0.083   33.223    0.083 {numpy.core.multiarray.array}
400000006   31.358    0.000   31.358    0.000 {method 'random' of '_random.Random' objects}
      400    1.667    0.004    1.667    0.004 gpuarray.py:93(set)
      400    0.274    0.001    0.397    0.001 gpuarray.py:975(sum)
     1400    0.171    0.000    0.189    0.000 gpuarray.py:61(__init__)
        1    0.036    0.036    0.036    0.036 tools.py:170(make_default_context)
        1    0.023    0.023  372.581  372.581 benchSum.py:3(<module>)
     1199    0.020    0.000    0.025    0.000 driver.py:257(function_prepared_async_call)
      400    0.012    0.000    0.100    0.000 reduction.py:203(__call__)
      799    0.011    0.000    0.025    0.000 tools.py:470(context_dependent_memoize)
     1404    0.008    0.000    0.016    0.000 __init__.py:130(memoize)
      400    0.008    0.000    1.794    0.004 gpuarray.py:619(to_gpu)
     1400    0.007    0.000    0.018    0.000 gpuarray.py:40(splay)
        1    0.006    0.006    0.006    0.006 driver.py:1(<module>)
        2    0.006    0.003    0.012    0.006 __init__.py:2(<module>)
     1199    0.005    0.000    0.005    0.000 {pycuda._pvt_struct.pack}
      235    0.005    0.000    0.006    0.000 function_base.py:2851(add_newdoc)
      135    0.004    0.000    0.005    0.000 sre_compile.py:213(_optimize_charset)

NumPy
========

Tue Nov  2 20:38:08 2010    sum-numpy.prof

         800027142 function calls (800026254 primitive calls) in 373.266 CPU seconds

   Ordered by: internal time
   List reduced from 453 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
400000000  166.126    0.000  197.885    0.000 random.py:351(uniform)
        1  140.405  140.405  373.137  373.137 benchSum.py:30(compare)
      402   33.300    0.083   33.300    0.083 {numpy.core.multiarray.array}
400000000   31.759    0.000   31.759    0.000 {method 'random' of '_random.Random' objects}
      400    1.541    0.004    1.541    0.004 {method 'sum' of 'numpy.ndarray' objects}
        1    0.034    0.034    0.034    0.034 tools.py:170(make_default_context)
        1    0.024    0.024  373.266  373.266 benchSum.py:3(<module>)
        1    0.005    0.005    0.005    0.005 driver.py:1(<module>)
        2    0.005    0.003    0.012    0.006 __init__.py:2(<module>)
      235    0.005    0.000    0.006    0.000 function_base.py:2851(add_newdoc)
     2594    0.003    0.000    0.003    0.000 {isinstance}
      400    0.003    0.000    1.547    0.004 fromnumeric.py:1185(sum)
        6    0.003    0.001    0.004    0.001 collections.py:13(namedtuple)
      127    0.003    0.000    0.004    0.000 sre_compile.py:213(_optimize_charset)
   124/10    0.003    0.000    0.007    0.001 sre_parse.py:385(_parse)
        1    0.002    0.002    0.044    0.044 autoinit.py:1(<module>)
        3    0.002    0.001    0.019    0.006 __init__.py:1(<module>)
   287/10    0.002    0.000    0.008    0.001 sre_compile.py:38(_compile)
        1    0.002    0.002    0.002    0.002 core.py:2230(MaskedArray)
        1    0.002    0.002    0.002    0.002 numeric.py:1(<module>)
"""

def compare(numLoops, testing="gpu", arrLen=1000*1000):

        accum = 0

        for i in xrange(numLoops):
                # randomized to avoid memoization or other caching
                rand  = [random.uniform(0., 1000.) for j in xrange(arrLen)]
                rand2 = [random.uniform(0., 1000.) for j in xrange(arrLen)]
                if testing == "gpu":
                        a = gpua.to_gpu(numpy.array(rand))
                        b = gpua.to_gpu(numpy.array(rand2))
                        c = gpua.sum(a) + gpua.sum(b)
                else:
                        a = numpy.array(rand)
                        b = numpy.array(rand2)
                        c = numpy.sum(a) + numpy.sum(b)
                accum += c

                if i % 10 == 0:
                        print i

        print accum

if __name__ == '__main__':
        import sys
        compare(int(sys.argv[1]), sys.argv[2])     

A few big points here. First, the overall time for

time python benchSum.py 200 gpu 

takes longer than

time python benchSum.py 200 numpy 

Not what we expected, huh?

Examining the profiling results in the above docstring tells us that when we go to GPU, Numpy's

400000000   31.759    0.000   31.759    0.000 {method 'random' of '_random.Random' objects}
      400    1.541    0.004    1.541    0.004 {method 'sum' of 'numpy.ndarray' objects}
        1    0.034    0.034    0.034    0.034 tools.py:170(make_default_context)

balloons into

400000006   31.358    0.000   31.358    0.000 {method 'random' of '_random.Random' objects}
      400    1.667    0.004    1.667    0.004 gpuarray.py:93(set)
      400    0.274    0.001    0.397    0.001 gpuarray.py:975(sum)
     1400    0.171    0.000    0.189    0.000 gpuarray.py:61(__init__)
        1    0.036    0.036    0.036    0.036 tools.py:170(make_default_context)

Keep this in mind.

The results in the module-level docstring above are why I hit the skids on GPUArray development last time, but I'll expound on them here.

Revisiting sumBench.py

"Trust, but verify" sez Reagan, and that's the way I was thinking when I rediscovered sumBench.py. It made sense to me that GPUArray isn't a silver bullet, but if I can't quantify that hunch then it's not science, but useless superstition. So, I decided to have some fun with sumBench.compare in the form of pstatsInterpretation.py. Let's see how:

#!/usr/bin/env python

import pstats

arraySizes = [10*10,
              100*100,
              500*500,
              1000*1000,
              2000*2000]

loopCounts = [10,
              50,
              100,
              200,
              500]

I decided that, while the default benchmark of a 1000*1000 element array run for 200 iterations through the body of sumBench.compare was a fair model of FiPy usage, more data points are always useful.

So, I tested for all combinations of (arraySizes[i], loopCounts[j]) (how many is that, kids? Remember discrete math?).

I generated the Pstat profiles with the following function.

def runProfiling():
    from benchSum import compare
    from cProfile import runctx

    for loopNum in loopCounts:
        for aSize in arraySizes:
            print "Doing %d loops on arraySize: %d..." % (loopNum, aSize)

            print "gpu."
            runctx("compare(%d, 'gpu', %d)" % (loopNum, aSize),
                   globals(), locals(),
                   "gpu-sum-%d-%d.pstats" % (loopNum, aSize))

            print "numpy."
            runctx("compare(%d, 'numpy', %d)" % (loopNum, aSize),
                   globals(), locals(),
                   "numpy-sum-%d-%d.pstats" % (loopNum, aSize))
            print

If you've forgotten how benchSum.compare works, go refresh your memory.

From the famous benchSum docstring, it is apparent that there is a correspondence between (numpy.sum), and (gpuarray.sum, gpuarray.set, gpuarray.__init__). This is the case because not only must GPUArray instantiate an ndarray but, in order to perform the arithmetic operation, it must throw the array on the GPU: something that obviously doesn't happen with ndarray. If you don't believe me, go have a look and convince yourself now, because my conclusions all rest on this correspondence.

If I wanted to get nasty, I could include in the correspondence other calls that are made in the use of GPUArray that aren't in that of ndarray, namely gpuarray.to_gpu and gpuarray.splay, but those times are basically insignificant.

Anyway, I needed a way of ripping out individual method timing (remember, that's time spent in self, or total time) from the mass of Pstat profiles I'd accumulated.

def _getTimeForMethods(arraySize, loopCount, methodList, 
                       arrType="gpu", benchmark="sum"):

    from StringIO import StringIO
    import re

    totalTime = 0.

    for method in methodList:
        strStream = StringIO()
        name = "%s-%s-%d-%d.pstats" % (arrType,benchmark,loopCount,arraySize)
        p = pstats.Stats(name, stream = strStream)

        p.print_stats(method)   

        profString = strStream.getvalue()
        m = re.search(r"\d+\s+(\d+\.\d+)", profString)

        totalTime += float(m.group(1))

    return totalTime 

A little hairy, but it gets the job done. The above, _getTimeForMethods prints the Pstats output to a string stream and then yanks out the total time with a regular expression. It does this for each method in methodList and accumulates a composite time, which is returned.

I put this function to work in the following Behemoth, which interprets and graphs the Pstats results.

def graph():
    from matplotlib import pylab as plt
    import subprocess as subp

    """
    The methods which, according to the profiling results in the docstring of
    `benchSum.py`, are analogous between numpy.ndarray and gpuarray.

    We don't need to consider anything more than `sum` for numpy because numpy's
    set/init procedure is absorbed in the call to `array`, which both gpuarray
    and numpy do.
    """
    numpyMethods = ["method 'sum'"]
    gpuMethods = ["gpuarray.py\S*(set)",
                  "gpuarray.py\S*(sum)",
                  "gpuarray.py\S*(__init__)"]

    loopsConstantTimes = {"numpy": [],
                          "gpu": []}

    arrSizeConstantTimes = {"numpy": [],
                            "gpu": []}

    justSumArrSizeConstantTimesGPU = []

    for loops in [200]:
        for size in arraySizes:
            loopsConstantTimes["gpu"].append(_getTimeForMethods(size, loops,
                                                                gpuMethods))
            loopsConstantTimes["numpy"].append(_getTimeForMethods(size, 
                                                                  loops, 
                                                                  numpyMethods, 
                                                                  "numpy"))
                      
    for loops in loopCounts:
        for size in [1000*1000]:
            arrSizeConstantTimes["gpu"].append(_getTimeForMethods(size, loops,
                                                                gpuMethods))
            arrSizeConstantTimes["numpy"].append(_getTimeForMethods(size, 
                                                                  loops, 
                                                                  numpyMethods, 
                                                                  "numpy"))
            justSumArrSizeConstantTimesGPU.append(_getTimeForMethods(size, loops,
                                                          ["gpuarray.py\S*(sum)"]))          

    """
    PLOTTING ENSUES...

    If you really wanna see the gory details, go to the file in
    Matforge.
    """

Results

Okay, let's start off nice and light with a simple comparison between gpuarray.sum and ndarray.sum, then we'll dash any hope of dead-simple gpuarray usage.

Hell yeah! Look at that speed-up! Life must be a Singaporean paradise where I get paid to let other people's libraries halve my runtime!

Chya.

Whoops.

There's the whole picture. Now we're factoring in the additional effects of GPUArray usage: communication and initialization.

Here, the array size is fixed at 1000*1000 elements. The horizontal axis indicates the size of the array that we're summing. The vertical axis is the time of sum for ndarray and {sum, set, __init__} for GPUArray. Again, remember that these sets of methods are analogous. You can't have GPUArray's hot sum without its homely friends set and __init__.

Here, the number of iterations is fixed at 200. The number of iterations that the sum-accumulation procedure does is indicated on the horizontal axis. The vertical axis is, again, timing.

It's clear that ndarray wins out, even at a large array size (2000*2000).

Things I could be screwing up

Considering that the bottleneck with the GPUArray is setting the array, it would seem possible that I'm doing it wrong. The GPUArray documentation will verify that isn't the case. You just can't get around that gpuarray.set call; it's inevitable and it's what sinks your battleship.

It's fundamentally intuitive that you can't avoid the cost of communicating with the card. It's a longer trip down the bus and, since there isn't any lazy-eval/currying going on, it's a trip that'll happen on every arithmetic operation. TANSTAAFL.

To be sure, let's take a look at pycuda.gpuarray.set, just to see if it's plausible that the contents are slowing us up.

def set(self, ary):
    assert ary.size == self.size
    assert ary.dtype == self.dtype
    if self.size:
        drv.memcpy_htod(self.gpudata, ary)

Bingo! The key is in drv.memcpy_htod(...). That's where the actual transfer of data takes place; it makes sense time is lost there.

Possible recourse?

Right below pycuda.gpuarray.set is

def set_async(self, ary, stream=None):
    assert ary.size == self.size
    assert ary.dtype == self.dtype
    if self.size:
        drv.memcpy_htod_async(self.gpudata, ary, stream)

Might this be faster? Dunno. If it were, wouldn't Klockner be quick to tell us about it in the documentation?

Summary

  • FiPy is bottlenecked largely by arithmetic operations.
  • GPUArray will not help us alleviate these bottlenecks by using it as a one-off Rainman every time we want to do a single arithmetic operation.
    • This is because communication is more expensive than the single arithmetic operation, even if the op is significantly faster on the GPU and even on arrays up to 2000*2000 in length.
  • The only appropriate use for PyCUDA is by writing a few kernels for arithmetic-critical junctures where we can amortize the cost of communication over the speedup of multi-step arithmetic on the GPU.
  • Posted: 2011-03-01 16:50 (Updated: 2011-03-02 22:50)
  • Author: obeirne
  • Categories: (none)
  • Comments (0)