Posts by author obeirne

More mesh refactoring!?

Don't worry: it's almost over.

Previous mesh hierarchy

Meshes

Topologies

Geometries

I'm stripping out geometry and topology

Why?

  • Their inheritance trees seem to inevitably mimic those of the meshes.
  • We don't get any code reuse out of their existence, despite what I thought at their creation.

  • A huge amount of boilerplate code is devoted to passing arguments to, instantiating, and creating interfaces to geometry/topology objects that is otherwise unnecessary.
  • While geometry/topology objects do provide an added amount of orthogonality to the meshes, this benefit is negligible since it is highly likely that no other class hierarchy aside from meshes will use geometry/topology objects.

I was convinced of the harmful effects of geom/top classes when I went to expand the topology classes. In adding a few extra methods to topology, I found that I would have to nearly fully replicate the class tree of the meshes (instead of having just five topology classes) and the argument list for topology classes would not only vary from class to class, but would become unreasonably long.

Since the (lengthy) argument list for all geometry objects already varied from class to class, and thus was slightly ridiculous, I decided to think about reuniting meshes and geometry/topology. Where there is no uniform interface, no good code-reuse, and heavy coupling, merging classes is likely a good course of action.

Hard lesson learned: never refactor unless it is going to prevent duplication.

The new arrangement

I spent today merging geometries and topologies back into meshes. The new mesh hierarchy is displayed in the following UML diagram.

Meshes

Grid Builders

New features

  • Uniform and irregular meshes are now siblings. This results in a less confusing inheritance scheme. Gridlike?D objects are aggregated under (Uniform)Grids to avoid duplication without the use of multiple inheritance. Gridlikes are never instantiated, but their contents are almost entirely static.
  • GridBuilders are being used for the construction of grids. They warrant their own class hierarchy because the process for grid construction is largely dimensionally-independent. After the class-travaganza of geometry and topology, I can understand if this addition is met with suspicion. However, I think the reasons above are good ones.
  • Interface for all meshes is defined in AbstractMesh, which should provide some clarification.
  • Reduced NLOC after deleting the cruft classes.
  • Posted: 2011-03-27 00:49 (Updated: 2011-03-27 01:00)
  • Author: obeirne
  • Categories: (none)
  • Comments (0)

Looking at Cython

Cython is widely accessible

  • Cython is bundled with Enthought Python Distribution, SAGE, and PythonXY
  • Cython only requires Python and GCC, so it's installable basically anywhere
  • Cython is easy_installable.
  • mpi4py is already written in Cython.

In short: I'll bet dollars to donuts that anywhere FiPy is installable, Cython is too.

Building Cython code

  1. Write up a module foo.pyx in Cython.
  2. Write a setup.py which lists foo.pyx as an extension and specifies a cmdclass entry to compile it. Let's call that entry "build_ext"
  3. Run
    python setup.py build_ext --inplace
    
    This compiles the foo module from Cython down to C.
  4. Use foo as you would a Python module, despite the fact that it's (supposedly well-optimized) C code.

More details are available here.

Cython integrates with NumPy

See here. Further details are available in Seljebotn's paper.

An interesting excerpt from the paper:

For the algorithms which are expressible as NumPy operations, the speedup is much lower, ranging from no speedup to around ten times. The Cython code is usually much more verbose and requires more decisions to be made at compile-time. Use of Cython in these situations seems much less clear cut. A good approach is to prototype using pure Python, and, if it is deemed too slow, optimize the important parts after benchmarks or code profiling.

Most arithmetic algorithms in FiPy are specified in terms of NumPy operations, so Cython's use to us may be questionable. In a direct translation from a NumPy-operation-based routine to typed Cython code (using cdef np.ndarray), I saw no speedup.

Return of the GPU

Spoiler alert: GPUArray is still snake oil

We've spoken recently of whether or not our decision to ditch GPUArray forever was ill-conceived. Certainly, my previous blog post on the topic was, at best, incomplete in that I never gave a conclusive explanation of what time is being spent where. This was a result of me using fancy dot-graph visualizations instead of sticking to that natural adversary of bullshit: raw text.

In the previous post, I never examined what Guyer refers to as terminal procedures. More importantly, though, I never presented a list of procedures sorted by time spent in each procedure itself (excluding calls made to other procedures within its body). In other words, I never presented a list of procedures sorted by what Python's Pstats refers to as total time.

In this post, I will explore whether or not arithmetic operations ex solver are taking a considerable amount of time. When I conclude that they are, I will explore whether or not the use of a temporary GPUArray for some arithmetic operations may alleviate this bottleneck. I will conclude that it won't, since GPUArray initialization and communication are stalls which cause a net gain in runtime, even considering the much better performance of arithmetic operations on the GPU.

Profiling FiPy

In order to determine if, in fact, we are spending an amount of time on ex solver arithmetic operations that would merit speeding them up somehow, I profiled a phase anisotropy run, available here.

I ran the simulation for 10 steps with dx and dy both at 1000. I then obtained profiling output with this script

import pstats

def printProfileInfo(files):
    for filename in files:
        p = pstats.Stats(filename)

        # Sort by time in self
        p.sort_stats('time').print_stats(20)

if __name__ == '__main__':
    import sys
    printProfileInfo(sys.argv[1:])

as generated by this pstats file. The output is here.

Let's walk through line-by-line.

  • pysparse.itsolvers.pcg: solver; of no concern right now.
  • {method 'sum' ...}: helllooooo arithmetic.
  • variable.py:1088(<lambda>): straight from The Good Book, Variable.py:

return self._BinaryOperatorVariable(lambda a,b: a*b, other)

Obviously some steamy arithmetic.

  • {method 'take' ...}: I'd guess the reason this guy is so costly is memory access. Just a guess, though.
  • meshVariable.py:230(_dot): helllooooo arithmetic.
  • harmonicCellToFaceVariable.py:42(_calcValuePy): a whole mess of arithmetic.
def _calcValuePy(self, alpha, id1, id2):
    cell1 = numerix.take(self.var,id1, axis=-1)
    cell2 = numerix.take(self.var,id2, axis=-1)
    value = ((cell2 - cell1) * alpha + cell1)
    eps = 1e-20
    value = (value == 0.) * eps + (value != 0.) * value
    cell1Xcell2 = cell1 * cell2
    value = ((value > eps) | (value < -eps)) * cell1Xcell2 / value
    value = (cell1Xcell2 >= 0.) * value

    return value

Okay, so the point is clear: arithmetic out the wazoo. That numerical analysis software is stunted by calculation shouldn't come as a surprise.

Can the GPU help us?

For the sake of simplicity and to serve as a proof of concept, from here on out I'll deal exclusively with sum. If we can make sum into less of a problem with the help of Dr. Nvidia, then certainly we can rid other arithmetic aches similarly.

As I was shuffling around $FIPYROOT/sandbox to prepare some GPU-sum benchmarks, I was pleasantly surprised to find that I'd already been through this process, I just never had the good sense to document it.

Allow me to introduce you to benchSum.py, available here.

#!/usr/bin/env python

import numpy
import pycuda.autoinit
import pycuda.gpuarray as gpua
import random

"""
Using `time`:

GPU 
========
          real   3m55.157s
          user   3m52.720s
          sys    0m1.530s

        where numLoops is 200.

NumPy
========
          real    3m52.877s
          user    3m51.680s
          sys     0m0.590s

        where numLoops is 200.

Using cProfile/pstats:
GPU
========

Tue Nov  2 20:31:07 2010    sum-gpu.prof

         800042925 function calls (800042000 primitive calls) in 372.578 CPU seconds

   Ordered by: internal time
   List reduced from 596 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
400000000  163.425    0.000  194.783    0.000 random.py:351(uniform)
        1  142.234  142.234  372.452  372.452 benchSum.py:26(compare)
      402   33.223    0.083   33.223    0.083 {numpy.core.multiarray.array}
400000006   31.358    0.000   31.358    0.000 {method 'random' of '_random.Random' objects}
      400    1.667    0.004    1.667    0.004 gpuarray.py:93(set)
      400    0.274    0.001    0.397    0.001 gpuarray.py:975(sum)
     1400    0.171    0.000    0.189    0.000 gpuarray.py:61(__init__)
        1    0.036    0.036    0.036    0.036 tools.py:170(make_default_context)
        1    0.023    0.023  372.581  372.581 benchSum.py:3(<module>)
     1199    0.020    0.000    0.025    0.000 driver.py:257(function_prepared_async_call)
      400    0.012    0.000    0.100    0.000 reduction.py:203(__call__)
      799    0.011    0.000    0.025    0.000 tools.py:470(context_dependent_memoize)
     1404    0.008    0.000    0.016    0.000 __init__.py:130(memoize)
      400    0.008    0.000    1.794    0.004 gpuarray.py:619(to_gpu)
     1400    0.007    0.000    0.018    0.000 gpuarray.py:40(splay)
        1    0.006    0.006    0.006    0.006 driver.py:1(<module>)
        2    0.006    0.003    0.012    0.006 __init__.py:2(<module>)
     1199    0.005    0.000    0.005    0.000 {pycuda._pvt_struct.pack}
      235    0.005    0.000    0.006    0.000 function_base.py:2851(add_newdoc)
      135    0.004    0.000    0.005    0.000 sre_compile.py:213(_optimize_charset)

NumPy
========

Tue Nov  2 20:38:08 2010    sum-numpy.prof

         800027142 function calls (800026254 primitive calls) in 373.266 CPU seconds

   Ordered by: internal time
   List reduced from 453 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
400000000  166.126    0.000  197.885    0.000 random.py:351(uniform)
        1  140.405  140.405  373.137  373.137 benchSum.py:30(compare)
      402   33.300    0.083   33.300    0.083 {numpy.core.multiarray.array}
400000000   31.759    0.000   31.759    0.000 {method 'random' of '_random.Random' objects}
      400    1.541    0.004    1.541    0.004 {method 'sum' of 'numpy.ndarray' objects}
        1    0.034    0.034    0.034    0.034 tools.py:170(make_default_context)
        1    0.024    0.024  373.266  373.266 benchSum.py:3(<module>)
        1    0.005    0.005    0.005    0.005 driver.py:1(<module>)
        2    0.005    0.003    0.012    0.006 __init__.py:2(<module>)
      235    0.005    0.000    0.006    0.000 function_base.py:2851(add_newdoc)
     2594    0.003    0.000    0.003    0.000 {isinstance}
      400    0.003    0.000    1.547    0.004 fromnumeric.py:1185(sum)
        6    0.003    0.001    0.004    0.001 collections.py:13(namedtuple)
      127    0.003    0.000    0.004    0.000 sre_compile.py:213(_optimize_charset)
   124/10    0.003    0.000    0.007    0.001 sre_parse.py:385(_parse)
        1    0.002    0.002    0.044    0.044 autoinit.py:1(<module>)
        3    0.002    0.001    0.019    0.006 __init__.py:1(<module>)
   287/10    0.002    0.000    0.008    0.001 sre_compile.py:38(_compile)
        1    0.002    0.002    0.002    0.002 core.py:2230(MaskedArray)
        1    0.002    0.002    0.002    0.002 numeric.py:1(<module>)
"""

def compare(numLoops, testing="gpu", arrLen=1000*1000):

        accum = 0

        for i in xrange(numLoops):
                # randomized to avoid memoization or other caching
                rand  = [random.uniform(0., 1000.) for j in xrange(arrLen)]
                rand2 = [random.uniform(0., 1000.) for j in xrange(arrLen)]
                if testing == "gpu":
                        a = gpua.to_gpu(numpy.array(rand))
                        b = gpua.to_gpu(numpy.array(rand2))
                        c = gpua.sum(a) + gpua.sum(b)
                else:
                        a = numpy.array(rand)
                        b = numpy.array(rand2)
                        c = numpy.sum(a) + numpy.sum(b)
                accum += c

                if i % 10 == 0:
                        print i

        print accum

if __name__ == '__main__':
        import sys
        compare(int(sys.argv[1]), sys.argv[2])     

A few big points here. First, the overall time for

time python benchSum.py 200 gpu 

takes longer than

time python benchSum.py 200 numpy 

Not what we expected, huh?

Examining the profiling results in the above docstring tells us that when we go to GPU, Numpy's

400000000   31.759    0.000   31.759    0.000 {method 'random' of '_random.Random' objects}
      400    1.541    0.004    1.541    0.004 {method 'sum' of 'numpy.ndarray' objects}
        1    0.034    0.034    0.034    0.034 tools.py:170(make_default_context)

balloons into

400000006   31.358    0.000   31.358    0.000 {method 'random' of '_random.Random' objects}
      400    1.667    0.004    1.667    0.004 gpuarray.py:93(set)
      400    0.274    0.001    0.397    0.001 gpuarray.py:975(sum)
     1400    0.171    0.000    0.189    0.000 gpuarray.py:61(__init__)
        1    0.036    0.036    0.036    0.036 tools.py:170(make_default_context)

Keep this in mind.

The results in the module-level docstring above are why I hit the skids on GPUArray development last time, but I'll expound on them here.

Revisiting sumBench.py

"Trust, but verify" sez Reagan, and that's the way I was thinking when I rediscovered sumBench.py. It made sense to me that GPUArray isn't a silver bullet, but if I can't quantify that hunch then it's not science, but useless superstition. So, I decided to have some fun with sumBench.compare in the form of pstatsInterpretation.py. Let's see how:

#!/usr/bin/env python

import pstats

arraySizes = [10*10,
              100*100,
              500*500,
              1000*1000,
              2000*2000]

loopCounts = [10,
              50,
              100,
              200,
              500]

I decided that, while the default benchmark of a 1000*1000 element array run for 200 iterations through the body of sumBench.compare was a fair model of FiPy usage, more data points are always useful.

So, I tested for all combinations of (arraySizes[i], loopCounts[j]) (how many is that, kids? Remember discrete math?).

I generated the Pstat profiles with the following function.

def runProfiling():
    from benchSum import compare
    from cProfile import runctx

    for loopNum in loopCounts:
        for aSize in arraySizes:
            print "Doing %d loops on arraySize: %d..." % (loopNum, aSize)

            print "gpu."
            runctx("compare(%d, 'gpu', %d)" % (loopNum, aSize),
                   globals(), locals(),
                   "gpu-sum-%d-%d.pstats" % (loopNum, aSize))

            print "numpy."
            runctx("compare(%d, 'numpy', %d)" % (loopNum, aSize),
                   globals(), locals(),
                   "numpy-sum-%d-%d.pstats" % (loopNum, aSize))
            print

If you've forgotten how benchSum.compare works, go refresh your memory.

From the famous benchSum docstring, it is apparent that there is a correspondence between (numpy.sum), and (gpuarray.sum, gpuarray.set, gpuarray.__init__). This is the case because not only must GPUArray instantiate an ndarray but, in order to perform the arithmetic operation, it must throw the array on the GPU: something that obviously doesn't happen with ndarray. If you don't believe me, go have a look and convince yourself now, because my conclusions all rest on this correspondence.

If I wanted to get nasty, I could include in the correspondence other calls that are made in the use of GPUArray that aren't in that of ndarray, namely gpuarray.to_gpu and gpuarray.splay, but those times are basically insignificant.

Anyway, I needed a way of ripping out individual method timing (remember, that's time spent in self, or total time) from the mass of Pstat profiles I'd accumulated.

def _getTimeForMethods(arraySize, loopCount, methodList, 
                       arrType="gpu", benchmark="sum"):

    from StringIO import StringIO
    import re

    totalTime = 0.

    for method in methodList:
        strStream = StringIO()
        name = "%s-%s-%d-%d.pstats" % (arrType,benchmark,loopCount,arraySize)
        p = pstats.Stats(name, stream = strStream)

        p.print_stats(method)   

        profString = strStream.getvalue()
        m = re.search(r"\d+\s+(\d+\.\d+)", profString)

        totalTime += float(m.group(1))

    return totalTime 

A little hairy, but it gets the job done. The above, _getTimeForMethods prints the Pstats output to a string stream and then yanks out the total time with a regular expression. It does this for each method in methodList and accumulates a composite time, which is returned.

I put this function to work in the following Behemoth, which interprets and graphs the Pstats results.

def graph():
    from matplotlib import pylab as plt
    import subprocess as subp

    """
    The methods which, according to the profiling results in the docstring of
    `benchSum.py`, are analogous between numpy.ndarray and gpuarray.

    We don't need to consider anything more than `sum` for numpy because numpy's
    set/init procedure is absorbed in the call to `array`, which both gpuarray
    and numpy do.
    """
    numpyMethods = ["method 'sum'"]
    gpuMethods = ["gpuarray.py\S*(set)",
                  "gpuarray.py\S*(sum)",
                  "gpuarray.py\S*(__init__)"]

    loopsConstantTimes = {"numpy": [],
                          "gpu": []}

    arrSizeConstantTimes = {"numpy": [],
                            "gpu": []}

    justSumArrSizeConstantTimesGPU = []

    for loops in [200]:
        for size in arraySizes:
            loopsConstantTimes["gpu"].append(_getTimeForMethods(size, loops,
                                                                gpuMethods))
            loopsConstantTimes["numpy"].append(_getTimeForMethods(size, 
                                                                  loops, 
                                                                  numpyMethods, 
                                                                  "numpy"))
                      
    for loops in loopCounts:
        for size in [1000*1000]:
            arrSizeConstantTimes["gpu"].append(_getTimeForMethods(size, loops,
                                                                gpuMethods))
            arrSizeConstantTimes["numpy"].append(_getTimeForMethods(size, 
                                                                  loops, 
                                                                  numpyMethods, 
                                                                  "numpy"))
            justSumArrSizeConstantTimesGPU.append(_getTimeForMethods(size, loops,
                                                          ["gpuarray.py\S*(sum)"]))          

    """
    PLOTTING ENSUES...

    If you really wanna see the gory details, go to the file in
    Matforge.
    """

Results

Okay, let's start off nice and light with a simple comparison between gpuarray.sum and ndarray.sum, then we'll dash any hope of dead-simple gpuarray usage.

Hell yeah! Look at that speed-up! Life must be a Singaporean paradise where I get paid to let other people's libraries halve my runtime!

Chya.

Whoops.

There's the whole picture. Now we're factoring in the additional effects of GPUArray usage: communication and initialization.

Here, the array size is fixed at 1000*1000 elements. The horizontal axis indicates the size of the array that we're summing. The vertical axis is the time of sum for ndarray and {sum, set, __init__} for GPUArray. Again, remember that these sets of methods are analogous. You can't have GPUArray's hot sum without its homely friends set and __init__.

Here, the number of iterations is fixed at 200. The number of iterations that the sum-accumulation procedure does is indicated on the horizontal axis. The vertical axis is, again, timing.

It's clear that ndarray wins out, even at a large array size (2000*2000).

Things I could be screwing up

Considering that the bottleneck with the GPUArray is setting the array, it would seem possible that I'm doing it wrong. The GPUArray documentation will verify that isn't the case. You just can't get around that gpuarray.set call; it's inevitable and it's what sinks your battleship.

It's fundamentally intuitive that you can't avoid the cost of communicating with the card. It's a longer trip down the bus and, since there isn't any lazy-eval/currying going on, it's a trip that'll happen on every arithmetic operation. TANSTAAFL.

To be sure, let's take a look at pycuda.gpuarray.set, just to see if it's plausible that the contents are slowing us up.

def set(self, ary):
    assert ary.size == self.size
    assert ary.dtype == self.dtype
    if self.size:
        drv.memcpy_htod(self.gpudata, ary)

Bingo! The key is in drv.memcpy_htod(...). That's where the actual transfer of data takes place; it makes sense time is lost there.

Possible recourse?

Right below pycuda.gpuarray.set is

def set_async(self, ary, stream=None):
    assert ary.size == self.size
    assert ary.dtype == self.dtype
    if self.size:
        drv.memcpy_htod_async(self.gpudata, ary, stream)

Might this be faster? Dunno. If it were, wouldn't Klockner be quick to tell us about it in the documentation?

Summary

  • FiPy is bottlenecked largely by arithmetic operations.
  • GPUArray will not help us alleviate these bottlenecks by using it as a one-off Rainman every time we want to do a single arithmetic operation.
    • This is because communication is more expensive than the single arithmetic operation, even if the op is significantly faster on the GPU and even on arrays up to 2000*2000 in length.
  • The only appropriate use for PyCUDA is by writing a few kernels for arithmetic-critical junctures where we can amortize the cost of communication over the speedup of multi-step arithmetic on the GPU.
  • Posted: 2011-03-01 16:50 (Updated: 2011-03-02 22:50)
  • Author: obeirne
  • Categories: (none)
  • Comments (0)

Buildbot setup

First, obtain Buildbot on both master and slave.

master$ easy_install buildbot
slave$ easy_install buildbot-slave

Now, create the master.

master$ buildbot create-master master_dir

This will make a directory called master_dir and fill it with all sorts of goodies. In order to be operational, the master_dir requires a master.cfg file to be present within it. Luckily, Buildbot supplies a master.cfg.sample which can easily be tailored to fit our needs.

master$ cp master.cfg.sample master.cfg
master$ $EDITOR master.cfg

Now let's do the aforementioned tailoring. I'll step through configuration piece-by-piece.

c = BuildmasterConfig = {}

All that is required of each master.cfg is that it builds a dictionary called BuildmasterConfig (which we'll call c for brevity), which contains various configuration information for Buildbot.

branches = ['trunk', 
            'branches/bad_branch',
           ]

Because of Buildbot's architecture, we must specify (in some way) the branches that Buildbot should concern itself with; I define factories later on which generate the necessary Buildbot machinery for each branch specified here.

Instead of explicitly specifying the branches we want to monitor (which is usually all of them), it may be possible to poll SVN periodically and reconfigure Buildbot when a new branch is detected, but I haven't looked into that yet.

Specifying BuildSlaves

Now, we specify Buildslaves for c.

from buildbot.buildslave import BuildSlave
c['slaves'] = [BuildSlave("danke", "bitte"),
               BuildSlave("slug", "gross")]

This addition to c implies that we will have two BuildSlaves communicating with the master: one called danke (with a slave-password of bitte), and one called slug (with a slave-password of gross). We will configure the slaves (with their passwords) after we have configured the master.

It is important that these slave passwords remain relatively private, or else there are some security vulnerabilities that entail possible execution of arbitrary code on the master's host (alluded to here).

Then, pick a port over which the master and slave will communicate. This port must be publicly accessible on their respective networks. I chose 9989.

c['slavePortnum'] = 9989

Remember to make this port publicly accessible. This may necessitate forwarding ports.

Detecting source changes

Now we configure how the Buildmaster finds out about changes in the repository. There are a variety of ways the master can be configured to keep itself informed of changes, including polling the repository periodically, but I chose to manually inform the buildmaster of changes via an addition to SVN's post-commit hook which calls a buildbot/contrib script available here.

On the master.cfg side, we add the following:

from buildbot.changes.pb import PBChangeSource
c['change_source'] = PBChangeSource()

From here, I assume that the SVN repository is hosted on the same machine as the master. Digressing a moment from master.cfg, I navigate to the SVN repository's directory.

master$ cd /path/to/svn-repo
master$ $EDITOR hooks/post-commit

Once the editor is open, add the following to notify the buildbot master of changes upon commit.

# set up PYTHONPATH to contain Twisted/buildbot perhaps, if not already
# installed site-wide

/path/to/svn_buildbot.py --repository "$REPOS" --revision "$REV" \
--bbserver localhost --bbport 9989

If the master is hosted on a different machine than the repository, the bbserver flag above can be modified accordingly.

One caveat here is that we must modify the contrib/svn_buildbot.py script to propagate branch information on to Buildbot. Open svn_buildbot.py

master$ $EDITOR /path/to/svn_buildbot.py

and enact the following change.

# this should be around line 143
# split_file = split_file_dummy

split_file = split_file_branches

Now the SVN repo is set up to play nicely with the Buildbot installation and (hopefully) the master will be informed of each subsequent commit.

Schedulers

Back to master.cfg.

Next, we want to configure Schedulers. Whenever the master is informed of a change (via whatever object is held in c['change_source']) all Schedulers attached to the list held by c['schedulers'] will be informed of the change and, depending on the particular Scheduler in question, may or may not react by triggering a build or multiple builds.

Here, I'll configure a Scheduler for each branch to react to any change in that branch by running two builds.

from buildbot.scheduler import Scheduler
from buildbot.schedulers.filter import ChangeFilter

def buildSchedulerForBranch(branch):
    cf = ChangeFilter(branch=branch)
    return Scheduler(name="%s-scheduler" % branch,
                     change_filter=cf,
                     treeStableTimer=30,
                     builderNames=["full-build-%s" % branch,
                                   "smaller-build-%s" % branch])

c['schedulers'] = []

for branch in branches:
    c['schedulers'].append(buildSchedulerForBranch(branch))

The ChangeFilter is used to discriminate branches; note that branch isn't the only criterion a ChangeFilter can consider. The parameter treeStableTimer is the length of time that Buildbot waits before starting the build process.

Note that this Scheduler references "full-build-[branch]" and "smaller-build-[branch]", which are Builders we'll define right now.

Builders

Here, we specify the build procedures. First, we will define a number of BuildFactorys, which detail generic steps that a given Build will take, and then we create the Builds themselves and attach them to to the BuildmasterConfig dictionary. There is some machinery defined up front for the sake of generating separate builds for each branch with as little duplication as possible, so bear with me.

def addCheckoutStep(factory, defaultBranch='trunk'):
    """Ensure that `baseURL` has a forward slash at the end."""
    baseURL = 'svn://slug-jamesob.no-ip.org/home/job/tmp/fake_fipy_repo/' 
    factory.addStep(SVN(mode='update',
                        baseURL=baseURL))
 
def testAllForSolver(buildFact, solverType, parallel=False):
    """
    Add doctest steps to a build factory for a given solver type.
    """

    descStr = "for %s" % solverType
    solverArg = "--%s" % solverType
    
    buildFact.addStep(Doctest(description="testing modules " + descStr,
                              extraArgs=[solverArg, "--modules"]))

    buildFact.addStep(Doctest(description="testing examples " + descStr,
                              extraArgs=[solverArg, "--examples"]))

Here, addCheckoutStep takes a BuildFactory and adds a step which checks out whichever branch has been modified, as specified by the Scheduler reporting the changes. It does so by appending the name of the branch, as reported by the provoking Scheduler, to the baseURL that we pass in.

The function testAllForSolver takes a BuildFactory and adds steps which test both modules and examples for a given solver. I should mention at this point that here I use Doctest, a class I had to add to Buildbot.

Getting doctests to work

Buildbot integrates nicely with Trial, the unit-test framework that comes bundled with Twisted. Trial, unfortunately, doesn't seem to pick up on doctests, though there are scattered allusions speaking to the contrary (here and here).

I tried a few approaches to getting Trial to process doctests (adding a __doctest__ module-level attribute, for one) to no avail. Even if there is some way to piggyback doctests onto Trial, it may involve a large modification of the FiPy source, which, if purely for the sake of making Buildbot happy, I don't think is worth it.

In light of this, I added an object that extends the class that handles Trial integration to handle doctests instead.

Within master.cfg, add

from buildbot.steps.python_twisted import Trial, TrialTestCaseCounter
from buildbot.steps.shell import ShellCommand
from twisted.python import log

class Doctest(Trial):
    """
    Add support for Python's doctests.
    """

    def __init__(self, description=None,
                       extraArgs=None,
                       **kwargs):

        ShellCommand.__init__(self, **kwargs)

        self.addFactoryArguments(description=description,
                                 extraArgs=extraArgs)

        self.testpath = "."
        self.command = ["python", "setup.py", "test"]
        self.extraArgs = extraArgs

        self.logfiles = {}

        if description is not None:
            self.description = [description]
            self.descriptionDone = [description + " done"]
        else:
            self.description = ["testing"]
            self.descriptionDone = ["tests"]
             
        # this counter will feed Progress along the 'test cases' metric
        self.addLogObserver('stdio', TrialTestCaseCounter())
             
    def start(self):
        if self.extraArgs is not None:
            if type(self.extraArgs) is list:
                self.command += self.extraArgs
            elif type(self.extraArgs) is str:
                self.command.append(self.extraArgs)
                
        self._needToPullTestDotLog = False
        log.msg("Doctest.start: command is '%s' with description '%s'." \
                    % (self.command, self.description))

        ShellCommand.start(self)
                 
    def _gotTestDotLog(self, cmd):
        Trial._gotTestDotLog(self, cmd)

        # strip out "tests" from the beginning 
        self.text[0] = self.description[0] + " " \
                        + " ".join(self.text[0].split(" ")[1:])

This allows us to run FiPy's test suite completely unmodified and have Buildbot interpret the results just as it would for Twisted's Trial.

Back to Builders

We never completed setting the Builders up, so let's do that now.

Let's first define a few Factory objects.

from buildbot.process import factory
from buildbot.steps.source import SVN

"""
Run a mostly-complete test-suite.
"""
fullFactory = factory.BuildFactory()

addCheckoutStep(fullFactory)
testAllForSolver(fullFactory, "pysparse")
testAllForSolver(fullFactory, "trilinos")

"""
Run a less-intensive test-suite.
"""
smallerFactory = factory.BuildFactory()

addCheckoutStep(smallerFactory)
smallerFactory.addStep(Doctest(extraArgs="--modules", 
                               description="testing modules"))

Now we will define and attach the Builders themselves.

c['builders'] = []

def makeBuildersForBranch(branch):
    builders = []

    builders.append({"name": "full-build-%s" % branch,
                     "slavename": "slug",
                     "builddir": "%s-slug" % branch,
                     "factory": fullFactory})

    builders.append({"name": "smaller-build-%s" % branch,
                     "slavename": "danke",
                     "builddir": "%s-danke" % branch,
                     "factory": smallerFactory})

    return builders

for branch in branches:
    for builder in makeBuildersForBranch(branch):
        c['builders'].append(builder)

The rest

Everything left in master.cfg concerns itself with how build information is reported back to us. Though there are very many possibilities open to us (e-mail reporting to blamed developers upon broken builds, IRC bots, Skynet, etc.), I haven't explored any of them other than the default web interface that Buildbot provides. With that in mind, here's the rest of master.cfg, basically unadulterated from the vanilla sample config.

####### STATUS TARGETS

# 'status' is a list of Status Targets. The results of each build will be
# pushed to these targets. buildbot/status/*.py has a variety to choose from,
# including web pages, email senders, and IRC bots.

c['status'] = []   


from buildbot.status import html
c['status'].append(html.WebStatus(http_port=8010))  

#
# from buildbot.status import mail
# c['status'].append(mail.MailNotifier(fromaddr="buildbot@localhost",
#                                      extraRecipients=["[email protected]"],
#                                      sendToInterestedUsers=False))
#
# from buildbot.status import words
# c['status'].append(words.IRC(host="irc.example.com", nick="bb",
#                              channels=["#example"]))
#
# from buildbot.status import client
# c['status'].append(client.PBListener(9988)) 

c['projectName'] = "FiPy"
c['projectURL'] = "http://www.ctcms.nist.gov/fipy/"
c['buildbotURL'] = "http://localhost:8010/"      

Configuring slaves

Configuring slaves is very simple.

danke$ buildslave create-slave slavedir slug-jamesob.no-ip.org:9989 danke bitte
danke$ $EDITOR slavedir/info/admin
danke$ $EDITOR slavedir/info/host

slug$ buildslave create-slave slavedir slug-jamesob.no-ip.org:9989 slug gross 
slug$ $EDITOR slavedir/info/admin
slug$ $EDITOR slavedir/info/host

Note that slug is both the master and a slave.

The form of the create-slave command is buildslave create-slave [directory for builds] [master's host address]:[port for communication] [slave-name] [slave-password].

So long as the port 9989 is open on both the slaves and master, we're done.

In summary

Buildbot is going to be an asset to us, though I still haven't contacted Doug as to how we're going to get it up and running.

In case I've screwed anything up in translating my config file to this blog post, I've uploaded the actual file to sandbox and it is available here.

  • Posted: 2011-02-24 17:37 (Updated: 2011-03-08 12:25)
  • Author: obeirne
  • Categories: (none)
  • Comments (0)

Efficiency changes due to properties

I check out trunk at revision [4101], before the property changes were introduced, as old_trunk and run setup.py test.

(misc)old_trunk/ time python setup.py test 2> /dev/null 
running test
...

real	2m57.447s
user	2m55.135s
sys	0m1.952s

I then run setup.py test on the latest revision of trunk@[4160], which includes all property implementations aside from those in fipy.terms.

(misc)trunk/ time python setup.py test 2> /dev/null 
running test
...

real	3m0.290s
user	2m57.987s
sys	0m2.000s

There's only a three second difference between the two; clearly not a big deal.

How about a non-trivial benchmark like phase anisotropy? For this, I copy-script'd the anisotropy example from examples.phase, then stripped out the viewing code and reduced the number of steps to one hundred.

I then ran it on trunk@[4101], without the properties:

(misc)old_trunk/ time python phaseAnis.py 
...

real	3m50.701s
user	2m50.679s
sys	0m59.748s

And then on the current trunk@[4160], with properties:

(misc)trunk/ time python phaseAnis.py 
...

real	3m49.818s
user	2m50.455s
sys	0m59.104s

Again, we see a negligible difference (3:50 vs. 3:49) between the two runtimes. These two trials have convinced me that the property refactoring hasn't added any considerable computational burden.

GPUArray considered harmful

Back in September, everyone decided that introducing FiPy to CUDA in some facet was a good idea. The particular introduction we decided on was in the form of middleware which would slide between NumPy and our own numerix.py. This proposed middleware would shift some (or all) arrays from the system memory onto the GPU memory. That way, with the arrays on the GPU, we could perform all computations there, reaping the gleaming wonders of the all-powerful graphics processing unit. Silver bullet, right?

No, not exactly.

I'm no longer convinced that this approach will gain us anything; that is to say, I am no longer convinced that indiscriminately chucking NumPy arrays at the GPU is going to yield any performance gain for FiPy. I come to this conclusion after profiling a few varied FiPy runs in their entirety and trying, in vain, to implement the middleware described above.

Profiling

I should have done this long ago, before we even talked about possible interactions between FiPy and CUDA, just to see where performance gains are actually going to benefit us. Finally, after a few weeks of struggling with the proposed GPUArray module, I sat down and profiled a few FiPy runs to see where the cycles are being spent. After all, if we can't find a significant amount of time that's being spent doing arithmetic operations on NumPy arrays, there is no point in putting arrays on the GPU before the solver.

3D heat diffusion (1003)

I profiled the problem found here with the timing instrumentation removed. The results of the profiling were run through gprof2dot to produce this figure. The figure is a tree where each node is a function, the percentage below it is the percentage time spent in the function and all sub-functions, the parenthesized percentage is the time spent in the function itself, and all children of that node are sub-functions called within that function.

In this case, the only nodes displayed are for functions which consume 5% of runtime or more. These are the only components of a run that we should consider optimizing. Notice that the only array-specific functions found here are numeric:201:asarray and ~:0:<numpy.core.multiarray.array>. Both of these functions would take much longer when using a GPU (based on a strong hunch). Note also that we can't see any arithmetic operations in this graph. For this particular problem, it is clear that arithmetic operations on arrays are not the bottleneck, and thus using a GPU pre-solution makes absolutely no sense.

To get a better idea of which array functions were consuming time, I decided to generate another figure where each node consumes a minimum of 1% total runtime. That figure shows array allocation dominating array-specific operations, with ~:0:<numpy.core.multiarray.array> responsible for 18.6% of runtime.

No arithmetic operations are visible until we reduce the minimum percentage of runtime consumed by any node to 0.5%. That figure finally shows numerix:624:sin weighing in at 0.73%. Another array operation which appears in this graph is core:3914:reshape, the reshaping of an array, which comes in around 0.85%. Think of how this operation's time consumption would balloon if we were on a GPU.

The conclusion is clear: we should not be using a GPU for pre-solution array corralling on this run or anything like it.

Wheeler's Reactive Wetting

After profiling these toy runs, I told Wheeler about the results. He and I agreed that before making any final conclusions, we should profile a more realistic problem. We chose his reactive wetting problem. He profiled, sent me the pstats file, and I generated [http://matforge.org/fipy/export/3926/sandbox/gpuarray/wettingProf.simple.pdf this figure]. The minimum runtime consumption here is 5%.

Note that most array-relevant operations (aside from ~:0:<numpy.core.multiarray.array>) are found under inline:14:_optionalInline. Here we see that numerix:941:take is a good subtree to examine, as it is responsible for ~14% of total runtime.

The only arithmetic operation in this mess is core:3164:__mul__, boasting 5.50%. "Well, finally," you say to yourself, "there's something that will benefit from residence on the GPU." Not so fast. Only 0.01% of that 5.50% is spent in __mul__ itself. The overwhelming majority of time is spent in its child, ~:0:<numpy.core.multiarray.where>, which is presumably fiddling around in memory: yet another operation that will balloon with GPU usage.

Let's work on PyKrylov

After these damning results, it's evident to me that using Klockner's PyCuda within FiPy is a really bad idea. When using PyCuda's drop-in replacement (ahem) for numpy.ndarray, any arithmetic operation results in an intermediate copy of the gpuarray since there is no clever mechanism for lazily evaluating curried-kernel strings, or whatever.

Furthermore, I think that any solution proposing to be as convenient as Klockner's gpuarray is snake-oil unless some very clever mechanisms for incremental kernel accumulation and/or lazy evaluation enter the picture. The only way we can juice the GPU is by writing a few custom kernels at performance critical junctures: that way, we're not tap-dancing all over the GPU's limited and relatively slow memory by creating transient copies at every step of an arbitrary computation.

For these reasons, I think we (or I) should work on writing a few CUDA kernels for use by the non-Charlatan regions of PyCuda within PyKrylov. This implies work on the PyKrylov branch, which I am more than happy to undertake.