More mesh refactoring!?
Don't worry: it's almost over.
Previous mesh hierarchy
Meshes
Topologies
Geometries
I'm stripping out geometry and topology
Why?
- Their inheritance trees seem to inevitably mimic those of the meshes.
- We don't get any code reuse out of their existence, despite what I thought at their creation.
- A huge amount of boilerplate code is devoted to passing arguments to, instantiating, and creating interfaces to geometry/topology objects that is otherwise unnecessary.
- While geometry/topology objects do provide an added amount of orthogonality to the meshes, this benefit is negligible since it is highly likely that no other class hierarchy aside from meshes will use geometry/topology objects.
I was convinced of the harmful effects of geom/top classes when I went to expand the topology classes. In adding a few extra methods to topology, I found that I would have to nearly fully replicate the class tree of the meshes (instead of having just five topology classes) and the argument list for topology classes would not only vary from class to class, but would become unreasonably long.
Since the (lengthy) argument list for all geometry objects already varied from class to class, and thus was slightly ridiculous, I decided to think about reuniting meshes and geometry/topology. Where there is no uniform interface, no good code-reuse, and heavy coupling, merging classes is likely a good course of action.
Hard lesson learned: never refactor unless it is going to prevent duplication.
The new arrangement
I spent today merging geometries and topologies back into meshes. The new mesh hierarchy is displayed in the following UML diagram.
Meshes
Grid Builders
New features
- Uniform and irregular meshes are now siblings. This results in a less confusing inheritance scheme. Gridlike?D objects are aggregated under (Uniform)Grids to avoid duplication without the use of multiple inheritance. Gridlikes are never instantiated, but their contents are almost entirely static.
- GridBuilders are being used for the construction of grids. They warrant their own class hierarchy because the process for grid construction is largely dimensionally-independent. After the class-travaganza of geometry and topology, I can understand if this addition is met with suspicion. However, I think the reasons above are good ones.
- Interface for all meshes is defined in AbstractMesh, which should provide some clarification.
- Reduced NLOC after deleting the cruft classes.
Looking at Cython
Cython is widely accessible
- Cython is bundled with Enthought Python Distribution, SAGE, and PythonXY
- Cython only requires Python and GCC, so it's installable basically anywhere
- Cython is easy_installable.
- mpi4py is already written in Cython.
In short: I'll bet dollars to donuts that anywhere FiPy is installable, Cython is too.
Building Cython code
- Write up a module foo.pyx in Cython.
- Write a setup.py which lists foo.pyx as an extension and specifies a cmdclass entry to compile it. Let's call that entry "build_ext"
- Run
python setup.py build_ext --inplace
This compiles the foo module from Cython down to C. - Use foo as you would a Python module, despite the fact that it's (supposedly well-optimized) C code.
More details are available here.
Cython integrates with NumPy
See here. Further details are available in Seljebotn's paper.
An interesting excerpt from the paper:
For the algorithms which are expressible as NumPy operations, the speedup is much lower, ranging from no speedup to around ten times. The Cython code is usually much more verbose and requires more decisions to be made at compile-time. Use of Cython in these situations seems much less clear cut. A good approach is to prototype using pure Python, and, if it is deemed too slow, optimize the important parts after benchmarks or code profiling.
Most arithmetic algorithms in FiPy are specified in terms of NumPy operations, so Cython's use to us may be questionable. In a direct translation from a NumPy-operation-based routine to typed Cython code (using cdef np.ndarray), I saw no speedup.
Return of the GPU
Spoiler alert: GPUArray is still snake oil
We've spoken recently of whether or not our decision to ditch GPUArray forever was ill-conceived. Certainly, my previous blog post on the topic was, at best, incomplete in that I never gave a conclusive explanation of what time is being spent where. This was a result of me using fancy dot-graph visualizations instead of sticking to that natural adversary of bullshit: raw text.
In the previous post, I never examined what Guyer refers to as terminal procedures. More importantly, though, I never presented a list of procedures sorted by time spent in each procedure itself (excluding calls made to other procedures within its body). In other words, I never presented a list of procedures sorted by what Python's Pstats refers to as total time.
In this post, I will explore whether or not arithmetic operations ex solver are taking a considerable amount of time. When I conclude that they are, I will explore whether or not the use of a temporary GPUArray for some arithmetic operations may alleviate this bottleneck. I will conclude that it won't, since GPUArray initialization and communication are stalls which cause a net gain in runtime, even considering the much better performance of arithmetic operations on the GPU.
Profiling FiPy
In order to determine if, in fact, we are spending an amount of time on ex solver arithmetic operations that would merit speeding them up somehow, I profiled a phase anisotropy run, available here.
I ran the simulation for 10 steps with dx and dy both at 1000. I then obtained profiling output with this script
import pstats def printProfileInfo(files): for filename in files: p = pstats.Stats(filename) # Sort by time in self p.sort_stats('time').print_stats(20) if __name__ == '__main__': import sys printProfileInfo(sys.argv[1:])
as generated by this pstats file. The output is here.
Let's walk through line-by-line.
- pysparse.itsolvers.pcg: solver; of no concern right now.
- {method 'sum' ...}: helllooooo arithmetic.
- variable.py:1088(<lambda>): straight from The Good Book, Variable.py:
return self._BinaryOperatorVariable(lambda a,b: a*b, other)
Obviously some steamy arithmetic.
- {method 'take' ...}: I'd guess the reason this guy is so costly is memory access. Just a guess, though.
- meshVariable.py:230(_dot): helllooooo arithmetic.
- harmonicCellToFaceVariable.py:42(_calcValuePy): a whole mess of arithmetic.
def _calcValuePy(self, alpha, id1, id2): cell1 = numerix.take(self.var,id1, axis=-1) cell2 = numerix.take(self.var,id2, axis=-1) value = ((cell2 - cell1) * alpha + cell1) eps = 1e-20 value = (value == 0.) * eps + (value != 0.) * value cell1Xcell2 = cell1 * cell2 value = ((value > eps) | (value < -eps)) * cell1Xcell2 / value value = (cell1Xcell2 >= 0.) * value return value
Okay, so the point is clear: arithmetic out the wazoo. That numerical analysis software is stunted by calculation shouldn't come as a surprise.
Can the GPU help us?
For the sake of simplicity and to serve as a proof of concept, from here on out I'll deal exclusively with sum. If we can make sum into less of a problem with the help of Dr. Nvidia, then certainly we can rid other arithmetic aches similarly.
As I was shuffling around $FIPYROOT/sandbox to prepare some GPU-sum benchmarks, I was pleasantly surprised to find that I'd already been through this process, I just never had the good sense to document it.
Allow me to introduce you to benchSum.py, available here.
#!/usr/bin/env python import numpy import pycuda.autoinit import pycuda.gpuarray as gpua import random """ Using `time`: GPU ======== real 3m55.157s user 3m52.720s sys 0m1.530s where numLoops is 200. NumPy ======== real 3m52.877s user 3m51.680s sys 0m0.590s where numLoops is 200. Using cProfile/pstats: GPU ======== Tue Nov 2 20:31:07 2010 sum-gpu.prof 800042925 function calls (800042000 primitive calls) in 372.578 CPU seconds Ordered by: internal time List reduced from 596 to 20 due to restriction <20> ncalls tottime percall cumtime percall filename:lineno(function) 400000000 163.425 0.000 194.783 0.000 random.py:351(uniform) 1 142.234 142.234 372.452 372.452 benchSum.py:26(compare) 402 33.223 0.083 33.223 0.083 {numpy.core.multiarray.array} 400000006 31.358 0.000 31.358 0.000 {method 'random' of '_random.Random' objects} 400 1.667 0.004 1.667 0.004 gpuarray.py:93(set) 400 0.274 0.001 0.397 0.001 gpuarray.py:975(sum) 1400 0.171 0.000 0.189 0.000 gpuarray.py:61(__init__) 1 0.036 0.036 0.036 0.036 tools.py:170(make_default_context) 1 0.023 0.023 372.581 372.581 benchSum.py:3(<module>) 1199 0.020 0.000 0.025 0.000 driver.py:257(function_prepared_async_call) 400 0.012 0.000 0.100 0.000 reduction.py:203(__call__) 799 0.011 0.000 0.025 0.000 tools.py:470(context_dependent_memoize) 1404 0.008 0.000 0.016 0.000 __init__.py:130(memoize) 400 0.008 0.000 1.794 0.004 gpuarray.py:619(to_gpu) 1400 0.007 0.000 0.018 0.000 gpuarray.py:40(splay) 1 0.006 0.006 0.006 0.006 driver.py:1(<module>) 2 0.006 0.003 0.012 0.006 __init__.py:2(<module>) 1199 0.005 0.000 0.005 0.000 {pycuda._pvt_struct.pack} 235 0.005 0.000 0.006 0.000 function_base.py:2851(add_newdoc) 135 0.004 0.000 0.005 0.000 sre_compile.py:213(_optimize_charset) NumPy ======== Tue Nov 2 20:38:08 2010 sum-numpy.prof 800027142 function calls (800026254 primitive calls) in 373.266 CPU seconds Ordered by: internal time List reduced from 453 to 20 due to restriction <20> ncalls tottime percall cumtime percall filename:lineno(function) 400000000 166.126 0.000 197.885 0.000 random.py:351(uniform) 1 140.405 140.405 373.137 373.137 benchSum.py:30(compare) 402 33.300 0.083 33.300 0.083 {numpy.core.multiarray.array} 400000000 31.759 0.000 31.759 0.000 {method 'random' of '_random.Random' objects} 400 1.541 0.004 1.541 0.004 {method 'sum' of 'numpy.ndarray' objects} 1 0.034 0.034 0.034 0.034 tools.py:170(make_default_context) 1 0.024 0.024 373.266 373.266 benchSum.py:3(<module>) 1 0.005 0.005 0.005 0.005 driver.py:1(<module>) 2 0.005 0.003 0.012 0.006 __init__.py:2(<module>) 235 0.005 0.000 0.006 0.000 function_base.py:2851(add_newdoc) 2594 0.003 0.000 0.003 0.000 {isinstance} 400 0.003 0.000 1.547 0.004 fromnumeric.py:1185(sum) 6 0.003 0.001 0.004 0.001 collections.py:13(namedtuple) 127 0.003 0.000 0.004 0.000 sre_compile.py:213(_optimize_charset) 124/10 0.003 0.000 0.007 0.001 sre_parse.py:385(_parse) 1 0.002 0.002 0.044 0.044 autoinit.py:1(<module>) 3 0.002 0.001 0.019 0.006 __init__.py:1(<module>) 287/10 0.002 0.000 0.008 0.001 sre_compile.py:38(_compile) 1 0.002 0.002 0.002 0.002 core.py:2230(MaskedArray) 1 0.002 0.002 0.002 0.002 numeric.py:1(<module>) """ def compare(numLoops, testing="gpu", arrLen=1000*1000): accum = 0 for i in xrange(numLoops): # randomized to avoid memoization or other caching rand = [random.uniform(0., 1000.) for j in xrange(arrLen)] rand2 = [random.uniform(0., 1000.) for j in xrange(arrLen)] if testing == "gpu": a = gpua.to_gpu(numpy.array(rand)) b = gpua.to_gpu(numpy.array(rand2)) c = gpua.sum(a) + gpua.sum(b) else: a = numpy.array(rand) b = numpy.array(rand2) c = numpy.sum(a) + numpy.sum(b) accum += c if i % 10 == 0: print i print accum if __name__ == '__main__': import sys compare(int(sys.argv[1]), sys.argv[2])
A few big points here. First, the overall time for
time python benchSum.py 200 gpu
takes longer than
time python benchSum.py 200 numpy
Not what we expected, huh?
Examining the profiling results in the above docstring tells us that when we go to GPU, Numpy's
400000000 31.759 0.000 31.759 0.000 {method 'random' of '_random.Random' objects} 400 1.541 0.004 1.541 0.004 {method 'sum' of 'numpy.ndarray' objects} 1 0.034 0.034 0.034 0.034 tools.py:170(make_default_context)
balloons into
400000006 31.358 0.000 31.358 0.000 {method 'random' of '_random.Random' objects} 400 1.667 0.004 1.667 0.004 gpuarray.py:93(set) 400 0.274 0.001 0.397 0.001 gpuarray.py:975(sum) 1400 0.171 0.000 0.189 0.000 gpuarray.py:61(__init__) 1 0.036 0.036 0.036 0.036 tools.py:170(make_default_context)
Keep this in mind.
The results in the module-level docstring above are why I hit the skids on GPUArray development last time, but I'll expound on them here.
Revisiting sumBench.py
"Trust, but verify" sez Reagan, and that's the way I was thinking when I rediscovered sumBench.py. It made sense to me that GPUArray isn't a silver bullet, but if I can't quantify that hunch then it's not science, but useless superstition. So, I decided to have some fun with sumBench.compare in the form of pstatsInterpretation.py. Let's see how:
#!/usr/bin/env python import pstats arraySizes = [10*10, 100*100, 500*500, 1000*1000, 2000*2000] loopCounts = [10, 50, 100, 200, 500]
I decided that, while the default benchmark of a 1000*1000 element array run for 200 iterations through the body of sumBench.compare was a fair model of FiPy usage, more data points are always useful.
So, I tested for all combinations of (arraySizes[i], loopCounts[j]) (how many is that, kids? Remember discrete math?).
I generated the Pstat profiles with the following function.
def runProfiling(): from benchSum import compare from cProfile import runctx for loopNum in loopCounts: for aSize in arraySizes: print "Doing %d loops on arraySize: %d..." % (loopNum, aSize) print "gpu." runctx("compare(%d, 'gpu', %d)" % (loopNum, aSize), globals(), locals(), "gpu-sum-%d-%d.pstats" % (loopNum, aSize)) print "numpy." runctx("compare(%d, 'numpy', %d)" % (loopNum, aSize), globals(), locals(), "numpy-sum-%d-%d.pstats" % (loopNum, aSize)) print
If you've forgotten how benchSum.compare works, go refresh your memory.
From the famous benchSum docstring, it is apparent that there is a correspondence between (numpy.sum), and (gpuarray.sum, gpuarray.set, gpuarray.__init__). This is the case because not only must GPUArray instantiate an ndarray but, in order to perform the arithmetic operation, it must throw the array on the GPU: something that obviously doesn't happen with ndarray. If you don't believe me, go have a look and convince yourself now, because my conclusions all rest on this correspondence.
If I wanted to get nasty, I could include in the correspondence other calls that are made in the use of GPUArray that aren't in that of ndarray, namely gpuarray.to_gpu and gpuarray.splay, but those times are basically insignificant.
Anyway, I needed a way of ripping out individual method timing (remember, that's time spent in self, or total time) from the mass of Pstat profiles I'd accumulated.
def _getTimeForMethods(arraySize, loopCount, methodList, arrType="gpu", benchmark="sum"): from StringIO import StringIO import re totalTime = 0. for method in methodList: strStream = StringIO() name = "%s-%s-%d-%d.pstats" % (arrType,benchmark,loopCount,arraySize) p = pstats.Stats(name, stream = strStream) p.print_stats(method) profString = strStream.getvalue() m = re.search(r"\d+\s+(\d+\.\d+)", profString) totalTime += float(m.group(1)) return totalTime
A little hairy, but it gets the job done. The above, _getTimeForMethods prints the Pstats output to a string stream and then yanks out the total time with a regular expression. It does this for each method in methodList and accumulates a composite time, which is returned.
I put this function to work in the following Behemoth, which interprets and graphs the Pstats results.
def graph(): from matplotlib import pylab as plt import subprocess as subp """ The methods which, according to the profiling results in the docstring of `benchSum.py`, are analogous between numpy.ndarray and gpuarray. We don't need to consider anything more than `sum` for numpy because numpy's set/init procedure is absorbed in the call to `array`, which both gpuarray and numpy do. """ numpyMethods = ["method 'sum'"] gpuMethods = ["gpuarray.py\S*(set)", "gpuarray.py\S*(sum)", "gpuarray.py\S*(__init__)"] loopsConstantTimes = {"numpy": [], "gpu": []} arrSizeConstantTimes = {"numpy": [], "gpu": []} justSumArrSizeConstantTimesGPU = [] for loops in [200]: for size in arraySizes: loopsConstantTimes["gpu"].append(_getTimeForMethods(size, loops, gpuMethods)) loopsConstantTimes["numpy"].append(_getTimeForMethods(size, loops, numpyMethods, "numpy")) for loops in loopCounts: for size in [1000*1000]: arrSizeConstantTimes["gpu"].append(_getTimeForMethods(size, loops, gpuMethods)) arrSizeConstantTimes["numpy"].append(_getTimeForMethods(size, loops, numpyMethods, "numpy")) justSumArrSizeConstantTimesGPU.append(_getTimeForMethods(size, loops, ["gpuarray.py\S*(sum)"])) """ PLOTTING ENSUES... If you really wanna see the gory details, go to the file in Matforge. """
Results
Okay, let's start off nice and light with a simple comparison between gpuarray.sum and ndarray.sum, then we'll dash any hope of dead-simple gpuarray usage.
Hell yeah! Look at that speed-up! Life must be a Singaporean paradise where I get paid to let other people's libraries halve my runtime!
Chya.
Whoops.
There's the whole picture. Now we're factoring in the additional effects of GPUArray usage: communication and initialization.
Here, the array size is fixed at 1000*1000 elements. The horizontal axis indicates the size of the array that we're summing. The vertical axis is the time of sum for ndarray and {sum, set, __init__} for GPUArray. Again, remember that these sets of methods are analogous. You can't have GPUArray's hot sum without its homely friends set and __init__.
Here, the number of iterations is fixed at 200. The number of iterations that the sum-accumulation procedure does is indicated on the horizontal axis. The vertical axis is, again, timing.
It's clear that ndarray wins out, even at a large array size (2000*2000).
Things I could be screwing up
Considering that the bottleneck with the GPUArray is setting the array, it would seem possible that I'm doing it wrong. The GPUArray documentation will verify that isn't the case. You just can't get around that gpuarray.set call; it's inevitable and it's what sinks your battleship.
It's fundamentally intuitive that you can't avoid the cost of communicating with the card. It's a longer trip down the bus and, since there isn't any lazy-eval/currying going on, it's a trip that'll happen on every arithmetic operation. TANSTAAFL.
To be sure, let's take a look at pycuda.gpuarray.set, just to see if it's plausible that the contents are slowing us up.
def set(self, ary): assert ary.size == self.size assert ary.dtype == self.dtype if self.size: drv.memcpy_htod(self.gpudata, ary)
Bingo! The key is in drv.memcpy_htod(...). That's where the actual transfer of data takes place; it makes sense time is lost there.
Possible recourse?
Right below pycuda.gpuarray.set is
def set_async(self, ary, stream=None): assert ary.size == self.size assert ary.dtype == self.dtype if self.size: drv.memcpy_htod_async(self.gpudata, ary, stream)
Might this be faster? Dunno. If it were, wouldn't Klockner be quick to tell us about it in the documentation?
Summary
- FiPy is bottlenecked largely by arithmetic operations.
- GPUArray will not help us alleviate these bottlenecks by using it as a one-off Rainman every time we want to do a single arithmetic operation.
- This is because communication is more expensive than the single arithmetic operation, even if the op is significantly faster on the GPU and even on arrays up to 2000*2000 in length.
- The only appropriate use for PyCUDA is by writing a few kernels for arithmetic-critical junctures where we can amortize the cost of communication over the speedup of multi-step arithmetic on the GPU.
Buildbot setup
First, obtain Buildbot on both master and slave.
master$ easy_install buildbot slave$ easy_install buildbot-slave
Now, create the master.
master$ buildbot create-master master_dir
This will make a directory called master_dir and fill it with all sorts of goodies. In order to be operational, the master_dir requires a master.cfg file to be present within it. Luckily, Buildbot supplies a master.cfg.sample which can easily be tailored to fit our needs.
master$ cp master.cfg.sample master.cfg master$ $EDITOR master.cfg
Now let's do the aforementioned tailoring. I'll step through configuration piece-by-piece.
c = BuildmasterConfig = {}
All that is required of each master.cfg is that it builds a dictionary called BuildmasterConfig (which we'll call c for brevity), which contains various configuration information for Buildbot.
branches = ['trunk', 'branches/bad_branch', ]
Because of Buildbot's architecture, we must specify (in some way) the branches that Buildbot should concern itself with; I define factories later on which generate the necessary Buildbot machinery for each branch specified here.
Instead of explicitly specifying the branches we want to monitor (which is usually all of them), it may be possible to poll SVN periodically and reconfigure Buildbot when a new branch is detected, but I haven't looked into that yet.
Specifying BuildSlaves
Now, we specify Buildslaves for c.
from buildbot.buildslave import BuildSlave c['slaves'] = [BuildSlave("danke", "bitte"), BuildSlave("slug", "gross")]
This addition to c implies that we will have two BuildSlaves communicating with the master: one called danke (with a slave-password of bitte), and one called slug (with a slave-password of gross). We will configure the slaves (with their passwords) after we have configured the master.
It is important that these slave passwords remain relatively private, or else there are some security vulnerabilities that entail possible execution of arbitrary code on the master's host (alluded to here).
Then, pick a port over which the master and slave will communicate. This port must be publicly accessible on their respective networks. I chose 9989.
c['slavePortnum'] = 9989
Remember to make this port publicly accessible. This may necessitate forwarding ports.
Detecting source changes
Now we configure how the Buildmaster finds out about changes in the repository. There are a variety of ways the master can be configured to keep itself informed of changes, including polling the repository periodically, but I chose to manually inform the buildmaster of changes via an addition to SVN's post-commit hook which calls a buildbot/contrib script available here.
On the master.cfg side, we add the following:
from buildbot.changes.pb import PBChangeSource c['change_source'] = PBChangeSource()
From here, I assume that the SVN repository is hosted on the same machine as the master. Digressing a moment from master.cfg, I navigate to the SVN repository's directory.
master$ cd /path/to/svn-repo master$ $EDITOR hooks/post-commit
Once the editor is open, add the following to notify the buildbot master of changes upon commit.
# set up PYTHONPATH to contain Twisted/buildbot perhaps, if not already # installed site-wide /path/to/svn_buildbot.py --repository "$REPOS" --revision "$REV" \ --bbserver localhost --bbport 9989
If the master is hosted on a different machine than the repository, the bbserver flag above can be modified accordingly.
One caveat here is that we must modify the contrib/svn_buildbot.py script to propagate branch information on to Buildbot. Open svn_buildbot.py
master$ $EDITOR /path/to/svn_buildbot.py
and enact the following change.
# this should be around line 143 # split_file = split_file_dummy split_file = split_file_branches
Now the SVN repo is set up to play nicely with the Buildbot installation and (hopefully) the master will be informed of each subsequent commit.
Schedulers
Back to master.cfg.
Next, we want to configure Schedulers. Whenever the master is informed of a change (via whatever object is held in c['change_source']) all Schedulers attached to the list held by c['schedulers'] will be informed of the change and, depending on the particular Scheduler in question, may or may not react by triggering a build or multiple builds.
Here, I'll configure a Scheduler for each branch to react to any change in that branch by running two builds.
from buildbot.scheduler import Scheduler from buildbot.schedulers.filter import ChangeFilter def buildSchedulerForBranch(branch): cf = ChangeFilter(branch=branch) return Scheduler(name="%s-scheduler" % branch, change_filter=cf, treeStableTimer=30, builderNames=["full-build-%s" % branch, "smaller-build-%s" % branch]) c['schedulers'] = [] for branch in branches: c['schedulers'].append(buildSchedulerForBranch(branch))
The ChangeFilter is used to discriminate branches; note that branch isn't the only criterion a ChangeFilter can consider. The parameter treeStableTimer is the length of time that Buildbot waits before starting the build process.
Note that this Scheduler references "full-build-[branch]" and "smaller-build-[branch]", which are Builders we'll define right now.
Builders
Here, we specify the build procedures. First, we will define a number of BuildFactorys, which detail generic steps that a given Build will take, and then we create the Builds themselves and attach them to to the BuildmasterConfig dictionary. There is some machinery defined up front for the sake of generating separate builds for each branch with as little duplication as possible, so bear with me.
def addCheckoutStep(factory, defaultBranch='trunk'): """Ensure that `baseURL` has a forward slash at the end.""" baseURL = 'svn://slug-jamesob.no-ip.org/home/job/tmp/fake_fipy_repo/' factory.addStep(SVN(mode='update', baseURL=baseURL)) def testAllForSolver(buildFact, solverType, parallel=False): """ Add doctest steps to a build factory for a given solver type. """ descStr = "for %s" % solverType solverArg = "--%s" % solverType buildFact.addStep(Doctest(description="testing modules " + descStr, extraArgs=[solverArg, "--modules"])) buildFact.addStep(Doctest(description="testing examples " + descStr, extraArgs=[solverArg, "--examples"]))
Here, addCheckoutStep takes a BuildFactory and adds a step which checks out whichever branch has been modified, as specified by the Scheduler reporting the changes. It does so by appending the name of the branch, as reported by the provoking Scheduler, to the baseURL that we pass in.
The function testAllForSolver takes a BuildFactory and adds steps which test both modules and examples for a given solver. I should mention at this point that here I use Doctest, a class I had to add to Buildbot.
Getting doctests to work
Buildbot integrates nicely with Trial, the unit-test framework that comes bundled with Twisted. Trial, unfortunately, doesn't seem to pick up on doctests, though there are scattered allusions speaking to the contrary (here and here).
I tried a few approaches to getting Trial to process doctests (adding a __doctest__ module-level attribute, for one) to no avail. Even if there is some way to piggyback doctests onto Trial, it may involve a large modification of the FiPy source, which, if purely for the sake of making Buildbot happy, I don't think is worth it.
In light of this, I added an object that extends the class that handles Trial integration to handle doctests instead.
Within master.cfg, add
from buildbot.steps.python_twisted import Trial, TrialTestCaseCounter from buildbot.steps.shell import ShellCommand from twisted.python import log class Doctest(Trial): """ Add support for Python's doctests. """ def __init__(self, description=None, extraArgs=None, **kwargs): ShellCommand.__init__(self, **kwargs) self.addFactoryArguments(description=description, extraArgs=extraArgs) self.testpath = "." self.command = ["python", "setup.py", "test"] self.extraArgs = extraArgs self.logfiles = {} if description is not None: self.description = [description] self.descriptionDone = [description + " done"] else: self.description = ["testing"] self.descriptionDone = ["tests"] # this counter will feed Progress along the 'test cases' metric self.addLogObserver('stdio', TrialTestCaseCounter()) def start(self): if self.extraArgs is not None: if type(self.extraArgs) is list: self.command += self.extraArgs elif type(self.extraArgs) is str: self.command.append(self.extraArgs) self._needToPullTestDotLog = False log.msg("Doctest.start: command is '%s' with description '%s'." \ % (self.command, self.description)) ShellCommand.start(self) def _gotTestDotLog(self, cmd): Trial._gotTestDotLog(self, cmd) # strip out "tests" from the beginning self.text[0] = self.description[0] + " " \ + " ".join(self.text[0].split(" ")[1:])
This allows us to run FiPy's test suite completely unmodified and have Buildbot interpret the results just as it would for Twisted's Trial.
Back to Builders
We never completed setting the Builders up, so let's do that now.
Let's first define a few Factory objects.
from buildbot.process import factory from buildbot.steps.source import SVN """ Run a mostly-complete test-suite. """ fullFactory = factory.BuildFactory() addCheckoutStep(fullFactory) testAllForSolver(fullFactory, "pysparse") testAllForSolver(fullFactory, "trilinos") """ Run a less-intensive test-suite. """ smallerFactory = factory.BuildFactory() addCheckoutStep(smallerFactory) smallerFactory.addStep(Doctest(extraArgs="--modules", description="testing modules"))
Now we will define and attach the Builders themselves.
c['builders'] = [] def makeBuildersForBranch(branch): builders = [] builders.append({"name": "full-build-%s" % branch, "slavename": "slug", "builddir": "%s-slug" % branch, "factory": fullFactory}) builders.append({"name": "smaller-build-%s" % branch, "slavename": "danke", "builddir": "%s-danke" % branch, "factory": smallerFactory}) return builders for branch in branches: for builder in makeBuildersForBranch(branch): c['builders'].append(builder)
The rest
Everything left in master.cfg concerns itself with how build information is reported back to us. Though there are very many possibilities open to us (e-mail reporting to blamed developers upon broken builds, IRC bots, Skynet, etc.), I haven't explored any of them other than the default web interface that Buildbot provides. With that in mind, here's the rest of master.cfg, basically unadulterated from the vanilla sample config.
####### STATUS TARGETS # 'status' is a list of Status Targets. The results of each build will be # pushed to these targets. buildbot/status/*.py has a variety to choose from, # including web pages, email senders, and IRC bots. c['status'] = [] from buildbot.status import html c['status'].append(html.WebStatus(http_port=8010)) # # from buildbot.status import mail # c['status'].append(mail.MailNotifier(fromaddr="buildbot@localhost", # extraRecipients=["[email protected]"], # sendToInterestedUsers=False)) # # from buildbot.status import words # c['status'].append(words.IRC(host="irc.example.com", nick="bb", # channels=["#example"])) # # from buildbot.status import client # c['status'].append(client.PBListener(9988)) c['projectName'] = "FiPy" c['projectURL'] = "http://www.ctcms.nist.gov/fipy/" c['buildbotURL'] = "http://localhost:8010/"
Configuring slaves
Configuring slaves is very simple.
danke$ buildslave create-slave slavedir slug-jamesob.no-ip.org:9989 danke bitte danke$ $EDITOR slavedir/info/admin danke$ $EDITOR slavedir/info/host slug$ buildslave create-slave slavedir slug-jamesob.no-ip.org:9989 slug gross slug$ $EDITOR slavedir/info/admin slug$ $EDITOR slavedir/info/host
Note that slug is both the master and a slave.
The form of the create-slave command is buildslave create-slave [directory for builds] [master's host address]:[port for communication] [slave-name] [slave-password].
So long as the port 9989 is open on both the slaves and master, we're done.
In summary
Buildbot is going to be an asset to us, though I still haven't contacted Doug as to how we're going to get it up and running.
In case I've screwed anything up in translating my config file to this blog post, I've uploaded the actual file to sandbox and it is available here.
Efficiency changes due to properties
I check out trunk at revision [4101], before the property changes were introduced, as old_trunk and run setup.py test.
(misc)old_trunk/ time python setup.py test 2> /dev/null running test ... real 2m57.447s user 2m55.135s sys 0m1.952s
I then run setup.py test on the latest revision of trunk@[4160], which includes all property implementations aside from those in fipy.terms.
(misc)trunk/ time python setup.py test 2> /dev/null running test ... real 3m0.290s user 2m57.987s sys 0m2.000s
There's only a three second difference between the two; clearly not a big deal.
How about a non-trivial benchmark like phase anisotropy? For this, I copy-script'd the anisotropy example from examples.phase, then stripped out the viewing code and reduced the number of steps to one hundred.
I then ran it on trunk@[4101], without the properties:
(misc)old_trunk/ time python phaseAnis.py ... real 3m50.701s user 2m50.679s sys 0m59.748s
And then on the current trunk@[4160], with properties:
(misc)trunk/ time python phaseAnis.py ... real 3m49.818s user 2m50.455s sys 0m59.104s
Again, we see a negligible difference (3:50 vs. 3:49) between the two runtimes. These two trials have convinced me that the property refactoring hasn't added any considerable computational burden.
GPUArray considered harmful
Back in September, everyone decided that introducing FiPy to CUDA in some facet was a good idea. The particular introduction we decided on was in the form of middleware which would slide between NumPy and our own numerix.py. This proposed middleware would shift some (or all) arrays from the system memory onto the GPU memory. That way, with the arrays on the GPU, we could perform all computations there, reaping the gleaming wonders of the all-powerful graphics processing unit. Silver bullet, right?
No, not exactly.
I'm no longer convinced that this approach will gain us anything; that is to say, I am no longer convinced that indiscriminately chucking NumPy arrays at the GPU is going to yield any performance gain for FiPy. I come to this conclusion after profiling a few varied FiPy runs in their entirety and trying, in vain, to implement the middleware described above.
Profiling
I should have done this long ago, before we even talked about possible interactions between FiPy and CUDA, just to see where performance gains are actually going to benefit us. Finally, after a few weeks of struggling with the proposed GPUArray module, I sat down and profiled a few FiPy runs to see where the cycles are being spent. After all, if we can't find a significant amount of time that's being spent doing arithmetic operations on NumPy arrays, there is no point in putting arrays on the GPU before the solver.
3D heat diffusion (1003)
I profiled the problem found here with the timing instrumentation removed. The results of the profiling were run through gprof2dot to produce this figure. The figure is a tree where each node is a function, the percentage below it is the percentage time spent in the function and all sub-functions, the parenthesized percentage is the time spent in the function itself, and all children of that node are sub-functions called within that function.
In this case, the only nodes displayed are for functions which consume 5% of runtime or more. These are the only components of a run that we should consider optimizing. Notice that the only array-specific functions found here are numeric:201:asarray and ~:0:<numpy.core.multiarray.array>. Both of these functions would take much longer when using a GPU (based on a strong hunch). Note also that we can't see any arithmetic operations in this graph. For this particular problem, it is clear that arithmetic operations on arrays are not the bottleneck, and thus using a GPU pre-solution makes absolutely no sense.
To get a better idea of which array functions were consuming time, I decided to generate another figure where each node consumes a minimum of 1% total runtime. That figure shows array allocation dominating array-specific operations, with ~:0:<numpy.core.multiarray.array> responsible for 18.6% of runtime.
No arithmetic operations are visible until we reduce the minimum percentage of runtime consumed by any node to 0.5%. That figure finally shows numerix:624:sin weighing in at 0.73%. Another array operation which appears in this graph is core:3914:reshape, the reshaping of an array, which comes in around 0.85%. Think of how this operation's time consumption would balloon if we were on a GPU.
The conclusion is clear: we should not be using a GPU for pre-solution array corralling on this run or anything like it.
Wheeler's Reactive Wetting
After profiling these toy runs, I told Wheeler about the results. He and I agreed that before making any final conclusions, we should profile a more realistic problem. We chose his reactive wetting problem. He profiled, sent me the pstats file, and I generated [http://matforge.org/fipy/export/3926/sandbox/gpuarray/wettingProf.simple.pdf this figure]. The minimum runtime consumption here is 5%.
Note that most array-relevant operations (aside from ~:0:<numpy.core.multiarray.array>) are found under inline:14:_optionalInline. Here we see that numerix:941:take is a good subtree to examine, as it is responsible for ~14% of total runtime.
The only arithmetic operation in this mess is core:3164:__mul__, boasting 5.50%. "Well, finally," you say to yourself, "there's something that will benefit from residence on the GPU." Not so fast. Only 0.01% of that 5.50% is spent in __mul__ itself. The overwhelming majority of time is spent in its child, ~:0:<numpy.core.multiarray.where>, which is presumably fiddling around in memory: yet another operation that will balloon with GPU usage.
Let's work on PyKrylov
After these damning results, it's evident to me that using Klockner's PyCuda within FiPy is a really bad idea. When using PyCuda's drop-in replacement (ahem) for numpy.ndarray, any arithmetic operation results in an intermediate copy of the gpuarray since there is no clever mechanism for lazily evaluating curried-kernel strings, or whatever.
Furthermore, I think that any solution proposing to be as convenient as Klockner's gpuarray is snake-oil unless some very clever mechanisms for incremental kernel accumulation and/or lazy evaluation enter the picture. The only way we can juice the GPU is by writing a few custom kernels at performance critical junctures: that way, we're not tap-dancing all over the GPU's limited and relatively slow memory by creating transient copies at every step of an arbitrary computation.
For these reasons, I think we (or I) should work on writing a few CUDA kernels for use by the non-Charlatan regions of PyCuda within PyKrylov. This implies work on the PyKrylov branch, which I am more than happy to undertake.