New Policy for Commit Messages
From now on, the following codes should be used for commit messages. The commit message should have one of the following three letter codes at the beginning of the first line followed by a colon.
API: an (incompatible) API change BLD: change related to building fipy BUG: bug fix DEP: deprecate something, or remove a deprecated object DEV: development tool or utility DOC: documentation ENH: enhancement MAINT: maintenance commit (refactoring, typos, etc.) REV: revert an earlier commit STY: style fix TST: addition or modification of tests REL: related to releasing fipy
These codes are the current Numpy standard (see http://docs.scipy.org/doc/numpy/dev/gitwash/development_workflow.html#writing-the-commit-message).
Commit and Ticket Update Automation
Jon has enabled the http://trac.edgewall.org/wiki/CommitTicketUpdater.
Use it as follows:
In future, when we make a commit that responds to a bug, include "addresses ticket:1234" or "references ticket:1234".
When you merge a pull request, include "fixes ticket:1234" or "closes ticket:1234".
These will automatically add the commit message to the ticket and, if appropriate, close the ticket.
SvnToGit
See wiki:SvnToGit.
How to represent a third order term?
A third order equation of the form:
and
I haven't tried this, but this should be possible with a coupled equation in a fully implicit manner. Here is an attempt. The sign on the diffusion term seems to make a large difference in the stability.
from fipy import Grid1D, CellVariable, TransientTerm, DiffusionTerm, Viewer from fipy import UpwindConvectionTerm, ImplicitSourceTerm m = Grid1D(nx=100, Lx=1.) phi = CellVariable(mesh=m, hasOld=True, value=0.5, name=r'$\phi$') psi = CellVariable(mesh=m, hasOld=True, name=r'$\psi$') phi.constrain(0, m.facesLeft) phi.constrain(1, m.facesRight) psi.constrain(0, m.facesRight) psi.constrain(0, m.facesLeft) eqnphi = TransientTerm(1, var=phi) == UpwindConvectionTerm([[1]], var=psi) eqnpsi = ImplicitSourceTerm(1, var=psi) == -DiffusionTerm(1, var=phi) eqn = eqnphi & eqnpsi vi = Viewer((phi, psi)) for t in range(1000): phi.updateOld() eqn.solve(dt=0.001) print phi vi.plot() raw_input('stopped')
Including the Transverse Waves in the Roe Convection Term
This is a continuation from blog:RoeConvectionTerm where the first order and second order correction are described fro the Roe convection term. These schemes have now been implemented in source:riemann/fipy/terms/roeConvectionTerm.py@5160. The scheme gives identical results to the spinning cone and block in CLAWPACK, but The second order scheme has not been tested with multiple equations yet (the spinning cone/block example only has one field).
The scheme needs to be updated to include the transverse waves. I believe this would make it truly second order accurate. Unfortunately, I'm still a little shaky on the derivation for the scheme.
Andrew Reeve's FiPy Linux Image
http://www.ctcms.nist.gov/fipy/download/reeveFiPyLinuxImage.iso
I have created a disk image that can be installed on an 8 GByte thumb drive. By installing this onto a usb thumb drive, and then booting this thumb drive through the computer's boot menu, you will be put into a fully functional linux (Arch Linux) system that contains FiPy, text editor, and associated python software (among other things). The 'wicd' program (type 'wicd-curses' in a termial window) on the thumb drive can be used to access a wireless network and the pacman package manager can be used to update/add software on the thumb drive. The image must written to the 8 Gbyte thumb drive using a low level copying program, such as 'dd' or 'rawrite'. Care should be taken to not copy over your hard drive! All data on your thumb drive will be overwritten when using these programs.
1) access your bootloaded when logging onto your computer by hitting the appropriate button when booting up. The button you need to hit to do this seems to vary with the make of the computer.
2) When you access the linux bootloaded, there are four options coresponding to different device names (sda, sdb, sdc, and sdd). Load the one that corresponds to the usb port on your computer. On the laptops I've used, 'sdb' has been to correct device name to use.
3) I think gmsh is also in the thumb drive,
4) mayavi2 has not been added to the thumbdrive, but can be added using the 'yaourt' package manages which automated building packages from source.
Let me know if people actually use these. I'd be willing to update the images one or twice a year if people find them useful.
Andy
Setting Up A Debug Environment
I've been trying to set up a debugable version of fipy in virtualenv for debugging a trilinos issue. Here are the steps:
- Install the python-dbg package from the Debian repositories.
- Use mkvirtualenv -p python-dbg debug to make the debug environment.
- Install numpy with pip install, not debug.
- Install swig in the standard way.
- Here is the do-configure for trilinos
EXTRA_ARGS=$@ TRILINOS_HOME=/users/wd15/Documents/python/trilinos-10.8.3-Source CMAKE=cmake PYTHON_EXECUTABLE=${VIRTUAL_ENV}/bin/python ${CMAKE} \ -D CMAKE_BUILD_TYPE:STRING=DEBUG \ -D Trilinos_ENABLE_PyTrilinos:BOOL=ON \ -D BUILD_SHARED_LIBS:BOOL=ON \ -D Trilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=ON \ -D TPL_ENABLE_MPI:BOOL=ON \ -D Trilinos_ENABLE_TESTS:BOOL=ON \ -D PYTHON_EXECUTABLE:FILEPATH=${PYTHON_EXECUTABLE} \ -D DART_TESTING_TIMEOUT:STRING=600 \ -D CMAKE_INSTALL_PREFIX:PATH=${VIRTUAL_ENV} \ -D PyTrilinos_INSTALL_PREFIX:PATH=${VIRTUAL_ENV} \ ${EXTRA_ARGS} \ ${TRILINOS_HOME}
Pyximport Weirdness
There seems to be an issue with using pyximport under certain conditions. Essentially, when pyximport is used here. When this loads from within the setup.py test environment it doesn't build on benson or on Jon's laptop. It does work on sandbox, which is using python 2.7 (rather than 2.5) and cython 0.15 (same as bunter). I'm not sure about Jon's configuration.
The error is that the compiler is trying to find the .c file in fipy/tools instead of the default .pyxblt/ directory. This manifests the following error:
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/users/wd15/.virtualenvs/riemann/lib/python2.5/site-packages/numpy/core/include -I/users/wd15/.virtualenvs/riemann/include/python2.5 -c /users/wd15/Documents/python/fipy/riemann/fipy/tools/smallMatrixVectorOpsExt.c -o /users/wd15/.pyxbld/temp.linux-i686-2.5/users/wd15/Documents/python/fipy/riemann/fipy/tools/smallMatrixVectorOpsExt.o gcc: /users/wd15/Documents/python/fipy/riemann/fipy/tools/smallMatrixVectorOpsExt.c: No such file or directory gcc: no input files
I have sort of figured out where the issue is occurring, but it appears that magic is occurring that doesn't seem possible. Initially to debug I turned DEBUG to True in ~/.virtualenvs/riemann/lib/python2.5/site-packages/pyximport/pyxbuild.py. That lead me to see what the problem is. If one looks in ~/.virtualenvs/riemann/lib/python2.5/site-packages/pyximport/pyximport.py one will see the following method
def get_distutils_extension(modname, pyxfilename): # try: # import hashlib # except ImportError: # import md5 as hashlib # extra = "_" + hashlib.md5(open(pyxfilename).read()).hexdigest() # modname = modname + extra extension_mod,setup_args = handle_special_build(modname, pyxfilename) print 'extension_mod',extension_mod print 'pyxfilename',pyxfilename print 'modname',modname if not extension_mod: from distutils.extension import Extension extension_mod = Extension(name = modname, sources=[pyxfilename]) import distutils.extension print 'distutils.extension.__file__',distutils.extension.__file__ print 'extension_mod.sources',extension_mod.sources print 'pyxfilename',pyxfilename ## extension_mod.sources = [pyxfilename] print 'extension_mod.sources',extension_mod.sources print type(pyxfilename) raw_input('stopped') return extension_mod,setup_args
When pyximport runs through this it gives the following output
extension_mod None pyxfilename /users/wd15/Documents/python/fipy/riemann/fipy/tools/smallMatrixVectorOpsExt.pyx modname fipy.tools.smallMatrixVectorOpsExt in extension self.sources ['/users/wd15/Documents/python/fipy/riemann/fipy/tools/smallMatrixVectorOpsExt.pyx'] self.sources ['/users/wd15/Documents/python/fipy/riemann/fipy/tools/smallMatrixVectorOpsExt.pyx'] distutils.extension.__file__ /users/wd15/Documents/python/fipy/riemann/tmp/distutils/extension.py extension_mod.sources ['/users/wd15/Documents/python/fipy/riemann/fipy/tools/smallMatrixVectorOpsExt.c'] pyxfilename /users/wd15/Documents/python/fipy/riemann/fipy/tools/smallMatrixVectorOpsExt.pyx extension_mod.sources ['/users/wd15/Documents/python/fipy/riemann/fipy/tools/smallMatrixVectorOpsExt.c'] <type 'str'>
What's weird here is that the pyxfilename passed to Extension has had it's extension changed from .pyx to .c. Now, this doesn't occur on sandbox, the file name never changes and everything runs smoothly. On bunter, if I explicitly set extension_mod.sources as I do in the commented line, then everything works. So what hocus-pocus is occurring in the Extension class to change .pyx to .c. Nothing! Absolutely nothing. Extension is just a stand alone class (no parents) and all it does it set self.sources = sources. In fact it I do print self.sources on the last line of Extension's __init__ it still has the .pyx file extension, but reverts to the .c file extension as soon as it returns to pyximport.py. What could possibly be occurring? Extension has no other methods. pyxfilename is a string. This seems impossible.
I finally figured out what was going on. When the tests are running, a different version of from distutils.extension import Extension is being imported. The version the tests are importing is in the virtualenv. When I run the script outside of the tests then it doesn't use the virtualenv version of distribute. The version in the virtualenv was dropping into the following:
try: from Pyrex.Distutils.build_ext import build_ext except ImportError: have_pyrex = False else: have_pyrex = True class Extension(_Extension): """Extension that uses '.c' files in place of '.pyx' files""" if not have_pyrex: # convert .pyx extensions to .c def __init__(self,*args,**kw): _Extension.__init__(self,*args,**kw) sources = [] for s in self.sources: if s.endswith('.pyx'): sources.append(s[:-3]+'c') else: sources.append(s) self.sources = sources
The code was changing all the .pyx extensions to .c. This is in distribute version 0.6.10. Updating to distribute version 0.6.24 fixed this as that snippet of code has been updated. Installing distribute was non-trivial as simply doing {{{pip install distribute --upgrade}}} broke everything because it wanted to install to the system directories instead of the virtualenv. It was unable to roll back breaking the virtualenv entirely. For some reason the --enivonment=$VIRTUAL_ENV flag was needed to avoid this.
Roe Convection Term
As a relatively painless entry to the implementation of higher order Riemann solvers in FiPy, I'm planning on implementing a basic Roe solver. Given a basic vector advection equation,
flux updates for a Roe solver can be written,
where the flux is across a face between cell and cell and the normal points from to . The question when implementing Roe solvers is how to approximate from the local Riemann problem
where
It is generally impossible to obtain an exact calculation for at the face. Strictly speaking, a Roe solver involves quite a complicated method to obtain an approximation for , . For the time being, in order to test the constant coefficient problems we will use a simple calculation for , namely,
From , we can obtain
where is the matrix of right eigenvectors and is the matrix of the absolute values of the eigenvalues along the diagonal.
I can pretty much see how to implement this algorithm in fipy without too many issues. I can see that getting from is not going to be easy, since an eigenvector/eigenvalue calculation is required for every face. At the moment the only way I see to do this is actually calculating and and reconstructing. Maybe there is a clever way to do this???
Higher Order Correction
Here we make the scheme second order on a regular grid. The correccted flux is given by,
where
is given by
and is given by
where is the cell to cell distance and is the face area. is given by,
where
The 's are given by,
Vector Diffusion Terms
We are now planning to implement equations of the form,
This has already been done partially in source:branches/vectorEquations. The convection term looks something like this
The coefficient is a rank 3 tensor in this case. The first index is always over the spatial range. The next two indices refer to the q's. We now have to address diffusion terms. The trouble is that we already have effective rank 3 coefficients for higher order diffusion terms so it's all getting a bit complicated, but it is still manageable.
For higher order terms, we generally use a list or tuple and don't embed the index as part of the tensor so I think that is automatically dealt with. So higher order terms should always be indexed with a tuple or list. That is our current policy and it makes sense. A second order diffusion term will now be of the following form,
Of course the rank of can vary depending on whether anisotropy is required. This requires some interpretation. If is rank 0 then it will be assumed that all the indices of are spatial. If q is rank 1 then it will be assumed that the last two indices are over q. For example if the variable is and the coefficient is then it's assumed that has a vector coefficient. If the variable was instead just then FiPy should throw an error as the coefficient has too many indices. Similarly, for a non-anisotropic vector equation the coefficient would simply be . The meaning of the coefficient indices changes based on context.
More mesh refactoring!?
Don't worry: it's almost over.
Previous mesh hierarchy
Meshes
Topologies
Geometries
I'm stripping out geometry and topology
Why?
- Their inheritance trees seem to inevitably mimic those of the meshes.
- We don't get any code reuse out of their existence, despite what I thought at their creation.
- A huge amount of boilerplate code is devoted to passing arguments to, instantiating, and creating interfaces to geometry/topology objects that is otherwise unnecessary.
- While geometry/topology objects do provide an added amount of orthogonality to the meshes, this benefit is negligible since it is highly likely that no other class hierarchy aside from meshes will use geometry/topology objects.
I was convinced of the harmful effects of geom/top classes when I went to expand the topology classes. In adding a few extra methods to topology, I found that I would have to nearly fully replicate the class tree of the meshes (instead of having just five topology classes) and the argument list for topology classes would not only vary from class to class, but would become unreasonably long.
Since the (lengthy) argument list for all geometry objects already varied from class to class, and thus was slightly ridiculous, I decided to think about reuniting meshes and geometry/topology. Where there is no uniform interface, no good code-reuse, and heavy coupling, merging classes is likely a good course of action.
Hard lesson learned: never refactor unless it is going to prevent duplication.
The new arrangement
I spent today merging geometries and topologies back into meshes. The new mesh hierarchy is displayed in the following UML diagram.
Meshes
Grid Builders
New features
- Uniform and irregular meshes are now siblings. This results in a less confusing inheritance scheme. Gridlike?D objects are aggregated under (Uniform)Grids to avoid duplication without the use of multiple inheritance. Gridlikes are never instantiated, but their contents are almost entirely static.
- GridBuilders are being used for the construction of grids. They warrant their own class hierarchy because the process for grid construction is largely dimensionally-independent. After the class-travaganza of geometry and topology, I can understand if this addition is met with suspicion. However, I think the reasons above are good ones.
- Interface for all meshes is defined in AbstractMesh, which should provide some clarification.
- Reduced NLOC after deleting the cruft classes.
explicit terms and residuals
What to do in _prepareLinearSystem() for different values of var?
Now:
None | A | |
Term() | error | matrix for A |
Term(var=A) | matrix for A | matrix for A |
Term(var=B) | matrix for B | 0 |
BinaryTerm(var=()) | error | matrix for A |
BinaryTerm(var=(A,)) | matrix for A | matrix for A |
BinaryTerm(var=(A, B)) | error | matrix for A |
BinaryTerm(var=(B,)) | matrix for B | 0 |
BinaryTerm(var=(B, C)) | error | 0 |
CoupledBinaryTerm(var=(A, B)) | matrix for (A, B) | error |
CoupledBinaryTerm(var=(B, C)) | matrix for (B, C) | error |
Want:
None | A | |
Term() | error | matrix for A |
Term(var=A) | matrix for A | matrix for A |
Term(var=B) | matrix for B | 0 |
BinaryTerm(var=()) | error | matrix for A |
BinaryTerm(var=(A,)) | matrix for A | matrix for A |
BinaryTerm(var=(A, B)) | residual for A and B | matrix for A + residual for B |
BinaryTerm(var=(B,)) | matrix for B | residual for B |
BinaryTerm(var=(B, C)) | residual for B and C | residual for B and C |
CoupledBinaryTerm(var=(A, B)) | matrix for (A, B) | error |
CoupledBinaryTerm(var=(B, C)) | matrix for (B, C) | error |
Note: _buildMatrix() still works as it did before, i.e., it must always be passed a solution variable and it build the matrix for that solution variable.
Looking at Cython
Cython is widely accessible
- Cython is bundled with Enthought Python Distribution, SAGE, and PythonXY
- Cython only requires Python and GCC, so it's installable basically anywhere
- Cython is easy_installable.
- mpi4py is already written in Cython.
In short: I'll bet dollars to donuts that anywhere FiPy is installable, Cython is too.
Building Cython code
- Write up a module foo.pyx in Cython.
- Write a setup.py which lists foo.pyx as an extension and specifies a cmdclass entry to compile it. Let's call that entry "build_ext"
- Run
python setup.py build_ext --inplace
This compiles the foo module from Cython down to C. - Use foo as you would a Python module, despite the fact that it's (supposedly well-optimized) C code.
More details are available here.
Cython integrates with NumPy
See here. Further details are available in Seljebotn's paper.
An interesting excerpt from the paper:
For the algorithms which are expressible as NumPy operations, the speedup is much lower, ranging from no speedup to around ten times. The Cython code is usually much more verbose and requires more decisions to be made at compile-time. Use of Cython in these situations seems much less clear cut. A good approach is to prototype using pure Python, and, if it is deemed too slow, optimize the important parts after benchmarks or code profiling.
Most arithmetic algorithms in FiPy are specified in terms of NumPy operations, so Cython's use to us may be questionable. In a direct translation from a NumPy-operation-based routine to typed Cython code (using cdef np.ndarray), I saw no speedup.
Return of the GPU
Spoiler alert: GPUArray is still snake oil
We've spoken recently of whether or not our decision to ditch GPUArray forever was ill-conceived. Certainly, my previous blog post on the topic was, at best, incomplete in that I never gave a conclusive explanation of what time is being spent where. This was a result of me using fancy dot-graph visualizations instead of sticking to that natural adversary of bullshit: raw text.
In the previous post, I never examined what Guyer refers to as terminal procedures. More importantly, though, I never presented a list of procedures sorted by time spent in each procedure itself (excluding calls made to other procedures within its body). In other words, I never presented a list of procedures sorted by what Python's Pstats refers to as total time.
In this post, I will explore whether or not arithmetic operations ex solver are taking a considerable amount of time. When I conclude that they are, I will explore whether or not the use of a temporary GPUArray for some arithmetic operations may alleviate this bottleneck. I will conclude that it won't, since GPUArray initialization and communication are stalls which cause a net gain in runtime, even considering the much better performance of arithmetic operations on the GPU.
Profiling FiPy
In order to determine if, in fact, we are spending an amount of time on ex solver arithmetic operations that would merit speeding them up somehow, I profiled a phase anisotropy run, available here.
I ran the simulation for 10 steps with dx and dy both at 1000. I then obtained profiling output with this script
import pstats def printProfileInfo(files): for filename in files: p = pstats.Stats(filename) # Sort by time in self p.sort_stats('time').print_stats(20) if __name__ == '__main__': import sys printProfileInfo(sys.argv[1:])
as generated by this pstats file. The output is here.
Let's walk through line-by-line.
- pysparse.itsolvers.pcg: solver; of no concern right now.
- {method 'sum' ...}: helllooooo arithmetic.
- variable.py:1088(<lambda>): straight from The Good Book, Variable.py:
return self._BinaryOperatorVariable(lambda a,b: a*b, other)
Obviously some steamy arithmetic.
- {method 'take' ...}: I'd guess the reason this guy is so costly is memory access. Just a guess, though.
- meshVariable.py:230(_dot): helllooooo arithmetic.
- harmonicCellToFaceVariable.py:42(_calcValuePy): a whole mess of arithmetic.
def _calcValuePy(self, alpha, id1, id2): cell1 = numerix.take(self.var,id1, axis=-1) cell2 = numerix.take(self.var,id2, axis=-1) value = ((cell2 - cell1) * alpha + cell1) eps = 1e-20 value = (value == 0.) * eps + (value != 0.) * value cell1Xcell2 = cell1 * cell2 value = ((value > eps) | (value < -eps)) * cell1Xcell2 / value value = (cell1Xcell2 >= 0.) * value return value
Okay, so the point is clear: arithmetic out the wazoo. That numerical analysis software is stunted by calculation shouldn't come as a surprise.
Can the GPU help us?
For the sake of simplicity and to serve as a proof of concept, from here on out I'll deal exclusively with sum. If we can make sum into less of a problem with the help of Dr. Nvidia, then certainly we can rid other arithmetic aches similarly.
As I was shuffling around $FIPYROOT/sandbox to prepare some GPU-sum benchmarks, I was pleasantly surprised to find that I'd already been through this process, I just never had the good sense to document it.
Allow me to introduce you to benchSum.py, available here.
#!/usr/bin/env python import numpy import pycuda.autoinit import pycuda.gpuarray as gpua import random """ Using `time`: GPU ======== real 3m55.157s user 3m52.720s sys 0m1.530s where numLoops is 200. NumPy ======== real 3m52.877s user 3m51.680s sys 0m0.590s where numLoops is 200. Using cProfile/pstats: GPU ======== Tue Nov 2 20:31:07 2010 sum-gpu.prof 800042925 function calls (800042000 primitive calls) in 372.578 CPU seconds Ordered by: internal time List reduced from 596 to 20 due to restriction <20> ncalls tottime percall cumtime percall filename:lineno(function) 400000000 163.425 0.000 194.783 0.000 random.py:351(uniform) 1 142.234 142.234 372.452 372.452 benchSum.py:26(compare) 402 33.223 0.083 33.223 0.083 {numpy.core.multiarray.array} 400000006 31.358 0.000 31.358 0.000 {method 'random' of '_random.Random' objects} 400 1.667 0.004 1.667 0.004 gpuarray.py:93(set) 400 0.274 0.001 0.397 0.001 gpuarray.py:975(sum) 1400 0.171 0.000 0.189 0.000 gpuarray.py:61(__init__) 1 0.036 0.036 0.036 0.036 tools.py:170(make_default_context) 1 0.023 0.023 372.581 372.581 benchSum.py:3(<module>) 1199 0.020 0.000 0.025 0.000 driver.py:257(function_prepared_async_call) 400 0.012 0.000 0.100 0.000 reduction.py:203(__call__) 799 0.011 0.000 0.025 0.000 tools.py:470(context_dependent_memoize) 1404 0.008 0.000 0.016 0.000 __init__.py:130(memoize) 400 0.008 0.000 1.794 0.004 gpuarray.py:619(to_gpu) 1400 0.007 0.000 0.018 0.000 gpuarray.py:40(splay) 1 0.006 0.006 0.006 0.006 driver.py:1(<module>) 2 0.006 0.003 0.012 0.006 __init__.py:2(<module>) 1199 0.005 0.000 0.005 0.000 {pycuda._pvt_struct.pack} 235 0.005 0.000 0.006 0.000 function_base.py:2851(add_newdoc) 135 0.004 0.000 0.005 0.000 sre_compile.py:213(_optimize_charset) NumPy ======== Tue Nov 2 20:38:08 2010 sum-numpy.prof 800027142 function calls (800026254 primitive calls) in 373.266 CPU seconds Ordered by: internal time List reduced from 453 to 20 due to restriction <20> ncalls tottime percall cumtime percall filename:lineno(function) 400000000 166.126 0.000 197.885 0.000 random.py:351(uniform) 1 140.405 140.405 373.137 373.137 benchSum.py:30(compare) 402 33.300 0.083 33.300 0.083 {numpy.core.multiarray.array} 400000000 31.759 0.000 31.759 0.000 {method 'random' of '_random.Random' objects} 400 1.541 0.004 1.541 0.004 {method 'sum' of 'numpy.ndarray' objects} 1 0.034 0.034 0.034 0.034 tools.py:170(make_default_context) 1 0.024 0.024 373.266 373.266 benchSum.py:3(<module>) 1 0.005 0.005 0.005 0.005 driver.py:1(<module>) 2 0.005 0.003 0.012 0.006 __init__.py:2(<module>) 235 0.005 0.000 0.006 0.000 function_base.py:2851(add_newdoc) 2594 0.003 0.000 0.003 0.000 {isinstance} 400 0.003 0.000 1.547 0.004 fromnumeric.py:1185(sum) 6 0.003 0.001 0.004 0.001 collections.py:13(namedtuple) 127 0.003 0.000 0.004 0.000 sre_compile.py:213(_optimize_charset) 124/10 0.003 0.000 0.007 0.001 sre_parse.py:385(_parse) 1 0.002 0.002 0.044 0.044 autoinit.py:1(<module>) 3 0.002 0.001 0.019 0.006 __init__.py:1(<module>) 287/10 0.002 0.000 0.008 0.001 sre_compile.py:38(_compile) 1 0.002 0.002 0.002 0.002 core.py:2230(MaskedArray) 1 0.002 0.002 0.002 0.002 numeric.py:1(<module>) """ def compare(numLoops, testing="gpu", arrLen=1000*1000): accum = 0 for i in xrange(numLoops): # randomized to avoid memoization or other caching rand = [random.uniform(0., 1000.) for j in xrange(arrLen)] rand2 = [random.uniform(0., 1000.) for j in xrange(arrLen)] if testing == "gpu": a = gpua.to_gpu(numpy.array(rand)) b = gpua.to_gpu(numpy.array(rand2)) c = gpua.sum(a) + gpua.sum(b) else: a = numpy.array(rand) b = numpy.array(rand2) c = numpy.sum(a) + numpy.sum(b) accum += c if i % 10 == 0: print i print accum if __name__ == '__main__': import sys compare(int(sys.argv[1]), sys.argv[2])
A few big points here. First, the overall time for
time python benchSum.py 200 gpu
takes longer than
time python benchSum.py 200 numpy
Not what we expected, huh?
Examining the profiling results in the above docstring tells us that when we go to GPU, Numpy's
400000000 31.759 0.000 31.759 0.000 {method 'random' of '_random.Random' objects} 400 1.541 0.004 1.541 0.004 {method 'sum' of 'numpy.ndarray' objects} 1 0.034 0.034 0.034 0.034 tools.py:170(make_default_context)
balloons into
400000006 31.358 0.000 31.358 0.000 {method 'random' of '_random.Random' objects} 400 1.667 0.004 1.667 0.004 gpuarray.py:93(set) 400 0.274 0.001 0.397 0.001 gpuarray.py:975(sum) 1400 0.171 0.000 0.189 0.000 gpuarray.py:61(__init__) 1 0.036 0.036 0.036 0.036 tools.py:170(make_default_context)
Keep this in mind.
The results in the module-level docstring above are why I hit the skids on GPUArray development last time, but I'll expound on them here.
Revisiting sumBench.py
"Trust, but verify" sez Reagan, and that's the way I was thinking when I rediscovered sumBench.py. It made sense to me that GPUArray isn't a silver bullet, but if I can't quantify that hunch then it's not science, but useless superstition. So, I decided to have some fun with sumBench.compare in the form of pstatsInterpretation.py. Let's see how:
#!/usr/bin/env python import pstats arraySizes = [10*10, 100*100, 500*500, 1000*1000, 2000*2000] loopCounts = [10, 50, 100, 200, 500]
I decided that, while the default benchmark of a 1000*1000 element array run for 200 iterations through the body of sumBench.compare was a fair model of FiPy usage, more data points are always useful.
So, I tested for all combinations of (arraySizes[i], loopCounts[j]) (how many is that, kids? Remember discrete math?).
I generated the Pstat profiles with the following function.
def runProfiling(): from benchSum import compare from cProfile import runctx for loopNum in loopCounts: for aSize in arraySizes: print "Doing %d loops on arraySize: %d..." % (loopNum, aSize) print "gpu." runctx("compare(%d, 'gpu', %d)" % (loopNum, aSize), globals(), locals(), "gpu-sum-%d-%d.pstats" % (loopNum, aSize)) print "numpy." runctx("compare(%d, 'numpy', %d)" % (loopNum, aSize), globals(), locals(), "numpy-sum-%d-%d.pstats" % (loopNum, aSize)) print
If you've forgotten how benchSum.compare works, go refresh your memory.
From the famous benchSum docstring, it is apparent that there is a correspondence between (numpy.sum), and (gpuarray.sum, gpuarray.set, gpuarray.__init__). This is the case because not only must GPUArray instantiate an ndarray but, in order to perform the arithmetic operation, it must throw the array on the GPU: something that obviously doesn't happen with ndarray. If you don't believe me, go have a look and convince yourself now, because my conclusions all rest on this correspondence.
If I wanted to get nasty, I could include in the correspondence other calls that are made in the use of GPUArray that aren't in that of ndarray, namely gpuarray.to_gpu and gpuarray.splay, but those times are basically insignificant.
Anyway, I needed a way of ripping out individual method timing (remember, that's time spent in self, or total time) from the mass of Pstat profiles I'd accumulated.
def _getTimeForMethods(arraySize, loopCount, methodList, arrType="gpu", benchmark="sum"): from StringIO import StringIO import re totalTime = 0. for method in methodList: strStream = StringIO() name = "%s-%s-%d-%d.pstats" % (arrType,benchmark,loopCount,arraySize) p = pstats.Stats(name, stream = strStream) p.print_stats(method) profString = strStream.getvalue() m = re.search(r"\d+\s+(\d+\.\d+)", profString) totalTime += float(m.group(1)) return totalTime
A little hairy, but it gets the job done. The above, _getTimeForMethods prints the Pstats output to a string stream and then yanks out the total time with a regular expression. It does this for each method in methodList and accumulates a composite time, which is returned.
I put this function to work in the following Behemoth, which interprets and graphs the Pstats results.
def graph(): from matplotlib import pylab as plt import subprocess as subp """ The methods which, according to the profiling results in the docstring of `benchSum.py`, are analogous between numpy.ndarray and gpuarray. We don't need to consider anything more than `sum` for numpy because numpy's set/init procedure is absorbed in the call to `array`, which both gpuarray and numpy do. """ numpyMethods = ["method 'sum'"] gpuMethods = ["gpuarray.py\S*(set)", "gpuarray.py\S*(sum)", "gpuarray.py\S*(__init__)"] loopsConstantTimes = {"numpy": [], "gpu": []} arrSizeConstantTimes = {"numpy": [], "gpu": []} justSumArrSizeConstantTimesGPU = [] for loops in [200]: for size in arraySizes: loopsConstantTimes["gpu"].append(_getTimeForMethods(size, loops, gpuMethods)) loopsConstantTimes["numpy"].append(_getTimeForMethods(size, loops, numpyMethods, "numpy")) for loops in loopCounts: for size in [1000*1000]: arrSizeConstantTimes["gpu"].append(_getTimeForMethods(size, loops, gpuMethods)) arrSizeConstantTimes["numpy"].append(_getTimeForMethods(size, loops, numpyMethods, "numpy")) justSumArrSizeConstantTimesGPU.append(_getTimeForMethods(size, loops, ["gpuarray.py\S*(sum)"])) """ PLOTTING ENSUES... If you really wanna see the gory details, go to the file in Matforge. """
Results
Okay, let's start off nice and light with a simple comparison between gpuarray.sum and ndarray.sum, then we'll dash any hope of dead-simple gpuarray usage.
Hell yeah! Look at that speed-up! Life must be a Singaporean paradise where I get paid to let other people's libraries halve my runtime!
Chya.
Whoops.
There's the whole picture. Now we're factoring in the additional effects of GPUArray usage: communication and initialization.
Here, the array size is fixed at 1000*1000 elements. The horizontal axis indicates the size of the array that we're summing. The vertical axis is the time of sum for ndarray and {sum, set, __init__} for GPUArray. Again, remember that these sets of methods are analogous. You can't have GPUArray's hot sum without its homely friends set and __init__.
Here, the number of iterations is fixed at 200. The number of iterations that the sum-accumulation procedure does is indicated on the horizontal axis. The vertical axis is, again, timing.
It's clear that ndarray wins out, even at a large array size (2000*2000).
Things I could be screwing up
Considering that the bottleneck with the GPUArray is setting the array, it would seem possible that I'm doing it wrong. The GPUArray documentation will verify that isn't the case. You just can't get around that gpuarray.set call; it's inevitable and it's what sinks your battleship.
It's fundamentally intuitive that you can't avoid the cost of communicating with the card. It's a longer trip down the bus and, since there isn't any lazy-eval/currying going on, it's a trip that'll happen on every arithmetic operation. TANSTAAFL.
To be sure, let's take a look at pycuda.gpuarray.set, just to see if it's plausible that the contents are slowing us up.
def set(self, ary): assert ary.size == self.size assert ary.dtype == self.dtype if self.size: drv.memcpy_htod(self.gpudata, ary)
Bingo! The key is in drv.memcpy_htod(...). That's where the actual transfer of data takes place; it makes sense time is lost there.
Possible recourse?
Right below pycuda.gpuarray.set is
def set_async(self, ary, stream=None): assert ary.size == self.size assert ary.dtype == self.dtype if self.size: drv.memcpy_htod_async(self.gpudata, ary, stream)
Might this be faster? Dunno. If it were, wouldn't Klockner be quick to tell us about it in the documentation?
Summary
- FiPy is bottlenecked largely by arithmetic operations.
- GPUArray will not help us alleviate these bottlenecks by using it as a one-off Rainman every time we want to do a single arithmetic operation.
- This is because communication is more expensive than the single arithmetic operation, even if the op is significantly faster on the GPU and even on arrays up to 2000*2000 in length.
- The only appropriate use for PyCUDA is by writing a few kernels for arithmetic-critical junctures where we can amortize the cost of communication over the speedup of multi-step arithmetic on the GPU.
Buildbot setup
First, obtain Buildbot on both master and slave.
master$ easy_install buildbot slave$ easy_install buildbot-slave
Now, create the master.
master$ buildbot create-master master_dir
This will make a directory called master_dir and fill it with all sorts of goodies. In order to be operational, the master_dir requires a master.cfg file to be present within it. Luckily, Buildbot supplies a master.cfg.sample which can easily be tailored to fit our needs.
master$ cp master.cfg.sample master.cfg master$ $EDITOR master.cfg
Now let's do the aforementioned tailoring. I'll step through configuration piece-by-piece.
c = BuildmasterConfig = {}
All that is required of each master.cfg is that it builds a dictionary called BuildmasterConfig (which we'll call c for brevity), which contains various configuration information for Buildbot.
branches = ['trunk', 'branches/bad_branch', ]
Because of Buildbot's architecture, we must specify (in some way) the branches that Buildbot should concern itself with; I define factories later on which generate the necessary Buildbot machinery for each branch specified here.
Instead of explicitly specifying the branches we want to monitor (which is usually all of them), it may be possible to poll SVN periodically and reconfigure Buildbot when a new branch is detected, but I haven't looked into that yet.
Specifying BuildSlaves
Now, we specify Buildslaves for c.
from buildbot.buildslave import BuildSlave c['slaves'] = [BuildSlave("danke", "bitte"), BuildSlave("slug", "gross")]
This addition to c implies that we will have two BuildSlaves communicating with the master: one called danke (with a slave-password of bitte), and one called slug (with a slave-password of gross). We will configure the slaves (with their passwords) after we have configured the master.
It is important that these slave passwords remain relatively private, or else there are some security vulnerabilities that entail possible execution of arbitrary code on the master's host (alluded to here).
Then, pick a port over which the master and slave will communicate. This port must be publicly accessible on their respective networks. I chose 9989.
c['slavePortnum'] = 9989
Remember to make this port publicly accessible. This may necessitate forwarding ports.
Detecting source changes
Now we configure how the Buildmaster finds out about changes in the repository. There are a variety of ways the master can be configured to keep itself informed of changes, including polling the repository periodically, but I chose to manually inform the buildmaster of changes via an addition to SVN's post-commit hook which calls a buildbot/contrib script available here.
On the master.cfg side, we add the following:
from buildbot.changes.pb import PBChangeSource c['change_source'] = PBChangeSource()
From here, I assume that the SVN repository is hosted on the same machine as the master. Digressing a moment from master.cfg, I navigate to the SVN repository's directory.
master$ cd /path/to/svn-repo master$ $EDITOR hooks/post-commit
Once the editor is open, add the following to notify the buildbot master of changes upon commit.
# set up PYTHONPATH to contain Twisted/buildbot perhaps, if not already # installed site-wide /path/to/svn_buildbot.py --repository "$REPOS" --revision "$REV" \ --bbserver localhost --bbport 9989
If the master is hosted on a different machine than the repository, the bbserver flag above can be modified accordingly.
One caveat here is that we must modify the contrib/svn_buildbot.py script to propagate branch information on to Buildbot. Open svn_buildbot.py
master$ $EDITOR /path/to/svn_buildbot.py
and enact the following change.
# this should be around line 143 # split_file = split_file_dummy split_file = split_file_branches
Now the SVN repo is set up to play nicely with the Buildbot installation and (hopefully) the master will be informed of each subsequent commit.
Schedulers
Back to master.cfg.
Next, we want to configure Schedulers. Whenever the master is informed of a change (via whatever object is held in c['change_source']) all Schedulers attached to the list held by c['schedulers'] will be informed of the change and, depending on the particular Scheduler in question, may or may not react by triggering a build or multiple builds.
Here, I'll configure a Scheduler for each branch to react to any change in that branch by running two builds.
from buildbot.scheduler import Scheduler from buildbot.schedulers.filter import ChangeFilter def buildSchedulerForBranch(branch): cf = ChangeFilter(branch=branch) return Scheduler(name="%s-scheduler" % branch, change_filter=cf, treeStableTimer=30, builderNames=["full-build-%s" % branch, "smaller-build-%s" % branch]) c['schedulers'] = [] for branch in branches: c['schedulers'].append(buildSchedulerForBranch(branch))
The ChangeFilter is used to discriminate branches; note that branch isn't the only criterion a ChangeFilter can consider. The parameter treeStableTimer is the length of time that Buildbot waits before starting the build process.
Note that this Scheduler references "full-build-[branch]" and "smaller-build-[branch]", which are Builders we'll define right now.
Builders
Here, we specify the build procedures. First, we will define a number of BuildFactorys, which detail generic steps that a given Build will take, and then we create the Builds themselves and attach them to to the BuildmasterConfig dictionary. There is some machinery defined up front for the sake of generating separate builds for each branch with as little duplication as possible, so bear with me.
def addCheckoutStep(factory, defaultBranch='trunk'): """Ensure that `baseURL` has a forward slash at the end.""" baseURL = 'svn://slug-jamesob.no-ip.org/home/job/tmp/fake_fipy_repo/' factory.addStep(SVN(mode='update', baseURL=baseURL)) def testAllForSolver(buildFact, solverType, parallel=False): """ Add doctest steps to a build factory for a given solver type. """ descStr = "for %s" % solverType solverArg = "--%s" % solverType buildFact.addStep(Doctest(description="testing modules " + descStr, extraArgs=[solverArg, "--modules"])) buildFact.addStep(Doctest(description="testing examples " + descStr, extraArgs=[solverArg, "--examples"]))
Here, addCheckoutStep takes a BuildFactory and adds a step which checks out whichever branch has been modified, as specified by the Scheduler reporting the changes. It does so by appending the name of the branch, as reported by the provoking Scheduler, to the baseURL that we pass in.
The function testAllForSolver takes a BuildFactory and adds steps which test both modules and examples for a given solver. I should mention at this point that here I use Doctest, a class I had to add to Buildbot.
Getting doctests to work
Buildbot integrates nicely with Trial, the unit-test framework that comes bundled with Twisted. Trial, unfortunately, doesn't seem to pick up on doctests, though there are scattered allusions speaking to the contrary (here and here).
I tried a few approaches to getting Trial to process doctests (adding a __doctest__ module-level attribute, for one) to no avail. Even if there is some way to piggyback doctests onto Trial, it may involve a large modification of the FiPy source, which, if purely for the sake of making Buildbot happy, I don't think is worth it.
In light of this, I added an object that extends the class that handles Trial integration to handle doctests instead.
Within master.cfg, add
from buildbot.steps.python_twisted import Trial, TrialTestCaseCounter from buildbot.steps.shell import ShellCommand from twisted.python import log class Doctest(Trial): """ Add support for Python's doctests. """ def __init__(self, description=None, extraArgs=None, **kwargs): ShellCommand.__init__(self, **kwargs) self.addFactoryArguments(description=description, extraArgs=extraArgs) self.testpath = "." self.command = ["python", "setup.py", "test"] self.extraArgs = extraArgs self.logfiles = {} if description is not None: self.description = [description] self.descriptionDone = [description + " done"] else: self.description = ["testing"] self.descriptionDone = ["tests"] # this counter will feed Progress along the 'test cases' metric self.addLogObserver('stdio', TrialTestCaseCounter()) def start(self): if self.extraArgs is not None: if type(self.extraArgs) is list: self.command += self.extraArgs elif type(self.extraArgs) is str: self.command.append(self.extraArgs) self._needToPullTestDotLog = False log.msg("Doctest.start: command is '%s' with description '%s'." \ % (self.command, self.description)) ShellCommand.start(self) def _gotTestDotLog(self, cmd): Trial._gotTestDotLog(self, cmd) # strip out "tests" from the beginning self.text[0] = self.description[0] + " " \ + " ".join(self.text[0].split(" ")[1:])
This allows us to run FiPy's test suite completely unmodified and have Buildbot interpret the results just as it would for Twisted's Trial.
Back to Builders
We never completed setting the Builders up, so let's do that now.
Let's first define a few Factory objects.
from buildbot.process import factory from buildbot.steps.source import SVN """ Run a mostly-complete test-suite. """ fullFactory = factory.BuildFactory() addCheckoutStep(fullFactory) testAllForSolver(fullFactory, "pysparse") testAllForSolver(fullFactory, "trilinos") """ Run a less-intensive test-suite. """ smallerFactory = factory.BuildFactory() addCheckoutStep(smallerFactory) smallerFactory.addStep(Doctest(extraArgs="--modules", description="testing modules"))
Now we will define and attach the Builders themselves.
c['builders'] = [] def makeBuildersForBranch(branch): builders = [] builders.append({"name": "full-build-%s" % branch, "slavename": "slug", "builddir": "%s-slug" % branch, "factory": fullFactory}) builders.append({"name": "smaller-build-%s" % branch, "slavename": "danke", "builddir": "%s-danke" % branch, "factory": smallerFactory}) return builders for branch in branches: for builder in makeBuildersForBranch(branch): c['builders'].append(builder)
The rest
Everything left in master.cfg concerns itself with how build information is reported back to us. Though there are very many possibilities open to us (e-mail reporting to blamed developers upon broken builds, IRC bots, Skynet, etc.), I haven't explored any of them other than the default web interface that Buildbot provides. With that in mind, here's the rest of master.cfg, basically unadulterated from the vanilla sample config.
####### STATUS TARGETS # 'status' is a list of Status Targets. The results of each build will be # pushed to these targets. buildbot/status/*.py has a variety to choose from, # including web pages, email senders, and IRC bots. c['status'] = [] from buildbot.status import html c['status'].append(html.WebStatus(http_port=8010)) # # from buildbot.status import mail # c['status'].append(mail.MailNotifier(fromaddr="buildbot@localhost", # extraRecipients=["[email protected]"], # sendToInterestedUsers=False)) # # from buildbot.status import words # c['status'].append(words.IRC(host="irc.example.com", nick="bb", # channels=["#example"])) # # from buildbot.status import client # c['status'].append(client.PBListener(9988)) c['projectName'] = "FiPy" c['projectURL'] = "http://www.ctcms.nist.gov/fipy/" c['buildbotURL'] = "http://localhost:8010/"
Configuring slaves
Configuring slaves is very simple.
danke$ buildslave create-slave slavedir slug-jamesob.no-ip.org:9989 danke bitte danke$ $EDITOR slavedir/info/admin danke$ $EDITOR slavedir/info/host slug$ buildslave create-slave slavedir slug-jamesob.no-ip.org:9989 slug gross slug$ $EDITOR slavedir/info/admin slug$ $EDITOR slavedir/info/host
Note that slug is both the master and a slave.
The form of the create-slave command is buildslave create-slave [directory for builds] [master's host address]:[port for communication] [slave-name] [slave-password].
So long as the port 9989 is open on both the slaves and master, we're done.
In summary
Buildbot is going to be an asset to us, though I still haven't contacted Doug as to how we're going to get it up and running.
In case I've screwed anything up in translating my config file to this blog post, I've uploaded the actual file to sandbox and it is available here.
Efficiency changes due to properties
I check out trunk at revision [4101], before the property changes were introduced, as old_trunk and run setup.py test.
(misc)old_trunk/ time python setup.py test 2> /dev/null running test ... real 2m57.447s user 2m55.135s sys 0m1.952s
I then run setup.py test on the latest revision of trunk@[4160], which includes all property implementations aside from those in fipy.terms.
(misc)trunk/ time python setup.py test 2> /dev/null running test ... real 3m0.290s user 2m57.987s sys 0m2.000s
There's only a three second difference between the two; clearly not a big deal.
How about a non-trivial benchmark like phase anisotropy? For this, I copy-script'd the anisotropy example from examples.phase, then stripped out the viewing code and reduced the number of steps to one hundred.
I then ran it on trunk@[4101], without the properties:
(misc)old_trunk/ time python phaseAnis.py ... real 3m50.701s user 2m50.679s sys 0m59.748s
And then on the current trunk@[4160], with properties:
(misc)trunk/ time python phaseAnis.py ... real 3m49.818s user 2m50.455s sys 0m59.104s
Again, we see a negligible difference (3:50 vs. 3:49) between the two runtimes. These two trials have convinced me that the property refactoring hasn't added any considerable computational burden.
Benchmarking the Efficiency Branch
An efficiency comparison between source:trunk@4057 and source:branches/efficiency@4057 on bunter.
Example | Source | Flags | CPUs | 9 steps (s) | memory (kB) | ||||
1 | 2 | 3 | 4 | 5 | |||||
attachment:anisotropy.py | source:branches/efficiency@4057 | --pysparse | 1 | 25 | 25 | 29 | 29 | 333552 | |
attachment:anisotropy.py | source:branches/efficiency@4057 | --trilinos | 1 | 72 | 72 | 417776 | |||
attachment:anisotropy.py | source:branches/efficiency@4057 | --trilinos | 2 | 48 | 48 | 278368 | |||
attachment:anisotropy.py | source:branches/efficiency@4057 | --trilinos | 4 | 34 | 36 | 192424 | |||
attachment:anisotropy.py | source:trunk@4057 | --pysparse | 1 | 29 | 29 | 296020 | |||
attachment:anisotropy.py | source:trunk@4057 | --trilinos | 1 | 92 | 96 | 395408 | |||
attachment:anisotropy.py | source:trunk@4057 | --trilinos | 2 | 49 | 53 | 265860 | |||
attachment:anisotropy.py | source:trunk@4057 | --trilinos | 4 | 35 | 38 | 187092 |
Coupled Efficiency
An efficiency comparison between source:trunk@4055 and source:branches/coupled@4055 on bunter.
Notes:
- Running 1000x1000 for CH used about 1.1 GB for both trunk and branches/coupled
- --trilinos is heinously slow for CH.
- RST wiki formatting doesn't seem to work very well at least with the examples provided.
- branches/coupled appears to be no slower nor appears to use more memory than trunk.
GPUArray considered harmful
Back in September, everyone decided that introducing FiPy to CUDA in some facet was a good idea. The particular introduction we decided on was in the form of middleware which would slide between NumPy and our own numerix.py. This proposed middleware would shift some (or all) arrays from the system memory onto the GPU memory. That way, with the arrays on the GPU, we could perform all computations there, reaping the gleaming wonders of the all-powerful graphics processing unit. Silver bullet, right?
No, not exactly.
I'm no longer convinced that this approach will gain us anything; that is to say, I am no longer convinced that indiscriminately chucking NumPy arrays at the GPU is going to yield any performance gain for FiPy. I come to this conclusion after profiling a few varied FiPy runs in their entirety and trying, in vain, to implement the middleware described above.
Profiling
I should have done this long ago, before we even talked about possible interactions between FiPy and CUDA, just to see where performance gains are actually going to benefit us. Finally, after a few weeks of struggling with the proposed GPUArray module, I sat down and profiled a few FiPy runs to see where the cycles are being spent. After all, if we can't find a significant amount of time that's being spent doing arithmetic operations on NumPy arrays, there is no point in putting arrays on the GPU before the solver.
3D heat diffusion (1003)
I profiled the problem found here with the timing instrumentation removed. The results of the profiling were run through gprof2dot to produce this figure. The figure is a tree where each node is a function, the percentage below it is the percentage time spent in the function and all sub-functions, the parenthesized percentage is the time spent in the function itself, and all children of that node are sub-functions called within that function.
In this case, the only nodes displayed are for functions which consume 5% of runtime or more. These are the only components of a run that we should consider optimizing. Notice that the only array-specific functions found here are numeric:201:asarray and ~:0:<numpy.core.multiarray.array>. Both of these functions would take much longer when using a GPU (based on a strong hunch). Note also that we can't see any arithmetic operations in this graph. For this particular problem, it is clear that arithmetic operations on arrays are not the bottleneck, and thus using a GPU pre-solution makes absolutely no sense.
To get a better idea of which array functions were consuming time, I decided to generate another figure where each node consumes a minimum of 1% total runtime. That figure shows array allocation dominating array-specific operations, with ~:0:<numpy.core.multiarray.array> responsible for 18.6% of runtime.
No arithmetic operations are visible until we reduce the minimum percentage of runtime consumed by any node to 0.5%. That figure finally shows numerix:624:sin weighing in at 0.73%. Another array operation which appears in this graph is core:3914:reshape, the reshaping of an array, which comes in around 0.85%. Think of how this operation's time consumption would balloon if we were on a GPU.
The conclusion is clear: we should not be using a GPU for pre-solution array corralling on this run or anything like it.
Wheeler's Reactive Wetting
After profiling these toy runs, I told Wheeler about the results. He and I agreed that before making any final conclusions, we should profile a more realistic problem. We chose his reactive wetting problem. He profiled, sent me the pstats file, and I generated [http://matforge.org/fipy/export/3926/sandbox/gpuarray/wettingProf.simple.pdf this figure]. The minimum runtime consumption here is 5%.
Note that most array-relevant operations (aside from ~:0:<numpy.core.multiarray.array>) are found under inline:14:_optionalInline. Here we see that numerix:941:take is a good subtree to examine, as it is responsible for ~14% of total runtime.
The only arithmetic operation in this mess is core:3164:__mul__, boasting 5.50%. "Well, finally," you say to yourself, "there's something that will benefit from residence on the GPU." Not so fast. Only 0.01% of that 5.50% is spent in __mul__ itself. The overwhelming majority of time is spent in its child, ~:0:<numpy.core.multiarray.where>, which is presumably fiddling around in memory: yet another operation that will balloon with GPU usage.
Let's work on PyKrylov
After these damning results, it's evident to me that using Klockner's PyCuda within FiPy is a really bad idea. When using PyCuda's drop-in replacement (ahem) for numpy.ndarray, any arithmetic operation results in an intermediate copy of the gpuarray since there is no clever mechanism for lazily evaluating curried-kernel strings, or whatever.
Furthermore, I think that any solution proposing to be as convenient as Klockner's gpuarray is snake-oil unless some very clever mechanisms for incremental kernel accumulation and/or lazy evaluation enter the picture. The only way we can juice the GPU is by writing a few custom kernels at performance critical junctures: that way, we're not tap-dancing all over the GPU's limited and relatively slow memory by creating transient copies at every step of an arbitrary computation.
For these reasons, I think we (or I) should work on writing a few CUDA kernels for use by the non-Charlatan regions of PyCuda within PyKrylov. This implies work on the PyKrylov branch, which I am more than happy to undertake.
Reconciling the new boundary conditions with the old
As explained in this ticket, boundary conditions have now been updated to use constraints and the FixedValue/FixedFlux boundary conditions are no longer required. As a check, we ran the source:branches/version-2_1@3873 examples against source:trunk@3873 and the convection examples threw a number of errors. On the surface, this is reasonable as the new (natural) boundary condition is in/outflow and hence if one uses a FixedValue boundary condition, one is also getting an in/outflow boundary condition.
Source
Version 2.1 for this example didn't work against trunk.
We noticed that this example had an added outflow boundary condition see line 70 of source.py. This should be removed on trunk as trunk already included the outflow boundary condition. If one removes the extra RHSBC, it still passes the test on trunk. Basically, the right outflow is miniscule as the value of the variable on that edge is and hence it doesn't matter whether RHSBC is included or not. We now understand the right BC issue. Now to understand the left BC.
The version 2.1 example when run against trunk has a value very close to 2 instead of 1 on the left hand boundary. The FixedValue boundary condition adds the following to the matrix and RHS vector
where is the boundary value. In this case, it results in 1 being added to the RHS vector. The new constraint boundary conditions now add this boundary condition by default (regardless of whether has been set), thus, the RHS vector is getting a value of 2, and, hence, the left most value now has a value of 2 (the diagonal entry is 1 from the first interior face).
Note that the inflow/outflow boundary condition has the same discretization as the FixedValue value boundary condition. The only difference is that an inflow/outflow condition does not require the external face to be explicitly fixed to a value, but uses whatever value var.getFaceValue() provides. If the boundary value is set then this is the same as a FixedValue BC
Robin
This example doesn't work when run against trunk for the same reasons as above. I did notice that the FixedFlux boundary condition is only ever applied once no matter how many terms invoke it. This meant that it previously worked as expected, see source:branches/version-2_1/fipy/boundaryConditions/fixedFlux@3873#L79.
Not much point going on, I think I understand why the rest of the examples fail.