Recent posts (max 20) - Browse or Archive for more

New Policy for Commit Messages

From now on, the following codes should be used for commit messages. The commit message should have one of the following three letter codes at the beginning of the first line followed by a colon.

API: an (incompatible) API change
BLD: change related to building fipy
BUG: bug fix
DEP: deprecate something, or remove a deprecated object
DEV: development tool or utility
DOC: documentation
ENH: enhancement
MAINT: maintenance commit (refactoring, typos, etc.)
REV: revert an earlier commit
STY: style fix
TST: addition or modification of tests
REL: related to releasing fipy

These codes are the current Numpy standard (see

  • Posted: 2013-03-26 15:32 (Updated: 2013-03-29 16:29)
  • Author: wd15
  • Categories: (none)
  • Comments (0)

Commit and Ticket Update Automation

Jon has enabled the

Use it as follows:

In future, when we make a commit that responds to a bug, include "addresses ticket:1234" or "references ticket:1234".

When you merge a pull request, include "fixes ticket:1234" or "closes ticket:1234".

These will automatically add the commit message to the ticket and, if appropriate, close the ticket.

  • Posted: 2013-03-21 17:13
  • Author: wd15
  • Categories: (none)
  • Comments (2)


  • Posted: 2012-08-27 10:48 (Updated: 2012-08-27 15:08)
  • Author: wd15
  • Categories: (none)
  • Comments (0)

How to represent a third order term?

A third order equation of the form:

$ \partial_t \phi = \partial_i \left( u_i \psi \right) $


$ \psi = \partial_j \left( \Gamma \partial_j \phi \right) $

I haven't tried this, but this should be possible with a coupled equation in a fully implicit manner. Here is an attempt. The sign on the diffusion term seems to make a large difference in the stability.

from fipy import Grid1D, CellVariable, TransientTerm, DiffusionTerm, Viewer
from fipy import UpwindConvectionTerm, ImplicitSourceTerm

m = Grid1D(nx=100, Lx=1.)

phi = CellVariable(mesh=m, hasOld=True, value=0.5, name=r'$\phi$')
psi = CellVariable(mesh=m, hasOld=True, name=r'$\psi$')

phi.constrain(0, m.facesLeft)
phi.constrain(1, m.facesRight)

psi.constrain(0, m.facesRight)
psi.constrain(0, m.facesLeft)

eqnphi = TransientTerm(1, var=phi) == UpwindConvectionTerm([[1]], var=psi)
eqnpsi = ImplicitSourceTerm(1, var=psi) == -DiffusionTerm(1, var=phi)

eqn = eqnphi & eqnpsi

vi = Viewer((phi, psi))

for t in range(1000): 
    print phi


  • Posted: 2012-06-13 10:14
  • Author: wd15
  • Categories: (none)
  • Comments (0)

Including the Transverse Waves in the Roe Convection Term

This is a continuation from blog:RoeConvectionTerm where the first order and second order correction are described fro the Roe convection term. These schemes have now been implemented in source:riemann/fipy/terms/[email protected]. The scheme gives identical results to the spinning cone and block in CLAWPACK, but The second order scheme has not been tested with multiple equations yet (the spinning cone/block example only has one field).

The scheme needs to be updated to include the transverse waves. I believe this would make it truly second order accurate. Unfortunately, I'm still a little shaky on the derivation for the scheme.

  • Posted: 2012-03-08 17:13
  • Author: wd15
  • Categories: (none)
  • Comments (0)

Andrew Reeve's FiPy Linux Image

I have created a disk image that can be installed on an 8 GByte thumb drive. By installing this onto a usb thumb drive, and then booting this thumb drive through the computer's boot menu, you will be put into a fully functional linux (Arch Linux) system that contains FiPy, text editor, and associated python software (among other things). The 'wicd' program (type 'wicd-curses' in a termial window) on the thumb drive can be used to access a wireless network and the pacman package manager can be used to update/add software on the thumb drive. The image must written to the 8 Gbyte thumb drive using a low level copying program, such as 'dd' or 'rawrite'. Care should be taken to not copy over your hard drive! All data on your thumb drive will be overwritten when using these programs.

1) access your bootloaded when logging onto your computer by hitting the appropriate button when booting up. The button you need to hit to do this seems to vary with the make of the computer.

2) When you access the linux bootloaded, there are four options coresponding to different device names (sda, sdb, sdc, and sdd). Load the one that corresponds to the usb port on your computer. On the laptops I've used, 'sdb' has been to correct device name to use.

3) I think gmsh is also in the thumb drive,

4) mayavi2 has not been added to the thumbdrive, but can be added using the 'yaourt' package manages which automated building packages from source.

Let me know if people actually use these. I'd be willing to update the images one or twice a year if people find them useful.


  • Posted: 2012-03-06 09:53 (Updated: 2012-03-06 11:18)
  • Author: wd15
  • Categories: (none)
  • Comments (0)

Setting Up A Debug Environment

I've been trying to set up a debugable version of fipy in virtualenv for debugging a trilinos issue. Here are the steps:

  • Install the python-dbg package from the Debian repositories.
  • Use mkvirtualenv -p python-dbg debug to make the debug environment.
  • Install numpy with pip install, not debug.
  • Install swig in the standard way.
  • Here is the do-configure for trilinos
    [email protected]
    ${CMAKE} \
      -D Trilinos_ENABLE_PyTrilinos:BOOL=ON \
      -D Trilinos_ENABLE_TESTS:BOOL=ON \
      ${EXTRA_ARGS} \
  • Posted: 2011-11-03 15:26
  • Author: wd15
  • Categories: (none)
  • Comments (9)

Pyximport Weirdness

There seems to be an issue with using pyximport under certain conditions. Essentially, when pyximport is used here. When this loads from within the test environment it doesn't build on benson or on Jon's laptop. It does work on sandbox, which is using python 2.7 (rather than 2.5) and cython 0.15 (same as bunter). I'm not sure about Jon's configuration.

The error is that the compiler is trying to find the .c file in fipy/tools instead of the default .pyxblt/ directory. This manifests the following error:

gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/users/wd15/.virtualenvs/riemann/lib/python2.5/site-packages/numpy/core/include -I/users/wd15/.virtualenvs/riemann/include/python2.5 -c /users/wd15/Documents/python/fipy/riemann/fipy/tools/smallMatrixVectorOpsExt.c -o /users/wd15/.pyxbld/temp.linux-i686-2.5/users/wd15/Documents/python/fipy/riemann/fipy/tools/smallMatrixVectorOpsExt.o
gcc: /users/wd15/Documents/python/fipy/riemann/fipy/tools/smallMatrixVectorOpsExt.c: No such file or directory
gcc: no input files

I have sort of figured out where the issue is occurring, but it appears that magic is occurring that doesn't seem possible. Initially to debug I turned DEBUG to True in ~/.virtualenvs/riemann/lib/python2.5/site-packages/pyximport/ That lead me to see what the problem is. If one looks in ~/.virtualenvs/riemann/lib/python2.5/site-packages/pyximport/ one will see the following method

def get_distutils_extension(modname, pyxfilename):
#    try:
#        import hashlib
#    except ImportError:
#        import md5 as hashlib
#    extra = "_" + hashlib.md5(open(pyxfilename).read()).hexdigest()  
#    modname = modname + extra
    extension_mod,setup_args = handle_special_build(modname, pyxfilename)
    print 'extension_mod',extension_mod
    print 'pyxfilename',pyxfilename
    print 'modname',modname
    if not extension_mod:
        from distutils.extension import Extension
        extension_mod = Extension(name = modname, sources=[pyxfilename])
        import distutils.extension
        print 'distutils.extension.__file__',distutils.extension.__file__
        print 'extension_mod.sources',extension_mod.sources
        print 'pyxfilename',pyxfilename
##        extension_mod.sources = [pyxfilename]
        print 'extension_mod.sources',extension_mod.sources
        print type(pyxfilename)
    return extension_mod,setup_args

When pyximport runs through this it gives the following output

extension_mod None
pyxfilename /users/wd15/Documents/python/fipy/riemann/fipy/tools/smallMatrixVectorOpsExt.pyx
in extension
self.sources ['/users/wd15/Documents/python/fipy/riemann/fipy/tools/smallMatrixVectorOpsExt.pyx']
self.sources ['/users/wd15/Documents/python/fipy/riemann/fipy/tools/smallMatrixVectorOpsExt.pyx']
distutils.extension.__file__ /users/wd15/Documents/python/fipy/riemann/tmp/distutils/
extension_mod.sources ['/users/wd15/Documents/python/fipy/riemann/fipy/tools/smallMatrixVectorOpsExt.c']
pyxfilename /users/wd15/Documents/python/fipy/riemann/fipy/tools/smallMatrixVectorOpsExt.pyx
extension_mod.sources ['/users/wd15/Documents/python/fipy/riemann/fipy/tools/smallMatrixVectorOpsExt.c']
<type 'str'>

What's weird here is that the pyxfilename passed to Extension has had it's extension changed from .pyx to .c. Now, this doesn't occur on sandbox, the file name never changes and everything runs smoothly. On bunter, if I explicitly set extension_mod.sources as I do in the commented line, then everything works. So what hocus-pocus is occurring in the Extension class to change .pyx to .c. Nothing! Absolutely nothing. Extension is just a stand alone class (no parents) and all it does it set self.sources = sources. In fact it I do print self.sources on the last line of Extension's __init__ it still has the .pyx file extension, but reverts to the .c file extension as soon as it returns to What could possibly be occurring? Extension has no other methods. pyxfilename is a string. This seems impossible.

I finally figured out what was going on. When the tests are running, a different version of from distutils.extension import Extension is being imported. The version the tests are importing is in the virtualenv. When I run the script outside of the tests then it doesn't use the virtualenv version of distribute. The version in the virtualenv was dropping into the following:

    from Pyrex.Distutils.build_ext import build_ext
except ImportError:
    have_pyrex = False
    have_pyrex = True

class Extension(_Extension):
    """Extension that uses '.c' files in place of '.pyx' files"""

    if not have_pyrex:
        # convert .pyx extensions to .c 
        def __init__(self,*args,**kw):
            sources = []
            for s in self.sources:
                if s.endswith('.pyx'):
            self.sources = sources

The code was changing all the .pyx extensions to .c. This is in distribute version 0.6.10. Updating to distribute version 0.6.24 fixed this as that snippet of code has been updated. Installing distribute was non-trivial as simply doing {{{pip install distribute --upgrade}}} broke everything because it wanted to install to the system directories instead of the virtualenv. It was unable to roll back breaking the virtualenv entirely. For some reason the --enivonment=$VIRTUAL_ENV flag was needed to avoid this.

  • Posted: 2011-08-24 11:45 (Updated: 2012-02-14 10:45)
  • Author: wd15
  • Categories: (none)
  • Comments (0)

Roe Convection Term

As a relatively painless entry to the implementation of higher order Riemann solvers in FiPy, I'm planning on implementing a basic Roe solver. Given a basic vector advection equation,

$  \partial_t \phi^i + \partial_j \left( u^{ji}_{k} \phi^k \right) = 0$

flux updates for a Roe solver can be written,

$  F^i = \frac{1}{2} \left[ n_j u^{ji-}_k \phi^{k-} + n_j u^{ji+}_k \phi^{k+} \right] - \frac{1}{2} n_j |\hat{A}^{ji}_k| \left( \phi^{k+} - \phi^{k-} \right) $

where the flux is across a face between cell $C^-$ and cell $C^+$ and the normal $n_j$ points from $C^-$ to $C^+$. The question when implementing Roe solvers is how to approximate $\hat{A}^{ji}_k$ from the local Riemann problem

$ \partial_t \phi^i + A^{ji}_{k} \partial_j \phi^k = 0 $


$ A^{ji}_{k} = \frac{ \partial \left( u^{ji}_l \phi^l \right) }{ \partial \phi_k } $

It is generally impossible to obtain an exact calculation for $A_{jik}$ at the face. Strictly speaking, a Roe solver involves quite a complicated method to obtain an approximation for $A_{jik}$, $\hat{A}_{jik}$. For the time being, in order to test the constant coefficient problems we will use a simple calculation for $\hat{A}_{jik}$, namely,

$ \hat{A}^{ji}_k = \frac{ u^{ji+}_l \phi^{l+} - u^{ji-}_l \phi^{l-} }{ \phi_{k}^+ - \phi_{k}^- } $

From $ \hat{A}^{ji}_k $, we can obtain

$ |\hat{A}^{ji}_k| = R^{ji}_l |\Gamma^{jl}_m| \left( R^{-1} \right)^{jm}_k $

where $R^{ji}_{l}$ is the matrix of right eigenvectors and $|\Gamma^{jl}_m|$ is the matrix of the absolute values of the eigenvalues along the diagonal.

I can pretty much see how to implement this algorithm in fipy without too many issues. I can see that getting $|\hat{A}^{ji}_k|$ from $\hat{A}^{ji}_k$ is not going to be easy, since an eigenvector/eigenvalue calculation is required for every face. At the moment the only way I see to do this is actually calculating $R$ and $\Gamma$ and reconstructing. Maybe there is a clever way to do this???

Higher Order Correction

Here we make the scheme second order on a regular grid. The correccted flux is given by,

$  F_{O\left(2\right)}^i = F^i + \bar{F}^i $


$ \bar{F}^i = n_j \left[ \left( M\left(\theta^{ji} \right) B^{ji}_l \right) \left( R^{-1} \right)^{jl}_k \right]  \left(\phi^{k+} - \phi^{k-} \right)$

$M$ is given by

$ M\left( \theta \right) = \max \left(0, \min \left( \left( 1 + \theta \right) / 2, 2, 2 \theta \right) \right) $

and $B$ is given by

$ B^{ji}_l = \frac{1}{2} |\Gamma^{ji}_l| \left(1 - \frac{\delta t}{\Delta x S} |\Gamma^{il}_l| \right) $

where $\Delta x$ is the cell to cell distance and $S$ is the face area. $\theta$ is given by,

$ \theta^{ji} = \frac{\alpha^{ji+-}}{\alpha^{ji}} $


$ \alpha^{ji+-} =  \left[\Gamma^{ji}_k > 0 \right] \left(\alpha^{jk-} - \alpha^{jk+} \right) + \alpha^{ji+} $

The $\alpha$'s are given by,

$\alpha^{ji} = \left( R^{-1} \right)^{ji}_k \left(\phi^{k+} - \phi^{k-} \right) $

$\alpha^{ji+} = \left( R^{-1} \right)^{ji}_k \left(\phi^{k++} - \phi^{k+} \right) $

$\alpha^{ji-} = \left( R^{-1} \right)^{ji}_k \left(\phi^{k-} - \phi^{k--} \right) $

  • Posted: 2011-06-27 17:34 (Updated: 2012-03-08 15:09)
  • Author: wd15
  • Categories: (none)
  • Comments (0)

Vector Diffusion Terms

We are now planning to implement equations of the form,

$ \partial_t q_i + \partial_j \left(F_{ji} \left( \vec{q} \right) \right)  = 0$

This has already been done partially in source:branches/vectorEquations. The convection term looks something like this

$ \partial_t q_i + \partial_j \left( u_{jik} q_k \right) = 0$

The coefficient $u_{jik}$ is a rank 3 tensor in this case. The first index is always over the spatial range. The next two indices refer to the q's. We now have to address diffusion terms. The trouble is that we already have effective rank 3 coefficients for higher order diffusion terms so it's all getting a bit complicated, but it is still manageable.

For higher order terms, we generally use a list or tuple and don't embed the index as part of the tensor so I think that is automatically dealt with. So higher order terms should always be indexed with a tuple or list. That is our current policy and it makes sense. A second order diffusion term will now be of the following form,

$ \partial_t q_i = \partial_k \left( \Gamma_{klij} \partial_l q_j \right) $

Of course the rank of $\Gamma$ can vary depending on whether anisotropy is required. This requires some interpretation. If $q$ is rank 0 then it will be assumed that all the indices of $\Gamma$ are spatial. If q is rank 1 then it will be assumed that the last two indices are over q. For example if the variable is $q_i$ and the coefficient is $\Gamma_{kij}$ then it's assumed that $\Gamma$ has a vector coefficient. If the variable was instead just $q$ then FiPy should throw an error as the coefficient has too many indices. Similarly, for a non-anisotropic vector equation the coefficient would simply be $\Gamma_{ij}$. The meaning of the coefficient indices changes based on context.

  • Posted: 2011-05-24 12:35 (Updated: 2011-05-24 12:36)
  • Author: wd15
  • Categories: (none)
  • Comments (0)

More mesh refactoring!?

Don't worry: it's almost over.

Previous mesh hierarchy




I'm stripping out geometry and topology


  • Their inheritance trees seem to inevitably mimic those of the meshes.
  • We don't get any code reuse out of their existence, despite what I thought at their creation.

  • A huge amount of boilerplate code is devoted to passing arguments to, instantiating, and creating interfaces to geometry/topology objects that is otherwise unnecessary.
  • While geometry/topology objects do provide an added amount of orthogonality to the meshes, this benefit is negligible since it is highly likely that no other class hierarchy aside from meshes will use geometry/topology objects.

I was convinced of the harmful effects of geom/top classes when I went to expand the topology classes. In adding a few extra methods to topology, I found that I would have to nearly fully replicate the class tree of the meshes (instead of having just five topology classes) and the argument list for topology classes would not only vary from class to class, but would become unreasonably long.

Since the (lengthy) argument list for all geometry objects already varied from class to class, and thus was slightly ridiculous, I decided to think about reuniting meshes and geometry/topology. Where there is no uniform interface, no good code-reuse, and heavy coupling, merging classes is likely a good course of action.

Hard lesson learned: never refactor unless it is going to prevent duplication.

The new arrangement

I spent today merging geometries and topologies back into meshes. The new mesh hierarchy is displayed in the following UML diagram.


Grid Builders

New features

  • Uniform and irregular meshes are now siblings. This results in a less confusing inheritance scheme. Gridlike?D objects are aggregated under (Uniform)Grids to avoid duplication without the use of multiple inheritance. Gridlikes are never instantiated, but their contents are almost entirely static.
  • GridBuilders are being used for the construction of grids. They warrant their own class hierarchy because the process for grid construction is largely dimensionally-independent. After the class-travaganza of geometry and topology, I can understand if this addition is met with suspicion. However, I think the reasons above are good ones.
  • Interface for all meshes is defined in AbstractMesh, which should provide some clarification.
  • Reduced NLOC after deleting the cruft classes.
  • Posted: 2011-03-27 00:49 (Updated: 2011-03-27 01:00)
  • Author: obeirne
  • Categories: (none)
  • Comments (0)

explicit terms and residuals

What to do in _prepareLinearSystem() for different values of var?


Term()errormatrix for A
Term(var=A)matrix for Amatrix for A
Term(var=B)matrix for B0
BinaryTerm(var=())errormatrix for A
BinaryTerm(var=(A,))matrix for Amatrix for A
BinaryTerm(var=(A, B))errormatrix for A
BinaryTerm(var=(B,))matrix for B0
BinaryTerm(var=(B, C))error0
CoupledBinaryTerm(var=(A, B))matrix for (A, B)error
CoupledBinaryTerm(var=(B, C))matrix for (B, C)error


Term()errormatrix for A
Term(var=A)matrix for Amatrix for A
Term(var=B)matrix for B0
BinaryTerm(var=())errormatrix for A
BinaryTerm(var=(A,))matrix for Amatrix for A
BinaryTerm(var=(A, B))residual for A and Bmatrix for A + residual for B
BinaryTerm(var=(B,))matrix for Bresidual for B
BinaryTerm(var=(B, C))residual for B and Cresidual for B and C
CoupledBinaryTerm(var=(A, B))matrix for (A, B)error
CoupledBinaryTerm(var=(B, C))matrix for (B, C)error

Note: _buildMatrix() still works as it did before, i.e., it must always be passed a solution variable and it build the matrix for that solution variable.

  • Posted: 2011-03-10 22:30 (Updated: 2011-03-11 10:00)
  • Author: guyer
  • Categories: (none)
  • Comments (0)

Looking at Cython

Cython is widely accessible

  • Cython is bundled with Enthought Python Distribution, SAGE, and PythonXY
  • Cython only requires Python and GCC, so it's installable basically anywhere
  • Cython is easy_installable.
  • mpi4py is already written in Cython.

In short: I'll bet dollars to donuts that anywhere FiPy is installable, Cython is too.

Building Cython code

  1. Write up a module foo.pyx in Cython.
  2. Write a which lists foo.pyx as an extension and specifies a cmdclass entry to compile it. Let's call that entry "build_ext"
  3. Run
    python build_ext --inplace
    This compiles the foo module from Cython down to C.
  4. Use foo as you would a Python module, despite the fact that it's (supposedly well-optimized) C code.

More details are available here.

Cython integrates with NumPy

See here. Further details are available in Seljebotn's paper.

An interesting excerpt from the paper:

For the algorithms which are expressible as NumPy operations, the speedup is much lower, ranging from no speedup to around ten times. The Cython code is usually much more verbose and requires more decisions to be made at compile-time. Use of Cython in these situations seems much less clear cut. A good approach is to prototype using pure Python, and, if it is deemed too slow, optimize the important parts after benchmarks or code profiling.

Most arithmetic algorithms in FiPy are specified in terms of NumPy operations, so Cython's use to us may be questionable. In a direct translation from a NumPy-operation-based routine to typed Cython code (using cdef np.ndarray), I saw no speedup.

Return of the GPU

Spoiler alert: GPUArray is still snake oil

We've spoken recently of whether or not our decision to ditch GPUArray forever was ill-conceived. Certainly, my previous blog post on the topic was, at best, incomplete in that I never gave a conclusive explanation of what time is being spent where. This was a result of me using fancy dot-graph visualizations instead of sticking to that natural adversary of bullshit: raw text.

In the previous post, I never examined what Guyer refers to as terminal procedures. More importantly, though, I never presented a list of procedures sorted by time spent in each procedure itself (excluding calls made to other procedures within its body). In other words, I never presented a list of procedures sorted by what Python's Pstats refers to as total time.

In this post, I will explore whether or not arithmetic operations ex solver are taking a considerable amount of time. When I conclude that they are, I will explore whether or not the use of a temporary GPUArray for some arithmetic operations may alleviate this bottleneck. I will conclude that it won't, since GPUArray initialization and communication are stalls which cause a net gain in runtime, even considering the much better performance of arithmetic operations on the GPU.

Profiling FiPy

In order to determine if, in fact, we are spending an amount of time on ex solver arithmetic operations that would merit speeding them up somehow, I profiled a phase anisotropy run, available here.

I ran the simulation for 10 steps with dx and dy both at 1000. I then obtained profiling output with this script

import pstats

def printProfileInfo(files):
    for filename in files:
        p = pstats.Stats(filename)

        # Sort by time in self

if __name__ == '__main__':
    import sys

as generated by this pstats file. The output is here.

Let's walk through line-by-line.

  • pysparse.itsolvers.pcg: solver; of no concern right now.
  • {method 'sum' ...}: helllooooo arithmetic.
  •<lambda>): straight from The Good Book,

return self._BinaryOperatorVariable(lambda a,b: a*b, other)

Obviously some steamy arithmetic.

  • {method 'take' ...}: I'd guess the reason this guy is so costly is memory access. Just a guess, though.
  • helllooooo arithmetic.
  • a whole mess of arithmetic.
def _calcValuePy(self, alpha, id1, id2):
    cell1 = numerix.take(self.var,id1, axis=-1)
    cell2 = numerix.take(self.var,id2, axis=-1)
    value = ((cell2 - cell1) * alpha + cell1)
    eps = 1e-20
    value = (value == 0.) * eps + (value != 0.) * value
    cell1Xcell2 = cell1 * cell2
    value = ((value > eps) | (value < -eps)) * cell1Xcell2 / value
    value = (cell1Xcell2 >= 0.) * value

    return value

Okay, so the point is clear: arithmetic out the wazoo. That numerical analysis software is stunted by calculation shouldn't come as a surprise.

Can the GPU help us?

For the sake of simplicity and to serve as a proof of concept, from here on out I'll deal exclusively with sum. If we can make sum into less of a problem with the help of Dr. Nvidia, then certainly we can rid other arithmetic aches similarly.

As I was shuffling around $FIPYROOT/sandbox to prepare some GPU-sum benchmarks, I was pleasantly surprised to find that I'd already been through this process, I just never had the good sense to document it.

Allow me to introduce you to, available here.

#!/usr/bin/env python

import numpy
import pycuda.autoinit
import pycuda.gpuarray as gpua
import random

Using `time`:

          real   3m55.157s
          user   3m52.720s
          sys    0m1.530s

        where numLoops is 200.

          real    3m52.877s
          user    3m51.680s
          sys     0m0.590s

        where numLoops is 200.

Using cProfile/pstats:

Tue Nov  2 20:31:07 2010

         800042925 function calls (800042000 primitive calls) in 372.578 CPU seconds

   Ordered by: internal time
   List reduced from 596 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
400000000  163.425    0.000  194.783    0.000
        1  142.234  142.234  372.452  372.452
      402   33.223    0.083   33.223    0.083 {numpy.core.multiarray.array}
400000006   31.358    0.000   31.358    0.000 {method 'random' of '_random.Random' objects}
      400    1.667    0.004    1.667    0.004
      400    0.274    0.001    0.397    0.001
     1400    0.171    0.000    0.189    0.000
        1    0.036    0.036    0.036    0.036
        1    0.023    0.023  372.581  372.581<module>)
     1199    0.020    0.000    0.025    0.000
      400    0.012    0.000    0.100    0.000
      799    0.011    0.000    0.025    0.000
     1404    0.008    0.000    0.016    0.000
      400    0.008    0.000    1.794    0.004
     1400    0.007    0.000    0.018    0.000
        1    0.006    0.006    0.006    0.006<module>)
        2    0.006    0.003    0.012    0.006<module>)
     1199    0.005    0.000    0.005    0.000 {pycuda._pvt_struct.pack}
      235    0.005    0.000    0.006    0.000
      135    0.004    0.000    0.005    0.000


Tue Nov  2 20:38:08 2010

         800027142 function calls (800026254 primitive calls) in 373.266 CPU seconds

   Ordered by: internal time
   List reduced from 453 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
400000000  166.126    0.000  197.885    0.000
        1  140.405  140.405  373.137  373.137
      402   33.300    0.083   33.300    0.083 {numpy.core.multiarray.array}
400000000   31.759    0.000   31.759    0.000 {method 'random' of '_random.Random' objects}
      400    1.541    0.004    1.541    0.004 {method 'sum' of 'numpy.ndarray' objects}
        1    0.034    0.034    0.034    0.034
        1    0.024    0.024  373.266  373.266<module>)
        1    0.005    0.005    0.005    0.005<module>)
        2    0.005    0.003    0.012    0.006<module>)
      235    0.005    0.000    0.006    0.000
     2594    0.003    0.000    0.003    0.000 {isinstance}
      400    0.003    0.000    1.547    0.004
        6    0.003    0.001    0.004    0.001
      127    0.003    0.000    0.004    0.000
   124/10    0.003    0.000    0.007    0.001
        1    0.002    0.002    0.044    0.044<module>)
        3    0.002    0.001    0.019    0.006<module>)
   287/10    0.002    0.000    0.008    0.001
        1    0.002    0.002    0.002    0.002
        1    0.002    0.002    0.002    0.002<module>)

def compare(numLoops, testing="gpu", arrLen=1000*1000):

        accum = 0

        for i in xrange(numLoops):
                # randomized to avoid memoization or other caching
                rand  = [random.uniform(0., 1000.) for j in xrange(arrLen)]
                rand2 = [random.uniform(0., 1000.) for j in xrange(arrLen)]
                if testing == "gpu":
                        a = gpua.to_gpu(numpy.array(rand))
                        b = gpua.to_gpu(numpy.array(rand2))
                        c = gpua.sum(a) + gpua.sum(b)
                        a = numpy.array(rand)
                        b = numpy.array(rand2)
                        c = numpy.sum(a) + numpy.sum(b)
                accum += c

                if i % 10 == 0:
                        print i

        print accum

if __name__ == '__main__':
        import sys
        compare(int(sys.argv[1]), sys.argv[2])     

A few big points here. First, the overall time for

time python 200 gpu 

takes longer than

time python 200 numpy 

Not what we expected, huh?

Examining the profiling results in the above docstring tells us that when we go to GPU, Numpy's

400000000   31.759    0.000   31.759    0.000 {method 'random' of '_random.Random' objects}
      400    1.541    0.004    1.541    0.004 {method 'sum' of 'numpy.ndarray' objects}
        1    0.034    0.034    0.034    0.034

balloons into

400000006   31.358    0.000   31.358    0.000 {method 'random' of '_random.Random' objects}
      400    1.667    0.004    1.667    0.004
      400    0.274    0.001    0.397    0.001
     1400    0.171    0.000    0.189    0.000
        1    0.036    0.036    0.036    0.036

Keep this in mind.

The results in the module-level docstring above are why I hit the skids on GPUArray development last time, but I'll expound on them here.


"Trust, but verify" sez Reagan, and that's the way I was thinking when I rediscovered It made sense to me that GPUArray isn't a silver bullet, but if I can't quantify that hunch then it's not science, but useless superstition. So, I decided to have some fun with in the form of Let's see how:

#!/usr/bin/env python

import pstats

arraySizes = [10*10,

loopCounts = [10,

I decided that, while the default benchmark of a 1000*1000 element array run for 200 iterations through the body of was a fair model of FiPy usage, more data points are always useful.

So, I tested for all combinations of (arraySizes[i], loopCounts[j]) (how many is that, kids? Remember discrete math?).

I generated the Pstat profiles with the following function.

def runProfiling():
    from benchSum import compare
    from cProfile import runctx

    for loopNum in loopCounts:
        for aSize in arraySizes:
            print "Doing %d loops on arraySize: %d..." % (loopNum, aSize)

            print "gpu."
            runctx("compare(%d, 'gpu', %d)" % (loopNum, aSize),
                   globals(), locals(),
                   "gpu-sum-%d-%d.pstats" % (loopNum, aSize))

            print "numpy."
            runctx("compare(%d, 'numpy', %d)" % (loopNum, aSize),
                   globals(), locals(),
                   "numpy-sum-%d-%d.pstats" % (loopNum, aSize))

If you've forgotten how works, go refresh your memory.

From the famous benchSum docstring, it is apparent that there is a correspondence between (numpy.sum), and (gpuarray.sum, gpuarray.set, gpuarray.__init__). This is the case because not only must GPUArray instantiate an ndarray but, in order to perform the arithmetic operation, it must throw the array on the GPU: something that obviously doesn't happen with ndarray. If you don't believe me, go have a look and convince yourself now, because my conclusions all rest on this correspondence.

If I wanted to get nasty, I could include in the correspondence other calls that are made in the use of GPUArray that aren't in that of ndarray, namely gpuarray.to_gpu and gpuarray.splay, but those times are basically insignificant.

Anyway, I needed a way of ripping out individual method timing (remember, that's time spent in self, or total time) from the mass of Pstat profiles I'd accumulated.

def _getTimeForMethods(arraySize, loopCount, methodList, 
                       arrType="gpu", benchmark="sum"):

    from StringIO import StringIO
    import re

    totalTime = 0.

    for method in methodList:
        strStream = StringIO()
        name = "%s-%s-%d-%d.pstats" % (arrType,benchmark,loopCount,arraySize)
        p = pstats.Stats(name, stream = strStream)


        profString = strStream.getvalue()
        m ="\d+\s+(\d+\.\d+)", profString)

        totalTime += float(

    return totalTime 

A little hairy, but it gets the job done. The above, _getTimeForMethods prints the Pstats output to a string stream and then yanks out the total time with a regular expression. It does this for each method in methodList and accumulates a composite time, which is returned.

I put this function to work in the following Behemoth, which interprets and graphs the Pstats results.

def graph():
    from matplotlib import pylab as plt
    import subprocess as subp

    The methods which, according to the profiling results in the docstring of
    ``, are analogous between numpy.ndarray and gpuarray.

    We don't need to consider anything more than `sum` for numpy because numpy's
    set/init procedure is absorbed in the call to `array`, which both gpuarray
    and numpy do.
    numpyMethods = ["method 'sum'"]
    gpuMethods = ["\S*(set)",

    loopsConstantTimes = {"numpy": [],
                          "gpu": []}

    arrSizeConstantTimes = {"numpy": [],
                            "gpu": []}

    justSumArrSizeConstantTimesGPU = []

    for loops in [200]:
        for size in arraySizes:
            loopsConstantTimes["gpu"].append(_getTimeForMethods(size, loops,
    for loops in loopCounts:
        for size in [1000*1000]:
            arrSizeConstantTimes["gpu"].append(_getTimeForMethods(size, loops,
            justSumArrSizeConstantTimesGPU.append(_getTimeForMethods(size, loops,


    If you really wanna see the gory details, go to the file in


Okay, let's start off nice and light with a simple comparison between gpuarray.sum and ndarray.sum, then we'll dash any hope of dead-simple gpuarray usage.

Hell yeah! Look at that speed-up! Life must be a Singaporean paradise where I get paid to let other people's libraries halve my runtime!



There's the whole picture. Now we're factoring in the additional effects of GPUArray usage: communication and initialization.

Here, the array size is fixed at 1000*1000 elements. The horizontal axis indicates the size of the array that we're summing. The vertical axis is the time of sum for ndarray and {sum, set, __init__} for GPUArray. Again, remember that these sets of methods are analogous. You can't have GPUArray's hot sum without its homely friends set and __init__.

Here, the number of iterations is fixed at 200. The number of iterations that the sum-accumulation procedure does is indicated on the horizontal axis. The vertical axis is, again, timing.

It's clear that ndarray wins out, even at a large array size (2000*2000).

Things I could be screwing up

Considering that the bottleneck with the GPUArray is setting the array, it would seem possible that I'm doing it wrong. The GPUArray documentation will verify that isn't the case. You just can't get around that gpuarray.set call; it's inevitable and it's what sinks your battleship.

It's fundamentally intuitive that you can't avoid the cost of communicating with the card. It's a longer trip down the bus and, since there isn't any lazy-eval/currying going on, it's a trip that'll happen on every arithmetic operation. TANSTAAFL.

To be sure, let's take a look at pycuda.gpuarray.set, just to see if it's plausible that the contents are slowing us up.

def set(self, ary):
    assert ary.size == self.size
    assert ary.dtype == self.dtype
    if self.size:
        drv.memcpy_htod(self.gpudata, ary)

Bingo! The key is in drv.memcpy_htod(...). That's where the actual transfer of data takes place; it makes sense time is lost there.

Possible recourse?

Right below pycuda.gpuarray.set is

def set_async(self, ary, stream=None):
    assert ary.size == self.size
    assert ary.dtype == self.dtype
    if self.size:
        drv.memcpy_htod_async(self.gpudata, ary, stream)

Might this be faster? Dunno. If it were, wouldn't Klockner be quick to tell us about it in the documentation?


  • FiPy is bottlenecked largely by arithmetic operations.
  • GPUArray will not help us alleviate these bottlenecks by using it as a one-off Rainman every time we want to do a single arithmetic operation.
    • This is because communication is more expensive than the single arithmetic operation, even if the op is significantly faster on the GPU and even on arrays up to 2000*2000 in length.
  • The only appropriate use for PyCUDA is by writing a few kernels for arithmetic-critical junctures where we can amortize the cost of communication over the speedup of multi-step arithmetic on the GPU.
  • Posted: 2011-03-01 16:50 (Updated: 2011-03-02 22:50)
  • Author: obeirne
  • Categories: (none)
  • Comments (0)

Buildbot setup

First, obtain Buildbot on both master and slave.

master$ easy_install buildbot
slave$ easy_install buildbot-slave

Now, create the master.

master$ buildbot create-master master_dir

This will make a directory called master_dir and fill it with all sorts of goodies. In order to be operational, the master_dir requires a master.cfg file to be present within it. Luckily, Buildbot supplies a master.cfg.sample which can easily be tailored to fit our needs.

master$ cp master.cfg.sample master.cfg
master$ $EDITOR master.cfg

Now let's do the aforementioned tailoring. I'll step through configuration piece-by-piece.

c = BuildmasterConfig = {}

All that is required of each master.cfg is that it builds a dictionary called BuildmasterConfig (which we'll call c for brevity), which contains various configuration information for Buildbot.

branches = ['trunk', 

Because of Buildbot's architecture, we must specify (in some way) the branches that Buildbot should concern itself with; I define factories later on which generate the necessary Buildbot machinery for each branch specified here.

Instead of explicitly specifying the branches we want to monitor (which is usually all of them), it may be possible to poll SVN periodically and reconfigure Buildbot when a new branch is detected, but I haven't looked into that yet.

Specifying BuildSlaves

Now, we specify Buildslaves for c.

from buildbot.buildslave import BuildSlave
c['slaves'] = [BuildSlave("danke", "bitte"),
               BuildSlave("slug", "gross")]

This addition to c implies that we will have two BuildSlaves communicating with the master: one called danke (with a slave-password of bitte), and one called slug (with a slave-password of gross). We will configure the slaves (with their passwords) after we have configured the master.

It is important that these slave passwords remain relatively private, or else there are some security vulnerabilities that entail possible execution of arbitrary code on the master's host (alluded to here).

Then, pick a port over which the master and slave will communicate. This port must be publicly accessible on their respective networks. I chose 9989.

c['slavePortnum'] = 9989

Remember to make this port publicly accessible. This may necessitate forwarding ports.

Detecting source changes

Now we configure how the Buildmaster finds out about changes in the repository. There are a variety of ways the master can be configured to keep itself informed of changes, including polling the repository periodically, but I chose to manually inform the buildmaster of changes via an addition to SVN's post-commit hook which calls a buildbot/contrib script available here.

On the master.cfg side, we add the following:

from buildbot.changes.pb import PBChangeSource
c['change_source'] = PBChangeSource()

From here, I assume that the SVN repository is hosted on the same machine as the master. Digressing a moment from master.cfg, I navigate to the SVN repository's directory.

master$ cd /path/to/svn-repo
master$ $EDITOR hooks/post-commit

Once the editor is open, add the following to notify the buildbot master of changes upon commit.

# set up PYTHONPATH to contain Twisted/buildbot perhaps, if not already
# installed site-wide

/path/to/ --repository "$REPOS" --revision "$REV" \
--bbserver localhost --bbport 9989

If the master is hosted on a different machine than the repository, the bbserver flag above can be modified accordingly.

One caveat here is that we must modify the contrib/ script to propagate branch information on to Buildbot. Open

master$ $EDITOR /path/to/

and enact the following change.

# this should be around line 143
# split_file = split_file_dummy

split_file = split_file_branches

Now the SVN repo is set up to play nicely with the Buildbot installation and (hopefully) the master will be informed of each subsequent commit.


Back to master.cfg.

Next, we want to configure Schedulers. Whenever the master is informed of a change (via whatever object is held in c['change_source']) all Schedulers attached to the list held by c['schedulers'] will be informed of the change and, depending on the particular Scheduler in question, may or may not react by triggering a build or multiple builds.

Here, I'll configure a Scheduler for each branch to react to any change in that branch by running two builds.

from buildbot.scheduler import Scheduler
from buildbot.schedulers.filter import ChangeFilter

def buildSchedulerForBranch(branch):
    cf = ChangeFilter(branch=branch)
    return Scheduler(name="%s-scheduler" % branch,
                     builderNames=["full-build-%s" % branch,
                                   "smaller-build-%s" % branch])

c['schedulers'] = []

for branch in branches:

The ChangeFilter is used to discriminate branches; note that branch isn't the only criterion a ChangeFilter can consider. The parameter treeStableTimer is the length of time that Buildbot waits before starting the build process.

Note that this Scheduler references "full-build-[branch]" and "smaller-build-[branch]", which are Builders we'll define right now.


Here, we specify the build procedures. First, we will define a number of BuildFactorys, which detail generic steps that a given Build will take, and then we create the Builds themselves and attach them to to the BuildmasterConfig dictionary. There is some machinery defined up front for the sake of generating separate builds for each branch with as little duplication as possible, so bear with me.

def addCheckoutStep(factory, defaultBranch='trunk'):
    """Ensure that `baseURL` has a forward slash at the end."""
    baseURL = 'svn://' 
def testAllForSolver(buildFact, solverType, parallel=False):
    Add doctest steps to a build factory for a given solver type.

    descStr = "for %s" % solverType
    solverArg = "--%s" % solverType
    buildFact.addStep(Doctest(description="testing modules " + descStr,
                              extraArgs=[solverArg, "--modules"]))

    buildFact.addStep(Doctest(description="testing examples " + descStr,
                              extraArgs=[solverArg, "--examples"]))

Here, addCheckoutStep takes a BuildFactory and adds a step which checks out whichever branch has been modified, as specified by the Scheduler reporting the changes. It does so by appending the name of the branch, as reported by the provoking Scheduler, to the baseURL that we pass in.

The function testAllForSolver takes a BuildFactory and adds steps which test both modules and examples for a given solver. I should mention at this point that here I use Doctest, a class I had to add to Buildbot.

Getting doctests to work

Buildbot integrates nicely with Trial, the unit-test framework that comes bundled with Twisted. Trial, unfortunately, doesn't seem to pick up on doctests, though there are scattered allusions speaking to the contrary (here and here).

I tried a few approaches to getting Trial to process doctests (adding a __doctest__ module-level attribute, for one) to no avail. Even if there is some way to piggyback doctests onto Trial, it may involve a large modification of the FiPy source, which, if purely for the sake of making Buildbot happy, I don't think is worth it.

In light of this, I added an object that extends the class that handles Trial integration to handle doctests instead.

Within master.cfg, add

from buildbot.steps.python_twisted import Trial, TrialTestCaseCounter
from import ShellCommand
from twisted.python import log

class Doctest(Trial):
    Add support for Python's doctests.

    def __init__(self, description=None,

        ShellCommand.__init__(self, **kwargs)


        self.testpath = "."
        self.command = ["python", "", "test"]
        self.extraArgs = extraArgs

        self.logfiles = {}

        if description is not None:
            self.description = [description]
            self.descriptionDone = [description + " done"]
            self.description = ["testing"]
            self.descriptionDone = ["tests"]
        # this counter will feed Progress along the 'test cases' metric
        self.addLogObserver('stdio', TrialTestCaseCounter())
    def start(self):
        if self.extraArgs is not None:
            if type(self.extraArgs) is list:
                self.command += self.extraArgs
            elif type(self.extraArgs) is str:
        self._needToPullTestDotLog = False
        log.msg("Doctest.start: command is '%s' with description '%s'." \
                    % (self.command, self.description))

    def _gotTestDotLog(self, cmd):
        Trial._gotTestDotLog(self, cmd)

        # strip out "tests" from the beginning 
        self.text[0] = self.description[0] + " " \
                        + " ".join(self.text[0].split(" ")[1:])

This allows us to run FiPy's test suite completely unmodified and have Buildbot interpret the results just as it would for Twisted's Trial.

Back to Builders

We never completed setting the Builders up, so let's do that now.

Let's first define a few Factory objects.

from buildbot.process import factory
from buildbot.steps.source import SVN

Run a mostly-complete test-suite.
fullFactory = factory.BuildFactory()

testAllForSolver(fullFactory, "pysparse")
testAllForSolver(fullFactory, "trilinos")

Run a less-intensive test-suite.
smallerFactory = factory.BuildFactory()

                               description="testing modules"))

Now we will define and attach the Builders themselves.

c['builders'] = []

def makeBuildersForBranch(branch):
    builders = []

    builders.append({"name": "full-build-%s" % branch,
                     "slavename": "slug",
                     "builddir": "%s-slug" % branch,
                     "factory": fullFactory})

    builders.append({"name": "smaller-build-%s" % branch,
                     "slavename": "danke",
                     "builddir": "%s-danke" % branch,
                     "factory": smallerFactory})

    return builders

for branch in branches:
    for builder in makeBuildersForBranch(branch):

The rest

Everything left in master.cfg concerns itself with how build information is reported back to us. Though there are very many possibilities open to us (e-mail reporting to blamed developers upon broken builds, IRC bots, Skynet, etc.), I haven't explored any of them other than the default web interface that Buildbot provides. With that in mind, here's the rest of master.cfg, basically unadulterated from the vanilla sample config.


# 'status' is a list of Status Targets. The results of each build will be
# pushed to these targets. buildbot/status/*.py has a variety to choose from,
# including web pages, email senders, and IRC bots.

c['status'] = []   

from buildbot.status import html

# from buildbot.status import mail
# c['status'].append(mail.MailNotifier(fromaddr="[email protected]",
#                                      extraRecipients=["[email protected]"],
#                                      sendToInterestedUsers=False))
# from buildbot.status import words
# c['status'].append(words.IRC(host="", nick="bb",
#                              channels=["#example"]))
# from buildbot.status import client
# c['status'].append(client.PBListener(9988)) 

c['projectName'] = "FiPy"
c['projectURL'] = ""
c['buildbotURL'] = "http://localhost:8010/"      

Configuring slaves

Configuring slaves is very simple.

danke$ buildslave create-slave slavedir danke bitte
danke$ $EDITOR slavedir/info/admin
danke$ $EDITOR slavedir/info/host

slug$ buildslave create-slave slavedir slug gross 
slug$ $EDITOR slavedir/info/admin
slug$ $EDITOR slavedir/info/host

Note that slug is both the master and a slave.

The form of the create-slave command is buildslave create-slave [directory for builds] [master's host address]:[port for communication] [slave-name] [slave-password].

So long as the port 9989 is open on both the slaves and master, we're done.

In summary

Buildbot is going to be an asset to us, though I still haven't contacted Doug as to how we're going to get it up and running.

In case I've screwed anything up in translating my config file to this blog post, I've uploaded the actual file to sandbox and it is available here.

  • Posted: 2011-02-24 17:37 (Updated: 2011-03-08 12:25)
  • Author: obeirne
  • Categories: (none)
  • Comments (0)

Efficiency changes due to properties

I check out trunk at revision [4101], before the property changes were introduced, as old_trunk and run test.

(misc)old_trunk/ time python test 2> /dev/null 
running test

real	2m57.447s
user	2m55.135s
sys	0m1.952s

I then run test on the latest revision of [email protected][4160], which includes all property implementations aside from those in fipy.terms.

(misc)trunk/ time python test 2> /dev/null 
running test

real	3m0.290s
user	2m57.987s
sys	0m2.000s

There's only a three second difference between the two; clearly not a big deal.

How about a non-trivial benchmark like phase anisotropy? For this, I copy-script'd the anisotropy example from examples.phase, then stripped out the viewing code and reduced the number of steps to one hundred.

I then ran it on [email protected][4101], without the properties:

(misc)old_trunk/ time python 

real	3m50.701s
user	2m50.679s
sys	0m59.748s

And then on the current [email protected][4160], with properties:

(misc)trunk/ time python 

real	3m49.818s
user	2m50.455s
sys	0m59.104s

Again, we see a negligible difference (3:50 vs. 3:49) between the two runtimes. These two trials have convinced me that the property refactoring hasn't added any considerable computational burden.

Benchmarking the Efficiency Branch

An efficiency comparison between source:[email protected] and source:branches/[email protected] on bunter.

Example Source Flags CPUs 9 steps (s) memory (kB)
1 2 3 4 5 source:branches/[email protected] --pysparse 1 25 25 29 29 333552 source:branches/[email protected] --trilinos 1 72 72 417776 source:branches/[email protected] --trilinos 2 48 48 278368 source:branches/[email protected] --trilinos 4 34 36 192424 source:[email protected] --pysparse 1 29 29 296020 source:[email protected] --trilinos 1 92 96 395408 source:[email protected] --trilinos 2 49 53 265860 source:[email protected] --trilinos 4 35 38 187092
  • Posted: 2010-12-29 17:07
  • Author: wd15
  • Categories: (none)
  • Comments (0)

Coupled Efficiency

An efficiency comparison between source:[email protected] and source:branches/[email protected] on bunter.

Example Flags CPUs 9 steps (s) memory (kB)
1 2 3 4 5
source:branches/coupled/examples/phase/[email protected] --pysparse 1 29 29 322048
source:branches/coupled/examples/phase/[email protected] --trilinos 1 87 87 400576
source:branches/coupled/examples/phase/[email protected] --trilinos 2 52 54 49 55 56
source:branches/coupled/examples/phase/[email protected] --trilinos 4 34 34
source:trunk/examples/phase/[email protected] --pysparse 1 29 29 317972
source:trunk/examples/phase/[email protected] --trilinos 1 83 83 403896
source:trunk/examples/phase/[email protected] --trilinos 2 52 52
source:trunk/examples/phase/[email protected] --trilinos 4 34 33 34
source:branches/coupled/examples/cahnHilliard/[email protected] (500x500) --pysparse1 29 29 278124
source:branches/coupled/examples/cahnHilliard/[email protected] (200x200) --trilinos1 174172 138352
source:branches/coupled/examples/cahnHilliard/[email protected] (200x200) --trilinos2 139
source:branches/coupled/examples/cahnHilliard/[email protected] (200x200) --trilinos4 98
source:trunk/examples/cahnHilliard/[email protected] (500x500) --pysparse 1 30 29 29 126400
source:trunk/examples/cahnHilliard/[email protected] (200x200) --trilinos 1 173 138352
source:trunk/examples/cahnHilliard/[email protected] (200x200) --trilinos 2 140
source:trunk/examples/cahnHilliard/[email protected] (200x200) --trilinos 4 124


  • Running 1000x1000 for CH used about 1.1 GB for both trunk and branches/coupled
  • --trilinos is heinously slow for CH.
  • RST wiki formatting doesn't seem to work very well at least with the examples provided.
  • branches/coupled appears to be no slower nor appears to use more memory than trunk.
  • Posted: 2010-12-28 13:02
  • Author: wd15
  • Categories: (none)
  • Comments (0)

GPUArray considered harmful

Back in September, everyone decided that introducing FiPy to CUDA in some facet was a good idea. The particular introduction we decided on was in the form of middleware which would slide between NumPy and our own This proposed middleware would shift some (or all) arrays from the system memory onto the GPU memory. That way, with the arrays on the GPU, we could perform all computations there, reaping the gleaming wonders of the all-powerful graphics processing unit. Silver bullet, right?

No, not exactly.

I'm no longer convinced that this approach will gain us anything; that is to say, I am no longer convinced that indiscriminately chucking NumPy arrays at the GPU is going to yield any performance gain for FiPy. I come to this conclusion after profiling a few varied FiPy runs in their entirety and trying, in vain, to implement the middleware described above.


I should have done this long ago, before we even talked about possible interactions between FiPy and CUDA, just to see where performance gains are actually going to benefit us. Finally, after a few weeks of struggling with the proposed GPUArray module, I sat down and profiled a few FiPy runs to see where the cycles are being spent. After all, if we can't find a significant amount of time that's being spent doing arithmetic operations on NumPy arrays, there is no point in putting arrays on the GPU before the solver.

3D heat diffusion (1003)

I profiled the problem found here with the timing instrumentation removed. The results of the profiling were run through gprof2dot to produce this figure. The figure is a tree where each node is a function, the percentage below it is the percentage time spent in the function and all sub-functions, the parenthesized percentage is the time spent in the function itself, and all children of that node are sub-functions called within that function.

In this case, the only nodes displayed are for functions which consume 5% of runtime or more. These are the only components of a run that we should consider optimizing. Notice that the only array-specific functions found here are numeric:201:asarray and ~:0:<numpy.core.multiarray.array>. Both of these functions would take much longer when using a GPU (based on a strong hunch). Note also that we can't see any arithmetic operations in this graph. For this particular problem, it is clear that arithmetic operations on arrays are not the bottleneck, and thus using a GPU pre-solution makes absolutely no sense.

To get a better idea of which array functions were consuming time, I decided to generate another figure where each node consumes a minimum of 1% total runtime. That figure shows array allocation dominating array-specific operations, with ~:0:<numpy.core.multiarray.array> responsible for 18.6% of runtime.

No arithmetic operations are visible until we reduce the minimum percentage of runtime consumed by any node to 0.5%. That figure finally shows numerix:624:sin weighing in at 0.73%. Another array operation which appears in this graph is core:3914:reshape, the reshaping of an array, which comes in around 0.85%. Think of how this operation's time consumption would balloon if we were on a GPU.

The conclusion is clear: we should not be using a GPU for pre-solution array corralling on this run or anything like it.

Wheeler's Reactive Wetting

After profiling these toy runs, I told Wheeler about the results. He and I agreed that before making any final conclusions, we should profile a more realistic problem. We chose his reactive wetting problem. He profiled, sent me the pstats file, and I generated [ this figure]. The minimum runtime consumption here is 5%.

Note that most array-relevant operations (aside from ~:0:<numpy.core.multiarray.array>) are found under inline:14:_optionalInline. Here we see that numerix:941:take is a good subtree to examine, as it is responsible for ~14% of total runtime.

The only arithmetic operation in this mess is core:3164:__mul__, boasting 5.50%. "Well, finally," you say to yourself, "there's something that will benefit from residence on the GPU." Not so fast. Only 0.01% of that 5.50% is spent in __mul__ itself. The overwhelming majority of time is spent in its child, ~:0:<numpy.core.multiarray.where>, which is presumably fiddling around in memory: yet another operation that will balloon with GPU usage.

Let's work on PyKrylov

After these damning results, it's evident to me that using Klockner's PyCuda within FiPy is a really bad idea. When using PyCuda's drop-in replacement (ahem) for numpy.ndarray, any arithmetic operation results in an intermediate copy of the gpuarray since there is no clever mechanism for lazily evaluating curried-kernel strings, or whatever.

Furthermore, I think that any solution proposing to be as convenient as Klockner's gpuarray is snake-oil unless some very clever mechanisms for incremental kernel accumulation and/or lazy evaluation enter the picture. The only way we can juice the GPU is by writing a few custom kernels at performance critical junctures: that way, we're not tap-dancing all over the GPU's limited and relatively slow memory by creating transient copies at every step of an arbitrary computation.

For these reasons, I think we (or I) should work on writing a few CUDA kernels for use by the non-Charlatan regions of PyCuda within PyKrylov. This implies work on the PyKrylov branch, which I am more than happy to undertake.

Reconciling the new boundary conditions with the old

As explained in this ticket, boundary conditions have now been updated to use constraints and the FixedValue/FixedFlux boundary conditions are no longer required. As a check, we ran the source:branches/[email protected] examples against source:[email protected] and the convection examples threw a number of errors. On the surface, this is reasonable as the new (natural) boundary condition is in/outflow and hence if one uses a FixedValue boundary condition, one is also getting an in/outflow boundary condition.


Version 2.1 for this example didn't work against trunk.

We noticed that this example had an added outflow boundary condition see line 70 of This should be removed on trunk as trunk already included the outflow boundary condition. If one removes the extra RHSBC, it still passes the test on trunk. Basically, the right outflow is miniscule as the value of the variable on that edge is $e^{-10} \ll 1$ and hence it doesn't matter whether RHSBC is included or not. We now understand the right BC issue. Now to understand the left BC.

The version 2.1 example when run against trunk has a value very close to 2 instead of 1 on the left hand boundary. The FixedValue boundary condition adds the following to the matrix and RHS vector

$A_f \left( \vec{n} \cdot \vec{u} \right) \left(\alpha_f \phi_b + \left(1 - \alpha_f \right) \phi_P \right)$

where $\phi_b$ is the boundary value. In this case, it results in 1 being added to the RHS vector. The new constraint boundary conditions now add this boundary condition by default (regardless of whether $\phi_b$ has been set), thus, the RHS vector is getting a value of 2, and, hence, the left most value now has a value of 2 (the diagonal entry is 1 from the first interior face).

Note that the inflow/outflow boundary condition has the same discretization as the FixedValue value boundary condition. The only difference is that an inflow/outflow condition does not require the external face to be explicitly fixed to a value, but uses whatever value var.getFaceValue() provides. If the boundary value is set then this is the same as a FixedValue BC


This example doesn't work when run against trunk for the same reasons as above. I did notice that the FixedFlux boundary condition is only ever applied once no matter how many terms invoke it. This meant that it previously worked as expected, see source:branches/version-2_1/fipy/boundaryConditions/[email protected]#L79.

Not much point going on, I think I understand why the rest of the examples fail.

  • Posted: 2010-09-22 16:17 (Updated: 2010-09-23 15:31)
  • Author: wd15
  • Categories: (none)
  • Comments (0)