Fixing Bitten
Bitten is broken and I'm trying to fix it. Anyway, to cut a long story short the slave is checking out the code just fine, but stops without running the tests
[INFO ] A examples/elphf/generated/phaseDiffusion/binary.pdf [INFO ] A examples/elphf/generated/phaseDiffusion/quaternary.png [INFO ] A examples/elphf/generated/phaseDiffusion/ternaryAndElectrons.pdf [INFO ] A examples/elphf/phaseDiffusion.py [INFO ] U . [INFO ] Checked out revision 3712. [DEBUG ] svn exited with code 0 [INFO ] Build step checkout completed successfully [DEBUG ] Sending POST request to 'http://matforge.org/fipy/builds/1414/steps/' [DEBUG ] Server returned error 500: Internal Server Error (no message available) [ERROR ] Exception raised processing step checkout. Reraising HTTP Error 500: Internal Server Error [DEBUG ] Stopping keepalive thread [DEBUG ] Keepalive thread exiting. [DEBUG ] Keepalive thread stopped [DEBUG ] Removing build directory /tmp/bittenSnQIvn/build_trunk_1414 [ERROR ] HTTP Error 500: Internal Server Error [DEBUG ] Removing working directory /tmp/bittenSnQIvn [INFO ] Slave exited at 2010-07-28 14:22:36
Observing the masters log file reveals the following error, which seems to occur at the same time when observing the files out put side by side with tail -f. Could this be connected?
2010-07-28 14:26:36,900 Trac[perm] WARNING: perm.permissions() is deprecated and is only present for HDF compatibility 2010-07-28 14:27:45,230 Trac[main] ERROR: 'time' Traceback (most recent call last): File "/usr/local/lib/python2.4/site-packages/Trac-0.11.1-py2.4.egg/trac/web/main.py", line 423, in _dispatch_request dispatcher.dispatch(req) File "/usr/local/lib/python2.4/site-packages/Trac-0.11.1-py2.4.egg/trac/web/main.py", line 197, in dispatch resp = chosen_handler.process_request(req) File "/usr/local/lib/python2.4/site-packages/Bitten-0.6dev_r562-py2.4.egg/bitten/master.py", line 93, in process_request return self._process_build_step(req, config, build) File "/usr/local/lib/python2.4/site-packages/Bitten-0.6dev_r562-py2.4.egg/bitten/master.py", line 229, in _process_build_step step.started = int(_parse_iso_datetime(elem.attr['time'])) File "/usr/local/lib/python2.4/site-packages/Bitten-0.6dev_r562-py2.4.egg/bitten/util/xmlio.py", line 252, in __getitem__ raise KeyError(name) KeyError: 'time'
The times are out of whack because matforge is 5 minutes fast. Let me investigate further.
Analysis of parallel speed ups
In the process of working with James I have tried to analyze some parallel runs more deeply. I should mention that http://www.mcs.anl.gov/~itf/dbpp is proving to be quite a useful text for understanding some of the concepts that I had overlooked. We can write an expression for the time for a given time step based on various aspects of the parallel partioning something like,
where
is the number of cells on node including overlaps, is the number of overlapping cells on node and is the total number of nodes. The terms in the equations represent in order- the local calculations (should be perfectly parallel in most of fipy outside the solver),
- the local processor to processor communication (questionable if this actually exists),
- global communication (probably more likely),
- calculations that are across the global cells (there should be none of this, very bad for scaling)
- a fixed penalty independent of the mesh or partioning
To look at the relative influence of each term I did calculations for various grid sizes and number of nodes, recorded the times and fit the data with a least squares fit using the anisotropy problem. In the least squares fit each timing value is weighed equally and the fastest out of 10 time steps is used for . This is done because luggage has high variability especially when egon is running openmp jobs. Using the attached scripts I get along with the following plot. The problem with this fit is that two parameters are actually negative (unphysical) and is way too big. This is caused because the script is not adjusting the number of crystals based on the box size (less comparative work in the solver per cell with less processors). As a quick fix we can assume the second and fourth terms are negligible and see how the fits looks. We get and a new plot.
This needs to be rerun with an updated timing values with the number of crystals increased with the box size.
Testing Anisotropy Example on a 4000 x 4000 grid
I'm running the anisotropy example on 32 processors on luggage. The phase and theta images seem to make sense after 30 steps. It took 11593 seconds to reach this point.
I also ran a 200 x 200 grid with 1 crystal (which is roughly the area occupied the crystals in the larger simulation). Running on 6 processors on poole it takes about 734 s to do 690 time steps. After 690 time steps the area of the crystal is 1.05 (the area of the box is 25). The initial area of the crystal is 0.05. So do a few calculations
>>> numpy.pi * (0.025 * 5.)**2 0.049087385212340517 >>> numpy.sqrt(0.05 / numpy.pi) 0.126156626101008 >>> numpy.sqrt(1.05 / numpy.pi) 0.57812228852811087 >>> (0.57812228852811087 - 0.126156626101008) / 690 0.00065502269916971432
and the rate of expansion is 0.00066 per time step. Now for crystals to touch and see structure the crystal must expand a distance of 2.5, which requires 2.5 / 0.00066 ~ 4000 time steps. The large 4000 x 4000 case will take ~18 days to get to a good point. Now we might not need the crystal to move a distance of 2.5, but it will probably slow down as the box gets filled with solid. We don't have 17 days at this point so on this evidence I'm going to scale back to 3000 x 3000 with 225 crystals. I think that will be pretty much guaranteed to show something good in 10 days.
Okay, the 3000 x 3000 case is taking about 100 s a time step, which will take about 5 days to do 4000 time steps. Much more reasonable.
Testing the divorcePysparse branch
The tables below show wall clock simulation duration data in seconds for source:branches/divorcePysparse@3716 branch and source:trunk@3716 on poole for 10 time steps of source:sandbox/anisotropy.py@3717. The simulations were conducted 5 times on 1, 2 and 4 processors.
source:branches/divorcePysparse@3716
1 | 2 | 4 |
30.80 | 17.57 | 10.14 |
37.29 | 19.73 | 9.74 |
30.51 | 19.18 | 9.73 |
32.23 | 16.86 | 10.35 |
30.27 | 16.86 | 10.27 |
source:trunk@3716
1 | 2 | 4 |
23.03 | 12.08 | 7.63 |
21.46 | 13.07 | 7.04 |
22.91 | 14.88 | 6.98 |
22.74 | 11.99 | 7.50 |
32.59 | 11.87 | 7.54 |