Simulations with varying solver tolerance
| ID | status | condor ID | | Tolerance | Step Time (s) | Steps | Distance (R) | Sim Time (s) | | | | Bill's movie |
| 18 | stopped | 12087.0 | | | 120 | 460 | 0.13 | 0.00256 | | | | movie |
| 19 | stopped | 12088.0 | | | 96 | 460 | 0.13 | 0.00256 | | | | movie |
| 20 | stopped | 12089.0 | | | 66 | 460 | 0.13 | 0.00256 | | | | movie |
| 21 | stopped | 12090.0 | | | 156 | 460 | 0.13 | 0.00256 | | | | movie |
| 22 | running | 12093.0 | | | -- | -- | -- | -- | -- | -- | -- | movie |
Notes
- These simulations need to run for longer to ascertain the long term effects of larger solver tolerances.
- According to the Trilinos manuals, the default stopping criterion is
.
- I'm still confused about how the tolerance relates to the error. Certainly I know that the error is linear with residual (Ae=r). This would say that an order of magnitude decrease in the tolerance leads to an order of magnitude decrease in error. However if the tolerance is 1 then no iterations should be done, which would be pathological. I am running a tolerance = 1 case, so we'll see.
Altix and !FiPy. What can we expect?
What are the expected performance benefits for FiPy using the Altix machine?
The following questions are pertinent.
- Which examples?
- What proportion of time is spent in routines that could benefit from a parallel capability?
- How long do the jobs currently take on one processor?
- How much memory do the jobs currently use?
| job ID | PID | Status | Radius Drop Size | N | Proportion | Solver | Max Memory (Gb) | Steps | Time Per Step (s) | Time (days) | Simulation time (s) | Altix (k=1.6) (days) | Poole (k=1.6) |
| 14 | stopped | | 34000 | 0.774 (@) | 0.542 | 0.41 | 450 | 185 | 0.96 | 0.0016 | 0.21 | 0.28 | |
| 15 | stopped | | 136000 | 0.806 | 0.579 | 0.99 | 680 | 863 | 6.8 | 0.00226 | 1.5 | 2.0 | |
| 16 | poole 4430 | running | | 544000 | 0.822 (+) | 0.562 (+) | 3.3 | 929 (%) | 3336.0 | 35.9 (%) | 0.00275 (%) | 8.0 | 11 |
| 17 | poole 4714 | running | | 2176000 | 0.798 (&) | 0.496 (&) | 13 | 50667 (#) | 11034.0 | 190 (647.1(#)) | 0.0293 (#) | 42 | 56 |
| 4896000 | 28 | 502 | 112 | 148 | ||||||||
| | 483 | 15420 | 3440 | 4552 |
- (+) still need many more data points
- (%) reasonable extrapolation, waiting for more data points
- (#) very bad extrapolation, waiting for more data points
- (&) only at 200 steps, this number rises until 600 steps
- (@) steps are too quick to get a good number
Notes
The simulations wall clock time requires the triple point to cover a distance of R / 10.
The memory usage seems to go like
with
and
.
The simulation times are estimated with a power law,
with
and
.
The following figure shows the time spent in routines that could
benefit from a parallel capability
.
A quick study of solver iterations and tolerance may also yield gains.
Other functions that are costly, but are not readily reimplemented in parallel.
| Radius Drop Size | N | FillComplete? | InsertGlobalValues? | array |
| 34000 | 0.068 | 0.039 | 0.015 |
| 136000 | 0.052 | 0.024 | 0.017 |
| 544000 | 0.052 | 0.026 | 0.017 |
| 2176000 | 0.057 | 0.027 | 0.020 |
Poor weave results on the altix
The times below are very bad. I have no idea why this is the case. Poole had great speed ups.
| N | Threads | speedup | speedup factor | raw times (s) |
| 500 | 0 | 1.00 | 1.00 | 0.50 |
| 500 | 1 | 0.33 | 0.00 | 1.50 |
| 500 | 2 | 0.65 | 0.65 | 0.76 |
| 500 | 4 | 1.29 | 1.13 | 0.38 |
| 500 | 8 | 2.49 | 1.36 | 0.20 |
| 500 | 16 | 4.43 | 1.45 | 0.11 |
| 500 | 32 | 6.08 | 1.43 | 0.08 |
| 500 | 64 | 6.35 | 1.36 | 0.08 |
| 1000 | 0 | 1.00 | 1.00 | 12.37 |
| 1000 | 1 | 0.62 | 0.00 | 20.06 |
| 1000 | 2 | 1.05 | 1.05 | 11.76 |
| 1000 | 4 | 1.88 | 1.37 | 6.57 |
| 1000 | 8 | 2.75 | 1.40 | 4.50 |
| 1000 | 16 | 2.91 | 1.31 | 4.25 |
| 1000 | 32 | 3.11 | 1.25 | 3.98 |
| 1000 | 64 | 3.38 | 1.23 | 3.66 |
| 2000 | 0 | 1.00 | 1.00 | 179.58 |
| 2000 | 1 | 0.74 | 0.00 | 244.13 |
| 2000 | 2 | 1.43 | 1.43 | 125.62 |
| 2000 | 4 | 2.91 | 1.71 | 61.63 |
| 2000 | 8 | 4.20 | 1.61 | 42.77 |
| 2000 | 16 | 4.97 | 1.49 | 36.16 |
| 2000 | 32 | 4.89 | 1.37 | 36.74 |
| 2000 | 64 | 5.87 | 1.34 | 30.60 |
| 3000 | 0 | 1.00 | 1.00 | 510.73 |
| 3000 | 1 | 0.71 | 0.00 | 719.87 |
| 3000 | 2 | 1.26 | 1.26 | 404.31 |
| 3000 | 4 | 2.30 | 1.52 | 221.79 |
| 3000 | 8 | 3.82 | 1.56 | 133.80 |
| 3000 | 16 | 3.98 | 1.41 | 128.48 |
| 3000 | 32 | 4.07 | 1.32 | 125.33 |
| 3000 | 64 | 4.80 | 1.30 | 106.35 |
Posting code
Posting some code because it can't be shared using svn.
1 import numpy
2 from scipy import weave
3
4 def multiply(N, nthreads):
5
6 i, j = numpy.indices((N, N), 'd')
7 c = numpy.zeros((N, N), 'd')
8
9 a = i + j
10 b = i * j
11
12 code_pragma = """
13
14 int i, j, k;
15
16 #pragma omp parallel shared(a,b,c,nthreads) private(i,j,k) num_threads(nthreads)
17 {
18
19 #pragma omp for schedule (dynamic)
20 for (i=0; i<N; i++)
21 for(j=0; j<N; j++)
22 for (k=0; k<N; k++)
23 c[i * N + j] += a[i * N + k] * b[k * N + j];
24
25 }
26
27 """
28
29 code_no_pragma = """
30
31 int i, j, k;
32
33 for (i=0; i<N; i++)
34 for(j=0; j<N; j++)
35 for (k=0; k<N; k++)
36 c[i * N + j] += a[i * N + k] * b[k * N + j];
37
38 """
39
40 args = {'N' : N, 'a' : a, 'b' : b, 'c' : c, 'nthreads' : nthreads}
41
42 if nthreads == 0:
43 code = code_no_pragma
44 else:
45 code = code_pragma
46
47 import time
48
49 t0 = time.time()
50
51 weave.inline(code,
52 args.keys(),
53 local_dict=args,
54 type_converters=None,
55 compiler = 'gcc',
56 force=0,
57 verbose=0,
58 extra_compile_args =['-O3 -fopenmp'],
59 extra_link_args=['-O3 -fopenmp'],
60 headers=['<omp.h>'])
61
62 return time.time() - t0
63
64 from fipy.tools import dump
65
66 results = {}
67
68 dump.write(results, 'datafile.gz')
69
70 for N in (500, 1000, 2000, 3000, 4000, 5000, 6000):
71
72 results = dump.read('datafile.gz')
73 results[N] = {}
74 dump.write(results, 'datafile.gz')
75
76 for nthreads in range(9):
77
78 results = dump.read('datafile.gz')
79 results[N][nthreads] = []
80 dump.write(results, 'datafile.gz')
81
82 for sim in range(5):
83
84 results = dump.read('datafile.gz')
85 results[N][nthreads].append(multiply(N, nthreads))
86 dump.write(results, 'datafile.gz')
87
88
openmp and weave on poole
Some results for openmp on poole. The test is for an N by N matrix multiplication. The code is here. The speed up factor is certainly improving as the threads increase. The result for N=4000 are nonsense. Probably Ed running stuff. I should really have recorded resources available during each simulation by keeping a record with time.clock() and time.time().
| N | speedup factor for 8 threads |
| 500 | 1.45 |
| 1000 | 1.78 |
| 2000 | 1.80 |
| 3000 | 1.82 |
| 4000 | 1.30 |
| 5000 | 1.76 |
More of the data.
| 500 | 0 | 1.00 | 1.00 | 0.67 |
| 500 | 1 | 0.52 | 0.00 | 1.30 |
| 500 | 2 | 0.97 | 0.97 | 0.69 |
| 500 | 3 | 1.41 | 1.24 | 0.48 |
| 500 | 4 | 1.82 | 1.35 | 0.37 |
| 500 | 5 | 2.22 | 1.41 | 0.30 |
| 500 | 6 | 2.65 | 1.46 | 0.25 |
| 500 | 7 | 2.86 | 1.45 | 0.23 |
| 500 | 8 | 3.03 | 1.45 | 0.22 |
| 1000 | 0 | 1.00 | 1.00 | 11.52 |
| 1000 | 1 | 0.85 | 0.00 | 13.55 |
| 1000 | 2 | 1.65 | 1.65 | 6.98 |
| 1000 | 3 | 2.35 | 1.72 | 4.89 |
| 1000 | 4 | 3.05 | 1.75 | 3.78 |
| 1000 | 5 | 3.84 | 1.79 | 3.00 |
| 1000 | 6 | 4.43 | 1.78 | 2.60 |
| 1000 | 7 | 5.12 | 1.79 | 2.25 |
| 1000 | 8 | 5.69 | 1.78 | 2.03 |
| 2000 | 0 | 1.00 | 1.00 | 113.21 |
| 2000 | 1 | 0.89 | 0.00 | 127.71 |
| 2000 | 2 | 1.69 | 1.69 | 67.07 |
| 2000 | 3 | 2.46 | 1.77 | 45.99 |
| 2000 | 4 | 3.21 | 1.79 | 35.27 |
| 2000 | 5 | 3.96 | 1.81 | 28.56 |
| 2000 | 6 | 4.64 | 1.81 | 24.38 |
| 2000 | 7 | 5.22 | 1.80 | 21.68 |
| 2000 | 8 | 5.81 | 1.80 | 19.48 |
| 3000 | 0 | 1.00 | 1.00 | 385.15 |
| 3000 | 1 | 0.89 | 0.00 | 431.11 |
| 3000 | 2 | 1.70 | 1.70 | 226.09 |
| 3000 | 3 | 2.49 | 1.78 | 154.88 |
| 3000 | 4 | 3.24 | 1.80 | 118.80 |
| 3000 | 5 | 4.02 | 1.82 | 95.81 |
| 3000 | 6 | 4.79 | 1.83 | 80.46 |
| 3000 | 7 | 5.43 | 1.83 | 70.93 |
| 3000 | 8 | 6.03 | 1.82 | 63.85 |
| 4000 | 0 | 1.00 | 1.00 | 2706.37 |
| 4000 | 1 | 0.78 | 0.00 | 3488.43 |
| 4000 | 2 | 1.31 | 1.31 | 2066.20 |
| 4000 | 3 | 1.77 | 1.44 | 1526.49 |
| 4000 | 4 | 2.07 | 1.44 | 1308.90 |
| 4000 | 5 | 2.29 | 1.43 | 1182.51 |
| 4000 | 6 | 2.28 | 1.38 | 1185.99 |
| 4000 | 7 | 2.16 | 1.32 | 1252.05 |
| 4000 | 8 | 2.21 | 1.30 | 1224.98 |
| 5000 | 0 | 1.00 | 1.00 | 1909.29 |
| 5000 | 1 | 0.78 | 0.00 | 2457.51 |
| 5000 | 2 | 1.61 | 1.61 | 1187.01 |
| 5000 | 3 | 2.45 | 1.76 | 780.55 |
| 5000 | 4 | 3.24 | 1.80 | 589.95 |
| 5000 | 5 | 3.96 | 1.81 | 481.70 |
| 5000 | 6 | 4.47 | 1.78 | 427.20 |
| 5000 | 7 | 5.01 | 1.78 | 380.78 |
| 5000 | 8 | 5.47 | 1.76 | 349.23 |
Machine details
$uname -a Linux poole 2.6.18-6-amd64 #1 SMP Sun Feb 10 17:50:19 UTC 2008 x86_64 GNU/Linux
8 of these
processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 65 model name : Dual-Core AMD Opteron(tm) Processor 8220 stepping : 3 cpu MHz : 2812.978 cache size : 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm cr8_legacy bogomips : 5630.17 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc
and memory
MemTotal: 32964928 kB
Curvature expressions in !FiPy
| Curvature | Expression | FiPy Expression |
| iso | | (phi.getFaceGrad() / phi.getFaceGrad().getMag()).getDivergence() |
| mean | | (phi.getFaceGrad() / numerix.sqrt(1 + phi.getFaceGrad().getMag()**2)).getDivergence() |
| Gauss | | ((phi.getFaceGrad() * (1,0)).getDivergence() * (phi.getFaceGrad() * (0,1)).getDivergence() - phi.getFaceGrad()[::-1].getDivergence() / 2) / (1 + phi.getFaceGrad().getMag()**2)**2 |
Is there a vector form for the Gaussian curvature? It may help the FiPy expression.
Very good differential geometry book http://www.archive.org/details/differentialgeom003681mbp.
rss