Simulations with varying solver tolerance
ID | status | condor ID | Tolerance | Step Time (s) | Steps | Distance (R) | Sim Time (s) | Bill's movie | ||||
18 | stopped | 12087.0 | 120 | 460 | 0.13 | 0.00256 | movie | |||||
19 | stopped | 12088.0 | 96 | 460 | 0.13 | 0.00256 | movie | |||||
20 | stopped | 12089.0 | 66 | 460 | 0.13 | 0.00256 | movie | |||||
21 | stopped | 12090.0 | 156 | 460 | 0.13 | 0.00256 | movie | |||||
22 | running | 12093.0 | -- | -- | -- | -- | -- | -- | -- | movie |
Notes
- These simulations need to run for longer to ascertain the long term effects of larger solver tolerances.
- According to the Trilinos manuals, the default stopping criterion is .
- I'm still confused about how the tolerance relates to the error. Certainly I know that the error is linear with residual (Ae=r). This would say that an order of magnitude decrease in the tolerance leads to an order of magnitude decrease in error. However if the tolerance is 1 then no iterations should be done, which would be pathological. I am running a tolerance = 1 case, so we'll see.
Altix and !FiPy. What can we expect?
What are the expected performance benefits for FiPy using the Altix machine?
The following questions are pertinent.
- Which examples?
- What proportion of time is spent in routines that could benefit from a parallel capability?
- How long do the jobs currently take on one processor?
- How much memory do the jobs currently use?
job ID | PID | Status | Radius Drop Size | N | Proportion | Solver | Max Memory (Gb) | Steps | Time Per Step (s) | Time (days) | Simulation time (s) | Altix (k=1.6) (days) | Poole (k=1.6) |
14 | stopped | 34000 | 0.774 (@) | 0.542 | 0.41 | 450 | 185 | 0.96 | 0.0016 | 0.21 | 0.28 | ||
15 | stopped | 136000 | 0.806 | 0.579 | 0.99 | 680 | 863 | 6.8 | 0.00226 | 1.5 | 2.0 | ||
16 | poole 4430 | running | 544000 | 0.822 (+) | 0.562 (+) | 3.3 | 929 (%) | 3336.0 | 35.9 (%) | 0.00275 (%) | 8.0 | 11 | |
17 | poole 4714 | running | 2176000 | 0.798 (&) | 0.496 (&) | 13 | 50667 (#) | 11034.0 | 190 (647.1(#)) | 0.0293 (#) | 42 | 56 | |
4896000 | 28 | 502 | 112 | 148 | |||||||||
483 | 15420 | 3440 | 4552 |
- (+) still need many more data points
- (%) reasonable extrapolation, waiting for more data points
- (#) very bad extrapolation, waiting for more data points
- (&) only at 200 steps, this number rises until 600 steps
- (@) steps are too quick to get a good number
Notes
The simulations wall clock time requires the triple point to cover a distance of R / 10.
The memory usage seems to go like with and .
The simulation times are estimated with a power law, with and .
The following figure shows the time spent in routines that could benefit from a parallel capability .
A quick study of solver iterations and tolerance may also yield gains.
Other functions that are costly, but are not readily reimplemented in parallel.
Radius Drop Size | N | FillComplete? | InsertGlobalValues? | array |
34000 | 0.068 | 0.039 | 0.015 | |
136000 | 0.052 | 0.024 | 0.017 | |
544000 | 0.052 | 0.026 | 0.017 | |
2176000 | 0.057 | 0.027 | 0.020 |
Poor weave results on the altix
The times below are very bad. I have no idea why this is the case. Poole had great speed ups.
N | Threads | speedup | speedup factor | raw times (s) |
500 | 0 | 1.00 | 1.00 | 0.50 |
500 | 1 | 0.33 | 0.00 | 1.50 |
500 | 2 | 0.65 | 0.65 | 0.76 |
500 | 4 | 1.29 | 1.13 | 0.38 |
500 | 8 | 2.49 | 1.36 | 0.20 |
500 | 16 | 4.43 | 1.45 | 0.11 |
500 | 32 | 6.08 | 1.43 | 0.08 |
500 | 64 | 6.35 | 1.36 | 0.08 |
1000 | 0 | 1.00 | 1.00 | 12.37 |
1000 | 1 | 0.62 | 0.00 | 20.06 |
1000 | 2 | 1.05 | 1.05 | 11.76 |
1000 | 4 | 1.88 | 1.37 | 6.57 |
1000 | 8 | 2.75 | 1.40 | 4.50 |
1000 | 16 | 2.91 | 1.31 | 4.25 |
1000 | 32 | 3.11 | 1.25 | 3.98 |
1000 | 64 | 3.38 | 1.23 | 3.66 |
2000 | 0 | 1.00 | 1.00 | 179.58 |
2000 | 1 | 0.74 | 0.00 | 244.13 |
2000 | 2 | 1.43 | 1.43 | 125.62 |
2000 | 4 | 2.91 | 1.71 | 61.63 |
2000 | 8 | 4.20 | 1.61 | 42.77 |
2000 | 16 | 4.97 | 1.49 | 36.16 |
2000 | 32 | 4.89 | 1.37 | 36.74 |
2000 | 64 | 5.87 | 1.34 | 30.60 |
3000 | 0 | 1.00 | 1.00 | 510.73 |
3000 | 1 | 0.71 | 0.00 | 719.87 |
3000 | 2 | 1.26 | 1.26 | 404.31 |
3000 | 4 | 2.30 | 1.52 | 221.79 |
3000 | 8 | 3.82 | 1.56 | 133.80 |
3000 | 16 | 3.98 | 1.41 | 128.48 |
3000 | 32 | 4.07 | 1.32 | 125.33 |
3000 | 64 | 4.80 | 1.30 | 106.35 |
Posting code
Posting some code because it can't be shared using svn.
1 import numpy 2 from scipy import weave 3 4 def multiply(N, nthreads): 5 6 i, j = numpy.indices((N, N), 'd') 7 c = numpy.zeros((N, N), 'd') 8 9 a = i + j 10 b = i * j 11 12 code_pragma = """ 13 14 int i, j, k; 15 16 #pragma omp parallel shared(a,b,c,nthreads) private(i,j,k) num_threads(nthreads) 17 { 18 19 #pragma omp for schedule (dynamic) 20 for (i=0; i<N; i++) 21 for(j=0; j<N; j++) 22 for (k=0; k<N; k++) 23 c[i * N + j] += a[i * N + k] * b[k * N + j]; 24 25 } 26 27 """ 28 29 code_no_pragma = """ 30 31 int i, j, k; 32 33 for (i=0; i<N; i++) 34 for(j=0; j<N; j++) 35 for (k=0; k<N; k++) 36 c[i * N + j] += a[i * N + k] * b[k * N + j]; 37 38 """ 39 40 args = {'N' : N, 'a' : a, 'b' : b, 'c' : c, 'nthreads' : nthreads} 41 42 if nthreads == 0: 43 code = code_no_pragma 44 else: 45 code = code_pragma 46 47 import time 48 49 t0 = time.time() 50 51 weave.inline(code, 52 args.keys(), 53 local_dict=args, 54 type_converters=None, 55 compiler = 'gcc', 56 force=0, 57 verbose=0, 58 extra_compile_args =['-O3 -fopenmp'], 59 extra_link_args=['-O3 -fopenmp'], 60 headers=['<omp.h>']) 61 62 return time.time() - t0 63 64 from fipy.tools import dump 65 66 results = {} 67 68 dump.write(results, 'datafile.gz') 69 70 for N in (500, 1000, 2000, 3000, 4000, 5000, 6000): 71 72 results = dump.read('datafile.gz') 73 results[N] = {} 74 dump.write(results, 'datafile.gz') 75 76 for nthreads in range(9): 77 78 results = dump.read('datafile.gz') 79 results[N][nthreads] = [] 80 dump.write(results, 'datafile.gz') 81 82 for sim in range(5): 83 84 results = dump.read('datafile.gz') 85 results[N][nthreads].append(multiply(N, nthreads)) 86 dump.write(results, 'datafile.gz') 87 88
openmp and weave on poole
Some results for openmp on poole. The test is for an N by N matrix multiplication. The code is here. The speed up factor is certainly improving as the threads increase. The result for N=4000 are nonsense. Probably Ed running stuff. I should really have recorded resources available during each simulation by keeping a record with time.clock() and time.time().
N | speedup factor for 8 threads |
500 | 1.45 |
1000 | 1.78 |
2000 | 1.80 |
3000 | 1.82 |
4000 | 1.30 |
5000 | 1.76 |
More of the data.
500 | 0 | 1.00 | 1.00 | 0.67 |
500 | 1 | 0.52 | 0.00 | 1.30 |
500 | 2 | 0.97 | 0.97 | 0.69 |
500 | 3 | 1.41 | 1.24 | 0.48 |
500 | 4 | 1.82 | 1.35 | 0.37 |
500 | 5 | 2.22 | 1.41 | 0.30 |
500 | 6 | 2.65 | 1.46 | 0.25 |
500 | 7 | 2.86 | 1.45 | 0.23 |
500 | 8 | 3.03 | 1.45 | 0.22 |
1000 | 0 | 1.00 | 1.00 | 11.52 |
1000 | 1 | 0.85 | 0.00 | 13.55 |
1000 | 2 | 1.65 | 1.65 | 6.98 |
1000 | 3 | 2.35 | 1.72 | 4.89 |
1000 | 4 | 3.05 | 1.75 | 3.78 |
1000 | 5 | 3.84 | 1.79 | 3.00 |
1000 | 6 | 4.43 | 1.78 | 2.60 |
1000 | 7 | 5.12 | 1.79 | 2.25 |
1000 | 8 | 5.69 | 1.78 | 2.03 |
2000 | 0 | 1.00 | 1.00 | 113.21 |
2000 | 1 | 0.89 | 0.00 | 127.71 |
2000 | 2 | 1.69 | 1.69 | 67.07 |
2000 | 3 | 2.46 | 1.77 | 45.99 |
2000 | 4 | 3.21 | 1.79 | 35.27 |
2000 | 5 | 3.96 | 1.81 | 28.56 |
2000 | 6 | 4.64 | 1.81 | 24.38 |
2000 | 7 | 5.22 | 1.80 | 21.68 |
2000 | 8 | 5.81 | 1.80 | 19.48 |
3000 | 0 | 1.00 | 1.00 | 385.15 |
3000 | 1 | 0.89 | 0.00 | 431.11 |
3000 | 2 | 1.70 | 1.70 | 226.09 |
3000 | 3 | 2.49 | 1.78 | 154.88 |
3000 | 4 | 3.24 | 1.80 | 118.80 |
3000 | 5 | 4.02 | 1.82 | 95.81 |
3000 | 6 | 4.79 | 1.83 | 80.46 |
3000 | 7 | 5.43 | 1.83 | 70.93 |
3000 | 8 | 6.03 | 1.82 | 63.85 |
4000 | 0 | 1.00 | 1.00 | 2706.37 |
4000 | 1 | 0.78 | 0.00 | 3488.43 |
4000 | 2 | 1.31 | 1.31 | 2066.20 |
4000 | 3 | 1.77 | 1.44 | 1526.49 |
4000 | 4 | 2.07 | 1.44 | 1308.90 |
4000 | 5 | 2.29 | 1.43 | 1182.51 |
4000 | 6 | 2.28 | 1.38 | 1185.99 |
4000 | 7 | 2.16 | 1.32 | 1252.05 |
4000 | 8 | 2.21 | 1.30 | 1224.98 |
5000 | 0 | 1.00 | 1.00 | 1909.29 |
5000 | 1 | 0.78 | 0.00 | 2457.51 |
5000 | 2 | 1.61 | 1.61 | 1187.01 |
5000 | 3 | 2.45 | 1.76 | 780.55 |
5000 | 4 | 3.24 | 1.80 | 589.95 |
5000 | 5 | 3.96 | 1.81 | 481.70 |
5000 | 6 | 4.47 | 1.78 | 427.20 |
5000 | 7 | 5.01 | 1.78 | 380.78 |
5000 | 8 | 5.47 | 1.76 | 349.23 |
Machine details
$uname -a Linux poole 2.6.18-6-amd64 #1 SMP Sun Feb 10 17:50:19 UTC 2008 x86_64 GNU/Linux
8 of these
processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 65 model name : Dual-Core AMD Opteron(tm) Processor 8220 stepping : 3 cpu MHz : 2812.978 cache size : 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm cr8_legacy bogomips : 5630.17 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc
and memory
MemTotal: 32964928 kB
Curvature expressions in !FiPy
Curvature | Expression | FiPy Expression |
iso | (phi.getFaceGrad() / phi.getFaceGrad().getMag()).getDivergence() | |
mean | (phi.getFaceGrad() / numerix.sqrt(1 + phi.getFaceGrad().getMag()**2)).getDivergence() | |
Gauss | ((phi.getFaceGrad() * (1,0)).getDivergence() * (phi.getFaceGrad() * (0,1)).getDivergence() - phi.getFaceGrad()[::-1].getDivergence() / 2) / (1 + phi.getFaceGrad().getMag()**2)**2 |
Is there a vector form for the Gaussian curvature? It may help the FiPy expression.
Very good differential geometry book http://www.archive.org/details/differentialgeom003681mbp.