Posts for the month of April 2008

Simulations with varying solver tolerance

ID status condor ID $R$ Tolerance Step Time (s) Steps Distance (R) Sim Time (s) $\left|\left|\frac{\phi - \phi^*}{\phi^*}\right|\right|_{\infty}$ $\left|\left|\frac{\rho - \rho^*}{\rho^*}\right|\right|_{\infty}$ $\left|\left|\frac{\rho_2 - \rho_2^*}{\rho_2^*}\right|\right|_{\infty}$ Bill's movie
18 stopped 12087.0 $5\times10^{-7}$ $1\times10^{-7}$ 120 460 0.13 0.00256 $7.0\times10^{-9}$ $1.4\times10^{-9}$ $1.5\times10^{-9}$ movie
19 stopped 12088.0 $5\times10^{-7}$ $1\times10^{-4}$ 96 460 0.13 0.00256 $1.2\times10^{-6}$ $9.9\times10^{-7}$ $9.7\times10^{-7}$ movie
20 stopped 12089.0 $5\times10^{-7}$ $1\times10^{-1}$ 66 460 0.13 0.00256 $6.1\times10^{-3}$ $5.9\times10^{-3}$ $5.9\times10^{-3}$ movie
21 stopped 12090.0 $5\times10^{-7}$ $1\times10^{-10}$ 156 460 0.13 0.00256 $0$ $0$ $0$ movie
22 running 12093.0 $5\times10^{-7}$ $1$ -- -- -- -- -- -- -- movie

Notes

  • These simulations need to run for longer to ascertain the long term effects of larger solver tolerances.
  • According to the Trilinos manuals, the default stopping criterion is $\frac{\left|\left|\vec{r}\right|\right|_2}{\left|\left|\vec{r}^{(0)}\right|\right|_2}$.
  • I'm still confused about how the tolerance relates to the error. Certainly I know that the error is linear with residual (Ae=r). This would say that an order of magnitude decrease in the tolerance leads to an order of magnitude decrease in error. However if the tolerance is 1 then no iterations should be done, which would be pathological. I am running a tolerance = 1 case, so we'll see.
  • Posted: 2008-04-17 22:42 (Updated: 2008-04-21 16:13)
  • Author: wd15
  • Categories: (none)
  • Comments (0)

Altix and !FiPy. What can we expect?

What are the expected performance benefits for FiPy using the Altix machine?

The following questions are pertinent.

  • Which examples?
  • What proportion of time is spent in routines that could benefit from a parallel capability?
  • How long do the jobs currently take on one processor?
  • How much memory do the jobs currently use?
job ID PID Status Radius Drop Size N Proportion Solver Max Memory (Gb) Steps Time Per Step (s) Time (days) Simulation time (s) Altix (k=1.6) (days) Poole (k=1.6)
14 stopped $5\times10^{-7}$ 34000 0.774 (@) 0.542 0.41 450 185 0.96 0.0016 0.21 0.28
15 stopped $1\times10^{-6}$ 136000 0.806 0.579 0.99 680 863 6.8 0.00226 1.5 2.0
16 poole 4430 running $2\times10^{-6}$ 544000 0.822 (+) 0.562 (+) 3.3 929 (%) 3336.0 35.9 (%) 0.00275 (%) 8.0 11
17 poole 4714 running $4\times10^{-6}$ 2176000 0.798 (&) 0.496 (&) 13 50667 (#) 11034.0 190 (647.1(#)) 0.0293 (#) 42 56
$6\times10^{-6}$ 4896000 28 502 112 148
$2.5\times10^{-5}$ $8.5\times10^{7}$ 483 15420 3440 4552
  • (+) still need many more data points
  • (%) reasonable extrapolation, waiting for more data points
  • (#) very bad extrapolation, waiting for more data points
  • (&) only at 200 steps, this number rises until 600 steps
  • (@) steps are too quick to get a good number

Notes

The simulations wall clock time requires the triple point to cover a distance of R / 10.

The memory usage seems to go like $\alpha + \beta R^2$ with $\alpha=0.217$ and $\beta=7.73\times10^{11}$.

The simulation times are estimated with a power law, $\alpha N^{\beta}$ with $\alpha=4.69e-6$ and $\beta=1.2$.

The following figure shows the time spent in routines that could benefit from a parallel capability .

A quick study of solver iterations and tolerance may also yield gains.

Other functions that are costly, but are not readily reimplemented in parallel.

Radius Drop Size N FillComplete? InsertGlobalValues? array
$5\times10^{-7}$ 34000 0.068 0.039 0.015
$1\times10^{-6}$ 136000 0.052 0.024 0.017
$2\times10^{-6}$ 544000 0.052 0.026 0.017
$4\times10^{-6}$ 2176000 0.057 0.027 0.020
  • Posted: 2008-04-14 20:39 (Updated: 2008-04-22 19:10)
  • Author: wd15
  • Categories: (none)
  • Comments (0)

Poor weave results on the altix

The times below are very bad. I have no idea why this is the case. Poole had great speed ups.

N Threads speedup speedup factor raw times (s)
500 0 1.00 1.00 0.50
500 1 0.33 0.00 1.50
500 2 0.65 0.65 0.76
500 4 1.29 1.13 0.38
500 8 2.49 1.36 0.20
500 16 4.43 1.45 0.11
500 32 6.08 1.43 0.08
500 64 6.35 1.36 0.08
1000 0 1.00 1.00 12.37
1000 1 0.62 0.00 20.06
1000 2 1.05 1.05 11.76
1000 4 1.88 1.37 6.57
1000 8 2.75 1.40 4.50
1000 16 2.91 1.31 4.25
1000 32 3.11 1.25 3.98
1000 64 3.38 1.23 3.66
2000 0 1.00 1.00 179.58
2000 1 0.74 0.00 244.13
2000 2 1.43 1.43 125.62
2000 4 2.91 1.71 61.63
2000 8 4.20 1.61 42.77
2000 16 4.97 1.49 36.16
2000 32 4.89 1.37 36.74
2000 64 5.87 1.34 30.60
3000 0 1.00 1.00 510.73
3000 1 0.71 0.00 719.87
3000 2 1.26 1.26 404.31
3000 4 2.30 1.52 221.79
3000 8 3.82 1.56 133.80
3000 16 3.98 1.41 128.48
3000 32 4.07 1.32 125.33
3000 64 4.80 1.30 106.35
  • Posted: 2008-04-11 16:56
  • Author: wd15
  • Categories: (none)
  • Comments (0)

Posting code

Posting some code because it can't be shared using svn.

1       import numpy
2 	from scipy import weave
3
4 	def multiply(N, nthreads):
5
6 	    i, j = numpy.indices((N, N), 'd')
7 	    c = numpy.zeros((N, N), 'd')
8
9 	    a = i + j
10 	    b = i * j
11
12 	    code_pragma = """
13
14 	    int i, j, k;
15
16 	    #pragma omp parallel shared(a,b,c,nthreads) private(i,j,k) num_threads(nthreads)
17 	      {
18
19 	      #pragma omp for schedule (dynamic)
20 	      for (i=0; i<N; i++)
21 	        for(j=0; j<N; j++)
22 	          for (k=0; k<N; k++)
23 	            c[i * N + j] += a[i * N + k] * b[k * N + j];
24
25 	      }
26
27 	    """
28
29 	    code_no_pragma = """
30
31 	    int i, j, k;
32
33 	    for (i=0; i<N; i++)
34 	      for(j=0; j<N; j++)
35 	          for (k=0; k<N; k++)
36 	            c[i * N + j] += a[i * N + k] * b[k * N + j];
37
38 	    """
39
40 	    args = {'N' : N, 'a' : a, 'b' : b, 'c' : c, 'nthreads' : nthreads}
41
42 	    if nthreads == 0:
43 	        code = code_no_pragma
44 	    else:
45 	        code = code_pragma
46
47 	    import time
48
49 	    t0 = time.time()
50
51 	    weave.inline(code,
52 	                 args.keys(),
53 	                 local_dict=args,
54 	                 type_converters=None,
55 	                 compiler = 'gcc',
56 	                 force=0,
57 	                 verbose=0,
58 	                 extra_compile_args =['-O3 -fopenmp'],
59 	                 extra_link_args=['-O3 -fopenmp'],
60 	                 headers=['<omp.h>'])
61
62 	    return time.time() - t0
63
64 	from fipy.tools import dump
65
66 	results = {}
67
68 	dump.write(results, 'datafile.gz')
69
70 	for N in (500, 1000, 2000, 3000, 4000, 5000, 6000):
71
72 	    results = dump.read('datafile.gz')
73 	    results[N] = {}
74 	    dump.write(results, 'datafile.gz')
75
76 	    for nthreads in range(9):
77
78 	        results = dump.read('datafile.gz')
79 	        results[N][nthreads] = []
80 	        dump.write(results, 'datafile.gz')
81
82 	        for sim in range(5):
83
84 	            results = dump.read('datafile.gz')
85 	            results[N][nthreads].append(multiply(N, nthreads))
86 	            dump.write(results, 'datafile.gz')
87
88
  • Posted: 2008-04-07 18:27 (Updated: 2008-04-07 18:30)
  • Author: wd15
  • Categories: (none)
  • Comments (0)

openmp and weave on poole

Some results for openmp on poole. The test is for an N by N matrix multiplication. The code is here. The speed up factor is certainly improving as the threads increase. The result for N=4000 are nonsense. Probably Ed running stuff. I should really have recorded resources available during each simulation by keeping a record with time.clock() and time.time().

N speedup factor for 8 threads
500 1.45
1000 1.78
2000 1.80
3000 1.82
4000 1.30
5000 1.76

More of the data.

500 0 1.00 1.00 0.67
500 1 0.52 0.00 1.30
500 2 0.97 0.97 0.69
500 3 1.41 1.24 0.48
500 4 1.82 1.35 0.37
500 5 2.22 1.41 0.30
500 6 2.65 1.46 0.25
500 7 2.86 1.45 0.23
500 8 3.03 1.45 0.22
1000 0 1.00 1.00 11.52
1000 1 0.85 0.00 13.55
1000 2 1.65 1.65 6.98
1000 3 2.35 1.72 4.89
1000 4 3.05 1.75 3.78
1000 5 3.84 1.79 3.00
1000 6 4.43 1.78 2.60
1000 7 5.12 1.79 2.25
1000 8 5.69 1.78 2.03
2000 0 1.00 1.00 113.21
2000 1 0.89 0.00 127.71
2000 2 1.69 1.69 67.07
2000 3 2.46 1.77 45.99
2000 4 3.21 1.79 35.27
2000 5 3.96 1.81 28.56
2000 6 4.64 1.81 24.38
2000 7 5.22 1.80 21.68
2000 8 5.81 1.80 19.48
3000 0 1.00 1.00 385.15
3000 1 0.89 0.00 431.11
3000 2 1.70 1.70 226.09
3000 3 2.49 1.78 154.88
3000 4 3.24 1.80 118.80
3000 5 4.02 1.82 95.81
3000 6 4.79 1.83 80.46
3000 7 5.43 1.83 70.93
3000 8 6.03 1.82 63.85
4000 0 1.00 1.00 2706.37
4000 1 0.78 0.00 3488.43
4000 2 1.31 1.31 2066.20
4000 3 1.77 1.44 1526.49
4000 4 2.07 1.44 1308.90
4000 5 2.29 1.43 1182.51
4000 6 2.28 1.38 1185.99
4000 7 2.16 1.32 1252.05
4000 8 2.21 1.30 1224.98
5000 0 1.00 1.00 1909.29
5000 1 0.78 0.00 2457.51
5000 2 1.61 1.61 1187.01
5000 3 2.45 1.76 780.55
5000 4 3.24 1.80 589.95
5000 5 3.96 1.81 481.70
5000 6 4.47 1.78 427.20
5000 7 5.01 1.78 380.78
5000 8 5.47 1.76 349.23

Machine details

   $uname -a
   Linux poole 2.6.18-6-amd64 #1 SMP Sun Feb 10 17:50:19 UTC 2008 x86_64 GNU/Linux

8 of these

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 8220
stepping        : 3
cpu MHz         : 2812.978
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm cr8_legacy
bogomips        : 5630.17
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

and memory

MemTotal:     32964928 kB
  • Posted: 2008-04-04 22:06
  • Author: wd15
  • Categories: (none)
  • Comments (0)

Curvature expressions in !FiPy

Curvature Expression FiPy Expression
iso $ \nabla \cdot \left( \frac{ \nabla \phi }{|\nabla \phi| } \right) $ (phi.getFaceGrad() / phi.getFaceGrad().getMag()).getDivergence()
mean $ \nabla \cdot \left( \frac{ \nabla \phi }{ \left(1 + |\nabla \phi|^2 \right)^{1/2} } \right) $ (phi.getFaceGrad() / numerix.sqrt(1 + phi.getFaceGrad().getMag()**2)).getDivergence()
Gauss $ \frac{ \phi_{xx} \phi_{yy} - \phi_{xy}^2 }{\left(1 + |\nabla \phi|^2 \right)^2}  $ ((phi.getFaceGrad() * (1,0)).getDivergence() * (phi.getFaceGrad() * (0,1)).getDivergence() - phi.getFaceGrad()[::-1].getDivergence() / 2) / (1 + phi.getFaceGrad().getMag()**2)**2

Is there a vector form for the Gaussian curvature? It may help the FiPy expression.

Very good differential geometry book http://www.archive.org/details/differentialgeom003681mbp.

  • Posted: 2008-04-04 20:25
  • Author: wd15
  • Categories: (none)
  • Comments (0)