Context Navigation

Posts for the month of April 2008

Simulations with varying solver tolerance

ID	status	condor ID		Tolerance	Step Time (s)	Steps	Distance (R)	Sim Time (s)	$\left\|\left\|\frac{\phi - \phi^}{\phi^}\right\|\right\|_{\infty}$	$\left\|\left\|\frac{\rho - \rho^}{\rho^}\right\|\right\|_{\infty}$	$\left\|\left\|\frac{\rho_2 - \rho_2^}{\rho_2^}\right\|\right\|_{\infty}$	Bill's movie
18	stopped	12087.0	$5\times10^{-7}$	$1\times10^{-7}$	120	460	0.13	0.00256	$7.0\times10^{-9}$	$1.4\times10^{-9}$	$1.5\times10^{-9}$	movie
19	stopped	12088.0	$5\times10^{-7}$	$1\times10^{-4}$	96	460	0.13	0.00256	$1.2\times10^{-6}$	$9.9\times10^{-7}$	$9.7\times10^{-7}$	movie
20	stopped	12089.0	$5\times10^{-7}$	$1\times10^{-1}$	66	460	0.13	0.00256	$6.1\times10^{-3}$	$5.9\times10^{-3}$	$5.9\times10^{-3}$	movie
21	stopped	12090.0	$5\times10^{-7}$	$1\times10^{-10}$	156	460	0.13	0.00256				movie
22	running	12093.0	$5\times10^{-7}$		--	--	--	--	--	--	--	movie

Notes

These simulations need to run for longer to ascertain the long term effects of larger solver tolerances.
According to the Trilinos manuals, the default stopping criterion is $\frac{\left|\left|\vec{r}\right|\right|_2}{\left|\left|\vec{r}^{(0)}\right|\right|_2}$ .
I'm still confused about how the tolerance relates to the error. Certainly I know that the error is linear with residual (Ae=r). This would say that an order of magnitude decrease in the tolerance leads to an order of magnitude decrease in error. However if the tolerance is 1 then no iterations should be done, which would be pathological. I am running a tolerance = 1 case, so we'll see.

Posted: 2008-04-17 22:42 (Updated: 2008-04-21 16:13)
Author: wd15
Categories: (none)
Comments (0)

Altix and !FiPy. What can we expect?

What are the expected performance benefits for FiPy using the Altix machine?

The following questions are pertinent.

Which examples?
What proportion of time is spent in routines that could benefit from a parallel capability?
How long do the jobs currently take on one processor?
How much memory do the jobs currently use?

job ID	PID	Status	Radius Drop Size	N	Proportion	Solver	Max Memory (Gb)	Steps	Time Per Step (s)	Time (days)	Simulation time (s)	Altix (k=1.6) (days)	Poole (k=1.6)
14		stopped	$5\times10^{-7}$	34000	0.774 (@)	0.542	0.41	450	185	0.96	0.0016	0.21	0.28
15		stopped	$1\times10^{-6}$	136000	0.806	0.579	0.99	680	863	6.8	0.00226	1.5	2.0
16	poole 4430	running	$2\times10^{-6}$	544000	0.822 (+)	0.562 (+)	3.3	929 (%)	3336.0	35.9 (%)	0.00275 (%)	8.0	11
17	poole 4714	running	$4\times10^{-6}$	2176000	0.798 (&)	0.496 (&)	13	50667 (#)	11034.0	190 (647.1(#))	0.0293 (#)	42	56
			$6\times10^{-6}$	4896000			28			502		112	148
			$2.5\times10^{-5}$	$8.5\times10^{7}$			483			15420		3440	4552

(+) still need many more data points
(%) reasonable extrapolation, waiting for more data points
(#) very bad extrapolation, waiting for more data points
(&) only at 200 steps, this number rises until 600 steps
(@) steps are too quick to get a good number

Notes

The simulations wall clock time requires the triple point to cover a distance of R / 10.

The memory usage seems to go like $\alpha + \beta R^2$ with $\alpha=0.217$ and $\beta=7.73\times10^{11}$ .

The simulation times are estimated with a power law, $\alpha N^{\beta}$ with $\alpha=4.69e-6$ and $\beta=1.2$ .

The following figure shows the time spent in routines that could benefit from a parallel capability .

A quick study of solver iterations and tolerance may also yield gains.

Other functions that are costly, but are not readily reimplemented in parallel.

Radius Drop Size	N	FillComplete?	InsertGlobalValues?	array
$5\times10^{-7}$	34000	0.068	0.039	0.015
$1\times10^{-6}$	136000	0.052	0.024	0.017
$2\times10^{-6}$	544000	0.052	0.026	0.017
$4\times10^{-6}$	2176000	0.057	0.027	0.020

Posted: 2008-04-14 20:39 (Updated: 2008-04-22 19:10)
Author: wd15
Categories: (none)
Comments (0)

Poor weave results on the altix

The times below are very bad. I have no idea why this is the case. Poole had great speed ups.

N	Threads	speedup	speedup factor	raw times (s)
500	0	1.00	1.00	0.50
500	1	0.33	0.00	1.50
500	2	0.65	0.65	0.76
500	4	1.29	1.13	0.38
500	8	2.49	1.36	0.20
500	16	4.43	1.45	0.11
500	32	6.08	1.43	0.08
500	64	6.35	1.36	0.08
1000	0	1.00	1.00	12.37
1000	1	0.62	0.00	20.06
1000	2	1.05	1.05	11.76
1000	4	1.88	1.37	6.57
1000	8	2.75	1.40	4.50
1000	16	2.91	1.31	4.25
1000	32	3.11	1.25	3.98
1000	64	3.38	1.23	3.66
2000	0	1.00	1.00	179.58
2000	1	0.74	0.00	244.13
2000	2	1.43	1.43	125.62
2000	4	2.91	1.71	61.63
2000	8	4.20	1.61	42.77
2000	16	4.97	1.49	36.16
2000	32	4.89	1.37	36.74
2000	64	5.87	1.34	30.60
3000	0	1.00	1.00	510.73
3000	1	0.71	0.00	719.87
3000	2	1.26	1.26	404.31
3000	4	2.30	1.52	221.79
3000	8	3.82	1.56	133.80
3000	16	3.98	1.41	128.48
3000	32	4.07	1.32	125.33
3000	64	4.80	1.30	106.35

Posted: 2008-04-11 16:56
Author: wd15
Categories: (none)
Comments (0)

Posting code

Posting some code because it can't be shared using svn.

1       import numpy
2 	from scipy import weave
3
4 	def multiply(N, nthreads):
5
6 	    i, j = numpy.indices((N, N), 'd')
7 	    c = numpy.zeros((N, N), 'd')
8
9 	    a = i + j
10 	    b = i * j
11
12 	    code_pragma = """
13
14 	    int i, j, k;
15
16 	    #pragma omp parallel shared(a,b,c,nthreads) private(i,j,k) num_threads(nthreads)
17 	      {
18
19 	      #pragma omp for schedule (dynamic)
20 	      for (i=0; i<N; i++)
21 	        for(j=0; j<N; j++)
22 	          for (k=0; k<N; k++)
23 	            c[i * N + j] += a[i * N + k] * b[k * N + j];
24
25 	      }
26
27 	    """
28
29 	    code_no_pragma = """
30
31 	    int i, j, k;
32
33 	    for (i=0; i<N; i++)
34 	      for(j=0; j<N; j++)
35 	          for (k=0; k<N; k++)
36 	            c[i * N + j] += a[i * N + k] * b[k * N + j];
37
38 	    """
39
40 	    args = {'N' : N, 'a' : a, 'b' : b, 'c' : c, 'nthreads' : nthreads}
41
42 	    if nthreads == 0:
43 	        code = code_no_pragma
44 	    else:
45 	        code = code_pragma
46
47 	    import time
48
49 	    t0 = time.time()
50
51 	    weave.inline(code,
52 	                 args.keys(),
53 	                 local_dict=args,
54 	                 type_converters=None,
55 	                 compiler = 'gcc',
56 	                 force=0,
57 	                 verbose=0,
58 	                 extra_compile_args =['-O3 -fopenmp'],
59 	                 extra_link_args=['-O3 -fopenmp'],
60 	                 headers=['<omp.h>'])
61
62 	    return time.time() - t0
63
64 	from fipy.tools import dump
65
66 	results = {}
67
68 	dump.write(results, 'datafile.gz')
69
70 	for N in (500, 1000, 2000, 3000, 4000, 5000, 6000):
71
72 	    results = dump.read('datafile.gz')
73 	    results[N] = {}
74 	    dump.write(results, 'datafile.gz')
75
76 	    for nthreads in range(9):
77
78 	        results = dump.read('datafile.gz')
79 	        results[N][nthreads] = []
80 	        dump.write(results, 'datafile.gz')
81
82 	        for sim in range(5):
83
84 	            results = dump.read('datafile.gz')
85 	            results[N][nthreads].append(multiply(N, nthreads))
86 	            dump.write(results, 'datafile.gz')
87
88

Posted: 2008-04-07 18:27 (Updated: 2008-04-07 18:30)
Author: wd15
Categories: (none)
Comments (0)

openmp and weave on poole

Some results for openmp on poole. The test is for an N by N matrix multiplication. The code is here. The speed up factor is certainly improving as the threads increase. The result for N=4000 are nonsense. Probably Ed running stuff. I should really have recorded resources available during each simulation by keeping a record with time.clock() and time.time().

N	speedup factor for 8 threads
500	1.45
1000	1.78
2000	1.80
3000	1.82
4000	1.30
5000	1.76

More of the data.

500	0	1.00	1.00	0.67
500	1	0.52	0.00	1.30
500	2	0.97	0.97	0.69
500	3	1.41	1.24	0.48
500	4	1.82	1.35	0.37
500	5	2.22	1.41	0.30
500	6	2.65	1.46	0.25
500	7	2.86	1.45	0.23
500	8	3.03	1.45	0.22
1000	0	1.00	1.00	11.52
1000	1	0.85	0.00	13.55
1000	2	1.65	1.65	6.98
1000	3	2.35	1.72	4.89
1000	4	3.05	1.75	3.78
1000	5	3.84	1.79	3.00
1000	6	4.43	1.78	2.60
1000	7	5.12	1.79	2.25
1000	8	5.69	1.78	2.03
2000	0	1.00	1.00	113.21
2000	1	0.89	0.00	127.71
2000	2	1.69	1.69	67.07
2000	3	2.46	1.77	45.99
2000	4	3.21	1.79	35.27
2000	5	3.96	1.81	28.56
2000	6	4.64	1.81	24.38
2000	7	5.22	1.80	21.68
2000	8	5.81	1.80	19.48
3000	0	1.00	1.00	385.15
3000	1	0.89	0.00	431.11
3000	2	1.70	1.70	226.09
3000	3	2.49	1.78	154.88
3000	4	3.24	1.80	118.80
3000	5	4.02	1.82	95.81
3000	6	4.79	1.83	80.46
3000	7	5.43	1.83	70.93
3000	8	6.03	1.82	63.85
4000	0	1.00	1.00	2706.37
4000	1	0.78	0.00	3488.43
4000	2	1.31	1.31	2066.20
4000	3	1.77	1.44	1526.49
4000	4	2.07	1.44	1308.90
4000	5	2.29	1.43	1182.51
4000	6	2.28	1.38	1185.99
4000	7	2.16	1.32	1252.05
4000	8	2.21	1.30	1224.98
5000	0	1.00	1.00	1909.29
5000	1	0.78	0.00	2457.51
5000	2	1.61	1.61	1187.01
5000	3	2.45	1.76	780.55
5000	4	3.24	1.80	589.95
5000	5	3.96	1.81	481.70
5000	6	4.47	1.78	427.20
5000	7	5.01	1.78	380.78
5000	8	5.47	1.76	349.23

Machine details

   $uname -a
   Linux poole 2.6.18-6-amd64 #1 SMP Sun Feb 10 17:50:19 UTC 2008 x86_64 GNU/Linux

8 of these

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 8220
stepping        : 3
cpu MHz         : 2812.978
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm cr8_legacy
bogomips        : 5630.17
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

and memory

MemTotal:     32964928 kB

Posted: 2008-04-04 22:06
Author: wd15
Categories: (none)
Comments (0)

Curvature expressions in !FiPy

Curvature	Expression	FiPy Expression
iso	$\nabla \cdot \left( \frac{ \nabla \phi }{\|\nabla \phi\| } \right)$	`(phi.getFaceGrad() / phi.getFaceGrad().getMag()).getDivergence()`
mean	$\nabla \cdot \left( \frac{ \nabla \phi }{ \left(1 + \|\nabla \phi\|^2 \right)^{1/2} } \right)$	`(phi.getFaceGrad() / numerix.sqrt(1 + phi.getFaceGrad().getMag()**2)).getDivergence()`
Gauss	$\frac{ \phi_{xx} \phi_{yy} - \phi_{xy}^2 }{\left(1 + \|\nabla \phi\|^2 \right)^2}$	`((phi.getFaceGrad() * (1,0)).getDivergence() * (phi.getFaceGrad() * (0,1)).getDivergence() - phi.getFaceGrad()[::-1].getDivergence() / 2) / (1 + phi.getFaceGrad().getMag()2)2`