openmp on the altix
The following are some timing results on the Altix. The test case is a matrix multiplication. The time goes like N**3 where N is a side of the matrix. The numbers are averaged over 5 runs.
N | Threads | speedups | scale factor | raw times (s) |
1000 | 1 | 1.00 | 1.00 | 9.99 |
1000 | 2 | 1.69 | 1.69 | 5.93 |
1000 | 4 | 2.86 | 1.69 | 3.23 |
1000 | 8 | 4.66 | 1.67 | 2.09 |
1000 | 16 | 8.05 | 1.68 | 1.17 |
1000 | 32 | 8.08 | 1.52 | 1.05 |
1000 | 64 | 9.53 | 1.46 | 0.95 |
2000 | 1 | 1.00 | 1.00 | 134.38 |
2000 | 2 | 1.79 | 1.79 | 75.21 |
2000 | 4 | 3.26 | 1.81 | 38.96 |
2000 | 8 | 5.85 | 1.80 | 22.89 |
2000 | 16 | 10.95 | 1.82 | 12.28 |
2000 | 32 | 13.35 | 1.68 | 8.03 |
2000 | 64 | 18.32 | 1.62 | 6.81 |
4000 | 1 | 1.00 | 1.00 | 1139.32 |
4000 | 2 | 1.77 | 1.77 | 640.28 |
4000 | 4 | 3.24 | 1.80 | 348.68 |
4000 | 8 | 5.90 | 1.81 | 190.78 |
4000 | 16 | 11.03 | 1.82 | 102.81 |
4000 | 32 | 16.39 | 1.75 | 63.89 |
4000 | 64 | 22.61 | 1.68 | 44.12 |
500 | 1 | 1.00 | 1.00 | 0.32 |
500 | 2 | 1.49 | 1.49 | 0.21 |
500 | 4 | 2.91 | 1.71 | 0.11 |
500 | 8 | 2.44 | 1.35 | 0.11 |
500 | 16 | 2.91 | 1.31 | 0.11 |
500 | 32 | 1.46 | 1.08 | 0.21 |
500 | 64 | 1.46 | 1.07 | 0.21 |
The important consideration is the scale factor and how that holds up as the threads increase. The C code used on the Altix is below. Still having problems with fipy as distutils is missing. As a point of reference I found this quote in an article, "More typical code will have a lower limit; 1.7x-1.8x are generally considered very good speedup numbers for code run on two threads" (http://cache-www.intel.com/cd/00/00/31/64/316421_316421.pdf). There is a start up time associated with threading. It is recommended that threads are maintained during the duration of a programming running. This may be difficult to achieve with weave.
Note:
The results need to be tested against unthreaded code.
/****************************************************************************** * FILE: omp_mm.c * DESCRIPTION: * OpenMp Example - Matrix Multiply - C Version * Demonstrates a matrix multiply using OpenMP. Threads share row iterations * according to a predefined chunk size. * AUTHOR: Blaise Barney * LAST REVISED: 06/28/05 ******************************************************************************/ #include <omp.h> #include <stdio.h> #include <stdlib.h> int main (int argc, char *argv[]) { int tid, nthreads, i, j, k, chunk; int N=4000; // int loop; // int loops=1; double *a; a = malloc(N * N * sizeof(double *)); double *b; b = malloc(N * N * sizeof(double *)); double *c; c = malloc(N * N * sizeof(double *)); printf("finished allocating\n"); chunk = 10; /* set loop iteration chunk size */ /*** Spawn a parallel region explicitly scoping all variables ***/ #pragma omp parallel shared(a,b,c,nthreads,chunk) private(tid,i,j,k) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Starting matrix multiple example with %d threads\n",nthreads); //printf("Initializing matrices...\n"); } /*** Initialize matrices ***/ #pragma omp for schedule (runtime) // #pragma omp for schedule (static, chunk) for (i=0; i<N; i++) for (j=0; j<N; j++) a[i * N + j]= i+j; //#pragma omp for schedule (static, chunk) #pragma omp for schedule (runtime) for (i=0; i<N; i++) for (j=0; j<N; j++) b[i * N + j]= i*j; #pragma omp for schedule (runtime) //#pragma omp for schedule (static, chunk) for (i=0; i<N; i++) for (j=0; j<N; j++) c[i * N + j]= 0; /*** Do matrix multiply sharing iterations on outer loop ***/ /*** Display who does which iterations for demonstration purposes ***/ // printf("Thread %d starting matrix multiply...\n",tid); #pragma omp for schedule (runtime) // #pragma omp for schedule (static, chunk) // for(loop=0; loop<loops; loop++) for(i=0; i<N; i++) for(j=0; j<N; j++) for (k=0; k<N; k++) c[i * N + j] += a[i * N + k] * b[k * N + j]; } /*** End of parallel region ***/ /*** Print results ***/ // //printf("******************************************************\n"); //printf("Result Matrix:\n"); //for (i=0; i<NRA; i++) // { // for (j=0; j<NCB; j++) // printf("%6.2f ", c[i][j]); // printf("\n"); // } //printf("******************************************************\n"); //printf ("Done.\n"); }
openmp/weave timings.
A matrix multiplication in weave really scales well with openmp. The code is here. The observed speedup is almost perfect with two threads.
This code, involving large array multiplications of size N, has the following speedups with two threads.
N | Speedup |
1E7 | 1.51 |
1E6 | 1.37 |
1E5 | 1.39 |
1E4 | 1.0 |
It should be noted that the number of loops in python increased inversely with the size of the array.
The question remains whether we can get speed ups for smaller arrays typically used in FiPy.
openmp and weave
The following steps are required to build openmp to work with weave. I've tried this on poole and rosie. I followed the follwoing steps
1) Install mpfr version 2.3.1
$ ../configure --prefix=${USR} $ make $ make install
2) Get gcc version 4.3 or 4.2.4 (when it is released). This version is needed otherwise you will get the "ImportError: libgomp.so.1: shared object cannot be dlopen()ed" error. gomp was not set up for dynamic loading in early gcc openmp compatible versions.
3) Create a new directory and configure with
$ ../configure --prefix=${USR} --disable-multilib
where USR is in your local directories somewhere.
4) make; make install