openmp on the altix
The following are some timing results on the Altix. The test case is a matrix multiplication. The time goes like N**3 where N is a side of the matrix. The numbers are averaged over 5 runs.
N | Threads | speedups | scale factor | raw times (s) |
1000 | 1 | 1.00 | 1.00 | 9.99 |
1000 | 2 | 1.69 | 1.69 | 5.93 |
1000 | 4 | 2.86 | 1.69 | 3.23 |
1000 | 8 | 4.66 | 1.67 | 2.09 |
1000 | 16 | 8.05 | 1.68 | 1.17 |
1000 | 32 | 8.08 | 1.52 | 1.05 |
1000 | 64 | 9.53 | 1.46 | 0.95 |
2000 | 1 | 1.00 | 1.00 | 134.38 |
2000 | 2 | 1.79 | 1.79 | 75.21 |
2000 | 4 | 3.26 | 1.81 | 38.96 |
2000 | 8 | 5.85 | 1.80 | 22.89 |
2000 | 16 | 10.95 | 1.82 | 12.28 |
2000 | 32 | 13.35 | 1.68 | 8.03 |
2000 | 64 | 18.32 | 1.62 | 6.81 |
4000 | 1 | 1.00 | 1.00 | 1139.32 |
4000 | 2 | 1.77 | 1.77 | 640.28 |
4000 | 4 | 3.24 | 1.80 | 348.68 |
4000 | 8 | 5.90 | 1.81 | 190.78 |
4000 | 16 | 11.03 | 1.82 | 102.81 |
4000 | 32 | 16.39 | 1.75 | 63.89 |
4000 | 64 | 22.61 | 1.68 | 44.12 |
500 | 1 | 1.00 | 1.00 | 0.32 |
500 | 2 | 1.49 | 1.49 | 0.21 |
500 | 4 | 2.91 | 1.71 | 0.11 |
500 | 8 | 2.44 | 1.35 | 0.11 |
500 | 16 | 2.91 | 1.31 | 0.11 |
500 | 32 | 1.46 | 1.08 | 0.21 |
500 | 64 | 1.46 | 1.07 | 0.21 |
The important consideration is the scale factor and how that holds up as the threads increase. The C code used on the Altix is below. Still having problems with fipy as distutils is missing. As a point of reference I found this quote in an article, "More typical code will have a lower limit; 1.7x-1.8x are generally considered very good speedup numbers for code run on two threads" (http://cache-www.intel.com/cd/00/00/31/64/316421_316421.pdf). There is a start up time associated with threading. It is recommended that threads are maintained during the duration of a programming running. This may be difficult to achieve with weave.
Note:
The results need to be tested against unthreaded code.
/****************************************************************************** * FILE: omp_mm.c * DESCRIPTION: * OpenMp Example - Matrix Multiply - C Version * Demonstrates a matrix multiply using OpenMP. Threads share row iterations * according to a predefined chunk size. * AUTHOR: Blaise Barney * LAST REVISED: 06/28/05 ******************************************************************************/ #include <omp.h> #include <stdio.h> #include <stdlib.h> int main (int argc, char *argv[]) { int tid, nthreads, i, j, k, chunk; int N=4000; // int loop; // int loops=1; double *a; a = malloc(N * N * sizeof(double *)); double *b; b = malloc(N * N * sizeof(double *)); double *c; c = malloc(N * N * sizeof(double *)); printf("finished allocating\n"); chunk = 10; /* set loop iteration chunk size */ /*** Spawn a parallel region explicitly scoping all variables ***/ #pragma omp parallel shared(a,b,c,nthreads,chunk) private(tid,i,j,k) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Starting matrix multiple example with %d threads\n",nthreads); //printf("Initializing matrices...\n"); } /*** Initialize matrices ***/ #pragma omp for schedule (runtime) // #pragma omp for schedule (static, chunk) for (i=0; i<N; i++) for (j=0; j<N; j++) a[i * N + j]= i+j; //#pragma omp for schedule (static, chunk) #pragma omp for schedule (runtime) for (i=0; i<N; i++) for (j=0; j<N; j++) b[i * N + j]= i*j; #pragma omp for schedule (runtime) //#pragma omp for schedule (static, chunk) for (i=0; i<N; i++) for (j=0; j<N; j++) c[i * N + j]= 0; /*** Do matrix multiply sharing iterations on outer loop ***/ /*** Display who does which iterations for demonstration purposes ***/ // printf("Thread %d starting matrix multiply...\n",tid); #pragma omp for schedule (runtime) // #pragma omp for schedule (static, chunk) // for(loop=0; loop<loops; loop++) for(i=0; i<N; i++) for(j=0; j<N; j++) for (k=0; k<N; k++) c[i * N + j] += a[i * N + k] * b[k * N + j]; } /*** End of parallel region ***/ /*** Print results ***/ // //printf("******************************************************\n"); //printf("Result Matrix:\n"); //for (i=0; i<NRA; i++) // { // for (j=0; j<NCB; j++) // printf("%6.2f ", c[i][j]); // printf("\n"); // } //printf("******************************************************\n"); //printf ("Done.\n"); }