openmp on the altix
The following are some timing results on the Altix. The test case is a matrix multiplication. The time goes like N**3 where N is a side of the matrix. The numbers are averaged over 5 runs.
| N | Threads | speedups | scale factor | raw times (s) |
| 1000 | 1 | 1.00 | 1.00 | 9.99 |
| 1000 | 2 | 1.69 | 1.69 | 5.93 |
| 1000 | 4 | 2.86 | 1.69 | 3.23 |
| 1000 | 8 | 4.66 | 1.67 | 2.09 |
| 1000 | 16 | 8.05 | 1.68 | 1.17 |
| 1000 | 32 | 8.08 | 1.52 | 1.05 |
| 1000 | 64 | 9.53 | 1.46 | 0.95 |
| 2000 | 1 | 1.00 | 1.00 | 134.38 |
| 2000 | 2 | 1.79 | 1.79 | 75.21 |
| 2000 | 4 | 3.26 | 1.81 | 38.96 |
| 2000 | 8 | 5.85 | 1.80 | 22.89 |
| 2000 | 16 | 10.95 | 1.82 | 12.28 |
| 2000 | 32 | 13.35 | 1.68 | 8.03 |
| 2000 | 64 | 18.32 | 1.62 | 6.81 |
| 4000 | 1 | 1.00 | 1.00 | 1139.32 |
| 4000 | 2 | 1.77 | 1.77 | 640.28 |
| 4000 | 4 | 3.24 | 1.80 | 348.68 |
| 4000 | 8 | 5.90 | 1.81 | 190.78 |
| 4000 | 16 | 11.03 | 1.82 | 102.81 |
| 4000 | 32 | 16.39 | 1.75 | 63.89 |
| 4000 | 64 | 22.61 | 1.68 | 44.12 |
| 500 | 1 | 1.00 | 1.00 | 0.32 |
| 500 | 2 | 1.49 | 1.49 | 0.21 |
| 500 | 4 | 2.91 | 1.71 | 0.11 |
| 500 | 8 | 2.44 | 1.35 | 0.11 |
| 500 | 16 | 2.91 | 1.31 | 0.11 |
| 500 | 32 | 1.46 | 1.08 | 0.21 |
| 500 | 64 | 1.46 | 1.07 | 0.21 |
The important consideration is the scale factor and how that holds up as the threads increase. The C code used on the Altix is below. Still having problems with fipy as distutils is missing. As a point of reference I found this quote in an article, "More typical code will have a lower limit; 1.7x-1.8x are generally considered very good speedup numbers for code run on two threads" (http://cache-www.intel.com/cd/00/00/31/64/316421_316421.pdf). There is a start up time associated with threading. It is recommended that threads are maintained during the duration of a programming running. This may be difficult to achieve with weave.
Note:
The results need to be tested against unthreaded code.
/******************************************************************************
* FILE: omp_mm.c
* DESCRIPTION:
* OpenMp Example - Matrix Multiply - C Version
* Demonstrates a matrix multiply using OpenMP. Threads share row iterations
* according to a predefined chunk size.
* AUTHOR: Blaise Barney
* LAST REVISED: 06/28/05
******************************************************************************/
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[])
{
int tid, nthreads, i, j, k, chunk;
int N=4000;
// int loop;
// int loops=1;
double *a;
a = malloc(N * N * sizeof(double *));
double *b;
b = malloc(N * N * sizeof(double *));
double *c;
c = malloc(N * N * sizeof(double *));
printf("finished allocating\n");
chunk = 10; /* set loop iteration chunk size */
/*** Spawn a parallel region explicitly scoping all variables ***/
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(tid,i,j,k)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Starting matrix multiple example with %d threads\n",nthreads);
//printf("Initializing matrices...\n");
}
/*** Initialize matrices ***/
#pragma omp for schedule (runtime)
// #pragma omp for schedule (static, chunk)
for (i=0; i<N; i++)
for (j=0; j<N; j++)
a[i * N + j]= i+j;
//#pragma omp for schedule (static, chunk)
#pragma omp for schedule (runtime)
for (i=0; i<N; i++)
for (j=0; j<N; j++)
b[i * N + j]= i*j;
#pragma omp for schedule (runtime)
//#pragma omp for schedule (static, chunk)
for (i=0; i<N; i++)
for (j=0; j<N; j++)
c[i * N + j]= 0;
/*** Do matrix multiply sharing iterations on outer loop ***/
/*** Display who does which iterations for demonstration purposes ***/
// printf("Thread %d starting matrix multiply...\n",tid);
#pragma omp for schedule (runtime)
// #pragma omp for schedule (static, chunk)
// for(loop=0; loop<loops; loop++)
for(i=0; i<N; i++)
for(j=0; j<N; j++)
for (k=0; k<N; k++)
c[i * N + j] += a[i * N + k] * b[k * N + j];
} /*** End of parallel region ***/
/*** Print results ***/
//
//printf("******************************************************\n");
//printf("Result Matrix:\n");
//for (i=0; i<NRA; i++)
// {
// for (j=0; j<NCB; j++)
// printf("%6.2f ", c[i][j]);
// printf("\n");
// }
//printf("******************************************************\n");
//printf ("Done.\n");
}
rss