Posts in category openmp

openmp on the altix

The following are some timing results on the Altix. The test case is a matrix multiplication. The time goes like N**3 where N is a side of the matrix. The numbers are averaged over 5 runs.

N Threads speedups scale factor raw times (s)
1000 1 1.00 1.00 9.99
1000 2 1.69 1.69 5.93
1000 4 2.86 1.69 3.23
1000 8 4.66 1.67 2.09
1000 16 8.05 1.68 1.17
1000 32 8.08 1.52 1.05
1000 64 9.53 1.46 0.95
2000 1 1.00 1.00 134.38
2000 2 1.79 1.79 75.21
2000 4 3.26 1.81 38.96
2000 8 5.85 1.80 22.89
2000 16 10.95 1.82 12.28
2000 32 13.35 1.68 8.03
2000 64 18.32 1.62 6.81
4000 1 1.00 1.00 1139.32
4000 2 1.77 1.77 640.28
4000 4 3.24 1.80 348.68
4000 8 5.90 1.81 190.78
4000 16 11.03 1.82 102.81
4000 32 16.39 1.75 63.89
4000 64 22.61 1.68 44.12
500 1 1.00 1.00 0.32
500 2 1.49 1.49 0.21
500 4 2.91 1.71 0.11
500 8 2.44 1.35 0.11
500 16 2.91 1.31 0.11
500 32 1.46 1.08 0.21
500 64 1.46 1.07 0.21

The important consideration is the scale factor and how that holds up as the threads increase. The C code used on the Altix is below. Still having problems with fipy as distutils is missing. As a point of reference I found this quote in an article, "More typical code will have a lower limit; 1.7x-1.8x are generally considered very good speedup numbers for code run on two threads" (http://cache-www.intel.com/cd/00/00/31/64/316421_316421.pdf). There is a start up time associated with threading. It is recommended that threads are maintained during the duration of a programming running. This may be difficult to achieve with weave.

Note:

The results need to be tested against unthreaded code.

/******************************************************************************
* FILE: omp_mm.c
* DESCRIPTION:
*   OpenMp Example - Matrix Multiply - C Version
*   Demonstrates a matrix multiply using OpenMP. Threads share row iterations
*   according to a predefined chunk size.
* AUTHOR: Blaise Barney
* LAST REVISED: 06/28/05
******************************************************************************/
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[])
{
  int   tid, nthreads, i, j, k, chunk;
int N=4000;
              //  int loop;
  //  int loops=1;
  double *a;
  a = malloc(N * N * sizeof(double *));
  double *b;
  b = malloc(N * N * sizeof(double *));
  double *c;
  c = malloc(N * N * sizeof(double *));
  printf("finished allocating\n");
  chunk = 10;                    /* set loop iteration chunk size */
  /*** Spawn a parallel region explicitly scoping all variables ***/
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(tid,i,j,k)
  {
  tid = omp_get_thread_num();
  if (tid == 0)
    {
    nthreads = omp_get_num_threads();
    printf("Starting matrix multiple example with %d threads\n",nthreads);
    //printf("Initializing matrices...\n");
    }
  /*** Initialize matrices ***/
#pragma omp for schedule (runtime)
  //  #pragma omp for schedule (static, chunk)
  for (i=0; i<N; i++)
    for (j=0; j<N; j++)
      a[i * N + j]= i+j;
  //#pragma omp for schedule (static, chunk)
#pragma omp for schedule (runtime)
  for (i=0; i<N; i++)
    for (j=0; j<N; j++)
      b[i * N + j]= i*j;
#pragma omp for schedule (runtime)
  //#pragma omp for schedule (static, chunk)
  for (i=0; i<N; i++)
    for (j=0; j<N; j++)
      c[i * N + j]= 0;
  /*** Do matrix multiply sharing iterations on outer loop ***/
  /*** Display who does which iterations for demonstration purposes ***/
  //  printf("Thread %d starting matrix multiply...\n",tid);
#pragma omp for schedule (runtime)
  //  #pragma omp for schedule (static, chunk)
  //  for(loop=0; loop<loops; loop++)
    for(i=0; i<N; i++)
      for(j=0; j<N; j++)
        for (k=0; k<N; k++)
          c[i * N + j] += a[i * N + k] * b[k * N + j];
  }   /*** End of parallel region ***/
/*** Print results ***/
//
//printf("******************************************************\n");
//printf("Result Matrix:\n");
//for (i=0; i<NRA; i++)
//  {
//  for (j=0; j<NCB; j++)
//    printf("%6.2f   ", c[i][j]);
//  printf("\n");
//  }
//printf("******************************************************\n");
//printf ("Done.\n");
}

openmp/weave timings.

A matrix multiplication in weave really scales well with openmp. The code is here. The observed speedup is almost perfect with two threads.

This code, involving large array multiplications of size N, has the following speedups with two threads.

N Speedup
1E7 1.51
1E6 1.37
1E5 1.39
1E4 1.0

It should be noted that the number of loops in python increased inversely with the size of the array.

The question remains whether we can get speed ups for smaller arrays typically used in FiPy.