caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Multiplication of matrix in C and OCaml
@ 2007-02-07 23:42 Frédéric Gava
  2007-02-08  2:14 ` [Caml-list] " Jacques Garrigue
  0 siblings, 1 reply; 32+ messages in thread
From: Frédéric Gava @ 2007-02-07 23:42 UTC (permalink / raw)
  To: caml-list

Dear All,

I compared multiplication of matrix in C and OCaml and I was a little 
surprise to see that the following C code (using -O2) is 8 time faster 
than the OCaml one (even with -unsafe).

Anybody have an idea to optimize my OCaml code or know why is there this 
"big" difference ?

many thanks

Frédéric Gava

ps: the difference is the same even when I use a non-polymorphic 
multiplication

*************** C Code ******************

typedef struct Complexe {double re; double img;} complexe;

__inline__ void* add(void *x, void *y)
{
complexe a=((complexe*) x)[0];
complexe b=((complexe*) y)[0];
return (void *) &((complexe){a.re+b.re, a.img+b.img});
}

__inline__ void* mult(void *x, void *y)
{
complexe a=((complexe*) x)[0];
complexe b=((complexe*) y)[0];
return (void *) &((complexe){a.re*b.re-a.img*b.img, a.re*b.img+a.img*b.re});
}

void multiply_generic(int n, void *zero, void* (*add)(void *a, void *b), 
void* (*mult)(void *a, void *b), void *a, void *b, void *c)
{
  register int i,j,k;
  void *tmp;
  void *cc;

   for (i=0; i<n; i++)
    {
     for (j=0;j<n;j++)
      {
       tmp=zero;
       for (k=0;k<n;k++) tmp=add(tmp,(mult(b+(k*n+j),a+(i*n+k))));
       cc=c+(i*n+j);
       cc=tmp;
      }
   }
}

complexe random_complexe(void)
{
  complexe c = {c.re=(double)(rand()%1000), c.img=(double)(rand()%1000)};
  return c;
}


Call this with (and with random matrix).

complexe zero;
zero.re=(double)0;
zero.img=(double)0;
multiply_generic(i,&zero,add,mult,a,b,c);


********************* OCaml Code *******************************

let random_complex n = {Complex.re=Random.float n; im=Random.float n}

let matrixRandom n = Array.init (n*n) (fun _ -> random_complex 1000.)

let multiplication n zero add mul a b c =
   let tmp=ref zero in
   for i=0 to n-1 do
    for j=0 to n-1 do
     tmp:=zero;
     for k=0 to n-1 do
      tmp:= add (!tmp) (mul b.(k*n+j) a.(i*n+k))
     done;
     c.(i*n+j)<-(!tmp)
    done
   done


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-07 23:42 Multiplication of matrix in C and OCaml Frédéric Gava
@ 2007-02-08  2:14 ` Jacques Garrigue
  2007-02-08  9:27   ` Frédéric Gava
  2007-02-08  9:56   ` Frédéric Gava
  0 siblings, 2 replies; 32+ messages in thread
From: Jacques Garrigue @ 2007-02-08  2:14 UTC (permalink / raw)
  To: gava; +Cc: caml-list

From: Frédéric Gava <gava@univ-paris12.fr>

> I compared multiplication of matrix in C and OCaml and I was a little 
> surprise to see that the following C code (using -O2) is 8 time faster 
> than the OCaml one (even with -unsafe).
> 
> Anybody have an idea to optimize my OCaml code or know why is there this 
> "big" difference ?
>
> ps: the difference is the same even when I use a non-polymorphic 
> multiplication

This is not what I see:
Multiplying 1000 times two 50x50 matrices on a Pentium M 1.8G, using
your code.

ocamlopt -unsafe polymorphic: 5.4s
ocamlopt -unsafe -inline 100 polymorphic: 5.0s
ocamlopt -unsafe monomorphic: 4.4s
gcc polymorphic: 6.5s
gcc monomorphic: 5.8s
gcc -O3 monomorphic: 4.7s

So, actually it seems that ocamlopt is even faster than gcc in many
cases.

Now, I also found a strange result:
gcc -O3 monomorphic: 1.4s

So I looked at the generated assembler:

.globl add
	.type	add, @function
add:
	pushl	%ebp
	movl	%esp, %ebp
	subl	$56, %esp
	leal	-56(%ebp), %eax
	leave
	ret
	.size	add, .-add
	.p2align 2,,3
.globl mult
	.type	mult, @function
mult:
	pushl	%ebp
	movl	%esp, %ebp
	subl	$56, %esp
	leal	-56(%ebp), %eax
	leave
	ret

It seems that returning a local value is not well-defined, and gcc
just decides that this value being unused inside the function, there
is no need to do any computation!

---------------------------------------------------------------------------
Jacques Garrigue      Nagoya University     garrigue at math.nagoya-u.ac.jp
		   <A HREF=http://www.math.nagoya-u.ac.jp/~garrigue/>JG</A>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-08  2:14 ` [Caml-list] " Jacques Garrigue
@ 2007-02-08  9:27   ` Frédéric Gava
  2007-02-08  9:38     ` Frédéric Gava
  2007-02-08 12:08     ` Jacques Garrigue
  2007-02-08  9:56   ` Frédéric Gava
  1 sibling, 2 replies; 32+ messages in thread
From: Frédéric Gava @ 2007-02-08  9:27 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 1287 bytes --]

> This is not what I see:
> Multiplying 1000 times two 50x50 matrices on a Pentium M 1.8G, using
> your code.

Thanks.

Curious. In my cat "/proc/cpuinfo" (using Ubuntu 6.06)
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 13
model name      : Intel(R) Pentium(R) M processor 1.60GHz
stepping        : 6
cpu MHz         : 598.127
cache size      : 2048 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr mce cx8 sep mtrr pge mca cmov 
pat clflush dts acpi mmx fxsr sse sse2 ss tm pbe est tm2
bogomips        : 1197.16

when I multiply two 400x400 matrices, I observe that C is faster than 
OCaml (2 seconds for C code and 16 seconds for OCaml one). It is perhaps 
because I use MPI. Do you used -O2 ?

 > gcc -O3 monomorphic: 4.7s
 > Now, I also found a strange result:
 > gcc -O3 monomorphic: 1.4s

Why two different times for the same program ? What is monomorphic ?

Regards,
Frédéric Gava

ps: I attach the C code and the OCaml one which used some specific 
functions of our parallel version of OCaml (here implemented using MPI). 
         I will send the .exe file to Jacques.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: mult.c --]
[-- Type: text/x-csrc; name="mult.c", Size: 3227 bytes --]

/* param 1 = taille des matrices au depart
   param 2 = incrementation
   param 3 = taille des matrices à l'arrive
   param 4 = nombre d'itérations 

mpicc -O3 -o cmult cmult.c
(using MPICH)

mpirun -arch LINUX -np 1 ./cmult 400 2 402 1

*/

#include <mpi.h>
#include <stdlib.h> 
#include <stdio.h> 
#include <math.h>
#include <string.h>

typedef struct Complexe {double re; double img;} complexe;

complexe zero;

/* __inline__ complexe complexe_add(complexe a, complexe b)
{
 return (complexe){a.re+b.re, a.img+b.img};
}


__inline__ complexe complexe_mult(complexe x, complexe y)
{
 return (complexe) {x.re*y.re-x.img*y.img, x.re*y.img+x.img*y.re};
}

*/


/* version generic */

__inline__ void* add(void *x, void *y)
{
complexe a=((complexe*) x)[0];
complexe b=((complexe*) y)[0];
return (void *) &((complexe){a.re+b.re, a.img+b.img});
}

__inline__ void* mult(void *x, void *y)
{
complexe a=((complexe*) x)[0];
complexe b=((complexe*) y)[0];
return (void *) &((complexe){a.re*b.re-a.img*b.img, a.re*b.img+a.img*b.re});
}

complexe random_complexe(void)
{
 complexe c = {c.re=(double)(rand()%1000), c.img=(double)(rand()%1000)};
 return c;
}


void multiply_generic(int n, void *zero, void* (*add)(void *a, void *b), void* (*mult)(void *a, void *b), void *a, void *b, void *c)
{
 register int i,j,k;
 void *tmp;
 void *cc;
 
 void *tt1, *tt2, *tt3;

  for (i=0; i<n; i++) 
   {
    for (j=0;j<n;j++)
     {
      tmp=zero;
      for (k=0;k<n;k++)
       {
        tmp=add(tmp,(mult(b+(k*n+j),a+(i*n+k))));
        /* tt1=malloc(sizeof(complexe));
        memcpy(tt1,mult(b+(k*n+j),a+(i*n+k)),sizeof(complexe));
	tt2=malloc(sizeof(complexe));
	memcpy(tt2,add(tmp,tt1),sizeof(complexe));
	free(tt1);
	memcpy(tmp,tt2,sizeof(complexe));
	free(tt2);	 */
       }
      cc=c+(i*n+j);
      cc=tmp;
      /* memcpy(cc,tmp,sizeof(complexe)); */
     }
  }
}

/*

void multiply_complex(int n, complexe *a, complexe *b, complexe *c)
{
 register int i,j,k,t1,t2;
 complexe tmp;

  for (i=0; i<n; i++) 
   {
    for (j=0;j<n;j++)
     {
      tmp=zero;
      for (k=0;k<n;k++) tmp=complexe_add(tmp,(complexe_mult(b[k*n+j],a[i*n+k])));
      c[i*n+j]=tmp;
     }
  } 
}

*/

int main(int argc, char* argv[])
{
 complexe *a, *b, *c;
 int i,j,k,t1,pas,maxi,iter,pid;
 double chrono=0;

 zero.re=(double)0;
 zero.img=(double)0;

 MPI_Init(&argc,&argv);
 MPI_Comm_rank(MPI_COMM_WORLD,&pid);

 i=atoi(argv[1]);
 pas=atoi(argv[2]);
 maxi=atoi(argv[3]);
 iter=atoi(argv[4]);

 srand(1000);
 while (i<maxi)
  {  
   /* initialisation des matrices */
   a=(complexe *)malloc(i*i*sizeof(complexe));
   b=(complexe *)malloc(i*i*sizeof(complexe));
   c=(complexe *)malloc(i*i*sizeof(complexe));
      
   for (j=0;j<i;j++)
    for (k=0,t1=j*i;k<i;k++)
     {
      a[t1+k]=random_complexe();
      b[t1+k]=random_complexe();
     }

  chrono=MPI_Wtime();  

  for (k=1;k<=iter;k++) 
   /* multiply_complex(i,a,b,c); */
   multiply_generic(i,&zero,add,mult,a,b,c);

  chrono=MPI_Wtime()-chrono;
  MPI_Barrier(MPI_COMM_WORLD);
  
  if (pid==0) printf("%f\n",chrono/iter);
  fflush(stdout);

  /* suppression des vieilles matrices */
  free((complexe *)a);
  free((complexe *)b);
  free((complexe *)c);

  i=i+pas;
  }

 MPI_Finalize();
 return 0;
}

[-- Attachment #3: mult.ml --]
[-- Type: text/plain, Size: 1260 bytes --]

(* param 1 = taille des matrices au depart
   param 2 = incrementation
   param 3 = taille des matrices à l'arrive
   param 4 = nombre d'itérations *)

open Bsmllib
open Bsmlcomm

let multiplication n zero add mul a b c =
  let tmp=ref zero in
  for i=0 to n-1 do
   for j=0 to n-1 do
    tmp:=zero;
    (* let rec loop tmp k =
     if k=n then tmp else loop (add tmp (mul b.(k*n+j) a.(i*n+k))) (k+1)
    in 
     c.(i*n+j)<-loop zero 0 *)
    for k=0 to n-1 do
     tmp:= add (!tmp) (mul b.(k*n+j) a.(i*n+k)) 
    done; 
    c.(i*n+j)<-(!tmp)    
   done
  done

let random_complex n = {Complex.re=Random.float n; im=Random.float n}

let matrixRandom n = Array.init (n*n) (fun _ -> random_complex 1000.)

let _ = 
        Bsmllib.initialize();
	Random.self_init();
        let i = ref (int_of_string (argv()).(1))
	and pas=(int_of_string (argv()).(2))
	and maxi=(int_of_string (argv()).(3)) 
	and iter=(int_of_string (argv()).(4)) in
	while !i<maxi do
         let a=matrixRandom (!i)
         and b=matrixRandom (!i)
         and c=Array.make ((!i)*(!i)) Complex.zero in
	 ignore(bench_fun (fun () -> 
                 ignore(multiplication (!i) Complex.zero Complex.add 
		                                      Complex.mul a b c)) iter);
	 i:=!i+pas;
	done

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-08  9:27   ` Frédéric Gava
@ 2007-02-08  9:38     ` Frédéric Gava
  2007-02-08 12:08     ` Jacques Garrigue
  1 sibling, 0 replies; 32+ messages in thread
From: Frédéric Gava @ 2007-02-08  9:38 UTC (permalink / raw)
  To: gava; +Cc: caml-list

> when I multiply two 400x400 matrices, 

mistake. 600x600 matrices

FG


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-08  2:14 ` [Caml-list] " Jacques Garrigue
  2007-02-08  9:27   ` Frédéric Gava
@ 2007-02-08  9:56   ` Frédéric Gava
  2007-02-08 10:45     ` Xavier Leroy
  1 sibling, 1 reply; 32+ messages in thread
From: Frédéric Gava @ 2007-02-08  9:56 UTC (permalink / raw)
  To: Jacques Garrigue; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 443 bytes --]

I attached the pure sequential version of my program.

In the file, I give how I compile, How I run and the result. Here C is 5 
time faster than OCaml

Many thanks to tell me why my OCaml code is so slow.

Regards,
Frédéric Gava

ps: note that in a more generic version of the C code (that is, do 
memcpy because the data could be of any size,  (for example, matrices of 
polynomials) the ocaml code is faster and faster than the C one :-) )

[-- Attachment #2: mult.c --]
[-- Type: text/x-csrc, Size: 2055 bytes --]

/* param 1 = size of the matrices at the beginning
   param 2 = incrementation
   param 3 = size of the matrices at the end
   param 4 = number of iteration 

gcc -O2 -o cmult mult.c

time ./cmult 600 2 602 1

real    0m3.800s
user    0m3.764s
sys     0m0.020s

*/

#include <stdlib.h> 
#include <stdio.h> 
#include <math.h>

typedef struct Complexe {double re; double img;} complexe;

complexe zero;

/* version generic */

__inline__ void* add(void *x, void *y)
{
complexe a=((complexe*) x)[0];
complexe b=((complexe*) y)[0];
return (void *) &((complexe){a.re+b.re, a.img+b.img});
}

__inline__ void* mult(void *x, void *y)
{
complexe a=((complexe*) x)[0];
complexe b=((complexe*) y)[0];
return (void *) &((complexe){a.re*b.re-a.img*b.img, a.re*b.img+a.img*b.re});
}

complexe random_complexe(void)
{
 complexe c = {c.re=(double)(rand()%1000), c.img=(double)(rand()%1000)};
 return c;
}


void multiply_generic(int n, void *zero, void* (*add)(void *a, void *b), void* (*mult)(void *a, void *b), void *a, void *b, void *c)
{
 register int i,j,k;
 void *tmp;
 void *cc;
 
  for (i=0; i<n; i++) 
   {
    for (j=0;j<n;j++)
     {
      tmp=zero;
      for (k=0;k<n;k++)
       {
        tmp=add(tmp,(mult(b+(k*n+j),a+(i*n+k))));
       }
      cc=c+(i*n+j);
      cc=tmp;
     }
  }
}


int main(int argc, char* argv[])
{
 complexe *a, *b, *c;
 int i,j,k,t1,pas,maxi,iter;

 zero.re=(double)0;
 zero.img=(double)0;

 i=atoi(argv[1]);
 pas=atoi(argv[2]);
 maxi=atoi(argv[3]);
 iter=atoi(argv[4]);

 srand(1000);
 while (i<maxi)
  {  
   /* initialisation des matrices */
   a=(complexe *)malloc(i*i*sizeof(complexe));
   b=(complexe *)malloc(i*i*sizeof(complexe));
   c=(complexe *)malloc(i*i*sizeof(complexe));
      
   for (j=0;j<i;j++)
    for (k=0,t1=j*i;k<i;k++)
     {
      a[t1+k]=random_complexe();
      b[t1+k]=random_complexe();
     }

  for (k=1;k<=iter;k++) multiply_generic(i,&zero,add,mult,a,b,c);

  /* suppression des vieilles matrices */
  free((complexe *)a);
  free((complexe *)b);
  free((complexe *)c);

  i=i+pas;
  }
 return 0;
}

[-- Attachment #3: mult.ml --]
[-- Type: text/plain, Size: 1372 bytes --]

(* param 1 = size of the matrices at the beginning
   param 2 = incrementation
   param 3 = size of the matrices at the end
   param 4 = number of iteration 

(Objective Caml version 3.09.1)

ocamlopt -unsafe -o ocamlmult mult.ml

time ./ocamlmult 600 2 602 1

real    0m17.043s
user    0m16.937s
sys     0m0.052s

*)


let multiplication n zero add mul a b c =
  let tmp=ref zero in
  for i=0 to n-1 do
   for j=0 to n-1 do
    tmp:=zero;
    (* let rec loop tmp k =
     if k=n then tmp else loop (add tmp (mul b.(k*n+j) a.(i*n+k))) (k+1)
    in 
     c.(i*n+j)<-loop zero 0 *)
    for k=0 to n-1 do
     tmp:= add (!tmp) (mul b.(k*n+j) a.(i*n+k)) 
    done; 
    c.(i*n+j)<-(!tmp)    
   done
  done

let random_complex n = {Complex.re=Random.float n; im=Random.float n}

let matrixRandom n = Array.init (n*n) (fun _ -> random_complex 1000.)

let _ = 
	Random.self_init();
        let i = ref (int_of_string (Sys.argv).(1))
	and pas=(int_of_string (Sys.argv).(2))
	and maxi=(int_of_string (Sys.argv).(3)) 
	and iter=(int_of_string (Sys.argv).(4)) in
	while !i<maxi do
         let a=matrixRandom (!i)
         and b=matrixRandom (!i)
         and c=Array.make ((!i)*(!i)) Complex.zero in
	  for j=1 to iter do
           ignore(multiplication (!i) Complex.zero Complex.add 
		                                      Complex.mul a b c);
          done;
	 i:=!i+pas;
	done

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-08  9:56   ` Frédéric Gava
@ 2007-02-08 10:45     ` Xavier Leroy
  2007-02-08 15:16       ` Frédéric Gava
  0 siblings, 1 reply; 32+ messages in thread
From: Xavier Leroy @ 2007-02-08 10:45 UTC (permalink / raw)
  To: gava; +Cc: Jacques Garrigue, caml-list

> In the file, I give how I compile, How I run and the result. Here C is 5
> time faster than OCaml
> Many thanks to tell me why my OCaml code is so slow.

Because, as Jacques told you already, your C code is wrong.  "add" and
"mult" invoke undefined behaviors of C and therefore gcc feels free to
optimize these functions as no-ops at optimization levels 1 and above.
That's a major speedup, for sure.  Why don't you check your code for
correctness first before drawing conclusions on performance?

- Xavier Leroy


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-08  9:27   ` Frédéric Gava
  2007-02-08  9:38     ` Frédéric Gava
@ 2007-02-08 12:08     ` Jacques Garrigue
  1 sibling, 0 replies; 32+ messages in thread
From: Jacques Garrigue @ 2007-02-08 12:08 UTC (permalink / raw)
  To: gava; +Cc: caml-list

From: Frédéric Gava <gava@univ-paris12.fr>
>  > gcc -O3 monomorphic: 4.7s
>  > Now, I also found a strange result:
>  > gcc -O3 monomorphic: 1.4s
> 
> Why two different times for the same program ? What is monomorphic ?

The second one is a misprint fron "gcc -O3 polymorphic".
By monomorphic I mean that add and mult are used directly (actually
inlined with -O3), rather than being passed as parameter to
generic_multiply, which should be faster.
The funny thing is that this results in correct code
even with -O3, because the inlined version does not discard its
result, as does the add and mult optimized functions.
So the polymorphic code, completely broken, ends up being 3 times
faster :-)

By the way there was another evident bug in your code: you do pointer
arithmetic on void*, which is clearly wrong. Yet, this can be easily
corrected by passing sizeof(complexe) as an extra argument to
generic_multiply, and adjusting offsets with it.

Jacques Garrigue


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-08 10:45     ` Xavier Leroy
@ 2007-02-08 15:16       ` Frédéric Gava
  2007-02-09  2:58         ` Jacques Garrigue
  0 siblings, 1 reply; 32+ messages in thread
From: Frédéric Gava @ 2007-02-08 15:16 UTC (permalink / raw)
  To: Xavier Leroy; +Cc: Jacques Garrigue, caml-list

[-- Attachment #1: Type: text/plain, Size: 1626 bytes --]

> Because, as Jacques told you already, your C code is wrong.  "add" and
> "mult" invoke undefined behaviors of C and therefore gcc feels free to
> optimize these functions as no-ops at optimization levels 1 and above.
> That's a major speedup, for sure.  Why don't you check your code for
> correctness first before drawing conclusions on performance?

Sorry for the inconvenience and this stupid error: I am a very bad C 
programmer.

But, I do not obtain the performance of Jacques Garrigue :-( I try to 
bench a parallel matrix multiplication algorithm and test the difference 
between C+MPI and OCaml+MPI (I try to prove that OCaml is efficient 
enought for high-performance, in this community, they largely prefer 
Fortran or C...))


a) with a "polymorphic" C program (using 
"multiply_complex_generic(i,complexe_add,complexe_mult,a,b,c);")

time ./cmult 600 2 602 1
real    0m18.402s
user    0m17.333s
sys     0m0.044s

b) for a monomorphic C programs (using "multiply_complex(i,a,b,c);");

time ./cmult 600 2 602 1
real    0m5.604s
user    0m5.556s
sys     0m0.036s


c) for a polymorphic OCaml program  (using 
"ignore(multiplication_polymorphic (!i) Complex.zero Complex.add 
Complex.mul a b c);")

time ./ocamlmult 600 2 602 1

real    0m16.433s
user    0m16.125s
sys     0m0.068s

d) for a monomorphic OCaml program  (using 
"ignore(multiplication_monomorphic (!i) a b c);")

time ./ocamlmult 600 2 602 1

real    0m15.460s
user    0m15.345s
sys     0m0.072s

I win 1 second if I compile with "-inline 100". So I have not the "twice 
slower at most".

thanks,
Frédéric Gava

ps: here I have test the programs ;-)

[-- Attachment #2: mult.c --]
[-- Type: text/x-csrc, Size: 2227 bytes --]

/* param 1 = size of the matrices at the beginning
   param 2 = incrementation
   param 3 = size of the matrices at the end
   param 4 = number of iteration 

with 
gcc -O3 -o cmult mult.c

*/

#include <stdlib.h> 
#include <stdio.h> 
#include <math.h>

typedef struct Complexe {double re; double img;} complexe;

complexe zero;

__inline__ complexe complexe_add(complexe a, complexe b)
{
 return (complexe){a.re+b.re, a.img+b.img};
}


__inline__ complexe complexe_mult(complexe x, complexe y)
{
 return (complexe) {x.re*y.re-x.img*y.img, x.re*y.img+x.img*y.re};
}


complexe random_complexe(void)
{
 complexe c = {c.re=(double)(rand()%1000), c.img=(double)(rand()%1000)};
 return c;
}


void multiply_complex(int n, complexe *a, complexe *b, complexe *c)
{
 register int i,j,k;
 complexe tmp;

  for (i=0; i<n; i++) 
   {
    for (j=0;j<n;j++)
     {
      tmp=zero;
      for (k=0;k<n;k++) 
       tmp=complexe_add(tmp,(complexe_mult(b[k*n+j],a[i*n+k])));
      c[i*n+j]=tmp;
     }
  }
 return;
}

void multiply_complex_generic(int n, complexe (*add)(complexe x, complexe y) , complexe (*mult)(complexe x, complexe y),complexe *a, complexe *b, complexe *c)
{
 register int i,j,k;
 complexe tmp;

  for (i=0; i<n; i++) 
   {
    for (j=0;j<n;j++)
     {
      tmp=zero;
      for (k=0;k<n;k++) 
       tmp=add(tmp,(mult(b[k*n+j],a[i*n+k])));
      c[i*n+j]=tmp;
     }
  }
 return;
}


int main(int argc, char* argv[])
{
 complexe *a, *b, *c;
 int i,j,k,t1,pas,maxi,iter;

 zero.re=(double)0;
 zero.img=(double)0;
 
 i=atoi(argv[1]);
 pas=atoi(argv[2]);
 maxi=atoi(argv[3]);
 iter=atoi(argv[4]);

 srand(1000);
 while (i<maxi)
  {  
   /* initialisation des matrices */
   a=(complexe *)malloc(i*i*sizeof(complexe));
   b=(complexe *)malloc(i*i*sizeof(complexe));
   c=(complexe *)malloc(i*i*sizeof(complexe));
      
   for (j=0;j<i;j++)
    for (k=0,t1=j*i;k<i;k++)
     {
      a[t1+k]=random_complexe();
      b[t1+k]=random_complexe();
     }

  for (k=1;k<=iter;k++) 
   multiply_complex(i,a,b,c);
   /* multiply_complex_generic(i,complexe_add,complexe_mult,a,b,c); */

      

  /* suppression des vieilles matrices */
  free((complexe *)a);
  free((complexe *)b);
  free((complexe *)c);

  i=i+pas;
  }
 return 0;
}

[-- Attachment #3: mult.ml --]
[-- Type: text/plain, Size: 1431 bytes --]

(* param 1 = size of the matrices at the beginning
   param 2 = incrementation
   param 3 = size of the matrices at the end
   param 4 = number of iteration 

ocamlopt -unsafe -o ocamlmult mult.ml

*)

let multiplication_polymorphic n zero add mul a b c =
  for i=0 to n-1 do
   for j=0 to n-1 do
    let rec loop tmp k =
     if k=n then tmp else loop (add tmp (mul b.(k*n+j) a.(i*n+k))) (k+1)
    in 
     c.(i*n+j)<-loop zero 0
   done
  done

let multiplication_monomorphic n a b c =
 let tmp=ref Complex.zero in
  for i=0 to n-1 do
   for j=0 to n-1 do
    for k=0 to n-1 do
      tmp:=Complex.add !tmp (Complex.mul b.(k*n+j) a.(i*n+k))
    done;
    c.(i*n+j)<-(!tmp)
   done
  done
(* same efficienty with the loop function *)

let random_complex n = {Complex.re=Random.float n; im=Random.float n}

let matrixRandom n = Array.init (n*n) (fun _ -> random_complex 1000.)

let _ = 
	Random.self_init();
        let i = ref (int_of_string (Sys.argv).(1))
	and pas=(int_of_string (Sys.argv).(2))
	and maxi=(int_of_string (Sys.argv).(3)) 
	and iter=(int_of_string (Sys.argv).(4)) in
	while !i<maxi do
         let a=matrixRandom (!i)
         and b=matrixRandom (!i)
         and c=Array.make ((!i)*(!i)) Complex.zero in
	  for j=1 to iter do
           (* ignore(multiplication_polymorphic (!i) Complex.zero Complex.add Complex.mul a b c); *)
	   ignore(multiplication_monomorphic (!i) a b c);
          done;
	 i:=!i+pas;
	done

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-08 15:16       ` Frédéric Gava
@ 2007-02-09  2:58         ` Jacques Garrigue
  2007-02-09  9:06           ` ls-ocaml-developer-2006
  0 siblings, 1 reply; 32+ messages in thread
From: Jacques Garrigue @ 2007-02-09  2:58 UTC (permalink / raw)
  To: gava; +Cc: caml-list

From: Frédéric Gava <gava@univ-paris12.fr>

> Sorry for the inconvenience and this stupid error: I am a very bad C 
> programmer.
> 
> But, I do not obtain the performance of Jacques Garrigue :-( I try to 
> bench a parallel matrix multiplication algorithm and test the difference 
> between C+MPI and OCaml+MPI (I try to prove that OCaml is efficient 
> enought for high-performance, in this community, they largely prefer 
> Fortran or C...))
> 
> 
> a) with a "polymorphic" C program (using 
> "multiply_complex_generic(i,complexe_add,complexe_mult,a,b,c);")
> 
> time ./cmult 600 2 602 1
> real    0m18.402s
> user    0m17.333s
> sys     0m0.044s
> 
> b) for a monomorphic C programs (using "multiply_complex(i,a,b,c);");
> 
> time ./cmult 600 2 602 1
> real    0m5.604s
> user    0m5.556s
> sys     0m0.036s

Interesting. It all depends on the compiler.
With gcc 3.4, as provided in FreeBSD, I get almost no difference
between your polymorphic and monomorphic versions. But if I switch to
gcc 4.1, the monomorphic version is indeed much faster. Actually, what
I get is:
gcc 3.4 polymorphic: 15s
gcc 4.1 polymorphic: 20s
gcc 3.4 monomorphic: 15s
gcc 4.1 monomorphic:  7s
So it looks like gcc 4.1 is better for monomorphic code, but worse for
function calls...
Note that in my case, this is still within a factor 2 of ocaml (which
is about the same as gcc 3.4).
But your C compiler may be doing some other platform specific
optimizations. The only way to know what is happening is to look at the
generated assembler.

Jacques Garrigue


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-09  2:58         ` Jacques Garrigue
@ 2007-02-09  9:06           ` ls-ocaml-developer-2006
  2007-02-09 10:32             ` ls-ocaml-developer-2006
                               ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: ls-ocaml-developer-2006 @ 2007-02-09  9:06 UTC (permalink / raw)
  To: caml-list


Jacques Garrigue <garrigue@math.nagoya-u.ac.jp> writes:

> From: Frédéric Gava <gava@univ-paris12.fr>
>
>> Sorry for the inconvenience and this stupid error: I am a very bad C 
>> programmer.
>> 
>> But, I do not obtain the performance of Jacques Garrigue :-( I try to 
>> bench a parallel matrix multiplication algorithm and test the difference 
>> between C+MPI and OCaml+MPI (I try to prove that OCaml is efficient 
>> enought for high-performance, in this community, they largely prefer 
>> Fortran or C...))
>> 
>> 
>> a) with a "polymorphic" C program (using 
>> "multiply_complex_generic(i,complexe_add,complexe_mult,a,b,c);")
>> 
>> time ./cmult 600 2 602 1
>> real    0m18.402s
>> user    0m17.333s
>> sys     0m0.044s
>> 
>> b) for a monomorphic C programs (using "multiply_complex(i,a,b,c);");
>> 
>> time ./cmult 600 2 602 1
>> real    0m5.604s
>> user    0m5.556s
>> sys     0m0.036s
>
> Interesting. It all depends on the compiler.
> With gcc 3.4, as provided in FreeBSD, I get almost no difference
> between your polymorphic and monomorphic versions. But if I switch to
> gcc 4.1, the monomorphic version is indeed much faster. Actually, what
> I get is:
> gcc 3.4 polymorphic: 15s
> gcc 4.1 polymorphic: 20s
> gcc 3.4 monomorphic: 15s
> gcc 4.1 monomorphic:  7s
> So it looks like gcc 4.1 is better for monomorphic code, but worse for
> function calls...
> Note that in my case, this is still within a factor 2 of ocaml (which
> is about the same as gcc 3.4).

> But your C compiler may be doing some other platform specific
> optimizations. The only way to know what is happening is to look at the
> generated assembler.

I'm just wondering about that. All the data produced during the
matrix-multiplication is AFAICS not used for anything. So I wouldn't
exclude that in the case of monomorphic code gcc 4.1 is just deciding
to ditch most of the actual work after doing some data flow
analysis. Gcc 4.1 is (I think) known to do some aggressive
optimizations. I'd feel better if the code is benchmarked in a way
that the result of the multiplication is output to a file and to
subtract the constant contribution of that to the run time that the
time is measured for various problem sizes (number of matrices). One
would get a linear dependency

   t(n) = C + K*n

and fitting a straight line against the data points could obtain K to
compare the efficiency of C (under various compiler versions and
options) and Ocaml against each other without having to worry about
no-op optimizations or constant startup costs (like load times and run
time initialization: which might be a bit higher with Ocaml, though
certainly not in the order of seconds).

Comparing the output to expected results would also help to ensure
that the code is correct which I still find difficult to assert at
first glance.

Regard -- Markus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-09  9:06           ` ls-ocaml-developer-2006
@ 2007-02-09 10:32             ` ls-ocaml-developer-2006
  2007-02-09 14:22               ` skaller
  2007-02-09 12:05             ` Jon Harrop
  2007-02-09 13:50             ` Brian Hurt
  2 siblings, 1 reply; 32+ messages in thread
From: ls-ocaml-developer-2006 @ 2007-02-09 10:32 UTC (permalink / raw)
  To: caml-list


ls-ocaml-developer-2006@m-e-leypold.de writes:

> optimizations. I'd feel better if the code is benchmarked in a way
> that the result of the multiplication is output to a file and to
> subtract the constant contribution of that to the run time that the
> time is measured for various problem sizes (number of matrices). 

Correction: ... the constant contribution of writing the result to a
file should be subtracted from the run time by measuring with various
problem sizes.

Hope that is understandable now.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-09  9:06           ` ls-ocaml-developer-2006
  2007-02-09 10:32             ` ls-ocaml-developer-2006
@ 2007-02-09 12:05             ` Jon Harrop
  2007-02-09 12:35               ` ls-ocaml-developer-2006
  2007-02-09 13:50             ` Brian Hurt
  2 siblings, 1 reply; 32+ messages in thread
From: Jon Harrop @ 2007-02-09 12:05 UTC (permalink / raw)
  To: caml-list

On Friday 09 February 2007 09:06, ls-ocaml-developer-2006@m-e-leypold.de 
wrote:
>    t(n) = C + K*n

If you're assuming that the time taken will be proportional to complexity then 
that is almost certainly quite inaccurate in this case due to cache effects.

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
OCaml for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-09 12:05             ` Jon Harrop
@ 2007-02-09 12:35               ` ls-ocaml-developer-2006
  0 siblings, 0 replies; 32+ messages in thread
From: ls-ocaml-developer-2006 @ 2007-02-09 12:35 UTC (permalink / raw)
  To: caml-list


Jon Harrop <jon@ffconsultancy.com> writes:

> On Friday 09 February 2007 09:06, ls-ocaml-developer-2006@m-e-leypold.de 
> wrote:
>>    t(n) = C + K*n
>
> If you're assuming that the time taken will be proportional to complexity then 

Not complexity: The number of matrices multiplied ...

> that is almost certainly quite inaccurate in this case due to cache effects.

Yes, that's too much simplification. I'm aware of that. Exactly for
that reason the exact shape of the resulting graph would be
interesting, because it reflects where the different caches run out of
capacity. Still -- with 15s tun time and random data I assume that the
processor L1/L2 caches don't count (I don't see, why multiplying 2
different matrices one after the other should profit from caching --
mind: I scale the number of multiplications, not the size of
matrices).

BTW: With my short note I didn't want to suggest a complete
methodology of measurement, but only wanted to comment that perhaps
optimization does still take place (because the results are not used)
and just hint to better methods of measurements than just taking the
time of a single run at a constant problem size. You (or better the
guy with the original problem) might improve on that at will.


Regards -- Markus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-09  9:06           ` ls-ocaml-developer-2006
  2007-02-09 10:32             ` ls-ocaml-developer-2006
  2007-02-09 12:05             ` Jon Harrop
@ 2007-02-09 13:50             ` Brian Hurt
  2007-02-09 14:23               ` Gerd Stolpmann
  2 siblings, 1 reply; 32+ messages in thread
From: Brian Hurt @ 2007-02-09 13:50 UTC (permalink / raw)
  To: ls-ocaml-developer-2006; +Cc: caml-list


>I'd feel better if the code is benchmarked in a way
>that the result of the multiplication is output to a file and to
>subtract the constant contribution of that to the run time that the
>time is measured for various problem sizes (number of matrices). 
>

For the C side, I recommend using clock() to time the runs.  This also 
factors out the cost of creating the random matricies.  Basically, you 
just do:

#include <time.h>

...
    {
        clock_t start, stop;
        ...
        start = clock();
        /* stuff to time here */
        stop = clock();
        printf("The amount of time taken = %f seconds\n", 
((double)(stop-start))/((double) CLOCKS_PER_SEC));
        /* write result to file */
    }

In Ocaml, Unix.gettimeofday can be used similiarly:

    let start = Unix.gettimeofday() in
    (* stuff to time here *)
    let stop = Unix.gettimeofday() in
    Printf.printf "The amount of time taken = %f seconds\n" (stop -. start);
     (* write results to file *)

Brian


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-09 10:32             ` ls-ocaml-developer-2006
@ 2007-02-09 14:22               ` skaller
  2007-02-09 21:47                 ` ls-ocaml-developer-2006
  2007-02-09 23:56                 ` Jon Harrop
  0 siblings, 2 replies; 32+ messages in thread
From: skaller @ 2007-02-09 14:22 UTC (permalink / raw)
  To: ls-ocaml-developer-2006; +Cc: caml-list

On Fri, 2007-02-09 at 11:32 +0100,
ls-ocaml-developer-2006@m-e-leypold.de wrote:
> ls-ocaml-developer-2006@m-e-leypold.de writes:
> 
> > optimizations. I'd feel better if the code is benchmarked in a way
> > that the result of the multiplication is output to a file and to
> > subtract the constant contribution of that to the run time that the
> > time is measured for various problem sizes (number of matrices). 
> 
> Correction: ... the constant contribution of writing the result to a
> file should be subtracted from the run time by measuring with various
> problem sizes.
> 
> Hope that is understandable now.

There is no need for that IMHO. The right way to benchmark simple
things like matrix operations AND simultaneously validate them
is to evaluate known laws of arithmetic, such as the associative
law, on various samples.

This doesn't require any I/O, nor a 'known correct' program
to generate the comparison data.

Generating the samples is harder .. I once used things like
Fibonacci to generate input for integer calculator tests.
You really want to run these benchmarks for at least 5 minutes
and randomly to eliminate cache effects.

If you can generate increasing sized problems based on a single
linear integer parameter, my Python test harness can handle
randomising the tests, run multiple processes, 
and also draw a nice plot of the results, such as these:

http://felix.sourceforge.net/speed/en_flx_perf_0005.html
http://felix.sourceforge.net/speed/en_flx_perf_0012.html

full sources etc are here to read:

http://felix.sourceforge.net/speed/en_flx_perf_top.html

or you can grab from Felix svn archive (the test harness
is independent of Felix). Or I can email to you (saves
building Felix just to get the test harness ..)

Ocamlopt scores quite well thank you! This data is from
two AMD64 machines running Ubuntu Linux and is the result
of HOURS of run time.

I don't know if it is cache, random context switches, or what,
but on my boxes variations up to 20% on tests under 5 seconds are
normal. Note the measurements are real time. Other measurements
are suspect .. you really SHOULD count VM paging for example.


-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-09 13:50             ` Brian Hurt
@ 2007-02-09 14:23               ` Gerd Stolpmann
  0 siblings, 0 replies; 32+ messages in thread
From: Gerd Stolpmann @ 2007-02-09 14:23 UTC (permalink / raw)
  To: Brian Hurt; +Cc: ls-ocaml-developer-2006, caml-list

Am Freitag, den 09.02.2007, 08:50 -0500 schrieb Brian Hurt:
> For the C side, I recommend using clock() to time the runs.
> ...
> In Ocaml, Unix.gettimeofday can be used similiarly:

Ahem, don't understand why you suggest two different methods for C and
Ocaml. First, there is Unix.times in O'Caml that also returns the
process time like clock(). Second - believe it or not - there is also
gettimeofday in C.

Personally, I recommend to rely on gettimeofday, i.e. the wall clock
time of the system, on an otherwise idle system. That's the time that
finally counts, and the way the process time is accumulated by the
kernel is somewhat dubious (but may be useful to check whether the
system was really idle during the benchmark).

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-09 14:22               ` skaller
@ 2007-02-09 21:47                 ` ls-ocaml-developer-2006
  2007-02-09 21:55                   ` Andrej Bauer
  2007-02-09 23:56                 ` Jon Harrop
  1 sibling, 1 reply; 32+ messages in thread
From: ls-ocaml-developer-2006 @ 2007-02-09 21:47 UTC (permalink / raw)
  To: caml-list


skaller <skaller@users.sourceforge.net> writes:

> On Fri, 2007-02-09 at 11:32 +0100,
> ls-ocaml-developer-2006@m-e-leypold.de wrote:
>> ls-ocaml-developer-2006@m-e-leypold.de writes:
>> 
>> > optimizations. I'd feel better if the code is benchmarked in a way
>> > that the result of the multiplication is output to a file and to
>> > subtract the constant contribution of that to the run time that the
>> > time is measured for various problem sizes (number of matrices). 
>> 
>> Correction: ... the constant contribution of writing the result to a
>> file should be subtracted from the run time by measuring with various
>> problem sizes.
>> 
>> Hope that is understandable now.

> There is no need for that IMHO. The right way to benchmark simple
> things like matrix operations AND simultaneously validate them
> is to evaluate known laws of arithmetic, such as the associative
> law, on various samples.




Yes, that's another way. But what stays is the fact that (a) you
should somehow use the result of your operations to keep the compiler
from just optimzing the operation away (since the compiler propably
doesn't know about the distributive law, that would be the right way
to go, but I can just imagin if you try to test

   (A * B) * C == A * (B * C)

(the associative law) it might just be that the compiler inlines the
function calls and then discovers that this is always true (since it
just knows the laws of float arithmetics).

Perhaps I'm just being paranoid.

> This doesn't require any I/O, nor a 'known correct' program
> to generate the comparison data.

And as far as correctness goes: The program should not invoke
undefined behaviour and if you have 2 programs in 2 different
languages that you want to benchmark against each other, they should
really do the same thing. I don't see how you could get any
meaningfull results with an incorrect program or by comparing 2
programs from which one just invokes undefined behaviour in C.

> Generating the samples is harder .. I once used things like
> Fibonacci to generate input for integer calculator tests.
> You really want to run these benchmarks for at least 5 minutes
> and randomly to eliminate cache effects.
>
> If you can generate increasing sized problems based on a single
> linear integer parameter, my Python test harness can handle
> randomising the tests, run multiple processes, 
> and also draw a nice plot of the results, such as these:
>
> http://felix.sourceforge.net/speed/en_flx_perf_0005.html
> http://felix.sourceforge.net/speed/en_flx_perf_0012.html
>
> full sources etc are here to read:
>
> http://felix.sourceforge.net/speed/en_flx_perf_top.html
>
> or you can grab from Felix svn archive (the test harness
> is independent of Felix). Or I can email to you (saves
> building Felix just to get the test harness ..)

I'm not the OP (with the original problem, but I'd be interested in
the test harness (if it is documented). I've been thinking about
writing something similar for testing algorithms (sometimes the
scaling behaviour tells you, what you did wrong or where problem
lies), but didn't have the time yet.

> Ocamlopt scores quite well thank you! This data is from
> two AMD64 machines running Ubuntu Linux and is the result
> of HOURS of run time.

I'd have prefered no straight lines between the points (physicist know
why) and more data points. But yes, this is the kind of benchmark I've
been suggesting. Which version of gcc did you use? 

I'm still a bit wary on the OPs results: That gcc generated code just
runs 3 times faster than code from another version from gcc seems to
point to an optimization bug, IMHO: There is no way that there was a
optimization potential of factor 3 in older gcc versions (else I
suppose the Intel cc would have generated much faster than gcc, too,
because those people presumably knew what they where doing and weren't
likely to make "the same mistakes as the gcc people: So, since I don't
believ in miracles ...)


> I don't know if it is cache, random context switches, or what,
> but on my boxes variations up to 20% on tests under 5 seconds are
> normal.

Almost the same here. I assume you didn't test in single user mode (I
tried it once and the resulting data was much smoother).

> Note the measurements are real time. Other measurements
> are suspect .. you really SHOULD count VM paging for example.

:-).

Why me?

Regards -- Markus 

  (who isn't really in the benchamark business presently, but only
   wanted to point to the importance of being earnest^W observing
   scaling behaviour instead of looking at absolute numbers)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-09 21:47                 ` ls-ocaml-developer-2006
@ 2007-02-09 21:55                   ` Andrej Bauer
  2007-02-09 22:36                     ` ls-ocaml-developer-2006
  2007-02-10  1:10                     ` Brian Hurt
  0 siblings, 2 replies; 32+ messages in thread
From: Andrej Bauer @ 2007-02-09 21:55 UTC (permalink / raw)
  Cc: caml-list

I hate to add to this long discussion, but since when is floating point 
multiplication associative?

# 1.3 *. (0.7 *. 2.1) = (1.3 *. 0.7) *. 2.1 ;;
- : bool = false

The suggestion that the results be tested for validity by checking the 
associative law of multipliciation won't worky very well, and neither 
will commutativity. In this case, instead of checking that A = B holds, 
it's better to output the difference between A and B (in case A and B 
are matrices, output the sup norm of the difference to keep it down to a 
single number).

Andrej


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-09 21:55                   ` Andrej Bauer
@ 2007-02-09 22:36                     ` ls-ocaml-developer-2006
  2007-02-09 23:53                       ` Jon Harrop
  2007-02-10  1:10                     ` Brian Hurt
  1 sibling, 1 reply; 32+ messages in thread
From: ls-ocaml-developer-2006 @ 2007-02-09 22:36 UTC (permalink / raw)
  To: caml-list



Andrej Bauer <Andrej.Bauer@fmf.uni-lj.si> writes:

> I hate to add to this long discussion, but since when is floating
> point multiplication associative?
>
> # 1.3 *. (0.7 *. 2.1) = (1.3 *. 0.7) *. 2.1 ;;
> - : bool = false

Which just means, that "testing assoicativity" wouldn't work as a
benchmark. Still I think the compiler (GCC) might take the license to
optimize (a * b) *c == a * (b * c) to false with -O3. Gcc 4.1 is
optimizing pretty aggressivly (yes even to the extend of doubtful,
sometimes simply wrong semantics) AFAIK.

> The suggestion that the results be tested for validity by checking the
> associative law of multipliciation won't worky very well, and neither
> will commutativity. In this case, instead of checking that A = B
> holds, it's better to output the difference between A and B (in case A
> and B are matrices, output the sup norm of the difference to keep it
> down to a single number).

:-) But wasn't I/O what Maxim wanted to avoid in a benchmark?

Regards -- Markus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-09 22:36                     ` ls-ocaml-developer-2006
@ 2007-02-09 23:53                       ` Jon Harrop
  2007-02-10  1:41                         ` ls-ocaml-developer-2006
  0 siblings, 1 reply; 32+ messages in thread
From: Jon Harrop @ 2007-02-09 23:53 UTC (permalink / raw)
  To: caml-list

On Friday 09 February 2007 22:36, ls-ocaml-developer-2006@m-e-leypold.de 
wrote:
> Still I think the compiler (GCC) might take the license to
> optimize (a * b) *c == a * (b * c) to false with -O3.

The -O* flags try to be correct but you can supply flags like -ffast-math to 
opt for faster but possibly less correct code:

  http://docs.freebsd.org/info//gcc/gcc.info.Optimize_Options.html

Many compilers and languages sacrifice strict double-precision for performance 
(by not rounding from 80-bit), which is a major headache when you're 
implementing algorithms that rely upon it, e.g. the exact arithmetic in 
Triangle:

  http://www.soest.hawaii.edu/GMT/gmt/README.TRIANGLE

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
OCaml for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-09 14:22               ` skaller
  2007-02-09 21:47                 ` ls-ocaml-developer-2006
@ 2007-02-09 23:56                 ` Jon Harrop
  1 sibling, 0 replies; 32+ messages in thread
From: Jon Harrop @ 2007-02-09 23:56 UTC (permalink / raw)
  To: caml-list

On Friday 09 February 2007 14:22, skaller wrote:
> The right way to benchmark simple
> things like matrix operations AND simultaneously validate them
> is to evaluate known laws of arithmetic, such as the associative
> law, on various samples.

They aren't likely to be associative (or commutative).

The best way to benchmark (IMHO) is to run programs that actually do something 
useful.

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
OCaml for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-09 21:55                   ` Andrej Bauer
  2007-02-09 22:36                     ` ls-ocaml-developer-2006
@ 2007-02-10  1:10                     ` Brian Hurt
  2007-02-10  1:16                       ` Robert Roessler
  1 sibling, 1 reply; 32+ messages in thread
From: Brian Hurt @ 2007-02-10  1:10 UTC (permalink / raw)
  To: Andrej.Bauer; +Cc: caml-list



On Fri, 9 Feb 2007, Andrej Bauer wrote:

> I hate to add to this long discussion, but since when is floating point 
> multiplication associative?
>
> # 1.3 *. (0.7 *. 2.1) = (1.3 *. 0.7) *. 2.1 ;;
> - : bool = false

Round off error.  You can't represent numbers like 7/10 or 13/10 in binary 
precisely in a finite number of digits, any more than you can represent 
the numbers 1/3 or 4/7 precisely in base-10 in a finite number of digits. 
This is a fundamental problem with floating point numbers, and shows up in 
every language that has them.  But this means reordering the calculations 
will result in slightly different answers.

And no, most of the "obvious" solutions don't work.  Using rational 
arithmetic doesn't help if I ask you to calculate the euclidean length of 
the vector (1, 1) for example (hint: sqrt (1.*.1. +. 1.*.1.)).  Interval 
arithmetic fails miserably on simple algorithms like Newton's method.  And 
so on.

On the other hand, the precision given by double precision floating point 
numbers is generally more than enough- they're precise enough to hold the 
distance from here to the moon in microns, for example.

My advice: if you're doing much of anything with floating point numbers, 
get a book on numerical analysis and follow the algorithms and 
recommendations it gives you (any book except "Numerical Recipies", which 
suck).  And don't use floating point for money.

Brian


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-10  1:10                     ` Brian Hurt
@ 2007-02-10  1:16                       ` Robert Roessler
  0 siblings, 0 replies; 32+ messages in thread
From: Robert Roessler @ 2007-02-10  1:16 UTC (permalink / raw)
  To: caml-list

Brian Hurt wrote:
> On Fri, 9 Feb 2007, Andrej Bauer wrote:
> 
>> I hate to add to this long discussion, but since when is floating 
>> point multiplication associative?
>>
>> # 1.3 *. (0.7 *. 2.1) = (1.3 *. 0.7) *. 2.1 ;;
>> - : bool = false
> 
> Round off error.  You can't represent numbers like 7/10 or 13/10 in 
> binary precisely in a finite number of digits, any more than you can 
> represent the numbers 1/3 or 4/7 precisely in base-10 in a finite number 
> of digits. This is a fundamental problem with floating point numbers, 
> and shows up in every language that has them.  But this means reordering 
> the calculations will result in slightly different answers.

I had assumed this was the point he was making...

Robert Roessler
roessler@rftp.com
http://www.rftp.com


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-09 23:53                       ` Jon Harrop
@ 2007-02-10  1:41                         ` ls-ocaml-developer-2006
  2007-02-10  2:24                           ` Jon Harrop
  0 siblings, 1 reply; 32+ messages in thread
From: ls-ocaml-developer-2006 @ 2007-02-10  1:41 UTC (permalink / raw)
  To: caml-list



Jon Harrop <jon@ffconsultancy.com> writes:

> On Friday 09 February 2007 22:36, ls-ocaml-developer-2006@m-e-leypold.de 
> wrote:
>> Still I think the compiler (GCC) might take the license to
>> optimize (a * b) *c == a * (b * c) to false with -O3.
>
> The -O* flags try to be correct but you can supply flags like -ffast-math to 

Sorry, I wanted to say "optimize (a * b) *c == a * (b * c) to true",
i.e. 1. Would the compiler be wrong/incorrect in doing so? I don't
think so.

Regards -- Markus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-10  1:41                         ` ls-ocaml-developer-2006
@ 2007-02-10  2:24                           ` Jon Harrop
  2007-02-10 14:41                             ` ls-ocaml-developer-2006
  2007-02-11 13:13                             ` Christophe Raffalli
  0 siblings, 2 replies; 32+ messages in thread
From: Jon Harrop @ 2007-02-10  2:24 UTC (permalink / raw)
  To: caml-list

On Saturday 10 February 2007 01:41, ls-ocaml-developer-2006@m-e-leypold.de 
wrote:
> Sorry, I wanted to say "optimize (a * b) *c == a * (b * c) to true",
> i.e. 1. Would the compiler be wrong/incorrect in doing so?

Yes.

> I don't think so.

Provided you only specify -O* optimisation flags, I believe that reducing that 
expression to "true" in the general case would be a compiler bug. If you 
specify flags like -ffast-math then anything can happen, but it will happen 
very quickly! ;-)

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
OCaml for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-10  2:24                           ` Jon Harrop
@ 2007-02-10 14:41                             ` ls-ocaml-developer-2006
  2007-02-10 14:52                               ` Jon Harrop
  2007-02-10 14:55                               ` Mattias Engdegård
  2007-02-11 13:13                             ` Christophe Raffalli
  1 sibling, 2 replies; 32+ messages in thread
From: ls-ocaml-developer-2006 @ 2007-02-10 14:41 UTC (permalink / raw)
  To: caml-list


Jon Harrop <jon@ffconsultancy.com> writes:

> On Saturday 10 February 2007 01:41, ls-ocaml-developer-2006@m-e-leypold.de 
> wrote:
>> Sorry, I wanted to say "optimize (a * b) *c == a * (b * c) to true",
>> i.e. 1. Would the compiler be wrong/incorrect in doing so?
>
> Yes.

Just to be sure: Would the compiler be wrong to optimize

   c * q > c * k

to just

       q > k

(all floats). And why? If not, why, in the case above? I don't want
letter and verse, but a general hand waving in the right direction
would be nice, since I have the impression, that is exactly what Gcc
4.1. is currently doing (though for the integer case).

> Provided you only specify -O* optimisation flags, I believe that reducing that 
> expression to "true" in the general case would be a compiler bug. If you 
> specify flags like -ffast-math then anything can happen, but it will happen 
> very quickly! ;-)

:-)

Regards -- Markus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-10 14:41                             ` ls-ocaml-developer-2006
@ 2007-02-10 14:52                               ` Jon Harrop
  2007-02-10 15:51                                 ` ls-ocaml-developer-2006
  2007-02-10 14:55                               ` Mattias Engdegård
  1 sibling, 1 reply; 32+ messages in thread
From: Jon Harrop @ 2007-02-10 14:52 UTC (permalink / raw)
  To: caml-list

On Saturday 10 February 2007 14:41, ls-ocaml-developer-2006@m-e-leypold.de 
wrote:
> Just to be sure: Would the compiler be wrong to optimize
>
>    c * q > c * k
>
> to just
>
>        q > k
>
> (all floats). And why?

Consider c=0, q=3 and k=2:

0 * 3 > 0 * 2  --> false
    3 > 2      --> true

It is just mathematically incorrect.

> If not, why, in the case above?

Floats are not reals, they are just an approximation that happens to be very 
useful. Floats do not obey the same laws (e.g. associativity). However, 
programmers may be relying upon this fact, e.g. when doing exact float 
arithmetic.

> I don't want  
> letter and verse, but a general hand waving in the right direction
> would be nice, since I have the impression, that is exactly what Gcc
> 4.1. is currently doing (though for the integer case).

Ints are completely different because they are exact (as modulo integers). So 
they do obey the same laws and they do not have special constants (like 
infinity, neg_infinity, nan, -0. and so on as floats do).

Xavier has written some enlightening posts here in the past, regarding the 
adoption of strict IEEE compliance in OCaml:

http://caml.inria.fr/pub/ml-archives/caml-list/2004/10/ffa452944f4bb9827f2cdca552f4f823.en.html

For an example of someone using the float constants that lie outside the set 
of reals, look no further than my ray tracer:

http://www.ffconsultancy.com/free/ray_tracer/languages.html

I used an initial intersection parameter of lambda=infinity to represent no 
intersection (or intersecting with the sky). I would be mortified if a 
compiler decided to optimise away my necessary and working code just because 
infinity is not in the set of real numbers.

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
OCaml for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-10 14:41                             ` ls-ocaml-developer-2006
  2007-02-10 14:52                               ` Jon Harrop
@ 2007-02-10 14:55                               ` Mattias Engdegård
  1 sibling, 0 replies; 32+ messages in thread
From: Mattias Engdegård @ 2007-02-10 14:55 UTC (permalink / raw)
  To: ls-ocaml-developer-2006; +Cc: caml-list

>Just to be sure: Would the compiler be wrong to optimize
>
>   c * q > c * k
>
>to just
>
>       q > k

With IEEE floating-point it would be wrong for several reasons.

Example 1: q and k are finite but c*q and c*k both are infinite.
Example 2: q and k are small, so that c*q and c*k both are +-0
Example 3: c is NaN but q and k are not.

I'm sure there are other examples involving but not limited to partial or total
loss of precision, positive/negative zero, infinities and NaNs.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-10 14:52                               ` Jon Harrop
@ 2007-02-10 15:51                                 ` ls-ocaml-developer-2006
  2007-02-10 16:10                                   ` Xavier Leroy
  2007-02-10 16:11                                   ` Jon Harrop
  0 siblings, 2 replies; 32+ messages in thread
From: ls-ocaml-developer-2006 @ 2007-02-10 15:51 UTC (permalink / raw)
  To: caml-list


Jon Harrop <jon@ffconsultancy.com> writes:

> On Saturday 10 February 2007 14:41, ls-ocaml-developer-2006@m-e-leypold.de 
> wrote:
>> Just to be sure: Would the compiler be wrong to optimize
>>
>>    c * q > c * k
>>
>> to just
>>
>>        q > k
>>
>> (all floats). And why?
>
> Consider c=0, q=3 and k=2:
>
> 0 * 3 > 0 * 2  --> false
>     3 > 2      --> true
>
> It is just mathematically incorrect.

Damn. I should have taken '+'. I've not been paying attention. My
question should rather have been: Is the compiler allowed to make
optimizations according to known mathematical laws?


>> If not, why, in the case above?
>
> Floats are not reals, they are just an approximation that happens to be very 
> useful. Floats do not obey the same laws (e.g. associativity). However, 
> programmers may be relying upon this fact, e.g. when doing exact float 
> arithmetic.

Apart from my obvious error above, I wonder, how much the compiler is
required to keep to that expectations and when. Some examples to
illustrate: Recently someone complained to the Gcc team that the
compiler optimizes

   a + 100 < 100 

to false in case a is an unsigned int. Actually the lnaguage standard
allows it: If the addition flows over, anything is allowed (undefined
behaviour) and if not, a+100 < 100 is just like a < 0. The expectation
that the overflow must be visible in the language was not justfied in
this case.

I've been wondering about similar optimizations for floats, but didn't
get my examples right. The transformtion done by the compiler would of
course be forbidden to increase the error. 

>> I don't want  
>> letter and verse, but a general hand waving in the right direction
>> would be nice, since I have the impression, that is exactly what Gcc
>> 4.1. is currently doing (though for the integer case).

> Ints are completely different because they are exact (as modulo integers). So 
> they do obey the same laws and they do not have special constants (like 
> infinity, neg_infinity, nan, -0. and so on as floats do).

Yes, thanks. I hope I have justfied sufficiently my motiviation to
ask my question, about which I should have been thinking just for some
more minutes (I admit :-).

Still: With a certain Gcc version and flags combination the OP saw a
threefold improvement in performance. That in intself is suspicious (I
don't think that this much optimization potential was left in Gcc ...)
and I still would check for optimization errors in this case. Gcc is
not bug free either, so one should test the correctness of the
compiled program first and wether it really does the work it is
supposed to do.

Regards -- Markus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-10 15:51                                 ` ls-ocaml-developer-2006
@ 2007-02-10 16:10                                   ` Xavier Leroy
  2007-02-10 16:11                                   ` Jon Harrop
  1 sibling, 0 replies; 32+ messages in thread
From: Xavier Leroy @ 2007-02-10 16:10 UTC (permalink / raw)
  To: ls-ocaml-developer-2006; +Cc: caml-list

> [Many questions about float arithmetic and optimizations]

Do yourself a favor and read Goldberg's excellent reference:

"What Every Computer Scientist Should Know About Floating Point Arithmetic"
ACM Computing Surveys 23(1), 5-48, 1991
http://citeseer.ist.psu.edu/goldberg91what.html

It doesn't read like _TV_digest_, but it's well worth the effort.

> Is the compiler allowed to make
> optimizations according to known mathematical laws?

Yes, provided they actually hold.  Goldberg lists a few that hold in
IEEE float arithmetic (section "Optimizers").  But pretty much every
algebraic law that holds over the reals doesn't hold in floating-point
arithmetic.

> Still: With a certain Gcc version and flags combination the OP saw a
> threefold improvement in performance. That in intself is suspicious (I
> don't think that this much optimization potential was left in Gcc ...)
> and I still would check for optimization errors in this case.

That's a good idea.  Note however that there are known optimizations
(loop blocking, loop interchange) that can dramatically improve the
performance of dense matrix multiply by making better use of the caches.
Automatic vectorization (generation of SSE2 instructions that operate
over pairs of double-precision floats) could also have a significant
impact, although not by a factor of 3.  There's only one way to know:
read the assembly code generated by gcc.

- Xavier Leroy


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-10 15:51                                 ` ls-ocaml-developer-2006
  2007-02-10 16:10                                   ` Xavier Leroy
@ 2007-02-10 16:11                                   ` Jon Harrop
  1 sibling, 0 replies; 32+ messages in thread
From: Jon Harrop @ 2007-02-10 16:11 UTC (permalink / raw)
  To: caml-list

On Saturday 10 February 2007 15:51, ls-ocaml-developer-2006@m-e-leypold.de 
wrote:
> Damn. I should have taken '+'.

Even for modulo integer arithmetic that still breaks:

  a+b > a+c

when a+c wraps around max_int and becomes negative but a+b does not, then 
a+b>a+c even though c>b.

> Is the compiler allowed to make optimizations according to known
> mathematical laws? 

Provided they are the correct laws, yes.

> I've been wondering about similar optimizations for floats, but didn't
> get my examples right. The transformtion done by the compiler would of
> course be forbidden to increase the error.

Yes, which is actually very restrictive. That is why -ffast-math can make 
C/C++ programs run so much more quickly. In OCaml, -ffast-math has a 
different purpose (to enable the emission of trig instructions, IIRC) and it 
exists on x86 and not AMD64 whereas the same flag exists on both platforms 
for GCC.

Incidentally, why is -ffast-math not always enabled on x86 and why isn't it a 
no-op on AMD64 (rather than causing an "unknown option" error as it does 
ATM)?

> Still: With a certain Gcc version and flags combination the OP saw a
> threefold improvement in performance. That in intself is suspicious (I
> don't think that this much optimization potential was left in Gcc ...)
> and I still would check for optimization errors in this case. Gcc is
> not bug free either, so one should test the correctness of the
> compiled program first and wether it really does the work it is
> supposed to do.

Although Frederic is an expert OCaml programmer, it sounds like his C was a 
little off the mark (having undefined behaviour and allowing arbitrary 
optimisations as a consequence).

Benchmarking is a real can of worms. I've spent a long time trying to 
benchmark programs and languages and I really think the only feasible way 
forward is to measure the real time taken to solve real problems.

Even this quickly becomes apples and oranges. For example, F# has a concurrent 
GC and I have a dual core machine, so F# has the advantage that even single 
threaded programs can exploit both cores when they are allocation intensive.

So how am I supposed to compare the performance of the two languages? There is 
no panacea. In the end I must measure the time taken to perform an important 
real world task, like spinning a 3D bunny around. Is spinning a bunny serious 
enough though? If not, maybe I should use a Mandelbrot renderer:

http://caml.inria.fr/cgi-bin/hump.en.cgi?contrib=481

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
OCaml for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Caml-list] Multiplication of matrix in C and OCaml
  2007-02-10  2:24                           ` Jon Harrop
  2007-02-10 14:41                             ` ls-ocaml-developer-2006
@ 2007-02-11 13:13                             ` Christophe Raffalli
  1 sibling, 0 replies; 32+ messages in thread
From: Christophe Raffalli @ 2007-02-11 13:13 UTC (permalink / raw)
  Cc: caml-list


Hi,

Picking up one coefficient, choosed randomaly in the matrix should take 
a constant time and prevent gcc to
do any optimization by guessing the the result is useless.

Christophe Raffalli


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2007-02-11 13:12 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-07 23:42 Multiplication of matrix in C and OCaml Frédéric Gava
2007-02-08  2:14 ` [Caml-list] " Jacques Garrigue
2007-02-08  9:27   ` Frédéric Gava
2007-02-08  9:38     ` Frédéric Gava
2007-02-08 12:08     ` Jacques Garrigue
2007-02-08  9:56   ` Frédéric Gava
2007-02-08 10:45     ` Xavier Leroy
2007-02-08 15:16       ` Frédéric Gava
2007-02-09  2:58         ` Jacques Garrigue
2007-02-09  9:06           ` ls-ocaml-developer-2006
2007-02-09 10:32             ` ls-ocaml-developer-2006
2007-02-09 14:22               ` skaller
2007-02-09 21:47                 ` ls-ocaml-developer-2006
2007-02-09 21:55                   ` Andrej Bauer
2007-02-09 22:36                     ` ls-ocaml-developer-2006
2007-02-09 23:53                       ` Jon Harrop
2007-02-10  1:41                         ` ls-ocaml-developer-2006
2007-02-10  2:24                           ` Jon Harrop
2007-02-10 14:41                             ` ls-ocaml-developer-2006
2007-02-10 14:52                               ` Jon Harrop
2007-02-10 15:51                                 ` ls-ocaml-developer-2006
2007-02-10 16:10                                   ` Xavier Leroy
2007-02-10 16:11                                   ` Jon Harrop
2007-02-10 14:55                               ` Mattias Engdegård
2007-02-11 13:13                             ` Christophe Raffalli
2007-02-10  1:10                     ` Brian Hurt
2007-02-10  1:16                       ` Robert Roessler
2007-02-09 23:56                 ` Jon Harrop
2007-02-09 12:05             ` Jon Harrop
2007-02-09 12:35               ` ls-ocaml-developer-2006
2007-02-09 13:50             ` Brian Hurt
2007-02-09 14:23               ` Gerd Stolpmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).