From mboxrd@z Thu Jan  1 00:00:00 1970
Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id JAA10199; Mon, 25 Nov 2002 09:10:33 +0100 (MET)
X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id JAA10268 for <caml-list@pauillac.inria.fr>; Mon, 25 Nov 2002 09:10:32 +0100 (MET)
Received: from jerry.kfunigraz.ac.at (GIGAJERRY.kfunigraz.ac.at [143.50.55.161])
	by nez-perce.inria.fr (8.11.1/8.11.1) with ESMTP id gAP8AV101460
	for <caml-list@inria.fr>; Mon, 25 Nov 2002 09:10:31 +0100 (MET)
Received: from tom.kfunigraz.ac.at (mc_tom [10.10.1.160])
	by jerry.kfunigraz.ac.at (8.11.2/8.11.2) with ESMTP id gAP8AVm01490
	for <caml-list@inria.fr>; Mon, 25 Nov 2002 09:10:31 +0100 (MET)
Received: from kfunigraz.ac.at (IGAM08AV.kfunigraz.ac.at [143.50.39.35])
	by tom.kfunigraz.ac.at (8.11.2/8.11.2) with ESMTP id gAP8AVJ27651
	for <caml-list@inria.fr>; Mon, 25 Nov 2002 09:10:31 +0100 (MET)
Message-ID: <3DE1E6CD.2020906@kfunigraz.ac.at>
Date: Mon, 25 Nov 2002 10:01:01 +0100
From: Siegfried Gonzi <siegfried.gonzi@kfunigraz.ac.at>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.8) Gecko/20020204
X-Accept-Language: en-us
MIME-Version: 1.0
To: caml-list@inria.fr
Subject: [Caml-list] Some clarifications to the language shootout page
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Sender: owner-caml-list@pauillac.inria.fr
Precedence: bulk

Introduction
------------

The language shootout page has some nice examples where Bigloo and OCaml
stacked up well when compared to C. But I think this is only valid for
the micro benchmarks. Take the matrix-matrix multiplication (MM) for
example. Though, rarely ever will you be in dire need for a
matrix-matrix multiplication, because there exists better methods for
solving problems in linear algebra. However, the matrix-matrix
multiplication is useful to check some facts.

Method
------

I downloaded the codes for the MM from the language shootout page and
adapted the examples as follows: The overall code scaffolding has been
left as it was with the original version on the language shootout page:

a) The dimension is increased to 512x512
b) The run is just performed for one MM
c) The arrays are double precision


Results and Discussion
-----------------------

Though, the published MM codes at the language shootout page do not use
any sophisticated constructs like loop unrolling or the like, they can
nevertheless be used because they are consistent within their specific
programming language namely C, C++, Scheme (Bigloo), OCaml, Python and
Fortran 90/95. However, I have assumed that the posted codes there are
posted by masters of their specific language (not every code has been
written by the owner of the language shootout page).

The code can be found at the end of this post. It is easy to copy and
paste it for own tests: memory has been observed via "top":

compile
			real/usr/sys  		memory
--------------------------------------------------------------
bigloo -Obench bench.scm	22s/22s/0.1s		6%
bigloo -Obench bench.scm	1m30s/1m30s/0.3s	30%	
ocamlopt bench.ml		38s/38s/0.1s		3%
g++ -O6 bench.c			8s/8s/0.1s		3%
g++ -O6 bench.cpp               8.5s/8.5s/0.1s          5%
python bench.py                 4m16s/4m16s/0.1s        6%
ifc -O3 bench.f90               13s/13s/0.04s           3%

[suffixes: scm...Scheme (Bigloo); ml...OCaml; c...gnu C; cpp...gnu C++;
py...Python; f90...Fortran 90/95]
	
[Machine: laptop Celeron 1000MHz, RAM 256MB, SuSE Linux 8.0, gcc, bigloo
2.5a, ocaml 3.04, Python 2.2, Intel Fortran 90/95]

The second Bigloo version uses ordinary "+" and "-" operators. The first
version uses Bigloo's operators "+fl", "+fx",... It is incomprehensible
why the second ordinary Bigloo version uses that much on memory! For a
dimension of 1024x1024 the overall timing picture is the same, except
that the normal Bigloo version consumes 80% of the memory as compared to
the C version which consumes 20%.

Especially important is the fact that the Python version stacks up very
well; you may not forget Python is just interpreted and not compiled.

The elegant C++ version is of interest too, because it relies on the
vector container and it is possible to completely avoid the awful "*".

Some note to the Fortran version: Fortran starts at 1 (C starts at 0)
when indexing arrays; and the first index runs faster (in C the second
index runs faster). Reading from left to right is more natural (index
wise), so using C arrays is way more pleasant. Bigloo uses C's array
indexing scheme. I hate Fortran for this array indexing nightmare and
hope I didn't mix things up because I was a little bit surprised that C
is faster than Fortran 90. The Fortran 90/95 compiler stems from Intel
and is the one and only free  Fortran 90/95 compiler available for Linux (go to the
Intel site and download the 50MB if you need a Fortran 90/95 compiler).
Nowadays times are over were Fortran is that much faster than C/C++ on
ordinary home PCs. Computing in Science and Engineering had once a nice
article (Theurich, G., et al., 2001: "Making the Fortran-To-C
Transition: How Painful is it Really?", January/February 2001, p.21)
about translating  a 20,000 lines of code program (in the field of
chemistry) from Fortran 77 to C. Everybody would have excpected that the
C version runs slower than the original Fortran version, but it turned
out that the C version runs even slightly faster than the Fortran version.


Before you scream:

a) I am sure there exists an ocaml option for further optimizations;
"ocamlopt --help" did not unveil it.

b) The Bigloo (Scheme) version could surely be improved but how? /I mean
not the code scheme itslef, because this should remain consistent as
compared to the other code versions./ Fore example transposing one array
is not allowed, because this would not only reduce Bigloo's execution
time but also the time of the C version would be reduced by a factor
of 2. Surely, there exists the caveat: But mum told me, we should
not use the same algorithms for every programming language. But mum
didn't tell  that often the reality does not let you do so. And in
addition mum did not tell you that you are not a Forth programmer.

c) I post it here for soliciting comments. I could have gone to bed to
construe my own explanations and theories. But this would likely result
in a desaster, because most of the time it isn't how I would like that
it is.

The original language shootout page results on the MM are misleading:

a) Pure integers are hardly ever used in reality
b) Problems often do not fit into the cache
c) A comparsion of the FFT would be more interesting


Conclusion
----------

Nevertheless Bigloo remains still a very decent compiler in my opinion,
and I do not regret that I have made the transition from Python (which
is one of the most confusing programming languages out there) to Scheme
speak Bigloo.
People should be more wary that consulting the language shootout page
can result in misleading conclusions; as you know: benchmarking is black
art.

S. Gonzi
APPENDIX:
===============================
Normal Bigloo version
original v.: (C) M. Serrano
usage: bigloo -Obench bench.scm
       time ./a.out
===============================
(module bench-matrix)


(define (mkmatrix rows cols)
  (let ((mx (make-vector rows 0.0))
	(count 1.0))
    (do ((i 0 (+ i 1)))
	((= i rows))
      (let ((row (make-vector cols 0.0)))
	(do ((j 0 (+ j 1)))
	    ((= j cols))
	  (vector-set! row j count)
	  (set! count (+ count 1.0)))
	(vector-set! mx i row)))
    mx))

(define (num-cols mx)
  (let ((row (vector-ref mx 0)))
    (vector-length row)))

(define (num-rows mx)
  (vector-length mx))

(define (mmult rows cols m1 m2)
  (let ((m3 (make-vector rows 0.0)))
    (do ((i 0 (+ 1 i)))
	((= i rows))
      (let ((m1i (vector-ref m1 i))
	    (row (make-vector cols 0.0)))
	(do ((j 0 (+ 1 j)))
	    ((= j cols))
	  (let ((val 0.0))
	    (do ((k 0 (+ 1 k)))
		((= k cols))
	      (set! val (+ val (* (vector-ref m1i k)
				  (vector-ref (vector-ref m2 k) j)))))
	    (vector-set! row j val)))
	(vector-set! m3 i row)))
    m3))


(define (do-main size)
  (let ((mm 0.0)
	(m1 (mkmatrix size size))
	(m2 (mkmatrix size size)))
    (set! mm (mmult size size m1 m2))
    (let ((r0 (vector-ref mm 0))
	  (r2 (vector-ref mm 2))
	  (r3 (vector-ref mm 3))
	  (r4 (vector-ref mm 4)))
      (print (vector-ref r0 0) " " (vector-ref r2 3) " "
		  (vector-ref r3 2) " " (vector-ref r4 4)))))

(do-main 512)
====

===============================
Bigloo version based on +fl,...
original v.: (C) M. Serrano
usage: bigloo -Obench bench.scm
       time ./a.out
===============================
(module bench-matrix
	(option (set! *genericity* #f)))


(define (mkmatrix rows cols)
  (let ((mx (make-vector rows 0.0))
	(count 1.0))
    (do ((i 0 (+fx i 1)))
	((=fx i rows))
      (let ((row (make-vector cols 0.0)))
	(do ((j 0 (+fx j 1)))
	    ((=fx j cols))
	  (vector-set! row j count)
	  (set! count (+fl count 1.0)))
	(vector-set! mx i row)))
    mx))

(define (num-cols mx)
  (let ((row (vector-ref mx 0)))
    (vector-length row)))

(define (num-rows mx)
  (vector-length mx))

(define (mmult rows cols m1 m2)
  (let ((m3 (make-vector rows 0.0)))
    (do ((i 0 (+fx 1 i)))
	((=fx i rows))
      (let ((m1i (vector-ref m1 i))
	    (row (make-vector cols 0.0)))
	(do ((j 0 (+fx 1 j)))
	    ((=fx j cols))
	  (let ((val 0.0))
	    (do ((k 0 (+fx 1 k)))
		((=fx k cols))
	      (set! val (+fl val (*fl (vector-ref m1i k)
				  (vector-ref (vector-ref m2 k) j)))))
	    (vector-set! row j val)))
	(vector-set! m3 i row)))
    m3))


(define (do-main size)
  (let ((mm 0.0)
	(m1 (mkmatrix size size))
	(m2 (mkmatrix size size)))
    (set! mm (mmult size size m1 m2))
    (let ((r0 (vector-ref mm 0))
	  (r2 (vector-ref mm 2))
	  (r3 (vector-ref mm 3))
	  (r4 (vector-ref mm 4)))
      (print (vector-ref r0 0) " " (vector-ref r2 3) " "
		  (vector-ref r3 2) " " (vector-ref r4 4)))))

(do-main 512)
====


=========================
Ocaml version
original v.: (C) M. Mottl
usage: 	ocamlopt bench.ml
        time ./a.out
=========================
let mkmatrix rows cols =
  let count = ref 1 and last_col = cols - 1
  and m = Array.make_matrix rows cols 0.0 in
  for i = 0 to rows - 1 do
    let mi = m.(i) in
    for j = 0 to last_col do
        mi.(j) <- float_of_int(!count);
        incr count done;
  done;
  m

let rec inner_loop k v m1i m2 j =
  if k < 0 then v
  else inner_loop (k - 1) (v +. m1i.(k) *. m2.(k).(j)) m1i m2 j

let mmult rows cols m1 m2 m3 =
  let last_col = cols - 1 and last_row = rows - 1 in
  for i = 0 to last_row do
    let m1i = m1.(i) and m3i = m3.(i) in
    for j = 0 to last_col do m3i.(j) <- inner_loop last_row 0.0 m1i m2 j
done;
  done

let size = 512
let m1 = mkmatrix size size and m2 = mkmatrix size size
let m3 = Array.make_matrix size size 0.0;;
mmult size size m1 m2 m3;
Printf.printf "%f %f %f %f\n" m3.(0).(0) m3.(2).(3) m3.(3).(2) m3.(4).(4)
====

=======================
C version
usage: 	g++ -O6 bench.c
	time ./a.out
=======================
#include <iostream>
#include <stdlib.h>

using namespace std;

double **mkmatrix(int rows, int cols) {
    int i, j, count = 1;
    double **m = (double **) malloc(rows * sizeof(double *));
    for (i=0; i<rows; i++) {
	m[i] = (double *) malloc(cols * sizeof(double));
	for (j=0; j<cols; j++) {
	    m[i][j] = count++;
	}
    }
    return(m);
}

void zeromatrix(int rows, int cols, double **m) {
    int i, j;
    for (i=0; i<rows; i++)
	for (j=0; j<cols; j++)
	    m[i][j] = 0;
}

void freematrix(int rows, double **m) {
    while (--rows > -1) { free(m[rows]); }
    free(m);
}

double **mmult(int rows, int cols, double **m1, double **m2, double **m3) {
    int i, j, k;
    double val;
    for (i=0; i<rows; i++) {
	for (j=0; j<cols; j++) {
	    val = double(0.0);
	    for (k=0; k<cols; k++) {
		val += m1[i][k] * m2[k][j];
	    }
	    m3[i][j] = val;
	}
    }
    return(m3);
}

int main()
{
    int i, SIZE=512;

    double **m1 = mkmatrix(SIZE, SIZE);
    double **m2 = mkmatrix(SIZE, SIZE);
    double **mm = mkmatrix(SIZE, SIZE);

    mm = mmult(SIZE, SIZE, m1, m2, mm);

    cout << mm[0][0] << " " << mm[2][3] << " " << mm[3][2] << " " <<
mm[4][4] << endl;

    freematrix(SIZE, m1);
    freematrix(SIZE, m2);
    freematrix(SIZE, mm);
    return(0);
}
====

=========================
C++ version
(C) S. Gonzi
usage: 	g++ -O6 bench.cpp
	time ./a.out
=========================
#include <iostream>
#include <vector>

using namespace std;

vector<vector<double> > mkmatrix(int rows, int cols)
{
    double counter=double(0.0);
    vector<vector<double> > m;


    for(int i=0; i<rows; i++)
    {
	vector<double> row_vec;
	for(int j=0; j<cols; j++)
	{
	    counter = counter + 1.0;
	    row_vec.push_back(counter);
	}
	m.push_back(row_vec);
    }
	return(m);
}


vector<vector<double> > mmult(int rows, int cols,
				vector<vector<double> > m1,
			      	vector<vector<double> > m2,
			 	vector<vector<double> > m3)
{
    double val;
    for(int i=0; i<rows; i++)
    {
	for(int j=0; j<rows; j++)
	{
	    val = double(0.0);
	    for(int k=0; k<cols; k++)
	    {
		val += m1[i][k] * m2[k][j];
	    }
	    m3[i][j] = val;
	}
    }
    return(m3);
}


int main()
{
    int SIZE=512;

    vector<vector<double> > m1 = mkmatrix(SIZE,SIZE);
    vector<vector<double> > m2 = mkmatrix(SIZE,SIZE);
    vector<vector<double> > mm = mkmatrix(SIZE,SIZE);

    mm = mmult(SIZE,SIZE,m1,m2,mm);

    cout << mm[0][0] << " " << mm[2][3] << " " << mm[3][2] << " " <<
mm[4][4] << endl;

    return(0);
}
====


=============================
Python version
usage: 	time python  bench.py
=============================
def mkmatrix(rows, cols):
    count = 1.0
    mx = [ None ] * rows
    for i in xrange(rows):
        mx[i] = [0] * cols
        for j in xrange(cols):
            mx[i][j] = count
            count += 1.0
    return mx

def mmult(rows, cols, m1, m2):
    m3 = [ None ] * rows
    for i in xrange( rows ):
        m3[i] = [0] * cols
        for j in xrange( cols ):
            val = 0.0
            for k in xrange( cols ):
                val += m1[i][k] * m2[k][j]
            m3[i][j] = val
    return m3

size = 512

def main():
    m1 = mkmatrix(size, size)
    m2 = mkmatrix(size, size)
    mm = mmult(size, size, m1, m2)
    print mm[0][0], mm[2][3], mm[3][2], mm[4][4]

main()
====

=============================
Intel Fortran 90/95 version
(C) Siegfried Gonzi
usage: 	ifc -O3  bench.f90
        timer ./a.out
=============================
program main
  integer:: size=512
  double precision, POINTER,dimension(:,:):: m1, m2, mm

  call mkmatrix(size,size,m1)
  call mkmatrix(size,size,m2)
  call mkmatrix(size,size,mm)

  call mmult(size,size,m1,m2,mm)

  print *, mm(1,1)

  deallocate(m1)
  deallocate(m2)
  deallocate(mm)

contains

subroutine mkmatrix(rows, cols,m)
  integer, intent(in):: rows, cols
  double precision, POINTER, dimension(:,:):: m
  double precision:: counter
  integer:: i,j

  allocate(m(rows,cols))

  counter = 1.0
  do i=1,cols
     do j=1,rows
        counter = counter + 1.0
        m(j,i) = counter
     end do
  end do
end subroutine mkmatrix


subroutine mmult(rows, cols, m1, m2, m3)
  integer, intent(in):: rows, cols
  double precision, dimension(rows,cols), intent(in):: m1, m2
  double precision, dimension(rows,cols), intent(in out):: m3
  double precision:: val
  integer:: i,j,k

  do i=1,rows
     do j=1,rows
        val = 0.0
        do k=1,cols
           val = val + m1(k,i)*m2(j,k)
        end do
      m3(j,i) = val
      end do
   end do
end subroutine mmult
end program main
====


-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners