caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] DFT in OCaml vs. C
@ 2003-03-27  7:33 Issac Trotts
  2003-03-27 10:58 ` Fabrice Le Fessant
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Issac Trotts @ 2003-03-27  7:33 UTC (permalink / raw)
  To: OCaml List

Here's a numerical mini-benchmark comparing C to OCaml
on a simple implementation of the Discrete Fourier Transform:

  http://redwood.ucdavis.edu/~issac/dft_compare.tar.gz

The results on my 1 GHZ Pentium III Linux box:

C:
real    0m21.273s
user    0m21.200s
sys     0m0.040s

OCaml:
real    1m51.602s
user    1m47.020s
sys     0m0.260s

So the C version was about five times as fast.  This is after looking 
for ideas
in the "Writing Efficient Numerical code in Objective Caml" page [1]
and the Great Language Shootout statistical moment page for OCaml [2].  

The OCaml code was easier to read and debug, and would be
easier to modify.  

I'd be interested if anyone on this list knows of a way
to make it perform as well as the C version (without using the FFT.)

[1] 
http://216.239.53.100/search?q=cache:5YnsSStlWiAC:caml.inria.fr/ocaml/numerical.html+efficient+numerical+ocaml&hl=en&ie=UTF-8

[2]
http://www.bagley.org/~doug/shootout/bench/moments/moments.ocaml

Issac Trotts







-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] DFT in OCaml vs. C
  2003-03-27  7:33 [Caml-list] DFT in OCaml vs. C Issac Trotts
@ 2003-03-27 10:58 ` Fabrice Le Fessant
  2003-03-27 19:40   ` Issac Trotts
  2003-03-27 14:21 ` Markus Mottl
  2003-03-27 14:32 ` Xavier Leroy
  2 siblings, 1 reply; 10+ messages in thread
From: Fabrice Le Fessant @ 2003-03-27 10:58 UTC (permalink / raw)
  To: Issac Trotts; +Cc: OCaml List


>  Here's a numerical mini-benchmark comparing C to OCaml
>  on a simple implementation of the Discrete Fourier Transform:
>  
>    http://redwood.ucdavis.edu/~issac/dft_compare.tar.gz
>  
>  The results on my 1 GHZ Pentium III Linux box:

>  I'd be interested if anyone on this list knows of a way
>  to make it perform as well as the C version (without using the FFT.)

If you really want to benchmark the numerical code, then, write a
program where there is only numerical code. Given the size of the
matrices you use (8), one can wonder if the program spends more time
to compute the FFT or to test and print the results.

- Fabrice

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] DFT in OCaml vs. C
  2003-03-27  7:33 [Caml-list] DFT in OCaml vs. C Issac Trotts
  2003-03-27 10:58 ` Fabrice Le Fessant
@ 2003-03-27 14:21 ` Markus Mottl
  2003-03-27 20:47   ` Issac Trotts
  2003-03-27 14:32 ` Xavier Leroy
  2 siblings, 1 reply; 10+ messages in thread
From: Markus Mottl @ 2003-03-27 14:21 UTC (permalink / raw)
  To: Issac Trotts; +Cc: OCaml List

On Wed, 26 Mar 2003, Issac Trotts wrote:
> The results on my 1 GHZ Pentium III Linux box:
> 
> C:
> real    0m21.273s
> user    0m21.200s
> sys     0m0.040s
> 
> OCaml:
> real    1m51.602s
> user    1m47.020s
> sys     0m0.260s

Well, another insignificant benchmark... ;-)

My timings on a 2.4 GHZ Pentium IV using GCC 2.95.4 and the latest
CVS-checkout of OCaml:

C:
real    0m24.920s
user    0m24.900s
sys     0m0.020s

OCaml:
real    0m30.397s
user    0m30.390s
sys     0m0.000s

The difference you have observed on your machine is most likely due to
cache effects and possibly also due to compiler differences (GCC version).

Regards,
Markus Mottl

-- 
Markus Mottl                                             markus@oefai.at
Austrian Research Institute
for Artificial Intelligence                  http://www.oefai.at/~markus

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] DFT in OCaml vs. C
  2003-03-27  7:33 [Caml-list] DFT in OCaml vs. C Issac Trotts
  2003-03-27 10:58 ` Fabrice Le Fessant
  2003-03-27 14:21 ` Markus Mottl
@ 2003-03-27 14:32 ` Xavier Leroy
  2003-03-27 14:55   ` Falk Hueffner
                     ` (2 more replies)
  2 siblings, 3 replies; 10+ messages in thread
From: Xavier Leroy @ 2003-03-27 14:32 UTC (permalink / raw)
  To: Issac Trotts; +Cc: OCaml List

> Here's a numerical mini-benchmark comparing C to OCaml
> on a simple implementation of the Discrete Fourier Transform:
> [...]
> So the C version was about five times as fast.
> I'd be interested if anyone on this list knows of a way
> to make it perform as well as the C version (without using the FFT.)

It can be done, but not on a Pentium 3.  Here are my timings:

                       Pentium 4        Pentium 4 SSE2     Alpha 21264
                       (2 GHz)          (2 GHz)            (500 MHz)

C                       20                20                 36
OCaml (your code)      113                40                 52
OCaml (variant 1)       90                26                 40
OCaml (variant 2)       72                38                100

Variants 1 and 2 differ on the complex multiply step:

Your code:
                let a2=c*. !a -. s*. !b 
                and b2=c*. !b +. s*. !a in
                a := a2; 
                b := b2;
Variant 1:
                let x = s *. !a in
                a := c*. !a -. s*. !b;
                b := c*. !b +. x
Variant 2:
                let olda = !a and oldb = !b in
                a := c *. olda -. s *. oldb;
                b := c *. oldb +. s *. olda

The "Pentium 4 SSE2" column is an experimental code generator for the
Pentium 4 that uses SSE2 instructions and registers for floating-point
computations.  (Before you ask: no, it's not publically available,
but will be the basis for the x86_64 code generator as soon as the
hardware becomes available.)

As you can see above, variant 1 achieves almost the performance of C
on platforms that have a regular register-based FP arithmetic unit.

However, the x86 floating-point stack (what OCaml uses for
compatibility with Pentium 3 and earlier processors) is notoriously
cranky and hard to generate efficient code for.  gcc manages to
exploit instruction-level parallelism between the "re" and "im"
computations via amazing feats (fxch instructions, etc), but the
ocamlopt x86 code generator just generates very sequential code...

So, unless you have an Alpha at hand, you'd better consider FFT.
There's an FFT implementation that I use as a benchmark here:

        http://camlcvs.inria.fr/cgi-bin/cvsweb.cgi/ocaml/test/fft.ml

and it delivers about 2/3 of the performances of C, even on the Pentium.

- Xavier Leroy

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] DFT in OCaml vs. C
  2003-03-27 14:32 ` Xavier Leroy
@ 2003-03-27 14:55   ` Falk Hueffner
  2003-03-27 16:06   ` OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C) David Monniaux
  2003-03-27 20:54   ` [Caml-list] DFT in OCaml vs. C Issac Trotts
  2 siblings, 0 replies; 10+ messages in thread
From: Falk Hueffner @ 2003-03-27 14:55 UTC (permalink / raw)
  To: OCaml List

Xavier Leroy <xavier.leroy@inria.fr> writes:

> It can be done, but not on a Pentium 3.  Here are my timings:
> 
>                        Pentium 4        Pentium 4 SSE2     Alpha 21264
>                        (2 GHz)          (2 GHz)            (500 MHz)
> 
> C                       20                20                 36
> OCaml (your code)      113                40                 52
> OCaml (variant 1)       90                26                 40
> OCaml (variant 2)       72                38                100

Hmm, on an Alpha with 800 MHz, I measured 25 seconds C versus 69
seconds Ocaml, i. e. a factor of 2.8 instead of 1.4... does your
version contain optimizations for non-i386, too? I'm using 3.06.

-- 
	Falk

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 10+ messages in thread

* OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C)
  2003-03-27 14:32 ` Xavier Leroy
  2003-03-27 14:55   ` Falk Hueffner
@ 2003-03-27 16:06   ` David Monniaux
  2003-03-27 21:27     ` Issac Trotts
  2003-03-27 20:54   ` [Caml-list] DFT in OCaml vs. C Issac Trotts
  2 siblings, 1 reply; 10+ messages in thread
From: David Monniaux @ 2003-03-27 16:06 UTC (permalink / raw)
  To: Liste CAML; +Cc: Antoine Mine

> The "Pentium 4 SSE2" column is an experimental code generator for the
> Pentium 4 that uses SSE2 instructions and registers for floating-point
> computations.  (Before you ask: no, it's not publically available,

In this case, to get meaningful comparison results, you should use
gcc -march=pentium4 -msse2 or icc -march=pentium4

> and it delivers about 2/3 of the performances of C, even on the Pentium.

Let me tell you about our experience here. We are developing a large
program consisting of
- a large part of Caml code handling complex data structures
- a smaller C library handling certain numerical matrix computations that
  are triggered by the Caml code
- some C (+ assembler) libraries dealing with system-dependent issues.

I profiled the code using OProfile (http://oprofile.sourceforge.net), for
expenses in clock cycles and cache faults. Earlier attempts were made with
gprof.

It turned out that we spent a significant amount of time in:

- The Caml polymorphic compare function (15% time + some cache faults)

  Part of the problem seems to lie with the fact that the same function is
  called when comparing strings, int64's and other types, thus the
  processor has to do lots of tests and jumps just to get at the correct
  comparison function.

  Wouldn't it be reasonable to define String.compare and Int64.compare to
  call monomorphic functions?

- The garbage collector (15% time + lots of cache faults)

  There's little we can do about it. Changing the size of the minor heap,
  adjusting it to optimize the use of L2 cache seems to gain 2.30% of the
  total running time.

  Curiously, using the compactor seems to slow things slightly.

  Would it be possible to optimize the GC cache-wise? For instance, have
  it ask the processor to "prefetch" data.

- 17% in a particular matrix function written in C. There's little we can
  do except trying to optimize it carefully and compiling it with the best
  C compiler around.

- The rest of the time is spent within the Caml code.

Now this was a bit surprising to us, because we thought we spent far more
time in the numerical computations.


Now back to the original question about DFTs. In your real-life
application, will DFT computations make a major part of the clock cycles
spent by the program?


David Monniaux            http://www.di.ens.fr/~monniaux
Laboratoire d'informatique de l'École Normale Supérieure,
Paris, France






-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] DFT in OCaml vs. C
  2003-03-27 10:58 ` Fabrice Le Fessant
@ 2003-03-27 19:40   ` Issac Trotts
  0 siblings, 0 replies; 10+ messages in thread
From: Issac Trotts @ 2003-03-27 19:40 UTC (permalink / raw)
  To: OCaml List

Fabrice Le Fessant wrote:

>> Here's a numerical mini-benchmark comparing C to OCaml
>> on a simple implementation of the Discrete Fourier Transform:
>> 
>>   http://redwood.ucdavis.edu/~issac/dft_compare.tar.gz
>> 
>> The results on my 1 GHZ Pentium III Linux box:
>>    
>>
>
>  
>
>> I'd be interested if anyone on this list knows of a way
>> to make it perform as well as the C version (without using the FFT.)
>>    
>>
>
>If you really want to benchmark the numerical code, then, write a
>program where there is only numerical code. Given the size of the
>matrices you use (8), one can wonder if the program spends more time
>to compute the FFT or to test and print the results.
>
Okay, I should have made the code clearer.  The bulk of the time is spent
in the test2 function, which works on signals much longer than eight 
samples.
After removing the printf calls and test1, the time doesn't improve much:

real    1m47.500s
user    1m37.820s
sys     0m0.250s

Thanks for the suggestions.

Issac


-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] DFT in OCaml vs. C
  2003-03-27 14:21 ` Markus Mottl
@ 2003-03-27 20:47   ` Issac Trotts
  0 siblings, 0 replies; 10+ messages in thread
From: Issac Trotts @ 2003-03-27 20:47 UTC (permalink / raw)
  To: OCaml List

Markus Mottl wrote:

>On Wed, 26 Mar 2003, Issac Trotts wrote:
>  
>
>>The results on my 1 GHZ Pentium III Linux box:
>>
>>C:
>>real    0m21.273s
>>user    0m21.200s
>>sys     0m0.040s
>>
>>OCaml:
>>real    1m51.602s
>>user    1m47.020s
>>sys     0m0.260s
>>    
>>
>
>Well, another insignificant benchmark... ;-)
>
>My timings on a 2.4 GHZ Pentium IV using GCC 2.95.4 and the latest
>CVS-checkout of OCaml:
>
>C:
>real    0m24.920s
>user    0m24.900s
>sys     0m0.020s
>
>OCaml:
>real    0m30.397s
>user    0m30.390s
>sys     0m0.000s
>
>The difference you have observed on your machine is most likely due to
>cache effects and possibly also due to compiler differences (GCC version).
>
Probably, but the C version doesn't seem as sensitive to these things.

Issac



-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] DFT in OCaml vs. C
  2003-03-27 14:32 ` Xavier Leroy
  2003-03-27 14:55   ` Falk Hueffner
  2003-03-27 16:06   ` OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C) David Monniaux
@ 2003-03-27 20:54   ` Issac Trotts
  2 siblings, 0 replies; 10+ messages in thread
From: Issac Trotts @ 2003-03-27 20:54 UTC (permalink / raw)
  To: OCaml List

Xavier Leroy wrote:

>>Here's a numerical mini-benchmark comparing C to OCaml
>>on a simple implementation of the Discrete Fourier Transform:
>>[...]
>>So the C version was about five times as fast.
>>I'd be interested if anyone on this list knows of a way
>>to make it perform as well as the C version (without using the FFT.)
>>    
>>
>
>It can be done, but not on a Pentium 3.  Here are my timings:
>
>                       Pentium 4        Pentium 4 SSE2     Alpha 21264
>                       (2 GHz)          (2 GHz)            (500 MHz)
>
>C                       20                20                 36
>OCaml (your code)      113                40                 52
>OCaml (variant 1)       90                26                 40
>OCaml (variant 2)       72                38                100
>
>Variants 1 and 2 differ on the complex multiply step:
>
>Your code:
>                let a2=c*. !a -. s*. !b 
>                and b2=c*. !b +. s*. !a in
>                a := a2; 
>                b := b2;
>Variant 1:
>                let x = s *. !a in
>                a := c*. !a -. s*. !b;
>                b := c*. !b +. x
>Variant 2:
>                let olda = !a and oldb = !b in
>                a := c *. olda -. s *. oldb;
>                b := c *. oldb +. s *. olda
>
>
>The "Pentium 4 SSE2" column is an experimental code generator for the
>Pentium 4 that uses SSE2 instructions and registers for floating-point
>computations.  (Before you ask: no, it's not publically available,
>but will be the basis for the x86_64 code generator as soon as the
>hardware becomes available.)
>
>As you can see above, variant 1 achieves almost the performance of C
>on platforms that have a regular register-based FP arithmetic unit.
>
>However, the x86 floating-point stack (what OCaml uses for
>compatibility with Pentium 3 and earlier processors) is notoriously
>cranky and hard to generate efficient code for.  gcc manages to
>exploit instruction-level parallelism between the "re" and "im"
>computations via amazing feats (fxch instructions, etc), but the
>ocamlopt x86 code generator just generates very sequential code...
>
>So, unless you have an Alpha at hand, you'd better consider FFT.
>There's an FFT implementation that I use as a benchmark here:
>
>        http://camlcvs.inria.fr/cgi-bin/cvsweb.cgi/ocaml/test/fft.ml
>
>and it delivers about 2/3 of the performances of C, even on the Pentium.
>
>- Xavier Leroy
>
Thanks for a very informative and helpful message.

Issac





-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C)
  2003-03-27 16:06   ` OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C) David Monniaux
@ 2003-03-27 21:27     ` Issac Trotts
  0 siblings, 0 replies; 10+ messages in thread
From: Issac Trotts @ 2003-03-27 21:27 UTC (permalink / raw)
  To: OCaml List

David Monniaux wrote:

>>The "Pentium 4 SSE2" column is an experimental code generator for the
>>Pentium 4 that uses SSE2 instructions and registers for floating-point
>>computations.  (Before you ask: no, it's not publically available,
>>    
>>
>
>In this case, to get meaningful comparison results, you should use
>gcc -march=pentium4 -msse2 or icc -march=pentium4
>
>  
>
>>and it delivers about 2/3 of the performances of C, even on the Pentium.
>>    
>>
>
>Let me tell you about our experience here. We are developing a large
>program consisting of
>- a large part of Caml code handling complex data structures
>- a smaller C library handling certain numerical matrix computations that
>  are triggered by the Caml code
>- some C (+ assembler) libraries dealing with system-dependent issues.
>
>I profiled the code using OProfile (http://oprofile.sourceforge.net), for
>expenses in clock cycles and cache faults. Earlier attempts were made with
>gprof.
>
>It turned out that we spent a significant amount of time in:
>
>- The Caml polymorphic compare function (15% time + some cache faults)
>
>  Part of the problem seems to lie with the fact that the same function is
>  called when comparing strings, int64's and other types, thus the
>  processor has to do lots of tests and jumps just to get at the correct
>  comparison function.
>
>  Wouldn't it be reasonable to define String.compare and Int64.compare to
>  call monomorphic functions?
>
>- The garbage collector (15% time + lots of cache faults)
>
>  There's little we can do about it. Changing the size of the minor heap,
>  adjusting it to optimize the use of L2 cache seems to gain 2.30% of the
>  total running time.
>
>  Curiously, using the compactor seems to slow things slightly.
>
>  Would it be possible to optimize the GC cache-wise? For instance, have
>  it ask the processor to "prefetch" data.
>
>- 17% in a particular matrix function written in C. There's little we can
>  do except trying to optimize it carefully and compiling it with the best
>  C compiler around.
>
>- The rest of the time is spent within the Caml code.
>
>Now this was a bit surprising to us, because we thought we spent far more
>time in the numerical computations.
>
>
>Now back to the original question about DFTs. In your real-life
>application, will DFT computations make a major part of the clock cycles
>spent by the program?
>
There's a small image processing experiment I want to do that will compute
lots of DFTs on small sub-images and will probably spend most of its 
clock cycles
doing the transforms.  

- Issac




-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2003-03-27 21:21 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-03-27  7:33 [Caml-list] DFT in OCaml vs. C Issac Trotts
2003-03-27 10:58 ` Fabrice Le Fessant
2003-03-27 19:40   ` Issac Trotts
2003-03-27 14:21 ` Markus Mottl
2003-03-27 20:47   ` Issac Trotts
2003-03-27 14:32 ` Xavier Leroy
2003-03-27 14:55   ` Falk Hueffner
2003-03-27 16:06   ` OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C) David Monniaux
2003-03-27 21:27     ` Issac Trotts
2003-03-27 20:54   ` [Caml-list] DFT in OCaml vs. C Issac Trotts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).