caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Q: float arrays
@ 1996-10-29 11:12 Jocelyn Serot
  1996-10-30 13:40 ` Xavier Leroy
  0 siblings, 1 reply; 6+ messages in thread
From: Jocelyn Serot @ 1996-10-29 11:12 UTC (permalink / raw)
  To: caml-list


Hello,

Trying to quantify to cost of handling arrays of tuples instead of 
multiple "parallel" arrays, i recently came to some results, which at 
first sight, seems a little surprising.
Can someone please enlight me (or point me the bugs i made !..) ?

What i am comparing here is two functionnaly equivalent implementations
of the same operation, namely multiplying two arrays of floats.
The first one use a map2 higher-order-function (similar to List.map2 but
for arrays), the second works on an single array of pairs.

Here's the code:

*********************** farrays.ml ******************************************

open Array

let map f a =		(* this is simply copied from stdlib *)
  let l = length a in
  if l = 0 then [||] else begin
    let r = create l (f(unsafe_get a 0)) in
    for i = 1 to l - 1 do
      unsafe_set r i (f (unsafe_get a i))
    done;
    r
  end

let map2 f a a' =		(* a slight generalization *)
  let l = length a in
  if ( length a' != l ) then invalid_arg "map2" else
  if l = 0 then [||] else begin
    let r = create l (f(unsafe_get a 0)(unsafe_get a' 0)) in
    for i = 1 to l - 1 do
      unsafe_set r i (f (unsafe_get a i) (unsafe_get a' i))
    done;
    r
  end

let main () =
  let a1 = Array.create (1024*1024) 123.456
  and a2 = Array.create (1024*1024) 123.456
  and a3 = Array.create (1024*1024) (123.456,123.456) in
  let _ = Profile.start () in
  let f1 = Profile.f2 "map2" (map2 (fun p p' -> p*.p')) in
  let _ = for i = 0 to 10 do let r1 = f1 a1 a2 in () done in
  let f2 = Profile.f1 "map_pairs" (map (function (p,p') -> (p*.p'))) in
  let _ = for i = 0 to 10 do let r2 = f2 a3 in () done in
  let _ = Profile.stop () in
  ()

let _ = Printexc.catch main ()

*******************************************************************************

Time profiling is done using the time-profiling module posted by
Mark Hayden (hayden@cs.cornell.edu, posted 4/96) to caml mailing list.
Both the bytecode and and the native code have been profiled.

-----------------------------------------------------------------------RESULTS
BYTECODE:

jserot@tufa$ main
 Number of samples is 11049
 Each sample counts as 0.02 seconds.

 %       self     self&            self    total
 time    seconds  children  calls  ms/call ms/call name
-------------------------------------------------------
51.42   98.894   98.894       11   8.990   8.990   map2
48.58   93.446   93.446       11   8.495   8.495   map_pairs

NATIVE:

jserot@tufa$ omain
 Number of samples is 1584
 Each sample counts as 0.02 seconds.

 %       self     self&            self    total
 time    seconds  children  calls  ms/call ms/call name
-------------------------------------------------------
62.56   16.198   16.198       11   1.473   1.473   map2
37.44    9.692    9.692       11   0.881   0.881   map_pairs

nb: tufa is a sun4-51 Sparc server. Compiler is ocaml-1.02
------------------------------------------------------------------------------

What i cannot understand is that map2 seems to be slower than map_pairs (in
the range 1->2 in native compiled mode). The few i know about the Caml compiler
is that it doesnt boxe floats in arrays. But i guess that an in an array of
float*float, each pair has to be boxed. So accessing arguments from the
map_pairs function should be definitly slower than from map2, 
Does that mean that accessing two unboxed floats takes longer than
accessing a boxed pair of float (twice the time in native mode ?!)

There's also the possibility from significative innaccurcies in my measuring
protocol ..

COuld someone comment ?

	Jocelyn

-- 
E-mail: jserot@epcc.ed.ac.uk (->Nov 29 1996) ................................
S-mail: LASMEA - URA 1793 CNRS, Universite Blaise Pascal, 63177 Aubiere cedex
Tel: (33) 73.40.73.30 - Fax: (33) 73.40.72.62 ...............................
.... http://wwwlasmea.univ-bpclermont.fr/Personnel/Jocelyn.Serot/Welcome.html





^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Q: float arrays
  1996-10-29 11:12 Q: float arrays Jocelyn Serot
@ 1996-10-30 13:40 ` Xavier Leroy
  1996-10-30 14:08   ` Jocelyn Serot
  1996-10-30 14:52   ` Thorsten Ohl
  0 siblings, 2 replies; 6+ messages in thread
From: Xavier Leroy @ 1996-10-30 13:40 UTC (permalink / raw)
  To: Jocelyn Serot; +Cc: caml-list


> Trying to quantify to cost of handling arrays of tuples instead of 
> multiple "parallel" arrays, i recently came to some results, which at 
> first sight, seems a little surprising.

> What i cannot understand is that map2 seems to be slower than
> map_pairs (in the range 1->2 in native compiled mode). The few i
> know about the Caml compiler is that it doesnt boxe floats in
> arrays.

The bytecode compiler keeps floats boxed in arrays, so there is no
essential difference between an array of pairs of floats and two float
arrays. (The array of pairs entails one extra indirection and is less
space efficient, but has better locality.)

The native-code compiler unboxes floats in arrays. But unboxing is a
double-edged sword. When the float array is accessed as a float
(e.g. to perform arithmetic on it), unboxed arrays are a big win.
When the float array is accessed as a Caml value (e.g. to pass to
another function, or from a piece of polymorphic code), the unboxed
float needs to be boxed first to convert it to a regular Caml value,
and this is expensive.

Your map2 function actually performs two boxing operations at each
iteration, because the two float elements are passed to a function. It
also has to test dynamically the kind of the array (float
vs. non-float), because it's a polymorphic function.  This explains
why it's slower than map on an array of pairs (which involves no
boxing, just 3 indirections per iteration)

If you inline the "*." operation in map2, all boxing will be
eliminated (and all dynamic tests as well, since the function is now
monomorphic), and you'll get much better performance than what can be
obtained with an array of pairs:

 let product f a a' =
   let l = length a in
   if ( length a' != l ) then invalid_arg "product" else
   if l = 0 then [||] else begin
     let r = create l (f(unsafe_get a 0)(unsafe_get a' 0)) in
     for i = 1 to l - 1 do
       unsafe_set r i (unsafe_get a i *. unsafe_get a' i)
     done;
     r
   end

Polymorphism and higher-order functions don't mix well with high
performance. If you need Fortran-like performance, there are cases
where you must write Fortran-style code.

- Xavier Leroy





^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Q: float arrays
  1996-10-30 13:40 ` Xavier Leroy
@ 1996-10-30 14:08   ` Jocelyn Serot
  1996-10-30 14:52   ` Thorsten Ohl
  1 sibling, 0 replies; 6+ messages in thread
From: Jocelyn Serot @ 1996-10-30 14:08 UTC (permalink / raw)
  To: Xavier Leroy; +Cc: caml-list


> [... lots of interesting comments deleted ...]
>
> If you inline the "*." operation in map2, all boxing will be
> eliminated (and all dynamic tests as well, since the function is now
> monomorphic), and you'll get much better performance than what can be
> obtained with an array of pairs:
> 
>  let product f a a' =
>    let l = length a in
>    if ( length a' != l ) then invalid_arg "product" else
>    if l = 0 then [||] else begin
>      let r = create l (f(unsafe_get a 0)(unsafe_get a' 0)) in
>      for i = 1 to l - 1 do
>        unsafe_set r i (unsafe_get a i *. unsafe_get a' i)
>      done;
>      r
>    end

Ok, i tried this. Here are the results:

jserot@tufa$ omain
 Number of samples is 1683
 Each sample counts as 0.02 seconds.

 %       self     self&            self    total
 time    seconds  children  calls  ms/call ms/call name
-------------------------------------------------------
63.10   18.527   18.527       11   1.684   1.684   map2
36.24   10.641   10.641       11   0.967   0.967   map_pairs
 0.59    0.174    0.174       11   0.016   0.016   product

Self-demonstrative. 

Thanks for your help,

	Jocelyn
-- 
E-mail: jserot@epcc.ed.ac.uk (->Nov 29 1996) ................................
S-mail: LASMEA - URA 1793 CNRS, Universite Blaise Pascal, 63177 Aubiere cedex
Tel: (33) 73.40.73.30 - Fax: (33) 73.40.72.62 ...............................
.... http://wwwlasmea.univ-bpclermont.fr/Personnel/Jocelyn.Serot/Welcome.html





^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Q: float arrays
  1996-10-30 13:40 ` Xavier Leroy
  1996-10-30 14:08   ` Jocelyn Serot
@ 1996-10-30 14:52   ` Thorsten Ohl
  1996-10-31 15:26     ` Xavier Leroy
  1996-11-05 16:56     ` Controlling inlining [was Re: Q: float arrays] rowan+
  1 sibling, 2 replies; 6+ messages in thread
From: Thorsten Ohl @ 1996-10-30 14:52 UTC (permalink / raw)
  To: caml-list; +Cc: Xavier Leroy


>>>>> "Xavier" == Xavier Leroy <Xavier.Leroy@inria.fr> writes:

Xavier> Polymorphism and higher-order functions don't mix well with
Xavier> high performance. If you need Fortran-like performance, there
Xavier> are cases where you must write Fortran-style code.

I know that I'm asking too much, but wouldn't it be nice it the
compiler did it for me?   In the example at hand,

  let f2 = map (function (p,p') -> (p*.p'))

the compiler could inline the multiplication automagically, iff it
still had access to the definition of the map function.

A trivial example is code like the following:

  let exponentiate make_unit eps norm plus minus times scale x =
    let u = make_unit x in
    let rec sumup s n x1 xn =
      if norm (xn) <= eps *. norm (minus s u) then
	s
      else
	sumup (plus s xn) (n +. 1.0) x1 (scale (1.0 /. n) (times x1 xn))
    in
      (* usually, times x u == x, but u can be a projector which
	 will speed up things for special cases.  *)
      sumup u 2.0 x (times x u)
  
  let exp_ =
    exponentiate (fun _ -> 1.0) 1e-14 abs_float ( +. ) ( -. ) ( *. ) ( *. )
  
  let exp_matrix =
    exponentiate unit_like 1e-14 infinity_norm add subtract multiply scale_matrix

Here partial application really shines and it could shine even
brighter if there was no speed penalty ...

Cheers,
-Thorsten

-- 
Thorsten Ohl, Physics Department, TH Darmstadt --- PGP: AF 38 FF CE 03 8A 2E A7
http://crunch.ikp.physik.th-darmstadt.de/~ohl/ -------- 8F 2A C1 86 8C 06 32 6B





^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Q: float arrays
  1996-10-30 14:52   ` Thorsten Ohl
@ 1996-10-31 15:26     ` Xavier Leroy
  1996-11-05 16:56     ` Controlling inlining [was Re: Q: float arrays] rowan+
  1 sibling, 0 replies; 6+ messages in thread
From: Xavier Leroy @ 1996-10-31 15:26 UTC (permalink / raw)
  To: Thorsten Ohl; +Cc: caml-list


> I know that I'm asking too much, but wouldn't it be nice it the
> compiler did it for me?   In the example at hand,
> 
>   let f2 = map (function (p,p') -> (p*.p'))
> 
> the compiler could inline the multiplication automagically, iff it
> still had access to the definition of the map function.

That's right, and I still plan to put some automatic inlining in
Objective Caml at some point. However, known inlining strategies
are very conservative and err on the safe side in order to avoid code
size blowup. This means that if your "map" functional is sufficiently
big, it will not be inlined, even though this could resut in important
speedups.

I'm not trying to discourage anyone from writing polymorphic or
higher-order functions: if that makes the program clearer, do it and
the performances will be all right most of the time.

What I'm saying is that if the performances of your application depend
critically on a few functions (say, matrix operations, or a Fourier
transform), then one of the first things you should consider to
hand-optimize these functions is removing function parameters and
avoid polymorphic array accesses. This is much more important
performance-wise than, say, removing array bound checks or turning
recursive functions into loops.

- Xavier Leroy





^ permalink raw reply	[flat|nested] 6+ messages in thread

* Controlling inlining [was Re: Q: float arrays]
  1996-10-30 14:52   ` Thorsten Ohl
  1996-10-31 15:26     ` Xavier Leroy
@ 1996-11-05 16:56     ` rowan+
  1 sibling, 0 replies; 6+ messages in thread
From: rowan+ @ 1996-11-05 16:56 UTC (permalink / raw)
  To: Thorsten Ohl; +Cc: caml-list, Xavier Leroy



Thorsten:
> I know that I'm asking too much, but wouldn't it be nice it the
> compiler did it for me?   In the example at hand,
> 
>   let f2 = map (function (p,p') -> (p*.p'))
> 
> the compiler could inline the multiplication automagically, iff it
> still had access to the definition of the map function.
...
> Here partial application really shines and it could shine even
> brighter if there was no speed penalty ...

Xavier replied:
> That's right, and I still plan to put some automatic inlining in
> Objective Caml at some point. However, known inlining strategies
> are very conservative and err on the safe side in order to avoid code
> size blowup. This means that if your "map" functional is sufficiently
> big, it will not be inlined, even though this could resut in important
> speedups.

This is part of the motivation for my recent work with Frank Pfenning
on "Staged Computation", which is related to partial evaluation and
run-time code generation.  The basic idea is to allow the programmer
to annotate their program to indicate which parts can be done "early",
which in this case means compile-time.  The result of the early stage
is a residual program which is executed at a later stage, in this case
run-time.

Using annotations allows the programmer to identify situations where
"inlining" (or "specialization") would be beneficial, which is very
difficult for the compiler to determine.  A static type system
guarantees that the programmers annotations are consistent, and allows
propagation of this information across module boundaries.

There are two papers so far describing this work, though both are
rather abstract.  The first [1] describes a language which allows
specialization at run-time.  The second [2] describes a system with
distinct stages like "compile-time" and "run-time".

Cheers

- Rowan

[1] Rowan Davies and Frank Pfenning.  A Modal Analysis of Staged
Computation, 23rd Annual ACM Symposium on Principles of Programming
Languages (POPL'96), St. Petersburg Beach, Florida, January 1996.  ACM
press.  
http://www.cs.cmu.edu/~rowan/papers/mlboxpoplfinal.ps.gz

[2] Rowan Davies. A Temporal Logic Approach to Binding-Time Analysis,
In E. Clarke, editor, Proceedings of the Eleventh Annual Symposium on
Logic in Computer Science, New Brunswick, New Jersey, July 1996. IEEE
Computer Society Press.
http://www.cs.cmu.edu/~rowan/papers/circle.ps.gz






^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~1996-11-06  7:30 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1996-10-29 11:12 Q: float arrays Jocelyn Serot
1996-10-30 13:40 ` Xavier Leroy
1996-10-30 14:08   ` Jocelyn Serot
1996-10-30 14:52   ` Thorsten Ohl
1996-10-31 15:26     ` Xavier Leroy
1996-11-05 16:56     ` Controlling inlining [was Re: Q: float arrays] rowan+

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).