caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] ackermann microbenchmark strange results
@ 2013-04-24 10:35 ygrek
  2013-04-24 15:57 ` John Carr
  2013-04-24 17:40 ` Xavier Leroy
  0 siblings, 2 replies; 9+ messages in thread
From: ygrek @ 2013-04-24 10:35 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 1175 bytes --]

Hello,

 Got some time scratching my head over this little puzzle.
 Consider this bog-standard ackermann code :

let rec ack m n =
  match m, n with
  | 0,n -> n+1
  | m,0 -> ack (m-1) 1
  | m,n -> ack (m-1) (ack m (n-1))
in let _ = ack 4 1 ()

One could also pass m and n as a tuple. Also the call to the actual computation can be a toplevel let or not.
All in all 4 variants. Can you predict what will be the performance and what is the difference (if any) in generated
code?

All code and Makefile is attached.

Running `make bench` here consistently gives the following (ack1, ack3 - tuples, ack2, ack4 - curried) :

ack1.ml
0:03.85

ack2.ml
0:04.70

ack3.ml
0:04.60

ack4.ml
0:03.85

Tested with 3.12.1 and 4.00.1 (ack4 becomes slower).

Moreover, the generated assembly code for the main loop is the same, afaics. The only
difference is the initialization of structure fields and the initial call to ack. Please can anybody
explain the performance difference? I understand that microbenchmarks are no way the basis to draw
performance conclusions upon, but I cannot explain these results to myself in any meaninful way.
Please help! :)

-- 
 ygrek
 http://ygrek.org.ua

[-- Attachment #2: ack1.ml --]
[-- Type: application/octet-stream, Size: 125 bytes --]

let rec ack = function
  | 0,n -> n+1
  | m,0 -> ack (m-1, 1)
  | m,n -> ack (m-1, ack (m, n-1))
in let _ = ack (4, 1) in ()

[-- Attachment #3: ack2.ml --]
[-- Type: application/octet-stream, Size: 134 bytes --]

let rec ack m n =
  match m, n with
  | 0,n -> n+1
  | m,0 -> ack (m-1) 1
  | m,n -> ack (m-1) (ack m (n-1))
in let _ = ack 4 1 in ()

[-- Attachment #4: ack3.ml --]
[-- Type: application/octet-stream, Size: 117 bytes --]

let rec ack = function
  | 0,n -> n+1
  | m,0 -> ack (m-1, 1)
  | m,n -> ack (m-1, ack (m, n-1))

let _ = ack (4, 1)

[-- Attachment #5: ack4.ml --]
[-- Type: application/octet-stream, Size: 126 bytes --]

let rec ack m n =
  match m, n with
  | 0,n -> n+1
  | m,0 -> ack (m-1) 1
  | m,n -> ack (m-1) (ack m (n-1))

let _ = ack 4 1

[-- Attachment #6: Makefile --]
[-- Type: application/octet-stream, Size: 323 bytes --]


target: ack1 ack2 ack3 ack4 ack1.s ack2.s ack3.s ack4.s

ack%.s: ack%.ml
	(cp $< ack.ml; ocamlopt -S -c ack.ml; mv ack.s $@; rm ack.ml)

ack%: ack%.ml
	ocamlopt -o $@ $<

.PHONY:
bench: target
	$(foreach i,1 2 3 4,echo ack$i.ml; time -f %E ./ack$i; echo;)

.PHONY: clean
clean:
	rm -f *.s ack? *.cmi *.cmx *.o *.obj *.exe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ackermann microbenchmark strange results
  2013-04-24 10:35 [Caml-list] ackermann microbenchmark strange results ygrek
@ 2013-04-24 15:57 ` John Carr
  2013-04-24 16:08   ` Alain Frisch
                     ` (2 more replies)
  2013-04-24 17:40 ` Xavier Leroy
  1 sibling, 3 replies; 9+ messages in thread
From: John Carr @ 2013-04-24 15:57 UTC (permalink / raw)
  To: ygrek; +Cc: caml-list


Try changing loop alignment by editing assembly code.  The address of
ack is different in the different versions.  Modern Intel processors are
sensitive to code alignment.  There is a limit on the number of branch
prediction table entries per cache line.  An instruction that crosses a
cache line boundary may be slower than an instruction within a cache
line.  I am not surprised to see a 20% difference caused by an apparently
irrelevant code change.

> Moreover, the generated assembly code for the main loop is the same, afaics. The only
> difference is the initialization of structure fields and the initial call to ack. Please can anybody
> explain the performance difference? I understand that microbenchmarks are no way the basis to draw
> performance conclusions upon, but I cannot explain these results to myself in any meaninful way.
> Please help! :)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ackermann microbenchmark strange results
  2013-04-24 15:57 ` John Carr
@ 2013-04-24 16:08   ` Alain Frisch
  2013-04-24 16:57     ` Anthony Tavener
  2013-04-24 17:26     ` rixed
  2013-04-24 17:35   ` Matteo Frigo
  2013-04-26  3:31   ` ygrek
  2 siblings, 2 replies; 9+ messages in thread
From: Alain Frisch @ 2013-04-24 16:08 UTC (permalink / raw)
  To: John Carr, ygrek; +Cc: caml-list

+1

I've already seen 20% speedup obtained by adding some dead code (in an 
OCaml program, but this is irrelevant).

-- Alain

On 04/24/2013 05:57 PM, John Carr wrote:
> Try changing loop alignment by editing assembly code.  The address of
> ack is different in the different versions.  Modern Intel processors are
> sensitive to code alignment.  There is a limit on the number of branch
> prediction table entries per cache line.  An instruction that crosses a
> cache line boundary may be slower than an instruction within a cache
> line.  I am not surprised to see a 20% difference caused by an apparently
> irrelevant code change.
>
>> Moreover, the generated assembly code for the main loop is the same, afaics. The only
>> difference is the initialization of structure fields and the initial call to ack. Please can anybody
>> explain the performance difference? I understand that microbenchmarks are no way the basis to draw
>> performance conclusions upon, but I cannot explain these results to myself in any meaninful way.
>> Please help! :)
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ackermann microbenchmark strange results
  2013-04-24 16:08   ` Alain Frisch
@ 2013-04-24 16:57     ` Anthony Tavener
  2013-04-24 17:26     ` rixed
  1 sibling, 0 replies; 9+ messages in thread
From: Anthony Tavener @ 2013-04-24 16:57 UTC (permalink / raw)
  To: Alain Frisch; +Cc: John Carr, ygrek, caml-list

[-- Attachment #1: Type: text/plain, Size: 1328 bytes --]

Ackermann function! That's a familiar beast, but I only thought I knew it
by name (heard of it, never "saw" it). Now that I see it, it was part of a
puzzle that I extracted from asm, translated into OCaml, converted to a
recursive loop without global state... and ended up with almost exactly
that definition. I tried optimizing it for a day... and my notes look like
the Wikipedia page, now that I know what to look for. The challenge
involved computing A(4,1,p), discovering 'p' which results in '6'
(word-wrapped on a 15-bit virtual machine). Eventually I gave up on
reducing or optimizing and memoized... churning through over 20k values of
'p' in less than an hour to find my answer.

I get slightly different, but very consistent, timing (OCaml 4.00.0). In
particular, the timing of ack1 and ack3 are "toggled" compared to yours:

ack1.ml
0:04.89

ack2.ml
0:04.89

ack3.ml
0:03.98

ack4.ml
0:03.98

ack5.ml
0:05.23

I'd also expect this kind of result due to alignment.

And for fun I added ack5, which is the non-match implementation I ended up
deriving from some disassembly:

let test r7 =
  let rec loop x y =
    if x = 0 then y+1
    else if y = 0 then loop (x-1) r7
    else loop (x-1) (loop x (y-1))
  in loop 4 1
let _ = test 1

I didn't compare asm outputs, but looks like this version adds a small
"something".

[-- Attachment #2: Type: text/html, Size: 2034 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ackermann microbenchmark strange results
  2013-04-24 16:08   ` Alain Frisch
  2013-04-24 16:57     ` Anthony Tavener
@ 2013-04-24 17:26     ` rixed
  2013-04-24 17:31       ` Török Edwin
  1 sibling, 1 reply; 9+ messages in thread
From: rixed @ 2013-04-24 17:26 UTC (permalink / raw)
  To: caml-list

-[ Wed, Apr 24, 2013 at 06:08:00PM +0200, Alain Frisch ]----
> +1
> 
> I've already seen 20% speedup obtained by adding some dead code (in
> an OCaml program, but this is irrelevant).

Remember me of a paper I read some years ago that was measuring the effect of
the various optimisation levels of gcc against the effect of addresses choices
(randomised using environment strings of various lengths!), and which
conclusion was that the effect of -O3 compared to -O2 was less than the effect
of "choosing" a good environment string :-)

Couldn't find it again using Google ; maybe someone remember this paper?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ackermann microbenchmark strange results
  2013-04-24 17:26     ` rixed
@ 2013-04-24 17:31       ` Török Edwin
  0 siblings, 0 replies; 9+ messages in thread
From: Török Edwin @ 2013-04-24 17:31 UTC (permalink / raw)
  To: caml-list

On 04/24/2013 08:26 PM, rixed@happyleptic.org wrote:
> -[ Wed, Apr 24, 2013 at 06:08:00PM +0200, Alain Frisch ]----
>> +1
>>
>> I've already seen 20% speedup obtained by adding some dead code (in
>> an OCaml program, but this is irrelevant).
> 
> Remember me of a paper I read some years ago that was measuring the effect of
> the various optimisation levels of gcc against the effect of addresses choices
> (randomised using environment strings of various lengths!), and which
> conclusion was that the effect of -O3 compared to -O2 was less than the effect
> of "choosing" a good environment string :-)
> 
> Couldn't find it again using Google ; maybe someone remember this paper?
> 
> 

http://plasma.cs.umass.edu/emery/stabilizer

--Edwin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ackermann microbenchmark strange results
  2013-04-24 15:57 ` John Carr
  2013-04-24 16:08   ` Alain Frisch
@ 2013-04-24 17:35   ` Matteo Frigo
  2013-04-26  3:31   ` ygrek
  2 siblings, 0 replies; 9+ messages in thread
From: Matteo Frigo @ 2013-04-24 17:35 UTC (permalink / raw)
  To: John Carr; +Cc: ygrek, caml-list

John Carr <jfc@MIT.EDU> writes:

> Try changing loop alignment by editing assembly code.  The address of
> ack is different in the different versions.  Modern Intel processors are
> sensitive to code alignment.  There is a limit on the number of branch
> prediction table entries per cache line.  An instruction that crosses a
> cache line boundary may be slower than an instruction within a cache
> line.  I am not surprised to see a 20% difference caused by an apparently
> irrelevant code change.

See also

    http://people.csail.mit.edu/nkushman/papers/ms-thesis.pdf

for a detailed analysis of a related problem on older (1998) hardware.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ackermann microbenchmark strange results
  2013-04-24 10:35 [Caml-list] ackermann microbenchmark strange results ygrek
  2013-04-24 15:57 ` John Carr
@ 2013-04-24 17:40 ` Xavier Leroy
  1 sibling, 0 replies; 9+ messages in thread
From: Xavier Leroy @ 2013-04-24 17:40 UTC (permalink / raw)
  To: caml-list

On 24/04/13 12:35, ygrek wrote:

> Running `make bench` here consistently gives the following (ack1,
  ack3 - tuples, ack2, ack4 - curried) : [...]
> Moreover, the generated assembly code for the main loop is the same,
> afaics. The only difference is the initialization of structure
> fields and the initial call to ack. Please can anybody explain the
> performance difference?

As others said, probably code placement.  A good tool to explore these
issues under Linux is "perf", in particular "perf stats", which lets
you count various hardware events.  Running it on your 4 programs,
you'll see that the number of instructions executed is almost exactly
the same, but the branch mispredictions (for instance) vary
significantly, in a way correlated with the overall execution time.

(Note that this doesn't have anything to do with OCaml: you'd observe
crazy variations like these with any compiled language.)

> I understand that microbenchmarks are no way the basis to draw
> performance conclusions upon, but I cannot explain these results to
> myself in any meaninful way.

Nobody can except perhaps a few engineers at Intel or AMD who have the
whole microarchitecture of their processors imprinted in their heads.
(And then they will not tell you.)  See, modern microarchitectures
contain so many clever heuristics that performance is generally very
good but essentially unpredictable...

rixed@happyleptic.org adds:

> Remember me of a paper I read some years ago that was measuring the
> effect of the various optimisation levels of gcc against the effect
> of addresses choices (randomised using environment strings of
> various lengths!), and which conclusion was that the effect of -O3
> compared to -O2 was less than the effect of "choosing" a good
> environment string :-) Couldn't find it again using Google ; maybe
> someone remember this paper?

It's probably "Producing Wrong Data Without Doing Anything Obviously
Wrong!" by Diwan et al, ASPLOS 2009,
http://www-plan.cs.colorado.edu/diwan/asplos09.pdf
This paper made quite a splash in the compiler community...

- Xavier Leroy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ackermann microbenchmark strange results
  2013-04-24 15:57 ` John Carr
  2013-04-24 16:08   ` Alain Frisch
  2013-04-24 17:35   ` Matteo Frigo
@ 2013-04-26  3:31   ` ygrek
  2 siblings, 0 replies; 9+ messages in thread
From: ygrek @ 2013-04-26  3:31 UTC (permalink / raw)
  To: caml-list

On Wed, 24 Apr 2013 11:57:29 -0400
John Carr <jfc@MIT.EDU> wrote:

> 
> Try changing loop alignment by editing assembly code.  The address of
> ack is different in the different versions.  Modern Intel processors are
> sensitive to code alignment.  There is a limit on the number of branch
> prediction table entries per cache line.  An instruction that crosses a
> cache line boundary may be slower than an instruction within a cache
> line.  I am not surprised to see a 20% difference caused by an apparently
> irrelevant code change.

I was aware of the data alignment influences, but didn't think of
code alignment playing such a role. Got some interesting links to read :)
Thanks you!

-- 
 ygrek
 http://ygrek.org.ua

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-04-26  3:31 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-24 10:35 [Caml-list] ackermann microbenchmark strange results ygrek
2013-04-24 15:57 ` John Carr
2013-04-24 16:08   ` Alain Frisch
2013-04-24 16:57     ` Anthony Tavener
2013-04-24 17:26     ` rixed
2013-04-24 17:31       ` Török Edwin
2013-04-24 17:35   ` Matteo Frigo
2013-04-26  3:31   ` ygrek
2013-04-24 17:40 ` Xavier Leroy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).