ocaml doesn't need to optimize on amd64??

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

* ocaml doesn't need to optimize on amd64??
@ 2008-01-09 14:22 Kuba Ober
  2008-01-09 17:14 ` [Caml-list] " Jon Harrop
       [not found] ` <20080110024359.GA15544@mulga.csse.unimelb.edu.au>
  0 siblings, 2 replies; 9+ messages in thread
From: Kuba Ober @ 2008-01-09 14:22 UTC (permalink / raw)
  To: caml-list

Jon & al,

why do you think that OCaml doesn't need to do certain
optimizations on amd64? Or does it apply only to 64 bit mode?
I run my benchmarks on amd64 (in 32 bit mode) and OCaml is worse
off than gcc.

Cheers, Kuba

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ocaml doesn't need to optimize on amd64??
  2008-01-09 14:22 ocaml doesn't need to optimize on amd64?? Kuba Ober
@ 2008-01-09 17:14 ` Jon Harrop
  2008-01-10 22:56   ` Jon Harrop
       [not found] ` <20080110024359.GA15544@mulga.csse.unimelb.edu.au>
  1 sibling, 1 reply; 9+ messages in thread
From: Jon Harrop @ 2008-01-09 17:14 UTC (permalink / raw)
  To: caml-list

On Wednesday 09 January 2008 14:22:00 Kuba Ober wrote:
> Jon & al,
>
> why do you think that OCaml doesn't need to do certain
> optimizations on amd64?

OCaml does a (much) better job of code generation on AMD64.

> Or does it apply only to 64 bit mode? 
> I run my benchmarks on amd64 (in 32 bit mode) and OCaml is worse
> off than gcc.

You need to run in 64-bit, of course.

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?e


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ocaml doesn't need to optimize on amd64??
       [not found] ` <20080110024359.GA15544@mulga.csse.unimelb.edu.au>
@ 2008-01-10 13:56   ` Kuba Ober
       [not found]     ` <47862F53.1050605@janestcapital.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Kuba Ober @ 2008-01-10 13:56 UTC (permalink / raw)
  To: caml-list

On Wednesday 09 January 2008, you wrote:
> On Wed, Jan 09, 2008 at 09:22:00AM -0500, Kuba Ober wrote:
> > Jon & al,
> >
> > why do you think that OCaml doesn't need to do certain
> > optimizations on amd64? Or does it apply only to 64 bit mode?
> > I run my benchmarks on amd64 (in 32 bit mode) and OCaml is worse
> > off than gcc.
>
> Register pressure.  The extra eight registers in AMD64 make a huge
> difference to a lot of code generators.  Of course, to access them
> you need to run in 64 bit mode.

Don't current x32 processors "emulate" extra registers anyway? I don't know 
what's the preferred way of telling the on-chip code dissector that you 
intend the data to say in virtual registers, but it must be something simple, 
like common, fixed memory locations or stack locations accessed in a certain 
way. I'm sure if you google on Intel's site you'll find it. In any event, the 
x32 chips have a notion of "many" virtual registers, it's just the old x32 
opcodes that don't have it. Isn't it so?

Cheers, Kuba

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ocaml doesn't need to optimize on amd64??
       [not found]     ` <47862F53.1050605@janestcapital.com>
@ 2008-01-10 16:36       ` Kuba Ober
  0 siblings, 0 replies; 9+ messages in thread
From: Kuba Ober @ 2008-01-10 16:36 UTC (permalink / raw)
  To: caml-list

On Thursday 10 January 2008, you wrote:
> Kuba Ober wrote:
> >On Wednesday 09 January 2008, you wrote:
> >>On Wed, Jan 09, 2008 at 09:22:00AM -0500, Kuba Ober wrote:
> >>>Jon & al,
> >>>
> >>>why do you think that OCaml doesn't need to do certain
> >>>optimizations on amd64? Or does it apply only to 64 bit mode?
> >>>I run my benchmarks on amd64 (in 32 bit mode) and OCaml is worse
> >>>off than gcc.
> >>
> >>Register pressure.  The extra eight registers in AMD64 make a huge
> >>difference to a lot of code generators.  Of course, to access them
> >>you need to run in 64 bit mode.
> >
> >Don't current x32 processors "emulate" extra registers anyway? I don't
> > know what's the preferred way of telling the on-chip code dissector that
> > you intend the data to say in virtual registers, but it must be something
> > simple, like common, fixed memory locations or stack locations accessed
> > in a certain way. I'm sure if you google on Intel's site you'll find it.
> > In any event, the x32 chips have a notion of "many" virtual registers,
> > it's just the old x32 opcodes that don't have it. Isn't it so?
>
> Short answer: no.
>
> The problem is that the code generated code needs to spill and fill
> registers on 32-bit x86 a lot more.  The extra registers are usefull in
> reordering instructions, and executing things in parallel (at the
> instruction level), but they can't avoid the memory accesses.  Now,
> there are a lot of tricks the CPUs play in order to reduce the costs of
> these memory accesses, but they can't eliminate the memory accesses.
> And that's the problem: you've now increases the memory pressure of the
> CPUs.

If a thread keeps those "virtual registers" in an isolated memory page that's 
accessed by nothing else, the CPU's caching will barely need to access the 
memory, as there won't be pressure from other cores or CPUs in the system to 
actually use that memory for anything. It will only get evicted to RAM if the 
cache pressure is high.

The "spill and fill" problem is really only appropriate in the context of the 
x86 registers actually existing somewhere. As far as I know, they are a 
virtual concept, there's no piece of silicon in modern x86 CPUs that can be 
called "the AX register". The actual opcodes referencing ax, bx, ... 
registers is recompiled on the fly by the CPU into a RISC-like code running 
on many more registers. Sure, the code rewriter has to work harder when 
there's more opcodes to move stuff around, but I think that it's implemented 
to actually recognize certain opcode patterns as "virtual register accesses".

I don't have any hard references on hand, but that's my impression so far. I 
can try and dig up relevant Intel references.

Cheers, Kuba

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ocaml doesn't need to optimize on amd64??
  2008-01-09 17:14 ` [Caml-list] " Jon Harrop
@ 2008-01-10 22:56   ` Jon Harrop
  2008-01-11  0:42     ` Peng Zang
  2008-01-11 13:14     ` Kuba Ober
  0 siblings, 2 replies; 9+ messages in thread
From: Jon Harrop @ 2008-01-10 22:56 UTC (permalink / raw)
  To: caml-list

On Wednesday 09 January 2008 17:14:52 Jon Harrop wrote:
> On Wednesday 09 January 2008 14:22:00 Kuba Ober wrote:
> > Jon & al,
> >
> > why do you think that OCaml doesn't need to do certain
> > optimizations on amd64?
>
> OCaml does a (much) better job of code generation on AMD64.

I was in a bit of a rush when I wrote that and I'd like to explain what I 
meant in more detail.

OCaml's AMD64 code gen is so good that I have never heard of a reason to drop 
to C for high performance code. That is simply no longer an issue, so we are 
free to stay in the land of safety and high-level concepts which is 
enormously valuable in practice because it saves so much developer time.

In particular, this is more important than having a compiler that implements 
the high-level optimizations we discussed because such a compiler can never 
match the performance of C if its AMD64 code gen is not as good as OCaml's.

This is one of the reasons why I continue to choose OCaml over SML/NJ, MLton, 
GHC6.8, SBCL, Bigloo and almost all other FPL compilers: they have worse 
AMD64 code gens and imposes a significantly lower speed limit upon their 
users.

If you look at the ray tracer benchmark:

  http://www.ffconsultancy.com/languages/ray_tracer/results.html

Three of the five OCaml programs are faster than any SML compiled with MLton 
even though MLton implements lots of high-level optimizations that OCaml does 
not. You can draw two crucial conclusions from this:

1. Even though OCaml lacks some high-level optimizations, you don't have to 
put much effort in before OCaml beats MLton-compiled SML because OCaml's 
AMD64 code gen is so good. In particular, two of those three OCaml 
implementations don't even bother implementing any of the low-level 
optimizations that we've discussed at all!

2. OCaml lets you choose how much you want to optimize your code right up to 
the performance of C but MLton imposes quite a low speed limit: 55% slower 
than the front runners. Once you hit that limit with MLton your only option 
is to drop to C (or OCaml ;-). However, dropping into another language 
imposes its own performance hits and is even likely to undermine the 
compiler's optimization efforts.

So I agree that it would be very nice if OCaml implemented more optimizations 
along these lines but I choose OCaml with its excellent code gen over a more 
optimizing compiler that didn't have such a good code gen every day of the 
week and twice on Sundays.

AFAIK, these high-level optimizations are never likely to get implemented in 
OCaml. My first vote would actually go to a different (and more fundamentally 
important) optimization anyway: arbitrary unboxing. One of the nice things 
about F# and GHC is that you can specify in your type declaration that the 
type is to be stored unboxed. This can have profound performance 
implications.

Even in standalone code, unboxing arrays of complex numbers (which OCaml does 
not do) makes FFTs 5x faster. In the context of FFIs, the performance 
improvement can be much bigger because you can completely avoid the cost of 
copying huge quantities of data (e.g. color/texcoord/normal/position struct 
arrays in vertex buffer objects for OpenGL). In contrast, the overhead of 
lambda abstraction in numerical code is "only" a factor of 2.

PS: Kuba, your C code will most likely run a significantly faster in 64-bit as 
well.
-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?e

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ocaml doesn't need to optimize on amd64??
  2008-01-10 22:56   ` Jon Harrop
@ 2008-01-11  0:42     ` Peng Zang
  2008-01-11 13:14     ` Kuba Ober
  1 sibling, 0 replies; 9+ messages in thread
From: Peng Zang @ 2008-01-11  0:42 UTC (permalink / raw)
  To: caml-list; +Cc: Jon Harrop

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hey Jon,

I was wondering if you've had any experience with OCaml's speed and quality of 
code generate on Intel 64 platforms.  Eg. a Core 2 Duo or one of the more 
recent Xeons.  Thanks,

Peng

On Thursday 10 January 2008 05:56:42 pm Jon Harrop wrote:
> On Wednesday 09 January 2008 17:14:52 Jon Harrop wrote:
> > On Wednesday 09 January 2008 14:22:00 Kuba Ober wrote:
> > > Jon & al,
> > >
> > > why do you think that OCaml doesn't need to do certain
> > > optimizations on amd64?
> >
> > OCaml does a (much) better job of code generation on AMD64.
>
> I was in a bit of a rush when I wrote that and I'd like to explain what I
> meant in more detail.
>
> OCaml's AMD64 code gen is so good that I have never heard of a reason to
> drop to C for high performance code. That is simply no longer an issue, so
> we are free to stay in the land of safety and high-level concepts which is
> enormously valuable in practice because it saves so much developer time.
>
> In particular, this is more important than having a compiler that
> implements the high-level optimizations we discussed because such a
> compiler can never match the performance of C if its AMD64 code gen is not
> as good as OCaml's.
>
> This is one of the reasons why I continue to choose OCaml over SML/NJ,
> MLton, GHC6.8, SBCL, Bigloo and almost all other FPL compilers: they have
> worse AMD64 code gens and imposes a significantly lower speed limit upon
> their users.
>
> If you look at the ray tracer benchmark:
>
>   http://www.ffconsultancy.com/languages/ray_tracer/results.html
>
> Three of the five OCaml programs are faster than any SML compiled with
> MLton even though MLton implements lots of high-level optimizations that
> OCaml does not. You can draw two crucial conclusions from this:
>
> 1. Even though OCaml lacks some high-level optimizations, you don't have to
> put much effort in before OCaml beats MLton-compiled SML because OCaml's
> AMD64 code gen is so good. In particular, two of those three OCaml
> implementations don't even bother implementing any of the low-level
> optimizations that we've discussed at all!
>
> 2. OCaml lets you choose how much you want to optimize your code right up
> to the performance of C but MLton imposes quite a low speed limit: 55%
> slower than the front runners. Once you hit that limit with MLton your only
> option is to drop to C (or OCaml ;-). However, dropping into another
> language imposes its own performance hits and is even likely to undermine
> the compiler's optimization efforts.
>
> So I agree that it would be very nice if OCaml implemented more
> optimizations along these lines but I choose OCaml with its excellent code
> gen over a more optimizing compiler that didn't have such a good code gen
> every day of the week and twice on Sundays.
>
> AFAIK, these high-level optimizations are never likely to get implemented
> in OCaml. My first vote would actually go to a different (and more
> fundamentally important) optimization anyway: arbitrary unboxing. One of
> the nice things about F# and GHC is that you can specify in your type
> declaration that the type is to be stored unboxed. This can have profound
> performance
> implications.
>
> Even in standalone code, unboxing arrays of complex numbers (which OCaml
> does not do) makes FFTs 5x faster. In the context of FFIs, the performance
> improvement can be much bigger because you can completely avoid the cost of
> copying huge quantities of data (e.g. color/texcoord/normal/position struct
> arrays in vertex buffer objects for OpenGL). In contrast, the overhead of
> lambda abstraction in numerical code is "only" a factor of 2.
>
> PS: Kuba, your C code will most likely run a significantly faster in 64-bit
> as well.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.7 (GNU/Linux)

iD8DBQFHhrtgfIRcEFL/JewRAqgFAKCm1Px3uMP8nEq5lrIp/vHdbIvSsgCgiDSJ
NpRFs4RKR+W5b0S0n9SIkAo=
=Y81S
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ocaml doesn't need to optimize on amd64??
  2008-01-10 22:56   ` Jon Harrop
  2008-01-11  0:42     ` Peng Zang
@ 2008-01-11 13:14     ` Kuba Ober
  2008-01-12 16:20       ` Will Farr
  1 sibling, 1 reply; 9+ messages in thread
From: Kuba Ober @ 2008-01-11 13:14 UTC (permalink / raw)
  To: caml-list

> Even in standalone code, unboxing arrays of complex numbers (which OCaml
> does not do) makes FFTs 5x faster. In the context of FFIs, the performance
> improvement can be much bigger because you can completely avoid the cost of
> copying huge quantities of data (e.g. color/texcoord/normal/position struct
> arrays in vertex buffer objects for OpenGL). In contrast, the overhead of
> lambda abstraction in numerical code is "only" a factor of 2.
>
> PS: Kuba, your C code will most likely run a significantly faster in 64-bit
> as well.

Yeah, but my area of interest is really embedded realtime stuff, running 
typically on architectures which are quite resource constrained. On some of 
those your typical GC wouldn't even fit in the code memory. And I'm not even 
(most of the time) using dynamic memory allocations. None of my code really 
calls for any sort of boxing -- there's no need for it. All I need is C that 
is more expressive and easier to optimize. No run-time variants, really, all 
types are known and fixed, and data is at fixed locations in memory, or on 
the stack, or occasionally on the heap which is manually managed (C-like).

Of course that pertains to the code that gets generated, because I should be 
able to use abstract concepts while writing the code. If I pass a function to 
a function, it doesn't necessarily mean that the compiler must emit the code 
for the former, and that the latter should actually call (as call machine 
instruction) the former.

And I know that my code does immensely benefit from certain high-level 
optimizations - I have implemented them only because vendor compilers lacked 
them, and I had to resort to assembly, and unfortunately most assemblers suck 
big time :( Once you get used to an assembler with LISP macros, it's hard to 
go back... Self inflicted bait-and-switch.

There's a whole slew of stuff that's impossible to do with current compiler 
architectures as long as you have your typical compile-assemble-link process. 
There have been attempts at optimizing stuff in the linkers, but to me it's 
just so much effort and code duplication - I couldn't imagine implementing 
things that way.

Most linkers can't even cope with fairly basic *assembly* - as soon as you 
have complex operations on relocatable (or out-of-module) symbols, the 
assembler has to essentially ship an AST of the expression so that the linker 
will be able to compute the needed value. I don't know how many linkers will 
actually dig ASTs shipped in object files -- none of the embedded ones I 
tried do. Many vendor assemblers won't even report warnings if you try to do 
that, and when you link you will get an incorrect binary. It's a mess, and 
makes the linker need more and more of compiler's functionality. Which begs 
the question: why have a separate linker?

Early on I have decided to do whole-program compilation and linking is just 
the last stage of code generation. Of course you can pregenerate syntax trees 
from source files, which can save a bunch of time if you were compiling a big 
project, but I didn't even have to use that. My code compiles usually in less 
than a minute, and that's mostly because I haven't really taken time to 
optimize my optimizer: what for? It'd take me weeks sometimes to hand-write 
the assembly and debug it, waiting a minute for a compilation of a small 
10kloc code base isn't a big deal. And it lets me have a very nice and 
understandable 10kloc code base, and a hopefully not much worse compiler. Not 
having function inlining, tail call removal and polymorphism (among others) 
at the source code level is a big deal - that's why I abhor C in general, and 
vendor C implementations even more so.

Cheers, Kuba

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ocaml doesn't need to optimize on amd64??
  2008-01-11 13:14     ` Kuba Ober
@ 2008-01-12 16:20       ` Will Farr
  2008-01-14 12:16         ` Kuba Ober
  0 siblings, 1 reply; 9+ messages in thread
From: Will Farr @ 2008-01-12 16:20 UTC (permalink / raw)
  To: Kuba Ober; +Cc: caml-list

Kuba,

> Yeah, but my area of interest is really embedded realtime stuff, running
> typically on architectures which are quite resource constrained. On some of
> those your typical GC wouldn't even fit in the code memory. And I'm not even
> (most of the time) using dynamic memory allocations. None of my code really
> calls for any sort of boxing -- there's no need for it. All I need is C that
> is more expressive and easier to optimize. No run-time variants, really, all
> types are known and fixed, and data is at fixed locations in memory, or on
> the stack, or occasionally on the heap which is manually managed (C-like).
>
> Of course that pertains to the code that gets generated, because I should be
> able to use abstract concepts while writing the code. If I pass a function to
> a function, it doesn't necessarily mean that the compiler must emit the code
> for the former, and that the latter should actually call (as call machine
> instruction) the former.

This may be something you have seen before and dismissed, and it's not
really OCaml at all, but have you looked at PreScheme?  It's a scheme
dialect (and, in fact, runs *un-modified* in a scheme interpreter),
plus a compiler that turns it into optimized C of the type you're
talking about.  (For example, one optimization is that all
higher-order procedures are beta-substituted away at compile time.)
It might not really fit your needs, but perhaps there's some ideas you
could steal there, in any case.  (The scheme48 guys used it to write
the VM for scheme48, which sounds to my un-expert ears like it would
have a lot in common with the tasks you're looking at doing.)

Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ocaml doesn't need to optimize on amd64??
  2008-01-12 16:20       ` Will Farr
@ 2008-01-14 12:16         ` Kuba Ober
  0 siblings, 0 replies; 9+ messages in thread
From: Kuba Ober @ 2008-01-14 12:16 UTC (permalink / raw)
  To: Will Farr; +Cc: caml-list

> > Yeah, but my area of interest is really embedded realtime stuff, running
> > typically on architectures which are quite resource constrained. On some
> > of those your typical GC wouldn't even fit in the code memory. And I'm
> > not even (most of the time) using dynamic memory allocations. None of my
> > code really calls for any sort of boxing -- there's no need for it. All I
> > need is C that is more expressive and easier to optimize. No run-time
> > variants, really, all types are known and fixed, and data is at fixed
> > locations in memory, or on the stack, or occasionally on the heap which
> > is manually managed (C-like).
> >
> > Of course that pertains to the code that gets generated, because I should
> > be able to use abstract concepts while writing the code. If I pass a
> > function to a function, it doesn't necessarily mean that the compiler
> > must emit the code for the former, and that the latter should actually
> > call (as call machine instruction) the former.
>
> This may be something you have seen before and dismissed, and it's not
> really OCaml at all, but have you looked at PreScheme?  It's a scheme
> dialect (and, in fact, runs *un-modified* in a scheme interpreter),
> plus a compiler that turns it into optimized C of the type you're
> talking about.  (For example, one optimization is that all
> higher-order procedures are beta-substituted away at compile time.)
> It might not really fit your needs, but perhaps there's some ideas you
> could steal there, in any case.  (The scheme48 guys used it to write
> the VM for scheme48, which sounds to my un-expert ears like it would
> have a lot in common with the tasks you're looking at doing.)

If it turned it into C-- it'd be even better, because I can't count much on C 
compilers for my platforms either: Zilog Z8 compiler and assembler/linker has 
bugs which produce wrong code, and for SX28 there's, afaik, only one C 
compiler that isn't quite there yet anyway.

C-- would be easier to generate code from.

Right now my hackish platform runs in LISP, so Scheme wouldn't be so much 
different, but I don't really know how macros are done in Scheme, and I kind 
of depend on them.

Cheers, Kuba


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-01-14 12:16 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-09 14:22 ocaml doesn't need to optimize on amd64?? Kuba Ober
2008-01-09 17:14 ` [Caml-list] " Jon Harrop
2008-01-10 22:56   ` Jon Harrop
2008-01-11  0:42     ` Peng Zang
2008-01-11 13:14     ` Kuba Ober
2008-01-12 16:20       ` Will Farr
2008-01-14 12:16         ` Kuba Ober
     [not found] ` <20080110024359.GA15544@mulga.csse.unimelb.edu.au>
2008-01-10 13:56   ` Kuba Ober
     [not found]     ` <47862F53.1050605@janestcapital.com>
2008-01-10 16:36       ` Kuba Ober

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).