* [Caml-list] Timing Ocaml
@ 2002-06-10 5:35 Blair Zajac
2002-06-10 6:24 ` Chris Hecker
2002-06-10 15:01 ` Xavier Leroy
0 siblings, 2 replies; 13+ messages in thread
From: Blair Zajac @ 2002-06-10 5:35 UTC (permalink / raw)
To: Caml Mailing List
Reading that the bytecode interpreter for Ocaml runs 2/3 as fast
when compiled with VC 6 compared to gcc, has anybody done any
timing comparisons with VisualStudio.Net, Intel C++ 5.x or
Intel C++ 6.0?
If I were to do these timing tests with these compilers and
with different gcc versions (2.95.3, 3.0.4 and 3.1) which
script/program should I use to get a fair estimate of the
compiler?
Also, in INSTALL, it says
* The GNU C compiler gcc is recommended, as the bytecode
interpreter takes advantage of gcc-specific features to enhance
performance.
What is the nature of these optimizations?
Blair
--
Blair Zajac <blair@orcaware.com>
Web and OS performance plots - http://www.orcaware.com/orca/
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Caml-list] Timing Ocaml
2002-06-10 5:35 [Caml-list] Timing Ocaml Blair Zajac
@ 2002-06-10 6:24 ` Chris Hecker
2002-06-10 12:02 ` Dmitry Bely
2002-06-10 15:01 ` Xavier Leroy
1 sibling, 1 reply; 13+ messages in thread
From: Chris Hecker @ 2002-06-10 6:24 UTC (permalink / raw)
To: Blair Zajac, Caml Mailing List
>* The GNU C compiler gcc is recommended, as the bytecode
> interpreter takes advantage of gcc-specific features to enhance
> performance.
>What is the nature of these optimizations?
GCC lets you take the address of a label. You can see in byterun/interp.c
that it uses a jump table instead of a switch when you're using GCC.
At least, that's what it looks like.
Chris
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Caml-list] Timing Ocaml
2002-06-10 6:24 ` Chris Hecker
@ 2002-06-10 12:02 ` Dmitry Bely
2002-06-10 12:50 ` Remi VANICAT
0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Bely @ 2002-06-10 12:02 UTC (permalink / raw)
To: caml-list
Chris Hecker <checker@d6.com> writes:
>>* The GNU C compiler gcc is recommended, as the bytecode
>> interpreter takes advantage of gcc-specific features to enhance
>> performance.
>>What is the nature of these optimizations?
>
> GCC lets you take the address of a label. You can see in
> byterun/interp.c that it uses a jump table instead of a switch when
> you're using GCC.
>
> At least, that's what it looks like.
I would rather say that gcc allows to force register allocation for some
specific variable, while MSVC always ignore "register" specifier.
#if defined(__GNUC__) && !defined(DEBUG)
[...]
#ifdef __i386__
#define PC_REG asm("%esi")
#define SP_REG asm("%edi")
#define ACCU_REG
#endif
[...]
#endif
/* The interpreter itself */
value interprete(code_t prog, asize_t prog_size)
{
#ifdef PC_REG
register code_t pc PC_REG;
register value * sp SP_REG;
register value accu ACCU_REG;
#else
register code_t pc;
register value * sp;
register value accu;
#endif
In the same time MSVC has very good optimizer and it is very strange, that
two explicit register variables lead to 30% performance gain...
Hope to hear from you soon,
Dmitry
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Caml-list] Timing Ocaml
2002-06-10 12:02 ` Dmitry Bely
@ 2002-06-10 12:50 ` Remi VANICAT
2002-06-10 14:19 ` Lionel Fourquaux
0 siblings, 1 reply; 13+ messages in thread
From: Remi VANICAT @ 2002-06-10 12:50 UTC (permalink / raw)
To: caml-list
Dmitry Bely <dbely@mail.ru> writes:
> Chris Hecker <checker@d6.com> writes:
>
> >>* The GNU C compiler gcc is recommended, as the bytecode
> >> interpreter takes advantage of gcc-specific features to enhance
> >> performance.
> >>What is the nature of these optimizations?
> >
> > GCC lets you take the address of a label. You can see in
> > byterun/interp.c that it uses a jump table instead of a switch when
> > you're using GCC.
> >
> > At least, that's what it looks like.
>
> I would rather say that gcc allows to force register allocation for some
> specific variable, while MSVC always ignore "register" specifier.
>
> #if defined(__GNUC__) && !defined(DEBUG)
> [...]
> #ifdef __i386__
> #define PC_REG asm("%esi")
> #define SP_REG asm("%edi")
> #define ACCU_REG
> #endif
> [...]
> #endif
well, it seem that threaded code also depend of being compile with
gcc:
#if defined(__GNUC__) && __GNUC__ >= 2 && !defined(DEBUG) && !defined (SHRINKED_
GNUC)
#define THREADED_CODE
#endif
so both register assignation and threaded code can imply a lot of
speedup.
--
Rémi Vanicat
vanicat@labri.u-bordeaux.fr
http://dept-info.labri.u-bordeaux.fr/~vanicat
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [Caml-list] Timing Ocaml
2002-06-10 12:50 ` Remi VANICAT
@ 2002-06-10 14:19 ` Lionel Fourquaux
0 siblings, 0 replies; 13+ messages in thread
From: Lionel Fourquaux @ 2002-06-10 14:19 UTC (permalink / raw)
To: caml-list
[-- Attachment #1: Type: text/plain, Size: 2011 bytes --]
> From: owner-caml-list@pauillac.inria.fr [mailto:owner-caml-
> list@pauillac.inria.fr] On Behalf Of Remi VANICAT
> Sent: Monday, June 10, 2002 2:50 PM
> To: caml-list@inria.fr
> Subject: Re: [Caml-list] Timing Ocaml
>
> Dmitry Bely <dbely@mail.ru> writes:
>
> > Chris Hecker <checker@d6.com> writes:
> >
> > >>* The GNU C compiler gcc is recommended, as the bytecode
> > >> interpreter takes advantage of gcc-specific features to enhance
> > >> performance.
> > >>What is the nature of these optimizations?
> > >
> > > GCC lets you take the address of a label. You can see in
> > > byterun/interp.c that it uses a jump table instead of a switch
when
> > > you're using GCC.
> > >
> > > At least, that's what it looks like.
> >
> > I would rather say that gcc allows to force register allocation for
some
> > specific variable, while MSVC always ignore "register" specifier.
No, that's not the problem. MSVC is usually very good at
register allocation.
> >
> > #if defined(__GNUC__) && !defined(DEBUG)
> > [...]
> > #ifdef __i386__
> > #define PC_REG asm("%esi")
> > #define SP_REG asm("%edi")
> > #define ACCU_REG
> > #endif
> > [...]
> > #endif
>
> well, it seem that threaded code also depend of being compile with
> gcc:
>
> #if defined(__GNUC__) && __GNUC__ >= 2 && !defined(DEBUG) && !defined
> (SHRINKED_
> GNUC)
> #define THREADED_CODE
> #endif
>
> so both register assignation and threaded code can imply a lot of
> speedup.
If you look at the generated code, you can see that MSVC uses
registers very efficiently, and that the difference comes only from the
threaded code. Mainly, it is forced to do two nearly successive jumps,
and I think that this causes some pipeline problem in modern processors.
If you check that bytecode ops are valid before the execution,
and if you use __assume(0) as the default case, you can gain about 10%
in execution speed, but the two successive jumps are still there.
I don't know what MSVC 7 does, but I'd be interested.
--
Lionel Fourquaux
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 1484 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Caml-list] Timing Ocaml
2002-06-10 5:35 [Caml-list] Timing Ocaml Blair Zajac
2002-06-10 6:24 ` Chris Hecker
@ 2002-06-10 15:01 ` Xavier Leroy
2002-06-10 16:29 ` Dmitry Bely
2002-06-10 18:19 ` Blair Zajac
1 sibling, 2 replies; 13+ messages in thread
From: Xavier Leroy @ 2002-06-10 15:01 UTC (permalink / raw)
To: Blair Zajac; +Cc: Caml Mailing List
> Reading that the bytecode interpreter for Ocaml runs 2/3 as fast
> when compiled with VC 6 compared to gcc, has anybody done any
> timing comparisons with VisualStudio.Net, Intel C++ 5.x or
> Intel C++ 6.0?
As others mentioned, the reason why gcc does a better job on the Caml
bytecode interpreter is not that gcc generates better code all by
itself (it doesn't), but that it supports "computed gotos" as a C
language extension. The bytecode interpreter takes advantage of this
feature by replacing opcodes with the addresses of the code fragments that
execute them, saving a significant amount of time in the bytecode
interpretation loop.
Microsoft's C compilers don't support this extension, and I doubt
Intel's compilers do, at least under Windows. (Although I seem to
remember that Intel's compiler for Linux implements gcc extensions.)
Someone else mentioned the explicit register declarations in the
bytecode interpreter. This is another gcc-specific extension, but
actually the bytecode interpreter uses them to work around the poor
register allocation performed by gcc (it fails to guess correctly
which local variables of the bytecode interpreter are most critical
and should end up in registers). So, it's really a gcc feature used
to work around a gcc deficiency :-) Other C compilers might actually
get the registers right by themselves.
- Xavier Leroy
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Caml-list] Timing Ocaml
2002-06-10 15:01 ` Xavier Leroy
@ 2002-06-10 16:29 ` Dmitry Bely
2002-06-10 16:49 ` Lionel Fourquaux
2002-06-10 18:19 ` Blair Zajac
1 sibling, 1 reply; 13+ messages in thread
From: Dmitry Bely @ 2002-06-10 16:29 UTC (permalink / raw)
To: caml-list
Xavier Leroy <xavier.leroy@inria.fr> writes:
>> Reading that the bytecode interpreter for Ocaml runs 2/3 as fast
>> when compiled with VC 6 compared to gcc, has anybody done any
>> timing comparisons with VisualStudio.Net, Intel C++ 5.x or
>> Intel C++ 6.0?
>
> As others mentioned, the reason why gcc does a better job on the Caml
> bytecode interpreter is not that gcc generates better code all by
> itself (it doesn't), but that it supports "computed gotos" as a C
> language extension. The bytecode interpreter takes advantage of this
> feature by replacing opcodes with the addresses of the code fragments that
> execute them, saving a significant amount of time in the bytecode
> interpretation loop.
>
> Microsoft's C compilers don't support this extension, and I doubt
> Intel's compilers do, at least under Windows. (Although I seem to
> remember that Intel's compiler for Linux implements gcc extensions.)
Thank a lot for the explanation. But why then not to use inline asm for
MSVC, something like that:
#if defined(__GNUC__) && __GNUC__ >= 2
#define indirect_goto(addr) goto (addr)
#elif defined(_MSC_VER)
#define indirect_goto(addr) \
{ void* a = addr; __asm jmp dword ptr a; }
#endif
Hope to hear from you soon,
Dmitry
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [Caml-list] Timing Ocaml
2002-06-10 16:29 ` Dmitry Bely
@ 2002-06-10 16:49 ` Lionel Fourquaux
2002-06-11 8:28 ` Dmitry Bely
0 siblings, 1 reply; 13+ messages in thread
From: Lionel Fourquaux @ 2002-06-10 16:49 UTC (permalink / raw)
To: 'Dmitry Bely', caml-list
> From: owner-caml-list@pauillac.inria.fr [mailto:owner-caml-
> list@pauillac.inria.fr] On Behalf Of Dmitry Bely
> Sent: Monday, June 10, 2002 6:30 PM
> To: caml-list@inria.fr
> Subject: Re: [Caml-list] Timing Ocaml
>
> Thank a lot for the explanation. But why then not to use inline asm
for
> MSVC, something like that:
Because any fragment of inline asm disable a lot of
optimisations in MSVC, and you end up with a much slower interpreter.
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Caml-list] Timing Ocaml
2002-06-10 16:49 ` Lionel Fourquaux
@ 2002-06-11 8:28 ` Dmitry Bely
2002-06-11 9:08 ` Xavier Leroy
2002-06-11 12:52 ` Mattias Waldau
0 siblings, 2 replies; 13+ messages in thread
From: Dmitry Bely @ 2002-06-11 8:28 UTC (permalink / raw)
To: caml-list
"Lionel Fourquaux" <lionel.fourquaux@wanadoo.fr> writes:
>> Thank a lot for the explanation. But why then not to use inline asm
> for
>> MSVC, something like that:
>
> Because any fragment of inline asm disable a lot of
> optimisations in MSVC, and you end up with a much slower interpreter.
I see... But there is another solution: use C switch() operator in interp
main loop that is translated to jump table by MSVC optimizer (don't know if
gcc is capable to do this). A small example:
int f( int i )
{
int j = 0;
switch( i ){
case 1: j = 2; break;
case 2: j = 4; break;
case 3: j = 8; break;
case 4: j = 16; break;
}
return j;
}
cl -c -Ox -Fatest.lst test.c
TITLE test.c
.386P
include listing.inc
if @Version gt 510
.model FLAT
else
_TEXT SEGMENT PARA USE32 PUBLIC 'CODE'
_TEXT ENDS
_DATA SEGMENT DWORD USE32 PUBLIC 'DATA'
_DATA ENDS
CONST SEGMENT DWORD USE32 PUBLIC 'CONST'
CONST ENDS
_BSS SEGMENT DWORD USE32 PUBLIC 'BSS'
_BSS ENDS
_TLS SEGMENT DWORD USE32 PUBLIC 'TLS'
_TLS ENDS
FLAT GROUP _DATA, CONST, _BSS
ASSUME CS: FLAT, DS: FLAT, SS: FLAT
endif
PUBLIC _f
_TEXT SEGMENT
_i$ = 8
_f PROC NEAR
; File test.c
; Line 4
mov ecx, DWORD PTR _i$[esp-4]
xor eax, eax
dec ecx
cmp ecx, 3
ja SHORT $L526
jmp DWORD PTR $L536[ecx*4]
$L529:
; Line 5
mov eax, 2
; Line 11
ret 0
$L530:
; Line 6
mov eax, 4
; Line 11
ret 0
$L531:
; Line 7
mov eax, 8
; Line 11
ret 0
$L532:
; Line 8
mov eax, 16 ; 00000010H
$L526:
; Line 11
ret 0
npad 1
$L536:
DD $L529
DD $L530
DD $L531
DD $L532
_f ENDP
_TEXT ENDS
END
Hope to hear from you soon,
Dmitry
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Caml-list] Timing Ocaml
2002-06-11 8:28 ` Dmitry Bely
@ 2002-06-11 9:08 ` Xavier Leroy
2002-06-11 12:52 ` Mattias Waldau
1 sibling, 0 replies; 13+ messages in thread
From: Xavier Leroy @ 2002-06-11 9:08 UTC (permalink / raw)
To: Dmitry Bely; +Cc: caml-list
> I see... But there is another solution: use C switch() operator in interp
> main loop that is translated to jump table by MSVC optimizer (don't know if
> gcc is capable to do this).
Dmitry, don't be naive: of course the bytecode interpretor loop uses
switch() if computed gotos are not available, and of course any C
compiler translates this switch() to a jump table. But the jumptable
is still significantly slower than the computed goto trick, since it
involves one extra compare-and-branch and one extra memory load.
This discussion ("efficient bytecode interpreters") is getting
off-topic for caml-list, so please let's stop here. If you're still
curious, the best way to understand the issues at hand is to stare at
the assembly code generated for interp.c :-)
- Xavier Leroy
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [Caml-list] Timing Ocaml
2002-06-11 8:28 ` Dmitry Bely
2002-06-11 9:08 ` Xavier Leroy
@ 2002-06-11 12:52 ` Mattias Waldau
1 sibling, 0 replies; 13+ messages in thread
From: Mattias Waldau @ 2002-06-11 12:52 UTC (permalink / raw)
To: caml-list
SICStus Prolog had the same problem as O'Caml with VC++. They solved it
by first
running VC++ and generating ASM-code.
Then they have a small Perl-script that rearranges the code, and at
last they compile the assembler code using MASM.
This improved the performance with 20-30% and in some cases 100%
(for very simple byte code instructions where the switch overhead
is relatively larger, for example the fameous naïve reverse.)
/mattias
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Caml-list] Timing Ocaml
2002-06-10 15:01 ` Xavier Leroy
2002-06-10 16:29 ` Dmitry Bely
@ 2002-06-10 18:19 ` Blair Zajac
2002-06-11 9:23 ` Florian Hars
1 sibling, 1 reply; 13+ messages in thread
From: Blair Zajac @ 2002-06-10 18:19 UTC (permalink / raw)
To: Xavier Leroy; +Cc: Caml Mailing List
Xavier Leroy wrote:
>
> > Reading that the bytecode interpreter for Ocaml runs 2/3 as fast
> > when compiled with VC 6 compared to gcc, has anybody done any
> > timing comparisons with VisualStudio.Net, Intel C++ 5.x or
> > Intel C++ 6.0?
>
> As others mentioned, the reason why gcc does a better job on the Caml
> bytecode interpreter is not that gcc generates better code all by
> itself (it doesn't), but that it supports "computed gotos" as a C
> language extension. The bytecode interpreter takes advantage of this
> feature by replacing opcodes with the addresses of the code fragments that
> execute them, saving a significant amount of time in the bytecode
> interpretation loop.
>
> Microsoft's C compilers don't support this extension, and I doubt
> Intel's compilers do, at least under Windows. (Although I seem to
> remember that Intel's compiler for Linux implements gcc extensions.)
>
> Someone else mentioned the explicit register declarations in the
> bytecode interpreter. This is another gcc-specific extension, but
> actually the bytecode interpreter uses them to work around the poor
> register allocation performed by gcc (it fails to guess correctly
> which local variables of the bytecode interpreter are most critical
> and should end up in registers). So, it's really a gcc feature used
> to work around a gcc deficiency :-) Other C compilers might actually
> get the registers right by themselves.
Thanks for the info.
And do you recommend a particular program or set of programs to run
to get a general relative performance number for each compiler, or
does it really matter?
Blair
--
Blair Zajac <blair@orcaware.com>
Web and OS performance plots - http://www.orcaware.com/orca/
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2002-06-11 12:52 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-06-10 5:35 [Caml-list] Timing Ocaml Blair Zajac
2002-06-10 6:24 ` Chris Hecker
2002-06-10 12:02 ` Dmitry Bely
2002-06-10 12:50 ` Remi VANICAT
2002-06-10 14:19 ` Lionel Fourquaux
2002-06-10 15:01 ` Xavier Leroy
2002-06-10 16:29 ` Dmitry Bely
2002-06-10 16:49 ` Lionel Fourquaux
2002-06-11 8:28 ` Dmitry Bely
2002-06-11 9:08 ` Xavier Leroy
2002-06-11 12:52 ` Mattias Waldau
2002-06-10 18:19 ` Blair Zajac
2002-06-11 9:23 ` Florian Hars
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).