caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Severe loss of performance due to new signal handling
@ 2006-03-17 18:39 Markus Mottl
  2006-03-17 19:10 ` [Caml-list] " Christophe TROESTLER
  2006-03-20  9:29 ` Xavier Leroy
  0 siblings, 2 replies; 13+ messages in thread
From: Markus Mottl @ 2006-03-17 18:39 UTC (permalink / raw)
  To: ocaml


[-- Attachment #1.1: Type: text/plain, Size: 1591 bytes --]

Hi,

this report has also been posted to the OCaml bug tracker, but since it is a
surprising observation, it may be good if people on the list know that it
exists without having to search the bug tracker archive.  Maybe some
assembler guru can repeat this result and explain to us what's going on...

----------

It seems that changes to signal handling between OCaml 3.08.4 and 3.09.1 can
lead to a very significant loss of performance (up to several orders of
magnitude!) in code that uses threads and performs I/O (tested on Linux).

The attached file (slow.ml) demonstrates this: it prints a character to
stdout in a for-loop. The uploaded version will take approximately 600ms in
native code to complete this test when redirecting output to /dev/null.  If
you comment out the line containing "module X = Thread" and compile without
thread support, then the test suddenly only takes around 1.5ms, i.e. it runs
400 times faster.

Profiling using oprofile revealed that the function
"caml_process_pending_signals" seems to be responsible for that.  Annotated
assembler output showed that the code was sampled an astonishing number of
times in the instruction "test %eax,%eax" as obviously generated for "if
(async_action != NULL)" in this function. This is really weird, because
everything else seems to be sampled a sensible number of times, but it would
surely explain the timings.

OCaml-3.08.4 does not exhibit any problems of that kind.

----------

Best regards,
Markus

--
Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com

[-- Attachment #1.2: Type: text/html, Size: 1937 bytes --]

[-- Attachment #2: slow.ml --]
[-- Type: application/octet-stream, Size: 195 bytes --]

open Unix
open Printf

module X = Thread

let () =
  let t1 = gettimeofday () in
  for i = 1 to 100000 do
    print_char '.';
  done;
  let t2 = gettimeofday () in
  eprintf "%f\n" (t2 -. t1);



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Severe loss of performance due to new signal handling
  2006-03-17 18:39 Severe loss of performance due to new signal handling Markus Mottl
@ 2006-03-17 19:10 ` Christophe TROESTLER
  2006-03-20  9:29 ` Xavier Leroy
  1 sibling, 0 replies; 13+ messages in thread
From: Christophe TROESTLER @ 2006-03-17 19:10 UTC (permalink / raw)
  To: OCaml Mailing List

Hi,

On Fri, 17 Mar 2006, "Markus Mottl" <markus.mottl@gmail.com> wrote:
> 
> Profiling using oprofile revealed that the function
> "caml_process_pending_signals" seems to be responsible for that.

An earlier related thread:
http://caml.inria.fr/pub/ml-archives/caml-list/2006/02/2858f1e4532daae90d5b0762e3fff3cd.en.html

But your code is even more striking!

> OCaml-3.08.4 does not exhibit any problems of that kind.

If somebody who has both OCaml 3.08 and 3.09 on his machine is willing
to spend some time to check whether the same thing happens with the
above mentioned program, that will be appreciated.

Best regards,
ChriS


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Severe loss of performance due to new signal handling
  2006-03-17 18:39 Severe loss of performance due to new signal handling Markus Mottl
  2006-03-17 19:10 ` [Caml-list] " Christophe TROESTLER
@ 2006-03-20  9:29 ` Xavier Leroy
  2006-03-20 10:39   ` Oliver Bandel
                     ` (2 more replies)
  1 sibling, 3 replies; 13+ messages in thread
From: Xavier Leroy @ 2006-03-20  9:29 UTC (permalink / raw)
  To: Markus Mottl; +Cc: ocaml

 > It seems that changes to signal handling between OCaml 3.08.4 and 3.09.1
 > can lead to a very significant loss of performance (up to several orders
 > of magnitude!) in code that uses threads and performs I/O (tested on Linux).
 > [...]
 > Maybe some assembler guru can repeat this result and explain to us
 > what's going on...

Short explanation: atomic instructions are dog slow.

Longer explanation:

OCaml 3.09 fixed a number of long-standing bugs in signal handling
that could cause signals to be "lost" (not acted upon).  The fixes,
located mostly in the code that polls for pending signals
(caml_process_pending_signals), rely on an atomic "read-and-clear"
operation, implemented using atomic processor instructions on x86,
x86-64 and PPC.  This makes signal handling correct (no signal can be
lost) but I didn't realize that it has such an impact on performance,
even on a uniprocessor machine.  Thanks for pointing this out.

(To prevent a number of well-meaning but irrelevant posts, keep in
mind that we're using atomic instructions in a single-threaded
program, to get atomicity w.r.t. signals, not w.r.t. concurrent threads.
We don't need the latter kind of atomicity given OCaml's threading model.)

Now, you may wonder why the problem appears mainly with threaded
programs.  The reason is that programs linked with the Thread library,
even if they do not create threads, check for signals much more
often, because they enter and leave blocking sections more often.  In
your example, each call to "print_char" needs to lock and unlock the
stdout channel, causing two signal polls each time.

So, it's time to go back to the drawing board.  Fortunately, it
appears that reliable polling of signals is possible without atomic
processor instructions.  Expect a fix in 3.09.2 at the latest, and
probably within a couple of weeks in the CVS.

Regards,

- Xavier Leroy


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Severe loss of performance due to new signal handling
  2006-03-20  9:29 ` Xavier Leroy
@ 2006-03-20 10:39   ` Oliver Bandel
  2006-03-20 12:37     ` Gerd Stolpmann
  2006-03-20 16:15   ` Markus Mottl
  2006-03-21  1:33   ` Robert Roessler
  2 siblings, 1 reply; 13+ messages in thread
From: Oliver Bandel @ 2006-03-20 10:39 UTC (permalink / raw)
  To: caml-list

On Mon, Mar 20, 2006 at 10:29:49AM +0100, Xavier Leroy wrote:
> > It seems that changes to signal handling between OCaml 3.08.4 and 3.09.1
> > can lead to a very significant loss of performance (up to several orders
> > of magnitude!) in code that uses threads and performs I/O (tested on 
> Linux).
> > [...]
> > Maybe some assembler guru can repeat this result and explain to us
> > what's going on...
> 
> Short explanation: atomic instructions are dog slow.
> 
> Longer explanation:
> 
> OCaml 3.09 fixed a number of long-standing bugs in signal handling
> that could cause signals to be "lost" (not acted upon).  The fixes,
[...]
> Now, you may wonder why the problem appears mainly with threaded
> programs.  The reason is that programs linked with the Thread library,
> even if they do not create threads, check for signals much more
> often, because they enter and leave blocking sections more often.  In
> your example, each call to "print_char" needs to lock and unlock the
> stdout channel, causing two signal polls each time.

Is this really necessary? Doing a write to stdout with
locking... if not explicitly wanted?!


> So, it's time to go back to the drawing board.  Fortunately, it
> appears that reliable polling of signals is possible without atomic
> processor instructions.  Expect a fix in 3.09.2 at the latest, and
> probably within a couple of weeks in the CVS.

I'm not clear about what your proble is with lost signals,
but when using signals on Unix/Linux-systems, you can use
UNIX-API, with sigaction/sigprocmask etc. you can do things well,
and with the signal-function which C provides things are bad/worse.
The C-API signal-function signal(3) clears out the signal handler
after a call to it. In the sigaction/sigprocmask/... functions
the handler remains installed.

But if this is what you think about (and how it will be done
on windows or other systems) I don't know, but maybe this is
a hint that matters.

BTW: I saw that in the Unix-module the unix-signalling functions are
     now included... (the ywere not on older versions of Ocaml).

Ciao,
   Oliver


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Severe loss of performance due to new signal handling
  2006-03-20 10:39   ` Oliver Bandel
@ 2006-03-20 12:37     ` Gerd Stolpmann
  2006-03-20 13:13       ` Oliver Bandel
  2006-03-20 15:54       ` Xavier Leroy
  0 siblings, 2 replies; 13+ messages in thread
From: Gerd Stolpmann @ 2006-03-20 12:37 UTC (permalink / raw)
  To: Oliver Bandel; +Cc: caml-list

Am Montag, den 20.03.2006, 11:39 +0100 schrieb Oliver Bandel:
> On Mon, Mar 20, 2006 at 10:29:49AM +0100, Xavier Leroy wrote:
> > > It seems that changes to signal handling between OCaml 3.08.4 and 3.09.1
> > > can lead to a very significant loss of performance (up to several orders
> > > of magnitude!) in code that uses threads and performs I/O (tested on 
> > Linux).
> > > [...]
> > > Maybe some assembler guru can repeat this result and explain to us
> > > what's going on...
> > 
> > Short explanation: atomic instructions are dog slow.
> > 
> > Longer explanation:
> > 
> > OCaml 3.09 fixed a number of long-standing bugs in signal handling
> > that could cause signals to be "lost" (not acted upon).  The fixes,
> 
> I'm not clear about what your proble is with lost signals,
> but when using signals on Unix/Linux-systems, you can use
> UNIX-API, with sigaction/sigprocmask etc. you can do things well,
> and with the signal-function which C provides things are bad/worse.
> The C-API signal-function signal(3) clears out the signal handler
> after a call to it. In the sigaction/sigprocmask/... functions
> the handler remains installed.

The problem is the following: The O'Caml runtime cannot handle signals
immediately because this would break memory management (e.g. imagine a
signal happens when memory has just been allocated but not initialized).
To get around this the signal handler sets just a flag, and the compiler
emits instructions that regularly check this flag at safe points of
execution (i.e. memory is known to be initialised). These instructions
are now atomic in 3.09. In 3.08, you have basically

if "flag is set" then (
  (*)
  "clear flag";
  "call the signal handler function"
)

If another signal happens at (*) it will be lost.

As you mention sigprocmask: Of course, you can block signals before
checking the flag and allow them again after clearing it, but this would
be even _much_ slower than the solution in 3.09, because sigprocmask
needs a context switch to do its work (it is a kernel function).

I don't know what Xavier has in mind to solve the problem, but I would
think about reducing the frequency of the atomic check.
This could work as follows:

- Revert the check to the 3.08 solution
- Use the alarm clock timer to regularly call a signal_manager
  function at a certain frequency (i.e. the signal flag is set
  at a certain frequency)
- Only the alarm clock timer signal is left unblocked. The
  other signals are normally blocked.
- In signal_manager, it is checked whether there are other
  pending signals, and if so, their functions are called.

Of course, it is again possible that alarm clock signals are lost, but
this is harmless, because it is a repeatedly emitted signal. The other
signals cannot be lost, but their execution is deferred to the next
alarm clock event.

> But if this is what you think about (and how it will be done
> on windows or other systems) I don't know, but maybe this is
> a hint that matters.
> 
> BTW: I saw that in the Unix-module the unix-signalling functions are
>      now included... (the ywere not on older versions of Ocaml).

They have been included for a long time. New is Thread.sigmask.

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Severe loss of performance due to new signal handling
  2006-03-20 12:37     ` Gerd Stolpmann
@ 2006-03-20 13:13       ` Oliver Bandel
  2006-03-20 15:54       ` Xavier Leroy
  1 sibling, 0 replies; 13+ messages in thread
From: Oliver Bandel @ 2006-03-20 13:13 UTC (permalink / raw)
  To: caml-list

On Mon, Mar 20, 2006 at 01:37:39PM +0100, Gerd Stolpmann wrote:
> Am Montag, den 20.03.2006, 11:39 +0100 schrieb Oliver Bandel:
> > On Mon, Mar 20, 2006 at 10:29:49AM +0100, Xavier Leroy wrote:
> > > > It seems that changes to signal handling between OCaml 3.08.4 and 3.09.1
> > > > can lead to a very significant loss of performance (up to several orders
> > > > of magnitude!) in code that uses threads and performs I/O (tested on 
> > > Linux).
> > > > [...]
> > > > Maybe some assembler guru can repeat this result and explain to us
> > > > what's going on...
> > > 
> > > Short explanation: atomic instructions are dog slow.
> > > 
> > > Longer explanation:
> > > 
> > > OCaml 3.09 fixed a number of long-standing bugs in signal handling
> > > that could cause signals to be "lost" (not acted upon).  The fixes,
> > 
> > I'm not clear about what your proble is with lost signals,
> > but when using signals on Unix/Linux-systems, you can use
> > UNIX-API, with sigaction/sigprocmask etc. you can do things well,
> > and with the signal-function which C provides things are bad/worse.
> > The C-API signal-function signal(3) clears out the signal handler
> > after a call to it. In the sigaction/sigprocmask/... functions
> > the handler remains installed.
> 
> The problem is the following: The O'Caml runtime cannot handle signals
> immediately because this would break memory management (e.g. imagine a
> signal happens when memory has just been allocated but not initialized).
> To get around this the signal handler sets just a flag, and the compiler
> emits instructions that regularly check this flag at safe points of
> execution (i.e. memory is known to be initialised). These instructions
> are now atomic in 3.09. In 3.08, you have basically
> 
> if "flag is set" then (
>   (*)
>   "clear flag";
>   "call the signal handler function"
> )
> 
> If another signal happens at (*) it will be lost.

Well, I'm not an OCaml-internals specialist, so I can't say
if this would be necessary...

On the first look it looks like the problem one has when using
signal(3) instead of sigprocmask(), sigaction() and Co. 



> 
> As you mention sigprocmask: Of course, you can block signals before
> checking the flag and allow them again after clearing it, but this would
> be even _much_ slower than the solution in 3.09, because sigprocmask
> needs a context switch to do its work (it is a kernel function).

Why to call such functions often?

You can use sigaction() to handle signals when you want it;
even if signals are blocked, their occurence will be saved.
When you want to handle them, then you can do it.

It's too long ago to say details here, but if wanted,
I can look for details (not today, but tomorrow I will have some time
to do it).

(The only thing you can't find out with this mechanism is,
 which of the signals came first and which later.)



> 
> I don't know what Xavier has in mind to solve the problem, but I would
> think about reducing the frequency of the atomic check.
> This could work as follows:
> 
> - Revert the check to the 3.08 solution
> - Use the alarm clock timer to regularly call a signal_manager
>   function at a certain frequency (i.e. the signal flag is set
>   at a certain frequency)

Using alarm() is not reliable.


[...]
> > BTW: I saw that in the Unix-module the unix-signalling functions are
> >      now included... (the ywere not on older versions of Ocaml).
> 
> They have been included for a long time. New is Thread.sigmask.

Depends on the definition of "long time" ;-)
As I had first conact with OCaml, which really is some years ago,
it was not included (I think 3.04?).
I didn't looked for these functions, and just saw them, while
looking for other things at about 3.08 (?).
So then I was astouned. This makes OCaml better suited for
applications in the real world, because C's signal(3) is unreliable.
(When catching the signal, the handler will be deactivated, until it is
 re-established again - that's the same problem as you has mentioned above.
 So if a signal comes twice, you lost one. But on the other hand,
 this provides the system for recursive loops which could make it
 unreliable too. But only with sigprocmask()/sigaction() and so on
 you can do it reliable and clean and clear.)

Ciao,
   Oliver


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Severe loss of performance due to new signal handling
  2006-03-20 12:37     ` Gerd Stolpmann
  2006-03-20 13:13       ` Oliver Bandel
@ 2006-03-20 15:54       ` Xavier Leroy
  1 sibling, 0 replies; 13+ messages in thread
From: Xavier Leroy @ 2006-03-20 15:54 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: caml-list

 > The problem is the following: [...] In 3.08, you have basically
 >
 > if "flag is set" then (
 >   (*)
 >   "clear flag";
 >   "call the signal handler function"
 > )
 >
 > If another signal happens at (*) it will be lost.

Actually, the problematic code in 3.08 is:

   tmp <- flag;
   (*)
   flag <- 0;
   if (tmp) { process the signal; }

and indeed a signal can be lost (never processed) if it occurs at (*).

The solution I have in mind is to implement exactly the pseudocode you
give above.  If a signal occurs at (*), it is not lost (the signal
handler function will be called just afterwards!), just conflated with
a previous occurrence of that signal, but this is fair game: POSIX
signals have the same behaviour.  (Yes, I'm ignoring the queueing
behaviour of realtime POSIX signals.)

Note however that in 3.09 and in my proposed fix, there is one flag
per signal, which still improves over 3.08 (which had only one shared
flag) and ensures that two occurrences of different signals are not
conflated, again as per POSIX.

 > I don't know what Xavier has in mind to solve the problem, but I would
 > think about reducing the frequency of the atomic check.

That would be plan C, plan B being making the check even more efficient.
I'd rather not introduce timer signals if at all possible, though,
since these mess up many function calls.

- Xavier Leroy


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Severe loss of performance due to new signal handling
  2006-03-20  9:29 ` Xavier Leroy
  2006-03-20 10:39   ` Oliver Bandel
@ 2006-03-20 16:15   ` Markus Mottl
  2006-03-20 16:24     ` Will Farr
  2006-03-21  1:33   ` Robert Roessler
  2 siblings, 1 reply; 13+ messages in thread
From: Markus Mottl @ 2006-03-20 16:15 UTC (permalink / raw)
  To: Xavier Leroy; +Cc: ocaml

[-- Attachment #1: Type: text/plain, Size: 1087 bytes --]

On 3/20/06, Xavier Leroy <Xavier.Leroy@inria.fr> wrote:
>
> Short explanation: atomic instructions are dog slow.


Thanks for the explanation.  I'd never have guessed that atomic instructions
could be responsible for such a deterioration of performance.

So, it's time to go back to the drawing board.  Fortunately, it
> appears that reliable polling of signals is possible without atomic
> processor instructions.  Expect a fix in 3.09.2 at the latest, and
> probably within a couple of weeks in the CVS.
>

Great!  Btw., since we are at it, you could also make us really happy by
fixing issue 3906 in the next release, too.  Now that the reason is clear
this should be very straightforward, and would save people who write certain
kinds of threaded code a lot of headaches, because this bug can cause all
sorts of weird problems in long-running applications (freezes, crashes,
execution of random code, etc.), and was extremely hard to trigger and track
down.

Best regards,
Markus

--
Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com

[-- Attachment #2: Type: text/html, Size: 1729 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Severe loss of performance due to new signal handling
  2006-03-20 16:15   ` Markus Mottl
@ 2006-03-20 16:24     ` Will Farr
  0 siblings, 0 replies; 13+ messages in thread
From: Will Farr @ 2006-03-20 16:24 UTC (permalink / raw)
  To: ocaml

Hello all,

As an aside, if anyone is interested in techniques for making atomic
transactions fast with low latency, etc, the paper

Atomic heap transactions and fine-grain interrupts by Olin Shivers,
James W. Clark and Roland McGrath:
http://www-static.cc.gatech.edu/~shivers/papers/heap.ps

presents several *neat* hacks to do this efficiently.  I'm sure that
the implementators on the list are already aware of this work, but I
just wanted to point it out as interesting reading for people (like
myself) who think this stuff is neat but don't necessarily have broad
experience with it.

Will


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Severe loss of performance due to new signal handling
  2006-03-20  9:29 ` Xavier Leroy
  2006-03-20 10:39   ` Oliver Bandel
  2006-03-20 16:15   ` Markus Mottl
@ 2006-03-21  1:33   ` Robert Roessler
  2006-03-21  3:11     ` Markus Mottl
  2 siblings, 1 reply; 13+ messages in thread
From: Robert Roessler @ 2006-03-21  1:33 UTC (permalink / raw)
  To: Caml-list

Xavier Leroy wrote:
>  > It seems that changes to signal handling between OCaml 3.08.4 and 3.09.1
>  > can lead to a very significant loss of performance (up to several orders
>  > of magnitude!) in code that uses threads and performs I/O (tested on 
> Linux).
>  > [...]
>  > Maybe some assembler guru can repeat this result and explain to us
>  > what's going on...
> 
> Short explanation: atomic instructions are dog slow.

At the risk of being "irrelevant", I wanted to nail down exactly what 
assertion is being made here: are we talking about directly executing 
in assembly code the relevant x86[-64]/ppc/whatever instructions for 
"read-and-clear", or going through OS-dependent access routines like 
Windows' InterlockedExchange()?

Or: is the source of the dog slow behavior because of OS overhead, or 
is it a low-level issue like memory barriers/cache lines getting 
flushed/something else?

Robert Roessler
roessler@rftp.com
http://www.rftp.com


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Severe loss of performance due to new signal handling
  2006-03-21  1:33   ` Robert Roessler
@ 2006-03-21  3:11     ` Markus Mottl
  2006-03-21  4:04       ` Brian Hurt
  2006-03-21 12:54       ` Robert Roessler
  0 siblings, 2 replies; 13+ messages in thread
From: Markus Mottl @ 2006-03-21  3:11 UTC (permalink / raw)
  To: Robert Roessler; +Cc: Caml-list

[-- Attachment #1: Type: text/plain, Size: 604 bytes --]

On 3/20/06, Robert Roessler <roessler@rftp.com> wrote:
>
> At the risk of being "irrelevant", I wanted to nail down exactly what
> assertion is being made here: are we talking about directly executing
> in assembly code the relevant x86[-64]/ppc/whatever instructions for
> "read-and-clear", or going through OS-dependent access routines like
> Windows' InterlockedExchange()?


We are talking of the assembly code.  See file byterun/signals_machdep.h,
which contains the corresponding macros.

Regards,
Markus

--
Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com

[-- Attachment #2: Type: text/html, Size: 1081 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Severe loss of performance due to new signal handling
  2006-03-21  3:11     ` Markus Mottl
@ 2006-03-21  4:04       ` Brian Hurt
  2006-03-21 12:54       ` Robert Roessler
  1 sibling, 0 replies; 13+ messages in thread
From: Brian Hurt @ 2006-03-21  4:04 UTC (permalink / raw)
  To: Markus Mottl; +Cc: Robert Roessler, Caml-list



On Mon, 20 Mar 2006, Markus Mottl wrote:

> On 3/20/06, Robert Roessler <roessler@rftp.com> wrote:
>>
>> At the risk of being "irrelevant", I wanted to nail down exactly what
>> assertion is being made here: are we talking about directly executing
>> in assembly code the relevant x86[-64]/ppc/whatever instructions for
>> "read-and-clear", or going through OS-dependent access routines like
>> Windows' InterlockedExchange()?
>
>
> We are talking of the assembly code.  See file byterun/signals_machdep.h,
> which contains the corresponding macros.

OK, poking around a little bit in byterun, I'm seeing this peice of code:

   for (signal_number = 0; signal_number < NSIG; signal_number++) {
     Read_and_clear(signal_state, caml_pending_signals[signal_number]);
     if (signal_state) caml_execute_signal(signal_number, 0);
   }

with Read_and_clear being defined as:

#if defined(__GNUC__) && defined(__i386__)

#define Read_and_clear(dst,src) \
   asm("xorl %0, %0; xchgl %0, %1" \
       : "=r" (dst), "=m" (src) \
       : "m" (src))


xchgl is the atomic operation (this is always atomic when referencing a 
memory location, regardless of the presence or absence of a lock prefix).

Appropos of nothing, a better definition of that macro would be:

#define Read_and_clear(dst,src) \
    asm volatile ("xchgl	%0, %1" \
        : "=r" (dst), "+m" (src) \
        : "0" (0))

as this gives gcc the choice of how to move 0 into the register (using an 
xor will still be a popular choice, but it'll occassionally do a movl 
depending upon instruction scheduling choices).

Some more poking around tells me that NSIG is defined on Linux to be 64.

I think the problem is not doing an atomic operation, but doing 64 of 
them.  I'd be inclined to move to a bitset implementation- allowing you 
to replace 64 atomic instructions with 2.

On the x86, you can use the lock bts instruction to set the bit.  Some 
implementation like:

#if defined(__GNUC__) && defined(__i386__)

     typedef unsigned long sigword_t;

#define Read_and_clear(dst,src) \
    asm volatile ("xchgl	%0, %1" \
        : "=r" (dst), "+m" (src) \
        : "0" (0))

#define Set_sigflag(sigflags, NR) \
    asm volatile ("lock bts %1, %0" \
        : "+m" (*sigflags) \
        : "rN" (NR) \
        : "cc")

...

#define SIGWORD_BITS (CHAR_BITS * sizeof(sigword_t))

#define NR_SIGWORDS ((NSIG + SIGWORD_BITS - 1)/SIGWORD_BITS)

   extern sigword_t caml_pending_signals[NR_SIGWORDS];

   for (i = 0; i < NR_SIGWORDS; i++) {
       sigword_t temp;
       int j;

       Read_and_clear(temp, caml_pending_signals[i]);
       for (j = 0; temp != 0; j++) {
           if ((temp & 1ul) != 0) {
               caml_execute_signal((i * SIGWORD_BITS) + j, 0)
           }
           temp >>= 1;
       }
   }


This is somewhat more code, but i, j, and temp would all end up in 
registers, and it'd be two atomic instructions, not 64.

The x86 assembly code I can dash off from the top of my head.  Similiar 
bits of assembly can be written for other CPUs- I just have to go dig out 
the right books.

Brian


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Severe loss of performance due to new signal handling
  2006-03-21  3:11     ` Markus Mottl
  2006-03-21  4:04       ` Brian Hurt
@ 2006-03-21 12:54       ` Robert Roessler
  1 sibling, 0 replies; 13+ messages in thread
From: Robert Roessler @ 2006-03-21 12:54 UTC (permalink / raw)
  To: Caml-list

Markus Mottl wrote:
> On 3/20/06, *Robert Roessler* <roessler@rftp.com 
> 
>     At the risk of being "irrelevant", I wanted to nail down exactly what
>     assertion is being made here: are we talking about directly executing
>     in assembly code the relevant x86[-64]/ppc/whatever instructions for
>     "read-and-clear", or going through OS-dependent access routines like
>     Windows' InterlockedExchange()?
> 
> 
> We are talking of the assembly code.  See file 
> byterun/signals_machdep.h, which contains the corresponding macros.

Thanks, Markus - in the case you cite (direct instruction use), I was 
hoping for some illumination on this huge cost... reviewing the Intel 
manuals, I note that:

1) there is *no* claim that cache lines are flushed just by doing the xchg

2) in fact, with the Pentium Pro on, the bus LOCK# operation will not 
even happen if the data is cached - everything is left to the cache 
coherency mechanism

3) there *is* mention of processor *cache locking*, but this is still 
just in the context of cache coherency with multiple processors... so 
nothing here is suggesting cache line flushing or anything else that 
sounds horrendously expensive, particularly in the single CPU case

< 8 hours later, back to finish email :) >

Finally, it is interesting that you bring up this file - it appears as 
if the msvc toolchain is no longer supported for doing "correct" (in 
terms of Xavier's "atomicity w.r.t. signals") builds... at least that 
is how I interpret the conditional compilation directives.

Robert Roessler
roessler@rftp.com
http://www.rftp.com


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2006-03-21 12:54 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-03-17 18:39 Severe loss of performance due to new signal handling Markus Mottl
2006-03-17 19:10 ` [Caml-list] " Christophe TROESTLER
2006-03-20  9:29 ` Xavier Leroy
2006-03-20 10:39   ` Oliver Bandel
2006-03-20 12:37     ` Gerd Stolpmann
2006-03-20 13:13       ` Oliver Bandel
2006-03-20 15:54       ` Xavier Leroy
2006-03-20 16:15   ` Markus Mottl
2006-03-20 16:24     ` Will Farr
2006-03-21  1:33   ` Robert Roessler
2006-03-21  3:11     ` Markus Mottl
2006-03-21  4:04       ` Brian Hurt
2006-03-21 12:54       ` Robert Roessler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).