In-progress stuff, week of Sept 1

mailing list of musl libc
 help / color / mirror / code / Atom feed

* In-progress stuff, week of Sept 1
@ 2014-09-05  1:14 Rich Felker
  2014-09-05  1:57 ` Szabolcs Nagy
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Rich Felker @ 2014-09-05  1:14 UTC (permalink / raw)
  To: musl

We have a lot of open patch and feature requests, some big projects
(C11), and other random stuff still under consideration for this
release cycle, so I'm going to try to summarize what I'm aware of and
what's likely to make it in.

First, things that definitely are going in this release cycle:

- Jens Gustedt's C11 threads: From discussion it's ready, but the last
  iteration of the patches added more gratuitous changes at the same
  time it removed others. Jens is going to be less available for a
  while, so I'm going to finish getting the patches ready to commit,
  but it's going to take a little work going back and reviewing the
  email threads to determine what was the outcome of discussion versus
  what's new, and among what's new, whether it's essential or side
  changes orthogonal to adding C11 threads.

- fgets at EOF: The proposed fix by nsz is right, but I think we
  should also address the issue of the missing lock in the n=1 case.

- dn_expand issues: Potentially serious bugs that need to be fixed
  ASAP. nsz and I are working on them right now.

- nsz's LFS64 fixes: Should be straightforward to review and commit.

- Alexander Monakov's "New static analysis results": These probably
  indicate a few minor bugs which should be easy to fix.

And one more that's iffy:

- Felix Janda's login_tty patch: While on the surface this is a short,
  simple patch, I think more consideration is still needed for what
  happens in the error cases. These legacy functions that lack any
  formal specification, especially for how errors are handled, are
  always a big pain to add.

Finally, here are the non-trivial open items that probably won't make
it into this release:

Jens Gustedt's work on cond var improvements:

Based on our previous discussions, I think the proposed changes are
valid, but I also think they make the code mildly more complex. So
they're probably justified only if we can measure a performance
improvement under at least some usage cases. The ideal way I'd like to
move forward with these is with some tests that could measure the lock
contention from the race between signaling and early waiter exit
(timeout or cancellation).

Alexander Monakov's work on semaphores:

It's all very interesting, but would take considerable further review
to convince me that it's safe, and would also need some benchmarking
to establish whether it's desirable, even if it is safe. I hope
something comes of all the work he's put into it though (either a
positive or negative result about its benefit would be "something",
here), rather than just remaining unevaluated.

Bobby Bingham's work on qsort:

This is also promising, but I've seen mixed claims about performance
relative to the current smoothsort. More data is needed to evaluate
it, I think.

My further fix for cond var wait cancellation:

I have a patch, but it requires setjmp/longjmp and makes typical usage
nontrivially slower. I could go ahead and commit it for now, and fix
the performance regression later, but the clean, free fix depends on
the new cancellation mode I want to add.

My new cancellation mode:

This is actually going to be an almost-trivial patch when I write it,
but I want to have the time to give it the attention it needs, with
the hopes of producing something that can also be a public interface
proposal.

Additional roadmap items not mentioned above, and for which work has
not yet started, are probably going to get pushed back again into the
next release cycle.

Rich

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: In-progress stuff, week of Sept 1
  2014-09-05  1:14 In-progress stuff, week of Sept 1 Rich Felker
@ 2014-09-05  1:57 ` Szabolcs Nagy
  2014-09-05  8:47   ` Rich Felker
  2014-09-05  7:41 ` Jens Gustedt
  2014-09-05  8:49 ` Rich Felker
  2 siblings, 1 reply; 14+ messages in thread
From: Szabolcs Nagy @ 2014-09-05  1:57 UTC (permalink / raw)
  To: musl

* Rich Felker <dalias@libc.org> [2014-09-04 21:14:42 -0400]:
> Additional roadmap items not mentioned above, and for which work has
> not yet started, are probably going to get pushed back again into the
> next release cycle.

other issues:

toupper/islower of sharp s

sem_wait/pthread_join cancellation point in non-blocking case

alpine patches: ns_parse apis, __setxid issues


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: In-progress stuff, week of Sept 1
  2014-09-05  1:14 In-progress stuff, week of Sept 1 Rich Felker
  2014-09-05  1:57 ` Szabolcs Nagy
@ 2014-09-05  7:41 ` Jens Gustedt
  2014-09-05  8:32   ` Rich Felker
  2014-09-05  8:49 ` Rich Felker
  2 siblings, 1 reply; 14+ messages in thread
From: Jens Gustedt @ 2014-09-05  7:41 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 4212 bytes --]

Hello,

Am Donnerstag, den 04.09.2014, 21:14 -0400 schrieb Rich Felker:
> First, things that definitely are going in this release cycle:
> 
> - Jens Gustedt's C11 threads: From discussion it's ready, but the last
>   iteration of the patches added more gratuitous changes at the same
>   time it removed others. Jens is going to be less available for a
>   while, so I'm going to finish getting the patches ready to commit,
>   but it's going to take a little work going back and reviewing the
>   email threads to determine what was the outcome of discussion versus
>   what's new, and among what's new, whether it's essential or side
>   changes orthogonal to adding C11 threads.

Sorry if that left that impression to you. There is only one thing I
remember, that is not really sorted out in the last patch series I
sent, the timespec_get things. The other patches should be relatively
clean, no? In particular, the higher numbered patches (I think it was
8 and 9) can just be omitted, if you don't like them.

In any case, just send me requests for diff's or entire files in that
or that version, my git history contains them all.

> Finally, here are the non-trivial open items that probably won't make
> it into this release:
> 
> Jens Gustedt's work on cond var improvements:
> 
> Based on our previous discussions, I think the proposed changes are
> valid, but I also think they make the code mildly more complex. So
> they're probably justified only if we can measure a performance
> improvement under at least some usage cases. The ideal way I'd like to
> move forward with these is with some tests that could measure the lock
> contention from the race between signaling and early waiter exit
> (timeout or cancellation).

I have other points that I delayed for later. Off my head:

 - review atomics

   * This starts by adding some "i" constraints to the assembler. Gcc is
   perfectly capable to optimize static inlined asm that has "ir"
   constraints. In the code a lot of these are used with constants 0
   or 1, this would shave of the use of a register at all these
   points.

   * Review the necessity of having all of these with "memory". As long
   as the pointer argument is constrained with "m" and not "p", this
   should suffice to encode the data dependency. The only thing that
   we'd have to be careful with is reordering, but I think this could
   be achieved by using `volatile`, instead.

   * Move to gcc's __sync and __atomic builtins. The first are available
   since always, and both are easily testable with macros. In case of
   __atomic this would make it possible to move the atomics used in
   musl to acquire-release semantics, which may pay for some archs, in
   particular arm, I think.

   This also should make it easier to port to new architectures in the
   future, as soon as there is a working gcc for it.

 - implement a simple lock feature based on futex that encodes the
   wait counter inside the `int`.

   After the discussion with Rich some weeks ago I gave it a thought,
   and I think this can be simply done by using the parity bit for the
   lock and the other bits for the waiters count. On i386 the parity
   bit (as any other bit) can effectively accessed atomically with
   "lock ; btsl ".

   IIRC, having such a simple lock would also help as a third step of
   the changes to cv that I proposed and that you mention above.

 - __timedwait translates absolute time to relative time, by taking
   the current time with clock_gettime. This could be avoided for
   CLOCK_REALTIME and CLOCK_MONOTONIC when using FUTEX_WAIT_BITSET
   instead of FUTEX_WAIT. That one uses absolute time from the
   start. Since C threads don't leave a chose for the clock they are
   using this would avoid using this code path completely and shave of
   one system call.

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: In-progress stuff, week of Sept 1
  2014-09-05  7:41 ` Jens Gustedt
@ 2014-09-05  8:32   ` Rich Felker
  2014-09-05  9:27     ` Jens Gustedt
  0 siblings, 1 reply; 14+ messages in thread
From: Rich Felker @ 2014-09-05  8:32 UTC (permalink / raw)
  To: musl

On Fri, Sep 05, 2014 at 09:41:42AM +0200, Jens Gustedt wrote:
> Hello,
> 
> Am Donnerstag, den 04.09.2014, 21:14 -0400 schrieb Rich Felker:
> > First, things that definitely are going in this release cycle:
> > 
> > - Jens Gustedt's C11 threads: From discussion it's ready, but the last
> >   iteration of the patches added more gratuitous changes at the same
> >   time it removed others. Jens is going to be less available for a
> >   while, so I'm going to finish getting the patches ready to commit,
> >   but it's going to take a little work going back and reviewing the
> >   email threads to determine what was the outcome of discussion versus
> >   what's new, and among what's new, whether it's essential or side
> >   changes orthogonal to adding C11 threads.
> 
> Sorry if that left that impression to you. There is only one thing I
> remember, that is not really sorted out in the last patch series I
> sent, the timespec_get things. The other patches should be relatively
> clean, no? In particular, the higher numbered patches (I think it was
> 8 and 9) can just be omitted, if you don't like them.
> 
> In any case, just send me requests for diff's or entire files in that
> or that version, my git history contains them all.

I think it's mainly the __clock_gettime changes that crept well
outside the scope of doing C11 threads, but I need to look again.
Trying to get some other things committed at the moment.

> I have other points that I delayed for later. Off my head:
> 
>  - review atomics
> 
>    * This starts by adding some "i" constraints to the assembler. Gcc is
>    perfectly capable to optimize static inlined asm that has "ir"
>    constraints. In the code a lot of these are used with constants 0
>    or 1, this would shave of the use of a register at all these
>    points.

Examples? We already use "ir" for syscalls on some archs, and I had
good results. But I wasn't aware that there were places in the atomic
code where "i" could help.

>    * Review the necessity of having all of these with "memory". As long
>    as the pointer argument is constrained with "m" and not "p", this
>    should suffice to encode the data dependency. The only thing that
>    we'd have to be careful with is reordering, but I think this could
>    be achieved by using `volatile`, instead.

In order for their memory-barrier semantics to be useful, all atomics
also need to be full compiler barriers. Otherwise the compiler could
move loads/stores of data it thinks is unrelated across the asm block,
and thereby across the memory barrier.

>    * Move to gcc's __sync and __atomic builtins. The first are available
>    since always, and both are easily testable with macros. In case of
>    __atomic this would make it possible to move the atomics used in
>    musl to acquire-release semantics, which may pay for some archs, in
>    particular arm, I think.
> 
>    This also should make it easier to port to new architectures in the
>    future, as soon as there is a working gcc for it.

This has already been discussed in the past, and it's not a good
option:

- It would severely reduce the set of compilers that can be used to
  build musl.

- It makes it impossible to correctly support archs where different
  atomic strategies might need to be chosen at runtime. This includes
  both arm (see the long threads on what to do about broken grsec
  kernels without the kuser helper page) and sh, and maybe more. Even
  if the compiler could be taught to solve this problem, we'd then be
  bumping the compiler requirement to gcc 5.0 or something, which
  would be completely unacceptable.

- POSIX requires seq_cst semantics for everything anyway, so the
  benefit would only be for C11 sync objects, and only if we write new
  implementations of them using relaxed atomics.

I'd like to pursue an approach with relaxed atomics at some point,
probably with asm still rather than intrinsics, but IMO it's rather
low priority. I'd like to see the whole C11/POSIX alignment matter get
settled first so we're not shooting for a moving target.

>  - implement a simple lock feature based on futex that encodes the
>    wait counter inside the `int`.
> 
>    After the discussion with Rich some weeks ago I gave it a thought,
>    and I think this can be simply done by using the parity bit for the
>    lock and the other bits for the waiters count. On i386 the parity
>    bit (as any other bit) can effectively accessed atomically with
>    "lock ; btsl ".
> 
>    IIRC, having such a simple lock would also help as a third step of
>    the changes to cv that I proposed and that you mention above.

I'm not sure what you mean, but it's a bad idea to have the waiters
count inside the futex int. This is because every time the futex int
changes, you risk FUTEX_WAIT returning with EAGAIN. Under high load of
waiters arriving/leaving, it might be impossible to _ever_ start a
wait; all threads may just spin. The futex int should be something
that doesn't change except for a very small finite number of changes
(like adding a waiter flag, not a waiter count) before you actually
have a chance of successfully getting the lock or whatever.

>  - __timedwait translates absolute time to relative time, by taking
>    the current time with clock_gettime. This could be avoided for
>    CLOCK_REALTIME and CLOCK_MONOTONIC when using FUTEX_WAIT_BITSET
>    instead of FUTEX_WAIT. That one uses absolute time from the
>    start. Since C threads don't leave a chose for the clock they are
>    using this would avoid using this code path completely and shave of
>    one system call.

IIRC I tried to do that at one point but then decided against it. That
was several years back, so it might be worth revisiting now, but keep
in mind clock_gettime is expected to be a vdso call (no syscall) on
first-class archs so it's not as big an issue as you might think.x

Rich


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: In-progress stuff, week of Sept 1
  2014-09-05  1:57 ` Szabolcs Nagy
@ 2014-09-05  8:47   ` Rich Felker
  2014-09-05  9:07     ` Alexander Monakov
  0 siblings, 1 reply; 14+ messages in thread
From: Rich Felker @ 2014-09-05  8:47 UTC (permalink / raw)
  To: musl

On Fri, Sep 05, 2014 at 03:57:30AM +0200, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2014-09-04 21:14:42 -0400]:
> > Additional roadmap items not mentioned above, and for which work has
> > not yet started, are probably going to get pushed back again into the
> > next release cycle.
> 
> other issues:
> 
> toupper/islower of sharp s

Fixed.

> sem_wait/pthread_join cancellation point in non-blocking case

Fixed.

> alpine patches: ns_parse apis, __setxid issues

These might end up waiting... The setxid stuff is hard to fix right.

Rich


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: In-progress stuff, week of Sept 1
  2014-09-05  1:14 In-progress stuff, week of Sept 1 Rich Felker
  2014-09-05  1:57 ` Szabolcs Nagy
  2014-09-05  7:41 ` Jens Gustedt
@ 2014-09-05  8:49 ` Rich Felker
  2 siblings, 0 replies; 14+ messages in thread
From: Rich Felker @ 2014-09-05  8:49 UTC (permalink / raw)
  To: musl

On Thu, Sep 04, 2014 at 09:14:42PM -0400, Rich Felker wrote:
> - fgets at EOF: The proposed fix by nsz is right, but I think we
>   should also address the issue of the missing lock in the n=1 case.

Done. The n=1 case turned out to have more observably-wrong properties
than I was aware about at first, and similar problems in fread,
fwrite, and fputs. which I've also fixed.

> - dn_expand issues: Potentially serious bugs that need to be fixed
>   ASAP. nsz and I are working on them right now.

Done.

> - nsz's LFS64 fixes: Should be straightforward to review and commit.
> 
> - Alexander Monakov's "New static analysis results": These probably
>   indicate a few minor bugs which should be easy to fix.

Both of these are still pending.

Rich


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: In-progress stuff, week of Sept 1
  2014-09-05  8:47   ` Rich Felker
@ 2014-09-05  9:07     ` Alexander Monakov
  2014-09-05 16:02       ` Rich Felker
  0 siblings, 1 reply; 14+ messages in thread
From: Alexander Monakov @ 2014-09-05  9:07 UTC (permalink / raw)
  To: musl

On Fri, 5 Sep 2014, Rich Felker wrote:
> > sem_wait/pthread_join cancellation point in non-blocking case
> 
> Fixed.

On IRC there was an agreement that not testing cancellation unless an actual
syscall is performed is sensible, after I mentioned that cancellation is
similarly missing in glibc sem_wait as well.

The current wording of the standard is not ambiguous to me though, and forbids
such elision.

Rich, would you raise this for an explicit consideration with the Austin group?

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: In-progress stuff, week of Sept 1
  2014-09-05  8:32   ` Rich Felker
@ 2014-09-05  9:27     ` Jens Gustedt
  2014-09-05 16:45       ` Rich Felker
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Gustedt @ 2014-09-05  9:27 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 8918 bytes --]

Am Freitag, den 05.09.2014, 04:32 -0400 schrieb Rich Felker:
> On Fri, Sep 05, 2014 at 09:41:42AM +0200, Jens Gustedt wrote:
> > Hello,
> > 
> > Am Donnerstag, den 04.09.2014, 21:14 -0400 schrieb Rich Felker:
> > > First, things that definitely are going in this release cycle:
> > > 
> > > - Jens Gustedt's C11 threads: From discussion it's ready, but the last
> > >   iteration of the patches added more gratuitous changes at the same
> > >   time it removed others. Jens is going to be less available for a
> > >   while, so I'm going to finish getting the patches ready to commit,
> > >   but it's going to take a little work going back and reviewing the
> > >   email threads to determine what was the outcome of discussion versus
> > >   what's new, and among what's new, whether it's essential or side
> > >   changes orthogonal to adding C11 threads.
> > 
> > Sorry if that left that impression to you. There is only one thing I
> > remember, that is not really sorted out in the last patch series I
> > sent, the timespec_get things. The other patches should be relatively
> > clean, no? In particular, the higher numbered patches (I think it was
> > 8 and 9) can just be omitted, if you don't like them.
> > 
> > In any case, just send me requests for diff's or entire files in that
> > or that version, my git history contains them all.
> 
> I think it's mainly the __clock_gettime changes that crept well
> outside the scope of doing C11 threads, but I need to look again.
> Trying to get some other things committed at the moment.

So tell me anything you need in terms of patches that might ease your
task.

> > I have other points that I delayed for later. Off my head:
> > 
> >  - review atomics
> > 
> >    * This starts by adding some "i" constraints to the assembler. Gcc is
> >    perfectly capable to optimize static inlined asm that has "ir"
> >    constraints. In the code a lot of these are used with constants 0
> >    or 1, this would shave of the use of a register at all these
> >    points.
> 
> Examples? We already use "ir" for syscalls on some archs, and I had
> good results. But I wasn't aware that there were places in the atomic
> code where "i" could help.

On i386 and friends these are all those that do "and", "or", or
"store". All of these could get "ir", I think. `a_and` has 1 usage in
src/, `a_or` has 4, and `a_store` has 26. For the later, all but 4 are
with constants.

> >    * Review the necessity of having all of these with "memory". As long
> >    as the pointer argument is constrained with "m" and not "p", this
> >    should suffice to encode the data dependency. The only thing that
> >    we'd have to be careful with is reordering, but I think this could
> >    be achieved by using `volatile`, instead.
> 
> In order for their memory-barrier semantics to be useful, all atomics
> also need to be full compiler barriers. Otherwise the compiler could
> move loads/stores of data it thinks is unrelated across the asm block,
> and thereby across the memory barrier.

Which would be ok if we'd allow for acquire-release semantics. So this
is related to the discussion below.

> >    * Move to gcc's __sync and __atomic builtins. The first are available
> >    since always, and both are easily testable with macros. In case of
> >    __atomic this would make it possible to move the atomics used in
> >    musl to acquire-release semantics, which may pay for some archs, in
> >    particular arm, I think.
> > 
> >    This also should make it easier to port to new architectures in the
> >    future, as soon as there is a working gcc for it.

> This has already been discussed in the past, and it's not a good
> option:
> 
> - It would severely reduce the set of compilers that can be used to
>   build musl.

Probably I am too naive on that part. We are using gcc extensions in
many places, so I thought that all compilers that can be used with
musl basically implement gcc's major features.

As these extensions are easily testable, so having this as an option
would still be doable.

> - It makes it impossible to correctly support archs where different
>   atomic strategies might need to be chosen at runtime. This includes
>   both arm (see the long threads on what to do about broken grsec
>   kernels without the kuser helper page) and sh, and maybe more. Even
>   if the compiler could be taught to solve this problem, we'd then be
>   bumping the compiler requirement to gcc 5.0 or something, which
>   would be completely unacceptable.

sure that for such cases this must be easily tunable

> - POSIX requires seq_cst semantics for everything anyway, so the
>   benefit would only be for C11 sync objects, and only if we write new
>   implementations of them using relaxed atomics.

right

> I'd like to pursue an approach with relaxed atomics at some point,
> probably with asm still rather than intrinsics, but IMO it's rather
> low priority. I'd like to see the whole C11/POSIX alignment matter get
> settled first so we're not shooting for a moving target.

sure, I completely agree

> >  - implement a simple lock feature based on futex that encodes the
> >    wait counter inside the `int`.
> > 
> >    After the discussion with Rich some weeks ago I gave it a thought,
> >    and I think this can be simply done by using the parity bit for the
> >    lock and the other bits for the waiters count. On i386 the parity
> >    bit (as any other bit) can effectively accessed atomically with
> >    "lock ; btsl ".
> > 
> >    IIRC, having such a simple lock would also help as a third step of
> >    the changes to cv that I proposed and that you mention above.
> 
> I'm not sure what you mean,

again, out of the head, i remember two places that have "adhoc"
locking conventions that we touched in the last weeks, cv and
pthread_once_t. Both encode waiters in a non-optimal way: cv only has
an estimate of "somebody might be waiting", pthread_once_t uses a
static `waiters` variable for all instances.

> but it's a bad idea to have the waiters
> count inside the futex int. This is because every time the futex int
> changes, you risk FUTEX_WAIT returning with EAGAIN. Under high load of
> waiters arriving/leaving, it might be impossible to _ever_ start a
> wait; all threads may just spin. The futex int should be something
> that doesn't change except for a very small finite number of changes
> (like adding a waiter flag, not a waiter count) before you actually
> have a chance of successfully getting the lock or whatever.

First, I think you overestimate the trouble that we can have with
that. A thread that activates the wait count and goes into the futex
call must be active. If the futex_wait is not possible EAGAIN is in
general returned without descheduling this thread. So the number of
threads spinning around could reduced to be at max the number of
cores. (check for EAGAIN, reload the `int` without changing it and
retry).  This might drag on performance in some extreme cases, but
having an accurate wait count might benefit in others.

In any case, for the two use cases I mention above this problem isn't
much relevant, I think. For cv this is the internal lock, there will
be at most one of "waiting" threads on that (he has to hold the mutex)
and perhaps several signaling threads, but it should be rare that this
is contended by more than two threads at a time. For the current
version the "may-have-waiters" flag may trigger some superfluous wake
up, I think.

And pthread_once_t is of minor performance importance, anyhow.

But it is a pity that there is no FUTEX_WAIT_OP or something like
that, that would allow to test for other things than equality.

> >  - __timedwait translates absolute time to relative time, by taking
> >    the current time with clock_gettime. This could be avoided for
> >    CLOCK_REALTIME and CLOCK_MONOTONIC when using FUTEX_WAIT_BITSET
> >    instead of FUTEX_WAIT. That one uses absolute time from the
> >    start. Since C threads don't leave a chose for the clock they are
> >    using this would avoid using this code path completely and shave of
> >    one system call.
> 
> IIRC I tried to do that at one point but then decided against it. That
> was several years back, so it might be worth revisiting now, but keep
> in mind clock_gettime is expected to be a vdso call (no syscall) on
> first-class archs so it's not as big an issue as you might think.x

Sure, all of this has low priority.

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: In-progress stuff, week of Sept 1
  2014-09-05  9:07     ` Alexander Monakov
@ 2014-09-05 16:02       ` Rich Felker
  2014-09-05 17:40         ` Alexander Monakov
  0 siblings, 1 reply; 14+ messages in thread
From: Rich Felker @ 2014-09-05 16:02 UTC (permalink / raw)
  To: musl

On Fri, Sep 05, 2014 at 01:07:35PM +0400, Alexander Monakov wrote:
> On Fri, 5 Sep 2014, Rich Felker wrote:
> > > sem_wait/pthread_join cancellation point in non-blocking case
> > 
> > Fixed.
> 
> On IRC there was an agreement that not testing cancellation unless an actual
> syscall is performed is sensible, after I mentioned that cancellation is
> similarly missing in glibc sem_wait as well.

I wouldn't say an agreement. There was some sentiment to the effect,
but obviously it doesn't conform to the requirements of the standard.
Less obviously, it can lead to unbounded delays acting on cancellation
in cases like producer/consumer uses where data is being produced so
fast that the consumer almost never has to wait.

> The current wording of the standard is not ambiguous to me though, and forbids
> such elision.
> 
> Rich, would you raise this for an explicit consideration with the Austin group?

I'm not sure it's worth, or even advisable, trying to change. It's
hard to specify exactly when omission of acting on cancellation would
be permitted without referring to terms that aren't really defined,
and I think it would yield less-useful behavior for applications.

On the other hand, it could make a considerable performance difference
to implementations where accessing TLS is costly (e.g. on mips). So I
think it could at least be a topic of further consideration.

Rich

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: In-progress stuff, week of Sept 1
  2014-09-05  9:27     ` Jens Gustedt
@ 2014-09-05 16:45       ` Rich Felker
  2014-09-05 17:09         ` Alexander Monakov
  2014-09-06  7:04         ` Jens Gustedt
  0 siblings, 2 replies; 14+ messages in thread
From: Rich Felker @ 2014-09-05 16:45 UTC (permalink / raw)
  To: musl

On Fri, Sep 05, 2014 at 11:27:52AM +0200, Jens Gustedt wrote:
> Am Freitag, den 05.09.2014, 04:32 -0400 schrieb Rich Felker:
> > On Fri, Sep 05, 2014 at 09:41:42AM +0200, Jens Gustedt wrote:
> > > Hello,
> > > 
> > > Am Donnerstag, den 04.09.2014, 21:14 -0400 schrieb Rich Felker:
> > > > First, things that definitely are going in this release cycle:
> > > > 
> > > > - Jens Gustedt's C11 threads: From discussion it's ready, but the last
> > > >   iteration of the patches added more gratuitous changes at the same
> > > >   time it removed others. Jens is going to be less available for a
> > > >   while, so I'm going to finish getting the patches ready to commit,
> > > >   but it's going to take a little work going back and reviewing the
> > > >   email threads to determine what was the outcome of discussion versus
> > > >   what's new, and among what's new, whether it's essential or side
> > > >   changes orthogonal to adding C11 threads.
> > > 
> > > Sorry if that left that impression to you. There is only one thing I
> > > remember, that is not really sorted out in the last patch series I
> > > sent, the timespec_get things. The other patches should be relatively
> > > clean, no? In particular, the higher numbered patches (I think it was
> > > 8 and 9) can just be omitted, if you don't like them.
> > > 
> > > In any case, just send me requests for diff's or entire files in that
> > > or that version, my git history contains them all.
> > 
> > I think it's mainly the __clock_gettime changes that crept well
> > outside the scope of doing C11 threads, but I need to look again.
> > Trying to get some other things committed at the moment.
> 
> So tell me anything you need in terms of patches that might ease your
> task.

If it looks like a big task when I read over it again, I'll see if
there's something you could do that would make it easier. Otherwise
it's probably easier just to do it than adding email latency. Thanks
though.

> > > I have other points that I delayed for later. Off my head:
> > > 
> > >  - review atomics
> > > 
> > >    * This starts by adding some "i" constraints to the assembler. Gcc is
> > >    perfectly capable to optimize static inlined asm that has "ir"
> > >    constraints. In the code a lot of these are used with constants 0
> > >    or 1, this would shave of the use of a register at all these
> > >    points.
> > 
> > Examples? We already use "ir" for syscalls on some archs, and I had
> > good results. But I wasn't aware that there were places in the atomic
> > code where "i" could help.
> 
> On i386 and friends these are all those that do "and", "or", or
> "store". All of these could get "ir", I think. `a_and` has 1 usage in
> src/, `a_or` has 4, and `a_store` has 26. For the later, all but 4 are
> with constants.

Only 2 uses of a_or have constant operands. But you're right that it
could be beneficial for a_store, and for those two uses of a_or. I'm
not opposed to this change.

> > >    * Review the necessity of having all of these with "memory". As long
> > >    as the pointer argument is constrained with "m" and not "p", this
> > >    should suffice to encode the data dependency. The only thing that
> > >    we'd have to be careful with is reordering, but I think this could
> > >    be achieved by using `volatile`, instead.
> > 
> > In order for their memory-barrier semantics to be useful, all atomics
> > also need to be full compiler barriers. Otherwise the compiler could
> > move loads/stores of data it thinks is unrelated across the asm block,
> > and thereby across the memory barrier.
> 
> Which would be ok if we'd allow for acquire-release semantics. So this
> is related to the discussion below.

It's almost never acceptable for the compiler to move loads/stores in
BOTH directions across the atomic, which removing the memory
constraint would allow, as far as I can tell. So I think
acquire-release semantics only give a significant compiler-level
benefit if you use the intrinsics. On the other hand, I suspect
allowing the compiler to reorder loads/stores is utterly useless for
performance inside the synchronization primitives (as opposed to
direct use of atomics by apps, where it might be very useful). The
useful part is reducing or removing the cache synchronization
requirements between cores.

> > >    * Move to gcc's __sync and __atomic builtins. The first are available
> > >    since always, and both are easily testable with macros. In case of
> > >    __atomic this would make it possible to move the atomics used in
> > >    musl to acquire-release semantics, which may pay for some archs, in
> > >    particular arm, I think.
> > > 
> > >    This also should make it easier to port to new architectures in the
> > >    future, as soon as there is a working gcc for it.
> 
> > This has already been discussed in the past, and it's not a good
> > option:
> > 
> > - It would severely reduce the set of compilers that can be used to
> >   build musl.
> 
> Probably I am too naive on that part. We are using gcc extensions in
> many places, so I thought that all compilers that can be used with
> musl basically implement gcc's major features.

While not currently documented, the set of GCC extensions used is
intentionally extremely minimal, and limited to situations where there
is absolutely no other way to represent the needed semantics. The
current code presumably works all the way back to gcc 3.0 (not tested)
works with pcc, and would work with tcc if some things not related to
gcc-compatibility (just C correctness) on tcc's side were fixed.

If I'm not mistaken, the __sync intrinsics require at least gcc 4.2
and perhaps even a later post-GPLv3 version (unacceptable to a
considerable portion of musl's users). I'm not clear whether pcc
supports them too.

> As these extensions are easily testable, so having this as an option
> would still be doable.

Options are combinatorics for testing. Unless they actually have a
huge practical benefit, they're a net liability, not a feature. IMO,
this (not a single option, but overall the abundance of them) is the
main reason uClibc is bitrotting today and musl is moving forward at a
very fast pace -- we don't have option combinatorics to maintain,
test, and constantly fight bitrot.

I'm not entirely opposed to optional support for these (e.g. arm
already has to have _runtime_ options for these, which are even worse,
but it's not yet implemented), but for the time being it wouldn't even
be a benefit since we'd be using seq_cst everywhere anyway...

> > I'd like to pursue an approach with relaxed atomics at some point,
> > probably with asm still rather than intrinsics, but IMO it's rather
> > low priority. I'd like to see the whole C11/POSIX alignment matter get
> > settled first so we're not shooting for a moving target.
> 
> sure, I completely agree

OK.

> > >  - implement a simple lock feature based on futex that encodes the
> > >    wait counter inside the `int`.
> > > 
> > >    After the discussion with Rich some weeks ago I gave it a thought,
> > >    and I think this can be simply done by using the parity bit for the
> > >    lock and the other bits for the waiters count. On i386 the parity
> > >    bit (as any other bit) can effectively accessed atomically with
> > >    "lock ; btsl ".
> > > 
> > >    IIRC, having such a simple lock would also help as a third step of
> > >    the changes to cv that I proposed and that you mention above.
> > 
> > I'm not sure what you mean,
> 
> again, out of the head, i remember two places that have "adhoc"
> locking conventions that we touched in the last weeks, cv and
> pthread_once_t. Both encode waiters in a non-optimal way: cv only has
> an estimate of "somebody might be waiting", pthread_once_t uses a
> static `waiters` variable for all instances.

Yes, this is bad, but pthread_once_t would be fine with a waiters flag
in the futex int. I'll look at improving that later. But since the
number of pthread_once_t objects in the whole program is
finite/bounded, it's really irrelevant to performance.

> > but it's a bad idea to have the waiters
> > count inside the futex int. This is because every time the futex int
> > changes, you risk FUTEX_WAIT returning with EAGAIN. Under high load of
> > waiters arriving/leaving, it might be impossible to _ever_ start a
> > wait; all threads may just spin. The futex int should be something
> > that doesn't change except for a very small finite number of changes
> > (like adding a waiter flag, not a waiter count) before you actually
> > have a chance of successfully getting the lock or whatever.
> 
> First, I think you overestimate the trouble that we can have with
> that.

Actually I don't. It follows from the massive performance problems I
had with the original cv implementation where wait was "write your own
tid then futex wait" and signal was "write your own tid then futex
wake". This is trivially correct and avoids sequence number wrapping
issues like the current pshared implementation, but it had a HUGE
number of spurious wakes: something like 3:1 (spurious:genuine) for
several threads hammering the cv. 

On typical cpus, there's a window of 500+ cycles between the atomic
operation in userspace and rechecking it in kernelspace. This is
plenty time for another core to cause your futex wait to EAGAIN.

> In any case, for the two use cases I mention above this problem isn't
> much relevant, I think. For cv this is the internal lock, there will
> be at most one of "waiting" threads on that (he has to hold the mutex)
> and perhaps several signaling threads, but it should be rare that this
> is contended by more than two threads at a time. For the current
> version the "may-have-waiters" flag may trigger some superfluous wake
> up, I think.

This is possibly true, but would need some analysis, and thinking
about it now is distracting me from getting done other things that we
want to get done now. ;-)

> And pthread_once_t is of minor performance importance, anyhow.

I don't see how a may-have-waiters approach has any cost for
pthread_once_t unless some of the waiters are cancelled. And that's a
pathological case that's not worth optimizing.

> But it is a pity that there is no FUTEX_WAIT_OP or something like
> that, that would allow to test for other things than equality.

Yes. And as I've noted before, I think it would allow for a very nice
cv implementation that's basically entirely kernel-side and that works
for process-shared cvs. (You could use it to make unlocking the mutex
truely atomic with respect to waiting on the cv futex.)

Rich

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: In-progress stuff, week of Sept 1
  2014-09-05 16:45       ` Rich Felker
@ 2014-09-05 17:09         ` Alexander Monakov
  2014-09-06  7:04         ` Jens Gustedt
  1 sibling, 0 replies; 14+ messages in thread
From: Alexander Monakov @ 2014-09-05 17:09 UTC (permalink / raw)
  To: musl

On Fri, 5 Sep 2014, Rich Felker wrote:
> If I'm not mistaken, the __sync intrinsics require at least gcc 4.2
> and perhaps even a later post-GPLv3 version (unacceptable to a
> considerable portion of musl's users). I'm not clear whether pcc
> supports them too.

The earliest release to have __sync built-ins is 4.1, and the earliest release
to have __atomic built-ins is 4.7.

Alexander


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: In-progress stuff, week of Sept 1
  2014-09-05 16:02       ` Rich Felker
@ 2014-09-05 17:40         ` Alexander Monakov
  2014-09-05 18:06           ` Rich Felker
  0 siblings, 1 reply; 14+ messages in thread
From: Alexander Monakov @ 2014-09-05 17:40 UTC (permalink / raw)
  To: musl

OK, thanks for explaining your perspective.  Should a corresponding bug be
filed with glibc (should that wait until we land a full-blown cancellation
point test in libc-test?) ?

Alexander

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: In-progress stuff, week of Sept 1
  2014-09-05 17:40         ` Alexander Monakov
@ 2014-09-05 18:06           ` Rich Felker
  0 siblings, 0 replies; 14+ messages in thread
From: Rich Felker @ 2014-09-05 18:06 UTC (permalink / raw)
  To: musl

On Fri, Sep 05, 2014 at 09:40:06PM +0400, Alexander Monakov wrote:
> OK, thanks for explaining your perspective.  Should a corresponding bug be
> filed with glibc (should that wait until we land a full-blown cancellation
> point test in libc-test?) ?

I'm not sure. glibc has major, systemic problems with their
implementation of cancellation, but there's finally an interest in
fixing them. Probably the question is whether piling on more
cancellation related bug reports would fuel interest in working on
solving the problem, or fuel frustration. And I'm really not sure.

Rich

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: In-progress stuff, week of Sept 1
  2014-09-05 16:45       ` Rich Felker
  2014-09-05 17:09         ` Alexander Monakov
@ 2014-09-06  7:04         ` Jens Gustedt
  1 sibling, 0 replies; 14+ messages in thread
From: Jens Gustedt @ 2014-09-06  7:04 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 3578 bytes --]

Am Freitag, den 05.09.2014, 12:45 -0400 schrieb Rich Felker:
> On Fri, Sep 05, 2014 at 11:27:52AM +0200, Jens Gustedt wrote:
> > again, out of the head, i remember two places that have "adhoc"
> > locking conventions that we touched in the last weeks, cv and
> > pthread_once_t. Both encode waiters in a non-optimal way: cv only has
> > an estimate of "somebody might be waiting", pthread_once_t uses a
> > static `waiters` variable for all instances.
> 
> Yes, this is bad, but pthread_once_t would be fine with a waiters flag
> in the futex int. I'll look at improving that later. But since the
> number of pthread_once_t objects in the whole program is
> finite/bounded, it's really irrelevant to performance.
> 
> > > but it's a bad idea to have the waiters
> > > count inside the futex int. This is because every time the futex int
> > > changes, you risk FUTEX_WAIT returning with EAGAIN. Under high load of
> > > waiters arriving/leaving, it might be impossible to _ever_ start a
> > > wait; all threads may just spin. The futex int should be something
> > > that doesn't change except for a very small finite number of changes
> > > (like adding a waiter flag, not a waiter count) before you actually
> > > have a chance of successfully getting the lock or whatever.
> > 
> > First, I think you overestimate the trouble that we can have with
> > that.
> 
> Actually I don't. It follows from the massive performance problems I
> had with the original cv implementation where wait was "write your own
> tid then futex wait" and signal was "write your own tid then futex
> wake". This is trivially correct and avoids sequence number wrapping
> issues like the current pshared implementation, but it had a HUGE
> number of spurious wakes: something like 3:1 (spurious:genuine) for
> several threads hammering the cv. 

The situation would be different here. A thread would only update the
waiters part of the `int` once, and then attempt to do the futex wait
in a loop (while EAGAIN) without changing it again.

> On typical cpus, there's a window of 500+ cycles between the atomic
> operation in userspace and rechecking it in kernelspace. This is
> plenty time for another core to cause your futex wait to EAGAIN.
> 
> > In any case, for the two use cases I mention above this problem isn't
> > much relevant, I think. For cv this is the internal lock, there will
> > be at most one of "waiting" threads on that (he has to hold the mutex)
> > and perhaps several signaling threads, but it should be rare that this
> > is contended by more than two threads at a time. For the current
> > version the "may-have-waiters" flag may trigger some superfluous wake
> > up, I think.
> 
> This is possibly true, but would need some analysis, and thinking
> about it now is distracting me from getting done other things that we
> want to get done now. ;-)
> 
> > And pthread_once_t is of minor performance importance, anyhow.
> 
> I don't see how a may-have-waiters approach has any cost for
> pthread_once_t unless some of the waiters are cancelled. And that's a
> pathological case that's not worth optimizing.

right, at least here such an approach would help to avoid the ugly
static "waiters" variable.

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-09-06  7:04 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-05  1:14 In-progress stuff, week of Sept 1 Rich Felker
2014-09-05  1:57 ` Szabolcs Nagy
2014-09-05  8:47   ` Rich Felker
2014-09-05  9:07     ` Alexander Monakov
2014-09-05 16:02       ` Rich Felker
2014-09-05 17:40         ` Alexander Monakov
2014-09-05 18:06           ` Rich Felker
2014-09-05  7:41 ` Jens Gustedt
2014-09-05  8:32   ` Rich Felker
2014-09-05  9:27     ` Jens Gustedt
2014-09-05 16:45       ` Rich Felker
2014-09-05 17:09         ` Alexander Monakov
2014-09-06  7:04         ` Jens Gustedt
2014-09-05  8:49 ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).