From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/5812
Path: news.gmane.org!not-for-mail
From: Rich Felker <dalias@libc.org>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: bug in pthread_cond_broadcast
Date: Tue, 12 Aug 2014 19:33:10 -0400
Message-ID: <20140812233310.GF12888@brightrain.aerifal.cx>
References: <1407801532.15134.96.camel@eris.loria.fr>
 <20140812165033.GM22308@port70.net>
 <20140812171941.GA12888@brightrain.aerifal.cx>
 <1407867539.15134.148.camel@eris.loria.fr>
 <20140812212013.GD12888@brightrain.aerifal.cx>
 <1407883819.15134.160.camel@eris.loria.fr>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: ger.gmane.org 1407886414 19964 80.91.229.3 (12 Aug 2014 23:33:34 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Tue, 12 Aug 2014 23:33:34 +0000 (UTC)
To: musl@lists.openwall.com
Original-X-From: musl-return-5818-gllmg-musl=m.gmane.org@lists.openwall.com Wed Aug 13 01:33:24 2014
Return-path: <musl-return-5818-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@plane.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-5818-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1XHLZD-0003Rd-Ng
	for gllmg-musl@plane.gmane.org; Wed, 13 Aug 2014 01:33:23 +0200
Original-Received: (qmail 24015 invoked by uid 550); 12 Aug 2014 23:33:23 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 24004 invoked from network); 12 Aug 2014 23:33:22 -0000
Content-Disposition: inline
In-Reply-To: <1407883819.15134.160.camel@eris.loria.fr>
User-Agent: Mutt/1.5.21 (2010-09-15)
Original-Sender: Rich Felker <dalias@aerifal.cx>
Xref: news.gmane.org gmane.linux.lib.musl.general:5812
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/5812>

On Wed, Aug 13, 2014 at 12:50:19AM +0200, Jens Gustedt wrote:
> Am Dienstag, den 12.08.2014, 17:20 -0400 schrieb Rich Felker:
> > On Tue, Aug 12, 2014 at 08:18:59PM +0200, Jens Gustedt wrote:
> > > so yes, the broadcast operation is synchronized with all other
> > > threads, that's the idea of this test
> > 
> > OK, thanks for clarifying. I'm still trying to understand where the
> > error in musl's accounting is happening --
> 
> I think these are the decrement operations on _c_waiters and
> _c_waiters2 in "unwait". In the case of the test, for pending waiters,
> these happen after other threads know that the condition variable is
> "released" by the main thread and have already entered the next phase.

I think you're saying the problem is this code:

        if (c->_c_waiters2) c->_c_waiters2--;
        else a_dec(&m->_m_waiters);

which is wrongly decrementing a newly-added waiter from the cond var
when the intent was that the waiter should be decremented from the
mutex. Is this correct?

One potential solution I have in mind is to get rid of this complex
waiter accounting by:

1. Having pthread_cond_broadcast set the new-waiter state on the mutex
   when requeuing, so that the next unlock forces a futex wake that
   otherwise might not happen.

2. Having pthread_cond_timedwait set the new-waiter state on the mutex
   after relocking it, either unconditionally (this would be rather
   expensive) or on some condition. One possible condition would be to
   keep a counter in the condvar object for the number of waiters that
   have been requeued, incrementing it by the number requeued at
   broadcast time and decrementing it on each wake. However the latter
   requires accessing the cond var memory in the return path for wait,
   and I don't see any good way around this. Maybe there's a way to
   use memory on the waiters' stacks?

> > I'd like to find a fix that
> > would be acceptable in the 1.0.x branch and make that fix before
> > possibly re-doing the cond var implementation (a fix which wouldn't be
> > suitable for backporting).
> 
> Some thoughts:
> 
> Basically, in "unwait" there shouldn't be any reference to c-> .  No
> pending thread inside timedwait should ever have to access the
> pthread_cond_t, again, it might already heavily used by other threads.

I'm not sure whether I agree with this or not -- as long as destroy is
properly synchronized, I don't think it's inherently a bug to access
c-> here, and I'm not clear yet on how the access could be eliminated.
I think this is a direction we should pursue too, but I'd like to
avoid invasive design changes for the backport to 1.0.x.

Rich