From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/5812 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: bug in pthread_cond_broadcast Date: Tue, 12 Aug 2014 19:33:10 -0400 Message-ID: <20140812233310.GF12888@brightrain.aerifal.cx> References: <1407801532.15134.96.camel@eris.loria.fr> <20140812165033.GM22308@port70.net> <20140812171941.GA12888@brightrain.aerifal.cx> <1407867539.15134.148.camel@eris.loria.fr> <20140812212013.GD12888@brightrain.aerifal.cx> <1407883819.15134.160.camel@eris.loria.fr> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1407886414 19964 80.91.229.3 (12 Aug 2014 23:33:34 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 12 Aug 2014 23:33:34 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-5818-gllmg-musl=m.gmane.org@lists.openwall.com Wed Aug 13 01:33:24 2014 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1XHLZD-0003Rd-Ng for gllmg-musl@plane.gmane.org; Wed, 13 Aug 2014 01:33:23 +0200 Original-Received: (qmail 24015 invoked by uid 550); 12 Aug 2014 23:33:23 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 24004 invoked from network); 12 Aug 2014 23:33:22 -0000 Content-Disposition: inline In-Reply-To: <1407883819.15134.160.camel@eris.loria.fr> User-Agent: Mutt/1.5.21 (2010-09-15) Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:5812 Archived-At: On Wed, Aug 13, 2014 at 12:50:19AM +0200, Jens Gustedt wrote: > Am Dienstag, den 12.08.2014, 17:20 -0400 schrieb Rich Felker: > > On Tue, Aug 12, 2014 at 08:18:59PM +0200, Jens Gustedt wrote: > > > so yes, the broadcast operation is synchronized with all other > > > threads, that's the idea of this test > > > > OK, thanks for clarifying. I'm still trying to understand where the > > error in musl's accounting is happening -- > > I think these are the decrement operations on _c_waiters and > _c_waiters2 in "unwait". In the case of the test, for pending waiters, > these happen after other threads know that the condition variable is > "released" by the main thread and have already entered the next phase. I think you're saying the problem is this code: if (c->_c_waiters2) c->_c_waiters2--; else a_dec(&m->_m_waiters); which is wrongly decrementing a newly-added waiter from the cond var when the intent was that the waiter should be decremented from the mutex. Is this correct? One potential solution I have in mind is to get rid of this complex waiter accounting by: 1. Having pthread_cond_broadcast set the new-waiter state on the mutex when requeuing, so that the next unlock forces a futex wake that otherwise might not happen. 2. Having pthread_cond_timedwait set the new-waiter state on the mutex after relocking it, either unconditionally (this would be rather expensive) or on some condition. One possible condition would be to keep a counter in the condvar object for the number of waiters that have been requeued, incrementing it by the number requeued at broadcast time and decrementing it on each wake. However the latter requires accessing the cond var memory in the return path for wait, and I don't see any good way around this. Maybe there's a way to use memory on the waiters' stacks? > > I'd like to find a fix that > > would be acceptable in the 1.0.x branch and make that fix before > > possibly re-doing the cond var implementation (a fix which wouldn't be > > suitable for backporting). > > Some thoughts: > > Basically, in "unwait" there shouldn't be any reference to c-> . No > pending thread inside timedwait should ever have to access the > pthread_cond_t, again, it might already heavily used by other threads. I'm not sure whether I agree with this or not -- as long as destroy is properly synchronized, I don't think it's inherently a bug to access c-> here, and I'm not clear yet on how the access could be eliminated. I think this is a direction we should pursue too, but I'd like to avoid invasive design changes for the backport to 1.0.x. Rich