From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: <4ac3289b045087345c64880f01166007@ladd.quanstro.net> References: <11da45046fa8267e7445128ed00724cd@ladd.quanstro.net> <24bb48f61c5eab87a133b82a9ef32474@coraid.com> <2808a9fa079bea86380a8d52be67b980@coraid.com> <40925e8f64489665bd5bd6ca743400ea@coraid.com> <4ac3289b045087345c64880f01166007@ladd.quanstro.net> Date: Fri, 25 Feb 2011 09:44:42 -0500 Message-ID: Subject: Re: [9fans] sleep/wakeup bug? From: Russ Cox To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Cc: erik quanstrom Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Topicbox-Message-UUID: b4bc579e-ead6-11e9-9d60-3106f5b1d025 > in the case of this particular bug, i have at least 40=C2=B5s > grace and the change has held up for 12 hrs, where i > could crash the machine in <5s before. i don't know about you but if i were shipping code that depended critically on not losing a race with a fast device, it would keep me up at night. what sape said. don't try to paper over this. also what the sleep and wakeup paper says: The code looks trivially correct in retrospect: all access to data structures is done under lock, and there is no place that things may get out of order. Nonetheless, it took us several iterations to arrive at the above implementation, because the things that can go wrong are often hard to see. We had four earlier implementations that were examined at great length and only found faulty when a new, different style of device or activity was added to the system. you can spend a lot of time chasing down bugs because you are using sleep and wakeup incorrectly, or you can use it correctly. "fixing" it is not something i would suggest taking on. sape also said that you can't dynamically allocate structures with rendezvous in them, but that's not strictly true. you just have to be careful, i.e. use a lock to make sure no cpu is about to find or about to call wakeup on the structure when you're about to free it. i believe the locking discipline i described in my original mail works (use a lock instead of an ilock if there's no interrupt involved). i was going to point you at devmnt but devmnt looks buggy to me. mountmux unlocks m before calling wakeup, which on first glance looks like a mistake mitigated only by the fact that mntfree only rarely frees r. or i could be missing something. it's certainly much easier to use static rendez structures. as presotto says in the thread richard mentioned, the problem is not sleep vs wakeup. the responsibility necessarily has to lie with the calling code, which is finding, looking at, and possibly modifying the structure containing r before it calls wakeup. http://9fans.net/archive/?q=3D%27test-and-set+problem%27 >> p.s. not relevant to your "only one sleep and one wakeup" >> constraint, but that last scenario also means that if you >> are doing repeated sleep + wakeup on a single r, that pending >> wakeup call left over on cpu2 might not happen until cpu1 has >> gone back to sleep (a second time). =C2=A0that is, the first wakeup >> can wake the second sleep, intending to wake the first sleep. >> so in general you have to handle the case where sleep >> wakes up for no good reason. =C2=A0it doesn't happen all the time, >> but it does happen. > > this is probablly a bug lurking in many things. it's not that hard to get this part right. the simple rule is that you should call sleep in a loop until the condition you are waiting for has happened. (keep sleeping until your dreams come true?) devmnt is a good example in this case. seeing the loop when you read the code is also a good reminder that sometimes sleep never runs at all. russ