From mboxrd@z Thu Jan  1 00:00:00 1970
MIME-Version: 1.0
In-Reply-To: <4ac3289b045087345c64880f01166007@ladd.quanstro.net>
References: <11da45046fa8267e7445128ed00724cd@ladd.quanstro.net>
	<AANLkTinuyGgidt_Lz9rQ8VyVtTVdnU2TABmbZG7NWt8g@mail.gmail.com>
	<24bb48f61c5eab87a133b82a9ef32474@coraid.com>
	<AANLkTim7W5rkgP00_M0AprNGpFhYgP=UUtWtOytJDKHO@mail.gmail.com>
	<2808a9fa079bea86380a8d52be67b980@coraid.com>
	<AANLkTi=4_=++Tm2a9Jq9jSzqUSexkW-ZjM-38oD_bS1y@mail.gmail.com>
	<40925e8f64489665bd5bd6ca743400ea@coraid.com>
	<AANLkTi=FabYqOd3ozUEXi9_Ua8S5DujfUjhzCYxPF2TA@mail.gmail.com>
	<4ac3289b045087345c64880f01166007@ladd.quanstro.net>
Date: Fri, 25 Feb 2011 09:44:42 -0500
Message-ID: <AANLkTinc=hesW0k9pa9EU-OxSQgwj3yM+VEEMasmVva+@mail.gmail.com>
Subject: Re: [9fans] sleep/wakeup bug?
From: Russ Cox <rsc@swtch.com>
To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net>
Cc: erik quanstrom <quanstro@quanstro.net>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Topicbox-Message-UUID: b4bc579e-ead6-11e9-9d60-3106f5b1d025

> in the case of this particular bug, i have at least 40=C2=B5s
> grace and the change has held up for 12 hrs, where i
> could crash the machine in <5s before.

i don't know about you but if i were shipping code that
depended critically on not losing a race with a fast device,
it would keep me up at night.

what sape said.  don't try to paper over this.

also what the sleep and wakeup paper says:

    The code looks trivially correct in retrospect: all access to data
    structures is done under lock, and there is no place that things
    may get out of order. Nonetheless, it took us several iterations
    to arrive at the above implementation, because the things that
    can go wrong are often hard to see. We had four earlier
    implementations that were examined at great length and only
    found faulty when a new, different style of device or activity
    was added to the system.

you can spend a lot of time chasing down bugs because
you are using sleep and wakeup incorrectly, or you can
use it correctly.  "fixing" it is not something i would suggest
taking on.

sape also said that you can't dynamically allocate structures
with rendezvous in them, but that's not strictly true.  you just
have to be careful, i.e. use a lock to make sure no cpu is
about to find or about to call wakeup on the structure
when you're about to free it.  i believe the locking discipline
i described in my original mail works (use a lock instead of an
ilock if there's no interrupt involved).

i was going to point you at devmnt but devmnt looks buggy to me.
mountmux unlocks m before calling wakeup, which on first
glance looks like a mistake mitigated only by the fact that
mntfree only rarely frees r.  or i could be missing something.
it's certainly much easier to use static rendez structures.

as presotto says in the thread richard mentioned,
the problem is not sleep vs wakeup.  the responsibility
necessarily has to lie with the calling code, which is finding,
looking at, and possibly modifying the structure containing r
before it calls wakeup.
http://9fans.net/archive/?q=3D%27test-and-set+problem%27

>> p.s. not relevant to your "only one sleep and one wakeup"
>> constraint, but that last scenario also means that if you
>> are doing repeated sleep + wakeup on a single r, that pending
>> wakeup call left over on cpu2 might not happen until cpu1 has
>> gone back to sleep (a second time). =C2=A0that is, the first wakeup
>> can wake the second sleep, intending to wake the first sleep.
>> so in general you have to handle the case where sleep
>> wakes up for no good reason. =C2=A0it doesn't happen all the time,
>> but it does happen.
>
> this is probablly a bug lurking in many things.

it's not that hard to get this part right.
the simple rule is that you should call sleep in a loop
until the condition you are waiting for has happened.
(keep sleeping until your dreams come true?)
devmnt is a good example in this case.
seeing the loop when you read the code is also a
good reminder that sometimes sleep never runs
at all.

russ