9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] ilock funny?
@ 2008-05-29  1:34 erik quanstrom
  2008-05-29 13:40 ` Gorka Guardiola
  2008-05-29 13:59 ` Russ Cox
  0 siblings, 2 replies; 6+ messages in thread
From: erik quanstrom @ 2008-05-29  1:34 UTC (permalink / raw)
  To: 9fans

i've got a panic that appears to be a race in ilock.
but perhaps i'm missing something and it's actually
a h/w problem.

in this situation there are 128 kernel procs that all
increment the same counter with some code
that looks like so:

void
incref(void)
{
	ilock(&somelock);
	someval++;
	iunlock(&somelock);
}

(i realize there are probablly better ways to do this.)
there is a similar function to decrease the value.
other than this, there are no references to somelock.

what i'm seeing is a panic with someval = 5. (gathered
from the fact that someval is stored immediately after
the somelock and is dumped with dumplockmem())
and the panic message:

corrupt ilock &somelock pc=&incref m=0 isilock=1

since the value of isilock is exactly 1, it's hard to imagine
this is a bad memory stick or a wild ptr.

i can't see how this could be unless on very first
reference of the lock there is a race with the looser
evaluateing !l->isilock before that processor can see
the winner setting l->isilock to 1.

if this diagnosis is correct, what is the proper fix?
ken does lock and unlock each lock he uses them in the fs
code before using them.  is this this required for kernel
ilocks, too?

- erik



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] ilock funny?
  2008-05-29  1:34 [9fans] ilock funny? erik quanstrom
@ 2008-05-29 13:40 ` Gorka Guardiola
  2008-05-29 14:01   ` erik quanstrom
  2008-05-29 18:47   ` Gorka Guardiola
  2008-05-29 13:59 ` Russ Cox
  1 sibling, 2 replies; 6+ messages in thread
From: Gorka Guardiola @ 2008-05-29 13:40 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, May 29, 2008 at 3:34 AM, erik quanstrom <quanstro@quanstro.net> wrote:
> in this situation there are 128 kernel procs that all
> increment the same counter with some code
> that looks like so:
>
> void
> incref(void)
> {
>        ilock(&somelock);
>        someval++;
>        iunlock(&somelock);
> }
>
> (i realize there are probablly better ways to do this.)
> there is a similar function to decrease the value.
> other than this, there are no references to somelock.
>
> what i'm seeing is a panic with someval = 5. (gathered
> from the fact that someval is stored immediately after
> the somelock and is dumped with dumplockmem())
> and the panic message:

What you say simply cannot happen if your code was correct (the one
you are not showing us :-)).
isilock is a variable set by the lock to tainted as ilock instead of lock.
Having isilock=1 onlys happen After the lock has been acquired by someone.
The lock is checked with a tas() which I assume works because everything
is based on it.
My guess is that you have the lock uninitialized (key is not what it should be),
so key has a bogus value and that is where your problems start.
Zeroing the lock before using it should do the trick.

HTH.
--
- curiosity sKilled the cat



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] ilock funny?
  2008-05-29  1:34 [9fans] ilock funny? erik quanstrom
  2008-05-29 13:40 ` Gorka Guardiola
@ 2008-05-29 13:59 ` Russ Cox
  2008-05-29 14:00   ` erik quanstrom
  1 sibling, 1 reply; 6+ messages in thread
From: Russ Cox @ 2008-05-29 13:59 UTC (permalink / raw)
  To: 9fans

> i can't see how this could be unless on very first
> reference of the lock there is a race with the looser
> evaluateing !l->isilock before that processor can see
> the winner setting l->isilock to 1.
>
> if this diagnosis is correct, what is the proper fix?

take the test out.

this is my fault.  (ironically, i put it in while debugging
some of your code, years ago.)

change:

		/*
		 * Cannot also check l->pc and l->m here because
		 * they might just not be set yet, or the lock might
		 * even have been let go.
		 */
		if(!l->isilock){
			dumplockmem("ilock:", l);
			panic("corrupt ilock %p pc=%luX m=%p isilock=%d",
				l, l->pc, l->m, l->isilock);
		}

to

		/*
		 * Cannot also check l->pc, l->m, or l->isilock here
		 * because they might just not be set yet, or
		 * (for pc and m) the lock might have just been unlocked.
		 */

two out of three isn't bad.  ;-)

russ



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] ilock funny?
  2008-05-29 13:59 ` Russ Cox
@ 2008-05-29 14:00   ` erik quanstrom
  0 siblings, 0 replies; 6+ messages in thread
From: erik quanstrom @ 2008-05-29 14:00 UTC (permalink / raw)
  To: 9fans

> take the test out.
>
> this is my fault.  (ironically, i put it in while debugging
> some of your code, years ago.)

thanks, russ.

(it was just my sneaky way of debugging the kernel. ☺)

- erik




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] ilock funny?
  2008-05-29 13:40 ` Gorka Guardiola
@ 2008-05-29 14:01   ` erik quanstrom
  2008-05-29 18:47   ` Gorka Guardiola
  1 sibling, 0 replies; 6+ messages in thread
From: erik quanstrom @ 2008-05-29 14:01 UTC (permalink / raw)
  To: 9fans

the code i typed in out of haste turns out to be exactly the same
as the code that had this problem, modulo names.

> isilock is a variable set by the lock to tainted as ilock instead of lock.
> Having isilock=1 onlys happen After the lock has been acquired by someone.
> The lock is checked with a tas() which I assume works because everything
> is based on it.

isilock is tested in the branch not holding the ilock.  therefore a winning
cpu can be executing the acquire branch and any number of loosing cpus can be
executing the body of the if statement concurrently.  where is the timing
guarentee that if this happens, the winning cpu executes l->isilock = 1
and the cacheline holding l->isilock has been flushed to the loosing cpu
before the !l->isilock test has been run?

i don't think that one needs to invoke any of the spectacularly odd things
that happen on modern pcs to explain this.  (for example, seperate cores
running at different frequencies.  or, my personal favorite, smm interrupting
things for a couple hundred ms.)

> My guess is that you have the lock uninitialized (key is not what it should be),
> so key has a bogus value and that is where your problems start.
> Zeroing the lock before using it should do the trick.

the lock is in the bss.  it has been zeroed by the linker.  in any event,
i don't think an uninitialized lock explains the behavior.  all we
need is good old-fashioned concurrency.

- erik




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] ilock funny?
  2008-05-29 13:40 ` Gorka Guardiola
  2008-05-29 14:01   ` erik quanstrom
@ 2008-05-29 18:47   ` Gorka Guardiola
  1 sibling, 0 replies; 6+ messages in thread
From: Gorka Guardiola @ 2008-05-29 18:47 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, May 29, 2008 at 3:40 PM, Gorka Guardiola <paurea@gmail.com> wrote:
> My guess is that you have the lock uninitialized (key is not what it should be),
> so key has a bogus value and that is where your problems start.
> Zeroing the lock before using it should do the trick.
>

Off course not, wrong again. Damn, I need more caffeine...
--
- curiosity sKilled the cat



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2008-05-29 18:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-05-29  1:34 [9fans] ilock funny? erik quanstrom
2008-05-29 13:40 ` Gorka Guardiola
2008-05-29 14:01   ` erik quanstrom
2008-05-29 18:47   ` Gorka Guardiola
2008-05-29 13:59 ` Russ Cox
2008-05-29 14:00   ` erik quanstrom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).