Re: [9fans] That deadlock, again

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* Re: [9fans] That deadlock, again
@ 2010-11-18  5:50 Lucio De Re
  2010-11-18  5:53 ` erik quanstrom
  2010-11-18  5:57 ` Lucio De Re
  0 siblings, 2 replies; 39+ messages in thread
From: Lucio De Re @ 2010-11-18  5:50 UTC (permalink / raw)
  To: 9fans

> one could move:
>
> 	up->qpc = getcallerpc(&q);
>
>  from qlock() before the lock(&q->use); so we can see from where that
> qlock gets called that hangs the exportfs call, or add another magic
> debug pointer (qpctry) to the proc stucture and print it in dumpaproc().

Cinap, I tried your debugging code and got an odd panic at boot time.
Consistently:

	panic: kernel fault: no user process pc=0xf01e739e addr=0x000009e8

Having a look with acid, this seems to be caused by an attempt at
setting the debug PC (your "up->qpctry") at a time when "up" has no
value yet.

Strangely, later in the qlock() code "up" is checked and a panic
issued if zero.  I'm missing something here: it is possible to execute
this code

/sys/src/9/port/qlock.c:29,37 (more or less)
	lock(&q->use);
	rwstats.qlock++;
	if(!q->locked) {
		q->locked = 1;
		unlock(&q->use);
		return;
	}

which is immediately followed by

	if(up == 0)
		panic("qlock");

If "up" is nil, but it looks like a bit of a one-way operation.

Anyway, I have moved the assignment to "qpctry" to after "up" is
tested.  Let's see what happens.  I'll have to get back to you once
the system is back up.

++L

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-18  5:50 [9fans] That deadlock, again Lucio De Re
@ 2010-11-18  5:53 ` erik quanstrom
  2010-11-18  8:11   ` Lucio De Re
  2010-11-18  5:57 ` Lucio De Re
  1 sibling, 1 reply; 39+ messages in thread
From: erik quanstrom @ 2010-11-18  5:53 UTC (permalink / raw)
  To: lucio, 9fans

> Strangely, later in the qlock() code "up" is checked and a panic
> issued if zero.  I'm missing something here: it is possible to execute
> this code

you must be in process context to qlock, because only
processes can sleep.

- erik



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-18  5:50 [9fans] That deadlock, again Lucio De Re
  2010-11-18  5:53 ` erik quanstrom
@ 2010-11-18  5:57 ` Lucio De Re
  1 sibling, 0 replies; 39+ messages in thread
From: Lucio De Re @ 2010-11-18  5:57 UTC (permalink / raw)
  To: lucio, 9fans

> Anyway, I have moved the assignment to "qpctry" to after "up" is
> tested.  Let's see what happens.  I'll have to get back to you once
> the system is back up.

The system is working now.  I have to wait for a problem to arise, next.

++L




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-18  5:53 ` erik quanstrom
@ 2010-11-18  8:11   ` Lucio De Re
  2010-11-18  8:35     ` cinap_lenrek
  2010-11-18  9:20     ` cinap_lenrek
  0 siblings, 2 replies; 39+ messages in thread
From: Lucio De Re @ 2010-11-18  8:11 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Nov 18, 2010 at 12:53:52AM -0500, erik quanstrom wrote:
>
> you must be in process context to qlock, because only
> processes can sleep.
>
There's obviously at least one exception, because otherwise I would not
have got a panic at startup.  Or, for that matter there would not be
active code ahead of the

/sys/src/9/port/qlock.c:35,36
	if(up == 0)
		panic("qlock");

in qlock().  Or maybe that's where things are going wrong, but I doubt
that the code is mistaken, I know my understanding is inadequate :-)

++L



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-18  8:11   ` Lucio De Re
@ 2010-11-18  8:35     ` cinap_lenrek
  2010-11-18  9:20     ` cinap_lenrek
  1 sibling, 0 replies; 39+ messages in thread
From: cinap_lenrek @ 2010-11-18  8:35 UTC (permalink / raw)
  To: lucio, 9fans

[-- Attachment #1: Type: text/plain, Size: 71 bytes --]

was 0xf01e739e really the code that accesses up->qpctry?

--
cinap

[-- Attachment #2: Type: message/rfc822, Size: 3755 bytes --]

From: Lucio De Re <lucio@proxima.alt.za>
To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net>
Subject: Re: [9fans] That deadlock, again
Date: Thu, 18 Nov 2010 10:11:30 +0200
Message-ID: <20101118081130.GA3464@fangle.proxima.alt.za>

On Thu, Nov 18, 2010 at 12:53:52AM -0500, erik quanstrom wrote:
>
> you must be in process context to qlock, because only
> processes can sleep.
>
There's obviously at least one exception, because otherwise I would not
have got a panic at startup.  Or, for that matter there would not be
active code ahead of the

/sys/src/9/port/qlock.c:35,36
	if(up == 0)
		panic("qlock");

in qlock().  Or maybe that's where things are going wrong, but I doubt
that the code is mistaken, I know my understanding is inadequate :-)

++L

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-18  8:11   ` Lucio De Re
  2010-11-18  8:35     ` cinap_lenrek
@ 2010-11-18  9:20     ` cinap_lenrek
  2010-11-18 10:48       ` Lucio De Re
  1 sibling, 1 reply; 39+ messages in thread
From: cinap_lenrek @ 2010-11-18  9:20 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 207 bytes --]

hm...  thinking about it...  does the kernel assume (maybe in early
initialization) that calling qlock() without a proc is ok as long as
it can make sure it will not be held by another proc?

--
cinap

[-- Attachment #2: Type: message/rfc822, Size: 3755 bytes --]

From: Lucio De Re <lucio@proxima.alt.za>
To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net>
Subject: Re: [9fans] That deadlock, again
Date: Thu, 18 Nov 2010 10:11:30 +0200
Message-ID: <20101118081130.GA3464@fangle.proxima.alt.za>

On Thu, Nov 18, 2010 at 12:53:52AM -0500, erik quanstrom wrote:
>
> you must be in process context to qlock, because only
> processes can sleep.
>
There's obviously at least one exception, because otherwise I would not
have got a panic at startup.  Or, for that matter there would not be
active code ahead of the

/sys/src/9/port/qlock.c:35,36
	if(up == 0)
		panic("qlock");

in qlock().  Or maybe that's where things are going wrong, but I doubt
that the code is mistaken, I know my understanding is inadequate :-)

++L

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-18  9:20     ` cinap_lenrek
@ 2010-11-18 10:48       ` Lucio De Re
  2010-11-18 15:10         ` erik quanstrom
  0 siblings, 1 reply; 39+ messages in thread
From: Lucio De Re @ 2010-11-18 10:48 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Nov 18, 2010 at 10:20:33AM +0100, cinap_lenrek@gmx.de wrote:
>
> hm...  thinking about it...  does the kernel assume (maybe in early
> initialization) that calling qlock() without a proc is ok as long as
> it can make sure it will not be held by another proc?
>
That's a question for Bell Labs, I suppose, but that's precisely what I
believe.  There is no other explanation for the panic.  Moving the
up == 0 test earlier will invalidate this assumption and cause the panic
we have already seen.

The issue here is whether there is a situation where qlock() is
intentionally invoked where up == 0 (suggested by the positioning of the
up == 0 test _after_ setting the "locked" condition).  This is improbable,
though, and needs sorting out: whereas setting the lock can be done with
up == 0 - and we can also clear the lock - we cannot _fail_ to set the
lock, because then the absence of up will trigger a panic.

Now, we know that qlock() is called with up == 0, we have seen a panic
being generated by such a call.  Will it suffice to locate the invocation
and somehow deal with it, or should we make qlock() more robust and cause
it to reject a request from a space where up == 0?  Definitely, if qlock()
no longer allows invocations with up == 0 there will be simplifications
in its implementation.  For example, the line

	if(up != nil && up->nlocks.ref)
		print("qlock: %#p: nlocks %lud\n", getcallerpc(&q), up->nlocks.ref);

will no longer need the up != nil test.

But I'm convinced there's more here than meets the eye.  Unfortunately,
while I have a Plan 9 distribution at my fingertips, I'm not going to
try to fix this problem in a 9vx environment, I'll wait until I get home
to deal with the native stuff.  But one can speculate...

++L

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-18 10:48       ` Lucio De Re
@ 2010-11-18 15:10         ` erik quanstrom
  2010-11-18 16:46           ` erik quanstrom
  2010-11-18 18:03           ` Lucio De Re
  0 siblings, 2 replies; 39+ messages in thread
From: erik quanstrom @ 2010-11-18 15:10 UTC (permalink / raw)
  To: lucio, 9fans

> 	if(up != nil && up->nlocks.ref)
> 		print("qlock: %#p: nlocks %lud\n", getcallerpc(&q), up->nlocks.ref);
>
> will no longer need the up != nil test.

that's just wrong.  if the kernel is qlocking without
a up, there's a bug.  stack dumps are your friend.

but i have a feeling that there is a mistake in your
modification to qlock.  you didn't have this panic
before you modified qlock.

- erik



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-18 15:10         ` erik quanstrom
@ 2010-11-18 16:46           ` erik quanstrom
  2010-11-18 18:01             ` Lucio De Re
  2010-11-18 18:03           ` Lucio De Re
  1 sibling, 1 reply; 39+ messages in thread
From: erik quanstrom @ 2010-11-18 16:46 UTC (permalink / raw)
  To: 9fans

On Thu Nov 18 10:23:20 EST 2010, quanstro@quanstro.net wrote:
> > 	if(up != nil && up->nlocks.ref)
> > 		print("qlock: %#p: nlocks %lud\n", getcallerpc(&q), up->nlocks.ref);
> >
> > will no longer need the up != nil test.
>
> that's just wrong.  if the kernel is qlocking without
> a up, there's a bug.  stack dumps are your friend.
>
> but i have a feeling that there is a mistake in your
> modification to qlock.  you didn't have this panic
> before you modified qlock.

and i'm just wrong.  intentionally or not, devsd does
qlock things with no up from sdreset().  ether82598
does too (my fault).

- erik



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-18 16:46           ` erik quanstrom
@ 2010-11-18 18:01             ` Lucio De Re
  2010-11-18 18:29               ` C H Forsyth
  2010-11-18 18:43               ` erik quanstrom
  0 siblings, 2 replies; 39+ messages in thread
From: Lucio De Re @ 2010-11-18 18:01 UTC (permalink / raw)
  To: 9fans

> and i'm just wrong.  intentionally or not, devsd does
> qlock things with no up from sdreset().  ether82598
> does too (my fault).

I suggest you fix ether82598: it is OK to call qlock() and qunlock()
without "up", but only if sure that the qlock() will succeed.  If it
has to wait, it will panic.  Given that, why do the locking at all?

++L




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-18 15:10         ` erik quanstrom
  2010-11-18 16:46           ` erik quanstrom
@ 2010-11-18 18:03           ` Lucio De Re
  1 sibling, 0 replies; 39+ messages in thread
From: Lucio De Re @ 2010-11-18 18:03 UTC (permalink / raw)
  To: 9fans

> but i have a feeling that there is a mistake in your
> modification to qlock.  you didn't have this panic
> before you modified qlock.

qlock() is broken, or at the very least ambivalent.  Someone ought to
put it out of its misery: is it legal or is it not to call qlock() in
a up == 0 context?

++L




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-18 18:29               ` C H Forsyth
@ 2010-11-18 18:23                 ` Lucio De Re
  2010-11-18 18:33                 ` Lucio De Re
  1 sibling, 0 replies; 39+ messages in thread
From: Lucio De Re @ 2010-11-18 18:23 UTC (permalink / raw)
  To: 9fans

> it's to allow the use during reset of a given driver's
> standard functions that normally must qlock, to avoid requiring two copies
> of them, with and without the qlock.
>
> after reset, it's illegal to call qlock without a process (notably
> in an interrupt function), as it previously was.

I'm willing to credit the validity of this, but I believe then that it
ought to be more explicit.  It seems to me that having a situation
where a panic can ensue if a lock is already taken is too risky.  Is
it possible to count the instances of such qlock() invocations in the
present kernel code and find out how common the problem really is?

Or should one simply treat such invocations as innocuous and just omit
connecting a user process to the queue when no user process is
specified, if the lock is taken?  That sounds positively explosive!

++L

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-18 18:01             ` Lucio De Re
@ 2010-11-18 18:29               ` C H Forsyth
  2010-11-18 18:23                 ` Lucio De Re
  2010-11-18 18:33                 ` Lucio De Re
  2010-11-18 18:43               ` erik quanstrom
  1 sibling, 2 replies; 39+ messages in thread
From: C H Forsyth @ 2010-11-18 18:29 UTC (permalink / raw)
  To: lucio, 9fans

>If it has to wait, it will panic.  Given that, why do the locking at all?

i assume the intention is along these lines:

it's to allow the use during reset of a given driver's
standard functions that normally must qlock, to avoid requiring two copies
of them, with and without the qlock.

after reset, it's illegal to call qlock without a process (notably
in an interrupt function), as it previously was.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-18 18:29               ` C H Forsyth
  2010-11-18 18:23                 ` Lucio De Re
@ 2010-11-18 18:33                 ` Lucio De Re
  1 sibling, 0 replies; 39+ messages in thread
From: Lucio De Re @ 2010-11-18 18:33 UTC (permalink / raw)
  To: 9fans

> after reset, it's illegal to call qlock without a process (notably
> in an interrupt function), as it previously was.

That suggests that the (hopefully) few instances of qlock()
invocations that may occur in this space should be burdened with the
need to check for the value of "up" and altogether skip the call if
it's nil.

Mind you, this is not the problem we set out to fix anymore, although
no doubt there is a relationship, however tenuous.

++L

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-18 18:01             ` Lucio De Re
  2010-11-18 18:29               ` C H Forsyth
@ 2010-11-18 18:43               ` erik quanstrom
  2010-11-18 18:54                 ` erik quanstrom
  2010-11-18 19:01                 ` Lucio De Re
  1 sibling, 2 replies; 39+ messages in thread
From: erik quanstrom @ 2010-11-18 18:43 UTC (permalink / raw)
  To: lucio, 9fans

> I suggest you fix ether82598: it is OK to call qlock() and qunlock()
> without "up", but only if sure that the qlock() will succeed.  If it
> has to wait, it will panic.

yes.  that's it.

> >If it has to wait, it will panic.  Given that, why do the locking at all?
>
> i assume the intention is along these lines:
>
> it's to allow the use during reset of a given driver's
> standard functions that normally must qlock, to avoid requiring two copies
> of them, with and without the qlock.
>
> after reset, it's illegal to call qlock without a process (notably
> in an interrupt function), as it previously was.

i think lucio has stated the current restriction more exactly,
but this is closer to the assumed intent.  perhaps we should make
this what qlock actually does with

diff -c /n/dump/2010/1118/sys/src/9/pc/main.c pc/main.c
/n/dump/2010/1118/sys/src/9/pc/main.c:201,206 - pc/main.c:201,207
  		poperror();
  	}
  	kproc("alarm", alarmkproc, 0);
+ 	conf.postdawn = 1;
  	touser(sp);
  }

diff -c /n/dump/2010/1118/sys/src/9/pc/dat.h pc/dat.h
/n/dump/2010/1118/sys/src/9/pc/dat.h:107,112 - pc/dat.h:107,113
  	ulong	ialloc;		/* max interrupt time allocation in bytes */
  	ulong	pipeqsize;	/* size in bytes of pipe queues */
  	int	nuart;		/* number of uart devices */
+ 	int	postdawn;	/* mutiprogramming on */
  };

  /*
diff -c /n/dump/2010/1118/sys/src/9/port/qlock.c port/qlock.c
/n/dump/2010/1118/sys/src/9/port/qlock.c:18,23 - port/qlock.c:18,25
  {
  	Proc *p;

+ 	if(up == nil && conf.postdawn)
+ 		panic("qlock: %#p: postdawn up nil\n", getcallerpc(&q));
  	if(m->ilockdepth != 0)
  		print("qlock: %#p: ilockdepth %d\n", getcallerpc(&q), m->ilockdepth);
  	if(up != nil && up->nlocks.ref)

a test kernel i've got does boot with this test.

- erik



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-18 18:43               ` erik quanstrom
@ 2010-11-18 18:54                 ` erik quanstrom
  2010-11-18 19:01                 ` Lucio De Re
  1 sibling, 0 replies; 39+ messages in thread
From: erik quanstrom @ 2010-11-18 18:54 UTC (permalink / raw)
  To: 9fans

on second thought, conf.postdawn should be set in schedinit().

- erik



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-18 18:43               ` erik quanstrom
  2010-11-18 18:54                 ` erik quanstrom
@ 2010-11-18 19:01                 ` Lucio De Re
  2010-11-18 19:27                   ` Lucio De Re
  1 sibling, 1 reply; 39+ messages in thread
From: Lucio De Re @ 2010-11-18 19:01 UTC (permalink / raw)
  To: 9fans

> /n/dump/2010/1118/sys/src/9/port/qlock.c:18,23 - port/qlock.c:18,25
>   {
>   	Proc *p;
>
> + 	if(up == nil && conf.postdawn)
> + 		panic("qlock: %#p: postdawn up nil\n", getcallerpc(&q));
>   	if(m->ilockdepth != 0)
>   		print("qlock: %#p: ilockdepth %d\n", getcallerpc(&q), m->ilockdepth);
>   	if(up != nil && up->nlocks.ref)

Yes, this is the type of explicit-ness I was thinking of.  Note that
you can now drop further tests for up == 0 later in the qlock() text.

++L




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-18 19:01                 ` Lucio De Re
@ 2010-11-18 19:27                   ` Lucio De Re
  0 siblings, 0 replies; 39+ messages in thread
From: Lucio De Re @ 2010-11-18 19:27 UTC (permalink / raw)
  To: lucio, 9fans

> Yes, this is the type of explicit-ness I was thinking of.  Note that
> you can now drop further tests for up == 0 later in the qlock() text.

Hm, spoke too quickly.  The tests on "up" have to remain, sadly.
Sorry about the misleading noise.

++L




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-17  4:08         ` Lucio De Re
  2010-11-17  4:18           ` erik quanstrom
@ 2010-11-17 14:40           ` Russ Cox
  1 sibling, 0 replies; 39+ messages in thread
From: Russ Cox @ 2010-11-17 14:40 UTC (permalink / raw)
  To: lucio, Fans of the OS Plan 9 from Bell Labs

lock loops are about Locks (spin locks), not QLocks.
the relative ordering of any two calls to qlock and qunlock
is irrelevant.

russ


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-17  7:03                 ` Lucio De Re
@ 2010-11-17  7:09                   ` erik quanstrom
  0 siblings, 0 replies; 39+ messages in thread
From: erik quanstrom @ 2010-11-17  7:09 UTC (permalink / raw)
  To: lucio, 9fans

> #I0tcpack pc f01ff12a dbgpc ...

and what's at that pc?

- erik



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-17  6:48               ` Lucio De Re
@ 2010-11-17  7:03                 ` erik quanstrom
  0 siblings, 0 replies; 39+ messages in thread
From: erik quanstrom @ 2010-11-17  7:03 UTC (permalink / raw)
  To: lucio, 9fans

> On Wed, Nov 17, 2010 at 06:33:13AM +0100, cinap_lenrek@gmx.de wrote:
> > sorry for not being clear.  what i ment was that qpc is for the last
> > qlock we succeeded to acquire.  its *not* the one we are spinning on.
> > also, qpc is not set to nil on unlock.
> >
> Ok, so we set qpctry (qpcdbg?) to qpc before changing qpc?  Irrespective
> of whether qpc is set or nil?  And should qunlock() clear qpc for safety,
> or would this just make debugging more difficult?

no.  you're very likely to nil out the new qpc if you do that.

- erik



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-17  6:45               ` Lucio De Re
@ 2010-11-17  7:03                 ` Lucio De Re
  2010-11-17  7:09                   ` erik quanstrom
  0 siblings, 1 reply; 39+ messages in thread
From: Lucio De Re @ 2010-11-17  7:03 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Wed, Nov 17, 2010 at 08:45:00AM +0200, Lucio De Re wrote:
> ... and from whatever the other proc is that also contributes to this
> jam. I don't have the name right in front of me, but I will post it
> separately.  As far as I know it's always those two that interfere with
> exportfs and usually together, only a short time apart.
>
#I0tcpack pc f01ff12a dbgpc ...

That's the other common factor.

++L



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-17  5:33             ` cinap_lenrek
@ 2010-11-17  6:48               ` Lucio De Re
  2010-11-17  7:03                 ` erik quanstrom
  0 siblings, 1 reply; 39+ messages in thread
From: Lucio De Re @ 2010-11-17  6:48 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Wed, Nov 17, 2010 at 06:33:13AM +0100, cinap_lenrek@gmx.de wrote:
> sorry for not being clear.  what i ment was that qpc is for the last
> qlock we succeeded to acquire.  its *not* the one we are spinning on.
> also, qpc is not set to nil on unlock.
>
Ok, so we set qpctry (qpcdbg?) to qpc before changing qpc?  Irrespective
of whether qpc is set or nil?  And should qunlock() clear qpc for safety,
or would this just make debugging more difficult?

++L



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-17  5:22             ` cinap_lenrek
@ 2010-11-17  6:45               ` Lucio De Re
  2010-11-17  7:03                 ` Lucio De Re
  0 siblings, 1 reply; 39+ messages in thread
From: Lucio De Re @ 2010-11-17  6:45 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Wed, Nov 17, 2010 at 06:22:33AM +0100, cinap_lenrek@gmx.de wrote:
>
> qpc is the just the caller of the last successfull *acquired* qlock.
> what we know is that the exportfs proc spins in the q->use taslock
> called by qlock() right?  this already seems wired...  q->use is held
> just long enougth to test q->locked and manipulate the queue.  also
> sched() will avoid switching to another proc while we are holding tas
> locks.
>
I think I'm with you, probably not quite to the same depth of understanding.

> i would like to know which qlock is the kernel is trying to acquire
> on behalf of exportfs that is also reachable from the etherread4
> code.
>
... and from whatever the other proc is that also contributes to this
jam. I don't have the name right in front of me, but I will post it
separately.  As far as I know it's always those two that interfere with
exportfs and usually together, only a short time apart.

> one could move:
>
> 	up->qpc = getcallerpc(&q);
>
> from qlock() before the lock(&q->use); so we can see from where that
> qlock gets called that hangs the exportfs call, or add another magic
> debug pointer (qpctry) to the proc stucture and print it in dumpaproc().
>
I think I'll do the latter; even though it's more complex, it can be a
useful debugging tool in future.  I wouldn't leave in the kernel code,
but it would be worth being able to refer to it when the occasion arises.

How do you expect this qpctry to be initialised/set?

++L



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-17  4:18           ` erik quanstrom
  2010-11-17  4:37             ` Lucio De Re
  2010-11-17  5:22             ` cinap_lenrek
@ 2010-11-17  5:33             ` cinap_lenrek
  2010-11-17  6:48               ` Lucio De Re
  2 siblings, 1 reply; 39+ messages in thread
From: cinap_lenrek @ 2010-11-17  5:33 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 194 bytes --]

sorry for not being clear.  what i ment was that qpc is for the last
qlock we succeeded to acquire.  its *not* the one we are spinning on.
also, qpc is not set to nil on unlock.

--
cinap

[-- Attachment #2: Type: message/rfc822, Size: 3429 bytes --]

From: erik quanstrom <quanstro@labs.coraid.com>
To: lucio@proxima.alt.za, 9fans@9fans.net
Subject: Re: [9fans] That deadlock, again
Date: Tue, 16 Nov 2010 23:18:09 -0500
Message-ID: <1e1e3d7c4781c86aa3a270cecdbaadbb@coraid.com>

> > acid: src(0xf0148c8a)
> > /sys/src/9/ip/tcp.c:2096
> >  2091		if(waserror()){
> >  2092			qunlock(s);
> >  2093			nexterror();
> >  2094		}
> >  2095		qlock(s);
> >>2096		qunlock(tcp);
> >  2097
> >  2098		/* fix up window */
> >  2099		seg.wnd <<= tcb->rcv.scale;
> >  2100
> >  2101		/* every input packet in puts off the keep alive time out */
>
> The source actually says (to be pedantic):
>
> 	/* The rest of the input state machine is run with the control block
> 	 * locked and implements the state machine directly out of the RFC.
> 	 * Out-of-band data is ignored - it was always a bad idea.
> 	 */
> 	tcb = (Tcpctl*)s->ptcl;
> 	if(waserror()){
> 		qunlock(s);
> 		nexterror();
> 	}
> 	qlock(s);
> 	qunlock(tcp);
>
> Now, the qunlock(s) should not precede the qlock(s), this is the first
> case in this procedure:

it doesn't.  waserror() can't be executed before the code
following it.  perhpas it could be more carefully written
as

> >  2095		qlock(s);
> >  2091		if(waserror()){
> >  2092			qunlock(s);
> >  2093			nexterror();
> >  2094		}
> >>2096		qunlock(tcp);

but it really wouldn't make any difference.

i'm not completely convinced that tcp's to blame.
and if it is, i think the problem is probablly tcp
timers.

- erik

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-17  4:18           ` erik quanstrom
  2010-11-17  4:37             ` Lucio De Re
@ 2010-11-17  5:22             ` cinap_lenrek
  2010-11-17  6:45               ` Lucio De Re
  2010-11-17  5:33             ` cinap_lenrek
  2 siblings, 1 reply; 39+ messages in thread
From: cinap_lenrek @ 2010-11-17  5:22 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 780 bytes --]

qpc is the just the caller of the last successfull *acquired* qlock.
what we know is that the exportfs proc spins in the q->use taslock
called by qlock() right?  this already seems wired...  q->use is held
just long enougth to test q->locked and manipulate the queue.  also
sched() will avoid switching to another proc while we are holding tas
locks.

i would like to know which qlock is the kernel is trying to acquire
on behalf of exportfs that is also reachable from the etherread4
code.

one could move:

	up->qpc = getcallerpc(&q);

from qlock() before the lock(&q->use); so we can see from where that
qlock gets called that hangs the exportfs call, or add another magic
debug pointer (qpctry) to the proc stucture and print it in dumpaproc().

--
cinap

[-- Attachment #2: Type: message/rfc822, Size: 3429 bytes --]

From: erik quanstrom <quanstro@labs.coraid.com>
To: lucio@proxima.alt.za, 9fans@9fans.net
Subject: Re: [9fans] That deadlock, again
Date: Tue, 16 Nov 2010 23:18:09 -0500
Message-ID: <1e1e3d7c4781c86aa3a270cecdbaadbb@coraid.com>

> > acid: src(0xf0148c8a)
> > /sys/src/9/ip/tcp.c:2096
> >  2091		if(waserror()){
> >  2092			qunlock(s);
> >  2093			nexterror();
> >  2094		}
> >  2095		qlock(s);
> >>2096		qunlock(tcp);
> >  2097
> >  2098		/* fix up window */
> >  2099		seg.wnd <<= tcb->rcv.scale;
> >  2100
> >  2101		/* every input packet in puts off the keep alive time out */
>
> The source actually says (to be pedantic):
>
> 	/* The rest of the input state machine is run with the control block
> 	 * locked and implements the state machine directly out of the RFC.
> 	 * Out-of-band data is ignored - it was always a bad idea.
> 	 */
> 	tcb = (Tcpctl*)s->ptcl;
> 	if(waserror()){
> 		qunlock(s);
> 		nexterror();
> 	}
> 	qlock(s);
> 	qunlock(tcp);
>
> Now, the qunlock(s) should not precede the qlock(s), this is the first
> case in this procedure:

it doesn't.  waserror() can't be executed before the code
following it.  perhpas it could be more carefully written
as

> >  2095		qlock(s);
> >  2091		if(waserror()){
> >  2092			qunlock(s);
> >  2093			nexterror();
> >  2094		}
> >>2096		qunlock(tcp);

but it really wouldn't make any difference.

i'm not completely convinced that tcp's to blame.
and if it is, i think the problem is probablly tcp
timers.

- erik

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-17  4:37             ` Lucio De Re
@ 2010-11-17  4:43               ` erik quanstrom
  0 siblings, 0 replies; 39+ messages in thread
From: erik quanstrom @ 2010-11-17  4:43 UTC (permalink / raw)
  To: lucio, 9fans

> Hm, I thought I understood waserror(), but now I'm sure I don't.  What
> condition is waserror() attempting to handle here?

waserror() sets up an entry in the error stack.
if there is a call to error() before poperror(),
then that entry is poped and waserror() returns
1.  it's just like set_jmp or sched().

- erik



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-17  4:18           ` erik quanstrom
@ 2010-11-17  4:37             ` Lucio De Re
  2010-11-17  4:43               ` erik quanstrom
  2010-11-17  5:22             ` cinap_lenrek
  2010-11-17  5:33             ` cinap_lenrek
  2 siblings, 1 reply; 39+ messages in thread
From: Lucio De Re @ 2010-11-17  4:37 UTC (permalink / raw)
  To: 9fans

>> Now, the qunlock(s) should not precede the qlock(s), this is the first
>> case in this procedure:
>
> it doesn't.  waserror() can't be executed before the code
> following it.  perhpas it could be more carefully written
> as
>
>> >  2095		qlock(s);
>> >  2091		if(waserror()){
>> >  2092			qunlock(s);
>> >  2093			nexterror();
>> >  2094		}
>> >>2096		qunlock(tcp);

Hm, I thought I understood waserror(), but now I'm sure I don't.  What
condition is waserror() attempting to handle here?

++L




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-17  4:08         ` Lucio De Re
@ 2010-11-17  4:18           ` erik quanstrom
  2010-11-17  4:37             ` Lucio De Re
                               ` (2 more replies)
  2010-11-17 14:40           ` Russ Cox
  1 sibling, 3 replies; 39+ messages in thread
From: erik quanstrom @ 2010-11-17  4:18 UTC (permalink / raw)
  To: lucio, 9fans

> > acid: src(0xf0148c8a)
> > /sys/src/9/ip/tcp.c:2096
> >  2091		if(waserror()){
> >  2092			qunlock(s);
> >  2093			nexterror();
> >  2094		}
> >  2095		qlock(s);
> >>2096		qunlock(tcp);
> >  2097
> >  2098		/* fix up window */
> >  2099		seg.wnd <<= tcb->rcv.scale;
> >  2100
> >  2101		/* every input packet in puts off the keep alive time out */
>
> The source actually says (to be pedantic):
>
> 	/* The rest of the input state machine is run with the control block
> 	 * locked and implements the state machine directly out of the RFC.
> 	 * Out-of-band data is ignored - it was always a bad idea.
> 	 */
> 	tcb = (Tcpctl*)s->ptcl;
> 	if(waserror()){
> 		qunlock(s);
> 		nexterror();
> 	}
> 	qlock(s);
> 	qunlock(tcp);
>
> Now, the qunlock(s) should not precede the qlock(s), this is the first
> case in this procedure:

it doesn't.  waserror() can't be executed before the code
following it.  perhpas it could be more carefully written
as

> >  2095		qlock(s);
> >  2091		if(waserror()){
> >  2092			qunlock(s);
> >  2093			nexterror();
> >  2094		}
> >>2096		qunlock(tcp);

but it really wouldn't make any difference.

i'm not completely convinced that tcp's to blame.
and if it is, i think the problem is probablly tcp
timers.

- erik



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-16 18:03       ` lucio
@ 2010-11-17  4:08         ` Lucio De Re
  2010-11-17  4:18           ` erik quanstrom
  2010-11-17 14:40           ` Russ Cox
  0 siblings, 2 replies; 39+ messages in thread
From: Lucio De Re @ 2010-11-17  4:08 UTC (permalink / raw)
  To: 9fans

> Well, here is an acid dump, I'll inspect it in detail, but I'm hoping
> someone will beat me to it (not hard at all, I have to confess):
>
> rumble# acid /sys/src/9/pc/9pccpuf
> /sys/src/9/pc/9pccpuf:386 plan 9 boot image
> /sys/lib/acid/port
> /sys/lib/acid/386
>
[ ... ]

This bit looks suspicious to me, but I'm really not an authority and I
may easily be missing something:

> acid: src(0xf0148c8a)
> /sys/src/9/ip/tcp.c:2096
>  2091		if(waserror()){
>  2092			qunlock(s);
>  2093			nexterror();
>  2094		}
>  2095		qlock(s);
>>2096		qunlock(tcp);
>  2097
>  2098		/* fix up window */
>  2099		seg.wnd <<= tcb->rcv.scale;
>  2100
>  2101		/* every input packet in puts off the keep alive time out */

The source actually says (to be pedantic):

	/* The rest of the input state machine is run with the control block
	 * locked and implements the state machine directly out of the RFC.
	 * Out-of-band data is ignored - it was always a bad idea.
	 */
	tcb = (Tcpctl*)s->ptcl;
	if(waserror()){
		qunlock(s);
		nexterror();
	}
	qlock(s);
	qunlock(tcp);

Now, the qunlock(s) should not precede the qlock(s), this is the first
case in this procedure:

/sys/src/9/ip/tcp.c:1941,2424
void
tcpiput(Proto *tcp, Ipifc*, Block *bp)
{
...
}

and unlocking an unlocked item should not be permitted, right?  If s
(the conversation) is returned locked by either of

/sys/src/9/ip/tcp.c:2048
	s = iphtlook(&tpriv->ht, source, seg.source, dest, seg.dest);

or

/sys/src/9/ip/tcp.c:2081
		s = tcpincoming(s, &seg, source, dest, version);

then the qlock(s) is an issue, but I'm sure that is not the case here.

Finally, locking an item ahead of unlocking another is often cause for
a deadlock, although I understand the possible necessity.

	qlock(s);
	qunlock(tcp);

Even though acid points to a different location, I'd be curious to
have the details above expanded on by somebody familiar with the code.

++L

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-16 13:53     ` erik quanstrom
@ 2010-11-16 18:03       ` lucio
  2010-11-17  4:08         ` Lucio De Re
  0 siblings, 1 reply; 39+ messages in thread
From: lucio @ 2010-11-16 18:03 UTC (permalink / raw)
  To: 9fans, lucio

> cinap is right, the bug is in the kernel.  we know
> that because it's a lock loop.  that can only happen
> if the kernel screws up.  also, the address is a kernel
> address (starts with 0xf).

Well, here is an acid dump, I'll inspect it in detail, but I'm hoping
someone will beat me to it (not hard at all, I have to confess):

rumble# acid /sys/src/9/pc/9pccpuf
/sys/src/9/pc/9pccpuf:386 plan 9 boot image
/sys/lib/acid/port
/sys/lib/acid/386

# lock 0xf0057d74 loop key 0xdeaddead pc 0xf01e736a held by pc 0xf01e736a proc 2052

acid: src(0xf01e736a)
/sys/src/9/port/qlock.c:29
 24			print("qlock: %#p: nlocks %lud\n", getcallerpc(&q), up->nlocks.ref);
 25
 26		if(q->use.key == 0x55555555)
 27			panic("qlock: q %#p, key 5*\n", q);
 28		lock(&q->use);
>29		rwstats.qlock++;
 30		if(!q->locked) {
 31			q->locked = 1;
 32			unlock(&q->use);
 33			return;
 34		}

# 61:etherread4 pc f01ef8a0 dbgpc        0   Running (Running) ut 2923 st 0 bss 0 qpc f0148c8a nl 0 nd 0 lpc f0100f6e pri 13

acid: src(0xf01ef8a0)
/sys/src/9/port/taslock.c:96
 91		lockstats.glare++;
 92		for(;;){
 93			lockstats.inglare++;
 94			i = 0;
 95			while(l->key){
>96				if(conf.nmach < 2 && up && up->edf && (up->edf->flags & Admitted)){
 97					/*
 98					 * Priority inversion, yield on a uniprocessor; on a
 99					 * multiprocessor, the other processor will unlock
 100					 */
 101					print("inversion %#p pc %#lux proc %lud held by pc %#lux proc %lud\n",
acid: src(0xf0148c8a)
/sys/src/9/ip/tcp.c:2096
 2091		if(waserror()){
 2092			qunlock(s);
 2093			nexterror();
 2094		}
 2095		qlock(s);
>2096		qunlock(tcp);
 2097
 2098		/* fix up window */
 2099		seg.wnd <<= tcb->rcv.scale;
 2100
 2101		/* every input packet in puts off the keep alive time out */
acid: src(0xf0100f6e)
/sys/src/9/pc/cga.c:112
 107				return;
 108		}
 109		else
 110			lock(&cgascreenlock);
 111
>112		while(n-- > 0)
 113			cgascreenputc(*s++);
 114
 115		unlock(&cgascreenlock);
 116	}
 117

# 2052:  exportfs pc f01e7377 dbgpc     94ad    Pwrite (Ready) ut 55 st 270 bss 40000 qpc f0145b62 nl 1 nd 0 lpc f01e2c60 pri 10

acid: src(0xf01e7377)
/sys/src/9/port/qlock.c:30
 25
 26		if(q->use.key == 0x55555555)
 27			panic("qlock: q %#p, key 5*\n", q);
 28		lock(&q->use);
 29		rwstats.qlock++;
>30		if(!q->locked) {
 31			q->locked = 1;
 32			unlock(&q->use);
 33			return;
 34		}
 35		if(up == 0)
acid: src(0xf0145b62)
/sys/src/9/ip/tcp.c:704
 699	tcpgo(Tcppriv *priv, Tcptimer *t)
 700	{
 701		if(t == nil || t->start == 0)
 702			return;
 703
>704		qlock(&priv->tl);
 705		t->count = t->start;
 706		timerstate(priv, t, TcptimerON);
 707		qunlock(&priv->tl);
 708	}
 709
acid: src(0xf01e2c60)
/sys/src/9/port/proc.c:345
 340	queueproc(Schedq *rq, Proc *p)
 341	{
 342		int pri;
 343
 344		pri = rq - runq;
>345		lock(runq);
 346		p->priority = pri;
 347		p->rnext = 0;
 348		if(rq->tail)
 349			rq->tail->rnext = p;
 350		else
acid:




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-16  5:03   ` lucio
  2010-11-16  5:11     ` cinap_lenrek
@ 2010-11-16 13:53     ` erik quanstrom
  2010-11-16 18:03       ` lucio
  1 sibling, 1 reply; 39+ messages in thread
From: erik quanstrom @ 2010-11-16 13:53 UTC (permalink / raw)
  To: 9fans

> I tried acid, but I'm just not familiar enough with it to make it
> work.  I tried
>
> 	rumble% acid 2052 /bin/exportfs
> 	/bin/exportfs:386 plan 9 executable
> 	/sys/lib/acid/port
> 	/sys/lib/acid/386
> 	acid: src(0xf01e7377)
> 	no source for ?file?

cinap is right, the bug is in the kernel.  we know
that because it's a lock loop.  that can only happen
if the kernel screws up.  also, the address is a kernel
address (starts with 0xf).

if you didn't know the lock loop message was from
the kernel, you could grep for it

; cd /sys/src
; g '"lock (%#?[luxp]+ )loop' .
./9/port/taslock.c:61: 	print("lock %#p loop key %#lux pc %#lux held by pc %#lux proc %lud\n",

- erik

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-16  5:28         ` cinap_lenrek
@ 2010-11-16  6:47           ` Lucio De Re
  0 siblings, 0 replies; 39+ messages in thread
From: Lucio De Re @ 2010-11-16  6:47 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Tue, Nov 16, 2010 at 06:28:07AM +0100, cinap_lenrek@gmx.de wrote:
>
> if your kernel image is uncompressed and is unstriped, you can
> just load it with acid:
>
> acid /386/9pc
>
> if you build it yourself, then there should be such a kernel in /sys/src/9/pc
>
OK, will try this evening, I specifically left the machines on to do so,
I'm very keen to help put this issue to rest.  And to get to know the
kernel more initmately.  Pity Nemo isn't updating his commentary.

++L



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-16  5:18       ` lucio
@ 2010-11-16  5:28         ` cinap_lenrek
  2010-11-16  6:47           ` Lucio De Re
  0 siblings, 1 reply; 39+ messages in thread
From: cinap_lenrek @ 2010-11-16  5:28 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 248 bytes --]

you almost had it with your acid approach :)

if your kernel image is uncompressed and is unstriped, you can
just load it with acid:

acid /386/9pc

if you build it yourself, then there should be such a kernel in /sys/src/9/pc

--
cinap

[-- Attachment #2: Type: message/rfc822, Size: 2661 bytes --]

From: lucio@proxima.alt.za
To: 9fans@9fans.net, lucio@proxima.alt.za
Subject: Re: [9fans] That deadlock, again
Date: Tue, 16 Nov 2010 07:18:18 +0200
Message-ID: <c2b7b32a419ec0f701e55c31c136e7ad@proxima.alt.za>

> the pc's printed by the lock loop message are kernel code. you
> have to load a debug kernel into acid.

Thanks, I'll do some digging in the Acid document(s), try to
familiarise myself with the details...

++L


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-16  5:11     ` cinap_lenrek
@ 2010-11-16  5:18       ` lucio
  2010-11-16  5:28         ` cinap_lenrek
  0 siblings, 1 reply; 39+ messages in thread
From: lucio @ 2010-11-16  5:18 UTC (permalink / raw)
  To: 9fans, lucio

> the pc's printed by the lock loop message are kernel code. you
> have to load a debug kernel into acid.

Thanks, I'll do some digging in the Acid document(s), try to
familiarise myself with the details...

++L




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-16  5:03   ` lucio
@ 2010-11-16  5:11     ` cinap_lenrek
  2010-11-16  5:18       ` lucio
  2010-11-16 13:53     ` erik quanstrom
  1 sibling, 1 reply; 39+ messages in thread
From: cinap_lenrek @ 2010-11-16  5:11 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 117 bytes --]

the pc's printed by the lock loop message are kernel code. you
have to load a debug kernel into acid.

--
cinap

[-- Attachment #2: Type: message/rfc822, Size: 3242 bytes --]

From: lucio@proxima.alt.za
To: 9fans@9fans.net, lucio@proxima.alt.za
Subject: Re: [9fans] That deadlock, again
Date: Tue, 16 Nov 2010 07:03:00 +0200
Message-ID: <c6e88904f9fe8dcc6306ec9e1fbc9438@proxima.alt.za>

> i assume you've fixed this?  (not yet fixed on sources.)

Yes, before I did that the errors occurred much more frequently;
there's definitely something in that, but as Russ points out, the fix
prevents panics and I have yet to see a panic.

I have a suspicion that we're looking at the wrong problem, but I
don't know how to approach it.

I tried acid, but I'm just not familiar enough with it to make it
work.  I tried

	rumble% acid 2052 /bin/exportfs
	/bin/exportfs:386 plan 9 executable
	/sys/lib/acid/port
	/sys/lib/acid/386
	acid: src(0xf01e7377)
	no source for ?file?

and that's where I have to leave it for the time being.  But
suggestions are welcome, I'll be back here at the end of the day (it's
07:00 now, I expect to be back after 19:00).

++L

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-16  4:40 ` erik quanstrom
@ 2010-11-16  5:03   ` lucio
  2010-11-16  5:11     ` cinap_lenrek
  2010-11-16 13:53     ` erik quanstrom
  0 siblings, 2 replies; 39+ messages in thread
From: lucio @ 2010-11-16  5:03 UTC (permalink / raw)
  To: 9fans, lucio

> i assume you've fixed this?  (not yet fixed on sources.)

Yes, before I did that the errors occurred much more frequently;
there's definitely something in that, but as Russ points out, the fix
prevents panics and I have yet to see a panic.

I have a suspicion that we're looking at the wrong problem, but I
don't know how to approach it.

I tried acid, but I'm just not familiar enough with it to make it
work.  I tried

	rumble% acid 2052 /bin/exportfs
	/bin/exportfs:386 plan 9 executable
	/sys/lib/acid/port
	/sys/lib/acid/386
	acid: src(0xf01e7377)
	no source for ?file?

and that's where I have to leave it for the time being.  But
suggestions are welcome, I'll be back here at the end of the day (it's
07:00 now, I expect to be back after 19:00).

++L

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [9fans] That deadlock, again
  2010-11-16  4:21 lucio
@ 2010-11-16  4:40 ` erik quanstrom
  2010-11-16  5:03   ` lucio
  0 siblings, 1 reply; 39+ messages in thread
From: erik quanstrom @ 2010-11-16  4:40 UTC (permalink / raw)
  To: 9fans

On Mon Nov 15 23:23:12 EST 2010, lucio@proxima.alt.za wrote:
> Regarding the "deadlock" report that I occasionally see on my CPU
> server console, I won't bore anyone with PC addresses or anything like
> that, but I will recommend something I believe to be a possible
> trigger: the failure always seems to occur within "exportfs", which in
> this case is used exclusively to run stats(1) remotely from my
> workstation.  So the recommendation is that somebody like Erik, who is
> infinitely more clued up than I am in the kernel arcana should run one
> or more stats sessions into a cpu server (I happen to be running
> fossil, so maybe Erik won't see this) and see if he can also trigger this behaviour.  I'm hoping that it is not platform specific.
>
> Right now, I'm short of skills as well as a serial console :-(

i run stats all the time.  i've never seen a lock loop caused by stats.

exportfs gets blamed all the time for the sins of others.  possible
culprits are the tcp/ip stack and the kernel devices that stats accesses
and of course, the channel code itself.

it would be a good idea for you to track down all the pcs involved
and send them along.  i can't think of another way of narrowing down
the list of potential suspects.  not all of our usual suspects has an
alibi.

i assume you've fixed this?  (not yet fixed on sources.)

/n/sources/plan9//sys/src/9/port/chan.c:1012,1018 - chan.c:1012,1020
  				/*
  				 * mh->mount->to == c, so start at mh->mount->next
  				 */
+ 				f = nil;
  				rlock(&mh->lock);
+ 				if(mh->mount)
  				for(f = mh->mount->next; f; f = f->next)
  					if((wq = ewalk(f->to, nil, names+nhave, ntry)) != nil)
  						break;

- erik



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [9fans] That deadlock, again
@ 2010-11-16  4:21 lucio
  2010-11-16  4:40 ` erik quanstrom
  0 siblings, 1 reply; 39+ messages in thread
From: lucio @ 2010-11-16  4:21 UTC (permalink / raw)
  To: 9fans, lucio

Regarding the "deadlock" report that I occasionally see on my CPU
server console, I won't bore anyone with PC addresses or anything like
that, but I will recommend something I believe to be a possible
trigger: the failure always seems to occur within "exportfs", which in
this case is used exclusively to run stats(1) remotely from my
workstation.  So the recommendation is that somebody like Erik, who is
infinitely more clued up than I am in the kernel arcana should run one
or more stats sessions into a cpu server (I happen to be running
fossil, so maybe Erik won't see this) and see if he can also trigger this behaviour.  I'm hoping that it is not platform specific.

Right now, I'm short of skills as well as a serial console :-(

++L

PS: here is a kmesg from the server:

	Plan 9
	E820: 00000000 0009fc00 memory
	E820: 0009fc00 000a0000 reserved
	E820: 000e0000 00100000 reserved
	E820: 00100000 47740000 memory
	E820: 47740000 47750000 acpi reclaim
	E820: 47750000 47800000 acpi nvs
	126 holes free
	00018000 0009f000 552960
	00468000 0642b000 100413440
	100966400 bytes free
	cpu0: 2599MHz GenuineIntel PentiumIV/Xeon (cpuid: AX 0x0F29 DX 0xBFEBFBFF)
	ELCR: 0E28
	#l0: i82557: 100Mbps port 0xDC00 irq 11: 00111104e0b6
	1143M memory: 100M kernel data, 1043M user, 1668M swap
	root is from (tcp, local)[local!#S/sdC0/fossil]: time...
	venti...2010/1115 17:36:16 venti: conf.../boot/venti: mem 31,972,556 bcmem 63,945,112 icmem 95,917,670...httpd tcp!127.1!8000...init...icache 95,917,670 bytes = 1,498,714 entries; 16 scache
	sync...announce tcp!127.1!17034...serving.
	fossil(#S/sdC0/fossil)...fsys: dialing venti at tcp!127.1!17034
	version...time...

	init: starting /bin/rc

which also supplies:

lock 0xf09d8980 loop key 0xdeaddead pc 0xf01e736a held by pc 0xf01e736a proc 2052
 17: #I0tcpack pc f01ff12a dbgpc        0   Running (Running) ut 530 st 0 bss 0 qpc f014583c nl 0 nd 0 lpc f01e2cc8 pri 13
2052:  exportfs pc f01efc9f dbgpc     94ad    Pwrite (Ready) ut 43 st 209 bss 40000 qpc f0145b62 nl 1 nd 0 lpc f01e2c60 pri 10

and, a bit later:

lock 0xf0057d74 loop key 0xdeaddead pc 0xf01e736a held by pc 0xf01e736a proc 2052
 61:etherread4 pc f01ef8a0 dbgpc        0   Running (Running) ut 2923 st 0 bss 0 qpc f0148c8a nl 0 nd 0 lpc f0100f6e pri 13
2052:  exportfs pc f01e7377 dbgpc     94ad    Pwrite (Ready) ut 55 st 270 bss 40000 qpc f0145b62 nl 1 nd 0 lpc f01e2c60 pri 10

to my surprise.




^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2010-11-18 19:27 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-18  5:50 [9fans] That deadlock, again Lucio De Re
2010-11-18  5:53 ` erik quanstrom
2010-11-18  8:11   ` Lucio De Re
2010-11-18  8:35     ` cinap_lenrek
2010-11-18  9:20     ` cinap_lenrek
2010-11-18 10:48       ` Lucio De Re
2010-11-18 15:10         ` erik quanstrom
2010-11-18 16:46           ` erik quanstrom
2010-11-18 18:01             ` Lucio De Re
2010-11-18 18:29               ` C H Forsyth
2010-11-18 18:23                 ` Lucio De Re
2010-11-18 18:33                 ` Lucio De Re
2010-11-18 18:43               ` erik quanstrom
2010-11-18 18:54                 ` erik quanstrom
2010-11-18 19:01                 ` Lucio De Re
2010-11-18 19:27                   ` Lucio De Re
2010-11-18 18:03           ` Lucio De Re
2010-11-18  5:57 ` Lucio De Re
  -- strict thread matches above, loose matches on Subject: below --
2010-11-16  4:21 lucio
2010-11-16  4:40 ` erik quanstrom
2010-11-16  5:03   ` lucio
2010-11-16  5:11     ` cinap_lenrek
2010-11-16  5:18       ` lucio
2010-11-16  5:28         ` cinap_lenrek
2010-11-16  6:47           ` Lucio De Re
2010-11-16 13:53     ` erik quanstrom
2010-11-16 18:03       ` lucio
2010-11-17  4:08         ` Lucio De Re
2010-11-17  4:18           ` erik quanstrom
2010-11-17  4:37             ` Lucio De Re
2010-11-17  4:43               ` erik quanstrom
2010-11-17  5:22             ` cinap_lenrek
2010-11-17  6:45               ` Lucio De Re
2010-11-17  7:03                 ` Lucio De Re
2010-11-17  7:09                   ` erik quanstrom
2010-11-17  5:33             ` cinap_lenrek
2010-11-17  6:48               ` Lucio De Re
2010-11-17  7:03                 ` erik quanstrom
2010-11-17 14:40           ` Russ Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).