[9fans] lock diagnostics on Fossil server

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* [9fans] lock diagnostics on Fossil server
@ 2010-10-24 12:20 Lucio De Re
  2010-10-25 13:20 ` erik quanstrom
  0 siblings, 1 reply; 13+ messages in thread
From: Lucio De Re @ 2010-10-24 12:20 UTC (permalink / raw)
  To: 9fans

I keep getting errors in this fashion (I'm afraid I have no serial
console, so this is manually copied stuff):

lock 0xf045c390 loop key 0xdeaddead pc 0xf017728a held by pc 0xf017728a proc 100
117: timesync pc f01ef88c dbgpc     916f     Open (Running) ut 22 st 173 bss 1b000 qpc f01c733d nl 0 nd 0 lpc f0100f6e pri 19
100:   listen pc f0100583 dbgpc     4a6a     Open (Ready) ut 165 st 1094 bss 29000 qpc f01e6bac nl 1 nd 0 lpc f01e2cb4 pri 10

The report is much less frequent now, it occurred very frequently when
I had a couple of CPU sessions running on it.  The most obvious trigger
seems to have been exportfs, I eventually turned off the stats report
I had running from the workstation and since then I have had a single
report.  While stats was running, reports seemed to coincide with full
load as reported by stats.

I've seen #I0tcpack appear in earlier reports and etherread4 beside the
more frequent exportfs.

Any idea what IO ought to be looking out for?  I have recompiled the
9pccpuf kernel from up to date sources (as up to date as replica can
make them, I was tempted to make local copies of /sys/src/9, but then I
thought I'd have to make sure the libaries are up to date too, and that
was going to be a bit of a mission.  I checked the libraries, though,
and the sizes and dates all seem to correspond with "sources" so I can
only presume there's a gremlin somewhere.

++L

PS: There's a marginal chance that dns is involved.  My Internet
connectivity isn't very robust and I note that dns gets pretty overwhelmed
when things aren't good.  It's hard to pinpoint the exact circumstances:
I don't have an active console that I can observe all the time for
unexpected reports.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [9fans] lock diagnostics on Fossil server
  2010-10-24 12:20 [9fans] lock diagnostics on Fossil server Lucio De Re
@ 2010-10-25 13:20 ` erik quanstrom
  2010-10-25 14:35   ` Lucio De Re
                     ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: erik quanstrom @ 2010-10-25 13:20 UTC (permalink / raw)
  To: lucio, 9fans

> I had a couple of CPU sessions running on it.  The most obvious trigger
> seems to have been exportfs, I eventually turned off the stats report

sounds familiar.  this patch needs to be applied to the kernel:

/n/sources/plan9//sys/src/9/port/chan.c:1012,1018 - chan.c:1012,1020
  				/*
  				 * mh->mount->to == c, so start at mh->mount->next
  				 */
+ 				f = nil;
  				rlock(&mh->lock);
+ 				if(mh->mount)
  				for(f = mh->mount->next; f; f = f->next)
  					if((wq = ewalk(f->to, nil, names+nhave, ntry)) != nil)
  						break;

please apply.  if the problem persists after the fix, use acid to print the
source of the pcs and qpcs of the procs involved and send that offline.

- erik



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [9fans] lock diagnostics on Fossil server
  2010-10-25 13:20 ` erik quanstrom
@ 2010-10-25 14:35   ` Lucio De Re
  2010-10-26  2:01   ` cinap_lenrek
  2010-10-26 14:28   ` Russ Cox
  2 siblings, 0 replies; 13+ messages in thread
From: Lucio De Re @ 2010-10-25 14:35 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, Oct 25, 2010 at 09:20:51AM -0400, erik quanstrom wrote:
>
> sounds familiar.  this patch needs to be applied to the kernel:
>
> /n/sources/plan9//sys/src/9/port/chan.c:1012,1018 - chan.c:1012,1020
>   				/*
>   				 * mh->mount->to == c, so start at mh->mount->next
>   				 */
> + 				f = nil;
>   				rlock(&mh->lock);
> + 				if(mh->mount)
>   				for(f = mh->mount->next; f; f = f->next)
>   					if((wq = ewalk(f->to, nil, names+nhave, ntry)) != nil)
>   						break;
>
> please apply.  if the problem persists after the fix, use acid to print the
> source of the pcs and qpcs of the procs involved and send that offline.
>
Will do.  I was looking for that patch, you have posted it to 9fans
before.  Somehow, a quick search did not reveal what I was hoping to find.
I'll come back with some results soon.

++L



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [9fans] lock diagnostics on Fossil server
  2010-10-25 13:20 ` erik quanstrom
  2010-10-25 14:35   ` Lucio De Re
@ 2010-10-26  2:01   ` cinap_lenrek
  2010-10-26 12:44     ` erik quanstrom
  2010-10-26 14:28   ` Russ Cox
  2 siblings, 1 reply; 13+ messages in thread
From: cinap_lenrek @ 2010-10-26  2:01 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 63 bytes --]

hm... wouldnt it just crash if mh->mount is nil?

--
cinap

[-- Attachment #2: Type: message/rfc822, Size: 2860 bytes --]

From: erik quanstrom <quanstro@labs.coraid.com>
To: lucio@proxima.alt.za, 9fans@9fans.net
Subject: Re: [9fans] lock diagnostics on Fossil server
Date: Mon, 25 Oct 2010 09:20:51 -0400
Message-ID: <6701519c3b154bfb8462542b787fbae5@coraid.com>

> I had a couple of CPU sessions running on it.  The most obvious trigger
> seems to have been exportfs, I eventually turned off the stats report

sounds familiar.  this patch needs to be applied to the kernel:

/n/sources/plan9//sys/src/9/port/chan.c:1012,1018 - chan.c:1012,1020
  				/*
  				 * mh->mount->to == c, so start at mh->mount->next
  				 */
+ 				f = nil;
  				rlock(&mh->lock);
+ 				if(mh->mount)
  				for(f = mh->mount->next; f; f = f->next)
  					if((wq = ewalk(f->to, nil, names+nhave, ntry)) != nil)
  						break;

please apply.  if the problem persists after the fix, use acid to print the
source of the pcs and qpcs of the procs involved and send that offline.

- erik

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [9fans] lock diagnostics on Fossil server
  2010-10-26  2:01   ` cinap_lenrek
@ 2010-10-26 12:44     ` erik quanstrom
  2010-10-26 13:45       ` Lucio De Re
  0 siblings, 1 reply; 13+ messages in thread
From: erik quanstrom @ 2010-10-26 12:44 UTC (permalink / raw)
  To: 9fans

On Mon Oct 25 22:03:53 EDT 2010, cinap_lenrek@gmx.de wrote:

> hm... wouldnt it just crash if mh->mount is nil?
>

perhaps you are reading the diff backwards?  it used to
crash when mh->mount was nil.  leading to a lock loop.
i added the test to see that mh->mount != nil after the
rlock on mh->lock is acquired. otherwise, there is a race
with unmount which can make mh->mount nil while we're running.

our newer faster processors with more cores were making
this event likely enough that a few receipies would crash
the machine within 5 minutes.

- erik

---

/n/sources/plan9//sys/src/9/port/chan.c:1012,1018 - chan.c:1012,1020
  				/*
  				 * mh->mount->to == c, so start at mh->mount->next
  				 */
+ 				f = nil;
  				rlock(&mh->lock);
+ 				if(mh->mount)
  				for(f = mh->mount->next; f; f = f->next)
  					if((wq = ewalk(f->to, nil, names+nhave, ntry)) != nil)
  						break;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [9fans] lock diagnostics on Fossil server
  2010-10-26 12:44     ` erik quanstrom
@ 2010-10-26 13:45       ` Lucio De Re
  2010-10-26 14:31         ` erik quanstrom
  0 siblings, 1 reply; 13+ messages in thread
From: Lucio De Re @ 2010-10-26 13:45 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Tue, Oct 26, 2010 at 08:44:37AM -0400, erik quanstrom wrote:
> On Mon Oct 25 22:03:53 EDT 2010, cinap_lenrek@gmx.de wrote:
>
> > hm... wouldnt it just crash if mh->mount is nil?
> >
>
> perhaps you are reading the diff backwards?  it used to
> crash when mh->mount was nil.  leading to a lock loop.
> i added the test to see that mh->mount != nil after the
> rlock on mh->lock is acquired. otherwise, there is a race
> with unmount which can make mh->mount nil while we're running.
>
I was hoping you'd follow up on that, I needed a seed message and my
mailbox has recently overflowed :-(

I'm curious what you call "crash" in this case and I think Cinap is too.
Basically, exactly what happens in the situation when a nil pointer is
dereferenced in the kernel?  How does the kernel survive and slip into
a locked situation?

Yes, yes, I know I could read the sources, but that's a skill I'm a
little short on.

> our newer faster processors with more cores were making
> this event likely enough that a few receipies would crash
> the machine within 5 minutes.
>
I really appreciate the fix, it certainly had the desired effect.

++L



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [9fans] lock diagnostics on Fossil server
  2010-10-26 13:45       ` Lucio De Re
@ 2010-10-26 14:31         ` erik quanstrom
  0 siblings, 0 replies; 13+ messages in thread
From: erik quanstrom @ 2010-10-26 14:31 UTC (permalink / raw)
  To: lucio, 9fans

> I was hoping you'd follow up on that, I needed a seed message and my
> mailbox has recently overflowed :-(
>
> I'm curious what you call "crash" in this case and I think Cinap is too.
> Basically, exactly what happens in the situation when a nil pointer is
> dereferenced in the kernel?  How does the kernel survive and slip into
> a locked situation?

good point.  usually you get a panic message, registers and a stack
dump.  i may be recalling the problem incorrectly.

> I really appreciate the fix, it certainly had the desired effect.

well, that's good.

- erik



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [9fans] lock diagnostics on Fossil server
  2010-10-25 13:20 ` erik quanstrom
  2010-10-25 14:35   ` Lucio De Re
  2010-10-26  2:01   ` cinap_lenrek
@ 2010-10-26 14:28   ` Russ Cox
  2010-10-26 14:48     ` erik quanstrom
  2010-10-26 16:27     ` Lucio De Re
  2 siblings, 2 replies; 13+ messages in thread
From: Russ Cox @ 2010-10-26 14:28 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> sounds familiar.  this patch needs to be applied to the kernel:

Like Lucio and Cinap, I am skeptical that this is the fix.

It's a real bug and a correct fix, as we've discussed before,
but if the kernel loses this race I believe it will crash dereferencing nil.
Lucio showed a kernel that was very much still running.

... unless the Plan 9 kernel has changed since I last worked on it
and now kills only the current process when a bad kernel memory
access happens (this is what Linux does, but I think that's
very questionable behavior).

Russ

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [9fans] lock diagnostics on Fossil server
  2010-10-26 14:28   ` Russ Cox
@ 2010-10-26 14:48     ` erik quanstrom
  2010-10-26 16:27     ` Lucio De Re
  1 sibling, 0 replies; 13+ messages in thread
From: erik quanstrom @ 2010-10-26 14:48 UTC (permalink / raw)
  To: 9fans

> It's a real bug and a correct fix, as we've discussed before,
> but if the kernel loses this race I believe it will crash dereferencing nil.
> Lucio showed a kernel that was very much still running.

you are correct.  i was confused.

the bug reported looks like a missing waserror().

- erik



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [9fans] lock diagnostics on Fossil server
  2010-10-26 14:28   ` Russ Cox
  2010-10-26 14:48     ` erik quanstrom
@ 2010-10-26 16:27     ` Lucio De Re
  2010-10-26 17:01       ` erik quanstrom
  1 sibling, 1 reply; 13+ messages in thread
From: Lucio De Re @ 2010-10-26 16:27 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Tue, Oct 26, 2010 at 07:28:57AM -0700, Russ Cox wrote:
>
> Like Lucio and Cinap, I am skeptical that this is the fix.
>
> It's a real bug and a correct fix, as we've discussed before,
> but if the kernel loses this race I believe it will crash dereferencing nil.
> Lucio showed a kernel that was very much still running.
>
And a very busy one, at that, because while I had stats(1) running,
it showed load at max.  I may not remember correctly, but I think there
lots of context switches as well, but load was saturating.

I can re-create the problem if anybody wants me to help diagnose it.

++L



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [9fans] lock diagnostics on Fossil server
  2010-10-26 16:27     ` Lucio De Re
@ 2010-10-26 17:01       ` erik quanstrom
  2010-10-27  3:03         ` lucio
  0 siblings, 1 reply; 13+ messages in thread
From: erik quanstrom @ 2010-10-26 17:01 UTC (permalink / raw)
  To: lucio, 9fans

> I can re-create the problem if anybody wants me to help diagnose it.

please do.

- erik



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [9fans] lock diagnostics on Fossil server
  2010-10-26 17:01       ` erik quanstrom
@ 2010-10-27  3:03         ` lucio
  2010-10-27  4:14           ` erik quanstrom
  0 siblings, 1 reply; 13+ messages in thread
From: lucio @ 2010-10-27  3:03 UTC (permalink / raw)
  To: 9fans

>> I can re-create the problem if anybody wants me to help diagnose it.
>
> please do.
>
Looks like I don't need to: I left the machines running last night and
I note two more instances this morning, using the patched kernel.  So
the problem is much less common now, but still present.  That is
positively weird.

I thought I posted a request for some help debugging the kernel from
the diagnostics, I wonder if only Erik got my message?  If anyone can
give me some suggestions, I'll have time this evening or more likely
early tomorrow morning.

++L




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [9fans] lock diagnostics on Fossil server
  2010-10-27  3:03         ` lucio
@ 2010-10-27  4:14           ` erik quanstrom
  0 siblings, 0 replies; 13+ messages in thread
From: erik quanstrom @ 2010-10-27  4:14 UTC (permalink / raw)
  To: 9fans

> Looks like I don't need to: I left the machines running last night and
> I note two more instances this morning, using the patched kernel.  So
> the problem is much less common now, but still present.  That is
> positively weird.
>
> I thought I posted a request for some help debugging the kernel from
> the diagnostics, I wonder if only Erik got my message?  If anyone can
> give me some suggestions, I'll have time this evening or more likely
> early tomorrow morning.

the key is to use acid on your kernel to print out the
pcs in the lock loop diag.

- erik



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-10-27  4:14 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-24 12:20 [9fans] lock diagnostics on Fossil server Lucio De Re
2010-10-25 13:20 ` erik quanstrom
2010-10-25 14:35   ` Lucio De Re
2010-10-26  2:01   ` cinap_lenrek
2010-10-26 12:44     ` erik quanstrom
2010-10-26 13:45       ` Lucio De Re
2010-10-26 14:31         ` erik quanstrom
2010-10-26 14:28   ` Russ Cox
2010-10-26 14:48     ` erik quanstrom
2010-10-26 16:27     ` Lucio De Re
2010-10-26 17:01       ` erik quanstrom
2010-10-27  3:03         ` lucio
2010-10-27  4:14           ` erik quanstrom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).