* [9fans] lock diagnostics on Fossil server
@ 2010-10-24 12:20 Lucio De Re
2010-10-25 13:20 ` erik quanstrom
0 siblings, 1 reply; 13+ messages in thread
From: Lucio De Re @ 2010-10-24 12:20 UTC (permalink / raw)
To: 9fans
I keep getting errors in this fashion (I'm afraid I have no serial
console, so this is manually copied stuff):
lock 0xf045c390 loop key 0xdeaddead pc 0xf017728a held by pc 0xf017728a proc 100
117: timesync pc f01ef88c dbgpc 916f Open (Running) ut 22 st 173 bss 1b000 qpc f01c733d nl 0 nd 0 lpc f0100f6e pri 19
100: listen pc f0100583 dbgpc 4a6a Open (Ready) ut 165 st 1094 bss 29000 qpc f01e6bac nl 1 nd 0 lpc f01e2cb4 pri 10
The report is much less frequent now, it occurred very frequently when
I had a couple of CPU sessions running on it. The most obvious trigger
seems to have been exportfs, I eventually turned off the stats report
I had running from the workstation and since then I have had a single
report. While stats was running, reports seemed to coincide with full
load as reported by stats.
I've seen #I0tcpack appear in earlier reports and etherread4 beside the
more frequent exportfs.
Any idea what IO ought to be looking out for? I have recompiled the
9pccpuf kernel from up to date sources (as up to date as replica can
make them, I was tempted to make local copies of /sys/src/9, but then I
thought I'd have to make sure the libaries are up to date too, and that
was going to be a bit of a mission. I checked the libraries, though,
and the sizes and dates all seem to correspond with "sources" so I can
only presume there's a gremlin somewhere.
++L
PS: There's a marginal chance that dns is involved. My Internet
connectivity isn't very robust and I note that dns gets pretty overwhelmed
when things aren't good. It's hard to pinpoint the exact circumstances:
I don't have an active console that I can observe all the time for
unexpected reports.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [9fans] lock diagnostics on Fossil server
2010-10-24 12:20 [9fans] lock diagnostics on Fossil server Lucio De Re
@ 2010-10-25 13:20 ` erik quanstrom
2010-10-25 14:35 ` Lucio De Re
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: erik quanstrom @ 2010-10-25 13:20 UTC (permalink / raw)
To: lucio, 9fans
> I had a couple of CPU sessions running on it. The most obvious trigger
> seems to have been exportfs, I eventually turned off the stats report
sounds familiar. this patch needs to be applied to the kernel:
/n/sources/plan9//sys/src/9/port/chan.c:1012,1018 - chan.c:1012,1020
/*
* mh->mount->to == c, so start at mh->mount->next
*/
+ f = nil;
rlock(&mh->lock);
+ if(mh->mount)
for(f = mh->mount->next; f; f = f->next)
if((wq = ewalk(f->to, nil, names+nhave, ntry)) != nil)
break;
please apply. if the problem persists after the fix, use acid to print the
source of the pcs and qpcs of the procs involved and send that offline.
- erik
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [9fans] lock diagnostics on Fossil server
2010-10-25 13:20 ` erik quanstrom
@ 2010-10-25 14:35 ` Lucio De Re
2010-10-26 2:01 ` cinap_lenrek
2010-10-26 14:28 ` Russ Cox
2 siblings, 0 replies; 13+ messages in thread
From: Lucio De Re @ 2010-10-25 14:35 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
On Mon, Oct 25, 2010 at 09:20:51AM -0400, erik quanstrom wrote:
>
> sounds familiar. this patch needs to be applied to the kernel:
>
> /n/sources/plan9//sys/src/9/port/chan.c:1012,1018 - chan.c:1012,1020
> /*
> * mh->mount->to == c, so start at mh->mount->next
> */
> + f = nil;
> rlock(&mh->lock);
> + if(mh->mount)
> for(f = mh->mount->next; f; f = f->next)
> if((wq = ewalk(f->to, nil, names+nhave, ntry)) != nil)
> break;
>
> please apply. if the problem persists after the fix, use acid to print the
> source of the pcs and qpcs of the procs involved and send that offline.
>
Will do. I was looking for that patch, you have posted it to 9fans
before. Somehow, a quick search did not reveal what I was hoping to find.
I'll come back with some results soon.
++L
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [9fans] lock diagnostics on Fossil server
2010-10-25 13:20 ` erik quanstrom
2010-10-25 14:35 ` Lucio De Re
@ 2010-10-26 2:01 ` cinap_lenrek
2010-10-26 12:44 ` erik quanstrom
2010-10-26 14:28 ` Russ Cox
2 siblings, 1 reply; 13+ messages in thread
From: cinap_lenrek @ 2010-10-26 2:01 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 63 bytes --]
hm... wouldnt it just crash if mh->mount is nil?
--
cinap
[-- Attachment #2: Type: message/rfc822, Size: 2860 bytes --]
From: erik quanstrom <quanstro@labs.coraid.com>
To: lucio@proxima.alt.za, 9fans@9fans.net
Subject: Re: [9fans] lock diagnostics on Fossil server
Date: Mon, 25 Oct 2010 09:20:51 -0400
Message-ID: <6701519c3b154bfb8462542b787fbae5@coraid.com>
> I had a couple of CPU sessions running on it. The most obvious trigger
> seems to have been exportfs, I eventually turned off the stats report
sounds familiar. this patch needs to be applied to the kernel:
/n/sources/plan9//sys/src/9/port/chan.c:1012,1018 - chan.c:1012,1020
/*
* mh->mount->to == c, so start at mh->mount->next
*/
+ f = nil;
rlock(&mh->lock);
+ if(mh->mount)
for(f = mh->mount->next; f; f = f->next)
if((wq = ewalk(f->to, nil, names+nhave, ntry)) != nil)
break;
please apply. if the problem persists after the fix, use acid to print the
source of the pcs and qpcs of the procs involved and send that offline.
- erik
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [9fans] lock diagnostics on Fossil server
2010-10-26 2:01 ` cinap_lenrek
@ 2010-10-26 12:44 ` erik quanstrom
2010-10-26 13:45 ` Lucio De Re
0 siblings, 1 reply; 13+ messages in thread
From: erik quanstrom @ 2010-10-26 12:44 UTC (permalink / raw)
To: 9fans
On Mon Oct 25 22:03:53 EDT 2010, cinap_lenrek@gmx.de wrote:
> hm... wouldnt it just crash if mh->mount is nil?
>
perhaps you are reading the diff backwards? it used to
crash when mh->mount was nil. leading to a lock loop.
i added the test to see that mh->mount != nil after the
rlock on mh->lock is acquired. otherwise, there is a race
with unmount which can make mh->mount nil while we're running.
our newer faster processors with more cores were making
this event likely enough that a few receipies would crash
the machine within 5 minutes.
- erik
---
/n/sources/plan9//sys/src/9/port/chan.c:1012,1018 - chan.c:1012,1020
/*
* mh->mount->to == c, so start at mh->mount->next
*/
+ f = nil;
rlock(&mh->lock);
+ if(mh->mount)
for(f = mh->mount->next; f; f = f->next)
if((wq = ewalk(f->to, nil, names+nhave, ntry)) != nil)
break;
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [9fans] lock diagnostics on Fossil server
2010-10-26 12:44 ` erik quanstrom
@ 2010-10-26 13:45 ` Lucio De Re
2010-10-26 14:31 ` erik quanstrom
0 siblings, 1 reply; 13+ messages in thread
From: Lucio De Re @ 2010-10-26 13:45 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
On Tue, Oct 26, 2010 at 08:44:37AM -0400, erik quanstrom wrote:
> On Mon Oct 25 22:03:53 EDT 2010, cinap_lenrek@gmx.de wrote:
>
> > hm... wouldnt it just crash if mh->mount is nil?
> >
>
> perhaps you are reading the diff backwards? it used to
> crash when mh->mount was nil. leading to a lock loop.
> i added the test to see that mh->mount != nil after the
> rlock on mh->lock is acquired. otherwise, there is a race
> with unmount which can make mh->mount nil while we're running.
>
I was hoping you'd follow up on that, I needed a seed message and my
mailbox has recently overflowed :-(
I'm curious what you call "crash" in this case and I think Cinap is too.
Basically, exactly what happens in the situation when a nil pointer is
dereferenced in the kernel? How does the kernel survive and slip into
a locked situation?
Yes, yes, I know I could read the sources, but that's a skill I'm a
little short on.
> our newer faster processors with more cores were making
> this event likely enough that a few receipies would crash
> the machine within 5 minutes.
>
I really appreciate the fix, it certainly had the desired effect.
++L
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [9fans] lock diagnostics on Fossil server
2010-10-25 13:20 ` erik quanstrom
2010-10-25 14:35 ` Lucio De Re
2010-10-26 2:01 ` cinap_lenrek
@ 2010-10-26 14:28 ` Russ Cox
2010-10-26 14:48 ` erik quanstrom
2010-10-26 16:27 ` Lucio De Re
2 siblings, 2 replies; 13+ messages in thread
From: Russ Cox @ 2010-10-26 14:28 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
> sounds familiar. this patch needs to be applied to the kernel:
Like Lucio and Cinap, I am skeptical that this is the fix.
It's a real bug and a correct fix, as we've discussed before,
but if the kernel loses this race I believe it will crash dereferencing nil.
Lucio showed a kernel that was very much still running.
... unless the Plan 9 kernel has changed since I last worked on it
and now kills only the current process when a bad kernel memory
access happens (this is what Linux does, but I think that's
very questionable behavior).
Russ
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [9fans] lock diagnostics on Fossil server
2010-10-26 13:45 ` Lucio De Re
@ 2010-10-26 14:31 ` erik quanstrom
0 siblings, 0 replies; 13+ messages in thread
From: erik quanstrom @ 2010-10-26 14:31 UTC (permalink / raw)
To: lucio, 9fans
> I was hoping you'd follow up on that, I needed a seed message and my
> mailbox has recently overflowed :-(
>
> I'm curious what you call "crash" in this case and I think Cinap is too.
> Basically, exactly what happens in the situation when a nil pointer is
> dereferenced in the kernel? How does the kernel survive and slip into
> a locked situation?
good point. usually you get a panic message, registers and a stack
dump. i may be recalling the problem incorrectly.
> I really appreciate the fix, it certainly had the desired effect.
well, that's good.
- erik
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [9fans] lock diagnostics on Fossil server
2010-10-26 14:28 ` Russ Cox
@ 2010-10-26 14:48 ` erik quanstrom
2010-10-26 16:27 ` Lucio De Re
1 sibling, 0 replies; 13+ messages in thread
From: erik quanstrom @ 2010-10-26 14:48 UTC (permalink / raw)
To: 9fans
> It's a real bug and a correct fix, as we've discussed before,
> but if the kernel loses this race I believe it will crash dereferencing nil.
> Lucio showed a kernel that was very much still running.
you are correct. i was confused.
the bug reported looks like a missing waserror().
- erik
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [9fans] lock diagnostics on Fossil server
2010-10-26 14:28 ` Russ Cox
2010-10-26 14:48 ` erik quanstrom
@ 2010-10-26 16:27 ` Lucio De Re
2010-10-26 17:01 ` erik quanstrom
1 sibling, 1 reply; 13+ messages in thread
From: Lucio De Re @ 2010-10-26 16:27 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
On Tue, Oct 26, 2010 at 07:28:57AM -0700, Russ Cox wrote:
>
> Like Lucio and Cinap, I am skeptical that this is the fix.
>
> It's a real bug and a correct fix, as we've discussed before,
> but if the kernel loses this race I believe it will crash dereferencing nil.
> Lucio showed a kernel that was very much still running.
>
And a very busy one, at that, because while I had stats(1) running,
it showed load at max. I may not remember correctly, but I think there
lots of context switches as well, but load was saturating.
I can re-create the problem if anybody wants me to help diagnose it.
++L
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [9fans] lock diagnostics on Fossil server
2010-10-26 16:27 ` Lucio De Re
@ 2010-10-26 17:01 ` erik quanstrom
2010-10-27 3:03 ` lucio
0 siblings, 1 reply; 13+ messages in thread
From: erik quanstrom @ 2010-10-26 17:01 UTC (permalink / raw)
To: lucio, 9fans
> I can re-create the problem if anybody wants me to help diagnose it.
please do.
- erik
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [9fans] lock diagnostics on Fossil server
2010-10-26 17:01 ` erik quanstrom
@ 2010-10-27 3:03 ` lucio
2010-10-27 4:14 ` erik quanstrom
0 siblings, 1 reply; 13+ messages in thread
From: lucio @ 2010-10-27 3:03 UTC (permalink / raw)
To: 9fans
>> I can re-create the problem if anybody wants me to help diagnose it.
>
> please do.
>
Looks like I don't need to: I left the machines running last night and
I note two more instances this morning, using the patched kernel. So
the problem is much less common now, but still present. That is
positively weird.
I thought I posted a request for some help debugging the kernel from
the diagnostics, I wonder if only Erik got my message? If anyone can
give me some suggestions, I'll have time this evening or more likely
early tomorrow morning.
++L
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [9fans] lock diagnostics on Fossil server
2010-10-27 3:03 ` lucio
@ 2010-10-27 4:14 ` erik quanstrom
0 siblings, 0 replies; 13+ messages in thread
From: erik quanstrom @ 2010-10-27 4:14 UTC (permalink / raw)
To: 9fans
> Looks like I don't need to: I left the machines running last night and
> I note two more instances this morning, using the patched kernel. So
> the problem is much less common now, but still present. That is
> positively weird.
>
> I thought I posted a request for some help debugging the kernel from
> the diagnostics, I wonder if only Erik got my message? If anyone can
> give me some suggestions, I'll have time this evening or more likely
> early tomorrow morning.
the key is to use acid on your kernel to print out the
pcs in the lock loop diag.
- erik
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2010-10-27 4:14 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-24 12:20 [9fans] lock diagnostics on Fossil server Lucio De Re
2010-10-25 13:20 ` erik quanstrom
2010-10-25 14:35 ` Lucio De Re
2010-10-26 2:01 ` cinap_lenrek
2010-10-26 12:44 ` erik quanstrom
2010-10-26 13:45 ` Lucio De Re
2010-10-26 14:31 ` erik quanstrom
2010-10-26 14:28 ` Russ Cox
2010-10-26 14:48 ` erik quanstrom
2010-10-26 16:27 ` Lucio De Re
2010-10-26 17:01 ` erik quanstrom
2010-10-27 3:03 ` lucio
2010-10-27 4:14 ` erik quanstrom
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).