9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* Re: [9fans] 9pcfs mballoc bug info
@ 2001-02-05 18:43 jmk
  0 siblings, 0 replies; 7+ messages in thread
From: jmk @ 2001-02-05 18:43 UTC (permalink / raw)
  To: 9fans

For a long time we've had trouble rebooting our fileserver when the network is
busy; 'busy' seems to mean lots of machines try to attach at once. There is a
panic, sometimes it's the mbfree panic, but mostly just a page fault due to a
nil pointer somewhere.

When it happened again last week I looked at some of the faults and through a
tortuous path including a number of leaps of faith started looking at the
locking in getchan and il.c in general, it just looks bogus but I haven't gotten
back to it to convince myself or not.

--jim


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] 9pcfs mballoc bug info
@ 2001-02-06  1:01 nemo
  0 siblings, 0 replies; 7+ messages in thread
From: nemo @ 2001-02-06  1:01 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 427 bytes --]

When I changed the code a bit to say in the 905 txstart:

if (!FREE)
	mbfree();

that was exactly what happen, the second caller of mbfree()
(the second path through the 905 driver) got a nil pointer as
the message buffer in the ring, and paniced, a page fault
followed. Not a surprise, though.

The ring descriptors were different, but the message buffer
was the same. Only need to find out who does that...




[-- Attachment #2: Type: message/rfc822, Size: 1865 bytes --]

From: jmk@plan9.bell-labs.com
To: 9fans@cse.psu.edu
Subject: Re: [9fans] 9pcfs mballoc bug info
Date: Mon, 5 Feb 2001 13:43:02 -0500
Message-ID: <20010205184310.14DD1199F4@mail.cse.psu.edu>

For a long time we've had trouble rebooting our fileserver when the network is
busy; 'busy' seems to mean lots of machines try to attach at once. There is a
panic, sometimes it's the mbfree panic, but mostly just a page fault due to a
nil pointer somewhere.

When it happened again last week I looked at some of the faults and through a
tortuous path including a number of leaps of faith started looking at the
locking in getchan and il.c in general, it just looks bogus but I haven't gotten
back to it to convince myself or not.

--jim

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] 9pcfs mballoc bug info
@ 2001-02-06  0:45 nemo
  0 siblings, 0 replies; 7+ messages in thread
From: nemo @ 2001-02-06  0:45 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 864 bytes --]

The bug happens only when the net is *sloooow* to send the
first packet (seems to happen w/ the arp requests being
sent). I can reproduce the bug quite easily by starting
a fs kernel and attaching to it from a different building
in our campus (yes, our network is slow :-( ).

I think that what happens is that a message buffer gets
linked twice in the transmit ring, perhaps due to a timeout
and retransmission.

Nevertheless, I had to get a fs up quickly and stopped
debugging by now to install one (within the same building,
hence working ;-) ).

My current problem is that although the box has two 4G IDE
disks, the fs kernel thinks the file system is full even
before unpacking the plan9.9gz package.

In case somebody has a hint about what I am doing wrong,
my config string for the device was ch0fh2.

Perhaps I'm just too sleepy...


[-- Attachment #2: Type: message/rfc822, Size: 2358 bytes --]

From: "Boyd Roberts" <boyd@planete.net>
To: <9fans@cse.psu.edu>
Subject: Re: [9fans] 9pcfs mballoc bug info
Date: Mon, 5 Feb 2001 18:56:54 +0100
Message-ID: <037001c08f9d$068b3e00$0ab9c6d4@cybercable.fr>

> Didn't fix it yet, but in my case, txstart905() (plan9pc/etherlnk3.c)
> calls mbfree()  twice with the same buffer. Only transmit() and interrupt()
> (same file) call txstart905; and both routines hold the ilock on ctlr->wlock.

these dup frees are rarely easy to track down.  i'd add some code
before the free to detect it and print some debug, before the panic
leaves you with just a hex stack backtrace.  unfortunately this
may change the timing and the problem may go away.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] 9pcfs mballoc bug info
  2001-02-05 19:06 jmk
@ 2001-02-05 20:23 ` Francisco J Ballesteros
  0 siblings, 0 replies; 7+ messages in thread
From: Francisco J Ballesteros @ 2001-02-05 20:23 UTC (permalink / raw)
  To: 9fans



jmk@plan9.bell-labs.com wrote:
>
> are we looking at the same driver? in the one i have there is
> one call to mbfree in the 905 section:
>
>                 if(pd->mb != nil){
>                         mbfree(pd->mb);
>                         pd->mb = nil;
>                 }

Yes, and that call to mbfree() was the one made twice.
My first silly try was to put the mbfree inside an if,
to do it only if the mb was not already FREE.
The second caller of mbfree for the same pd->mb would then
file mb to be nil, and panic.

There are two different pd's with
the same pd->mb, and both are calling mbfree. My silly try
was trying to ensure that was the case.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] 9pcfs mballoc bug info
@ 2001-02-05 19:06 jmk
  2001-02-05 20:23 ` Francisco J Ballesteros
  0 siblings, 1 reply; 7+ messages in thread
From: jmk @ 2001-02-05 19:06 UTC (permalink / raw)
  To: 9fans

nemo@gsyc.escet.urjc.es:
> When I changed the code a bit to say in the 905 txstart:
>
> if (!FREE)
> 	mbfree();
>
> that was exactly what happen, the second caller of mbfree()
> (the second path through the 905 driver) got a nil pointer as
> the message buffer in the ring, and paniced, a page fault
> followed. Not a surprise, though.
>
> The ring descriptors were different, but the message buffer
> was the same. Only need to find out who does that...


are we looking at the same driver? in the one i have there is
one call to mbfree in the 905 section:

		if(pd->mb != nil){
			mbfree(pd->mb);
			pd->mb = nil;
		}



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [9fans] 9pcfs mballoc bug info
@ 2001-02-05 18:04 nemo
  2001-02-05 17:56 ` Boyd Roberts
  0 siblings, 1 reply; 7+ messages in thread
From: nemo @ 2001-02-05 18:04 UTC (permalink / raw)
  To: 9fans

Hi,

	Didn't fix it yet, but in my case, txstart905() (plan9pc/etherlnk3.c)
calls mbfree()  twice with the same buffer. Only transmit() and interrupt()
(same file) call txstart905; and both routines hold the ilock on ctlr->wlock.

Looking at the code that should not happen, but it happens.

If I find out what's going on, I'll let you know.




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] 9pcfs mballoc bug info
  2001-02-05 18:04 nemo
@ 2001-02-05 17:56 ` Boyd Roberts
  0 siblings, 0 replies; 7+ messages in thread
From: Boyd Roberts @ 2001-02-05 17:56 UTC (permalink / raw)
  To: 9fans

> Didn't fix it yet, but in my case, txstart905() (plan9pc/etherlnk3.c)
> calls mbfree()  twice with the same buffer. Only transmit() and interrupt()
> (same file) call txstart905; and both routines hold the ilock on ctlr->wlock.

these dup frees are rarely easy to track down.  i'd add some code
before the free to detect it and print some debug, before the panic
leaves you with just a hex stack backtrace.  unfortunately this
may change the timing and the problem may go away.




^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2001-02-06  1:01 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-02-05 18:43 [9fans] 9pcfs mballoc bug info jmk
  -- strict thread matches above, loose matches on Subject: below --
2001-02-06  1:01 nemo
2001-02-06  0:45 nemo
2001-02-05 19:06 jmk
2001-02-05 20:23 ` Francisco J Ballesteros
2001-02-05 18:04 nemo
2001-02-05 17:56 ` Boyd Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).