[9fans] kernel memory allocator got confused?

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* [9fans] kernel memory allocator got confused?
@ 2007-11-21 13:02 Steve Simon
  2007-11-21 13:22 ` erik quanstrom
  2007-11-22 16:14 ` Russ Cox
  0 siblings, 2 replies; 6+ messages in thread
From: Steve Simon @ 2007-11-21 13:02 UTC (permalink / raw)
  To: 9fans

Ok, out of my depth here but I was developing a prog under ape, I wanted to see how
it got to a given line so, being lazy I put "*(char *)0 = 0;" in my code.

This worked the first time but then mk started dieing,
	then rc died,
	then I had the joy of several programs reading the console
	then rio went and took the remaining shells with it

In fact anything that allocated memory seemed to died.

I discovered that cat and con would still run so I was able
to cat the contents of /dev/kprint into a file and
use con to halt fossil.

my kernel is fairly recent though I haven't rebuilt it for
a couple of months. This machine is a do-everything box:
venti/fossil/cpu (net listeners)/auth/terminal.
please ignore the complaint about the filesystem 'other' -
I removed the disk.

Just aheads-up really.

-Steve



E820: 00000000 0009f400 memory
E820: 0009f400 000a0000 reServed
E820: 000f0000 00100000 rEserved
E820: 00100000 3ffeb000 Memory
E820: 3ffeb000 3ffef000 aCpi reclaim
E820: 3ffef000 3ffff000 reserved
E820: 3ffff000 40000000 acpi nvs
E820: fec00000 fec10000 reserved
E820: fee00000 fee01000 reserved
E820: fff80000 100000000 reserved
126 holes free
00017000 0009f000 557056
0036b000 10000000 264851456
265408512 bytes free
cpu0: 1607MHz GenuineIntel P4 (cpuid: AX 0x0F0A DX 0x3FEBFBFF)
ELCR: 1A20
apm0: configured cbase 000f0000 off 00006824
#l0: i82557: 100Mbps port 0xD400 irq 5: 0002b3a256a6
sd53c8xx: SYM53C875 rev. 0x26 intr=11 command=0017
#U/usb0: uhci: port 0xB400 irq 9
#U/usb1: uhci: port 0xB000 irq 11
1024M memory: 256M kernel data, 767M user, 1392M swap
root is from (tcp, local)[local!#S/sd00/fossil]: 
user[none]: steve
sd53c8xx: bios scntl3(00) stest2(00)
time...
venti...2007/1120 12:36:58 venti: conf...httpd tcp!127.1!8000...init...sync...announce tcp!127.1!17034...serving.
fossil(#S/sd00/fossil)...fsOpen: open /dev/fs/other: '/dev/fs/other' does not exist
fsys other open -c 12800: fsOpen: open /dev/fs/other: '/dev/fs/other' does not exist
version...time...

init: starting /bin/rc
4 keys read
secstore from (floppy, secstore, none)[secstore]: 
secstore password: 
secstore
aux/cifs: 146 long share names too long for RAP (>13 chars)
23795 cda: checked 8223 page table entries
23849 mk: checked 53 page table entries
23851 mk: checked 50 page table entries
23853 mk: checked 53 page table entries
6308 rc: checked 48 page table entries
mem user overflow
pool sbrkmem block 435a70
hdr 0a110c09 0014b220 0000c847 0000c847 002e002e 002e002e
tail 00000000 00000000 00000000 00000000 00000000 00000000 | efd88dbe 0014b220
user data 00 00 00 00  00 00 00 00 | fe d1 f0 fa  00 00 00 00
panic: pool panic
47 rio: checked 1190 page table entries
rio 47: suicide: sys: trap: fault read addr=0x0 pc=0x00028fe0
init: rc exit status: rio 47: sys: trap: fault read addr=0x0 pc=0x00028fe0
23857 maild: checked 45 page table entries
rc: note: sys: trap: fault write addr=0xfffffff9 pc=0x0000d6a8
maild 23857: suicide: sys: trap: fault write addr=0xfffffff9 pc=0x0000d6a8

init: starting /bin/rc
larch% larch% sat: '/bin/sat' file does not exist
larch% i: '/bin/i' file does not exist
larch% i: '/bin/i' file does not exist
assert failed: (*t)->magic == FREE_MAGIC
23868 rc: checked 48 page table entries
rc: note: sys: trap: fault read addr=0x0 pc=0x0000fbcc
rc 23868: suicide: sys: trap: fault read addr=0x0 pc=0x0000fbcc
larch% i: '/bin/i' file does not exist
assert failed: (*t)->magic == FREE_MAGIC
23870 rc: checked 48 page table entries
rc: note: sys: trap: fault read addr=0x0 pc=0x0000fbcc
rc 23870: suicide: sys: trap: fault read addr=0x0 pc=0x0000fbcc
23863 rc: checked 48 page table entries
rc: note: sys: trap: fault read addr=0x657669 pc=0x00011301
rc 23863: suicide: sys: trap: fault read addr=0x657669 pc=0x00011301
init: rc exit status: rc 23863: sys: trap: fault read addr=0x657669 pc=0x00011301

etc.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] kernel memory allocator got confused?
  2007-11-21 13:02 [9fans] kernel memory allocator got confused? Steve Simon
@ 2007-11-21 13:22 ` erik quanstrom
  2007-11-22 16:14 ` Russ Cox
  1 sibling, 0 replies; 6+ messages in thread
From: erik quanstrom @ 2007-11-21 13:22 UTC (permalink / raw)
  To: 9fans

the problem appears to be detected here:

/sys/src/libc/port/pool.c:833: 					printblock(p, b, "mem user overflow");

so either your memory has gone wonky or the kernel is writing past the
end of the buffer.  

this must be in the kernel where the overflow is occuring.
it appears the reason you don't panic on the spot is because of this:

/sys/src/9/port/alloc.c:28: 	.flags=	POOL_TOLERANCE,

if you're feeling brave, you could recompile without POOL_TOLERANCE
and try to recreate the problem.  it'd be nice to know who's screwing up.
but .. that's not too much fun when you have an all-in-wonder
box.  maybe make a copy of fossil's stuff before trying.

- erik

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] kernel memory allocator got confused?
  2007-11-21 13:02 [9fans] kernel memory allocator got confused? Steve Simon
  2007-11-21 13:22 ` erik quanstrom
@ 2007-11-22 16:14 ` Russ Cox
  2007-11-22 17:20   ` erik quanstrom
  2007-11-22 23:04   ` Steve Simon
  1 sibling, 2 replies; 6+ messages in thread
From: Russ Cox @ 2007-11-22 16:14 UTC (permalink / raw)
  To: 9fans

steve simon:
> aux/cifs: 146 long share names too long for RAP (>13 chars)
> 23795 cda: checked 8223 page table entries
> 23849 mk: checked 53 page table entries
> 23851 mk: checked 50 page table entries
> 23853 mk: checked 53 page table entries
> 6308 rc: checked 48 page table entries
> mem user overflow
> pool sbrkmem block 435a70
be 8d d8 ef

> hdr 0a110c09 0014b220 0000c847 0000c847 002e002e 002e002e
> tail 00000000 00000000 00000000 00000000 00000000 00000000 | efd88dbe 0014b220
> user data 00 00 00 00  00 00 00 00 | fe d1 f0 fa  00 00 00 00
> panic: pool panic
> 47 rio: checked 1190 page table entries
> rio 47: suicide: sys: trap: fault read addr=0x0 pc=0x00028fe0
> init: rc exit status: rio 47: sys: trap: fault read addr=0x0 pc=0x00028fe0
> 23857 maild: checked 45 page table entries
> rc: note: sys: trap: fault write addr=0xfffffff9 pc=0x0000d6a8
> maild 23857: suicide: sys: trap: fault write addr=0xfffffff9 pc=0x0000d6a8
> 
> init: starting /bin/rc
> larch% larch% sat: '/bin/sat' file does not exist
> larch% i: '/bin/i' file does not exist
> larch% i: '/bin/i' file does not exist
> assert failed: (*t)->magic == FREE_MAGIC
> 23868 rc: checked 48 page table entries
> rc: note: sys: trap: fault read addr=0x0 pc=0x0000fbcc
> rc 23868: suicide: sys: trap: fault read addr=0x0 pc=0x0000fbcc
> larch% i: '/bin/i' file does not exist
> assert failed: (*t)->magic == FREE_MAGIC
> 23870 rc: checked 48 page table entries
> rc: note: sys: trap: fault read addr=0x0 pc=0x0000fbcc
> rc 23870: suicide: sys: trap: fault read addr=0x0 pc=0x0000fbcc
> 23863 rc: checked 48 page table entries
> rc: note: sys: trap: fault read addr=0x657669 pc=0x00011301
> rc 23863: suicide: sys: trap: fault read addr=0x657669 pc=0x00011301
> init: rc exit status: rc 23863: sys: trap: fault read addr=0x657669 pc=0x00011301

there aren't many hard numbers above.

the original malloc panic in rio is interesting, though.  the block in question
is 1.3MB, with 36 kB of extra unused space beyond what was asked for.
so you'd have to write far beyond the end to really cause significant
corruption.  also, it's a rune buffer (002e 002e 002e 002e is ....)
allocated at /sys/src/cmd/rio/wind.c:1639.  the header and tail are
intact, and the end of user data, marked with the |, has not been
reached.  the text hasn't gotten that far (it's all zeros to the left of the |).
the bytes to the right of the | are supposed to be fe f1 f0 fa (fe fi fo fum)
but the f1 has turned into a d1 - it lost its 0x20 bit.
since rio isn't the kind of program that goes around flipping bits in memory, 
i wonder if your memory is on the fritz.  

it would be more useful if the assert said what (*t)->magic was (my fault).
i wonder if it too was something with just a bit wrong, indicating that
the cached rc text image in the kernel had lost a bit too.

obviously it's possible that the kernel screwed up in some mysterious way.
but in this instance, i'm more inclined to suspect memory problems.

erik quanstrom:
> the problem appears to be detected here:
> 
> /sys/src/libc/port/pool.c:833: 					printblock(p, b, "mem user overflow");
> 
> so either your memory has gone wonky or the kernel is writing past the
> end of the buffer.  
> 
> this must be in the kernel where the overflow is occuring.

no.  those are all user programs dying, including the very first dump.
malloc prints "panic: pool panic" even when running in user programs.
if the kernel had panicked, it would have stopped running (this isn't linux!).

the message says it was the block at 435a70, which is a user address.

POOL_TOLERANCE would not have printed "panic: pool panic".
it would have just done the "mem user overflow" block dump and
continued.

POOL_TOLERANCE only tolerates overflowing the block by a
single extra zero byte, to diagnose the common mistake of

	x = malloc(strlen(s));
	strcpy(x, s);

but be able to continue executing.  it would have tolerated:

	# user data 00 00 00 00  00 00 00 00 | 00 f1 f0 fa  00 00 00 00

but that's not what happened.

russ

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] kernel memory allocator got confused?
  2007-11-22 16:14 ` Russ Cox
@ 2007-11-22 17:20   ` erik quanstrom
  2007-11-22 23:04   ` Steve Simon
  1 sibling, 0 replies; 6+ messages in thread
From: erik quanstrom @ 2007-11-22 17:20 UTC (permalink / raw)
  To: 9fans

> no.  those are all user programs dying, including the very first dump.
> malloc prints "panic: pool panic" even when running in user programs.
> if the kernel had panicked, it would have stopped running (this isn't linux!).
> 

i have a machine the periodicly panics and keeps running.
plan 9 does try to stop but sometimes is waylaid.

- erik


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] kernel memory allocator got confused?
  2007-11-22 16:14 ` Russ Cox
  2007-11-22 17:20   ` erik quanstrom
@ 2007-11-22 23:04   ` Steve Simon
  2007-11-22 23:09     ` Uriel
  1 sibling, 1 reply; 6+ messages in thread
From: Steve Simon @ 2007-11-22 23:04 UTC (permalink / raw)
  To: 9fans

> ...i'm more inclined to suspect memory problems.

Thanks Russ for the very detailed analysis.

I will start by reseating the RAM and if that fails
I will petition my boss for a new machine (bout time really).

-Steve


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] kernel memory allocator got confused?
  2007-11-22 23:04   ` Steve Simon
@ 2007-11-22 23:09     ` Uriel
  0 siblings, 0 replies; 6+ messages in thread
From: Uriel @ 2007-11-22 23:09 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

I would recommend running memtest first to find out if it is actually
a memory issue. Otherwise you might just mask the issue and think that
moving the memory around
made it go away when it actually is something else but hard to trigger.

If memtest finds errors, then you know with some certainty that there
was no bug.

uriel

On Nov 23, 2007 12:04 AM, Steve Simon <steve@quintile.net> wrote:
> > ...i'm more inclined to suspect memory problems.
>
> Thanks Russ for the very detailed analysis.
>
> I will start by reseating the RAM and if that fails
> I will petition my boss for a new machine (bout time really).
>
> -Steve
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-11-22 23:09 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-21 13:02 [9fans] kernel memory allocator got confused? Steve Simon
2007-11-21 13:22 ` erik quanstrom
2007-11-22 16:14 ` Russ Cox
2007-11-22 17:20   ` erik quanstrom
2007-11-22 23:04   ` Steve Simon
2007-11-22 23:09     ` Uriel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).