From mboxrd@z Thu Jan  1 00:00:00 1970
To: 9fans@cse.psu.edu
Subject: Re: [9fans] kernel memory allocator got confused?
From: "Russ Cox" <rsc@swtch.com>
Date: Thu, 22 Nov 2007 11:14:58 -0500
In-Reply-To: <d81908eb510fb7c7281f68d2486a7e9b@quintile.net>
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
Message-Id: <20071122161448.F3C991E8C22@holo.morphisms.net>
Topicbox-Message-UUID: 085fa8b4-ead3-11e9-9d60-3106f5b1d025

steve simon:
> aux/cifs: 146 long share names too long for RAP (>13 chars)
> 23795 cda: checked 8223 page table entries
> 23849 mk: checked 53 page table entries
> 23851 mk: checked 50 page table entries
> 23853 mk: checked 53 page table entries
> 6308 rc: checked 48 page table entries
> mem user overflow
> pool sbrkmem block 435a70
be 8d d8 ef

> hdr 0a110c09 0014b220 0000c847 0000c847 002e002e 002e002e
> tail 00000000 00000000 00000000 00000000 00000000 00000000 | efd88dbe 0014b220
> user data 00 00 00 00  00 00 00 00 | fe d1 f0 fa  00 00 00 00
> panic: pool panic
> 47 rio: checked 1190 page table entries
> rio 47: suicide: sys: trap: fault read addr=0x0 pc=0x00028fe0
> init: rc exit status: rio 47: sys: trap: fault read addr=0x0 pc=0x00028fe0
> 23857 maild: checked 45 page table entries
> rc: note: sys: trap: fault write addr=0xfffffff9 pc=0x0000d6a8
> maild 23857: suicide: sys: trap: fault write addr=0xfffffff9 pc=0x0000d6a8
> 
> init: starting /bin/rc
> larch% larch% sat: '/bin/sat' file does not exist
> larch% i: '/bin/i' file does not exist
> larch% i: '/bin/i' file does not exist
> assert failed: (*t)->magic == FREE_MAGIC
> 23868 rc: checked 48 page table entries
> rc: note: sys: trap: fault read addr=0x0 pc=0x0000fbcc
> rc 23868: suicide: sys: trap: fault read addr=0x0 pc=0x0000fbcc
> larch% i: '/bin/i' file does not exist
> assert failed: (*t)->magic == FREE_MAGIC
> 23870 rc: checked 48 page table entries
> rc: note: sys: trap: fault read addr=0x0 pc=0x0000fbcc
> rc 23870: suicide: sys: trap: fault read addr=0x0 pc=0x0000fbcc
> 23863 rc: checked 48 page table entries
> rc: note: sys: trap: fault read addr=0x657669 pc=0x00011301
> rc 23863: suicide: sys: trap: fault read addr=0x657669 pc=0x00011301
> init: rc exit status: rc 23863: sys: trap: fault read addr=0x657669 pc=0x00011301

there aren't many hard numbers above.

the original malloc panic in rio is interesting, though.  the block in question
is 1.3MB, with 36 kB of extra unused space beyond what was asked for.
so you'd have to write far beyond the end to really cause significant
corruption.  also, it's a rune buffer (002e 002e 002e 002e is ....)
allocated at /sys/src/cmd/rio/wind.c:1639.  the header and tail are
intact, and the end of user data, marked with the |, has not been
reached.  the text hasn't gotten that far (it's all zeros to the left of the |).
the bytes to the right of the | are supposed to be fe f1 f0 fa (fe fi fo fum)
but the f1 has turned into a d1 - it lost its 0x20 bit.
since rio isn't the kind of program that goes around flipping bits in memory, 
i wonder if your memory is on the fritz.  

it would be more useful if the assert said what (*t)->magic was (my fault).
i wonder if it too was something with just a bit wrong, indicating that
the cached rc text image in the kernel had lost a bit too.

obviously it's possible that the kernel screwed up in some mysterious way.
but in this instance, i'm more inclined to suspect memory problems.

erik quanstrom:
> the problem appears to be detected here:
> 
> /sys/src/libc/port/pool.c:833: 					printblock(p, b, "mem user overflow");
> 
> so either your memory has gone wonky or the kernel is writing past the
> end of the buffer.  
> 
> this must be in the kernel where the overflow is occuring.

no.  those are all user programs dying, including the very first dump.
malloc prints "panic: pool panic" even when running in user programs.
if the kernel had panicked, it would have stopped running (this isn't linux!).

the message says it was the block at 435a70, which is a user address.

POOL_TOLERANCE would not have printed "panic: pool panic".
it would have just done the "mem user overflow" block dump and
continued.

POOL_TOLERANCE only tolerates overflowing the block by a
single extra zero byte, to diagnose the common mistake of

	x = malloc(strlen(s));
	strcpy(x, s);

but be able to continue executing.  it would have tolerated:

	# user data 00 00 00 00  00 00 00 00 | 00 f1 f0 fa  00 00 00 00

but that's not what happened.

russ