From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <293d086081456bfda9777d45dfa18ba1@plan9.bell-labs.com>
To: 9fans@cse.psu.edu
From: "Russ Cox" <rsc@plan9.bell-labs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
Subject: [9fans] exception/interrupt 14 bug
Date: Mon, 13 May 2002 15:58:25 -0400
Topicbox-Message-UUID: 902d74c8-eaca-11e9-9e20-41e7f4b1d025

The short version is that I think this bug is fixed; at
least, I think the bug that cause the two reports with
register dumps are fixed.  I don't know about the others
that were mentioned without full panic messages, but it's
worth trying again.  Thanks to Marc Rocas for reporting the
first instance of the bug and to Kevin Adams of VMware for
reporting the fix five days ago (though I didn't realize it
then).

The long version is the story of the actual bug, just
because it's fun.

Here's the first transcript I got of the bug in action:

PBS... Plan 9 from Bell Labs
ELCR: 0E00
apm ax=f000 cx=f000 dx=40 di=ffff ebx=57c7 esi=-1
using fd0!dos!plan9.ini
dev A0 port 1F0 config 045A capabilities 2F00 mwdma 0007 udma 0407
dev A0 port 170 config 80A0 capabilities 0E00 mwdam 0000

found 9PCFLOP .GZ attr 0x0 start 0xf0 len 1162928
.....gz..... 1162928 -> 637473+992156+63892 = 1693521
entry:80100020
FLAGS=10286 TRAP=e ECODE=0 PC=8001c50b
AX d3c04c57	BX 000006e0	CX 80317718	DX 00000000
SI 8003b580	DI 8003b55c	BP 0000000a
CS 0010	DS 0008	ES 0008	FS 0008	GS 0008
CR0 80000011	CR2 d3c04c6b	CR3 000c000
panic: exception/interrupt 14

After printing the entry line and before jumping to the
loaded kernel, 9load calls _sddetach to turn off any devices
that might otherwise interrupt the nascent kernel.

static void
_sddetach(void)
{
	SDev *sdev;

	for(sdev = sdlist; sdev != nil; sdev = sdev->next){
		if(sdev->enabled == 0)
			continue;
		if(sdev->ifc->disable)
			sdev->ifc->disable(sdev);
		sdev->enabled = 0;
	}
}

Unfortunately, sddetach has died because sdev->ifc does not
contain &sdataifc as expected, but rather d3c04c57 (AX).  We
know that sdev is the right pointer because it's in CX and
booting another 9load with some debugging prints in malloc
shows that the first sdev always ends up at 80317718.

The interesting question now is what d3c04c57 is.  It's not
a pointer or a small integer.  It really looks like random
noise; it could conceivably be a clock tick count except
that the clock runs at 100Hz and is initialized to zero.
Still, it's distinctive enough to look for.  It's not in the
9load binary nor in 9pcflop.gz, but it _is_ in the
uncompressed 9pcflop, at location 125159 (hex).  The
kernel size is (decimal) 637473+992156+63892, so
that puts it in the data segment.  The kernel file layout
looks like:

	00000  exec header
	00020  text
	9ba41  data

while the layout in memory will eventually look like

	80100000  exec header
	80100020  text
	8019c000  data

with the data shifted to begin on the next page boundary
after the end of the text.  That means our magic value
d3c04c57 should have ended up at 80225718, but we
found it at 80317718 instead.  The identical page alignment
of the expected and actual locations of the value is very
suspicious.  Their difference is f2000, which doesn't ring
any bells.

Meanwhile, we have this puzzle.

	if(gunzip((uchar*)PADDR(entry)-sizeof(Exec), sizeof(Exec)+text+data,
	     (uchar*)b->bp, b->wp-b->bp) < sizeof(Exec)+text+data) {
		print("error uncompressing kernel\n");
		return FAIL;
	}

	/* relocate data to start at page boundary */
	memmove((void*)PGROUND(entry+text), (void*)(entry+text), data);

	print("entry: %lux\n", entry);
	warp9(PADDR(entry));

The magic value came into existence in memory as a result
of the gunzip call above, but it would have been in an unshifted
text segment, so it would be at 80225159 rather than 80225718.
The memmove is the only way the data could have been shifted
into the right alignment, but how did it end up at 80317718?

But first, a different story.  Five days ago Kevin Adams of
VMware pointed out that our protected mode entry code
looked like:

           0x1000:0x0066 | 0f 01 16 6d 01                 LGDT  (0x16d)
           0x1000:0x006b | b8 01 00                       MOV   %ax,0x1
           0x1000:0x006e | 00 00                          ADD   0x0(%bx+%si),%al
           0x1000:0x0070 | 0f 01 f0                       LMSW  %cr0,%ax
           0x1000:0x0073 | eb 00                          JMP   0x0 ;0x75

the weird part being the ADD instruction.  Of course, this
is because the mov %eax, 0x1 that precedes it was
assembled with a 32-bit immediate rather than a 16-bit
immediate.

This is a fairly new bug.  I introduced it while fiddling with the
VMware port a few months ago.  The effect is to increment
the byte at (BX+SI), whatever that may be.  Maybe, just maybe,
some bit in a page table is getting whacked just right to double
map one page at both 80225000 and 80317000.

I put a check in warp9 to look for page table corruption
and sent out a new boot disk, but the check didn't fire;
the crash remained the same.  So the page tables are okay.

Of course, if I could do decimal/hex conversion a little
better in my head I would have noticed that f2000 (the difference
twixt 80225000 and 80317000) is almost exactly the size of
the data segment itself: 992156 = f239c.

If the memmove call being executed were really

	memmove((void*)PGROUND(entry+text+data), (void*)(entry+text), data);

then that would break things in exactly the manner we
observed.  But clearly the memmove isn't that, and further
it works on most people's machines.

The disassembly looks perfectly fine:

bootpass+0x210 0x80015798	MOVL	entry+0x28(SP),DX
bootpass+0x214 0x8001579c	MOVL	data+0x24(SP),CX
bootpass+0x218 0x800157a0	MOVL	text+0x20(SP),BX
bootpass+0x21c 0x800157a4	MOVL	CX,BP
bootpass+0x21e 0x800157a6	ADDL	BX,CX
bootpass+0x220 0x800157a8	ADDL	$0x20,CX
bootpass+0x223 0x800157ab	CMPL	CX,AX
bootpass+0x225 0x800157ad	JCC	bootpass+0x23d(SB)
	...
bootpass+0x23d 0x800157c5	MOVL	BX,AX
bootpass+0x23f 0x800157c7	ADDL	DX,AX
bootpass+0x241 0x800157c9	ADDL	$0xfff,AX
bootpass+0x246 0x800157ce	ANDL	$0xfffff000,AX
bootpass+0x24b 0x800157d3	MOVL	AX,0x0(SP)
bootpass+0x24e 0x800157d6	LEA	0x0(BX)(DX*1),AX
bootpass+0x251 0x800157d9	MOVL	AX,0x4(SP)
bootpass+0x255 0x800157dd	MOVL	BP,0x8(SP)
bootpass+0x259 0x800157e1	CALL	memmove(SB)

The first bit is doing the comparison of the gunzip return
value, and then we set the arguments for memmove.

But wait.  We don't have to guess at what byte got incremented
by the other bug.  The last thing before the jump to protected
mode is the BIOS call to initialize APM.  The registers upon
return were saved for the kernel, and even printed in the
transcript:

	apm ax=f000 cx=f000 dx=40 di=ffff ebx=57c7 esi=-1

And there it is.  BX+SI = 57c6, and the data segment
at the time of the increment is 1000, making the byte
157c6 aka 80157c6.  The MOVL BX,AX at 80157c5 gets
assembled as 89 D8; after the second byte is incremented
it becomes 89 D9, which is MOVL  BX, CX.

So in fact, the memmove call that was being executed was

	memmove((void*)PGROUND(entry+text+data+sizeof(Exec)),
		(void*)(entry+text), data);

which ends up being close enough.

The fix is in /sys/src/boot/pc/l.s.  In the section marked

/*
 * 	goto protected mode
 */

change

	MOVL	$1, AX

into

	LWI(1, rAX)

I put up a fixed boot disk, although the distribution archive
still contains a bad 9load.  Hopefully we can get a new
distribution archive together soon.  That takes a bit longer
to test.

Russ