[9fans] Too many checkpages() diagnostics ...

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* [9fans] Too many checkpages() diagnostics ...
@ 2014-05-26 23:14 Lyndon Nerenberg
  2014-05-26 23:41 ` Steve Simon
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Lyndon Nerenberg @ 2014-05-26 23:14 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 2673 bytes --]

For the last couple of days I have been plagued by many many diagnostics from checkpages(), in conjunction with things like:

  rc: note: sys: trap: fault read addr=0x0 pc=0x000101c4
  rc 50675: suicide: sys: trap: fault read addr=0x0 pc=0x000101c4

The kernel print buffer holds corresponding entries like:

  coral# 10618 dns: checked 136 page table entries
  dns 10618: suicide: sys: trap: fault write addr=0x0 pc=0x00015cea
  26591 rfcmirror: checked 270 page table entries
  37326 rc: checked 51 page table entries
  47773 rc: checked 57 page table entries
  47773 rc: checked 57 page table entries
  47773 rc: checked 57 page table entries
  47773 rc: checked 57 page table entries
  47773 rc: checked 57 page table entries
  47773 rc: checked 57 page table entries
  47773 rc: checked 57 page table entries
  47773 rc: checked 57 page table entries
  47773 rc: checked 57 page table entries
  47773 rc: checked 57 page table entries
  47773 rc: checked 57 page table entries
  50675 rc: checked 53 page table entries
  coral# rm '#s/dns'
  coral# ndb/dns -r
  coral# 55270 rfcmirror: checked 146 page table entries
  55270 rfcmirror: checked 146 page table entries
  66218 rfcmirror: checked 146 page table entries
  70615 rfcmirror: checked 62 page table entries
  70615 rfcmirror: checked 62 page table entries
  70644 tcp567: checked 39 page table entries
  70644 tcp567: checked 39 page table entries
  71354 rfcmirror: checked 46 page table entries
  71354 rfcmirror: checked 46 page table entries

Yes, these were two different events.  These just happened to be what I captured for later reference.  Three events, really; the 'rc' complaints are from me running 'mk' in various source trees.

I have always seen these 'checked nnn page table entries' messages, but for the last couple of days they are everywhere.  And processes are failing hand-over-fist.  Forking processes in rc seems to be a sure-fire way to provoke this.  I cannot get through a 'mk' of any significant piece of software, and /n/sources/contrib/lyndon/rfcmirror is very good at borking things, too.

Is anyone else seeing this?  I'm running bleeding edge labs code, compiled from a pull from this afternoon.  (And I have been running very up-to-date labs pulls all the way along.)

This is all running in a Parallels VM on a Mac, the same VM I have been using as a terminal for several years.  What changed was switching over to a CPU kernel.  The VM has 1GB of RAM now, but was quite happy running 9pcf (vs 9pccpuf now) in 256 MB, and that terminal kernel ran the same suite of commands just fine.  (This is objtype=386.)

--lyndon

[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 817 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Too many checkpages() diagnostics ...
  2014-05-26 23:14 [9fans] Too many checkpages() diagnostics Lyndon Nerenberg
@ 2014-05-26 23:41 ` Steve Simon
  2014-05-26 23:44   ` Lyndon Nerenberg
  2014-05-27 10:43   ` Charles Forsyth
  2014-05-27  6:46 ` lucio
  2014-05-27 13:36 ` erik quanstrom
  2 siblings, 2 replies; 12+ messages in thread
From: Steve Simon @ 2014-05-26 23:41 UTC (permalink / raw)
  To: 9fans

I have to ask, when you rebuilt everything, you did rebuild
9pccpuf as well didn't you? i.e. its not the lack of he new
nsec() systemcall biteing you is it?

-Steve



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Too many checkpages() diagnostics ...
  2014-05-26 23:41 ` Steve Simon
@ 2014-05-26 23:44   ` Lyndon Nerenberg
  2014-05-26 23:47     ` Steve Simon
  2014-05-27 10:43   ` Charles Forsyth
  1 sibling, 1 reply; 12+ messages in thread
From: Lyndon Nerenberg @ 2014-05-26 23:44 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 311 bytes --]


On May 26, 2014, at 4:41 PM, Steve Simon <steve@quintile.net> wrote:

> I have to ask, when you rebuilt everything, you did rebuild
> 9pccpuf as well didn't you? i.e. its not the lack of he new
> nsec() systemcall biteing you is it?

No, I carefully did the Macarena around that mess ;-)

--lyndon


[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 817 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Too many checkpages() diagnostics ...
  2014-05-26 23:44   ` Lyndon Nerenberg
@ 2014-05-26 23:47     ` Steve Simon
  2014-05-27  0:27       ` Lyndon Nerenberg
  0 siblings, 1 reply; 12+ messages in thread
From: Steve Simon @ 2014-05-26 23:47 UTC (permalink / raw)
  To: 9fans

Ok,

Just thought I would ask, 9pccpuf is not built by the labs
so you would need to rebuild it by hand.

worth a try.

-Steve



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Too many checkpages() diagnostics ...
  2014-05-26 23:47     ` Steve Simon
@ 2014-05-27  0:27       ` Lyndon Nerenberg
  2014-05-27 13:08         ` Anthony Sorace
  0 siblings, 1 reply; 12+ messages in thread
From: Lyndon Nerenberg @ 2014-05-27  0:27 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 525 bytes --]

On May 26, 2014, at 4:47 PM, Steve Simon <steve@quintile.net> wrote:

> Just thought I would ask, 9pccpuf is not built by the labs
> so you would need to rebuild it by hand.

It's not rebuilt, which is a shame, since I'm pretty sure this must be the kernel they run on their file servers.

If not, I would really like to see what they *are* running.

I recall there used to me a mk target that would rebuild all the kernel configs.  I.e. everything in CONFLIST.  It would be nice if that came back.

--lyndon

[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 817 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Too many checkpages() diagnostics ...
  2014-05-26 23:14 [9fans] Too many checkpages() diagnostics Lyndon Nerenberg
  2014-05-26 23:41 ` Steve Simon
@ 2014-05-27  6:46 ` lucio
  2014-05-27  6:57   ` lucio
  2014-05-27 13:36 ` erik quanstrom
  2 siblings, 1 reply; 12+ messages in thread
From: lucio @ 2014-05-27  6:46 UTC (permalink / raw)
  To: 9fans

> Is anyone else seeing this?  I'm running bleeding edge labs code,
> compiled from a pull from this afternoon.  (And I have been running
> very up-to-date labs pulls all the way along.)

The dns failures occur this side too, once, sometimes a few times a
day.  The responsible party is the auth/cpu/file server, where I run
the network's DNS service, specially for the foreign hiosts (NetBSD -
practically idle - and UBUNTU in various guises), It is not running
the new kernel with NSEC, yet (it's got NANOTIME, instead, same thing,
really).

I'm tempted to suggest that the UBUNTU machines are tickling a bug in
dns, as they seem to spend a lot of time reaching into the Internet,
much more than I'm comfortable with.

L.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Too many checkpages() diagnostics ...
  2014-05-27  6:46 ` lucio
@ 2014-05-27  6:57   ` lucio
  2014-05-27 20:52     ` Lyndon Nerenberg
  0 siblings, 1 reply; 12+ messages in thread
From: lucio @ 2014-05-27  6:57 UTC (permalink / raw)
  To: 9fans

> The dns failures occur this side too, once, sometimes a few times a
> day.

I was more frequent when there was a duplicate entry in
/lib/ndb/kestell (happens to be the description of my local network),
it's improved since I fixed that.  There may still be some trouble in
the database, but I could not spot any errors.

It's unfortunate that ndb/dns isn't (for me) the industrial-strength
utility I normally expect from Plan 9.  But it's too vast and too
ambitious for me to debug, even if just to make it more strict about
its configuration.

L.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Too many checkpages() diagnostics ...
  2014-05-26 23:41 ` Steve Simon
  2014-05-26 23:44   ` Lyndon Nerenberg
@ 2014-05-27 10:43   ` Charles Forsyth
  1 sibling, 0 replies; 12+ messages in thread
From: Charles Forsyth @ 2014-05-27 10:43 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 592 bytes --]

On 27 May 2014 00:41, Steve Simon <steve@quintile.net> wrote:

> its not the lack of he new
> nsec() systemcall biteing you is it?
>

that wouldn't lead to checkpages faults, which appear when processes trap
on bad addresses.
i'd suspect an inconsistency between the source (eg, paging or lock data
structures) and existing object files.
It could be that some other structure has changed (for instance Block
acquired a magic value a few months ago).
Ordinarily I'd expect that to be caught by the -T compilation option and
loader checks, but perhaps those aren't on by default.

[-- Attachment #2: Type: text/html, Size: 1039 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Too many checkpages() diagnostics ...
  2014-05-27  0:27       ` Lyndon Nerenberg
@ 2014-05-27 13:08         ` Anthony Sorace
  2014-05-27 22:56           ` Lyndon Nerenberg
  0 siblings, 1 reply; 12+ messages in thread
From: Anthony Sorace @ 2014-05-27 13:08 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 222 bytes --]

> I recall there used to me a mk target that would rebuild all the kernel configs.  I.e. everything in CONFLIST.  It would be nice if that came back.

I believe 'mk all' in /sys/src/9/<whatever> will still do this.



[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 169 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Too many checkpages() diagnostics ...
  2014-05-26 23:14 [9fans] Too many checkpages() diagnostics Lyndon Nerenberg
  2014-05-26 23:41 ` Steve Simon
  2014-05-27  6:46 ` lucio
@ 2014-05-27 13:36 ` erik quanstrom
  2 siblings, 0 replies; 12+ messages in thread
From: erik quanstrom @ 2014-05-27 13:36 UTC (permalink / raw)
  To: 9fans

On Mon May 26 19:16:22 EDT 2014, lyndon@orthanc.ca wrote:

> For the last couple of days I have been plagued by many many diagnostics from checkpages(), in conjunction with things like:
>
>   rc: note: sys: trap: fault read addr=0x0 pc=0x000101c4
>   rc 50675: suicide: sys: trap: fault read addr=0x0 pc=0x000101c4

acid says that this is an abort.

; acid /n/sources/plan9/386/bin/rc
/n/sources/plan9/386/bin/rc:386 plan 9 executable
/sys/lib/acid/port
/sys/lib/acid/386
acid; src(0x000101c4)
/sys/src/libc/9sys/abort.c:6
 1	#include <u.h>
 2	#include <libc.h>
 3	void
 4	abort(void)
 5	{
>6		while(*(int*)0)
 7			;
 8	}

the problem is without a backtrace, there are a few too many possibilities.
if the abort is legit, these would be good canidates
- notifyf (plan9.c)
- _vsaop (not very likely)
- assert:
io.c:101: 			assert(b->fd == -1 || b->bufp > b->buf);
pcmd.c:24: 	assert(f != nil);


but ...

> The kernel print buffer holds corresponding entries like:
>
>   coral# 10618 dns: checked 136 page table entries
>   dns 10618: suicide: sys: trap: fault write addr=0x0 pc=0x00015cea

/sys/src/libc/port/pool.c:974
 969		return a;
 970	}
 971
 972	/* poolallocl: attempt to allocate block to hold dsize user bytes; assumes lock held */
 973	static void*
>974	poolallocl(Pool *p, ulong dsize)
 975	{
 976		ulong bsize;
 977		Free *fb;
 978		Alloc *ab;
 979
acid; asm(0x00015cea)
poolallocl 0x00015cea	SUBL	$0x1c,SP
poolallocl+0x3 0x00015ced	MOVL	dsize+0x4(FP),DX
poolallocl+0x7 0x00015cf1	CMPL	DX,$0x80000000
poolallocl+0xd 0x00015cf7	JCS	poolallocl+0x22(SB)

this one doesn't make any sense, unless the stack ptr is smashed.

>   26591 rfcmirror: checked 270 page table entries
>   37326 rc: checked 51 page table entries
>   47773 rc: checked 57 page table entries
>   47773 rc: checked 57 page table entries
>   47773 rc: checked 57 page table entries
>   47773 rc: checked 57 page table entries
>   47773 rc: checked 57 page table entries
>   47773 rc: checked 57 page table entries
>   47773 rc: checked 57 page table entries
>   47773 rc: checked 57 page table entries
>   47773 rc: checked 57 page table entries
>   47773 rc: checked 57 page table entries
>   47773 rc: checked 57 page table entries
>   50675 rc: checked 53 page table entries

ah.  this is starting to make some sense.  remember above, there was
an abort in notifyf?  that was if the trap depth got too deep.  the problem
is we would need to see 33 events for pid 47773, but we don't.

i had a very similar problem under vbox on osx, and the solution
was to use gorka's ancient fix, which basically avoids clearing PTEs
which do not have the PteP bit set.  there are substantial differences
between the pc and nix kernel's here.

so for example mmuptefree() looks fishy to me since it clears
pages not present.  but i'm not sure.

- erik

the applied patch is /n/atom/patch/applied/vboxmmu

; diff -c mmu.c.orig mmu.c
mmu.c.orig:87,93 - mmu.c:87,93
  }

  void
- mmuflushtlb(uintmem)
+ xmmuflushtlb(uintmem)
  {

  	m->tlbpurge++;
mmu.c.orig:98,104 - mmu.c:98,122
  	putcr3(m->pml4->pa);
  }

+ /* hack for vbox */
  void
+ mmuflushtlb(uintmem)
+ {
+ 	int i;
+ 	PTE *pte;
+
+ 	m->tlbpurge++;
+ 	if(m->pml4->daddr){
+ 		pte = UINT2PTR(m->pml4->va);
+ 		for(i = 0; i < m->pml4->daddr; i++)
+ 			if(pte[i] & PteP)
+ 				pte[i] = 0;
+ 		m->pml4->daddr = 0;
+ 	}
+ 	putcr3(m->pml4->pa);
+ }
+
+ void
  mmuflush(void)
  {
  	Mpl pl;
mmu.c.orig:259,264 - mmu.c:277,283
  void
  mmuswitch(Proc* proc)
  {
+ 	int i;
  	PTE *pte;
  	Page *page;
  	Mpl pl;
mmu.c.orig:270,276 - mmu.c:289,300
  	}

  	if(m->pml4->daddr){
- 		memset(UINT2PTR(m->pml4->va), 0, m->pml4->daddr*sizeof(PTE));
+ 		/* hack for vbox */
+ //		memset(UINT2PTR(m->pml4->va), 0, m->pml4->daddr*sizeof(PTE));
+ 		pte = UINT2PTR(m->pml4->va);
+ 		for(i = 0; i < m->pml4->daddr; i++)
+ 			if(pte[i] & PteP)
+ 				pte[i] = 0;
  		m->pml4->daddr = 0;
  	}



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Too many checkpages() diagnostics ...
  2014-05-27  6:57   ` lucio
@ 2014-05-27 20:52     ` Lyndon Nerenberg
  0 siblings, 0 replies; 12+ messages in thread
From: Lyndon Nerenberg @ 2014-05-27 20:52 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 749 bytes --]

On May 26, 2014, at 11:57 PM, lucio@proxima.alt.za wrote:

> I was more frequent when there was a duplicate entry in
> /lib/ndb/kestell (happens to be the description of my local network),
> it's improved since I fixed that.  There may still be some trouble in
> the database, but I could not spot any errors.

For me, ndb/dns rarely trips the diagnostic.  In almost every case I've examined, the diagnostic is triggered when something in the process of forking.  The rfcmirror/idmirror scripts trip it with all the cp commands they run (although it's rc itself that's breaking).  mk blows up when doing the fork/exec of the build recipe commands.  I'm trying to collect a set of stack traces to see if I can find a pattern.

--lyndon

[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 817 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Too many checkpages() diagnostics ...
  2014-05-27 13:08         ` Anthony Sorace
@ 2014-05-27 22:56           ` Lyndon Nerenberg
  0 siblings, 0 replies; 12+ messages in thread
From: Lyndon Nerenberg @ 2014-05-27 22:56 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 204 bytes --]


On May 27, 2014, at 6:08 AM, Anthony Sorace <a@9srv.net> wrote:

> I believe 'mk all' in /sys/src/9/<whatever> will still do this.

So there is.  (And 'installall'.)  Sorry for not seeing this :-P

[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 817 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-05-27 22:56 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-26 23:14 [9fans] Too many checkpages() diagnostics Lyndon Nerenberg
2014-05-26 23:41 ` Steve Simon
2014-05-26 23:44   ` Lyndon Nerenberg
2014-05-26 23:47     ` Steve Simon
2014-05-27  0:27       ` Lyndon Nerenberg
2014-05-27 13:08         ` Anthony Sorace
2014-05-27 22:56           ` Lyndon Nerenberg
2014-05-27 10:43   ` Charles Forsyth
2014-05-27  6:46 ` lucio
2014-05-27  6:57   ` lucio
2014-05-27 20:52     ` Lyndon Nerenberg
2014-05-27 13:36 ` erik quanstrom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).