* [9fans] Too many checkpages() diagnostics ...
@ 2014-05-26 23:14 Lyndon Nerenberg
2014-05-26 23:41 ` Steve Simon
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Lyndon Nerenberg @ 2014-05-26 23:14 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
[-- Attachment #1: Type: text/plain, Size: 2673 bytes --]
For the last couple of days I have been plagued by many many diagnostics from checkpages(), in conjunction with things like:
rc: note: sys: trap: fault read addr=0x0 pc=0x000101c4
rc 50675: suicide: sys: trap: fault read addr=0x0 pc=0x000101c4
The kernel print buffer holds corresponding entries like:
coral# 10618 dns: checked 136 page table entries
dns 10618: suicide: sys: trap: fault write addr=0x0 pc=0x00015cea
26591 rfcmirror: checked 270 page table entries
37326 rc: checked 51 page table entries
47773 rc: checked 57 page table entries
47773 rc: checked 57 page table entries
47773 rc: checked 57 page table entries
47773 rc: checked 57 page table entries
47773 rc: checked 57 page table entries
47773 rc: checked 57 page table entries
47773 rc: checked 57 page table entries
47773 rc: checked 57 page table entries
47773 rc: checked 57 page table entries
47773 rc: checked 57 page table entries
47773 rc: checked 57 page table entries
50675 rc: checked 53 page table entries
coral# rm '#s/dns'
coral# ndb/dns -r
coral# 55270 rfcmirror: checked 146 page table entries
55270 rfcmirror: checked 146 page table entries
66218 rfcmirror: checked 146 page table entries
70615 rfcmirror: checked 62 page table entries
70615 rfcmirror: checked 62 page table entries
70644 tcp567: checked 39 page table entries
70644 tcp567: checked 39 page table entries
71354 rfcmirror: checked 46 page table entries
71354 rfcmirror: checked 46 page table entries
Yes, these were two different events. These just happened to be what I captured for later reference. Three events, really; the 'rc' complaints are from me running 'mk' in various source trees.
I have always seen these 'checked nnn page table entries' messages, but for the last couple of days they are everywhere. And processes are failing hand-over-fist. Forking processes in rc seems to be a sure-fire way to provoke this. I cannot get through a 'mk' of any significant piece of software, and /n/sources/contrib/lyndon/rfcmirror is very good at borking things, too.
Is anyone else seeing this? I'm running bleeding edge labs code, compiled from a pull from this afternoon. (And I have been running very up-to-date labs pulls all the way along.)
This is all running in a Parallels VM on a Mac, the same VM I have been using as a terminal for several years. What changed was switching over to a CPU kernel. The VM has 1GB of RAM now, but was quite happy running 9pcf (vs 9pccpuf now) in 256 MB, and that terminal kernel ran the same suite of commands just fine. (This is objtype=386.)
--lyndon
[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 817 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ...
2014-05-26 23:14 [9fans] Too many checkpages() diagnostics Lyndon Nerenberg
@ 2014-05-26 23:41 ` Steve Simon
2014-05-26 23:44 ` Lyndon Nerenberg
2014-05-27 10:43 ` Charles Forsyth
2014-05-27 6:46 ` lucio
2014-05-27 13:36 ` erik quanstrom
2 siblings, 2 replies; 12+ messages in thread
From: Steve Simon @ 2014-05-26 23:41 UTC (permalink / raw)
To: 9fans
I have to ask, when you rebuilt everything, you did rebuild
9pccpuf as well didn't you? i.e. its not the lack of he new
nsec() systemcall biteing you is it?
-Steve
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ...
2014-05-26 23:41 ` Steve Simon
@ 2014-05-26 23:44 ` Lyndon Nerenberg
2014-05-26 23:47 ` Steve Simon
2014-05-27 10:43 ` Charles Forsyth
1 sibling, 1 reply; 12+ messages in thread
From: Lyndon Nerenberg @ 2014-05-26 23:44 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
[-- Attachment #1: Type: text/plain, Size: 311 bytes --]
On May 26, 2014, at 4:41 PM, Steve Simon <steve@quintile.net> wrote:
> I have to ask, when you rebuilt everything, you did rebuild
> 9pccpuf as well didn't you? i.e. its not the lack of he new
> nsec() systemcall biteing you is it?
No, I carefully did the Macarena around that mess ;-)
--lyndon
[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 817 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ...
2014-05-26 23:44 ` Lyndon Nerenberg
@ 2014-05-26 23:47 ` Steve Simon
2014-05-27 0:27 ` Lyndon Nerenberg
0 siblings, 1 reply; 12+ messages in thread
From: Steve Simon @ 2014-05-26 23:47 UTC (permalink / raw)
To: 9fans
Ok,
Just thought I would ask, 9pccpuf is not built by the labs
so you would need to rebuild it by hand.
worth a try.
-Steve
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ...
2014-05-26 23:47 ` Steve Simon
@ 2014-05-27 0:27 ` Lyndon Nerenberg
2014-05-27 13:08 ` Anthony Sorace
0 siblings, 1 reply; 12+ messages in thread
From: Lyndon Nerenberg @ 2014-05-27 0:27 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
[-- Attachment #1: Type: text/plain, Size: 525 bytes --]
On May 26, 2014, at 4:47 PM, Steve Simon <steve@quintile.net> wrote:
> Just thought I would ask, 9pccpuf is not built by the labs
> so you would need to rebuild it by hand.
It's not rebuilt, which is a shame, since I'm pretty sure this must be the kernel they run on their file servers.
If not, I would really like to see what they *are* running.
I recall there used to me a mk target that would rebuild all the kernel configs. I.e. everything in CONFLIST. It would be nice if that came back.
--lyndon
[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 817 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ...
2014-05-26 23:14 [9fans] Too many checkpages() diagnostics Lyndon Nerenberg
2014-05-26 23:41 ` Steve Simon
@ 2014-05-27 6:46 ` lucio
2014-05-27 6:57 ` lucio
2014-05-27 13:36 ` erik quanstrom
2 siblings, 1 reply; 12+ messages in thread
From: lucio @ 2014-05-27 6:46 UTC (permalink / raw)
To: 9fans
> Is anyone else seeing this? I'm running bleeding edge labs code,
> compiled from a pull from this afternoon. (And I have been running
> very up-to-date labs pulls all the way along.)
The dns failures occur this side too, once, sometimes a few times a
day. The responsible party is the auth/cpu/file server, where I run
the network's DNS service, specially for the foreign hiosts (NetBSD -
practically idle - and UBUNTU in various guises), It is not running
the new kernel with NSEC, yet (it's got NANOTIME, instead, same thing,
really).
I'm tempted to suggest that the UBUNTU machines are tickling a bug in
dns, as they seem to spend a lot of time reaching into the Internet,
much more than I'm comfortable with.
L.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ...
2014-05-27 6:46 ` lucio
@ 2014-05-27 6:57 ` lucio
2014-05-27 20:52 ` Lyndon Nerenberg
0 siblings, 1 reply; 12+ messages in thread
From: lucio @ 2014-05-27 6:57 UTC (permalink / raw)
To: 9fans
> The dns failures occur this side too, once, sometimes a few times a
> day.
I was more frequent when there was a duplicate entry in
/lib/ndb/kestell (happens to be the description of my local network),
it's improved since I fixed that. There may still be some trouble in
the database, but I could not spot any errors.
It's unfortunate that ndb/dns isn't (for me) the industrial-strength
utility I normally expect from Plan 9. But it's too vast and too
ambitious for me to debug, even if just to make it more strict about
its configuration.
L.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ...
2014-05-26 23:41 ` Steve Simon
2014-05-26 23:44 ` Lyndon Nerenberg
@ 2014-05-27 10:43 ` Charles Forsyth
1 sibling, 0 replies; 12+ messages in thread
From: Charles Forsyth @ 2014-05-27 10:43 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
[-- Attachment #1: Type: text/plain, Size: 592 bytes --]
On 27 May 2014 00:41, Steve Simon <steve@quintile.net> wrote:
> its not the lack of he new
> nsec() systemcall biteing you is it?
>
that wouldn't lead to checkpages faults, which appear when processes trap
on bad addresses.
i'd suspect an inconsistency between the source (eg, paging or lock data
structures) and existing object files.
It could be that some other structure has changed (for instance Block
acquired a magic value a few months ago).
Ordinarily I'd expect that to be caught by the -T compilation option and
loader checks, but perhaps those aren't on by default.
[-- Attachment #2: Type: text/html, Size: 1039 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ...
2014-05-27 0:27 ` Lyndon Nerenberg
@ 2014-05-27 13:08 ` Anthony Sorace
2014-05-27 22:56 ` Lyndon Nerenberg
0 siblings, 1 reply; 12+ messages in thread
From: Anthony Sorace @ 2014-05-27 13:08 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
[-- Attachment #1: Type: text/plain, Size: 222 bytes --]
> I recall there used to me a mk target that would rebuild all the kernel configs. I.e. everything in CONFLIST. It would be nice if that came back.
I believe 'mk all' in /sys/src/9/<whatever> will still do this.
[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 169 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ...
2014-05-26 23:14 [9fans] Too many checkpages() diagnostics Lyndon Nerenberg
2014-05-26 23:41 ` Steve Simon
2014-05-27 6:46 ` lucio
@ 2014-05-27 13:36 ` erik quanstrom
2 siblings, 0 replies; 12+ messages in thread
From: erik quanstrom @ 2014-05-27 13:36 UTC (permalink / raw)
To: 9fans
On Mon May 26 19:16:22 EDT 2014, lyndon@orthanc.ca wrote:
> For the last couple of days I have been plagued by many many diagnostics from checkpages(), in conjunction with things like:
>
> rc: note: sys: trap: fault read addr=0x0 pc=0x000101c4
> rc 50675: suicide: sys: trap: fault read addr=0x0 pc=0x000101c4
acid says that this is an abort.
; acid /n/sources/plan9/386/bin/rc
/n/sources/plan9/386/bin/rc:386 plan 9 executable
/sys/lib/acid/port
/sys/lib/acid/386
acid; src(0x000101c4)
/sys/src/libc/9sys/abort.c:6
1 #include <u.h>
2 #include <libc.h>
3 void
4 abort(void)
5 {
>6 while(*(int*)0)
7 ;
8 }
the problem is without a backtrace, there are a few too many possibilities.
if the abort is legit, these would be good canidates
- notifyf (plan9.c)
- _vsaop (not very likely)
- assert:
io.c:101: assert(b->fd == -1 || b->bufp > b->buf);
pcmd.c:24: assert(f != nil);
but ...
> The kernel print buffer holds corresponding entries like:
>
> coral# 10618 dns: checked 136 page table entries
> dns 10618: suicide: sys: trap: fault write addr=0x0 pc=0x00015cea
/sys/src/libc/port/pool.c:974
969 return a;
970 }
971
972 /* poolallocl: attempt to allocate block to hold dsize user bytes; assumes lock held */
973 static void*
>974 poolallocl(Pool *p, ulong dsize)
975 {
976 ulong bsize;
977 Free *fb;
978 Alloc *ab;
979
acid; asm(0x00015cea)
poolallocl 0x00015cea SUBL $0x1c,SP
poolallocl+0x3 0x00015ced MOVL dsize+0x4(FP),DX
poolallocl+0x7 0x00015cf1 CMPL DX,$0x80000000
poolallocl+0xd 0x00015cf7 JCS poolallocl+0x22(SB)
this one doesn't make any sense, unless the stack ptr is smashed.
> 26591 rfcmirror: checked 270 page table entries
> 37326 rc: checked 51 page table entries
> 47773 rc: checked 57 page table entries
> 47773 rc: checked 57 page table entries
> 47773 rc: checked 57 page table entries
> 47773 rc: checked 57 page table entries
> 47773 rc: checked 57 page table entries
> 47773 rc: checked 57 page table entries
> 47773 rc: checked 57 page table entries
> 47773 rc: checked 57 page table entries
> 47773 rc: checked 57 page table entries
> 47773 rc: checked 57 page table entries
> 47773 rc: checked 57 page table entries
> 50675 rc: checked 53 page table entries
ah. this is starting to make some sense. remember above, there was
an abort in notifyf? that was if the trap depth got too deep. the problem
is we would need to see 33 events for pid 47773, but we don't.
i had a very similar problem under vbox on osx, and the solution
was to use gorka's ancient fix, which basically avoids clearing PTEs
which do not have the PteP bit set. there are substantial differences
between the pc and nix kernel's here.
so for example mmuptefree() looks fishy to me since it clears
pages not present. but i'm not sure.
- erik
the applied patch is /n/atom/patch/applied/vboxmmu
; diff -c mmu.c.orig mmu.c
mmu.c.orig:87,93 - mmu.c:87,93
}
void
- mmuflushtlb(uintmem)
+ xmmuflushtlb(uintmem)
{
m->tlbpurge++;
mmu.c.orig:98,104 - mmu.c:98,122
putcr3(m->pml4->pa);
}
+ /* hack for vbox */
void
+ mmuflushtlb(uintmem)
+ {
+ int i;
+ PTE *pte;
+
+ m->tlbpurge++;
+ if(m->pml4->daddr){
+ pte = UINT2PTR(m->pml4->va);
+ for(i = 0; i < m->pml4->daddr; i++)
+ if(pte[i] & PteP)
+ pte[i] = 0;
+ m->pml4->daddr = 0;
+ }
+ putcr3(m->pml4->pa);
+ }
+
+ void
mmuflush(void)
{
Mpl pl;
mmu.c.orig:259,264 - mmu.c:277,283
void
mmuswitch(Proc* proc)
{
+ int i;
PTE *pte;
Page *page;
Mpl pl;
mmu.c.orig:270,276 - mmu.c:289,300
}
if(m->pml4->daddr){
- memset(UINT2PTR(m->pml4->va), 0, m->pml4->daddr*sizeof(PTE));
+ /* hack for vbox */
+ // memset(UINT2PTR(m->pml4->va), 0, m->pml4->daddr*sizeof(PTE));
+ pte = UINT2PTR(m->pml4->va);
+ for(i = 0; i < m->pml4->daddr; i++)
+ if(pte[i] & PteP)
+ pte[i] = 0;
m->pml4->daddr = 0;
}
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ...
2014-05-27 6:57 ` lucio
@ 2014-05-27 20:52 ` Lyndon Nerenberg
0 siblings, 0 replies; 12+ messages in thread
From: Lyndon Nerenberg @ 2014-05-27 20:52 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
[-- Attachment #1: Type: text/plain, Size: 749 bytes --]
On May 26, 2014, at 11:57 PM, lucio@proxima.alt.za wrote:
> I was more frequent when there was a duplicate entry in
> /lib/ndb/kestell (happens to be the description of my local network),
> it's improved since I fixed that. There may still be some trouble in
> the database, but I could not spot any errors.
For me, ndb/dns rarely trips the diagnostic. In almost every case I've examined, the diagnostic is triggered when something in the process of forking. The rfcmirror/idmirror scripts trip it with all the cp commands they run (although it's rc itself that's breaking). mk blows up when doing the fork/exec of the build recipe commands. I'm trying to collect a set of stack traces to see if I can find a pattern.
--lyndon
[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 817 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ...
2014-05-27 13:08 ` Anthony Sorace
@ 2014-05-27 22:56 ` Lyndon Nerenberg
0 siblings, 0 replies; 12+ messages in thread
From: Lyndon Nerenberg @ 2014-05-27 22:56 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
[-- Attachment #1: Type: text/plain, Size: 204 bytes --]
On May 27, 2014, at 6:08 AM, Anthony Sorace <a@9srv.net> wrote:
> I believe 'mk all' in /sys/src/9/<whatever> will still do this.
So there is. (And 'installall'.) Sorry for not seeing this :-P
[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 817 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2014-05-27 22:56 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-26 23:14 [9fans] Too many checkpages() diagnostics Lyndon Nerenberg
2014-05-26 23:41 ` Steve Simon
2014-05-26 23:44 ` Lyndon Nerenberg
2014-05-26 23:47 ` Steve Simon
2014-05-27 0:27 ` Lyndon Nerenberg
2014-05-27 13:08 ` Anthony Sorace
2014-05-27 22:56 ` Lyndon Nerenberg
2014-05-27 10:43 ` Charles Forsyth
2014-05-27 6:46 ` lucio
2014-05-27 6:57 ` lucio
2014-05-27 20:52 ` Lyndon Nerenberg
2014-05-27 13:36 ` erik quanstrom
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).