* [9fans] Too many checkpages() diagnostics ... @ 2014-05-26 23:14 Lyndon Nerenberg 2014-05-26 23:41 ` Steve Simon ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Lyndon Nerenberg @ 2014-05-26 23:14 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs [-- Attachment #1: Type: text/plain, Size: 2673 bytes --] For the last couple of days I have been plagued by many many diagnostics from checkpages(), in conjunction with things like: rc: note: sys: trap: fault read addr=0x0 pc=0x000101c4 rc 50675: suicide: sys: trap: fault read addr=0x0 pc=0x000101c4 The kernel print buffer holds corresponding entries like: coral# 10618 dns: checked 136 page table entries dns 10618: suicide: sys: trap: fault write addr=0x0 pc=0x00015cea 26591 rfcmirror: checked 270 page table entries 37326 rc: checked 51 page table entries 47773 rc: checked 57 page table entries 47773 rc: checked 57 page table entries 47773 rc: checked 57 page table entries 47773 rc: checked 57 page table entries 47773 rc: checked 57 page table entries 47773 rc: checked 57 page table entries 47773 rc: checked 57 page table entries 47773 rc: checked 57 page table entries 47773 rc: checked 57 page table entries 47773 rc: checked 57 page table entries 47773 rc: checked 57 page table entries 50675 rc: checked 53 page table entries coral# rm '#s/dns' coral# ndb/dns -r coral# 55270 rfcmirror: checked 146 page table entries 55270 rfcmirror: checked 146 page table entries 66218 rfcmirror: checked 146 page table entries 70615 rfcmirror: checked 62 page table entries 70615 rfcmirror: checked 62 page table entries 70644 tcp567: checked 39 page table entries 70644 tcp567: checked 39 page table entries 71354 rfcmirror: checked 46 page table entries 71354 rfcmirror: checked 46 page table entries Yes, these were two different events. These just happened to be what I captured for later reference. Three events, really; the 'rc' complaints are from me running 'mk' in various source trees. I have always seen these 'checked nnn page table entries' messages, but for the last couple of days they are everywhere. And processes are failing hand-over-fist. Forking processes in rc seems to be a sure-fire way to provoke this. I cannot get through a 'mk' of any significant piece of software, and /n/sources/contrib/lyndon/rfcmirror is very good at borking things, too. Is anyone else seeing this? I'm running bleeding edge labs code, compiled from a pull from this afternoon. (And I have been running very up-to-date labs pulls all the way along.) This is all running in a Parallels VM on a Mac, the same VM I have been using as a terminal for several years. What changed was switching over to a CPU kernel. The VM has 1GB of RAM now, but was quite happy running 9pcf (vs 9pccpuf now) in 256 MB, and that terminal kernel ran the same suite of commands just fine. (This is objtype=386.) --lyndon [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 817 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ... 2014-05-26 23:14 [9fans] Too many checkpages() diagnostics Lyndon Nerenberg @ 2014-05-26 23:41 ` Steve Simon 2014-05-26 23:44 ` Lyndon Nerenberg 2014-05-27 10:43 ` Charles Forsyth 2014-05-27 6:46 ` lucio 2014-05-27 13:36 ` erik quanstrom 2 siblings, 2 replies; 12+ messages in thread From: Steve Simon @ 2014-05-26 23:41 UTC (permalink / raw) To: 9fans I have to ask, when you rebuilt everything, you did rebuild 9pccpuf as well didn't you? i.e. its not the lack of he new nsec() systemcall biteing you is it? -Steve ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ... 2014-05-26 23:41 ` Steve Simon @ 2014-05-26 23:44 ` Lyndon Nerenberg 2014-05-26 23:47 ` Steve Simon 2014-05-27 10:43 ` Charles Forsyth 1 sibling, 1 reply; 12+ messages in thread From: Lyndon Nerenberg @ 2014-05-26 23:44 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs [-- Attachment #1: Type: text/plain, Size: 311 bytes --] On May 26, 2014, at 4:41 PM, Steve Simon <steve@quintile.net> wrote: > I have to ask, when you rebuilt everything, you did rebuild > 9pccpuf as well didn't you? i.e. its not the lack of he new > nsec() systemcall biteing you is it? No, I carefully did the Macarena around that mess ;-) --lyndon [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 817 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ... 2014-05-26 23:44 ` Lyndon Nerenberg @ 2014-05-26 23:47 ` Steve Simon 2014-05-27 0:27 ` Lyndon Nerenberg 0 siblings, 1 reply; 12+ messages in thread From: Steve Simon @ 2014-05-26 23:47 UTC (permalink / raw) To: 9fans Ok, Just thought I would ask, 9pccpuf is not built by the labs so you would need to rebuild it by hand. worth a try. -Steve ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ... 2014-05-26 23:47 ` Steve Simon @ 2014-05-27 0:27 ` Lyndon Nerenberg 2014-05-27 13:08 ` Anthony Sorace 0 siblings, 1 reply; 12+ messages in thread From: Lyndon Nerenberg @ 2014-05-27 0:27 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs [-- Attachment #1: Type: text/plain, Size: 525 bytes --] On May 26, 2014, at 4:47 PM, Steve Simon <steve@quintile.net> wrote: > Just thought I would ask, 9pccpuf is not built by the labs > so you would need to rebuild it by hand. It's not rebuilt, which is a shame, since I'm pretty sure this must be the kernel they run on their file servers. If not, I would really like to see what they *are* running. I recall there used to me a mk target that would rebuild all the kernel configs. I.e. everything in CONFLIST. It would be nice if that came back. --lyndon [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 817 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ... 2014-05-27 0:27 ` Lyndon Nerenberg @ 2014-05-27 13:08 ` Anthony Sorace 2014-05-27 22:56 ` Lyndon Nerenberg 0 siblings, 1 reply; 12+ messages in thread From: Anthony Sorace @ 2014-05-27 13:08 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs [-- Attachment #1: Type: text/plain, Size: 222 bytes --] > I recall there used to me a mk target that would rebuild all the kernel configs. I.e. everything in CONFLIST. It would be nice if that came back. I believe 'mk all' in /sys/src/9/<whatever> will still do this. [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 169 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ... 2014-05-27 13:08 ` Anthony Sorace @ 2014-05-27 22:56 ` Lyndon Nerenberg 0 siblings, 0 replies; 12+ messages in thread From: Lyndon Nerenberg @ 2014-05-27 22:56 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs [-- Attachment #1: Type: text/plain, Size: 204 bytes --] On May 27, 2014, at 6:08 AM, Anthony Sorace <a@9srv.net> wrote: > I believe 'mk all' in /sys/src/9/<whatever> will still do this. So there is. (And 'installall'.) Sorry for not seeing this :-P [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 817 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ... 2014-05-26 23:41 ` Steve Simon 2014-05-26 23:44 ` Lyndon Nerenberg @ 2014-05-27 10:43 ` Charles Forsyth 1 sibling, 0 replies; 12+ messages in thread From: Charles Forsyth @ 2014-05-27 10:43 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs [-- Attachment #1: Type: text/plain, Size: 592 bytes --] On 27 May 2014 00:41, Steve Simon <steve@quintile.net> wrote: > its not the lack of he new > nsec() systemcall biteing you is it? > that wouldn't lead to checkpages faults, which appear when processes trap on bad addresses. i'd suspect an inconsistency between the source (eg, paging or lock data structures) and existing object files. It could be that some other structure has changed (for instance Block acquired a magic value a few months ago). Ordinarily I'd expect that to be caught by the -T compilation option and loader checks, but perhaps those aren't on by default. [-- Attachment #2: Type: text/html, Size: 1039 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ... 2014-05-26 23:14 [9fans] Too many checkpages() diagnostics Lyndon Nerenberg 2014-05-26 23:41 ` Steve Simon @ 2014-05-27 6:46 ` lucio 2014-05-27 6:57 ` lucio 2014-05-27 13:36 ` erik quanstrom 2 siblings, 1 reply; 12+ messages in thread From: lucio @ 2014-05-27 6:46 UTC (permalink / raw) To: 9fans > Is anyone else seeing this? I'm running bleeding edge labs code, > compiled from a pull from this afternoon. (And I have been running > very up-to-date labs pulls all the way along.) The dns failures occur this side too, once, sometimes a few times a day. The responsible party is the auth/cpu/file server, where I run the network's DNS service, specially for the foreign hiosts (NetBSD - practically idle - and UBUNTU in various guises), It is not running the new kernel with NSEC, yet (it's got NANOTIME, instead, same thing, really). I'm tempted to suggest that the UBUNTU machines are tickling a bug in dns, as they seem to spend a lot of time reaching into the Internet, much more than I'm comfortable with. L. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ... 2014-05-27 6:46 ` lucio @ 2014-05-27 6:57 ` lucio 2014-05-27 20:52 ` Lyndon Nerenberg 0 siblings, 1 reply; 12+ messages in thread From: lucio @ 2014-05-27 6:57 UTC (permalink / raw) To: 9fans > The dns failures occur this side too, once, sometimes a few times a > day. I was more frequent when there was a duplicate entry in /lib/ndb/kestell (happens to be the description of my local network), it's improved since I fixed that. There may still be some trouble in the database, but I could not spot any errors. It's unfortunate that ndb/dns isn't (for me) the industrial-strength utility I normally expect from Plan 9. But it's too vast and too ambitious for me to debug, even if just to make it more strict about its configuration. L. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ... 2014-05-27 6:57 ` lucio @ 2014-05-27 20:52 ` Lyndon Nerenberg 0 siblings, 0 replies; 12+ messages in thread From: Lyndon Nerenberg @ 2014-05-27 20:52 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs [-- Attachment #1: Type: text/plain, Size: 749 bytes --] On May 26, 2014, at 11:57 PM, lucio@proxima.alt.za wrote: > I was more frequent when there was a duplicate entry in > /lib/ndb/kestell (happens to be the description of my local network), > it's improved since I fixed that. There may still be some trouble in > the database, but I could not spot any errors. For me, ndb/dns rarely trips the diagnostic. In almost every case I've examined, the diagnostic is triggered when something in the process of forking. The rfcmirror/idmirror scripts trip it with all the cp commands they run (although it's rc itself that's breaking). mk blows up when doing the fork/exec of the build recipe commands. I'm trying to collect a set of stack traces to see if I can find a pattern. --lyndon [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 817 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [9fans] Too many checkpages() diagnostics ... 2014-05-26 23:14 [9fans] Too many checkpages() diagnostics Lyndon Nerenberg 2014-05-26 23:41 ` Steve Simon 2014-05-27 6:46 ` lucio @ 2014-05-27 13:36 ` erik quanstrom 2 siblings, 0 replies; 12+ messages in thread From: erik quanstrom @ 2014-05-27 13:36 UTC (permalink / raw) To: 9fans On Mon May 26 19:16:22 EDT 2014, lyndon@orthanc.ca wrote: > For the last couple of days I have been plagued by many many diagnostics from checkpages(), in conjunction with things like: > > rc: note: sys: trap: fault read addr=0x0 pc=0x000101c4 > rc 50675: suicide: sys: trap: fault read addr=0x0 pc=0x000101c4 acid says that this is an abort. ; acid /n/sources/plan9/386/bin/rc /n/sources/plan9/386/bin/rc:386 plan 9 executable /sys/lib/acid/port /sys/lib/acid/386 acid; src(0x000101c4) /sys/src/libc/9sys/abort.c:6 1 #include <u.h> 2 #include <libc.h> 3 void 4 abort(void) 5 { >6 while(*(int*)0) 7 ; 8 } the problem is without a backtrace, there are a few too many possibilities. if the abort is legit, these would be good canidates - notifyf (plan9.c) - _vsaop (not very likely) - assert: io.c:101: assert(b->fd == -1 || b->bufp > b->buf); pcmd.c:24: assert(f != nil); but ... > The kernel print buffer holds corresponding entries like: > > coral# 10618 dns: checked 136 page table entries > dns 10618: suicide: sys: trap: fault write addr=0x0 pc=0x00015cea /sys/src/libc/port/pool.c:974 969 return a; 970 } 971 972 /* poolallocl: attempt to allocate block to hold dsize user bytes; assumes lock held */ 973 static void* >974 poolallocl(Pool *p, ulong dsize) 975 { 976 ulong bsize; 977 Free *fb; 978 Alloc *ab; 979 acid; asm(0x00015cea) poolallocl 0x00015cea SUBL $0x1c,SP poolallocl+0x3 0x00015ced MOVL dsize+0x4(FP),DX poolallocl+0x7 0x00015cf1 CMPL DX,$0x80000000 poolallocl+0xd 0x00015cf7 JCS poolallocl+0x22(SB) this one doesn't make any sense, unless the stack ptr is smashed. > 26591 rfcmirror: checked 270 page table entries > 37326 rc: checked 51 page table entries > 47773 rc: checked 57 page table entries > 47773 rc: checked 57 page table entries > 47773 rc: checked 57 page table entries > 47773 rc: checked 57 page table entries > 47773 rc: checked 57 page table entries > 47773 rc: checked 57 page table entries > 47773 rc: checked 57 page table entries > 47773 rc: checked 57 page table entries > 47773 rc: checked 57 page table entries > 47773 rc: checked 57 page table entries > 47773 rc: checked 57 page table entries > 50675 rc: checked 53 page table entries ah. this is starting to make some sense. remember above, there was an abort in notifyf? that was if the trap depth got too deep. the problem is we would need to see 33 events for pid 47773, but we don't. i had a very similar problem under vbox on osx, and the solution was to use gorka's ancient fix, which basically avoids clearing PTEs which do not have the PteP bit set. there are substantial differences between the pc and nix kernel's here. so for example mmuptefree() looks fishy to me since it clears pages not present. but i'm not sure. - erik the applied patch is /n/atom/patch/applied/vboxmmu ; diff -c mmu.c.orig mmu.c mmu.c.orig:87,93 - mmu.c:87,93 } void - mmuflushtlb(uintmem) + xmmuflushtlb(uintmem) { m->tlbpurge++; mmu.c.orig:98,104 - mmu.c:98,122 putcr3(m->pml4->pa); } + /* hack for vbox */ void + mmuflushtlb(uintmem) + { + int i; + PTE *pte; + + m->tlbpurge++; + if(m->pml4->daddr){ + pte = UINT2PTR(m->pml4->va); + for(i = 0; i < m->pml4->daddr; i++) + if(pte[i] & PteP) + pte[i] = 0; + m->pml4->daddr = 0; + } + putcr3(m->pml4->pa); + } + + void mmuflush(void) { Mpl pl; mmu.c.orig:259,264 - mmu.c:277,283 void mmuswitch(Proc* proc) { + int i; PTE *pte; Page *page; Mpl pl; mmu.c.orig:270,276 - mmu.c:289,300 } if(m->pml4->daddr){ - memset(UINT2PTR(m->pml4->va), 0, m->pml4->daddr*sizeof(PTE)); + /* hack for vbox */ + // memset(UINT2PTR(m->pml4->va), 0, m->pml4->daddr*sizeof(PTE)); + pte = UINT2PTR(m->pml4->va); + for(i = 0; i < m->pml4->daddr; i++) + if(pte[i] & PteP) + pte[i] = 0; m->pml4->daddr = 0; } ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2014-05-27 22:56 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-05-26 23:14 [9fans] Too many checkpages() diagnostics Lyndon Nerenberg 2014-05-26 23:41 ` Steve Simon 2014-05-26 23:44 ` Lyndon Nerenberg 2014-05-26 23:47 ` Steve Simon 2014-05-27 0:27 ` Lyndon Nerenberg 2014-05-27 13:08 ` Anthony Sorace 2014-05-27 22:56 ` Lyndon Nerenberg 2014-05-27 10:43 ` Charles Forsyth 2014-05-27 6:46 ` lucio 2014-05-27 6:57 ` lucio 2014-05-27 20:52 ` Lyndon Nerenberg 2014-05-27 13:36 ` erik quanstrom
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).