From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <1cc43661a51e16266fb8b944cd1827ec@swtch.com> To: 9fans@cse.psu.edu From: "Russ Cox" Date: Thu, 9 Mar 2006 01:54:52 -0500 MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: [9fans] web site being down & reliability Topicbox-Message-UUID: 100e425c-ead1-11e9-9d60-3106f5b1d025 To the people on IRC who were grumbling this morning: if it seriously impacts your quality of life when plan9.bell-labs.com is off the net for a few hours, you should stop reading this email, unsubscribe from 9fans, delete your IRC client, and find something more healthy to occupy your time. That being said, a status report of sorts. The outside internet gateway machine (aka plan9.bell-labs.com) gets hammered with more network traffic than perhaps any other Plan 9 machine in the world, both volume and variety of traffic. Problems that have been latent for years often manifest on that machine. So it goes down more often than your own terminals or cpu servers might. It's still up the vast majority of the time. I now run a script that checks that it can fetch the main web page every five minutes, and in the last 38 days the web site has been missing for six hours: five minutes on seven separate occasions on February 16-17, fifteen minutes on March 4, and two and a half hours this morning. That's more than I'd like, but it was up 99.4% of the time. A regular old panic would have only been five or ten minutes of outage this morning instead of two and a half hours (couldn't reboot until Jim got to work today and manually power-cycled the machine). The outside machine was running some fixes to the pc mmu code that I was testing before pushing out. I don't think they caused the wedge (they're just some splhi/splx around possibly sensitive code), as the machine had been up for ten days since I booted the new kernel. The mmu splhi changes attempt to solve a problem with page faults happening inside putmmu when it access the VPT. I believe that if an interrupt happens in putmmu, the process should be rescheduled correctly and putmmu should pick up where it left off without problem. In practice, every page fault we saw happened only after the process had been rescheduled during putmmu, so I made the processor go splhi. I don't fully understand what's going on, but I've looked and looked. The problem only seems to manifest itself when the gateway machine is being really heavily pounded on, like when Google is crawling the new web trees. We recently fixed a bug that caused problems if machines with large memories had been running for a very long time and finally ran out of (executable) image cache entries. Imagereclaim would have a lot of work to do and would eventually get interrupted holding a critical lock (palloc), and then you'd get an endless run of lock loops with no hope of recovery. This is fixed twice over: processes holding palloc can no longer be rescheduled, and imagereclaim stops after reclaiming 1000 images. There appears to be a slow memory leak somewhere in the kernel. The down time on March 4 was not because the gateway machine crashed but because the internal machine that hands the gateway web files had run out of memory. We still haven't found this bug, nor have we tried very hard to track it down. We've seen other machines panic with this too, all once the machine has been up a long time (pids in the tens of millions). Geoff Collyer has been seeing all kinds of weird memory faults on his two machines, but he's using ECC RAM so we think that the hardware should be okay. This has been going on since before the pc mmu changes, so it's hard to imagine what could be going wrong in Plan 9 itself. If other people are seeing weird behavior, do let us know. Thanks. Russ