From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <1cc43661a51e16266fb8b944cd1827ec@swtch.com>
To: 9fans@cse.psu.edu
From: "Russ Cox" <rsc@swtch.com>
Date: Thu,  9 Mar 2006 01:54:52 -0500
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
Subject: [9fans] web site being down & reliability
Topicbox-Message-UUID: 100e425c-ead1-11e9-9d60-3106f5b1d025

To the people on IRC who were grumbling this morning:
if it seriously impacts your quality of life when
plan9.bell-labs.com is off the net for a few hours, you
should stop reading this email, unsubscribe from 9fans,
delete your IRC client, and find something more healthy to
occupy your time.


That being said, a status report of sorts.

The outside internet gateway machine (aka plan9.bell-labs.com)
gets hammered with more network traffic than perhaps any
other Plan 9 machine in the world, both volume and variety
of traffic.  Problems that have been latent for years often
manifest on that machine.  So it goes down more often than
your own terminals or cpu servers might.

It's still up the vast majority of the time.  I now run a
script that checks that it can fetch the main web page every
five minutes, and in the last 38 days the web site has been
missing for six hours: five minutes on seven separate
occasions on February 16-17, fifteen minutes on March 4, and
two and a half hours this morning.  That's more than I'd
like, but it was up 99.4% of the time.  A regular old panic
would have only been five or ten minutes of outage this
morning instead of two and a half hours (couldn't reboot
until Jim got to work today and manually power-cycled
the machine).

The outside machine was running some fixes to the pc mmu
code that I was testing before pushing out.  I don't think
they caused the wedge (they're just some splhi/splx around
possibly sensitive code), as the machine had been up for ten
days since I booted the new kernel.

The mmu splhi changes attempt to solve a problem with page
faults happening inside putmmu when it access the VPT.  I
believe that if an interrupt happens in putmmu, the process
should be rescheduled correctly and putmmu should pick up
where it left off without problem.  In practice, every page
fault we saw happened only after the process had been
rescheduled during putmmu, so I made the processor go splhi.
I don't fully understand what's going on, but I've looked
and looked.  The problem only seems to manifest itself when
the gateway machine is being really heavily pounded on, like
when Google is crawling the new web trees.

We recently fixed a bug that caused problems if machines
with large memories had been running for a very long time
and finally ran out of (executable) image cache entries.
Imagereclaim would have a lot of work to do and would
eventually get interrupted holding a critical lock (palloc),
and then you'd get an endless run of lock loops with no hope
of recovery.  This is fixed twice over: processes holding
palloc can no longer be rescheduled, and imagereclaim stops
after reclaiming 1000 images.

There appears to be a slow memory leak somewhere in
the kernel.  The down time on March 4 was not because
the gateway machine crashed but because the internal
machine that hands the gateway web files had run out of
memory.  We still haven't found this bug, nor have we tried
very hard to track it down.  We've seen other machines
panic with this too, all once the machine has been up a
long time (pids in the tens of millions).

Geoff Collyer has been seeing all kinds of weird memory
faults on his two machines, but he's using ECC RAM so we
think that the hardware should be okay.  This has been
going on since before the pc mmu changes, so it's hard
to imagine what could be going wrong in Plan 9 itself.
If other people are seeing weird behavior, do let us know.

Thanks.
Russ