[9fans] web site being down & reliability

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* [9fans] web site being down & reliability
@ 2006-03-09  6:54 Russ Cox
  2006-03-09 18:30 ` [9fans] 9p flush rog
  2006-03-09 21:37 ` [9fans] web site being down & reliability Russ Cox
  0 siblings, 2 replies; 7+ messages in thread
From: Russ Cox @ 2006-03-09  6:54 UTC (permalink / raw)
  To: 9fans

To the people on IRC who were grumbling this morning:
if it seriously impacts your quality of life when
plan9.bell-labs.com is off the net for a few hours, you
should stop reading this email, unsubscribe from 9fans,
delete your IRC client, and find something more healthy to
occupy your time.

That being said, a status report of sorts.

The outside internet gateway machine (aka plan9.bell-labs.com)
gets hammered with more network traffic than perhaps any
other Plan 9 machine in the world, both volume and variety
of traffic.  Problems that have been latent for years often
manifest on that machine.  So it goes down more often than
your own terminals or cpu servers might.

It's still up the vast majority of the time.  I now run a
script that checks that it can fetch the main web page every
five minutes, and in the last 38 days the web site has been
missing for six hours: five minutes on seven separate
occasions on February 16-17, fifteen minutes on March 4, and
two and a half hours this morning.  That's more than I'd
like, but it was up 99.4% of the time.  A regular old panic
would have only been five or ten minutes of outage this
morning instead of two and a half hours (couldn't reboot
until Jim got to work today and manually power-cycled
the machine).

The outside machine was running some fixes to the pc mmu
code that I was testing before pushing out.  I don't think
they caused the wedge (they're just some splhi/splx around
possibly sensitive code), as the machine had been up for ten
days since I booted the new kernel.

The mmu splhi changes attempt to solve a problem with page
faults happening inside putmmu when it access the VPT.  I
believe that if an interrupt happens in putmmu, the process
should be rescheduled correctly and putmmu should pick up
where it left off without problem.  In practice, every page
fault we saw happened only after the process had been
rescheduled during putmmu, so I made the processor go splhi.
I don't fully understand what's going on, but I've looked
and looked.  The problem only seems to manifest itself when
the gateway machine is being really heavily pounded on, like
when Google is crawling the new web trees.

We recently fixed a bug that caused problems if machines
with large memories had been running for a very long time
and finally ran out of (executable) image cache entries.
Imagereclaim would have a lot of work to do and would
eventually get interrupted holding a critical lock (palloc),
and then you'd get an endless run of lock loops with no hope
of recovery.  This is fixed twice over: processes holding
palloc can no longer be rescheduled, and imagereclaim stops
after reclaiming 1000 images.

There appears to be a slow memory leak somewhere in
the kernel.  The down time on March 4 was not because
the gateway machine crashed but because the internal
machine that hands the gateway web files had run out of
memory.  We still haven't found this bug, nor have we tried
very hard to track it down.  We've seen other machines
panic with this too, all once the machine has been up a
long time (pids in the tens of millions).

Geoff Collyer has been seeing all kinds of weird memory
faults on his two machines, but he's using ECC RAM so we
think that the hardware should be okay.  This has been
going on since before the pc mmu changes, so it's hard
to imagine what could be going wrong in Plan 9 itself.
If other people are seeing weird behavior, do let us know.

Thanks.
Russ

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [9fans] 9p flush
  2006-03-09  6:54 [9fans] web site being down & reliability Russ Cox
@ 2006-03-09 18:30 ` rog
  2006-03-09 18:40   ` jmk
  2006-03-09 18:55   ` Russ Cox
  2006-03-09 21:37 ` [9fans] web site being down & reliability Russ Cox
  1 sibling, 2 replies; 7+ messages in thread
From: rog @ 2006-03-09 18:30 UTC (permalink / raw)
  To: 9fans

i've been implementing some generic mechanisms for serving 9p.

from flush(5):

          The server must answer the flush message immediately.

which i interpret to mean "no other messages should be processed in
the meantime".

how important is this?

the problem is that in a multithreaded environment, the thread
performing a request must be notified that it has been flushed, and
given a chance to try and abort its current action.

it it happens that the action has already completed, then it should
be allowed to reply before the flush reply, thus letting the client
know correctly about the state change.  the problem is that this means
that there can be an arbitrary delay before the flush is replied to
(although i'm still ensuring that flush messages are answered in
order),

does doing things this way break any aspect of the protocol?  if it
does not, then might i suggest that that line of the manual page be
changed to:

	The server must answer the flush message as soon as possible.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] 9p flush
  2006-03-09 18:30 ` [9fans] 9p flush rog
@ 2006-03-09 18:40   ` jmk
  2006-03-09 18:55   ` Russ Cox
  1 sibling, 0 replies; 7+ messages in thread
From: jmk @ 2006-03-09 18:40 UTC (permalink / raw)
  To: 9fans

When implementing 9P2000 in the old fileserver and fossil, there
were many places the protocol was not specific enough about what
to do. Most were cleared up. I dealt with this exact issue in fosssil
and it was decided the correct operation should be to reply to the
original operation if it has completed and then reply to the flush.
See /sys/src/cmd/fossil/9proc.c:^msgFlush for the lowdown.

--jim


On Thu Mar  9 13:30:43 EST 2006, rog@vitanuova.com wrote:
> i've been implementing some generic mechanisms for serving 9p.
>
> from flush(5):
>
>           The server must answer the flush message immediately.
>
> which i interpret to mean "no other messages should be processed in
> the meantime".
>
> how important is this?
>
> the problem is that in a multithreaded environment, the thread
> performing a request must be notified that it has been flushed, and
> given a chance to try and abort its current action.
>
> it it happens that the action has already completed, then it should
> be allowed to reply before the flush reply, thus letting the client
> know correctly about the state change.  the problem is that this means
> that there can be an arbitrary delay before the flush is replied to
> (although i'm still ensuring that flush messages are answered in
> order),
>
> does doing things this way break any aspect of the protocol?  if it
> does not, then might i suggest that that line of the manual page be
> changed to:
>
> 	The server must answer the flush message as soon as possible.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] 9p flush
  2006-03-09 18:30 ` [9fans] 9p flush rog
  2006-03-09 18:40   ` jmk
@ 2006-03-09 18:55   ` Russ Cox
  2006-03-09 19:03     ` rog
  1 sibling, 1 reply; 7+ messages in thread
From: Russ Cox @ 2006-03-09 18:55 UTC (permalink / raw)
  To: 9fans

The only requirement is that flush is handled in a timely fashion.
It can't block indefinitely and it shouldn't block on anything
it doesn't have to.  For example, in the worm file server,
if you flush a read, the file server doesn't interrupt the read
but it does mark the read as flushed and reply to the flush
immediately.  It will then not reply to the read once the worm
data arrives.  It would be as correct, though more frustrating
to users typing DEL, if the flush just waited on the worm data.

It's not the case that you can't reply to other messages
between receiving a Tflush and sending the Rflush.
That would be unreasonable since there might be
R-messages in flight that would appear to have been
sent between the two events.

The ultimate reason for all this is that if you have a process
blocked on a 9P message and it gets interrupted, the kernel
sends a Tflush and cannot let the process handle the interrupt
until the Rflush is received.  So it shouldn't take arbitrarily long,
and the faster you can reply the better.

The kernel must wait for the Rflush to find out whether the
op is going to complete successfully.  For example, if you
flush a Twalk, it's important to know whether the Rwalk
comes back or not.  If you get the Rwalk before the Rflush,
then the walk succeeded and the fid has moved (or a new fid
created).  If you get the Rflush with no Rwalk, you know the
Rwalk isn't coming and that the walk never happened.

You can wait for the Rwhatever to go out before you respond
to the Rflush if you want, as long as it's not going to block
indefinitely.  This is what lib9p will do for you if you don't
give a flush handler in the Srv structure.

It is more important to get the semantics right than the timing.

Russ

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] 9p flush
  2006-03-09 19:03     ` rog
@ 2006-03-09 18:59       ` Ronald G Minnich
  0 siblings, 0 replies; 7+ messages in thread
From: Ronald G Minnich @ 2006-03-09 18:59 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

rog@vitanuova.com wrote:

> flush is such fun.
>

ah, but so glad we have it.

NFS sure could have used such a thing.

ron


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] 9p flush
  2006-03-09 18:55   ` Russ Cox
@ 2006-03-09 19:03     ` rog
  2006-03-09 18:59       ` Ronald G Minnich
  0 siblings, 1 reply; 7+ messages in thread
From: rog @ 2006-03-09 19:03 UTC (permalink / raw)
  To: 9fans

> It is more important to get the semantics right than the timing.

that's good.

my particular scenario is that i'm writing a 9p server that is also a
9p client.  if a particular request is flushed, then the process
dealing that request needs to be able to send a flush request down the
client 9p connection, and wait for a reply in order to decide how to
reply to the original flush request.

as far as i understand it, this should be fine.

flush is such fun.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] web site being down & reliability
  2006-03-09  6:54 [9fans] web site being down & reliability Russ Cox
  2006-03-09 18:30 ` [9fans] 9p flush rog
@ 2006-03-09 21:37 ` Russ Cox
  1 sibling, 0 replies; 7+ messages in thread
From: Russ Cox @ 2006-03-09 21:37 UTC (permalink / raw)
  To: 9fans

> There appears to be a slow memory leak somewhere in
> the kernel.  The down time on March 4 was not because
> the gateway machine crashed but because the internal
> machine that hands the gateway web files had run out of
> memory.  We still haven't found this bug, nor have we tried
> very hard to track it down.  We've seen other machines
> panic with this too, all once the machine has been up a
> long time (pids in the tens of millions).

I found this today using the new kmem(1) command.
Approximately 32 bytes per attach.  Sources fixed but
no new kernels yet.

Russ



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-03-09 21:37 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-03-09  6:54 [9fans] web site being down & reliability Russ Cox
2006-03-09 18:30 ` [9fans] 9p flush rog
2006-03-09 18:40   ` jmk
2006-03-09 18:55   ` Russ Cox
2006-03-09 19:03     ` rog
2006-03-09 18:59       ` Ronald G Minnich
2006-03-09 21:37 ` [9fans] web site being down & reliability Russ Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).