[9fans] clunk clunk

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* [9fans] clunk clunk
@ 2006-01-03 19:28 Bruce Ellis
  2006-01-03 19:38 ` Sascha Retzki
  2006-01-03 19:46 ` Russ Cox
  0 siblings, 2 replies; 24+ messages in thread
From: Bruce Ellis @ 2006-01-03 19:28 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Why do I have a few hundred processes that are exiting
but can't find the door (thanks joe walsh)?

When a process exits it closes it's fds, so it sends a
Tclunk ...but if the process it sends it to is exiting it can't
respond.   It's in the same position.  Call it "deadly embrace".

Think about it.

I fixed it in Inferno years ago.

Nobody checks the return value of close() - I haven't grepped
around to see if someone does, but Rclunk or Rerror are silly
to wait for, particularly if they aren't gonna happen.

There is no close() in Inferno so the garbage collector is responsible.

The code is in /usr/nferno at the labs.  Maybe glenda needs it.

Opinions?

brucee

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-03 19:28 [9fans] clunk clunk Bruce Ellis
@ 2006-01-03 19:38 ` Sascha Retzki
  2006-01-03 19:46 ` Russ Cox
  1 sibling, 0 replies; 24+ messages in thread
From: Sascha Retzki @ 2006-01-03 19:38 UTC (permalink / raw)
  To: 9fans

On Wed, Jan 04, 2006 at 06:28:48AM +1100, Bruce Ellis wrote:
> When a process exits it closes it's fds, so it sends a
> Tclunk ...but if the process it sends it to is exiting it can't
> respond.   It's in the same position.  Call it "deadly embrace".

Definitely not a nice situation. But from my (oh $god knows) limited
understanding of the great picture, when does that happen? File server
shutdowns? Wouldn't it be much nicer to make a "don't accept new
connections, wait X seconds for the clients to clunk, if time is over,
just clunk"-function. No other situations, none that I could think of,
would result in the situation you described. 

> There is no close() in Inferno so the garbage collector is responsible.

I don't know limbo well, but I don't know how it could work without
close() in any way. Else see my rave above.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-03 19:28 [9fans] clunk clunk Bruce Ellis
  2006-01-03 19:38 ` Sascha Retzki
@ 2006-01-03 19:46 ` Russ Cox
  2006-01-03 20:07   ` Bruce Ellis
  1 sibling, 1 reply; 24+ messages in thread
From: Russ Cox @ 2006-01-03 19:46 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> When a process exits it closes it's fds, so it sends a
> Tclunk ...but if the process it sends it to is exiting it can't
> respond.   It's in the same position.  Call it "deadly embrace".
>
> Think about it.

This only happens if two file servers have mounted each other,
which creates many other possibilities for deadlock too.

Usually file servers are careful to dissociate from the name
spaces in which they mount themselves, so that they don't
access their own files and cause ref count problems.
This has the added effect that future servers that get mounted
into the name space don't end up mounted in the first server's
name space.  So these kind of loops basically don't happen.

What processes do you have running that are in this deadly
embrace?  Standard programs or ones you wrote?  If the
former, which ones?  Are you sure they're in Tclunk?

There is one exception in Plan 9: upas/fs and plumber have
each other mounted, so that plumber can send around
references to upas/fs's files.  It sometimes happens that
they end up sticking around just because of the circular
ref count, if somehow the session ends without a hangup
note being sent to the note group.  Even in this case, though,
the Tclunk thing doesn't happen, because plumber doesn't
keep any of upas/fs's files open.  It could possibly happen
if the plumber managed to get killed in the middle of walking
one of the upas paths during a stat, but that wouldn't happen
hundreds of times on a single system.

Instead of making us read through the Inferno code,
why not tell us what you did to fix it?  A separate kproc
to run all clunks?  Close all the non-devmnt chans first?

Russ

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-03 19:46 ` Russ Cox
@ 2006-01-03 20:07   ` Bruce Ellis
  2006-01-03 20:24     ` Russ Cox
  0 siblings, 1 reply; 24+ messages in thread
From: Bruce Ellis @ 2006-01-03 20:07 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Bullshit.  It happens all the time.  Rio+Plumber is a simple example.

brucee

On 1/4/06, Russ Cox <rsc@swtch.com> wrote:
> > When a process exits it closes it's fds, so it sends a
> > Tclunk ...but if the process it sends it to is exiting it can't
> > respond.   It's in the same position.  Call it "deadly embrace".
> >
> > Think about it.
>
> This only happens if two file servers have mounted each other,
> which creates many other possibilities for deadlock too.
>
> Usually file servers are careful to dissociate from the name
> spaces in which they mount themselves, so that they don't
> access their own files and cause ref count problems.
> This has the added effect that future servers that get mounted
> into the name space don't end up mounted in the first server's
> name space.  So these kind of loops basically don't happen.
>
> What processes do you have running that are in this deadly
> embrace?  Standard programs or ones you wrote?  If the
> former, which ones?  Are you sure they're in Tclunk?
>
> There is one exception in Plan 9: upas/fs and plumber have
> each other mounted, so that plumber can send around
> references to upas/fs's files.  It sometimes happens that
> they end up sticking around just because of the circular
> ref count, if somehow the session ends without a hangup
> note being sent to the note group.  Even in this case, though,
> the Tclunk thing doesn't happen, because plumber doesn't
> keep any of upas/fs's files open.  It could possibly happen
> if the plumber managed to get killed in the middle of walking
> one of the upas paths during a stat, but that wouldn't happen
> hundreds of times on a single system.
>
> Instead of making us read through the Inferno code,
> why not tell us what you did to fix it?  A separate kproc
> to run all clunks?  Close all the non-devmnt chans first?
>
> Russ
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-03 20:07   ` Bruce Ellis
@ 2006-01-03 20:24     ` Russ Cox
  2006-01-03 21:33       ` Bruce Ellis
  0 siblings, 1 reply; 24+ messages in thread
From: Russ Cox @ 2006-01-03 20:24 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> Bullshit.  It happens all the time.  Rio+Plumber is a simple example.

In the standard profile, as set up by /sys/lib/newuser
or demonstrated in /usr/glenda/lib/profile,
plumber is run before rio starts, so rio has
plumber mounted but not vice versa:

% grep plumb /proc/^`{ps|grep rio|awk '{print $2}'}^/ns
/proc/86746/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
/proc/86768/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
/proc/86769/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
/proc/86770/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
/proc/86771/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
/proc/86772/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
/proc/86773/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
% grep rio /proc/^`{ps|grep plumb|awk '{print $2}'}^/ns
%

If you want to actually cause a loop, then you could run
plumber in your riostart (as in rio -i riostart) file.  But there
are plenty of ways to shoot yourself in the foot with
ref count loops if you really want to go there.  A much
more direct one is to run a file server that mounts
itself into its own name space.

Russ


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-03 20:24     ` Russ Cox
@ 2006-01-03 21:33       ` Bruce Ellis
  2006-01-03 21:47         ` jmk
  0 siblings, 1 reply; 24+ messages in thread
From: Bruce Ellis @ 2006-01-03 21:33 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

I can do anything I like Russ.  A fileserver is not the size of a fridge.
It's often a user process.  If you don't think this is a problem well
live with it.  I don't like rebooting a machine that's been running for
months because of this "bug".

Mother Mary may believe your explanation but it is a BUG which
I squashed.

brucee

On 1/4/06, Russ Cox <rsc@swtch.com> wrote:
> > Bullshit.  It happens all the time.  Rio+Plumber is a simple example.
>
> In the standard profile, as set up by /sys/lib/newuser
> or demonstrated in /usr/glenda/lib/profile,
> plumber is run before rio starts, so rio has
> plumber mounted but not vice versa:
>
> % grep plumb /proc/^`{ps|grep rio|awk '{print $2}'}^/ns
> /proc/86746/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> /proc/86768/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> /proc/86769/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> /proc/86770/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> /proc/86771/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> /proc/86772/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> /proc/86773/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> % grep rio /proc/^`{ps|grep plumb|awk '{print $2}'}^/ns
> %
>
> If you want to actually cause a loop, then you could run
> plumber in your riostart (as in rio -i riostart) file.  But there
> are plenty of ways to shoot yourself in the foot with
> ref count loops if you really want to go there.  A much
> more direct one is to run a file server that mounts
> itself into its own name space.
>
> Russ
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-03 21:33       ` Bruce Ellis
@ 2006-01-03 21:47         ` jmk
  2006-01-03 22:09           ` Bruce Ellis
  0 siblings, 1 reply; 24+ messages in thread
From: jmk @ 2006-01-03 21:47 UTC (permalink / raw)
  To: 9fans

Instead of engaging in handbags, why don't
you demonstrate the problem and how to fix it?

--jim

On Tue Jan  3 16:34:12 EST 2006, bruce.ellis@gmail.com wrote:
> I can do anything I like Russ.  A fileserver is not the size of a fridge.
> It's often a user process.  If you don't think this is a problem well
> live with it.  I don't like rebooting a machine that's been running for
> months because of this "bug".
> 
> Mother Mary may believe your explanation but it is a BUG which
> I squashed.
> 
> brucee
> 
> On 1/4/06, Russ Cox <rsc@swtch.com> wrote:
> > > Bullshit.  It happens all the time.  Rio+Plumber is a simple example.
> >
> > In the standard profile, as set up by /sys/lib/newuser
> > or demonstrated in /usr/glenda/lib/profile,
> > plumber is run before rio starts, so rio has
> > plumber mounted but not vice versa:
> >
> > % grep plumb /proc/^`{ps|grep rio|awk '{print $2}'}^/ns
> > /proc/86746/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> > /proc/86768/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> > /proc/86769/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> > /proc/86770/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> > /proc/86771/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> > /proc/86772/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> > /proc/86773/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> > % grep rio /proc/^`{ps|grep plumb|awk '{print $2}'}^/ns
> > %
> >
> > If you want to actually cause a loop, then you could run
> > plumber in your riostart (as in rio -i riostart) file.  But there
> > are plenty of ways to shoot yourself in the foot with
> > ref count loops if you really want to go there.  A much
> > more direct one is to run a file server that mounts
> > itself into its own name space.
> >
> > Russ
> >


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-03 21:47         ` jmk
@ 2006-01-03 22:09           ` Bruce Ellis
  2006-01-03 22:14             ` jmk
  2006-01-03 22:36             ` Russ Cox
  0 siblings, 2 replies; 24+ messages in thread
From: Bruce Ellis @ 2006-01-03 22:09 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

check /usr/inferno on bootes - it's solved.

demonstrate it?  how many ways.

is a simple "ps" good enough?

brucee

On 1/4/06, jmk@plan9.bell-labs.com <jmk@plan9.bell-labs.com> wrote:
> Instead of engaging in handbags, why don't
> you demonstrate the problem and how to fix it?
>
> --jim
>
> On Tue Jan  3 16:34:12 EST 2006, bruce.ellis@gmail.com wrote:
> > I can do anything I like Russ.  A fileserver is not the size of a fridge.
> > It's often a user process.  If you don't think this is a problem well
> > live with it.  I don't like rebooting a machine that's been running for
> > months because of this "bug".
> >
> > Mother Mary may believe your explanation but it is a BUG which
> > I squashed.
> >
> > brucee
> >
> > On 1/4/06, Russ Cox <rsc@swtch.com> wrote:
> > > > Bullshit.  It happens all the time.  Rio+Plumber is a simple example.
> > >
> > > In the standard profile, as set up by /sys/lib/newuser
> > > or demonstrated in /usr/glenda/lib/profile,
> > > plumber is run before rio starts, so rio has
> > > plumber mounted but not vice versa:
> > >
> > > % grep plumb /proc/^`{ps|grep rio|awk '{print $2}'}^/ns
> > > /proc/86746/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> > > /proc/86768/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> > > /proc/86769/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> > > /proc/86770/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> > > /proc/86771/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> > > /proc/86772/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> > > /proc/86773/ns:mount  #s/plumb.rsc.86763 /mnt/plumb
> > > % grep rio /proc/^`{ps|grep plumb|awk '{print $2}'}^/ns
> > > %
> > >
> > > If you want to actually cause a loop, then you could run
> > > plumber in your riostart (as in rio -i riostart) file.  But there
> > > are plenty of ways to shoot yourself in the foot with
> > > ref count loops if you really want to go there.  A much
> > > more direct one is to run a file server that mounts
> > > itself into its own name space.
> > >
> > > Russ
> > >
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-03 22:09           ` Bruce Ellis
@ 2006-01-03 22:14             ` jmk
  2006-01-03 22:16               ` Bruce Ellis
  2006-01-03 22:36             ` Russ Cox
  1 sibling, 1 reply; 24+ messages in thread
From: jmk @ 2006-01-03 22:14 UTC (permalink / raw)
  To: 9fans

There is no bootes and there is no /usr/inferno
on any of our fileservers.

Still waiting.

--jim

On Tue Jan  3 17:10:17 EST 2006, bruce.ellis@gmail.com wrote:
> check /usr/inferno on bootes - it's solved.
> 
> demonstrate it?  how many ways.
> 
> is a simple "ps" good enough?
> 
> brucee
> 
> On 1/4/06, jmk@plan9.bell-labs.com <jmk@plan9.bell-labs.com> wrote:
> > Instead of engaging in handbags, why don't
> > you demonstrate the problem and how to fix it?
> >
> > --jim


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-03 22:14             ` jmk
@ 2006-01-03 22:16               ` Bruce Ellis
  0 siblings, 0 replies; 24+ messages in thread
From: Bruce Ellis @ 2006-01-03 22:16 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

i'll send you the code.

On 1/4/06, jmk@plan9.bell-labs.com <jmk@plan9.bell-labs.com> wrote:
> There is no bootes and there is no /usr/inferno
> on any of our fileservers.
>
> Still waiting.
>
> --jim
>
> On Tue Jan  3 17:10:17 EST 2006, bruce.ellis@gmail.com wrote:
> > check /usr/inferno on bootes - it's solved.
> >
> > demonstrate it?  how many ways.
> >
> > is a simple "ps" good enough?
> >
> > brucee
> >
> > On 1/4/06, jmk@plan9.bell-labs.com <jmk@plan9.bell-labs.com> wrote:
> > > Instead of engaging in handbags, why don't
> > > you demonstrate the problem and how to fix it?
> > >
> > > --jim
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-03 22:09           ` Bruce Ellis
  2006-01-03 22:14             ` jmk
@ 2006-01-03 22:36             ` Russ Cox
  2006-01-03 22:45               ` Bruce Ellis
  1 sibling, 1 reply; 24+ messages in thread
From: Russ Cox @ 2006-01-03 22:36 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> check /usr/inferno on bootes - it's solved.

This is much less helpful than actually summarizing
the method of solution.  I asked in my original reply what
you'd done, and you still haven't told us.  I can think of
a handful of solutions,

> demonstrate it?  how many ways.
> is a simple "ps" good enough?

I believe the custom is to post a program (or pair of programs)
that tickle the actual problem.

Don't mind me, though.  I'm just trying to understand the
problem and what the suggested solution is.  Apparently
that's not the right approach.

Russ

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-03 22:36             ` Russ Cox
@ 2006-01-03 22:45               ` Bruce Ellis
  2006-01-04 12:15                 ` Russ Cox
  0 siblings, 1 reply; 24+ messages in thread
From: Bruce Ellis @ 2006-01-03 22:45 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Of course .. the code will be sent to jmk. And it's simple
to write two short programs that demonstrate the BUG.
<name deleted> and <name deleted> at 1127 agreed
it should be fixed.

But of course Russ is always right (in his own lunchbox).

brucee

On 1/4/06, Russ Cox <rsc@swtch.com> wrote:
> > check /usr/inferno on bootes - it's solved.
>
> This is much less helpful than actually summarizing
> the method of solution.  I asked in my original reply what
> you'd done, and you still haven't told us.  I can think of
> a handful of solutions,
>
> > demonstrate it?  how many ways.
> > is a simple "ps" good enough?
>
> I believe the custom is to post a program (or pair of programs)
> that tickle the actual problem.
>
> Don't mind me, though.  I'm just trying to understand the
> problem and what the suggested solution is.  Apparently
> that's not the right approach.
>
> Russ
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-03 22:45               ` Bruce Ellis
@ 2006-01-04 12:15                 ` Russ Cox
  2006-01-04 12:25                   ` Bruce Ellis
  0 siblings, 1 reply; 24+ messages in thread
From: Russ Cox @ 2006-01-04 12:15 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

I apologize if I give the impression of thinking I'm always
right.  I don't think I'm always right.  In fact, I am
frequently wrong.

What I do think is that in order to maintain the Plan 9
software, it would be helpful if I understood the problem
and proposed solution before I start editing code.  Too
often when I don't fully understand what's going on, bugs
end up going in with the fixes.  There was a good instance
of that over the weekend -- a subtle bug introduced into
acme a few years ago (thanks to Arvindht Tamilmani for
pointing it out).

You pointed out a scenario where two processes can deadlock.
That much I understood.  There were two things I didn't
understand.

The first was how common the problem actually was.  You made
it sound like it happens all the time, and my understanding
of the problem is that it shouldn't, hence my discussion of
the usual file server conventions.  You then said that
rio+plumber was an example, which would certainly be a
common case, except that as I understand the problem,
rio+plumber doesn't suffer from it.

Other cyclic dependency loops are even more common (as I
understand them).  Maybe your solution would fix those too.

The second thing I didn't understand was what solution you
propose.  Saying "go do like Inferno does" isn't really
helpful by itself.  In just one or two sentences, you could
explain the solution enough to know where in the Inferno
kernel to look.

I did look at cclose, pexit, and closefgrp in the Vita Nuova
inferno, and I didn't see anything that looked significantly
different from Plan 9.  While writing this message, I
checked the archived /usr/inferno trees on emelie and they
looked the same too.  I don't see anything there that would
protect against the problem you described.  Perhaps I'm
looking in the wrong place, perhaps I don't understand what
the code is doing, or perhaps I don't even understand what
problem it is you were describing.

I'm just trying to understand your suggestion.

Thanks for your help.
Russ

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-04 12:15                 ` Russ Cox
@ 2006-01-04 12:25                   ` Bruce Ellis
  2006-01-04 15:36                     ` Ronald G Minnich
  2006-01-05  9:36                     ` Francisco J Ballesteros
  0 siblings, 2 replies; 24+ messages in thread
From: Bruce Ellis @ 2006-01-04 12:25 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

too long for me to read.  jmk has the code.

On 1/4/06, Russ Cox <rsc@swtch.com> wrote:
> I apologize if I give the impression of thinking I'm always
> right.  I don't think I'm always right.  In fact, I am
> frequently wrong.
>
> What I do think is that in order to maintain the Plan 9
> software, it would be helpful if I understood the problem
> and proposed solution before I start editing code.  Too
> often when I don't fully understand what's going on, bugs
> end up going in with the fixes.  There was a good instance
> of that over the weekend -- a subtle bug introduced into
> acme a few years ago (thanks to Arvindht Tamilmani for
> pointing it out).
>
> You pointed out a scenario where two processes can deadlock.
> That much I understood.  There were two things I didn't
> understand.
>
> The first was how common the problem actually was.  You made
> it sound like it happens all the time, and my understanding
> of the problem is that it shouldn't, hence my discussion of
> the usual file server conventions.  You then said that
> rio+plumber was an example, which would certainly be a
> common case, except that as I understand the problem,
> rio+plumber doesn't suffer from it.
>
> Other cyclic dependency loops are even more common (as I
> understand them).  Maybe your solution would fix those too.
>
> The second thing I didn't understand was what solution you
> propose.  Saying "go do like Inferno does" isn't really
> helpful by itself.  In just one or two sentences, you could
> explain the solution enough to know where in the Inferno
> kernel to look.
>
> I did look at cclose, pexit, and closefgrp in the Vita Nuova
> inferno, and I didn't see anything that looked significantly
> different from Plan 9.  While writing this message, I
> checked the archived /usr/inferno trees on emelie and they
> looked the same too.  I don't see anything there that would
> protect against the problem you described.  Perhaps I'm
> looking in the wrong place, perhaps I don't understand what
> the code is doing, or perhaps I don't even understand what
> problem it is you were describing.
>
> I'm just trying to understand your suggestion.
>
> Thanks for your help.
> Russ
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-04 12:25                   ` Bruce Ellis
@ 2006-01-04 15:36                     ` Ronald G Minnich
  2006-01-04 15:41                       ` Bruce Ellis
  2006-01-05  9:36                     ` Francisco J Ballesteros
  1 sibling, 1 reply; 24+ messages in thread
From: Ronald G Minnich @ 2006-01-04 15:36 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Bruce Ellis wrote:
> too long for me to read.  jmk has the code.

I look forward to seeing it, since this discussion confused me almost as 
much as last year's threads discussion.

Near as I can tell, what I've learned is that if you set up a deadlock, 
you'll get a deadlock. Or did I miss something there too?

thanks

ron

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-04 15:36                     ` Ronald G Minnich
@ 2006-01-04 15:41                       ` Bruce Ellis
  0 siblings, 0 replies; 24+ messages in thread
From: Bruce Ellis @ 2006-01-04 15:41 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

close enough.  we will sort it out.

brucee

On 1/5/06, Ronald G Minnich <rminnich@lanl.gov> wrote:
> Bruce Ellis wrote:
> > too long for me to read.  jmk has the code.
>
> I look forward to seeing it, since this discussion confused me almost as
> much as last year's threads discussion.
>
> Near as I can tell, what I've learned is that if you set up a deadlock,
> you'll get a deadlock. Or did I miss something there too?
>
> thanks
>
> ron
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-04 12:25                   ` Bruce Ellis
  2006-01-04 15:36                     ` Ronald G Minnich
@ 2006-01-05  9:36                     ` Francisco J Ballesteros
  2006-01-05  9:39                       ` Russ Cox
  1 sibling, 1 reply; 24+ messages in thread
From: Francisco J Ballesteros @ 2006-01-05  9:36 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Could we all see that code as well?
thanks in any case

On 1/4/06, Bruce Ellis <bruce.ellis@gmail.com> wrote:
> too long for me to read.  jmk has the code.
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-05  9:36                     ` Francisco J Ballesteros
@ 2006-01-05  9:39                       ` Russ Cox
  2006-01-05 12:00                         ` Bruce Ellis
  2006-01-06 21:34                         ` Dave Eckhardt
  0 siblings, 2 replies; 24+ messages in thread
From: Russ Cox @ 2006-01-05  9:39 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> Could we all see that code as well?

The code looks at c->dev in cclose to see if the
chan is from devmnt.  If so, cclose places the chan on
a queue rather than calling devtab[c->dev]->close()
and chanfree() directly.  A pool of worker processes
tend the queue, like in exportfs, calling close() and
chanfree() themselves.

Russ

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-05  9:39                       ` Russ Cox
@ 2006-01-05 12:00                         ` Bruce Ellis
  2006-01-05 12:36                           ` Charles Forsyth
  2006-01-05 15:26                           ` Francisco J Ballesteros
  2006-01-06 21:34                         ` Dave Eckhardt
  1 sibling, 2 replies; 24+ messages in thread
From: Bruce Ellis @ 2006-01-05 12:00 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

thanks.  very precise.

On 1/5/06, Russ Cox <rsc@swtch.com> wrote:
> > Could we all see that code as well?
>
> The code looks at c->dev in cclose to see if the
> chan is from devmnt.  If so, cclose places the chan on
> a queue rather than calling devtab[c->dev]->close()
> and chanfree() directly.  A pool of worker processes
> tend the queue, like in exportfs, calling close() and
> chanfree() themselves.
>
> Russ


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-05 12:00                         ` Bruce Ellis
@ 2006-01-05 12:36                           ` Charles Forsyth
  2006-01-05 15:26                           ` Francisco J Ballesteros
  1 sibling, 0 replies; 24+ messages in thread
From: Charles Forsyth @ 2006-01-05 12:36 UTC (permalink / raw)
  To: 9fans

with that change, can't the following fail from time to time:

	chmod +al events
	echo first >>events
	echo second >>events

because the first clunk can be sent after the second open attempt;
similarly for exclusive-use devices.
it could avoid asynchronous clunk where QTEXCL is set, but that doesn't
help exclusive-use devices (although there aren't many of the latter)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-05 12:00                         ` Bruce Ellis
  2006-01-05 12:36                           ` Charles Forsyth
@ 2006-01-05 15:26                           ` Francisco J Ballesteros
  1 sibling, 0 replies; 24+ messages in thread
From: Francisco J Ballesteros @ 2006-01-05 15:26 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

In Plan B (the one with the changed kernel) I had a kernel
process doing cclose for channels that were on gone volumes.
Perhaps that was the same thing. I also noticed that the list
of "closing chans" could grow a lot.

But, isn´t the problem the failure to declare a connection
as broken, even when it is? [e.g., when its server process is
dead or broken].

For example, while we were using IL, the connections
were nicely collected and declared broken, however, when
we switched to tcp, some endpoints stayed for a very long
time.

In short, wouldn´t it be better to try to "break" officially an
already broken connection, than to use a pool of closing
processes? [NB: I used just one process, not a pool, but
I think the problem and workaround remains the same].

Note that if a connection is broken, there is no need to
send clunks for chans going through it.

What do you think?
Thanks

On 1/5/06, Bruce Ellis <bruce.ellis@gmail.com> wrote:
> thanks.  very precise.
>
> On 1/5/06, Russ Cox <rsc@swtch.com> wrote:
> > > Could we all see that code as well?
> >
> > The code looks at c->dev in cclose to see if the
> > chan is from devmnt.  If so, cclose places the chan on
> > a queue rather than calling devtab[c->dev]->close()
> > and chanfree() directly.  A pool of worker processes
> > tend the queue, like in exportfs, calling close() and
> > chanfree() themselves.
> >
> > Russ
>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-05  9:39                       ` Russ Cox
  2006-01-05 12:00                         ` Bruce Ellis
@ 2006-01-06 21:34                         ` Dave Eckhardt
  2006-01-06 21:57                           ` Bruce Ellis
  2006-01-06 22:00                           ` Francisco J Ballesteros
  1 sibling, 2 replies; 24+ messages in thread
From: Dave Eckhardt @ 2006-01-06 21:34 UTC (permalink / raw)
  To: 9fans

> The code looks at c->dev in cclose to see if the
> chan is from devmnt.  If so, cclose places the chan on
> a queue rather than calling devtab[c->dev]->close()
> and chanfree() directly.  A pool of worker processes
> tend the queue, like in exportfs, calling close() and
> chanfree() themselves.

Will that work if there is a cycle with more edges than
the number of worker processes?

Dave Eckhardt


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-06 21:34                         ` Dave Eckhardt
@ 2006-01-06 21:57                           ` Bruce Ellis
  2006-01-06 22:00                           ` Francisco J Ballesteros
  1 sibling, 0 replies; 24+ messages in thread
From: Bruce Ellis @ 2006-01-06 21:57 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

a diligent team is working on this - it does seem to work
flawlessly on froggie but charles' example demonstrates
a race which i think is esily corrected.

working to the better good (i don't know what that means),

brucee

On 1/7/06, Dave Eckhardt <davide+p9@cs.cmu.edu> wrote:
> > The code looks at c->dev in cclose to see if the
> > chan is from devmnt.  If so, cclose places the chan on
> > a queue rather than calling devtab[c->dev]->close()
> > and chanfree() directly.  A pool of worker processes
> > tend the queue, like in exportfs, calling close() and
> > chanfree() themselves.
>
> Will that work if there is a cycle with more edges than
> the number of worker processes?
>
> Dave Eckhardt
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] clunk clunk
  2006-01-06 21:34                         ` Dave Eckhardt
  2006-01-06 21:57                           ` Bruce Ellis
@ 2006-01-06 22:00                           ` Francisco J Ballesteros
  1 sibling, 0 replies; 24+ messages in thread
From: Francisco J Ballesteros @ 2006-01-06 22:00 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

It works. At least, in the Plan B version, which has just one
worker, it works. It should work with more workers as well.
What happens is that you get more chans waiting to be
clunked into the wait queue. When enough time passes,
and the connection is broken (because of a reboot of
the machine with the server process, or whatever), the queue
will continue to be drained (if there are N workers, you just
would get less draining processes).

On 1/6/06, Dave Eckhardt <davide+p9@cs.cmu.edu> wrote:
> > The code looks at c->dev in cclose to see if the
> > chan is from devmnt.  If so, cclose places the chan on
> > a queue rather than calling devtab[c->dev]->close()
> > and chanfree() directly.  A pool of worker processes
> > tend the queue, like in exportfs, calling close() and
> > chanfree() themselves.
>
> Will that work if there is a cycle with more edges than
> the number of worker processes?
>
> Dave Eckhardt
>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2006-01-06 22:00 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-03 19:28 [9fans] clunk clunk Bruce Ellis
2006-01-03 19:38 ` Sascha Retzki
2006-01-03 19:46 ` Russ Cox
2006-01-03 20:07   ` Bruce Ellis
2006-01-03 20:24     ` Russ Cox
2006-01-03 21:33       ` Bruce Ellis
2006-01-03 21:47         ` jmk
2006-01-03 22:09           ` Bruce Ellis
2006-01-03 22:14             ` jmk
2006-01-03 22:16               ` Bruce Ellis
2006-01-03 22:36             ` Russ Cox
2006-01-03 22:45               ` Bruce Ellis
2006-01-04 12:15                 ` Russ Cox
2006-01-04 12:25                   ` Bruce Ellis
2006-01-04 15:36                     ` Ronald G Minnich
2006-01-04 15:41                       ` Bruce Ellis
2006-01-05  9:36                     ` Francisco J Ballesteros
2006-01-05  9:39                       ` Russ Cox
2006-01-05 12:00                         ` Bruce Ellis
2006-01-05 12:36                           ` Charles Forsyth
2006-01-05 15:26                           ` Francisco J Ballesteros
2006-01-06 21:34                         ` Dave Eckhardt
2006-01-06 21:57                           ` Bruce Ellis
2006-01-06 22:00                           ` Francisco J Ballesteros

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).