Another bug when suspending pipelines

zsh-workers
 help / color / mirror / code / Atom feed

* Another bug when suspending pipelines
       [not found] <CGME20160916123306eucas1p1f4349b727ebdfb64b4947c9d210457e1@eucas1p1.samsung.com>
@ 2016-09-16 12:33 ` Peter Stephenson
       [not found]   ` <CGME20160916164752eucas1p1add32b4825e126def7b3317c07731342@eucas1p1.samsung.com>
  2016-09-16 18:38   ` Bart Schaefer
  0 siblings, 2 replies; 5+ messages in thread
From: Peter Stephenson @ 2016-09-16 12:33 UTC (permalink / raw)
  To: Zsh Hackers' List

To check my code for the other pipeline suspending problem, I came up
with a command designed to check different cases, in particular where
the left of the pipeline was still running:

  (sleep 5; print foo) | { sleep 5; read bar; print $bar; }

Suspending this sometimes doesn't work: the ^Z is delayed and gets
relayed to the parent shell.  The "sometimes" suggests this is a race
--- note this is the same oldish machine that was showing up the second
variant of the problems I've just been looking at, so this might not be
widely visible (don't know yet).

Sometimes even if it suspends, it stops again when you fg it, then
the second fg allows it to complete.

I'm still 90% convinced there's no or very limited interaction with my
recent patches, however.  I can see the same ranges of behaviours both
with and without.  The problem seems the same in 5.2, as well, despite
what I originally thought, sporadic enough to be hard to be sure.

I have no handle on what aspect of this is problematic, but I don't
see anything about the shell code above that suggests this is a
particularly hairy case.

(It would be nice to have a set of test cases for this code, even if we
run them by hand.)

pws

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Another bug when suspending pipelines
       [not found]   ` <CGME20160916164752eucas1p1add32b4825e126def7b3317c07731342@eucas1p1.samsung.com>
@ 2016-09-16 16:47     ` Peter Stephenson
  0 siblings, 0 replies; 5+ messages in thread
From: Peter Stephenson @ 2016-09-16 16:47 UTC (permalink / raw)
  To: Zsh Hackers' List

On Fri, 16 Sep 2016 13:33:02 +0100
Peter Stephenson <p.stephenson@samsung.com> wrote:
> To check my code for the other pipeline suspending problem, I came up
> with a command designed to check different cases, in particular where
> the left of the pipeline was still running:
> 
>   (sleep 5; print foo) | { sleep 5; read bar; print $bar; }
> 
> Suspending this sometimes doesn't work: the ^Z is delayed and gets
> relayed to the parent shell.

I've had trouble getting this to happen again, so it looks like it's
quite an obscure race.  It's not necessarily directly related to
list_pipe's Wirkung, in either English, German or Old Norse.

> Sometimes even if it suspends, it stops again when you fg it, then
> the second fg allows it to complete.

This is much easier to reproduce, and it looks like it's related,
serendipitously, to the condition I fixed earlier today --- the forked
zsh can be stopped too many times because it's in the same process
groups as the other stuff.

I've now run out of reasons why it should be in the same process group
at all, so this extends the other fix until we find out why...

pws

diff --git a/Src/exec.c b/Src/exec.c
index 0755d55..9a7234e 100644
--- a/Src/exec.c
+++ b/Src/exec.c
@@ -1710,39 +1710,23 @@ execpline(Estate state, wordcode slcode, int how, int last1)
 			break;
 		    }
 		    else {
-			Job pjn = jobtab + list_pipe_job;
 			close(synch[0]);
 			entersubsh(ESUB_ASYNC);
 			/*
 			 * At this point, we used to attach this process
 			 * to the process group of list_pipe_job (the
 			 * new superjob) any time that was still available.
-			 * That caused problems when the fork happened
-			 * early enough that the subjob is in that
-			 * process group, since then we will stop this
-			 * job when we signal the subjob, and we have
-			 * no way here to know that we shouldn't also
-			 * send STOP to itself, resulting in two stops.
-			 * So don't do that if the original
-			 * list_pipe_job has exited.
+			 * That caused problems in at least two
+			 * cases because this forked shell was then
+			 * suspended with the right hand side of the
+			 * pipeline, and the SIGSTOP below suspended
+			 * it a second time when it was continued.
 			 *
-			 * The choice here needs to match the assumption
-			 * made when handling SUBLEADER above that the
-			 * process group is our own PID.  I'm not sure
-			 * if there's a potential race for that.  But
-			 * setting up a new process group if the
-			 * superjob is still functioning seems the wrong
-			 * thing to do.
+			 * It's therefore not clear entirely why you'd ever
+			 * do anything other than the following, but no
+			 * doubt we'll find out...
 			 */
-			if (pjn->procs &&
-			    (pjn->stat & STAT_INUSE) &&
-			    !(pjn->stat & STAT_DONE)) {
-			    if (setpgrp(0L, mypgrp = jobtab[list_pipe_job].gleader)
-				== -1) {
-				setpgrp(0L, mypgrp = getpid());
-			    }
-			} else
-			    setpgrp(0L, mypgrp = getpid());
+			setpgrp(0L, mypgrp = getpid());
 			close(synch[1]);
 			kill(getpid(), SIGSTOP);
 			list_pipe = 0;


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Another bug when suspending pipelines
  2016-09-16 12:33 ` Another bug when suspending pipelines Peter Stephenson
       [not found]   ` <CGME20160916164752eucas1p1add32b4825e126def7b3317c07731342@eucas1p1.samsung.com>
@ 2016-09-16 18:38   ` Bart Schaefer
  2016-09-16 20:52     ` Peter Stephenson
  1 sibling, 1 reply; 5+ messages in thread
From: Bart Schaefer @ 2016-09-16 18:38 UTC (permalink / raw)
  To: Zsh Hackers' List

On Sep 16,  1:33pm, Peter Stephenson wrote:
} Subject: Another bug when suspending pipelines
}
} To check my code for the other pipeline suspending problem, I came up
} with a command designed to check different cases, in particular where
} the left of the pipeline was still running:
} 
}   (sleep 5; print foo) | { sleep 5; read bar; print $bar; }
} 
} Suspending this sometimes doesn't work: the ^Z is delayed and gets
} relayed to the parent shell.

In this construct, the "read" is going to be running in the current
zsh; that is:

% (sleep 5; print foo) | { sleep 5; read bar; print GOT: $bar; }

should print "GOT: foo" even after ^Z/fg, whereas

% (sleep 5; print foo) | { sleep 5; read bar }; print GOT: $bar

should print "GOT: foo" if it runs un-interrupted, but should print
nothing if ^Z/fg.

Therefore the current (parent) shell is required to handle the TSTP in
this case.

} I have no handle on what aspect of this is problematic, but I don't
} see anything about the shell code above that suggests this is a
} particularly hairy case.

Let's tweak this slightly to be

  (sleep 6; print foo) | { sleep 7; read bar; print $bar }

just for clarity in the following description.

I believe what should be happening is:

* The "sleep 7" is initially the tty group leader, because it is the
  foreground job in the parent shell.  "sleep 6" and its subshell
  parent will not be in this process group because they were started
  before "sleep 7" existed; instead they are in the process group of
  the original parent.
* If the keyboard signal is generated during "sleep 7", sleep gets a
  SIGTSTP, is stopped, and the parent shell is notified by SIGCHLD.
  The parent reads the wait() status and finds that the foregroup job
  has stopped.
* The parent responds by stopping all the other jobs in the current
  pipeline.  It forks the brace expression into a new process and
  stops that one as well (or maybe that one stops itself, I don't
  recall the exact sequence of events here).
* The parent reclaims the tty leadership and returns to the prompt.

You can see this happen by adding "print $sysparams[pid]" on both
sides of the "sleep 7"; you will get a different PID before and after
the ^Z/fg.

So now we have two different process groups:  The "sleep 7" all by
itself, and the parent shell, whose group includes two more jobs,
the subshell and the rest of the brace expression.

The parent has to manage all three of these jobs, because it can't
hand off the already-forked "sleep 7" to the newly-forked brace job.
It has to arrange that the brace job can be sent a SIGCONT on "fg"
but not actually do anything until the "sleep 7" has finished; I
believe that's handled by the "synch" pipes and hidden reads; in any
case, it works.

On "fg" the parent:
* makes "sleep 7" the tty group leader again
* sends SIGCONT everywhere
* and waits for "sleep 7" to exit.

If I'm correct about the synch pipe, the tail of the brace expression
is blocked waiting to read that.  When "sleep 7" finally exits, the
parent:

* makes the tail of the brace expression into a process group which
  becomes the tty leader, and
* writes a byte on the synch pipe to wake it up.

And now we're finally back to the same state we'd have started in if
the brace expression had instead been a subshell, and the parent may
proceed with simply waiting for children.

"Particularly hairy"?

When the forked brace expression stops, the parent is going to be
notified about that again, possibly by SIGCHLD.  If it mis-handles
that, it could be the source of your observed "the ^Z is delayed"
and consequently cause an incorrect second pass through "stopping
all other jobs in the current pipeline".  But I can't reproduce
the double-stop to confirm that.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Another bug when suspending pipelines
  2016-09-16 18:38   ` Bart Schaefer
@ 2016-09-16 20:52     ` Peter Stephenson
  2016-09-16 22:25       ` Bart Schaefer
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Stephenson @ 2016-09-16 20:52 UTC (permalink / raw)
  To: Zsh Hackers' List

On Fri, 16 Sep 2016 11:38:01 -0700
Bart Schaefer <schaefer@brasslantern.com> wrote:
> * The parent responds by stopping all the other jobs in the current
>   pipeline.  It forks the brace expression into a new process and
>   stops that one as well (or maybe that one stops itself, I don't
>   recall the exact sequence of events here).

This is where the problem was.  The forked shell is supposed to stop
itself, but because it was in the same process group it was also being
stopped by the parent shell.

> The parent has to manage all three of these jobs, because it can't
> hand off the already-forked "sleep 7" to the newly-forked brace job.
> It has to arrange that the brace job can be sent a SIGCONT on "fg"
> but not actually do anything until the "sleep 7" has finished; I
> believe that's handled by the "synch" pipes and hidden reads; in any
> case, it works.

The parent shell "simply" waits for the non-shell processes that were
forked directly from the parent shell on the right of the pipeline and
restarts the forked shell to run the rest (which may, of course, include
other external processes) when those have finished.  The synch just
ensures syncronisation at the start, though I'm not sure what good it's
doing.

> On "fg" the parent:
> * makes "sleep 7" the tty group leader again
> * sends SIGCONT everywhere

except the forked subshell

> * and waits for "sleep 7" to exit.
?
> If I'm correct about the synch pipe, the tail of the brace expression
> is blocked waiting to read that.

No, it's stopped itself waiting to be resstarted.

> * makes the tail of the brace expression into a process group which
>   becomes the tty leader, and
> * writes a byte on the synch pipe to wake it up.

This all happened in one go when the shell was forked.  (Its process
group is the subject matter here.)

pws

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Another bug when suspending pipelines
  2016-09-16 20:52     ` Peter Stephenson
@ 2016-09-16 22:25       ` Bart Schaefer
  0 siblings, 0 replies; 5+ messages in thread
From: Bart Schaefer @ 2016-09-16 22:25 UTC (permalink / raw)
  To: Zsh Hackers' List

On Sep 16,  9:52pm, Peter Stephenson wrote:
} Subject: Re: Another bug when suspending pipelines
}
} On Fri, 16 Sep 2016 11:38:01 -0700
} Bart Schaefer <schaefer@brasslantern.com> wrote:
} > The parent has to manage all three of these jobs, because it can't
} > hand off the already-forked "sleep 7" to the newly-forked brace job.
} > It has to arrange that the brace job can be sent a SIGCONT on "fg"
} > but not actually do anything until the "sleep 7" has finished; I
} > believe that's handled by the "synch" pipes and hidden reads; in any
} > case, it works.
} 
} The parent shell "simply" waits for the non-shell processes that were
} forked directly from the parent shell on the right of the pipeline and
} restarts the forked shell to run the rest (which may, of course, include
} other external processes) when those have finished.

Hm.  That means that if you *backgound* the "sleep" and bother looking
with "ps", you will see a process that is in T state while the shell
claims it is running.

I suppose that's worked fine for the last lo these many years, but does
it not mean that an external "kill -CONT" can cause the expected order
of events to be violated?

} > * makes the tail of the brace expression into a process group which
} >   becomes the tty leader, and
} > * writes a byte on the synch pipe to wake it up.
} 
} This all happened in one go when the shell was forked.

Maybe it shouldn't.  Ugh.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-09-16 22:24 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20160916123306eucas1p1f4349b727ebdfb64b4947c9d210457e1@eucas1p1.samsung.com>
2016-09-16 12:33 ` Another bug when suspending pipelines Peter Stephenson
     [not found]   ` <CGME20160916164752eucas1p1add32b4825e126def7b3317c07731342@eucas1p1.samsung.com>
2016-09-16 16:47     ` Peter Stephenson
2016-09-16 18:38   ` Bart Schaefer
2016-09-16 20:52     ` Peter Stephenson
2016-09-16 22:25       ` Bart Schaefer

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).