On Thu, Nov 3, 2022 at 4:10 PM Bart Schaefer wrote: > > We send a SIGCONT to the process group for the forked job, > which wakes up both the subjob and the superjob, but when we get the > SIGCHLD for the subjob we decide that its superjob is running not > because we just restarted it, but because it is still waiting to be > stopped from the previous SIGTSTP. I find we have to take this all the way back to workers/50105 and the "unkillable pipeline". The earlier example given was this: { /bin/sleep 10 ; /bin/sleep 20; } | { /bin/sleep 30 ; /bin/sleep 40; } I "fixed" this with workers/50134 by avoiding a reset of the process group leader when the job is not in the current shell, but it turns out that's not actually the problem. In the above example, two subshells are forked for either side of the pipe and "/bin/sleep 30" becomes the foreground job. Upon interrupting this, "/bin/sleep 40" begins to execute and the right-hand subshell immediately exits, because only the exit value of that sleep matters to the parent. The patch in 50134 makes "/bin/sleep 40" its own group leader in that case, allowing further keyboard-generated signals to reach it. Without 50134, the sleep effectively becomes an orphan; the parent is waiting, but for the wrong job, and the job that is running is immune to signals. There are two problems with this. First, arguably the "sleep 40" should not start at all, because the pipeline was interrupted. (This is in fact what happens in "sh" emulation thanks to workers/47794.) Second, in the "ls | less" function example from this thread, upon ^Z it's wrong to set the group leader to the newly-forked subshell because that results in the aforementioned race condition -- upon "fg", "ls" keeps getting stopped before "less" can get started, which causes the "oops, subjob is still stopped, better stop the superjob" confusion. I believe the reason interrupting "sleep 30" doesn't abort the whole pipeline is because the { ... } expression is in an interactive shell, so the two sleeps get interpreted as separately "jobbable". Running the whole thing in a non-interactive shell (zsh -fs) works fine. In fact % { /bin/sleep 10 ; /bin/sleep 20; } | { /bin/sleep 30 ; /bin/sleep 40; } ^Z % fg ^C % also works fine and does end the pipeline, but if you "fg" without immediate ^C, once "sleep 40" starts running we're right back in the invulnerable orphan state. There are two fixes needed here. One, never set the group leader to a process that's about to exit. Sadly, as far as I can tell, the parent doesn't know whether that's going to happen, and attempting to check with kill(gleader, 0) or similar just leads to a different race condition. Two, stop the whole { ... } construct if any job within it gets interrupted. In both cases 50134 should then be reverted. The attached patch seems to satisfactorily implement the second fix. There's an anecdote I may have mentioned before wherein a mechanic is called upon to fix a piece of machinery. When the job is finished he presents a bill for $1000. "What was the problem?" "Oh, a bolt had come out, I just replaced it." "What? $1000 for a single bolt?" "Sorry, let me correct that." The mechanic takes back his invoice, scribbles a few words, and hands it back. It now reads: "New bolt: $1. Finding out where to put it: $999" This patch is a $1000 bolt. Some no-op whitespace and comments included. Unfortunately, that leaves us with another example in 50105 still unfixed: ( zsh -fc 'print a few words; /bin/sleep 10' ) | { head -n 1 } Here we end up with the current shell as the foreground job because "head" has exited, and as an interactive shell it is ignoring signals. It needs to bring the left-hand-side into the foreground so it can be killed. I haven't tracked that down (nor figured out why 50134 works around it, except to believe it's two bugs canceling each other). Also, yes, I'm on vacation, and it's been raining all day here.