* Bug related to stdin/always/jobcontrol @ 2016-09-02 18:39 Christian Neukirchen 2016-09-05 15:04 ` Christian Neukirchen 2016-09-05 15:42 ` Peter Stephenson 0 siblings, 2 replies; 20+ messages in thread From: Christian Neukirchen @ 2016-09-02 18:39 UTC (permalink / raw) To: zsh-workers Hi, Stripped down test case for a mysterious loss of child: zsh 5.2 (x86_64-unknown-linux-gnu) zsh-5.2-0-gc86c20a VIM - Vi IMproved 7.4 (2013 Aug 10, compiled Aug 29 2016 13:06:04) Included patches: 1-2207 zsh -f juno% v() { { vim - } always { true } } juno% ls | v ^Z zsh: running v juno% jobs -p [1] 4421 running v juno% fg fg: no current job juno% fg %1 fg: %1: no such job juno% kill %1 kill: kill %1 failed: no such process juno% echo ${jobstates} suspended::4421=running juno% ^D Vim: Caught deadly signal HUP ... Everything works ok when - always is not used - stdin is not used Thanks, -- Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Bug related to stdin/always/jobcontrol 2016-09-02 18:39 Bug related to stdin/always/jobcontrol Christian Neukirchen @ 2016-09-05 15:04 ` Christian Neukirchen 2016-09-05 15:42 ` Peter Stephenson 1 sibling, 0 replies; 20+ messages in thread From: Christian Neukirchen @ 2016-09-05 15:04 UTC (permalink / raw) To: zsh-workers Christian Neukirchen <chneukirchen@gmail.com> writes: > Hi, > > Stripped down test case for a mysterious loss of child: > > zsh 5.2 (x86_64-unknown-linux-gnu) > zsh-5.2-0-gc86c20a > VIM - Vi IMproved 7.4 (2013 Aug 10, compiled Aug 29 2016 13:06:04) > Included patches: 1-2207 > > zsh -f > juno% v() { { vim - } always { true } } > juno% ls | v > ^Z > zsh: running v > juno% jobs -p > [1] 4421 running v > juno% fg > fg: no current job > juno% fg %1 > fg: %1: no such job > juno% kill %1 > kill: kill %1 failed: no such process > juno% echo ${jobstates} > suspended::4421=running > juno% ^D > Vim: Caught deadly signal HUP > ... > > Everything works ok when > - always is not used > - stdin is not used > > Thanks, For the record, this is also broken on 5.0.7 (Linux 3.10.42-1-lts), but works on 4.3.17 (3.2.0-4-amd64). -- Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Bug related to stdin/always/jobcontrol 2016-09-02 18:39 Bug related to stdin/always/jobcontrol Christian Neukirchen 2016-09-05 15:04 ` Christian Neukirchen @ 2016-09-05 15:42 ` Peter Stephenson [not found] ` <CGME20160914173110eucas1p1f0567e319ae439b975b19f4e55676fed@eucas1p1.samsung.com> 1 sibling, 1 reply; 20+ messages in thread From: Peter Stephenson @ 2016-09-05 15:42 UTC (permalink / raw) Cc: zsh-workers On Fri, 2 Sep 2016 20:39:39 +0200 Christian Neukirchen <chneukirchen@gmail.com> wrote: > Stripped down test case for a mysterious loss of child: > > zsh 5.2 (x86_64-unknown-linux-gnu) > zsh-5.2-0-gc86c20a > VIM - Vi IMproved 7.4 (2013 Aug 10, compiled Aug 29 2016 13:06:04) > Included patches: 1-2207 > > zsh -f > juno% v() { { vim - } always { true } } > juno% ls | v > ^Z > zsh: running v > juno% fg > fg: no current job It looks like the job state is quite seriously challenged. I think the relevance of the always and stdin is probably that there's something for the job to fix up after it finishes, so the vim process can't be fully detached from the subshell it's in. When fg is run, the state of the job you can see has flags 0x2012: #define STAT_STOPPED (0x0002) /* all procs stopped or exited */ #define STAT_LOCKED (0x0010) /* shell is finished creating this job, */ /* may be deleted from job table */ #define STAT_SUBLEADER (0x2000) /* is super-job, but leader is sub-shell */ This doesn't have STAT_INUSE, so it's not clear "jobs" should be reporting it at all. Adding STAT_INUSE and fg'ing doesn't help; it does attempt to bring the command to the foreground but then gets stuck and stays stuck even if the vim command is killed from elsewhere. pws 27402 25329 7 16:28 pts/1 00:00:01 ./zsh pws 27423 27402 0 16:29 pts/1 00:00:00 vim - pws 27425 27402 0 16:29 pts/1 00:00:00 ./zsh That first ./zsh is the parent shell; I guess 27425 is the subshell process. job 3 in the job table appears to be valid and is in use but is marked as STAT_NOPRINT. It has status 0x173: #define STAT_CHANGED (0x0001) /* status changed and not reported */ #define STAT_STOPPED (0x0002) /* all procs stopped or exited */ #define STAT_LOCKED (0x0010) /* shell is finished creating this job, */ /* may be deleted from job table */ #define STAT_NOPRINT (0x0020) /* job was killed internally, */ /* we don't want to show that */ #define STAT_INUSE (0x0040) /* this job entry is in use */ #define STAT_SUBJOB (0x0100) /* job is a subjob */ I'm guessing what should happen is something like foregrounding the SUBLEADER job should also foreground the SUBJOB job. But that doesn't tell us whether a job that's not INUSE should show up, or why the INUSE flag is missing in the first place. pws ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <CGME20160914173110eucas1p1f0567e319ae439b975b19f4e55676fed@eucas1p1.samsung.com>]
* Re: Bug related to stdin/always/jobcontrol [not found] ` <CGME20160914173110eucas1p1f0567e319ae439b975b19f4e55676fed@eucas1p1.samsung.com> @ 2016-09-14 17:31 ` Peter Stephenson 2016-09-14 21:35 ` Peter Stephenson 0 siblings, 1 reply; 20+ messages in thread From: Peter Stephenson @ 2016-09-14 17:31 UTC (permalink / raw) To: zsh-workers On Mon, 5 Sep 2016 16:42:07 +0100 Peter Stephenson <p.stephenson@samsung.com> wrote: > > juno% v() { { vim - } always { true } } > > juno% ls | v > > ^Z > > zsh: running v > > juno% fg > > fg: no current job > > It looks like the job state is quite seriously challenged. I think I might be getting close with the attached patch, but not quite far enough so I know I'm not doing something fundamentally screwy. With it, I get as far as continuing the process, but vim gets stopped again because it can't write to the terminal. Hence I'm suspecting something's still amiss in process group handling. I never understood process group handling. In case anyone does, as you'll see if you try it out, you get the vim process and the fix-up subprocess zsh (as below) in a process group which is the PID of an exited process. So first guess is we need to create a new process group, or something. That exited process is what was originally SUPERJOB, which is the ls which has exited (I think: I haven't trapped it in the act but this seems to be the only thing that makes sense). The newly forked zsh is the result of the ^Z and so needs to be in charge of the vim process (and, in the zsh model, subjob) --- again, I think, because that's basically why we do the fork, so we can still bring the right hand side of the pipeline which was originally in the current shell into the foreground. > When fg is run, the state of the job you can see has flags 0x2012: > > #define STAT_STOPPED (0x0002) /* all procs stopped or exited */ > #define STAT_LOCKED (0x0010) /* shell is finished creating this job, */ > /* may be deleted from job table */ > #define STAT_SUBLEADER (0x2000) /* is super-job, but leader is sub-shell */ > > This doesn't have STAT_INUSE, so it's not clear "jobs" should be > reporting it at all. Adding STAT_INUSE and fg'ing doesn't help; it does > attempt to bring the command to the foreground but then gets stuck and > stays stuck even if the vim command is killed from elsewhere. > > pws 27402 25329 7 16:28 pts/1 00:00:01 ./zsh > pws 27423 27402 0 16:29 pts/1 00:00:00 vim - > pws 27425 27402 0 16:29 pts/1 00:00:00 ./zsh > > That first ./zsh is the parent shell; I guess 27425 is the subshell > process. This appears to be the case. I've added the INUSE back in, because the job doesn't seem to be any use to person nor best without it. I've then made it look for a SUBJOB of which it can become SUPERJOB, and fixed up the associations. This also assumes I've worked out the difference between being a SUBLEADER and a SUPERJOB. Note: I'm not sure if we need to search for an existing SUPERJOB first because I'm not sure in what circumstances we can get to the point we have. The final hunk is an obvious typo --- "other" is always a job (possibly sounds better in French). Now fg is trying to do something looks plausible, but with the effect noted above. pws diff --git a/Src/exec.c b/Src/exec.c index cfd633a..2638b01 100644 --- a/Src/exec.c +++ b/Src/exec.c @@ -1570,6 +1570,7 @@ execpline(Estate state, wordcode slcode, int how, int last1) if (nowait) { if(!pline_level) { + int jobsub; struct process *pn, *qn; curjob = newjob; @@ -1593,7 +1594,23 @@ execpline(Estate state, wordcode slcode, int how, int last1) } jn->stat &= ~(STAT_DONE | STAT_NOPRINT); - jn->stat |= STAT_STOPPED | STAT_CHANGED | STAT_LOCKED; + jn->stat |= STAT_STOPPED | STAT_CHANGED | STAT_LOCKED | + STAT_INUSE; + /* + * Pick up any subjob that's still lying around + * as it's now our responsibility. + * If we find it we're a regular SUPERJOB + * instead of a mere SUBLEADER. + */ + for (jobsub = 1; jobsub <= maxjob; jobsub++) { + Job jnsub = jobtab + jobsub; + if (jnsub->stat & STAT_SUBJOB) { + jn->other = jobsub; + jn->stat |= STAT_SUPERJOB; + jn->stat &= ~STAT_SUBLEADER; + jnsub->other = newjob; + } + } printjob(jn, !!isset(LONGLISTJOBS), 1); } else if (newjob != list_pipe_job) @@ -1668,7 +1685,7 @@ execpline(Estate state, wordcode slcode, int how, int last1) jobtab[list_pipe_job].other = newjob; jobtab[list_pipe_job].stat |= STAT_SUPERJOB; jn->stat |= STAT_SUBJOB | STAT_NOPRINT; - jn->other = pid; + jn->other = list_pipe_job; } if ((list_pipe || last1) && hasprocs(list_pipe_job)) killpg(jobtab[list_pipe_job].gleader, SIGSTOP); ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Bug related to stdin/always/jobcontrol 2016-09-14 17:31 ` Peter Stephenson @ 2016-09-14 21:35 ` Peter Stephenson 2016-09-15 3:24 ` Bart Schaefer 2016-09-22 11:59 ` Daniel Shahaf 0 siblings, 2 replies; 20+ messages in thread From: Peter Stephenson @ 2016-09-14 21:35 UTC (permalink / raw) To: zsh-workers On Wed, 14 Sep 2016 18:31:05 +0100 Peter Stephenson <p.stephenson@samsung.com> wrote: > I think I might be getting close with the attached patch... first > guess is we need to create a new process group, or something. The subjob needs to keep its process group because it's too late to change it; I found out what was missing, it needs process group leader assigning from that of list_pipe_job (obviously, right?) Seems a bit weird this never got noticed before, but it's hard to see how it can be wrong since this is the way the pipeline always works, regardless of whether the ls process or equivalent has exited. I've tried a few bits of job control with multiple different list-pipe-style constructs lying around and they still seem to work. Looks like the pgrp for the new superjob is irrelevant as it's kept out of the way till the last minute, so no changes there. I was shooting myself in the foot wth the change for "other". It turns out "other" means other job if this is the superjob, and other process if this is the subjob (obviously, right?) Je est veritablement un autre. I've made the claiming of the subjob safe by marking it as orphaned when the original superjob, the ls process in the example, exits. So if that doesn't happen this doesn't kick in, so this shouldn't be stomping on anything that already works. I present the solution for verification. If accepted, I will be demanding a certifcate of some sort. But there's still one thing I don't understand. When Scooby... er... sorry, force of habit... it's this chunk which is for SIGCONT when we're dealing with a superjob in killjb(): for (pn = jn->procs; pn->next; pn = pn->next) if (kill(pn->pid, sig) == -1 && errno != ESRCH) err = -1; The exit test is pn->next, so we skip the last process in the superjob, which is a bit weird. Certainly, changing that to the more obvious pn causes the shell process acting as superjob in this case to be started too early. However, I don't know what happens in more normal cases. I've left it alone. pws diff --git a/Src/exec.c b/Src/exec.c index cfd633a..21270c8 100644 --- a/Src/exec.c +++ b/Src/exec.c @@ -1570,6 +1570,7 @@ execpline(Estate state, wordcode slcode, int how, int last1) if (nowait) { if(!pline_level) { + int jobsub; struct process *pn, *qn; curjob = newjob; @@ -1582,6 +1583,20 @@ execpline(Estate state, wordcode slcode, int how, int last1) if (!jn->procs->next || lpforked == 2) { jn->gleader = list_pipe_pid; jn->stat |= STAT_SUBLEADER; + /* + * Pick up any subjob that's still lying around + * as it's now our responsibility. + * If we find it we're a SUPERJOB. + */ + for (jobsub = 1; jobsub <= maxjob; jobsub++) { + Job jnsub = jobtab + jobsub; + if (jnsub->stat & STAT_SUBJOB_ORPHANED) { + jn->other = jobsub; + jn->stat |= STAT_SUPERJOB; + jnsub->stat &= ~STAT_SUBJOB_ORPHANED; + jnsub->other = list_pipe_pid; + } + } } for (pn = jobtab[jn->other].procs; pn; pn = pn->next) if (WIFSTOPPED(pn->status)) @@ -1593,7 +1608,8 @@ execpline(Estate state, wordcode slcode, int how, int last1) } jn->stat &= ~(STAT_DONE | STAT_NOPRINT); - jn->stat |= STAT_STOPPED | STAT_CHANGED | STAT_LOCKED; + jn->stat |= STAT_STOPPED | STAT_CHANGED | STAT_LOCKED | + STAT_INUSE; printjob(jn, !!isset(LONGLISTJOBS), 1); } else if (newjob != list_pipe_job) @@ -1669,6 +1685,7 @@ execpline(Estate state, wordcode slcode, int how, int last1) jobtab[list_pipe_job].stat |= STAT_SUPERJOB; jn->stat |= STAT_SUBJOB | STAT_NOPRINT; jn->other = pid; + jn->gleader = jobtab[list_pipe_job].gleader; } if ((list_pipe || last1) && hasprocs(list_pipe_job)) killpg(jobtab[list_pipe_job].gleader, SIGSTOP); diff --git a/Src/jobs.c b/Src/jobs.c index 6bc3616..9284c71 100644 --- a/Src/jobs.c +++ b/Src/jobs.c @@ -128,7 +128,7 @@ makerunning(Job jn) Process pn; jn->stat &= ~STAT_STOPPED; - for (pn = jn->procs; pn; pn = pn->next) + for (pn = jn->procs; pn; pn = pn->next) { #if 0 if (WIFSTOPPED(pn->status) && (!(jn->stat & STAT_SUPERJOB) || pn->next)) @@ -136,6 +136,7 @@ makerunning(Job jn) #endif if (WIFSTOPPED(pn->status)) pn->status = SP_RUNNING; + } if (jn->stat & STAT_SUPERJOB) makerunning(jobtab + jn->other); @@ -236,7 +237,7 @@ handle_sub(int job, int fg) if ((sj->stat & STAT_DONE) || (!sj->procs && !sj->auxprocs)) { struct process *p; - for (p = sj->procs; p; p = p->next) + for (p = sj->procs; p; p = p->next) { if (WIFSIGNALED(p->status)) { if (jn->gleader != mypgrp && jn->procs->next) killpg(jn->gleader, WTERMSIG(p->status)); @@ -246,6 +247,7 @@ handle_sub(int job, int fg) kill(sj->other, WTERMSIG(p->status)); break; } + } if (!p) { int cp; @@ -1316,6 +1318,11 @@ deletejob(Job jn, int disowning) attachtty(mypgrp); adjustwinsize(0); } + if (jn->stat & STAT_SUPERJOB) { + Job jno = jobtab + jn->other; + if (jno->stat & STAT_SUBJOB) + jno->stat |= STAT_SUBJOB_ORPHANED; + } freejob(jn, 1); } diff --git a/Src/zsh.h b/Src/zsh.h index 2dc5e7e..bb8ce13 100644 --- a/Src/zsh.h +++ b/Src/zsh.h @@ -983,7 +983,8 @@ struct jobfile { struct job { pid_t gleader; /* process group leader of this job */ - pid_t other; /* subjob id or subshell pid */ + pid_t other; /* subjob id (SUPERJOB) + * or subshell pid (SUBJOB) */ int stat; /* see STATs below */ char *pwd; /* current working dir of shell when * * this job was spawned */ @@ -1015,6 +1016,8 @@ struct job { #define STAT_SUBLEADER (0x2000) /* is super-job, but leader is sub-shell */ #define STAT_BUILTIN (0x4000) /* job at tail of pipeline is a builtin */ +#define STAT_SUBJOB_ORPHANED (0x8000) + /* STAT_SUBJOB with STAT_SUPERJOB exited */ #define SP_RUNNING -1 /* fake status for jobs currently running */ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Bug related to stdin/always/jobcontrol 2016-09-14 21:35 ` Peter Stephenson @ 2016-09-15 3:24 ` Bart Schaefer 2016-09-15 8:27 ` Peter Stephenson 2016-09-22 11:59 ` Daniel Shahaf 1 sibling, 1 reply; 20+ messages in thread From: Bart Schaefer @ 2016-09-15 3:24 UTC (permalink / raw) To: zsh-workers On Sep 14, 10:35pm, Peter Stephenson wrote: } Subject: Re: Bug related to stdin/always/jobcontrol } } I present the solution for verification. If accepted, I will be } demanding a certifcate of some sort. Thank you for digging through this. I was expecting to need to dive into it, and my schedule the next several days would not have been amenable. Your explanation makes sense and the code looks sane, I think you can claim your certificate. } SIGCONT when we're dealing with a superjob in killjb(): } } for (pn = jn->procs; pn->next; pn = pn->next) } if (kill(pn->pid, sig) == -1 && errno != ESRCH) } err = -1; } } The exit test is pn->next, so we skip the last process in the superjob, } which is a bit weird. I think this has to do with fork-to-the-left on pipelines, i.e., in the "more normal case" the last process in the superjob is either the current shell or the process group leader. The loop leaves pn pointing to the last process, and then the following bit -- if (!jobtab[jn->other].procs && pn) if (kill(pn->pid, sig) == -1 && errno != ESRCH) err = -1; -- sends that process the signal, except not when there are "other" procs, so I don't quite grok that part. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Bug related to stdin/always/jobcontrol 2016-09-15 3:24 ` Bart Schaefer @ 2016-09-15 8:27 ` Peter Stephenson 2016-09-15 10:33 ` Peter Stephenson 0 siblings, 1 reply; 20+ messages in thread From: Peter Stephenson @ 2016-09-15 8:27 UTC (permalink / raw) To: zsh-workers On Wed, 14 Sep 2016 20:24:50 -0700 Bart Schaefer <schaefer@brasslantern.com> wrote: > I think this has to do with fork-to-the-left on pipelines, i.e., in > the "more normal case" the last process in the superjob is either > the current shell or the process group leader. The loop leaves pn > pointing to the last process, and then the following bit -- > > if (!jobtab[jn->other].procs && pn) > if (kill(pn->pid, sig) == -1 && errno != ESRCH) > err = -1; > > -- sends that process the signal, except not when there are "other" > procs, so I don't quite grok that part. Ah, I think that *is* the sort of thing I'm looking at (as well as in the normal superjob case). If that's the right process for this (which is the bit I didn't really undersand), i.e. what at least at one point in its life cycle was list_pipe_pid when a subjob was associated with it, then the presence of processes associated with the "other" job (the subjob) means we should kick those off instead. There are then efforts to ensure this gets restarted when the subjob has exited (that's the bit that sprang back to life when I realised "other" was a process when used within the subjob structure, and it's this process it's pointing to). So I think this does make sense, depsite being a bit hairy and underdocumented. I'll add some tentative notes (though they might be closer to eight- or ninetative). pws ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Bug related to stdin/always/jobcontrol 2016-09-15 8:27 ` Peter Stephenson @ 2016-09-15 10:33 ` Peter Stephenson 2016-09-15 17:40 ` Peter Stephenson 0 siblings, 1 reply; 20+ messages in thread From: Peter Stephenson @ 2016-09-15 10:33 UTC (permalink / raw) To: zsh-workers Still issues here --- I tried the last patch on another machine and it's OK up to the point where you resume and then exit the job, at which point vim exits cleanly but the new superjob (the forked zsh) is continued but suspends itself again. Sending it SIGCONT causes it to exit normally, so this looks like a race. pws ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Bug related to stdin/always/jobcontrol 2016-09-15 10:33 ` Peter Stephenson @ 2016-09-15 17:40 ` Peter Stephenson 2016-09-16 1:08 ` Bart Schaefer 0 siblings, 1 reply; 20+ messages in thread From: Peter Stephenson @ 2016-09-15 17:40 UTC (permalink / raw) To: zsh-workers On Thu, 15 Sep 2016 11:33:15 +0100 Peter Stephenson <p.stephenson@samsung.com> wrote: > Still issues here --- I tried the last patch on another machine and it's > OK up to the point where you resume and then exit the job, at which > point vim exits cleanly but the new superjob (the forked zsh) is > continued but suspends itself again. Sending it SIGCONT causes it to > exit normally, so this looks like a race. I'm a bit stuck on this. The race is related to the fact that when the new SUPERJOB takes over it's actually in the same process group as the SUBJOB it's taking over, which is different from it's own PID, list_pipe_pid, that the code currently assumes is the pgrp (and is indeed the pgrp in the cases where everything works --- I've again verified the code on a different machine here and get the impression faster machines work better). However, fixing that up doesn't help --- apparently attaching it to the TTY after the SUBJOB is finished works OK, sending it SIGCONT seems to be OK, but then it always gets SIGSTOP --- and sending it SIGCONT from the keyboard allows it to finish. So I have no idea why it's being stopped after the first SIGCONT. (I tried putting it from the main shell into a new process group of its own; that succeeds, but still doesn't help.) Stepping through the forked zsh that was stopped doesn't suggest anything, either, as it just executes the remainder of the shell code in the v function. However, I should have another look at the point where it gets the pukka SIGSTOP (the one that keeps it out of the way until the SUBJOB has finisehd) to make sure I'm not missing something there. I think the code is enough better that I may submit it and wonder what to do with the remaining problem later, or let Bart have a go when he gets some time. pws ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Bug related to stdin/always/jobcontrol 2016-09-15 17:40 ` Peter Stephenson @ 2016-09-16 1:08 ` Bart Schaefer 2016-09-16 12:02 ` Peter Stephenson 0 siblings, 1 reply; 20+ messages in thread From: Bart Schaefer @ 2016-09-16 1:08 UTC (permalink / raw) To: zsh-workers On Sep 15, 6:40pm, Peter Stephenson wrote: } Subject: Re: Bug related to stdin/always/jobcontrol } } The race is related to the fact that when the new SUPERJOB takes over } it's actually in the same process group as the SUBJOB it's taking over, } which is different from it's own PID, list_pipe_pid [...] } } However, fixing that up doesn't help --- apparently attaching it to the } TTY after the SUBJOB is finished works OK, sending it SIGCONT seems to } be OK, but then it always gets SIGSTOP Is it definitely getting SIGSTOP and not SIGTSTP ? The operating system will never[*] automatically send a true STOP signal because it's un-catchable. If a STOP is being received, it's probably being sent from the "for (; !nowait;)" loop in execpline(), either at line 1674 [ killpg(jobtab[list_pipe_job].gleader, SIGSTOP); ] or at line 1688 [ kill(getpid(), SIGSTOP); ]. [*] To my knowledge, anyway. } I think the code is enough better that I may submit it and wonder what } to do with the remaining problem later, or let Bart have a go when he } gets some time. That seems reasonable. This isn't any worse than it ever was, so it shouldn't hold up the release. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Bug related to stdin/always/jobcontrol 2016-09-16 1:08 ` Bart Schaefer @ 2016-09-16 12:02 ` Peter Stephenson 0 siblings, 0 replies; 20+ messages in thread From: Peter Stephenson @ 2016-09-16 12:02 UTC (permalink / raw) To: zsh-workers On Thu, 15 Sep 2016 18:08:55 -0700 Bart Schaefer <schaefer@brasslantern.com> wrote: > On Sep 15, 6:40pm, Peter Stephenson wrote: > } Subject: Re: Bug related to stdin/always/jobcontrol > } > } The race is related to the fact that when the new SUPERJOB takes over > } it's actually in the same process group as the SUBJOB it's taking over, > } which is different from it's own PID, list_pipe_pid [...] > } > } However, fixing that up doesn't help --- apparently attaching it to the > } TTY after the SUBJOB is finished works OK, sending it SIGCONT seems to > } be OK, but then it always gets SIGSTOP > > Is it definitely getting SIGSTOP and not SIGTSTP ? > > The operating system will never[*] automatically send a true STOP signal > because it's un-catchable. If a STOP is being received, it's probably > being sent from the "for (; !nowait;)" loop in execpline(), either at > line 1674 [ killpg(jobtab[list_pipe_job].gleader, SIGSTOP); ] or at > line 1688 [ kill(getpid(), SIGSTOP); ]. I've found the double SIGSTOP --- it was getting *both* of the above. In the case in question, the newly forked child zsh was picking up the pgrp of the vim process, so when we sent SIGSTOP to that pgrp it got stopped, too. When it got restarted, it immediately stopped itself. There's some code in the child to decide on the pgrp. It looks like it needs to have its own pgrp if the old superjob has already gone, the basis of the problems we are seeing here. I'm not sure if this leaves some more races but, again, it looks like it's an improvement. The rest of the patch is cosmetic. The bad news is there are other problems with this logic that have crept in since 5.2 --- this appears to be unrelated, or at least not made any worse by this change, so I'll start a separate thread. pws diff --git a/Src/exec.c b/Src/exec.c index 21270c8..0755d55 100644 --- a/Src/exec.c +++ b/Src/exec.c @@ -1652,7 +1652,13 @@ execpline(Estate state, wordcode slcode, int how, int last1) int synch[2]; struct timeval bgtime; + /* + * A pipeline with the shell handling the right + * hand side was stopped. We'll fork to allow + * it to continue. + */ if (pipe(synch) < 0 || (pid = zfork(&bgtime)) == -1) { + /* Failure */ if (pid < 0) { close(synch[0]); close(synch[1]); @@ -1666,6 +1672,18 @@ execpline(Estate state, wordcode slcode, int how, int last1) thisjob = newjob; } else if (pid) { + /* + * Parent: job control is here. If the job + * started for the RHS of the pipeline is still + * around, then its a SUBJOB and the job for + * earlier parts of the pipeeline is its SUPERJOB. + * The newly forked shell isn't recorded as a + * separate job here, just as list_pipe_pid. + * If the superjob exits (it may already have + * done so, see child branch below), we'll use + * list_pipe_pid to form the basis of a + * replacement job --- see SUBLEADER code above. + */ char dummy; lpforked = @@ -1692,9 +1710,33 @@ execpline(Estate state, wordcode slcode, int how, int last1) break; } else { + Job pjn = jobtab + list_pipe_job; close(synch[0]); entersubsh(ESUB_ASYNC); - if (jobtab[list_pipe_job].procs) { + /* + * At this point, we used to attach this process + * to the process group of list_pipe_job (the + * new superjob) any time that was still available. + * That caused problems when the fork happened + * early enough that the subjob is in that + * process group, since then we will stop this + * job when we signal the subjob, and we have + * no way here to know that we shouldn't also + * send STOP to itself, resulting in two stops. + * So don't do that if the original + * list_pipe_job has exited. + * + * The choice here needs to match the assumption + * made when handling SUBLEADER above that the + * process group is our own PID. I'm not sure + * if there's a potential race for that. But + * setting up a new process group if the + * superjob is still functioning seems the wrong + * thing to do. + */ + if (pjn->procs && + (pjn->stat & STAT_INUSE) && + !(pjn->stat & STAT_DONE)) { if (setpgrp(0L, mypgrp = jobtab[list_pipe_job].gleader) == -1) { setpgrp(0L, mypgrp = getpid()); diff --git a/Src/jobs.c b/Src/jobs.c index 9284c71..d1b98ac 100644 --- a/Src/jobs.c +++ b/Src/jobs.c @@ -232,11 +232,12 @@ super_job(int sub) static int handle_sub(int job, int fg) { + /* job: superjob; sj: subjob. */ Job jn = jobtab + job, sj = jobtab + jn->other; if ((sj->stat & STAT_DONE) || (!sj->procs && !sj->auxprocs)) { struct process *p; - + for (p = sj->procs; p; p = p->next) { if (WIFSIGNALED(p->status)) { if (jn->gleader != mypgrp && jn->procs->next) diff --git a/Src/signals.c b/Src/signals.c index 2eefc07..30dde71 100644 --- a/Src/signals.c +++ b/Src/signals.c @@ -723,7 +723,7 @@ killjb(Job jn, int sig) { Process pn; int err = 0; - + if (jobbing) { if (jn->stat & STAT_SUPERJOB) { if (sig == SIGCONT) { @@ -731,11 +731,21 @@ killjb(Job jn, int sig) if (killpg(pn->pid, sig) == -1) if (kill(pn->pid, sig) == -1 && errno != ESRCH) err = -1; - + + /* + * Note this does not kill the last process, + * which is assumed to be the one controlling the + * subjob, i.e. the forked zsh that was originally + * list_pipe_pid... + */ for (pn = jn->procs; pn->next; pn = pn->next) if (kill(pn->pid, sig) == -1 && errno != ESRCH) err = -1; + /* + * ...we only continue that once the external processes + * currently associated with the subjob are finished. + */ if (!jobtab[jn->other].procs && pn) if (kill(pn->pid, sig) == -1 && errno != ESRCH) err = -1; @@ -744,7 +754,7 @@ killjb(Job jn, int sig) } if (killpg(jobtab[jn->other].gleader, sig) == -1 && errno != ESRCH) err = -1; - + if (killpg(jn->gleader, sig) == -1 && errno != ESRCH) err = -1; ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Bug related to stdin/always/jobcontrol 2016-09-14 21:35 ` Peter Stephenson 2016-09-15 3:24 ` Bart Schaefer @ 2016-09-22 11:59 ` Daniel Shahaf 2016-09-22 16:38 ` Bart Schaefer 2016-09-25 15:16 ` Bug related to stdin/always/jobcontrol Peter Stephenson 1 sibling, 2 replies; 20+ messages in thread From: Daniel Shahaf @ 2016-09-22 11:59 UTC (permalink / raw) To: Peter Stephenson; +Cc: zsh-workers Peter Stephenson wrote on Wed, Sep 14, 2016 at 22:35:53 +0100: > On Wed, 14 Sep 2016 18:31:05 +0100 > Peter Stephenson <p.stephenson@samsung.com> wrote: > > I think I might be getting close with the attached patch... first > > guess is we need to create a new process group, or something. > > The subjob needs to keep its process group because it's too late to > change it; I found out what was missing, it needs process group leader > assigning from that of list_pipe_job (obviously, right?) Seems a bit > weird this never got noticed before, but it's hard to see how it can be > wrong since this is the way the pipeline always works, regardless of > whether the ls process or equivalent has exited. I've tried a few bits > of job control with multiple different list-pipe-style constructs lying > around and they still seem to work. This (39331) broke fg'ing a function for me: % print $ZSH_PATCHLEVEL zsh-5.2-dev-1-401-g4e51079 % f() { $EDITOR } % f <press ^Z in the editor> zsh: suspended f % fg [1] + continued f zsh: suspended (tty output) f After the 'fg' I get another prompt immediately, instead of being returend to $EDITOR. Cheers, Daniel ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Bug related to stdin/always/jobcontrol 2016-09-22 11:59 ` Daniel Shahaf @ 2016-09-22 16:38 ` Bart Schaefer 2016-09-23 1:18 ` Bart Schaefer 2016-09-25 15:16 ` Bug related to stdin/always/jobcontrol Peter Stephenson 1 sibling, 1 reply; 20+ messages in thread From: Bart Schaefer @ 2016-09-22 16:38 UTC (permalink / raw) To: zsh-workers On Sep 22, 11:59am, Daniel Shahaf wrote: } } % f() { $EDITOR } } % f } <press ^Z in the editor> } zsh: suspended f } % fg } [1] + continued f } zsh: suspended (tty output) f } } After the 'fg' I get another prompt immediately, instead of being } returend to $EDITOR. Worse that that, signals go to the parent shell: torch% fg [1] + continued f zsh: suspended (tty output) f torch% kill -9 %1 Vim: Caught deadly signal HUP Vim: Finished. zsh: killed Src/zsh -f ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Bug related to stdin/always/jobcontrol 2016-09-22 16:38 ` Bart Schaefer @ 2016-09-23 1:18 ` Bart Schaefer 2016-09-23 6:06 ` Daniel Shahaf 2016-09-25 14:39 ` Peter Stephenson 0 siblings, 2 replies; 20+ messages in thread From: Bart Schaefer @ 2016-09-23 1:18 UTC (permalink / raw) To: zsh-workers On Sep 22, 9:38am, Bart Schaefer wrote: } Subject: Re: Bug related to stdin/always/jobcontrol } } On Sep 22, 11:59am, Daniel Shahaf wrote: } } } } % f() { $EDITOR } } } % f } } <press ^Z in the editor> } } zsh: suspended f } } % fg } } [1] + continued f } } zsh: suspended (tty output) f } } Worse that that, signals go to the parent shell: } } torch% kill -9 %1 } Vim: Caught deadly signal HUP } Vim: Finished. } zsh: killed Src/zsh -f The signal is being sent from here (signals.c): 755 if (killpg(jobtab[jn->other].gleader, sig) == -1 && errno != ESRCH) The problem is that jobtab[jn->other].gleader is 0, so killpg kills the current process. Obviously the group leader should never be zero, but I haven't figured out how to track down where it's [not] being set. Probably this is related to the shell function not properly attaching to the tty when foregrounded. However, in attempting to figure it out, I found some sixteen-year-old code that is clearly wrong: diff --git a/Src/exec.c b/Src/exec.c index d924148..bf97b5c 100644 --- a/Src/exec.c +++ b/Src/exec.c @@ -1702,7 +1702,7 @@ execpline(Estate state, wordcode slcode, int how, int last1) jobtab[list_pipe_job].other = newjob; jobtab[list_pipe_job].stat |= STAT_SUPERJOB; jn->stat |= STAT_SUBJOB | STAT_NOPRINT; - jn->other = pid; + jn->other = list_pipe_job; jn->gleader = jobtab[list_pipe_job].gleader; } if ((list_pipe || last1) && hasprocs(list_pipe_job)) jn->other is a job table index, not a process ID. However, (a) I don't know if list_pip_job is the right value (although a few lines earlier list_pipe_pid = pid is assigned, so it would seem to makes sense) and (b) that this has never caused a wild pointer dereference in all this time seems to indicate that this code is never actually reached. I don't have any more time to look at this now; hoping this information is helpful to PWS if he has a chance to look when he wakes up on Friday. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Bug related to stdin/always/jobcontrol 2016-09-23 1:18 ` Bart Schaefer @ 2016-09-23 6:06 ` Daniel Shahaf 2016-09-25 14:39 ` Peter Stephenson 1 sibling, 0 replies; 20+ messages in thread From: Daniel Shahaf @ 2016-09-23 6:06 UTC (permalink / raw) To: zsh-workers Bart Schaefer wrote on Thu, Sep 22, 2016 at 18:18:47 -0700: > However, in attempting to figure it out, I found some sixteen-year-old > code that is clearly wrong: > > diff --git a/Src/exec.c b/Src/exec.c > index d924148..bf97b5c 100644 > --- a/Src/exec.c > +++ b/Src/exec.c > @@ -1702,7 +1702,7 @@ execpline(Estate state, wordcode slcode, int how, int last1) > jobtab[list_pipe_job].other = newjob; > jobtab[list_pipe_job].stat |= STAT_SUPERJOB; > jn->stat |= STAT_SUBJOB | STAT_NOPRINT; > - jn->other = pid; > + jn->other = list_pipe_job; > jn->gleader = jobtab[list_pipe_job].gleader; > } > if ((list_pipe || last1) && hasprocs(list_pipe_job)) > > jn->other is a job table index, not a process ID. However, (a) I don't > know if list_pip_job is the right value (although a few lines earlier > list_pipe_pid = pid is assigned, so it would seem to makes sense) and > (b) that this has never caused a wild pointer dereference in all this > time seems to indicate that this code is never actually reached. The code does get reached: if you do . % f() vim % f ^Z . at a new shell, it executes the line you changed. Perhaps nothing reads the value assigned to jn->other, though? > I don't have any more time to look at this now; hoping this information > is helpful to PWS if he has a chance to look when he wakes up on Friday. > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Bug related to stdin/always/jobcontrol 2016-09-23 1:18 ` Bart Schaefer 2016-09-23 6:06 ` Daniel Shahaf @ 2016-09-25 14:39 ` Peter Stephenson 2016-09-25 22:54 ` struct job.other (was Re: Bug related to stdin/always/jobcontrol) Bart Schaefer 1 sibling, 1 reply; 20+ messages in thread From: Peter Stephenson @ 2016-09-25 14:39 UTC (permalink / raw) To: zsh-workers On Thu, 22 Sep 2016 18:18:47 -0700 Bart Schaefer <schaefer@brasslantern.com> wrote: > However, in attempting to figure it out, I found some sixteen-year-old > code that is clearly wrong: > > diff --git a/Src/exec.c b/Src/exec.c > index d924148..bf97b5c 100644 > --- a/Src/exec.c > +++ b/Src/exec.c > @@ -1702,7 +1702,7 @@ execpline(Estate state, wordcode slcode, int how, int last1) > jobtab[list_pipe_job].other = newjob; > jobtab[list_pipe_job].stat |= STAT_SUPERJOB; > jn->stat |= STAT_SUBJOB | STAT_NOPRINT; > - jn->other = pid; > + jn->other = list_pipe_job; > jn->gleader = jobtab[list_pipe_job].gleader; > } > if ((list_pipe || last1) && hasprocs(list_pipe_job)) > > jn->other is a job table index, not a process ID. The code's correct as it stands (modulo problems setting the values it's using) --- read my previous explanations and the updated comment in zsh.h about "other" which now explicitly states this is a pid if the job is a SUBJOB. "other" could perhaps be a union to indicate this. pws ^ permalink raw reply [flat|nested] 20+ messages in thread
* struct job.other (was Re: Bug related to stdin/always/jobcontrol) 2016-09-25 14:39 ` Peter Stephenson @ 2016-09-25 22:54 ` Bart Schaefer 2016-09-26 11:20 ` Peter Stephenson 0 siblings, 1 reply; 20+ messages in thread From: Bart Schaefer @ 2016-09-25 22:54 UTC (permalink / raw) To: zsh-workers On Sep 25, 3:39pm, Peter Stephenson wrote: } } > jn->other is a job table index, not a process ID. } } The code's correct as it stands (modulo problems setting the values it's } using) -- read my previous explanations and the updated comment in zsh.h OK, thanks ... how about this: diff --git a/Src/exec.c b/Src/exec.c index 4e89340..c27c41c 100644 --- a/Src/exec.c +++ b/Src/exec.c @@ -1702,7 +1702,7 @@ execpline(Estate state, wordcode slcode, int how, int last1) jobtab[list_pipe_job].other = newjob; jobtab[list_pipe_job].stat |= STAT_SUPERJOB; jn->stat |= STAT_SUBJOB | STAT_NOPRINT; - jn->other = pid; + jn->other = list_pipe_pid; /* see zsh.h */ if (hasprocs(list_pipe_job)) jn->gleader = jobtab[list_pipe_job].gleader; } ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: struct job.other (was Re: Bug related to stdin/always/jobcontrol) 2016-09-25 22:54 ` struct job.other (was Re: Bug related to stdin/always/jobcontrol) Bart Schaefer @ 2016-09-26 11:20 ` Peter Stephenson 2016-09-26 16:33 ` Bart Schaefer 0 siblings, 1 reply; 20+ messages in thread From: Peter Stephenson @ 2016-09-26 11:20 UTC (permalink / raw) To: zsh-workers On Sun, 25 Sep 2016 15:54:36 -0700 Bart Schaefer <schaefer@brasslantern.com> wrote: > On Sep 25, 3:39pm, Peter Stephenson wrote: > } > } > jn->other is a job table index, not a process ID. > } > } The code's correct as it stands (modulo problems setting the values it's > } using) -- read my previous explanations and the updated comment in zsh.h > > OK, thanks ... how about this: > > diff --git a/Src/exec.c b/Src/exec.c > index 4e89340..c27c41c 100644 > --- a/Src/exec.c > +++ b/Src/exec.c > @@ -1702,7 +1702,7 @@ execpline(Estate state, wordcode slcode, int how, int last1) > jobtab[list_pipe_job].other = newjob; > jobtab[list_pipe_job].stat |= STAT_SUPERJOB; > jn->stat |= STAT_SUBJOB | STAT_NOPRINT; > - jn->other = pid; > + jn->other = list_pipe_pid; /* see zsh.h */ > if (hasprocs(list_pipe_job)) > jn->gleader = jobtab[list_pipe_job].gleader; > } > That's OK; I assume there's no race where list_pipe_pid is set to something else since signals are blocked. Same thing happens at line 1597, which is the code I added for the new case with "ls | func". pws ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: struct job.other (was Re: Bug related to stdin/always/jobcontrol) 2016-09-26 11:20 ` Peter Stephenson @ 2016-09-26 16:33 ` Bart Schaefer 0 siblings, 0 replies; 20+ messages in thread From: Bart Schaefer @ 2016-09-26 16:33 UTC (permalink / raw) To: zsh-workers On Sep 26, 12:20pm, Peter Stephenson wrote: } Subject: Re: struct job.other (was Re: Bug related to stdin/always/jobcont } } On Sun, 25 Sep 2016 15:54:36 -0700 } Bart Schaefer <schaefer@brasslantern.com> wrote: } > - jn->other = pid; } > + jn->other = list_pipe_pid; /* see zsh.h */ } } That's OK; I assume there's no race where list_pipe_pid is set to } something else since signals are blocked. If there is such a race, then list_pipe_job is also going to be wrong in all of the surrounding lines, so I think the above is at least not more broken. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Bug related to stdin/always/jobcontrol 2016-09-22 11:59 ` Daniel Shahaf 2016-09-22 16:38 ` Bart Schaefer @ 2016-09-25 15:16 ` Peter Stephenson 1 sibling, 0 replies; 20+ messages in thread From: Peter Stephenson @ 2016-09-25 15:16 UTC (permalink / raw) To: zsh-workers On Thu, 22 Sep 2016 11:59:21 +0000 Daniel Shahaf <d.s@daniel.shahaf.name> wrote: > This (39331) broke fg'ing a function for me: > > % print $ZSH_PATCHLEVEL > zsh-5.2-dev-1-401-g4e51079 > % f() { $EDITOR } > % f > <press ^Z in the editor> > zsh: suspended f > % fg > [1] + continued f > zsh: suspended (tty output) f > > After the 'fg' I get another prompt immediately, instead of being > returend to $EDITOR. I think this is a fairly straightforward additional race fix, luckily, at least to get us to the point where known problems are fixed. In the cases I was looking at before, the processes associated with the SUPERJOB already existed, and there was no further opportunity of fixing up the pgrp for the SUBJOB. The fact those processes already existed also determined the pgrp for the SUBJOB, which is what's being set here that caused the case above to be mishandled. In this case, there's no process associated with the SUPERJOB yet, but that's OK: everything gets handled properly when there is one. I think the gleader of the SUBJOB doesn't need any further special handling here (everything seems to work, anyway). Process handling and SUPERJOB creation are murkily separate, so I don't think this is (necessarily) buggy. There could well be more aspects as I certainly haven't got my mind round everything --- in particular, I don't know for sure that the fact the shell has registered processes associated with the superjob is actually tied to the subjob's pgrp, which is what this test assumes, just that that appears to be so in the cases I've tried. pws diff --git a/Src/exec.c b/Src/exec.c index d924148..a5086c3 100644 --- a/Src/exec.c +++ b/Src/exec.c @@ -1703,7 +1703,8 @@ execpline(Estate state, wordcode slcode, int how, int last1) jobtab[list_pipe_job].stat |= STAT_SUPERJOB; jn->stat |= STAT_SUBJOB | STAT_NOPRINT; jn->other = pid; - jn->gleader = jobtab[list_pipe_job].gleader; + if (hasprocs(list_pipe_job)) + jn->gleader = jobtab[list_pipe_job].gleader; } if ((list_pipe || last1) && hasprocs(list_pipe_job)) killpg(jobtab[list_pipe_job].gleader, SIGSTOP); ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2016-09-26 16:33 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-09-02 18:39 Bug related to stdin/always/jobcontrol Christian Neukirchen 2016-09-05 15:04 ` Christian Neukirchen 2016-09-05 15:42 ` Peter Stephenson [not found] ` <CGME20160914173110eucas1p1f0567e319ae439b975b19f4e55676fed@eucas1p1.samsung.com> 2016-09-14 17:31 ` Peter Stephenson 2016-09-14 21:35 ` Peter Stephenson 2016-09-15 3:24 ` Bart Schaefer 2016-09-15 8:27 ` Peter Stephenson 2016-09-15 10:33 ` Peter Stephenson 2016-09-15 17:40 ` Peter Stephenson 2016-09-16 1:08 ` Bart Schaefer 2016-09-16 12:02 ` Peter Stephenson 2016-09-22 11:59 ` Daniel Shahaf 2016-09-22 16:38 ` Bart Schaefer 2016-09-23 1:18 ` Bart Schaefer 2016-09-23 6:06 ` Daniel Shahaf 2016-09-25 14:39 ` Peter Stephenson 2016-09-25 22:54 ` struct job.other (was Re: Bug related to stdin/always/jobcontrol) Bart Schaefer 2016-09-26 11:20 ` Peter Stephenson 2016-09-26 16:33 ` Bart Schaefer 2016-09-25 15:16 ` Bug related to stdin/always/jobcontrol Peter Stephenson
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/zsh/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).