* zsh hangs sometimes continued. [not found] <1301593035.6016.ezmlm@zsh.org> @ 2011-03-31 18:32 ` VAN VLIERBERGHE Stef 2011-04-01 9:28 ` Peter Stephenson 0 siblings, 1 reply; 6+ messages in thread From: VAN VLIERBERGHE Stef @ 2011-03-31 18:32 UTC (permalink / raw) To: zsh-workers; +Cc: LORANG Geert Since about 2 years we are suffering from the same bug that was reported in: http://www.zsh.org/mla/users/2008/msg00432.html. After adding more and more debug info to the zsh-4.3.10 sources I figured out that the problem is in the findjob returning the pid of a terminated process. Geert (Cc) then pointed out that this problem matches perfectly with the description in the msg00432.html above, however the fix made at that time was insufficient in our case. The previous fix was to stop looking in jobs that were in status STAT_DONE, i.e. jobs that do not contain any process in status SP_RUNNING : + /* + * We are only interested in jobs with processes still + * marked as live. Careful in case there's an identical + * process number in a job we haven't quite got around + * to deleting. + */ + if (jobtab[i].stat & STAT_DONE) + continue; + for (pn = aux ? jobtab[i].auxprocs : jobtab[i].procs; pn; pn = pn->next) if (pn->pid == pid) { However, this does not prevent from findjob returning a process that is no longer running. In our case there was a job containing 2 processes, one of them running and one of them terminated. In that case the job is not "STAT_DONE", but the for loop above still happily returns the pid of the process that was already terminated, leading to the same deadlock situation is in the original description. So I added a condition to check that the returned pid is still running : for (pn = aux ? jobtab[i].auxprocs : jobtab[i].procs; pn; pn = pn->next) if ((pn->pid == pid) && (pn->status == SP_RUNNING) /* Additional condition required to avoid INC035 : When a job contains two pids, one terminated pid and one running pid, then the condition above jobtab[i].stat & STAT_DONE will not stop these pids from being candidates for the findproc result (which is supposed to be a RUNNING pid), and if the terminated pid is an identical process number for the pid identifying the running process we are trying to find (after pid number wrapping), then we need to avoid returning the terminated pid, otherwise the shell would block and wait forever for the termination of the process which pid we were supposed to return in a different job. */ ) { We had 2 scripts that suffered from this problem, the simplest one did something like : cat <file> | uniq | while read LINE do quite a bit of fork-exec done >From the traces I understood that the cat terminates well before the uniq and inside the loop (different job) a new process was created that had the same pid as the cat, but that job was not complete (because of the uniq), and hence the findproc code above concluded that the cat had died a second time (called from zhander handling SIGCHLD). Obviously this problem was not easy to reproduce, because it depended a lot on all parallel fork activity to make the pid numbers advance. Executing a "while true do usleep 100 done" significantly increased the frequency of the script getting stuck (usually after 1..3 hours) but with the fix above the script now ran in a loop over 2 days, so the fix looks promising. We would appreciate if this fix could be improved (if needed) and validated/integrated. Note : The problem was submitted to RedHat via HP, so you have probably received the script and the input file before (it is very large to I don't attach it here). Anyway, now that you understand the problem I guess it is not very difficult to produce it systematically, if the cat just echos its pid to a file and terminates, then the loop only needs to wait until it forked a child with the same pid and then break, which should trigger the bug as well. ____ This message and any files transmitted with it are legally privileged and intended for the sole use of the individual(s) or entity to whom they are addressed. If you are not the intended recipient, please notify the sender by reply and delete the message and any attachments from your system. Any unauthorised use or disclosure of the content of this message is strictly prohibited and may be unlawful. Nothing in this e-mail message amounts to a contractual or legal commitment on the part of EUROCONTROL, unless it is confirmed by appropriately signed hard copy. Any views expressed in this message are those of the sender. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: zsh hangs sometimes continued. 2011-03-31 18:32 ` zsh hangs sometimes continued VAN VLIERBERGHE Stef @ 2011-04-01 9:28 ` Peter Stephenson 2011-04-01 14:18 ` Bart Schaefer 0 siblings, 1 reply; 6+ messages in thread From: Peter Stephenson @ 2011-04-01 9:28 UTC (permalink / raw) To: VAN VLIERBERGHE Stef, zsh-workers; +Cc: LORANG Geert On Thu, 31 Mar 2011 20:32:58 +0200 VAN VLIERBERGHE Stef <stef.van-vlierberghe@eurocontrol.int> wrote: > After adding more and more debug info to the zsh-4.3.10 sources I > figured out that the problem is in the findjob returning the pid of > a terminated process. Thanks for the investigation and the explanation. I agree the uses of findproc() all appear to assume the process is still marked as running, since they are all about to update it. Here's the patch I'll apply. Index: Src/jobs.c =================================================================== RCS file: /cvsroot/zsh/zsh/Src/jobs.c,v retrieving revision 1.79 diff -p -u -r1.79 jobs.c --- Src/jobs.c 22 Aug 2010 20:08:57 -0000 1.79 +++ Src/jobs.c 1 Apr 2011 09:09:54 -0000 @@ -173,11 +173,28 @@ findproc(pid_t pid, Job *jptr, Process * for (pn = aux ? jobtab[i].auxprocs : jobtab[i].procs; pn; pn = pn->next) - if (pn->pid == pid) { + { + /* + * Make sure we match a process that's still running. + * + * When a job contains two pids, one terminated pid and one + * running pid, then the condition (jobtab[i].stat & + * STAT_DONE) will not stop these pids from being candidates + * for the findproc result (which is supposed to be a + * RUNNING pid), and if the terminated pid is an identical + * process number for the pid identifying the running + * process we are trying to find (after pid number + * wrapping), then we need to avoid returning the terminated + * pid, otherwise the shell would block and wait forever for + * the termination of the process which pid we were supposed + * to return in a different job. + */ + if (pn->pid == pid && pn->status == SP_RUNNING) { *pptr = pn; *jptr = jobtab + i; return 1; } + } } return 0; -- Peter Stephenson <pws@csr.com> Software Engineer Tel: +44 (0)1223 692070 Cambridge Silicon Radio Limited Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, UK Member of the CSR plc group of companies. CSR plc registered in England and Wales, registered number 4187346, registered office Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, United Kingdom ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: zsh hangs sometimes continued. 2011-04-01 9:28 ` Peter Stephenson @ 2011-04-01 14:18 ` Bart Schaefer 2011-04-01 21:33 ` VAN VLIERBERGHE Stef 2011-04-01 21:48 ` Phil Pennock 0 siblings, 2 replies; 6+ messages in thread From: Bart Schaefer @ 2011-04-01 14:18 UTC (permalink / raw) To: zsh-workers On Apr 1, 10:28am, Peter Stephenson wrote: } Subject: Re: zsh hangs sometimes continued. } } On Thu, 31 Mar 2011 20:32:58 +0200 } VAN VLIERBERGHE Stef <stef.van-vlierberghe@eurocontrol.int> wrote: } > After adding more and more debug info to the zsh-4.3.10 sources I } > figured out that the problem is in the findjob returning the pid of } > a terminated process. } } Thanks for the investigation and the explanation. I agree the uses of } findproc() all appear to assume the process is still marked as running, } since they are all about to update it. Here's the patch I'll apply. Reading the explanation left me wondering whether there's an additional problem in the event that the shell is spawning processes fast enough for the PID of a terminated-but-not-yet-reaped job to have been re-used in a new still-running job? ^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: zsh hangs sometimes continued. 2011-04-01 14:18 ` Bart Schaefer @ 2011-04-01 21:33 ` VAN VLIERBERGHE Stef 2011-04-01 21:48 ` Phil Pennock 1 sibling, 0 replies; 6+ messages in thread From: VAN VLIERBERGHE Stef @ 2011-04-01 21:33 UTC (permalink / raw) To: Bart Schaefer, zsh-workers Thanks for fixing, Peter. Bart, the kernel will not re-use a pid until the parent process collects the state change by a wait system call (zsh uses wait4()) so it is impossible for a process with the same pid to exist when the wait returns. As long as the parent process (usually inside the zhandler catching the SIGCHLD) updates the job table and pid state before the process forks a new child there is no additional problem I believe. A new process with the same pid could appear before the job table update is complete, but it won't be a child of this same parent process. -----Original Message----- From: Bart Schaefer [mailto:schaefer@brasslantern.com] Sent: Friday 1 April 2011 16:18 To: zsh-workers@zsh.org Subject: Re: zsh hangs sometimes continued. On Apr 1, 10:28am, Peter Stephenson wrote: } Subject: Re: zsh hangs sometimes continued. } } On Thu, 31 Mar 2011 20:32:58 +0200 } VAN VLIERBERGHE Stef <stef.van-vlierberghe@eurocontrol.int> wrote: } > After adding more and more debug info to the zsh-4.3.10 sources I } > figured out that the problem is in the findjob returning the pid of } > a terminated process. } } Thanks for the investigation and the explanation. I agree the uses of } findproc() all appear to assume the process is still marked as running, } since they are all about to update it. Here's the patch I'll apply. Reading the explanation left me wondering whether there's an additional problem in the event that the shell is spawning processes fast enough for the PID of a terminated-but-not-yet-reaped job to have been re-used in a new still-running job? ____ This message and any files transmitted with it are legally privileged and intended for the sole use of the individual(s) or entity to whom they are addressed. If you are not the intended recipient, please notify the sender by reply and delete the message and any attachments from your system. Any unauthorised use or disclosure of the content of this message is strictly prohibited and may be unlawful. Nothing in this e-mail message amounts to a contractual or legal commitment on the part of EUROCONTROL, unless it is confirmed by appropriately signed hard copy. Any views expressed in this message are those of the sender. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: zsh hangs sometimes continued. 2011-04-01 14:18 ` Bart Schaefer 2011-04-01 21:33 ` VAN VLIERBERGHE Stef @ 2011-04-01 21:48 ` Phil Pennock 2011-04-01 23:40 ` Bart Schaefer 1 sibling, 1 reply; 6+ messages in thread From: Phil Pennock @ 2011-04-01 21:48 UTC (permalink / raw) To: zsh-workers On 2011-04-01 at 07:18 -0700, Bart Schaefer wrote: > Reading the explanation left me wondering whether there's an additional > problem in the event that the shell is spawning processes fast enough > for the PID of a terminated-but-not-yet-reaped job to have been re-used > in a new still-running job? Is this an April Fool's joke? [suffering flu, can't tell] If it's not yet reaped, then it's a zombie and still using a slot in the process table and the pid can't have been reused. Surely. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: zsh hangs sometimes continued. 2011-04-01 21:48 ` Phil Pennock @ 2011-04-01 23:40 ` Bart Schaefer 0 siblings, 0 replies; 6+ messages in thread From: Bart Schaefer @ 2011-04-01 23:40 UTC (permalink / raw) To: zsh-workers On Apr 1, 5:48pm, Phil Pennock wrote: } Subject: Re: zsh hangs sometimes continued. } } On 2011-04-01 at 07:18 -0700, Bart Schaefer wrote: } > Reading the explanation left me wondering whether there's an additional } > problem in the event that the shell is spawning processes fast enough } > for the PID of a terminated-but-not-yet-reaped job to have been re-used } > in a new still-running job? } } Is this an April Fool's joke? [suffering flu, can't tell] No, I'm just using the word "reaped" too loosely. My thought was that a job in pipeline or the like might still have a slot in *zsh's* job table (if the rest of the pipeline had not exited) even though the corresponding process was already gone (and had been waited-for in the SIGCHLD handler). Hence zsh could end up thinking it needed to do a synchronous wait for the old job which, if the pid was never re-used, would immediately fail, but if the pid was reused would instead hang while the new one with the same pid was still running (and then maybe fail to mark the newer one as done in its actual job slot). On further reflection, though, either this must already have been impossible, or the test in PWS's patch would cover it as well. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2011-04-01 23:41 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <1301593035.6016.ezmlm@zsh.org> 2011-03-31 18:32 ` zsh hangs sometimes continued VAN VLIERBERGHE Stef 2011-04-01 9:28 ` Peter Stephenson 2011-04-01 14:18 ` Bart Schaefer 2011-04-01 21:33 ` VAN VLIERBERGHE Stef 2011-04-01 21:48 ` Phil Pennock 2011-04-01 23:40 ` Bart Schaefer
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/zsh/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).