zsh hangs sometimes continued.

zsh-workers
 help / color / mirror / code / Atom feed

* zsh hangs sometimes continued.
       [not found] <1301593035.6016.ezmlm@zsh.org>
@ 2011-03-31 18:32 ` VAN VLIERBERGHE Stef
  2011-04-01  9:28   ` Peter Stephenson
  0 siblings, 1 reply; 6+ messages in thread
From: VAN VLIERBERGHE Stef @ 2011-03-31 18:32 UTC (permalink / raw)
  To: zsh-workers; +Cc: LORANG Geert

Since about 2 years we are suffering from the same bug that was reported
in: http://www.zsh.org/mla/users/2008/msg00432.html.

After adding more and more debug info to the zsh-4.3.10 sources I
figured out that the problem is in the findjob returning the pid of
a terminated process.

Geert (Cc) then pointed out that this problem matches perfectly with the
description in the msg00432.html above,
however the fix made at that time was insufficient in our case.

The previous fix was to stop looking in jobs that were in status
STAT_DONE, i.e. jobs that do not contain any process in status
SP_RUNNING :

+	/*
+	 * We are only interested in jobs with processes still
+	 * marked as live.  Careful in case there's an identical
+	 * process number in a job we haven't quite got around
+	 * to deleting.
+	 */
+	if (jobtab[i].stat & STAT_DONE)
+	    continue;
+
 	for (pn = aux ? jobtab[i].auxprocs : jobtab[i].procs;
 	     pn; pn = pn->next)
 	    if (pn->pid == pid) {

However, this does not prevent from findjob returning a process that is
no longer running. In our case there was a job
containing 2 processes, one of them running and one of them terminated.
In that case the job is not "STAT_DONE", but the
for loop above still happily returns the pid of the process that was
already terminated, leading to the same deadlock
situation is in the original description.

So I added a condition to check that the returned pid is still running :

	for (pn = aux ? jobtab[i].auxprocs : jobtab[i].procs;
             pn; pn = pn->next)
          if ((pn->pid == pid)
               && (pn->status == SP_RUNNING)
              /* Additional condition required to avoid INC035 : When a
job contains two
                 pids, one terminated pid and one running pid, then the
condition above
                 jobtab[i].stat & STAT_DONE will not stop these pids
from being candidates
                 for the findproc result (which is supposed to be a
RUNNING pid), and if
                 the terminated pid is an identical process number for
the pid identifying the
                 running process we are trying to find (after pid number
wrapping), then we
                 need to avoid returning the terminated pid, otherwise
the shell would block
                 and wait forever for the termination of the process
which pid we were supposed
                 to return in a different job.
               */
             ) {

We had 2 scripts that suffered from this problem, the simplest one did
something like :

cat <file> | uniq | while read LINE
do
  quite a bit of fork-exec
done

>From the traces I understood that the cat terminates well before the
uniq and inside the loop
(different job) a new process was created that had the same pid as the
cat, but that job was
not complete (because of the uniq), and hence the findproc code above
concluded that the cat
had died a second time (called from zhander handling SIGCHLD).

Obviously this problem was not easy to reproduce, because it depended a
lot on all parallel
fork activity to make the pid numbers advance. Executing a "while true
do usleep 100 done"
significantly increased the frequency of the script getting stuck
(usually after 1..3 hours)
but with the fix above the script now ran in a loop over 2 days, so the
fix looks promising.

We would appreciate if this fix could be improved (if needed) and
validated/integrated.

Note : The problem was submitted to RedHat via HP, so you have probably
received the script and
the input file before (it is very large to I don't attach it here).
Anyway, now that you understand
the problem I guess it is not very difficult to produce it
systematically, if the cat just
echos its pid to a file and terminates, then the loop only needs to wait
until it forked a child
with the same pid and then break, which should trigger the bug as well.

____

This message and any files transmitted with it are legally privileged and intended for the sole use of the individual(s) or entity to whom they are addressed. If you are not the intended recipient, please notify the sender by reply and delete the message and any attachments from your system. Any unauthorised use or disclosure of the content of this message is strictly prohibited and may be unlawful.

Nothing in this e-mail message amounts to a contractual or legal commitment on the part of EUROCONTROL, unless it is confirmed by appropriately signed hard copy.

Any views expressed in this message are those of the sender.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: zsh hangs sometimes continued.
  2011-03-31 18:32 ` zsh hangs sometimes continued VAN VLIERBERGHE Stef
@ 2011-04-01  9:28   ` Peter Stephenson
  2011-04-01 14:18     ` Bart Schaefer
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Stephenson @ 2011-04-01  9:28 UTC (permalink / raw)
  To: VAN VLIERBERGHE Stef, zsh-workers; +Cc: LORANG Geert

On Thu, 31 Mar 2011 20:32:58 +0200
VAN VLIERBERGHE Stef <stef.van-vlierberghe@eurocontrol.int> wrote:
> After adding more and more debug info to the zsh-4.3.10 sources I
> figured out that the problem is in the findjob returning the pid of
> a terminated process.

Thanks for the investigation and the explanation.  I agree the uses of
findproc() all appear to assume the process is still marked as running,
since they are all about to update it.  Here's the patch I'll apply.

Index: Src/jobs.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/jobs.c,v
retrieving revision 1.79
diff -p -u -r1.79 jobs.c
--- Src/jobs.c	22 Aug 2010 20:08:57 -0000	1.79
+++ Src/jobs.c	1 Apr 2011 09:09:54 -0000
@@ -173,11 +173,28 @@ findproc(pid_t pid, Job *jptr, Process *
 
 	for (pn = aux ? jobtab[i].auxprocs : jobtab[i].procs;
 	     pn; pn = pn->next)
-	    if (pn->pid == pid) {
+	{
+	    /*
+	     * Make sure we match a process that's still running.
+	     *
+	     * When a job contains two pids, one terminated pid and one
+	     * running pid, then the condition (jobtab[i].stat &
+	     * STAT_DONE) will not stop these pids from being candidates
+	     * for the findproc result (which is supposed to be a
+	     * RUNNING pid), and if the terminated pid is an identical
+	     * process number for the pid identifying the running
+	     * process we are trying to find (after pid number
+	     * wrapping), then we need to avoid returning the terminated
+	     * pid, otherwise the shell would block and wait forever for
+	     * the termination of the process which pid we were supposed
+	     * to return in a different job.
+	     */
+	    if (pn->pid == pid && pn->status == SP_RUNNING) {
 		*pptr = pn;
 		*jptr = jobtab + i;
 		return 1;
 	    }
+	}
     }
 
     return 0;

-- 
Peter Stephenson <pws@csr.com>            Software Engineer
Tel: +44 (0)1223 692070                   Cambridge Silicon Radio Limited
Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, UK


Member of the CSR plc group of companies. CSR plc registered in England and Wales, registered number 4187346, registered office Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, United Kingdom


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: zsh hangs sometimes continued.
  2011-04-01  9:28   ` Peter Stephenson
@ 2011-04-01 14:18     ` Bart Schaefer
  2011-04-01 21:33       ` VAN VLIERBERGHE Stef
  2011-04-01 21:48       ` Phil Pennock
  0 siblings, 2 replies; 6+ messages in thread
From: Bart Schaefer @ 2011-04-01 14:18 UTC (permalink / raw)
  To: zsh-workers

On Apr 1, 10:28am, Peter Stephenson wrote:
} Subject: Re: zsh hangs sometimes continued.
}
} On Thu, 31 Mar 2011 20:32:58 +0200
} VAN VLIERBERGHE Stef <stef.van-vlierberghe@eurocontrol.int> wrote:
} > After adding more and more debug info to the zsh-4.3.10 sources I
} > figured out that the problem is in the findjob returning the pid of
} > a terminated process.
} 
} Thanks for the investigation and the explanation.  I agree the uses of
} findproc() all appear to assume the process is still marked as running,
} since they are all about to update it.  Here's the patch I'll apply.

Reading the explanation left me wondering whether there's an additional
problem in the event that the shell is spawning processes fast enough
for the PID of a terminated-but-not-yet-reaped job to have been re-used
in a new still-running job?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: zsh hangs sometimes continued.
  2011-04-01 14:18     ` Bart Schaefer
@ 2011-04-01 21:33       ` VAN VLIERBERGHE Stef
  2011-04-01 21:48       ` Phil Pennock
  1 sibling, 0 replies; 6+ messages in thread
From: VAN VLIERBERGHE Stef @ 2011-04-01 21:33 UTC (permalink / raw)
  To: Bart Schaefer, zsh-workers

Thanks for fixing, Peter.

Bart, the kernel will not re-use a pid until the parent process collects
the
state change by a wait system call (zsh uses wait4()) so it is
impossible for
a process with the same pid to exist when the wait returns. As long as
the
parent process (usually inside the zhandler catching the SIGCHLD)
updates the
job table and pid state before the process forks a new child there is no
additional problem I believe. A new process with the same pid could
appear before
the job table update is complete, but it won't be a child of this same
parent
process.

-----Original Message-----
From: Bart Schaefer [mailto:schaefer@brasslantern.com]
Sent: Friday 1 April 2011 16:18
To: zsh-workers@zsh.org
Subject: Re: zsh hangs sometimes continued.

On Apr 1, 10:28am, Peter Stephenson wrote:
} Subject: Re: zsh hangs sometimes continued.
}
} On Thu, 31 Mar 2011 20:32:58 +0200
} VAN VLIERBERGHE Stef <stef.van-vlierberghe@eurocontrol.int> wrote:
} > After adding more and more debug info to the zsh-4.3.10 sources I
} > figured out that the problem is in the findjob returning the pid of
} > a terminated process.
}
} Thanks for the investigation and the explanation.  I agree the uses of
} findproc() all appear to assume the process is still marked
as running,
} since they are all about to update it.  Here's the patch I'll apply.

Reading the explanation left me wondering whether there's an additional
problem in the event that the shell is spawning processes fast enough
for the PID of a terminated-but-not-yet-reaped job to have been re-used
in a new still-running job?

____

This message and any files transmitted with it are legally privileged and intended for the sole use of the individual(s) or entity to whom they are addressed. If you are not the intended recipient, please notify the sender by reply and delete the message and any attachments from your system. Any unauthorised use or disclosure of the content of this message is strictly prohibited and may be unlawful.

Nothing in this e-mail message amounts to a contractual or legal commitment on the part of EUROCONTROL, unless it is confirmed by appropriately signed hard copy.

Any views expressed in this message are those of the sender.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: zsh hangs sometimes continued.
  2011-04-01 14:18     ` Bart Schaefer
  2011-04-01 21:33       ` VAN VLIERBERGHE Stef
@ 2011-04-01 21:48       ` Phil Pennock
  2011-04-01 23:40         ` Bart Schaefer
  1 sibling, 1 reply; 6+ messages in thread
From: Phil Pennock @ 2011-04-01 21:48 UTC (permalink / raw)
  To: zsh-workers

On 2011-04-01 at 07:18 -0700, Bart Schaefer wrote:
> Reading the explanation left me wondering whether there's an additional
> problem in the event that the shell is spawning processes fast enough
> for the PID of a terminated-but-not-yet-reaped job to have been re-used
> in a new still-running job?

Is this an April Fool's joke?  [suffering flu, can't tell]

If it's not yet reaped, then it's a zombie and still using a slot in the
process table and the pid can't have been reused.  Surely.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: zsh hangs sometimes continued.
  2011-04-01 21:48       ` Phil Pennock
@ 2011-04-01 23:40         ` Bart Schaefer
  0 siblings, 0 replies; 6+ messages in thread
From: Bart Schaefer @ 2011-04-01 23:40 UTC (permalink / raw)
  To: zsh-workers

On Apr 1,  5:48pm, Phil Pennock wrote:
} Subject: Re: zsh hangs sometimes continued.
}
} On 2011-04-01 at 07:18 -0700, Bart Schaefer wrote:
} > Reading the explanation left me wondering whether there's an additional
} > problem in the event that the shell is spawning processes fast enough
} > for the PID of a terminated-but-not-yet-reaped job to have been re-used
} > in a new still-running job?
} 
} Is this an April Fool's joke?  [suffering flu, can't tell]

No, I'm just using the word "reaped" too loosely.

My thought was that a job in pipeline or the like might still have a
slot in *zsh's* job table (if the rest of the pipeline had not exited)
even though the corresponding process was already gone (and had been
waited-for in the SIGCHLD handler).  Hence zsh could end up thinking
it needed to do a synchronous wait for the old job which, if the pid
was never re-used, would immediately fail, but if the pid was reused
would instead hang while the new one with the same pid was still
running (and then maybe fail to mark the newer one as done in its
actual job slot).

On further reflection, though, either this must already have been
impossible, or the test in PWS's patch would cover it as well.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-04-01 23:41 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1301593035.6016.ezmlm@zsh.org>
2011-03-31 18:32 ` zsh hangs sometimes continued VAN VLIERBERGHE Stef
2011-04-01  9:28   ` Peter Stephenson
2011-04-01 14:18     ` Bart Schaefer
2011-04-01 21:33       ` VAN VLIERBERGHE Stef
2011-04-01 21:48       ` Phil Pennock
2011-04-01 23:40         ` Bart Schaefer

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).