From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <zsh-workers-return-29478-mason-zsh=primenet.com.au@zsh.org>
Received: (qmail 7533 invoked by alias); 13 Jun 2011 14:38:21 -0000
Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm
Precedence: bulk
X-No-Archive: yes
List-Id: Zsh Workers List <zsh-workers.zsh.org>
List-Post: <mailto:zsh-workers@zsh.org>
List-Help: <mailto:zsh-workers-help@zsh.org>
X-Seq: 29478
Received: (qmail 238 invoked from network); 13 Jun 2011 14:38:09 -0000
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on f.primenet.com.au
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE
	autolearn=ham version=3.3.1
Received-SPF: none (ns1.primenet.com.au: domain at closedmail.com does not designate permitted sender hosts)
From: Bart Schaefer <schaefer@brasslantern.com>
Message-id: <110613073748.ZM2701@torch.brasslantern.com>
Date: Mon, 13 Jun 2011 07:37:48 -0700
In-reply-to: <20110613120747.2f018471@pwslap01u.europe.root.pri>
Comments: In reply to Peter Stephenson <Peter.Stephenson@csr.com>
 "Re: killing suspended jobs makes zsh hang after 47d1215" (Jun 13, 12:07pm)
References: <86aadnwtl2.fsf@gmail.com>
	<110612072211.ZM26399@torch.brasslantern.com>
	<110612075958.ZM27334@torch.brasslantern.com>	<8662oaha3g.fsf@gmail.com>
	<110612185339.ZM28551@torch.brasslantern.com>
	<20110613120747.2f018471@pwslap01u.europe.root.pri>
X-Mailer: OpenZMail Classic (0.9.2 24April2005)
To: <zsh-workers@zsh.org>
Subject: Re: killing suspended jobs makes zsh hang after 47d1215
MIME-version: 1.0
Content-type: text/plain; charset=us-ascii

On Jun 13, 12:07pm, Peter Stephenson wrote:
} Subject: Re: killing suspended jobs makes zsh hang after 47d1215
}
} It may be we need to distinguish the callers.  The original bug was when
} a process that had exited, that was part of a job that had not yet
} terminated, was being used inappropriately (see zsh-workers/28965).  In
} this case it appears that under similar circumstances we need the
} terminated job.  What is it in the current case that means we need the
} process number even though the process has exited?

In the original bug (28965 and long ago 12814) we have two "procs"
with the same PID, one (call it X) exited but not yet removed from
the job table and the other one (Y) still running.  When the state
of Y changes we want to update the job table for Y, but because we
search linearly by zsh job number we find X first and improperly
return that.  In this scenario we do not want to find a job that
isn't running, because we may confuse it with a job that IS running.

In the new bug we have no duplicate PIDs but we have a job whose state
is changing twice.  In one case (users/16092) it was stopped (so we
update the state to say it's not running) and then killed behind our
back (so the state changed again without first becoming running).  In
another case (workers/29475) the job really did exit, but we haven't
removed it from the job table because it is part of a pipeline which
has not yet finished.  We need to find it in order to finally clean
it up.

The first bug is obviously a whole lot more rare than the second, and
it's pretty unfortunate that we've broken a common case to fix one
that almost never happens.

It's possible that differentiating at the call to wait_for_processes()
between whether we're called from a signal handler or whether we are
called from bin_fg() might resolve the deadlock, but I don't think
that's the correct fix.  In the original bug we deadlock when called
from a signal handler, and in the new bug we deadlock later because
we did not update the job state correctly when called from a signal
handler (we skipped the job because it's "not running" even though
its state is changing a second time).

Changing the way bin_fg() works might detect the latter mistake after
the fact, but it will leave a bug that the job status is not being
correctly reflected in the output of "jobs".  (*Maybe* that's only a
problem in the WIFSTOPPED() case and combining 29472 with a flag
passed in through wait_for_processes() would suffice.)

In the 28965 case we might be able to fix it by having findproc()
continue to scan the table for running jobs any time it encounters
one that matches but is not running, as long as it eventually does
return the first one it found if there are no others.  That may be
a lot of overhead for a case that almost never happens.