From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 7533 invoked by alias); 13 Jun 2011 14:38:21 -0000 Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Workers List List-Post: List-Help: X-Seq: 29478 Received: (qmail 238 invoked from network); 13 Jun 2011 14:38:09 -0000 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received-SPF: none (ns1.primenet.com.au: domain at closedmail.com does not designate permitted sender hosts) From: Bart Schaefer Message-id: <110613073748.ZM2701@torch.brasslantern.com> Date: Mon, 13 Jun 2011 07:37:48 -0700 In-reply-to: <20110613120747.2f018471@pwslap01u.europe.root.pri> Comments: In reply to Peter Stephenson "Re: killing suspended jobs makes zsh hang after 47d1215" (Jun 13, 12:07pm) References: <86aadnwtl2.fsf@gmail.com> <110612072211.ZM26399@torch.brasslantern.com> <110612075958.ZM27334@torch.brasslantern.com> <8662oaha3g.fsf@gmail.com> <110612185339.ZM28551@torch.brasslantern.com> <20110613120747.2f018471@pwslap01u.europe.root.pri> X-Mailer: OpenZMail Classic (0.9.2 24April2005) To: Subject: Re: killing suspended jobs makes zsh hang after 47d1215 MIME-version: 1.0 Content-type: text/plain; charset=us-ascii On Jun 13, 12:07pm, Peter Stephenson wrote: } Subject: Re: killing suspended jobs makes zsh hang after 47d1215 } } It may be we need to distinguish the callers. The original bug was when } a process that had exited, that was part of a job that had not yet } terminated, was being used inappropriately (see zsh-workers/28965). In } this case it appears that under similar circumstances we need the } terminated job. What is it in the current case that means we need the } process number even though the process has exited? In the original bug (28965 and long ago 12814) we have two "procs" with the same PID, one (call it X) exited but not yet removed from the job table and the other one (Y) still running. When the state of Y changes we want to update the job table for Y, but because we search linearly by zsh job number we find X first and improperly return that. In this scenario we do not want to find a job that isn't running, because we may confuse it with a job that IS running. In the new bug we have no duplicate PIDs but we have a job whose state is changing twice. In one case (users/16092) it was stopped (so we update the state to say it's not running) and then killed behind our back (so the state changed again without first becoming running). In another case (workers/29475) the job really did exit, but we haven't removed it from the job table because it is part of a pipeline which has not yet finished. We need to find it in order to finally clean it up. The first bug is obviously a whole lot more rare than the second, and it's pretty unfortunate that we've broken a common case to fix one that almost never happens. It's possible that differentiating at the call to wait_for_processes() between whether we're called from a signal handler or whether we are called from bin_fg() might resolve the deadlock, but I don't think that's the correct fix. In the original bug we deadlock when called from a signal handler, and in the new bug we deadlock later because we did not update the job state correctly when called from a signal handler (we skipped the job because it's "not running" even though its state is changing a second time). Changing the way bin_fg() works might detect the latter mistake after the fact, but it will leave a bug that the job status is not being correctly reflected in the output of "jobs". (*Maybe* that's only a problem in the WIFSTOPPED() case and combining 29472 with a flag passed in through wait_for_processes() would suffice.) In the 28965 case we might be able to fix it by having findproc() continue to scan the table for running jobs any time it encounters one that matches but is not running, as long as it eventually does return the first one it found if there are no others. That may be a lot of overhead for a case that almost never happens.