From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <zsh-workers-return-36124-mason-zsh=primenet.com.au@zsh.org>
Received: (qmail 24568 invoked by alias); 12 Aug 2015 09:44:04 -0000
Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm
Precedence: bulk
X-No-Archive: yes
List-Id: Zsh Workers List <zsh-workers.zsh.org>
List-Post: <mailto:zsh-workers@zsh.org>
List-Help: <mailto:zsh-workers-help@zsh.org>
X-Seq: 36124
Received: (qmail 23108 invoked from network); 12 Aug 2015 09:44:01 -0000
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on f.primenet.com.au
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_HI,
	RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_PASS autolearn=ham
	autolearn_force=no version=3.4.0
X-AuditID: cbfec7f5-f794b6d000001495-2f-55cb155a5aaa
Date: Wed, 12 Aug 2015 10:43:51 +0100
From: Peter Stephenson <p.stephenson@samsung.com>
To: zsh-workers@zsh.org
Subject: Re: 5.0.8 regression when waiting for suspended jobs
Message-id: <20150812104351.65a4cbea@pwslap01u.europe.root.pri>
In-reply-to: <150811165655.ZM31504@torch.brasslantern.com>
References: <87wpxhk970.fsf@gmail.com>
 <150730123904.ZM11774@torch.brasslantern.com> <87si84k9uf.fsf@gmail.com>
 <150731085638.ZM15733@torch.brasslantern.com>
 <150811165655.ZM31504@torch.brasslantern.com>
Organization: Samsung Cambridge Solution Centre
X-Mailer: Claws Mail 3.7.9 (GTK+ 2.22.0; i386-redhat-linux-gnu)
MIME-version: 1.0
Content-type: text/plain; charset=US-ASCII
Content-transfer-encoding: 7bit
X-Brightmail-Tracker:
 H4sIAAAAAAAAA+NgFjrMLMWRmVeSWpSXmKPExsVy+t/xy7pRoqdDDe7t4bI42PyQyYHRY9XB
	D0wBjFFcNimpOZllqUX6dglcGT9+vWYpmCRa8f6tbAPjA4EuRk4OCQETiR8tz1ggbDGJC/fW
	s3UxcnEICSxllGh5uI8JwpnBJDFx41FGCGcbo8T17o/sXYwcHCwCqhLXV7qBdLMJGEpM3TSb
	EcQWERCXOLv2PNhUYQFbiQN//oLFeQXsJd62/AOzOQWsJKZ+n80CMfMio8SpS++YQRL8AvoS
	V/9+YoI4yV5i5pUzUM2CEj8m3wMbyiygJbF5WxMrhC0vsXnNW7BeIQF1iRt3d7NPYBSahaRl
	FpKWWUhaFjAyr2IUTS1NLihOSs810itOzC0uzUvXS87P3cQICdqvOxiXHrM6xCjAwajEw3uj
	71SoEGtiWXFl7iFGCQ5mJRHenvtAId6UxMqq1KL8+KLSnNTiQ4zSHCxK4rwzd70PERJITyxJ
	zU5NLUgtgskycXBKNTBWhU19wHn8wfJ6Dqud75y2a6sL6pxp5uTPfLnx5ZRjP1j3drApL/Ce
	+Tr8Z9rn6R8T5dk/nHAKD2LRySm3k9/MHrX32aQA5cg1Xu/b4yuKVoRqPPyff7KE2+O9g1OL
	2csjHJUTDvyoESlg2N759lBiSsyHtoeTm5fqXrMMK/wYmRm6bedrjXlKLMUZiYZazEXFiQAN
	NvvgVgIAAA==

On Tue, 11 Aug 2015 16:56:55 -0700
Bart Schaefer <schaefer@brasslantern.com> wrote:
> On Jul 31,  8:56am, Bart Schaefer wrote:
> I still only suspect what changed to make 5.0.8 different from 5.0.7 in
> this regard, but here's what's going on:

> - "wait $!" -
> } zsh-5.0.7
> }  - "wait $!" blocks (looping on repeated wait3() nonzero)
> } zsh-5.0.8
> }  - "wait $!" loops but also printing status every time
> 
> bin_fg() calls waitforpid() which discovers the job is stopped and goes
> into a loop calling kill(pid, SIGCONT) to try to get the job to run
> again.  In the 5.0.8 case, each time this happens the job briefly wakes
> up, gets stopped with SIGTTIN, thus causes another SIGCHLD to go to the
> parent zsh, which then prints the "suspended" message and loops right
> back to kill(pid, SIGCONT) again.
> 
> All of this is exactly the same as in 5.0.7 except that because of the
> SIGCONT change in workers/35032 we notice the stopped -> continued ->
> stopped again status change and therefore print the new status even
> though it's actually the same as the last time we printed the status,
> because we skipped printing the "continued" status.  Or so I surmise.

So you might have thought the right thing to do was note it had been
stopped immediately, possibly warn the user, and not try to continue it
again without further user action?  Is that easy?  Can we pin down
"immediately" well enough?  Clearly there's a race in the real world
where the programme could get SIGTTIN at any time, but in the general
case (i.e. where a background process got SIGTTIN when the foreground
was doing something irrelevant) you clearly *don't* want it to continue
every time.

In that case the difference between 5.0.7 and 5.0.8 becomes
basically moot (it's different but in a sane fashion).

Do we even understand what the loop with SIGCONT is doing for us?  Under
what circumstances would this help?  Some (other sort of) race where
something else (what?  Not zsh and not the process that's suspended)
takes a while to get going, so the SIGCONT only succeeds after a few
attempts?

> - wait %1" -
> 
> bin_fg() calls zwaitjob() which does NOT do kill(pid, SIGCONT) instead
> simply blocking forever waiting for a SIGCHLD that will never arrive.

Hmm... I can't think of a good reason from the user point of view why
this should behave differently.  It just seems confusing.  It's
certainly not documented as a zsh feature, is it?

> - "wait" -
> 
> bin_fg() goes into a loop calling zwaitjob() on every entry in the job
> table; i.e., identical to "wait %1" repeated for every job number.

In which case I think the same reaction arises.

pws