From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <zsh-workers-return-39359-mason-zsh=primenet.com.au@zsh.org>
Received: (qmail 18493 invoked by alias); 16 Sep 2016 12:02:50 -0000
Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm
Precedence: bulk
X-No-Archive: yes
List-Id: Zsh Workers List <zsh-workers.zsh.org>
List-Post: <mailto:zsh-workers@zsh.org>
List-Help: <mailto:zsh-workers-help@zsh.org>
X-Seq: 39359
Received: (qmail 17365 invoked from network); 16 Sep 2016 12:02:50 -0000
X-Qmail-Scanner-Diagnostics: from mailout3.w1.samsung.com by f.primenet.com.au (envelope-from <p.stephenson@samsung.com>, uid 7791) with qmail-scanner-2.11 
 (clamdscan: 0.99.2/21882. spamassassin: 3.4.1.  
 Clear:RC:0(210.118.77.13):SA:0(-2.2/5.0):. 
 Processed in 0.631168 secs); 16 Sep 2016 12:02:50 -0000
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on f.primenet.com.au
X-Spam-Level: 
X-Spam-Status: No, score=-2.2 required=5.0 tests=RP_MATCHES_RCVD
	autolearn=unavailable autolearn_force=no version=3.4.1
X-Envelope-From: p.stephenson@samsung.com
X-Qmail-Scanner-Mime-Attachments: |
X-Qmail-Scanner-Zip-Files: |
Received-SPF: none (ns1.primenet.com.au: domain at samsung.com does not designate permitted sender hosts)
X-AuditID: cbfec7f2-f79556d000002c42-b9-57dbdf6149f7
Date: Fri, 16 Sep 2016 13:02:37 +0100
From: Peter Stephenson <p.stephenson@samsung.com>
To: zsh-workers@zsh.org
Subject: Re: Bug related to stdin/always/jobcontrol
Message-id: <20160916130237.04f1ef72@pwslap01u.europe.root.pri>
In-reply-to: <160915180855.ZM3928@torch.brasslantern.com>
Organization: Samsung Cambridge Solution Centre
X-Mailer: Claws Mail 3.7.9 (GTK+ 2.22.0; i386-redhat-linux-gnu)
MIME-version: 1.0
Content-type: text/plain; charset=US-ASCII
Content-transfer-encoding: 7bit
X-Brightmail-Tracker:
 H4sIAAAAAAAAA+NgFnrDIsWRmVeSWpSXmKPExsWy7djPc7qJ92+HG7ztMrU42PyQyYHRY9XB
	D0wBjFFcNimpOZllqUX6dglcGf2N81kLVhpW3P+wibmB8a1yFyMnh4SAicSnWzPYIGwxiQv3
	1gPZXBxCAksZJa5NX8kM4fQySdw6vp4ZpuPU5GuMEIlljBILzn9lgXCmMUn8PtgC5ZxhlLhw
	9ywrSIuQwFlGifvLokBsFgFVidt/l4DF2QQMJaZums0IYosIiEucXXueBcQWFjCW+Hz3DZjN
	K2AvcfpxB1ANBwengKXEgV/iIGF+AX2Jq38/MUFcZC8x88oZRohyQYkfk++BtTIL6Ehs2/aY
	HcKWl9i85i3YOxICzewSsy+fYAWZKSEgK7HpANRnLhJd+6ZC2cISr45vYYewZSQuT+5mgbD7
	GSWedPtCzJnBKHH6zA5o4FlL9N2+yAixjE9i0rbpzBDzeSU62oQgSjwknr/qYIWwHSWufuln
	msCoOAvJ2bOQnD0LydkLGJlXMYqklhbnpqcWG+sVJ+YWl+al6yXn525iBKaB0/+Of9rB+PWE
	1SFGAQ5GJR7eFXNvhQuxJpYVV+YeYpTgYFYS4eW8eztciDclsbIqtSg/vqg0J7X4EKM0B4uS
	OO+eBVfChQTSE0tSs1NTC1KLYLJMHJxSDYzJq7yuq8tP5Xl422YJu5ul1VNr+YYdsUuWOVv8
	ynq/57Dd5EPajoc3f55ptSbg26L1eoYJVXK8DkePBUoeq5p8eRW3lpfvu4eHl1ZEPbDVfqP0
	pvndUTbNoteH3U4Fd5gc5PpTeT1W585Bg0fzOPf/n9Smd+N+bPN8wfMdLFG8W7YLvN8jqvRO
	iaU4I9FQi7moOBEAXny7tP8CAAA=
X-Brightmail-Tracker:
 H4sIAAAAAAAAA+NgFlrKIsWRmVeSWpSXmKPExsVy+t/xq7qG92+HGyyaZmNxsPkhkwOjx6qD
	H5gCGKPcbDJSE1NSixRS85LzUzLz0m2VQkPcdC2UFPISc1NtlSJ0fUOClBTKEnNKgTwjAzTg
	4BzgHqykb5fgltHfOJ+1YKVhxf0Pm5gbGN8qdzFyckgImEicmnyNEcIWk7hwbz1bFyMXh5DA
	EkaJRcenQzkzmCQ+z+9ignDOMUrMWtHNDNIiJHCWUeLGIT4Qm0VAVeL23yWsIDabgKHE1E2z
	wcaKCIhLnF17ngXEFhYwlvh89w2YzStgL3H6cQdQDQcHp4ClxIFf4hDzZ7JITN52hAmkhl9A
	X+Lq309MEOfZS8y8coYRoldQ4sfke2BzmAW0JDZva2KFsOUlNq95C3WbusSNu7vZJzAKz0LS
	MgtJyywkLQsYmVcxiqSWFuem5xYb6RUn5haX5qXrJefnbmIERtG2Yz+37GDsehd8iFGAg1GJ
	h3fF3FvhQqyJZcWVuYcYJTiYlUR4Oe/eDhfiTUmsrEotyo8vKs1JLT7EaAoMmInMUqLJ+cAI
	zyuJNzQxNLc0NDK2sDA3MlIS55364Uq4kEB6YklqdmpqQWoRTB8TB6dUA+PyPGn9S2bbFNaa
	njK6+Y3zXFLruT+im31vd4rUHLGKuGt/mbv8nF/N5HrTN6lz305S//q8c8YGmemSyvKLL2pf
	ml0tdXX2Lv85C3bX6h59qTs5rnh+1PyTJ5lnzmF/vWVFqQnX9X/zpi8yt38n+r752Gtu0d3s
	wRdy4uMOGCcti7/4mFVWy0JUiaU4I9FQi7moOBEABKlSYLgCAAA=
X-MTR: 20000000000000000@CPGS
X-CMS-MailID: 20160916120241eucas1p299e882c57269e5ea2e8bdc46589643cd
X-Msg-Generator: CA
X-Sender-IP: 182.198.249.180
X-Local-Sender: =?UTF-8?B?UGV0ZXIgU3RlcGhlbnNvbhtTQ1NDLURhdGEgUGxhbmUb?=
	=?UTF-8?B?7IK87ISx7KCE7J6QG1ByaW5jaXBhbCBFbmdpbmVlciwgU29mdHdhcmU=?=
X-Global-Sender: =?UTF-8?B?UGV0ZXIgU3RlcGhlbnNvbhtTQ1NDLURhdGEgUGxhbmUbU2Ft?=
	=?UTF-8?B?c3VuZyBFbGVjdHJvbmljcxtQcmluY2lwYWwgRW5naW5lZXIsIFNvZnR3YXJl?=
X-Sender-Code: =?UTF-8?B?QzEwG0VIURtDMTBDRDA1Q0QwNTAwNTg=?=
CMS-TYPE: 201P
X-HopCount: 7
X-CMS-RootMailID: 20160914173110eucas1p1f0567e319ae439b975b19f4e55676fed
X-RootMTR: 20160914173110eucas1p1f0567e319ae439b975b19f4e55676fed
References: <87r392jgd0.fsf@juno.home.vuxu.org>
 <20160905164207.4630643b@pwslap01u.europe.root.pri>
 <CGME20160914173110eucas1p1f0567e319ae439b975b19f4e55676fed@eucas1p1.samsung.com>
 <20160914183105.69862fa9@pwslap01u.europe.root.pri>
 <20160914223553.3173c8ca@ntlworld.com>
 <160914202450.ZM32479@torch.brasslantern.com>
 <20160915092721.3d669bc9@pwslap01u.europe.root.pri>
 <20160915113315.1655019c@pwslap01u.europe.root.pri>
 <20160915184040.5fc1b9d0@pwslap01u.europe.root.pri>
 <160915180855.ZM3928@torch.brasslantern.com>

On Thu, 15 Sep 2016 18:08:55 -0700
Bart Schaefer <schaefer@brasslantern.com> wrote:
> On Sep 15,  6:40pm, Peter Stephenson wrote:
> } Subject: Re: Bug related to stdin/always/jobcontrol
> }
> } The race is related to the fact that when the new SUPERJOB takes over
> } it's actually in the same process group as the SUBJOB it's taking over,
> } which is different from it's own PID, list_pipe_pid [...]
> } 
> } However, fixing that up doesn't help --- apparently attaching it to the
> } TTY after the SUBJOB is finished works OK, sending it SIGCONT seems to
> } be OK, but then it always gets SIGSTOP
> 
> Is it definitely getting SIGSTOP and not SIGTSTP ?
> 
> The operating system will never[*] automatically send a true STOP signal
> because it's un-catchable.  If a STOP is being received, it's probably
> being sent from the "for (; !nowait;)" loop in execpline(), either at
> line 1674 [ killpg(jobtab[list_pipe_job].gleader, SIGSTOP); ] or at
> line 1688 [ kill(getpid(), SIGSTOP); ].

I've found the double SIGSTOP --- it was getting *both* of the above.

In the case in question, the newly forked child zsh was picking up the
pgrp of the vim process, so when we sent SIGSTOP to that pgrp it got
stopped, too.  When it got restarted, it immediately stopped itself.

There's some code in the child to decide on the pgrp.  It looks like it
needs to have its own pgrp if the old superjob has already gone, the
basis of the problems we are seeing here.

I'm not sure if this leaves some more races but, again, it looks like
it's an improvement.

The rest of the patch is cosmetic.

The bad news is there are other problems with this logic that have crept
in since 5.2 --- this appears to be unrelated, or at least not made any
worse by this change, so I'll start a separate thread.

pws

diff --git a/Src/exec.c b/Src/exec.c
index 21270c8..0755d55 100644
--- a/Src/exec.c
+++ b/Src/exec.c
@@ -1652,7 +1652,13 @@ execpline(Estate state, wordcode slcode, int how, int last1)
 		    int synch[2];
 		    struct timeval bgtime;
 
+		    /*
+		     * A pipeline with the shell handling the right
+		     * hand side was stopped.  We'll fork to allow
+		     * it to continue.
+		     */
 		    if (pipe(synch) < 0 || (pid = zfork(&bgtime)) == -1) {
+			/* Failure */
 			if (pid < 0) {
 			    close(synch[0]);
 			    close(synch[1]);
@@ -1666,6 +1672,18 @@ execpline(Estate state, wordcode slcode, int how, int last1)
 			thisjob = newjob;
 		    }
 		    else if (pid) {
+			/*
+			 * Parent: job control is here.  If the job
+			 * started for the RHS of the pipeline is still
+			 * around, then its a SUBJOB and the job for
+			 * earlier parts of the pipeeline is its SUPERJOB.
+			 * The newly forked shell isn't recorded as a
+			 * separate job here, just as list_pipe_pid.
+			 * If the superjob exits (it may already have
+			 * done so, see child branch below), we'll use
+			 * list_pipe_pid to form the basis of a
+			 * replacement job --- see SUBLEADER code above.
+			 */
 			char dummy;
 
 			lpforked =
@@ -1692,9 +1710,33 @@ execpline(Estate state, wordcode slcode, int how, int last1)
 			break;
 		    }
 		    else {
+			Job pjn = jobtab + list_pipe_job;
 			close(synch[0]);
 			entersubsh(ESUB_ASYNC);
-			if (jobtab[list_pipe_job].procs) {
+			/*
+			 * At this point, we used to attach this process
+			 * to the process group of list_pipe_job (the
+			 * new superjob) any time that was still available.
+			 * That caused problems when the fork happened
+			 * early enough that the subjob is in that
+			 * process group, since then we will stop this
+			 * job when we signal the subjob, and we have
+			 * no way here to know that we shouldn't also
+			 * send STOP to itself, resulting in two stops.
+			 * So don't do that if the original
+			 * list_pipe_job has exited.
+			 *
+			 * The choice here needs to match the assumption
+			 * made when handling SUBLEADER above that the
+			 * process group is our own PID.  I'm not sure
+			 * if there's a potential race for that.  But
+			 * setting up a new process group if the
+			 * superjob is still functioning seems the wrong
+			 * thing to do.
+			 */
+			if (pjn->procs &&
+			    (pjn->stat & STAT_INUSE) &&
+			    !(pjn->stat & STAT_DONE)) {
 			    if (setpgrp(0L, mypgrp = jobtab[list_pipe_job].gleader)
 				== -1) {
 				setpgrp(0L, mypgrp = getpid());
diff --git a/Src/jobs.c b/Src/jobs.c
index 9284c71..d1b98ac 100644
--- a/Src/jobs.c
+++ b/Src/jobs.c
@@ -232,11 +232,12 @@ super_job(int sub)
 static int
 handle_sub(int job, int fg)
 {
+    /* job: superjob; sj: subjob. */
     Job jn = jobtab + job, sj = jobtab + jn->other;
 
     if ((sj->stat & STAT_DONE) || (!sj->procs && !sj->auxprocs)) {
 	struct process *p;
-		    
+
 	for (p = sj->procs; p; p = p->next) {
 	    if (WIFSIGNALED(p->status)) {
 		if (jn->gleader != mypgrp && jn->procs->next)
diff --git a/Src/signals.c b/Src/signals.c
index 2eefc07..30dde71 100644
--- a/Src/signals.c
+++ b/Src/signals.c
@@ -723,7 +723,7 @@ killjb(Job jn, int sig)
 {
     Process pn;
     int err = 0;
- 
+
     if (jobbing) {
         if (jn->stat & STAT_SUPERJOB) {
             if (sig == SIGCONT) {
@@ -731,11 +731,21 @@ killjb(Job jn, int sig)
                     if (killpg(pn->pid, sig) == -1)
 			if (kill(pn->pid, sig) == -1 && errno != ESRCH)
 			    err = -1;
- 
+
+		/*
+		 * Note this does not kill the last process,
+		 * which is assumed to be the one controlling the
+		 * subjob, i.e. the forked zsh that was originally
+		 * list_pipe_pid...
+		 */
                 for (pn = jn->procs; pn->next; pn = pn->next)
                     if (kill(pn->pid, sig) == -1 && errno != ESRCH)
 			err = -1;
 
+		/*
+		 * ...we only continue that once the external processes
+		 * currently associated with the subjob are finished.
+		 */
 		if (!jobtab[jn->other].procs && pn)
 		    if (kill(pn->pid, sig) == -1 && errno != ESRCH)
 			err = -1;
@@ -744,7 +754,7 @@ killjb(Job jn, int sig)
             }
             if (killpg(jobtab[jn->other].gleader, sig) == -1 && errno != ESRCH)
 		err = -1;
-		
+
 	    if (killpg(jn->gleader, sig) == -1 && errno != ESRCH)
 		err = -1;