zsh-workers
 help / color / mirror / code / Atom feed
* problem with 'ls | less' shell function
@ 2022-10-17  9:17 Thomas Klausner
  2022-10-17 14:50 ` Mikael Magnusson
  2022-10-19  9:33 ` Jun T
  0 siblings, 2 replies; 28+ messages in thread
From: Thomas Klausner @ 2022-10-17  9:17 UTC (permalink / raw)
  To: zsh-workers

Hi!

I recently noticed a problem in zsh 5.9 (as built from pkgsrc) on
NetBSD 9.99.100. Since I didn't notice it before it could be related
to a change in NetBSD (I'm following the latest version), but I've
been told that the issue can be reproduced on Ubuntu 19.04 and FreeBSD
13.1 too; but not in zsh 5.8.1, nor in most other shells though.

The discussion on the NetBSD mailing list can be read in this thread:
https://mail-index.netbsd.org/current-users/2022/10/12/msg043076.html
but I'll summarize the issue I see in zsh here.

I have a shell function I've been using for ages:

dir() { ls -al "$@" | less; }

Recently, when I tried suspending this with CTRL-Z and then resuming
it with 'fg', I get:

$ dir
(CTRL-Z)
zsh: done       ls -al "$@" | 
zsh: suspended  
$ fg
[1]  + done       ls -al "$@" | 
       continued  
zsh: done                    ls -al "$@" | 
zsh: suspended (tty output)  
zsh: done                    ls -al "$@" | 
zsh: suspended (tty output)  

The same thing works in NetBSD's ksh:

$ fg
ls -al "$@" | less 
(CTRL-Z)
[1] + Done                 ls -al "$@" |
      Stopped              less 

or in bash

$ fg
ls -al "$@" | less
(CTRL-Z)

[1]+  Stopped                 ls -al "$@" | less

If I use '/bin/ls' in the shell function instead of 'ls', it works
fine.

Any ideas what the issue could be?

Cheers,
 Thomas

(I'll try to send this without subscribing first, since this seems to
be the address intended for bug reports.)


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: problem with 'ls | less' shell function
  2022-10-17  9:17 problem with 'ls | less' shell function Thomas Klausner
@ 2022-10-17 14:50 ` Mikael Magnusson
  2022-10-17 15:36   ` Thomas Klausner
  2022-10-19  9:33 ` Jun T
  1 sibling, 1 reply; 28+ messages in thread
From: Mikael Magnusson @ 2022-10-17 14:50 UTC (permalink / raw)
  To: Thomas Klausner; +Cc: zsh-workers

On 10/17/22, Thomas Klausner <wiz@gatalith.at> wrote:
> Hi!
>
> I recently noticed a problem in zsh 5.9 (as built from pkgsrc) on
> NetBSD 9.99.100. Since I didn't notice it before it could be related
> to a change in NetBSD (I'm following the latest version), but I've
> been told that the issue can be reproduced on Ubuntu 19.04 and FreeBSD
> 13.1 too; but not in zsh 5.8.1, nor in most other shells though.
>
> The discussion on the NetBSD mailing list can be read in this thread:
> https://mail-index.netbsd.org/current-users/2022/10/12/msg043076.html
> but I'll summarize the issue I see in zsh here.
>
> I have a shell function I've been using for ages:
>
> dir() { ls -al "$@" | less; }
>
> Recently, when I tried suspending this with CTRL-Z and then resuming
> it with 'fg', I get:
>
> $ dir
> (CTRL-Z)
> zsh: done       ls -al "$@" |
> zsh: suspended
> $ fg
> [1]  + done       ls -al "$@" |
>        continued
> zsh: done                    ls -al "$@" |
> zsh: suspended (tty output)
> zsh: done                    ls -al "$@" |
> zsh: suspended (tty output)
>
> The same thing works in NetBSD's ksh:
>
> $ fg
> ls -al "$@" | less
> (CTRL-Z)
> [1] + Done                 ls -al "$@" |
>       Stopped              less
>
> or in bash
>
> $ fg
> ls -al "$@" | less
> (CTRL-Z)
>
> [1]+  Stopped                 ls -al "$@" | less
>
> If I use '/bin/ls' in the shell function instead of 'ls', it works
> fine.
>
> Any ideas what the issue could be?

The last bit implies that 'ls' is an alias or function, can you check
the output of 'which ls'?

-- 
Mikael Magnusson


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: problem with 'ls | less' shell function
  2022-10-17 14:50 ` Mikael Magnusson
@ 2022-10-17 15:36   ` Thomas Klausner
  2022-10-19  8:26     ` zsugabubus
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Klausner @ 2022-10-17 15:36 UTC (permalink / raw)
  To: Mikael Magnusson; +Cc: zsh-workers

On Mon, Oct 17, 2022 at 04:50:20PM +0200, Mikael Magnusson wrote:
> On 10/17/22, Thomas Klausner <wiz@gatalith.at> wrote:
> > Hi!
> >
> > I recently noticed a problem in zsh 5.9 (as built from pkgsrc) on
> > NetBSD 9.99.100. Since I didn't notice it before it could be related
> > to a change in NetBSD (I'm following the latest version), but I've
> > been told that the issue can be reproduced on Ubuntu 19.04 and FreeBSD
> > 13.1 too; but not in zsh 5.8.1, nor in most other shells though.
> >
> > The discussion on the NetBSD mailing list can be read in this thread:
> > https://mail-index.netbsd.org/current-users/2022/10/12/msg043076.html
> > but I'll summarize the issue I see in zsh here.
> >
> > I have a shell function I've been using for ages:
> >
> > dir() { ls -al "$@" | less; }
> >
> > Recently, when I tried suspending this with CTRL-Z and then resuming
> > it with 'fg', I get:
> >
> > $ dir
> > (CTRL-Z)
> > zsh: done       ls -al "$@" |
> > zsh: suspended
> > $ fg
> > [1]  + done       ls -al "$@" |
> >        continued
> > zsh: done                    ls -al "$@" |
> > zsh: suspended (tty output)
> > zsh: done                    ls -al "$@" |
> > zsh: suspended (tty output)
> >
> > The same thing works in NetBSD's ksh:
> >
> > $ fg
> > ls -al "$@" | less
> > (CTRL-Z)
> > [1] + Done                 ls -al "$@" |
> >       Stopped              less
> >
> > or in bash
> >
> > $ fg
> > ls -al "$@" | less
> > (CTRL-Z)
> >
> > [1]+  Stopped                 ls -al "$@" | less
> >
> > If I use '/bin/ls' in the shell function instead of 'ls', it works
> > fine.
> >
> > Any ideas what the issue could be?
> 
> The last bit implies that 'ls' is an alias or function, can you check
> the output of 'which ls'?

No, it isn't:

$ which ls
/bin/ls
$ type ls
ls is /bin/ls

Compare to

$ type ll
ll is an alias for ls -al
$ type dir
dir is a shell function from /home/wiz/.zshrc

But I take back the statement that it works with /bin/ls in the dir
function definition, it shows the same broken behaviour. Not sure why
I thought it worked before, sorry for that red herring.
 Thomas


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: problem with 'ls | less' shell function
  2022-10-17 15:36   ` Thomas Klausner
@ 2022-10-19  8:26     ` zsugabubus
  2022-10-19 10:04       ` Jun T
                         ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: zsugabubus @ 2022-10-19  8:26 UTC (permalink / raw)
  To: Thomas Klausner; +Cc: zsh-workers

On Mon, Oct 17, 2022 at 05:36:20PM +0200, Thomas Klausner wrote:
> On Mon, Oct 17, 2022 at 04:50:20PM +0200, Mikael Magnusson wrote:
> > On 10/17/22, Thomas Klausner <wiz@gatalith.at> wrote:
> > > Hi!
> > >
> > > I recently noticed a problem in zsh 5.9 (as built from pkgsrc) on
> > > NetBSD 9.99.100. Since I didn't notice it before it could be related
> > > to a change in NetBSD (I'm following the latest version), but I've
> > > been told that the issue can be reproduced on Ubuntu 19.04 and FreeBSD
> > > 13.1 too; but not in zsh 5.8.1, nor in most other shells though.
> > >
> > > The discussion on the NetBSD mailing list can be read in this thread:
> > > https://mail-index.netbsd.org/current-users/2022/10/12/msg043076.html
> > > but I'll summarize the issue I see in zsh here.
> > >
> > > I have a shell function I've been using for ages:
> > >
> > > dir() { ls -al "$@" | less; }
> > >
> > > Recently, when I tried suspending this with CTRL-Z and then resuming
> > > it with 'fg', I get:
> > >
> > > $ dir
> > > (CTRL-Z)
> > > zsh: done       ls -al "$@" |
> > > zsh: suspended
> > > $ fg
> > > [1]  + done       ls -al "$@" |
> > >        continued
> > > zsh: done                    ls -al "$@" |
> > > zsh: suspended (tty output)
> > > zsh: done                    ls -al "$@" |
> > > zsh: suspended (tty output)
> > >
> > > The same thing works in NetBSD's ksh:
> > >
> > > $ fg
> > > ls -al "$@" | less
> > > (CTRL-Z)
> > > [1] + Done                 ls -al "$@" |
> > >       Stopped              less
> > >
> > > or in bash
> > >
> > > $ fg
> > > ls -al "$@" | less
> > > (CTRL-Z)
> > >
> > > [1]+  Stopped                 ls -al "$@" | less
> > >
> > > If I use '/bin/ls' in the shell function instead of 'ls', it works
> > > fine.
> > >
> > > Any ideas what the issue could be?
> > 
> > The last bit implies that 'ls' is an alias or function, can you check
> > the output of 'which ls'?
> 
> No, it isn't:
> 
> $ which ls
> /bin/ls
> $ type ls
> ls is /bin/ls
> 
> Compare to
> 
> $ type ll
> ll is an alias for ls -al
> $ type dir
> dir is a shell function from /home/wiz/.zshrc
> 
> But I take back the statement that it works with /bin/ls in the dir
> function definition, it shows the same broken behaviour. Not sure why
> I thought it worked before, sorry for that red herring.
>  Thomas
> 

I think what happens is that zsh fails to correctly set the foreground
process group in `fg`. `less` is not in the foreground pgrp that's why
it immediately gets suspended by SIGTTIN after it receives SIGCONT.

In this example it is more obvious that `sleep` is not in the
foreground pgrp (confirm with ps):

$ f(){echo|sleep 60}
$ f
^Z
$ fg
^Z^Z^C^C^C^C^C^\

15bf8ace168a86d0fae90b10e9f706baddd4c0bf is the first bad commit
50134: Tweak process group handling to prevent unkillable pipelines

-- 
zsugabubus


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: problem with 'ls | less' shell function
  2022-10-17  9:17 problem with 'ls | less' shell function Thomas Klausner
  2022-10-17 14:50 ` Mikael Magnusson
@ 2022-10-19  9:33 ` Jun T
  2022-10-19 10:01   ` Jun T
  1 sibling, 1 reply; 28+ messages in thread
From: Jun T @ 2022-10-19  9:33 UTC (permalink / raw)
  To: zsh-workers; +Cc: Thomas Klausner

Is there anyone here who can reproduce this problem?

> 2022/10/17 18:17, Thomas Klausner <wiz@gatalith.at> wrote:
> 
> I recently noticed a problem in zsh 5.9 (as built from pkgsrc) on
> NetBSD 9.99.100. Since I didn't notice it before it could be related
> to a change in NetBSD (I'm following the latest version),

I can't (and will not be able to) test on NetBSD 9.99.
I tested on NetBSD 9.3, but couldn't reproduce the problem.

> but I've
> been told that the issue can be reproduced on Ubuntu 19.04 and FreeBSD
> 13.1 too;

I also tested on FreeBSD 13.1, but ^Z/fg worked fine.
I don't have Ubuntu-19.04, but any of LTS (18.04/20.04/22.04) works fine.
Does the problem occur every time you suspend a pipeline, or just
occasionally? (I tested only a few times)

I'm testing on virtual machines for which I allocate only two CPU cores.
Usually two cores are enough to reproduce this kind of problems (probably
related with a kind of race conditions), but haven't tested with more cores.

> but not in zsh 5.8.1,

Can you try 'git bisect'?
Or at least checkout some specific commits (the one just before the
suspicious commits) and test them?



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: problem with 'ls | less' shell function
  2022-10-19  9:33 ` Jun T
@ 2022-10-19 10:01   ` Jun T
  0 siblings, 0 replies; 28+ messages in thread
From: Jun T @ 2022-10-19 10:01 UTC (permalink / raw)
  To: zsh-workers; +Cc: Thomas Klausner

Sorry, please disregard my following post;

> 2022/10/19 18:33, Jun T <takimoto-j@kba.biglobe.ne.jp> wrote:
> 
> Is there anyone here who can reproduce this problem?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: problem with 'ls | less' shell function
  2022-10-19  8:26     ` zsugabubus
@ 2022-10-19 10:04       ` Jun T
  2022-10-19 13:36         ` Jun. T
  2022-10-20  0:10       ` Bart Schaefer
  2022-10-21  5:45       ` Jun T
  2 siblings, 1 reply; 28+ messages in thread
From: Jun T @ 2022-10-19 10:04 UTC (permalink / raw)
  To: zsh-workers; +Cc: Thomas Klausner

Sorry, I didn't notice this post:

> 2022/10/19 17:26, zsugabubus@national.shitposting.agency wrote:
> 
> $ f(){echo|sleep 60}
> $ f
> ^Z
> $ fg
> ^Z^Z^C^C^C^C^C^\

Thnaks.
Now I understand that we need to use function to reproduce the problem.
% ls | less      # works fine
but
% dir () { ls | less }
% dir     # this has the problem

> 15bf8ace168a86d0fae90b10e9f706baddd4c0bf is the first bad commit
> 50134: Tweak process group handling to prevent unkillable pipelines

Thanks again.

> I think what happens is that zsh fails to correctly set the foreground
> process group in `fg`. `less` is not in the foreground pgrp that's why
> it immediately gets suspended by SIGTTIN after it receives SIGCONT.


Yes, and I guess it is related with the code in wait_for_processes(),
signals.c, lines 540-557.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: problem with 'ls | less' shell function
  2022-10-19 10:04       ` Jun T
@ 2022-10-19 13:36         ` Jun. T
  0 siblings, 0 replies; 28+ messages in thread
From: Jun. T @ 2022-10-19 13:36 UTC (permalink / raw)
  To: zsh-workers


> 2022/10/19 19:04, Jun T <takimoto-j@kba.biglobe.ne.jp> wrote:
> 
> Yes, and I guess it is related with the code in wait_for_processes(),
> signals.c, lines 540-557.

No, sorry again. This is not related.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: problem with 'ls | less' shell function
  2022-10-19  8:26     ` zsugabubus
  2022-10-19 10:04       ` Jun T
@ 2022-10-20  0:10       ` Bart Schaefer
  2022-10-20 16:26         ` Peter Stephenson
  2022-10-21  5:45       ` Jun T
  2 siblings, 1 reply; 28+ messages in thread
From: Bart Schaefer @ 2022-10-20  0:10 UTC (permalink / raw)
  To: zsh-workers

On Wed, Oct 19, 2022 at 1:28 AM <zsugabubus@national.shitposting.agency> wrote:
>
> I think what happens is that zsh fails to correctly set the foreground
> process group in `fg`. `less` is not in the foreground pgrp that's why
> it immediately gets suspended by SIGTTIN after it receives SIGCONT.

First sentence partly wrong, second sentence right.

When a shell function is suspended with ^Z, the parent shell forks
again to create a process for the shell function, but doesn't change
the process group of the pipeline.  When "fg" brings the shell
function process back, that process is made group leader, but it then
needs to reset the group leader again to the pid of its own foreground
job, which it does not ... or, conversely, upon suspend the parent
shell needs to avoid changing the group leader of the new process.

At this point I'm not sure which of those needs to happen or where.
It's all in or around the call to addproc() at exec.c line 1734.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: problem with 'ls | less' shell function
  2022-10-20  0:10       ` Bart Schaefer
@ 2022-10-20 16:26         ` Peter Stephenson
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Stephenson @ 2022-10-20 16:26 UTC (permalink / raw)
  To: zsh-workers


> On 20/10/2022 01:10 Bart Schaefer <schaefer@brasslantern.com> wrote:
> 
>  
> On Wed, Oct 19, 2022 at 1:28 AM <zsugabubus@national.shitposting.agency> wrote:
> >
> > I think what happens is that zsh fails to correctly set the foreground
> > process group in `fg`. `less` is not in the foreground pgrp that's why
> > it immediately gets suspended by SIGTTIN after it receives SIGCONT.
> 
> First sentence partly wrong, second sentence right.
> 
> When a shell function is suspended with ^Z, the parent shell forks
> again to create a process for the shell function, but doesn't change
> the process group of the pipeline.  When "fg" brings the shell
> function process back, that process is made group leader, but it then
> needs to reset the group leader again to the pid of its own foreground
> job, which it does not ... or, conversely, upon suspend the parent
> shell needs to avoid changing the group leader of the new process.
> 
> At this point I'm not sure which of those needs to happen or where.
> It's all in or around the call to addproc() at exec.c line 1734.

I'd be quite surprised if you didn't need to be looking at
this comment at line 1872.  This isn't necessarily the same thing
but must surely be related...

			/*
			 * At this point, we used to attach this process
			 * to the process group of list_pipe_job (the
			 * new superjob) any time that was still available.
			 * That caused problems in at least two
			 * cases because this forked shell was then
			 * suspended with the right hand side of the
			 * pipeline, and the SIGSTOP below suspended
			 * it a second time when it was continued.
			 *
			 * It's therefore not clear entirely why you'd ever
			 * do anything other than the following, but no
			 * doubt we'll find out...
			 */

pws


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: problem with 'ls | less' shell function
  2022-10-19  8:26     ` zsugabubus
  2022-10-19 10:04       ` Jun T
  2022-10-20  0:10       ` Bart Schaefer
@ 2022-10-21  5:45       ` Jun T
  2022-10-21  7:40         ` Jun T
  2 siblings, 1 reply; 28+ messages in thread
From: Jun T @ 2022-10-21  5:45 UTC (permalink / raw)
  To: zsh-workers


> 2022/10/19 17:26, zsugabubus@national.shitposting.agency wrote:
> 
> 15bf8ace168a86d0fae90b10e9f706baddd4c0bf is the first bad commit
> 50134: Tweak process group handling to prevent unkillable pipelines

This is for solving the problem that pipelines like the following
can't be interrupted by ^C:

zsh% { /bin/sleep 10 ; /bin/sleep 11; echo done 1; } |
 { /bin/sleep 12 ; /bin/sleep 13; echo done 2; }

After the commit ^C works. But if we suspend/resume it by ^Z/fg,
the pipeline never finishes. We get the output 'done 2', but
not 'done 1', nor the next prompt (^C works). After we get 'done 2',
if we run 'ps -Oppid,pgid,tpgid' in another terminal:

    PID    PPID    PGID   TPGID S TTY          TIME COMMAND
   2447    ....    2447    3644 S pts/0    00:00:00 zsh
   3644    2447    3644    3644 R pts/0    00:00:37 zsh
   3645    3644    3644    3644 Z pts/0    00:00:00 [sleep] <defunct>

2447 is the main zsh, 3644 is the subshell for the left hand side
of the pipeline, and 3645 is 'sleep 10'.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: problem with 'ls | less' shell function
  2022-10-21  5:45       ` Jun T
@ 2022-10-21  7:40         ` Jun T
  2022-10-21 21:22           ` Bart Schaefer
  0 siblings, 1 reply; 28+ messages in thread
From: Jun T @ 2022-10-21  7:40 UTC (permalink / raw)
  To: zsh-workers


> 2022/10/21 14:45, Jun T <takimoto-j@kba.biglobe.ne.jp> wrote:
> 
> 2447 is the main zsh, 3644 is the subshell for the left hand side
> of the pipeline, and 3645 is 'sleep 10'.

And the subshell is using 100% of a CPU core, in the loop
            for (; !nowait;) {        

in function execpline(), mostly lines 1779-1797 in exec.c.
The subshell have missed the SIGCHLD?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: problem with 'ls | less' shell function
  2022-10-21  7:40         ` Jun T
@ 2022-10-21 21:22           ` Bart Schaefer
  2022-10-21 21:24             ` Bart Schaefer
  0 siblings, 1 reply; 28+ messages in thread
From: Bart Schaefer @ 2022-10-21 21:22 UTC (permalink / raw)
  To: Jun T; +Cc: zsh-workers

On Fri, Oct 21, 2022 at 12:41 AM Jun T <takimoto-j@kba.biglobe.ne.jp> wrote:
>
> The subshell have missed the SIGCHLD?

The subshell will never get the SIGCHLD, because it is a sibling of
the pipeline, not its parent.  It's up to the original shell to assign
the correct process group, which means the new subshell for the
function has to be in the same process group as the foreground job at
the time the STOP/TSTP was received.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: problem with 'ls | less' shell function
  2022-10-21 21:22           ` Bart Schaefer
@ 2022-10-21 21:24             ` Bart Schaefer
  2022-10-22 23:22               ` Bart Schaefer
  0 siblings, 1 reply; 28+ messages in thread
From: Bart Schaefer @ 2022-10-21 21:24 UTC (permalink / raw)
  To: Jun T; +Cc: zsh-workers

On Fri, Oct 21, 2022 at 2:22 PM Bart Schaefer <schaefer@brasslantern.com> wrote:
>
> which means the new subshell for the
> function has to be in the same process group as the foreground job at
> the time the STOP/TSTP was received.

Or more accurately ... the parent needs to assign the tty pgroup to
the foreground job, and the forked child has to wait for that process
to exit and then grab the pgroup (or be assigned it by the parent).
There's a lot of juggling here.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: problem with 'ls | less' shell function
  2022-10-21 21:24             ` Bart Schaefer
@ 2022-10-22 23:22               ` Bart Schaefer
  2022-10-22 23:43                 ` Bart Schaefer
  0 siblings, 1 reply; 28+ messages in thread
From: Bart Schaefer @ 2022-10-22 23:22 UTC (permalink / raw)
  To: Jun T; +Cc: zsh-workers

OK, perhaps we're getting somewhere.  Still working with suspend of
dir() { ls -lrRt | less }

In bin_fg(), line 2564, we
attachtty(jobtab[jobtab[job].other].gleader)) which is using the
correct PID.  We then enter waitjobs() which calls waitonejob() for
that PID (so far so good).  However, that returns immediately, because
at that point the pipeline gets SIGTTOU -- which is confusing, because
it should be attached to the tty.  In any case, waitjobs() then falls
through to waiting for the function subshell, which is the also the
right thing.

It's a bit hard to follow here because sometimes the "other" field of
the jobtab struct is an offset into jobtab, and sometimes its an
actual PID, but the parent shell's internal record-keeping seems to be
correct.  However, it must be the case that something is sending
SIGCONT to "less" before the attachtty() has a chance to ... actually
change the process group?  If I stop at the point where the SIGCONT is
sent, and call getprp() from the debugger, the process group still
belongs to the parent shell despite the apparently successful call to
attachtty().


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: problem with 'ls | less' shell function
  2022-10-22 23:22               ` Bart Schaefer
@ 2022-10-22 23:43                 ` Bart Schaefer
  2022-11-03 23:10                   ` Bart Schaefer
  0 siblings, 1 reply; 28+ messages in thread
From: Bart Schaefer @ 2022-10-22 23:43 UTC (permalink / raw)
  To: Jun T; +Cc: zsh-workers

On Sat, Oct 22, 2022 at 4:22 PM Bart Schaefer <schaefer@brasslantern.com> wrote:
>
> However, it must be the case that something is sending
> SIGCONT to "less" before the attachtty() has a chance to ... actually
> change the process group?  If I stop at the point where the SIGCONT is
> sent, and call getprp() from the debugger, the process group still
> belongs to the parent shell despite the apparently successful call to
> attachtty().

If I instrument attachtty():
[1]  + continued  ls -lrRt |
zsh: ep = 0 and last_attached_pgrp = 276698 in 276698
zsh: last_attached_pgrp now 276716 in 276698, gettygrp = 276716
zsh: suspended (tty output)  ls -lrRt |
zsh: suspended (tty output)  ls -lrRt |

So as far as zsh knows, the attachtty() was successful, but if the
debugger is telling the truth, something is resetting the tty pgrp
again without attachtty() being called.

I'm out of time to chase this today.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: problem with 'ls | less' shell function
  2022-10-22 23:43                 ` Bart Schaefer
@ 2022-11-03 23:10                   ` Bart Schaefer
  2022-11-04  6:09                     ` [PATCH] " Bart Schaefer
  0 siblings, 1 reply; 28+ messages in thread
From: Bart Schaefer @ 2022-11-03 23:10 UTC (permalink / raw)
  To: Jun T; +Cc: zsh-workers

On Sat, Oct 22, 2022 at 4:43 PM Bart Schaefer <schaefer@brasslantern.com> wrote:
>
> So as far as zsh knows, the attachtty() was successful, but if the
> debugger is telling the truth, something is resetting the tty pgrp
> again without attachtty() being called.

A little further ... I instrumented jobs.c:makerunning() --

% dir() { ls -lrRt | less }
% dir
^Z
% fg
makerunning: PID 79731 job 1 jn->stat: 9427
makerunning: PID 79731 job 2 jn->stat: 370
[1]  + continued  ls -lrRt |
makerunning: PID 79731 job 1 jn->stat: 9425
makerunning: PID 79731 job 2 jn->stat: 368
zsh: suspended (tty output)  ls -lrRt |
zsh: suspended (tty output)  ls -lrRt |

With an anonymous function instead of a named one, this doesn't happen
and "fg" works:

% () { ls -lrRt | less }
^Z
% fg
makerunning: PID 79731 job 3 jn->stat: 82
[3]  - continued  ls -lrRt |
makerunning: PID 79731 job 3 jn->stat: 80

9427 is CHANGED|STOPPED|LOCKED|INUSE|SUPERJOB|CURSH|SUBLEADER
370 is STOPPED|LOCKED|NOPRINT|INUSE|SUBJOB
82 is STOPPED|LOCKED|INUSE

So we can remove at least one makerunning(), I think from bin_fg().
But I then instrumented update_job() where "continued..." is printed,
and:

% fg
[1]  + suspended  ls -lrRt |
makerunning: PID 687 job 1 jn->stat: 9427
makerunning: PID 687 job 2 jn->stat: 370
zsh: wait_for_processes after SIGCHLD
zsh: update_job signals.c L559
zsh: update_job signals.c L559
update_job: stopping gleader
zsh: suspended (tty output)  ls -lrRt |
zsh: suspended (tty output)  ls -lrRt |

Aha.  So there's a race condition of sorts, just one that we always
lose.  We send a SIGCONT to the process group for the forked job,
which wakes up both the subjob and the superjob, but when we get the
SIGCHLD for the subjob we decide that its superjob is running not
because we just restarted it, but because it is still waiting to be
stopped from the previous SIGTSTP.  I haven't worked out if this means
there's an order of signal delivery issue or just that we botched some
internal state.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] Re: problem with 'ls | less' shell function
  2022-11-03 23:10                   ` Bart Schaefer
@ 2022-11-04  6:09                     ` Bart Schaefer
  2022-11-04 15:10                       ` [PATCH] " Jun T
  0 siblings, 1 reply; 28+ messages in thread
From: Bart Schaefer @ 2022-11-04  6:09 UTC (permalink / raw)
  To: Zsh hackers list

[-- Attachment #1: Type: text/plain, Size: 4001 bytes --]

On Thu, Nov 3, 2022 at 4:10 PM Bart Schaefer <schaefer@brasslantern.com> wrote:
>
> We send a SIGCONT to the process group for the forked job,
> which wakes up both the subjob and the superjob, but when we get the
> SIGCHLD for the subjob we decide that its superjob is running not
> because we just restarted it, but because it is still waiting to be
> stopped from the previous SIGTSTP.

I find we have to take this all the way back to workers/50105 and the
"unkillable pipeline".  The earlier example given was this:

{ /bin/sleep 10 ; /bin/sleep 20; } | { /bin/sleep 30 ; /bin/sleep 40; }

I "fixed" this with workers/50134 by avoiding a reset of the process
group leader when the job is not in the current shell, but it turns
out that's not actually the problem.  In the above example, two
subshells are forked for either side of the pipe and "/bin/sleep 30"
becomes the foreground job.  Upon interrupting this, "/bin/sleep 40"
begins to execute and the right-hand subshell immediately exits,
because only the exit value of that sleep matters to the parent.  The
patch in 50134 makes "/bin/sleep 40" its own group leader in that
case, allowing further keyboard-generated signals to reach it.
Without 50134, the sleep effectively becomes an orphan; the parent is
waiting, but for the wrong job, and the job that is running is immune
to signals.

There are two problems with this.  First, arguably the "sleep 40"
should not start at all, because the pipeline was interrupted.  (This
is in fact what happens in "sh" emulation thanks to workers/47794.)
Second, in the "ls | less" function example from this thread, upon ^Z
it's wrong to set the group leader to the newly-forked subshell
because that results in the aforementioned race condition -- upon
"fg", "ls" keeps getting stopped before "less" can get started, which
causes the "oops, subjob is still stopped, better stop the superjob"
confusion.

I believe the reason interrupting "sleep 30" doesn't abort the whole
pipeline is because the { ... } expression is in an interactive shell,
so the two sleeps get interpreted as separately "jobbable".  Running
the whole thing in a non-interactive shell (zsh -fs) works fine.  In
fact

% { /bin/sleep 10 ; /bin/sleep 20; } | { /bin/sleep 30 ; /bin/sleep 40; }
^Z
% fg
^C
%

also works fine and does end the pipeline, but if you "fg" without
immediate ^C, once "sleep 40" starts running we're right back in the
invulnerable orphan state.

There are two fixes needed here.  One, never set the group leader to a
process that's about to exit.  Sadly, as far as I can tell, the parent
doesn't know whether that's going to happen, and attempting to check
with kill(gleader, 0) or similar just leads to a different race
condition.  Two, stop the whole { ... } construct if any job within it
gets interrupted.  In both cases 50134 should then be reverted.

The attached patch seems to satisfactorily implement the second fix.

There's an anecdote I may have mentioned before wherein a mechanic is
called upon to fix a piece of machinery.  When the job is finished he
presents a bill for $1000.  "What was the problem?"  "Oh, a bolt had
come out, I just replaced it."  "What?  $1000 for a single bolt?"
"Sorry, let me correct that."  The mechanic takes back his invoice,
scribbles a few words, and hands it back.  It now reads:  "New bolt:
$1.  Finding out where to put it: $999"

This patch is a $1000 bolt.  Some no-op whitespace and comments included.

Unfortunately, that leaves us with  another example in 50105 still unfixed:

( zsh -fc 'print a few words; /bin/sleep 10' ) | { head -n 1 }

Here we end up with the current shell as the foreground job because
"head" has exited, and as an interactive shell it is ignoring signals.
It needs to bring the left-hand-side into the foreground so it can be
killed.  I haven't tracked that down (nor figured out why 50134 works
around it, except to believe it's two bugs canceling each other).

Also, yes, I'm on vacation, and it's been raining all day here.

[-- Attachment #2: pipejobs.txt --]
[-- Type: text/plain, Size: 1644 bytes --]

diff --git a/Src/jobs.c b/Src/jobs.c
index 707374297..2c4f41547 100644
--- a/Src/jobs.c
+++ b/Src/jobs.c
@@ -566,6 +566,12 @@ update_job(Job jn)
 		     * when the job is finally deleted.
 		     */
 		    jn->stat |= STAT_ATTACH;
+		    /*
+		     * If we're in shell jobs on the right side of a pipeline
+		     * we should treat it like a job in the current shell.
+		     */
+		    if (inforeground == 2)
+			inforeground = 1;
 		}
 		/* If we have `foo|while true; (( x++ )); done', and hit
 		 * ^C, we have to stop the loop, too. */
@@ -1488,10 +1494,7 @@ addproc(pid_t pid, char *text, int aux, struct timeval *bgtime,
 	 * set it for that, too.
 	 */
 	if (gleader != -1) {
-	    if (jobtab[thisjob].stat & STAT_CURSH)
-		jobtab[thisjob].gleader = gleader;
-	    else
-		jobtab[thisjob].gleader = pid;
+	    jobtab[thisjob].gleader = gleader;
 	    if (list_pipe_job_used != -1)
 		jobtab[list_pipe_job_used].gleader = gleader;
 	    /*
@@ -1500,7 +1503,7 @@ addproc(pid_t pid, char *text, int aux, struct timeval *bgtime,
 	     */
 	    last_attached_pgrp = gleader;
 	} else if (!jobtab[thisjob].gleader)
-		jobtab[thisjob].gleader = pid;
+	    jobtab[thisjob].gleader = pid;
 	/* attach this process to end of process list of current job */
 	pnlist = &jobtab[thisjob].procs;
     }
@@ -2506,6 +2509,7 @@ bin_fg(char *name, char **argv, Options ops, int func)
 		jobtab[job].stat &= ~STAT_CURSH;
 	    }
 	    if ((stopped = (jobtab[job].stat & STAT_STOPPED))) {
+		/* WIFCONTINUED will makerunning() again at killjb() */
 		makerunning(jobtab + job);
 		if (func == BIN_BG) {
 		    /* Set $! to indicate this was backgrounded */

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] problem with 'ls | less' shell function
  2022-11-04  6:09                     ` [PATCH] " Bart Schaefer
@ 2022-11-04 15:10                       ` Jun T
  2022-11-05  0:09                         ` Bart Schaefer
  0 siblings, 1 reply; 28+ messages in thread
From: Jun T @ 2022-11-04 15:10 UTC (permalink / raw)
  To: zsh-workers


> 2022/11/04 15:09, Bart Schaefer <schaefer@brasslantern.com> wrote:
> 
> Unfortunately, that leaves us with  another example in 50105 still unfixed:
> 
> ( zsh -fc 'print a few words; /bin/sleep 10' ) | { head -n 1 }

The main zsh (zsh0, pid=pgid=100) forks two times;
the 1st one exec 'zsh -fc' (zsh1, pid=101), the 2nd one 'head' (pid=102).
Both are in the process group pgid=101, and the group becomes foreground.
But it seems these two are in different jobs (because of { head } ?).
Then zsh1 exec 'sleep 100' (without forking). Now 'sleep 100' has pid=101
and pgid=101.
When 'head' exits, zsh0 calls update_job(). In this function:

548         /* is this job in the foreground of an interactive shell? */
549         if (mypgrp != pgrp && inforeground &&
550             (jn->gleader == pgrp ||
551              (pgrp > 1 &&                             
552               (kill(-pgrp, 0) == -1 && errno == ESRCH)))) {

mypgrp=100, pgrp=101, jn->gleader=101, and the if(...) holds, and the line

568                     jn->stat |= STAT_ATTACH;    

is executed. Since { head } is a separate job and head has exited, the
job will be removed, and attachtty(mypgrp=100) will be called when the
job is removed. Then pgid=101 (=sleep 100) lose tty. 

But why just "jn->gleader == pgrp" is enough to assume that the foreground
job has finished? If the line 550 above is replaced by

550             (

then ^C works for
[1] ( zsh -fc 'print a few words; /bin/sleep 10' ) | { head -n 1 }

But now we need two ^C's to kill
[2] { sleep 10; sleep 20; } | { sleep 30; sleep 40; }

Another code that looks suspicious to me is (also in jobs.c)

478         if (pn->pid == jn->gleader)    /* if this process is process group leader */
479             status = pn->status;
480     }

The 'status' is later used at (line numbers after the patch pipejobs.txt):

641     if (inforeground == 2 && isset(MONITOR) && WIFSIGNALED(status)) {

When 'sleep 30' is killed, pn->pid=(pid of sleep 30), but it is not equal
to jn->gleader (= probably 'sleep 10'?). Then if() at line 641 does not hold,
and 'sleep 40' will be started.

If the line 478 is replaced by

478         if (WIFSIGNALED(pn->status) || pn->pid == jn->gleader)

then [2] can be killed by a single ^C. But I guess it will have bad side
effects. And ^Z/fg (still) does not work for [2].

Sorry, I have no time now to investigate further.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] problem with 'ls | less' shell function
  2022-11-04 15:10                       ` [PATCH] " Jun T
@ 2022-11-05  0:09                         ` Bart Schaefer
  2022-11-06 19:12                           ` Bart Schaefer
  0 siblings, 1 reply; 28+ messages in thread
From: Bart Schaefer @ 2022-11-05  0:09 UTC (permalink / raw)
  To: Jun T; +Cc: zsh-workers

[-- Attachment #1: Type: text/plain, Size: 1267 bytes --]

On Fri, Nov 4, 2022 at 8:12 AM Jun T <takimoto-j@kba.biglobe.ne.jp> wrote:
>
> 548         /* is this job in the foreground of an interactive shell? */
> 549         if (mypgrp != pgrp && inforeground &&
> 550             (jn->gleader == pgrp ||
> 551              (pgrp > 1 &&
> 552               (kill(-pgrp, 0) == -1 && errno == ESRCH)))) {
>
> But why just "jn->gleader == pgrp" is enough to assume that the foreground
> job has finished?
> [...]
> Another code that looks suspicious to me is (also in jobs.c)
>
> 478         if (pn->pid == jn->gleader)    /* if this process is process group leader */
> 479             status = pn->status;
> 480     }
>
> [...]
> If the line 478 is replaced by
>
> 478         if (WIFSIGNALED(pn->status) || pn->pid == jn->gleader)
>
> then [2] can be killed by a single ^C.

Thank you!  This is the clue I needed.

On line 476 is

476             signalled = WIFSIGNALED(pn->status);

If we change line 550 to be

550             ((jn->gleader == pgrp && signalled) ||

then all three examples start working.

Second patch attached, replaces the one from workers/50862.  Tests
pass, which is not surprising since all of this depends on signaling
interactive shells, but I have no idea how to write a regression test
for that.

[-- Attachment #2: pipejobs2.txt --]
[-- Type: text/plain, Size: 2422 bytes --]

diff --git a/Src/jobs.c b/Src/jobs.c
index 707374297..76c762ee5 100644
--- a/Src/jobs.c
+++ b/Src/jobs.c
@@ -544,16 +544,14 @@ update_job(Job jn)
 
     if (isset(MONITOR)) {
 	pid_t pgrp = gettygrp();           /* get process group of tty      */
+	int deadpgrp = (mypgrp != pgrp && inforeground && pgrp > 1 &&
+			kill(-pgrp, 0) == -1 && errno == ESRCH);
 
 	/* is this job in the foreground of an interactive shell? */
 	if (mypgrp != pgrp && inforeground &&
-	    (jn->gleader == pgrp ||
-	     (pgrp > 1 &&
-	      (kill(-pgrp, 0) == -1 && errno == ESRCH)))) {
+	    ((jn->gleader == pgrp && signalled) || deadpgrp)) {
 	    if (list_pipe) {
-		if (somestopped || (pgrp > 1 &&
-				    kill(-pgrp, 0) == -1 &&
-				    errno == ESRCH)) {
+		if (somestopped || deadpgrp) {
 		    attachtty(mypgrp);
 		    /* check window size and adjust if necessary */
 		    adjustwinsize(0);
@@ -566,6 +564,12 @@ update_job(Job jn)
 		     * when the job is finally deleted.
 		     */
 		    jn->stat |= STAT_ATTACH;
+		    /*
+		     * If we're in shell jobs on the right side of a pipeline
+		     * we should treat it like a job in the current shell.
+		     */
+		    if (inforeground == 2)
+			inforeground = 1;
 		}
 		/* If we have `foo|while true; (( x++ )); done', and hit
 		 * ^C, we have to stop the loop, too. */
@@ -1488,10 +1492,7 @@ addproc(pid_t pid, char *text, int aux, struct timeval *bgtime,
 	 * set it for that, too.
 	 */
 	if (gleader != -1) {
-	    if (jobtab[thisjob].stat & STAT_CURSH)
-		jobtab[thisjob].gleader = gleader;
-	    else
-		jobtab[thisjob].gleader = pid;
+	    jobtab[thisjob].gleader = gleader;
 	    if (list_pipe_job_used != -1)
 		jobtab[list_pipe_job_used].gleader = gleader;
 	    /*
@@ -1500,7 +1501,7 @@ addproc(pid_t pid, char *text, int aux, struct timeval *bgtime,
 	     */
 	    last_attached_pgrp = gleader;
 	} else if (!jobtab[thisjob].gleader)
-		jobtab[thisjob].gleader = pid;
+	    jobtab[thisjob].gleader = pid;
 	/* attach this process to end of process list of current job */
 	pnlist = &jobtab[thisjob].procs;
     }
@@ -2506,6 +2507,7 @@ bin_fg(char *name, char **argv, Options ops, int func)
 		jobtab[job].stat &= ~STAT_CURSH;
 	    }
 	    if ((stopped = (jobtab[job].stat & STAT_STOPPED))) {
+		/* WIFCONTINUED will makerunning() again at killjb() */
 		makerunning(jobtab + job);
 		if (func == BIN_BG) {
 		    /* Set $! to indicate this was backgrounded */

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] problem with 'ls | less' shell function
  2022-11-05  0:09                         ` Bart Schaefer
@ 2022-11-06 19:12                           ` Bart Schaefer
  2022-11-07  8:43                             ` Jun T
  0 siblings, 1 reply; 28+ messages in thread
From: Bart Schaefer @ 2022-11-06 19:12 UTC (permalink / raw)
  To: Jun T; +Cc: zsh-workers

On Fri, Nov 4, 2022 at 5:09 PM Bart Schaefer <schaefer@brasslantern.com> wrote:
>
> Second patch attached, replaces the one from workers/50862.

This "works for me" so I'm going to push it to encourage wider testing.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] problem with 'ls | less' shell function
  2022-11-06 19:12                           ` Bart Schaefer
@ 2022-11-07  8:43                             ` Jun T
  2022-11-07 19:33                               ` Bart Schaefer
  0 siblings, 1 reply; 28+ messages in thread
From: Jun T @ 2022-11-07  8:43 UTC (permalink / raw)
  To: zsh-workers


> 2022/11/07 4:12, Bart Schaefer <schaefer@brasslantern.com> wrote:
> 
> On Fri, Nov 4, 2022 at 5:09 PM Bart Schaefer <schaefer@brasslantern.com> wrote:
>> 
>> Second patch attached, replaces the one from workers/50862.
> 
> This "works for me" so I'm going to push it to encourage wider testing.

[1] ( zsh -fc 'print a few words; /bin/sleep 10'; )
[2] { sleep 10; sleep 11; } | { sleep 20; sleep 21; }
[3] dir () { ls | less; }; dir

Now both ^C and ^Z/fg fork for [1], ^C works for [2],
and ^Z/fg works for [3] ({3] can't be killed by ^C but it's OK).

But ^Z/fg still doesn't work for [2]; the pipeline never finishes.
This was not working in zsh-5.8.1, zsh-5.9 and zsh-5.9.1-test (before
worker/50874), so this is not a regression.


worker/50803⁩
> 2022/10/22 6:22, Bart Schaefer <schaefer@brasslantern.com> wrote:
> 
> On Fri, Oct 21, 2022 at 12:41 AM Jun T <takimoto-j@kba.biglobe.ne.jp> wrote:
>> 
>> The subshell have missed the SIGCHLD?
> 
> The subshell will never get the SIGCHLD, because it is a sibling of
> the pipeline, not its parent.

Do you mean the subshell for the right hand side of the pipe?

In the case of [2], the main zsh (zsh0) always fork() a subshell (zsh1)
for the left hand side of the pipe, and zsh1 fork/exec 'sleep 10'.
By hitting ^Z, zsh0 fork() another subshell (zsh2) for the right hand
side of the pipe, but it seems zsh2 is not causing the problem.
After 'fg', zsh1 (not zsh2) is using 100% of a CPU core, and
when 'sleep 10' exits it is left as "defunct" (not waited for).

Strace output suggests that zsh1 does not receive SIGCHLD when
'sleep 10' finishes. Maybe SIGCHLD is blocked?? (I'm not sure).

zsh1 is repeatedly calling hasprocs(list_pid_job) at line 1790 in
exec.c:

1789                 if (!updated &&
1790                     list_pipe_job && hasprocs(list_pipe_job) &&
1791                     !(jobtab[list_pipe_job].stat & STAT_STOPPED)) {
1792                     int q = queue_signal_level();
1793                     child_unblock();
1794                     child_block();
1795                     dont_queue_signals();
1796                     restore_queue_signals(q);
1797                 }

When zsh1 is executing this code, thisjob=2 and list_pipe_job=1,
and hasprocs(list_pipe_job) returns 0 because jobtab[1].procs is NULL.
I guess 'sleep 10' is in thisjob.

If I blindly replace the line 1790 by

1790                     list_pipe_job &&

then 'sleep 10' does not left as "defunct" and the pipeline finishes when
all the sleeps exit. But zsh1 still uses 100% of CPU before 'sleep 10'
exits (replace 'sleep 10' by 'sleep 30' to see this more easily),
so this is obviously not the correct solution.

# I haven't yet understood (and maybe will never understand) the
# list_pipe_xxxx.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] problem with 'ls | less' shell function
  2022-11-07  8:43                             ` Jun T
@ 2022-11-07 19:33                               ` Bart Schaefer
  2022-11-07 19:44                                 ` Roman Perepelitsa
  2022-11-08  0:44                                 ` Bart Schaefer
  0 siblings, 2 replies; 28+ messages in thread
From: Bart Schaefer @ 2022-11-07 19:33 UTC (permalink / raw)
  To: Jun T; +Cc: zsh-workers

On Mon, Nov 7, 2022 at 12:44 AM Jun T <takimoto-j@kba.biglobe.ne.jp> wrote:
>
> [1] ( zsh -fc 'print a few words; /bin/sleep 10'; )

Missing the | { head -n -1 } here, but I think still confirmed working?

> [2] { sleep 10; sleep 11; } | { sleep 20; sleep 21; }
> [3] dir () { ls | less; }; dir
>
> Now both ^C and ^Z/fg fork for [1], ^C works for [2],
> and ^Z/fg works for [3] ({3] can't be killed by ^C but it's OK).
>
> But ^Z/fg still doesn't work for [2]; the pipeline never finishes.
[...]
> In the case of [2], the main zsh (zsh0) always fork() a subshell (zsh1)
> for the left hand side of the pipe, and zsh1 fork/exec 'sleep 10'.
> By hitting ^Z, zsh0 fork() another subshell (zsh2) for the right hand
> side of the pipe, but it seems zsh2 is not causing the problem.

Hmm.  If I examine "pstree" I find:

zsh(43919)---zsh(43978)---sleep(43980)

43978 is your "zsh2" and is busy-waiting with 100% CPU at

#0  0x000055ca5bc4c042 in execpline (state=0x7ffc311ce770, slcode=4098, how=2,
    last1=0) at exec.c:1789
#1  0x000055ca5bc4abc3 in execlist (state=0x7ffc311ce770, dont_change_job=1,
    exiting=1) at exec.c:1444
#2  0x000055ca5bc477d7 in execcursh (state=0x7ffc311ce770, do_exec=1)
    at exec.c:450

43919 is "zsh1" and is here:

#0  0x00007fcadf04045c in __GI___sigsuspend (set=0x7ffc311cd9b0)
    at ../sysdeps/unix/sysv/linux/sigsuspend.c:26
#1  0x000055ca5bcbf050 in signal_suspend (sig=17, wait_cmd=0) at signals.c:393
#2  0x000055ca5bc7c31a in zwaitjob (job=1, wait_cmd=0) at jobs.c:1629

> After 'fg', zsh1 (not zsh2) is using 100% of a CPU core, and
> when 'sleep 10' exits it is left as "defunct" (not waited for).

pstree for that zombie sleep shows it's a child of 43919 (zsh1) and
hasn't been cleaned up yet because zsh1 is waiting on zsh2.

zsh2 believes it's going to get signaled for the sleep, but it won't.

> > 2022/10/22 6:22, Bart Schaefer <schaefer@brasslantern.com> wrote:
> >
> > On Fri, Oct 21, 2022 at 12:41 AM Jun T <takimoto-j@kba.biglobe.ne.jp> wrote:
> >>
> >> The subshell have missed the SIGCHLD?
> >
> > The subshell will never get the SIGCHLD, because it is a sibling of
> > the pipeline, not its parent.
>
> Do you mean the subshell for the right hand side of the pipe?

Yes, that is what I mean.  The zombie sleep and zsh2 are siblings.

> Strace output suggests that zsh1 does not receive SIGCHLD when
> 'sleep 10' finishes. Maybe SIGCHLD is blocked?? (I'm not sure).

#2  0x000055ca5bc7c31a in zwaitjob (job=1, wait_cmd=0) at jobs.c:1629

job=1 is zsh2:

(gdb) p jobtab[1].procs[0]
$2 = {next = 0x55ca5dca7690, pid = 43978,

waitjob() in zsh1 will never return because zsh2 is in an infinite
loop waiting for a child that doesn't exist.

> zsh1 is repeatedly calling hasprocs(list_pid_job) at line 1790 in
> exec.c:

This is actually zsh2.  As you noted, hasprocs() returns zero, so we
fall through to

1902            else if (subsh && jn->stat & STAT_STOPPED)
1903                thisjob = newjob;
1904            else
1905                break;

The condition at 1902 is true because jn->stat refers to the zombie
sleep, which WAS stopped when zsh2 was forked (but is not actually
even a job of this new subshell).  So we go around the loop again at

1778            for (; !nowait;) {

When zsh2 was forked during "sleep 20" , we should have hit this

1899                break;

so we must have re-entered at line 1778 after starting the "sleep 21"
and are now incorrectly waiting for the left side of the pipeline.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] problem with 'ls | less' shell function
  2022-11-07 19:33                               ` Bart Schaefer
@ 2022-11-07 19:44                                 ` Roman Perepelitsa
  2022-11-08  0:46                                   ` Bart Schaefer
  2022-11-08  0:44                                 ` Bart Schaefer
  1 sibling, 1 reply; 28+ messages in thread
From: Roman Perepelitsa @ 2022-11-07 19:44 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Jun T, zsh-workers

On Mon, Nov 7, 2022 at 8:33 PM Bart Schaefer <schaefer@brasslantern.com> wrote:
>
> On Mon, Nov 7, 2022 at 12:44 AM Jun T <takimoto-j@kba.biglobe.ne.jp> wrote:
> >
> > [1] ( zsh -fc 'print a few words; /bin/sleep 10'; )

The old bug that's been bothering me might be related. This hangs:

    zsh -fic '( setopt monitor; true | true )'

A few more details here: https://www.zsh.org/mla/workers//2020/msg00471.html

Do you think your patch will fix this?

Roman.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] problem with 'ls | less' shell function
  2022-11-07 19:33                               ` Bart Schaefer
  2022-11-07 19:44                                 ` Roman Perepelitsa
@ 2022-11-08  0:44                                 ` Bart Schaefer
  2022-11-08  5:03                                   ` Bart Schaefer
  1 sibling, 1 reply; 28+ messages in thread
From: Bart Schaefer @ 2022-11-08  0:44 UTC (permalink / raw)
  To: Jun T; +Cc: zsh-workers

On Mon, Nov 7, 2022 at 11:33 AM Bart Schaefer <schaefer@brasslantern.com> wrote:
>
> > [2] { sleep 10; sleep 11; } | { sleep 20; sleep 21; }
[...]
> Hmm.  If I examine "pstree" I find:
>
> zsh(43919)---zsh(43978)---sleep(43980)
>
> 43978 is your "zsh2" and is busy-waiting with 100% CPU at

OK, this just gets weirder.  There are different results depending on
whether you hit ^Z during the first "sleep 10", during the second
"sleep 11", or during "sleep 21".  There are also different results if
you change it to

{ sleep 5; sleep 6; } | { sleep 20; sleep 21; }

and hit ^Z during "sleep 20".  So some of what I said in workers/50901
doesn't always apply, though the location of the busy-loop is always
the same.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] problem with 'ls | less' shell function
  2022-11-07 19:44                                 ` Roman Perepelitsa
@ 2022-11-08  0:46                                   ` Bart Schaefer
  0 siblings, 0 replies; 28+ messages in thread
From: Bart Schaefer @ 2022-11-08  0:46 UTC (permalink / raw)
  To: Roman Perepelitsa; +Cc: Jun T, zsh-workers

On Mon, Nov 7, 2022 at 11:44 AM Roman Perepelitsa
<roman.perepelitsa@gmail.com> wrote:
>
>     zsh -fic '( setopt monitor; true | true )'
>
> Do you think your patch will fix this?

I don't the the patch I've already pushed will fix that, no.  IIRC
there's a race condition with the above example because the first
"true" exits before the pipeline to the second "true" has even been
set up.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] problem with 'ls | less' shell function
  2022-11-08  0:44                                 ` Bart Schaefer
@ 2022-11-08  5:03                                   ` Bart Schaefer
  2022-11-09  5:03                                     ` Bart Schaefer
  0 siblings, 1 reply; 28+ messages in thread
From: Bart Schaefer @ 2022-11-08  5:03 UTC (permalink / raw)
  To: Jun T; +Cc: zsh-workers

On Mon, Nov 7, 2022 at 4:44 PM Bart Schaefer <schaefer@brasslantern.com> wrote:
>
> the location of the busy-loop is always
> the same.

I'm sure someone is going to explain to me why the following is wrong.

If we're in the subshell and we just came back from being suspended,
why would we not resume the current job?

What happens with the patch below is that in the right side of the
pipeline, after the new subshell is created upon ^Z, "fg" resumes the
previous foreground job (the earlier job in the right-side brace
construct) and the new subshell remains stopped until that job
finishes, at which point it is resumed and proceeds with the next job
in the braces.

There are still some oddball race conditions.  Depending on exactly
when the ^Z is sent, and particularly after the left-side job has
finished, the resumed right-side can become unkillable again.  But
that doesn't always happen.

diff --git a/Src/exec.c b/Src/exec.c
index c8bcf4ee5..b0f42ae67 100644
--- a/Src/exec.c
+++ b/Src/exec.c
@@ -1899,8 +1899,12 @@ execpline(Estate state, wordcode slcode, int
how, int last1)
                        break;
                    }
                }
-               else if (subsh && jn->stat & STAT_STOPPED)
-                   thisjob = newjob;
+               else if (subsh && jn->stat & STAT_STOPPED) {
+                   if (thisjob == newjob)
+                       makerunning(jn);
+                   else
+                       thisjob = newjob;
+               }
                else
                    break;
            }


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] problem with 'ls | less' shell function
  2022-11-08  5:03                                   ` Bart Schaefer
@ 2022-11-09  5:03                                     ` Bart Schaefer
  0 siblings, 0 replies; 28+ messages in thread
From: Bart Schaefer @ 2022-11-09  5:03 UTC (permalink / raw)
  To: Jun T; +Cc: zsh-workers

[-- Attachment #1: Type: text/plain, Size: 1196 bytes --]

On Mon, Nov 7, 2022 at 9:03 PM Bart Schaefer <schaefer@brasslantern.com> wrote:
>
> I'm sure someone is going to explain to me why the following is wrong.

Nobody has, yet.  However, after thinking for a while about this:

> What happens with the patch below is that in the right side of the
> pipeline, after the new subshell is created upon ^Z, "fg" resumes the
> previous foreground job (the earlier job in the right-side brace
> construct) and the new subshell remains stopped until that job
> finishes, at which point it is resumed and proceeds with the next job
> in the braces.

... it occurred to me that it might be accomplishing a superset of
this from 50874:

+                   /*
+                    * If we're in shell jobs on the right side of a pipeline
+                    * we should treat it like a job in the current shell.
+                    */
+                   if (inforeground == 2)
+                       inforeground = 1;

... and indeed, if I remove that bit of 50874 I'm still able to do
^Z/fg and ^C in all the previously hanging or re-suspending examples.
I like that there doesn't seem to be as much of a magic bullet as that
seemed to be.

Patch attached.

[-- Attachment #2: pipejobs3.txt --]
[-- Type: text/plain, Size: 993 bytes --]

diff --git a/Src/exec.c b/Src/exec.c
index 2422dae91..d4e681887 100644
--- a/Src/exec.c
+++ b/Src/exec.c
@@ -1899,8 +1899,12 @@ execpline(Estate state, wordcode slcode, int how, int last1)
 			break;
 		    }
 		}
-		else if (subsh && jn->stat & STAT_STOPPED)
-		    thisjob = newjob;
+		else if (subsh && jn->stat & STAT_STOPPED) {
+		    if (thisjob == newjob)
+			makerunning(jn);
+		    else
+			thisjob = newjob;
+		}
 		else
 		    break;
 	    }
diff --git a/Src/jobs.c b/Src/jobs.c
index 76c762ee5..4863962b9 100644
--- a/Src/jobs.c
+++ b/Src/jobs.c
@@ -564,12 +564,6 @@ update_job(Job jn)
 		     * when the job is finally deleted.
 		     */
 		    jn->stat |= STAT_ATTACH;
-		    /*
-		     * If we're in shell jobs on the right side of a pipeline
-		     * we should treat it like a job in the current shell.
-		     */
-		    if (inforeground == 2)
-			inforeground = 1;
 		}
 		/* If we have `foo|while true; (( x++ )); done', and hit
 		 * ^C, we have to stop the loop, too. */

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2022-11-09  5:04 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-17  9:17 problem with 'ls | less' shell function Thomas Klausner
2022-10-17 14:50 ` Mikael Magnusson
2022-10-17 15:36   ` Thomas Klausner
2022-10-19  8:26     ` zsugabubus
2022-10-19 10:04       ` Jun T
2022-10-19 13:36         ` Jun. T
2022-10-20  0:10       ` Bart Schaefer
2022-10-20 16:26         ` Peter Stephenson
2022-10-21  5:45       ` Jun T
2022-10-21  7:40         ` Jun T
2022-10-21 21:22           ` Bart Schaefer
2022-10-21 21:24             ` Bart Schaefer
2022-10-22 23:22               ` Bart Schaefer
2022-10-22 23:43                 ` Bart Schaefer
2022-11-03 23:10                   ` Bart Schaefer
2022-11-04  6:09                     ` [PATCH] " Bart Schaefer
2022-11-04 15:10                       ` [PATCH] " Jun T
2022-11-05  0:09                         ` Bart Schaefer
2022-11-06 19:12                           ` Bart Schaefer
2022-11-07  8:43                             ` Jun T
2022-11-07 19:33                               ` Bart Schaefer
2022-11-07 19:44                                 ` Roman Perepelitsa
2022-11-08  0:46                                   ` Bart Schaefer
2022-11-08  0:44                                 ` Bart Schaefer
2022-11-08  5:03                                   ` Bart Schaefer
2022-11-09  5:03                                     ` Bart Schaefer
2022-10-19  9:33 ` Jun T
2022-10-19 10:01   ` Jun T

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).