bug in zsh wait builtin

zsh-workers
 help / color / mirror / code / Atom feed

* bug in zsh wait builtin - rhbz#1150541
@ 2014-10-21  7:53 Tim Speetjens
  2014-10-21 20:02 ` Peter Stephenson
  0 siblings, 1 reply; 17+ messages in thread
From: Tim Speetjens @ 2014-10-21  7:53 UTC (permalink / raw)
  To: zsh-workers

[-- Attachment #1: Type: text/plain, Size: 1170 bytes --]

Dear zsh workers,

I'ld like to report a bug originally filed in
https://bugzilla.redhat.com/show_bug.cgi?id=1150541 which is still present
in the latest version, 5.0.7

Title:
zsh wait builtin shows an error and doesn't propagate exit code for a
finished child process

Description of problem:
When a child process is started, and finished before a call to wait, an
error is signaled. Also the exit code for wait is 1, where it should be the
exit code of the child process

Version-Release number of selected component:
zsh-4.3.10-7 (rhel6, x86_64)
zsh-5.0.2-7 (rhel7, x86_64)
zsh 5.0.7 (f21 rawhide, x86_64)
zsh 5.0.7 from source

How reproducible:
100%

Steps to Reproduce:
Run the following script with zsh (or with zsh in ksh emulation mode):
#!/bin/ksh
sh -c "echo done" &
pid1=$!
sleep 1
wait $pid1
echo "rc: $?"

sh -c "exit 3" &
pid2=$!
sleep 1
wait $pid2
echo "rc: $?"

Actual results:
$ zsh test.sh
done
test.sh:wait:5: pid 6156 is not a child of this shell
rc: 1
test.sh:wait:11: pid 6159 is not a child of this shell
rc: 1


Expected results (also seen when running the script with bash or mksh):
$ zsh test.sh
done
rc: 0
rc: 3


Kind regards
Tim Speetjens

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bug in zsh wait builtin - rhbz#1150541
  2014-10-21  7:53 bug in zsh wait builtin - rhbz#1150541 Tim Speetjens
@ 2014-10-21 20:02 ` Peter Stephenson
  2014-10-22  6:55   ` Bart Schaefer
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Stephenson @ 2014-10-21 20:02 UTC (permalink / raw)
  To: zsh-workers

On Tue, 21 Oct 2014 09:53:33 +0200
Tim Speetjens <tim.speetjens@gmail.com> wrote:
> I'ld like to report a bug originally filed in
> https://bugzilla.redhat.com/show_bug.cgi?id=1150541 which is still present
> in the latest version, 5.0.7
> 
> Title:
> zsh wait builtin shows an error and doesn't propagate exit code for a
> finished child process

There's an explanatory note in the latest POSIX standard about this,
quoted below.  It seems that the shell is basically required to remember
all background processes indefinitely (up to not very helpful get out
clauses).  As a baseline, CHILD_MAX here is 1024.  This probably needs
to be a special hash.

On most implementations, wait is a shell built-in. If it is called in a
subshell or separate utility execution environment, such as one of the
following:

    (wait)
    nohup wait ...
    find . -exec wait ... \;

it returns immediately because there are no known process IDs to wait
for in those environments.

Historical implementations of interactive shells have discarded the exit
status of terminated background processes before each shell
prompt. Therefore, the status of background processes was usually lost
unless it terminated while wait was waiting for it. This could be a
serious problem when a job that was expected to run for a long time
actually terminated quickly with a syntax or initialization error
because the exit status returned was usually zero if the requested
process ID was not found. This volume of POSIX.1-2008 requires the
implementation to keep the status of terminated jobs available until the
status is requested, so that scripts like:

    j1&
    p1=$!
    j2&
    wait $p1
    echo Job 1 exited with status $?
    wait $!
    echo Job 2 exited with status $?

work without losing status on any of the jobs. The shell is allowed to
discard the status of any process if it determines that the application
cannot get the process ID for that process from the shell. It is also
required to remember only {CHILD_MAX} number of processes in this
way. Since the only way to get the process ID from the shell is by using
the '!' shell parameter, the shell is allowed to discard the status of
an asynchronous list if "$!" was not referenced before another
asynchronous list was started. (This means that the shell only has to
keep the status of the last asynchronous list started if the application
did not reference "$!". If the implementation of the shell is smart
enough to determine that a reference to "$!" was not saved anywhere that
the application can retrieve it later, it can use this information to
trim the list of saved information. Note also that a successful call to
wait with no operands discards the exit status of all asynchronous
lists.)

If the exit status of wait is greater than 128, there is no way for the
application to know if the waited-for process exited with that value or
was killed by a signal. Since most utilities exit with small values,
there is seldom any ambiguity. Even in the ambiguous cases, most
applications just need to know that the asynchronous job failed; it does
not matter whether it detected an error and failed or was killed and did
not complete its job normally.

pws

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bug in zsh wait builtin - rhbz#1150541
  2014-10-21 20:02 ` Peter Stephenson
@ 2014-10-22  6:55   ` Bart Schaefer
       [not found]     ` <CAO7vJOjrb=N3xuTJVSb7U8mdXtexYp8nN4YaoknfUb3fofU2zg@mail.gmail.com>
                       ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Bart Schaefer @ 2014-10-22  6:55 UTC (permalink / raw)
  To: zsh-workers

On Oct 21,  9:02pm, Peter Stephenson wrote:
}
} There's an explanatory note in the latest POSIX standard about this,
} quoted below.  It seems that the shell is basically required to remember
} all background processes indefinitely (up to not very helpful get out
} clauses).  As a baseline, CHILD_MAX here is 1024.  This probably needs
} to be a special hash.

Since you've bothered to look this up ... does it go on to say what the
shell is supposed to do if PIDs roll over so that a new background job
gets the same $! as some previous one?  Is "kill" supposed to work the
same way?  (Do we need to send an inquiry to austin-group?  If so, you
will have to do it, my ability to post there has been messed up for a
long time.)

Note also that this is partly handled by the POSIX_JOBS option:

     When the option is set, it becomes possible to use the wait
     builtin to wait for the last job started in the background (as
     given by $!) even if that job has already exited.  This works even
     if the option is turned on temporarily around the use of the wait
     builtin.

I would say that any further change made for this should also be under
the auspices (so to speak) of POSIX_JOBS.

} Historical implementations of interactive shells have discarded the exit
} status of terminated background processes before each shell prompt.

Precisely.  This is not a "bug," as zsh's implementation predates 2008.
It's a POSIX non-conformity, but zsh has quite a few of those remaining,
I think.

-- 
Barton E. Schaefer

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bug in zsh wait builtin - rhbz#1150541
       [not found]     ` <CAO7vJOjrb=N3xuTJVSb7U8mdXtexYp8nN4YaoknfUb3fofU2zg@mail.gmail.com>
@ 2014-10-22 15:48       ` Bart Schaefer
  0 siblings, 0 replies; 17+ messages in thread
From: Bart Schaefer @ 2014-10-22 15:48 UTC (permalink / raw)
  To: zsh-workers, zsh-workers; +Cc: Tim Speetjens

On Oct 22, 10:42am, Tim Speetjens wrote:
}
} Are there any plans, to make zsh behave more like other shells, with
} regard to this? Both bash and mksh do what one would expect: propagate
} the return code, even if the job finished meanwhile.

I would say there are "intentions" rather than "plans".  There must be
something there already for { setopt POSIX_JOBS; foo & wait $! } so it
may be possible to generalize.

I'm still curious about PID rollover and whether "kill $!" is supposed
to avoid signalling the wrong job.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bug in zsh wait builtin - rhbz#1150541
  2014-10-22  6:55   ` Bart Schaefer
       [not found]     ` <CAO7vJOjrb=N3xuTJVSb7U8mdXtexYp8nN4YaoknfUb3fofU2zg@mail.gmail.com>
@ 2014-10-22 18:32     ` Chet Ramey
  2014-10-23  8:32     ` Peter Stephenson
  2 siblings, 0 replies; 17+ messages in thread
From: Chet Ramey @ 2014-10-22 18:32 UTC (permalink / raw)
  To: Bart Schaefer, zsh-workers; +Cc: chet.ramey

On 10/22/14, 2:55 AM, Bart Schaefer wrote:
> On Oct 21,  9:02pm, Peter Stephenson wrote:
> }
> } There's an explanatory note in the latest POSIX standard about this,
> } quoted below.  It seems that the shell is basically required to remember
> } all background processes indefinitely (up to not very helpful get out
> } clauses).  As a baseline, CHILD_MAX here is 1024.  This probably needs
> } to be a special hash.
> 
> Since you've bothered to look this up ... does it go on to say what the
> shell is supposed to do if PIDs roll over so that a new background job
> gets the same $! as some previous one?  Is "kill" supposed to work the
> same way?  (Do we need to send an inquiry to austin-group?  If so, you
> will have to do it, my ability to post there has been messed up for a
> long time.)

The implicit assumption in the Posix spec is that the pid space is large:
large enough that PIDs won't roll over and collide before you've gone
through CHILD_MAX children.  In practice, it happens surprisingly often.

The way I do it is to add the PIDs of exited jobs to a list, and prune it
if the list gets longer than the current child_max.  I check after fork()
and remove pids from the list if they recycle.

`wait' knows how to look through this list for child PID statuses.

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
		 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRU    chet@case.edu    http://cnswww.cns.cwru.edu/~chet/


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bug in zsh wait builtin - rhbz#1150541
  2014-10-22  6:55   ` Bart Schaefer
       [not found]     ` <CAO7vJOjrb=N3xuTJVSb7U8mdXtexYp8nN4YaoknfUb3fofU2zg@mail.gmail.com>
  2014-10-22 18:32     ` Chet Ramey
@ 2014-10-23  8:32     ` Peter Stephenson
  2014-10-24  4:50       ` Bart Schaefer
  2 siblings, 1 reply; 17+ messages in thread
From: Peter Stephenson @ 2014-10-23  8:32 UTC (permalink / raw)
  To: zsh-workers

On Tue, 21 Oct 2014 23:55:42 -0700
Bart Schaefer <schaefer@brasslantern.com> wrote:
> On Oct 21,  9:02pm, Peter Stephenson wrote:
> }
> } There's an explanatory note in the latest POSIX standard about this,
> } quoted below.  It seems that the shell is basically required to remember
> } all background processes indefinitely (up to not very helpful get out
> } clauses).  As a baseline, CHILD_MAX here is 1024.  This probably needs
> } to be a special hash.
> 
> Since you've bothered to look this up ... does it go on to say what the
> shell is supposed to do if PIDs roll over so that a new background job
> gets the same $! as some previous one?

No, and as Chet says this appears to be basically "caveat usator".  The
user needs to be careful / lucky enough to perform their "wait" before
the numbers come round again.

>  Is "kill" supposed to work the same way?

There's no indication kill needs to have this.  Presumably this is
because for kill you don't need to have a sensible exit status, just a
reasonable likelihood the job is dead (or wedged in some state where
that signal doesn't work, but that's an entirely different problem).

> Note also that this is partly handled by the POSIX_JOBS option:
> 
>      When the option is set, it becomes possible to use the wait
>      builtin to wait for the last job started in the background (as
>      given by $!) even if that job has already exited.  This works even
>      if the option is turned on temporarily around the use of the wait
>      builtin.
> 
> I would say that any further change made for this should also be under
> the auspices (so to speak) of POSIX_JOBS.

That would already cover the cases in the "bug" report, in fact.

I'm not really sure why we wouldn't just implement this particular
feature generally, despite the current status.  Is there any reason why
you'd *want* "wait" to give you an error (which isn't a particularly
useful message) owing to a race condition you can't control?  POSIX_JOBS
was originally designed for things where the behaviour was clearly
incompatible.

pws

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bug in zsh wait builtin - rhbz#1150541
  2014-10-23  8:32     ` Peter Stephenson
@ 2014-10-24  4:50       ` Bart Schaefer
  2014-10-24  8:04         ` Tim Speetjens
  2014-10-25 19:08         ` Peter Stephenson
  0 siblings, 2 replies; 17+ messages in thread
From: Bart Schaefer @ 2014-10-24  4:50 UTC (permalink / raw)
  To: zsh-workers

On Oct 23,  9:32am, Peter Stephenson wrote:
}
} On Tue, 21 Oct 2014 23:55:42 -0700
} Bart Schaefer <schaefer@brasslantern.com> wrote:
} >  Is "kill" supposed to work the same way?
} 
} There's no indication kill needs to have this.  Presumably this is
} because for kill you don't need to have a sensible exit status, just a
} reasonable likelihood the job is dead (or wedged in some state where
} that signal doesn't work, but that's an entirely different problem).

My implied point was that both commands accept either job identifiers
(%3, %?sleep?) or PIDs and presumably should act the same way for the
same child process regardless of how it was identified; or else PIDs
are something entirely different than job identifiers and the rules
are different.  But for "wait" treat PIDs magically while "kill" does
not, seems inconsistent.

} > Note also that this is partly handled by the POSIX_JOBS option:
} > 
} >      When the option is set, it becomes possible to use the wait
} >      builtin to wait for the last job started in the background (as
} >      given by $!) even if that job has already exited.  This works even
} >      if the option is turned on temporarily around the use of the wait
} >      builtin.
} > 
} > I would say that any further change made for this should also be under
} > the auspices (so to speak) of POSIX_JOBS.
} 
} That would already cover the cases in the "bug" report, in fact.

I don't think it would, because the report starts two background jobs
and then waits for the one started first.  The current implementation
only allows the most recent $! to be waited for after it exits.

} I'm not really sure why we wouldn't just implement this particular
} feature generally, despite the current status.  Is there any reason why
} you'd *want* "wait" to give you an error (which isn't a particularly
} useful message) owing to a race condition you can't control?

There are a lot of error messages that a script probably doesn't want
but an interactive user might.  Why do you want "wait %3" to report
"%3: no such job"?  If nobody wants it, why did it take us 25 years
to figure that out?

Maybe there should just be an option to never remove job table entries
until "wait" is explicitly called (and MONITOR + NOTIFY constitutes
an implicit wait).  Then even $pipestatus could be updated at wait-time
for backgrounded pipelines.  (I'm not seriously suggesting that, just
running this train out to the last station.  Keeping job table entries
around would solve the storage problem, but job table entries get made
for things like brace expressions for which "wait" makes no sense, and
managing pipestatus is a bear we only recently wrestled.)

OK, I'm done making up odd metaphors now.  Carry on.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bug in zsh wait builtin - rhbz#1150541
  2014-10-24  4:50       ` Bart Schaefer
@ 2014-10-24  8:04         ` Tim Speetjens
  2014-10-25 19:08         ` Peter Stephenson
  1 sibling, 0 replies; 17+ messages in thread
From: Tim Speetjens @ 2014-10-24  8:04 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: zsh-workers

> } >
> } > I would say that any further change made for this should also be under
> } > the auspices (so to speak) of POSIX_JOBS.
> }
> } That would already cover the cases in the "bug" report, in fact.
>
> I don't think it would, because the report starts two background jobs
> and then waits for the one started first.  The current implementation
> only allows the most recent $! to be waited for after it exits.

This is indeed the case, and I revised the bz report accordingly.
Also, I should have mentioned
https://bugzilla.redhat.com/show_bug.cgi?id=1150554 rather than the
one in $subject.
(1150541 is for zsh 4.3.10, which doesn't seem to have POSIXJOBS)

>
> } I'm not really sure why we wouldn't just implement this particular
> } feature generally, despite the current status.  Is there any reason why
> } you'd *want* "wait" to give you an error (which isn't a particularly
> } useful message) owing to a race condition you can't control?
>
> There are a lot of error messages that a script probably doesn't want
> but an interactive user might.  Why do you want "wait %3" to report
> "%3: no such job"?  If nobody wants it, why did it take us 25 years
> to figure that out?
I believe that reporting an error for a non-existing job like this
makes sense, it just doesn't for pids obtained through $!.

Kind regards
Tim


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bug in zsh wait builtin - rhbz#1150541
  2014-10-24  4:50       ` Bart Schaefer
  2014-10-24  8:04         ` Tim Speetjens
@ 2014-10-25 19:08         ` Peter Stephenson
  2014-10-25 21:54           ` Bart Schaefer
  2014-10-25 22:28           ` Bart Schaefer
  1 sibling, 2 replies; 17+ messages in thread
From: Peter Stephenson @ 2014-10-25 19:08 UTC (permalink / raw)
  To: zsh-workers

On Thu, 23 Oct 2014 21:50:41 -0700
Bart Schaefer <schaefer@brasslantern.com> wrote:
> On Oct 23,  9:32am, Peter Stephenson wrote:
> }
> } On Tue, 21 Oct 2014 23:55:42 -0700
> } Bart Schaefer <schaefer@brasslantern.com> wrote:
> } >  Is "kill" supposed to work the same way?
> } 
> } There's no indication kill needs to have this.  Presumably this is
> } because for kill you don't need to have a sensible exit status, just a
> } reasonable likelihood the job is dead (or wedged in some state where
> } that signal doesn't work, but that's an entirely different problem).
> 
> My implied point was that both commands accept either job identifiers
> (%3, %?sleep?) or PIDs and presumably should act the same way for the
> same child process regardless of how it was identified; or else PIDs
> are something entirely different than job identifiers and the rules
> are different.  But for "wait" treat PIDs magically while "kill" does
> not, seems inconsistent.

In any case there appears to be no call for kill to do this.

> } > Note also that this is partly handled by the POSIX_JOBS option:
> } > 
> } >      When the option is set, it becomes possible to use the wait
> } >      builtin to wait for the last job started in the background (as
> } >      given by $!) even if that job has already exited.  This works even
> } >      if the option is turned on temporarily around the use of the wait
> } >      builtin.
> } > 
> } > I would say that any further change made for this should also be under
> } > the auspices (so to speak) of POSIX_JOBS.
> } 
> } That would already cover the cases in the "bug" report, in fact.
> 
> I don't think it would, because the report starts two background jobs
> and then waits for the one started first.  The current implementation
> only allows the most recent $! to be waited for after it exits.

I missed that.
 
> } I'm not really sure why we wouldn't just implement this particular
> } feature generally, despite the current status.  Is there any reason why
> } you'd *want* "wait" to give you an error (which isn't a particularly
> } useful message) owing to a race condition you can't control?
> 
> There are a lot of error messages that a script probably doesn't want
> but an interactive user might.  Why do you want "wait %3" to report
> "%3: no such job"?  If nobody wants it, why did it take us 25 years
> to figure that out?

The point I was making wasn't that errors were necessarily bad, if they
could detect something useful, but that the error wasn't detecting
anything particularly useful as there was an intrinsic race, and was
thereby claiming by incorrect guesswork that the PID wasn't a child of
the shell (and trying to claim that's OK because it no longer is is the
sort of argument that will annoy users, who want things to work without
long justifications).  It's incorrect no matter how long it's been
there.

In any case I don't think using wait in this fashion is useful
interactively.  It's replaced by job control.  If you turn off job
control to rely on a system designed for scripts that therefore has no
notifications you are shooting yourself in the foot.  If you use the
scripting-style mechanism you accept the consequences.

(Not sure it's even worth arguing about, but it's Saturday and I'm no
longer on git duty...)

Here's an implementation.  I've given it the obvious finger test, but
there may be some more stressful tests we could apply.  Note I'm only
recording background PIDs, since the user can't explicitly wait for
foreground PIDs; it's possible I've missed a case where something can
be in the background but that would suggest the job is wrongly recorded.

One piece of unfinished business: I think lastpid_status can go, but the
logic associated with it is rather different from what I just
implemented so I'd like some further thoughts over whether there's a
case the latter doesn't cover.  As I've written it lastpid_status is not
an optimisation because you need to remove it from the list anyway ---
you can only wait for a given PID once, after which returning an error
is the correct behaviour.

Note it's not useful to store a job, as opposed to a process, in this
fashion, because we reuse jobs immediately, which is standard shell
behaviour (it would be annoying for job numbers to increment rapidly).
That fits with the fact this isn't really designed for use with job
control.  I've tried to clarify the point in the manual entry for
"wait".

diff --git a/Doc/Zsh/builtins.yo b/Doc/Zsh/builtins.yo
index 46f40cc..edc335e 100644
--- a/Doc/Zsh/builtins.yo
+++ b/Doc/Zsh/builtins.yo
@@ -2059,6 +2059,22 @@ then all currently active child processes are waited for.
 Each var(job) can be either a job specification or the process ID
 of a job in the job table.
 The exit status from this command is that of the job waited for.
+
+It is possible to wait for recent processes (specified by process ID,
+not by job) that were running in the background even if the process has
+exited.  Typically the process ID will be recorded by capturing the
+value of the variable tt($!) immediately after the process has been
+started.  There is a limit on the number of process IDs remembered by
+the shell; this is given by the value of the system configuration
+parameter tt(CHILD_MAX).  When this limit is reached, older process IDs
+are discarded, least recently started processes first.
+
+Note there is no protection against the process ID wrapping, i.e. if the
+wait is not executed soon enough there is a chance the process waited
+for is the wrong one.  A conflict implies both process IDs have been
+generated by the shell, as other processes are not recorded, and that
+the user is potentially interested in both, so this problem is intrinsic
+to process IDs.
 )
 findex(whence)
 item(tt(whence) [ tt(-vcwfpams) ] var(name) ...)(
diff --git a/Doc/Zsh/options.yo b/Doc/Zsh/options.yo
index 068a253..452b258 100644
--- a/Doc/Zsh/options.yo
+++ b/Doc/Zsh/options.yo
@@ -1434,10 +1434,10 @@ shell is saved for output within a subshell (for example, within a
 pipeline).  When the option is set, the output of tt(jobs) is empty
 until a job is started within the subshell.
 
-When the option is set, it becomes possible to use the tt(wait) builtin to
-wait for the last job started in the background (as given by tt($!)) even
-if that job has already exited.  This works even if the option is turned
-on temporarily around the use of the tt(wait) builtin.
+In previous versions of the shell, it was necessary to enable
+tt(POSIX_JOBS) in order for the builtin command tt(wait) to return the
+status of background jobs that had already exited.  This is no longer
+the case.
 )
 enditem()
 
diff --git a/Src/jobs.c b/Src/jobs.c
index bd95afb..ec76c4f 100644
--- a/Src/jobs.c
+++ b/Src/jobs.c
@@ -1940,6 +1940,122 @@ maybeshrinkjobtab(void)
     unqueue_signals();
 }
 
+/*
+ * Definitions for the background process stuff recorded below.
+ * This would be more efficient as a hash, but
+ * - that's quite heavyweight for something not needed very often
+ * - we need some kind of ordering as POSIX allows us to limit
+ *   the size of the list to the value of _SC_CHILD_MAX and clearly
+ *   we want to clear the oldest first
+ * - cases with a long list of background jobs where the user doesn't
+ *   wait for a large number, and then does wait for one (the only
+ *   inefficient case) are rare
+ * - in the context of waiting for an external process, looping
+ *   over a list isn't so very inefficient.
+ * Enough excuses already.
+ */
+
+/* Data in the link list, a key (process ID) / value (exit status) pair. */
+struct bgstatus {
+    pid_t pid;
+    int status;
+};
+typedef struct bgstatus *Bgstatus;
+/* The list of those entries */
+LinkList bgstatus_list;
+/* Count of entries.  Reaches value of _SC_CHILD_MAX and stops. */
+long bgstatus_count;
+
+/*
+ * Remove and free a bgstatus entry.
+ */
+static void rembgstatus(LinkNode node)
+{
+    zfree(remnode(bgstatus_list, node), sizeof(struct bgstatus));
+    bgstatus_count--;
+}
+
+/*
+ * Record the status of a background process that exited so we
+ * can execute the builtin wait for it.
+ *
+ * We can't execute the wait builtin for something that exited in the
+ * foreground as it's not visible to the user, so don't bother recording.
+ */
+
+/**/
+void
+addbgstatus(pid_t pid, int status)
+{
+    static long child_max;
+    Bgstatus bgstatus_entry;
+
+    if (!child_max) {
+#ifdef _SC_CHILD_MAX
+	child_max = sysconf(_SC_CHILD_MAX);
+	if (!child_max) /* paranoia */
+#endif
+	{
+	    /* Be inventive */
+	    child_max = 1024L;
+	}
+    }
+
+    if (!bgstatus_list) {
+	bgstatus_list = znewlinklist();
+	/*
+	 * We're not always robust about memory failures, but
+	 * this is pretty deep in the shell basics to be failing owing
+	 * to memory, and a failure to wait is reported loudly, so test
+	 * and fail silently here.
+	 */
+	if (!bgstatus_list)
+	    return;
+    }
+    if (bgstatus_count == child_max) {
+	/* Overflow.  List is in order, remove first */
+	rembgstatus(firstnode(bgstatus_list));
+    }
+    bgstatus_entry = (Bgstatus)zalloc(sizeof(*bgstatus_entry));
+    if (!bgstatus_entry) {
+	/* See note above */
+	return;
+    }
+    bgstatus_entry->pid = pid;
+    bgstatus_entry->status = status;
+    if (!zaddlinknode(bgstatus_list, bgstatus_entry)) {
+	zfree(bgstatus_entry, sizeof(*bgstatus_entry));
+	return;
+    }
+    bgstatus_count++;
+}
+
+/*
+ * See if pid has a recorded exit status.
+ * Note we make no guarantee that the PIDs haven't wrapped, so this
+ * may not be the right process.
+ *
+ * This is only used by wait, which must only work on each
+ * pid once, so we need to remove the entry if we find it.
+ */
+
+static int getbgstatus(pid_t pid)
+{
+    LinkNode node;
+    Bgstatus bgstatus_entry;
+
+    if (!bgstatus_list)
+	return -1;
+    for (node = firstnode(bgstatus_list); node; incnode(node)) {
+	bgstatus_entry = (Bgstatus)getdata(node);
+	if (bgstatus_entry->pid == pid) {
+	    int status = bgstatus_entry->status;
+	    rembgstatus(node);
+	    return status;
+	}
+    }
+    return -1;
+}
 
 /* bg, disown, fg, jobs, wait: most of the job control commands are     *
  * here.  They all take the same type of argument.  Exception: wait can *
@@ -2085,10 +2201,24 @@ bin_fg(char *name, char **argv, Options ops, int func)
 		}
 		if (retval == 0)
 		    retval = lastval2;
-	    } else if (isset(POSIXJOBS) &&
-		       pid == lastpid && lastpid_status >= 0L) {
+	    } else if (pid == lastpid && lastpid_status >= 0L) {
+		/*
+		 * Historical note: this used to be covered by
+		 * isset(POSIXJOBS), but reporting that the last
+		 * PID to exist isn't a child of the shell is not
+		 * obviously useful behaviour.
+		 */
 		retval = (int)lastpid_status;
-	    } else {
+		/*
+		 * We can't wait for a PID twice so ensure it's
+		 * not on the list, either.
+		 *
+		 * TODO: We could optimise this because pid must be at
+		 * the end of the list, if present, but I think we now
+		 * can get rid of lastpid_status anyway.
+ 		 */
+		(void)getbgstatus(pid);
+	    } else if ((retval = getbgstatus(pid)) < 0) {
 		zwarnnam(name, "pid %d is not a child of this shell", pid);
 		/* presumably lastval2 doesn't tell us a heck of a lot? */
 		retval = 1;
diff --git a/Src/linklist.c b/Src/linklist.c
index 1e364fb..3aa8125 100644
--- a/Src/linklist.c
+++ b/Src/linklist.c
@@ -118,6 +118,8 @@ znewlinklist(void)
     LinkList list;
 
     list = (LinkList) zalloc(sizeof *list);
+    if (!list)
+	return NULL;
     list->list.first = NULL;
     list->list.last = &list->node;
     list->list.flags = 0;
@@ -152,6 +154,8 @@ zinsertlinknode(LinkList list, LinkNode node, void *dat)
 
     tmp = node->next;
     node->next = new = (LinkNode) zalloc(sizeof *tmp);
+    if (!new)
+	return NULL;
     new->prev = node;
     new->dat = dat;
     new->next = tmp;
diff --git a/Src/signals.c b/Src/signals.c
index 2df69f9..7b212e6 100644
--- a/Src/signals.c
+++ b/Src/signals.c
@@ -530,6 +530,12 @@ wait_for_processes(void)
 	 */
 	if (pn != NULL && pid == lastpid && lastpid_status != -1L)
 	    lastpid_status = lastval2;
+	/*
+	 * Accumulate a list of older jobs (the above is basically an
+	 * optimisation for the last job.
+	 */
+	if (!(jn->stat & (STAT_CURSH|STAT_BUILTIN)) && jn - jobtab != thisjob)
+	    addbgstatus(pid, (int)lastval2);
     }
 }
 
 
-- 
Peter Stephenson <p.w.stephenson@ntlworld.com>
Web page now at http://homepage.ntlworld.com/p.w.stephenson/


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bug in zsh wait builtin - rhbz#1150541
  2014-10-25 19:08         ` Peter Stephenson
@ 2014-10-25 21:54           ` Bart Schaefer
  2014-10-25 22:28           ` Bart Schaefer
  1 sibling, 0 replies; 17+ messages in thread
From: Bart Schaefer @ 2014-10-25 21:54 UTC (permalink / raw)
  To: zsh-workers

On Oct 25,  8:08pm, Peter Stephenson wrote:
}
} Here's an implementation.  I've given it the obvious finger test, but
} there may be some more stressful tests we could apply.  Note I'm only
} recording background PIDs, since the user can't explicitly wait for
} foreground PIDs; it's possible I've missed a case where something can
} be in the background but that would suggest the job is wrongly recorded.

I was going to ask a bunch of questions about this but in looking at the
patch I realized that you determine foreground-ness at the time the job
exits, not at the time it's started, so this sounds fine.

Incidentally, when you ^Z the foreground job, the value of $! is not
updated, so the only way to use wait is by job identifier (which is
equivalent to using fg).  $! finally does get updated when the job is
continued with bg.  (This describes the pre-33531-patch state, but I
don't think the patch alters it.)

} One piece of unfinished business: I think lastpid_status can go, but
} the logic associated with it is rather different from what I just
} implemented so I'd like some further thoughts

I can't think of a case where lastpid_status would not be the same as
getbgstatus(lastpid) ... I suppose you could throw in a DPUTS() to
confirm that, just as a sanity check, but otherwise I concur that you
can do away with lastpid_status.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bug in zsh wait builtin - rhbz#1150541
  2014-10-25 19:08         ` Peter Stephenson
  2014-10-25 21:54           ` Bart Schaefer
@ 2014-10-25 22:28           ` Bart Schaefer
  2014-10-25 22:32             ` Bart Schaefer
  1 sibling, 1 reply; 17+ messages in thread
From: Bart Schaefer @ 2014-10-25 22:28 UTC (permalink / raw)
  To: zsh-workers

On Oct 25,  8:08pm, Peter Stephenson wrote:
}
} Here's an implementation.  I've given it the obvious finger test, but
} there may be some more stressful tests we could apply.

I've run "make check" on this patch twice now and gotten different
numbers of silent failure each time -- 8 the first time (with 3 core
files) and 7 the second time (2 core files) so it's probably a race
condition.

In all core dumps the "jn" pointer is NULL in wait_for_processes().  Here
are two sample backtraces, though they're not very helpful.

#0  0x080b7085 in wait_for_processes () at ../../zsh-5.0/Src/signals.c:537
537             if (!(jn->stat & (STAT_CURSH|STAT_BUILTIN)) && jn - jobtab !=
thisjob)
(gdb) p jn
$1 = 0x0
(gdb) where
#0  0x080b7085 in wait_for_processes () at ../../zsh-5.0/Src/signals.c:537
#1  0x080b726c in zhandler (sig=17) at ../../zsh-5.0/Src/signals.c:600
#2  <signal handler called>
#3  0x0086e7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#4  0x008afe8c in sigsuspend () from /lib/tls/libc.so.6
#5  0x080b6dba in signal_suspend (sig=17, wait_cmd=0)
    at ../../zsh-5.0/Src/signals.c:375
#6  0x08084ea2 in waitforpid (pid=27019, wait_cmd=0)
    at ../../zsh-5.0/Src/jobs.c:1402
#7  0x0806982f in getoutput (cmd=0xb7daf7c2 "diff \"$@\"", qt=1)
    at ../../zsh-5.0/Src/exec.c:3862


(gdb) where
#0  0x080b7085 in wait_for_processes () at ../../zsh-5.0/Src/signals.c:537
#1  0x080b726c in zhandler (sig=17) at ../../zsh-5.0/Src/signals.c:600
#2  <signal handler called>
#3  0x0086e7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#4  0x008afe8c in sigsuspend () from /lib/tls/libc.so.6
#5  0x080b6dba in signal_suspend (sig=17, wait_cmd=0)
    at ../../zsh-5.0/Src/signals.c:375
#6  0x08085199 in zwaitjob (job=6, wait_cmd=0) at ../../zsh-5.0/Src/jobs.c:1454
#7  0x08085372 in waitjobs () at ../../zsh-5.0/Src/jobs.c:1499
#8  0x080637f2 in execpline (state=0xbffb4380, slcode=4098, how=18, last1=0)
    at ../../zsh-5.0/Src/exec.c:1561
#9  0x08062cf6 in execlist (state=0xbffb4380, dont_change_job=1, exiting=0)
    at ../../zsh-5.0/Src/exec.c:1268
#10 0x0806272e in execode (p=0xb7d43a30, dont_change_job=1, exiting=0, 
    context=0x813a2f7 "eval") at ../../zsh-5.0/Src/exec.c:1074
#11 0x0805b5a2 in eval (argv=0xb7d43770) at ../../zsh-5.0/Src/builtin.c:5043
#12 0x0805bae1 in bin_eval (nam=0xb7d436e0 "eval", argv=0xb7d43770, 
    ops=0xbffb4470, func=14) at ../../zsh-5.0/Src/builtin.c:5208


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bug in zsh wait builtin - rhbz#1150541
  2014-10-25 22:28           ` Bart Schaefer
@ 2014-10-25 22:32             ` Bart Schaefer
  2014-10-25 23:04               ` Peter Stephenson
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Schaefer @ 2014-10-25 22:32 UTC (permalink / raw)
  To: zsh-workers

On Oct 25,  3:28pm, Bart Schaefer wrote:
}
} #0  0x080b7085 in wait_for_processes () at ../../zsh-5.0/Src/signals.c:537
} 537             if (!(jn->stat & (STAT_CURSH|STAT_BUILTIN)) && jn - jobtab !=
} thisjob)
} (gdb) p jn
} $1 = 0x0

Add "jn != NULL &&" to the front of that condition and all tests pass.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bug in zsh wait builtin - rhbz#1150541
  2014-10-25 22:32             ` Bart Schaefer
@ 2014-10-25 23:04               ` Peter Stephenson
  2014-10-25 23:17                 ` Peter Stephenson
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Stephenson @ 2014-10-25 23:04 UTC (permalink / raw)
  To: zsh-workers

On Sat, 25 Oct 2014 15:32:31 -0700
Bart Schaefer <schaefer@brasslantern.com> wrote:
> On Oct 25,  3:28pm, Bart Schaefer wrote:
> }
> } #0  0x080b7085 in wait_for_processes () at ../../zsh-5.0/Src/signals.c:537
> } 537             if (!(jn->stat & (STAT_CURSH|STAT_BUILTIN)) && jn - jobtab !=
> } thisjob)
> } (gdb) p jn
> } $1 = 0x0
> 
> Add "jn != NULL &&" to the front of that condition and all tests pass.

Yes, I was dozily wondering about that when I sent the patch but didn't
get around to checking.

I'm now wondering whether we need

	if (!jn || (!(jn->stat & (STAT_CURSH|STAT_BUILTIN)) &&
		    jn - jobtab != thisjob))
	    addbgstatus(pid, (int)lastval2);

i.e. if there's *no* job we should remember the PID because we don't
have enough information to say we don't need to remember it.  That's
probably safe --- we've already had a succesful return value from wait
or one of its relatives so it's a child of the shell --- but there might
be cases where it's inefficient.  I suppose I ought to check what
happens when job control is off.

> Incidentally, when you ^Z the foreground job, the value of $! is not
> updated, so the only way to use wait is by job identifier (which is
> equivalent to using fg).  $! finally does get updated when the job is
> continued with bg.  (This describes the pre-33531-patch state, but I
> don't think the patch alters it.)

Yes, this isn't going to interact particularly usefully with the patch,
but I'm not really expecting it to be particularly useful when you're in
an interactive shell with job control anyway.

pws

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bug in zsh wait builtin - rhbz#1150541
  2014-10-25 23:04               ` Peter Stephenson
@ 2014-10-25 23:17                 ` Peter Stephenson
  2014-10-26 19:01                   ` Peter Stephenson
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Stephenson @ 2014-10-25 23:17 UTC (permalink / raw)
  To: zsh-workers

On Sun, 26 Oct 2014 00:04:48 +0100
Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
> I'm now wondering whether we need
> 
> 	if (!jn || (!(jn->stat & (STAT_CURSH|STAT_BUILTIN)) &&
> 		    jn - jobtab != thisjob))
> 	    addbgstatus(pid, (int)lastval2);
> 
> i.e. if there's *no* job we should remember the PID because we don't
> have enough information to say we don't need to remember it.  That's
> probably safe --- we've already had a succesful return value from wait
> or one of its relatives so it's a child of the shell --- but there might
> be cases where it's inefficient.  I suppose I ought to check what
> happens when job control is off.

It's typically OK with job control off, there is a job at that point...

I found the places where this (!jn) triggers by adding error output in
this case and running the test suite...

./A02alias.ztst: starting.
This test hangs the shell when it fails...
*** /tmp/zsh.ztst.err.4092	Sun Oct 26 00:10:01 2014
--- /tmp/zsh.ztst.terr.4092	Sun Oct 26 00:10:01 2014
***************
*** 0 ****
--- 1 ----
+ pid 4118 with no job
Test ./A02alias.ztst failed: error output differs from expected as shown above for:
  print -u $ZTST_fd 'This test hangs the shell when it fails...'
  alias cat='LC_ALL=C cat'
  cat <(echo foo | cat)

So happens with <(....).

./C04funcdef.ztst: starting.
*** /tmp/zsh.ztst.err.12383	Sun Oct 26 00:10:30 2014
--- /tmp/zsh.ztst.terr.12383	Sun Oct 26 00:10:30 2014
***************
*** 0 ****
--- 1 ----
+ pid 12454 with no job
Test ./C04funcdef.ztst failed: error output differs from expected as shown above for:
  () (cat $1 $2) <(print process expanded) =(print expanded to file)
Was testing: Process substitution with anonymous functions

And =(...), not surprisingly.

./D03procsubst.ztst: starting.
*** /tmp/zsh.ztst.err.12707	Sun Oct 26 00:10:31 2014
--- /tmp/zsh.ztst.terr.12707	Sun Oct 26 00:10:31 2014
***************
*** 0 ****
--- 1,2 ----
+ pid 12713 with no job
+ pid 12714 with no job
Test ./D03procsubst.ztst failed: error output differs from expected as shown above for:
  paste <(cut -f1 FILE1) <(cut -f3 FILE2)
Was testing: <(...) substitution

And again.

./V08zpty.ztst: starting.
*** /tmp/zsh.ztst.err.15038	Sun Oct 26 00:10:44 2014
--- /tmp/zsh.ztst.terr.15038	Sun Oct 26 00:10:44 2014
***************
*** 0 ****
--- 1 ----
+ pid 15042 with no job
Test ./V08zpty.ztst failed: error output differs from expected as shown above for:
  zpty cat cat
  zpty -w cat a line of text
  var=
  zpty -r cat var && print -r -- ${var%%$'\r\n'}
  zpty -d cat
Was testing: zpty with a process that does not set up the terminal: internal write

So happens with processes forked by zpty.  I get messages in the
completion tests but no errors which I guess are also zpty.

So it looks like these cases aren't relevant and we can ignore the case
of no job and still pick up all the cases we do need.

pws


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bug in zsh wait builtin - rhbz#1150541
  2014-10-25 23:17                 ` Peter Stephenson
@ 2014-10-26 19:01                   ` Peter Stephenson
  2014-10-26 20:41                     ` Bart Schaefer
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Stephenson @ 2014-10-26 19:01 UTC (permalink / raw)
  To: zsh-workers

Here's a test.  The tortuous explanation points out that we don't
absolutely guarantee on every run of the test that it's testing the
feature in question, but statistically it will be tested more often than
not.  So Axel will spot failures even if no one else does :-).

The changes in quotes earlier are just to deconfuse Emacs shell mode.

> Word not found in the Dictionary and Encyclopedia.

So what is the opposite of confuse?

pws

diff --git a/Test/A05execution.ztst b/Test/A05execution.ztst
index ca97f4f..838cd74 100644
--- a/Test/A05execution.ztst
+++ b/Test/A05execution.ztst
@@ -190,9 +190,9 @@
                    print "${pipestatus[@]}")
 	ZTST_hashmark
   done | sort | uniq -c | sed 's/^ *//'
-0:Check whether `$pipestatus[]' behaves.
+0:Check whether '$pipestatus[]' behaves.
 >2048 2 1 0
-F:This test checks for a bug in `$pipestatus[]' handling.  If it breaks then
+F:This test checks for a bug in '$pipestatus[]' handling.  If it breaks then
 F:the bug is still there or it reappeared. See workers-29973 for details.
 
   { setopt MONITOR } 2>/dev/null
@@ -244,3 +244,29 @@ F:anonymous function, and a descriptor leak when backgrounding a pipeline
 >autoload_redir () {
 >	print Autoloaded ksh style
 >} > autoload.log
+
+# This tests that we record the status of processes that have already exited
+# for when we wait for them.
+#
+# Actually, we don't guarantee here that the jobs have already exited, but
+# the order of the waits means it's highly likely we do need to a recall
+# previous status, barring accidents which shouldn't happen very often.  In
+# other words, we relying on the test working repeatedly rather than just
+# once.  The monitor option is irrelevant to the logic, so we'll make
+# our job easier by turning it off.
+  unsetopt monitor
+  (exit 1) &
+  one=$!
+  (exit 2) &
+  two=$!
+  (exit 3) &
+  three=$!
+  wait $three
+  print $?
+  wait $two
+  print $?
+  wait $one
+1:The status of recently exited background jobs is recorded
+>3
+>2
+


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bug in zsh wait builtin - rhbz#1150541
  2014-10-26 19:01                   ` Peter Stephenson
@ 2014-10-26 20:41                     ` Bart Schaefer
  2014-10-26 21:22                       ` Peter Stephenson
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Schaefer @ 2014-10-26 20:41 UTC (permalink / raw)
  To: zsh-workers

On Oct 26,  7:01pm, Peter Stephenson wrote:
} Subject: Re: bug in zsh wait builtin - rhbz#1150541
}
} Here's a test.
} 
} +# once.  The monitor option is irrelevant to the logic, so we'll make
} +# our job easier by turning it off.
} +  unsetopt monitor

That throws errors in some automated build environments.  Look for other
places in A05 that fiddle with it.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bug in zsh wait builtin - rhbz#1150541
  2014-10-26 20:41                     ` Bart Schaefer
@ 2014-10-26 21:22                       ` Peter Stephenson
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Stephenson @ 2014-10-26 21:22 UTC (permalink / raw)
  To: zsh-workers

On Sun, 26 Oct 2014 13:41:00 -0700
Bart Schaefer <schaefer@brasslantern.com> wrote:
> On Oct 26,  7:01pm, Peter Stephenson wrote:
> } Subject: Re: bug in zsh wait builtin - rhbz#1150541
> }
> } Here's a test.
> } 
> } +# once.  The monitor option is irrelevant to the logic, so we'll make
> } +# our job easier by turning it off.
> } +  unsetopt monitor
> 
> That throws errors in some automated build environments.  Look for other
> places in A05 that fiddle with it.

Looks like we need this.

diff --git a/Test/A05execution.ztst b/Test/A05execution.ztst
index 589815f..042b2d0 100644
--- a/Test/A05execution.ztst
+++ b/Test/A05execution.ztst
@@ -254,7 +254,7 @@ F:anonymous function, and a descriptor leak when backgrounding a pipeline
 # other words, we rely on the test working repeatedly rather than just
 # once.  The monitor option is irrelevant to the logic, so we'll make
 # our job easier by turning it off.
-  unsetopt monitor
+  { unsetopt MONITOR } 2>/dev/null
   (exit 1) &
   one=$!
   (exit 2) &

pws


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-10-26 21:22 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-21  7:53 bug in zsh wait builtin - rhbz#1150541 Tim Speetjens
2014-10-21 20:02 ` Peter Stephenson
2014-10-22  6:55   ` Bart Schaefer
     [not found]     ` <CAO7vJOjrb=N3xuTJVSb7U8mdXtexYp8nN4YaoknfUb3fofU2zg@mail.gmail.com>
2014-10-22 15:48       ` Bart Schaefer
2014-10-22 18:32     ` Chet Ramey
2014-10-23  8:32     ` Peter Stephenson
2014-10-24  4:50       ` Bart Schaefer
2014-10-24  8:04         ` Tim Speetjens
2014-10-25 19:08         ` Peter Stephenson
2014-10-25 21:54           ` Bart Schaefer
2014-10-25 22:28           ` Bart Schaefer
2014-10-25 22:32             ` Bart Schaefer
2014-10-25 23:04               ` Peter Stephenson
2014-10-25 23:17                 ` Peter Stephenson
2014-10-26 19:01                   ` Peter Stephenson
2014-10-26 20:41                     ` Bart Schaefer
2014-10-26 21:22                       ` Peter Stephenson

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).