Re: [9fans] wait hang

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* Re: [9fans] wait hang
@ 2006-12-18 20:00 erik quanstrom
  0 siblings, 0 replies; 8+ messages in thread
From: erik quanstrom @ 2006-12-18 20:00 UTC (permalink / raw)
  To: rsc, 9fans

On Mon Dec 18 14:27:31 EST 2006, rsc@swtch.com wrote:
> I'm not really convinced this part of rc ever worked right to
> begin with, so this is no big loss.

you're right.  this feature of rc doesn't work as expected.  and your example 
illustrates this well.  for example if we try 20 wait records instead of 128:

	; sleep 1000000 &
	; pids=()
	; for(i in `{seq 20}){
		  echo foo >/dev/null &
		  pids=($pids $apid)
		}
	; for(p in $pids){
			echo $p
		  	wait $p
		}
	4627
	[hang]

since 20 is much less than 128, this can't be the kernel.  i think the
problem is in the discarded wait messages here: rc/plan9.c/^Waitfor

int
Waitfor(int pid, int)
{
	thread *p;
	Waitmsg *w;
	char errbuf[ERRMAX];

	while((w = wait()) != nil){
		if(w->pid==pid){
			setstatus(w->msg);
			free(w);
			return 0;
		}
		for(p = runq->ret;p;p = p->ret)
			if(p->pid==w->pid){
				p->pid=-1;
				strcpy(p->status, w->msg);
			}
		free(w);
	}

	errstr(errbuf, sizeof errbuf);
	if(strcmp(errbuf, "interrupted")==0) return -1;
	return 0;
}

byron's rc solves this problem by maintaining it's own waitlist in $apids.

	; for(i in 1 2 3 4) { echo fu>/dev/null&}
	31427
	31428
	31429
	31430
	; echo $apids
	31427 31428 31429 31430
	; echo $apids
	31427 31428 31429 31430
	; wait $apids(1)
	; echo $apids
	31428 31429 31430
	; wait $apids(3)
	; echo $apids
	
	; wait 31428
	;

i would think the right way to go may be to either remove
"wait $pid" from rc's vocabulary or do something like byron's rc.

- erik


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] wait hang
@ 2006-12-20  3:55 erik quanstrom
  0 siblings, 0 replies; 8+ messages in thread
From: erik quanstrom @ 2006-12-20  3:55 UTC (permalink / raw)
  To: rsc, quanstro, 9fans

there's a version of rc on sources (/n/sources/contrib/quanstro/rc-wait.tbz)
that keeps careful track of waits so "wait $pid" should be dependable,
at least in the basic approach, i did minimal minimal testing of the details.
there is somewhat verbose information printed along these lines:

	; sleep & sleep & sleep & sleep
	waitqadd(772665)
	waitqadd(772666)
	waitqadd(772667)
	waitqadd(772668)
	waitfor 772668
	  w->pid 772668
	waitqrm(772668)
	Xreturn i=0; p 205256, i=1
	broken! wait 772665
	waitfor 772665
	  w->pid 772667
	  w->pid 772666
	  w->pid 772665
	waitqrm(772665)
	Xreturn i=0; p 205256, i=1
	broken! wait 772666
	waitfor 772666
	Xreturn i=0; p 205256, i=1
	broken! 

this output can be supressed by redefining the wdebug macro in exec.h.

the previous case of
	; sleep $longtime &
	; echo&echo&echo&
	; wait $apid

should illustrate the difference.

- erik

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] wait hang
@ 2006-12-18 20:45 erik quanstrom
  0 siblings, 0 replies; 8+ messages in thread
From: erik quanstrom @ 2006-12-18 20:45 UTC (permalink / raw)
  To: rsc, quanstro, 9fans

if that scan through the runq fails (it always will, since runq->ret 
is never set). then the wait message will be lost by Waitfor unless the
first wait message returned from the kernel happens to be the one we're
looking for.

perhaps what i wrote wasn't clear.

it's not so bad dropping the kernel's wait message, but in order for "wait $pid"
to work as advertised, Xasync would need to add $apid to a list and Waitfor
would need to remove it.  this way, if the waitfor list is nil, we just complain
there's nothing to wait for instead of hanging.  if it's not nil, then if the wait
completed, we return otherwise we find the right member of the list and note
that it completed.

the list scan that's there i'm pretty sure is for non-Xasync processes only.  p->ret
is referenced in Xreturn and Xstart and Waitfor only.

- erik

On Mon Dec 18 15:06:38 EST 2006, rsc@swtch.com wrote:
> > since 20 is much less than 128, this can't be the kernel.  i think the
> > problem is in the discarded wait messages here: rc/plan9.c/^Waitfor
> 
> Where do you see discarded wait messages?
> 
> > int
> > Waitfor(int pid, int)
> > {
> >         thread *p;
> >         Waitmsg *w;
> >         char errbuf[ERRMAX];
> >
> >         while((w = wait()) != nil){
> >                 if(w->pid==pid){
> >                         setstatus(w->msg);
> >                         free(w);
> >                         return 0;
> >                 }
> >                 for(p = runq->ret;p;p = p->ret)
> >                         if(p->pid==w->pid){
> >                                 p->pid=-1;
> >                                 strcpy(p->status, w->msg);
> >                         }
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <db450a2ec40aeb8284b66f00407cb1cb@coraid.com>]

* Re: [9fans] wait hang
       [not found] <db450a2ec40aeb8284b66f00407cb1cb@coraid.com>
@ 2006-12-18 20:06 ` Russ Cox
  0 siblings, 0 replies; 8+ messages in thread
From: Russ Cox @ 2006-12-18 20:06 UTC (permalink / raw)
  To: erik quanstrom; +Cc: 9fans

> since 20 is much less than 128, this can't be the kernel.  i think the
> problem is in the discarded wait messages here: rc/plan9.c/^Waitfor

Where do you see discarded wait messages?

> int
> Waitfor(int pid, int)
> {
>         thread *p;
>         Waitmsg *w;
>         char errbuf[ERRMAX];
>
>         while((w = wait()) != nil){
>                 if(w->pid==pid){
>                         setstatus(w->msg);
>                         free(w);
>                         return 0;
>                 }
>                 for(p = runq->ret;p;p = p->ret)
>                         if(p->pid==w->pid){
>                                 p->pid=-1;
>                                 strcpy(p->status, w->msg);
>                         }

This loop is trying to save the information about the
wait message.  But it doesn't seem to get consulted
anywhere, and I have never understood why that particular
linked list is the one to be looking through.

Russ


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [9fans] wait hang
@ 2006-12-18 16:01 erik quanstrom
  2006-12-18 16:40 ` Russ Cox
  0 siblings, 1 reply; 8+ messages in thread
From: erik quanstrom @ 2006-12-18 16:01 UTC (permalink / raw)
  To: 9fans

the kernel fix does indeed keep
	; for(i in `{seq 1 $n}){>/dev/null echo fu&}
	; for(i in `{seq 1 $n}){>/dev/null echo fu&}
from hanging.

however, i think the fix to rc might be in error.  if i do this

	#!/bin/rc
	cmd1&
	cmd2&
	wait

shouldn't that wait for both processes?  for example, on an older
rc, from this script

	#/bin/rc
	date
	sleep 15 &
	sleep 30 &
	wait
	date

i get

	Mon Dec 18 10:57:43 EST 2006
	Mon Dec 18 10:58:13 EST 2006

whereas with the rc from friday, i get

	; 8.out
	broken! date ; sleep 15 & sleep 30 & wait ; date
	Mon Dec 18 11:00:33 EST 2006
	Mon Dec 18 11:00:33 EST 2006

- erik


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] wait hang
  2006-12-18 16:01 erik quanstrom
@ 2006-12-18 16:40 ` Russ Cox
  2006-12-18 19:05   ` erik quanstrom
  0 siblings, 1 reply; 8+ messages in thread
From: Russ Cox @ 2006-12-18 16:40 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> the kernel fix does indeed keep
>         ; for(i in `{seq 1 $n}){>/dev/null echo fu&}
>         ; for(i in `{seq 1 $n}){>/dev/null echo fu&}
> from hanging.
>
> however, i think the fix to rc might be in error.  if i do this
>
>         #!/bin/rc
>         cmd1&
>         cmd2&
>         wait
>
> shouldn't that wait for both processes?  for example, on an older
> rc, from this script
>
>         #/bin/rc
>         date
>         sleep 15 &
>         sleep 30 &
>         wait
>         date
>
> i get
>
>         Mon Dec 18 10:57:43 EST 2006
>         Mon Dec 18 10:58:13 EST 2006
>
> whereas with the rc from friday, i get
>
>         ; 8.out
>         broken! date ; sleep 15 & sleep 30 & wait ; date
>         Mon Dec 18 11:00:33 EST 2006
>         Mon Dec 18 11:00:33 EST 2006

You're right.  I forgot about "wait".
I reverted the change to rc.

Note that if you add "wait" to your hanging example
it will not necessarily wait for all of the echos, only
the first 128 or so.

Russ


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] wait hang
  2006-12-18 16:40 ` Russ Cox
@ 2006-12-18 19:05   ` erik quanstrom
  2006-12-18 19:23     ` Russ Cox
  0 siblings, 1 reply; 8+ messages in thread
From: erik quanstrom @ 2006-12-18 19:05 UTC (permalink / raw)
  To: 9fans

On Mon Dec 18 11:49:41 EST 2006, rsc@swtch.com wrote:
> You're right.  I forgot about "wait".
> I reverted the change to rc.
> 
> Note that if you add "wait" to your hanging example
> it will not necessarily wait for all of the echos, only
> the first 128 or so.
> 
> Russ

if i'm reading pwait correctly, it will actually wait for all if them because nwait
is now properly incremented and decremented, but the wait message will be
discarded if the waitlist is >= 128 when the child exits.

do i have this wrong?

- erik

from port/proc.c

1195 	lock(&up->exl);
1196 	if(up->nchild == 0 && up->waitq == 0) {
1197 		unlock(&up->exl);
1198 		error(Enochild);
1199 	}
1200 	unlock(&up->exl);
1201 
1202 	sleep(&up->waitr, haswaitq, up);
1203 
1204 	lock(&up->exl);
1205 	wq = up->waitq;
1206 	up->waitq = wq->next;
1207 	up->nwait--;
1208 	unlock(&up->exl);
1209 
1210 	qunlock(&up->qwaitr);
1211 	poperror();
1212 
1213 	if(w)
1214 		memmove(w, &wq->w, sizeof(Waitmsg));


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] wait hang
  2006-12-18 19:05   ` erik quanstrom
@ 2006-12-18 19:23     ` Russ Cox
  0 siblings, 0 replies; 8+ messages in thread
From: Russ Cox @ 2006-12-18 19:23 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

You're right again.  The "wait" command will work correctly --
it won't return until all children have finished.  What might not
work is waiting for a particular pid.  That is, if you do something like

sleep 1000000 &
pids=()
for(i in `{seq 200}){
  echo foo >/dev/null &
  pids=($pids $apid)
}
for(p in $pids)
  wait $p

then since the kernel tosses away some of the wait records,
the first wait $p for a tossed-away record will just loop calling
wait and never getting the expected record.  It will not give up
because there is still the one child (sleep 1000000) left.

I'm not really convinced this part of rc ever worked right to
begin with, so this is no big loss.

Plain "wait", with no argument, does work correctly now.

Russ

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2006-12-20  3:55 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-12-18 20:00 [9fans] wait hang erik quanstrom
  -- strict thread matches above, loose matches on Subject: below --
2006-12-20  3:55 erik quanstrom
2006-12-18 20:45 erik quanstrom
     [not found] <db450a2ec40aeb8284b66f00407cb1cb@coraid.com>
2006-12-18 20:06 ` Russ Cox
2006-12-18 16:01 erik quanstrom
2006-12-18 16:40 ` Russ Cox
2006-12-18 19:05   ` erik quanstrom
2006-12-18 19:23     ` Russ Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).