From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 23197 invoked from network); 27 Jun 1999 08:41:28 -0000 Received: from sunsite.auc.dk (130.225.51.30) by ns1.primenet.com.au with SMTP; 27 Jun 1999 08:41:28 -0000 Received: (qmail 6360 invoked by alias); 27 Jun 1999 08:41:20 -0000 Mailing-List: contact zsh-workers-help@sunsite.auc.dk; run by ezmlm Precedence: bulk X-No-Archive: yes X-Seq: 6872 Received: (qmail 6353 invoked from network); 27 Jun 1999 08:41:17 -0000 From: "Bart Schaefer" Message-Id: <990627084112.ZM9488@candle.brasslantern.com> Date: Sun, 27 Jun 1999 08:41:12 +0000 X-Mailer: Z-Mail (5.0.0 30July97) To: zsh-workers@sunsite.auc.dk Subject: Final (?) info on signals/crashes when suspending "mutt" function MIME-Version: 1.0 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii [I sent this once before but it seems to have vanished. Sorry if it shows up twice.] Jump to the end for the big news that may finally get this fixed. I've been writing this message incrementally between debugging passes, so you might as well get the whole play-by-play. Recall that Jos Backus reported that suspending the function mutt () { command mutt "$@" echotc rs } cause zsh to behave badly. Sven has sent several patches but none of them have completely fixed the problem. Attempting to debug this, I've been running gdb on zsh. I reproduced the problem but so far I'm only able to break at the point at which the SIGSTOP is received, so I'm not sure who is sending that signal -- however, the parent zsh received first SIGSTOP and *then* SIGTSTP when I hit ^Z, which is very suspicious. However, because I was in gdb (attached to a PID from another xterm) I was able to make zsh continue after each signal (so zsh's xterm never got hung). Continuing through the second (TSTP) signal, I ended up with this: zagzig% mutt () { function> command mutt "$@" function> echotc rs function> } zagzig% mutt zsh: suspended (signal) mutt zagzig% pstree $$ zsh-+-mutt `-pstree zagzig% fg [1] - trace trap (core dumped) mutt Simultaneously in the gdb terminal, the parent zsh got a SIGSEGV because it tried to strcmp() a bad job table entry. Here's the stack trace: (gdb) where #0 strcmp (p1=0x0, p2=0x80bfe70 "/usr/src/local/zsh/zsh-3.0.6-pre") at ../sysdeps/generic/strcmp.c:36 #1 0x804ba8b in bin_fg (name=0x80c25d8 "fg", argv=0x80c2770, ops=0xbffff1a8 "", func=2) at builtin.c:629 #2 0x804a8c3 in execbuiltin (args=0x80c2710, bn=0x80b0ea0) at builtin.c:186 #3 0x805d7d3 in execcmd (cmd=0x80c26f0, input=0, output=0, how=2, last1=2) at exec.c:1779 #4 0x805af5e in execpline2 (pline=0x80c2740, how=2, input=0, output=0, last1=0) at exec.c:912 #5 0x805a5b0 in execpline (l=0x80c26d8, how=2, last1=0) at exec.c:739 #6 0x805a183 in execlist (list=0x80c2750, dont_change_job=0, exiting=0) at exec.c:612 #7 0x806bee0 in loop (toplevel=1, justonce=0) at init.c:143 #8 0x806bbe4 in main (argc=2, argv=0xbffff6ec) at init.c:75 (gdb) up #1 0x804ba8b in bin_fg (name=0x80c25d8 "fg", argv=0x80c2770, ops=0xbffff1a8 "", func=2) at builtin.c:629 629 if (strcmp(jobtab[job].pwd, pwd)) { (gdb) p job $1 = 1 (gdb) p jobtab[1] $3 = {gleader = 0, other = 0, stat = 0, pwd = 0x0, procs = 0x0, filelist = 0x0, stty_in_env = 0, ty = 0x0} (gdb) p jobtab[0] $4 = {gleader = 0, other = 0, stat = 0, pwd = 0x0, procs = 0x0, filelist = 0x0, stty_in_env = 0, ty = 0x0} (gdb) p curjob $5 = 2 Somewhere zsh has completely lost track of two (?) jobs, and failed to reset curjob to -1. Now, oddly, if I change the function to be: mutt() { cd /tmp command mutt "$@" echotc rs } I still get the SIGSTOP followed by the SIGTSTP, but now zsh is able to correctly "fg" the job: zagzig% mutt () { cd /tmp command mutt "$@" echotc rs } zagzig% mutt zsh: suspended (signal) mutt (pwd now: /tmp) zagzig% cd - /usr/src/local/zsh/zsh-3.0.6-pre zagzig% fg [1] - continued mutt zsh: suspended (signal) mutt zagzig% fg [1] - continued mutt The extra builtin has caused something different to happen. Following the second "fg" I quit mutt with "q" -- and now zsh is hung, blocked in sigsuspend() called from waitjob(); but that may be a side effect of gdb. The strange thing is, I can't tell where the heck that SIGSTOP is coming from. I've even tried putting in debug print statements around places where zsh performs a kill() or killpg(), and I don't get any output! Is some other process (mutt itself?) sending a SIGSTOP to the process group? YES! That's IT! MUTT is calling kill(0, SIGSTOP) and blowing its parent zsh out of the water! Confirmed by changing "command" to "strace" in the function above. Mutt expects to be the process group leader, but is not. So that pretty much tears it. There is no way short of forking a "watcher" subshell for EVERY external process to handle both: (1) badly-behaved programs whose exit status does not reveal that they died from a signal, and (2) badly-behaved programs that send uncatchable signals to their entire process group even when they are not the group leader. The failure in case (1) is far less catastrophic than case (2), so I think the right solution is to back off to the behavior from patch 6707 (that is, scrap 6819 and most of 6824, but 6848 and 6850 are orthogonal and good). I don't know, however, if that's directly related to the bogus curjob value and "fg" crash noted above. Probably so, but ... -- Bart Schaefer Brass Lantern Enterprises http://www.well.com/user/barts http://www.brasslantern.com