From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 11723 invoked by alias); 7 Aug 2015 05:39:15 -0000 Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Workers List List-Post: List-Help: X-Seq: 36010 Received: (qmail 22568 invoked from network); 7 Aug 2015 05:39:12 -0000 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham autolearn_force=no version=3.4.0 X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:message-id:date:in-reply-to:comments :references:to:subject:mime-version:content-type; bh=o3MQJ/5WvFmWJjwNfbgTWM2AOr46uhRs5oypq2ScGEQ=; b=AH22O3fzM7g73rrfFcuK+IQzqcEKICCmls9aqRG8x6WF37f6u50onqjrFi3aeZwo++ c6+a8Kmg47aqzQorpLp1t3DHAII59lSMcRa/Wl2qf8gVYIuzc1sJRO4H0lKAFde8zA81 qbV83/Wk34grCyhsdu97HWR8aWOVDAGMZFmgzoBCZ5w6HGblOJhvQtBPe03Tz2LFmfxo TV7hzyD2wMAFcINtFDUZyDIY+cOe2ayHPhAJQMBnkIPdAlQS69FLuHu5bNiGQJ2s3FyO 77ZnFpmlYDGJw8M+3nbWKhZcc25qxyt73b9XF88H4fxo/wY1PgRo0mC0Q9majE3JuTBO k2YA== X-Gm-Message-State: ALoCoQl4IdZxnQOapDWeY8+D103gSMpJGIF6SLH8HcpbX3nIzHkaOOpAyPg1ZvxdMpW/+1zHiRMs X-Received: by 10.60.45.178 with SMTP id o18mr5112411oem.48.1438925950087; Thu, 06 Aug 2015 22:39:10 -0700 (PDT) From: Bart Schaefer Message-Id: <150806223906.ZM17762@torch.brasslantern.com> Date: Thu, 6 Aug 2015 22:39:06 -0700 In-Reply-To: Comments: In reply to Mathias Fredriksson "Re: Deadlock when receiving kill-signal from child process" (Aug 7, 3:45am) References: <150803085228.ZM24837@torch.brasslantern.com> <150803135818.ZM24977@torch.brasslantern.com> <150804235400.ZM9958@torch.brasslantern.com> <150805085258.ZM17673@torch.brasslantern.com> <150805115249.ZM7158@torch.brasslantern.com> <150805132014.ZM7746@torch.brasslantern.com> <150805220656.ZM18545@torch.brasslantern.com> <150806085451.ZM402@torch.brasslantern.com> X-Mailer: OpenZMail Classic (0.9.2 24April2005) To: zsh-workers@zsh.org Subject: Re: Deadlock when receiving kill-signal from child process MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii On Aug 7, 3:45am, Mathias Fredriksson wrote: } } Sadly I can't be of much assistance here, I believe you can't call } pthread mutexes from a signal handler If true, that would mean either (a) you can't use stdio functions from a signal handler, because this implementation of stdio is using pthread mutexes, or (b) stdio is not allowed to use pthread mutexes, so this implementation is broken. } but that isn't whats happening } here? If I understand correctly a signal is received while a mutex } lock is (being) aquired. Specifially ferror() is attempting to acquire a mutex lock when the signal arrives, and then the handler calls fputc() which tries to acquire the same lock, and: clunk. } I'm not quite sure I understand what these changes do queue_signals() increments a counter that is tested in the signal handler. If the counter is nonzero, the handler does nothing but make note in a static array that the signal was received. If too many signals arrive before there is an opportunity to empty the array, the excess are simply dropped. (The default is to be able to queue 128 signals before starting to lose some -- if you're expecting the shell to handle massive numbers of signals arriving in quick succession, you're using the wrong tool for your job. Further, this is another reason I'm reluctant to adopt the WINCH approach for signals in general.) dont_queue_signals() forces the counter to zero and calls all the shell function handlers for the signals in the static array, in the order the signals were recorded. restore_queue_signals() sets the counter back to some previous state (obtained from queue_signal_level()). unqueue_signals() decrements the counter and calls all the shell functions if (when) it reaches zero. Using the counter means that you can have arbitrarily deep nesting of queue_signals(). Using dont_queue_signals() implies that it is known to be safe to run the handlers at that time, because it unwinds all those levels of nesting at once. So anytime I find myself using dont_queue_signals() I worry that there is a calling scope reason avoid signal handlers. On the other hand, if it's safe to run shell code at all (e.g., runshfunc()), it should also be safe to run signal traps. } but at least } this last patch made it a lot harder for me to have zsh lock up. I had } to leave my script running in a while true; do ...; done loop } (eventually, 30sec-10min it would hit a lock). } } #0 0x00007fff8abfe72a in __sigsuspend () } #1 0x0000000107b59287 in signal_suspend () } #2 0x0000000107b30671 in zwaitjob () Sigh. That looks like another case of waiting for a child that does not exist. Unfortunately this time the call stack beyond here does not hint at where that job was started. Might need another strace that corresponds to this backtrace. } This just seems like the same mutex stuff again: } } #0 0x00007fff8abfe166 in __psynch_mutexwait () } #1 0x00007fff8e4b578a in _pthread_mutex_lock () } #2 0x00007fff82ce5750 in fputc () } #3 0x0000000102c20cd5 in zputs () } #4 0x0000000102c20b3c in mb_niceformat () } #5 0x0000000102c201cd in zwarning () } #6 0x0000000102c20376 in zwarn () } #7 0x0000000102c143da in wait_for_processes () } #8 0x0000000102c140a6 in zhandler () } #9 } #10 0x00007fff8abfe72a in __sigsuspend () The stack must go further than this? What called __sigsuspend() ? } This also looks vaguely familiar but might as well post it: } } #0 0x00007fff8abf95da in syscall_thread_switch () } #1 0x00007fff853a982d in _OSSpinLockLockSlow () } #2 0x00007fff896d771b in szone_malloc_should_clear () Yeah, this is a signal arriving during memory allocation. Looks like it came from here: } #14 0x00007fff896d98d6 in szone_free_definite_size () } #15 0x0000000101ca5874 in execlist () So we definitely need queue_signals() somewhere down in the guts of execlist(). Which means rejiggering some of the stuff from previous patches, so I think I'd better back up and send a new patch against the git master. What next concerns me is being sure that signals are unqueued often enough that interrupts still work, etc. } Bonus NO_TRAPS_ASYNC: } } #0 0x00007fff8abfe72a in __sigsuspend () } #1 0x0000000107509287 in signal_suspend () } #11 0x00000001075090b0 in zhandler () } #12 } #13 0x00007fff8abfe97a in write$NOCANCEL () } #14 0x00007fff82ceb9ed in _swrite () } #15 0x00007fff82ce44a7 in __sflush () } #16 0x00007fff82ce43f5 in fflush () } #17 0x0000000107515376 in zwarn () } #18 0x00000001075093da in wait_for_processes () } #19 0x00000001075090a6 in zhandler () } #20 } #21 0x00007fff8abffa12 in sigprocmask () This trace must also go further? The part shown is calling a handler while during fflush() in a previous handler, but the trace doesn't go back far enough to see where the first handler was called.