From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <zsh-workers-return-36010-mason-zsh=primenet.com.au@zsh.org>
Received: (qmail 11723 invoked by alias); 7 Aug 2015 05:39:15 -0000
Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm
Precedence: bulk
X-No-Archive: yes
List-Id: Zsh Workers List <zsh-workers.zsh.org>
List-Post: <mailto:zsh-workers@zsh.org>
List-Help: <mailto:zsh-workers-help@zsh.org>
X-Seq: 36010
Received: (qmail 22568 invoked from network); 7 Aug 2015 05:39:12 -0000
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on f.primenet.com.au
X-Spam-Level: 
X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_LOW,
	RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham autolearn_force=no
	version=3.4.0
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20130820;
        h=x-gm-message-state:from:message-id:date:in-reply-to:comments
         :references:to:subject:mime-version:content-type;
        bh=o3MQJ/5WvFmWJjwNfbgTWM2AOr46uhRs5oypq2ScGEQ=;
        b=AH22O3fzM7g73rrfFcuK+IQzqcEKICCmls9aqRG8x6WF37f6u50onqjrFi3aeZwo++
         c6+a8Kmg47aqzQorpLp1t3DHAII59lSMcRa/Wl2qf8gVYIuzc1sJRO4H0lKAFde8zA81
         qbV83/Wk34grCyhsdu97HWR8aWOVDAGMZFmgzoBCZ5w6HGblOJhvQtBPe03Tz2LFmfxo
         TV7hzyD2wMAFcINtFDUZyDIY+cOe2ayHPhAJQMBnkIPdAlQS69FLuHu5bNiGQJ2s3FyO
         77ZnFpmlYDGJw8M+3nbWKhZcc25qxyt73b9XF88H4fxo/wY1PgRo0mC0Q9majE3JuTBO
         k2YA==
X-Gm-Message-State: ALoCoQl4IdZxnQOapDWeY8+D103gSMpJGIF6SLH8HcpbX3nIzHkaOOpAyPg1ZvxdMpW/+1zHiRMs
X-Received: by 10.60.45.178 with SMTP id o18mr5112411oem.48.1438925950087;
        Thu, 06 Aug 2015 22:39:10 -0700 (PDT)
From: Bart Schaefer <schaefer@brasslantern.com>
Message-Id: <150806223906.ZM17762@torch.brasslantern.com>
Date: Thu, 6 Aug 2015 22:39:06 -0700
In-Reply-To: <CA+=GgY5BAkLceSV_Hav_mhna7M7aTLHH-WqV2EHmDRrMdqi16A@mail.gmail.com>
Comments: In reply to Mathias Fredriksson <mafredri@gmail.com>
        "Re: Deadlock when receiving kill-signal from child process" (Aug  7,  3:45am)
References: <CA+=GgY7mHkyK4NJQ6m7y-HpVPKOuKx3-bkJqRHriKzZ662_iwA@mail.gmail.com> 
	<150803085228.ZM24837@torch.brasslantern.com> 
	<CA+=GgY5iZfgUag_V1jqmCv4=PUGBmaV2cNWTDjSO4DAZ+zm-iQ@mail.gmail.com> 
	<150803135818.ZM24977@torch.brasslantern.com> 
	<CA+=GgY7uGzCYEKLBzqrt=ct6q72WFC5w1jMB5RDNe60J-wUz=Q@mail.gmail.com> 
	<150804235400.ZM9958@torch.brasslantern.com> 
	<CA+=GgY5826pmKzn=WHUKCVWOEfRbP=sCN873arRsYHvGgEUb7A@mail.gmail.com> 
	<150805085258.ZM17673@torch.brasslantern.com> 
	<CA+=GgY5F-ZD7us6C6-N+3NevKPmVvcJ+Zm7MYp4yDQTZhxY6kg@mail.gmail.com> 
	<150805115249.ZM7158@torch.brasslantern.com> 
	<CA+=GgY73SZ9NkSS74sOyF_VtaCU=Np8fixq1PbG4QmcgbcfVjQ@mail.gmail.com> 
	<150805132014.ZM7746@torch.brasslantern.com> 
	<CA+=GgY7cg9aDJxN--wcX=+MSD9dd2To3fSm9ecsd=tF1zg=zbQ@mail.gmail.com> 
	<150805220656.ZM18545@torch.brasslantern.com> 
	<CA+=GgY5xEojBVRpKKZp-svZStp-s4=N10Xz2n0=s8SKjUpbNwg@mail.gmail.com> 
	<150806085451.ZM402@torch.brasslantern.com> 
	<CA+=GgY5BAkLceSV_Hav_mhna7M7aTLHH-WqV2EHmDRrMdqi16A@mail.gmail.com>
X-Mailer: OpenZMail Classic (0.9.2 24April2005)
To: zsh-workers@zsh.org
Subject: Re: Deadlock when receiving kill-signal from child process
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii

On Aug 7,  3:45am, Mathias Fredriksson wrote:
}
} Sadly I can't be of much assistance here, I believe you can't call
} pthread mutexes from a signal handler

If true, that would mean either (a) you can't use stdio functions from
a signal handler, because this implementation of stdio is using pthread
mutexes, or (b) stdio is not allowed to use pthread mutexes, so this 
implementation is broken.

} but that isn't whats happening
} here? If I understand correctly a signal is received while a mutex
} lock is (being) aquired.

Specifially ferror() is attempting to acquire a mutex lock when the
signal arrives, and then the handler calls fputc() which tries to
acquire the same lock, and: clunk.

} I'm not quite sure I understand what these changes do

queue_signals() increments a counter that is tested in the signal
handler.  If the counter is nonzero, the handler does nothing but
make note in a static array that the signal was received.  If too
many signals arrive before there is an opportunity to empty the
array, the excess are simply dropped.  (The default is to be able
to queue 128 signals before starting to lose some -- if you're
expecting the shell to handle massive numbers of signals arriving
in quick succession, you're using the wrong tool for your job.
Further, this is another reason I'm reluctant to adopt the WINCH
approach for signals in general.)

dont_queue_signals() forces the counter to zero and calls all the
shell function handlers for the signals in the static array, in the
order the signals were recorded.

restore_queue_signals() sets the counter back to some previous state
(obtained from queue_signal_level()).

unqueue_signals() decrements the counter and calls all the shell
functions if (when) it reaches zero.

Using the counter means that you can have arbitrarily deep nesting of
queue_signals().  Using dont_queue_signals() implies that it is known
to be safe to run the handlers at that time, because it unwinds all
those levels of nesting at once.

So anytime I find myself using dont_queue_signals() I worry that there
is a calling scope reason avoid signal handlers.  On the other hand,
if it's safe to run shell code at all (e.g., runshfunc()), it should
also be safe to run signal traps.

} but at least
} this last patch made it a lot harder for me to have zsh lock up. I had
} to leave my script running in a while true; do ...; done loop
} (eventually, 30sec-10min it would hit a lock).
} 
} #0  0x00007fff8abfe72a in __sigsuspend ()
} #1  0x0000000107b59287 in signal_suspend ()
} #2  0x0000000107b30671 in zwaitjob ()

Sigh.  That looks like another case of waiting for a child that does
not exist.  Unfortunately this time the call stack beyond here does
not hint at where that job was started.  Might need another strace
that corresponds to this backtrace.

} This just seems like the same mutex stuff again:
} 
} #0  0x00007fff8abfe166 in __psynch_mutexwait ()
} #1  0x00007fff8e4b578a in _pthread_mutex_lock ()
} #2  0x00007fff82ce5750 in fputc ()
} #3  0x0000000102c20cd5 in zputs ()
} #4  0x0000000102c20b3c in mb_niceformat ()
} #5  0x0000000102c201cd in zwarning ()
} #6  0x0000000102c20376 in zwarn ()
} #7  0x0000000102c143da in wait_for_processes ()
} #8  0x0000000102c140a6 in zhandler ()
} #9  <signal handler called>
} #10 0x00007fff8abfe72a in __sigsuspend ()

The stack must go further than this?  What called __sigsuspend() ?


} This also looks vaguely familiar but might as well post it:
} 
} #0  0x00007fff8abf95da in syscall_thread_switch ()
} #1  0x00007fff853a982d in _OSSpinLockLockSlow ()
} #2  0x00007fff896d771b in szone_malloc_should_clear ()

Yeah, this is a signal arriving during memory allocation.  Looks like
it came from here:

} #14 0x00007fff896d98d6 in szone_free_definite_size ()
} #15 0x0000000101ca5874 in execlist ()

So we definitely need queue_signals() somewhere down in the guts of
execlist().  Which means rejiggering some of the stuff from previous
patches, so I think I'd better back up and send a new patch against
the git master.

What next concerns me is being sure that signals are unqueued often
enough that interrupts still work, etc.


} Bonus NO_TRAPS_ASYNC:
} 
} #0  0x00007fff8abfe72a in __sigsuspend ()
} #1  0x0000000107509287 in signal_suspend ()
} #11 0x00000001075090b0 in zhandler ()
} #12 <signal handler called>
} #13 0x00007fff8abfe97a in write$NOCANCEL ()
} #14 0x00007fff82ceb9ed in _swrite ()
} #15 0x00007fff82ce44a7 in __sflush ()
} #16 0x00007fff82ce43f5 in fflush ()
} #17 0x0000000107515376 in zwarn ()
} #18 0x00000001075093da in wait_for_processes ()
} #19 0x00000001075090a6 in zhandler ()
} #20 <signal handler called>
} #21 0x00007fff8abffa12 in sigprocmask ()

This trace must also go further?  The part shown is calling a handler
while during fflush() in a previous handler, but the trace doesn't
go back far enough to see where the first handler was called.