From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL autolearn=ham autolearn_force=no version=3.4.4 Received: from second.openwall.net (second.openwall.net [193.110.157.125]) by inbox.vuxu.org (Postfix) with SMTP id 3CD72214EA for ; Wed, 29 May 2024 15:15:24 +0200 (CEST) Received: (qmail 10091 invoked by uid 550); 29 May 2024 13:15:20 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 10044 invoked from network); 29 May 2024 13:15:19 -0000 Date: Wed, 29 May 2024 09:15:34 -0400 From: Rich Felker To: Pablo Correa Gomez Cc: musl@lists.openwall.com Message-ID: <20240529131533.GH10433@brightrain.aerifal.cx> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Subject: Re: [musl] Crash in kill(..., SIGHUP) when using SA_ONSTACK On Wed, May 29, 2024 at 02:04:25PM +0200, Pablo Correa Gomez wrote: > Hi everybody, > > I am responsible for musl CI in GNOME's GLib, and we have recently > bumped into a crash that I have been unable to resolve. > > https://gitlab.gnome.org/GNOME/glib/-/commit/137db219a7266300ffde1aa75d781284fb0807cb > introduced in GLib an alternate stack by setting the signal action > SA_ONSTACK if available. However, the tests that were introduced, and > that pass in most other libc's (there's CI for a lot more than just > glibc and musl) crash in my alpine linux edge installation with SIGSEGV > (stack trace below) while doing: kill (getpid(), SIGHUP) > > I have verified that not adding SA_ONSTACK fixes the crash. Would > anybody have some pointers of what could possibly be going wrong? If > anybody is really interested, the public issue is > https://gitlab.gnome.org/GNOME/glib/-/issues/3315 > > Stack trace > ------------ > > Thread 1 "unix" received signal SIGSEGV, Segmentation fault. > 0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at > ../arch/x86_64/syscall_arch.h:21 > warning: 21 ./arch/x86_64/syscall_arch.h: No such file or directory > (gdb) bt > #0 0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at > ../arch/x86_64/syscall_arch.h:21 > #1 kill (pid=17483, sig=sig@entry=1) at src/signal/kill.c:6 > #2 0x0000555555556e96 in test_signal (signum=signum@entry=1) at > .../glib/tests/unix.c:534 > #3 0x0000555555557200 in test_signal_alternate_stack (signal=1) at > .../glib/tests/unix.c:590 > #4 0x00007ffff7e8f364 in test_case_run (path=, > test_run_name=0x55555555d3f0 "/glib-unix/sighup/alternate-stack", > tc=0x55555555db60) at ../glib/gtestutils.c:2988 > #5 g_test_run_suite_internal (suite=suite@entry=0x55555555da70, > path=path@entry=0x0) at ../glib/gtestutils.c:3090 > #6 0x00007ffff7e8f2db in g_test_run_suite_internal > (suite=suite@entry=0x7ffff7ffee20, path=path@entry=0x0) at > .../glib/gtestutils.c:3109 > #7 0x00007ffff7e8f2db in g_test_run_suite_internal > (suite=suite@entry=0x7ffff7ffede0, path=path@entry=0x0) at > .../glib/gtestutils.c:3109 > #8 0x00007ffff7e8f86a in g_test_run_suite > (suite=suite@entry=0x7ffff7ffede0) at ../glib/gtestutils.c:3189 > #9 0x00007ffff7e8f8ea in g_test_run () at ../glib/gtestutils.c:2275 > #10 0x00005555555561f7 in main (argc=, argv= out>) at ../glib/tests/unix.c:910 Can you get a disassembly and register dump at the point of crash? My best guess is that this is a simple stack overflow. There's really not any other plausible reason for a segfault in kill(). The only operations that touch memory in it are (on my build at least) a push to realign the stack, and a call to __syscall_ret. I'm not sure if the crashing code is running on the signal stack or main stack, but here's a thought: is it possible the CI machines are running on a cpu/kernel with some monster AVX512 or whatever extension enabled with register file that doesn't fit in MINSIGSTKSZ? If so, using sysconf(_SC_MINSIGSTKSZ) (conditional on _SC_MINSIGSTKSZ being defined) to allocate the alt stack should mitigate the problem. If doing this, it should probably be allocated by mmap or malloc, since in principle it could be too large for the caller's stack. It's also possible that the kernel may have some weird behavior deciding if the task is already "running on the alt stack" when the alt stack is embedded in the normal stack like this. Just getting rid of that might be worth trying. If so, whether the problem manifests could be subject to timing of signal delivery (although I would not expect that for synchronously generated signals like here). Rich