From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <musl-return-20970-ml=inbox.vuxu.org@lists.openwall.com>
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.0 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H4,
	RCVD_IN_MSPIKE_WL autolearn=ham autolearn_force=no version=3.4.4
Received: from second.openwall.net (second.openwall.net [193.110.157.125])
	by inbox.vuxu.org (Postfix) with SMTP id 3CD72214EA
	for <ml@inbox.vuxu.org>; Wed, 29 May 2024 15:15:24 +0200 (CEST)
Received: (qmail 10091 invoked by uid 550); 29 May 2024 13:15:20 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 10044 invoked from network); 29 May 2024 13:15:19 -0000
Date: Wed, 29 May 2024 09:15:34 -0400
From: Rich Felker <dalias@libc.org>
To: Pablo Correa Gomez <pabloyoyoista@postmarketos.org>
Cc: musl@lists.openwall.com
Message-ID: <20240529131533.GH10433@brightrain.aerifal.cx>
References: <d8475b607b0c728b9133846c4faa469f9e4cad16.camel@postmarketos.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <d8475b607b0c728b9133846c4faa469f9e4cad16.camel@postmarketos.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
Subject: Re: [musl] Crash in kill(..., SIGHUP) when using SA_ONSTACK

On Wed, May 29, 2024 at 02:04:25PM +0200, Pablo Correa Gomez wrote:
> Hi everybody,
> 
> I am responsible for musl CI in GNOME's GLib, and we have recently
> bumped into a crash that I have been unable to resolve. 
> 
> https://gitlab.gnome.org/GNOME/glib/-/commit/137db219a7266300ffde1aa75d781284fb0807cb
> introduced in GLib an alternate stack by setting the signal action
> SA_ONSTACK if available. However, the tests that were introduced, and
> that pass in most other libc's (there's CI for a lot more than just
> glibc and musl) crash in my alpine linux edge installation with SIGSEGV
> (stack trace below) while doing: kill (getpid(), SIGHUP)
> 
> I have verified that not adding SA_ONSTACK fixes the crash. Would
> anybody have some pointers of what could possibly be going wrong? If
> anybody is really interested, the public issue is
> https://gitlab.gnome.org/GNOME/glib/-/issues/3315
> 
> Stack trace
> ------------
> 
> Thread 1 "unix" received signal SIGSEGV, Segmentation fault.
> 0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at
> ../arch/x86_64/syscall_arch.h:21
> warning: 21     ./arch/x86_64/syscall_arch.h: No such file or directory
> (gdb) bt
> #0  0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at
> ../arch/x86_64/syscall_arch.h:21
> #1  kill (pid=17483, sig=sig@entry=1) at src/signal/kill.c:6
> #2  0x0000555555556e96 in test_signal (signum=signum@entry=1) at
> .../glib/tests/unix.c:534
> #3  0x0000555555557200 in test_signal_alternate_stack (signal=1) at
> .../glib/tests/unix.c:590
> #4  0x00007ffff7e8f364 in test_case_run (path=<optimized out>,
> test_run_name=0x55555555d3f0 "/glib-unix/sighup/alternate-stack",
> tc=0x55555555db60) at ../glib/gtestutils.c:2988
> #5  g_test_run_suite_internal (suite=suite@entry=0x55555555da70,
> path=path@entry=0x0) at ../glib/gtestutils.c:3090
> #6  0x00007ffff7e8f2db in g_test_run_suite_internal
> (suite=suite@entry=0x7ffff7ffee20, path=path@entry=0x0) at
> .../glib/gtestutils.c:3109
> #7  0x00007ffff7e8f2db in g_test_run_suite_internal
> (suite=suite@entry=0x7ffff7ffede0, path=path@entry=0x0) at
> .../glib/gtestutils.c:3109
> #8  0x00007ffff7e8f86a in g_test_run_suite
> (suite=suite@entry=0x7ffff7ffede0) at ../glib/gtestutils.c:3189
> #9  0x00007ffff7e8f8ea in g_test_run () at ../glib/gtestutils.c:2275
> #10 0x00005555555561f7 in main (argc=<optimized out>, argv=<optimized
> out>) at ../glib/tests/unix.c:910

Can you get a disassembly and register dump at the point of crash? My
best guess is that this is a simple stack overflow. There's really not
any other plausible reason for a segfault in kill(). The only
operations that touch memory in it are (on my build at least) a push
to realign the stack, and a call to __syscall_ret.

I'm not sure if the crashing code is running on the signal stack or
main stack, but here's a thought: is it possible the CI machines are
running on a cpu/kernel with some monster AVX512 or whatever extension
enabled with register file that doesn't fit in MINSIGSTKSZ? If so,
using sysconf(_SC_MINSIGSTKSZ) (conditional on _SC_MINSIGSTKSZ being
defined) to allocate the alt stack should mitigate the problem. If
doing this, it should probably be allocated by mmap or malloc, since
in principle it could be too large for the caller's stack.

It's also possible that the kernel may have some weird behavior
deciding if the task is already "running on the alt stack" when the
alt stack is embedded in the normal stack like this. Just getting rid
of that might be worth trying. If so, whether the problem manifests
could be subject to timing of signal delivery (although I would not
expect that for synchronously generated signals like here).

Rich