* [musl] Crash in kill(..., SIGHUP) when using SA_ONSTACK
@ 2024-05-29 12:04 Pablo Correa Gomez
2024-05-29 13:15 ` Rich Felker
0 siblings, 1 reply; 4+ messages in thread
From: Pablo Correa Gomez @ 2024-05-29 12:04 UTC (permalink / raw)
To: musl
Hi everybody,
I am responsible for musl CI in GNOME's GLib, and we have recently
bumped into a crash that I have been unable to resolve.
https://gitlab.gnome.org/GNOME/glib/-/commit/137db219a7266300ffde1aa75d781284fb0807cb
introduced in GLib an alternate stack by setting the signal action
SA_ONSTACK if available. However, the tests that were introduced, and
that pass in most other libc's (there's CI for a lot more than just
glibc and musl) crash in my alpine linux edge installation with SIGSEGV
(stack trace below) while doing: kill (getpid(), SIGHUP)
I have verified that not adding SA_ONSTACK fixes the crash. Would
anybody have some pointers of what could possibly be going wrong? If
anybody is really interested, the public issue is
https://gitlab.gnome.org/GNOME/glib/-/issues/3315
Stack trace
------------
Thread 1 "unix" received signal SIGSEGV, Segmentation fault.
0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at
./arch/x86_64/syscall_arch.h:21
warning: 21 ./arch/x86_64/syscall_arch.h: No such file or directory
(gdb) bt
#0 0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at
./arch/x86_64/syscall_arch.h:21
#1 kill (pid=17483, sig=sig@entry=1) at src/signal/kill.c:6
#2 0x0000555555556e96 in test_signal (signum=signum@entry=1) at
../glib/tests/unix.c:534
#3 0x0000555555557200 in test_signal_alternate_stack (signal=1) at
../glib/tests/unix.c:590
#4 0x00007ffff7e8f364 in test_case_run (path=<optimized out>,
test_run_name=0x55555555d3f0 "/glib-unix/sighup/alternate-stack",
tc=0x55555555db60) at ../glib/gtestutils.c:2988
#5 g_test_run_suite_internal (suite=suite@entry=0x55555555da70,
path=path@entry=0x0) at ../glib/gtestutils.c:3090
#6 0x00007ffff7e8f2db in g_test_run_suite_internal
(suite=suite@entry=0x7ffff7ffee20, path=path@entry=0x0) at
../glib/gtestutils.c:3109
#7 0x00007ffff7e8f2db in g_test_run_suite_internal
(suite=suite@entry=0x7ffff7ffede0, path=path@entry=0x0) at
../glib/gtestutils.c:3109
#8 0x00007ffff7e8f86a in g_test_run_suite
(suite=suite@entry=0x7ffff7ffede0) at ../glib/gtestutils.c:3189
#9 0x00007ffff7e8f8ea in g_test_run () at ../glib/gtestutils.c:2275
#10 0x00005555555561f7 in main (argc=<optimized out>, argv=<optimized
out>) at ../glib/tests/unix.c:910
Best and thanks for your time,
Pablo.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [musl] Crash in kill(..., SIGHUP) when using SA_ONSTACK
2024-05-29 12:04 [musl] Crash in kill(..., SIGHUP) when using SA_ONSTACK Pablo Correa Gomez
@ 2024-05-29 13:15 ` Rich Felker
2024-05-30 10:17 ` Pablo Correa Gomez
0 siblings, 1 reply; 4+ messages in thread
From: Rich Felker @ 2024-05-29 13:15 UTC (permalink / raw)
To: Pablo Correa Gomez; +Cc: musl
On Wed, May 29, 2024 at 02:04:25PM +0200, Pablo Correa Gomez wrote:
> Hi everybody,
>
> I am responsible for musl CI in GNOME's GLib, and we have recently
> bumped into a crash that I have been unable to resolve.
>
> https://gitlab.gnome.org/GNOME/glib/-/commit/137db219a7266300ffde1aa75d781284fb0807cb
> introduced in GLib an alternate stack by setting the signal action
> SA_ONSTACK if available. However, the tests that were introduced, and
> that pass in most other libc's (there's CI for a lot more than just
> glibc and musl) crash in my alpine linux edge installation with SIGSEGV
> (stack trace below) while doing: kill (getpid(), SIGHUP)
>
> I have verified that not adding SA_ONSTACK fixes the crash. Would
> anybody have some pointers of what could possibly be going wrong? If
> anybody is really interested, the public issue is
> https://gitlab.gnome.org/GNOME/glib/-/issues/3315
>
> Stack trace
> ------------
>
> Thread 1 "unix" received signal SIGSEGV, Segmentation fault.
> 0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at
> ../arch/x86_64/syscall_arch.h:21
> warning: 21 ./arch/x86_64/syscall_arch.h: No such file or directory
> (gdb) bt
> #0 0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at
> ../arch/x86_64/syscall_arch.h:21
> #1 kill (pid=17483, sig=sig@entry=1) at src/signal/kill.c:6
> #2 0x0000555555556e96 in test_signal (signum=signum@entry=1) at
> .../glib/tests/unix.c:534
> #3 0x0000555555557200 in test_signal_alternate_stack (signal=1) at
> .../glib/tests/unix.c:590
> #4 0x00007ffff7e8f364 in test_case_run (path=<optimized out>,
> test_run_name=0x55555555d3f0 "/glib-unix/sighup/alternate-stack",
> tc=0x55555555db60) at ../glib/gtestutils.c:2988
> #5 g_test_run_suite_internal (suite=suite@entry=0x55555555da70,
> path=path@entry=0x0) at ../glib/gtestutils.c:3090
> #6 0x00007ffff7e8f2db in g_test_run_suite_internal
> (suite=suite@entry=0x7ffff7ffee20, path=path@entry=0x0) at
> .../glib/gtestutils.c:3109
> #7 0x00007ffff7e8f2db in g_test_run_suite_internal
> (suite=suite@entry=0x7ffff7ffede0, path=path@entry=0x0) at
> .../glib/gtestutils.c:3109
> #8 0x00007ffff7e8f86a in g_test_run_suite
> (suite=suite@entry=0x7ffff7ffede0) at ../glib/gtestutils.c:3189
> #9 0x00007ffff7e8f8ea in g_test_run () at ../glib/gtestutils.c:2275
> #10 0x00005555555561f7 in main (argc=<optimized out>, argv=<optimized
> out>) at ../glib/tests/unix.c:910
Can you get a disassembly and register dump at the point of crash? My
best guess is that this is a simple stack overflow. There's really not
any other plausible reason for a segfault in kill(). The only
operations that touch memory in it are (on my build at least) a push
to realign the stack, and a call to __syscall_ret.
I'm not sure if the crashing code is running on the signal stack or
main stack, but here's a thought: is it possible the CI machines are
running on a cpu/kernel with some monster AVX512 or whatever extension
enabled with register file that doesn't fit in MINSIGSTKSZ? If so,
using sysconf(_SC_MINSIGSTKSZ) (conditional on _SC_MINSIGSTKSZ being
defined) to allocate the alt stack should mitigate the problem. If
doing this, it should probably be allocated by mmap or malloc, since
in principle it could be too large for the caller's stack.
It's also possible that the kernel may have some weird behavior
deciding if the task is already "running on the alt stack" when the
alt stack is embedded in the normal stack like this. Just getting rid
of that might be worth trying. If so, whether the problem manifests
could be subject to timing of signal delivery (although I would not
expect that for synchronously generated signals like here).
Rich
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [musl] Crash in kill(..., SIGHUP) when using SA_ONSTACK
2024-05-29 13:15 ` Rich Felker
@ 2024-05-30 10:17 ` Pablo Correa Gomez
2024-05-30 11:51 ` Markus Wichmann
0 siblings, 1 reply; 4+ messages in thread
From: Pablo Correa Gomez @ 2024-05-30 10:17 UTC (permalink / raw)
To: Rich Felker; +Cc: musl
Hi Rich, thanks a lot for your reply
El mie, 29-05-2024 a las 09:15 -0400, Rich Felker escribió:
> On Wed, May 29, 2024 at 02:04:25PM +0200, Pablo Correa Gomez wrote:
> > Hi everybody,
> >
> > I am responsible for musl CI in GNOME's GLib, and we have recently
> > bumped into a crash that I have been unable to resolve.
> >
> > https://gitlab.gnome.org/GNOME/glib/-
> > /commit/137db219a7266300ffde1aa75d781284fb0807cb
> > introduced in GLib an alternate stack by setting the signal action
> > SA_ONSTACK if available. However, the tests that were introduced,
> > and
> > that pass in most other libc's (there's CI for a lot more than just
> > glibc and musl) crash in my alpine linux edge installation with
> > SIGSEGV
> > (stack trace below) while doing: kill (getpid(), SIGHUP)
> >
> > I have verified that not adding SA_ONSTACK fixes the crash. Would
> > anybody have some pointers of what could possibly be going wrong?
> > If
> > anybody is really interested, the public issue is
> > https://gitlab.gnome.org/GNOME/glib/-/issues/3315
> >
> > Stack trace
> > ------------
> >
> > Thread 1 "unix" received signal SIGSEGV, Segmentation fault.
> > 0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at
> > ../arch/x86_64/syscall_arch.h:21
> > warning: 21 ./arch/x86_64/syscall_arch.h: No such file or
> > directory
> > (gdb) bt
> > #0 0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at
> > ../arch/x86_64/syscall_arch.h:21
> > #1 kill (pid=17483, sig=sig@entry=1) at src/signal/kill.c:6
> > #2 0x0000555555556e96 in test_signal (signum=signum@entry=1) at
> > .../glib/tests/unix.c:534
> > #3 0x0000555555557200 in test_signal_alternate_stack (signal=1) at
> > .../glib/tests/unix.c:590
> > #4 0x00007ffff7e8f364 in test_case_run (path=<optimized out>,
> > test_run_name=0x55555555d3f0 "/glib-unix/sighup/alternate-stack",
> > tc=0x55555555db60) at ../glib/gtestutils.c:2988
> > #5 g_test_run_suite_internal (suite=suite@entry=0x55555555da70,
> > path=path@entry=0x0) at ../glib/gtestutils.c:3090
> > #6 0x00007ffff7e8f2db in g_test_run_suite_internal
> > (suite=suite@entry=0x7ffff7ffee20, path=path@entry=0x0) at
> > .../glib/gtestutils.c:3109
> > #7 0x00007ffff7e8f2db in g_test_run_suite_internal
> > (suite=suite@entry=0x7ffff7ffede0, path=path@entry=0x0) at
> > .../glib/gtestutils.c:3109
> > #8 0x00007ffff7e8f86a in g_test_run_suite
> > (suite=suite@entry=0x7ffff7ffede0) at ../glib/gtestutils.c:3189
> > #9 0x00007ffff7e8f8ea in g_test_run () at
> > ../glib/gtestutils.c:2275
> > #10 0x00005555555561f7 in main (argc=<optimized out>,
> > argv=<optimized
> > out>) at ../glib/tests/unix.c:910
>
> Can you get a disassembly and register dump at the point of crash?
(gdb) layout asm
0x7ffff7fa96f9 <kill+7> movslq %esi,%rsi
0x7ffff7fa96fc <kill+10> mov $0x3e,%eax
0x7ffff7fa9701 <kill+15> syscall
>0x7ffff7fa9703 <kill+17> mov %rax,%rdi
0x7ffff7fa9706 <kill+20> call 0x7ffff7f7afb7 <__syscall_ret>
0x7ffff7fa970b <kill+25> add $0x8,%rsp
0x7ffff7fa970f <kill+29> ret
0x7ffff7fa9710 <killpg> test %edi,%edi
0x7ffff7fa9712 <killpg+2> js 0x7ffff7fa971b <killpg+11>
0x7ffff7fa9714 <killpg+4> neg %edi
0x7ffff7fa9716 <killpg+6> jmp 0x7ffff7fa96f2 <kill>
0x7ffff7fa971b <killpg+11> sub $0x8,%rsp
0x7ffff7fa971f <killpg+15> call 0x7ffff7f78bae <__errno_location>
0x7ffff7fa9724 <killpg+20> movl $0x16,(%rax)
0x7ffff7fa972a <killpg+26> mov $0xffffffff,%eax
0x7ffff7fa972f <killpg+31> add $0x8,%rsp
0x7ffff7fa9733 <killpg+35> ret
0x7ffff7fa9734 <psiginfo> mov (%rdi),%edi
0x7ffff7fa9736 <psiginfo+2> jmp 0x7ffff7fa973b <psignal>
0x7ffff7fa973b <psignal> push %r15
0x7ffff7fa973d <psignal+2> push %r14
0x7ffff7fa973f <psignal+4> push %r13
0x7ffff7fa9741 <psignal+6> lea 0x51938(%rip),%r13 # 0x7ffff7ffb080
<__stderr_FILE>
0x7ffff7fa9748 <psignal+13> push %r12
0x7ffff7fa974a <psignal+15> xor %r12d,%r12d
0x7ffff7fa974d <psignal+18> push %rbp
0x7ffff7fa974e <psignal+19> push %rbx
0x7ffff7fa974f <psignal+20> mov %rsi,%rbx
0x7ffff7fa9752 <psignal+23> sub $0x18,%rsp
0x7ffff7fa9756 <psignal+27> call 0x7ffff7fb5780 <strsignal>
(gdb) info registers
rax 0x0 0
rbx 0x7ffff7f55c30 140737353440304
rcx 0x7ffff7fa9703 140737353783043
rdx 0x0 0
rsi 0x1 1
rdi 0x525e 21086
rbp 0x1 0x1
rsp 0x7fffffffd5d0 0x7fffffffd5d0
r8 0x0 0
r9 0x80 128
r10 0x8 8
r11 0x202 514
r12 0x7ffff7ffdb5c 140737354128220
r13 0x1 1
r14 0x7fffffffd6d0 140737488344784
r15 0x7fffffffd6f0 140737488344816
rip 0x7ffff7fa9703 0x7ffff7fa9703 <kill+17>
eflags 0x202 [ IF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
fs_base 0x7ffff7ffdb28 140737354128168
gs_base 0x0 0
Does this tell you anything?
> I'm not sure if the crashing code is running on the signal stack or
> main stack, but here's a thought: is it possible the CI machines are
> running on a cpu/kernel with some monster AVX512 or whatever
> extension
> enabled with register file that doesn't fit in MINSIGSTKSZ?
That might be the case. Would explain why I could not reproduce in my
9-year old laptop I was running last month, but I can reproduce it now
in a new machine with a 13th Gen Intel(R) Core(TM) i7-1360P
> If so,
> using sysconf(_SC_MINSIGSTKSZ) (conditional on _SC_MINSIGSTKSZ being
> defined) to allocate the alt stack should mitigate the problem. If
> doing this, it should probably be allocated by mmap or malloc, since
> in principle it could be too large for the caller's stack.
>
I'll forward this to the maintainers, let's see if we can come up with
a solution. Thanks a lot for your feedback!
> It's also possible that the kernel may have some weird behavior
> deciding if the task is already "running on the alt stack" when the
> alt stack is embedded in the normal stack like this. Just getting rid
> of that might be worth trying. If so, whether the problem manifests
> could be subject to timing of signal delivery (although I would not
> expect that for synchronously generated signals like here).
>
> Rich
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [musl] Crash in kill(..., SIGHUP) when using SA_ONSTACK
2024-05-30 10:17 ` Pablo Correa Gomez
@ 2024-05-30 11:51 ` Markus Wichmann
0 siblings, 0 replies; 4+ messages in thread
From: Markus Wichmann @ 2024-05-30 11:51 UTC (permalink / raw)
To: musl; +Cc: Pablo Correa Gomez
Am Thu, May 30, 2024 at 12:17:59PM +0200 schrieb Pablo Correa Gomez:
> El mie, 29-05-2024 a las 09:15 -0400, Rich Felker escribió:
> > On Wed, May 29, 2024 at 02:04:25PM +0200, Pablo Correa Gomez wrote:
> > > Thread 1 "unix" received signal SIGSEGV, Segmentation fault.
> > > 0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at
> > > ../arch/x86_64/syscall_arch.h:21
> (gdb) layout asm
>
> 0x7ffff7fa96f9 <kill+7> movslq %esi,%rsi
> 0x7ffff7fa96fc <kill+10> mov $0x3e,%eax
> 0x7ffff7fa9701 <kill+15> syscall
> >0x7ffff7fa9703 <kill+17> mov %rax,%rdi
> [...]
> Does this tell you anything?
>
It tells me that Rich's reasoning was correct. I'll explain further
down.
> > I'm not sure if the crashing code is running on the signal stack or
> > main stack, but here's a thought: is it possible the CI machines are
> > running on a cpu/kernel with some monster AVX512 or whatever
> > extension
> > enabled with register file that doesn't fit in MINSIGSTKSZ?
>
> That might be the case. Would explain why I could not reproduce in my
> 9-year old laptop I was running last month, but I can reproduce it now
> in a new machine with a 13th Gen Intel(R) Core(TM) i7-1360P
>
That is exactly what the program is doing, according to the link you
provided in the OP.
> > It's also possible that the kernel may have some weird behavior
> > deciding if the task is already "running on the alt stack" when the
> > alt stack is embedded in the normal stack like this. Just getting rid
> > of that might be worth trying. If so, whether the problem manifests
> > could be subject to timing of signal delivery (although I would not
> > expect that for synchronously generated signals like here).
> >
Thankfully, we needn't speculate, as Linux is open source. The function
get_sigframe() will determine if the thread is currently executing on
the signal stack. It does that by determining that the sp is between
stack base and stack top. If that isn't the case, it will allocate a red
zone, else it will start at the top of the altstack. It will then try to
allocate a full frame. If that doesn't work (because it already was on
an altstack that got overflowed, or it tried to enter too small of an
altstack), then it will generate a message "overflowed sigaltstack",
that you might find in dmesg, before returning a bogus address.
Due to the bogus address, all calls to unsafe_put_user() in
x64_setup_rt_frame() will fail, and it will return EFAULT. This error
will be reported to signal_setup_done() and it will call
force_sigsegv(), which then reports a SIGSEGV at the "current" IP. Since
this all happens during a syscall, the current IP is the one directly
following the syscall instruction.
> > Rich
>
Ciao,
Markus
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-05-30 11:51 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-29 12:04 [musl] Crash in kill(..., SIGHUP) when using SA_ONSTACK Pablo Correa Gomez
2024-05-29 13:15 ` Rich Felker
2024-05-30 10:17 ` Pablo Correa Gomez
2024-05-30 11:51 ` Markus Wichmann
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).